Opentopia Directory Encyclopedia Tools

URL normalization

Encyclopedia : U : UR : URL : URL normalization


URL normalization (or URL canonicalization) is the process by which URLs are modified and standardized in a consistent manner. The goal of the normalization process is to transform a URL into a normalized or canonical URL so it is possible to determine if two syntactically different URLs are equivalent.

Search engines employ URL normalization in order to assign importance to web pages and to reduce indexing of duplicate pages. Web crawlers perform URL normalization in order to avoid crawling the same resource more than once. Web browsers may perform normalization to determine if a link has been visited or to determine if a page has been cached.

Normalization process

There are several type of normalization that may be performed:

HTTP://www.FooBar.com/ → http://www.foobar.com/
  • Converting the entire URL to lower case – Some web servers that run on top of case-insensitive file systems allow URLs to be case insensitive. Therefore all URLs from a case-insensitive web server may be converted to lowercase to avoid ambiguity. Example:
  • http://foo.org/BAR.htmlhttp://foo.org/bar.html
  • Capitalizing hexadecimal digits – All hexadecimal digits within a percent-encoding triplet (e.g., "%3a") are case-insensitive, and therefore the digits A-F should be capitalized. Example:
  • http://foo.org/?mode=%3a%b1+abchttp://foo.org/?mode=%3A%B1+abc
  • Removing the fragment – The fragment portion of a URL is usually removed because a URL with and without the fragment represent the same resource. Example:
  • http://foo.org/bar.html#section1http://foo.org/bar.html
  • Removing port 80 – The default port (80) may be removed from (or added to) a URL. Example:
  • http://foo.org:80/bar.htmlhttp://foo.org/bar.html
  • Removing ".." and "." segments – The ".." and "." segments are usually removed from a URL. Many normalizers use the algorithm described in RFC 3986 (or a similar algorithm) to remove the segments. Example:
  • http://foo.org/../a/b/../c/./d.htmlhttp://foo.org/a/c/d.html
  • Add terminating slash – A terminating slash may be added at the end of a URL that points to a directory. Most web servers will redirect HTTP requests that are missing a terminating slash to a URL with the terminating slash. Example:
  • http://foo.orghttp://foo.org/ http://foo.org/dirhttp://foo.org/dir/
  • Removing "www" prefix – Some websites allow access to them through using an optional "www" prefix. For example, http://foo.org/ and http://www.foo.org/ may access the same website. Although many websites will redirect the user to the non-www prefix version (or vice versa), some do not. A normalizer may perform extra processing to determine if there is a non-www prefix version and then normalize all URLs to the non-www prefix. Example:
  • http://www.foo.org/http://foo.org/

    References

    See also

     


    From Wikipedia, the Free Encyclopedia. Original article here. Support Wikipedia by contributing or donating.
    All text is available under the terms of the GNU Free Documentation License See Wikipedia Copyrights for details.

    Search Titles
    0123456789
    ABCDEFGHIJ
    KLMNOPQRST
    UVWXYZ?

    E-mail this article to:

    Personal Message: