Domain Names containing International Characters
When the world wide web first came into existence it used the English language, and a limited set of characters was enough to represent the name of any location on the web. All domain names were made up from the basic letters of the English alphabet, plus the digits 0-9 and the hyphen character. The basic 7-bit ASCII character set was used to represent these characters.Some time later, it was recognized that there was a need for domain names containing non-English characters - for example European letters with accents, letters from the Cyrillic alphabet, Arabic script, etc. Unfortunately, the use of the basic ASCII character set was by then firmly embedded in the system used for registering domain names and translating them to IP addresses, and so a method needed to be found whereby these new symbols could be used without having to change the existing infrastructure of the internet.
Under the scheme known as IDNA2003 (Internationalising Domain Names in Applications), domain names containing characters outside the basic ASCII set are encoded as special ASCII sequences before being used on the internet. The encoding scheme takes as its input a domain name consisting of a string of Unicode codepoints, which can represent not only the ASCII symbols but also an enormous range of accented letters and symbols in non-English alphabets.
All uppercase characters - including accented characters - are first converted to lowercase and then the string is examined for any characters which are not representable as ASCII alphanumeric characters (or hyphens). If none are found, then the string is simply used as-is, with no further processing required. If non-ASCII characters are detected, however, then the string is encoded to produce an ASCII-compatible string using a scheme known as punycode.
Punycode conversion
Firstly all basic ASCII characters in the input string are collected together, without changing their order, then if this string is non-empty a single hyphen is appended. This is followed by some additional ASCII characters which are generated using an algorithm which encodes both the positions and values of all non-ASCII characters from the original string. Finally the string is prefixed with 'xn--' which is a special sequence indicating that the URL has been encoded using punycode. This encoding is applied individually to each section of a full domain name, separated by '.' characters.Some example encodings are:
IDN | Encoded | Notes |
---|---|---|
www.example.com | www.example.com | No encoding required. |
www.bücher.de | www.xn--bcher-kva.de | The 'bcher' characters can be represented using basic ASCII. The single 'ü' requires punycoding and is represented by the '-kva' suffix. The 'xn--' prefix indicates that the string is a punycode string. |
кто.рф | xn--j1ail.xn--p1ai | The strings before and after the '.' are both individually coded. No basic ASCII characters are present in either string. |
Note that punycode encoding is performed automatically by your browser when it encounters an HTML link containing non-ASCII characters - this allows web authors to create pages with links written in their native character set, and lets the browser display them correctly in the address bar or link preview, whilst also presenting the correctly encoded version when requesting pages from the internet.
Exceptions to the Rule
The IDNA2003 system specified that the ß character (Sharp-s or Eszett) should be translated to 'ss', and that the ς character (used in Greek when a sigma appears at the end of a word) should be translated to a standard sigma σ prior to encoding. These rules were in part due to difficulties with lowercase conversion - there was no uppercase equivalent of ß (it was written as 'SS'), and the uppercase versions of ς and σ are both Σ, so the conversion to lowercase would have been ambiguous. For example, SCHLOSS.DE could be converted either to schloss.de or schloß.de leading to two different websites.A new IDNA2008 standard was therefore published in which, amongst other things, all uppercase characters are disallowed (along with some other symbols, punctuation, and variant characters). Consequently there was no need to disallow the ß and ς characters, so they are permitted under IDNA2008. However, this leads to a problem.
If a website contains a link to, say, http://faß.de then under IDNA2003 this would have been translated to http://fass.de whereas under IDNA2008 the ß would trigger a punycode conversion to http://xn--fa-hia.de which is an entirely different domain. A website containing an HTML anchor
<A href="http://faß.de">Faß</A>
would link to one page under IDNA2003 and another under IDNA2008. Since the encoding of domain names is handled by the browser, this leads to differences in behavior depending on whether a browser follows one standard or the other.
In practice, most browser vendors seem to have elected not to break pre-existing websites, and they continue to interpret ß and ς characters under the 2003 rules whilst generally following the 2008 scheme elsewhere. However a notable exception is Firefox, which encodes ß and ς under the 2008 rules. So following a link to http://faß.de (or entering it in the address bar) will go to http://xn--fa-hia.de in Firefox, but to http://fass.de in Chrome or Explorer.
Dead Link Checker pragmatically sides with the majority of browser vendors and interprets the link as http://fass.de even though this is not the most up-to-date standard. If browser behavior changes in the future, Dead Link Checker will be updated accordingly.
Security Considerations
Theoretically when one of these domain names containing a ß or ς character is registered, the registrar ought to take steps to ensure that the alternative domain is either owned by the same client or is offered for registration simultaneously - this helps to avoid security issues where a malicious actor could divert traffic intended for a genuine domain name, potentially serving malware from the alternative website. However even if the same client owns both variants of the domain name there is nothing to force them to present the same content on each domain.IDNs also present a security risk with homoglyphs - different characters which are written the same way but are represented by different Unicode codepoints. For example, many letters in the Cyrillic alphabet (а, с, е, о, р, х and у) look identical to letters from the Latin alphabet, so a domain name containing these Cyrillic characters can appear to be a valid link to one location whereas in fact it leads somewhere entirely different. Most browsers attempt to guard against these attacks by displaying the punycode version of the domain name in the status/address bar whenever mixed or ambiguous character sets are present. The results are inconsistent - http://www.pace.com (using Latin characters) displays as expected in the status bar of both Chrome and Firefox. However http://www.расе.com (in which the 'pace' is written using Cyrillic characters) displays as punycode in Chrome and Explorer but not in Firefox, where it looks identical to www.pace.com. This type of behavior could lead to links taking the user to an unexpected and potentially malicious destination.
URL Encoding
The section of a URL following the domain name - the optional path and page name - is not encoded using punycode. Instead, any characters not in the basic ASCII set are first converted to sequences of 8-bit data values using an encoding such as UTF-8, and then these values (which are typically between 128 and 255, still outside the basic ASCII values) are represented as a percent symbol followed by the value converted to a two-digit hexadecimal string. So for example, the Σ character with Unicode value 931 is converted to the two 8-bit values 206, 163 which are then represented in %-encoded hexadecimal as %CE%A3So, the URL http://letters.com/Greek/Σ.html would become http://letters.com/Greek/%CE%A3.html and http://φκφ.com/Σ.html would become http://xn--vxaxb.com/%CE%A3.html
Further reading:
ICANN - Guidelines for implementing IDNPunycoder - an online punycode converter
Wikipedia - UTF-8 encoding