Now-a-days, web addresses with non-ASCII characters are in fashion. In this blog, we will discuss about non-ASCII characters web addresses and how it actually works. I will explain by using examples based on HTML and HTTP.
What is the purpose of I18n URLs?
As of now, web addresses are using URI (Uniform Resource Identifiers) and its syntax defined in RFC 3986 STD 66, which don’t allow for large numbers of characters and use only upper and lower case letters of the English alphabet, European numerals, and a small number of symbols.
Based on the users’ demand and wide use of internet, there is a growing need to enable non-ASCII characters in web addresses. Web addresses in our own language basically provides many benefits like they are easy to create, remember, transcribe, guess, etc.
Examples for the same:
- http://ar.aichi-u.ac.jp/iri/日本語ディレクトリ/ (Japanese URL)
- http://清华大学.cn (Chinese URL)
- http://日本語.jp/case/accessible/all.html (Japanese URL)
Concept of Internationalized Resource Identifiers(IRI):-
Will use below fictitious Web address for explanation of this blog:
The above IRI consist of three parts:
Scheme: - HTTP contains information about the scheme, which does not contain any non-ASCII character.
Domain Name: - It usually contains non-ASCII characters.
Path: - It may or may not contain non-ASCII characters.
Now will discuss about how to deal with non-ASCII characters in Domain Name and Path. They both will be treated as differently.
Domain Name Handling:-
Domain names are assigned and handled by domain name registration organizations around the world. And the domain names which are requested as against multilingual web addresses, they are allocated in a punycode representation.
Punycode is a method to represent the Unicode codepoints using only ASCII characters.
These domain names are referred to as Internationalised Domain Name (IDN).
Now let’s talk about what actually happens when the user enters the web address in his native language. When the user enters non-ASCII web address or enters IRI in the address bar of a user agent, at that point, non-ASCII characters can be in any encoding. Here is an example of the domain name:
Now if the domain name is not in Unicode, then the user agent will convert that into Unicode, and perform normalization on that string to remove any spaces, extra unwanted characters etc.
Then the user agent will convert Unicode string into punycode representation appending a special marker “—xn” just to recognize that it belongs to non-ASCII character url. The output of the above domain name would be:
xn--wgv71a119e.jp Punycode representation of non-ASCII characters domain name
Now, this punycode representation will work the same way as English domain names work.
There are many APIs provided for conversion of IDN to ASCII like in ICU (ICU is a set of C/C++ and Java libraries used to provide Unicode and Globalization support for software applications) and “uidna_IDNToASCII”, which converts the IDN into ASCII format.
Handling path is quite difficult than a domain name, as domain names are already registered against their ASCII-based punycode representation, but a path is any folder path which can be provided in any encoding. The IETF Proposed Standard RFC 3987 (Internationalized Resource Identifiers (IRIs)) defines how to handle this.
For a path with non-ASCII characters, it will be represented in percent encoding which is nothing but a sequence of two digit hexadecimal numbers preceded by % sign. Apart from this, path encoding can be of any encoding type which can be stored in a non-Unicode encoding. This is then converted to Unicode, normalized using Unicode Normalization Form C, and encoded using the UTF-8 encoding.
/iri/日本語ディレクトリ/ -> Path with non-ASCII characters
The user agent then converts the non-ASCII bytes to percent-escapes. Our example now looks like this:
/iri/%E6%97%A5%E6%9C%AC%E8%AA%9E%E3%83%87%E3%82%A3%E3%83%AC%E3%82%AF%E3%83%88%E3%83%AA/ ->Percent Encoding Format
This is just a small introduction for I18n URLs as there is lot more detail regarding this topic.