Google Blog: UTF-8 won the internet

Google Blog: UTF-8 won the internet →

May 6, 2008 ∞https://marco.org/2008/05/06/google-blog-utf-8-won-the-internet

It’s important to distinguish Unicode from UTF-8 here. This post’s language is incomplete and slightly misleading.

Unicode is the character set. It associates characters with numbers. ‘A’ is number 65, ‘ḿ’ is number 7743, etc.

UTF-8 is one method of encoding those numbers into bytes. Normally, a 1-byte character only holds values 0-255, so there’s no direct way to represent 7743. There are a few solutions to this problem with various tradeoffs:

16-bit characters: Easy to process, but incompatible with most old C string functions, and wasteful of space. Also, the full Unicode standard has some infrequently used numbers that won’t even fit in 16 bits. This solution is used by Windows, OS X, Java, and .NET.
32-bit characters: Fits the whole Unicode standard, but even more wasteful of space. I have no idea if anything uses this.
Multi-byte encoding, such as UTF-8: Most Latin characters stay as 8 bits, but anything above 127 is encoded as a special sequence of 2 to 4 bytes. Very space-efficient and fully backwards compatible, but there are a few little behavioral quirks: for example, there’s no way to know how many characters (not bytes) are in a string without scanning through the whole thing, and you can’t safely split a string on any byte boundary because you might cut a multi-byte character in half. This is now what most web pages use.

Congratulations, you now know more about character encoding than 99% of people who write software for a living.

Marco.org