UTF16 is not really a curse for languages that require it. String operations in non-English languages are very fast because of it, and most software these days has to deal with localization.
UTF-16 is the worst of all worlds: it's less efficient than UTF8 for most use cases, requires you to think about endianness, but is still a variable-length encoding. (And the cases that require variable-length encoding are rarer than they are for UTF-8, meaning you're less likely to hit them in testing)
No, it uses the fixed-width LATIN1 (ISO-8859-1) encoding for compact strings. It wouldn't make much sense to use another variable-width encoding like UTF-8.
While technically UTF16 is variable length, 99.99% cases use single word per character. I.e. on modern hardware with branch prediction and speculative execution, these branches don't affect speed. With UTF8, CPU mispredicts branches all the time because spaces, punctuations and newlines are single bytes even in non Latin-1 text.
I think the most common operations are comparisons for equality and copying anyways. UTF-8 is faster for those.
I tried out how fast I could make UTF-8 strlen, with an assumption of a valid UTF-8 string. The routine ran at 18 GB/s on a single core using SSE.
> With UTF8, CPU mispredicts branches all the time because spaces, punctuations and newlines are single bytes
I don't understand this sentence. Why would there be any more mispredictions because of those being single bytes? These days code is so often bandwidth limited if anything, so smaller data helps.
> Indexing code points in both UTF-8 and UTF-16 requires reading the whole string up to index location. Substrings are the same as well.
Java's String functions don't index by Unicode code points, though. Java strings are encoded in UCS-2, or at least the API needs to pretend that they are.
Even in 2017, not everyone is a Web or Electron developer. I certainly am not.
I don’t advocate using UTF16 for the web, but people still code native desktop apps, mobile apps, embedded software, videogames, store stuff in various databases, etc. For such use, markup is irrelevant.
For some of these things we don’t have much choice, because the encoding is part of some lower-level API (file system, OpenGL, CLI), which usually don’t accept arbitrary encoding. They accept only one, and unless you want to waste time converting, you better use that exact encoding.
Other stuff like IDs, shaders before GL 4.2, and many text protocols aren’t Unicode at all.
For configs I usually use UTF-8 myself, because I don’t like writing parsers for custom formats and just use XML, and any standard-compliant parser supports all of them.
Represent indexes not as number-of-code-points from start but as byte offsets, and index/substring doesn't need to decode code points. You lose the ability to easily say "get me the 100th code point in this string", but I'm hard-pressed to actually think of any case where that is actually valuable.
> Right, and for 1 billion Chinese speaking people UTF16 is 2 bytes/character, UTF8 is 3 bytes/character.
The information density of a single hanzi character is roughly equivalent to 5 letters in English. A Chinese plaintext document in UTF-8 is still smaller in memory footprint than an equivalent English document in ASCII. Of course, most documents aren't plaintext, and where people use characters for metadata (e.g., email, HTML), there is a substantial corpus of ASCII metadata in those documents that UTF-8 is still smaller than UTF-16 even for East Asian languages.
Of course, it's moot since the people who don't like UTF-8 in China and Japan aren't using UTF-16 either. They're using GB18030 or ISO-2022-JP for their documents.
> A Chinese plaintext document in UTF-8 is still smaller in memory footprint than an equivalent English document in ASCII.
When you need to process Chinese text you don’t care how much an equivalent English document would take. You only care about the difference between different encodings of Chinese language. And UTF16 is more compact for East Asian languages.
> most documents aren't plaintext
That’s true for the web, and that’s why UTF8 is the clear winner there. In a desktop software, in a videogame, in a database — not so much.
In non-Latin text, if most characters are 2 bytes but a large minority are 1 byte, the branch prediction in charge of guessing between the different codepoint representation lengths expects 2 bytes and fails very often.
Speculative execution (counting in two or three ways simultaneously) might mitigate the performance hit.
> In non-Latin text, if most characters are 2 bytes but a large minority are 1 byte, the branch prediction in charge of guessing between the different codepoint representation lengths expects 2 bytes and fails very often
You wouldn't want to process a single code point (or unit) at a time anyways, but 16, 32 or 64 code units (or bytes) at once.
That UTF-8 strlen I wrote had no mispredicts, because it was vectored.
Indexing is slow, but the difference to UTF-16 is not significant.
I guess locale based comparisons or case insensitive operations could be slow, but then again, they'll need a slow array lookup anyways.
You don't need to check the representation doing anything specifically with spaces or newlines. All 0x0A bytes are newline characters in UTF8 and all 0x20 bytes are spaces.
The only place you really need to decode UTF8 characters is when you convert it to another format (which you hopefully won't need to do anymore in the far future) or display it (where the decoding is a minuscule factor in performance)
UTF-8 is self-synchronizing, which means you can treat it as a byte string for most operations, including finding substrings. You don't need to convert UTF-8 to a sequence of codepoints for most tasks (particularly if you drop the insistence of using character boundaries). When you do have to do so, you're usually applying a complex Unicode algorithm like case conversion, and so the branch misprediction overhead of creating characters is likely small in comparison to the actual cost of doing the algorithm.
http://www.baeldung.com/java-9-compact-string
UTF16 is not really a curse for languages that require it. String operations in non-English languages are very fast because of it, and most software these days has to deal with localization.