Java uses UTF8 in latest release for strings without special characters. http://...

lmm · on Jan 7, 2018

UTF-16 is the worst of all worlds: it's less efficient than UTF8 for most use cases, requires you to think about endianness, but is still a variable-length encoding. (And the cases that require variable-length encoding are rarer than they are for UTF-8, meaning you're less likely to hit them in testing)

nwellnhof · on Jan 7, 2018

No, it uses the fixed-width LATIN1 (ISO-8859-1) encoding for compact strings. It wouldn't make much sense to use another variable-width encoding like UTF-8.

oblio · on Jan 7, 2018

Well, not necessarily a curse, but a suboptimal solution. Are there any situations where UTF16 is a clear upgrade over UTF8?

Const-me · on Jan 7, 2018

Any non-Latin string operations.

While technically UTF16 is variable length, 99.99% cases use single word per character. I.e. on modern hardware with branch prediction and speculative execution, these branches don't affect speed. With UTF8, CPU mispredicts branches all the time because spaces, punctuations and newlines are single bytes even in non Latin-1 text.

vardump · on Jan 7, 2018

I think the most common operations are comparisons for equality and copying anyways. UTF-8 is faster for those.

I tried out how fast I could make UTF-8 strlen, with an assumption of a valid UTF-8 string. The routine ran at 18 GB/s on a single core using SSE.

> With UTF8, CPU mispredicts branches all the time because spaces, punctuations and newlines are single bytes

I don't understand this sentence. Why would there be any more mispredictions because of those being single bytes? These days code is so often bandwidth limited if anything, so smaller data helps.

Const-me · on Jan 7, 2018

> most common operations are comparisons for equality and copying anyways

Indexing & substrings are common, too.

> These days code is so often bandwidth limited if anything

Right, and for 1 billion Chinese speaking people UTF16 is 2 bytes/character, UTF8 is 3 bytes/character.

vardump · on Jan 7, 2018

> Indexing & substrings are common, too.

Indexing code points in both UTF-8 and UTF-16 requires reading the whole string up to index location. Substrings are the same as well.

> Right, and for 1 billion Chinese speaking people UTF16 is 2 bytes/character, UTF8 is 3 bytes/character.

That's true for a text file without markup. But most text is not like that in 2017. HTML is probably the most common text format nowadays.

So let's see how a popular Chinese language website does.

  curl http://language.chinadaily.com.cn/ --silent | wc -c
  52678

  curl http://language.chinadaily.com.cn/ --silent | iconv -f utf8 -t utf-16le | wc -c
  93368

So UTF-8 seems to be quite a bit more efficient in this case, 52678 bytes. When converted to UTF-16, same page was 93368 bytes.

Ded7xSEoPKYNsDd · on Jan 7, 2018

> Indexing code points in both UTF-8 and UTF-16 requires reading the whole string up to index location. Substrings are the same as well.

Java's String functions don't index by Unicode code points, though. Java strings are encoded in UCS-2, or at least the API needs to pretend that they are.

vardump · on Jan 7, 2018

That can sure lead to interesting bugs!

Const-me · on Jan 7, 2018

Right. Same in C#, C++ STL, and in Apple’s obj-c/swift.

tjalfi · on Jan 7, 2018

Swift takes a different approach than Objective C[0].

[0] https://www.mikeash.com/pyblog/friday-qa-2015-11-06-why-is-s...

Const-me · on Jan 7, 2018

Even in 2017, not everyone is a Web or Electron developer. I certainly am not.

I don’t advocate using UTF16 for the web, but people still code native desktop apps, mobile apps, embedded software, videogames, store stuff in various databases, etc. For such use, markup is irrelevant.

banthar · on Jan 7, 2018

Even outside Web, you still have mostly-ASCII:

* filenames

* identifiers

* config files

* text protocols

* host names, email addresses

* embedded scripts (including SQL and OpenGL shaders)

* command line interfaces

* translations for languages using Latin alphabets

I don't think 2/3 size reduction for some languages will offset the cost in all the other places.

Const-me · on Jan 8, 2018

For some of these things we don’t have much choice, because the encoding is part of some lower-level API (file system, OpenGL, CLI), which usually don’t accept arbitrary encoding. They accept only one, and unless you want to waste time converting, you better use that exact encoding.

Other stuff like IDs, shaders before GL 4.2, and many text protocols aren’t Unicode at all.

For configs I usually use UTF-8 myself, because I don’t like writing parsers for custom formats and just use XML, and any standard-compliant parser supports all of them.

pjmlp · on Jan 8, 2018

If English is your world yeah.

Some of us use other languages and like to use them everywhere we can.

jcranmer · on Jan 7, 2018

Represent indexes not as number-of-code-points from start but as byte offsets, and index/substring doesn't need to decode code points. You lose the ability to easily say "get me the 100th code point in this string", but I'm hard-pressed to actually think of any case where that is actually valuable.

> Right, and for 1 billion Chinese speaking people UTF16 is 2 bytes/character, UTF8 is 3 bytes/character.

The information density of a single hanzi character is roughly equivalent to 5 letters in English. A Chinese plaintext document in UTF-8 is still smaller in memory footprint than an equivalent English document in ASCII. Of course, most documents aren't plaintext, and where people use characters for metadata (e.g., email, HTML), there is a substantial corpus of ASCII metadata in those documents that UTF-8 is still smaller than UTF-16 even for East Asian languages.

Of course, it's moot since the people who don't like UTF-8 in China and Japan aren't using UTF-16 either. They're using GB18030 or ISO-2022-JP for their documents.

Const-me · on Jan 7, 2018

> A Chinese plaintext document in UTF-8 is still smaller in memory footprint than an equivalent English document in ASCII.

When you need to process Chinese text you don’t care how much an equivalent English document would take. You only care about the difference between different encodings of Chinese language. And UTF16 is more compact for East Asian languages.

> most documents aren't plaintext

That’s true for the web, and that’s why UTF8 is the clear winner there. In a desktop software, in a videogame, in a database — not so much.

HelloNurse · on Jan 7, 2018

In non-Latin text, if most characters are 2 bytes but a large minority are 1 byte, the branch prediction in charge of guessing between the different codepoint representation lengths expects 2 bytes and fails very often. Speculative execution (counting in two or three ways simultaneously) might mitigate the performance hit.

vardump · on Jan 7, 2018

> In non-Latin text, if most characters are 2 bytes but a large minority are 1 byte, the branch prediction in charge of guessing between the different codepoint representation lengths expects 2 bytes and fails very often

You wouldn't want to process a single code point (or unit) at a time anyways, but 16, 32 or 64 code units (or bytes) at once.

That UTF-8 strlen I wrote had no mispredicts, because it was vectored.

Indexing is slow, but the difference to UTF-16 is not significant.

I guess locale based comparisons or case insensitive operations could be slow, but then again, they'll need a slow array lookup anyways.

Which string operation(s) are you talking about?

legulere · on Jan 7, 2018

You don't need to check the representation doing anything specifically with spaces or newlines. All 0x0A bytes are newline characters in UTF8 and all 0x20 bytes are spaces.

The only place you really need to decode UTF8 characters is when you convert it to another format (which you hopefully won't need to do anymore in the far future) or display it (where the decoding is a minuscule factor in performance)

jcranmer · on Jan 7, 2018

Not really.

UTF-8 is self-synchronizing, which means you can treat it as a byte string for most operations, including finding substrings. You don't need to convert UTF-8 to a sequence of codepoints for most tasks (particularly if you drop the insistence of using character boundaries). When you do have to do so, you're usually applying a complex Unicode algorithm like case conversion, and so the branch misprediction overhead of creating characters is likely small in comparison to the actual cost of doing the algorithm.