Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> For CJK characters, they unified all semantically similar han-characters, even when they have visual forms that are quite different between Japanese, Chinese and Korean.

This isn't true. 青 and 靑 are the same character written differently; they have their own codepoints. Ditto for a huge number of simplified Chinese characters; 语 is mainland Chinese and 語 is the same character in Japanese.



It is true for lots of characters (so I guess I was being a little hyperbolic when I said "all"), and you cannot rely on choosing the correct code points in order to have a text display Japanese or Chinese. You need to tell your rendering program (often through choice of font) if things are to be rendered with Japanese or Chinese forms.

I wouldn't know how to show you examples here, as 直 will 直 display the same since they have the same code point, but different number of strokes in japabese and chinese.

https://en.m.wikipedia.org/wiki/Han_unification


Aren't they putting the disunified characters into the U+2xxxx plane now?

Han unification is generally seen as a bad choice in retrospect, but it was something Unicode had to do when it looked like 2^16 codepoints were all they were going to get.


Never heard of that, but I would appreciate if all the characters with different glyphs had different codepoints. Do you have a source? Do you know what happens to the "unified" code-points?


It is true to some extent. While 青 and 靑 have different codepoints, there are plenty of characters of the same codepoint that are rendered differently depends on the language specificed:

https://en.wikipedia.org/wiki/Han_unification#Examples_of_la...

Han characters that are traditionally viewed as variants of one another, or that are simplified from more complex logograms (such as 龜, which was simplified into 亀 in Japan and 龟 in mainland China) tend to have different codepoints, but the stylistically different ones usually belong to the same codepoint.


I do know about the issue; it causes problems for me. But I couldn't let the claim that all semantically equivalent characters were unified pass.

> the stylistically different ones usually belong to the same codepoint

Fair enough. Do you happen to know why 青 and 靑 weren't unified?


Han Unification "rules" were an inconsistent mess, but I do know that in Japanese 靑 was at one time a printer's simplification of 青, so you could find either in texts, and the Consortium tended to encode a character separately if you could find printed examples of both in the same language.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: