> There are obviously more characters than the token vocabulary size of typical ...

weinzierl · on July 6, 2023

Not sure what you mean, but the transformer model can only predict one token at a time. The final output layer needs as many nodes (or neurons, if you will) as there are distinct in the token vocabulary. So a large token vocabulary is expensive and that's why GPT-3 and LLaMA have only about 50000 different tokens and use BPE to find a set of useful tokens. They still can express every possible English text because the token vocabulary contains the whole latin alphabet.

Unicode 15 has nearly 150000 characters and CJK languages have even more than that because of Han unification.

A model like GPT-3 can only output a very primitive version of Chinese. My question is how real Chinese models deal with this and specifically how tokenization works in that case.

space_fountain · on July 6, 2023

Yes, but the tokens are translated into bytes not characters. There are only 256 distinct bytes so GPT models can easily be trained to produce any character. Probably the problem will be how sensible or understandable the binary form of Chinese characters in Unicode are, but that will be a problem for the model, not the tokenizer

weinzierl · on July 6, 2023

Ok, I understand. That helped, thanks.

And also suddenly the B in BPE makes a lot of sense.

cubefox · on July 6, 2023

The way I think about it: A token can be one to many bytes long, so they can be longer or shorter than a single character.