Across a broad enough dataset (char count / 4) is very close to the actual token...

Across a broad enough dataset (char count / 4) is very close to the actual token count in english -- we verified across millions of queries. We had to switch to using an actual tokenizer for chinese and other unicode languages, as that simple formula misses the mark for context stuffing.

The more complicated stuff is the effective bin-packing problem that emerges depending on how much different contextual sources you have.