You're assuming all you can do is prompt it. Surely you could also constrain its...

JohnKemeny · 2025-04-14T09:21:16 1744622476

But the problem is that the tokens are subwords, which means that if you simply disallowed tokens with es, you'd make it hard to complete a word given a prefix.

For example, it may start like this "This is a way to solv-", or "This is th-"

lelag · 2025-04-14T09:41:04 1744623664

If I understand it correctly, that's a valid concern but the way structured generation library like outlines[1] work is that they can generate multiple variants of the inference (which they call beam search).

One beam could be "This is a way to solv-". With no obvious "good" next token. Another beam could be "This way is solv-". With "ing" as the obvious next token.

It will select the best beam for the output.

[1]:https://github.com/dottxt-ai/outlines

zahlman · 2025-04-14T17:02:57 1744650177

... What if you retrained it from scratch, on an e-less corpus?

JohnKemeny · 2025-04-14T21:06:42 1744664802

Yes, that would probably work quite well, given enough training data. However, I interpreted the question/claim as a task that LLMs excell at, meaning that writing text while avoiding a certain character is a task for a general purpose LLM.

probably_wrong · 2025-04-14T09:30:23 1744623023

I tried something like that some time ago. The problem with that strategy is the lack of backtracking.

Let's say I prompt my LLM to exclusively use the letters 'aefghilmnoprst' and the LLM generates "that's one small step for a man, one giant leap for man-"[1]. Since the next token with the highest probability ("-kind") isn't allowed, it may very well be that the next appropriate word is something really generic or, if your grammar is really restrictive, straight up nonsense because nothing fits. And then there's pathological stuff like "... one giant leap for man, one small step for a man, one giant leap for man- ...".

[1] Toy example - I'm sure these specific rules are not super restrictive and "management" is right there.

wizzwizz4 · 2025-04-14T09:50:58 1744624258

The next token is obviously "goes". (Any language model that disagrees is simply wrong.)

JohnKemeny · 2025-04-14T21:09:10 1744664950

I'm not sure if my chain's bein' yanked right now, but surely you mean "gos"‽

wizzwizz4 · 2025-04-14T22:39:41 1744670381

The plural of mangoe is mangoes. https://en.wiktionary.org/wiki/mangoe

lelag · 2025-04-14T09:10:14 1744621814

I was going to point that out.

What I will add is that constrained generation is supported by the major inference engine like llama.cpp, vllm and the likes, so what you are describing is actually trivial on locally hosted models, you just have to provide a regex that prevent them to use the letter 'e' in the output.

Der_Einzige · 2025-04-14T10:04:43 1744625083

You can do this more properly with the antislop sampler and we are working on a follow up paper to our previous work on this exact problem.

https://github.com/sam-paech/antislop-sampler

https://arxiv.org/abs/2306.15926

HPsquared · 2025-04-14T11:23:16 1744629796

All the training data contains 'e's.

pyfon · 2025-04-14T11:31:28 1744630288

That is not a counter point! The output has a probability distribution so you can assing zero to any e-containing token and scale everything else up accordingly.