Here's one thing I don't get. Why all the rigamarole of *hoping* you get a valid...

zarzavat · on July 21, 2023

It’s like the story of the brown M&Ms[0]. If the model is returning semantically correct data, you would hope that it can at least get the syntax correct. And if it can’t then you ought to throw the response away anyway.

Also I believe that such a method cannot capture the full complexity of TypeScript types.

[0] https://www.snopes.com/fact-check/brown-out/

tonyonodi · on July 21, 2023

That's a great analogy! I'd been wondering for a while whether that's a problem with this approach; to be honest I still don't know whether it is, so it would be good to see someone test it empirically.

rolisz · on July 21, 2023

> when you can guarantee a valid JSON syntax by only sampling tokens that are valid? Instead of greedily picking the highest-scoring token every time, you select the highest-scoring token that conforms to the requested format.

Yes, you can guarantee a syntactically correct JSON that way, but will it be a semantically correct? If the model really really really wanted to put another token there, but you are forcing it to put a {, maybe the following generated text won't be as good.

I'm not sure, I'm just wondering out loud.

geysersam · on July 21, 2023

Well, if the output doesn't conform to the format it's useless. If the model can't produce good and correct output then it's simply not up to the task.

waffletower · on July 21, 2023

In my experience, LLM responses result in a fair distribution of outputs that do have semantically useful outputs but do not precisely adhere to the requested format. If I chose to use a strongly typed language for LLM parsing, perhaps I would be tempted to eliminate complexity and simply throw structural outliers away, and explain to the suits that a certain percentage of our queries/expenses are unusable. Instead, more sophisticated coercion techniques could be applied instead to increase output utilization.

IanCal · on July 21, 2023

That really strongly depends on your task. Lots of tasks can accept a non-zero failure rate in return for better results on the successful cases. I'm not sure I can think of any off the top of my head where you'd use a LLM and can never deal with a failure, particularly if you're using an external service where you're guaranteed to have to deal with errors or downtime at some point.

donfotto · on July 21, 2023

I agree that sampling only valid tokens is a very promising approach.

I experimented a bit with finetuning open source LLMs for JSON parsing (without guided token sampling). Depending on one's use case, 70B parameters might be an overkill. I've seen promising results with much much smaller models. Finetuning a small model combined with guided token sampling would be interesting.

Then again, finetuning is perhaps not perfect for very general applications. When you get input that you didn't anticipate in your training dataset, you're in trouble.

csomar · on July 21, 2023

The LLM will be able to handle more complex scenarios. I could imagine a use-case: If you are ordering from a self-vending machine, instead of having to go through the whole process you just say your order out loud. You can say, for example, a couple chocolate bars and the LLM tries to guess from inventory.

Of course, if you are on the web, it makes no sense. It is much easier to use the mouse to click on a couple of items.

Scaevolus · on July 22, 2023

Llama.cpp recently added grammar based sampling, which constraints token selection to follow a rigid format like you describe.

https://github.com/ggerganov/llama.cpp/pull/1773

CGamesPlay · on July 21, 2023

OpenAI doesn’t expose this information because it makes it vastly easier to train your model off theirs.