Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Here's one thing I don't get.

Why all the rigamarole of hoping you get a valid response, adding last-mile validators to detect invalid responses, trying to beg the model to pretty please give me the syntax I'm asking for...

...when you can guarantee a valid JSON syntax by only sampling tokens that are valid? Instead of greedily picking the highest-scoring token every time, you select the highest-scoring token that conforms to the requested format.

This is what Guidance does already, also from Microsoft: https://github.com/microsoft/guidance

But OpenAI apparently does not expose the full scores of all tokens, it only exposes the highest-scoring token. Which is so odd, because if you run models locally, using Guidance is trivial, and you can guarantee your json is correct every time. It's faster to generate, too!



It’s like the story of the brown M&Ms[0]. If the model is returning semantically correct data, you would hope that it can at least get the syntax correct. And if it can’t then you ought to throw the response away anyway.

Also I believe that such a method cannot capture the full complexity of TypeScript types.

[0] https://www.snopes.com/fact-check/brown-out/


That's a great analogy! I'd been wondering for a while whether that's a problem with this approach; to be honest I still don't know whether it is, so it would be good to see someone test it empirically.


> when you can guarantee a valid JSON syntax by only sampling tokens that are valid? Instead of greedily picking the highest-scoring token every time, you select the highest-scoring token that conforms to the requested format.

Yes, you can guarantee a syntactically correct JSON that way, but will it be a semantically correct? If the model really really really wanted to put another token there, but you are forcing it to put a {, maybe the following generated text won't be as good.

I'm not sure, I'm just wondering out loud.


Well, if the output doesn't conform to the format it's useless. If the model can't produce good and correct output then it's simply not up to the task.


In my experience, LLM responses result in a fair distribution of outputs that do have semantically useful outputs but do not precisely adhere to the requested format. If I chose to use a strongly typed language for LLM parsing, perhaps I would be tempted to eliminate complexity and simply throw structural outliers away, and explain to the suits that a certain percentage of our queries/expenses are unusable. Instead, more sophisticated coercion techniques could be applied instead to increase output utilization.


That really strongly depends on your task. Lots of tasks can accept a non-zero failure rate in return for better results on the successful cases. I'm not sure I can think of any off the top of my head where you'd use a LLM and can never deal with a failure, particularly if you're using an external service where you're guaranteed to have to deal with errors or downtime at some point.


I agree that sampling only valid tokens is a very promising approach.

I experimented a bit with finetuning open source LLMs for JSON parsing (without guided token sampling). Depending on one's use case, 70B parameters might be an overkill. I've seen promising results with much much smaller models. Finetuning a small model combined with guided token sampling would be interesting.

Then again, finetuning is perhaps not perfect for very general applications. When you get input that you didn't anticipate in your training dataset, you're in trouble.


The LLM will be able to handle more complex scenarios. I could imagine a use-case: If you are ordering from a self-vending machine, instead of having to go through the whole process you just say your order out loud. You can say, for example, a couple chocolate bars and the LLM tries to guess from inventory.

Of course, if you are on the web, it makes no sense. It is much easier to use the mouse to click on a couple of items.


Llama.cpp recently added grammar based sampling, which constraints token selection to follow a rigid format like you describe.

https://github.com/ggerganov/llama.cpp/pull/1773


OpenAI doesn’t expose this information because it makes it vastly easier to train your model off theirs.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: