Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

There are constructed languages that preserve the expressivity of natural human languages but without the implicit ambiguity, though; most notably, Loglan and its successor Lojban. If you read Golden Age sci-fi, Loglan sometimes shows up there specifically in this role - e.g. "Moon is a Harsh Mistress":

> By then Mike had voder-vocoder circuits supplementing his read-outs, print-outs, and decision-action boxes, and could understand not only classic programming but also Loglan and English, and could accept other languages and was doing technical translating—and reading endlessly. But in giving him instructions was safer to use Loglan. If you spoke English, results might be whimsical; multi-valued nature of English gave option circuits too much leeway.

For those unfamiliar with it, it's not that Lojban is perfectly unambiguous. It's that its design strives to ensure that ambiguity is always deliberate by making it explicit.

The obvious problem with all this is that Lojban is a very niche language with a fairly small corpus, so training AI on it is a challenge (although it's interesting to note that existing SOTA models can read and write it even so, better than many obscure human languages). However, Lojban has the nice property of being fully machine parseable - it has a PEG grammar. And, once you parse it, you can use dictionaries to construct a semantic tree of any Lojban snippet.

When it comes to LLMs, this property can be used in two ways. First, you can use structured output driven by the grammar to constrain the model to output only syntactically valid Lojban at any point. Second, you can parse the fully constructed text once it has been generated, add semantic annotations, and feed the tree back into the model to have it double-check that what it ended up writing means exactly what it wanted to mean.

With SOTA models, in fact, you don't even need the structured output - you can just give them parser as a tool and have them iterate. I did that with Claude and had it produce Lojban translations that, while not perfect, were very good. So I think that it might be possible, in principle, to generate Lojban training data out of other languages, and I can't help but wonder what would happen if you trained a model primarily on that; I suspect it would reduce hallucinations and generally improve metrics, but this is just a gut feel. Unfortunately this is a hypothesis that requires a lot of $$$ to properly test...



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: