Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Finally some serious writing about LLMs that doesn’t follow the hype and it faces reality of what can and can’t be useful with these tools.

Really interesting read, although I can’t stand the word “agent” for a for-loop that call recursively an LLM, but this industry is not famous for being sharp with naming things, so here we are.

edit: grammar



It seems like an excellent name, given that people understand it so readily, but what else would you suggest? LoopGPT?


RePT


I’m no better at naming things! Shall we propose LLM feedback loop systems? It’s more grounded in reality. Agent is like Retina Display to my ears, at least at this stage!


Agent is clear in that it acts on behalf of the user.

"LLM feedback loop systems" could be to do with training, customer service, etc.

> Agent is like Retina Display to my ears, at least at this stage!

Retina is a great name. People know what it means - high quality screens.


>Agent is clear in that it acts on behalf of the user.

Yes, but you could say that AI orchestrated workflows are also acting on behalf of the user and the "Agentic AI" people seem to be going to great lengths to distinguish AI Agents from AI Workflows. Really, the only things that distinguish the AI Agent is the "running the LLM in a loop" + the LLM creating structured output.


> Really, the only things that distinguish the AI Agent is the "running the LLM in a loop" + the LLM creating structured output.

Well, that UI is what makes agent such an apt name.


Retina Display means nothing. Just because Apple pushed hard to make it common to everyone it doesn’t mean it’s a good technical name.


You’re right that it’s branding, but it also has meaning: a display resolution that (approximately) matches the resolution of the human retina, under typical viewing conditions. The fact that the term is easily understood by the lay public is what makes it a good name and smart branding. BTW the term ‘retinal display’ existed long before Apple used it, and refers to a display that projects directly onto the retina.


A screen that directly projects onto the retina sounds like a great reason to call it a retinal display. So then Apple hijacking the term to mean high DPI... how does that fit in?

There's not that many results about this before Apple's announcement in 2010, many of them reporting on science and not general public media: https://www.google.com/search?q=retinal+display&sca_esv=3689... Clearly not something anyone really used for an actual (not research grade) display, especially not in the meaning of high DPI

This isn't an especially easily understood term: that it means "good" would have been obvious no matter what this premium brand came up with. The fact that it's from Apple makes you assume it's good. (And the screens are good)


> So then Apple hijacking the term to mean high DPI... how does that fit in?

It fits in quite easily and obviously: Just typical Apple being Apple, marketing hyperbole.


The trademark ‘retina display’ was defined to mean the display resolution approximately matches the human retina, which is why ‘retina display’ seems obvious and easy to understand. That it’s good is implied, but “good” is not the definition of the term. I know a lot of non-technical people who understand it without any trouble. Come to think of it, I’ve never met anyone who doesn’t understand it or had trouble. Are you saying you had a hard time understanding what it means?

The branding term is slightly different from ‘retinal display’. The term in use may have been ‘virtual retinal display’. Dropping the ell off retinal and changing it from an adjective to a noun maybe helped their trademark application, perhaps, but since the term wasn’t in widespread use and the term is not exactly the same, that starts to contradict the idea they were ‘hijacking’ it.

The fact that any company advertised it implies that it’s supposed to be good. Doesn’t matter that it was Apple, nor that it was a premium brand, when a company advertises, no company is ever suggesting anything other than it’s a good thing.


> The trademark ‘retina display’ was defined to mean the display resolution approximately matches the human retina, which is why ‘retina display’ seems obvious and easy to understand.

Wait, because it's a trademark, it must be easy and obvious to understand? And you don't think people just assume it means something positive but that they can identify that it must specifically refer to display resolution without any prior exposure to Apple marketing material or people talking about that marketing material?

> I’ve never met anyone who doesn’t understand it or had trouble. Are you saying you had a hard time understanding what it means?

This thread is the first time where I hear of this specific definition as far as I remember, but tech media explain the marketing material as meaning "high resolution" so it's not like my mental dictionary didn't have an entry for "retina display -> see high resolution". Does that mean I had trouble understanding the definition? I guess it depends on if you're asking about the alleged underlying reason for this name or about the general meaning of the word


> Wait, because it's a trademark, it must be easy and obvious to understand?

That’s not what I said, where did you read that? The sentence you quoted doesn’t say that. I did suggest that the fact that it’s easy to understand makes it a good name, and I think that’s also what makes it a good trademark. The causal direction is opposite of what you’re assuming.

> retina display > see high resolution

The phrase ‘high resolution’ or ‘high DPI’ is relative, vague and non-specific. High compared to what? The phrase ‘Retina Display’ is making a specific statement about a resolution high enough to match the human retina.

You said the phrase wasn’t easily understood. I’m curious why not, since the non-technical lay public seems to have easily understood the term for 15 years, and nobody’s been complaining about it, by and large.

I suspect you might be arguing a straw man about whether the term is understood outside of Apple’s definition, and whether people will assume what it means without being told or having any context. It might be true that not everyone would make the same assumption about the phrase if they heard it without any context or knowledge, but that wasn’t the point of this discussion, nor a claim that anyone here challenged.


You can argue that Apple haven't achieved it, but it has a very clear technical meaning - a sufficiently high dpi such that pixels become imperceptible to the average healthy human eye from a typical viewing distance.


> [retina] it has a very clear technical meaning

Retina does not mean that, not even slightly or in connotation

Even today, no other meanings are listed: https://www.merriam-webster.com/dictionary/retina

It comes from something that means "net-like tunic" (if you want to stretch possible things someone might understand from it): https://en.m.wiktionary.org/wiki/retina

They could have named it rods and cones, cells, eye, eyecandy, iris, ultra max, infinite, or just about anything else that isn't negative and you can still make this comment of "clearly this adjective before »screen« means it's high definition". Anything else is believing Apple marketing "on their blue eyes" as we say in Dutch

> imperceptible to the average healthy human eye from a typical viewing distance

That's most non-CRT (aquarium) displays. What's different about high DPI (why we need display scaling now) is that they're imperceptible even if you put your nose onto them: there's so many pixels that you can't see any of them at any distance, at least not with >100% vision or a water droplet or other magnifier on the screen


The term is ‘retina display’ not ‘retina’

> That’s most non-CRT (aquarium) displays. What’s different about high DPI (why we need display scaling now) is that they’re imperceptible even if you put your nose onto them

Neither of those claims is true.

Retina Display was 2x-3x higher PPI (and 4x-9x higher pixel area density) than the vast majority of displays at the time it was introduced, in 2010. The fact that many displays are today now as high DPI as Apple’s Retina display means that the competition caught up, that high DPI had a market and was temporarily a competitive advantage.

The rationale for Retina Display was, in fact, the DPI needed for pixels to be imperceptible at the typical viewing distance, not when touching your nose. It has been argued that the choice of 300DPI was not high enough at a distance of 12 inches to have pixels be imperceptible. That has been debated, and some people say it’s enough. But it was not argued that pixels should or will be imperceptible at a distance of less than 12 inches. And people with perfect vision can see pixels of a current Retina Display iPhone if held up to their nose.

https://en.wikipedia.org/wiki/Retina_display#Rationale_and_d...


> Retina Display means nothing.

It means a high-quality screen and is named after the innermost part of the eye, which evokes focused perception.

> Just because Apple pushed hard to make it common to everyone it doesn’t mean it’s a good technical name.

It's an excellent technical name, just like AI agent. People understand what it means with minimal education and their hunch about that meaning is usually right.


A downward spiral


Call it Reznor to imply it’s a downward spiral?


A state machine, or more specifically a Moore Machine.


I agree with not liking the author’s definition of an Agent being … “a for loop which contains an LLM call”.

Instead it is an LLM calling tools/resources in a loop. The difference is subtle and a question of what is in charge.


Although implementation/internal wise it's not wrong to say it's just an llm call in a loop. If the llm responds with a tool call, you (the implementor) needs to program the call to happen, then loop back and let the llm continue.

The model/weights themselves do not execute tool calls unless the tooling around it helps them do it, and loops it.


I liked the phrase “tools in a loop” for agents. I think Simon said that


He was quoting someone else. Please take care not to attribute falsely, as it creates a falsehood likely to spread and become the new (un) truth.


You are right. During a “Prompting for Agents” workshop at an Anthropic developer conference, Hannah Moran described agents as “models using tools in a loop.”


I saw a LinkedIn post (I know, I know) talking about how soon agents will replace apps. . .

Because of course, LLM calls in a for loop are also not applications anymore.


I actually take some minor issue with OP's definition of an agent. IMO an agent isn't just a LLM on a loop.

IMO the defining feature of an agent is that the LLM's behavior is being constrained or steered by some other logical component. Some of these things are deterministic while others are also ML-powered (including LLMs).

Which is to say, the LLM is being programmed in some way.

For example, prompting the LLM to build and run tests after code edits is a great way to get better performance out of it. But the idea is that you're designing a system where a deterministic layer (your tests) is nudging the LLM to do more useful things.

Likewise many "agentic reasoning" systems deliberately force the LLM to write out a plan before execution. Sometimes these plans can even be validated deterministically, and the LLM forced to re-gen if plan is no good.

The idea that the LLM is feeding itself isn't inaccurate, but misses IMO the defining way these systems are useful: they're being intentionally guided along the way by various other components that oversee the LLM's behavior.


Can you explain the interface between the LLM and the deterministic system? I’m not understanding how a probabilistic machine output can reliably map onto a strict input schema.


So it's pretty early-days for these kinds of systems, so there's no "one true" architecture that people have settled on. There are two broad variations that I see:

1 - The LLM is in charge and at the top of the stack. The deterministic bits are exposed to the LLM as tools, but you instruct the LLM specifically to use them in a particular way. For example: "Generate this code, and then run the build and tests. Do not proceed with more code generation until build and tests successfully pass. Fix any errors reported at the build and test step before continuing." This mostly works fine, but of course subject to the LLM not following instructions reliably (worse as context gets longer).

2 - A deterministic system is at the top, and uses LLMs in an otherwise-scripted program. This potentially works better when the domain the LLM is meant to solve is narrow and well-understood. In this case the structure of the system is more like a traditional program, but one that calls out to LLMs as-needed to fulfill certain tasks.

> "I’m not understanding how a probabilistic machine output can reliably map onto a strict input schema."

So there are two tricks to this:

1 - You can actually force the machine output into strict schemas. Basically all of the large model providers now support outputting in defined schemas - heck, Apple just announced their on-device LLM which can do that as well. If you want the LLM to output in a specified schema with guarantees of correctness, this is trivial to do today! This is fundamental to tool-calling.

2 - But often you don't actually want to force the LLM into strict schemas. For the coding tool example above where the LLM runs build/tests, it's often much more productive to directly expose stdout/stderr to the LLM. If the program crashed on a test, it's often very productive to just dump the stack trace as plaintext at the LLM, rather than try to coerce the data into a stronger structure and then show it to the LLM.

How much structure vs. freeform is very much domain-specific, but the important realization is that more structure isn't always good.

To make the example concrete, an example would be something like:

[LLM generates a bunch of code, in a structured format that your IDE understands and can convert into a diff]

[LLM issues the `build_and_test` tool call at your IDE. Your IDE executes the build and tests.]

[Build and tests (deterministic) complete, IDE returns the output to the LLM. This can be unstructured or structured.]

[LLM does the next thing]


So, to summarize, there is a feedback loop like this: LLM <--> deterministic agent? And there's a asymmetry in strictness, i.e. LLM --> agent funnels probabilistic output into 1+ structured fields, whereas agent --> LLM can be more freeform (stderr plaintext). Is that right?

A few questions:

1) how does the LLM know where to put output tokens given more than one structured field options?

2) Is this loop effective for projects from scratch? How good is it at proper design (understanding tradeoffs in algorithms, etc)?


> "there is a feedback loop like this: LLM <--> deterministic agent?"

More or less, though the agent doesn't have to be deterministic. There's a sliding scale of how much determinism you want in the "overseer" part of the system. This is a huge area of active development with not a lot of settled stances.

There's a lot of work being put into making the overseer/agent a LLM also. The neat thing is that it doesn't have to be the same LLM, it can be something fine-tuned to specifically oversee this task. For example, "After code generation and build/test has finished, send the output to CodeReviewerBot. Incorporate its feedback into the next round of code generation." - where CodeReviewerBot is a different probabilistic model trained for the task.

You could even put a human in as part of the agent: "do this stuff, then upload it for review, and continue only after the review has been approved" is a totally reasonable system where (part of) the agent is literal people.

> "And there's a asymmetry in strictness, i.e. LLM --> agent funnels probabilistic output into 1+ structured fields, whereas agent --> LLM can be more freeform (stderr plaintext). Is that right?"

Yes, though some flexibility exists here. If LLM --> deterministic agent, then you'd want to squeeze the output into structured fields. But if the agent is itself probabilistic/a LLM, then you can also just dump unstructured data at it.

It's kind of the wild west right now in this whole area. There's not a lot of common wisdom besides "it works better if I do it this way".

> "1) how does the LLM know where to put output tokens given more than one structured field options?"

Prompt engineering and a bit of praying. The trick is that there are methods for ensuring the LLM doesn't hallucinate things that break the schema (fields that don't exist for example), but output quality within the schema is highly variable!

For example, you can force the LLM to output a schema that references a previous commit ID... but it might hallucinate a non-existent ID. You can make it output a list of desired code reviewers, and it'll respect the format... but hallucinate non-existent reviewers.

Smart prompt engineering can reduce the chances of this kind of undesired behavior, but given that it's a giant ball of probabilities, performance is never truly guaranteed. Remember also that this is a language model - so it's sensitive to the schema itself. Obtuse naming within the schema itself will negatively impact reliability.

This is actually part of the role of the agent. "This code reviewer doesn't exist. Try again. The valid reviewers are: ..." is a big part of why these systems work at all.

> "2) Is this loop effective for projects from scratch? How good is it at proper design (understanding tradeoffs in algorithms, etc)?"

This is where the quality of the initial prompt and the structure of the agent comes into play. I don't have a great answer for here besides that making these agents better at decomposing higher-level tasks (including understanding tradeoffs) is a lot of what's at the bleeding edge.


Wait, so you just tell the LLM the schema, and hope it replicates it verbatim with content filled into it? I was under the impression that you say "hey, please tell me what to put in this box" repeatedly until your data model is done. That sort of surprises me!

This interface interests me the most because it sits between the reliability-flexibility tradeoff that people are constantly debating w/ the new AI tech. Are there "mediator" agents with some reliability AND some flexibility? I could see a loosey goosey LLM passing things off to Mr. Stickler agent leading to failure all the time. Is the mediator just humans?


> "Wait, so you just tell the LLM the schema, and hope it replicates it verbatim with content filled into it?"

In the early stages of LLMs yes ("get me all my calendar events for next week and output in JSON format" and pray the format it picks is sane), but nowadays there are specific model features that guarantee output constrained to the schema. The term of art here is "constrained decoding".

The structuring is also a bit of a dark art - overall system performance can improve/degrade depending on the shape of the data structure you constrain to. Sometimes you want the LLM to output into an intermediate and more expressive data structure before converting to a less expressive final data structure that your deterministic piece expects.

> "Are there "mediator" agents with some reliability AND some flexibility?"

Pretty much, and this is basically where "agentic" stuff is at the moment. What mediates the LLM's outputs? Is it some deterministic system? Is it a probabilistic system? Is it kind of both? Is it a machine? Is it a human?

Specifically with coding tools, there seems like the mediator(s) are some mixture of sticklers (compiles, tests) and loosey-goosey components (other LLMs, the same LLM).

This gets a bit wilder with multimodal models too: think about a workflow step like "The user asked me to make a web page that looks like [insert user input here], here is my work, including a screenshot of the rendered page. Hey mediator, does this look like what the user asked for? If not, give me specific feedback on what's wrong."

And then feed that back into codegen. There has been some surprisingly good results from the mediator being a multimodal LLM.


> prompting the LLM to build and run tests after code edits

Isn't that done by passing function definitions or "tools" to the llm?


Thanks for this comment, i totally agree. Not to say this article isnt good; its great!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: