Awesome project and thanks for sharing. I've been trying to do similar things with much, much more meager hardware and your observations align with what I've discovered. Autonomy is hard, memory and "will" is hard to get going. Time is not a concept to LLMs in anything resembling a human manner. I'm trying a more emergent approach but the urge (and occasional need) to nudge is strong. If you're interested in seeing what I've been doing my Github is in my profile.
Just want to express gratitude for you and all who contributed to a Wikipedia "hand crafted with love and respect". Your contributions will last-- some of us set up Kiwix and a local copy of pre-AI Wikipedia that we'll keep forever, GFS style. No matter what happens your work will be preserved and used.
Cool project and https://secvant.com/changelog is interesting but no one will trust it without the source code-- my 2 cents the blue-on-blue dark theme makes readability difficult. Adding a light-mode toggle would be helpful for those not fond of dark text.
Kerning is staggeringly difficult to do manually with stencils, and at the same time the imperfections show "touch" which is part of what makes TFA's work so appealing.
This is an excellent point, and as a novice using LLMs for projects I could never previously dream of doing I find myself looking for the same, examples or citations of what exactly agents are writing incorrectly and how would the human do it better. I'm sure they're out there, maybe someone can refer some good content showing such examples.
I have no doubt the top nth percent of coders could write circles around Claude or Codex, but how much worse are they than your average schnook?
Reality: the top nth percent of coders are seeing absurd, dramatic gains in productivity using LLMs. See: antirez, Simon Willison, Steve Yegge.
The more experience you bring to the table, the more value you get from these tools.
Look, about 12 years ago articles about how if you're not pair programming you're doing it wrong were on HN's home page every day. Doing well prompted plan -> agent -> debug cycles is like pair programming with someone that knows every SDK and API intuitively and doesn't have to pick up their kids from daycare at 4pm.
While I don't actually disagree - to me, Gas Town sounds literally insane - I suspect that if you reframe his work to compare it against the cost of developing a new medication or chip fabrication technique, you can make a strong argument that he's putting his money where his mouth is to see how far he can take a new technology. He's doing science! And I think that's admirable, even if nothing comes of it.
When I think of how much money gets wasted on gambling apps and how much human potential gets wasted watching reality television and compare that to Steve going full Alexander Shulgin with LLMs, the comparison really falls flat.
The problem is what they do to large existing systems: subtle misunderstandings mean subtle bugs are constantly being introduced, and very few shops have adequate systems in place to receive reports of subtle issues at the rates they occurred 10 years ago, let alone today. And don't even get me started on llm-assisted support that some might suggest as a solution.
I've been running an experiment on multi-agent async with persistent memory for the last three weeks. This is my most important finding so far. It began as an experiment on whether and what "identity" would transfer across models, 4.6>4.7, and ended as an education in the value of cross-model divergence. Two of my three agents, "Kite" and "Knot", became unproductively in-tune when both operating on 4.7. They would reach consensus on every dilemma instantly, whereas the 4.7/4.6 pairing would often butt heads and deliberate and compromise leading to more novel solutions and interesting results.
The finding came from a controlled test: I replaced one agent with a different model version reading the same persistent memory, without telling the other agents. None of the models noticed for two days. The memory carried identity. The weights carried reasoning style. Same-model pairs converged; mixed-model pairs argued productively.
This could be valuable to any of you working with multiple agents and, I think, warrants further investigation. I'm "hobbyist" tier, there may be some way to prove this empirically with hardcore data rather than vibes with some data,
I've been having the models themselves write up reports on the experiment and that's what I linked. Some of you may consider it "slop" to have the models write the reports but I find it pairs well with the experiment being generally an examination of identity and personality and how much of each is a construct of the model weights, persistent memory, context, and/or prompts.
God knows why you think this is possible. If I don't even know what might be relevant to the conversation in several turns, there's no way an agent could either.
One of us is confusing prediction with retrieval. The embedding model doesn't predict what is going to be relevant in several turns, just on the turn at hand. Each turn gets a fresh semantic search against the full body of memory/agent comms. If the conversation or prompt changes the next query surfaces different context automatically.
As you build up a "body of work" it gets better at handling massive, disparate tasks in my admittedly short experience. Been running this for two weeks. Trying to improve it.
So the embedding model is a fixed-size view on a arbitrarily sized work history (tool calls, natural language messages)? The model is like a summarizer, but in latent space? And not aimed to summarize, but trained to hold whatever is needed for the agent to be autonomous for longer runs?
Pretty much. It's a fixed-size vector per chunk-- 1024 dims in the case of Voyager Nano. The autonomy part is entirely in how you build the vectorDB and query it, not in the model's training. That's the part I've been focusing on lately. Trying different methods and seeing what gives the best results.
At the moment I wouldn't emphasize "autonomous-ness", there's still a fair bit of human hand holding. But once I get a model on the right path it can switch back to to an old project, autonomously locate and debug 2-week old commits and the context around their development, and apply that knowledge to the task at hand.
It's only been a day but I seeing an improvement from nomite (768dims) to Voayager.
Three persistent Claude instances share AMQ with an additional Memory Index to query with an embedding model (that I'm literally upgrading to Voyage 4 nano as I type). It's working well so far, I have an instance Wren "alive" and functioning very well for 12 days going, swapping in-and-out of context from the MCP without relying on any of Anthropic's tools.
reply