> LLMs still seem as terrible at this as they'd been in the GPT-3.5 age. Software agents break down once the codebase becomes complex enough, game-playing agents get stuck in loops out of which they break out only by accident, etc.
This has been my observation. I got into Github Copilot as early as it launched back when GPT-3 was the model. By that time (late 2021) copilot can already write tests for my Rust functions, and simple documentation. This was revolutionary. We didn't have another similar moment since then.
The Github copilot vim plugin is always on. As you keep typing, it keeps suggesting in faded text the rest of the context. Because it is always on, I kind of can read into the AI "mind". The more I coded, the more I realized it's just search with structured results. The results got better with 3.5/4 but after that only slightly and sometimes not quite (ie: 4o or o1).
I don't care what anyone says, as yesterday I made a comment that truth has essentially died: https://news.ycombinator.com/item?id=43308513 If you have a revolutionary intelligence product, why is it not working for me?
Ultimately, every AI thing I've tried in this era seems to want to make me happy, even if it's wrong, instead of helping me.
I describe it like "an eager intern who can summarize a 20-min web search session instantly, but ultimately has insufficient insight to actually help you". (Note to current interns: I'm mostly describing myself some years ago; you may be fantastic so don't take it personally!)
Most of my interactions with it via text prompt or builtin code suggestions go like this:
1. Me: I want to do X in C++. Show me how to do it only using stdlib components (no external libraries).
2. LLM: Gladly! Here is solution X
3. Me: Remove the undefined behavior from foo() and fix the methods that call it
4. LLM: Sure! Here it is (produces solution X again)
5. Me: No you need to remove the use of uninitialized variables as the out parameters.
6. LLM: Oh certainly! Here is the correct solution (produces a completely different solution that also has issues)
7. Me: No go back to the first one
etc
For the ones that suggest code, it can at least suggest some very simple boilerplate very easily (e.g. gtest and gmock stuff for C++), but asking it to do anything more significant is a real gamble. Often I end up spending more time scrutinizing the suggested code than writing a version of it myself.
The difference is that interns can learn, and can benefit from reference items like a prior report, whose format and structure they can follow when working on the revisions.
AI is just AI. You can upload a reference file for it to summarize, but it's not going to be able to look at the structure of the file and use that as a template for future reports. You'll still have to spoon-feed it constantly.
7 is the worst part about trying to review my coworker's code that I'm 99% confident is copilot output - and to be clear, I don't really care how someone chooses to write their code, I'll still review it as evenly as I can.
I'll very rarely ask someone to completely rewrite a patch, but so often a few minor comments get addressed with an entire new block of code that forces me to do a full re-review, and I can't get it across to him that that's not what I'm asking for.
interns can generally also tell me "tbh i have no damn idea", while AI just talks out it's virtual ass, and I can't read from it's voice or behavior that maybe it's not sure.
interns can also be clever and think outside the box. this is mostly not good, but sometimes they will surprise you in a good way. the AI by definition can only copy what someone else has done.
The last line has been my experience as well. I only trust what I've verified firsthand now because the Internet is just so rife with people trying to influence your thoughts in a way that benefits them, over a good faith sharing of the truth.
I just recently heard this quote from a clip of Jeff Bezos: "When the data and the anecdotes disagree, the anecdotes are usually right.", and I was like... wow. That quote is the zeitgeist.
If it's so revolutionary, it should be immediately obvious to me. I knew Uber, Netflix, Spotify were revolutionary the first time I used them. With LLMs for coding, it's like I'm groping in the dark trying to find what others are seeing, and it's just not there.
> I knew Uber, Netflix, Spotify were revolutionary the first time I used them.
Maybe re-tune your revolution sensor. None of those are revolutionary companies. Profitable and well executed, sure, but those turn up all the time.
Uber's entire business model was running over the legal system so quickly that taxi licenses didn't have time to catch up. Other than that it was a pretty obvious idea. It is a taxi service. The innovations they made were almost completely legal ones; figuring out how to skirt employment and taxi law.
Netflix was anticipated online by and is probably inferior to YouTube except for the fact that they have a pretty traditional content creator lab tacked on the side to do their own programs. And torrenting had been a thing for a long time already showing how to do online distribution of video content.
They were revolutionary as product genres, not necessary individual companies. Ordering a cab without making a phone call was revolutionary. Netflix at least with its initial promise of having all the world's movies and TV was revolutionary, but it didn't live up to that. Spotify because of how cheap and easy it was to have access to all the music, this was the era when people were paying 99c per song on iTunes.
I've tried some AI code completion tools and none of them hit me that way. My first reaction was "nobody is actually going to use this stuff" and that opinion hasn't really changed.
And if you think those 3 companies weren't revolutionary then AI code completion is even less than that.
> They were revolutionary as product genres, not necessary individual companies.
Even then, they were evolutionary at best.
Before Netflix and Spotify, streaming movies and music were already there as a technology, ask anybody with a Megaupload or Sopcast account. What changed was that DMCA acquired political muscle and cross-border reach, wiping out waves of torrent sites and P2P networks. That left a new generation of users with locked-down mobile devices no option but to use legitimate apps who had deals in place with the record labels and movie studios.
Even the concept of "downloading MP3s" disappeared because every mobile OS vendor hated the idea of giving their customers access to the filesystem, and iOS didn't even have a file manager app until well into the next decade (2017).
> What changed was that DMCA acquired political muscle and cross-border reach, wiping out waves of torrent sites and P2P networks.
Half true - that was happening some, but wasn't why music piracy mostly died out. DMCA worked on centralized platforms like YouTube, but the various avenues for downloading music people used back then still exist, they're just not used as much anymore. Spotify was proof that piracy is mostly a service problem: it was suddenly easier for most people to get the music they wanted through official channels than through piracy.
DMCA claims took out huge numbers of public torrent trackers which was how 99% of people accessed contraband media. All the way back in 2008, the loss of TorrentSpy.com probably shifted everybody to private trackers, but it's a whack-a-mole game there too and most people won't bother.
DMCA also led to the development of ContentID and automated copyright strike system on Youtube, but it didn't block you from downloading the stream as a high bitrate MP3, which is possible even now.
> streaming movies and music were already there as a technology, ask anybody with a Megaupload or Sopcast account.
You can't have a revolution without users. It's the ability to reach a large audience, through superior UX, superior business model, superior marketing, etc. which creates the possibility for revolutionary impact.
Which is why Megaupload and Sopcast didn't revolutionize anything.
Yes, but Google left that functionality half baked intentionally, letting 3rd party developers fill the void. Even now the Google Files app feels like a toy compared to Fossify Explorer or Solid Explorer.
There was a gain in precision going from phone call to app. There is a loss of precision going from app to voice. The tradeoff of precision for convenience is rarely worth it.
Because if it were, Uber would just make a widget asking "Where do you want to go?" and you'd enter "Airport" and that would be it. If a widget of some action is a bad idea, so is the voice command.
"Do something existing with a different mechanism" is innovative, but not revolutionary, and certainly not a new "product genre". My parents used to order pizza by phone calls, then a website, then an app. It's the same thing. (The friction is a little bit less, but maybe forcing another human to bring food to you because you're feeling lazy should have a little friction. And as a side effect, we all stopped being as comfortable talking to real people on phone calls!)
The experience of Netflix, Spotify, and Uber were revolutionary. It felt like the future, and it worked as expected. Sure, we didn't realize the poison these products were introducing into many creative and labor ecosystems, nor did we fully appreciate how they would operate as means to widen the income inequality gap by concentrating more profits to executives. But they fit cleanly into many of our lives immediately.
Debating whether that's "revolutionary" or "innovative" or "whatever-other-word" is just a semantic sideshow common to online discourse. It's missing the point. I'll use whatever word you want, but it doesn't change the point.
"Simple, small" and "good marketing" seem like obvious undersells considering the titanic impacts Netflix and Spotify (for instance) have had on culture, personal media consumption habits, and the economics of industries. But if that's the semantic construction that works for you, so be it.
> The innovations they made were almost completely legal ones; figuring out how to skirt employment and taxi law.
The impact of this was quite revolutionary.
> except for the fact that they have a pretty traditional content creator lab tacked on the side to do their own programs.
The way in which they did this was quite innovative, if not "revolutionary". They used the data they had from the watching habits of their large user base to decide what kinds of content to invest in creating.
In screwing over a lot of people around the world, yes. Otherwise, not really. Ordering rides by app was an obvious next step that's already been pursued independently everywhere.
> They used the data they had from the watching habits of their large user base to decide what kinds of content to invest in creating.
And they successfully created a line of content universally known as something to avoid. Tracks with the "success" of recommendation systems in general.
I strongly disagree about Netflix. It came out when I was in high school without a car. Being able to get whatever DVD I wanted without having to bum a ride from my parents--and also never have to pay late fees--was a major game changer.
Not only Uber/Grab (or delivery app) were revolutionary, they are still revolutionary. I could live without LLMs and my life will be slightly impacted when coding. If delivery apps are not available, my life is severely degraded. The other day I was sick. I got medicine and dinner with Grab. Delivered to the condo lobby which is as far as I can get. That is revolutionary.
Practically or functionally? Airbnb was invented by people posting on craigslist message boards, and even existed before the Internet, if you had rich friends with spare apartments. But by packaging it up into an online platform it became a company with 2.5 billion in revenue last year. So you can dismiss ordering from a screen instead of looking at a piece of paper and using the phone as not being revolutionary, because of you squint, they're the same thing, but I can now order take out for restaurants I previously would never have ordered from, and Uber Eats generated $13.7 billion in revenue last year, up from 12.2.
Again, the "revolutionary" aspect that made Uber and AirBnB big names, as opposed to any of the plethora of competitors who were doing the same thing at the same time or before, is that these two gained "innovative" competitive advantage by breaking the law around the world.
Obviously you can get ahead if you ignore the rules everyone else plays by.
If we throw away the laws, there's a lot more unrealized "innovation" waiting.
The taxi cab companies were free to innovate and create their own app. And we could continue to have drivers who's credit card machine didn't work until suddenly it does because you don't have any cash. Regulatory capture is anti-capitalism.
Yes, let's throw away the bad laws that are only there to prop up ossified power structures that exist for no good reason, and innovate!
Some laws are good, some laws are bad. we don't have to agree on which ones are which, but it's an oversimplification to frame it as merely that.
Honestly, yes. Calling in an order can result in the restaurant botching the order and you have no way to challenge it unless you recorded the call. Also, as someone who’s been on both sides of the transaction, some people have poor audio quality or speak accented English, which is difficult to understand. Ordering from a screen saves everyone valuable time and reduces confusion.
I’ve had app delivery orders get botched, drivers get lost on their way to my apartment, food show up cold or ruined, etc.
The worst part is that when DoorDash fucks up an order, the standard remediation process every other business respects—either a full refund or come back, pick up the wrong order, and bring you the correct order—is just not something they ever do. And if you want to avoid DoorDash, you can’t because if you order from the restaurant directly it often turns out to be white label DoorDash.
Some days I wish there was a corporate death penalty and that it could be applied to DoorDash.
Before the proliferation of Uber Eats, Doordash, GrubHub, etc, most of the places I've lived had 2 choices for delivered food: pizza and Chinese.
It has absolutely massively expanded the kinds of food I can get delivered living in a suburban bordering on rural area. It might be a different experience in cities where the population size made delivery reasonable for many restaurants to offer on their own.
Now if anyone solves the problem that for most cuisines ordered food is vastly inferior to freshly served meals. That would be revolutionary.
Crisp fries and pizza. Noodles perfectly Al dente and risotto that has not started to thicken.
It's far from a perfect solution, but I applaud businesses that have tried to improve the situation through packaging changes. IHOP is a stand-out here, in my experience. Their packaging is very sturdy and isolates each component in its own space. I've occasionally been surprised at how hot the food is.
Revolutionary things are things that change how society actually works at a fundamental level. I can think of four technologies of the past 40 years that fit that bill:
the personal computer
the internet
the internet connected phone
social media
those technologies are revolutionary, because they caused fundamental changes to how people behave. People who behaved differently in the "old world" were forced to adapt to a "new world" with those technologies, whether they wanted to or not. A newer more convenient way of ordering a taxicab or watching a movie or music are great consumer product stories, and certainly big money makers. They don't cause complex and not fully understood changes to way people work, play, interact, self-identify, etc. the way that revolutionary technologies do.
Language models feel like they have the potential to be a full blown sociotechnological phenomenon like the above four. They don't have a convenient consumer product story beyond ChatGPT today. But they are slowly seeping into the fabric of things, especially on social media, and changing the way people apply to jobs, draft emails, do homework, maybe eventually communicate and self-identify at a basic level.
I'd almost say that the lack of a smash bang consumer product story is even more evidence that the technology is diffusing all over the place.
Build the much maligned Todo app with Aider and Claude for yourself. give it one sentence and have it spit out working, if imperfect code. iterate. add a graph for completion or something and watch it pick and find a library without you having to know the details of that library. fine, sure, it's just a Todo app, and it'll never work for a "real" codebase, whatever that means, but holy shit, just how much programming did you need to get down and dirty with to build that "simple" Todo app? Obviously building a Todo app before LLMs was possible, but abstracted out, the fact that it can be generated like that's not a game changer?
How are you surprise that getting an LLM to spit out a clone of a very common starter project is evidence of it being able to generate non trivial and valuable code - as in not a clone of overabundant codebases - on demand?
because in actually doing the exercise, and not just talking about it, you'd come up with your own tweak on the Todo app that couldn't be directly be in the training data. you, as a smart human, could come up with a creative feature for your Todo app to have, that no one else would make, showing that these things can compose between the things in their training data and produce a unique combination that didn't exist before. copying example-todo.app to my-todo.app isn't what's impressive, having it able to add features that aren't in the example app is what is. If it only has a box of Lego and can only build things from them, and can't invent new Lego blocks, there's still a large amount of things it can be told to build. That it can assemble those blocks together into a new model that isn't in the instruction manual might not be the most surprising thing in the world, but when that's what most software development is, is the fact that it can't invent new blocks really going to hold it back that much?
I think you're getting too distracted with the analogies. LLMs have been shown to spit out easily code for patterns that are very common such as todo apps. That isn't very impressive in itself.
LLMs are cool and show some interesting emergent properties, but knowing why they work getting them to make a very common type of app doesn't show a great deal of emergent ability beyond searching a compressed space and reproducing a common pattern found in it.
It's easy to underestimate the amount and variability of the training data used. There are possibly tens of thousands of variations on todo apps available on github alone.
So be creative and ask it to make something that isn't a Todo app with thousands examples on github. Knowing that it's really really really advanced autocomplete based on matrix math that does have some funny edge cases because of a numbering subreddit is an interesting degenerate case, but the sheer vastness of the training data it has picked up makes it able to dig into a simultaneously deep and shallow Lego box. it does fall over and go in circles, and having learned to program before the Internet was mainstream, I'm able to go into the code a fix it manually, doesn't impeach it's ability to get that far in the first place.
if it's only able to do the first 90% of the work, and I have to do the last 90% of the work, it's still saved me from doing that first 90% of the work.
I may be wrong but I coming to the conclusion that the promise that a piece of software can generate any piece of software you can possibly desire and describe is basically saying that P=NP and the halting problem can be solved by a Turing machine that approaches infinite speed.
I'm not claiming that it can generate any piece of software you can possibly desire, but that there are enough examples of quite a lot of pieces of them, and that it can compose them into something that it hasn't directly seen before. Like an astronaut on a horse that was popular for Stablediffusion. It didn't have that directly in the training data but it had enough of the pieces that it's able to create a reasonable looking version of one. That it produces too many fingers on hands and has no concept of writing is a glaring obvious shortcoming, just like LLMs generating buggy inefficient code is another one. Catching it's fuckups is missing the point.
My point is that it's such a game changer that you ignore it at your own peril. Just go into a side project, half-cocked and get shit built. Yes the code will be ugly but it got built. Maybe. It has its limits, as does the operator's patience, so it's entirely possible you'll run into a bug it can't fix. But a smart operator knows when to stop it and dig into the problem and fix it manually.
Funnily enough though, if you give it some toy code that doesn't ever complete, like a Fibonacci number generator and ask it if it will halt, it's able to point out that it won't. That, of course, is because those are in the training data, but it's cute nonetheless.
But I think if you know how to code well you can use code generators, whereas if you know how to use code generators you don't necessarily learn how to code. And being able to review generated code stems from mastering code, not from fiddling with prompts.
More generally, I think the premise that using code generators is a skill is empty. They don't require special skill beyond understanding programming well so it is probably a better use of my time to keep on coding and if someone wants to give me money to use code generators I am in a better position having spent lots of time writing code.
And if code generators become so good that it requires no skill to create software, then all this is moot. But of course I think that is nonsense because it seems to me that depends on the foundations of computer science being wrong.
Hm, that's fair. I suppose I should be more transparent as to my bias. I learned to program as the Internet was coming online and before Google, Stack Overflow, and now, ChatGPT. Not trying to brag, merely trying to show that I know how to code.
I can't say how I'd actually feel if I was just starting out, but your position isn't unreasonable. I will disagree and say there is a certain skill to prompting, even though calling prompt engineering is maybe ridiculous. Given the way the industry is trending, getting code generators into a corner; having it run around in circles, and then having in skill to fix it by hand, quickly, is the skill to hone. Whether that comes from practicing writing code from scratch, or from using code generation and then fixing the bugs it creates though, is up for debate. I obviously think the latter but that's just a random person who's not in your shoes' opinion.
Best of luck to you in your career! Hopefully it goes well, no matter which direction it takes.
While I don't disagree with that observation, it falls into the "well, duh!"-category for me. The models are build with no mechanism for long term memory and thus suck at tasks that require long term memory. There is nothing surprising here. There was never any expectation that LLMs magically develop long term memory, as that's impossible given the architecture. They predict the next word and once the old text moves out of the context window, it's gone. The models neither learn as they work nor can they remember the past.
It's not even like humans are all that different here. Strip a human of their tools (pen&paper, keyboard, monitor, etc.) and have them try solving problems with nothing but the power of their brain and they'll struggle a hell of a lot too, since our memory ain't exactly perfect either. We don't have perfect recall, we look things up when we need to, a large part of our "memory" is out there in the world around us, not in our head.
The open question is how to move forward. But calling AI progress a dead end before we even started exploring long term memory, tool use and on-the-fly learning is a tad little premature. It's like calling quits on the development of the car before you put the wheels on.
> If you have a revolutionary intelligence product, why is it not working for me?
Is programming itself revolutionary? Yes. Does it work for most people? I don't even know how to parse that question, most people aren't programmers and need to spend a lot of effort to be able to harness a tool like programming. Especially in the early days of software dev, when programming was much harder.
Your position of "I'll only trust things I see with my own eyes" is not a very good one, IMO. I mean, for sure the internet is full of hype and tricksters, but your comment yesterday was on a Tweet by Steve Yegge, a famous and influential software developer and software blogger, who some of us have been reading for twenty years and has taught us tons.
He's not a trickster, not a fraud, and if he says "this technology is actually useful for me, in practice" then I believe he has definitely found an actual use of the technology. Whether I can find a similar use for that technology is a question - it's not always immediate. He might be working in a different field, with different constraints, etc. But most likely, he's just doing something he's learned how to do and I don't, meaning I want to learn it.
Nope. I try the latest models as they come and I have a self-made custom setup (as in a custom lua plugin) in Neovim. What I am not, is selling AI or AI-driven solutions.
Similar experience, I try so hard to make AI useful, and there are some decent spots here and there. Overall though I see the fundamental problem being that people need information. Language isn't strictly information, and the LLMs are very good at language, but they aren't great at information. I think anything more than the novelty of "talking" to the AI is very over hyped.
There is some usefulness to be had for sure, but I don't know if the usefulness is there with the non-subsidized models.
yeah, but why does the fact that it's vc subsidized matter to you? the price is the price. I don't go to the store and look at eggs and lettuce and consider how much of my tax money goes into subsiding farmers before buying their products. maybe the prices will go up, maybe they'll go down due to competition. Thai doesn't stop me from using them though.
Because if they're not covering their costs now, then eventually they will which either means service degradation (cough ads cough) or price increases.
I applaud the GP for thinking about this before it becomes an issue.
It's worth actually trying Cursor, because it is a valuable step change over previous products and you might find it's better in some ways than your custom setup. The processes they use for creating the context seems to be really good. And their autocomplete is far better than Copilot's in ways that could provide inspiration.
That said, you're right that it's not as overwhelmingly revolutionary as the internet would lead you to believe. It's a step change over Copilot.
Do you mean that you have successfully managed to get the same experience in cursor but in neovim? I have been looking for something like that to move back to my neovim setup instead of using cursor. Any hints would be greatly appreciated!
Start with Avante or CopilotChat. Create your own Lua config/plugin (easy with Claude 3.5 ;) ) and then use their chat window to run copilot/models. Most of my custom config was built with Claude 3.5 and some trial/error/success.
This has been my observation. I got into Github Copilot as early as it launched back when GPT-3 was the model. By that time (late 2021) copilot can already write tests for my Rust functions, and simple documentation. This was revolutionary. We didn't have another similar moment since then.
The Github copilot vim plugin is always on. As you keep typing, it keeps suggesting in faded text the rest of the context. Because it is always on, I kind of can read into the AI "mind". The more I coded, the more I realized it's just search with structured results. The results got better with 3.5/4 but after that only slightly and sometimes not quite (ie: 4o or o1).
I don't care what anyone says, as yesterday I made a comment that truth has essentially died: https://news.ycombinator.com/item?id=43308513 If you have a revolutionary intelligence product, why is it not working for me?