Self driving isn't a sensor problem, its a software problem.
From how humans drive, its pretty clear that there exists some latent space representation of immediate surroundings inside our brains that doesn't require a lot of data. If you had a driving sim wheel and 4 monitors for each direction + 3 smaller ones for rear view mirror, connected to a real world car with sufficiently high definition cameras, you could probably drive the car remotely as well as you could in real life, all because the images would map to the same latent space.
But the advantage that humans have is that we have an innate understanding of basic physics from experience in interacting with the world, which we can deduce from something simple as a 2d representation, and that is very much a big part of that latent space. You wouldn't be able to drive a car if you didn't have some "understanding" of things like velocity, acceleration, object collision, e.t.c
So my bet is that just like with LLMs, there will be research published at some point that given certain frames in a video, it will be able to extrapolate the physical interactions that will occur, including things like collision, relative distances, and so on. Once that is in place, self driving systems will get MASSIVELY better.
It's both. Your eyes have much better dynamic range and FPS than modern self driving systems & cameras. If you can reduce the amount of guessing your robot does (e.g. laser says _with certainty_ that you'll collide with an object ahead), you should do it.
Self-driving is still a robotics problem, and robots are probablistic operators with many component dependencies. If you have 3 99% reliable systems strung together running 24 hours a day, that's 43 minutes a day that it will be unreliable ((1 - .99^3)*1440). Multi-modality allows your systems to provide redundancy for one another and reduce the accumulating correlated errors.
Check out this NOVA video on how limited your acute vision actually is. It is only by rapidly moving our eyes around that we have high quality vision. In the places you are not looking your brain is computing what it thinks is happening, not actually watching it.
I should have said eyes+brain in combination have much better dynamic range and FPS perception than self driving systems. Point remains unchanged -- what sensor you use is tied to the computation you need to do. What you see is the sum of computation+sensor so it's impossible for sensor not to matter.
Tangential: event cameras work more like our eyes but aren't ready for AVs yet.
It's only "kind of" if they compensate for the reduced specs. As the root commenter said, they don't compensate yet. It's just less safe in those situations.
Whether it's fine to be less safe in certain situations because it's safer overall is a different question.
> Your eyes have much better dynamic range and FPS than modern self driving systems & cameras. If you can reduce the amount of guessing your robot does (e.g. laser says _with certainty_ that you'll collide with an object ahead), you should do it.
You could drive fine at 30fps on a regular monitor (SDR). More fps would help with aggressive/sporty driving of course.
> You could drive fine at 30fps on a regular monitor (SDR). More fps would help with aggressive/sporty driving of course.
What? This is preposterous.
Have you tried playing a shooter video game at 30 FPS? It's atrocious, you get rekt. There is a reason all gamers are getting 120 FPS and up.
30 FPS means 33 ms of latency. Driving on a highway, car moves over a meter before the camera even detects an obstacle. The display has it's own input lag, so does the operating system. Your total latency is going to be over 100ms, so the car will have travelled several meters. If a motorcyclist in front of you falls, you will feel the car crashing into his body before the image even appears on the screen.
There's plenty of FPS racing games that you can play just fine at 30FPS. Obviously more FPS is a better experience, but it's not like it becomes impossible to drive.
Also, if you truly are only a few meters behind a motorcyclist when driving at highway speeds, by definition you are being unsafe. The rule I learned in driving school was roughly 1 car length per 10mph of space, so you should be ~90 feet (~30 meters) away.
Finally, the average reaction time for people driving in real life is something like 3/4 of a second. 750ms to transition from accelerating to braking. A self-driving car being able to make decisions in the 100ms time frame is FAR superior.
I agree this is preposterous but one nit to pick: event loops on self driving cars are really that slow, and they must use very good behavior prediction + speculative reasoning to deal with scenarios like the one you described.
Have you tried doing this in the dark? Have you tried spotting the little arrow in the green traffic light that says you can turn left, consistently, in your video feed even facing a low sun?
Only if that monitor was hooked up to a camera that could dynamically adjust its gain to achieve best possible image contrast in everything from bright sunlight to moonlit night.
You’d also lose depth perception entirely, which can’t be good for your driving.
You can test this pretty easily, it's not like that model doesn't exist. Play your average driving videogame at 30fps in first-person mode. Crank up the brightness until you can barely see if you like. We do it just fine because the model exists in our head, not because there's some inherent perfection in our immediate sensing abilities.
Yeah. I mean you're right and wrong at the same time imo. I won't hypothesize about how humans drive. I think for the most part it's a futile exercise and I'll leave that to the people who have better understanding of neuroscience. (I hate when ML/CS people pretend to be experts at everything).
That being said, this idea of a latent space representation of the world is the right tree to be barking up (imo). The problem with "scale it like an LLM" right now is that 3D scene understanding (currently) requires labels. And LLMs scale the way they do because they don't require labels. They structure the problem as next token prediction and can scale up unsupervised (their state space/vocabulary is also much smaller). And without going into too much detail, myself (and others I know in this field) are actively doing research to resolve these issues so perhaps we really will get there someday.
Until then however. Sensors are king, and anyone selling you "self-driving" without them is lying to you :)
I think you may be over-indexing on the word "selling". I didn't mean it literally as in for sale to you (the customer) directly. That is what Tesla FSD is claiming and I agree with you that we're some indeterminate amount of time away from it.
However Waymo, Cruise and others do exist. If you haven't already, check out JJRicks videos on YouTube. I think you might be changing the number of years in your estimation ;)
Each time I see functional FSD it is in a very specific and limited scope. Simple thing that ultra precise maps, low speed, good roads, suitable climate, and a system that can just bail and stop the car are common themes. I would also be interested to hear if places with waymo have traffic rules where pedestrians/cyclists have priority without relying on traffic signs.
> if you had a driving sim wheel and 4 monitors for each direction + 3 smaller ones for rear view mirror, connected to a real world car with sufficiently high definition cameras, you could probably drive the car remotely as well as you could in real life, all because the images would map to the same latent space.
I disagree. When in a car, we are using more than our eyes. We have sound as well, of course, something that we provide feedback even in the quietest cars. We also have the ability to feel vibration, gravity and acceleration. Sitting in a sim without at least some of these additional forms of feedback would be a different skill.
There was an even where they took the top iRacing sim driver and put him in a real F1 car and he was able to do VERY well in terms of lap times.
There was another even where they took another sim driver and put him in a real drift car, and he was able to drift very well.
Both vids are on youtube. Yes, real world driving has more variables, and yes, the racing drivers had force feedback wheels, but in general, if a person is able to control a car so well as to put the virtual wheel in the right square foot of the virtual track to take a corner optimally, its probably likely that most people could drive very well solely from visual feedback. Sound and IMUs can provide additional correctional information, but the key point still remains, is that whatever software runs has to deduce physics from visual images.
I recommend watching this NOVA video on human perception. When doing any number of task, especially ones we do commonly we're using a ton of unconscious perception and prediction based upon our internal representation of physics and human modeling.
For example when I was younger I was noticing that I was commonly aware that a car was going to get over before it did so. I kept an eye out trying to determine why this was the case and I noticed two things. One is people commonly turn their head and check the mirrors before they even signal to get over. The other is they'll make a slight jerk of the wheel in the direction before making the lane change.
This assertion: Self driving isn't a sensor problem, its a software problem. is hard to support today. Your human vision analogy leaves out a lot of both sensor and processing differences between what we call machine vision and human vision.
Even if parity with human vision can be attained, humans kill 42,000 other American humans each year on the roads. If human driven cars were invented today, and pitched as killing only 42,000 people per year, the inventor would get thrown into a special prison for supervillains.
Not much would change. The idiotic idea of removing traffic lights in favor of self driving cars zipping past each other forgets about those pesky pedestrians we should be designing cities for.
When I wrote the comment, I was envisioning the current world, but with some bluetooth type protocol that cars could use to send beacons to help other cars near it.
The most basic example of how this could be helpful is if the car ahead of you turns a sharp corner and crashes into a truck stopped in the road. Without car-to-car networking, you won't brake until the crash is in your line of sight.
Have you ever seen those youtube videos of massive car pile ups on highways caused by a crash, and then a cascade of additional crashes afterwards? E.g. icy conditions or dense fog. What if the original crash could communicate to cars behind it, wouldn't that be helpful if the crash isn't yet in the driver's (or car's) line of sight?
I agree "not much would change" overnight. It's just another input for the car's software to have at its disposal.
With the current hardware on the roads, I don't think it's technically possible for autos to achieve legitimate self-driving (if that's even the goal anymore?) - there are way too many edge cases that are way too difficult to solve for with just software.
And what happens if there is a child on the road? Or are we going to need implanted transmitter chips in the future, so we can safely go outside and are not overrun by „smart“ cars?
Even if every car is required to be part of the network, there may be badly maintained cars that don’t work properly, or even malicious cars, that send wrong data on purpose.
Something more is necessary if "self-driving" is going to actually live up to its name at some point in the future, and I don't think the answer is 100% software.
At this point it's all about edge cases. Certain edge cases are impossible to overcome with just software + cameras alone.
Most humans can drive fairly well in heavy downpour, solely from the brake lights of the car and occasional glimpses of road markings. Thats almost equivalent to a very poor sensor suite.
For this to work, either (1) the network has to be reliable, and all cars have to be trustworthy (both from a security and fault tolerance perspective), or (2) the cars have to be safe even when disconnected from the network, such as during an evacuation.
We already know for sure that we can’t solve (1), which means we have to solve (2). Therefore, car-to-car communication is, at best, a value add, not the enabling technology.
> Imagine if Car A could improve its own understanding of the environment using inputs/sensor data from nearby Car B.
You can't rely on this in real time because urban canyons make it hard to get consistent cell signal (for one thing), but you can definitely improve your models on this data once the data's been uploaded to your offline systems, and some SDC companies do this.
A system of this sort could use some local area networking (think infrared, RF, or even lasers) to create an adhoc mesh network. It's how I imagine cars in the future to be networked at least.
From how humans drive, its pretty clear that there exists some latent space representation of immediate surroundings inside our brains that doesn't require a lot of data. If you had a driving sim wheel and 4 monitors for each direction + 3 smaller ones for rear view mirror, connected to a real world car with sufficiently high definition cameras, you could probably drive the car remotely as well as you could in real life, all because the images would map to the same latent space.
But the advantage that humans have is that we have an innate understanding of basic physics from experience in interacting with the world, which we can deduce from something simple as a 2d representation, and that is very much a big part of that latent space. You wouldn't be able to drive a car if you didn't have some "understanding" of things like velocity, acceleration, object collision, e.t.c
So my bet is that just like with LLMs, there will be research published at some point that given certain frames in a video, it will be able to extrapolate the physical interactions that will occur, including things like collision, relative distances, and so on. Once that is in place, self driving systems will get MASSIVELY better.