> If the verifier finds any unsafe code, the program is rejected and not executed. The verifier is rigorous -- the Linux implementation has over 20,000 lines of code [0] -- with contributions from industry (e.g., Meta, Isovalent, Google) and academia (e.g., Rutgers University, University of Washington).
/* bpf_check() is a static code analyzer that walks eBPF program
* instruction by instruction and updates register/stack state.
* All paths of conditional branches are analyzed until 'bpf_exit' insn.
*
* The first pass is depth-first-search to check that the program is a DAG.
* It rejects the following programs:
* - larger than BPF_MAXINSNS insns
* - if loop is present (detected via back-edge)
...
I haven't inspected the code, but I thought that checking for infinite loops would imply solving the halting problem. Where's the catch?
I'm not able to comment on what this code is doing, but as for the theory:
The halting problem is only unsolvable in the general case. You cannot prove that any arbitrary piece of code will stop, but you can prove that specific types of code will stop and reject anything that you're unable to prove. The trivial case is "no jumps"—if your code executes strictly linearly and is itself finite then you know it will terminate. More advanced cases can also be proven, like a loop over a very specific bound, as long as you can place constraints on how the code can be structured.
As an example, take a look at Dafny, which places a lot of restrictions on loops [0], only allowing the subset that it can effectively analyze.
Adding on (and it's not terribly relevant to eBPF), it's also worth noting that there are trivial programs you can prove DON'T halt.
A trivial example[1]:
int main() {
while (true) {}
int x = foo();
return x;
}
This program trivially runs forever[2], and indeed many static code analyzers will point out that everything after the `while (true) {}` line is unreachable.
I feel like the halting problem is incredibly widely misunderstood to be similar to be about "ANY program" when it really talks about "ALL programs".
[1]: In C++, this is undefined behavior technically, but C and most other programming languages define the behavior of this (or equivalent) function.
The halting problem cannot be solved in the general case, but in many cases you can prove that a program halts. eBPF only allows verifiably-halting programs to run.
the halting problem is only true for _arbitrary_ programs
but there are always sets of programs for which it is clearly possible to guarantee their termination
e.g. the program `return 1+1;` is guaranteed to halt
e.g. given program like `while condition(&mut state) { ... }` with where `condition()` is guaranteed to halt but otherwise unknown is not guaranteed to halt, but if you turn it into `for _ in 0..1000 { if !condition(&mut state) { break; } ... }` then it is guaranteed to halt after at most 1000 iterations
or in other words eBPF only accepts programs which it can proof will halt in at most maxins "instruction" (through it's more strict then my example, i.e. you would need to unroll the for-loop to make it pass validation)
the thing with programs which are provable halting is that they tend to also not be very convenient to write and/or quite limited in what you can do with them, i.e. they are not suitable as general purpose programming languages at all
This, others have said it less concisely, but a program without loops and arbitrary jumps is guaranteed to halt if we assume the external functions it calls into will halt.
I'm glad to hear that Meta and Google code is "rigorous". I'd prefer INRIA, universities that fund theorem provers, industries where correctness matters like aerospace or semiconductors.
Windows doesn't use the Linux eBPF verifier, they have their own implementation named PREVAIL[0] that is based on an abstract interpretation model that has formal small step semantics. The actual implementation isn't formally proven, however.
The halting problem is exhaustive, there isn't an algorithm that is valid for all programs. You can still check for some kinds of infinite loops though!
More specifically, you can accept a set of programs that you are certain do halt, and reject all others, at the expense of rejecting some that will halt. As long as that set is large enough to be practical, the result can be useful. If you eg forbid code paths that jump "backwards", you can't really loop at all. Or require loops to be bounded by constants.
I have no insight into this particular project but you could work around the halting problem by only allowing loops you can proof will not go infinite. That would of course imply rejecting loops that won't go infinite but can't be proven not to.
If the verifier can't determine that the loop will halt, the program is disallowed. Also, if the program gets passed and then runs too long anyway, it's force-halted. So... I guess that solves the halting problem.
So this "solves" the halting problem by creating a new class "might-not-halt-but-not-sure" and lumping it with "does-not-halt". I find it hard to believe the new class is small enough for this to be useful, in the sense that it will avoid all kernel crashes.
I rather expect useful or needed code would be rejected due to "not-sure-it-halts", and then people will use some kind of exception or not use the verifier at all, and then we are back to square one.
Well it is useful in practice, there are some pretty useful products based on eBPF on Linux, most notably Cilium (and, shameless plug for the one I’m working on: Parca, an eBPF-based CPU profiler).
Bad wording on my part, and I still don't know how to word it better. I'm sure this thing is useful, I don't think everyone who contributed code was just clueless.
However, the claim "in the future, computers will not crash due to bad software updates, even those updates that involve kernel code" must be false. There is no way it is true. Whatever Cilium is, I cannot believe it generally prevents kernel crashes.
Correct, you will never be able to write any possible arbitrary code and have it run in eBPF. It necessarily constrains the class of programs you can write. But the constrained set is still quite useful and probably includes the crowdstrike agent.
Also, although this isn't the case now, it's possible to imagine that the verifier could be relaxed to allow a Turing-complete subset of C that supports infinite loops while still rejecting sources of UB/crashes like dereferencing an invalid pointer. I suspect from reading this post that that is the future Mr. Gregg has in mind.
> Whatever Cilium is, I cannot believe it generally prevents kernel crashes.
It doesn't magically prevent all kernel crashes from unrelated code. But what we can say is that Cilium itself can't crash the kernel unless there are bugs in the eBPF verifier.
My point is that the verifier could be relaxed to accept programs that never halt, thus not needing to solve the halting problem. You could then have the kernel just kill it after running over a certain maximum amount of time.
Why do you think the kernel crashes when crowdstrike attempts to reference some unavailable address (or whatever it does) instead of just denying that operation and continuing on? That would be the solution using this philosophy "just kill long running program". And no need for eBPF or anything complicated. But it doesn't work that way in practice.
This is just such a naive view. "We can prevent programs from crashing by just taking care to stop them when they do bad things". Well, sure, that's why you have a kernel and userland. But it turns out, some things need to run in the kernel. Or "just deny permission". Then it turns out some programs need to run as admin. And so on.
There is a generality in the halting problem, and saying "we'll just kill long runing programs" just misses the point entirely.
Likely what will happen is that you will kill useful long-running programs, then an exception mechanism will be invented so some programs will not be killed, because they need to run longer, then one of those programs will go into an infinite loop despite all your mechanisms preventing it. Just like the crowdstrike driver managed to bring down the OS despite all the work that is supposed to prevent the entire computer crashing if a single program tries something stupid.
> Why do you think the kernel crashes when crowdstrike attempts to reference some unavailable address (or whatever it does) instead of just denying that operation and continuing on?
Linux and windows are completely monolithic kernels; the crowdstrike agent isn't running in a sandbox and has complete unfettered access to the entire kernel address space. There is no separate "the kernel" to detect when the agent does something wrong; once a kernel module is loaded, IT IS the kernel.
Lots of people have indeed realized this is undesirable and that there should be a sandboxed way to run kernel code such that bugs in it can't cause arbitrarily bad undefined behavior. Thus they invented eBPF. That's precisely what eBPF is.
I don't know whether it's literally true that someday you will be able to write all possibly useful kernel-mode code in eBPF. But the spirit of the claim is true: there's a huge amount of useful software that could be written in eBPF today on Linux instead of as kernel modules, and this includes crowdstrike. Thus Windows supporting eBPF, and crowdstrike choosing to use it, would have solved this problem. That set of software will increase as the eBPF verifier is enhanced to accept a wider variety of programs.
Just like you can write pretty much any useful program in JavaScript today -- a sandboxed language.
You're also correct that due to the halting problem, we'll either have to accept that eBPF will never be Turing complete, OR accept that some eBPF programs will never halt and deal with the issues in other ways. Just like Chrome's JavaScript engine has to do. I don't really view this as a fundamentally unsolvable issue with the nature of eBPF.
The claim isn't that eBPF generally prevents kernel crashes. It's that it prevents crashes in the subset of programs it's designed for, in particular for instrumentation, which Crowdstrike is (in this author's conception) an instance of.
It's referring to Windows security software. If you have a lot of context with eBPF, which Gregg obviously does, the notion that eBPF will subsume the entire kernel doesn't even need to be said: you can't express arbitrary programs in eBPF. eBPF is safe because the verifier rejects the vast majority of valid programs.
It is not, programs that are accepted are proved to terminate. Large and more complex programs are accepted by BPF as of now, which might give the impression that it's now Turing complete, when it is definitely not the case.
I should clarify that individual eBPF programs have to terminate, but more complex problems can be solved with multiple eBPF programs, and can be "scheduled" indefinitely using BPF timers
> If the verifier finds any unsafe code, the program is rejected and not executed. The verifier is rigorous -- the Linux implementation has over 20,000 lines of code [0] -- with contributions from industry (e.g., Meta, Isovalent, Google) and academia (e.g., Rutgers University, University of Washington).
[0] links to https://github.com/torvalds/linux/blob/master/kernel/bpf/ver... which has this interesting comment at the top:
I haven't inspected the code, but I thought that checking for infinite loops would imply solving the halting problem. Where's the catch?