Fun fact, there have been 3 soundness bugs in lean 4 so far. (They were all fixed within hours.) I expect we have not yet found them all, but I also do not sleep poorly worried that mathematics will come tumbling down, because these are invariably implementation bugs (either in the literal sense of implementation of the system, or in the implementation of mathematics in the type theory). If something is wrong it is almost certainly going to be the system and not the mathematics. But I'm saying this as someone who works on the proof assistant itself (i.e. hold fixed the mathematics, vary the proof assistant). Kevin Buzzard will say the exact opposite because he is working on the mathematics (vary the mathematics, hold fixed the proof assistant), in which case the likely failure modes will be that a given proof is written poorly, a definition has incorrect behavior on edge cases, etc, but only the current proof is under threat, not some completely unrelated e.g. proof of infinitude of primes deep in the library.
If this is your situation, you should absolutely be asking more questions on Zulip. It is really easy to get guidance on how to use mathlib, what things exist and where they are.
The issue with stacked casts is mostly solved by the `norm_cast` tactic. Again, ask more questions on Zulip - even if you don't ask about this in particular, if you suggest it in passing, or your code gives indications of an unnecessarily complicated proof style, you will get suggestions about tactics you may not be aware of.
One way you can focus a question like this if you don't know what techniques to use but just have a feeling that formalization is too hard, is to isolate an example where you really had to work hard to get a proof and your proof is unsatisfying to you, and challenge people to golf it down. These kind of questions are generally well received and everyone learns a lot from them.
I'm familiar with norm_cast and push_cast, but they don't always do what I expect them to do or solve all the problems. Then there's also the issue (in my experience at least) that in order to e.g. define the real exp or cos function you need to compose the complex function with the real projection and then manually use theorems such as "for all reals r, exp(r_inj_c(r)) = r_inj_c(re_exp(r))", which are easy enough to prove but it's still more work than in "regular" mathematics where you just say "re_exp = exp restricted to the reals" (and then implicitly use the fact that exp maps reals to reals).
I can usually find a way around such things but it does make proofs more tedious. As said, I'm sure I'm not using Lean optimally, but I wouldn't know what to ask for specifically, it's a case of "unknown unknowns".
> One way you can focus a question like this if you don't know what techniques to use but just have a feeling that formalization is too hard, is to isolate an example where you really had to work hard to get a proof and your proof is unsatisfying to you, and challenge people to golf it down.
Fair, that's something I could try. For example, my proof that cos 2 ≤ -1/3 (which is used for showing that cos has a zero in (0,2) with which I can define pi) is unreasonably complicated, while the proof on paper is just a couple of lines. There are a bunch of techniques used when estimating series, such as (re)grouping terms, that I found difficult to do rigorously.
No, Lean is not suitable for axiomatic investigations, it comes with too much baggage from "classical foundations". As Randall said above, Lean is axiomatically much stronger than NF, and that's even with "no axioms"! You can use Lean to prove things about axiom systems, but you have to model the axiom system explicitly as a "deep embedding" with syntax and a typing judgment. For metatheory work like the one reported on here this is exactly what you want, but if you want to actually work in the theory then it's an extra layer of indirection which makes things a lot more cumbersome compared to using Lean's own logic.
Metamath is much more configurable in this regard, you just directly specify the axiom system you want to work in and there is no special status given to first order logic or anything like that.
> Speaking of Miri, is the long term goal to say for certain whether or not a program execution encountered UB? (Which is, of course, different than verifying it before execution at compile time.)
That's more than a long term goal, that's the present behavior, to the extent that the UB rules are defined in the first place. Miri will tell you when it has to make approximating assumptions (e.g. accessing FFI or using nondeterministic operations), and it doesn't happen very often. This is very much the intent of the tool.
Yes, that is incorrect. If you write `partial def foo : Nat := foo + 1` it will be accepted, but `def foo : Nat := foo + 1` is not. So while lean checks that functions terminate by default, it is possible to define and run programs containing general recursion. This is subject to some restrictions though:
* You can't unfold the definition to try to prove `foo = foo + 1` (which is of course false for any natural number), it is an "opaque" definition and its value for specification purposes is essentially arbitrary and does not need to match the definition.
* Even then there is a possibility of proving false things as in `partial def loop : False := loop`, so to prevent inconsistency the target type (`Nat` in the previous example, `False` in this one) must be inhabited (proved automatically by the typeclass machinery). So it would reject the `loop` example but not `foo`.
The jargon is a bit confusing sometimes. In Lean, "refl" does a whole lot more than prove x=x. It is of course available if you want to prove x=x, but the real power of "refl" is that it also proves x=y where x and y are definitionally equal. Or at least that's the idea; it turns out that lean's definitional equality relation is not decidable so sometimes it will fail to work even when x and y are defeq, and this is the theoretically distasteful aspect that came up on the linked Coq issue. In practice, the theoretical undecidability issue never happens, however there is a related problem where depending on what is unfolded a proof by "refl" can take seconds to minutes, and if alternative external proof-checkers don't follow exactly the same unfolding heuristics it can turn a seconds long proof into a thousand-year goose chase. By comparison, methods like "simp" have much more controllable performance because they actually produce a proof term, so they tend to be preferable.
Thanks for the explanation. Is the defeq undecidability a bug of Lean that should be fixed in the future? Or is it an intentional design decision for it to function properly for other types of proofs?
Defeq undecidability is a feature of Lean in the sense that it is a conscious design decision. As we have seen both in this thread and in other places, this design decision puts off some people interested in the foundations of type theory from working with Lean. However it turns out that for people interested other kinds of questions (e.g. the mathematics of Scholze), defeq undecidability is of no relevance at all.
Here's an analogue. It's like saying "Goedel proved that maths was undecidable therefore maths is unusable and you shouldn't prove theorems". The point is that Goedel's observation is of no practical relevance to a modern number theorist and does not provide any kind of obstruction in practice to the kind of things that number theorists think about.
I am quite confident that the developers of lean consider it a feature (or at least, not a bug). Though I'm also not sure why they are building on a complex metatheory like CiC if they are willing to accept undecidable typechecking since you can simplify a lot of things if you give that up.
I think one of the reasons is that Lean's approximation of defeq still works well enough in practice. As mentioned by others, you never really run into the kind of counter examples that break theoretical properties, but instead into examples where checking defeq is inacceptably slow, which remains an issue even when defeq is decidable.
From my understanding CiC is used because it allows using a unified language for proofs, specifications, programs and types, which greatly simplifies the system. For example, in Lean 4, more or less the entire system is built on the same core term language: Proofs, propositions / specifications, programs, tactics (i.e. meta programs), the output of tactics and even Lean 4 itself are all written in that same language, even if they don't always use the same method of computation. In my eyes, for the purposes of writing and checking proofs, CiC is quite complex; but if you want to capture all of the above and have them all checked by the same kernel, CiC seems like a fairly minimal metatheory to do so, even with undecidable type checking.
For example, if you are willing to accept undecidable typechecking, you can define inductive types in a much simpler theory (inductive types and dependent pattern matching are by far the most complex parts of CiC): https://homepage.cs.uiowa.edu/~astump/papers/from-realizabil... .
There are a few other examples but this is the one that immediately sprung to mind. The complexity of CiC is very much about decidable typechecking, it's not fundamental to writing a sound dependently typed proof assistant. And anyone who has actually worked on a kernel for CiC knows that while it is "minimal" compared to what people want to do with it, no efficient kernel is nearly as minimal as something like λP2 with dependent intersections.
Thanks! But does that not already answer your question as to why Lean would use CIC and not a simpler metatheory?
That particular paper is from 2016 and the type theory presented there is still lacking a number of features from CIC, with those particular extensions being listed as further research. Work on Lean started in 2013 and the kernel in particular has not been touched in a long time for good reason. Around that time, CIC was already battle-proven with there being popular and well-tested reference implementations to learn from. Since then, all the fancy work has been done on the system surrounding the kernel.
I believe one of the goals at the start was that the kernel should be sufficiently simple and small so that it is easy to ensure that the kernel is correct. Despite the apparent redundant complexity that you suggest, the implementation still seems to be sufficiently simple to reach this goal, as no proof of false that was a result of a kernel bug has been reported yet. CIC being well-tested surely also contributed to that.
Another reason is that, as I understand it, it was not always obvious that Lean would break decidability of defeq. Afaik proof irrelevance was only baked in with Lean 3, while it was an option before that. By that point I don't see much of a reason to rewrite an entire kernel that has worked fine so far.
I also vaguely remember that there were arguments related to performance in favor of the current system (e.g. versus W-types), but I don't really understand these things well enough to make that point with confidence :)
How are the ergonomics of a proof assistant based on λP2? Lean does a lot of elaboration of metavariables and typeclass instances and such before handing things off for checking by the kernel -- I'm curious about whether λP2's undecidability is decidable enough, so to speak.
That's currently an open research question, but the answer most likely is that the ergonomics could be made just as good, considering that elaboration of metavariables, typeclass instances, etc. already uses possibly-undecidable algorithms distinct from those used in the kernel in just about every proof assistant I'm familiar with (Lean included).
In terms of stuff that do go more or less directly through the kernel, some followup papers to the one I linked discuss how to translate inductive definitions etc. automatically into a form accepted by the simpler kernel, so I am even more confident that this would not present a problem.
The fact remains though that Lean 3 worked well enough to do Scholze level mathematics. I never quite know what to make of people saying "if the designers had done it in some other way, it would have been better". Of course if someone comes along doing it another way, and then they go on to also achieve Scholze level mathematics, then I will be all ears.
That's a very strange line of argument. Many great programs have been written in C. That doesn't mean that there is no way to improve on C. Perhaps to even the playing field, this hypothetical competitor should be given the resources of Microsoft Research as well :)
I think Kevin is asserting that to make strong comparative claims about the quality of either system, it's good to have a large and functionally similar body of work to compare the systems with, not that there is no way to improve on Lean's core :)
MSR funds a single developer. The Lean community consists of at least 30 or so active contributors right now, with many more having contributed in the past. Most folks work in academia. Much of Lean's success is absolutely due to the work of that single developer, but I don't think that the playing field is uneven due to MSR's funding.
> implementation in C++ rather than a nice functional language for dogfooding
To be fair, the main author of Lean has been cloistered for two years writing the next version, Lean 4, which is written in its own (pure functional) language. The siren song is strong :)
For those who are more into watching talks on YouTube than reading papers, and are interested in the low level hardware formalization aspect of this project, there is a recording of the ITP 2019 presentation here: "x86 verification from scratch" https://www.youtube.com/watch?v=7hAShC6K_vA
Software gets more complicated, and so bugs become more common. Hardware gets faster, so speed pressures are reduced on software, so it gets more layered and hence slower; and also more complicated. More layers of abstraction means more things to get wrong, so correctness becomes a serious problem. The world is increasingly reliant on software, so correctness becomes a serious concern.
That sums up the "software crisis" as it relates to correctness. Most theorem provers have been swept up in this. They are written in high level languages and frameworks made by people who paid no thought to verification and only slightly more than most to correctness. What's more, even some theorem prover languages have dubious correctness or semantics; Rust and Dafny are both languages that make a big deal about formal correctness but have no formal spec. (Rust is working on it.) Coq is not known to be consistent (the literature is full of "here is a proof that some subsystem of Coq has this and that property" but no one can handle the whole thing), not to mention that it has had a few soundness bugs in its history, one of which (Reals violate EM) wasn't even clearly a bug because no one knew what the model was supposed to be.
Agda is a free for all as far as I can tell, and Isabelle has some strange thing with axiomatic typeclasses that takes them out of the HOL world. HOL Light is pretty decent metatheory-wise, and the logic at least was formalized inside itself, but the implementation has not, and OCaml's Obj.magic is still there, ready to cause inconsistency if you feel like using it.
Lean has not had any soundness bugs to date, and the theory is well understood, but it is a massive C++ application and there are doubtless dozens of implementation bugs in there waiting to be found.
None of the big theorem provers are really appropriate for trying to prove correct, because either it is not known what this even means or it's trivially false and we all try to pretend otherwise, or it's an intractable open question.
These are the paragons of correctness in our world.
Can we do better? What does it even mean to do better? This paper aims to find out.
Lean has independent proof checkers, and Coq could have, too. The fact that these are big systems doesn't harm the de Bruijn principle: you only need to trust the checkers…
(btw I couldn't find the source code for the proof checker in the paper, is it available?)
The fast proof checker I talk about is mm0-c: https://github.com/digama0/mm0/tree/master/mm0-c . It is (deliberately) bare bones, and I'm planning on formalizing approximately the assembly that you get from running that through gcc.
There is a second proof checker for MM0 in mm0-hs: https://github.com/digama0/mm0/tree/master/mm0-hs , which has a few more features like the MM1 server mode to give interactive feedback, as well as a verifier that reads a text version of the proof file (not described in the paper, the .mmu format).
The footnotes are supposed to be links to the github repo but I got the versions mixed up and submitted a partially anonymized version to arXiv.
Do we even know if it's possible to prove correctness of a theorem-prover, in a very general sense? At some point, that would imply that some theorem-prover would have to prove its own correctness (either by proving itself directly or by proving itself through a loop of theorem-provers). This seems intuitively like it start to run up against Gödel-style incompleteness problems.
It does, but only if you state the theorem in a particular way. Let's say that you have a verifier A, that checks theorems in Peano Arithmetic (PA). Inside that logic, you can define what a computer is, how it executes, what PA is, what a verifier is, and what it means for a program to verify theorems of PA. Then you can have a sequence of bytes (verifier B), and prove that they represent a program that verifies theorems of PA.
What have you proven? Well, verifier A (using PA) proves that "verifier B proves theorems of PA". That means that "if B says phi is true, PA |- phi". We would have a hard time actually running verifier B inside the logic of verifier A (that entails running all the steps of the computation B does as theorem applications inside verifier A), but even if we did, we would obtain a proof of "PA |- phi". If we assume A correctly implements PA, then that means PA |- "PA |- phi". In general, this is the best we can do. If we assume further that PA is a sound axiom system, i.e. when it proves facts about numbers then we won't find concrete counterexamples, then this strengthens to PA |- phi and then to phi itself, so we've learned that phi is true.
The plan is to prove inside verifier A (that is, MM0) the statement "Here is a sequence of bytes. It is an executable that when executed has the property that if it succeeds on input "T |- phi", then T |- phi." The bootstrapping part is that the sequence of bytes in question is in fact the code of verifier A. In order to support statements of the form "here is a sequence of bytes", the verifier also has the ability to output bytes as part of its operation (it's not just a yes/no oracle), and assert facts about the encoding of those bytes in the logic.
Seems like the ideal would be a number of theorem provers each capable of verifying the next. And the first one simple enough to be verified with pen and paper
Provers in the LCF tradition have tiny cores that avoid having to trust lots of code. There are also reflective provers that can admit self-verified extensions. One project, called Milawa, had several layers of this.
(I'm the author of the paper BTW.) Regarding TLA+ and CakeML: TLA+ seems like a good idea, Leslie Lamport talks a lot of sense in most of his papers. I especially recommend "Composition: A Way to Make Proofs Harder" (https://lamport.azurewebsites.net/pubs/lamport-composition.p...). It sums up a lot of my feelings about "simplifying" a problem by adding more layers and generalizing the layers so that they have more work to do individually. I'm not sure I'm totally on board the idea of using temporal logic; my plan is to just use plain first order logic to talk about the states of a computer.
Regarding CakeML, I'm really impressed by the work they've done, and I think they more than anyone else are leading the charge on getting verification as pervasive as possible at every level (especially the low levels, where historically it has been difficult to get past the closed source barrier). That said, I think they are not doing an optimal job at being a bootstrap. They have a system that does that and so much more that if you just want a bootstrap it's way overkill. And you pay for that overkill - 14 hour compilation per target (see https://cakeml.org/regression.cgi).
(I edited my comment to remove my questions about TLA+ and CakeML, when I got to the "Related work" part of the paper. I was also wondering about Milawa but couldn't remember the name! Kudos sir!)