Not directly, but COPRA which they compare to (and show about 3% more theorems p...

Not directly, but COPRA which they compare to (and show about 3% more theorems proved than) is based on GPT-4 using an agent framework. And in the COPRA paper, they compare to using GPT-4 directly as a one-shot, and find it can only get 10% of theorems in miniF2F, as opposed to 23% with their agent based approach. So if the eval line up correctly, we would see GPT-4 one-shot proving 10.6% of theorems in miniF2F, as opposed to Llemma-7b proving 26.23%, a pretty significant improvement. Still not as good as other specialized tools though, especially when you look at tools in other theorem proving languages (see my other comment for more detail about a cross-language comparison).