Nice work on these benchmarks Simon. I’ve followed your blog closely since your great talk at the AI Engineers World Fair, and I want to say thank you for all the high quality content you share for free. It’s become my primary source for keeping up to date.
I’ve been working on a few benchmarks to test how well LLMs can recreate interfaces from screenshots. (https://github.com/alechewitt/llm-ui-challenge). From my basic tests, it seems GPT-5.2 is slightly better at these UI recreations. For example, in the MS Word replica, it implemented the undo/redo buttons as well as the bold/italic formatting that GPT-5.1 handled, and it generally seemed a bit closer to the original screenshot (https://alechewitt.github.io/llm-ui-challenge/outputs/micros...).
My team at Amazon is currently hiring for software engineers, scientists and machine learning engineers to help us control and optimize Amazon's growing fleet of wind and solar farms. We are a new team with a lot of scope ahead of us.
You can see some more info below:
135 GW is the total electricity production in France in 2009.
Of which 80.1% were from nuclear power, or 108 GW.
The situation has worsened, nowadays we only have 70% of nuclear power, ie. 61 GW.
This is only the power installed. In terms of energy produced, it's even worse. At least, the 61 GW can produce almost 24h/24h ie. 1464 GWh of energy, while the solar panels work at best half the time, and actually much less, and a lot of their energy is also lost in the transformation required (storage in battery, building and recycling the batteries, etc). That may be acceptable for mobile applications, but to power factories, or static computers, this is a waste.
But my point is that is far from enough to replace carbon based energies, and that we're too late to build the number of nuclear plant that would be needed world wide.
I’ve been working on a few benchmarks to test how well LLMs can recreate interfaces from screenshots. (https://github.com/alechewitt/llm-ui-challenge). From my basic tests, it seems GPT-5.2 is slightly better at these UI recreations. For example, in the MS Word replica, it implemented the undo/redo buttons as well as the bold/italic formatting that GPT-5.1 handled, and it generally seemed a bit closer to the original screenshot (https://alechewitt.github.io/llm-ui-challenge/outputs/micros...).
In the VS Code test, it also added the tabs that weren’t visible in the screenshot! (https://alechewitt.github.io/llm-ui-challenge/outputs/vs_cod...).