I also think a lot of coding benchmarks and perhaps even RL environments are not accounting for the messy back and forth of real world software development, which is why there's always a gap between the promise and reality.
I have had a user story and a research plan and only realized deep in the implementation that a fundamental detail about how the code works was missing (specifically, that types and sdks are generated from OpenAPI spec) - this missing meant the plan was wrong (didn’t read carefully enough) and the implementation was a mess
Yeah I agree. There's a lot more needed than just the User Story, one way I'm thinking about it is that the "core" is deliverable business value, and the "shells" are context required for fine-grained details. There will likely need to be a step to verify against the acceptance criteria.
I hope to back up this hypothesis with actual data and experiments!
An abstraction for this that seems promising to me for its completeness and size is a User Story paired with a research plan(?).
This works well for many kinds of applications and emphasizes shipping concrete business value for every unit of work.
I wrote about some of it here: https://blog.nilenso.com/blog/2025/09/15/ai-unit-of-work/
I also think a lot of coding benchmarks and perhaps even RL environments are not accounting for the messy back and forth of real world software development, which is why there's always a gap between the promise and reality.