LangGraph and LlamaIndex Workflows are generating a lot of buzz right now, and we wanted to see how they measure up in practice versus just writing the code. To do that, we took a straightforward agent architecture—one we've built and deployed in code without a framework—and implemented it using LangGraph and Workflows. Our main goal was to explore how these frameworks translate a simple agent design into their abstractions and assess the impact on the development and debugging process.
We want to share our findings with the community, providing practical examples and honest observations about these frameworks where they introduce friction and where they shine. There’s a lot of hype out there, and we hope to offer some clarity with real code examples and unbiased perspectives.
For context, we’ve been running our own Co-pilot agent/assistant in production for about eight months. We’ve also helped clients troubleshoot their assistants at scale, so we’ve seen a wide range of use cases and challenges.
The architecture we tested is a single-tier LLM router—a pattern we often see in various client implementations. It involves a single LLM router that uses function calling to route tasks or skills, which might include another LLM call before returning control to the router. It’s a simple but versatile pattern.
Here’s a Towards Data Science write up we did on the project: https://towardsdatascience.com/choosing-between-llm-agent-frameworks-69019493b259
Full code: https://github.com/Arize-ai/phoenix/tree/main/examples/agent_framework_comparison
Hot take #1: For experienced developers, framework abstractions can add unnecessary complexity.
Hot take #2: Built-in parallelism, while promising, can complicate debugging a lot.
Hot Take #3: In environments with less experienced development teams that have no scaffolding, these frameworks could offer some useful structure. At least in the POC phase.
We’re repeating this process now with CrewAI and Autogen - learnings to follow soon.
And if you want to deep dive into the logs of any of these, we’ve published the traces captured with Arize Phoenix here.
Pure code: https://phoenix-demo.arize.com/projects/UHJvamVjdDo2
LangGraph: https://phoenix-demo.arize.com/projects/UHJvamVjdDoy
LlamaIndex Workflows: https://phoenix-demo.arize.com/projects/UHJvamVjdDo1
We’re curious to hear what others think. What’s been your experience with these frameworks, and how do they compare to rolling your own agent solutions?
https://github.com/jackmpcollins/magentic