TL;DR
- Built the same multi-agent task on all 6 frameworks: LangGraph, Google ADK, Claude SDK, AutoGen, CrewAI, Semantic Kernel
- Same LLM (Claude Haiku 4.5 via AWS Bedrock) and same search tool (Serper API) across every framework
- Ran each benchmark on 3 environments: Local MacBook, AWS EC2 (t3.medium), and GCP VM (e2-medium)
- 3 parallel researchers + 1 analyst — same task structure, same prompts, same tools
- Found a real connector bug in Semantic Kernel on AWS Bedrock — documented, root-caused, and fixed
What Is an Agent?
A regular AI answers your question. An AI agent goes further — it breaks down your goal, searches for information, uses tools, and takes actions step by step until the job is done.
What Is an Agent Building Framework?
An agent building framework is a toolkit that provides the structure, tools, and components to build AI agents faster and easier. Instead of building everything from scratch, these frameworks give you pre-built components like LLM integrations, tool management, memory handling, and execution loops — so you can focus on what makes your agent unique.
The Benchmark
To go beyond feature checklists, I built the same multi-agent task across all 6 frameworks:
- 3 parallel researchers — each assigned one query, searches the web, and summarizes findings:
- Researcher 1: "Main AI agent frameworks available in 2025 overview comparison"
- Researcher 2: "Key features capabilities architecture of AI agent frameworks 2025"
- Researcher 3: "Real world use cases production deployments AI agent frameworks 2025"
- 1 analyst — synthesizes all 3 research outputs into a final report
- Same LLM: Claude Haiku 4.5 via AWS Bedrock across every framework
- Same search tool: Serper API across every framework
Run across 3 environments: Local (MacBook), AWS EC2 (t3.medium, us-east-1), and GCP (e2-medium).
1. LangGraph — by LangChain
LangGraph is a stateful, directed graph-based execution framework for building LLM-driven workflows and agents. It gives you full control to design agent logic by representing workflows as nodes and edges — each node performs a task and edges determine what happens next based on current state.
Supports Python and JavaScript/TypeScript. Works with any LLM: OpenAI, Groq, Anthropic, Gemini, and open-source models.
Key Features
- Stateful Graphs — each node carries information forward, enabling continuous memory and context across workflow steps
- Cyclical Graphs — supports workflows where steps repeat, essential for complex agent runtimes
- Human-in-the-Loop — pause execution at any node and require human approval before proceeding
- Tool Integration — deep integration with LangChain's tool ecosystem, supporting custom functions and MCP-compatible tools
- LangSmith Monitoring — built-in observability platform for tracking execution flows, costs, latency, and performance
Pros
- Maximum control — you define every node, edge, and conditional transition
- Deep observability — LangSmith provides full visibility into traces, costs, and latency
- Production-ready — designed for reliable, complex systems at scale
Cons
- Steeper learning curve — you need to understand graphs, nodes, and state logic
- More setup required — agents need explicit design of nodes, edges, and transitions
- Higher effort for simple tasks — overkill if you just need a basic agent
Benchmark Results

LangGraph is the leanest framework tested — under 1 MB memory across all environments. EC2 total time of 32.93s was the fastest single result in the entire benchmark. Consistent 2.45-2.93x parallel speedup.
LangGraph is a good fit for:
- Complex workflows that need maximum reliability and control
- Systems where you want to see exactly what's happening at every step
- Teams already using LangChain who want seamless integration
- Human-in-the-loop workflows where the framework handles state and waiting
2. Google ADK — by Google
Google ADK (Agent Development Kit) is a flexible framework for building, managing, evaluating, and deploying AI-powered agents. Supports Python, TypeScript, Java, and Go. Optimized for Gemini models but works with other LLMs through its BaseLLM interface.
Key Features
- Multi-Agent System Design — build applications with multiple specialized agents that coordinate, delegate, and collaborate
- Flexible Orchestration — sequential, parallel, or loop agents alongside LLM-driven dynamic routing
- Rich Tool Ecosystem — custom functions, built-in tools, external APIs, and MCP as both consumer and producer
- Native Streaming Support — real-time bidirectional streaming for text and audio via Gemini Live API
- Integrated Developer Tooling — built-in CLI and Developer UI for running, inspecting, and debugging agents locally
- OpenTelemetry Tracing — emits traces to any OTel-compatible backend, giving you a hierarchical span view of LLM reasoning, tool calls, and external API requests end-to-end
Pros
- Easy to build multi-agent systems — built-in support for multiple agents working together
- Great for Google ecosystem — easy to deploy with Google Cloud and Gemini models
- Real-time streaming — supports live text and audio interactions out of the box
Cons
- Best with Gemini — most features and optimizations are built around Gemini models
- Less control than LangGraph — high-level abstractions make deep customization harder
- Highest memory usage in benchmark — 85-88 MB peak across all environments
Benchmark Results

Google ADK uses 85-88 MB peak memory — significantly more than every other framework tested. The speedup is estimated at 3.0x because ADK's native ParallelAgent runs all researchers internally and doesn't expose per-agent timing.
ADK is a good fit for:
- Teams building within the Google ecosystem who want rapid deployment
- Multi-agent systems that need real-time streaming (text and audio)
- Projects requiring parallel task execution across multiple specialized agents
3. Claude Agent SDK — by Anthropic
Claude SDK is a Python and TypeScript/Node.js framework that handles the entire autonomous agent execution loop automatically. Works exclusively with Claude models.
Key Features
- Autonomous Agent Loop — handles the entire execution loop automatically via a single
query()call - Built-in Production Tools — pre-built tools like Read, Write, Edit, Bash, Glob, and Grep require zero setup
- Fine-Grained Permission Control — auto-approve specific tools, block others, or require approval for everything
- Smart Context Management — automatic context compaction and prompt caching to reduce cost and latency
- Session Flexibility — sessions can be continued, resumed, or forked even across different hosts
Pros
- Zero setup for coding tasks — built-in tools mean you can start building coding agents immediately
- Production ready — automatic error handling, session management, and monitoring from day one
- Cost efficient — automatic prompt caching reduces cost and latency for repeated information
Cons
- Claude only — works exclusively with Claude models, no support for other LLMs
- Best for code tasks — less ideal for general-purpose or domain-specific workflows
- Requires Claude Code CLI on every machine — the SDK is a Python wrapper around the
claudeCLI process; without the CLI installed, the SDK has nothing to execute. Deploying to EC2 or GCP means installing the CLI there too, not just the Python package
Benchmark Results

Claude SDK shows the most consistent parallel efficiency across environments — 2.96x locally, 2.90x on EC2, 2.85x on GCP. Memory footprint is under 1 MB across all environments, matching LangGraph.
Claude SDK is a good fit for:
- Automated code review agents
- Bug fixing and code generation tasks
- Teams already using Claude Code CLI who want SDK access
- When you need production-ready tools with automatic cost and latency optimization
4. AutoGen — by Microsoft
AutoGen is an open-source framework for building agents that collaborate through conversational patterns to accomplish tasks. Supports Python and .NET. Works with any OpenAI-compatible endpoint — Groq, Azure OpenAI, and local models.
Key Features
- Three-Layer Architecture — Core (scalable distributed network), AgentChat (conversational AI assistants), Extensions (expandable capabilities)
- Multiple Predefined Agent Types — User Proxy Agent, Assistant Agent, and Tool/Function Agent with distinct roles
- Flexible Conversation Patterns — one-to-one, group chat, and hierarchical conversations where agents can delegate tasks
- AutoGen Studio — visual tool to rapidly prototype multi-agent workflows
Pros
- Easy multi-agent collaboration — agents work together through conversational patterns
- No UI needed to test — AutoGen Studio lets you visually prototype workflows without extra code
- Works with multiple LLMs — compatible with any OpenAI-compatible endpoint
Cons
- Documentation and community gaps — fewer tutorials and smaller community than established frameworks
- Smaller community compared to other frameworks — fewer tutorials, StackOverflow answers, and real-world examples outside Microsoft's own docs
Benchmark Results

AutoGen delivers solid, consistent performance — 40.99s locally, 46.29s on EC2, 53.32s on GCP. Memory at 10.56-10.99 MB is moderate. Parallel speedup of 2.71-2.83x is reliable across environments.
Microsoft Agent Framework (MAF) is the enterprise-ready successor to AutoGen and Semantic Kernel.
AutoGen is a good fit for:
- Research-style workflows and collaborative problem solving
- Multi-stage validation pipelines where agents review and critique each other's outputs
5. CrewAI — by CrewAI Inc.
CrewAI empowers developers to build production-ready multi-agent systems by combining collaborative Crews with precise control via Flows. Supports Python only. Works with any model supported via LiteLLM.
Key Features
- Flows — reliable, stateful workflows for long-running processes and complex logic
- Autonomous Crews — teams of agents that plan, execute, and collaborate to achieve high-level goals
- Role-Based Agent Design — agents defined by role, goal, and backstory to guide behavior
- Enterprise Security — designed with security and compliance in mind
Pros
- Ease of use — move from idea to execution quickly with role, goal, and backstory customization
- Great for prototyping — simple enough to get a multi-agent system running fast
- Customizable agent creation — role-based definitions increase task performance
Cons
- Limited native tool integrations — fewer ready-made connectors for niche tools
- Not production-ready at scale — lacks the mature monitoring and debugging tooling needed for large systems
Benchmark Results

CrewAI is a good fit for:
- Multi-agent systems where clear role separation and structured task delegation are the primary design requirements
- Role-based automation workflows like onboarding, scheduling, and multi-step approvals
- Teams that need rapid prototyping with simple-to-mid-scale agent setups
6. Semantic Kernel — by Microsoft
Semantic Kernel is a framework for building AI agents using reusable plugins and tools. Available in Python, C#, and Java. Works with OpenAI, Azure OpenAI, and custom endpoints, with deep integration into Azure AI services.
Key Features
- Multiple Agent Types — ChatCompletionAgent, OpenAIAssistantAgent, AzureAIAgent, and custom agents
- Plugin-Driven Architecture — collections of functions and tools exposed to AI models
- Native Azure Integration — deep integration with Azure AI services, Monitor, and Application Insights
- Flexible Invocation Modes — both streaming and non-streaming agent invocation
Pros
- Deep Azure integration — seamless monitoring via Azure Monitor and Application Insights
- Flexible plugin architecture — supports native plugins, MCP plugins, and OpenAPI plugins
- Enterprise-grade observability — best monitoring option for teams already in the Azure ecosystem
Cons
- Poor parallelism support — limited ability to run multiple agents simultaneously
- Complex memory management — different agent types use different memory systems
- AWS Bedrock multi-tool bug — hard crash when LLM batches multiple tool calls (see below)
Benchmark Results

Semantic Kernel is a good fit for:
- Organizations already using Azure Monitor and Application Insights for observability
- Applications requiring deep integration with Microsoft's cloud services
Benchmark Results — All Environments
Local (MacBook)

AWS EC2 (t3.medium, us-east-1)

GCP

What the Numbers Tell You
Lowest memory: LangGraph and Claude SDK both stay under 1 MB peak heap — extremely lean for production workloads with memory constraints.
Highest memory: Google ADK uses 85-88 MB consistently across all environments — more than 100x the memory of LangGraph for the same task.
Most consistent parallel efficiency: Claude SDK achieves 2.85-2.96x speedup across all 3 environments — the tightest range in the benchmark.
Choosing the Right Framework
| Factor | LangGraph | Google ADK | Claude SDK | AutoGen | CrewAI | Semantic Kernel |
|---|---|---|---|---|---|---|
| Creator | LangChain | Anthropic | Microsoft | CrewAI Inc. | Microsoft | |
| Languages | Python, JS/TS | Python, TS, Java, Go | Python, Node.js | Python, .NET | Python | Python, C#, Java |
| LLM Support | Any | Gemini-optimized + OpenAI-LLM | Claude only | Any OpenAI-compatible | Any via LiteLLM | OpenAI, Azure, custom |
| Best For | Complex workflows, full control | Google ecosystem, streaming | Coding agents | Research, collaboration | Role-based multi-agent | Azure/Microsoft ecosystem |
| Monitoring | LangSmith | OpenTelemetry (OTel) | Cost/turn controls | AutoGen Studio | Limited | Azure Monitor |
There is no single best framework — only the right one for your use case:
- LangGraph for complex workflows needing full control and maximum observability
- Google ADK if you're building in the Google ecosystem and need streaming
- Claude SDK for coding-focused agents with the most consistent parallel performance
- AutoGen for collaborative, research-style multi-agent workflows
- CrewAI for role-based systems where clarity of agent responsibilities matters most
- Semantic Kernel if you're already deep in the Microsoft/Azure world
Agent frameworks are moving fast. The developers experimenting now will be the ones leading the next wave of AI-powered products. Pick your framework, run the same task I ran, and see how it behaves on your workload. The numbers will tell you everything feature lists won't.




