claude codetoken analysisai internalsprompt cachingdeveloper toolscost optimization

I Analyzed 165.9 Million Tokens to Understand How Claude Code Actually Works

V
Vishal MakwanaSDE-1
Mar 10, 2026·15 min read
I Analyzed 165.9 Million Tokens to Understand How Claude Code Actually Works

Everyone talks about what Claude Code does. Nobody explains how. I spent days reverse-engineering the internals by analyzing real session data, and what I discovered changed how I think about AI-assisted development.

This post is for engineers who want to understand the internal mechanics — not the marketing pitch. We'll look at real JSONL files, real token counts, and real cost breakdowns from 4 days of heavy Claude Code usage across multiple projects.


Architecture Overview

AI-assisted coding requires solving a fundamental tension: the AI model runs in the cloud, but your code lives on your machine. Claude Code's architecture resolves this cleanly:

Developer Terminal
       ↕
   stdin/stdout
       ↕
  Node.js CLI (Local) ——— File I/O, Shell ———→ Local Filesystem
       ↕
   HTTPS / SSE
       ↕
  Anthropic API (Cloud)
       ↕
   Inference
       ↕
   GPU Cluster

The key insight: Claude Code is a local Node.js CLI that acts as an orchestration layer between you and Anthropic's API. The AI proposes actions (read file, edit code, run command), and the CLI executes them locally in a security sandbox.

One important thing to understand: the GitHub repo contains the full source code — Claude Code was open-sourced in 2025. What we can analyze beyond the code itself is the session data it produces — and that's exactly what I did.


Session Storage & Folder Structure

Claude Code stores every conversation as append-only JSONL files in your local filesystem. There are two default locations:

  • ~/.claude/projects/ — the old default
  • ~/.config/claude/projects/ — the new default (after a breaking change in Claude Code)

Path Encoding

Your project directory gets encoded into the folder name by replacing / with -:

/Users/john/Desktop/projects/saas-platform
→ -Users-john-Desktop-projects-saas-platform

Real Directory Structure

Here's what a real project with multiple sessions looks like:

~/.claude/projects/-Users-john-Desktop-projects-saas-platform/
├── 1888752c-cd02-48f6-848e-ccf8567e690f.jsonl     ← Session 1 (8.9 MB, 4,260 lines)
│   └── subagents/
│       ├── agent-a62021154333096fe.jsonl            ← Explore agent (Haiku)
│       ├── agent-a70c323807aca87dd.jsonl            ← Explore agent (Haiku)
│       ├── agent-acompact-295a86e6a5d3ef66.jsonl   ← Compact agent (Sonnet)
│       └── ... (17 subagent files total)
├── 2fed1aa9-0d0a-48f3-8f96-a9715076752a.jsonl     ← Session 2 (3.5 MB)
└── 5dbbdf47-81b5-46a6-a203-ca95a41fae19.jsonl     ← Session 3 (4.4 MB)

Each session file is named with its UUID session ID. Subagent files live in a subagents/ subdirectory under each session.

Why Append-Only JSONL?

The design choice of append-only JSONL is deliberate:

  • Crash-safe — partial writes don't corrupt earlier data
  • Streamable — you can tail the file in real-time
  • Easy to parse — one JSON object per line, no complex structure
  • Natural audit log — every event is preserved in order

JSONL Session Format — The 6 Record Types

Every line in a session JSONL file has a type field. After analyzing thousands of lines across multiple sessions, I identified exactly 6 record types:

1. file-history-snapshot

Saves file state before edits, enabling undo. Appears at the start of each user turn.

2. user

Your messages to Claude. Contains metadata like sessionId, cwd (current working directory), version (Claude Code version), and gitBranch. No token data.

3. system

System prompt injections — this is where CLAUDE.md content, tool definitions, and other context gets recorded. No token data.

4. progress

Streaming chunks during response generation. These are the incremental updates that make Claude's response appear word-by-word in your terminal. A single response generates 50-100+ progress records. No token data.

5. assistant

The final response record. This is the only record type that contains message.usage with token data. This is the goldmine.

{
  "type": "assistant",
  "message": {
    "model": "claude-sonnet-4-20250514",
    "usage": {
      "input_tokens": 3,
      "output_tokens": 169,
      "cache_creation_input_tokens": 716,
      "cache_read_input_tokens": 26894
    }
  },
  "costUSD": 0.00
}

Note: The costUSD field was removed after Claude Code v1.0.9 (June 2025) — it now always reads $0.00. You have to calculate costs from token counts yourself.

6. queue-operation

Task queue management for background tasks. Minimal data, used internally by the CLI.

Line Distribution in a Real Session

Here's the breakdown from one of my research sessions (408 lines total):

Record TypeCount% of Lines
progress36489.2%
assistant245.9%
user163.9%
file-history-snapshot41.0%
system20.5%

The takeaway: ~90% of JSONL lines are streaming progress records with no token data. Only the assistant records matter for usage analysis.


The 4 Token Types — Anatomy of a Response

Every assistant record contains exactly 4 token counts. Understanding what each one means is critical:

Real Data: 4-Day Breakdown (165.9M tokens)

Token TypeCount% of TotalWhat It Is
cache_read_input_tokens161,299,00097.2%Context retrieved from GPU cache
cache_creation_input_tokens4,246,2532.6%Context computed for first time
output_tokens328,4280.2%Claude's generated response
input_tokens64,8520.04%Actual new prompt content
Total165,938,533100%

Read that again: 97.2% of all tokens are cache reads. Your actual "new" content — the thing you typed plus Claude's response — is just 0.24% of total tokens.

The Hidden Sub-Object

Inside the assistant record, there's a nested cache_creation object that reveals which cache tier was used:

"cache_creation": {
    "ephemeral_1h_input_tokens": 690262,
    "ephemeral_5m_input_tokens": 0
}

From my analysis: 100% of cache creation tokens use the 1-hour cache tier. Zero tokens used the 5-minute tier. This has significant cost implications (more on this later).


Request Lifecycle — What Happens When You Send One Message

Let's trace exactly what happens when you type a message, using real data from my sessions.

Step-by-Step

  1. You type a message in the terminal (say: "Fix the auth bug")

  2. CLI assembles the full prompt (~33,000 tokens):

    • System prompt: ~10,000 tokens
    • Tool definitions: ~8,000 tokens
    • CLAUDE.md files: ~5,000 tokens
    • Full conversation history: ~10,000+ tokens
    • Your new message: ~20 tokens
  3. CLI marks the reusable prefix with cache_control: { type: "ephemeral" } so the API knows what can be cached

  4. Full payload sent to Anthropic API via HTTPS

  5. API checks KV cache for matching prefix → cache hit on 97%+ of tokens

  6. GPU computes only new tokens (your message + response)

  7. Response streams back (generating 50-100+ progress records in JSONL)

  8. If response contains tool_use (e.g., "Read file X") → CLI executes locally → result sent back → another API call

  9. Loop repeats until Claude responds with text only (no more tool calls)

  10. Final assistant record written to JSONL with complete usage data

The Tool Use Loop

User Message
    ↓
Assemble Prompt (~33K tokens)
    ↓
Send to Anthropic API
    ↓
┌── Response Type? ──┐
│                    │
↓ tool_use           ↓ text only
Execute Locally      Display Response
│                    │
↓                    ↓
Send Result ────→    Write to JSONL
(back to API)

Real Example: One Message = Multiple API Calls

When I typed "we need to deploy this buddy," it triggered:

API CallActionCache ReadCache CreateOutput
1Read deployment config22,71413487
2Edit Dockerfile22,714131,203
3Run build command22,71413892
4Final response22,714133,036

1 message → 4 API calls → 229,017 total tokens

Notice how cache_read is identical across all 4 calls — the same conversation prefix is reused every time.


The Caching Mechanism — Engineering Deep Dive

This is where Claude Code's engineering really shines. Understanding the caching system explains why the product economics work.

What is KV Cache?

When a transformer model processes text, each layer computes Key and Value matrices from the attention mechanism. These matrices represent the model's "understanding" of the input at each layer. Computing them is the most expensive part of inference.

KV Cache stores these pre-computed matrices in GPU RAM so they don't need to be recomputed when the same prefix appears again.

How Prefix Caching Works

Turn 1:  [System Prompt + Tools + CLAUDE.md + Message 1]
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
          ALL computed fresh → stored in GPU cache

Turn 2:  [System Prompt + Tools + CLAUDE.md + Message 1 + Response 1 + Message 2]
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
          Retrieved from cache (FREE)                                ^^^^^^^^^^^
                                                                     Computed fresh

Turn 10: [System Prompt + Tools + CLAUDE.md + Msg1 + R1 + ... + Msg10]
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
          99.9% from cache (FREE)                                    ^^
                                                                     0.1% new

The Math

Without cache:

  • 27,000 token prefix × 27,000 attention calculations = 729,000,000 computations

With cache:

  • 3 new tokens × 27,000 = 81,000 computations

That's ~9,000x less compute. This is why Anthropic can offer flat-rate pricing on Pro and Max plans.

Cache Growth Timeline (Real Data)

TurnCachedNew Compute
Turn 10%100% (fresh compute)
Turn 295%5%
Turn 598.5%1.5%
Turn 1099.9%0.1%

Cache Tiers and Pricing

TierDurationWrite CostRead Cost
5-minute ephemeral5 min1.25× input price0.1× input price
1-hour ephemeral1 hour2× input price0.1× input price

Key finding from my analysis: Claude Code exclusively uses the 1-hour cache tier. Every single ephemeral_5m_input_tokens value was 0. This means cache writes cost 2× the input token price — but reads (97% of tokens) cost only 10%.

Tiered Pricing Above 200K Tokens

Anthropic applies different rates when context exceeds 200K tokens. This is another invisible detail the CLI handles for you.


Subagents & Multi-Model Routing

Claude Code doesn't use one model for everything. It runs a multi-model architecture that routes tasks to the cheapest capable model.

Model Routing Strategy

Task TypeModelCost per MTok (Input)
Complex reasoning, code generationOpus 4.6$15.00
Standard coding, editingSonnet 4.6$3.00
Research, file exploration, code searchHaiku 4.5$0.80
Context compressionSonnet 4.6$3.00

Agent Types Discovered

From analyzing subagent files, I found these agent types:

  • Explore — Uses Haiku for fast codebase search and file exploration
  • general-purpose — Uses Sonnet/Opus for complex multi-step tasks
  • Plan — Uses Sonnet for implementation planning
  • Compact agents — Uses Sonnet to compress conversation context

Real Data: SaaS Platform Session

My largest session (SaaS Platform project) tells the story:

  • 1,070 turns across the main conversation
  • 271 subagents spawned automatically
  • Each subagent gets its own JSONL file in subagents/
  • 793 duplicate records correctly filtered via deduplication

The Deduplication Challenge

When a subagent runs, its responses are logged in both the parent session file AND the subagent's own agent-*.jsonl file. Without deduplication, you'd double-count tokens and costs.

The dedup mechanism uses a composite key: messageId:requestId. Any record with the same composite key is counted only once.

From my research session:

  • 210 raw Haiku entries in JSONL files
  • 137 duplicates (same message in both parent + subagent file)
  • 73 unique entries after dedup

This multi-model routing is completely invisible to you as a developer. You don't configure it. The CLI decides when to spawn a cheap Haiku agent for research vs. using Opus for complex reasoning.

Subagent Architecture

Main Session (Opus/Sonnet)
├── Explore Agent (Haiku)     → subagents/agent-a620.jsonl
├── Explore Agent (Haiku)     → subagents/agent-a70c.jsonl
├── General Purpose (Sonnet)  → subagents/agent-a186.jsonl
└── Compact Agent (Sonnet)    → subagents/agent-acompact-295a.jsonl

Context Compression (Compaction)

What happens when your conversation gets too long? Claude Code has an elegant solution.

How It Works

When context approaches ~200K tokens, the CLI auto-triggers a compact agent that:

  1. Takes the entire conversation history
  2. Summarizes it into ~5K tokens, preserving key context
  3. Replaces the old history with the summary
  4. Continues the conversation with a fresh, smaller context

This is observable in JSONL as system messages containing summary content, and as agent-acompact-*.jsonl files in the subagents directory.

Trade-offs

  • Benefit: Sessions can run indefinitely without hitting context limits
  • Cost: Some detail is lost in compression
  • Frequency: My SaaS Platform session (1,070 turns) had 3 compression events

You can manually trigger compression with the /compact command when you notice context growing large.


CLI vs API — Technical Cost Analysis

This is the question everyone asks: Why is the CLI so much cheaper than the same usage on the API?

Real Data Comparison (4 Days, 4 Projects)

ProjectAPI Cost (Pay-Per-Token)CLI Cost
SaaS Platform$80.44
Backend API$41.65
Dev Tools$42.39
Chatbot Service$5.76
Total$170.51Flat-rate (Pro/Max plan)

Monthly projection at this usage rate: ~$1,280/month on API vs flat-rate pricing on CLI.

Per-Model Cost Breakdown (SaaS Platform, 2 Days)

ModelTokensAPI Cost
Sonnet 4.678.5M$31.43
Opus 4.622.0M$47.95
Haiku 4.54.8M$1.05
Total105.3M$80.44

The 5 Reasons CLI Is Cheaper

1. Automatic Prompt Caching

The CLI handles cache_control headers and prefix optimization automatically. Building this yourself means implementing cache-aware prompt assembly, managing cache TTLs, and optimizing prefix ordering. The CLI gets 97%+ cache hit rates out of the box.

2. Smart Model Routing

Haiku at $0.80/MTok for research tasks instead of Opus at $15/MTok for everything. The CLI dynamically routes to the cheapest model that can handle each sub-task.

3. Context Compression

Auto-compaction prevents context from growing unbounded. Without it, you'd hit the context limit and need to start a new session (losing all context) or pay for increasingly expensive API calls.

4. Subscription Economics

Humans type slower than API bots. Anthropic can offer flat-rate pricing because human typing speed creates a natural rate limit. A developer sending 50 messages/day with think time between them is predictable and bounded. An API script sending 50 requests/second is not.

5. Cache Reuse Within Sessions

Within a session, 97% of every API call is a cache read. The incremental cost of each additional message is tiny compared to the first message. This is why long sessions are surprisingly economical.

When API IS Better

The API is the right choice when you're:

  • Building customer-facing products
  • Need programmatic access at scale
  • Running custom pipelines or batch processing
  • Require fine-grained control over prompts and parameters

Design Patterns & Engineering Decisions

Stepping back, several design patterns emerge from this analysis:

PatternImplementationWhy
Append-only logJSONL filesCrash-safe, streamable, auditable
Security sandboxCLI executes, AI proposesNo direct filesystem access for the model
Cache-first architecture1-hour ephemeral cache, prefix optimization9,000x compute reduction
Multi-model routingHaiku/Sonnet/Opus per task type18x cost difference between cheapest and most expensive
Dedup by composite keymessageId:requestId hashHandles parent/child session overlap
Context compressionCompact agents with SonnetEnables infinite-length sessions

Practical Insights for Developers

Based on this analysis, here's what actually matters for your usage:

  1. Don't fear long conversations within a session. Cache reads are 97% of tokens and cost 10% of input price. The incremental cost of message N+1 is small.

  2. Be specific in prompts. Vague prompts cause more tool-use round-trips. "Fix the auth bug in src/auth/login.ts line 42" = 2 API calls. "Something's wrong with login" = 6+ API calls.

  3. Use /compact proactively. When you notice context growing large, manually trigger compression before the auto-trigger kicks in. You maintain more control over what gets preserved.

  4. Break big tasks into focused sessions. While long sessions are cheap, new sessions start with a fresh, small context — faster responses and more focused results.

  5. Subagents are free optimization. When the CLI spawns Haiku agents for research, that's 18x cheaper than using the main Opus model. Don't fight it.

  6. New sessions reset the cache. The first message in a new session is the most expensive (100% cache miss). Subsequent messages drop to 95%+ cache hits within 2-3 turns.


Engineering Takeaways

After analyzing 165.9 million tokens across 4 days of real usage, here's what I've concluded:

Claude Code is fundamentally a cache-optimized, multi-model orchestration layer. The CLI's value isn't the terminal UI — it's the invisible infrastructure that manages caching, model routing, context compression, and tool execution.

The 92% cost difference between API and CLI isn't a pricing trick. It's architectural. The CLI is purpose-built to exploit prefix caching (97% cache hits), multi-model routing (18x cost range), and human typing speed as a natural rate limiter.

The JSONL session format is a goldmine for anyone building custom tooling. Every API call, every token count, every model switch is recorded. If you want to build your own usage analytics, cost tracking, or session analysis — the data is sitting right there in your home directory.

Understanding these internals won't make you a better prompt engineer. But it will help you understand why certain patterns are cheap (long focused sessions), why others are expensive (starting fresh sessions frequently), and why Claude Code can offer flat-rate pricing that seems too good to be true.

It's not too good to be true. It's just good engineering.


All data in this post comes from real Claude Code sessions analyzed over 4 days across 4 active projects, totaling 165.9M tokens and $170.51 in equivalent API costs.

Tags:claude codetoken analysisai internalsprompt cachingdeveloper toolscost optimization