Systems and Strides

The 5 Stages of Claude Code Mastery

Shantanu Singh — Fri, 27 Mar 2026 00:00:00 GMT

Andrej Karpathy tweeted nine words in February 2025 and accidentally started a religion:

~34k likes. Collins Dictionary Word of the Year. A million LinkedIn posts about "the future of development." One year later, Karpathy hand-wrote his own app because Claude agents were "net unhelpful" for it.

That's the whole arc of AI-assisted coding in two paragraphs. But I've watched enough people walk this path that I can now identify exactly five stages.

Level 0: The Refuser

Headspace: "I can code faster and better than Claude 100% of the time."

Reality: 99.99% pride, 0.01% 200 IQ genius

$ claude
=> command not found
=> (uses vim with no plugins and likes it)

These are the developers who mass-downvote every AI post on reddit while quietly testing Copilot suggestions at 2 AM with the door locked, lights turned off. ThePrimeagen called AI coding tools "dangerously lazy," then admitted Cursor's multi-file editing is "legitimately impressive." Classic Level 0 pipeline: deny, try in secret, never admit publicly.

The Level 0 developer has a mass-produced motivational poster that reads "REAL PROGRAMMERS USE BUTTERFLIES" and they mean it literally.

Level 1: The Enthusiastic Beginner

Headspace: "I can prompt. I can build things. I am GOD."

Reality: Has not yet attempted anything that requires the code to work in production.

$ claude "build me a full-stack stock exchange
  with real-time order matching, regulatory
  compliance, and a mobile app"

This is the YC Winter 2025 batch energy, where 1 in 4 founders reported 95%+ AI-generated codebases. "Built in a weekend with Cursor, ready for Series A." The senior developer's response: close Cursor, pour a drink.

In January 2026, a Google principal engineer tweeted:

8.8 million views. HN commenters pointed out she'd fed it the surviving best ideas from a year of iteration. One commenter compared the headline-vs-reality gap to journalism where "you read down to the eighth paragraph and it turns out the fatality was among pigeons."

Level 1 is intoxicating. Your prototype looks amazing. Your demo video gets 10K likes. You tell your manager you'll ship in two weeks. Two months later, your codebase has two functions called processUserData and processUserInfo that do the same thing differently, generated months apart. Neither you nor the AI noticed.

The best Level 1 story remains Jason Lemkin's 12-day Replit experiment. He told the AI, in ALL CAPS, eleven separate times, not to touch his production database. The AI deleted it. Then said: "This was a catastrophic failure on my part. I destroyed months of work in seconds." Then it lied about whether recovery was possible. Lemkin recovered the data manually.

1,200 executives. 1,190 companies. Gone. "But I told it not to" is the Level 1 epitaph.

Level 2: The Configuration Sorcerer

Headspace: "I know context rot. I run agent teams. I built 50+ MCPs, 200+ custom skills. I am the productivity God."

Reality: Context window is 58% full before the first prompt.

$ claude /context
=> 58% full (before first prompt)
=> 89% full (after 6 exchanges)
=> 100% full (you haven't started the actual work yet)

The Level 2 developer has installed every MCP server known to humanity. Google MCP. GitHub MCP. Linear MCP. A custom MCP for their smart fridge. Their CLAUDE.md file is 8,000 tokens of carefully curated instructions that the model starts ignoring around token 2,800.

Research showed that AI agents with too many tools become "slower, less accurate, more expensive, and more prone to dangerous behavior." The Level 2 developer responded by installing three more MCPs to help manage the problem.

Context rot is the silent killer here. Chroma researchers proved that output quality degrades well before you hit the context limit. A model with a 200K context window starts losing coherence at 50K tokens. It favors the beginning and end, ignoring the middle. Your 8,000-token CLAUDE.md? The model read the first paragraph and the last paragraph. Everything in between is vibes.

The Level 2 CLAUDE.md also includes the instruction "NEVER say 'You're absolutely right!'" because Claude said it twelve times in one conversation. This became a documented cultural phenomenon. Anthropic knew about the sycophancy problem since 2023. The model would rather gaslight you with compliments than risk making you sad.

Level 2 is the developer who has automated everything except the part that matters. Their terminal looks like the cockpit of a 747, and they're flying to the grocery store.

Level 3: The Intermediate (Actually Effective)

Headspace: "I use 1-3 skills. Not more than 5 MCPs. I am decidedly NOT God."

Reality: Can do almost anything possible today. Has internalized that 74% of developers feel more productive with AI while the actual data shows a 19% slowdown from error correction.

$ claude /context
=> stays between 5-60%
=> (because they learned the hard way)

Level 3 is where you stop fighting the tool and start working with its actual capabilities. You know that AI-generated code produces 1.75x more logic errors and 1.57x more security findings than human-written code. You know Google's DORA research found AI-heavy teams had slower delivery times once rework was counted. And you still use it. Because you've figured out the trick: you're not asking it to be right. You're asking it to be fast, then verifying yourself.

The Level 3 developer has mastered the art of fresh context. When the conversation gets stale, they don't keep prompting into the void. They spawn a new session. Someone literally built a Claude Code plugin called the "Ralph Wiggum Loop" (yes, named after the Simpsons character) that intercepts Claude's exit attempts to keep it iterating while state lives in the filesystem. The community went from laughing at it to actually using it.

Level 3 uses Claude the way you use a very fast, very confident intern: small, well-scoped tasks. Read every diff. Trust, but verify. Mostly verify.

Level 4: The Enlightened

Headspace: "I know nothing. I'll be the last human to keep this job. And I'm fine with that."

Reality: They will be the last human to keep this job. They're fine with that.

$ tmux ls
=> 12 vanilla claude sessions
=> colors akin to a Mondrian painting
=> no MCPs, no skills, no CLAUDE.md
=> just prompts and patience

The Level 4 developer is a 50-year-old HN poster who says Claude Code "reignited their passion for building software where they focus on solving problems versus the rat race of chasing frameworks." No Twitter thread about their workflow. No YouTube channel. They just ship.

Karpathy himself landed here by December 2025:

Level 4 knows the asterisks. Level 1 skips them.

The gap between your LinkedIn take and your terminal history is the measure of your enlightenment. Level 4 is admitting that the tool you publicly critique is privately indispensable.

The Uncomfortable Truth

The whole curve is about one thing: when you stop believing the AI is right and start verifying that it is.

Level 0 doesn't trust it at all. Level 1 trusts it completely. Level 2 trusts the tooling around it. Level 3 trusts the process. Level 4 trusts nothing and ships anyway.

The guy who coined "vibe coding" hand-writes his own apps now, and a year later rebranded the whole thing:

And somewhere, right now, a Level 1 developer is telling Claude to build a stock exchange. Claude is saying "You're absolutely right, let's build that!" And honestly? The demo is going to look incredible.

Rest Days

Shantanu Singh — Mon, 23 Mar 2026 00:00:00 GMT

I recently ran a 10K in 46:26. Two minutes faster than my previous best. The strange part? I trained less, ran a harder course, and spent two weeks sick leading up to race day.

My previous best was 48:30 last November. That race felt perfect. I'd trained consistently, the course was flat, and I paced it evenly with near-identical 5K splits. I emptied the tank. A near-textbook execution.

This one shouldn't have been faster. But it was. The only variable that improved was rest.

The Science of Getting Faster by Doing Nothing

There's a concept in sports science called supercompensation. First described by Russian scientist Nikolai Yakovlev in the 1950s, the idea is straightforward: after a training stimulus, your body doesn't just recover to where it was. It rebuilds slightly above your previous level.

The cycle looks like this:

You train hard and your body takes a hit. Muscles develop micro-tears, glycogen stores deplete, connective tissue takes damage.
You rest. Your body repairs the damage and then overbuilds. More capillaries, stronger muscle fibers, better glycogen storage.
If you time your next session right, usually 48-72 hours later, you're training from a higher baseline.

But timing matters. Train again too soon and you interrupt the process. Instead of climbing, you accumulate fatigue. Do this repeatedly and you don't plateau. You regress.

This is where many runners go wrong. We think progress comes from stacking hard efforts. In reality, it comes from absorbing them.

Push this too far and you get overtraining syndrome. Persistent fatigue, poor sleep, heavy legs, declining performance despite trying harder. Recovery can take weeks, sometimes months.

Run Slow to Race Fast

Every runner hears this early. It takes much longer to believe it.

The 80/20 rule says roughly 80% of your training volume should be at an easy, conversational pace. Only 20% should be hard. Intervals, tempo runs, threshold work.

It sounds wrong. It feels wrong. That's why most people don't follow it.

But the physiology is clear. Studies on elite endurance athletes across sports found they all converge on roughly this distribution. Easy runs build your aerobic engine. They grow capillaries, improve fat metabolism, and strengthen connective tissue without the recovery cost of hard sessions. They're not junk miles. They're the foundation everything else sits on.

Rest days aren't optional either. They're when adaptation actually happens. Sleep is where your body clears metabolic waste, consolidates neuromuscular adaptations, and repairs tissue. One bad night can add an extra day to your recovery needs.

Looking back, the contrast between my two races makes sense. In November, I trained consistently but never gave myself a real window to recover and adapt. I was always chasing the next run. Before March, the illness forced me to back off two weeks leading upto the race. I hated it at the time. But on race day, my legs felt fresher than they had in months.

I wasn't detrained. I was finally recovered.

The Pattern Shows Up Elsewhere

The same dynamic exists outside running.

If you push at maximum intensity every day at work, you're doing threshold workouts daily. No athlete trains like that for long. They break down. Yet in knowledge work, this pattern is common, even celebrated.

After a certain point, more hours don't produce more output. Top knowledge workers tend to peak around four to five hours of deep, focused work per day. Chronic overwork doesn't just reduce quality. It leads to the same thing runners deal with. Fatigue, mood shifts, declining output despite increasing effort.

The running framework maps cleanly here:

Base runs = routine tasks that keep things moving without draining you
Recovery runs = lighter days, admin work, low-stakes reviews
Rest days = actual time off. No Slack, no "quick check" of chat
Hard sessions = the high-stakes sprints. Product launches, war rooms, critical reviews, deep problem-solving

If everything is a hard day, nothing is. And performance suffers. The gains don't come from the hardest days. They come from the days that feel too easy to matter.

7 Agents, 38 Tasks, $0: Running Claude Code Agent Teams on Local GPUs

Shantanu Singh — Mon, 16 Mar 2026 00:00:00 GMT

Seven agents. Thirty-eight tasks. Three hours of autonomous work. Total API cost: $0.

That's what happened when I pointed a Claude Code multi-agent team at this blog and let it run entirely on local GPU hardware. The typography, the transitions, the polish you're seeing on this site now: all of it came from the setup I'm about to walk through.

No free tier. No credits. Just a proxy that routes Claude Code's API calls to models running on my own machine. Crucially, it allows per-model overrides—for example, you can seamlessly redirect 'opus' requests to Anthropic for high-level planning, while routing all 'sonnet' execution agents to local models for free execution.

Github

The Cost Problem with Agentic Workflows

Single-agent Claude Code sessions are already token-hungry. Ask for a refactor, the model reads a dozen files, thinks through the changes, edits them, runs tests. Maybe 50k tokens. Fine.

Now multiply that by seven agents running in parallel for three hours, each with its own conversation context, tool calls, and inter-agent coordination overhead. At Anthropic's API rates, that bill arrives faster than you'd like.

Ralph loops make it worse in the best possible way. A ralph loop (named after Ralph Wiggum's unfazed persistence) is Claude Code's self-restarting agentic pattern: you define a task and a success condition, and a Stop hook re-injects your prompt after each iteration until the condition is met. It's the right tool for "keep improving this until the tests pass" or "keep refactoring until it's done." It's also a reliable way to burn tokens across dozens of iterations.

The solution is, perhaps, to not use the Anthropic API for sonnet & haiku class models at all.

Claude Model Proxy: The Local API Bridge

claude-model-proxy is a small FastAPI server that sits between Claude Code and Ollama:

Claude Code → proxy (:8082) → Ollama (:11434) → local GPU

It implements the full Anthropic Messages API. Claude Code doesn't know the difference. It sends the same requests it would send to api.anthropic.com, and the proxy translates them to Ollama's format and back, including streaming and tool use.

Setup starts with pulling local models and configuring the proxy. Here's my Ollama model library alongside the .env that maps each Claude tier to a local model:

Every Claude model name (Opus, Sonnet, Haiku) gets routed to a local GLM-4.7-Flash (q4_K_M) running on my GPU. Context sizes, timeouts, and Ollama connection details are all configured in the .env file. You can also set any tier to anthropic to pass those requests through to the real API, useful when you want cloud quality for one agent and local speed for the rest.

Kicking It Off

Two environment variables and Claude Code doesn't know it's talking to a local model:

export ANTHROPIC_BASE_URL=http://localhost:8082
export ANTHROPIC_API_KEY=proxy  # any non-empty string
claude

On the left, Claude Code starts normally, same welcome screen, same interface. On the right, the proxy logs confirm every request is being caught and forwarded to the local Ollama instance. Claude Code has no idea it's not talking to Anthropic's servers.

I gave it a simple, open-ended prompt: "Identify aesthetic improvements to this blog. Split them into tiers from most important to least important."

The Lead Agent Plans

Within minutes, the lead agent had scanned the entire codebase, every component, every stylesheet, every layout file, and produced a prioritized improvement plan.

Twenty improvements, categorized by priority. The right pane shows a steady stream of proxy logs: the model reading files, analyzing aesthetics, and reasoning about what matters most. All tokens processed locally.

The agent then structured this into a formal plan document, breaking improvements into tiers:

Decomposition: From Plan to Tasks

Next, I asked the agent to read its own improvement plan and turn it into a structured task list, each task scoped small enough to be completed by a junior engineer (or in this case, a local GLM-4.7-Flash model).

It produced tasks.md with 38 actionable coding tasks broken down by priority tier:

Each task included a file path, a problem description, specific actions to take, and enough context for an independent agent to execute without further guidance. The tasks were organized as:

Tier 1 (13 tasks): High impact, color contrast, header/footer behavior, hero visual hierarchy
Tier 2 (8 tasks): Medium impact, post cards, search input, typography, buttons, images, code blocks
Tier 3 (17 tasks): Nice-to-have, toasts, animations, social links, back-to-top, tables, and more

This is the critical step. The decomposition is the intelligence. Once the work is broken into small, well-specified units, the model executing each one doesn't need to be frontier-tier.

Spawning the Team

With tasks.md ready, I told the agent to read it and spawn agent teams to complete the tasks.

The agent decided to create a team and assign tasks to multiple sub-agents working in parallel. It began choreographing, figuring out which tasks could run concurrently and how to group them by specialty.

Then the team launched. Sub-agents spun up with names like blog-aesthetic-improvements, each receiving a batch of related tasks. The proxy logs lit up with concurrent requests: multiple agents thinking and coding simultaneously, all routed to the same local GPU.

7 Agents Running in Parallel

This is what it looks like when a full agent team is running locally:

Seven teammates running simultaneously: Boiler-files, Header-footer, BlogList, Task-styling, Page-transitions, Parallax, and more. Each one independently reading files, making edits, and working through its assigned tasks. The colored status bars on the right show all of them active and processing.

Every single token, across all seven agents, processed by GLM-4.7-Flash on local hardware.

When Things Break (and Get Fixed)

Three hours into the run, the agents had made hundreds of changes across dozens of files. Inevitably, some of those changes conflicted. The build broke.

The lead agent caught the failures and started debugging. npm run build failed with missing dependencies and code issues. It installed what was needed, identified the problems, and moved on to fixing them.

Two files had issues: Layout.js had duplicate imports and metadata definitions, and BlogList.js had malformed duplicate code where lines had been doubled by a merge conflict between agents. The lead agent cleaned up both, and the build passed.

This self-healing behavior is one of the strengths of the agentic pattern. The agents don't just make changes and walk away. They validate their work and fix what's broken.

Why Local Models Are Good Enough (For This)

The obvious objection: local models aren't as capable as Sonnet. True.

But that only matters if you're asking a local model to do what Sonnet does. In a multi-agent team, the lead agent has already done the hard thinking: scoping the problem, breaking it into discrete tasks, assigning them. By the time a sub-agent picks up its task, the problem is small and well-specified. A local GLM-4.7-Flash handles "add responsive padding to this component" or "fix light mode text-secondary contrast" without trouble.

The decomposition is the intelligence. Local models are the execution.

This is why ralph loops in particular work well locally. Each iteration is a focused micro-task: a targeted edit, a specific fix, a check against acceptance criteria. The task scope is small enough to fit the model's capabilities without needing Sonnet-level reasoning.

Getting Started

1. Clone and install

git clone https://github.com/shansin/claude-model-proxy
cd claude-model-proxy
uv sync

2. Configure models

Create .env in the repo root. Not sure which local model to use? Run benchmark_model.sh: it tests every installed Ollama model on code generation tasks and outputs tokens/sec and quality scores as a CSV.

OLLAMA_MODEL_MAP_OPUS=glm-4.7-flash:q4_K_M
OLLAMA_MODEL_MAP_SONNET=glm-4.7-flash:q4_K_M
OLLAMA_MODEL_MAP_HAIKU=glm-4.7-flash:q4_K_M
OLLAMA_CONTEXT_SIZE_DEFAULT=32768

3. Start the proxy

uv run python main.py

4. Point Claude Code at it

export ANTHROPIC_BASE_URL=http://localhost:8082
export ANTHROPIC_API_KEY=proxy
claude

Spawn a team, kick off a ralph loop, all local.

When to Stay Cloud

Local models handle well-scoped sub-tasks well. They're weaker at the lead agent's job: high-level decomposition, ambiguous problem scoping, coordination decisions that require broad reasoning.

The hybrid approach works well: route the lead agent through Anthropic, sub-agents through Ollama. You pay for a small slice of the total token count.

38 tasks across 7 agents over three hours via the Anthropic API would have been a real bill. Running it locally cost nothing except electricity.

The experiment worked. The blog is better. The build passes. And the receipt is empty.

If you're already using Claude Code for agentic work and you have a GPU, the proxy is one .env file away from running your next team run for free.

Claude Lens - A Control Tower for Claude Code's Multi-Agent System

Shantanu Singh — Sat, 07 Mar 2026 00:00:00 GMT

You spawn a team of agents. They fan out across your codebase — refactoring modules, writing tests, updating docs. Tokens burn. Tasks fly. And you're left staring at a terminal, wondering:

Is that agent stuck in a loop? How much has this cost me? Did the migration task actually finish?

There's no dashboard. No overview. Just raw .jsonl files and ~/.claude/tasks/ directories you'd have to cat like a caveman.

So I built one.

Github

Claude Lens is a native desktop app that reads your ~/.claude/ directory and gives you a real-time control tower over everything Claude Code does — teams, agents, tasks, costs, conversations, settings, and system health.

It's open source, it's free, and if you're running multi-agent Claude Code, you probably need it.

Why I Built This

For single-agent sessions, the terminal is fine. But the moment you use multiple sessions or Claude Code's team feature — a lead agent recruiting teammates, assigning tasks, coordinating across files — your visibility drops to zero.

Here's what "monitoring" looked like before:

Terminal output scrolling faster than you can read
Manually inspecting JSON task files to check status
No idea what your token spend is until the API bill arrives
Digging through .jsonl files to find what an agent said two days ago

Claude Lens replaces all of that with a single window.

Projects at a Glance

The Projects view surfaces every Claude Code project on your machine as a card — total tokens, session count, cost breakdown, and which models were used. Sort by recency, cost, or token count. Click into any project to see its sessions.

Agent Teams: Three Ways to See Your Swarm

The Agent Teams view is the nerve center. Choose the layout that fits your brain:

Card View gives you a responsive grid — progress bars, agent counts, model badges, task lists, and live cost tracking per team.

Graph View renders your entire team topology as an interactive node graph. Violet edges for team-agent links, animated blue pulses for in-progress tasks, dashed orange for blocking dependencies. Click any node to inspect it.

Split View — graph on the left, detail panel on the right. The pragmatist's layout.

Need a new team? Hit New Team in the toolbar. Name it, describe it, click Create. No terminal required.

Analytics That Actually Tell You Something

Five tabs of insight into your AI usage patterns.

Overview — Token volume and daily cost as a stacked bar chart. Pick your window: 7, 30, or 90 days. The Top Projects by Cost panel shows your biggest spenders instantly.

Heatmap — A GitHub-style contribution calendar for your Claude Code usage. Spot your heaviest days at a glance.

Models — Side-by-side comparison of every model you've used: message counts, token volumes, cache utilization, and total cost. Are you getting your money's worth out of Opus? Find out here.

Cache — Your cache hit rate, total dollars saved from cache reads, and a daily area chart of cache write vs. read tokens. A 96% hit rate means your agents are sharing context efficiently — and you're paying dramatically less.

All tabs lazy-load on first visit and silently refresh every 30 seconds, pausing automatically when you switch away.

Conversations Without the JSONL Archaeology

Ever needed to re-read what an agent said during a session three days ago? Good luck parsing raw JSONL by hand.

Claude Lens gives you a collapsible project tree with every session listed. The conversation thread renders with proper user/assistant bubbles, expandable tool-use blocks, token counts, and per-session cost in the sidebar.

Browse / Search — the sidebar has a two-mode toggle. In Browse mode you navigate the project tree. Flip to Search and you get full-text search across every JSONL session on disk — debounced, with highlighted snippets. Click a result and the conversation opens instantly.

Ctrl+F opens an inline search bar that highlights every match across the thread. Export as Markdown dumps the full conversation as a clean .md file.

Content: Memory, Plans, and Todos

The Content view surfaces Claude Code's internal state — memory files, active plans, and todo lists — in a readable format. No more hunting through hidden directories to see what your agent "remembers."

Settings Without the JSON Editing

A full GUI over ~/.claude/settings.json:

General — Effort levels, permission modes, environment variables, and status line commands. All dropdowns and toggles, no text editor.

Hooks — Manage your Pre/Post tool-use hooks with an inline test runner. Click play, see stdout/stderr and exit codes live. No more switching to a terminal to debug your Slack webhook.

MCP Servers — Add and configure servers with a clean form.

Profiles & Templates — Snapshot your settings or save your favorite multi-agent topology as a reusable template.

Budget Alerts (Save Your Wallet)

Set a daily USD limit. Claude Lens gives you a soft warning at your threshold and a hard alert when you hit the cap. Don't let a rogue autonomous agent drain your API credits overnight.

Native OS notifications fire when tasks complete or teams are created — even when the app is in the background.

The toolbar always shows your today and 30-day spend at a glance, color-coded green to red as costs climb.

System: Kill Rogue Agents

The System view shows a live process table of every claude session on your machine. Each row has a CPU sparkline — a rolling 60-second mini-graph so you can tell at a glance whether a process is pegged at 100% or just idling. One click to kill it.

Auth monitoring warns you before your token expires. The Telemetry tab shows recent events.

Keyboard-First Navigation

Shortcut	Action
`1`	Projects
`2`	Agent Teams
`3`	Analytics
`4`	Content
`5`	Conversations
`6`	System
`7`	Settings
`r`	Refresh data
`Ctrl+K` / `Cmd+K`	Command palette
`Ctrl+F`	Search current conversation
`Escape`	Close palette / modal

Under the Hood

Claude Lens is a read-mostly companion. It never writes to your Claude Code state or interferes with running agents.

A lightweight Node.js main process watches the filesystem with chokidar, handles JSONL scanning and deduplication (Claude Code's streaming writes can massively overcount if you're not careful), and pushes updates via IPC to the React frontend.

The stack: Electron 40 / React 19 / TypeScript / Tailwind CSS v4 / Recharts / React Flow

Get Started

git clone https://github.com/shansin/claude-lens.git
cd claude-lens
npm install
npm run dev

That's it. If you've used Claude Code before, the app reads from ~/.claude/ and your dashboard is live immediately.

For production builds (macOS .dmg, Windows NSIS, Linux AppImage + deb):

npm run build

Is This For You?

If you run multi-agent Claude Code teams, care about your API costs, or want a civilized way to browse conversations and manage settings without hand-editing JSON — yes.

Stop flying blind. Know exactly what every agent is doing, what it costs, and what happened.

Claude Lens is open source under the ISC license. Contributions welcome.

Leo: My Zero-Cost, Privacy-First AI Assistant on WhatsApp

Shantanu Singh — Sun, 22 Feb 2026 00:00:00 GMT

$0/month. Runs on your hardware. Lives in WhatsApp.

That's Leo. An AI assistant I built that handles queries, searches the web, manages your calendar, reads your email, tracks your fitness, and delivers a personalized briefing every morning. All from the app you're probably already using to text family and friends.

Github

Why I Built This

WhatsApp is already on everyone's phone. It's the most popular messaging app on the planet, and I was already using it to stay connected with family and friends. The question was: what if it could also manage my digital life?

I wanted four things:

Privacy first: My data never leaves my machine
Control: I own the logic for workflows, system prompts, and model choice
Zero recurring cost: No API subscriptions, no token metering
Learn by building: A real project to deepen my understanding of local LLMs and agents

Leo is the result.

What Leo Can Do

Intelligent Conversations

Leo handles the full range of AI assistant tasks: answering questions, brainstorming, deep research, explaining concepts. Each conversation maintains its own memory via SQLite-backed sessions, so Leo remembers what you discussed earlier.

Web Search

Need current information? Leo connects to Brave Search for real-time data. Ask about news, look up facts, research any topic.

What's the latest supreme court ruling on Tariffs?

Do a deep research and summarize if Tariffs are good or bad for US economy

Google Workspace Integration

Leo becomes a productivity layer across your entire Google account:

Service	Capabilities
Google Calendar	View events, create meetings, find free time slots
Gmail	Search threads, draft and send emails
Google Docs	Create, read, find, update documents
Google Drive	Search files, create folders, download content
Google Sheets	Read data, get ranges
Google Slides	Read presentations

@leo, am I free this Sat 5pm? if so add 2 hr block for Tom's bday

Health & Fitness

Leo connects to Garmin Connect to pull your fitness data: sleep patterns, training schedule, workout history, performance trends. This feeds directly into your morning briefings.

One-Time Reminders

Use natural language:

#remindme in 30 minutes to call mom
#remindme tomorrow at 9am to check emails
#remindme at 12pm Feb 25, 2026 to complete taxes

Leo parses your request and messages you at the right time.

Recurring Reminders

Build habits:

#reminder add "9pm Sun to Thu" Review and adjust tomorrow's calendar
#reminder add "12:30 pm Thursdays" Readup Weekly Review Doc
#reminder help
#reminder list
#reminder remove

Scheduled Briefings

This is the feature I use most. You define a prompt and a schedule; Leo runs it and delivers the results to your WhatsApp:

#briefing add "Morning Brief" "6:00am everyday" Get today's scheduled training from Garmin, today's calendar events, and unread emails summary

#briefing add "Evening Brief" "5:00pm everyday" Get unread emails summary and top 2 news from today

#briefing help
#briefing list
#briefing remove

Wake up to a personalized digest built from your actual calendar, email, and fitness data.

Hooks: Bridge to External Programs

Leo can route messages to any program on your machine through bidirectional named pipes. Each hook creates two FIFOs: one for sending messages to the program, one for receiving responses back.

Trigger with #hook-name message or @hook-name message. I have hooks for claude and codex, so I can type #claude explain quantum computing in WhatsApp and get a Claude response routed right back into the chat.

This turns Leo into a message router that can bridge WhatsApp to virtually anything running on your machine.

Test Mode

Leo includes a local Gradio UI at http://127.0.0.1:7860 that bypasses the WhatsApp bridge entirely. It has a model selector to hot-swap Ollama models at runtime and a live system log panel. All background schedulers still run, so you can iterate on prompts without needing your phone.

Why Zero Cost Actually Works

Component	Cost
LLM (Ollama + local model)	$0
WhatsApp messaging	$0 (uses WhatsApp Web protocol)
Brave Search (free tier)	$0
Google APIs	Free
Garmin data access	Free
Hosting	$0 (runs locally)
Total	$0/month

Electricity is the only real cost. My estimates put it well under $10/year:

The service draws ~60W at idle
Inference spikes 100-300W for a few seconds on the 5070 Ti
The 5060 Ti is slower but even more efficient

The key insight: modern open-source LLMs are good enough for most assistant tasks. Models like GLM-4.7-Flash, gpt-oss:20b, and deepseek-r1:8b run on consumer hardware and deliver strong results without per-token costs.

You already own the hardware. Make it work for you.

The Privacy Advantage

Leo runs entirely on your local machine:

Local LLM: Powered by Ollama. Inference happens on your GPUs.
Local storage: Messages, reminders, and sessions live in SQLite databases on your device.
No cloud dependency: Your conversations never travel to external servers beyond WhatsApp.
Your credentials, your machine: Google, WhatsApp, and Garmin tokens stay on your hardware.

Technical Architecture

The system splits into two processes that communicate over Unix domain sockets (paths configurable via INSTANCE_GUID), which allows multiple Leo instances on the same machine. The Go bridge handles the WhatsApp protocol; the Python server handles AI reasoning. Neither exposes a network port.

Go WhatsApp Bridge (`whatsapp-mcp/whatsapp-bridge/`)

Built on whatsmeow, a Go library implementing WhatsApp's multi-device protocol
Heavily modified for performance and to support all Leo use cases
Handles authentication via QR code scanning
Manages message storage in SQLite
Processes media: images, videos, audio, documents
Includes a custom Ogg Opus parser for voice message duration detection

Python Agent Server (`agent/`)

Uses OpenAI Agents SDK for orchestration
Connects to Ollama via OpenAI-compatible API (http://localhost:11434/v1)
Agent factory with LRU cache (max 20 agents, 30-minute TTL) for multi-conversation support
Natural language time parsing for reminders via an LLM agent
Cron-based scheduling for briefings and recurring reminders

MCP Servers

Three servers launched as child processes using stdio-based MCP:

brave-search-mcp: Web search
workspace-mcp: Google Workspace (Docs, Calendar, Gmail, Drive, Sheets, Slides)
garmin-mcp: Fitness and health data

All three communicate with the agent server over stdin/stdout, not HTTP.

Operating Modes

Dedicated Number Mode (IS_DEDICATED_NUMBER=true): Responds to all DMs and group mentions. Good for a dedicated Leo phone number.
Mention Mode: Only responds when explicitly mentioned (@leo or #leo). Works with your existing WhatsApp account.

Access Control

Privileged whitelist (ALLOWED_SENDERS): Only listed phone numbers get Google Workspace, Garmin, reminders, and briefings access. Non-privileged users can still chat and search the web.
Unix domain sockets for inter-process communication; no exposed network ports.
Thread-local SQLite connections to avoid concurrency issues.
Environment-based configuration for all sensitive credentials.

Interesting Technical Details

Natural Language Time Parsing

Instead of rigid regex patterns, the reminder system uses an LLM agent to parse times. It handles:

"in 30 minutes"
"tomorrow at 9am"
"at 5pm Feb 14, 2026"
"next Monday morning"

Cron-Based Scheduling

Briefings and recurring reminders use croniter for flexible scheduling:

Input	Cron Expression
"9am everyday"	`0 9 * * *`
"Monday 8am"	`0 8 * * 1`
"5pm friday"	`0 17 * * 5`

WhatsApp LID Resolution

WhatsApp uses LID (Linked ID) for privacy, a format that can't be used for sending messages. The Go bridge automatically resolves LID to actual phone numbers for outbound messages.

Performance Optimizations

Agent caching: LRU eviction prevents memory bloat from idle conversations
Pre-built MCP parameters: Avoids per-message object creation overhead
Shared environment copy: Avoids copying 100+ environment variables per request
Singleton OpenAI client: Reused across all messages

Getting Started

Built on Ubuntu with NVIDIA GPUs, but it should work the same on Mac and WSL.

Prerequisites: Python >= 3.13, uv, Go, Ollama, and Node.js/npm.

Clone the repository
Install Ollama and pull a model: ollama pull glm-4.7-flash
Copy .env_example to .env and fill in your Brave Search API key, allowed senders, and other settings
Run the services: ./start_services.sh
Scan the QR code to connect WhatsApp
Start messaging Leo

Want to try it without a phone? Run in test mode:

IS_TEST_MODE=true ./start_services.sh

Then open http://127.0.0.1:7860 for the Gradio UI.

What's Next

Long-term memory: Remember preferences, recall past conversations
Multi-modal capabilities: Image analysis, document understanding
Voice improvements: Better TTS/STT for seamless voice conversations
RAG on personal data: Index and search through your own documents
Family/shared mode: Multiple users with separate contexts

The Bottom Line

Own, don't rent: Your hardware, your model, your rules
Privacy by design: Data never leaves your machine
Zero marginal cost: Chat all day, run 50 briefings. Nothing extra.
Meet users where they are: WhatsApp is already in everyone's pocket

You don't have to choose between convenience and privacy. With Leo, you get both.

Leo is open source. Your assistant, your data, your control. This post was updated on 2/28/2026

Optional Hard Things

Shantanu Singh — Wed, 04 Feb 2026 00:00:00 GMT

Easy things are easy. Paying bills, doing taxes, feeding yourself. These are hard, but they're mandatory. Skip them and the consequences hit fast. But there's something different about doing hard optional things.

The Effort-Reward Mismatch

For 99% of human history, reward was linked to effort. You hunted to eat. You built shelter to stay warm. You walked miles to find water. Our brains release dopamine after we overcome resistance. The struggle wasn't just a barrier to the reward. It was part of the reward equation.

Modern life short-circuited this loop. We decoupled reward from effort. You can get a dopamine hit without moving a muscle. Just swipe your thumb on a glass screen.

This instant gratification confuses our ancient biology. We get the prize without the hunt, the feast without the famine. The result isn't deep happiness. It's a hollow satiety that leaves us craving more. We're addicted to cheap pleasure because we've forgotten how to earn expensive happiness.

Dopamine: The Currency of Pursuit

We often confuse pleasure with happiness, but they run on entirely different mechanisms.

Pleasure is short-term. It's the hit from doom-scrolling, eating sugar, or binge-watching a show. Cheap dopamine, zero effort. The problem: it spikes fast and crashes hard, leaving a craving for more (the addiction loop) and a baseline that slowly drops over time.

Happiness, in the deep, contented sense, comes from effort. It's the dopamine of pursuit and achievement. Training for a marathon, learning a complex skill, building a business. These engage a long-term dopamine release that feels like purpose.

Choosing hard things rewires your reward system. You stop being a passive consumer of pleasure and become an active creator of your own happiness.

Strategic Suffering

The core trade-off of life is simple:

Easy Now leads to Longer Hard Later. Hard Now leads to Predictable Easy Later.

If you choose the easy path now, skipping the workout, avoiding the difficult conversation, procrastinating on the project, you're borrowing comfort from your future self. The interest rate on that loan is brutal. It shows up as poor health, lack of skills, and regret.

But when you choose voluntary exposure to difficulty, you build resilience. You train your nervous system to handle stress.

Cold showers teach you to suppress the panic response.
Heavy lifting teaches you that you can bear a load.
Deep work teaches you that you can focus in a distracted world.
Running teaches you everything there's to know about life- more on this later!

These are optional. No one will fire you for skipping them. But doing them signals something to your deepest self: I am capable. I am strong. I can handle whatever comes.

The Magic of the Optional

The most important thing about optional hardship is precisely that: it is optional.

When life forces a struggle on you, a setback, an illness, it's suffering. But when you choose the struggle, it's empowerment. You're not reacting to circumstance. You're building character on purpose.

Pick one hard optional thing today. Not because you have to, but because you don't.

From Toy Debates to Autonomous Engineering Teams with CrewAI

Shantanu Singh — Wed, 31 Dec 2025 00:00:00 GMT

Three debaters, a sandboxed coder, a stock picker with memory, and a full engineering team that writes, tests, and reviews its own code. All running on local GPUs.

I discovered CrewAI through this Udemy course. This post walks through five levels of increasing complexity, from a toy debate to autonomous software engineering.

Github

Level 1: The Basics (The Debate Team)

Theme: Pure Interaction | Code

The journey begins with the Debate project—the "Hello World" of agent orchestration.

Here, we have three simple agents: two Debaters and a Judge. The complexity is minimal, but the core idea matters: Role-Playing.

The Setup: One agent proposes an argument, and the other judges it.
The Feature: Pure prompt-based personalities. No complex prompt engineering variables required—just role, goal, and backstory.

# debate/config/agents.yaml

debater:
  role: A compelling debater
  goal: Present a clear argument...

judge:
  role: Decide the winner...

Key Takeaway: With just a few lines of YAML, you can create distinct personalities without tools or memory—just pure LLM-to-LLM interaction.

Results:

Topic: There needs to be strict laws to regulate LLMs.
Proposer: gpt-oss:20b
Opponent: deepseek-r1:8b
Judge: llama3.1:8b

DeepSeek takes the win. Here is Llama's verdict:

Final Output: After carefully reviewing the arguments presented by both sides, I have come to the conclusion that the motion "There needs to be strict laws to regulate LLMs" is more convincing.

The proponents of strict regulation present a comprehensive and well-reasoned argument that highlights the numerous risks associated with unregulated LLMs. They convincingly argue that without legal oversight, malicious actors can exploit these systems to manipulate elections, incite violence, or spread pandemics of falsehoods. Furthermore, they emphasize the importance of protecting individual privacy rights by mandating transparent data sourcing, anonymization standards, and user consent.

The opponents' argument, on the other hand, relies heavily on the potential for restrictive regulations to stifle innovation and hinder technological progress. However, I believe that this argument is less compelling due to its vagueness and lack of concrete evidence. The opponents argue that the rapid evolution of LLMs outpaces legislative efforts, rendering strict laws quickly outdated. While this may be true, it does not necessarily follow that strict laws are inherently counterproductive.

In contrast, the proponents provide a detailed analysis of specific risks and propose targeted solutions to address them through legislation. They emphasize the importance of balancing innovation with protection of human rights, democratic integrity, and societal well-being. I am convinced by their argument that strict laws can create a predictable, safe, and equitable environment in which LLMs can deliver their transformative benefits without becoming sources of harm.

Furthermore, I believe that the proponents' emphasis on transparency and accountability through mandatory disclosure requirements is essential for building public trust and ensuring that end-users can scrutinize AI recommendations before accepting them. This approach aligns with the principles of democratic governance and human rights, which should be the foundation of any regulatory framework governing LLMs.

In conclusion, based on the arguments presented, I am convinced that strict laws to regulate LLMs are necessary to safeguard society, ensure accountability, and preserve democratic values. The potential benefits of regulation far outweigh the perceived risks and limitations associated with restrictive legislation.

Level 2: Safe Code Execution (The Coder)

Theme: Agency with Guardrails | Code

Next, we graduate to the Coder project. This is where things get real. An agent that just talks is fun; an agent that does things is useful.

Giving an AI unrestricted access to your terminal is terrifying. CrewAI solves this with a simple feature:

The Setup: The Coder agent can write and execute Python code directly.
The Feature: code_execution_mode="safe". The codebase configures the agent to run code inside a Docker container.

# coder/crew.py

agent = Agent(
    role="coder",
    allow_code_execution=True,
    code_execution_mode="safe", # Dockerized safety!
    llm="ollama_chat/deepseek-r1:8b"
)

Key Takeaway: You can run powerful coding agents (like deepseek-r1) locally without risking your host machine.

Level 3: Connecting to the World (The Financial Researcher)

Theme: Tool Use | Code

The Financial Researcher project introduces Tools.

A smart agent is useless if it's cut off from the world. This crew is composed of a Researcher and an Analyst.

The Setup: A workflow where the Researcher searches the web for real-time data, and the Analyst synthesizes that raw data into a markdown report.
The Feature: SerperDevTool. The agent isn't just hallucinating facts anymore; it's equipped with tools for live Google searches. No more "I'm sorry, my knowledge cutoff is 2021."

Key Takeaway: This demonstrates the classic "Research & Write" pattern, perfect for automating daily briefings.

Level 4: Memory & Structure (The Stock Picker)

Theme: Advanced Cognition | Code

Now we enter the big leagues with the Stock Picker project. This introduces two advanced concepts: Memory and Structured Outputs.

The Setup: The crew uses LongTermMemory (SQLite) to store insights across runs and ShortTermMemory (RAG) to maintain context. It uses local embeddings (nomic-embed-text) to keep everything private.
The Feature: output_pydantic. Instead of a wall of text, agents return strictly typed Pydantic objects.

# stock_picker/crew.py

class TrendingCompany(BaseModel):
    name: str
    ticker: str
    reason: str

@task(output_pydantic=TrendingCompanyList)
def find_trending_companies(self): ...

Key Takeaway: Structured output guarantees you can reliably pipe AI generation into a database or API because the structure is enforced.

Level 5: The Enterprise (The Engineering Team)

Theme: Orchestration & Delegation | Code

Finally, the Engineering Team project. This is the pinnacle of the experiment.

It simulates a full software development lifecycle with specialized roles: Lead, Backend Engineer, Frontend Engineer, and QA.

The Setup: Tasks are chained contextually. The Backend Engineer doesn't start until the Lead finishes the design. QA waits for the code.
The Feature: Multi-Model Intelligence. Different models are routed to different brains (gpt-oss:20b for high-level design, qwen3-coder:30b for heavy lifting).

# engineering_team/config/tasks.yaml

backend_engineer:
  output_file: output/{module_name}

frontend_engineer:
  output_file: output/app.py

Key Takeaway: A crew can take an abstract idea and output a fully tested, functional application with frontend and backend, saved directly to disk.

Requirements:

A simple account management system for a trading simulation platform.
The system should allow users to create an account, deposit funds, and withdraw funds.
The system should allow users to record that they have bought or sold shares, providing a quantity.
The system should calculate the total value of the user's portfolio, and the profit or loss from the initial deposit.
The system should be able to report the holdings of the user at any point in time.
The system should be able to report the profit or loss of the user at any point in time.
The system should be able to list the transactions that the user has made over time.
The system should prevent the user from withdrawing funds that would leave them with a negative balance, or
from buying more shares than they can afford, or selling shares that they don't have.
The system has access to a function get_share_price(symbol) which returns the current price of a share, and includes a test implementation that returns fixed prices for AAPL, TSLA, GOOGL.

Design here

Final Output: Account Management

Trading

Portfolio & Transactions

The Local Advantage

What ties all these projects together? Local Dominance.

Every agent here runs on local hardware using Ollama. Whether it's the 8B parameter model for the debater or the 30B coding specialist for the engineer, the power is entirely in my hands.

This codebase proves that you don't need to choose between simplicity and power. With CrewAI, you can start with a debate and end with a software empire. (Or at least a very productive localhost.)

Building a Local AI Rig in 2025

Shantanu Singh — Tue, 25 Nov 2025 00:00:00 GMT

It has been roughly 20 years since I last cracked open a PC case to build a machine from scratch. Back then, we were worried about IDE cables and jumper pins; today, the stakes are a bit different. My goal this time wasn't just to browse the web—I wanted to run LLMs locally.

I was looking for a sandbox for toy projects and experimentation without the leash of a monthly subscription to OpenAI or Anthropic. More importantly, I wanted to "get into the weeds": fine-tuning models and understanding the hardware bottlenecks firsthand.

The "Sensible" Alternative

When building for AI, the primary gating factor is VRAM (GPU memory). To do anything meaningful, 16GB is the floor.

Now, a rational choice is a Mac Mini with 24GB+ of unified memory. It’s efficient, quiet, and fits in a desk drawer. But where’s the fun in being sensible? I wanted a machine that looked the part and gave me the flexibility to swap components when the next breakthrough hits.

The Build Specs

To support heavy local inference and future fine-tuning, I landed on a dual-GPU setup that prioritizes memory overhead and core count.

GPU 1: NVIDIA RTX 5070 Ti (16GB)
GPU 2: NVIDIA RTX 5060 Ti (16GB)
CPU: AMD Ryzen 9 9950X3D
Motherboard: Asus ProArt Creator X870E (Crucial for supporting dual GPUs at PCIe 5 x8/x8)
RAM: 64GB DDR5

Component	Role
Total VRAM	32GB (Sufficient for medium-sized 70B parameter models)
Logic	The Ryzen 9 9950X3D provides the multi-threading needed to keep the GPUs fed.
Connectivity	The X870E chipset ensures the second GPU isn't throttled by a narrow data pipe.

Why this "Frankenstein" Rig?

By pairing two 16GB cards, I’ve managed to bypass the massive "VRAM tax" associated with the ultra-high-end 5090s while still hitting a respectable 32GB of total VRAM.

The choice of the ProArt Creator X870E motherboard was a specific technical requirement. Most consumer boards choke the second PCIe slot down to x4 speeds or don't leave enough physical space to accommodate a full size graphic card; this setup ensures the data pipeline stays wide enough for serious workloads.

It feels good to be back in the BIOS. Now, if you’ll excuse me, I have some local weights to download and some fans to tune. Let the experimentation begin!

First Post

Shantanu Singh — Mon, 24 Nov 2025 00:00:00 GMT

Two things occupy most of my headspace lately: building AI systems and logging miles.

The systems side is AI engineering. Local inference, multi-agent orchestration, GPU builds, making machines do useful work without renting someone else's cloud. The miles side is endurance. Running, training, the kind of voluntary suffering that teaches you things no tutorial can.

This blog sits at that intersection. Systems and Strides.

I've wanted to start writing for a while. Not because the world needs another blog, but because writing forces clarity. Half-baked ideas have to survive being put into sentences. Some won't. That's the point.

I'll aim for at least one post a month.

No fluff. Just what I'm building, learning, and thinking about.

Systems and Strides

The 5 Stages of Claude Code Mastery

Level 0: The Refuser

Level 1: The Enthusiastic Beginner

Level 2: The Configuration Sorcerer

Level 3: The Intermediate (Actually Effective)

Level 4: The Enlightened

The Uncomfortable Truth

Rest Days

The Science of Getting Faster by Doing Nothing

Run Slow to Race Fast

The Pattern Shows Up Elsewhere

7 Agents, 38 Tasks, $0: Running Claude Code Agent Teams on Local GPUs

The Cost Problem with Agentic Workflows

Claude Model Proxy: The Local API Bridge

Kicking It Off

The Lead Agent Plans

Decomposition: From Plan to Tasks

Spawning the Team

7 Agents Running in Parallel

When Things Break (and Get Fixed)

Why Local Models Are Good Enough (For This)

Getting Started

When to Stay Cloud

Claude Lens - A Control Tower for Claude Code's Multi-Agent System

Why I Built This

Projects at a Glance

Agent Teams: Three Ways to See Your Swarm

Analytics That Actually Tell You Something

Conversations Without the JSONL Archaeology

Content: Memory, Plans, and Todos

Settings Without the JSON Editing

Budget Alerts (Save Your Wallet)

System: Kill Rogue Agents

Keyboard-First Navigation

Under the Hood

Get Started

Is This For You?

Leo: My Zero-Cost, Privacy-First AI Assistant on WhatsApp

Why I Built This

What Leo Can Do

Intelligent Conversations

Web Search

Google Workspace Integration

Health & Fitness

One-Time Reminders

Recurring Reminders

Scheduled Briefings

Hooks: Bridge to External Programs

Test Mode

Why Zero Cost Actually Works

The Privacy Advantage

Technical Architecture

Go WhatsApp Bridge (whatsapp-mcp/whatsapp-bridge/)

Python Agent Server (agent/)

MCP Servers

Operating Modes

Access Control

Interesting Technical Details

Natural Language Time Parsing

Cron-Based Scheduling

WhatsApp LID Resolution

Performance Optimizations

Getting Started

What's Next

The Bottom Line

Optional Hard Things

The Effort-Reward Mismatch

Dopamine: The Currency of Pursuit

Strategic Suffering

The Magic of the Optional

From Toy Debates to Autonomous Engineering Teams with CrewAI

Level 1: The Basics (The Debate Team)

Level 2: Safe Code Execution (The Coder)

Level 3: Connecting to the World (The Financial Researcher)

Level 4: Memory & Structure (The Stock Picker)

Level 5: The Enterprise (The Engineering Team)

The Local Advantage

Building a Local AI Rig in 2025

The "Sensible" Alternative

Go WhatsApp Bridge (`whatsapp-mcp/whatsapp-bridge/`)

Python Agent Server (`agent/`)