Elevate

Long-running Agents

Addy Osmani — Thu, 30 Apr 2026 14:30:29 GMT

A long-running AI agent can keep making progress over hours, days, or weeks. It can do this across many context windows and sandboxes, recover from failure, leave structured artifacts behind, and resume where it left off.

For two years the dominant image of an “AI agent” has been a chat window with a clever loop in it. You type a goal, the agent calls some tools, you watch tokens stream by, you stop watching when the work runs out of patience or the context window fills up. That paradigm got us a long way, but it has a ceiling. The model forgets. It declares “task complete” when it isn’t. It re-introduces a bug it fixed nine turns ago. The whole thing is structured around a single sitting.

Long-running agents are what comes next. The idea is easy to state: an agent that keeps making forward progress on a goal across many sessions and many sandboxes, possibly many days or weeks, while leaving the workspace clean enough that the next session can pick up where the last one left off. The engineering is harder. You have to solve for persistence, recovery, and verification in a way that doesn’t just paper over the cracks. You have to build a state layer that lives outside the model’s context window, and you have to design the handoff between sessions so the agent doesn’t lose its mind when it wakes up and finds itself in a different sandbox with a different context window.

This post is my attempt to lay out what’s changed, who’s pushing on it, and how an engineer can use long-running agents today without writing the whole thing from scratch.

What “long-running” actually means

“Long-running” gets used to mean at least three different things in practice, and it helps to keep them separate.

Long-horizon reasoning. The agent has to plan and execute over many dependent steps. This is mostly a model-quality story: coherence, planning, the ability to recover from a wrong turn ten steps ago. METR has been tracking this with their time horizon metric, which estimates how long a task a frontier model can complete with 50% reliability. The headline finding is that the metric has been doubling roughly every seven months since 2019, and their TH1.1 update earlier this year doubled the count of 8-hour-plus tasks in the eval set. If that curve holds, frontier agents complete tasks at the day scale by 2028 and the year scale by 2034.

Long-running execution. The agent’s process runs for hours or days. Maybe it’s a coding job, maybe it’s a research sweep, maybe it’s a 24/7 monitoring service. The model might be invoked thousands of times across the run. This is mostly a harness story, and it’s the one this post is mostly about.

Persistent agency. The agent has an identity that outlives any single task. It accumulates memory, learns user preferences, and is always available. This is the Memory Bank flavor of long-running.

In practice the three blur together. A real production agent does long-horizon reasoning inside a long-running execution backed by persistent agency. But the engineering problems are different in each, and so are the products that solve them.

Why this matters

There are two reasons I believe this work matters a lot right now.

The first is a phase change in what’s economically feasible to delegate. An agent that runs for ten minutes can answer a question, summarize a doc, fix a small bug. An agent that runs for ten hours can own an entire feature, finish a migration that was on the backlog for six quarters, or do the kind of overnight research sweep that used to require a junior analyst. One of Anthropic’s Claude Sonnet announcements put concrete numbers on this last fall: 30+ hours of autonomous coding in internal tests, including one run that produced an 11,000-line Slack-style app. That’s already past the threshold where the answer to “should I delegate this?” is no longer obvious.

The second is that persistence changes what the agent is. A stateless agent answers your question and disappears. A long-running one accumulates context: which competitor moved which way last week, which test flaked twice on Tuesday, what you usually mean by “the dashboard.” Anthropic’s Project Vend was the most public early demonstration of this. They had a Claude instance run an actual office vending business for a month, managing inventory, setting prices, talking to suppliers. It failed in informative ways, and the second phase ran much better, but the point wasn’t profitability. The point was watching what kinds of weird coherence problems show up when an agent has to maintain identity across weeks instead of turns.

Those are the same problems every team building production agents now hits.

The three walls every long-running agent hits

Three walls show up in basically every write-up I’ve read this year.

Finite context. Even a 1M-token window fills. And context rot, the steady degradation of model performance as the window gets full, kicks in well before the hard limit. A 24-hour run is not going to fit in any context window the field has on its roadmap. Something has to give.

No persistent state. A new session starts blank. Anthropic’s framing in their scientific computing post is the cleanest version I’ve seen: “imagine a software project staffed by engineers working in shifts, where each new engineer arrives with no memory of what happened on the previous shift.” Without an explicit persistence story, every shift change is a productivity disaster.

No self-verification. Models reliably skew positive when they grade their own work. Asked “are you done?” they answer “yes” more often than they should. Without a separate signal that the work meets a bar, you get the agent that ships at 30% complete with full confidence.

Long-running agent designs are mostly answers to these three problems. The major labs have converged on similar shapes of answer, but with very different surface area.

The Ralph loop: one of the simpler practitioner versions of long-running agents

The Ralph loop (sometimes called the Ralph Wiggum technique) is one of “simpler” practitioner version of long-running agents, popularized by Geoffrey Huntley and Ryan Carson. The reference implementation is literally a bash script that loops:

Pick the next unfinished task from a list (prd.json or equivalent).
Build a prompt with the task, the relevant context, and any persistent notes.
Call the agent.
Run tests or other checks.
Append what happened to progress.txt.
Update the task list (done, failed, blocked).
Go back to step 1.

The reason it works is the same reason any of the harnesses below work: state lives outside the agent’s context. prd.json is the plan, progress.txt is the lab notes, AGENTS.md is the rolling rulebook. The agent itself is amnesiac, but the filesystem isn’t. Each iteration starts fresh and reads enough state from disk to keep going. Carson’s Compound Product extends the idea by chaining multiple loops (an analysis loop that reads daily reports, a planning loop that emits a PRD, an execution loop that writes the code), which is roughly the open-source version of the planner-generator-evaluator triad Anthropic landed on independently.

I went deeper on all of this in Self-improving agents: task list structure, progress files, QA gates, monitoring, the failure modes you’ll actually hit. The short version is that you can build a working long-running agent in an evening with a bash script and a JSON file. Most of what Google and Anthropic have productized is the work of making this pattern recoverable, secure, and observable at scale.

The big-lab stories below are different ways of paying for that production-readiness.

Anthropic: harnesses, then the brain/hands/session split

Anthropic has been the most public about the engineering. Two posts are worth reading end-to-end.

The first is “Effective harnesses for long-running agents”, which lays out a two-agent harness for autonomous full-stack development. An initializer agent runs once at the start of a project to set up the environment, expand the prompt into a structured feature-list.json, and write an init.sh that future sessions will run on boot. A coding agent is then woken up over and over, each session asked to make incremental progress on one feature, run tests, leave a claude-progress.txt note, and commit. A test ratchet (“it is unacceptable to remove or edit tests because this could lead to missing or buggy functionality”) sits in the prompt to stop the very common failure of an agent deleting failing tests to “make them pass.” InfoQ’s writeup extends this into a planner, generator, and evaluator triad, on the same logic that separating generation from evaluation matters because models grade their own work too generously.

The second is “Scaling Managed Agents: Decoupling the brain from the hands”, the architectural post behind Claude Managed Agents (Anthropic’s hosted runtime, launched in early April). The argument is that an agent has three components that should be independently replaceable. The Brain is the model and the harness loop that calls it. The Hands are sandboxed, ephemeral execution environments where tools actually run. The Session is an append-only event log of every thought, tool call, and observation.

This sounds abstract and it isn’t. Anthropic’s framing: “every component in a harness encodes an assumption about what the model can’t do on its own.” When you couple them, an assumption that goes stale (e.g., the model used to need an explicit planner and now plans natively) means the whole system has to change at once. When you decouple them, the harness becomes stateless, sandboxes become cattle, not pets, and a brain crash doesn’t lose the run. A fresh container calls wake(sessionId) and reconstitutes the state from the log. They reported time-to-first-token dropped ~60% at p50 and over 90% at p95 just from being able to start inference before the sandbox is ready.

The session-as-event-log idea is the part most teams underappreciate. It is what makes a long-running agent recoverable. Without it, a container failure is a session failure and you’re debugging into a stale snapshot. With it, the agent’s memory is a queryable artifact that lives outside whatever process happens to be running at the moment.

For the scientific computing crowd, Anthropic’s long-running Claude post reduces all of this to a simpler stack: CLAUDE.md as a living plan the agent edits as it learns, CHANGELOG.md as portable lab notes, tmux plus SLURM plus git as the execution and coordination layer, and the Ralph loop, a for loop that kicks the agent back into context whenever it claims completion and asks if it’s really done. Their flagship case study is a Boltzmann solver Claude Opus built over a few days that reached sub-percent agreement with a reference CLASS implementation. Months-to-years of researcher time, compressed.

Same patterns across all three posts: an explicit plan file, an explicit progress file, structured handoffs between sessions, separate generation from evaluation, and a loop that refuses to let the agent stop early.

Cursor: planners, workers, judges

Cursor’s “Scaling long-running autonomous coding” is the other essential read this year. They walked into walls that Anthropic mostly papered over.

Their first attempt was a flat coordination model: equal-status agents writing to shared files with locks. It became a bottleneck and made the agents risk-averse, churning rather than committing. Their second attempt swapped locks for optimistic concurrency control, which removed the bottleneck but didn’t fix the coordination problem. The third design is what’s running in production now and what they describe as solving most of the problem:

Planners continuously explore the codebase and emit tasks. They can recursively spawn sub-planners.
Workers are focused executors. They don’t coordinate with each other and they don’t worry about the big picture.
Judges decide when an iteration is finished and when to restart.

Two things stand out from the post. One: “a surprising amount of the system’s behavior comes down to how we prompt the agents” more than the harness or the model. Two: different models slot into different roles. Their reported finding is that a GPT model was better than Opus for extended autonomous work specifically because Opus tended to stop early and take shortcuts. Same task, different role, different model. The matching is becoming part of the design surface.

This pairs with Composer 2 (their proprietary frontier coding model that ships in Cursor 3) and their background cloud agents: long-running tasks that run on Anysphere’s cloud infrastructure rather than your laptop. Eight-hour refactors and codebase-wide migrations survive a closed lid. You can start a task locally, hit run in cloud when you realize it’ll take 30 minutes, and re-attach later from your phone. Each agent runs in an isolated git worktree and merges back via PR. The handoff between local and remote is the part most teams haven’t figured out yet, and Cursor’s bet is that it has to be its own product surface.

The shape ends up close to Anthropic’s: roles are split, sessions are durable, judges sit beside the worker, and a long task runs in a cloud sandbox with git as the coordination substrate.

Google: long-running agents on the Agent Platform

Google’s announcement at Cloud Next ‘26 two weeks ago folded Vertex AI into the Gemini Enterprise Agent Platform and turned long-running agents into a named product, with named SLAs.

The pieces that matter for this post:

Agent Runtime supports agents that “run autonomously for days at a time” with sub-second cold starts and on-demand sandbox provisioning. The launch post’s example use case is a sales prospecting sequence that takes a week to play out, which is roughly the right shape for it.
Agent Sessions persist conversation and event history. You can pin them to a custom session ID that maps to your own CRM or DB record, so the agent’s state lives next to the business state instead of in a separate AI silo.
Agent Memory Bank is the persistent long-term memory layer, generally available as of Next ‘26. It curates memories from sessions, scopes them to a user identity, and exposes a search API so the next agent invocation can pull what’s relevant. Payhawk reported that auto-submitting expenses through a Memory-Bank-backed agent cut submission time by over 50%.
Agent Sandbox handles hardened code execution.
Agent-to-Agent Orchestration, Agent Registry, Agent Identity, Agent Gateway, Agent Observability, and Agent Simulation cover basically every operational concern you’d otherwise build by hand for a production fleet, including the cryptographic-identity-and-audit-log story enterprises actually need to ship.

Architecturally this is the same brain/hands/session split Anthropic described, just productized at platform scale and bundled with ADK (the code-first dev kit) and Agent Studio (the visual one). If you’re building inside Google Cloud, you don’t have to design a session log or a memory store from scratch anymore. You wire an ADK agent into Memory Bank and Sessions, deploy onto Agent Runtime, and the persistence question is answered.

Notice how much this looks like the pattern Anthropic and Cursor describe, just unbundled into named services with SLAs. Three years ago you’d have built all of this yourself. Now you pick which version of “decoupled brain, hands, and session” you want to rent.

Five patterns for long-running agents in production

Shubham Saboo and I wrote up five design patterns we’ve seen separate working long-running agents from demos. They aren’t Google-specific, but they map cleanly onto the primitives Agent Runtime now exposes, so it’s worth walking through them here in shortened form.

Checkpoint-and-resume. The most common multi-day failure is context loss. An agent processes 200 documents over four hours, hits an error on document 201, and without a checkpoint you start from scratch. Treat the agent like a long-running server process: write intermediate state to disk, checkpoint every N units of work, recover from failures. The Agent Runtime sandbox gives you a persistent filesystem, but choosing the right checkpoint granularity (not every step, not only the end) is on you.

Delegated approval (human-in-the-loop). Most “human-in-the-loop” implementations are: serialize state to JSON, fire a webhook, hope someone responds. The state goes stale, the notification gets buried, the agent re-deserializes into a slightly different world. Long-running runtimes let the agent pause in place with full execution state intact: reasoning chain, working memory, tool history, pending action. Hours of human time pass, the agent consumes zero compute, and it resumes with sub-second latency. Mission Control is Google’s inbox for this. The pattern works regardless of vendor.

Memory-layered context. A seven-day agent needs more than session state. Memory Bank handles long-term curated memory, Memory Profiles add low-latency lookups, and the failure mode you’ll hit in production is memory drift: the agent learns a procedural shortcut from a few atypical interactions and starts applying it broadly. Govern memory like you govern microservices. Agent Identity controls who can read and write which banks. Agent Registry tracks which version of which agent is running. Agent Gateway enforces policy on the wire. The auditing question stops being “what are my agents doing?” and becomes “what are my agents remembering, and how is that changing their behavior?”

Ambient processing. Not every long-running agent talks to a human. Some sit on a Pub/Sub stream or a BigQuery table and act on events as they arrive: content moderation, anomaly detection, inbox triage. The architectural decision worth making early is to not hardcode policy into the agent. Define it in the Gateway and the fleet picks up policy changes without redeploys. Ambient agents run unsupervised for long stretches, and the only sane way to update a hundred of them is to update the policy layer once.

Fleet orchestration. In real systems, you rarely have one agent. A coordinator delegates sub-tasks to specialists (a Lead Researcher Agent, a Scoring Agent, an Outreach Agent), each running independently for different durations. Each specialist gets its own Identity (so the Outreach Agent can’t read financial data meant for Scoring), its own policy enforcement, its own Registry entry. This is the same coordinator/worker shape distributed systems have used for decades. What’s new is that ADK handles it declaratively with graph-based workflows, and a bad deployment in one specialist doesn’t cascade to the others.

The patterns compose. A compliance system might use checkpointing for document processing, delegated approval for review gates, memory layering for cross-session knowledge, and fleet orchestration to coordinate the specialists. The opening question is always the same: what’s the longest uninterrupted unit of work your agent needs to perform? Minutes, and you don’t need long-running agents. Hours or days, and these patterns are where to start. The full write-up with code samples covers each pattern in depth.

So how do you actually build one today?

This is the practical question and it has a different answer depending on what you’re building.

You’re a developer who wants long-running coding work on your own repo. Just use Claude Code (or Antigravity, Cursor, or Codex). The harness is already there. Treat your AGENTS.md like a pilot’s checklist: short, every line earned by a real failure. Add hooks for typecheck and lint that surface failures back to the agent. Write a plan file before the agent starts. Use the Ralph loop when the agent claims it’s done and you don’t believe it. For multi-hour or overnight jobs, run in a worktree so a closed laptop doesn’t kill the run, and have it commit progress every meaningful unit of work. This is the path most people should take, and it’s where the most leverage is right now.

You’re building a hosted agent product. Don’t build the runtime. Pick a managed one. The three real options today: Google’s Agent Platform (Agent Engine + Memory Bank + Sessions), Claude Managed Agents, or roll something on top of ADK, the Claude Agent SDK, or Codex SDK and host it yourself. The trade-off is the usual one. Managed gets you the brain/hands/session split, observability, identity, and an audit trail out of the box. Self-hosted gets you control and the ability to use weird models for weird roles (Cursor’s pattern). For most teams, the right starting point is a managed runtime plus your own ADK or SDK code for the actual loop.

You’re doing something autonomous and operational (monitoring, research, ops). Memory Bank-style persistence is what you want, and it’s the part that doesn’t exist in Claude Code. ADK + Memory Bank + Cloud Run + Cloud Scheduler is the cleanest stack I’ve seen for “agent runs every N hours, accumulates state, alerts on a threshold.” This is also where Cursor’s planner/worker/judge split starts to matter more than it does for IDE coding, because the work is genuinely parallel and the failure modes are different.

A few things matter regardless of which path you take.

Write down the done-condition before the agent starts. This is the single highest-leverage move for long runs. The Anthropic harness post calls it the feature list; Cursor calls it the planner’s task spec. Either way, it’s an external file with explicit, testable completion criteria, and it exists so the agent can’t quietly redefine done mid-run.

Separate the evaluator from the generator. Self-grading is the failure mode. A planner / worker / judge pipeline, or a generator / evaluator pair, is a real architectural pattern not a stylistic preference. Even if it’s the same model in different roles with different prompts.

Invest in the session log, not just the prompt. The append-only event log is what makes the agent recoverable, debuggable, and auditable. If you can’t reconstruct what the agent did in the last 24 hours from durable storage, what you have is a long-running shell script that happens to call an LLM, not a long-running agent.

Treat compaction and context resets as first-class. Anthropic is explicit that summarization-as-compaction wasn’t enough for very long jobs; they had to do full context resets where the harness tears the session down and rebuilds it from a structured handoff file. It is essentially how humans onboard a new engineer.

There are some real limitations right now

A few things are still genuinely unsolved.

Cost. A 24-hour run with a frontier model and a few tools is not cheap. Without budgets, circuit breakers, and a hard cap on tool spend, an agent can quietly burn through a week’s API budget in an afternoon. This is solvable, but it’s an explicit step you have to take.

Security. A long-running agent with API keys, cloud access, and the ability to run shell commands has a much larger attack surface than a chat session. The brain/hands separation pattern matters here too: credentials should be unreachable from the sandbox where model-generated code runs, which is one of the benefits Anthropic calls out for Managed Agents.

Alignment drift. Over many context windows, agents drift. The original goal gets summarized, then re-summarized, then loses fidelity. This is the part hooks and judges exist to defend against. It is also the most common reason “the agent went off and did something I didn’t ask for.”

Verification. Auditing 24 hours of autonomous activity is a real human-time problem. Observability and structured artifacts (PRs, commits, briefings, test runs) are how you make this tractable. Without them, you’re scrolling logs and you’ll miss what matters.

The human role. Defining work crisply enough that an agent can run for a day on it is harder than doing the work yourself. The skill that’s appreciating in value isn’t writing code. It’s writing specs that survive contact with an autonomous executor.

Where this is going

Google, Anthropic, and Cursor have converged on roughly the same shape. Separate the model loop from the execution sandbox from the durable session log. Split planning from generation from evaluation. Bake in compaction, hooks, and context resets. Expose memory as a managed service that any agent invocation can query.

Surface area is what differs. Google’s Agent Platform is the enterprise-stack version, with the identity and audit trail story baked in. The patterns underneath are the same. Claude Managed Agents is “Anthropic’s harness, hosted.” Cursor’s background agents are “long-running coding, pulled out of the IDE and into the cloud.”

The harder problems for the next year aren’t in any of those layers individually. They’re in the coordination above them. Many long-running agents on a shared codebase. Agents that read their own traces and patch their own harnesses. Harnesses that assemble tools and context just-in-time for a task instead of being pre-configured at startup. That’s where the agent stops looking like a smarter chat window and starts looking like a colleague who’s been on the project longer than you have.

The model is still load-bearing. But the gap between a chat window and an agent you can leave running overnight is mostly in the state, sessions, and structured handoffs wrapped around it. That’s where I’d spend my learning time right now.

You might be interested in checking out some of my O’Reilly books such as Beyond Vibe Coding, The Effective Software Engineer or Web Perf engineering in the age of AI.

The Agent Stack Bet

Addy Osmani — Sat, 18 Apr 2026 17:16:43 GMT

Peek under the hood of most “production agents” shipping today and you won’t find intelligence. You’ll find custom plumbing, fragile session logic, shared service accounts, and a security model held together by hope. This can be so much better.

If you’ve spent the last 18 months putting agents into production, you already know the models and tools have gotten dramatically better. You also know the problems that are still burning your on-call rotation are not problems you can prompt your way out of. We are running into a stack ceiling, and it is quietly creating a governance and reliability gap that the next generation of agentic systems cannot grow through.

Right now the industry is living with what I’d call excessive agency: autonomous systems given broad permissions to get things done, then left to discover - at runtime, in production - that a schema drifted, an API changed, or a downstream service started returning PII it wasn’t supposed to. Agents mark tasks “complete” while leaving a trail of corrupted state behind them. The humans find out on Monday.

This is not a failure of the people building agents. It is a failure of the stack they’re building on.

Here are the four architectural bets I think every serious team has to make in the next twelve months.

1) Agents need identities, not shared credentials

Every engineer who has shipped agents to production knows this specific flavor of dread: you have agents doing useful work, and effectively zero visibility into which tools they touched, which data they moved, or which credentials they used to do it. I call this governance debt - the silent accumulation of security and audit risk that eventually forces a full rewrite, usually right after the first incident that reaches the CISO.

The root cause is that most agents today are ghosts. They don’t have identities. They borrow a service account, inherit a human’s OAuth token, and “promise” - in application code, in a prompt - to stay inside the lines. In a real enterprise environment, a promise in a prompt is not a policy.

My bet is that agent identity has to move from the application layer down into the platform layer.

The difference is between bolted-on vs. embedded security. Bolted-on looks like middleware in front of every tool call, politely asking the agent to behave: easy to bypass, expensive in latency, and invisible to your existing IAM. Embedded looks like a badge reader welded into a steel frame. The agent has a distinct, unforgeable identity recognized at the network and platform level, and policy is enforced at the source. If the agent reaches for a database it isn’t cleared for, the connection never opens. No middleware, no vibes.

Done right, this turns “a fleet of liabilities” into something that looks a lot more like a managed workforce: every action attributable, every permission auditable, every agent revocable with one call.

2) Agents need universal context, not scraped windows

Context management is a tax every builder is currently paying. Teams are burning a huge share of their engineering hours (and tokens) on undifferentiated plumbing - custom serialization, bespoke session stores, hand-rolled memory layers - just to keep an agent from forgetting its mission halfway through a multi-step task.

Worse, the context agents can get their hands on is usually siloed. A browser-based agent can see the open tab. A desktop wrapper can see the files a user happened to drag in. Neither of them can easily reason across the systems where the business actually lives - the CRM, the ERP, the data warehouse, the ticketing system, the transcripts, the project plans - at the same time.

Agents need universal context that integrates at the platform level. If we don’t fix this, we should be honest that the ceiling of agentic AI is “slightly better spreadsheet autocomplete,” and we should stop writing vision pieces about it.

3) Agents need to survive your laptop closing

Here’s the uncomfortable version of this: a lot of what ships today as “an agent” isn’t yet ready to deploy across a business.

I want to be precise, because the frontier has genuinely moved in the last six months. Environments like Claude Code, OpenClaw, and similar platforms are capable - persistent task state, scheduled execution, multi-agent coordination, and long-running sessions that survive disconnects are no longer aspirational. These are not toys. The question has moved on.

The question now is whether an agent can run for a week instead of an hour. Whether it can cross three handoffs, two credential rotations, and an approval gate without a human babysitting the session. Whether the work it did on Tuesday is auditable on Friday by someone who wasn’t in the room. A session that survives a dropped WebSocket is table stakes. A mission that survives a quarter is the bar enterprises actually need.

Real work doesn’t fit in a session, and most of it doesn’t fit in a day either. A procurement workflow spans weeks and a dozen handoffs. A compliance audit runs for a month. An incident investigation outlives three on-call rotations.

Most agents today hit a hard ceiling - sometimes time-based, sometimes token-based, sometimes governance-based - and when they hit it, the mission fails and a human picks up the pieces from wherever the transcript ended.

Enterprise-grade autonomy requires durable, cloud-native execution with a much higher floor than “the session stayed up.” Concretely, that means:

State and checkpointing that survives restarts, disconnects, redeploys, and model version changes by default - not bolted on with a local Redis and a prayer.
Context that outlives the window: long-horizon memory, summarization, and handoff between agent instances, so a multi-week task doesn’t die because a single run exhausted its tokens.
Missions that outlive sessions: agents that stay on the job across days, handoffs, and credential rotations, with an auditable trail of what happened while you were asleep.
First-class human-in-the-loop primitives, so the agent can pause and ask for permission to do something new instead of silently deciding it has the authority.

Persistence with guardrails. That’s the bar. Anything less and you’re building demos that happen to run for a long time.

4) Agents need platforms

The pattern I see most often in strong teams is the saddest one: brilliant engineers draining their bandwidth into stack problems that do not differentiate their product. Custom memory. Bespoke eval harnesses. Homegrown observability. Handwritten retry logic. A tracing system that almost works. None of this is the hard part of the agentic era, and none of it is what your users are paying you for.

The real value lives in domain reasoning and business logic - the judgment calls that are specific to your company, your customers, your regulatory environment. Everything underneath should be the platform you build on, not the plumbing you build.

This is why the maturation of open primitives matters right now. Open-source orchestration frameworks exist precisely so the scaffolding isn’t locked behind any single vendor’s roadmap. The model that worked for cloud compute, containers, and CI/CD - start local on open primitives, graduate to a managed platform when you’re ready to scale - is the model agent platforms need to copy.

Teams should be able to prototype on their laptop with the same building blocks they’ll run in production, and cross that boundary without a rewrite.

That’s the engineering standard that lets teams stop fighting plumbing and get back to the product.

The five-year horizon

The teams that pull ahead in the next five years will not pull ahead by being smarter at writing boilerplate. They’ll pull ahead by choosing the right agent foundation and spending their engineering hours on the problems only they can solve.

Every month spent rebuilding the common stack - identity, context, persistence, orchestration - is a month not spent on the logic that actually makes your agents worth deploying.

The agent stack has to become a solved problem. The only real question is whether you want to solve it yourself, again, or build on a foundation that was engineered for agents from the ground up.

My bet is on the latter. I think yours should be too.

Is the IDE dead?

Addy Osmani — Fri, 20 Mar 2026 14:31:23 GMT

The center of developer work is moving. Not disappearing - moving. Away from continuous, line-by-line editing inside a single window, and toward supervising agents that can plan, rewrite files, run tests, and propose changes for review. IDEs as we know them may stop being the primary tool for software work, or heavily evolve.

Across the tools many developers including myself are already using daily - Conductor, Claude Code Web, GitHub Copilot Agent, Jules, Vibe KanBan, even cmux - the same shift keeps showing up: the control plane is becoming the primary surface, and the editor is becoming one of several instruments underneath it.

Cursor just shipped Glass - a new interface explicitly built to make “working with agents clear, intuitive, and in your control” where agent management is the primary experience and the traditional editor is something you reach for when you need to go deeper. The reaction from developers was immediate:

Now Cursor feels more like an Agent Orchestrator than an IDE. Managing agents in parallel is easier

But Glass is one data point in a much larger pattern. Terminal UIs like cmux highlight how the surfaces we’re used to are evolving to better manage agent workflows.

From editing files to steering workstreams

Historically, IDEs optimized for a tight inner loop: open files → edit → build → debug → repeat. The “death” argument is that this loop is no longer the dominant unit of productivity once agents can execute most of it autonomously.

The new loop looks like this: specify intent → delegate → observe → review diffs → merge. What makes it different from “autocomplete with a chat window” is tool-using autonomy combined with interfaces designed to make that autonomy governable.

You can see this playing out across tools already in heavy use. Claude Code Web (or Desktop) and Codex let developers hand off well-defined tasks to agents running in isolated cloud environments, with progress visible in a browser - no terminal, no local setup required.

GitHub Copilot’s Agents plan and implements multi-file changes independently, creates branches, runs tests, and surfaces a PR for review; the developer’s primary job becomes reviewing the outcome and iterating, not directing each step.

Conductor takes a different approach: a desktop app for running multiple Claude Code agents simultaneously in isolated workspaces, with live progress monitoring across all of them. And Google’s Jules handles asynchronous background tasks - you assign work, it runs, you review the result when it’s done.

What these tools share is a mental model: the agent is the unit of work, not the file. The interface worth optimizing is the one that helps you direct, monitor, and review agents - not the one that helps you type faster.

The orchestration layer taking shape

The displacement story becomes persuasive only when you look at the specific interface patterns converging across tools.

Work isolation as a primitive. Parallel agents need to not step on each other. Virtually every serious tool in this space has landed on git worktrees (or similar) as the answer. Conductor maps each agent session to its own isolated workspace. Vibe Kanban (shown above) does the same for its kanban-driven agent workflow. The pattern is near ubiquitous because the problem is real: without isolation, parallel agents produce chaos.

Planning and task state as the primary UI. Tools like Vibe Kanban have replaced “tabs and files” with “tasks and states” as the top-level mental model. You create task cards (a landing page, a backend service, an email integration), assign each to an agent and a model, and manage the whole effort like a lightweight project board - except the “team” is running autonomously. This is a project management surface that happens to have agents doing the implementation.

Background agents and async-first design. Some of the most interesting tools in this space don’t even try to keep you in the loop during execution. Cursor, Copilot and Antigravity support background agents that run without requiring your presence - you define intent, step away, and review when they’re done. Jules works similarly: assign a task, come back to a diff. The implicit promise is that your attention is too valuable to spend watching a progress bar. That’s a significant departure from the IDE’s real-time, synchronous feedback loop.

Attention management for parallel agents. When many agents run concurrently, the real bottleneck becomes knowing which one needs you right now. This is why tools like Conductor surface live progress across sessions and cmux introduced notification rings and unread badges for terminal panes. “Agent needs attention” is becoming a first-class event in the developer environment - something to route and triage, not just notice.

Agents embedded into the software lifecycle. GitHub’s Copilot coding agent is asynchronous, secured by a control layer, and powered by GitHub Actions - attached to how code actually ships (issues → PRs → CI → merge), not just how it gets written.

None of these tools claim IDEs are obsolete - many still interoperate with them. But the repeated patterns (parallel workspaces, diff-first review, task state, background execution, lifecycle integration) are precisely what “death of the IDE” proponents mean when they talk about a center-of-gravity shift.

Why developers still reach for an IDE

The best critique of “the IDE is dead” is that the IDE still compresses several genuinely hard problems into a high-fidelity feedback loop: precise navigation, local reasoning, interactive debugging, and the ability to understand a system by directly manipulating it.

Even the most ambitious orchestration tools keep a manual-edit escape hatch. For example, reviewing diffs in-thread, commenting on changes, and then opening the result in your editor for manual adjustments. That’s an acknowledgment that human intervention is part of the intended workflow.

Agent tooling itself highlights where the limits still are. Multi-file refactorings in large repositories remain among the toughest challenges for software engineering agents. These are exactly the situations where interactive code navigation and human judgment still matter most - where you need to hold a mental model of the system that the agent can’t fully reconstruct from context alone.

The failure mode that keeps developers anchored to IDE-level inspection is agents being almost right. When something is 90% correct and subtly broken, the cost of finding the issue often exceeds what it would have taken to write it yourself. For high-stakes changes, the IDE remains the best instrument for that kind of deep, precise inspection.

The new costs: review fatigue and governance overhead

If development becomes “run many agents in parallel” the workflow inherits problems that look less like text editing and more like distributed systems management - observability, permissions, isolation, and governance.

Agent workflows invert the labor. Instead of writing, you’re reviewing. That sounds like an improvement until you’re staring at twelve diffs from twelve parallel agents at the end of the day. Review fatigue is real, and it’s one of the reasons the most thoughtful tools in this space focus on attention routing, structured plans, and review-first gates rather than pushing for full autonomy by default.

The security surface also expands as agents gain access to more tools, repos, and external systems. As agents can browse the web, query databases, write to filesystems, and trigger deploys, what they’re allowed to do becomes as important as what they’re capable of doing.

On observability and control, IDE-integrated agent modes are already pushing toward explicit tool logs and approval gates. The governance question isn’t optional once agents act asynchronously and touch CI pipelines.

What survives: the IDE, the control plane, or both

A clear reading of the landscape is that “death of the IDE” is directionally right about the center of gravity, but wrong as a literal forecast.

The strongest version of the claim is this: the IDE stops being the primary workspace and becomes one of several subordinate instruments - used for targeted inspection, debugging, and final edits - while planning, orchestration, review, and agent management move into dashboards, issue trackers, observability terminals, and cloud control planes.

The “bigger IDE” framing is equally well-supported. The new “IDE” is a system that provides multi-agent orchestration, isolated workspaces, permissions and audit logs, diff-first review, reliable tool connectivity, and attention routing. The file editor is still there. It’s just no longer the front door.

The IDE isn’t dying. It’s being de-centered. The work is moving outward - into orchestration surfaces where humans define intent, delegate to parallel agent runtimes, and spend more time supervising, reviewing, and governing than typing.

The IDE remains critical for correctness, comprehension, and the hard problems agents still struggle with. But its no longer the only place where programming happens - and for a growing number, it’s no longer the first place they go.

14 More lessons from 14 years at Google

Addy Osmani — Thu, 12 Feb 2026 15:30:32 GMT

A while back, I wrote down 21 lessons from my time at Google. The response caught me off guard because of which ones stuck. It wasn’t tech-specific advice. It was the stuff about people, decisions, and the messy reality of building things together.

That made me realize I’d left a lot on the table. The first list skewed toward individual craft - how to write better code, how to think about your career. But some of the hardest lessons I’ve learned aren’t about how you work. They’re about how teams work: how decisions actually get made, where coordination breaks down, what separates the groups that ship from the ones that spin.

These lessons pick up where the first left off. They’re less about being a better individual engineer and more about the systems around the engineering.

1. The best engineers pick the right problems to solve.

Every yes is an implicit no to something else.

I’ve watched talented engineers burn out because they said yes to everything - every bug, every feature request, every “quick favor.” Their calendar filled up with other people’s priorities, and their own roadmap became a graveyard of half-finished ideas.

Sometimes it’s just because they truly do care so much about the product. Protect your bandwidth from “nice to have” the same way you protect production from outages. The skill is doing the right things and letting the wrong things stay undone.

The engineers who create disproportionate impact aren’t necessarily faster or smarter. They’re more ruthless about what deserves their attention. They’ve learned that the opportunity cost of working on the wrong thing is working on the wrong thing.

2. If you can’t say what decision you’re asking for, you’re not ready for the meeting.

Most meetings fail not because they’re unnecessary, but because they’re disguised journaling. I’ve sat through hundreds of hours where smart people talked around a problem without ever naming what they needed. The meeting ends with vibes and no owner.

I learned to start with the ask: approve, choose, unblock, or inform.

Just those four words changed how I prepare for every meeting. If I can’t pick one, I’m not ready to take anyone’s time. And when I’m on the receiving end, I’ve started asking “what decision do you need from me?” within the first two minutes. It sounds blunt, but people are usually relieved - they often didn’t realize they hadn’t defined it themselves.

The hidden cost of vague meetings isn’t just the hour you lose. It’s the week of drift that follows while everyone waits for clarity that never came.

3. “We should” is not a plan. “On Tuesday, I will” is a plan.

The difference between motion and progress is specificity.

Teams drown in intentions. I’ve watched roadmaps fill up with “we should improve the onboarding flow” and “we should reduce latency” and “we should document the API.” Months later, the same items are still there, gathering dust and guilt. You might think that’s a solved problem now that we have agentic engineering, but not quite.

Convert talk into the smallest next action someone can actually do, then put a name and a date on it. Not “we should improve onboarding” but “On Tuesday, Sarah will run three user sessions and document the top friction points.”

This is about respecting that humans need traction to make progress. Vague intentions create anxiety. Specific commitments create momentum. The plan doesn’t have to be perfect - it just has to be concrete enough that someone can actually start.

4. Slow code is sometimes a symptom. Slow decisions are always a problem.

Speed is about removing the friction that makes smart people hesitate. “Bias towards action” when you can.

When a project drags, the instinct is to blame velocity: people aren’t working hard enough, the codebase is messy, there aren’t enough engineers. But in my experience, slow code is often a symptom. Slow decisions are the disease.

If decisions routinely take weeks or months, look deeper. Missing context means people can’t evaluate tradeoffs. Unclear ownership means everyone’s waiting for someone else to decide. Fear of accountability means people hedge instead of commit.

The fastest engineering team I ever worked with wasn’t the one with the best programmers. It was the one where decisions happened in hours instead of weeks because the authority was clear, the context was shared, and being wrong wasn’t a career risk.

5. Reliability is a product feature. Treat it like one.

Users don’t praise reliability but they do notice its absence.

This creates a dangerous dynamic: reliability work is invisible until it fails, which means it’s perpetually under-resourced compared to shiny new features.

Error budgets are one way to make the tradeoff explicit. If your service has an SLO of 99.9% uptime, you have a “budget” of 0.1% downtime to spend on innovation. Burn through it, and you focus on reliability until you’ve earned it back. This is a framework for having honest conversations about risk.

The teams that maintain both velocity and reliability don’t do it through heroics. They do it by treating reliability as a first-class product feature with its own roadmap, its own metrics, and its own advocates.

You wouldn’t ship a feature without product review. Don’t ship a system without some kind of reliability discussion.

6. You can’t “communication” your way out of a bad interface between teams.

Team interaction modes exist for a reason: collaboration (working closely together), service (clear API and SLAs), or facilitation (one team helping another build capability).

Most cross-team pain isn’t about effort or good intentions. It’s about unclear boundaries and messy contracts. I’ve watched teams “improve communication” by adding more meetings, more Slack channels, more syncs - and it doesn’t make things better.

The problem isn’t that people aren’t talking. It’s that the interface between teams is undefined. Who owns what? What’s the contract? What can team A depend on team B for, and vice versa?

Choose deliberately, and you’ll need fewer meetings to make things work. Try to paper over a bad interface with communication, and you’ll burn out your most collaborative people while the underlying dysfunction remains.

7. The best escalation comes with a proposal.

“Here’s the problem” is half the job. I used to think my role was to identify issues and bring them to leadership. That’s necessary but insufficient.

“Here are two options, the tradeoffs, and what I recommend” is how you get unblocked and earn trust. It shows you’ve done the thinking. It gives decision-makers something super specific to react to instead of an open-ended problem to solve.

It makes their job easier, which makes them more likely to give you what you need.

The difference between “I need help” and “I need you to choose between A and B, and here’s why I lean toward B” is the difference between being a problem-raiser and being a problem-solver.

Both identify issues. Only one earns increasing trust and autonomy.

8. Avoid hero culture. Build systems that don’t require heroes.

The hero is burned out, undocumented, and a single point of failure.

If one person saving the day is a recurring pattern, that’s a failure mode rather than a badge of honor. I’ve seen teams celebrate their heroes while ignoring the dysfunction that made heroism necessary.

When they leave - and they always leave eventually - the team discovers that no one else really knows how things work. The celebration of heroism masks a systemic problem: the path for “normal humans on a normal day” doesn’t work.

Make the normal path the default. Document the system. Spread the knowledge. Design for the average Tuesday, not the exceptional crisis. Heroes should be unnecessary, and if they’re necessary, you should be working to make them unnecessary.

9. Make observability part of the feature.

A feature without telemetry is a liability in disguise.

If you ship a feature without knowing how it behaves in production, you shipped uncertainty.

I’ve watched teams celebrate launches only to discover weeks later that their feature was silently failing for 20% of users. They had no logs, no metrics, no dashboards but a gap where understanding should be. This can cause all kinds of pain if you want to fix it, including unshipping just to properly A/B test with observability in place.

Logs, traces, dashboards, and alerts aren’t “ops work.” They’re how you learn. They’re how you know whether the thing you built actually works for real people doing real things in real conditions.

The best engineers I know treat observability as part of the definition of done. Not “I wrote the code” but “I wrote the code and I can see it working.”

10. Small PRs are kindness. Especially if the PR is AI generated.

Small changes are easier to review, easier to reason about, and easier to revert.

I used to write large pull requests. I liked the idea of a complete feature being reviewable at once. I was optimizing for my convenience at the expense of my reviewers’ sanity. Smaller PRs are often better for everyone.

They ship faster because they don’t sit in a review queue while someone tries to find an hour to understand your thousand-line diff. If you want teammates to trust your pace, make your work reviewable.

The hidden benefit is that small PRs force you to think in increments. Instead of one monolithic change, you build up capability piece by piece. Each piece gets feedback. Each piece can be rolled back independently. It’s slower per-PR but faster to actual production.

11. When you add a team, you add edges, not just nodes.

Coordination cost grows faster than headcount.

This is why “just throw more people at the problem” often fails, and why adding heads late in a project can make it later. Every new person adds communication overhead with everyone they need to coordinate with. The graph gets denser, not just larger.

I’ve seen managers genuinely puzzled when a team doubled in size but output barely changed. The answer is always the same: the new edges ate the new capacity. More people meant more alignment meetings, more context-sharing, more waiting for decisions that now required more stakeholders.

The solution isn’t to stop hiring. It’s to be intentional about reducing edges. Clear ownership. Autonomous teams with minimal dependencies. Interfaces that let people work in parallel instead of in lockstep. The best organizations aren’t the ones with the most people - they’re the ones with the most leverage per person.

12. The migration is never just a migration

Every migration is a negotiation between the system you have, the system you want, and the people who didn’t ask for either.

I’ve seen migrations estimated at one quarter stretch to years. Not because the technical work was wrong, but because nobody accounted for the human work: convincing teams to prioritize your migration over their roadmap, supporting the long tail of edge cases nobody knew existed, and maintaining two systems in parallel while the old one refuses to die.

The technical plan is the easy part. The hard part is designing for coexistence. You will run old and new simultaneously for longer than you think. You will discover that the “legacy” system encodes decisions nobody documented and workflows nobody remembers designing but everyone depends on. You will need a adoption strategy that doesn’t require every team to drop what they’re doing at once.

The migrations that actually finish share three traits: a sponsor who stays engaged past the kickoff, a team that really owns the migration instead of treating it as a side quest, and a clear deprecation date that people believe is real. Without all three, you get a migration that’s perpetually “almost done” - which is worse than not starting, because now you’re paying the cost of two systems indefinitely.

If you’re not willing to fund the finish, don’t start the migration.

13. AI makes drafts cheap. Taste becomes expensive.

Everyone can generate code now. The barrier to producing code, content, designs - it’s largely collapsing. AI will write you ten versions of anything in the time it used to take to write one.

The differentiator is choosing: what to build, what to delete, what to simplify, what not to ship, and what “good” looks like. Taste - the ability to distinguish between options and pick the right one - becomes the scarce resource.

Use AI to explore options fast, then apply judgment ruthlessly. The engineers who thrive in this environment won’t be the ones who generate the most. They’ll be the ones who curate the best.

Production is cheap. Editing is expensive. Selection is everything.

14. Trust is a latency optimization for teams.

This is the highest-leverage thing you can build. Not a system but credibility.

When people trust you, they don’t need five meetings to approve a decision. They assume competence, good intent, and follow-through. Decisions that would take weeks in a low-trust environment take hours in a high-trust one.

Every time you deliver on a promise, every time you’re honest about a mistake, every time you make someone else’s life easier, you’re depositing into an account that will pay dividends for years.

I’ve watched engineers with modest technical skills accomplish enormous things because everyone trusted them. I’ve watched brilliant engineers accomplish little because nobody would take their calls.

The code doesn’t matter if you can’t get anyone to ship it with you.

A final thought

The first time around, I said these lessons come down to staying curious, staying humble, and remembering that the work is about people. I still believe that.

But if this second list has a through-line, it’s something more specific: the work is about making it easier for normal people to do extraordinary things on a normal day. A career in engineering gives you plenty of time to learn these things the hard way and I’ve certainly learned a lot during my time at Google so far.

I hope a few of them save you a scar or two. And if they do, share what you’ve figured out with someone earlier in the journey.

That’s how the good lessons travel.

The 80% Problem in Agentic Coding

Addy Osmani — Wed, 28 Jan 2026 17:20:53 GMT

said something this week that made me pause:

“I rapidly went from about 80% manual+autocomplete coding and 20% agents to 80% agent coding and 20% edits+touchups. I really am mostly programming in English now.”

The inversion happened over a few weeks in late 2025. While this may apply to new (greenfield) or personal projects more than existing or legacy apps, I imagine how far AI takes you is still further than a year ago. You can thank models, specs, skills, MCPs and our workflows improving.

Boris Cherney, creator of Claude Code, has recently echoed similar sentiments:

“Pretty much 100% of our code is written by Claude Code + Opus 4.5. For me personally it has been 100% for two+ months now, I don’t even make small edits by hand. I shipped 22 PRs yesterday and 27 the day before, each one 100% written by Claude. I think most of the industry will see similar stats in the coming months - it will take more time for some vs others.”

Some time ago I wrote about “the 70% problem” - where AI coding took you to 70% completion, then leave the final 30% last mile for humans. That framing may now be evolving. The percentage may shift to 80% or higher for certain kinds of projects, but the nature of the problem changed more dramatically than the numbers suggest.

Armin Ronacher’s poll of 5,000 developers compliments this story: 44% now write less than 10% of their code manually. Another 26% are in the 10-50% range. We’ve crossed a threshold. But here’s what the triumphalist narrative misses: the problems didn’t disappear, they shifted. And some got worse.

I want to caveat: I’ve definitely felt the shift to 80%+ agent coding on new side-projects, however, this is very different in large or existing apps, especially where teams are involved. Expectations differ, but this is a taste of where we’re headed.

The mistakes changed

AI errors evolved from syntax bugs to conceptual failures - the kind a sloppy, hasty junior may make under time pressure.

Karpathy catalogs what still breaks:

“The models make wrong assumptions on your behalf and run with them without checking. They don’t manage confusion, don’t seek clarifications, don’t surface inconsistencies, don’t present tradeoffs, don’t push back when they should. They’re still a little too sycophantic.”

Assumption propagation: The model misunderstands something early and builds an entire feature on faulty premises. You don’t notice until you’re five PRs deep and the architecture is cemented. This is kind of two-steps-back pattern.

Abstraction bloat: Given free rein, agents can overcomplicate relentlessly. They’ll scaffold 1,000 lines where 100 would suffice, creating elaborate class hierarchies where a function would do. You have to actively push back: “Couldn’t you just...?” The response is always “Of course!” followed by immediate simplification. They’re optimizing for looking comprehensive, not for maintainability.

Dead code accumulation: They often don’t clean up after themselves. Old implementations linger. Comments get removed as side effects. Code they don’t fully understand gets altered anyway because it was adjacent to the task.

Sycophantic agreement: They don’t always push back. No “Are you sure?” or “Have you considered...?” Just enthusiastic execution of whatever you described, even if your description was incomplete or contradictory.

It’s possible to mitigate some of this via Skills if you know what to watch for.

These otherwise persist despite system prompts, despite CLAUDE.md instructions, despite plan mode. They’re not bugs to be fixed - they’re sometimes inherent to how these systems work.

Agents optimize for coherent output, not for questioning your premises.

I've watched this happen on my own teams - code that looks right in review but breaks three commits later when someone touches an adjacent system.

If you’re data minded, recent survey data suggests “verification bottleneck” has emerged: only 48% of developers consistently check AI-assisted code before committing it, even though 38% find that reviewing AI-generated logic actually requires more effort than reviewing human-written code. We’re generating correct code faster, but may be accumulating technical debt even faster.

Comprehension debt: a hidden cost we don’t track

Generation (writing code) and discrimination (reading code) are different cognitive capabilities. You can review code competently even after your ability to write it from scratch has atrophied. But there’s a threshold where “review” becomes “rubber stamping.”

Jeremy Twei coined the perfect term for this: comprehension debt. It’s certainly tempting to just move on when the LLM one-shotted something that seems to work. This is the insidious part. The agent doesn’t get tired. It will sprint through implementation after implementation with unwavering confidence. The code looks plausible. The tests pass (or seem to). You’re under pressure to ship. You move on.

Over time, you may understand less of your own codebase.

I caught myself doing this last week. Claude implemented a feature I’d been putting off for days. The tests passed. I skimmed it, nodded, merged. Three days later I couldn’t explain how it worked.

Yoko Li captured the addiction loop perfectly:

“The agent implements an amazing feature and got maybe 10% of the thing wrong, and you’re like ‘hey I can fix this if I just prompt it for 5 more mins.’ And that was 5 hrs ago.”

You’re always almost there. The final 10% feels tantalizingly close. Just one more prompt. Just one more iteration. The psychological hook is real.

Someone else put it differently:

“I spend most of my time babysitting agents. The AGI vibes are real, but so is the micromanagement tax. You’re not coding anymore, you’re supervising. Watching. Redirecting. It’s a different kind of exhausting.”

The dangerous part: it’s trivially easy to review code you can no longer write from scratch. If your ability to “read” doesn’t scale with the agent’s ability to “output,” you’re not engineering anymore. You’re hoping.

The productivity paradox: More code, same throughput

Individual output surged 98% in high-adoption teams, but PR review time increased anywhere as high as 91%.

The data from Faros AI and Google’s DORA report are interesting:

Teams with high AI adoption merged 98% more PRs
Those same teams saw review times balloon 91%
PR size increased 154% on average
Code review became the new bottleneck

Atlassian’s 2025 survey found the paradox in stark terms: 99% of AI-using developers reported saving 10+ hours per week, yet most reported no decrease in overall workload. The time saved writing code was consumed by organizational friction - more context switching, more coordination overhead, managing the higher volume of changes.

We got faster cars, but the roads got more congested.

We're producing more code but spending more time reviewing it. The bottleneck just moved. When you make a resource cheaper (in this case, code generation), consumption increases faster than efficiency improves, and total resource use goes up.

We’re not writing less code. We’re writing vastly more code, and someone still has to understand much of it. There are of course groups of developers who feel this should no longer be the case if AI can do that.

Where the 80/20 split actually works

The 80% threshold is most accessible in greenfield contexts where you control the entire stack and comprehension debt stays manageable through small team size.

This actually works in a few contexts.

Personal projects where you control everything
MVPs where “good enough” is actually good enough
Startups in greenfield territory without legacy constraints
Teams small enough that comprehension debt stays manageable

In these environments, the agent’s weaknesses matter less. You can scaffold rapidly, refactor aggressively, throw away code without political friction. The pace of iteration outweighs occasional misdirection.

In mature codebases with complex invariants, the calculus inverts. The agent doesn’t know what it doesn’t know. It can’t intuit the unwritten rules. Its confidence scales inversely with context understanding.

Someone pointed out the obvious thing I was tiptoeing around: the first 90% might be easy, but the last 10% can take a long time. 90% accuracy is fine for non-mission-critical stuff. For the parts that actually matter, it's nowhere close. Self-driving cars work great until they don't, and that's why L2 is everywhere but L4 is still mostly vaporware.

For non-engineers, the wall is lower but still real. Tools like AI Studio, v0 and Bolt can turn sketches into working prototypes instantly. But hardening that prototype for production - handling real user data at scale, ensuring security and compliance - still requires engineering fundamentals. AI gets you 80% to an MVP; the last 20% requires patience, learning deeply or hiring engineers.

Two different populations

We’re not seeing a smooth curve of adoption - we’re seeing a split between those who’ve crossed the threshold and everyone else. The gap between early adopters and the rest is widening, not closing.

Armin’s poll revealed what raw adoption numbers obscure: 44% of developers still write over 90% of their code manually. We have a bimodal distribution, not a bell curve. On one side: people like Karpathy and the Claude Code team, shipping dozens of PRs daily with 100% AI-written code, iterating faster than ever before. On the other: the vast majority, incrementally adopting copilot-style tools but not fundamentally changing their workflow.

The age split may be visible in discourse too. Younger developers seem more willing to adapt workflow radically. Older developers are more skeptical - not because they can't use the tools, but because they've seen enough cycles to know the difference between a temporary productivity boost and a sustainable practice. Both might be right.

Stack Overflow’s 2025 survey showed only 16% reported “great” productivity improvements. Half saw modest gains. The top frustrations: “AI solutions that are almost right, but not quite” (66%) and “debugging AI code takes longer than writing it myself” (45%).

The engineers who appear to be thriving in 2026 aren’t just using better tools. They’ve reconceptualized their role from implementer to orchestrator. They’ve learned to think declaratively rather than imperatively. They’ve accepted that their job is now architectural oversight and quality control, not line-by-line coding.

Those struggling are trying to use AI as a faster typewriter. They haven’t adapted their workflow. They’re fighting the agent’s approach instead of redirecting its goals. They haven’t invested in learning to prompt effectively which is now as critical as writing good documentation or design specs ever was.

There's an uncomfortable truth here: orchestrating agents feels a lot like management. Delegating tasks. Reviewing output. Redirecting when things go sideways. If you became an engineer because you didn't want to be a manager, this shift might feel like a betrayal. The role changed underneath you.

The gap seems to be widening. The people who’ve figured out how to work with these tools are shipping stuff I can barely keep up with. Everyone else is... still figuring it out.

This split may make some uncomfortable. I’ve always said I’m a builder, but I also enjoyed programming. The idea that these are now diverging paths - that you have to pick one - feels reductive. Like we’re forcing a binary on something more complicated. Someone in the comments said it perfectly: both viewpoints are valid, just different wiring. Neither is wrong.

From imperative to declarative: The real leverage

Don’t tell the AI what to do - give it success criteria and watch it loop. The magic isn’t in the agent writing code, it’s in the agent iterating until it satisfies conditions you specify.

Karpathy’s observation about leverage cuts to the core:

“LLMs are exceptionally good at looping until they meet specific goals and this is where most of the ‘feel the AGI’ magic is to be found.”

The shift from imperative to declarative development:

Old model (imperative): “Write a function that takes X and returns Y. Use this library. Handle these edge cases. Make sure to...”

New model (declarative): “Here are the requirements. Here are the tests that must pass. Here’s the success criteria. Figure out how.”

This works because agents never get demoralized. They’ll try approaches you wouldn’t have patience for. They iterate relentlessly. If you specify the destination clearly, they’ll navigate there - even if it takes 30 failed attempts.

The patterns that work:

Write tests first, let the agent iterate until they pass
Hook it up to a browser via MCP, let it verify behavior visually
Implement the naive correct version, then optimize while preserving correctness
Define the API contract, let it implement to spec

But this only works if your success criteria are actually correct. Garbage in, garbage out scales with capability.

The developers succeeding with this approach spend 70% of their time on problem definition and verification strategy, 30% on execution. The ratios inverted from traditional development, but the total time decreased dramatically.

The slopacolypse question

When anyone can generate thousands of lines of code in minutes, the ability to say ‘we don’t need this’ becomes more valuable.

Karpathy warned:

“I am bracing for 2026 as the year of the slopacolypse across all of github, substack, arxiv, X/instagram, and generally all digital media.”

The concern is straightforward: when anyone can generate arbitrarily large volumes of plausible-looking code, content, papers, or posts, how do we maintain signal-to-noise ratio?

Boris Cherny offers a counterpoint: “My bet is that there will be no slopcopolypse because the model will become better at writing less sloppy code and at fixing existing code issues. In the meantime, what helps is having the model code review its code using a fresh context window.”

Both can be true simultaneously. The capability for slop exists at unprecedented scale. The tooling to prevent it is emerging. The question is which scales faster.

The slopacolypse will be driven by people who mistake velocity for productivity. Agents are marathon runners with no sense of direction unless you give it to them. They will sprint ten miles into a brick wall if you don’t audit the “code actions” where necessary.

The teams I’ve seen handle this well tend to do a few things:

Fresh-context code reviews help, though it feels weird asking the same model to critique its own code. It works, though - give it a clean slate and it catches its own mistakes.
Automated verification at every step (CI/CD, linters, type checkers, tests as guardrails)
Deliberate constraints on agent autonomy (bounded tasks, clear success criteria)
High emphasis on human-in-the-loop at architectural decision points

The code quality problems Karpathy describes - overcomplication, abstraction bloat, dead code - these improve as models improve. But they won’t disappear. They’re emergent from how these systems approach problems.

What actually works: practical patterns

The future belongs to those who can maintain a coherent mental model of the macro while agents handle tactical drudgery of the micro.

After watching teams adapt over the past year, effective patterns have crystallized:

1. Agent-first drafts with tight iteration loops

Don’t use AI for one-off suggestions. Generate entire first drafts, then refine. The Claude Code team practice: have the model review its own code with a fresh context window. This catches issues before human review.

2. Declarative communication

Spend 70% of effort on problem definition, 30% on execution. Write comprehensive specs, define success criteria, provide test cases up front. Guide the agent’s goals, not its methods.

3. Automated verification

If you repeatedly fix the same class of mistake, write a test or lint rule preemptively. Make the agent explain its code and flag potential problems before you review.

4. Deliberate learning vs. just focusing so much on production

Use AI as a learning tool, not a crutch (you’ve heard this a few times now). When the agent writes something you don’t understand, that’s a signal to dig deeper. Treat AI-generated code like code from a mentor - review it to learn, not just to ship.

5. Architectural hygiene

More modularization, clearer API boundaries. Well-documented style guides fed into prompts. High-level architecture descriptions provided before coding begins. The planning phase expanded; the coding phase compressed; the review phase focused on design rather than syntax.

The developers who thrive won’t be those who generate the most code. They’ll be those who know which code to generate, when to question the output, and how to maintain comprehension even as their hands leave the keyboard.

The uncomfortable truth about skill development

If your ability to “read” doesn’t scale at the same rate as the agent’s ability to “output,” you aren’t engineering anymore. You’re rubber stamping.

“It’s been like the boiling frog for me. Started by copy-pasting more into ChatGPT. Then more in-IDE prompting. Then agent tools. Suddenly I barely hand code anymore. The transition was so gradual I didn’t notice until I was already there” [HN]

There’s early evidence of skill atrophy in heavy AI users. Junior developers who rely on AI for everything report feeling less confident in problem-solving abilities over time. It’s the Google effect applied to coding - when you outsource constantly, your brain stops retaining.

I don’t know what the solution is, but I’ve been trying a few things:

Use TDD: write tests (or think through test cases) before letting AI implement
Pair with seniors: discuss AI suggestions in real-time to learn the decision-making process
Ask for explanations: have the AI justify its approach, not just generate solutions
Alternate: write some features manually to maintain muscle memory

The risk is real: it’s dangerously easy to review code you can no longer write from scratch. When that happens, you’ve become dependent on the tool in a way that limits your growth.

The engineers who will thrive long-term are those who use AI to accelerate gaining experience, not to bypass it entirely. They maintain their fundamentals while leveraging AI to explore more territory faster.

Where this leaves us

The shift from 70% to 80% isn’t about percentages - it’s about the gap between prototype and production-ready software. That gap is narrowing, but it hasn’t closed.

Karpathy asks the right questions:

“What happens to the ‘10X engineer’ - the ratio of productivity between the mean and the max engineer? It’s quite possible that this grows a lot. Armed with LLMs, do generalists increasingly outperform specialists?”

These questions will define the next few years.

One thing is certain: AI wrote 80% of code for early adopters in late 2025. Even if your percentage is much lower, it’s likely higher than a year ago. This places disproportionate emphasis on the human’s role: owning outcomes, maintaining quality bars, ensuring tests actually validate behavior.

The danger isn’t that the agent fails. I think it’s that it succeeds so confidently in the wrong direction that you stop checking the compass.

DORA’s 2025 report crystallized the reality: AI is an amplifier of your development practices. Good processes get better (high-performing teams saw 55-70% faster delivery). Bad processes get worse (accumulating debt at unprecedented speed). There is no silver bullet.

Karpathy’s final observation resonates most:

“I didn’t anticipate that with agents programming feels more fun because a lot of the fill in the blanks drudgery is removed and what remains is the creative part. I also feel less blocked/stuck and I experience a lot more courage because there’s almost always a way to work hand in hand with it to make some positive progress.”

He also notes: “LLM coding will split up engineers based on those who primarily liked coding and those who primarily liked building.”

That’s probably the most insightful prediction about where this is headed.

If you liked the act of writing code itself - the craft of it, the meditation of it - this transition might feel like loss. If you liked building things and code was the necessary means, this feels like liberation.

Neither response is wrong. But the tooling is optimizing for the latter.

For the skeptics (you’re right to be skeptical)

The productivity claims are often overhyped. AI still makes mistakes a competent junior wouldn’t. Comprehension debt is real and poorly understood. The slopacolypse risk is genuine.

But the shift is real. When Karpathy admits he barely writes code directly anymore, when the Claude Code team ships 20+ PRs daily with 100% AI-written code, we’re past the point of dismissing this as hype.

As software engineers, our identity was never “the person who can write code” - it was “the person who can solve problems with software.”

AI isn’t replacing engineers. It’s amplifying them - for better and for worse.

My advice: embrace the tools, but own the outcome. Use AI to accelerate learning, not skip it. Focus on fundamentals that matter more than ever: robust architecture, clean code, thorough tests, thoughtful UX. These remain as important as ever - maybe more so, since implementation is no longer the bottleneck.

I don’t know where this goes. Karpathy’s probably right that it’ll split people between those who liked coding and those who liked building.

We’re all figuring this out in public, one PR at a time.

How to write a good spec for AI agents

Addy Osmani — Mon, 19 Jan 2026 15:31:16 GMT

TL;DR: Aim for a clear spec covering just enough nuance (this may include structure, style, testing, boundaries) to guide the AI without overwhelming it. Break large tasks into smaller ones vs. keeping everything in one large prompt. Plan first in read-only mode, then execute and iterate continuously.

“I’ve heard a lot about writing good specs for AI agents, but haven’t found a solid framework yet. I could write a spec that rivals an RFC, but at some point the context is too large and the model breaks down.”

Many developers share this frustration. Simply throwing a massive spec at an AI agent doesn’t work - context window limits and the model’s “attention budget” get in the way. The key is to write smart specs: documents that guide the agent clearly, stay within practical context sizes, and evolve with the project. This guide distills best practices from my use of coding agents including Claude Code and Gemini CLI into a framework for spec-writing that keeps your AI agents focused and productive.

We’ll cover five principles for great AI agent specs, each starting with a bolded takeaway.

1. Start with a high-level vision and let the AI draft the details

Kick off your project with a concise high-level spec, then have the AI expand it into a detailed plan.

Instead of over-engineering upfront, begin with a clear goal statement and a few core requirements. Treat this as a “product brief” and let the agent generate a more elaborate spec from it. This leverages the AI’s strength in elaboration while you maintain control of the direction. This works well unless you already feel you have very specific technical requirements that must be met from the start.

Why this works: LLM-based agents excel at fleshing out details when given a solid high-level directive, but they need a clear mission to avoid drifting off course. By providing a short outline or objective description and asking the AI to produce a full specification (e.g. a spec.md), you create a persistent reference for the agent. Planning in advance matters even more with an agent - you can iterate on the plan first, then hand it off to the agent to write the code. The spec becomes the first artifact you and the AI build together.

Practical approach: Start a new coding session by prompting:

“You are an AI software engineer. Draft a detailed specification for [project X] covering objectives, features, constraints, and a step-by-step plan.”
Keep your initial prompt high-level - e.g. “Build a web app where users can track tasks (to-do list), with user accounts, a database, and a simple UI”.

The agent might respond with a structured draft spec: an overview, feature list, tech stack suggestions, data model, and so on. This spec then becomes the “source of truth” that both you and the agent can refer back to. GitHub’s AI team promotes spec-driven development where “specs become the shared source of truth… living, executable artifacts that evolve with the project”. Before writing any code, review and refine the AI’s spec. Make sure it aligns with your vision and correct any hallucinations or off-target details.

Use Plan Mode to enforce planning-first: Tools like Claude Code offer a Plan Mode that restricts the agent to read-only operations - it can analyze your codebase and create detailed plans but won’t write any code until you’re ready. This is ideal for the planning phase: start in Plan Mode (Shift+Tab in Claude Code), describe what you want to build, and let the agent draft a spec while exploring your existing code. Ask it to clarify ambiguities by questioning you about the plan. Have it review the plan for architecture, best practices, security risks, and testing strategy. The goal is to refine the plan until there’s no room for misinterpretation. Only then do you exit Plan Mode and let the agent execute. This workflow prevents the common trap of jumping straight into code generation before the spec is solid.

Use the spec as context: Once approved, save this spec (e.g. as SPEC.md) and feed relevant sections into the agent as needed. Many developers using a strong model do exactly this - the spec file persists between sessions, anchoring the AI whenever work resumes on the project. This mitigates the forgetfulness that can happen when the conversation history gets too long or when you have to restart an agent. It’s akin to how one would use a Product Requirements Document (PRD) in a team: a reference that everyone (human or AI) can consult to stay on track. Experienced folks often “write good documentation first and the model may be able to build the matching implementation from that input alone” as one engineer observed. The spec is that documentation.

Keep it goal-oriented: A high-level spec for an AI agent should focus on what and why, more than the nitty-gritty how (at least initially). Think of it like the user story and acceptance criteria: Who is the user? What do they need? What does success look like? (e.g. “User can add, edit, complete tasks; data is saved persistently; the app is responsive and secure”). This keeps the AI’s detailed spec grounded in user needs and outcome, not just technical to-dos. As the GitHub Spec Kit docs put it, provide a high-level description of what you’re building and why, and let the coding agent generate a detailed specification focusing on user experience and success criteria. Starting with this big-picture vision prevents the agent from losing sight of the forest for the trees when it later gets into coding.

2. Structure the spec like a professional PRD (or SRS)

Treat your AI spec as a structured document (PRD) with clear sections, not a loose pile of notes.

Many developers treat specs for agents much like traditional Product Requirement Documents (PRDs) or System Design docs - comprehensive, well-organized, and easy for a “literal-minded” AI to parse. This formal approach gives the agent a blueprint to follow and reduces ambiguity.

The six core areas: GitHub’s analysis of over 2,500 agent configuration files revealed a clear pattern: the most effective specs cover six areas. Use this as a checklist for completeness:

1. Commands: Put executable commands early - not just tool names, but full commands with flags: npm test, pytest -v, npm run build. The agent will reference these constantly.

2. Testing: How to run tests, what framework you use, where test files live, and what coverage expectations exist.

3. Project structure: Where source code lives, where tests go, where docs belong. Be explicit: “src/ for application code, tests/ for unit tests, docs/ for documentation.”

4. Code style: One real code snippet showing your style beats three paragraphs describing it. Include naming conventions, formatting rules, and examples of good output.

5. Git workflow: Branch naming, commit message format, PR requirements. The agent can follow these if you spell them out.

6. Boundaries: What the agent should never touch - secrets, vendor directories, production configs, specific folders. “Never commit secrets” was the single most common helpful constraint in the GitHub study.

Be specific about your stack: Say “React 18 with TypeScript, Vite, and Tailwind CSS” not “React project.” Include versions and key dependencies. Vague specs produce vague code.

Use a consistent format: Clarity is king. Many devs use Markdown headings or even XML-like tags in the spec to delineate sections, because AI models handle well-structured text better than free-form prose. For example, you might structure the spec as:

# Project Spec: My team's tasks app

## Objective
- Build a web app for small teams to manage tasks...

## Tech Stack
- React 18+, TypeScript, Vite, Tailwind CSS
- Node.js/Express backend, PostgreSQL, Prisma ORM

## Commands
- Build: `npm run build` (compiles TypeScript, outputs to dist/)
- Test: `npm test` (runs Jest, must pass before commits)
- Lint: `npm run lint --fix` (auto-fixes ESLint errors)

## Project Structure
- `src/` – Application source code
- `tests/` – Unit and integration tests
- `docs/` – Documentation

## Boundaries
- ✅ Always: Run tests before commits, follow naming conventions
- ⚠️ Ask first: Database schema changes, adding dependencies
- 🚫 Never: Commit secrets, edit node_modules/, modify CI config

This level of organization not only helps you think clearly, it helps the AI find information. Anthropic engineers recommend organizing prompts into distinct sections (like , , , etc.) for exactly this reason - it gives the model strong cues about which info is which. And remember, “minimal does not necessarily mean short” - don’t shy away from detail in the spec if it matters, but keep it focused.

Integrate specs into your toolchain: Treat specs as “executable artifacts” tied to version control and CI/CD. The GitHub Spec Kit uses a four-phase, gated workflow that makes your specification the center of your engineering process. Instead of writing a spec and setting it aside, the spec drives the implementation, checklists, and task breakdowns. Your primary role is to steer; the coding agent does the bulk of the writing. Each phase has a specific job, and you don’t move to the next one until the current task is fully validated:

1. Specify: You provide a high-level description of what you’re building and why, and the coding agent generates a detailed specification. This isn’t about technical stacks or app design - it’s about user journeys, experiences, and what success looks like. Who will use this? What problem does it solve? How will they interact with it? Think of it as mapping the user experience you want to create, and letting the coding agent flesh out the details. This becomes a living artifact that evolves as you learn more.

2. Plan: Now you get technical. You provide your desired stack, architecture, and constraints, and the coding agent generates a comprehensive technical plan. If your company standardizes on certain technologies, this is where you say so. If you’re integrating with legacy systems or have compliance requirements, all of that goes here. You can ask for multiple plan variations to compare approaches. If you make internal docs available, the agent can integrate your architectural patterns directly into the plan.

3. Tasks: The coding agent takes the spec and plan and breaks them into actual work - small, reviewable chunks that each solve a specific piece of the puzzle. Each task should be something you can implement and test in isolation, almost like test-driven development for your AI agent. Instead of “build authentication,” you get concrete tasks like “create a user registration endpoint that validates email format.”

4. Implement: Your coding agent tackles tasks one by one (or in parallel). Instead of reviewing thousand-line code dumps, you review focused changes that solve specific problems. The agent knows what to build (specification), how to build it (plan), and what to work on (task). Crucially, your role is to verify at each phase: Does the spec capture what you want? Does the plan account for constraints? Are there edge cases the AI missed? The process builds in checkpoints for you to critique, spot gaps, and course-correct before moving forward.

This gated workflow prevents what Willison calls “house of cards code” - fragile AI outputs that collapse under scrutiny. Anthropic’s Skills system offers a similar pattern, letting you define reusable Markdown-based behaviors that agents invoke. By embedding your spec in these workflows, you ensure the agent can’t proceed until the spec is validated, and changes propagate automatically to task breakdowns and tests.

Consider agents.md for specialized personas: For tools like GitHub Copilot, you can create agents.md files that define specialized agent personas - a @docs-agent for technical writing, a @test-agent for QA, a @security-agent for code review. Each file acts as a focused spec for that persona’s behavior, commands, and boundaries. This is particularly useful when you want different agents for different tasks rather than one general-purpose assistant.

Design for Agent Experience (AX): Just as we design APIs for developer experience (DX), consider designing specs for “Agent Experience.” This means clean, parseable formats: OpenAPI schemas for any APIs the agent will consume, llms.txt files that summarize documentation for LLM consumption, and explicit type definitions. The Agentic AI Foundation (AAIF) is standardizing protocols like MCP (Model Context Protocol) for tool integration - specs that follow these patterns are easier for agents to consume and act on reliably.

PRD vs SRS mindset: It helps to borrow from established documentation practices. For AI agent specs, you’ll often blend these into one document (as illustrated above), but covering both angles serves you well. Writing it like a PRD ensures you include user-centric context (“the why behind each feature”) so the AI doesn’t optimize for the wrong thing. Expanding it like an SRS ensures you nail down the specifics the AI will need to actually generate correct code (like what database or API to use). Developers have found that this extra upfront effort pays off by drastically reducing miscommunications with the agent later.

Make the spec a “living document”: Don’t write it and forget it. Update the spec as you and the agent make decisions or discover new info. If the AI had to change the data model or you decided to cut a feature, reflect that in the spec so it remains the ground truth. Think of it as version-controlled documentation. In spec-driven workflows, the spec drives implementation, tests, and task breakdowns, and you don’t move to coding until the spec is validated. This habit keeps the project coherent, especially if you or the agent step away and come back later. Remember, the spec isn’t just for the AI - it helps you as the developer maintain oversight and ensure the AI’s work meets the real requirements.

3. Break tasks into modular prompts and context, not one big prompt

Divide and conquer: give the AI one focused task at a time rather than a monolithic prompt with everything at once.

Experienced AI engineers have learned that trying to stuff the entire project (all requirements, all code, all instructions) into a single prompt or agent message is a recipe for confusion. Not only do you risk hitting token limits, you also risk the model losing focus due to the “curse of instructions” - too many directives causing it to follow none of them well. The solution is to design your spec and workflow in a modular way, tackling one piece at a time and pulling in only the context needed for that piece.

The curse of too much context/instructions: Research has confirmed what many devs anecdotally saw: as you pile on more instructions or data into the prompt, the model’s performance in adhering to each one drops significantly. One study dubbed this the “curse of instructions”, showing that even GPT-4 and Claude struggle when asked to satisfy many requirements simultaneously. In practical terms, if you present 10 bullet points of detailed rules, the AI might obey the first few and start overlooking others. The better strategy is iterative focus. Guidelines from industry suggest decomposing complex requirements into sequential, simple instructions as a best practice. Focus the AI on one sub-problem at a time, get that done, then move on. This keeps the quality high and errors manageable.

Divide the spec into phases or components: If your spec document is very long or covers a lot of ground, consider splitting it into parts (either physically separate files or clearly separate sections). For example, you might have a section for “Backend API Spec” and another for “Frontend UI Spec.” You don’t need to always feed the frontend spec to the AI when it’s working on the backend, and vice versa. Many devs using multi-agent setups even create separate agents or sub-processes for each part - e.g. one agent works on database/schema, another on API logic, another on frontend - each with the relevant slice of the spec. Even if you use a single agent, you can emulate this by copying only the relevant spec section into the prompt for that task. Avoid context overload: Don’t mix authentication tasks with database schema changes in one go, as the DigitalOcean AI guide warns. Keep each prompt tightly scoped to the current goal.

Extended TOC / Summaries for large specs: One clever technique is to have the agent build an extended Table of Contents with summaries for the spec. This is essentially a “spec summary” that condenses each section into a few key points or keywords, and references where details can be found. For example, if your full spec has a section on “Security Requirements” spanning 500 words, you might have the agent summarize it to: “Security: use HTTPS, protect API keys, implement input validation (see full spec §4.2)”. By creating a hierarchical summary in the planning phase, you get a bird’s-eye view that can stay in the prompt, while the fine details remain offloaded unless needed. This extended TOC acts as an index: the agent can consult it and say “aha, there’s a security section I should look at”, and you can then provide that section on demand. It’s similar to how a human developer skims an outline and then flips to the relevant page of a spec document when working on a specific part.

To implement this, you can prompt the agent after writing the spec: “Summarize the spec above into a very concise outline with each section’s key points and a reference tag.” The result might be a list of sections with one or two sentence summaries. That summary can be kept in the system or assistant message to guide the agent’s focus without eating up too many tokens. This hierarchical summarization approach is known to help LLMs maintain long-term context by focusing on the high-level structure. The agent carries a “mental map” of the spec.

Utilize sub-agents or “skills” for different spec parts: Another advanced approach is using multiple specialized agents (what Anthropic calls subagents or what you might call “skills”). Each subagent is configured for a specific area of expertise and given the portion of the spec relevant to that area. For instance, you might have a Database Designer subagent that only knows about the data model section of the spec, and an API Coder subagent that knows the API endpoints spec. The main agent (or an orchestrator) can route tasks to the appropriate subagent automatically. The benefit is each agent has a smaller context window to deal with and a more focused role, which can boost accuracy and allow parallel work on independent tasks. Anthropic’s Claude Code supports this by letting you define subagents with their own system prompts and tools. “Each subagent has a specific purpose and expertise area, uses its own context window separate from the main conversation, and has a custom system prompt guiding its behavior,” as their docs describe. When a task comes up that matches a subagent’s domain, Claude can delegate that task to it, with the subagent returning results independently.

Parallel agents for throughput: Running multiple agents simultaneously is emerging as “the next big thing” for developer productivity. Rather than waiting for one agent to finish before starting another task, you can spin up parallel agents for non-overlapping work. Willison describes this as “embracing parallel coding agents” and notes it’s “surprisingly effective, if mentally exhausting”. The key is scoping tasks so agents don’t step on each other - one agent codes a feature while another writes tests, or separate components get built concurrently. Orchestration frameworks like LangGraph or OpenAI Swarm can help coordinate these agents, and shared memory via vector databases (like Chroma) lets them access common context without redundant prompting.

Single vs. multi-agent: when to use each

AspectSingle AgentParallel/Multi-AgentStrengthsSimpler setup; lower overhead; easier to debug and followHigher throughput; handles complex interdependencies; specialists per domainChallengesContext overload on big projects; slower iteration; single point of failureCoordination overhead; potential conflicts; needs shared memory (e.g., vector DBs)Best ForIsolated modules; small-to-medium projects; early prototypingLarge codebases; one codes + one tests + one reviews; independent featuresTipsUse spec summaries; refresh context per task; start fresh sessions oftenLimit to 2-3 agents initially; use MCP for tool sharing; define clear boundaries

In practice, using subagents or skill-specific prompts might look like: you maintain multiple spec files (or prompt templates) - e.g. SPEC_backend.md, SPEC_frontend.md - and you tell the AI, “For backend tasks, refer to SPEC_backend; for frontend tasks refer to SPEC_frontend.” Or in a tool like Cursor/Claude, you actually spin up a subagent for each. This is certainly more complex to set up than a single-agent loop, but it mimics what human developers do - we mentally compartmentalize a large spec into relevant chunks (you don’t keep the whole 50-page spec in your head at once; you recall the part you need for the task at hand, and have a general sense of the overall architecture). The challenge, as noted, is managing interdependencies: the subagents must still coordinate (the frontend needs to know the API contract from the backend spec, etc.). A central overview (or an “architect” agent) can help by referencing the sub-specs and ensuring consistency.

Focus each prompt on one task/section: Even without fancy multi-agent setups, you can manually enforce modularity. For example, after the spec is written, your next move might be: “Step 1: Implement the database schema.” You feed the agent the Database section of the spec only, plus any global constraints from the spec (like tech stack). The agent works on that. Then for Step 2, “Now implement the authentication feature”, you provide the Auth section of the spec and maybe the relevant parts of the schema if needed. By refreshing the context for each major task, you ensure the model isn’t carrying a lot of stale or irrelevant information that could distract it. As one guide suggests: “Start fresh: begin new sessions to clear context when switching between major features”. You can always remind the agent of critical global rules (from the spec’s Constraints section) each time, but don’t shove the entire spec in if it’s not all needed.

Use in-line directives and code TODOs: Another modularity trick is to use your code or spec as an active part of the conversation. For instance, scaffold your code with // TODO comments that describe what needs to be done, and have the agent fill them one by one. Each TODO essentially acts as a mini-spec for a small task. This keeps the AI laser-focused (“implement this specific function according to this spec snippet”) and you can iterate in a tight loop. It’s similar to giving the AI a checklist item to complete rather than the whole checklist at once.

The bottom line: small, focused context beats one giant prompt. This improves quality and keeps the AI from getting “overwhelmed” by too much at once. As one set of best practices sums up, provide “One Task Focus” and “Relevant info only” to the model, and avoid dumping everything everywhere. By structuring the work into modules - and using strategies like spec summaries or sub-spec agents - you’ll navigate around context size limits and the AI’s short-term memory cap. Remember, a well-fed AI is like a well-fed function: give it only the inputs it needs for the job at hand.

4. Build in self-checks, constraints, and human expertise

Make your spec not just a to-do list for the agent, but also a guide for quality control - and don’t be afraid to inject your own expertise.

A good spec for an AI agent anticipates where the AI might go wrong and sets up guardrails. It also takes advantage of what you know (domain knowledge, edge cases, “gotchas”) so the AI doesn’t operate in a vacuum. Think of the spec as both coach and referee for the AI: it should encourage the right approach and call out fouls.

Use three-tier boundaries: The GitHub analysis of 2,500+ agent files found that the most effective specs use a three-tier boundary system rather than a simple list of don’ts. This gives the agent clearer guidance on when to proceed, when to pause, and when to stop:

✅ Always do: Actions the agent should take without asking. “Always run tests before commits.” “Always follow the naming conventions in the style guide.” “Always log errors to the monitoring service.”

⚠️ Ask first: Actions that require human approval. “Ask before modifying database schemas.” “Ask before adding new dependencies.” “Ask before changing CI/CD configuration.” This tier catches high-impact changes that might be fine but warrant a human check.

🚫 Never do: Hard stops. “Never commit secrets or API keys.” “Never edit node_modules/ or vendor/.” “Never remove a failing test without explicit approval.” “Never commit secrets” was the single most common helpful constraint in the study.

This three-tier approach is more nuanced than a flat list of rules. It acknowledges that some actions are always safe, some need oversight, and some are categorically off-limits. The agent can proceed confidently on “Always” items, flag “Ask first” items for review, and hard-stop on “Never” items.

Encourage self-verification: One powerful pattern is to have the agent verify its work against the spec automatically. If your tooling allows, you can integrate checks like unit tests or linting that the AI can run after generating code. But even at the spec/prompt level, you can instruct the AI to double-check: e.g. “After implementing, compare the result with the spec and confirm all requirements are met. List any spec items that are not addressed.” This pushes the LLM to reflect on its output relative to the spec, catching omissions. It’s a form of self-audit built into the process.

For instance, you might append to a prompt: “(After writing the function, review the above requirements list and ensure each is satisfied, marking any missing ones).” The model will then (ideally) output the code followed by a short checklist indicating if it met each requirement. This reduces the chance it forgets something before you even run tests. It’s not foolproof, but it helps.

LLM-as-a-Judge for subjective checks: For criteria that are hard to test automatically - code style, readability, adherence to architectural patterns - consider using “LLM-as-a-Judge.” This means having a second agent (or a separate prompt) review the first agent’s output against your spec’s quality guidelines. Anthropic and others have found this effective for subjective evaluation. You might prompt: “Review this code for adherence to our style guide. Flag any violations.” The judge agent returns feedback that either gets incorporated or triggers a revision. This adds a layer of semantic evaluation beyond syntax checks.

Conformance testing: Willison advocates building conformance suites - language-independent tests (often YAML-based) that any implementation must pass. These act as a contract: if you’re building an API, the conformance suite specifies expected inputs/outputs, and the agent’s code must satisfy all cases. This is more rigorous than ad-hoc unit tests because it’s derived directly from the spec and can be reused across implementations. Include conformance criteria in your spec’s Success section (e.g., “Must pass all cases in conformance/api-tests.yaml”).

Leverage testing in the spec: If possible, incorporate a test plan or even actual tests in your spec and prompt flow. In traditional development, we use TDD or write test cases to clarify requirements - you can do the same with AI. For example, in the spec’s Success Criteria, you might say “These sample inputs should produce these outputs…” or “the following unit tests should pass.” The agent can be prompted to run through those cases in its head or actually execute them if it has that capability. Simon Willison noted that having a robust test suite is like giving the agents superpowers - they can validate and iterate quickly when tests fail. In an AI coding context, writing a bit of pseudocode for tests or expected outcomes in the spec can guide the agent’s implementation. Additionally, you can use a dedicated “test agent” in a subagent setup that takes the spec’s criteria and continuously verifies the “code agent’s” output.

Bring your domain knowledge: Your spec should reflect insights that only an experienced developer or someone with context would know. For example, if you’re building an e-commerce agent and you know that “products” and “categories” have a many-to-many relationship, state that clearly (don’t assume the AI will infer it - it might not). If a certain library is notoriously tricky, mention pitfalls to avoid. Essentially, pour your mentorship into the spec. The spec can contain advice like “If using library X, watch out for memory leak issue in version Y (apply workaround Z).” This level of detail is what turns an average AI output into a truly robust solution, because you’ve steered the AI away from common traps.

Also, if you have preferences or style guidelines (say, “use functional components over class components in React”), encode that in the spec. The AI will then emulate your style. Many engineers even include small examples in the spec, e.g., “All API responses should be JSON. E.g. {“error”: “message”} for errors.” By giving a quick example, you anchor the AI to the exact format you want.

Minimalism for simple tasks: While we advocate thorough specs, part of expertise is knowing when to keep it simple. For relatively simple, isolated tasks, an overbearing spec can actually confuse more than help. If you’re asking the agent to do something straightforward (like “center a div on the page”), you might just say, “Make sure to keep the solution concise and do not add extraneous markup or styles.” No need for a full PRD there. Conversely, for complex tasks (like “implement an OAuth flow with token refresh and error handling”), that’s when you break out the detailed spec. A good rule of thumb: adjust spec detail to task complexity. Don’t under-spec a hard problem (the agent will flail or go off-track), but don’t over-spec a trivial one (the agent might get tangled or use up context on unnecessary instructions).

Maintain the AI’s “persona” if needed: Sometimes, part of your spec is defining how the agent should behave or respond, especially if the agent interacts with users. For example, if building a customer support agent, your spec might include guidelines like “Use a friendly and professional tone,” “If you don’t know the answer, ask for clarification or offer to follow up, rather than guessing.” These kind of rules (often included in system prompts) help keep the AI’s outputs aligned with expectations. They are essentially spec items for AI behavior. Keep them consistent and remind the model of them if needed in long sessions (LLMs can “drift” in style over time if not kept on a leash).

You remain the exec in the loop: The spec empowers the agent, but you remain the ultimate quality filter. If the agent produces something that technically meets the spec but doesn’t feel right, trust your judgement. Either refine the spec or directly adjust the output. The great thing about AI agents is they don’t get offended - if they deliver a design that’s off, you can say, “Actually, that’s not what I intended, let’s clarify the spec and redo it.” The spec is a living artifact in collaboration with the AI, not a one-time contract you can’t change.

Simon Willison humorously likened working with AI agents to “a very weird form of management” and even “getting good results out of a coding agent feels uncomfortably close to managing a human intern”. You need to provide clear instructions (the spec), ensure they have the necessary context (the spec and relevant data), and give actionable feedback. The spec sets the stage, but monitoring and feedback during execution are key. If an AI was a “weird digital intern who will absolutely cheat if you give them a chance”, the spec and constraints you write are how you prevent that cheating and keep them on task.

Here’s the payoff: a good spec doesn’t just tell the AI what to build, it also helps it self-correct and stay within safe boundaries. By baking in verification steps, constraints, and your hard-earned knowledge, you drastically increase the odds that the agent’s output is correct on the first try (or at least much closer to correct). This reduces iterations and those “why on Earth did it do that?” moments.

5. Test, iterate, and evolve the spec (and use the right tools)

Think of spec-writing and agent-building as an iterative loop: test early, gather feedback, refine the spec, and leverage tools to automate checks.

The initial spec is not the end - it’s the beginning of a cycle. The best outcomes come when you continually verify the agent’s work against the spec and adjust accordingly. Also, modern AI devs use various tools to support this process (from CI pipelines to context management utilities).

Continuous testing: Don’t wait until the end to see if the agent met the spec. After each major milestone or even each function, run tests or at least do quick manual checks. If something fails, update the spec or prompt before proceeding. For example, if the spec said “passwords must be hashed with bcrypt” and you see the agent’s code storing plain text - stop and correct it (and remind the spec or prompt about the rule). Automated tests shine here: if you provided tests (or write them as you go), let the agent run them. In many coding agent setups, you can have an agent run npm test or similar after finishing a task. The results (failures) can then feed back into the next prompt, effectively telling the agent “your output didn’t meet spec on X, Y, Z - fix it.” This kind of agentic loop (code -> test -> fix -> repeat) is extremely powerful and is how tools like Claude Code or Copilot Labs are evolving to handle larger tasks. Always define what “done” means (via tests or criteria) and check for it.

Iterate on the spec itself: If you discover that the spec was incomplete or unclear (maybe the agent misunderstood something or you realized you missed a requirement), update the spec document. Then explicitly re-sync the agent with the new spec: “I have updated the spec as follows… Given the updated spec, adjust the plan or refactor the code accordingly.” This way the spec remains the single source of truth. It’s similar to how we handle changing requirements in normal dev - but in this case you’re also the product manager for your AI agent. Keep version history if possible (even just via commit messages or notes), so you know what changed and why.

Utilize context-management and memory tools: There’s a growing ecosystem of tools to help manage AI agent context and knowledge. For instance, retrieval-augmented generation (RAG) is a pattern where the agent can pull in relevant chunks of data from a knowledge base (like a vector database) on the fly. If your spec is huge, you could embed sections of it and let the agent retrieve the most relevant parts when needed, instead of always providing the whole thing. There are also frameworks implementing the Model Context Protocol (MCP), which automates feeding the right context to the model based on the current task. One example is Context7 (context7.com), which can auto-fetch relevant context snippets from docs based on what you’re working on. In practice, this might mean the agent notices you’re working on “payment processing” and it pulls the “Payments” section of your spec or documentation into the prompt. Consider leveraging such tools or setting up a rudimentary version (even a simple search in your spec document).

Parallelize carefully: Some developers run multiple agent instances in parallel on different tasks (as mentioned earlier with subagents). This can speed up development - e.g., one agent generates code while another simultaneously writes tests, or two features are built concurrently. If you go this route, ensure the tasks are truly independent or clearly separated to avoid conflicts (the spec should note any dependencies). For example, don’t have two agents writing to the same file at once. One workflow is to have an agent generate code and another review it in parallel, or to have separate components built that integrate later. This is advanced usage and can be mentally taxing to manage (as Willison admitted, running multiple agents is surprisingly effective, if mentally exhausting!). Start with at most 2-3 agents to keep things manageable.

Version control and spec locks: Use Git or your version control of choice to track what the agent does. Good version control habits matter even more with AI assistance. Commit the spec file itself to the repo. This not only preserves history, but the agent can even use git diff or blame to understand changes (LLMs are quite capable of reading diffs). Some advanced agent setups let the agent query the VCS history to see when something was introduced - surprisingly, models can be “fiercely competent at Git”. By keeping your spec in the repo, you allow both you and the AI to track evolution. There are tools (like GitHub Spec Kit mentioned earlier) that integrate spec-driven development into the git workflow - for instance, gating merges on updated specs or generating checklists from spec items. While you don’t need those tools to succeed, the takeaway is to treat the spec like code - maintain it diligently.

Cost and speed considerations: Working with large models and long contexts can be slow and expensive. A practical tip is to use model selection and batching smartly. Perhaps use a cheaper/faster model for initial drafts or repetitions, and reserve the most capable (and expensive) model for final outputs or complex reasoning. Some developers use GPT-4 or Claude for planning and critical steps, but offload simpler expansions or refactors to a local model or a smaller API model. If using multiple agents, maybe not all need to be top-tier; a test-running agent or a linter agent could be a smaller model. Also consider throttling context size: don’t feed 20k tokens if 5k will do. As we discussed, more tokens can mean diminishing returns.

Monitor and log everything: In complex agent workflows, logging the agent’s actions and outputs is essential. Check the logs to see if the agent is deviating or encountering errors. Many frameworks provide trace logs or allow printing the agent’s chain-of-thought (especially if you prompt it to think step-by-step). Reviewing these logs can highlight where the spec or instructions might have been misinterpreted. It’s not unlike debugging a program - except the “program” is the conversation/prompt chain. If something weird happens, go back to the spec/instructions to see if there was ambiguity.

Learn and improve: Finally, treat each project as a learning opportunity to refine your spec-writing skill. Maybe you’ll discover that a certain phrasing consistently confuses the AI, or that organizing spec sections in a certain way yields better adherence. Incorporate those lessons into the next spec. The field of AI agents is rapidly evolving, so new best practices (and tools) emerge constantly. Stay updated via blogs (like the ones by Simon Willison, Andrej Karpathy, etc.), and don’t hesitate to experiment.

A spec for an AI agent isn’t “write once, done.” It’s part of a continuous cycle of instructing, verifying, and refining. The payoff for this diligence is substantial: by catching issues early and keeping the agent aligned, you avoid costly rewrites or failures later. As one AI engineer quipped, using these practices can feel like having “an army of interns” working for you, but you have to manage them well. A good spec, continuously maintained, is your management tool.

Avoid common pitfalls

Before wrapping up, it’s worth calling out anti-patterns that can derail even well-intentioned spec-driven workflows. The GitHub study of 2,500+ agent files revealed a stark divide: “Most agent files fail because they’re too vague.” Here are the mistakes to avoid:

Vague prompts: “Build me something cool” or “Make it work better” gives the agent nothing to anchor on. As Baptiste Studer puts it: “Vague prompts mean wrong results.” Be specific about inputs, outputs, and constraints. “You are a helpful coding assistant” doesn’t work. “You are a test engineer who writes tests for React components, follows these examples, and never modifies source code” does.

Overlong contexts without summarization: Dumping 50 pages of documentation into a prompt and hoping the model figures it out rarely works. Use hierarchical summaries (as discussed in Principle 3) or RAG to surface only what’s relevant. Context length is not a substitute for context quality.

Skipping human review: Willison has a personal rule: “I won’t commit code I couldn’t explain to someone else.” Just because the agent produced something that passes tests doesn’t mean it’s correct, secure, or maintainable. Always review critical code paths. The “house of cards” metaphor applies: AI-generated code can look solid but collapse under edge cases you didn’t test.

Conflating vibe coding with production engineering: Rapid prototyping with AI (“vibe coding”) is great for exploration and throwaway projects. But shipping that code to production without rigorous specs, tests, and review is asking for trouble. I distinguish “vibe coding” from “AI-assisted engineering” - the latter requires the discipline this guide describes. Know which mode you’re in.

Ignoring the “lethal trifecta”: Willison warns of three properties that make AI agents dangerous: speed (they work faster than you can review), non-determinism (same input, different outputs), and cost (encouraging corner-cutting on verification). Your spec and review process must account for all three. Don’t let speed outpace your ability to verify.

Missing the six core areas: If your spec doesn’t cover commands, testing, project structure, code style, git workflow, and boundaries, you’re likely missing something the agent needs. Use the six-area checklist from Section 2 as a sanity check before handing off to the agent.

Conclusion

Writing an effective spec for AI coding agents requires solid software engineering principles combined with adaptation to LLM quirks. Start with clarity of purpose and let the AI help expand the plan. Structure the spec like a serious design document - covering the six core areas and integrating it into your toolchain so it becomes an executable artifact, not just prose. Keep the agent’s focus tight by feeding it one piece of the puzzle at a time (and consider clever tactics like summary TOCs, subagents, or parallel orchestration to handle big specs). Anticipate pitfalls by including three-tier boundaries (Always/Ask first/Never), self-checks, and conformance tests - essentially, teach the AI how to not fail. And treat the whole process as iterative: use tests and feedback to refine both the spec and the code continuously.

Follow these guidelines and your AI agent will be far less likely to “break down” under large contexts or wander off into nonsense.

Happy spec-writing!

I’m excited to share I’ve released a new AI-assisted engineering book with O’Reilly. There are a number of free tips on the book site in case interested.

Code Review in the Age of AI

Addy Osmani — Mon, 05 Jan 2026 15:30:39 GMT

AI did not kill code review. It made the burden of proof explicit. Ship changes with evidence like manual verification and automated tests, then use review for risk, intent, and accountability. Solo developers lean on automation to keep up with AI speed, while teams use review to build shared context and ownership.

If your pull request doesn’t contain evidence that it works, you’re not shipping faster - you’re just moving work downstream.

By early 2026, over 30% of senior developers report shipping mostly AI-generated code. The challenge? AI excels at drafting features but falters on logic, security, and edge cases - making errors 75% more common in logic alone. This splits workflows: solos “vibe” at inference speed with test suites as backstops, while teams demand human eyes for context and compliance. Done right, both treat AI as an accelerator, but verification - who, what, and when - defines the difference.

As I’ve said before: if you haven’t seen the code do the right thing yourself, it doesn’t work. AI amplifies this rule, not excuses it.

How developers use AI for review

Ad-hoc LLM checks: Paste diffs into Claude, Gemini or GPT for quick bug/style scans before committing.
IDE integrations: Tools like Cursor, Claude Code, or Gemini CLI for inline suggestions and refactors during coding.
PR bots and scanners: GitHub Copilot or custom agents to flag issues in PRs; pair with static/dynamic analysis like Snyk for security.
Automated testing loops: Use AI to generate and run tests, enforcing coverage >70% as a gate.
Multi-model reviews: Run code through different LLMs (e.g., Claude for generation, a security-focused model for audit) to catch biases.

The workflow and mindset differ dramatically depending on whether you’re solo or working in a team where others maintain your code.

Solo vs. Team: A quick comparison

Solo Devs: Shipping at “inference speed”

Solo developers increasingly “trust the vibe” of AI-generated code - shipping features rapidly by reviewing only the key parts and relying on tests to catch issues.

This workflow treats coding agents as powerful interns that can handle massive refactors largely on their own. As Peter Steinberger admits: “I don’t read much code anymore. I watch the stream and sometimes look at key parts, but most code I don’t read.” The bottleneck becomes inference time - waiting for the AI to generate output - not typing.

There’s a catch: perceived speed gains vanish without strong testing practices. Build those first. If you skip review, you don’t eliminate work - you defer it. The developers who succeed with AI at high velocity aren’t the ones who blindly trust it; they’re the ones who’ve built verification systems that catch issues before they reach production.

That isn’t to say solos throw caution to the wind. The responsible ones employ extensive automated testing as a safety net - aiming for high coverage (often >70%) and using AI to generate tests that catch bugs in real-time. Modern coding agents are surprisingly good at designing sophisticated end-to-end tests.

For solos, the game-changer is language-independent, data-driven tests. If comprehensive, they let an agent build (or fix) implementations in any language, verifying as it goes. I start projects with a spec.md the AI drafts, approve it, then loop: write → test → fix.

Crucially, solo coders still do manual testing and critical reasoning on the final product. Run the application, click through the UI, use the feature yourself. When higher stakes are involved, read more code and add extra checks. And despite moving fast, fix ugly code when you see it rather than letting the mess accumulate.

Even in this bleeding-edge paradigm: your job is to deliver code you have proven to work.

Teams: AI shifts review bottlenecks

In team settings, AI is a powerful assistant for code review, but cannot replace the human judgment needed for quality, security, and maintainability.

When multiple engineers collaborate, the cost of mistakes and longevity of code are much higher concerns. Teams have started using AI-based review bots for an initial pass on PRs, but they still require a human to sign off. As Greg Foster of Graphite puts it: “I don’t ever see [AI agents] becoming a stand-in for an actual human engineer signing off on a pull request.”

The biggest practical problem isn’t that AI reviewers miss style issues - it’s that AI increases volume and shifts the burden onto humans. PRs are getting larger (~18% more additions as AI adoption increases), incidents per PR are up ~24%, and change failure rates up ~30%. When output increases faster than verification capacity, review becomes the rate limiter. As Foster notes: “If we’re shipping code that’s never actually read or understood by a fellow human, we’re running a huge risk.”

In teams, AI floods volume, so enforce incrementalism: break agent output into digestible commits. Human sign-off isn’t going away - it’s evolving to focus on what AI misses, like roadmap alignment and institutional context that AI can’t grasp.

Security: AI’s predictable weaknesses

One area where human oversight is absolutely non-negotiable is security. Approximately 45% of AI-generated code contains security flaws. Logic errors appear at 1.75× the rate of human-written code, and XSS vulnerabilities occur at 2.74× higher frequency.

Beyond code issues, agentic tooling and AI-integrated IDEs have created new attack paths - prompt injection, data exfiltration, even RCE vulnerabilities. AI expands attack surfaces, so hybrid approaches win: AI flags, humans verify.

Rule: If code touches auth, payments, secrets, or untrusted input, treat AI as a high-speed intern and require a human threat model review plus a security tool pass before merge.

Review as knowledge transfer

Code review is also how teams share system context. If AI writes the code and nobody can explain it, on-call becomes expensive.

When a developer submits AI-generated code they don’t fully understand, they’re breaking the knowledge transfer mechanism that makes teams resilient. If the original author can’t explain why the code works, how will the on-call engineer debug it at 2 AM?

The OCaml maintainers’ rejection of a 13,000-line AI-generated PR crystallizes this issue. The code wasn’t necessarily bad, but no one had bandwidth to review such a huge change, and reviewing AI-generated code is “more taxing” than reviewing human code. The lesson: AI can flood you with code, but teams must manage volume to avoid a review bottleneck.

Making AI review tools work

User experiences with AI review tools are decidedly mixed. On the positive side, teams report catching 95%+ of bugs in some cases - null pointer exceptions, missing test coverage, anti-patterns. On the negative side, some developers dismiss AI review comments as “text noise” - generic observations that add no value.

The lesson: AI review tools require thoughtful configuration. Tune sensitivity levels, disable unhelpful comment types, and establish clear opt-in/opt-out policies. Properly configured, AI reviewers can catch 70-80% of low-hanging fruit, freeing humans to focus on architecture and business logic.

Many teams encourage smaller, stackable pull requests even if AI could do a giant change all at once. Commit early and often - treat each self-contained change as a separate commit/PR with clear messages.

Importantly, teams maintain a hard line of human accountability. No matter how much AI contributed, a human must take responsibility. As an old IBM training saying goes: “A computer can never be held accountable. That’s your job as the human in the loop.”

The PR Contract: What authors owe reviewers

Whether solo or in a team, the emerging best practice is to treat AI-generated code as a helpful draft that must be verified.

The most successful teams have converged on a simple framework:

PR Contract

What/why: Intent in 1-2 sentences.
Proof it works: Tests passed, manual steps (screenshots/logs).
Risk + AI role: Tier and which parts were AI-generated (e.g., high=payments).
Review focus: 1-2 areas for human input (e.g., architecture).

This isn’t bureaucracy - it’s respect for reviewer time and a forcing function for author accountability. If you can’t fill this out, you don’t understand your own change well enough to ask someone else to approve it.

Core Principles

Insist on proof, not promises. Make “working code” the baseline. Prompt AI agents to execute code or run unit tests after generation. Demand evidence: logs, screenshots, results. No PR goes up without either new tests or a demo of the change working.

Use AI as first-pass reviewer, not final arbiter. Treat AI review output as advisory - a dialog where one AI writes code, another reviews it, and the human orchestrates fixes. Think of AI reviews as spellcheck, not an editor.

Focus human review on what AI misses. Does the change introduce a security hole? Does it duplicate existing code (a common AI flaw)? Is the approach maintainable? AI triages the easy stuff; humans tackle the hard stuff.

Enforce incremental development. Break work into small pieces - easier for AI to produce and for humans to review. Small commits with clear messages serve as checkpoints. Never commit code you can’t explain.

Maintain high testing standards. Those who get the most out of coding agents have strong testing practices. Ask AI to draft tests - it’s good at generating edge-case tests you might not think of.

Looking Ahead: The bottleneck has moved

AI is transforming code review from line-by-line gatekeeping into higher-level quality control - but human judgment remains the safety-critical component.

What we’re seeing is workflow evolution, not elimination. Code reviews now involve reviewing a conversation or plan between AI and author as much as the code diff itself. The human reviewer’s role becomes more like an editor or architect: focusing on what’s important and trusting automation for mundane checks.

For solo developers, the path ahead is exhilarating - new tools will further streamline development. Even then, the wise developer will “trust but verify.”

In larger teams, expect growing emphasis on AI governance. Companies will formalize policies about AI contributions, requiring sign-offs that code was reviewed by an employee. Roles like “AI code auditor” will emerge. Enterprise platforms will evolve to offer better multi-repository context and custom policy enforcement.

No matter the advances, the core principle remains: code review ensures software meets requirements, is secure, robust, and maintainable. AI doesn’t change those fundamentals - it just changes how we get there.

The bottleneck moved from writing code to proving it works. The best code reviewers in the age of AI will embrace this shift - letting AI accelerate the mechanical work while holding the line on accountability. They’ll let AI accelerate the process, never abdicate it. As engineers are learning, it’s about “proof over vibes” in coding.

Code review isn’t dead but it’s becoming more strategic. And whether you’re a solo hacker deploying at 2 AM or a team lead signing off a critical system change, one truth holds: the human is ultimately responsible for what the AI delivers.

Embrace the AI, but never forget to double-check the work.

I’m excited to share I’ve released a new AI-assisted engineering book with O’Reilly. There are a number of free tips on the book site in case interested.

How Good Is AI at Coding React (Really)?

Addy Osmani — Mon, 29 Dec 2025 15:31:06 GMT

tl;dr: AI coding benchmarks show models excel at isolated React tasks like scaffolding components or implementing explicit specs, achieving ~40% success in benchmarks, but drops to ~25% on multi-step integrations due to a “complexity cliff” in state management and design taste. The gap between “AI helped me ship” and “AI gave me a mess” is context engineering and explicit constraints. Deep React and domain knowledge enable you to spot when AI goes off the rails and understand why it repeats mistakes. Guide it without blindly accepting the output.

This article is based on my closing keynote at React Summit by GitNation (video).

The Question everyone’s asking (but nobody’s answering well)

Let me be direct: most conversations about AI and coding are stuck on vibes. Either AI is magic that will replace us all, or it’s garbage that can’t do anything useful, or it’s perpetually “one prompt away” from shipping production code. All three takes miss what’s actually interesting.

After spending a year analyzing benchmarks, building with these tools at Google, and watching React developers struggle (and succeed) with AI assistants, here’s what I’ve learned: AI is already useful for React developers, but its usefulness is extremely uneven. The unevenness is predictable if you know what to look for.

More importantly - and this is the part most articles skip - you have far more control over outcomes than you think.

The Core Thesis: Two sides of the same coin

This article covers two critical angles:

What the data tells us: Benchmarks like Design Arena, Web Dev Arena, SWE-Bench, and Web-Bench reveal clear patterns about where AI excels (isolated components, scaffolding, implementing explicit requirements) and where it struggles (multi-step integration, design taste, complex state management). Understanding these patterns means you can predict what will work before you waste time.

What you can control: The difference between “AI helped me ship” and “AI gave me a mess to untangle” almost never comes down to just model selection. It comes down to context engineering, prompt specificity, workflow structure, and guardrails. These are all in your power to fix.

Let’s start with the foundation.

90% of developers use AI for coding in some way. In an AI-assisted world, the value of frameworks like React depends on how effectively AI can use them. If AI can’t handle a framework well, that means the quality of experiences you can build without a lot of manual work can be limited.

A lot of AI code quality comes down to context. You want to squeeze the most value out of the token budget in your context window. But the model, the tools that sit on top of it all of these layers play an important role.

AI Changes what is easy, not what is true

AI is a force multiplier. It amplifies everything: good requirements, good architecture, good taste. It also amplifies the bad: vague specs, messy state, and the temptation to ship something you haven’t really understood. Give it a weak brief, and it will happily hand you a 10,000-line maze you’ll later delete.

I draw a hard line between AI-assisted vibe coding and AI-assisted engineering. Vibe coding is trusting high-level prompts and prioritizing speed over review. AI-assisted engineering is integrating AI inside a structured process where the human stays in control and accountable for the output.

Why does this distinction matter for React?

Because React apps aren’t just code. They’re product behavior, user experience, reliability, security, performance, accessibility, and long-term maintenance. AI can help with all of that - but only if you treat it like a teammate you’re pairing with and have oversight over, not a vending machine dispensing code.

The Monoculture problem (and opportunity)

One of the most under-discussed parts of the AI coding story: “how well AI codes” is not a universal property. It depends on what the model has seen in training, what tools it has access to, and what the ecosystem has standardized on.

Large language models effectively set the ceiling for how much leverage you get out of a framework in an AI-assisted workflow. If AI struggles with a framework, you feel that as friction and quality limits.

Most AI tools converge on a stack that looks like: React, TypeScript, Tailwind, shadcn/ui. That stack dominates training data and tool optimization, so models are competent there and noticeably shakier off the beaten path.

This has two implications for practicing React developers:

If you’re on the mainstream stack, your “AI assistance ceiling” is higher. You’ll get better scaffolds, better component generation, and fewer hallucinated APIs.
If you’re not, you need to compensate with better context, doc retrieval, and stricter constraints - or you’ll watch the model confidently build an alternate universe version of your app.

There’s also a second-order effect: monoculture can slow innovation. React skills are likely to stay very relevant (comforting), but it also means newer frameworks or alternative patterns can face headwinds until models and tools catch up.

The good news? If a framework gains traction, AI makers will fine-tune models on it. Docs MCPs can bridge gaps in the interim. But short-term, React’s position is extremely strong because AI “knows” it best.

The Big reality check: The complexity cliff

If you remember nothing else from this article, remember this: AI handles simple tasks well and then falls off a cliff as complexity rises.

A form component, a utility function, a small isolated widget: great. Multi-step work across a real codebase: much less reliable.

I used a mix of objective and human-rated benchmarks to show this pattern. We need both, because pass/fail benchmarks tell you “can it solve the issue,” while human-rated arenas tell you something equally important for frontend work: “do humans actually want to use what it builds.”

What The Numbers Show

On objective benchmarks, the complexity cliff is visible:

Next.js eval tasks: Best models around 42% success - roughly 21 of 50 tasks completed. Even on framework-specific challenges, failures are common.
Web-Bench multi-step full-stack tasks: Around 25% tasks solved. Many failures as steps chain together.
SWE-Bench Pro: Around 20-43% on the Pro public set, versus jumping to over 70% on SWE-Bench Verified. Increasing complexity collapses performance.

The gap between benchmark performance and your real codebase is the important thing to calibrate to.

My practical translation of the complexity cliff for React developers:

AI is great at first drafts
AI is mediocre at integration
AI is unreliable at long multi-step changes unless you give it strong tooling and context
AI gets you to “it works” faster than it gets you to “it’s a codebase I want to own”

Design Arena and Web Dev Arena: Where React developers should pay attention

React developers spend a lot of time in the space between “the code runs” and “this is good.” That space includes UI quality, hierarchy, spacing, accessibility, and whether the end result feels intentional.

Design Arena is interesting because it’s explicitly human preference–driven. Here’s how it works:

Users come to the platform to explore and use the best AI-powered tools, like website generation and game generation
Design Arena presents multiple versions of the same experience (a website, agent, or builder), and users can save their favorite
Rankings emerge from these aggregated choices across categories like website generation, agents, and builders, reflecting real usage preferences rather than curated rubrics
Elo-style scores are calculated using a Bradley–Terry model, with models below a minimum comparison threshold filtered out.
The leaderboard updates live (every three hours, per their methodology), powered by interactions from over 850,000 users across 145 countries.

By using a pairwise comparison system, DesignArena generates leaderboards that rank AI models based on human preferences, helping to measure and drive improvements in design quality, usability, and aesthetics.

Similarly, Web Dev Arena is an open-source benchmarking platform from LMArena designed to evaluate LLMs based on their capability to build functional, interactive web applications. Users submit a prompt and compare anonymous AI models generating code side-by-side, contributing to a community-driven Elo leaderboard that ranks top models for complex web development tasks.

So what do React developers learn from such arenas?

The Core Finding: AI has mastered logic, but not taste

This slide is the thesis of the whole talk. Models can solve hard reasoning problems and still produce UIs with basic design failures: off color choices, inconsistent spacing, weak hierarchy.

I call this the capability divide:

AI is strong at logic, data flow, and implementing explicit requirements
AI is weak at taste, usability awareness, and aesthetic judgment

If you’re a React developer, this should change how you delegate:

Delegate boilerplate and mechanical implementation
Keep design intent, API design, and architecture decisions human-led
Treat “pretty” as an explicit requirement, not a default outcome

The Surprise: Tools and scaffolding matter more than you think

Design Arena also found something counterintuitive: general agents are more variable than specialists, and the scaffolding and workflow around the base model drives a lot of the performance spread.

Put differently: two products can wrap the same base model and feel wildly different because of tooling, retrieval, iteration loops, and guardrails.

This is great news, because it means you have leverage even when you don’t control the base model.

Arena by Arena: What React developers should steal from the data

Let me walk through five arenas and extract the practical lessons for each one.

1. Website Arena: Prompt to website (and why Purple keeps happening)

The Website Arena measures how well models generate complete single-page sites from a prompt, with instructions to add modern UI/UX practices, accessibility, and responsive design.

The important nuance is how winners tend to win: it’s not always the flashiest layout, it’s often the most coherent and complete page.

If your goal is something shippable, bias your prompts toward coherence and structure, not 'make it look cool.

Why is there so much purple?

I joked about this in the talk because once you see it, you can’t unsee it: models converge on safe, generic design patterns, and “purple gradient plus glassmorphism” is one of those defaults.

That’s not just a meme. It’s distributional convergence: under uncertainty, models gravitate toward common patterns in the data.

How do you fix it?

One approach is tooling rather than “better prompts forever.” Anthropic pushed some of this into Skills (markdown files that Claude reads on demand) instead of trying to brute force it through training. Their frontend-design skill is worth checking out.

Even if you never use Claude Skills specifically, the lesson is broader:

Some failures are better solved by scaffolding and constraints than by model selection
You want repeatable, shareable “taste primitives” that don’t require rewriting your entire prompt every time

My website generation checklist for React teams

What I ask for up front:

Anchor the layout first: Specify the page sections you want before code
Specify stack and routing: Call out Next App Router, file names, and RSC vs client components so it doesn’t invent structure
Describe content density: Minimal landing page vs long-form so spacing doesn’t default to sludge
Ask for responsive constraints: Breakpoints and collapse behavior
Bake in accessibility: Semantic landmarks, skip links, labels, safe contrast
Convert HTML to real React files: Map sections to components and wire them up in page.tsx

What I do after generation, before trusting it:

Strip inline scripts and move DOM logic into client components with hooks and typed props
Normalize layout primitives and refactor div soup into your real Shell, Container, Stack components
Run a11y and perf checks: Lint, Lighthouse, and add tests for critical flows
Freeze the visual system: Snap palette, spacing, typography into Tailwind config or tokens
Keep the model on a leash: Use it for slices and variants, not wholesale rewrites of a tuned page

Single sentence summary: Be radically explicit in your instructions, and enforce your design system and coding standards so the model can’t drift.

Poor prompt:

Make a landing page for a SaaS product

Strong prompt:

Create a Next.js App Router landing page (app/page.tsx) for a developer tools SaaS:

Layout sections:
1. Hero with headline, subheadline, CTA
2. Features (3 columns, icon + title + description each)
3. Social proof (logos grid)
4. CTA

Stack: Next.js 15, TypeScript, Tailwind
Density: Spacious landing page (not cramped)
Colors: Avoid purple/pink gradients - use neutral gray with blue accent
Responsive: Stack features vertically below 768px

Accessibility:
- Semantic HTML (header, main, section)
- Alt text for all images
- Sufficient color contrast (WCAG AA)

2. Agent Arena: Most failures are context failures now

The Agent Arena is a step up: multi-step tasks like writing code, fixing bugs, running tests, running browsers, debugging. This is where “agent loops” show up.

Here’s the biggest trap: when agents fail, it often looks like “the model is dumb.” Increasingly, that’s not true.

Most agent failures are context failures. If the agent doesn’t see the right logs, tests, or constraints, it makes confident but wrong changes. Fixing context is often higher leverage than switching models.

I also called out something that will resonate if you’ve ever spent time tweaking prompts: prompt engineering failures often come from context mismanagement, not “the wrong magic words.”

How I run agents like a React team lead

Treat agents like a junior hire:

Give a written task brief, acceptance criteria, and constraints
Declare the sandbox: disposable branch, test DB, temporary env vars
Ask for a plan first: files it will touch, tools it will call, risks it sees
Cap blast radius: constrain write access to app, src, config
Require tests as part of fixes: reproduce bug first, then patch
Force small PRs: reviewable commits, not a mega diff

Then add operational guardrails:

Point them at logs and monitors: build logs, Sentry traces, Playwright failures
Snap to house style: ESLint config, prettier rules, naming conventions
Disable auto-merge and require human approval for agent changes

That’s the difference between “agentic coding” and “outsourcing your codebase to a stochastic parrot.”

Poor prompt:

Fix the bug in the checkout flow

Strong prompt:

Task: Fix abandoned cart bug in checkout

Context:
- File: app/checkout/page.tsx
- Error: Cart resets on page refresh
- Expected: Cart persists via localStorage
- Test: Run `npm test checkout.test.tsx` to verify

Plan required before implementation:
1. Identify where cart state is managed
2. Add localStorage persistence
3. Add hydration logic
4. Update tests
5. Verify in Playwright

Constraints:
- Only modify app/checkout/* and lib/cart.ts
- Maintain existing TypeScript types
- Follow our ESLint rules

3. Context Engineering: The highest leverage skill for agentic React

If context is the bottleneck, then context engineering is the discipline.

In the talk I described context engineering as the art and science of filling the context window with just the right information to guide the agent’s performance. It’s more than clever prompting.

Two specific tips I want most React developers to internalize:

Visual context is powerful. Screenshots can enable one-shot solutions for UI bugs or design tasks.
Structure beats volume. Unstructured dumps confuse the model, competing information distracts it, and overload overwhelms it.

Under the hood, this ties back to a core principle: “Find the smallest possible set of high-signal tokens that maximize the likelihood of your desired outcome.”

Every token you waste is context you cannot spend on:

The actual API surface you need
The architectural constraints you care about
The failing test output that would prevent a bad patch

4. Tooling: If you can’t control the base model, control the layer around It

As I said in the “mastering the tools” section: you probably don’t control the base model, but you can absolutely steer the tooling around it.

A concrete example is doc and state retrieval. Let me show you two tools that demonstrate this pattern:

Context7 MCP

Context7 pulls fresh, version-specific docs and examples from source sites and injects them into the model’s working set, reducing guessing and stale snippets. You can nudge it toward topics like routing or hooks and cap how much to bring in.

Next.js DevTools MCP

The modern Next dev server exposes a built-in MCP endpoint. The Next.js DevTools MCP server connects to it so an agent can ask for real data about your running app:

Current build or runtime errors
Routes and layouts
Component metadata
Server actions and dev logs
Playwright paths for simple browser checks

It also ships with a Next-specific knowledge base and helpers for common tasks like upgrades.

Chrome DevTools MCP

Chrome DevTools MCP gives the agent eyes and hands in a real browser. It can open pages, click through flows, read console and network logs, take screenshots, and record performance traces to investigate things like high LCP or blocking time. Under the hood it rides on Chrome DevTools and Puppeteer, so you get reliable automation instead of brittle scripts. Because it can see page content, you still want sensible flags and isolation from personal browsing, but treated as scoped tooling it is very powerful.

How MCPs fit together

Context7 gives your assistant the right external knowledge. Next DevTools MCP gives it your app’s truth. Chrome DevTools MCP proves the result in a real browser. Used together, you turn a guessing assistant into a closed‑loop coder and debugger that cites sources, places changes correctly, and verifies outcomes before you hit commit.

This is the pattern I expect more React teams to adopt: rather than hoping the model remembers today’s Next.js behavior, wire it to an always-correct source of truth.

5. Builder Arena: Vibe Coding tools used responsibly

Builder tools are designed for rapid prompt-driven product creation, not just “write me a component.” They optimize for cohesion and perceived completeness.

Design Arena’s builder results were surprising precisely because builders are not just base models. They’re base models plus scaffolding plus UX and post-processing.

My guidance for React developers:

Use builders as idea generators. Harvest layout, copy, micro-interactions, then rebuild cleanly in your codebase
Normalize APIs. Refactor generated fetch calls, hooks, stores to your patterns
Consolidate CSS. Pull scattered styles into tokens and your component library to avoid spawning a second design system
Archive failure cases. Save screenshots and diffs to refine prompts and tool settings over time

And if you want the “before you even start” checklist:

Start with a written product spec: features, user types, flows
Lock your design system: your existing shadcn, Radix, or in-house primitives
Describe the vibe in concrete terms: reference sites, adjectives, motion levels
Limit surface area: use builders for a single flow rather than your entire shell

If you treat builder output as production code by default, you’ll end up maintaining a foreign codebase you never chose.

6. UI Components Arena: where React developers win

The UI Components Arena is the most directly applicable to most React teams: generate isolated reusable components. Scope is focused, success rate is high, and output can be close to production-ready.

It’s also where the “logic not taste” lesson shows up cleanly: models can wire up props and state and still make ugly, inconsistent decisions.

So I use AI here heavily, but with a specific protocol.

Start by forcing a Component contract

These are the things I want in the prompt before the model writes JSX:

Define prop names, types, variants, and states
Provide examples and edge cases, including weird inputs
Demand accessibility by default: keyboard nav, ARIA, focus management, error messaging
Avoid anonymous div wrappers; use semantic structure where it matters
Separate styling concerns: Tailwind classes or your utility system, not inline styles
Request story files: Storybook or MDX usage examples

Then do the React integration work AI is usually bad at

Once you have a plausible component, integrate it like a senior engineer:

Convert state to idiomatic React: Replace query selectors and global variables with hooks and props
Make behavior composable: Refactor complex pieces into hooks you own
Test the contract, not the implementation: Focus tests on props and events so internals can evolve
Snap components into your design system before exposing them broadly

This is the pattern I recommend to teams: let AI get you 70% of the way on structure, then deliberately take ownership of API shape, composition, and design tokens.

Poor prompt:

Create a sign-up button component with different variants

Strong prompt:

Create a sign-up Button component with:

Props:
- variant: 'primary' | 'secondary' | 'ghost'
- size: 'sm' | 'md' | 'lg'
- disabled: boolean
- loading: boolean

Requirements:
- Use Tailwind classes
- Show loading spinner when loading=true
- Disable pointer events when disabled
- Support keyboard navigation (Enter/Space)
- Include focus-visible ring
- ARIA: use aria-disabled, aria-busy

Example usage:

Then show a side-by-side of what each produces - the poor one generates inconsistent spacing, misses accessibility, uses inline styles. The strong one hits all requirements.

7. 3D and Data Viz: Let AI generate assets and data, not your entire integration

The 3D and Data Viz arenas stress more structured generation tasks, relevant for interactive dashboards, WebGL, and data-heavy apps.

The lesson from these arenas is not “AI writes your Three.js app.” It’s:

Decide what AI should generate: Ask for geometries, datasets, configuration, not full integration code
Specify the target library: React Three Fiber, Drei, Recharts, Victory, Visx
Request low poly first and iterate toward fidelity once performance is proven
Keep performance under control: Lazy load heavy assets, guard frame rate, keep fallbacks

In practice, this is how you avoid an “AI demo” becoming a performance incident.

React-Specific tips I want more teams to operationalize

The talk includes a slide of “React AI coding tips” that I keep coming back to because it captures what actually works in practice.

Here are the ones I see paying off immediately on real teams:

Start prompts with the component API: Declare props, variants, states, and then tell the model to implement exactly that
Name interactive states explicitly: hover, focus, loading, disabled
Ask for a plan, then generate in steps: Small increments beat one big shot, and help avoid the complexity cliff
Codify taste so AI can follow it: Lock spacing, colors, components in Tailwind config and your design system
Be explicit about routes, layouts, server actions, loading and error boundaries
Bake conventions into the repo: Document App Router defaults, Server Components, Suspense so assistants align automatically
Run checks only on what changed: Husky with lint-staged to run typecheck, lint, tests on staged files
Control cache behavior explicitly: Fetch cache options and revalidation windows as part of the prompt, so the model doesn’t guess your policy

The meta-message is the same: The difference between “AI helped me ship” and “AI gave me a mess” is almost always the level of specificity and the strength of your guardrails.

How I debug AI coding failures: it’s a pipeline, not a model

Once you accept the complexity cliff, the question becomes: how do you consistently get good outcomes?

I use a mental model I showed near the end of the talk: when AI code works or fails, it’s rarely “just the model.” It’s the whole pipeline.

The pipeline:

Base model
System prompt and instructions
Your user prompt
Fine-tuning and code training
Tools and retrieval (RAG)
Agent loops (iteration)
Post-processing

If you’re disappointed, you can almost always point to a weak link: wrong model for task, vague prompt, missing context, no iteration.

When things work, it’s usually because multiple layers aligned well: strong model, good prompt, necessary context, and iteration to iron out kinks.

This is actionable, because most of those layers are under your control as a user, even if you don’t own the base model.

The workflow I recommend for agentic React coding

I summarized it in the deck as the “new flow state”:

Define clear requirements (maybe write tests)
Prompt with context (stack, docs, examples)
Ask for plan, review
Generate code in small steps
Run, test, refine
Iterate until production-ready

This sounds like “normal engineering,” and that’s the point. The best teams I see using AI are not doing anything mystical. They’re turning implicit engineering discipline into explicit instructions and then using AI to accelerate the boring parts.

If you want one sentence: You’re not just typing code anymore, you’re orchestrating code creation.

So, how good is AI at coding React, really?

Where I land after looking at these benchmarks and using these tools day to day:

✅ AI is genuinely strong at:

Isolated React components
Scaffolding
Converting clearly specified requirements into working code

⚠️ AI is still unreliable at:

Multi-step integration tasks without strong tooling, strong context, and iteration loops

❌ AI is consistently weaker at:

Taste, hierarchy, and nuanced UX decisions than it is at “code that runs”
The aesthetic gap is real

💡 The highest leverage strategy is not “pick the best model.” It’s:

Reduce context failures
Codify your conventions
Force stepwise work

And the part I find most exciting: The opportunity keeps expanding. Models and tools change fast, but the underlying skills that make you effective don’t. You are still the architect.

Learn More

If you want to keep exploring this space:

Talk video: https://gitnation.com/contents/how-good-is-ai-at-coding-react-really
Design Arena leaderboard: https://www.designarena.ai/leaderboard
Design Arena methodology: https://notes.designarena.ai/in-pursuit-of-a-benchmark-for-human-taste/

And if you want to dive deeper into related topics, I’ve written two books on the topic: “Beyond Vibe Coding” and “Building large-scale web apps with React”

My LLM coding workflow going into 2026

Addy Osmani — Thu, 18 Dec 2025 15:30:46 GMT

AI coding assistants became game-changers this year, but harnessing them effectively takes skill and structure. These tools dramatically increased what LLMs can do for real-world coding, and many developers (myself included) embraced them.

At Anthropic, for example, engineers adopted Claude Code so heavily that today ~90% of the code for Claude Code is written by Claude Code itself. Yet, using LLMs for programming is not a push-button magic experience - it’s “difficult and unintuitive” and getting great results requires learning new patterns. Critical thinking remains key. Over a year of projects, I’ve converged on a workflow similar to what many experienced devs are discovering: treat the LLM as a powerful pair programmer that requires clear direction, context and oversight rather than autonomous judgment.

In this article, I’ll share how I plan, code, and collaborate with AI going into 2026, distilling tips and best practices from my experience and the community’s collective learning. It’s a more disciplined “AI-assisted engineering” approach - leveraging AI aggressively while staying proudly accountable for the software produced.

If you’re interested in more on my workflow, see “The AI-Native Software Engineer”, otherwise let’s dive straight into some of the lessons I learned.

Start with a clear plan (specs before code)

Don’t just throw wishes at the LLM - begin by defining the problem and planning a solution.

One common mistake is diving straight into code generation with a vague prompt. In my workflow, and in many others’, the first step is brainstorming a detailed specification with the AI, then outlining a step-by-step plan, before writing any actual code. For a new project, I’ll describe the idea and ask the LLM to iteratively ask me questions until we’ve fleshed out requirements and edge cases. By the end, we compile this into a comprehensive spec.md - containing requirements, architecture decisions, data models, and even a testing strategy. This spec forms the foundation for development.

Next, I feed the spec into a reasoning-capable model and prompt it to generate a project plan: break the implementation into logical, bite-sized tasks or milestones. The AI essentially helps me do a mini “design doc” or project plan. I often iterate on this plan - editing and asking the AI to critique or refine it - until it’s coherent and complete. Only then do I proceed to coding. This upfront investment might feel slow, but it pays off enormously. As Les Orchard put it, it’s like doing a “waterfall in 15 minutes” - a rapid structured planning phase that makes the subsequent coding much smoother.

Having a clear spec and plan means when we unleash the codegen, both the human and the LLM know exactly what we’re building and why. In short, planning first forces you and the AI onto the same page and prevents wasted cycles. It’s a step many people are tempted to skip, but experienced LLM developers now treat a robust spec/plan as the cornerstone of the workflow.

Break work into small, iterative chunks

Scope management is everything - feed the LLM manageable tasks, not the whole codebase at once.

A crucial lesson I’ve learned is to avoid asking the AI for large, monolithic outputs. Instead, we break the project into iterative steps or tickets and tackle them one by one. This mirrors good software engineering practice, but it’s even more important with AI in the loop. LLMs do best when given focused prompts: implement one function, fix one bug, add one feature at a time. For example, after planning, I will prompt the codegen model: “Okay, let’s implement Step 1 from the plan”. We code that, test it, then move to Step 2, and so on. Each chunk is small enough that the AI can handle it within context and you can understand the code it produces.

This approach guards against the model going off the rails. If you ask for too much in one go, it’s likely to get confused or produce a “jumbled mess” that’s hard to untangle. Developers report that when they tried to have an LLM generate huge swaths of an app, they ended up with inconsistency and duplication - “like 10 devs worked on it without talking to each other,” one said. I’ve felt that pain; the fix is to stop, back up, and split the problem into smaller pieces. Each iteration, we carry forward the context of what’s been built and incrementally add to it. This also fits nicely with a test-driven development (TDD) approach - we can write or generate tests for each piece as we go (more on testing soon).

Several coding-agent tools now explicitly support this chunked workflow. For instance, I often generate a structured “prompt plan” file that contains a sequence of prompts for each task, so that tools like Cursor can execute them one by one. The key point is to avoid huge leaps. By iterating in small loops, we greatly reduce the chance of catastrophic errors and we can course-correct quickly. LLMs excel at quick, contained tasks - use that to your advantage.

Provide extensive context and guidance

LLMs are only as good as the context you provide - show them the relevant code, docs, and constraints.

When working on a codebase, I make sure to feed the AI all the information it needs to perform well. That includes the code it should modify or refer to, the project’s technical constraints, and any known pitfalls or preferred approaches. Modern tools help with this: for example, Anthropic’s Claude can import an entire GitHub repo into its context in “Projects” mode, and IDE assistants like Cursor or Copilot auto-include open files in the prompt. But I often go further - I will either use an MCP like Context7 or manually copy important pieces of the codebase or API docs into the conversation if I suspect the model doesn’t have them.

Expert LLM users emphasize this “context packing” step. For example, doing a “brain dump” of everything the model should know before coding, including: high-level goals and invariants, examples of good solutions, and warnings about approaches to avoid. If I’m asking an AI to implement a tricky solution, I might tell it which naive solutions are too slow, or provide a reference implementation from elsewhere. If I’m using a niche library or a brand-new API, I’ll paste in the official docs or README so the AI isn’t flying blind. All of this upfront context dramatically improves the quality of its output, because the model isn’t guessing - it has the facts and constraints in front of it.

There are now utilities to automate context packaging. I’ve experimented with tools like gitingest or repo2txt, which essentially “dump” the relevant parts of your codebase into a text file for the LLM to read. These can be a lifesaver when dealing with a large project - you generate an output.txt bundle of key source files and let the model ingest that. The principle is: don’t make the AI operate on partial information. If a bug fix requires understanding four different modules, show it those four modules. Yes, we must watch token limits, but current frontier models have pretty huge context windows (tens of thousands of tokens). Use them wisely. I often selectively include just the portions of code relevant to the task at hand, and explicitly tell the AI what not to focus on if something is out of scope (to save tokens).

I think Claude Skills have potential because they turn what used to be fragile repeated prompting into something durable and reusable by packaging instructions, scripts, and domain specific expertise into modular capabilities that tools can automatically apply when a request matches the Skill. This means you get more reliable and context aware results than a generic prompt ever could and you move away from one off interactions toward workflows that encode repeatable procedures and team knowledge for tasks in a consistent way. A number of community-curated Skills collections exist, but one of my favorite examples is the frontend-design skill which can “end” the purple design aesthetic prevalent in LLM generated UIs. Until more tools support Skills officially, workarounds exist.

Finally, guide the AI with comments and rules inside the prompt. I might precede a code snippet with: “Here is the current implementation of X. We need to extend it to do Y, but be careful not to break Z.” These little hints go a long way. LLMs are literalists - they’ll follow instructions, so give them detailed, contextual instructions. By proactively providing context and guidance, we minimize hallucinations and off-base suggestions and get code that fits our project’s needs.

Choose the right model (and use multiple when needed)

Not all coding LLMs are equal - pick your tool with intention, and don’t be afraid to swap models mid-stream.

In 2025 we’ve been spoiled with a variety of capable code-focused LLMs. Part of my workflow is choosing the model or service best suited to each task. Sometimes it can be valuable to even try two or more LLMs in parallel to cross-check how they might approach the same problem differently.

Each model has its own “personality”. The key is: if one model gets stuck or gives mediocre outputs, try another. I’ve literally copied the same prompt from one chat into another service to see if it can handle it better. This “model musical chairs” can rescue you when you hit a model’s blind spot.

Also, make sure you’re using the best version available. If you can, use the newest “pro” tier models - because quality matters. And yes, it often means paying for access, but the productivity gains can justify it. Ultimately, pick the AI pair programmer whose “vibe” meshes with you. I know folks who prefer one model simply because they like how its responses feel. That’s valid - when you’re essentially in a constant dialogue with an AI, the UX and tone make a difference.

Personally I gravitate towards Gemini for a lot of coding work these days because the interaction feels more natural and it often understands my requests on the first try. But I will not hesitate to switch to another model if needed; sometimes a second opinion helps the solution emerge. In summary: use the best tool for the job, and remember you have an arsenal of AIs at your disposal.

Leverage AI coding across the lifecycle

Supercharge your workflow with coding-specific AI help across the SDLC.

On the command-line, new AI agents emerged. Claude Code, OpenAI’s Codex CLI and Google’s Gemini CLI are CLI tools where you can chat with them directly in your project directory - they can read files, run tests, and even multi-step fix issues. I’ve used Google’s Jules and GitHub’s Copilot Agent as well - these are asynchronous coding agents that actually clone your repo into a cloud VM and work on tasks in the background (writing tests, fixing bugs, then opening a PR for you). It’s a bit eerie to witness: you issue a command like “refactor the payment module for X” and a little while later you get a pull request with code changes and passing tests. We are truly living in the future. You can read more about this in conductors to orchestrators.

That said, these tools are not infallible, and you must understand their limits. They accelerate the mechanical parts of coding - generating boilerplate, applying repetitive changes, running tests automatically - but they still benefit greatly from your guidance. For instance, when I use an agent like Claude or Copilot to implement something, I often supply it with the plan or to-do list from earlier steps so it knows the exact sequence of tasks. If the agent supports it, I’ll load up my spec.md or plan.md in the context before telling it to execute. This keeps it on track.

We’re not at the stage of letting an AI agent code an entire feature unattended and expecting perfect results. Instead, I use these tools in a supervised way: I’ll let them generate and even run code, but I keep an eye on each step, ready to step in when something looks off. There are also orchestration tools like Conductor that let you run multiple agents in parallel on different tasks (essentially a way to scale up AI help) - some engineers are experimenting with running 3-4 agents at once on separate features. I’ve dabbled in this “massively parallel” approach; it’s surprisingly effective at getting a lot done quickly, but it’s also mentally taxing to monitor multiple AI threads! For most cases, I stick to one main agent at a time and maybe a secondary one for reviews (discussed below).

Just remember these are power tools - you still control the trigger and guide the outcome.

A full overview of where AI can improve the developer experience. This spans design, inner, submit, and outer loops - highlighting every point where AI can meaningfully reduce toil.

Keep a human in the loop - verify, test, and review everything

AI will happily produce plausible-looking code, but you are responsible for quality - always review and test thoroughly. One of my cardinal rules is never to blindly trust an LLM’s output. As Simon Willison aptly says, think of an LLM pair programmer as “over-confident and prone to mistakes”. It writes code with complete conviction - including bugs or nonsense - and won’t tell you something is wrong unless you catch it. So I treat every AI-generated snippet as if it came from a junior developer: I read through the code, run it, and test it as needed. You absolutely have to test what it writes - run those unit tests, or manually exercise the feature, to ensure it does what it claims. Read more about this in vibe coding is not an excuse for low-quality work.

In fact, I weave testing into the workflow itself. My earlier planning stage often includes generating a list of tests or a testing plan for each step. If I’m using a tool like Claude Code, I’ll instruct it to run the test suite after implementing a task, and have it debug failures if any occur. This kind of tight feedback loop (write code → run tests → fix) is something AI excels at as long as the tests exist. It’s no surprise that those who get the most out of coding agents tend to be those with strong testing practices. An agent like Claude can “fly” through a project with a good test suite as safety net. Without tests, the agent might blithely assume everything is fine (“sure, all good!”) when in reality it’s broken several things. So, invest in tests - it amplifies the AI’s usefulness and confidence in the result.

Even beyond automated tests, do code reviews - both manual and AI-assisted. I routinely pause and review the code that’s been generated so far, line by line. Sometimes I’ll spawn a second AI session (or a different model) and ask it to critique or review code produced by the first. For example, I might have Claude write the code and then ask Gemini, “Can you review this function for any errors or improvements?” This can catch subtle issues. The key is to not skip the review just because an AI wrote the code. If anything, AI-written code needs extra scrutiny, because it can sometimes be superficially convincing while hiding flaws that a human might not immediately notice.

I also use Chrome DevTools MCP, built with my last team, for my debugging and quality loop to bridge the gap between static code analysis and live browser execution. It “gives your agent eyes”. It lets me grant my AI tools direct access to see what the browser can, inspect the DOM, get rich performance traces, console logs or network traces. This integration eliminates the friction of manual context switching, allowing for automated UI testing directly through the LLM. It means bugs can be diagnosed and fixed with high precision based on actual runtime data.

The dire consequences of skipping human oversight have been documented. One developer who leaned heavily on AI generation for a rush project described the result as an inconsistent mess - duplicate logic, mismatched method names, no coherent architecture. He realized he’d been “building, building, building” without stepping back to really see what the AI had woven together. The fix was a painful refactor and a vow to never let things get that far out of hand again. I’ve taken that to heart. No matter how much AI I use, I remain the accountable engineer.

In practical terms, that means I only merge or ship code after I’ve understood it. If the AI generates something convoluted, I’ll ask it to add comments explaining it, or I’ll rewrite it in simpler terms. If something doesn’t feel right, I dig in - just as I would if a human colleague contributed code that raised red flags.

It’s all about mindset: the LLM is an assistant, not an autonomously reliable coder. I am the senior dev; the LLM is there to accelerate me, not replace my judgment. Maintaining this stance not only results in better code, it also protects your own growth as a developer. (I’ve heard some express concern that relying too much on AI might dull their skills - I think as long as you stay in the loop, actively reviewing and understanding everything, you’re still sharpening your instincts, just at a higher velocity.) In short: stay alert, test often, review always. It’s still your codebase at the end of the day.

Commit often and use version control as a safety net. Never commit code you can’t explain.

Frequent commits are your save points - they let you undo AI missteps and understand changes.

When working with an AI that can generate a lot of code quickly, it’s easy for things to veer off course. I mitigate this by adopting ultra-granular version control habits. I commit early and often, even more than I would in normal hand-coding. After each small task or each successful automated edit, I’ll make a git commit with a clear message. This way, if the AI’s next suggestion introduces a bug or a messy change, I have a recent checkpoint to revert to (or cherry-pick from) without losing hours of work. One practitioner likened it to treating commits as “save points in a game” - if an LLM session goes sideways, you can always roll back to the last stable commit. I’ve found that advice incredibly useful. It’s much less stressful to experiment with a bold AI refactor when you know you can undo it with a git reset if needed.

Proper version control also helps when collaborating with the AI. Since I can’t rely on the AI to remember everything it’s done (context window limitations, etc.), the git history becomes a valuable log. I often scan my recent commits to brief the AI (or myself) on what changed. In fact, LLMs themselves can leverage your commit history if you provide it - I’ve pasted git diffs or commit logs into the prompt so the AI knows what code is new or what the previous state was. Amusingly, LLMs are really good at parsing diffs and using tools like git bisect to find where a bug was introduced. They have infinite patience to traverse commit histories, which can augment your debugging. But this only works if you have a tidy commit history to begin with.

Another benefit: small commits with good messages essentially document the development process, which helps when doing code review (AI or human). If an AI agent made five changes in one go and something broke, having those changes in separate commits makes it easier to pinpoint which commit caused the issue. If everything is in one giant commit titled “AI changes”, good luck! So I discipline myself: finish task, run tests, commit. This also meshes well with the earlier tip about breaking work into small chunks - each chunk ends up as its own commit or PR.

Finally, don’t be afraid to use branches or worktrees to isolate AI experiments. One advanced workflow I’ve adopted (inspired by folks like Jesse Vincent) is to spin up a fresh git worktree for a new feature or sub-project. This lets me run multiple AI coding sessions in parallel on the same repo without them interfering, and I can later merge the changes. It’s a bit like having each AI task in its own sandbox branch. If one experiment fails, I throw away that worktree and nothing is lost in main. If it succeeds, I merge it in. This approach has been crucial when I’m, say, letting an AI implement Feature A while I (or another AI) work on Feature B simultaneously. Version control is what makes this coordination possible. In short: commit often, organize your work with branches, and embrace git as the control mechanism to keep AI-generated changes manageable and reversible.

Customize the AI’s behavior with rules and examples

Steer your AI assistant by providing style guides, examples, and even “rules files” - a little upfront tuning yields much better outputs.

One thing I learned is that you don’t have to accept the AI’s default style or approach - you can influence it heavily by giving it guidelines. For instance, I have a CLAUDE.md file that I update periodically, which contains process rules and preferences for Claude (Anthropic’s model) to follow (and similarly a GEMINI.md when using Gemini CLI). This includes things like “write code in our project’s style, follow our lint rules, don’t use certain functions, prefer functional style over OOP,” etc. When I start a session, I feed this file to Claude to align it with our conventions. It’s surprising how well this works to keep the model “on track” as Jesse Vincent noted - it reduces the tendency of the AI to go off-script or introduce patterns we don’t want.

Even without a fancy rules file, you can set the tone with custom instructions or system prompts. GitHub Copilot and Cursor both introduced features to let you configure the AI’s behavior globally for your project. I’ve taken advantage of that by writing a short paragraph about our coding style, e.g. “Use 4 spaces indent, avoid arrow functions in React, prefer descriptive variable names, code should pass ESLint.” With those instructions in place, the AI’s suggestions adhere much more closely to what a human teammate might write. Ben Congdon mentioned how shocked he was that few people use Copilot’s custom instructions, given how effective they are - he could guide the AI to output code matching his team’s idioms by providing some examples and preferences upfront. I echo that: take the time to teach the AI your expectations.

Another powerful technique is providing in-line examples of the output format or approach you want. If I want the AI to write a function in a very specific way, I might first show it a similar function already in the codebase: “Here’s how we implemented X, use a similar approach for Y.” If I want a certain commenting style, I might write a comment myself and ask the AI to continue in that style. Essentially, prime the model with the pattern to follow. LLMs are great at mimicry - show them one or two examples and they’ll continue in that vein.

The community has also come up with creative “rulesets” to tame LLM behavior. You might have heard of the “Big Daddy” rule or adding a “no hallucination/no deception” clause to prompts. These are basically tricks to remind the AI to be truthful and not overly fabricate code that doesn’t exist. For example, I sometimes prepend a prompt with: “If you are unsure about something or the codebase context is missing, ask for clarification rather than making up an answer.” This reduces hallucinations. Another rule I use is: “Always explain your reasoning briefly in comments when fixing a bug.” This way, when the AI generates a fix, it will also leave a comment like “// Fixed: Changed X to Y to prevent Z (as per spec).” That’s super useful for later review.

In summary, don’t treat the AI as a black box - tune it. By configuring system instructions, sharing project docs, or writing down explicit rules, you turn the AI into a more specialized developer on your team. It’s akin to onboarding a new hire: you’d give them the style guide and some starter tips, right? Do the same for your AI pair programmer. The return on investment is huge: you get outputs that need less tweaking and integrate more smoothly with your codebase.

Embrace testing and automation as force multipliers

Use your CI/CD, linters, and code review bots - AI will work best in an environment that catches mistakes automatically.

This is a corollary to staying in the loop and providing context: a well-oiled development pipeline enhances AI productivity. I ensure that any repository where I use heavy AI coding has a robust continuous integration setup. That means automated tests run on every commit or PR, code style checks (like ESLint, Prettier, etc.) are enforced, and ideally a staging deployment is available for any new branch. Why? Because I can let the AI trigger these and evaluate the results. For instance, if the AI opens a pull request via a tool like Jules or GitHub Copilot Agent, our CI will run tests and report failures. I can feed those failure logs back to the AI: “The integration tests failed with XYZ, let’s debug this.” It turns bug-fixing into a collaborative loop with quick feedback, which AIs handle quite well (they’ll suggest a fix, we run CI again, and iterate).

Automated code quality checks (linters, type checkers) also guide the AI. I actually include linter output in the prompt sometimes. If the AI writes code that doesn’t pass our linter, I’ll copy the linter errors into the chat and say “please address these issues.” The model then knows exactly what to do. It’s like having a strict teacher looking over the AI’s shoulder. In my experience, once the AI is aware of a tool’s output (like a failing test or a lint warning), it will try very hard to correct it - after all, it “wants” to produce the right answer. This ties back to providing context: give the AI the results of its actions in the environment (test failures, etc.) and it will learn from them.

AI coding agents themselves are increasingly incorporating automation hooks. Some agents will refuse to say a code task is “done” until all tests pass, which is exactly the diligence you want. Code review bots (AI or otherwise) act as another filter - I treat their feedback as additional prompts for improvement. For example, if CodeRabbit or another reviewer comments “This function is doing X which is not ideal” I will ask the AI, “Can you refactor based on this feedback?”

By combining AI with automation, you start to get a virtuous cycle. The AI writes code, the automated tools catch issues, the AI fixes them, and so forth, with you overseeing the high-level direction. It feels like having an extremely fast junior dev whose work is instantly checked by a tireless QA engineer. But remember, you set up that environment. If your project lacks tests or any automated checks, the AI’s work may slip through with subtle bugs or poor quality until much later.

So as we head into 2026, one of my goals is to bolster the quality gates around AI code contribution: more tests, more monitoring, perhaps even AI-on-AI code reviews. It might sound paradoxical (AIs reviewing AIs), but I’ve seen it catch things one model missed. Bottom line: an AI-friendly workflow is one with strong automation - use those tools to keep the AI honest.

Continuously learn and adapt (AI amplifies your skills)

Treat every AI coding session as a learning opportunity - the more you know, the more the AI can help you, creating a virtuous cycle.

One of the most exciting aspects of using LLMs in development is how much I have learned in the process. Rather than replacing my need to know things, AIs have actually exposed me to new languages, frameworks, and techniques I might not have tried on my own.

This pattern holds generally: if you come to the table with solid software engineering fundamentals, the AI will amplify your productivity multifold. If you lack that foundation, the AI might just amplify confusion. Seasoned devs have observed that LLMs “reward existing best practices” - things like writing clear specs, having good tests, doing code reviews, etc., all become even more powerful when an AI is involved. In my experience, the AI lets me operate at a higher level of abstraction (focusing on design, interface, architecture) while it churns out the boilerplate, but I need to have those high-level skills first. As Simon Willison notes, almost everything that makes someone a senior engineer (designing systems, managing complexity, knowing what to automate vs hand-code) is what now yields the best outcomes with AI. So using AIs has actually pushed me to up my engineering game - I’m more rigorous about planning and more conscious of architecture, because I’m effectively “managing” a very fast but somewhat naïve coder (the AI).

For those worried that using AI might degrade their abilities: I’d argue the opposite, if done right. By reviewing AI code, I’ve been exposed to new idioms and solutions. By debugging AI mistakes, I’ve deepened my understanding of the language and problem domain. I often ask the AI to explain its code or the rationale behind a fix - kind of like constantly interviewing a candidate about their code - and I pick up insights from its answers. I also use AI as a research assistant: if I’m not sure about a library or approach, I’ll ask it to enumerate options or compare trade-offs. It’s like having an encyclopedic mentor on call. All of this has made me a more knowledgeable programmer.

The big picture is that AI tools amplify your expertise. Going into 2026, I’m not afraid of them “taking my job” - I’m excited that they free me from drudgery and allow me to spend more time on creative and complex aspects of software engineering. But I’m also aware that for those without a solid base, AI can lead to Dunning-Kruger on steroids (it may seem like you built something great, until it falls apart). So my advice: continue honing your craft, and use the AI to accelerate that process. Be intentional about periodically coding without AI too, to keep your raw skills sharp. In the end, the developer + AI duo is far more powerful than either alone, and the developer half of that duo has to hold up their end.

Conclusion

As we enter 2026, I’ve fully embraced AI in my development workflow - but in a considered, expert-driven way. My approach is essentially “AI-augmented software engineering” rather than AI-automated software engineering.

I’ve learned: the best results come when you apply classic software engineering discipline to your AI collaborations. It turns out all our hard-earned practices - design before coding, write tests, use version control, maintain standards - not only still apply, but are even more important when an AI is writing half your code.

I’m excited for what’s next. The tools keep improving and my workflow will surely evolve alongside them. Perhaps fully autonomous “AI dev interns” will tackle more grunt work while we focus on higher-level tasks. Perhaps new paradigms of debugging and code exploration will emerge. No matter what, I plan to stay in the loop - guiding the AIs, learning from them, and amplifying my productivity responsibly.

The bottom line for me: AI coding assistants are incredible force multipliers, but the human engineer remains the director of the show.

With that…happy building in 2026! 🚀

I’m excited to share I’ve released a new AI-assisted engineering book with O’Reilly. There are a number of free tips on the book site in case interested.

21 Lessons from 14 Years at Google

Addy Osmani — Thu, 04 Dec 2025 15:30:53 GMT

When I joined Google ~14 years ago, I thought the job was about writing great code. I was partly right. But the longer I’ve stayed, the more I’ve realized that the engineers who thrive aren’t necessarily the best programmers - they’re the ones who’ve figured out how to navigate everything around the code: the people, the politics, the alignment, the ambiguity.

These lessons are what I wish I’d known earlier. Some would have saved me months of frustration. Others took years to fully understand. None of them are about specific technologies - those change too fast to matter. They’re about the patterns that keep showing up, project after project, team after team.

I’m sharing them because I’ve benefited enormously from engineers who did the same for me. Consider this my attempt to pay it forward.

1. The best engineers are obsessed with solving user problems.

It’s seductive to fall in love with a technology and go looking for places to apply it. I’ve done it. Everyone has. But the engineers who create the most value work backwards: they become obsessed with understanding user problems deeply, and let solutions emerge from that understanding.

User obsession means spending time in support tickets, talking to users, watching users struggle, asking “why” until you hit bedrock. The engineer who truly understands the problem often finds that the elegant solution is simpler than anyone expected.

The engineer who starts with a solution tends to build complexity in search of a justification.

2. Being right is cheap. Getting to right together is the real work.

You can win every technical argument and lose the project. I’ve watched brilliant engineers accrue silent resentment by always being the smartest person in the room. The cost shows up later as “mysterious execution issues” and “strange resistance.”

The skill isn’t being right. It’s entering discussions to align on the problem, creating space for others, and remaining skeptical of your own certainty.

Strong opinions, weakly held - not because you lack conviction, but because decisions made under uncertainty shouldn’t be welded to identity.

3. Bias towards action. Ship. You can edit a bad page, but you can’t edit a blank one.

The quest for perfection is paralyzing. I’ve watched engineers spend weeks debating the ideal architecture for something they’ve never built. The perfect solution rarely emerges from thought alone - it emerges from contact with reality. AI can in many ways help here.

First do it, then do it right, then do it better. Get the ugly prototype in front of users. Write the messy first draft of the design doc. Ship the MVP that embarrasses you slightly. You’ll learn more from one week of real feedback than a month of theoretical debate.

Momentum creates clarity. Analysis paralysis creates nothing.

4. Clarity is seniority. Cleverness is overhead.

The instinct to write clever code is almost universal among engineers. It feels like proof of competence.

But software engineering is what happens when you add time and other programmers. In that environment, clarity isn’t a style preference - it’s operational risk reduction.

Your code is a strategy memo to strangers who will maintain it at 2am during an outage. Optimize for their comprehension, not your elegance. The senior engineers I respect most have learned to trade cleverness for clarity, every time.

5. Novelty is a loan you repay in outages, hiring, and cognitive overhead.

Treat your technology choices like an organization with a small “innovation token” budget. Spend one each time you adopt something materially non-standard. You can’t afford many.

The punchline isn’t “never innovate.” It’s “innovate only where you’re uniquely paid to innovate.” Everything else should default to boring, because boring has known failure modes.

The “best tool for the job” is often the “least-worst tool across many jobs”-because operating a zoo becomes the real tax.

6. Your code doesn’t advocate for you. People do.

Early in my career, I believed great work would speak for itself. I was wrong. Code sits silently in a repository. Your manager mentions you in a meeting, or they don’t. A peer recommends you for a project, or someone else.

In large organizations, decisions get made in meetings you’re not invited to, using summaries you didn’t write, by people who have five minutes and twelve priorities. If no one can articulate your impact when you’re not in the room, your impact is effectively optional.

This isn’t strictly about self-promotion. It’s about making the value chain legible to everyone- including yourself.

7. The best code is the code you never had to write.

We celebrate creation in engineering culture. Nobody gets promoted for deleting code, even though deletion often improves a system more than addition. Every line of code you don’t write is a line you never have to debug, maintain, or explain.

Before you build, exhaust the question: “What would happen if we just… didn’t?” Sometimes the answer is “nothing bad,” and that’s your solution.

The problem isn’t that engineers can’t write code or use AI to do so. It’s that we’re so good at writing it that we forget to ask whether we should.

8. At scale, even your bugs have users.

With enough users, every observable behavior becomes a dependency - regardless of what you promised. Someone is scraping your API, automating your quirks, caching your bugs.

This creates a career-level insight: you can’t treat compatibility work as “maintenance” and new features as “real work.” Compatibility is product.

Design your deprecations as migrations with time, tooling, and empathy. Most “API design” is actually “API retirement.”

9. Most “slow” teams are actually misaligned teams.

When a project drags, the instinct is to blame execution: people aren’t working hard enough, the technology is wrong, there aren’t enough engineers. Usually none of that is the real problem.

In large companies, teams are your unit of concurrency, but coordination costs grow geometrically as teams multiply. Most slowness is actually alignment failure - people building the wrong things, or the right things in incompatible ways.

Senior engineers spend more time clarifying direction, interfaces, and priorities than “writing code faster” because that’s where the actual bottleneck lives.

10. Focus on what you can control. Ignore what you can’t.

In a large company, countless variables are outside your control - organizational changes, management decisions, market shifts, product pivots. Dwelling on these creates anxiety without agency.

The engineers who stay sane and effective zero in on their sphere of influence. You can’t control whether a reorg happens. You can control the quality of your work, how you respond, and what you learn. When faced with uncertainty, break problems into pieces and identify the specific actions available to you.

This isn’t passive acceptance but it is strategic focus. Energy spent on what you can’t change is energy stolen from what you can.

11. Abstractions don’t remove complexity. They move it to the day you’re on call.

Every abstraction is a bet that you won’t need to understand what’s underneath. Sometimes you win that bet. But something always leaks, and when it does, you need to know what you’re standing on.

Senior engineers keep learning “lower level” things even as stacks get higher. Not out of nostalgia, but out of respect for the moment when the abstraction fails and you’re alone with the system at 3am. Use your stack.

But keep a working model of its underlying failure modes.

12. Writing forces clarity. The fastest way to learn something better is to try teaching it.

Writing forces clarity. When I explain a concept to others - in a doc, a talk, a code review comment, even just chatting with AI - I discover the gaps in my own understanding. The act of making something legible to someone else makes it more legible to me.

This doesn’t mean that you’re going to learn how to be a surgeon by teaching it, but the premise still holds largely true in the software engineering domain.

This isn’t just about being generous with knowledge. It’s a selfish learning hack. If you think you understand something, try to explain it simply. The places where you stumble are the places where your understanding is shallow.

Teaching is debugging your own mental models.

13. The work that makes other work possible is priceless - and invisible.

Glue work - documentation, onboarding, cross-team coordination, process improvement - is vital. But if you do it unconsciously, it can stall your technical trajectory and burn you out. The trap is doing it as “helpfulness” rather than treating it as deliberate, bounded, visible impact.

Timebox it. Rotate it. Turn it into artifacts: docs, templates, automation. And make it legible as impact, not as personality trait.

Priceless and invisible is a dangerous combination for your career.

14. If you win every debate, you’re probably accumulating silent resistance.

I’ve learned to be suspicious of my own certainty. When I “win” too easily, something is usually wrong. People stop fighting you not because you’ve convinced them, but because they’ve given up trying - and they’ll express that disagreement in execution, not meetings.

Real alignment takes longer. You have to actually understand other perspectives, incorporate feedback, and sometimes change your mind publicly.

The short-term feeling of being right is worth much less than the long-term reality of building things with willing collaborators.

15. When a measure becomes a target, it stops measuring.

Every metric you expose to management will eventually be gamed. Not through malice, but because humans optimize for what’s measured.

If you track lines of code, you’ll get more lines. If you track velocity, you’ll get inflated estimates.

The senior move: respond to every metric request with a pair. One for speed. One for quality or risk. Then insist on interpreting trends, not worshiping thresholds. The goal is insight, not surveillance.

16. Admitting what you don’t know creates more safety than pretending you do.

Senior engineers who say “I don’t know” aren’t showing weakness - they’re creating permission. When a leader admits uncertainty, it signals that the room is safe for others to do the same. The alternative is a culture where everyone pretends to understand and problems stay hidden until they explode.

I’ve seen teams where the most senior person never admitted confusion, and I’ve seen the damage. Questions don’t get asked. Assumptions don’t get challenged. Junior engineers stay silent because they assume everyone else gets it.

Model curiosity, and you get a team that actually learns.

17. Your network outlasts every job you’ll ever have.

Early in my career, I focused on the work and neglected networking. In hindsight, this was a mistake. Colleagues who invested in relationships - inside and outside the company - reaped benefits for decades.

They heard about opportunities first, could build bridges faster, got recommended for roles, and co-founded ventures with people they’d built trust with over years.

Your job isn’t forever, but your network is. Approach it with curiosity and generosity, not transactional hustle.

When the time comes to move on, it’s often relationships that open the door.

18. Most performance wins come from removing work, not adding cleverness.

When systems get slow, the instinct is to add: caching layers, parallel processing, smarter algorithms. Sometimes that’s right. But I’ve seen more performance wins from asking “what are we computing that we don’t need?”

Deleting unnecessary work is almost always more impactful than doing necessary work faster. The fastest code is code that never runs.

Before you optimize, question whether the work should exist at all.

19. Process exists to reduce uncertainty, not to create paper trails.

The best process makes coordination easier and failures cheaper. The worst process is bureaucratic theater - it exists not to help but to assign blame when things go wrong.

If you can’t explain how a process reduces risk or increases clarity, it’s probably just overhead.

And if people are spending more time documenting their work than doing it, something has gone deeply wrong.

20. Eventually, time becomes worth more than money. Act accordingly.

Early in your career, you trade time for money - and that’s fine. But at some point, the calculus inverts. You start to realize that time is the non-renewable resource.

I’ve watched senior engineers burn out chasing the next promo level, optimizing for a few more percentage points of compensation. Some of them got it. Most of them wondered, afterward, if it was worth what they gave up.

The answer isn’t “don’t work hard.” It’s “know what you’re trading, and make the trade deliberately.”

21. There are no shortcuts, but there is compounding.

Expertise comes from deliberate practice - pushing slightly beyond your current skill, reflecting, repeating. For years. There’s no condensed version.

But here’s the hopeful part: learning compounds when it creates new options, not just new trivia. Write - not for engagement, but for clarity. Build reusable primitives. Collect scar tissue into playbooks.

The engineer who treats their career as compound interest, not lottery tickets, tends to end up much further ahead.

A final thought

Twenty-one lessons sounds like a lot, but they really come down to a few core ideas: stay curious, stay humble, and remember that the work is always about people - the users you’re building for and the teammates you’re building with.

A career in engineering is long enough to make plenty of mistakes and still come out ahead. The engineers I admire most aren’t the ones who got everything right - they’re the ones who learned from what went wrong, shared what they discovered, and kept showing up.

If you’re early in your journey, know that it gets richer with time. If you’re deep into it, I hope some of these resonate.

Treat AI-Generated code as a draft

Addy Osmani — Tue, 25 Nov 2025 16:43:00 GMT

tl;dr: Treat AI-generated code as a draft. It can write the first version, but never outsource the reading. No human review means no reliable trace from behavior back to intent. When you stop reviewing AI drafts, you stop knowing why the code works at all. Practically, hold AI-written code to the same standards as human team mates.

Never outsource the reading - always review AI’s first draft

AI can write a first version of code, but humans must do the reading and reviewing to ensure intent and quality.

If you stop reviewing AI-generated drafts, you stop knowing why the code works (or if it truly does) – there’s no reliable trace from behavior back to intent. In other words, LLMs don’t ship bad code, teams do. When no one takes responsibility for checking AI-written code, bad code slips through not because the model failed, but because the workflow failed to demand a higher standard [1].

Treat the AI’s output as untrusted input – it might be syntactically correct and even pass tests, but it hasn’t earned your trust until a human verifies it. AI models often produce plausible-looking but subtly flawed code, including hallucinated functions or insecure patterns [2]. So never merge code that hasn’t been read and understood by a human. As one engineer put it, blindly trusting AI output without verification risks immediate bugs and “systematically degrades our ability to catch these errors” because the very skills needed to validate code atrophy from disuse [3].

In short, always insist on a human-in-the-loop: AI can draft, but only a human can ensure the code’s behavior matches the intended purpose.

Blind reliance on AI erodes critical thinking and skills

Engineering leaders worry that developers who blindly accept AI-generated code will lose their critical thinking abilities.

The concern isn’t hypothetical – early research bears it out. Studies have found that heavy use of AI assistants correlates with lower brain engagement and reduced critical thinking performance [4]. In practice, developers dependent on AI may skip fundamental tasks like reading documentation or debugging errors themselves. One veteran engineer confessed that using AI’s instant answers made him “worse at his own craft.”

He stopped reading docs (“why bother when an LLM can explain it instantly?”) and even stopped analyzing errors – instead, he’d copy-paste stack traces into the AI and paste the AI’s answers back into the code. “I’ve become a human clipboard,” he lamented [5]. This kind of cognitive offloading means the developer isn’t reasoning through problems anymore; the AI is doing the thinking, and the human is just transcribing. The result is not only diminished skill, but also less vigilance – if developers assume the AI is always right, they may miss subtle bugs or security issues they would have caught before. In fact, the ease and polish of AI output can lull engineers into a false sense of security, lowering their skepticism during reviews [6].

The irony is that AI was supposed to boost productivity, but over-reliance can make individuals less capable. “We’re not becoming 10× developers with AI, we’re becoming 10× dependent on AI,” as one author observed – trading long-term understanding for short-term speed [7]. The takeaway: to maintain your engineering sharpness, you must stay intellectually engaged with the code. Use AI as a tool, not a crutch – always challenge and verify its solutions rather than accepting them blindly.

Skipping the learning process in favor of speed hurts growth

Many teams have leapt straight into using AI for speed, bypassing the learning and understanding that should accompany its use.

The promise of AI coding tools is high velocity – generate, generate, generate – but this often comes at the expense of developers truly grasping what they’re building. When you rely on AI to write code you don’t fully understand, you are skipping the essential learning process that makes you a better engineer [8].

The mistakes, trial-and-error, and research that traditionally accompany coding aren’t just hurdles – they are the training ground where critical skills develop. By outsourcing the heavy lifting to AI, junior devs in particular may never acquire the depth of knowledge to assess or improve the code being produced.

This creates a vicious cycle: you produce poor code because you use AI without experience, and you never gain experience because you keep using AI[8]. As one commentator bluntly asked, if your role is reduced to just prompting AI for code you don’t understand, what value are you adding?[9].

We’ve largely skipped the phase where AI could be used as a learning aid or tutor, and jumped straight to using it as an auto-coder for output. Ideally, developers would use AI to improve understanding – for example, asking an AI to explain a tricky piece of code, or to suggest why a solution works – and even do a local “self review” with the AI before handing code to others. But in practice, many are just hitting “accept” on suggestions and moving on. This means they might deliver a feature faster, but with only shallow knowledge of how it works or why certain patterns were used.

Over time, that lack of understanding accumulates into a serious skill gap. Senior engineers worry about newcomers who can pump out code with AI assistance yet struggle to debug or extend it, because they never learned the underlying concepts. Indeed, engineering leaders report that while juniors now ship features faster than ever, when something breaks “they struggle to debug code they don’t understand” [10].

The craft of software engineering is about far more than producing code that runs – it’s about knowing why the code is written that way, and how to evolve it. If we sidestep that journey, we risk creating a generation of programmers who can only operate with an AI on autopilot. To counteract this, treat AI output as an opportunity to learn: don’t just copy-paste answers, read them, question them, and ensure you could explain them to a colleague. Use AI to accelerate your work, not bypass your growth as an engineer [11][12].

I’ve heard of seniors sending back PRs where its clear AI was used but the person didn’t understand what they were doing. When a junior submits an AI-generated PR, the review becomes the primary venue for mentorship. Ask Socratic questions that force them to explain the AI’s output. This ensures understanding, not just functionality. Reviews become about comprehension, not just correctness.

Code reviews are straining under AI-generated code

Traditional code review practices are struggling to cope with AI-generated code, leaving teams unsure how to maintain quality.

Code reviews have always been the safety net for catching errors and ensuring code quality. But AI assistance changes the game: AI can produce much larger diffs in an instant, often touching many lines or files, which means reviewers face more volume and potentially more complexity in each pull request. In fact, studies found that pull requests heavy with Copilot-generated code take about 26% longer to review on average, because reviewers must untangle unfamiliar patterns and double-check for AI-specific mistakes [13].

Reviewers also report a psychological effect: when examining code they didn’t write, especially if it’s syntactically polished, their confidence drops – they take longer to validate logic and may second-guess their understanding [14]. AI can churn out code that looks clean and modern (consistent naming, proper formatting) which can lower reviewers’ skepticism [6]. It’s easy to assume the code is sound if it “looks professional,” making it more likely that subtle bugs or design flaws slip through.

Another complication is lost intent. In a traditional review, the reviewer can discuss “what the author meant to do” – there’s a human intention to compare against the implementation. With AI-generated code, the code’s author might not fully grasp the intent behind every line, because they didn’t write it in the conventional sense. The original prompt given to the AI is essentially the spec, but reviewers often don’t see that prompt [15]. This means a reviewer is left guessing at the requirements and whether the AI’s solution actually meets them, rather than just reviewing whether the code works.

As one report noted, reviewers are no longer assessing what the developer meant to do, but rather what the model actually did [16]. Traditional code review checklists (focused on style, obvious logic errors, etc.) aren’t enough, because AI code can fail in non-traditional ways – e.g. using an outdated algorithm that a junior dev wouldn’t know, or introducing an edge-case bug that isn’t immediately obvious.

Teams are also encountering review overload. An AI pair programmer can generate code faster than a human, which means a single developer can open very large pull requests or many pull requests in short time. This “velocity” can overwhelm the team’s capacity to give thorough reviews. It’s akin to slop in code form – flooding the reviewer with so much output that it’s hard to pinpoint the issues [17]. In such cases, some organizations have instituted new policies: for example, if a PR is more than 30% AI-generated (by lines or content), it might trigger a required extra level of review or a more senior reviewer [18].

The idea is to acknowledge that AI-heavy code needs different scrutiny levels, not business-as-usual. Another emerging practice is labeling AI contributions: explicitly marking in the pull request or commit message that “this code was assisted by AI.” This can cue reviewers to be extra vigilant. Indeed, experts recommend tagging and tracking AI-generated code for accountability – it helps reviewers know what to look for and helps teams trace bugs later (“was this bug from AI-written code?”)[19].

However, openly tagging AI involvement comes with a cultural challenge: developers must feel psychologically safe to disclose AI usage. If people fear judgment for using AI (“will my team think I’m lazy or less competent?”), they may hide it – and that’s worse for the team. Hidden AI usage means the team doesn’t know where potential risk lies and can’t adjust their reviews accordingly [20]. To counter this, forward-thinking teams encourage transparency without stigma.

Using AI should be treated like using any tool – it’s fine to use it, but you must own the output. As one guide put it, never blame the AI for bugs or quality issues; the engineer who committed the code owns it, period [21]. If everyone embraces that mindset, then saying “I used Cursor to help with this module” is simply a factual statement, not an admission of guilt. It allows the team to collectively ensure the AI-generated sections get proper attention.

Right now, our code review tools and norms are still catching up to these needs. We don’t yet have widespread automated detectors for AI code in PRs, and most diff viewers don’t show the AI’s prompt or reasoning. So, we need to rely on process and team agreements to fill the gap – explicitly calling out AI-written code, reviewing tests more rigorously, and possibly setting size limits to what we’ll accept from an AI without breakpoints for human review.

If questionable code is making it past PR unchallenged, the issue is not just AI – it’s that the review process isn’t robust enough to catch these problems [22]. It’s a call to action that code review practices must evolve alongside AI adoption.

66% of developers in a Stack Overflow survey said the most common frustration with AI assistants is that the code is “almost right, but not quite. 45% of developers in the Stack Overflow survey reported that time spent debugging AI-generated code was their biggest time sink. Quality gates and validation are now the critical path

Best practices: treating AI-generated code as a draft

To use AI coding tools effectively, we must adjust our habits and processes. Think of AI output as a first draft from a junior developer – valuable, but in need of careful review and refinement. Here are some pragmatic best practices to ensure that AI-generated code boosts productivity without sacrificing quality or understanding:

Never merge code you don’t understand. If an AI helped produce some code, the onus is on you (the developer) to read every line and make sure you get it. You should be able to explain what the code does and why. If there’s any part of the AI-generated snippet that you can’t follow, treat that as a red flag – either refine the prompt, have the AI explain it, or rewrite that part yourself. Some open-source projects explicitly require that contributors certify they understand the code they submit, even if AI wrote it. In professional settings, the same principle applies: take full ownership of any code you commit, regardless of who (or what) authored it[21]. In practice, this means running the code, writing or reviewing tests for it, and stepping through its logic before it ever hits your team’s repository.
Treat AI code like an intern’s code – don’t trust, verify. AI doesn’t possess context or wisdom; it’s more like a very fast, eager junior developer. It will confidently produce a solution, but that solution might be overly simplistic, miss edge cases, or use patterns that are out of place for your codebase. As a best practice, approach AI contributions with healthy skepticism. Check boundary conditions, look for off-by-one errors, thread safety issues, or other corner cases that a less-experienced coder might overlook [23][24]. Often, AI will do exactly what you asked, not necessarily what you truly need. So cross-verify the output against the requirements. If it’s a complex or critical piece of code, consider manually reimplementing it after seeing the AI’s draft – you might catch nuances the AI missed. Remember the mantra for AI output: “Don’t trust. Verify.” [25]
Use AI as a coding assistant, not an author – incorporate it into your own thinking. Instead of just asking AI to spit out code and blindly pasting it, use it in a conversational, explanatory way. For example, you can ask the AI to explain the code it just suggested, or to generate comments for it. You can have it suggest test cases for the code, which you then run to see if the code truly works. AI can also help by summarizing a large diff or identifying potential problem areas in a PR (some advanced code review tools now offer AI-generated summaries). All these uses keep you, the human, in the driver’s seat. You’re leveraging AI to augment your understanding, not replace it. One recommended practice is to review tests first for AI-generated changes [26] – ensure there’s a solid test suite covering the new code. If tests are weak or missing, that’s your cue to write more before trusting the code. Also, use strict linting and static analysis on AI code: AI might not follow your team’s idioms out-of-the-box, so enforce style and architecture rules with automated tools [27][28]. If the AI suggests something that doesn’t fit your usual patterns, don’t hesitate to refactor it. Essentially, make AI your pair programmer who writes draft code and gives ideas, but you still make all final edits and decisions.
Thoroughly test and secure AI-generated code. It’s crucial to apply the same (or higher) level of testing to AI-written code as you would to handmade code. Write unit tests and integration tests to cover the functionality. Specifically look for edge cases and potential failure modes – AI is notorious for handling the “happy path” but ignoring unusual inputs or error handling. Also consider security: common vulnerabilities like SQL injection, XSS, insecure deserialization, etc., might slip in if the AI drew from a code example with a flaw [29][30]. Use security linters or scanners (tools like Semgrep or Bandit can catch obvious issues [31]). If the AI generated any dependency or configuration, ensure you review those for secrets or insecure defaults. Treat the AI’s code as if you hired a contractor whose work you don’t fully trust – double-check everything, because ultimately your team is accountable for any bugs or security holes, no matter who wrote the code.
Leverage AI for self-review before seeking peer review. One productive pattern is to ask the AI to critique its own output before you open a pull request. For example, after getting a code suggestion, you might prompt, “What potential issues do you see in this code? Any edge cases or improvements?” The AI might point out a condition you didn’t consider or a more idiomatic approach. It’s like a spell-check for logic – not infallible, but it can catch low-hanging fruit. This doesn’t replace a human review, but it can help you clean up the draft so that your peers aren’t distracted by obvious problems. Think of it as you collaborating with the AI to polish the code, then handing it to your team. This also helps you learn, as the AI’s review comments can highlight areas you need to think about. Just remember to verify any AI feedback; sometimes it might “hallucinate” problems that aren’t real, so use your judgment.
If an AI-generated change is too large or confusing, break it down. Don’t let the AI’s speed force you into merging giant, monolithic changes. If Cursor spews out 500 lines of mixed modifications, it might be better to treat that as a prototype. Perhaps run the code to see if the approach works, then reimplement the solution in smaller, comprehensible pieces. One developer likened an initial AI-generated draft to a spike solution – a quick and dirty implementation to prove a concept[32]. You wouldn’t merge a spike into production; you’d refine it. Similarly, take the AI draft and iteratively improve it: maybe split that big PR into multiple commits or pull requests that are easier to review. Often the second draft (written with the insight gained from the first) is much cleaner and more maintainable[32]. This disciplined approach prevents the “gish gallop” effect where the AI dumps so much code that reviewers can’t effectively review it. By breaking it down, you ensure that each piece gets adequate human attention.
Document and label AI contributions when sharing with the team. In your pull request description or code comments, it can be helpful to note which parts were generated by AI or if you relied heavily on an AI for a solution. For example: “Used Gemini/Opus/GPT to generate the initial implementation of this sorting algorithm; reviewed and modified the result.” This kind of transparency helps reviewers know where to focus. It’s not about blaming the AI or you but about context. In fact, marking AI-generated code with clear comments or annotations is encouraged as a way to create accountability and traceability[33]. If an odd bug appears later, the team can trace it back and see, “Oh, this chunk was AI-written based on prompt X” and that might make debugging easier. Of course, do this in a supportive culture (see next section) – the goal is to collectively safeguard quality, not to call someone out. Some teams even keep a log of AI-assisted changes for auditing purposes [33]. At the very least, consider sharing the prompt you used with your reviewers, e.g. in a PR comment. That way the reviewer understands what you asked for and can judge if the AI’s code actually matches the intent[15]. This prompt-as-spec technique can bridge the gap between intention and implementation.

In summary, treating AI code as a draft means applying all the same rigor you would to a human novice’s code: you review it deeply, test it thoroughly, and don’t assume anything is correct until proven. The AI can drastically speed up writing boilerplate and even suggest solutions, but you are the engineer – you must integrate those suggestions into the codebase responsibly.

Establish team agreements for AI-generated code

To successfully integrate AI into development, teams should set clear guidelines – essentially a “contract” – on how to handle AI-generated code. This is a new frontier, and misalignment can cause friction or quality issues. A team working agreement might include rules, responsibilities, and cultural norms around AI usage. Here are some key elements teams are adopting:

Ensure accountability doesn’t lapse. Make it explicit that whoever integrates AI-generated code into the codebase is responsible for it, full stop. No pointing fingers at the AI. If a bug is introduced, it’s treated like any other bug you’d introduce. This principle, supported by industry guides, says developers must take full ownership of any code they commit, regardless of who wrote it, and test AI-generated code as thoroughly as their own [21]. Management should reinforce that using AI is not an excuse for lower quality. Code reviewers and approvers also share responsibility – if you approve a change, you’re vouching for it as usual. Essentially, AI doesn’t change the definition of “code owner.”
Define how and when AI should be used. As a team, discuss what types of tasks are appropriate for AI assistance. For example, you might agree that AI is great for generating unit tests, boilerplate, scaffolding, or exploring multiple approaches – but perhaps you’ll avoid using it for core complex algorithms without additional review. Some teams may forbid AI use for security-sensitive code or critical algorithms, unless a senior engineer supervises closely. Others might say it’s fine to use AI for anything as long as you follow the other rules (understand it, test it, etc.). The key is to set expectations. This also ties into ethical and legal considerations (e.g. ensuring AI output doesn’t include copied licensed code, or doesn’t introduce biases), but that’s another essay in itself. The point is, an agreed policy prevents misunderstandings like one dev merging huge AI-written chunks that others aren’t comfortable with.
Emphasize transparency and psychological safety. The team contract should encourage developers to be open about AI involvement. For instance, a guideline could be: “If AI assisted significantly in a change, mention it in the PR.” Leaders must foster an environment where this admission is seen positively (as due diligence), not negatively. A lack of transparency can lead to “shadow AI” in your codebase – code that is AI-written but nobody realizes it, making debugging and maintenance harder [20]. To avoid that, make transparency the norm. One practice is adding a simple comment in the code like // Code generated with AI assistance or using a tag in PRs. The team might also agree on documenting prompts in the project wiki or in the code review for future reference [33]. If someone feels they don’t fully understand an AI-generated section, they should feel safe to say so and ask for help or extra review [33]. It’s far better to admit “I’m not 100% confident in what Copilot produced here” than to pretend everything is fine. Psychological safety ensures people speak up, which ultimately protects the code quality and the developers’ growth.
Integrate AI-awareness into the review process. Teams should update their code review checklists or definitions-of-done to account for AI. For example, a review checklist might add items like “If code was AI-generated, has the author provided the prompt or described the intent?” or “For AI-generated code, double-check for common issues (edge cases, security, style consistency).” Some organizations formalize this by requiring an extra pair of eyes on AI-heavy code, as noted earlier[34]. Training sessions can help too – a team might do a brownbag meeting on “typical AI mistakes” so all reviewers know what to watch for (e.g. unnecessary complexity, missing null checks, etc.). The team could also adopt tools to assist, like AI-powered code analysis that flags likely problematic code patterns. Ultimately, the whole review culture may shift to treat AI contributions with a bit more rigor. As a shared rule, you might say: No AI-generated code gets merged without thorough human review, no exceptions. It seems obvious, but stating it sets the tone that speed will not trump quality.
Support continuous learning and skill development. To address the critical thinking atrophy issue, a team agreement can explicitly encourage practices that keep skills sharp. For instance, pair programming sessions where one person doesn’t use AI and explains their thought process, or rotations on challenging bug fixes without AI. Or even simply encouraging developers to occasionally implement things “the hard way” first, before using AI to optimize. Some companies have gone as far as tracking how AI impacts debugging time and making sure employees still know how to troubleshoot without the tool[10]. An agreement could be: “We use AI to speed up routine tasks, but we still expect engineers to understand and be able to manually handle the complex parts.” By acknowledging this in your team principles, you validate the importance of human expertise. Leads and managers in particular should lead by example – demonstrating in code reviews that they scrutinize AI-generated code just as they would any code, asking thoughtful questions. Junior devs will take cues from that and learn that AI is not a get-out-of-thinking-free card.

In essence, a team’s AI code agreement is about maintaining quality, clarity, and trust. Everyone should know how AI is being used and agree on the standards its output must meet. This “contract” might be a living document that evolves as you gain experience. The goal is to prevent the scenario where AI quietly degrades your codebase or your engineers’ skills. Instead, with rules in place, AI can be harnessed as a powerful accelerator with guardrails. It forces conversations now about topics that were previously implicit (like “do you understand what you committed?”) – now we make them explicit.

Conclusion: AI is not a replacement for understanding

AI coding tools are here to stay, and they excel at generating drafts – the scaffolding, the boilerplate, even complex code that might take a human much longer to write from scratch. Embracing them can lead to huge gains in productivity and free developers from drudgery. But the moment we start treating AI-generated code as “fire-and-forget,” we undermine the very benefits we seek.

The true value of AI in software engineering comes when we pair its speed with our judgment. That means always reviewing AI output with a critical eye, staying curious about why the code works, and insisting on clarity and correctness. When you treat AI-generated code as a draft, you acknowledge it’s a work in progress – to be massaged and perfected by human insight.

By maintaining high standards for code quality and developer education, we ensure that AI is a tool that augments our capabilities rather than atrophying them. We keep the “why” and “how” in focus even as the “what” is delivered to us on a platter. In practical terms: don’t stop reading code.

Whether written by an intern, an AI, or a seasoned colleague, code must be understood to be trusted. If you never outsource the reading and thinking, you retain the ability to connect a code’s behavior back to the intent behind it – which is the essence of software engineering.

Use AI to move faster, by all means, but keep your hands on the wheel.

The code that lands in production should always have a human’s eyes (and heart) behind it. That way, we get the best of both worlds: the efficiency of AI-generated first drafts and the reliability of human-reviewed, well-understood final code.

I’m excited to share I’ve released a new AI-assisted engineering book with O’Reilly. There are a number of free tips on the book site in case interested.

Critical Thinking during the age of AI

Addy Osmani — Fri, 21 Nov 2025 15:31:16 GMT

In a time where AI can generate code, design ideas, and occasionally plausible answers on demand, the need for human critical thinking is greater than ever. Even the smartest automation can’t replace the ability to ask the right questions, challenge assumptions, and think independently at this time.

This essay explores the importance of critical thinking skills for software engineers and technical teams using the classic “Who, what, where, when, why, how” framework to structure pragmatic guidance.

tl;dr: Critical thinking checklist for AI-augmented teams

Who: Don’t rely on AI as an oracle. Verify its output.
What: Define the real problem before rushing to a solution.
Where: Context is king. A fix that works in a sandbox might break in production.
When: Know when to use a quick heuristic (triage) vs. deep analysis (root cause).
Why: Use the “5 Whys” technique to uncover underlying causes.
How: Communicate with evidence and data, not just opinions.

We’ll dive into how each of these question categories applies to decision-making in an AI-augmented world, with concrete examples and common pitfalls. The goal is to show how humble curiosity and evidence-based reasoning can keep projects on track and avoid downstream issues.

Who: Involve the right people and perspectives

Who is involved in defining or solving the problem? In technical projects, critical thinking starts by identifying the who.

This means knowing who the stakeholders are (e.g. engineers, product managers, users, domain experts) and making sure the right people are engaged in decision-making. Problems in engineering for example are rarely solved in isolation – they affect users and often span multiple teams. A critical thinker asks: Who should we consult or inform? Who might have relevant expertise or a different perspective? Including diverse viewpoints is essential.

Otherwise, teams risk falling into groupthink, where everyone converges on the same idea and dissenting opinions are silenced. Groupthink can fool a team into validating only their aligned views without questioning if those views rest on good data or sound assumptions. To counter this, effective teams encourage questions from all members and even bring in outsiders for fresh eyes. In short, who is in the room (and who isn’t) can make or break the objectivity of technical decisions.

Who should we listen to – human or AI? In the age of AI assistants, we must also critically assess who an answer is coming from. Is it the output of a large language model or a seasoned colleague? An AI might confidently provide an answer that sounds authoritative, but remember it’s a statistical engine. Who “said” it matters.

A critical thinker treats an AI’s output as just another input to examine, not an oracle. If an entity (like an AI) hands us a plausible-sounding answer, our human tendency is to accept it and not dig deeper. This cognitive laziness isn’t new – it’s a general human weakness to take the easy, “sounds good” solution and run with it. But in engineering, blindly trusting an answer can be dangerous. If an AI code assistant suggests a code snippet or architecture, ask: Who authored this suggestion? Does the AI actually understand our context? Treat AI outputs as if coming from an inexperienced intern – verify everything.

For example, if Cursor provides a code fix for a bug, a critical engineer would review that code thoroughly and test it, just as they would review a junior developer’s work. The who question reminds us that accountability and understanding lie with the humans, regardless of AI involvement.

Who is responsible and who is affected? Finally, critical thinking means staying aware of who will be affected by technical decisions. Shipping a quick-and-dirty fix might satisfy a manager in the short term, but who will maintain the code later? If a system fails, who bears the cost – is it the end-users, the on-call engineers, the company’s reputation?

Considering the human impact grounds our problem-solving in reality. It fosters humility – a recognition that our decisions affect real people and that we ourselves don’t have all the answers. Great engineers and product people cultivate this humility. They know there’s always more to learn and that no single person has the complete picture. By adopting a posture of learning and asking colleagues questions, they fill in their knowledge gaps and catch mistakes early.

In practice, this might mean a backend developer double-checks a feature’s impact with a frontend teammate (“Could this API change break the mobile app?”) or a developer seeking a security review from the infosec team rather than assuming “it’s probably fine.” In short, critical thinking in teams is a social endeavor: it thrives when who is involved includes a mix of people willing to question each other and themselves.

What: define the real problem and gather evidence

What problem are we actually trying to solve? This is perhaps the most important question. A classic pitfall in engineering is rushing to solve something without confirming it’s the right thing.

In fact, Harvard Business Review has emphasized that rigorously defining the problem upfront ensures we address the right challenges and avoid wasted effort. In practice, this means taking time to clarify requirements and success criteria. Imagine a scenario: a product manager requests “an AI feature to summarize user data”. Jumping straight to coding a summarization algorithm would be premature before asking what the end goal is. Is the goal to help users understand their data trends? If so, maybe the “right problem” is actually that users are overwhelmed by raw data, and the solution might involve better visualization rather than just a summary.

Critical thinking urges us to explicitly articulate the problem and question initial assumptions. Our natural instinct is to go into “problem-solving mode” immediately, but this tends to lead us to quick, surface-level fixes rather than more strategic and thoughtful solutions. In other words, if we don’t slow down to define what needs solving, we risk fixing a symptom or a poorly-understood issue. A thoughtful engineer will ask early: “How do we know we’re solving the right problem?” – a simple question that can save immense time and prevent downstream headaches.

Concretely, defining what the problem is involves gathering evidence and facts. For example, suppose users are complaining that a system is “slow.” Rather than blindly optimizing random code, a critical thinker will ask: What is slow, exactly? Is it page load time, a specific query, or the entire app? What evidence do we have? Maybe logs show one database query is taking 5 seconds. That frames the problem clearly: improving that query’s performance. Similarly, in debugging scenarios, asking “What changed?” when something broke often points to the cause – a recent deployment or config update. This investigative mindset ensures we tackle the actual cause rather than just the first guess.

What evidence supports our solution or conclusion? Critical thinking in engineering is fundamentally about evidence-based decision making. It’s not enough to have an idea but we need to justify it with data or logical reasoning. Always ask: “Does the evidence support this conclusion?” For instance, if an AI model suggests that a bug is due to a null pointer exception, don’t accept it at face value – check the logs or write a unit test to confirm. If a performance test indicates improvement, verify the results on multiple runs or environments. In modern AI-assisted development, this is especially vital.

Large language models (LLMs) often produce answers that sound correct. They’re excellent at sounding confident, which can trick even experienced engineers.

If some entity gives you a good enough result, probably you aren’t going to spend much time improving it unless there is a good reason to do so. Likewise you probably aren’t going to spend a lot of time researching something that AI tells you if it sounds plausible. This is certainly a weakness, but it’s a general weakness in human cognition, and has little to do with AI in and of itself. - Hacker News

However, a plausible answer isn’t necessarily a true one. LLM answers are “almost always” plausible-sounding but with no guarantee of being correct – a tremendous flaw with real consequences. A critical thinker treats any proposed solution (whether from AI or a teammate) as a hypothesis to be tested, not a fact. They gather evidence to confirm or refute it. This might involve running an experiment, collecting metrics, or searching for analogous past incidents.

Consider the example of evaluating an AI-generated code snippet. Suppose Cursor provides a solution for timezone conversions. Instead of simply copy-pasting and assuming validity, a critical developer tests it against various formats and edge cases. If they discover the code fails on complex offsets, this evidence dictates the next step- perhaps switching to a dedicated library. By asking, “What data supports this?”, engineers avoid the trap of confirmation bias.

Instead, they actively look for falsifying evidence. In technical debates, confirmation bias might lead someone to defend their initial design choice and ignore alternative approaches. The antidote is to seek out data or feedback that challenges your idea: if you believe a new feature improved load time, also look at any cases where it regressed performance. What we know (and don’t know) should drive decisions, not just what we feel. Good critical thinkers are almost like scientists – they gather facts, run tests, and let evidence rather than ego determine the path forward.

Where: consider context and scope

Where does this problem occur, and where will a solution apply? Context is everything in engineering. A fix that works perfectly in one environment might fail in another. Critical thinking means being mindful of where our assumptions hold true. Engineers should ask: Where is the boundary of this issue? Where in the system or workflow are we seeing the effects?

For example, if an AI ops tool flags an anomaly in system metrics, we should pinpoint where – which server, which module – before reacting. A spike in CPU on one microservice doesn’t mean the whole system is failing. By localizing where the problem lives, we avoid over-generalizing or deploying unnecessary global “fixes.” Similarly, consider where a solution will be used. Is the code running on a user’s low-powered smartphone or on a beefy cloud server? The context might dictate very different approaches. A critical thinker is always aware of the environment: “Where will this code run? Where are the users encountering difficulties?”

Where are the gaps in our knowledge? Asking “where” also means identifying where we need more information. If we’re debugging a distributed system, we might realize we don’t know where a specific request fails – is it at the client, the API gateway, or the database? That’s a cue to gather more data (e.g. add logging at various points) to determine the location of failure. Similarly, if a product idea is being discussed, critical thinking prompts us to ask where in the user journey this idea fits. This prevents solving a non-issue; perhaps the “cool feature” is addressing a part of the app that users rarely visit. Knowing where helps allocate effort to where it matters most.

To illustrate, imagine planning an experiment for a new feature rollout. A critical question is: Where will we test it – in a staging environment, with internal users, or as a small percentage A/B test in production? Each context has pros and cons. Testing in a realistic environment (like a small percentage of live users) may reveal issues that an isolated lab test won’t. On the other hand, some experiments should stay in a sandbox to avoid impacting real users. By explicitly considering where an experiment runs, engineers ensure they approach testing with appropriate rigor given the constraints. It’s easy to get false confidence from a perfect lab result that doesn’t hold in the messy real world.

Finally, “where” can be metaphorical: Where could this solution cause side effects? Where might this decision have downstream impact? Thinking a few steps ahead is a hallmark of seasoned engineers. For example, when modifying a shared library, ask where else that library is used. This way, you anticipate ripple effects and can check those places or alert those teams before problems occur. In sum, contextual awareness – spatial, environmental, and systemic – is a key part of critical thinking. It prevents tunnel vision. Great engineers don’t just solve a problem; they solve the problem in the right place and with full awareness of the setting.

When: timing, timelines, and when to dive deep

When did or will something happen? The dimension of time is crucial in technical work. Critical thinking involves asking when both in diagnosing issues and in planning work. In troubleshooting, understanding when a bug first appeared or when a system behaves differently often reveals the cause. (“The system crashed at 3 AM last night – what happened around that time?” Perhaps a nightly job or a deployment coincided with the crash.) Experienced engineers habitually ask: “When did it last work? What’s changed since then?” This line of questioning is often more effective at finding the root cause than blindly guessing. It ties into evidence gathering – a deploy timeline or version history might show exactly when a faulty piece of code went live.

When should we apply more rigor, and when is a quick heuristic enough? Not every decision warrants days of analysis; part of critical thinking is knowing when to go deep. In engineering, we constantly balance thoroughness against time constraints. Project deadlines and on-call incidents can create immense pressure to act quickly. Under stress or tight timelines, humans tend to rely on intuition and mental shortcuts – what cognitive scientists call heuristics. These are useful, but they also open the door to biases and mistakes.

Research at NASA has noted that when engineers are under stress or have limited time, they make faster decisions that are more prone to error than those made with time to reflect. This doesn’t mean we can always avoid urgency, but it means we should acknowledge the risk. A critical thinker under time pressure will consciously slow down on the most crucial aspects of the decision. For instance, if you’re debugging a production outage at 2 AM, you might use a quick fix to get the system running (that’s a heuristic – e.g. restart a service). But a critical mindset means you’ll also note, “This is a band-aid. I need to investigate the root cause in the morning.” In other words, know when to apply a temporary fix and when to invest in a permanent solution.

Approaching rigor with limited time often involves triage: prioritizing which questions need deep answers now and which can be answered later. A useful prompt is, “How do we approach this with rigor given time constraints?” For example, in planning a new feature under a tight deadline, critical thinking might lead a team to identify the riskiest assumption and test it early (even in a quick-and-dirty way), rather than trying to perfect every detail. They focus on when each piece of information is needed. Is it okay to decide the UI later, but crucial to validate the algorithm now? If so, time is allocated accordingly.

Good critical thinkers also develop a sense of timing for interventions. When should we ask for help? If a problem remains unsolved after a certain amount of time, a critical engineer knows it might be time to get a second pair of eyes or escalate to a wider team discussion. When should we pause and reconsider? On teams practicing Agile, this might be at sprint boundaries or before major releases – essentially built-in “when” checkpoints to ask if they’re on the right track. And when have we done enough analysis? There is a point of diminishing returns.

Being rigorous doesn’t mean being paralyzed by analysis. It means doing the right amount of thinking for the decision at hand. As an example, if you have a day to debug an issue, spending the first 4 hours to methodically gather data is wise; spending 23 hours to get a perfect answer might mean missing the deadline. Critical thinking helps balance these through self-awareness: knowing when you’re falling into analysis paralysis versus when you’re leaping to conclusions too soon.

Why: questioning motives, causes, and rationale

Why are we doing this? The “why” questions get to the heart of motivation and causality. In an engineering context, constantly asking why serves two big purposes: (1) ensuring there’s a sound rationale for actions (so we’re not just doing things because “someone said so”), and (2) drilling down to find root causes of problems rather than treating symptoms. A critical thinker faced with a task – say, implementing a new AI tool – will ask: “Why do we need this tool? What problem will it solve and why is that important?”

If the best answer the team has is “because it’s trendy” or “our competitor has it,” that should spark concern. Chasing buzzwords without a clear why can lead teams to invest in solutions that don’t actually address their users’ needs. On the other hand, articulating a strong why (e.g. “to reduce the time users spend analyzing their data by automating summaries”) aligns the team on the real goal. It fosters independent thinking – an engineer confident in the why can independently make better decisions during implementation, because they understand the end goal deeply rather than just following orders.

Why did this happen? When something goes wrong (or right), asking “why” repeatedly is a proven technique to get beyond superficial answers. In fact, the Five Whys technique in root cause analysis is essentially institutionalized critical thinking – it forces you to peel back causes layer by layer. The idea is to avoid jumping on the first explanation and instead uncover the chain of causality.

For example, imagine a machine learning model’s accuracy suddenly drops. A naive response might be, “The model is bad, retrain it.” A critical approach would ask: “Why did the accuracy drop? Because the input data distribution changed.” Why did that happen? Perhaps a new data source was added. Why was that not accounted for? Maybe the data pipeline lacked a validation for distribution shifts. By the time you’ve asked “why” five times (or as many times as needed), you likely have a much clearer picture – maybe the real root cause was a flawed monitoring process that failed to catch the data drift early.

The difference is huge: a quick fix might just retrain the model (addressing the symptom of low accuracy temporarily), but the Five Whys approach might lead you to improve the monitoring system, preventing the issue from recurring. As one guide on the 5 Whys explains, this method “aims to get to the heart of the matter rather than just addressing surface-level symptoms,” encouraging teams to move beyond quick fixes to sustainable solutions.

However, why can be a double-edged sword if we’re not careful. Humans are prone to biases when answering why. One common pitfall is confirmation bias – we might latch onto a convenient explanation that fits our preconceptions and stop investigating further. For instance, an engineer might assume “The server crashed because of a memory leak, which has happened before,” and not consider other causes like a new configuration change, simply because the memory leak fits their mental model. If they don’t seek evidence to disconfirm their memory leak hypothesis, they might miss the real cause.

The earlier-mentioned plunging-in bias is another “why” trap: it’s the tendency to rush into solving a perceived problem without fully understanding it. Studies have found that this bias – jumping to conclusions and imposing pre-determined solutions – leads to addressing symptoms rather than root causes in about half of failed decisions studied. In other words, not asking “why” enough (or not the right why) can sink projects. The Harley-Davidson company in the 1980s famously misdiagnosed why they were losing market share, blaming external factors and thus implementing wrong solutions, when the real issues were internal practices. It took years for them to correct course, exemplifying how failing to pin down the true “why” can prolong pain.

Good critical thinkers are almost relentlessly curious about why. They maintain a humble curiosity – an openness to finding out that their initial assumption was wrong. They ask questions like: “Why do we believe this approach will work? Is it because of actual data or just our gut? Why are users asking for this feature – what’s the underlying need?” By drilling into reasons, they often catch logical gaps or uncover hidden requirements. Importantly, they also communicate the why behind decisions to others, which helps teams stay aligned and spot flaws. If you can’t clearly explain why a particular technical decision was made, that’s a red flag – either the decision lacks solid reasoning or that reasoning isn’t shared (both are dangerous). On the flip side, when everyone understands the rationale (the why), they can independently verify if new developments still support that rationale or if it needs revisiting.

How: apply rigor and communicate clearly

After exploring “who, what, where, when, why,” the final question is “How do we actually practice critical thinking day to day?” This is about the methods and mindset – how to approach problems rigorously yet efficiently, and how to carry solutions through with clear communication. Good critical thinkers tend to have a systematic approach. They formulate questions clearly, validate evidence, and communicate solutions logically. Let’s break this down:

How to approach problems methodically: Often it starts with asking better questions. Instead of vague queries, they ask specific, open-ended questions that lead to insight. For example, rather than “Is this design good?”, a critical thinker might ask “How does this design address the user’s primary need and how could it fail?” It’s important to avoid loaded or leading questions that just confirm what we already think. Maintaining an open mind and probing for details yields much more useful information. A practical habit is to enumerate what you know and what you don’t know, then plan how to test or learn the latter. Think like a scientist: if you have a hypothesis (e.g. “the database is the bottleneck”), figure out how to prove or disprove it (perhaps by profiling or looking at query times). This structured interrogation of problems is at the core of critical thinking.
How to validate evidence and avoid bias: Once you have data or answers, a critical thinker validates them. Does the data actually support the conclusion, or are there alternative interpretations? This might mean cross-checking metrics from two sources, reproducing a bug in a test environment to ensure it’s not a fluke, or getting a code review for an assumption you’ve made. It also means being aware of your own biases. As discussed, if you find yourself gravitating to an explanation too quickly, pause and ask, “Am I considering all the evidence, or just the bits that confirm my theory?” One strategy is actively seeking contradictory evidence. If you think a new feature improved retention, look at any cohort where retention didn’t improve – what’s different there? By welcoming negative data, you ensure you’re not kidding yourself. This is essentially a quality assurance mindset but applied to thinking: test the robustness of your ideas like you test your code. Additionally, frameworks and checklists can help maintain rigor. Some teams, for example, use a premortem exercise (imagining a future where the project failed and writing down reasons why) to surface potential issues and assumptions that weren’t initially considered. Such techniques enforce a more critical evaluation of a plan before it’s executed.
How to communicate solutions and reasoning: A brilliant solution isn’t worth much if it can’t be communicated and implemented by the team. Critical thinking shines in how solutions are explained. Good engineers organize their explanation logically: start with the problem definition (the what and why), state the proposed solution (the how), and provide the evidence or reasoning backing it. They make their assumptions explicit and describe the trade-offs considered. This kind of communication not only helps others understand the proposal, it also serves as a final self-check for the thinker: if you can’t articulate it clearly, maybe your thinking isn’t clear yet. Importantly, critical thinkers use facts and data to bolster their communication, rather than hyperbole or opinion. As one engineering leadership article notes, humble engineers prefer facts instead of opinions – they will cite the data (“this change improved load time by 25% as measured on the dashboard”) rather than make boastful claims. This approach builds credibility. It shows you’re guided by evidence, which makes it easier for colleagues and stakeholders to trust the solution. Furthermore, clear communication involves listening and inviting feedback. A critical thinker doesn’t deliver a monologue; they encourage others to poke holes and ask questions, because that scrutiny will either validate the idea or help improve it. In meetings, this might look like: “Here’s what I propose and why. Does anyone see any gap in this reasoning or have concerns?” By fostering an open dialogue, they ensure the solution is robust and agreed upon, not just the loudest voice winning.

Finally, “How do we ensure we’re continuously improving our critical thinking?” This meta-question is worth a thought. The answer is practice and reflection. Just as we do retrospectives for projects, doing mini-retrospectives on decisions can sharpen thinking skills. For instance, if a rushed decision led to a bug, a team can analyze: how did we miss it, and how can we catch such things next time?

Over time, engineers build a mental library of lessons learned (e.g. “Remember to check X, because last time we assumed and it burned us”). Many top engineers also cultivate habits like reading post-mortems from other companies’ failures or studying cognitive biases to become familiar with traps they might fall into. Critical thinking isn’t a one-and-done checkbox; it’s a continuous, career-long discipline of staying curious, humble, and evidence-driven.

Conclusion

As AI gets increasingly used, critical thinking is not optional, but essential.

We should ask Who should be involved, What is the real problem, Where is the context, When to dig deeper, Why something is done, and How to do it properly. By using this classic framework pragmatically, technical teams can navigate complexity with clarity.

It means a culture where independent thinking is valued: team members feel safe to question a proposed solution (“How do we know this is truly the fix and not a band-aid?”), to challenge assumptions (“Why are we sure the users want this feature?”), and to demand evidence (“Does the data actually show an improvement, or are we seeing what we want to see?”). Embracing humble curiosity – the idea that no matter how experienced we are, we could be missing something – keeps engineers from falling prey to confirmation bias or overconfidence.

Critical thinking also protects against the allure of quick fixes. It’s understandably tempting to patch a problem and move on, especially under pressure. But as we’ve seen, failing to think critically about a quick fix can mean the same problem resurfaces later or, worse, that we fix the wrong thing entirely. By asking the tough questions upfront and validating before acting, we actually save time and trouble in the long run. We avoid downstream issues by catching them upstream – whether it’s discovering a design flaw before code is written or realizing an AI’s output is flawed before it reaches customers.

In conclusion, while AI and automation will continue to evolve and handle more routine work, critical thinking remains an uniquely human advantage. It’s how we ensure that we’re solving the right problems, in the right way, for the right reasons.

Conductors to Orchestrators: The Future of Agentic Coding

Addy Osmani — Sat, 01 Nov 2025 14:30:38 GMT

AI coding assistants have quickly moved from novelty to necessity where up to 90% of software engineers use some kind of AI for coding. But a new paradigm is emerging in software development - one where engineers leverage fleets of autonomous coding agents. In this agentic future, the role of the software engineer is evolving from implementer to manager, or in other words, from coder to conductor and ultimately orchestrator.

Over time, developers will increasingly guide AI agents to build the right code and coordinate multiple agents working in concert. This write-up explores the distinction between Conductors and Orchestrators in AI-assisted coding, defines these roles, and examines how today’s cutting-edge tools embody each approach. Senior engineers may start to see the writing on the wall: our jobs are shifting from “How do I code this?” to “How do I get the right code built?” - a subtle but profound change.

What’s the tl;dr of an orchestrator tool? It supports multi-agent workflows where you can run many agents in parallel without them interfering with each another. But let’s talk terminology more first.

The Conductor: Guiding a single AI agent

In the context of AI coding, acting as a Conductor means working closely with a single AI agent on a specific task, much like a conductor guiding a soloist through a performance.

The engineer remains in the loop at each step, dynamically steering the agent’s behavior, tweaking prompts, intervening when needed, and iterating in real-time. This is the logical extension of the “AI pair programmer” model many developers are already familiar with. With conductor-style workflows, coding happens in a synchronous, interactive session between human and AI, typically in your IDE or CLI.

Key characteristics: A conductor keeps a tight feedback loop with one agent, verifying or modifying each suggestion, much as a driver navigates with a GPS. The AI helps write code, but the developer still performs many manual steps - creating branches, running tests, writing commit messages, etc., and ultimately decides which suggestions to accept.

Crucially, most of this interaction is ephemeral: once code is written and the session ends, the AI’s role is done and any context or decisions not captured in code may be lost. This mode is powerful for focused tasks and allows fine-grained control, but it doesn’t fully exploit what multiple AIs could do in parallel.

Modern tools as Conductors: Several current AI coding tools exemplify the conductor pattern:

Claude Code (Anthropic): Anthropic’s Claude model offers a coding assistant mode (accessible via a CLI tool or editor integration) where the developer converses with Claude to generate or modify code. For example, with the Claude Code CLI, you navigate your project in a shell, ask Claude to implement a function or refactor code, and it prints diffs or file updates for you to approve. You remain the conductor: you trigger each action and review the output immediately. While Claude Code has features to handle long-running tasks and tools, in the basic usage it’s essentially a smart co-developer working step-by-step under human direction.
Gemini CLI (Google): A command-line assistant powered by Google’s Gemini model, used for planning and coding with a very large context window. An engineer can prompt Gemini CLI to analyze a codebase or draft a solution plan, then iterate on results interactively. The human directs each step and Gemini responds within the CLI session. It’s a one-at-a-time collaborator, not running off to make code changes on its own (at least in this conductor mode).
Cursor (Editor AI Assistant): The Cursor editor (a specialized AI-augmented IDE) can operate in an inline or chat mode where you ask it questions or to write a snippet, and it immediately performs those edits or gives answers within your coding session. Again, you guide it one request at a time. Cursor’s strength as a conductor is its deep context integration - it indexes your whole codebase so the AI can answer questions about any part of it. But the hallmark is that you, the developer, initiate and oversee each change in real time.
VSCode, Cline, Roo Code (in-IDE chat): Similar to above, other coding agents also fall into this category. They suggest code or even multi-step fixes, but always under continuous human guidance.

This conductor-style AI assistance has already boosted productivity significantly. It feels like having a junior engineer or pair programmer always by your side. However, it’s inherently one-agent-at-a-time and synchronous. To truly leverage AI at scale, we need to go beyond being a single-agent conductor. This is where the Orchestrator role comes in.

The Orchestrator: Managing a fleet of agents

If a conductor works with one AI “musician,” an Orchestrator oversees the entire symphony of multiple AI agents working in parallel on different parts of a project. The orchestrator sets high-level goals, defines tasks, and lets a team of autonomous coding agents independently carry out the implementation details.

Instead of micromanaging every function or bug fix, the human focuses on coordination, quality control, and integration of the agents’ outputs. In practical terms, this often means an engineer can assign tasks to AI agents (e.g. via issues or prompts) and have those agents asynchronously produce code changes - often as ready-to-review pull requests. The engineer’s job becomes reviewing, giving feedback, and merging the results, rather than writing all the code personally.

This asynchronous, parallel workflow is a fundamental shift. It moves AI assistance from the foreground to the background. While you attend to higher-level design or other work, your “AI team” is coding in the background. When they’re done, they hand you completed work (with tests, docs, etc.) for review. It’s akin to being a project tech lead delegating tasks to multiple devs and later reviewing their pull requests, except the “devs” are AI agents.

Key characteristics: An orchestrator deals with autonomous agents that can plan and execute multi-step coding tasks with minimal intervention. These agents have more agency: they can clone your repo, create new git branches, edit multiple files, compile/run tests, and iteratively refine their solution before presenting it.

The orchestrator doesn’t see every intermediate step (unless they choose to peek in); they mainly ensure the final outcome aligns with requirements. Importantly, all this happens in a tracked, persistent workflow (often leveraging version control and CI pipelines) rather than ephemeral suggestions. For example, GitHub’s coding agent operates entirely via pull requests on GitHub, so every change is logged and reviewable. Another hallmark is concurrency: an orchestrator can spin up multiple agents to tackle different tasks simultaneously, dramatically parallelizing development.

Modern tools as Orchestrators: Over just the past year, several tools have emerged that embody this orchestrator paradigm:

GitHub Copilot Coding Agent (Microsoft): This upgrade to Copilot transforms it from an in-editor assistant into an autonomous background developer (I cover it in this video). You can assign a GitHub issue to Copilot’s agent or invoke it via the VS Code agents panel, telling it (for example) “Implement feature X” or “Fix bug Y”. Copilot then spins up an ephemeral dev environment via GitHub Actions, checks out your repo, creates a new branch, and begins coding. It can run tests, linters, even spin up the app if needed, all without human babysitting. When finished, it opens a pull request with the changes, complete with a description and meaningful commit messages. It then asks for your review. You, the human orchestrator, review the PR (perhaps using Copilot’s AI-assisted code review to get an initial analysis). If changes are needed, you can leave comments like @copilot please update the unit tests for edge case Z, and the agent will iterate on the PR. This is asynchronous, autonomous code generation in action. Notably, Copilot automates the tedious book-keeping: branch creation, committing, opening PRs, etc., which used to cost developers time. All the grunt work around writing code (aside from the design itself) is handled, allowing developers to focus on reviewing and guiding at a high level. GitHub’s agent effectively lets one engineer supervise many “AI juniors” working in parallel across different issues (and you can even create multiple specialized agents for different task types).
Jules, Google’s Coding Agent: Jules is an autonomous coding agent. Jules is “not a co-pilot, not a code-completion sidekick, but an autonomous agent that reads your code, understands your intent, and gets to work.” Integrated with Google Cloud and GitHub, Jules lets you connect a repository and then ask it to perform tasks much as you would a developer on your team. Under the hood, Jules clones your entire codebase into a secure cloud VM and analyzes it with a powerful model. You might tell Jules: “Add user authentication to our app” or “Upgrade this project to the latest Node.js and fix any compatibility issues.” It will formulate a plan, present it to you for approval, and once you approve, execute the changes asynchronously. It makes commits on a new branch and can even open a pull request for you to merge. Jules handles writing new code, updating tests, bumping dependencies, etc., all while you could be doing something else. Crucially, Jules provides transparency and control: it shows you its proposed plan and reasoning before making changes, and allows you to intervene or modify instructions at any point (a feature Google calls “user steerability”). This is akin to giving an AI intern the spec and watching over their shoulder less frequently - you trust them to get it mostly right, but you still verify the final diff. Jules also boasts unique touches like audio changelogs (it generates spoken summaries of code changes) and the ability to run multiple tasks concurrently in the cloud. In short, Google’s Jules demonstrates the orchestrator model: you define the task, Jules does the heavy lifting asynchronously, and you oversee the result.
OpenAI Codex (Cloud Agent): OpenAI introduced a new cloud-based Codex agent in to complement ChatGPT. This evolved Codex (different from the 2021 Codex model) is described as “a cloud-based software engineering agent that can work on many tasks in parallel”. It’s available as part of ChatGPT Plus/Pro under the name OpenAI Codex and via an npm CLI (npm i -g @openai/codex). With the Codex CLI or its VS Code/Cursor extensions, you can delegate tasks to OpenAI’s agent similar to Copilot or Jules. For instance, from your terminal you might say: “Hey Codex, implement dark mode for the settings page”. Codex then launches into your repository, edits the necessary files, perhaps runs your test suite, and when done, presents the diff for you to merge. It operates in an isolated sandbox for safety, running each task in a container with your repo and environment. Like others, OpenAI’s Codex agent integrates with developer workflows: you can even kick off tasks from a ChatGPT mobile app on your phone and get notified when the agent is done. OpenAI emphasizes seamless switching “between real-time collaboration and async delegation” with Codex. In practice, this means you have the flexibility to use it in conductor mode (pair-programming in your IDE) or orchestrator mode (hand off a background task to the cloud agent). Codex can also be invited into your Slack channels - teammates can assign tasks to @Codex in Slack and it will pull context from the conversation and your repo to execute them. It’s a vision of ubiquitous AI assistance, where coding tasks can be delegated from anywhere. Early users report that Codex can autonomously identify and fix bugs, or generate significant features, given a well-scoped prompt. All of this again aligns with the orchestrator workflow: the human defines the goal, the AI agent autonomously delivers a solution.
Anthropic Claude Code (for Web): Anthropic has offered Claude as an AI chatbot for a while, and their Claude Code CLI has been a favorite for interactive coding. Anthropic took the next step by launching Claude Code for Web, effectively a hosted version of their coding agent. Using Claude Code for Web, you point it at your GitHub repo (with configurable sandbox permissions) and give it a task. The agent then runs in Anthropic’s managed container, just like the CLI version, but now you can trigger it from a web interface or even a mobile app. It queues up multiple prompts and steps, executes them, and when done, pushes a branch to your repo (and can open a PR). Essentially, Anthropic took their single-agent Claude Code and made it an orchestratable service in the cloud. They even provided a “teleport” feature to transfer the session to your local environment if you want to take over manually. The rationale for this web version aligns with orchestrator benefits: convenience and scale. You don’t need to run long jobs on your machine; Anthropic’s cloud handles the heavy lifting, with filesystem and network isolation for safety. Claude Code for Web acknowledges that autonomy with safety is key - by sandboxing the agent, they reduce the need for constant permission prompts, letting the agent operate more freely (less babysitting by the user). In effect, Anthropic has made it easier to use Claude as an autonomous coding worker you launch on demand.
Cursor Background Agents: tl;dr - Cursor 2.0 has a more focused multi-agent interface more focused around agents rather than files. Cursor 2 expands its Background Agents feature into a full-fledged orchestration layer for developers. Beyond serving as an interactive assistant, Cursor 2 lets you spawn autonomous background agents that operate asynchronously in a managed cloud workspace. When you delegate a task, Cursor 2’s agents now clone your GitHub repository, spin up an ephemeral environment, and check out an isolated branch where they execute work end-to-end. These agents can handle the entire development loop - from editing and running code, to installing dependencies, executing tests, running builds, and even searching the web or referencing documentation to resolve issues. Once complete, they push commits and open a detailed pull request summarizing their work. Cursor 2 introduces multi-agent orchestration, allowing several background agents to run concurrently across different tasks - for instance, one refining UI components while another optimizes backend performance or fixes tests. Each agent’s activity is visible through a real-time dashboard that can be accessed from desktop or mobile, enabling you to monitor progress, issue follow-ups, or intervene manually if needed. This new system effectively treats each agent as part of an on-demand AI workforce, coordinated through the developer’s high-level intent. Cursor 2’s focus on parallel, asynchronous execution dramatically amplifies a single engineer’s throughput - fully realizing the orchestrator model where humans oversee a fleet of cooperative AI developers rather than a single assistant.
Agent Orchestration Platforms: Beyond individual product offerings, there are also emerging platforms and open-source projects aimed at orchestrating multiple agents. For instance, Conductor by Melty Labs (despite its name!) is actually an orchestration tool that lets you deploy and manage multiple Claude Code agents on your own machine in parallel. With Conductor, each agent gets its own isolated Git worktree to avoid conflicts, and you can see a dashboard of all agents (“who’s working on what”) and review their code as they progress. The idea is to make running a small swarm of coding agents as easy as running one. Similarly, Claude Squad is a popular open-source terminal app that essentially multiplexes Anthropic’s Claude - it can spawn several Claude Code instances working concurrently in separate tmux panes, allowing you to give each a different task and thus code “10x faster” by parallelizing. These orchestration tools underscore the trend: developers want to coordinate multiple AI coding agents and have them collaborate or divide work. Even Microsoft’s Azure AI services are enabling this - at Build 2025 they announced tools for developers to “orchestrate multiple specialized agents to handle complex tasks”, with SDKs supporting Agent-to-Agent communication so your fleet of agents can talk to each other and share context. All of this infrastructure is being built to support the orchestrator engineer, who might eventually oversee dozens of AI processes tackling different parts of the software development lifecycle.

“I found Conductor to make the most sense to me. It was a perfect balance of talking to an agent and seeing my changes in a pane next to it. Its Github integration feels seamless; e.g. after merging PR, it immediately showed a task as “Merged” and provided an “Archive” button.” - Juriy Zaytsev, Staff SWE, LinkedIn
He also tried Magnet: “The idea of tying tasks to a Kanban board is interesting and makes sense. As such, Magnet feels very product -centric.”

Conductor vs Orchestrator - Differences

Many engineers will continue to engage in conductor-style workflows (single-agent, interactive) even as orchestrator patterns mature. The two modes will co-exist.

It’s clear that “conductor” and “orchestrator” aren’t just fancy terms but they describe a genuine shift in how we work with AI:

Scope of control: A conductor operates at the micro level, guiding one agent through a single task or a narrow problem. An orchestrator operates at the macro level, defining broader tasks and objectives for multiple agents or for a powerful single agent that can handle multi-step projects. The conductor asks, “How do I solve this function or bug with the AI’s help?” The orchestrator asks, “What set of tasks can I delegate to AI agents today to move this project forward?”
Degree of autonomy: In conductor mode, the AI’s autonomy is low - it waits for user prompts each step of the way. In orchestrator mode, we give the AI high autonomy - it might plan and execute dozens of steps internally (writing code, running tests, adjusting its approach) before needing human feedback. A GitHub Copilot agent or Jules will try to complete a feature from start to finish once assigned, whereas Copilot’s IDE suggestions only go line-by-line as you type.
Synchronous vs Asynchronous: Conductor interactions are typically synchronous - you prompt, AI responds within seconds, you immediately integrate or iterate. It’s a real-time loop. Orchestrator interactions are asynchronous - you might dispatch an agent and check back minutes or hours later when it’s done (somewhat like kicking off a long CI job). This means orchestrators must handle waiting, context-switching, and possibly managing multiple things concurrently, which is a different workflow rhythm for developers.
Artifacts and traceability: A subtle but important difference: orchestrator workflows produce persistent artifacts like branches, commits, and pull requests that are preserved in version control. The agent’s work is fully recorded (and often linked to an issue/ticket), which improves traceability and collaboration. With conductor-style (IDE chat, etc.), unless the developer manually commits intermediate changes, a lot of the AI’s involvement isn’t explicitly documented. In essence, orchestrators leave a paper trail (or rather a git trail) that others on the team can see or even trigger themselves. This can help bring AI into team processes more naturally.
Human Effort Profile: For a conductor, the human is actively engaged nearly 100% of the time the AI is working - reviewing each output, refining prompts, etc. It’s interactive work. For an orchestrator, the human’s effort is front-loaded (writing a good task description or spec for the agent, setting up the right context) and back-loaded (reviewing the final code and testing it), but not much is needed in the middle. This means one orchestrator can manage more total work in parallel than would ever be possible by working with one AI at a time. Essentially, orchestrators leverage automation at scale, trading off fine-grained control for breadth of throughput.

To illustrate, consider a common scenario: adding a new feature that touches frontend, backend, and requires new tests. As a conductor, you might open your AI chat and implement the backend logic with the AI’s help, then separately implement the frontend, then ask it to generate some tests - doing each step sequentially with you in the loop throughout. As an orchestrator, you could assign the backend implementation to one agent (Agent A), the frontend UI changes to another (Agent B), and test creation to a third (Agent C). You give each a prompt or an issue description, then step back and let them work concurrently.

After a short time, you get perhaps three PRs: one for backend, one for frontend, one for tests. Your job then is to review and integrate them (and maybe have Agent C adjust tests if Agents A/B’s code changed during integration). In effect, you managed a mini “AI team” to deliver the feature. This example highlights how orchestrators think in terms of task distribution and integration, whereas conductors focus on step-by-step implementation.

It’s worth noting that these roles are fluid, not rigid categories. A single developer might act as a conductor in one moment and an orchestrator the next. For example, you might kick off an asynchronous agent to handle one task (orchestrator mode) while you personally work with another AI on a tricky algorithm in the meantime (conductor mode). Tools are also blurring lines: as OpenAI’s Codex marketing suggests, you can seamlessly switch between collaborating in real-time and delegating async tasks. So, think of “conductor” vs “orchestrator” as two ends of a spectrum of AI-assisted development, with many hybrid workflows in between.

Why Orchestrators matter

Experts are suggesting that this shift to orchestration could be one of the biggest leaps in programming productivity we’ve ever seen. Consider the historical trends: we went from writing assembly to using high-level languages, then to using frameworks and libraries, and recently to leveraging AI for autocompletion. Each step abstracted away more low-level work. Autonomous coding agents are the next abstraction layer - instead of manually coding every piece, you describe what you need at a higher level and let multiple agents build it.

As orchestrator-style agents ramp up, we could imagine even larger percentages of code being drafted by AIs. What does a software team look like when AI agents generate, say, 80% or 90% of the code, and humans provide the 10% critical guidance and oversight? Many believe it doesn’t mean replacing developers - it means augmenting developers to build better software. We may witness an explosion of productivity where a small team of engineers, effectively managing dozens of agent processes, can accomplish what once took an army of programmers months. (Note: I continue to believe the code review loop where we’ll continue to focus our human skills is going to need work if all this code is not to be slop).

One intriguing possibility is that every engineer becomes, to some degree, a manager of AI developers. It’s a bit like everyone having a personal team of interns or junior engineers. Your effectiveness will depend on how well you can break down tasks, communicate requirements to AI, and verify the results. Human judgment will remain vital: deciding what to build, ensuring correctness, handling ambiguity, and injecting creativity or domain knowledge where AI might fall short. In other words, the skillset of an orchestrator - good planning, prompt engineering, validation, and oversight - is going to be in high demand. Far from making engineers obsolete, these agents could elevate engineers into more strategic, supervisory roles on projects

Toward an “AI Team” of specialists

Today’s coding agents mostly tackle implementation: write code, fix code, write tests, etc. But the vision doesn’t stop there. Imagine a full software development pipeline where multiple specialized AI agents handle different phases of the lifecycle, coordinated by a human orchestrator. This is already on the horizon. Researchers and companies have floated architectures where, for example, you have:

a Planning Agent that analyzes feature requests or bug reports and breaks them into specific tasks
a Coding Agent (or several) that implement the tasks in code
a Testing Agent that generates and runs tests to verify the changes
a Code Review Agent that checks the pull requests for quality and standards compliance
a Documentation Agent that updates README or docs to reflect the changes
possibly a Deployment/Monitoring Agent that can roll out the change and watch for issues in production.

In this scenario, the human engineer’s role becomes one of oversight and orchestration across the whole flow: you might initiate the process with a high-level goal (e.g., “Add support for payment via cryptocurrency in our app”), the planning agent turns that into sub-tasks, coding agents implement each sub-task asynchronously, the testing agent and review agent catch problems or polish the code, and finally everything gets merged and deployed under watch of monitoring agents.

The human would step in to approve plans, resolve any conflicts or questions the agents raise, and give final approval to deploy. This is essentially an “AI swarm” tackling software development end-to-end, with the engineer as the conductor of the orchestra.

While this might sound futuristic, we see early signs. Microsoft’s Azure AI Foundry now provides building blocks for multi-agent workflows and agent orchestration in enterprise settings, implicitly supporting the idea that multiple agents will collaborate on complex, multi-step tasks. Internal experiments at tech companies have agents creating pull requests that other agent reviewers automatically critique, forming an AI/AI interaction with a human in the loop at the end. In open-source communities, people have chained tools like Claude Squad (parallel coders) with additional scripts that integrate their outputs. And the conversation has started about standards like Model-Context Protocol (MCP) for agents sharing state and communicating results to each other.

I’ve noted before that “specialized agents for Design, Implementation, Test, and Monitoring could work together to develop, launch, and land features in complex environments“ - with developers onboarding these AI agents to their team and guiding/overseeing their execution. In such a setup, agents would “coordinate with other agents autonomously, request human feedback, reviews and approvals” at key points, and otherwise handle the busywork amongst themselves. The goal is a central platform where we can deploy specialized agents across the workflow, without humans micromanaging each individual step - instead, the human oversees the entire operation with full context.

This could transform how software projects are managed: more like running an automated assembly line where engineers ensure quality and direction, rather than hand-crafting each component on the line.

Challenges and Human Role in orchestration

Does this mean programming becomes a push-button activity where you sit back and let the AI factory run? Not quite - and likely never entirely. There are significant challenges and open questions with the orchestrator model:

Quality control & trust: Orchestrating multiple agents means you’re not eyeballing every single change as it’s made. Bugs or design flaws might slip through if you solely rely on AI. Human oversight remains critical as the final failsafe. Indeed, current tools explicitly require the human to review the AI’s pull requests before merging. The relationship is often compared to managing a team of junior developers: they can get a lot done, but you wouldn’t ship their code without review. The orchestrator engineer must be vigilant about checking the AI’s work, writing good test cases, and having monitoring in place. AI agents can make mistakes or produce logically correct but undesirable solutions (for instance, implementing a feature in a convoluted way). Part of the orchestration skillset is knowing when to intervene versus when to trust the agent’s plan. As the CTO of Stack Overflow wrote, “developers maintain expertise to evaluate AI outputs” and will need new “trust models” for this collaboration.
Coordination & conflict: When multiple agents work on a shared codebase, coordination issues arise - much like multiple developers can conflict if they touch the same files. We need strategies to prevent merge conflicts or duplicated work. Current solutions use workspace isolation (each agent works on its own git branch or separate environment) and clear task separation. For example, one agent per task, and tasks designed to minimize overlap. Some orchestrator tools can even automatically merge changes or rebase agent branches, but usually it falls to the human to integrate. Ensuring agents don’t step on each others’ toes is an active area of development. It’s conceivable that in the future agents might negotiate with each other (via something like agent-to-agent communication protocols) to avoid conflicts, but today the orchestrator sets the boundaries.
Context, shared state and hand-offs: Coding workflows are rich in state: repository structure, dependencies, build systems, test suites, style guidelines, team practices, legacy code, branching strategies etc. Multi-agent orchestration demands shared context, memory, and smooth transitions. But in enterprise settings: Context sharing across agents is non-trivial. Without a unified “workflow orchestration layer”, each agent can become a silo, working well in its domain but failing to mesh. In a coding-engineering team this may translate into: one agent creates a feature branch, another one runs unit tests, another merges into master - if the first agent doesn’t tag metadata the second is expecting, you get breakdowns.
Prompting and specifications: Ironically, as the AI handles more coding, the human’s “coding” moves up a level to writing specifications and prompts. The quality of an agent’s output is highly dependent on how well you specify the task. Vague instructions lead to subpar results or agents going astray. Best practices that have emerged include writing mini design docs or acceptance criteria for the agents - essentially treating them like contractors who need a clear definition of done. This is why we’re seeing ideas like spec-driven development for AI: you feed the agent a detailed spec of what to build, so it can execute predictably. Engineers will need to hone their ability to describe problems and desired solutions unambiguously. Paradoxically, it’s a very old-school skill (writing good specs and tests) made newly important in the AI era. As agents improve, prompts might get simpler (“write me a mobile app for X and Y with these features”) and yet yield more complex results, but we’re not quite at the point of the AI intuiting everything unsaid. For now, orchestrators must be excellent communicators to their digital workforce.
Tooling and debugging: With a human developer, if something goes wrong, they can debug in real time. With autonomous agents, if something goes wrong (say the agent gets stuck on a problem or produces a failing PR), the orchestrator has to debug the situation: Was it a bad prompt? Did the agent misinterpret the spec? Do we roll back and try again or step in and fix it manually? New tools are being added to help here: for instance, checkpointing and rollback commands let you undo an agent’s changes if it went down a wrong path. Monitoring dashboards can show if an agent is taking too long or has errors. But effectively, orchestrators might at times have to drop down to conductor mode to fix an issue, then go back to orchestration. This interplay will improve as agents get more robust, but it highlights that orchestrating isn’t just “fire and forget” - it requires active monitoring. AI observability tools (tracking cost, performance, accuracy of agents) are likely to become part of the developer’s toolkit.
Ethics and responsibility: Another angle - if an AI agent writes most of the code, who is responsible for license compliance, security vulnerabilities, or bias in that code? Ultimately the human orchestrator (or their organization) carries responsibility. This means orchestrators should incorporate practices like security scanning of AI-generated code and verifying dependencies. Interestingly, some agents like Copilot and Jules include built-in safeguards (they won’t introduce known vulnerable versions of libraries, for instance, and can be directed to run security audits). But at the end of the day, “trust, but verify” is the mantra. The human remains accountable for what ships, so orchestrators will need to ensure AI contributions meet the team’s quality and ethical standards.

In summary, the rise of orchestrator-style development doesn’t remove the human from the loop - it changes the human’s position in the loop. We move from being the one turning the wrench to the one designing and supervising the machine that turns the wrench. It’s a higher-leverage position, but also one that demands broader awareness.

Developers who adapt to being effective conductors and orchestrators of AI will likely be even more valuable in this new landscape.

Conclusion: Every engineer a maestro?

Will every engineer become an orchestrator of multiple coding agents? It’s a provocative question, but trends suggest we’re headed that way for a large class of programming tasks. The day-to-day reality of a software engineer in the late 2020s could involve less heads-down coding and more high-level supervision of code that’s mostly written by AIs.

Today we’re already seeing early adopters treating AI agents as teammates - for example, some developers report delegating 10+ pull requests per day to AI, effectively treating the agent as an independent teammate rather than a smart autocomplete. Those developers free themselves to focus on system design, tricky algorithms, or simply coordinating even more work.

That said, the transition won’t happen overnight for everyone. Junior developers might start as “AI conductors,” getting comfortable working with a single agent, before they take on orchestrating many. Seasoned engineers are more likely to early-adopt orchestrator workflows, since they have the experience to architect tasks and evaluate outcomes. In many ways, it mirrors career growth: junior engineers implement (now with AI help), senior engineers design and integrate (soon with AI agent teams).

The tools we discussed - from GitHub’s coding agent to Google’s Jules to OpenAI’s Codex - are rapidly lowering the barrier to try this approach, so expect it to go mainstream quickly. The hyperbole aside, there’s truth that these capabilities can dramatically amplify what an individual developer can do.

So, will we all be orchestrators? Probably to some extent - yes. We’ll still write code, especially for novel or complex pieces that defy simple specification. But much of the boilerplate, routine patterns, and even a lot of sophisticated glue code could be offloaded to AI. The role of “software engineer” may evolve to emphasize product thinking, architecture, and validation, with the actual coding being a largely automated act. In this envisioned future, asking an engineer to crank out thousands of lines of mundane code by hand would feel as inefficient as asking a modern accountant to calculate ledgers with pencil and paper. Instead, the engineer would delegate that to their AI agents and focus on the creative and critical-thinking aspects around it.

Btw, yes, there’s plenty to be cautious about. We need to ensure these agents don’t introduce more problems than they solve. And the developer experience of orchestrating multiple agents is still maturing - it can be clunky at times. But the trajectory is clear. Just as continuous integration and automated testing became standard practice, continuous delegation to AI could become a normal part of the development process. The engineers who master both modes - knowing when to be a precise conductor and when to scale up as an orchestrator - will be in the best position to leverage this “agentic” world.

One thing is certain: the way we build software in the next 5-10 years will look quite different from the last 10. I want to stress that not all or most code will be agent-driven within a year or two, but that’s a direction we’re heading in. The keyboard isn’t going away, but alongside our keystrokes we’ll be issuing high-level instructions to swarms of intelligent helpers. In the end, the human element remains irreplaceable: it’s our judgment, creativity, and understanding of real-world needs that guides these AI agents toward meaningful outcomes.

The future of coding isn’t AI or human, it’s AI and human - with humans at the helm as conductors and orchestrators, directing a powerful ensemble to achieve our software ambitions.

I’m excited to share I’m written a new AI-assisted engineering book with O’Reilly. If you’ve enjoyed my writing here you may be interested in checking it out.

Gemini CLI Tips & Tricks

Addy Osmani — Tue, 21 Oct 2025 16:04:40 GMT

This guide is also available to star/follow on our GitHub repository.

Gemini CLI is an open-source AI assistant that brings the power of Google’s Gemini model directly into your terminal. It functions as a conversational, “agentic” command-line tool - meaning it can reason about your requests, choose tools (like running shell commands or editing files), and execute multi-step plans to help with your development workflow.

In practical terms, Gemini CLI acts like a supercharged pair programmer and command-line assistant. It excels at coding tasks, debugging, content generation, and even system automation, all through natural language prompts. Before diving into pro tips, let’s quickly recap how to set up Gemini CLI and get it running.

Getting Started

Installation: You can install Gemini CLI via npm. For a global install, use:

npm install -g @google/gemini-cli

Or run it without installing using npx:

npx @google/gemini-cli

Gemini CLI is available on all major platforms (it’s built with Node.js/TypeScript). Once installed, simply run the gemini command in your terminal to launch the interactive CLI.

Authentication: On first use, you’ll need to authenticate with the Gemini service. You have two options: (1) Google Account Login (free tier) - this lets you use Gemini 2.5 Pro for free with generous usage limits (about 60 requests/minute and 1,000 requests per day. On launch, Gemini CLI will prompt you to sign in with a Google account (no billing required. (2) API Key (paid or higher-tier access) - you can get an API key from Google AI Studio and set the environment variable GEMINI_API_KEY to use it.

API key usage can offer higher quotas and enterprise data‑use protections; prompts aren’t used for training on paid/billed usage, though logs may be retained for safety.

For example, add to your shell profile:

export GEMINI_API_KEY=”YOUR_KEY_HERE”

Basic Usage: To start an interactive session, just run gemini with no arguments. You’ll get a gemini> prompt where you can type requests or commands. For instance:

$ gemini
gemini> Create a React recipe management app using SQLite

You can then watch as Gemini CLI creates files, installs dependencies, runs tests, etc., to fulfill your request. If you prefer a one-shot invocation (non-interactive), use the -p flag with a prompt, for example:

gemini -p "Summarize the main points of the attached file. @./report.txt"

This will output a single response and exit. You can also pipe input into Gemini CLI: for example, echo “Count to 10” | gemini will feed the prompt via stdin.

CLI Interface: Gemini CLI provides a rich REPL-like interface. It supports slash commands (special commands prefixed with / for controlling the session, tools, and settings) and bang commands (prefixed with ! to execute shell commands directly). We’ll cover many of these in the pro tips below. By default, Gemini CLI operates in a safe mode where any action that modifies your system (writing files, running shell commands, etc.) will ask for confirmation. When a tool action is proposed, you’ll see a diff or command and be prompted (Y/n) to approve or reject it. This ensures the AI doesn’t make unwanted changes without your consent.

With the basics out of the way, let’s explore a series of pro tips and hidden features to help you get the most out of Gemini CLI. Each tip is presented with a simple example first, followed by deeper details and nuances. These tips incorporate advice and insights from the tool’s creators (e.g. Taylor Mullen) and the Google Developer Relations team, as well as the broader community, to serve as a canonical guide for power users of Gemini CLI.

Tip 1: Use `GEMINI.md` for Persistent Context

Quick use-case: Stop repeating yourself in prompts. Provide project-specific context or instructions by creating a GEMINI.md file, so the AI always has important background knowledge without being told every time.

When working on a project, you often have certain overarching details - e.g. coding style guidelines, project architecture, or important facts - that you want the AI to keep in mind. Gemini CLI allows you to encode these in one or more GEMINI.md files. Simply create a .gemini folder (if not already present) in your project, and add a Markdown file named GEMINI.md with whatever notes or instructions you want the AI to persist. For example:

# Project Phoenix - AI Assistant

- All Python code must follow PEP 8 style.  
- Use 4 spaces for indentation.  
- The user is building a data pipeline; prefer functional programming paradigms.

Place this file in your project root (or in subdirectories for more granular context). Now, whenever you run gemini in that project, it will automatically load these instructions into context. This means the model will always be primed with them, avoiding the need to prepend the same guidance to every prompt.

How it works: Gemini CLI uses a hierarchical context loading system. It will combine global context (from ~/.gemini/GEMINI.md, which you can use for cross-project defaults) with your project-specific GEMINI.md, and even context files in subfolders. More specific files override more general ones. You can inspect what context was loaded at any time by using the command:

/memory show

This will display the full combined context the AI sees. If you make changes to your GEMINI.md, use /memory refresh to reload the context without restarting the session.

Pro Tip: Use the /init slash command to quickly generate a starter GEMINI.md. Running /init in a new project creates a template context file with information like the tech stack detected, a summary of the project, etc.. You can then edit and expand that file. For large projects, consider breaking the context into multiple files and importing them into GEMINI.md with @include syntax. For example, your main GEMINI.md could have lines like @./docs/prompt-guidelines.md to pull in additional context files. This keeps your instructions organized.

With a well-crafted GEMINI.md, you essentially give Gemini CLI a “memory” of the project’s requirements and conventions. This persistent context leads to more relevant responses and less back-and-forth prompt engineering.

Tip 2: Create Custom Slash Commands

Quick use-case: Speed up repetitive tasks by defining your own slash commands. For example, you could make a command /test:gen that generates unit tests from a description, or /db:reset that drops and recreates a test database. This extends Gemini CLI’s functionality with one-liners tailored to your workflow.

Gemini CLI supports custom slash commands that you can define in simple configuration files.

Under the hood, these are essentially pre-defined prompt templates. To create one, make a directory commands/ under either ~/.gemini/ for global commands or in your project’s .gemini/ folder for project-specific commands. Inside commands/, create a TOML file for each new command. The file name format determines the command name: e.g. a file test/gen.toml defines a command /test:gen.

Let’s walk through an example. Say you want a command to generate a unit test from a requirement description. You could create ~/.gemini/commands/test/gen.toml with the following content:

# Invoked as: /test:gen "Description of the test"  
description \= "Generates a unit test based on a requirement."  
prompt \= """  
You are an expert test engineer. Based on the following requirement, please write a comprehensive unit test using the Jest framework.

Requirement: {{args}}  
"""

Now, after reloading or restarting Gemini CLI, you can simply type:

/test:gen "Ensure the login button redirects to the dashboard upon success"

Gemini CLI will recognize /test:gen and substitute the {{args}} in your prompt template with the provided argument (in this case, the requirement). The AI will then proceed to generate a Jest unit test accordingly. The description field is optional but is used when you run /help or /tools to list available commands.

This mechanism is extremely powerful - effectively, you can script the AI with natural language. The community has created numerous useful custom commands. For instance, Google’s DevRel team shared a set of 10 practical workflow commands (via an open-source repo) demonstrating how you can script common flows like creating API docs, cleaning data, or setting up boilerplate code. By defining a custom command, you package a complex prompt (or series of prompts) into a reusable shortcut.

Pro Tip: Custom commands can also be used to enforce formatting or apply a “persona” to the AI for certain tasks. For example, you might have a /review:security command that always prefaces the prompt with “You are a security auditor...” to review code for vulnerabilities. This approach ensures consistency in how the AI responds to specific categories of tasks.

To share commands with your team, you can commit the TOML files in your project’s repo (under .gemini/commands directory). Team members who have Gemini CLI will automatically pick up those commands when working in the project. This is a great way to standardize AI-assisted workflows across a team.

Tip 3: Extend Gemini with Your Own `MCP` Servers

Quick use-case: Suppose you want Gemini to interface with an external system or a custom tool that isn’t built-in - for example, query a proprietary database, or integrate with Figma designs. You can do this by running a custom Model Context Protocol (MCP) server and plugging it into Gemini CLI. MCP servers let you add new tools and abilities to Gemini, effectively extending the agent.

Gemini CLI comes with several MCP servers out-of-the-box (for instance, ones enabling Google Search, code execution sandboxes, etc.), and you can add your own. An MCP server is essentially an external process (it could be a local script, a microservice, or even a cloud endpoint) that speaks a simple protocol to handle tasks for Gemini. This architecture is what makes Gemini CLI so extensible.

Examples of MCP servers: Some community and Google-provided MCP integrations include a Figma MCP (to fetch design details from Figma), a Clipboard MCP (to read/write from your system clipboard), and others. In fact, in an internal demo, the Gemini CLI team showcased a “Google Docs MCP” server that allowed saving content directly to Google Docs. The idea is that whenever Gemini needs to perform an action that the built-in tools can’t handle, it can delegate to your MCP server.

Above is a demo of Gemini CLI being used with the Figma MCP for design => code

The Chrome DevTools MCP is also of course a favorite :)

How to add one: You can configure MCP servers via your settings.json or using the CLI. For a quick setup, try the CLI command:

gemini mcp add myserver --command "python3 my_mcp_server.py" --port 8080

This would register a server named “myserver” that Gemini CLI will launch by running the given command (here a Python module) on port 8080. In ~/.gemini/settings.json, it would add an entry under mcpServers. For example:

“mcpServers”: {
  “myserver”: {
    “command”: “python3”,
    “args”: [”-m”, “my_mcp_server”, “--port”, “8080”],
    “cwd”: “./mcp_tools/python”,
    “timeout”: 15000
  }
}

This configuration (based on the official docs) tells Gemini how to start the MCP server and where. Once running, the tools provided by that server become available to Gemini CLI. You can list all MCP servers and their tools with the slash command:

/mcp

This will show any registered servers and what tool names they expose.

Power of MCP: MCP servers can provide rich, multi-modal results. For instance, a tool served via MCP could return an image or a formatted table as part of the response to Gemini CLI. They also support OAuth 2.0, so you can securely connect to APIs (like Google’s APIs, GitHub, etc.) via an MCP tool without exposing credentials. Essentially, if you can code it, you can wrap it as an MCP tool - turning Gemini CLI into a hub that orchestrates many services.

Default vs. custom: By default, Gemini CLI’s built-in tools cover a lot (reading files, web search, executing shell commands, etc.), but MCP lets you go beyond. Some advanced users have created MCP servers to interface with internal systems or to perform specialized data processing. For example, you could have a database-mcp that provides a /query_db tool for running SQL queries on a company database, or a jira-mcp to create tickets via natural language.

When creating your own, be mindful of security: by default, custom MCP tools require confirmation unless you mark them as trusted. You can control safety with settings like trust: true for a server (which auto-approves its tool actions) or by whitelisting specific safe tools and blacklisting dangerous ones.

In short, MCP servers unlock limitless integration. They’re a pro feature that lets Gemini CLI become a glue between your AI assistant and whatever system you need it to work with. If you’re interested in building one, check out the official MCP guide and community examples.

Tip 4: Leverage Memory Addition & Recall

Quick use-case: Keep important facts at your AI’s fingertips by adding them to its long-term memory. For example, after figuring out a database port or an API token, you can do:

/memory add “Our staging RabbitMQ is on port 5673”

This will store that fact so you (or the AI) don’t forget it later. You can then recall everything in memory with /memory show at any time.

The /memory commands provide a simple but powerful mechanism for persistent memory. When you use /memory add , the given text is appended to your project’s global context (technically, it’s saved into the global ~/.gemini/GEMINI.md file or the project’s GEMINI.md.

It’s a bit like taking a note and pinning it to the AI’s virtual bulletin board. Once added, the AI will always see that note in the prompt context for future interactions, across sessions.

Consider an example: you’re debugging an issue and discover a non-obvious insight (”The config flag X_ENABLE must be set to true or the service fails to start”). If you add this to memory, later on if you or the AI are discussing a related problem, it won’t overlook this critical detail - it’s in the context.

Using /memory:

/memory add “” - Add a fact or note to memory (persistent context). This updates the GEMINI.md immediately with the new entry.
/memory show - Display the full content of the memory (i.e. the combined context file that’s currently loaded).
/memory refresh - Reload the context from disk (useful if you manually edited the GEMINI.md file outside of Gemini CLI, or if multiple people are collaborating on it).

Because the memory is stored in Markdown, you can also manually edit the GEMINI.md file to curate or organize the info. The /memory commands are there for convenience during conversation, so you don’t have to open an editor.

Pro Tip: This feature is great for “decision logs.” If you decide on an approach or rule during a chat (e.g., a certain library to use, or an agreed code style), add it to memory. The AI will then recall that decision and avoid contradicting it later. It’s especially useful in long sessions that might span hours or days - by saving key points, you mitigate the model’s tendency to forget earlier context when the conversation gets long.

Another use is personal notes. Because ~/.gemini/GEMINI.md (global memory) is loaded for all sessions, you could put general preferences or information there. For example, “The user’s name is Alice. Speak politely and avoid slang.” It’s like configuring the AI’s persona or global knowledge. Just be aware that global memory applies to all projects, so don’t clutter it with project-specific info.

In summary, Memory Addition & Recall helps Gemini CLI maintain state. Think of it as a knowledge base that grows with your project. Use it to avoid repeating yourself or to remind the AI of facts it would otherwise have to rediscover from scratch.

Tip 5: Use Checkpointing and `/restore` as an Undo Button

Quick use-case: If Gemini CLI makes a series of changes to your files that you’re not happy with, you can instantly roll back to a prior state. Enable checkpointing when you start Gemini (or in settings), and use the /restore command to undo changes like a lightweight Git revert. /restore rolls back your workspace to the saved checkpoint; conversation state may be affected depending on how the checkpoint was captured.

Gemini CLI’s checkpointing feature acts as a safety net. When enabled, the CLI takes a snapshot of your project’s files before each tool execution that modifies files. If something goes wrong, you can revert to the last known good state. It’s essentially version control for the AI’s actions, without you needing to manually commit to Git each time.

How to use it: You can turn on checkpointing by launching the CLI with the --checkpointing flag:

gemini --checkpointing

Alternatively, you can make it the default by adding to your config (“checkpointing”: { “enabled”: true } in settings.json). Once active, you’ll notice that each time Gemini is about to write to a file, it says something like “Checkpoint saved.”

If you then realize an AI-made edit is problematic, you have two options:

Run /restore list (or just /restore with no arguments) to see a list of recent checkpoints with timestamps and descriptions.
Run /restore to rollback to a specific checkpoint. If you omit the id and there’s only one pending checkpoint, it will restore that by default.

For example:

/restore

Gemini CLI might output:

0: [2025-09-22 10:30:15] Before running ‘apply_patch’
1: [2025-09-22 10:45:02] Before running ‘write_file’

You can then do /restore 0 to revert all file changes (and even the conversation context) back to how it was at that checkpoint. In this way, you can “undo” a mistaken code refactor or any other changes Gemini made.

What gets restored: The checkpoint captures the state of your working directory (all files that Gemini CLI is allowed to modify) and the workspace files (conversation state may also be rolled back depending on how the checkpoint was captured). When you restore, it overwrites files to the old version and resets the conversation memory to that snapshot. It’s like time-traveling the AI agent back to before it made the wrong turn. Note that it won’t undo external side effects (for example, if the AI ran a database migration, it can’t undo that), but anything in the file system and chat context is fair game.

Best practices: It’s a good idea to keep checkpointing on for non-trivial tasks. The overhead is small, and it provides peace of mind. If you find you don’t need a checkpoint (everything went well), you can always clear it or just let the next one overwrite it. The development team recommends using checkpointing especially before multi-step code edits. For mission-critical projects, though, you should still use a proper version control (git) as your primary safety net - consider checkpoints as a convenience for quick undo rather than a full VCS.

In essence, /restore lets you use Gemini CLI with confidence. You can let the AI attempt bold changes, knowing you have an “OH NO” button to rewind if needed.

Tip 6: Read Google Docs, Sheets, and More.

Quick use-case: Imagine you have a Google Doc or Sheet with some specs or data that you want the AI to use. Instead of copy-pasting the content, you can provide the link, and with a configured Workspace MCP server Gemini CLI can fetch and read it.

For example:

Summarize the requirements from this design doc: https://docs.google.com/document/d/

Gemini can pull in the content of that Doc and incorporate it into its response. Similarly, it can read Google Sheets or Drive files by link.

How this works: These capabilities are typically enabled via MCP integrations. Google’s Gemini CLI team is working on connectors for Google Workspace. One approach is running a small MCP server that uses Google’s APIs (Docs API, Sheets API, etc.) to retrieve document content when given a URL or ID. When configured, you might have slash commands or tools like /read_google_doc or simply an auto-detection that sees a Google Docs link and invokes the appropriate tool to fetch it.

For example, in an Agent Factory podcast demo, the team used a Google Docs MCP to save a summary directly to a doc - which implies they could also read the doc’s content in the first place. In practice, you might do something like:

@https://docs.google.com/document/d/XYZ12345

Including a URL with @ (the context reference syntax) signals Gemini CLI to fetch that resource. With a Google Doc integration in place, the content of that document would be pulled in as if it were a local file. From there, the AI can summarize it, answer questions about it, or otherwise use it in the conversation.

Similarly, if you paste a Google Drive file link, a properly configured Drive tool could download or open that file (assuming permissions and API access are set up). Google Sheets could be made available via an MCP that runs queries or reads cell ranges, enabling you to ask things like “What’s the sum of the budget column in this Sheet [link]?” and have the AI calculate it.

Setting it up: As of this writing, the Google Workspace integrations may require some tinkering (obtaining API credentials, running an MCP server such as the one described by Kanshi Tanaike, etc.). Keep an eye on the official Gemini CLI repository and community forums for ready-to-use extensions - for example, an official Google Docs MCP might become available as a plugin/extension. If you’re eager, you can write one following guides on how to use Google APIs within an MCP server. It typically involves handling OAuth (which Gemini CLI supports for MCP servers) and then exposing tools like read_google_doc.

Usage tip: When you have these tools, using them can be as simple as providing the link in your prompt (the AI might automatically invoke the tool to fetch it) or using a slash command like /doc open . Check /tools to see what commands are available - Gemini CLI lists all tools and custom commands there.

In summary, Gemini CLI can reach out beyond your local filesystem. Whether it’s Google Docs, Sheets, Drive, or other external content, you can pull data in by reference. This pro tip saves you from manual copy-paste and keeps the context flow natural - just refer to the document or dataset you need, and let the AI grab what’s needed. It makes Gemini CLI a true knowledge assistant for all the information you have access to, not just the files on your disk.

(Note: Accessing private documents of course requires the CLI to have the appropriate permissions. Always ensure any integration respects security and privacy. In corporate settings, setting up such integrations might involve additional auth steps.)

Tip 7: Reference Files and Images with `@` for Explicit Context

Quick use-case: Instead of describing a file’s content or an image verbally, just point Gemini CLI directly to it. Using the @ syntax, you can attach files, directories, or images into your prompt. This guarantees the AI sees exactly what’s in those files as context. For example:

Explain this code to me: @./src/main.js

This will include the contents of src/main.js in the prompt (up to Gemini’s context size limits), so the AI can read it and explain it.

This @ file reference is one of Gemini CLI’s most powerful features for developers. It eliminates ambiguity - you’re not asking the model to rely on memory or guesswork about the file, you’re literally handing it the file to read. You can use this for source code, text documents, logs, etc. Similarly, you can reference entire directories:

Refactor the code in @./utils/ to use async/await.

By appending a path that ends in a slash, Gemini CLI will recursively include files from that directory (within reason, respecting ignore files and size limits). This is great for multi-file refactors or analyses, as the AI can consider all relevant modules together.

Even more impressively, you can reference binary files like images in prompts. Gemini CLI (using the Gemini model’s multimodal capabilities) can understand images. For example:

Describe what you see in this screenshot: @./design/mockup.png

The image will be fed into the model, and the AI might respond with something like “This is a login page with a blue sign-in button and a header image,” etc.. You can imagine the uses: reviewing UI mockups, organizing photos (as we’ll see in a later tip), or extracting text from images (Gemini can do OCR as well).

A few notes on using @ references effectively:

File limits: Gemini 2.5 Pro has a huge context window (up to 1 million tokens), so you can include quite large files or many files. However, extremely large files might be truncated. If a file is enormous (say, hundreds of thousands of lines), consider summarizing it or breaking it into parts. Gemini CLI will warn you if a reference is too large or if it skipped something due to size.
Automatic ignoring: By default, Gemini CLI respects your .gitignore and .geminiignore files when pulling in directory context. So if you @./ a project root, it will not dump huge ignored folders (like node_modules) into the prompt. You can customize ignore patterns with .geminiignore similarly to how .gitignore works.
Explicit vs implicit context: Taylor Mullen (the creator of Gemini CLI) emphasizes using @ for explicit context injection rather than relying on the model’s memory or summarizing things yourself. It’s more precise and ensures the AI isn’t hallucinating content. Whenever possible, point the AI to the source of truth (code, config files, documentation) with @ references. This practice can significantly improve accuracy.
Chaining references: You can include multiple files in one prompt, like:

Compare @./foo.py and @./bar.py and tell me differences.

The CLI will include both files. Just be mindful of token limits; multiple large files might consume a lot of the context window.

Using @ is essentially how you feed knowledge into Gemini CLI on the fly. It turns the CLI into a multi-modal reader that can handle text and images. As a pro user, get into the habit of leveraging this - it’s often faster and more reliable than asking the AI something like “Open the file X and do Y” (which it may or may not do on its own). Instead, you explicitly give it X to work with.

Tip 8: On-the-Fly Tool Creation (Have Gemini Build Helpers)

Quick use-case: If a task at hand would benefit from a small script or utility, you can ask Gemini CLI to create that tool for you - right within your session. For example, you might say, “Write a Python script to parse all JSON files in this folder and extract the error fields.” Gemini can generate the script, which you can then execute via the CLI. In essence, you can dynamically extend the toolset as you go.

Gemini CLI is not limited to its pre-existing tools; it can use its coding abilities to fabricate new ones when needed. This often happens implicitly: if you ask for something complex, the AI might propose writing a temporary file (with code) and then running it. As a user, you can also guide this process explicitly:

Creating scripts: You can prompt Gemini to create a script or program in the language of your choice. It will likely use the write_file tool to create the file. For instance:

Generate a Node.js script that reads all ‘.log’ files in the current directory and reports the number of lines in each.

Gemini CLI will draft the code, and with your approval, write it to a file (e.g. script.js). You can then run it by either using the ! shell command (e.g. !node script.js) or by asking Gemini CLI to execute it (the AI might automatically use run_shell_command to execute the script it just wrote, if it deems it part of the plan).

Temporary tools via MCP: In advanced scenarios, the AI might even suggest launching an MCP server for some specialized tasks. For example, if your prompt involves some heavy text processing that might be better done in Python, Gemini could generate a simple MCP server in Python and run it. While this is more rare, it demonstrates that the AI can set up a new “agent” on the fly. (One of the slides from the Gemini CLI team humorously referred to “MCP servers for everything, even one called LROwn” - suggesting you can have Gemini run an instance of itself or another model, though that’s more of a trick than a practical use!).

The key benefit here is automation. Instead of you manually stopping to write a helper script, you can let the AI do it as part of the flow. It’s like having an assistant who can create tools on-demand. This is especially useful for data transformation tasks, batch operations, or one-off computations that the built-in tools don’t directly provide.

Nuances and safety: When Gemini CLI writes code for a new tool, you should still review it before running. The /diff view (Gemini will show you the file diff before you approve writing it) is your chance to inspect the code. Ensure it does what you expect and nothing malicious or destructive (the AI shouldn’t produce something harmful unless your prompt explicitly asks, but just like any code from an AI, double-check logic, especially for scripts that delete or modify lots of data).

Example scenario: Let’s say you have a CSV file and you want to filter it in a complex way. You ask Gemini CLI to do it, and it might say: “I will write a Python script to parse the CSV and apply the filter.” It then creates filter_data.py. After you approve and it runs, you get your result, and you might never need that script again. This ephemeral creation of tools is a pro move - it shows the AI effectively extending its capabilities autonomously.

Pro Tip: If you find the script useful beyond the immediate context, you can promote it into a permanent tool or command. For instance, if the AI generated a great log-processing script, you might later turn it into a custom slash command (Tip #2) for easy reuse. The combination of Gemini’s generative power and the extension hooks means your toolkit can continuously evolve as you use the CLI.

In summary, don’t restrict Gemini to what it comes with. Treat it as a junior developer who can whip up new programs or even mini-servers to help solve the problem. This approach embodies the agentic philosophy of Gemini CLI - it will figure out what tools it needs, even if it has to code them on the spot.

Tip 9: Use Gemini CLI for System Troubleshooting & Configuration

Quick use-case: You can run Gemini CLI outside of a code project to help with general system tasks - think of it as an intelligent assistant for your OS. For example, if your shell is misbehaving, you could open Gemini in your home directory and ask: “Fix my .bashrc file, it has an error.” Gemini can then open and edit your config file for you.

This tip highlights that Gemini CLI isn’t just for coding projects - it’s your AI helper for your whole development environment. Many users have used Gemini to customize their dev setup or fix issues on their machine:

Editing dotfiles: You can load your shell configuration (.bashrc or .zshrc) by referencing it (@~/.bashrc) and then ask Gemini CLI to optimize or troubleshoot it. For instance, “My PATH isn’t picking up Go binaries, can you edit my .bashrc to fix that?” The AI can insert the correct export line. It will show you the diff for confirmation before saving changes.
Diagnosing errors: If you encounter a cryptic error in your terminal or an application log, you can copy it and feed it to Gemini CLI. It will analyze the error message and often suggest steps to resolve it. This is similar to how one might use StackOverflow or Google, but with the AI directly examining your scenario. For example: “When I run npm install, I get an EACCES permission error - how do I fix this?” Gemini might detect it’s a permissions issue in node_modules and guide you to change directory ownership or use a proper node version manager.
Running outside a project: By default, if you run gemini in a directory without a .gemini context, it just means no project-specific context is loaded - but you can still use the CLI fully. This is great for ad-hoc tasks like system troubleshooting. You might not have any code files for it to consider, but you can still run shell commands through it or let it fetch web info. Essentially, you’re treating Gemini CLI as an AI-powered terminal that can do things for you, not just chat.
Workstation customization: Want to change a setting or install a new tool? You can ask Gemini CLI, “Install Docker on my system” or “Configure my Git to sign commits with GPG.” The CLI will attempt to execute the steps. It might fetch instructions from the web (using the search tool) and then run the appropriate shell commands. Of course, always watch what it’s doing and approve the commands - but it can save time by automating multi-step setup processes. One real example: a user asked Gemini CLI to “set my macOS Dock preferences to auto-hide and remove the delay,” and the AI was able to execute the necessary defaults write commands.

Think of this mode as using Gemini CLI as a smart shell. In fact, you can combine this with Tip 16 (shell passthrough mode) - sometimes you might drop into ! shell mode to verify something, then go back to AI mode to have it analyze output.

Caveat: When doing system-level tasks, be cautious with commands that have widespread impact (like rm -rf or system config changes). Gemini CLI will usually ask for confirmation, and it doesn’t run anything without you seeing it. But as a power user, you should have a sense of what changes are being made. If unsure, ask Gemini to explain a command before running (e.g., “Explain what defaults write com.apple.dock autohide-delay -float 0 does” - it will gladly explain rather than just execute if you prompt it in that way).

Troubleshooting bonus: Another neat use is using Gemini CLI to parse logs or config files looking for issues. For instance, “Scan this Apache config for mistakes” (with @httpd.conf), or “Look through syslog for errors around 2 PM yesterday” (with an @/var/log/syslog if accessible). It’s like having a co-administrator. It can even suggest likely causes for crashes or propose fixes for common error patterns.

In summary, don’t hesitate to fire up Gemini CLI as your assistant for environment issues. It’s there to accelerate all your workflows - not just writing code, but maintaining the system that you write code on. Many users report that customizing their dev environment with Gemini’s help feels like having a tech buddy always on call to handle the tedious or complex setup steps.

Tip 10: YOLO Mode - Auto-Approve Tool Actions (Use with Caution)

Quick use-case: If you’re feeling confident (or adventurous), you can let Gemini CLI run tool actions without asking for your confirmation each time. This is YOLO mode (You Only Live Once). It’s enabled by the --yolo flag or by pressing Ctrl+Y during a session. In YOLO mode, as soon as the AI decides on a tool (like running a shell command or writing to a file), it executes it immediately, without that “Approve? (y/n)” prompt.

Why use YOLO mode? Primarily for speed and convenience when you trust the AI’s actions. Experienced users might toggle YOLO on if they’re doing a lot of repetitive safe operations. For example, if you ask Gemini to generate 10 different files one after another, approving each can slow down the flow; YOLO mode would just let them all be written automatically. Another scenario is using Gemini CLI in a completely automated script or CI pipeline - you might run it headless with --yolo so it doesn’t pause for confirmation.

To start in YOLO mode from the get-go, launch the CLI with:

gemini --yolo

Or the short form gemini -y. You’ll see some indication in the CLI (like a different prompt or a notice) that auto-approve is on. During an interactive session, you can toggle it by pressing Ctrl+Y at any time - the CLI will usually display a message like “YOLO mode enabled (all actions auto-approved)” in the footer.

Big warning: YOLO mode is powerful but risky. The Gemini team themselves labels it for “daring users” - meaning you should be aware that the AI could potentially execute a dangerous command without asking. In normal mode, if the AI decided to run rm -rf / (worst-case scenario), you’d obviously decline. In YOLO mode, that command would run immediately (and likely ruin your day). While such extreme mistakes are unlikely (the AI’s system prompt includes safety guidelines), the whole point of confirmations is to catch any unwanted action. YOLO removes that safety net.

Best practices for YOLO: If you want some of the convenience without full risk, consider allow-listing specific commands. For example, you can configure in settings that certain tools or command patterns don’t require confirmation (like allowing all git commands, or read-only actions). In fact, Gemini CLI supports a config for skipping confirmation on specific commands: e.g., you can set something like “tools.shell.autoApprove”: [”git “, “npm test”] to always run those. This way, you might not need YOLO mode globally - you selectively YOLO only safe commands. Another approach: run Gemini in a sandbox or container when using YOLO, so even if it does something wild, your system is insulated (Gemini has a --sandbox flag to run tools in a Docker container).

Many advanced users toggle YOLO on and off frequently - turning it on when doing a string of minor file edits or queries, and off when about to do something critical. You can do the same, using the keyboard shortcut as a quick toggle.

In summary, YOLO mode eliminates friction at the cost of oversight. It’s a pro feature to use sparingly and wisely. It truly demonstrates trust in the AI (or recklessness!). If you’re new to Gemini CLI, you should probably avoid YOLO until you clearly understand the patterns of what it tends to do. If you do use it, double down on having version control or backups - just in case.

(If it’s any consolation, you’re not alone - many in the community joke about “I YOLO’ed and Gemini did something crazy.” So use it, but... well, you only live once.)

Tip 11: Headless & Scripting Mode (Run Gemini CLI in the Background)

Quick use-case: You can use Gemini CLI in scripts or automation by running it in headless mode. This means you provide a prompt (or even a full conversation) via command-line arguments or environment variables, and Gemini CLI produces an output and exits. It’s great for integrating with other tools or triggering AI tasks on a schedule.

For instance, to get a one-off answer without opening the REPL, you’ve seen you can use gemini -p “...prompt...”. This is already headless usage: it prints the model’s response and returns to the shell. But there’s more you can do:

System prompt override: If you want to run Gemini CLI with a custom system persona or instruction set (different from the default), you can use the environment variable GEMINI_SYSTEM_MD. By setting this, you tell Gemini CLI to ignore its built-in system prompt and use your provided file instead. For example:

export GEMINI_SYSTEM_MD=”/path/to/custom_system.md”
gemini -p “Perform task X with high caution”

This would load your custom_system.md as the system prompt (the “role” and rules the AI follows) before executing the prompt. Alternatively, if you set GEMINI_SYSTEM_MD=true, the CLI will look for a file named system.md in the current project’s .gemini directory. This feature is very advanced - it essentially allows you to replace the built-in brain of the CLI with your own instructions, which some users do for specialized workflows (like simulating a specific persona or enforcing ultra-strict policies). Use it carefully, as replacing the core prompt can affect tool usage (the core prompt contains important directions for how the AI selects and uses tools).

Direct prompt via CLI: Aside from -p, there’s also -i (interactive prompt) which starts a session with an initial prompt, and then keeps it open. For example: gemini -i “Hello, let’s debug something” will open the REPL and already have said hello to the model. This is useful if you want the first question to be asked immediately when starting.
Scripting with shell pipes: You can pipe not just text but also files or command outputs into Gemini. For example: gemini -p “Summarize this log:” < big_log.txt will feed the content of big_log.txt into the prompt (after the phrase “Summarize this log:”). Or you might do some_command | gemini -p “Given the above output, what went wrong?”. This technique allows you to compose Unix tools with AI analysis. It’s headless in the sense that it’s a single-pass operation.
Running in CI/CD: You could incorporate Gemini CLI into build processes. For instance, a CI pipeline might run a test and then use Gemini CLI to automatically analyze failing test output and post a comment. Using the -p flag and environment auth, this can be scripted. (Of course, ensure the environment has the API key or auth needed.)

One more headless trick: the --format=json flag (or config setting). Gemini CLI can output responses in JSON format instead of the human-readable text if you configure it. This is useful for programmatic consumption - your script can parse the JSON to get the answer or any tool actions details.

Why headless mode matters: It transforms Gemini CLI from an interactive assistant into a backend service or utility that other programs can call. You could schedule a cronjob that runs a Gemini CLI prompt nightly (imagine generating a report or cleaning up something with AI logic). You could wire up a button in an IDE that triggers a headless Gemini run for a specific task.

Example: Let’s say you want a daily summary of a news website. You could have a script:

gemini -p “Web-fetch \”https://news.site/top-stories\” and extract the headlines, then write them to headlines.txt”

With --yolo perhaps, so it won’t ask confirmation to write the file. This would use the web fetch tool to get the page and the file write tool to save the headlines. All automatically, no human in the loop. The possibilities are endless once you treat Gemini CLI as a scriptable component.

In summary, Headless Mode enables automation. It’s the bridge between Gemini CLI and other systems. Mastering it means you can scale up your AI usage - not just when you’re typing in the terminal, but even when you aren’t around, your AI agent can do work for you.

(Tip: For truly long-running non-interactive tasks, you might also look into Gemini CLI’s “Plan” mode or how it can generate multi-step plans without intervention. However, those are advanced topics beyond this scope. In most cases, a well-crafted single prompt via headless mode can achieve a lot.)

Tip 12: Save and Resume Chat Sessions

Quick use-case: If you’ve been debugging an issue with Gemini CLI for an hour and need to stop, you don’t have to lose the conversation context. Use /chat save to save the session. Later (even after restarting the CLI), you can use /chat resume to pick up where you left off. This way, long-running conversations can be paused and continued seamlessly.

Gemini CLI essentially has a built-in chat session manager. The commands to know are:

/chat save - Saves the current conversation state under a tag/name you provide. The tag is like a filename or key for that session. Save often if you want, it will overwrite the tag if it exists. (Using a descriptive name is helpful - e.g., chat save fix-docker-issue.)
/chat list - Lists all your saved sessions (the tags you’ve used. This helps you remember what you named previous saves.
/chat resume - Resumes the session with that tag, restoring the entire conversation context and history to how it was when saved. It’s like you never left. You can then continue chatting from that point.
/chat share - (saves to file) This is useful as you can share the entire chat with someone else who can continue the session. Almost collaboration-like.

Under the hood, these sessions are stored likely in ~/.gemini/chats/ or a similar location. They include the conversation messages and any relevant state. This feature is super useful for cases such as:

Long debugging sessions: Sometimes debugging with an AI can be a long back-and-forth. If you can’t solve it in one go, save it and come back later (maybe with a fresh mind). The AI will still “remember” everything from before, because the whole context is reloaded.
Multi-day tasks: If you’re using Gemini CLI as an assistant for a project, you might have one chat session for “Refactor module X” that spans multiple days. You can resume that specific chat each day so the context doesn’t reset daily. Meanwhile, you might have another session for “Write documentation” saved separately. Switching contexts is just a matter of saving one and resuming the other.
Team hand-off: This is more experimental, but in theory, you could share the content of a saved chat with a colleague (the saved files are likely portable). If they put it in their .gemini directory and resume, they could see the same context. The practical simpler approach for collaboration is just copying the relevant Q&A from the log and using a shared GEMINI.md or prompt, but it’s interesting to note that the session data is yours to keep.

Usage example:

/chat save api-upgrade

(Session saved as “api-upgrade”)

/quit

(Later, reopen CLI)

$ gemini
gemini> /chat list

(Shows: api-upgrade)

gemini> /chat resume api-upgrade

Now the model greets you with the last exchange’s state ready. You can confirm by scrolling up that all your previous messages are present.

Pro Tip: Use meaningful tags when saving chats. Instead of /chat save session1, give it a name related to the topic (e.g. /chat save memory-leak-bug). This will help you find the right one later via /chat list. There is no strict limit announced on how many sessions you can save, but cleaning up old ones occasionally might be wise just for organization.

This feature turns Gemini CLI into a persistent advisor. You don’t lose knowledge gained in a conversation; you can always pause and resume. It’s a differentiator compared to some other AI interfaces that forget context when closed. For power users, it means you can maintain parallel threads of work with the AI. Just like you’d have multiple terminal tabs for different tasks, you can have multiple chat sessions saved and resume the one you need at any given time.

Tip 13: Multi-Directory Workspace - One Gemini, Many Folders

Quick use-case: Do you have a project split across multiple repositories or directories? You can launch Gemini CLI with access to all of them at once, so it sees a unified workspace. For example, if your frontend and backend are separate folders, you can include both so that Gemini can edit or reference files in both.

There are two ways to use multi-directory mode:

Launch flag: Use the --include-directories (or -I) flag when starting Gemini CLI. For example:

gemini --include-directories “../backend:../frontend”

This assumes you run the command from, say, a scripts directory and want to include two sibling folders. You provide a colon-separated list of paths. Gemini CLI will then treat all those directories as part of one big workspace.

Persistent setting: In your settings.json, you can define “includeDirectories”: [”path1”, “path2”, [...]]. This is useful if you always want certain common directories loaded (e.g., a shared library folder that multiple projects use). The paths can be relative or absolute. Environment variables in the paths (like ~/common-utils) are allowed.

When multi-dir mode is active, the CLI’s context and tools consider files across all included locations. The > /directory show command will list which directories are in the current workspace. You can also dynamically add directories during a session with /directory add [] - it will then load that on the fly (potentially scanning it for context like it does on startup).

Why use multi-directory mode? In microservice architectures or modular codebases, it’s common that one piece of code lives in one repo and another piece in a different repo. If you only ran Gemini in one, it wouldn’t “see” the others. By combining them, you enable cross-project reasoning. For example, you could ask, “Update the API client in the frontend to match the backend’s new API endpoints” - Gemini can open the backend folder to see the API definitions and simultaneously open the frontend code to modify it accordingly. Without multi-dir, you’d have to do one side at a time and manually carry info over.

Example: Let’s say you have client/ and server/. You start:

cd client
gemini --include-directories “../server”

Now at the gemini> prompt, if you do > !ls, you’ll see it can list files in both client and server (it might show them as separate paths). You could do:

Open server/routes/api.py and client/src/api.js side by side to compare function names.

The AI will have access to both files. Or you might say:

The API changed: the endpoint “/users/create” is now “/users/register”. Update both backend and frontend accordingly.

It can simultaneously create a patch in the backend route and adjust the frontend fetch call.

Under the hood, Gemini merges the file index of those directories. There might be some performance considerations if each directory is huge, but generally it handles multiple small-medium projects fine. The cheat sheet notes that this effectively creates one workspace with multiple roots.

Tip within a tip: Even if you don’t use multi-dir all the time, know that you can still reference files across the filesystem by absolute path in prompts (@/path/to/file). However, without multi-dir, Gemini might not have permission to edit those or know to load context from them proactively. Multi-dir formally includes them in scope so it’s aware of all files for tasks like search or code generation across the whole set.

Remove directories: If needed, /directory remove (or a similar command) can drop a directory from the workspace. This is less common, but maybe if you included something accidentally, you can remove it.

In summary, multi-directory mode unifies your context. It’s a must-have for polyrepo projects or any situation where code is split up. It makes Gemini CLI act more like an IDE that has your entire solution open. As a pro user, this means no part of your project is out of the AI’s reach.

Tip 14: Organize and Clean Up Your Files with AI Assistance

Quick use-case: Tired of a messy Downloads folder or disorganized project assets? You can enlist Gemini CLI to act as a smart organizer. By providing it an overview of a directory, it can classify files and even move them into subfolders (with your approval). For instance, “Clean up my Downloads: move images to an Images folder, PDFs to Documents, and delete temporary files.”

Because Gemini CLI can read file names, sizes, and even peek into file contents, it can make informed decisions about file organization. One community-created tool dubbed “Janitor AI” showcases this: it runs via Gemini CLI to categorize files as important vs junk, and groups them accordingly. The process involved scanning the directory, using Gemini’s reasoning on filenames and metadata (and content if needed), then moving files into categories. Notably, it didn’t automatically delete junk - rather, it moved them to a Trash folder for review.

Here’s how you might replicate such a workflow with Gemini CLI manually:

Survey the directory: Use a prompt to have Gemini list and categorize. For example:

List all files in the current directory and categorize them as “images”, “videos”, “documents”, “archives”, or “others”.

Gemini might use !ls or similar to get the file list, then analyze the names/extensions to produce categories.

Plan the organization: Ask Gemini how it would like to reorganize. For example:

Propose a new folder structure for these files. I want to separate by type (Images, Videos, Documents, etc.). Also identify any files that seem like duplicates or unnecessary.

The AI might respond with a plan: e.g., “Create folders: Images/, Videos/, Documents/, Archives/. Move X.png, Y.jpg to Images/; move A.mp4 to Videos/; etc. The file temp.txt looks unnecessary (maybe a temp file).”

Execute moves with confirmation: You can then instruct it to carry out the plan. It may use shell commands like mv for each file. Since this modifies your filesystem, you’ll get confirmation prompts for each (unless you YOLO it). Carefully approve the moves. After completion, your directory will be neatly organized as suggested.

Throughout, Gemini’s natural language understanding is key. It can reason, for instance, that IMG_001.png is an image or that presentation.pdf is a document, even if not explicitly stated. It can even open an image (using its vision capability) to see what’s in it - e.g., differentiating between a screenshot vs a photo vs an icon - and name or sort it accordingly.

Renaming files by content: A particularly magical use is having Gemini rename files to be more descriptive. The Dev Community article “7 Insane Gemini CLI Tips” describes how Gemini can scan images and automatically rename them based on their content. For example, a file named IMG_1234.jpg might be renamed to login_screen.jpg if the AI sees it’s a screenshot of a login screen. To do this, you could prompt:

For each .png image here, look at its content and rename it to something descriptive.

Gemini will open each image (via vision tool), get a description, then propose a mv IMG_1234.png login_screen.png action. This can dramatically improve the organization of assets, especially in design or photo folders.

Two-pass approach: The Janitor AI discussion noted a two-step process: first broad categorization (important vs junk vs other), then refining groups. You can emulate this: first separate files that likely can be deleted (maybe large installer .dmg files or duplicates) from those to keep. Then focus on organizing the keepers. Always double-check what the AI flags as junk; its guess might not always be right, so manual oversight is needed.

Safety tip: When letting the AI loose on file moves or deletions, have backups or at least be ready to undo (with /restore or your own backup). It’s wise to do a dry-run: ask Gemini to print the commands it would run to organize, without executing them, so you can review. For instance: “List the mv and mkdir commands needed for this plan, but don’t execute them yet.” Once you review the list, you can either copy-paste execute them, or instruct Gemini to proceed.

This is a prime example of using Gemini CLI for “non-obvious” tasks - it’s not just writing code, it’s doing system housekeeping with AI smarts. It can save time and bring a bit of order to chaos. After all, as developers we accumulate clutter (logs, old scripts, downloads), and an AI janitor can be quite handy.

Tip 15: Compress Long Conversations to Stay Within Context

Quick use-case: If you’ve been chatting with Gemini CLI for a long time, you might hit the model’s context length limit or just find the session getting unwieldy. Use the /compress command to summarize the conversation so far, replacing the full history with a concise summary. This frees up space for more discussion without starting from scratch.

Large language models have a fixed context window (Gemini 2.5 Pro’s is very large, but not infinite). If you exceed it, the model may start forgetting earlier messages or lose coherence. The /compress feature is essentially an AI-generated tl;dr of your session that keeps important points.

How it works: When you type /compress, Gemini CLI will take the entire conversation (except system context) and produce a summary. It then replaces the chat history with that summary as a single system or assistant message, preserving essential details but dropping minute-by-minute dialogue. It will indicate that compression happened. For example, after /compress, you might see something like:

--- Conversation compressed ---
Summary of discussion: The user and assistant have been debugging a memory leak in an application. Key points: The issue is likely in DataProcessor.js, where objects aren’t being freed. The assistant suggested adding logging and identified a possible infinite loop. The user is about to test a fix.
--- End of summary ---

From that point on, the model only has that summary (plus new messages) as context for what happened before. This usually is enough if the summary captured the salient info.

When to compress: Ideally before you hit the limit. If you notice the session is getting lengthy (several hundred turns or a lot of code in context), compress proactively. The cheat sheet mentions an automatic compression setting (e.g., compress when context exceeds 60% of max). If you enable that, Gemini might auto-compress and let you know. Otherwise, manual /compress is in your toolkit.

After compressing: You can continue the conversation normally. If needed, you can compress multiple times in a very long session. Each time, you lose some granularity, so don’t compress too frequently for no reason - you might end up with an overly brief remembrance of a complex discussion. But generally the model’s own summarization is pretty good at keeping the key facts (and you can always restate anything critical yourself).

Context window example: Let’s illustrate. Suppose you fed in a large codebase by referencing many files and had a 1M token context (the max). If you then want to shift to a different part of the project, rather than starting a new session (losing all that understanding), you could compress. The summary will condense the knowledge gleaned from the code (like “We loaded modules A, B, C. A has these functions... B interacts with C in these ways...”). Now you can proceed to ask about new things with that knowledge retained abstractly.

Memory vs Compression: Note that compression doesn’t save to long-term memory, it’s local to the conversation. If you have facts you never want lost, consider Tip 4 (adding to /memory) - because memory entries will survive compression (they’ll just be reinserted anyway since they are in GEMINI.md context). Compression is more about ephemeral chat content.

A minor caution: after compression, the AI’s style might slightly change because it’s effectively seeing a “fresh” conversation with a summary. It might reintroduce itself or change tone. You can instruct it like “Continue from here... (we compressed)” to smooth it out. In practice, it often continues fine.

To summarize (pun intended), use /compress as your session grows long to maintain performance and relevance. It helps Gemini CLI focus on the bigger picture instead of every detail of the conversation’s history. This way, you can have marathon debugging sessions or extensive design discussions without running out of the “mental paper” the AI is writing on.

Tip 16: Passthrough Shell Commands with `!` (Talk to Your Terminal)

Quick use-case: At any point in a Gemini CLI session, you can run actual shell commands by prefixing them with !. For example, if you want to check the git status, just type !git status and it will execute in your terminal. This saves you from switching windows or context - you’re still in the Gemini CLI, but you’re essentially telling it “let me run this command real quick.”

This tip is about Shell Mode in Gemini CLI. There are two ways to use it:

Single command: Just put ! at the start of your prompt, followed by any command and arguments. This will execute that command in the current working directory and display the output in-line. For example:

!ls -lh src/

will list the files in the src directory, outputting something like you’d see in a normal terminal. After the output, the Gemini prompt returns so you can continue chatting or issue more commands.

Persistent shell mode: If you enter ! alone and hit Enter, Gemini CLI switches into a sub-mode where you get a shell prompt (often it looks like shell> or similar. Now you can type multiple shell commands interactively. It’s basically a mini-shell within the CLI. You exit this mode by typing ! on an empty line again (or exit). For instance:

!
shell> pwd
/home/alice/project
shell> python --version
Python 3.x.x
shell> !

After the final !, you’re back to the normal Gemini prompt.

Why is this useful? Because development is a mix of actions and inquiries. You might be discussing something with the AI and realize you need to compile the code or run tests to see something. Instead of leaving the conversation, you can quickly do it and feed the result back into the chat. In fact, Gemini CLI often does this for you as part of its tool usage (it might automatically run !pytest when you ask to fix tests, for example). But as the user, you have full control to do it manually too.

Examples:

After Gemini suggests a fix in code, you can do !npm run build to see if it compiles, then copy any errors and ask Gemini to help with those.
If you want to open a file in vim or nano, you could even launch it via !nano filename (though note that since Gemini CLI has its own interface, using an interactive editor inside it might be a bit awkward - better to use the built-in editor integration or copy to your editor).
You can use shell commands to gather info for the AI: e.g., !grep TODO -R . to find all TODOs in the project, then you might ask Gemini to help address those TODOs.
Or simply use it for environment tasks: !pip install some-package if needed, etc., without leaving the CLI.

Seamless interplay: One cool aspect is how the conversation can refer to outputs. For example, you could do !curl http://example.com to fetch some data, see the output, then immediately say to Gemini, “Format the above output as JSON” - since the output was printed in the chat, the AI has it in context to work with (provided it’s not too large).

Terminal as a default shell: If you find yourself always prefacing commands with !, you can actually make the shell mode persistent by default. One way is launching Gemini CLI with a specific tool mode (there’s a concept of default tool). But easier: just drop into shell mode (! with nothing) at session start if you plan to run a lot of manual commands and only occasionally talk to AI. Then you can exit shell mode whenever you want to ask a question. It’s almost like turning Gemini CLI into your normal terminal that happens to have an AI readily available.

Integration with AI planning: Sometimes Gemini CLI itself will propose to run a shell command. If you approve, it effectively does the same as !command. Understanding that, you know you can always intervene. If Gemini is stuck or you want to try something, you don’t have to wait for it to suggest - you can just do it and then continue.

In summary, the ! passthrough means you don’t have to leave Gemini CLI for shell tasks. It collapses the boundary between chatting with the AI and executing commands on your system. As a pro user, this is fantastic for efficiency - your AI and your terminal become one continuous environment.

Tip 17: Treat Every CLI Tool as a Potential Gemini Tool

Quick use-case: Realize that Gemini CLI can leverage any command-line tool installed on your system as part of its problem-solving. The AI has access to the shell, so if you have cURL, ImageMagick, git, Docker, or any other tool, Gemini can invoke it when appropriate. In other words, your entire $PATH is the AI’s toolkit. This greatly expands what it can do - far beyond its built-in tools.

For example, say you ask: “Convert all PNG images in this folder to WebP format.” If you have ImageMagick’s convert utility installed, Gemini CLI might plan something like: use a shell loop with convert command for each file. Indeed, one of the earlier examples from a blog showed exactly this, where the user prompted to batch-convert images, and Gemini executed a shell one-liner with the convert tool.

Another scenario: “Deploy my app to Docker.” If Docker CLI is present, the AI could call docker build and docker run steps as needed. Or “Use FFmpeg to extract audio from video.mp4“ - it can construct the ffmpeg command.

This tip is about mindset: Gemini isn’t limited to what’s coded into it (which is already extensive). It can figure out how to use other programs available to achieve a goal. It knows common syntax and can read help texts if needed (it could call --help on a tool). The only limitation is safety: by default, it will ask confirmation for any run_shell_command it comes up with. But as you become comfortable, you might allow certain benign commands automatically (see YOLO or allowed-tools config).

Be mindful of the environment: “With great power comes great responsibility.” Since every shell tool is fair game, you should ensure that your $PATH doesn’t include anything you wouldn’t want the AI to run inadvertently. This is where Tip 19 (custom PATH) comes in - some users create a restricted $PATH for Gemini, so it can’t, say, directly call system destructive commands or maybe not call gemini recursively (to avoid loops). The point is, by default if gcc or terraform or anything is in $PATH, Gemini could invoke it. It doesn’t mean it will randomly do so - only if the task calls for it - but it’s possible.

Train of thought example: Imagine you ask Gemini CLI: “Set up a basic HTTP server that serves the current directory.” The AI might think: “I can use Python’s built-in server for this.” It then issues !python3 -m http.server 8000. Now it just used a system tool (Python) to launch a server. That’s an innocuous example. Another: “Check the memory usage on this Linux system.” The AI might use the free -h command or read from /proc/meminfo. It’s effectively doing what a sysadmin would do, by using available commands.

All tools are extensions of the AI: This is somewhat futuristic, but consider that any command-line program can be seen as a “function” the AI can call to extend its capability. Need to solve a math problem? It could call bc (calculator). Need to manipulate an image? It could call an image processing tool. Need to query a database? If the CLI client is installed and credentials are there, it can use it. The possibilities are expansive. In other AI agent frameworks, this is known as tool use, and Gemini CLI is designed with a lot of trust in its agent to decide the right tool.

When it goes wrong: The flip side is if the AI misunderstands a tool or has a hallucination about one. It might try to call a command that doesn’t exist, or use wrong flags, resulting in errors. This isn’t a big deal - you’ll see the error and can correct or clarify. In fact, the system prompt of Gemini CLI likely guides it to first do a dry-run (just propose the command) rather than executing blindly. So you often get a chance to catch these. Over time, the developers are improving the tool selection logic to reduce these missteps.

The main takeaway is to think of Gemini CLI as having a very large Swiss Army knife - not just the built-in blades, but every tool in your OS. You don’t have to instruct it on how to use them if it’s something standard; usually it knows or can find out. This significantly amplifies what you can accomplish. It’s like having a junior dev or devops engineer who knows how to run pretty much any program you have installed.

As a pro user, you can even install additional CLI tools specifically to give Gemini more powers. For example, if you install a CLI for a cloud service (AWS CLI, GCloud CLI, etc.), in theory Gemini can utilize it to manage cloud resources if prompted to. Always ensure you understand and trust the commands run, especially with powerful tools (you wouldn’t want it spinning up huge cloud instances accidentally). But used wisely, this concept - everything is a Gemini tool - is what makes it exponentially more capable as you integrate it into your environment.

Tip 18: Utilize Multimodal AI - Let Gemini See Images and More

Quick use-case: Gemini CLI isn’t limited to text - it’s multimodal. This means it can analyze images, diagrams, or even PDFs if given. Use this to your advantage. For instance, you could say “Here’s a screenshot of an error dialog, @./error.png - help me troubleshoot this.” The AI will “see” the image and respond accordingly.

One of the standout features of Google’s Gemini model (and its precursor PaLM2 in Codey form) is image understanding. In Gemini CLI, if you reference an image with @, the model receives the image data. It can output descriptions, classifications, or reason about the image’s content. We already discussed renaming images by content (Tip 14) and describing screenshots (Tip 7). But let’s consider other creative uses:

UI/UX feedback: If you’re a developer working with designers, you can drop a UI image and ask Gemini for feedback or to generate code. “Look at this UI mockup @mockup.png and produce a React component structure for it.” It could identify elements in the image (header, buttons, etc.) and outline code.
Organizing images: Beyond renaming, you might have a folder of mixed images and want to sort by content. “Sort the images in ./photos/ into subfolders by theme (e.g., sunsets, mountains, people).” The AI can look at each photo and categorize it (this is similar to what some photo apps do with AI - now you can do it with your own script via Gemini).
OCR and data extraction: If you have a screenshot of error text or a photo of a document, Gemini can often read the text from it. For example, “Extract the text from invoice.png and put it into a structured format.” As shown in a Google Cloud blog example, Gemini CLI can process a set of invoice images and output a table of their info. It basically did OCR + understanding to get invoice numbers, dates, amounts from pictures of invoices. That’s an advanced use-case but entirely possible with the multimodal model under the hood.
Understanding graphs or charts: If you have a graph screenshot, you could ask “Explain this chart’s key insights @chart.png.” It might interpret the axes and trends. Accuracy can vary, but it’s a nifty try.

To make this practical: when you @image.png, ensure the image isn’t too huge (though the model can handle reasonably large images). The CLI will likely encode it and send it to the model. The response might include descriptions or further actions. You can mix text and image references in one prompt too.

Non-image modalities: The CLI and model potentially can handle PDFs and audio too, by converting them via tools. For example, if you @report.pdf, Gemini CLI might use a PDF-to-text tool under the hood to extract text and then summarize. If you @audio.mp3 and ask for a transcript, it might use an audio-to-text tool (like a speech recognition function). The cheat sheet suggests referencing PDFs, audio, video files is supported, presumably by invoking appropriate internal tools or APIs. So, “transcribe this interview audio: @interview.wav“ could actually work (if not now, likely soon, since underlying Google APIs for speech-to-text could be plugged in).

Rich outputs: Multimodal also means the AI can return images in responses if integrated (though in CLI it usually won’t display them directly, but it could save an image file or output ASCII art, etc.). The MCP capability mentioned that tools can return images. For instance, an AI drawing tool could generate an image and Gemini CLI could present it (maybe by opening it or giving a link).

Important: The CLI itself is text-based, so you won’t see the image in the terminal (unless it’s capable of ASCII previews). You’ll just get the analysis. So this is mostly about reading images, not displaying them. If you’re in VS Code integration, it might show images in the chat view.

In summary, don’t forget the “I” in GUI when using Gemini CLI - it can handle the visual just as well as the textual in many cases. This opens up workflows like visual debugging, design help, data extraction from screenshots, etc., all under the same tool. It’s a differentiator that some other CLI tools may not have yet. And as models improve, this multimodal support will only get more powerful, so it’s a future-proof skill to exploit.

Tip 19: Customize the `$PATH` (and Tool Availability) for Stability

Quick use-case: If you ever find Gemini CLI getting confused or invoking the wrong programs, consider running it with a tailored $PATH. By limiting or ordering the available executables, you can prevent the AI from, say, calling a similarly named script that you didn’t intend. Essentially, you sandbox its tool access to known-good tools.

For most users, this isn’t an issue, but for pro users with lots of custom scripts or multiple versions of tools, it can be helpful. One reason mentioned by the developers is avoiding infinite loops or weird behavior. For example, if gemini itself is in $PATH, an AI gone awry might recursively call gemini from within Gemini (a strange scenario, but theoretically possible). Or perhaps you have a command named test that conflicts with something - the AI might call the wrong one.

How to set PATH for Gemini: Easiest is inline on launch:

PATH=/usr/bin:/usr/local/bin gemini

This runs Gemini CLI with a restricted $PATH of just those directories. You might exclude directories where experimental or dangerous scripts lie. Alternatively, create a small shell script wrapper that purges or adjusts $PATH then exec’s gemini.

Another approach is using environment or config to explicitly disable certain tools. For instance, if you absolutely never want the AI to use rm or some destructive tool, you could technically create an alias or dummy rm in a safe $PATH that does nothing (though this could interfere with normal operations, so maybe not that one). A better method is the exclude list in settings. In an extension or settings.json, you can exclude tool names. E.g.,

“excludeTools”: [”run_shell_command”]

This extreme example would stop all shell commands from running (making Gemini effectively read-only). More granular, there was mention of skipping confirmation for some; similarly you might configure something like:

“tools”: {
  “exclude”: [”apt-get”, “shutdown”]
}

(This syntax is illustrative; consult docs for exact usage.)

The principle is, by controlling the environment, you reduce risk of the AI doing something dumb with a tool it shouldn’t. It’s akin to child-proofing the house.

Prevent infinite loops: One user scenario was a loop where Gemini kept reading its own output or re-reading files repeatedly. Custom $PATH can’t directly fix logic loops, but one cause could be if the AI calls a command that triggers itself. Ensuring it can’t accidentally spawn another AI instance (like calling bard or gemini command, if it thought to do so) is good. Removing those from $PATH (or renaming them for that session) helps.

Isolation via sandbox: Another alternative to messing with $PATH is using --sandbox mode (which uses Docker or Podman to run tools in an isolated environment). In that case, the AI’s actions are contained and have only the tools that sandbox image provides. You could supply a Docker image with a curated set of tools. This is heavy-handed but very safe.

Custom PATH for specific tasks: You might have different $PATH setups for different projects. For example, in one project you want it to use a specific version of Node or a local toolchain. Launching gemini with the $PATH that points to those versions will ensure the AI uses the right one. Essentially, treat Gemini CLI like any user - it uses whatever environment you give it. So if you need it to pick gcc-10 vs gcc-12, adjust $PATH or CC env var accordingly.

In summary: Guard rails. As a power user, you have the ability to fine-tune the operating conditions of the AI. If you ever find a pattern of undesirable behavior tied to tool usage, tweaking $PATH is a quick remedy. For everyday use, you likely won’t need this, but it’s a pro tip to keep in mind if you integrate Gemini CLI into automation or CI: give it a controlled environment. That way, you know exactly what it can and cannot do, which increases reliability.

Tip 20: Track and reduce token spend with token caching and stats

If you run long chats or repeatedly attach the same big files, you can cut cost and latency by turning on token caching and monitoring usage. With an API key or Vertex AI auth, Gemini CLI automatically reuses previously sent system instructions and context, so follow‑up requests are cheaper. You can see the savings live in the CLI.

How to use it

Use an auth mode that enables caching. Token caching is available when you authenticate with a Gemini API key or Vertex AI. It is not available with OAuth login today. Google Gemini

Inspect your usage and cache hits. Run the stats command during a session. It shows total tokens and a cached field when caching is active.

/stats

The command’s description and cached reporting behavior are documented in the commands reference and FAQ. Google Gemini+1

Capture metrics in scripts. When running headless, output JSON and parse the stats block, which includes tokens.cached for each model:

gemini -p “Summarize README” --output-format json

The headless guide documents the JSON schema with cached token counts. Google Gemini

Save a session summary to file: For CI or budget tracking, write a JSON session summary to disk.

gemini -p “Analyze logs” --session-summary usage.json

This flag is listed in the changelog. Google Gemini

With API key or Vertex auth, the CLI automatically reuses previously sent context so later turns send fewer tokens. Keeping GEMINI.md and large file references stable across turns increases cache hits; you’ll see that reflected in stats as cached tokens.

Tip 21: Use `/copy` for Quick Clipboard Copy

Quick use-case: Instantly copy the latest answer or code snippet from Gemini CLI to your system clipboard, without any extraneous formatting or line numbers. This is perfect for quickly pasting AI-generated code into your editor or sharing a result with a teammate.

When Gemini CLI provides an answer (especially a multi-line code block), you often want to reuse it elsewhere. The /copy slash command makes this effortless by copying the last output produced by the CLI directly to your clipboard. Unlike manual selection (which can grab line numbers or prompt text), /copy grabs only the raw response content. For example, if Gemini just generated a 50-line Python script, simply typing /copy will put that entire script into your clipboard, ready to paste - no need to scroll and select text. Under the hood, Gemini CLI uses the appropriate clipboard utility for your platform (e.g. pbcopy on macOS, clip on Windows. Once you run the command, you’ll typically see a confirmation message, and then you can paste the copied text wherever you need it.

How it works: The /copy command requires that your system has a clipboard tool available. On macOS and Windows, the required tools (pbcopy and clip respectively) are usually pre-installed. On Linux, you may need to install xclip or xsel for /copy to function. After ensuring that, you can use /copy anytime after Gemini CLI prints an answer. It will capture the entire last response (even if it’s long) and omit any internal numbering or formatting the CLI may show on-screen. This saves you from dealing with unwanted artifacts when transferring the content. It’s a small feature, but a huge time-saver when you’re iterating on code or compiling a report generated by the AI.

Pro Tip: If you find the /copy command isn’t working, double-check that your clipboard utilities are installed and accessible. For instance, Ubuntu users should run sudo apt install xclip to enable clipboard copying. Once set up, /copy lets you share Gemini’s outputs with zero friction - copy, paste, and you’re done.

Tip 22: Master `Ctrl+C` for Shell Mode and Exiting

Quick use-case: Cleanly interrupt Gemini CLI or exit shell mode with a single keypress - and quit the CLI entirely with a quick double-tap - thanks to the versatile Ctrl+C shortcut. This gives you immediate control when you need to stop or exit.

Gemini CLI operates like a REPL, and knowing how to break out of operations is essential. Pressing Ctrl+C once will cancel the current action or clear any input you’ve started typing, essentially acting as an “abort” command. For example, if the AI is generating a lengthy answer and you’ve seen enough, hit Ctrl+C - the generation stops immediately. If you had started typing a prompt but want to discard it, Ctrl+C will wipe the input line so you can start fresh. Additionally, if you are in shell mode (activated by typing ! to run shell commands), a single Ctrl+C will exit shell mode and return you to the normal Gemini prompt (it sends an interrupt to the shell process running. This is extremely handy if a shell command is hanging or you simply want to get back to AI mode.

Pressing Ctrl+C twice in a row is the shortcut to exit Gemini CLI entirely. Think of it as “Ctrl+C to cancel, and Ctrl+C again to quit.” This double-tap signals the CLI to terminate the session (you’ll see a goodbye message or the program will close). It’s a faster alternative to typing /quit or closing the terminal window, allowing you to gracefully shut down the CLI from the keyboard. Do note that a single Ctrl+C will not quit if there’s input to clear or an operation to interrupt - it requires that second press (when the prompt is idle) to fully exit. This design prevents accidentally closing the session when you only meant to stop the current output.

Pro Tip: In shell mode, you can also press the Esc key to leave shell mode and return to Gemini’s chat mode without terminating the CLI. And if you prefer a more formal exit, the /quit command is always available to cleanly end the session. Lastly, Unix users can use Ctrl+D (EOF) at an empty prompt to exit as well - Gemini CLI will prompt for confirmation if needed. But for most cases, mastering the single- and double-tap of Ctrl+C is the quickest way to stay in control.

Tip 23: Customize Gemini CLI with `settings.json`

Quick use-case: Adapt the CLI’s behavior and appearance to your preferences or project conventions by editing the settings.json config file, instead of sticking with one-size-fits-all defaults. This lets you enforce things like theme, tool usage rules, or editor mode across all your sessions.

Gemini CLI is highly configurable. In your home directory (~/.gemini/) or project folder (.gemini/ within your repo), you can create a settings.json file to override default settings. Nearly every aspect of the CLI can be tuned here - from visual theme to tool permissions. The CLI merges settings from multiple levels: system-wide defaults, your user settings, and project-specific settings (project settings override user settings. For example, you might have a global preference for a dark theme, but a particular project might require stricter tool sandboxing; you can handle this via different settings.json files at each level.

Inside settings.json, options are specified as JSON key-value pairs. Here’s a snippet illustrating some useful customizations:

{
“theme”: “GitHub”,
“autoAccept”: false,
“vimMode”: true,
“sandbox”: “docker”,
“includeDirectories”: [”../shared-library”, “~/common-utils”],
“usageStatisticsEnabled”: true
}

In this example, we set the theme to “GitHub” (a popular color scheme), disable autoAccept (so the CLI will always ask before running potentially altering tools), enable Vim keybindings for the input editor, and enforce using Docker for tool sandboxing. We also added some directories to the workspace context (includeDirectories) so Gemini can see code in shared paths by default. Finally, we kept usageStatisticsEnabled true to collect basic usage stats (which feeds into telemetry, if enabled. There are many more settings available - like defining custom color themes, adjusting token limits, or whitelisting/blacklisting specific tools - all documented in the configuration guide. By tailoring these, you ensure Gemini CLI behaves optimally for your workflow (for instance, some developers always want vimMode on for efficiency, while others might prefer the default editor).

One convenient way to edit settings is via the built-in settings UI. Run the command /settings in Gemini CLI, and it will open an interactive editor for your configuration. This interface lets you browse and search settings with descriptions, and prevents JSON syntax errors by validating inputs. You can tweak colors, toggle features like yolo (auto-approval), adjust checkpointing (file save/restore behavior), and more through a friendly menu. Changes are saved to your settings.json, and some take effect immediately (others might require restarting the CLI).

Pro Tip: Maintain separate project-specific settings.json files for different needs. For example, on a team project you might set “sandbox”: “docker” and “excludeTools”: [”run_shell_command”] to lock down dangerous operations, while your personal projects might allow direct shell commands. Gemini CLI will automatically pick up the nearest .gemini/settings.json in your project directory tree and merge it with your global ~/.gemini/settings.json. Also, don’t forget you can quickly adjust visual preferences: try /theme to interactively switch themes without editing the file, which is great for finding a comfortable look. Once you find one, put it in settings.json to make it permanent.

Tip 24: Leverage IDE Integration (VS Code) for Context & Diffs

Quick use-case: Supercharge Gemini CLI by hooking it into VS Code - the CLI will automatically know which files you’re working on and even open AI-proposed code changes in VS Code’s diff editor for you. This creates a seamless loop between AI assistant and your coding workspace.

One of Gemini CLI’s powerful features is its IDE integration with Visual Studio Code. By installing the official Gemini CLI Companion extension in VS Code and connecting it, you allow Gemini CLI to become “context-aware” of your editor. What does this mean in practice? When connected, Gemini knows about the files you have open, your current cursor location, and any text you’ve selected in VS Code. All that information is fed into the AI’s context. So if you ask, “Explain this function,” Gemini CLI can see the exact function you’ve highlighted and give a relevant answer, without you needing to copy-paste code into the prompt. The integration shares up to your 10 most recently opened files, plus selection and cursor info, giving the model a rich understanding of your workspace.

Another huge benefit is native diffing of code changes. When Gemini CLI suggests modifications to your code (for example, “refactor this function” and it produces a patch), it can open those changes in VS Code’s diff viewer automatically. You’ll see a side-by-side diff in VS Code showing the proposed edits. You can then use VS Code’s familiar interface to review the changes, make any manual tweaks, and even accept the patch with a click. The CLI and editor stay in sync - if you accept the diff in VS Code, Gemini CLI knows and continues the session with those changes applied. This tight loop means you no longer have to copy code from the terminal to your editor; the AI’s suggestions flow straight into your development environment.

How to set it up: If you start Gemini CLI inside VS Code’s integrated terminal, it will detect VS Code and usually prompt you to install/connect the extension automatically. You can agree and it will run the necessary /ide install step. If you don’t see a prompt (or you’re enabling it later), simply open Gemini CLI and run the command: /ide install. This will fetch and install the “Gemini CLI Companion” extension into VS Code for you. Next, run /ide enable to establish the connection - the CLI will then indicate it’s linked to VS Code. You can verify at any time with /ide status, which will show if it’s connected and list which editor and files are being tracked. From then on, Gemini CLI will automatically receive context from VS Code (open files, selections) and will open diffs in VS Code when needed. It essentially turns Gemini CLI into an AI pair programmer that lives in your terminal but operates with full awareness of your IDE.

Currently, VS Code is the primary supported editor for this integration. (Other editors that support VS Code extensions, like VSCodium or some JetBrains via a plugin, may work via the same extension, but officially it’s VS Code for now.) The design is open though - there’s an IDE Companion Spec for developing similar integrations with other editors. So down the road we might see first-class support for IDEs like IntelliJ or Vim via community extensions.

Pro Tip: Once connected, you can use VS Code’s Command Palette to control Gemini CLI without leaving the editor. For example, press Ctrl+Shift+P (Cmd+Shift+P on Mac) and try commands like “Gemini CLI: Run” (to launch a new CLI session in the terminal), “Gemini CLI: Accept Diff” (to approve and apply an open diff), or “Gemini CLI: Close Diff Editor” (to reject changes. These shortcuts can streamline your workflow even further. And remember, you don’t always have to start the CLI manually - if you enable the integration, Gemini CLI essentially becomes an AI co-developer inside VS Code, watching context and ready to help as you work on code.

Tip 25: Automate Repo Tasks with `Gemini CLI GitHub Action`

Quick use-case: Put Gemini to work on GitHub - use the Gemini CLI GitHub Action to autonomously triage new issues and review pull requests in your repository, acting as an AI teammate that handles routine dev tasks.

Gemini CLI isn’t just for interactive terminal sessions; it can also run in CI/CD pipelines via GitHub Actions. Google has provided a ready-made Gemini CLI GitHub Action (currently in beta) that integrates into your repo’s workflows.

This effectively deploys an AI agent into your project on GitHub. It runs in the background, triggered by repository events. For example, when someone opens a new issue, the Gemini Action can automatically analyze the issue description, apply relevant labels, and even prioritize it or suggest duplicates (this is the “intelligent issue triage” workflow. When a pull request is opened, the Action kicks in to provide an AI code review - it will comment on the PR with insights about code quality, potential bugs, or stylistic improvements. This gives maintainers immediate feedback on the PR before any human even looks at it. Perhaps the coolest feature is on-demand collaboration: team members can mention @gemini-cli in an issue or PR comment and give it an instruction, like “@gemini-cli please write unit tests for this”. The Action will pick that up and Gemini CLI will attempt to fulfill the request (adding a commit with new tests, for instance. It’s like having an AI assistant living in your repo, ready to do chores when asked.

Setting up the Gemini CLI GitHub Action is straightforward. First, ensure you have Gemini CLI version 0.1.18 or later installed locally (this ensures compatibility with the Action. Then, in Gemini CLI run the special command: /setup-github. This command generates the necessary workflow files in your repository (it will guide you through authentication if needed). Specifically, it adds YAML workflow files (for issue triage, PR review, etc.) under .github/workflows/. You will need to add your Gemini API key to the repo’s secrets (as GEMINI_API_KEY) so the Action can use the Gemini API. Once that’s done and the workflows are committed, the GitHub Action springs to life - from that point on, Gemini CLI will autonomously respond to new issues and PRs according to those workflows.

Because this Action is essentially running Gemini CLI in an automated way, you can customize it just like you would your CLI. The default setup comes with three workflows (issue triage, PR review, and a general mention-triggered assistant) which are **fully open-source and editable**. You can tweak the YAML to adjust what the AI does, or even add new workflows. For instance, you might create a nightly workflow that uses Gemini CLI to scan your repository for outdated dependencies or to update a README based on recent code changes - the possibilities are endless. The key benefit here is offloading mundane or time-consuming tasks to an AI agent so that human developers can focus on harder problems. And since it runs on GitHub’s infrastructure, it doesn’t require your intervention - it’s truly a “set and forget” AI helper.

Pro Tip: Keep an eye on the Action’s output in the GitHub Actions logs for transparency. The Gemini CLI Action logs will show what prompts it ran and what changes it made or suggested. This can both build trust and help you refine its behavior. Also, the team has built enterprise-grade safeguards into the Action - e.g., you can require that all shell commands the AI tries to run in a workflow are allow-listed by you. So don’t hesitate to use it even on serious projects. And if you come up with a cool custom workflow using Gemini CLI, consider contributing it back to the community - the project welcomes new ideas in their repo!

Tip 26: Enable Telemetry for Insights and Observability

Quick use-case: Gain deeper insight into how Gemini CLI is being used and performing by turning on its built-in OpenTelemetry instrumentation - monitor metrics, logs, and traces of your AI sessions to analyze usage patterns or troubleshoot issues.

For developers who like to measure and optimize, Gemini CLI offers an observability feature that exposes what’s happening under the hood. By leveraging OpenTelemetry (OTEL), Gemini CLI can emit structured telemetry data about your sessions. This includes things like metrics (e.g. how many tokens used, response latency), logs of actions taken, and even traces of tool calls. With telemetry enabled, you can answer questions like: Which custom command do I use most often? How many times did the AI edit files in this project this week? What’s the average response time when I ask the CLI to run tests? Such data is invaluable for understanding usage patterns and performance. Teams can use it to see how developers are interacting with the AI assistant and where bottlenecks might be.

By default, telemetry is off (Gemini respects privacy and performance). You can opt-in by setting “telemetry.enabled”: true in your settings.json or by starting Gemini CLI with the flag --telemetry. Additionally, you choose the target for the telemetry data: it can be logged locally or sent to a backend like Google Cloud. For a quick start, you might set “telemetry.target”: “local” - with this, Gemini will simply write telemetry data to a local file (by default) or to a custom path you specify via [”outfile”]. The local telemetry includes JSON logs you can parse or feed into tools. For more robust monitoring, set “target”: “gcp” (Google Cloud) or even integrate with other OpenTelemetry-compatible systems like Jaeger or Datadog. In fact, Gemini CLI’s OTEL support is vendor-neutral - you can export data to just about any observability stack you prefer (Google Cloud Operations, Prometheus, etc.. Google provides a streamlined path for Cloud: if you point to GCP, the CLI can send data directly to Cloud Logging and Cloud Monitoring in your project, where you can use the usual dashboards and alerting tools.

What kind of insights can you get? The telemetry captures events like tool executions, errors, and important milestones. It also records metrics such as prompt processing time and token counts per prompt. For usage analytics, you might aggregate how many times each slash command is used across your team, or how often code generation is invoked. For performance monitoring, you could track if responses have gotten slower, which might indicate hitting API rate limits or model changes. And for debugging, you can see errors or exceptions thrown by tools (e.g., a run_shell_command failure) logged with context. All this data can be visualized if you send it to a platform like Google Cloud’s Monitoring - for example, you can create a dashboard of “tokens used per day” or “error rate of tool X”. It essentially gives you a window into the AI’s “brain” and your usage, which is especially helpful in enterprise settings to ensure everything runs smoothly.

Enabling telemetry does introduce some overhead (extra data processing), so you might not keep it on 100% of the time for personal use. However, it’s fantastic for debugging sessions or for intermittent health checks. One approach is to enable it on a CI server or in your team’s shared environment to collect stats, while leaving it off locally unless needed. Remember, you can always toggle it on the fly: update settings and use /memory refresh if needed to reload, or restart Gemini CLI with --telemetry flag. Also, all telemetry is under your control - it respects your environment variables for endpoint and credentials, so data goes only where you intend it to. This feature turns Gemini CLI from a black box into an observatory, shining light on how the AI agent interacts with your world, so you can continuously improve that interaction.

Pro Tip: If you just want a quick view of your current session’s stats (without full telemetry), use the /stats command. It will output metrics like token usage and session length right in the CLI. This is a lightweight way to see immediate numbers. But for long-term or multi-session analysis, telemetry is the way to go. And if you’re sending telemetry to a cloud project, consider setting up dashboards or alerts (e.g., alert if error rate spikes or token usage hits a threshold) - this can proactively catch issues in how Gemini CLI is being used in your team.

Tip 27: Keep an Eye on the Roadmap (Background Agents & More)

Quick use-case: Stay informed about upcoming Gemini CLI features - by following the public Gemini CLI roadmap, you’ll know about major planned enhancements (like background agents for long-running tasks) before they arrive, allowing you to plan and give feedback.

Gemini CLI is evolving rapidly, with new releases coming out frequently, so it’s wise to track what’s on the horizon. Google maintains a public roadmap for Gemini CLI on GitHub, detailing the key focus areas and features targeted for the near future. This is essentially a living document (and set of issues) where you can see what the developers are working on and what’s in the pipeline.

For instance, one exciting item on the roadmap is support for background agents - the ability to spawn autonomous agents that run in the background to handle tasks continuously or asynchronously. According to the roadmap discussion, these background agents would let you delegate long-running processes to Gemini CLI without tying up your interactive session. You could, say, start a background agent that monitors your project for certain events or periodically executes tasks, either on your local machine or even by deploying to a service like Cloud Run. This feature aims to “enable long-running, autonomous tasks and proactive assistance” right from the CLI, essentially extending Gemini CLI’s usefulness beyond just on-demand queries.

By keeping tabs on the roadmap, you’ll also learn about other planned features. These could include new tool integrations, support for additional Gemini model versions, UI/UX improvements, and more. The roadmap is usually organized by “areas” (for example, Extensibility, Model, Background, etc.) and often tagged with milestones (like a target quarter for delivery]. It’s not a guarantee of when something will land, but it gives a good idea of the team’s priorities. Since the project is open-source, you can even dive into the linked GitHub issues for each roadmap item to see design proposals and progress. For developers who rely on Gemini CLI, this transparency means you can anticipate changes - maybe an API is adding a feature you need, or a breaking change might be coming that you want to prepare for.

Following the roadmap can be as simple as bookmarking the GitHub project board or issue labeled “Roadmap” and checking periodically. Some major updates (like the introduction of Extensions or the IDE integration) were hinted at in the roadmap before they were officially announced, so you get a sneak peek. Additionally, the Gemini CLI team often encourages community feedback on those future features. If you have ideas or use cases for something like background agents, you can usually comment on the issue or discussion thread to influence its development.

Pro Tip: Since Gemini CLI is open source (Apache 2.0 licensed), you can do more than just watch the roadmap - you can participate! The maintainers welcome contributions, especially for items aligned with the roadmap. If there’s a feature you really care about, consider contributing code or testing once it’s in preview. At the very least, you can open a feature request if something you need isn’t on the roadmap yet. The roadmap page itself provides guidance on how to propose changes. Engaging with the project not only keeps you in the loop but also lets you shape the tool that you use. After all, Gemini CLI is built with community involvement in mind, and many recent features (like certain extensions and tools) started as community suggestions.

Tip 28: Extend Gemini CLI with `Extensions`

Quick use-case: Add new capabilities to Gemini CLI by installing plug-and-play extensions - for example, integrate with your favorite database or cloud service - expanding the AI’s toolset without any heavy lifting on your part. It’s like installing apps for your CLI to teach it new tricks.

Extensions are a game-changer introduced in late 2025: they allow you to customize and expand Gemini CLI’s functionality in a modular way. An extension is essentially a bundle of configurations (and optionally code) that connects Gemini CLI to an external tool or service. One of my favorite examples was the Nano Banana extension, as highlighted by Richard Seroter:

Fire up the Gemini CLI and install the nano-banana extension. I just did. Generate or edit images, create icons, even produce technical diagrams or mockups.

Google also released a suite of extensions for Google Cloud - there’s one that helps deploy apps to Cloud Run, one for managing BigQuery, one for analyzing application security, and more. Partners and community developers have built extensions for all sorts of things: Dynatrace (monitoring), Elastic (search analytics), Figma (design assets), Shopify, Snyk (security scans), Stripe (payments), and the list is growing. By installing an appropriate extension, you instantly grant Gemini CLI the ability to use new domain-specific tools. The beauty is that these extensions come with a pre-defined “playbook” that teaches the AI how to use the new tools effectively. That means once installed, you can ask Gemini CLI to perform tasks with those services and it will know the proper APIs or commands to invoke, as if it had that knowledge built-in.

Using extensions is very straightforward. The CLI has a command to manage them: gemini extensions install . Typically, you provide the URL of the extension’s GitHub repo or a local path, and the CLI will fetch and install it. For example, to install an official extension, you might run: gemini extensions install https://github.com/google-gemini/gemini-cli-extension-cloud-run. Within seconds, the extension is added to your environment (stored under ~/.gemini/extensions/ or your project’s .gemini/extensions/ folder). You can then see it by running /extensions in the CLI, which lists active extensions. From that point on, the AI has new tools at its disposal. If it’s a Cloud Run extension, you could say “Deploy my app to Cloud Run,” and Gemini CLI will actually be able to execute that (by calling the underlying gcloud commands through the extension’s tools). Essentially, extensions function as first-class expansions of Gemini CLI’s capabilities, but you opt-in to the ones you need.

There’s an open ecosystem around extensions. Google has an official Extensions page listing available extensions, and because the framework is open, anyone can create and share their own. If you have a particular internal API or workflow, you can build an extension for it so that Gemini CLI can assist with it. Writing an extension is easier than it sounds: you typically create a directory (say, my-extension/) with a file gemini-extension.json describing what tools or context to add. You might define new slash commands or specify remote APIs the AI can call. No need to modify Gemini CLI’s core - just drop in your extension. The CLI is designed to load these at runtime. Many extensions consist of adding custom MCP tools (Model Context Protocol servers or functions) that the AI can use. For example, an extension could add a /translate command by hooking into an external translation API; once installed, the AI knows how to use /translate. The key benefit is modularity: you install only the extensions you want, keeping the CLI lightweight, but you have the option to integrate virtually anything.

To manage extensions, besides the install command, you can update or remove them via similar CLI commands (gemini extensions update or just by removing the folder). It’s wise to occasionally check for updates on extensions you use, as they may receive improvements. The CLI might introduce an “extensions marketplace” style interface in the future, but for now, exploring the GitHub repositories and official catalog is the way to discover new ones. Some popular ones at launch include the GenAI Genkit extension (for building generative AI apps), and a variety of Google Cloud extensions that cover CI/CD, database admin, and more.

Pro Tip: If you’re building your own extension, start by looking at existing ones for examples. The official documentation provides an Extensions Guide with the schema and capabilities. A simple way to create a private extension is to use the @include functionality in GEMINI.md to inject scripts or context, but a full extension gives you more power (like packaging tools). Also, since extensions can include context files, you can use them to preload domain knowledge. Imagine an extension for your company’s internal API that includes a summary of the API and a tool to call it - the AI would then know how to handle requests related to that API. In short, extensions open up a new world where Gemini CLI can interface with anything. Keep an eye on the extensions marketplace for new additions, and don’t hesitate to share any useful extension you create with the community - you might just help thousands of other developers.

Additional Fun: Corgi Mode Easter Egg 🐕

Lastly, not a productivity tip but a delightful easter egg - try the command */corgi* in Gemini CLI. This toggles “corgi mode”, which makes a cute corgi animation run across your terminal! It doesn’t help you code any better, but it can certainly lighten the mood during a long coding session. You’ll see an ASCII art corgi dashing in the CLI interface. To turn it off, just run /corgi again.

This is a purely for-fun feature the team added (and yes, there’s even a tongue-in-cheek debate about spending dev time on corgi mode). It shows that the creators hide some whimsy in the tool. So when you need a quick break or a smile, give /corgi a try. 🐕🎉

(Rumor has it there might be other easter eggs or modes - who knows? Perhaps a “/partyparrot” or similar. The cheat sheet or help command lists /corgi, so it’s not a secret, just underused. Now you’re in on the joke!)

Conclusion

We’ve covered a comprehensive list of pro tips and features for Gemini CLI. From setting up persistent context with GEMINI.md, to writing custom commands and using advanced tools like MCP servers, to leveraging multi-modal inputs and automating workflows, there’s a lot this AI command-line assistant can do. As an external developer, you can integrate Gemini CLI into your daily routine - it’s like a powerful ally in your terminal that can handle tedious tasks, provide insights, and even troubleshoot your environment.

Gemini CLI is evolving rapidly (being open-source with community contributions), so new features and improvements are constantly on the horizon. By mastering the pro tips in this guide, you’ll be well-positioned to harness the full potential of this tool. It’s not just about using an AI model - it’s about integrating AI deeply into how you develop and manage software.

Happy coding with Gemini CLI, and have fun exploring just how far your “AI agent in the terminal” can take you.

You now have a Swiss-army knife of AI at your fingertips - use it wisely, and it will make you a more productive (and perhaps happier) developer!

I’m excited to share I’ve released a new AI-assisted engineering book with O’Reilly. There are a number of free tips on the book site in case interested.

How modern browsers work

Addy Osmani — Sat, 13 Sep 2025 14:30:31 GMT

Note: For those eager to dive deep into how browsers work, an excellent resource is Browser Engineering by Pavel Panchekha and Chris Harrelson (available at browser.engineering). Please do check it out. This article is an overview of how browsers work.

Web developers often treat the browser as a black box that magically transforms HTML, CSS, and JavaScript into interactive web applications. In truth, a modern web browser like Chrome (Chromium), Firefox (Gecko) or Safari (WebKit) is a complex piece of software. It orchestrates networking, parses and executes code, renders graphics with GPU acceleration, and isolates content in sandboxed processes for security.

This article dives into how modern browsers work - focusing on Chromium's architecture and internals, while noting where other engines differ. We'll explore everything from the networking stack and parsing pipeline to the rendering process via Blink, JavaScript engine via V8, module loading, multi-process architecture, security sandboxing, and developer tooling. The goal is a developer-friendly explanation that demystifies what happens behind the scenes.

Let's begin our journey through the browser's internals.

Networking and Resource Loading

Every page load begins with the browser's networking stack fetching resources from the web. When you enter a URL or click a link, the browser's UI thread (running in the "browser process") kicks off a navigation request.

The browser process is the main, controlling process that manages all other processes and the browser's user interface. Everything that happens outside of a specific web page tab is controlled by the browser process.

The steps include:

URL parsing and security checks: The browser parses the URL to determine the scheme (http, https, etc.) and target domain. It also decides if the input is a search query or URL (in Chrome's omnibox, for example). Security features like blocklists may be checked here to avoid phishing sites.

DNS lookup: The network stack resolves the domain name to an IP address (unless it's cached). This may involve contacting a DNS server. Modern browsers might use OS DNS services or even DNS over HTTPS (DoH) if configured, but ultimately they obtain an IP for the host.

Establishing a connection: If no open connection to the server exists, the browser opens one. For HTTPS URLs, this includes a TLS handshake to securely exchange keys and verify certificates. The browser's network thread handles protocols like TCP/TLS setup transparently.

Sending the HTTP request: Once connected, an HTTP GET request (or other method) is sent for the resource. Browsers today default to HTTP/2 or HTTP/3 if the server supports it, which allows multiplexing multiple resource requests over one connection. This improves performance by avoiding the old limit of ~6 parallel connections per host (HTTP/1.1). For example, with HTTP/2 the HTML, CSS, JS, images can all be fetched concurrently on one TCP/TLS link, and with HTTP/3 (over QUIC UDP) setup latency is further reduced.

Receiving the response: The server responds with an HTTP status and headers, followed by the response body (HTML content, JSON data, etc.). The browser reads the response stream. It may need to sniff the MIME type if the Content-Type header is missing or incorrect, to decide how to handle the content. For example, if a response looks like HTML but isn't labeled as such, the browser will still try to treat it as HTML (per permissive web standards). There are security measures here too: the network layer checks Content-Type and may block suspicious MIME mismatches or disallowed cross-origin data (Chrome's CORB - Cross-Origin Read Blocking - is one such mechanism). The browser also consults Safe Browsing or similar services to block known malicious payloads.

Redirects and next steps: If the response is an HTTP redirect (e.g. 301 or 302 with a Location header), the network code will follow the redirect (after informing the UI thread) and repeat the request to the new URL. Only once a final response with actual content is obtained does the browser move on to processing that content.

All these steps happen in the network stack, which in Chromium is run in a dedicated Network Service (now typically a separate process, as part of Chrome's "servicification" effort). The browser process's network thread coordinates the low-level work of socket communication, using the OS networking APIs under the hood. Importantly, this design means the renderer (which will execute the page's code) doesn't directly access the network - it asks the browser process to fetch what it needs, a security win.

Speculative Loading and Resource Optimization

Modern browsers implement sophisticated performance optimizations in the networking stage. Chrome will proactively perform a DNS prefetch or open a TCP connection when you hover over a link or start typing a URL (using the Predictor or preconnect mechanisms) so that if you click, some latency is already shaved off. There's also HTTP caching: the network stack can satisfy requests from the browser cache if the resource is cached and fresh, avoiding a network trip.

Preload scanner operation: Chromium implements a sophisticated preload scanner that tokenizes HTML markup ahead of the main parser. When the primary HTML parser is blocked by CSS or synchronous JavaScript, the preload scanner continues examining the raw markup to identify resources like images, scripts, and stylesheets that can be fetched in parallel. This mechanism is fundamental to modern browser performance and operates automatically without developer intervention. The preload scanner cannot discover resources injected via JavaScript, making such resources likely to be loaded consecutively rather than concurrently.

Early Hints (HTTP 103): Early Hints allows servers to send resource hints while generating the main response, using HTTP 103 status codes. This enables preconnect and preload hints to be sent during server think-time, potentially improving Largest Contentful Paint by several hundred milliseconds. Early Hints are only available for navigation requests and support preconnect and preload directives, but not prefetch.

Speculation Rules API: The Speculation Rules API is a recent web standard that allows defining rules to dynamically prefetch and prerender URLs based on user interaction patterns. Unlike traditional link prefetch, this API can prerender entire pages including JavaScript execution, leading to near-instant load times. The API uses JSON syntax within script elements or HTTP headers to specify which URLs should be speculatively loaded. Chrome has limits to prevent overuse, with different capacity settings based on urgency levels.

HTTP/2 and HTTP/3: Most Chromium-based browsers and Firefox support HTTP/2 fully, and HTTP/3 (based on QUIC) is also widely supported (Chrome has it enabled by default for supporting sites). These protocols improve page load by allowing concurrent transfers and reducing handshake overhead. From a developer perspective, this means you may no longer need sprite sheets or domain sharding tricks - the browser can efficiently fetch many small files in parallel on one connection.

Resource prioritization: The browser also prioritizes certain resources over others. Typically, HTML and CSS are high priority (as they block rendering), scripts might be medium (or high if marked defer/async appropriately), and images maybe lower. Chromium's network stack assigns weights and can even cancel or defer requests to prioritize what's needed for an initial render. Developers can use link rel=preload and Fetch Priority to influence resource prioritization.

By the end of the networking phase, the browser has the initial HTML for the page (assuming it was an HTML navigation). At this point, Chrome's browser process chooses a renderer process to handle the content. Chrome will often launch a new renderer process in parallel with the network request (speculatively) so that it's ready to go when the data arrives. This renderer process is isolated (more on multi-process architecture later) and will take over for parsing and rendering the page.

Once the response is fully received (or as it streams in), the browser process commits the navigation: it signals the renderer process to take the stream of bytes and start processing the page. At this moment, the address bar updates and the security indicator (HTTPS lock, etc.) is shown for the new site. Now the action moves to the renderer process: parsing the HTML, loading sub-resources, executing scripts, and painting the page.

Parsing HTML, CSS, and JavaScript

When the renderer process receives the HTML content, its main thread begins to parse it according to the HTML specification. The result of parsing HTML is the DOM (Document Object Model) - a tree of objects representing the page structure. Parsing is incremental and can interleave with network reading (browsers parse HTML in a streaming fashion, so the DOM can start being built even before the entire HTML file is downloaded).

HTML parsing and DOM construction: HTML parsing is defined by the HTML Standard as a error-tolerant process that will produce a DOM no matter how malformed the markup is. This means even if you forget a closing

tag or have nested tags incorrectly, the parser will implicitly fix or adjust the DOM tree so that it's valid. For example,

Hello

World

will automatically end the

before the

in the DOM structure. The parser creates DOM elements and text nodes for each tag or text in the HTML. Each element is placed in a tree reflecting the nesting in the source.

One important aspect is that the HTML parser may encounter resources to fetch as it goes: for instance, encountering a will prompt the browser to request the CSS file (on the network thread), and encountering an will trigger an image request. These happen in parallel to parsing. The parser can keep going while those loads occur, with one big exception: scripts.

Handling