What Every Developer Should Know About Background Coding Agents

What they can and can't do, and why true autonomy remains distant

Oct 17, 2025

This essay is an excerpt from a book I’m writing for developers to use AI effectively in their day-to-day work.
If you’d like to see the book published, subscribe to my newsletter.

The Autonomy Spectrum

Agents differ in how much independence they have when performing tasks, and this autonomy is either adjustable or fixed by design. Autonomy determines what and how much an agent can do without your intervention, which directly affects both productivity and risk. It helps to think of autonomy as a spectrum rather than fixed steps‒one that can be adjusted depending on context, confidence, and the cost of failure.

No Autonomy: Traditional IDEs

Traditional IDEs have no initiative and don’t act independently. They require a developer to be in charge—without your interaction, they remain idle.

While they may offer code generation, suggestions, and completions to improve productivity, these features are often algorithmic, template-based, and adapt minimally to the specific project. However, this lack of autonomy makes them highly predictable and low-risk, giving developers complete control over every action. The tradeoff is lower productivity relative to modern IDEs equipped with agentic coding agents.

Partial Autonomy: Human-in-the-Loop

Partially autonomous agents can plan, suggest, and execute actions—but they’re designed with a human-in-the-loop. They can modify source code, run scripts, execute commands, and answer queries, but are designed to pause and ask for approval before taking actions that aren’t pre-approved. You grant permissions, abort or redirect when needed, and maintain control. Partial autonomy strikes a balance between safety and utility, keeping you engaged while offloading trivial and repetitive work.

Most coding agents available in the market today are designed to be partially autonomous, whether they’re IDE-based agents (GitHub Copilot, Cursor, Windsurf, Cline, etc.) or command-line agents (Codex, Claude Code, Aider, amp, Gemini CLI etc.)—these systems augment rather than replace developer judgment.

Problem statements in software engineering are open-ended. This makes partial autonomy not merely a stepping stone toward full autonomy, but likely the default paradigm for the foreseeable future and this is intentional. Partial autonomy isn’t inferior to fully autonomous agents—it’s a deliberate design choice that acknowledges the complexity, ambiguity, and creative judgment needed for software development, as well as current limitations in LLMs’ architectures. In software engineering, human oversight remains essential for navigating uncertain requirements and making the nuanced trade-offs that define quality engineering work.

Full (?) Autonomy: Asynchronous Background Agents

I find the subject of autonomous background agents fascinating. This space is rife with marketing hype, with some vendors even claiming you can delegate entire features to their agents asynchronously. Meanwhile, developers have formed expectations and misconceptions, fueled in part by vendor promises, that do not align with reality. I’ll objectively present the current landscape and potential future trajectory.

Defining Autonomy

Before discussing autonomous background agents, it helps to establish what autonomy means in this context. Developers are comfortable shipping code to production when they have trustworthy quality control mechanisms in place. This includes both manual and automated verification throughout the development, build, and deployment stages. An agent’s output flows through these same quality control mechanisms.

Today, when we talk about autonomous background agents, their autonomy is limited to a specific window: it begins when a developer hands over a problem statement and ends when the agent raises a PR for human review. Outside this window, the agent requires human direction or oversight.

Therefore, viable autonomy hinges on the success rate—how often an agent produces an acceptable solution. If an agent completes tasks successfully most of the time, matches or exceeds typical developer performance, and does so faster, then it becomes economically acceptable. In practical terms, this means the resulting PR gets merged without significant rework in most cases, relative to what you’d expect from a developer.

Current Autonomous Agent Landscape

After diving into the source code of a few open-source CLI-first coding agents, I noticed something interesting: they’re typically containerized and deployed as autonomous background agents. You assign tasks through issue trackers like GitHub Issues, messaging apps like Slack, or other interfaces that communicate with the agent. This agent works autonomously in its own isolated environment and usually raises a PR for review.

In other words, we’re packaging CLI agents designed for partial autonomy and deploying them for asynchronous background workflows. If you’re using a CLI agent already, the background version is likely the same tool deployed in the cloud with Slack or GitHub integration.

It’s worth noting that “asynchronous” and “autonomous” here describe the workflow characteristics rather than the agent’s nature. The agent’s capabilities remain largely unchanged—what shifts is the execution context: from terminal interaction to event-driven background operation that can proceed without human presence. Full autonomy, however, remains a different problem entirely.

Why Full Autonomy Is a Different Problem

The gap between current “autonomous” agents and genuine full autonomy becomes clear when we consider present LLMs’ and agents’ capabilities along with software development’s inherent nature: an open-ended problem space with several viable solutions and competing trade-offs. Both problem statements and the solutions evolve as clarity emerges through iteration. Several fundamental limitations stand in the way of full autonomy.

Search and Planning Limitations

Software engineering at its core, is a search and planning problem—one that humans tackle through deliberate reasoning, exploration, and backtracking skills. In computer science, search and planning have long been recognized as complex topics with dedicated textbooks and research fields. Yet existing LLMs’ core architecture and training optimize for pattern recognition and sequence prediction over goal-oriented reasoning. This creates a fundamental mismatch between agent capabilities and software engineering domain requirements. Nevertheless, reasoning models attempt to bridge this gap through extended “thinking” phases, showing promise on well-defined problem statements. However, most software engineering tasks are ill-defined, and success criteria emerge through iterating and clarifying the problem statement. Recent work (Hariharan et al, 2025 (https://arxiv.org/pdf/2509.02761)) demonstrates that even for well-defined tasks, LLMs produce plans with logical errors, unnecessary actions and inefficiencies that require iterative verification and refinement, suggesting software engineering problems may present an even greater challenge.

Model Accuracy and Compounding Errors

A model with 90% accuracy on individual steps drops to just 35% accuracy over a 10-step process. Mistakes made early cascade through subsequent steps, creating a compounding effect. Each additional step can degrade the quality of the solution and risks diverging from the task’s intent. However, reliable verification and feedback mechanisms can mitigate this compounding effect. Therefore, the precision, granularity, quality, and latency of these mechanisms directly impact an agent’s overall accuracy and reliability. Advancements in verification and feedback mechanisms also improve the performance and reliability of agents.

Context Relevance and Sufficiency

When performing a task, irrelevant information in the model’s context window is counterproductive, regardless of the window size. Contrary to popular belief, including an entire codebase doesn’t help and actually hurts performance. Context should be limited to what’s relevant for the current task (Shi et al., 2023 (https://arxiv.org/pdf/2302.00093)). The challenge is twofold: first, determining and gathering only the relevant files, dependencies, and historical decisions that matter; second, fitting that information within the available context window while discarding what’s irrelevant. Poor selection pollutes the context and degrades the agent’s reasoning ability. Unlike humans, who can fluidly zoom in and out of high-level understanding and specific details, agents struggle to discern what to retain versus discard, risking incomplete or incorrect mental models of the systems they modify.

Tacit Knowledge Gap

Domain expertise, product decisions, and the reasoning behind architectural design choices exist as tacit knowledge that is nearly impossible to fully externalize into prompts or documentation. Much of this knowledge remains undocumented because it cannot be easily articulated or is simply taken for granted. Developers and other stakeholders develop this understanding through conversations, experimentation, hands-on development, code reviews, and iterative problem-solving—a continuous process that shapes judgment over time. Without participating in these ongoing exchanges, agents lack the context needed to make nuanced decisions. Even when agents gather context from codebases and documentation, inherently that context is incomplete without these human-held insights.

Validating Semantic Correctness

Background agents typically operate in isolated environments with minimal feedback to validate their changes. They rely on compilers, linters, and build systems to enforce syntactic correctness (proper grammar and structure) and static semantic correctness (type safety, import resolution, and interface conformance). However, they cannot determine if program behavior matches its intended specification or requirements (behavioral semantic correctness). Tests could validate behavioral semantics only when they fully capture all requirements, but ensuring this itself is an unsolved problem in software engineering.

Risk Asymmetry

Agents’ access to powerful tools amplifies both their capabilities and the consequences of their actions. Tools unlock new possibilities but also introduce proportionally severe risks when things go wrong.

Verification Burden

As problem scope increases, agents generate proportionally larger change sets that require human review. Reviewers must verify and validate the agent’s changes, examine the implementation, security, and edge cases. This verification work scales with the problem’s scope and complexity, quickly becoming time-consuming and unsustainable. This undermines the efficiency gains autonomous agents promise to deliver.

Scaling Laws and Diminishing Returns

Early generations of language models saw exponential gains in capability, but recent models from different providers have converged in performance despite massive increases in training compute. We’ve hit a new phase where doubling model capability requires far more than double the compute—the economics have fundamentally shifted. Even with exponentially more resources, we’re now seeing only incremental improvements in reasoning and planning. This means the current model architectures won’t simply scale their way to autonomous software engineering. We need fundamentally different approaches to how models reason, plan, and verify their work, not just bigger or better versions of what we have today.

As of 2025, full autonomy represents a qualitatively different challenge than the capabilities that LLMs and agents possess today. The limitations aren’t merely technical—they highlight fundamental mismatches between agents and software engineering’s nature as a domain.

Meaningful progress towards full autonomy requires addressing limitations in several areas. Along with ways to define problem statements effectively, we need verification and validation methods beyond what is currently available. Breakthroughs require progress with validating semantic correctness and improving planning capabilities. We must reimagine software development practices too—create abstractions and workflows that account for human cognitive limits and support verification beyond line-by-line reviews. In addition to these, how to transfer or circumvent our reliance on tacit knowledge remains a puzzle. The path forward lies not in perfecting autonomous agents in isolation, but in redesigning how humans and agents collaborate throughout the development lifecycle.

Usefulness and Future Trajectory

The fundamental limitations to full autonomy remain; they limit but do not negate the usefulness of today’s class of autonomous coding agents. Software engineering has always been about creatively working around constraints and making trade-offs to build systems that work for us. Working with LLMs and coding agents is no different—we often operate within constraints and environments that may not be ideal, but we make them work through careful engineering.

I’m not downplaying the significance of coding agents or language models. I’m highlighting the variables at play so you understand what you can control and optimize. Multiple factors influence and agent’s performance: the philosophy behind its design and architecture, system prompts, available tools, tool descriptions and feedback mechanisms, the model in use, your tech stack and the language model’s training dataset, your codebase’s structure and uniformity, how clearly domain concepts are expressed through naming, and documentation of architectural patterns used throughout your codebase. How you communicate problem statements to your agents also matters significantly. There’s plenty of room to experiment with these variables to nudge results toward what you want.

It’s important to have realistic expectations: coding agents running autonomously in the cloud are fundamentally the same as those running on your local machine. Given the same coding agent, underlying model, and problem statement, if the agent struggles locally, it’s likely that the autonomous agent will face similar challenges. The environment may differ, but the core capabilities and limitations remain consistent. Understanding this helps you calibrate your expectations toward autonomous agents based on your local development experience.

Better models and agents with improved planning and execution capabilities continue to emerge. Whether these improvements will be incremental or exponential is difficult to predict. But even today, understanding the variables at play can help you delegate tasks effectively to autonomous background agents. Those who regularly use CLI coding agents gain a distinct advantage—through daily use, they develop intuition about which classes of problems produce reliable results on the first try. When they work with background agents deployed using the same underlying tool, they’ve already learned how to communicate problem statements effectively in a way that produces desired results consistently. That intuition transfers directly and is immensely valuable.

I relate to how Andrej Karpathy reflected on advancements with self-driving cars. They took nearly a decade to reach where they are today and still aren’t ubiquitous. But driving is a fairly constrained problem space with relatively structured environments and clear success criteria.

Software engineering is far more complex. It’s an open-ended domain where requirements are ambiguous, solutions emerge through iteration, and success criteria shift as understanding deepens. We’ll make progress, but achieving genuine autonomy in software development remains to be seen.

Next section in the chapter: Developer-Agent Interaction

You’ve read an early excerpt from my upcoming book. If this essay resonated with you, subscribe to follow along and help shape what goes into it.

Developer x AI

Discussion about this post