Imagine sixteen AI assistants working together for two weeks, coordinating through a shared code repository, and producing a functional C compiler capable of building a Linux kernel. That’s exactly what Anthropic researcher Nicholas Carlini demonstrated in a recent experiment that cost approximately $20,000 in API fees. The project involved 16 instances of Claude Opus 4.6 AI model working in parallel, creating a 100,000-line Rust-based compiler that can compile major open-source projects including PostgreSQL, SQLite, Redis, and even run the classic game Doom.
The Technical Achievement and Its Caveats
Carlini used Anthropic’s new “agent teams” feature, where each Claude instance ran in its own Docker container, independently identifying problems to solve and pushing completed code back upstream. The resulting compiler achieved a 99% pass rate on the GCC torture test suite and can compile for x86, ARM, and RISC-V architectures. However, the limitations are telling: the compiler lacks a 16-bit x86 backend needed to boot Linux from real mode, produces less efficient code than GCC with optimizations disabled, and has buggy assembler and linker components.
“The resulting compiler has nearly reached the limits of Opus’s abilities,” Carlini wrote. “I tried (hard!) to fix several of the above limitations but wasn’t fully successful. New features and bugfixes frequently broke existing functionality.” This pattern of diminishing returns at around 100,000 lines suggests a practical ceiling for autonomous agentic coding with current models.
The Human Scaffolding Behind AI Autonomy
While the headline suggests autonomous AI work, the reality involves significant human engineering. Carlini spent considerable effort building test harnesses, continuous integration pipelines, and feedback systems specifically tuned for how language models fail. He discovered that verbose test output polluted the model’s context window, causing it to lose track of tasks, so he designed test runners that printed only summary lines. He also found that Claude has no sense of time and would spend hours running tests without progress, requiring him to build a fast mode that samples only 1-10% of test cases.
When all 16 agents got stuck trying to fix the same Linux kernel bug simultaneously, Carlini used GCC as a reference oracle, randomly compiling most kernel files with GCC and only a subset with Claude’s compiler. “Claude will work autonomously to solve whatever problem I give it,” Carlini noted. “So it’s important that the task verifier is nearly perfect, otherwise Claude will solve the wrong problem.”
Industry Context: The AI Agent Race Intensifies
This experiment arrives amid intensifying competition in the AI agent space. Just days before Carlini’s demonstration, Anthropic released Opus 4.6 with its new “agent teams” feature that allows multiple AI agents to split and coordinate tasks in parallel. Scott White, Head of Product at Anthropic, explained: “Instead of one agent working through tasks sequentially, you can split the work across multiple agents – each owning its piece and coordinating directly with the others.”
Simultaneously, OpenAI launched GPT-5.3 Codex, an upgraded agentic coding model that’s 25% faster than its predecessor and expands beyond code generation to handle the entire software lifecycle. According to OpenAI, the model can create complex apps from scratch over days and was instrumental in its own creation. Both companies are positioning their tools not just for developers but for broader knowledge workers, with Anthropic noting that many non-developers are using Claude Code because “it was a really amazing engine to do tasks.”
Productivity Impacts and Industry Debate
The timing of these agentic coding tools coincides with measurable productivity gains in software development. According to Financial Times analysis, GitHub code pushes in the US increased 30% compared to pre-2025 trends by Q3 2025, iOS app releases grew 55% in January 2026 compared to January 2025, and global website registrations increased 34% year-over-year after years of stability.
Anthropic engineer Boris Cherny reports that “pretty much 100% of our code is written by Claude Code + Opus 4.5. For me personally it has been 100% for two+ months now, I don’t even make small edits by hand. I shipped 22 PRs yesterday and 27 the day before, each one 100% written by Claude.”
The Clean Room Controversy and Ethical Considerations
Anthropic describes the compiler as a “clean-room implementation” because the agents had no internet access during development. However, this framing has drawn criticism from developers who note that the underlying model was trained on enormous quantities of publicly available source code, almost certainly including GCC, Clang, and numerous smaller C compilers. In traditional software development, “clean room” specifically means implementers have never seen the original code.
On Hacker News, one commenter described the project as “rather a brute force attempt to decompress fuzzily stored knowledge contained within the network.” The $20,000 figure also deserves context – it covers only API token costs and excludes the billions spent training the model, the human labor invested in building scaffolding, and decades of work by compiler engineers who created the test suites and reference implementations.
Carlini himself raised concerns rooted in his previous career in penetration testing, noting that “the thought of programmers deploying software they’ve never personally verified is a real concern.” This highlights a fundamental tension in the agentic coding revolution: as AI systems become more capable of autonomous work, human oversight and verification become both more critical and potentially more challenging.
What This Means for Businesses and Developers
The C compiler experiment reveals both the promise and limitations of current AI agent technology. For businesses, the methodology of parallel agents coordinating through Git with minimal human supervision represents a novel approach that could be adapted for other complex software projects. The engineering tricks Carlini developed – context-aware test output, time-boxing, and reference oracles for parallelization – could become standard practices in agentic software development.
However, the experiment also demonstrates that AI agents excel at well-defined tasks with comprehensive test suites and reference implementations. Most real-world software projects lack these advantages. As Carlini noted, “The hard part of most development isn’t writing code that passes tests; it’s figuring out what the tests should be in the first place.”
For developers and businesses considering AI agent adoption, the key insight is that these tools work best within carefully engineered environments with robust verification systems. They can dramatically accelerate development on well-understood problems but may struggle with novel challenges or projects that exceed certain complexity thresholds. As the industry races toward more autonomous AI systems, the most successful implementations will likely be those that maintain the right balance between AI autonomy and human oversight.

