By Ryan Lopopolo, Member of the Technical Staff
Overview
OpenAI's Harness team built and shipped an internal beta product with zero lines of manually-written code over five months, using Codex agents exclusively. The article details how they redefined engineering roles to focus on designing environments, specifying intent, and building feedback loops rather than writing code directly, achieving roughly 1/10th the development time with ~1 million lines of agent-generated code across ~1,500 pull requests.
What You'll Learn
How to structure a repository and documentation system optimized for AI agent legibility using progressive disclosure patterns
Why enforcing architectural invariants mechanically (via custom linters and structural tests) is critical for agent-generated codebases
How to give coding agents access to observability tooling (logs, metrics, traces) so they can self-validate and fix issues autonomously
When to treat AGENTS.md as a table of contents rather than an encyclopedia to avoid context overload
How to implement continuous 'garbage collection' processes to prevent entropy and architectural drift in agent-generated code
Prerequisites & Requirements
- Experience with software engineering workflows including CI/CD, code review, and pull request processes
- Understanding of software architecture patterns such as layered architecture and dependency management
- Familiarity with AI coding agents (e.g., OpenAI Codex, GitHub Copilot) and prompt-driven development
- Experience with observability tools (logs, metrics, traces) and query languages like LogQL or PromQL(optional)
Key Questions Answered
Can you build a real software product entirely with AI coding agents and zero manually-written code?
How should you structure AGENTS.md files for maximum AI agent effectiveness?
What role do human engineers play in an agent-first software development workflow?
How do you prevent architectural drift and code quality decay in an AI-generated codebase?
How can AI coding agents validate their own UI changes and bug fixes?
What is the 'progressive disclosure' approach for AI agent context management?
Why do boring technologies work better for AI coding agents?
How does high agent throughput change traditional merge and code review practices?
Key Statistics & Figures
Technologies & Tools
Some links below are affiliate links. We may earn a commission if you make a purchase.
Key Actionable Insights
1Treat your AGENTS.md as a table of contents, not an encyclopedia. Keep it to roughly 100 lines that serve as a map with pointers to deeper documentation in a structured docs/ directory. A monolithic instruction file crowds out task context, rots quickly, and causes agents to pattern-match locally rather than navigate intentionally.The team found that too much upfront guidance becomes 'non-guidance'—when everything is marked important, nothing is. Progressive disclosure lets agents start with a stable entry point and find relevant context on demand.
2Invest in making your application directly legible to agents by exposing UI state, logs, metrics, and traces through programmatic interfaces. Wire Chrome DevTools Protocol into agent runtimes for DOM snapshots and screenshots, and provide local observability stacks with queryable APIs (LogQL, PromQL, TraceQL).The team's bottleneck shifted from code throughput to human QA capacity. By making the application itself inspectable by agents, they enabled autonomous bug reproduction, fix validation, and performance verification without human intervention.
3Enforce architectural invariants mechanically through custom linters and structural tests rather than relying on documentation alone. Write custom lint error messages that inject remediation instructions directly into agent context, turning every violation into a learning opportunity for the agent.In a human-first workflow, strict linting rules might feel pedantic. With agents, they become multipliers—once encoded, they apply everywhere at once, preventing drift across a million-line codebase generated at high throughput.
4Implement automated 'garbage collection' for your codebase by encoding golden principles and running recurring cleanup agents that scan for deviations, update quality grades, and open targeted refactoring pull requests. This catches bad patterns daily rather than letting them compound.The team initially spent 20% of their week (every Friday) manually cleaning up 'AI slop.' By automating this into background Codex tasks with codified principles, they scaled cleanup proportionally to code generation throughput.
5When agents struggle with a task, resist the urge to 'try harder' or write the code manually. Instead, diagnose what capability is missing—tools, guardrails, abstractions, or documentation—and have the agent itself build that missing capability into the repository.This depth-first approach compounds over time: each missing capability that gets encoded becomes infrastructure for all future agent tasks, steadily increasing the scope of what agents can accomplish autonomously.
6Push all relevant team knowledge into the repository as versioned, co-located artifacts. Slack discussions, Google Docs, and tacit human knowledge are invisible to agents—if it isn't discoverable in the repo, it effectively doesn't exist for the agent and leads to misaligned output.This mirrors the new-hire onboarding problem: anything not written down is lost context. The team treats the repo as the single system of record for product principles, engineering norms, architecture decisions, and even team culture preferences.