Harness engineering: leveraging Codex in an agent-first world

Ryan Lopopolo

By Ryan Lopopolo, Member of the Technical Staff

OpenAI

•

Ryan Lopopolo

•15 min read•advanced•

--

•View Original

GPTYOLO

Overview

OpenAI's Harness team built and shipped an internal beta product with zero lines of manually-written code over five months, using Codex agents exclusively. The article details how they redefined engineering roles to focus on designing environments, specifying intent, and building feedback loops rather than writing code directly, achieving roughly 1/10th the development time with ~1 million lines of agent-generated code across ~1,500 pull requests.

What You'll Learn

1

How to structure a repository and documentation system optimized for AI agent legibility using progressive disclosure patterns

2

Why enforcing architectural invariants mechanically (via custom linters and structural tests) is critical for agent-generated codebases

3

How to give coding agents access to observability tooling (logs, metrics, traces) so they can self-validate and fix issues autonomously

4

When to treat AGENTS.md as a table of contents rather than an encyclopedia to avoid context overload

5

How to implement continuous 'garbage collection' processes to prevent entropy and architectural drift in agent-generated code

Prerequisites & Requirements

Experience with software engineering workflows including CI/CD, code review, and pull request processes
Understanding of software architecture patterns such as layered architecture and dependency management
Familiarity with AI coding agents (e.g., OpenAI Codex, GitHub Copilot) and prompt-driven development
Experience with observability tools (logs, metrics, traces) and query languages like LogQL or PromQL(optional)

Key Questions Answered

Can you build a real software product entirely with AI coding agents and zero manually-written code?

Yes. OpenAI's Harness team built and shipped an internal beta product over five months with 0 lines of manually-written code. The product has internal daily users and external alpha testers, contains approximately one million lines of code across application logic, infrastructure, tooling, and documentation, and was built in roughly 1/10th the time of manual coding. The key was investing in environment design, feedback loops, and agent tooling rather than writing code directly.

How should you structure AGENTS.md files for maximum AI agent effectiveness?

Treat AGENTS.md as a table of contents (~100 lines) rather than a comprehensive encyclopedia. A monolithic instruction file fails because it crowds out task context, causes agents to pattern-match locally instead of navigating intentionally, rots instantly as rules go stale, and is hard to verify mechanically. Instead, use AGENTS.md as a map with pointers to a structured docs/ directory containing design docs, execution plans, product specs, and references—enabling progressive disclosure where agents discover context as needed.

What role do human engineers play in an agent-first software development workflow?

Human engineers shift from writing code to designing environments, specifying intent, and building feedback loops. They prioritize work, translate user feedback into acceptance criteria, validate outcomes, and when agents struggle, identify what's missing—tools, guardrails, or documentation—and feed it back into the repository. Humans interact primarily through prompts, describing tasks and allowing agents to open pull requests. Over time, even code review shifts from human-driven to agent-to-agent.

How do you prevent architectural drift and code quality decay in an AI-generated codebase?

Use a combination of rigid layered architecture with mechanically enforced dependency rules, custom linters with remediation instructions in error messages, structural tests, and 'taste invariants' covering structured logging, naming conventions, and file size limits. Additionally, implement recurring background cleanup tasks ('garbage collection') where agents scan for deviations from 'golden principles,' update quality grades, and open targeted refactoring pull requests on a regular cadence.

How can AI coding agents validate their own UI changes and bug fixes?

By making the application bootable per git worktree so the agent can launch one instance per change, and wiring the Chrome DevTools Protocol into the agent runtime. This allows agents to take DOM snapshots, capture screenshots, navigate the UI, reproduce bugs, validate fixes, and reason about UI behavior directly. Combined with a local observability stack exposing logs via LogQL and metrics via PromQL, agents can run multi-hour validation loops autonomously.

What is the 'progressive disclosure' approach for AI agent context management?

Progressive disclosure means agents start with a small, stable entry point (a ~100-line AGENTS.md) and are taught where to look next, rather than being overwhelmed with all context up front. The repository's knowledge base lives in a structured docs/ directory with indexed design documents, architecture maps, quality grades, and execution plans. Agents navigate to deeper context on demand, keeping their working context focused on the current task while maintaining access to the full knowledge base.

Why do boring technologies work better for AI coding agents?

Technologies described as 'boring' tend to be easier for agents to model due to their composability, API stability, and strong representation in training data. The team favored dependencies that could be fully internalized and reasoned about within the repository. In some cases, reimplementing subsets of functionality (like building a custom map-with-concurrency helper instead of using p-limit) was cheaper than working around opaque upstream behavior, while also achieving tighter integration with instrumentation and 100% test coverage.

How does high agent throughput change traditional merge and code review practices?

With agent throughput far exceeding human attention, corrections become cheap while waiting becomes expensive. The team operates with minimal blocking merge gates, short-lived pull requests, and addresses test flakes with follow-up runs rather than blocking progress. Code review shifts from human-driven to agent-to-agent, with agents reviewing their own changes locally, requesting additional agent reviews, responding to feedback, and iterating until all agent reviewers are satisfied before merging.

Key Statistics & Figures

Lines of manually-written code

0

Every line of code in the product was written by Codex agents over 5 months

Estimated development time reduction

~1/10th

Compared to writing the code by hand

Total codebase size

~1 million lines of code

Across application logic, infrastructure, tooling, documentation, and internal developer utilities

Pull requests opened and merged

~1,500

Over the 5-month development period

Average PR throughput

3.5 PRs per engineer per day

Starting with 3 engineers, throughput increased as the team grew to 7

Starting team size

3 engineers

Small team driving Codex agents, later growing to 7 engineers

AGENTS.md size

~100 lines

Kept deliberately short as a map/table of contents rather than comprehensive documentation

Maximum single Codex run duration

6+ hours

Single Codex runs regularly working on a single task, often while humans are sleeping

Weekly time spent on manual cleanup (before automation)

20%

every Friday

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

AI Coding Agent

Openai Codex

Primary tool for all code generation, review, testing, and documentation across the entire project

AI Model

Gpt-5

Used via Codex CLI for initial repository scaffold generation

Browser Automation

Chrome Devtools Protocol

Wired into agent runtime for DOM snapshots, screenshots, and UI navigation to validate changes

Query Language

Logql

Used by agents to query logs from the local observability stack

Query Language

Promql

Used by agents to query metrics from the local observability stack

Query Language

Traceql

Used by agents to query traces from the local observability stack

Observability Pipeline

Vector

Fans out app data (logs, metrics, traces) to the observability stack components

Log Storage

Victorialogs

Part of the local observability stack for storing and querying logs

Metrics Storage

Victoriametrics

Part of the local observability stack for storing and querying metrics

Observability Framework

Opentelemetry

Used for instrumentation, integrated with custom map-with-concurrency helper

Validation Library

Zod

Chosen by the agent (not prescribed by humans) for data shape validation at boundaries

Developer Tool

Github CLI (gh)

Used by agents to pull review feedback, respond inline, push updates, and manage pull requests

AI Agent

Aardvark

Referenced as another agent working on the codebase alongside Codex

Key Actionable Insights

1
Treat your AGENTS.md as a table of contents, not an encyclopedia. Keep it to roughly 100 lines that serve as a map with pointers to deeper documentation in a structured docs/ directory. A monolithic instruction file crowds out task context, rots quickly, and causes agents to pattern-match locally rather than navigate intentionally.
The team found that too much upfront guidance becomes 'non-guidance'—when everything is marked important, nothing is. Progressive disclosure lets agents start with a stable entry point and find relevant context on demand.

2
Invest in making your application directly legible to agents by exposing UI state, logs, metrics, and traces through programmatic interfaces. Wire Chrome DevTools Protocol into agent runtimes for DOM snapshots and screenshots, and provide local observability stacks with queryable APIs (LogQL, PromQL, TraceQL).
The team's bottleneck shifted from code throughput to human QA capacity. By making the application itself inspectable by agents, they enabled autonomous bug reproduction, fix validation, and performance verification without human intervention.

3
Enforce architectural invariants mechanically through custom linters and structural tests rather than relying on documentation alone. Write custom lint error messages that inject remediation instructions directly into agent context, turning every violation into a learning opportunity for the agent.
In a human-first workflow, strict linting rules might feel pedantic. With agents, they become multipliers—once encoded, they apply everywhere at once, preventing drift across a million-line codebase generated at high throughput.

4
Implement automated 'garbage collection' for your codebase by encoding golden principles and running recurring cleanup agents that scan for deviations, update quality grades, and open targeted refactoring pull requests. This catches bad patterns daily rather than letting them compound.
The team initially spent 20% of their week (every Friday) manually cleaning up 'AI slop.' By automating this into background Codex tasks with codified principles, they scaled cleanup proportionally to code generation throughput.

5
When agents struggle with a task, resist the urge to 'try harder' or write the code manually. Instead, diagnose what capability is missing—tools, guardrails, abstractions, or documentation—and have the agent itself build that missing capability into the repository.
This depth-first approach compounds over time: each missing capability that gets encoded becomes infrastructure for all future agent tasks, steadily increasing the scope of what agents can accomplish autonomously.

6
Push all relevant team knowledge into the repository as versioned, co-located artifacts. Slack discussions, Google Docs, and tacit human knowledge are invisible to agents—if it isn't discoverable in the repo, it effectively doesn't exist for the agent and leads to misaligned output.
This mirrors the new-hire onboarding problem: anything not written down is lost context. The team treats the repo as the single system of record for product principles, engineering norms, architecture decisions, and even team culture preferences.

Common Pitfalls

1

Creating a monolithic AGENTS.md file that tries to be an encyclopedia of all project knowledge. A giant instruction file crowds out the actual task, code, and relevant docs from the agent's context, causing it to either miss key constraints or optimize for the wrong ones. When everything is marked as 'important,' nothing is, and agents end up pattern-matching locally instead of navigating intentionally.

The team experienced this firsthand and found the file also 'rots instantly'—becoming a graveyard of stale rules that agents can't distinguish from current truth, and humans stop maintaining.

2

Attempting to manually clean up agent-generated code quality issues ('AI slop') on a fixed schedule. The team initially spent every Friday (20% of engineering time) on manual cleanup, which didn't scale as agent throughput increased and the codebase grew.

The solution was encoding 'golden principles' into the repository and building automated recurring cleanup processes where background Codex tasks scan for deviations and open targeted refactoring PRs.

3

Trying harder or writing code manually when agents struggle with a task, rather than diagnosing and fixing the underlying environmental gap. The fix is almost never 'try harder'—it's identifying what capability, tool, guardrail, or documentation is missing and making it legible and enforceable for the agent.

The team found that every struggle was a signal about underspecified environments. Building the missing capability into the repo compounds over time, unlocking progressively more complex tasks.

4

Keeping important context in Google Docs, Slack threads, or team members' heads rather than in the repository. From the agent's perspective, anything it can't access in-context while running effectively doesn't exist, leading to misaligned output and repeated mistakes.

Just as a new hire joining three months later wouldn't know about an undocumented Slack discussion, agents can't reason about information that isn't versioned and co-located in the repository.

5

Applying conventional blocking merge gates and extensive human code review processes in a high-throughput agent environment. When agent throughput far exceeds human attention, waiting becomes more expensive than the cost of occasional corrections, and traditional gates become bottlenecks.

The team shifted to minimal blocking merge gates, short-lived PRs, and agent-to-agent review. Test flakes are addressed with follow-up runs rather than blocking progress. This tradeoff only works in high-throughput environments with strong architectural guardrails.

6

Expecting early agent-driven progress to be fast without sufficient upfront investment in environment specification. The team found initial progress was slower than expected not because the agent was incapable, but because the environment lacked the tools, abstractions, and internal structure needed for the agent to make progress toward high-level goals.

The work required going depth-first: breaking larger goals into smaller building blocks, having the agent construct those blocks, then using them to unlock more complex tasks in a compounding fashion.

Related Concepts

Ai-assisted Software Development

Agent-first Engineering Workflows

Codex CLI

Agents.md Design Patterns

Progressive Disclosure For AI Context Management

Layered Domain Architecture

Mechanical Invariant Enforcement

Custom Linters For AI Agents

Observability-driven Development

Chrome Devtools Protocol Automation

Agent-to-agent Code Review

Technical Debt As Garbage Collection

Repository-as-system-of-record

Execution Plans

Ralph Wiggum Loop (iterative Agent Review)

Parse Don't Validate Pattern

Git Worktree Isolation

Boring Technology Selection For AI Agents