Tony Kim
Mar 04, 2026 17:29
OpenAI and Pacific Northwest National Laboratory launch DraftNEPABench, showing AI agents could save 1-5 hours per subsection on federal environmental reviews.
OpenAI and the U.S. Department of Energy’s Pacific Northwest National Laboratory have developed a benchmark showing AI coding agents could reduce federal environmental permitting drafts by up to 15%. The collaboration, announced February 26, 2026, produced DraftNEPABench—a testing framework that evaluated AI performance across 102 drafting tasks from 18 federal agencies.
The benchmark specifically targets National Environmental Policy Act workflows, the 50-year-old process requiring federal agencies to document environmental impacts before approving infrastructure projects like power plants, bridges, and manufacturing facilities. These reviews often take years and involve hundreds of pages of technical reports.
What the Testing Showed
Nineteen NEPA subject matter experts evaluated AI-generated drafts on a 1-5 scale measuring structure, clarity, accuracy, and proper reference use. The agents—running on OpenAI’s Codex CLI with GPT-5—demonstrated potential to save 1-5 hours per document subsection.
That doesn’t sound dramatic until you consider the scale. Environmental Impact Statements contain dozens of subsections, each requiring cross-referencing technical reports, regulatory requirements, and multiple data sources. A few hours saved per section adds up fast on projects that currently take months or years to clear.
The AI agents were required to read and synthesize documents spanning hundreds of pages, verify facts across environmental and regulatory sources, and produce structured reports meeting specific legal criteria. Tasks covered document sections from agencies across the federal government.
Limitations Worth Noting
PNNL and OpenAI were upfront about what this benchmark doesn’t prove. It evaluates performance on well-specified drafting tasks where relevant context is available—not the messy ambiguity of real permitting decisions.
When reviewing failure cases, researchers found some “errors” stemmed from outdated references and weak evaluation criteria rather than model mistakes. Real deployments would involve expert feedback loops expected to improve performance beyond benchmark results.
If source materials are incomplete or inconsistent, the models won’t necessarily flag problems without explicit instructions. Human oversight remains essential.
The Bigger Picture
This partnership sits within PNNL’s broader PermitAI initiative, funded by the Department of Energy’s Office of Policy. The goal isn’t replacing human reviewers—it’s giving government workers AI teams that handle time-consuming document work so they can focus on judgment calls and complex decisions.
OpenAI says the collaboration will continue refining PermitAI applications. The companies expect average approval times for federally reviewed infrastructure projects to eventually drop from months to weeks, though no specific timeline was provided for achieving that reduction.
For the AI industry, this represents another government validation use case—demonstrating that frontier models can handle real regulatory workflows, not just chatbot conversations. Whether that translates to broader federal AI adoption depends on how subsequent pilots perform under actual permitting conditions.
Image source: Shutterstock

