If You Can't Read the Diff, You Need a Better Test

The first time an AI coding agent hands you a working tool, it feels like cheating.

You described the thing you wanted. It created files, installed packages, wrote functions, fixed errors, and gave you something that actually runs. A few years ago, that gap between “I want this” and “I built this” was where the whole project died.

Now the code exists. The uncomfortable part starts after that: can you tell whether it is right?

I keep coming back to this because I cannot code in the way developers mean it. I can read around code. I can usually understand the shape of what is happening. I can debug with help. I can ask decent questions.

But I cannot honestly pretend that “read the diff” is my main quality-control process.

A lot of AI coding advice assumes the human at the end of the loop is a developer reviewing a pull request. For people like me, that is the wrong safety net.

The Old Barrier Was Code⌗

For me, AI mostly killed the syntax wall.

If you can describe a small tool clearly enough, a coding agent can often build a useful version of it. Scripts, dashboards, scrapers, data cleanup tools, internal workflows, little utilities that save an hour here and there. I use this stuff constantly.

This is not just me messing around with scripts after dinner. OpenAI says more than 85% of its own company uses Codex every week. Anthropic says users are handing off difficult coding work that previously needed close supervision. A 2026 research dataset called AIDev includes 932,791 agent-authored pull requests across 116,211 GitHub repositories.

My own problem shows up after the first working version: I can get past the blank-file stage and still have no good reason to trust the output.

A script that crashes is annoying. A script that confidently produces wrong data is expensive. A tool that fails to launch is easy to reject. A tool that works in the demo and corrupts your assumptions quietly is harder to spot.

The Tools Are Built for Supervision⌗

The product design tells you what role the human is supposed to play.

OpenAI’s Codex app, announced in February 2026, goes well past editor autocomplete. It is a command center for multiple coding agents: separate threads, built-in worktrees, diff review, editor handoff, skills, and automations.

OpenAI described the developer challenge as a shift from what agents can do to how people direct, supervise, and collaborate with them at scale.

Then in May, OpenAI added Codex remote access from the ChatGPT mobile app. You can start or continue threads, answer questions, change direction, approve actions, review findings, and move across connected hosts from your phone. The host Mac has to stay awake, online, and running Codex.

Anthropic is describing the same operating model from another angle. Claude Code’s best practices are basically management advice:

Let it explore the repo before it edits.
Make it plan before it starts changing files.
Give it tests, screenshots, or expected outputs.
Watch the context window because performance degrades as it fills.
Use separate agents for investigation and review.

For a non-developer, the useful instruction in that list is “give it tests, screenshots, or expected outputs.” If you cannot reliably judge the code, you need to judge the behavior.

Better Models Make Bad Trust Easier⌗

OpenAI announced GPT-5.5 in April 2026 and positioned it as especially strong in agentic coding, computer use, knowledge work, and scientific research. The company reports 82.7% on Terminal-Bench 2.0, 58.6% on SWE-Bench Pro, and 73.1% on its internal Expert-SWE benchmark.

Anthropic announced Claude Opus 4.7 around the same time, describing it as better at difficult software engineering tasks and long-running coding work. Anthropic says the model follows instructions closely and devises ways to verify its own outputs before reporting back.

As the tools improve, the output gets easier to trust for the wrong reasons.

A bad coding agent fails in ways you can see. It errors out. It cannot install the package. It loops on the same mistake. It gives you nonsense.

A good coding agent fails in ways that look professional. It creates clean files. It explains itself well. It runs a test. It tells you the implementation is complete. It gives you enough polish to make you feel like the remaining risk is low.

Sometimes the remaining risk is low. Sometimes the output is wrong in the one place that matters.

If you are a developer, you might catch that in the diff. If you are not, you need another way to force the truth out of the tool.

Task Shape Matters⌗

The pull request data backs this up in a useful way.

A later analysis of 7,156 PRs from the AIDev dataset found that acceptance depends heavily on task type. Documentation tasks were accepted 82.1% of the time. New features were accepted 66.1% of the time. OpenAI Codex ranged from 59.6% to 88.6% across nine task categories. No single tool dominated every category.

Bounded work is easier to verify. Documentation has a visible output. A bug fix with a reproduction step either fixes the case or it does not. A data script can be tested against a known input and known output.

Feature work is fuzzier. The requirement is incomplete. The edge cases live in someone’s head. The “right” answer depends on product judgment, customer context, and weird assumptions no model can infer from a vague prompt.

This is why “build me a dashboard” is risky and “take this CSV, group rows by month, output a table with these five columns, and match this sample result” is much safer.

The agent can write code either way. The second task gives you something you can check.

The New Skill Is Verification Design⌗

For developers, the new bottleneck may be code review.

For people like me, the bottleneck is verification design.

In practice, you turn the thing you want into checks you can actually evaluate:

Known inputs: Give the agent a tiny sample dataset where you already know the answer.
Expected outputs: Make it produce a specific table, file, screenshot, or response you can compare against.
Visible behavior: If it is a web tool, open it and click through the actual workflow.
Failure cases: Give it bad input and see whether it breaks cleanly or lies.
Before/after checks: If it changes data, inspect a copy first and compare counts, totals, and samples.
Plain-English explanation: Make it explain what changed in terms you can understand, then challenge the explanation.
Independent review: Ask a second model to look for risks, missing tests, and edge cases.
Rollback path: Keep the work isolated so you can throw it away when it gets weird.

None of this requires pretending you are a senior engineer. It requires being honest about what you can and cannot evaluate.

If I cannot tell whether a function is elegant, I should not make elegance the quality bar. I can still check whether the output matches a known result. I can still check whether the tool handles empty files. I can still check whether the totals reconcile. I can still check whether the command runs twice without duplicating data.

For a non-developer, that is a better safety net than staring at unfamiliar code and hoping understanding arrives.

The Workflow I Trust⌗

The workflow I trust looks like this:

Define the job in plain language.
Create a tiny test case by hand.
Write down the expected result before the agent builds anything.
Ask the agent to explain the plan.
Make it build the smallest useful version.
Run it against the tiny test case.
Compare the output yourself.
Try one ugly edge case.
Ask another model to review the approach.
Only then use it on real data or a real workflow.

Yes, this is slower than “just build it.” Fine. The goal is not to feel fast while creating a tool you cannot trust. The goal is to build small pieces of personal infrastructure that actually make your life easier.

Deterministic code still matters here. If the agent helps you write a script that calculates something the same way every time, great. Let code do that. But the verification step should be outside the model’s vibes. Known input. Known output. Repeatable command. Visible result.

Leverage Still Requires Ownership⌗

AI coding agents are a huge deal for technical-adjacent people. They let you build things that used to be blocked by syntax, framework knowledge, and setup pain.

But access is not the same as competence.

The dangerous move is letting the agent’s confidence substitute for your own verification. If you cannot read the diff, say that plainly and build a process around it. Do not cosplay as a code reviewer. Do not merge or deploy something important because the summary sounded good.

Use agents where the downside is limited, the output is visible, and the result can be checked. Start with tools that chew through copied data, generate reports, clean files, or automate annoying local workflows. Avoid giving a model write access to anything you cannot restore.

The real skill is knowing what “done” looks like before the code exists.

If you can define that clearly, the agent becomes useful. If you cannot, better code generation just gets you to false confidence faster.

If You Can’t Read the Diff, You Need a Better Test