One Shot Delivery

If you grew up around arcades (or emulators), you probably remember the special kind of annoyance certain games produced. Ghosts ’n Goblins wasn’t "hard" in the heroic way a good challenge is hard. It was hard in the petty way. One hit and your armor is gone. Another hit and you’re done. It trained you to feel the cost of a mistake immediately.

That’s why this benchmark stuck in my head when I finally read the Remote Labor Index (RLI) prompt closely. Because the RLI setup is, in a very real sense, worse than Ghosts ’n Goblins.

At least that game gave you two states: armored and unarmored. A warning shot. A chance to adjust. In RLI’s core setup, you don’t even get that. You get one shot at the deliverable. And you are explicitly forbidden from doing the one thing knowledge workers rely on to avoid needless failure: asking questions.

Imagine a normal office assignment. A brief. Some input files. A deadline. This could be engineering work, a proposal, an analysis, a deck, a piece of content, a prototype, a report. You open it and immediately see the gaps--not necessarily because the brief is "bad," but because real work rarely arrives as a complete formal specification. There are always unstated priorities, hidden constraints, and tacit expectations about what "good" means.

So you do what competent people do. You clarify. You ask what matters most. You show an early draft. You get a quick "yes, but…" while it’s still cheap to course-correct. That back-and-forth isn’t bureaucracy; it’s how modern knowledge work functions. It’s how weak signals become explicit constraints.

Now remove it.

RLI is a benchmark built from real end-to-end freelance projects: a brief, input files, and a professional human deliverable. The authors evaluate multiple agent systems and score whether the agent’s output would be acceptable as commissioned work to a "reasonable client." The headline number that everyone repeats is that the best-performing system in their evaluation reaches an automation rate of about 2.5%.

That number isn’t meaningless. But it is routinely interpreted as something it isn’t.

Because RLI is not testing "can an AI help a person do work." It’s testing "can an agent do this unattended, end-to-end, from a static brief, with no interactive clarification." And yes, the authors provide proper agent environments and scaffolds. The point isn’t that they did something naïve on the tooling side. The point is that the benchmark deliberately chooses a regime that is structurally anti-human: it bans the conversational mechanism humans use to make ambiguity survivable.

A quick aside, because it matters for reader intuition: the paper mentions "ChatGPT agent," and that phrase causes immediate confusion. Most people picture the chat interface. That’s not the mental model you should have here. RLI is evaluating agent systems, not your casual back-and-forth in a web UI. Whether you personally like a given agent product or harness is secondary. The experiment is about what happens when you demand one-shot delivery without dialogue.

If you’ve ever done consulting, engineering, product work, or anything stakeholder-facing, you already know what that constraint does. It forces you into guessing. And guessing is expensive.

Here’s the part people don’t want to hear: this constraint doesn’t only punish AI agents. It punishes humans too.

RLI doesn’t measure a human baseline under "no questions allowed." The human deliverables are gold standards created under real-world freelancing conditions, where humans can clarify, iterate, and calibrate taste. So you can’t honestly treat the automation rate as "percent of human work replaceable." What you can do is triangulate from decades of research in adjacent domains that measure what happens to human performance when clarification and context are removed.

In requirements-driven work, allowing communication and clarification materially improves the quality of produced artifacts compared to text-only regimes. In medicine, controlled studies show that adding clinical context improves interpretive accuracy--because context collapses ambiguity. In instruction-driven work like annotation, clearer rules outperform vague guidance by meaningful margins. Different fields, same structural truth: when "good" is partly tacit, performance depends on the feedback loop that makes it explicit.

So when I see "2.5%," I don’t see a comforting "AI is weak" headline. I see a one-life stress test. I see a benchmark that says: "Do this job the way nobody actually wants to do the job--no questions, no calibration--and then tell me how often you land an acceptable outcome."

Under that framing, 2.5% can be read two ways, and both readings can be true.

It’s low if you expected full autonomy across the messy breadth of remote work, where the deliverable isn’t just "produce output," but "discover what output is wanted." A static brief rarely contains enough signal. Humans solve that by asking, by showing drafts, by interpreting reactions. The benchmark forbids the main human solution.

It’s also, strangely, not trivial if you take the constraint seriously. One-shot delivery to "reasonable client acceptance" is an unforgiving bar even for people. If you forced skilled humans into the same regime--no clarification, no stakeholder feedback, no iteration--you would expect a significant failure rate, because failure often comes from mismatched intent, not from inability to execute.

And that’s where the real disruption sits today, especially in software engineering.

The story most people want is binary: AI replaces humans, or it doesn’t. But what’s actually happening is more uncomfortable: AI-augmented humans outcompete unaugmented humans. The unit that wins is the person (or pair) who can convert weak signals into explicit constraints, steer an agent system through ambiguity, and verify outputs cheaply enough that iteration becomes safe.

This is why I’m skeptical of "replacement meters" as the main lens. The bigger risk is distributional and operational. A centaur--one strong operator with a good harness, clear constraints, and a verification loop--can replace a surprising amount of what used to require a whole team. Not because the AI is a human substitute, but because execution has become cheap and the bottleneck moved up the stack: judgment, coordination, and verification.

RLI, read carefully, doesn’t contradict that. It actually reinforces it.

It tells you that unattended autonomy is still rare under harsh constraints. It also tells you why: when you ban dialogue, you force everything into the brief. And when everything has to be in the brief, the real work becomes specification and verification, not generation.

That’s the "one-life game" lesson.

Ghosts ’n Goblins was annoying because tiny mistakes had immediate consequences. The RLI regime is worse because it doesn’t even let you reduce mistakes the normal way--by clarifying what you’re supposed to build. If you want to argue about the 2.5%, argue about what it actually measures: one-shot delivery from a static brief.

Then ask the question that matters: what happens when we stop pretending the goal is replacement, and instead optimize for what’s already reshaping teams--AI as execution leverage under human judgment?

That’s the curve. And unlike arcade games, it isn’t optional.

References:

Requirements communication, artifact quality
https://www.diva-portal.org/smash/get/diva2:1742211/FULLTEXT02.pdf
Clinical context, diagnostic accuracy (radiology review)
https://pmc.ncbi.nlm.nih.gov/articles/PMC7890923/
Instruction clarity, annotation accuracy
https://arxiv.org/abs/2312.14565

Operator

One Shot Delivery