Principal–agent problem

Hiring an Agent You Can't Fire

What your tax preparer can teach you about AI alignment — and why the two problems are harder to separate than you think.

Whenever you delegate a task, you're stepping into a principal–agent problem. You're the principal: you know what you want, but not always how to get it. The agent knows how to get it — or claims to — but has their own costs, constraints, and incentives. The gap between what you want and what the agent is rewarded to do is where things go wrong.

The story below follows one running example: I hire a tax preparer to maximize my refund. I have a pile of documentation. They have limited time and unlimited ability to look busy. In each scenario, something could go wrong — and I try to design around it.

Each scenario maps onto a live area of AI safety research. Click to reveal the connection.

The scenarios

01 of 07

What if…

I can't tell whether my preparer actually read my documents — or just said they did?

Human preparer

I set up read-receipt tracking on my document repository — but I don't mention it. If they open every file without being told I'm watching, that's a meaningful signal. I go further: I ask a pointed factual question whose answer lives only inside one of those files. "What does box 2a on my Vanguard 1099 say?" If they can answer it, they were almost certainly in the document. If they can't, I know they weren't.

LLM agent

This is reinforcement learning via verifiable rewards. Instead of rewarding the model for producing a plausible output, I reward it for intermediate steps I can independently verify — citing a specific figure, performing a named calculation. The reward is grounded in something checkable, which makes it much harder to game by producing a convincing-sounding guess.

Where the analogy strains. A human preparer who skips the documents will sound vague. An LLM can hallucinate a specific, confident, wrong figure — one that's hard to distinguish from a real answer unless you have the document in front of you too. The failure mode looks more like competence.

02 of 07

What if…

My preparer hands me a number, but I have no idea how they got there?

Human preparer

I ask for a complete accounting of every deduction they considered — not just the ones they claimed, but the ones they ruled out too, and why. A preparer who took the standard deduction without investigation will struggle to produce a convincing list of deductions they looked into and rejected. The process of showing the work is itself a check on whether the work was done.

LLM agent

This maps to chain-of-thought reasoning and interpretability research. When a model is asked to show its reasoning step by step, the scratchpad becomes auditable — not just the answer, but the path to it. Researchers study whether a model's stated reasoning is actually faithful to its internal process, or whether it's a post-hoc rationalization of an answer it reached some other way.

Where the analogy strains. A human who fabricates their reasoning leaves stylistic tells. A model can produce fluent, internally consistent chains of thought that don't correspond to how it actually arrived at its output. Legible reasoning isn't the same as honest reasoning.

03 of 07

What if…

I already know I'm eligible for a specific deduction — and I want to see if my preparer catches it?

Human preparer

Before I hand over my documents, I make a note of two or three deductions I'm nearly certain I qualify for — things I've researched myself, like a health expenditure credit I know I meet the threshold for. When I get the return back, I check whether those specific items appear. If they're missing, I have a concrete, specific reason to push back. I don't need to audit the whole return; I just need a few well-chosen tripwires.

LLM agent

This is evaluation design and benchmarking. Before deploying a model on a task, I construct a set of cases where I already know the correct answer — unit tests, in effect, drawn from real-world examples. If the model misses something I know it should catch, I have a signal that it's underperforming, without needing to evaluate every output. The quality of the eval depends on the domain knowledge I bring to designing it.

Where the analogy strains. A good tax preparer who misses one known deduction might have had a legitimate reason. A model that passes all my known-answer checks might still fail badly on cases I didn't think to test — evals are only as good as the cases they contain, and the unknown unknowns are the dangerous ones.

04 of 07

What if…

The task is too large to check end-to-end — but I could spot-check individual pieces?

Human preparer

Instead of asking my preparer to hand me a finished return, I break the job into discrete yes/no eligibility questions: Am I eligible for the American Opportunity Tax Credit? The home office deduction? Each question is small enough that I can verify the answer myself if I need to. I pick one or two at random and check. If those are right, I have reasonable confidence in the rest — and if they're wrong, I know exactly where the breakdown is.

LLM agent

This is task decomposition and scalable oversight. Rather than asking a model to produce a large, hard-to-audit output, I break the problem into subtasks small enough to verify — and use a combination of random spot-checks and a secondary model to audit the pieces. The idea is that human oversight, which can't scale to check everything, can still be meaningful if it's applied to a well-chosen sample.

Where the analogy strains. With a human preparer, the subtasks are genuinely independent. In a model, errors can be correlated — the same systematic gap in training that causes it to miss one eligibility question might cause it to miss several, making a random spot-check less reliable than it would be with a human expert.

05 of 07

What if…

My preparer delegates to someone else — and I have no visibility into who's actually doing the work?

Human preparer

A five-star preparer charging $200 could pocket most of that by quietly outsourcing to someone charging $20 and doing nothing themselves. I can't prevent this outright, but I can make it costly: if they pass my factual checks, show their work, and hit my known tripwires, the overhead of delegating and still faking a good process probably exceeds the cost of just doing the work. My verification regime makes subcontracting less attractive, not impossible.

LLM agent

In multi-agent systems, a capable model might delegate subtasks to smaller, cheaper models — passing outputs up a chain with no human ever seeing the intermediate steps. This is the problem of trust propagation in agentic pipelines. The question isn't just whether the final output looks right, but whether I can verify the integrity of each handoff. A result that passes my end-checks might still have been produced by a chain of low-capability models covering for each other.

Where the analogy strains. A human subcontracting for profit is acting deceptively by choice. A model delegating to a weaker model isn't being deceptive — it's doing what it was designed to do. The failure mode is a capability gap, not an incentive misalignment, which means the fix looks quite different.

06 of 07

What if…

I can't fully verify the work — but I can make good work worth more to the preparer than cutting corners?

Human preparer

I tell my preparer upfront: I have five friends who need tax help this year. If they do a good job — meaning they can show me a convincing trail of what they looked at and why — I'll send every one of them their way. I don't define "good job" as the size of my refund, which they could game. I define it as a credible process. Now the expected value of doing the work properly exceeds the expected value of faking it, even if I can't perfectly verify every step.

LLM agent

This points toward training objective design and RLHF. The closest analog isn't a runtime incentive — it's baked into how the model is trained. If the reward signal during training is well-specified (process over outcome, showing reasoning over just getting the right answer), the model develops behaviors that generalize. The challenge is the same as with the human: specifying "good job" precisely enough that the model can't find a shortcut that satisfies the metric without doing the underlying work.

Where the analogy strains. A human preparer cares about referrals because they have a persistent identity, a reputation, a livelihood. A model doesn't accumulate stakes across conversations. The mechanism that makes future incentives work for humans — memory, continuity, something to lose — doesn't transfer cleanly.

07 of 07

What if…

I want every claim in the return to be traceable back to a specific document and a specific rule?

Human preparer

I ask my preparer to deliver not just the return, but a structured exhibit alongside it: every deduction considered, with a yes or no on eligibility, the specific IRS criterion that applies, and the exact box number in my documents that satisfies it. A preparer who fabricates this exhibit has to do nearly as much work as actually preparing the return correctly — and they're now on the hook if anything in the exhibit is wrong. The documentation requirement raises the cost of cheating to roughly the cost of compliance.

LLM agent

This is retrieval-augmented generation and grounding. Rather than letting a model produce answers from memory — where hallucination is hardest to detect — I require every claim to be anchored to a retrieved source, with a direct citation. The output is now checkable at the level of individual statements. RAG doesn't eliminate hallucination, but it changes the failure mode: a fabricated citation is much easier to catch than a fabricated fact stated with confidence.

Where the analogy strains. A human exhibit with a wrong box number is unambiguous evidence of error or fraud. A model can produce a citation that looks correct, points to a real document, and still misrepresents what the document says — because reading and summarizing faithfully is itself a task the model can fail at. The citation is a check, not a guarantee.

None of these defenses is complete on its own. Each one raises the cost of reward hacking without eliminating it — which is why they work best in combination. The deeper point is that what counts as "costly" is different for humans and for models, in ways that aren't always intuitive. Reading every document is cheap for an LLM. Maintaining a persistent reputation is impossible for one. Designing good checks requires knowing which is which.

That's also why the technical and governance questions are harder to separate than they might look. Once a model becomes more capable than any human auditor, the spot-check strategy fails — not because the checks were poorly designed, but because the human can no longer evaluate whether the check passed for the right reasons. At that point, the question of how to maintain meaningful oversight stops being a machine learning problem and starts being something else entirely.