There's a narrative in the AI world right now that goes something like this: agents are getting smarter, more autonomous, more capable. The next generation will reason better, plan longer, operate with less supervision. The trajectory is toward full autonomy.
And then there's what people are actually shipping.
A research team at UC Berkeley just published the first large-scale empirical study of AI agents in production. Not demos. Not prototypes. Not "coming soon" product announcements. Eighty-six deployed systems, serving hundreds to millions of real users, across 26 industries. Twenty in-depth interviews with the teams that built them. Three hundred six practitioners surveyed.
The paper is called "Measuring Agents in Production" — MAP. And if you're trying to figure out whether AI agents actually work for real business operations, this is the most useful thing published this year.
The headline finding
Production agents are deliberately boring.
Not because the teams building them don't know about the latest research. Not because they can't build something more sophisticated. Because they tried the sophisticated version and it broke in production.
The numbers: 70% use off-the-shelf models with no fine-tuning — just prompting. 80% use structured workflows rather than open-ended autonomy. 68% execute fewer than 10 steps before requiring human intervention. 85% of case study teams build custom implementations rather than using off-the-shelf frameworks like LangChain or CrewAI.
The researchers' conclusion: "Practitioners deliberately choose simple, controllable methods not from lack of sophistication, but because they offer reliable agent performance and fast development cycles."
Read that again. The teams closest to production, with the most at stake, are actively choosing less autonomy, less complexity, and less novelty. On purpose.
Why simple wins
The paper identifies a reliability paradox. Reliability is the number one development challenge — 38% of teams rank it as their top priority, far above governance (3%) or compliance (17%). And yet these agents are in production, serving real users, handling real business processes. How?
The answer isn't better models. It's better systems.
Teams achieve reliability through what the researchers call "system-level design." Constrained environments. Wrapper APIs that hide production system details from the agent. Read-only modes where agents can analyze but not modify. Role-based access controls. Fixed action sequences with human approval at critical steps. Sandbox verification before anything touches production.
This should sound familiar to anyone who's worked in accounting. We've been doing this for 500 years. You don't give one person the ability to create a vendor, approve an invoice, and cut a check. You separate duties. You build in review steps. You constrain what any single actor can do. The system is designed so that the controls work even when the individual actors make mistakes.
The AI industry is discovering what accounting figured out in the Renaissance: the system is more reliable than any individual component inside it.
The fine-tuning myth
One finding surprised me, though maybe it shouldn't have. Only 30% of deployed agents use any form of fine-tuning or post-training. The other 70% use foundation models exactly as they come from the provider — Claude, GPT, whatever — and rely entirely on prompting to shape behavior.
The reason is practical, not ideological. Teams report that fine-tuning is brittle to model upgrades. You spend months tuning a model, the provider releases a better base model, and you have to retune. The fine-tuning advantage evaporates. The engineering hours are gone. Meanwhile, the team that was just prompting switches to the new base model in an afternoon.
I wrote about this a few weeks ago — the reasoning layer changes every six months, the tools layer doesn't. This paper provides the empirical evidence. The teams in production aren't fine-tuning because the ROI doesn't survive the next model release cycle.
Human-in-the-loop isn't a phase
Here's the one that matters most if you're thinking about AI for accounting or finance.
74% of deployed agents rely primarily on human-in-the-loop evaluation. Not as a stopgap while they build something more autonomous. As the architecture. The paper explicitly states that teams "deliberately incorporate human-in-the-loop as an architectural component." The agent does the work. The human checks the work. That's the design, not a limitation of it.
93% of deployed agents serve human users — not other software systems, not other agents. Humans. And 52% serve internal employees specifically, with humans acting as "final verifiers of agent outputs."
This maps perfectly to how we think about AI in accounting. The agent processes the invoice, maps the GL codes, creates the entry. The CPA reviews it. Not because the agent can't do it alone — it probably can, most of the time — but because "most of the time" isn't good enough when you're dealing with someone's financial statements. The review step isn't overhead. It's the product.
The research confirms what we've seen in practice: you don't need a fully autonomous agent to deliver massive value. You need an agent that does 90% of the work reliably, and a human who spends their time on the 10% that requires judgment. That's a completely different staffing model than having humans do 100% of the work — and it scales in a way that pure human labor doesn't.
What nobody's figured out yet
The paper is honest about what's still broken, and this is the part worth paying attention to.
75% of teams have no formal benchmarks for evaluating agent output. No golden datasets. No standardized test suites. They rely on A/B testing, user feedback, and expert review — which work, but don't scale and don't transfer across deployments.
The researchers found three reasons. First, regulated domains like healthcare and finance require expensive expert-labeled data that takes months to create. Second, every client deployment is different — proprietary toolsets, localized workflows, domain-specific edge cases — making standardized benchmarks impractical. Third, many real-world tasks are genuinely hard to verify automatically. How do you programmatically check whether a customer support response was "good"? How do you automatically verify that a journal entry was coded to the right account?
For coding agents, this is partially solved — you can compile the code and run the tests. For most business processes, there's no equivalent of "it compiles." The feedback signal is slow and expensive. An insurance agent's mistake might not surface until a claim is denied months later. An accounting error might not surface until the audit.
I don't think this is a permanent condition. But it means that right now, the teams doing this well are building their own evaluation infrastructure from scratch — establishing expected-output sets, collecting user interactions, iteratively expanding with expert review. It's manual, expensive, and custom per deployment. There's no shortcut.
Where accounting has an advantage
There's a reason finance and banking is the second-most-common domain for deployed agents (44% of surveyed systems), behind only technology. Accounting processes have properties that make them unusually well-suited to the production agent patterns this paper describes.
Latency tolerance. 66% of deployed agents operate with minute-scale latency or longer. Nobody cares if an invoice takes two minutes to process instead of two seconds. The alternative was a human taking two hours. Or two days, if the person who handles that vendor is out sick.
Structured workflows. AP processing follows a defined sequence — receive invoice, match to PO, code GL accounts, create entry, route for approval, post. Bank reconciliation has explicit steps. Month-end close has a checklist. These aren't open-ended reasoning problems. They're structured processes with known steps and verifiable outputs — exactly the kind of workflow that 80% of production agents use.
Built-in verification. The accounting system itself provides guardrails that most domains don't have. Debits must equal credits. GL accounts must exist in the chart of accounts. Entries must balance. The bank statement is an independent source of truth you can reconcile against. These aren't AI safety features — they're centuries-old accounting controls. But they happen to work perfectly as automated verification layers for agent output.
And the human-in-the-loop model isn't new to the profession. Every set of financial statements has a review process. Every audit has a review partner. The CPA's entire job is structured around the idea that work product gets checked by someone with judgment. Adding an AI agent to that workflow doesn't change the review model — it changes what's being reviewed. Instead of reviewing the work of a staff accountant, you're reviewing the work of an agent. The review skill is the same.
What to take from this
If you're evaluating AI for your accounting operations — or honestly, for any business process — the MAP paper suggests a few things worth internalizing.
Distrust anyone promising full autonomy. The teams with production deployments are building constrained systems with human oversight. If a vendor is telling you their agent "handles everything end to end with no human involvement," they're either not in production yet or they're not telling you about the failure rate.
The boring approach is the correct approach. Off-the-shelf models. Structured workflows. Human review at critical steps. Limited autonomous steps. This isn't a lack of ambition. It's what the data says works.
Ask about architecture, not intelligence. The paper shows that production reliability comes from system design — how the agent is constrained, how it's evaluated, how failures are caught — not from how smart the underlying model is. When you're evaluating an AI accounting solution, ask how they handle errors, how credentials are isolated, what the agent can and can't do. If the answer is mostly about how clever the AI is, that's the wrong answer.
Expect to build evaluation. Nobody has this figured out yet — not Google, not the biggest AI startups, not the research labs. Evaluation is custom, manual, and expensive. Any honest vendor will tell you that measuring agent quality is an ongoing process, not a solved problem.
The 86 production agents in this study aren't impressive because they're sophisticated. They're impressive because they're reliable — and they got there by choosing boring over brilliant, every time.
That's not a limitation. That's engineering discipline.