
Every time I speak at a conference or sit down with a compliance team exploring AI, someone asks about hallucinations. I understand why. The fear is legitimate, and the stakes in financial crime detection are real. We are talking about catching human trafficking, money laundering, and terrorist financing. A system that makes things up is not just an inconvenience; it is a liability.
But here is the honest answer: AI hallucinations in financial crime compliance have been engineered around. They were genuinely scary in 2023. In 2026, there are well-established, production-tested techniques that make AI agents reliable enough to process hundreds of thousands of alerts a month with measurable accuracy. Unit21's agents have completed over 500,000 alert reviews. We know exactly how we get there, and we know exactly what we do about the cases where the system is not confident enough to act on its own.
What follows is not theory. It is a breakdown of the four techniques we use in production to make AI output trustworthy enough to put in front of regulators, plus the infrastructure that handles the remaining error rate.
Before getting into solutions, it helps to reframe the problem. Technically speaking, all LLM output is generative. When a model responds to a question, it is producing a statistically likely sequence of tokens, not retrieving a fact from a database. In that philosophical sense, everything an LLM says is a "hallucination."
That framing is not useful for compliance work. The real question is: does the AI output accurately represent the underlying data? If a customer's age field says 65, does the agent return 65 or does it return 27? If a name does not appear on a sanctions list, does the agent correctly say "no match"?
That is the standard we build to. Not "did it sound confident?" but "is it factually correct, every single time, at scale?"
What it is: Eval sets are structured test suites that measure whether an AI agent produces correct outputs across a large, representative sample of real-world cases. You run the agent against hundreds or thousands of examples, compare its outputs to known-correct answers, measure the accuracy rate, and iterate until you reach a threshold you can defend.
This is the technique most teams underestimate, and the one where Unit21 has the most structural advantage. It is also the reason we lead with it: everything else in this list matters, but eval sets are the mechanism that actually tells you whether your system works.
Here is how it works in practice. Imagine you want to test whether an AI agent can correctly determine whether a transaction is a true positive for sanctions exposure. You need test cases: real transactions, real entity data, and known outcomes (was this actually a sanctions hit or not?). You run the agent. You compare. You score.
The accuracy percentage tells you how reliable the agent is. If it is at 60%, you adjust the prompt, change the context, and run again. You keep iterating until it hits 95% or above. Only then do you deploy.
Two things make this work at Unit21 that most vendors cannot replicate.
First, we have seven to eight years of human analysts performing real investigations inside our platform. Every disposition, every decision, every alert review is a data point. That is our ground truth. Competitors who did not build a compliance platform before building AI have no comparable training base for their eval sets.
Second, we have a QA layer built into our historical data. Not all human analysts are equally reliable. We have been randomly sampling and scoring analyst work for years, which means we know which historical decisions came from high-performing analysts and which did not. Our eval sets are benchmarked against the best, not the average. That distinction matters enormously for the quality of what you are measuring against.
When we say an agent task is ready to deploy, it means it has passed eval sets drawn from real, high-quality human investigations. That is not a confidence score on paper. It is verified performance on data that reflects actual compliance work.
For a deeper look at how this methodology works end to end, read Inside Unit21's AI Suite.
What it is: Rather than letting the LLM look directly at raw transaction data and summarize what it sees, we have the model generate code (SQL or Python) that then queries the data deterministically. The LLM decides what to look for. The code retrieves it precisely.
Here is the problem with giving an LLM raw data and asking it to summarize: even at low temperature, the model can misread a field, conflate two values, or introduce subtle inaccuracies in how it represents what it saw. The output sounds correct. It may not be.
When the LLM generates a query instead, we break the process into two steps with a verification layer in between. Step one: the LLM decides what data to retrieve and writes the logic to get it. Step two: deterministic code executes that logic and returns exact values. The model then summarizes those values, but the retrieval itself is controlled and testable.
This shifts the failure mode from hallucinated answers to potentially incorrect queries, which is a much more tractable problem. If the model writes a bad query, the wrong answer shows up in the eval set results and the prompt gets refined until it doesn't. The eval sets catch this the same way they catch everything else: by comparing output to known-correct answers across thousands of cases.
This architecture, combining LLM reasoning with deterministic execution, is what separates reliable compliance AI from demos that fall apart in production. It is also one reason why the rules vs. machine learning debate has evolved: the best systems use both, with deterministic logic as the guardrail. The LLM is the engine. The deterministic layer is the rest of the car.
What it is: Context engineering is the practice of carefully controlling what information you give the LLM so it has exactly what it needs to make a correct decision, and nothing more.
This sounds simple. It is not.
There is an intuitive assumption that more information equals better outputs. The opposite tends to be true for LLMs. Every model has a context window. As you approach that limit, performance degrades. Even well before the limit, too much information confuses the model. It starts conflating fields, drawing on the wrong signals, and producing outputs that look reasonable but are not accurate.
We learned this empirically, not theoretically. In our earliest experiments in 2022 and 2023, we fed large dumps of transaction data into models to see if they could detect fraud. The results were poor. Through iteration, we found that giving the model less data, carefully selected, produced measurably better outcomes against our eval sets.
Context engineering means figuring out exactly what information a given AI task needs to make a correct decision, and engineering the input to include that and only that. For a sanctions screening task, the agent needs entity name data, list-matching logic, and relationship context. It does not need three years of transaction history, device data, or open account notes.
Our architecture takes this further than scoping individual prompts. Rather than stuffing everything into one giant model call, the system routes an investigation across specialized parallel workers, each focused on a narrow objective with a narrow slice of context. A data-gathering worker pulls transaction histories and baseline entity data. A behavioral-analysis worker evaluates deviation from normal patterns. A sanctions worker checks watchlists. Each operates independently and feeds results into a shared data store. A separate decision layer then aggregates their outputs to reach a final determination. The effect is that no single model call is asked to do too much, which is where context overload typically causes quality to degrade.
This also means the system gets smarter across passes. If the behavioral-analysis worker needs data that the data-gathering worker has not yet produced, it waits. On the next pass, that data is available in the shared store, and the analysis proceeds. The investigation converges through multiple passes rather than trying to get everything right in a single shot.
When customers build their own custom AI agent tasks inside Unit21, context management is the primary constraint we help them think through. If you want to understand what that looks like in practice, this piece on how custom AI agents are transforming fraud and AML operations goes deeper on the configuration decisions that separate good agents from great ones. Getting the context right is the difference between a task that performs at 70% accuracy and one that performs at 95%+.
What it is: Temperature is a parameter that controls how "creative" or random a language model's outputs are. A high temperature produces varied, exploratory responses. A temperature of zero or near-zero produces more deterministic, consistent ones.
For compliance use cases, we set temperature to zero or as close to it as the model allows. There is no reason a Sanctions Agent needs to be creative. We want the same input to produce the same output, every time.
Temperature control is the simplest of the four techniques and the least sufficient on its own. Setting temperature to zero does not prevent a model from being confidently wrong — it just increases the probability that it is consistently wrong in the same way, which makes the problem easier to catch in eval sets. It mitigates one source of variability from a system designed to minimize all of them. Combined with the three mechanisms above, it completes the picture. On its own, it would not get you very far.
The four techniques above are how we get to high accuracy before deploying an agent. But no responsible system stops there. The question an engineer should ask is: "What about the cases where the agent gets it wrong?"
We handle this with two additional layers.
Continuous monitoring in production. We run what the AI industry calls an "LLM-as-judge" system: a separate model that evaluates the primary model's work in real time, scoring every output for consistency and accuracy against the established baseline. This runs continuously across every stage of the agent's work — not just the final answer, but every intermediate step. If quality degrades, whether from a model update, a shift in the data, or a prompt that works well on one class of alerts but poorly on another, the monitoring system flags it and recommends updates: prompt adjustments, parameter changes, or swapping in a different model for that specific task. It also detects drift over time, catching the slow degradation that is harder to notice than a sudden failure.
Human-in-the-loop escalation. The AI agent does the investigation, gathers the evidence, writes the narrative, and makes a recommendation. But on any case where the system's confidence is not high enough, or where the decision carries real regulatory weight, the output goes to a human analyst for review. The system is not trying to reach 100% automation. It is trying to automate the cases that are clearly false positives — roughly the 60 to 80 percent of alert volume that experienced analysts close in minutes — and escalate everything else. The residual error rate is managed not by pretending it does not exist, but by building the infrastructure to detect it, learn from it, and route around it.
This is the part that makes the "hallucination" framing outdated. The question is not "does the model ever get it wrong?" Of course it does. The question is whether the error rate of the AI system, including its monitoring and escalation layers, is lower than the error rate of the manual process it replaces. For well-defined compliance tasks, it is — and we can prove it with data.
The techniques above apply to every AI agent task we ship. But there is an additional layer worth mentioning: not all LLMs are equally good at every task.
We run a multi-model architecture across Unit21's platform. At any given time, we are using around 11 to 12 different models for different task types, selected based on current benchmark performance. Document analysis, AML transaction monitoring, and narrative drafting all have different capability profiles across models. We route tasks to the model best suited for each one.
Because we have eval sets for every task, we can swap models systematically. When a new model scores better on our benchmarks, we run it through our eval sets. If it meets or exceeds performance, we deploy it. If not, we stay with the current model. The decision is empirical, not speculative.
This means our agents get better over time without any additional configuration from customers. The underlying infrastructure is always being optimized against measured accuracy.
I want to be direct about something that often gets lost in discussions about AI accuracy: this is not primarily a technical problem. It is a moral one.
Financial crime — human trafficking, drug trafficking, sanctions evasion — these are not abstractions. They are things happening now, funded through the financial system. Every time a compliance team chooses to stay in "governance committee mode" debating AI risk rather than deploying tested, reliable AI, real activity continues undetected.
The argument against using AI is sometimes framed as caution. But caution has a cost too. If a well-tested AI agent would have flagged a bad actor that a human analyst missed under a crushing alert load, the decision not to deploy that agent was not neutral. I wrote more about this tension in Why I Believe AI Agents Are the Last Chance to Actually Win Against Financial Crime.
Unit21 has built our AI to be auditable, explainable, and designed with the requirements of the EU AI Act in mind, currently the strictest AI regulation in the world. Our agents produce regulator-ready outputs with full transparency on which models were used, what tests were run, and what bias evaluations were performed. The governance question has been answered. The technology exists. The tools are in production.
The question now is whether teams will use them.
The hallucination conversation is important to have. But it should end with action, not paralysis.
The four techniques that make AI reliable in financial crime compliance — eval sets, deterministic code generation, context engineering, and temperature control — are not research concepts. They are production-tested at scale. And they are backstopped by continuous quality monitoring and human-in-the-loop escalation that handle the cases where the system is not confident enough to act alone. Unit21 processes over 213,000 alerts per month through AI agents built on these foundations. We know they work because we measure them, constantly, against real data.
If your team is still treating hallucinations as a barrier to deployment rather than an engineering problem with known solutions, the risk is not that you will ship something unreliable. The risk is that you will not ship anything at all, while the problem keeps growing.
See what production-grade compliance AI looks like for your team.
Unit21's AI agents process over 213,000 alerts per month across financial institutions and fintechs worldwide, built on the techniques described above. If you want to see how they translate into real fraud and AML workflows, book a demo with the Unit21 team.
Do AI hallucinations still happen in compliance tools?
Not in the way they did in 2023. With eval set benchmarking, deterministic code generation, context engineering, and temperature control applied together — plus continuous monitoring and human-in-the-loop escalation — modern AI agents can achieve 95%+ accuracy on well-defined compliance tasks, with infrastructure to catch and route the remainder. The question is not "does your AI hallucinate?" but "how was it tested, what is its verified accuracy on real compliance data, and what happens when the system is not confident?"
What is an eval set and why does it matter for AI reliability?
An eval set is a structured test suite of real examples with known correct outputs. You run your AI agent against these examples and measure accuracy before deploying. The quality of the eval set is as important as the quality of the model: at Unit21, our eval sets are drawn from years of real analyst reviews, scored against high-performing analysts only.
How does context engineering differ from prompt engineering?
Prompt engineering focuses on how you phrase instructions to the model. Context engineering focuses on what data you provide alongside those instructions. In compliance AI, context engineering is the more important of the two: providing too much data degrades accuracy. Providing exactly the right data for each task type is what enables reliable outputs.
Can you build reliable AI agents in-house without a compliance data history?
You can build AI agents, but building reliable ones is significantly harder without years of human analyst review data to benchmark against. Eval sets require ground truth. Ground truth in compliance comes from verified human decisions. Without that history, teams are essentially flying blind when measuring AI accuracy.
Is Unit21's AI compliant with the EU AI Act?
Unit21's AI agents are designed with the requirements of the EU AI Act in mind, currently the strictest AI regulation globally. Customers can request on-demand reports detailing which models are in use, what testing was performed, and what bias evaluations were completed.

Kunal Datta is the Chief Product Officer at Unit21. Prior to Unit21, he led the Product team for Checkout at Fast, and prior to that, led the Product teams responsible for automating aerial wildfire safety inspections at Pacific Gas & Electric.
He has a background leading Product teams using AI to automate processes at regulated entities, as well as financial products, machine learning products, web applications, mobile applications, hardware products, and data products. Kunal is a Fulbright Scholar and studied Civil and Environmental Engineering and Music Science Technology at Stanford University.