
I've spent a lot of time lately listening to compliance teams talk about AI, and the pattern is the same almost every time. We know we need to do something here. We don't know where to start. And we're worried about what happens when an examiner asks us about it.
That worry is fair. But I think it's also stopping teams from doing something that matters. The reason to deploy an AI agent isn't cost or headcount. It's that you can finally do a deep review of every alert, not just the ones your team has bandwidth for. That's a genuinely new capability. It didn't exist five years ago. And it happens to be exactly what FinCEN's proposed effectiveness standard is asking you to demonstrate.
So this is the order I'd actually do it in. Seven steps, with a note on why each one matters and what I've seen go wrong.
The biggest mistake I see is teams trying to deploy AI broadly before they've proven it works narrowly. One agent, every alert type, every queue. That's how you end up with a system that's mediocre everywhere and defensible nowhere.
Pick a process where you already have a clean dataset, a decision pattern your team understands well, and a volume you can actually manage. A lot of teams start with sanctions, and for good reason. The decision logic is fairly clear: name matching, demographic comparison, match score analysis. The agent doesn't need to understand your entire transaction monitoring program. It needs to do one thing well. Sanctions is the clean entry point. L1 AML triage works too, if you're ready for a bit more configuration work.
Configure the agent with two or three tasks to begin with, no more. And do it in plain language rather than SQL or a logic tree.
This part surprised me when we were building out agent creation at Unit21. You describe what you want, say, tell me when the sender name matches the receiver name and the account has a high onboarding risk score, and the system works out which fields in your data actually correspond to that intent. It finds your KYC risk score column. It notices when that field is sparse and suggests a better one. It proposes a threshold. It flags the ambiguities in your prompt before anything goes live.
This matters for more than convenience. When a task is built from plain language, you end up with a written record of what you were trying to build, not only what the configuration happens to do. That distinction carries real weight when you're explaining an agent's decision to a validator or a regulator. The intent behind the design becomes part of the audit trail.
You have a lot to draw on. Pre-built tasks already cover current activity, prior case history, network analysis, customer risk rating, sanctions and PEP matching, and document review. Custom tasks let you go further, into tailored KYC profile analysis, counterparty risk checks, whatever is specific to how risk shows up at your institution.
A lot of deployments go wrong right here. So be deliberate about it before you go any further.
A well-designed investigation agent does not decide suspicion. It does not recommend a filing. It makes one of two calls: close this as a false positive, or send it for human review. We have not solved the accountability problem with AI, and the human still has to be the final decision-maker. What the agent does is make that human's job better, either by documenting why something isn't worth their attention, or by showing exactly why something is.
Both calls come with a full reasoning chain. Low match score, significant demographic discrepancies, no indication it's the same individual. Or: two prior cases, high-frequency outgoing transactions, a spike in activity, possible structuring. Every task output, and every piece of data the agent looked at, stays visible to the analyst, and to the examiner when they show up.
Run the agent against historical alerts. Compare its recommendations to the decisions your humans actually made. Make sure it isn't hallucinating or sailing past obvious patterns.
This is the obvious part, and most teams do it. It's the next step that separates a working program from one that just looks busy.
Of all seven steps, this is the one I think matters most for demonstrating effectiveness: compare the agent's output to your best analysts, not your median ones.
Think about what effectiveness actually means. If your agent performs at the level of an average analyst, that's table stakes. The benchmark that means something is whether it catches what your best people catch, and closes what they'd close. So when you run your random sampling, pull a set of false positive closures and put them in front of your top analysts. If the agent's decisions hold up under that review, you have a defensible record. If they don't, you've found the problem before an examiner does.
This is how we evaluate our own pre-built tasks, incidentally. We benchmark against the top performers who've used the platform, not against the distribution. Under an effectiveness framework, that's the only standard that means anything. Whatever methodology you land on, write it down as you go. That document is half your evidence.
Deploy the agent to auto-review first. It does the work and surfaces its reasoning, and a human still confirms every outcome.
Auto-close is a feature, not a default. Treat it as optional until you've been through a few sampling cycles and you trust what you're seeing. There's no prize for turning it on early, and there's real downside if you do it before the evidence is there.
The teams I've watched do this well don't try to cover everything at once. They take one thing, prove it, and build out from there. Once one process holds up, the next one is faster, because you already have the pattern: narrow scope, plain-language tasks, back-test, sample against your best people, auto-review before auto-close.
And the documentation you generate along the way, the task configuration, the back-test results, the sampling records, the reasoning chains, becomes your evidence of effectiveness, for an examiner and, just as much, for your own team.
If you follow those steps, you'll have the answers ready. FinCEN's proposed rule moves the question from "do you have a program?" to "how do you know your program is working?" For AI agents in particular, there are three things I'd prepare to answer.
First, is the logic conceptually sound? Does the task configuration actually correspond to the risk it's meant to catch? You should be able to pull up the reasoning chain for any decision, along with the task creation record that shows what you set out to build. That's Steps 2 and 3.
Second, is the input data reliable? The biggest failure point in a risk model usually isn't the model. It's the data feeding it. When the agent was configured, did it surface data quality problems? Did you fix them? Is that written down anywhere? The exploration step in Step 2 produces exactly this kind of evidence.
Third, was it tested, and what happened? Historical back-test results, your sampling methodology, the comparison to human benchmarks. That's Steps 4 and 5, and it should be actual results with citations, not a narrative.
One more thing on auditability, worth settling before you need it. Treat AI agents and humans identically in the audit trail: every action timestamped, every change attributed, none of it alterable. And ask where those records live. If your platform goes down, gets acquired, or your contract lapses, can you still produce the decision history for an alert that closed eighteen months ago? That's a fair question to put to any vendor, us included.
This is the part I find most interesting about where agentic compliance is headed.
Consider how investigations get prioritized today. You work by risk, or by deadline, or some combination of the two. You take the highest-risk alerts first, and you clear whatever the regulatory clock forces you to clear. The volume in the queue decides where your people spend their hours.
I think that inverts. The goal stops being to investigate everything, or even to investigate strictly by risk. It becomes to investigate the minimum set of cases needed to stay confident that auto-close is working, and to back that confidence with statistical evidence instead of a gut feel. The agent closes the clear false positives. The human's job starts to look more like an auditor's: sample enough to prove, mathematically, that the close rate holds up, and stop there. You spend less time re-clearing what the agent already cleared, and more time on the cases that actually carry risk. And you have a number you can put in front of an examiner that says here is how confident we are, and here is why.
There's a second shift worth watching, on the detection side. I've talked to a lot of compliance people who build genuinely good detection rules: low noise, good coverage, steady iteration. When I ask them how, the answer is almost always the same. I talk to my analysts. The people reviewing alerts are the first to notice when a rule is firing garbage, and the first to spot a pattern starting to form. They keep the detection logic sharp. The trouble is that loop is slow, undocumented, and entirely dependent on individual relationships and initiative.
AI closes that gap. When an investigation agent reviews alerts and writes structured findings into the case, that output becomes signal. The detection agent reads it. It notices that a given rule fires constantly but almost never leads to an escalation, and it suggests adjusting a threshold, or excluding a transaction type, or adding a geographic filter. The suggestion arrives with citations: here are the specific investigations behind it, here's the field that would have filtered the noise, here's the shadow mode result. You look at it, you agree or you push back, and you deploy. The loop tightens. And unlike the informal version, all of it is written down.
That's what effectiveness looks like in practice. The point was never to process alerts faster. The point is a program that gets better over time, with a documented trail of how and why. That's the standard worth building toward.

Kunal Datta is the Chief Product Officer at Unit21. Prior to Unit21, he led the Product team for Checkout at Fast, and prior to that, led the Product teams responsible for automating aerial wildfire safety inspections at Pacific Gas & Electric.
He has a background leading Product teams using AI to automate processes at regulated entities, as well as financial products, machine learning products, web applications, mobile applications, hardware products, and data products. Kunal is a Fulbright Scholar and studied Civil and Environmental Engineering and Music Science Technology at Stanford University.