AI Risk Infrastructure

How I Evaluate and Test AI in a Compliance Program

Published

April 8, 2026

Read Time

mins

Guy Huber

Principal at FS Vector

Subscribe to stay informed

Home

Blog

AI Risk Infrastructure

How I Evaluate and Test AI in a Compliance Program

Table of contents

Text Link

Over the past year, I’ve spoken with a wide range of fintechs, banks, and compliance teams that are evaluating how to implement agentic AI in their workflows. One question comes up frequently: how should we evaluate AI solutions in a way that holds up in practice?
‍

There are many AI solutions available today, and the pace of innovation does not seem to be slowing. As organizations begin the tough work of implementing agentic tools, they need an approach to help filter agentic solutions that are not positioned to meaningfully improve workflows.
‍

Last week, I joined a great webinar discussion with Unit21, Liminal, and Equifax on this topic. What follows are some practical suggestions for evaluating and testing AI in a compliance context. This is not a comprehensive framework, but rather a set of lenses I’ve found useful when advising clients.
‍

1. Look Beyond Efficiency. Ask What Actually Changed

The introduction of AI has fundamentally changed the workflow, or simply accelerated it?
‍

It is very common to see solutions that layer a chatbot or assistant onto an existing process. In those cases:

Alerts are still generated the same way
Investigators follow the same steps
Outputs look largely unchanged

Done right, this can increase speed. But typically, those efficiencies are capped, in part because you are adding manual steps to the existing flow (e.g., querying the chatbot) that offset some of the gains.

The more compelling implementations tend to reshape the process itself, removing steps, redistributing decision-making, or changing how teams are structured.

A simple test I often suggest:

Map your current workflow
Map the “AI-enabled” workflow
Identify what has actually been eliminated or rethought

If the answer is “not much,” it is worth probing further.

‍

2. Don’t Rely on the Demo. Test the Edge Cases

Vendor demos are designed to succeed. That is their purpose. But compliance work rarely operates in clean, ideal scenarios. To properly evaluate an AI solution, you need to see how it behaves under pressure. Some of the most valuable testing I’ve seen comes from deliberately introducing friction:

Incomplete or inconsistent data
Ambiguous alerts
Cases where there is no clear “right” answer

This is where systems tend to diverge.

Practically, this means:

Bringing your own sample cases into evaluations
Asking vendors to walk through less-than-ideal scenarios
Observing how the system handles uncertainty, not just accuracy

That exercise often reveals far more than a polished demonstration ever will.

‍

3. Make “Human in the Loop” Concrete

“Human in the loop” is a phrase that comes up in nearly every conversation, particularly when regulators are involved. But it is often left undefined.

In reality, there are several very different ways this can be implemented:

The human reviews every AI output
The human only handles exceptions
The AI makes certain decisions autonomously, with risk-based human oversight

Each of these carries different implications for risk, governance, and operational design.

When evaluating a solution, I encourage teams to get very specific:

Where exactly does the human intervene?
What authority does the AI have on its own?
How are overrides handled and recorded?

Clarity here is critical, not just for internal comfort, but for how the program will stand up under scrutiny.

‍

4. Prioritize Explainability That Works in Practice

Explainability is often discussed at a technical level, but in compliance, it needs to work operationally.

The question is not whether a model can produce an explanation, but whether that explanation is usable by:

Compliance officers
Internal audit
Regulators

In strong implementations, you see:

Clear, plain-language reasoning
Traceable inputs and outputs
A narrative that connects the data to the decision

If understanding a decision requires technical interpretation, that creates friction and risk downstream.

‍

5. Start with Contained Use Cases

There are a few use cases that stand out as sensible starting places for agentic AI. These are areas where there is a clear “human-in-the-loop” model, which supports control and evaluation.

Common entry points include:

Sanctions screening
Negative news or PEP reviews
Level 1 alert triage

These workflows tend to be:

High volume
Repetitive
Based on relatively structured data

That makes them well-suited for testing performance, building confidence, and establishing governance before expanding further.

‍

6. Treat Evaluation as an Ongoing Process

As with other critical vendor relationships, evaluating AI is less of a one-time decision and more of a continuous process:

Initial testing and validation
Ongoing monitoring and tuning
Periodic reassessment as models and use cases evolve

This requires coordination across teams, compliance, risk, technology, and a willingness to iterate.

There is no meaningful shortcut here. The diligence is part of the process.

‍

7. Anchor Everything in Governance and Accountability

Much of the discussion around agentic AI has focused on governance and oversight, a topic that warrants its own article. Finally, evaluation cannot be separated from governance.

Before deploying any AI capability, organizations should be able to answer a few foundational questions:

Who is accountable for the outcomes produced by this system?
How are decisions documented and validated?
How does this fit within existing compliance and risk frameworks?

In my experience, the institutions that are most successful are the ones that address these questions early and bring regulators along in the process.

‍

Getting Started: A Practical Approach

For teams just beginning this journey, I typically recommend a measured, structured approach:

Define a narrow use case where success can be clearly measured
Run the AI alongside existing processes rather than replacing them immediately
Evaluate performance across multiple dimensions—accuracy, consistency, explainability, and failure handling
Document everything, particularly how decisions are made and reviewed

The importance of the last two bullets cannot be overstated. Auditors and examiners will scrutinize how you configure, test, and oversee your AI tools. If you cannot demonstrate this, you risk setting back your AI roadmap.

‍

Final Thoughts

AI has the potential to materially improve how compliance programs operate. That much is clear. What is less clear and still evolving is how to separate meaningful capability from superficial progress.
‍

The organizations that will benefit most are not necessarily those that move fastest, but those that evaluate rigorously, test thoughtfully, and build with governance in mind from the outset.

That approach takes more time upfront. In my experience, it more than pays for itself over the long term.

Tune into the full discussion I had last week with leaders from Unit21, Liminal, and Equifax for a much deeper dive and an array of advice.

‍

Guy Huber

Principal at FS Vector

Guy helps clients launch products, build and improve compliance programs and navigate bank partnerships.

Prior to joining FS Vector, Guy was a Senior Managing Consultant at Promontory Financial Group. At Promontory, Guy advised a range of domestic and international financial institutions on regulatory compliance, with a focus on financial crimes and complex operational transformations. Guy also has deep experience assisting clients with regulatory remediation strategy.

Guy earned a J.D. from Tulane University Law School.

‍

Learn more about Unit21

Unit21 is the leader in AI Risk Infrastructure, trusted by over 200 customers across 90 countries, including Sallie Mae, Chime, Intuit, and Green Dot. Our platform unifies fraud and AML with agentic AI that executes investigations end-to-end—gathering evidence, drafting narratives, and filing reports—so teams can scale safely without expanding headcount.