
Over the past year, I’ve spoken with a wide range of fintechs, banks, and compliance teams that are evaluating how to implement agentic AI in their workflows. One question comes up frequently: how should we evaluate AI solutions in a way that holds up in practice?
There are many AI solutions available today, and the pace of innovation does not seem to be slowing. As organizations begin the tough work of implementing agentic tools, they need an approach to help filter agentic solutions that are not positioned to meaningfully improve workflows.
Last week, I joined a great webinar discussion with Unit21, Liminal, and Equifax on this topic. What follows are some practical suggestions for evaluating and testing AI in a compliance context. This is not a comprehensive framework, but rather a set of lenses I’ve found useful when advising clients.
The introduction of AI has fundamentally changed the workflow, or simply accelerated it?
It is very common to see solutions that layer a chatbot or assistant onto an existing process. In those cases:
Done right, this can increase speed. But typically, those efficiencies are capped, in part because you are adding manual steps to the existing flow (e.g., querying the chatbot) that offset some of the gains.
The more compelling implementations tend to reshape the process itself, removing steps, redistributing decision-making, or changing how teams are structured.
A simple test I often suggest:
If the answer is “not much,” it is worth probing further.
Vendor demos are designed to succeed. That is their purpose. But compliance work rarely operates in clean, ideal scenarios. To properly evaluate an AI solution, you need to see how it behaves under pressure. Some of the most valuable testing I’ve seen comes from deliberately introducing friction:
This is where systems tend to diverge.
Practically, this means:
That exercise often reveals far more than a polished demonstration ever will.
“Human in the loop” is a phrase that comes up in nearly every conversation, particularly when regulators are involved. But it is often left undefined.
In reality, there are several very different ways this can be implemented:
Each of these carries different implications for risk, governance, and operational design.
When evaluating a solution, I encourage teams to get very specific:
Clarity here is critical, not just for internal comfort, but for how the program will stand up under scrutiny.
Explainability is often discussed at a technical level, but in compliance, it needs to work operationally.
The question is not whether a model can produce an explanation, but whether that explanation is usable by:
In strong implementations, you see:
If understanding a decision requires technical interpretation, that creates friction and risk downstream.
There are a few use cases that stand out as sensible starting places for agentic AI. These are areas where there is a clear “human-in-the-loop” model, which supports control and evaluation.
Common entry points include:
These workflows tend to be:
That makes them well-suited for testing performance, building confidence, and establishing governance before expanding further.
As with other critical vendor relationships, evaluating AI is less of a one-time decision and more of a continuous process:
This requires coordination across teams, compliance, risk, technology, and a willingness to iterate.
There is no meaningful shortcut here. The diligence is part of the process.
Much of the discussion around agentic AI has focused on governance and oversight, a topic that warrants its own article. Finally, evaluation cannot be separated from governance.
Before deploying any AI capability, organizations should be able to answer a few foundational questions:
In my experience, the institutions that are most successful are the ones that address these questions early and bring regulators along in the process.
For teams just beginning this journey, I typically recommend a measured, structured approach:
The importance of the last two bullets cannot be overstated. Auditors and examiners will scrutinize how you configure, test, and oversee your AI tools. If you cannot demonstrate this, you risk setting back your AI roadmap.
AI has the potential to materially improve how compliance programs operate. That much is clear. What is less clear and still evolving is how to separate meaningful capability from superficial progress.
The organizations that will benefit most are not necessarily those that move fastest, but those that evaluate rigorously, test thoughtfully, and build with governance in mind from the outset.
That approach takes more time upfront. In my experience, it more than pays for itself over the long term.
Tune into the full discussion I had last week with leaders from Unit21, Liminal, and Equifax for a much deeper dive and an array of advice.

Guy helps clients launch products, build and improve compliance programs and navigate bank partnerships.
Prior to joining FS Vector, Guy was a Senior Managing Consultant at Promontory Financial Group. At Promontory, Guy advised a range of domestic and international financial institutions on regulatory compliance, with a focus on financial crimes and complex operational transformations. Guy also has deep experience assisting clients with regulatory remediation strategy.
Guy earned a J.D. from Tulane University Law School.