Fighting financial crime with hybrid AI

I’ve been in the data game long enough to see plenty of AI projects crash and burn.

I started my career building data warehouses for telcos and banks, then moved into machine learning consulting, where I led hundreds of projects across industries. Now I’m leading data analytics and machine learning at Phenom, and I want to share something we recently built that actually works.

Let me be clear about what I mean when I say “Gen AI” here. I’m talking about LLMs and the tools built on top of them. The “old school ML” I’ll reference means those low-complexity supervised models we’ve been using for years, the ones that are fast, cheap, and reliable by nature.

The reality of building AI in fintech

Phenom provides banking solutions for SMEs across Europe, but at our core, we’re a B2B fintech scale-up. Each of these words carries weight.

Being B2B means every single client counts. We can’t mess around with client communications or operations. Everything that touches our clients’ needs to meet a certain standard, no exceptions.

Being a fintech means we love technology, sure, but we’re also bound by regulations. The Financial Crimes Enforcement Network doesn’t care how innovative your solution is if it doesn’t meet compliance standards.

And being a scale-up? That means we can’t afford AI theater. We have some budget for innovation and experimentation, but every investment needs to demonstrate real efficiency gains and positive ROI.

These constraints shaped our entire approach to AI and machine learning at Phenom. We’ve established two fundamental pillars that guide everything we build.

First, we successfully convinced leadership (all the way up to the board) that while AI is nice, having a solid data foundation and platform is even better. When you’re dealing with regulatory reporting or enabling better tactical and strategic business decisions, that foundation matters more than any flashy AI feature.
Second, we developed clear ground rules for when to use which technology. When we need stability and structured signals, we reach for traditional machine learning first. When we’re dealing with messy input data like customer reviews or unstructured text, we consider generative AI.

High-risk scenarios involving financial crime, regulations, or customer care always get hybrid solutions with humans in the loop. Low-risk internal use cases? That’s where we let AI shine and can afford the occasional mistake.

The problem with transaction monitoring

Let me walk you through a specific use case where we applied these principles: financial crime and risk management.

Transaction monitoring follows a pretty standard process. A transaction happens. Rules and models analyze it to generate alerts about potentially suspicious activity, sanctions violations, AML concerns, and fraud indicators.

Some alerts resolve automatically, but most go to financial crime analysts who must manually review them and create a paper trail for regulators.

💡

In major banks like ABN AMRO or Deutsche Bank, thousands of people work in these departments, meticulously reviewing every alert. It’s mandatory work, but the resource requirements are enormous.

We looked at this process and thought: let’s replace as much manual work as possible with an AI agent. We gave this agent three tasks:

Draft regulatory-grade rationales for alert resolution
Research counterparties and gather supporting evidence
Recommend decisions on alert resolution with human-in-the-loop control

Round one: Replicating human logic

We started by trying to replicate a mid-level analyst. We conducted dozens of interviews, diving deep into how these people work, think, and make decisions. What do they look at? What do they ignore? How do they reason through complex cases?

We formalized all the logic they use within their process. The results looked promising, 35% automatic resolution by AI represents huge savings.

But there was a problem…

The head of the department came to us with mixed feelings. “The savings are nice,” she said, “but I’m not satisfied with the quality. I won’t allow this AI to replace my analysts across all use cases.”

When we pointed out that human analysts make similar mistakes (sometimes even more than the AI), she remained firm. “I expect AI to do better than humans in many cases.”

Round two: Thinking beyond human limitations

That feedback triggered a complete rethink of our approach. Instead of just replacing one piece of the process, we needed to consider the entire end-to-end workflow. This led to several insights.

To increase quality, we needed to step outside that single box and move upstream to provide better alerts over time.

More importantly, we shouldn’t limit ourselves to replicating human logic. Robots can employ more sophisticated, robust logic than humans.

💡

We split our AI analyst into two components. First, evidence generation: all the unstructured input that analysts review gets processed by AI and structured into a table with over 100 evidence points. Second, the decision logic based on that evidence now runs through a classical ML model (specifically XGBoost) rather than an LLM.

This approach lets us feed the model not just what analysts see, but all available data within the process. The results exceeded our expectations.

The power of structured evidence extraction

The evidence extraction process deserves more detail. We inherited seven blocks of features from our analysts’ workflow. They check whether company names look legitimate, which gives us four features right there.

They verify if counterparty names match payment descriptions. If a counterparty is “Printing LLC” but the payment says “office supplies,” something’s fishy.

💡

Overall, we started with about 30 text-based features extracted by the LLM. These features became inputs for our XGBoost model, alongside the traditional structured data like transaction counts, dates, times, amounts, and historical patterns with specific counterparties.

The beauty of this approach is flexibility. When an analyst comes up with a new way to detect fraudulent behavior, we don’t need to understand how they’d integrate it into their decision-making process.

We just feed it into the system and let XGBoost figure out the optimal weighting.

Why hybrid beats pure AI

You might wonder why we didn’t just use a more sophisticated LLM for everything. There are several reasons.

Speed matters. When a transaction comes in, we need to make our initial classification (whether to fire an alert, within a second or faster. We can’t process extensive unstructured data at that speed. But once an alert fires, we have time for thorough LLM-based evidence extraction.

Reliability matters too. When you try to encode complex logic into LLM prompts, you end up with pages of “if this, then that” rules. LLMs don’t follow these instructions precisely every time. There’s always a margin of error when trying to formalize human logic, and you can never capture 100% of what humans do.

Testability matters most. Classical machine learning models come with established evaluation processes, accuracy, F1 scores, precision, and recall. You can’t skip these fundamentals just because you’re using fancy new technology.

Lessons from the trenches

Several key lessons emerged from this project that apply broadly to AI implementation in regulated industries.

Start with existing processes. People ask how to identify where to implement AI. In our world, it’s simple: there needs to be an existing process first. Without a baseline process, you’re taking a leap of faith with no way to measure improvement.
Mix your metrics. Global end-to-end process metrics combined with local metrics for specific process components create a complete picture. They augment each other and provide real insight into system performance.
Test human performance first. We should have started by measuring the error rate of human analysts. Only after establishing that baseline should we have asked leadership about their tolerance for AI errors. Surprisingly, expectations for AI accuracy often exceed those for human performance.
Plan for human imperfection. Humans make mistakes, and that’s inevitable. The key insight is that we can patch human behavior with classical ML models to make reasoning more solid and reliable.

The unexpected benefits

The technical improvements were just part of the story. We discovered several organizational benefits that made the project truly successful.

💡

We reduced time-to-market for new evidence types dramatically. Analysts can propose new fraud detection methods without us needing to understand their exact reasoning process. We simply add it to the evidence generation pipeline.

More importantly, we achieved a holistic evaluation. Instead of measuring quality in isolated process boxes, we now assess the entire financial crime resolution process end-to-end.

Team dynamics improved significantly. Previously, the AI team, ML team, and analysts worked in silos. Now they share overarching metrics influenced by all their contributions. They see how their work affects each other and collaborate to improve the overall process. There’s still some healthy competition, but it’s collaborative rather than adversarial.

The path forward

This project taught us that successful AI implementation in financial services isn’t about replacing humans with machines. It’s about augmenting human capabilities while respecting the constraints of our industry.

We’re not trying to build AGI here. We’re building practical systems that make our analysts more effective, our processes more efficient, and our compliance more robust.

The hybrid approach, combining the pattern recognition of traditional ML with the language understanding of LLMs, gives us the best of both worlds.

For those of you working on similar challenges, remember that the flashiest solution isn’t always the best one.

Sometimes, a well-tuned XGBoost model processing structured evidence beats a complex LLM trying to replicate human reasoning. Sometimes human-in-the-loop processes are features, not bugs.

The key is understanding your constraints, measuring what matters, and building systems that enhance rather than replace human judgment. In the world of financial crime detection, that balanced approach makes all the difference.