Why all ML and AI use cases need standardized testing

What do AI leaders really think about AI today? We’ve had over 1,000 hours of conversations with AI leaders at more than 100 Fortune 500 enterprises. This is the second post in a three-part series summarizing lessons from these conversations, in which we’ll cover why leaders today believe that all AI/ML components—not just generative AI—need deeper testing. Read the first post on what makes AI so hard to test here.

Before jumping in, it’s worth noting that I’m the CRO here at Distributional and the only one of our 25 team members without a technical degree. My take is therefore intended to be more accessible, less technical and, hopefully, a quick read. If you want to dive deeper, let’s find time to talk.

Traditional AI/ML is here to stay

Although going from zero to one on Generative AI is the top priority for nearly every company we meet, traditional AI and machine learning is here to stay. Generative AI will certainly replace traditional AI/ML for certain tasks, but more often we see it being used to expand the variety of tasks that AI handles or augment specific parts of existing AI/ML applications.

Machine learning is particularly well suited to many of the tasks it handles today, especially in contexts where stability and explainability are important. In short, ML isn’t going anywhere. Time series, tabular data, and even some image and vision tasks will continue to be handled by more traditional machine learning or deep learning models independently.

But these use cases are relatively mature, so a fair question is: do they need better testing methods? From our discussions with dozens of AI/ML team leaders, the answer is a resounding yes. Let’s dig into a few reasons why this is the case.

Ongoing AI/ML testing challenges

Visibility into process

Today, many AI/ML teams lack a clear way to understand application behavior on a global scale. “We have hundreds of models in production, but we don’t have a way to visualize how their behavior is shifting over time in a standard way, let alone in a single place,” said an AI product leader at an automotive company. “And this exposes us to outsized risk.”

Even though AI/ML has been in production at scale for a decade at some companies, teams may still struggle to get visibility on behavior of these applications.

To help with this, Distributional was designed to make it easy to unify how all behavioral metrics are collected, analyzed, tested and reported, and to do so in a customizable way so you are looking at a dashboard with only information, naming, and conventions that make sense to you.

Catching real issues

All AI/ML applications have some degree of non-determinism and non-stationarity. Non-stationarity in particular is ratcheted up in production where both the application and the usage of the application could shift at the same time, making it hard to identify real issues and isolate where they are coming from.

To catch these real issues with their AI applications, some teams monitor thresholds on summary statistics for their applications, but these can hide potential behavioral issues lurking in the broader distribution of usage data. Others only look at specific, recent windows of time, causing them to miss more subtle long-term shifts in behavior that may be worrisome.

Distributional is designed to address these challenges with statistical tests on distributions of data and dynamic baselines that make it easy to explore multiple time windows for production AI systems. All of this is designed to make sure there isn’t anything hidden in your data—to fully test it to catch all of the lurking issues with an AI application in production.

Avoiding false alarms

Too many false alerts can make it impossible to identify which ones are real, and make it even harder to rally the team to address them. “We get so many alerts that we just don’t pay any attention to them anymore,” said an AI product manager at a large financial services company. “And it is hard to get the development team to take action on these issues if we don’t have a way to dynamically calibrate these tests or thresholds to fit the application, and show the development team the approach we took to do so.”

AI/ML teams need new testing methods that help them find the signal within the noise. Distributional fills this gap with a workflow these teams can use to adapt their tests over time. They can automatically recalibrate tests over time at scale using reinforcement learning. And they can do deep root cause analysis directly within the same platform to analyze the specific data that is causing the issue.

Results analysis

When AI is more mature and operating at a significant scale, this scale itself can become an issue. How do you do root cause analysis when you “stare at a wall of numbers,” as the technology leader at an automotive company recently told us?

There are three things teams need to address this. First, teams need custom dashboards that are designed to help them understand the status of their unique AI applications. Then they need to know which test results are the cause of the issue. Finally, they need to be able to tag tests and filter data so they can explore various segments to determine what may actually be causing a particular issue. All of this helps to help these teams understand issues with their AI applications, triage these issues, and take appropriate action to resolve them before they create deeper problems in production.

Cross-team workflow

Large teams collaborating on ML/AI naturally run into complexity due to the scale of the operation. How do you get everyone on the same page? “We have one team that productionalizes models and houses all of the data, and another team who develops the ML models,” said an AI engineering leader at a large financial services company. “So when there is an issue, the team that productionalizes sends data to the team that developed the model, but then this development team lacks context to analyze the actual issue.”

To solve this, you need to get multiple teams on the same page, working off of the same information and with the same workflow to calibrate tests or resolve issues. As a bonus, doing this well can build up better relationships between teams that can have a large halo effect on the broader organization. A better workflow starts with standardizing how tests are created, run, tracked and triaged, which requires a software platform–not a smattering of information across notebooks, docs, sheets and reports.

Reporting

Whether related to product efficacy, internal standards, or external regulations, governance is critical to company-wide adoption of AI. For GenAI, this entails designing a new process. But even for legacy AI/ML applications, there is often a meaningful gap in the information available to governance teams who need to analyze, resolve, and evolve AI test suites and results. An AI leader at a regulated company shared, “We send a point-in-time email with a set of daily eval results to our leadership team, but we need a way to do this systematically, continuously and with greater depth.” Another leader of continuous integration for traditional ML commented, “I need better ways to report out on lineage across tests and test results for internal and external compliance purposes.”

AI testing isn’t just enabled with a better framework. It needs to include a full audit trail of issues identified and actions taken. And these tests need to connect to a dashboard with test results that can be analyzed and shared with various parties involved with AI governance. It is the power of putting all of these pieces together in a single software solution that can enable better AI decisions, reduce AI risk and allow for more sustainable AI governance.

It’s time for a testing upgrade

Generative AI is all the rage, but traditional AI and ML is here to stay. And these types of AI applications are often chained together in a single application pipeline, so testing solutions need to be able to cover all application types, not just generative AI.

In summary, despite a decade of at-scale use, more traditional AI/ML applications still need a testing upgrade. Metrics are often computed and displayed by individual data scientists with no cross-cutting visibility in a unified dashboard for behavior. False negatives are often hidden in data you already have. A high volume of false positives can be hard to handle without a workflow that addresses them. It can be hard to parse results with current tooling, especially when operating at significant scale. Multiple teams are usually involved in the AI software lifecycle, and it can be hard for them to have a mutually productive workflow without a software platform in place that standardizes it. And being able to track a lineage and report on it is often just as important as resolving any given point-in-time issue.

These challenges are hard, and deserve a better approach to AI testing. If you are interested in learning more, here are a few ways: