Why testing is an enterprise problem that requires an enterprise solution

Written by

Nick Payton

What do AI leaders really think about AI today? We’ve had over 1,000 hours of conversations with AI leaders at more than 100 Fortune 500 enterprises. This is the third post in a three-part series summarizing lessons from these conversations, focused on why AI testing is an enterprise problem that requires an enterprise solution. Read the first two posts here and here.

Before jumping in, it’s worth noting that I’m the CRO here at Distributional and the only one of our 25 team members without a technical degree (we have more than 35 technical degrees across 25 team members). My take is therefore intended to be more accessible, less technical and, hopefully, a quick read. If you want to dive deeper, let’s find time to talk.

Enterprise versus developer products

Even the most established enterprises typically give a lot of latitude to AI development teams in tool selection as they’re iterating or prototyping in the model and application development phase. This includes evals, where teams typically start from open source benchmarks and evolve them with their own domain expertise to define and get to a level of performance that fits their needs. The goal of these processes is to minimize friction or constraints in a relatively free-form research process.

AI testing, however, requires a very different approach. The goal of testing is to enable teams to get confidence with a definition of steady state for any AI application, confirm that it still meets this definition, and, where it deviates, figure out what needs to evolve or be fixed to reach steady state once again. This process needs to be discoverable, logged, organized, consistent, integrated and scalable. It is an enterprise problem that needs an enterprise solution. This approach requires that you start standardizing tests as checks in development, evolve them into a standard suite in deployment, and expand on them to cover scenarios where multiple components are shifting in production.

Below are a few more specific challenges related to AI testing that lend themselves to an enterprise solution.

Enterprise testing needs

Production-grade consistency

There needs to be a standard way to collect testing data, augment this data, define tests, evaluate results, validate actions, and give visibility into the testing process in every phase of the AI/ML software lifecycle. And these tests need to study behavioral properties for all components for every AI application. This level of depth and continuity gives enterprises visibility into AI app behavior that translates to confidence in productionalizing these apps. And this process starts at the very beginning of the application lifecycle. As one ML platform product leader in financial services explained, “I need a set of pre-deployment checks that I run consistently to know what to expect with each AI app in production. But this process of defining the tests needs to start in development and evolve in production.”

Multi-component visibility

It’s possible to use free or open source libraries to run one-off evals that give you point-in-time confidence of a given AI or ML model. But AI applications aren’t typically a single model—rather, they’re a series of components that may be constantly shifting in different directions. A single developer rarely controls the whole pipeline, so a developer tool is not a good solution for gaining visibility into what is actually wrong across the full pipeline. Instead, teams need an enterprise solution capable of standarding how tests are run for every component of an AI application so teams know where the issue is.

Multi-team usability

There are multiple users with diverse needs in enterprise AI testing. AI development, engineering, product, and governance teams may all need to run tests, but have different workflow and goals associated with this testing process. Sometimes the goal is enabling complete configurability so an AI developer can customize the experience to their domain. Other times, it requires automating metric computation and test configuration so users can consume the results of these tests in a more abstract way. Any enterprise platform needs to enable this diverse set of users and use cases. Distributional is built to do this. An AI product leader at a large consumer technology company told us, “I love how your platform empowers me to explore my applications without needing to get in the weeds of selecting the right tests or even defining the full set of metrics.”

Integration

Testing solutions aren’t libraries that individual developers spin up and tear down when they have what they need. They need to be fully integrated into the enterprise software stack so they can enable easy access to data and trigger actions/alerts when tests pass or fail. And it is important to enable this in three ways:

First, build primitives in the product that make it easy to integrate into other tools regardless of the enterprise stack—as each enterprise will have a unique collection of systems and tools.
Second, invest in lightweight integrations that are more native for some of the most widely used infrastructure to make it even easier to get started in these cases.
Third, provide implementation services that meet the needs of custom configurations. Each enterprise will have nuance in their needs around integrations, but it is critical that testing is fully integrated to deliver the most value.

Lineage

After tests have failed and the AI app has been debugged and re-productionalized, then various teams responsible for the AI application or related to AI governance need to be able to see what has happened and why. This information can’t be isolated in a developer notebook or trashed once the fix is made. There needs to be both persistence and provenance in this audit trail and a way to easily report on it to teams. This is important because teams need to know what was done to feel confident re-deploying the AI application.

It’s also important for consistently managing reputational, regulatory, or operational risk. Teams don’t want to end up on the front page of the Journal for a chatbot going rogue—even less so finding themselves on the wrong end of a deca-million-dollar AI error.

The enterprise platform for AI testing

Enterprises need testing that is standard for AI applications in production. They need it to cut across all AI components to give visibility into what is actually causing the issue—not single usages of single models. They need multiple teammates with varying levels of interest in diving deep to be able to use or consume information off of the testing platform. They need a testing platform to be fully integrated with data sources, CI/CD pipelines, and alerts. And they need lineage on what has happened so they can audit this process and report on it to multiple constituencies.

In short, testing is an enterprise problem that requires an enterprise solution. This is what we’ve built at Distributional, and I’d be happy to show you how it would work for you.

If you are interested in learning more, here are a few ways:

‍