Distributional Logo
Why generative AI creates unique testing challenges

Why generative AI creates unique testing challenges

Nick Payton
October 18, 2024

What do AI leaders really think about AI today? In a three-part series, we’re sharing the insights we’ve gleaned from over 1,000 hours of conversations with AI leaders at more than 100 Fortune 500 enterprises. This first post delineates the unique challenges companies face when they deploy generative AI, as compared with traditional software. The second covers why leaders today believe that all AI/ML components—not just generative AI—need deeper testing. And the final post focuses on why AI testing is an enterprise problem in need of an enterprise solution. 

Before jumping in, it’s worth noting that I’m the CRO here at Distributional and the only one of our 25 team members without a technical degree. My take is therefore intended to be more accessible, less technical and, hopefully, a quick read. If you want to dive deeper, let’s find time to talk.

Generative AI is on the rise

Generative AI (GenAI) has real potential to transform the enterprise. Whether agents, copilots or apps designed for specific tasks, teams have started to deploy Large Language Models (LLMs) for both internal and external use cases. 

This opportunity is well established, but so are the challenges. Teams who have deployed GenAI struggle to detect and mitigate undesired behavior, resulting in hallucinations, incorrectness or an unreliable customer experience in production, among other issues. Other teams have a long backlog of these applications they want to deploy, but struggle with confidence in their behavior or capacity to satisfy governance needs—so these use cases are withering on the vine. 

No matter where you fall on this spectrum, a more complete approach to AI testing is one of the critical components to addressing these issues, much as traditional testing is for traditional software. But testing can mean different things to different people, so we spoke with over a hundred CIOs, CTOs, Chief AI Officers, VPs of AI, Directors of AI engineering and AI product managers to understand how they define AI testing. Digging deeper has yielded a few insights into how they think about this challenge—and some of the potential solutions.

Challenges with testing AI

Non-determinism

Non-determinism is when the same input can yield a multitude of possible outputs. For example, this might mean you prompt an LLM “What are your best recommendations for Florence?” and it might recommend a cooking class, a walking tour, or something else entirely each time you ask.
The level of variance within LLM responses is one of the characteristics that make them such powerful tools—it is ultimately a desirable trait. But it can also make them hard to properly test and evaluate. The way Distributional tackles this problem is through testing on distributions of many inputs and outputs rather than single usages of an application. Any given usage of an LLM should vary, but the distribution of usages should behave roughly the same. So tests need to be run on these distributions—something that Distributional was explicitly designed to do. Many companies still face this challenge today. Prior to trying Distributional, a CTO at a data services company told us, “Non-deterministic apps need to be continuously checked to ensure they’re at a steady state based on our use case, and we lack solutions to do this well today.”

Non-stationarity

Most AI, GenAI or traditional ML applications include non-stationary components, but this is particularly pervasive for LLMs. The entire application could be managed by a third-party vendor. The LLM powering an application could be a third-party-managed API. There could be evolving datasets supporting RAG applications or driving financial risk scoring. The underlying infrastructure running these applications may shift underneath you as teams update their models. These upstream shifts can cause shifts in the behavior of the application that relies on it, even if nothing was changed in the application itself. All of these are instances of non-stationarity that require all data of these AI/ML applications—including for their upstream components—to be continuously and adaptively tested over time. 

Defining behavior

“Many GenAI tasks are subjective, so how do you measure its behavior? Are judge metrics reliable? Do you automatically compute these off of raw text?” asked an ML engineer at a large consumer technology company recently. Most conversations we’ve had with our customers include some variation of these questions. GenAI is so early that most teams are still parsing which metrics are the best representation of model performance, let alone behavior. To streamline this process, teams use Distributional to automatically derive many performance and behavioral properties off of your text data, adding a slew of information about model behavior that you can test alongside any custom evals you’ve developed. This creates a rich representation of behavior from potentially limited application data. Our platform then automatically recommends tests and adaptively calibrates these tests to fit your unique AI application ensuring this behavior doesn’t deviate over time.

Pipelines

AI applications rarely exist in isolation. They depend on upstream components that produce features, third party packages that may need to be upgraded, third party data sources that shift over time, or third party APIs, such as hosted LLM endpoints. In most cases, there are typically multiple non-stationary components in any given AI or ML pipeline. And if non-stationarity or non-determinism exists anywhere, it propagates through these systems. AI leaders are aware of this issue. One AI leader at a consumer electronics company told us, “I run evals that give us a good sense of how our core app performs today, but we rely on upstream dependencies that shift over time and need a way to standardize how tests are run across each of these components as well.” The entire computational graph representing an application needs to be tested in unison to effectively root cause the origin of any potential issue or behavioral change.

Democratization

Growing use of pre-trained models—and LLMs in particular—makes it easier than ever for anyone to develop their own AI applications. Yet a lack of confidence is holding companies back from reaping the full benefits. One CTO of a large insurance company told us, “This is the first time in my life where we’re getting pull from business lines to use a specific technology. But unless I standardize testing and other operations around LLMs, we’re not in a position to enable any of these use cases.”

Leaders want to empower teams to unlock LLM use cases for their business lines, but also know that doing so without proper checks risks reputational, operational or regulatory harm. There is a huge opportunity to democratize AI development, but it must come with standardized enterprise tooling that can provide repeatability, consistency and visibility to the process. The goal is to enable flexibility in terms of what is being developed, but then to verify it fits standards the company sets through testing.

Opportunity cost

The challenges of building reliable AI today comes with a massive opportunity cost for companies. A VP of AI at a financial technology company told me, “With traditional ML, development took the vast majority of my team’s time. Now with LLMs, testing and evaluation takes 5-10x as much time as development.” Similarly, another CPO of a large enterprise technology company shared, “I know we need better testing for our GenAI, but we are prioritizing building new revenue-generating AI features instead.” 

In all cases, teams need a workflow to automate how testing and validation is done on their applications so this step takes less valuable team time. This is why at Distributional we’ve prioritized automating the process from data collection to augmentation to testing to adaptive recalibration in our product to ensure AI teams can quickly reap the benefits of testing to get more confidence in their AI applications.

A challenge that Distributional solves

In summary, because generative AI is such a large opportunity for most companies, it also carries significant risk. It’s prone to non-determinism that doesn’t allow for traditional software testing. It often includes many non-stationary components with varying levels of team control. It is also so new that there isn’t a standard set of metrics teams can align on. LLMs are often chained together or embedded in pipelines with other ML components, so any non-determinism or non-stationarity propagates through the entire system. With pre-trained LLMs and LLM API endpoints, it’s easier than ever for a business line to develop an AI app, but also easier than ever for them to ship without a clear concept of AI app behavior. And proper testing is both time-intensive and orthogonal to an AI team’s daily responsibilities. Distributional is building the solution to all of these challenges.

If you are interested in learning more, here are a few ways:

Recommended Blog Posts