Distributional Simplifies Adaptive Testing with Similarity Index and Key Insights

Written by

Jaron Parnala

While seemingly everyone is experimenting with AI, only a few have been able to bridge the AI confidence gap into production and handling real world usage. And even fewer are confident enough to tackle the most valuable – and risky – AI use cases. We believe adaptive testing is the key. But why is it that current approaches to testing are breaking down?

It’s not that teams aren’t doing testing. In fact, nearly all enterprise teams that we’ve met with have implemented some form of testing, especially during the development of their AI applications. But this testing tends to be incomplete, static, and hard to act upon. For example, teams may start with testing different prompts and comparing individual responses one-by-one. The best performing responses in turn get annotated and become a golden dataset. And this golden dataset is used to test different models and select the best performing one.

Once in production, teams may also start tracking some aggregate summary statistics, as well as user feedback on the quality of responses. These tests are run periodically on a subset of responses just to be sure. Until the response performance degrades. And these teams are forced back to square one just to figure out what’s wrong, resulting in entirely new research and development cycles to try and get performance back on track. For AI applications, testing needs to go beyond simply understanding performance and provide an understanding of behavior as a whole.

“GenAI has created unique challenges that aren’t well handled by existing MLOps platforms and workflows. Most notably, enterprises seek ways to ensure that GenAI systems behave as expected and don’t introduce unpredictable behavior that could result in reputational harm, poor user experiences, or costly business consequences,” says Sam Charrington, Creator and Host of the TWIML AI Podcast. “New approaches to system profiling and testing are required to meet this need, and Distributional’s adaptive testing solution is purpose-built to solve this for enterprise teams.”

Helping these teams understand the full behavior of their AI apps so they can deploy with confidence is why we created Distributional, the first enterprise platform for adaptive testing. Using Distributional’s platform, customers are able to maximize the uptime of AI apps, resolve issues faster, and unlock the development of higher value applications. All while minimizing risk to the business with the confidence that these applications are behaving - and will continue to behave - as desired.

Today, we’re excited to introduce some new capabilities to help enterprise teams more easily implement adaptive testing. But let’s first take a deeper look at the adaptive testing workflow to better understand where these features will fit in.

‍

Adaptive testing workflow

‍

Distributional’s platform is focused on providing a simple and automated workflow to help AI teams across enterprises continuously define, understand, and improve AI application behavior.

Define

Distributional first helps teams define the desired behavior for their applications. The platform automatically creates a behavioral fingerprint using the app’s runtime logs as well as any existing development metrics. Distributional generates associated tests to be able to detect changes in that behavior over time. Teams can also use this behavioral fingerprint to specify behaviors they do or do not want the application to exhibit at a statistical level.

Understand

Once that behavior is defined, teams are able to use the platform to understand changes in behavior and deviations from desired behavior as these apps are being used in production. They get alerted when there are changes to app behavior, understand what is changing, and pinpoint at any level of depth what is causing the change to quickly take appropriate action.

Improve

Finally, these teams continuously improve both their tests and their app based on any changes they observe. By easily adding, removing, or recalibrating tests over time, teams now have a dynamic and accurate representation of desired state to test new models, roll out new upgrades, or accelerate new app development.

This workflow is game changing in driving more consistent and predictable app behavior. And we will constantly innovate in the product to ensure teams have what they need to be confident in their AI, especially as they support and scale more apps. Today’s innovations continue that commitment, providing clarity into what changed and whether an action is needed.

‍

Introducing the Similarity Index & Key Insights

Pinpointing change with Similarity Index

While it’s necessary to take into account a much more robust set of attributes to get a comprehensive, testable understanding of behavior, teams need to balance this with the ability to quickly understand “is the current behavior of my AI app similar to what I want it to be? If not, where is it least similar and do I care?”. As these teams scale to more usage and app complexity, the ability to easily answer these questions is even more critical. Which is why we developed the Similarity Index and Key Insights.

The Similarity Index (Sim Index) is a single numerical value — between 0 and 100 — that quantifies how much an application or subsets of an application has changed between two points in time. Think of it as a signal that gives you an instant read on whether your app is behaving consistently, or if something has meaningfully shifted.

Sim Index operates across three levels:

Application-level: how much your app as a whole has drifted
Column-level: which specific inputs, outputs, or intermediate data have changed
Metric-level: what specific properties or evals of the column level data (e.g. readability, accuracy) are driving the change

The Sim Index is automatically computed on every test session with no setup required and can be integrated with existing alerting tools like Pagerduty so you can get notified when there is a change in the value.

When there is a drop in the Sim Index — say from 94 to 46 — you immediately know that there has been a significant change. To help you quickly understand what has changed is where Key Insights come in.

Understanding change with Key Insights

Alongside every Sim Index are Key Insights, which are designed to give you an actionable interpretation of what has changed at a glance. The Distributional platform automatically generates these human-readable summaries that tell you exactly what changed and why it matters.

For example:

“Answer similarity has dropped significantly — response length distribution substantially drifted to the right and readability decreased by 20%.”

Each Insight is tied to the relevant metric-level details and includes visual comparisons and historical context so you get instant clarity and can take immediate action, such as triaging issues or validating hypotheses. Even more powerful is the ability to create new tests and set thresholds on critical metrics with a single click, so you can continuously add behavioral test coverage that best represents your desired behavior for your specific application.

Together, Similarity Index and Key Insights give teams:

A broad view of application behavior and how it’s changing over time
A fast, guided path to root cause, with the ability to drill down from the application-level, down to the column- and even metric-level.

These features are now available in Distributional’s platform and are automatically included as part of the default setup, making it easier than ever to gain confidence in the behavioral stability of your AI apps from day one. To see a full demo of these features in action, check out this video.

If you’re interested in trying out these new capabilities for yourself, reach out to the team and we’d be happy to get you set up.

Subscribe to DBNL

Thank you for your submission!

Oops! Something went wrong while submitting the form.

The AI Software Development Lifecycle: A practical framework for modern AI systems

Written by

Erin LeDell

What is Distributional? Explaining AI behavior through the language of statistical distributions

Written by

Kembey Gbarayor

Get to know Bonney Pelley, COO at Distributional

Written by

Distributional