Distributional Logo
Driving confidence in AI: Understanding deployment testing through model version updates

Driving confidence in AI: Understanding deployment testing through model version updates

Tobi Andreasen
April 3, 2025

Building on my previous blog post, I’d like to share more examples to help readers develop intuition around AI testing. This time, rather than focusing on RAG apps, we’ll dive into deployment testing. The goal is to illustrate how deployment testing can provide deeper insights into the consistency of an AI-powered application after it reaches production.

At its core, deployment testing is about understanding the consistency of an AI application over time. This can range from a narrow focus of continuously testing a single LLM-endpoint, to the broader scope of continuously testing an entire AI application.

Regardless of the scope, the goal is to extract a meaningful signal that indicates whether or not  something is changing. Interestingly, testing the full AI application is often easier, because we generally have a clearer understanding of what the application is supposed to do. In contrast, testing a single LLM-endpoint can be more challenging, as its expected behavior is harder to define.

There are multiple ways to extract meaningful signals. I personally like to extract signals through summarization. A simple system diagram for a summarization application is shown below:

System diagram for AI-powered summarization application.
System diagram for AI-powered summarization application.

Since our focus is on testing a single LLM-endpoint, the goal isn’t to build a world-class summarization application. Instead, the aim is to create a solution that’s reliable enough to extract a meaningful signal of change.

So what is the signal?

The signal comes from the idea that by repeatedly exposing an LLM-endpoint to the same task at a regular cadence, i.e. deployment testing, we can continuously compute and compare a set of metrics over time. The change in these metrics, or the delta, is what I refer to as the signal.

For summarization, the task is to repeatedly summarize the same set of text at regular cadence and compare the summarized answer to a fixed reference answer. To achieve this, we rely on what’s commonly known as a Golden Dataset. For this blog post, I’ve created a simple Golden Dataset consisting of 60 different pieces of text across six categories:

  • Economics & Finance
  • Health & Medicine
  • History
  • News
  • Scientific Papers
  • Technology

Each text is paired with a fixed reference summary. Since the goal is just to extract a meaningful signal, we don’t need a perfect reference summary. Just one that’s good enough for comparison between the reference and the computed summary.

An Economics & Finance example from the Golden Dataset can be seen below:

And similarly from the Technology category:

Obviously, if we were to test a full AI-powered application, this Golden Dataset should reflect the tasks which we wanted the application to complete. However, for testing a single LLM-endpoint, we instead only need to come up with a meaningful Golden Dataset that ideally reflects the business relying on the LLM-endpoint.

What about the metrics?

When evaluating behavior, relying on a single metric is never sufficient. However, since the focus of this blog is on building intuition, we’ll keep it simple and examine just two key metrics.

For this example, we’ll look at the BLEU and the ROUGE score, which are both well-established NLP-metrics that show differences between two pieces of text. One could imagine expanding these two with all sorts of metrics like toxicity-level and readability scores.

And what about the LLM-endpoint?

From countless conversations I’ve had, there are three primary ways teams access LLMs:

  1. Third-Party Endpoints – Services like ChatGPT and Together AI that provide external API access.
  2. Self-Hosted – Teams running their own LLMs on dedicated infrastructure.
  3. Company-Hosted – Organization-wide initiatives where LLMs are deployed on internal infrastructure for company-wide use.

Each approach comes with its own trade-offs. For this blog, we’ll focus on the company-hosted model and act as an engineering team using a Llama LLM-endpoint provided through a company-wide GenAI platform. This means that we have very little control over the LLM-endpoint, and that we’ll be using what is being provided to us from a different team

Additionally, we’ll be looking at an 11-month time period from May 2024 to March 2025.

Do vibe checks work?

We’ve found that most companies start with basic vibe checks. So, let’s begin by examining a single piece of text alongside its reference summary.

Next, we can query our LLM-endpoint to retrieve and log the summary generated for the above example each month.

May 2024 – Summary of above example

June 2024 – Summary of above example

July 2024 – Summary of above example

August 2024 – Summary of above example

September 2024 – Summary of above example

October 2024 – Summary of above example

November 2024 – Summary of above example

December 2024 – Summary of above example

January 2025 – Summary of above example

February 2025 – Summary of above example

March 2025 – Summary of above example

We can observe that the generated summaries vary from month to month, which aligns to the non-deterministic nature of an LLM. However, as humans, we can’t easily determine whether these differences are purely due to non-determinism or if other factors are at play.

This is where a statistical lens can help!

A statistical perspective?

With our Golden Dataset in place, we no longer need to rely on simple month-over-month vibe checks. We can instead examine the distributions of key metrics and track their evolution over time.

This kind of analysis is at the core of what the Distributional platform enables at scale. However, to build intuition, we’ll continue with the simple two-metric example outlined earlier.

The workflow is as follows:

Each month, we will:

  1. Generate a summary for each text-reference-summary pair.
  2. Compute ROUGE and BLEU scores for each reference-summary and generated-summary pair.
  3. Calculate the average ROUGE and BLEU scores for each subclass in the Golden Dataset.
  4. Compute the overall average ROUGE and BLEU scores across all data.

From here, we can visualize how both ROUGE and BLEU scores change month over month for both the entire Golden Dataset and each of the subclasses.

Monthly ROUGE score for each subclass and the global average.
Monthly ROUGE score for each subclass and the global average.

When analyzing the ROUGE score over time, we observe that our AI application is gradually improving. However, its performance isn’t consistent across all subclasses—some are summarized more effectively than others according to our reference answer. Interestingly, as the BLEU score below shows, an improvement in one metric doesn’t necessarily mean all metrics will improve in parallel.

Monthly BLEU score for each subclass and the global average.
Monthly BLEU score for each subclass and the global average.

The BLEU score follows a similar trend to the ROUGE score, but at certain points, we observed a regression that wasn’t reflected in the ROUGE metric. This is the exact reason why we want people to think about change in the context of the distributional fingerprint and not just a single metric. The regression is seen on both August – 2024 and September – 2024 for the BLUE score but not for the ROUGE score.

What is happening?

Why are we seeing an increase in performance over time? From the plots below, we can see that the team serving up the LLM-endpoint have been following the Llama release schedule, and have been constantly updating the underlying LLM-endpoint to the most recent Llama model.

Monthly ROUGE score for each subclass and the global average including used LLM.
Monthly ROUGE score for each subclass and the global average including used LLM.
Monthly BLEU score for each subclass and the global average including used LLM.
Monthly BLEU score for each subclass and the global average including used LLM.

Whether or not teams are doing this type of model update out in the wild is not the point—there are plenty of arguments for why one should or shouldn’t do it. What I want to show is that even for simple tasks like summarization, we are capable of getting a really strong indication of whether or not something is changing using the right dataset.

For situations where you are constantly updating models to the newest version, having a system like the above to extract a continuous signal makes a lot of sense. However, for everyone that relies on a model provider for their LLM-endpoints, this is also a simple way to understand whether or not these providers are making changes to their models, and the impact of those changes. 

Is there a signal?

One last thing. It’s fair to ask whether everything we’ve seen so far is just a reflection of the inherent non-determinism of LLMs or if there’s actually a meaningful difference between these models. To dig into that, I ran the data through each LLM 10 times to see if the results truly vary or if we’re just seeing randomness in action.

Average ROUGE score including standard deviation for each used LLM.
Average ROUGE score including standard deviation for each used LLM.
Average BLEU score including standard deviation for each used LLM.
Average BLEU score including standard deviation for each used LLM.

From the two plots, it is clear that there is a difference between the four models. But whether or not that difference will have an impact on an AI powered application is a topic that we’ll cover in a later blogpost.

Automating deployment testing

Distributional is helping companies define, understand, and improve the reliability of their AI applications. The above scenario represents just a subset of the challenges we’re helping to solve.

Our goal is to help teams develop confidence in the AI applications they’re taking to production, so they can rely on more than just a vibe check to ensure their applications are working as intended.

Interested in exploring how Distributional can help your organization? Sign up for access to our product or get in touch with our team to learn more.

Recommended Blog Posts