We raised $11M for better AI testing

Written by

Scott Clark

Summary

As the capacity of AI across enterprise tasks grows, so does its potential risk to these businesses and their customers. Every day there is a new report of AI bias, instability, failure, error or other issues.
This is a problem with massive scale. Marc Andreessen has called AI correctness and security trillion-dollar software problems.
Distributional is building the modern enterprise AI testing and evaluation platform designed to enable our customers to identify, understand and address AI risk before AI-enabled products are deployed.
Distributional has an 11-person founding team led by Scott Clark, co-founder and CEO of SigOpt, acquired by Intel in 2020, as well as a team of AI, platform and research engineers from Bloomberg, Google, Intel, Meta, SigOpt, Slack, Stripe, Uber and Yelp.
To fuel our product vision, we are announcing a $11M Seed round led by Andresseen Horowitz with participation from Operator Stack, Point72 Ventures, SV Angel, Two Sigma and Willowtree Investments.

Introducing Distributional

“How do you test these models today?” I asked the head of the AI platform engineering team at a company that relies on thousands of models in production as part of its core business.

“We have over 500 engineers and analysts who are responsible for deep testing and retesting of every model on a daily basis. If these models shift, they are responsible for finding, evaluating and fixing these issues.”

“Do they have any standardized tools to do this work systematically? Do you aspire to this?”

“No, they each choose their own approach. And, yes, we would like to automate testing but haven’t found the right approach yet.”

In recent months, I have had what feels like the same conversation with dozens of AI leaders in finance, technology, energy, semiconductors, pharmaceuticals, consulting, software and manufacturing. AI – whether traditional machine learning, deep learning, generative AI or the large language models (LLMs) dominating the generative space – is complex, often unpredictable, and constantly changing. Whether from hallucinations, instability, inaccuracy, integration or dozens of other potential challenges, these teams struggle to identify, understand and address AI risk with depth or at scale.

I am often astonished by the differences between traditional software engineering and AI-enabled software development. Testing is standard for traditional software. Teams try to maximize the coverage of their tests and root out “flaky” tests that spuriously fail as they bridge the gap between development and deployment. Engineering teams run unit, regression and integration tests in scalable CI/CD pipelines before putting code in production.

But when AI is added, introducing more math and randomness, the complexity of testing these systems explodes on a variety of dimensions at once. Standard tools no longer work for your purpose. AI models often are given a pass because they are “too complex” and unpredictable. This causes too much uncertainty so coverage is no longer enough. Proper AI testing requires depth, which is a hard problem to solve.

This is in part why AI has been described as the high interest credit card of technical debt, A huge part of this debt is insufficient testing. Most teams choose to assume model behavior risk, and accept that models will have issues. Some may try ad-hoc manual testing to find these issues, which is often resource intensive, disorganized, and inherently incomplete. Others may try to passively catch these issues with monitoring tools after AI is in production. In many cases, teams choose to avoid AI even when it could be useful for their applications. In all cases, these teams know there is significant risk around their AI-enabled applications and that they need more robust testing to understand and address it. And, increasingly, these teams may also be required to do this through shareholder pressure, government regulation or industry standards.

We founded Distributional to solve this problem. Our mission is to empower our customers to actively make their AI-based products more safe, reliable, and secure, before they deploy them. We aim to catch harm before their customers or users do.

To pursue this mission, we raised an $11M seed led by Andreessen Horowitz with Martin Casado joining the board and with participation from Operator Stack, Point72 Ventures, SV Angel, Two Sigma, Willowtree Investments, and more than 40 other AI leaders in industry and academia as angel investors. In a recent interview with Martin, Marc Andreessen said, “to make AI generally useful in a way that is guaranteed to be correct or secure – these are two of the biggest opportunities I’ve ever seen in my career.” Armed with their deep support and expertise, our founding team of 11 is poised to realize this opportunity.

A Decade of AI Testing Problems

Our partners, customers, and investors give our team a broad perspective on this problem. But what makes our perspective unique is that we combine their insights with our direct experience attempting to solve versions of this problem for nearly a decade.

2014: AI Evaluation

We first saw this problem in our own software at SigOpt, the AI startup I previously founded, when building our optimization and experimentation platform for enterprise scale in 2014. We had developed cutting edge ways to efficiently optimize complex systems, but were constantly exploring new techniques to improve this solution. To feel confident deploying new algorithmic solutions, we needed to rigorously test them and have confidence in their robustness.

We considered A/B testing, but we couldn’t run these tests in production due to the risk of real customer harm. Not to mention, this approach was antithetical to our value proposition of extremely efficient optimization. We also considered standard frameworks for benchmarking optimization methods, but couldn’t find one designed to robustly compare results from stochastic methods. With no available solution, our team instead built an evaluation framework and published it at the ICML workshop on optimization in 2016.

Being able to confidently claim we had the best, and most tested, optimization framework became one of our strongest competitive advantages in the years to come. More importantly, this evaluation process exposed valuable insights on product performance. It was often shocking what methods looked good in a paper, but did not perform well when exposed to rigorous testing. By continuously testing we were able to cut out poor performing techniques before they ever made it to our users. Although we were proud of our invention, even our team believed it would have been great to use standardized tooling here instead of needing to build it ourselves from scratch.

2016-2020: AI Robustness

After we established SigOpt as a reliable, sample-efficient product for optimizing black box systems, our product was increasingly used by sophisticated companies deploying AI as a core component of their product or revenue strategy. These teams had high upside for boosting performance of their models, but also significant downside if they didn’t perform as expected. So they often valued robustness as much or more than performance.

For example, if one of our clients were to utilize a brittle model to make important business decisions, subtle shifts in inputs may lead to widely varying outputs and suboptimal outcomes. As SigOpt made these models better and more powerful, the need for robustness – and the tradeoff between robustness and maximum potential performance – became more important.

It is often better to have a solution at 90% of perfect all the time than a solution that wildly oscillates between 99% and 10%. This is a very difficult problem in high dimensions of input and output where traditional perturbation analysis is prohibitively expensive.

Once you find an optimal model, how can you evaluate whether it is brittle? And how do you make sure this understanding of optimal performance and relative brittleness doesn’t shift over time?

As we saw the rise of this use case across our user base, we designed a purpose built solution to this problem called Constraint Active Search and published it at ICML 2021. This algorithmic technique allowed these teams to set constraints on a variety of metrics and run experiments that would actively probe and produce a variety of performant models that satisfied these constraints. Users loved this feature because it allowed them to effectively and efficiently optimize their model reliably against different permutations of input parameters in ways they never could before. In turn, they built more intuition on model robustness and had more confidence that the model they deployed wouldn’t significantly degrade with shifts in input distributions.

2022: Continuous AI Testing at Scale

In October 2020, Intel acquired SigOpt. At Intel, I had the privilege of leading the AI and HPC software teams in the Supercomputing division that was bringing Intel’s next generation of GPUs and HPC-oriented CPUs to market. In this role, I managed over one hundred engineers with the purpose of running, evaluating, debugging and evolving AI and HPC workloads for each new processor we were bringing to market. Given the sophistication of our customers, most of this work involved complex AI and physical modeling. This process translated to our teams orchestrating up to thousands of AI test workloads daily.

As we built out the full software stack for this task, there were robust frameworks in place for traditional software testing, but nothing similar for AI. As a result, our team was forced to spend most of its time and energy manually designing, instrumenting, executing and analyzing tests for AI workloads. We explored options for supporting this workflow with software, but couldn’t find a robust enough solution or a reliable testing framework. Although we had ambitions for continuous testing, this simply wasn’t attainable without automation in place. One member of the executive team called AI testing a “million dollar per day problem for companies operating at this size and scale.” This was a huge problem, but there were no good off-the-shelf solutions internally or externally to address it.

Better Testing, Greater AI Impact

Through conversations with AI product leaders in finance, energy and tech, I have come to realize that these are quite common issues. These leaders agree that software requires testing, but traditional testing methods and frameworks were built around assumptions that do not hold for applications built on AI.

Engineers are often forced to test these models by partially fitting them into legacy testing frameworks (often only testing metric thresholds and summary statistics), applying qualitative analysis as they build models (using visualizations or hand-constructed examples in notebooks to gain intuition and confidence), or shifting their problem to their users and customers by letting them test it live via online monitoring. As a consequence, AI is incompletely and non-continuously tested today. This exposes the business to significant risk, high opportunity cost, or both.

When I ask CIOs how they know that their models haven’t introduced bias or gone off the rails they often say that their only recourse is to constantly monitor metrics and feedback, passively waiting for something to go wrong. While monitoring is an important part of any software product, having the confidence of rigorous, active testing allows product teams to deploy without fear that something catastrophic is right around the corner and only discoverable by your users, after the fact.

As AI becomes more powerful and prevalent, it becomes increasingly important to make sure it is tested and performing as expected. We hope to usher in a virtuous cycle for our customers. With better testing, teams will have more confidence deploying AI in their applications. As they deploy more AI, they will see its impact grow exponentially. And as they see this impact scale, they will apply it to more complex and meaningful problems, which in turn will need even more testing to ensure it is safe, reliable, and secure.

What’s Next?

We aspire to help our customers realize this future and would love your help along the way. We are collaborating with more than a dozen co-design partners to build the modern enterprise platform for AI testing, but are always interested in expanding the scope of our collaboration.

Here are ways to get involved: