We raised a $19M Series A for enterprise AI testing

We’ve made a lot of progress over the last year at Distributional. Since raising our seed round, we have validated the enterprise need for confidence in AI applications, grown our team, and started deploying our enterprise testing platform to address this problem in collaboration with a dozen design partners. I couldn’t be more proud of our team. 

To push us even faster on this journey, I’m thrilled to announce our $19M Series A led by Two Sigma Ventures with participation from Andreessen Horowitz, Operator Collective, Oregon Venture Fund, Essence Venture Capital, Alumni Ventures, and dozens of angel investors. We are excited to partner more deeply with Two Sigma Ventures, who have had a front-row seat on our journey and a unique perspective on the AI testing problem—especially for financial and regulated industries. We are using this fresh capital to continue to expand our team, accelerate our roadmap, and scale our enterprise deployments.

Why are we so excited to pour fuel on this fire? During our first year, we’ve had thousands of hours of conversations with over 100 large financial, industrial, and technology enterprises. These conversations have made a few things clear: confidence in AI applications is a critical problem, we have a unique solution through an enterprise-first testing platform, and there is a lot of opportunity to expand on what we’ve built.

Challenge: AI testing is a unique, operationally intensive problem

Testing is the primary way to gain confidence that traditional software applications are behaving as expected. But AI is complex, which makes it difficult to test with traditional approaches. It is non-deterministic, which requires writing statistical tests on distributions of many data properties to quantify behavior. It is non-stationary, which requires continuous and adaptive testing through the AI lifecycle—including development, deployment and production to catch behavioral change. And it is multi-component, which requires testing all dependencies to pinpoint and resolve potential issues.

Because of all of this, testing is a gap in the AI stack today. Developer tools focus on helping rapidly prototype AI applications, from eval tools for constructing performance related benchmarks to workbenches that help get an end-to-end proof of concept together. But these tools don’t give teams a standardized process to gain confidence their AI app is behaving consistently, especially once the application goes into production. Monitoring tools often focus on higher level metrics and specific instances of outliers—this gives a limited sense of consistency, but without insights on broader application behavior ahead of an impact on business metrics. Testing fills the gap between these two solutions, but also enhances them. The path to more robust evaluation and validation is through better testing in development and deployment. Monitoring becomes more insightful with continuous, adaptive testing in production.

We didn’t invent this approach—statistical testing has been around for centuries. AI teams tend to not fully implement these techniques because they are operationally intensive. It takes a combination of AI, engineering, statistical, product, and platform expertise to build a deep, automated, and standardized approach to this problem—one that is rarely the core job of a single team.

Solution: AI testing with depth, automation and standardization

So what have we done about it? We’ve built the modern platform for enterprise AI testing to address this problem and remove the operational burden of enterprises building and maintaining their own solutions or cobbling together incomplete solutions with other tools. By proactively addressing these testing problems with Distributional, AI teams can deploy with more confidence and catch issues with AI applications before they cause significant damage in production. 

We’ve designed our platform with a simple principle in mind: make it easy to get to actionable value and empower customization to increase this value over time. By making this process more efficient teams are freed up to focus on their mandate of creating value through building better applications and resolve issues with confidence when they do arise. This platform has three primary capabilities:

  • Depth: To handle the non-deterministic, non-stationary, and multi-component nature of AI applications, teams need to write statistical tests on distributions of properties of their applications and data. We’ve designed the first purpose-built platform with this approach to testing, enabling AI teams to get visibility on the consistency and performance for all components of their AI applications and take action with insightful analysis. 
  • Automation: The platform allows for quickly achieving value by automating the collection of application data, derivation of testable properties, creation of statistical tests, surfacing of insights for analysis, and recalibration of tests to fit expectations. The user can provide further contextual information or feedback to enhance this process. Additionally, they can completely customize aspects of the process to fit the bespoke testing needs of the application.
  • Standardization: We built our solution to address the needs of enterprises. We provide visibility across the organization into what, when, and how AI applications were tested and how that has changed over time. We provide consistency in how teams approach the problem of AI testing that enables governance and leadership teams to effectively audit how risk is mitigated for different AI applications throughout their lifecycle through reporting. We increase the efficiency of AI teams by providing a repeatable process for AI testing for similar applications by using sharable templates, configurations, filters, and tags.

Onward: solving the AI testing problem at scale

I’m incredibly proud of what we’ve built in such a short period of time, but there is a lot more to do. The path to reliable AI starts with having a reliable AI testing platform. We’re excited to use this fresh round of funding to seize this opportunity.

If you’re just as excited to tackle this opportunity, there are a few ways to engage with us:

  • Join the team: We have an experienced team addressing this critical problem with a unique solution. We’re planning to double in size over the coming months and would love for you to join us. Apply on our career page
  • Get product access: We’re also scaling our customer deployments. If you relate to any of the challenges covered above, reach out to get product access. Our team will get in touch with you to learn more about your circumstances. 
  • Learn more: Read our blog, watch our demos, follow us on LinkedIn or X and sign up for updates

Why generative AI creates unique testing challenges

What do AI leaders really think about AI today? In a three-part series, we’re sharing the insights we’ve gleaned from over 1,000 hours of conversations with AI leaders at more than 100 Fortune 500 enterprises. This first post delineates the unique challenges companies face when they deploy generative AI, as compared with traditional software. The second covers why leaders today believe that all AI/ML components—not just generative AI—need deeper testing. And the final post focuses on why AI testing is an enterprise problem in need of an enterprise solution. 

Before jumping in, it’s worth noting that I’m the CRO here at Distributional and the only one of our 25 team members without a technical degree. My take is therefore intended to be more accessible, less technical and, hopefully, a quick read. If you want to dive deeper, let’s find time to talk.

Generative AI is on the rise

Generative AI (GenAI) has real potential to transform the enterprise. Whether agents, copilots or apps designed for specific tasks, teams have started to deploy Large Language Models (LLMs) for both internal and external use cases. 

This opportunity is well established, but so are the challenges. Teams who have deployed GenAI struggle to detect and mitigate undesired behavior, resulting in hallucinations, incorrectness or an unreliable customer experience in production, among other issues. Other teams have a long backlog of these applications they want to deploy, but struggle with confidence in their behavior or capacity to satisfy governance needs—so these use cases are withering on the vine. 

No matter where you fall on this spectrum, a more complete approach to AI testing is one of the critical components to addressing these issues, much as traditional testing is for traditional software. But testing can mean different things to different people, so we spoke with over a hundred CIOs, CTOs, Chief AI Officers, VPs of AI, Directors of AI engineering and AI product managers to understand how they define AI testing. Digging deeper has yielded a few insights into how they think about this challenge—and some of the potential solutions.

Challenges with testing AI

Non-determinism

Non-determinism is when the same input can yield a multitude of possible outputs. For example, this might mean you prompt an LLM “What are your best recommendations for Florence?” and it might recommend a cooking class, a walking tour, or something else entirely each time you ask.
The level of variance within LLM responses is one of the characteristics that make them such powerful tools—it is ultimately a desirable trait. But it can also make them hard to properly test and evaluate. The way Distributional tackles this problem is through testing on distributions of many inputs and outputs rather than single usages of an application. Any given usage of an LLM should vary, but the distribution of usages should behave roughly the same. So tests need to be run on these distributions—something that Distributional was explicitly designed to do. Many companies still face this challenge today. Prior to trying Distributional, a CTO at a data services company told us, “Non-deterministic apps need to be continuously checked to ensure they’re at a steady state based on our use case, and we lack solutions to do this well today.”

Non-stationarity

Most AI, GenAI or traditional ML applications include non-stationary components, but this is particularly pervasive for LLMs. The entire application could be managed by a third-party vendor. The LLM powering an application could be a third-party-managed API. There could be evolving datasets supporting RAG applications or driving financial risk scoring. The underlying infrastructure running these applications may shift underneath you as teams update their models. These upstream shifts can cause shifts in the behavior of the application that relies on it, even if nothing was changed in the application itself. All of these are instances of non-stationarity that require all data of these AI/ML applications—including for their upstream components—to be continuously and adaptively tested over time. 

Defining behavior

“Many GenAI tasks are subjective, so how do you measure its behavior? Are judge metrics reliable? Do you automatically compute these off of raw text?” asked an ML engineer at a large consumer technology company recently. Most conversations we’ve had with our customers include some variation of these questions. GenAI is so early that most teams are still parsing which metrics are the best representation of model performance, let alone behavior. To streamline this process, teams use Distributional to automatically derive many performance and behavioral properties off of your text data, adding a slew of information about model behavior that you can test alongside any custom evals you’ve developed. This creates a rich representation of behavior from potentially limited application data. Our platform then automatically recommends tests and adaptively calibrates these tests to fit your unique AI application ensuring this behavior doesn’t deviate over time.

Pipelines

AI applications rarely exist in isolation. They depend on upstream components that produce features, third party packages that may need to be upgraded, third party data sources that shift over time, or third party APIs, such as hosted LLM endpoints. In most cases, there are typically multiple non-stationary components in any given AI or ML pipeline. And if non-stationarity or non-determinism exists anywhere, it propagates through these systems. AI leaders are aware of this issue. One AI leader at a consumer electronics company told us, “I run evals that give us a good sense of how our core app performs today, but we rely on upstream dependencies that shift over time and need a way to standardize how tests are run across each of these components as well.” The entire computational graph representing an application needs to be tested in unison to effectively root cause the origin of any potential issue or behavioral change.

Democratization

Growing use of pre-trained models—and LLMs in particular—makes it easier than ever for anyone to develop their own AI applications. Yet a lack of confidence is holding companies back from reaping the full benefits. One CTO of a large insurance company told us, “This is the first time in my life where we’re getting pull from business lines to use a specific technology. But unless I standardize testing and other operations around LLMs, we’re not in a position to enable any of these use cases.”

Leaders want to empower teams to unlock LLM use cases for their business lines, but also know that doing so without proper checks risks reputational, operational or regulatory harm. There is a huge opportunity to democratize AI development, but it must come with standardized enterprise tooling that can provide repeatability, consistency and visibility to the process. The goal is to enable flexibility in terms of what is being developed, but then to verify it fits standards the company sets through testing.

Opportunity cost

The challenges of building reliable AI today comes with a massive opportunity cost for companies. A VP of AI at a financial technology company told me, “With traditional ML, development took the vast majority of my team’s time. Now with LLMs, testing and evaluation takes 5-10x as much time as development.” Similarly, another CPO of a large enterprise technology company shared, “I know we need better testing for our GenAI, but we are prioritizing building new revenue-generating AI features instead.” 

In all cases, teams need a workflow to automate how testing and validation is done on their applications so this step takes less valuable team time. This is why at Distributional we’ve prioritized automating the process from data collection to augmentation to testing to adaptive recalibration in our product to ensure AI teams can quickly reap the benefits of testing to get more confidence in their AI applications.

A challenge that Distributional solves

In summary, because generative AI is such a large opportunity for most companies, it also carries significant risk. It’s prone to non-determinism that doesn’t allow for traditional software testing. It often includes many non-stationary components with varying levels of team control. It is also so new that there isn’t a standard set of metrics teams can align on. LLMs are often chained together or embedded in pipelines with other ML components, so any non-determinism or non-stationarity propagates through the entire system. With pre-trained LLMs and LLM API endpoints, it’s easier than ever for a business line to develop an AI app, but also easier than ever for them to ship without a clear concept of AI app behavior. And proper testing is both time-intensive and orthogonal to an AI team’s daily responsibilities. Distributional is building the solution to all of these challenges.

If you are interested in learning more, here are a few ways:

Distributional: Empowering trustworthy AI in the enterprise

Frances Schwiep and Vin Sachidananda from Two Sigma Ventures are no strangers to what makes AI applications so uniquely challenging. In fact, they predicted a multibillion dollar opportunity in AI testing, evaluation and monitoring solutions in 2021 with a clear perspective on this challenge. 

Their perspective is informed by the space they occupy at the intersection of a few highly technical markets. As technology investors, Frances and Vin have played instrumental roles enabling the growth of some of the most successful deep tech startups riding enterprise market waves. And as part of a fund connected to one of the most successful quantitative hedge funds of all time, they are biased toward technical products for technical users. Finally, connected to the broader financial services industry, they have felt pain around reputational, regulatory, or financial risk of AI products that misbehave. 

This combination makes Frances, Vin, and the Two Sigma Ventures team exceptional partners for what we are building, which is why we are so thrilled they are leading our Series A. 

Read their full post explaining their take on us and this market opportunity, and sign up for Distributional to start building reliable AI today.

Announcing our $19M Series A

Today, we announced we raised $19 million in Series A funding led by Two Sigma Ventures with participation from Andreessen Horowitz, Operator Collective, Oregon Venture Fund, Essence VC, Alumni Ventures, and dozens of angel investors.

Below is the press release announcement in full. Please sign up for Distributional to learn more.


Distributional Secures $19 Million in Series A Funding and Debuts the Industry’s Only Enterprise AI Testing Platform

With fresh funding, Distributional helps enterprises rein in AI code development by standardizing and automating AI testing

Distributional, the modern enterprise platform for AI testing, announced today that it has raised $19 million in Series A funding led by Two Sigma Ventures with participation from Andreessen Horowitz, Operator Collective, Oregon Venture Fund, Essence VC, Alumni Ventures, and dozens of angel investors. The new round brings Distributional’s total capital raised to $30 million less than one year since incorporation. The milestone also aligns with the initial enterprise deployments of its AI testing platform that gives AI engineering and product teams confidence in the reliability of their AI applications, reducing operational AI risk in the process.  

Unlike traditional software testing, AI testing needs to be done consistently and adaptively over time on a meaningful amount of data due to AI being inherently probabilistic and dynamic. As the power and pervasiveness of AI applications grows, so does the need for better AI testing. The various operational risks of deploying faulty products are becoming increasingly significant to a business’s financial, regulatory and reputational bottom line. 

“Between my previous line of work optimizing AI applications at SigOpt and deploying AI applications at Intel, and through conversations with Fortune 500 CIOs, it became clear that reliability of AI applications is both critical and challenging to assess,” said Scott Clark, co-founder and CEO of Distributional. “With Distributional, we have built a scalable statistical testing platform to discover, triage, root cause, and resolve issues with the consistency of AI/ML application behavior, giving teams confidence to bring and keep these applications in production.” 

Distributional is built to test the consistency of any AI/ML application, especially generative AI, which is particularly unreliable since it is prone to non-determinism, or varying outputs from a given input. Generative AI is also more likely to be non-stationary with many shifting components that are outside of the control of developers. As AI leaders are increasingly under pressure to ship generative AI, Distributional helps automate AI testing with intelligent suggestions on augmenting application data, suggesting tests, and enabling a feedback loop that adaptively calibrates these tests for each AI application being tested. 

“We are inspired by Distributional’s mission of making AI reliable so teams are confident deploying it across their full set of use cases, maximizing the impact of AI on their organizations in the process,” said Frances Schwiep of Two Sigma Ventures. “By building for enterprise scale, precision, and flexibility from day one, Distributional occupies a unique position in the broader landscape of AI testing, monitoring, and operations. We have strong conviction in the Distributional team’s deep expertise in the field, as evidenced by how the company is already addressing both the complexity and scale of its design partners in finance, technology, and industrial sectors.”

Distributional’s platform allows AI product teams to proactively and continuously identify, understand and address AI risk before customer impact. Prominent features include: 

  • Extensible Test Framework: Distributional’s extensible test framework enables AI application teams to collect and augment data, test on this data, alert on test results, triage these results, and resolve these alerts through either adaptive calibration or analysis driven debugging. This framework can be deployed as a self-managed solution in a customer VPC and is fully integrated with existing datastores, workflow systems, and alerting platforms.
  • Configurable Test Dashboard: Teams use Distributional’s configurable test dashboards to collaborate on test repositories, analyze test results, triage failed tests, calibrate tests, capture test session audit trails, and report test outcomes for governance processes. This enables multiple teams to collaborate on an AI testing workflow throughout the lifecycle of the underlying application, and standardize it across AI platform, product, application, and governance teams. 
  • Intelligent Test Automation: Distributional makes it easy for teams to get started and scale AI testing with automation of data augmentation, test selection, and calibration of these steps in an adaptive preference learning process. Intelligence is the flywheel that fine tunes a test suite to a given AI application throughout its production lifecycle and scales testing across all properties for all components of all AI applications. 

For more information on Distributional, visit www.distributional.com

About Distributional 

Distributional is building the modern enterprise platform for consistent, adaptive, and reliable AI testing. Unlike traditional software testing, AI testing needs to be done consistently over time on a meaningful amount of data. Enterprise CIOs, CTOs, and AI product teams use Distributional to proactively and continuously identify, understand, and address AI risk before it harms their customers. Notable backers include Two Sigma Ventures, Andreessen Horowitz, Operator Stack, SV Angel, Two Sigma Investments, Willowtree Investments, and dozens of AI leaders. Distributional was founded in September 2023 by a team with experience building, optimizing, and testing AI systems at Bloomberg, Google, Intel, Meta, SigOpt, Slack, Stripe, Uber, and Yelp.

Media Contact

Inkhouse for Distributional

distributional@inkhouse.com 

We raised $11M for better AI testing

Summary

  • As the capacity of AI across enterprise tasks grows, so does its potential risk to these businesses and their customers. Every day there is a new report of AI bias, instability, failure, error or other issues. 
  • This is a problem with massive scale. Marc Andreessen has called AI correctness and security trillion-dollar software problems. 
  • Distributional is building the modern enterprise AI testing and evaluation platform designed to enable our customers to identify, understand and address AI risk before AI-enabled products are deployed.
  • Distributional has an 11-person founding team led by Scott Clark, co-founder and CEO of SigOpt, acquired by Intel in 2020, as well as a team of AI, platform and research engineers from Bloomberg, Google, Intel, Meta, SigOpt, Slack, Stripe, Uber and Yelp.
  • To fuel our product vision, we are announcing a $11M Seed round led by Andresseen Horowitz with participation from Operator Stack, Point72 Ventures, SV Angel, Two Sigma and Willowtree Investments.

Introducing Distributional

“How do you test these models today?” I asked the head of the AI platform engineering team at a company that relies on thousands of models in production as part of its core business. 

“We have over 500 engineers and analysts who are responsible for deep testing and retesting of every model on a daily basis. If these models shift, they are responsible for finding, evaluating and fixing these issues.” 

“Do they have any standardized tools to do this work systematically? Do you aspire to this?”

“No, they each choose their own approach. And, yes, we would like to automate testing but haven’t found the right approach yet.”

In recent months, I have had what feels like the same conversation with dozens of AI leaders in finance, technology, energy, semiconductors, pharmaceuticals, consulting, software and manufacturing. AI – whether traditional machine learning, deep learning, generative AI or the large language models (LLMs) dominating the generative space – is complex, often unpredictable, and constantly changing. Whether from hallucinations, instability, inaccuracy, integration or dozens of other potential challenges, these teams struggle to identify, understand and address AI risk with depth or at scale.

I am often astonished by the differences between traditional software engineering and AI-enabled software development. Testing is standard for traditional software. Teams try to maximize the coverage of their tests and root out “flaky” tests that spuriously fail as they bridge the gap between development and deployment. Engineering teams run unit, regression and integration tests in scalable CI/CD pipelines before putting code in production.

But when AI is added, introducing more math and randomness, the complexity of testing these systems explodes on a variety of dimensions at once. Standard tools no longer work for your purpose. AI models often are given a pass because they are “too complex” and unpredictable. This causes too much uncertainty so coverage is no longer enough. Proper AI testing requires depth, which is a hard problem to solve.

This is in part why AI has been described as the high interest credit card of technical debt, A huge part of this debt is insufficient testing. Most teams choose to assume model behavior risk, and accept that models will have issues. Some may try ad-hoc manual testing to find these issues, which is often resource intensive, disorganized, and inherently incomplete. Others may try to passively catch these issues with monitoring tools after AI is in production. In many cases, teams choose to avoid AI even when it could be useful for their applications. In all cases, these teams know there is significant risk around their AI-enabled applications and that they need more robust testing to understand and address it. And, increasingly, these teams may also be required to do this through shareholder pressure, government regulation or industry standards

We founded Distributional to solve this problem. Our mission is to empower our customers to actively make their AI-based products more safe, reliable, and secure, before they deploy them. We aim to catch harm before their customers or users do. 

To pursue this mission, we raised an $11M seed led by Andreessen Horowitz with Martin Casado joining the board and with participation from Operator Stack, Point72 Ventures, SV Angel, Two Sigma, Willowtree Investments, and more than 40 other AI leaders in industry and academia as angel investors. In a recent interview with Martin, Marc Andreessen said, “to make AI generally useful in a way that is guaranteed to be correct or secure – these are two of the biggest opportunities I’ve ever seen in my career.” Armed with their deep support and expertise, our founding team of 11 is poised to realize this opportunity. 

A Decade of AI Testing Problems

Our partners, customers, and investors give our team a broad perspective on this problem. But what makes our perspective unique is that we combine their insights with our direct experience attempting to solve versions of this problem for nearly a decade. 

2014: AI Evaluation

We first saw this problem in our own software at SigOpt, the AI startup I previously founded, when building our optimization and experimentation platform for enterprise scale in 2014. We had developed cutting edge ways to efficiently optimize complex systems, but were constantly exploring new techniques to improve this solution. To feel confident deploying new algorithmic solutions, we needed to rigorously test them and have confidence in their robustness. 

We considered A/B testing, but we couldn’t run these tests in production due to the risk of real customer harm. Not to mention, this approach was antithetical to our value proposition of extremely efficient optimization. We also considered standard frameworks for benchmarking optimization methods, but couldn’t find one designed to robustly compare results from stochastic methods. With no available solution, our team instead built an evaluation framework and published it at the ICML workshop on optimization in 2016. 

Being able to confidently claim we had the best, and most tested, optimization framework became one of our strongest competitive advantages in the years to come. More importantly, this evaluation process exposed valuable insights on product performance. It was often shocking what methods looked good in a paper, but did not perform well when exposed to rigorous testing. By continuously testing we were able to cut out poor performing techniques before they ever made it to our users. Although we were proud of our invention, even our team believed it would have been great to use standardized tooling here instead of needing to build it ourselves from scratch.

2016-2020: AI Robustness

After we established SigOpt as a reliable, sample-efficient product for optimizing black box systems, our product was increasingly used by sophisticated companies deploying AI as a core component of their product or revenue strategy. These teams had high upside for boosting performance of their models, but also significant downside if they didn’t perform as expected. So they often valued robustness as much or more than performance. 

For example, if one of our clients were to utilize a brittle model to make important business decisions, subtle shifts in inputs may lead to widely varying outputs and suboptimal outcomes. As SigOpt made these models better and more powerful, the need for robustness – and the tradeoff between robustness and maximum potential performance – became more important.  

It is often better to have a solution at 90% of perfect all the time than a solution that wildly oscillates between 99% and 10%. This is a very difficult problem in high dimensions of input and output where traditional perturbation analysis is prohibitively expensive. 

Once you find an optimal model, how can you evaluate whether it is brittle? And how do you make sure this understanding of optimal performance and relative brittleness doesn’t shift over time? 

As we saw the rise of this use case across our user base, we designed a purpose built solution to this problem called Constraint Active Search and published it at ICML 2021. This algorithmic technique allowed these teams to set constraints on a variety of metrics and run experiments that would actively probe and produce a variety of performant models that satisfied these constraints. Users loved this feature because it allowed them to effectively and efficiently optimize their model reliably against different permutations of input parameters in ways they never could before. In turn, they built more intuition on model robustness and had more confidence that the model they deployed wouldn’t significantly degrade with shifts in input distributions.

2022: Continuous AI Testing at Scale

In October 2020, Intel acquired SigOpt. At Intel, I had the privilege of leading the AI and HPC software teams in the Supercomputing division that was bringing Intel’s next generation of GPUs and HPC-oriented CPUs to market. In this role, I managed over one hundred engineers with the purpose of running, evaluating, debugging and evolving AI and HPC workloads for each new processor we were bringing to market. Given the sophistication of our customers, most of this work involved complex AI and physical modeling. This process translated to our teams orchestrating up to thousands of AI test workloads daily.

As we built out the full software stack for this task, there were robust frameworks in place for traditional software testing, but nothing similar for AI.  As a result, our team was forced to spend most of its time and energy manually designing, instrumenting, executing and analyzing tests for AI workloads. We explored options for supporting this workflow with software, but couldn’t find a robust enough solution or a reliable testing framework. Although we had ambitions for continuous testing, this simply wasn’t attainable without automation in place. One member of the executive team called AI testing a “million dollar per day problem for companies operating at this size and scale.” This was a huge problem, but there were no good off-the-shelf solutions internally or externally to address it.

Better Testing, Greater AI Impact

Through conversations with AI product leaders in finance, energy and tech, I have come to realize that these are quite common issues. These leaders agree that software requires testing, but traditional testing methods and frameworks were built around assumptions that do not hold for applications built on AI. 

Engineers are often forced to test these models by partially fitting them into legacy testing frameworks (often only testing metric thresholds and summary statistics), applying qualitative analysis as they build models (using visualizations or hand-constructed examples in notebooks to gain intuition and confidence), or shifting their problem to their users and customers by letting them test it live via online monitoring. As a consequence, AI is incompletely and non-continuously tested today. This exposes the business to significant risk, high opportunity cost, or both. 

When I ask CIOs how they know that their models haven’t introduced bias or gone off the rails they often say that their only recourse is to constantly monitor metrics and feedback, passively waiting for something to go wrong. While monitoring is an important part of any software product, having the confidence of rigorous, active testing allows product teams to deploy without fear that something catastrophic is right around the corner and only discoverable by your users, after the fact.

As AI becomes more powerful and prevalent, it becomes increasingly important to make sure it is tested and performing as expected. We hope to usher in a virtuous cycle for our customers. With better testing, teams will have more confidence deploying AI in their applications. As they deploy more AI, they will see its impact grow exponentially. And as they see this impact scale, they will apply it to more complex and meaningful problems, which in turn will need even more testing to ensure it is safe, reliable, and secure.

What’s Next?

We aspire to help our customers realize this future and would love your help along the way. We are collaborating with more than a dozen co-design partners to build the modern enterprise platform for AI testing, but are always interested in expanding the scope of our collaboration.

Here are ways to get involved: 

  • Sign up for early access to our private beta 
  • Let us know your interest in joining the team
  • Read this post on the market opportunity from Martin Casado at a16z
  • Read this post on the product problem from Noah Carr at Point72 Ventures
  • Follow us on LinkedIn, X/Twitter and Youtube
  • Reach out to share your perspective

Announcing our seed funding to make AI safe, reliable and secure

Today, we announced we raised $11 million in seed funding led by Andreessen Horowitz with participation from Operator Stack, P72 Ventures, SV Angel, Two Sigma and Willowtree Investments. Below is the press release announcement in full. Please reach out to the team at contact@distributional.com or fill out our form at distributional.com/sign-up if you want to learn more! Or click here to read the press release.

===============

Today, Distributional announced that it has raised $11 million to build the modern enterprise platform for artificial intelligence (AI) testing and evaluation, with the goal of making all forms of AI safe, secure and reliable. The Seed round was led by Andreessen Horowitz with participation from Operator Stack, Point72 Ventures, SV Angel, Two Sigma, Willowtree Investments and dozens of AI leaders as angel investors. 

“I directly experienced this testing problem while applying AI at Yelp, optimizing models for customers at SigOpt and running a hundred-person AI & HPC engineering team at Intel,” says Scott Clark, Co-Founder and CEO of Distributional. “I learned that to robustly test AI I needed to evaluate distributions of outcomes and that there is no purpose-built software for this task.”

AI is complex, unpredictable and constantly changing. Whether due to hallucinations, instability, inaccuracy, or dozens of other potential challenges, it can be hard to identify, understand and address AI risk. To meet this challenge, some AI product teams rely on insights gathered during training that rarely translate to model behavior in production. Others rely on monitoring to quickly catch errors in production, but this leaves their customers exposed to potential harm or a poor user experience. And some teams run bespoke tests on their models prior to production, but these tests are inconsistent, incomplete and insufficient. 

“Lack of reliability in AI systems is one of the biggest barriers to widespread enterprise adoption,” says Martin Casado, General Partner at Andreessen Horowitz. “We are excited for Distributional to address this problem by building a platform for robust and repeatable AI testing.” 

Distributional is working with more than a dozen design partners to build an active testing platform that makes it easy for AI product teams across finance, technology, energy and manufacturing industries to get a complete view of AI risk. The platform will handle all model types, including statistical models, machine learning, deep learning, large language models and other forms of generative AI. With Distributional, AI product teams will continuously catch and address issues before production. 

“A number of AI product managers that I have spoken with have told me models are failing in production with increasing regularity,” says Noah Carr, partner at Point72 Ventures. “As a result, I believe generative foundation models are becoming more critical. As demand for implementations grows, so does the potential risk that applications leveraging these models will be pulled offline due to issues related to model shift or exposure to misinformation. We are excited to back Distributional’s efforts to enable these teams to catch such issues before their customers do.”

Distributional was founded by CEO Scott Clark and an 11-person founding team with experience testing complex AI systems at Bloomberg, Google, Meta, Intel, SigOpt, Slack, Stripe, Uber and Yelp. Scott previously co-founded the pioneering AI startup SigOpt, which was funded by Andreessen Horowitz in 2016 and acquired by Intel in 2020. 

Distributional is remote first and will use the investment to further develop its product and grow its team. The company plans to launch its enterprise product in the second half of 2024. 

For more information, please visit distributional.com.  

About Distributional

Distributional is building the modern enterprise platform for AI testing and evaluation to make AI safe, secure and reliable. As the power of AI applications grows, so does the risk of harm. AI product teams use our platform to proactively and continuously identify, understand and address AI risk before it harms their customers in production. Distributional is backed by Andreessen Horowitz, Operator Stack, Point72 Ventures, SV Angel, Two Sigma, Willowtree Investments and dozens of AI leaders. Distributional was founded in September 2023 by a team with experience testing AI systems at Bloomberg, Google, Intel, Meta, SigOpt, Slack, Stripe, Uber and Yelp. 

Contact: press@distributional.com

Purpose built testing for AI

Testing is standard for traditional software. Teams try to maximize the coverage of their tests and root out “flaky” tests that spuriously fail as they bridge the gap between development and deployment. Engineering teams run unit, regression and integration tests in scalable CI/CD pipelines before putting code in production.

But in AI the complexity of testing explodes on a variety of dimensions at once. Inputs, or the distribution of these inputs, often shift over time. Upstream and downstream dependencies for feature generation or model execution are tough to assess individually and together in an entire system or pipeline. Summary statistics hide bias or other issues. Hard-coded thresholds fail to evolve with models in production, throwing off meaningless alerts that are ignored by development teams. Underlying frameworks evolve and integration creates bugs. Some AI architectures, such as transformers behind generative AI, are intrinsically random and can produce different outputs when given the same input. AI requires purpose built testing.

I was excited to hear that Noah Carr from Point72 Ventures had been evolving his own nuanced understanding of this problem. He had detailed examples of how the lack of rigorous testing was a significant barrier to enterprise productionalization of AI products. And he had a clear view on the complexity that would be required to develop the right solution. When we decided to form Distributional, we were excited to partner with Noah on this journey.

Noah recently published his take on why purpose built testing is so critical for AI models and AI products powered by these models. I encourage you to read it. 

Read Noah’s post: https://p72.vc/perspectives/our-investment-in-distributional/

Addressing a critical barrier to enterprise AI adoption

Our mission is to build software to help make AI safe, reliable and secure. Our vision is for this type of testing to enable safe use of AI across all use cases, maximizing its impact. We’ve assembled a founding team of researchers, engineers and leaders with experience designing these systems at Bloomberg, Google, Intel, Meta, SigOpt, Slack, Stripe and Uber. And we raised $11M in Seed funding led by Andreessen Horowitz to make this a reality for our enterprise customers. 

Our relationship with Andreessen Horowitz dates back to 2016 when they funded SigOpt, the AI startup that Scott Clark, Distributional’s CEO, co-founded, ran, scaled and sold to Intel in 2020. We’ve been lucky to work with and stay in touch with Martin Casado and Matt Bornstein as enterprise AI has evolved from one-off projects to at scale products, from random forests to transformers.

During this time, we’ve seen the potential for AI explode. AlphaFold solved protein folding to enable drug and materials discovery. Stable Diffusion reduced the marginal cost of image and video generation by orders of magnitude. And ChatGPT popularized these techniques for the masses – my parents are Plus users today. These are just a few examples and it feels like we have only scratched the surface on AI’s potential.

As the power of these systems grows, so does their potential for harm. If you can discover a new molecule to save a life, you can also discover a molecule to take one. Although examples from hard sciences are often more striking, this is also true in the enterprise. We’ve spoken to many companies who have shelved many of their AI products because they lack ways to understand the potential for this harm and mitigate it. 

Martin and Matt have heard similar stories and believe this is a critical problem to solve. And they wrote a post explaining their take that I encourage you to read. We are thrilled to partner with them to remove this substantial barrier to enterprise adoption of these powerful AI systems.

Read a16z’s post on investing in Distributional: https://a16z.com/announcement/investing-in-distributional/