Purpose built testing for AI
Testing is standard for traditional software. Teams try to maximize the coverage of their tests and root out “flaky” tests that spuriously fail as they bridge the gap between development and deployment. Engineering teams run unit, regression and integration tests in scalable CI/CD pipelines before putting code in production.
But in AI the complexity of testing explodes on a variety of dimensions at once. Inputs, or the distribution of these inputs, often shift over time. Upstream and downstream dependencies for feature generation or model execution are tough to assess individually and together in an entire system or pipeline. Summary statistics hide bias or other issues. Hard-coded thresholds fail to evolve with models in production, throwing off meaningless alerts that are ignored by development teams. Underlying frameworks evolve and integration creates bugs. Some AI architectures, such as transformers behind generative AI, are intrinsically random and can produce different outputs when given the same input. AI requires purpose built testing.
I was excited to hear that Noah Carr from Point72 Ventures had been evolving his own nuanced understanding of this problem. He had detailed examples of how the lack of rigorous testing was a significant barrier to enterprise productionalization of AI products. And he had a clear view on the complexity that would be required to develop the right solution. When we decided to form Distributional, we were excited to partner with Noah on this journey.
Noah recently published his take on why purpose built testing is so critical for AI models and AI products powered by these models. I encourage you to read it.
Read Noah’s post: https://p72.vc/perspectives/our-investment-in-distributional/