• PreSeed Now
  • Posts
  • It's not only developers who need to test A.I. apps

It's not only developers who need to test A.I. apps

Composo wants to help less technical experts get involved

Sponsored by

PreSeed Now brings a new super-early-stage B2B or deep tech startup every Tuesday and Thursday - profiled in depth.

Subscribe for free to get it straight to your inbox:

Sure, you want to build an A.I. app, but have you thought about how you’ll test it to make sure it doesn’t do something horribly wrong and upset your customers?

While some tools already do this for developers themselves, Composo wants to make life easier for less technical experts who need to test the A.I. too. Read all about them below.


  • Hey startups: the reborn Tech Nation’s Rising Stars competition looks well worth applying to, and there’s less than a week to get your application in

– Martin

What’s the secret to staying ahead of the curve in the world of AI? Information. Luckily, you can join early adopters reading The Rundown– the free newsletter that makes you smarter on AI with just a 5-minute read per day.

Composo wants to help non-developers test generative A.I. apps, too

Composo co-founders Luke Markham and Seb Fox

In summary:

Premium subscribers get the full version of this article, plus a TLDR summary right here, and access to our Startup Tracker for updates about what this startup does next.

As developers around the world race to apply large language models to any and every software application you can think of, something’s holding the careful ones back.

If you’re a large company and you want to introduce an LLM chatbot for your customers, you absolutely do not want it to start inventing product lines that don’t exist or begin spouting questionable opinions about irrelevant topics.

But that’s the kind of thing LLMs tend to do if you don’t rein them in enough.

Enter Composo, which wants to replace manual, subjective human review of A.I. apps with something more sophisticated, aimed not at developers, but at subject matter experts who need to make sure the A.I. is fit for purpose.

It’s designed to help companies find the highest performing setup for their prompts, models, or architectures. The aim is to maximise accuracy and quality for end users, minimise hallucinations and cost, and pressure test for safety and compliance.

How it works

As CEO Seb Fox explains, Composo can be integrated into A.I. software with a few lines of code. 

It allows non-technical or less-technical users to then conduct experiments with the LLM from the Composo interface, to test its performance and automatically evaluate it.

You can find a video walkthrough of the product on Loom.

“The way we do that evaluation is essentially by looking at the end output–whether it's a legal contract that's generated or it's a conversation with a medical chatbot–in relation to what the input was, and use a series of criteria or metrics to evaluate whether that is good or not,” says Fox.

Composo offers its own built-in criteria for factors like hallucinations, harmful outputs, and the like. But users can specify their own, too. So an HR chatbot might need to be tested to make sure it stays focused on the background of a specific job candidate when talking with them, for example.

At present this kind of testing is done in Composos by simply asking the human testing the LLM to confirm when outputs have the right format, type of information etc, but Fox says the next stage will be to take a more sophisticated approach.

This will see the human choose a preference for what the best output is, and provide a comment or label as to why it is good. That data will be used to train reward models that will be able to identify good and bad results automatically. 

“Using that, we can create a much more tight feedback loop to optimise applications for real-world use cases,” says Fox.

A demo screenshot of Composo in action

Why it’s important

Obviously, no-one wants their commercial A.I. product to be misleading, confusing, or unexpectedly offensive. But Fox explains that things can get quite nuanced when evaluating an LLM trained on data from certain fields. 

“For a lawyer to assess whether a summary of a corpus of legal documents is a good summary or not, is quite subjective. And so that's where we need in particular, much more of these flexible evaluation criteria that are able to look at this natural language output and use all that richness of what a subject matter expert will be looking for, and evaluate for that. 

“At the moment in a lot of organisations, these domain experts are doing quite manual subjective evaluation. Typically, companies are passing an Excel spreadsheet with a load of outputs to a biomedical expert, or a lawyer, [or insert your favourite domain expert here].

“They then manually go through and mark a rating of whether these are good or not, and pass it back to a developer who then has to figure out how they try and update prompts or other elements.”

In the future, as A.I. regulations begin to be introduced, compliance will be another important reason for rigorous testing, Fox predicts. And of course, the advent of more complex and powerful A.I. applications will require more careful and comprehensive evaluation than is often carried out today.

The story so far

Fox originally studied medicine at Oxford. But then he changed direction and became an associate at McKinsey, where he developed a focus on building and implementing A.I. and life sciences products for customers. 

After shifting over to McKinsey’s A.I.-focused QuantumBlack division for a year as a product manager, he broke out as a startup founder in September last year.

Fox founded Composo with Luke Markham, who he met at university and had since become a machine learning engineer.

Their first product experiments were around building A.I. apps for the digital health market. They quickly discovered the kinds of problems with generative A.I. that they’re now trying to solve.

They’ve now got to a point where they have built the Composo product. 

Fox says that until now, they’ve been iterating on it with a number of design partners. Those companies have now received access to the full product, and a wider roll-out is planned from next month.

“From Q2, we’re looking to start to really grow the go-to-market, particularly with mid to large enterprises,” says Fox.

As 2024 progresses, Composo plans to move beyond pre-deployment testing of A.I. apps to monitor their performance in use, too.

They also plan to tailor the product more tightly to specific use cases and verticals, and to automate more of the testing process to make it simpler to use.

And there’s more!

Premium subscribers get the full story about rightsHUB:

Upgrade your subscription now to learn about:

  • Composo’s funding and investment plans

  • Seb Fox’s vision for what Composo will ultimately achieve

  • How Compsos squares up to a competitive field

  • What challenges face the startup as it grows

Subscribe to Premium Membership to read the rest.

Become a paying subscriber of Premium Membership to get access to this post and other subscriber-only content.

Already a paying subscriber? Sign In

A subscription gets you:
Full profiles of early-stage startups every Tuesday & Thursday with investment info, their future plans
Access to our Startup Tracker database
Support our work to bring you the best startups