How Good is Good Enough: Subjective Testing and Manual LLM Evaluation

, we talked about the highest level of testing and evaluation for LLM models, and went into detail about some of the most commonly used benchmarks for validating LLM performance at a high level. Today, we’re going to look a at some more fine-grained evaluation metrics that you can use while building an LLM-based tool.

Here we make the distinction between statistical metrics - that is those computed using a statistical model - and more generalised metrics that attempt to measure the more ‘subjective’ elements of LLM performance (such as those used in manual testing) and that use AI to evaluate how useful a model is in its given context.

In this article we’ll give an overview of the different classes of metrics used and cover human evaluation and its importance before moving on to common statistical metrics and LLM-as-Judge evaluations in the following articles.

But first, let’s make one thing clear…

This is probably a very obvious statement. As we’ve said in previous articles, LLMs are non-deterministic and cover a huge test surface. A lot of elements of their output have a subjective quality (is this helpful? is the tone correct? how well does the output align with the prompt? etc.) that is difficult to evaluate. LLMs also have a lot of moving parts and component level complexities.

Thus, it’s generally advisable to have a wide variety of signals to evaluate their performance, and not to discount the value of targeted manual and exploratory testing. Statistical metrics have their place, and are generally reliable, but they lack accuracy when capturing the complexity of LLM outputs. 

For an assessment that’s more ‘accurate’ (i.e. closer to human evaluation) and less reliable (less deterministic) LLMs can be used in turn to evaluate the output of other LLMs. This is known as LLM-as-a-Judge.  

I know how it sounds - this is an ouroboros where AI validates AI and invariably leads to a Terminator Skynet scenario. Okay, I’m obviously being hyperbolic here, but LLM-as-a-Judge will more closely resemble how humans evaluate the outputs from LLMs, and thus are very useful. But ultimately a mix of metrics and human oversight will get you the most useful overall evaluation of your LLM model.

There are also combinatorial metrics that use a mix of LLM and statistical evaluators for a “best of both worlds” approach, which we will cover in a later article.

Manual Testing and Subjective Evaluation Metrics

Manual testing testing is often underestimated, undervalued, and underused - and not just in the AI sphere. Why? Because it’s expensive, and with the vastness of the test surface of some tools - especially LLMs, it’s impossible for human testers to give quantitative assessments. That said, manual testing done by humans will give you 

 data points that are difficult (or impossible) for automated testing and instrumentation to pick up.

The value of manual testing is particularly seen when dealing with 

 elements and with simulating user behaviour - it probably goes without saying humans inherently simulate human behaviour better than machines.

Thus I recommend anyone building LLM-based products not to discount the value - or even necessity - for manual testing done by humans.  LLM-as-a-Judge goes some of the way to replicating human test activities, and can operate at a larger scale than human testers, but it isn’t quite the same thing.

The most obvious starting point is exploratory testing - give your product to testers and ask them to do anything and everything from: 

There’s a lot more to be said about exploratory testing - in my long career as QA some of the most critical bugs me or my team has discovered have been found through exploratory testing - but this isn’t the focus of this article. You came here for metrics after all - let’s look at how we can generate metrics based on human testing.

To get the most value out of human-generated metrics scores, a common approach is to have human testers use the LLM product with a test set of size 

 and give a score on subjective values. This closely mirrors supervised learning, and can be partitioned by question type and aggregated over many testers and problem spaces.

There are many ways to generate test sets - LLMs or other tools can be used to generate synthetic test data, or you can use real-world production data - but if you’re using human testers it’s probably best to give them looser, less deterministic test cases that have a goal but no test steps - i.e. User-Acceptance Testing (

Test cases - also called user stories in the context of UAT -  should be simple and non-specific to guide a more naturalistic type of testing, 

These tests should cover a variety of domains that reflect the intended capabilities of the product, and have a user profile that aligns with their goals.

For example your UAT tests could look something like:

Leaving the test cases loose and goal-oriented means that testers won’t get caught in the weeds or fuss over details, and they will approach the task in a way that more closely aligns with real user behaviour.

There’s plenty to choose from (and some of these will overlap with metrics that you can use to scale up with LLM-as-a-Judge solutions), but the important thing is that you choose metrics that align with the goal of your product.

Of course, the metrics to use depend entirely on the type of product you’re creating. If you’re building an LLM product for a specific use case you should consider which subjective metrics are useful to you. For example if you’re designing a chatbot to handle customer service interactions, you probably want to consider measuring something like: “helpfulness”, “politeness”, or “professionalism”.

Scoring can be bounded however you want - but I’d usually suggest  to score each metric on a scale of 1 to 5, this is a pretty good range to choose because it’s fine grained enough to allow for subtlety but short enough to provide intuitive grading system (e.g. very good, good, neutral, bad, very bad). It’s much easier to rate something out of five than out of ten.

You will probably want to have specialised metrics for anything related to sensitive or harmful content too and design specific test cases to address this. For example, you probably don’t want our LLM to say anything racist, or to give harmful medical advice.

We’ve discussed the importance of manual testing, the problem caused by subjectivity, and the strengths and weaknesses of different genres of metrics. Manual testing is unmatched in qualitative assessment, statistical models are reliable and scalable but aren’t accurate on subjective outputs. In truth, both are required to assess complex products - especially LLMs.

Remember: Humans are subjective creatures, and no two testers are likely to give you the exact same scores. But the insights gained at this stage can be invaluable to spot patterns in product behaviour and discover issues of a subjective nature. Humans also excel at discovering potentiall harmful or sensitive content.

In our next article we’ll cover some of the most common statistical metrics you can use to validate your LLMs.

Learn

The newline Guide to Building Your First GraphQL Server with Node and TypeScript

Teach

Amelia Wattenberger

Author of Fullstack D3

Community

Tutorials on Evaluation

How Good is Good Enough: Subjective Testing and Manual LLM Evaluation

Email Newsletter

Popular Topics

Masterclasses

Tutorials

Fullstack React with TypeScript