Common Statistical LLM Evaluation Metrics and what they Mean

, we touched on statistical metrics and how they can be used in evaluation - we also briefly discussed precision, recall, and F1-score in our 

. Today, we’ll go into more detail on how to apply these metrics more directly, and more complex metrics derived from these that can be used to assess LLM performance.

This is a standard measure in statistics, and has long been used to measure the performance of ML systems. In simple terms, this is a measure of how many samples are correctly categorised (true positives) or predicted by a model out of the total set of samples predicted to be positive (true positives + false positives).

If we take a simple examples of an ML tool that takes a photo as an input and tells you if there is a dog in the picture, this would be:

(Samples where model correctly identifies a dog) / (Samples where the model identifies a dog)

 used to measure LLM performance outside of some specific use cases, but it underpins many statistical metrics and thus is worth knowing about.

Recall is often used in concert with precision - and like precision isn’t often directly as an LLM metric but underpins many statistical approaches. This metric refers to the ratio of positively identified samples to the total number of samples that 

To go back to our example with dog pictures:

(Samples where model correctly identifies a dog) / (Total number of dog pictures in the test set)

If this is still confusing, the graphic below (thanks Wikipedia) details both precision and recall in an easy-to-understand way.

The F1-score combines precision and recall into a single metric bounded from 0 to 1 where a value of 1 indicates perfect precision and recall. In other words, the F1-score is the harmonic mean of precision and recall. 

Specifically, the F1-score gives equal weight to precision and recall, which may not always be desirable. To offset this, the Fβ score can be used where β is a weight that can emphasise or de-emphasise the importance of either value.

Now we’re getting into statistical measures that are often directly used to measure LLM performance. Perplexity aims to quantify the level of 

 a model experiences when trying to predict the next token or action in a sequence. This metric has been around for a while - since 1977 when IBM engineers developed it to evaluate speech recognition technology.

It uses language entropy (i.e. the amount of information in a word or sequence of words) as a measure of the degree of unpredictability in a language’s word distribution. Higher entropy values signify lower predictability.

Let’s take an example, since that’s rather confusing - let’s say you (a human, presumably) are given a set of characters and asked to predict the next one. If you have the characters ‘T-H-I-’ you can make a pretty good guess that the next character will be an S or an N (lower entropy), but if you’re asked to guess the next character after receiving ‘T-H-E- -‘ then it’s very difficult for you to guess the next one, since pretty much any noun in the English language could come next (high entropy).

 distribution, in other words, difference between how likely the model is to select a given token, versus the true likelihood that a given token is correct, and exponentiates it to measure the level of uncertainty in selecting a token

Essentially, perplexity is a measure of how many tokens the model finds plausible on average where lower values indicate fewer options considered (higher confidence) and higher values thus indicating a high degree of uncertainty. When we talk about the perplexity of a model, we generally talk about the average perplexity across all predicted tokens.

One of perplexity’s advantages is that it’s relatively straightforward to calculate and intuitive to understand, but it’s limited by a narrow focus - showing a model’s 

 does not evaluate its ultimate correctness nor it’s ability to ‘understand’, and it’s prone to fluctuations based on the type of tokenization model used.

 scorer originates in the field of machine translation, built on the central idea of comparing the output of machine generated text to a high-quality reference sample produced by a human. This metric can be used for any natural language processing domain, provided that a reference sample exists - thus, it has become common in evaluating the text output of LLMs.

To do this it generates n-grams (structures consisting of n consecutive words) of the reference text and output, then calculates the precision of the LLM vs the reference text, producing a score between 0 and 1 where 1 is a perfect match.

In the example below , if we take n=1 (i.e. a unigram where we check each word individually) have a perfect match as each word is matched between the reference (first sentence)  and the translation (second sentence). 

If we change one of the words - for example if the translation was “on the mat sat 

 cat” we would have a BLEU score of 6/7  or 0.85

The BLEU score usually also includes a brevity penalty (BP) in the event that an output is shorter than the reference translation. This is calculated by dividing the length of the reference by the length of the translation - if the value is greater than one, the BP is set to one. The exponentiated sum of n-gram precision is then multiplied by the BP value to give a final BLEU score.

Recall-Oriented Understudy for Gisting Evaluation (ROUGE)

 is similar to BLEU, in that it also uses n-grams of an output and reference text and calculates the overlap between them. The main difference, as the name implies, is that ROUGE uses recall whereas BLEU uses precision.

In  other words ROUGE is the sum of n-grams of a given length that appear in both the reference text and the generated text.

ROUGE scores are branched into ROUGE-N,ROUGE-L, and ROUGE-S, where:

Given what we covered at the start of this section, you might be thinking: why not combine BLEU and ROUGE to have something like an F1-score for text evaluation? 

Metric for Evaluation of Translation with Explicit Ordering (METEOR)

 does, taking the harmonic mean of precision (n-gram matches as computed by BLEU) and n-gram overlaps (recall, as calculated by ROUGE). But it goes even further, having some weights and adjustments for word order and using third party databases to check for the use of synonyms.

Thus it’s the most comprehensive of the three common n-gram metrics, but since it still requires reference text it shares the same limitations - n-gram models focus on textual similarity at a syntactic level, they struggle with things like paraphrases and idioms, and generally ignore the relationship between distant words. For example a sentence with the output “X because Y” will generally score well against the phrase “Y because X” despite having the inverse meaning.

 for calculating METEOR when presented with a  prediction and reference. In the screenshot below you can see an output with a single row where we compare “hello my name is joe” vs “hello my name is joseph” and get a score of ~0.8 - which makes sense, since 4/5 tokens are predicted correctly in this example.

The Levenshtein Distance, named after Soviet mathematician Vladimir Levenshtein who introduced the metric in 1965, is another scorer that focuses on textual similarity between an output and a reference. The Levenshtein Distance between two sequences of characters is given by the number of single character edits (i.e. insertions, deletions or substitution) required to change one sequence into the other.

For example, the Levenshtein distance between the strings "kitten" and "sitting" is 3, since 3 edits are needed to change one into the other at minimum.

Now, you might notice a trend here. Nearly all of these textual metrics require a reference sample to compare to -  this is a strong limiting factor, since there are often many correct strings of tokens to a given prompt, and reference sets have to be created for any niche use cases. This is part of the reason there’s been a push in the last few years to find evaluations that mix the rigour and scalability of statistical methods with the subjective, qualitative insight of human testing.

We’ve covered the most well-known statistical methods - and I hope it’s clear that while each of these metrics is valuable their value is very contextual and limited to a given use case.

In our next article we’ll look at “LLM-as-a-Judge” metrics in which AI models are used to generate metrics, sometimes approximating (but not equalling!) those that might be generated by human testers, and metrics that combine LLMs and statistical methods (e.g. BERTscoe) to evaluate performance.

Learn

The newline Guide to Building Your First GraphQL Server with Node and TypeScript

Teach

Amelia Wattenberger

Author of Fullstack D3

Community

Tutorials on Metrics

Common Statistical LLM Evaluation Metrics and what they Mean

Email Newsletter

Popular Topics

Masterclasses

Tutorials

Fullstack React with TypeScript