How Good is Good Enough: Subjective Testing and Manual LLM Evaluation
In our previous article , we talked about the highest level of testing and evaluation for LLM models, and went into detail about some of the most commonly used benchmarks for validating LLM performance at a high level. Today, we’re going to look a at some more fine-grained evaluation metrics that you can use while building an LLM-based tool. Here we make the distinction between statistical metrics - that is those computed using a statistical model - and more generalised metrics that attempt to measure the more ‘subjective’ elements of LLM performance (such as those used in manual testing) and that use AI to evaluate how useful a model is in its given context. In this article we’ll give an overview of the different classes of metrics used and cover human evaluation and its importance before moving on to common statistical metrics and LLM-as-Judge evaluations in the following articles.