Evaluation of Retrieval-Augmented Generation (RAG) systems is paramount for any industry-quality usage. Without proper evaluation we end up in the world of “it works on my machine”. In the realm of AI, this would be called “it works on my questions”.

Whether you are an engineer seeking to refine your RAG systems, are just intrigued by the nuances of RAG evaluation or are eager to read more after the first part of the series (Evaluating retrieval in RAGs: a gentle introduction) — you are in the right place.

This article equips you with the knowledge needed to navigate evaluation in RAGs and the framework to systematically compare and contrast existing evaluation libraries. This framework covers benchmark creation, evaluation metrics, parameter space and experiment tracking.