EvalugatoršāRapid, Agile Development and Evaluation of Retrieval Augmented Generation Systems Without Labels
Published in Accepted ECIR-2026, 2026
Recommended citation: Bevan Koopman, Hang Li, Shuai Wang and Guido Zuccon. 2025. EvalugatoršāRapid, Agile Development and Evaluation of Retrieval Augmented Generation Systems Without Labels (Accepted ECIR-2026). https://hangli.me/files/ecir2026-evalugator.pdf
Abstract
Evaluating complex Retrieval Augmented Generation (RAG) systems in real-world settings is challenging. There is often a lack of finegrained labelled data and the absence of comprehensive evaluation tools that can assess individual components of a pipeline. This hinders rapid, rigorous development, particularly for agentic RAG systems. We was our experienced at GuideStream.AI, a startup developing an AI for clinical guideline recommendation. To address this gap, we developed Evalugator , a suite of agentic components to support agile development and evaluation. Evalugator features: (1) generation of synthetic queries, relevance assessments, answers and evaluation criteria for training and evaluation in new domains; (2) LLM-based judging agents; and (3) simple UI and API tools to launch experiments and analyse results. This paper uses Evalugator as a case study to demonstrate how a principled, agent-based evaluation framework can support the rapid development of complex RAG systems in a startup environment.
