Seminar: "Measuring Distributional Realism in Synthetic Query-Passage Datasets for Information Retrieval"

Abstract

One promising application of Large Language Models (LLMs) is the generation of synthetic textual datasets for training, particularly in information retrieval (IR), where the task is to provide a query and retrieve passages containing the relevant information. Synthetic data offers several advantages: it can be generated at scale, applied across diverse scenarios, and produced without human supervision. However, a key challenge lies in evaluating the quality of such data. While human evaluation remains the gold standard, it is often impractical due to the high cost and time required. To address this limitation, we propose a novel automatic evaluation method that assesses the realism of synthetic datasets by comparing the distribution of query–passage similarity scores with those observed in human-authored datasets. Specifically, we quantify realism by measuring the degree of overlap between the two distributions. This representation provides a reasonable approximation of how closely a synthetic dataset resembles human-created data. We show that our metric captures improvements achieved when moving from baseline generation approaches to more refined prompt-engineering methods. Finally, we validate our results using both an LLM-as-a-judge framework and human evaluation on a subset of the data.

Date
Nov 13, 2025 13:00 — 14:00
Location
Abacws

Invited Speaker: Marco Cuccarini

Nedjma Ousidhoum
Nedjma Ousidhoum
Lecturer