Seminar: "NOLIMA: Long-Context Evaluation Beyond Literal Matching"

Abstract

Recent large language models (LLMs) support long contexts ranging from 128K to 1M tokens. A popular method for evaluating these capabilities is the needle-in-a-haystack (NIAH) test, which involves retrieving a “needle” (relevant information) from a “haystack” (long irrelevant context). Extensions of this approach include increasing distractors, fact chaining, and in-context reasoning. However, in these benchmarks, models can exploit existing literal matches between the needle and haystack to simplify the task. To address this, we introduce NoLiMa, a benchmark extending NIAH with a carefully designed needle set, where questions and needles have minimal lexical overlap, requiring models to infer latent associations to locate the needle within the haystack. We evaluate 12 popular LLMs that claim to support contexts of at least 128K tokens. While they perform well in short contexts (<1K), performance degrades significantly as context length increases. At 32K, for instance, 10 models drop below 50% of their strong short-length baselines. Even GPT-4o, one of the top-performing exceptions, experiences a reduction from an almost-perfect baseline of 99.3% to 69.7%. Our analysis suggests these declines stem from the increased difficulty the attention mechanism faces in longer contexts when literal matches are absent, making it harder to retrieve relevant information.

Date
Jun 19, 2025 13:00 — 14:00
Location
Abacws

Invited Speaker: Ali Modaressi (LMU Munich)

Bio: I am a third-year PhD student at the Center for Information and Language Processing (CIS) at LMU Munich, working under the supervision of Prof. Hinrich Schütze. My current research focuses on memory-augmented large language models, and, more broadly, on long-context language modeling. Closely related, I have also worked on interactive language generation and information extraction. My NLP journey began during my MSc, supervised by Mohammad Taher Pilehvar, where I worked on explainability methods and interpretability of pre-trained language models–an area that remains relevant to my current research, particularly in analyzing retrieval models and knowledge probing.

Nedjma Ousidhoum
Nedjma Ousidhoum
Lecturer