arxiv:2410.01945

LFQA-E: Carefully Benchmarking Long-form QA Evaluation

Published on Oct 2, 2024

Authors:

Abstract

LFQA-E is a multilingual, reference-based benchmark with 1618 questions and 7323 pairwise comparisons across 15 topics designed to evaluate automatic metrics for long-form question answering, revealing current metrics' limitations in capturing dense information in comprehensive responses.

AI-generated summary

Long-Form Question Answering (LFQA) involves generating comprehensive, paragraph-level responses to open-ended questions, which poses a significant challenge for evaluation due to the richness of information and flexible response format. Existing LFQA-evaluation benchmarks often lack reference answers and are limited in size and topic coverage, reducing their reliability. To address this gap, we introduce LFQA-E, a well-constructed, multilingual, and reference-based benchmark designed to rigorously evaluate automatic metrics for LFQA. LFQA-E comprises 1618 questions and 7323 pairwise comparisons across 15 topics, drawn from diverse sources such as online queries and examination questions, thereby enabling a comprehensive assessment of evaluation metrics. We examine five categories of metrics, encompassing 17 specific methods, using LFQA-E. The results demonstrate that none of the existing automatic metrics perform comparably to human judgments, highlighting their inability to capture the dense information in long-form responses. Furthermore, we present a detailed analysis of the failure cases and the generalization capacity of these metrics, offering insights to guide the future development of LFQA evaluation methods. The benchmark and code are available at https://github.com/YuchenFan48/LFQA-E.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2410.01945 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2410.01945 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2410.01945 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.