research-article

The BETTER Cross-Language Datasets

Author:

SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval

Pages 3047 - 3053

https://doi.org/10.1145/3539618.3591910

Published: 18 July 2023 Publication History

Get Access

Abstract

The IARPA BETTER (Better Extraction from Text Through Enhanced Retrieval) program held three evaluations of information retrieval (IR) and information extraction (IE). For both tasks, the only training data available was in English, but systems had to perform cross-language retrieval and extraction from Arabic, Farsi, Chinese, Russian, and Korean. Pooled assessment and information extraction annotation were used to create reusable IR test collections. These datasets are freely available to researchers working in cross-language retrieval, information extraction, or the conjunction of IR and IE. This paper describes the datasets, how they were constructed, and how they might be used by researchers.

References

[1]

John Beieler. 2016. Generating Politically-Relevant Event Data. In Proceedings of the First Workshop on NLP and Computational Social Science. Association for Computational Linguistics, Austin, Texas, 37--42. https://doi.org/10.18653/v1/W16-5605

Crossref

Google Scholar

[2]

Charles L.A. Clarke, Maheedhar Kolla, Gordon V. Cormack, Olga Vechtomova, Azin Ashkan, Stefan Büttcher, and Ian MacKinnon. 2008. Novelty and Diversity in Information Retrieval Evaluation. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Singapore, Singapore) (SIGIR '08). Association for Computing Machinery, New York, NY, USA, 659--666. https://doi.org/10.1145/1390334.1390446

Digital Library

Google Scholar

[3]

Petra Galu?čáková, Douglas W. Oard, and Suraj Nair. 2021. Cross-language Information Retrieval. https://doi.org/10.48550/ARXIV.2111.05988

Crossref

Google Scholar

[4]

Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated Gain-Based Evaluation of IR Techniques. ACM Trans. Inf. Syst., Vol. 20, 4 (oct 2002), 422--446. https://doi.org/10.1145/582415.582418

Digital Library

Google Scholar

[5]

Dawn Lawrie, Sean MacAvaney, James Mayfield, Paul McNamee, Douglas W. Oard, Luca Soldani, and Eugene Yang. 2022. Overview of the TREC 2022 NeuCLIR Track. In Proceedings of the 31st Text REtrieval Conference (TREC 2022), Ian Soboroff (Ed.).

Crossref

Google Scholar

[6]

Timothy Mckinnon and Carl Rubino. 2022. The IARPA BETTER Program Abstract Task Four New Semantically Annotated Corpora from IARPA's BETTER Program. In Proceedings of the Thirteenth Language Resources and Evaluation Conference. European Language Resources Association, Marseille, France, 3595--3600. https://aclanthology.org/2022.lrec-1.384

Google Scholar

[7]

Tetsuya Sakai. 2006. Evaluating Evaluation Metrics Based on the Bootstrap. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Seattle, Washington, USA) (SIGIR '06). Association for Computing Machinery, New York, NY, USA, 525--532. https://doi.org/10.1145/1148170.1148261

Digital Library

Google Scholar

[8]

Ellen M. Voorhees and Donna K. Harman (Eds.). 2005. TREC: Experiment and Evaluation in Information Retrieval. MIT Press.

Digital Library

Google Scholar

[9]

Ilya Zavorin, Aric Bills, Cassian Corey, Michelle Morrison, Audrey Tong, and Richard Tong. 2020. Corpora for Cross-Language Information Retrieval in Six Less-Resourced Languages. In Proceedings of the workshop on Cross-Language Search and Summarization of Text and Speech (CLSSTS2020). European Language Resources Association, Marseille, France, 7--13. https://aclanthology.org/2020.clssts-1.2

Google Scholar

[10]

Justin Zobel. 1998. How Reliable Are the Results of Large-Scale Information Retrieval Experiments?. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Melbourne, Australia) (SIGIR '98). Association for Computing Machinery, New York, NY, USA, 307--314. https://doi.org/10.1145/290941.291014 io

Digital Library

Google Scholar

Index Terms

The BETTER Cross-Language Datasets
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Information extraction
2. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results
      1. Test collections
    2. Specialized information retrieval
      1. Structure and multilingual text search
        Multilingual and cross-lingual retrieval

Recommendations

Cross-language spoken document retrieval using HMM-based retrieval model with multi-scale fusion

Cross-language spoken document retrieval (CL-SDR) is the technology that facilitates automatic retrieval of relevant information from a collection of spoken documents in a language that is different from that used in the queries. Information sources ...
Multilingual information access: the contribution of evaluation
IWRIDL '06: Proceedings of the 2006 international workshop on Research issues in digital libraries

Since evaluation of cross-language information retrieval systems began at TREC in 1997 and NTCIR in 1998 and, in particular, with the launch of the Cross-Language Evaluation Forum (CLEF) in 2000, considerable progress has been made in this particular ...
MT-based Japanese-Enlish cross-language IR experiments using the TREC test collections
IRAL '00: Proceedings of the fifth international workshop on on Information retrieval with Asian languages

This paper evaluates the effectiveness of MT-based Japanese-English CLIR using a subcollection of the TREC test collections and two bilingual researchers to separately translate the TREC requests into Japanese. Our main findings are as follows: (1)With ...

Comments

Information & Contributors

Information

Published In

SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval

July 2023

3567 pages

ISBN:9781450394086

DOI:10.1145/3539618

General Chairs:
Hsin-Hsi Chen
National Taiwan University
,
Wei-Jou (Edward) Duh
National Taiwan University
,
Hen-Hsen Huang
Academia Sinica
,
Program Chairs:
Makoto P. Kato
Spotify
,
Josiane Mothe
Universite de Toulouse
,
Barbara Poblete
University of Chile and Amazon Visiting Academic

Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 July 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGIR '23

Sponsor:

SIGIR

SIGIR '23: The 46th International ACM SIGIR Conference on Research and Development in Information Retrieval

July 23 - 27, 2023

Taipei, Taiwan

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
176
Total Downloads

Downloads (Last 12 months)46
Downloads (Last 6 weeks)7

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Abstract

References

Index Terms

Recommendations

Cross-language spoken document retrieval using HMM-based retrieval model with multi-scale fusion

Multilingual information access: the contribution of evaluation

MT-based Japanese-Enlish cross-language IR experiments using the TREC test collections

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations