skip to main content
10.1145/3539618.3591910acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

The BETTER Cross-Language Datasets

Published: 18 July 2023 Publication History

Abstract

The IARPA BETTER (Better Extraction from Text Through Enhanced Retrieval) program held three evaluations of information retrieval (IR) and information extraction (IE). For both tasks, the only training data available was in English, but systems had to perform cross-language retrieval and extraction from Arabic, Farsi, Chinese, Russian, and Korean. Pooled assessment and information extraction annotation were used to create reusable IR test collections. These datasets are freely available to researchers working in cross-language retrieval, information extraction, or the conjunction of IR and IE. This paper describes the datasets, how they were constructed, and how they might be used by researchers.

References

[1]
John Beieler. 2016. Generating Politically-Relevant Event Data. In Proceedings of the First Workshop on NLP and Computational Social Science. Association for Computational Linguistics, Austin, Texas, 37--42. https://doi.org/10.18653/v1/W16-5605
[2]
Charles L.A. Clarke, Maheedhar Kolla, Gordon V. Cormack, Olga Vechtomova, Azin Ashkan, Stefan Büttcher, and Ian MacKinnon. 2008. Novelty and Diversity in Information Retrieval Evaluation. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Singapore, Singapore) (SIGIR '08). Association for Computing Machinery, New York, NY, USA, 659--666. https://doi.org/10.1145/1390334.1390446
[3]
Petra Galu?čáková, Douglas W. Oard, and Suraj Nair. 2021. Cross-language Information Retrieval. https://doi.org/10.48550/ARXIV.2111.05988
[4]
Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated Gain-Based Evaluation of IR Techniques. ACM Trans. Inf. Syst., Vol. 20, 4 (oct 2002), 422--446. https://doi.org/10.1145/582415.582418
[5]
Dawn Lawrie, Sean MacAvaney, James Mayfield, Paul McNamee, Douglas W. Oard, Luca Soldani, and Eugene Yang. 2022. Overview of the TREC 2022 NeuCLIR Track. In Proceedings of the 31st Text REtrieval Conference (TREC 2022), Ian Soboroff (Ed.).
[6]
Timothy Mckinnon and Carl Rubino. 2022. The IARPA BETTER Program Abstract Task Four New Semantically Annotated Corpora from IARPA's BETTER Program. In Proceedings of the Thirteenth Language Resources and Evaluation Conference. European Language Resources Association, Marseille, France, 3595--3600. https://aclanthology.org/2022.lrec-1.384
[7]
Tetsuya Sakai. 2006. Evaluating Evaluation Metrics Based on the Bootstrap. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Seattle, Washington, USA) (SIGIR '06). Association for Computing Machinery, New York, NY, USA, 525--532. https://doi.org/10.1145/1148170.1148261
[8]
Ellen M. Voorhees and Donna K. Harman (Eds.). 2005. TREC: Experiment and Evaluation in Information Retrieval. MIT Press.
[9]
Ilya Zavorin, Aric Bills, Cassian Corey, Michelle Morrison, Audrey Tong, and Richard Tong. 2020. Corpora for Cross-Language Information Retrieval in Six Less-Resourced Languages. In Proceedings of the workshop on Cross-Language Search and Summarization of Text and Speech (CLSSTS2020). European Language Resources Association, Marseille, France, 7--13. https://aclanthology.org/2020.clssts-1.2
[10]
Justin Zobel. 1998. How Reliable Are the Results of Large-Scale Information Retrieval Experiments?. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Melbourne, Australia) (SIGIR '98). Association for Computing Machinery, New York, NY, USA, 307--314. https://doi.org/10.1145/290941.291014 io

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval
July 2023
3567 pages
ISBN:9781450394086
DOI:10.1145/3539618
Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 July 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. cross-language information retrieval
  2. information extraction
  3. test collections

Qualifiers

  • Research-article

Conference

SIGIR '23
Sponsor:

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 176
    Total Downloads
  • Downloads (Last 12 months)46
  • Downloads (Last 6 weeks)7
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media