skip to main content
10.1145/3404835.3463256acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
short-paper

REGIS: A Test Collection for Geoscientific Documents in Portuguese

Published: 11 July 2021 Publication History

Abstract

Experimental validation is key to the development of Information Retrieval (IR) systems. The standard evaluation paradigm requires a test collection with documents, queries, and relevance judgments. Creating test collections requires significant human effort, mainly for providing relevance judgments. As a result, there are still many domains and languages that, to this day, lack a proper evaluation testbed. Portuguese is an example of a major world language that has been overlooked in terms of IR research -- the only test collection available is composed of news articles from 1994 and a hundred queries. With the aim of bridging this gap, in this paper, we developed REGIS (Retrieval Evaluation for Geoscientific Information Systems), a test collection for the geoscientific domain in Portuguese. REGIS contains 20K documents and 34 query topics along with relevance assessments. We describe the procedures for document collection, topic creation, and relevance assessment. In addition, we report on results of standard IR techniques on REGIS so that they can serve as a baseline for future research.

Supplementary Material

MP4 File (REGIS A Test Collection for Geoscientific Documents in Portuguese.mp4)
Presentation video of the resource paper "REGIS: A Test Collection for Geoscientific Documents in Portuguese", submited to SIGIR 2021.

References

[1]
Moumita Basu, Anurag Roy, Kripabandhu Ghosh, Somprakash Bandyopadhyay, and Saptarshi Ghosh. 2017. Microblog Retrieval in a Disaster Situation: A New Test Collection for Evaluation. In SMERP@ ECIR. 22--31.
[2]
Guilherme Torresan Bazzo, Gustavo Acauan Lorentz, Danny Suarez Vargas, and Viviane P Moreira. 2020. Assessing the Impact of OCR Errors in Information Retrieval. In European Conference on Information Retrieval. Springer, 102--109.
[3]
Chris Buckley and Ellen M Voorhees. 2004. Retrieval evaluation with incomplete information. In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval. 25--32.
[4]
Chris Buckley and Ellen M Voorhees. 2017. Evaluating evaluation measure stability. In ACM SIGIR Forum, Vol. 51. Association for Computing Machinery, New York, NY, USA, 235--242.
[5]
Cyril W Cleverdon. 1962. Aslib Cranfield research project: report on the testing and analysis of an investigation into the comparative efficiency of indexing systems. Technical Report.
[6]
Bernardo Consoli, Joaquim Santos, Diogo Gomes, Fabio Cordeiro, Renata Vieira, and Viviane Moreira. 2020. Embeddings for named entity recognition in geoscience portuguese literature. In Proceedings of The 12th Language Resources and Evaluation Conference. 4625--4630.
[7]
Fá bio Corrê a Cordeiro and Cristian Enrique Munoz Villalobos. 2020. Petrolê s - How to Build a Specialized Oil and Gas Corpus in Portuguese. Rio Oil and Gas Expo and Conference, Vol. 20, 2020 (Dec. 2020), 387--388. https://doi.org/10.48072/2525--7579.rog.2020.387
[8]
Diogo Gomes, Fábio Cordeiro, Bernardo Consoli, Nikolas Santos, Viviane Moreira, Renata Vieira, Silvia Moraes, and Alexandre Evsukoff. 2021. Portuguese word embeddings for the oil and gas industry: Development and evaluation. Computers in Industry, Vol. 124 (2021), 103347. https://doi.org/10.1016/j.compind.2020.103347
[9]
William Hersh, Andrew Turpin, Susan Price, Benjamin Chan, Dale Kramer, Lynetta Sacherek, and Daniel Olson. 2000. Do batch and user evaluations give the same results?. In Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval. 17--24.
[10]
Marianne Lykke, Birger Larsen, Haakon Lund, and Peter Ingwersen. 2010. Developing a test collection for the evaluation of integrated search. In European Conference on Information Retrieval. Springer, 627--630.
[11]
Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Evaluation in information retrieval .Cambridge University Press, Cambridge, UK, 139 161. https://doi.org/10.1017/CBO9780511809071.009
[12]
Anna Ritchie, Simone Teufel, and Stephen Robertson. 2006. Creating a test collection for citation-based IR experiments. In Proceedings of the human language technology conference of the NAACL, main conference. 391--398.
[13]
Mark Sanderson. 2010. Test collection based evaluation of information retrieval systems. Now Publishers Inc.
[14]
Diana Santos and Paulo Rocha. 2004. The key to the first clef with Portuguese: Topics, questions and answers in chave. In Workshop of the Cross-Language Evaluation Forum for European Languages. Springer, 821--832.
[15]
Ian Soboroff, Kira Griffitt, and Stephanie Strassel. 2016. The BOLT IR test collections of multilingual passage retrieval from discussion forums. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval. 713--716.
[16]
Karen Spark-Jones. 1975. Report on the need for and provision of an 'ideal' information retrieval test collection. Computer Laboratory (1975).
[17]
Andrew Turpin and Falk Scholer. 2006. User performance versus precision measures for simple search tasks. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval. 11--18.
[18]
Emine Yilmaz and Javed A Aslam. 2006. Estimating average precision with incomplete and imperfect judgments. In Proceedings of the 15th ACM international conference on Information and knowledge management. 102--111.

Cited By

View all
  • (2024)Petro NLPComputers & Geosciences10.1016/j.cageo.2024.105714193:COnline publication date: 1-Nov-2024
  • (2023)Evaluating and mitigating the impact of OCR errors on information retrievalInternational Journal on Digital Libraries10.1007/s00799-023-00345-624:1(45-62)Online publication date: 26-Jan-2023

Index Terms

  1. REGIS: A Test Collection for Geoscientific Documents in Portuguese

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval
    July 2021
    2998 pages
    ISBN:9781450380379
    DOI:10.1145/3404835
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 11 July 2021

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. geoscientific data
    2. information retrieval
    3. test collection

    Qualifiers

    • Short-paper

    Funding Sources

    Conference

    SIGIR '21
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 792 of 3,983 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)8
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 16 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Petro NLPComputers & Geosciences10.1016/j.cageo.2024.105714193:COnline publication date: 1-Nov-2024
    • (2023)Evaluating and mitigating the impact of OCR errors on information retrievalInternational Journal on Digital Libraries10.1007/s00799-023-00345-624:1(45-62)Online publication date: 26-Jan-2023

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media