skip to main content
research-article

The information retrieval anthology 2021: inaugural status report and challenges ahead

Published: 16 July 2021 Publication History

Abstract

The Information Retrieval Anthology, IR Anthology for short, is an endeavor to create a comprehensive collection of metadata and full texts of IR-related publications. We report on its first release, the use cases it can serve, as well as the challenges lying ahead to develop it towards a resource that serves the IR community for years to come. The IR Anthology's metadata browser and full text search engine are available at IR.webis.de.

References

[1]
Uchenna Akujuobi and Xiangliang Zhang. Delve: A Dataset-Driven Scholarly Search and Analysis System. SIGKDD Explorations, 19(2):36--46, 2017. URL https://doi.org/10.1145/3166054.3166059.
[2]
William Y. Arms. Digital Libraries. MIT Press, 2000. ISBN 0-262-01180-8. URL http://www.cs.cornell.edu/wya/DigLib/.
[3]
Krisztian Balog, Lucie Flekova, Matthias Hagen, Rosie Jones, Martin Potthast, Filip Radlinski, Mark Sanderson, Svitlana Vakulenko, and Hamed Zamani. Common Conversational Community Prototype: Scholarly Conversational Assistant. CoRR, abs/2001.06910, 2020. URL https://arxiv.org/abs/2001.06910.
[4]
Dominik Benz, Andreas Hotho, Robert Jäschke, Beate Krause, Folke Mitzlaff, Christoph Schmitz, and Gerd Stumme. The Social Bookmark and Publication Management System BibSonomy - A Platform for Evaluating and Demonstrating Web 2.0 Research. VLDB J., 19(6):849--875, 2010. URL https://doi.org/10.1007/s00778-010-0208-4.
[5]
Janek Bevendorff, Benno Stein, Matthias Hagen, and Martin Potthast. Elastic ChatNoir: Search Engine for the ClueWeb and the Common Crawl. In Leif Azzopardi, Allan Hanbury, Gabriella Pasi, and Benjamin Piwowarski, editors, Advances in Information Retrieval. 40th European Conference on IR Research (ECIR 2018), Lecture Notes in Computer Science, Berlin Heidelberg New York, March 2018. Springer.
[6]
Caroline Birkle, David A. Pendlebury, Joshua Schnell, and Jonathan Adams. Web of Science as a Data Source for Research on Scientific and Scholarly Activity. Quantitative Science Studies, 1(1): 363--376, 2020. URL https://doi.org/10.1162/qss_a_00018.
[7]
John Bohannon. A Computer Program Just Ranked the Most Influential Brain Scientists of the Modern Era. Science, November 2016. ISSN 0036-8075, 1095--9203.
[8]
Marcel Bollmann and Desmond Elliott. On Forgetting to Cite Older Papers: An Analysis of the ACL Anthology. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 7819--7827. Association for Computational Linguistics, 2020. URL https://doi.org/10.18653/v1/2020.acl-main.699.
[9]
Dan Brickley, Matthew Burgess, and Natasha F. Noy. Google Dataset Search: Building a Search Engine for Datasets in an Open Web Ecosystem. In Ling Liu, Ryen W. White, Amin Mantrach, Fabrizio Silvestri, Julian J. McAuley, Ricardo Baeza-Yates, and Leila Zia, editors, The World Wide Web Conference, WWW 2019, San Francisco, CA, USA, May 13-17, 2019, pages 1365--1375. ACM, 2019. URL https://doi.org/10.1145/3308558.3313685.
[10]
Vannevar Bush. As We May Think. The Atlantic Monthly, 176(1):101--108, 1945. URL http://www.theatlantic.com/unbound/flashbks/computer/bushf.htm.
[11]
Vannevar Bush. As We May Think (Reprint). Interactions, 3(2):35--46, 1996. URL https://doi.org/10.1145/227181.227186.
[12]
Declan Butler. Scientists: your number is up. Nat., 485(7400):564, 2012. URL https://doi.org/10.1038/485564a.
[13]
Harry B. Coonce. Computer science and the mathematics genealogy project. SIGACT News, 35(4): 117, 2004. URL https://doi.org/10.1145/1054916.1054918.
[14]
Tim Fischer, Steffen Remus, and Chris Biemann. LT Expertfinder: An Evaluation Framework for Expert Finding Methods. In Waleed Ammar, Annie Louis, and Nasrin Mostafazadeh, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Demonstrations, pages 98--104. Association for Computational Linguistics, 2019. URL https://doi.org/10.18653/v1/n19-4017.
[15]
Eugene Garfield. "Science Citation Index"-A New Dimension in Indexing. Science, 144(3619): 649--654, May 1964. ISSN 0036-8075, 1095--9203.
[16]
Daniel Gildea, Min-Yen Kan, Nitin Madnani, Christoph Teichmann, and Martín Villalba. The ACL Anthology: Current State and Future Directions. In Proceedings of Workshop for NLP Open Source Software (NLP-OSS), pages 23--28, Melbourne, Australia, July 2018. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/W18-2504.
[17]
C. Lee Giles, Kurt D. Bollacker, and Steve Lawrence. CiteSeer: An Automatic Citation Indexing System. In Proceedings of the 3rd ACM International Conference on Digital Libraries, June 23-26, 1998, Pittsburgh, PA, USA, pages 89--98. ACM, 1998. URL https://doi.org/10.1145/276675.276685.
[18]
Jim Giles. Science in the Web Age: Start Your Engines. Nature, 438(7068):554--555, December 2005. ISSN 1476-4687.
[19]
GROBID. GROBID. https://github.com/kermitt2/grobid, 2008-2021.
[20]
Michael Gusenbauer. Google Scholar to Overshadow Them All? Comparing the Sizes of 12 Academic Search Engines and Bibliographic Databases. Scientometrics, 118(1):177--214, 2019. URL https://doi.org/10.1007/s11192-018-2958-5.
[21]
Matthias Hagen and Benno Stein. Candidate Document Retrieval for Web-Scale Text Reuse Detection. In Roberto Grossi, Fabrizio Sebastiani, and Fabrizio Silvestri, editors, String Processing and Information Retrieval, 18th International Symposium, SPIRE 2011, Pisa, Italy, October 17-21, 2011. Proceedings, volume 7024 of Lecture Notes in Computer Science, pages 356--367. Springer, 2011. URL https://doi.org/10.1007/978-3-642-24583-1_35.
[22]
Matthias Hagen, Anna Beyer, Tim Gollub, Kristof Komlossy, and Benno Stein. Supporting Scholarly Search with Keyqueries. In Nicola Ferro, Fabio Crestani, Marie-Francine Moens, Josiane Mothe, Fabrizio Silvestri, Giorgio Maria Di Nunzio, Claudia Hauff, and Gianmaria Silvello, editors, Advances in Information Retrieval. 38th European Conference on IR Research (ECIR 2016), volume 9626 of Lecture Notes in Computer Science, pages 507--520, Berlin Heidelberg New York, March 2016. Springer.
[23]
Matthias Hagen, Martin Potthast, Payam Adineh, Ehsan Fatehifar, and Benno Stein. Source Retrieval for Web-Scale Text Reuse Detection. In Ee-Peng Lim, Marianne Winslett, Mark Sanderson, Ada Wai-Chee Fu, Jimeng Sun, J. Shane Culpepper, Eric Lo, Joyce C. Ho, Debora Donato, Rakesh Agrawal, Yu Zheng, Carlos Castillo, Aixin Sun, Vincent S. Tseng, and Chenliang Li, editors, Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, CIKM 2017, Singapore, November 06 - 10, 2017, pages 2091--2094. ACM, 2017. URL https://doi.org/10.1145/3132847.3133097.
[24]
Joseph Y. Halpern. CoRR: a computing research repository. ACM J. Comput. Documentation, 24(2):41--48, 2000. URL https://doi.org/10.1145/337271.337274.
[25]
Anne-Wil Harzing. Two New Kids on the Block: How do Crossref and Dimensions Compare with Google Scholar, Microsoft Academic, Scopus and the Web of Science? Scientometrics, 120(1):341--349, 2019. URL https://doi.org/10.1007/s11192-019-03114-y.
[26]
Ginny Hendricks, Dominika Tkaczyk, Jennifer Lin, and Patricia Feeney. Crossref: The Sustainable Source of Community-owned Scholarly Metadata. Quantitative Science Studies, 1(1):414--427, 2020. URL https://doi.org/10.1162/qss_a_00022.
[27]
Victor Henning and Jan Reichelt. Mendeley - A Last.Fm for Research? In Fourth International Conference on E-Science, e-Science 2008, 7-12 December 2008, Indianapolis, IN, USA, pages 327--328. IEEE Computer Society, 2008.
[28]
Djoerd Hiemstra, Claudia Hauff, Franciska de Jong, and Wessel Kraaij. SIGIR's 30th Anniversary: An Analysis of Trends in IR Research and the Topology of its Community. SIGIR Forum, 41(2):18--24, 2007. URL https://doi.org/10.1145/1328964.1328966.
[29]
Djoerd Hiemstra, Marie-Francine Moens, Raffaele Perego, and Fabrizio Sebastiani. Transitioning the Information Retrieval Literature to a Fully Open Access Model. SIGIR Forum, 54(1), February 2021. ISSN 0163-5840.
[30]
Frank Hopfgartner, Allan Hanbury, Henning Müller, Ivan Eggel, Krisztian Balog, Torben Brodt, Gordon V. Cormack, Jimmy Lin, Jayashree Kalpathy-Cramer, Noriko Kando, Makoto P. Kato, Anastasia Krithara, Tim Gollub, Martin Potthast, Evelyne Viegas, and Simon Mercer. Evaluation-as-a-Service for the Computational Sciences: Overview and Outlook. Journal of Data and Information Quality (JDIQ), 10(4):15:1--15:32, October 2018.
[31]
Allyn Jackson. A Labor of Love: The Mathematics Genealogy Project. Notices Of The American Mathematical Society, 54(8):1002--1003, 2007.
[32]
Katy Jordan. From Social Networks to Publishing Platforms: A Review of the History and Scholarship of Academic Social Network Sites. Frontiers in Digital Humanities, 6:5, 2019. URL https://doi.org/10.3389/fdigh.2019.00005.
[33]
Michael Ley. DBLP - Some Lessons Learned. Proceedings of the VLDB Endowment, 2(2): 1493--1500, 2009. URL http://www.vldb.org/pvldb/vol2/vldb09-98.pdf.
[34]
D. A. Lindberg. Internet Access to the National Library of Medicine. Effective clinical practice: ECP, 3(5):256--260, 2000 Sep-Oct. ISSN 1099-8128.
[35]
Kyle Lo, Lucy Lu Wang, Mark Neumann, Rodney Kinney, and Daniel S. Weld. S2ORC: The Semantic Scholar Open Research Corpus. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 4969--4983. Association for Computational Linguistics, 2020. URL https://doi.org/10.18653/v1/2020.acl-main.447.
[36]
Saif M. Mohammad. The State of NLP Literature: A Diachronic Analysis of the ACL Anthology. CoRR, abs/1911.03562, 2019. URL http://arxiv.org/abs/1911.03562.
[37]
Saif M. Mohammad. NLP Scholar: A Dataset for Examining the State of NLP Research. In Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asunción Moreno, Jan Odijk, and Stelios Piperidis, editors, Proceedings of The 12th Language Resources and Evaluation Conference, LREC 2020, Marseille, France, May 11-16, 2020, pages 868--877. European Language Resources Association, 2020a. URL https://www.aclweb.org/anthology/2020.lrec-1.109/.
[38]
Saif M. Mohammad. NLP Scholar: An Interactive Visual Explorer for Natural Language Processing Literature. In Asli Celikyilmaz and Tsung-Hsien Wen, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, ACL 2020, Online, July 5-10, 2020, pages 232--255. Association for Computational Linguistics, 2020b. URL https://doi.org/10.18653/v1/2020.acl-demos.27.
[39]
Greg Morrison. Explorations in Bibliography: Zotero Goes Public. Atla Summary of Proceedings, pages 218--221, 2019. ISSN 0066-0868.
[40]
Colm Mulcahy. The Mathematics Genealogy Project Comes of Age at Twenty-one. Notices Of The American Mathematical Society, 64(5):466--470, 2017.
[41]
Bryan Newbold. Search Scholarly Materials Preserved in the Internet Archive, March 2021. URL https://blog.archive.org/2021/03/09/search-scholarly-materials-preserved-in-the-internet-archive/.
[42]
Kevin O'Brien. Resource Review: ResearchGate. Journal of the Medical Library Association, 107(2):284--285, April 2019. ISSN 1558-9439.
[43]
Monarch Parmar, Naman Jain, Pranjali Jain, P. Jayakrishna Sahit, Soham Pachpande, Shruti Singh, and Mayank Singh. NLPExplorer: Exploring the Universe of NLP Papers. In Joemon M. Jose, Emine Yilmaz, João Magalhães, Pablo Castells, Nicola Ferro, Mário J. Silva, and Flávio Martins, editors, Advances in Information Retrieval - 42nd European Conference on IR Research, ECIR 2020, Lisbon, Portugal, April 14-17, 2020, Proceedings, Part II, volume 12036 of Lecture Notes in Computer Science, pages 476--480. Springer, 2020. URL https://doi.org/10.1007/978-3-030-45442-5_61.
[44]
Martin Potthast, Matthias Hagen, Benno Stein, Jan Graßegger, Maximilian Michel, Martin Tippmann, and Clement Welsch. ChatNoir: A Search Engine for the ClueWeb09 Corpus. In Bill Hersh, Jamie Callan, Yoelle Maarek, and Mark Sanderson, editors, 35th International ACM Conference on Research and Development in Information Retrieval (SIGIR 2012), page 1004. ACM, August 2012. ISBN 978-1-4503-1472-5.
[45]
Martin Potthast, Tim Gollub, Matti Wiegmann, and Benno Stein. TIRA Integrated Research Architecture. In Nicola Ferro and Carol Peters, editors, Information Retrieval Evaluation in a Changing World, The Information Retrieval Series. Springer, Berlin Heidelberg New York, September 2019. ISBN 978-3-030-22948-1.
[46]
Martin Potthast, Sebastian Günther, Janek Bevendorff, Jan Philipp Bittner, Alexander Bondarenko, Maik Fröbe, Christian Kahmann, Andreas Niekler, Michael Völske, Benno Stein, and Matthias Hagen. The Information Retrieval Anthology. In 44th International ACM Conference on Research and Development in Information Retrieval (SIGIR 2021). ACM, July 2021. URL https://dl.acm.org/doi/10.1145/3404835.3462798.
[47]
Stephen E. Robertson, Hugo Zaragoza, and Michael J. Taylor. Simple BM25 Extension to Multiple Weighted Fields. In David A. Grossman, Luis Gravano, ChengXiang Zhai, Otthein Herzog, and David A. Evans, editors, Proceedings of the 2004 ACM CIKM International Conference on Information and Knowledge Management, Washington, DC, USA, November 8-13, 2004, pages 42--49. ACM, 2004. URL https://doi.org/10.1145/1031171.1031181.
[48]
Ulrich Schäfer, Bernd Kiefer, Christian Spurk, Jörg Steffen, and Rui Wang. The ACL Anthology Searchbench. In The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference, 19-24 June, 2011, Portland, Oregon, USA - System Demonstrations, pages 7--13. The Association for Computer Linguistics, 2011. URL https://www.aclweb.org/anthology/P11-4002/.
[49]
Arnab Sinha, Zhihong Shen, Yang Song, Hao Ma, Darrin Eide, Bo-June Paul Hsu, and Kuansan Wang. An Overview of Microsoft Academic Service (MAS) and Applications. In Aldo Gangemi, Stefano Leonardi, and Alessandro Panconesi, editors, Proceedings of the 24th International Conference on World Wide Web Companion, WWW 2015, Florence, Italy, May 18-22, 2015 - Companion Volume, pages 243--246. ACM, 2015. URL https://doi.org/10.1145/2740908.2742839.
[50]
Manfred Stede and Jodi Schneider. Argumentation Mining, volume 40 of Synthesis Lectures in Human Language Technology. Morgan & Claypool, 2018.
[51]
Benno Stein, Martin Potthast, and Martin Trenkmann. Retrieving Customary Web Language to Assist Writers. In Cathal Gurrin, Yulan He, Gabriella Kazai, Udo Kruschwitz, Suzanne Little, Thomas Roelleke, Stefan M. Rüger, and Keith van Rijsbergen, editors, Advances in Information Retrieval. 32nd European Conference on Information Retrieval (ECIR 2010), volume 5993 of Lecture Notes in Computer Science, pages 631--635, Berlin Heidelberg New York, March 2010. Springer. ISBN 978-3-642-12274-3.
[52]
Gary Taubes. Publication by Electronic Mail Takes Physics by Storm. Science, 259(5099):1246--1248, February 1993. ISSN 0036-8075, 1095-9203.
[53]
Henning Wachsmuth, Martin Potthast, Khalid Al-Khatib, Yamen Ajjour, Jana Puschmann, Jiani Qu, Jonas Dorsch, Viorel Morari, Janek Bevendorff, and Benno Stein. Building an Argument Search Engine for the Web. In Kevin Ashley, Claire Cardie, Nancy Green, Iryna Gurevych, Ivan Habernal, Diane Litman, Georgios Petasis, Chris Reed, Noam Slonim, and Vern Walker, editors, 4th Workshop on Argument Mining (ArgMining 2017) at EMNLP, pages 49--59. Association for Computational Linguistics, September 2017. URL https://www.aclweb.org/anthology/W17-5106.
[54]
Huaiyu Wan, Yutao Zhang, Jing Zhang, and Jie Tang. AMiner: Search and Mining of Academic Social Networks. Data Intelligence, 1(1):58--76, 2019. URL https://doi.org/10.1162/dint_a_00006.
[55]
Jian Wu, Kunho Kim, and C. Lee Giles. CiteSeerX: 20 Years of Service to Scholarly Big Data. In Huajin Wang and Keith Webster, editors, Proceedings of the Conference on Artificial Intelligence for Data Discovery and Reuse, AIDR 2019, Pittsburgh, PA, USA, May 13-15, 2019, pages 1:1--1:4. ACM, 2019. URL https://doi.org/10.1145/3359115.3359119.
[56]
Holt Zaugg, Richard E. West, Isaku Tateishi, and Daniel L. Randall. Mendeley: Creating Communities of Scholarly Inquiry through Research Collaboration. TechTrends: Linking Research and Practice to Improve Learning, 55(1):32--36, July 2010. ISSN 8756-3894.
[57]
Tiancheng Zhao and Kyusong Lee. Talk to Papers: Bringing Neural Question Answering to Academic Search. In Asli Celikyilmaz and Tsung-Hsien Wen, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, ACL 2020, Online, July 5-10, 2020, pages 30--36. Association for Computational Linguistics, 2020. URL https://doi.org/10.18653/v1/2020.acl-demos.5.
[58]
Michel Zitt, Alain Lelu, Martine Cadot, and Guillaume Cabanac. Bibliometric Delineation of Scientific Fields. In Wolfgang Glänzel, Henk F. Moed, Ulrich Schmoch, and Mike Thelwall, editors, Springer Handbook of Science and Technology Indicators, Springer Handbooks, pages 25--68. Springer, 2019. URL https://doi.org/10.1007/978-3-030-02511-3_2.
  1. The information retrieval anthology 2021: inaugural status report and challenges ahead

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM SIGIR Forum
    ACM SIGIR Forum  Volume 55, Issue 1
    June 2021
    157 pages
    ISSN:0163-5840
    DOI:10.1145/3476415
    Issue’s Table of Contents
    Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 16 July 2021
    Published in SIGIR Volume 55, Issue 1

    Check for updates

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 95
      Total Downloads
    • Downloads (Last 12 months)3
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 16 Feb 2025

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media