skip to main content
10.1145/1316874.1316894acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Document retrieval for question answering: a quantitative evaluation of text preprocessing

Published: 09 November 2007 Publication History

Abstract

Question Answering (QA) has been an area of interest for researchers, in part motivated by the international QA evaluation forums, namely the Text REtrieval Conference (TREC), and more recently, the Cross Language Evaluation Forum (CLEF) through QA@CLEF, that since 2004 includes the Portuguese language. In these forums, a collection of written documents is provided, as well as a set of questions, which are to be answered by the participating systems. Each system is evaluated by its capacity to answer the questions, as a whole, and there are relatively few results published that focus on the performance of its different components and their influence on the overall system performance. That is the case of the Information Retrieval (IR) component, which is broadly used in QA systems.
Our work concentrates on the different options of preprocessing Portuguese text before feeding it to the IR component, evaluating their impact on the IR performance in the specific context of QA, so that we can make a sustained choice of which options to choose. From this work we conclude the clear advantage of the basic preprocessing techniques: case folding and removal of punctuation marks. For the other techniques considered, stop word removal enhanced the performance of the IR system but that was not the case as far as Stemming and Lemmatization are concerned.

References

[1]
Alves, M. A. Engenharia do Léxico Computacional: princípios, tecnologia e o caso das palavras compostas. Mestrado emEngenharia Informática. Faculdade de Ciências e Tecnologia da Universidade Nova de Lisboa, (20 Feb. 2002). www.liacc.up.pt/~maa/elc.html
[2]
Bilotti, M.W., Katz, B. and Lin, J. What Works Better for Question Answering: Stemming or Morphological Query Expansion? ACM SIGIR'04 Workshop Information Retrieval for QA, (Jul. 2004).
[3]
Fox, C. A stop list for general text. ACM SIGIR Forum., Volume 24,Issue 1--2, (1998), 19--21.
[4]
Roberts, I., and Gaizauskas, R. Evaluating Passage Retrieval Approaches for Question Answering. Lecture Notes in Computer Science, Book: Advances in Information Retrieval, Volume 2997, (Mar. 2004), 72--84.
[5]
Santos, D. and Rocha,P. CHAVE: topics and questions on the Portuguese participation in CLEF. In C. Peters and F. Borri, editors, Cross Language Evaluation Forum: Working Notes for the CLEF 2004 Workshop, Bath, UK, (15--17 September 2004) Pg. 639--648
[6]
Tellex, S., Katz, B., Lin, J., Fernandes, A., and Marton, G. Quantitative Evaluation of Passage Retrieval Algorithms for Question Answering. ACM SIGIR'03., Toronto, Canada, (Jul.-Aug. 2003).
[7]
Voorhees, E. M. and Tice, D. M. Building a question answering test collection. In SIGIR Forum (ACM Special Interest Group on Information Retrieval), (2000) Pg. 200

Cited By

View all
  • (2022)Enhancing GAN-LCS Performance Using an Abbreviations Checker in Automatic Short Answer ScoringComputers10.3390/computers1107010811:7(108)Online publication date: 1-Jul-2022
  • (2022)Comparison of text preprocessing methodsNatural Language Engineering10.1017/S1351324922000213(1-45)Online publication date: 13-Jun-2022
  • (2021)Using Online Data in Predicting Stock Price MovementsResearch Anthology on Strategies for Using Social Media as a Service and Tool in Business10.4018/978-1-7998-9020-1.ch053(1056-1083)Online publication date: 2021
  • Show More Cited By

Index Terms

  1. Document retrieval for question answering: a quantitative evaluation of text preprocessing

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      PIKM '07: Proceedings of the ACM first Ph.D. workshop in CIKM
      November 2007
      184 pages
      ISBN:9781595938329
      DOI:10.1145/1316874
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 09 November 2007

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. information retrieval
      2. question answering

      Qualifiers

      • Research-article

      Conference

      CIKM07

      Acceptance Rates

      Overall Acceptance Rate 25 of 62 submissions, 40%

      Upcoming Conference

      CIKM '25

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)16
      • Downloads (Last 6 weeks)2
      Reflects downloads up to 08 Mar 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2022)Enhancing GAN-LCS Performance Using an Abbreviations Checker in Automatic Short Answer ScoringComputers10.3390/computers1107010811:7(108)Online publication date: 1-Jul-2022
      • (2022)Comparison of text preprocessing methodsNatural Language Engineering10.1017/S1351324922000213(1-45)Online publication date: 13-Jun-2022
      • (2021)Using Online Data in Predicting Stock Price MovementsResearch Anthology on Strategies for Using Social Media as a Service and Tool in Business10.4018/978-1-7998-9020-1.ch053(1056-1083)Online publication date: 2021
      • (2019)Using Online Data in Predicting Stock Price MovementsTechno-Social Systems for Modern Economical and Governmental Infrastructures10.4018/978-1-5225-5586-5.ch006(125-159)Online publication date: 2019
      • (2018)An Experimental Study of Text Preprocessing Techniques for Automatic Short Answer Grading in Indonesian2018 3rd International Conference on Information Technology, Information System and Electrical Engineering (ICITISEE)10.1109/ICITISEE.2018.8720957(230-234)Online publication date: Nov-2018
      • (2015)Effective 20 Newsgroups Dataset Cleaning2015 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT)10.1109/WI-IAT.2015.90(98-101)Online publication date: Dec-2015
      • (2015)Interdependence of Text Mining Quality and the Input Data PreprocessingArtificial Intelligence Perspectives and Applications10.1007/978-3-319-18476-0_15(141-150)Online publication date: 2015
      • (2014)A Weighted Density-Based Approach for Identifying Standardized Items that are Significantly Related to the Biological LiteratureData Mining for Service10.1007/978-3-642-45252-9_6(79-96)Online publication date: 4-Jan-2014
      • (2012)Searching a mixed corpus in the light of the new portuguese orthographic normProceedings of the 10th international conference on Computational Processing of the Portuguese Language10.1007/978-3-642-28885-2_6(56-62)Online publication date: 17-Apr-2012
      • (2009)IdSay: Question Answering for PortugueseEvaluating Systems for Multilingual and Multimodal Information Access10.1007/978-3-642-04447-2_40(345-352)Online publication date: 2009
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media