Skip to main content
Log in

Minimally supervised question classification on fine-grained taxonomies

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

This article presents a minimally supervised approach to question classification on fine-grained taxonomies. We have defined an algorithm that automatically obtains lists of weighted terms for each class in the taxonomy, thus identifying which terms are highly related to the classes and are highly discriminative between them. These lists have then been applied to the task of question classification. Our approach is based on the divergence of probability distributions of terms in plain text retrieved from the Web. A corpus of questions with which to train the classifier is not therefore necessary. As the system is based purely on statistical information, it does not require additional linguistic resources or tools. The experiments were performed on English questions and their Spanish translations. The results reveal that our system surpasses current supervised approaches in this task, obtaining a significant improvement in the experiments carried out.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Notes

  1. Text Retrieval Conference: http://trec.nist.org/.

  2. Cross Language Evaluation Forum: http://clef-campaign.org/.

  3. NII-NACSIS Test Collection for IR Systems: http://research.nii.ac.jp/ntcir/.

  4. Text Analysis Conference: http://www.nist.gov/tac/.

  5. In our experiments, we extended the concept of term to unigrams, bigrams, and trigrams.

  6. We employed Yahoo! Search BOSS: http://developer.yahoo.com/search/boss/.

  7. Binary logarithms were used in our experiments.

  8. The value \(-1\) is not included in the interval because it always produces a negative \(\omega \).

  9. Freely available at http://trec.nist.gov/data/qa.html.

  10. All the sets of questions and seeds employed in the evaluation are available at http://www.dlsi.ua.es/~dtomas/resources/.

  11. We employed the Apache Lucene search engine: http://lucene.apache.org.

  12. The set of folds was the same as that used in the experiments with SVM.

References

  1. Abbasnejad ME, Ramachandram D, Mandava R (2012) A survey of the state of the art in learning the kernels. Knowl Inf Syst 31:193–221

    Article  Google Scholar 

  2. Blunsom P, Kocik K, Curran JR (2006) Question classification with log-linear models. In: SIGIR ’06: proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval. ACM, New York, NY, USA, pp 615–616

  3. Brown J (2004) Entity-tagged language models for question classification in a qa system. Technical report, IR Lab

  4. Callan JP, Lu Z, Croft WB (1995) Searching distributed collections with inference networks. In: Proceedings of the 18th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’95. ACM, New York, NY, USA, pp 21–28

  5. Cheung Z, Phan KL, Mahidadia A, Hoffmann AG (2004) Feature extraction for learning to classify questions. In: Webb GI, Yu X (eds) AI 2004: advances in artificial intelligence, 17th Australian joint conference on artificial intelligence, vol 3339., of Lecture Notes in Computer Science Springer, Cairns, Australia, pp 1069–1075

  6. Dagan I, Lee L, Pereira FCN (1999) Similarity-based models of word cooccurrence probabilities. Mach Learn 34(1–3):43–69.

    Google Scholar 

  7. Dang HT (2008) Overview of the TAC 2008 opinion question answering and summarization tasks. In: TAC 2008 proceedings papers

  8. Day M-Y, Ong C-S, Hsu W-L (2007) Question classification in english-chinese cross-language question answering: an integrated genetic algorithm and machine learning approach. IEEE international conference on information reuse and integration, 2007. IRI 2007, pp 203–208

  9. Dietterich TG (1998) Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput 10:1895–1923

    Article  Google Scholar 

  10. Greenwood MA (2005) Open-domain question answering, PhD thesis, Department of Computer Science, University of Sheffield, UK

  11. Hacioglu K, Ward W (2003) Question classification with support vector machines and error correcting codes. In ‘NAACL ’03: proceedings of the 2003 conference of the North American chapter of the association for computational linguistics on human language technology. Association for Computational Linguistics, Morristown, NJ, USA, pp 28–30

  12. Hermjakob U (2001) Parsing and question classification for question answering. In: Workshop on open-domain question answering at ACL-2001

  13. Hull DA (1999) Xerox trec-8 question answering track report. In Eighth text REtrieval conference, Vol 500–246 of NIST Special Publication, National Institute of Standards and Technology, Gaithersburg, USA, pp 743–752

  14. Ittycheriah A, Franz M, Zhu W-J, Ratnaparkhi A (2000) IBM’s statistical question answering system. In: Ninth text REtrieval conference, Vol 500–249 of NIST special publication. National Institute of Standards and Technology, Gaithersburg, USA, pp 229–234

  15. Kando N (2005) Overview of the fifth ntcir workshop. In: Proceedings of NTCIR-5 workshop. Tokyo, Japan

  16. Kocik K (2004) Question classification using maximum entropy models. School of Information Technologies, University of Sydney, Master’s thesis

  17. Krishnan V, Das S, Chakrabarti S (2005) Enhanced answer type inference from questions using sequential models. In: HLT ’05: proceedings of the conference on human language technology and empirical methods in natural language processing. Association for Computational Linguistics, Morristown, NJ, USA, pp 315–322

  18. Li X, Roth D (2002) Learning question classifiers. In: Proceedings of the 19th international conference on Computational linguistics. Association for Computational Linguistics, Morristown, NJ, USA, pp 1–7

  19. Li X, Roth D (2005) Learning question classifiers: the role of semantic information. Nat Lang Eng 12(3):229–249

    Article  Google Scholar 

  20. Lin J (1991) Divergence measures based on the Shannon entropy. IEEE Trans Inf Theory 37(1):145–151

    Article  MATH  Google Scholar 

  21. Magnini B, Romagnoli S, Vallin A, Herrera J, Peñas A, Peinado V, Verdejo F, de Rijke M (2003) Creating the DISEQuA corpus: a test set for multilingual question answering. In: Cross-lingual evaluation forum (CLEF) 2003 workshop, pp 311–320

  22. Magnini B, Vallin A, Ayache C, Erbach G, Peñas A, de Rijke M, Rocha P, Simov KI, Sutcliffe RFE (2005) Overview of the clef 2004 multilingual question answering track. In: 5th Workshop of the cross-language evaluation forum, CLEF 2004, Vol 3491 of Lecture Notes in Computer Science, Springer, pp 371–391

  23. Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press, Cambridge

  24. Manning CD, Schütze H (1999) Foundations of statistical natural language processing. MIT Press, Cambridge, MA

    MATH  Google Scholar 

  25. Metzler D, Croft WB (2004) Combining the language model and inference network approaches to retrieval. Inf Process Manag 40:735–750

    Article  Google Scholar 

  26. Metzler D, Croft WB (2005) Analysis of statistical question classification for fact-based questions. Inf Retr 8(3):481–504

    Article  Google Scholar 

  27. Moldovan D, Pasca M, Harabagiu S, Surdeanu M (2003) Performance issues and error analysis in an open-domain question answering system. ACM Trans Inf Syst 21(2):133–154

    Article  Google Scholar 

  28. Nguyen TT, Nguyen LM, Shimazu A (2008) Using semi-supervised learning for question classification. Inf Media Technol 3(1):112–130

    Google Scholar 

  29. Ni X, Quan X, Lu Z, Wenyin L, Hua B (2010) Short text clustering by finding core terms. Knowl Inf Syst 27(3):345–365

    Google Scholar 

  30. Paşca M, Harabagiu S (2001) High performance question/answering. In: SIGIR ’01: proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval. ACM, New York, NY, USA, pp 366–374

  31. Pan Y, Tang Y, Lin L, Luo Y (2008) Question classification with semantic tree kernel. In: SIGIR ’08: proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval. ACM, New York, NY, USA, pp 837–838

  32. Pinchak C, Lin D (2006) A probabilistic answer type model. In: EACL 2006, 11st conference of the European chapter of the association for computational linguistics. The Association for Computer, Linguistics, pp 393–400

  33. Ponte JM, Croft WB (1998) A language modeling approach to information retrieval. In: Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval. SIGIR ’98. ACM, New York, NY, USA, pp 275–281

  34. Prager J, Radev D, Brown E, Coden A, Samn V (1999) The use of predictive annotation for question answering in trec-8. In: Eighth text REtrieval conference, vol 500–246 of NIST special publication, National Institute of Standards and Technology, Gaithersburg, USA, pp 399–409

  35. Quan X, Liu G, Lu Z, Ni X, Wenyin L (2010) Short text similarity based on probabilistic topics. Knowl Inf Syst 25:473–491

    Article  Google Scholar 

  36. Radev D, Fan W, Qi H, Wu H, Grewal A (2002) Probabilistic question answering on the web. In: WWW ’02: proceedings of the 11th international conference on World Wide Web. ACM, New York, NY, USA, pp 408–419

  37. Ray SK, Singh S, Joshi BP (2010) A semantic approach for question classification using wordnet and wikipedia. Pattern Recognit Lett 31:1935–1943

    Article  Google Scholar 

  38. Robertson S, Walker S, Jones S, Hancock-Beaulieu M, Gatford M (1996) Okapi at TREC-3. In: Third text REtrieval conference, vol 500–225 of NIST special publication. National Institute of Standars and Technology, Gaithersburg, USA, pp 109–126

  39. Schlobach S, Olsthoorn M, Rijke MD (2004) Type checking in open-domain question answering. In: Journal of Applied Logic. IOS Press, pp 398–402

  40. Sekine S, Sudo K, Nobata C (2002) Extended named entity hierarchy. In: LREC 2002: language resources and evaluation conference. Las Palmas, Spain, pp 1818–1824

  41. Singhal A, Abney S, Bacchiani M, Collins M, Hindle D, Pereira F (1999) ATT at TREC-8. In: Eighth text REtrieval conference, vol 500–246 of NIST special publication. National Institute of Standards and Technology, Gaithersburg, USA, pp 317–330

  42. Solorio T, no MP-C, y Gémez MM, nor Pineda LV, López-López A (2004) A language independent method for question classification. In: ‘COLING ’04: proceedings of the 20th international conference on computational linguistics. Association for Computational Linguistics, Morristown, NJ, USA, pp 1374–1380

  43. Strzalkowski T, Harabagiu S (2006) Advances in open domain question answering (text, speech and language technology). Springer-Verlag New York Inc, Secaucus, NJ, USA

    Google Scholar 

  44. Sundblad H (2007) Question classification in question answering. Linköping University, Department of Computer and Information Science, Master’s thesis

  45. Suzuki J, Hirao T, Sasaki Y, Maeda E (2003) Hierarchical directed acyclic graph kernel: methods for structured natural language data. In ‘ACL’, pp 32–39

  46. Suzuki J, Taira H, Sasaki Y, Maeda E (2003) Question classification using HDAG kernel. In: Proceedings of the ACL 2003 workshop on multilingual summarization and question answering. Association for computational linguistics, Morristown, NJ, USA, pp 61–68

  47. Tomás D, Giuliano C (2009) A semi-supervised approach to question classification. In: 17th European symposium on artificial neural networks: advances in computational intelligence and learning

  48. Tomás D, Vicedo JL (2010) Feature selection for multilingual question classification. Procesamiento del Lenguaje Nat 44:67–74

    Google Scholar 

  49. Vallin A, Magnini B, Giampiccolo D, Aunimo L, Ayache C, Osenova P, Peñas A, de Rijke M, Sacaleanu B, Santos D, Sutcliffe R (2006) Overview of the clef 2005 multilingual question answering track. In: Heidelberg SB (eds) Accessing multilingual information repositories, vol 4022 of Lecture Notes in Computer Science, pp 307–331

  50. Voorhees EM (1999) The trec-8 question answering track report. In: Proceedings of the 8th text REtrieval conference, vol 500–246 of NIST special publication. National Institute of Standards and Technology, Gaithersburg, USA, pp 77–82

  51. Voorhees EM (2001) The trec question answering track. Nat Lang Eng 7(4):361–378

    Article  MathSciNet  Google Scholar 

  52. Yu Z, Su L, Li L, Zhao Q, Mao C, Guo J (2010) Question classification based on co-training style semi-supervised learning. Pattern Recognit Lett 31(13):1975–1980

    Article  Google Scholar 

  53. Zhang D, Lee WS (2003) Question classification using support vector machines. In: SIGIR ’03: proceedings of the 26th annual international ACM SIGIR conference on research and development in information retrieval. ACM, New York, NY, USA, pp 26–32

  54. Zhang R, Tran T (2011) An information gain-based approach for recommending useful product reviews. Knowl Inf Syst 26:419–434

    Article  Google Scholar 

Download references

Acknowledgments

This research has been partially funded by the Spanish Government under project TEXTMESS 2.0 (TIN2009-13391-C04-01) and by the University of Alicante under project GRE10-33.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to David Tomás.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tomás, D., Vicedo, J.L. Minimally supervised question classification on fine-grained taxonomies. Knowl Inf Syst 36, 303–334 (2013). https://doi.org/10.1007/s10115-012-0557-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-012-0557-y

Keywords

Navigation