Minimally supervised question classification on fine-grained taxonomies

Tomás, David; Vicedo, José L.

doi:10.1007/s10115-012-0557-y

Minimally supervised question classification on fine-grained taxonomies

Regular Paper
Published: 18 September 2012

Volume 36, pages 303–334, (2013)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

David Tomás^1,2 &
José L. Vicedo¹

401 Accesses
9 Citations
Explore all metrics

Abstract

This article presents a minimally supervised approach to question classification on fine-grained taxonomies. We have defined an algorithm that automatically obtains lists of weighted terms for each class in the taxonomy, thus identifying which terms are highly related to the classes and are highly discriminative between them. These lists have then been applied to the task of question classification. Our approach is based on the divergence of probability distributions of terms in plain text retrieved from the Web. A corpus of questions with which to train the classifier is not therefore necessary. As the system is based purely on statistical information, it does not require additional linguistic resources or tools. The experiments were performed on English questions and their Spanish translations. The results reveal that our system surpasses current supervised approaches in this task, obtaining a significant improvement in the experiments carried out.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Russian-Language Question Classification: A New Typology and First Results

Question Classification for Polish Question Answering

Hierarchical Dirichlet Process Topic Modeling for Large Number of Answer Types Classification in Open domain Question Answering

Notes

Text Retrieval Conference: http://trec.nist.org/.
Cross Language Evaluation Forum: http://clef-campaign.org/.
NII-NACSIS Test Collection for IR Systems: http://research.nii.ac.jp/ntcir/.
Text Analysis Conference: http://www.nist.gov/tac/.
In our experiments, we extended the concept of term to unigrams, bigrams, and trigrams.
We employed Yahoo! Search BOSS: http://developer.yahoo.com/search/boss/.
Binary logarithms were used in our experiments.
The value \(-1\) is not included in the interval because it always produces a negative \(\omega \).
Freely available at http://trec.nist.gov/data/qa.html.
All the sets of questions and seeds employed in the evaluation are available at http://www.dlsi.ua.es/~dtomas/resources/.
We employed the Apache Lucene search engine: http://lucene.apache.org.
The set of folds was the same as that used in the experiments with SVM.

References

Abbasnejad ME, Ramachandram D, Mandava R (2012) A survey of the state of the art in learning the kernels. Knowl Inf Syst 31:193–221
Article Google Scholar
Blunsom P, Kocik K, Curran JR (2006) Question classification with log-linear models. In: SIGIR ’06: proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval. ACM, New York, NY, USA, pp 615–616
Brown J (2004) Entity-tagged language models for question classification in a qa system. Technical report, IR Lab
Callan JP, Lu Z, Croft WB (1995) Searching distributed collections with inference networks. In: Proceedings of the 18th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’95. ACM, New York, NY, USA, pp 21–28
Cheung Z, Phan KL, Mahidadia A, Hoffmann AG (2004) Feature extraction for learning to classify questions. In: Webb GI, Yu X (eds) AI 2004: advances in artificial intelligence, 17th Australian joint conference on artificial intelligence, vol 3339., of Lecture Notes in Computer Science Springer, Cairns, Australia, pp 1069–1075
Dagan I, Lee L, Pereira FCN (1999) Similarity-based models of word cooccurrence probabilities. Mach Learn 34(1–3):43–69.
Google Scholar
Dang HT (2008) Overview of the TAC 2008 opinion question answering and summarization tasks. In: TAC 2008 proceedings papers
Day M-Y, Ong C-S, Hsu W-L (2007) Question classification in english-chinese cross-language question answering: an integrated genetic algorithm and machine learning approach. IEEE international conference on information reuse and integration, 2007. IRI 2007, pp 203–208
Dietterich TG (1998) Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput 10:1895–1923
Article Google Scholar
Greenwood MA (2005) Open-domain question answering, PhD thesis, Department of Computer Science, University of Sheffield, UK
Hacioglu K, Ward W (2003) Question classification with support vector machines and error correcting codes. In ‘NAACL ’03: proceedings of the 2003 conference of the North American chapter of the association for computational linguistics on human language technology. Association for Computational Linguistics, Morristown, NJ, USA, pp 28–30
Hermjakob U (2001) Parsing and question classification for question answering. In: Workshop on open-domain question answering at ACL-2001
Hull DA (1999) Xerox trec-8 question answering track report. In Eighth text REtrieval conference, Vol 500–246 of NIST Special Publication, National Institute of Standards and Technology, Gaithersburg, USA, pp 743–752
Ittycheriah A, Franz M, Zhu W-J, Ratnaparkhi A (2000) IBM’s statistical question answering system. In: Ninth text REtrieval conference, Vol 500–249 of NIST special publication. National Institute of Standards and Technology, Gaithersburg, USA, pp 229–234
Kando N (2005) Overview of the fifth ntcir workshop. In: Proceedings of NTCIR-5 workshop. Tokyo, Japan
Kocik K (2004) Question classification using maximum entropy models. School of Information Technologies, University of Sydney, Master’s thesis
Krishnan V, Das S, Chakrabarti S (2005) Enhanced answer type inference from questions using sequential models. In: HLT ’05: proceedings of the conference on human language technology and empirical methods in natural language processing. Association for Computational Linguistics, Morristown, NJ, USA, pp 315–322
Li X, Roth D (2002) Learning question classifiers. In: Proceedings of the 19th international conference on Computational linguistics. Association for Computational Linguistics, Morristown, NJ, USA, pp 1–7
Li X, Roth D (2005) Learning question classifiers: the role of semantic information. Nat Lang Eng 12(3):229–249
Article Google Scholar
Lin J (1991) Divergence measures based on the Shannon entropy. IEEE Trans Inf Theory 37(1):145–151
Article MATH Google Scholar
Magnini B, Romagnoli S, Vallin A, Herrera J, Peñas A, Peinado V, Verdejo F, de Rijke M (2003) Creating the DISEQuA corpus: a test set for multilingual question answering. In: Cross-lingual evaluation forum (CLEF) 2003 workshop, pp 311–320
Magnini B, Vallin A, Ayache C, Erbach G, Peñas A, de Rijke M, Rocha P, Simov KI, Sutcliffe RFE (2005) Overview of the clef 2004 multilingual question answering track. In: 5th Workshop of the cross-language evaluation forum, CLEF 2004, Vol 3491 of Lecture Notes in Computer Science, Springer, pp 371–391
Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press, Cambridge
Manning CD, Schütze H (1999) Foundations of statistical natural language processing. MIT Press, Cambridge, MA
MATH Google Scholar
Metzler D, Croft WB (2004) Combining the language model and inference network approaches to retrieval. Inf Process Manag 40:735–750
Article Google Scholar
Metzler D, Croft WB (2005) Analysis of statistical question classification for fact-based questions. Inf Retr 8(3):481–504
Article Google Scholar
Moldovan D, Pasca M, Harabagiu S, Surdeanu M (2003) Performance issues and error analysis in an open-domain question answering system. ACM Trans Inf Syst 21(2):133–154
Article Google Scholar
Nguyen TT, Nguyen LM, Shimazu A (2008) Using semi-supervised learning for question classification. Inf Media Technol 3(1):112–130
Google Scholar
Ni X, Quan X, Lu Z, Wenyin L, Hua B (2010) Short text clustering by finding core terms. Knowl Inf Syst 27(3):345–365
Google Scholar
Paşca M, Harabagiu S (2001) High performance question/answering. In: SIGIR ’01: proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval. ACM, New York, NY, USA, pp 366–374
Pan Y, Tang Y, Lin L, Luo Y (2008) Question classification with semantic tree kernel. In: SIGIR ’08: proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval. ACM, New York, NY, USA, pp 837–838
Pinchak C, Lin D (2006) A probabilistic answer type model. In: EACL 2006, 11st conference of the European chapter of the association for computational linguistics. The Association for Computer, Linguistics, pp 393–400
Ponte JM, Croft WB (1998) A language modeling approach to information retrieval. In: Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval. SIGIR ’98. ACM, New York, NY, USA, pp 275–281
Prager J, Radev D, Brown E, Coden A, Samn V (1999) The use of predictive annotation for question answering in trec-8. In: Eighth text REtrieval conference, vol 500–246 of NIST special publication, National Institute of Standards and Technology, Gaithersburg, USA, pp 399–409
Quan X, Liu G, Lu Z, Ni X, Wenyin L (2010) Short text similarity based on probabilistic topics. Knowl Inf Syst 25:473–491
Article Google Scholar
Radev D, Fan W, Qi H, Wu H, Grewal A (2002) Probabilistic question answering on the web. In: WWW ’02: proceedings of the 11th international conference on World Wide Web. ACM, New York, NY, USA, pp 408–419
Ray SK, Singh S, Joshi BP (2010) A semantic approach for question classification using wordnet and wikipedia. Pattern Recognit Lett 31:1935–1943
Article Google Scholar
Robertson S, Walker S, Jones S, Hancock-Beaulieu M, Gatford M (1996) Okapi at TREC-3. In: Third text REtrieval conference, vol 500–225 of NIST special publication. National Institute of Standars and Technology, Gaithersburg, USA, pp 109–126
Schlobach S, Olsthoorn M, Rijke MD (2004) Type checking in open-domain question answering. In: Journal of Applied Logic. IOS Press, pp 398–402
Sekine S, Sudo K, Nobata C (2002) Extended named entity hierarchy. In: LREC 2002: language resources and evaluation conference. Las Palmas, Spain, pp 1818–1824
Singhal A, Abney S, Bacchiani M, Collins M, Hindle D, Pereira F (1999) ATT at TREC-8. In: Eighth text REtrieval conference, vol 500–246 of NIST special publication. National Institute of Standards and Technology, Gaithersburg, USA, pp 317–330
Solorio T, no MP-C, y Gémez MM, nor Pineda LV, López-López A (2004) A language independent method for question classification. In: ‘COLING ’04: proceedings of the 20th international conference on computational linguistics. Association for Computational Linguistics, Morristown, NJ, USA, pp 1374–1380
Strzalkowski T, Harabagiu S (2006) Advances in open domain question answering (text, speech and language technology). Springer-Verlag New York Inc, Secaucus, NJ, USA
Google Scholar
Sundblad H (2007) Question classification in question answering. Linköping University, Department of Computer and Information Science, Master’s thesis
Suzuki J, Hirao T, Sasaki Y, Maeda E (2003) Hierarchical directed acyclic graph kernel: methods for structured natural language data. In ‘ACL’, pp 32–39
Suzuki J, Taira H, Sasaki Y, Maeda E (2003) Question classification using HDAG kernel. In: Proceedings of the ACL 2003 workshop on multilingual summarization and question answering. Association for computational linguistics, Morristown, NJ, USA, pp 61–68
Tomás D, Giuliano C (2009) A semi-supervised approach to question classification. In: 17th European symposium on artificial neural networks: advances in computational intelligence and learning
Tomás D, Vicedo JL (2010) Feature selection for multilingual question classification. Procesamiento del Lenguaje Nat 44:67–74
Google Scholar
Vallin A, Magnini B, Giampiccolo D, Aunimo L, Ayache C, Osenova P, Peñas A, de Rijke M, Sacaleanu B, Santos D, Sutcliffe R (2006) Overview of the clef 2005 multilingual question answering track. In: Heidelberg SB (eds) Accessing multilingual information repositories, vol 4022 of Lecture Notes in Computer Science, pp 307–331
Voorhees EM (1999) The trec-8 question answering track report. In: Proceedings of the 8th text REtrieval conference, vol 500–246 of NIST special publication. National Institute of Standards and Technology, Gaithersburg, USA, pp 77–82
Voorhees EM (2001) The trec question answering track. Nat Lang Eng 7(4):361–378
Article MathSciNet Google Scholar
Yu Z, Su L, Li L, Zhao Q, Mao C, Guo J (2010) Question classification based on co-training style semi-supervised learning. Pattern Recognit Lett 31(13):1975–1980
Article Google Scholar
Zhang D, Lee WS (2003) Question classification using support vector machines. In: SIGIR ’03: proceedings of the 26th annual international ACM SIGIR conference on research and development in information retrieval. ACM, New York, NY, USA, pp 26–32
Zhang R, Tran T (2011) An information gain-based approach for recommending useful product reviews. Knowl Inf Syst 26:419–434
Article Google Scholar

Download references

Acknowledgments

This research has been partially funded by the Spanish Government under project TEXTMESS 2.0 (TIN2009-13391-C04-01) and by the University of Alicante under project GRE10-33.

Author information

Authors and Affiliations

Department of Software and Computing Systems, University of Alicante, 03080 , Alicante, Spain
David Tomás & José L. Vicedo
Depto. de Lenguajes y Sistemas Informáticos, Universidad de Alicante, 03080 , Alicante, Spain
David Tomás

Authors

David Tomás
View author publications
You can also search for this author in PubMed Google Scholar
José L. Vicedo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to David Tomás.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tomás, D., Vicedo, J.L. Minimally supervised question classification on fine-grained taxonomies. Knowl Inf Syst 36, 303–334 (2013). https://doi.org/10.1007/s10115-012-0557-y

Download citation

Received: 22 December 2010
Revised: 30 April 2012
Accepted: 22 August 2012
Published: 18 September 2012
Issue Date: August 2013
DOI: https://doi.org/10.1007/s10115-012-0557-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Minimally supervised question classification on fine-grained taxonomies

Abstract

Access this article

Similar content being viewed by others

Russian-Language Question Classification: A New Typology and First Results

Question Classification for Polish Question Answering

Hierarchical Dirichlet Process Topic Modeling for Large Number of Answer Types Classification in Open domain Question Answering

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Minimally supervised question classification on fine-grained taxonomies

Abstract

Access this article

Similar content being viewed by others

Russian-Language Question Classification: A New Typology and First Results

Question Classification for Polish Question Answering

Hierarchical Dirichlet Process Topic Modeling for Large Number of Answer Types Classification in Open domain Question Answering

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation