Vocabulary Filtering for Term Weighting in Archived Question Search

Ming, Zhao-Yan; Wang, Kai; Chua, Tat-Seng

doi:10.1007/978-3-642-13657-3_42

Zhao-Yan Ming²³,
Kai Wang²³ &
Tat-Seng Chua²³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 6118))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

4110 Accesses
4 Citations

Abstract

This paper proposes the notion of vocabulary filtering in a term weighting framework that consists of three filters at the document level, collection level, and vocabulary level. While term frequency and document frequency along with their variations are respectively the dominant term weighting factors at the document level and collection level, vocabulary level factors are seldom considered in current models. In a way, stopword removal can be seen as a vocabulary level filter, but it is not well integrated into the current term-weighting models. In this paper, we propose a vocabulary filtering and multi-level term weighting model by integrating point-wise divergence based measure into the commonly used TF-IDF model. With our proposed model, the specificity of the vocabulary is captured as a new factor in term weighting, and stopwords are naturally handled within the model rather than being removed according to a separately constructed list. Experiments conducted on searching for similar questions in a large community-based question answering archive show that: (a)our proposed term weighting model with multiple levels is consistently better than those with single level for retrieval task; (b)the proposed vocabulary filter well distinguishes salient and trivial terms, and can be utilized to construct stopword lists.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Aizawa, A.: An information-theoretic perspective of tf-idf measures. Inf. Process. Manage. 39(1), 45–65 (2003)
Article MATH MathSciNet Google Scholar
Amati, G., Joost, C., Rijsbergen, V.: Probabilistic models for information retrieval based on divergence from randomness. ACM TOIS 20, 357–389 (2002)
Article Google Scholar
Frank, E., Paynter, G.W., Witten, I.H., Gutwin, C., Nevill-manning, C.G.: Domain-specific keyphrase extraction, pp. 668–673. Morgan Kaufmann Publishers, San Francisco (1999)
Google Scholar
Hiemstra, D.: A probabilistic justification for using tf idf term weighting in information retrieval. International Journal on Digital Libraries 3, 131–139 (2000)
Article Google Scholar
Jeon, J., Croft, W.B., Lee, J.H.: Finding similar questions in large question and answer archives. In: Proc. of CIKM, pp. 84–90. ACM, New York (2005)
Google Scholar
Kor, K.-W., Chua, T.-S.: Interesting nuggets and their impact on definitional question answering. In: Proc. of SIGIR, pp. 335–342. ACM, New York (2007)
Google Scholar
Lo, R., He, B., Ounis, I.: Automatically building a stopword list for an information retrieval system. In: Proc. of Dutch-Belgian DIR, Utrecht, Netherlands (2005)
Google Scholar
Makrehchi, M., Kamel, M.S.: Automatic extraction of domain-specific stopwords from labeled documents. In: Macdonald, C., Ounis, I., Plachouras, V., Ruthven, I., White, R.W. (eds.) ECIR 2008. LNCS, vol. 4956, pp. 222–233. Springer, Heidelberg (2008)
Chapter Google Scholar
Ounis, I., Amati, G., Plachouras, V., He, B., Macdonald, C., Lioma, C.: Terrier: A High Performance and Scalable Information Retrieval Platform. In: Proc. of SIGIR Workshop on OSIR (2006)
Google Scholar
Robertson, S.E., Walker, S.: Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In: Proc. of SIGIR, pp. 232–241. Springer, New York (1994)
Google Scholar
Robertson, S.E., Zaragoza, H.: The probabilistic relevance model: Bm25 and beyond. In: Tutorial of SIGIR, ACM, New York (2008)
Google Scholar
Roelleke, T., Wang, J.: Tf-idf uncovered: a study of theories and probabilities. In: Proc. of SIGIR, pp. 435–442. ACM, New York (2008)
Chapter Google Scholar
Salton, G., Yang, C.: On the specification of term values in automatic indexing. Journal of Documentation 29, 351–372 (1973)
Article Google Scholar
Wang, K., Ming, Z., Chua, T.-S.: A syntactic tree matching approach to finding similar questions in community-based qa services. In: Proc. of SIGIR, pp. 187–194. ACM, New York (2009)
Chapter Google Scholar
Xue, X., Jeon, J., Croft, W.B.: Retrieval models for question and answer archives. In: Proc. of SIGIR, pp. 475–482. ACM, New York (2008)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, School of Computing, National University of Singapore,
Zhao-Yan Ming, Kai Wang & Tat-Seng Chua

Authors

Zhao-Yan Ming
View author publications
You can also search for this author in PubMed Google Scholar
Kai Wang
View author publications
You can also search for this author in PubMed Google Scholar
Tat-Seng Chua
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Computer Science Department, Rensselaer Polytechnic Institute, USA
Mohammed J. Zaki
The Chinese University of Hong Kong, China
Jeffrey Xu Yu
IIT Madras, Chennai, India
B. Ravindran
IIIT, Hyderabad, India
Vikram Pudi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ming, ZY., Wang, K., Chua, TS. (2010). Vocabulary Filtering for Term Weighting in Archived Question Search. In: Zaki, M.J., Yu, J.X., Ravindran, B., Pudi, V. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2010. Lecture Notes in Computer Science(), vol 6118. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-13657-3_42

Download citation

DOI: https://doi.org/10.1007/978-3-642-13657-3_42
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-13656-6
Online ISBN: 978-3-642-13657-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics