Pure High-Order Word Dependence Mining via Information Geometry

Hou, Yuexian; He, Liang; Zhao, Xiaozhao; Song, Dawei

doi:10.1007/978-3-642-23318-0_8

Yuexian Hou¹⁸,
Liang He¹⁸,
Xiaozhao Zhao¹⁸ &
…
Dawei Song¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6931))

Included in the following conference series:

Conference on the Theory of Information Retrieval

920 Accesses
7 Citations

Abstract

The classical bag-of-word models fail to capture contextual associations between words. We propose to investigate the “high-order pure dependence” among a number of words forming a semantic entity, i.e., the high-order dependence that cannot be reduced to the random coincidence of lower-order dependence. We believe that identifying these high-order pure dependence patterns will lead to a better representation of documents. We first present two formal definitions of pure dependence: Unconditional Pure Dependence (UPD) and Conditional Pure Dependence (CPD). The decision on UPD or CPD, however, is a NP-hard problem. We hence prove a series of sufficient criteria that entail UPD and CPD, within the well-principled Information Geometry (IG) framework, leading to a more feasible UPD/CPD identification procedure. We further develop novel methods to extract word patterns with high-order pure dependence, which can then be used to extend the original unigram document models. Our methods are evaluated in the context of query expansion. Compared with the original unigram model and its extensions with term associations derived from constant n-grams and Apriori association rule mining, our IG-based methods have proved mathematically more rigorous and empirically more effective.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Amari, S.: Information geometry on hierarchy of probability distributions. IEEE Transactions in Information Theory 47(5), 1701–1711
Google Scholar
Amari, S., Nagaoka, H.: Methods of Information Geometry. American Mathematical Society, Providence (2001)
MATH Google Scholar
Bakirov, N.K., Rizzo, M.L., Székely, G.J.: A multivariate nonparametric test of independence. Journal of Multivariate Analysis 79(8), 1742–1756
Google Scholar
Chickering, D., et al.: Large-sample learning of bayesian networks is np-hard. The Journal of Machine Learning Research 5, 1287–1330
Google Scholar
Gao, J., Nie, J.Y., et al.: Dependence language model for information retrieval. In: Proceedings of SIGIR 2004, pp. 170–177 (2004)
Google Scholar
Hofmann, T.: Learning the similarity of documents: An information-geometric approach to document retrieval and categorization
Google Scholar
Hou, Y., et al.: Efficient factorization test and high-order pure dependence mining. Submitted to NIPS 2011 (2011)
Google Scholar
Jeffreys, H.: An invariant form for the prior probability in estimation problems. Proceedings of the Royal Society of London. Series A 186 (1946)
Google Scholar
Lavrenko, V., Croft, W.B.: Relevance-based language models. In: Proceedings of SIGIR 2001, pp. 120–127 (2001)
Google Scholar
Lv, Y., Zhai, C.: A comparative study of methods for estimating query language models with pseudo feedback. In: Proceedings of CIKM 2009, pp. 1895–1898 (2009)
Google Scholar
Metzler, D., Croft, W.B.: A markov random field model for term dependencies. In: Proceedings of SIGIR 2005, pp. 472–479 (2005)
Google Scholar
Metzler, D., Croft, W.B.: Latent concept expansion using markov random fields. In: Proceedings of SIGIR 2007, pp. 311–318 (2007)
Google Scholar
Mihalcea, R., Corley, C.: Corpus-based and knowledge-based measures of text semantic similarity. In: Proceedings of AAAI 2006, pp. 775–780 (2006)
Google Scholar
Nakahara, H., Amari, S.: Information geometric measure for neural spikes. Neural Computation 14(10), 2269–2316
Google Scholar
Niesler, T.R., Woodland, P.C.: A variable-length category-based n-gram language model. In: Proceedings of IEEE ICASSP 1996, pp. 164–167 (1996)
Google Scholar
Ponte, J.M., Croft, W.B.: A language modeling approach to information retrieval. In: Proceedings of SIGIR 1998, pp. 275–281 (1998)
Google Scholar
Rao, C.R.: Information and Accuracy Attainable in the Estimation of Statistical Parameters. Bull. Calcutta. Math. Soc. 37 (1945)
Google Scholar
Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Communications of the ACM 18(11)
Google Scholar
Schütze, H.: Automatic word sense discrimination. Computational Linguistics 24(1), 97–123
Google Scholar
Song, D., Huang, Q., Rueger, S., Bruza, P.: Facilitating query decomposition in query language modeling by association rule mining using multiple sliding windows. In: Macdonald, C., Ounis, I., Plachouras, V., Ruthven, I., White, R.W. (eds.) ECIR 2008. LNCS, vol. 4956, pp. 334–345. Springer, Heidelberg (2008)
Chapter Google Scholar
Taskinen, S., Oja, H., Randles, R.H.: Multivariate nonparametric tests of independence. Journal of the American Statistical Association 100(471), 916–925
Google Scholar
Tsukiyama, S., Ide, M., Ariyoshi, H., Shirakawa, I.: A new algorithm for generating all the maximal independent sets. SIAM Journal on Computing 6(3), 505–517
Google Scholar
Zhang, S., Dong, N.: An effective combination of different order n-grams. In: Proceedings of O-COCOSDA 2003, pp. 251–256 (2003)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Sci & Tec, Tianjin University, Tianjin, China
Yuexian Hou, Liang He & Xiaozhao Zhao
School of Computing, The Robert Gordon University, Aberdeen, United Kingdom
Dawei Song

Authors

Yuexian Hou
View author publications
You can also search for this author in PubMed Google Scholar
Liang He
View author publications
You can also search for this author in PubMed Google Scholar
Xiaozhao Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Dawei Song
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Fondazione Ugo Bordoni, Viale del Policlinico 147, 00161, Rome, Italy
Giambattista Amati
Faculty of Informatics, University of Lugano, 6900, Lugano, Switzerland
Fabio Crestani

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hou, Y., He, L., Zhao, X., Song, D. (2011). Pure High-Order Word Dependence Mining via Information Geometry. In: Amati, G., Crestani, F. (eds) Advances in Information Retrieval Theory. ICTIR 2011. Lecture Notes in Computer Science, vol 6931. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23318-0_8

Download citation

DOI: https://doi.org/10.1007/978-3-642-23318-0_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-23317-3
Online ISBN: 978-3-642-23318-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics