Abstract
The classical bag-of-word models fail to capture contextual associations between words. We propose to investigate the “high-order pure dependence” among a number of words forming a semantic entity, i.e., the high-order dependence that cannot be reduced to the random coincidence of lower-order dependence. We believe that identifying these high-order pure dependence patterns will lead to a better representation of documents. We first present two formal definitions of pure dependence: Unconditional Pure Dependence (UPD) and Conditional Pure Dependence (CPD). The decision on UPD or CPD, however, is a NP-hard problem. We hence prove a series of sufficient criteria that entail UPD and CPD, within the well-principled Information Geometry (IG) framework, leading to a more feasible UPD/CPD identification procedure. We further develop novel methods to extract word patterns with high-order pure dependence, which can then be used to extend the original unigram document models. Our methods are evaluated in the context of query expansion. Compared with the original unigram model and its extensions with term associations derived from constant n-grams and Apriori association rule mining, our IG-based methods have proved mathematically more rigorous and empirically more effective.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Amari, S.: Information geometry on hierarchy of probability distributions. IEEE Transactions in Information Theory 47(5), 1701–1711
Amari, S., Nagaoka, H.: Methods of Information Geometry. American Mathematical Society, Providence (2001)
Bakirov, N.K., Rizzo, M.L., Székely, G.J.: A multivariate nonparametric test of independence. Journal of Multivariate Analysis 79(8), 1742–1756
Chickering, D., et al.: Large-sample learning of bayesian networks is np-hard. The Journal of Machine Learning Research 5, 1287–1330
Gao, J., Nie, J.Y., et al.: Dependence language model for information retrieval. In: Proceedings of SIGIR 2004, pp. 170–177 (2004)
Hofmann, T.: Learning the similarity of documents: An information-geometric approach to document retrieval and categorization
Hou, Y., et al.: Efficient factorization test and high-order pure dependence mining. Submitted to NIPS 2011 (2011)
Jeffreys, H.: An invariant form for the prior probability in estimation problems. Proceedings of the Royal Society of London. Series A 186 (1946)
Lavrenko, V., Croft, W.B.: Relevance-based language models. In: Proceedings of SIGIR 2001, pp. 120–127 (2001)
Lv, Y., Zhai, C.: A comparative study of methods for estimating query language models with pseudo feedback. In: Proceedings of CIKM 2009, pp. 1895–1898 (2009)
Metzler, D., Croft, W.B.: A markov random field model for term dependencies. In: Proceedings of SIGIR 2005, pp. 472–479 (2005)
Metzler, D., Croft, W.B.: Latent concept expansion using markov random fields. In: Proceedings of SIGIR 2007, pp. 311–318 (2007)
Mihalcea, R., Corley, C.: Corpus-based and knowledge-based measures of text semantic similarity. In: Proceedings of AAAI 2006, pp. 775–780 (2006)
Nakahara, H., Amari, S.: Information geometric measure for neural spikes. Neural Computation 14(10), 2269–2316
Niesler, T.R., Woodland, P.C.: A variable-length category-based n-gram language model. In: Proceedings of IEEE ICASSP 1996, pp. 164–167 (1996)
Ponte, J.M., Croft, W.B.: A language modeling approach to information retrieval. In: Proceedings of SIGIR 1998, pp. 275–281 (1998)
Rao, C.R.: Information and Accuracy Attainable in the Estimation of Statistical Parameters. Bull. Calcutta. Math. Soc. 37 (1945)
Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Communications of the ACM 18(11)
Schütze, H.: Automatic word sense discrimination. Computational Linguistics 24(1), 97–123
Song, D., Huang, Q., Rueger, S., Bruza, P.: Facilitating query decomposition in query language modeling by association rule mining using multiple sliding windows. In: Macdonald, C., Ounis, I., Plachouras, V., Ruthven, I., White, R.W. (eds.) ECIR 2008. LNCS, vol. 4956, pp. 334–345. Springer, Heidelberg (2008)
Taskinen, S., Oja, H., Randles, R.H.: Multivariate nonparametric tests of independence. Journal of the American Statistical Association 100(471), 916–925
Tsukiyama, S., Ide, M., Ariyoshi, H., Shirakawa, I.: A new algorithm for generating all the maximal independent sets. SIAM Journal on Computing 6(3), 505–517
Zhang, S., Dong, N.: An effective combination of different order n-grams. In: Proceedings of O-COCOSDA 2003, pp. 251–256 (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Hou, Y., He, L., Zhao, X., Song, D. (2011). Pure High-Order Word Dependence Mining via Information Geometry. In: Amati, G., Crestani, F. (eds) Advances in Information Retrieval Theory. ICTIR 2011. Lecture Notes in Computer Science, vol 6931. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23318-0_8
Download citation
DOI: https://doi.org/10.1007/978-3-642-23318-0_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-23317-3
Online ISBN: 978-3-642-23318-0
eBook Packages: Computer ScienceComputer Science (R0)