Skip to main content

Pure High-Order Word Dependence Mining via Information Geometry

  • Conference paper
Advances in Information Retrieval Theory (ICTIR 2011)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6931))

Included in the following conference series:

Abstract

The classical bag-of-word models fail to capture contextual associations between words. We propose to investigate the “high-order pure dependence” among a number of words forming a semantic entity, i.e., the high-order dependence that cannot be reduced to the random coincidence of lower-order dependence. We believe that identifying these high-order pure dependence patterns will lead to a better representation of documents. We first present two formal definitions of pure dependence: Unconditional Pure Dependence (UPD) and Conditional Pure Dependence (CPD). The decision on UPD or CPD, however, is a NP-hard problem. We hence prove a series of sufficient criteria that entail UPD and CPD, within the well-principled Information Geometry (IG) framework, leading to a more feasible UPD/CPD identification procedure. We further develop novel methods to extract word patterns with high-order pure dependence, which can then be used to extend the original unigram document models. Our methods are evaluated in the context of query expansion. Compared with the original unigram model and its extensions with term associations derived from constant n-grams and Apriori association rule mining, our IG-based methods have proved mathematically more rigorous and empirically more effective.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Amari, S.: Information geometry on hierarchy of probability distributions. IEEE Transactions in Information Theory 47(5), 1701–1711

    Google Scholar 

  2. Amari, S., Nagaoka, H.: Methods of Information Geometry. American Mathematical Society, Providence (2001)

    MATH  Google Scholar 

  3. Bakirov, N.K., Rizzo, M.L., Székely, G.J.: A multivariate nonparametric test of independence. Journal of Multivariate Analysis 79(8), 1742–1756

    Google Scholar 

  4. Chickering, D., et al.: Large-sample learning of bayesian networks is np-hard. The Journal of Machine Learning Research 5, 1287–1330

    Google Scholar 

  5. Gao, J., Nie, J.Y., et al.: Dependence language model for information retrieval. In: Proceedings of SIGIR 2004, pp. 170–177 (2004)

    Google Scholar 

  6. Hofmann, T.: Learning the similarity of documents: An information-geometric approach to document retrieval and categorization

    Google Scholar 

  7. Hou, Y., et al.: Efficient factorization test and high-order pure dependence mining. Submitted to NIPS 2011 (2011)

    Google Scholar 

  8. Jeffreys, H.: An invariant form for the prior probability in estimation problems. Proceedings of the Royal Society of London. Series A 186 (1946)

    Google Scholar 

  9. Lavrenko, V., Croft, W.B.: Relevance-based language models. In: Proceedings of SIGIR 2001, pp. 120–127 (2001)

    Google Scholar 

  10. Lv, Y., Zhai, C.: A comparative study of methods for estimating query language models with pseudo feedback. In: Proceedings of CIKM 2009, pp. 1895–1898 (2009)

    Google Scholar 

  11. Metzler, D., Croft, W.B.: A markov random field model for term dependencies. In: Proceedings of SIGIR 2005, pp. 472–479 (2005)

    Google Scholar 

  12. Metzler, D., Croft, W.B.: Latent concept expansion using markov random fields. In: Proceedings of SIGIR 2007, pp. 311–318 (2007)

    Google Scholar 

  13. Mihalcea, R., Corley, C.: Corpus-based and knowledge-based measures of text semantic similarity. In: Proceedings of AAAI 2006, pp. 775–780 (2006)

    Google Scholar 

  14. Nakahara, H., Amari, S.: Information geometric measure for neural spikes. Neural Computation 14(10), 2269–2316

    Google Scholar 

  15. Niesler, T.R., Woodland, P.C.: A variable-length category-based n-gram language model. In: Proceedings of IEEE ICASSP 1996, pp. 164–167 (1996)

    Google Scholar 

  16. Ponte, J.M., Croft, W.B.: A language modeling approach to information retrieval. In: Proceedings of SIGIR 1998, pp. 275–281 (1998)

    Google Scholar 

  17. Rao, C.R.: Information and Accuracy Attainable in the Estimation of Statistical Parameters. Bull. Calcutta. Math. Soc. 37 (1945)

    Google Scholar 

  18. Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Communications of the ACM 18(11)

    Google Scholar 

  19. Schütze, H.: Automatic word sense discrimination. Computational Linguistics 24(1), 97–123

    Google Scholar 

  20. Song, D., Huang, Q., Rueger, S., Bruza, P.: Facilitating query decomposition in query language modeling by association rule mining using multiple sliding windows. In: Macdonald, C., Ounis, I., Plachouras, V., Ruthven, I., White, R.W. (eds.) ECIR 2008. LNCS, vol. 4956, pp. 334–345. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  21. Taskinen, S., Oja, H., Randles, R.H.: Multivariate nonparametric tests of independence. Journal of the American Statistical Association 100(471), 916–925

    Google Scholar 

  22. Tsukiyama, S., Ide, M., Ariyoshi, H., Shirakawa, I.: A new algorithm for generating all the maximal independent sets. SIAM Journal on Computing 6(3), 505–517

    Google Scholar 

  23. Zhang, S., Dong, N.: An effective combination of different order n-grams. In: Proceedings of O-COCOSDA 2003, pp. 251–256 (2003)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Hou, Y., He, L., Zhao, X., Song, D. (2011). Pure High-Order Word Dependence Mining via Information Geometry. In: Amati, G., Crestani, F. (eds) Advances in Information Retrieval Theory. ICTIR 2011. Lecture Notes in Computer Science, vol 6931. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23318-0_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-23318-0_8

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-23317-3

  • Online ISBN: 978-3-642-23318-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics