Skip to main content

Stochastic Modelling of Scientific Terms Distribution in Publications

  • Conference paper
Mathematical Knowledge Management (MKM 2006)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4108))

Included in the following conference series:

  • 598 Accesses


In this paper, we address the problem of automatic keywords assignment to scientific publications. The idea to use textual traces learned from training data in a supervised manner to identify appropriate keywords is considered. We introduce the transparent concept of identification cloud as a means to represent the semantics of scientific terms. This concept is mathematically defined by models of scientific terms stochastic distributions over publication texts. Characteristics of models as well as procedures for both non-parametric and parametric estimation of probability distributions are presented.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others


  1. Blei, D., Ng, A., Jordan, M.: Latent Dirichlet Allocation. Journal of Machine Learning Research 3, 993–1022 (2003)

    Article  MATH  Google Scholar 

  2. Balys, V., Rudzkis, R.: Stochastic models for keyphrase assignment. In: Proceedings of the VII International Conference Computer Data Analysis and Modelling (2004)

    Google Scholar 

  3. Church, K.W., Hanks, P.: Word Association Norms, Mutual Information, and Lexicography. Computational linguistics 16, 22–29 (1990)

    Google Scholar 

  4. Deerwester, S., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harsham, R.A.: Indexing by Latent Semantic Analysis. Journal of the American Society of Information Science 41, 391–407 (1990)

    Article  Google Scholar 

  5. Domingos, P., Pazzani, M.: Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classifier. In: Proceedings of the 13th International Conference on Machine Learning, pp. 105–112 (1996)

    Google Scholar 

  6. Hazewinkel, M.: Topologies and metrics on information spaces. CWI Quarterly 12, 93–110 (1999)

    Google Scholar 

  7. Hazewinkel, M.: Dynamic stochastic models for indexes and thesauri, identification clouds, and information retrieval and storage. In: Baeza-Yates, R. (ed.) Recent advances in applied probability. KAP, pp. 181–204 (2004)

    Google Scholar 

  8. Hazewinkel, M., Rudzkis, R.: A probabilistic model for the growth of thesauri. Acta Applicandae Mathematicae 67, 237–252 (2001)

    Article  MATH  MathSciNet  Google Scholar 

  9. Hofmann, T.: Probabilistic Latent Semantic Analysis. In: Proc. of Uncertainty in Artificial Intelligence, UAI 1999 (1999)

    Google Scholar 

  10. Joachims, T.: Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)

    Chapter  Google Scholar 

  11. Lee, L.: Measures of Distributional Similarity. ACL 99, 25–32 (1999)

    Google Scholar 

  12. Magerman, D.M., Marcus, M.P.: Parsing a Natural Language Using Mutual Information Statistics. In: National Conference on Artificial Intelligence, pp. 984–989 (1990)

    Google Scholar 

  13. Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys (2002)

    Google Scholar 

  14. Vapnik, V.: The Nature of Statistical Learning Theory. Springer, New York (1995)

    MATH  Google Scholar 

  15. Yang, Y.: Expert Network: Effective and Efficient Learning from Human Decisions in Text Categorization and Retrieval. In: Proceedings of SIGIR-1994 (1994)

    Google Scholar 

  16. Yarowsky, D.: Word-Sense Disambiguation using Statistical Models of Roget’s Categories Trained on Large Corpora. In: Proceedings of COLING-1992, pp. 454–460 (1992)

    Google Scholar 

  17. Yang, Y., Chute, C.G.: A Linear Least Squares Fit Mapping Method for Information Retrieval from Natural Language Texts. In: Proceedings of COLING-1992, the 15th International Conference on Computational Linguistics (1992)

    Google Scholar 

Download references

Author information

Authors and Affiliations


Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Rudzkis, R., Balys, V., Hazewinkel, M. (2006). Stochastic Modelling of Scientific Terms Distribution in Publications. In: Borwein, J.M., Farmer, W.M. (eds) Mathematical Knowledge Management. MKM 2006. Lecture Notes in Computer Science(), vol 4108. Springer, Berlin, Heidelberg.

Download citation

  • DOI:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-37104-5

  • Online ISBN: 978-3-540-37106-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics