Abstract
In this paper, we address the problem of automatic keywords assignment to scientific publications. The idea to use textual traces learned from training data in a supervised manner to identify appropriate keywords is considered. We introduce the transparent concept of identification cloud as a means to represent the semantics of scientific terms. This concept is mathematically defined by models of scientific terms stochastic distributions over publication texts. Characteristics of models as well as procedures for both non-parametric and parametric estimation of probability distributions are presented.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Blei, D., Ng, A., Jordan, M.: Latent Dirichlet Allocation. Journal of Machine Learning Research 3, 993–1022 (2003)
Balys, V., Rudzkis, R.: Stochastic models for keyphrase assignment. In: Proceedings of the VII International Conference Computer Data Analysis and Modelling (2004)
Church, K.W., Hanks, P.: Word Association Norms, Mutual Information, and Lexicography. Computational linguistics 16, 22–29 (1990)
Deerwester, S., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harsham, R.A.: Indexing by Latent Semantic Analysis. Journal of the American Society of Information Science 41, 391–407 (1990)
Domingos, P., Pazzani, M.: Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classifier. In: Proceedings of the 13th International Conference on Machine Learning, pp. 105–112 (1996)
Hazewinkel, M.: Topologies and metrics on information spaces. CWI Quarterly 12, 93–110 (1999)
Hazewinkel, M.: Dynamic stochastic models for indexes and thesauri, identification clouds, and information retrieval and storage. In: Baeza-Yates, R. (ed.) Recent advances in applied probability. KAP, pp. 181–204 (2004)
Hazewinkel, M., Rudzkis, R.: A probabilistic model for the growth of thesauri. Acta Applicandae Mathematicae 67, 237–252 (2001)
Hofmann, T.: Probabilistic Latent Semantic Analysis. In: Proc. of Uncertainty in Artificial Intelligence, UAI 1999 (1999)
Joachims, T.: Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
Lee, L.: Measures of Distributional Similarity. ACL 99, 25–32 (1999)
Magerman, D.M., Marcus, M.P.: Parsing a Natural Language Using Mutual Information Statistics. In: National Conference on Artificial Intelligence, pp. 984–989 (1990)
Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys (2002)
Vapnik, V.: The Nature of Statistical Learning Theory. Springer, New York (1995)
Yang, Y.: Expert Network: Effective and Efficient Learning from Human Decisions in Text Categorization and Retrieval. In: Proceedings of SIGIR-1994 (1994)
Yarowsky, D.: Word-Sense Disambiguation using Statistical Models of Roget’s Categories Trained on Large Corpora. In: Proceedings of COLING-1992, pp. 454–460 (1992)
Yang, Y., Chute, C.G.: A Linear Least Squares Fit Mapping Method for Information Retrieval from Natural Language Texts. In: Proceedings of COLING-1992, the 15th International Conference on Computational Linguistics (1992)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Rudzkis, R., Balys, V., Hazewinkel, M. (2006). Stochastic Modelling of Scientific Terms Distribution in Publications. In: Borwein, J.M., Farmer, W.M. (eds) Mathematical Knowledge Management. MKM 2006. Lecture Notes in Computer Science(), vol 4108. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11812289_13
Download citation
DOI: https://doi.org/10.1007/11812289_13
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-37104-5
Online ISBN: 978-3-540-37106-9
eBook Packages: Computer ScienceComputer Science (R0)