Stochastic Modelling of Scientific Terms Distribution in Publications

Rudzkis, Rimantas; Balys, Vaidas; Hazewinkel, Michiel

doi:10.1007/11812289_13

Rimantas Rudzkis²⁰,
Vaidas Balys²⁰ &
Michiel Hazewinkel²¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4108))

Included in the following conference series:

International Conference on Mathematical Knowledge Management

598 Accesses

Abstract

In this paper, we address the problem of automatic keywords assignment to scientific publications. The idea to use textual traces learned from training data in a supervised manner to identify appropriate keywords is considered. We introduce the transparent concept of identification cloud as a means to represent the semantics of scientific terms. This concept is mathematically defined by models of scientific terms stochastic distributions over publication texts. Characteristics of models as well as procedures for both non-parametric and parametric estimation of probability distributions are presented.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

A decade of research in statistics: a topic model approach

Article 12 March 2015

A domain keyword analysis approach extending Term Frequency-Keyword Active Index with Google Word2Vec model

Article 18 November 2017

Bibliographic analysis on research publications using authors, categorical labels and the citation network

Article 11 March 2016

References

Blei, D., Ng, A., Jordan, M.: Latent Dirichlet Allocation. Journal of Machine Learning Research 3, 993–1022 (2003)
Article MATH Google Scholar
Balys, V., Rudzkis, R.: Stochastic models for keyphrase assignment. In: Proceedings of the VII International Conference Computer Data Analysis and Modelling (2004)
Google Scholar
Church, K.W., Hanks, P.: Word Association Norms, Mutual Information, and Lexicography. Computational linguistics 16, 22–29 (1990)
Google Scholar
Deerwester, S., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harsham, R.A.: Indexing by Latent Semantic Analysis. Journal of the American Society of Information Science 41, 391–407 (1990)
Article Google Scholar
Domingos, P., Pazzani, M.: Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classifier. In: Proceedings of the 13th International Conference on Machine Learning, pp. 105–112 (1996)
Google Scholar
Hazewinkel, M.: Topologies and metrics on information spaces. CWI Quarterly 12, 93–110 (1999)
Google Scholar
Hazewinkel, M.: Dynamic stochastic models for indexes and thesauri, identification clouds, and information retrieval and storage. In: Baeza-Yates, R. (ed.) Recent advances in applied probability. KAP, pp. 181–204 (2004)
Google Scholar
Hazewinkel, M., Rudzkis, R.: A probabilistic model for the growth of thesauri. Acta Applicandae Mathematicae 67, 237–252 (2001)
Article MATH MathSciNet Google Scholar
Hofmann, T.: Probabilistic Latent Semantic Analysis. In: Proc. of Uncertainty in Artificial Intelligence, UAI 1999 (1999)
Google Scholar
Joachims, T.: Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
Chapter Google Scholar
Lee, L.: Measures of Distributional Similarity. ACL 99, 25–32 (1999)
Google Scholar
Magerman, D.M., Marcus, M.P.: Parsing a Natural Language Using Mutual Information Statistics. In: National Conference on Artificial Intelligence, pp. 984–989 (1990)
Google Scholar
Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys (2002)
Google Scholar
Vapnik, V.: The Nature of Statistical Learning Theory. Springer, New York (1995)
MATH Google Scholar
Yang, Y.: Expert Network: Effective and Efficient Learning from Human Decisions in Text Categorization and Retrieval. In: Proceedings of SIGIR-1994 (1994)
Google Scholar
Yarowsky, D.: Word-Sense Disambiguation using Statistical Models of Roget’s Categories Trained on Large Corpora. In: Proceedings of COLING-1992, pp. 454–460 (1992)
Google Scholar
Yang, Y., Chute, C.G.: A Linear Least Squares Fit Mapping Method for Information Retrieval from Natural Language Texts. In: Proceedings of COLING-1992, the 15th International Conference on Computational Linguistics (1992)
Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Mathematics and Informatics, Akademijos st. 4, LT-08663, Vilnius, Lithuania
Rimantas Rudzkis & Vaidas Balys
Centrum voor Wiskunde en Informatica, Kruislaan 413, NL-1098 SJ, Amsterdam, The Netherlands
Michiel Hazewinkel

Authors

Rimantas Rudzkis
View author publications
You can also search for this author in PubMed Google Scholar
Vaidas Balys
View author publications
You can also search for this author in PubMed Google Scholar
Michiel Hazewinkel
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Computer Science, Dalhousie University, 6050 University Avenue, Halifax, B3H 1W5, Nova Scotia, Canada
Jonathan M. Borwein
Department of Computing and Software, McMaster University, Hamilton, Ontario, Canada
William M. Farmer

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rudzkis, R., Balys, V., Hazewinkel, M. (2006). Stochastic Modelling of Scientific Terms Distribution in Publications. In: Borwein, J.M., Farmer, W.M. (eds) Mathematical Knowledge Management. MKM 2006. Lecture Notes in Computer Science(), vol 4108. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11812289_13

Download citation

DOI: https://doi.org/10.1007/11812289_13
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-37104-5
Online ISBN: 978-3-540-37106-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Stochastic Modelling of Scientific Terms Distribution in Publications

Abstract

Access this chapter

Preview

Similar content being viewed by others

A decade of research in statistics: a topic model approach

A domain keyword analysis approach extending Term Frequency-Keyword Active Index with Google Word2Vec model

Bibliographic analysis on research publications using authors, categorical labels and the citation network

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Stochastic Modelling of Scientific Terms Distribution in Publications

Abstract

Access this chapter

Preview

Similar content being viewed by others

A decade of research in statistics: a topic model approach

A domain keyword analysis approach extending Term Frequency-Keyword Active Index with Google Word2Vec model

Bibliographic analysis on research publications using authors, categorical labels and the citation network

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation