Categorization of Multilingual Scientific Documents by a Compound Classification System

Protasiewicz, Jarosław; Mirończuk, Marcin; Dadas, Sławomir

doi:10.1007/978-3-319-59060-8_51

Jarosław Protasiewicz¹⁹,
Marcin Mirończuk¹⁹ &
Sławomir Dadas¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10246))

Included in the following conference series:

International Conference on Artificial Intelligence and Soft Computing

2030 Accesses
3 Citations

Abstract

The aim of this study was to propose a classification method for documents that include simultaneously text parts in various languages. For this purpose, we constructed a three-leveled classification system. On its first level, a data processing module prepares a suitable vector space model. Next, in the middle tier, a set of monolingual or multilingual classifiers assigns the probabilities of belonging each document or its parts to all possible categories. The models are trained by using Multinomial Naïve Bayes and Long Short-Term Memory algorithms. Finally, in the last component, a multilingual decision module assigns a target class to each document. The module is built on a logistic regression classifier, which as the inputs receives probabilities produced by the classifiers. The system has been verified experimentally. According to the reported results, it can be assumed that the proposed system can deal with textual documents which content is composed of many languages at the same time. Therefore, the system can be useful in the automatic organizing of multilingual publications or other documents.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Amini, M.-R., Goutte, C.: A co-classification approach to learning from multilingual corpora. Mach. Learn. 79(1–2), 105–121 (2010)
Article MathSciNet Google Scholar
Amini, M.-R., Goutte, C., Usunier, N.: Combining coregularization and consensus-based self-training for multilingual text categorization. In: Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2010, pp. 475–482. ACM, New York (2010)
Google Scholar
Amini, M.-R., Usunier, N., Goutte, C.: Learning from multiple partially observed views-an application to multilingual text categorization. In: Advances in Neural Information Processing Systems, pp. 28–36 (2009)
Google Scholar
Chollet, F.: Keras (2015). https://github.com/fchollet/keras
Melo, G., Siersdorfer, S.: Multilingual text classification using ontologies. In: Amati, G., Carpineto, C., Romano, G. (eds.) ECIR 2007. LNCS, vol. 4425, pp. 541–548. Springer, Heidelberg (2007). doi:10.1007/978-3-540-71496-5_49
Chapter Google Scholar
García-Adeva, J.-J., Calvo, R.A., de Ipiña, D.L.: Multilingual approaches to text categorisation. CEPIS promotes, p. 43 (2005)
Google Scholar
Gonalves, T., Quaresma, P.: Multilingual text classification through combination of monolingual classifiers. In: Proceedings of the 4th Workshop on Legal Ontologies and Artificial Intelligence Techniques, pp. 29–38 (2010)
Google Scholar
Guo, Y., Xiao, M.: Cross language text classification via subspace co-regularized multi-view learning. In: Langford, J., Pineau, J. (eds.) Proceedings of the 29th International Conference on Machine Learning (ICML 2012), pp. 1615–1622. ACM, New York (2012)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Jindal, R., Malhotra, R., Jain, A.: Techniques for text classification: literature review and current trends. Webology 12(2) (2015)
Google Scholar
Lee, C.-H., Yang, H.-C.: Construction of supervised and unsupervised learning systems for multilingual text categorization. Expert Syst. Appl. 36(2), 2400–2410 (2009)
Article Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Pinto, D., Civera, J., Barron-Cedeno, A., Juan, A., Rosso, P.: A statistical approach to crosslingual natural language tasks. J. Algorithms 64(1), 51–60 (2009)
Article MATH Google Scholar
Protasiewicz, J., Pedrycz, W., Kozłowski, M., Dadas, S., Stanisławek, T., Kopacz, A., Gałężewska, M.: A recommender system of reviewers and experts in reviewing problems. Knowl.-Based Syst. 206, 164–178 (2016)
Article Google Scholar
Protasiewicz, J., Stanislawek, T., Dadas, S.: Multilingual and hierarchical classification of large datasets of scientific publications. In 2015 IEEE International Conference on Systems, Man, and Cybernetics, pp. 1670–1675. IEEE (2015)
Google Scholar
Rigutini, L., Maggini, M., Liu, B.: An EM based training algorithm for cross-language text categorization. In: The 2005 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2005), pp. 529–535 (2005)
Google Scholar
Science-Metrix. Ontology of scientific journals (v1.03), September 2011
Google Scholar
Suzuki, M., Yamagishi, N., Tsai, Y.-C., Hirasawa, S.: Multilingual text categorization using Character N-gram. In: IEEE Conference on Soft Computing in Industrial Applications, SMCia 2008, pp. 49–54 (2008)
Google Scholar
Xiao, M., Guo, Y.: Semi-supervised representation learning for cross-lingual text classification. In: EMNLP, pp. 1465–1475. Citeseer (2013)
Google Scholar
Yang, H.-C., Hsiao, H.-W., Lee, C.-H.: Multilingual document mining and navigation using self-organizing maps. Inf. Process. Manage. 47(5), 647–666 (2011)
Article Google Scholar

Download references

Author information

Authors and Affiliations

National Information Processing Institute, Warsaw, Poland
Jarosław Protasiewicz, Marcin Mirończuk & Sławomir Dadas

Authors

Jarosław Protasiewicz
View author publications
You can also search for this author in PubMed Google Scholar
Marcin Mirończuk
View author publications
You can also search for this author in PubMed Google Scholar
Sławomir Dadas
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jarosław Protasiewicz .

Editor information

Editors and Affiliations

Częstochowa University of Technology, Częstochowa, Poland
Leszek Rutkowski
Częstochowa University of Technology, Częstochowa, Poland
Marcin Korytkowski
Częstochowa University of Technology, Częstochowa, Poland
Rafał Scherer
AGH University of Science and Technology, Kraków, Poland
Ryszard Tadeusiewicz
University of California, Berkeley, California, USA
Lotfi A. Zadeh
University of Louisville, Louisville, Kentucky, USA
Jacek M. Zurada

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Protasiewicz, J., Mirończuk, M., Dadas, S. (2017). Categorization of Multilingual Scientific Documents by a Compound Classification System. In: Rutkowski, L., Korytkowski, M., Scherer, R., Tadeusiewicz, R., Zadeh, L., Zurada, J. (eds) Artificial Intelligence and Soft Computing. ICAISC 2017. Lecture Notes in Computer Science(), vol 10246. Springer, Cham. https://doi.org/10.1007/978-3-319-59060-8_51

Download citation

DOI: https://doi.org/10.1007/978-3-319-59060-8_51
Published: 24 May 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-59059-2
Online ISBN: 978-3-319-59060-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics