Abstract
The aim of this study was to propose a classification method for documents that include simultaneously text parts in various languages. For this purpose, we constructed a three-leveled classification system. On its first level, a data processing module prepares a suitable vector space model. Next, in the middle tier, a set of monolingual or multilingual classifiers assigns the probabilities of belonging each document or its parts to all possible categories. The models are trained by using Multinomial Naïve Bayes and Long Short-Term Memory algorithms. Finally, in the last component, a multilingual decision module assigns a target class to each document. The module is built on a logistic regression classifier, which as the inputs receives probabilities produced by the classifiers. The system has been verified experimentally. According to the reported results, it can be assumed that the proposed system can deal with textual documents which content is composed of many languages at the same time. Therefore, the system can be useful in the automatic organizing of multilingual publications or other documents.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Amini, M.-R., Goutte, C.: A co-classification approach to learning from multilingual corpora. Mach. Learn. 79(1–2), 105–121 (2010)
Amini, M.-R., Goutte, C., Usunier, N.: Combining coregularization and consensus-based self-training for multilingual text categorization. In: Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2010, pp. 475–482. ACM, New York (2010)
Amini, M.-R., Usunier, N., Goutte, C.: Learning from multiple partially observed views-an application to multilingual text categorization. In: Advances in Neural Information Processing Systems, pp. 28–36 (2009)
Chollet, F.: Keras (2015). https://github.com/fchollet/keras
Melo, G., Siersdorfer, S.: Multilingual text classification using ontologies. In: Amati, G., Carpineto, C., Romano, G. (eds.) ECIR 2007. LNCS, vol. 4425, pp. 541–548. Springer, Heidelberg (2007). doi:10.1007/978-3-540-71496-5_49
García-Adeva, J.-J., Calvo, R.A., de Ipiña, D.L.: Multilingual approaches to text categorisation. CEPIS promotes, p. 43 (2005)
Gonalves, T., Quaresma, P.: Multilingual text classification through combination of monolingual classifiers. In: Proceedings of the 4th Workshop on Legal Ontologies and Artificial Intelligence Techniques, pp. 29–38 (2010)
Guo, Y., Xiao, M.: Cross language text classification via subspace co-regularized multi-view learning. In: Langford, J., Pineau, J. (eds.) Proceedings of the 29th International Conference on Machine Learning (ICML 2012), pp. 1615–1622. ACM, New York (2012)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Jindal, R., Malhotra, R., Jain, A.: Techniques for text classification: literature review and current trends. Webology 12(2) (2015)
Lee, C.-H., Yang, H.-C.: Construction of supervised and unsupervised learning systems for multilingual text categorization. Expert Syst. Appl. 36(2), 2400–2410 (2009)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Pinto, D., Civera, J., Barron-Cedeno, A., Juan, A., Rosso, P.: A statistical approach to crosslingual natural language tasks. J. Algorithms 64(1), 51–60 (2009)
Protasiewicz, J., Pedrycz, W., Kozłowski, M., Dadas, S., Stanisławek, T., Kopacz, A., Gałężewska, M.: A recommender system of reviewers and experts in reviewing problems. Knowl.-Based Syst. 206, 164–178 (2016)
Protasiewicz, J., Stanislawek, T., Dadas, S.: Multilingual and hierarchical classification of large datasets of scientific publications. In 2015 IEEE International Conference on Systems, Man, and Cybernetics, pp. 1670–1675. IEEE (2015)
Rigutini, L., Maggini, M., Liu, B.: An EM based training algorithm for cross-language text categorization. In: The 2005 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2005), pp. 529–535 (2005)
Science-Metrix. Ontology of scientific journals (v1.03), September 2011
Suzuki, M., Yamagishi, N., Tsai, Y.-C., Hirasawa, S.: Multilingual text categorization using Character N-gram. In: IEEE Conference on Soft Computing in Industrial Applications, SMCia 2008, pp. 49–54 (2008)
Xiao, M., Guo, Y.: Semi-supervised representation learning for cross-lingual text classification. In: EMNLP, pp. 1465–1475. Citeseer (2013)
Yang, H.-C., Hsiao, H.-W., Lee, C.-H.: Multilingual document mining and navigation using self-organizing maps. Inf. Process. Manage. 47(5), 647–666 (2011)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Protasiewicz, J., Mirończuk, M., Dadas, S. (2017). Categorization of Multilingual Scientific Documents by a Compound Classification System. In: Rutkowski, L., Korytkowski, M., Scherer, R., Tadeusiewicz, R., Zadeh, L., Zurada, J. (eds) Artificial Intelligence and Soft Computing. ICAISC 2017. Lecture Notes in Computer Science(), vol 10246. Springer, Cham. https://doi.org/10.1007/978-3-319-59060-8_51
Download citation
DOI: https://doi.org/10.1007/978-3-319-59060-8_51
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-59059-2
Online ISBN: 978-3-319-59060-8
eBook Packages: Computer ScienceComputer Science (R0)