Skip to main content

Categorization of Multilingual Scientific Documents by a Compound Classification System

  • Conference paper
  • First Online:
Artificial Intelligence and Soft Computing (ICAISC 2017)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10246))

Included in the following conference series:

Abstract

The aim of this study was to propose a classification method for documents that include simultaneously text parts in various languages. For this purpose, we constructed a three-leveled classification system. On its first level, a data processing module prepares a suitable vector space model. Next, in the middle tier, a set of monolingual or multilingual classifiers assigns the probabilities of belonging each document or its parts to all possible categories. The models are trained by using Multinomial Naïve Bayes and Long Short-Term Memory algorithms. Finally, in the last component, a multilingual decision module assigns a target class to each document. The module is built on a logistic regression classifier, which as the inputs receives probabilities produced by the classifiers. The system has been verified experimentally. According to the reported results, it can be assumed that the proposed system can deal with textual documents which content is composed of many languages at the same time. Therefore, the system can be useful in the automatic organizing of multilingual publications or other documents.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Amini, M.-R., Goutte, C.: A co-classification approach to learning from multilingual corpora. Mach. Learn. 79(1–2), 105–121 (2010)

    Article  MathSciNet  Google Scholar 

  2. Amini, M.-R., Goutte, C., Usunier, N.: Combining coregularization and consensus-based self-training for multilingual text categorization. In: Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2010, pp. 475–482. ACM, New York (2010)

    Google Scholar 

  3. Amini, M.-R., Usunier, N., Goutte, C.: Learning from multiple partially observed views-an application to multilingual text categorization. In: Advances in Neural Information Processing Systems, pp. 28–36 (2009)

    Google Scholar 

  4. Chollet, F.: Keras (2015). https://github.com/fchollet/keras

  5. Melo, G., Siersdorfer, S.: Multilingual text classification using ontologies. In: Amati, G., Carpineto, C., Romano, G. (eds.) ECIR 2007. LNCS, vol. 4425, pp. 541–548. Springer, Heidelberg (2007). doi:10.1007/978-3-540-71496-5_49

    Chapter  Google Scholar 

  6. García-Adeva, J.-J., Calvo, R.A., de Ipiña, D.L.: Multilingual approaches to text categorisation. CEPIS promotes, p. 43 (2005)

    Google Scholar 

  7. Gonalves, T., Quaresma, P.: Multilingual text classification through combination of monolingual classifiers. In: Proceedings of the 4th Workshop on Legal Ontologies and Artificial Intelligence Techniques, pp. 29–38 (2010)

    Google Scholar 

  8. Guo, Y., Xiao, M.: Cross language text classification via subspace co-regularized multi-view learning. In: Langford, J., Pineau, J. (eds.) Proceedings of the 29th International Conference on Machine Learning (ICML 2012), pp. 1615–1622. ACM, New York (2012)

    Google Scholar 

  9. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  10. Jindal, R., Malhotra, R., Jain, A.: Techniques for text classification: literature review and current trends. Webology 12(2) (2015)

    Google Scholar 

  11. Lee, C.-H., Yang, H.-C.: Construction of supervised and unsupervised learning systems for multilingual text categorization. Expert Syst. Appl. 36(2), 2400–2410 (2009)

    Article  Google Scholar 

  12. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)

  13. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  14. Pinto, D., Civera, J., Barron-Cedeno, A., Juan, A., Rosso, P.: A statistical approach to crosslingual natural language tasks. J. Algorithms 64(1), 51–60 (2009)

    Article  MATH  Google Scholar 

  15. Protasiewicz, J., Pedrycz, W., Kozłowski, M., Dadas, S., Stanisławek, T., Kopacz, A., Gałężewska, M.: A recommender system of reviewers and experts in reviewing problems. Knowl.-Based Syst. 206, 164–178 (2016)

    Article  Google Scholar 

  16. Protasiewicz, J., Stanislawek, T., Dadas, S.: Multilingual and hierarchical classification of large datasets of scientific publications. In 2015 IEEE International Conference on Systems, Man, and Cybernetics, pp. 1670–1675. IEEE (2015)

    Google Scholar 

  17. Rigutini, L., Maggini, M., Liu, B.: An EM based training algorithm for cross-language text categorization. In: The 2005 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2005), pp. 529–535 (2005)

    Google Scholar 

  18. Science-Metrix. Ontology of scientific journals (v1.03), September 2011

    Google Scholar 

  19. Suzuki, M., Yamagishi, N., Tsai, Y.-C., Hirasawa, S.: Multilingual text categorization using Character N-gram. In: IEEE Conference on Soft Computing in Industrial Applications, SMCia 2008, pp. 49–54 (2008)

    Google Scholar 

  20. Xiao, M., Guo, Y.: Semi-supervised representation learning for cross-lingual text classification. In: EMNLP, pp. 1465–1475. Citeseer (2013)

    Google Scholar 

  21. Yang, H.-C., Hsiao, H.-W., Lee, C.-H.: Multilingual document mining and navigation using self-organizing maps. Inf. Process. Manage. 47(5), 647–666 (2011)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jarosław Protasiewicz .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Protasiewicz, J., Mirończuk, M., Dadas, S. (2017). Categorization of Multilingual Scientific Documents by a Compound Classification System. In: Rutkowski, L., Korytkowski, M., Scherer, R., Tadeusiewicz, R., Zadeh, L., Zurada, J. (eds) Artificial Intelligence and Soft Computing. ICAISC 2017. Lecture Notes in Computer Science(), vol 10246. Springer, Cham. https://doi.org/10.1007/978-3-319-59060-8_51

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-59060-8_51

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-59059-2

  • Online ISBN: 978-3-319-59060-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics