Abstract
In this work, we expand the emerging fastText method capabilities to open set classification. It is done by utilization of the Local Outlier Factor (LOF) algorithm. It allows extending the closed set classifier with an additional class that identifies outliers. The analyzed text documents are represented by averaged word embeddings calculated using the fastText method on training data. We evaluate these approach in the task of categorization of Polish language Wikipedia articles with 34 subject areas. Conducting the experiment with two different outlier corpora we show how the LOF parameter (contamination) and the dimension of the feature space (vector representation of documents) influence the open set classification results. The results show that the proposed extension of fastText is capable to work effectively for the open set classification task. Moreover, experiments for different dimensions of word embedding show that even the dimension as low as 5 is sufficient to achieve good results.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Breunig, M.M., Kriegel, H.P., Ng, R.T., Sander, J.: LOF: identifying density-based local outliers. SIGMOD Rec. 29(2), 93–104 (2000)
Doan, T., Kalita, J.: Overcoming the challenge for text classification in the open world. In: 2017 IEEE 7th Annual Computing and Communication Workshop and Conference (CCWC), pp. 1–7. IEEE (2017)
Eder, M., Rybicki, J.: Late 19th- and early 20th-century polish novels (2015). http://hdl.handle.net/11321/57. CLARIN-PL digital repository
Fei, G., Liu, B.: Breaking the closed world assumption in text classification. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 506–514 (2016)
Goodman, J.: Classes for fast maximum entropy training. In: 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221), vol. 1, pp. 561–564 (2001)
Harris, Z.: Distributional structure. Word 10(2–3), 146–162 (1954)
Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pp. 427–431. Association for Computational Linguistics (2017)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. CoRR abs/1301.3781 (2013). http://arxiv.org/abs/1301.3781
Mikolov, T., Yih, W.T., Zweig, G.: Linguistic regularities in continuous space word representations. In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 746–751. Association for Computational Linguistics, Atlanta (2013)
Młynarczyk, K., Piasecki, M.: Wiki test - 34 categories (2015). CLARIN-PL digital repository. http://hdl.handle.net/11321/217
Młynarczyk, K., Piasecki, M.: Wiki train - 34 categories (2015). CLARIN-PL digital repository. http://hdl.handle.net/11321/222
Prakhya, S., Venkataram, V., Kalita, J.: Open set text classification using convolutional neural networks. In: Proceedings of the 14th International Conference on Natural Language Processing, pp. 466–475. NLP Association of India, Kolkata (2017)
Walkowiak, T., Datko, S., Maciejewski, H.: Algorithm based on modified angle-based outlier factor for open-set classification of text documents. Appl. Stoch. Models Bus. Ind. 34(5), 718–729 (2018)
Walkowiak, T., Datko, S., Maciejewski, H.: Feature extraction in subject classification of text documents in polish. In: Rutkowski, L., Scherer, R., Korytkowski, M., Pedrycz, W., Tadeusiewicz, R., Zurada, J.M. (eds.) ICAISC 2018. LNCS (LNAI), vol. 10842, pp. 445–452. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-91262-2_40
Walkowiak, T., Datko, S., Maciejewski, H.: Bag-of-words, bag-of-topics and Word-to-Vec based subject classification of text documents in polish - a comparative study. In: Zamojski, W., Mazurkiewicz, J., Sugier, J., Walkowiak, T., Kacprzyk, J. (eds.) DepCoS-RELCOMEX 2018. AISC, vol. 761, pp. 526–535. Springer, Cham (2019). https://doi.org/10.1007/978-3-319-91446-6_49
Walkowiak, T., Malak, P.: Polish texts topic classification evaluation. In: Proceedings of the 10th International Conference on Agents and Artificial Intelligence - Volume 2: ICAART, pp. 515–522. INSTICC, SciTePress (2018)
Acknowledgments
This work was sponsored by National Science Centre, Poland (grant 2016/21/B/ST6/02159).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Walkowiak, T., Datko, S., Maciejewski, H. (2019). Open Set Subject Classification of Text Documents in Polish by Doc-to-Vec and Local Outlier Factor. In: Rutkowski, L., Scherer, R., Korytkowski, M., Pedrycz, W., Tadeusiewicz, R., Zurada, J. (eds) Artificial Intelligence and Soft Computing. ICAISC 2019. Lecture Notes in Computer Science(), vol 11509. Springer, Cham. https://doi.org/10.1007/978-3-030-20915-5_41
Download citation
DOI: https://doi.org/10.1007/978-3-030-20915-5_41
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-20914-8
Online ISBN: 978-3-030-20915-5
eBook Packages: Computer ScienceComputer Science (R0)