Open Set Subject Classification of Text Documents in Polish by Doc-to-Vec and Local Outlier Factor

Walkowiak, Tomasz; Datko, Szymon; Maciejewski, Henryk

doi:10.1007/978-3-030-20915-5_41

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11509))

Included in the following conference series:

International Conference on Artificial Intelligence and Soft Computing

Abstract

In this work, we expand the emerging fastText method capabilities to open set classification. It is done by utilization of the Local Outlier Factor (LOF) algorithm. It allows extending the closed set classifier with an additional class that identifies outliers. The analyzed text documents are represented by averaged word embeddings calculated using the fastText method on training data. We evaluate these approach in the task of categorization of Polish language Wikipedia articles with 34 subject areas. Conducting the experiment with two different outlier corpora we show how the LOF parameter (contamination) and the dimension of the feature space (vector representation of documents) influence the open set classification results. The results show that the proposed extension of fastText is capable to work effectively for the open set classification task. Moreover, experiments for different dimensions of word embedding show that even the dimension as low as 5 is sufficient to achieve good results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Distance Metrics in Open-Set Classification of Text Documents by Local Outlier Factor and Doc2Vec

Utilizing Local Outlier Factor for Open-Set Classification in High-Dimensional Data - Case Study Applied for Text Documents

Feature Transformations for Outlier Detection in Classification of Text Documents

Notes

1.
https://scikit-learn.org/0.19/modules/generated/sklearn.neighbors.LocalOutlierFactor.html.

References

Breunig, M.M., Kriegel, H.P., Ng, R.T., Sander, J.: LOF: identifying density-based local outliers. SIGMOD Rec. 29(2), 93–104 (2000)
Article Google Scholar
Doan, T., Kalita, J.: Overcoming the challenge for text classification in the open world. In: 2017 IEEE 7th Annual Computing and Communication Workshop and Conference (CCWC), pp. 1–7. IEEE (2017)
Google Scholar
Eder, M., Rybicki, J.: Late 19th- and early 20th-century polish novels (2015). http://hdl.handle.net/11321/57. CLARIN-PL digital repository
Fei, G., Liu, B.: Breaking the closed world assumption in text classification. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 506–514 (2016)
Google Scholar
Goodman, J.: Classes for fast maximum entropy training. In: 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221), vol. 1, pp. 561–564 (2001)
Google Scholar
Harris, Z.: Distributional structure. Word 10(2–3), 146–162 (1954)
Article Google Scholar
Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pp. 427–431. Association for Computational Linguistics (2017)
Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. CoRR abs/1301.3781 (2013). http://arxiv.org/abs/1301.3781
Mikolov, T., Yih, W.T., Zweig, G.: Linguistic regularities in continuous space word representations. In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 746–751. Association for Computational Linguistics, Atlanta (2013)
Google Scholar
Młynarczyk, K., Piasecki, M.: Wiki test - 34 categories (2015). CLARIN-PL digital repository. http://hdl.handle.net/11321/217
Młynarczyk, K., Piasecki, M.: Wiki train - 34 categories (2015). CLARIN-PL digital repository. http://hdl.handle.net/11321/222
Prakhya, S., Venkataram, V., Kalita, J.: Open set text classification using convolutional neural networks. In: Proceedings of the 14th International Conference on Natural Language Processing, pp. 466–475. NLP Association of India, Kolkata (2017)
Google Scholar
Walkowiak, T., Datko, S., Maciejewski, H.: Algorithm based on modified angle-based outlier factor for open-set classification of text documents. Appl. Stoch. Models Bus. Ind. 34(5), 718–729 (2018)
Article MathSciNet Google Scholar
Walkowiak, T., Datko, S., Maciejewski, H.: Feature extraction in subject classification of text documents in polish. In: Rutkowski, L., Scherer, R., Korytkowski, M., Pedrycz, W., Tadeusiewicz, R., Zurada, J.M. (eds.) ICAISC 2018. LNCS (LNAI), vol. 10842, pp. 445–452. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-91262-2_40
Chapter Google Scholar
Walkowiak, T., Datko, S., Maciejewski, H.: Bag-of-words, bag-of-topics and Word-to-Vec based subject classification of text documents in polish - a comparative study. In: Zamojski, W., Mazurkiewicz, J., Sugier, J., Walkowiak, T., Kacprzyk, J. (eds.) DepCoS-RELCOMEX 2018. AISC, vol. 761, pp. 526–535. Springer, Cham (2019). https://doi.org/10.1007/978-3-319-91446-6_49
Chapter Google Scholar
Walkowiak, T., Malak, P.: Polish texts topic classification evaluation. In: Proceedings of the 10th International Conference on Agents and Artificial Intelligence - Volume 2: ICAART, pp. 515–522. INSTICC, SciTePress (2018)
Google Scholar

Download references

Acknowledgments

This work was sponsored by National Science Centre, Poland (grant 2016/21/B/ST6/02159).

Author information

Authors and Affiliations

Faculty of Electronics, Wrocław University of Science and Technology, Wrocław, Poland
Tomasz Walkowiak, Szymon Datko & Henryk Maciejewski

Authors

Tomasz Walkowiak
View author publications
You can also search for this author in PubMed Google Scholar
Szymon Datko
View author publications
You can also search for this author in PubMed Google Scholar
Henryk Maciejewski
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tomasz Walkowiak .

Editor information

Editors and Affiliations

Częstochowa University of Technology, Częstochowa, Poland
Leszek Rutkowski
Częstochowa University of Technology, Częstochowa, Poland
Rafał Scherer
Częstochowa University of Technology, Częstochowa, Poland
Marcin Korytkowski
University of Alberta, Edmonton, AB, Canada
Witold Pedrycz
AGH University of Science and Technology, Kraków, Poland
Ryszard Tadeusiewicz
University of Louisville, Louisville, KY, USA
Jacek M. Zurada

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Walkowiak, T., Datko, S., Maciejewski, H. (2019). Open Set Subject Classification of Text Documents in Polish by Doc-to-Vec and Local Outlier Factor. In: Rutkowski, L., Scherer, R., Korytkowski, M., Pedrycz, W., Tadeusiewicz, R., Zurada, J. (eds) Artificial Intelligence and Soft Computing. ICAISC 2019. Lecture Notes in Computer Science(), vol 11509. Springer, Cham. https://doi.org/10.1007/978-3-030-20915-5_41

Download citation

DOI: https://doi.org/10.1007/978-3-030-20915-5_41
Published: 27 May 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-20914-8
Online ISBN: 978-3-030-20915-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics