Skip to main content

Open Set Subject Classification of Text Documents in Polish by Doc-to-Vec and Local Outlier Factor

  • Conference paper
  • First Online:
Artificial Intelligence and Soft Computing (ICAISC 2019)

Abstract

In this work, we expand the emerging fastText method capabilities to open set classification. It is done by utilization of the Local Outlier Factor (LOF) algorithm. It allows extending the closed set classifier with an additional class that identifies outliers. The analyzed text documents are represented by averaged word embeddings calculated using the fastText method on training data. We evaluate these approach in the task of categorization of Polish language Wikipedia articles with 34 subject areas. Conducting the experiment with two different outlier corpora we show how the LOF parameter (contamination) and the dimension of the feature space (vector representation of documents) influence the open set classification results. The results show that the proposed extension of fastText is capable to work effectively for the open set classification task. Moreover, experiments for different dimensions of word embedding show that even the dimension as low as 5 is sufficient to achieve good results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://scikit-learn.org/0.19/modules/generated/sklearn.neighbors.LocalOutlierFactor.html.

References

  1. Breunig, M.M., Kriegel, H.P., Ng, R.T., Sander, J.: LOF: identifying density-based local outliers. SIGMOD Rec. 29(2), 93–104 (2000)

    Article  Google Scholar 

  2. Doan, T., Kalita, J.: Overcoming the challenge for text classification in the open world. In: 2017 IEEE 7th Annual Computing and Communication Workshop and Conference (CCWC), pp. 1–7. IEEE (2017)

    Google Scholar 

  3. Eder, M., Rybicki, J.: Late 19th- and early 20th-century polish novels (2015). http://hdl.handle.net/11321/57. CLARIN-PL digital repository

  4. Fei, G., Liu, B.: Breaking the closed world assumption in text classification. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 506–514 (2016)

    Google Scholar 

  5. Goodman, J.: Classes for fast maximum entropy training. In: 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221), vol. 1, pp. 561–564 (2001)

    Google Scholar 

  6. Harris, Z.: Distributional structure. Word 10(2–3), 146–162 (1954)

    Article  Google Scholar 

  7. Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pp. 427–431. Association for Computational Linguistics (2017)

    Google Scholar 

  8. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. CoRR abs/1301.3781 (2013). http://arxiv.org/abs/1301.3781

  9. Mikolov, T., Yih, W.T., Zweig, G.: Linguistic regularities in continuous space word representations. In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 746–751. Association for Computational Linguistics, Atlanta (2013)

    Google Scholar 

  10. Młynarczyk, K., Piasecki, M.: Wiki test - 34 categories (2015). CLARIN-PL digital repository. http://hdl.handle.net/11321/217

  11. Młynarczyk, K., Piasecki, M.: Wiki train - 34 categories (2015). CLARIN-PL digital repository. http://hdl.handle.net/11321/222

  12. Prakhya, S., Venkataram, V., Kalita, J.: Open set text classification using convolutional neural networks. In: Proceedings of the 14th International Conference on Natural Language Processing, pp. 466–475. NLP Association of India, Kolkata (2017)

    Google Scholar 

  13. Walkowiak, T., Datko, S., Maciejewski, H.: Algorithm based on modified angle-based outlier factor for open-set classification of text documents. Appl. Stoch. Models Bus. Ind. 34(5), 718–729 (2018)

    Article  MathSciNet  Google Scholar 

  14. Walkowiak, T., Datko, S., Maciejewski, H.: Feature extraction in subject classification of text documents in polish. In: Rutkowski, L., Scherer, R., Korytkowski, M., Pedrycz, W., Tadeusiewicz, R., Zurada, J.M. (eds.) ICAISC 2018. LNCS (LNAI), vol. 10842, pp. 445–452. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-91262-2_40

    Chapter  Google Scholar 

  15. Walkowiak, T., Datko, S., Maciejewski, H.: Bag-of-words, bag-of-topics and Word-to-Vec based subject classification of text documents in polish - a comparative study. In: Zamojski, W., Mazurkiewicz, J., Sugier, J., Walkowiak, T., Kacprzyk, J. (eds.) DepCoS-RELCOMEX 2018. AISC, vol. 761, pp. 526–535. Springer, Cham (2019). https://doi.org/10.1007/978-3-319-91446-6_49

    Chapter  Google Scholar 

  16. Walkowiak, T., Malak, P.: Polish texts topic classification evaluation. In: Proceedings of the 10th International Conference on Agents and Artificial Intelligence - Volume 2: ICAART, pp. 515–522. INSTICC, SciTePress (2018)

    Google Scholar 

Download references

Acknowledgments

This work was sponsored by National Science Centre, Poland (grant 2016/21/B/ST6/02159).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tomasz Walkowiak .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Walkowiak, T., Datko, S., Maciejewski, H. (2019). Open Set Subject Classification of Text Documents in Polish by Doc-to-Vec and Local Outlier Factor. In: Rutkowski, L., Scherer, R., Korytkowski, M., Pedrycz, W., Tadeusiewicz, R., Zurada, J. (eds) Artificial Intelligence and Soft Computing. ICAISC 2019. Lecture Notes in Computer Science(), vol 11509. Springer, Cham. https://doi.org/10.1007/978-3-030-20915-5_41

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-20915-5_41

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-20914-8

  • Online ISBN: 978-3-030-20915-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics