Distributed Classification of Text Documents on Apache Spark Platform

Semberecki, Piotr; Maciejewski, Henryk

doi:10.1007/978-3-319-39378-0_53

Piotr Semberecki¹⁹ &
Henryk Maciejewski¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9692))

Included in the following conference series:

International Conference on Artificial Intelligence and Soft Computing

1415 Accesses
10 Citations
2 Altmetric

Abstract

This paper presents implementation of the system for subject classification of text documents based on the Apache Spark distributed computing framework. Classification of text documents starts with generation of high-dimensional feature vectors from documents; the task realized with methods and tools for natural language processing. The next steps involve reduction of dimensionality of feature vectors and training classifiers. In the paper we show how these consecutive steps can be realized on the Apache Spark platform dedicated to distributed processing of big data. We illustrate the proposed method by a sample classifier aimed to predict subject category of a document in English-language Wikipedia.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. (CSUR) 34(1), 1–47 (2002)
Article Google Scholar
Koppel, M., Schler, J., Argamon, S.: Authorship attribution in the wild. Lang. Resour. Eval. 45(1), 83–94 (2011)
Article Google Scholar
Stamatatos, E.: A survey of modern authorship attribution methods. JASIST 60(3), 538–556 (2009)
Article Google Scholar
Torkkola, K.: Discriminative features for text document classification. Formal Pattern Anal. Appl. 6(4), 301–308 (2004)
MathSciNet Google Scholar
Jurafsky, D., Manning, C.: Natural Language Processing. https://www.coursera.org/course/nlp
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. HotCloud 10, 10 (2010)
Google Scholar
Nesi, P., Pantaleo, G., Sanesi, G.: A distributed framework for NLP-based keyword and keyphrase extraction from web pages and documents. In: 21st International Conference on Distributed Multimedia Systems, DMS2015 (2015)
Google Scholar
Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python - Analyzing Text with the Natural Language Toolkit. O’Reilly, Beijing (2009)
MATH Google Scholar
Bijalwan, V., et al.: KNN based machine learning approach for text and document mining. Int. J. Database Theo. Appl. 7(1), 61–70 (2014)
Article Google Scholar
Isa, D., et al.: Text document preprocessing with the Bayes formula for classification using the support vector machine. IEEE Trans. Knowl. Data Eng. 20(9), 1264–1272 (2008)
Article Google Scholar
Wang, L., Zhao, X.: Improved KNN classification algorithms research in text categorization. In: 2nd International Conference Consumer Electronics, Communications and Networks (CECNet), IEEE (2012)
Google Scholar
Perkins, J.: Python 3 Text Processing with NLTK 3 Cookbook. Packt Publishing (2014)
Google Scholar
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: OSDI 2004 Sixth Symposium on Operating System Design and Implementation (2004)
Google Scholar
Rosnova, D.: Practical Natural Language Processing with Hadoop. https://danrosanova.files.wordpress.com/2014/04/practical-natural-language-processing-with-hadoop.pdf
Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The Hadoop Distributed File System. Yahoo!, Sunnyvale, California USA (2010)
Google Scholar
Zaharia, M., Das, T., Li, H., Hunter, T., Shenker, S., Stoica, I.: Discretized streams: fault-tolerant streaming computation at scale. In: Twenty-Fourth ACM Symposium on Operating Systems Principles, pp. 423–438. ACM New York (2013)
Google Scholar
De Smedt, T., Marfia, F., Matteucci, M., Daelemans, W.: Using wiktionary to build an italian part-of-speech tagger. In: Métais, E., Roche, M., Teisseire, M. (eds.) NLDB 2014. LNCS, vol. 8455, pp. 1–8. Springer, Heidelberg (2014)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Engineering, Wrocław University of Technology, Wybrzeże Wyspiańskiego 27, 50-270, Wrocław, Poland
Piotr Semberecki & Henryk Maciejewski

Authors

Piotr Semberecki
View author publications
You can also search for this author in PubMed Google Scholar
Henryk Maciejewski
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Henryk Maciejewski .

Editor information

Editors and Affiliations

Częstochowa University of Technology, Czestochowa, Poland
Leszek Rutkowski
Częstochowa University of Technology, Czestochowa, Poland
Marcin Korytkowski
Częstochowa University of Technology, Czestochowa, Poland
Rafał Scherer
AGH University of Science and Technology, Krakow, Poland
Ryszard Tadeusiewicz
University of California, Berkeley, California, USA
Lotfi A. Zadeh
University of Louisville, Louisville, Kentucky, USA
Jacek M. Zurada

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Semberecki, P., Maciejewski, H. (2016). Distributed Classification of Text Documents on Apache Spark Platform. In: Rutkowski, L., Korytkowski, M., Scherer, R., Tadeusiewicz, R., Zadeh, L., Zurada, J. (eds) Artificial Intelligence and Soft Computing. ICAISC 2016. Lecture Notes in Computer Science(), vol 9692. Springer, Cham. https://doi.org/10.1007/978-3-319-39378-0_53

Download citation

DOI: https://doi.org/10.1007/978-3-319-39378-0_53
Published: 29 May 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-39377-3
Online ISBN: 978-3-319-39378-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics