Abstract
This paper presents implementation of the system for subject classification of text documents based on the Apache Spark distributed computing framework. Classification of text documents starts with generation of high-dimensional feature vectors from documents; the task realized with methods and tools for natural language processing. The next steps involve reduction of dimensionality of feature vectors and training classifiers. In the paper we show how these consecutive steps can be realized on the Apache Spark platform dedicated to distributed processing of big data. We illustrate the proposed method by a sample classifier aimed to predict subject category of a document in English-language Wikipedia.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. (CSUR) 34(1), 1–47 (2002)
Koppel, M., Schler, J., Argamon, S.: Authorship attribution in the wild. Lang. Resour. Eval. 45(1), 83–94 (2011)
Stamatatos, E.: A survey of modern authorship attribution methods. JASIST 60(3), 538–556 (2009)
Torkkola, K.: Discriminative features for text document classification. Formal Pattern Anal. Appl. 6(4), 301–308 (2004)
Jurafsky, D., Manning, C.: Natural Language Processing. https://www.coursera.org/course/nlp
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. HotCloud 10, 10 (2010)
Nesi, P., Pantaleo, G., Sanesi, G.: A distributed framework for NLP-based keyword and keyphrase extraction from web pages and documents. In: 21st International Conference on Distributed Multimedia Systems, DMS2015 (2015)
Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python - Analyzing Text with the Natural Language Toolkit. O’Reilly, Beijing (2009)
Bijalwan, V., et al.: KNN based machine learning approach for text and document mining. Int. J. Database Theo. Appl. 7(1), 61–70 (2014)
Isa, D., et al.: Text document preprocessing with the Bayes formula for classification using the support vector machine. IEEE Trans. Knowl. Data Eng. 20(9), 1264–1272 (2008)
Wang, L., Zhao, X.: Improved KNN classification algorithms research in text categorization. In: 2nd International Conference Consumer Electronics, Communications and Networks (CECNet), IEEE (2012)
Perkins, J.: Python 3 Text Processing with NLTK 3 Cookbook. Packt Publishing (2014)
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: OSDI 2004 Sixth Symposium on Operating System Design and Implementation (2004)
Rosnova, D.: Practical Natural Language Processing with Hadoop. https://danrosanova.files.wordpress.com/2014/04/practical-natural-language-processing-with-hadoop.pdf
Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The Hadoop Distributed File System. Yahoo!, Sunnyvale, California USA (2010)
Zaharia, M., Das, T., Li, H., Hunter, T., Shenker, S., Stoica, I.: Discretized streams: fault-tolerant streaming computation at scale. In: Twenty-Fourth ACM Symposium on Operating Systems Principles, pp. 423–438. ACM New York (2013)
De Smedt, T., Marfia, F., Matteucci, M., Daelemans, W.: Using wiktionary to build an italian part-of-speech tagger. In: Métais, E., Roche, M., Teisseire, M. (eds.) NLDB 2014. LNCS, vol. 8455, pp. 1–8. Springer, Heidelberg (2014)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Semberecki, P., Maciejewski, H. (2016). Distributed Classification of Text Documents on Apache Spark Platform. In: Rutkowski, L., Korytkowski, M., Scherer, R., Tadeusiewicz, R., Zadeh, L., Zurada, J. (eds) Artificial Intelligence and Soft Computing. ICAISC 2016. Lecture Notes in Computer Science(), vol 9692. Springer, Cham. https://doi.org/10.1007/978-3-319-39378-0_53
Download citation
DOI: https://doi.org/10.1007/978-3-319-39378-0_53
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-39377-3
Online ISBN: 978-3-319-39378-0
eBook Packages: Computer ScienceComputer Science (R0)