Skip to main content

Distributed Classification of Text Documents on Apache Spark Platform

  • Conference paper
  • First Online:
Artificial Intelligence and Soft Computing (ICAISC 2016)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9692))

Included in the following conference series:

Abstract

This paper presents implementation of the system for subject classification of text documents based on the Apache Spark distributed computing framework. Classification of text documents starts with generation of high-dimensional feature vectors from documents; the task realized with methods and tools for natural language processing. The next steps involve reduction of dimensionality of feature vectors and training classifiers. In the paper we show how these consecutive steps can be realized on the Apache Spark platform dedicated to distributed processing of big data. We illustrate the proposed method by a sample classifier aimed to predict subject category of a document in English-language Wikipedia.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. (CSUR) 34(1), 1–47 (2002)

    Article  Google Scholar 

  2. Koppel, M., Schler, J., Argamon, S.: Authorship attribution in the wild. Lang. Resour. Eval. 45(1), 83–94 (2011)

    Article  Google Scholar 

  3. Stamatatos, E.: A survey of modern authorship attribution methods. JASIST 60(3), 538–556 (2009)

    Article  Google Scholar 

  4. Torkkola, K.: Discriminative features for text document classification. Formal Pattern Anal. Appl. 6(4), 301–308 (2004)

    MathSciNet  Google Scholar 

  5. Jurafsky, D., Manning, C.: Natural Language Processing. https://www.coursera.org/course/nlp

  6. Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. HotCloud 10, 10 (2010)

    Google Scholar 

  7. Nesi, P., Pantaleo, G., Sanesi, G.: A distributed framework for NLP-based keyword and keyphrase extraction from web pages and documents. In: 21st International Conference on Distributed Multimedia Systems, DMS2015 (2015)

    Google Scholar 

  8. Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python - Analyzing Text with the Natural Language Toolkit. O’Reilly, Beijing (2009)

    MATH  Google Scholar 

  9. Bijalwan, V., et al.: KNN based machine learning approach for text and document mining. Int. J. Database Theo. Appl. 7(1), 61–70 (2014)

    Article  Google Scholar 

  10. Isa, D., et al.: Text document preprocessing with the Bayes formula for classification using the support vector machine. IEEE Trans. Knowl. Data Eng. 20(9), 1264–1272 (2008)

    Article  Google Scholar 

  11. Wang, L., Zhao, X.: Improved KNN classification algorithms research in text categorization. In: 2nd International Conference Consumer Electronics, Communications and Networks (CECNet), IEEE (2012)

    Google Scholar 

  12. Perkins, J.: Python 3 Text Processing with NLTK 3 Cookbook. Packt Publishing (2014)

    Google Scholar 

  13. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: OSDI 2004 Sixth Symposium on Operating System Design and Implementation (2004)

    Google Scholar 

  14. Rosnova, D.: Practical Natural Language Processing with Hadoop. https://danrosanova.files.wordpress.com/2014/04/practical-natural-language-processing-with-hadoop.pdf

  15. Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The Hadoop Distributed File System. Yahoo!, Sunnyvale, California USA (2010)

    Google Scholar 

  16. Zaharia, M., Das, T., Li, H., Hunter, T., Shenker, S., Stoica, I.: Discretized streams: fault-tolerant streaming computation at scale. In: Twenty-Fourth ACM Symposium on Operating Systems Principles, pp. 423–438. ACM New York (2013)

    Google Scholar 

  17. De Smedt, T., Marfia, F., Matteucci, M., Daelemans, W.: Using wiktionary to build an italian part-of-speech tagger. In: Métais, E., Roche, M., Teisseire, M. (eds.) NLDB 2014. LNCS, vol. 8455, pp. 1–8. Springer, Heidelberg (2014)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Henryk Maciejewski .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Semberecki, P., Maciejewski, H. (2016). Distributed Classification of Text Documents on Apache Spark Platform. In: Rutkowski, L., Korytkowski, M., Scherer, R., Tadeusiewicz, R., Zadeh, L., Zurada, J. (eds) Artificial Intelligence and Soft Computing. ICAISC 2016. Lecture Notes in Computer Science(), vol 9692. Springer, Cham. https://doi.org/10.1007/978-3-319-39378-0_53

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-39378-0_53

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-39377-3

  • Online ISBN: 978-3-319-39378-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics