Skip to main content

Text Categorization Using Hyper Rectangular Keyword Extraction: Application to News Articles Classification

  • Conference paper
  • First Online:
Book cover Relational and Algebraic Methods in Computer Science (RAMICS 2015)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9348))

Abstract

Automatic text categorization is still a very important research topic. Typical applications include assisting end-users in archiving existing documents, or helping them in browsing existing corpus of documents in a hierarchical way. Text categorization is usually composed of two main steps: keyword extraction and classification. In this paper, a corpus of documents is represented by a binary relation linking each document to the words it contains. From this relation, the Hyper Rectangle Algorithm extracts the list of the most representative words in a hierarchical way. A hyper-Rectangle associated to an element of the range of a binary relation is the union of all non-enlargeable rectangles containing it. The extracted keywords are fed into the random forest classifier in order to predict the category of each document. The method has been validated on the popular Reuters 21578 news articles database. Results are very promising and show the effectiveness of the Hyper Rectangular method in extracting relevant keywords.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aphinyanaphongs, Y., Fu, L.D., Li, Z., Peskin, E.R., Efstathiadis, E., Aliferis, C.F., Statnikov, A.: A comprehensive empirical comparison of modern supervised classification and feature selection methods for text categorization. Journal of the Association for Information Science and Technology 65(10), 1964–1987 (2014)

    Article  Google Scholar 

  2. Azam, N., Yao, J.: Comparison of term frequency and document frequency based feature selection metrics in text categorization. Expert Systems with Applications 39(5), 4760–4768 (2012)

    Article  Google Scholar 

  3. Birkhoff, G.: Lattice theory, vol. 25. American Mathematical Soc. (1967)

    Google Scholar 

  4. Breiman, L.: Random forests. Machine Learning 45(1), 5–32 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  5. Cardoso-Cachopo, A.: Datasets for single label text categorization. artificial Intelligence Group, Department of Information Systems and Computer Science Instituto Superior Tecnico, Portugal (2009) http://web.ist.utl.pt/~acardoso/datasets/

  6. Cardoso-Cachopo, A.: Improving Methods for Single-label Text Categorization. Ph.D. thesis, Instituto Superior Tecnico, Universidade Tecnica de Lisboa (2007)

    Google Scholar 

  7. Cardoso-Cachopo, A., Oliveira, A.: Combining lsi with other classifiers to improve accuracy of single-label text categorization. In: First European Workshop on Latent Semantic Analysis in Technology Enhanced Learning-EWLSATEL, vol. (2007)

    Google Scholar 

  8. Ferjani, F., Jaoua, A., Elloumi, S., Yahia, S.B.: Hyper-rectangular relation decomposition and dimensionality reduction. In: 13th International Conference on Relational and Algebraic Methods in Computer Science, RAMiCS 2013 (2012)

    Google Scholar 

  9. Ganter, B.: Two basic algorithms in concept analysis. In: Kwuida, L., Sertkaya, B. (eds.) ICFCA 2010. LNCS, vol. 5986, pp. 312–340. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  10. Ganter, B., Wille, R.: Formal concept analysis: mathematical foundations. Springer Science & Business Media (2012)

    Google Scholar 

  11. Jaoua, A.: Pseudo-conceptual text and web Structuring. In: 16th International Conference on Conceptual Structures (ICCS 2008) (2008)

    Google Scholar 

  12. Jia, S., Liang, J., Xie, Y., Deng, L.: A novel feature voting model for text classification. In: 2014 11th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), pp. 306–311. IEEE (2014)

    Google Scholar 

  13. Jiang, S., Pang, G., Wu, M., Kuang, L.: An improved k-nearest-neighbor algorithm for text categorization. Expert Systems with Applications 39(1), 1503–1509 (2012)

    Article  Google Scholar 

  14. Kurian, A., Josephine, M., Jeyabalaraja, V.: Scaling down dimensions and feature extraction in document repository classification. International Journal of Data Mining Techniques and Applications (2014)

    Google Scholar 

  15. Lee, L.H., Wan, C.H., Rajkumar, R., Isa, D.: An enhanced support vector machine classification framework by using euclidean distance function for text document categorization. Applied Intelligence 37(1), 80–99 (2012)

    Article  Google Scholar 

  16. Lewis, D.D.: Reuters-21578 text categorization test collection, distribution 1.0 (1997). http://www.research.att.com/~lewis/reuters21578.html

  17. Li, C.H., Yang, J.C., Park, S.C.: Text categorization algorithms using semantic approaches, corpus-based thesaurus and wordnet. Expert Systems with Applications 39(1), 765–772 (2012)

    Article  Google Scholar 

  18. Llc, B.: Relational Model: Relational Algebra, Relational Database Management System, Object-Relational Impedance Mismatch, Synonym, Codd’s Theorem. General Books LLC (2010). https://books.google.com.qa/books?id=JgDFbwAACAAJ

  19. Uğuz, H.: A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm. Knowledge-Based Systems 24(7), 1024–1032 (2011)

    Article  Google Scholar 

  20. Yang, J., Liu, Y., Zhu, X., Liu, Z., Zhang, X.: A new feature selection based on comprehensive measurement both in inter-category and intra-category for text categorization. Information Processing & Management 48(4), 741–754 (2012)

    Article  Google Scholar 

  21. Yoshikawa, Y., Iwata, T., Sawada, H.: Latent support measure machines for bag-of-words data classification. In: Advances in Neural Information Processing Systems, pp. 1961–1969 (2014)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Abdelaali Hassaine .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Hassaine, A., Mecheter, S., Jaoua, A. (2015). Text Categorization Using Hyper Rectangular Keyword Extraction: Application to News Articles Classification. In: Kahl, W., Winter, M., Oliveira, J. (eds) Relational and Algebraic Methods in Computer Science. RAMICS 2015. Lecture Notes in Computer Science(), vol 9348. Springer, Cham. https://doi.org/10.1007/978-3-319-24704-5_19

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-24704-5_19

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-24703-8

  • Online ISBN: 978-3-319-24704-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics