Skip to main content

Cluster Based Text Classification Model

  • Chapter
  • First Online:
Book cover Counterterrorism and Open Source Intelligence

Part of the book series: Lecture Notes in Social Networks ((LNSN))

  • 1952 Accesses

Abstract

We propose a cluster based classification model for suspicious email detection and other text classification tasks. The text classification tasks comprise many training examples that require a complex classification model. Using clusters for classification makes the model simpler and increases the accuracy at the same time. The test example is classified using simpler and smaller model. The training examples in a particular cluster share the common vocabulary. At the time of clustering, we do not take into account the labels of the training examples. After the clusters have been created, the classifier is trained on each cluster having reduced dimensionality and less number of examples. The experimental results show that the proposed model outperforms the existing classification models for the task of suspicious email detection and topic categorization on the Reuters-21578 and 20 Newsgroups datasets. Our model also outperforms A Decision Cluster Classification (ADCC) and the Decision Cluster Forest Classification (DCFC) models on the Reuters-21578 dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The terms categorization and classification will be used interchangeably throughout the paper.

References

  1. Appavu, S., Rajaram, R.: Learning to classify threatening e-mail. Int. J. Artif. Intell. Soft Comput. 1, 39–51 (2008)

    Article  Google Scholar 

  2. Backer, L.D., McCallum, A.K.: Distributional Clustering of Words for Text Classification. In: 21st annual international ACM SIGIR conference on Research and development in information retrieval. SIGIR’98. ACM (1998)

    Google Scholar 

  3. Bekkerman, R., Allan, J.: Using Bigrams in Text Categorization. CIIR Technical Report. (2005)

    Google Scholar 

  4. Brown, P.F, deSouza, P.V., Mercer, Pietra, V.J.D., Lai, J.C.: Class-based n-gram models of natural language. Comput. Linguist. 18(4), 467–479 (1992)

    Google Scholar 

  5. Collection of Methods to Analyze the text. http://code.google.com/p/text-analysis/

  6. Dumais, S., Platt, J., Hackerman, D., Sahami, M.: Inductive Learning Algorithms and Representations for Text Categorization. CIKM’98. ACM. (1998)

    Google Scholar 

  7. Freund, Y., Schapire, R.E: A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55, 119–139 (1997)

    Google Scholar 

  8. Hall, M., Frank, E., Holmes, G., Pfahringer,B., Reutemann, P., Ian H. Witten, I. H.: The WEKA Data Mining Software: An Update; SIGKDD Explorations, vol. 11(1). (2009)

    Google Scholar 

  9. Jing, L., Huang, J., Michael K. Ng., Rong, H.: A feature weighting approach to building classification models by interactive clustering. LNAI, pp. 284–294. Springer, Berlin (2004)

    Google Scholar 

  10. Joachims, T: A Statistical Learning Model of Text Classification for Support Vector Machines. In: 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. (2001)

    Google Scholar 

  11. Kyriakopoulou, A., Kalamboukis, T.: Using clustering to enhance text classification. In: 30th annual international ACM SIGIR 07, conference on Research and development in information retrieval. (2007)

    Google Scholar 

  12. Kyriakopoulou, A., Kalamboukis, T.: Combining Clustering with Classification for Spam Detection in Social Bookmarking Systems. RSDC, 2008

    Google Scholar 

  13. Kyriakopoulou, A.: Text Classification Aided by Clustering: A Literature Review. I-Tech Education and Publishing KG, Vienna, Austria (2008)

    Google Scholar 

  14. Lewis, D.D.: An evaluation of phrasal and clustered representations on a text categorization task. In: 15th International Annual Conference SIGR’92, pp. 37–50. (1992)

    Google Scholar 

  15. Li, Y., Hung, E.: Building a decision cluster forest model to classify high dimensional data with multi-classes. LNAI, pp. 263–277. Springer, Berlin (2009)

    Google Scholar 

  16. Li, Y., Hung, E., Chung, K., Huang, J.: Building a decision cluster classification model for high dimensional data by a variable weighting K-means method. In: AI ’08 Proceedings of the 21st Australasian Joint Conference on Artificial Intelligence: Advances in Artificial Intelligence. (2008)

    Google Scholar 

  17. McCallum, A., Nigam, K.: A comparison of event models for naive bayes text classification. Technical Report. Workshop on Learning for Text Categorization, pp. 41–48. (1998)

    Google Scholar 

  18. Moore, J., Hong, E., Han, S., Boley, D., Gini, M., Gross, R., Hastings, K., Karypis, G., Kumar, V., Mobasher, B.: Web Page Categorization and Feature Selection Using Association Rule and Principal Component Clustering. (1997)

    Google Scholar 

  19. National Commission on Terrorist Attacks Upon the United States. http://govinfo.library.unt.edu/911/report/911Report.pdf, (2004). Accessed on 25-08-2010

  20. Nizamani, S., Memon, N., Wiil, U.K.: Detecting suspicious emails using improved features. In: IEEE International Conference on Modeling and Simulation Control, pp. 232–236. (2010)

    Google Scholar 

  21. Quinlan, J.R.: Induction of Decision Trees. J. Mach. Learn. 1, 81–106 (1986)

    Google Scholar 

  22. Quinlan, J.R.: C4.5: Programs for machine learning. Machine Learning, vol. 16, pp. 235–240. Springer, Berlin (1993)

    Google Scholar 

  23. Renuka, D.K., Hamsapriya, T.: Email Classification for Spam Detection using Word Stemming. Int. J. Comput. Appl. 1, 45–47 (2010)

    Google Scholar 

  24. Sebastani, F.: Machine Learning in Automated Text Categorization. ACM Comput. surv. 34(1), 1–47 (2002)

    Article  Google Scholar 

  25. Schapire, R.E., Singer, Y.: Boostexter: A boosting based system for text categorization. Mach. Learn. 39(2/3), 135–168 (2000)

    Article  MATH  Google Scholar 

  26. Schlimmer, J.C., Fisher, D.: A case study of incremental concept induction. In: 5th National Conference on Artificial Intelligence, pp. 496–501. (1986)

    Google Scholar 

  27. Tan, P.N., Michael Steinbach, Vipin Kumar: Introduction to Data Mining. pp. 490–530. (2006)

    Google Scholar 

  28. Utgoff, P.E: ID5: An incremental ID3. In: 5th International Conference on Machine Learning, pp. 107–120. (1988)

    Google Scholar 

  29. Utgoff, P.E.: Incremental induction of decision trees. Mach. Learn. 4, 161–186 (1989)

    Article  Google Scholar 

  30. Utgoff, P.E., Berkman, N.C., Clouse, J.A.: Decision tree induction based on efficient tree restructuring. Mach. Learn. 29, 5–44 (1997)

    Article  MATH  Google Scholar 

  31. Vapnik, V.: The Nature of Statistical Theory. Springer, Berlin (1995)

    Book  MATH  Google Scholar 

  32. Wu, X., Kumar, V., Quinlan, J.R., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G.J., Ng, A., Liu, B., Yu, P.S., Zhou, Z.H., Steinbach, M., Hand, D.J., Steinberg, D.: Top 10 algorithms in data mining. Survey paper, Springer, Berlin (2007)

    Google Scholar 

  33. Yang, Y., Pederson, J.: Feature selection in statistical learning of text categorization. In: ZCML-97, pp. 412–420. (1997)

    Google Scholar 

  34. Yong, Z., Youwen, L., Shixiong, X.: An improved KNN text classification algorithm based on clustering. J. Comput. 4(3), 230–237 (2009)

    Google Scholar 

  35. Zeng, H.J., Wang, X.H., Chen, Z., Ying, W.: CBC: Clustering based text classification. Requiring minimal labeled data. In: 3rd IEEE International Conference on Data Mining. (2003)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sarwat Nizamani .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag/Wien

About this chapter

Cite this chapter

Nizamani, S., Memon, N., Wiil, U.K. (2011). Cluster Based Text Classification Model. In: Wiil, U.K. (eds) Counterterrorism and Open Source Intelligence. Lecture Notes in Social Networks. Springer, Vienna. https://doi.org/10.1007/978-3-7091-0388-3_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-7091-0388-3_14

  • Published:

  • Publisher Name: Springer, Vienna

  • Print ISBN: 978-3-7091-0387-6

  • Online ISBN: 978-3-7091-0388-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics