Cluster Based Text Classification Model

Nizamani, Sarwat; Memon, Nasrullah; Wiil, Uffe Kock

doi:10.1007/978-3-7091-0388-3_14

Sarwat Nizamani^4,5,
Nasrullah Memon^4,6 &
Uffe Kock Wiil⁴

Part of the book series: Lecture Notes in Social Networks ((LNSN))

1952 Accesses

Abstract

We propose a cluster based classification model for suspicious email detection and other text classification tasks. The text classification tasks comprise many training examples that require a complex classification model. Using clusters for classification makes the model simpler and increases the accuracy at the same time. The test example is classified using simpler and smaller model. The training examples in a particular cluster share the common vocabulary. At the time of clustering, we do not take into account the labels of the training examples. After the clusters have been created, the classifier is trained on each cluster having reduced dimensionality and less number of examples. The experimental results show that the proposed model outperforms the existing classification models for the task of suspicious email detection and topic categorization on the Reuters-21578 and 20 Newsgroups datasets. Our model also outperforms A Decision Cluster Classification (ADCC) and the Decision Cluster Forest Classification (DCFC) models on the Reuters-21578 dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The terms categorization and classification will be used interchangeably throughout the paper.

References

Appavu, S., Rajaram, R.: Learning to classify threatening e-mail. Int. J. Artif. Intell. Soft Comput. 1, 39–51 (2008)
Article Google Scholar
Backer, L.D., McCallum, A.K.: Distributional Clustering of Words for Text Classification. In: 21st annual international ACM SIGIR conference on Research and development in information retrieval. SIGIR’98. ACM (1998)
Google Scholar
Bekkerman, R., Allan, J.: Using Bigrams in Text Categorization. CIIR Technical Report. (2005)
Google Scholar
Brown, P.F, deSouza, P.V., Mercer, Pietra, V.J.D., Lai, J.C.: Class-based n-gram models of natural language. Comput. Linguist. 18(4), 467–479 (1992)
Google Scholar
Collection of Methods to Analyze the text. http://code.google.com/p/text-analysis/
Dumais, S., Platt, J., Hackerman, D., Sahami, M.: Inductive Learning Algorithms and Representations for Text Categorization. CIKM’98. ACM. (1998)
Google Scholar
Freund, Y., Schapire, R.E: A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55, 119–139 (1997)
Google Scholar
Hall, M., Frank, E., Holmes, G., Pfahringer,B., Reutemann, P., Ian H. Witten, I. H.: The WEKA Data Mining Software: An Update; SIGKDD Explorations, vol. 11(1). (2009)
Google Scholar
Jing, L., Huang, J., Michael K. Ng., Rong, H.: A feature weighting approach to building classification models by interactive clustering. LNAI, pp. 284–294. Springer, Berlin (2004)
Google Scholar
Joachims, T: A Statistical Learning Model of Text Classification for Support Vector Machines. In: 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. (2001)
Google Scholar
Kyriakopoulou, A., Kalamboukis, T.: Using clustering to enhance text classification. In: 30th annual international ACM SIGIR 07, conference on Research and development in information retrieval. (2007)
Google Scholar
Kyriakopoulou, A., Kalamboukis, T.: Combining Clustering with Classification for Spam Detection in Social Bookmarking Systems. RSDC, 2008
Google Scholar
Kyriakopoulou, A.: Text Classification Aided by Clustering: A Literature Review. I-Tech Education and Publishing KG, Vienna, Austria (2008)
Google Scholar
Lewis, D.D.: An evaluation of phrasal and clustered representations on a text categorization task. In: 15th International Annual Conference SIGR’92, pp. 37–50. (1992)
Google Scholar
Li, Y., Hung, E.: Building a decision cluster forest model to classify high dimensional data with multi-classes. LNAI, pp. 263–277. Springer, Berlin (2009)
Google Scholar
Li, Y., Hung, E., Chung, K., Huang, J.: Building a decision cluster classification model for high dimensional data by a variable weighting K-means method. In: AI ’08 Proceedings of the 21st Australasian Joint Conference on Artificial Intelligence: Advances in Artificial Intelligence. (2008)
Google Scholar
McCallum, A., Nigam, K.: A comparison of event models for naive bayes text classification. Technical Report. Workshop on Learning for Text Categorization, pp. 41–48. (1998)
Google Scholar
Moore, J., Hong, E., Han, S., Boley, D., Gini, M., Gross, R., Hastings, K., Karypis, G., Kumar, V., Mobasher, B.: Web Page Categorization and Feature Selection Using Association Rule and Principal Component Clustering. (1997)
Google Scholar
National Commission on Terrorist Attacks Upon the United States. http://govinfo.library.unt.edu/911/report/911Report.pdf, (2004). Accessed on 25-08-2010
Nizamani, S., Memon, N., Wiil, U.K.: Detecting suspicious emails using improved features. In: IEEE International Conference on Modeling and Simulation Control, pp. 232–236. (2010)
Google Scholar
Quinlan, J.R.: Induction of Decision Trees. J. Mach. Learn. 1, 81–106 (1986)
Google Scholar
Quinlan, J.R.: C4.5: Programs for machine learning. Machine Learning, vol. 16, pp. 235–240. Springer, Berlin (1993)
Google Scholar
Renuka, D.K., Hamsapriya, T.: Email Classification for Spam Detection using Word Stemming. Int. J. Comput. Appl. 1, 45–47 (2010)
Google Scholar
Sebastani, F.: Machine Learning in Automated Text Categorization. ACM Comput. surv. 34(1), 1–47 (2002)
Article Google Scholar
Schapire, R.E., Singer, Y.: Boostexter: A boosting based system for text categorization. Mach. Learn. 39(2/3), 135–168 (2000)
Article MATH Google Scholar
Schlimmer, J.C., Fisher, D.: A case study of incremental concept induction. In: 5th National Conference on Artificial Intelligence, pp. 496–501. (1986)
Google Scholar
Tan, P.N., Michael Steinbach, Vipin Kumar: Introduction to Data Mining. pp. 490–530. (2006)
Google Scholar
Utgoff, P.E: ID5: An incremental ID3. In: 5th International Conference on Machine Learning, pp. 107–120. (1988)
Google Scholar
Utgoff, P.E.: Incremental induction of decision trees. Mach. Learn. 4, 161–186 (1989)
Article Google Scholar
Utgoff, P.E., Berkman, N.C., Clouse, J.A.: Decision tree induction based on efficient tree restructuring. Mach. Learn. 29, 5–44 (1997)
Article MATH Google Scholar
Vapnik, V.: The Nature of Statistical Theory. Springer, Berlin (1995)
Book MATH Google Scholar
Wu, X., Kumar, V., Quinlan, J.R., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G.J., Ng, A., Liu, B., Yu, P.S., Zhou, Z.H., Steinbach, M., Hand, D.J., Steinberg, D.: Top 10 algorithms in data mining. Survey paper, Springer, Berlin (2007)
Google Scholar
Yang, Y., Pederson, J.: Feature selection in statistical learning of text categorization. In: ZCML-97, pp. 412–420. (1997)
Google Scholar
Yong, Z., Youwen, L., Shixiong, X.: An improved KNN text classification algorithm based on clustering. J. Comput. 4(3), 230–237 (2009)
Google Scholar
Zeng, H.J., Wang, X.H., Chen, Z., Ying, W.: CBC: Clustering based text classification. Requiring minimal labeled data. In: 3rd IEEE International Conference on Data Mining. (2003)
Google Scholar

Download references

Author information

Authors and Affiliations

Counterterrorism Research Lab, The Maersk Mc-Kinney Moller Institute, University of Southern Denmark, Odense, Denmark
Sarwat Nizamani, Nasrullah Memon & Uffe Kock Wiil
University of Sindh, Jamshoro, Pakistan
Sarwat Nizamani
Hellenic American University, Manchester, NH, USA
Nasrullah Memon

Authors

Sarwat Nizamani
View author publications
You can also search for this author in PubMed Google Scholar
Nasrullah Memon
View author publications
You can also search for this author in PubMed Google Scholar
Uffe Kock Wiil
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sarwat Nizamani .

Editor information

Editors and Affiliations

The Maersk McKinney Moller Institute, University of Southern Denmark, Campusvej 55, 5230, Odense, Denmark
Uffe Kock Wiil

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Nizamani, S., Memon, N., Wiil, U.K. (2011). Cluster Based Text Classification Model. In: Wiil, U.K. (eds) Counterterrorism and Open Source Intelligence. Lecture Notes in Social Networks. Springer, Vienna. https://doi.org/10.1007/978-3-7091-0388-3_14

Download citation

DOI: https://doi.org/10.1007/978-3-7091-0388-3_14
Published: 26 May 2011
Publisher Name: Springer, Vienna
Print ISBN: 978-3-7091-0387-6
Online ISBN: 978-3-7091-0388-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics