A Study on Text Clustering Algorithms Based on Frequent Term Sets

Liu, Xiangwei; He, Pilian

doi:10.1007/11527503_42

Xiangwei Liu^21,22 &
Pilian He²¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3584))

Included in the following conference series:

International Conference on Advanced Data Mining and Applications

2461 Accesses

Abstract

In this paper, a new text-clustering algorithm named Frequent Term Set-based Clustering (FTSC) is introduced. It uses frequent term sets to cluster texts. First, it extracts useful information from documents and inserts into databases. Then, it uses the Apriori algorithm based on association rules mining efficiently to discover the frequent items sets. Finally, it clusters the documents according to the frequent words in subsets of the frequent term sets. This algorithm can reduce the dimension of the text data efficiently for very large databases, thus it can improve the accuracy and speed of the clustering algorithm. The results of clustering texts by the FTSC algorithm cannot reflect the overlap of texts’ classes. Based on the FTSC algorithm, an improved algorithm—Frequent Term Set-based Hierarchical Clustering algorithm (FTSHC) is given. This algorithm can determine the overlap of texts’ classes by the overlap of the frequent words sets, and provide an understandable description of the discovered clusters by the frequent terms sets. The FTSC, FTSHC and K-Means algorithms are evaluated quantitatively by experiments. The results of the experiments prove that FTSC and FTSHC algorithms are more efficient than K-Means algorithm in the performance of clustering.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

A comprehensive and analytical review of text clustering techniques

Article 08 April 2024

Combining semantic and term frequency similarities for text clustering

Article 02 January 2019

Text mining using nonnegative matrix factorization and latent semantic analysis

Article 21 April 2021

References

Lagus, K., Kaski, S.: Keyword Selection Method for Characterizing Text/Document Maps. In: Proceeding of ICANN 1999 (1999)
Google Scholar
EI-Hamdouchi, A., Willett, P.: An improved Algorithm for the Calculation of Exact Term Discrimination Values. Information Processing & Management 24(1), 17–22 (1988)
Article Google Scholar
Agrawal, R., Imielimski, T., Swami, A.: Mining association rules between sets of items in large databases. In: Proc. 1993 ACM-SIGMOD Int. Conf. Management of Data, Washington DC, May 1993, pp. 207–216 (1993)
Google Scholar
Agrawal, R., Srikant, R.: Fast algorithm for mining association rules in large databases. Research Report RJ 9839,IBM Almaden Research Center, San Jose, CA (June 1994)
Google Scholar
Agrawal, R., Srikant, R.: Fast algorithm for mining association rules. In: Proc. 1994 Int. Conf. Very Large Data Bases (VLDB 1994), Santiago, Chile, September 1994, pp. 487–499 (1994)
Google Scholar
http://www.daviddlewis.com/resources/testcollections
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Proc. of SIGIR 2001 (2001)
Google Scholar
Bekkerman, R., EI-Yaniv, R.: On feature distributional clustering for text categorization. In: Proc. of SIGIR (2001)
Google Scholar
Macqueen, J.: Some methods for classification and analysis of multivariate observation. In: Proc.5th Berkeley Symp. Math. Statist, vol. 1, pp. 281–297 (1967)
Google Scholar

Download references

Author information

Authors and Affiliations

Dept. of Computer Science, Tianjin University, postbox 26#, Tianjin, 300072, China
Xiangwei Liu & Pilian He
Dept. of Computer Science, Tianjin Polytechnic University, Tianjin, 300160, China
Xiangwei Liu

Authors

Xiangwei Liu
View author publications
You can also search for this author in PubMed Google Scholar
Pilian He
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Information Technology and Electrical Engineering, The University of Queensland, 4072, Brisbane, Queensland, Australia
Xue Li
The State Key Laboratory for Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, 430072, Wuhan, China
Shuliang Wang
School of ITEE, The Univ of Queensland, St. Lucia, 4072, QLD, Australia
Zhao Yang Dong

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, X., He, P. (2005). A Study on Text Clustering Algorithms Based on Frequent Term Sets. In: Li, X., Wang, S., Dong, Z.Y. (eds) Advanced Data Mining and Applications. ADMA 2005. Lecture Notes in Computer Science(), vol 3584. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11527503_42

Download citation

DOI: https://doi.org/10.1007/11527503_42
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-27894-8
Online ISBN: 978-3-540-31877-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics