Abstract
Document clustering is classifying a data set of documents into groups of closely related documents, so that its resulting clusters can be used in browsing and searching the documents of a specific topic. In most cases of such as application, a set of new documents are incrementally added to the data set and there can be a large variation in the number of words in each document. This paper proposes an incremental document clustering method for an incrementally increasing data set of documents. The normalized inverse document frequency of a word in the data set is introduced to cope with the variation of the number of words in each document. Furthermore, an average link method for document clustering instead of using one similarity measure used in two similarity measures: a cluster cohesion rate and a cluster participation rate. Furthermore, a category tree for a set of identified clusters is introduced to assist the incremental document clustering of newly added documents. In this paper, the performance of the proposed method is analyzed by a series of experiments to identify their various characteristics.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Joo, K.H., Lee, S. (2005). An Incremental Document Clustering Algorithm Based on a Hierarchical Agglomerative Approach. In: Chakraborty, G. (eds) Distributed Computing and Internet Technology. ICDCIT 2005. Lecture Notes in Computer Science, vol 3816. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11604655_37
Download citation
DOI: https://doi.org/10.1007/11604655_37
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-30999-4
Online ISBN: 978-3-540-32429-4
eBook Packages: Computer ScienceComputer Science (R0)