Article

A new suffix tree similarity measure for document clustering

Authors:

Xiaotie DengAuthors Info & Claims

WWW '07: Proceedings of the 16th international conference on World Wide Web

Pages 121 - 130

https://doi.org/10.1145/1242572.1242590

Published: 08 May 2007 Publication History

Abstract

In this paper, we propose a new similarity measure to compute the pairwise similarity of text-based documents based on suffix tree document model. By applying the new suffix tree similarity measure in Group-average Agglomerative Hierarchical Clustering (GAHC) algorithm, we developed a new suffix tree document clustering algorithm (NSTC). Experimental results on two standard document clustering benchmark corpus OHSUMED and RCV1 indicate that the new clustering algorithm is a very effective document clustering algorithm. Comparing with the results of traditional word term weight tf-idf similarity measure in the same GAHC algorithm, NSTC achieved an improvement of 51% on the average of F-measure score. Furthermore, we apply the new clustering algorithm in analyzing the Web documents in online forum communities. A topic oriented clustering algorithm is developed to help people in assessing, classifying and searching the the Web documents in a large forum community.

References

[1]

R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison Wesley, 1999.

Digital Library

[2]

E. Charniak. Statistical Language Learning. MIT Press, 1993.

Digital Library

[3]

W. B. Croft. Organizing and searching large files of documents. PhD thesis, University of Cambridge, 1978.

[4]

T. G. R. David D. Lewis, Yiming Yang and F. Li. RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5:361--397, 2004.

Digital Library

[5]

A. Ehrenfeucht and D. Haussler. A new distance metric on strings computable in linear time. Discrete Applied Math, 40, 1988.

Digital Library

[6]

R. Giegerich and S. Kurtz. From Ukkonen to McCreight and Weiner: A unifying view of linear-time suffix tree construction. Algorithmica, 19(3):331--353, 1997.

[7]

K. M. Hammouda and M. S. Kamel. Efficient phrase-based document indexing for web document clustering. IEEE Transactions on Knowledge and Data Engineering, 16(10):1279--1296, 2004.

Digital Library

[8]

X. D. Hung Chim, Min Jiang. A semantics based information distribution framework for large web-based course forum system. Lecture Notes in Computer Science: Advances in Web Based Learning ICWL 2006, 4181/2006:93--104, 2006.

Digital Library

[9]

F. Jelinek. Statistical Methods for Speech Recognition. MIT Press, 1997.

Digital Library

[10]

B. Larsen and C. Aone. Fast and effective text mining using linear-time document clustering. In Proceedings of the KDD-99 Workshop, San Diego, CA, USA.

Digital Library

[11]

U. Manber and G. Myers. Suffix arrays: a new method for on-line string searches. SIAM Journal on Computing, 22(5):935--948, 1993.

Digital Library

[12]

D. K. O'Neill and L. M. Gomez. The collaboratory notebook: A distributed knowledge-building environment for project-enhanced learning. In Proceedings of Ed-Meida'94, Vancouver, BC, 1994.

[13]

O. M. Oren Zamir, Oren Etzioni and R. M. Karp. Fast and intuitive clustering of web documents. In Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining, 1997.

[14]

J. R. Paul Bieganski and J. V. Carlis. Generalized suffix trees for biological sequence data: Application and implentation. In Proceedings of 27th Annual Hawaii International Conference on System Sciences, pages 35--44, 1994.

[15]

M. Porter. New models in probabilistic information retrieval. British Library Research and Development Report, no. 5587, 1980.

[16]

P. O. R. Allen and M. Littman. An interface for navigating clustered document sets returned by queries. In Proceedings of the ACM Conference on Organizational Computing Systems, pages 166--171, 1993.

Digital Library

[17]

B. M. Rajesh Pampapathi and M. Levene. A suffix tree approach to anti-spam email filtering. Machine Learning, 65, 2006.

Digital Library

[18]

G. Salton and C. Buckley. On the use of spreading activation methods in automatic information retrieval. In Proceedings of 11th Annual International Conference on Research and Development in Information Retrieval, ACM, pages 147-160, 1988.

Digital Library

[19]

G. Salton, A. Wong, and C. S. Yang. A vector space model for automatic indexing. Communication of ACM, 18(11):613--620, 1975.

Digital Library

[20]

D. S. Sven Meyer zu Eissen and M. Potthast. The suffix tree document model revisited. In Proceedings of the 5th International Conference on Knowledge Management, pages 596--603, 2005.

[21]

E. Ukkonen. On-line construction of suffix trees. Algorithmica, 14(3):249--260, 1995.

Digital Library

[22]

C. J. van Rijsbergen. Information Retrieval. Second Edition, Butterworths, London, 1979.

Digital Library

[23]

P. Willett. Recent trends in hierarchic document clustering: a critical review. Information Processing and Management, 24(5):577--597, 1988.

Digital Library

[24]

T. J. L. William Hersh, Chris Buckley and D. Hickam. Ohsumed: an interactive retrieval evaluation and new large test collection for research. In Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval, pages 192--201, Dublin, Ireland.

Digital Library

[25]

M. Yamamoto and K. W. Church. Using suffix arrays to compute term frequency and document frequency for all substrings in a corpus. Computational Linguistics, 27(1):1--30, 2001.

Digital Library

[26]

O. Zamir and O. Etzioni. Web document clustering: A feasibility demonstration. In Proceedings of SIGIR'98, University of Washington, Seattle, USA, 1998.

Digital Library

[27]

O. Zamir and O. Etzioni. Grouper: a dynamic clustering interface to Web search results. Computer Networks (Amsterdam, Netherlands: 1999), 31(11-16):1361--1374, 1999.

Digital Library

Cited By

Bingmann TDinklage PFischer JKurpicz FOhlebusch ESanders P(2023)Scalable Text Index ConstructionAlgorithms for Big Data10.1007/978-3-031-21534-6_14(252-284)Online publication date: 18-Jan-2023
https://doi.org/10.1007/978-3-031-21534-6_14
Xia QZhou ZLi ZXu BZou WChen ZMa HLiang GLu HGuo SXiong TDeng YXie TRothermel GBae D(2020)JSidentifyProceedings of the ACM/IEEE 42nd International Conference on Software Engineering: Software Engineering in Practice10.1145/3377813.3381352(211-220)Online publication date: 27-Jun-2020
https://dl.acm.org/doi/10.1145/3377813.3381352
Shalev YBen-Gal I(2019)Context Based Predictive InformationEntropy10.3390/e2107064521:7(645)Online publication date: 29-Jun-2019
https://doi.org/10.3390/e21070645
Show More Cited By

Recommendations

Improving suffix tree clustering with new ranking and similarity measures
ADMA'11: Proceedings of the 7th international conference on Advanced Data Mining and Applications - Volume Part II

Retrieving relevant information from web, containing enormous amount of data, is a highly complicated research area. A landmark research that contributes to this area is web clustering which efficiently organizes a large amount of web documents into a ...
Representing document as dependency graph for document clustering
CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge management

In traditional clustering methods, a document is often represented as "bag of words" (in BOW model) or n-grams (in suffix tree document model) without considering the natural language relationships between the words. In this paper, we propose a novel ...
Similarity Measure Based on Adaptive Neighbors for Spectral Clustering
ICMLC '17: Proceedings of the 9th International Conference on Machine Learning and Computing

Spectral clustering has become one of the most popular clustering methods for exploratory data analysis. Similarity measure is crucial to the performance of spectral clustering. In this paper, to improve spectral clustering, we propose an efficient ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

WWW '07: Proceedings of the 16th international conference on World Wide Web

May 2007

1382 pages

ISBN:9781595936547

DOI:10.1145/1242572

General Chairs:
Carey Williamson
University of Calgary, Canada
,
Mary Ellen Zurko
IBM, USA
,
Program Chairs:
Peter Patel-Schneider
Bell Labs Research, USA
,
Prashant Shenoy
University of Massachusetts at Amherst, USA

Copyright © 2007 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

ACM: Association for Computing Machinery

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 May 2007

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

WWW'07

Sponsor:

ACM

WWW'07: 16th International World Wide Web Conference

May 8 - 12, 2007

Alberta, Banff, Canada

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

70
Total Citations
View Citations
1,595
Total Downloads

Downloads (Last 12 months)11
Downloads (Last 6 weeks)2

Reflects downloads up to 20 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Bingmann TDinklage PFischer JKurpicz FOhlebusch ESanders P(2023)Scalable Text Index ConstructionAlgorithms for Big Data10.1007/978-3-031-21534-6_14(252-284)Online publication date: 18-Jan-2023
https://doi.org/10.1007/978-3-031-21534-6_14
Xia QZhou ZLi ZXu BZou WChen ZMa HLiang GLu HGuo SXiong TDeng YXie TRothermel GBae D(2020)JSidentifyProceedings of the ACM/IEEE 42nd International Conference on Software Engineering: Software Engineering in Practice10.1145/3377813.3381352(211-220)Online publication date: 27-Jun-2020
https://dl.acm.org/doi/10.1145/3377813.3381352
Shalev YBen-Gal I(2019)Context Based Predictive InformationEntropy10.3390/e2107064521:7(645)Online publication date: 29-Jun-2019
https://doi.org/10.3390/e21070645
de Souza RDorneles CBorghoff USchimmler S(2019)Searching and ranking questionnairesProceedings of the ACM Symposium on Document Engineering 201910.1145/3342558.3345390(1-9)Online publication date: 23-Sep-2019
https://dl.acm.org/doi/10.1145/3342558.3345390
Zhu GGuo CLu LHuang ZYuan CGu RHuang Y(2019)DGSTParallel Computing10.1016/j.parco.2019.06.00287:C(87-102)Online publication date: 1-Sep-2019
https://dl.acm.org/doi/10.1016/j.parco.2019.06.002
Zeng MYu TMengshoel OQin HLee CShen J(2018)Improving Bag-Of-WordsProceedings of the 2018 ACM International Joint Conference and 2018 International Symposium on Pervasive and Ubiquitous Computing and Wearable Computers10.1145/3267305.3274182(1345-1354)Online publication date: 8-Oct-2018
https://dl.acm.org/doi/10.1145/3267305.3274182
Cheung MShe JWang N(2018)Characterizing User Connections in Social Media through User-Shared ImagesIEEE Transactions on Big Data10.1109/TBDATA.2017.27627194:4(447-458)Online publication date: 1-Dec-2018
https://doi.org/10.1109/TBDATA.2017.2762719
Pei ZXiong MXiong W(2018)Visualization of Pairwise Data: An Overview2018 IEEE Third International Conference on Data Science in Cyberspace (DSC)10.1109/DSC.2018.00116(729-734)Online publication date: Jun-2018
https://doi.org/10.1109/DSC.2018.00116
Fang YXie XZhang XCheng RZhang Z(2018)STEMKnowledge and Information Systems10.1007/s10115-017-1062-055:2(305-331)Online publication date: 1-May-2018
https://dl.acm.org/doi/10.1007/s10115-017-1062-0
Qazanfari KYoussef A(2018)Document Clustering Using Local and Universal KnowledgeMachine Learning and Data Mining in Pattern Recognition10.1007/978-3-319-96136-1_14(159-173)Online publication date: 15-Jul-2018
https://dl.acm.org/doi/10.1007/978-3-319-96136-1_14
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents