skip to main content
10.1145/1242572.1242590acmconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
Article

A new suffix tree similarity measure for document clustering

Published: 08 May 2007 Publication History

Abstract

In this paper, we propose a new similarity measure to compute the pairwise similarity of text-based documents based on suffix tree document model. By applying the new suffix tree similarity measure in Group-average Agglomerative Hierarchical Clustering (GAHC) algorithm, we developed a new suffix tree document clustering algorithm (NSTC). Experimental results on two standard document clustering benchmark corpus OHSUMED and RCV1 indicate that the new clustering algorithm is a very effective document clustering algorithm. Comparing with the results of traditional word term weight tf-idf similarity measure in the same GAHC algorithm, NSTC achieved an improvement of 51% on the average of F-measure score. Furthermore, we apply the new clustering algorithm in analyzing the Web documents in online forum communities. A topic oriented clustering algorithm is developed to help people in assessing, classifying and searching the the Web documents in a large forum community.

References

[1]
R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison Wesley, 1999.
[2]
E. Charniak. Statistical Language Learning. MIT Press, 1993.
[3]
W. B. Croft. Organizing and searching large files of documents. PhD thesis, University of Cambridge, 1978.
[4]
T. G. R. David D. Lewis, Yiming Yang and F. Li. RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5:361--397, 2004.
[5]
A. Ehrenfeucht and D. Haussler. A new distance metric on strings computable in linear time. Discrete Applied Math, 40, 1988.
[6]
R. Giegerich and S. Kurtz. From Ukkonen to McCreight and Weiner: A unifying view of linear-time suffix tree construction. Algorithmica, 19(3):331--353, 1997.
[7]
K. M. Hammouda and M. S. Kamel. Efficient phrase-based document indexing for web document clustering. IEEE Transactions on Knowledge and Data Engineering, 16(10):1279--1296, 2004.
[8]
X. D. Hung Chim, Min Jiang. A semantics based information distribution framework for large web-based course forum system. Lecture Notes in Computer Science: Advances in Web Based Learning ICWL 2006, 4181/2006:93--104, 2006.
[9]
F. Jelinek. Statistical Methods for Speech Recognition. MIT Press, 1997.
[10]
B. Larsen and C. Aone. Fast and effective text mining using linear-time document clustering. In Proceedings of the KDD-99 Workshop, San Diego, CA, USA.
[11]
U. Manber and G. Myers. Suffix arrays: a new method for on-line string searches. SIAM Journal on Computing, 22(5):935--948, 1993.
[12]
D. K. O'Neill and L. M. Gomez. The collaboratory notebook: A distributed knowledge-building environment for project-enhanced learning. In Proceedings of Ed-Meida'94, Vancouver, BC, 1994.
[13]
O. M. Oren Zamir, Oren Etzioni and R. M. Karp. Fast and intuitive clustering of web documents. In Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining, 1997.
[14]
J. R. Paul Bieganski and J. V. Carlis. Generalized suffix trees for biological sequence data: Application and implentation. In Proceedings of 27th Annual Hawaii International Conference on System Sciences, pages 35--44, 1994.
[15]
M. Porter. New models in probabilistic information retrieval. British Library Research and Development Report, no. 5587, 1980.
[16]
P. O. R. Allen and M. Littman. An interface for navigating clustered document sets returned by queries. In Proceedings of the ACM Conference on Organizational Computing Systems, pages 166--171, 1993.
[17]
B. M. Rajesh Pampapathi and M. Levene. A suffix tree approach to anti-spam email filtering. Machine Learning, 65, 2006.
[18]
G. Salton and C. Buckley. On the use of spreading activation methods in automatic information retrieval. In Proceedings of 11th Annual International Conference on Research and Development in Information Retrieval, ACM, pages 147-160, 1988.
[19]
G. Salton, A. Wong, and C. S. Yang. A vector space model for automatic indexing. Communication of ACM, 18(11):613--620, 1975.
[20]
D. S. Sven Meyer zu Eissen and M. Potthast. The suffix tree document model revisited. In Proceedings of the 5th International Conference on Knowledge Management, pages 596--603, 2005.
[21]
E. Ukkonen. On-line construction of suffix trees. Algorithmica, 14(3):249--260, 1995.
[22]
C. J. van Rijsbergen. Information Retrieval. Second Edition, Butterworths, London, 1979.
[23]
P. Willett. Recent trends in hierarchic document clustering: a critical review. Information Processing and Management, 24(5):577--597, 1988.
[24]
T. J. L. William Hersh, Chris Buckley and D. Hickam. Ohsumed: an interactive retrieval evaluation and new large test collection for research. In Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval, pages 192--201, Dublin, Ireland.
[25]
M. Yamamoto and K. W. Church. Using suffix arrays to compute term frequency and document frequency for all substrings in a corpus. Computational Linguistics, 27(1):1--30, 2001.
[26]
O. Zamir and O. Etzioni. Web document clustering: A feasibility demonstration. In Proceedings of SIGIR'98, University of Washington, Seattle, USA, 1998.
[27]
O. Zamir and O. Etzioni. Grouper: a dynamic clustering interface to Web search results. Computer Networks (Amsterdam, Netherlands: 1999), 31(11-16):1361--1374, 1999.

Cited By

View all
  • (2023)Scalable Text Index ConstructionAlgorithms for Big Data10.1007/978-3-031-21534-6_14(252-284)Online publication date: 18-Jan-2023
  • (2020)JSidentifyProceedings of the ACM/IEEE 42nd International Conference on Software Engineering: Software Engineering in Practice10.1145/3377813.3381352(211-220)Online publication date: 27-Jun-2020
  • (2019)Context Based Predictive InformationEntropy10.3390/e2107064521:7(645)Online publication date: 29-Jun-2019
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
WWW '07: Proceedings of the 16th international conference on World Wide Web
May 2007
1382 pages
ISBN:9781595936547
DOI:10.1145/1242572
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 May 2007

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. document model
  2. similarity measure
  3. suffix tree

Qualifiers

  • Article

Conference

WWW'07
Sponsor:
WWW'07: 16th International World Wide Web Conference
May 8 - 12, 2007
Alberta, Banff, Canada

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)11
  • Downloads (Last 6 weeks)2
Reflects downloads up to 20 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Scalable Text Index ConstructionAlgorithms for Big Data10.1007/978-3-031-21534-6_14(252-284)Online publication date: 18-Jan-2023
  • (2020)JSidentifyProceedings of the ACM/IEEE 42nd International Conference on Software Engineering: Software Engineering in Practice10.1145/3377813.3381352(211-220)Online publication date: 27-Jun-2020
  • (2019)Context Based Predictive InformationEntropy10.3390/e2107064521:7(645)Online publication date: 29-Jun-2019
  • (2019)Searching and ranking questionnairesProceedings of the ACM Symposium on Document Engineering 201910.1145/3342558.3345390(1-9)Online publication date: 23-Sep-2019
  • (2019)DGSTParallel Computing10.1016/j.parco.2019.06.00287:C(87-102)Online publication date: 1-Sep-2019
  • (2018)Improving Bag-Of-WordsProceedings of the 2018 ACM International Joint Conference and 2018 International Symposium on Pervasive and Ubiquitous Computing and Wearable Computers10.1145/3267305.3274182(1345-1354)Online publication date: 8-Oct-2018
  • (2018)Characterizing User Connections in Social Media through User-Shared ImagesIEEE Transactions on Big Data10.1109/TBDATA.2017.27627194:4(447-458)Online publication date: 1-Dec-2018
  • (2018)Visualization of Pairwise Data: An Overview2018 IEEE Third International Conference on Data Science in Cyberspace (DSC)10.1109/DSC.2018.00116(729-734)Online publication date: Jun-2018
  • (2018)STEMKnowledge and Information Systems10.1007/s10115-017-1062-055:2(305-331)Online publication date: 1-May-2018
  • (2018)Document Clustering Using Local and Universal KnowledgeMachine Learning and Data Mining in Pattern Recognition10.1007/978-3-319-96136-1_14(159-173)Online publication date: 15-Jul-2018
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media