Skip to main content

Improving Suffix Tree Clustering with New Ranking and Similarity Measures

  • Conference paper
Advanced Data Mining and Applications (ADMA 2011)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7121))

Included in the following conference series:

Abstract

Retrieving relevant information from web, containing enormous amount of data, is a highly complicated research area. A landmark research that contributes to this area is web clustering which efficiently organizes a large amount of web documents into a small number of meaningful and coherent groups[1,2]. Various techniques aim at accurately categorizing the web pages into clusters automatically. Suffix Tree Clustering (STC) is a phrase-based, state-of-art algorithm for web clustering that automatically groups semantically related documents based on shared phrases. Research has shown that it has outperformed other clustering algorithms such as K-means and Buckshot due to its efficient utilization of phrases to identify the clusters. Using STC as the baseline, we introduce a new method for ranking base clusters and new similarity measures for comparing clusters. Our STHAC technique combines the Heirarchical Agglomerative clustering method with phrase based Suffix Tree clustering to improve the cluster merging process. Experimental results have shown that STHAC outperforms the original STC as well as ESTC(our precious extended version of STC) with 16% increase in F-measure. This increase in F-measure of STHAC is achieved due to its better filtering of low score clusters, better similarity measures and efficient cluster merging algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Syan Chen, M., Hun, J., Yu, P.S., Ibm, J.I., Ctr, W.R.: Data mining: An overview from database perspective. IEEE Transactions on Knowledge and Data Engineering 8, 866–883 (1996)

    Article  Google Scholar 

  2. Sebastiani, F., Ricerche, C.N.D.: Machine learning in automated text categorization. ACM Computing Surveys 34, 1–47 (2002)

    Article  Google Scholar 

  3. Zamir, O., Etzioni, O.: Web document clustering: a feasibility demonstration. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 1998, pp. 46–54. ACM, New York (1998)

    Google Scholar 

  4. Crabtree, D., Gao, X., Andreae, P.: Improving web clustering by cluster selection. In: Proceedings of 2005 IEEE/WIC/ACM International Conference on Web Intelligence, pp. 172–178 (2005)

    Google Scholar 

  5. Andersson, A., Larsson, N.J., Swanson, K.: Suffix trees on words. In: Hirschberg, D.S., Meyers, G. (eds.) CPM 1996. LNCS, vol. 1075, pp. 102–115. Springer, Heidelberg (1996)

    Chapter  Google Scholar 

  6. Arne Andersson, S.N.: Efficient implementation of suffix trees. Software Practice and Experience 25, 129–141 (1995)

    Article  Google Scholar 

  7. Hung Chim, X.D.: A new suffix tree similarity measure for document clustering. ACM 978-1-59593-654 (2007)

    Google Scholar 

  8. Lin, D.: Automatic retrieval and clustering of similar words. In: Proceedings of the 17th International Conference on Computational Linguistics, COLING 1998, vol. 2, pp. 768–774. Association for Computational Linguistics, Stroudsburg (1998)

    Google Scholar 

  9. Manning, C.D., Prabhakar Raghavan, H.S.: Hierarchical agglomerative clustering, http://nlp.stanford.edu/IR-book/html/htmledition/hierarchical-agglomerative-clustering-1.html (2008)

  10. Hammouda, K., Kamel, M.: Efficient phrase-based document indexing for web document clustering. IEEE Transactions on Knowledge and Data Engineering 16(10), 1279–1296 (2004)

    Article  Google Scholar 

  11. Hammouda, K., Kamel, M.: Phrase-based document similarity based on an index graph model. In: Proceedings of 2002 IEEE International Conference on Data Mining ICDM, pp. 203–210 (2002)

    Google Scholar 

  12. Chim, H., Deng, X.: A new suffix tree similarity measure for document clustering. In: Proceedings of the 16th International Conference on World Wide Web, WWW 2007, pp. 121–130. ACM, New York (2007)

    Google Scholar 

  13. Aas, K., Eikvil, L.: Text categorization: A survey. Technical Report 941, Norwegian Computing Center (1999)

    Google Scholar 

  14. Chim, H., Deng, X.: Efficient phrase-based document similarity for clustering. IEEE Transactions on Knowledge and Data Engineering 20(9), 1217–1229 (2008)

    Article  Google Scholar 

  15. Salton, G., McGill, M.: Introduction to modern information retrieval. McGraw-Hill (1983)

    Google Scholar 

  16. Huang, A.: Similarity measures for text document clustering, pp. 49–56 (2008)

    Google Scholar 

  17. Joydeep, A.S., Strehl, E., Ghosh, J., Mooney, R.: Impact of similarity measures on web-page clustering. In: Workshop on Artificial Intelligence for Web Search, AAAI, pp. 58–64 (2000)

    Google Scholar 

  18. McCreight, E.M.: A space-economical suffix tree construction algorithm. Journal of the ACM 23(2), 262–272 (1976)

    Article  MATH  Google Scholar 

  19. Janruang, J., Guha, S.: Semantic suffix tree clustering. In: First IRAST International Conference on Data Engineering and Internet Technology, DEIT (2011)

    Google Scholar 

  20. Carpineto, C., Osinski, S., Romano, G., Weiss, D.: A survey of web clustering engines. ACM Computing Surveys 41, 1–38 (2009)

    Article  Google Scholar 

  21. Zamir, O., Etzioni, O.: Grouper: A dynamic clustering interface to web search results. In: Proceedings of the Eighth International World Wide Web Conference, pp. 283–296. Elsevier, Toronto (1999)

    Google Scholar 

  22. Weiner, P.: Linear pattern matching algorithms. In: IEEE Conference Record of 14th Annual Symposium on Switching and Automata Theory, SWAT 2008, pp. 1–11 (1973)

    Google Scholar 

  23. Wang, J., Li, R.: A New Cluster Merging Algorithm of Suffix Tree Clustering. In: Shi, Z., Shimohara, K., Feng, D. (eds.) Intelligent Information Proceesing III. IFIP AICT, vol. 228, pp. 197–203. Springer, Boston (2006)

    Chapter  Google Scholar 

  24. Zhang, D., Dong, Y.: Semantic, Hierarchical, online Clustering of Web Search Results. In: Yu, J.X., Lin, X., Lu, H., Zhang, Y. (eds.) APWeb 2004. LNCS, vol. 3007, pp. 69–78. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  25. Osinski, S., Weiss, D.: A concept-driven algorithm for clustering search results. IEEE Intelligent Systems 20, 48–54 (2005)

    Article  Google Scholar 

  26. Sameh, A.: Semantic web search results clustering using lingo and wordnet. International Journal of Research and Reviews in Computer Science (IJRRCS) 1(2) (June 2010)

    Google Scholar 

  27. Crabtree, D., Andreae, P., Gao, X.: Query directed web page clustering. In: Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence, WI 2006, pp. 202–210. IEEE Computer Society, Washington, DC, USA (2006)

    Chapter  Google Scholar 

  28. Kale, A., Bharambe, U., SashiKumar, M.: A new suffix tree similarity measure and labeling for web search results clustering. In: 2nd International Conference on Emerging Trends in Engineering and Technology (ICETET), pp. 856 –861 (2009)

    Google Scholar 

  29. Wu, J., Wang, Z.: Search results clustering in chinese context based on a new suffix tree. In: Proceedings of the 2008 IEEE 8th International Conference on Computer and Information Technology Workshops, pp. 110–115. IEEE Computer Society, Washington, DC, USA (2008)

    Google Scholar 

  30. Alsabti, K.: An efficient k-means clustering algorithm. In: Proceedings of IPPS/SPDP Workshop on High Performance Data Mining (1998)

    Google Scholar 

  31. Ng, A.Y., Jordan, M.I., Weiss, Y.: On spectral clustering: Analysis and an algorithm. In: Advances in Neural Information Processing Systems, vol. 14, pp. 849–856. MIT Press (2001)

    Google Scholar 

  32. Beil, F., Ester, M., Xu, X.: Frequent term-based text clustering. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2002, pp. 436–442. ACM, New York (2002)

    Google Scholar 

  33. Milligan, G.W., Sokol, L.: A two stage clustering algorithm with robust recovery characteristics. Educational and Psychological Measurement 40, 755–759 (1980)

    Article  Google Scholar 

  34. Kessler, B.: Computational dialectology in irish gaelic. Computing Research Repository cmp-lg/950, 60–66 (1995)

    Google Scholar 

  35. Everitt, B.S., Landau, S., Leese, M.: Cluster analysis, 4th edn. Oxford University Press (1993)

    Google Scholar 

  36. Cilibrasi, R., Vitanyi, P.: The google similarity distance. IEEE Transactions on Knowledge and Data Engineering 19(3), 370–383 (2007)

    Article  Google Scholar 

  37. Willett, P.: Recent trends in hierarchic document clustering: A critical review. Information Processing and Management 24(5), 577–597 (1988)

    Article  Google Scholar 

  38. Kjos-hanssen, B., Evangelista, A.J.: Google distance between words. Computing Research Repository abs/0901.4 (2009)

    Google Scholar 

  39. Gerard Salton, C.B.: Term-weighting approaches in autumatic text retrieval. In: Information Processing and Management, vol. 24, pp. 513–523. Pergamin Press plc, Great Britain (1988)

    Google Scholar 

  40. Machado, D., Barbosa, T., Pais, S., Martins, B., Dias, G.: Universal Mobile Information Retrieval. In: Stephanidis, C. (ed.) UAHCI 2009, Part II. LNCS, vol. 5615, pp. 345–354. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  41. Losee, R.M.: When information retrieval measures agree about the relative quality of document rankings. Journal of the American Society for Information Science 51(9), 834–840 (2000)

    Article  Google Scholar 

  42. Crabtee, D.: Raw data (2005), http://www.danielcrabtree.com/research/wi05/rawdata.zip

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Worawitphinyo, P., Gao, X., Jabeen, S. (2011). Improving Suffix Tree Clustering with New Ranking and Similarity Measures. In: Tang, J., King, I., Chen, L., Wang, J. (eds) Advanced Data Mining and Applications. ADMA 2011. Lecture Notes in Computer Science(), vol 7121. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-25856-5_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-25856-5_5

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-25855-8

  • Online ISBN: 978-3-642-25856-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics