Improving Suffix Tree Clustering with New Ranking and Similarity Measures

Worawitphinyo, Phiradit; Gao, Xiaoying; Jabeen, Shahida

doi:10.1007/978-3-642-25856-5_5

Phiradit Worawitphinyo²²,
Xiaoying Gao²² &
Shahida Jabeen²²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7121))

Included in the following conference series:

International Conference on Advanced Data Mining and Applications

1424 Accesses
4 Citations

Abstract

Retrieving relevant information from web, containing enormous amount of data, is a highly complicated research area. A landmark research that contributes to this area is web clustering which efficiently organizes a large amount of web documents into a small number of meaningful and coherent groups[1,2]. Various techniques aim at accurately categorizing the web pages into clusters automatically. Suffix Tree Clustering (STC) is a phrase-based, state-of-art algorithm for web clustering that automatically groups semantically related documents based on shared phrases. Research has shown that it has outperformed other clustering algorithms such as K-means and Buckshot due to its efficient utilization of phrases to identify the clusters. Using STC as the baseline, we introduce a new method for ranking base clusters and new similarity measures for comparing clusters. Our STHAC technique combines the Heirarchical Agglomerative clustering method with phrase based Suffix Tree clustering to improve the cluster merging process. Experimental results have shown that STHAC outperforms the original STC as well as ESTC(our precious extended version of STC) with 16% increase in F-measure. This increase in F-measure of STHAC is achieved due to its better filtering of low score clusters, better similarity measures and efficient cluster merging algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Syan Chen, M., Hun, J., Yu, P.S., Ibm, J.I., Ctr, W.R.: Data mining: An overview from database perspective. IEEE Transactions on Knowledge and Data Engineering 8, 866–883 (1996)
Article Google Scholar
Sebastiani, F., Ricerche, C.N.D.: Machine learning in automated text categorization. ACM Computing Surveys 34, 1–47 (2002)
Article Google Scholar
Zamir, O., Etzioni, O.: Web document clustering: a feasibility demonstration. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 1998, pp. 46–54. ACM, New York (1998)
Google Scholar
Crabtree, D., Gao, X., Andreae, P.: Improving web clustering by cluster selection. In: Proceedings of 2005 IEEE/WIC/ACM International Conference on Web Intelligence, pp. 172–178 (2005)
Google Scholar
Andersson, A., Larsson, N.J., Swanson, K.: Suffix trees on words. In: Hirschberg, D.S., Meyers, G. (eds.) CPM 1996. LNCS, vol. 1075, pp. 102–115. Springer, Heidelberg (1996)
Chapter Google Scholar
Arne Andersson, S.N.: Efficient implementation of suffix trees. Software Practice and Experience 25, 129–141 (1995)
Article Google Scholar
Hung Chim, X.D.: A new suffix tree similarity measure for document clustering. ACM 978-1-59593-654 (2007)
Google Scholar
Lin, D.: Automatic retrieval and clustering of similar words. In: Proceedings of the 17th International Conference on Computational Linguistics, COLING 1998, vol. 2, pp. 768–774. Association for Computational Linguistics, Stroudsburg (1998)
Google Scholar
Manning, C.D., Prabhakar Raghavan, H.S.: Hierarchical agglomerative clustering, http://nlp.stanford.edu/IR-book/html/htmledition/hierarchical-agglomerative-clustering-1.html (2008)
Hammouda, K., Kamel, M.: Efficient phrase-based document indexing for web document clustering. IEEE Transactions on Knowledge and Data Engineering 16(10), 1279–1296 (2004)
Article Google Scholar
Hammouda, K., Kamel, M.: Phrase-based document similarity based on an index graph model. In: Proceedings of 2002 IEEE International Conference on Data Mining ICDM, pp. 203–210 (2002)
Google Scholar
Chim, H., Deng, X.: A new suffix tree similarity measure for document clustering. In: Proceedings of the 16th International Conference on World Wide Web, WWW 2007, pp. 121–130. ACM, New York (2007)
Google Scholar
Aas, K., Eikvil, L.: Text categorization: A survey. Technical Report 941, Norwegian Computing Center (1999)
Google Scholar
Chim, H., Deng, X.: Efficient phrase-based document similarity for clustering. IEEE Transactions on Knowledge and Data Engineering 20(9), 1217–1229 (2008)
Article Google Scholar
Salton, G., McGill, M.: Introduction to modern information retrieval. McGraw-Hill (1983)
Google Scholar
Huang, A.: Similarity measures for text document clustering, pp. 49–56 (2008)
Google Scholar
Joydeep, A.S., Strehl, E., Ghosh, J., Mooney, R.: Impact of similarity measures on web-page clustering. In: Workshop on Artificial Intelligence for Web Search, AAAI, pp. 58–64 (2000)
Google Scholar
McCreight, E.M.: A space-economical suffix tree construction algorithm. Journal of the ACM 23(2), 262–272 (1976)
Article MATH Google Scholar
Janruang, J., Guha, S.: Semantic suffix tree clustering. In: First IRAST International Conference on Data Engineering and Internet Technology, DEIT (2011)
Google Scholar
Carpineto, C., Osinski, S., Romano, G., Weiss, D.: A survey of web clustering engines. ACM Computing Surveys 41, 1–38 (2009)
Article Google Scholar
Zamir, O., Etzioni, O.: Grouper: A dynamic clustering interface to web search results. In: Proceedings of the Eighth International World Wide Web Conference, pp. 283–296. Elsevier, Toronto (1999)
Google Scholar
Weiner, P.: Linear pattern matching algorithms. In: IEEE Conference Record of 14th Annual Symposium on Switching and Automata Theory, SWAT 2008, pp. 1–11 (1973)
Google Scholar
Wang, J., Li, R.: A New Cluster Merging Algorithm of Suffix Tree Clustering. In: Shi, Z., Shimohara, K., Feng, D. (eds.) Intelligent Information Proceesing III. IFIP AICT, vol. 228, pp. 197–203. Springer, Boston (2006)
Chapter Google Scholar
Zhang, D., Dong, Y.: Semantic, Hierarchical, online Clustering of Web Search Results. In: Yu, J.X., Lin, X., Lu, H., Zhang, Y. (eds.) APWeb 2004. LNCS, vol. 3007, pp. 69–78. Springer, Heidelberg (2004)
Chapter Google Scholar
Osinski, S., Weiss, D.: A concept-driven algorithm for clustering search results. IEEE Intelligent Systems 20, 48–54 (2005)
Article Google Scholar
Sameh, A.: Semantic web search results clustering using lingo and wordnet. International Journal of Research and Reviews in Computer Science (IJRRCS) 1(2) (June 2010)
Google Scholar
Crabtree, D., Andreae, P., Gao, X.: Query directed web page clustering. In: Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence, WI 2006, pp. 202–210. IEEE Computer Society, Washington, DC, USA (2006)
Chapter Google Scholar
Kale, A., Bharambe, U., SashiKumar, M.: A new suffix tree similarity measure and labeling for web search results clustering. In: 2nd International Conference on Emerging Trends in Engineering and Technology (ICETET), pp. 856 –861 (2009)
Google Scholar
Wu, J., Wang, Z.: Search results clustering in chinese context based on a new suffix tree. In: Proceedings of the 2008 IEEE 8th International Conference on Computer and Information Technology Workshops, pp. 110–115. IEEE Computer Society, Washington, DC, USA (2008)
Google Scholar
Alsabti, K.: An efficient k-means clustering algorithm. In: Proceedings of IPPS/SPDP Workshop on High Performance Data Mining (1998)
Google Scholar
Ng, A.Y., Jordan, M.I., Weiss, Y.: On spectral clustering: Analysis and an algorithm. In: Advances in Neural Information Processing Systems, vol. 14, pp. 849–856. MIT Press (2001)
Google Scholar
Beil, F., Ester, M., Xu, X.: Frequent term-based text clustering. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2002, pp. 436–442. ACM, New York (2002)
Google Scholar
Milligan, G.W., Sokol, L.: A two stage clustering algorithm with robust recovery characteristics. Educational and Psychological Measurement 40, 755–759 (1980)
Article Google Scholar
Kessler, B.: Computational dialectology in irish gaelic. Computing Research Repository cmp-lg/950, 60–66 (1995)
Google Scholar
Everitt, B.S., Landau, S., Leese, M.: Cluster analysis, 4th edn. Oxford University Press (1993)
Google Scholar
Cilibrasi, R., Vitanyi, P.: The google similarity distance. IEEE Transactions on Knowledge and Data Engineering 19(3), 370–383 (2007)
Article Google Scholar
Willett, P.: Recent trends in hierarchic document clustering: A critical review. Information Processing and Management 24(5), 577–597 (1988)
Article Google Scholar
Kjos-hanssen, B., Evangelista, A.J.: Google distance between words. Computing Research Repository abs/0901.4 (2009)
Google Scholar
Gerard Salton, C.B.: Term-weighting approaches in autumatic text retrieval. In: Information Processing and Management, vol. 24, pp. 513–523. Pergamin Press plc, Great Britain (1988)
Google Scholar
Machado, D., Barbosa, T., Pais, S., Martins, B., Dias, G.: Universal Mobile Information Retrieval. In: Stephanidis, C. (ed.) UAHCI 2009, Part II. LNCS, vol. 5615, pp. 345–354. Springer, Heidelberg (2009)
Chapter Google Scholar
Losee, R.M.: When information retrieval measures agree about the relative quality of document rankings. Journal of the American Society for Information Science 51(9), 834–840 (2000)
Article Google Scholar
Crabtee, D.: Raw data (2005), http://www.danielcrabtree.com/research/wi05/rawdata.zip

Download references

Author information

Authors and Affiliations

School of Engineering and Computer Science, Victoria University of Wellington, P.O. Box 600, Wellington, New Zealand
Phiradit Worawitphinyo, Xiaoying Gao & Shahida Jabeen

Authors

Phiradit Worawitphinyo
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoying Gao
View author publications
You can also search for this author in PubMed Google Scholar
Shahida Jabeen
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science and Technology, Tsinghua University, 100084, Beijing, China
Jie Tang & Jianyong Wang &
Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong, SAR, China
Irwin King
Faculty of Engineering and Information Technology, University of Technology, 2007, Sydney, NSW, Australia
Ling Chen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Worawitphinyo, P., Gao, X., Jabeen, S. (2011). Improving Suffix Tree Clustering with New Ranking and Similarity Measures. In: Tang, J., King, I., Chen, L., Wang, J. (eds) Advanced Data Mining and Applications. ADMA 2011. Lecture Notes in Computer Science(), vol 7121. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-25856-5_5

Download citation

DOI: https://doi.org/10.1007/978-3-642-25856-5_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-25855-8
Online ISBN: 978-3-642-25856-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics