Abstract
Retrieving relevant information from web, containing enormous amount of data, is a highly complicated research area. A landmark research that contributes to this area is web clustering which efficiently organizes a large amount of web documents into a small number of meaningful and coherent groups[1,2]. Various techniques aim at accurately categorizing the web pages into clusters automatically. Suffix Tree Clustering (STC) is a phrase-based, state-of-art algorithm for web clustering that automatically groups semantically related documents based on shared phrases. Research has shown that it has outperformed other clustering algorithms such as K-means and Buckshot due to its efficient utilization of phrases to identify the clusters. Using STC as the baseline, we introduce a new method for ranking base clusters and new similarity measures for comparing clusters. Our STHAC technique combines the Heirarchical Agglomerative clustering method with phrase based Suffix Tree clustering to improve the cluster merging process. Experimental results have shown that STHAC outperforms the original STC as well as ESTC(our precious extended version of STC) with 16% increase in F-measure. This increase in F-measure of STHAC is achieved due to its better filtering of low score clusters, better similarity measures and efficient cluster merging algorithms.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Syan Chen, M., Hun, J., Yu, P.S., Ibm, J.I., Ctr, W.R.: Data mining: An overview from database perspective. IEEE Transactions on Knowledge and Data Engineering 8, 866–883 (1996)
Sebastiani, F., Ricerche, C.N.D.: Machine learning in automated text categorization. ACM Computing Surveys 34, 1–47 (2002)
Zamir, O., Etzioni, O.: Web document clustering: a feasibility demonstration. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 1998, pp. 46–54. ACM, New York (1998)
Crabtree, D., Gao, X., Andreae, P.: Improving web clustering by cluster selection. In: Proceedings of 2005 IEEE/WIC/ACM International Conference on Web Intelligence, pp. 172–178 (2005)
Andersson, A., Larsson, N.J., Swanson, K.: Suffix trees on words. In: Hirschberg, D.S., Meyers, G. (eds.) CPM 1996. LNCS, vol. 1075, pp. 102–115. Springer, Heidelberg (1996)
Arne Andersson, S.N.: Efficient implementation of suffix trees. Software Practice and Experience 25, 129–141 (1995)
Hung Chim, X.D.: A new suffix tree similarity measure for document clustering. ACM 978-1-59593-654 (2007)
Lin, D.: Automatic retrieval and clustering of similar words. In: Proceedings of the 17th International Conference on Computational Linguistics, COLING 1998, vol. 2, pp. 768–774. Association for Computational Linguistics, Stroudsburg (1998)
Manning, C.D., Prabhakar Raghavan, H.S.: Hierarchical agglomerative clustering, http://nlp.stanford.edu/IR-book/html/htmledition/hierarchical-agglomerative-clustering-1.html (2008)
Hammouda, K., Kamel, M.: Efficient phrase-based document indexing for web document clustering. IEEE Transactions on Knowledge and Data Engineering 16(10), 1279–1296 (2004)
Hammouda, K., Kamel, M.: Phrase-based document similarity based on an index graph model. In: Proceedings of 2002 IEEE International Conference on Data Mining ICDM, pp. 203–210 (2002)
Chim, H., Deng, X.: A new suffix tree similarity measure for document clustering. In: Proceedings of the 16th International Conference on World Wide Web, WWW 2007, pp. 121–130. ACM, New York (2007)
Aas, K., Eikvil, L.: Text categorization: A survey. Technical Report 941, Norwegian Computing Center (1999)
Chim, H., Deng, X.: Efficient phrase-based document similarity for clustering. IEEE Transactions on Knowledge and Data Engineering 20(9), 1217–1229 (2008)
Salton, G., McGill, M.: Introduction to modern information retrieval. McGraw-Hill (1983)
Huang, A.: Similarity measures for text document clustering, pp. 49–56 (2008)
Joydeep, A.S., Strehl, E., Ghosh, J., Mooney, R.: Impact of similarity measures on web-page clustering. In: Workshop on Artificial Intelligence for Web Search, AAAI, pp. 58–64 (2000)
McCreight, E.M.: A space-economical suffix tree construction algorithm. Journal of the ACM 23(2), 262–272 (1976)
Janruang, J., Guha, S.: Semantic suffix tree clustering. In: First IRAST International Conference on Data Engineering and Internet Technology, DEIT (2011)
Carpineto, C., Osinski, S., Romano, G., Weiss, D.: A survey of web clustering engines. ACM Computing Surveys 41, 1–38 (2009)
Zamir, O., Etzioni, O.: Grouper: A dynamic clustering interface to web search results. In: Proceedings of the Eighth International World Wide Web Conference, pp. 283–296. Elsevier, Toronto (1999)
Weiner, P.: Linear pattern matching algorithms. In: IEEE Conference Record of 14th Annual Symposium on Switching and Automata Theory, SWAT 2008, pp. 1–11 (1973)
Wang, J., Li, R.: A New Cluster Merging Algorithm of Suffix Tree Clustering. In: Shi, Z., Shimohara, K., Feng, D. (eds.) Intelligent Information Proceesing III. IFIP AICT, vol. 228, pp. 197–203. Springer, Boston (2006)
Zhang, D., Dong, Y.: Semantic, Hierarchical, online Clustering of Web Search Results. In: Yu, J.X., Lin, X., Lu, H., Zhang, Y. (eds.) APWeb 2004. LNCS, vol. 3007, pp. 69–78. Springer, Heidelberg (2004)
Osinski, S., Weiss, D.: A concept-driven algorithm for clustering search results. IEEE Intelligent Systems 20, 48–54 (2005)
Sameh, A.: Semantic web search results clustering using lingo and wordnet. International Journal of Research and Reviews in Computer Science (IJRRCS)Â 1(2) (June 2010)
Crabtree, D., Andreae, P., Gao, X.: Query directed web page clustering. In: Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence, WI 2006, pp. 202–210. IEEE Computer Society, Washington, DC, USA (2006)
Kale, A., Bharambe, U., SashiKumar, M.: A new suffix tree similarity measure and labeling for web search results clustering. In: 2nd International Conference on Emerging Trends in Engineering and Technology (ICETET), pp. 856 –861 (2009)
Wu, J., Wang, Z.: Search results clustering in chinese context based on a new suffix tree. In: Proceedings of the 2008 IEEE 8th International Conference on Computer and Information Technology Workshops, pp. 110–115. IEEE Computer Society, Washington, DC, USA (2008)
Alsabti, K.: An efficient k-means clustering algorithm. In: Proceedings of IPPS/SPDP Workshop on High Performance Data Mining (1998)
Ng, A.Y., Jordan, M.I., Weiss, Y.: On spectral clustering: Analysis and an algorithm. In: Advances in Neural Information Processing Systems, vol. 14, pp. 849–856. MIT Press (2001)
Beil, F., Ester, M., Xu, X.: Frequent term-based text clustering. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2002, pp. 436–442. ACM, New York (2002)
Milligan, G.W., Sokol, L.: A two stage clustering algorithm with robust recovery characteristics. Educational and Psychological Measurement 40, 755–759 (1980)
Kessler, B.: Computational dialectology in irish gaelic. Computing Research Repository cmp-lg/950, 60–66 (1995)
Everitt, B.S., Landau, S., Leese, M.: Cluster analysis, 4th edn. Oxford University Press (1993)
Cilibrasi, R., Vitanyi, P.: The google similarity distance. IEEE Transactions on Knowledge and Data Engineering 19(3), 370–383 (2007)
Willett, P.: Recent trends in hierarchic document clustering: A critical review. Information Processing and Management 24(5), 577–597 (1988)
Kjos-hanssen, B., Evangelista, A.J.: Google distance between words. Computing Research Repository abs/0901.4 (2009)
Gerard Salton, C.B.: Term-weighting approaches in autumatic text retrieval. In: Information Processing and Management, vol. 24, pp. 513–523. Pergamin Press plc, Great Britain (1988)
Machado, D., Barbosa, T., Pais, S., Martins, B., Dias, G.: Universal Mobile Information Retrieval. In: Stephanidis, C. (ed.) UAHCI 2009, Part II. LNCS, vol. 5615, pp. 345–354. Springer, Heidelberg (2009)
Losee, R.M.: When information retrieval measures agree about the relative quality of document rankings. Journal of the American Society for Information Science 51(9), 834–840 (2000)
Crabtee, D.: Raw data (2005), http://www.danielcrabtree.com/research/wi05/rawdata.zip
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Worawitphinyo, P., Gao, X., Jabeen, S. (2011). Improving Suffix Tree Clustering with New Ranking and Similarity Measures. In: Tang, J., King, I., Chen, L., Wang, J. (eds) Advanced Data Mining and Applications. ADMA 2011. Lecture Notes in Computer Science(), vol 7121. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-25856-5_5
Download citation
DOI: https://doi.org/10.1007/978-3-642-25856-5_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-25855-8
Online ISBN: 978-3-642-25856-5
eBook Packages: Computer ScienceComputer Science (R0)