Abstract
This paper describes Armil, a meta-search engine that groups into disjoint labelled clusters the Web snippets returned by auxiliary search engines. The cluster labels generated by Armil provide the user with a compact guide to assessing the relevance of each cluster to her information need. Striking the right balance between running time and cluster well-formedness was a key point in the design of our system. Both the clustering and the labelling tasks are performed on the fly by processing only the snippets provided by the auxiliary search engines, and use no external sources of knowledge. Clustering is performed by means of a fast version of the furthest-point-first algorithm for metric k-center clustering. Cluster labelling is achieved by combining intra-cluster and inter-cluster term extraction based on a variant of the information gain measure. We have tested the clustering effectiveness of Armil against Vivisimo, the de facto industrial standard in Web snippet clustering, using as benchmark a comprehensive set of snippets obtained from the Open Directory Project hierarchy. According to two widely accepted “external” metrics of clustering quality, Armil achieves better performance levels by 10%. We also report the results of a thorough user evaluation of both the clustering and the cluster labelling algorithms.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Geraci, F., Pellegrini, M., Pisati, P., Sebastiani, F.: A scalable algorithm for high-quality clustering of Web snippets. In: Proceedings of SAC-06, 21st ACM Symposium on Applied Computing, Dijon, FR, pp. 1058–1062 (2006)
Cover, T.M., Thomas, J.A.: Elements of information theory. John Wiley & Sons, New York (1991)
Ferragina, P., Gulli, A.: A personalized search engine based on Web-snippet hierarchical clustering. In: Special Interest Tracks and Poster Proceedings of WWW 2005, 14th International Conference on the World Wide Web, Chiba, JP, pp. 801–810 (2005)
Lawrie, D.J., Croft, W.B.: Generating hierarchical summaries for Web searches. In: Proceedings of SIGIR 2003, 26th ACM International Conference on Research and Development in Information Retrieval, pp. 457–458 (2003)
Kummamuru, K., Lotlikar, R., Roy, S., Singal, K., Krishnapuram, R.: A hierarchical monothetic document clustering algorithm for summarization and browsing search results. In: Proceedings of WWW 2004, 13th International Conference on the World Wide Web, New York, pp. 658–665 (2004)
Zamir, O., Etzioni, O., Madani, O., Karp, R.M.: Fast and intuitive clustering of Web documents. In: Proceedings of KDD 1997, 3rd International Conference on Knowledge Discovery and Data Mining, Newport Beach, US, pp. 287–290 (1997)
Gonzalez, T.F.: Clustering to minimize the maximum intercluster distance. Theoretical Computer Science 38(2/3), 293–306 (1985)
Geraci, F., Pellegrini, M., Sebastiani, F., Maggini, M.: Cluster generation and cluster labelling for web snippets: A fast and accurate hierarchical solution. Technical Report IIT TR-1/2006, Institute for Informatics and Telematics of CNR (2006)
Kural, Y., Robertson, S., Jones, S.: Clustering information retrieval search outputs. In: Proceedings of the 21st BCS IRSG Colloquium on Information Retrieval, Glasgow, UK (1999)
Kural, Y., Robertson, S., Jones, S.: Deciphering cluster representations. Information Processing and Management 37, 593–601 (1993)
Tombros, A., Villa, R., van Rijsbergen, C.J.: The effectiveness of query-specific hierarchic clustering in information retrieval. Information Processing and Management 38(4), 559–582 (2002)
Zamir, O., Etzioni, O.: Web document clustering: A feasibility demonstration. In: Proceedings of SIGIR-98, 21st ACM International Conference on Research and Development in Information Retrieval, Melbourne, AU, pp. 46–54 (1998)
Cheng, D., Kannan, R., Vempala, S., Wang, G.: On a recursive spectral algorithm for clustering from pairwise similarities. Technical Report MIT-LCS-TR-906, Massachusetts Institute of Technology, Cambridge, US (2003)
Zhang, D., Dong, Y.: Semantic, Hierarchical, Online Clustering of Web Search Results. In: Yu, J.X., et al. (eds.) APWeb 2004. LNCS, vol. 3007, pp. 69–78. Springer, Heidelberg (2004)
Maarek, Y., Fagin, R., Ben-Shaul, I., Pelleg, D.: Ephemeral document clustering for Web applications. Technical Report RJ 10186, IBM, San Jose (2000)
Zeng, H.J., He, Q.C., Chen, Z., Ma, W.Y., Ma, J.: Learning to cluster Web search results. In: Proceedings of SIGIR-04, 27th ACM International Conference on Research and Development in Information Retrieval, Sheffield, UK, pp. 210–217 (2004)
Osinski, S., Weiss, D.: Conceptual clustering using Lingo algorithm: Evaluation on Open Directory Project data. In: Proceedings of IIPWM 2004, 5th Conference on Intelligent Information Processing and Web Mining, Zakopane, PL, pp. 369–377 (2004)
MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 281–297 (1967)
Cutting, D.R., Pedersen, J.O., Karger, D., Tukey, J.W.: Scatter/Gather: A cluster-based approach to browsing large document collections. In: Proceedings of SIGIR 1992, 15th ACM International Conference on Research and Development in Information Retrieval, Kobenhavn, DK, pp. 318–329 (1992)
Hochbaum, D.S., Shmoys, D.B.: A best possible approximation algorithm for the k-center problem. Mathematics of Operations Research 10(2), 180–184 (1985)
Indyk, P.: Sublinear time algorithms for metric space problems. In: Proceedings of STOC 1999, ACM Symposium on Theory of Computing, pp. 428–434 (1999)
Charikar, M.S.: Similarity estimation techniques from rounding algorithms. In: Proceedings of STOC 2002, 34th Annual ACM Symposium on the Theory of Computing, Montreal, CA, pp. 380–388 (2002)
Strehl, A.: Relationship-based Clustering and Cluster Ensembles for High-dimensional Data Mining. PhD thesis, University of Texas, Austin, US (2002)
Haveliwala, T.H., Gionis, A., Klein, D., Indyk, P.: Evaluating strategies for similarity search on the Web. In: Proceedings of WWW 2002, 11th International Conference on the World Wide Web, Honolulu, US, pp. 432–442 (2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Geraci, F., Pellegrini, M., Maggini, M., Sebastiani, F. (2006). Cluster Generation and Cluster Labelling for Web Snippets: A Fast and Accurate Hierarchical Solution. In: Crestani, F., Ferragina, P., Sanderson, M. (eds) String Processing and Information Retrieval. SPIRE 2006. Lecture Notes in Computer Science, vol 4209. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11880561_3
Download citation
DOI: https://doi.org/10.1007/11880561_3
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-45774-9
Online ISBN: 978-3-540-45775-6
eBook Packages: Computer ScienceComputer Science (R0)