Cluster Generation and Cluster Labelling for Web Snippets: A Fast and Accurate Hierarchical Solution

Geraci, Filippo; Pellegrini, Marco; Maggini, Marco; Sebastiani, Fabrizio

doi:10.1007/11880561_3

Filippo Geraci^19,20,
Marco Pellegrini¹⁹,
Marco Maggini²⁰ &
…
Fabrizio Sebastiani²¹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4209))

Included in the following conference series:

International Symposium on String Processing and Information Retrieval

710 Accesses
30 Citations

Abstract

This paper describes Armil, a meta-search engine that groups into disjoint labelled clusters the Web snippets returned by auxiliary search engines. The cluster labels generated by Armil provide the user with a compact guide to assessing the relevance of each cluster to her information need. Striking the right balance between running time and cluster well-formedness was a key point in the design of our system. Both the clustering and the labelling tasks are performed on the fly by processing only the snippets provided by the auxiliary search engines, and use no external sources of knowledge. Clustering is performed by means of a fast version of the furthest-point-first algorithm for metric k-center clustering. Cluster labelling is achieved by combining intra-cluster and inter-cluster term extraction based on a variant of the information gain measure. We have tested the clustering effectiveness of Armil against Vivisimo, the de facto industrial standard in Web snippet clustering, using as benchmark a comprehensive set of snippets obtained from the Open Directory Project hierarchy. According to two widely accepted “external” metrics of clustering quality, Armil achieves better performance levels by 10%. We also report the results of a thorough user evaluation of both the clustering and the cluster labelling algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Keyqueries for Clustering and Labeling

CafeLLM: Context-Aware Fine-Grained Semantic Clustering Using Large Language Models

Combining semantic and term frequency similarities for text clustering

Article 02 January 2019

References

Geraci, F., Pellegrini, M., Pisati, P., Sebastiani, F.: A scalable algorithm for high-quality clustering of Web snippets. In: Proceedings of SAC-06, 21st ACM Symposium on Applied Computing, Dijon, FR, pp. 1058–1062 (2006)
Google Scholar
Cover, T.M., Thomas, J.A.: Elements of information theory. John Wiley & Sons, New York (1991)
Book MATH Google Scholar
Ferragina, P., Gulli, A.: A personalized search engine based on Web-snippet hierarchical clustering. In: Special Interest Tracks and Poster Proceedings of WWW 2005, 14th International Conference on the World Wide Web, Chiba, JP, pp. 801–810 (2005)
Google Scholar
Lawrie, D.J., Croft, W.B.: Generating hierarchical summaries for Web searches. In: Proceedings of SIGIR 2003, 26th ACM International Conference on Research and Development in Information Retrieval, pp. 457–458 (2003)
Google Scholar
Kummamuru, K., Lotlikar, R., Roy, S., Singal, K., Krishnapuram, R.: A hierarchical monothetic document clustering algorithm for summarization and browsing search results. In: Proceedings of WWW 2004, 13th International Conference on the World Wide Web, New York, pp. 658–665 (2004)
Google Scholar
Zamir, O., Etzioni, O., Madani, O., Karp, R.M.: Fast and intuitive clustering of Web documents. In: Proceedings of KDD 1997, 3rd International Conference on Knowledge Discovery and Data Mining, Newport Beach, US, pp. 287–290 (1997)
Google Scholar
Gonzalez, T.F.: Clustering to minimize the maximum intercluster distance. Theoretical Computer Science 38(2/3), 293–306 (1985)
Article MATH MathSciNet Google Scholar
Geraci, F., Pellegrini, M., Sebastiani, F., Maggini, M.: Cluster generation and cluster labelling for web snippets: A fast and accurate hierarchical solution. Technical Report IIT TR-1/2006, Institute for Informatics and Telematics of CNR (2006)
Google Scholar
Kural, Y., Robertson, S., Jones, S.: Clustering information retrieval search outputs. In: Proceedings of the 21st BCS IRSG Colloquium on Information Retrieval, Glasgow, UK (1999)
Google Scholar
Kural, Y., Robertson, S., Jones, S.: Deciphering cluster representations. Information Processing and Management 37, 593–601 (1993)
Article Google Scholar
Tombros, A., Villa, R., van Rijsbergen, C.J.: The effectiveness of query-specific hierarchic clustering in information retrieval. Information Processing and Management 38(4), 559–582 (2002)
Article MATH Google Scholar
Zamir, O., Etzioni, O.: Web document clustering: A feasibility demonstration. In: Proceedings of SIGIR-98, 21st ACM International Conference on Research and Development in Information Retrieval, Melbourne, AU, pp. 46–54 (1998)
Google Scholar
Cheng, D., Kannan, R., Vempala, S., Wang, G.: On a recursive spectral algorithm for clustering from pairwise similarities. Technical Report MIT-LCS-TR-906, Massachusetts Institute of Technology, Cambridge, US (2003)
Google Scholar
Zhang, D., Dong, Y.: Semantic, Hierarchical, Online Clustering of Web Search Results. In: Yu, J.X., et al. (eds.) APWeb 2004. LNCS, vol. 3007, pp. 69–78. Springer, Heidelberg (2004)
Chapter Google Scholar
Maarek, Y., Fagin, R., Ben-Shaul, I., Pelleg, D.: Ephemeral document clustering for Web applications. Technical Report RJ 10186, IBM, San Jose (2000)
Google Scholar
Zeng, H.J., He, Q.C., Chen, Z., Ma, W.Y., Ma, J.: Learning to cluster Web search results. In: Proceedings of SIGIR-04, 27th ACM International Conference on Research and Development in Information Retrieval, Sheffield, UK, pp. 210–217 (2004)
Google Scholar
Osinski, S., Weiss, D.: Conceptual clustering using Lingo algorithm: Evaluation on Open Directory Project data. In: Proceedings of IIPWM 2004, 5th Conference on Intelligent Information Processing and Web Mining, Zakopane, PL, pp. 369–377 (2004)
Google Scholar
MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 281–297 (1967)
Google Scholar
Cutting, D.R., Pedersen, J.O., Karger, D., Tukey, J.W.: Scatter/Gather: A cluster-based approach to browsing large document collections. In: Proceedings of SIGIR 1992, 15th ACM International Conference on Research and Development in Information Retrieval, Kobenhavn, DK, pp. 318–329 (1992)
Google Scholar
Hochbaum, D.S., Shmoys, D.B.: A best possible approximation algorithm for the k-center problem. Mathematics of Operations Research 10(2), 180–184 (1985)
Article MATH MathSciNet Google Scholar
Indyk, P.: Sublinear time algorithms for metric space problems. In: Proceedings of STOC 1999, ACM Symposium on Theory of Computing, pp. 428–434 (1999)
Google Scholar
Charikar, M.S.: Similarity estimation techniques from rounding algorithms. In: Proceedings of STOC 2002, 34th Annual ACM Symposium on the Theory of Computing, Montreal, CA, pp. 380–388 (2002)
Google Scholar
Strehl, A.: Relationship-based Clustering and Cluster Ensembles for High-dimensional Data Mining. PhD thesis, University of Texas, Austin, US (2002)
Google Scholar
Haveliwala, T.H., Gionis, A., Klein, D., Indyk, P.: Evaluating strategies for similarity search on the Web. In: Proceedings of WWW 2002, 11th International Conference on the World Wide Web, Honolulu, US, pp. 432–442 (2002)
Google Scholar

Download references

Author information

Authors and Affiliations

Istituto di Informatica e Telematica, Consiglio Nazionale delle Ricerche, Via G Moruzzi 1, 56124, Pisa, Italy
Filippo Geraci & Marco Pellegrini
Dipartimento di Ingegneria dell’Informazione, Università di Siena, Via Roma 56, 53100, Siena, Italy
Filippo Geraci & Marco Maggini
Istituto di Scienza e Tecnologia dell’Informazione, Consiglio Nazionale delle Ricerche, Via G Moruzzi 1, 56124, Pisa, Italy
Fabrizio Sebastiani

Authors

Filippo Geraci
View author publications
You can also search for this author in PubMed Google Scholar
Marco Pellegrini
View author publications
You can also search for this author in PubMed Google Scholar
Marco Maggini
View author publications
You can also search for this author in PubMed Google Scholar
Fabrizio Sebastiani
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer and Information Science, University of Strathclyde, Scotland
Fabio Crestani
Dipartimento di Informatica, University of Pisa, Largo B. Pontecorvo 3, 56127, Pisa, Italy
Paolo Ferragina
Department of Information Studies, University of Sheffield, Sheffield, UK
Mark Sanderson

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Geraci, F., Pellegrini, M., Maggini, M., Sebastiani, F. (2006). Cluster Generation and Cluster Labelling for Web Snippets: A Fast and Accurate Hierarchical Solution. In: Crestani, F., Ferragina, P., Sanderson, M. (eds) String Processing and Information Retrieval. SPIRE 2006. Lecture Notes in Computer Science, vol 4209. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11880561_3

Download citation

DOI: https://doi.org/10.1007/11880561_3
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-45774-9
Online ISBN: 978-3-540-45775-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics