Elsevier

Data & Knowledge Engineering

Volume 62, Issue 3, September 2007, Pages 504-522
Data & Knowledge Engineering

A new algorithm for clustering search results

https://doi.org/10.1016/j.datak.2006.10.006Get rights and content

Abstract

We develop a new algorithm for clustering search results. Differently from many other clustering systems that have been recently proposed as a post-processing step for Web search engines, our system is not based on phrase analysis inside snippets, but instead uses latent semantic indexing on the whole document content. A main contribution of the paper is a novel strategy – called dynamic SVD clustering – to discover the optimal number of singular values to be used for clustering purposes. Moreover, the algorithm is such that the SVD computation step has in practice good performance, which makes it feasible to perform clustering when term vectors are available. We show that the algorithm has very good classification performance, and that it can be effectively used to cluster results of a search engine to make them easier to browse by users. The algorithm has being integrated into the Noodles search engine, a tool for searching and clustering Web and desktop documents.

Section snippets

Introduction and motivations

Web search engines like Google [5] can nowadays be considered as a cornerstone service for any Internet user. The keyword-based, boolean search style used by these engines has rapidly permeated user habits, to such an extent that it is now extending to other classes of applications, for example desktop search [29].

A key factor in the success of Web search engines is their ability to rapidly find good quality results to queries that are based on rather specific terms, like “Java Server Faces” or

Preliminaries

This section introduces a number of techniques that will be used in the rest of the paper.

Clustering algorithm

In this section we introduce the clustering algorithm used by Noodles. We shall first give some insight on the main ideas behind the algorithm, and then elaborate on the technical details.

Implementation and experiments

The clustering algorithm has been implemented in the Noodles desktop search engine [30], a snapshot of which is shown in Fig. 4. The system is written in Java, using Apache Lucene2 as an indexing engine, and the Spring Rich Client Platform3 as a desktop application framework. It has been conceived as a general-purpose search tool, that can run both Web and desktop searches. In order to perform desktop searches, it incorporates a crawler

Related works

As we have discussed in Section 1, there are several commercial search engines that incorporate some form of clustering. Besides Vivisimo [35] and Grokker [12], other examples are Ask.com [2], iBoogie [15], Kartoo [17], and WiseNut [36].

In fact, the idea of clustering search results as a means to improve retrieval performance has been investigated quite deeply in Information Retrieval. A seminal work in this respect is the Scatter/Gather project [14], [31]. Scatter–Gather provides a simple

Conclusions

The paper has introduced a new algorithm for clustering the results of a search engine. With respect to snippet-based clustering algorithms, we have shown that, by considering the whole document content and employing appropriate SVD-based compression techniques, it is possible to achieve very good classification results, significantly better than those obtained by analyzing document snippets only.

We believe that these results represent promising directions to improve the quality of clustering

Acknowledgments

The authors thank Martin Funk, Donatella Occorsio, and Maria Grazia Russo for the insightful discussions during the early stages of this research. Thanks also go to Donatello Santoro for his excellent work in the implementation of the desktop search engine.

Giansalvatore Mecca is full professor of Computer Science at Universita’ della Basilicata. He graduated in Computer Engineering in 1992 from Universita’ di Roma “La Sapienza”. In 1996 he received his PhD, also from Universita’ di Roma “La Sapienza”. He is with Universita’ della Basilicata from 1995. His research interests include information extraction, data management techniques for XML and Web data, and information extraction. He has also worked on cooperative database systems, string

References (41)

  • S. Brin et al.

    The anatomy of a large-scale hypertextual web search engine

    Computer Networks and ISDN Systems

    (1998)
  • C. Papadimitriou et al.

    Latent semantic indexing: a probabilistic analysis

    Journal of Computer and System Sciences

    (2000)
  • O. Zamir et al.

    Grouper: a dynamic clustering interface for web search results

    Computer Networks

    (1999)
  • P. Anick, S. Vaithyanathan, Exploiting clustering and phrases for context-based information retrieval, in: ACM SIGIR,...
  • Ask.com Search Engine....
  • T. Berners-Lee et al.

    The semantic web

    Scientific American

    (2001)
  • M.W. Berry

    Large-scale sparse singular value computations

    The International Journal of Supercomputer Applications

    (1992)
  • D. Calvetti et al.

    An implicitly restarted lanczos method for large symmetric eigenvalue problems

    Electronic Transactions on Numerical Analysis

    (1994)
  • C. Chekuri, P. Raghavan, Web search using automatic classification, in: Proceedings of the World Wide Web Conference,...
  • E. Chisholm, T.G. Kolda, New term weighting formulas for the vector space method in information retrieval, Technical...
  • S. Deerwester et al.

    Indexing by latent semantic analysis

    Journal of the American Society for Information Sciences

    (1990)
  • Open Directory Project (DMOZ)....
  • P. Ferragina, A. Gulli, A personalized search engine based on web snippet hierarchical clustering, in: Proceedings of...
  • Grokker Search Engine....
  • R. Guha, R. Mc Cool, E. Miller, Semantic search, in: Proceedings of the World Wide Web Conference,...
  • M.A. Hearst, J.O. Pedersen, Re-examining the cluster hypothesis: Scatter/gather on retrieval results, in: Proceedings...
  • iBoogie Search Engine....
  • A.K. Jain et al.

    Data clustering: a review

    ACM Computing Surveys

    (1999)
  • Kartoo Search Engine....
  • K. Kummamuru, R. Lotlikar, S. Roy, K. Singal, R. Krishnapuram, A hierarchical monothetic document clustering algorithm...
  • Cited by (85)

    • Enhanced rough–fuzzy c-means algorithm with strict rough sets properties

      2016, Applied Soft Computing Journal
      Citation Excerpt :

      It partitions a given dataset into several groups, such that patterns within the same group share a similar characteristics, while patterns from different groups are as dissimilar as possible. Clustering techniques have been playing a central role in diverse range of real applications including rotating machinery [1], image processing [2], biology [3], market segmentation [4], and web mining [5]. The k-means method [6] has been regarded as a classical prototype-based clustering algorithm.

    • Large-scale clustering using decomposition-based evolutionary algorithms

      2020, 2020 IEEE Symposium Series on Computational Intelligence, SSCI 2020
    View all citing articles on Scopus

    Giansalvatore Mecca is full professor of Computer Science at Universita’ della Basilicata. He graduated in Computer Engineering in 1992 from Universita’ di Roma “La Sapienza”. In 1996 he received his PhD, also from Universita’ di Roma “La Sapienza”. He is with Universita’ della Basilicata from 1995. His research interests include information extraction, data management techniques for XML and Web data, and information extraction. He has also worked on cooperative database systems, string databases, deductive databases, and object-oriented databases.

    Salvatore Raunich is a research assistant at Universita’ della Basilicata. He graduated in Computer Science at Universita’ della Basilicata in 2003, and received a master in Computer Science in 2005. His research interests include Web clustering and data integration.

    Alessandro Pappalardo holds a master in Computer Science from Universita’ della Basilicata. He graduated in Computer Science at Universita’ della Basilicata in 2004. He received his Master in 2006. His research interests include Web clustering and data integration.

    View full text