A new algorithm for clustering search results
Section snippets
Introduction and motivations
Web search engines like Google [5] can nowadays be considered as a cornerstone service for any Internet user. The keyword-based, boolean search style used by these engines has rapidly permeated user habits, to such an extent that it is now extending to other classes of applications, for example desktop search [29].
A key factor in the success of Web search engines is their ability to rapidly find good quality results to queries that are based on rather specific terms, like “Java Server Faces” or
Preliminaries
This section introduces a number of techniques that will be used in the rest of the paper.
Clustering algorithm
In this section we introduce the clustering algorithm used by Noodles. We shall first give some insight on the main ideas behind the algorithm, and then elaborate on the technical details.
Implementation and experiments
The clustering algorithm has been implemented in the Noodles desktop search engine [30], a snapshot of which is shown in Fig. 4. The system is written in Java, using Apache Lucene2 as an indexing engine, and the Spring Rich Client Platform3 as a desktop application framework. It has been conceived as a general-purpose search tool, that can run both Web and desktop searches. In order to perform desktop searches, it incorporates a crawler
Related works
As we have discussed in Section 1, there are several commercial search engines that incorporate some form of clustering. Besides Vivisimo [35] and Grokker [12], other examples are Ask.com [2], iBoogie [15], Kartoo [17], and WiseNut [36].
In fact, the idea of clustering search results as a means to improve retrieval performance has been investigated quite deeply in Information Retrieval. A seminal work in this respect is the Scatter/Gather project [14], [31]. Scatter–Gather provides a simple
Conclusions
The paper has introduced a new algorithm for clustering the results of a search engine. With respect to snippet-based clustering algorithms, we have shown that, by considering the whole document content and employing appropriate SVD-based compression techniques, it is possible to achieve very good classification results, significantly better than those obtained by analyzing document snippets only.
We believe that these results represent promising directions to improve the quality of clustering
Acknowledgments
The authors thank Martin Funk, Donatella Occorsio, and Maria Grazia Russo for the insightful discussions during the early stages of this research. Thanks also go to Donatello Santoro for his excellent work in the implementation of the desktop search engine.
Giansalvatore Mecca is full professor of Computer Science at Universita’ della Basilicata. He graduated in Computer Engineering in 1992 from Universita’ di Roma “La Sapienza”. In 1996 he received his PhD, also from Universita’ di Roma “La Sapienza”. He is with Universita’ della Basilicata from 1995. His research interests include information extraction, data management techniques for XML and Web data, and information extraction. He has also worked on cooperative database systems, string
References (41)
- et al.
The anatomy of a large-scale hypertextual web search engine
Computer Networks and ISDN Systems
(1998) - et al.
Latent semantic indexing: a probabilistic analysis
Journal of Computer and System Sciences
(2000) - et al.
Grouper: a dynamic clustering interface for web search results
Computer Networks
(1999) - P. Anick, S. Vaithyanathan, Exploiting clustering and phrases for context-based information retrieval, in: ACM SIGIR,...
- Ask.com Search Engine....
- et al.
The semantic web
Scientific American
(2001) Large-scale sparse singular value computations
The International Journal of Supercomputer Applications
(1992)- et al.
An implicitly restarted lanczos method for large symmetric eigenvalue problems
Electronic Transactions on Numerical Analysis
(1994) - C. Chekuri, P. Raghavan, Web search using automatic classification, in: Proceedings of the World Wide Web Conference,...
- E. Chisholm, T.G. Kolda, New term weighting formulas for the vector space method in information retrieval, Technical...
Indexing by latent semantic analysis
Journal of the American Society for Information Sciences
Data clustering: a review
ACM Computing Surveys
Cited by (85)
Body in motion, attention in focus: A virtual reality study on teachers' movement patterns and noticing
2023, Computers and EducationOverlapping communities detection of social network based on hybrid C-means clustering algorithm
2019, Sustainable Cities and SocietyMatching parse thickets for open domain question answering
2017, Data and Knowledge EngineeringEnhanced rough–fuzzy c-means algorithm with strict rough sets properties
2016, Applied Soft Computing JournalCitation Excerpt :It partitions a given dataset into several groups, such that patterns within the same group share a similar characteristics, while patterns from different groups are as dissimilar as possible. Clustering techniques have been playing a central role in diverse range of real applications including rotating machinery [1], image processing [2], biology [3], market segmentation [4], and web mining [5]. The k-means method [6] has been regarded as a classical prototype-based clustering algorithm.
Large-scale clustering using decomposition-based evolutionary algorithms
2020, 2020 IEEE Symposium Series on Computational Intelligence, SSCI 2020A Concept-Based Approach for Generating Better Topics for Web Search Results
2020, SN Computer Science
Giansalvatore Mecca is full professor of Computer Science at Universita’ della Basilicata. He graduated in Computer Engineering in 1992 from Universita’ di Roma “La Sapienza”. In 1996 he received his PhD, also from Universita’ di Roma “La Sapienza”. He is with Universita’ della Basilicata from 1995. His research interests include information extraction, data management techniques for XML and Web data, and information extraction. He has also worked on cooperative database systems, string databases, deductive databases, and object-oriented databases.
Salvatore Raunich is a research assistant at Universita’ della Basilicata. He graduated in Computer Science at Universita’ della Basilicata in 2003, and received a master in Computer Science in 2005. His research interests include Web clustering and data integration.
Alessandro Pappalardo holds a master in Computer Science from Universita’ della Basilicata. He graduated in Computer Science at Universita’ della Basilicata in 2004. He received his Master in 2006. His research interests include Web clustering and data integration.