Abstract
Current web search engines, such as Google, Bing, and Yahoo!, rank the set of documents SD retrieved in response to a user query and display each document D in SD with a title and a snippet, which serves as an abstract of D. Snippets, however, are not as useful as they are designed for, i.e., assisting its users to quickly identify results of interest, if they exist. These snippets are inadequate in providing distinct information and capturing the main contents of the corresponding documents. Moreover, when the intended information need specified in a search query is ambiguous, it is very difficult, if not impossible, for a search engine to identify precisely the set of documents that satisfy the user’s intended request without requiring additional inputs. Furthermore, a document title is not always a good indicator of the content of the corresponding document. All of these design problems can be solved by our proposed query-based cluster and summarizer, called \(Q_{Sum}\). \(Q_{Sum}\) generates a concise/comprehensive summary for each cluster of documents retrieved in response to a user query, which saves the user’s time and effort in searching for specific information of interest without having to browse through the documents one by one. Experimental results show that \(Q_{Sum}\) is effective and efficient in generating a high-quality summary for each cluster of documents on a specific topic.
Similar content being viewed by others
Notes
A snippet of a document D is treated as a summary of D.
According to a survey on iProspect (iprospect.com), majority of the queries submitted to web search engines are general 1–3 words in length.
Although the titles and snippets are inadequate in capturing the contents of their corresponding documents, they provide sufficient information for document clustering.
The labels in the actual interface are listed vertically, instead of horizontally (ordered from left to right) as arranged in the figure to save space.
The Text Analysis Conference (TAC) (nist.gov/tac) recommends a multi-document summary with the length of Size.
Word-correlation factors quantify the similarity (degree of closeness) of two words in terms of their semantic meaning.
End-of-sentence punctuation marks, such as periods, question marks, and exclamation points, are less ambiguous as end-of-sentence indicators. However, as a period is not exclusively used to indicate sentence breaks, which may indicate an abbreviation, a decimal point, parts of an e-mail address, etc., a list of common abbreviations, such as “i.e.,” “u.s.,” and “e.g.,” are maintained to minimize the detection errors.
If a sentence contains a date, then it overrides the publication time of the document, since it explicitly states the time of the information presented in the sentence.
Variance is widely used in statistics, along with standard deviation (which is the square root of the variance), to measure the average dispersion of the scores in a distribution.
The logs of AOL (gregsadetsky.com/aol-data/) include 50 million queries created by millions of AOL users over a three-month period between March 1, 2006, and May 31, 2006, and the AOL logs are available for public use.
A summary is considered useful if it is of high quality (4 or 5 on a 5-point scale) as defined by DUC.
References
Alguliev R, Alyguliev R (2008) Automatic text documents summarization through sentences clustering. Autom Inf Sci 40:53–63
Altman A, Tennenholtz M (2005) Ranking systems: the PageRank axioms. In: Proceedings of the 13th ACM conference on electronic commerce (ACM EC), pp 1–8
Amini M, Usunier N (2009) Incorporating prior knowledge into a transductive ranking algorithm for multi-document summarization. In: Proceedings of the international ACM conference on research and development in information retrieval (ACM SIGIR), pp 704–705
Arora R, Ravindran B (2008) Latent dirichlet allocation based multi-document summarization. In: Proceedings of the second workshop on analytics for noisy unstructured text data (AND), pp 91–97
Baxendale P (1958) Machine-made index for technical literature: an experiment. IBM J Res Dev 2(4):354–361
Bhandari H, Shimbo M, Ito T, Matsumoto Y (2008) Generic text summarization using probabilistic latent semantic indexing. In: Proceedings of the international joint conference on natural language processing (JCNLP), pp 133–140
Braschler M, Schauble P (1998) Multilingual information retrieval based on document alignment techniques. In: Proceedings of research and advanced technology for digital libraries: second European conference (ECDL), pp 183–197
Chen L (2011) Using a new relational concept to improve the clustering performance of search engines. Inf Process Manag 47:287–299
Chim H, Deng X (2008) A new suffix tree similarity measure for document clustering. In: Proceedings of the 17th international world wide web conference (WWW), pp 121–130
Dunlavy D, O’Leary D, Conroy J, Schlesinger J (2007) QCS: a system for querying, clustering, and summarizing documents. Inf Process Manag 43:1588–1605
Ferragina P, Guli A (2008) A personalized search engine based on web-snippet hierarchical clustering. Softw Pract Exp 38(2):189–225
Geraci F, Pellegrini M, Pisati P, Sebastiani F (2006) A scalable algorithm for high quality clustering of web snippets. In: Proceedings of the 21st annual ACM symposium on applied computing (ACM SAC), pp 1058–1062
Giacomo E, Didimo W, Grilli L, Liotta G (2007) Graph visualization techniques for web clustering engines. IEEE Trans Vis Comput Gr 13(2):294–304
Hearst M, Pedersen J (1996) Reexamining the cluster hypothesis: scatter/gather on retrieval results. In: Proceedings of the international acm conference on research and development in information retrieval (ACM SIGIR), pp 76–84
Jansen B, Spink A, Saracevic T (2000) Real life, real users, and real needs: a study and analysis of user queries on the web. Inf Process Manag 36(2):207–227
Jones B, Kenward M (2003) Design and analysis of cross-over trials, 2nd edn. Chapman and Hall, London
Judea P (1988) Probabilistic reasoning in the intelligent systems: networks of plausible inference. Morgan Kaufmann Publishers, Los Altos
Kazmier L (2003) Schaum’s outline of business statistics. McGraw-Hill, New York
Khoo C, Ou S, Goh D (2002) A hierarchical framework for multi-document summarization of dissertation abstracts. In: Proceedings of the international conference on asian digital libraries (ICADL), pp 99–110
Koberstein J, Ng Y-K (2006) Using word clusters to detect similar web documents. In: Proceedings of the second international conference on knowledge science, engineering, and management (KSEM 2007), pp 215–228
Leskovec J, Grobelnik M, Milic-Frayling N (2004) Learning sub-structures of document semantic graphs for document summarization. In: Proceedings of the workshop on link analysis and group detection (LinkKDD-2004), pp 133–138
Li H, Sun C, Wang K (2009) Clustering web search results using conceptual grouping. In: Proceedings of the international conference on machine learning and cybernetics (ICMLC), pp 1499–1503
Lin C, Hovy E (2002) From single to multi-document summarization: a prototype system and its evaluation. In: Proceedings of the annual meeting of the association for computational linguistics (ACL), pp 457–464
Luger G (2008) Artificial intelligence: structures and strategies for complex problem solving, 6th edn. Addison-Wesley, Reading
Luhn H (1958) The automatic creation of literature abstracts. IBM J Res Dev 2:159–165
Osinski S (2006) Improving quality of search results clustering with approximate matrix factorisations. In: Proceedings of the annual European conference on information retrieval (ECIR 2006), pp 167–178
Ou S, Khoo C, Goh D (2006) Automatic multi-document summarization for digital libraries. In: Proceedings of the Asia-Pacific conference on library and information education and practice (A-LIEP), pp 72–82
Qumsiyeh R, Ng Y-K (2013) Enhancing web search using query-based clusters and labels. In: Proceedings of the IEEE/WIC/ACM international conference on web intelligence (WI’13), pp 159–164
Rozakis L (2002) Test taking strategies and study skills for the utterly confused. McGraw Hill, NY
Schiffman B, Nenkova A, McKeown K (2002) Experiments in multidocument summarization. In: Proceedings of the human language technology conference (HLT), pp 52–58
Schlesinger J, Leary D, Conroy J (2008) Arabic/English multi-document summarization with CLASSY—the past and the future. In: Proceedings of the conference on intelligent text processing and computational linguistics (CICLing), pp 568–581
Selberg E (1999) Towards comprehensive web search. Ph.D. thesis, University of Washington
Shekhar S, Agrawal R (2010) An architectural framework of a crawler for retrieving highly relevant web documents by filtering replicated web collections. In: Proceedings of international conference on advances in computer entertainment technology (ACE), pp 29–30
Shen D, Pan R (2006) Query enrichment for web-query classification. ACM Trans Inf Syst 24(3):320–352
Xide C, Yu Y, Han J, Liu B (2010) Hierarchical web-page clustering via in-page and cross-page link structures. In: Proceedings of the 15th Pacific-Asia conference on knowledge discovery and data mining (PAKDD), pp 222–229
Yu P, Li X, Liu B (2005) Adding the temporal dimension to search: a case study in publication search. In: Proceedings of the 2005 IEEE/WIC/ACM international conference on web intelligence (WI), pp 543–549
Zamir O, Etzioni O (1998) Web document clustering: a feasibility demonstration. In: Proceedings of the conference on research and development in information retireval (ACM SIGIR), pp 46–54
Zamir O, Etzioni O (1999) Grouper: a dynamic clustering interface to web search results. Comput Netw 31(11–16):1361–1374
Zeng H, He Q, Chen Z, Ma W (2004) Learning to cluster web search results. In: Proceedings of the international ACM conference on research and development in information retrieval (ACM SIGIR), pp 210–217
Zhang D, Dong Y (2001) Semantic, hierarchical, online clustering of web search results. In: Proceedings of the 3rd international workshop on web information and data management (WIDM 2001), pp 69–78
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Qumsiyeh, R., Ng, YK. Enhancing web search by using query-based clusters and multi-document summaries. Knowl Inf Syst 47, 355–380 (2016). https://doi.org/10.1007/s10115-015-0852-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-015-0852-5