Enhancing web search by using query-based clusters and multi-document summaries

Qumsiyeh, Rani; Ng, Yiu-Kai

doi:10.1007/s10115-015-0852-5

Enhancing web search by using query-based clusters and multi-document summaries

Regular Paper
Published: 30 June 2015

Volume 47, pages 355–380, (2016)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Rani Qumsiyeh¹ &
Yiu-Kai Ng¹

306 Accesses
5 Citations
Explore all metrics

Abstract

Current web search engines, such as Google, Bing, and Yahoo!, rank the set of documents SD retrieved in response to a user query and display each document D in SD with a title and a snippet, which serves as an abstract of D. Snippets, however, are not as useful as they are designed for, i.e., assisting its users to quickly identify results of interest, if they exist. These snippets are inadequate in providing distinct information and capturing the main contents of the corresponding documents. Moreover, when the intended information need specified in a search query is ambiguous, it is very difficult, if not impossible, for a search engine to identify precisely the set of documents that satisfy the user’s intended request without requiring additional inputs. Furthermore, a document title is not always a good indicator of the content of the corresponding document. All of these design problems can be solved by our proposed query-based cluster and summarizer, called \(Q_{Sum}\). \(Q_{Sum}\) generates a concise/comprehensive summary for each cluster of documents retrieved in response to a user query, which saves the user’s time and effort in searching for specific information of interest without having to browse through the documents one by one. Experimental results show that \(Q_{Sum}\) is effective and efficient in generating a high-quality summary for each cluster of documents on a specific topic.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

A snippet of a document D is treated as a summary of D.
According to a survey on iProspect (iprospect.com), majority of the queries submitted to web search engines are general 1–3 words in length.
Although the titles and snippets are inadequate in capturing the contents of their corresponding documents, they provide sufficient information for document clustering.
The labels in the actual interface are listed vertically, instead of horizontally (ordered from left to right) as arranged in the figure to save space.
The Text Analysis Conference (TAC) (nist.gov/tac) recommends a multi-document summary with the length of Size.
Word-correlation factors quantify the similarity (degree of closeness) of two words in terms of their semantic meaning.
End-of-sentence punctuation marks, such as periods, question marks, and exclamation points, are less ambiguous as end-of-sentence indicators. However, as a period is not exclusively used to indicate sentence breaks, which may indicate an abbreviation, a decimal point, parts of an e-mail address, etc., a list of common abbreviations, such as “i.e.,” “u.s.,” and “e.g.,” are maintained to minimize the detection errors.
If a sentence contains a date, then it overrides the publication time of the document, since it explicitly states the time of the information presented in the sentence.
Variance is widely used in statistics, along with standard deviation (which is the square root of the variance), to measure the average dispersion of the scores in a distribution.
The logs of AOL (gregsadetsky.com/aol-data/) include 50 million queries created by millions of AOL users over a three-month period between March 1, 2006, and May 31, 2006, and the AOL logs are available for public use.
A summary is considered useful if it is of high quality (4 or 5 on a 5-point scale) as defined by DUC.
http://www.top-keywords.com/longterm.html.

References

Alguliev R, Alyguliev R (2008) Automatic text documents summarization through sentences clustering. Autom Inf Sci 40:53–63
Article Google Scholar
Altman A, Tennenholtz M (2005) Ranking systems: the PageRank axioms. In: Proceedings of the 13th ACM conference on electronic commerce (ACM EC), pp 1–8
Amini M, Usunier N (2009) Incorporating prior knowledge into a transductive ranking algorithm for multi-document summarization. In: Proceedings of the international ACM conference on research and development in information retrieval (ACM SIGIR), pp 704–705
Arora R, Ravindran B (2008) Latent dirichlet allocation based multi-document summarization. In: Proceedings of the second workshop on analytics for noisy unstructured text data (AND), pp 91–97
Baxendale P (1958) Machine-made index for technical literature: an experiment. IBM J Res Dev 2(4):354–361
Article Google Scholar
Bhandari H, Shimbo M, Ito T, Matsumoto Y (2008) Generic text summarization using probabilistic latent semantic indexing. In: Proceedings of the international joint conference on natural language processing (JCNLP), pp 133–140
Braschler M, Schauble P (1998) Multilingual information retrieval based on document alignment techniques. In: Proceedings of research and advanced technology for digital libraries: second European conference (ECDL), pp 183–197
Chen L (2011) Using a new relational concept to improve the clustering performance of search engines. Inf Process Manag 47:287–299
Article Google Scholar
Chim H, Deng X (2008) A new suffix tree similarity measure for document clustering. In: Proceedings of the 17th international world wide web conference (WWW), pp 121–130
Dunlavy D, O’Leary D, Conroy J, Schlesinger J (2007) QCS: a system for querying, clustering, and summarizing documents. Inf Process Manag 43:1588–1605
Article Google Scholar
Ferragina P, Guli A (2008) A personalized search engine based on web-snippet hierarchical clustering. Softw Pract Exp 38(2):189–225
Article Google Scholar
Geraci F, Pellegrini M, Pisati P, Sebastiani F (2006) A scalable algorithm for high quality clustering of web snippets. In: Proceedings of the 21st annual ACM symposium on applied computing (ACM SAC), pp 1058–1062
Giacomo E, Didimo W, Grilli L, Liotta G (2007) Graph visualization techniques for web clustering engines. IEEE Trans Vis Comput Gr 13(2):294–304
Article Google Scholar
Hearst M, Pedersen J (1996) Reexamining the cluster hypothesis: scatter/gather on retrieval results. In: Proceedings of the international acm conference on research and development in information retrieval (ACM SIGIR), pp 76–84
Jansen B, Spink A, Saracevic T (2000) Real life, real users, and real needs: a study and analysis of user queries on the web. Inf Process Manag 36(2):207–227
Article Google Scholar
Jones B, Kenward M (2003) Design and analysis of cross-over trials, 2nd edn. Chapman and Hall, London
MATH Google Scholar
Judea P (1988) Probabilistic reasoning in the intelligent systems: networks of plausible inference. Morgan Kaufmann Publishers, Los Altos
MATH Google Scholar
Kazmier L (2003) Schaum’s outline of business statistics. McGraw-Hill, New York
Google Scholar
Khoo C, Ou S, Goh D (2002) A hierarchical framework for multi-document summarization of dissertation abstracts. In: Proceedings of the international conference on asian digital libraries (ICADL), pp 99–110
Koberstein J, Ng Y-K (2006) Using word clusters to detect similar web documents. In: Proceedings of the second international conference on knowledge science, engineering, and management (KSEM 2007), pp 215–228
Leskovec J, Grobelnik M, Milic-Frayling N (2004) Learning sub-structures of document semantic graphs for document summarization. In: Proceedings of the workshop on link analysis and group detection (LinkKDD-2004), pp 133–138
Li H, Sun C, Wang K (2009) Clustering web search results using conceptual grouping. In: Proceedings of the international conference on machine learning and cybernetics (ICMLC), pp 1499–1503
Lin C, Hovy E (2002) From single to multi-document summarization: a prototype system and its evaluation. In: Proceedings of the annual meeting of the association for computational linguistics (ACL), pp 457–464
Luger G (2008) Artificial intelligence: structures and strategies for complex problem solving, 6th edn. Addison-Wesley, Reading
Google Scholar
Luhn H (1958) The automatic creation of literature abstracts. IBM J Res Dev 2:159–165
Article MathSciNet Google Scholar
Osinski S (2006) Improving quality of search results clustering with approximate matrix factorisations. In: Proceedings of the annual European conference on information retrieval (ECIR 2006), pp 167–178
Ou S, Khoo C, Goh D (2006) Automatic multi-document summarization for digital libraries. In: Proceedings of the Asia-Pacific conference on library and information education and practice (A-LIEP), pp 72–82
Qumsiyeh R, Ng Y-K (2013) Enhancing web search using query-based clusters and labels. In: Proceedings of the IEEE/WIC/ACM international conference on web intelligence (WI’13), pp 159–164
Rozakis L (2002) Test taking strategies and study skills for the utterly confused. McGraw Hill, NY
Google Scholar
Schiffman B, Nenkova A, McKeown K (2002) Experiments in multidocument summarization. In: Proceedings of the human language technology conference (HLT), pp 52–58
Schlesinger J, Leary D, Conroy J (2008) Arabic/English multi-document summarization with CLASSY—the past and the future. In: Proceedings of the conference on intelligent text processing and computational linguistics (CICLing), pp 568–581
Selberg E (1999) Towards comprehensive web search. Ph.D. thesis, University of Washington
Shekhar S, Agrawal R (2010) An architectural framework of a crawler for retrieving highly relevant web documents by filtering replicated web collections. In: Proceedings of international conference on advances in computer entertainment technology (ACE), pp 29–30
Shen D, Pan R (2006) Query enrichment for web-query classification. ACM Trans Inf Syst 24(3):320–352
Article Google Scholar
Xide C, Yu Y, Han J, Liu B (2010) Hierarchical web-page clustering via in-page and cross-page link structures. In: Proceedings of the 15th Pacific-Asia conference on knowledge discovery and data mining (PAKDD), pp 222–229
Yu P, Li X, Liu B (2005) Adding the temporal dimension to search: a case study in publication search. In: Proceedings of the 2005 IEEE/WIC/ACM international conference on web intelligence (WI), pp 543–549
Zamir O, Etzioni O (1998) Web document clustering: a feasibility demonstration. In: Proceedings of the conference on research and development in information retireval (ACM SIGIR), pp 46–54
Zamir O, Etzioni O (1999) Grouper: a dynamic clustering interface to web search results. Comput Netw 31(11–16):1361–1374
Article Google Scholar
Zeng H, He Q, Chen Z, Ma W (2004) Learning to cluster web search results. In: Proceedings of the international ACM conference on research and development in information retrieval (ACM SIGIR), pp 210–217
Zhang D, Dong Y (2001) Semantic, hierarchical, online clustering of web search results. In: Proceedings of the 3rd international workshop on web information and data management (WIDM 2001), pp 69–78

Download references

Author information

Authors and Affiliations

Computer Science Department, Brigham Young University, Provo, UT, 84604, USA
Rani Qumsiyeh & Yiu-Kai Ng

Authors

Rani Qumsiyeh
View author publications
You can also search for this author in PubMed Google Scholar
Yiu-Kai Ng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yiu-Kai Ng.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Qumsiyeh, R., Ng, YK. Enhancing web search by using query-based clusters and multi-document summaries. Knowl Inf Syst 47, 355–380 (2016). https://doi.org/10.1007/s10115-015-0852-5

Download citation

Received: 22 August 2014
Revised: 03 May 2015
Accepted: 13 June 2015
Published: 30 June 2015
Issue Date: May 2016
DOI: https://doi.org/10.1007/s10115-015-0852-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Enhancing web search by using query-based clusters and multi-document summaries

Abstract

Access this article

Similar content being viewed by others

Clustering Retrieved Web Documents to Speed Up Web Searches

A Framework for Grouping and Summarizing Keyword Search Results

Clustering and Visualization on Web Search Results: A Survey

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Enhancing web search by using query-based clusters and multi-document summaries

Abstract

Access this article

Similar content being viewed by others

Clustering Retrieved Web Documents to Speed Up Web Searches

A Framework for Grouping and Summarizing Keyword Search Results

Clustering and Visualization on Web Search Results: A Survey

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation