Skip to main content
Log in

Enhancing web search by using query-based clusters and multi-document summaries

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Current web search engines, such as Google, Bing, and Yahoo!, rank the set of documents SD retrieved in response to a user query and display each document D in SD with a title and a snippet, which serves as an abstract of D. Snippets, however, are not as useful as they are designed for, i.e., assisting its users to quickly identify results of interest, if they exist. These snippets are inadequate in providing distinct information and capturing the main contents of the corresponding documents. Moreover, when the intended information need specified in a search query is ambiguous, it is very difficult, if not impossible, for a search engine to identify precisely the set of documents that satisfy the user’s intended request without requiring additional inputs. Furthermore, a document title is not always a good indicator of the content of the corresponding document. All of these design problems can be solved by our proposed query-based cluster and summarizer, called \(Q_{Sum}\). \(Q_{Sum}\) generates a concise/comprehensive summary for each cluster of documents retrieved in response to a user query, which saves the user’s time and effort in searching for specific information of interest without having to browse through the documents one by one. Experimental results show that \(Q_{Sum}\) is effective and efficient in generating a high-quality summary for each cluster of documents on a specific topic.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. A snippet of a document D is treated as a summary of D.

  2. According to a survey on iProspect (iprospect.com), majority of the queries submitted to web search engines are general 1–3 words in length.

  3. Although the titles and snippets are inadequate in capturing the contents of their corresponding documents, they provide sufficient information for document clustering.

  4. The labels in the actual interface are listed vertically, instead of horizontally (ordered from left to right) as arranged in the figure to save space.

  5. The Text Analysis Conference (TAC) (nist.gov/tac) recommends a multi-document summary with the length of Size.

  6. Word-correlation factors quantify the similarity (degree of closeness) of two words in terms of their semantic meaning.

  7. End-of-sentence punctuation marks, such as periods, question marks, and exclamation points, are less ambiguous as end-of-sentence indicators. However, as a period is not exclusively used to indicate sentence breaks, which may indicate an abbreviation, a decimal point, parts of an e-mail address, etc., a list of common abbreviations, such as “i.e.,” “u.s.,” and “e.g.,” are maintained to minimize the detection errors.

  8. If a sentence contains a date, then it overrides the publication time of the document, since it explicitly states the time of the information presented in the sentence.

  9. Variance is widely used in statistics, along with standard deviation (which is the square root of the variance), to measure the average dispersion of the scores in a distribution.

  10. The logs of AOL (gregsadetsky.com/aol-data/) include 50 million queries created by millions of AOL users over a three-month period between March 1, 2006, and May 31, 2006, and the AOL logs are available for public use.

  11. A summary is considered useful if it is of high quality (4 or 5 on a 5-point scale) as defined by DUC.

  12. http://www.top-keywords.com/longterm.html.

References

  1. Alguliev R, Alyguliev R (2008) Automatic text documents summarization through sentences clustering. Autom Inf Sci 40:53–63

    Article  Google Scholar 

  2. Altman A, Tennenholtz M (2005) Ranking systems: the PageRank axioms. In: Proceedings of the 13th ACM conference on electronic commerce (ACM EC), pp 1–8

  3. Amini M, Usunier N (2009) Incorporating prior knowledge into a transductive ranking algorithm for multi-document summarization. In: Proceedings of the international ACM conference on research and development in information retrieval (ACM SIGIR), pp 704–705

  4. Arora R, Ravindran B (2008) Latent dirichlet allocation based multi-document summarization. In: Proceedings of the second workshop on analytics for noisy unstructured text data (AND), pp 91–97

  5. Baxendale P (1958) Machine-made index for technical literature: an experiment. IBM J Res Dev 2(4):354–361

    Article  Google Scholar 

  6. Bhandari H, Shimbo M, Ito T, Matsumoto Y (2008) Generic text summarization using probabilistic latent semantic indexing. In: Proceedings of the international joint conference on natural language processing (JCNLP), pp 133–140

  7. Braschler M, Schauble P (1998) Multilingual information retrieval based on document alignment techniques. In: Proceedings of research and advanced technology for digital libraries: second European conference (ECDL), pp 183–197

  8. Chen L (2011) Using a new relational concept to improve the clustering performance of search engines. Inf Process Manag 47:287–299

    Article  Google Scholar 

  9. Chim H, Deng X (2008) A new suffix tree similarity measure for document clustering. In: Proceedings of the 17th international world wide web conference (WWW), pp 121–130

  10. Dunlavy D, O’Leary D, Conroy J, Schlesinger J (2007) QCS: a system for querying, clustering, and summarizing documents. Inf Process Manag 43:1588–1605

    Article  Google Scholar 

  11. Ferragina P, Guli A (2008) A personalized search engine based on web-snippet hierarchical clustering. Softw Pract Exp 38(2):189–225

    Article  Google Scholar 

  12. Geraci F, Pellegrini M, Pisati P, Sebastiani F (2006) A scalable algorithm for high quality clustering of web snippets. In: Proceedings of the 21st annual ACM symposium on applied computing (ACM SAC), pp 1058–1062

  13. Giacomo E, Didimo W, Grilli L, Liotta G (2007) Graph visualization techniques for web clustering engines. IEEE Trans Vis Comput Gr 13(2):294–304

    Article  Google Scholar 

  14. Hearst M, Pedersen J (1996) Reexamining the cluster hypothesis: scatter/gather on retrieval results. In: Proceedings of the international acm conference on research and development in information retrieval (ACM SIGIR), pp 76–84

  15. Jansen B, Spink A, Saracevic T (2000) Real life, real users, and real needs: a study and analysis of user queries on the web. Inf Process Manag 36(2):207–227

    Article  Google Scholar 

  16. Jones B, Kenward M (2003) Design and analysis of cross-over trials, 2nd edn. Chapman and Hall, London

    MATH  Google Scholar 

  17. Judea P (1988) Probabilistic reasoning in the intelligent systems: networks of plausible inference. Morgan Kaufmann Publishers, Los Altos

    MATH  Google Scholar 

  18. Kazmier L (2003) Schaum’s outline of business statistics. McGraw-Hill, New York

    Google Scholar 

  19. Khoo C, Ou S, Goh D (2002) A hierarchical framework for multi-document summarization of dissertation abstracts. In: Proceedings of the international conference on asian digital libraries (ICADL), pp 99–110

  20. Koberstein J, Ng Y-K (2006) Using word clusters to detect similar web documents. In: Proceedings of the second international conference on knowledge science, engineering, and management (KSEM 2007), pp 215–228

  21. Leskovec J, Grobelnik M, Milic-Frayling N (2004) Learning sub-structures of document semantic graphs for document summarization. In: Proceedings of the workshop on link analysis and group detection (LinkKDD-2004), pp 133–138

  22. Li H, Sun C, Wang K (2009) Clustering web search results using conceptual grouping. In: Proceedings of the international conference on machine learning and cybernetics (ICMLC), pp 1499–1503

  23. Lin C, Hovy E (2002) From single to multi-document summarization: a prototype system and its evaluation. In: Proceedings of the annual meeting of the association for computational linguistics (ACL), pp 457–464

  24. Luger G (2008) Artificial intelligence: structures and strategies for complex problem solving, 6th edn. Addison-Wesley, Reading

    Google Scholar 

  25. Luhn H (1958) The automatic creation of literature abstracts. IBM J Res Dev 2:159–165

    Article  MathSciNet  Google Scholar 

  26. Osinski S (2006) Improving quality of search results clustering with approximate matrix factorisations. In: Proceedings of the annual European conference on information retrieval (ECIR 2006), pp 167–178

  27. Ou S, Khoo C, Goh D (2006) Automatic multi-document summarization for digital libraries. In: Proceedings of the Asia-Pacific conference on library and information education and practice (A-LIEP), pp 72–82

  28. Qumsiyeh R, Ng Y-K (2013) Enhancing web search using query-based clusters and labels. In: Proceedings of the IEEE/WIC/ACM international conference on web intelligence (WI’13), pp 159–164

  29. Rozakis L (2002) Test taking strategies and study skills for the utterly confused. McGraw Hill, NY

    Google Scholar 

  30. Schiffman B, Nenkova A, McKeown K (2002) Experiments in multidocument summarization. In: Proceedings of the human language technology conference (HLT), pp 52–58

  31. Schlesinger J, Leary D, Conroy J (2008) Arabic/English multi-document summarization with CLASSY—the past and the future. In: Proceedings of the conference on intelligent text processing and computational linguistics (CICLing), pp 568–581

  32. Selberg E (1999) Towards comprehensive web search. Ph.D. thesis, University of Washington

  33. Shekhar S, Agrawal R (2010) An architectural framework of a crawler for retrieving highly relevant web documents by filtering replicated web collections. In: Proceedings of international conference on advances in computer entertainment technology (ACE), pp 29–30

  34. Shen D, Pan R (2006) Query enrichment for web-query classification. ACM Trans Inf Syst 24(3):320–352

    Article  Google Scholar 

  35. Xide C, Yu Y, Han J, Liu B (2010) Hierarchical web-page clustering via in-page and cross-page link structures. In: Proceedings of the 15th Pacific-Asia conference on knowledge discovery and data mining (PAKDD), pp 222–229

  36. Yu P, Li X, Liu B (2005) Adding the temporal dimension to search: a case study in publication search. In: Proceedings of the 2005 IEEE/WIC/ACM international conference on web intelligence (WI), pp 543–549

  37. Zamir O, Etzioni O (1998) Web document clustering: a feasibility demonstration. In: Proceedings of the conference on research and development in information retireval (ACM SIGIR), pp 46–54

  38. Zamir O, Etzioni O (1999) Grouper: a dynamic clustering interface to web search results. Comput Netw 31(11–16):1361–1374

    Article  Google Scholar 

  39. Zeng H, He Q, Chen Z, Ma W (2004) Learning to cluster web search results. In: Proceedings of the international ACM conference on research and development in information retrieval (ACM SIGIR), pp 210–217

  40. Zhang D, Dong Y (2001) Semantic, hierarchical, online clustering of web search results. In: Proceedings of the 3rd international workshop on web information and data management (WIDM 2001), pp 69–78

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yiu-Kai Ng.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Qumsiyeh, R., Ng, YK. Enhancing web search by using query-based clusters and multi-document summaries. Knowl Inf Syst 47, 355–380 (2016). https://doi.org/10.1007/s10115-015-0852-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-015-0852-5

Keywords

Navigation