Skip to main content
Log in

Scalable entity-based summarization of web search results using MapReduce

  • Published:
Distributed and Parallel Databases Aims and scope Submit manuscript

Abstract

Although Web Search Engines index and provide access to huge amounts of documents, user queries typically return only a linear list of hits. While this is often satisfactory for focalized search, it does not provide an exploration or deeper analysis of the results. One way to achieve advanced exploration facilities exploiting the availability of structured (and semantic) data in Web search, is to enrich it with entity mining over the full contents of the search results. Such services provide the users with an initial overview of the information space, allowing them to gradually restrict it until locating the desired hits, even if they are low ranked. This is especially important in areas of professional search such as medical search, patent search, etc. In this paper we consider a general scenario of providing such services as meta-services (that is, layered over systems that support keywords search) without a-priori indexing of the underlying document collection(s). To make such services feasible for large amounts of data we use the MapReduce distributed computation model on a Cloud infrastructure (Amazon EC2). Specifically, we show how the required computational tasks can be factorized and expressed as MapReduce functions. A key contribution of our work is a thorough evaluation of platform configuration and tuning, an aspect that is often disregarded and inadequately addressed in prior work, but crucial for the efficient utilization of resources. Finally we report experimental results about the achieved speedup in various settings.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Algorithm 1
Algorithm 2
Algorithm 3
Algorithm 4
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18

Similar content being viewed by others

Notes

  1. In our implementation any system that supports OpenSearch [14] can straightforwardly be used.

  2. By September 2011, datasets from Linked Open Data (http://linkeddata.org/) had grown to 31 billion RDF triples, interlinked by around 504 million RDF links.

  3. We chose Bing because it does not limit the number of queries submitted, in contrast to Google, which blocks the account for one hour if more than 600 queries are submitted.

  4. This size is chosen to ensure full utilization in all cases, as max(Reusability)×max(Split size)=20×5 MB=100 MB.

  5. Even with the full functionality, the reduce phase constitutes only about 2 % of the total job time when analyzing 300 MB-SET1 using 4 nodes.

  6. In particular: Person, Location, Organization, Address, Date, Time, Money, Percent, Age, Drug.

References

  1. Allocca, C., dAquin, M., Motta, E.: Impact of using relationships between ontologies to enhance the ontology search results. In: Simperl, E., Cimiano, P., Polleres, A., Corcho, O., Presutti, V. (eds.) The Semantic Web: Research and Applications. Lecture Notes in Computer Science, vol. 7295, pp. 453–468. Springer, Berlin (2012)

    Chapter  Google Scholar 

  2. Amdahl, G.M.: Validity of the single processor approach to achieving large scale computing capabilities. pages 483–485, 1967

  3. Apache Software Foundation: The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing. http://hadoop.apache.org/. Accessed: 03/05/2013

  4. Armbrust, M., Fox, A., Griffith, R., Joseph, A.D., Katz, R., Konwinski, A., Lee, G., Patterson, D., Rabkin, A., Stoica, I., Zaharia, M.: A view of cloud computing. Commun. ACM 53(4), 50–58 (2010)

    Article  Google Scholar 

  5. Assel, M., Cheptsov, A., Gallizo, G., Celino, I., Dell’Aglio, D., Bradeško, L., Witbrock, M., Della Valle, E.: Large knowledge collider—a service-oriented platform for large-scale semantic reasoning. In: Proceedings of the International Conference on Web Intelligence, Mining and Semantics (WIMS’11), pp. 41:1–41:9. ACM, New York (2011)

    Google Scholar 

  6. Bonino, D., Ciaramella, A., Corno, F.: Review of the state-of-the-art in patent information and forthcoming evolutions in intelligent patent informatics. World Pat. Inf. 32(1), 30–38 (2010)

    Article  Google Scholar 

  7. Broder, A.: A taxonomy of web search. SIGIR Forum 36(2), 3–10 (2002)

    Article  Google Scholar 

  8. Callaghan, G., Moffatt, L., Szasz, S.: General architecture for text engineering. http://gate.ac.uk/. Accessed: 03/04/2013

  9. Callan, J.: Distributed information retrieval. Advances in Information Retrieval, 7, 127–150, 2002

  10. Caputo, A., Basile, P., Semeraro, G.: Boosting a semantic search engine by named entities. In: Proceedings of the 18th International Symposium on Foundations of Intelligent Systems (ISMIS’09), pp. 241–250. Springer, Berlin (2009)

    Google Scholar 

  11. Carpineto, C., DAmico, M., Romano, G.: Evaluating subtopic retrieval methods: clustering versus diversification of search results. Inf. Process. Manag. 48(2), 358–373 (2012)

    Article  Google Scholar 

  12. Chen, S., Schlosser, S.W.: Map-reduce meets wider varieties of applications. Technical report IRP-TR-08-05, Intel Research Pittsburgh (2008)

  13. Cheng, T., Yan, X., Chang, K.: Supporting entity search: a large-scale prototype search engine. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data (SIGMOD’07), pp. 1144–1146. ACM, New York (2007)

    Chapter  Google Scholar 

  14. Clinton, D., Tesler, J., Fagan, M., Snell, J., Suave, A., et al.: OpenSearch is a collection of simple formats for the sharing of search results. http://www.opensearch.org/. Accessed: 03/05/2013

  15. Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V.: A framework and graphical development environment for robust NLP tools and applications. In: Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL’02) (2002)

    Google Scholar 

  16. Das, D., Martins, A.: A survey on automatic text summarization. Literature Survey for the Language and Statistics II course at CMU 4, 192–195 (2007)

    Google Scholar 

  17. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  18. Ernde, B., Lebel, M., Thiele, C., Hold, A., Naumann, F., Barczyn’ski, W., Brauer, F.: ECIR—a lightweight approach for entity-centric information retrieval. In: Proceedings of the 18th Text REtrieval Conference (TREC 2010) (2010)

    Google Scholar 

  19. Fafalios, P., Kitsos, I., Marketakis, Y., Baldassarre, C., Salampasis, M., Tzitzikas, Y.: Web searching with entity mining at query time. In: Proceedings of the 5th Information Retrieval Facility Conference (IRFC 2012), Vienna (2012)

    Google Scholar 

  20. Fafalios, P., Salampasis, M., Tzitzikas, Y.: Exploratory patent search with faceted search and configurable entity mining. In: Proceedings of the 1st International Workshop on Integrating IR Technologies for Professional Search (ECIR 2013) (2013)

    Google Scholar 

  21. Grossman, R.L., Gu, Y.: Data mining using high performance data clouds: experimental studies using sector and sphere. CoRR, abs/0808.3019:920–927, 2008

    Google Scholar 

  22. Halevy, A.Y.: Answering queries using views: a survey. VLDB J. 10(4), 270–294 (2001)

    Article  MATH  Google Scholar 

  23. Herzig, D.M., Tran, T.: Heterogeneous web data search using relevance-based on the fly data integration. In: Proceedings of the 21st International Conference on World Wide Web (WWW ’12), pp. 141–150. ACM, New York (2012)

    Chapter  Google Scholar 

  24. Husain, M., Khan, L., Kantarcioglu, M., Thuraisingham, B.: Data intensive query processing for large rdf graphs using cloud computing tools. In: 2010 IEEE 3rd International Conference on Clod Computing (CLOUD), pp. 1–10. IEEE Press, New York (2010)

    Chapter  Google Scholar 

  25. Hwang, J.: IBM pattern modeling and analysis tool for Java garbage collector. https://www.ibm.com/developerworks/community/groups/service/html/communityview?communityUuid=22d56091-3a7b-4497-b36e-634b51838e11 Accessed: 28/01/2013

  26. Jaccard, P.: The distribution of the flora in the alpine zone. New Phytol. 11(2), 37–50 (1912)

    Article  Google Scholar 

  27. Jestes, J., Yi, K., Li, F.: Building wavelet histograms on large data in mapreduce. Proc. VLDB Endow. 5(2), 109–120 (2011)

    Google Scholar 

  28. Jiménez-Ruiz, E., Grau, B.C., Horrocks, I., Berlanga, R.: Ontology integration using mappings: towards getting the right logical consequences. In: The Semantic Web: Research and Applications, pp. 173–187. Springer, Berlin (2009)

    Chapter  Google Scholar 

  29. Joho, H., Azzopardi, L., Vanderbauwhede, W.: A survey of patent users: an analysis of tasks, behavior, search functionality and system requirements. In: Proc. of the 3rd Symposium on Information Interaction in Context, pp. 13–24. ACM, New York (2010)

    Google Scholar 

  30. Käki, M.: Findex: search result categories help users when document ranking fails. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 131–140. ACM, New York (2005)

    Google Scholar 

  31. Käki, M., Aula, A.: Findex: improving search result use through automatic filtering categories. Interact. Comput. 17(2), 187–206 (2005)

    Article  Google Scholar 

  32. Kitsos, I., Papaioannou, A., Tsikoudis, N., Magoutis, K.: Adapting data-intensive workloads to generic allocation policies in cloud infrastructures. In: Proceedings of IEEE/IFIP Network Operations and Management Symposium (NOMS 2012), pp. 25–33. IEEE Press, New York (2012)

    Chapter  Google Scholar 

  33. Kohn, A., Bry, F., Manta, A., Ifenthaler, D.: Professional Search: Requirements, Prototype and Preliminary Experience Report, pp. 195–202. 2008

  34. Kules, B., Capra, R., Banta, M., Sierra, T.: What do exploratory searchers look at in a faceted search interface? In: Proceedings of the 9th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 313–322. ACM, New York (2009)

    Google Scholar 

  35. Kulkarni, P.: Distributed SPARQL query engine using MapReduce. Master’s thesis

  36. Li, B., Mazur, E., Diao, Y., McGregor, A., Shenoy, P.: A platform for scalable one-pass analytics using mapreduce. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data (SIGMOD’11), pp. 985–996. ACM, New York (2011)

    Chapter  Google Scholar 

  37. Marketakis, Y., Tzanakis, M., Tzitzikas, Y.: Prescan: towards automating the preservation of digital objects. In: Proceedings of the International Conference on Management of Emergent Digital EcoSystems (MEDES’09), pp. 60:404–60:411. ACM, New York (2009)

    Google Scholar 

  38. Massie, M., Chun, B., Culler, D.: The ganglia distributed monitoring system: design, implementation, and experience. Parallel Comput. 30(7), 817–840 (2004)

    Article  Google Scholar 

  39. Massie, M., Li, B., Nicholes, B., Vuksan, V., Alexander, R., Buchbinder, J., Costa, F., Dean, A., Josephsen, D., Phaal, P., et al.: Monitoring with Ganglia. O’Reilly Media, Inc., Sebastopol (2012)

    Google Scholar 

  40. McCreadie, R., Macdonald, C., Ounis, I.: Comparing distributed indexing: to mapreduce or not? In: Proc. of LSDS-IR, pp. 41–48 (2009)

    Google Scholar 

  41. Mccreadie, R., Macdonald, C., Ounis, I.: Mapreduce indexing strategies: studying scalability and efficiency. Inf. Process. Manag. 48(5), 873–888 (2012)

    Article  Google Scholar 

  42. Mika, P., Tummarello, G.: Web semantics in the clouds. IEEE Intell. Syst. 23(5), 82–87 (2008)

    Article  Google Scholar 

  43. Nenkova, A., McKeown, K.: A survey of text summarization techniques. In: Mining Text Data, pp. 43–76 (2012)

  44. Papadimitriou, S., Sun, J.: Disco: distributed co-clustering with map-reduce: a case study towards petabyte-scale end-to-end mining. In: Eighth IEEE International Conference on Data Mining (ICDM’08), pp. 512–521. IEEE Press, New York (2008)

    Chapter  Google Scholar 

  45. Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., Dewitt, D.J., Madden, S., Stonebraker, M.: A comparison of approaches to large-scale data analysis. In: Proceedings of the 35th SIGMOD International Conference on Management of Data (SIGMOD’09), pp. 165–178. ACM, New York (2009)

    Chapter  Google Scholar 

  46. Phaal, P.: SFlow is an industry standard technology for monitoring high speed switched networks. http://blog.sflow.com/. Accessed: 03/05/2013

  47. Poosala, V., Haas, P., Ioannidis, Y., Shekita, E.: Improved Histograms for Selectivity Estimation of Range Predicates vol. 25, pp. 294–305. ACM, New York (1996)

    Google Scholar 

  48. Pratt, W., Fagan, L.: The usefulness of dynamically categorizing search results. J. Am. Med. Inform. Assoc. 7(6), 605–617 (2000)

    Article  Google Scholar 

  49. Ramachandran, S.: Google developers: Web metrics. https://developers.google.com/speed/articles/web-metrics. Accessed: 03/05/2013

  50. Sacco, G., Tzitzikas, Y.: Dynamic Taxonomies and Faceted Search. Springer, Berlin (2009)

    Book  Google Scholar 

  51. Thakker, D., Osman, T., Lakin, P.: Java annotation patterns engine. http://en.wikipedia.org/wiki/JAPE_(linguistics). Accessed: 03/04/2013

  52. Tom, W.: Hadoop: The Definitive Guide. O’Reilly, Sebastopol (2009)

    Google Scholar 

  53. Tzitzikas, Y., Meghini, C.: Ostensive automatic schema mapping for taxonomy-based peer-to-peer systems. In: Cooperative Information Agents VII, pp. 78–92. Springer, Berlin (2003)

    Chapter  Google Scholar 

  54. Tzitzikas, Y., Spyratos, N., Constantopoulos, P.: Mediators over taxonomy-based information sources. VLDB J. 14(1), 112–136 (2005)

    Article  Google Scholar 

  55. Urbani, J., Kotoulas, S., Oren, E., Van Harmelen, F.: Scalable distributed reasoning using Mapreduce. pp. 634–649 (2009)

  56. van Zwol, R., Garcia Pueyo, L., Muralidharan, M., Sigurbjörnsson, B.: Machine learned ranking of entity facets. In: Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’10), pp. 879–880. ACM, New York (2010)

    Chapter  Google Scholar 

  57. Venner, J.: Pro Hadoop. Apress, Berkeley (2009)

    Book  Google Scholar 

  58. White, R.W., Kules, B., Drucker, S.M., Schraefel, M.: Supporting exploratory search, introduction (special issue). Communications of the ACM. Commun. ACM 49(4), 36–39 (2006)

    Article  Google Scholar 

  59. Wilson, M., et al.: A longitudinal study of exploratory and keyword search. In: Proceedings of the 8th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL’08), pp. 52–56. ACM, New York (2008)

    Google Scholar 

  60. Yahoo! Inc. Chaining jobs. http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining. Accessed: 09/05/2013

  61. Zhai, K., Boyd-Graber, J., Asadi, N., Alkhouja, M.: Mr. LDA: a flexible large scale topic modeling package using variational inference in Mapreduce. In: Proceedings of the 21st International Conference on World Wide Web (WWW’12), pp. 879–888. ACM, New York (2012)

    Chapter  Google Scholar 

  62. Zhang, C., Li, F., Jestes, J.: Efficient parallel knn joins for large data in Mapreduce. In: Proceedings of the 15th International Conference on Extending Database Technology, pp. 38–49. ACM, New York (2012)

    Chapter  Google Scholar 

Download references

Acknowledgements

Many thanks to Carlo Allocca and to Pavlos Fafalios for their contributions. We thankfully acknowledge the support of the iMarine (FP7 Research Infrastructures, 2011–2014) and PaaSage (FP7 Integrated Project 317715, 2012–2016) EU projects and of Amazon Web Services through an Education Grant. We also acknowledge the interesting discussions we had in the context of the MUMIA COST action (IC1002, 2010–2014).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yannis Tzitzikas.

Additional information

Communicated by Feifei Li and Suman Nath.

Appendix A: Vertical search applications

Appendix A: Vertical search applications

Fig. 19
figure 23

Two screens from vertical search applications, one for the marine domain (top) and another for patent search (bottom)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kitsos, I., Magoutis, K. & Tzitzikas, Y. Scalable entity-based summarization of web search results using MapReduce. Distrib Parallel Databases 32, 405–446 (2014). https://doi.org/10.1007/s10619-013-7133-7

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10619-013-7133-7

Keywords

Navigation