MapReduce for Information Retrieval Evaluation: “Let’s Quickly Test This on 12 TB of Data”

Hiemstra, Djoerd; Hauff, Claudia

doi:10.1007/978-3-642-15998-5_8

Djoerd Hiemstra²¹ &
Claudia Hauff²¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6360))

Included in the following conference series:

International Conference of the Cross-Language Evaluation Forum for European Languages

709 Accesses

Abstract

We propose to use MapReduce to quickly test new retrieval approaches on a cluster of machines by sequentially scanning all documents. We present a small case study in which we use a cluster of 15 low cost machines to search a web crawl of 0.5 billion pages showing that sequential scanning is a viable approach to running large-scale information retrieval experiments with little effort. The code is available to other researchers at: http://mirex.sourceforge.net

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Google Maps and Google Local Search

The Open Web Index

Finding Moore: No Search Engines, No Indexes, No Computers

References

Clarke, C.L.A., Craswell, N., Soboroff, I.: Overview of the TREC 2009 web track. In: Proceedings of the 18th Text REtrieval Conference, TREC (2009)
Google Scholar
Craswell, N., Fetterly, D., Najork, M., Robertson, S., Yilmaz, E.: Microsoft Research at TREC 2009: Web and relevance feedback tracks. In: Proceedings of the 18th Text REtrieval Conference, TREC (2009)
Google Scholar
Dean, J.: Challenges in building large-scale information retrieval systems. In: Proceedings of the 2nd Conference on Web Search and Data Mining, WSDM (2009)
Google Scholar
Dean, J., Ghemawat, S.: MapReduce: Simplified data processing on large clusters. In: Proceedings of the 6th Symposium on Operating System Design and Implemention, OSDI (2004)
Google Scholar
Hiemstra, D.: Using Language Models for Information Retrieval. Ph.D. thesis (2001)
Google Scholar
Lemur Toolkit, http://www.lemurproject.org/
Lin, J.: Brute Force and Indexed Approaches to Pairwise Document Similarity Comparisons with MapReduce. In: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval (2009)
Google Scholar
Lucene Search Engine, http://lucene.apache.org
Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)
Book MATH Google Scholar
Peters, C., Deselaers, T., Ferro, N., Gonzalo, J., Jones, G.J.F., Kurimo, M., Mandl, T., Peñas, A., Petras, V. (eds.): Evaluating Systems for Multilingual and Multimodal Information Access. LNCS, vol. 5706. Springer, Heidelberg (2009)
Google Scholar
Terrier IR Platform, http://ir.dcs.gla.ac.uk/terrier/
Salton, G., Buckley, C.: Parallel text search methods. Communications of the ACM 31(2) (1988)
Google Scholar
Voorhees, E.M., Harman, D.K. (eds.): TREC Experiment and Evaluation in Information Retrieval. MIT Press, Cambridge (2008)
Google Scholar
White, T.: Hadoop: The Definitive Guide. O’Reilly Media, Sebastopol (2009)
Google Scholar
Yu, Y., Isard, M., Fetterly, D., Budiu, M., Erlingsson, U., Kumar, P., Currey, J.: DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. In: Proceedings of the 8th Symposium on Operating System Design and Implemention, OSDI (2008)
Google Scholar

Download references

Author information

Authors and Affiliations

University of Twente, The Netherlands
Djoerd Hiemstra & Claudia Hauff

Authors

Djoerd Hiemstra
View author publications
You can also search for this author in PubMed Google Scholar
Claudia Hauff
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Information Engineering, University of Padua, Via Gradenigo 6/a, 35131, Padova, Italy
Maristella Agosti
University of Padua, Padua, Italy
Nicola Ferro
ISTI-CNR, Area Ricerca CNR, Via Moruzzi, 1, 56124, Pisa, Italy
Carol Peters
ISLA, University of Amsterdam, Amsterdam, The Netherlands
Maarten de Rijke
Dublin City University, Dublin, Ireland
Alan Smeaton

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hiemstra, D., Hauff, C. (2010). MapReduce for Information Retrieval Evaluation: “Let’s Quickly Test This on 12 TB of Data”. In: Agosti, M., Ferro, N., Peters, C., de Rijke, M., Smeaton, A. (eds) Multilingual and Multimodal Information Access Evaluation. CLEF 2010. Lecture Notes in Computer Science, vol 6360. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15998-5_8

Download citation

DOI: https://doi.org/10.1007/978-3-642-15998-5_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-15997-8
Online ISBN: 978-3-642-15998-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics