Abstract
We propose to use MapReduce to quickly test new retrieval approaches on a cluster of machines by sequentially scanning all documents. We present a small case study in which we use a cluster of 15 low cost machines to search a web crawl of 0.5 billion pages showing that sequential scanning is a viable approach to running large-scale information retrieval experiments with little effort. The code is available to other researchers at: http://mirex.sourceforge.net
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Clarke, C.L.A., Craswell, N., Soboroff, I.: Overview of the TREC 2009 web track. In: Proceedings of the 18th Text REtrieval Conference, TREC (2009)
Craswell, N., Fetterly, D., Najork, M., Robertson, S., Yilmaz, E.: Microsoft Research at TREC 2009: Web and relevance feedback tracks. In: Proceedings of the 18th Text REtrieval Conference, TREC (2009)
Dean, J.: Challenges in building large-scale information retrieval systems. In: Proceedings of the 2nd Conference on Web Search and Data Mining, WSDM (2009)
Dean, J., Ghemawat, S.: MapReduce: Simplified data processing on large clusters. In: Proceedings of the 6th Symposium on Operating System Design and Implemention, OSDI (2004)
Hiemstra, D.: Using Language Models for Information Retrieval. Ph.D. thesis (2001)
Lemur Toolkit, http://www.lemurproject.org/
Lin, J.: Brute Force and Indexed Approaches to Pairwise Document Similarity Comparisons with MapReduce. In: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval (2009)
Lucene Search Engine, http://lucene.apache.org
Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)
Peters, C., Deselaers, T., Ferro, N., Gonzalo, J., Jones, G.J.F., Kurimo, M., Mandl, T., Peñas, A., Petras, V. (eds.): Evaluating Systems for Multilingual and Multimodal Information Access. LNCS, vol. 5706. Springer, Heidelberg (2009)
Terrier IR Platform, http://ir.dcs.gla.ac.uk/terrier/
Salton, G., Buckley, C.: Parallel text search methods. Communications of the ACM 31(2) (1988)
Voorhees, E.M., Harman, D.K. (eds.): TREC Experiment and Evaluation in Information Retrieval. MIT Press, Cambridge (2008)
White, T.: Hadoop: The Definitive Guide. O’Reilly Media, Sebastopol (2009)
Yu, Y., Isard, M., Fetterly, D., Budiu, M., Erlingsson, U., Kumar, P., Currey, J.: DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. In: Proceedings of the 8th Symposium on Operating System Design and Implemention, OSDI (2008)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Hiemstra, D., Hauff, C. (2010). MapReduce for Information Retrieval Evaluation: “Let’s Quickly Test This on 12 TB of Data”. In: Agosti, M., Ferro, N., Peters, C., de Rijke, M., Smeaton, A. (eds) Multilingual and Multimodal Information Access Evaluation. CLEF 2010. Lecture Notes in Computer Science, vol 6360. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15998-5_8
Download citation
DOI: https://doi.org/10.1007/978-3-642-15998-5_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-15997-8
Online ISBN: 978-3-642-15998-5
eBook Packages: Computer ScienceComputer Science (R0)