Skip to main content

EaaS: Evaluation-as-a-Service and Experiences from the VISCERAL Project

  • Chapter
  • First Online:
Information Retrieval Evaluation in a Changing World

Part of the book series: The Information Retrieval Series ((INRE,volume 41))

  • 685 Accesses

Abstract

The Cranfield paradigm has dominated information retrieval evaluation for almost 50 years. It has had a major impact on the entire domain of information retrieval since the 1960s and, compared with systematic evaluation in other domains, is very well developed and has helped very much to advance the field. This chapter summarizes some of the shortcomings in information analysis evaluation and how recent techniques help to leverage these shortcomings. The term Evaluation-as-a-Service (EaaS) was defined at a workshop that combined several approaches that do not distribute the data but use source code submission, APIs or the cloud to run evaluation campaigns. The outcomes of a white paper and the experiences gained in the VISCERAL project on cloud-based evaluation for medical imaging are explained in this paper. In the conclusions, the next steps for research infrastructures are imagined and the impact that EaaS can have in this context to make research in data science more efficient and effective.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 99.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 179.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Agosti M, Di Buccio E, Ferro N, Masiero I, Peruzzo S, Silvello G (2012) Directions: design and specification of an ir evaluation infrastructure. In: Springer (ed) Multilingual and multimodal information access evaluation—third international conference of the cross-language evaluation forum, LNCS, vol 7488, pp 88–99

    Google Scholar 

  • Armstrong TG, Moffat A, Webber W, Zobel J (2009a) EvaluatIR: an online tool for evaluating and comparing ir systems. In: Proceedings of the 32nd international ACM SIGIR conference, SIGIR’09. ACM, New York, p 833. http://doi.acm.org/10.1145/1571941.1572153

  • Armstrong TG, Moffat A, Webber W, Zobel J (2009b) Improvements that don’t add up: ad-hoc retrieval results since 1998. In Proceeding of the 18th ACM conference on Information and knowledge management, CIKM’09. ACM, New York, pp 601–610. http://doi.acm.org/10.1145/1645953.1646031

    Chapter  Google Scholar 

  • Blanco R, Zaragoza H (2011) Beware of relatively large but meaningless improvements. Tech. rep., Yahoo Research

    Google Scholar 

  • Borlund P, Ingwersen P (1997) The development of a method for the evaluation of interactive information retrieval systems. J Doc 53:225–250

    Article  Google Scholar 

  • Braschler M, Peters C (2002) The CLEF campaigns: evaluation of cross–language information retrieval systems. CEPIS UPGRADE III 3:78–81

    Google Scholar 

  • Cleverdon CW (1962) Report on the testing and analysis of an investigation into the comparative efficiency of indexing systems. Tech. rep., Aslib Cranfield Research Project, Cranfield

    Google Scholar 

  • Cleverdon C, Mills J, Keen M (1966) Factors determining the performance of indexing systems. Tech. rep., ASLIB Cranfield Research Project, Cranfield

    Google Scholar 

  • Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) ImageNet: a large–scale hierarchical image database. In: IEEE conference on computer vision and pattern recognition, CVPR 2009, pp 248–255

    Google Scholar 

  • Di Nunzio GM, Ferro N (2005) Direct: a system for evaluating information access components of digital libraries. In: International conference on theory and practice of digital libraries. Springer, Berlin, pp 483–484

    Google Scholar 

  • Forsyth DA (2002) Benchmarks for storage and retrieval in multimedia databases. In: SPIE Proceedings of storage and retrieval for media databases, vol 4676, San Jose, pp 240–247 (SPIE photonics west conference)

    Google Scholar 

  • Fraser AG, Dunstan FD (2010) On the impossibility of being expert. BMJ 341:c6815

    Article  Google Scholar 

  • Gollub T, Stein B, Burrows S (2012) Ousting ivory tower research: towards a web framework for providing experiments as a service. In: Proceedings of the 35th international ACM SIGIR conference on research and development in information retrieval. ACM, New York, pp 1125–1126

    Google Scholar 

  • Gonzalo J, Clough P, Vallin A (2005) Overview of the CLEF 2005 interactive track. In: Working notes of the 2005 CLEF workshop, Vienna

    Google Scholar 

  • Hanbury A, Müller H (2010) Automated component–level evaluation: present and future. In: International conference of the cross–language evaluation forum (CLEF). Lecture notes in computer science (LNCS), vol 6360. Springer, Berlin, pp 124–135

    Chapter  Google Scholar 

  • Hanbury A, Müller H, Langs G, Weber MA, Menze BH, Fernandez TS (2012) Bringing the algorithms to the data: cloud–based benchmarking for medical image analysis. In: CLEF conference. Lecture notes in computer science. Springer, Berlin

    Chapter  Google Scholar 

  • Hanbury A, Müller H, Balog K, Brodt T, Cormack GV, Eggel I, Gollub T, Hopfgartner F, Kalpathy-Cramer J, Kando N, Krithara A, Lin J, Mercer S, Potthast M (2015) Evaluation–as–a–service: overview and outlook. arXiv 1512.07454

    Google Scholar 

  • Hanbury A, Müller H, Langs G (eds) (2017) Cloud–based benchmarking of medical image analysis. Springer, Berlin

    MATH  Google Scholar 

  • Harman D (1992) Overview of the first text REtrieval conference (TREC-1). In: Proceedings of the first text REtrieval conference (TREC–1), Washington, pp 1–20

    Google Scholar 

  • Hopfgartner F, Kille B, Lommatzsch A, Plumbaum T, Brodt T, Heintz T (2014) Benchmarking news recommendations in a living lab. In: International conference of the cross-language evaluation forum for European languages. Springer, Berlin, pp 250–267

    Google Scholar 

  • Hopfgartner F, Hanbury A, Müller H, Kando N, Mercer S, Kalpathy-Cramer J, Potthast M, Gollub T, Krithara A, Lin J, Balog K, Eggel I (2015) Report on the evaluation-as-a-service (EAAS) expert workshop. ACM SIGIR Forum 49(1):57–65

    Article  Google Scholar 

  • Hopfgartner F, Hanbury A, Müller H, Balog K, Brodt T, Cormack GV, Eggel I, Gollub T, Kalpathy-Cramer J, Kando N, Krithara A, Lin J, Mercer S, Potthast M (2018) Evaluation-as-a-service in the computational sciences: overview and outlook. J Data Inf Qual 10(4):15

    Google Scholar 

  • Ioannidis JP (2005) Why most published research findings are false. PLoS Med 2(8):e124

    Article  Google Scholar 

  • Jimenez-del-Toro O, Müller H, Krenn M, Gruenberg K, Taha AA, Winterstein M, Eggel I, Foncubierta-Rodríguez A, Goksel O, Jakab A, Kontokotsios G, Langs G, Menze B, Salas Fernandez T, Schaer R, Walleyo A, Weber MA, Dicente Cid Y, Gass T, Heinrich M, Jia F, Kahl F, Kechichian R, Mai D, Spanier AB, Vincent G, Wang C, Wyeth D, Hanbury A (2016) Cloud–based evaluation of anatomical structure segmentation and landmark detection algorithms: VISCERAL anatomy benchmarks. IEEE Trans Med Imaging 35(11):2459–2475

    Article  Google Scholar 

  • Jones KS, van Rijsbergen C (1975) Report on the need for and provision of an ideal information retrieval test collection. British Library Research and Development Report 5266, Computer Laboratory, University of Cambridge

    Google Scholar 

  • Kalpathy-Cramer J, García Seco de Herrera A, Demner-Fushman D, Antani S, Bedrick S, Müller H (2015) Evaluating performance of biomedical image retrieval systems: overview of the medical image retrieval task at ImageCLEF 2004–2014. Comput Med Imaging Graph 39:55–61

    Article  Google Scholar 

  • Krenn M, Dorfer M, Jimenez-del-Toro O, Müller H, Menze B, Weber MA, Hanbury A, Langs G (2016) Creating a large–scale silver corpus from multiple algorithmic segmentations. In: Menze B, Langs G, Montillo A, Kelm M, Müller H, Zhang S, Cai W, Metaxas D (eds) Medical computer vision: algorithms for big data: international workshop, MCV 2015, Held in Conjunction with MICCAI 2015, Munich, Germany, October 9, 2015, Revised Selected Papers, Springer International Publishing, pp 103–115

    Google Scholar 

  • Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. In: Pereira F, Burges CJC, Bottou L, Weinberger KQ (eds) Advances in neural information processing systems, vol 25, Curran Associates, pp 1097–1105

    Google Scholar 

  • Langs G, Hanbury A, Menze B, Müller H (2012) VISCERAL: towards large data in medical imaging—challenges and directions. In: Greenspan H, Müller H, Syeda-Mahmood T (eds) Medical content-based retrieval for clinical decision support. Lecture notes in computer science. Springer, Berlin, pp 92–98

    Google Scholar 

  • Markonis D, Holzer M, Dungs S, Vargas A, Langs G, Kriewel S, Müller H (2012) A survey on visual information search behavior and requirements of radiologists. Methods Inf Med 51(6):539–548

    Article  Google Scholar 

  • Markonis D, Baroz F, Ruiz de Castaneda RL, Boyer C, Müller H (2013) User tests for assessing a medical image retrieval system: a pilot study. Stud Health Technol Inform 192:224–228

    Google Scholar 

  • Mayernik MS, Hart DL, Mauli DL, Weber NM (2017) Assessing and tracing the outcomes and impact of research infrastructures. J Am Soc Inf Sci Technol 68(6):1341–1359

    Article  Google Scholar 

  • Müller H, Müller W, Marchand-Maillet S, Squire DM, Pun T (2001) Automated benchmarking in content–based image retrieval. In: Proceedings of the second international conference on multimedia and exposition (ICME’2001). IEEE Computer Society, Silver Spring, pp 321–324

    Google Scholar 

  • Müller H, Marchand-Maillet S, Pun T (2002) The truth about corel–evaluation in image retrieval. In: Lew MS, Sebe N, Eakins JP (eds) Proceedings of the international conference on the challenge of image and video retrieval (CIVR 2002). Lecture notes in computer science (LNCS), vol 2383. Springer, Berlin, pp 38–49

    MATH  Google Scholar 

  • Müller H, Boyer C, Gaudinat A, Hersh W, Geissbuhler A (2007) Analyzing web log files of the Health On the Net HONmedia search engine to define typical image search tasks for image retrieval evaluation. Stud Health Technol Inform 129(Pt 2):1319–1323

    Google Scholar 

  • Müller H, Clough P, Deselaers T, Caputo B (eds) (2010) ImageCLEF–experimental evaluation in visual information retrieval. In: The Springer international series on information retrieval, vol 32. Springer, Berlin

    Google Scholar 

  • Müller H, Kalpathy-Cramer J, Hanbury A, Farahani K, Sergeev R, Paik JH, Klein A, Criminisi A, Trister A, Norman T, Kennedy D, Srinivasa G, Mamonov A, Preuss N (2016) Report on the cloud–based evaluation approaches workshop 2015. ACM SIGIR Forum 51(1):35–41

    Google Scholar 

  • Niemeyer KE, Smith AM, Katz DS (2016) The challenge and promise of software citation for credit, identification, discovery, and reuse. J Data Inform Qual 7(6):161–165

    Google Scholar 

  • Ounis I, Macdonald C, Lin J, Soboroff I (2011) Overview of the trec-2011 microblog track. In: Proceedings of the 20th text REtrieval conference (TREC 2011), vol 32

    Google Scholar 

  • Rowe BR, Wood DW, Link AN, Simoni DA (2010) Economic impact assessment of NIST text retrieval conference (TREC) program. Technical report project number 0211875, National Institute of Standards and Technology

    Google Scholar 

  • Salton G (1971) The SMART retrieval system, experiments in automatic document processing. Prentice Hall, Englewood Cliffs

    Google Scholar 

  • Silvello G (2018) Theory and practice of data citation. J Assoc Inf Sci Technol 69:6–20

    Article  Google Scholar 

  • Silvello G, Bordea G, Ferro N, Buitelaar P, Bogers T (2017) Semantic representation and enrichment of information retrieval experimental data. Int J Digit Libr 18(2):145–172

    Article  Google Scholar 

  • Smeaton AF, Kraaij W, Over P (2003) TRECVID 2003: an overview. In: Proceedings of the TRECVID 2003 conference

    Google Scholar 

  • Thornley CV, Johnson AC, Smeaton AF, Lee H (2011) The scholarly impact of TRECVid (2003–2009). J Am Soc Inf Sci Technol 62(4):613–627

    Article  Google Scholar 

  • Trister AD, Buist DS, Lee CI (2017) Will machine learning tip the balance in breast cancer screening? JAMA Oncol 3(11):1463–1464

    Article  Google Scholar 

  • Tsatsaronis G, Balikas G, Malakasiotis P, Partalas I, Zschunke M, Alvers MR, Weissenborn D, Krithara A, Petridis S, Polychronopoulos D, et al (2015) An overview of the bioasq large-scale biomedical semantic indexing and question answering competition. BMC Bioinf 16(1):138

    Article  Google Scholar 

  • Tsikrika T, García Seco de Herrera A, Müller H (2011) Assessing the scholarly impact of ImageCLEF. In: CLEF 2011. Lecture notes in computer science (LNCS). Springer, Berlin, pp 95–106

    Chapter  Google Scholar 

  • Tsikrika T, Larsen B, Müller H, Endrullis S, Rahm E (2013) The scholarly impact of CLEF (2000–2009). In: information access evaluation, multilinguality, multimodality, and visualization. Springer, Berlin, pp 1–12

    Google Scholar 

Download references

Acknowledgements

The work leading to the chapter was partly funded by the EU FP7 program in the VISCERAL project and the ESF via the ELIAS project. We also thank all the participants of the related workshops for their input and the rich discussions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Henning Müller .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Müller, H., Hanbury, A. (2019). EaaS: Evaluation-as-a-Service and Experiences from the VISCERAL Project. In: Ferro, N., Peters, C. (eds) Information Retrieval Evaluation in a Changing World. The Information Retrieval Series, vol 41. Springer, Cham. https://doi.org/10.1007/978-3-030-22948-1_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-22948-1_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-22947-4

  • Online ISBN: 978-3-030-22948-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics