Skip to main content

Big Data Storage and Processing on Azure Clouds: Experiments at Scale and Lessons Learned

  • Chapter
  • First Online:
Cloud Computing for Data-Intensive Applications

Abstract

Data-intensive computing is now starting to be considered as the basis for a new, fourth paradigm for science. Two factors are encouraging this trend. First, vast amounts of data are becoming available in more and more application areas. Second, the infrastructures allowing to persistently store these data for sharing and processing are becoming a reality. This allows to unify knowledge acquired through the previous three paradigms for scientific research (theory, experiments and simulations) with vast amounts of multidisciplinary data. The technical and scientific issues related to this context have been designated as the “Big Data” challenges. In this landscape, building a functional infrastructure for the requirements of Big Data applications is critical and is still a challenge. An important step has been made thanks to the emergence of cloud infrastructures, which are bringing the first bricks to cope with the challenging scale of the Big Data vision. Clouds bring to life the illusion of a (more-or-less) infinitely scalable infrastructure managed through a fully outsourced ICT service. Instead of having to buy and manage hardware, users “rent” outsourced resources as needed. However, cloud technologies have not reached yet their full potential. In particular, the capabilities available now for data storage and processing are still far from meeting the application requirements. In this work we investigate several hot challenges related to Big Data management on clouds. We discuss current state-of-the-art solutions, their limitations and some ways to overcome them. We illustrate our study with a concrete application study from the area of joint genetic and neuroimaging data analysis. The goal of this chapter is to present the conclusions of this study performed through a large-scale experiment carried out across three data centers of Microsoft’s Azure cloud platform during 2 weeks, which consumed approximately 200.000 compute hours.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 179.00
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. A-Brain. http://www.irisa.fr/kerdata/doku.php?id=abrain.

  2. Azure. http://www.windowsazure.com/.

  3. Extracting Value from Chaos. EMC Corporation, June 2011. http://www.emc.com/collateral/analyst-reports/idc-extracting-value-from-chaos-ar.pdf.

  4. B. Calder, J. Wang, A. Ogus, N. Nilakantan, A. Skjolsvold, S. McKelvie, Y. Xu, S. Srivastav, J. Wu, H. Simitci, J. Haridas, C. Uddaraju, H. Khatri, A. Edwards, V. Bedekar, S. Mainali, R. Abbasi, A. Agarwal, M. F. u. Haq, M. I. u. Haq, D. Bhardwaj, S. Dayanand, A. Adusumilli, M. McNett, S. Sankaran, K. Manivannan, and L. Rigas. Windows azure storage: a highly available cloud storage service with strong consistency. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles, SOSP ‘11, pages 143–157, New York, NY, USA, 2011. ACM.

    Google Scholar 

  5. D. Chappell. Introducing the Windows Azure Platform. Technical report, Microsoft. http://www.microsoft.com/windowsazure/whitepapers/.

  6. A. Costan, R. Tudoran, G. Antoniu, and G. Brasche. TomusBlobs: Scalable Data-intensive Processing on Azure Clouds. Journal of Concurrency and computation: practice and experience, 2013.

    Google Scholar 

  7. A. Greenberg, J. Hamilton, D. A. Maltz, and P. Patel. The cost of a cloud: research problems in data center networks. SIGCOMM Comput. Commun. Rev., 39(1):68–73, Dec. 2008.

    Article  Google Scholar 

  8. K. Keahey, M. Tsugawa, A. Matsunaga, and J. Fortes. Sky computing. IEEE Internet Computing, 13(5):43–51, Sept. 2009.

    Article  Google Scholar 

  9. B. Nicolae, G. Antoniu, L. Bougé, D. Moise, and A. Carpen-Amarie. BlobSeer: Next Generation Data Management for Large Scale Infrastructures. Journal of Parallel and Distributed Computing, 71(2):168–184, Feb. 2011.

    Article  Google Scholar 

  10. R. Tudoran, A. Costan, and G. Antoniu. Mapiterativereduce: a framework for reduction-intensive data processing on azure clouds. In Proceedings of third international workshop on MapReduce and its Applications Date, MapReduce ‘12, pages 9–16, New York, NY, USA, 2012. ACM.

    Google Scholar 

  11. R. Tudoran, A. Costan, and G. Antoniu. Datasteward: Using dedicated compute nodes for scalable data management on public clouds. In Proceedings of the 11th IEEE International Symposium on Parallel and Distributed Processing with Applications, ISPA ‘13, Washington, DC, USA, 2013. IEEE Computer Society.

    Google Scholar 

  12. R. Tudoran, A. Costan, G. Antoniu, and H. Soncu. Tomusblobs: Towards communication-efficient storage for mapreduce applications in azure. In Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012), CCGRID ‘12, pages 427–434, Washington, DC, USA, 2012. IEEE Computer Society.

    Google Scholar 

  13. E. Yildirim and T. Kosar. Network-aware end-to-end data throughput optimization. In Proceedings of the first international workshop on Network-aware data management, NDM ‘11, pages 21–30, New York, NY, USA, 2011. ACM.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alexandru Costan .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer Science+Business Media New York

About this chapter

Cite this chapter

Tudoran, R., Costan, A., Antoniu, G., Goetz, B. (2014). Big Data Storage and Processing on Azure Clouds: Experiments at Scale and Lessons Learned. In: Li, X., Qiu, J. (eds) Cloud Computing for Data-Intensive Applications. Springer, New York, NY. https://doi.org/10.1007/978-1-4939-1905-5_14

Download citation

  • DOI: https://doi.org/10.1007/978-1-4939-1905-5_14

  • Published:

  • Publisher Name: Springer, New York, NY

  • Print ISBN: 978-1-4939-1904-8

  • Online ISBN: 978-1-4939-1905-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics