Skip to main content

High-Availability at Massive Scale: Building Google’s Data Infrastructure for Ads

  • Conference paper
  • First Online:
Real-Time Business Intelligence and Analytics (BIRTE 2015, BIRTE 2016, BIRTE 2017)

Abstract

Google’s Ads Data Infrastructure systems run the multi-billion dollar ads business at Google. High availability and strong consistency are critical for these systems. While most distributed systems handle machine-level failures well, handling datacenter-level failures is less common. In our experience, handling datacenter-level failures is critical for running true high availability systems. Most of our systems (e.g. Photon, F1, Mesa) now support multi-homing as a fundamental design property. Multi-homed systems run live in multiple datacenters all the time, adaptively moving load between datacenters, with the ability to handle outages of any scale completely transparently.

This paper focuses primarily on stream processing systems, and describes our general approaches for building high availability multi-homed systems, discusses common challenges and solutions, and shares what we have learned in building and running these large-scale systems for over ten years.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Abadi, D.J., et al.: Aurora: a new model and architecture for data stream management. VLDB J. 12(2), 120–139 (2003)

    Google Scholar 

  2. Akidau, T., et al.: The dataflow model: a practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing. Proc. VLDB Endow. 8(12), 1792–1803 (2015)

    Article  Google Scholar 

  3. Ananthanarayanan, R., et al.: Photon: fault-tolerant and scalable joining of continuous data streams. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, (SIGMOD 2013), New York, NY, USA (2013)

    Google Scholar 

  4. Apache Cassandra (2011). Accessed 5 Oct 2011

    Google Scholar 

  5. Apache Flink (2014). http://flink.apache.org

  6. Apache Samza (2014). http://samza.apache.org

  7. Apache Storm (2013). http://storm.apache.org

  8. Astley, M., et al.: Pulsar: a resource-control architecture for time-critical service-oriented applications. IBM Syst. J. 47(2), 265–280 (2008)

    Article  Google Scholar 

  9. Bailis, P., Ghodsi, A.: Eventual consistency today: limitations, extensions, and beyond. ACM Queue 11(3), 20:20–20:32 (2013)

    Article  Google Scholar 

  10. Bernstein, P.A., Hadzilacos, V., Goodman, N.: Concurrency Control and Recovery in Database Systems. Addison-Wesley, Boston (1987)

    Google Scholar 

  11. Chandrasekaran, S., et al.: TelegraphCQ: continuous dataflow processing. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, SIGMOD 2003, pp. 668–668. ACM, New York (2003)

    Google Scholar 

  12. Chang, F., et al.: Bigtable: a distributed storage system for structured data. In: 7th Symposium on Operating Systems Design and Implementation (OSDI 2006), 6–8 November, Seattle, WA, USA, pp. 205–218 (2006)

    Google Scholar 

  13. Chen, J., et al.: NiagaraCQ: a scalable continuous query system for internet databases. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, SIGMOD 2000. ACM, New York (2000)

    Google Scholar 

  14. Cooper, B.F., et al.: Pnuts: Yahoo!’s hosted data serving platform. Proc. VLDB Endow. 1(2), 1277–1288 (2008)

    Article  Google Scholar 

  15. Corbett, J.C., et al.: Spanner: Google’s globally-distributed database. In: 10th USENIX Symposium on Operating Systems Design and Implementation, (OSDI 2012), 8–10 October 2012, Hollywood, CA, USA, pp. 261–264 (2012)

    Google Scholar 

  16. Cormode, G., Garofalakis, M.N.: Approximate continuous querying over distributed streams. ACM Trans. Database Syst. 33(2), 9 (2008)

    Article  Google Scholar 

  17. Cormode, G., Muthukrishnan, S., Yi, K., Zhang, Q.: Continuous sampling from distributed streams. J. ACM 59(2), 10 (2012)

    Article  MathSciNet  Google Scholar 

  18. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In 6th USENIX Symposium on Operating System Design and Implementation (OSDI 2004), San Francisco, California, USA, pp. 137–150 (2004)

    Google Scholar 

  19. DeCandia, G., et al.: Dynamo: Amazon’s highly available key-value store. In: Proceedings of 21st ACM Symposium Operating Systems Principles, pp. 205–220 (2007)

    Google Scholar 

  20. Flajolet, P., Martin, G.N.: Probabilistic counting algorithms for data base applications. J. Comput. Syst. Sci. 31(2), 182–209 (1985)

    Article  MathSciNet  Google Scholar 

  21. Gupta, A., et al.: Mesa: geo-replicated, near real-time scalable data warehousing. PVLDB 7(12), 1259–1270 (2014)

    Google Scholar 

  22. Lamport, L.: Time, clocks, and the ordering of events in a distributed system. Commun. ACM 21(7), 558–565 (1978)

    Article  Google Scholar 

  23. Lamport, L.: The part-time parliament. ACM Trans. Comput. Syst. 16(2), 133–169 (1998)

    Article  Google Scholar 

  24. Metwally, A., Agrawal, D., El Abbadi, A.: An integrated efficient solution for computing frequent and top-k elements in data streams. ACM Trans. Database Syst. 31(3), 1095–1133 (2006)

    Article  Google Scholar 

  25. Shrivastava, N., et al.: Medians and beyond: new aggregation techniques for sensor networks. In: Proceedings of the 2nd International Conference on Embedded Networked Sensor Systems, SenSys 2004, Baltimore, MD, USA (2004)

    Google Scholar 

  26. Shute, J., et al.: F1: a distributed SQL database that scales. PVLDB 6(11), 1068–1079 (2013)

    Google Scholar 

  27. Weikum, G., Vossen, G.: Transactional Information Systems: Theory, Algorithms, and the Practice of Concurrency Control and Recovery. Morgan-Kaufman Publishers, Burlington (2002)

    Chapter  Google Scholar 

  28. Zaharia, M., et al.: Discretized streams: fault-tolerant streaming computation at scale. In: Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, pp. 423–438. ACM (2013)

    Google Scholar 

Download references

Acknowledgements

We would like to thank the teams inside Google who built and ran the systems we have described, and the earlier generations of systems that informed our current designs. We would like to thank Divyakant Agrawal for his help preparing this paper.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Ashish Gupta or Jeff Shute .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Gupta, A., Shute, J. (2019). High-Availability at Massive Scale: Building Google’s Data Infrastructure for Ads. In: Castellanos, M., Chrysanthis, P., Pelechrinis, K. (eds) Real-Time Business Intelligence and Analytics. BIRTE BIRTE BIRTE 2015 2016 2017. Lecture Notes in Business Information Processing, vol 337. Springer, Cham. https://doi.org/10.1007/978-3-030-24124-7_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-24124-7_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-24123-0

  • Online ISBN: 978-3-030-24124-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics