High-Availability at Massive Scale: Building Google’s Data Infrastructure for Ads

Gupta, Ashish; Shute, Jeff

doi:10.1007/978-3-030-24124-7_5

Ashish Gupta⁹ &
Jeff Shute⁹

Part of the book series: Lecture Notes in Business Information Processing ((LNBIP,volume 337))

Included in the following conference series:

421 Accesses
1 Citations

Abstract

Google’s Ads Data Infrastructure systems run the multi-billion dollar ads business at Google. High availability and strong consistency are critical for these systems. While most distributed systems handle machine-level failures well, handling datacenter-level failures is less common. In our experience, handling datacenter-level failures is critical for running true high availability systems. Most of our systems (e.g. Photon, F1, Mesa) now support multi-homing as a fundamental design property. Multi-homed systems run live in multiple datacenters all the time, adaptively moving load between datacenters, with the ability to handle outages of any scale completely transparently.

This paper focuses primarily on stream processing systems, and describes our general approaches for building high availability multi-homed systems, discusses common challenges and solutions, and shares what we have learned in building and running these large-scale systems for over ten years.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Abadi, D.J., et al.: Aurora: a new model and architecture for data stream management. VLDB J. 12(2), 120–139 (2003)
Google Scholar
Akidau, T., et al.: The dataflow model: a practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing. Proc. VLDB Endow. 8(12), 1792–1803 (2015)
Article Google Scholar
Ananthanarayanan, R., et al.: Photon: fault-tolerant and scalable joining of continuous data streams. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, (SIGMOD 2013), New York, NY, USA (2013)
Google Scholar
Apache Cassandra (2011). Accessed 5 Oct 2011
Google Scholar
Apache Flink (2014). http://flink.apache.org
Apache Samza (2014). http://samza.apache.org
Apache Storm (2013). http://storm.apache.org
Astley, M., et al.: Pulsar: a resource-control architecture for time-critical service-oriented applications. IBM Syst. J. 47(2), 265–280 (2008)
Article Google Scholar
Bailis, P., Ghodsi, A.: Eventual consistency today: limitations, extensions, and beyond. ACM Queue 11(3), 20:20–20:32 (2013)
Article Google Scholar
Bernstein, P.A., Hadzilacos, V., Goodman, N.: Concurrency Control and Recovery in Database Systems. Addison-Wesley, Boston (1987)
Google Scholar
Chandrasekaran, S., et al.: TelegraphCQ: continuous dataflow processing. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, SIGMOD 2003, pp. 668–668. ACM, New York (2003)
Google Scholar
Chang, F., et al.: Bigtable: a distributed storage system for structured data. In: 7th Symposium on Operating Systems Design and Implementation (OSDI 2006), 6–8 November, Seattle, WA, USA, pp. 205–218 (2006)
Google Scholar
Chen, J., et al.: NiagaraCQ: a scalable continuous query system for internet databases. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, SIGMOD 2000. ACM, New York (2000)
Google Scholar
Cooper, B.F., et al.: Pnuts: Yahoo!’s hosted data serving platform. Proc. VLDB Endow. 1(2), 1277–1288 (2008)
Article Google Scholar
Corbett, J.C., et al.: Spanner: Google’s globally-distributed database. In: 10th USENIX Symposium on Operating Systems Design and Implementation, (OSDI 2012), 8–10 October 2012, Hollywood, CA, USA, pp. 261–264 (2012)
Google Scholar
Cormode, G., Garofalakis, M.N.: Approximate continuous querying over distributed streams. ACM Trans. Database Syst. 33(2), 9 (2008)
Article Google Scholar
Cormode, G., Muthukrishnan, S., Yi, K., Zhang, Q.: Continuous sampling from distributed streams. J. ACM 59(2), 10 (2012)
Article MathSciNet Google Scholar
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In 6th USENIX Symposium on Operating System Design and Implementation (OSDI 2004), San Francisco, California, USA, pp. 137–150 (2004)
Google Scholar
DeCandia, G., et al.: Dynamo: Amazon’s highly available key-value store. In: Proceedings of 21st ACM Symposium Operating Systems Principles, pp. 205–220 (2007)
Google Scholar
Flajolet, P., Martin, G.N.: Probabilistic counting algorithms for data base applications. J. Comput. Syst. Sci. 31(2), 182–209 (1985)
Article MathSciNet Google Scholar
Gupta, A., et al.: Mesa: geo-replicated, near real-time scalable data warehousing. PVLDB 7(12), 1259–1270 (2014)
Google Scholar
Lamport, L.: Time, clocks, and the ordering of events in a distributed system. Commun. ACM 21(7), 558–565 (1978)
Article Google Scholar
Lamport, L.: The part-time parliament. ACM Trans. Comput. Syst. 16(2), 133–169 (1998)
Article Google Scholar
Metwally, A., Agrawal, D., El Abbadi, A.: An integrated efficient solution for computing frequent and top-k elements in data streams. ACM Trans. Database Syst. 31(3), 1095–1133 (2006)
Article Google Scholar
Shrivastava, N., et al.: Medians and beyond: new aggregation techniques for sensor networks. In: Proceedings of the 2nd International Conference on Embedded Networked Sensor Systems, SenSys 2004, Baltimore, MD, USA (2004)
Google Scholar
Shute, J., et al.: F1: a distributed SQL database that scales. PVLDB 6(11), 1068–1079 (2013)
Google Scholar
Weikum, G., Vossen, G.: Transactional Information Systems: Theory, Algorithms, and the Practice of Concurrency Control and Recovery. Morgan-Kaufman Publishers, Burlington (2002)
Chapter Google Scholar
Zaharia, M., et al.: Discretized streams: fault-tolerant streaming computation at scale. In: Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, pp. 423–438. ACM (2013)
Google Scholar

Download references

Acknowledgements

We would like to thank the teams inside Google who built and ran the systems we have described, and the earlier generations of systems that informed our current designs. We would like to thank Divyakant Agrawal for his help preparing this paper.

Author information

Authors and Affiliations

Google Inc., Mountain View, USA
Ashish Gupta & Jeff Shute

Authors

Ashish Gupta
View author publications
You can also search for this author in PubMed Google Scholar
Jeff Shute
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Ashish Gupta or Jeff Shute .

Editor information

Editors and Affiliations

Teradata, Santa Clara, CA, USA
Malu Castellanos
University of Pittsburgh, Pittsburgh, PA, USA
Panos K. Chrysanthis
University of Pittsburgh, Pittsburgh, PA, USA
Konstantinos Pelechrinis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gupta, A., Shute, J. (2019). High-Availability at Massive Scale: Building Google’s Data Infrastructure for Ads. In: Castellanos, M., Chrysanthis, P., Pelechrinis, K. (eds) Real-Time Business Intelligence and Analytics. BIRTE BIRTE BIRTE 2015 2016 2017. Lecture Notes in Business Information Processing, vol 337. Springer, Cham. https://doi.org/10.1007/978-3-030-24124-7_5

Download citation

DOI: https://doi.org/10.1007/978-3-030-24124-7_5
Published: 11 October 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-24123-0
Online ISBN: 978-3-030-24124-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics