Abstract
Google’s Ads Data Infrastructure systems run the multi-billion dollar ads business at Google. High availability and strong consistency are critical for these systems. While most distributed systems handle machine-level failures well, handling datacenter-level failures is less common. In our experience, handling datacenter-level failures is critical for running true high availability systems. Most of our systems (e.g. Photon, F1, Mesa) now support multi-homing as a fundamental design property. Multi-homed systems run live in multiple datacenters all the time, adaptively moving load between datacenters, with the ability to handle outages of any scale completely transparently.
This paper focuses primarily on stream processing systems, and describes our general approaches for building high availability multi-homed systems, discusses common challenges and solutions, and shares what we have learned in building and running these large-scale systems for over ten years.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Abadi, D.J., et al.: Aurora: a new model and architecture for data stream management. VLDB J. 12(2), 120–139 (2003)
Akidau, T., et al.: The dataflow model: a practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing. Proc. VLDB Endow. 8(12), 1792–1803 (2015)
Ananthanarayanan, R., et al.: Photon: fault-tolerant and scalable joining of continuous data streams. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, (SIGMOD 2013), New York, NY, USA (2013)
Apache Cassandra (2011). Accessed 5 Oct 2011
Apache Flink (2014). http://flink.apache.org
Apache Samza (2014). http://samza.apache.org
Apache Storm (2013). http://storm.apache.org
Astley, M., et al.: Pulsar: a resource-control architecture for time-critical service-oriented applications. IBM Syst. J. 47(2), 265–280 (2008)
Bailis, P., Ghodsi, A.: Eventual consistency today: limitations, extensions, and beyond. ACM Queue 11(3), 20:20–20:32 (2013)
Bernstein, P.A., Hadzilacos, V., Goodman, N.: Concurrency Control and Recovery in Database Systems. Addison-Wesley, Boston (1987)
Chandrasekaran, S., et al.: TelegraphCQ: continuous dataflow processing. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, SIGMOD 2003, pp. 668–668. ACM, New York (2003)
Chang, F., et al.: Bigtable: a distributed storage system for structured data. In: 7th Symposium on Operating Systems Design and Implementation (OSDI 2006), 6–8 November, Seattle, WA, USA, pp. 205–218 (2006)
Chen, J., et al.: NiagaraCQ: a scalable continuous query system for internet databases. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, SIGMOD 2000. ACM, New York (2000)
Cooper, B.F., et al.: Pnuts: Yahoo!’s hosted data serving platform. Proc. VLDB Endow. 1(2), 1277–1288 (2008)
Corbett, J.C., et al.: Spanner: Google’s globally-distributed database. In: 10th USENIX Symposium on Operating Systems Design and Implementation, (OSDI 2012), 8–10 October 2012, Hollywood, CA, USA, pp. 261–264 (2012)
Cormode, G., Garofalakis, M.N.: Approximate continuous querying over distributed streams. ACM Trans. Database Syst. 33(2), 9 (2008)
Cormode, G., Muthukrishnan, S., Yi, K., Zhang, Q.: Continuous sampling from distributed streams. J. ACM 59(2), 10 (2012)
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In 6th USENIX Symposium on Operating System Design and Implementation (OSDI 2004), San Francisco, California, USA, pp. 137–150 (2004)
DeCandia, G., et al.: Dynamo: Amazon’s highly available key-value store. In: Proceedings of 21st ACM Symposium Operating Systems Principles, pp. 205–220 (2007)
Flajolet, P., Martin, G.N.: Probabilistic counting algorithms for data base applications. J. Comput. Syst. Sci. 31(2), 182–209 (1985)
Gupta, A., et al.: Mesa: geo-replicated, near real-time scalable data warehousing. PVLDB 7(12), 1259–1270 (2014)
Lamport, L.: Time, clocks, and the ordering of events in a distributed system. Commun. ACM 21(7), 558–565 (1978)
Lamport, L.: The part-time parliament. ACM Trans. Comput. Syst. 16(2), 133–169 (1998)
Metwally, A., Agrawal, D., El Abbadi, A.: An integrated efficient solution for computing frequent and top-k elements in data streams. ACM Trans. Database Syst. 31(3), 1095–1133 (2006)
Shrivastava, N., et al.: Medians and beyond: new aggregation techniques for sensor networks. In: Proceedings of the 2nd International Conference on Embedded Networked Sensor Systems, SenSys 2004, Baltimore, MD, USA (2004)
Shute, J., et al.: F1: a distributed SQL database that scales. PVLDB 6(11), 1068–1079 (2013)
Weikum, G., Vossen, G.: Transactional Information Systems: Theory, Algorithms, and the Practice of Concurrency Control and Recovery. Morgan-Kaufman Publishers, Burlington (2002)
Zaharia, M., et al.: Discretized streams: fault-tolerant streaming computation at scale. In: Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, pp. 423–438. ACM (2013)
Acknowledgements
We would like to thank the teams inside Google who built and ran the systems we have described, and the earlier generations of systems that informed our current designs. We would like to thank Divyakant Agrawal for his help preparing this paper.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Gupta, A., Shute, J. (2019). High-Availability at Massive Scale: Building Google’s Data Infrastructure for Ads. In: Castellanos, M., Chrysanthis, P., Pelechrinis, K. (eds) Real-Time Business Intelligence and Analytics. BIRTE BIRTE BIRTE 2015 2016 2017. Lecture Notes in Business Information Processing, vol 337. Springer, Cham. https://doi.org/10.1007/978-3-030-24124-7_5
Download citation
DOI: https://doi.org/10.1007/978-3-030-24124-7_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-24123-0
Online ISBN: 978-3-030-24124-7
eBook Packages: Computer ScienceComputer Science (R0)