skip to main content
10.1145/2391229.2391248acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Untangling cluster management with Helix

Published:14 October 2012Publication History

ABSTRACT

Distributed data systems systems are used in a variety of settings like online serving, offline analytics, data transport, and search, among other use cases. They let organizations scale out their workloads using cost-effective commodity hardware, while retaining key properties like fault tolerance and scalability. At LinkedIn we have built a number of such systems. A key pattern we observe is that even though they may serve different purposes, they tend to have a lot of common functionality, and tend to use common building blocks in their architectures. One such building block that is just beginning to receive attention is cluster management, which addresses the complexity of handling a dynamic, large-scale system with many servers. Such systems must handle software and hardware failures, setup tasks such as bootstrapping data, and operational issues such as data placement, load balancing, planned upgrades, and cluster expansion.

All of this shared complexity, which we see in all of our systems, motivates us to build a cluster management framework, Helix, to solve these problems once in a general way.

Helix provides an abstraction for a system developer to separate coordination and management tasks from component functional tasks of a distributed system. The developer defines the system behavior via a state model that enumerates the possible states of each component, the transitions between those states, and constraints that govern the system's valid settings. Helix does the heavy lifting of ensuring the system satisfies that state model in the distributed setting, while also meeting the system's goals on load balancing and throttling state changes. We detail several Helix-managed production distributed systems at LinkedIn and how Helix has helped them avoid building custom management components. We describe the Helix design and implementation and present an experimental study that demonstrates its performance and functionality.

References

  1. Apache Cassandra. http://cassandra.apache.org.Google ScholarGoogle Scholar
  2. Apache Hadoop. http://hadoop.apache.org/.Google ScholarGoogle Scholar
  3. Apache Hadoop NextGen MapReduce (YARN). http://hadoop.apache.org/.Google ScholarGoogle Scholar
  4. Apache HBase. http://hbase.apache.org/.Google ScholarGoogle Scholar
  5. Apache Mesos. http://incubator.apache.org/mesos/.Google ScholarGoogle Scholar
  6. Hedwig. https://cwiki.apache.org/ZOOKEEPER/hedwig.html.Google ScholarGoogle Scholar
  7. MongoDB. http://www.mongodb.org/.Google ScholarGoogle Scholar
  8. SenseiDB. http://www.senseidb.com/.Google ScholarGoogle Scholar
  9. Zookeeper. http://zookeeper.apache.org.Google ScholarGoogle Scholar
  10. F. Chang et al. Bigtable: A distributed storage system for structured data. In OSDI, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. B. F. Cooper et al. PNUTS: Yahoo!'s hosted data serving platform. In VLDB, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. R. Honicky and E. Miller. Replication under scalable hashing: A family of algorithms for scalable decentralized data distribution. In IPDPS, 2004.Google ScholarGoogle Scholar
  14. LinkedIn Data Infrastructure Team. Data infrastructure at LinkedIn. In ICDE, 2012.Google ScholarGoogle Scholar
  15. J. Shute et al. F1-the fault-tolerant distributed rdbms supporting google's ad business. In SIGMOD, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. M. Zaharia et al. The datacenter needs an operating system. In HotCloud, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Untangling cluster management with Helix

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          SoCC '12: Proceedings of the Third ACM Symposium on Cloud Computing
          October 2012
          325 pages
          ISBN:9781450317610
          DOI:10.1145/2391229

          Copyright © 2012 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 14 October 2012

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate169of722submissions,23%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader