skip to main content
10.1145/1989323.1989439acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Nova: continuous Pig/Hadoop workflows

Published: 12 June 2011 Publication History

Abstract

This paper describes a workflow manager developed and deployed at Yahoo called Nova, which pushes continually-arriving data through graphs of Pig programs executing on Hadoop clusters. (Pig is a structured dataflow language and runtime for the Hadoop map-reduce system.)
Nova is like data stream managers in its support for stateful incremental processing, but unlike them in that it deals with data in large batches using disk-based processing. Batched incremental processing is a good fit for a large fraction of Yahoo's data processing use-cases, which deal with continually-arriving data and benefit from incremental algorithms, but do not require ultra-low-latency processing.

References

[1]
Apache. Hadoop: Open-source implementation of MapReduce. http://hadoop.apache.org.
[2]
Apache. HBase: Open-source implementation of BigTable. http://hbase.apache.org.
[3]
Apache. Oozie: Hadoop workflow system. http://yahoo.github.com/oozie/.
[4]
Apache. Pig: High-level dataflow system for Hadoop. http://pig.apache.org.
[5]
Apache. Zebra: Hadoop self-describing, column-oriented file format. http://hadoop.apache.org/pig/docs/r0.6.0/zebra_overview.html.
[6]
A. Z. Broder, S. C. Glassman, and M. S. Manasse. Syntactic clustering of the web. In Proc. WWW, 1997.
[7]
F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A distributed storage system for structured data. ACM Trans. Computer Systems, 26(2), 2008.
[8]
J. Cheney, L. Chiticariu, and W.-C. Tan. Provenance in databases: Why, how, and where. Foundations and Trends in Databases, 1(4):379--474, 2009.
[9]
T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmeleegy, and R. Sears. MapReduce online. In Proc. NSDI, 2010.
[10]
J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In Proc. OSDI, 2004.
[11]
J. Freire, C. T. Silva, S. P. Callahan, E. Santos, C. E. Scheidegger, and H. T. Vo. Managing rapidly-evolving scientific workflows. In Proc. International Provenance and Annotation Workshop, 2006.
[12]
M. Garofalakis, J. Gehrke, and R. Rastogi, editors. Data Stream Management. Springer, 2009.
[13]
A. F. Gates, O. Natkovich, S. Chopra, P. Kamath, S. M. Narayanamurthy, C. Olston, B. Reed, S. Srinivasan, and U. Srivastava. Building a high-level dataflow system on top of map-reduce: The Pig experience. In Proc. VLDB, 2009.
[14]
A. Gupta and I. S. Mumick. Maintenance of materialized views: Problems, techniques, and applications. IEEE Data Engineering Bulletin, 18(2):5--20, 1995.
[15]
B. He, M. Yang, Z. Guo, R. Chen, W. Lin, B. Su, and L. Zhou. Comet: Batched stream processing for data intensive distributed computing. In Proc. ACM Symposium on Cloud Computing (SOCC), 2010.
[16]
P. Hunt, M. Konar, F. P. Junqueira, and B. Reed. ZooKeeper: Wait-free coordination for internet-scale systems. In Proc. USENIX Annual Technical Conference, 2010.
[17]
D. Logothetis, C. Olston, B. Reed, K. Webb, and K. Yocum. Stateful bulk processing for incremental algorithms. In Proc. ACM Symposium on Cloud Computing (SOCC), 2010.
[18]
B. Ludascher et al. Scientific process automation and workflow management. In Scientific Data Management: Challenges, Technology, and Deployment, chapter 13. Chapman & Hall/CRC, 2009.
[19]
C. Olston. Graceful logic evolution in web data processing workflows. Technical report, 2011. http://i.stanford.edu/ olston/publications/workflowEvolutionTR.pdf.
[20]
C. Olston. Modeling and scheduling asynchronous incremental workflows. Technical report, 2011. http://i.stanford.edu/ olston/publications/asynchronousWorkflowsTR.pdf.
[21]
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: A not-so-foreign language for data processing. In Proc. ACM SIGMOD, 2008.
[22]
Open-Source Community. Cascading. http://www.cascading.org/.
[23]
Open-Source Community. MySQL Cluster: A synchronously-replicated, shared-nothing database management system. http://www.mysql.com/products/database/cluster/.
[24]
D. Peng and F. Dabek. Large-scale incremental processing using distributed transactions and notifications. In Proc. OSDI, 2010.
[25]
L. Popa, M. Budiu, Y. Yu, and M. Isard. DryadInc: Reusing work in large-scale computations. In Proc. USENIX Workshop on Hot Topics in Cloud Computing (HotCloud), 2008.
[26]
T. K. Sellis. Multiple query optimization. ACM Trans. on Database Systems, 13(1), 1988.
[27]
J. Wang, D. Crawl, and I. Altintas. Kepler Hadoop: A general architecture facilitating data-intensive applications in scientific workflow systems. In Proc. Workshop on Workflows in Support of Large-Scale Science, 2009.
[28]
Workflow Management Coalition. XPDL: XML process definition language. http://www.wfmc.org/xpdl.html.

Cited By

View all
  • (2024)Evaluating BPMN Extensions for Continuous Processes Based on Use Cases and Expert InterviewsBusiness & Information Systems Engineering10.1007/s12599-023-00850-766:6(709-735)Online publication date: 29-Jan-2024
  • (2022)Research on Reverse Skyline Query Algorithm Based on Decision SetJournal of Database Management10.4018/JDM.31397133:1(1-28)Online publication date: 1-Jan-2022
  • (2022)Analytics at Scale: Evolution at Infrastructure and Algorithmic Levels2022 IEEE 38th International Conference on Data Engineering (ICDE)10.1109/ICDE53745.2022.00302(3217-3220)Online publication date: May-2022
  • Show More Cited By

Index Terms

  1. Nova: continuous Pig/Hadoop workflows

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD '11: Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
    June 2011
    1364 pages
    ISBN:9781450306614
    DOI:10.1145/1989323
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 12 June 2011

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. hadoop
    2. incremental processing
    3. workflow

    Qualifiers

    • Research-article

    Conference

    SIGMOD/PODS '11
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)10
    • Downloads (Last 6 weeks)3
    Reflects downloads up to 14 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Evaluating BPMN Extensions for Continuous Processes Based on Use Cases and Expert InterviewsBusiness & Information Systems Engineering10.1007/s12599-023-00850-766:6(709-735)Online publication date: 29-Jan-2024
    • (2022)Research on Reverse Skyline Query Algorithm Based on Decision SetJournal of Database Management10.4018/JDM.31397133:1(1-28)Online publication date: 1-Jan-2022
    • (2022)Analytics at Scale: Evolution at Infrastructure and Algorithmic Levels2022 IEEE 38th International Conference on Data Engineering (ICDE)10.1109/ICDE53745.2022.00302(3217-3220)Online publication date: May-2022
    • (2021)STARProceedings of the 29th International Conference on Advances in Geographic Information Systems10.1145/3474717.3484265(606-615)Online publication date: 2-Nov-2021
    • (2019)BaranHandbook of Research on the Evolution of IT and the Rise of E-Society10.4018/978-1-5225-7214-5.ch007(124-161)Online publication date: 2019
    • (2019)Storage and Security Preservation Using Cloud Based Intelligent Compression SchemeInternational Journal of Scientific Research in Science, Engineering and Technology10.32628/IJSRSET196274(417-424)Online publication date: 15-Mar-2019
    • (2019)Secure Data Compression Scheme for Scalable Data in Dynamic Data Storage EnvironmentsInternational Journal of Scientific Research in Computer Science, Engineering and Information Technology10.32628/CSEIT195439(229-237)Online publication date: 1-Aug-2019
    • (2019)Storage Preservation Using Big Data Based Intelligent Compression SchemeInternational Journal of Scientific Research in Computer Science, Engineering and Information Technology10.32628/CSEIT19539(92-100)Online publication date: 1-May-2019
    • (2019)Cloud resource management using 3Vs of Internet of Big data streamsComputing10.1007/s00607-019-00732-5Online publication date: 3-Jun-2019
    • (2019)Tools and Libraries for Big Data AnalysisEncyclopedia of Big Data Technologies10.1007/978-3-319-77525-8_282(1679-1685)Online publication date: 20-Feb-2019
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media