skip to main content
10.1145/3183519.3183528acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
research-article

Cross-language optimizations in big data systems: a case study of SCOPE

Published:27 May 2018Publication History

ABSTRACT

Building scalable big data programs currently requires programmers to combine relational (SQL) with non-relational code (Java, C#, Scala). Relational code is declarative - a program describes what the computation is and the compiler decides how to distribute the program. SQL query optimization has enjoyed a rich and fruitful history, however, most research and commercial optimization engines treat non-relational code as a black-box and thus are unable to optimize it.

This paper empirically studies over 3 million SCOPE programs across five data centers within Microsoft and finds programs with non-relational code take between 45-70% of data center CPU time. We further explore the potential for SCOPE optimization by generating more native code from the non-relational part. Finally, we present 6 case studies showing that triggering more generation of native code in these jobs yields significant performance improvement: optimizing just one portion resulted in as much as 25% improvement for an entire program.

References

  1. {n. d.}. U-SQL, the new big data language for Azure Data Lake. https://azure.microsoft.com/en-us/blog/u-sql-the-new-big-data-language-for-azure-data-lake/. ({n. d.}). Accessed: 2017-05-09.Google ScholarGoogle Scholar
  2. Apache {n. d.}. Hadoop Streaming. Apache, https://hadoop.apache.org/docs/current/hadoop-streaming/HadoopStreaming.html.Google ScholarGoogle Scholar
  3. E. Bertino and W. Kim. 1989. Indexing Techniques for Queries on Nested Objects. IEEE Trans. onKnowl. and Data Eng. 1, 2 (June 1989), 196--214. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Ronnie Chaiken, Bob Jenkins, Perake Larson, Bill Ramsey, Darren Shakib, Simon Weaver, and Jingren Zhou. 2008. SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets. Proc. VLDB Endow. 1, 2 (Aug. 2008), 1265--1276. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Surajit Chaudhuri. 1998. An Overview of Query Optimization in Relational Systems. In Proceedings of the Seventeenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS '98). ACM, New York, NY, USA, 34--43. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Surajit Chaudhuri, Ravi Krishnamurthy, Spyros Potamianos, and Kyuseok Shim. 1995. Optimizing Queries with Materialized Views. In Proceedings of the Eleventh International Conference on Data Engineering (ICDE '95). IEEE Computer Society, Washington, DC, USA, 190--200. http://dl.acm.org/citation.cfm?id=645480.655434 Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Surajit Chaudhuri and Kyuseok Shim. 1994. Including Group-By in Query Optimization. In Proceedings of the 20th International Conference on Very Large Data Bases (VLDB '94). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 354--366. http://dl.acm.org/citation.cfm?id=645920.672834 Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. D. Chimenti, R. Gamboa, and R. Krishnamurthy. 1989. Towards an Open Architecture for LDL. In Proceedings of the 15th International Conference on Very Large Data Bases (VLDB '89). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 195--203. http://dl.acm.org/citation.cfm?id=88830.88851 Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. John Darlington. 1978. A Synthesis of Several Sorting Algorithms. Acta Inf. 11, 1 (March 1978), 1--30. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified Data Processing onLarge Clusters. Commun. ACM 51, 1 (Jan. 2008), 107--113. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified Data Processing on Large Clusters. Commun. ACM5 1, 1 (Jan. 2008), 107--113. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Jens Dittrich, Jorge-Arnulfo Quiané-Ruiz, Alekh Jindal, Yagiz Kargin, Vinay Setty, and Jörg Schad. 2010. Hadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing). Proc. VLDB Endow. 3, 1--2 (Sept. 2010), 515--529. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Yi Fang, Marc Friedman, Giri Nair, Michael Rys, and Ana-Elisa Schmid. 2008. Spatial Indexing in Microsoft SQL Server 2008. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data (SIGMOD '08). ACM, New York, NY, USA, 1207--1216. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Avrilia Floratou, Jignesh M. Patel, Eugene J. Shekita, and Sandeep Tata. 2011. Column-oriented Storage Techniques for MapReduce. Proc. VLDB Endow. 4, 7 (April 2011), 419--429. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Alan F. Gates, Olga Natkovich, Shubham Chopra, Pradeep Kamath, Shravan M. Narayanamurthy, Christopher Olston, Benjamin Reed, Santhosh Srinivasan, and Utkarsh Srivastava. 2009. Building a High-level Dataflow System on Top of Map-Reduce: The Pig Experience. Proc. VLDB Endow. 2, 2 (Aug. 2009), 1414--1425. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Zhenyu Guo, Xuepeng Fan, Rishan Chen, Jiaxing Zhang, Hucheng Zhou, Sean McDirmid, Chang Liu, Wei Lin, Jingren Zhou, and Lidong Zhou. 2012. Spotting Code Optimizations in Data-Parallel Pipelines through PeriSCOPE.. In OSDI. 121--133. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Anders Hejlsberg, Mads Torgersen, Scott Wiltamuth, and Peter Golde. 2010. C# Programming Language (4th ed.). Addison-Wesley Professional. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Joseph M. Hellerstein and Michael Stonebraker. 1993. Predicate Migration: Optimizing Queries with Expensive Predicates. In Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data (SIGMOD '93). ACM, New York, NY, USA, 267--276. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Herodotos Herodotou and Shivnath Babu. 2011. Profiling, What-if Analysis, and Cost-based Optimization of MapReduce Programs. 4 (01 2011), 1111--1122.Google ScholarGoogle Scholar
  20. Michael Isard, Vijayan Prabhakaran, Jon Currey, Udi Wieder, Kunal Talwar, and Andrew Goldberg. 2009. Quincy: Fair Scheduling for Distributed Computing Clusters. In Proceedings of the ACM SIGOPS 22Nd Symposium on Operating Systems Principles (SOSP '09). ACM, New York, NY, USA, 261--276. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Eaman Jahani, Michael J. Cafarella, and Christopher Ré. 2011. Automatic Optimization for MapReduce Programs. Proc. VLDB Endow. 4, 6 (March 2011), 385--396. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Won Kim. 1982. On Optimizing an SQL-like Nested Query. ACM Trans. Database Syst. 7, 3 (Sept. 1982), 443--469. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Inderpal Singh Mumick and Hamid Pirahesh. 1994. Implementation of Magic-sets in a Relational Database System. In Proceedings of the 1994 ACMSIGMOD International Conference on Management of Data (SIGMOD '94). ACM, New York, NY, USA, 103--114. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. M. Muralikrishna. 1992. Improved Unnesting Algorithms for Join Aggregate SQL Queries. In Proceedings of the 18th International Conference on Very Large Data Bases (VLDB '92). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 91--102. http://dl.acm.org/citation.cfm?id=645918.756653 Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Tomasz Nykiel, Michalis Potamias, Chaitanya Mishra, George Kollios, and Nick Koudas. 2010. MRShare: Sharing Across Multiple Queries in MapReduce. Proc. VLDB Endow. 3, 1--2 (Sept. 2010), 494--505. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Christopher Olston, Benjamin Reed, Adam Silberstein, and Utkarsh Srivastava. 2008. Automatic Optimization of Parallel Dataflow Programs. In USENIX 2008 Annual Technical Conference (ATC'08). USENIX Association, Berkeley, CA, USA, 267--273. http://dl.acm.org/citation.cfm?id=1404014.1404035 Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Kay Ousterhout, Ryan Rasti, Sylvia Ratnasamy, Scott Shenker, and Byung-Gon Chun. 2015. Making Sense of Performance in Data Analytics Frameworks. In 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI 15). USENIX Association, Oakland, CA, 293--307. https://www.usenix.org/conference/nsdi15/technical-sessions/presentation/ousterhout Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. DeWitt, Samuel Madden, and Michael Stonebraker. 2009. A Comparison of Approaches to Large-scale Data Analysis. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data (SIGMOD '09). ACM, New York, NY, USA, 165--178. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Thomas Phan and Wen-Syan Li. 2008. Dynamic Materialization of Query Views for Data Warehouse Workloads. In Proceedings of the 2008 IEEE 24th International Conference on Data Engineering (ICDE '08). IEEE Computer Society, Washington, DC, USA, 436--445. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Timos K. Sellis. 1988. Intelligent caching and indexing techniques for relational database systems. Information Systems 13, 2 (1988), 175 -- 185. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Praveen Seshadri, Joseph M. Hellerstein, Hamid Pirahesh, T. Y. Cliff Leung, Raghu Ramakrishnan, Divesh Srivastava, Peter J. Stuckey, and S. Sudarshan. 1996. Cost-based Optimization for Magic: Algebra and Implementation. SIGMOD Rec. 25, 2 (June 1996), 435--446. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Weipeng P. Yan and Perake Larson. 1995. Eager Aggregation and Lazy Aggregation. In Proceedings of the 21th International Conference on Very Large Data Bases (VLDB '95). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 345--357. http://dl.acm.org/citation.cfm?id=645921.673154 Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Hung-chih Yang, Ali Dasdan, Ruey-Lung Hsiao, and D. Stott Parker. 2007. Map-reduce-merge: Simplified Relational Data Processing on Large Clusters. In Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data (SIGMOD '07). ACM, New York, NY, USA, 1029--1040. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster Computing with Working Sets. HotCloud 10, 10--10 (2010), 95. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz, and Ion Stoica. 2008. Improving MapReduce Performance in Heterogeneous Environments. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation (OSDI'08). USENIX Association, Berkeley, CA, USA, 29--42. http://dl.acm.org/citation.cfm?id=1855741.1855744 Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Conferences
    ICSE-SEIP '18: Proceedings of the 40th International Conference on Software Engineering: Software Engineering in Practice
    May 2018
    336 pages
    ISBN:9781450356596
    DOI:10.1145/3183519

    Copyright © 2018 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 27 May 2018

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article

    Upcoming Conference

    ICSE 2025
  • Article Metrics

    • Downloads (Last 12 months)3
    • Downloads (Last 6 weeks)0

    Other Metrics

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader