research-article

Cross-language optimizations in big data systems: a case study of SCOPE

Authors:
Marija Selakovic

TU Darmstadt, Germany

TU Darmstadt, Germany
View Profile

,
Michael Barnett

Microsoft Research

Microsoft Research
View Profile

,
Madan Musuvathi

Microsoft Research

Microsoft Research
View Profile

,
Todd Mytkowicz

Microsoft Research

Microsoft Research
View Profile

ICSE-SEIP '18: Proceedings of the 40th International Conference on Software Engineering: Software Engineering in PracticeMay 2018Pages 45–54https://doi.org/10.1145/3183519.3183528

Published:27 May 2018Publication History

ICSE-SEIP '18: Proceedings of the 40th International Conference on Software Engineering: Software Engineering in Practice

Pages 45–54

ABSTRACT

Building scalable big data programs currently requires programmers to combine relational (SQL) with non-relational code (Java, C#, Scala). Relational code is declarative - a program describes what the computation is and the compiler decides how to distribute the program. SQL query optimization has enjoyed a rich and fruitful history, however, most research and commercial optimization engines treat non-relational code as a black-box and thus are unable to optimize it.

This paper empirically studies over 3 million SCOPE programs across five data centers within Microsoft and finds programs with non-relational code take between 45-70% of data center CPU time. We further explore the potential for SCOPE optimization by generating more native code from the non-relational part. Finally, we present 6 case studies showing that triggering more generation of native code in these jobs yields significant performance improvement: optimizing just one portion resulted in as much as 25% improvement for an entire program.

References

{n. d.}. U-SQL, the new big data language for Azure Data Lake. https://azure.microsoft.com/en-us/blog/u-sql-the-new-big-data-language-for-azure-data-lake/. ({n. d.}). Accessed: 2017-05-09.Google Scholar
Apache {n. d.}. Hadoop Streaming. Apache, https://hadoop.apache.org/docs/current/hadoop-streaming/HadoopStreaming.html.Google Scholar
E. Bertino and W. Kim. 1989. Indexing Techniques for Queries on Nested Objects. IEEE Trans. onKnowl. and Data Eng. 1, 2 (June 1989), 196--214. Google ScholarDigital Library
Ronnie Chaiken, Bob Jenkins, Perake Larson, Bill Ramsey, Darren Shakib, Simon Weaver, and Jingren Zhou. 2008. SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets. Proc. VLDB Endow. 1, 2 (Aug. 2008), 1265--1276. Google ScholarDigital Library
Surajit Chaudhuri. 1998. An Overview of Query Optimization in Relational Systems. In Proceedings of the Seventeenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS '98). ACM, New York, NY, USA, 34--43. Google ScholarDigital Library
Surajit Chaudhuri, Ravi Krishnamurthy, Spyros Potamianos, and Kyuseok Shim. 1995. Optimizing Queries with Materialized Views. In Proceedings of the Eleventh International Conference on Data Engineering (ICDE '95). IEEE Computer Society, Washington, DC, USA, 190--200. http://dl.acm.org/citation.cfm?id=645480.655434 Google ScholarDigital Library
Surajit Chaudhuri and Kyuseok Shim. 1994. Including Group-By in Query Optimization. In Proceedings of the 20th International Conference on Very Large Data Bases (VLDB '94). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 354--366. http://dl.acm.org/citation.cfm?id=645920.672834 Google ScholarDigital Library
D. Chimenti, R. Gamboa, and R. Krishnamurthy. 1989. Towards an Open Architecture for LDL. In Proceedings of the 15th International Conference on Very Large Data Bases (VLDB '89). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 195--203. http://dl.acm.org/citation.cfm?id=88830.88851 Google ScholarDigital Library
John Darlington. 1978. A Synthesis of Several Sorting Algorithms. Acta Inf. 11, 1 (March 1978), 1--30. Google ScholarDigital Library
Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified Data Processing onLarge Clusters. Commun. ACM 51, 1 (Jan. 2008), 107--113. Google ScholarDigital Library
Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified Data Processing on Large Clusters. Commun. ACM5 1, 1 (Jan. 2008), 107--113. Google ScholarDigital Library
Jens Dittrich, Jorge-Arnulfo Quiané-Ruiz, Alekh Jindal, Yagiz Kargin, Vinay Setty, and Jörg Schad. 2010. Hadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing). Proc. VLDB Endow. 3, 1--2 (Sept. 2010), 515--529. Google ScholarDigital Library
Yi Fang, Marc Friedman, Giri Nair, Michael Rys, and Ana-Elisa Schmid. 2008. Spatial Indexing in Microsoft SQL Server 2008. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data (SIGMOD '08). ACM, New York, NY, USA, 1207--1216. Google ScholarDigital Library
Avrilia Floratou, Jignesh M. Patel, Eugene J. Shekita, and Sandeep Tata. 2011. Column-oriented Storage Techniques for MapReduce. Proc. VLDB Endow. 4, 7 (April 2011), 419--429. Google ScholarDigital Library
Alan F. Gates, Olga Natkovich, Shubham Chopra, Pradeep Kamath, Shravan M. Narayanamurthy, Christopher Olston, Benjamin Reed, Santhosh Srinivasan, and Utkarsh Srivastava. 2009. Building a High-level Dataflow System on Top of Map-Reduce: The Pig Experience. Proc. VLDB Endow. 2, 2 (Aug. 2009), 1414--1425. Google ScholarDigital Library
Zhenyu Guo, Xuepeng Fan, Rishan Chen, Jiaxing Zhang, Hucheng Zhou, Sean McDirmid, Chang Liu, Wei Lin, Jingren Zhou, and Lidong Zhou. 2012. Spotting Code Optimizations in Data-Parallel Pipelines through PeriSCOPE.. In OSDI. 121--133. Google ScholarDigital Library
Anders Hejlsberg, Mads Torgersen, Scott Wiltamuth, and Peter Golde. 2010. C# Programming Language (4th ed.). Addison-Wesley Professional. Google ScholarDigital Library
Joseph M. Hellerstein and Michael Stonebraker. 1993. Predicate Migration: Optimizing Queries with Expensive Predicates. In Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data (SIGMOD '93). ACM, New York, NY, USA, 267--276. Google ScholarDigital Library
Herodotos Herodotou and Shivnath Babu. 2011. Profiling, What-if Analysis, and Cost-based Optimization of MapReduce Programs. 4 (01 2011), 1111--1122.Google Scholar
Michael Isard, Vijayan Prabhakaran, Jon Currey, Udi Wieder, Kunal Talwar, and Andrew Goldberg. 2009. Quincy: Fair Scheduling for Distributed Computing Clusters. In Proceedings of the ACM SIGOPS 22Nd Symposium on Operating Systems Principles (SOSP '09). ACM, New York, NY, USA, 261--276. Google ScholarDigital Library
Eaman Jahani, Michael J. Cafarella, and Christopher Ré. 2011. Automatic Optimization for MapReduce Programs. Proc. VLDB Endow. 4, 6 (March 2011), 385--396. Google ScholarDigital Library
Won Kim. 1982. On Optimizing an SQL-like Nested Query. ACM Trans. Database Syst. 7, 3 (Sept. 1982), 443--469. Google ScholarDigital Library
Inderpal Singh Mumick and Hamid Pirahesh. 1994. Implementation of Magic-sets in a Relational Database System. In Proceedings of the 1994 ACMSIGMOD International Conference on Management of Data (SIGMOD '94). ACM, New York, NY, USA, 103--114. Google ScholarDigital Library
M. Muralikrishna. 1992. Improved Unnesting Algorithms for Join Aggregate SQL Queries. In Proceedings of the 18th International Conference on Very Large Data Bases (VLDB '92). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 91--102. http://dl.acm.org/citation.cfm?id=645918.756653 Google ScholarDigital Library
Tomasz Nykiel, Michalis Potamias, Chaitanya Mishra, George Kollios, and Nick Koudas. 2010. MRShare: Sharing Across Multiple Queries in MapReduce. Proc. VLDB Endow. 3, 1--2 (Sept. 2010), 494--505. Google ScholarDigital Library
Christopher Olston, Benjamin Reed, Adam Silberstein, and Utkarsh Srivastava. 2008. Automatic Optimization of Parallel Dataflow Programs. In USENIX 2008 Annual Technical Conference (ATC'08). USENIX Association, Berkeley, CA, USA, 267--273. http://dl.acm.org/citation.cfm?id=1404014.1404035 Google ScholarDigital Library
Kay Ousterhout, Ryan Rasti, Sylvia Ratnasamy, Scott Shenker, and Byung-Gon Chun. 2015. Making Sense of Performance in Data Analytics Frameworks. In 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI 15). USENIX Association, Oakland, CA, 293--307. https://www.usenix.org/conference/nsdi15/technical-sessions/presentation/ousterhout Google ScholarDigital Library
Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. DeWitt, Samuel Madden, and Michael Stonebraker. 2009. A Comparison of Approaches to Large-scale Data Analysis. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data (SIGMOD '09). ACM, New York, NY, USA, 165--178. Google ScholarDigital Library
Thomas Phan and Wen-Syan Li. 2008. Dynamic Materialization of Query Views for Data Warehouse Workloads. In Proceedings of the 2008 IEEE 24th International Conference on Data Engineering (ICDE '08). IEEE Computer Society, Washington, DC, USA, 436--445. Google ScholarDigital Library
Timos K. Sellis. 1988. Intelligent caching and indexing techniques for relational database systems. Information Systems 13, 2 (1988), 175 -- 185. Google ScholarDigital Library
Praveen Seshadri, Joseph M. Hellerstein, Hamid Pirahesh, T. Y. Cliff Leung, Raghu Ramakrishnan, Divesh Srivastava, Peter J. Stuckey, and S. Sudarshan. 1996. Cost-based Optimization for Magic: Algebra and Implementation. SIGMOD Rec. 25, 2 (June 1996), 435--446. Google ScholarDigital Library
Weipeng P. Yan and Perake Larson. 1995. Eager Aggregation and Lazy Aggregation. In Proceedings of the 21th International Conference on Very Large Data Bases (VLDB '95). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 345--357. http://dl.acm.org/citation.cfm?id=645921.673154 Google ScholarDigital Library
Hung-chih Yang, Ali Dasdan, Ruey-Lung Hsiao, and D. Stott Parker. 2007. Map-reduce-merge: Simplified Relational Data Processing on Large Clusters. In Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data (SIGMOD '07). ACM, New York, NY, USA, 1029--1040. Google ScholarDigital Library
Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster Computing with Working Sets. HotCloud 10, 10--10 (2010), 95. Google ScholarDigital Library
Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz, and Ion Stoica. 2008. Improving MapReduce Performance in Heterogeneous Environments. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation (OSDI'08). USENIX Association, Berkeley, CA, USA, 29--42. http://dl.acm.org/citation.cfm?id=1855741.1855744 Google ScholarDigital Library

Recommendations

Big Data Analytics
Read More
Non-Relational Databases in Big Data
ICTCS '16: Proceedings of the Second International Conference on Information and Communication Technology for Competitive Strategies

These days' Big data is becoming a very essential component for the industries where large volume of data at very high speed is used to solve particular data problems. Generally, big data is first analyzed and then used with other available data in the ...
Read More
Next-Generation Big Data: A Practical Guide to Apache Kudu, Impala, and Spark
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICSE-SEIP '18: Proceedings of the 40th International Conference on Software Engineering: Software Engineering in Practice
May 2018
336 pages
ISBN:9781450356596
DOI:10.1145/3183519
Conference Chairs:
Frances Paulisch
Siemens Healthineers, Germany
,
Jan Bosch
Chalmers University of Technology
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 27 May 2018
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
Conference

Upcoming Conference

ICSE 2025

2025 IEEE/ACM 46th International Conference on Software Engineering

April 26 - May 3, 2025

Ottawa , ON , Canada
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 94
  Total Downloads
- Downloads (Last 12 months)3
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Cross-language optimizations in big data systems: a case study of SCOPE

ICSE-SEIP '18: Proceedings of the 40th International Conference on Software Engineering: Software Engineering in Practice

ABSTRACT

References

Cited By

Recommendations

Big Data Analytics

Non-Relational Databases in Big Data

Next-Generation Big Data: A Practical Guide to Apache Kudu, Impala, and Spark

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Cross-language optimizations in big data systems: a case study of SCOPE

ICSE-SEIP '18: Proceedings of the 40th International Conference on Software Engineering: Software Engineering in Practice

ABSTRACT

References

Cited By

Recommendations

Big Data Analytics

Non-Relational Databases in Big Data

Next-Generation Big Data: A Practical Guide to Apache Kudu, Impala, and Spark

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media