ABSTRACT
Building scalable big data programs currently requires programmers to combine relational (SQL) with non-relational code (Java, C#, Scala). Relational code is declarative - a program describes what the computation is and the compiler decides how to distribute the program. SQL query optimization has enjoyed a rich and fruitful history, however, most research and commercial optimization engines treat non-relational code as a black-box and thus are unable to optimize it.
This paper empirically studies over 3 million SCOPE programs across five data centers within Microsoft and finds programs with non-relational code take between 45-70% of data center CPU time. We further explore the potential for SCOPE optimization by generating more native code from the non-relational part. Finally, we present 6 case studies showing that triggering more generation of native code in these jobs yields significant performance improvement: optimizing just one portion resulted in as much as 25% improvement for an entire program.
- {n. d.}. U-SQL, the new big data language for Azure Data Lake. https://azure.microsoft.com/en-us/blog/u-sql-the-new-big-data-language-for-azure-data-lake/. ({n. d.}). Accessed: 2017-05-09.Google Scholar
- Apache {n. d.}. Hadoop Streaming. Apache, https://hadoop.apache.org/docs/current/hadoop-streaming/HadoopStreaming.html.Google Scholar
- E. Bertino and W. Kim. 1989. Indexing Techniques for Queries on Nested Objects. IEEE Trans. onKnowl. and Data Eng. 1, 2 (June 1989), 196--214. Google ScholarDigital Library
- Ronnie Chaiken, Bob Jenkins, Perake Larson, Bill Ramsey, Darren Shakib, Simon Weaver, and Jingren Zhou. 2008. SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets. Proc. VLDB Endow. 1, 2 (Aug. 2008), 1265--1276. Google ScholarDigital Library
- Surajit Chaudhuri. 1998. An Overview of Query Optimization in Relational Systems. In Proceedings of the Seventeenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS '98). ACM, New York, NY, USA, 34--43. Google ScholarDigital Library
- Surajit Chaudhuri, Ravi Krishnamurthy, Spyros Potamianos, and Kyuseok Shim. 1995. Optimizing Queries with Materialized Views. In Proceedings of the Eleventh International Conference on Data Engineering (ICDE '95). IEEE Computer Society, Washington, DC, USA, 190--200. http://dl.acm.org/citation.cfm?id=645480.655434 Google ScholarDigital Library
- Surajit Chaudhuri and Kyuseok Shim. 1994. Including Group-By in Query Optimization. In Proceedings of the 20th International Conference on Very Large Data Bases (VLDB '94). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 354--366. http://dl.acm.org/citation.cfm?id=645920.672834 Google ScholarDigital Library
- D. Chimenti, R. Gamboa, and R. Krishnamurthy. 1989. Towards an Open Architecture for LDL. In Proceedings of the 15th International Conference on Very Large Data Bases (VLDB '89). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 195--203. http://dl.acm.org/citation.cfm?id=88830.88851 Google ScholarDigital Library
- John Darlington. 1978. A Synthesis of Several Sorting Algorithms. Acta Inf. 11, 1 (March 1978), 1--30. Google ScholarDigital Library
- Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified Data Processing onLarge Clusters. Commun. ACM 51, 1 (Jan. 2008), 107--113. Google ScholarDigital Library
- Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified Data Processing on Large Clusters. Commun. ACM5 1, 1 (Jan. 2008), 107--113. Google ScholarDigital Library
- Jens Dittrich, Jorge-Arnulfo Quiané-Ruiz, Alekh Jindal, Yagiz Kargin, Vinay Setty, and Jörg Schad. 2010. Hadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing). Proc. VLDB Endow. 3, 1--2 (Sept. 2010), 515--529. Google ScholarDigital Library
- Yi Fang, Marc Friedman, Giri Nair, Michael Rys, and Ana-Elisa Schmid. 2008. Spatial Indexing in Microsoft SQL Server 2008. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data (SIGMOD '08). ACM, New York, NY, USA, 1207--1216. Google ScholarDigital Library
- Avrilia Floratou, Jignesh M. Patel, Eugene J. Shekita, and Sandeep Tata. 2011. Column-oriented Storage Techniques for MapReduce. Proc. VLDB Endow. 4, 7 (April 2011), 419--429. Google ScholarDigital Library
- Alan F. Gates, Olga Natkovich, Shubham Chopra, Pradeep Kamath, Shravan M. Narayanamurthy, Christopher Olston, Benjamin Reed, Santhosh Srinivasan, and Utkarsh Srivastava. 2009. Building a High-level Dataflow System on Top of Map-Reduce: The Pig Experience. Proc. VLDB Endow. 2, 2 (Aug. 2009), 1414--1425. Google ScholarDigital Library
- Zhenyu Guo, Xuepeng Fan, Rishan Chen, Jiaxing Zhang, Hucheng Zhou, Sean McDirmid, Chang Liu, Wei Lin, Jingren Zhou, and Lidong Zhou. 2012. Spotting Code Optimizations in Data-Parallel Pipelines through PeriSCOPE.. In OSDI. 121--133. Google ScholarDigital Library
- Anders Hejlsberg, Mads Torgersen, Scott Wiltamuth, and Peter Golde. 2010. C# Programming Language (4th ed.). Addison-Wesley Professional. Google ScholarDigital Library
- Joseph M. Hellerstein and Michael Stonebraker. 1993. Predicate Migration: Optimizing Queries with Expensive Predicates. In Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data (SIGMOD '93). ACM, New York, NY, USA, 267--276. Google ScholarDigital Library
- Herodotos Herodotou and Shivnath Babu. 2011. Profiling, What-if Analysis, and Cost-based Optimization of MapReduce Programs. 4 (01 2011), 1111--1122.Google Scholar
- Michael Isard, Vijayan Prabhakaran, Jon Currey, Udi Wieder, Kunal Talwar, and Andrew Goldberg. 2009. Quincy: Fair Scheduling for Distributed Computing Clusters. In Proceedings of the ACM SIGOPS 22Nd Symposium on Operating Systems Principles (SOSP '09). ACM, New York, NY, USA, 261--276. Google ScholarDigital Library
- Eaman Jahani, Michael J. Cafarella, and Christopher Ré. 2011. Automatic Optimization for MapReduce Programs. Proc. VLDB Endow. 4, 6 (March 2011), 385--396. Google ScholarDigital Library
- Won Kim. 1982. On Optimizing an SQL-like Nested Query. ACM Trans. Database Syst. 7, 3 (Sept. 1982), 443--469. Google ScholarDigital Library
- Inderpal Singh Mumick and Hamid Pirahesh. 1994. Implementation of Magic-sets in a Relational Database System. In Proceedings of the 1994 ACMSIGMOD International Conference on Management of Data (SIGMOD '94). ACM, New York, NY, USA, 103--114. Google ScholarDigital Library
- M. Muralikrishna. 1992. Improved Unnesting Algorithms for Join Aggregate SQL Queries. In Proceedings of the 18th International Conference on Very Large Data Bases (VLDB '92). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 91--102. http://dl.acm.org/citation.cfm?id=645918.756653 Google ScholarDigital Library
- Tomasz Nykiel, Michalis Potamias, Chaitanya Mishra, George Kollios, and Nick Koudas. 2010. MRShare: Sharing Across Multiple Queries in MapReduce. Proc. VLDB Endow. 3, 1--2 (Sept. 2010), 494--505. Google ScholarDigital Library
- Christopher Olston, Benjamin Reed, Adam Silberstein, and Utkarsh Srivastava. 2008. Automatic Optimization of Parallel Dataflow Programs. In USENIX 2008 Annual Technical Conference (ATC'08). USENIX Association, Berkeley, CA, USA, 267--273. http://dl.acm.org/citation.cfm?id=1404014.1404035 Google ScholarDigital Library
- Kay Ousterhout, Ryan Rasti, Sylvia Ratnasamy, Scott Shenker, and Byung-Gon Chun. 2015. Making Sense of Performance in Data Analytics Frameworks. In 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI 15). USENIX Association, Oakland, CA, 293--307. https://www.usenix.org/conference/nsdi15/technical-sessions/presentation/ousterhout Google ScholarDigital Library
- Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. DeWitt, Samuel Madden, and Michael Stonebraker. 2009. A Comparison of Approaches to Large-scale Data Analysis. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data (SIGMOD '09). ACM, New York, NY, USA, 165--178. Google ScholarDigital Library
- Thomas Phan and Wen-Syan Li. 2008. Dynamic Materialization of Query Views for Data Warehouse Workloads. In Proceedings of the 2008 IEEE 24th International Conference on Data Engineering (ICDE '08). IEEE Computer Society, Washington, DC, USA, 436--445. Google ScholarDigital Library
- Timos K. Sellis. 1988. Intelligent caching and indexing techniques for relational database systems. Information Systems 13, 2 (1988), 175 -- 185. Google ScholarDigital Library
- Praveen Seshadri, Joseph M. Hellerstein, Hamid Pirahesh, T. Y. Cliff Leung, Raghu Ramakrishnan, Divesh Srivastava, Peter J. Stuckey, and S. Sudarshan. 1996. Cost-based Optimization for Magic: Algebra and Implementation. SIGMOD Rec. 25, 2 (June 1996), 435--446. Google ScholarDigital Library
- Weipeng P. Yan and Perake Larson. 1995. Eager Aggregation and Lazy Aggregation. In Proceedings of the 21th International Conference on Very Large Data Bases (VLDB '95). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 345--357. http://dl.acm.org/citation.cfm?id=645921.673154 Google ScholarDigital Library
- Hung-chih Yang, Ali Dasdan, Ruey-Lung Hsiao, and D. Stott Parker. 2007. Map-reduce-merge: Simplified Relational Data Processing on Large Clusters. In Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data (SIGMOD '07). ACM, New York, NY, USA, 1029--1040. Google ScholarDigital Library
- Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster Computing with Working Sets. HotCloud 10, 10--10 (2010), 95. Google ScholarDigital Library
- Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz, and Ion Stoica. 2008. Improving MapReduce Performance in Heterogeneous Environments. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation (OSDI'08). USENIX Association, Berkeley, CA, USA, 29--42. http://dl.acm.org/citation.cfm?id=1855741.1855744 Google ScholarDigital Library
Recommendations
Non-Relational Databases in Big Data
ICTCS '16: Proceedings of the Second International Conference on Information and Communication Technology for Competitive StrategiesThese days' Big data is becoming a very essential component for the industries where large volume of data at very high speed is used to solve particular data problems. Generally, big data is first analyzed and then used with other available data in the ...
Comments