SCOPE: parallel databases meet MapReduce

Zhou, Jingren; Bruno, Nicolas; Wu, Ming-Chuan; Larson, Per-Ake; Chaiken, Ronnie; Shakib, Darren

doi:10.1007/s00778-012-0280-z

SCOPE: parallel databases meet MapReduce

Special Issue Paper
Published: 28 June 2012

Volume 21, pages 611–636, (2012)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Jingren Zhou¹,
Nicolas Bruno¹,
Ming-Chuan Wu¹,
Per-Ake Larson¹,
Ronnie Chaiken¹ &
…
Darren Shakib¹

1200 Accesses
94 Citations
3 Altmetric
Explore all metrics

Abstract

Companies providing cloud-scale data services have increasing needs to store and analyze massive data sets, such as search logs, click streams, and web graph data. For cost and performance reasons, processing is typically done on large clusters of tens of thousands of commodity machines. Such massive data analysis on large clusters presents new opportunities and challenges for developing a highly scalable and efficient distributed computation system that is easy to program and supports complex system optimization to maximize performance and reliability. In this paper, we describe a distributed computation system, Structured Computations Optimized for Parallel Execution (Scope), targeted for this type of massive data analysis. Scope combines benefits from both traditional parallel databases and MapReduce execution engines to allow easy programmability and deliver massive scalability and high performance through advanced optimization. Similar to parallel databases, the system has a SQL-like declarative scripting language with no explicit parallelism, while being amenable to efficient parallel execution on large clusters. An optimizer is responsible for converting scripts into efficient execution plans for the distributed computation engine. A physical execution plan consists of a directed acyclic graph of vertices. Execution of the plan is orchestrated by a job manager that schedules execution on available machines and provides fault tolerance and recovery, much like MapReduce systems. Scope is being used daily for a variety of data analysis and data mining applications over tens of thousands of machines at Microsoft, powering Bing, and other online services.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D., Silberschatz, A., Rasin, A.: HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads. In: Proceeding of VLDB Conference (2009)
Ananthanarayanan, G., Kandula, S., Greenberg, A., Stoica, I., Lu, Y., Saha, B., Harris, E.: Reining in the outliers in map-reduce clusters using Mantri. In: Proceedings of OSDI Conference (2010)
Apache. Hadoop. http://hadoop.apache.org/
Battré, D., Ewen, S., Hueske, F., Kao, O., Markl, V., Warneke, D.: Nephele/PACTs: a programming model and execution framework for web-scale analytical processing. In: Proceedings of the ACM Symposium on Cloud Computing (2010)
Beyer, K.S., Ercegovac, V., Gemulla, R., Balmin, A., Eltabakh, M., Kanne, C.-C., Ozcan, F., Shekita, E.J.: Jaql: a scripting language for large scale semistructured data analysis. In: Proceedings of VLDB Conference (2011)
Borkar, V., Carey, M., Grover, R., Onose, N., Vernica, R.: Hyracks: a flexible and extensible foundation for data-intensive computing. In: Proceedings of ICDE Conference (2011)
Chaiken, R., Jenkins, B., Larson, P.-Å., Ramsey, B., Shakib, D., Weaver, S., Zhou, J.: SCOPE: easy and efficient parallel processing of massive data sets. In: Proceedings of VLDB Conference (2008)
Chattopadhyay, B., Lin, L., Liu, W., Mittal, S., Aragonda, P., Lychagina, V., Kwon, Y., Wong, M.: Tenzing: a SQL implementation on the MapReduce framework. In: Proceedings of VLDB Conference (2011)
Copeland, G.P., Khoshafian, S.N.: A decomposition storage model. In: Proceedings of SIGMOD Conference (1985)
Darwen, H., Date, C.: The role of functional dependencies in query decomposition. In: Relational Database Writings 1989-1991. Addison Wesley (1992)
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Proceedings of OSDI Conference (2004)
DeWitt D., Gray J.: Parallel database systems: the future of high performance database processing. Commun. ACM 35(6), 85–98 (1992)
Article Google Scholar
Ghemawat, S., Gobioff, H., Leung, S.-T.: The Google file system. In: Proceedings of SOSP Conference (2003)
Graefe, G.: Encapsulation of parallelism in the Volcano query processing system. In: Proceeding of SIGMOD Conference (1990)
Graefe G.: The Cascades framework for query optimization. Data Eng. Bull. 18(3), 19–29 (1995)
Google Scholar
Graefe, G., McKenna, W.J.: The Volcano optimizer generator: extensibility and efficient search. In: Proceeding of ICDE Conference (1993)
Isard, M. et al.: Dryad: distributed data-parallel programs from sequential building blocks. In: Proceedings of EuroSys Conference (2007)
Isard, M., Prabhakaran, V., Currey, J., Wieder, U., Talwar, K., Goldberg, A.: Quincy: fair scheduling for distributed computing clusters. In: Proceedings of SOSP Conference (2009)
Lu H., Ooi B.-C., Tan K.L.: Query Processing in Parallel Relational Database Systems. IEEE Computer Society Press, Los Alamitos (1994)
Google Scholar
Melnik, S., Gubarev, A., Long, J.J., Romer, G., Shivakumar, S., Tolton, M., Vassilakis, T.: Dremel: interactive analysis of webscale datasets. In: Proceedings of VLDB Conference (2010)
Neumann, T., Moerkotte, G.: A combined framework for grouping and order optimization. In: Proceedings of VLDB Conference (2004)
Neumann, T., Moerkotte, G.: An efficient framework for order optimization. In: Proceedings of ICDE Conference (2004)
Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig latin: a not-so-foreign language for data processing. In: Proceedings of SIGMOD Conference (2008)
Pike R., Dorward S., Griesemer R., Quinlan S.: Interpreting the data: parallel analysis with sawzall. Sci. Program. J. 13(4), 277–298 (2005)
Google Scholar
Selinger, P.G., Astrahan, M.M., Chamberlin, D.D., Lorie, R.A., Price, T.G.: Access path selection in a relational database management system. In: Proceedings of SIGMOD Conference (1979)
Simmen, D., Shekita, E., Malkenus, T.: Fundamental techniques for order optimization. In: Proceedings of SIGMOD Conference (1996)
Stonebraker M., Abadi D., DeWitt D.J., Madden S., Paulson E., Pavlo A., Rasin A.: MapReduce and parallel DBMSs: friends or foes? Commun. ACM 53(1), 64–71 (2010)
Article Google Scholar
Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive—a warehousing solution over a MapReduce framework. In: Proceedings of VLDB Conference (2009)
Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Zhang, N., Antony, S., Liu, H., Murthy, R.: Hive—a petabyte scale data warehouse using Hadoop. In: Proceedings of ICDE Conference (2010)
Wang, X., Cherniack, M.: Avoiding sorting and grouping in processing queries. In: Proceeding of VLDB Conference (2003)
Yu, Y. et al.: DryadLINQ: a system for general-purpose distributed data-parallel computing using a high-level language. In: Proceedings of OSDI Conference (2008)
Zhou, J., Larson, P.-Å., Chaiken, R.: Incorporating partitioning and parallel plans into the SCOPE optimizer. In: Proceedings of ICDE Conference (2010)
Zhou, J., Larson, P.-Å., Freytag, J.-C., Lehner, W.: Efficient exploitation of similar subexpressions for query processing. In: Proceedings of SIGMOD Conference (2007)

Download references

Author information

Authors and Affiliations

Microsoft Corp., One Microsoft Way, Redmond, WA, 98052, USA
Jingren Zhou, Nicolas Bruno, Ming-Chuan Wu, Per-Ake Larson, Ronnie Chaiken & Darren Shakib

Authors

Jingren Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Nicolas Bruno
View author publications
You can also search for this author in PubMed Google Scholar
Ming-Chuan Wu
View author publications
You can also search for this author in PubMed Google Scholar
Per-Ake Larson
View author publications
You can also search for this author in PubMed Google Scholar
Ronnie Chaiken
View author publications
You can also search for this author in PubMed Google Scholar
Darren Shakib
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jingren Zhou.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhou, J., Bruno, N., Wu, MC. et al. SCOPE: parallel databases meet MapReduce. The VLDB Journal 21, 611–636 (2012). https://doi.org/10.1007/s00778-012-0280-z

Download citation

Received: 15 August 2011
Revised: 16 February 2012
Accepted: 14 May 2012
Published: 28 June 2012
Issue Date: October 2012
DOI: https://doi.org/10.1007/s00778-012-0280-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SCOPE: parallel databases meet MapReduce

Abstract

Access this article

Similar content being viewed by others

Big data analytics on Apache Spark

A brief introduction to distributed systems

A survey on the evolution of stream processing systems

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

SCOPE: parallel databases meet MapReduce

Abstract

Access this article

Similar content being viewed by others

Big data analytics on Apache Spark

A brief introduction to distributed systems

A survey on the evolution of stream processing systems

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation