Skip to main content
Log in

SCOPE: parallel databases meet MapReduce

  • Special Issue Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

Companies providing cloud-scale data services have increasing needs to store and analyze massive data sets, such as search logs, click streams, and web graph data. For cost and performance reasons, processing is typically done on large clusters of tens of thousands of commodity machines. Such massive data analysis on large clusters presents new opportunities and challenges for developing a highly scalable and efficient distributed computation system that is easy to program and supports complex system optimization to maximize performance and reliability. In this paper, we describe a distributed computation system, Structured Computations Optimized for Parallel Execution (Scope), targeted for this type of massive data analysis. Scope combines benefits from both traditional parallel databases and MapReduce execution engines to allow easy programmability and deliver massive scalability and high performance through advanced optimization. Similar to parallel databases, the system has a SQL-like declarative scripting language with no explicit parallelism, while being amenable to efficient parallel execution on large clusters. An optimizer is responsible for converting scripts into efficient execution plans for the distributed computation engine. A physical execution plan consists of a directed acyclic graph of vertices. Execution of the plan is orchestrated by a job manager that schedules execution on available machines and provides fault tolerance and recovery, much like MapReduce systems. Scope is being used daily for a variety of data analysis and data mining applications over tens of thousands of machines at Microsoft, powering Bing, and other online services.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D., Silberschatz, A., Rasin, A.: HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads. In: Proceeding of VLDB Conference (2009)

  2. Ananthanarayanan, G., Kandula, S., Greenberg, A., Stoica, I., Lu, Y., Saha, B., Harris, E.: Reining in the outliers in map-reduce clusters using Mantri. In: Proceedings of OSDI Conference (2010)

  3. Apache. Hadoop. http://hadoop.apache.org/

  4. Battré, D., Ewen, S., Hueske, F., Kao, O., Markl, V., Warneke, D.: Nephele/PACTs: a programming model and execution framework for web-scale analytical processing. In: Proceedings of the ACM Symposium on Cloud Computing (2010)

  5. Beyer, K.S., Ercegovac, V., Gemulla, R., Balmin, A., Eltabakh, M., Kanne, C.-C., Ozcan, F., Shekita, E.J.: Jaql: a scripting language for large scale semistructured data analysis. In: Proceedings of VLDB Conference (2011)

  6. Borkar, V., Carey, M., Grover, R., Onose, N., Vernica, R.: Hyracks: a flexible and extensible foundation for data-intensive computing. In: Proceedings of ICDE Conference (2011)

  7. Chaiken, R., Jenkins, B., Larson, P.-Å., Ramsey, B., Shakib, D., Weaver, S., Zhou, J.: SCOPE: easy and efficient parallel processing of massive data sets. In: Proceedings of VLDB Conference (2008)

  8. Chattopadhyay, B., Lin, L., Liu, W., Mittal, S., Aragonda, P., Lychagina, V., Kwon, Y., Wong, M.: Tenzing: a SQL implementation on the MapReduce framework. In: Proceedings of VLDB Conference (2011)

  9. Copeland, G.P., Khoshafian, S.N.: A decomposition storage model. In: Proceedings of SIGMOD Conference (1985)

  10. Darwen, H., Date, C.: The role of functional dependencies in query decomposition. In: Relational Database Writings 1989-1991. Addison Wesley (1992)

  11. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Proceedings of OSDI Conference (2004)

  12. DeWitt D., Gray J.: Parallel database systems: the future of high performance database processing. Commun. ACM 35(6), 85–98 (1992)

    Article  Google Scholar 

  13. Ghemawat, S., Gobioff, H., Leung, S.-T.: The Google file system. In: Proceedings of SOSP Conference (2003)

  14. Graefe, G.: Encapsulation of parallelism in the Volcano query processing system. In: Proceeding of SIGMOD Conference (1990)

  15. Graefe G.: The Cascades framework for query optimization. Data Eng. Bull. 18(3), 19–29 (1995)

    Google Scholar 

  16. Graefe, G., McKenna, W.J.: The Volcano optimizer generator: extensibility and efficient search. In: Proceeding of ICDE Conference (1993)

  17. Isard, M. et al.: Dryad: distributed data-parallel programs from sequential building blocks. In: Proceedings of EuroSys Conference (2007)

  18. Isard, M., Prabhakaran, V., Currey, J., Wieder, U., Talwar, K., Goldberg, A.: Quincy: fair scheduling for distributed computing clusters. In: Proceedings of SOSP Conference (2009)

  19. Lu H., Ooi B.-C., Tan K.L.: Query Processing in Parallel Relational Database Systems. IEEE Computer Society Press, Los Alamitos (1994)

    Google Scholar 

  20. Melnik, S., Gubarev, A., Long, J.J., Romer, G., Shivakumar, S., Tolton, M., Vassilakis, T.: Dremel: interactive analysis of webscale datasets. In: Proceedings of VLDB Conference (2010)

  21. Neumann, T., Moerkotte, G.: A combined framework for grouping and order optimization. In: Proceedings of VLDB Conference (2004)

  22. Neumann, T., Moerkotte, G.: An efficient framework for order optimization. In: Proceedings of ICDE Conference (2004)

  23. Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig latin: a not-so-foreign language for data processing. In: Proceedings of SIGMOD Conference (2008)

  24. Pike R., Dorward S., Griesemer R., Quinlan S.: Interpreting the data: parallel analysis with sawzall. Sci. Program. J. 13(4), 277–298 (2005)

    Google Scholar 

  25. Selinger, P.G., Astrahan, M.M., Chamberlin, D.D., Lorie, R.A., Price, T.G.: Access path selection in a relational database management system. In: Proceedings of SIGMOD Conference (1979)

  26. Simmen, D., Shekita, E., Malkenus, T.: Fundamental techniques for order optimization. In: Proceedings of SIGMOD Conference (1996)

  27. Stonebraker M., Abadi D., DeWitt D.J., Madden S., Paulson E., Pavlo A., Rasin A.: MapReduce and parallel DBMSs: friends or foes? Commun. ACM 53(1), 64–71 (2010)

    Article  Google Scholar 

  28. Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive—a warehousing solution over a MapReduce framework. In: Proceedings of VLDB Conference (2009)

  29. Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Zhang, N., Antony, S., Liu, H., Murthy, R.: Hive—a petabyte scale data warehouse using Hadoop. In: Proceedings of ICDE Conference (2010)

  30. Wang, X., Cherniack, M.: Avoiding sorting and grouping in processing queries. In: Proceeding of VLDB Conference (2003)

  31. Yu, Y. et al.: DryadLINQ: a system for general-purpose distributed data-parallel computing using a high-level language. In: Proceedings of OSDI Conference (2008)

  32. Zhou, J., Larson, P.-Å., Chaiken, R.: Incorporating partitioning and parallel plans into the SCOPE optimizer. In: Proceedings of ICDE Conference (2010)

  33. Zhou, J., Larson, P.-Å., Freytag, J.-C., Lehner, W.: Efficient exploitation of similar subexpressions for query processing. In: Proceedings of SIGMOD Conference (2007)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jingren Zhou.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhou, J., Bruno, N., Wu, MC. et al. SCOPE: parallel databases meet MapReduce. The VLDB Journal 21, 611–636 (2012). https://doi.org/10.1007/s00778-012-0280-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-012-0280-z

Keywords

Navigation