Abstract
The amount of data available for many areas is increasing faster than our ability to process it. The promise of “infinite” resources given by the cloud computing paradigm has led to recent interest in exploiting clouds for large-scale data intensive computing. Data-intensive computing presents new challenges for systems management in the cloud including new processing frameworks, such as MapReduce, and costs inherent with large data sets in distributed environments. Workload management, an important component of systems management, is the discipline of effectively managing, controlling and monitoring “workflow” across computing systems. This chapter examines the state-of-the-art of workload management for data-intensive computing in clouds. A taxonomy is presented for workload management of data-intensive computing in the cloud and use the taxonomy to classify and evaluate current workload management mechanisms.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Armbrust, M., Fox, A., Griffith, R., Joseph, A.D., Katz, R., Konwinski, A., Lee, G., Patterson, D., Rabkin, A., Stoica, I., Zaharia, M.: A view of cloud computing. Commun. ACM 53(4), 50–58 (2010). doi:10.1145/1721654.1721672
Armbrust, M., Fox, A., Griffith, R., Joseph, A.D., Katz, R.H., Konwinski, A., Lee, G., Patterson, D.A., Rabkin, A., Stoica, I., Zaharia, M.: Above the clouds: A berkeley view of cloud computing. Technical Report No. UCB/EECS-2009–28. University of California at Berkeley (2009)
Amazon Elastic Compute Cloud (amazon ec2). http://aws.amazon.com/ec2/ (2010). Accessed 19 May 2010
Google App engine. http://code.google.com/intl/de-DE/appengine/ (2010). Accessed 19 May 2010
Raicu, I., Foster, I., Szalay, A., Turcu, G.: Astroportal: A science gateway for large-scale astronomy data analysis. In: TeraGrid Conference, 12–15 June 2006
Desprez, F., Vernois, A.: Simultaneous scheduling of replication and computation for data-intensive applications on the grid. J. Grid Comput. 4(1), 19–31 (2006)
Dean, J., Ghemawat, S.: Mapreduce: Simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Ahmad, M., Aboulnaga, A., Babu, S., Munagala, K.: Modeling and exploiting query interactions in database systems. Paper presented at the proceeding of the 17th ACM conference on information and knowledge management, Napa Valley, CA, USA (2008)
Niu, B., Martin, P., Powley, W.: Towards autonomic workload management in DBMSs. J. Database Manag. 20(3), 1–17 (2009)
Krompass, S., Kuno, H., Wiene, J.L., Wilkinson, K., Dayal, U., Kemper, A.: Managing long-running queries. In: Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology, EDBT’09, Saint Petersburg, Russia, 2009. Association for Computing Machinery, pp. 132–143
Dean, J., Sanjay, G.: Mapreduce: Simplified data processing on large clusters. In: Proceedings of the Sixth Symposium on Operating Systems Design and Implementation (OSDI’04), Berkeley, CA, USA, 2004. USENIX Assoc, pp. 137–149
Apache Hadoop. http://hadoop.apache.org/ (2010). Accessed 19 Aug 2010
Gurd, J.R., Kirkham, C.C., Watson, I.: The manchester prototype dataflow computer. Commun. ACM 28(1), 34–52 (1985)
Thain, D., Tannenbaum, T., Livny, M.: Distributed computing in practice: The condor experience. Concurr. Comput-Pract. Exp. 17(2–4), 323–356 (2005)
Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: Distributed data-parallel programs from sequential building blocks. Paper presented at the Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007, Lisbon, Portugal, 2007
DeWitt, D.J., Paulson, E., Robinson, E., Naughton, J., Royalty, J., Shankar, S., Krioukov, A. Clustera: An integrated computation and data management system. Proc. VLDB Endow. 1(1), 28–41 (2008). doi:10.1145/1453856.1453865
Chaiken, R., Jenkins, B., Larson, P., Ramsey, B., Shakib, D., Weaver, S., Zhou, J.: Scope: Easy and efficient parallel processing of massive data sets. Proc. VLDB Endow. 1(2), 1265–1276 (2008). doi:10.1145/1454159.1454166
Dewitt, D., Gray, J.: Parallel database systems. The future of high performance database systems. Commun. ACM 35(6), 85–98 (1992)
GreenPlum. Greenplum database architecture. http://www.greenplum.com/technology/architecture/ (2010). Accessed 19 Aug 2010
Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D., Silberschatz, A., Rasin, S.A.: Hadoopdb: An architectural hybrid of mapreduce and dbms technologies for analytical workloads. Proc. VLDB Endow. 2(1), 922–933 (2009)
Gu, Y., Grossman, R.L. Sector and sphere: The design and implementation of a high-performance data cloud. Phil. Trans. Roy. Soc. A: Math. Phys. Eng. Sci. 367(1897), 2429–2445 (2009). doi:10.1098/rsta.2009.0053
Duncan, R.: Survey of parallel computer architectures. Computer 23(2), 5–16 (1990)
Amazon Cloudwatch. http://aws.amazon.com/cloudwatch/ (2010). Accessed 18 May 2010
Amazon Auto scaling. http://aws.amazon.com/autoscaling/ (2010). Accessed 18 May 2010
Foster, I., Yong, Z., Raicu, I., Lu, S., Cloud computing and grid computing 360-degree compared. In: Grid Computing Environments Workshop, 2008. GCE ’08, 2008, pp. 1–10
Dong, F.: Workflow scheduling algorithms in the grid. PhD, Queen’s University, Kingston (2009)
Venugopal, S., Buyya, R., Ramamohanarao, K. A taxonomy of data grids for distributed data sharing, management, and processing. ACM Comput. Surv. 38(1), 123–175 (2006). doi:http://doi.acm.org/10.1145/1132952.1132955
Yu, J., Buyya, R.: A taxonomy of scientific workflow systems for grid computing. Sigmod. Rec. 34(3), 44–49 (2005)
Hockauf, R., Karl, W., Leberecht, M., Oberhuber, M., Wagner, M.: Exploiting spatial and temporal locality of accesses: A new hardware-based monitoring approach for dsm systems. In: Euro-par’98 parallel processing, pp. 206–215 (1998)
McKinley, K.S., Carr, S., Tseng, C.-W. Improving data locality with loop transformations. ACM Trans. Program Lang. Syst. 18(4), 424–453 (1996). doi:http://doi.acm.org/10.1145/233561.233564
Shatdal, A., Kant, C., Naughton, J.F.: Cache conscious algorithms for relational query processing. In: International Conference Proceedings on Very Large Data Bases, Santiago, Chile, pp. 510–521. Morgan Kaufmann, CA (1994)
Elmore, A., Das, S., Agrawal, D., Abbadi, A.E.: Who’s driving this cloud? Towards efficient migration for elastic and autonomic multitenant databases. Tecnical Report 2010–05. UCSB CS (2010)
Lim, H.C., Babu, S., Chase, J.S. Automated control for elastic storage. Paper presented at the Proceeding of the 7th International Conference on Autonomic Computing, Washington, DC, USA, pp. 1–10 (2010)
Sanjay, G., Howard, G., Shun-Tak, L.: The google file system. SIGOPS Oper. Syst. Rev. 37(5), 29–43 (2003). doi:10.1145/1165389.945450
Apache Hadoop. http://hadoop.apache.org/ (2010). Accessed 3 Jun 2010
Apache Hadoop distribtued file system. http://hadoop.apache.org/common/docs/current/hdfsdesign.html (2010). Accessed 3 Jun 2010
Zaharia, M., Borthakur, D., Sarma, J.S., Elmeleegy, K., Shenker, S., Stoica, I.: Job scheduling for multi-user mapreduce clusters. Technical Report No. UCB/EECS-2009–28. University of California at Berkeley (2009)
Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A. Pig latin: A not-so-foreign language for data processing. Paper presented at the Proceedings of the 2008 ACM SIGMOD International Conference on Management of data, Vancouver, Canada (2008)
Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R. Hive: A warehousing solution over a map-reduce framework. Proc. VLDB Endow. 2(2), 1626–1629 (2009)
Ranganathan, K., Foster, I.: Decoupling computation and data scheduling in distributed data-intensive applications. In: Proceedings 11th IEEE International Symposium on High Performance Distributed Computing, Piscataway, NJ, USA, 2002. IEEE Comput. Soc., pp. 352–358
Quiroz, A., Kim, H., Parashar, M., Gnanasambandam, N., Sharma, N.: Towards autonomic workload provisioning for enterprise grids and clouds. In: 2009 10th IEEE/ACM International Conference on Grid Computing (GRID), Banff, AB, Canada, 2009. IEEE Computer Society, pp. 50–57
Chappell, D.: Introducing windows azure. David Chappell & Associates. http://download.microsoft.com/documents/uk/mediumbusiness/products/cloudonlinesoftware/IntroducingWindowsAzure.pdf (2009). Accessed 24 Aug 2010
Voorsluys, W., Broberg, J., Venugopal, S., Buyya, R.: Cost of virtual machine live migration in clouds: A performance evaluation. In: 1st International Conference on Cloud Computing, Beijing, China, 2009. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Springer, Berlin, pp. 254–265
Prodan, R., Ostermann, S.: A survey and taxonomy of infrastructure as a service and web hosting cloud providers. In: 2009 10th IEEE/ACM International Conference on Grid Computing, 13–15 Oct 2009, pp. 17–25
Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.E.: Bigtable: a distributed storage system for structured data. ACM Trans. Comput. Syst. 26(2), 1–26 (2008). doi:10.1145/1365815.1365816
Weissman, C.D., Bobrowski, S. The design of the force.Com multitenant internet application development platform. Paper presented at the proceedings of the 35th SIGMOD international conference on Management of data, Providence, RI, USA (2009)
Zhang, H., Jiang, G., Yoshihira, K., Chen, H., Saxena, A.: Resilient workload manager: Taming bursty workload of scaling internet applications. In: 6th International Conference on Autonomic Computing, ICAC’09, Barcelona, Spain, 2009. Proceedings of the 6th International Conference Industry Session on Autonomic Computing and Communications Industry Session, ICAC-INDST’09. Association for Computing Machinery, pp. 19–28
Moreno-Vozmediano, R., Montero, R.S., Llorente, I.M.: Elastic management of cluster-based services in the cloud. Paper presented at the proceedings of the 1st workshop on Automated control for datacenters and clouds, Barcelona, Spain (2009)
Sotomayor, B., Montero, R.S., Llorente, I.M., Foster, I. Virtual infrastructure management in private and hybrid clouds. IEEE Internet Comput. 13(5), 14–22 (2009)
Raicu, I., Zhao, Y., Dumitrescu, C., Foster, I., Wilde, M.: Falkon: A fast and light-weight task execution framework. Paper presented at the proceedings of the 2007 ACM/IEEE conference on Supercomputing, Reno, Nevada (2007)
Walker, E., Gardner, J.P., Litvin, V., Turner, E.L.: Creating personal adaptive clusters for managing scientific jobs in a distributed computing environment. In: Challenges of Large Applications in Distributed Environments, 2006 IEEE, 2006, pp. 95–103
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Mian, R., Martin, P., Brown, A., Zhang, M. (2011). Managing Data-Intensive Workloads in a Cloud. In: Fiore, S., Aloisio, G. (eds) Grid and Cloud Database Management. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20045-8_12
Download citation
DOI: https://doi.org/10.1007/978-3-642-20045-8_12
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-20044-1
Online ISBN: 978-3-642-20045-8
eBook Packages: Computer ScienceComputer Science (R0)