ABSTRACT
Text analysis tools are nowadays required to process increasingly large corpora which are often organized as small files (abstracts, news articles, etc). Cloud computing offers a convenient, on-demand, pay-as-you-go computing environment for solving such problems. We investigate provisioning on the Amazon EC2 cloud from the user perspective, attempting to provide a scheduling strategy that is both timely and cost effective. We rely on the empirical performance of the application of interest on smaller subsets of data, to construct an execution plan. A first goal of our performance measurements is to determine an optimal file size for our application to consume. Using the subset-sum first fit heuristic we reshape the input data by merging files in order to match as closely as possible the desired file size. This also speeds up the task of retrieving the results of our application, by having the output be less segmented. Using predictions of the performance of our application based on measurements on small data sets, we devise an execution plan that meets a user specified deadline while minimizing cost.
- }}Bonnie++. http://www.coker.com.au/bonnie++/Google Scholar
- }}Project gutenberg. http://www.gutenberg.org/Google Scholar
- }}S. Barker and P. Shenoy. Empirical evaluation of latency-sensitive application performance in the cloud. In Proceedings of MMSys 2010, February 2010. Google ScholarDigital Library
- }}J. Cao, D. J. Kerbyson, E. Papaefstathiou, and G. R. Nudd. Performance modelling of parallel and distributed computing using pace1. IEEE International Performance Computing and Communications Conference, IPCCC-2000, pages 485--492, February 2000.Google Scholar
- }}E. Deelman, G. Singh, M. Livny, B. Berriman, and J. Good. The cost of doing science on the cloud: the montage example. In SC '08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, pages 1--12, Piscataway, NJ, USA, 2008. IEEE Press. Google ScholarDigital Library
- }}J. Dejun, G. Pierre, and C.-H. Chi. EC2 performance analysis for resource provisioning of service-oriented applications. In Proceedings of the 3rd Workshop on Non-Functional Properties and SLA Management in Service-Oriented Computing, Nov. 2009. Google ScholarDigital Library
- }}K. C. et al. New grid scheduling and rescheduling methods in the grads project. In in Proceedings of NSF Next Generation Software Workshop: International Parallel and Distributed Processing Symposium. Santa Fe, USA: IEEE CS, pages 209--229. Press, 2004.Google Scholar
- }}I. T. Foster, Y. Zhao, I. Raicu, and S. Lu. Cloud computing and grid computing 360-degree compared. CoRR, abs/0901.0131, 2009.Google Scholar
- }}S. L. Garfinkel. An evaluation of amazon's grid computing services: Ec2, s3 and sqs. Technical Report TR-08-07, Computer Science Group, Harvard University, 2008.Google Scholar
- }}S. Hazelhurst. Scientific computing using virtual high-performance computing: a case study using the amazon elastic computing cloud. In SAICSIT '08: Proceedings of the 2008 annual research conference of the South African Institute of Computer Scientists and Information Technologists on IT research in developing countries, pages 94--103, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
- }}G. Juve, E. Deelman, K. Vahi, G. Mehta, B. Berriman, B. P. Berman, and P. Maechling. Scientific workflow applications on amazon ec2. In Workshop on Cloud-based Services and Applications in conjunction with 5th IEEE Internation Conference on e-Science (e-Science 2009), 2009.Google ScholarCross Ref
- }}D. Murray and S. Hand. Nephology towards a scientific method for cloud computing. In 6th USENIX Symposium on Networked Systems Design and Implementation (NSDI), Boston, MA, April 2009.Google Scholar
- }}G. R. Nudd, D. J. Kerbyson, E. Papaefstathiou, S. C. Perry, J. S. Harper, and D. V. Wilcox. Pace--a toolset for the performance prediction of parallel and distributed systems. Int. J. High Perform. Comput. Appl., 14(3):228--251, 2000. Google ScholarDigital Library
- }}M. R. Palankar, A. Iamnitchi, M. Ripeanu, and S. Garfinkel. Amazon s3 for science grids: a viable solution? In DADC '08: Proceedings of the 2008 international workshop on Data-aware distributed computing, pages 55--64, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
- }}W. Smith, I. T. Foster, and V. E. Taylor. Predicting application run times using historical information. In IPPS/SPDP '98: Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing, pages 122--142, London, UK, 1998. Springer-Verlag. Google ScholarDigital Library
- }}Stanford part-of-speech tagger. http://nlp.stanford.edu/software/tagger.shtmlGoogle Scholar
- }}E. Walker. Benchmarking amazon ec2 for high-performance scientific computing. USENIX Login, 33(5):18--23, 2008.Google Scholar
- }}G. Wang and T. E. Ng. The impact of virtualization on network performance of amazon ec2 data center. In Proceedings of the 3rd Workshop on Non-Functional Properties and SLA Management in Service-Oriented Computing, 2010.Google ScholarCross Ref
- }}J. Yu, R. Buyya, and C. K. Tham. Cost-based scheduling of scientific workflow application on utility grids. In E-SCIENCE '05: Proceedings of the First International Conference on e-Science and Grid Computing, pages 140--147, Washington, DC, USA, 2005. IEEE Computer Society. Google ScholarDigital Library
Index Terms
- Reshaping text data for efficient processing on Amazon EC2
Recommendations
Reshaping text data for efficient processing on Amazon EC2
Science-Driven Cloud ComputingText analysis tools are nowadays required to process increasingly large corpora which are often organized as small files (abstracts, news articles, etc.). Cloud computing offers a convenient, on-demand, pay-as-you-go computing environment for solving ...
Decision model for provisioning virtual resources in Amazon EC2
CNSM '12: Proceedings of the 8th International Conference on Network and Service ManagementNowadays computing resources can be acquired from IaaS cloud providers in different purchasing options. Taking Amazon Elastic Compute Cloud (EC2) for instance, there are three purchasing models, and each option has different price and yields different ...
Reducing the price of resource provisioning using EC2 spot instances with prediction models
AbstractThe increasing demand of computing resources has boosted the use of cloud computing providers. This has raised a new dimension in which the connections between resource usage and costs have to be considered from an organizational ...
Highlights- A framework to predict Amazon Spot Instances (SI) prices is presented.
- The ...
Comments