skip to main content
10.1145/1851476.1851540acmconferencesArticle/Chapter ViewAbstractPublication PageshpdcConference Proceedingsconference-collections
research-article

Reshaping text data for efficient processing on Amazon EC2

Published:21 June 2010Publication History

ABSTRACT

Text analysis tools are nowadays required to process increasingly large corpora which are often organized as small files (abstracts, news articles, etc). Cloud computing offers a convenient, on-demand, pay-as-you-go computing environment for solving such problems. We investigate provisioning on the Amazon EC2 cloud from the user perspective, attempting to provide a scheduling strategy that is both timely and cost effective. We rely on the empirical performance of the application of interest on smaller subsets of data, to construct an execution plan. A first goal of our performance measurements is to determine an optimal file size for our application to consume. Using the subset-sum first fit heuristic we reshape the input data by merging files in order to match as closely as possible the desired file size. This also speeds up the task of retrieving the results of our application, by having the output be less segmented. Using predictions of the performance of our application based on measurements on small data sets, we devise an execution plan that meets a user specified deadline while minimizing cost.

References

  1. }}Bonnie++. http://www.coker.com.au/bonnie++/Google ScholarGoogle Scholar
  2. }}Project gutenberg. http://www.gutenberg.org/Google ScholarGoogle Scholar
  3. }}S. Barker and P. Shenoy. Empirical evaluation of latency-sensitive application performance in the cloud. In Proceedings of MMSys 2010, February 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. }}J. Cao, D. J. Kerbyson, E. Papaefstathiou, and G. R. Nudd. Performance modelling of parallel and distributed computing using pace1. IEEE International Performance Computing and Communications Conference, IPCCC-2000, pages 485--492, February 2000.Google ScholarGoogle Scholar
  5. }}E. Deelman, G. Singh, M. Livny, B. Berriman, and J. Good. The cost of doing science on the cloud: the montage example. In SC '08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, pages 1--12, Piscataway, NJ, USA, 2008. IEEE Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. }}J. Dejun, G. Pierre, and C.-H. Chi. EC2 performance analysis for resource provisioning of service-oriented applications. In Proceedings of the 3rd Workshop on Non-Functional Properties and SLA Management in Service-Oriented Computing, Nov. 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. }}K. C. et al. New grid scheduling and rescheduling methods in the grads project. In in Proceedings of NSF Next Generation Software Workshop: International Parallel and Distributed Processing Symposium. Santa Fe, USA: IEEE CS, pages 209--229. Press, 2004.Google ScholarGoogle Scholar
  8. }}I. T. Foster, Y. Zhao, I. Raicu, and S. Lu. Cloud computing and grid computing 360-degree compared. CoRR, abs/0901.0131, 2009.Google ScholarGoogle Scholar
  9. }}S. L. Garfinkel. An evaluation of amazon's grid computing services: Ec2, s3 and sqs. Technical Report TR-08-07, Computer Science Group, Harvard University, 2008.Google ScholarGoogle Scholar
  10. }}S. Hazelhurst. Scientific computing using virtual high-performance computing: a case study using the amazon elastic computing cloud. In SAICSIT '08: Proceedings of the 2008 annual research conference of the South African Institute of Computer Scientists and Information Technologists on IT research in developing countries, pages 94--103, New York, NY, USA, 2008. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. }}G. Juve, E. Deelman, K. Vahi, G. Mehta, B. Berriman, B. P. Berman, and P. Maechling. Scientific workflow applications on amazon ec2. In Workshop on Cloud-based Services and Applications in conjunction with 5th IEEE Internation Conference on e-Science (e-Science 2009), 2009.Google ScholarGoogle ScholarCross RefCross Ref
  12. }}D. Murray and S. Hand. Nephology towards a scientific method for cloud computing. In 6th USENIX Symposium on Networked Systems Design and Implementation (NSDI), Boston, MA, April 2009.Google ScholarGoogle Scholar
  13. }}G. R. Nudd, D. J. Kerbyson, E. Papaefstathiou, S. C. Perry, J. S. Harper, and D. V. Wilcox. Pace--a toolset for the performance prediction of parallel and distributed systems. Int. J. High Perform. Comput. Appl., 14(3):228--251, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. }}M. R. Palankar, A. Iamnitchi, M. Ripeanu, and S. Garfinkel. Amazon s3 for science grids: a viable solution? In DADC '08: Proceedings of the 2008 international workshop on Data-aware distributed computing, pages 55--64, New York, NY, USA, 2008. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. }}W. Smith, I. T. Foster, and V. E. Taylor. Predicting application run times using historical information. In IPPS/SPDP '98: Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing, pages 122--142, London, UK, 1998. Springer-Verlag. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. }}Stanford part-of-speech tagger. http://nlp.stanford.edu/software/tagger.shtmlGoogle ScholarGoogle Scholar
  17. }}E. Walker. Benchmarking amazon ec2 for high-performance scientific computing. USENIX Login, 33(5):18--23, 2008.Google ScholarGoogle Scholar
  18. }}G. Wang and T. E. Ng. The impact of virtualization on network performance of amazon ec2 data center. In Proceedings of the 3rd Workshop on Non-Functional Properties and SLA Management in Service-Oriented Computing, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  19. }}J. Yu, R. Buyya, and C. K. Tham. Cost-based scheduling of scientific workflow application on utility grids. In E-SCIENCE '05: Proceedings of the First International Conference on e-Science and Grid Computing, pages 140--147, Washington, DC, USA, 2005. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Reshaping text data for efficient processing on Amazon EC2

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        HPDC '10: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
        June 2010
        911 pages
        ISBN:9781605589428
        DOI:10.1145/1851476

        Copyright © 2010 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 21 June 2010

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate166of966submissions,17%

        Upcoming Conference

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader