ABSTRACT
We summarize important overall issues affecting use of clouds to support Data Science. We describe the mapping of different applications to HPCC and Cloud systems and the architecture that support data analytics that is interoperable between these architectures.
- Geoffrey Fox, Tony Hey, and Anne Trefethen, Where does all the data come from?, Chapter in Data Intensive Science. Terence Critchlow and Kerstin Kleese Van Dam, Editors. 2011. http://grids.ucs.indiana.edu/ptliupages/publications/Where%20does%20all%20the%20data%20come%20from%20v7.pdf.Google Scholar
- IDC. Cloud Computing's Role in Job Creation. 2012 {accessed 2012 March 6}; Sponsored by Microsoft Available from: http://www.microsoft.com/presspass/download/features/2012/IDC_Cloud_jobs_White_Paper.pdf.Google Scholar
- Cloud Computing to Bring 2.4 Million New Jobs in Europe by 2015. 2011 {accessed 2011 March 6}; Available from: http://www.eweek.com/c/a/Cloud-Computing/Cloud-Computing-to-Bring-24-Million-New-Jobs-in-Europe-by-2015-108084/.Google Scholar
- James Manyika, Michael Chui, Brad Brown, Jacques Bughin, Richard Dobbs, Charles Roxburgh, and A. H. Byers. Big data: The next frontier for innovation, competition, and productivity. 2011 {accessed 2012 August 23}; McKinsey Global Institute Available from: http://www.mckinsey.com/insights/mgi/research/technology_and_innovation/big_data_the_next_frontier_for_innovation.Google Scholar
- Jeffrey Dean and Sanjay Ghemawat, MapReduce: simplified data processing on large clusters. Commun. ACM, 2008. 51(1): p. 107--113. DOI: http://doi.acm.org/10.1145/1327452.1327492 Google ScholarDigital Library
- Fox, G. C., R. D. Williams, and P. C. Messina, Parallel computing works! 1994: Morgan Kaufmann Publishers, Inc. http://www.old-npac.org/copywrite/pcw/node278.html#SECTION001440000000000000000Google Scholar
- Geoffrey C. Fox, Data intensive applications on clouds, in Proceedings of the second international workshop on Data intensive computing in the clouds. 2011, ACM. Seattle, Washington, USA. pages. 1--2. DOI: 10.1145/2087522.2087524. Google ScholarDigital Library
- Jaliya Ekanayake, Thilina Gunarathne, Judy Qiu, Geoffrey Fox, Scott Beason, Jong Youl Choi, Yang Ruan, Seung-Hee Bae, and Hui Li, Applicability of DryadLINQ to Scientific Applications. January 30, 2010, Community Grids Laboratory, Indiana University. http://grids.ucs.indiana.edu/ptliupages/publications/DryadReport.pdf.Google Scholar
- Judy Qiu, Jaliya Ekanayake, Thilina Gunarathne, Jong Youl Choi, Seung-Hee Bae, Yang Ruan, Saliya Ekanayake, Stephen Wu, Scott Beason, Geoffrey Fox, Mina Rho, and H. Tang, Data Intensive Computing for Bioinformatics. December 29, 2009. http://grids.ucs.indiana.edu/ptliupages/publications/DataIntensiveComputing_BookChapter.pdf.Google Scholar
- Kai Hwang, Geoffrey Fox, and Jack Dongarra, Distributed and Cloud Computing : from Parallel Processing to The Internet of Things. 2011: Morgan Kaufmann Publishers. Google ScholarDigital Library
- Thilina Gunarathne, Bingjing Zhang, Tak-Lon Wu, and Judy Qiu, Scalable Parallel Computing on Clouds Using Twister4Azure Iterative MapReduce Future Generation Computer Systems 2012. To be published. http://grids.ucs.indiana.edu/ptliupages/publications/Scalable_Parallel_Computing_on_Clouds_Using_Twister4Azure_Iterative_MapReduce_cr_submit.pdfGoogle Scholar
- Judy Qiu, Thilina Gunarathne, and Geoffrey Fox, Classical and Iterative MapReduce on Azure, in Cloud Futures 2011 workshop. June 2-3, 2011. Microsoft Conference Center Building 33 Redmond, Washington United States. http://grids.ucs.indiana.edu/ptliupages/presentations/Twister4azure_June2-2011.pptx.Google Scholar
- Yingyi Bu, Bill Howe, Magdalena Balazinska, and Michael D. Ernst, HaLoop: Efficient Iterative Data Processing on Large Clusters, in The 36th International Conference on Very Large Data Bases. September 13-17, 2010, VLDB Endowment: Vol. 3. Singapore. http://www.ics.uci.edu/~yingyib/papers/HaLoop_camera_ready.pdf. Google ScholarDigital Library
- SALSA Group. Iterative MapReduce. 2010 {accessed 2010 November 7}; Twister Home Page Available from: http://www.iterativemapreduce.org/.Google Scholar
- J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S. Bae, J. Qiu, and G. Fox, Twister: A Runtime for iterative MapReduce, in Proceedings of the First International Workshop on MapReduce and its Applications of ACM HPDC 2010 conference June 20-25, 2010. 2010, ACM. Chicago, Illinois. http://grids.ucs.indiana.edu/ptliupages/publications/hpdc-camera-ready-submission.pdf. Google ScholarDigital Library
- Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica, Spark: Cluster Computing with Working Sets, in 2nd USENIX Workshop on Hot Topics in Cloud Computing (HotCloud '10). June 22, 2010. Boston. http://www.cs.berkeley.edu/~franklin/Papers/hotcloud.pdf. Google ScholarDigital Library
- Parallel Linear Algebra for Scalable Multi-core Architectures (PLASMA) project. {accessed 2012 September 6}; Available from: http://icl.cs.utk.edu/plasma/index.html.Google Scholar
- The Comprehensive R Archive Network. {accessed 2012 August 22}; Available from: http://cran.r-project.org/.Google Scholar
- Apache Mahout Scalable machine learning and data mining {accessed 2012 August 22}; Available from: http://mahout.apache.org/.Google Scholar
- Shantenu Jha, Murray Cole, Daniel S. Katz, Manish Parashar, Omer Rana, and J. Weissman, Distributed Computing Practice for Large-Scale Science & Engineering Applications Concurrency and Computation: Practice and Experience (in press), 2012.Google Scholar
- Andre Luckow, Mark Santcroos, Ole Weidner, Andre Merzky, Pradeep Mantha, and Shantenu Jha, P*: A Model of Pilot-Abstractions, in 8th IEEE International Conference on e-Science. 2012.Google Scholar
- Pradeep Kumar Mantha, Andre Luckow, and S. Jha, Pilot-MapReduce: an extensible and flexible MapReduce implementation for distributed data, in Third international workshop on MapReduce and its Applications. 2012. Google ScholarDigital Library
- Apache. HDFS Overview. 2010 {accessed 2010 November 6}; Available from: http://hadoop.apache.org/hdfs/.Google Scholar
- Jonathan Klinginsmith, M. Mahoui, and Y. M. Wu, Towards Reproducible eScience in the Cloud., in Third International Conference on Cloud Computing Technology and Science (CloudCom). November 29 - December 1, 2011. DOI: 10.1109/CloudCom.2011.89. Google ScholarDigital Library
- Jonathan Klinginsmith and Judy Qiu, Using Cloud Computing for Scalable, Reproducible Experimentation. August, 2012.Google Scholar
- Gregor von Laszewski, Hyungro Lee, Javier Diaz, Fugang Wang, Koji Tanaka, Shubhada Karavinkoppa, Geoffrey C. Fox, and Tom Furlani, Design of an Accounting and Metric-based Cloud-shifting and Cloud-seeding framework for Federated Clouds and Bare-metal Environments, in Workshop on Cloud Services, Federation, and the 8th Open Cirrus Summit. September 21, 2012. San Jose, CA (USA). http://grids.ucs.indiana.edu/ptliupages/publications/p25-vonLaszewski.pdf. Google ScholarDigital Library
- Geoffrey C. Fox, Gregor von Laszewski, Javier Diaz, Kate Keahey, Jose Fortes, Renato Figueiredo, Shava Smallen, Warren Smith, and Andrew Grimshaw, FutureGrid - a reconfigurable testbed for Cloud, HPC and Grid Computing, Chapter in On the Road to Exascale Computing: Contemporary Architectures in High Performance Computing, Jeff Vetter, Editor. 2012, Chapman & Hall/CRC Press http://grids.ucs.indiana.edu/ptliupages/publications/sitka-chapter.pdfGoogle Scholar
- Javier Diaz, Gregor von Laszewski, Fugang Wang, and Geoffrey Fox, Abstract Image Management and Universal Image Registration for Cloud and HPC Infrastructures, in IEEE CLOUD 2012 5th International Conference on Cloud Computing June 24-29 2012. Hyatt Regency Waikiki Resort and Spa, Honolulu, Hawaii, USA http://grids.ucs.indiana.edu/ptliupages/publications/jdiaz-IEEECloud2012_id-4656.pdf Google ScholarDigital Library
- J. Diaz, A. J. Younge, G. von Laszewski, F. Wang, and G. C. Fox, Grappling cloud infrastructure services with a generic image repository, in CCA11: Cloud Computing and Its Applications. April 12-13, 2011. Argonne National Laboratory, USA. http://grids.ucs.indiana.edu/ptliupages/publications/11-imagerepo-cca.pdf.Google Scholar
- Javier Diaz, Gregor von Laszewski, Fugang Wang, Andrew J. Younge, and Geoffrey Fox, FutureGrid Image Repository: A Generic Catalog and Storage System for Heterogeneous Virtual Machine Images, in 3rd IEEE International Conference CloudCom on Cloud Computing Technology and Science. November 29 - December 1, 2011. Athens Greece. http://grids.ucs.indiana.edu/ptliupages/publications/jdiazCloudCom2011.pdf Google ScholarDigital Library
Index Terms
- Large scale data analytics on clouds
Recommendations
Data intensive applications on clouds
DataCloud-SC '11: Proceedings of the second international workshop on Data intensive computing in the cloudsThe cyberinfrastructure supporting science appears will include large-scale simulation systems headed to exascale combined with cloud like systems supporting data intensive and high throughput computing, pleasingly parallel jobs and the long tail of ...
Towards a framework for large-scale multimedia data storage and processing on Hadoop platform
Cloud computing techniques take the form of distributed computing by utilizing multiple computers to execute computing simultaneously on the service side. To process the increasing quantity of multimedia data, numerous large-scale multimedia data ...
Scalable parallel computing on clouds using Twister4Azure iterative MapReduce
Recent advances in data-intensive computing for science discovery are fueling a dramatic growth in the use of data-intensive iterative computations. The utility computing model introduced by cloud computing, combined with the rich set of cloud ...
Comments