Abstract
Current data intensive scalable computing (DISC) systems, although scalable, achieve embarrassingly low rates of processing per node. We feel that current DISC systems have repeated a mistake of old high-performance systems: focusing on scalability without considering efficiency. This poor efficiency comes with issues in reliability, energy, and cost. As the gap between theoretical performance and what is actually achieved has become glaringly large, we feel there is a pressing need to rethink the design of future data intensive computing and carefully consider the direction of future research.
- Bowen Alpern and Larry Carter. The myth of scalable high performance. In PPSC 1995: SIAM Conference on Parallel Processing for Scientific Computing, pages 857--859, February 1995. Available at http: //www.cs.ucsd.edu/users/carter/Papers/scale.ps Accessed June 2009.Google Scholar
- Eric Anderson, Martin Arlitt, Charles B. Morrey III, and Alistair Veitch. DataSeries: An Efficient, Flexible Data Format for Structured Serial Data. Operating Systems Review, 43(1):70--75, 2009. Google ScholarDigital Library
- Anon, et. al. A measure of transaction processing power. Datamation, 31(7):112--118, 1985. Available at http://sortbenchmark.org/AMeasureOfTransactionProcessingPower.doc accessed June 2009. Google ScholarDigital Library
- David H. Bailey. Highly parallel perspective: Twelve ways to fool the masses when giving performance results on parallel computers. Supercomputing Review, 4(8):54--55, Aug 1991.Google Scholar
- Luiz André Barroso. Saving the planet with systems research. Keynote abstract at http://www.cs.virginia.edu/asplos09/keynote.htm Accessed June 2009.Google Scholar
- Luiz André Barroso and Urs Hölzle. The case for energy-proportional computing. IEEE Computer, 40(12):33--37, 2007. Google ScholarDigital Library
- Randal E. Bryant. Data-intensive supercomputing: The case for DISC. Technical Report CMU-CS-07-128, Carnegie Mellon University, 2007.Google Scholar
- Grzegorz Czajkowski. Sorting 1pb with MapReduce. Available at http://googleblog.blogspot.com/2008/11/sorting-1pb-with-mapreduce.html Accessed June 2009.Google Scholar
- Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing on large clusters. In OSDI '04: Proceedings of the Sixth Conference on Operating Systems Design and Implementation. USENIX Association, December 2004. Google ScholarDigital Library
- Xiaobo Fan, Wolf-Dietrich Weber, and Luiz Andre Barroso. Power provisioning for a warehouse-sized computer. In ISCA '07: Proceedings of the 34th annual International Symposium on Computer Architecture, pages 13--23, New York, NY, USA, 2007. ACM. Available at http://research.google.com/archive/power_provisioning.pdf Accessed June 2009. Google ScholarDigital Library
- Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The Google file system. In SOSP '03: Proceedings of the 19th ACM Symposium on Operating Systems Principles", pages 29--43, October 2003. Google ScholarDigital Library
- James Hamilton. Service design best practices. Available at http://www.mvdirona.com/jrh/TalksAndPapers/JamesHamilton_POA20090226.pdf Accessed September 2009.Google Scholar
- Bingsheng He, Mao Yang, Zhenyu Guo, Rishan Chen, Wei Lin, Bing Su, Hongyi Wang, and Lidong Zhou. Wave Computing in the Cloud. HotOS 2009, 2009. Available at http://www.usenix.org/events/hotos/tech/full_papers/he/he.pdf Accessed June 2009. Google ScholarDigital Library
- Joseph M. Hellerstein. Diverging views on Big Data density, and some gimmes. Available at http://databeta.wordpress.com/2009/05/14/bigdata-node-density/ Accessed June 2009.Google Scholar
- Energy-efficient software guidelines. Available at http://software.intel.com/en-us/articles/energy-efficient-software-guidelines/ Accessed June 2009.Google Scholar
- Guillaume Marceau. The speed, size and dependability of programming languages. Available at http://gmarceau.qc.ca/blog/2009/05/speed-size-and-dependability-of.html Accessed June 2009.Google Scholar
- Chris Nyberg and Mehul Shah. Sort benchmark home page. Available at http://sortbenchmark.org Accessed June 2009.Google Scholar
- Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. Dewitt, Samuel Madden, and Michael Stonebraker. A comparison of approaches to large-scale data analysis. In SIGMOD '09: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data. ACM, June 2009. Available at http://db.csail.mit.edu/pubs/benchmarks-sigmod09.pdf Accessed June 2009. Google ScholarDigital Library
- Rob Pike, Sean Dorward, Robert Griesemer, and Sean Quinlan. Interpreting the data: Parallel analysis with sawzall. Scientific Programming, 13(4):277--298, October 2005. Available at http://labs.google.com/papers/sawzall.html accessed June 2009. Google ScholarDigital Library
- Lutz Prechelt. An empirical comparison of seven programming languages. IEEE Computer, 33(10):23--29, 2000. Available at http://page.mi.fu-berlin.de/~prechelt/Biblio/jccpprt_computer2000.pdf accessed June 2009. Google ScholarDigital Library
- Benchmarking - thrift-protobuf-compare. Available at http://code.google.com/p/thrift-protobuf-compare/wiki/Benchmarking Accessed June 2009.Google Scholar
Recommendations
Energy efficiency for MapReduce workloads: an in-depth study
ADC '12: Proceedings of the Twenty-Third Australasian Database Conference - Volume 124Energy efficiency has emerged as a crucial optimization goal in data centers. MapReduce has become a popular and even fashionable distributed processing model for parallel computing in data centers. Hadoop is an open-source implementation of MapReduce, ...
Performance and energy efficiency of big data systems: characterization, implication and improvement
ICSCA '17: Proceedings of the 6th International Conference on Software and Computer ApplicationsLarge volume of data is produced by various applications in the world, processing such scale of data has great challenges in not only performance but also energy efficiency. Researchers propose various techniques to either improve the performance or the ...
Energy efficiency of large scale graph processing platforms
UbiComp '16: Proceedings of the 2016 ACM International Joint Conference on Pervasive and Ubiquitous Computing: AdjunctA number of graph processing platforms have emerged recently as a result of the growing demand on graph data analytics with complex and large-scale graph structured datasets. These platforms have been tailored for iterative graph computations and can ...
Comments