Abstract
The so called “big data” is increasingly present in several modern applications, in which massive parallel processing is the main approach in order to achieve acceptable performance. However, as the size of data is ever increasing, even parallelism will meet its limits unless it is combined with other powerful processing techniques. In this paper we propose to combine parallelism with rewriting, that is reusing previous results stored in a cache in order to perform new (parallel) computations. To do this, we introduce an abstract framework based on the lattice of partitions of the data set. Our basic contributions are: (a) showing that our framework allows rewriting of parallel computations (b) deriving the basic principles of optimal cache management and (c) showing that, in case of structured data, our approach can leverage both structure and semantics in data to improve performance.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Manyika, J., Chui, M., Bughin, J., Brown, B., Dobbs, R., Roxburgh, C., Byers, A.H.: Big Data: The next frontier for innovation, competition, and productivity. McKinsey Global Institute (May 2011)
Horowitz, M.: Visualizing Big Data: Bar Charts for Words. Wired Magazine 16(7) (June 2008)
Douglas, L.: The Importance of ‘Big Data’: A Definition. Gartner (June 2012)
Data, data everywhere. The Economist (February 25, 2010)
Executive Office of the President: Big Data Across the Federal Government. White House (March 2012)
Graham, M.: Big data and the end of theory? The Guardian (March 9, 2012)
Shvetank, S., Horne, A., Capellá, J.: Good Data Won’t Guarantee Good Decisions. Harvard Business Review (September 2012)
Ohm, P.: Don’t Build a Database of Ruin. Harvard Business Review (August 23, 2012), http://blogs.hbr.org/cs/2012/08/dont_build_a_database_of_ruin.html
Jacobs, A.: The Pathologies of Big Data. ACM Queue (July 6, 2009)
Monash: eBay’s two enormous data warehouses. DBMS2 (April 30, 2009), http://www.dbms2.com/2009/04/30/
Monash, C.: eBay followup — Greenplum out, Teradata > 10 petabytes, Hadoop has some value, and more. DBMS2 (October 6, 2010), http://www.dbms2.com/2010/10/06/
Apache Hadoop project page, http://hadoop.apache.org/
Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. In: 6th Symposium on Operating Systems Design and Implementation, OSDI 2004, Sponsored by USENIX, in Cooperation with ACM SIGOPS, pp. 137–150 (2004)
DeWitt, D.J., Stonebraker, M.: MapReduce: A major step backwards. Vertica The Database Column (January 17, 2008)
Bain, T.: Was Stonebraker right? (September 15, 2010), http://blog.tonybain.com/tony_bain/2010/09/was-stonebraker-right.html
Ferrera, P., de Prado, I., Palacios, E., Fernandez-Marquez, J.L., Di Marzo Serugendo, G.: Tuple Map Reduce: Beyond classic MapReduce. In: IEEE Intl. Conf. on Data Mining, ICDM 2012, Brussels (December 2012)
Spyratos, N.: The Partition Model: A deductive Database Model. ACM Transactions on Database Systems 12(1), 1–37 (1987)
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Spyratos, N., Sugibuchi, T. (2013). Parallelism and Rewriting for Big Data Processing. In: Tanaka, Y., Spyratos, N., Yoshida, T., Meghini, C. (eds) Information Search, Integration and Personalization. ISIP 2012. Communications in Computer and Information Science, vol 146. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40140-4_2
Download citation
DOI: https://doi.org/10.1007/978-3-642-40140-4_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-40139-8
Online ISBN: 978-3-642-40140-4
eBook Packages: Computer ScienceComputer Science (R0)