Skip to main content

Parallelism and Rewriting for Big Data Processing

  • Conference paper
Information Search, Integration and Personalization (ISIP 2012)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 146))

Abstract

The so called “big data” is increasingly present in several modern applications, in which massive parallel processing is the main approach in order to achieve acceptable performance. However, as the size of data is ever increasing, even parallelism will meet its limits unless it is combined with other powerful processing techniques. In this paper we propose to combine parallelism with rewriting, that is reusing previous results stored in a cache in order to perform new (parallel) computations. To do this, we introduce an abstract framework based on the lattice of partitions of the data set. Our basic contributions are: (a) showing that our framework allows rewriting of parallel computations (b) deriving the basic principles of optimal cache management and (c) showing that, in case of structured data, our approach can leverage both structure and semantics in data to improve performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Manyika, J., Chui, M., Bughin, J., Brown, B., Dobbs, R., Roxburgh, C., Byers, A.H.: Big Data: The next frontier for innovation, competition, and productivity. McKinsey Global Institute (May 2011)

    Google Scholar 

  2. Horowitz, M.: Visualizing Big Data: Bar Charts for Words. Wired Magazine 16(7) (June 2008)

    Google Scholar 

  3. Douglas, L.: The Importance of ‘Big Data’: A Definition. Gartner (June 2012)

    Google Scholar 

  4. Data, data everywhere. The Economist (February 25, 2010)

    Google Scholar 

  5. Executive Office of the President: Big Data Across the Federal Government. White House (March 2012)

    Google Scholar 

  6. Graham, M.: Big data and the end of theory? The Guardian (March 9, 2012)

    Google Scholar 

  7. Shvetank, S., Horne, A., Capellá, J.: Good Data Won’t Guarantee Good Decisions. Harvard Business Review (September 2012)

    Google Scholar 

  8. Ohm, P.: Don’t Build a Database of Ruin. Harvard Business Review (August 23, 2012), http://blogs.hbr.org/cs/2012/08/dont_build_a_database_of_ruin.html

  9. Jacobs, A.: The Pathologies of Big Data. ACM Queue (July 6, 2009)

    Google Scholar 

  10. Monash: eBay’s two enormous data warehouses. DBMS2 (April 30, 2009), http://www.dbms2.com/2009/04/30/

  11. Monash, C.: eBay followup — Greenplum out, Teradata > 10 petabytes, Hadoop has some value, and more. DBMS2 (October 6, 2010), http://www.dbms2.com/2010/10/06/

  12. Apache Hadoop project page, http://hadoop.apache.org/

  13. Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. In: 6th Symposium on Operating Systems Design and Implementation, OSDI 2004, Sponsored by USENIX, in Cooperation with ACM SIGOPS, pp. 137–150 (2004)

    Google Scholar 

  14. DeWitt, D.J., Stonebraker, M.: MapReduce: A major step backwards. Vertica The Database Column (January 17, 2008)

    Google Scholar 

  15. Bain, T.: Was Stonebraker right? (September 15, 2010), http://blog.tonybain.com/tony_bain/2010/09/was-stonebraker-right.html

  16. Ferrera, P., de Prado, I., Palacios, E., Fernandez-Marquez, J.L., Di Marzo Serugendo, G.: Tuple Map Reduce: Beyond classic MapReduce. In: IEEE Intl. Conf. on Data Mining, ICDM 2012, Brussels (December 2012)

    Google Scholar 

  17. Spyratos, N.: The Partition Model: A deductive Database Model. ACM Transactions on Database Systems 12(1), 1–37 (1987)

    Article  Google Scholar 

Download references

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Spyratos, N., Sugibuchi, T. (2013). Parallelism and Rewriting for Big Data Processing. In: Tanaka, Y., Spyratos, N., Yoshida, T., Meghini, C. (eds) Information Search, Integration and Personalization. ISIP 2012. Communications in Computer and Information Science, vol 146. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40140-4_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-40140-4_2

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-40139-8

  • Online ISBN: 978-3-642-40140-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics