Skip to main content

A Query Processing Framework for Array-Based Computations

  • Conference paper
  • First Online:
Database and Expert Systems Applications (DEXA 2016)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9827))

Included in the following conference series:

Abstract

Current scientific applications must analyze enormous amounts of array data using complex mathematical data processing methods. This paper describes a distributed query processing framework for large-scale scientific data analysis that captures array-based computations using SQL-like queries and optimizes and evaluates these computations using state-of-the-art parallel processing algorithms. Instead of providing a library of concrete distributed algorithms that implement certain matrix operations efficiently, we generalize these algorithms by making them parametric in such a way that the same efficient implementations that apply to the concrete algorithms can also apply to their generic counterparts. By specifying matrix operations as generic algebraic operators, we are able to perform inter-operator optimizations, such as fusing matrix transpose with matrix multiplication, resulting to new instantiations of the generic algebraic operators, without having to introduce new efficient algorithms on the fly. We evaluate the effectiveness of our framework by measuring the performance improvement of matrix factorization when evaluated with inter-operator optimization.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Buck, J., Watkins, N., Lefevre, J., Ioannidou, K., Maltzahn, C., Polyzotis, N., Brandt, S.A.: SciHadoop: array-based query processing in hadoop. In: SC 2011

    Google Scholar 

  2. Das, A., Afrati, F.N., Salihoglu, S., Ullman, J.D.: Upper and lower bounds on the cost of a map-reduce computation. In: VLDB 2013

    Google Scholar 

  3. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: OSDI 2004

    Google Scholar 

  4. Fegaras, L., Li, C., Gupta, U.: An optimization framework for map-reduce queries. In: EDBT 2012

    Google Scholar 

  5. Fegaras, L., Li, C., Gupta, U., Philip, J.J.: XML query optimization in map-reduce. In: International Workshop on the Web and Databases (WebDB) (2011)

    Google Scholar 

  6. Apache Flink. http://flink.apache.org/

  7. Geijn, R.A., Watts, J.: SUMMA: scalable universal matrix multiplication algorithm. Concurrency: Pract. Experience 9(4), 255–274 (1997)

    Article  Google Scholar 

  8. Ghoting, A., Krishnamurthy, R., Pednault, E., Reinwald, B., Sindhwani, V., Tatikonda, S., Tian, Y., Vaithyanathan, S.: SystemML: declarative machine learning on MapReduce. In: IEEE International Conference on Data Engineering (ICDE) (2011)

    Google Scholar 

  9. Apache Hadoop. http://hadoop.apache.org/

  10. Apache Hama. http://hama.apache.org/

  11. Apache Hive. http://hive.apache.org/

  12. Koren, Y., Bell, R., Volinsky, C.: Matrix factorization techniques for recommender systems. In: IEEE Computer, August 2009

    Google Scholar 

  13. Kraska, T., Talwalkar, A., Duchi, J., Griffith, R., Franklin, M., Jordan, M.I.: MLbase: a distributed machine learning system. In: Conference on Innovative Data Systems Research (2013)

    Google Scholar 

  14. Lin, J., Dyer, C.: Data-intensive text processing with MapReduce. Book pre-production manuscript, April 2010

    Google Scholar 

  15. Low, Y., Gonzalez, J., Kyrola, A., Bickson, D., Guestrin, C., Hellerstein, J.M.: Distributed GraphLab: a framework for machine learning and data mining in the cloud. In: VLDB 2012

    Google Scholar 

  16. Malewicz, G., Austern, M.H., Bik, A.J.C., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: a system for large-scale graph processing. In: PODC 2009

    Google Scholar 

  17. Apache MRQL (incubating). http://mrql.incubator.apache.org/

  18. Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig Latin: a not-so-foreign language for data processing. In: SIGMOD 2008

    Google Scholar 

  19. Apache Spark. http://spark.apache.org/

  20. Soroush, E., Balazinska, M., Wang, D.: ArrayStore: a storage manager for complex parallel array processing. In: SIGMOD 2011

    Google Scholar 

  21. Shinnar, A., Cunningham, D., Herta, B., Saraswat, V.: M3R: increased performance for in-memory Hadoop jobs. In: VLDB 2012

    Google Scholar 

  22. The SciDB Development Team: overview of SciDB: large scale array storage, processing and analysis. In: SIGMOD 2010

    Google Scholar 

  23. Valiant, L.G.: A bridging model for parallel computation. Commun. ACM 33(8), 103–111 (1990)

    Article  Google Scholar 

Download references

Acknowledgments

This work is supported in part by the National Science Foundation under the grant CCF-1117369. Our performance evaluations were performed at the Chameleon cloud computing infrastructure, www.chameleoncloud.org, supported by NSF.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Leonidas Fegaras .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Fegaras, L. (2016). A Query Processing Framework for Array-Based Computations. In: Hartmann, S., Ma, H. (eds) Database and Expert Systems Applications. DEXA 2016. Lecture Notes in Computer Science(), vol 9827. Springer, Cham. https://doi.org/10.1007/978-3-319-44403-1_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-44403-1_15

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-44402-4

  • Online ISBN: 978-3-319-44403-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics