Skip to main content

Streaming ETL in Polystore Era

  • Conference paper
  • First Online:
Book cover Algorithms and Architectures for Parallel Processing (ICA3PP 2018)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11336))

  • 1609 Accesses

Abstract

In today’s digital environment, businesses have to access, store and analyze in a real time fashion vast amounts of data issued from streaming graph-structure data sources. To meet these requirements, companies owning the data warehouse (\(\mathcal {DW}\)) technology have to combine hardware and software solutions to reduce the time latency between a \(\mathcal {DW}\) and its data sources. The explosion of advanced hardware deployment platforms such as polystore represents an opportunity as pointed in recent studies. But, deploying a graph-structure \(\mathcal {DW}\) over a polystore is not a simple task, since it requires two important phases which are data partitioning and allocation. We claim that these phases have to be connected to the ETL (Extract, Transform, Load) phase, especially its loading process. This connection questions the initial schedule of ETL and deployment processes. In this paper, we present a new approach that connects ETL and deployment processes and challenges their traditional scheduling to meet real time analysis requirements.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    In this paper, we use fragmentation and partitioning interchangeably.

  2. 2.

    https://www.w3.org/RDF/.

  3. 3.

    www.w3.org/TR/rdf-SPARQL-query/.

  4. 4.

    http://swat.cse.lehigh.edu/projects/lubm/.

  5. 5.

    http://glaros.dtc.umn.edu/gkhome/metis/metis/overview.

  6. 6.

    http://swat.cse.lehigh.edu/projects/lubm/.

  7. 7.

    http://www.cytoscape.org/.

References

  1. Berkani, N., Bellatreche, L.: A variety-sensitive ETL processes. In: Benslimane, D., Damiani, E., Grosky, W.I., Hameurlain, A., Sheth, A., Wagner, R.R. (eds.) DEXA 2017. LNCS, vol. 10439, pp. 201–216. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-64471-4_17

    Chapter  Google Scholar 

  2. Berkani, N., Bellatreche, L., Benatallah, B.: A value-added approach to design BI applications. In: Madria, S., Hara, T. (eds.) DaWaK 2016. LNCS, vol. 9829, pp. 361–375. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-43946-4_24

    Chapter  Google Scholar 

  3. Berkani, N., Bellatreche, L., Ordonez, C.: ETL-aware materialized view selection in semantic data streamwarehouses. In: RCIS. IEEE (2018)

    Google Scholar 

  4. Bondiombouy, C., Valduriez, P.: Query processing in multistore systems: an overview. IJCC 5(4), 309–346 (2016)

    Article  Google Scholar 

  5. Bornea, M.A., Deligiannakis, A., Kotidis, Y., Vassalos, V.: Semi-streamed index join for near-real time execution of ETL transformations. In: ICDE, pp. 159–170 (2011)

    Google Scholar 

  6. Boukorca, A., Bellatreche, L., Cuzzocrea, A.: SLEMAS: an approach for selecting MV under query scheduling constraints. In: COMAD, pp. 66–73 (2014)

    Google Scholar 

  7. Duggan, J., et al.: The bigdawg polystore system. ACM Sigmod Rec. 44(2), 11–16 (2015)

    Article  Google Scholar 

  8. Galárraga, L., Hose, K., Schenkel, R.: Partout: a distributed engine for efficient RDF processing. In: WWW, pp. 267–268. ACM (2014)

    Google Scholar 

  9. Hose, K., Schenkel, R.: WARP: workload-aware replication and partitioning for RDF. In: ICDE Workshops, pp. 1–6 (2013)

    Google Scholar 

  10. Huang, J., Abadi, D.J., Ren, K.: Scalable SPARQL querying of large RDF graphs. PVLDB 4(11), 1123–1134 (2011)

    Google Scholar 

  11. Jörg, T., Deßloch, S.: Towards generating ETL processes for incremental loading. In: IDEAS, pp. 101–110 (2008)

    Google Scholar 

  12. Jörg, T., Dessloch, S.: Formalizing ETL jobs for incremental loading of data warehouses. In: BTW, pp. 327–346 (2009)

    Google Scholar 

  13. Jörg, T., Dessloch, S.: Near real-time data warehousing using state-of-the-art ETL tools. In: Castellanos, M., Dayal, U., Miller, R.J. (eds.) BIRTE 2009. LNBIP, vol. 41, pp. 100–117. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-14559-9_7

    Chapter  Google Scholar 

  14. Karakasidis, A., Vassiliadis, P., Pitoura, E.: ETL queues for active data warehousing. In: IQIS, pp. 28–39 (2005)

    Google Scholar 

  15. Karypis, G., Kumar, V.: Multilevel k-way hypergraph partitioning. In: DAC, pp. 343–348 (1999)

    Google Scholar 

  16. Le, W., Kementsietsidis, A., Duan, S., Li, F.: Scalable multi-query optimization for SPARQL. In: ICDE, pp. 666–677 (2012)

    Google Scholar 

  17. Lee, K., Liu, L.: Scaling queries over big RDF graphs with semantic hash partitioning. Proc. VLDB Endow. 6(14), 1894–1905 (2013)

    Article  Google Scholar 

  18. Mayer, R., Mayer, C., Tariq, M.A., Rothermel, K.: Graphcep: real-time data analytics using parallel complex event and graph processing. In: DEBS, pp. 309–316 (2016)

    Google Scholar 

  19. Meehan, J., Aslantas, C., Zdonik, S., Tatbul, N., Du, J.: Data ingestion for the connected world. In: CIDR (2017)

    Google Scholar 

  20. Ordonez, C., Johnson, T., Urbanek, S., Shkapenyuk, V., Srivastava, D.: Integrating the R language runtime system with a data stream warehouse. In: Benslimane, D., Damiani, E., Grosky, W.I., Hameurlain, A., Sheth, A., Wagner, R.R. (eds.) DEXA 2017. LNCS, vol. 10439, pp. 217–231. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-64471-4_18

    Chapter  Google Scholar 

  21. Ozsu, M.T., Valduriez, P.: Principles of Distributed Database Systems, 3rd edn. Springer, Heidelberg (2011). https://doi.org/10.1007/978-1-4419-8834-8

    Book  Google Scholar 

  22. Peng, P., Zou, L., Chen, L., Zhao, D.: Query workload-based RDF graph fragmentation and allocation. In: EDBT, pp. 377–388 (2016)

    Google Scholar 

  23. Ram, P., Do, L.: Extracting delta for incremental data warehouse maintenance. In: ICDE, pp. 220–229 (2000)

    Google Scholar 

  24. Simitsis, A., Wilkinson, K., Dayal, U., Castellanos, M.: Optimizing ETL workflows for fault-tolerance. In: ICDE, pp. 385–396 (2010)

    Google Scholar 

  25. Skoutas, D., Simitsis, A.: Ontology-based conceptual design of ETL processes for both structured and semi-structured data. Seman. Web 3(4), 1–24 (2007)

    Google Scholar 

  26. Suchanek, F.M., Kasneci, G., Weikum, G.: Yago: a core of semantic knowledge. In: WWW, pp. 697–706 (2007)

    Google Scholar 

  27. Vassiliadis, P., Simitsis, A.: Near real time ETL. In: Vassiliadis, P., Simitsis, A., et al. (eds.) New Trends in Data Warehousing and Data Analysis, pp. 1–31. Springer, Heidelberg (2009). https://doi.org/10.1007/978-0-387-87431-9

    Chapter  Google Scholar 

  28. Waas, F., Wrembel, R., Freudenreich, T., Thiele, M., Koncilia, C., Furtado, P.: On-demand ELT architecture for right-time BI: extending the vision. IJDWM 9(2), 21–38 (2013)

    Google Scholar 

  29. Wu, B., Zhou, Y., Yuan, P., Liu, L., Jin, H.: Scalable SPARQL querying using path partitioning. In: ICDE, pp. 795–806 (2015)

    Google Scholar 

  30. Zeng, K., Yang, J., Wang, H., Shao, B., Wang, Z.: A distributed graph engine for web scale RDF data. PVLDB 6(4), 265–276 (2013)

    Google Scholar 

  31. Zhu, M., Risch, T.: Querying combined cloud-based and relational databases. In: Cloud and Service Computing (CSC), pp. 330–335. IEEE (2011)

    Google Scholar 

  32. Zhu, Y., An, L., Liu, S.: Data updating and query in real-time data warehouse system. In: CSSE, vol. 5, pp. 1295–1297 (2008)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ladjel Bellatreche .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Berkani, N., Bellatreche, L. (2018). Streaming ETL in Polystore Era. In: Vaidya, J., Li, J. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2018. Lecture Notes in Computer Science(), vol 11336. Springer, Cham. https://doi.org/10.1007/978-3-030-05057-3_42

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-05057-3_42

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-05056-6

  • Online ISBN: 978-3-030-05057-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics