Optimization of data-intensive workflows in stream-based data processing models

Ahmad, Saima Gulzar; Liew, Chee Sun; Rafique, M. Mustafa; Munir, Ehsan Ullah

doi:10.1007/s11227-017-1991-0

Optimization of data-intensive workflows in stream-based data processing models

Published: 08 March 2017

Volume 73, pages 3901–3923, (2017)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Saima Gulzar Ahmad¹,
Chee Sun Liew¹,
M. Mustafa Rafique² &
…
Ehsan Ullah Munir³

540 Accesses
11 Citations
Explore all metrics

Abstract

Stream computing applications require minimum latency and high throughput for efficiently processing real-time data. Typically, data-intensive applications where large datasets are required to be moved across execution nodes have low latency requirements. In this paper, a stream-based data processing model is adopted to develop an algorithm for optimal partitioning the input data such that the inter-partition data flow remains minimal. The proposed algorithm improves the execution of the data-intensive workflows in heterogeneous computing environments by partitioning the data-intensive workflow and mapping each partition on the available heterogeneous resources that offer minimum execution time. Minimum data movement between the partitions reduces the latency, which can be further reduced by applying advanced data parallelism techniques. In this paper, we apply data parallelism technique to the bottleneck (most compute-intensive) task in each partition that significantly reduces the latency. We study the effectiveness and the performance of the proposed approach by using synthesized workflows and real-world applications, such as Montage and Cybershake. Our evaluation shows that the proposed algorithm provides schedules with approximately 12% reduced latency and nearly 17% enhanced throughput as compared to the existing state of the art algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

CBP: A New Parallelization Paradigm for Massively Distributed Stream Processing

Energy-aware scientific workflow scheduling in cloud environment

Article 18 May 2022

Scheduling of Workflows with Task Resource Requirements in Cluster Environments

Notes

References

Hey T, Tansley S, and Tolle K (eds) (2009) The fourth paradigm: data-intensive scientific discovery. Microsoft Research, Redmond, WA
Hashem IAT, Yaqoob I, Anuar NB, Mokhtar S, Gani A, Khan SU (2015) The rise of big data on cloud computing: review and open research issues. Inf Syst 47:98–115
Article Google Scholar
Liew CS, Atkinson MP, Galea M, Ang TF, Martin P, and van Hemert J (2016) Scientific workflows: moving across paradigms. ACM Comput Surv 49(4):66:1–66:39
Berriman GB, Groom SL (2011) How will astronomy archives survive the data tsunami? Commun ACM 54(12):52–56
Article Google Scholar
Chen C, Zhang C-Y (2014) Data-intensive applications, challenges, technique and technologies: a survey on big data. Inf Sci 275:314–347
Prokhorenko V, Choo K-KR, Ashman H (2016) Context-oriented web application protection model. Appl Math Comput 285:59–78
MathSciNet Google Scholar
Penga J, Choob K-KR, Ashmana H (2016) User profiling in intrusion detection: a review. J Netw Comput Appl 72:14–27
Article Google Scholar
Prokhorenko V, Choo K-KR, Ashman H (2016) Web application protection techniques: a taxonomy. J Netw Comput Appl 60:95–112
Article Google Scholar
Prokhorenko V, Choo K-KR, Ashman H (2016) Intent-based extensible real-time php supervision framework. IEEE Trans Inf Forensics Secur 11(10):2215–2226
Article Google Scholar
Ahmad SG, Liew CS, Rafique MM, Munir EU, Khan SU (2014) Data-intensive workflow optimization based on application task graph partitioning in heterogeneous computing systems. In: Fourth International Conference on Big Data and Cloud Computing, pp 129–136
Ahmad SG, Munir EU, Nisar W (2012) PEGA: a performance effective genetic algorithm for task scheduling in heterogeneous systems. In: International Conference on High Performance Computing and Communications, pp 1082–1087
Liew CS, Atkinson MP, van Hemert J, Han L (2010) Towards optimising distributed data streaming graphs using parallel streams. In: 19th ACM International Symposium on High Performance Distributed Computing (HPDC), pp 725–736
Liew CS (2012) Optimisation of the enactment of fine-grained distributed data-intensive workflows. Ph.D. dissertation, School of Informatics University of Edinburgh
Guirado F, Roig C, Ripoll A (2013) Enhancing throughput for streaming applications running on cluster systems. J Parallel Distrib Comput 73(8):1092–1105
Article Google Scholar
Pandey S, Buyya R (2012) Data intensive distributed computing: challenges and solutions for large-scale information management. IGI Global, 2012, ch. A Survey of Scheduling and Management Techniques for Data-Intensive Application Workflows, pp 156–176
DaweiSun G, Zhang S, Yang W, Zheng S, Khan U, Li K (2015) Re-stream: real-time and energy-efficient resource scheduling in big data stream computing environments. Inf Sci 319:92–112
Article MathSciNet Google Scholar
Vydyanathan N, Catalyurek U, Kurc T, Sadayappan P, Saltz J (2011) Optimizing latency and throughput of application workflows on clusters. Parallel Comput 37:694–712
Article MathSciNet MATH Google Scholar
Issa SA, Kienzler R, El-Kalioby M, Tonellato PJ, Wall D, Abouelhoda RBM (2013) Streaming support for data intensive cloud-based sequence analysis. BioMed Res Int 2013:1–16
Article Google Scholar
Agarwalla B, Ahmed N, Hilley D, Ramachandran U (2007) Streamline: a scheduling heuristic for streaming applications on the grid. Multimed Syst 13:69–85
Article Google Scholar
Foster I, Kesselman C (1997) Globus: a metacomputing infrastructure toolkit. Int J Supercomput Appl High Perform Comput 11:115–128
Google Scholar
Munir EU, Mohsin S, Hussain A, Nisar MW, Ali S (2013) SDBATS: A novel algorithm for task scheduling in heterogeneous computing systems. In: IEEE 27th international parallel and distributed processing symposium workshops. Ph.D. Forum (IPDPSW), 2013, pp 43–53
Arabnejad H, Barbosa JG (2014) List scheduling algorithm for heterogeneous systems by an optimistic cost table. IEEE Trans Parallel Distrib Syst 25(3):682–694
Article Google Scholar
Hackett A, Ajwani D, Ali S, Kirkland S, Morrison JP (2013) A network configuration algorithm based on optimization of Kirchhoff index. In: IEEE 27th International Symposium on Parallel and Distributed Processing, pp 407–417
Gu Y, Wu Q (2010) Maximizing workflow throughput for streaming applications in distributed environments. In: 19th International Conference on Computer Communications and Networks (ICCCN)
Agrawal K, Benoit A, Dufosse F, Robert Y (2009) Mapping filtering streaming applications with communication costs. Technical report, Massachusetts Institute of Technology, USA
Gu Y, Shenq S-L, Wu Q, Dasgupta D (2012) On a multi-objective evolutionary algorithm for optimizing end-to-end performance of scientific workflows in distributed environments. In: Proceedings of the 45th Annual Simulation Symposium
Benoit A, Catalyurek UV, Robert Y, Saule E (2013) A survey of pipelined workflow scheduling: models and algorithms. ACM Comput Surv (CSUR) 45(4):50:1–50:36
Ahmad SG, Munir EU, Nisar MW (2012) Pega: a performance effective genetic algorithm for task scheduling in heterogeneous systems. In: The 14th IEEE International Conference on High Performance Computing and Communications, pp 1082–1087
Juve G, Chervenak A, Deelman E, Bharathi S, Mehta G, Vahi K (2013) Characterizing and profiling scientific workflow. Fut Gener Comput Syst 29(3):682–692
Article Google Scholar
Quick D, Choo K-KR (2016) Big forensic data reduction: digital forensic images and electronic evidence. Clust Comput 19(2):723–740
Article Google Scholar
Martini B, Choo K-KR (2013) Cloud storage forensics: owncloud as a case study. Digital Investig 10(4):287–299
Article Google Scholar
Martini B, Choo KKR (2014) Distributed filesystem forensics: XtreemFS as a case study. Digital Investig 11(4):295–313
Article Google Scholar
Deelman E, Gannon D, Shields M, Taylor I (2009) Workflows and e-Science: an overview of workflow system features and capabilities. Future Gener Comput Syst 25(5):528–540
Article Google Scholar

Download references

Acknowledgements

The work presented in this paper is partly supported by the Ministry of Education Malaysia (FRGS FP051-2013A and UMRG RP001F-13ICT).

Author information

Authors and Affiliations

Faculty of Computer Science and Information Technology, University of Malaya, Kuala Lumpur, Malaysia
Saima Gulzar Ahmad & Chee Sun Liew
IBM Research, Mulhuddart, Ireland
M. Mustafa Rafique
Department of Computer Science, COMSATS Institute of Information Technology, Wah Cantt, Pakistan
Ehsan Ullah Munir

Authors

Saima Gulzar Ahmad
View author publications
You can also search for this author inPubMed Google Scholar
Chee Sun Liew
View author publications
You can also search for this author inPubMed Google Scholar
M. Mustafa Rafique
View author publications
You can also search for this author inPubMed Google Scholar
Ehsan Ullah Munir
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Chee Sun Liew.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ahmad, S.G., Liew, C.S., Rafique, M.M. et al. Optimization of data-intensive workflows in stream-based data processing models. J Supercomput 73, 3901–3923 (2017). https://doi.org/10.1007/s11227-017-1991-0

Download citation

Published: 08 March 2017
Issue Date: September 2017
DOI: https://doi.org/10.1007/s11227-017-1991-0

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Optimization of data-intensive workflows in stream-based data processing models

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

CBP: A New Parallelization Paradigm for Massively Distributed Stream Processing

Energy-aware scientific workflow scheduling in cloud environment

Scheduling of Workflows with Task Resource Requirements in Cluster Environments

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now