Abstract
Stream computing applications require minimum latency and high throughput for efficiently processing real-time data. Typically, data-intensive applications where large datasets are required to be moved across execution nodes have low latency requirements. In this paper, a stream-based data processing model is adopted to develop an algorithm for optimal partitioning the input data such that the inter-partition data flow remains minimal. The proposed algorithm improves the execution of the data-intensive workflows in heterogeneous computing environments by partitioning the data-intensive workflow and mapping each partition on the available heterogeneous resources that offer minimum execution time. Minimum data movement between the partitions reduces the latency, which can be further reduced by applying advanced data parallelism techniques. In this paper, we apply data parallelism technique to the bottleneck (most compute-intensive) task in each partition that significantly reduces the latency. We study the effectiveness and the performance of the proposed approach by using synthesized workflows and real-world applications, such as Montage and Cybershake. Our evaluation shows that the proposed algorithm provides schedules with approximately 12% reduced latency and nearly 17% enhanced throughput as compared to the existing state of the art algorithms.
























Similar content being viewed by others
References
Hey T, Tansley S, and Tolle K (eds) (2009) The fourth paradigm: data-intensive scientific discovery. Microsoft Research, Redmond, WA
Hashem IAT, Yaqoob I, Anuar NB, Mokhtar S, Gani A, Khan SU (2015) The rise of big data on cloud computing: review and open research issues. Inf Syst 47:98–115
Liew CS, Atkinson MP, Galea M, Ang TF, Martin P, and van Hemert J (2016) Scientific workflows: moving across paradigms. ACM Comput Surv 49(4):66:1–66:39
Berriman GB, Groom SL (2011) How will astronomy archives survive the data tsunami? Commun ACM 54(12):52–56
Chen C, Zhang C-Y (2014) Data-intensive applications, challenges, technique and technologies: a survey on big data. Inf Sci 275:314–347
Prokhorenko V, Choo K-KR, Ashman H (2016) Context-oriented web application protection model. Appl Math Comput 285:59–78
Penga J, Choob K-KR, Ashmana H (2016) User profiling in intrusion detection: a review. J Netw Comput Appl 72:14–27
Prokhorenko V, Choo K-KR, Ashman H (2016) Web application protection techniques: a taxonomy. J Netw Comput Appl 60:95–112
Prokhorenko V, Choo K-KR, Ashman H (2016) Intent-based extensible real-time php supervision framework. IEEE Trans Inf Forensics Secur 11(10):2215–2226
Ahmad SG, Liew CS, Rafique MM, Munir EU, Khan SU (2014) Data-intensive workflow optimization based on application task graph partitioning in heterogeneous computing systems. In: Fourth International Conference on Big Data and Cloud Computing, pp 129–136
Ahmad SG, Munir EU, Nisar W (2012) PEGA: a performance effective genetic algorithm for task scheduling in heterogeneous systems. In: International Conference on High Performance Computing and Communications, pp 1082–1087
Liew CS, Atkinson MP, van Hemert J, Han L (2010) Towards optimising distributed data streaming graphs using parallel streams. In: 19th ACM International Symposium on High Performance Distributed Computing (HPDC), pp 725–736
Liew CS (2012) Optimisation of the enactment of fine-grained distributed data-intensive workflows. Ph.D. dissertation, School of Informatics University of Edinburgh
Guirado F, Roig C, Ripoll A (2013) Enhancing throughput for streaming applications running on cluster systems. J Parallel Distrib Comput 73(8):1092–1105
Pandey S, Buyya R (2012) Data intensive distributed computing: challenges and solutions for large-scale information management. IGI Global, 2012, ch. A Survey of Scheduling and Management Techniques for Data-Intensive Application Workflows, pp 156–176
DaweiSun G, Zhang S, Yang W, Zheng S, Khan U, Li K (2015) Re-stream: real-time and energy-efficient resource scheduling in big data stream computing environments. Inf Sci 319:92–112
Vydyanathan N, Catalyurek U, Kurc T, Sadayappan P, Saltz J (2011) Optimizing latency and throughput of application workflows on clusters. Parallel Comput 37:694–712
Issa SA, Kienzler R, El-Kalioby M, Tonellato PJ, Wall D, Abouelhoda RBM (2013) Streaming support for data intensive cloud-based sequence analysis. BioMed Res Int 2013:1–16
Agarwalla B, Ahmed N, Hilley D, Ramachandran U (2007) Streamline: a scheduling heuristic for streaming applications on the grid. Multimed Syst 13:69–85
Foster I, Kesselman C (1997) Globus: a metacomputing infrastructure toolkit. Int J Supercomput Appl High Perform Comput 11:115–128
Munir EU, Mohsin S, Hussain A, Nisar MW, Ali S (2013) SDBATS: A novel algorithm for task scheduling in heterogeneous computing systems. In: IEEE 27th international parallel and distributed processing symposium workshops. Ph.D. Forum (IPDPSW), 2013, pp 43–53
Arabnejad H, Barbosa JG (2014) List scheduling algorithm for heterogeneous systems by an optimistic cost table. IEEE Trans Parallel Distrib Syst 25(3):682–694
Hackett A, Ajwani D, Ali S, Kirkland S, Morrison JP (2013) A network configuration algorithm based on optimization of Kirchhoff index. In: IEEE 27th International Symposium on Parallel and Distributed Processing, pp 407–417
Gu Y, Wu Q (2010) Maximizing workflow throughput for streaming applications in distributed environments. In: 19th International Conference on Computer Communications and Networks (ICCCN)
Agrawal K, Benoit A, Dufosse F, Robert Y (2009) Mapping filtering streaming applications with communication costs. Technical report, Massachusetts Institute of Technology, USA
Gu Y, Shenq S-L, Wu Q, Dasgupta D (2012) On a multi-objective evolutionary algorithm for optimizing end-to-end performance of scientific workflows in distributed environments. In: Proceedings of the 45th Annual Simulation Symposium
Benoit A, Catalyurek UV, Robert Y, Saule E (2013) A survey of pipelined workflow scheduling: models and algorithms. ACM Comput Surv (CSUR) 45(4):50:1–50:36
Ahmad SG, Munir EU, Nisar MW (2012) Pega: a performance effective genetic algorithm for task scheduling in heterogeneous systems. In: The 14th IEEE International Conference on High Performance Computing and Communications, pp 1082–1087
Juve G, Chervenak A, Deelman E, Bharathi S, Mehta G, Vahi K (2013) Characterizing and profiling scientific workflow. Fut Gener Comput Syst 29(3):682–692
Quick D, Choo K-KR (2016) Big forensic data reduction: digital forensic images and electronic evidence. Clust Comput 19(2):723–740
Martini B, Choo K-KR (2013) Cloud storage forensics: owncloud as a case study. Digital Investig 10(4):287–299
Martini B, Choo KKR (2014) Distributed filesystem forensics: XtreemFS as a case study. Digital Investig 11(4):295–313
Deelman E, Gannon D, Shields M, Taylor I (2009) Workflows and e-Science: an overview of workflow system features and capabilities. Future Gener Comput Syst 25(5):528–540
Acknowledgements
The work presented in this paper is partly supported by the Ministry of Education Malaysia (FRGS FP051-2013A and UMRG RP001F-13ICT).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Ahmad, S.G., Liew, C.S., Rafique, M.M. et al. Optimization of data-intensive workflows in stream-based data processing models. J Supercomput 73, 3901–3923 (2017). https://doi.org/10.1007/s11227-017-1991-0
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-017-1991-0