Skip to main content
Log in

Use case-based evaluation of workflow optimization strategy in real-time computation system

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

With the start of big data era, data stream computing has emerged as a well-known approach to optimize data-intensive workflows. Apache STORM is an open-source real-time distributed computation system for processing data streams and has been opted by famous organizations such as Twitter, Yahoo, Alibaba, Baidu, Groupon. The workflows are implemented as topologies in STORM. The main aspect that controls the execution performance of a workflow in STORM is the strategy of scheduling the topology components (spout and bolts). In this paper, we evaluate and analyze the performance of our algorithm Partition-based Data-intensive Workflow optimization Algorithm (PDWA) in Apache STORM using a use case workflow, EURExpressII. It is a real-world application-based workflow that builds a transcriptome-wide atlas of gene expression for the developing mouse embryo established by ribonucleic acid (RNA) in situ hybridization. Our proposed algorithm, PDWA, partitions the application task graph so that the data movement between partitions is minimum. Each partition is then mapped on one machine for the execution of tasks of that partition. It provides minimum execution time for that particular partition. Partial task duplication is also part of this algorithm that enhances the performance. A STORM-based computing cluster is developed in OpenStack cloud which is used as a computing environment. The performance of PDWA-based optimizer is evaluated with the data sets of different sizes. The achieved results show that PDWA performs with 21% improved average execution time for different sizes of data sets and varying execution nodes. In addition, the comparative results show that on average the efficiency of PDWA is 20.4% higher as compared to STORM default scheduler (SDS).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Hey T, Tansley S, Tolle K (2009) The fourth paradigm: data-intensive scientific discovery. Microsoft Research. https://www.amazon.com/Fourth-Paradigm-Data-Intensive-Scientific-Discovery

  2. Bhadani A, Jothimani D. Big data: challenges, opportunities and realities. http://arxiv.org/abs/1705.04928v1

  3. Kune R, Konugurthi PK, Agarwal A, Chillarige RR, Buyya R (2015) The anatomy of big data computing. Softw Pract Exp 46(1):79–105. https://doi.org/10.1002/spe.2374

    Article  Google Scholar 

  4. Umasri ML, Shyamalagowri D, Kumar S (2014) Aspects and infrastructure of big data. Int J Adv Res Comput Sci Softw Eng 4(1):609–612

    Google Scholar 

  5. Deelman E, Gannon D, Shields M, Taylor I (2009) Workflows and e-science: an overview of workflow system features and capabilities. Fut Gener Comput Syst 25(5):528–540

    Article  Google Scholar 

  6. Laszewski GV, Hategan M, Kodeboyina D (2007) Workflows for e-Science: scientific workflows for grids. Springer, London, pp 340–356

    Book  Google Scholar 

  7. Liew CS, van Hemert JI, Atkinson MP, Han L (2010) Towards optimising distributed data streaming graphs using parallel streams. In: High-Performance Parallel and Distributed Computing, pp 725–736

  8. Dayarathna M, Perera S (2018) Recent advancements in event processing. ACM Comput Surv 51(2):1–36. https://doi.org/10.1145/3170432

    Article  Google Scholar 

  9. Ahmad SG, Liew CS, Rafique MM, Munir EU, Khan SU (2014) Data-intensive workflow optimization based on application task graph partitioning in heterogeneous computing systems. In: Fourth International Conference on Big Data and Cloud Computing, vol 123, pp 129–136

  10. Han L, van Hemert JI, Baldock RA (2011) Automatically identifying and annotating mouse embryo gene expression patterns. Bioinformatics 27(8):1101–1107

    Article  Google Scholar 

  11. Vydyanathan N, Catalyurek U, Kurc T, Sadayappan P, Saltz J (2011) Optimizing latency and throughput of application workflows on clusters. Parallel Comput 37:694–712

    Article  MathSciNet  Google Scholar 

  12. Guirado F, Roig C, Ripoll A (2013) Enhancing throughput for streaming applications running on cluster systems. J Parallel Distrib Comput 73(8):1092–1105

    Article  Google Scholar 

  13. Gu Y, Shenq S-L, Wu Q, Dasgupta D (2012) On a multi-objective evolutionary algorithm for optimizing end-to-end performance of scientific workflows in distributed environments. In: Proceedings of the 45th Annual Simulation Symposium

  14. Agrawal K, Benoit A, Dufosse F, Robert Y (2009) Mapping filtering streaming applications with communication costs. Technical report, Massachusetts Institute of Technology, USA

  15. Gu Y, Wu Q (2010) Maximizing workflow throughput for streaming applications in distributed environments. In: 19th International Conference on Computer Communications and Networks (ICCCN)

  16. Cao F, Zhu MM, Ding D (2014) Distributed workflow scheduling under throughput and budget constraints in grid environments. In: Lecture notes in computer science, Job scheduling strategies for parallel processing. Springer, Berlin, pp 62–80

  17. Agarwalla B, Ahmed N, Hilley D, Ramachandran U (2007) Streamline: a scheduling heuristic for streaming applications on the grid. Multimed Syst 13:69–85

    Article  Google Scholar 

  18. Foster I, Kesselman C (1997) Globus: a metacomputing infrastructure toolkit. Int J Supercomput Appl High Perform Comput 11:115–128

    Google Scholar 

  19. Aniello L, Baldoni R, Querzoni L (2013) Adaptive online scheduling in storm. In: 7th ACM International Conference on Distributed Event-Based Systems, pp 207–218

  20. Sun D, Zhang G, Yang S, Zheng W, Khan SU, Li K (2015) Re-Stream: real-time and energy-efficient resource scheduling in big data stream computing environments. Inf Sci 319:92–112

    Article  MathSciNet  Google Scholar 

  21. Rychly M, Skdo P, Smrz P (2014) Scheduling decisions in stream processing on heterogeneous clusters. In: International Conference on Complex, Intelligent and Software Intensive Systems (CISIS), pp 614–619

  22. Liu X, Buyya R (2017) D-storm: dynamic resource-efficient scheduling of stream processing applications. In: 2017 IEEE 23rd International Conference on Parallel and Distributed Systems (ICPADS), IEEE. https://doi.org/10.1109/icpads.2017.00070

  23. Wang J, Hang S, Liu J (2016) Multi-level scheduling algorithm based on storm. KSII Trans Internet Inf Syst. https://doi.org/10.3837/tiis.2016.03.008

    Article  Google Scholar 

  24. Peng B, Hosseini M, Hong Z, Farivar R, Campbell R (2019) R-storm: resource-aware scheduling in storm. In: Proceedings of the 16th Annual Middleware Conference on Middleware. ACM Press. https://doi.org/10.1145/2814576.2814808

  25. Sun LC (2012) Optimisation of the enactment of fine-grained distributed data-intensive workflows. The University of Edinburgh, Edinburgh

    Google Scholar 

  26. Smirnov P, Melnik M, Nasonov D (2017) Performance-aware scheduling of streaming applications using genetic algorithm. In: Proceedings of the International Conference on Computational Science, ICCS 12–14 June 2017. Zurich, Switzerland

  27. Sun D, Gao S, Liu X, Li F, Zheng X, Buyya R (2019) State and runtime-aware scheduling in elastic stream computing systems. Fut Gener Comput Syst (FGCS) 97:194–209

    Article  Google Scholar 

  28. Fisher RA (1936) The use of multiple measurements in taxonomic problems. Ann Hum Genet 7(2):179–188

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hikmat Ullah Khan.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gulzar Ahmad, S., Ullah Khan, H., Ijaz, S. et al. Use case-based evaluation of workflow optimization strategy in real-time computation system. J Supercomput 76, 708–725 (2020). https://doi.org/10.1007/s11227-019-03060-9

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-019-03060-9

Keywords

Navigation