Bigflow: A General Optimization Layer for Distributed Computing Frameworks

Zhang, Yun-Cong; Wang, Xiao-Yang; Wang, Cong; Xu, Yao; Zhang, Jian-Wei; Lin, Xiao-Dong; Sun, Guang-Yu; Zheng, Gong-Lin; Yin, Shan-Hui; Ye, Xian-Jin; Li, Li; Song, Zhan; Miao, Dong-Dong

doi:10.1007/s11390-020-9702-3

Bigflow: A General Optimization Layer for Distributed Computing Frameworks

Regular Paper
Published: 27 March 2020

Volume 35, pages 453–467, (2020)
Cite this article

Journal of Computer Science and Technology Aims and scope Submit manuscript

Yun-Cong Zhang¹,
Xiao-Yang Wang^1,2,
Cong Wang¹,
Yao Xu¹,
Jian-Wei Zhang¹,
Xiao-Dong Lin¹,
Guang-Yu Sun²,
Gong-Lin Zheng¹,
Shan-Hui Yin¹,
Xian-Jin Ye¹,
Li Li¹,
Zhan Song¹ &
…
Dong-Dong Miao¹

82 Accesses
1 Citation
Explore all metrics

Abstract

As data volumes grow rapidly, distributed computations are widely employed in data-centers to provide cheap and efficient methods to process large-scale parallel datasets. Various computation models have been proposed to improve the abstraction of distributed datasets and hide the details of parallelism. However, most of them follow the single-layer partitioning method, which limits developers to express a multi-level partitioning operation succinctly. To overcome the problem, we present the NDD (Nested Distributed Dataset) data model. It is a more compact and expressive extension of Spark RDD (Resilient Distributed Dataset), in order to remove the burden on developers to manually write the logic for multi-level partitioning cases. Base on the NDD model, we develop an open-source framework called Bigflow, which serves as an optimization layer over computation engines from most widely used processing frameworks. With the help of Bigflow, some advanced optimization techniques, which may only be applied by experienced programmers manually, are enabled automatically in a distributed data processing job. Currently, Bigflow is processing about 3 PB data volumes daily in the data-centers of Baidu. According to customer experience, it can significantly save code length and improve performance over the intuitive programming style.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Building Scalable Software for Data Centers: An Approach to Distributed Computing at Enterprise Level

Aura: A Flexible Dataflow Engine for Scalable Data Processing

Upgrading a high performance computing environment for massive data processing

Article Open access 16 October 2019

References

Dean J, Ghemawat S. MapReduce: Simplified data processing on large clusters. Communications of the ACM, 2008, 51(1): 107-113.
Article Google Scholar
Zaharia M, Chowdhury M, Franklin M J, Shenker S, Stoica I. Spark: Cluster computing with working sets. In Proc. the 2nd USENIX Workshop on Hot Topics in Cloud Computing, June 2010, Article No. 5.
Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauley M, Franklin M J, Shenker S, Stoica I. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proc. the 9th USENIX Conference on Networked Systems Design and Implementation, April 2012, pp.15-28.
Chambers C, Raniwala A, Perry F, Adams S, Henry R R, Bradshaw R, Weizenbaum N. FlumeJava: Easy, efficient data-parallel pipelines. In Proc. the 2010 ACM SIGPLAN Conference on Programming Language Design and Implementation, June 2010, pp.363-375.
Meng X, Bradley J, Yavuz B et al. MLlib: Machine learning in Apache Spark. The Journal of Machine Learning Research, 2016, 17: Article No. 34.
Parsian M. Data Algorithms: Recipes for Scaling up with Hadoop and Spark. O’Reilly Media Inc., 2015.
Karau H, Warren R. High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark (1st edition). O’Reilly Media Inc., 2017.
Akidau T, Bradshaw R, Chambers C et al. The dataflow model: A practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing. Proceedings of the VLDB Endowment, 2015, 8(12): 1792-1803.
Article Google Scholar
Isard M, Budiu M, Yu Y, Birrell A, Fetterly D. Dryad: Distributed data-parallel programs from sequential building blocks. ACM SIGOPS Operating Systems Review, 2007, 44(3): 59-72.
Article Google Scholar
Saha B, Shah H, Seth S, Vijayaraghavan G, Murthy A, Curino C. Apache tez: A unifying framework for modeling and building data processing applications. In Proc. the 2015 ACM SIGMOD International Conference on Management of Data, May 2015, pp.1357-1369.
Gates A F, Natkovich O, Chopra S, Kamath P, Narayanamurthy S M, Olston C, Reed B, Srinivasan S, Srivastava U. Building a high-level dataflow system on top of Map-Reduce: The pig experience. Proceedings of the VLDB Endowment, 2009, 2(2):1414-1425.
Article Google Scholar
Thusoo A, Sarma J S, Jain N, Shao Z, Chakka P, Anthony S, Liu H, Wyckoff P, Murthy R. Hive: A warehousing solution over a Map-Reduce framework. Proceedings of the VLDB Endowment, 2009, 2(2): 1626-1629.
Article Google Scholar
Alexandrov A, Bergmann R, Ewen S et al. The stratosphere platform for big data analytics. The VLDB Journal, 2014, 23(6): 939-964.
Article Google Scholar
Brown K J, Lee H, Rompf T, Sujeeth A K, de Sa C, Aberger C, Olukotun K. Have abstraction and eat performance, too: Optimized heterogeneous computing with parallel patterns. In Proc. the 2016 IEEE/ACM International Symposium on Code Generation and Optimization, March 2016, pp.194-205.
Power R, Li J. Piccolo: Building fast, distributed programs with partitioned tables. In Proc. the 9th USENIX Symposium on Operating Systems Design and Implementation, October 2010, pp.293-306.
Gunarathne T, Zhang B, Wu T L, Qiu J. Scalable parallel computing on clouds using Twister4Azure iterative MapReduce. Future Generation Computer Systems, 2013, 29(4): 1035-1048.
Article Google Scholar
Caneill M, de Palma N. Lambda-blocks: Data processing with topologies of blocks. In Proc. the 2018 IEEE International Congress on Big Data, July 2018, pp.9-16.

Download references

Author information

Authors and Affiliations

Baidu Inc., Beijing, 100193, China
Yun-Cong Zhang, Xiao-Yang Wang, Cong Wang, Yao Xu, Jian-Wei Zhang, Xiao-Dong Lin, Gong-Lin Zheng, Shan-Hui Yin, Xian-Jin Ye, Li Li, Zhan Song & Dong-Dong Miao
Center for Energy-Efficient Computing and Applications, Peking University, Beijing, 100871, China
Xiao-Yang Wang & Guang-Yu Sun

Authors

Yun-Cong Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xiao-Yang Wang
View author publications
You can also search for this author in PubMed Google Scholar
Cong Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yao Xu
View author publications
You can also search for this author in PubMed Google Scholar
Jian-Wei Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xiao-Dong Lin
View author publications
You can also search for this author in PubMed Google Scholar
Guang-Yu Sun
View author publications
You can also search for this author in PubMed Google Scholar
Gong-Lin Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Shan-Hui Yin
View author publications
You can also search for this author in PubMed Google Scholar
Xian-Jin Ye
View author publications
You can also search for this author in PubMed Google Scholar
Li Li
View author publications
You can also search for this author in PubMed Google Scholar
Zhan Song
View author publications
You can also search for this author in PubMed Google Scholar
Dong-Dong Miao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiao-Yang Wang.

Electronic supplementary material

ESM 1

(PDF 81 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, YC., Wang, XY., Wang, C. et al. Bigflow: A General Optimization Layer for Distributed Computing Frameworks. J. Comput. Sci. Technol. 35, 453–467 (2020). https://doi.org/10.1007/s11390-020-9702-3

Download citation

Received: 21 May 2019
Revised: 17 January 2020
Published: 27 March 2020
Issue Date: March 2020
DOI: https://doi.org/10.1007/s11390-020-9702-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Bigflow: A General Optimization Layer for Distributed Computing Frameworks

Abstract

Access this article

Similar content being viewed by others

Building Scalable Software for Data Centers: An Approach to Distributed Computing at Enterprise Level

Aura: A Flexible Dataflow Engine for Scalable Data Processing

Upgrading a high performance computing environment for massive data processing

References

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

ESM 1

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Bigflow: A General Optimization Layer for Distributed Computing Frameworks

Abstract

Access this article

Similar content being viewed by others

Building Scalable Software for Data Centers: An Approach to Distributed Computing at Enterprise Level

Aura: A Flexible Dataflow Engine for Scalable Data Processing

Upgrading a high performance computing environment for massive data processing

References

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

ESM 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation