ABSTRACT
In recent years, due to the emergence of Big Data (terabytes or petabytes) and Big Model (tens of billions of parameters), there has been an ever-increasing need of parallelizing machine learning (ML) algorithms in both academia and industry. Although there are some existing distributed computing systems, such as Hadoop and Spark, for parallelizing ML algorithms, they only provide synchronous and coarse-grained operators (e.g., Map, Reduce, and Join, etc.), which may hinder developers from implementing more efficient algorithms. This motivated us to design a universal distributed platform termed KunPeng, that combines both distributed systems and parallel optimization algorithms to deal with the complexities that arise from large-scale ML. Specifically, KunPeng not only encapsulates the characteristics of data/model parallelism, load balancing, model sync-up, sparse representation, industrial fault-tolerance, etc., but also provides easy-to-use interface to empower users to focus on the core ML logics. Empirical results on terabytes of real datasets with billions of samples and features demonstrate that, such a design brings compelling performance improvements on ML programs ranging from Follow-the-Regularized-Leader Proximal algorithm to Sparse Logistic Regression and Multiple Additive Regression Trees. Furthermore, KunPeng's encouraging performance is also shown for several real-world applications including the Alibaba's Double 11 Online Shopping Festival and Ant Financial's transaction risk estimation.
Supplemental Material
- Rami Al-Rfou and others 2016. Theano: A Python framework for fast computation of mathematical expressions. arXiv preprint arXiv:1605.02688 (2016).Google Scholar
- Galen Andrew and Jianfeng Gao 2007. Scalable training of L 1-regularized log-linear models Proceedings of the 24th International Conference on Machine Learning. ACM, 33--40.Google Scholar
- Badri Bhaskar and Erik Ordentlich 2016. Scaling Machine Learning To Billions of Parameters. Spark Summit (2016).Google Scholar
- Christopher JC Burges. 2010. From ranknet to lambdarank to lambdamart: An overview. Learning, Vol. 11, 23--581 (2010), 81.Google Scholar
- Tianqi Chen and Carlos Guestrin 2016. Xgboost: A scalable tree boosting system. In Proceedings of the 22th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 785--794. Google ScholarDigital Library
- Thomas H Cormen and Michael T Goodrich 1996. A bridging model for parallel computation, communication, and I/O. Comput. Surveys Vol. 28, 4es (1996), 208. Google ScholarDigital Library
- Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, and others 2012. Large scale distributed deep networks. In Advances in Neural Information Processing Systems. 1223--1231.Google Scholar
- Jeffrey Dean and Sanjay Ghemawat 2008. MapReduce: simplified data processing on large clusters. Commun. ACM Vol. 51, 1 (2008), 107--113. Google ScholarDigital Library
- John Duchi, Elad Hazan, and Yoram Singer 2011. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research Vol. 12, Jul (2011), 2121--2159.Google ScholarDigital Library
- Jerome H Friedman. 2001. Greedy function approximation: a gradient boosting machine. Annals of statistics (2001), 1189--1232.Google Scholar
- Joseph E Gonzalez, Reynold S Xin, Ankur Dave, Daniel Crankshaw, Michael J Franklin, and Ion Stoica 2014. GraphX: Graph Processing in a Distributed Dataflow Framework OSDI, Vol. Vol. 14. 599--613.Google Scholar
- Xinran He and others. 2014. Practical lessons from predicting clicks on ads at facebook Proceedings of the Eighth International Workshop on Data Mining for Online Advertising. ACM, 1--9.Google Scholar
- Qirong Ho and Others. 2013. More effective distributed ml via a stale synchronous parallel parameter server Advances in Neural Information Processing Systems. 1223--1231.Google Scholar
- Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. 2013. Learning deep structured semantic models for web search using clickthrough data Proceedings of the 22nd ACM International Conference on Information & Knowledge Management. ACM, 2333--2338.Google Scholar
- Richard E Korf. 2009. Multi-Way Number Partitioning. In Proceedings of the Twenty-first International Joint Conference on Artificial Intelligence. Citeseer, 538--543.Google Scholar
- Mu Li and others. 2014natexlabb. Scaling Distributed Machine Learning with the Parameter Server Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation, Vol. Vol. 14. 583--598.Google Scholar
- Mu Li, David G Andersen, Alexander J Smola, and Kai Yu. 2014 a. Communication efficient distributed machine learning with the parameter server NIPS. 19--27.Google Scholar
- Mu Li, Ziqi Liu, Alexander J Smola, and Yu-Xiang Wang. 2016. DiFacto: Distributed factorization machines. In Proceedings of the Ninth ACM International Conference on Web Search and Data Mining. ACM, 377--386. Google ScholarDigital Library
- Yucheng Low, Danny Bickson, Joseph Gonzalez, Carlos Guestrin, Aapo Kyrola, and Joseph M Hellerstein 2012. Distributed GraphLab: a framework for machine learning and data mining in the cloud. Proceedings of the VLDB Endowment Vol. 5, 8 (2012), 716--727. Google ScholarDigital Library
- H Brendan McMahan and others 2013. Ad click prediction: a view from the trenches. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1222--1230. Google ScholarDigital Library
- Steffen Rendle. 2012. Factorization machines with libfm. ACM Transactions on Intelligent Systems and Technology, Vol. 3, 3 (2012), 57. Google ScholarDigital Library
- Steffen Rendle and Lars Schmidt-Thieme 2010. Pairwise interaction tensor factorization for personalized tag recommendation Proceedings of the Third ACM International Conference on Web Search and Data Mining. ACM, 81--90.Google Scholar
- RohitShetty. 2011. hot-standby. https://www.ibm.com/developerworks/community/blogs/RohitShetty/entry/high_availability_cold_warm_hot?lang=en. (2011). Accessed Feb 12, 2017.Google Scholar
- Alexander Smola and Shravan Narayanamurthy 2010. An architecture for parallel topic models. Proceedings of the VLDB Endowment Vol. 3, 1--2 (2010), 703--710.Google ScholarDigital Library
- Vinod Kumar Vavilapalli and others 2013. Apache hadoop yarn: Yet another resource negotiator Proceedings of the 4th annual Symposium on Cloud Computing. ACM, 5.Google Scholar
- Tom White. 2012. Hadoop-The Definitive Guide: Storage and Analysis at Internet Scale (revised and updated). (2012).Google Scholar
- Eric P Xing and others 2015. Petuum: A new platform for distributed machine learning on big data. IEEE Transactions on Big Data Vol. 1, 2 (2015), 49--67.Google ScholarCross Ref
- Yuan Yu and others. 2008. DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language. In Proceedings of the 8th USENIX Symposium on Operating Systems Design and Implementation, Vol. Vol. 8. 1--14.Google Scholar
- Jinhui Yuan, Fei Gao, Qirong Ho, Wei Dai, Jinliang Wei, Xun Zheng, Eric Po Xing, Tie-Yan Liu, and Wei-Ying Ma 2015. LightLDA: Big topic models on modest computer clusters Proceedings of the 24th International Conference on World Wide Web. ACM, 1351--1361.Google Scholar
- Hyokun Yun, Hsiang-Fu Yu, Cho-Jui Hsieh, SVN Vishwanathan, and Inderjit Dhillon 2014. NOMAD: Non-locking, stOchastic Multi-machine algorithm for Asynchronous and Decentralized matrix completion. Proceedings of the VLDB Endowment Vol. 7, 11 (2014), 975--986. Google ScholarDigital Library
- He Yunlong, Sun Yongjie, Liu Lantao, and Hao Ruixiang. 2016. DistML. https://github.com/intel-machine-learning/DistML. (2016). Accessed Feb 12, 2017.Google Scholar
- Matei Zaharia and others 2012. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation. USENIX Association, 2--2.Google Scholar
- Zhuo Zhang, Chao Li, Yangyu Tao, Renyu Yang, Hong Tang, and Jie Xu. 2014. Fuxi: a fault-tolerant resource management and job scheduling system at internet scale. Proceedings of the VLDB Endowment Vol. 7, 13 (2014), 1393--1404. Google ScholarDigital Library
- Jun Zhou, Qing Cui, Xiaolong Li, Peilin Zhao, Shenquan Qu, and Jun Huang. 2017. PSMART: Parameter Server based Multiple Additive Regression Trees System. Accepted Proceedings of the 26th International Conference on World Wide Web. ACM.Google ScholarDigital Library
Index Terms
- KunPeng: Parameter Server based Distributed Learning Systems and Its Applications in Alibaba and Ant Financial
Recommendations
PS2: Parameter Server on Spark
SIGMOD '19: Proceedings of the 2019 International Conference on Management of DataMost of the data is extracted and processed by Spark in Tencent Machine Learning Platform. However, seldom of them use Spark MLlib, an official machine learning (ML) library on top of Spark due to its inefficiency. In contrast, systems like parameter ...
PCB—A Distributed Computing System in CORBA
Special Issue on Java on ClustersCORBA (common object request broker architecture) provides a conceptual software bus for distributed object systems, which is not only suitable for enterprise distributed computing, but also suitable for parallel distributed scientific computations. ...
Sparse Gradient Compression for Distributed SGD
Database Systems for Advanced ApplicationsAbstractCommunication bandwidth is a bottleneck in distributed machine learning, and limits the system scalability. The transmission of gradients often dominates the communication in distributed SGD. One promising technique is using the gradient ...
Comments