ABSTRACT
A number of online machine learning techniques based on tree model have been studied in order to cope with today's requirements of quickly processing large scale data-sets. We present a design pattern for incremental tree data processing as gradually constructing on-demand tree-model on memory. Our approach adopts the actor model as making use of multi-cores and distributed computers without largely rewriting code for algorithms. The pattern basically defines a node in the tree as an actor which is the unit of asynchronous processes and each data instance flows between actor nodes as a message. We study concrete two machine learning algorithms, VFDT for decision tree's top-down growth and BIRCH for hierarchical clustering's bottom up growth. For supporting VFDT, we propose an extension mechanism of replicating root nodes so that it can address bottleneck as starting of inputs. For supporting BIRCH, we split processes of recursive construction into asynchronous steps with correcting target node by traversing extra horizontal links between sibling nodes. We carried out machine learning tasks with our implementation on top of Akka Java, and we confirmed reasonable performance for the tasks with large scale data-sets.
- Saleema Amershi et al. 2019. Software Engineering for Machine Learning: A Case Study. In ICSE'19 Software Engineering in Practice. Google ScholarDigital Library
- Yael Ben-Haim and Elad Tom-Tov. 2010. A Streaming Parallel Decision Tree Algorithm. Journal of Machine Learning Research (2010), 849-872. Google ScholarDigital Library
- Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting System. In Proceedings of KDD'16. 785-794. Google ScholarDigital Library
- Joeri De Koster, Stefan Marr, Tom Van Cutsem, and Theo D'Hondt. 2016. Domains: Sharing State in the Communicating Event-Loop Actor Model. Computer Languages, Systems & Structures 45 (2016), 132-160. Google ScholarDigital Library
- Pedro Domingos and Geoff Hulten. 2000. Mining High-Speed Data Streams. In Proceedings of KDD'00. 71-80. Google ScholarDigital Library
- Simon Fowler, Sam Lindley, and Philip Wadler. 2017. Mixing Metaphors: Actors as Channels and Channels as Actors. In Proceedings of ECOOP'17, Vol. 74. 11:1-11:28.Google Scholar
- João Gama, Ricardo Rocha, and Pedro Medas. 2003. Accurate Decision Trees for Mining High-speed Data Streams. In Proceedings of KDD'03. 523-528. Google ScholarDigital Library
- Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides. 1994. Design Patterns: Elements of Reusable Object-Oriented Software. Addison-Wesley. Google ScholarDigital Library
- Ashwani Garg, Ashish Mangla, Neelima Gupta, and Vasudha Bhatnagar. 2006. PBIRCH: A Scalable Parallel Clustering algorithm for Incremental Data. In Proceedings of IDEAS'06. 315-316. Google ScholarDigital Library
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In Proceedings of CVPR'16.Google ScholarCross Ref
- Katherine A. Heller and Zoubin Ghahramani. 2005. Bayesian Hierarchical Clustering. In Proceedings of ICML'05. 297-304. Google ScholarDigital Library
- Matthew Hoffman, Francis R. Bach, and David M. Blei. 2010. Online Learning for Latent Dirichlet Allocation. In Advances in NIPS'10. 856- 864. Google ScholarDigital Library
- Geoff Hulten, Laurie Spencer, and Pedro Domingos. 2001. Mining Time Changing Data Streams. In Proceedings of KDD'01. 97-106. Google ScholarDigital Library
- Ruoming Jin and Gagan Agrawal. 2003. Efficient Decision Tree Construction on Streaming Data. In Proceedings of KDD'03. 571-576. Google ScholarDigital Library
- Ari Kobren, Nicholas Monath, Akshay Krishnamurthy, and Andrew McCallum. 2017. A Hierarchical Algorithm for Extreme Clustering. In Proceedings of KDD'17. 255-264. Google ScholarDigital Library
- Philip L. Lehman and S. Bing Yao. 1981. Efficient Locking for Concurrent Operations on B-trees. ACM Trans. Database Syst. 6, 4 (1981), 650-670. Google ScholarDigital Library
- Mohsen Lesani and Antonio Lain. 2013. Semantics-preserving Sharing Actors. In Proceedings of AGERE!'13. 69-80. Google ScholarDigital Library
- Philipp Moritz et al. 2018. Ray: A Distributed Framework for Emerging AI Applications. In 13th USENIX Symposium on OSDI '18. 561-577. Google ScholarDigital Library
- Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. DeepWalk: Online Learning of Social Representations. In Proceedings of KDD'14. 701-710. Google ScholarDigital Library
- Lior Rokach and Oded Maimon. 2005. Clustering Methods. Springer US, 321-352.Google Scholar
- Lior Rokach and Oded Maimon. 2008. Data Mining with Decision Trees: Theory and Applications. World Scientific Publishing. Google ScholarDigital Library
- Marc Shapiro, Nuno Preguiça, Carlos Baquero, and Marek Zawirski. 2011. Conflict-Free Replicated Data Types. In Proceedings of SSS'11. 386-400. Google ScholarDigital Library
- Janwillem Swalens, Joeri De Koster, and Wolfgang De Meuter. 2017. Transactional Actors: Communication in Transactions. In Proceedings of SEPS'17. 31-41. Google ScholarDigital Library
- Tian Zhang, Raghu Ramakrishnan, and Miron Livny. 1996. BIRCH: An Efficient Data Clustering Method for Very Large Databases. In ACM Sigmod Record, Vol. 25. 103-114. Google ScholarDigital Library
Index Terms
- Actor-based incremental tree data processing for large-scale machine learning applications
Recommendations
On cluster tree for nested and multi-density data clustering
Clustering is one of the important data mining tasks. Nested clusters or clusters of multi-density are very prevalent in data sets. In this paper, we develop a hierarchical clustering approach-a cluster tree to determine such cluster structure and ...
K-means tree: an optimal clustering tree for unsupervised learning
AbstractTree construction is one of the popular methods for tackling any supervised task in machine learning. However, there has been little effort in applying trees for unsupervised tasks. The traditional unsupervised trees are based on recursively ...
Multi-objective optimization for incremental decision tree learning
DaWaK'12: Proceedings of the 14th international conference on Data Warehousing and Knowledge DiscoveryDecision tree learning can be roughly classified into two categories: static and incremental inductions. Static tree induction applies greedy search in splitting test for obtaining a global optimal model. Incremental tree induction constructs a decision ...
Comments