skip to main content
10.1145/3358499.3361220acmconferencesArticle/Chapter ViewAbstractPublication PagessplashConference Proceedingsconference-collections
research-article

Actor-based incremental tree data processing for large-scale machine learning applications

Published:22 October 2019Publication History

ABSTRACT

A number of online machine learning techniques based on tree model have been studied in order to cope with today's requirements of quickly processing large scale data-sets. We present a design pattern for incremental tree data processing as gradually constructing on-demand tree-model on memory. Our approach adopts the actor model as making use of multi-cores and distributed computers without largely rewriting code for algorithms. The pattern basically defines a node in the tree as an actor which is the unit of asynchronous processes and each data instance flows between actor nodes as a message. We study concrete two machine learning algorithms, VFDT for decision tree's top-down growth and BIRCH for hierarchical clustering's bottom up growth. For supporting VFDT, we propose an extension mechanism of replicating root nodes so that it can address bottleneck as starting of inputs. For supporting BIRCH, we split processes of recursive construction into asynchronous steps with correcting target node by traversing extra horizontal links between sibling nodes. We carried out machine learning tasks with our implementation on top of Akka Java, and we confirmed reasonable performance for the tasks with large scale data-sets.

References

  1. Saleema Amershi et al. 2019. Software Engineering for Machine Learning: A Case Study. In ICSE'19 Software Engineering in Practice. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Yael Ben-Haim and Elad Tom-Tov. 2010. A Streaming Parallel Decision Tree Algorithm. Journal of Machine Learning Research (2010), 849-872. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting System. In Proceedings of KDD'16. 785-794. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Joeri De Koster, Stefan Marr, Tom Van Cutsem, and Theo D'Hondt. 2016. Domains: Sharing State in the Communicating Event-Loop Actor Model. Computer Languages, Systems & Structures 45 (2016), 132-160. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Pedro Domingos and Geoff Hulten. 2000. Mining High-Speed Data Streams. In Proceedings of KDD'00. 71-80. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Simon Fowler, Sam Lindley, and Philip Wadler. 2017. Mixing Metaphors: Actors as Channels and Channels as Actors. In Proceedings of ECOOP'17, Vol. 74. 11:1-11:28.Google ScholarGoogle Scholar
  7. João Gama, Ricardo Rocha, and Pedro Medas. 2003. Accurate Decision Trees for Mining High-speed Data Streams. In Proceedings of KDD'03. 523-528. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides. 1994. Design Patterns: Elements of Reusable Object-Oriented Software. Addison-Wesley. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Ashwani Garg, Ashish Mangla, Neelima Gupta, and Vasudha Bhatnagar. 2006. PBIRCH: A Scalable Parallel Clustering algorithm for Incremental Data. In Proceedings of IDEAS'06. 315-316. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In Proceedings of CVPR'16.Google ScholarGoogle ScholarCross RefCross Ref
  11. Katherine A. Heller and Zoubin Ghahramani. 2005. Bayesian Hierarchical Clustering. In Proceedings of ICML'05. 297-304. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Matthew Hoffman, Francis R. Bach, and David M. Blei. 2010. Online Learning for Latent Dirichlet Allocation. In Advances in NIPS'10. 856- 864. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Geoff Hulten, Laurie Spencer, and Pedro Domingos. 2001. Mining Time Changing Data Streams. In Proceedings of KDD'01. 97-106. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Ruoming Jin and Gagan Agrawal. 2003. Efficient Decision Tree Construction on Streaming Data. In Proceedings of KDD'03. 571-576. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Ari Kobren, Nicholas Monath, Akshay Krishnamurthy, and Andrew McCallum. 2017. A Hierarchical Algorithm for Extreme Clustering. In Proceedings of KDD'17. 255-264. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Philip L. Lehman and S. Bing Yao. 1981. Efficient Locking for Concurrent Operations on B-trees. ACM Trans. Database Syst. 6, 4 (1981), 650-670. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Mohsen Lesani and Antonio Lain. 2013. Semantics-preserving Sharing Actors. In Proceedings of AGERE!'13. 69-80. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Philipp Moritz et al. 2018. Ray: A Distributed Framework for Emerging AI Applications. In 13th USENIX Symposium on OSDI '18. 561-577. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. DeepWalk: Online Learning of Social Representations. In Proceedings of KDD'14. 701-710. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Lior Rokach and Oded Maimon. 2005. Clustering Methods. Springer US, 321-352.Google ScholarGoogle Scholar
  21. Lior Rokach and Oded Maimon. 2008. Data Mining with Decision Trees: Theory and Applications. World Scientific Publishing. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Marc Shapiro, Nuno Preguiça, Carlos Baquero, and Marek Zawirski. 2011. Conflict-Free Replicated Data Types. In Proceedings of SSS'11. 386-400. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Janwillem Swalens, Joeri De Koster, and Wolfgang De Meuter. 2017. Transactional Actors: Communication in Transactions. In Proceedings of SEPS'17. 31-41. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Tian Zhang, Raghu Ramakrishnan, and Miron Livny. 1996. BIRCH: An Efficient Data Clustering Method for Very Large Databases. In ACM Sigmod Record, Vol. 25. 103-114. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Actor-based incremental tree data processing for large-scale machine learning applications

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader