research-article

Actor-based incremental tree data processing for large-scale machine learning applications

Authors:
Kouhei Sakurai

Kanazawa University, Japan

Kanazawa University, Japan
View Profile

,
Taiki Shimizu

Kanazawa University, Japan

Kanazawa University, Japan
View Profile

AGERE 2019: Proceedings of the 9th ACM SIGPLAN International Workshop on Programming Based on Actors, Agents, and Decentralized ControlOctober 2019Pages 1–10https://doi.org/10.1145/3358499.3361220

Published:22 October 2019Publication History

AGERE 2019: Proceedings of the 9th ACM SIGPLAN International Workshop on Programming Based on Actors, Agents, and Decentralized Control

Pages 1–10

ABSTRACT

A number of online machine learning techniques based on tree model have been studied in order to cope with today's requirements of quickly processing large scale data-sets. We present a design pattern for incremental tree data processing as gradually constructing on-demand tree-model on memory. Our approach adopts the actor model as making use of multi-cores and distributed computers without largely rewriting code for algorithms. The pattern basically defines a node in the tree as an actor which is the unit of asynchronous processes and each data instance flows between actor nodes as a message. We study concrete two machine learning algorithms, VFDT for decision tree's top-down growth and BIRCH for hierarchical clustering's bottom up growth. For supporting VFDT, we propose an extension mechanism of replicating root nodes so that it can address bottleneck as starting of inputs. For supporting BIRCH, we split processes of recursive construction into asynchronous steps with correcting target node by traversing extra horizontal links between sibling nodes. We carried out machine learning tasks with our implementation on top of Akka Java, and we confirmed reasonable performance for the tasks with large scale data-sets.

References

Saleema Amershi et al. 2019. Software Engineering for Machine Learning: A Case Study. In ICSE'19 Software Engineering in Practice. Google ScholarDigital Library
Yael Ben-Haim and Elad Tom-Tov. 2010. A Streaming Parallel Decision Tree Algorithm. Journal of Machine Learning Research (2010), 849-872. Google ScholarDigital Library
Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting System. In Proceedings of KDD'16. 785-794. Google ScholarDigital Library
Joeri De Koster, Stefan Marr, Tom Van Cutsem, and Theo D'Hondt. 2016. Domains: Sharing State in the Communicating Event-Loop Actor Model. Computer Languages, Systems & Structures 45 (2016), 132-160. Google ScholarDigital Library
Pedro Domingos and Geoff Hulten. 2000. Mining High-Speed Data Streams. In Proceedings of KDD'00. 71-80. Google ScholarDigital Library
Simon Fowler, Sam Lindley, and Philip Wadler. 2017. Mixing Metaphors: Actors as Channels and Channels as Actors. In Proceedings of ECOOP'17, Vol. 74. 11:1-11:28.Google Scholar
João Gama, Ricardo Rocha, and Pedro Medas. 2003. Accurate Decision Trees for Mining High-speed Data Streams. In Proceedings of KDD'03. 523-528. Google ScholarDigital Library
Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides. 1994. Design Patterns: Elements of Reusable Object-Oriented Software. Addison-Wesley. Google ScholarDigital Library
Ashwani Garg, Ashish Mangla, Neelima Gupta, and Vasudha Bhatnagar. 2006. PBIRCH: A Scalable Parallel Clustering algorithm for Incremental Data. In Proceedings of IDEAS'06. 315-316. Google ScholarDigital Library
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In Proceedings of CVPR'16.Google ScholarCross Ref
Katherine A. Heller and Zoubin Ghahramani. 2005. Bayesian Hierarchical Clustering. In Proceedings of ICML'05. 297-304. Google ScholarDigital Library
Matthew Hoffman, Francis R. Bach, and David M. Blei. 2010. Online Learning for Latent Dirichlet Allocation. In Advances in NIPS'10. 856- 864. Google ScholarDigital Library
Geoff Hulten, Laurie Spencer, and Pedro Domingos. 2001. Mining Time Changing Data Streams. In Proceedings of KDD'01. 97-106. Google ScholarDigital Library
Ruoming Jin and Gagan Agrawal. 2003. Efficient Decision Tree Construction on Streaming Data. In Proceedings of KDD'03. 571-576. Google ScholarDigital Library
Ari Kobren, Nicholas Monath, Akshay Krishnamurthy, and Andrew McCallum. 2017. A Hierarchical Algorithm for Extreme Clustering. In Proceedings of KDD'17. 255-264. Google ScholarDigital Library
Philip L. Lehman and S. Bing Yao. 1981. Efficient Locking for Concurrent Operations on B-trees. ACM Trans. Database Syst. 6, 4 (1981), 650-670. Google ScholarDigital Library
Mohsen Lesani and Antonio Lain. 2013. Semantics-preserving Sharing Actors. In Proceedings of AGERE!'13. 69-80. Google ScholarDigital Library
Philipp Moritz et al. 2018. Ray: A Distributed Framework for Emerging AI Applications. In 13th USENIX Symposium on OSDI '18. 561-577. Google ScholarDigital Library
Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. DeepWalk: Online Learning of Social Representations. In Proceedings of KDD'14. 701-710. Google ScholarDigital Library
Lior Rokach and Oded Maimon. 2005. Clustering Methods. Springer US, 321-352.Google Scholar
Lior Rokach and Oded Maimon. 2008. Data Mining with Decision Trees: Theory and Applications. World Scientific Publishing. Google ScholarDigital Library
Marc Shapiro, Nuno Preguiça, Carlos Baquero, and Marek Zawirski. 2011. Conflict-Free Replicated Data Types. In Proceedings of SSS'11. 386-400. Google ScholarDigital Library
Janwillem Swalens, Joeri De Koster, and Wolfgang De Meuter. 2017. Transactional Actors: Communication in Transactions. In Proceedings of SEPS'17. 31-41. Google ScholarDigital Library
Tian Zhang, Raghu Ramakrishnan, and Miron Livny. 1996. BIRCH: An Efficient Data Clustering Method for Very Large Databases. In ACM Sigmod Record, Vol. 25. 103-114. Google ScholarDigital Library

Index Terms

Actor-based incremental tree data processing for large-scale machine learning applications
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
      1. Classification and regression trees
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language features
        Concurrent programming structures

Recommendations

On cluster tree for nested and multi-density data clustering

Clustering is one of the important data mining tasks. Nested clusters or clusters of multi-density are very prevalent in data sets. In this paper, we develop a hierarchical clustering approach-a cluster tree to determine such cluster structure and ...
Read More
K-means tree: an optimal clustering tree for unsupervised learning
Abstract
Tree construction is one of the popular methods for tackling any supervised task in machine learning. However, there has been little effort in applying trees for unsupervised tasks. The traditional unsupervised trees are based on recursively ...
Read More
Multi-objective optimization for incremental decision tree learning
DaWaK'12: Proceedings of the 14th international conference on Data Warehousing and Knowledge Discovery

Decision tree learning can be roughly classified into two categories: static and incremental inductions. Static tree induction applies greedy search in splitting test for obtaining a global optimal model. Incremental tree induction constructs a decision ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
AGERE 2019: Proceedings of the 9th ACM SIGPLAN International Workshop on Programming Based on Actors, Agents, and Decentralized Control
October 2019
50 pages
ISBN:9781450369824
DOI:10.1145/3358499
General Chairs:
Federico Bergenti,
Elias Castegren,
Joeri De Koster,
Juliana Franco
Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 22 October 2019
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Actor Model
Decision Tree
Hierarchical Clustering
Incremental Tree Data Processing
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate19of35submissions,54%
Upcoming Conference
SPLASH '24

Sponsor:

sigplan

ACM SIGPLAN International Conference on Systems, Programming, Languages, and Applications: Software for Humanity

October 20 - 25, 2024

Pasadena , CA , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 105
  Total Downloads
- Downloads (Last 12 months)4
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Actor-based incremental tree data processing for large-scale machine learning applications

AGERE 2019: Proceedings of the 9th ACM SIGPLAN International Workshop on Programming Based on Actors, Agents, and Decentralized Control

ABSTRACT

References

Cited By

Index Terms

Recommendations

On cluster tree for nested and multi-density data clustering

K-means tree: an optimal clustering tree for unsupervised learning

Multi-objective optimization for incremental decision tree learning

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Actor-based incremental tree data processing for large-scale machine learning applications

AGERE 2019: Proceedings of the 9th ACM SIGPLAN International Workshop on Programming Based on Actors, Agents, and Decentralized Control

ABSTRACT

References

Cited By

Index Terms

Recommendations

On cluster tree for nested and multi-density data clustering

K-means tree: an optimal clustering tree for unsupervised learning

Multi-objective optimization for incremental decision tree learning

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media