Distributed Cache Strategies for Machine Learning Classification Tasks over Cluster Computing Resources

Ovalle, John Edilson Arévalo; Ramos-Pollan, Raúl; González, Fabio A.

doi:10.1007/978-3-662-45483-1_4

John Edilson Arévalo Ovalle²⁰,
Raúl Ramos-Pollan²¹ &
Fabio A. González²⁰

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 485))

Included in the following conference series:

Latin American High Performance Computing Conference

629 Accesses
3 Altmetric

Abstract

Scaling machine learning (ML) methods to learn from large datasets requires devising distributed data architectures and algorithms to support their iterative nature where the same data records are processed several times. Data caching becomes key to minimize data transmission through iterations at each node and, thus, contribute to the overall scalability. In this work we propose a two level caching architecture (disk and memory) and benchmark different caching strategies in a distributed machine learning setup over a cluster with no shared RAM memory. Our results strongly favour strategies where (1) datasets are partitioned and preloaded throughout the distributed memory of the cluster nodes and (2) algorithms use data locality information to synchronize computations at each iteration. This supports the convergence towards models where “ computing goes to data” as observed in other Big Data contexts, and allows us to align strategies for parallelizing ML algorithms and configure appropriately computing infrastructures.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

A Data-Locality-Aware Distributed Learning System

Distributed Platforms and Cloud Services: Enabling Machine Learning for Big Data

Benchmarking Distributed Data Processing Systems for Machine Learning Workloads

References

Bottou, L.: Large-scale machine learning with stochastic gradient descent. In: Proceedings of COMPSTAT 2010, pp. 177–186. Springer (2010)
Google Scholar
Chu, C., Kim, S.K., Lin, Y.-A., Yu, Y., Bradski, G., Ng, A.Y., Olukotun, K.: Map-reduce for machine learning on multicore. Advances in neural information processing systems 19, 281 (2007)
Google Scholar
Coates, A., Huval, B., Wang, T., Wu, D., Catanzaro, B., Andrew, N.: Deep learning with cots hpc systems. In: Proceedings of the 30th International Conference on Machine Learning, pp. 1337–1345 (2013)
Google Scholar
Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., Senior, A., Tucker, P., Yang, K., Le, Q.V., et al.: Large scale distributed deep networks. In: Advances in Neural Information Processing Systems, pp. 1223–1231 (2012)
Google Scholar
Hsu, D., Karampatziakis, N., Langford, J., Smola, A.J.: Parallel online learning. CoRR, abs/1103.4204 (2011)
Google Scholar
Kraska, T., Talwalkar, A., Duchi, J.C., Griffith, R., Franklin, M.J., Jordan, M.I.: Mlbase: A distributed machine-learning system. In: CIDR (2013)
Google Scholar
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11), 2278–2324 (1998)
Article Google Scholar
Navruzyan, A.: Online machine learning with distributed in-memory clusters (2013)
Google Scholar
Ramos-Pollan, R., Cruz-Roa, A., Gonzalez, F.A.: A framework for high performance image analysis pipelines. In: 2012 7th Colombian Computing Congress (CCC), pp. 1–6 (October 2012)
Google Scholar
Ramos-Pollan, R., Gonzalez, F.A., Caicedo, J.C., Cruz-Roa, A., Camargo, J.E., Vanegas, J.A., Perez, S.A., Bermeo, J.D., Otalora, J.S., Rozo, P.K., Arevalo, J.: Bigs: A framework for large-scale image processing and analysis over distributed and heterogeneous computing resources. In: 2012 IEEE 8th International Conference on E-Science (e-Science), pp. 1–8 (October 2012)
Google Scholar
Rosen, J., Polyzotis, N., Borkar, V., Bu, Y., Carey, M.J., Weimer, M., Condie, T., Ramakrishnan, R.: Iterative mapreduce for large scale machine learning. arXiv preprint arXiv:1303.3517 (2013)
Google Scholar
Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The hadoop distributed file system. In: 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), pp. 1–10. IEEE (2010)
Google Scholar
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, p. 10 (2010)
Google Scholar

Download references

Author information

Authors and Affiliations

Universidad Nacional de Colombia, Colombia
John Edilson Arévalo Ovalle & Fabio A. González
Unidad de Supercómputo y Cálculo Científico, Universidad Industrial de Santander, Colombia
Raúl Ramos-Pollan

Authors

John Edilson Arévalo Ovalle
View author publications
You can also search for this author in PubMed Google Scholar
Raúl Ramos-Pollan
View author publications
You can also search for this author in PubMed Google Scholar
Fabio A. González
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Universidad Santa Maria, Valparaiso, Chile
Gonzalo Hernández
Ciudad Universitaria, Bucaramanga, Chile
Carlos Jaime Barrios Hernández
Universidad Industrial de Santander, Bucaramanga, Colombia
Gilberto Díaz
Universidad Nacional de Cuyo, Mendoza,, Argentina
Carlos García Garino
Facultad de Ingeniería, Universidad de la República, Montevideo, Uruguay
Sergio Nesmachnow
Universidad de Valparaíso, Chile
Tomás Pérez-Acle
CIMEC,, Santa Fe, Argentina
Mario Storti
Barcelona Supercomputing Center, Spain
Mariano Vázquez

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ovalle, J.E.A., Ramos-Pollan, R., González, F.A. (2014). Distributed Cache Strategies for Machine Learning Classification Tasks over Cluster Computing Resources. In: Hernández, G., et al. High Performance Computing. CARLA 2014. Communications in Computer and Information Science, vol 485. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-45483-1_4

Download citation

DOI: https://doi.org/10.1007/978-3-662-45483-1_4
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-45482-4
Online ISBN: 978-3-662-45483-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics