Abstract
Many recently proposed BigData processing frameworks make programming easier, but typically expect the datasets to fit in the memory of either a single multicore machine or a cluster of multicore machines. When this assumption does not hold, these frameworks fail. We introduce the InfiniMem framework that enables size oblivious processing of large collections of objects that do not fit in memory by making them disk-resident. InfiniMem is easy to program with: the user just indicates the large collections of objects that are to be made disk-resident, while InfiniMem transparently handles their I/O management. The InfiniMem library can manage a very large number of objects in a uniform manner, even though the objects have different characteristics and relationships which, when processed, give rise to a wide range of access patterns requiring different organizations of data on the disk. We demonstrate the ease of programming and versatility of InfiniMem with 3 different probabilistic analytics algorithms, 3 different graph processing size oblivious frameworks; they require minimal effort, 6–9 additional lines of code. We show that InfiniMem can successfully generate a mesh with 7.5 million nodes and 300 million edges (4.5 GB on disk) in 40 min and it performs the PageRank computation on a 14 GB graph with 134 million vertices and 805 million edges at 14 min per iteration on an 8-core machine with 8 GB RAM. Many graph generators and processing frameworks cannot handle such large graphs. We also exploit InfiniMem on a cluster to scale-up an object-based DSM.
This work was supported by NSF Grant CCF-1524852, CCF-1318103, CNS-1157377, CCF-0963996, CCF-0905509, and a Google Research Award.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Avery, C.: Giraph: large-scale graph processing infrastruction on hadoop. In: Proceedings of Hadoop Summit. Santa Clara, USA: [sn] (2011)
Bader, D.A., Madduri, K.: Gtgraph: A synthetic graph generator suite. Atlanta, GA, February 2006
Berry, J., Mackey, G.: The multithreaded graph library (2014)
Bu, Y., Borkar, V., Xu, G., Carey, M.J.: A bloat-aware design for big data applications. In: Proceedings of ISMM 2013, pp. 119–130. ACM (2013)
Chiang, Y.J., Goodrich, M.T., Grove, E.F., Tamassia, R., Vengroff, D.E., Vitter, J.S.: External-memory graph algorithms. In: Proceedings of SODA 1995, pp. 139–149 (1995)
Da Zheng, D.M., Burns, R., Vogelstein, J., Priebe, C.E., Szalay, A.S.: Flashgraph: processing billion-node graphs on an array of commodity SSDs. In: FAST (2015)
Facebook: RocksDB Project. http://RocksDB.org
Gonzalez, J.E., Low, Y., Gu, H., Bickson, D., Guestrin, C.: Powergraph: Distributed graph-parallel computation on natural graphs. In: OSDI 2012, pp. 17–30 (2012)
Koduru, S.-C., Vora, K., Gupta, R.: Optimizing caching DSM for distributed software speculation. In: Proceedings of Cluster Computing (2015)
Kundeti, V.K., et al.: Efficient parallel and out of core algorithms for constructing large bi-directed de bruijn graphs. BMC bioinform. 11(1), 560 (2010)
Kyrola, A., Blelloch, G., Guestrin, C.: Graphchi: Large-scale graph computation on just a PC. In: Proceedings of the 10th USENIX Symposium on OSDI, pp. 31–46 (2012)
Low, Y., Gonzalez, J., Kyrola, A., Bickson, D., Guestrin, C., Hellerstein, J.M.: Graphlab: A new framework for parallel machine learning. (2010). arXiv:1006.4990
Malewicz, G., et al.: Pregel: a system for large-scale graph processing. In: Proceedings of the 2010 ACM SIGMOD ICMD, pp. 135–146. ACM (2010)
Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: Bringing order to the web (1999)
Pingali, K., Nguyen, D., Kulkarni, M., Burtscher, M., Hassaan, M.A., Kaleem, R., Lee, T.H., Lenharth, A., Manevich, R., Méndez-Lojo, M., et al.: The tao of parallelism in algorithms. ACM SIGPLAN Not. 46, 12–25 (2011)
Shun, J., Blelloch, G.E.: Ligra: a lightweight graph processing framework for shared memory. In: Proceedings of PPopp 2013, pp. 135–146. ACM (2013)
Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The hadoop distributed filesystem. In: IEEE MSST 2010, pp. 1–10 (2010)
Siek, J., Lee, L., Lumsdaine, A.: The boost graph library (BGL) (2000)
Team, T., et al.: Apache mahout project (2014). https://mahout.apace.org
Toledo, S.: A survey of out-of-core algorithms in numerical linear algebra. Extern. Mem. Algorithms Vis. 50, 161–179 (1999)
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark:cluster computing with working sets. In: Proceedings of HotCloud 2010, vol. 10, p. 10 (2010)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Koduru, S.C., Gupta, R., Neamtiu, I. (2016). Size Oblivious Programming with InfiniMem . In: Shen, X., Mueller, F., Tuck, J. (eds) Languages and Compilers for Parallel Computing. LCPC 2015. Lecture Notes in Computer Science(), vol 9519. Springer, Cham. https://doi.org/10.1007/978-3-319-29778-1_1
Download citation
DOI: https://doi.org/10.1007/978-3-319-29778-1_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-29777-4
Online ISBN: 978-3-319-29778-1
eBook Packages: Computer ScienceComputer Science (R0)