research-article

Software-defined storage for fast trajectory queries using a deltaFS indexed massive directory

Authors:

George Amvrosiadis,

Saurabh Kadekodi,

Garth A. Gibson,

Charles D. Cranor,

Bradley W. Settlemyer,

Fan GuoAuthors Info & Claims

PDSW-DISCS '17: Proceedings of the 2nd Joint International Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems

Pages 7 - 12

https://doi.org/10.1145/3149393.3149398

Published: 12 November 2017 Publication History

Abstract

In this paper we introduce the Indexed Massive Directory, a new technique for indexing data within DeltaFS. With its design as a scalable, server-less file system for HPC platforms, DeltaFS scales file system metadata performance with application scale. The Indexed Massive Directory is a novel extension to the DeltaFS data plane, enabling in-situ indexing of massive amounts of data written to a single directory simultaneously, and in an arbitrarily large number of files. We achieve this through a memory-efficient indexing mechanism for reordering and indexing data, and a log-structured storage layout to pack small writes into large log objects, all while ensuring compute node resources are used frugally. We demonstrate the efficiency of this indexing mechanism through VPIC, a widely-used simulation code that scales to trillions of particles. With DeltaFS, we modify VPIC to create a file for each particle to receive writes of that particle's output data. Dynamically indexing the directory's underlying storage allows us to achieve a 5000x speedup in single particle trajectory queries, which require reading all data for a single particle. This speedup increases with application scale while the overhead is fixed at 3% of available memory.

References

[1]

Exascale computing project, http://www.exascale.org.

[2]

Hadoop. http://hadoop.apache.org/.

[3]

Leveldb. https://github.com/google/leveldb/.

[4]

Spark. https://spark.apache.org/.

[5]

Trinity. http://www.lanl.gov/projects/trinity/.

[6]

Ali, N., Carns, P., Iskra, K., Kimpe, D., Lang, S., Latham, R., Ross, R., Ward, L., and Sadayappan, P. Scalable I/O forwarding framework for high-performance computing systems. In Proceedings of the 2009 IEEE International Conference on Cluster Computing (CLUSTER 09), pp. 1--10.

[7]

Alverson, B., Froese, E., Kaplan, L., and Roweth, D. Cray xc series network. Tech. Rep. WP-Aries01-1112, Cray Inc., Nov. 2012.

[8]

Anderson, T. E., Dahlin, M. D., Neefe, J. M., Patterson, D. A., Roselli, D. S., and Wang, R. Y. Serverless network file systems. In Proceedings of the Fifteenth ACM Symposium on Operating Systems Principles (SOSP 95), pp. 109--126.

Digital Library

[9]

Bent, J., Gibson, G., Grider, G., McClelland, B., Nowoczynski, P., Nunez, J., Polte, M., and Wingate, M. PLFS: A checkpoint filesystem for parallel applications. In Proceedings of the 2009 International Conference for High Performance Computing, Networking, Storage, and Analysis (SC 09), pp. 21:1--21:12.

Digital Library

[10]

Bent, J., Settlemyer, B., and Grider, G. Serving data to the lunatic fringe: The evolution of HPC storage. USENIX ;login: 41, 2 (June 2016).

[11]

Bloom, B.H. Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13, 7 (July 1970), 422--426.

Digital Library

[12]

Bonnie, D. J., and Torres, A. G. Small file aggregation with plfs. Tech. rep., Los Alamos National Laboratory, 2013.

[13]

Byna, S., Sisneros, R., Chadalavada, K., and Koziol, Q. Tuning parallel i/o on blue waters for writing 10 trillion particles. In Cray User Group (CUG) (2015).

[14]

Byna, S., Uselton, A., Prabhat, D. K., and He, Y. Trillion particles, 120,000 cores, and 350 tbs: Lessons learned from a hero i/o run on hopper. In Cray User Group (CUG) (2013).

[15]

Carns, P., Ligon, W, Ross, R., and Wyckoff, P. Bmi: a network abstraction layer for parallel i/o. In 19th IEEE International Parallel and Distributed Processing Symposium (April 2005).

Digital Library

[16]

Carns, P. H., Ligon, W. B., Ross, R. B, and Thakur, R. PVFS: A parallel file system for linux clusters. In Proceedings of the 4th USENIX Annual Linux showcase and Conference (ALS 00), pp. 317--328.

Digital Library

[17]

Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C, Wallach, D. A., Burrows, M., Chandra, T, Fikes, A., and Gruber, R. E. Bigtable: A distributed storage system for structured data. In Proceedings of the 7th Symposium on Operating Systems Design and Implementation (OSDI 06), pp. 205--218.

Digital Library

[18]

Chou, J., Wu, K., and Prabhat. FastQuery: A parallel indexing system for scientific data. In Proceedings of the 2011 IEEE International Conference on Cluster Computing (CLUSTER 11), pp. 455--464.

Digital Library

[19]

Cranor, C, Polte, M., and Gibson, G. Structuring PLFS for extensibility. In Proceedings of the 8th Parallel Data Storage Workshop (PDSW 13), pp. 20--26.

Digital Library

[20]

Dean, J., and Ghemawat, S. MapReduce: Simplified data processing on large clusters. In Proceedings of the 6th Symposium on Opearting Systems Design and Implementation (OSDI 04), pp. 10--10.

Digital Library

[21]

Folk, M., Cheng, A., and Yates, K. Hdf5: A file format and i/o library for high performance computing applications. In Proceedings of Supercomputing (1999), vol. 99, pp. 5--33.

[22]

Ghemawat, S., Gobioff, H., and Leung, S.-T. The google file system. In Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles (SOSP 03), pp. 29--43.

Digital Library

[23]

Gibson, G. A., Nagle, D. F., Amiri, K., Butler, J., Chang, F. W., Gobioff, H., Hardin, C, Riedel, E., Rochberg, D., and Zelenka, J. A cost-effective, high-bandwidth storage architecture. In Proceedings of the Eighth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 98), pp. 92--103.

Digital Library

[24]

Greenberg, H. N., Bent, J., and Grider, G. MDHIM: A parallel key/value framework for HPC. In Proceedings of the 7th USENIX Conference on Hot Topics in Storage and File Systems (HotStorage 15), pp. 10--10.

Digital Library

[25]

Hereld, M., Papka, M. E., and Vishwanath, V. Toward simulation-time data analysis and i/o acceleration on leadership-class systems. In Proc. IEEE Symposium on Large-Scale Data Analysis and Visualization (LDAV2011) (Providence, RI, 10/2011 2011).

[26]

Inman, J., Vining, W., Ransom, G., and Grider, G. MarFS, a Near-POSIX interface to cloud objects. USENIX ;login: 42, 1 (Jan. 2017).

[27]

Kim, J., Abbasi, H., Chacón, L., Docan, C, Klasky, S., Liu, Q., Podhorszki, N., Shoshani, A., and Wu, K. Parallel in situ indexing for data-intensive computing. In Proceedings of the 2011 IEEE Symposium on Large Data Analysis and Visualization (LDAV 11), pp. 65--72.

[28]

LANL, NERSC, and SNL. Apex workflows. Tech. rep., Los Alamos National Laboratory (LANL), National Energy Research Scientific Computing Center (NERSC), Sandia National Laboratory (SNL), Mar. 2016.

[29]

Liu, N, Cope, J., Carns, P., Carothers, C, Ross, R., Grider, G., Crume, A., and Maltzahn, C. On the role of burst buffers in leadership-class storage systems. In Proceedings of the 2012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST 12), pp. 1--11.

[30]

Liu, Q., Logan, J., Tian, Y., Abbasi, H., Podhorszki, N, Choi, J. Y, Klasky, S., Tchoua, R., Lofstead, J., Oldfield, R., Parashar, M., Samatova, N., Schwan, K, Shoshani, A., Wolf, M., Wu, K., and Yu, W. Hello ADIOS: The challenges and lessons of developing leadership class I/O frameworks. Concurr. Comput. : Pract. Exper. 26, 7 (May 2014), 1453--1473.

Digital Library

[31]

O'Neil, P., Cheng, E., Gawlick, D., and O'Neil, E. The log-structured merge-tree (LSM-tree). Acta Inf. 33, 4 (June 1996), 351--385.

Digital Library

[32]

Patil, S., and Gibson, G. Scale and concurrency of GIGA+: File system directories with millions of files. In Proceedings of the 9th USENIX Conference on File and Stroage Technologies (FAST 11), pp. 13--13.

Digital Library

[33]

Rajachandrasekar, R., Moody, A., Mohror, K., and Panda, D. K. D. A 1 PB/s file system to checkpoint three million MPI tasks. In Proceedings of the 22Nd International Symposium on High-performance Parallel and Distributed Computing (HPDC 13), pp. 143--154.

Digital Library

[34]

Ren, K., and Gibson, G. TABLEFS: Enhancing metadata efficiency in the local file system. In Proceedings of the 2013 USENIX Annual Technical Conference (USENIX ATC 13), pp. 145--156.

Digital Library

[35]

Ren, K., Zheng, Q., Arulraj, J., and Gibson, G. Slimdb: A space-efficient key-value storage engine for semi-sorted data. Proc. VLDB Endow. 10, 13 (Sept. 2017), 2037--2048.

Digital Library

[36]

Ren, K., Zheng, Q., Patil, S., and Gibson, G. IndexFS: Scaling file system metadata performance with stateless caching and bulk insertion. In Proceedings of the 2014 International Conference for High Performance Computing, Networking, Storage, and Analysis (SC 14), pp. 237--248.

Digital Library

[37]

Schmuck, F. B., and Haskin, R. L. GPFS: A shared-disk file system for large computing clusters. In Proceedings of the 1st USENLX Conference on File and Storage Technologies (FAST 02), pp. 231--244.

Digital Library

[38]

Schwan, P. Lustre: Building a file system for 1000-node clusters. In Proceedings of the 2003 Ottawa Linux Symposium (OLS 03), pp. 380--386.

[39]

Soumagne, J., Kimpe, D., Zounmevo, J., Chaarawi, M., Koziol, Q., Afsahi, A., and Ross, R. Mercury: Enabling remote procedure call for high-performance computing. In 2013 IEEE International Conference on Cluster Computing (CLUSTER) (Sept 2013), pp. 1--8.

[40]

Wang, T, Mohror, K., Moody, A., Sato, K., and Yu, W. An ephemeral burst-buffer file system for scientific applications. In Proceedings of the 2016 International Conference for High Performance Computing, Networking, Storage, and Analysis (SC 16), pp. 69:1--69:12.

Digital Library

[41]

Wang, T, Moody, A., Zhu, Y., Mohror, K., Sato, K., Islam, T, and Yu, W. Metakv: A key-value store for metadata management of distributed burst buffers. In 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) (May 2017), pp. 1174--1183.

[42]

Wang, Y, Agrawal, G., Bicer, T, and Jiang, W. Smart: A mapreduce-like framework for in-situ scientific analytics. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (New York, NY, USA, 2015), SC '15, ACM, pp. 51:1--51:12.

Digital Library

[43]

Weil, S. A., Brandt, S. A., Miller, E. L., Long, D. D. E., and Maltzahn, C. Ceph: A scalable, high-performance distributed file system. In Proceedings of the 7th Symposium on Operating Systems Design and Implementation (OSDI 06), pp. 307--320.

Digital Library

[44]

Welch, B., Unangst, M., Abbasi, Z., Gibson, G., Mueller, B., Small, J., Zelenka, J., and Zhou, B. Scalable performance of the panasas parallel file system. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST 08), pp. 2:1--2:17.

Digital Library

[45]

Wu, K. Fastbit: an efficient indexing technology for accelerating data-intensive science. Journal of Physics: Conference Series 16, 1 (2005), 556.

[46]

Zhao, D., Zhang, Z., Zhou, X., Li, T, Wang, K., Kimpe, D., Carns, P., Ross, R., and Raicu, I. FusionFS: Toward supporting data-intensive scientific applications on extreme-scale high-performance computing systems. In Proceedings of the 2014 IEEE International Conference on Big Data (BigData 14), pp. 61--70.

[47]

Zheng, F., Abbasi, H., Docan, C, Lofstead, J., Liu, Q., Klasky, S., Parashar, M., Podhorszki, N., Schwan, K., and Wolf, M. PreDatA - preparatory data analytics on peta-scale machines. In Proceedings of the 2010 IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2010 (05 2010), pp. 1--12.

[48]

Zheng, Q., Ren, K., and Gibson, G. BatchFS: Scaling the file system control plane with client-funded metadata servers. In Proceedings of the 9th Parallel Data Storage Workshop (PDSW 14), pp. 1--6.

Digital Library

[49]

Zheng, Q., Ren, K., Gibson, G., Settlemyer, B. W., and Grider, G. DeltaFS: Exascale file systems scale better without dedicated servers. In Proceedings of the 10th Parallel Data Storage Workshop (PDSW 15), pp. 1--6.

Digital Library

Cited By

Jain ACranor CZheng QSettlemyer BAmvrosiadis GGrider G(2024)CARP: Range Query-Optimized Indexing for Streaming DataProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00093(1-19)Online publication date: 17-Nov-2024
https://dl.acm.org/doi/10.1109/SC41406.2024.00093
Chen YTong WFeng DWang Z(2022)Workload-aware storage policies for cloud object storageJournal of Parallel and Distributed Computing10.1016/j.jpdc.2022.01.026163(232-247)Online publication date: May-2022
https://doi.org/10.1016/j.jpdc.2022.01.026
Zheng QCranor CJain AGanger GGibson GAmvrosiadis GSettlemyer BGrider G(2020)Streaming Data Reorganization at Scale with DeltaFS Indexed Massive DirectoriesACM Transactions on Storage10.1145/341558116:4(1-31)Online publication date: 24-Sep-2020
https://dl.acm.org/doi/10.1145/3415581
Show More Cited By

Software-defined storage for fast trajectory queries using a deltaFS indexed massive directory

Recommendations

Streaming Data Reorganization at Scale with DeltaFS Indexed Massive Directories
Special Section on Computational Storage and Regular Papers

Complex storage stacks providing data compression, indexing, and analytics help leverage the massive amounts of data generated today to derive insights. It is challenging to perform this computation, however, while fully utilizing the underlying storage ...
DeltaFS: exascale file systems scale better without dedicated servers
PDSW '15: Proceedings of the 10th Parallel Data Storage Workshop

High performance computing fault tolerance depends on scalable parallel file system performance. For more than a decade scalable bandwidth has been available from the object storage systems that underlie modern parallel file systems, and recently we ...
Efficient Directory Mutations in a Full-Path-Indexed File System
Special Issue on FAST 2018 and Regular Papers

Full-path indexing can improve I/O efficiency for workloads that operate on data organized using traditional, hierarchical directories, because data is placed on persistent storage in scan order. Prior results indicate, however, that renames in a local ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PDSW-DISCS '17: Proceedings of the 2nd Joint International Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems

November 2017

74 pages

ISBN:9781450351348

DOI:10.1145/3149393

Program Chairs:
Kathryn Mohror
Lawrence Livermore National Laboratory
,
Brent Welch
Google

Copyright © 2017 ACM.

Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

SIGHPC: ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing
IEEE-CS\DATC: IEEE Computer Society

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 November 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Conference

SC '17

Sponsor:

SIGHPC
IEEE-CS\DATC

SC '17: The International Conference for High Performance Computing, Networking, Storage and Analysis

November 12 - 17, 2017

Colorado, Denver

Acceptance Rates

Overall Acceptance Rate 17 of 41 submissions, 41%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
244
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)0

Reflects downloads up to 20 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Jain ACranor CZheng QSettlemyer BAmvrosiadis GGrider G(2024)CARP: Range Query-Optimized Indexing for Streaming DataProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00093(1-19)Online publication date: 17-Nov-2024
https://dl.acm.org/doi/10.1109/SC41406.2024.00093
Chen YTong WFeng DWang Z(2022)Workload-aware storage policies for cloud object storageJournal of Parallel and Distributed Computing10.1016/j.jpdc.2022.01.026163(232-247)Online publication date: May-2022
https://doi.org/10.1016/j.jpdc.2022.01.026
Zheng QCranor CJain AGanger GGibson GAmvrosiadis GSettlemyer BGrider G(2020)Streaming Data Reorganization at Scale with DeltaFS Indexed Massive DirectoriesACM Transactions on Storage10.1145/341558116:4(1-31)Online publication date: 24-Sep-2020
https://dl.acm.org/doi/10.1145/3415581
Chen YTong WFeng DWang Z(2020)Mass: Workload-Aware Storage Policy for OpenStack SwiftProceedings of the 49th International Conference on Parallel Processing10.1145/3404397.3404427(1-11)Online publication date: 17-Aug-2020
https://dl.acm.org/doi/10.1145/3404397.3404427
Zheng QCranor CJain AGanger GGibson GAmvrosiadis GSettlemyer BGrider G(2019)Compact Filters for Fast Online Data Partitioning2019 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER.2019.8890992(1-12)Online publication date: Sep-2019
https://doi.org/10.1109/CLUSTER.2019.8890992
Zheng QCranor CGuo DGanger GAmvrosiadis GGibson GSettlemyer BGrider GGuo F(2018)Scaling embedded in-situ indexing with deltaFSProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.5555/3291656.3291660(1-15)Online publication date: 11-Nov-2018
https://dl.acm.org/doi/10.5555/3291656.3291660
Zheng QCranor CGuo DGanger GAmvrosiadis GGibson GSettlemyer BGrider GGuo F(2018)Scaling embedded in-situ indexing with deltaFSProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC.2018.00006(1-15)Online publication date: 11-Nov-2018
https://dl.acm.org/doi/10.1109/SC.2018.00006
Dorier MSettlemyer BShipman GSoumagne JKowalkowski JPaterno MSehrish SCarns PHarms KLatham RRoss RSnyder SWozniak JGutierrez SRobey B(2018)Methodology for the Rapid Development of Scalable HPC Data Services2018 IEEE/ACM 3rd International Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems (PDSW-DISCS)10.1109/PDSW-DISCS.2018.00013(76-87)Online publication date: Nov-2018
https://doi.org/10.1109/PDSW-DISCS.2018.00013

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten