skip to main content
10.1145/3149393.3149398acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Software-defined storage for fast trajectory queries using a deltaFS indexed massive directory

Published: 12 November 2017 Publication History

Abstract

In this paper we introduce the Indexed Massive Directory, a new technique for indexing data within DeltaFS. With its design as a scalable, server-less file system for HPC platforms, DeltaFS scales file system metadata performance with application scale. The Indexed Massive Directory is a novel extension to the DeltaFS data plane, enabling in-situ indexing of massive amounts of data written to a single directory simultaneously, and in an arbitrarily large number of files. We achieve this through a memory-efficient indexing mechanism for reordering and indexing data, and a log-structured storage layout to pack small writes into large log objects, all while ensuring compute node resources are used frugally. We demonstrate the efficiency of this indexing mechanism through VPIC, a widely-used simulation code that scales to trillions of particles. With DeltaFS, we modify VPIC to create a file for each particle to receive writes of that particle's output data. Dynamically indexing the directory's underlying storage allows us to achieve a 5000x speedup in single particle trajectory queries, which require reading all data for a single particle. This speedup increases with application scale while the overhead is fixed at 3% of available memory.

References

[1]
Exascale computing project, http://www.exascale.org.
[2]
Hadoop. http://hadoop.apache.org/.
[3]
Leveldb. https://github.com/google/leveldb/.
[4]
Spark. https://spark.apache.org/.
[5]
Trinity. http://www.lanl.gov/projects/trinity/.
[6]
Ali, N., Carns, P., Iskra, K., Kimpe, D., Lang, S., Latham, R., Ross, R., Ward, L., and Sadayappan, P. Scalable I/O forwarding framework for high-performance computing systems. In Proceedings of the 2009 IEEE International Conference on Cluster Computing (CLUSTER 09), pp. 1--10.
[7]
Alverson, B., Froese, E., Kaplan, L., and Roweth, D. Cray xc series network. Tech. Rep. WP-Aries01-1112, Cray Inc., Nov. 2012.
[8]
Anderson, T. E., Dahlin, M. D., Neefe, J. M., Patterson, D. A., Roselli, D. S., and Wang, R. Y. Serverless network file systems. In Proceedings of the Fifteenth ACM Symposium on Operating Systems Principles (SOSP 95), pp. 109--126.
[9]
Bent, J., Gibson, G., Grider, G., McClelland, B., Nowoczynski, P., Nunez, J., Polte, M., and Wingate, M. PLFS: A checkpoint filesystem for parallel applications. In Proceedings of the 2009 International Conference for High Performance Computing, Networking, Storage, and Analysis (SC 09), pp. 21:1--21:12.
[10]
Bent, J., Settlemyer, B., and Grider, G. Serving data to the lunatic fringe: The evolution of HPC storage. USENIX ;login: 41, 2 (June 2016).
[11]
Bloom, B.H. Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13, 7 (July 1970), 422--426.
[12]
Bonnie, D. J., and Torres, A. G. Small file aggregation with plfs. Tech. rep., Los Alamos National Laboratory, 2013.
[13]
Byna, S., Sisneros, R., Chadalavada, K., and Koziol, Q. Tuning parallel i/o on blue waters for writing 10 trillion particles. In Cray User Group (CUG) (2015).
[14]
Byna, S., Uselton, A., Prabhat, D. K., and He, Y. Trillion particles, 120,000 cores, and 350 tbs: Lessons learned from a hero i/o run on hopper. In Cray User Group (CUG) (2013).
[15]
Carns, P., Ligon, W, Ross, R., and Wyckoff, P. Bmi: a network abstraction layer for parallel i/o. In 19th IEEE International Parallel and Distributed Processing Symposium (April 2005).
[16]
Carns, P. H., Ligon, W. B., Ross, R. B, and Thakur, R. PVFS: A parallel file system for linux clusters. In Proceedings of the 4th USENIX Annual Linux showcase and Conference (ALS 00), pp. 317--328.
[17]
Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C, Wallach, D. A., Burrows, M., Chandra, T, Fikes, A., and Gruber, R. E. Bigtable: A distributed storage system for structured data. In Proceedings of the 7th Symposium on Operating Systems Design and Implementation (OSDI 06), pp. 205--218.
[18]
Chou, J., Wu, K., and Prabhat. FastQuery: A parallel indexing system for scientific data. In Proceedings of the 2011 IEEE International Conference on Cluster Computing (CLUSTER 11), pp. 455--464.
[19]
Cranor, C, Polte, M., and Gibson, G. Structuring PLFS for extensibility. In Proceedings of the 8th Parallel Data Storage Workshop (PDSW 13), pp. 20--26.
[20]
Dean, J., and Ghemawat, S. MapReduce: Simplified data processing on large clusters. In Proceedings of the 6th Symposium on Opearting Systems Design and Implementation (OSDI 04), pp. 10--10.
[21]
Folk, M., Cheng, A., and Yates, K. Hdf5: A file format and i/o library for high performance computing applications. In Proceedings of Supercomputing (1999), vol. 99, pp. 5--33.
[22]
Ghemawat, S., Gobioff, H., and Leung, S.-T. The google file system. In Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles (SOSP 03), pp. 29--43.
[23]
Gibson, G. A., Nagle, D. F., Amiri, K., Butler, J., Chang, F. W., Gobioff, H., Hardin, C, Riedel, E., Rochberg, D., and Zelenka, J. A cost-effective, high-bandwidth storage architecture. In Proceedings of the Eighth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 98), pp. 92--103.
[24]
Greenberg, H. N., Bent, J., and Grider, G. MDHIM: A parallel key/value framework for HPC. In Proceedings of the 7th USENIX Conference on Hot Topics in Storage and File Systems (HotStorage 15), pp. 10--10.
[25]
Hereld, M., Papka, M. E., and Vishwanath, V. Toward simulation-time data analysis and i/o acceleration on leadership-class systems. In Proc. IEEE Symposium on Large-Scale Data Analysis and Visualization (LDAV2011) (Providence, RI, 10/2011 2011).
[26]
Inman, J., Vining, W., Ransom, G., and Grider, G. MarFS, a Near-POSIX interface to cloud objects. USENIX ;login: 42, 1 (Jan. 2017).
[27]
Kim, J., Abbasi, H., Chacón, L., Docan, C, Klasky, S., Liu, Q., Podhorszki, N., Shoshani, A., and Wu, K. Parallel in situ indexing for data-intensive computing. In Proceedings of the 2011 IEEE Symposium on Large Data Analysis and Visualization (LDAV 11), pp. 65--72.
[28]
LANL, NERSC, and SNL. Apex workflows. Tech. rep., Los Alamos National Laboratory (LANL), National Energy Research Scientific Computing Center (NERSC), Sandia National Laboratory (SNL), Mar. 2016.
[29]
Liu, N, Cope, J., Carns, P., Carothers, C, Ross, R., Grider, G., Crume, A., and Maltzahn, C. On the role of burst buffers in leadership-class storage systems. In Proceedings of the 2012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST 12), pp. 1--11.
[30]
Liu, Q., Logan, J., Tian, Y., Abbasi, H., Podhorszki, N, Choi, J. Y, Klasky, S., Tchoua, R., Lofstead, J., Oldfield, R., Parashar, M., Samatova, N., Schwan, K, Shoshani, A., Wolf, M., Wu, K., and Yu, W. Hello ADIOS: The challenges and lessons of developing leadership class I/O frameworks. Concurr. Comput. : Pract. Exper. 26, 7 (May 2014), 1453--1473.
[31]
O'Neil, P., Cheng, E., Gawlick, D., and O'Neil, E. The log-structured merge-tree (LSM-tree). Acta Inf. 33, 4 (June 1996), 351--385.
[32]
Patil, S., and Gibson, G. Scale and concurrency of GIGA+: File system directories with millions of files. In Proceedings of the 9th USENIX Conference on File and Stroage Technologies (FAST 11), pp. 13--13.
[33]
Rajachandrasekar, R., Moody, A., Mohror, K., and Panda, D. K. D. A 1 PB/s file system to checkpoint three million MPI tasks. In Proceedings of the 22Nd International Symposium on High-performance Parallel and Distributed Computing (HPDC 13), pp. 143--154.
[34]
Ren, K., and Gibson, G. TABLEFS: Enhancing metadata efficiency in the local file system. In Proceedings of the 2013 USENIX Annual Technical Conference (USENIX ATC 13), pp. 145--156.
[35]
Ren, K., Zheng, Q., Arulraj, J., and Gibson, G. Slimdb: A space-efficient key-value storage engine for semi-sorted data. Proc. VLDB Endow. 10, 13 (Sept. 2017), 2037--2048.
[36]
Ren, K., Zheng, Q., Patil, S., and Gibson, G. IndexFS: Scaling file system metadata performance with stateless caching and bulk insertion. In Proceedings of the 2014 International Conference for High Performance Computing, Networking, Storage, and Analysis (SC 14), pp. 237--248.
[37]
Schmuck, F. B., and Haskin, R. L. GPFS: A shared-disk file system for large computing clusters. In Proceedings of the 1st USENLX Conference on File and Storage Technologies (FAST 02), pp. 231--244.
[38]
Schwan, P. Lustre: Building a file system for 1000-node clusters. In Proceedings of the 2003 Ottawa Linux Symposium (OLS 03), pp. 380--386.
[39]
Soumagne, J., Kimpe, D., Zounmevo, J., Chaarawi, M., Koziol, Q., Afsahi, A., and Ross, R. Mercury: Enabling remote procedure call for high-performance computing. In 2013 IEEE International Conference on Cluster Computing (CLUSTER) (Sept 2013), pp. 1--8.
[40]
Wang, T, Mohror, K., Moody, A., Sato, K., and Yu, W. An ephemeral burst-buffer file system for scientific applications. In Proceedings of the 2016 International Conference for High Performance Computing, Networking, Storage, and Analysis (SC 16), pp. 69:1--69:12.
[41]
Wang, T, Moody, A., Zhu, Y., Mohror, K., Sato, K., Islam, T, and Yu, W. Metakv: A key-value store for metadata management of distributed burst buffers. In 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) (May 2017), pp. 1174--1183.
[42]
Wang, Y, Agrawal, G., Bicer, T, and Jiang, W. Smart: A mapreduce-like framework for in-situ scientific analytics. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (New York, NY, USA, 2015), SC '15, ACM, pp. 51:1--51:12.
[43]
Weil, S. A., Brandt, S. A., Miller, E. L., Long, D. D. E., and Maltzahn, C. Ceph: A scalable, high-performance distributed file system. In Proceedings of the 7th Symposium on Operating Systems Design and Implementation (OSDI 06), pp. 307--320.
[44]
Welch, B., Unangst, M., Abbasi, Z., Gibson, G., Mueller, B., Small, J., Zelenka, J., and Zhou, B. Scalable performance of the panasas parallel file system. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST 08), pp. 2:1--2:17.
[45]
Wu, K. Fastbit: an efficient indexing technology for accelerating data-intensive science. Journal of Physics: Conference Series 16, 1 (2005), 556.
[46]
Zhao, D., Zhang, Z., Zhou, X., Li, T, Wang, K., Kimpe, D., Carns, P., Ross, R., and Raicu, I. FusionFS: Toward supporting data-intensive scientific applications on extreme-scale high-performance computing systems. In Proceedings of the 2014 IEEE International Conference on Big Data (BigData 14), pp. 61--70.
[47]
Zheng, F., Abbasi, H., Docan, C, Lofstead, J., Liu, Q., Klasky, S., Parashar, M., Podhorszki, N., Schwan, K., and Wolf, M. PreDatA - preparatory data analytics on peta-scale machines. In Proceedings of the 2010 IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2010 (05 2010), pp. 1--12.
[48]
Zheng, Q., Ren, K., and Gibson, G. BatchFS: Scaling the file system control plane with client-funded metadata servers. In Proceedings of the 9th Parallel Data Storage Workshop (PDSW 14), pp. 1--6.
[49]
Zheng, Q., Ren, K., Gibson, G., Settlemyer, B. W., and Grider, G. DeltaFS: Exascale file systems scale better without dedicated servers. In Proceedings of the 10th Parallel Data Storage Workshop (PDSW 15), pp. 1--6.

Cited By

View all
  • (2024)CARP: Range Query-Optimized Indexing for Streaming DataProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00093(1-19)Online publication date: 17-Nov-2024
  • (2022)Workload-aware storage policies for cloud object storageJournal of Parallel and Distributed Computing10.1016/j.jpdc.2022.01.026163(232-247)Online publication date: May-2022
  • (2020)Streaming Data Reorganization at Scale with DeltaFS Indexed Massive DirectoriesACM Transactions on Storage10.1145/341558116:4(1-31)Online publication date: 24-Sep-2020
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
PDSW-DISCS '17: Proceedings of the 2nd Joint International Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems
November 2017
74 pages
ISBN:9781450351348
DOI:10.1145/3149393
Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 November 2017

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Conference

SC '17
Sponsor:

Acceptance Rates

Overall Acceptance Rate 17 of 41 submissions, 41%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)0
Reflects downloads up to 20 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)CARP: Range Query-Optimized Indexing for Streaming DataProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00093(1-19)Online publication date: 17-Nov-2024
  • (2022)Workload-aware storage policies for cloud object storageJournal of Parallel and Distributed Computing10.1016/j.jpdc.2022.01.026163(232-247)Online publication date: May-2022
  • (2020)Streaming Data Reorganization at Scale with DeltaFS Indexed Massive DirectoriesACM Transactions on Storage10.1145/341558116:4(1-31)Online publication date: 24-Sep-2020
  • (2020)Mass: Workload-Aware Storage Policy for OpenStack SwiftProceedings of the 49th International Conference on Parallel Processing10.1145/3404397.3404427(1-11)Online publication date: 17-Aug-2020
  • (2019)Compact Filters for Fast Online Data Partitioning2019 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER.2019.8890992(1-12)Online publication date: Sep-2019
  • (2018)Scaling embedded in-situ indexing with deltaFSProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.5555/3291656.3291660(1-15)Online publication date: 11-Nov-2018
  • (2018)Scaling embedded in-situ indexing with deltaFSProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC.2018.00006(1-15)Online publication date: 11-Nov-2018
  • (2018)Methodology for the Rapid Development of Scalable HPC Data Services2018 IEEE/ACM 3rd International Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems (PDSW-DISCS)10.1109/PDSW-DISCS.2018.00013(76-87)Online publication date: Nov-2018

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media