skip to main content
10.1145/3319647.3325828acmconferencesArticle/Chapter ViewAbstractPublication PagessystorConference Proceedingsconference-collections
research-article

Write optimization of log-structured flash file system for parallel I/O on manycore servers

Published: 22 May 2019 Publication History

Abstract

In Manycore server environment, we observe the performance degradation in parallel writes and identify the causes as follows - (i) When multiple threads write to a single file simultaneously, the current POSIX-based F2FS file system does not allow this parallel write even though ranges are distinct where threads are writing. (ii) The high processing time of Fsync at file system layer degrades the I/O throughput as multiple threads call Fsync simultaneously. (iii) The file system periodically checkpoints to recover from system crashes. All incoming I/O requests are blocked while the checkpoint is running, which significantly degrades overall file system performance. To solve these problems, first, we propose file systems to employ a fine-grained file-level Range Lock that allows multiple threads to write on mutually exclusive ranges of files rather than the course-grained inode mutex lock. Second, we propose NVM Node Logging that uses NVM as an extended storage space to store file metadata and file system metadata at high speed during Fsync and checkpoint operations. In particular, the NVM Node Logging consists of (i) a fine-grained inode structure to solve the write amplification problem caused by flushing the file metadata in block units and (ii) a Pin Point NAT (Node Address Table) Update, which can allow flushing only modified NAT entries. We implemented Range Lock and NVM Node Logging for F2FS in Linux kernel 4.14.11. Our extensive evaluation at two different types of servers (single socket 10 cores CPU server, multi-socket 120 cores NUMA CPU server) shows significant write throughput improvements in both real and synthetic workloads.

References

[1]
R. Johnson, I. Pandis, N. Hardavellas, A. Ailamaki, and B. Falsafi, "Shore-MT: A Scalable Storage Manager for the Multicore era," in Proceedings of the 12th International Conference on Extending Database Technology (EDBT), 2009, pp. 24--35.
[2]
M. Canim, G. A. Mihaila, B. Bhattacharjee, K. A. Ross, and C. A. Lang, "SSD Bufferpool Extensions for Database Systems," Proceedings of the VLDB Endowment, vol. 3, no. 1--2, pp. 1435--1446, 2010.
[3]
MySQL. {Online}. Available: https://www.mysql.com/
[4]
SQLite. {Online}. Available: https://www.sqlite.org/
[5]
RocksDB. {Online}. Available: https://rocksdb.org/
[6]
LevelDB. {Online}. Available: http://leveldb.org/
[7]
Bouteiller, Lemarinier, Krawezik, and Capello, "Coordinated checkpoint versus message log for fault tolerant MPI," in Proceedings of the 2003 IEEE International Conference on Cluster Computing (CLUSTER), 2003, pp. 242--250.
[8]
J. Li, W. Liao, A. Choudhary, R. Ross, R. Thakur, W. Gropp, R. Latham, A. Siegel, B. Gallagher, and M. Zingale, "Parallel netCDF: A High-Performance Scientific I/O Interface," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2003, pp. 39--48.
[9]
Y. Yu, D. H. Rudd, Z. Lan, N. Y. Gnedin, A. Kravtsov, and J. Wu, "Improving Parallel IO Performance of Cell-based AMR Cosmology Applications," in Proceedings of the 26th IEEE International Conference on Parallel & Distributed Processing Symposium (IPDPS), 2012, pp. 933--944.
[10]
L. Dagum and R. Menon, "OpenMP: An Industry Standard API for Shared-Memory Programming," IEEE Computational Science and Engineering, vol. 5, no. 1, pp. 46--55, 1998.
[11]
M. Si, A. J. Peña, P. Balaji, M. Takagi, and Y. Ishikawa, "MT-MPI: Multithreaded MPI for Many-Core Environments," in Proceedings of the 28th ACM International Conference on Supercomputing (ICS), 2014, pp. 125--134.
[12]
X. Yu, G. Bezerra, A. Pavlo, S. Devadas, and M. Stonebraker, "Staring into the Abyss: An Evaluation of Concurrency Control with One Thousand Cores," Proceedings of the VLDB Endowment, vol. 8, no. 3, pp. 209--220, 2014.
[13]
X. Yu, A. Pavlo, D. Sanchez, and S. Devadas, "TicToc: Time Traveling Optimistic Concurrency Control," in Proceedings of the 2016 ACM SIGMOD International Conference on Management of Data (SIGMOD), 2016, pp. 1629--1642.
[14]
H. Kimura, "FOEDUS: OLTP Engine for a Thousand Cores and NVRAM," in Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD), 2015, pp. 691--706.
[15]
Performance Benchmarking for PCIe and NVMe Enterprise Solid State Drive. {Online}. Available: https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/performance-pcie-nvme-enterprise-ssds-white-paper.pdf
[16]
A. Mathur, M. Cao, S. Bhattacharya, A. Dilger, A. Tomas, and L. Vivier, "The new ext4 filesystem: Current status and future plans," in Proceedings of the 2007 Linux Symposium, 2007, pp. 21--33.
[17]
O. Rodeh, J. Bacik, and C. Mason, "BTRFS: The Linux B-Tree Filesystem," Trans. Storage, vol. 9, no. 3, pp. 9:1--9:32, 2013.
[18]
C. Lee, D. Sim, J. Y. Hwang, and S. Cho, "F2FS: A New File System for Flash Storage," in Proceedings of the 13th USENIX Conference on File and Storage Technologies (FAST), 2015, pp. 273--286.
[19]
S. Yan, H. Li, M. Hao, M. H. Tong, S. Sundararaman, A. A. Chien, and H. S. Gunawi, "Tiny-Tail Flash: Near-Perfect Elimination of Garbage Collection Tail Latencies in NAND SSDs," ACM Transactions on Storage (TOS), vol. 13, no. 3, p. 22, 2017.
[20]
M. J. Breitwisch, "Phase Change Memory," in Proceedings of the 2008 International Interconnect Technology Conference (ITTC), 2008, pp. 219--221.
[21]
T. Kawahara, "Scalable Spin-Transfer Torque RAM Technology for Normally-Off Computing," IEEE Design & Test of Computers, vol. 28, no. 1, pp. 52--63, 2011.
[22]
T. David, A. Dragojević, R. Guerraoui, and I. Zablotchi, "Log-Free Concurrent Data Structures," in Proceedings of the 2018 USENIX Conference on Usenix Annual Technical Conference (ATC), 2018, pp. 373--386.
[23]
J. Xu and S. Swanson, "NOVA: A Log-structured File System for Hybrid Volatile/Non-volatile Main Memories," in Proceedings of the 14th USENIX Conference on File and Storage Technologies (FAST), 2016, pp. 323--338.
[24]
C. Chen, J. Yang, Q. Wei, C. Wang, and M. Xue, "Fine-grained metadata journaling on NVM," in Proceedings of the 32nd IEEE Symposium on Mass Storage Systems and Technologies (MSST), 2016, pp. 1--13.
[25]
Q. Wei, C. Wang, C. Chen, Y. Yang, J. Yang, and M. Xue, "Transactional NVM Cache with High Performance and Crash Consistency," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2017, pp. 56:1--56:12.
[26]
S. K. Lee, K. H. Lim, H. Song, B. Nam, and S. H. Noh, "WORT: Write Optimal Radix Tree for Persistent Memory Storage Systems," in Proceedings of the 15th USENIX Conference on File and Storage Technologies (FAST), 2017, pp. 257--270.
[27]
Intel Transactional Synchronization Extensions (Intel TSX) Overview. {Online}. Available: https://software.intel.com/en-us/cpp-compiler-developer-guide-and-reference-intel-transactional-synchronization-extensions-intel-tsx-overview
[28]
C. Min, S. Kashyap, S. Maass, and T. Kim, "Understanding Manycore Scalability of File Systems," in Proceedings of the 2016 USENIX Conference on Usenix Annual Technical Conference (ATC), 2016, pp. 71--85.
[29]
The Open Group Technical Standard Base Specifications, Issue 7 - POSIX.1-2017. {Online}. Available: http://pubs.opengroup.org/onlinepubs/9699919799/
[30]
E. Lee, H. Bahn, and S. H. Noh, "Unioning of the Buffer Cache and Journaling Layers with Non-volatile Memory," in Proceedings of the 11th USENIX conference on File and Storage Technologies (FAST), 2013, pp. 73--80.
[31]
Filebench Benchmark. {Online}. Available: https://github.com/filebench/filebench
[32]
TPC-C Benchmark. {Online}. Available: https://github.com/Percona-Lab/tpcc-mysql
[33]
Intel Xeon Processor E5-2640 v4. {Online}. Available: https://ark.intel.com/products/92984
[34]
Samsung 850 PRO Series SSD. {Online}. Available: https://www.samsung.com/us/business/support/owners/product/850-pro-series-256gb/
[35]
Intel Xeon Processor E7-8870 v2. {Online}. Available: https://ark.intel.com/products/75255/Intel-Xeon-Processor-E7-8870-v2-30M-Cache-2-30-GHz-
[36]
Intel SSD 750 SERIES. {Online}. Available: https://www.intel.com/content/www/us/en/products/memory-storage/solid-state-drives/gaming-enthusiast-ssds/750-series/750-400gb-aic-20nm.html
[37]
J. Bent, G. Gibson, G. Grider, B. McClelland, P. Nowoczynski, J. Nunez, M. Polte, and M. Wingate, "PLFS: A Checkpoint Filesystem for Parallel Applications," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2009, pp. 21:1--21:12.
[38]
Infinite Memory Engine (IME). {Online}. Available: https://www.ddn.com/products/ime-flash-native-data-cache/
[39]
R. Thakur, W. Gropp, and E. Lusk, "On Implementing MPI-IO Portably and with High Performance," in Proceedings of the 6th ACM Workshop on I/O in Parallel and Distributed Systems (IOPADS), 1999, pp. 23--32.
[40]
R. B. Bennett, B. P. Dixon, and E. Johnson, "Byte Range Locking in A Distributed Environment," Sep. 21 1999, US Patent 5,956,712.
[41]
P. Schwan et al., "Lustre: Building a File System for 1000-node Clusters," in Proceedings of the 2003 Linux symposium, 2003, pp. 380--386.
[42]
The Gluster File System. {Online}. Available: http://www.gluster.org/
[43]
D. Park and D. Shin, "iJournaling: Fine-Grained Journaling for Improving the Latency of Fsync System Call," in Proceedings of the 2017 USENIX Conference on Usenix Annual Technical Conference (ATC), 2017, pp. 787--798.
[44]
J. Yeon, M. Jeong, S. Lee, and E. Lee, "RFLUSH: Rethink the Flush," in Proceedings of the 16th USENIX Conference on File and Storage Technologies (FAST), 2018, pp. 201--210.
[45]
S. S. Bhat, R. Eqbal, A. T. Clements, M. F. Kaashoek, and N. Zeldovich, "Scaling a file system to many cores using an operation log," in Proceedings of the 26th ACM Symposium on Operating Systems Principles (SOSP), 2017, pp. 69--86.

Cited By

View all
  • (2024)StreamCacheProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692060(1119-1134)Online publication date: 10-Jul-2024
  • (2024)Improving F2FS fsync() Latency Through Parallelizing Dnode and Data Page Writeback2024 International Conference on Networking, Architecture and Storage (NAS)10.1109/NAS63802.2024.10781365(1-8)Online publication date: 9-Nov-2024
  • (2024)Hades: A Context-Aware Active Storage Framework for Accelerating Large-Scale Data Analysis2024 IEEE 24th International Symposium on Cluster, Cloud and Internet Computing (CCGrid)10.1109/CCGrid59990.2024.00070(577-586)Online publication date: 6-May-2024
  • Show More Cited By

Index Terms

  1. Write optimization of log-structured flash file system for parallel I/O on manycore servers

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SYSTOR '19: Proceedings of the 12th ACM International Conference on Systems and Storage
    May 2019
    211 pages
    ISBN:9781450367493
    DOI:10.1145/3319647
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    In-Cooperation

    • USENIX Assoc: USENIX Assoc

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 22 May 2019

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. file system
    2. manycore OS
    3. non-volatile memory

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    SYSTOR '19
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 108 of 323 submissions, 33%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)74
    • Downloads (Last 6 weeks)3
    Reflects downloads up to 17 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)StreamCacheProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692060(1119-1134)Online publication date: 10-Jul-2024
    • (2024)Improving F2FS fsync() Latency Through Parallelizing Dnode and Data Page Writeback2024 International Conference on Networking, Architecture and Storage (NAS)10.1109/NAS63802.2024.10781365(1-8)Online publication date: 9-Nov-2024
    • (2024)Hades: A Context-Aware Active Storage Framework for Accelerating Large-Scale Data Analysis2024 IEEE 24th International Symposium on Cluster, Cloud and Internet Computing (CCGrid)10.1109/CCGrid59990.2024.00070(577-586)Online publication date: 6-May-2024
    • (2023)High-Performance OverlayFS for ContainersJournal of Digital Contents Society10.9728/dcs.2023.24.11.284124:11(2841-2847)Online publication date: 30-Nov-2023
    • (2023)CITRONProceedings of the 21st USENIX Conference on File and Storage Technologies10.5555/3585938.3585957(297-314)Online publication date: 21-Feb-2023
    • (2023)Achieving Enhanced Performance Combining Checkpointing and Dynamic State Partitioning2023 IEEE 35th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)10.1109/SBAC-PAD59825.2023.00024(149-159)Online publication date: 17-Oct-2023
    • (2023)CredsCache: Making OverlayFS scalable for containerized servicesFuture Generation Computer Systems10.1016/j.future.2023.04.027147(44-58)Online publication date: Oct-2023
    • (2021)Enabling manycore scalability in F2FS metadata for unlink() operationProceedings of the 14th ACM International Conference on Systems and Storage10.1145/3456727.3463832(1-1)Online publication date: 14-Jun-2021
    • (2021)Concurrent file metadata structure using readers-writer lockProceedings of the 36th Annual ACM Symposium on Applied Computing10.1145/3412841.3441992(1172-1181)Online publication date: 22-Mar-2021
    • (2020)CrossFSProceedings of the 14th USENIX Conference on Operating Systems Design and Implementation10.5555/3488766.3488774(137-154)Online publication date: 4-Nov-2020
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media