skip to main content
10.1145/3542637.3542646acmotherconferencesArticle/Chapter ViewAbstractPublication PagescommConference Proceedingsconference-collections
research-article

A Disaggregate Data Collecting Approach for Loss-Tolerant Applications

Published: 07 November 2023 Publication History

Abstract

Datacenter generates operation data at an extremely high rate, and data center operators collect and analyze them for problem diagnosis, resource utilization improvement, and performance optimization. However, existing data collection methods fail to efficiently aggregate and store data at extremely high speed and scale. In this paper, we explore a new approach that leverages programmable switches to aggregate data and directly write data to the destination storage. Our proposed data collection system, ALT, uses programmable switches to control NVMe SSDs on remote hosts without the involvement of a remote CPU. To tolerate loss, ALT uses an elegant data structure to enable efficient data recovery when retrieving the collected data. We implement our system on a Tofino-based programmable switch for a prototype. Our evaluation shows that ALT can saturate SSD’s peak performance without any CPU involvement.

References

[1]
2004. Cisco Systems NetFlow Services Export Version 9. https://datatracker.ietf.org/doc/html/rfc3954. (2004).
[2]
2021. NVM Express Base Specification, Revision 1.4b. https://nvmexpress.org/wp-content/uploads/NVM-Express-1_4b-2020.09.21-Ratified.pdf. (2021).
[3]
2022. Arista 7170 Series - Arista. https://www.arista.com/en/products/7170-series. (2022).
[4]
2022. GPUDirect | NVIDIA Developer. https://developer.nvidia.com/gpudirect. (2022).
[5]
2022. Home - DPDK. https://www.dpdk.org/. (2022).
[6]
2022. NVIDIA Mellanox ConnectX-5 Adapters | NVIDIA. https://www.nvidia.com/en-us/networking/ethernet/connectx-5/. (2022).
[7]
2022. NVIDIA PeerDirect - MLNXOFED v5.4-1.0.3.0 - NVIDIA Networking Docs. https://docs.nvidia.com/networking/display/MLNXOFEDv541030/NVIDIA+PeerDirect. (2022).
[8]
2022. NVME-oF - NVM Express over Fabrics - MLNXEN v5.1-1.0.4.0 - NVIDIA Networking Docs. https://docs.nvidia.com/networking/display/MLNXENv511040/NVME-oF+-+NVM+Express+over+Fabrics. (2022).
[9]
2022. Samsung Enterprise SSD 983 DCT M.2 960GB | MZ-1LB960NE | for Business. https://www.samsung.com/us/business/computing/memory-storage/enterprise-solid-state-drives/983-dct-960gb-mz-1lb960ne/. (2022).
[10]
2022. sFlow.org - Making the Network Visible. https://sflow.org/index.php. (2022).
[11]
Shridhar Allagi and Rashmi Rachh. 2019. Analysis of Network log data using Machine Learning. In IEEE I2CT (2019). IEEE.
[12]
Tom Barbette, Chen Tang, Haoran Yao, Dejan Kostić, Gerald Q Maguire Jr, Panagiotis Papadimitratos, and Marco Chiesa. 2020. A high-speed load-balancer design with guaranteed per-connection-consistency. In USENIX NSDI (2020).
[13]
Ran Ben Basat, Sivaramakrishnan Ramanathan, Yuliang Li, Gianni Antichi, Minian Yu, and Michael Mitzenmacher. 2020. Pint: Probabilistic in-band network telemetry. In ACM SIGCOMM (2020).
[14]
Gaël Bernard and Periklis Andritsos. 2021. Selecting representative sample traces from large event logs. In IEEE ICPM (2021). IEEE.
[15]
Qizhe Cai, Shubham Chaudhary, Midhul Vuppalapati, Jaehyun Hwang, and Rachit Agarwal. 2021. Understanding host network stack overheads. In ACM SIGCOMM (2021).
[16]
Min Du, Feifei Li, Guineng Zheng, and Vivek Srikumar. 2017. Deeplog: Anomaly detection and diagnosis from system logs through deep learning. In ACM CCS (2017).
[17]
Alessandro D’Alconzo, Idilio Drago, Andrea Morichetta, Marco Mellia, and Pedro Casas. 2019. A survey on big data for network traffic monitoring and analysis. IEEE Trans. Netw. Service Manag. 16, 3 (2019), 800–813.
[18]
Paul Emmerich, Maximilian Pudelko, Sebastian Gallenmüller, and Georg Carle. 2017. Flowscope: Efficient packet capture and storage in 100 gbit/s networks. In IEEE IFIP Networking (2017). IEEE.
[19]
Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth. 1996. From data mining to knowledge discovery in databases. AI magazine 17, 3 (1996), 37–37.
[20]
Nikhil Handigol, Brandon Heller, Vimalkumar Jeyakumar, David Mazières, and Nick McKeown. 2014. I know what your packet did last hop: Using packet histories to troubleshoot networks. In USENIX NSDI (2014).
[21]
Shilin He, Jieming Zhu, Pinjia He, and Michael R Lyu. 2016. Experience report: System log analysis for anomaly detection. In IEEE ISSRE (2016). IEEE.
[22]
Xin Jin, Xiaozhou Li, Haoyu Zhang, Robert Soulé, Jeongkeun Lee, Nate Foster, Changhoon Kim, and Ion Stoica. 2017. Netcache: Balancing key-value stores with fast in-network caching. In ACM SOSP (2017).
[23]
Antti Juvonen, Tuomo Sipola, and Timo Hämäläinen. 2015. Online anomaly detection using dimensionality reduction techniques for HTTP log analysis. Computer Networks 91(2015), 46–56.
[24]
Manolis Karpathiotakis, Dino Wernli, and Milos Stojanovic. 2019. Scribe: Transporting petabytes per hour via a distributed, buffered queueing system. https://engineering.fb.com/2019/10/07/data-infrastructure/scribe/. (2019).
[25]
Changhoon Kim, Anirudh Sivaraman, Naga Katta, Antonin Bas, Advait Dixit, and Lawrence J Wobker. 2015. In-band network telemetry via programmable dataplanes. In ACM SIGCOMM (2015).
[26]
Daehyeok Kim, Zaoxing Liu, Yibo Zhu, Changhoon Kim, Jeongkeun Lee, Vyas Sekar, and Srinivasan Seshan. 2020. Tea: Enabling state-intensive network functions on programmable switches. In ACM SIGCOMM (2020).
[27]
Bram Knols and Jan Martijn EM van der Werf. 2019. Measuring the behavioral quality of log sampling. In IEEE ICPM (2019). IEEE.
[28]
ChonLam Lao, Yanfang Le, Kshiteej Mahajan, Yixi Chen, Wenfei Wu, Aditya Akella, and Michael M Swift. 2021. ATP: In-network Aggregation for Multi-tenant Learning. In USENIX NSDI (2021).
[29]
Cong Liu, Yulong Pei, Long Cheng, Qingtian Zeng, and Hua Duan. 2021. Sampling business process event logs using graph-based ranking model. Concurrency and Computation: Practice and Experience 33, 5(2021), e5974.
[30]
Zaoxing Liu, Zhihao Bai, Zhenming Liu, Xiaozhou Li, Changhoon Kim, Vladimir Braverman, Xin Jin, and Ion Stoica. 2019. Distcache: Provable load balancing for large-scale storage systems with distributed caching. In USENIX FAST (2019).
[31]
Benjamin Marlin. 2008. Missing data problems in machine learning. Ph.D. Dissertation.
[32]
Rui Miao, Hongyi Zeng, Changhoon Kim, Jeongkeun Lee, and Minlan Yu. 2017. Silkroad: Making stateful layer-4 load balancing fast and cheap using switching asics. In ACM SIGCOMM (2017).
[33]
Venkat Mohan, YR Janardhan Reddy, and K Kalpana. 2011. Active and passive network measurements: a survey. Int. J. Comput. Sci. Inf. Technol. 2, 4 (2011), 1372–1385.
[34]
Kazuki Otomo, Satoru Kobayashi, Kensuke Fukuda, and Hiroshi Esaki. 2021. Latent semantics approach for network log analysis: modeling and its application. In IFIP/IEEE IM (2021). IEEE.
[35]
Tian Pan, Nianbing Yu, Chenhao Jia, Jianwen Pi, Liang Xu, Yisong Qiao, Zhiguo Li, Kun Liu, Jie Lu, Jianyuan Lu, 2021. Sailfish: Accelerating cloud-scale multi-tenant multi-service gateways with programmable switches. In ACM SIGCOMM (2021).
[36]
Siyi Qiao, Chen Xu, Lei Xie, Ji Yang, Chengchen Hu, Xiaohong Guan, and Jianhua Zou. 2014. Network recorder and player: FPGA-based network traffic capture and replay. In IEEE FPT (2014). IEEE.
[37]
Amedeo Sapio, Marco Canini, Chen-Yu Ho, Jacob Nelson, Panos Kalnis, Changhoon Kim, Arvind Krishnamurthy, Masoud Moshref, Dan Ports, and Peter Richtarik. 2021. Scaling Distributed Machine Learning with In-Network Aggregation. In USENIX NSDI (2021).
[38]
Olivier Tilmans, Tobias Bühler, Ingmar Poese, Stefano Vissicchio, and Laurent Vanbever. 2018. Stroboscope: Declarative network monitoring on a budget. In USENIX NSDI (2018).
[39]
Juan Camilo Vega, Marco Antonio Merlini, and Paul Chow. 2020. FFShark: a 100G FPGA implementation of BPF filtering for Wireshark. In IEEE FCCM (2020). IEEE.
[40]
Tianzhu Zhang, Leonardo Linguaglossa, Massimo Gallo, Paolo Giaccone, and Dario Rossi. 2018. FlowMon-DPDK: Parsimonious per-flow software monitoring at line rate. In IEEE TMA (2018). IEEE.
[41]
Yibo Zhu, Nanxi Kang, Jiaxin Cao, Albert Greenberg, Guohan Lu, Ratul Mahajan, Dave Maltz, Lihua Yuan, Ming Zhang, Ben Y Zhao, 2015. Packet-level telemetry in large datacenter networks. In ACM SIGCOMM (2015).

Cited By

View all
  • (2024)Label Noise Correction for Federated Learning: A Secure, Efficient and Reliable Realization2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00277(3600-3612)Online publication date: 13-May-2024

Index Terms

  1. A Disaggregate Data Collecting Approach for Loss-Tolerant Applications

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Other conferences
      APNet '22: Proceedings of the 6th Asia-Pacific Workshop on Networking
      July 2022
      110 pages
      ISBN:9781450397483
      DOI:10.1145/3542637
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 07 November 2023

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Data collection
      2. NVM Express
      3. Programmable switches
      4. Remote Direct Memory Access

      Qualifiers

      • Research-article
      • Research
      • Refereed limited

      Conference

      APNet 2022

      Acceptance Rates

      Overall Acceptance Rate 50 of 118 submissions, 42%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)22
      • Downloads (Last 6 weeks)3
      Reflects downloads up to 11 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Label Noise Correction for Federated Learning: A Secure, Efficient and Reliable Realization2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00277(3600-3612)Online publication date: 13-May-2024

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media