Skip to main content
Log in

Optimizing the data-collection time of a large-scale data-acquisition system through a simulation framework

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

The ATLAS detector at CERN records particle collision “events” delivered by the Large Hadron Collider. Its data-acquisition system identifies, selects, and stores interesting events in near real-time, with an aggregate throughput of several 10 GB/s. It is a distributed software system executed on a farm of roughly 2000 commodity worker nodes communicating via TCP/IP on an Ethernet network. Event data fragments are received from the many detector readout channels and are buffered, collected together, analyzed and either stored permanently or discarded. This system, and data-acquisition systems in general, are sensitive to the latency of the data transfer from the readout buffers to the worker nodes. Challenges affecting this transfer include the many-to-one communication pattern and the inherently bursty nature of the traffic. The main performance issues brought about by this workload are addressed in this paper, focusing in particular on the so-called TCP incast pathology. Since performing systematic studies of these issues is often impeded by operational constraints related to the mission-critical nature of these systems, we developed a simulation model of the ATLAS data-acquisition system. The resulting simulation tool is based on the well-established, widely-used OMNeT++ framework. This tool was successfully validated by comparing the obtained simulation results with existing measurements of the system’s behavior. Furthermore, the simulation tool enables the study of the theoretical behavior of the system in numerous what-if scenarios and with modifications that are not immediately applicable to the real system. In this paper, we take advantage of this to analyze the behavior of the system using different traffic shaping and scheduling policies, and with network hardware modifications. This analysis leads to conclusions that could be used to devise future system enhancements.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

Notes

  1. Measurements show that both time intervals are smaller than 0.5  ms.

  2. In this particular sentence, the word “event” refers to simulation events. To avoid confusion, throughout the rest of the paper, “event” will only be used in its high-energy physics meaning, i.e., “collision event”.

References

  1. ATLAS Collaboration (2008) The ATLAS experiment at the CERN large hadron collider. J Instrum 3(08):S08,003. doi:10.1088/1748-0221/3/08/S08003

  2. ATLAS Collaboration (2003) ATLAS high-level trigger, data-acquisition and controls. Technical Design Report ATLAS-TDR-016 CERN-LHCC-2003-022, CERN, Geneva

  3. Pozo Astigarraga ME (2015) Evolution of the ATLAS trigger and data acquisition system. J Phys Conf Ser 608(1):012,006. doi:10.1088/1742-6596/608/1/012006

    Article  Google Scholar 

  4. Phanishayee A et al (2008) Measurement and analysis of TCP throughput collapse in cluster-based storage systems. In: Proc. of the 6th USENIX Conference on File and Storage Technologies, FAST’08, pp 12:1–12:14. USENIX Association, Berkeley

  5. Allman M, Paxson V, Blanton E (2009) TCP congestion control. RFC 5681. RFC Editor. doi:10.17487/RFC5681

  6. Paxson V, Allman M, Chu J, Sargent M (2011) Computing TCP’s retransmission timer. RFC 6298. RFC Editor. doi:10.17487/RFC6298

  7. Colombo T (2015) Data-flow performance optimisation on unreliable networks: the ATLAS data-acquisition case. J Phys Conf Ser 608(1):012,005. doi:10.1088/1742-6596/608/1/012005

    Article  Google Scholar 

  8. Brocade MLX series. http://www.brocade.com/en/products-services/routers/mlx-series.html

  9. HP ProCurve 6600 manuals. http://www.hp.com/rnd/support/manuals/6600dc.htm

  10. Brocade VDX 6740 switches. http://www.brocade.com/en/products-services/switches/data-center-switches/vdx-6740-switches.html

  11. Vargas A (2010) OMNeT++. In: Wehrle K, Gross J, Günes M (eds) Modeling and tools for network simulation, pp 35–59. Springer, Berlin. doi:10.1007/978-3-642-12331-3

  12. Castalia wireless sensor network simulator. https://castalia.forge.nicta.com.au

  13. INET framework open-source OMNeT++ model suite for wired, wireless, and mobile networks. https://inet.omnetpp.org

  14. Köpke A, Swigulski M, Wessel K, Willkomm D, Haneveld PTK, Parker TEV, Visser OW, Lichte HS, Valentin S (2008) Simulating wireless and mobile networks in OMNeT++ the MiXiM vision. In: Proceedings of the 1st International Conference on Simulation Tools and Techniques for Communications, Networks and Systems & Workshops, Simutools 2008, p 71. ICST, Brussels. doi:10.4108/ICST.SIMUTOOLS2008.3031

  15. Núñez A, Fernández J, Garcia JD, Prada L, Carretero J (2008) SIMCAN: a simulator framework for computer architectures and storage networks. In: Proceedings of the 1st International Conference on Simulation Tools and Techniques for Communications, Networks and Systems & Workshops, Simutools 2008, p 73. ICST, Brussels. doi:10.4108/ICST.SIMUTOOLS2008.3025

  16. Yebenes P, Escudero-Sahuquillo J, Garcia PJ, Quiles FJ (2013) Towards modeling interconnection networks of exascale systems with OMNet++. In: Proceedings of the 2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, PDP ’13, pp 203–207. IEEE Computer Society, Washington. doi:10.1109/PDP.2013.36

  17. Reschka T, Dreibholz T, Becke M, Pulinthanath J, Rathgeb EP (2008) Enhancement of the TCP module in the OMNeT++/INET framework. In: Proceedings of the 3rd International ICST Conference on Simulation Tools and Techniques, Simutools 2010, p 24. ICST, Brussels (2008). doi:10.4108/ICST.SIMUTOOLS2010.8834

  18. Ha S, Rhee I, Xu L (2008) CUBIC: a new TCP-friendly high-speed TCP variant. SIGOPS Oper Syst Rev 42(5):64–74. doi:10.1145/1400097.1400105

    Article  Google Scholar 

  19. Henderson T, Floyd S, Gurtov A, Nishida Y (2012) The NewReno modification to TCP’s fast recovery algorithm. RFC 6582. RFC Editor. doi:10.17487/RFC6582

  20. Alessio F et al (2014) The LHCb Data Acquisition during LHC Run 1. J Phys Conf Ser 513(1):012033. doi:10.1088/1742-6596/513/1/012033

  21. Carena F et al (2014) The ALICE data acquisition system. Nucl Instrum Methods Phys Res A 741:130–162. doi:10.1016/j.nima.2013.12.015

    Article  Google Scholar 

  22. Bawej T et al (2014) Boosting Event Building Performance using Infiniband FDR for the CMS Upgrade. Proc Sci TIPP2014:190

    Google Scholar 

  23. The ns-3 network simulator. http://www.nsnam.org

  24. Zhang Y, Ansari N (2013) On architecture design, congestion notification, TCP incast and power consumption in data centers. IEEE Commun Surv Tutor 15(1):39–64. doi:10.1109/SURV.2011.122211.00017

    Article  Google Scholar 

  25. Jereczek G, Lehmann-Miotto G, Malone D (2015) Analogues between tuning TCP for data acquisition and datacenter networks. In: IEEE Int. Conf. Comm

Download references

Acknowledgments

This work was partially supported by the Ministry of Economy and Competitiveness of Spain (project TIN2012-38341-C04) and the European Commission (ERDF).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tommaso Colombo.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Colombo, T., Fröning, H., Garcìa, P.J. et al. Optimizing the data-collection time of a large-scale data-acquisition system through a simulation framework. J Supercomput 72, 4546–4572 (2016). https://doi.org/10.1007/s11227-016-1764-1

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-016-1764-1

Keywords

Navigation