skip to main content
10.1145/2716281.2836094acmconferencesArticle/Chapter ViewAbstractPublication PagesconextConference Proceedingsconference-collections
research-article

Demystifying and mitigating TCP stalls at the server side

Published: 01 December 2015 Publication History

Abstract

TCP is an important factor affecting user-perceived performance of Internet applications. Diagnosing the causes behind TCP performance issues in the wild is essential for better understanding the current shortcomings in TCP. This paper presents a TCP flow performance analysis framework that classifies causes of TCP stalls. The framework forms the basis of a tool that is publicly available to the research community. We use our tool to analyze packet-level traces of three services (cloud storage, software download and web search) deployed by a popular Chinese service provider. We find that as many as 20% of the flows are stalled for half of their lifetime. Network-related causes, especially timeout retransmission, dominate the stalls. A breakdown of the causes for timeout retransmission stalls reveals that double retransmission and tail retransmission are among the top contributors. The importance of these causes depends however on the specific service. We also propose S-RTO, a mechanism that mitigates timeout retransmission stalls. S-RTO has been deployed on production front-end servers and results show that it is effective at improving TCP performance, especially for short flows.

References

[1]
M. Allman, K. Avrachenkov, U. Ayesta, J. Blanton, and P. Hurtig. Early retransmit for tcp and stream control transmission protocol (sctp), 2010. RFC 5827.
[2]
M. Allman, H. Balakrishnan, and S. Floyd. Enhancing tcp's loss recovery using limited transmit, 2001. RFC 3042.
[3]
E. Blanton and M. Allman. Using tcp duplicate selective acknowledgement (dsacks) and stream control transmission protocol (sctp) duplicate transmission sequence numbers (tsns) to detect spurious retransmissions, 2004. RFC 3078.
[4]
E. Blanton, M. Allman, L. Wang, I. Jarvinen, M. Kojo, and Y. Nishida. A conservative loss recovery algorithm based on selective acknowledgment (sack) for tcp, 2012. RFC 6675 (Proposed Standard).
[5]
R. Braden. Requirements for internet hosts, 1989. RFC 1122.
[6]
N. Cardwell, Y. Cheng, L. Brakmo, M. Mathis, B. Raghavan, N. Dukkipati, H.-k. J. Chu, A. Terzis, and T. Herbert. packetdrill: Scriptable network stack testing, from sockets to packets. In USENIX ATC, 2013.
[7]
M. Dong, Q. Li, D. Zarchy, B. Godfrey, and M. Schapira. Pcc: Re-architecting congestion control for consistent high performance. In NSDI, 2015.
[8]
T. Flach, N. Dukkipati, A. Terzis, B. Raghavan, N. Cardwell, Y. Cheng, A. Jain, S. Hao, E. Katz-Bassett, and R. Govindan. Reducing web latency: the virtue of gentle aggression. In ACM SIGCOMM, 2013.
[9]
S. Ha, I. Rhee, and L. Xu. Cubic: a new tcp-friendly high-speed tcp variant. ACM SIGOPS Operating Systems Review, 42(5), 2008.
[10]
M. Honda, Y. Nishida, C. Raiciu, A. Greenhalgh, M. Handley, and H. Tokuda. Is it still possible to extend tcp? In ACM IMC, 2011.
[11]
D. Katabi, M. Handley, and C. Rohrs. Congestion control for high bandwidth-delay product networks. In ACM SIGCOMM, 2002.
[12]
C. Lai, K.-C. Leung, and V. O. Li. Tcp-ncl: a unified solution for tcp packet reordering and random loss. In Personal, Indoor and Mobile Radio Communications, 2009 IEEE 20th International Symposium on, pages 1093--1097. IEEE, 2009.
[13]
M. Mathis and J. Mahdavi. Forward acknowledgement: Refining tcp congestion control. In ACM SIGCOMM CCR, volume 26, 1996.
[14]
R. Mittal, J. Sherry, S. Ratnasamy, and S. Shenker. Recursively cautious congestion control. In NSDI, 2014.
[15]
V. Paxson, M. Allman, H. J. Chu, and M. Sargent. Computing tcp's retransmission timer, 2011. RFC 6298.
[16]
M. Rajiullah, P. Hurtig, A. Brunstrom, A. Petlund, and M. Welzl. An evaluation of tail loss recovery mechanisms for tcp. SIGCOMM CCR, 45(1), 2015.
[17]
A. Sivaraman, K. Winstein, P. Thaker, and H. Balakrishnan. An experimental study of the learnability of congestion control. In ACM SIGCOMM, 2014.
[18]
P. Sun, M. Yu, M. J. Freedman, and J. Rexford. Identifying performance bottlenecks in cdns through tcp-level monitoring. In SIGCOMM WU-MUST Workshop, 2011.
[19]
K. Tan, J. Song, Q. Zhang, and M. Sridharan. A compound tcp approach for high-speed and long distance networks. In IEEE INFOCOM, 2006.
[20]
Z. Wang, Z. Qian, Q. Xu, Z. Mao, and M. Zhang. An untold story of middleboxes in cellular networks. In ACM SIGCOMM, 2011.
[21]
D. Wei, P. Cao, S. Low, and C. EAS. Tcp pacing revisited. In IEEE INFOCOM, 2006.
[22]
D. X. Wei, C. Jin, S. H. Low, and S. Hegde. Fast tcp: motivation, architecture, algorithms, performance. IEEE/ACM Transactions on Networking (ToN), 14(6), 2006.
[23]
K. Winstein and H. Balakrishnan. Tcp ex machina: computer-generated congestion control. In ACM SIGCOMM, 2013.
[24]
M. Yu, A. Greenberg, D. Maltz, J. Rexford, L. Yuan, S. Kandula, and C. Kim. Profiling network performance for multi-tier data center applications. In NSDI, 2011.
[25]
M. Zhang, B. Karp, S. Floyd, and L. Peterson. Rr-tcp: a reordering-robust tcp with dsack. In IEEE ICNP, 2003.
[26]
Y. Zhang, L. Breslau, V. Paxson, and S. Shenker. On the characteristics and origins of internet flow rates. In ACM SIGCOMM, 2002.

Cited By

View all
  • (2024)Diagnosing application-network anomalies for millions of IPs in production cloudsProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692046(885-899)Online publication date: 10-Jul-2024
  • (2021)Speeding Up TCP with Selective Loss Prevention2021 IEEE 29th International Conference on Network Protocols (ICNP)10.1109/ICNP52444.2021.9651983(1-6)Online publication date: 1-Nov-2021
  • (2020)A First Look at Disconnection-Centric TCP Performance on High-Speed RailwaysIEEE Journal on Selected Areas in Communications10.1109/JSAC.2020.300548638:12(2723-2733)Online publication date: Dec-2020
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CoNEXT '15: Proceedings of the 11th ACM Conference on Emerging Networking Experiments and Technologies
December 2015
483 pages
ISBN:9781450334129
DOI:10.1145/2716281
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 December 2015

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. TCP
  2. congestion control
  3. measurement
  4. performance troubleshooting

Qualifiers

  • Research-article

Conference

CoNEXT '15
Sponsor:

Acceptance Rates

Overall Acceptance Rate 198 of 789 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)7
  • Downloads (Last 6 weeks)1
Reflects downloads up to 10 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Diagnosing application-network anomalies for millions of IPs in production cloudsProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692046(885-899)Online publication date: 10-Jul-2024
  • (2021)Speeding Up TCP with Selective Loss Prevention2021 IEEE 29th International Conference on Network Protocols (ICNP)10.1109/ICNP52444.2021.9651983(1-6)Online publication date: 1-Nov-2021
  • (2020)A First Look at Disconnection-Centric TCP Performance on High-Speed RailwaysIEEE Journal on Selected Areas in Communications10.1109/JSAC.2020.300548638:12(2723-2733)Online publication date: Dec-2020
  • (2020)Reducing web latency with coding-based fast multi-path loss recoveryWireless Networks10.1007/s11276-020-02443-8Online publication date: 20-Aug-2020
  • (2019)TCP Stalls at the Server SideIEEE/ACM Transactions on Networking10.1109/TNET.2018.288628227:1(272-287)Online publication date: 1-Feb-2019
  • (2019)Dynamic TCP Initial Windows and Congestion Control Schemes Through Reinforcement LearningIEEE Journal on Selected Areas in Communications10.1109/JSAC.2019.290435037:6(1231-1247)Online publication date: Jun-2019
  • (2018)RAVENProceedings of the 24th Annual International Conference on Mobile Computing and Networking10.1145/3241539.3241571(557-572)Online publication date: 15-Oct-2018
  • (2018)FUSOIEEE/ACM Transactions on Networking10.1109/TNET.2018.283041426:3(1376-1389)Online publication date: 1-Jun-2018
  • (2018)Reducing Web Latency Through Dynamically Setting TCP Initial Window with Reinforcement Learning2018 IEEE/ACM 26th International Symposium on Quality of Service (IWQoS)10.1109/IWQoS.2018.8624175(1-10)Online publication date: Jun-2018
  • (2017)TCP WISE: One initial congestion window is not enough2017 IEEE 36th International Performance Computing and Communications Conference (IPCCC)10.1109/PCCC.2017.8280464(1-8)Online publication date: Dec-2017
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media