RAPID: Avoiding TCP Incast Throughput Collapse in Public Clouds with Intelligent Packet Discarding

Yang Xu*, Shikhar Shukla, Zehua Guo, Sen Liu, Adrian S.W. Tam, Kang Xi, H. Jonathan Chao

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

13 Citations (Scopus)

Abstract

Many applications in public clouds require a high fan-in, many-to-one type of data communication (known as TCP incast) in modern Data Center Networks (DCNs). Such communication could cause severe incast congestion in switches and result in TCP throughput collapse, substantially degrading the application performance. The root cause of throughput collapse is the Retransmission Timeouts (RTO) due to packet losses in congested switches. Tenants in public clouds can opt to use a variety of TCP versions. However, the existing solutions rely on modifications of TCP protocols and specific techniques from switches, and thus these existing solutions are not always feasible for public clouds. In this paper, we are inspired by the emerging virtualization and network softwarization technologies to develop a novel scheme called Retransmission timeout Avoidance by Packet Intelligent Discarding (RAPID) using software switches. RAPID considers the number of packets of each incast flow, buffered in the switch to selectively discard some packets, and ensures that the Fast Retransmission/Fast Recovery rather than RTO is invoked at the sender(s) in response to packet loss. Thus, the long idle period of a timeout and the throughput drop are avoided. We prove that, given a predetermined minimum switch buffer space, dedicated to the incast application, RAPID can prevent RTO in all the incast senders. We also present a low-complexity heuristic version of RAPID named RAPID-ED, which combines the principles of RAPID and early detection and is extremely easy to implement on today's software switches. We evaluate the two proposed schemes in a data center network testbed built on NS-3 simulator. The simulation results confirm the theoretical expectation, and show that the RAPID and RAPID-ED perform very well to prevent RTO of TCP incast flows and hence the throughput collapse. Compared with other incast solutions, RAPID and RAPID-ED do not modify TCP protocols and therefore are more suitable in public clouds.

Original languageEnglish
Article number8766803
Pages (from-to)1911-1923
Number of pages13
JournalIEEE Journal on Selected Areas in Communications
Volume37
Issue number8
DOIs
Publication statusPublished - Aug 2019

Keywords

  • TCP incast
  • TCP timeout
  • queue management
  • random early detection

Fingerprint

Dive into the research topics of 'RAPID: Avoiding TCP Incast Throughput Collapse in Public Clouds with Intelligent Packet Discarding'. Together they form a unique fingerprint.

Cite this