How Does TCP Work?

The transmission control protocol (TCP) does one job very well — it creates an abstraction that makes an unreliable channel look like a reliable network. For applications built over an unreliable network like the Internet, TCP is a godsend that hides a lot of the inherent complexity in building networked applications. A laundry list of TCP features that application developers rely on every day includes: retransmission of lost data, in-order data delivery, data integrity, and congestion control. This article provides an introduction TCP, describing the structure of TCP segments, how TCP connections are established, and the algorithms that govern the flow of data between senders and receivers.

TCP Segments

TCP divides a stream of data into chunks, and then adds a TCP header to each chunk to create a TCP segment. A TCP segment consists of a header and a data section. The TCP header contains 10 mandatory fields, and an optional extension field. The payload data follows the header and contains the data for the application. The following figure from RFC 793 shows the format of a TCP segment, where each - represents one bit.

 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|          Source Port          |       Destination Port        |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                        Sequence Number                        |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                    Acknowledgement Number                      |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|  Data |           |U|A|P|R|S|F|                               |
| Offset| Reserved  |R|C|S|S|Y|I|            Window             |
|       |           |G|K|H|T|N|N|                               |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|           Checksum            |         Urgent Pointer        |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                    Options                    |    Padding    |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                             Data                              |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  • Source Port (16 bits): The client’s port number.
  • Destination Port (16 bits): The server’s port number.
  • Sequence Number (32 bits): A sequence number used for guaranteeing packet order.
  • Acknowledgement Number (32 bits): An acknowledge number notifying senders of the receipt of TCP segments.
  • Data Offset (4 bits): The size of the TCP header (i.e. where the data begins).
  • Reserved (3 bits): Set to zero. Reserved for future use.
  • Flags (9 bits): Flags set TCP control options used to alter the connection. For example, the SYN flag is used to synchronize on sequence numbers, and the FIN flag is used to close the connection. The full list of flags is available on Wikipedia.
  • Window (16 bits): The receive window. The number of bytes that the sender of this segment is willing to receive. A key feature of TCP flow control.
  • Checksum (16 bits): A 16-bit checksum used for error-checking.
  • Urgent pointer (16 bits): If the sender sets the URG flag, then this 16-bit field is an offset from the sequence number indicating the last urgent data byte. Data received with an URG flag is typically placed in a separate incoming buffer to be delivered to the application, where it is up to the application to use this data correctly. For example, FTP and Telnet use the URG flag when sending user commands so that the server can prioritize user commands over in progress data transfer.
  • Options (Variable 0–320 bits, divisible by 32): Options are — as the name would suggest — optional. One example option is the Maximum Segment Size. By setting this option, a sender advertises the maximum size of segments that it can process. The receiver must honour this value when sending segments.
  • Padding (Variable 0–320 bits, divisible by 32): Zeroes. Padding ensures that the TCP header ends and data begins on a 32 bit boundary.
  • Data (Variable length): The application specific payload data for this TCP segment.

Establishing a TCP Connection — The Three-Way Handshake

All TCP connections begin with a connection algorithm called the three-way handshake (because it requires three network requests to successfully complete). During this handshake, both the client and the server must agree on a starting packet sequence number that is used to guarantee ordering of future packets. In point form, the handshake looks like this:

  1. SYN
    • The client picks a random sequence number \(x\).
    • The client sends a packet with the SYN flag set and the Sequence Number field set to \(x\).
  2. SYN-ACK
    • The server increments \(x\) by one.
    • The server picks a random sequence number \(y\).
    • The server sends a packet with the SYN and ACK flags set, the Sequence Number field set to \(y\), and the Acknowledgement Number field set to \(x\).
  3. ACK
    • The client increments increments \(x\) and \(y\) by one.
    • The client sends a packet with the ACK flag sent, the Sequence Number field set to \(x\), and the Acknowledgement Number field set to \(y\).

After the handshake is complete, a client can start sending data packets immediately.

The following figure shows a successful three-way handshake. At the end of the handshake, the client begins sending application data.

assets/three-way-handshake.png

A successful three-way handshake.

Sending Data

After a TCP connection is established using the three-way handshake, TCP segments can be exchanged between the client and server. To guard against the failures introduced by the unreliable network, TCP uses sequence numbers to verify the correct delivery and ordering of TCP segments. The host on either side of a TCP session maintains a 32-bit sequence number it uses to keep track of how much data it has sent. This sequence number is included on each transmitted packet, and acknowledged by the opposite host as an acknowledgement number to inform the sending host that the transmitted data was received successfully.

A fundamental notion in the design is that every octet of data sent over a TCP connection has a sequence number. Since every octet is sequenced, each of them can be acknowledged.

RFC 793 Transmission Control Protocol

More precisely, when a client sends a TCP segment with an acknowledgement number \(x\), it means that the client has correctly received all data up to \(x-1\). TCP uses this mechanism to perform straight-forward duplicate detection of retransmitted segments. To do so, both sides of the connection keep track of a few variables:

  • SND.UNA: oldest unacknowledged sequence number
  • SND.NXT: next sequence number to be sent
  • SEG.ACK: next sequence number expected by the receiving host
  • SEG.SEQ: first sequence number of a segment
  • SEG.LEN: the number of octets of data in the segment
  • SEG.SEQ+SEG.LEN-1: last sequence number of a segment

When the sender creates a segment and transmits it, the sender advances SND.NXT. When the receiver accepts a segment it advances RCV.NXT and sends an acknowledgement. When the sender receives an acknowledgement it advances SND.UNA. The extent to which the values of these variables differ is a measure of the delay in the communication. If the data flow is momentarily idle and all data sent has been acknowledged then the three variables will be equal.

The following figure provides a visual example of data exchange between a client and server.

assets/sending-data.png

Data exchange between a client and server.

The client makes a request to the server for data, and the server creates a response that is divided into three TCP segments. Segment one is returned to the client without issue. Segment two (1) is dropped somewhere by the network, and segment three arrives without issue. The client, after receiving segment one, issues an acknowledgement for that segment. The client then receives segment three out of order, and sends a duplicate acknowledgement (2) for segment one to signal that it has not received data for segment two yet. The server responds to this duplicate acknowledgement (3) by resending both segments two and three, which are acknowledged by the client.

With this algorithm, the serve has to send segment three twice, even though it was already successfully delivered, RFC 2081 improves the efficiency of TCP by allowing a client to say which data it has already received during an acknowledgement. The server uses this information to resend only the missing data.

Closing a Connection

To close a TCP connection, a sender transmits a packet with the FIN flag set, indicating that the sender has no more data it wishes to send. After receipt of a FIN segment the receiver should refuse any additional data from the client. Closing a TCP connection is a one-way operation — both sides of the connection must choose to close independently. The sender who closes the transmission must continue to receive data until the receiver also decides to close the connection.

Congestion Avoidance and Control

The basic TCP protocol as described was originally codified in 1981. In 1984, the Ford company found some problems using TCP over wide-area networks and documented those issues in RFC 896 (later incorporated into RFC 1122 and RFC 6633).

In heavily loaded pure datagram networks with end to end retransmission, as switching nodes become congested, the round trip time through the net increases and the count of datagrams in transit within the net also increases. This is normal behavior under load. As long as there is only one copy of each datagram in transit, congestion is under control. Once retransmission of datagrams not yet delivered begins, there is potential for serious trouble.

John Nagle, RFC 896

In other words, if the roundtrip time between two nodes exceeds the maximum retransmission interval for a host, that host will resend the packet because it thinks that it was lost. The effect is that more and more copies of the same data will be sent into the network. Eventually all of the network routers and switches will be flooded with packets and all data being sent into the network will be dropped. This problem was called “congestion collapse”, and is fixed by introducing flow control, congestion control, and congestion avoidance.

Flow Control

The Window field in a TCP segment is the number of bytes that the sender of this segment is willing to receive in a response. Each side of a TCP connection can control how much data it is willing to receive by setting this receive window. If a sender is under heavy load, it can set the window to a low value to decrease pressure on itself, alternatively, if the sender is under light load and can process more information, it can advertise a high receive window indicating that it is ready to receive and process more information. Senders can also advertise a receive window of zero, indicating that the connection cannot keep up and needs time to clear the data in its buffer. The receive window size is transmitted as part of every ACK packet, allowing both sides of a connection to adjust the amount of data they can receive, optimizing their processing capacity.

In the original TCP specification, the receive window was given 16 bits of data, which implies a maximum window size of \(2^{16}\) or 65,535 bytes. This upper bound, though large at the time, doesn’t allow high bandwidth networks to reach optimal performance. RFC 1323 addresses this problem by introducing an option allowing clients and servers to scale the window size, reaching maximum window sizes of up to a gigabyte. During the three-way handshake, nodes can set the window scaling option representing the number of bits to left-shift the 16-bit window size for future ACKS.

Slow Start

Flow control helps senders and receivers match bandwidth rates to avoid overloading one another, but there is still a possibility of overloading the underlying network that transmits packets between senders and receivers. Slow start, described in RFC 2581 forces TCP senders to set a congestion window variable limiting the amount of data a sender can push into the network before receiving an acknowledgement from a receiver. Slow start also prescribes an algorithm for senders to carefully probe the network to see how much data they should be sending.

At the beginning of a new network connection, there is no way for a sender to know the available bandwidth of the network. TCP Slow Start is conservative and allows a sender to transmit four TCP segment before receiving an acknowledgement. After an acknowledgement is received, the sender can transmit eight TCP segments. This pattern continues so that for every acknowledged packet, two new packets can be sent, up to the receive window limit specified by the receiver. From this, we can derive a new rule for TCP senders: the maximum amount of data that is unacknowledged is the minimum of the receive window and the congestion window.

The following figure shows the TCP Slow Start algorithm with an initial congestion window of one. After each successful roundtrip, the congestion window is doubled.

assets/slow-start.png

TCP Slow Start algorithm.

Congestion Avoidance

The TCP protocol can only ever estimate the bandwidth available in the underlying network, and must use packet loss and the congestion window variables to regulate packet flow. The slow start algorithm increases the amount of data sent to a receiver until either the receiver’s window is reached, or packet loss occurs. If packet loss occurs, the TCP congestion avoidance algorithm starts. Congestion avoidance is designed with the assumption that packet loss is an indicator of an overloaded network, and the solution is to limit the amount of data sent by resetting the congestion window variable that senders use to regulate data flow.

Once the congestion window is reset, congestion avoidance specifies its own algorithms for how to grow the window to minimize further loss. At a certain point, another packet loss event will occur, and the process will repeat once over.

High Performance Browser Networking

When packet loss does occur, TCP undergoes a congestion detection phase, followed by a congestion avoidance phase. In the congestion detection phase, the congestion window is reduced, and in the congestion avoidance phase the congestion window is slowly increased. The original TCP algorithm specified a multiplicative decrease, halving the size of the congestion window when congestion is detected, followed by an additive increase, that slowly increases the size of the congestion window by one each round trip.

The following figure, from ResearchGate, shows the varying size of the congestion window in response to packet loss and the multiplicative decrease, additive increase algorithm.

assets/congestion-control.png

Congestion detection and avoidance.

The congestion detection and avoidance algorithm is a key feature of TCP, and its implementation has a large effect on network performance. In many cases, this algorithm is too conservative, and new algorithms have been developed (up to 13 at this point). Congestion detection and avoidance are still an area of ongoing research.

The Core Principles of TCP

TCP is a wonderful thing. It completes a difficult task in a way that is completely transparent to the application. And even though it is complex, the core principles can be easily explained. In particular, the TCP protocol can be distilled down to a few items:

  • A three-way handshake establishes the parameters and settings for every new connection.
  • Slow-start is applied to avoid overloading the network.
  • Flow and congestion control regulate the connection throughput.

There are a lot of details in each individual item, but the essence of the protocol remains simple. TCP has a long history, and because it is a cornerstone of the internet, it is an area of continued interest and research, culminating in many RFCs and a few good books:

See also

comments powered by Disqus