A WebSocket is a persistent connection between a client and server. WebSockets provide a bidirectional, full-duplex communications channel that operates over HTTP through a single TCP/IP socket connection. At its core, the WebSocket protocol facilitates message passing between a client and server. This article provides an introduction to the WebSocket protocol, including what problem WebSockets solve, and an overview of how WebSockets are described at the protocol level.
Why WebSocket?
The idea of WebSockets was borne out of the limitations of HTTP-based technology. With HTTP, a client requests a resource, and the server responds with the requested data. HTTP is a strictly unidirectional protocol — any data sent from the server to the client must be first requested by the client. Long-polling has traditionally acted as a workaround for this limitation. With long-polling, a client makes an HTTP request with a long timeout period, and the server uses that long timeout to push data to the client. Long-polling works, but comes with a drawback — resources on the server are tied up throughout the length of the long-poll, even when no data is available to send.
WebSockets, on the other hand, allow for sending message-based data, similar to UDP, but with the reliability of TCP. WebSocket uses HTTP as the initial transport mechanism, but keeps the TCP connection alive after the HTTP response is received so that it can be used for sending messages between client and server. WebSockets allow us to build “real-time” applications without the use of long-polling.
Protocol Overview
The protocol consists of an opening handshake followed by basic message framing, layered over TCP.
WebSockets begin life as a standard HTTP request and response. Within that request response chain, the client asks to open a WebSocket connection, and the server responds (if its able to). If this initial handshake is successful, the client and server have agreed to use the existing TCP/IP connection that was established for the HTTP request as a WebSocket connection. Data can now flow over this connection using a basic framed message protocol. Once both parties acknowledge that the WebSocket connection should be closed, the TCP connection is torn down.
Establishing a WebSocket connection — The WebSocket Open Handshake
WebSockets do not use the http://
or https://
scheme (because they do
not follow the HTTP protocol). Rather, WebSocket URIs use a new scheme
ws:
(or wss:
for a secure WebSocket). The remainder of the URI is the
same as an HTTP URI: a host, port, path and any query parameters.
"ws:" "//" host [ ":" port ] path [ "?" query ]
"wss:" "//" host [ ":" port ] path [ "?" query ]
WebSocket connections can only be established to URIs that follow this
scheme. That is, if you see a URI with a scheme of ws://
(or wss://
),
then both the client and the server MUST follow the WebSocket connection
protocol to follow the WebSocket specification.
WebSocket connections are established by upgrading an HTTP request/response pair. A client that supports WebSockets and wants to establish a connection will send an HTTP request that includes a few required headers:
Connection: Upgrade
- The
Connection
header generally controls whether or not the network connection stays open after the current transaction finishes. A common value for this header iskeep-alive
to make sure the connection is persistent to allow for subsequent requests to the same server. During the WebSocket opening handshake we set to header toUpgrade
, signaling that we want to keep the connection alive, and use it for non-HTTP requests.
- The
Upgrade: websocket
- The
Upgrade
header is used by clients to ask the server to switch to one of the listed protocols, in descending preference order. We specifywebsocket
here to signal that the client wants to establish a WebSocket connection.
- The
Sec-WebSocket-Key: q4xkcO32u266gldTuKaSOw==
- The
Sec-WebSocket-Key
is a one-time random value (a nonce) generated by the client. The value is a randomly selected 16-byte value that has been base64-encoded.
- The
Sec-WebSocket-Version: 13
- The only accepted version of the WebSocket protocol is 13. Any other version listed in this header is invalid.
Together, these headers would result in an HTTP GET request from the
client to a ws://
URI like in the following example:
GET ws://example.com:8181/ HTTP/1.1
Host: localhost:8181
Connection: Upgrade
Pragma: no-cache
Cache-Control: no-cache
Upgrade: websocket
Sec-WebSocket-Version: 13
Sec-WebSocket-Key: q4xkcO32u266gldTuKaSOw==
Once a client sends the initial request to open a WebSocket connection, it
waits for the server’s reply. The reply must have an HTTP 101 Switching Protocols
response code. The HTTP 101 Switching Protocols
response
indicates that the server is switching to the protocol that the client
requested in its Upgrade
request header. In addition, the server must
include HTTP headers that validate the connection was successfully
upgraded:
HTTP/1.1 101 Switching Protocols
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Accept: fA9dggdnMPU79lJgAE3W4TRnyDM=
Connection: Upgrade
- Confirms that the connection has been upgraded.
Upgrade: websocket
- Confirms that the connection has been upgraded.
Sec-WebSocket-Accept
: fA9dggdnMPU79lJgAE3W4TRnyDM=`Sec-WebSocket-Accept
is base64 encoded, SHA-1 hashed value. You generate this value by concatenating the clientsSec-WebSocket-Key
nonce and the static value258EAFA5-E914-47DA-95CA-C5AB0DC85B11
defined in RFC 6455. Although theSec-WebSocket-Key and
Sec-WebSocket-Accept` seem complicated, they exist so that both the client and the server can know that their counterpart supports WebSockets. Since the WebSocket re-uses the HTTP connection, there are potential security concerns if either side interprets WebSocket data as an HTTP request.
After the client receives the server response, the WebSocket connection is open to start transmitting data.
The WebSocket Protocol
WebSocket is a framed protocol, meaning that a chunk of data (a message) is divided into a number of discrete chunks, with the size of the chunk encoded in the frame. The frame includes a frame type, a payload length, and a data portion. An overview of the frame is given in RFC 6455 and reproduced here.
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-------+-+-------------+-------------------------------+
|F|R|R|R| opcode|M| Payload len | Extended payload length |
|I|S|S|S| (4) |A| (7) | (16/64) |
|N|V|V|V| |S| | (if payload len==126/127) |
| |1|2|3| |K| | |
+-+-+-+-+-------+-+-------------+ - - - - - - - - - - - - - - - +
| Extended payload length continued, if payload len == 127 |
+ - - - - - - - - - - - - - - - +-------------------------------+
| |Masking-key, if MASK set to 1 |
+-------------------------------+-------------------------------+
| Masking-key (continued) | Payload Data |
+-------------------------------- - - - - - - - - - - - - - - - +
: Payload Data continued ... :
+ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - +
| Payload Data continued ... |
+---------------------------------------------------------------+
I won’t cover every piece of the frame protocol here. Refer to RFC 6455 for full details. Rather, I will cover the most important bits so that we can gain an understanding of the WebSocket protocol.
Fin Bit
The first bit of the WebSocket header is the Fin bit. This bit is set if this frame is the last data to complete this message.
RSV1, RSV2, RSV3 Bits
These bits are reserved for future use.
opcode
Every frame has an opcode that determines how to interpret this frame’s payload data.
Opcode value | Description |
---|---|
0x00 | This frame continues the payload from the previous frame. |
0x01 | Denotes a text frame. Text frames are UTF-8 decoded by the server. |
0x02 | Denotes a binary frame. Binary frames are delivered unchanged by the server. |
0x03-0x07 | Reserved for future use. |
0x08 | Denotes the client wishes to close the connection. |
0x09 | A ping frame. Serves as a heartbeat mechanism ensuring the connection is still alive. The receiver must respond with a pong. |
0x0a | A pong frame. Serves as a heartbeat mechanism ensuring the connection is still alive. The receiver must respond with a ping frame. |
0x0b-0x0f | Reserved for future use. |
Mask
Setting this bit to 1 enables masking. WebSockets require that all payload be obfuscated using a random key (the mask) chosen by the client. The masking key is combined with the payload data using an XOR operation before sending data to the payload. This masking prevents caches from misinterpreting WebSocket frames as cacheable data. Why should we prevent caching of WebSocket data? Security.
During development of the WebSocket protocol, it was shown that if a compromised server is deployed, and clients connect to that server, it is possible to have intermediate proxies or infrastructure cache the responses of the compromised server so that future clients requesting that data receive the incorrect response. This attack is called cache poisoning, and results from the fact that we cannot control how misbehaving proxies behave in the wild. This is especially problematic when introducing a new protocol like WebSocket that has to interact with the existing infrastructure of the internet.
Payload len
The Payload len
field and Extended payload length
field are used to
encode the total length of the payload data for this frame. If the payload
data is small (under 126 bytes), the length is encoded in the Payload len
field. As the payload data grows, we use the additional fields to
encode the length of the payload.
Masking-key
As discussed with the MASK
bit, all frames sent from the client to the
server are masked by a 32-bit value that is contained within the frame.
This field is present if the mask bit is set to 1 and is absent if the
mask bit is set to 0.
Payload data
The Payload data
includes arbitrary application data and any extension
data that has been negotiated between the client and the server.
Extensions are negotiated during the initial handshake and allow you to
extend the WebSocket protocol for additional uses.
Closing a WebSocket connection — The WebSocket Close Handshake
To close a WebSocket connection, a closing frame is sent (opcode 0x08
).
In addition to the opcode, the close frame may contain a body that
indicates the reason for closing. If either side of a connection receives
a close frame, it must send a close frame in response, and no more data
should be sent over the connection. Once the close frame has been received
by both parties, the TCP connection is torn down. The server always
initiates closing the TCP connection.
More References
This article provides an introduction to the WebSocket protocol, and covers a lot of ground. However, the full protocol has more detail than what I could fit in to this blog post. If you want to learn more, there are several great resources to choose from: