Debugging HTTP 502 errors on Google Kubernetes Engine

This is a walkthrough of how I debugged and fixed intermittent HTTP 502 (Bad Gateway) errors on Google Kubernetes Engine (GKE).

Infrastructure Setup

  • GKE Kubernetes cluster
    • 2 Nodes
    • 1 Deployment, scaled to two pods. Each pod running a single Node.js-based HTTP server application
    • 1 Service
    • 1 GCE Ingress. It manages Google Cloud HTTP Load Balancers via the Kubernetes Ingress API. I was using a Network Endpoint Group (NEG) as a backend, which allows pods to be connected to from the load balancer directly.
HTTP Load Balancer Architecture
The vanilla HTTP Load Balancer Architecture. In my setup, NEGs replace Instance Groups.
NEGs with Container
NEGs with Containers

Application Server Setup

Requests resulted in HTTP 502s seemingly at random. Running the load test suite was sufficient to reproduce the issue almost every time.

The HTTP Load Balancing docs have information about timeouts and retries. The Load Balancer keeps TCP connections idle for up to 10 minutes, therefore the application server’s timeout must be longer than this to avoid race conditions. My initial Node.js code to do this was as follows, but did not resolve the issue.

// https://nodejs.org/api/http.html#http_event_connection
server.on('connection', function(socket) {
  // Set the socket timeout to 620 seconds
  socket.setTimeout(620e3);
});

Checking for Known Issues

There was an open issue on the GCE Ingress GitHub with several ideas.

Some related to switching from externalTrafficPolicy: Cluster (which is the default for services) to externalTrafficPolicy: Local. By default, the GCE ingress will create an Instance Group targeting all nodes in the cluster. Any nodes not running a pod of the target Service need to forward traffic to another node which is. Using Network Endpoint Groups avoids this situation, as the pods are directly targeted.

There were also suggestions that nodes might be being terminated while receiving traffic (common if using pre-emptible machine types). That was not the issue in my case.

Checking the Logs

Stackdriver Logging creates logs for much of Google Cloud Platform by default, including HTTP Load Balancers:

resource.type="http_load_balancer"
httpRequest.status=502

The jsonPayload.statusDetails field had the value backend_connection_closed_before_data_sent_to_client in all cases, indicating that the backend (my application) had closed the connection unexpectedly.

This was puzzling since I had set the socket timeout in my application code. I opened a telnet session in the container to the server without sending any data and it was indeed closed after 620 seconds, indicating that the TCP timeout was set correctly.

The Server’s View

To see what was happening to these failed requests from the server’s view, I installed tshark (the CLI companion to Wireshark). I scaled down the deployment to a single pod, monitored the network traffic during a load-test run, and saved the output to a pcap file. kubectl cp makes it blessedly easy to download files from Kubernetes containers. I then opened the pcap file locally in the Wireshark GUI.

Looking for HTTP 502 errors in the trace would not be fruitful, because these errors were being sent by the load balancer, not the server. I tagged each request with a random X-My-Uuid header, and logged failed UUIDs during the load-test run.

Using a failed UUID as a display filter in Wireshark let me track down one of the failed requests. I then filtered the trace to only show packets from the same TCP connection.

Wireshark TCP stream
The trace for the TCP connection containing the failed request. The second column is elapsed time in seconds. 130.211.0.143 is the load balancer. 10.56.5.6 is my server.

Two requests were served correctly in the space of 4 seconds. The failed request came 5 seconds later and resulted in a TCP RST from the server, closing the connection. This is the backend_connection_closed_before_data_sent_to_client seen in the Stackdriver logs.

Debugging the server and looking through the Node.js source for the HTTP server module yielded the following likely looking code in the ‘response finished’ callback:

// ...
} else if (state.outgoing.length === 0) {
    if (server.keepAliveTimeout && typeof socket.setTimeout === 'function') {
      socket.setTimeout(0);
      socket.setTimeout(server.keepAliveTimeout);
      state.keepAliveTimeoutSet = true;
    }
  }
// ...

server.keepAliveTimeout (default 5 seconds) was replacing the socket timeout I had set in the connection event listener whenever a request was received! Apparently this default keepAlive timeout was new in Node 8, there did not used to be a default timeout.

Setting the default timeout as follows resolved the issue:

server.keepAliveTimeout = 620e3;

I made a PR to the connection event listener docs to hopefully save someone some time in future.


comments powered by Disqus