How We Upgraded to NGINX 1.25 Ingress Controller in Our EKS Clusters

Table of Contents

Introduction

In one of our Kubernetes-centric projects here at QloudX, we host hundreds of containerized applications on Amazon’s Elastic Kubernetes Service (EKS). For most EKS users, the AWS Load Balancer Controller (LBC) would be the first choice of ingress controller. But since LBC creates listener rules in AWS load balancers for ingresses in EKS, it wasn’t a viable solution for us. We wanted all EKS apps to use a single Application Load Balancer (ALB) & given the hundreds of apps in our cluster, we would have hit several hard limits of the ALB, like number of listeners per ALB, had we tried to use the LBC. So we chose NGINX as our ingress controller.

As the (Kubernetes) platform engineering team, we update our EKS clusters & all EKS addons (including NGINX) every quarter. However, NGINX 1.25 introduced a breaking change in the way it handles HTTP headers. This meant that upgrading the NGINX ingress controller would not be as simple as updating its Helm chart this time around. This is the story of how we implemented NGINX 1.25 upgrade in our EKS clusters, including providing a self-service mechanism for app developers to migrate/upgrade their apps to the new NGINX at their own pace, while testing their apps for compatibility along the way.

Breaking Change in NGINX

In NGINX v1.25+, duplicate Content-Length & Transfer-Encoding headers are rejected. Further, responses with invalid Content-Length or Transfer-Encoding headers are also rejected, as well as responses with both Content-Length & Transfer-Encoding.

See:

This change in NGINX’s behavior revealed “duplicate header” issues in a number of our critical EKS apps. This meant that as soon as we upgraded NGINX, the apps simply stopped working with an “HTTP 502 Bad Gateway” error, either when you open the app URL in a browser, or in the browser’s network inspector while navigating the app’s web UI, or in an API response from an app’s endpoint.

When NGINX drops an upstream response, it logs this error in its pod logs:

[error] upstream sent duplicate header line: Transfer-Encoding: chunked, previous value: Transfer-Encoding: chunked while reading response header from upstream, client: 10.x.y.z, server: my-app.example.com, request: GET /my-app-ui

Running 2 NGINX Ingress Controllers in Parallel

Our solution was to deploy the new NGINX as an additional ingress controller, such that our EKS now had both the old & new versions of NGINX running in parallel. All we needed now, is a way for developers to migrate their apps from old to new NGINX.

We run NGINX as a DaemonSet with a NodePort service. This makes it very easy on the ALB-end to forward traffic to NGINX: simply register the EKS node group auto-scaling group as a backend for the ALB & configure the ALB’s target group to always forward all traffic to the node port that NGINX is listening on. This keeps all routing logic centralized within NGINX, nothing on the ALB.

We manage all our EKS addons including NGINX as Helm releases deployed by Flux CD, so deploying a new NGINX is a matter of duplicating the existing Helm release with new Helm values:

apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: ingress-nginx
  namespace: ingress-nginx
spec:
  chart:
    spec:
      chart: ingress-nginx
      version: 4.9.1
      sourceRef:
        kind: HelmRepository
        name: ingress-nginx
        namespace: ingress-nginx
  values:
    controller:
      kind: DaemonSet
      ingressClassResource:
        name: nginx
      service:
        type: NodePort
        nodePorts:
          http: 11080
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: ingress-nginx-new
  namespace: ingress-nginx-new
spec:
  chart:
    spec:
      chart: ingress-nginx
      version: 4.10.1
      sourceRef:
        kind: HelmRepository
        name: ingress-nginx
        namespace: ingress-nginx
  values:
    controller:
      kind: DaemonSet
      ingressClassResource:
        name: nginx-new
      service:
        type: NodePort
        nodePorts:
          http: 12080

Now there exist 2 ingress classes in the cluster: nginx & nginx-new. Switching an app from old to new NGINX then, is a matter of changing its ingress class & ALB target group:

  • Every app is a Helm chart with its ingress class exposed as a Helm value, which the developers can provide as a parameter to their app deployment pipelines
  • And since we manage our AWS infra using Terraform, changing ALB target groups is just a matter of raising a pull request with your app domain

The following Terraform config adds listener rules to the ALB to forward traffic for selected apps to new NGINX, while respecting the ALB’s “5 hosts per listener rule” limit:

locals {
  # No more than 5 hosts are allowed per listener rule
  host_headers = chunklist(var.host_headers, 5)
}

resource "aws_lb_listener_rule" "listener_rule" {
  count        = length(local.host_headers)
  listener_arn = data.aws_lb_listener.listener.arn

  action {
    type             = "forward"
    target_group_arn = aws_lb_target_group.target_group.arn
  }

  condition {
    host_header {
      values = local.host_headers[count.index]
    }
  }
}

var.host_headers is the complete list of all app domains on new NGINX & chunklist() breaks down this list into blocks of 5 per listener rule.

Test App on New NGINX

Use cURL to confirm that the app domain is now served by new NGINX.

Before migrating to new NGINX:

curl --silent --head https://my-app.example.com | grep -i nginx
server: nginx/1.21.6

After migrating to new NGINX:

curl --silent --head https://my-app.example.com | grep -i nginx
server: nginx/1.25.3

At this point, the developers can start testing their apps, looking for the HTTP 502 error & fix the duplicate header issues in their app code, if any.

FAQs

πŸ”΄ Does this issue affect incoming requests from clients to EKS apps, or just the responses from EKS apps to clients?

🟒 Just the responses. Even if the requests have duplicate headers, NGINX will let the request flow through to the app in EKS. If the response from the EKS has duplicate headers, NGINX will not forward it to the client!

πŸ”΄ Does this issue affect service-to-service communication inside EKS, or just external traffic?

🟒 External traffic is affected. Service-to-service communication inside EKS is unaffected assuming they call each other using their cluster-internal service endpoints like my-service.my-namespace.svc.cluster.local. But if they use their ingress URLs to call each other, like my-service.example.com, then this traffic flows through NGINX & is affected as well!

About the Author ✍🏻

Harish KM is a Principal DevOps Engineer at QloudX. πŸ‘¨πŸ»β€πŸ’»

With over a decade of industry experience as everything from a full-stack engineer to a cloud architect, Harish has built many world-class solutions for clients around the world! πŸ‘·πŸ»β€β™‚οΈ

With over 20 certifications in cloud (AWS, Azure, GCP), containers (Kubernetes, Docker) & DevOps (Terraform, Ansible, Jenkins), Harish is an expert in a multitude of technologies. πŸ“š

These days, his focus is on the fascinating world of DevOps & how it can transform the way we do things! πŸš€

Leave a Reply

Your email address will not be published. Required fields are marked *