Skip to content

fix(api,task-processor): Close shutdown drain gap and name container ports#533

Open
germangarces wants to merge 3 commits into
mainfrom
fix/drain-timers
Open

fix(api,task-processor): Close shutdown drain gap and name container ports#533
germangarces wants to merge 3 commits into
mainfrom
fix/drain-timers

Conversation

@germangarces
Copy link
Copy Markdown
Member

@germangarces germangarces commented May 13, 2026

Two small, independent fixes to the api and task-processor deployment templates.

1. Close shutdown drain gap

Default terminationGracePeriodSeconds to 75 and add preStop sleep 20 on the API container. Lets the load balancer finish deregistering the pod before gunicorn stops accepting connections, so rolling deploys and HPA scale-downs no longer cause a brief 5xx spike.

2. Name container ports

The api and task-processor container ports were unnamed. PodMonitoring resources that reference them by name (port: http or port: prom) silently scraped nothing as a result. Adding names fixes that. Existing Service and ServiceMonitor resources are unaffected — they reference ports numerically or via the Service name.

Contributes to Flagsmith/infrastructure#317

Signed-off-by: germangarces <german.garces@flagsmith.com>
The api and task-processor container ports were unnamed, so any
PodMonitoring (or other) resource referencing them by name (e.g.
`port: http`) could not resolve them and silently scraped nothing.

Name the existing container port `http`, and declare the Prometheus
port 9100 as `prom` when `prometheus.enabled` is true. Service and
ServiceMonitor resources are unaffected: both reference ports by
numeric value or by the Service's own port name.
@germangarces germangarces changed the title fix(api): close graceful-shutdown gap behind LB fix(api,task-processor): Close shutdown drain gap and name container ports May 13, 2026
@germangarces germangarces requested a review from khvn26 May 19, 2026 08:12
Comment thread charts/flagsmith/values.yaml Outdated
Comment on lines +77 to +85
# Container lifecycle hooks. Default preStop delays SIGTERM so the
# LB / endpoints controller has time to deregister the pod before
# gunicorn closes its listen socket. Without this, rolling deploys
# and HPA scale-down can cause a short 5xx spike on traffic that
# the LB routes to the pod after it has stopped accepting connections.
lifecycle:
preStop:
exec:
command: ["sleep", "20"]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand if we're describing the default or the custom exec command we've added here?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're talking about the custom command we've added. But I have rephrased it so is more understandable: 0210430

# Pod termination grace period in seconds. Must exceed the LB's
# connection-draining timeout so the kubelet does not SIGKILL
# the pod while the LB is still draining in-flight connections.
terminationGracePeriodSeconds: 75
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did we land on 75?

Copy link
Copy Markdown
Member Author

@germangarces germangarces May 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

20s preStop + 30s gunicorn default graceful worker shutdown + 25s for possible in-flight requests

Signed-off-by: germangarces <german.garces@flagsmith.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants