Skip to content

fix: propagate Kafka broker exit code to pod phase#260

Open
DarkIsDude wants to merge 1 commit into
adobe:masterfrom
DarkIsDude:fix/propagate-kafka-broker-exit-code
Open

fix: propagate Kafka broker exit code to pod phase#260
DarkIsDude wants to merge 1 commit into
adobe:masterfrom
DarkIsDude:fix/propagate-kafka-broker-exit-code

Conversation

@DarkIsDude
Copy link
Copy Markdown

Problem

In pkg/resources/kafka/wait-for-envoy-sidecar.sh, the entrypoint finishes with:

/opt/kafka/bin/kafka-server-start.sh /config/broker-config
rm /var/run/wait/do-not-exit-yet

rm always exits 0, so it overwrites Kafka's real exit code. When Kafka crashes
(e.g. KafkaStorageException from a full disk), the broker pod reports phase
Succeeded instead of Failed. Koperator then recreates the pod every ~2 s with
no backoff and no visible failure signal — alerts never fire, and the root cause
is invisible in kubectl get pods.

Fix

Capture Kafka's exit code before rm and propagate it:

/opt/kafka/bin/kafka-server-start.sh /config/broker-config
KAFKA_EXIT=$?
rm /var/run/wait/do-not-exit-yet
exit $KAFKA_EXIT

One crashed container is enough to set the pod phase to Failed, so the failure
is immediately visible and monitoring can alert on it.

Verification

Fill a broker's data disk to trigger a KafkaStorageException:

kubectl exec -n kafka <broker-pod> -- dd if=/dev/urandom of=/kafka-logs/fill-disk bs=1M count=99999 || true

Before fix: pod phase → Succeeded (crash hidden)
After fix: pod phase → Failed (crash visible)

The entrypoint script called `rm /var/run/wait/do-not-exit-yet` after
kafka-server-start.sh, and since `rm` always exits 0, the broker pod
would show phase Succeeded even after a crash (e.g. KafkaStorageException).
Koperator would then recreate the pod with no backoff and no visible failure.

Capture Kafka's exit code before the rm and exit with it, so a crashed
broker produces pod phase Failed instead of Succeeded.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant