Claude/kubernetes migration plan kq jw d#358
Conversation
Decompose the monolithic Docker container into Kubernetes workloads: - Streamlit Deployment with health probes and session affinity - Redis Deployment + Service for job queue - RQ Worker Deployment for background workflows - CronJob for workspace cleanup - Ingress with WebSocket support and cookie-based sticky sessions - Shared PVC (ReadWriteMany) for workspace data - ConfigMap for runtime configuration (replaces build-time settings) - Kustomize base + template-app overlay for multi-app deployment Code changes: - Remove unsafe enableCORS=false and enableXsrfProtection=false from config.toml - Make workspace path configurable via WORKSPACES_DIR env var in clean-up-workspaces.py CI/CD: - Add build-and-push-image.yml to push Docker images to ghcr.io - Add k8s-manifests-ci.yml for manifest validation and kind integration tests https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ
kustomization.yaml is a Kustomize config file, not a standard K8s resource, so kubeconform has no schema for it. Exclude it via -ignore-filename-pattern. https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ
The integration-test job now uses a matrix with Dockerfile_simple and Dockerfile. Each matrix entry checks if its Dockerfile exists before running — all steps are guarded with an `if` condition so they skip gracefully when a Dockerfile is absent. This allows downstream forks that only have one Dockerfile to pass CI without errors. https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ
- Switch workspace PVC from ReadWriteMany to ReadWriteOnce with cinder-csi storage class (required by de.NBI KKP cluster) - Increase PVC storage to 500Gi - Add namespace: openms to kustomization.yaml - Reduce pod resource requests (1Gi/500m) and limits (8Gi/4 CPU) so all workspace-mounting pods fit on a single node https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ
The workspaces PVC uses ReadWriteOnce (Cinder CSI block storage) which requires all pods mounting it to run on the same node. Without explicit affinity rules, the scheduler was failing silently, leaving pods in Pending state with no events. Adds a `volume-group: workspaces` label and podAffinity with requiredDuringSchedulingIgnoredDuringExecution to streamlit deployment, rq-worker deployment, and cleanup cronjob. This ensures the scheduler explicitly co-locates all workspace-consuming pods on the same node. https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ
…ration-plan-KQJwD
The controller pod being Ready doesn't guarantee the admission webhook service is accepting connections. Add a polling loop that waits for the webhook endpoint to have an IP assigned before applying the Ingress resource, preventing "connection refused" errors during kustomize apply. https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ
The kustomize overlay deploys into the openms namespace, but the verification steps (Redis wait, Redis ping, deployment checks) were querying the default namespace, causing "no matching resources found". https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ
Replace the unreliable endpoint-IP polling with a retry loop on kubectl apply (up to 5 attempts with backoff). This handles the race where the ingress-nginx admission webhook has an endpoint IP but isn't yet accepting TCP connections. https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ
Kustomize namePrefix renames the Redis service to template-app-redis, but the REDIS_URL env var in streamlit and rq-worker deployments still referenced the unprefixed name "redis", causing the rq-worker to CrashLoopBackOff with "Name or service not known". Add JSON patches in the overlay to set the correct prefixed hostname. https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdds a Traefik IngressRoute resource and includes it in base kustomization; extends the template-app overlay with JSON6902 patches updating REDIS_URL env values and the IngressRoute service name; changes ConfigMap key to Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Client
participant Traefik
participant Streamlit as "streamlit Service"
participant RQWorker as "rq-worker"
participant Redis
Client->>Traefik: HTTP request (PathPrefix /)
Traefik->>Streamlit: Forward to `template-app-streamlit` service:8501 (IngressRoute)
Streamlit->>Redis: Connect using REDIS_URL=redis://template-app-redis:6379/0
Client->>Streamlit: User interactions (websocket/forms)
Streamlit->>RQWorker: Enqueue background tasks (via Redis)
RQWorker->>Redis: Connect using REDIS_URL=redis://template-app-redis:6379/0
Redis-->>Streamlit: Task/session data
Streamlit-->>Client: HTTP response
Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
k8s/overlays/template-app/kustomization.yaml (1)
25-38: Add test operations to precondition JSON patches for robustness.The patches rely on
env/0index, which is order-sensitive. Addtestops before eachreplaceto verifyREDIS_URLis at that index; this ensures patching fails fast with a clear error if the environment variable array structure changes.Suggested diff
- target: kind: Deployment name: streamlit patch: | + - op: test + path: /spec/template/spec/containers/0/env/0/name + value: REDIS_URL - op: replace path: /spec/template/spec/containers/0/env/0/value value: "redis://template-app-redis:6379/0" - target: kind: Deployment name: rq-worker patch: | + - op: test + path: /spec/template/spec/containers/0/env/0/name + value: REDIS_URL - op: replace path: /spec/template/spec/containers/0/env/0/value value: "redis://template-app-redis:6379/0"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@k8s/overlays/template-app/kustomization.yaml` around lines 25 - 38, The JSON patches in kustomization.yaml for the Deployment targets "streamlit" and "rq-worker" currently replace /spec/template/spec/containers/0/env/0/value assuming env[0] is REDIS_URL; add a preceding "test" op for each replace that asserts /spec/template/spec/containers/0/env/0/name equals "REDIS_URL" (so the patch fails clearly if ordering changed), then perform the existing replace of /spec/.../env/0/value with the new redis URL; update both target blocks (Deployment name: streamlit and Deployment name: rq-worker) to include these test operations immediately before their replace ops.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@k8s/overlays/template-app/kustomization.yaml`:
- Around line 25-38: The JSON patches in kustomization.yaml for the Deployment
targets "streamlit" and "rq-worker" currently replace
/spec/template/spec/containers/0/env/0/value assuming env[0] is REDIS_URL; add a
preceding "test" op for each replace that asserts
/spec/template/spec/containers/0/env/0/name equals "REDIS_URL" (so the patch
fails clearly if ordering changed), then perform the existing replace of
/spec/.../env/0/value with the new redis URL; update both target blocks
(Deployment name: streamlit and Deployment name: rq-worker) to include these
test operations immediately before their replace ops.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 5066100a-8395-46f2-8e7c-36fce8229010
📒 Files selected for processing (1)
k8s/overlays/template-app/kustomization.yaml
The cluster uses Traefik, not nginx, so the nginx Ingress annotations are ignored. Add a Traefik IngressRoute with PathPrefix(/) catch-all routing and sticky session cookie for Streamlit session affinity. https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ
…ests kubeconform doesn't know the Traefik IngressRoute CRD schema, and the kind cluster in integration tests doesn't have Traefik installed. Skip the IngressRoute in kubeconform validation and filter it out with yq before applying to the kind cluster. https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ
Kustomize namePrefix doesn't rewrite service references inside CRDs, so the IngressRoute was pointing to 'streamlit' instead of 'template-app-streamlit', causing Traefik to return 404. https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/k8s-manifests-ci.yml:
- Around line 24-27: The CI is skipping Traefik IngressRoute validation via
kubeconform due to the -ignore-filename-pattern 'traefik-ingressroute.yaml' and
the -ignore-kind IngressRoute usage; remove those ignore rules and add a
dedicated validation step (e.g., a job/step named
"validate-traefik-ingressroutes") that runs kubeconform or an equivalent
validator against IngressRoute manifests using Traefik's CRD schema (or a custom
kubeconform schema provider) — keep the existing kubeconform calls for core
resources (the kubeconform invocation) and add the new invocation that points to
the Traefik CRD schema or uses --schema-location/--schema-provider so
IngressRoute is validated instead of being silently skipped.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 04de464e-fa1a-4df7-a1e1-39897b5ae0ee
📒 Files selected for processing (2)
.github/workflows/k8s-manifests-ci.ymlk8s/overlays/template-app/kustomization.yaml
🚧 Files skipped from review as they are similar to previous changes (1)
- k8s/overlays/template-app/kustomization.yaml
| kubeconform -summary -strict -kubernetes-version 1.28.0 \ | ||
| -ignore-filename-pattern 'kustomization.yaml' \ | ||
| -ignore-filename-pattern 'traefik-ingressroute.yaml' \ | ||
| k8s/base/*.yaml |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "Inspect CI workflow handling of Traefik/IngressRoute:"
rg -n -C3 'traefik-ingressroute|IngressRoute|kubeconform|kustomize|yq' .github/workflows
echo
echo "Find Traefik-related manifests in k8s tree:"
fd -i 'traefik|ingressroute' k8sRepository: OpenMS/streamlit-template
Length of output: 3169
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== K8s directory structure ==="
find k8s -type f -name "*.yaml" | head -20
echo
echo "=== Check if traefik-ingressroute.yaml is in base or overlay ==="
find k8s -name "traefik-ingressroute.yaml" -exec ls -la {} \;
echo
echo "=== Look for all IngressRoute definitions across k8s ==="
rg -l "kind: IngressRoute" k8s
echo
echo "=== Check kustomization.yaml in base and overlay for resources ==="
echo "--- Base kustomization.yaml ---"
cat k8s/base/kustomization.yaml 2>/dev/null | head -30
echo
echo "--- Overlay kustomization.yaml ---"
cat k8s/overlays/template-app/kustomization.yaml 2>/dev/null | head -30Repository: OpenMS/streamlit-template
Length of output: 1785
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Content of traefik-ingressroute.yaml ==="
cat k8s/base/traefik-ingressroute.yaml
echo
echo "=== Overlay kustomization.yaml (full) ==="
cat k8s/overlays/template-app/kustomization.yaml
echo
echo "=== Verify what kustomize outputs (check if IngressRoute is there) ==="
kubectl kustomize k8s/overlays/template-app/ 2>/dev/null | grep -A5 "kind: IngressRoute" || echo "No IngressRoute found in kustomize output"Repository: OpenMS/streamlit-template
Length of output: 1640
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Full kustomize output (raw) ==="
kubectl kustomize k8s/overlays/template-app/ 2>/dev/null > /tmp/kustomize-output.yaml
wc -l /tmp/kustomize-output.yaml
echo
echo "=== Check for IngressRoute or traefik references in output ==="
grep -i "ingressroute\|traefik" /tmp/kustomize-output.yaml || echo "No matches found"
echo
echo "=== Show kinds present in output ==="
grep "^kind:" /tmp/kustomize-output.yaml | sort | uniq
echo
echo "=== Check if CRD definitions are in the file ==="
head -100 /tmp/kustomize-output.yamlRepository: OpenMS/streamlit-template
Length of output: 106
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Verify which tools are available ==="
which kubectl kustomize yq jq python3 2>&1 | head -10 || echo "Checking alternatives..."
echo
echo "=== Check if we can use Python to parse YAML ==="
python3 --version
echo
echo "=== Try to understand the impact by reading the manifests directly ==="
echo "Base includes traefik-ingressroute.yaml in resources list"
echo "Overlay includes base via ../../base"
echo "Therefore overlay kustomize output SHOULD include IngressRoute"
echo
echo "=== Let's check the actual CI workflow logic ==="
echo "Line 24-27: kubeconform validates k8s/base/*.yaml directly (ignoring traefik-ingressroute.yaml by filename)"
echo "Line 39: pipes kustomize output through kubeconform with -skip IngressRoute"
echo
echo "=== Check if the ignore pattern on line 26 is actually effective ==="
echo "The pattern 'traefik-ingressroute.yaml' should match k8s/base/traefik-ingressroute.yaml"
python3 << 'EOF'
import re
# Test the ignore pattern
pattern = 'traefik-ingressroute.yaml'
filename = 'traefik-ingressroute.yaml'
# kubeconform uses filename pattern matching
if pattern in filename:
print(f"Pattern '{pattern}' matches '{filename}': YES")
else:
print(f"Pattern '{pattern}' matches '{filename}': NO")
EOFRepository: OpenMS/streamlit-template
Length of output: 941
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Check kubeconform capabilities regarding CRDs ==="
echo "Searching for kubeconform documentation or usage info in repo..."
rg -i "kubeconform" .github -A2 -B2
echo
echo "=== Research: kubeconform is designed to validate against Kubernetes OpenAPI schemas ==="
echo "The -skip flag is meant for resource kinds that aren't in the core K8s schema"
echo "IngressRoute is a Traefik CRD, not part of Kubernetes core API"
echo
echo "=== Verify: Does kubeconform have CRD schemas built-in? ==="
echo "kubeconform validates against official K8s API versions specified by -kubernetes-version"
echo "CRDs like Traefik IngressRoute require custom schema validation setup"
echo
echo "=== Analysis of the CI configuration ==="
echo "1. Base validation ignores traefik-ingressroute.yaml by filename"
echo "2. Overlay validation skips IngressRoute kind entirely"
echo "3. Integration test (line 91) also filters out IngressRoute"
echo ""
echo "This appears INTENTIONAL, not accidental:"
echo "- kubeconform validates core K8s resources only"
echo "- CRDs cannot be validated without custom schema providers"
echo "- -skip is the documented way to exclude CRDs from validation"
echo
echo "=== Check integration test filtering ==="
grep -A10 "kind cluster uses nginx" .github/workflows/k8s-manifests-ci.ymlRepository: OpenMS/streamlit-template
Length of output: 3044
IngressRoute resources are not validated by kubeconform in either stage
The CI skips validation of Traefik IngressRoute manifests at line 26 (base validation ignores by filename) and line 39 (overlay validation skips by kind). This appears intentional—kubeconform validates only core Kubernetes API resources using official OpenAPI schemas; CRDs like IngressRoute require custom schema providers or manual dry-run validation.
However, this creates an unmonitored path for IngressRoute manifest changes. If validation of Traefik CRD syntax is desired, consider adding a schema-based validation step (e.g., using Traefik's CRD schemas) or enabling CRD validation via a custom kubeconform schema provider, rather than silently skipping the resource kind.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.github/workflows/k8s-manifests-ci.yml around lines 24 - 27, The CI is
skipping Traefik IngressRoute validation via kubeconform due to the
-ignore-filename-pattern 'traefik-ingressroute.yaml' and the -ignore-kind
IngressRoute usage; remove those ignore rules and add a dedicated validation
step (e.g., a job/step named "validate-traefik-ingressroutes") that runs
kubeconform or an equivalent validator against IngressRoute manifests using
Traefik's CRD schema (or a custom kubeconform schema provider) — keep the
existing kubeconform calls for core resources (the kubeconform invocation) and
add the new invocation that points to the Traefik CRD schema or uses
--schema-location/--schema-provider so IngressRoute is validated instead of
being silently skipped.
The ConfigMap was replacing the entire settings.json, losing keys like "version" and "repository-name" that the app expects (causing KeyError). Now the ConfigMap only contains deployment-specific overrides, which are merged into the Docker image's base settings.json at container startup using jq. https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@k8s/base/rq-worker-deployment.yaml`:
- Around line 36-37: The current startup sequence runs "jq -s '.[0] * .[1]'
/app/settings.json /app/settings-overrides.json > /tmp/settings-merged.json &&
mv /tmp/settings-merged.json /app/settings.json" but then unconditionally runs
"exec rq worker openms-workflows --url $REDIS_URL", so a failed jq merge can be
ignored; change the command so worker start is conditional on a successful merge
(e.g., append && before exec or check the exit status and exit 1 on failure), or
enable "set -e" at the top of the container command; ensure the referenced
commands (/app/settings.json, /app/settings-overrides.json,
/tmp/settings-merged.json, jq, and exec rq worker openms-workflows --url
$REDIS_URL) are part of the conditional so the worker only starts when the merge
succeeded.
In `@k8s/base/streamlit-deployment.yaml`:
- Around line 36-37: The startup script can continue to line exec streamlit run
app.py if the jq merge fails; modify the container startup shell so it fails
fast by adding shell strict mode (set -euo pipefail) at the top of the
script/command that runs the merge and exec, ensuring the jq merge failure stops
execution before exec streamlit run app.py --server.address 0.0.0.0; locate the
block containing the jq -s '.[0] * .[1]' /app/settings.json
/app/settings-overrides.json and the exec streamlit run app.py command and
prepend the strict-mode directive so any error in jq (or other commands) will
abort the start-up.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 7008f6ca-2deb-4e60-9195-f879dba3a8fb
📒 Files selected for processing (3)
k8s/base/configmap.yamlk8s/base/rq-worker-deployment.yamlk8s/base/streamlit-deployment.yaml
Addresses CodeRabbit review: if jq merge fails, the container should not start with unmerged settings. https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ
Summary by CodeRabbit