Skip to content

refactor(ci): encapsulate test run tracking into lib/test-run-tracker.sh#4366

Merged
openshift-merge-bot[bot] merged 8 commits intoredhat-developer:mainfrom
gustavolira:refactor/deployment-tracking
Mar 11, 2026
Merged

refactor(ci): encapsulate test run tracking into lib/test-run-tracker.sh#4366
openshift-merge-bot[bot] merged 8 commits intoredhat-developer:mainfrom
gustavolira:refactor/deployment-tracking

Conversation

@gustavolira
Copy link
Member

@gustavolira gustavolira commented Mar 4, 2026

Summary

  • Introduces .ci/pipelines/lib/test-run-tracker.sh module that encapsulates the CURRENT_DEPLOYMENT global counter into a clean API (test_run_tracker::register, test_run_tracker::mark_deploy_success, test_run_tracker::mark_deploy_failed, test_run_tracker::mark_test_result)
  • Replaces manual counter manipulation in utils.sh and lib/testing.sh with single-call functions, reducing error-prone boilerplate
  • Removes the CURRENT_DEPLOYMENT export from reporting.sh — state is now managed internally by the test run tracker module

Test plan

  • Verify grep -rn "CURRENT_DEPLOYMENT" .ci/ returns zero matches outside lib/test-run-tracker.sh
  • Run shellcheck on modified files (lib/test-run-tracker.sh, lib/testing.sh, reporting.sh, utils.sh)
  • Validate source chain: reporting.shlib/test-run-tracker.sh loads correctly
  • Run CI pipeline to confirm functional equivalence (counter increments identically)

🤖 Generated with Claude Code

@openshift-ci openshift-ci bot requested review from jrichter1 and psrna March 4, 2026 19:46
@github-actions
Copy link
Contributor

github-actions bot commented Mar 4, 2026

The container image build workflow finished with status: cancelled.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 4, 2026

The container image build workflow finished with status: cancelled.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 4, 2026

The container image build workflow finished with status: cancelled.

@gustavolira
Copy link
Member Author

/review

@rhdh-qodo-merge
Copy link

rhdh-qodo-merge bot commented Mar 4, 2026

PR Reviewer Guide 🔍

(Review updated until commit fdca364)

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
🔒 Security concerns

Supply chain / remote script execution:
.ci/pipelines/install-methods/operator.sh downloads a script from GitHub based on ${RELEASE_BRANCH_NAME} and makes it executable. If that file is later executed (likely), this introduces a risk of running unpinned remote code. Consider pinning to a specific commit/tag, validating ${RELEASE_BRANCH_NAME} against an allowlist, and/or verifying integrity (checksum/signature) before execution.

⚡ Recommended focus areas for review

State Consistency

The deployment counter is intentionally not exported, while the functions are exported for subshell usage. Please validate that no call sites invoke deployment::* in a subshell (e.g., timeout bash -c ..., pipes, command substitutions) in a way that increments _DEPLOYMENT_COUNTER outside the parent shell, since this can desynchronize IDs from the parent process and lead to statuses being written under unexpected deployment indices.

# Internal state
_DEPLOYMENT_COUNTER=0

deployment::next_id() {
  _DEPLOYMENT_COUNTER=$((_DEPLOYMENT_COUNTER + 1))
  echo "${_DEPLOYMENT_COUNTER}"
}

deployment::current_id() {
  echo "${_DEPLOYMENT_COUNTER}"
}

deployment::register() {
  local label="$1"
  deployment::next_id > /dev/null
  save_status_deployment_namespace "${_DEPLOYMENT_COUNTER}" "$label"
}

deployment::mark_deploy_success() {
  save_status_failed_to_deploy "${_DEPLOYMENT_COUNTER}" false
}

deployment::mark_deploy_failed() {
  local label="$1"
  deployment::register "$label"
  save_status_failed_to_deploy "${_DEPLOYMENT_COUNTER}" true
  save_status_test_failed "${_DEPLOYMENT_COUNTER}" true
  save_overall_result 1
}

deployment::mark_test_result() {
  local passed="$1"
  local num_failures="${2:-}"
  if [[ "$passed" == "true" ]]; then
    save_status_test_failed "${_DEPLOYMENT_COUNTER}" false
  else
    save_status_test_failed "${_DEPLOYMENT_COUNTER}" true
  fi
  if [[ -n "$num_failures" ]]; then
    save_status_number_of_test_failed "${_DEPLOYMENT_COUNTER}" "$num_failures"
  fi
}

# Export all functions for subshell compatibility.
# Note: _DEPLOYMENT_COUNTER is NOT exported because subshells inherit only
# the snapshot at fork time — counter updates in the parent would not propagate.
export -f deployment::next_id
export -f deployment::current_id
export -f deployment::register
export -f deployment::mark_deploy_success
export -f deployment::mark_deploy_failed
export -f deployment::mark_test_result
Double Registration

deployment::mark_deploy_failed calls deployment::register, which always increments the counter. Confirm that failure paths never call deployment::mark_deploy_failed after an earlier deployment::register for the same logical deployment, otherwise you’ll get an extra deployment entry (and the failure will be attributed to the next ID rather than the already-registered one).

deployment::register() {
  local label="$1"
  deployment::next_id > /dev/null
  save_status_deployment_namespace "${_DEPLOYMENT_COUNTER}" "$label"
}

deployment::mark_deploy_success() {
  save_status_failed_to_deploy "${_DEPLOYMENT_COUNTER}" false
}

deployment::mark_deploy_failed() {
  local label="$1"
  deployment::register "$label"
  save_status_failed_to_deploy "${_DEPLOYMENT_COUNTER}" true
  save_status_test_failed "${_DEPLOYMENT_COUNTER}" true
  save_overall_result 1
}
📚 Focus areas based on broader codebase context

Exit Code

Failure paths now call deployment::mark_deploy_failed ... but then return 0 to the caller (e.g., when Backstage is not running or a Helm upgrade times out). Validate that the top-level CI entrypoints still reliably fail the job (non-zero exit) when a deployment/test fails, rather than only recording OVERALL_RESULT. (Ref 1, Ref 5)

  else
    echo "Backstage is not running. Marking deployment as failed and continuing..."
    deployment::mark_deploy_failed "$artifacts_subdir"
    save_all_pod_logs "$namespace"
    return 0
  fi

  # Collect pod logs only on test failure to speed up successful PR runs.
  local _current_id
  _current_id="$(deployment::current_id)"
  if [[ "${STATUS_TEST_FAILED[$_current_id]:-}" == "true" ]]; then
    save_all_pod_logs "$namespace"
  else
    log::info "Tests passed — skipping pod log collection for namespace: ${namespace}"
  fi
  return 0
}

# ==============================================================================
# Upgrade Verification
# ==============================================================================

# Check Helm upgrade rollout status
# Args:
#   $1 - deployment_name: The name of the deployment
#   $2 - namespace: The namespace where the deployment is located
#   $3 - timeout: Timeout in seconds (default: 600)
# Returns:
#   0 - Upgrade completed successfully
#   1 - Upgrade failed or timed out
testing::check_helm_upgrade() {
  local deployment_name="$1"
  local namespace="$2"
  local timeout="${3:-600}"

  if [[ -z "$deployment_name" || -z "$namespace" ]]; then
    log::error "${_TESTING_ERR_MISSING_PARAMS}"
    log::info "Usage: testing::check_helm_upgrade <deployment_name> <namespace> [timeout]"
    return 1
  fi

  log::info "Checking rollout status for deployment: ${deployment_name} in namespace: ${namespace}..."

  if oc rollout status "deployment/${deployment_name}" -n "${namespace}" --timeout="${timeout}s" -w; then
    log::info "RHDH upgrade is complete."
    return 0
  else
    log::error "RHDH upgrade encountered an issue or timed out."
    return 1
  fi
}

# Check upgrade and run tests if successful
# Args:
#   $1 - deployment_name: The name of the deployment
#   $2 - release_name: The Helm release name
#   $3 - namespace: The namespace
#   $4 - playwright_project: The Playwright project to run
#   $5 - url: The URL to test against
#   $6 - timeout: (optional) Timeout in seconds (default: 600)
testing::check_upgrade_and_test() {
  local deployment_name="$1"
  local release_name="$2"
  local namespace="$3"
  local playwright_project="$4"
  local url=$5
  local timeout=${6:-600}

  if [[ -z "$deployment_name" || -z "$release_name" || -z "$namespace" || -z "$playwright_project" || -z "$url" ]]; then
    log::error "${_TESTING_ERR_MISSING_PARAMS}"
    log::info "Usage: testing::check_upgrade_and_test <deployment_name> <release_name> <namespace> <playwright_project> <url> [timeout]"
    return 1
  fi

  if testing::check_helm_upgrade "${deployment_name}" "${namespace}" "${timeout}"; then
    testing::check_and_test "${release_name}" "${namespace}" "${playwright_project}" "${url}"
  else
    log::error "Helm upgrade encountered an issue or timed out. Exiting..."
    deployment::mark_deploy_failed "$namespace"
  fi
  return 0

Reference reasoning: Existing e2e runner scripts treat deployment/container failures as hard failures by exiting with the captured non-zero exit code. This indicates the expected behavior is to propagate failure via process exit status, not just via reporting state files.

📄 References
  1. redhat-developer/rhdh/e2e-tests/container-init.sh [177-182]
  2. redhat-developer/rhdh/e2e-tests/container-init.sh [136-175]
  3. redhat-developer/rhdh/e2e-tests/container-init.sh [86-117]
  4. redhat-developer/rhdh/e2e-tests/container-init.sh [47-84]
  5. redhat-developer/rhdh/e2e-tests/local-run.sh [442-470]
  6. redhat-developer/rhdh/e2e-tests/local-run.sh [403-426]
  7. redhat-developer/rhdh/e2e-tests/local-run.sh [338-360]
  8. redhat-developer/rhdh/e2e-tests/local-run.sh [1-12]

@gustavolira
Copy link
Member Author

/retest

@github-actions
Copy link
Contributor

github-actions bot commented Mar 4, 2026

Image was built and published successfully. It is available at:

@zdrapela
Copy link
Member

zdrapela commented Mar 5, 2026

/test e2e-ocp-helm-nightly

@gustavolira gustavolira force-pushed the refactor/deployment-tracking branch from ce7a406 to 88355be Compare March 6, 2026 14:20
Copy link
Member

@zdrapela zdrapela left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make sure that each file that uses functions from other files sources all the files that it depends on - to ensure dependencies.

We should definitely check that for job triggered by /test e2e-ocp-helm-nightly, all the files in https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/redhat-developer_rhdh/4366/pull-ci-redhat-developer-rhdh-main-e2e-ocp-helm-nightly/2029492297959215104/artifacts/e2e-ocp-helm-nightly/redhat-developer-rhdh-ocp-helm-nightly/artifacts/reporting/ have the correct number of entries.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 6, 2026

Image was built and published successfully. It is available at:

@github-actions
Copy link
Contributor

github-actions bot commented Mar 6, 2026

The container image build and publish workflows were skipped (either due to [skip-build] tag or no relevant changes with existing image).

@zdrapela
Copy link
Member

zdrapela commented Mar 9, 2026

/review
-i

@rhdh-qodo-merge
Copy link

Persistent review updated to latest commit fdca364

@zdrapela
Copy link
Member

zdrapela commented Mar 9, 2026

/agentic_review

@rhdh-qodo-merge
Copy link

rhdh-qodo-merge bot commented Mar 9, 2026

Code Review by Qodo

🐞 Bugs (1) 📘 Rule violations (0) 📎 Requirement gaps (0)

Grey Divider


Remediation recommended

1. Missing failure-count entry 🐞 Bug ✧ Quality
Description
deployment::mark_test_result only writes STATUS_NUMBER_OF_TEST_FAILED when a non-empty value is
provided; testing::run_tests can pass an empty value if JUnit parsing returns no match, and
deployment-failure paths don’t set a placeholder count. This can lead to missing lines in
STATUS_NUMBER_OF_TEST_FAILED.txt, reducing report completeness and risking misalignment with other
per-deployment status files.
Code

.ci/pipelines/lib/deployment.sh[R42-52]

+deployment::mark_test_result() {
+  local passed="$1"
+  local num_failures="${2:-}"
+  if [[ "$passed" == "true" ]]; then
+    save_status_test_failed "${_DEPLOYMENT_COUNTER}" false
+  else
+    save_status_test_failed "${_DEPLOYMENT_COUNTER}" true
+  fi
+  if [[ -n "$num_failures" ]]; then
+    save_status_number_of_test_failed "${_DEPLOYMENT_COUNTER}" "$num_failures"
+  fi
Evidence
The deployment module conditionally persists failure counts only when num_failures is non-empty,
so any caller passing an empty string results in no line being appended to
STATUS_NUMBER_OF_TEST_FAILED.txt. testing::run_tests populates failed_tests via grep when
the JUnit file exists but does not default it if grep returns an empty string, and then passes it
through to deployment::mark_test_result. The documentation explicitly describes
STATUS_NUMBER_OF_TEST_FAILED.txt as having failure counts “one per line,” implying each deployment
should contribute an entry.

.ci/pipelines/lib/deployment.sh[42-53]
.ci/pipelines/lib/testing.sh[112-124]
.ci/pipelines/reporting.sh[47-53]
docs/e2e-tests/enhanced-ci-reporting.md[72-81]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`STATUS_NUMBER_OF_TEST_FAILED.txt` can miss an entry for a deployment because `deployment::mark_test_result` only persists the count when the value is non-empty, and callers may pass an empty string (e.g., if JUnit parsing yields no match). This reduces reporting completeness and can desynchronize per-deployment status files.

### Issue Context
- `deployment::mark_test_result` gates writing the failure count on `[[ -n &quot;$num_failures&quot; ]]`.
- `testing::run_tests` computes `failed_tests` via `grep` when the JUnit file exists, but does not default it if `grep` yields an empty result.
- Docs describe `STATUS_NUMBER_OF_TEST_FAILED.txt` as &quot;one per line&quot;, implying each deployment should contribute an entry.

### Fix Focus Areas
- .ci/pipelines/lib/deployment.sh[34-53]
- .ci/pipelines/lib/testing.sh[112-124]
- (optional for consistency) .ci/pipelines/lib/deployment.sh[34-40]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

ⓘ The new review experience is currently in Beta. Learn more

Grey Divider

Grey Divider

Previous review results

Review updated until commit fdca364

Results up to commit N/A


Grey Divider

Looking for bugs?

Come back again in a few minutes. An AI review agent is analysing this pull request Grey Divider

Qodo Logo

@zdrapela
Copy link
Member

zdrapela commented Mar 9, 2026

/agentic_review

@rhdh-qodo-merge
Copy link

rhdh-qodo-merge bot commented Mar 9, 2026

Persistent review updated to latest commit fdca364

@gustavolira gustavolira force-pushed the refactor/deployment-tracking branch from fdca364 to d522a3d Compare March 10, 2026 12:43
@zdrapela
Copy link
Member

/test e2e-ocp-helm-nightly

@github-actions
Copy link
Contributor

Image was built and published successfully. It is available at:

@github-actions
Copy link
Contributor

The container image build workflow finished with status: cancelled.

@gustavolira gustavolira changed the title refactor(ci): encapsulate deployment tracking into lib/deployment.sh refactor(ci): encapsulate test run tracking into lib/test-run-tracker.sh Mar 11, 2026
gustavolira and others added 8 commits March 11, 2026 09:49
Replace fragile CURRENT_DEPLOYMENT global counter with a dedicated
deployment module that provides a clean API (deployment::register,
deployment::mark_deploy_success, deployment::mark_deploy_failed,
deployment::mark_test_result). This eliminates manual counter
manipulation scattered across utils.sh and lib/testing.sh.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove misleading `export _DEPLOYMENT_COUNTER` (subshells can't see
  parent counter updates anyway)
- Source deployment.sh directly in testing.sh so job files that skip
  utils.sh/reporting.sh still have deployment::* available
- Consolidate test result reporting into a single deployment::mark_test_result
  call instead of mixing old and new APIs
- Add deployment::mark_deploy_success in initiate_upgrade_base_deployments
  to record deploy status consistently

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…eline scripts

- Fix silent failures: cert extraction pipeline, helm chart version resolution,
  curl without -f flag, and http_status comparison
- Add error checks to all helm upgrade commands in EKS/GKE/AKS deployments
- Add common::require_vars() for envsubst variable validation
- Remove dead functions: namespace::remove_finalizers, operator::install_postgres_k8s
- Export common::base64_encode for subshell usage
- Fix unquoted variables in kubectl commands and sleep

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Move common::require_vars to beginning of deployment functions for early validation
- Add common::require_vars to operator deployment files (AKS, EKS, GKE)
- Add explicit source for common.sh and namespace.sh in K8s deployment files
- Restore namespace::remove_finalizers function (was prematurely removed)
- Add source for reporting.sh in deployment.sh (fixes dependency declaration)
- Add re-source guard to reporting.sh and fix circular dependency
- Use artifacts_subdir (playwright project name) as deployment label instead
  of namespace in deployment::register to fix RHDHBUGS-2726
- Simplify log collection: collect immediately on deploy failure, check
  STATUS_TEST_FAILED only after test execution
- Remove unnecessary "Uses globals: none" comment
- Update docs/e2e-tests/enhanced-ci-reporting.md to reflect deployment:: API

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…atus file alignment

Use playwright_project instead of namespace as the default deployment
label, fixing Slack reports that showed duplicate names (RHDHBUGS-2726).

Restructure testing::check_and_test to use testing::run_tests return
value directly for pod log collection, simplifying the control flow.

Ensure STATUS_NUMBER_OF_TEST_FAILED.txt always has an entry per
deployment by defaulting num_failures to "0" and writing "N/A" for
deploy failures.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…function prefix

Rename the tracking library from `deployment` to `test_run_tracker` to
better reflect its purpose: tracking test run status (deploy outcome +
test results) for reporting, not managing deployments.

- Rename file: lib/deployment.sh -> lib/test-run-tracker.sh
- Rename function prefix: deployment:: -> test_run_tracker::
- Rename internal state: _DEPLOYMENT_COUNTER -> _TEST_RUN_COUNTER
- Update all call sites in testing.sh and utils.sh

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…er rename

- Update enhanced-ci-reporting.md to use test_run_tracker:: prefix and
  lib/test-run-tracker.sh file path
- Clarify FUTURE MODULE comment in utils.sh to distinguish planned
  lib/deploy.sh from lib/test-run-tracker.sh

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@gustavolira gustavolira force-pushed the refactor/deployment-tracking branch from a4b2c9a to 798a2a0 Compare March 11, 2026 12:50
@github-actions
Copy link
Contributor

The container image build workflow finished with status: cancelled.

@sonarqubecloud
Copy link

@github-actions
Copy link
Contributor

Image was built and published successfully. It is available at:

@zdrapela
Copy link
Member

/test e2e-ocp-helm-nightly

@zdrapela
Copy link
Member

/test e2e-ocp-helm

Copy link
Member

@zdrapela zdrapela left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! It's now looking great. I'll just wait for the nightly job to finish, and I'll approve.

@openshift-ci openshift-ci bot added the lgtm label Mar 11, 2026
@openshift-merge-bot openshift-merge-bot bot merged commit ad49057 into redhat-developer:main Mar 11, 2026
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants