We have deployed multiple clusters with chart 0.27.0 and noticed some of them were failing to populate the Grafana dashboard. After much digging, we were finding some DB pods returning a 500 http result instead of the expected 200. This was due to the low volume (or not used at all yet), so the query was returning results for db:postgres and db:app and causing a duplicate result collision and freaking out the metrics exporter.
Further digging found:
The pg_stat_statements_top query defined in the Helm chart's monitoringQueriesConfigMap (values.yaml) causes HTTP 500 errors on the metrics endpoint due to duplicate Prometheus label combinations.
charts/cloudnative-pg/values.yaml → monitoringQueriesConfigMap.queries → pg_stat_statements_top
Default values query:
pg_stat_statements_top:
query: |
SELECT
current_database() AS datname,
queryid::text AS queryid,
LEFT(query, 100) AS query,
calls,
total_exec_time / 1000.0 AS total_time_seconds,
mean_exec_time / 1000.0 AS mean_time_seconds,
rows
FROM pg_stat_statements
WHERE query NOT LIKE '%pg_stat_statements%'
ORDER BY total_exec_time DESC
LIMIT 20
target_databases:
- '*'
Two issues combine to cause duplicate Prometheus labels:
1 - Missing dbid filter: With target_databases: ['*'], the query runs against all databases. Since pg_stat_statements is cluster-wide, the same rows are returned regardless of which database is queried, causing duplicate metrics with the same queryid but different datname values.
2 - Missing aggregation: In PostgreSQL, pg_stat_statements uses composite key (userid, dbid, queryid, toplevel). The same query executed by different users creates separate rows with identical queryid. Without GROUP BY, these produce duplicate label combinations.
What we did to fix was change the query in the values file to:
pg_stat_statements_top:
query: |
SELECT
current_database() AS datname,
queryid::text AS queryid,
LEFT(query, 100) AS query,
SUM(calls)::bigint AS calls,
SUM(total_exec_time) / 1000.0 AS total_time_seconds,
SUM(total_exec_time) / NULLIF(SUM(calls), 0) / 1000.0 AS mean_time_seconds,
SUM(rows)::bigint AS rows
FROM pg_stat_statements
WHERE dbid = (SELECT oid FROM pg_database WHERE datname = current_database())
AND query NOT LIKE '%pg_stat_statements%'
GROUP BY queryid, query
ORDER BY SUM(total_exec_time) DESC
LIMIT 20
target_databases:
- '*'
Summary of Changes:
- Added dbid filter to scope results to the current database
- Added GROUP BY queryid, query with SUM() aggregation for multi-user queries
- Fixed mean calculation using SUM(total_exec_time) / NULLIF(SUM(calls), 0)
- Added ::bigint casts to maintain consistent types
We have deployed multiple clusters with chart 0.27.0 and noticed some of them were failing to populate the Grafana dashboard. After much digging, we were finding some DB pods returning a 500 http result instead of the expected 200. This was due to the low volume (or not used at all yet), so the query was returning results for db:postgres and db:app and causing a duplicate result collision and freaking out the metrics exporter.
Further digging found:
The pg_stat_statements_top query defined in the Helm chart's monitoringQueriesConfigMap (values.yaml) causes HTTP 500 errors on the metrics endpoint due to duplicate Prometheus label combinations.
charts/cloudnative-pg/values.yaml → monitoringQueriesConfigMap.queries → pg_stat_statements_top
Default values query:
Two issues combine to cause duplicate Prometheus labels:
1 - Missing dbid filter: With target_databases: ['*'], the query runs against all databases. Since pg_stat_statements is cluster-wide, the same rows are returned regardless of which database is queried, causing duplicate metrics with the same queryid but different datname values.
2 - Missing aggregation: In PostgreSQL, pg_stat_statements uses composite key (userid, dbid, queryid, toplevel). The same query executed by different users creates separate rows with identical queryid. Without GROUP BY, these produce duplicate label combinations.
What we did to fix was change the query in the values file to:
Summary of Changes: