Add new metrics for long running transactions#17
Conversation
metrics (one for each threshold, 1min, 5min, 10min, 30min) and another one returns the duration of the longest running transaction.
| } | ||
| defer rows.Close() | ||
| // Query for each threshold | ||
| for _, threshold := range longRunningTransactionThresholds { |
There was a problem hiding this comment.
med: I don't love this. We can re-express all this in a single query:
SELECT
count(*) FILTER (
WHERE
extract('epoch', clock_timestamp() - xact_start) >= 60
)
AS count_60,
count(*) FILTER (
WHERE
extract('epoch', clock_timestamp() - xact_start)
>= 300
)
AS count_300,
count(*) FILTER (
WHERE
extract('epoch', clock_timestamp() - xact_start)
>= 600
)
AS count_600,
count(*) FILTER (
WHERE
extract('epoch', clock_timestamp() - xact_start)
>= 1800
)
AS count_1800,
COALESCE(
max(extract('epoch', clock_timestamp() - xact_start)),
0
)
AS oldest_timestamp_seconds
FROM
pg_catalog.pg_stat_activity
WHERE
state IS DISTINCT FROM 'idle'
AND query NOT LIKE 'autovacuum:%'
AND xact_start IS NOT NULL;Can then break it up based out of the results.
There was a problem hiding this comment.
oh, yes, much better, will update :)
| defer rows.Close() | ||
| // Query for each threshold | ||
| for _, threshold := range longRunningTransactionThresholds { | ||
| rows, err := db.QueryContext(ctx, longRunningTransactionsQuery, threshold) |
There was a problem hiding this comment.
med: With the updated query, we can use QueryRowContext instead to avoid manual row closure handling.
|
Updated applying your changes @mble , check it out see if it improved :) |
| "pg_long_running_transactions", | ||
| "Current number of long running transactions", | ||
| []string{}, | ||
| prometheus.BuildFQName(namespace, longRunningTransactionsSubsystem, "count"), |
There was a problem hiding this comment.
note: breaking change as this goes from pg_long_running_transactions to pg_long_running_transactions_count.
There was a problem hiding this comment.
We are not using these metrics anywhere, so it shouldn't break any dashboards. And I dont intend to push this to the upstream. The metric inherently changed, we have a count for each threshold, so its a different "metric" warranting a different name, wdyt?
There was a problem hiding this comment.
Was mostly just a call out that if there were consumers, it was a breaking change. No other action needed.
| FROM pg_catalog.pg_stat_activity | ||
| WHERE state IS DISTINCT FROM 'idle' | ||
| AND query NOT LIKE 'autovacuum:%' | ||
| AND pg_stat_activity.xact_start IS NOT NULL; |
There was a problem hiding this comment.
opt: Very minor optimisation here is to extract this into a CTE to avoid calculating the whole EXTRACT(EPOCH FROM clock_timestamp() - pg_stat_activity.xact_start 5 times.
| longRunningTransactionsAgeInSeconds, | ||
| prometheus.GaugeValue, | ||
| ageValue, | ||
| ) |
There was a problem hiding this comment.
We don't actually have tests running in CI 😬 but these changes will break the relevant test in https://github.com/planetscale/postgres_exporter/blob/main/collector/pg_long_running_transactions_test.go.
There was a problem hiding this comment.
The change in the metric itself will break the tests, so I'll update them too.
| prometheus.GaugeValue, | ||
| count1800s, | ||
| "1800", | ||
| ) |
There was a problem hiding this comment.
Very mild thing, but a loop here would be a little cleaner:
thresholds := []string{"60", "300", "600", "1800"}
counts := []float64{count60s, count300s, count600s, count1800s}
for i, threshold := range thresholds {
ch <- prometheus.MustNewConstMetric(
longRunningTransactionsCount,
prometheus.GaugeValue,
counts[i],
threshold,
)
}|
@joaofoltran I've re-added CI so if you could rebase, that would be great. |
Updated both metrics for long running transactions. One returns 4 metrics (one for each threshold, 1min, 5min, 10min, 30min) and another one returns the duration of the longest running transaction.
With this we can check if there was a long running transaction during the time of any issues that could cause xmin/lsn retention.
The original code was just counting how many transactions and then bringing the longest running one. This new code we know if there are multiple long running ones.
Next I'll be adding these to our grafana metrics so we can check them when diagnosing issues.