Add new metrics for long running transactions by joaofoltran · Pull Request #17 · planetscale/postgres_exporter

joaofoltran · 2026-01-27T20:27:15Z

Updated both metrics for long running transactions. One returns 4 metrics (one for each threshold, 1min, 5min, 10min, 30min) and another one returns the duration of the longest running transaction.

With this we can check if there was a long running transaction during the time of any issues that could cause xmin/lsn retention.

The original code was just counting how many transactions and then bringing the longest running one. This new code we know if there are multiple long running ones.

Next I'll be adding these to our grafana metrics so we can check them when diagnosing issues.

metrics (one for each threshold, 1min, 5min, 10min, 30min) and another one returns the duration of the longest running transaction.

mble · 2026-01-28T10:44:57Z

collector/pg_long_running_transactions.go

-	}
-	defer rows.Close()
+	// Query for each threshold
+	for _, threshold := range longRunningTransactionThresholds {


med: I don't love this. We can re-express all this in a single query:

SELECT count(*) FILTER ( WHERE extract('epoch', clock_timestamp() - xact_start) >= 60 ) AS count_60, count(*) FILTER ( WHERE extract('epoch', clock_timestamp() - xact_start) >= 300 ) AS count_300, count(*) FILTER ( WHERE extract('epoch', clock_timestamp() - xact_start) >= 600 ) AS count_600, count(*) FILTER ( WHERE extract('epoch', clock_timestamp() - xact_start) >= 1800 ) AS count_1800, COALESCE( max(extract('epoch', clock_timestamp() - xact_start)), 0 ) AS oldest_timestamp_seconds FROM pg_catalog.pg_stat_activity WHERE state IS DISTINCT FROM 'idle' AND query NOT LIKE 'autovacuum:%' AND xact_start IS NOT NULL;

Can then break it up based out of the results.

oh, yes, much better, will update :)

mble · 2026-01-28T10:47:49Z

collector/pg_long_running_transactions.go

-	defer rows.Close()
+	// Query for each threshold
+	for _, threshold := range longRunningTransactionThresholds {
+		rows, err := db.QueryContext(ctx, longRunningTransactionsQuery, threshold)


med: With the updated query, we can use QueryRowContext instead to avoid manual row closure handling.

… Matt review

joaofoltran · 2026-01-29T14:39:31Z

Updated applying your changes @mble , check it out see if it improved :)
(has been quite some time since I've touched golang)

mble · 2026-01-29T15:13:15Z

collector/pg_long_running_transactions.go

-		"pg_long_running_transactions",
-		"Current number of long running transactions",
-		[]string{},
+		prometheus.BuildFQName(namespace, longRunningTransactionsSubsystem, "count"),


note: breaking change as this goes from pg_long_running_transactions to pg_long_running_transactions_count.

We are not using these metrics anywhere, so it shouldn't break any dashboards. And I dont intend to push this to the upstream. The metric inherently changed, we have a count for each threshold, so its a different "metric" warranting a different name, wdyt?

Was mostly just a call out that if there were consumers, it was a breaking change. No other action needed.

mble · 2026-01-29T15:16:53Z

collector/pg_long_running_transactions.go

+		FROM pg_catalog.pg_stat_activity
+		WHERE state IS DISTINCT FROM 'idle'
+		AND query NOT LIKE 'autovacuum:%'
+		AND pg_stat_activity.xact_start IS NOT NULL;


opt: Very minor optimisation here is to extract this into a CTE to avoid calculating the whole EXTRACT(EPOCH FROM clock_timestamp() - pg_stat_activity.xact_start 5 times.

yep! for sure much better

mble · 2026-01-29T15:22:52Z

collector/pg_long_running_transactions.go

+		longRunningTransactionsAgeInSeconds,
+		prometheus.GaugeValue,
+		ageValue,
+	)


We don't actually have tests running in CI 😬 but these changes will break the relevant test in https://github.com/planetscale/postgres_exporter/blob/main/collector/pg_long_running_transactions_test.go.

The change in the metric itself will break the tests, so I'll update them too.

mble · 2026-01-29T15:26:48Z

collector/pg_long_running_transactions.go

+		prometheus.GaugeValue,
+		count1800s,
+		"1800",
+	)


Very mild thing, but a loop here would be a little cleaner:

thresholds := []string{"60", "300", "600", "1800"} counts := []float64{count60s, count300s, count600s, count1800s} for i, threshold := range thresholds { ch <- prometheus.MustNewConstMetric( longRunningTransactionsCount, prometheus.GaugeValue, counts[i], threshold, ) }

mble · 2026-02-02T10:06:01Z

@joaofoltran I've re-added CI so if you could rebase, that would be great.

Updated both metrics for long running transactions. One returns 4

0ca96f9

metrics (one for each threshold, 1min, 5min, 10min, 30min) and another one returns the duration of the longest running transaction.

joaofoltran self-assigned this Jan 27, 2026

joaofoltran added the enhancement New feature or request label Jan 27, 2026

mble reviewed Jan 28, 2026

View reviewed changes

Optimize long running transactions collector to a single query as per…

84ee559

… Matt review

mble reviewed Jan 29, 2026

View reviewed changes

Conversation

joaofoltran commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

joaofoltran commented Jan 29, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mble commented Feb 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

joaofoltran commented Jan 27, 2026 •

edited

Loading