Skip to content

receive: Add buffer pooling to reduce heap allocations and GC pressure#283

Merged
mkusumdb merged 7 commits intodb_mainfrom
kusum-madarasu_data/heapopt
Feb 4, 2026
Merged

receive: Add buffer pooling to reduce heap allocations and GC pressure#283
mkusumdb merged 7 commits intodb_mainfrom
kusum-madarasu_data/heapopt

Conversation

@mkusumdb
Copy link
Collaborator

@mkusumdb mkusumdb commented Jan 29, 2026

Significantly reduces memory allocations and GC pressure in the Thanos receive hot path by introducing buffer pooling and streaming size limits. Testing shows
23% heap reduction, 27x faster GC max pauses, and estimated 30% CPU reduction at production scale.

## Problem

The receive HTTP handler allocates new buffers for every remote write request, creating:
- High allocation rates (1.85 TB/hour in production)
- Large heap occupancy (19.3 GB)
- Long GC pauses (up to 4.34ms p100)
- High CPU overhead from GC

At production throughput (@ 100s millions samples/hour), this results in significant resource consumption and tail latency issues.

## Solution

### 1. Buffer Pooling

Introduced four `sync.Pool`s to reuse hot-path allocations:
- `compressedBufPool` (32KB default, for compressed request body)
- `decompressedBufPool` (128KB default, for decompressed payload)
- `writeRequestPool` (proto message reuse)
- `copyBufPool` (32KB for io.CopyBuffer)

**Pool ballooning prevention:** Buffers exceeding max capacities (1MB compressed, 4MB decompressed) are not returned to the pool, preventing a single large
request from permanently inflating RSS.

### 2. Streaming Size Limits

Introduced `limitedBufferWriter` to enforce size limits during streaming, protecting against:
- Missing `Content-Length` headers
- Incorrect `Content-Length` values
- Malicious oversized requests

The limiter checks size incrementally during `io.Copy`, aborting early if the tenant limit is exceeded.

## Changes

### Modified Files
- `pkg/receive/handler.go`: Buffer pooling, streaming limits, optimized buffer management

### Key Implementation Details

**Global pools** (lines 102-125):
```go
compressedBufPool    = sync.Pool{...}  // 32KB default
decompressedBufPool  = sync.Pool{...}  // 128KB default
writeRequestPool     = sync.Pool{...}  // Proto message reuse
copyBufPool          = sync.Pool{...}  // 32KB for io.CopyBuffer
```

**limitedBufferWriter** (lines 127-150):
- Enforces size limits during streaming
- Protects against unbounded memory growth

**receiveHTTP handler** (lines 642-700):
- Get buffers from pools with deferred returns
- Use `limitedBufferWriter` for streaming
- Guard pool returns with capacity checks
- Reset objects before returning to pool

### Constants Added
```go
defaultCompressedBufCap   = 32 * 1024   // 32KB
defaultDecompressedBufCap = 128 * 1024  // 128KB
maxPooledCompressedCap    = 1 << 20     // 1MB
maxPooledDecompressedCap  = 4 << 20     // 4MB
copyBufSize               = 32 * 1024   // 32KB
```

## Testing

### Integration Testing
- ✅ Deployed to dev-sut
- ✅ Running for 11+ hours with 127K series, 169M samples
- ✅ No errors or performance degradation
- ✅ Memory and GC metrics as expected

### Comparison Testing
- ✅ Compared against production (dev-obs1)
- ✅ Normalized for workload differences (20.3x throughput, 34.6x series)
- ✅ Validated improvements are consistent across scale

### Edge Cases Tested
- ✅ Requests with `Content-Length` header
- ✅ Requests without `Content-Length` header (streaming)
- ✅ Oversized requests (rejected with 413)
- ✅ Multi-tenant workload (22 tenants in integrationtest, 44 in obs1)


### Key Observations

1. **Allocation rate:** Still high (5.5 GB per 1M samples) because proto unmarshaling is intrinsic to the Prometheus remote write protocol. However, pooling
keeps allocations **short-lived** rather than accumulating on the heap.

2. **GC efficiency:** More frequent GCs (10.5 vs 0.63 per 1M samples) but each completes much faster. The **27x reduction in max pause** dramatically improves
tail latency.

3. **RSS unchanged:** Expected behavior. RSS is dominated by TSDB head structures which scale with series count, not buffer pooling.

## Metrics to Monitor Post-Deployment

### Primary Success Metrics
- ✅ Heap occupancy: 19.3 GB → 14.9 GB (-23%)
- ✅ GC p100: 4.34ms → 0.16ms (27x improvement)
- ✅ CPU: 3.5 cores → 2.5 cores (-30%)

### Secondary Monitoring
- RSS: Expect minimal change (~32 GB, dominated by TSDB)
- Request latency: Should improve (better GC pauses)
- Error rates: Should remain unchanged
- Goroutine count: Expect slight decrease

mkusumdb and others added 7 commits January 27, 2026 19:26
This commit implements memory optimizations for the receive handler:

- Add sync.Pool for compressed/decompressed buffers and WriteRequest
- Implement pool size caps (1MB/4MB) to prevent RSS inflation
- Add limitedBufferWriter to enforce size limits during streaming
- Add zlabelsGet helper to avoid ZLabels->PromLabels conversions
- Add tenantKeyForDistribution for consistent tenant routing
- Improve errorSeries counting with precomputed tenant mappings

These changes reduce memory allocations in the hot path and prevent
memory retention from oversized requests.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@mkusumdb mkusumdb requested a review from yuchen-db February 4, 2026 00:39
Comment on lines +762 to +768
// Deep copy all label strings to detach them from pooled decode buffer.
// Required for correctness when pooled buffers are reused and for preventing
// retention of the whole request buffer via zero-copy label references.
for i := range wreq.Timeseries {
labelpb.ReAllocZLabelsStrings(&wreq.Timeseries[i].Labels, h.writer.opts.Intern)
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: shall we detach exemplar labels from the pool as well?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll do this in a follow up PR.

@mkusumdb mkusumdb changed the title Heap optimizations. receive: Add buffer pooling to reduce heap allocations and GC pressure Feb 4, 2026
@mkusumdb mkusumdb merged commit f95c3e0 into db_main Feb 4, 2026
13 of 14 checks passed
@mkusumdb mkusumdb deleted the kusum-madarasu_data/heapopt branch February 4, 2026 00:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants