Skip to content

Commit be421f0

Browse files
becholsclaudebrianmacdonald-temporal
authored
Add pre-production testing best practices guide (#4175)
* Add pre-production testing best practices guide Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Trigger rebuild * Update pre-production-testing.mdx Minor copyedits * PR feedback --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Brian MacDonald <brian.macdonald@temporal.io>
1 parent a80c05e commit be421f0

3 files changed

Lines changed: 385 additions & 2 deletions

File tree

docs/best-practices/index.mdx

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -54,3 +54,6 @@ This section is intended for:
5454

5555
- **[Worker Deployment and Performance](./worker.mdx)** Best practices for deploying and optimizing Temporal Workers for
5656
performance and reliability.
57+
58+
- **[Pre-Production Testing](./pre-production-testing.mdx)** Experience-driven testing practices covering failure
59+
injection, load testing, and operational validation.
Lines changed: 379 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,379 @@
1+
---
2+
title: Pre-production testing
3+
sidebar_label: Pre-Production Testing
4+
description: Experience-driven testing practices for teams running Temporal applications, covering failure injection, load testing, and operational validation.
5+
toc_max_heading_level: 4
6+
keywords:
7+
- testing
8+
- pre-production
9+
- load testing
10+
- chaos engineering
11+
- best practices
12+
tags:
13+
- Best Practices
14+
- Temporal Cloud
15+
---
16+
17+
This guide collects practical, experience-driven testing practices for teams running Temporal applications.
18+
The goal is not just to verify that things fail and recover, but to build confidence that *recovery*, *correctness*, *consistency*, and *operability* hold under real-world conditions.
19+
20+
The scenarios below assume familiarity with Temporal concepts such as [Namespaces](/namespaces), [Workers](/workers), [Task Queues](/task-queue), [History shards](/temporal-service/temporal-server#history-shard), [Timers](/workflow-execution/timers-delays), and [Workflow replay](/workflow-execution#replay).
21+
Start with [Understanding Temporal](/evaluate/understanding-temporal#durable-execution) if you need background.
22+
23+
Before starting any load testing in Temporal Cloud, we recommend connecting with your Temporal Account team and our Developer Success Engineering team.
24+
25+
## Guiding principles
26+
27+
Before diving into specific experiments, keep these principles in mind:
28+
29+
- **Failure is normal**: Temporal is designed to survive failure and issues, but *your application logic* must be too.
30+
- **Partial failure is often harder to deal with than total failure**: Systems that are "mostly working" expose the most flaws.
31+
- **Recovery paths deserve as much testing as steady state**: Analyze recovering application behavior as much as you analyze failing behavior.
32+
- **Build observability before you break things**: Ensure metrics, logs, and visibility tools are in place before injecting failures.
33+
- **Testing is a continual process**: Testing is never finished. Testing is a practice.
34+
35+
## Worker testing
36+
37+
**Relevant best practices**: [Worker deployment and performance](/best-practices/worker), appropriate timeouts, managing Worker shutdown, idempotency
38+
39+
- [Worker shutdown](/encyclopedia/workers/worker-shutdown)
40+
41+
### Kill all Workers, then restart them
42+
43+
**What to test**
44+
45+
Abruptly terminate all Workers processing a Task Queue, then restart them.
46+
47+
**Why it matters**
48+
49+
- Validates at-least-once execution semantics.
50+
- Ensures Activities are idempotent and Workflows replay cleanly.
51+
- Validates Task timeouts and retries and that Workers can finish business processes.
52+
53+
**How to run this**
54+
55+
Depending on execution environment:
56+
57+
- **Kubernetes**: Set pod count to zero:
58+
```bash
59+
kubectl scale deployment <deployment-name> --replicas=0 -n <namespace>
60+
kubectl scale deployment <deployment-name> --replicas=3 -n <namespace>
61+
```
62+
- **Azure App Service**:
63+
```bash
64+
az webapp restart --name <app-name> --resource-group <resource-group>
65+
```
66+
67+
**Things to watch**
68+
69+
- Duplicate/improper Activity results
70+
- Workflow failures
71+
- Workflow backlog growth and drain time
72+
73+
### Frequent Worker restart
74+
75+
**What to test**
76+
77+
Periodically restart a fixed or random percentage (e.g. 20-30%) of your Worker fleet every few minutes.
78+
79+
**Why it matters**
80+
81+
- Mimics failure modes where Workers restart due to high CPU utilization and out-of-memory errors from compute-intensive logic in Activities.
82+
- Ensures Temporal invalidates specific Sticky Task Queues and reschedules the task to the associated non-Sticky Task Queue.
83+
84+
**How to run this**
85+
86+
- **Kubernetes**: Build a script using `kubectl` to randomly delete pods in a loop.
87+
- **Chaos Mesh**: [Simulate pod faults](https://chaos-mesh.org/docs/simulate-pod-chaos-on-kubernetes/).
88+
- **App Services**: Scale down and up again.
89+
90+
**Things to watch**
91+
92+
- Replay latency
93+
- Drop in Workflow and Activity completion
94+
- Duplicate/improper Activity results
95+
- Workflow failures
96+
- Workflow backlog growth and drain time
97+
98+
## Load testing
99+
100+
### Pre-load test setup: expectations for success
101+
102+
1. Have SDK metrics accessible (not just the Cloud metrics)
103+
2. Understand and predict what you should see from these metrics:
104+
- Rate limiting (`temporal_cloud_v1_resource_exhausted_error_count`)
105+
- Workflow failures (`temporal_cloud_v1_workflow_failed_count`)
106+
- Workflow execution time (`workflow_endtoend_latency`)
107+
- High Cloud latency (`temporal_cloud_v1_service_latency_p95`)
108+
- [Worker metrics](/develop/worker-performance) (`workflow_task_schedule_to_start_latency` and `activity_schedule_to_start_latency`)
109+
3. Determine throughput requirements ahead of time. Work with your account team to match that to the Namespace capacity to avoid rate limiting. Capacity increases are done via Temporal support and can be requested for a load test (short-term).
110+
4. Automate how you run the load test so you can start and stop it at will. How will you clear Workflow Executions that are just temporary?
111+
5. What does "success" look like for this test? Be specific with metrics and numbers stated in business terms.
112+
113+
### Validate downstream load capacity
114+
115+
**Relevant best practices**: Idempotent Activities, bounded retries, appropriate timeouts and retry policies, understand behavior when limits are reached
116+
117+
**What to test**
118+
119+
- Schedule a large number of Actions and Requests by starting many Workflows
120+
- Increase the number until you start overloading downstream systems
121+
122+
**Why it matters**
123+
124+
Validates behavior of Temporal application and application dependencies under high load.
125+
126+
**How to run this**
127+
128+
Start Workflows at a rate to surpass throughput limits. Example: [temporal-ratelimit-tester-go](https://github.com/joshmsmith/temporal-ratelimit-tester-go)
129+
130+
**Things to watch**
131+
132+
- Downstream service error rates (HTTP 5xx, database errors)
133+
- Increased downstream service latency and saturation metrics
134+
- Activity failure rates, specifically classifying between retryable and non-retryable errors
135+
- Activity retry and backoff behavior against the overloaded system
136+
- Workflow backlog growth and drain time
137+
- Correctness and consistency of data (ensuring Activity idempotency holds under duress)
138+
- Worker CPU/memory utilization
139+
140+
### Validate rate limiting behavior
141+
142+
**Relevant best practices**: [Manage Namespace capacity limits](/best-practices/managing-aps-limits), understand behavior when limits are reached
143+
144+
**What to test**
145+
146+
- Schedule a large number of Actions and Requests by starting many Workflows
147+
- Increase the number until you get rate limited (trigger metric [`temporal_cloud_v0_resource_exhausted_error_count`](/cloud/metrics/reference#temporal_cloud_v0_resource_exhausted_error_count))
148+
149+
**Why it matters**
150+
151+
Validates behavior of Cloud service under high load: "In Temporal Cloud, the effect of rate limiting is increased latency, not lost work. Workers might take longer to complete Workflows."
152+
153+
**How to run this**
154+
155+
1. (Optional) Decrease a test Namespace's rate limits to make it easier to hit limits
156+
2. Calculate current APS at current throughput (in production)
157+
3. Calculate Workflow throughput needed to surpass limits
158+
4. Start Workflows at a rate to surpass throughput limits using [temporal-ratelimit-tester-go](https://github.com/joshmsmith/temporal-ratelimit-tester-go)
159+
160+
**Things to watch**
161+
162+
- Worker behavior when rate limited
163+
- Client behavior when rate limited
164+
- Temporal request and long_request failure rates
165+
- Workflow success rates
166+
- Workflow latency rates
167+
168+
## Failover and availability
169+
170+
**Relevant best practices**: Use [High Availability features](/cloud/high-availability) for critical workloads.
171+
172+
- [High Availability monitoring](/cloud/high-availability/monitoring)
173+
174+
### Test region failover
175+
176+
**What to test**
177+
178+
Trigger a [High Availability](/cloud/high-availability) failover event for a Namespace.
179+
180+
**Why it matters**
181+
182+
- Real outages are messy and rarely isolated.
183+
- Ensures your operational playbooks and automation are resilient.
184+
- Validates Worker and Namespace failover behavior.
185+
186+
**How to run this**
187+
188+
Execute a manual failover per the [manual failovers documentation](/cloud/high-availability/failovers#manual-failovers).
189+
190+
**Things to watch**
191+
192+
- Namespace availability
193+
- Client and Worker connectivity to failover region
194+
- Workflow Task reassignments
195+
- Human-in-the-loop recovery steps
196+
197+
## Dependency and downstream testing
198+
199+
### Break the things your Workflows call
200+
201+
**What to test**
202+
203+
Intentionally break or degrade downstream dependencies used by Activities:
204+
205+
- Make databases read-only or unavailable
206+
- Inject high latency or error rates into external APIs
207+
- Throttle or pause message queues and event streams
208+
209+
**Why it matters**
210+
211+
- Temporal guarantees Workflow durability, not dependency availability.
212+
- Validates that Activities are retryable, idempotent, and correctly timeout-bounded.
213+
- Ensures Workflows make forward progress instead of livelocking on broken dependencies.
214+
215+
**Things to watch**
216+
217+
- Activity retry and backoff behavior
218+
- Heartbeat effectiveness for long-running Activities
219+
- Database connection exhaustion and retry storms
220+
- API timeouts vs Activity timeouts
221+
- Whether failures propagate as Signals, compensations, or Workflow-level errors
222+
223+
**Anti-patterns this reveals**
224+
225+
- Non-idempotent Activities
226+
- Infinite retries without circuit breaking
227+
- Using Workflow logic to "wait out" broken dependencies
228+
229+
## Deployment and code-level testing
230+
231+
### Deploy a Workflow change with versioning
232+
233+
**Relevant best practices**: Implement a versioning strategy.
234+
235+
- [Workflow Versioning Strategies - Developer Corner](https://community.temporal.io/t/workflow-versioning-strategies/6911)
236+
- [Worker Versioning](/production-deployment/worker-deployments/worker-versioning)
237+
- [Replay Testing](/evaluate/development-production-features/testing-suite)
238+
239+
**What to test**
240+
241+
- Deploy Workflow code that would introduce non-deterministic errors (NDEs) but use a versioning strategy to deploy successfully
242+
- Validate Workflow success and clear the backlog of tasks
243+
244+
**Why it matters**
245+
246+
- Unplanned NDEs can be a painful surprise
247+
- Tests versioning strategy and patching discipline to build production confidence
248+
249+
**Things to watch**
250+
251+
- Workflow Task failure reasons
252+
- Effectiveness of versioning and patching patterns
253+
254+
### Deploy a version that causes NDEs, then recover
255+
256+
**Relevant best practices**: Implement a versioning strategy.
257+
258+
- [Workflow Versioning Strategies - Developer Corner](https://community.temporal.io/t/workflow-versioning-strategies/6911)
259+
- [Worker Versioning](/production-deployment/worker-deployments/worker-versioning)
260+
- [Replay Testing](/evaluate/development-production-features/testing-suite)
261+
262+
**What to test**
263+
264+
- Deploy Workflow code that introduces non-deterministic errors (NDEs)
265+
- Attempt rollback to a known-good version, or apply versioning strategies to apply the new changes successfully
266+
- Clear or recover the backlog of tasks
267+
268+
**Why it matters**
269+
270+
- Unplanned NDEs can be a painful surprise
271+
- Tests versioning strategy, patching discipline, and recovery tooling
272+
273+
**Things to watch**
274+
275+
- Workflow Task failure reasons
276+
- Backlog growth and drain time
277+
- Effectiveness of versioning and patching patterns
278+
279+
## Network-level testing
280+
281+
The scenarios below are most relevant if your infrastructure introduces network boundaries (such as firewalls, VPNs, or network policies) between Workers and the Temporal service, or if you need to verify application behavior during prolonged disconnections.
282+
283+
**Relevant best practices**: Idempotent Activities, bounded retries, appropriate timeouts
284+
285+
- [Activity timeouts](https://temporal.io/blog/activity-timeouts)
286+
- [Idempotency and durable execution](https://temporal.io/blog/idempotency-and-durable-execution)
287+
288+
### Remove network connectivity to a Namespace
289+
290+
**What to test**
291+
292+
Temporarily block all network access between Workers and the Temporal service for a Namespace.
293+
294+
**Why it matters**
295+
296+
- Validates Worker retry behavior, Sticky Task Queue behavior, Worker recovery performance, backoff policies, and Workflow replay determinism under prolonged disconnection.
297+
- Ensures no assumptions are made about "always-on" connectivity.
298+
299+
**Temporal failure modes exercised**
300+
301+
- Workflow Task timeouts vs retries
302+
- Activity retry semantics
303+
- Replay correctness after long gaps
304+
305+
**How to run this**
306+
307+
- **Kubernetes**: Apply a NetworkPolicy that denies egress from Worker pods to the Temporal APIs.
308+
- **[ToxiProxy](https://github.com/Shopify/toxiproxy)**: Proves your application doesn't have single points of failure.
309+
- **Chaos Mesh / Litmus**: NetworkChaos with full packet drop.
310+
- **Local testing**: Block ports with iptables or firewall rules.
311+
312+
**Things to watch**
313+
314+
- Workflow failures (replay, timeout)
315+
- Workflow Task retries
316+
- Activity failures, classifications (retryable vs non-retryable)
317+
- Worker CPU usage during reconnect storms
318+
319+
## Observability checklist
320+
321+
Before (and during) testing, ensure visibility into:
322+
323+
- Workflow Task and Activity failure rates
324+
- Throughput limits and usage
325+
- Workflow and Activity end-to-end latencies
326+
- Task latency and backlog depth
327+
- Workflow History size and event counts
328+
- Worker CPU, memory, and restart counts
329+
- gRPC error codes
330+
- Retry behavior
331+
332+
## Game day runbook
333+
334+
Use this checklist when running tests during a scheduled game day or real incident simulation.
335+
336+
### Before you start
337+
338+
- Make sure people know you're testing and what scenarios you're trying
339+
- Let the teams that support the APIs you're calling know you're testing
340+
- Reach out to the Temporal Cloud Support and Account teams to coordinate
341+
- Dashboards for SDK and Cloud metrics
342+
- Task latency, backlog depth, Workflow failures, Activity failures
343+
- Alerts muted or routed appropriately
344+
- Known-good deployment artifact available
345+
- Rollback and scale controls verified
346+
347+
### During testing
348+
349+
- Introduce *one variable at a time*
350+
- Record start/stop times of each experiment
351+
- Capture screenshots or logs of unexpected behavior
352+
- Track backlog growth and drain rate
353+
354+
### Recovery validation
355+
356+
- Workflows resume without manual intervention
357+
- No permanent Workflow Task failures (unless intentional)
358+
- Activity retries behave as expected
359+
- Backlogs drain in predictable time
360+
361+
### After action review
362+
363+
- Identify unclear alerts or missing metrics/alerts
364+
- Update retry, timeout, or versioning policies
365+
- Document surprises and operational debt
366+
367+
## Summary
368+
369+
Pre-production testing with Temporal is about more than proving durability - it's about proving *operability under stress*.
370+
You want to go through the exercise and know what to do before you go to production and have to do it for real.
371+
372+
If your system survives:
373+
374+
- Connectivity issues
375+
- Repeated failovers
376+
- Greater than expected load
377+
- Mass Worker churn
378+
379+
...then you can have confidence it's ready for many kinds of production chaos.

0 commit comments

Comments
 (0)