Skip to content

Commit 2fbeb18

Browse files
becholsclaudebrianmacdonald-temporal
authored
Add cost optimization best practices guide (#4174)
* Add cost optimization best practices guide Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Update cost-optimization.mdx Copyedits and wording changes * PR feedback --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Brian MacDonald <brian.macdonald@temporal.io>
1 parent f219e8c commit 2fbeb18

3 files changed

Lines changed: 284 additions & 2 deletions

File tree

Lines changed: 280 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,280 @@
1+
---
2+
title: Workflow cost optimization
3+
sidebar_label: Cost Optimization
4+
description: Strategies for optimizing costs associated with workloads running on Temporal Cloud while maintaining workflow reliability and observability.
5+
toc_max_heading_level: 4
6+
keywords:
7+
- cost optimization
8+
- actions
9+
- storage
10+
- pricing
11+
- best practices
12+
tags:
13+
- Best Practices
14+
- Temporal Cloud
15+
---
16+
17+
This guide provides strategies for optimizing costs associated with workloads running on Temporal Cloud while maintaining Workflow reliability and observability.
18+
19+
## Overview
20+
21+
Temporal Cloud uses consumption-based pricing with two primary cost components: [Actions and Storage](/cloud/pricing#action).
22+
Optimization opportunities vary significantly based on your workload characteristics - Workflows with high signal volume face different cost drivers than long-running Workflows with large payloads.
23+
24+
:::important
25+
Build Workflows following best practices first, then optimize based on observed costs.
26+
Premature optimization can compromise observability and create operational challenges.
27+
:::
28+
29+
Every optimization involves tradeoffs.
30+
This guide helps you make informed decisions about where and how to optimize based on your specific requirements.
31+
32+
Should you need additional guidance on Workflow design considerations, please reach out to a Temporal Solutions Architect.
33+
34+
## Common anti-patterns
35+
36+
Avoid these patterns that either inflate costs unnecessarily or create problems through aggressive optimization:
37+
38+
### Premature Activity consolidation
39+
40+
Combining Activities before understanding failure modes reduces observability and retry control.
41+
Activities should be split based on failure boundaries and retry requirements, not cost optimization alone.
42+
See [How many Activities should I use in my Temporal Workflow](https://temporal.io/blog/how-many-activities-should-i-use-in-my-temporal-workflow) for a decision framework.
43+
44+
### Inappropriate use of Local Activities
45+
46+
Using Local Activities for all operations without understanding their failure semantics and limitations.
47+
Local Activities don't provide Worker-level isolation and have different retry behavior.
48+
See [Local Activities](/local-activity) for guidance.
49+
50+
### Missing Continue-As-New
51+
52+
Long-running Workflows that don't implement Continue-As-New accumulate large Event Histories, increasing storage costs and impacting performance.
53+
Workflows running days or weeks or processing thousands of events require [Continue-As-New](/workflow-execution/continue-as-new).
54+
55+
### High volume of Activity retries
56+
57+
Generally, the default values for [Activity retries](/encyclopedia/retry-policies) are quite good.
58+
However, excessive Activity retries often indicate underlying issues like timeouts that are too short or Activities that frequently fail.
59+
Detect Activity retry frequency and if high, consider increasing retry intervals or Activity timeouts before failures occur.
60+
See [Spooky Stories: Chilling Temporal Anti-Patterns](https://temporal.io/blog/spooky-stories-chilling-temporal-anti-patterns-part-2#2-hiding-behind-the-chainsaws) for guidance on retry defaults and patterns.
61+
62+
### Large payloads in Workflow History
63+
64+
Passing multi-megabyte payloads through Workflows when external storage (S3, blob storage) is more appropriate.
65+
Use [compression](/troubleshooting/blob-size-limit-error#why-does-this-error-occur) or the [claim check pattern](https://dataengineering.wiki/Concepts/Software+Engineering/Claim+Check+Pattern) for large data.
66+
67+
### Over-optimization at the expense of observability
68+
69+
Aggressively optimizing costs without maintaining sufficient visibility for debugging and operational needs.
70+
Balance cost reduction with your team's observability requirements.
71+
72+
For example, merging five separate Activities (validate input, call payment API, update database, send notification, generate receipt) into a single "processOrder" Activity reduces from 5 Actions to 1, but you lose per-step visibility in the Temporal UI.
73+
When the notification step fails, you can't see which of the five steps failed, you lose independent retry control (a notification failure retries the entire flow including the payment call), and you can't filter Workflows by failure stage.
74+
75+
Similarly, removing [Heartbeats](/encyclopedia/detecting-activity-failures#activity-heartbeat) from a long-running data processing Activity saves Actions, but means you can't detect a stuck Worker until the full Activity timeout expires and you lose progress tracking (for example, "processed 500 of 1,000 records").
76+
77+
### Excessive Activity Heartbeats
78+
79+
Each Heartbeat counts as one Action.
80+
Only use Heartbeats for long-running Activities (10+ minutes) where you need to detect Worker failures and track progress.
81+
Short-running Activities that complete in seconds or minutes don't need Heartbeats.
82+
See [Activity Heartbeat documentation](/encyclopedia/detecting-activity-failures#which-activities-should-heartbeat) for guidance.
83+
84+
## Understanding cost drivers
85+
86+
Temporal Cloud pricing consists of Actions, Storage, and Support.
87+
If you are new to Temporal Cloud, see the [pricing documentation](/cloud/pricing#action) to learn more and familiarize yourself with [what results in a billable Action](/cloud/actions) in Temporal Cloud.
88+
89+
### Cost distribution
90+
91+
For most workloads, Actions represent the majority of total costs, with storage typically accounting for 10% or less of a monthly bill.
92+
Focus optimization efforts on what's driving costs with a specific workload:
93+
94+
**High Actions costs generally indicate**:
95+
96+
- [Many Activities per Workflow](https://temporal.io/blog/how-many-activities-should-i-use-in-my-temporal-workflow)
97+
- Frequent [Signals, Queries, or Updates](/encyclopedia/workflow-message-passing#choosing-messages)
98+
- Long-running Activities with [Heartbeats](/cloud/worker-health#manage-worker-heartbeating)
99+
- High [Activity retry rates](/develop/activity-retry-simulator)
100+
- Extensive Query usage
101+
102+
**High Storage costs generally indicate**:
103+
104+
- Large payloads in Workflow inputs, outputs, or Activity results
105+
- Long retention periods with high Workflow volume
106+
- Long-running Workflows without Continue-As-New
107+
- Workflows accumulating large Event Histories
108+
109+
### Optimization priority
110+
111+
1. **Actions optimization**: Usually provides the largest cost reduction opportunity
112+
2. **Active Storage optimization**: Relevant for long-running Workflows or large payloads
113+
3. **Retained Storage optimization**: Relevant for high volume combined with long retention periods
114+
115+
## Measuring
116+
117+
Establish [baseline metrics](https://docs.temporal.io/cloud/metrics/reference) before optimizing and be sure to validate impact after implementation.
118+
Specifically:
119+
120+
- Actions consumption (per Workflow, per day/month, by Namespace)
121+
- Storage consumption (Active and Retained)
122+
- Monthly costs (total, per Namespace, per Workflow Type)
123+
- Observability metrics (time to debug, incident detection)
124+
125+
## Actions optimization
126+
127+
[Actions](https://docs.temporal.io/cloud/actions) encompass Workflow operations, Activity Executions, Signals, Queries, and other interactions with Temporal.
128+
Each represents a unit of consumption.
129+
130+
### Activity granularity
131+
132+
Activity granularity is a fundamental architectural decision that impacts both costs and observability.
133+
More Activities provide better visibility and retry control but increase the Action count.
134+
Fewer Activities reduce costs but limit observability.
135+
136+
For detailed discussion of this tradeoff, see [How many Activities should I use in my Temporal Workflow?](https://temporal.io/blog/how-many-activities-should-i-use-in-my-temporal-workflow)
137+
138+
### Child Workflows vs Activities
139+
140+
[Child Workflows cost 2 Actions](/cloud/actions#child-workflows) compared to an Activity's 1 Action.
141+
See [Child Workflows documentation](/child-workflows) for detailed comparison of capabilities and use cases.
142+
143+
### Retry Policies
144+
145+
Each Activity retry counts as one Action.
146+
Default [Retry Policies](/encyclopedia/retry-policies) can be aggressive, which is appropriate for most operations but costly for expensive external operations.
147+
148+
For example, consider an Activity that calls a third-party payment API.
149+
With Temporal's default Retry Policy (1s initial interval, 2.0 backoff coefficient, unlimited maximum attempts), if that API goes down for 30 minutes, the Activity retries approximately 20 times before reaching the 100s maximum interval cap, then continues retrying every 100 seconds.
150+
Each retry counts as 1 Action.
151+
Across 1,000 concurrent Workflows hitting the same outage, that produces 20,000+ extra Actions from retries alone.
152+
153+
For expensive external operations like this, consider:
154+
155+
- Setting `MaximumAttempts` to cap total retries
156+
- Increasing `InitialInterval` (for example, to 10s) to reduce retry frequency
157+
- Adding error types to `NonRetryableErrorTypes` for errors that won't resolve on retry (such as 4xx HTTP status codes)
158+
- Using [next retry delay](/encyclopedia/retry-policies#per-error-next-retry-delay) to dynamically control retry timing based on failure types (for example, respecting rate-limit headers)
159+
- Implementing an [Activity pause pattern](/cli/activity#pause) to wait for manual intervention rather than automatic retries
160+
161+
Use the [Activity Retry Simulator](/develop/activity-retry-simulator) to visualize how different Retry Policy configurations affect retry behavior and Action consumption.
162+
163+
Refer to this blog post on [Mastering Workflow retry logic for resilient applications](https://temporal.io/blog/failure-handling-in-practice) for additional guidance.
164+
165+
### Local Activities
166+
167+
A [Local Activity](/local-activity#local-activity) is an Activity Execution that executes in the same process as the Workflow Execution that spawns it.
168+
Therefore, multiple Local Activities that run back-to-back only [count as a single billable action](/cloud/actions#activities), whereas each regular Activity counts as a billable action.
169+
However, there are tradeoffs to converting regular Activities to Local Activities.
170+
For example, if a specific Local Activity fails, *all* of them will be retried together.
171+
Review [the docs](/local-activity) or reach out to your account team to learn more.
172+
173+
#### When to stick with Regular Activities
174+
175+
Use Regular Activities instead of Local Activities if you require any of the following:
176+
177+
- Activities may take more than 10 seconds to complete
178+
- Independent retry control for each Activity
179+
- Need to avoid re-running expensive Activities when unrelated Activities fail
180+
- Immediate Signal/Update handling during execution
181+
- Separate resource management (like rate limits) for each Activity
182+
183+
### Batching operations
184+
185+
#### Search Attributes
186+
187+
1. [Search Attributes](/search-attribute#custom-search-attribute) provided at Workflow start do not count as billable Actions.
188+
If Search Attribute values are known before starting the Workflow, provide them at Workflow start to eliminate these costs entirely.
189+
2. For Search Attributes that must be updated during Workflow Execution, each `UpsertSearchAttributes` call counts as 1 Action regardless of how many attributes are updated.
190+
Batch multiple related attribute updates into single operations to reduce Actions consumed.
191+
192+
See the [Temporal Cloud Action Documentation](/cloud/actions#workflows) for details.
193+
194+
#### Signal handling
195+
196+
Where feasible, implement deduplication logic client-side or aggregate data into fewer Signals.
197+
Use `SignalWithStart` instead of separate `StartWorkflow` and `SignalWorkflow` calls when initiating Workflows with Signals.
198+
199+
## Storage optimization
200+
201+
[Storage costs](/cloud/pricing#storage) are divided into Active Storage (open Workflows) and Retained Storage (closed Workflow History during retention period).
202+
Active Storage is significantly more expensive than Retained Storage.
203+
204+
### Active Storage
205+
206+
Active Storage applies to open Workflows and their Event Histories.
207+
The following sections detail optimization opportunities for Active Storage.
208+
209+
#### Continue-As-New
210+
211+
For long-running Workflows with extended sleep/wait periods, calling Continue-As-New before sleeping closes the current execution (moving to cheaper Retained Storage) and starts fresh when work resumes, reducing Active Storage costs.
212+
See [Continue-As-New documentation](/workflow-execution/continue-as-new) to learn more.
213+
214+
#### Compression
215+
216+
Large payloads increase Active Storage costs.
217+
Implement a custom Data Converter with compression for moderately large payloads (100KB-1MB).
218+
See the [Data Converter documentation](/default-custom-data-converters#custom-data-converter) to learn more.
219+
220+
#### Claim check pattern
221+
222+
For very large payloads or binary data, store data externally (S3 or GCS) and pass references through Workflows.
223+
224+
### Retained Storage
225+
226+
Retained Storage applies to closed Workflow History during the retention period.
227+
The following sections detail optimization opportunities for Retained Storage.
228+
229+
#### Retention periods
230+
231+
The default Namespace retention is 30 days (configurable between 1 and 90 days).
232+
Adjust based on operational and compliance requirements.
233+
234+
**Considerations**:
235+
236+
- Shorter retention reduces costs but limits historical analysis
237+
- Audit investigation patterns before shortening retention
238+
- Ensure compliance requirements are met
239+
240+
See [Namespace retention documentation](/temporal-service/temporal-server#retention-period) for configuration details.
241+
242+
#### Workflow export
243+
244+
Temporal Cloud supports exporting Workflow Histories to external storage for compliance while maintaining shorter retention periods.
245+
Note that Workflow export costs one Action per export.
246+
247+
See the [Workflow History export documentation](/cloud/export) for more details.
248+
Alternatively, if you are looking to do analysis on closed Workflow Executions, [review this blog post to learn how to gain insights from exported Workflow Histories](https://temporal.io/blog/get-insights-from-workflow-histories-export-on-temporal-cloud).
249+
250+
## Validation
251+
252+
### Validation approach
253+
254+
1. **Test in non-production**: Validate functional correctness before production deployment
255+
2. **Monitor comprehensively**: Leverage the [Usage dashboard](/cloud/actions#usage) in the Cloud UI to track the impact on Actions and Storage after optimizations are made
256+
3. **Progressive rollout**: Deploy to a small percentage, validate, then expand. Review the [Worker Versioning documentation](/production-deployment/worker-deployments/worker-versioning) to learn about rolling out changes to Workflows
257+
4. **Continuous review**: Re-evaluate optimization effectiveness quarterly as system evolves
258+
259+
### Success criteria
260+
261+
- Cost reduced without increasing mean time to repair (MTTR)
262+
- Workflow success rates maintained or improved
263+
- Reduced observability does not increase mean time to detect (MTTD) incidents
264+
265+
### Tools
266+
267+
- Temporal Cloud Usage dashboard for Actions and Storage metrics
268+
- Workflow History for per-Workflow billable Actions estimates
269+
- Export metrics to observability platforms (Datadog, Grafana, etc.) for custom monitoring
270+
271+
## When to get help
272+
273+
Engage the Temporal team for Workflow audits when experiencing:
274+
275+
- Complex Workflow patterns with unclear optimization paths
276+
- Compliance requirements limiting optimization options
277+
- Need for custom DataConverters or advanced patterns
278+
- Desire for expert validation of optimization strategies
279+
280+
Contact your Temporal Account Representative or [Temporal support](http://support.temporal.io) to discuss optimization services.

docs/best-practices/index.mdx

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,9 @@ This section is intended for:
5555
- **[Worker Deployment and Performance](./worker.mdx)** Best practices for deploying and optimizing Temporal Workers for
5656
performance and reliability.
5757

58+
- **[Cost Optimization](./cost-optimization.mdx)** Strategies for optimizing costs associated with workloads running on
59+
Temporal Cloud while maintaining Workflow reliability and observability.
60+
5861
- **[Pre-Production Testing](./pre-production-testing.mdx)** Experience-driven testing practices covering failure
5962
injection, load testing, and operational validation.
6063

sidebars.js

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -651,9 +651,8 @@ module.exports = {
651651
'best-practices/managing-aps-limits',
652652
'best-practices/cloud-access-control',
653653
'best-practices/security-controls',
654-
'best-practices/worker',
654+
'best-practices/cost-optimization',
655655
'best-practices/knowledge-hub',
656-
'production-deployment/multi-tenant-patterns'
657656
],
658657
},
659658
{

0 commit comments

Comments
 (0)