Keep your applications running while AWS is down
Till Rohrmann
In today's global economy, applications need to be available 24/7 across the world.
When AWS's us-east-1
region experienced outages recently or went down in December 2021, it didn't just inconvenience users—it cost companies millions in lost revenue.
In the recent outage, ChatGPT couldn't answer questions.
Snapchat failed to deliver messages.
Perplexity wasn't able to search.
In 2021, Netflix couldn't stream.
Disney+ went dark.
Robinhood couldn't process trades.
The culprit in many cases?
A lack of true geo-replication.
Distributing your application across multiple geographic regions ensures that it stays running even when entire regions fail.
The stakes are high
Modern applications run everything from financial transactions to healthcare systems to supply chain management. A regional outage doesn't just mean your website is down, it can mean:
- Financial services can't process payments or trades, imagine a stock exchange frozen during market hours
- E-commerce platforms lose sales during peak shopping periods, think Black Friday with your checkout disabled
- Healthcare systems can't access patient records during emergencies
- Logistics companies can't track shipments or coordinate deliveries
The cost of downtime for enterprise applications can easily reach thousands of dollars per minute, making regional resilience not just a nice-to-have, but a business necessity.
Traditional challenges with geo-replication
Building truly geo-replicated applications has historically been an expert-level undertaking. Here's why:
1. Distributed state management complexity: Traditional applications struggle with keeping state consistent across regions. You're constantly battling the CAP theorem. For example, do you let US and EU users see potentially stale inventory counts (availability), or do you block orders until you can guarantee the count is accurate across regions (consistency)?
2. Complex replication logic: You need to implement sophisticated replication mechanisms, handle conflict resolution, manage leader election (determining which region handles writes), and ensure data doesn't get corrupted during network partitions.
3. Operational overhead: Setting up monitoring, handling failover scenarios, managing different deployment configurations across regions, and coordinating updates becomes a full-time job for entire teams.
4. Application-level changes: Most solutions require you to fundamentally restructure your application code, implement custom retry logic, handle partial failures, and manage distributed transactions.
How Restate transforms building geo-replicated applications
Here's where Restate changes the game completely. What if geo-replication didn't require any special code? What if surviving an entire region failure was just a configuration change? With Restate, building a geo-replicated application becomes primarily a deployment concern rather than an application development concern.
You build applications the same way
When you develop with Restate, you write your business logic as you normally would. Here's an order workflow that handles inventory, payments, and notifications:
// Standard business logic. No special code for geo-replication needed
const orderWorkflow = restate.workflow({
name: "Order",
handlers: {
run: async (ctx: restate.WorkflowContext, order) => {
let inventoryResult = await ctx
.serviceClient(InventoryService)
.reserveItems(order.items);
if (inventoryResult.result === false) {
return { status: "not enough items" };
}
let paymentResult = await ctx
.serviceClient(PaymentService)
.processPayment(order.payment);
if (paymentResult.result === false) {
await ctx.serviceClient(InventoryService).unreserveItems(order.items);
return { status: "payment rejected" };
}
await ctx.serviceClient(InventoryService).shipItems(order.items);
await ctx.run("send notification", () => sendNotification(ctx.key));
return { status: "completed" };
},
},
});
There's no complex distributed state management, no replication logic, no custom retry logic. You focus on your business logic, and Restate handles the distributed systems complexity.
Geo-replication becomes a deployment detail
To make this application geo-replicated and resilient to entire region failures, you simply deploy the Restate cluster and your services across multiple regions with the right configuration. That's it.
How to deploy a geo-replicated Restate application
Let's walk through the practical steps to deploy a geo-replicated Restate application that can survive entire region outages.
Configure location-aware nodes
Each Restate node needs to know its geographic location.
This allows Restate to make intelligent decisions about data placement and replication.
The location
setting encodes the region.availability-zone
information of a node.
cluster-name = "geo-replicated"
node-name = "us-east-1.a"
advertised-address = "10.0.1.100:5122"
default-replication = "{region: 2, node: 3}" # <-- Multi region replication property
location = "us-east-1.a" # <-- Geographic location for this node
auto-provision = false
[metadata-client]
addresses = [
"10.0.1.100:5122", # us-east-1.a
"10.0.1.101:5122", # us-east-1.b
"10.0.2.100:5122", # us-east-2.a
"10.0.2.101:5122" # us-east-2.b
"10.0.3.100:5122", # us-west-1.a
"10.0.3.101:5122" # us-west-1.b
]
[worker]
experimental-partition-driven-log-trimming = true
durability-mode = "balanced"
trim-delay-interval = "60 min"
[worker.snapshots]
destination = "s3://my-snapshots-us-east-1/"
snapshot-interval-num-records = 10000
Set up cross-region replication policies
The default-replication = "{region: 2, node: 3}"
setting is doing a lot of the heavy lifting here:
region: 2
ensures your data is replicated to at least 2 different regionsnode: 3
ensures your data is replicated to at least 3 different nodes

This means your application can survive:
- ✅ The complete failure of any single region
- ✅ The failure of up to 2 arbitrary nodes across regions
while remaining fully available.
If you need higher availability guarantees, you only need to increase the default-replication
property and make sure that Restate and your services are deployed in enough regions/availability zones.
Configure S3 cross-region replication
Every partition has a lead partition processor that is responsible for creating and uploading state snapshots to S3. Since it uploads snapshots to S3 in the region where it is running, we enable S3's cross-region replication (CRR) to ensure that snapshots will automatically be replicated to the other regions.
Since the S3 replication happens asynchronously, we need to ensure that Restate doesn't trim its internal logs until S3 snapshots have had time to replicate across regions.
The setting trim-delay-interval = "60 min"
ensures that logs are only trimmed 60 minutes after the snapshot has been uploaded, providing sufficient time for S3's asynchronous cross-region replication to complete.
What happens during a region failure?
Now for the impressive part. Let's see what happens when an entire region goes offline. For this test, we have deployed a 6-node Restate cluster across 3 regions, with each node being deployed in its own availability zone. The application services are running next to the Restate servers on the same nodes.

Under a load of 400 req/s, we achieve a P50 latency of ~350ms and a P90 latency of ~1s for our 5-step testing workflow in steady state.

During the failure: Automatic failover
When us-east-1
goes completely offline, Restate automatically:
- Detects the failure within seconds using its gossip protocol
- Redistributes leadership for affected partitions to the remaining nodes in
us-east-2
andus-west-1
- Continues processing requests using the replicated data in
us-east-2
andus-west-1
- Maintains consistency - no data is lost, no requests are duplicated

During the brief failover period (less than a minute), P99 latencies temporarily spike to about 30 seconds. However, the cluster remains fully available and continues processing all requests—only 1% of requests experience these higher latencies.

After failover completes, P75 and higher latencies return to nearly normal levels. P50 latency remains slightly elevated because all replication now occurs between the more distant us-east-2
and us-west-1
regions. Previously, some replication happened between the closer us-east-1
and us-east-2
regions, which had much lower network latency.

us-east-2
and us-west-1
, which has higher network latency than the previous us-east-1
to us-east-2
communication.The bottom line: Your application experiences zero downtime during a complete regional failure. While some requests see temporary latency increases during the ~60-second failover window, all requests continue to be processed successfully.
When us-east-1
comes back online, Restate automatically:
- Synchronizes any missed updates from snapshots and logs
- Gradually redistributes load back to achieve a balanced distribution across all regions
- Returns to normal operation with no manual intervention required
With Restate, you write your application once and deploy it anywhere. Geo-replication becomes a deployment detail, not an application architecture concern.
Getting started with geo-replicated Restate
Ready to build applications that can survive anything? The beauty of Restate is that you can start simple and add geo-replication when you need it, without changing your application code.
Start local, scale global
- 🚀 Begin with a single-region deployment following our quickstart guide
- 🔧 Build and test your application logic using Restate's development tools
- 🌍 Add geo-replication by deploying Restate in additional regions with the configuration shown above
- 💪 Benefit from having a highly available and scalable application that can withstand all kinds of failures–even complete regional outages.
✨ Star us on GitHub and join the conversation on Discord or Slack — we'd love to hear what you're building with Restate.