Blog

Keep your applications running while AWS is down

October 21, 2025

Till Rohrmann

In today's global economy, applications need to be available 24/7 across the world. When AWS's us-east-1 region experienced outages recently or went down in December 2021, it didn't just inconvenience users—it cost companies millions in lost revenue. In the recent outage, ChatGPT couldn't answer questions. Snapchat failed to deliver messages. Perplexity wasn't able to search. In 2021, Netflix couldn't stream. Disney+ went dark. Robinhood couldn't process trades. The culprit in many cases? A lack of true geo-replication. Distributing your application across multiple geographic regions ensures that it stays running even when entire regions fail.

The stakes are high

Modern applications run everything from financial transactions to healthcare systems to supply chain management. A regional outage doesn't just mean your website is down, it can mean:

  • Financial services can't process payments or trades, imagine a stock exchange frozen during market hours
  • E-commerce platforms lose sales during peak shopping periods, think Black Friday with your checkout disabled
  • Healthcare systems can't access patient records during emergencies
  • Logistics companies can't track shipments or coordinate deliveries

The cost of downtime for enterprise applications can easily reach thousands of dollars per minute, making regional resilience not just a nice-to-have, but a business necessity.

Traditional challenges with geo-replication

Building truly geo-replicated applications has historically been an expert-level undertaking. Here's why:

1. Distributed state management complexity: Traditional applications struggle with keeping state consistent across regions. You're constantly battling the CAP theorem. For example, do you let US and EU users see potentially stale inventory counts (availability), or do you block orders until you can guarantee the count is accurate across regions (consistency)?

2. Complex replication logic: You need to implement sophisticated replication mechanisms, handle conflict resolution, manage leader election (determining which region handles writes), and ensure data doesn't get corrupted during network partitions.

3. Operational overhead: Setting up monitoring, handling failover scenarios, managing different deployment configurations across regions, and coordinating updates becomes a full-time job for entire teams.

4. Application-level changes: Most solutions require you to fundamentally restructure your application code, implement custom retry logic, handle partial failures, and manage distributed transactions.

How Restate transforms building geo-replicated applications

Here's where Restate changes the game completely. What if geo-replication didn't require any special code? What if surviving an entire region failure was just a configuration change? With Restate, building a geo-replicated application becomes primarily a deployment concern rather than an application development concern.

You build applications the same way

When you develop with Restate, you write your business logic as you normally would. Here's an order workflow that handles inventory, payments, and notifications:


// Standard business logic. No special code for geo-replication needed
const orderWorkflow = restate.workflow({
  name: "Order",
  handlers: {
    run: async (ctx: restate.WorkflowContext, order) => {
      let inventoryResult = await ctx
        .serviceClient(InventoryService)
        .reserveItems(order.items);

      if (inventoryResult.result === false) {
        return { status: "not enough items" };
      }

      let paymentResult = await ctx
        .serviceClient(PaymentService)
        .processPayment(order.payment);

      if (paymentResult.result === false) {
        await ctx.serviceClient(InventoryService).unreserveItems(order.items);
        return { status: "payment rejected" };
      }

      await ctx.serviceClient(InventoryService).shipItems(order.items);
      await ctx.run("send notification", () => sendNotification(ctx.key));

      return { status: "completed" };
    },
  },
});

There's no complex distributed state management, no replication logic, no custom retry logic. You focus on your business logic, and Restate handles the distributed systems complexity.

Geo-replication becomes a deployment detail

To make this application geo-replicated and resilient to entire region failures, you simply deploy the Restate cluster and your services across multiple regions with the right configuration. That's it.

How to deploy a geo-replicated Restate application

Let's walk through the practical steps to deploy a geo-replicated Restate application that can survive entire region outages.

Configure location-aware nodes

Each Restate node needs to know its geographic location. This allows Restate to make intelligent decisions about data placement and replication. The location setting encodes the region.availability-zone information of a node.

cluster-name = "geo-replicated"
node-name = "us-east-1.a"
advertised-address = "10.0.1.100:5122"
default-replication = "{region: 2, node: 3}" # <-- Multi region replication property
location = "us-east-1.a" # <-- Geographic location for this node
auto-provision = false

[metadata-client]
addresses = [
  "10.0.1.100:5122",  # us-east-1.a
  "10.0.1.101:5122",  # us-east-1.b
  "10.0.2.100:5122",  # us-east-2.a
  "10.0.2.101:5122"   # us-east-2.b
  "10.0.3.100:5122",  # us-west-1.a
  "10.0.3.101:5122"   # us-west-1.b
]

[worker]
experimental-partition-driven-log-trimming = true
durability-mode = "balanced"
trim-delay-interval = "60 min"

[worker.snapshots]
destination = "s3://my-snapshots-us-east-1/"
snapshot-interval-num-records = 10000

Set up cross-region replication policies

The default-replication = "{region: 2, node: 3}" setting is doing a lot of the heavy lifting here:

  • region: 2 ensures your data is replicated to at least 2 different regions
  • node: 3 ensures your data is replicated to at least 3 different nodes
Location aware replication
Location aware replication places copies of your data in different regions and nodes, ensuring the replication property is satisfied.

This means your application can survive:

  • ✅ The complete failure of any single region
  • ✅ The failure of up to 2 arbitrary nodes across regions

while remaining fully available.

If you need higher availability guarantees, you only need to increase the default-replication property and make sure that Restate and your services are deployed in enough regions/availability zones.

Configure S3 cross-region replication

Every partition has a lead partition processor that is responsible for creating and uploading state snapshots to S3. Since it uploads snapshots to S3 in the region where it is running, we enable S3's cross-region replication (CRR) to ensure that snapshots will automatically be replicated to the other regions.

Since the S3 replication happens asynchronously, we need to ensure that Restate doesn't trim its internal logs until S3 snapshots have had time to replicate across regions. The setting trim-delay-interval = "60 min" ensures that logs are only trimmed 60 minutes after the snapshot has been uploaded, providing sufficient time for S3's asynchronous cross-region replication to complete.

What happens during a region failure?

Now for the impressive part. Let's see what happens when an entire region goes offline. For this test, we have deployed a 6-node Restate cluster across 3 regions, with each node being deployed in its own availability zone. The application services are running next to the Restate servers on the same nodes.

Multi region Restate deployment
6-node Restate cluster deployed across 3 regions, with each node running in its own availability zone.

Under a load of 400 req/s, we achieve a P50 latency of ~350ms and a P90 latency of ~1s for our 5-step testing workflow in steady state.

Steady latency

During the failure: Automatic failover

When us-east-1 goes completely offline, Restate automatically:

  1. Detects the failure within seconds using its gossip protocol
  2. Redistributes leadership for affected partitions to the remaining nodes in us-east-2 and us-west-1
  3. Continues processing requests using the replicated data in us-east-2 and us-west-1
  4. Maintains consistency - no data is lost, no requests are duplicated
Regional outage
Outage of region us-east-1 does not affect the availability of the Restate application.

During the brief failover period (less than a minute), P99 latencies temporarily spike to about 30 seconds. However, the cluster remains fully available and continues processing all requests—only 1% of requests experience these higher latencies.

Failover latency
During failover, we see a brief spike in higher percentile latencies before they normalize within less than a minute.

After failover completes, P75 and higher latencies return to nearly normal levels. P50 latency remains slightly elevated because all replication now occurs between the more distant us-east-2 and us-west-1 regions. Previously, some replication happened between the closer us-east-1 and us-east-2 regions, which had much lower network latency.

Failover latency w/o P99
P50 latencies increase because all replication now occurs between us-east-2 and us-west-1, which has higher network latency than the previous us-east-1 to us-east-2 communication.

The bottom line: Your application experiences zero downtime during a complete regional failure. While some requests see temporary latency increases during the ~60-second failover window, all requests continue to be processed successfully.

When us-east-1 comes back online, Restate automatically:

  1. Synchronizes any missed updates from snapshots and logs
  2. Gradually redistributes load back to achieve a balanced distribution across all regions
  3. Returns to normal operation with no manual intervention required

With Restate, you write your application once and deploy it anywhere. Geo-replication becomes a deployment detail, not an application architecture concern.

Getting started with geo-replicated Restate

Ready to build applications that can survive anything? The beauty of Restate is that you can start simple and add geo-replication when you need it, without changing your application code.

Start local, scale global

  1. 🚀 Begin with a single-region deployment following our quickstart guide
  2. 🔧 Build and test your application logic using Restate's development tools
  3. 🌍 Add geo-replication by deploying Restate in additional regions with the configuration shown above
  4. 💪 Benefit from having a highly available and scalable application that can withstand all kinds of failures–even complete regional outages.

Star us on GitHub and join the conversation on Discord or Slack — we'd love to hear what you're building with Restate.