Unkey uses Restate to run reliable multi-step deployment workflows

Unkey is an API management platform that uses Restate for deployment workflows involving multiple steps like building Docker images and deploying across regions, where any failure requires a reliable retry and recovery mechanism.

We just exchanged all of our internal run functions with restate.run and it mostly just worked. And the UI is much nicer than looking at database columns all day.

-- Andreas Thomas - CTO Unkey
  • Website: https://www.unkey.com/
  • Industry: API management
  • Use Case: Deployment Automation
  • Tech Stack: Kubernetes, Go, MySQL, S3

Before Restate: Architecture & Challenge

Unkey needed a durable solution for multi-step deployment workflows that require reliable retries of failed tasks.

The team initially attempted to build their own system using MySQL to track steps and cron jobs for retries. They eventually stopped this effort because of complex bugs that occurred, such as timing issues and race conditions.

I spent three or four hours debugging why an old job didn’t get picked up after a node crash. Each worker would grab a lease - just a row in the database - to claim a task, but a timing bug made it think another worker still held the lease even when it didn’t. If I’m already spending four hours fixing this now, what happens in production? I just don’t want to maintain that.

-- Andreas Thomas - CTO Unkey

Technical Challenges:

  • Handling failures and retries
  • Coordinating async operations
  • Dealing with race conditions

Why Restate?

After evaluating several alternatives, Unkey decided to go with Restate because of the following benefits:

  • Ease of local development and self-hosting thanks to Restate’s single-binary deployment (no additional dependencies such as databases)
  • Services/workflows don’t need to run on workers managed by Restate and can run anywhere
  • Lightweight, as opposed to competitors
  • Great SDK experience with support for Go
  • Helpful, responsive support team

Switching to Restate went smoothly. Unkey’s internal step execution logic mapped directly to Restate’s programming model:

It was a pretty small change and required no re-architecting. We just exchanged all of our internal run functions with restate.run and it mostly just worked.

-- Andreas Thomas - CTO Unkey

The Results

Unkey now uses a variety of Restate features to implement durable deployment workflows, including:

  • Durable Execution: Replaced custom step-tracking and retry logic.
  • Workflows: For managing stateful deployments and handling infrastructure provisioning, which involves gluing together various APIs that can be fragile.
  • Durable Promises: To durably handle external events like notifications when an S3 upload is complete.
  • Virtual Objects: For concurrency control and serializing critical deployment actions, like switching domains to point to a new deployment, rolling back changes, or promoting deployments to production.

Domain switching during deployment should never happen in parallel. With virtual objects, it’s serialized automatically. Even if a user rage-clicks, only one switch happens. I didn’t have to build any of that logic myself — I just said ‘that’s a virtual object now,’ and it came out of the box.

-- Andreas Thomas - CTO Unkey

Now that they got a taste of what Restate can do, they are seeing more and more use cases for it. For example, they are looking at using Restate to implement a time-delayed deletion system for projects, such as marking a project for deletion and scheduling the actual delete operation for three days later, with the option to cancel.

With Restate, I can schedule something to happen in three days and cancel it easily if needed. Suddenly all kinds of use cases pop up once you already have a workflow engine.

-- Andreas Thomas - CTO Unkey