We replaced 400 lines of StepFunctions ASL with 40 lines of TypeScript by making Lambdas suspendable

Posted November 27, 2023 by Jack Kleeman and Pavel Tcholakov and Stephan Ewen ‐ 12 min read

tl;dr: We show you a new way to build complex business processes on AWS Lambda, where you can do everything with sequential code and RPC-style service calls, no workflow DSL or plumbing events. You can write code the same way you’d write outside of Lambda, but with all the benefits of Lambda for operations and cost, and the reliability of Step Functions and event-driven apps. This is possible because we make Lambda functions suspendable, using durable async/await, implemented by our open-source runtime, Restate.


AWS Lambda is a fantastic service that greatly simplifies deployments, operations, and capacity management. The more you work with Lambda, the more you want to use it for everything. However, the managed runtime imposes some restrictions that become drawbacks if we want to build components that invoke each other in a typical service-oriented architecture. We ideally want to deploy independent components with clearly defined responsibilities, that communicate with each other via APIs. That way, different components can be maintained by different teams, with APIs modeling relationships between those teams.

The synchronous RPC model is well-understood by developers: distributed components can communicate with each other by invoking a remote service and wait for a response. This model is challenging to implement on Lambda: we generally want to avoid calling other Lambdas because the caller is billed for the time it spends waiting for a response, so any meaningful call chain of Lambdas gets expensive, fast. By contrast, on an EC2 machine, handlers blocked on IO don’t use any CPU and so share resources very effectively with other requests. Coordinating distributed systems that require long-running processes or which might need multiple retries can also be expensive in a execution-time billing environment.

These constraints push developers towards services like AWS Step Functions as the outermost orchestration layer for a number of short-running Lambda functions. This might also necessitate the use of a state store such as DynamoDB to persist intermediate state. Today we’re going to talk about a different approach that we think you’re going to love; suspendable Lambda functions, which we’ve implemented with durable async/await in a new open-source project called Restate.

Reliable serverless business processes

Business processes generally require a sequence of tasks to be completed by various components, with a strategy to handle failure. It’s important that we don’t leave the system in an invalid state. A common approach to achieve this is the Saga pattern which uses reliable execution to perform “compensating actions” that undo partial progress in the event of a failure.

AWS provides an excellent demonstration of this pattern in the Serverless Sagas example, with Step Functions taking the role of reliable executor. In the example scenario we implement a service that takes bookings for holiday trips. It needs to book a flight, reserve a rental car, process a payment, and then issue a confirmation. If any of these steps fails, the system must roll everything back; refund the payment and cancel any reservations. In the AWS example, each operation on flights, cars, and payments is implemented as a small Lambda handler. The overall business process is captured as a Step Functions workflow, which defines the sequence of operations. It is responsible for accumulating progress as we go through the steps, as well as for retries and rollback.

The AWS reference example uses the CDK States module to define the Step Function workflow, but we could also build it with the Step Functions Workflow Studio or even directly in the Amazon States Language (ASL) JSON representation. In every case, there is a distinction between the “service code” (the Lambda-hosted program code of our individual steps) and the orchestration code, which expressed in the specialized Step Functions language.

A simple RPC-based approach

Let’s take a step back. What if we could write a simple script that expressed our desired holiday booking flow in an imperative, top-to-bottom style? Instead of a workflow language, we might write code like this:

const reserve = async (input) => {
  const tripID = uuidv4();

  // create an undo stack
  const undos = [];
  try {
    // RPC the flights service to reserve, keeping track of how to cancel
    const flight_booking = await flights.reserve(tripID, input);
    undos.push(() => flights.cancel(tripID, flight_booking));

    // RPC the rental service to reserve, keeping track of how to cancel
    const car_booking = await carRentals.reserve(tripID, input);
    undos.push(() => carRentals.cancel(tripID, car_booking));

    // RPC the payments service to process, keeping track of how to refund
    const payment = await payments.process(tripID);
    undos.push(() => payments.refund(tripID, payment));

    // confirm the flight and car
    await flightsService.confirm(flight_booking);
    await carRentals.confirm(car_booking);
  } catch (e) {
    // undo all the steps up to this point
    for (const undo of undos.reverse()) {
      await undo();
    }

    // notify failure
    sns.send(new PublishCommand({
      TopicArn: process.env.SNS_TOPIC,
      Message: "Your Travel Reservation Failed",
    }));

    // exit with an error
    throw new Error(`Travel reservation failed with err '${e}'`, {
      cause: e,
    });
  }

  // notify success
  sns.send(new PublishCommand({
    TopicArn: process.env.SNS_TOPIC,
    Message: "Your Travel Reservation is Successful",
  }));
};

Wouldn’t it be great if we could write our business process orchestration logic like this? We can make RPC-style calls to other services and use the AWS SDK directly as needed.

But what if it takes minutes for one of the RPCs to complete? And what happens if this handler (or its runtime) stopped due to an unhandled error or a timeout? Our exception handling blocks might never run, and we’ll end up with an unfinished booking. And didn’t we say earlier that it is an anti-pattern to have Lambdas call other Lambdas? We don’t want to get charged for the execution time of the trips service while its waiting.

If functions could suspend and resume, for example on RPC calls, we wouldn’t have to worry about any of those things! And thanks to the magic of durable async/await, we can make that happen. Whenever we make an RPC, the calling handler doesn’t need to hang around waiting; whenever you see an await in the code, the Lambda invocation suspends and returns control to Restate. Only one Lambda runtime is actively executing code (and spending execution milliseconds) at any time.

The Restate runtime keeps track of outstanding requests, and re-invokes the suspended Lambda when it is ready to continue. The Restate SDK takes care of resuming the execution exactly where it left off. This way you only pay for the time spent doing actual work. The secret is that every step of the handler had its result journaled by the runtime. When we resume execution, we can replace the already-processed steps with the journaled results, fast-forwarding to where we suspended.

Because of durable async/await, we can also guarantee that any started executions will run to completion without ever having to redo work. With this guarantee, we can handle cleanup tasks like undo steps without worrying that the code might crash before they get to run. We don’t make a distinction between ‘workflow’ code like the trip handler and ‘step’ code like the handlers for flights, cars, and payments. All the code in the example gets journaled, it’s all laid out as RPC handlers which can call each other and be called through a HTTP ingress and return a synchronous result, if necessary.

The only catch is that interactions with external services like SNS or sources of non-determinism (such as clocks or random number/id generators) must be wrapped in a sideEffect. The results of side effect closures are captured and journaled by Restate, similar to other steps.

Writing a real handler

Now that we know a bit more about how Restate handlers work, let’s tweak the code above:

const reserve = async (ctx, input) => {
  const tripID = ctx.rand.uuidv4();

  // set up RPC clients
  const flights = ctx.rpc(flightsService);
  const carRentals = ctx.rpc(carRentalService);
  const payments = ctx.rpc(paymentsService);

  // create an undo stack
  const undos = [];
  try {
    // call the flights Lambda to reserve, keeping track of how to cancel
    const flight_booking = await flights.reserve(tripID, input);
    undos.push(() => flights.cancel(tripID, flight_booking));

    // RPC the rental service to reserve, keeping track of how to cancel
    const car_booking = await carRentals.reserve(tripID, input);
    undos.push(() => carRentals.cancel(tripID, car_booking));

    // RPC the payments service to process, keeping track of how to refund
    const payment = await payments.process(tripID);
    undos.push(() => payments.refund(tripID, payment));

    // confirm the flight and car
    await flights.confirm(tripID, input);
    await carRentals.confirm(tripID, input);
  } catch (e) {
    // undo all the steps up to this point
    for (const undo of undos.reverse()) {
      await undo();
    }

    // notify failure
    await ctx.sideEffect(() => sns.send(new PublishCommand({
      TopicArn: process.env.SNS_TOPIC,
      Message: "Your Travel Reservation Failed",
    })));

    // exit with an error
    throw new TerminalError(`Travel reservation failed with err '${e}'`, {
      cause: e,
    });
  }

  // notify success
  await ctx.sideEffect(() => sns.send(new PublishCommand({
    TopicArn: process.env.SNS_TOPIC,
    Message: "Your Travel Reservation is Successful",
  })));
}

This is a real working example of a workflow that you can deploy with Restate today, backed by AWS Lambda! As you can see, very little has changed from the imperative script approach we started from. Let’s look through the changes we had to make.

  1. Our handler now accepts a ctx object, which gives it a way to interact with Restate.
  2. To create RPC clients we use ctx.rpc(). When RPCs are made through these clients, Restate will suspend execution, and route the request to another service, resuming execution when the result is available.
  3. We have also encapsulated our interactions with Amazon SNS into side effect blocks, using ctx.sideEffect(). This ensures that those steps are executed once, and their effects are recorded in the journal, just as we would a Restate RPC. We’ve used the utility ctx.rand.uuidv4() to create a deterministic uuid; but we could have used a side effect for that too.
  4. We’ve thrown a TerminalError instead of a normal Error - this signals to the Restate SDK that this is not a transient error, and should be treated as a valid completion of the handler.

It’s not just for Lambda

Unlike the original AWS example, our code from above isn’t specific to Lambda or Step Functions. It looks like a regular RPC handler and can also be executed as such. You can deploy the same code with no changes (other than the entry-point reference) in pretty much any other place where you would normally deploy RPC services.

With suspendable functions, Lambda becomes a deployment detail: just a platform that allows for very fine-grained compute. It’s no longer something you need to develop for.

Imagine everything that follows from that:

  • No emulators for local development of Lambda or Step Functions: You can just run it locally as a single auto-reloading Node.js app, using your ~/.aws credentials. You can use all the standard tools, attach a debugger, just like you always did.
  • You can deploy it on Fargate, to handle a higher base load more cost-effectively, perhaps overflowing bursty load to Lambda.
  • You can also mix and match, run different services on different platforms, depending on what is the most appropriate (or cost-effective) place to run that particular part of the application.

Try it on AWS

To try this out, you need access to a Restate service. We are developing Restate both as an open-source project that you can host yourself, as well as a managed service.

You can sign up here for the managed service waiting list.

To get started with the self-hosted version, the easiest way to bring up an instance is to use the Restate construct for AWS Cloud Development Kit (CDK). Restate is exceptionally easy to deploy and host, because it comes in the form of a self-contained single binary.

The following steps will bring up a complete end-to-end system with a Restate deployment and the demo code. Clone the CDK project:

git clone https://github.com/restatedev/restate-holiday
cd restate-holiday
npm clean-install

This project contains two separate CloudFormation stacks: with a Restate container in a dedicated VPC, and all the Lambda handlers that make up the Reservations service in a separate stack.

You can deploy the complete CDK application with a single command:

npx cdk deploy --all

This will create two separate stacks: one containing a VPC with a EC2 deployment single-node deployment of Restate, and a separate stack containing only the serverless components that make up the Holiday reservation services. (Note: To deploy just the Restate container and start building your own applications, you can use npx cdk deploy RestateStack. Check the Restate Holiday Demo Repository for more details.)

The Restate stack creates a dedicated VPC containing a T4g.micro EC2 Instance running Restate, an Application Load Balancer to handle incoming requests to the Restate ingress endpoint, and a NAT Gateway for outbound internet connectivity. Restate service logs are sent to Amazon CloudWatch Logs. The Holiday Service stack contains the sample application’s components – in our case, some service handlers and DynamoDB tables.

When done, this will print out a few values such as the Restate ingress endpoint for your self-hosted Restate runtime. Let’s grab the endpoint address:

export INGRESS=$(aws cloudformation describe-stacks \
    --stack-name RestateStack \
    --query "Stacks[0].Outputs[?OutputKey=='RestateIngressEndpoint'].OutputValue" \
    --output text)

We are now ready to invoke the service!

curl -k $INGRESS/trips/reserve --json '{}'

Note that we use the -k flag to accept self-signed certificates. This demonstration deployment creates its own certificate to enable encryption. If all went well, you should see the following output:

{
  "response": {
    "status": "success",
    "trip_id": "6592ad0b-e442-4397-9465-82837ea2eb0b"
  }
}

You can also simulate a specific type of failure using the run_type request parameter:

curl -k $INGRESS/trips/reserve --json '{"request": {"run_type": "failNotification"}}'

The response to such a failed invocation should return:

{
  "code": "internal",
  "message": "Travel reservation failed with err 'Error: Failed to send notification'; successfully applied 3 compensations"
}

Let’s take a closer look at what really happens with the underlying Lambda handlers when we make an invocation to the Trips Service. Using CloudWatch Log Insights, we can query across all four handlers’ log groups:

fields substr(@log, 50), @message
| filter (@message like /START/ or @message like /END/ or @message like /INFO/)
  and not (@message like /Registering:/ or @message like /INIT_START/)
| sort @timestamp asc

We can see that individual handlers activate and de-activate just as we expected – the TripHandler in particular starts and stops multiple times over the course of the overall trip booking processing.

The sample application also integrates with Amazon CloudWatch. You can find instructions for how to access the Restate service logs and RPC invocations traces in the Holiday Service repository.

Once you are done exploring this example, you can use the deployed Restate runtime for other services you might write with the Restate SDK. Or you can tear everything down via:

npx cdk destroy --all

Thanks for reading!

In the AWS serverless world, to develop complex processes you need to spread your business logic across multiple platforms and languages, exchanging state as untyped JSON, in order to get the same results that you would get in a long-running handler running on an EC2 instance. We’re very excited about the potential of suspendable functions to change this; you can write maintainable, typed, imperative code which describes the entire process, without losing any of the benefits of Lambda. We think this makes Lambda into the perfect application platform.

Interested in learning more? Go to https://docs.restate.dev!


Join the community and help shape Restate!

Join our Discord!