Giter Club home page Giter Club logo

service-protocol's Introduction

Restate Service Protocol

This repo contains specification documents and Protobuf schemas of the Restate Service Protocol.

Development

To format the spec document:

npx prettier -w service-invocation-protocol.md

service-protocol's People

Contributors

ahmedsoliman avatar slinkydeveloper avatar tillrohrmann avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

tillrohrmann

service-protocol's Issues

Remove unused messages

ATM we have no plan to implement anytime soon the SuspensionMessage. Let's remove it until we need it.

Harden the callback specification

  • Describe the syscall is implemented as two steps side effect + callback entry
  • If the closure fails, should we create the failed callback entry anyway?
  • How to handle receiving future completions?
  • How to handle case where closure sent some message, then closure fails, and we receive the completion?

Allow any journal entries to require_ack

The REQUIRES_ACK flag is now restricted to custom journal entries, we should allow it for any entry.

With this mechanism in place, the SDK can decide to wait for acks of specific types of entries, to avoid losing side effects when dealing with the cancellation signal. See https://docs.google.com/document/d/14s7D6KP1IKNS1K3OrVbZNKGNei9TO_-9Sk0YC5mpht8/edit#heading=h.hooeo8f4d03z

I propose to keep this concept of "REQUIRES_ACK" general such that we can at a later point introduce more "cancellation points" without touching this aspect implementation in the runtime (and potentially in the sdks as well)

Describe side effects in the spec

We have no description of how side effects work. Even if this is purely an sdk only feature, we should describe anyway how it works in the "optional features" section for users to implement this themselves.

`BatchMessage`

In one of the initial designs of the protocol, we conceived the idea of BatchMessage, mostly to improve protocol efficiency, but then after a while we just ignored it as it didn't seem particularly interesting as optimization.

@StephanEwen recently suggested that a BatchMessage could in fact make sense, only in the response stream (from service endpoint to runtime), not for optimization purposes, but to provide a way to either commit all entries or no entries. This can be an interesting property for SDKs that wants to provide a different programming model, e.g. actor based models where there are no "await" points, where the side effect don't need any wrapping within a ctx.sideEffect or similar.

An interesting aspect of this message is that this would be an optional contract of the protocol, as only SDKs that wants to push this message to the runtime needs to implement it.

Some properties of the protocol to describe in the spec

  • ACKs returned only for side effects?
    • Or better, whether to ack or not should be written in a flag. Important for unknown entries as well
  • Completions are received by service endpoint in any order
  • Service endpoint might even receive future completions?
  • When replaying, it's guaranteed no completions are sent by the runtime

Add a magic number to discern the header from other bytes

In order to harden the protocol, it would be good to add/prepend a magic number to the header so that one can distinguish a header value from a non-header value. That way, SDKs can guard against interpreting the wrong bytes as the header value.

Make suspensions an optional feature

When implementing the new TS SDK, the runtime initiated suspension mechanism made the implementation harder. It would be simpler for a new SDK to be able to ignore this feature altogether if possible. One idea could be to let the SDK tell the runtime whether it supports runtime initiated suspensions (w/o it, the SDK would probably not work on AWS lambda, though).

Another simplification could be to allow the waiting_for_completed_entries field to be empty which denotes that the invocation will be resumed on any journal entry completion.

More details are needed for why the suspension mechanism caused trouble (Stephan, Giselle).

Document service discovery protocol

While we have documented how the service protocol works, we have not documented how the discovery protocol works. To make the implementation of custom SDKs easier, this would probably help.

Pass partialStateFlag as part of StartMessage instead of in header

From the perspective of SDK development, it would be easier to have the partialStateFlag in the start message. This makes the header parsing simpler an you could just use the protobuf deserializers to get the flags... The flags would then also just pass as part of the message through the code base.

Cancellations

This issue includes:

  • #50
  • Introduce cancelled/failed field for background invoke and resolve awakeable
  • Introduce failed completion variant for GetState
  • Introduce StartMessage.cancel field
  • Describe the SDK cancellation behavior in the protocol spec

Make protocol actions more explicit

For some SDKs, it could be helpful to make certain protocol actions such as suspending more explicit by sending an explicit message instead of only closing the request channel. It would have the benefit that a runtime failure which also closes the request channel can be easily distinguished from a suspension. Some languages/frameworks have difficulties detecting the difference right now (see https://restatedev.slack.com/archives/C04KZRLE1SM/p1687219253254139 for more details).

Notifying non-recoverable errors back to the runtime

In the protocol we distinguish three types of errors:

  • User errors
  • Infra recoverable errors, that is, they can be retried
  • Infra unrecoverable errors, that is, retrying them will lead again to failure

For user errors we already have a strategy to handle them, as all the fallible operations have a failure variant where we can write the user failure and propagate it back. It's up to the sdk to correctly implement error handling to write user failures as OutputStreamEntry.failure. For recoverable and unrecoverable errors, we don't have yet a good strategy to distinguish them.

To recap, right now a bidi stream can end in 4 ways:

  • Suspending, by sending a SuspensionMessage
  • Completing the invocation, by sending an OutputStreamEntry
  • Failing, by simply closing the stream correctly, not aborting it
  • Badly failing, by either connection broken or aborting the stream

In the runtime, we already distinguish the cases "Failing" and "Badly failing" thanks to the HTTP/2 protocol frames: https://github.com/restatedev/restate/blob/main/src/invoker/src/invoker.rs#L822 and https://github.com/restatedev/restate/blob/main/src/invoker/src/invocation_task.rs#L453 are the relevant code. In other words, I can always distinguish when the SDK closed the stream, or when it was aborted, or when there was a connection failure.

All the considerations above lead to this question: Is there ever a case where the SDK produces on-purpose recoverable errors?

  • If yes, we could simply say that when the SDK closes the stream correctly, and does not send SuspensionMessage or OutputStreamEntry, we know it's an unrecoverable failure, so the runtime won't retry to invoke it, and will simply propagate the error to the caller.
  • If no, then we need some way to let the sdk notify whether the occurred error is recoverable or not, by sending back something like ErrorMessage. If we go down the road this road, we still need to define what to do in case the stream was closed correctly without of SuspensionMessage nor OutputStreamEntry nor ErrorMessage, which probably means that this approach still implies the other one.

For example, a non-complete list of errors the Java sdk produces is here: https://github.com/restatedev/sdk-java/blob/main/sdk-core-impl/src/main/java/dev/restate/sdk/core/impl/ProtocolException.java, all of those are unrecoverable.

`PARTIAL_STATE` is always false

Quick question on the eager state implementation right now:

• Is Partial State always false, meaning do we get all state always?
• Is there a way we can keep the SDK simple, keep that flag out and just assume that the flag is always false?

Slack Message

Allow end of stream without OutputStreamEntry

Right now to end a stream:

A message stream MUST start with StartMessage and MUST end with either:

One OutputStreamEntry
One SuspensionMessage
One ErrorMessage.
None of the above, which is equivalent to sending an empty ErrorMessage.

This assumes that a correct invocation always ends with a response, which is not the case for restatedev/restate#899. We should modify the end of stream operations to not assume OutputStreamEntry is the last message.

Tasks

Line up terminology between protocol and SDK

For example, a unidirectional call is called backgroundInvoke in the protocol but oneWayCall in the TS SDK.
This means that a user types oneWayCall in his code, but sees backgroundInvoke in the logs and traces.

Decide on the type for time

Both BackgroundInvoke and Sleep have a time field. Right now this type is i64, but perhaps we should change this to u64. We should also consider what would be the behaviour of Rust's time apis wrt time overflow

Simplify implementation of new SDKs

This is an exploration issue for collecting and evaluating ideas for the service protocol that could simplify the implementation of new SDKs. The goal should be that implementing a new SDK should be possible within 2-3 weeks. Key to it are an easy to understand description of the protocol, clear concepts, optionality of more advanced features and ease of use.

Make the journal an immutable log by storing completions separately

Implementing new SDKs could become easier for the journal would be an immutable log of journal entries and completions that are stored separately. This would allow us to remove the need for completable and non-completable journal entries. It could also help with implementing deterministic futures since we record in which order the journal entries are completed.

One problem that might arise is that with this change, there will be two components that append to the journal: the runtime which appends completions and the SDK appending journal entries. Right now, only the SDK is allowed to append journal entries which makes it quite simple to keep the runtime and SDK view on the journal in sync.

In order to solve this problem, the SDK would probably need to be able to re-order the tail of its journal in case there were completions that were appended before the last journal entries.

A minor disadvantage is that the runtime will lose the cheap capability to check whether a journal entry was completed or not.

More details on how an immutable log can simplify the SDK implementation are needed (Stephan, Giselle).

Transport the opaque sid

Perhaps it makes sense to transport the opaque id rather than invocaiton_id and service_key in the StartMessage?

We could use this opaque id as part of the awakeable identifier as well.

Remove the `SideEffectEntryMessage`

I propose to remove SideEffectEntryMessage from the "core" protocol and let every sdk define its own SideEffectEntryMessage as Unknown entry type.

The reason for this design choice is to keep the service-protocol small and define only journal entry messages which the runtime must be able to read and process in some way, as done with GetStateEntry, SetStateEntry or InvokeEntry, where the runtime needs to parse the entry, apply respective effects and eventually send back a completion. SideEffectEntryMessage does not fit in this category of messages, as the runtime never needs to parse it. It simply has to store it as blob and ack back "I've stored it" to the SDK. This specific mechanism is already specified by Unknown entries, which the runtime will accept and store, but won't try to parse them [1]. For an example usage of the Unknown entry mechanism, see CombinatorsEntryMessage.

Another nice consequence of this design choice is that we leave freedom to SDKs to define SideEffectEntry as they want, for example in Java we might be able to record error's stacktrace in a specific format, while in other languages we might need another message structure to record error's stacktrace.

[1] The spec still needs some clarification on this though, like defining when an Unknown entry should be acked or not. See #2.

Finish the invocation with the first OutputStreamEntry

One problem the TS SDK ran into was that a replay could happen after the first OutputStreamEntry was sent. Given that we don't support output streaming (yet and probably in the foreseeable future), we might consider changing the protocol such that an invocation terminates with the first and only OutputStreamEntry.

Agree on the format used to expose the awakeable identifier

Right now we have two approaches in the SDKs:

We should agree on a single format we use in all the SDKs, and use the same format in restatedev/proto#20.

Make SideEffect an explicit journal entry type

While implementing the TS SDK, it turned out that implementing the side effect journal entry as a CustomEntry complicated the code a tad bit. The reason was that for the side effect entry, one always had to handle a CustomEntry and check that it contains a side effect. Maybe we want to make the SideEffect journal entry a first class citizen of the protocol to simplify this aspect even though the runtime does not need to understand it.

More details on what exactly was more complicated to implement for the side effect entry are needed (Stephan, Giselle).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.