We would like to add a P2P messaging pattern, that is, a protocol that user functions can use to do P2P messaging through Akka.
What is a peer?
It's important to define what we mean by peer. A peer is an abstract concept that, for each domain, is defined by the domain. It could be a human, or it could be a device - eg, an IoT device - or it could be an entity (eg, an event sourced entity that is pushing updates to pages in real time). If it is a human, they may be interacting through many devices, for example, I have Slack installed on multiple laptops and multiple mobile devices, when someone sends me a message, and I have Slack on all my devices open, I expect to receive that message in real time on all of my devices at once. Typically, a device will have a TCP connection (perhaps gRPC stream or WebSocket) to a serverless service from which it will receive P2P messages. That connection may be over an unreliable network, and when it fails, it will reconnect, but not necessarily back to the same node that it was originally connected to.
Example characteristics
While we probably can't address every possible use case, we want to come up with one or more solutions that cover a broad range of use cases. With that in mind, here are some different characteristics or requirements that some use cases might have.
The P2P messaging may in some cases be between more than 2 peers (eg, a chat room, or multiple IoT devices in a home), there may be multiple publishers for a single topic, and multiple subscribers for a single topic - this may expand the traditional definition of P2P, perhaps we really are talking about addressed communication, but note that address is not a machine or actor address, it is the abstract user/device as defined above.
Various use cases exist for a range of different delivery guarantees. At most once is useful when the current state is being sent, and new messages invalidate previous messages. For example, tracking the location of an IoT enabled vehicle. The other major useful guarantee is effectively once. In this case it's assumed that the device receiving updates can deduplicate (using a domain specific sequence number for example, or unique ids), but needs at least once delivery. Instant messaging is an example of this.
Delivery time guarantees for effectively once messaging vary too. The point of P2P messaging is to allow effectively instant delivery, ie the only latency comes from network, routing, and processing, and that should happen in the happy case. In failure scenarios however, in some use cases there should be a maximum time that it takes for the message to be delivered, in other cases it's ok for the dropped message to not be delivered until the next message is received.
Solutions
Currently, the only out the box solution that Akka provides to implement P2P messaging as described above is distributed pubsub. This can be combined with Akka persistence to achieve at least once delivery, by persisting messages first, then publishing them, and then using the sequence number to detect dropped messages, and the journal to recover.
Distributed pubsub however requires replicating the subscriber state to all nodes, and hence doesn't scale well when there are a very large number of topics being subscribed to.
Here are two other distributed P2P possibilities that we might want to consider. These ideas are very raw and not fully thought out, they may be terrible.
- Sharded mediator. In this case, all messages go through a sharded mediator. Subscribers are required to register subscription with the mediator, and they are required to maintain that subscription, including in cases when the mediator is rebalanced. The mediator may tell the subscriber when it's handing off to another node to assist on this, but the subscriber can't rely on that, and the subscriber should periodically resubscribe to the mediator - the mediator will also expire subscribers that haven't resubscribed for a while. Akka cluster sharding could be used, or a consistent hashing router could be used - if the latter, cluster membership events might trigger resubscribing. This solution has reduced availability because it introduces a third node that needs to be available for communication to succeed.
- Gossip subscription state among publishers. In this case, a sharded contact point might be used to initially discover publishers and subscribers, once discovered, publishers gossip the subscription state between themselves and the contact point. Subscribers could either be part of the gossip cluster themselves, and use that to keep themselves active, or they could regularly tell the sharded contact point that they are active. Because there are potentially many gossip clusters, to keep communication down, gossip intervals would need to be long. Message activity could be used to trigger a temporary increase in gossip frequency (or an immediate gossip to all nodes), and publishers could keep messages that they publish for a time in case they learn of any new subscribers during this period of increased gossip frequency.