Giter Club home page Giter Club logo

fmi's Introduction

FaaS Message Interface

Serverless platforms provide massive parallelism with very high elasticity and fine-grained billing. Because of these properties, they are increasingly used for stateful, distributed jobs at large scales. However, a major limitation of the commonly used platforms is communication: Individual functions cannot communicate directly and using external storage or databases for ephemeral data can be slow and expensive. We present FMI, the FaaS Message Interface, to overcome this limitation. FMI is an easy-to-use, high-performance framework for general-purpose communication in Function as a Service platforms. It supports different communication channels (including direct communication with our TCP NAT hole punching system), a model-driven channel selection according to performance or cost, and provides optimized collective implementations that exploit characteristics of the different channels. In our experiments, FMI can speed up communication for a distributed machine learning job by up to 1,200x, while reducing cost at the same time by factors of up to 365. It provides a simple interface and can be integrated into existing codebases with a few minor changes.

If you use FMI in your work, then please cite our ACM ICS 2023 paper:

@inproceedings{10.1145/3577193.3593718,
author = {Copik, Marcin and B\"{o}hringer, Roman and Calotoiu, Alexandru and Hoefler, Torsten},
title = {FMI: Fast and Cheap Message Passing for Serverless Functions},
year = {2023},
isbn = {9798400700569},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3577193.3593718},
doi = {10.1145/3577193.3593718},
abstract = {Serverless functions provide elastic scaling and a fine-grained billing model, making Function-as-a-Service (FaaS) an attractive programming model. However, for distributed jobs that benefit from large-scale and dynamic parallelism, the lack of fast and cheap communication is a major limitation. Individual functions cannot communicate directly, group operations do not exist, and users resort to manual implementations of storage-based communication. This results in communication times multiple orders of magnitude slower than those found in HPC systems. We overcome this limitation and present the FaaS Message Interface (FMI). FMI is an easy-to-use, high-performance framework for general-purpose point-to-point and collective communication in FaaS applications. We support different communication channels and offer a model-driven channel selection according to performance and cost expectations. We model the interface after MPI and show that message passing can be integrated into serverless applications with minor changes, providing portable communication closer to that offered by high-performance systems. In our experiments, FMI can speed up communication for a distributed machine learning FaaS application by up to 162x, while simultaneously reducing cost by up to 397 times.},
booktitle = {Proceedings of the 37th International Conference on Supercomputing},
pages = {373–385},
numpages = {13},
keywords = {high-performance computing, I/O, serverless, function-as-a-service, faas},
location = {Orlando, FL, USA},
series = {ICS '23}
}

Dependencies

  • C++17 or higher
  • Boost
  • AWS SDK for C++
  • hiredis
  • TCPunch

Installation (C++)

  • Clone this repository
  • Add to your CMakeLists.txt:
add_subdirectory(path_to_repo/FMI/)

target_link_libraries(${PROJECT_NAME} PRIVATE FMI)
target_include_directories(${PROJECT_NAME} PRIVATE ${FMI_INCLUDE_DIRS})
  • Integrate the library into your project:
#include <Communicator.h>
...
FMI::Communicator comm(peer_id, num_peers, "config/fmi.json", "MyApp", 512);

Installation (Python)

  • Clone this repository
cd python
mkdir build
cd build
cmake ..
make
  • fmi.so gets created in the python/build directory. You can copy it into your Python module path or include the build directory via PYTHONPATH. The library can then be integrated into your project:
import fmi
comm = fmi.Communicator(peer_id, num_peers, "config/fmi.json", "MyApp", 512);

Docker Images

The Docker images FMI-build-docker contain all necessary dependencies and set up the environment for you. See the repo for details.

AWS Lambda Layer

For even easier deployment, we provide AWS CloudFormation templates to create Lambda layers in python/aws. Simply run sam build and sam deploy --guided in the folder corresponding to your Python version, which creates a Lambda layer in your account that can be added to your function. As soon as you added the layer, you can simply use import fmi and work with the library.

Examples

C++ sample code for the library is available at tests/communicator.cpp, the usage from Python is demonstrated in python/tests/client.py.

Documentation

The architecture of the system, including a comparison with existing systems and benchmarks, is documented in the ACM ICS'23 paper FMI: Fast and Cheap Message Passing for Serverless Functions. More details can be found in the thesis FMI: The FaaS Message Interface.

A technical documentation of the system (for people that want to extend it) is available at fmi.opencore.ch.

Authors

fmi's People

Contributors

mcopik avatar opencorech avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

fmi's Issues

Add new function launcher

While FMI supports launching a swarm of function workers, we have used some Bash scripts to launch many functions in parallel. We need a proper tool similar to mpiexec that would take parameters such as the number of functions, name of the function, and cloud region, generate a configuration for each function, and quickly launch a desired number of workers.

  • Implement a basic parallel launcher supporting arbitrary functions.
  • Document the input format describing the function ID and the hierarchy of workers (see #7).
  • Add a local backend to start functions as Docker or native processes on the same system.
  • Add error handling to display failure outputs from functions.
  • Add a tree-based launcher to achieve a logarithmic startup time.

Implement shared-memory communication

Serverless functions are becoming more powerful, and AWS Lambda now supports using up to 6 vCPUs per function. We could benefit by running multiple "workers" as threads on a single function and letting them communicate locally. We cannot use processes within a single function as that would require shmem, which is not supported on AWS Lambda.

  • Introduce a memory-based communication channel.
  • Adjust the function launcher to support a hierarchy of functions - local and external ones.
  • Implement in collectives and P2P communication the selection of external and local channels based on the recipient.

This might require making collectives to be natively multi-protocol first #6.

Run LAPACK with FMI

We should be able to run a parallel LAPACK on serverless functions. However, we might need to implement additional MPI functionalities for it.

  • Generate the list of MPI functionalities (#11)
  • Adapt the entrypoint to run on a function.
  • Add sample benchmark code.
  • Test the benchmark for several size values.

Push notification for Redis

While FMI supports communication over Redis, it does only with polling for data. This, in turn, requires a backoff to prevent DDOS-ing the service. However, Redis supports push-based messaging - we could use to push messages to another function.

  • Implement latency benchmark for Redis in SeBS.
  • Extend the Redis channel to support push notifications.
  • Add tests with local Redis deployment.

Multi-channel collectives

Some collective operations can benefit from switching protocols on the fly. An example is broadcast - S3 bandwidth will scale significantly linearly with the number of recipients, while a single sender would saturate its TCP bandwidth when sending multiple copies.

  • Introduce multi-channel support into collective operations.
  • Implement a simple heuristic-based selection of the channel.
  • Implement broadcast that switches channels depending on the number of clients and message size.

Nondeterministic connection breaking on AWS

We experienced that TCP connections created with TCPunch can randomly fail on AWS. So far, we have not found the primary issue - the observed behavior that a TCP message is suddenly lost after exchanging 16 - 64 kB of data between peers. The data is sent, as verified by the Wireshark analysis, but the receiver keeps retrying for a TCP packet that never arrives. We have been able to reproduce the issue between two VMs as well.

So far, we have implemented a workaround that attempts to exchange 64 kB between two peers and restarts the pairing process.

  • Can we still reproduce the problem?
  • Can changing the default packet size influence the problem?

Port communication channels to Google Cloud

We should implement communication channels for Google Cloud.

  • Research storage and communication services.
  • Add Docker containers for Google services and environments.
  • Implement a function launcher for GCP functions.
  • Add tests for all channel types.

The prerequisite of this task is validating hole punching #2.

Error in installation: SAM build fails for Python39

The documentation says that running sam build and sam deploy --guided will deploy the layer. But, I get an error when I followed this.

Steps to reproduce the issue:

  1. Fork the fmi repository
  2. Setup AWS credentials (access key and id)
  3. cd python/aws/python39
  4. Run sam build
  5. The error message I got was:
Build Failed
Error: CustomMakeBuilder:MakeBuild - Make Failed: Unable to find image 'fmi-build-python39:latest' locally
docker: Error response from daemon: pull access denied for fmi-build-python39, repository does not exist or may require 'docker login': denied: requested access to the resource is denied.
See 'docker run --help'.
make: *** [build-python39] Error 125

Possible way to fix this:

Update on this: I fixed this by building the docker image from the repo

Evaluate TCPunch on Google Cloud

We have verified that TCPunch works on the AWS cloud. However, it has yet to be established if the implemented NAT hole punching will work on the Google cloud. This step is necessary to run TCP communication between two different functions.

We should first run this on two VMs to verify that the connection is established, and then try to establish TCP connection between a VM and a function.

Optimizing collectives for size

The HPC literature includes many collective algorithms optimized for specific network topologies and message sizes. Based on the message size, MPI implementations might choose completely different communication algorithms to efficiently utilize bandwidth on large messages and hide the latency for small messages, e.g., with pipelining.

We want to evaluate how such algorithms can be ported to serverless communication.

  • Research and evaluate improved broadcast.
  • Research and evaluate improved reduction.
  • Research and evaluate improved scan.

Evaluate collective operations on the Google Cloud

This is the follow-up for task #5 - once we port the communication channels, we should run all collective operations, verify their work, and ensure that underlying communication channels scale with the number of clients.

Integration with Serverless framework.

The Serverless framework provides the ability to deploy functions and necessary components automatically. FMI relies on having provisioned S3 buckets, DynamoDB tables, or other resources.

Instead of expecting users to configure such resources manually, we could provide a YAML configuration file for the framework to deploy resources and export the configuration. Then, user's application could include FMI code directly in functions and load the configuration from a file or environment variables.

Refresh Docker images

We have a set of Docker images in the repository spcleth/fmi. We should update them and ensure they are always up to date.

Map FMI operations to MPI

In FMI, we implemented a subset of MPi functionalities but improved the interface to support better typing. To natively support MPI applications, we should create a library wrapper that will redirect MPI calls to FMI.

  • Test with OpenMPI
  • Test with mpi4py.

Run LULESH with FMI

We should be able to run a parallel LULESH on serverless functions. However, we might need to implement additional MPI functionalities for it.

  • Generate the list of MPI functionalities (#11)
  • Adapt the entrypoint to run on a function.
  • Add sample benchmark code.
  • Test the benchmark for several size values.

Implement SQS-based communication

One communication channel we have yet to cover so far is AWS SQS. We could use it to communicate between functions, with the main limitation being the low message size (256 kB) and the need for base64 encoding.

Steps for the solution:

  • Allocate the SQS client and select a proper queue.
  • Implement a new indirect channel.
  • Implement splitting serialized data across multiple packets.
  • Add tests covering the new communication method.

Extend MPI support

In FMI, we have implemented a subset of MPI functionalities. We should extend the implementation to support typical HPC use cases, such as communicator manipulation.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.