romange / gaia Goto Github PK

View Code? Open in Web Editor NEW

76.0 7.0 14.0 5.13 MB

C++ framework for rapid server development

Home Page: https://romange.github.io/gaia/

License: BSD 2-Clause "Simplified" License

CMake 8.27% C++ 89.09% C 1.87% Shell 0.19% Yacc 0.19% Lex 0.12% CSS 0.05% Dockerfile 0.08% JavaScript 0.14%

server async cpp14 asio fibers backend abseil

gaia's Introduction

Gaia - rapid backend development framework in C++

Gaia is a set of libraries and c++ environment that allows you efficient and rapid development in c++14 on linux systems. The focus is mostly for backend development, data processing etc.

Dependency on abseil-cpp
Dependency on Boost 1.69
Uses ninja-build on top of cmake
Build artifacts are docker-friendly.
Generic RPC implementation.
HTTP server implementation.
Many other features.

I will gradually add explanations for most crucial blocks in this library.

Setting Up & Building

abseil is integrated as submodule. To fetch abseil run:
```
git submodule update --init --recursive
```

Building using docker. Short version - running asio_fibers example. There is no need to install dependencies on a host machine.

   > docker build -t asio_fibers -f docker/bin_build.Dockerfile --build-arg TARGET=asio_fibers
   > docker run --network host asio_fibers --logtostderr  # server process, from shell A
   > docker run --network host asio_fibers --connect=localhost --count 10000 --num_connections=4  # client process from shell B

For a longer version please see this document.

Building on host machine directly. Currently requires Ubuntu 18.04.

> sudo ./install-dependencies.sh
> ./blaze.sh -ninja -release
> cd build-opt && ninja -j4 asio_fibers

third_party folder is checked out under build directories.

Then, from 2 tabs run:

  server> ./asio_fibers --logtostderr
  client> ./asio_fibers --connect=localhost --count 100000 --num_connections=4

Single node Mapreduce

GAIA library provides a very efficient multi-threaded mapreduce framework for batch processing. It supports out of the box json parsing, compressed formats (gzip, zstd), local disk I/O and GCS (Google Cloud Storage). Using GAIA MR it's possible to map, re-shard (partition), join and group multiple sources of data very efficiently. Fibers in GAIA allowed maximizing pipeline execution and balance IO with CPU workloads in parallel. The example below shows how to process text files and re-shard them based on an imaginary "year" column for each CSV row. Please check out this tutorial to learn more about GAIA MR.

#include "absl/strings/str_cat.h"
#include "mr/local_runner.h"
#include "mr/mr_main.h"
#include "strings/split.h"  // For SplitCSVLineWithDelimiter.

using namespace std;

DEFINE_string(dest_dir, "~/mr_output", "Working dir where the pipeline writes its by products");

int main() {
  // sets up IO threads and optional http console interace via port 8080 by default.
  PipelineMain pm(&argc, &argv);
  vector<string> inputs;
  for (int i = 1; i < argc; ++i) {
    inputs.push_back(argv[i]);  // could be a local file or "gs://...." url.
  }
  CHECK(!inputs.empty()) << "Must provide some inputs to run!";

  Pipeline* pipeline = pm.pipeline();

  // Assuming that the first line of each file is csv header.
  StringTable ss = pipeline->ReadText("read", inputs).set_skip_header(1);
  auto reshard = [](string str) {
    vector<char*> cols;
    SplitCSVLineWithDelimiter(&str.front(), ',', &cols);
    return absl::StrCat("year-", cols_[0]);
  };

  // Simplest example: read and repartition by year.
  ss.Write("write_input", pb::WireFormat::TXT)
      .WithCustomSharding(reshard).AndCompress(pb::Output::ZSTD);

  // Environment is abstracted away through mr3::Runner class. LocalRunner is an implementation
  // that comes out of the box.
  LocalRunner* runner = pm.StartLocalRunner(FLAGS_dest_dir);
  pipeline->Run(runner);

  LOG(INFO) << "Pipeline finished";

  return 0;
}

RPC

In addition to great performance, this RPC supports server streaming API, fully asynchronous processing, low-latency service. GAIA RPC framework employs Boost.ASIO and Boost.Fibers as its core libraries for asynchronous processing.

IoContextPool is used for managing a thread-per-core asynchronous engine based on ASIO. For periodic tasks, look at asio/period_task.h.
The listening server (AcceptServer) is protocol agnostic and serves both HTTP and RPC.
RPC-service methods run inside a fiber. That fiber belongs to a thread that probably serves many other fiber-based connections in the server. Using regular locking mechanisms (std::mutex, pthread_mutex) or calling 3rd party libraries (libmysqlcpp) will block the whole thread and all its connections will be stalled. We need to be mindful of this, and as a policy prohibit thread blocking in fiber-based server code.
Nevertheless, RPC service methods might need to issue RPC calls by themselves or block for some other reason. To do it correctly, we must use fiber-friendly synchronization routines. But even in this case, we will still block the calling fiber (not thread). All other connections will continue processing but this one will stall. By default, there is one dedicated fiber per RPC connection that reads rpc requests and delegates them to the RPC application code. We need to remember that if higher level server-code stalls its fiber during its request processing, it effectively limits total QPS per that socket connection. For spinlock use-cases (i.e. RAM access locking with rw-spinlocks with low contention) having single fiber per rpc-connection is usually good enough to sustain high throughput. For more complicated cases, it's advised to implement fiber-pool (currently not exposed in GAIA).
Server-side streaming is needed for responses that can be very large. Such responses can easily be represented by a stream of smaller responses with an identical schema. Think of SQL response for example. It may consist of many rows returned by SELECT. Instead, of returning all of them as one blob, server-side streaming can send back multiple responses in the context of a single request on a wire. Each small response is propagated to RPC client via a callback based interface. As a result, both systems (client and server) are not required to hold the whole response in RAM at the same time.

While GAIA provides very efficient RPC core library, it does not provide higher level RPC bindings. It's possible though to build a layer that uses protobuf-based declaration language this RPC library. For raw RPC demo see asio_fibers above.

HTTP

HTTP handler is implemented using Boost.Beast library. It's integrated with the IoContextPool similarly to RPC service. Please see http_main.cc, for example. HTTP also provides support for backend monitoring (Varz status page) and for extensible debugging interface. With monitoring C++ backend returns json object that is formatted inside status page in the browser. To check how it looks, please go to localhost:8080 while asio_fibers are running.

Self-profiling

Every http-powered backend has integrated CPU profiling capabilities using gperf-tools and pprof Profiling can be trigerred in prod using magic-url commands. Enabled profiling usually has very minimal impact on cpu performance of the running backend.

Logging

Logging is based on Google's glog library. The library is very reliable, performant and solid. It has many features that allow resilient backend development. Unfortunately, Google's version has some bugs, which I fixed (waiting for review...), so I use my own fork. Glog library gives me the ability to control logging levels of a backend at run-time without restarting it.

Tests

GAIA uses googletest+gmock unit-test environment.

Conventions

To use abseil code use #include "absl/...". Third_party packages have TRDP:: prefix in CMakeLists.txt. absl libraries have prefix absl_....

gaia's People

Contributors

Stargazers

Watchers

Forkers

derofim zacharya19 oriubimo blockspacer pubfork dendisuhubdy 5l1v3r1 wangscript007 ekatz-quotient lsllopez faizol zou000 quotient-technology-inc

gaia's Issues

gaia and microservices on k8s

I`m interested in C++ usage for microservices running on k8s.

I plan to use Istio as service mesh and Minikube for local dev.

If gaia suits well for microservices, than i want to contribute example app gaia Shop: Cloud-Native Microservices Demo Application

Some questions:

gaia has custom rpc, is it good idea to use it instead of grpc?
Is it good idea to use gaia for development of something similar to https://github.com/GoogleCloudPlatform/microservices-demo ?

Similar projects:

Colossus project has example C++ microservice https://github.com/mattpaletta/foobar-social/blob/9bf9e32f6c2853d7681c36bacc2dfab3c6056b77/news_feed/include/main.cpp#L22
foobar-social - grpc - Colossus - Example Microservices https://github.com/mattpaletta/foobar-social/blob/9bf9e32f6c2853d7681c36bacc2dfab3c6056b77/news_feed/include/main.cpp#L22
pwave_cloud (grpc++) with istio scrips Toeplitz/pwave_cloud@fc5ae5a#diff-a055b85bab3694cc2083e4ae4a7d2b50

partial pipeline run

It could be nice if we can rerun the pipeline from the middle by specifying the stage name.
For example, in word_count example if we could run starting from group_by stage assuming that the previous stage has already run.

asio::ssl introduces 2 redundant copies

asio::ssl communicates with SSL via BIO and BIO_new_bio_pair in boost/asio/ssl/detail/impl/engine.ipp. So when SSL writes something into its internal BIO buffer,
stream_core reads it from its external BIO buffer and writes it into its own buffers output_buffer_ and input_buffer_. Only then it sends it into the underlying socket next_layer_.

The memory copying logic is implemented in boost/asio/ssl/detail/io.hpp in boost::asio::ssl::detail::io function.

What needs to be done:

Extract & refactor the generic asio::ssl code into gaia codebase (only the synchronous part) and make util::http::SslStream to be not dependent on asio::ssl. It should be self contained synchronous stream.
Write a test that covers ssl client code ?
Understand how BIO/BIO_pair interacts
Access BIO directly without copying from it. Should solve one level of copying.
Explore the option of adding a new BIO method describing BIO_fiber_socket, where writing to it
and reading from it triggers direct communication with the socket

GAIA fails sometimes to write a file to GCS

Related to this issue

Unfortunately, there is no fix at this moment because Google does not provide real resumable uploads. See also this

add documentation on how to build dockerized binaries

gaia binaries require several shared libs. it would be great if a process of building self-contained docker image would exist and be documented.

merge MapperExecutor::GetStats and metric_map_ logic

Currently GetStats is monitoring only and is not dumped into logs.

On the other hand metric_map_ is customizable, accessible by users but not exposed in varz.

We should move all the metrics provided in GetStats to metric_map_.
GetStats should expose metric_map_.

More helpful docs.

gaia looks great, but needs some docs. Can you provide a bit more helpful documentation on how to use it? Thanks.

Some ideas about docs:

more info about fibers
how to use with/without docker
deploying into some cloud service (like GCE or Amazon)
RPC usage example for microservices
HTTP REST server usage example (PUT request, DELETE request, e.t.c.)
comparison with other solutions may be included into docs (by link or copy): http://www.romange.com/2018/07/12/seastar-asynchronous-c-framework/
Introduction to fibers in c++ may be included into docs (by link or copy): http://www.romange.com/2018/12/15/introduction-to-fibers-in-c-/
may be included into docs (by link or copy): http://www.romange.com/2019/07/15/gaia-mapreduce-tutorial-part1/
may be included into docs (by link or copy): http://www.romange.com/2019/07/25/benchmarking-gaia-mr-on-google-cloud/

Work stealing mappers

Currently each mapper executor fiber processes its file fully. This is simple and efficient approach but it creates sometimes a long tail latency when some mappers has finished but others are still processing their workload.

Make sure the mappers incoming queue is short, possibly of length 1.
When mappers finished they will register to take some work from other mappers.
Need to think how we maintain "begin_shard/process/end_shard" flow with stealing mappers.

logs should include a fiber id

It's very difficult to understand which fiber is outputting which log line.

introduce gaia custom error messages

Instead of errc::illegal_byte_sequence in rpc code. use rpc::error that has more detailed meaning.
Scan util code for more references to system and http errors and see if gaia errors make more sense.

● Up For Grabs https://up-for-grabs.net/#/

● Issuehub.io http://issuehub.io/

● First Timers Only https://www.firsttimersonly.com/

● Your First PR http://yourfirstpr.github.io/

● Awesome for Beginners https://github.com/mungell/awesome-for-beginners

● Pull Request Roulette http://www.pullrequestroulette.com/

● Code Triage https://www.codetriage.com/

● 24 P[ull] R[equest]s https://24pullrequests.com/

Maybe it is good idea to add more issues labeled Hacktoberfest/good first issue

hacktoberfest label https://github.com/search?q=label%3Ahacktoberfest+state%3Aopen&type=Issues

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.