federatedai / fate Goto Github PK

An Industrial Grade Federated Learning Framework

License: Apache License 2.0

Python 52.09% Shell 0.65% Lua 0.42% Rust 3.39% Makefile 0.04% Batchfile 0.03% Dockerfile 0.01% Java 43.36%

fate machine-learning federated-learning privacy-preserving algorithm

fate's Introduction

FATE (Federated AI Technology Enabler) is the world's first industrial grade federated learning open source framework to enable enterprises and institutions to collaborate on data while protecting data security and privacy. It implements secure computation protocols based on homomorphic encryption and multi-party computation (MPC). Supporting various federated learning scenarios, FATE now provides a host of federated learning algorithms, including logistic regression, tree-based algorithms, deep learning and transfer learning.

FATE is an open source project hosted by Linux Foundation. The Technical Charter sets forth the responsibilities and procedures for technical contribution to, and oversight of, the FATE (“Federated AI Technology Enabler”) Project.

https://fate.readthedocs.io/en/latest

Getting Started

FATE can be deployed on a single node or on multiple nodes. Choose the deployment approach which matches your environment. Release version can be downloaded here.

Version >= 2.0

Standalone deployment

Deploying FATE on a single node via PyPI, pre-built docker images or installers. It is for simple testing purposes. Refer to this guide.

Cluster deployment

Deploying FATE to multiple nodes to achieve scalability, reliability and manageability.

Cluster deployment by CLI: Using CLI to deploy a FATE cluster.

Quick Start

More examples

Documentation

FATE Design

Architecture: Building Unified and Standardized API for Heterogeneous Computing Engines Interconnection
FATE Algorithm Components: Building Standardized Algorithm Components for different Scheduling Engines
OSX (Open Site Exchange): Building Open Platform for Cross-Site Communication Interconnection
FATE-Flow: Building Open and Standardized Scheduling Platform for Scheduling Interconnection
PipeLine Design: Building Scalable Federated DSL for Application Layer Interconnection And Providing Tools For Fast Federated Modeling
RoadMap
Paper & Conference

Related Repositories (Projects)

KubeFATE: An operational tool for the FATE platform using cloud native technologies such as containers and Kubernetes.
FATE-Flow: A multi-party secure task scheduling platform for federated learning pipeline.
FATE-Board: A suite of visualization tools to explore and understand federated models easily and effectively.
FATE-Serving: A high-performance and production-ready serving system for federated learning models.
FATE-Cloud: An infrastructure for building and managing industrial-grade federated learning cloud services.
EggRoll: A simple high-performance computing framework for (federated) machine learning.
AnsibleFATE: A tool to optimize and automate the configuration and deployment operations via Ansible.
FATE-Builder: A tool to build package and docker image for FATE and KubeFATE.
FATE-Client: A tool to enable fast federated modeling tasks for FATE.
FATE-Test: An automated testing tool for FATE, including tests and benchmark comparisons.
FATE-LLM : A framework to support federated learning for large language models(LLMs).

Governance

FATE-Community contains all the documents about how the community members coopearte with each other.

GOVERNANCE.md documents the governance model of the project.
Minutes of working meetings
Development Process Guidelines
Security Release Process

Getting Involved

Contributing

FATE is an inclusive and open community. We welcome developers who are interested in making FATE better! Contributions of all kinds are welcome. Please refer to the general contributing guideline of all FATE projects and the contributing guideline of each repository.

Mailing list

Join the FATE user mailing list, and stay connected with the community and learn about the latest news and information of the FATE project. Discussion and feedback of FATE project are welcome.

Bugs or feature requests

File bugs and features requests via the GitHub issues. If you need help, ask your questions via the mailing list.

Contact emails

Maintainers: FedAI-maintainers @ groups.io

Security Response Committee: FATE-security @ groups.io

Twitter

FAQ

https://github.com/FederatedAI/FATE/wiki

License

Apache License 2.0

fate's People

Contributors

Stargazers

Watchers

Forkers

yankang18 zhaoyang626 jmwdpk xiyueyiwan liuy16 wanghuazhong cdx08222028 kelley1069 evwang awesome-archive y1ran slowbull colinxiong lujunjian0324 mikumikumi saxenauts gaoxiaoninghit josh200501 zacharywaseda crownpku qjing666 yonggucheng hitszxp yaoxumingtina drougon jerrick lkunxyz liujianhuiouc zzzcq codeforher paulbaogang sunxutao jiawen11 matricer cristicmf zziqin shineyyang jaeliiin dyyuen pseudoprogrammer hyzcn charlottesean mbyase jingx8885 bigdot123456 jingmouren yfyang86 youngyi zou-weidong evayan16915 cacoderquan xinyu6115 kevinew chanzhennan davidlgr ian-zh-fang nicole-mm jamesliao2016 roffeyluo0601 yuhuayuana jzx20061010 xzwang29 liuyang21cn a540611121 jiangfengg cserken xs-li invoker4zoo liuwenhaha three3stones fengfengyuyu liangqinghuan qqyzk tilyp smallyi victorleelk kongkong56 wells-qiang-chen ryanhuangnlp yaweneholder jiatiancheng7 liuweiping2020 linfuyang sylviadream folgerhuang michaelpjy astarxixi ianliyi1996 iwtbs marcoszh skchina nocoldbob kannyjyk francishero msnqqer zzc1996 sovitagar hackty jackquj fuleying

fate's Issues

add offline feature engineering

outlier
missing value
scale, at least include standardscaler， minmaxscaler

Add support for feature extracting for federated logistic regression by exploiting DNN

For now, the federated logistic regression (LR) algorithm is only using structural data (i.e., tabular data). This limits the application of LR. We may add support for automatic feature engineering to LR for dealing with various types of inputs such as text and images.

Neural networks such as RNN, CNN and autoencoders are widely used for learning features from text and images. Therefore, we may add these neural networks as local models for parties to extract features and then feed extracted features to LR.

This feature is recommended for FATE 0.3v

Speed up on dnn-based lr

Currently, too much running time was spent on decryption, which is account for 66% of the running time per iteration.

The reason that causes this issue is that currently, we delegate the whole work of updating local neural network model (calculate gradients and update model parameters) to tensorflow for the purpose of saving engineering effort and supporting model extensibility. To let tensorflow automatically update local model, we need to feed tensorflow with plain gradients of all sample. That is, if we have 30,000 samples, we would have 30,000 samples of gradients. The bottleneck comes when we decrypt the whole 30,000 samples of gradients (this is not safe in terms of data protection), which consumes lots of time on both computation and communication.

The solution to solve this issue is :

Use eggroll to parallelly decrypt gradients
Compute the new gradients used to update local model manually（do not use tensorflow for this part of work and tensorflow does not support this for now）.

feature selection make sure have at least one feature of any party

Describe the bug
In the feature selection component , make sure it has at least one feature of any party.

FATE-Board

visualize offline/online tasklist，models, etc
support submit a task and view task status

Tensorflow FL

Google pulished their version of Tensorflow FL stack. They seem to be specifically for horizontal FL with large number of guests, with models wrapped in Keras.

I'm wondering if FATE team is also looking into it, and what your thoughts are? Can we even embed it into the FATE as part of the Horizontal FL with deep learning?

Thanks!

Serving support

FATE-Serving

An online service for serving federated learning models.

It should have the following features.

Dynamic Loading Federated Learning Models.
Real-time Prediction Using Federated Learning Models.

Abnormal Parameters Detection

Add abnormal parameter detection module to check user setting parameters

关于项目以及联系方式

微众的前辈好，我想请问一下有没有项目成员（负责人）的联系方式呢，我对这个开源项目很感兴趣，想详细聊一聊

Decentralized FTL

Implement a decentralized encryption scheme of FTL without arbiter in the loop (refer to paper)

Docker support

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

How to interpret the process and performance for guest and host?

I'm trying to interpret the results, understand the training and evaluation process and the loss/accuracy for each of the guest and the host for each round.

For example I run the logistic regression standalone version, and get all the logs: homo_lr_guest.log, homo_lr_host.log and homo_lr_arbiter.log.

Where should I look for the performance for each round for each of the guest? And the host model performance? Is there a good way to interpret the results and get the metrics of the whole process (model distributing, guest training performance, model encryption, model merging, host training performance etc.)

Thanks!

Replace modelmeta object with protobuffer to support different language

Use protobuffer object to save modelmeta for supporting multiple-language model loading

Training validation in workflow need to reset flowid

Describe the bug
In workflow, if not reset the flowid in validation stage, guest will receive old federation object, may raise a bug.

To Reproduce
Steps to reproduce the behavior:
set train and predict data totally different id sets, and run examples.

Additional context
If reset flowid in validation stage, it works perfectly.

Storage optimizations mega thread

Is your feature request related to a problem? Please describe.
Storage usage is growing quite fast.

Describe the solution you'd like

Measuring write amplification. (check #47)
Providing a clean up API. (check #57)
Optimizing existing API to provide auto cleanup options.
Providing a iteration-aware mechanism to reduce duplicate storage usage among epochs.
Implementing a decorator / annotation enabling easy usage of above mechanisms.
Precise auto cleanup later depending on dynamic runtime mechanisms.

Describe alternatives you've considered
Precise auto cleanup later depending on dynamic runtime mechanisms.

Additional context
I will fork several sub-thread to track each milestone.
If you have any idea on storage issue please reply. Thanks.

Mini-FederatedML Task TestCase

We need to develop a Mini-FederatedML Task, which seen as a test case for Federated Learning Task after FATE deployed.
Through this test case, users can know fate deployed successfully and they can run any other Federated Learning Task.

EggRoll API document

Merge infrastructure of proxy module with fate-core

Is your feature request related to a problem? Please describe.
No

Describe the solution you'd like
Proxy and other arch components use similar infrastructure but with different implementations. They should be merged into one.

Describe alternatives you've considered
N/A

Additional context
Regression test is required.

How to improve the efficiency of paillier or other homomorphic encryption algorithm?

I evaluated the computational efficiency of paillier in Fate(denoted as paillier_fate) and paillier implemented in python(https://github.com/n1analytics/python-paillier , denoted as paillier_python), the calculation efficiency of paillier_fate is more than six times higher than paillier_python.

The implementations of the two algorithms are similar. So, i want to know how to improve the efficiency of paillier. Are there any other homomorphic encryption libraries or implementation tips that can improve efficiency.

Use clean up In status_tracer_decorator cause bug

Describe the bug
use clean up api in status tracer_decorator will cause bugs. Because in many situations, mostly standalone version, host or guest finish first will clean all job's data.

Lightweight ML Model Management module

In practice, we will do a lot of model training experiments and release some models to production. Then we may encounter some problems such as:

No record of model history
Difficult to reproduce results
Can not search for or query models
Difficult to collaborate

FATE ModelManager will solve the problems. At first, it provides these features:

Versioning
Reproducibility
Queries, search

In the future,

Experiment tracking
Collaboration

Add documentation for API

For now, it is a lack of documentation for public APIs (e.g., eggroll APIs and operators such as HeterologisticGradient.fore_gradient()), which makes the whole framework hard to use.

It would be much better to add clear documentation and even some examples for these APIs.

Discussions

This issue is specially opened for discussion. Here, you can feedback some problem such as install problem when you install FATE or new feature which you think is important for you.
Finally， hope to use English to ask your questions.
Thanks.
dylanfan

Adding feature engineer module.

Add new function that federated calculating IV and WOE.

Add feature selection modules.

Is your feature request related to a problem? Please describe.
Add feature selection methods for federal learning.

Describe the solution you'd like
Add a new workflow for feature selection. Also, provided interfaces for single side.

Parallel Execution of Processors in Cluster Mode

Is your feature request related to a problem? Please describe.
All Processors of Eggroll are running in one python process, pretty slow

Describe the solution you'd like
Since GIL problem(Feature?) of PVM can not be fixed, deploying multiple processor would be a easy solution

Describe alternatives you've considered
Jython

Is there a method that can get one sample of a DTable

Is your feature request related to a problem? Please describe.
When we want to initial a model or do some feature engineering, we might prefer to get the feature shape. Thus, getting one instance in a DTable is necessary.

Describe the solution you'd like
Provide an interface in federation.

Add intersection operator before heterogeneous train or predict

add raw intersection before train
2.add raw intersection before predict

need predict script method support and api to read predict result

the current examples support run in train and cross_validation method,missing predict method support.
also i think there is no api to read out the saved pridict result.
it would be very grateful,these features are added.

Adding sparse inputformat supported in data_io module

Add svm-light sparse data inputformat supported

Numeric Stable for sigmoid & log_logistic

sigmoid(x)
if x >0 , sigmoid(x) = 1/（1+exp(-x)）
if x <=0, sigmoid(x) = exp(x)/（1+exp(x)）

log_logistic
if x >0 , log_logistic(x) = -log *（1+exp(-x)）
if x <=0, log_logistic(x) = x - log *（1+exp(x)）

SecureBoost online inference

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Any support of multi-machine training?

Add Statistic methods

Is your feature request related to a problem? Please describe.
Add statistic methods so that the mean, median, variance etc. values are easy to access. This is useful in feature engineering.

Describe the solution you'd like
Add a statistic module and provide corresponding interface.

Storage Optimization 1: Measuring Write Amplification

Is your feature request related to a problem? Please describe.
Referring to #45 , this thread processes the following question:
Measuring write amplification.

Describe the solution you'd like
Writes different size of data in different number of keys to measure the amplification scale.

Describe alternatives you've considered

Additional context
We are using lmdb as storage engine. Will measure different engines to collect data and to compare.

One-second latency after each transfer event

Note that the Proxy.Packet (in arch/networking/proxy/src/main/java/com/webank/ai/fate/networking/proxy/grpc/client/DataTransferPipedClient.java line 96)do one extra read from the pipe which is already drained, and the read function set a timeout of 1 second to ensure the pipe has no packets. This may cause an one-second latency at the end of each transfer event.

This can be solved by adding a "isDrain" check before polling packets from the pipe.

I have tried to add a line of code "if (isDrained()) return result;" between line 66 and line 67 of PacketQueuePipe.java(in arch/networking/proxy/src/main/java/com/webank/ai/fate/networking/proxy/infra/impl/) , the training examples works properly and the latency is removed.

add regression evaluation support

regression evaluation method include:
1.mean_absolute_error
2.mean_squared_error
3.mean_squared_log_error
4.median_absolute_error
5.r2_score
6.root_mean_squard_error

7.fix some format project of evaluation

EggRoll dtable pickling issue in cluster mode

Describe the bug
When pickling a dtable, following error occured:

Traceback (most recent call last):
  File "test.py", line 19, in <module>
    test.run()
  File "test.py", line 15, in run
    table = self.data.mapValues(lambda x: self.fun(x))
  File "/data/projects/fate/python/arch/api/cluster/eggroll.py", line 108, in mapValues
    return self.__client.map_values(self, func)
  File "/data/projects/fate/python/arch/api/cluster/eggroll.py", line 290, in map_values
    func_id, func_bytes = self.serialize_and_hash_func(func)
  File "/data/projects/fate/python/arch/api/cluster/eggroll.py", line 195, in serialize_and_hash_func
    pickled_function = cloudpickle.dumps(func)
  File "/data/projects/fate/python/arch/api/utils/cloudpickle.py", line 892, in dumps
    cp.dump(obj)
  File "/data/projects/fate/python/arch/api/utils/cloudpickle.py", line 271, in dump
    return Pickler.dump(self, obj)
  File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 409, in dump
    self.save(obj)
  File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File "/data/projects/fate/python/arch/api/utils/cloudpickle.py", line 412, in save_function
    self.save_function_tuple(obj)
  File "/data/projects/fate/python/arch/api/utils/cloudpickle.py", line 559, in save_function_tuple
    save(state)
  File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 821, in save_dict
    self._batch_setitems(obj.items())
  File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 847, in _batch_setitems
    save(v)
  File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 781, in save_list
    self._batch_appends(obj)
  File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 808, in _batch_appends
    save(tmp[0])
  File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 521, in save
    self.save_reduce(obj=obj, *rv)
  File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 634, in save_reduce
    save(state)
  File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 821, in save_dict
    self._batch_setitems(obj.items())
  File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 852, in _batch_setitems
    save(v)
  File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 521, in save
    self.save_reduce(obj=obj, *rv)
  File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 634, in save_reduce
    save(state)
  File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 821, in save_dict
    self._batch_setitems(obj.items())
  File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 847, in _batch_setitems
    save(v)
  File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 521, in save
    self.save_reduce(obj=obj, *rv)
  File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 634, in save_reduce
    save(state)
  File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 821, in save_dict
    self._batch_setitems(obj.items())
  File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 847, in _batch_setitems
    save(v)
  File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 521, in save
    self.save_reduce(obj=obj, *rv)
  File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 634, in save_reduce
    save(state)
  File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 821, in save_dict
    self._batch_setitems(obj.items())
  File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 847, in _batch_setitems
    save(v)
  File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 496, in save
    rv = reduce(self.proto)
  File "stringsource", line 2, in grpc._cython.cygrpc.Channel.__reduce_cython__
TypeError: no default __reduce__ due to non-trivial __cinit__

To Reproduce
Steps to reproduce the behavior:

from arch.api import eggroll
import functools


eggroll.init("20190404.1425", 1)

class Test(object):
    def __init__(self):
        self.data = eggroll.parallelize(range(1000), include_key=False)
        
    def fun(x):
        return x * x

    def run(self):
        table = self.data.mapValues(lambda x: self.fun(x))
        print (table.collect())

test = Test()
test.run()

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):
Linux

Additional context
Working fine in standalone mode, not working in cluster mode.

Deployment process and document optimization

Lacking of Basic Dashboard Support

Is your feature request related to a problem? Please describe.
It's very hard to observe a training progress, when there is no dashboard or visualization of the whole process

Describe the solution you'd like
Tensorboard like or at least something that is on the same level with the spark dashboard

Describe alternatives you've considered
Nope

Support for secret sharing scheme

Is your feature request related to a problem? Please describe.
Secret sharing Scheme is a must-have for FATE project.

Describe the solution you'd like
Do R&D on implementing secret sharing operations, such as:

Create beaver triple
Add, Multiply, Division, Compare, and others

Having secret sharing operations been created, then

Implement secret sharing based LR
Implement secret sharing based FTL

These works do not need to be full-fledged for industrial applications. However, they should be able to help us create various secure federated learning algorithms/prototypes.

Adding Regression Training and Prediction to SecureBoost

Add Regression support, include several regression problem, like least-square、mean-abosulte-error、tweedie、huber-approx...

Input data abnormal

Describe the bug
If input DTABLE of any component （lr， secureboost，feature eng ，etc） is empty（not any keys，or have keys but values are None）， what happen？

Toy Example

We need to develop a toy example, which seen as a quickstart example of FATE for users.
Also, a toy example can be a test case for a successful deployment.

Optimization of SecureBoost and Quantile Process

Quantile Optimization
Sparse optimize : now quantile process time cost O(N * max_feature_dimension), we will speed up to O(sum of non_sparse_feature).
Secureboost Optimization:
a. memory optimize: we use breadth-first-search algorithm to build trees, we always use all nodes of one level to build histogram and find splits of next tree level. We now support not to use all nodes but specify the maximun node num once a time.
b. distributed finding candidate splits
c. speed up host federated finding splits process.

Intersetion module upgrade

1.add Encode class
2.RSA intersection will send encrypted intersection results
3.RAW intersection's role can be configured
4.RAW intersection configure if encode for each role

Support for storing tensorflow/keras models

For now, trained models are stored through eggroll table-save API. It is OK for simple models like LR. But for complex models like CNN, it would be a tedious work to store model into eggroll table via such API.

It might be better to add higher level eggroll API or other storage mechanisms for storing tensorflow models such that we can exploit tensorflow built-in model save/load API.

Storage Optimization 2: Providing an cleanup API

Is your feature request related to a problem? Please describe.
Referring to issue #45, step 2.

Describe the solution you'd like
Providing a cleanup API to enable batch data cleanup

Describe alternatives you've considered
Other steps in #45

Additional context
usage:
eggroll.cleanup(name, namespace, persistent=False)

name: name of table, supporting '*' to wildcard.
namespace: exact match of namespace.
persistent: False to cleanup IN_MEMORY tables, True to cleanup LMDB persistent tables.

How to configure proxy router table

there is a route_table.json configure file under conf dir.
but i don't known how to configure guest,host and arbiter,can anyone give a more detail route_table.json configure demo?

ERROR in SendProcessor with limited number of CPUs

We rent multiple machines on google cloud for running the FATE project. In the beginning, we used machines with 2 vCPUs. While running the cluster code for hetero_logistic_regression, we get an Error from arbiter when sending public-keys to guest and host(console.log in federation):

[ERROR] 2019-03-18T07:11:15,361 [transferJobSchedulerExecutor-2] [SendProcessor:94] - java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
at java.util.ArrayList.rangeCheck(ArrayList.java:657)
at java.util.ArrayList.get(ArrayList.java:433)
at com.webank.ai.fate.driver.federation.transfer.service.impl.DefaultProxySelectionService.select(DefaultProxySelectionService.java:81)
at com.webank.ai.fate.driver.federation.transfer.communication.processor.SendProcessor.run(SendProcessor.java:70)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

[ERROR] 2019-03-18T07:11:24,396 [transferJobSchedulerExecutor-2] [GrpcChannelFactory:119] - [COMMON][CHANNEL][ERROR] Error getting ManagedChannel after retries
[ERROR] 2019-03-18T07:11:24,397 [transferJobSchedulerExecutor-2] [TransferJobScheduler:127] - [FEDERATION][SCHEDULER] processor failed: transferMetaId: cxz-HeteroLRTransferVariable.paillier_pubkey-HeteroLRTransferVariable.paillier_pubkey.0-2-arbiter-1-guest, exception: java.lang.RuntimeException: should never get here
at com.webank.ai.fate.core.factory.GrpcStubFactory.createGrpcStub(GrpcStubFactory.java:47)
at com.webank.ai.fate.core.factory.GrpcStubFactory.createGrpcStub(GrpcStubFactory.java:56)
at com.webank.ai.fate.core.api.grpc.client.GrpcAsyncClientContext.createStub(GrpcAsyncClientContext.java:207)
at com.webank.ai.fate.core.api.grpc.client.GrpcStreamingClientTemplate.calleeStreamingRpc(GrpcStreamingClientTemplate.java:106)
at com.webank.ai.fate.core.api.grpc.client.GrpcStreamingClientTemplate.calleeStreamingRpcWithImmediateDelayedResult(GrpcStreamingClientTemplate.java:149)
at com.webank.ai.fate.driver.federation.transfer.api.grpc.client.ProxyClient.unaryCall(ProxyClient.java:98)
at com.webank.ai.fate.driver.federation.transfer.api.grpc.client.ProxyClient.requestSendEnd(ProxyClient.java:121)
at com.webank.ai.fate.driver.federation.transfer.communication.processor.SendProcessor.run(SendProcessor.java:98)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

and we tried the same code using machines with 8 vCPUs, the ERROR cannot be reproduced.

Here are the configurations of the two groups of machines:
The former ones: Google Cloud n1-standard-2 machines with 2 vCPUs and 7.5GB RAM

The latter ones: Google Cloud n1-standard-8 machines with 8 vCPUs and 30GB RAM