Giter Club home page Giter Club logo

fate's Introduction

License CodeStyle Style Build Status codecov Documentation Status Gitpod Ready-to-Code CII Best Practices

FATE (Federated AI Technology Enabler) is the world's first industrial grade federated learning open source framework to enable enterprises and institutions to collaborate on data while protecting data security and privacy. It implements secure computation protocols based on homomorphic encryption and multi-party computation (MPC). Supporting various federated learning scenarios, FATE now provides a host of federated learning algorithms, including logistic regression, tree-based algorithms, deep learning and transfer learning.

FATE is an open source project hosted by Linux Foundation. The Technical Charter sets forth the responsibilities and procedures for technical contribution to, and oversight of, the FATE (“Federated AI Technology Enabler”) Project.

https://fate.readthedocs.io/en/latest

Getting Started

FATE can be deployed on a single node or on multiple nodes. Choose the deployment approach which matches your environment. Release version can be downloaded here.

Version >= 2.0

Standalone deployment

  • Deploying FATE on a single node via PyPI, pre-built docker images or installers. It is for simple testing purposes. Refer to this guide.

Cluster deployment

Deploying FATE to multiple nodes to achieve scalability, reliability and manageability.

Quick Start

More examples

Documentation

FATE Design

  • Architecture: Building Unified and Standardized API for Heterogeneous Computing Engines Interconnection
  • FATE Algorithm Components: Building Standardized Algorithm Components for different Scheduling Engines
  • OSX (Open Site Exchange): Building Open Platform for Cross-Site Communication Interconnection
  • FATE-Flow: Building Open and Standardized Scheduling Platform for Scheduling Interconnection
  • PipeLine Design: Building Scalable Federated DSL for Application Layer Interconnection And Providing Tools For Fast Federated Modeling
  • RoadMap
  • Paper & Conference

Related Repositories (Projects)

  • KubeFATE: An operational tool for the FATE platform using cloud native technologies such as containers and Kubernetes.
  • FATE-Flow: A multi-party secure task scheduling platform for federated learning pipeline.
  • FATE-Board: A suite of visualization tools to explore and understand federated models easily and effectively.
  • FATE-Serving: A high-performance and production-ready serving system for federated learning models.
  • FATE-Cloud: An infrastructure for building and managing industrial-grade federated learning cloud services.
  • EggRoll: A simple high-performance computing framework for (federated) machine learning.
  • AnsibleFATE: A tool to optimize and automate the configuration and deployment operations via Ansible.
  • FATE-Builder: A tool to build package and docker image for FATE and KubeFATE.
  • FATE-Client: A tool to enable fast federated modeling tasks for FATE.
  • FATE-Test: An automated testing tool for FATE, including tests and benchmark comparisons.
  • FATE-LLM : A framework to support federated learning for large language models(LLMs).

Governance

FATE-Community contains all the documents about how the community members coopearte with each other.

Getting Involved

Contributing

FATE is an inclusive and open community. We welcome developers who are interested in making FATE better! Contributions of all kinds are welcome. Please refer to the general contributing guideline of all FATE projects and the contributing guideline of each repository.

Mailing list

Join the FATE user mailing list, and stay connected with the community and learn about the latest news and information of the FATE project. Discussion and feedback of FATE project are welcome.

Bugs or feature requests

File bugs and features requests via the GitHub issues. If you need help, ask your questions via the mailing list.

Contact emails

Maintainers: FedAI-maintainers @ groups.io

Security Response Committee: FATE-security @ groups.io

Twitter

Follow us on twitter @FATEFedAI

FAQ

https://github.com/FederatedAI/FATE/wiki

License

Apache License 2.0

fate's People

Contributors

answerfxy avatar asfdghf avatar blackcas avatar dylan-fan avatar fangxiaoran avatar forgivedengkai avatar gxcuit avatar happycooperxu avatar huttuh avatar izhfeng00 avatar jackli529 avatar jarvenshan avatar jarviszeng-zjc avatar jat001 avatar jiahaoc1993 avatar liucongwust avatar mgqa34 avatar nemirorox avatar owlet42 avatar paulbaogang avatar peiyong86 avatar possibledream avatar sagewe avatar talkingwallace avatar tanmc123 avatar wfangchi avatar xiong-li-github avatar zazd avatar zhihuiwan avatar zzzcq avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fate's Issues

Add support for feature extracting for federated logistic regression by exploiting DNN

For now, the federated logistic regression (LR) algorithm is only using structural data (i.e., tabular data). This limits the application of LR. We may add support for automatic feature engineering to LR for dealing with various types of inputs such as text and images.

Neural networks such as RNN, CNN and autoencoders are widely used for learning features from text and images. Therefore, we may add these neural networks as local models for parties to extract features and then feed extracted features to LR.

This feature is recommended for FATE 0.3v

Speed up on dnn-based lr

Currently, too much running time was spent on decryption, which is account for 66% of the running time per iteration.

The reason that causes this issue is that currently, we delegate the whole work of updating local neural network model (calculate gradients and update model parameters) to tensorflow for the purpose of saving engineering effort and supporting model extensibility. To let tensorflow automatically update local model, we need to feed tensorflow with plain gradients of all sample. That is, if we have 30,000 samples, we would have 30,000 samples of gradients. The bottleneck comes when we decrypt the whole 30,000 samples of gradients (this is not safe in terms of data protection), which consumes lots of time on both computation and communication.

The solution to solve this issue is :

  1. Use eggroll to parallelly decrypt gradients
  2. Compute the new gradients used to update local model manually(do not use tensorflow for this part of work and tensorflow does not support this for now).

FATE-Board

visualize offline/online tasklist,models, etc
support submit a task and view task status

Tensorflow FL

Google pulished their version of Tensorflow FL stack. They seem to be specifically for horizontal FL with large number of guests, with models wrapped in Keras.

I'm wondering if FATE team is also looking into it, and what your thoughts are? Can we even embed it into the FATE as part of the Horizontal FL with deep learning?

Thanks!

Serving support

FATE-Serving

An online service for serving federated learning models.

It should have the following features.

  • Dynamic Loading Federated Learning Models.
  • Real-time Prediction Using Federated Learning Models.

关于项目以及联系方式

微众的前辈好,我想请问一下有没有项目成员(负责人)的联系方式呢,我对这个开源项目很感兴趣,想详细聊一聊

Decentralized FTL

Implement a decentralized encryption scheme of FTL without arbiter in the loop (refer to paper)

Docker support

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

How to interpret the process and performance for guest and host?

I'm trying to interpret the results, understand the training and evaluation process and the loss/accuracy for each of the guest and the host for each round.

For example I run the logistic regression standalone version, and get all the logs: homo_lr_guest.log, homo_lr_host.log and homo_lr_arbiter.log.

Where should I look for the performance for each round for each of the guest? And the host model performance? Is there a good way to interpret the results and get the metrics of the whole process (model distributing, guest training performance, model encryption, model merging, host training performance etc.)

Thanks!

Training validation in workflow need to reset flowid

Describe the bug
In workflow, if not reset the flowid in validation stage, guest will receive old federation object, may raise a bug.

To Reproduce
Steps to reproduce the behavior:
set train and predict data totally different id sets, and run examples.

Additional context
If reset flowid in validation stage, it works perfectly.

Storage optimizations mega thread

Is your feature request related to a problem? Please describe.
Storage usage is growing quite fast.

Describe the solution you'd like

  1. Measuring write amplification. (check #47)
  2. Providing a clean up API. (check #57)
  3. Optimizing existing API to provide auto cleanup options.
  4. Providing a iteration-aware mechanism to reduce duplicate storage usage among epochs.
  5. Implementing a decorator / annotation enabling easy usage of above mechanisms.
  6. Precise auto cleanup later depending on dynamic runtime mechanisms.

Describe alternatives you've considered
Precise auto cleanup later depending on dynamic runtime mechanisms.

Additional context
I will fork several sub-thread to track each milestone.
If you have any idea on storage issue please reply. Thanks.

Mini-FederatedML Task TestCase

We need to develop a Mini-FederatedML Task, which seen as a test case for Federated Learning Task after FATE deployed.
Through this test case, users can know fate deployed successfully and they can run any other Federated Learning Task.

Merge infrastructure of proxy module with fate-core

Is your feature request related to a problem? Please describe.
No

Describe the solution you'd like
Proxy and other arch components use similar infrastructure but with different implementations. They should be merged into one.

Describe alternatives you've considered
N/A

Additional context
Regression test is required.

How to improve the efficiency of paillier or other homomorphic encryption algorithm?

I evaluated the computational efficiency of paillier in Fate(denoted as paillier_fate) and paillier implemented in python(https://github.com/n1analytics/python-paillier , denoted as paillier_python), the calculation efficiency of paillier_fate is more than six times higher than paillier_python.

The implementations of the two algorithms are similar. So, i want to know how to improve the efficiency of paillier. Are there any other homomorphic encryption libraries or implementation tips that can improve efficiency.

Lightweight ML Model Management module

In practice, we will do a lot of model training experiments and release some models to production. Then we may encounter some problems such as:

  • No record of model history
  • Difficult to reproduce results
  • Can not search for or query models
  • Difficult to collaborate

FATE ModelManager will solve the problems. At first, it provides these features:

  • Versioning
  • Reproducibility
  • Queries, search

In the future,

  • Experiment tracking
  • Collaboration

Add documentation for API

For now, it is a lack of documentation for public APIs (e.g., eggroll APIs and operators such as HeterologisticGradient.fore_gradient()), which makes the whole framework hard to use.

It would be much better to add clear documentation and even some examples for these APIs.

Discussions

This issue is specially opened for discussion. Here, you can feedback some problem such as install problem when you install FATE or new feature which you think is important for you.
Finally, hope to use English to ask your questions.
Thanks.
dylanfan

Add feature selection modules.

Is your feature request related to a problem? Please describe.
Add feature selection methods for federal learning.

Describe the solution you'd like
Add a new workflow for feature selection. Also, provided interfaces for single side.

Parallel Execution of Processors in Cluster Mode

Is your feature request related to a problem? Please describe.
All Processors of Eggroll are running in one python process, pretty slow

Describe the solution you'd like
Since GIL problem(Feature?) of PVM can not be fixed, deploying multiple processor would be a easy solution

Describe alternatives you've considered
Jython

Is there a method that can get one sample of a DTable

Is your feature request related to a problem? Please describe.
When we want to initial a model or do some feature engineering, we might prefer to get the feature shape. Thus, getting one instance in a DTable is necessary.

Describe the solution you'd like
Provide an interface in federation.

Numeric Stable for sigmoid & log_logistic

sigmoid(x)
if x >0 , sigmoid(x) = 1/(1+exp(-x))
if x <=0, sigmoid(x) = exp(x)/(1+exp(x))

  1. log_logistic
    if x >0 , log_logistic(x) = -log *(1+exp(-x))
    if x <=0, log_logistic(x) = x - log *(1+exp(x))

SecureBoost online inference

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Add Statistic methods

Is your feature request related to a problem? Please describe.
Add statistic methods so that the mean, median, variance etc. values are easy to access. This is useful in feature engineering.

Describe the solution you'd like
Add a statistic module and provide corresponding interface.

Storage Optimization 1: Measuring Write Amplification

Is your feature request related to a problem? Please describe.
Referring to #45 , this thread processes the following question:
Measuring write amplification.

Describe the solution you'd like
Writes different size of data in different number of keys to measure the amplification scale.

Describe alternatives you've considered

Additional context
We are using lmdb as storage engine. Will measure different engines to collect data and to compare.

One-second latency after each transfer event

Note that the Proxy.Packet (in arch/networking/proxy/src/main/java/com/webank/ai/fate/networking/proxy/grpc/client/DataTransferPipedClient.java line 96)do one extra read from the pipe which is already drained, and the read function set a timeout of 1 second to ensure the pipe has no packets. This may cause an one-second latency at the end of each transfer event.

This can be solved by adding a "isDrain" check before polling packets from the pipe.

I have tried to add a line of code "if (isDrained()) return result;" between line 66 and line 67 of PacketQueuePipe.java(in arch/networking/proxy/src/main/java/com/webank/ai/fate/networking/proxy/infra/impl/) , the training examples works properly and the latency is removed.

add regression evaluation support

regression evaluation method include:
1.mean_absolute_error
2.mean_squared_error
3.mean_squared_log_error
4.median_absolute_error
5.r2_score
6.root_mean_squard_error

7.fix some format project of evaluation

EggRoll dtable pickling issue in cluster mode

Describe the bug
When pickling a dtable, following error occured:

Traceback (most recent call last):
  File "test.py", line 19, in <module>
    test.run()
  File "test.py", line 15, in run
    table = self.data.mapValues(lambda x: self.fun(x))
  File "/data/projects/fate/python/arch/api/cluster/eggroll.py", line 108, in mapValues
    return self.__client.map_values(self, func)
  File "/data/projects/fate/python/arch/api/cluster/eggroll.py", line 290, in map_values
    func_id, func_bytes = self.serialize_and_hash_func(func)
  File "/data/projects/fate/python/arch/api/cluster/eggroll.py", line 195, in serialize_and_hash_func
    pickled_function = cloudpickle.dumps(func)
  File "/data/projects/fate/python/arch/api/utils/cloudpickle.py", line 892, in dumps
    cp.dump(obj)
  File "/data/projects/fate/python/arch/api/utils/cloudpickle.py", line 271, in dump
    return Pickler.dump(self, obj)
  File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 409, in dump
    self.save(obj)
  File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File "/data/projects/fate/python/arch/api/utils/cloudpickle.py", line 412, in save_function
    self.save_function_tuple(obj)
  File "/data/projects/fate/python/arch/api/utils/cloudpickle.py", line 559, in save_function_tuple
    save(state)
  File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 821, in save_dict
    self._batch_setitems(obj.items())
  File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 847, in _batch_setitems
    save(v)
  File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 781, in save_list
    self._batch_appends(obj)
  File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 808, in _batch_appends
    save(tmp[0])
  File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 521, in save
    self.save_reduce(obj=obj, *rv)
  File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 634, in save_reduce
    save(state)
  File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 821, in save_dict
    self._batch_setitems(obj.items())
  File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 852, in _batch_setitems
    save(v)
  File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 521, in save
    self.save_reduce(obj=obj, *rv)
  File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 634, in save_reduce
    save(state)
  File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 821, in save_dict
    self._batch_setitems(obj.items())
  File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 847, in _batch_setitems
    save(v)
  File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 521, in save
    self.save_reduce(obj=obj, *rv)
  File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 634, in save_reduce
    save(state)
  File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 821, in save_dict
    self._batch_setitems(obj.items())
  File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 847, in _batch_setitems
    save(v)
  File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 521, in save
    self.save_reduce(obj=obj, *rv)
  File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 634, in save_reduce
    save(state)
  File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 821, in save_dict
    self._batch_setitems(obj.items())
  File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 847, in _batch_setitems
    save(v)
  File "/data/projects/common/miniconda3/lib/python3.6/pickle.py", line 496, in save
    rv = reduce(self.proto)
  File "stringsource", line 2, in grpc._cython.cygrpc.Channel.__reduce_cython__
TypeError: no default __reduce__ due to non-trivial __cinit__

To Reproduce
Steps to reproduce the behavior:

from arch.api import eggroll
import functools


eggroll.init("20190404.1425", 1)

class Test(object):
    def __init__(self):
        self.data = eggroll.parallelize(range(1000), include_key=False)
        
    def fun(x):
        return x * x

    def run(self):
        table = self.data.mapValues(lambda x: self.fun(x))
        print (table.collect())

test = Test()
test.run()

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):
Linux

Additional context
Working fine in standalone mode, not working in cluster mode.

Lacking of Basic Dashboard Support

Is your feature request related to a problem? Please describe.
It's very hard to observe a training progress, when there is no dashboard or visualization of the whole process

Describe the solution you'd like
Tensorboard like or at least something that is on the same level with the spark dashboard

Describe alternatives you've considered
Nope

Support for secret sharing scheme

Is your feature request related to a problem? Please describe.
Secret sharing Scheme is a must-have for FATE project.

Describe the solution you'd like
Do R&D on implementing secret sharing operations, such as:

  1. Create beaver triple
  2. Add, Multiply, Division, Compare, and others

Having secret sharing operations been created, then

  1. Implement secret sharing based LR
  2. Implement secret sharing based FTL

These works do not need to be full-fledged for industrial applications. However, they should be able to help us create various secure federated learning algorithms/prototypes.

Input data abnormal

Describe the bug
If input DTABLE of any component (lr, secureboost,feature eng ,etc) is empty(not any keys,or have keys but values are None) , what happen?

Toy Example

We need to develop a toy example, which seen as a quickstart example of FATE for users.
Also, a toy example can be a test case for a successful deployment.

Optimization of SecureBoost and Quantile Process

  1. Quantile Optimization
    Sparse optimize : now quantile process time cost O(N * max_feature_dimension), we will speed up to O(sum of non_sparse_feature).

  2. Secureboost Optimization:
    a. memory optimize: we use breadth-first-search algorithm to build trees, we always use all nodes of one level to build histogram and find splits of next tree level. We now support not to use all nodes but specify the maximun node num once a time.
    b. distributed finding candidate splits
    c. speed up host federated finding splits process.

Intersetion module upgrade

1.add Encode class
2.RSA intersection will send encrypted intersection results
3.RAW intersection's role can be configured
4.RAW intersection configure if encode for each role

Support for storing tensorflow/keras models

For now, trained models are stored through eggroll table-save API. It is OK for simple models like LR. But for complex models like CNN, it would be a tedious work to store model into eggroll table via such API.

It might be better to add higher level eggroll API or other storage mechanisms for storing tensorflow models such that we can exploit tensorflow built-in model save/load API.

Storage Optimization 2: Providing an cleanup API

Is your feature request related to a problem? Please describe.
Referring to issue #45, step 2.

Describe the solution you'd like
Providing a cleanup API to enable batch data cleanup

Describe alternatives you've considered
Other steps in #45

Additional context
usage:
eggroll.cleanup(name, namespace, persistent=False)

name: name of table, supporting '*' to wildcard.
namespace: exact match of namespace.
persistent: False to cleanup IN_MEMORY tables, True to cleanup LMDB persistent tables.

How to configure proxy router table

there is a route_table.json configure file under conf dir.
but i don't known how to configure guest,host and arbiter,can anyone give a more detail route_table.json configure demo?

ERROR in SendProcessor with limited number of CPUs

We rent multiple machines on google cloud for running the FATE project. In the beginning, we used machines with 2 vCPUs. While running the cluster code for hetero_logistic_regression, we get an Error from arbiter when sending public-keys to guest and host(console.log in federation):

[ERROR] 2019-03-18T07:11:15,361 [transferJobSchedulerExecutor-2] [SendProcessor:94] - java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
at java.util.ArrayList.rangeCheck(ArrayList.java:657)
at java.util.ArrayList.get(ArrayList.java:433)
at com.webank.ai.fate.driver.federation.transfer.service.impl.DefaultProxySelectionService.select(DefaultProxySelectionService.java:81)
at com.webank.ai.fate.driver.federation.transfer.communication.processor.SendProcessor.run(SendProcessor.java:70)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

[ERROR] 2019-03-18T07:11:24,396 [transferJobSchedulerExecutor-2] [GrpcChannelFactory:119] - [COMMON][CHANNEL][ERROR] Error getting ManagedChannel after retries
[ERROR] 2019-03-18T07:11:24,397 [transferJobSchedulerExecutor-2] [TransferJobScheduler:127] - [FEDERATION][SCHEDULER] processor failed: transferMetaId: cxz-HeteroLRTransferVariable.paillier_pubkey-HeteroLRTransferVariable.paillier_pubkey.0-2-arbiter-1-guest, exception: java.lang.RuntimeException: should never get here
at com.webank.ai.fate.core.factory.GrpcStubFactory.createGrpcStub(GrpcStubFactory.java:47)
at com.webank.ai.fate.core.factory.GrpcStubFactory.createGrpcStub(GrpcStubFactory.java:56)
at com.webank.ai.fate.core.api.grpc.client.GrpcAsyncClientContext.createStub(GrpcAsyncClientContext.java:207)
at com.webank.ai.fate.core.api.grpc.client.GrpcStreamingClientTemplate.calleeStreamingRpc(GrpcStreamingClientTemplate.java:106)
at com.webank.ai.fate.core.api.grpc.client.GrpcStreamingClientTemplate.calleeStreamingRpcWithImmediateDelayedResult(GrpcStreamingClientTemplate.java:149)
at com.webank.ai.fate.driver.federation.transfer.api.grpc.client.ProxyClient.unaryCall(ProxyClient.java:98)
at com.webank.ai.fate.driver.federation.transfer.api.grpc.client.ProxyClient.requestSendEnd(ProxyClient.java:121)
at com.webank.ai.fate.driver.federation.transfer.communication.processor.SendProcessor.run(SendProcessor.java:98)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

and we tried the same code using machines with 8 vCPUs, the ERROR cannot be reproduced.

Here are the configurations of the two groups of machines:
The former ones: Google Cloud n1-standard-2 machines with 2 vCPUs and 7.5GB RAM

The latter ones: Google Cloud n1-standard-8 machines with 8 vCPUs and 30GB RAM

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.