scaleoutsystems / fedn Goto Github PK

FEDn: An enterprise-ready federated learning framework. This repository contains the Python framework, CLI and API.

Home Page: https://docs.scaleoutsystems.com

License: Apache License 2.0

Dockerfile 0.41% Python 98.92% Shell 0.57% Smarty 0.10%

federated-machine-learning fedml federated-learning edge-ml fleet-learning keras-tensorflow pytorch scikit-learn tensorflow edge-ai

fedn's Introduction

FEDn: An enterprise-ready federated learning framework

Our goal is to provide a federated learning framework that is both secure, scalable and easy-to-use. We believe that that minimal code change should be needed to progress from early proof-of-concepts to production. This is reflected in our core design:

Minimal server-side complexity for the end-user. Running a proper distributed FL deployment is hard. With FEDn Studio we seek to handle all server-side complexity and provide a UI, REST API and a Python interface to help users manage FL experiments and track metrics in real time.
Secure by design. FL clients do not need to open any ingress ports. Industry-standard communication protocols (gRPC) and token-based authentication and RBAC (Jason Web Tokens) provides flexible integration in a range of production environments.
ML-framework agnostic. A black-box client-side architecture lets data scientists interface with their framework of choice.
Cloud native. By following cloud native design principles, we ensure a wide range of deployment options including private cloud and on-premise infrastructure.
Scalability and resilience. Multiple aggregation servers (combiners) can share the workload. FEDn seamlessly recover from failures in all critical components and manages intermittent client-connections.
Developer and DevOps friendly. Extensive event logging and distributed tracing enables developers to monitor the sytem in real-time, simplifying troubleshooting and auditing. Extensions and integrations are facilitated by a flexible plug-in architecture.

FEDn is free forever for academic and personal use / small projects. Sign up for a FEDn Studio account and take the Quickstart tutorial to get started with FEDn.

Features

Federated learning:

Tiered federated learning architecture enabling massive scalability and resilience.
Support for any ML framework (examples for PyTorch, Tensforflow/Keras and Scikit-learn)
Extendable via a plug-in architecture (aggregators, load balancers, object storage backends, databases etc.)
Built-in federated algorithms (FedAvg, FedAdam, FedYogi, FedAdaGrad, etc.)
UI, CLI and Python API.
Implement clients in any language (Python, C++, Kotlin etc.)
No open ports needed client-side.

From development to FL in production:

Secure deployment of server-side / control-plane on Kubernetes.
UI with dashboards for orchestrating FL experiments and for visualizing results
Team features - collaborate with other users in shared project workspaces.
Features for the trusted-third party: Manage access to the FL network, FL clients and training progress.
REST API for handling experiments/jobs.
View and export logging and tracing information.
Public cloud, dedicated cloud and on-premise deployment options.

Available client APIs:

Python client (this repository)
C++ client (FEDn C++ client)
Android Kotlin client (FEDn Kotlin client)

Getting started

Get started with FEDn in two steps:

Register for a FEDn Studio account
Take the Quickstart tutorial

Use of our multi-tenant, managed deployment of FEDn Studio (SaaS) is free forever for academic research and personal development/testing purposes. For users and teams requiring additional resources, more storage and cpu, dedicated support, and other hosting options (private cloud, on-premise), explore our plans.

Documentation

More details about the architecture, deployment, and how to develop your own application and framework extensions are found in the documentation:

Documentation

FEDn Project Examples

Our example projects demonstrate different use case scenarios of FEDn and its integration with popular machine learning frameworks like PyTorch and TensorFlow.

FEDn Studio Deployment options

Several hosting options are available to suit different project settings.

Public cloud (multi-tenant): Managed multi-tenant deployment in public cloud.
Dedicated cloud (single-tenant): Managed, dedicated deployment in a cloud region of your choice (AWS, GCP, Azure, managed Kubernetes)
Self-managed: Set up a self-managed deployment in your VPC or on-premise Kubernets cluster using Helm Chart and container images provided by Scaleout.

Contact the Scaleout team for information.

Support

Community support is available in our Discord server.

Options are available for Dedicated/custom support.

Making contributions

All pull requests will be considered and are much appreciated. For more details please refer to our contribution guidelines.

Citation

If you use FEDn in your research, please cite:

@article{ekmefjord2021scalable,
  title={Scalable federated machine learning with FEDn},
  author={Ekmefjord, Morgan and Ait-Mlouk, Addi and Alawadi, Sadi and {\AA}kesson, Mattias and Stoyanova, Desislava and Spjuth, Ola and Toor, Salman and Hellander, Andreas},
  journal={arXiv preprint arXiv:2103.00148},
  year={2021}
}

License

FEDn is licensed under Apache-2.0 (see LICENSE file for full information).

Use of FEDn Studio is subject to the Terms of Use.

fedn's People

Contributors

Stargazers

Watchers

fedn's Issues

Example in Getting Started: keras-client does not find package KerasSequentialHelper

I am trying to follow the example given in the readme, but get an error when I come to the section "Attach two Clients to the FEDn network". I get the following message when the keras-clients are starting
client_1 | Traceback (most recent call last):
client_1 | File "train.py", line 54, in
client_1 | from fedn.utils.kerassequential import KerasSequentialHelper
client_1 | ModuleNotFoundError: No module named 'fedn.utils.kerassequential'
client_2 | Traceback (most recent call last):
client_2 | File "train.py", line 54, in
client_2 | from fedn.utils.kerassequential import KerasSequentialHelper
client_2 | ModuleNotFoundError: No module named 'fedn.utils.kerassequential'
Curret directory is "test/mnist-keras". As far as I can see there is no KerasSequentialHelper, but a KerasHelper that should be called.
I use Linux Mint 18.3.

Deprecated key & dynamic port warnings when starting minio service

Severity
Low (select if the issue doesn't significantly affect the user experience, like minor visual bugs)

Describe the bug
Got the following warnings when start the minio service. Version 0.2.4

minio_1 | WARNING: MINIO_ACCESS_KEY and MINIO_SECRET_KEY are deprecated.
minio_1 | Please use MINIO_ROOT_USER and MINIO_ROOT_PASSWORD
minio_1 | API: http://172.19.0.4:9000 http://127.0.0.1:9000
minio_1 |
minio_1 | Console: http://172.19.0.4:46547 http://127.0.0.1:46547
minio_1 |
minio_1 | Documentation: https://docs.min.io
minio_1 |
minio_1 | WARNING: Console endpoint is listening on a dynamic port (46547), please use --console-address ":PORT" to choose a static port.

Download client config

Is your feature request related to a problem? Please describe.
User can download client configuration conveniently via CLI/SDK and GUI.

Describe the solution you'd like
When implemented this will provide the way to download a fedn-network.yaml configuration file where the client should connect.

Client with same name leads to communication errors on combiner level

Is your feature request related to a problem? Please describe.

Since the client name is currently used as unique id in the combiner queues

Describe the solution you'd like
One of two options:

Enforce a globally unique name for clients, i.e. reject assignment if the name is already in the db.
Generate a unique identifier when assigning the client and use that for routing communication.

Sometimes validations are plotted in the wrong order

Severity

Critical/Blocker (select if the issue makes the application unusable or causes serious data loss)
High (select if the issue affects a major feature and there is no workaround or the available workaround is very complex)
Medium (select if the issue affects a minor feature or affects a major feature but has an easy enough workaround to not cause any major inconvenience)
[] Low (select if the issue doesn't significantly affect the user experience, like minor visual bugs)

Describe the bug
Sometimes the validation boxplot "folds upon itself" - validation are plotted in the wrong order on the x-axis. Probably there is a problem sorting the validations on timestamp or so.

Expected behavior
A plot where all validation appears i chronological order.

Screenshots

Contact Details
An e-mail address in case we need to contact you for further details regarding this issue.

Uploading new seed models unavailable

Severity

Critical/Blocker (select if the issue makes the application unusable or causes serious data loss)
High (select if the issue affects a major feature and there is no workaround or the available workaround is very complex)
Medium (select if the issue affects a minor feature or affects a major feature but has an easy enough workaround to not cause any major inconvenience)
Low (select if the issue doesn't significantly affect the user experience, like minor visual bugs)

Describe the bug
There doesn't seem to be a consice way of uploading a new model when training on new examples. For example once model of data-center is uploaded, i cannot reupload any other model even when pushing the mnist model to minio. The command line would still display that the combiner is attempting to pass on the data-center model. Not sure if i am missing something or it is a bug.

Environment:

OS: [windows]
Version: [10]
Browser [mozilla]
Version [83.0]
Tensorflow 2.3.1

Reproduction Steps
Steps to reproduce the behavior:

Run first example as data-center
Upload the data-center seed model
Try to run mnist example
No option to upload model on https://localhost:8090/seed
Try to upload file to minio
Try to run mnist example again

Expected behavior
Running new example should use appropriate models that are available

Screenshots

Contact Details
[email protected]

Complete CLI start command

Is your feature request related to a problem? Please describe.

"fedn control start " should start rounds from the CLI.

Controller/ reducer disconnected (but still work in background) for long training process (UI): losses the access to UI

Severity

Critical/Blocker (select if the issue makes the application unusable or causes serious data loss)
High (select if the issue affects a major feature and there is no workaround or the available workaround is very complex)
Medium (select if the issue affects a minor feature or affects a major feature but has an easy enough workaround to not cause any major inconvenience)
Low (select if the issue doesn't significantly affect the user experience, like minor visual bugs)

Describe the bug
A clear and concise description of what the bug is.

Environment:

OS: [e.g. mac OS]
Version: [e.g. macOS Catalina]
Browser [e.g. chrome, safari]
Version [e.g. 22]

Reproduction Steps
Steps to reproduce the behavior:

Start a long training process (NLP) ~ 1 day
Try to access the UI

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Contact Details
An e-mail address in case we need to contact you for further details regarding this issue.

Support click 8.0.0

Is your feature request related to a problem? Please describe.
Currently we are not compatible with click 8.0.0.

Describe the solution you'd like
Support click 8.0.0

Add how-to for configuring and starting clients on Jetson Nano

Is your feature request related to a problem? Please describe.
¨
Would be great with a how-to for how we configure and start clients in our experiments with the Jetson Nanos.

Here: https://github.com/scaleoutsystems/examples/tree/main/how-tos

Remove the built-in SSL cert in favor of k8s ingress-controller

Background:

Right now the SSL cert for the Reducer is generated and assigned via Flask. This should be disabled rendering the psudo-local/docker versions insecure.

And that is fine as the currently supported production environment is STACKn/Studio (or use own proxy).

TODO:

FEDn: Change Flask to serve http only

Studio/STACKn: Ingress controller will serve HTTPS via ingress-controller TLC cert

Sandbox docker-compose will be insecure mode only.

After rewrite:

Version after this fix will serve TLS via ingress-controller generated certs.

List model trail via SDK/CLI

Is your feature request related to a problem? Please describe.
As a user I want to conveniently list / iterate over the models in the model trail, as well as their associated metadata and validation scores, in order to facilitate selection of models programatically downstream serving etc.

Describe the solution you'd like
SDK and CLI functionality to interact with the model trail.

list, get_latest, etc.

Remove deprecated mapview and geodatabase file

Is your feature request related to a problem? Please describe.

Complete support for PyTorch base models

Work in progress on this in current develop branch.

docker-compose template for mnist-keras does not work out of the box

Severity

Critical/Blocker (select if the issue makes the application unusable or causes serious data loss)
High (select if the issue affects a major feature and there is no workaround or the available workaround is very complex)
Medium (select if the issue affects a minor feature or affects a major feature but has an easy enough workaround to not cause any major inconvenience)
Low (select if the issue doesn't significantly affect the user experience, like minor visual bugs)

Describe the bug
It does not comply with instructions, as it is assiming "default" docker network, but we define another name when deploying reducer, combiner, base-services.

Demonstrate HA setup for Reducer

Is your feature request related to a problem? Please describe.

Current examples work with a single Reducer service.

Describe the solution you'd like
A reference example for a HA setup for the FEDn Network/Reducer.

How would the solution positively affect the functionality?
Increased robustness/resilience.

@sztoor Will coordinate.

Upstream model consumption

Is your feature request related to a problem? Please describe.
Allow upstream model consumption through direct integration.
(Replacing previous workflow-defined or manual flow of model )

Describe the solution you'd like
Implement integrated upstream model consumption

Add model on round-complete through webhook.
Assemble model entrypoint (as models can be dynamically typed we introduce a new entrypoint to assemble a model from any context) for upstream consumption.
Reducer <-> deploy-context integration.

Improve combine_models function (fedavg.py) to better describe root cause of potential combination problems

Is your feature request related to a problem? Please describe.
Improve combine_models function (fedavg.py) to identify the source of the model combining problem (here lack of resources on client-side).

Describe the solution you'd like
If clients failed to train models/ or the training process is not finish this will generate and send an empty model to the combiners. Then, combiners will throw errors during model combining among them (FAILED TO UNPACK FROM COMBINER!, ) which is not very useful to identify the source of the problem. Hence, adding more exceptions about the lack of resources will help to quickly identify the issue.

How would the solution positively affect the functionality?
A clear and concise description of the positive outcomes of the suggested solution.

Describe any drawbacks (if any)
A clear and concise description of the negative outcomes of the suggested solution.

Contact Details
An e-mail address in case we need to contact you for further details regarding this request.

Tensorflow: Could not load dynamic library 'libcuda.so.1

Severity

Critical/Blocker (select if the issue makes the application unusable or causes serious data loss)
High (select if the issue affects a major feature and there is no workaround or the available workaround is very complex)
Medium (select if the issue affects a minor feature or affects a major feature but has an easy enough workaround to not cause any major inconvenience)
Low (select if the issue doesn't significantly affect the user experience, like minor visual bugs)

Describe the bug
Running the federation process seems to instigate the error log below:
I suspect it could be due to the tensorflow gpu version running doesn't work well alongside fedN and conflict. The training process continues as normal without fault after this log is displayed.

Environment:

OS: [windows]
Version: [10]
Browser [mozilla]
Version [83.0]
Tensorflow 2.3.1

Reproduction Steps
Steps to reproduce the behavior:
Run the mnist example in the documentation with tensorflow 2.3.1 installed

Error log
client_2 | 2020-12-07 21:02:12.975489: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
client_2 | 2020-12-07 21:02:12.975543: E tensorflow/stream_executor/cuda/cuda_driver.cc:313] failed call to cuInit: UNKNOWN ERROR (303)
client_2 | 2020-12-07 21:02:12.975574: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (f00c4eace373): /proc/driver/nvidia/version does not exist
client_2 | 2020-12-07 21:02:12.975758: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use:
AVX2 FMA
client_2 | 2020-12-07 21:02:12.984561: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 2592005000 Hz
client_2 | 2020-12-07 21:02:12.985607: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fe608000b20 initialized for platform Host (this does not guarantee that
XLA will be used). Devices:
client_2 | 2020-12-07 21:02:12.985674: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
client_2 | -- RUNNING TRAINING --

Contact Details
[email protected]

Model upload error

Severity

[x ] Critical/Blocker (select if the issue makes the application unusable or causes serious data loss)
High (select if the issue affects a major feature and there is no workaround or the available workaround is very complex)
Medium (select if the issue affects a minor feature or affects a major feature but has an easy enough workaround to not cause any major inconvenience)
Low (select if the issue doesn't significantly affect the user experience, like minor visual bugs)

Describe the bug
When I try to upload the mnist-keras or mnist-pytorch models at 'https://localhost:8090/control' , it gives a http 500 server error.
The log of the reducer contains the following error message:

model = self.load_model(path)
reducer_1 | File "/app/fedn/fedn/utils/pytorchhelper.py", line 31, in load_model
reducer_1 | b = np.load(path)
reducer_1 | File "/usr/local/lib/python3.8/site-packages/numpy/lib/npyio.py", line 457, in load
reducer_1 | raise ValueError("Cannot load file containing pickled data "
reducer_1 | ValueError: Cannot load file containing pickled data when allow_pickle=False

It seems that numpy tries to load a pickled file with the allow_pickle=False option.
This is for PyTorch, but the same error comes with the Keras model as well.

Environment:
Ubuntu 18.04

Contact Details
[email protected]

undefined variable in fedn/fedn/common/control/package.py

Severity

Critical/Blocker (select if the issue makes the application unusable or causes serious data loss)
High (select if the issue affects a major feature and there is no workaround or the available workaround is very complex)
Medium (select if the issue affects a minor feature or affects a major feature but has an easy enough workaround to not cause any major inconvenience)
Low (select if the issue doesn't significantly affect the user experience, like minor visual bugs)

Describe the bug
on line 199 a variable: to_path is used without beeing defined, this caused the fedbird project to crash.

Environment:

OS: [e.g. mac OS]
Version: [e.g. macOS Catalina]
Browser [e.g. chrome, safari]
Version [e.g. 22]

Reproduction Steps
Steps to reproduce the behavior:

Go to '...'
Click on '....'
Scroll down to '....'
See error

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Contact Details
An e-mail address in case we need to contact you for further details regarding this issue.

fedn client --remote option broken

Severity

Critical/Blocker (select if the issue makes the application unusable or causes serious data loss)
High (select if the issue affects a major feature and there is no workaround or the available workaround is very complex)
Medium (select if the issue affects a minor feature or affects a major feature but has an easy enough workaround to not cause any major inconvenience)
Low (select if the issue doesn't significantly affect the user experience, like minor visual bugs)

Describe the bug
There is an option when starting the client to toggle fetching the remote exectution context (compute package) or use a local pre-staged version (useful for development of the package).

-r, --remote BOOLEAN Enable remote configured execution context

Unfortunately, it appears broken:

'fedn run client --remote=False' results in:

new_func
return f(get_current_context(), *args, **kwargs)
File "/Users/andreas/github/scaleoutsystems/fedn/fedn/cli/run_cmd.py", line 75, in client_cmd
client = Client(config)
File "/Users/andreas/github/scaleoutsystems/fedn/fedn/fedn/client.py", line 120, in init
copy_tree(from_path, run_path)
NameError: name 'run_path' is not defined

Upgrade Keras examples to TF 2.5

Start only trainer, validator or both

Is your feature request related to a problem? Please describe.
Sometimes it would be good to start dedicated validation clients.

Describe the solution you'd like
Via a CLI option, I would like to configure the client to listen on model update requests, model validation requests, or both (today's solution)

How would the solution positively affect the functionality?
More flexibility in client contributions. Can run trainer interface even if dedicated validation data is missing locally.

Describe any drawbacks (if any)
Slighly increased complexity.

Validation broken in mnist-keras

Severity

[] Critical/Blocker (select if the issue makes the application unusable or causes serious data loss)
[x ] High (select if the issue affects a major feature and there is no workaround or the available workaround is very complex)
Medium (select if the issue affects a minor feature or affects a major feature but has an easy enough workaround to not cause any major inconvenience)
Low (select if the issue doesn't significantly affect the user experience, like minor visual bugs)

Describe the bug
The validation errors for mnist-keras in current develop.

2021-08-18 13:48:16,345 - root - INFO - y_pred = model.predict_classes(x_test)

08/18/2021 01:48:16 PM [process.py:14] AttributeError: 'Sequential' object has no attribute 'predict_classes'

2021-08-18 13:48:16,345 - root - INFO - AttributeError: 'Sequential' object has no attribute 'predict_classes'

08/18/2021 01:48:16 PM [dispatcher.py:28] DONE RUNNING validate /tmp/tmpaiyib12_.npz /tmp/tmp2di2m2ds
2021-08-18 13:48:16,683 - fedn.utils.dispatcher - INFO - DONE RUNNING validate /tmp/tmpaiyib12_.npz /tmp/tmp2di2m2ds
Exception in thread Thread-3:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/usr/local/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/app/src/fedn/fedn/fedn/client.py", line 259, in __listen_to_model_validation_request_stream
metrics = self.__process_validation_request(model_id)
File "/app/src/fedn/fedn/fedn/client.py", line 346, in __process_validation_request
validation = json.loads(fh.read())
File "/usr/local/lib/python3.8/json/init.py", line 357, in loads
return _default_decoder.decode(s)
File "/usr/local/lib/python3.8/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/local/lib/python3.8/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Environment:

OS: [e.g. mac OS]
Version: [e.g. macOS Catalina]
Browser [e.g. chrome, safari]
Version [e.g. 22]

Reproduction Steps
Run the mnist-keras test.

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Contact Details
An e-mail address in case we need to contact you for further details regarding this issue.

Harmonize CLI command options

Is your feature request related to a problem? Please describe.
fedn run client, combiner, reducers uses a non-harmonized naming scheme for command options.

For example, client uses -i as short for client_id, while reducer, combiner uses it for port (could be ok, but suboptimal)

reducer uses -i as short for --init, while combiner uses -i . Etc.

A harmonized naming scheme will simplify for users.

Improve combiner startup log

When starting a combiner, the last line printed on the command lines is still "starting the combiner", which is confusing: if it is completed or not -- maybe some words could be added to indicate that the combiner has successfully initialized.

Add info for how to join Discord community in main README

We should add information for how to join the community to get community support to the main README.

Delete obsolete test/mnist example

Severity

Critical/Blocker (select if the issue makes the application unusable or causes serious data loss)
High (select if the issue affects a major feature and there is no workaround or the available workaround is very complex)
Medium (select if the issue affects a minor feature or affects a major feature but has an easy enough workaround to not cause any major inconvenience)
Low (select if the issue doesn't significantly affect the user experience, like minor visual bugs)

Describe the bug
A clear and concise description of what the bug is.

Environment:

OS: [e.g. mac OS]
Version: [e.g. macOS Catalina]
Browser [e.g. chrome, safari]
Version [e.g. 22]

Reproduction Steps
Steps to reproduce the behavior:

Go to '...'
Click on '....'
Scroll down to '....'
See error

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Contact Details
An e-mail address in case we need to contact you for further details regarding this issue.

Remove bundled test examples in favor of the new repository

Is your feature request related to a problem? Please describe.
The plan is to refactor and not include the examples in the main fedn repo. Instead, we should point docs / tutorials here: https://github.com/scaleoutsystems/examples

TODOs:

Update all READMEs / Docs to point to scaleoutsystems/examples
Update scaleoutsystems/examples with latest instructions from this repo (mnist-examples)
Delete fedn/test from this repo.

Timestamp on combiner log

Is your feature request related to a problem? Please describe.
Why not adding a Timestamp to the combiner log like reducer, then at least it will help to track errors for long training

Describe the solution you'd like
A clear and concise description of what you want to happen.
Instead of having this:
combiner_1 | COMBINER(FEDn_Combiner_addi):0 COMBINER: waiting for model updates: 0 of 4 completed.

it's better to have this:
combiner_1 | COMBINER(FEDn_Combiner_addi): 09/24/2021 09:38:58 AM 0 COMBINER: waiting for model updates: 0 of 4 completed.

Describe any drawbacks (if any)
None

Contact Details
An e-mail address in case we need to contact you for further details regarding this request.

Rewrite git-lfs file history

Is your feature request related to a problem? Please describe.

We have removed test/examples which used to hold large files.

Describe the solution you'd like
Get rid of them :-)

Add option to configure reducer package cache directory

The UPLOAD_PATH is hardcoded to '/app/client/package/'. Would be nice to be able to configure.

Number of clients running different than clients started

Severity

Critical/Blocker (select if the issue makes the application unusable or causes serious data loss)
High (select if the issue affects a major feature and there is no workaround or the available workaround is very complex)
Medium (select if the issue affects a minor feature or affects a major feature but has an easy enough workaround to not cause any major inconvenience)
Low (select if the issue doesn't significantly affect the user experience, like minor visual bugs)

Describe the bug
It seems that when i start a federation round. The combiner doesn't always request the same amount of clients running. I am not sure if this is intended. In my scenario started 5 clients, when federation starts:
combiner_1 | COMBINER(combiner):0 waiting for 4 clients to get started, currently: 4
It seems to happen at random, sometimes it would even wait for 3 clients to get started and currently indicates : 2 only in the exact same scenario
Environment:

OS: [windows]
Version: [10]
Browser [mozilla]
Version [83.0]

Reproduction Steps
Steps to reproduce the behavior:
Follow the documentation to run the mnist test word for word.

Expected behavior
If i start 5 clients, all 5 clients should be used for the federation process

Contact Details
[email protected]

Provide examples/how-tos for modifying/extending the combiner-level aggregators

Multiple native clients not compatible with current log UI assuming port 8080 is allocatable.

Severity

Critical/Blocker (select if the issue makes the application unusable or causes serious data loss)
High (select if the issue affects a major feature and there is no workaround or the available workaround is very complex)
Medium (select if the issue affects a minor feature or affects a major feature but has an easy enough workaround to not cause any major inconvenience)
[] Low (select if the issue doesn't significantly affect the user experience, like minor visual bugs)

Describe the bug
When starting multiple clients on the same host machine running natively without config the default port is (obviously) in use for serving the client log UI.

Suggested mitigation

Reverse the logic and disable the local UI by default and allow for parametric enabling of the local logging UI
We should update document about that also that exposing that endpoint you can relay also to any logging services your org might be using for aggregation of infra/service logs
Make configuration of port easier.

Furthermore

Also should we recover and try another port?

Could not find a suitable TLS CA certificate bundle, invalid path

Severity

[] Critical/Blocker (select if the issue makes the application unusable or causes serious data loss)
High (select if the issue affects a major feature and there is no workaround or the available workaround is very complex)
Medium (select if the issue affects a minor feature or affects a major feature but has an easy enough workaround to not cause any major inconvenience)
Low (select if the issue doesn't significantly affect the user experience, like minor visual bugs)

Describe the bug
The mnist-keras client will not assign complaining about no cert, see screenshot. On inspecting the container there are not certs staged in app/certs .

Environment:

Ubuntu Linux 20.04
snap docker (as root)

Reproduction Steps
Run the Sandbox getting-started guide.

Expected behavior

Screenshots

Generate SDK/API documentation

Is your feature request related to a problem? Please describe.

We need to set up auto-generation of the SDK documentation and add that to docs.

Old compute package in master

Severity

[] Critical/Blocker (select if the issue makes the application unusable or causes serious data loss)
High (select if the issue affects a major feature and there is no workaround or the available workaround is very complex)
Medium (select if the issue affects a minor feature or affects a major feature but has an easy enough workaround to not cause any major inconvenience)
Low (select if the issue doesn't significantly affect the user experience, like minor visual bugs)

Describe the bug
The pre-built compute package for mnist-keras in current master does not match the actual client code (has not been updated in the release)

Environment:

OS: [e.g. mac OS]
Version: [e.g. macOS Catalina]
Browser [e.g. chrome, safari]
Version [e.g. 22]

Reproduction Steps
Steps to reproduce the behavior:

Go to '...'
Click on '....'
Scroll down to '....'
See error

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Contact Details
An e-mail address in case we need to contact you for further details regarding this issue.

Update the mnist examples to handle data like the fraud and NLP examples

Is your feature request related to a problem? Please describe.

The current MNIST examples reads the full dataset and subsamples a fraction of it. This should be removed, and instead we should have a script to partition the data in X clients (like in the fraud and NLP examples).

Add helper class to support SKLearn SGDClassifier base models

This is a useful base class in order to support SVM and logistic regression with SKLearn.

Only one combiner listed on network page

Severity

Critical/Blocker (select if the issue makes the application unusable or causes serious data loss)
High (select if the issue affects a major feature and there is no workaround or the available workaround is very complex)
Medium (select if the issue affects a minor feature or affects a major feature but has an easy enough workaround to not cause any major inconvenience)
Low (select if the issue doesn't significantly affect the user experience, like minor visual bugs)

Describe the bug
Only one combiner shows up in the table listing on /network (but it shows in the graph).

This worked before so I suspect some change in the pymongo API or something like that.

Reproduction Steps
Start two combiners.

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Contact Details
An e-mail address in case we need to contact you for further details regarding this issue.

Add message about package uploading on client log to avoid connection errors

Is your feature request related to a problem? Please describe.
After starting the reducer and combiners we will get this error after starting clients:

client_1 | Asking for assignment
client_1 | 09/25/2021 11:50:08 AM [connectionpool.py:971] Starting new HTTPS connection (1): 130.238.29.53:8090
client_1 | 2021-09-25 11:50:08,297 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): 130.238.29.53:8090
client_1 | 09/25/2021 11:50:08 AM [connectionpool.py:452] https://130.238.29.53:8090 "GET /assign?name=client00ba921f HTTP/1.1" 200 24

This will confuse the user, however, it's just a package problem, one needs to upload package before starting clients
So, the suggestion is to have a clear message on the client log that asks the user to upload the package.

Describe the solution you'd like
A clear and concise description of what you want to happen.

How would the solution positively affect the functionality?
A clear and concise description of the positive outcomes of the suggested solution.

Describe any drawbacks (if any)
A clear and concise description of the negative outcomes of the suggested solution.

Contact Details
An e-mail address in case we need to contact you for further details regarding this request.

Enforce combiner names to be globally unique

Is your feature request related to a problem? Please describe.

Currently things do not work as intended two or more combiners share the same name.

Describe the solution you'd like
Check name against already added combiners upon first connection, then reject connection if the name is not globally unique.

Upgrade mnist examples to work with TF/Keras 2.6.0

Is your feature request related to a problem? Please describe.

In Tensorflow 2.6 there is no longer a "predict_classes", so validate.py in test/mnist-keras is not working.

Describe the solution you'd like
An overhaul of the client code to work with latest TF.

How would the solution positively affect the functionality?

Describe any drawbacks (if any)
A clear and concise description of the negative outcomes of the suggested solution.

Contact Details
An e-mail address in case we need to contact you for further details regarding this request.

Upgrade minio to 7.0.2

Is your feature request related to a problem? Please describe.
Currently a hard dependency on minio==6.0.0, causes version clashes with stackn which requires 7.0,2

Unable to run data center example

Severity

Critical/Blocker (select if the issue makes the application unusable or causes serious data loss)
High (select if the issue affects a major feature and there is no workaround or the available workaround is very complex)
Medium (select if the issue affects a minor feature or affects a major feature but has an easy enough workaround to not cause any major inconvenience)
Low (select if the issue doesn't significantly affect the user experience, like minor visual bugs)

Describe the bug
Data center example isn't running due to the read_data file pointing to ../Data/train.csv which doesn't exist for the data center example. Not sure if this is a bug on my end somehow or just an unfinished test file.

Environment:

OS: [Windows]
Version: [10]
Browser [Mozilla]
Version [83.0]

Reproduction Steps
Steps to reproduce the behavior:
Simply following the documentation to run the test with just change to .env file: EXAMPLE = data-center

Expected behavior
Data center example should run properly and be able to train.

Error code
client_5 | -- RUNNING TRAINING --
client_5 | Using TensorFlow backend.
client_5 | Traceback (most recent call last):
client_5 | File "train.py", line 35, in
client_5 | model = train(model,'../data/train.csv')
client_5 | File "train.py", line 21, in train
client_5 | (x_train, y_train) = read_data(data)
client_5 | File "/app/client/read_data.py", line 16, in read_data
client_5 | data = pd.read_csv(filename, sep = ',',index_col=[0])
client_5 | File "/usr/local/lib/python3.8/site-packages/pandas/io/parsers.py", line 676, in parser_f
client_5 | return _read(filepath_or_buffer, kwds)
client_5 | File "/usr/local/lib/python3.8/site-packages/pandas/io/parsers.py", line 448, in _read
client_5 | parser = TextFileReader(fp_or_buf, **kwds)
client_5 | File "/usr/local/lib/python3.8/site-packages/pandas/io/parsers.py", line 880, in init
client_5 | self._make_engine(self.engine)
client_5 | File "/usr/local/lib/python3.8/site-packages/pandas/io/parsers.py", line 1114, in _make_engine
client_5 | self._engine = CParserWrapper(self.f, **self.options)
client_5 | File "/usr/local/lib/python3.8/site-packages/pandas/io/parsers.py", line 1891, in init
client_5 | self._reader = parsers.TextReader(src, **kwds)
client_5 | File "pandas/_libs/parsers.pyx", line 374, in pandas._libs.parsers.TextReader.cinit
client_5 | File "pandas/_libs/parsers.pyx", line 674, in pandas._libs.parsers.TextReader._setup_parser_source
client_5 | FileNotFoundError: [Errno 2] File ../data/train.csv does not exist: '../data/train.csv'

Contact Details
[email protected]

How-to for how to develop a new compute package locally/interactively

Is your feature request related to a problem? Please describe.

When developing the compute package one wants an interactive experience without the need to upload a new package for each change. This is possible, but currently not documented.

Describe the solution you'd like
A how-to / recommended way in markup for how to develop locally a new package.

Add how-to for how we configure and start clients with Singularity containers

Is your feature request related to a problem? Please describe.

Here:
https://github.com/scaleoutsystems/examples/tree/main/how-tos

Describe the solution you'd like
A clear and concise description of what you want to happen.

How would the solution positively affect the functionality?
A clear and concise description of the positive outcomes of the suggested solution.

Describe any drawbacks (if any)
A clear and concise description of the negative outcomes of the suggested solution.

Contact Details
An e-mail address in case we need to contact you for further details regarding this request.

Improved visualization of FEDn network graph

Is your feature request related to a problem? Please describe.

Is is often helpful to have a graphical representation of the FEDn network. We have a simple version now in the /network view, but this should be improved to better handle multiple combiners. Also, it would be nice if the actual client names/ids were shown, as well as their status (online/offline).

scaleoutsystems / fedn Goto Github PK

fedn's Introduction

FEDn: An enterprise-ready federated learning framework

Features

Getting started

Documentation

FEDn Project Examples

FEDn Studio Deployment options

Support

Making contributions

Citation

License

fedn's People

Contributors

Stargazers

Watchers

Forkers

fedn's Issues

Recommend Projects

Recommend Topics

Recommend Org