Giter Club home page Giter Club logo

sedna's Introduction

KubeEdge

Go Report Card LICENSE Releases CII Best Practices

English | 简体中文

KubeEdge is built upon Kubernetes and extends native containerized application orchestration and device management to hosts at the Edge. It consists of cloud part and edge part, provides core infrastructure support for networking, application deployment and metadata synchronization between cloud and edge. It also supports MQTT which enables edge devices to access through edge nodes.

With KubeEdge it is easy to get and deploy existing complicated machine learning, image recognition, event processing and other high level applications to the Edge. With business logic running at the Edge, much larger volumes of data can be secured & processed locally where the data is produced. With data processed at the Edge, the responsiveness is increased dramatically and data privacy is protected.

KubeEdge is an incubation-level hosted project by the Cloud Native Computing Foundation (CNCF). KubeEdge incubation announcement by CNCF.

Advantages

  • Kubernetes-native support: Managing edge applications and edge devices in the cloud with fully compatible Kubernetes APIs.
  • Cloud-Edge Reliable Collaboration: Ensure reliable messages delivery without loss over unstable cloud-edge network.
  • Edge Autonomy: Ensure edge nodes run autonomously and the applications in edge run normally, when the cloud-edge network is unstable or edge is offline and restarted.
  • Edge Devices Management: Managing edge devices through Kubernetes native APIs implemented by CRD.
  • Extremely Lightweight Edge Agent: Extremely lightweight Edge Agent(EdgeCore) to run on resource constrained edge.

How It Works

KubeEdge consists of cloud part and edge part.

Architecture

In the Cloud

  • CloudHub: a web socket server responsible for watching changes at the cloud side, caching and sending messages to EdgeHub.
  • EdgeController: an extended kubernetes controller which manages edge nodes and pods metadata so that the data can be targeted to a specific edge node.
  • DeviceController: an extended kubernetes controller which manages devices so that the device metadata/status data can be synced between edge and cloud.

On the Edge

  • EdgeHub: a web socket client responsible for interacting with Cloud Service for the edge computing (like Edge Controller as in the KubeEdge Architecture). This includes syncing cloud-side resource updates to the edge, and reporting edge-side host and device status changes to the cloud.
  • Edged: an agent that runs on edge nodes and manages containerized applications.
  • EventBus: a MQTT client to interact with MQTT servers (mosquitto), offering publish and subscribe capabilities to other components.
  • ServiceBus: an HTTP client to interact with HTTP servers (REST), offering HTTP client capabilities to components of cloud to reach HTTP servers running at edge.
  • DeviceTwin: responsible for storing device status and syncing device status to the cloud. It also provides query interfaces for applications.
  • MetaManager: the message processor between edged and edgehub. It is also responsible for storing/retrieving metadata to/from a lightweight database (SQLite).

Kubernetes compatibility

Kubernetes 1.20 Kubernetes 1.21 Kubernetes 1.22 Kubernetes 1.23 Kubernetes 1.24 Kubernetes 1.25 Kubernetes 1.26 Kubernetes 1.27 Kubernetes 1.28 Kubernetes 1.29
KubeEdge 1.12 - - - - - - -
KubeEdge 1.13 + - - - - - -
KubeEdge 1.14 + + - - - - -
KubeEdge 1.15 + + + + - - -
KubeEdge 1.16 + + + + + - -
KubeEdge 1.17 + + + + + + -
KubeEdge HEAD (master) + + + + + + +

Key:

  • KubeEdge and the Kubernetes version are exactly compatible.
  • + KubeEdge has features or API objects that may not be present in the Kubernetes version.
  • - The Kubernetes version has features or API objects that KubeEdge can't use.

Guides

Get start with this doc.

See our documentation on kubeedge.io for more details.

To learn deeply about KubeEdge, try some examples on examples.

Roadmap

Meeting

Regular Community Meeting:

Resources:

Contact

If you need support, start with the troubleshooting guide, and work your way through the process that we've outlined.

If you have questions, feel free to reach out to us in the following ways:

Contributing

If you're interested in being a contributor and want to get involved in developing the KubeEdge code, please see CONTRIBUTING for details on submitting patches and the contribution workflow.

Security

Security Audit

A third party security audit of KubeEdge has been completed in July 2022. Additionally, the KubeEdge community completed an overall system security analysis of KubeEdge. The detailed reports are as follows.

Reporting security vulnerabilities

We encourage security researchers, industry organizations and users to proactively report suspected vulnerabilities to our security team ([email protected]), the team will help diagnose the severity of the issue and determine how to address the issue as soon as possible.

For further details please see Security Policy for our security process and how to report vulnerabilities.

License

KubeEdge is under the Apache 2.0 license. See the LICENSE file for details.

sedna's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sedna's Issues

Wrong base model in surface defect detection example with the federated learning job

What happened:
As I have successfully setup and run the surface defect detection example with the federated learning job. I have then use the model generated to run inference on other images in the Magnetic tile defect dataset. However, I have noticed that the results are not ideal, and the model always predicts "No defect" for all the images I fed into the model.

How to reproduce it (as minimally and precisely as possible):

fl_instance = FederatedLearning(estimator=Estimator)
fl_instance.load(model_url)

print(fl_instance.predict(test_data))

Anything else we need to know?:

Environment:

Sedna Version
v0.3.0

Add GlobalManager high availability support

What would you like to be added/modified:
Add GlobalManager multi-instance support for high availability and high throughput.

Why is this needed:
Currently GlobalManager is deployed as a k8s deployment with replicas=1, and only one instance is supported.

For GlobalManager's high availability and high throughput, we need to deploy GlobalManager with replicas >=2.

repository setup tracking issue

This issue is to track the things need to be done for completely setup our repository.

  • add fossa checker for licenses checking
  • add prowbot support for auto merging pr etc.
  • add e2e testcases, current empty

Model Management module is needed.

Why a model management module is needed?

Currently, the model management capability of Sedna is fragmented and has the following problems:

1. Unclear module responsibilities.

It is found that the code for model uploading and downloading was duplicated in different modules(LC and Lib in Sedna), and they were even in different languages.
In this case, for example, if you want to extend the protocol for saving the model, you will add code with the same function to these different modules, which is difficult to maintain.

2. Difficult to leverage basic capabilities.

For example, the model saving function may be used for model compression and model conversion. it is not necessary to implement model saving function both in compression and model conversion.

3. Model basic functions interfaces are not designed in a unified manner.

For example, model compression and model deployment may be combined, as may model transformation and model deployment. If there is a basic interface to standardize the functions of the model, it can provide more flexible functions.

4. The style of interfaces exposed to users might be different.

Em.., this point is intuitive.

I will add some concrete examples to illustrate this later.

Current model management requirements

I summarize the requirements for model management of sedna's existing features.

Incremental Learning

Fncremental Learning

Joint Inference

Future model management requirements

Through a survay, I have learned that model can also have the following behavior:

  • write
  • read
  • watch
  • version/history
  • evaluate
  • predict/serving
  • publish/deploy
  • conversion
  • compresion
  • monitoring
  • searching

For example, multi-task lifelong learning is in the roadmap of Sedna, it requires capabilities such as multi-model serving, model metric recording, and model metadata search.

Summary

Therefore, I hope that we can design an edge-cloud synergy model management component with a unified architecture and interface style based on current or future features.

Model serving should support multi-model

In model inference, multiple models may be composed for inference. For example, model A receives input and its output will be inputed to model B for inference to obtain the final inference result. For more scenarios, see this link

Add operator to support dynamic neural networks

Dynamic neural network is an emerging technology in deep learning. Compared to static models which have fixed computational graphs and parameters at the inference stage, dynamic networks can adapt their structures or parameters to different inputs, leading to notable advantages in terms of accuracy, computational efficiency, adaptiveness, etc.

In general, dynamic neural network has the following advantages:

  1. Adaptiveness. Dynamic models are able to achieve a desired trade-off between accuracy and efficiency for dealing with varying computational budgets on the fly. Therefore, they are more adaptable to different hardware platforms and changing environments, compared to static models with a fixed computational cost.

  2. Representation power. Due to the data-dependent network architecture/parameters, dynamic networks have significantly enlarged parameter space and improved representation power.

  3. Compatibility. Dynamic networks are compatible with most advanced techniques in deep learning, such as model compression, neural architecture search (NAS) .

Demo:

  1. Model early-exiting with BranchyNet. A real industrial case is here.
  2. Model layer skipping with ResNet.

Model serving should support hot loading

In joint inference or incremental learning scenario, sometimes we need to redelpoy our model. For example, in incremental learning scenario, after continuous data training, the model precision reaches the trigger condition, and the model needs to be redeployed.
How to redepoy our model without service interruption is need to be considerd.
Some serving framework support hot loading, for example pytorch server, but some framework do not, such as mindspore.

Add nodeSelector for worker pod template

In the issue #19, we add the pod template support. But the nodeSelector is not support in pr #33 because of its complex changes:

  1. federated learning: since the train workers need to access the aggregation worker, currently we create the train workers injecting with the aggregation service NodeIP:NodePort. So GM needs to know the nodeName at first. With nodeSelector, we don't know the node where the aggregation worker be scheduled at when creating the train workers, so we need to delay the train-worker creation with service info until the aggregation worker is scheduled.
    Another note: if with edgemesh support, this is not a probelm.
  2. joint inference: same problem with federated learning, edgeWorker needs to access cloudWorker.
  3. the downstream controller: it syncs the features to these edge nodes when get the feature updates, so downstream need to know the nodeName.

feature enhancements tracking issue

This issue is to track these known enhancements to be done:

  • add resource limits for worker
  • add gpu support for worker
  • add pod template like support for worker spec
  • add model management with central storage support
  • add dataset management with central storage support
  • add example code style checker
  • add descriptions for CRD fields
  • abstract the worker controller into one, currently each feature controller has own similarity worker implementation
  • move the feature CR logic embedded in upstream/downstream to respective feature controller
  • replace self-built websocket between gm and lc with KubeEdge message communication
  • improve the state translation implementation of incremental learning
  • make the python lib interface more clearer
  • model serving should support hot loading & multiple models
  • the basic TensorFlow images in the examples needs to be unified to one version
  • the networking differences need to be considered when the LC is deployed on the cloud

tracking issues of pod template support #19

  • refactor workerSpec of API/implementation into pod template
  • change CRD of all features, and we can avoid this by kube-builder #11
  • update the installation doc
  • update the local-up script
  • update the proposal
  • update all three examples
  • add all example dockerfiles

decouple collaboration feature code and public code

What would you like to be added/modified:

  1. Partitioning folders at "sedna/pkg/globalmanager/" by collaboration feature and place the corresponding code.
  2. Partitioning folders at "sedna/pkg/localcontroller/" by collaboration feature and place the corresponding code.
  3. Decoupling upstream file by collaboration feature and put the corresponding code into the collaboration feature fold
  4. Decoupling downsteam file by collaboration feature and put the corresponding code into the collaboration feature fold
  5. Any other modifications after decoupling

Why is this needed:
Currently, multiple collaboration features are available, and more collaboration features may be available in the future. Decoupling facilitates the development of the developer ecosystem, Such as after decoupling, developers can easily integrate more features.

the ***trainingWorks*** design in federatedlearningjob.yaml

Hi, i do not quite understand the yaml design here. why are the 2 dataset sections and the 2 template sections in trainingWorkers here?

I guess what you mean is 2 edges with different datasets and different configurations. if this is the case, it makes more sense for me to have 2 worker sections under the trainingWorks and assign the dataset and template to the worker

trainingWorkers:
- dataset:
name: "edge1-surface-defect-detection-dataset"
template:
spec:
nodeName: "edge1"
containers:
- image: kubeedge/sedna-example-federated-learning-surface-defect-detection-train:v0.1.0
name: train-worker
imagePullPolicy: IfNotPresent
env: # user defined environments
- name: "batch_size"
value: "32"
- name: "learning_rate"
value: "0.001"
- name: "epochs"
value: "1"
resources: # user defined resources
limits:
memory: 2Gi
- dataset:
name: "edge2-surface-defect-detection-dataset"
template:
spec:
nodeName: "edge2"
containers:
- image: kubeedge/sedna-example-federated-learning-surface-defect-detection-train:v0.1.0
name: train-worker
imagePullPolicy: IfNotPresent
env: # user defined environments
- name: "batch_size"
value: "32"
- name: "learning_rate"
value: "0.001"
- name: "epochs"
value: "1"
resources: # user defined resources
limits:
memory: 2Gi

Join_inference cannot be connected to the cloud for analysis

What happened:
When I use the example which is joint_inference/helmet_detection_inference, there are some errors in the log
helmet-detection-inference-example-edge-5pg66 logs:
[2021-04-16 02:32:48,383][sedna.joint_inference.joint_inference][ERROR][124]: send request to http://192.168.2.211:30691 failed, error is HTTPConnectionPool(host='192.168.2.211', port=30691): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7efe5409dcf8>: Failed to establish a new connection: [Errno 111] Connection refused',)), retry times: 5 [2021-04-16 02:32:48,384][sedna.joint_inference.joint_inference][WARNING][365]: retrieve cloud infer service failed, use edge result

What you expected to happen:
The above error does not exist

How to reproduce it (as minimally and precisely as possible):
Just follow the example: https://github.com/kubeedge/sedna/blob/main/examples/joint_inference/helmet_detection_inference/README.md

Anything else we need to know?:

Environment:

Sedna Version
$ kubectl get -n sedna deploy gm -o jsonpath='{.spec.template.spec.containers[0].image}'
# paste output here
kubeedge/sedna-gm:v0.1.0
$ kubectl get -n sedna ds lc -o jsonpath='{.spec.template.spec.containers[0].image}'
# paste output here

kubeedge/sedna-lc:v0.1.0

Kubernets Version
$ kubectl version
# paste output here

Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.3", GitCommit:"2e7996e3e2712684bc73f0dec0200d64eec7fe40", GitTreeState:"clean", BuildDate:"2020-05-20T12:52:00Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.3", GitCommit:"2e7996e3e2712684bc73f0dec0200d64eec7fe40", GitTreeState:"clean", BuildDate:"2020-05-20T12:43:34Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}

KubeEdge Version
$ cloudcore --version
# paste output here
1.6
$ edgecore --version
# paste output here

1.6

CloudSide Environment:

Hardware configuration
$ lscpu
# paste output here
OS
$ cat /etc/os-release
# paste output here
Kernel
$ uname -a
# paste output here
Others

EdgeSide Environment:

Hardware configuration
$ lscpu
# paste output here
OS
$ cat /etc/os-release
# paste output here
Kernel
$ uname -a
# paste output here
Others

Automatically pushing docker-images when creating a release

What would you like to be added/modified:
Add a github action for automatically pushing docker-images when creating a release.
Why is this needed:

  1. automatically push instead of manually
  2. the image should be consistent with the same code

Feature worker spec support configmap

What would you like to be added/modified:
Feature worker spec configuration information can be injected through the configmap or environment variables. (Currently, it is an environment variable.)

Why is this needed:
Developers have built-in containers on the Worker. The ConfigMap is equivalent to the configuration file of the native application. There can be one or more configuration files.
After the container is started, the container obtains the ConfigMap content from the host machine, generates a local file, and maps the file to a specified directory in the container as a volume.
Applications in the container read the configuration files in the container-specific directory in the original manner.
For a container, the configuration file is packaged in a specific directory inside the container. The entire process does not intrude on applications.

Implement automatic model conversion and model serving image selection to support heterogeneous hardware platform

Function description: Automatically select AI server (TFserving, PaddleServing, MindSpore Serving, Tensorflow lite, Paddle lite, MNN, OpenVINO, etc.) according to the hardware platform, and automatically perform model format conversion.

Solution:
Provided a k8s operator to realize the following functions:

  1. Read the edge node hardware type label and select model serving image. Edge node hardware type is labeled by kubectl as follows:
kubectl label node edge-node-1 hardware=Ascend
  1. Deploy model conversion image for model conversion;
  2. Deploy model serving image.

[Incremental Learning] Add the support with different `nodeName` of train/eval workers

What would you like to be added/modified:
Currently in the feature of incremental learning, the nodeNames of dataset and train/eval workers should be same.

When the dataset is in shared storage, the support with different nodeName could be added

Why is this needed:

  1. In principle we can't require the user must train and eval model in same node.
  2. Train requires much more resources than eval worker, they may not in the same node.
  3. Sometimes the user may need to do evalution in the same/similar node with infer-worker, such as both at edge.

Contributor Experience Improvement Tracking Issue

One of the top priority work to do duing the next few months is to improve contributor experience, help more new contributors get on board.

This issue is to track the things need to be done.

General

  • [Docs] Separate user manual and contributor guide docs.
  • [Docs] Code of Conduct

Releases

  • [Docs] Release lifecycle
  • [Tooling] Release automation: automate the process of making a new release.
  • [Docs] Usage of Milestone and Projects view.

Feature guide

  • [Docs] Feature lifecycle
  • [Docs] Proposing Enhancements (features).
  • [Docs] How to write Enhancement (KEP) guide.

Developer self-service

  • [Docs] local env setup (for development), with local-up script already done
  • [Docs] PR flow.
  • [Docs] Debugging guide.
  • [Docs] Reporting Bugs
  • [Tooling] Speed up CI checking.

Add multi-edge collaborative inference support

Motivation

Multi-edge collaborative inference refers to the use of multiple edge computing nodes for collaborative inference. This technology can make full use of the distributed computing resources of edge computing nodes, reduce the delay of edge AI services and improve inference accuracy and throughput. It is a key technology of edge intelligence Therefore, we propose a multi-edge collaborative inference framework to help users build multi-edge collaborative AI business easily based on KubeEdge.

Goals

  • The framework can utilize multiple edge computing nodes for collaborative inference.
  • Utilize KubeEdge's EdgeMesh to realize multi-edge load balancing.
  • Provide typical multi-edge inference case study (such as ReID, multi-source data fusion, etc.).

Solution

Take pedestrian ReID as a example:

  1. pedestrian ReID workflow
    image

The client is used to read the camera video data and carry out the first stage of inference.
The server is used to receive the region proposal predicted by the client for final target detection and pedestrian feature matching.

  1. CRD example (preliminary design)
apiVersion: sedna.io/v1alpha1
kind: MultiEdgeInferenceService
metadata:
  name: pedestrian-reid
  namespace: default
spec:
  clientWorkers:
    -template:
      spec:
        containers:
          - image: kubeedge/sedna-example-multi-edge-inference-reid-client:v0.1.0
          ......
        nodeSelector:
          ......
  serverWorkers:
    -template:
      metadata:
      labels:
        app: reid-server
      spec:
        containers:
          - image: kubeedge/sedna-example-multi-edge-inference-reid-server:v0.1.0
          ......
        nodeSelector:
          ......
    -template:
      metadata:
      labels:
        app: user-server
      spec:
        containers:
          - image: kubeedge/sedna-example-multi-edge-inference-user-server:v0.1.0
          ......
        nodeSelector:
          ......
    -template:
      metadata:
      labels:
        app: mqtt-server
      spec:
        containers:
          - image: kubeedge/sedna-example-multi-edge-inference-mqtt-server:v0.1.0
          ......
        nodeSelector:
          ......

Open issues

We hope to build a general framework to adapt to a variety of multi-edge collaborative inference scenarios. Therefore, users and developers are welcome to put forward more scenarios, applications and requirements to improve our architecture and CRD interface design.

gm and kb pods are always in pending status

What happened:
Hello, I tried to install sedna and encountered some problems. I followed the instructions of Sedna installation document, while the gm and kb pods were in pending status forever.

What you expected to happen:
gm and kb pods are ready and in running status.

How to reproduce it (as minimally and precisely as possible):
Follow the instructions in Sedna installation document.

Anything else we need to know?:
I have solved it through adding tolerations to gm and kb deployment in install.sh. It looks like this:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: kb
...
spec:
...
  template:
  ...
    spec:
    ...
      containers:
      ...
        tolerations:
        - key: node-role.kubernetes.io/master
          effect: NoSchedule
...

Modify gm deployment in the same way and do install.sh, then it works.

It confused me that kb deployment and gm deployment use nodeSelector of sedna: control-plane, but they can't tolerate the taint node-role.kubernetes.io/master:NoSchedule on master(cloud) node which is labeled as sedna: control-plane.

Is there any reason?

Lib Refactoring

Description

Some of current components in Sedna Lib provide services independently based on different features. Each feature provides an independent interface and code structure, which is confusing. To better facilitate those use cases, We need to start thinking about decoupling some of our components, While not a specific action item for Sedna, believe this conversation is worthwhile.

Discussion so far:

  1. Secondary development is difficult, developers cannot replace with their customized processing logic. (e.g. pre-process, post-process, feature engineering, etc.).
  2. Too many non-global variables defined in baseConfig is that since every function has access to these, it becomes increasingly hard to figure out which feature of sedna actually read and write these variables.
  3. Right now Sedna Lib supports only Tensorflow, which possibly limits the number of possible users. If we can provide anyway to allowed extend other ML framework, it will benefit more developers.
  4. The current feature modules (incremental_learningjoint_inferencefederated_learning, etc.) have classes scattered around in a generally unorganized mess,it throws users out of their rhythm when they go to learn 、use different modules.

Details

​ During reviews I've seen a few areas where we can decouple concepts:

  1. By using a registration of class-factory functions to emulate virtual constructors, developers can invoke different components by change variables in the Config file.
class ClassFactory(object):
    """A Factory Class to manage all class need to register with config."""

    __registry__ = {}

    @classmethod
    def register(cls, type_name='common', alias=None):
        """Register class into registry.

        :param type_name: type_name: type name of class registry
        :param alias: alias of class name
        :return: wrapper
        """

        def wrapper(t_cls):
            """Register class with wrapper function.

            :param t_cls: class need to register
            :return: wrapper of t_cls
            """
            t_cls_name = alias if alias is not None else t_cls.__name__
            if type_name not in cls.__registry__:
                cls.__registry__[type_name] = {t_cls_name: t_cls}
            else:
                if t_cls_name in cls.__registry__:
                    raise ValueError(
                        "Cannot register duplicate class ({})".format(t_cls_name))
                cls.__registry__[type_name].update({t_cls_name: t_cls})

            return t_cls

        return wrapper

    @classmethod
    def register_cls(cls, t_cls, type_name='common', alias=None):
        """Register class with type name.

        :param t_cls: class need to register.
        :param type_name: type name.
        :param alias: class name.
        :return:
        """
        t_cls_name = alias if alias is not None else t_cls.__name__
        if type_name not in cls.__registry__:
            cls.__registry__[type_name] = {t_cls_name: t_cls}
        else:
            if t_cls_name in cls.__registry__:
                raise ValueError(
                    "Cannot register duplicate class ({})".format(t_cls_name))
            cls.__registry__[type_name].update({t_cls_name: t_cls})
        return t_cls
    
 	@classmethod
    def get_cls(cls, type_name, t_cls_name=None):
        """Get class and bind config to class.

        :param type_name: type name of class registry
        :param t_cls_name: class name
        :return:t_cls
        """
        if not cls.is_exists(type_name, t_cls_name):
            raise ValueError("can't find class type {} class name {} in class registry".format(type_name, t_cls_name))
        if t_cls_name is None:
            raise ValueError("can't find class. class type={}".format(type_name))
        t_cls = cls.__registry__.get(type_name).get(t_cls_name)
        return t_cls
  1. Clean up and redesign the base Config class, each feature maintains it's specific variables, and ensures that developers can be manually updated the variables.

    class ConfigSerializable(object):
        """Seriablizable config base class."""
    
        __original__value__ = None
    
        @property
        def __allattr__(self):
            attrs = filter(lambda attr: not (attr.startswith("__") or ismethod(getattr(self, attr))
                                             or isfunction(getattr(self, attr))), dir(self))
            return list(attrs)
    
        def update(self, **kwargs):
            for attr in self.__allattr__:
                if attr not in kwargs:
                    continue
                setattr(self, attr, kwargs[attr])
    
        def to_json(self):
            """Serialize config to a dictionary. It will be very useful in distributed systems. """
            pass
    
        def __getitem__(self, item):
            return getattr(self, item, None)
    
        def get(self, item, default=""):
            return self.__getitem__(item) or default
    
        @classmethod
        def from_json(cls, data):
            """Restore config from a dictionary or a file."""
            pass
    
  2. Refer to tools such as moxingneuropod .etc, decouple the ML framework from the features of sedna, allows developers to choose their favorite framework.
    Few frameworks which would be nice to support:

  3. Add a common file operation and a unified log-format in common module, use an abstract base class to standardize the feature modules' interface, and features are invoked by inheriting the base class.

    Specifically, the goals are:

    • Clean up the base module's classes and package structure
    • Create a necessary amount of modules based on big features
    • Revise dependency structure for all the modules

Implementation for injecting storage-initializer

In the #18, I proposed to add an init-containers to download dataset/models before running workers.
Then we need to inject the storage-initializer to workers.

the simply way

The obvious way to implement it is to modify the creating-worker logic in each collaboration features in GM.
I can abstract the common logic to one func/file.
its pros: simply and quick
its cons: need to modify the gm

the more decouple way

Another good way I found is to leverage the k8s admission hooks used by the kfserving.
its pros: decoupling with each collaboration features
its cons: add extra a webhook server; more code worker

What I decide to do now

For simplicity, firstly to implement the simply way, then evolute to the admission hook way when needed since injecting code can be reused.

examples: enhance incremental_learning/helmet_detection

tracking the issues of the incremental learning example.

  • the namespace sedna-test should be renamed, default is recommended.
  • there exists an useless file named train_data.txt.bak1 in dataset.tar.gz
  • there exists an useless directory named base_model/model/ in model.tar.gz
  • need to test hard-example-mining algorithm because of high hard-example storage
  • with bad model eval-worker should report low precision/recall instead of raise exception
  • alleviate the memory usage of training worker

Add examples images GPU version

What would you like to be added/modified:
Add examples images GPU version.
Why is this needed:
Train/inference deep learning models in GPU is much quicker than in CPU, we need to add the corresponding GPU version for our examples.

incremental-learning: data path problem of dataset label file

A label file contains a field specifying the image path, this path could be relative or absolute path.
And as far as i know, the main case is the relative path, such as:

imgs/black/34290.jpg,1
imgs/black/36347.jpg,1
imgs/white/30049.jpg,0

but in the incremental-learning, LC would monitor the original dataset, split it into trainset and test set, and save them in another path.

Add more e2e testcases

Following #14 with only one testcase(i.e. create dataset) added, we need to add more e2e testcases:

  • dataset: get/delete
  • model: create/get/delete
  • joint-inference: create/get/delete
  • federated-learning: create/get/delete
  • incremental-learning: create/get/delete

[Enhancement Request] Integrate Plato into Sedna as a backend for supporting federated learning

What would you like to be added/modified:

Plato is a new software framework to facilitate scalable federated learning. So far, Plato has already supported PyTorch and MindSpore. Several advantages are summarized as follow:

  1. Simplicity: Plato provides user-friendly APIs.
  2. Scalability: Plato is scalable. Plato also supports running multiple (unlimited) workers, which share one GPU in turn.
  3. Extensibility: Plato manages various machine learning models and aggregation algorithms.
  4. Framework-agnostic: Most of the codebases in Plato can be used with various machine learning libraries.
  5. Hierarchical Design: Plato supports multiple-level cells, including edge-cloud (2 levels) federated learning and device-edge-cloud (3 levels) federated learning.

This proposal discusses how to integrate Plato into Sedna as a backend for supporting federated learning. @li-ch @baochunli @jaypume

Why is this needed:
The motivation of this proposal could be summarized as follow:

  1. Algorithm:
    Sedna (Aggregator) currently supports FedAvg. With Plato, Sedna can choose various aggregation algorithms, such as FedAvg, Adaptive Freezing, Mistnet, and Adaptive sync.
  2. Dataset:
    Sedna needs to manually prepare the user data. With Plato, it can provide a "datasources" module, including various public datasets (e.g., cifar10, cinic10, and coco). Non-iid samplers could also be supported.
  3. Model:
    Sedna specifies the model in the images as a file. It uploads the whole model to the server. With Plato, it can specify all models as user configurations. The Report class can help the worker to determine the strategy of uploading gradients for fast convergence, such as Adaptive Freezing, Nova, Sarah, Mistnet, and so on.

Plans:

  1. Overview
    Sedna aims to provide the following federated learning features:

    • Write easy and short configuration files in Sedna to support flexible federated learning setups.
    • It should handle real datasets in the industry and simulate a non-iid version of public standard dataset in academia.
    • It should consider how to configure a customized model.

    Therefore, two resources are updated:

    • Dataset: The definition of Dataset
    • Model: The definition of model

    Configuration updates in aggregationWorker and trainingWorkers:

    apiVersion: sedna.io/v1alpha1
    kind: FederatedLearningJob
    metadata:
      name: surface-defect-detection
    spec:
      aggregationWorker:
        # read and write
        model:
          name: "surface-defect-detection-model"
        platoConfig: 
          url: "sdd_rcnn.yml" # stored in S3 or github
        template:
          spec:
            nodeName: $CLOUD_NODE
            containers:
              - image: kubeedge/sedna-example-federated-learning-surface-defect-detection-aggregation:v0.1.0
                name:  agg-worker
                imagePullPolicy: IfNotPresent
                # env: # user defined environments
                resources:  # user defined resources
                  limits:
                    memory: 2Gi
        - dataset:
            name: "cloud-surface-defect-detection-dataset"
    
      trainingWorkers:
        # read only
        model:
          name: "surface-defect-detection-model"
        - dataset:
            name: "edgeX-surface-defect-detection-dataset"
          template:
            spec:
              nodeName: $EDGE1_NODE
              containers:
                - image: kubeedge/sedna-example-federated-learning-surface-defect-detection-train:v0.1.0
                  name:  train-worker
                  imagePullPolicy: IfNotPresent
                  env:  # user defined environments or given by the GlobalManager. 
                    - name: "server_ip"
                      value: "localhost"
                    - name: "server_port"
                      value: "8000"
                  resources:  # user defined resources
                    limits:
                      memory: 2Gi
  2. How to write Plato code in Sedna
    The users only need to prepare the configuration file in public storage. The Plato code is settled in the Sedna libraries:
    An example of configuration file sdd_rcnn.yml:

    clients:
     # Type
     type: mistnet
     # The total number of clients
     total_clients: 1
     # The number of clients selected in each round
     per_round: 1
     # Should the clients compute test accuracy locally?
     do_test: false
    
    # this will be discarded in the future
    # server:
    #  address: localhost
    #  port: 8000
    
    data:
     datasource: sednaDataResource
     # Number of samples in each partition
     partition_size: 128
     # IID or non-IID?
     sampler: iid
    
    trainer:
     # The type of the trainer
     type: yolov5
     # The maximum number of training rounds
     rounds: 1
     # Whether the training should use multiple GPUs if available
     parallelized: false
     # The maximum number of clients running concurrently
     max_concurrency: 3
     # The target accuracy
     target_accuracy: 0.99
     # Number of epoches for local training in each communication round
     epochs: 500
     batch_size: 16
     optimizer: SGD
     linear_lr: false
     # The machine learning model
     model_name: sednaModelResource
    
    algorithm:
     # Aggregation algorithm
     type: mistnet
     cut_layer: 4
     epsilon: 100
  3. How to integrate the Dataset in Plato
    In this part, several functions are added to Dataset.

    apiVersion: sedna.io/v1alpha1
    kind: Dataset
    metadata:
      name: "edge1-surface-defect-detection-dataset"
    spec:
      name: COCO
      data_params: packages/coco128.yaml
      # if download_url is None, the data should be stored in disk by default
      download_url: https://github.com/ultralytics/yolov5/releases/download/v1.0/coco128.zip 
      data_path: ./data/
        train_path: ./data/COCO/coco128/images/train2017/
        test_path: ./data/COCO/coco128/images/train2017/
      # number of classes
      num_classes: 80
      # image size
      image_size: 640
      # class names
      classes:
          [
              "person",
              "bicycle",
              ...
          ]
      # remark
      format: ""
      nodeName: $EDGE1_NODE
  4. How to integrate the Models management tools in Plato
    In this part, several functions are added to Model.

    apiVersion: sedna.io/v1alpha1
    kind: Model
    metadata:
      name: "surface-defect-detection-model"
    spec:
      model_name: vgg_16
      url: "/model"
      # ONNX (https://onnx.ai/) or specify a framework 
      format: "ckpt"
      framework: "PyTorch"
      model_config: packages/models/vgg_16.yaml
      # if true, the model needs to be loaded from url before training
      pretrained: True

To-Do Lists

  • Enhance aggregationWorker and trainingWorkers interfaces
  • Datasets interface
  • Models management
  • Examples and demo presentation
    • CV: yolo-v5 demo in KubeEdge
    • NLP: huggingface demo in KubeEdge

Incremental learning supports hot model updates.

What would you like to be added/modified:
Currently, models are updated through restart. infer worker. Incremental learning is required to support hot model updates rather than only through restart.

Why is this needed:
inference engine supports model reloading. Therefore, it is hoped that dynamic update can be supported.

the issue of downloading the raw images in dataset

In the #18, I proposed to add an init-containers to download dataset/models before running workers.

In the incremental-learning example, the user creates a dataset with the url s3://helmet_detection/train_data/train_data.txt, where train_data.txt only contains the label info, not the image blobs.

So where and who does download these images?

lifelong learning enhancements tracking issue

This issue is to track these known lifelong learning added in #72 enhancements to be done:

  • KB server need its standalone code directory instead of placing into lib directory
  • KB API need to be reviewed again(such as /file/download, /file/upload)
  • Reduce KB image size(v0.3.0 is 1.28G showed by docker-images)
  • In multi-task-learning of lib code, need to add docs and code comment
  • Remove scikit-learn requirement for federated-learning surface-defect-detection example
  • Reduce lifelong example image size(v0.3.0 is 1.36G)
  • Review the requirements of lib

Add shared storage support for dataset/model

What would you like to be added:

Add shared storage support for dataset/model, such as s3/http protocols.

Why is this needed:

Currently only dataset/model uri with host localpath is supported, thus limiting cross node model training/serving.

Edge AI Benchmark review

Edge AI Benchmark Draft pdf is here

If you have comments about this proposal, you can directly reply under this issue.
If you want to join the benchmark writing, you can contact Dr. Zimu Zheng.

Dr. Zimu Zheng
Email: [email protected]
WeChat ID: moodzheng

he ***trainingWorks*** design in federatedlearningjob.yaml

Hi, i do not quite understand the yaml design here. why are the 2 dataset sections and the 2 template sections in trainingWorkers here?

I guess what you mean is 2 edges with different datasets and different configurations. if this is the case, it makes more sense for me to have 2 worker sections under the trainingWorks and assign the dataset and template to the worker

trainingWorkers:
- dataset:
name: "edge1-surface-defect-detection-dataset"
template:
spec:
nodeName: "edge1"
containers:
- image: kubeedge/sedna-example-federated-learning-surface-defect-detection-train:v0.1.0
name: train-worker
imagePullPolicy: IfNotPresent
env: # user defined environments
- name: "batch_size"
value: "32"
- name: "learning_rate"
value: "0.001"
- name: "epochs"
value: "1"
resources: # user defined resources
limits:
memory: 2Gi
- dataset:
name: "edge2-surface-defect-detection-dataset"
template:
spec:
nodeName: "edge2"
containers:
- image: kubeedge/sedna-example-federated-learning-surface-defect-detection-train:v0.1.0
name: train-worker
imagePullPolicy: IfNotPresent
env: # user defined environments
- name: "batch_size"
value: "32"
- name: "learning_rate"
value: "0.001"
- name: "epochs"
value: "1"
resources: # user defined resources
limits:
memory: 2Gi

There is something wrong when I deploy the jointinferenceservice example...

What happened?

I successfully deployed the sedna with following the documentation of installation, then I saw gm and lc pod are running, like this:

archlab@cloud-master:~/gopath/src$ kubectl get pod -n sedna -o wide
NAME                 READY   STATUS    RESTARTS   AGE   IP               NODE           NOMINATED NODE   READINESS GATES
gm-f58b846ff-mlltx   1/1     Running   0          27h   10.244.0.3       cloud-master   <none>           <none>
lc-g799f             1/1     Running   0          27h   192.168.30.207   cloudnode-1    <none>           <none>
lc-l7m5t             1/1     Running   0          27h   192.168.30.206   cloud-master   <none>           <none>
lc-q7jhf             1/1     Running   0          27h   192.168.60.36    edgenode2201   <none>           <none>

Then I tried the Helmet Detection Experiment. And I built images and models we needed, like this:

archlab@cloud-master:~/gopath/src$ kubectl get Model -A
NAMESPACE   NAME                                      AGE
default     helmet-detection-inference-big-model      9h
default     helmet-detection-inference-little-model   8h

archlab@cloud-master:~/gopath/src$ kubectl get jointinferenceservices.sedna.io
NAME                                 AGE
helmet-detection-inference-example   8h

Finally I Mock Video Stream for Inference in Edge Side:

archlab001@edgenode2201:~/joint_inference_example/data/video$ sudo ffmpeg -re -i video.mp4 -vcodec libx264 -strict -2 -f rtsp rtsp://localhost/video 
sudo: 无法解析主机:edgenode2201
ffmpeg version 2.8.17-0ubuntu0.1 Copyright (c) 2000-2020 the FFmpeg developers
  built with gcc 5.4.0 (Ubuntu 5.4.0-6ubuntu1~16.04.12) 20160609
  configuration: --prefix=/usr --extra-version=0ubuntu0.1 --build-suffix=-ffmpeg --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --cc=cc --cxx=g++ --enable-gpl --enable-shared --disable-stripping --disable-decoder=libopenjpeg --disable-decoder=libschroedinger --enable-avresample --enable-avisynth --enable-gnutls --enable-ladspa --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libmodplug --enable-libmp3lame --enable-libopenjpeg --enable-libopus --enable-libpulse --enable-librtmp --enable-libschroedinger --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvorbis --enable-libvpx --enable-libwavpack --enable-libwebp --enable-libx265 --enable-libxvid --enable-libzvbi --enable-openal --enable-opengl --enable-x11grab --enable-libdc1394 --enable-libiec61883 --enable-libzmq --enable-frei0r --enable-libx264 --enable-libopencv
  libavutil      54. 31.100 / 54. 31.100
  libavcodec     56. 60.100 / 56. 60.100
  libavformat    56. 40.101 / 56. 40.101
  libavdevice    56.  4.100 / 56.  4.100
  libavfilter     5. 40.101 /  5. 40.101
  libavresample   2.  1.  0 /  2.  1.  0
  libswscale      3.  1.101 /  3.  1.101
  libswresample   1.  2.101 /  1.  2.101
  libpostproc    53.  3.100 / 53.  3.100
Input #0, mov,mp4,m4a,3gp,3g2,mj2, from 'video.mp4':
  Metadata:
    major_brand     : mp42
    minor_version   : 0
    compatible_brands: isommp42
    creation_time   : 2019-03-31 14:46:46
  Duration: 00:15:06.00, start: 0.000000, bitrate: 1430 kb/s
    Stream #0:0(und): Video: h264 (Main) (avc1 / 0x31637661), yuv420p(tv, bt709), 1280x720 [SAR 1:1 DAR 16:9], 1298 kb/s, 23.98 fps, 23.98 tbr, 24k tbn, 47.95 tbc (default)
    Metadata:
      creation_time   : 2019-03-31 14:46:46
      handler_name    : ISO Media file produced by Google Inc. Created on: 03/31/2019.
    Stream #0:1(eng): Audio: aac (LC) (mp4a / 0x6134706D), 44100 Hz, stereo, fltp, 128 kb/s (default)
    Metadata:
      creation_time   : 2019-03-31 14:46:46
      handler_name    : ISO Media file produced by Google Inc. Created on: 03/31/2019.
[libx264 @ 0x750740] using SAR=1/1
[libx264 @ 0x750740] using cpu capabilities: MMX2 SSE2Fast SSSE3 SSE4.2
[libx264 @ 0x750740] profile High, level 3.1
[libx264 @ 0x750740] 264 - core 148 r2643 5c65704 - H.264/MPEG-4 AVC codec - Copyleft 2003-2015 - http://www.videolan.org/x264.html - options: cabac=1 ref=3 deblock=1:0:0 analyse=0x3:0x113 me=hex subme=7 psy=1 psy_rd=1.00:0.00 mixed_ref=1 me_range=16 chroma_me=1 trellis=1 8x8dct=1 cqm=0 deadzone=21,11 fast_pskip=1 chroma_qp_offset=-2 threads=6 lookahead_threads=1 sliced_threads=0 nr=0 decimate=1 interlaced=0 bluray_compat=0 constrained_intra=0 bframes=3 b_pyramid=2 b_adapt=1 b_bias=0 direct=1 weightb=1 open_gop=0 weightp=2 keyint=250 keyint_min=23 scenecut=40 intra_refresh=0 rc_lookahead=40 rc=crf mbtree=1 crf=23.0 qcomp=0.60 qpmin=0 qpmax=69 qpstep=4 ip_ratio=1.40 aq=1:1.00
Output #0, rtsp, to 'rtsp://localhost/video':
  Metadata:
    major_brand     : mp42
    minor_version   : 0
    compatible_brands: isommp42
    encoder         : Lavf56.40.101
    Stream #0:0(und): Video: h264 (libx264), yuv420p, 1280x720 [SAR 1:1 DAR 16:9], q=-1--1, 23.98 fps, 90k tbn, 23.98 tbc (default)
    Metadata:
      creation_time   : 2019-03-31 14:46:46
      handler_name    : ISO Media file produced by Google Inc. Created on: 03/31/2019.
      encoder         : Lavc56.60.100 libx264
    Stream #0:1(eng): Audio: aac, 44100 Hz, stereo, fltp, 128 kb/s (default)
    Metadata:
      creation_time   : 2019-03-31 14:46:46
      handler_name    : ISO Media file produced by Google Inc. Created on: 03/31/2019.
      encoder         : Lavc56.60.100 aac
Stream mapping:
  Stream #0:0 -> #0:0 (h264 (native) -> h264 (libx264))
  Stream #0:1 -> #0:1 (aac (native) -> aac (native))
Press [q] to stop, [?] for help
frame=21722 fps= 24 q=-1.0 Lsize=N/A time=00:15:05.99 bitrate=N/A    
video:313537kB audio:19168kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: unknown
[libx264 @ 0x750740] frame I:172   Avg QP:21.53  size: 54827
[libx264 @ 0x750740] frame P:11817 Avg QP:24.38  size: 22320
[libx264 @ 0x750740] frame B:9733  Avg QP:27.95  size:  4919
[libx264 @ 0x750740] consecutive B-frames: 21.3% 49.7% 21.3%  7.6%
[libx264 @ 0x750740] mb I  I16..4: 12.7% 70.1% 17.2%
[libx264 @ 0x750740] mb P  I16..4:  4.6% 12.8%  2.5%  P16..4: 44.0% 15.3%  6.2%  0.0%  0.0%    skip:14.7%
[libx264 @ 0x750740] mb B  I16..4:  0.3%  0.5%  0.2%  B16..8: 39.8%  4.2%  0.8%  direct: 2.7%  skip:51.5%  L0:45.2% L1:47.5% BI: 7.3%
[libx264 @ 0x750740] 8x8 transform intra:64.3% inter:63.0%
[libx264 @ 0x750740] coded y,uvDC,uvAC intra: 53.6% 36.5% 3.6% inter: 21.1% 9.3% 0.1%
[libx264 @ 0x750740] i16 v,h,dc,p: 23% 30% 11% 37%
[libx264 @ 0x750740] i8 v,h,dc,ddl,ddr,vr,hd,vl,hu: 23% 23% 19%  4%  5%  5%  7%  5%  8%
[libx264 @ 0x750740] i4 v,h,dc,ddl,ddr,vr,hd,vl,hu: 24% 22% 15%  5%  8%  7%  9%  5%  6%
[libx264 @ 0x750740] i8c dc,h,v,p: 65% 17% 13%  4%
[libx264 @ 0x750740] Weighted P-Frames: Y:21.0% UV:4.7%
[libx264 @ 0x750740] ref P L0: 59.7% 18.9% 15.5%  5.3%  0.6%
[libx264 @ 0x750740] ref B L0: 85.8% 13.6%  0.5%
[libx264 @ 0x750740] ref B L1: 95.8%  4.2%
[libx264 @ 0x750740] kb/s:2835.01

After these, I can't see any container or pod is running, why? And there are no results in my ouput dir.

My Enviornment Settings

  • create_big_model_resources.yaml
apiVersion: sedna.io/v1alpha1
kind:  Model
metadata:
  name: helmet-detection-inference-big-model
  namespace: default
spec:
  url: "/home/archlab/gopath/src/joint_inference_example/data/big-model/yolov3_darknet.pb"
  format: "pb"
  • create_little_model_resources.yaml
apiVersion: sedna.io/v1alpha1
kind: Model
metadata:
  name: helmet-detection-inference-little-model
  namespace: default
spec:
  url: "/home/archlab/gopath/src/joint_inference_example/data/little-model/yolov3_resnet18.pb"
  format: "pb"
  • create_joint_inference_service.yaml
apiVersion: sedna.io/v1alpha1
kind: JointInferenceService
metadata:
  name: helmet-detection-inference-example
  namespace: default
spec:
  edgeWorker:
    model:
      name: "helmet-detection-inference-little-model"
    hardExampleMining:
      name: "IBT"
      parameters:
        - key: "threshold_img"
          value: "0.9"
        - key: "threshold_box"
          value: "0.9"
    template:
      spec:
        nodeName: edgenode2201
        containers:
        - image: kubeedge/sedna-example-joint-inference-helmet-detection-little:v0.1.0
          imagePullPolicy: IfNotPresent
          name:  little-model
          env:  # user defined environments
          - name: input_shape
            value: "416,736"
          - name: "video_url"
            value: "rtsp://localhost/video"
          - name: "all_examples_inference_output"
            value: "/data/output"
          - name: "hard_example_cloud_inference_output"
            value: "/data/hard_example_cloud_inference_output"
          - name: "hard_example_edge_inference_output"
            value: "/data/hard_example_edge_inference_output"
          resources:  # user defined resources
            requests:
              memory: 64M
              cpu: 100m
            limits:
              memory: 2Gi
          volumeMounts:
            - name: outputdir
              mountPath: /data/
        volumes:   # user defined volumes
          - name: outputdir
            hostPath:
              # user must create the directory in host
              path: /home/archlab001/joint_inference_example/joint_inference/output
              type: Directory

  cloudWorker:
    model:
      name: "helmet-detection-inference-big-model"
    template:
      spec:
        nodeName: cloud-master
        containers:
          - image: kubeedge/sedna-example-joint-inference-helmet-detection-big:v0.1.0
            name:  big-model
            imagePullPolicy: IfNotPresent
            env:  # user defined environments
              - name: "input_shape"
                value: "544,544"
            resources:  # user defined resources
              requests:
                memory: 2Gi

And my dir is:

# In Cloud Side
archlab@cloud-master:~/gopath/src/joint_inference_example$ ls
create_big_model_resources.yaml      create_little_model_resources.yaml  joint_inference
create_joint_inference_service.yaml  data
archlab@cloud-master:~/gopath/src/joint_inference_example$ pwd
/home/archlab/gopath/src/joint_inference_example
# In Edge Side
archlab@cloud-master:~/gopath/src/joint_inference_example/joint_inference/output$ pwd
/home/archlab/gopath/src/joint_inference_example/joint_inference/output

Add arm support

Add Sedna's support for below platforms:

  1. Raspberry Pi
  2. Arm64 Server

How does federated learning server and client communicate through KubeEdge?

Hi,

The server and client of federated learning communicate through websocket in the example provided here

f"ws://{self.config.bind_ip}:{self.config.bind_port}")
.

While I cannot find the settings that actually bind the IP and Port to environment variables ENV. I am also wondering (1) whether the websocket is through KubeEdge or a separate connection? If it is, (2) how does the websocket actually run connect with KubeEdge network?

Thank you.

Add pod template like support for worker spec

What would you like to be added:

Add pod template like support for worker spec

Why is this needed:

Current state of the worker spec definition:

 type WorkerSpec struct {
    ScriptDir        string     `json:"scriptDir"`
    ScriptBootFile   string     `json:"scriptBootFile"`
    FrameworkType    string     `json:"frameworkType"`
    FrameworkVersion string     `json:"frameworkVersion"`
    Parameters       []ParaSpec `json:"parameters"`
 }

 // ParaSpec is a description of a parameter
 type ParaSpec struct {
    Key   string `json:"key"`
    Value string `json:"value"`
 }
  1. ScriptDir/ScriptBootFile is the entrypoint of worker, localpath or shared storage(e.g. s3).
  2. FrameworkType/FrameworkVersion specifies the base container image of worker.
  3. Parameters specifies the environment of worker.

pros:

  1. simply for demo

cons:

  1. need to copy the code script to all known nodes manually.
  2. don't support docker-container cap: code version mgmt, distribution etc.
  3. don't support k8s pod similar features: resource limits, user defined volumes etc.
  4. need shared storage(e.g. s3) for code if not localpath.
  5. need to build base image if the current base image can't satisfy the user
    requirements(user-defined code package dependents, or new framework).
    And then reedit the configuration of GM and restart it.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.