Giter Club home page Giter Club logo

akri's People

Contributors

adithyaj avatar akri-bot avatar ammmze avatar bfjelds avatar brendandburns avatar britel avatar charleszheng44 avatar dazwilkin avatar diconico07 avatar didier-durand avatar edrickwong avatar evrardjp avatar flynnduism avatar gauravgahlot avatar github-actions[bot] avatar harrison-tin avatar hernangatta avatar jbpaux avatar jiayihu avatar jiria avatar johnsonshih avatar karok2m avatar kate-goldenring avatar koutselakismanos avatar ragnyll avatar rishit-dagli avatar romoh avatar shantanoo-desai avatar web-flow avatar yujinkim-msft avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

akri's Issues

Allow running Akri controller on non-dedicated control plane nodes

Is your feature request related to a problem? Please describe.
Akri Helm chart allows to specify whether to run the controller onlyOnMaster. But for that, Akri expects a specific label to be present on the control plane nodes. Not all K8s distributions have such tag. Also, with AKS, you cannot deploy onto control plane nodes.

Is your feature request related to a way you would like Akri extended? Please describe.
Akri should be able to run on clusters without the specific label that Akri is looking for and also on clusters where users cannot deploy onto CP nodes.

Describe the solution you'd like
Consider moving the toleration outside of the onlyOnMaster condition in the Helm chart. This way, Akri could run on a control plane, but not only. With onlyOnMaster, it would only run on control plane.

This would not allow disabling the option to run on CP, but I think that can be solved later based on feedback.

[Extensibility] ZeroConf

**Is your feature request related to a problem? Please describe.**g

I've begun prototyping (!) a ZeroConf protocol implementation for Akri

Is your feature request related to a way you would like Akri extended? Please describe.

ZeroConf is probably (!) a useful protocol implementation particularly due to its reliable naming of (transient) devices.

Describe the solution you'd like

Akri protocol implementation with Broker samples.

DOCS: specify a minimum kubernetes version to work.

Is your feature request related to a problem? Please describe.
No one likes it when their 1.14 cluster won't work. Make it clear what DOES work and beyond.

Is your feature request related to a way you would like Akri extended? Please describe.
DOCUMENT the minimum version.

Describe the solution you'd like
DOCUMENT the minimum version.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Enable Instance.shared to be determined by Configuration

Every Instance CRD has a shared property, which determines whether the device/capability can be shared by multiple nodes. If an instance is not shared, every node that discovers it will create a new Instance CRD, while if it is shared, that Instance CRD will be shared by all nodes that can discover the instance/capability.

Currently the sharability of an instance is defined by the protocol that was used to discover it. For example, currently, all instances (more specifically ip cameras in this protocol) discovered by the ONVIF protocol are marked as shared, as can be seen in the ONVIF DiscoveryHandler's implementation of are_shared(). Similarly, all instances discovered with udev are marked as unshared, since the udev protocol only discovers devices on a specific node.

Should the "is this shared?" decision be made elsewhere, such as in the Configuration, allowing an operator to specify sharability? Should the "is this shared?" decision be made on a device by device case?

One thing to note about the validity of the current solution of defining sharability by protocol: Right now, if ONVIF cameras were allowed to be unshared, it could result in multiple instances for a single camera, making it hard to regulate capacity and possibly leading to an overloaded device. Also, having two Instance CRDs for a single device seems counter-intuitive.

Akri udev ATTRS ignored

Describe the bug
Based on the Akri udev grammar it seems that ATTRS might be supported, but when I pass ATTRS as part of my udev rule, Agent is ignoring it:

[2020-11-08T09:04:14Z TRACE agent::protocols::udev::discovery_impl] parse_udev_rule - unsupported field ATTRS{manufacturer}

Is v2 not yet supported?

Output of kubectl get pods,akrii,akric -o wide
agent and controller are running, my configuration is specified, but too many instances are found, because ATTRS of the rule are ignored

Kubernetes Version: K3s

To Reproduce
Supplying rule such as SUBSYSTEM=="tty"\, ATTRS{manufacturer}=="Silicon Labs"\, ATTRS{idProduct}=="ea60"' matches all tty devices.

Expected behavior
Supplying rule such as SUBSYSTEM=="tty"\, ATTRS{manufacturer}=="Silicon Labs"\, ATTRS{idProduct}=="ea60"' should only match tty devices that have manufacturer and idProduct in the device or its parents.

Logs (please share snips of applicable logs)
Snippet from Agent log:

[2020-11-08T09:04:14Z TRACE agent::util::config_action] do_periodic_discovery - start for config akri-udev
[2020-11-08T09:04:14Z TRACE agent::util::config_action] do_periodic_discovery - loop iteration for config akri-udev
[2020-11-08T09:04:14Z TRACE agent::protocols::udev::discovery_handler] discover - for udev rules ["SUBSYSTEM==\"tty\", ATTRS{manufacturer}==\"Silicon Labs\", ATTRS{idProduct}
==\"ea60\""]
[2020-11-08T09:04:14Z INFO  agent::protocols::udev::discovery_impl] parse_udev_rule - enter for udev rule string SUBSYSTEM=="tty", ATTRS{manufacturer}=="Silicon Labs", ATTRS{
idProduct}=="ea60"
[2020-11-08T09:04:14Z TRACE agent::protocols::udev::discovery_impl] parse_udev_rule - parsing udev_rule "SUBSYSTEM==\"tty\", ATTRS{manufacturer}==\"Silicon Labs\", ATTRS{idPr
oduct}==\"ea60\""
[2020-11-08T09:04:14Z TRACE agent::protocols::udev::discovery_impl] parse_udev_rule - unsupported field ATTRS{manufacturer}
[2020-11-08T09:04:14Z TRACE agent::protocols::udev::discovery_impl] parse_udev_rule - unsupported field ATTRS{idProduct}
[2020-11-08T09:04:14Z TRACE agent::protocols::udev::discovery_impl] find_devices - enter with udev_filters [UdevFilter { field: Pair { rule: subsystem, span: Span { str: "SUB
SYSTEM", start: 0, end: 9 }, inner: [] }, operation: equality, value: "tty" }]
[2020-11-08T09:04:14Z TRACE agent::protocols::udev::discovery_impl] enumerator_match_udev_filters - enter with udev_filters [UdevFilter { field: Pair { rule: subsystem, span:
Span { str: "SUBSYSTEM", start: 0, end: 9 }, inner: [] }, operation: equality, value: "tty" }]
[2020-11-08T09:04:14Z TRACE agent::protocols::udev::discovery_impl] enumerator_nomatch_udev_filters - enter with udev_filters []
[2020-11-08T09:04:14Z TRACE agent::protocols::udev::discovery_impl] filter_by_remaining_udev_filters - enter with udev_filters []
[2020-11-08T09:04:14Z TRACE agent::protocols::udev::discovery_impl] do_parse_and_find - returning discovered devices with devpaths: ["/dev/ttyAMA0", "/dev/ttyUSB0", "/dev/con
sole", "/dev/ptmx", "/dev/tty", "/dev/tty0", "/dev/tty1", "/dev/tty10", "/dev/tty11", "/dev/tty12", "/dev/tty13", "/dev/tty14", "/dev/tty15", "/dev/tty16", "/dev/tty17", "/de
v/tty18", "/dev/tty19", "/dev/tty2", "/dev/tty20", "/dev/tty21", "/dev/tty22", "/dev/tty23", "/dev/tty24", "/dev/tty25", "/dev/tty26", "/dev/tty27", "/dev/tty28", "/dev/tty29
", "/dev/tty3", "/dev/tty30", "/dev/tty31", "/dev/tty32", "/dev/tty33", "/dev/tty34", "/dev/tty35", "/dev/tty36", "/dev/tty37", "/dev/tty38", "/dev/tty39", "/dev/tty4", "/dev
/tty40", "/dev/tty41", "/dev/tty42", "/dev/tty43", "/dev/tty44", "/dev/tty45", "/dev/tty46", "/dev/tty47", "/dev/tty48", "/dev/tty49", "/dev/tty5", "/dev/tty50", "/dev/tty51"
, "/dev/tty52", "/dev/tty53", "/dev/tty54", "/dev/tty55", "/dev/tty56", "/dev/tty57", "/dev/tty58", "/dev/tty59", "/dev/tty6", "/dev/tty60", "/dev/tty61", "/dev/tty62", "/dev
/tty63", "/dev/tty7", "/dev/tty8", "/dev/tty9", "/dev/ttyprintk"]
[2020-11-08T09:04:14Z TRACE agent::protocols::udev::discovery_handler] discover - mapping and returning devices at devpaths {"/dev/tty41", "/dev/tty24", "/dev/tty20", "/dev/t
ty37", "/dev/tty51", "/dev/tty38", "/dev/tty17", "/dev/tty18", "/dev/tty49", "/dev/tty5", "/dev/tty4", "/dev/tty0", "/dev/tty63", "/dev/tty43", "/dev/tty48", "/dev/tty39", "/
dev/tty46", "/dev/tty28", "/dev/tty11", "/dev/tty54", "/dev/tty36", "/dev/tty30", "/dev/tty52", "/dev/tty16", "/dev/tty2", "/dev/tty45", "/dev/tty61", "/dev/tty25", "/dev/tty
", "/dev/tty59", "/dev/tty42", "/dev/tty29", "/dev/tty47", "/dev/tty56", "/dev/tty55", "/dev/ttyUSB0", "/dev/tty44", "/dev/tty50", "/dev/tty33", "/dev/tty13", "/dev/tty58", "
/dev/tty31", "/dev/ttyAMA0", "/dev/tty6", "/dev/tty10", "/dev/tty8", "/dev/tty35", "/dev/ttyprintk", "/dev/tty32", "/dev/tty14", "/dev/tty60", "/dev/tty23", "/dev/tty27", "/d
ev/tty40", "/dev/tty3", "/dev/tty7", "/dev/ptmx", "/dev/tty1", "/dev/tty19", "/dev/tty15", "/dev/tty53", "/dev/tty62", "/dev/tty22", "/dev/tty9", "/dev/tty21", "/dev/tty26", 
"/dev/tty57", "/dev/tty34", "/dev/console", "/dev/tty12"}

'unknown flag: --set agent.host.crictl' on Akri's helm chart install

Describe the bug

I am following the install procedure of end-to-end demo: when executing step 1 of this section, I get

Error: unknown flag: --set agent.host.crictl

I have set variable AKRI_HELM_CRICTL_CONFIGURATION properly (I hope...) via
AKRI_HELM_CRICTL_CONFIGURATION='--set agent.host.crictl=/usr/local/bin/crictl --set agent.host.dockerShimSock=/var/snap/microk8s/common/run/containerd.sock'

The akri helm repo was properly added via microk8s helm3 repo add 'akri-helm-charts' 'https://deislabs.github.io/akri/' on which I get "akri-helm-charts" has been added to your repositories

So, what do I do wrong?

Thanks for your help

Didier

Output of kubectl get pods,akrii,akric -o wide

I did not reach this stage yet

Kubernetes Version: [e.g. Native Kubernetes 1.19, MicroK8s 1.19, Minikube 1.19, K3s]

microk8s (1.19/edge) v1.19.4 from Canonical* installed
microk8s          v1.19.4    1810   1.19/edge        canonical*         classic

1810 is the version of installed snap

To Reproduce

run our script at https://github.com/didier-durand/microk8s-akri/blob/main/sh/microk8s-akri.sh or fork repo and run workflow https://github.com/didier-durand/microk8s-akri/blob/main/.github/workflows/microk8s-akri.yml

Expected behavior
Proper installation of Helm chart to be able to continue further in the demo install

Logs (please share snips of applicable logs)

### install akri chart: 
2020-11-20T06:52:02.4560237Z Error: no repositories to show
2020-11-20T06:52:02.8022500Z "akri-helm-charts" has been added to your repositories
2020-11-20T06:52:02.8085732Z critctl path: /usr/local/bin/crictl
2020-11-20T06:52:02.9648848Z Error: unknown flag: --set agent.host.crictl

Full log (Github interactive interface): https://github.com/didier-durand/microk8s-akri/actions/runs/373835396

Full log (text form) available at https://pipelines.actions.githubusercontent.com/YT1CGW2Wkplohu80t7gWKaYhzh929U5dpGIo69C98k0YO1N9dm/_apis/pipelines/1/runs/25/signedlogcontent/3?urlExpires=2020-11-20T06%3A53%3A29.4204408Z&urlSigningMethod=HMACV1&urlSignature=Z2xBZfORSHyClYKfz86VocTNemnkMl4zLflU%2B3S2Dqg%3D

Additional context
Add any other context about the problem here.

Support of MQTT

Is your feature request related to a way you would like Akri extended? Please describe.

MQTT is a key protocol in the IoT world. It would be extremely beneficial if Akri could directly communicate with devices implementing this protocol to report data / get instructions. That would make the path between device and kubernetes application much shorter than it is for other projects lilke KubeEdge whose architecture entails more elements between device and K8s application.

Many use cases show how it's used today or will be used in the future.

Describe the solution you'd like

Akri may rely on existing protocol implementations to deliver this feature rapidly

Describe alternatives you've considered

The only one known as of now is Kubeedge with the disadvantage mentioned above.

Additional context

https://dzone.com/articles/why-mqtt-has-become-the-de-facto-iot-standard

a small issue

When I used the helm to install the Akri,I got an Error-- no matches for kind "CustomResourceDefinition" in version "apiextensions.k8s.io/v1". And then I found the CRD is the use of apiextensions. K8s. IO/v1beta1 apiversion. But after I fixed it, the following mistakes popped up.(May be my knowledge point is relatively weak)
Error: failed to install CRD crds/akri-configuration-crd.yaml: CustomResourceDefinition.apiextensions.k8s.io "configurations.akri.sh" is invalid: [spec.versions[0].additionalPrinterColumns[0].JSONPath: Required value, spec.versions[0].additionalPrinterColumns[1].JSONPath: Required value, spec.versions: Invalid value: []apiextensions.CustomResourceDefinitionVersion{apiextensions.CustomResourceDefinitionVersion{Name:"v0", Served:true, Storage:true, Schema:(*apiextensions.CustomResourceValidation)(0xc005e8c630), Subresources:(*apiextensions.CustomResourceSubresources)(nil), AdditionalPrinterColumns:[]apiextensions.CustomResourceColumnDefinition{apiextensions.CustomResourceColumnDefinition{Name:"Capacity", Type:"string", Format:"", Description:"The capacity for each Instance discovered", Priority:0, JSONPath:""}, apiextensions.CustomResourceColumnDefinition{Name:"Age", Type:"date", Format:"", Description:"", Priority:0, JSONPath:""}}}}: per-version schemas may not all be set to identical values (top-level validation should be used instead), spec.versions: Invalid value: []apiextensions.CustomResourceDefinitionVersion{apiextensions.CustomResourceDefinitionVersion{Name:"v0", Served:true, Storage:true, Schema:(*apiextensions.CustomResourceValidation)(0xc005e8c630), Subresources:(*apiextensions.CustomResourceSubresources)(nil), AdditionalPrinterColumns:[]apiextensions.CustomResourceColumnDefinition{apiextensions.CustomResourceColumnDefinition{Name:"Capacity", Type:"string", Format:"", Description:"The capacity for each Instance discovered", Priority:0, JSONPath:""}, apiextensions.CustomResourceColumnDefinition{Name:"Age", Type:"date", Format:"", Description:"", Priority:0, JSONPath:""}}}}: per-version additionalPrinterColumns may not all be set to identical values (top-level additionalPrinterColumns should be used instead)]

End to end CI tests do not install crictl or validate that slot-reconciliation is successful

Describe the bug
The e2e tests are using MicroK8s, which does not install crictl. Because of this (and #6) the tests really should be failing. But are not failing.

To Reproduce
Any PR that triggers the e2e tests will demonstrate this problem.

Expected behavior
The tests

  1. Install crictl
  2. Pass the crictl path and and the MicroK8s CRI socket via Helm parameters
  3. Validate that slot reconciliation succeeds (something like: microk8s kubectl logs $(microk8s kubectl get pods -A | grep agent | awk '{print $1}' | first) | grep 'get_node_slots - crictl called successfully')

Create Kubernetes label consistency

Is your feature request related to a problem? Please describe.

  • akri-agent is labeled name (link)
  • akri-controller is labeled app (link)

It's minor but I think consistency here would be helpful (using an arbitrary output spec to prove the point):

TMPL="{.items[].status.phase}"

# Agent requires `name`
LABEL="name"
kubectl get pods \
--selector=${LABEL}=akri-agent \
--output=jsonpath="${TMPL}"
Running

# Controller requires `app`
LABEL="app"
kubectl get pods \
--selector=${LABEL}=akri-controller \
--output=jsonpath="${TMPL}"
Running

Is your feature request related to a way you would like Akri extended? Please describe.

Describe the solution you'd like

To avoid breaking any existing dependencies on the labels, propose adding the label you'd prefer to the one that doesn't have it. Since name is overused (and exists as part of the metadata), propose: add app label to akri-agent and swap the matchLabels to use app too:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: akri-agent-daemonset
spec:
  selector:
    matchLabels:
      app: akri-agent
  template:
    metadata:
      labels:
        app: akri-agent
        name: akri-agent

If you're confident there aren't dependencies on the name label, then it could be removed.

Describe alternatives you've considered

N/A

Additional context

Helm workflow should lint using latest Helm version (as warning, not as failure)

Is your feature request related to a problem? Please describe.
When a new Helm version is released, sometimes its lint capabilities change and it can pick up new things.

It would be great to run the latest linting so that we can see if a potential issue arises. At the same time, I don't think it would make sense to fail a PR CI build because of this. There isn't a Github actions concept that allows for a failure to exist without causing a CI workflow to fail (and show up as a failure in the PR page) ... but there are Github actions Issues about enabling allow-failure for a workflow:

It is not clear if/when/where this will land.

In the meantime, until there is a way for a Github workflow to just "warn" on failure, we can play with the actual lint command ... we could change the existing Helm workflow like this:

jobs:
  lint-with-current-helm:
    # Run workflow pull_request if it is NOT a fork, as pull_request_target if it IS a fork
    if: >-
      ( github.event_name == 'pull_request_target' && github.event.pull_request.head.repo.fork == true ) ||
      ( github.event_name == 'pull_request' && github.event.pull_request.head.repo.fork == false )
    runs-on: ubuntu-latest
    timeout-minutes: 20
    steps:
      - name: Checkout the merged commit from PR and base branch
        uses: actions/checkout@v2
        if: github.event_name == 'pull_request_target'
        with:
          # pull_request_target is run in the context of the base repository
          # of the pull request, so the default ref is master branch and
          # ref should be manually set to the head of the PR
          ref: refs/pull/${{ github.event.pull_request.number }}/head

      - name: Checkout the head commit of the branch
        if: ${{ github.event_name != 'pull_request_target' }}
        uses: actions/checkout@v2

      - uses: azure/setup-helm@v1

      - name: Lint helm chart
        run: helm lint deployment/helme && echo "lint finished successfully" || echo "lint found issues"


  helm:
    # existing flow here

[Extensibility] Nessie gRPC

Is your feature request related to a problem? Please describe.

IIUC Nessie's use of gRPC is redundant.

Having begun implementing a protocol (#85), I used Nessie as a template and realized this.

My lack of familiarity with the gRPC crate used and struggling to understand the architecture of Akri, exacerbated my challenge.

Is your feature request related to a way you would like Akri extended? Please describe.

Remove the gRPC code from Nessie to simplify it.

Describe the solution you'd like

See above.

Describe alternatives you've considered

It's probable that I'm incorrect.

Additional context

[Extensibility] Too(?)-tightly coupled

Is your feature request related to a problem? Please describe.

Extensibility is tightly-coupled.

Protocol changes require agent, controller and CRD rebuilds and redeployments.

It seems that this system would benefit from being dynamic though I can't provide guidance on how to achieve this beyond it being a good use-case for gRPC.

Resource limit quantity type

Is your feature request related to a problem? Please describe.
Akri Configuration resource limit quantity is specified as string:

 resources:
          limits:
            "{{PLACEHOLDER}}": "1"

But if you look at Device Plugin Kubernetes doc page, the quantity is numeric, not a string: https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/

apiVersion: v1
kind: Pod
metadata:
  name: demo-pod
spec:
  containers:
    - name: demo-container-1
      image: k8s.gcr.io/pause:2.0
      resources:
        limits:
          hardware-vendor.example/foo: 2

Describe the solution you'd like
Consider using numeric quantity.

Udev devices should be mounted through DeviceSpec

Describe the bug
Udev discovered devices are currently mounted through Mount, which does not allow for fine grained control of access permissions. We should switch to DeviceSpec and expose the desired permission control through Configuration CRD.


/// Mount specifies a host volume to mount into a container.
--
  | /// where device library or tools are installed on host and   container
  | #[derive(Clone,   PartialEq, ::prost::Message)]
  | pub struct Mount {
  | /// Path of the mount within the   container.
  | #[prost(string, tag = "1")]
  | pub container_path: std::string::String,
  | /// Path of the mount on the   host.
  | #[prost(string, tag = "2")]
  | pub host_path: std::string::String,
  | /// If set, the mount is   read-only.
  | #[prost(bool, tag = "3")]
  | pub read_only: bool,
  | }
  | /// DeviceSpec specifies a host device to mount into a   container.
  | #[derive(Clone,   PartialEq, ::prost::Message)]
  | pub struct DeviceSpec {
  | /// Path of the device within the   container.
  | #[prost(string, tag = "1")]
  | pub container_path: std::string::String,
  | /// Path of the device on the   host.
  | #[prost(string, tag = "2")]
  | pub host_path: std::string::String,
  | /// Cgroups permissions of the   device, candidates are one or more of
  | /// * r - allows container to   read from the specified device.
  | /// * w - allows container to   write to the specified device.
  | /// * m - allows container to   create device files that do not yet exist.
  | #[prost(string, tag = "3")]
  | pub permissions: std::string::String,
  | }


Use RBAC to enable Akri

Is your feature request related to a problem? Please describe.
It seems overly permissive to configure admin as the docs advise: kubectl create clusterrolebinding serviceaccounts-cluster-admin --clusterrole=cluster-admin --group=system:serviceaccounts

Is your feature request related to a way you would like Akri extended? Please describe.
Use RBAC to specifically grant access to the controller and agent according to what they need specifically

Describe the solution you'd like
RBAC for Controller (add serviceAccountName: 'akri-controller-sa' to controller.yaml)

apiVersion: v1
kind: ServiceAccount
metadata:
  name: akri-controller-sa
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
  name: "akri-controller-role"
rules:
- apiGroups: [""]
  resources: ["pods", "services"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: [""]
  resources: ["nodes"]
  verbs: ["get", "list", "watch"]
- apiGroups: ["akri.sh"]
  resources: ["instances"]
  verbs: ["get", "list", "watch", "update", "patch"]
- apiGroups: ["akri.sh"]
  resources: ["configurations"]
  verbs: ["get", "list", "watch"]
---
apiVersion: 'rbac.authorization.k8s.io/v1'
kind: 'ClusterRoleBinding'
metadata:
  name: 'akri-controller-binding'
  namespace: {{ .Release.Namespace }}
roleRef:
  apiGroup: ''
  kind: 'ClusterRole'
  name: 'akri-controller-role'
subjects:
  - kind: 'ServiceAccount'
    name: 'akri-controller-sa'
    namespace: {{ .Release.Namespace }}

RBAC for agent (add serviceAccountName: 'akri-agent-sa' to agent.yaml):

apiVersion: v1
kind: ServiceAccount
metadata:
  name: akri-agent-sa
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
  name: "akri-agent-role"
rules:
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get", "list", "watch"]
- apiGroups: ["akri.sh"]
  resources: ["instances"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: ["akri.sh"]
  resources: ["configurations"]
  verbs: ["get", "list", "watch"]
---
apiVersion: 'rbac.authorization.k8s.io/v1'
kind: 'ClusterRoleBinding'
metadata:
  name: 'akri-agent-binding'
  namespace: {{ .Release.Namespace }}
roleRef:
  apiGroup: ''
  kind: 'ClusterRole'
  name: 'akri-agent-role'
subjects:
  - kind: 'ServiceAccount'
    name: 'akri-agent-sa'
    namespace: {{ .Release.Namespace }}

Describe alternatives you've considered
None

Additional context
None

PR builds are failing from forks

Describe the bug
PR CI builds are failing because secrets are not available from extrenal forks

To Reproduce
Steps to reproduce the behavior:

  1. Create fork
  2. Create PR from fork
  3. CI builds fail because they cannot access secrets

Expected behavior
CI builds execute successfully

modprobe: ERROR: could not insert 'v4l2loopback': Unknown symbol in module, or unknown parameter (see dmesg)

Describe the bug

Challenged to run the End-to-End example cloud (!) VMs (GCP, DigitalOcean). Specifically modprobe v4l2loopback:

sudo modprobe v4l2loopback exclusive_caps=1 video_nr=1,2
modprobe: ERROR: could not insert 'v4l2loopback': Unknown symbol in module, or unknown parameter (see dmesg)

I suspect the issue is that dependenct modules aren't included on cloud VM kernels but I prefer to use cloud VMs because they're disposable and I can keep my primary workstation untainted by random installs.

Output of kubectl get pods,akrii,akric -o wide

N/A

Kubernetes Version: [e.g. Native Kubernetes 1.19, MicroK8s 1.19, Minikube 1.19, K3s]

N/A

To Reproduce
Steps to reproduce the behavior:

  1. Create cloud VM (GCP, DigitalOcean)
  2. Set up mock video devices (link)
  3. Attempt modprobe v4l2loopback

Expected behavior

No errors

dmesg:

v4l2loopback: loading out-of-tree module taints kernel.
v4l2loopback: module verification failed: signature and/or required key missing - tainting kernel
v4l2loopback: Unknown symbol video_ioctl2 (err -2)
v4l2loopback: Unknown symbol v4l2_ctrl_handler_init_class (err -2)
v4l2loopback: Unknown symbol video_devdata (err -2)
v4l2loopback: Unknown symbol v4l2_ctrl_new_custom (err -2)
v4l2loopback: Unknown symbol video_unregister_device (err -2)
v4l2loopback: Unknown symbol video_device_alloc (err -2)
v4l2loopback: Unknown symbol v4l2_device_register (err -2)
v4l2loopback: Unknown symbol __video_register_device (err -2)
v4l2loopback: Unknown symbol v4l2_ctrl_handler_free (err -2)
v4l2loopback: Unknown symbol v4l2_device_unregister (err -2)
v4l2loopback: Unknown symbol video_device_release (err -2)

NOTE the taint and module verification error are warnings only and may be ignored

Additional context

The solution I've found is:

sudo apt update && \
sudo apt -y install modules-extra-$(uname -r) && \
sudo apt -y install dkms

NOTE the installs may possibly be combined

Then:

curl http://deb.debian.org/debian/pool/main/v/v4l2loopback/v4l2loopback-dkms_0.12.5-1_all.deb ...
sudo dpkg -i v4l2loopback-dkms_0.12.5-1_all.deb
sudo modprobe v4l2loopback exclusive_caps=1 video_nr=1,2

Succeeds!

On GCP:

dkms status
v4l2loopback, 0.12.5, 5.4.0-1028-gcp, x86_64: installed

On DigitalOcean:

dkms status
v4l2loopback, 0.12.5, 5.4.0-51-generic, x86_64: installed

Hopefully this will help others that get stuck.

Caveat: referencing images by tag

Describe the bug

Some minor feedback on using image tags rather than digests.

If the developer rebuilds|repushes e.g. agent image, a Kubernetes cluster may not repull the image even when specs are reapplied. This is because even though the image has changed, it's tag has not e.g. the Helm chart references images by tag, and Kubernetes doesn't (!?) repull an image if it has the image (tag) cached.

helm install akri ... \
...
--set agent.image.repository="${REPO}/agent" \
--set agent.image.tag="v0.0.XX-amd64" \
--set controller.image.repository="${REPO}/controller" \
--set controller.image.tag="v0.0.XX-amd64"

This can cause problems if the developer assumes that repushing causes e.g. Kubernetes to repull the image when the spec is reapplied.

The preferred mechanism would be to always (even in Helm) reference images by digest|hash as this is very likely to change every time the image changes.

An alternative is to eyeball the image digests after changes to ensure the images cached by Kubernetes reflect the digests of the images in the repo.

In the case of MicroK8s, it's possible to enumerate the cluster's cached images using crictl and to remove stale versions:

sudo crictl --runtime-endpoint=unix:///var/snap/microk8s/common/run/containerd.sock images

sudo crictl --runtime-endpoint=unix:///var/snap/microk8s/common/run/containerd.sock rmi ...

Output of kubectl get pods,akrii,akric -o wide

N/A

Kubernetes Version: [e.g. Native Kubernetes 1.19, MicroK8s 1.19, Minikube 1.19, K3s]

N/A

To Reproduce
Steps to reproduce the behavior:

  1. Create cluster
  2. Install Akri with the Helm command
  3. Enumerate cluster images using crictl
  4. Delete Akri
  5. Change the agent or controller and repush (same tag)
  6. Reinstall Akri
  7. use crictl to confirm that the most recent image hash was not used

Expected behavior

Any images changes (e.g. agent, controller, brokers) should cause Kubernetes repulls of the image on recreates.

Logs (please share snips of applicable logs)

N/A

Additional context

N/A

Akri fails to deploy on armv7l

Is your feature request related to a problem? Please describe.
When trying to deploy Akri Agent on Raspbian running on RPi 3 (armv7l), it fails with:

  Warning  Failed     38s (x3 over 81s)  kubelet, automation  Failed to pull image "ghcr.io/deislabs/akri/agent:latest-dev": rpc error: code = NotFound desc = failed to pull and unpack image "ghcr.io/deislabs/akri/agent:latest-dev": failed to unpack image on snapshotter overlayfs: no match for platform in manifest sha256:22358e94c42efcd6d2f6751ade021fc606742110b3d0e4a917052ff4cf2609df: not found

Akri has been deployed using the latest containers and akri-dev helm chart.

It seems the manifest only contains amd64 and arm64 images:

docker manifest inspect ghcr.io/deislabs/akri/agent:latest-dev
{
   "schemaVersion": 2,
   "mediaType": "application/vnd.docker.distribution.manifest.list.v2+json",
   "manifests": [
      {
         "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
         "size": 1159,
         "digest": "sha256:ff5da6128f12f8ef678dab07f68b22fd32c5d89af73c62620ebca21b283853c3",
         "platform": {
            "architecture": "amd64",
            "os": "linux"
         }
      },
      {
         "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
         "size": 1159,
         "digest": "sha256:e6f0aaf7608e57c304151d7aeac60d4aa7bc2cf3e6897dc6d55b6f53c1c3bdc5",
         "platform": {
            "architecture": "arm64",
            "os": "linux"
         }
      }
   ]
}

Looks like arm32 is disabled for now: https://github.com/deislabs/akri/blob/main/Makefile

Describe the solution you'd like
Consider supporting Akri on armv7l to allow for deployments on Raspbian.

Controller not bringing down pending pods in case of `deviceUsage` slot collision

As expected, the kubelet will not schedule a pod if the pod requests an Instance's deviceUsage slot that has already been taken, and the pod will be left in Pending state. At this point, the controller should come and take down that pod (so it can possibly be rescheduled to a different slot if capacity has not been met). However, this is not currently happening. The pod is staying in a Pending state.

Output of kubectl get pods -o wide

akri-agent-daemonset-98p49                         1/1     Running   0          21h
akri-agent-daemonset-x9k9r                         1/1     Running   0          21h
akri-controller-deployment-75d655d869-nfmvz        1/1     Running   0          21h
worker1-akri-debug-echo-foo-8120fe-pod             0/1     Pending   0          21h
worker1-akri-debug-echo-foo-a19705-pod             0/1     Pending   0          21h
worker2-akri-debug-echo-foo-8120fe-pod             1/1     Running   0          21h
worker2-akri-debug-echo-foo-a19705-pod             1/1     Running   0          21h

Kubernetes Version: [e.g. Native Kubernetes 1.19, MicroK8s 1.19, Minikube 1.19, K3s]
MicroK8s 1.19

To reproduce, run on cluster with 2 workers:

helm install akri akri-helm-charts/akri --set debugEcho.enabled=true --set debugEcho.capacity=1 --set debugEcho.shared=true --set agent.allowDebugEcho=true --debug --set agent.host.dockerShimSock=/var/snap/microk8s/common/run/containerd.sock --set agent.host.crictl=/usr/local/bin/crictl --set debugEcho.capacity=1

Expected behavior
Controller should bring down the pending pods, and in this case where capacity is 1, pods should not be rescheduled

Expected problem
The controller has a grace period it allows to pass before bringing down pods. That grace period is being calculated from a timestamp that only occurs after a pod is assigned to a node, which never occurs if allocate fails. Consequently, we see the pod continuing to run in a Pending state as the Controller doesn't remove it if no start time was found:

[2020-10-21T16:05:28Z TRACE controller::util::pod_action] time_choice_for_non_running_pods - no start time found ... give it more time? ("worker1-akri-debug-echo-foo-a19705-pod")
[2020-10-21T16:05:28Z TRACE controller::util::pod_action] time_choice_for_non_running_pods - give_it_more_time: (true)

Remove Agent's polling method for slot-reconcilliation

Each Agent currently continually checks all running containers to make sure that the Instance slot properties are correct (https://github.com/deislabs/sonar/blob/master/beacon/src/util/slot_reconciliation.rs). The reason this is needed is that K8s' Device-Plugin interface does not include a Deallocate message ... so when a Pod exits/deletes, the Instance needs to be updated.

A better model would be to create a Pod watcher and respond to creation/deletion/modified events. A generalized algorithm would be:

  1. Agent.Allocate
    • update slot with some kind of designation reflecting that the slot is requested (slot-id: "*node-a")
    • add SLOT annotation to container (this exists in the current polling solution)
    • start 5 minute timer
      • if, after 5 minutes, there is no Pod with a sonar.sh/slot label, revoke the Instance slot request/reservation (slot-id: "")
  2. Agent.PodWatcher.Created/Modified (for spec.nodeName == <this node>)
    • if there is no sonar.sh/slot label, query crictl to get the SLOT annotation from the container
      • add slot as Pod sonar.sh/slot label
      • update Instance to show slot claimed (slot-id: "node-a")
  3. Agent.PodWather.Modified/Deleted (for spec.nodeName == <this node>)
    • if the pod has sonar.sh/slot, revoke the Instance slot request (slot-id: "")
  4. On Agent startup, start 5 minute timer for any slot requests on this node
    • if, after 5 minutes, there is no Pod with a sonar.sh/slot label, revoke the Instance slot request/reservation

[Extensibility] "invalid URL, scheme is not http"

Describe the bug
A clear and concise description of what the bug is.

Output of kubectl get pods,akrii,akric -o wide

kubectl apply --filename=nessie.yaml
configuration.akri.sh/nessie created

And:

kubectl get all
NAME                                              READY   STATUS    RESTARTS   AGE
pod/akri-agent-daemonset-6gjbr                    1/1     Running   0          89s
pod/akri-controller-deployment-848858df4b-cgplb   1/1     Running   0          89s

NAME                 TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)   AGE
service/kubernetes   ClusterIP   10.152.183.1   <none>        443/TCP   26h

NAME                                  DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
daemonset.apps/akri-agent-daemonset   1         1         1       1            1           kubernetes.io/os=linux   89s

NAME                                         READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/akri-controller-deployment   1/1     1            1           89s

NAME                                                    DESIRED   CURRENT   READY   AGE
replicaset.apps/akri-controller-deployment-848858df4b   1         1         1       89s

Kubernetes Version: [e.g. Native Kubernetes 1.19, MicroK8s 1.19, Minikube 1.19, K3s]

Expected behavior

Logs (please share snips of applicable logs)

kubectl logs pods/akri-agent-daemonset-6gjbr
akri.sh Agent start
akri.sh KUBERNETES_PORT found ... env_logger::init
[2020-10-27T19:57:34Z TRACE agent] akri.sh KUBERNETES_PORT found ... env_logger::init finished
[2020-10-27T19:57:34Z TRACE agent::util::slot_reconciliation] periodic_slot_reconciliation - start
[2020-10-27T19:57:34Z TRACE akri_shared::k8s] Loading in-cluster config
[2020-10-27T19:57:34Z INFO  agent::util::config_action] do_config_watch - enter
[2020-10-27T19:57:34Z TRACE akri_shared::k8s] Loading in-cluster config
[2020-10-27T19:57:34Z TRACE agent::util::slot_reconciliation] periodic_slot_reconciliation - iteration pre delay_for
[2020-10-27T19:57:34Z TRACE akri_shared::akri::configuration] get_configurations enter
[2020-10-27T19:57:34Z TRACE akri_shared::akri::configuration] get_configurations kube_client.request::<KubeAkriInstanceList>(akri_config_type.list(...)?).await?
[2020-10-27T19:57:34Z TRACE akri_shared::akri::configuration] get_configurations return
[2020-10-27T19:57:34Z TRACE agent::util::config_action] watch_for_config_changes - start
[2020-10-27T19:57:44Z TRACE agent::util::slot_reconciliation] periodic_slot_reconciliation - iteration call reconiler.reconcile
[2020-10-27T19:57:44Z TRACE agent::util::slot_reconciliation] reconcile - thread iteration start [Mutex { data: {} }]
[2020-10-27T19:57:44Z TRACE agent::util::slot_reconciliation] get_node_slots - Command failed to call crictl: Os { code: 13, kind: PermissionDenied, message: "Permission denied" }
[2020-10-27T19:57:44Z TRACE agent::util::slot_reconciliation] reconcile - get_node_slots failed: Os { code: 13, kind: PermissionDenied, message: "Permission denied" }
[2020-10-27T19:57:44Z TRACE agent::util::slot_reconciliation] periodic_slot_reconciliation - iteration end
...
**REPEATS**
...
[2020-10-27T19:58:50Z TRACE agent::util::config_action] handle_config - something happened to a configuration
[2020-10-27T19:58:50Z INFO  agent::util::config_action] handle_config - added DevCapConfig nessie
thread 'tokio-runtime-worker' panicked at 'called `Result::unwrap()` on an `Err` value: hyper::Error(Connect, "invalid URL, scheme is not http")', agent/src/util/config_action.rs:146:64
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

The Nessie URL is https://www.lochness.co.uk/livecam/img/lochness.jpg

Additional context
Add any other context about the problem here.

Implementing On-Device Audio

I am currently running a k3s cluster on some Raspberry Pi 4s. Would it be possible for Akri to utilize an audio card on one of these Pi's as an Akri Plugin?

I would assume a device plugin would need to be created. Does this make sense as an Akri use-case?

Support IoT Plug and Play

Is your feature request related to a problem? Please describe.
Microsoft IoT team has come up with a generic way to discover the resources such as temperature sensors or whatever. It would be great if Akri could just discover such resources.

Is your feature request related to a way you would like Akri extended? Please describe.
For each IoT P&P device, it would be great if Akri would create a resource handler automatically.

Additional context
For details, see https://docs.microsoft.com/en-us/azure/iot-pnp/

Akri agent fails to start in managed GKE cluster

Describe the bug
When starting Akri in Google Cloud Platform cluster, the agent failed to start, complaining about mounting crictl from read-only file-system.

kubectl get pods -o wide

NAME                                          READY   STATUS             RESTARTS   AGE    IP            NODE                                          NOMINATED NODE   READINESS GATES
akri-agent-daemonset-jr7pl                    0/1     CrashLoopBackOff   5          4m7s   10.128.0.12   gke-cluster-1-16-default-pool-c4e0f4c9-lthv   <none>           <none>
akri-agent-daemonset-trwqm                    0/1     CrashLoopBackOff   5          4m7s   10.128.0.10   gke-cluster-1-16-default-pool-c4e0f4c9-xlsx   <none>           <none>
akri-agent-daemonset-wq77x                    0/1     CrashLoopBackOff   5          4m7s   10.128.0.11   gke-cluster-1-16-default-pool-c4e0f4c9-tbdm   <none>           <none>
akri-controller-deployment-5957d7d7cc-z8rnx   1/1     Running            0          4m6s   10.64.1.7     gke-cluster-1-16-default-pool-c4e0f4c9-lthv   <none>           <none>

Kubernetes Version: 1.16.13-gke.401

To Reproduce
Steps to reproduce the behavior:

  1. Create cluster using GCP: 1.16.13-gke.401
  2. helm install akri akri-helm-charts/akri --set debugEcho.enabled=true --set debugEcho.name=debug-echo --set debugEcho.shared=false --set agent.allowDebugEcho=true --debug --set controller.onlyOnMaster=false

Expected behavior
Expect Akri controller and agent to start without error. Akri controller starts. Akri agent fails:

kubectl describe pod $(kubectl get pods | grep agent | awk '{print $1}' | head -1)

Events:
  Type     Reason     Age                     From               Message
  ----     ------     ----                    ----               -------
  Normal   Scheduled  7m39s                   default-scheduler  Successfully assigned default/akri-agent-daemonset-jr7pl to gke-cluster-1-16-default-pool-c4e0f4c9-lthv
  Warning  Failed     6m55s (x4 over 7m37s)   kubelet            Error: failed to start container "akri-agent": Error response from daemon: error while creating mount source path '/usr/bin/crictl': mkdir /usr/bin/crictl: read-only file system
  Normal   Pulling    6m10s (x5 over 7m38s)   kubelet            Pulling image "ghcr.io/deislabs/akri/agent:v0.0.36-dev"
  Normal   Pulled     6m9s (x5 over 7m37s)    kubelet            Successfully pulled image "ghcr.io/deislabs/akri/agent:v0.0.36-dev"
  Normal   Created    6m9s (x5 over 7m37s)    kubelet            Created container akri-agent
  Warning  BackOff    2m34s (x23 over 7m35s)  kubelet            Back-off restarting failed container

using helm flag "--set udev.enabled=true" get error

Describe the bug
When I set flag "--set udev.enabled=true" in helm command to modify configuration, I get error message:

Error: UPGRADE FAILED: YAML parse error on akri/templates/udev.yaml: error converting YAML to JSON: yaml: line 10: could not find expected ':'

Output of kubectl get pods,akrii,akric -o wide

NAME                                              READY   STATUS    RESTARTS   AGE    IP            NODE          NOMINATED NODE   READINESS GATES
pod/akri-controller-deployment-5b4bb5cbb5-lgqq6   1/1     Running   0          167m   10.42.0.8     raspberrypi   <none>           <none>
pod/akri-agent-daemonset-krkhr                    1/1     Running   0          119m   192.168.0.9   rpi-3         <none>           <none>
pod/akri-agent-daemonset-wmhdn                    1/1     Running   0          118m   192.168.0.8   raspberrypi   <none>           <none>

NAME                               CAPACITY   AGE
configuration.akri.sh/akri-onvif   1          119m

Kubernetes Version: [e.g. Native Kubernetes 1.19, MicroK8s 1.19, Minikube 1.19, K3s]
K3s

NAME          STATUS   ROLES    AGE   VERSION
raspberrypi   Ready    master   10h   v1.19.3+k3s2
rpi-3         Ready    <none>   9h    v1.19.3+k3s2

To Reproduce
Install or upgrade Akri with the Helm command:

helm upgrade akri akri-helm-charts/akri-dev --set onvif.enabled=true --set udev.enabled=true

Additional context

Add K3s and Kubernetes to end to end tests

Is your feature request related to a problem? Please describe.
Currently the end-to-end tests only validate MicroK8s.

Is your feature request related to a way you would like Akri extended? Please describe.
Add K3s and Kubernetes to the testing matrix.

Describe the solution you'd like
Expand .github/workflows/run-test-cases.yml strategy.matrix to include a dimension for runtime (K3s, Kubernetes) and add runtime installation details for both K3s and Kuberentes.

Describe alternatives you've considered

Handle crictl query error in slot reconciliation

Currently, we are doing a best effort approach to slot reconciliation, which ensures that device usage on an Instance reflects the real state of which pods are using an instance.
To get the real usage, we are using crictl to query the container runtime in search of active Containers that have our slot Annotations that device-plugin adds to a pod upon an Allocate call from kubelet.
If this crictl query fails (possibly due to crictl not being mounted correctly), we are doing an early return on slot reconciliation. This should be handled more specifically in the future.

Alternatively, if Kubernetes adds Deallocate to Device-Plugin, slot reconciliation might not be needed. This is the PR for Deallocate: kubernetes/kubernetes#91190

Consider switching template comments

Is your feature request related to a problem? Please describe.
Helm templates such as https://github.com/deislabs/akri/blob/main/deployment/helm/templates/udev.yaml include YAML comments, which are rendered in the final .yaml files. But at least some comments do not make sense in the final yaml, such as: # Only add broker pod spec if a broker image is provided.

Describe the solution you'd like
Evaluate if template comments would be better and not end up in the generated yaml. See https://helm.sh/docs/chart_best_practices/templates/, specifically Comments (YAML Comments vs. Template Comments).

k3s/microk8s instructions missing on some doc pages

Is your feature request related to a problem? Please describe.
https://github.com/deislabs/akri/blob/main/docs/user-guide.md describes how to deploy Akri on k3s and microk8s. Specifically, this requires to pass additional parameters to helm to properly configure crictl. However, pages such as https://github.com/deislabs/akri/blob/main/docs/udev-configuration.md or https://github.com/deislabs/akri/blob/main/docs/modifying-akri-installation.md do not have such instructions or even mention about requiring to set crictl.

Describe the solution you'd like
Consider adding some reference as folks might modify the installation and unset the required configuration, thus ending up with misconfigured Akri configuration.

[Extensibility] Compilation errors around `hyper::Client`

Describe the bug

Please see PR #71

When:

PREFIX=ghcr.io/dazwilkin BUILD_AMD64=1 BUILD_ARM32=0 BUILD_ARM64=0 make akri-agent

Errors:

error[E0061]: this function takes 1 argument but 0 arguments were supplied
  --> agent/src/protocols/nessie/discovery_handler.rs:28:28
   |
28 |         if let Ok(_body) = hyper::Client::new().get(url).compat().await {
   |                            ^^^^^^^^^^^^^^^^^^-- supplied 0 arguments
   |                            |
   |                            expected 1 argument

error[E0599]: no method named `compat` found for struct `hyper::client::FutureResponse` in the current scope
  --> agent/src/protocols/nessie/discovery_handler.rs:28:58
   |
28 |         if let Ok(_body) = hyper::Client::new().get(url).compat().await {
   |                                                          ^^^^^^ method not found in `hyper::client::FutureResponse`

Output of kubectl get pods,akrii,akric -o wide

N/A

Kubernetes Version: [e.g. Native Kubernetes 1.19, MicroK8s 1.19, Minikube 1.19, K3s]

N/A

To Reproduce
Steps to reproduce the behavior:

  1. Follow Extensibility as diligently as possible (!?)
  2. Try to build akri-agent (or akri-controller)

Expected behavior

No compilation errors.

Logs (please share snips of applicable logs)

Above.

Additional context
Add any other context about the problem here.

[End-to-End] MicroK8s: "Install Helm" redundant?

Describe the bug

IIUC the End-to-End MicroK8s step to Install Helm is redundant.

Output of kubectl get pods,akrii,akric -o wide

N/A

Kubernetes Version: [e.g. Native Kubernetes 1.19, MicroK8s 1.19, Minikube 1.19, K3s]

N/A

To Reproduce
Steps to reproduce the behavior:

  1. Follow End-to-End instructions
  2. Don't Install Helm
  3. Do microk8s.enable helm3
  4. For consistency: alias helm=microk8s.helm3
  5. helm version

Yields: version.BuildInfo{Version:"v3.0.2", ...

Expected behavior

N/A

Logs (please share snips of applicable logs)

N/A

Additional context

Recommend:

  • Drop the "Install Helm" step
  • Retain "Enable Helm for MicroK8s"
  • Consider adding alias helm=microk8s.helm3

[Nessie] Get https://ghcr.io/v2/deislabs/akri/rust-crossbuild/manifests/x86_64-unknown-linux-gnu-0.1.16-0.0.6: unauthorized.

Describe the bug

Walking through Extensibility, when trying to build akri-agent or akri-controller in this step, receive unauthorized from deislabs' registry endpoint on GHCR.

I assume the build needs these intermediate images and that the repository is not public.

Recommend: make it public

PREFIX=ghcr.io/dazwilkin BUILD_AMD64=1 BUILD_ARM32=0 BUILD_ARM64=0 make akri-controller
cargo install cross
    Updating crates.io index
     Ignored package `cross v0.2.1` is already installed, use --force to override
PKG_CONFIG_ALLOW_CROSS=1 cross build --release --target=x86_64-unknown-linux-gnu
Unable to find image 'ghcr.io/deislabs/akri/rust-crossbuild:x86_64-unknown-linux-gnu-0.1.16-0.0.6' locally
docker: Error response from daemon: Get https://ghcr.io/v2/deislabs/akri/rust-crossbuild/manifests/x86_64-unknown-linux-gnu-0.1.16-0.0.6: unauthorized.
See 'docker run --help'.
make: *** [build/akri-containers.mk:44: akri-cross-build-amd64] Error 125

Kubernetes Version: [e.g. Native Kubernetes 1.19, MicroK8s 1.19, Minikube 1.19, K3s]

N/A

To Reproduce

See summary.

Expected behavior

Permitted to pull from deislabs' GHCR registry|repos

Logs (please share snips of applicable logs)

Additional context

Default capacity should not be 5

Describe the bug
Currently, the default broker spec capacity is 5. That seems to be an overkill.

To Reproduce
Steps to reproduce the behavior:

  1. Follow the default e2e demo and the default helm chart deployment.
  2. Capacity is set to 5.

Expected behavior
I think it would make more sense to start with 1 or 2. I would vote for 1 as HA can be an opt-in.

Extensibility: Nessie should illustrate how to handle failure to find resource vs unexpected error

Describe the bug

Nessie's disovery_handler.rs discover function always returns Ok(T) even on failure.:

async fn discover(&self) -> Result<Vec<DiscoveryResult>, failure::Error> {
    let url = self.discovery_handler_config.nessie_url.clone();
    let mut results = Vec::new();

    match reqwest::get(&url).await {
        Ok(resp) => {
            trace!("Found nessie url: {:?} => {:?}", &url, &resp);
            // If the Nessie URL can be accessed, we will return a DiscoveryResult
            // instance
            let mut props = HashMap::new();
            props.insert("nessie_url".to_string(), url.clone());

            results.push(DiscoveryResult::new(&url, props, true));
        }
        Err(err) => {
            println!("Failed to establish connection to {}", &url);
            println!("Error: {}", err);
            return Ok(results);
        }
    };
    Ok(results)
}

Expected behavior

Should this not return Err(failure::Error) on the Err branch?

async fn discover(&self) -> Result<Vec<DiscoveryResult>, failure::Error> {
    let url = self.discovery_handler_config.nessie_url.clone();
    let mut results = Vec::new();

    match reqwest::get(&url).await {
        Ok(resp) => {
            trace!("Found nessie url: {:?} => {:?}", &url, &resp);
            // If the Nessie URL can be accessed, we will return a DiscoveryResult
            // instance
            let mut props = HashMap::new();
            props.insert("nessie_url".to_string(), url.clone());

            results.push(DiscoveryResult::new(&url, props, true));
        }
        Err(err) => {
            println!("Failed to establish connection to {}", &url);
            println!("Error: {}", err);
            Err(format_err!("failed to establish connection: {}". err))
        }
    };
    Ok(results)
}

Then, since results is only required by the Ok branch, only has a single value, we can simplify and just (implicitly) return the result of the match:

async fn discover(&self) -> Result<Vec<DiscoveryResult>, failure::Error> {
    let url = self.discovery_handler_config.nessie_url.clone();

    match reqwest::get(&url).await {
        Ok(resp) => {
            trace!("Found nessie url: {:?} => {:?}", &url, &resp);
            // If the Nessie URL can be accessed, we will return a DiscoveryResult
            // instance
            let mut props = HashMap::new();
            props.insert("nessie_url".to_string(), url.clone());

           Ok(vec![DiscoveryResult::new(&url, props, true)])
        }
        Err(err) => {
            println!("Failed to establish connection to {}", &url);
            println!("Error: {}", err);

            Err(format_err!("failed to establish connection: {}". err))
        }
    }
}

Additional context

See #89

[Extensibility] HTTP protocol

Is your feature request related to a problem? Please describe.

A proposal.

Is your feature request related to a way you would like Akri extended? Please describe.

Part "can I repro 'Nessie'?".
Part "What could be even simpler?".

I'm noodling a handler based on HTTP:

  • Device endpoints are discovered by GETting a URL from a separate discovery endpoint
  • Devices are represented by an HTTP server with 2 paths: /health and /.
  • The / returns a random number purporting as some sensor value; easily extended
  • A companion app that creates an arbitrary number of "devices" and the discovery service using multiple ports on a default 0.0.0.0

Describe the solution you'd like

This solution is very basic but it's realistic (HTTP aside) of an MCU device with sensor(s) scenario.

Describe alternatives you've considered

The obvious alternative for MCUs would be to use MQTT and invert the flow of control: devices publish to broker, Akri subscribes to MQTT channels.

My sense (!) is that using MQTT doesn't extol Akri; I'm still unsure what the ideal use-case is for Akri but suspect that discovery and non-trivial interactions|data may be key (!?).

I considered using gRPC for device interaction but the added complexity appears to add no value in explaining Akri.

I considered using named pipes but feel this doesn't enlighten the developer and conveys locally attached producers.

Additional context

Plan to write the tutorial from-scratch.

Build fails on Ubuntu 20.04

I ran:

./build/setup.sh
cargo build

And I see:

bburns@helios:~/src/akri$ RUST_BACKTRACE=1 cargo build
   Compiling hyper v0.13.7
   Compiling async-std v1.6.2
   Compiling agent v0.0.36 (/home/bburns/src/akri/agent)
   Compiling udev-video-broker v0.0.36 (/home/bburns/src/akri/samples/brokers/udev-video-broker)
error: failed to run custom build command for `udev-video-broker v0.0.36 (/home/bburns/src/akri/samples/brokers/udev-video-broker)`

Caused by:
  process didn't exit successfully: `/home/bburns/src/akri/target/debug/build/udev-video-broker-6ad08c5c40d22b43/build-script-build` (exit code: 101)
--- stderr
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 2, kind: NotFound, message: "No such file or directory" }', /home/bburns/.cargo/registry/src/github.com-1ecc6299db9ec823/tonic-build-0.1.1/src/lib.rs:219:19
stack backtrace:
   0: backtrace::backtrace::libunwind::trace
             at /usr/src/rustc-1.43.0/vendor/backtrace/src/backtrace/libunwind.rs:86
   1: backtrace::backtrace::trace_unsynchronized
             at /usr/src/rustc-1.43.0/vendor/backtrace/src/backtrace/mod.rs:66
   2: std::sys_common::backtrace::_print_fmt
             at src/libstd/sys_common/backtrace.rs:78
   3: <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt
             at src/libstd/sys_common/backtrace.rs:59
   4: core::fmt::write
             at src/libcore/fmt/mod.rs:1063
   5: std::io::Write::write_fmt
             at src/libstd/io/mod.rs:1426
   6: std::sys_common::backtrace::_print
             at src/libstd/sys_common/backtrace.rs:62
   7: std::sys_common::backtrace::print
             at src/libstd/sys_common/backtrace.rs:49
   8: std::panicking::default_hook::{{closure}}
             at src/libstd/panicking.rs:204
   9: std::panicking::default_hook
             at src/libstd/panicking.rs:224
  10: std::panicking::rust_panic_with_hook
             at src/libstd/panicking.rs:474
  11: rust_begin_unwind
             at src/libstd/panicking.rs:378
  12: core::panicking::panic_fmt
             at src/libcore/panicking.rs:85
  13: core::result::unwrap_failed
             at src/libcore/result.rs:1222
  14: core::result::Result<T,E>::unwrap
             at /usr/src/rustc-1.43.0/src/libcore/result.rs:1003
  15: tonic_build::fmt
             at /home/bburns/.cargo/registry/src/github.com-1ecc6299db9ec823/tonic-build-0.1.1/src/lib.rs:219
  16: tonic_build::Builder::compile
             at /home/bburns/.cargo/registry/src/github.com-1ecc6299db9ec823/tonic-build-0.1.1/src/lib.rs:172
  17: build_script_build::main
             at samples/brokers/udev-video-broker/build.rs:2
  18: std::rt::lang_start::{{closure}}
             at /usr/src/rustc-1.43.0/src/libstd/rt.rs:67
  19: std::rt::lang_start_internal::{{closure}}
             at src/libstd/rt.rs:52
  20: std::panicking::try::do_call
             at src/libstd/panicking.rs:303
  21: __rust_maybe_catch_panic
             at src/libpanic_unwind/lib.rs:86
  22: std::panicking::try
             at src/libstd/panicking.rs:281
  23: std::panic::catch_unwind
             at src/libstd/panic.rs:394
  24: std::rt::lang_start_internal
             at src/libstd/rt.rs:51
  25: std::rt::lang_start
             at /usr/src/rustc-1.43.0/src/libstd/rt.rs:67
  26: main
  27: __libc_start_main
  28: _start
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

warning: build failed, waiting for other jobs to finish...
error: failed to run custom build command for `agent v0.0.36 (/home/bburns/src/akri/agent)`

Caused by:
  process didn't exit successfully: `/home/bburns/src/akri/target/debug/build/agent-205b11fae37b158b/build-script-build` (exit code: 101)
--- stderr
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 2, kind: NotFound, message: "No such file or directory" }', /home/bburns/.cargo/registry/src/github.com-1ecc6299db9ec823/tonic-build-0.1.1/src/lib.rs:219:19
stack backtrace:
   0: backtrace::backtrace::libunwind::trace
             at /usr/src/rustc-1.43.0/vendor/backtrace/src/backtrace/libunwind.rs:86
   1: backtrace::backtrace::trace_unsynchronized
             at /usr/src/rustc-1.43.0/vendor/backtrace/src/backtrace/mod.rs:66
   2: std::sys_common::backtrace::_print_fmt
             at src/libstd/sys_common/backtrace.rs:78
   3: <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt
             at src/libstd/sys_common/backtrace.rs:59
   4: core::fmt::write
             at src/libcore/fmt/mod.rs:1063
   5: std::io::Write::write_fmt
             at src/libstd/io/mod.rs:1426
   6: std::sys_common::backtrace::_print
             at src/libstd/sys_common/backtrace.rs:62
   7: std::sys_common::backtrace::print
             at src/libstd/sys_common/backtrace.rs:49
   8: std::panicking::default_hook::{{closure}}
             at src/libstd/panicking.rs:204
   9: std::panicking::default_hook
             at src/libstd/panicking.rs:224
  10: std::panicking::rust_panic_with_hook
             at src/libstd/panicking.rs:474
  11: rust_begin_unwind
             at src/libstd/panicking.rs:378
  12: core::panicking::panic_fmt
             at src/libcore/panicking.rs:85
  13: core::result::unwrap_failed
             at src/libcore/result.rs:1222
  14: core::result::Result<T,E>::unwrap
             at /usr/src/rustc-1.43.0/src/libcore/result.rs:1003
  15: tonic_build::fmt
             at /home/bburns/.cargo/registry/src/github.com-1ecc6299db9ec823/tonic-build-0.1.1/src/lib.rs:219
  16: tonic_build::Builder::compile
             at /home/bburns/.cargo/registry/src/github.com-1ecc6299db9ec823/tonic-build-0.1.1/src/lib.rs:172
  17: build_script_build::main
             at agent/build.rs:3
  18: std::rt::lang_start::{{closure}}
             at /usr/src/rustc-1.43.0/src/libstd/rt.rs:67
  19: std::rt::lang_start_internal::{{closure}}
             at src/libstd/rt.rs:52
  20: std::panicking::try::do_call
             at src/libstd/panicking.rs:303
  21: __rust_maybe_catch_panic
             at src/libpanic_unwind/lib.rs:86
  22: std::panicking::try
             at src/libstd/panicking.rs:281
  23: std::panic::catch_unwind
             at src/libstd/panic.rs:394
  24: std::rt::lang_start_internal
             at src/libstd/rt.rs:51
  25: std::rt::lang_start
             at /usr/src/rustc-1.43.0/src/libstd/rt.rs:67
  26: main
  27: __libc_start_main
  28: _start
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

warning: build failed, waiting for other jobs to finish...
error: build failed

[Extensibility] Agent hostNetwork setting breaks cluster DNS lookup

Describe the bug
Using hostNetwork: true for a Pod (as Akri agent does) breaks DNS name resolution for K8s services. See kubernetes/dns#316. Adding dnsPolicy: ClusterFirstWithHostNet fixes the problem.

Details
I'm developing an end:end example using HTTP (See #85).

The discovery handler consistently fails to discover the URL referenced by its handler's discovery_endpoint value.

I don't want to distract you with my noob issues but, if you've any insight into what I'm doing wrong, I'd appreciate it.

Per @bfjelds revised Nessie example, I'm also using reqwest and the agent generates the following error:

[http:discover] Entered
[http:discover] url: http://discovery:9999
[http:discover] Response: Err(reqwest::Error { kind: Request, url: "http://discovery:9999/", source: hyper::Error(Connect, ConnectError("dns error", Custom { kind: Other, error: "failed to lookup address information: Temporary failure in name resolution" })) })
[http:discover] Spoofed results

In the above, I've taken the get out of the control flow and am spoofing the results (see below):

async fn discover(&self) -> Result<Vec<DiscoverResult>, failure::Error> {
  println!("[http:discover] Entered");
  let url = self.discovery_handler_config.discovery_endpoint.clone();
  println!("[http:discover] url: {}", &url);

  let resp = get(&url).await;
  ..
}

When the discover function returned directly from matching on the get, the error was slightly more informative:

async fn discover(&self) -> Result<Vec<DiscoveryResult>, failure::Error> {
  println!("[http:discover] Entered");
  let url = self.discovery_handler_config.discovery_endpoint.clone();
  println!("[http:discover] url: {}", &url);

  match get(&url).await {
    Ok(resp) => {
      let device_list = &resp.text().await?;
      let result: Vec<DiscoveryResult> = device_list.line().map(...).collect();
      Ok(result)
    }
    Err(err) => {
      Err(format_err!("unable to parse discovery endpoint results: {:?}", err))
    }
}

Yields:

[http:discover] Entered
[http:discover] url: http://discovery:9999
[http:discover] Failed to connect to discovery endpoint: http://discovery:9999
[http:discover] Error: error sending request for url (http://discovery:9999/): error trying to connect: dns error: failed to lookup address information: Temporary failure in name resolution
thread 'tokio-runtime-worker' panicked at 'called `Result::unwrap()` on an `Err` value: ErrorMessage { msg: "unable to parse discovery endpoint results: reqwest::Error { kind: Request, url: \"http://discovery:9999/\", source: hyper::Error(Connect, ConnectError(\"dns error\", Custom { kind: Other, error: \"failed to lookup address information: Temporary failure in name resolution\" })) }" }', agent/src/util/config_action.rs:146:64
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

I'm relatively (!) confident that this URL is correct and the agent should be able to GET it.

If I run a curl pod in the cluster('s default namespace), the endpoint generates 200 and the correct response:

kubectl run curl --image=radial/busyboxplus:curl --stdin --tty --rm

[ root@curl:/ ]$ curl http://discovery:9999/
0.0.0.0:8000
0.0.0.0:8001
0.0.0.0:8002
0.0.0.0:8003
0.0.0.0:8004
0.0.0.0:8005
0.0.0.0:8006
0.0.0.0:8007
0.0.0.0:8008
0.0.0.0:8009

I'm at a loss to understand why this error arises but it does so consistently and reliably (I've tried ... but will try using the Cluster IP).

I'm able to spoof the correct result by manually creating Vec<DiscoveryResult> with the values provided by the response and then the agent and broker work correctly:

kubectl logs pod/akri-http-dbb47e-pod
[http:main] Entered
[http:main] Device: http://device-8000:8000
[http:main] get_discovery_data
[http:get_discovery_data] Entered
[http:main] Environment:
PATH:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
HOSTNAME:akri-http-dbb47e-pod
....
[http:main] Starting gRPC server
[http:serve] Entered
[http:serve] Starting gRPC server: 0.0.0.0:8084
[http:main:loop] Sleep
[http:main:loop] read_sensor(http://device-8000:8000)
[http:read_sensor] Entered
[main:read_sensor] Response status: 200
[main:read_sensor] Response body: Ok("0.2854040649574679")
[http:main:loop] Sleep
[http:main:loop] read_sensor(http://device-8000:8000)
[http:read_sensor] Entered
[main:read_sensor] Response status: 200
[main:read_sensor] Response body: Ok("0.4158989841983801")
[http:main:loop] Sleep
[http:main:loop] read_sensor(http://device-8000:8000)
[http:read_sensor] Entered
[main:read_sensor] Response status: 200
[main:read_sensor] Response body: Ok("0.30926792372133194")
[http:main:loop] Sleep

So, beside this issue, I'm almost (still need to come up with a solution for device DNS naming...) at a solution.

Output of kubectl get pods,akrii,akric -o wide

See above

Kubernetes Version: [e.g. Native Kubernetes 1.19, MicroK8s 1.19, Minikube 1.19, K3s]

MicroK8s.

ci: authentication required prior to pulling images from DockerHub

Starting November 1, 2020, Docker Hub will impose rate limits based on the originating IP. Since Akri pulls down images from DockerHub when certain CI jobs are invoked, and since Github Actions are run from a shared pool of IP addresses, it is highly recommended to use authenticated Docker pulls with Docker Hub to avoid rate limit problems.

https://docs.docker.com/docker-hub/download-rate-limit/
https://www.docker.com/pricing

By default, DockerHub limits anonymous pull access to 100 pulls per hour, per IP. When logged in with a free account, it's bumped to 200 per hour. Pro/Team accounts allow unlimited pulls.

Both build-intermediate and build-component-per-arch pull down multiarch/qemu-user-static:

https://github.com/deislabs/akri/blob/a7d719d6d84ce6d9ef2f5a9beb3c64219b51b5c7/.github/actions/build-intermediate/main.js#L28-L29
https://github.com/deislabs/akri/blob/a7d719d6d84ce6d9ef2f5a9beb3c64219b51b5c7/.github/actions/build-component-per-arch/main.js#L28-L29

run-tarpaulin appears to pull xd009642/tarpaulin:0.12.2:

https://github.com/deislabs/akri/blob/a7d719d6d84ce6d9ef2f5a9beb3c64219b51b5c7/.github/workflows/run-tarpaulin.yml#L31

There are two solutions:

  1. migrate to another "hub" without this rate limit like Github Container Registry
  2. log in to DockerHub before pulling these images, as done later on in most workflows with Github Container Registry:

https://github.com/deislabs/akri/blob/a7d719d6d84ce6d9ef2f5a9beb3c64219b51b5c7/.github/actions/build-component-per-arch/main.js#L58-L59

I'd also highly recommend auditing the rest of the code base to ensure you've logged in prior to pulling images from Docker Hub.

[Extensibility] Protocol "skeleton-builder"

Is your feature request related to a problem? Please describe.

Developing protocol extensions requires multiple changes to the existing project code, configuration, specs etc. in addition to the addition of protocol-specific agent code and broker implementations.

I think it would encourage developers to have a skeleton-builder to configure Akri for the addition of a new protocol implementation. The builder would either prompt the developer for answers to a set of questions or it would parse a simple specification file in order to generate the appropriate changes to the Akri sources and template stubs ready for development.

In addition, coding the process to make these changes would more effectively 'enshrine' what's needed (the current approach is to have developers complete multiple steps prior to development). This code would form part of the repo and would evolve as the extensibility mechanism evolves.

Describe alternatives you've considered

The current approach is to maintain and follow documentation guidelines upon deciding to add a protocol to Akri. This manual approach is error-prone and tedious and requires the developer to spend time working on Akri before the interesting aspect of development begins.

The current approach will need to be manually revised as Akri evolves. If the process were code, it should include tests that prove it continues to be current. If the process were code that used a simple spec, a developer could re-apply the spec to Akri to regenerate a project outline as Akri evolves.

Additional context
Add any other context or screenshots about the feature request here.

failure is deprecated

Is your feature request related to a problem? Please describe.

Rust failure crate is deprecated

Describe the solution you'd like

Consider replacing failure.

Video streaming app pod going to CrashLoopBackOff

Hello,

Describe the bug

When replicating the end-to-demo with MicroK8s (Ubuntu instance on GC) on my own, the pod of video streaming app goes to CrashLoopBackOff.

All other prior install steps were successful.

I added below the logs of failing pod

Didier

Output of kubectl get pods,akrii,akric -o wide

microk8s kubectl get all --all-namespaces -o wide
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
kube-system pod/calico-kube-controllers-847c8c99d-65bbl 1/1 Running 0 18m 10.1.54.67 microk8s-akri
kube-system pod/calico-node-xsvqv 1/1 Running 1 18m 10.128.0.35 microk8s-akri
kube-system pod/coredns-86f78bb79c-dbgm9 1/1 Running 0 17m 10.1.54.66 microk8s-akri
default pod/akri-agent-daemonset-c88pb 1/1 Running 0 12m 10.128.0.35 microk8s-akri
default pod/akri-controller-deployment-5b4bb5cbb5-8mlwp 1/1 Running 0 12m 10.1.54.68 microk8s-akri
default pod/akri-video-streaming-app-fd5f4cb7d-fzpcq 0/1 CrashLoopBackOff 7 12m 10.1.54.69 microk8s-akri

NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
default service/kubernetes ClusterIP 10.152.183.1 443/TCP 18m
kube-system service/kube-dns ClusterIP 10.152.183.10 53/UDP,53/TCP,9153/TCP 17m k8s-app=kube-dns
default service/akri-video-streaming-app NodePort 10.152.183.216 80:31671/TCP 12m app=akri-video-streaming-app

NAMESPACE NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE CONTAINERS IMAGES SELECTOR
kube-system daemonset.apps/calico-node 1 1 1 1 1 kubernetes.io/os=linux 18m calico-node calico/node:v3.13.2 k8s-app=calico-node
default daemonset.apps/akri-agent-daemonset 1 1 1 1 1 kubernetes.io/os=linux 12m akri-agent ghcr.io/deislabs/akri/agent:latest-dev name=akri-agent

NAMESPACE NAME READY UP-TO-DATE AVAILABLE AGE CONTAINERS IMAGES SELECTOR
kube-system deployment.apps/coredns 1/1 1 1 17m coredns coredns/coredns:1.6.6 k8s-app=kube-dns
kube-system deployment.apps/calico-kube-controllers 1/1 1 1 18m calico-kube-controllers calico/kube-controllers:v3.13.2 k8s-app=calico-kube-controllers
default deployment.apps/akri-controller-deployment 1/1 1 1 12m akri-controller ghcr.io/deislabs/akri/controller:latest-dev app=akri-controller
default deployment.apps/akri-video-streaming-app 0/1 1 0 12m akri-video-streaming-app ghcr.io/deislabs/akri/video-streaming-app:latest-dev app=akri-video-streaming-app

NAMESPACE NAME DESIRED CURRENT READY AGE CONTAINERS IMAGES SELECTOR
kube-system replicaset.apps/calico-kube-controllers-847c8c99d 1 1 1 18m calico-kube-controllers calico/kube-controllers:v3.13.2 k8s-app=calico-kube-controllers,pod-template-hash=847c8c99d
kube-system replicaset.apps/coredns-86f78bb79c 1 1 1 17m coredns coredns/coredns:1.6.6 k8s-app=kube-dns,pod-template-hash=86f78bb79c
default replicaset.apps/akri-controller-deployment-5b4bb5cbb5 1 1 1 12m akri-controller ghcr.io/deislabs/akri/controller:latest-dev app=akri-controller,pod-template-hash=5b4bb5cbb5
default replicaset.apps/akri-video-streaming-app-fd5f4cb7d 1 1 0 12m akri-video-streaming-app ghcr.io/deislabs/akri/video-streaming-app:latest-dev app=akri-video-streaming-app,pod-template-hash=fd5f4cb7d

Kubernetes Version: [e.g. Native Kubernetes 1.19, MicroK8s 1.19, Minikube 1.19, K3s]

snap list
Name Version Rev Tracking Publisher Notes
core 16-2.47.1 10185 latest/stable canonical✓ core
core18 20200929 1932 latest/stable canonical✓ base
google-cloud-sdk 316.0.0 157 latest/stable/… google-cloud-sdk✓ classic
helm 3.4.0 302 latest/stable snapcrafters classic
lxd 4.0.3 16922 4.0/stable/… canonical✓ -
microk8s v1.19.2 1769 1.19/stable canonical✓ classic
snapd 2.47.1 9721 latest/stable canonical✓ snapd

To Reproduce
I can publish my script in a repo with corresponding Github workflow to allow you to see full exec log (andreproduce by forking if needed)

Expected behavior

Get streaming app pod to Running status

Logs (please share snips of applicable logs)

microk8s kubectl logs akri-video-streaming-app-fd5f4cb7d-fzpcq
Traceback (most recent call last):
File "./streaming.py", line 33, in
grpc_port = os.environ[env_var_prefix + 'PORT_GRPC'] # instance services are using the same port by default
File "/usr/lib/python3.7/os.py", line 678, in getitem
raise KeyError(key) from None
KeyError: 'AKRI_UDEV_VIDEO_SVC_SERVICE_PORT_GRPC'

Additional context

GCE instance with Ubuntu LTS 20.04

Add documentation on how to use udev outside of udevvideo

Is your feature request related to a problem? Please describe.
Currently, two use cases for Akri are spelled out in the docs. onvifvideo and udevvideo. But udev can be used for other devices as well.

Describe the solution you'd like
Please add a top level entry on how to use udev for non-video related leaf devices. Also consider making that a top level and reference the video use case from that as an example. Added bonus would be a template for Helm, but not necessary, if a sample Akri Configuration is provided and explained how it should be used.

Simplify (!?) End-to-End

Is your feature request related to a problem? Please describe.

My curiosity is piqued by akri; it sounds very interesting.

The End-to-End is involved and the video dependency is neat but it requires a bunch of dependencies that may (!?) cause problems (see #42).

Is your feature request related to a way you would like Akri extended? Please describe.

IIUC (!?) akri provides a mechanism by which IoT devices can be managed through Kubernetes apps.

The prototypical IoT examples are simple sensors shipping data to consumers and these can be readily emulated with random number generators pushing|pulling via some endpoint.

Describe the solution you'd like

I think it would be useful to have a simpler End-to-End example that consumes e.g. random numbers representing e.g. temperatures generated by simple off-cluster apps. Perhaps these apps could be simple sockets, TCP|UDP (less realistically, but practical, HTTP) emitters?

This would avoid the need to install dkms, v4l2loopback etc. to get a developer worker with a basic akri installation.

Describe alternatives you've considered

The Nessie extensibility looks interesting and I'm going to try this. But, it would be helpful to have a baseline, known working installation (using something simple as described above) before proceeding with Nessie.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.