project-akri / akri Goto Github PK
View Code? Open in Web Editor NEWA Kubernetes Resource Interface for the Edge
Home Page: https://docs.akri.sh/
License: Apache License 2.0
A Kubernetes Resource Interface for the Edge
Home Page: https://docs.akri.sh/
License: Apache License 2.0
Is your feature request related to a problem? Please describe.
Akri Helm chart allows to specify whether to run the controller onlyOnMaster
. But for that, Akri expects a specific label to be present on the control plane nodes. Not all K8s distributions have such tag. Also, with AKS, you cannot deploy onto control plane nodes.
Is your feature request related to a way you would like Akri extended? Please describe.
Akri should be able to run on clusters without the specific label that Akri is looking for and also on clusters where users cannot deploy onto CP nodes.
Describe the solution you'd like
Consider moving the toleration outside of the onlyOnMaster
condition in the Helm chart. This way, Akri could run on a control plane, but not only. With onlyOnMaster
, it would only run on control plane.
This would not allow disabling the option to run on CP, but I think that can be solved later based on feedback.
**Is your feature request related to a problem? Please describe.**g
I've begun prototyping (!) a ZeroConf protocol implementation for Akri
Is your feature request related to a way you would like Akri extended? Please describe.
ZeroConf is probably (!) a useful protocol implementation particularly due to its reliable naming of (transient) devices.
Describe the solution you'd like
Akri protocol implementation with Broker samples.
Is your feature request related to a problem? Please describe.
No one likes it when their 1.14 cluster won't work. Make it clear what DOES work and beyond.
Is your feature request related to a way you would like Akri extended? Please describe.
DOCUMENT the minimum version.
Describe the solution you'd like
DOCUMENT the minimum version.
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.
Every Instance CRD has a shared property, which determines whether the device/capability can be shared by multiple nodes. If an instance is not shared, every node that discovers it will create a new Instance CRD, while if it is shared, that Instance CRD will be shared by all nodes that can discover the instance/capability.
Currently the sharability of an instance is defined by the protocol that was used to discover it. For example, currently, all instances (more specifically ip cameras in this protocol) discovered by the ONVIF protocol are marked as shared, as can be seen in the ONVIF DiscoveryHandler's implementation of are_shared(). Similarly, all instances discovered with udev are marked as unshared, since the udev protocol only discovers devices on a specific node.
Should the "is this shared?" decision be made elsewhere, such as in the Configuration, allowing an operator to specify sharability? Should the "is this shared?" decision be made on a device by device case?
One thing to note about the validity of the current solution of defining sharability by protocol: Right now, if ONVIF cameras were allowed to be unshared, it could result in multiple instances for a single camera, making it hard to regulate capacity and possibly leading to an overloaded device. Also, having two Instance CRDs for a single device seems counter-intuitive.
Describe the bug
Based on the Akri udev grammar it seems that ATTRS might be supported, but when I pass ATTRS as part of my udev rule, Agent is ignoring it:
[2020-11-08T09:04:14Z TRACE agent::protocols::udev::discovery_impl] parse_udev_rule - unsupported field ATTRS{manufacturer}
Is v2 not yet supported?
Output of kubectl get pods,akrii,akric -o wide
agent and controller are running, my configuration is specified, but too many instances are found, because ATTRS of the rule are ignored
Kubernetes Version: K3s
To Reproduce
Supplying rule such as SUBSYSTEM=="tty"\, ATTRS{manufacturer}=="Silicon Labs"\, ATTRS{idProduct}=="ea60"'
matches all tty devices.
Expected behavior
Supplying rule such as SUBSYSTEM=="tty"\, ATTRS{manufacturer}=="Silicon Labs"\, ATTRS{idProduct}=="ea60"'
should only match tty devices that have manufacturer and idProduct in the device or its parents.
Logs (please share snips of applicable logs)
Snippet from Agent log:
[2020-11-08T09:04:14Z TRACE agent::util::config_action] do_periodic_discovery - start for config akri-udev
[2020-11-08T09:04:14Z TRACE agent::util::config_action] do_periodic_discovery - loop iteration for config akri-udev
[2020-11-08T09:04:14Z TRACE agent::protocols::udev::discovery_handler] discover - for udev rules ["SUBSYSTEM==\"tty\", ATTRS{manufacturer}==\"Silicon Labs\", ATTRS{idProduct}
==\"ea60\""]
[2020-11-08T09:04:14Z INFO agent::protocols::udev::discovery_impl] parse_udev_rule - enter for udev rule string SUBSYSTEM=="tty", ATTRS{manufacturer}=="Silicon Labs", ATTRS{
idProduct}=="ea60"
[2020-11-08T09:04:14Z TRACE agent::protocols::udev::discovery_impl] parse_udev_rule - parsing udev_rule "SUBSYSTEM==\"tty\", ATTRS{manufacturer}==\"Silicon Labs\", ATTRS{idPr
oduct}==\"ea60\""
[2020-11-08T09:04:14Z TRACE agent::protocols::udev::discovery_impl] parse_udev_rule - unsupported field ATTRS{manufacturer}
[2020-11-08T09:04:14Z TRACE agent::protocols::udev::discovery_impl] parse_udev_rule - unsupported field ATTRS{idProduct}
[2020-11-08T09:04:14Z TRACE agent::protocols::udev::discovery_impl] find_devices - enter with udev_filters [UdevFilter { field: Pair { rule: subsystem, span: Span { str: "SUB
SYSTEM", start: 0, end: 9 }, inner: [] }, operation: equality, value: "tty" }]
[2020-11-08T09:04:14Z TRACE agent::protocols::udev::discovery_impl] enumerator_match_udev_filters - enter with udev_filters [UdevFilter { field: Pair { rule: subsystem, span:
Span { str: "SUBSYSTEM", start: 0, end: 9 }, inner: [] }, operation: equality, value: "tty" }]
[2020-11-08T09:04:14Z TRACE agent::protocols::udev::discovery_impl] enumerator_nomatch_udev_filters - enter with udev_filters []
[2020-11-08T09:04:14Z TRACE agent::protocols::udev::discovery_impl] filter_by_remaining_udev_filters - enter with udev_filters []
[2020-11-08T09:04:14Z TRACE agent::protocols::udev::discovery_impl] do_parse_and_find - returning discovered devices with devpaths: ["/dev/ttyAMA0", "/dev/ttyUSB0", "/dev/con
sole", "/dev/ptmx", "/dev/tty", "/dev/tty0", "/dev/tty1", "/dev/tty10", "/dev/tty11", "/dev/tty12", "/dev/tty13", "/dev/tty14", "/dev/tty15", "/dev/tty16", "/dev/tty17", "/de
v/tty18", "/dev/tty19", "/dev/tty2", "/dev/tty20", "/dev/tty21", "/dev/tty22", "/dev/tty23", "/dev/tty24", "/dev/tty25", "/dev/tty26", "/dev/tty27", "/dev/tty28", "/dev/tty29
", "/dev/tty3", "/dev/tty30", "/dev/tty31", "/dev/tty32", "/dev/tty33", "/dev/tty34", "/dev/tty35", "/dev/tty36", "/dev/tty37", "/dev/tty38", "/dev/tty39", "/dev/tty4", "/dev
/tty40", "/dev/tty41", "/dev/tty42", "/dev/tty43", "/dev/tty44", "/dev/tty45", "/dev/tty46", "/dev/tty47", "/dev/tty48", "/dev/tty49", "/dev/tty5", "/dev/tty50", "/dev/tty51"
, "/dev/tty52", "/dev/tty53", "/dev/tty54", "/dev/tty55", "/dev/tty56", "/dev/tty57", "/dev/tty58", "/dev/tty59", "/dev/tty6", "/dev/tty60", "/dev/tty61", "/dev/tty62", "/dev
/tty63", "/dev/tty7", "/dev/tty8", "/dev/tty9", "/dev/ttyprintk"]
[2020-11-08T09:04:14Z TRACE agent::protocols::udev::discovery_handler] discover - mapping and returning devices at devpaths {"/dev/tty41", "/dev/tty24", "/dev/tty20", "/dev/t
ty37", "/dev/tty51", "/dev/tty38", "/dev/tty17", "/dev/tty18", "/dev/tty49", "/dev/tty5", "/dev/tty4", "/dev/tty0", "/dev/tty63", "/dev/tty43", "/dev/tty48", "/dev/tty39", "/
dev/tty46", "/dev/tty28", "/dev/tty11", "/dev/tty54", "/dev/tty36", "/dev/tty30", "/dev/tty52", "/dev/tty16", "/dev/tty2", "/dev/tty45", "/dev/tty61", "/dev/tty25", "/dev/tty
", "/dev/tty59", "/dev/tty42", "/dev/tty29", "/dev/tty47", "/dev/tty56", "/dev/tty55", "/dev/ttyUSB0", "/dev/tty44", "/dev/tty50", "/dev/tty33", "/dev/tty13", "/dev/tty58", "
/dev/tty31", "/dev/ttyAMA0", "/dev/tty6", "/dev/tty10", "/dev/tty8", "/dev/tty35", "/dev/ttyprintk", "/dev/tty32", "/dev/tty14", "/dev/tty60", "/dev/tty23", "/dev/tty27", "/d
ev/tty40", "/dev/tty3", "/dev/tty7", "/dev/ptmx", "/dev/tty1", "/dev/tty19", "/dev/tty15", "/dev/tty53", "/dev/tty62", "/dev/tty22", "/dev/tty9", "/dev/tty21", "/dev/tty26",
"/dev/tty57", "/dev/tty34", "/dev/console", "/dev/tty12"}
Describe the bug
I am following the install procedure of end-to-end demo: when executing step 1 of this section, I get
Error: unknown flag: --set agent.host.crictl
I have set variable AKRI_HELM_CRICTL_CONFIGURATION properly (I hope...) via
AKRI_HELM_CRICTL_CONFIGURATION='--set agent.host.crictl=/usr/local/bin/crictl --set agent.host.dockerShimSock=/var/snap/microk8s/common/run/containerd.sock'
The akri helm repo was properly added via microk8s helm3 repo add 'akri-helm-charts' 'https://deislabs.github.io/akri/'
on which I get "akri-helm-charts" has been added to your repositories
So, what do I do wrong?
Thanks for your help
Didier
Output of kubectl get pods,akrii,akric -o wide
I did not reach this stage yet
Kubernetes Version: [e.g. Native Kubernetes 1.19, MicroK8s 1.19, Minikube 1.19, K3s]
microk8s (1.19/edge) v1.19.4 from Canonical* installed
microk8s v1.19.4 1810 1.19/edge canonical* classic
1810 is the version of installed snap
To Reproduce
run our script at https://github.com/didier-durand/microk8s-akri/blob/main/sh/microk8s-akri.sh or fork repo and run workflow https://github.com/didier-durand/microk8s-akri/blob/main/.github/workflows/microk8s-akri.yml
Expected behavior
Proper installation of Helm chart to be able to continue further in the demo install
Logs (please share snips of applicable logs)
### install akri chart:
2020-11-20T06:52:02.4560237Z Error: no repositories to show
2020-11-20T06:52:02.8022500Z "akri-helm-charts" has been added to your repositories
2020-11-20T06:52:02.8085732Z critctl path: /usr/local/bin/crictl
2020-11-20T06:52:02.9648848Z Error: unknown flag: --set agent.host.crictl
Full log (Github interactive interface): https://github.com/didier-durand/microk8s-akri/actions/runs/373835396
Full log (text form) available at https://pipelines.actions.githubusercontent.com/YT1CGW2Wkplohu80t7gWKaYhzh929U5dpGIo69C98k0YO1N9dm/_apis/pipelines/1/runs/25/signedlogcontent/3?urlExpires=2020-11-20T06%3A53%3A29.4204408Z&urlSigningMethod=HMACV1&urlSignature=Z2xBZfORSHyClYKfz86VocTNemnkMl4zLflU%2B3S2Dqg%3D
Additional context
Add any other context about the problem here.
Is your feature request related to a way you would like Akri extended? Please describe.
MQTT is a key protocol in the IoT world. It would be extremely beneficial if Akri could directly communicate with devices implementing this protocol to report data / get instructions. That would make the path between device and kubernetes application much shorter than it is for other projects lilke KubeEdge whose architecture entails more elements between device and K8s application.
Many use cases show how it's used today or will be used in the future.
Describe the solution you'd like
Akri may rely on existing protocol implementations to deliver this feature rapidly
Describe alternatives you've considered
The only one known as of now is Kubeedge with the disadvantage mentioned above.
Additional context
https://dzone.com/articles/why-mqtt-has-become-the-de-facto-iot-standard
When I used the helm to install the Akri,I got an Error-- no matches for kind "CustomResourceDefinition" in version "apiextensions.k8s.io/v1". And then I found the CRD is the use of apiextensions. K8s. IO/v1beta1 apiversion. But after I fixed it, the following mistakes popped up.(May be my knowledge point is relatively weak)
Error: failed to install CRD crds/akri-configuration-crd.yaml: CustomResourceDefinition.apiextensions.k8s.io "configurations.akri.sh" is invalid: [spec.versions[0].additionalPrinterColumns[0].JSONPath: Required value, spec.versions[0].additionalPrinterColumns[1].JSONPath: Required value, spec.versions: Invalid value: []apiextensions.CustomResourceDefinitionVersion{apiextensions.CustomResourceDefinitionVersion{Name:"v0", Served:true, Storage:true, Schema:(*apiextensions.CustomResourceValidation)(0xc005e8c630), Subresources:(*apiextensions.CustomResourceSubresources)(nil), AdditionalPrinterColumns:[]apiextensions.CustomResourceColumnDefinition{apiextensions.CustomResourceColumnDefinition{Name:"Capacity", Type:"string", Format:"", Description:"The capacity for each Instance discovered", Priority:0, JSONPath:""}, apiextensions.CustomResourceColumnDefinition{Name:"Age", Type:"date", Format:"", Description:"", Priority:0, JSONPath:""}}}}: per-version schemas may not all be set to identical values (top-level validation should be used instead), spec.versions: Invalid value: []apiextensions.CustomResourceDefinitionVersion{apiextensions.CustomResourceDefinitionVersion{Name:"v0", Served:true, Storage:true, Schema:(*apiextensions.CustomResourceValidation)(0xc005e8c630), Subresources:(*apiextensions.CustomResourceSubresources)(nil), AdditionalPrinterColumns:[]apiextensions.CustomResourceColumnDefinition{apiextensions.CustomResourceColumnDefinition{Name:"Capacity", Type:"string", Format:"", Description:"The capacity for each Instance discovered", Priority:0, JSONPath:""}, apiextensions.CustomResourceColumnDefinition{Name:"Age", Type:"date", Format:"", Description:"", Priority:0, JSONPath:""}}}}: per-version additionalPrinterColumns may not all be set to identical values (top-level additionalPrinterColumns should be used instead)]
Describe the bug
Codecov action fails to upload for pull_request_target triggers.
Describe the bug
The e2e tests are using MicroK8s, which does not install crictl. Because of this (and #6) the tests really should be failing. But are not failing.
To Reproduce
Any PR that triggers the e2e tests will demonstrate this problem.
Expected behavior
The tests
microk8s kubectl logs $(microk8s kubectl get pods -A | grep agent | awk '{print $1}' | first) | grep 'get_node_slots - crictl called successfully'
)Is your feature request related to a problem? Please describe.
It's minor but I think consistency here would be helpful (using an arbitrary output
spec to prove the point):
TMPL="{.items[].status.phase}"
# Agent requires `name`
LABEL="name"
kubectl get pods \
--selector=${LABEL}=akri-agent \
--output=jsonpath="${TMPL}"
Running
# Controller requires `app`
LABEL="app"
kubectl get pods \
--selector=${LABEL}=akri-controller \
--output=jsonpath="${TMPL}"
Running
Is your feature request related to a way you would like Akri extended? Please describe.
Describe the solution you'd like
To avoid breaking any existing dependencies on the labels, propose adding the label you'd prefer to the one that doesn't have it. Since name
is overused (and exists as part of the metadata), propose: add app
label to akri-agent
and swap the matchLabels
to use app
too:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: akri-agent-daemonset
spec:
selector:
matchLabels:
app: akri-agent
template:
metadata:
labels:
app: akri-agent
name: akri-agent
If you're confident there aren't dependencies on the name
label, then it could be removed.
Describe alternatives you've considered
N/A
Additional context
Is your feature request related to a problem? Please describe.
When a new Helm version is released, sometimes its lint capabilities change and it can pick up new things.
It would be great to run the latest linting so that we can see if a potential issue arises. At the same time, I don't think it would make sense to fail a PR CI build because of this. There isn't a Github actions concept that allows for a failure to exist without causing a CI workflow to fail (and show up as a failure in the PR page) ... but there are Github actions Issues about enabling allow-failure
for a workflow:
It is not clear if/when/where this will land.
In the meantime, until there is a way for a Github workflow to just "warn" on failure, we can play with the actual lint command ... we could change the existing Helm workflow like this:
jobs:
lint-with-current-helm:
# Run workflow pull_request if it is NOT a fork, as pull_request_target if it IS a fork
if: >-
( github.event_name == 'pull_request_target' && github.event.pull_request.head.repo.fork == true ) ||
( github.event_name == 'pull_request' && github.event.pull_request.head.repo.fork == false )
runs-on: ubuntu-latest
timeout-minutes: 20
steps:
- name: Checkout the merged commit from PR and base branch
uses: actions/checkout@v2
if: github.event_name == 'pull_request_target'
with:
# pull_request_target is run in the context of the base repository
# of the pull request, so the default ref is master branch and
# ref should be manually set to the head of the PR
ref: refs/pull/${{ github.event.pull_request.number }}/head
- name: Checkout the head commit of the branch
if: ${{ github.event_name != 'pull_request_target' }}
uses: actions/checkout@v2
- uses: azure/setup-helm@v1
- name: Lint helm chart
run: helm lint deployment/helme && echo "lint finished successfully" || echo "lint found issues"
helm:
# existing flow here
Is your feature request related to a problem? Please describe.
IIUC Nessie's use of gRPC is redundant.
Having begun implementing a protocol (#85), I used Nessie as a template and realized this.
My lack of familiarity with the gRPC crate used and struggling to understand the architecture of Akri, exacerbated my challenge.
Is your feature request related to a way you would like Akri extended? Please describe.
Remove the gRPC code from Nessie to simplify it.
Describe the solution you'd like
See above.
Describe alternatives you've considered
It's probable that I'm incorrect.
Additional context
Is your feature request related to a problem? Please describe.
Extensibility is tightly-coupled.
Protocol changes require agent, controller and CRD rebuilds and redeployments.
It seems that this system would benefit from being dynamic though I can't provide guidance on how to achieve this beyond it being a good use-case for gRPC.
Describe the bug
RBAC yaml should use non-deprecated rbac.authorization.k8s.io/v1 ClusterRole
Is your feature request related to a problem? Please describe.
Akri Configuration resource limit quantity is specified as string:
resources:
limits:
"{{PLACEHOLDER}}": "1"
But if you look at Device Plugin Kubernetes doc page, the quantity is numeric, not a string: https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/
apiVersion: v1
kind: Pod
metadata:
name: demo-pod
spec:
containers:
- name: demo-container-1
image: k8s.gcr.io/pause:2.0
resources:
limits:
hardware-vendor.example/foo: 2
Describe the solution you'd like
Consider using numeric quantity.
Describe the bug
Udev discovered devices are currently mounted through Mount, which does not allow for fine grained control of access permissions. We should switch to DeviceSpec and expose the desired permission control through Configuration CRD.
/// Mount specifies a host volume to mount into a container.
--
| /// where device library or tools are installed on host and container
| #[derive(Clone, PartialEq, ::prost::Message)]
| pub struct Mount {
| /// Path of the mount within the container.
| #[prost(string, tag = "1")]
| pub container_path: std::string::String,
| /// Path of the mount on the host.
| #[prost(string, tag = "2")]
| pub host_path: std::string::String,
| /// If set, the mount is read-only.
| #[prost(bool, tag = "3")]
| pub read_only: bool,
| }
| /// DeviceSpec specifies a host device to mount into a container.
| #[derive(Clone, PartialEq, ::prost::Message)]
| pub struct DeviceSpec {
| /// Path of the device within the container.
| #[prost(string, tag = "1")]
| pub container_path: std::string::String,
| /// Path of the device on the host.
| #[prost(string, tag = "2")]
| pub host_path: std::string::String,
| /// Cgroups permissions of the device, candidates are one or more of
| /// * r - allows container to read from the specified device.
| /// * w - allows container to write to the specified device.
| /// * m - allows container to create device files that do not yet exist.
| #[prost(string, tag = "3")]
| pub permissions: std::string::String,
| }
Is your feature request related to a problem? Please describe.
It seems overly permissive to configure admin as the docs advise: kubectl create clusterrolebinding serviceaccounts-cluster-admin --clusterrole=cluster-admin --group=system:serviceaccounts
Is your feature request related to a way you would like Akri extended? Please describe.
Use RBAC to specifically grant access to the controller and agent according to what they need specifically
Describe the solution you'd like
RBAC for Controller (add serviceAccountName: 'akri-controller-sa'
to controller.yaml)
apiVersion: v1
kind: ServiceAccount
metadata:
name: akri-controller-sa
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
name: "akri-controller-role"
rules:
- apiGroups: [""]
resources: ["pods", "services"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: [""]
resources: ["nodes"]
verbs: ["get", "list", "watch"]
- apiGroups: ["akri.sh"]
resources: ["instances"]
verbs: ["get", "list", "watch", "update", "patch"]
- apiGroups: ["akri.sh"]
resources: ["configurations"]
verbs: ["get", "list", "watch"]
---
apiVersion: 'rbac.authorization.k8s.io/v1'
kind: 'ClusterRoleBinding'
metadata:
name: 'akri-controller-binding'
namespace: {{ .Release.Namespace }}
roleRef:
apiGroup: ''
kind: 'ClusterRole'
name: 'akri-controller-role'
subjects:
- kind: 'ServiceAccount'
name: 'akri-controller-sa'
namespace: {{ .Release.Namespace }}
RBAC for agent (add serviceAccountName: 'akri-agent-sa'
to agent.yaml):
apiVersion: v1
kind: ServiceAccount
metadata:
name: akri-agent-sa
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
name: "akri-agent-role"
rules:
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "list", "watch"]
- apiGroups: ["akri.sh"]
resources: ["instances"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: ["akri.sh"]
resources: ["configurations"]
verbs: ["get", "list", "watch"]
---
apiVersion: 'rbac.authorization.k8s.io/v1'
kind: 'ClusterRoleBinding'
metadata:
name: 'akri-agent-binding'
namespace: {{ .Release.Namespace }}
roleRef:
apiGroup: ''
kind: 'ClusterRole'
name: 'akri-agent-role'
subjects:
- kind: 'ServiceAccount'
name: 'akri-agent-sa'
namespace: {{ .Release.Namespace }}
Describe alternatives you've considered
None
Additional context
None
Describe the bug
PR CI builds are failing because secrets are not available from extrenal forks
To Reproduce
Steps to reproduce the behavior:
Expected behavior
CI builds execute successfully
Describe the bug
Challenged to run the End-to-End example cloud (!) VMs (GCP, DigitalOcean). Specifically modprobe v4l2loopback
:
sudo modprobe v4l2loopback exclusive_caps=1 video_nr=1,2
modprobe: ERROR: could not insert 'v4l2loopback': Unknown symbol in module, or unknown parameter (see dmesg)
I suspect the issue is that dependenct modules aren't included on cloud VM kernels but I prefer to use cloud VMs because they're disposable and I can keep my primary workstation untainted by random installs.
Output of kubectl get pods,akrii,akric -o wide
N/A
Kubernetes Version: [e.g. Native Kubernetes 1.19, MicroK8s 1.19, Minikube 1.19, K3s]
N/A
To Reproduce
Steps to reproduce the behavior:
modprobe v4l2loopback
Expected behavior
No errors
dmesg:
v4l2loopback: loading out-of-tree module taints kernel.
v4l2loopback: module verification failed: signature and/or required key missing - tainting kernel
v4l2loopback: Unknown symbol video_ioctl2 (err -2)
v4l2loopback: Unknown symbol v4l2_ctrl_handler_init_class (err -2)
v4l2loopback: Unknown symbol video_devdata (err -2)
v4l2loopback: Unknown symbol v4l2_ctrl_new_custom (err -2)
v4l2loopback: Unknown symbol video_unregister_device (err -2)
v4l2loopback: Unknown symbol video_device_alloc (err -2)
v4l2loopback: Unknown symbol v4l2_device_register (err -2)
v4l2loopback: Unknown symbol __video_register_device (err -2)
v4l2loopback: Unknown symbol v4l2_ctrl_handler_free (err -2)
v4l2loopback: Unknown symbol v4l2_device_unregister (err -2)
v4l2loopback: Unknown symbol video_device_release (err -2)
NOTE the taint and module verification error are warnings only and may be ignored
Additional context
The solution I've found is:
sudo apt update && \
sudo apt -y install modules-extra-$(uname -r) && \
sudo apt -y install dkms
NOTE the installs may possibly be combined
Then:
curl http://deb.debian.org/debian/pool/main/v/v4l2loopback/v4l2loopback-dkms_0.12.5-1_all.deb ...
sudo dpkg -i v4l2loopback-dkms_0.12.5-1_all.deb
sudo modprobe v4l2loopback exclusive_caps=1 video_nr=1,2
Succeeds!
On GCP:
dkms status
v4l2loopback, 0.12.5, 5.4.0-1028-gcp, x86_64: installed
On DigitalOcean:
dkms status
v4l2loopback, 0.12.5, 5.4.0-51-generic, x86_64: installed
Hopefully this will help others that get stuck.
Describe the bug
Some minor feedback on using image tags rather than digests.
If the developer rebuilds|repushes e.g. agent
image, a Kubernetes cluster may not repull the image even when specs are reapplied. This is because even though the image has changed, it's tag has not e.g. the Helm chart references images by tag, and Kubernetes doesn't (!?) repull an image if it has the image (tag) cached.
helm install akri ... \
...
--set agent.image.repository="${REPO}/agent" \
--set agent.image.tag="v0.0.XX-amd64" \
--set controller.image.repository="${REPO}/controller" \
--set controller.image.tag="v0.0.XX-amd64"
This can cause problems if the developer assumes that repushing causes e.g. Kubernetes to repull the image when the spec is reapplied.
The preferred mechanism would be to always (even in Helm) reference images by digest|hash as this is very likely to change every time the image changes.
An alternative is to eyeball the image digests after changes to ensure the images cached by Kubernetes reflect the digests of the images in the repo.
In the case of MicroK8s, it's possible to enumerate the cluster's cached images using crictl
and to remove stale versions:
sudo crictl --runtime-endpoint=unix:///var/snap/microk8s/common/run/containerd.sock images
sudo crictl --runtime-endpoint=unix:///var/snap/microk8s/common/run/containerd.sock rmi ...
Output of kubectl get pods,akrii,akric -o wide
N/A
Kubernetes Version: [e.g. Native Kubernetes 1.19, MicroK8s 1.19, Minikube 1.19, K3s]
N/A
To Reproduce
Steps to reproduce the behavior:
crictl
agent
or controller
and repush (same tag)crictl
to confirm that the most recent image hash was not usedExpected behavior
Any images changes (e.g. agent
, controller
, brokers) should cause Kubernetes repulls of the image on recreates.
Logs (please share snips of applicable logs)
N/A
Additional context
N/A
Is your feature request related to a problem? Please describe.
When trying to deploy Akri Agent on Raspbian running on RPi 3 (armv7l), it fails with:
Warning Failed 38s (x3 over 81s) kubelet, automation Failed to pull image "ghcr.io/deislabs/akri/agent:latest-dev": rpc error: code = NotFound desc = failed to pull and unpack image "ghcr.io/deislabs/akri/agent:latest-dev": failed to unpack image on snapshotter overlayfs: no match for platform in manifest sha256:22358e94c42efcd6d2f6751ade021fc606742110b3d0e4a917052ff4cf2609df: not found
Akri has been deployed using the latest containers and akri-dev helm chart.
It seems the manifest only contains amd64 and arm64 images:
docker manifest inspect ghcr.io/deislabs/akri/agent:latest-dev
{
"schemaVersion": 2,
"mediaType": "application/vnd.docker.distribution.manifest.list.v2+json",
"manifests": [
{
"mediaType": "application/vnd.docker.distribution.manifest.v2+json",
"size": 1159,
"digest": "sha256:ff5da6128f12f8ef678dab07f68b22fd32c5d89af73c62620ebca21b283853c3",
"platform": {
"architecture": "amd64",
"os": "linux"
}
},
{
"mediaType": "application/vnd.docker.distribution.manifest.v2+json",
"size": 1159,
"digest": "sha256:e6f0aaf7608e57c304151d7aeac60d4aa7bc2cf3e6897dc6d55b6f53c1c3bdc5",
"platform": {
"architecture": "arm64",
"os": "linux"
}
}
]
}
Looks like arm32 is disabled for now: https://github.com/deislabs/akri/blob/main/Makefile
Describe the solution you'd like
Consider supporting Akri on armv7l to allow for deployments on Raspbian.
As expected, the kubelet will not schedule a pod if the pod requests an Instance's deviceUsage
slot that has already been taken, and the pod will be left in Pending state. At this point, the controller should come and take down that pod (so it can possibly be rescheduled to a different slot if capacity has not been met). However, this is not currently happening. The pod is staying in a Pending state.
Output of kubectl get pods -o wide
akri-agent-daemonset-98p49 1/1 Running 0 21h
akri-agent-daemonset-x9k9r 1/1 Running 0 21h
akri-controller-deployment-75d655d869-nfmvz 1/1 Running 0 21h
worker1-akri-debug-echo-foo-8120fe-pod 0/1 Pending 0 21h
worker1-akri-debug-echo-foo-a19705-pod 0/1 Pending 0 21h
worker2-akri-debug-echo-foo-8120fe-pod 1/1 Running 0 21h
worker2-akri-debug-echo-foo-a19705-pod 1/1 Running 0 21h
Kubernetes Version: [e.g. Native Kubernetes 1.19, MicroK8s 1.19, Minikube 1.19, K3s]
MicroK8s 1.19
To reproduce, run on cluster with 2 workers:
helm install akri akri-helm-charts/akri --set debugEcho.enabled=true --set debugEcho.capacity=1 --set debugEcho.shared=true --set agent.allowDebugEcho=true --debug --set agent.host.dockerShimSock=/var/snap/microk8s/common/run/containerd.sock --set agent.host.crictl=/usr/local/bin/crictl --set debugEcho.capacity=1
Expected behavior
Controller should bring down the pending pods, and in this case where capacity is 1, pods should not be rescheduled
Expected problem
The controller has a grace period it allows to pass before bringing down pods. That grace period is being calculated from a timestamp that only occurs after a pod is assigned to a node, which never occurs if allocate fails. Consequently, we see the pod continuing to run in a Pending state as the Controller doesn't remove it if no start time was found:
[2020-10-21T16:05:28Z TRACE controller::util::pod_action] time_choice_for_non_running_pods - no start time found ... give it more time? ("worker1-akri-debug-echo-foo-a19705-pod")
[2020-10-21T16:05:28Z TRACE controller::util::pod_action] time_choice_for_non_running_pods - give_it_more_time: (true)
Each Agent currently continually checks all running containers to make sure that the Instance slot properties are correct (https://github.com/deislabs/sonar/blob/master/beacon/src/util/slot_reconciliation.rs). The reason this is needed is that K8s' Device-Plugin interface does not include a Deallocate message ... so when a Pod exits/deletes, the Instance needs to be updated.
A better model would be to create a Pod watcher and respond to creation/deletion/modified events. A generalized algorithm would be:
Agent.Allocate
Agent.PodWatcher.Created/Modified
(for spec.nodeName
== <this node>
)
Agent.PodWather.Modified/Deleted
(for spec.nodeName
== <this node>
)
Agent
startup, start 5 minute timer for any slot requests on this node
Describe the bug
A clear and concise description of what the bug is.
Output of kubectl get pods,akrii,akric -o wide
kubectl apply --filename=nessie.yaml
configuration.akri.sh/nessie created
And:
kubectl get all
NAME READY STATUS RESTARTS AGE
pod/akri-agent-daemonset-6gjbr 1/1 Running 0 89s
pod/akri-controller-deployment-848858df4b-cgplb 1/1 Running 0 89s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/kubernetes ClusterIP 10.152.183.1 <none> 443/TCP 26h
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
daemonset.apps/akri-agent-daemonset 1 1 1 1 1 kubernetes.io/os=linux 89s
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/akri-controller-deployment 1/1 1 1 89s
NAME DESIRED CURRENT READY AGE
replicaset.apps/akri-controller-deployment-848858df4b 1 1 1 89s
Kubernetes Version: [e.g. Native Kubernetes 1.19, MicroK8s 1.19, Minikube 1.19, K3s]
Expected behavior
Logs (please share snips of applicable logs)
kubectl logs pods/akri-agent-daemonset-6gjbr
akri.sh Agent start
akri.sh KUBERNETES_PORT found ... env_logger::init
[2020-10-27T19:57:34Z TRACE agent] akri.sh KUBERNETES_PORT found ... env_logger::init finished
[2020-10-27T19:57:34Z TRACE agent::util::slot_reconciliation] periodic_slot_reconciliation - start
[2020-10-27T19:57:34Z TRACE akri_shared::k8s] Loading in-cluster config
[2020-10-27T19:57:34Z INFO agent::util::config_action] do_config_watch - enter
[2020-10-27T19:57:34Z TRACE akri_shared::k8s] Loading in-cluster config
[2020-10-27T19:57:34Z TRACE agent::util::slot_reconciliation] periodic_slot_reconciliation - iteration pre delay_for
[2020-10-27T19:57:34Z TRACE akri_shared::akri::configuration] get_configurations enter
[2020-10-27T19:57:34Z TRACE akri_shared::akri::configuration] get_configurations kube_client.request::<KubeAkriInstanceList>(akri_config_type.list(...)?).await?
[2020-10-27T19:57:34Z TRACE akri_shared::akri::configuration] get_configurations return
[2020-10-27T19:57:34Z TRACE agent::util::config_action] watch_for_config_changes - start
[2020-10-27T19:57:44Z TRACE agent::util::slot_reconciliation] periodic_slot_reconciliation - iteration call reconiler.reconcile
[2020-10-27T19:57:44Z TRACE agent::util::slot_reconciliation] reconcile - thread iteration start [Mutex { data: {} }]
[2020-10-27T19:57:44Z TRACE agent::util::slot_reconciliation] get_node_slots - Command failed to call crictl: Os { code: 13, kind: PermissionDenied, message: "Permission denied" }
[2020-10-27T19:57:44Z TRACE agent::util::slot_reconciliation] reconcile - get_node_slots failed: Os { code: 13, kind: PermissionDenied, message: "Permission denied" }
[2020-10-27T19:57:44Z TRACE agent::util::slot_reconciliation] periodic_slot_reconciliation - iteration end
...
**REPEATS**
...
[2020-10-27T19:58:50Z TRACE agent::util::config_action] handle_config - something happened to a configuration
[2020-10-27T19:58:50Z INFO agent::util::config_action] handle_config - added DevCapConfig nessie
thread 'tokio-runtime-worker' panicked at 'called `Result::unwrap()` on an `Err` value: hyper::Error(Connect, "invalid URL, scheme is not http")', agent/src/util/config_action.rs:146:64
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
The Nessie URL is https://www.lochness.co.uk/livecam/img/lochness.jpg
Additional context
Add any other context about the problem here.
I am currently running a k3s cluster on some Raspberry Pi 4s. Would it be possible for Akri to utilize an audio card on one of these Pi's as an Akri Plugin?
I would assume a device plugin would need to be created. Does this make sense as an Akri use-case?
Is your feature request related to a problem? Please describe.
Microsoft IoT team has come up with a generic way to discover the resources such as temperature sensors or whatever. It would be great if Akri could just discover such resources.
Is your feature request related to a way you would like Akri extended? Please describe.
For each IoT P&P device, it would be great if Akri would create a resource handler automatically.
Additional context
For details, see https://docs.microsoft.com/en-us/azure/iot-pnp/
Describe the bug
When starting Akri in Google Cloud Platform cluster, the agent failed to start, complaining about mounting crictl from read-only file-system.
kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
akri-agent-daemonset-jr7pl 0/1 CrashLoopBackOff 5 4m7s 10.128.0.12 gke-cluster-1-16-default-pool-c4e0f4c9-lthv <none> <none>
akri-agent-daemonset-trwqm 0/1 CrashLoopBackOff 5 4m7s 10.128.0.10 gke-cluster-1-16-default-pool-c4e0f4c9-xlsx <none> <none>
akri-agent-daemonset-wq77x 0/1 CrashLoopBackOff 5 4m7s 10.128.0.11 gke-cluster-1-16-default-pool-c4e0f4c9-tbdm <none> <none>
akri-controller-deployment-5957d7d7cc-z8rnx 1/1 Running 0 4m6s 10.64.1.7 gke-cluster-1-16-default-pool-c4e0f4c9-lthv <none> <none>
Kubernetes Version: 1.16.13-gke.401
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Expect Akri controller and agent to start without error. Akri controller starts. Akri agent fails:
kubectl describe pod $(kubectl get pods | grep agent | awk '{print $1}' | head -1)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 7m39s default-scheduler Successfully assigned default/akri-agent-daemonset-jr7pl to gke-cluster-1-16-default-pool-c4e0f4c9-lthv
Warning Failed 6m55s (x4 over 7m37s) kubelet Error: failed to start container "akri-agent": Error response from daemon: error while creating mount source path '/usr/bin/crictl': mkdir /usr/bin/crictl: read-only file system
Normal Pulling 6m10s (x5 over 7m38s) kubelet Pulling image "ghcr.io/deislabs/akri/agent:v0.0.36-dev"
Normal Pulled 6m9s (x5 over 7m37s) kubelet Successfully pulled image "ghcr.io/deislabs/akri/agent:v0.0.36-dev"
Normal Created 6m9s (x5 over 7m37s) kubelet Created container akri-agent
Warning BackOff 2m34s (x23 over 7m35s) kubelet Back-off restarting failed container
Describe the bug
(sorry if I am a bit off topic here but I couldn't ask elsewhere like Slack due to my problem)
I cannot join the project's Slack using the badge on README.md
It will redirect me to the page below at https://kubernetes.slack.com/?redir=%2Fmessages%2Fakri
If I do my usual "Continue with Google" (that I use successfully with other Slacks), it will fail. Is it invite-only or something like that?
Thanks
Didier
How can I create a configuration that will detect leaf devices and advertise them as extended resources, but not deploy any brokers automatically?
Describe the bug
When I set flag "--set udev.enabled=true" in helm command to modify configuration, I get error message:
Error: UPGRADE FAILED: YAML parse error on akri/templates/udev.yaml: error converting YAML to JSON: yaml: line 10: could not find expected ':'
Output of kubectl get pods,akrii,akric -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod/akri-controller-deployment-5b4bb5cbb5-lgqq6 1/1 Running 0 167m 10.42.0.8 raspberrypi <none> <none>
pod/akri-agent-daemonset-krkhr 1/1 Running 0 119m 192.168.0.9 rpi-3 <none> <none>
pod/akri-agent-daemonset-wmhdn 1/1 Running 0 118m 192.168.0.8 raspberrypi <none> <none>
NAME CAPACITY AGE
configuration.akri.sh/akri-onvif 1 119m
Kubernetes Version: [e.g. Native Kubernetes 1.19, MicroK8s 1.19, Minikube 1.19, K3s]
K3s
NAME STATUS ROLES AGE VERSION
raspberrypi Ready master 10h v1.19.3+k3s2
rpi-3 Ready <none> 9h v1.19.3+k3s2
To Reproduce
Install or upgrade Akri with the Helm command:
helm upgrade akri akri-helm-charts/akri-dev --set onvif.enabled=true --set udev.enabled=true
Additional context
Wrote a post summarizing how to deploy the akri end-to-end to both Google Compute Engine and DigitalOcean.
Hope it helps foster adoption of akri!
Is your feature request related to a problem? Please describe.
Currently the end-to-end tests only validate MicroK8s.
Is your feature request related to a way you would like Akri extended? Please describe.
Add K3s and Kubernetes to the testing matrix.
Describe the solution you'd like
Expand .github/workflows/run-test-cases.yml strategy.matrix to include a dimension for runtime (K3s, Kubernetes) and add runtime installation details for both K3s and Kuberentes.
Describe alternatives you've considered
Currently, we are doing a best effort approach to slot reconciliation, which ensures that device usage on an Instance reflects the real state of which pods are using an instance.
To get the real usage, we are using crictl to query the container runtime in search of active Containers that have our slot Annotations that device-plugin adds to a pod upon an Allocate call from kubelet.
If this crictl query fails (possibly due to crictl not being mounted correctly), we are doing an early return on slot reconciliation. This should be handled more specifically in the future.
Alternatively, if Kubernetes adds Deallocate to Device-Plugin, slot reconciliation might not be needed. This is the PR for Deallocate: kubernetes/kubernetes#91190
Is your feature request related to a problem? Please describe.
Helm templates such as https://github.com/deislabs/akri/blob/main/deployment/helm/templates/udev.yaml include YAML comments, which are rendered in the final .yaml files. But at least some comments do not make sense in the final yaml, such as: # Only add broker pod spec if a broker image is provided
.
Describe the solution you'd like
Evaluate if template comments would be better and not end up in the generated yaml. See https://helm.sh/docs/chart_best_practices/templates/, specifically Comments (YAML Comments vs. Template Comments)
.
Is your feature request related to a problem? Please describe.
https://github.com/deislabs/akri/blob/main/docs/user-guide.md describes how to deploy Akri on k3s and microk8s. Specifically, this requires to pass additional parameters to helm to properly configure crictl. However, pages such as https://github.com/deislabs/akri/blob/main/docs/udev-configuration.md or https://github.com/deislabs/akri/blob/main/docs/modifying-akri-installation.md do not have such instructions or even mention about requiring to set crictl.
Describe the solution you'd like
Consider adding some reference as folks might modify the installation and unset the required configuration, thus ending up with misconfigured Akri configuration.
Describe the bug
Please see PR #71
When:
PREFIX=ghcr.io/dazwilkin BUILD_AMD64=1 BUILD_ARM32=0 BUILD_ARM64=0 make akri-agent
Errors:
error[E0061]: this function takes 1 argument but 0 arguments were supplied
--> agent/src/protocols/nessie/discovery_handler.rs:28:28
|
28 | if let Ok(_body) = hyper::Client::new().get(url).compat().await {
| ^^^^^^^^^^^^^^^^^^-- supplied 0 arguments
| |
| expected 1 argument
error[E0599]: no method named `compat` found for struct `hyper::client::FutureResponse` in the current scope
--> agent/src/protocols/nessie/discovery_handler.rs:28:58
|
28 | if let Ok(_body) = hyper::Client::new().get(url).compat().await {
| ^^^^^^ method not found in `hyper::client::FutureResponse`
Output of kubectl get pods,akrii,akric -o wide
N/A
Kubernetes Version: [e.g. Native Kubernetes 1.19, MicroK8s 1.19, Minikube 1.19, K3s]
N/A
To Reproduce
Steps to reproduce the behavior:
akri-agent
(or akri-controller
)Expected behavior
No compilation errors.
Logs (please share snips of applicable logs)
Above.
Additional context
Add any other context about the problem here.
Describe the bug
IIUC the End-to-End MicroK8s step to Install Helm is redundant.
Output of kubectl get pods,akrii,akric -o wide
N/A
Kubernetes Version: [e.g. Native Kubernetes 1.19, MicroK8s 1.19, Minikube 1.19, K3s]
N/A
To Reproduce
Steps to reproduce the behavior:
microk8s.enable helm3
alias helm=microk8s.helm3
helm version
Yields: version.BuildInfo{Version:"v3.0.2", ...
Expected behavior
N/A
Logs (please share snips of applicable logs)
N/A
Additional context
Recommend:
alias helm=microk8s.helm3
Describe the bug
Walking through Extensibility, when trying to build akri-agent
or akri-controller
in this step, receive unauthorized from deislabs' registry endpoint on GHCR.
I assume the build needs these intermediate images and that the repository is not public.
Recommend: make it public
PREFIX=ghcr.io/dazwilkin BUILD_AMD64=1 BUILD_ARM32=0 BUILD_ARM64=0 make akri-controller
cargo install cross
Updating crates.io index
Ignored package `cross v0.2.1` is already installed, use --force to override
PKG_CONFIG_ALLOW_CROSS=1 cross build --release --target=x86_64-unknown-linux-gnu
Unable to find image 'ghcr.io/deislabs/akri/rust-crossbuild:x86_64-unknown-linux-gnu-0.1.16-0.0.6' locally
docker: Error response from daemon: Get https://ghcr.io/v2/deislabs/akri/rust-crossbuild/manifests/x86_64-unknown-linux-gnu-0.1.16-0.0.6: unauthorized.
See 'docker run --help'.
make: *** [build/akri-containers.mk:44: akri-cross-build-amd64] Error 125
Kubernetes Version: [e.g. Native Kubernetes 1.19, MicroK8s 1.19, Minikube 1.19, K3s]
N/A
To Reproduce
See summary.
Expected behavior
Permitted to pull from deislabs' GHCR registry|repos
Logs (please share snips of applicable logs)
Additional context
Describe the bug
Currently, the default broker spec capacity is 5. That seems to be an overkill.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
I think it would make more sense to start with 1 or 2. I would vote for 1 as HA can be an opt-in.
Describe the bug
Nessie's disovery_handler.rs
discover
function always returns Ok(T)
even on failure.:
async fn discover(&self) -> Result<Vec<DiscoveryResult>, failure::Error> {
let url = self.discovery_handler_config.nessie_url.clone();
let mut results = Vec::new();
match reqwest::get(&url).await {
Ok(resp) => {
trace!("Found nessie url: {:?} => {:?}", &url, &resp);
// If the Nessie URL can be accessed, we will return a DiscoveryResult
// instance
let mut props = HashMap::new();
props.insert("nessie_url".to_string(), url.clone());
results.push(DiscoveryResult::new(&url, props, true));
}
Err(err) => {
println!("Failed to establish connection to {}", &url);
println!("Error: {}", err);
return Ok(results);
}
};
Ok(results)
}
Expected behavior
Should this not return Err(failure::Error)
on the Err
branch?
async fn discover(&self) -> Result<Vec<DiscoveryResult>, failure::Error> {
let url = self.discovery_handler_config.nessie_url.clone();
let mut results = Vec::new();
match reqwest::get(&url).await {
Ok(resp) => {
trace!("Found nessie url: {:?} => {:?}", &url, &resp);
// If the Nessie URL can be accessed, we will return a DiscoveryResult
// instance
let mut props = HashMap::new();
props.insert("nessie_url".to_string(), url.clone());
results.push(DiscoveryResult::new(&url, props, true));
}
Err(err) => {
println!("Failed to establish connection to {}", &url);
println!("Error: {}", err);
Err(format_err!("failed to establish connection: {}". err))
}
};
Ok(results)
}
Then, since results
is only required by the Ok
branch, only has a single value, we can simplify and just (implicitly) return the result of the match:
async fn discover(&self) -> Result<Vec<DiscoveryResult>, failure::Error> {
let url = self.discovery_handler_config.nessie_url.clone();
match reqwest::get(&url).await {
Ok(resp) => {
trace!("Found nessie url: {:?} => {:?}", &url, &resp);
// If the Nessie URL can be accessed, we will return a DiscoveryResult
// instance
let mut props = HashMap::new();
props.insert("nessie_url".to_string(), url.clone());
Ok(vec![DiscoveryResult::new(&url, props, true)])
}
Err(err) => {
println!("Failed to establish connection to {}", &url);
println!("Error: {}", err);
Err(format_err!("failed to establish connection: {}". err))
}
}
}
Additional context
See #89
Is your feature request related to a problem? Please describe.
A proposal.
Is your feature request related to a way you would like Akri extended? Please describe.
Part "can I repro 'Nessie'?".
Part "What could be even simpler?".
I'm noodling a handler based on HTTP:
/health
and /
./
returns a random number purporting as some sensor value; easily extended0.0.0.0
Describe the solution you'd like
This solution is very basic but it's realistic (HTTP aside) of an MCU device with sensor(s) scenario.
Describe alternatives you've considered
The obvious alternative for MCUs would be to use MQTT and invert the flow of control: devices publish to broker, Akri subscribes to MQTT channels.
My sense (!) is that using MQTT doesn't extol Akri; I'm still unsure what the ideal use-case is for Akri but suspect that discovery and non-trivial interactions|data may be key (!?).
I considered using gRPC for device interaction but the added complexity appears to add no value in explaining Akri.
I considered using named pipes but feel this doesn't enlighten the developer and conveys locally attached producers.
Additional context
Plan to write the tutorial from-scratch.
I ran:
./build/setup.sh
cargo build
And I see:
bburns@helios:~/src/akri$ RUST_BACKTRACE=1 cargo build
Compiling hyper v0.13.7
Compiling async-std v1.6.2
Compiling agent v0.0.36 (/home/bburns/src/akri/agent)
Compiling udev-video-broker v0.0.36 (/home/bburns/src/akri/samples/brokers/udev-video-broker)
error: failed to run custom build command for `udev-video-broker v0.0.36 (/home/bburns/src/akri/samples/brokers/udev-video-broker)`
Caused by:
process didn't exit successfully: `/home/bburns/src/akri/target/debug/build/udev-video-broker-6ad08c5c40d22b43/build-script-build` (exit code: 101)
--- stderr
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 2, kind: NotFound, message: "No such file or directory" }', /home/bburns/.cargo/registry/src/github.com-1ecc6299db9ec823/tonic-build-0.1.1/src/lib.rs:219:19
stack backtrace:
0: backtrace::backtrace::libunwind::trace
at /usr/src/rustc-1.43.0/vendor/backtrace/src/backtrace/libunwind.rs:86
1: backtrace::backtrace::trace_unsynchronized
at /usr/src/rustc-1.43.0/vendor/backtrace/src/backtrace/mod.rs:66
2: std::sys_common::backtrace::_print_fmt
at src/libstd/sys_common/backtrace.rs:78
3: <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt
at src/libstd/sys_common/backtrace.rs:59
4: core::fmt::write
at src/libcore/fmt/mod.rs:1063
5: std::io::Write::write_fmt
at src/libstd/io/mod.rs:1426
6: std::sys_common::backtrace::_print
at src/libstd/sys_common/backtrace.rs:62
7: std::sys_common::backtrace::print
at src/libstd/sys_common/backtrace.rs:49
8: std::panicking::default_hook::{{closure}}
at src/libstd/panicking.rs:204
9: std::panicking::default_hook
at src/libstd/panicking.rs:224
10: std::panicking::rust_panic_with_hook
at src/libstd/panicking.rs:474
11: rust_begin_unwind
at src/libstd/panicking.rs:378
12: core::panicking::panic_fmt
at src/libcore/panicking.rs:85
13: core::result::unwrap_failed
at src/libcore/result.rs:1222
14: core::result::Result<T,E>::unwrap
at /usr/src/rustc-1.43.0/src/libcore/result.rs:1003
15: tonic_build::fmt
at /home/bburns/.cargo/registry/src/github.com-1ecc6299db9ec823/tonic-build-0.1.1/src/lib.rs:219
16: tonic_build::Builder::compile
at /home/bburns/.cargo/registry/src/github.com-1ecc6299db9ec823/tonic-build-0.1.1/src/lib.rs:172
17: build_script_build::main
at samples/brokers/udev-video-broker/build.rs:2
18: std::rt::lang_start::{{closure}}
at /usr/src/rustc-1.43.0/src/libstd/rt.rs:67
19: std::rt::lang_start_internal::{{closure}}
at src/libstd/rt.rs:52
20: std::panicking::try::do_call
at src/libstd/panicking.rs:303
21: __rust_maybe_catch_panic
at src/libpanic_unwind/lib.rs:86
22: std::panicking::try
at src/libstd/panicking.rs:281
23: std::panic::catch_unwind
at src/libstd/panic.rs:394
24: std::rt::lang_start_internal
at src/libstd/rt.rs:51
25: std::rt::lang_start
at /usr/src/rustc-1.43.0/src/libstd/rt.rs:67
26: main
27: __libc_start_main
28: _start
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
warning: build failed, waiting for other jobs to finish...
error: failed to run custom build command for `agent v0.0.36 (/home/bburns/src/akri/agent)`
Caused by:
process didn't exit successfully: `/home/bburns/src/akri/target/debug/build/agent-205b11fae37b158b/build-script-build` (exit code: 101)
--- stderr
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 2, kind: NotFound, message: "No such file or directory" }', /home/bburns/.cargo/registry/src/github.com-1ecc6299db9ec823/tonic-build-0.1.1/src/lib.rs:219:19
stack backtrace:
0: backtrace::backtrace::libunwind::trace
at /usr/src/rustc-1.43.0/vendor/backtrace/src/backtrace/libunwind.rs:86
1: backtrace::backtrace::trace_unsynchronized
at /usr/src/rustc-1.43.0/vendor/backtrace/src/backtrace/mod.rs:66
2: std::sys_common::backtrace::_print_fmt
at src/libstd/sys_common/backtrace.rs:78
3: <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt
at src/libstd/sys_common/backtrace.rs:59
4: core::fmt::write
at src/libcore/fmt/mod.rs:1063
5: std::io::Write::write_fmt
at src/libstd/io/mod.rs:1426
6: std::sys_common::backtrace::_print
at src/libstd/sys_common/backtrace.rs:62
7: std::sys_common::backtrace::print
at src/libstd/sys_common/backtrace.rs:49
8: std::panicking::default_hook::{{closure}}
at src/libstd/panicking.rs:204
9: std::panicking::default_hook
at src/libstd/panicking.rs:224
10: std::panicking::rust_panic_with_hook
at src/libstd/panicking.rs:474
11: rust_begin_unwind
at src/libstd/panicking.rs:378
12: core::panicking::panic_fmt
at src/libcore/panicking.rs:85
13: core::result::unwrap_failed
at src/libcore/result.rs:1222
14: core::result::Result<T,E>::unwrap
at /usr/src/rustc-1.43.0/src/libcore/result.rs:1003
15: tonic_build::fmt
at /home/bburns/.cargo/registry/src/github.com-1ecc6299db9ec823/tonic-build-0.1.1/src/lib.rs:219
16: tonic_build::Builder::compile
at /home/bburns/.cargo/registry/src/github.com-1ecc6299db9ec823/tonic-build-0.1.1/src/lib.rs:172
17: build_script_build::main
at agent/build.rs:3
18: std::rt::lang_start::{{closure}}
at /usr/src/rustc-1.43.0/src/libstd/rt.rs:67
19: std::rt::lang_start_internal::{{closure}}
at src/libstd/rt.rs:52
20: std::panicking::try::do_call
at src/libstd/panicking.rs:303
21: __rust_maybe_catch_panic
at src/libpanic_unwind/lib.rs:86
22: std::panicking::try
at src/libstd/panicking.rs:281
23: std::panic::catch_unwind
at src/libstd/panic.rs:394
24: std::rt::lang_start_internal
at src/libstd/rt.rs:51
25: std::rt::lang_start
at /usr/src/rustc-1.43.0/src/libstd/rt.rs:67
26: main
27: __libc_start_main
28: _start
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
warning: build failed, waiting for other jobs to finish...
error: build failed
Describe the bug
Using hostNetwork: true
for a Pod (as Akri agent does) breaks DNS name resolution for K8s services. See kubernetes/dns#316. Adding dnsPolicy: ClusterFirstWithHostNet
fixes the problem.
Details
I'm developing an end:end example using HTTP (See #85).
The discovery handler consistently fails to discover
the URL referenced by its handler's discovery_endpoint
value.
I don't want to distract you with my noob issues but, if you've any insight into what I'm doing wrong, I'd appreciate it.
Per @bfjelds revised Nessie example, I'm also using reqwest
and the agent generates the following error:
[http:discover] Entered
[http:discover] url: http://discovery:9999
[http:discover] Response: Err(reqwest::Error { kind: Request, url: "http://discovery:9999/", source: hyper::Error(Connect, ConnectError("dns error", Custom { kind: Other, error: "failed to lookup address information: Temporary failure in name resolution" })) })
[http:discover] Spoofed results
In the above, I've taken the get
out of the control flow and am spoofing the results (see below):
async fn discover(&self) -> Result<Vec<DiscoverResult>, failure::Error> {
println!("[http:discover] Entered");
let url = self.discovery_handler_config.discovery_endpoint.clone();
println!("[http:discover] url: {}", &url);
let resp = get(&url).await;
..
}
When the discover
function returned directly from matching on the get
, the error was slightly more informative:
async fn discover(&self) -> Result<Vec<DiscoveryResult>, failure::Error> {
println!("[http:discover] Entered");
let url = self.discovery_handler_config.discovery_endpoint.clone();
println!("[http:discover] url: {}", &url);
match get(&url).await {
Ok(resp) => {
let device_list = &resp.text().await?;
let result: Vec<DiscoveryResult> = device_list.line().map(...).collect();
Ok(result)
}
Err(err) => {
Err(format_err!("unable to parse discovery endpoint results: {:?}", err))
}
}
Yields:
[http:discover] Entered
[http:discover] url: http://discovery:9999
[http:discover] Failed to connect to discovery endpoint: http://discovery:9999
[http:discover] Error: error sending request for url (http://discovery:9999/): error trying to connect: dns error: failed to lookup address information: Temporary failure in name resolution
thread 'tokio-runtime-worker' panicked at 'called `Result::unwrap()` on an `Err` value: ErrorMessage { msg: "unable to parse discovery endpoint results: reqwest::Error { kind: Request, url: \"http://discovery:9999/\", source: hyper::Error(Connect, ConnectError(\"dns error\", Custom { kind: Other, error: \"failed to lookup address information: Temporary failure in name resolution\" })) }" }', agent/src/util/config_action.rs:146:64
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
I'm relatively (!) confident that this URL is correct and the agent should be able to GET it.
If I run a curl
pod in the cluster('s default namespace), the endpoint generates 200 and the correct response:
kubectl run curl --image=radial/busyboxplus:curl --stdin --tty --rm
[ root@curl:/ ]$ curl http://discovery:9999/
0.0.0.0:8000
0.0.0.0:8001
0.0.0.0:8002
0.0.0.0:8003
0.0.0.0:8004
0.0.0.0:8005
0.0.0.0:8006
0.0.0.0:8007
0.0.0.0:8008
0.0.0.0:8009
I'm at a loss to understand why this error arises but it does so consistently and reliably (I've tried ... but will try using the Cluster IP).
I'm able to spoof the correct result by manually creating Vec<DiscoveryResult>
with the values provided by the response and then the agent and broker work correctly:
kubectl logs pod/akri-http-dbb47e-pod
[http:main] Entered
[http:main] Device: http://device-8000:8000
[http:main] get_discovery_data
[http:get_discovery_data] Entered
[http:main] Environment:
PATH:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
HOSTNAME:akri-http-dbb47e-pod
....
[http:main] Starting gRPC server
[http:serve] Entered
[http:serve] Starting gRPC server: 0.0.0.0:8084
[http:main:loop] Sleep
[http:main:loop] read_sensor(http://device-8000:8000)
[http:read_sensor] Entered
[main:read_sensor] Response status: 200
[main:read_sensor] Response body: Ok("0.2854040649574679")
[http:main:loop] Sleep
[http:main:loop] read_sensor(http://device-8000:8000)
[http:read_sensor] Entered
[main:read_sensor] Response status: 200
[main:read_sensor] Response body: Ok("0.4158989841983801")
[http:main:loop] Sleep
[http:main:loop] read_sensor(http://device-8000:8000)
[http:read_sensor] Entered
[main:read_sensor] Response status: 200
[main:read_sensor] Response body: Ok("0.30926792372133194")
[http:main:loop] Sleep
So, beside this issue, I'm almost (still need to come up with a solution for device DNS naming...) at a solution.
Output of kubectl get pods,akrii,akric -o wide
See above
Kubernetes Version: [e.g. Native Kubernetes 1.19, MicroK8s 1.19, Minikube 1.19, K3s]
MicroK8s.
Starting November 1, 2020, Docker Hub will impose rate limits based on the originating IP. Since Akri pulls down images from DockerHub when certain CI jobs are invoked, and since Github Actions are run from a shared pool of IP addresses, it is highly recommended to use authenticated Docker pulls with Docker Hub to avoid rate limit problems.
https://docs.docker.com/docker-hub/download-rate-limit/
https://www.docker.com/pricing
By default, DockerHub limits anonymous pull access to 100 pulls per hour, per IP. When logged in with a free account, it's bumped to 200 per hour. Pro/Team accounts allow unlimited pulls.
Both build-intermediate
and build-component-per-arch
pull down multiarch/qemu-user-static
:
https://github.com/deislabs/akri/blob/a7d719d6d84ce6d9ef2f5a9beb3c64219b51b5c7/.github/actions/build-intermediate/main.js#L28-L29
https://github.com/deislabs/akri/blob/a7d719d6d84ce6d9ef2f5a9beb3c64219b51b5c7/.github/actions/build-component-per-arch/main.js#L28-L29
run-tarpaulin
appears to pull xd009642/tarpaulin:0.12.2
:
There are two solutions:
I'd also highly recommend auditing the rest of the code base to ensure you've logged in prior to pulling images from Docker Hub.
Is your feature request related to a problem? Please describe.
Developing protocol extensions requires multiple changes to the existing project code, configuration, specs etc. in addition to the addition of protocol-specific agent code and broker implementations.
I think it would encourage developers to have a skeleton-builder to configure Akri for the addition of a new protocol implementation. The builder would either prompt the developer for answers to a set of questions or it would parse a simple specification file in order to generate the appropriate changes to the Akri sources and template stubs ready for development.
In addition, coding the process to make these changes would more effectively 'enshrine' what's needed (the current approach is to have developers complete multiple steps prior to development). This code would form part of the repo and would evolve as the extensibility mechanism evolves.
Describe alternatives you've considered
The current approach is to maintain and follow documentation guidelines upon deciding to add a protocol to Akri. This manual approach is error-prone and tedious and requires the developer to spend time working on Akri before the interesting aspect of development begins.
The current approach will need to be manually revised as Akri evolves. If the process were code, it should include tests that prove it continues to be current. If the process were code that used a simple spec, a developer could re-apply the spec to Akri to regenerate a project outline as Akri evolves.
Additional context
Add any other context or screenshots about the feature request here.
Is your feature request related to a problem? Please describe.
Rust failure
crate is deprecated
Describe the solution you'd like
Consider replacing failure
.
Hello,
Describe the bug
When replicating the end-to-demo with MicroK8s (Ubuntu instance on GC) on my own, the pod of video streaming app goes to CrashLoopBackOff.
All other prior install steps were successful.
I added below the logs of failing pod
Didier
Output of kubectl get pods,akrii,akric -o wide
microk8s kubectl get all --all-namespaces -o wide
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
kube-system pod/calico-kube-controllers-847c8c99d-65bbl 1/1 Running 0 18m 10.1.54.67 microk8s-akri
kube-system pod/calico-node-xsvqv 1/1 Running 1 18m 10.128.0.35 microk8s-akri
kube-system pod/coredns-86f78bb79c-dbgm9 1/1 Running 0 17m 10.1.54.66 microk8s-akri
default pod/akri-agent-daemonset-c88pb 1/1 Running 0 12m 10.128.0.35 microk8s-akri
default pod/akri-controller-deployment-5b4bb5cbb5-8mlwp 1/1 Running 0 12m 10.1.54.68 microk8s-akri
default pod/akri-video-streaming-app-fd5f4cb7d-fzpcq 0/1 CrashLoopBackOff 7 12m 10.1.54.69 microk8s-akri
NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
default service/kubernetes ClusterIP 10.152.183.1 443/TCP 18m
kube-system service/kube-dns ClusterIP 10.152.183.10 53/UDP,53/TCP,9153/TCP 17m k8s-app=kube-dns
default service/akri-video-streaming-app NodePort 10.152.183.216 80:31671/TCP 12m app=akri-video-streaming-app
NAMESPACE NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE CONTAINERS IMAGES SELECTOR
kube-system daemonset.apps/calico-node 1 1 1 1 1 kubernetes.io/os=linux 18m calico-node calico/node:v3.13.2 k8s-app=calico-node
default daemonset.apps/akri-agent-daemonset 1 1 1 1 1 kubernetes.io/os=linux 12m akri-agent ghcr.io/deislabs/akri/agent:latest-dev name=akri-agent
NAMESPACE NAME READY UP-TO-DATE AVAILABLE AGE CONTAINERS IMAGES SELECTOR
kube-system deployment.apps/coredns 1/1 1 1 17m coredns coredns/coredns:1.6.6 k8s-app=kube-dns
kube-system deployment.apps/calico-kube-controllers 1/1 1 1 18m calico-kube-controllers calico/kube-controllers:v3.13.2 k8s-app=calico-kube-controllers
default deployment.apps/akri-controller-deployment 1/1 1 1 12m akri-controller ghcr.io/deislabs/akri/controller:latest-dev app=akri-controller
default deployment.apps/akri-video-streaming-app 0/1 1 0 12m akri-video-streaming-app ghcr.io/deislabs/akri/video-streaming-app:latest-dev app=akri-video-streaming-app
NAMESPACE NAME DESIRED CURRENT READY AGE CONTAINERS IMAGES SELECTOR
kube-system replicaset.apps/calico-kube-controllers-847c8c99d 1 1 1 18m calico-kube-controllers calico/kube-controllers:v3.13.2 k8s-app=calico-kube-controllers,pod-template-hash=847c8c99d
kube-system replicaset.apps/coredns-86f78bb79c 1 1 1 17m coredns coredns/coredns:1.6.6 k8s-app=kube-dns,pod-template-hash=86f78bb79c
default replicaset.apps/akri-controller-deployment-5b4bb5cbb5 1 1 1 12m akri-controller ghcr.io/deislabs/akri/controller:latest-dev app=akri-controller,pod-template-hash=5b4bb5cbb5
default replicaset.apps/akri-video-streaming-app-fd5f4cb7d 1 1 0 12m akri-video-streaming-app ghcr.io/deislabs/akri/video-streaming-app:latest-dev app=akri-video-streaming-app,pod-template-hash=fd5f4cb7d
Kubernetes Version: [e.g. Native Kubernetes 1.19, MicroK8s 1.19, Minikube 1.19, K3s]
snap list
Name Version Rev Tracking Publisher Notes
core 16-2.47.1 10185 latest/stable canonical✓ core
core18 20200929 1932 latest/stable canonical✓ base
google-cloud-sdk 316.0.0 157 latest/stable/… google-cloud-sdk✓ classic
helm 3.4.0 302 latest/stable snapcrafters classic
lxd 4.0.3 16922 4.0/stable/… canonical✓ -
microk8s v1.19.2 1769 1.19/stable canonical✓ classic
snapd 2.47.1 9721 latest/stable canonical✓ snapd
To Reproduce
I can publish my script in a repo with corresponding Github workflow to allow you to see full exec log (andreproduce by forking if needed)
Expected behavior
Get streaming app pod to Running status
Logs (please share snips of applicable logs)
microk8s kubectl logs akri-video-streaming-app-fd5f4cb7d-fzpcq
Traceback (most recent call last):
File "./streaming.py", line 33, in
grpc_port = os.environ[env_var_prefix + 'PORT_GRPC'] # instance services are using the same port by default
File "/usr/lib/python3.7/os.py", line 678, in getitem
raise KeyError(key) from None
KeyError: 'AKRI_UDEV_VIDEO_SVC_SERVICE_PORT_GRPC'
Additional context
GCE instance with Ubuntu LTS 20.04
Is your feature request related to a problem? Please describe.
Currently, two use cases for Akri are spelled out in the docs. onvifvideo and udevvideo. But udev can be used for other devices as well.
Describe the solution you'd like
Please add a top level entry on how to use udev for non-video related leaf devices. Also consider making that a top level and reference the video use case from that as an example. Added bonus would be a template for Helm, but not necessary, if a sample Akri Configuration is provided and explained how it should be used.
Is your feature request related to a problem? Please describe.
My curiosity is piqued by akri; it sounds very interesting.
The End-to-End is involved and the video dependency is neat but it requires a bunch of dependencies that may (!?) cause problems (see #42).
Is your feature request related to a way you would like Akri extended? Please describe.
IIUC (!?) akri provides a mechanism by which IoT devices can be managed through Kubernetes apps.
The prototypical IoT examples are simple sensors shipping data to consumers and these can be readily emulated with random number generators pushing|pulling via some endpoint.
Describe the solution you'd like
I think it would be useful to have a simpler End-to-End example that consumes e.g. random numbers representing e.g. temperatures generated by simple off-cluster apps. Perhaps these apps could be simple sockets, TCP|UDP (less realistically, but practical, HTTP) emitters?
This would avoid the need to install dkms, v4l2loopback etc. to get a developer worker with a basic akri installation.
Describe alternatives you've considered
The Nessie extensibility looks interesting and I'm going to try this. But, it would be helpful to have a baseline, known working installation (using something simple as described above) before proceeding with Nessie.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.