warm-metal / container-image-csi-driver Goto Github PK

View Code? Open in Web Editor NEW

29.0 6.0 22.0 534 KB

Kubernetes CSI driver for mounting image

License: MIT License

Go 69.92% Dockerfile 1.20% Shell 25.79% Makefile 1.77% Smarty 1.32%

csi-driver csi mounting-images containerd docker image credential-provider pullimagesecrets secret cri-o

container-image-csi-driver's People

Contributors

Stargazers

Watchers

container-image-csi-driver's Issues

Supports cri-o

I am going to refactor the project as well.

Well defined developer workflow

Scope

Have a docs/developer.md detailing the workflow for a developer
- Initial setup
- Development
- Testing
- Anything else expected by the project

Split from #93

Build and push csi-image to the registry from pipeline

Issue faced

Currently the maintainers of warm-metal need to manually build and push the csi-image to the docker registry.

Proposed solution

We want to automatically build and push the csi-image to the docker registry (cause that's what we are currently using) using CI pipeline for avoiding any manual work.

When to push the image

On every push to the master branch, update the latest tag for csi-image with the changes in default(master) branch.
On tag creation, an automated build should be triggered which builds and pushes the image for that respective tag.

How to push the image

We might want to use GitHub workflows for running the CI, that's what we are using currently for running integration tests.

Image build don't work for me

I can submit another PR updating the README, but Right now I but into this errors:

> REGISTRY=gcr.io/myregistry make image
docker buildx build -t gcr.io/myregistry/csi-image:v0.5.1 --push .
[+] Building 30.1s (4/4) FINISHED                                                                                                             
 => [internal] load build definition from Dockerfile                                                                                     0.0s
 => => transferring dockerfile: 397B                                                                                                     0.0s
 => [internal] load .dockerignore                                                                                                        0.0s
 => => transferring context: 2B                                                                                                          0.0s
 => ERROR [internal] load metadata for docker.io/library/alpine:3.13                                                                    30.0s
 => CANCELED [internal] load metadata for docker.io/library/golang:1.16-alpine3.13                                                      30.0s
------
 > [internal] load metadata for docker.io/library/alpine:3.13:
------
Dockerfile:13
--------------------
  11 |     RUN CGO_ENABLED=0 go build -o csi-image-plugin ./cmd/plugin
  12 |     
  13 | >>> FROM alpine:3.13
  14 |     WORKDIR /
  15 |     COPY --from=builder /go/src/csi-driver-image/csi-image-plugin /usr/bin/
--------------------
ERROR: failed to solve: alpine:3.13: failed to do request: Head "https://registry-1.docker.io/v2/library/alpine/manifests/3.13": dial tcp: i/o timeout

Also happens with make sanity. I already learned that one needs a docker hub account and and then run ocker login before, but it did not help in my case. Any idea?

If I try to build just using docker, I run into this:

> docker build .
Sending build context to Docker daemon  107.1MB
Step 1/11 : FROM docker.io/library/golang:1.16-alpine3.13 as builder
1.16-alpine3.13: Pulling from library/golang
5758d4e389a3: Pull complete 
04b7a40ca5d5: Pull complete 
452a8c64b8e1: Pull complete 
01da5aed4ae6: Pull complete 
83967ad3b539: Pull complete 
Digest: sha256:c538c29503b9ac4b874ae776a0537fe16fde4581af896d112e53a44a9963b116
Status: Downloaded newer image for golang:1.16-alpine3.13
 ---> e1b239f8b504
Step 2/11 : WORKDIR /go/src/csi-driver-image
 ---> Running in f367a24f22b6
Removing intermediate container f367a24f22b6
 ---> d87a9d13e8fe
Step 3/11 : COPY go.mod go.sum ./
 ---> 20a74138b43c
Step 4/11 : RUN go mod download
 ---> Running in 02f1947e5c3a
go: github.com/BurntSushi/[email protected]: Get "https://proxy.golang.org/github.com/%21burnt%21sushi/toml/@v/v0.3.1.mod": dial tcp: lookup proxy.golang.org: Try again
The command '/bin/sh -c go mod download' returned a non-zero code: 1

New logo is not getting loaded on the README

We recently added a logo for our repository in #80.

After we merged the PR, the logo is not loading for me on the README.

Kubernetes v1.26 drops support for CRI v1alpha2

Details

K8s docs for updating to v1.26 mentions about dropping support for CRI v1alpha2.

We are using CRI v1alpha1 in our code, which won't be supported in k8s 1.26 version.

This means that containerd minor version 1.5 and older are not supported from k8s 1.26 onwards.

ToDo

Update docs and integration-tests to use/support containerd version >v1.6 only.
Check for support/upgrade required in other container runtimes like cri-o.

Missing versioning and release documentation

Currently we try following semver versioning while creating GitHub releases, but that's not documented anywhere.

ToDo:

Finalize versioning and releases standard.
Add documentation for release procedure.
Update README with links to documentation created in above steps.
Update maintainer docs.

Is there a plan to support CI on this repo?
Also, what is the testing approach? I was wondering if there could be a go command that runs unit tests with no minikube.
Would you please provide details on how to run the e2e tests. looks like it's using kubectl-dev tool.

Delete old container images action fails on fork repository

This is the failure I noticed on my fork.

Fix pipeline for dependabot branches/PRs

Dependabot PRs as of now will update the actions only. (Maybe in the future we should also consider updating Go dependencies.)

updating actions means there might be breakage in the image build/push and as such I think we should allow dependabot branches to also push to DockerHub, but add an additional cleanup logic for dependabot branches/PRs to delete the docker image right after push.
if we shouldn't run the image build/push workflow on the dependabot branch, then should we also run the other testing workflows since the GitHub action version changes won't really affect changes.

Automation for updating tags in Makefile and Chart.yaml

Currently the maintainers or contributors need to manually update the tag in Makefile and Chart.yaml for the tag we want to create.

It would be nice if we could do the following using automation -

Creating a tag/release.
Updating the Makefile and Chart.yaml with the tag we want to cut (or already have).

ephemeral secret namespace

The readme says ephemeral mounts can specify a secret namespace. It could be very dangerous to allow that without some kind of permissions involved. Other services don't allow a user controlled secret reference outside of the same namespace. How is this safe?

Cancel mechanism for in-flight requests to CRI

This ticket is the outcome of conversation in #71 (comment)

TL:DR:

We don't have a mechanism to cancel in-flight requests in case driver pod restarts i.e., we don't have a graceful way to cancel contexts.

TODO

come up with an approach to cancel in-flight requests on receiving termination signal for the pod
create a PR
get it reviewed and merged
DONE

Automated image build and push not working on tags

Whenever we create a new tag/release in GitHub, we expect that the image should be pushed for that tag in docker container registry (that's what we support currently).

Yesterday we merged this PR, and this action ran for image build and push on the default (master) branch.

But when we cut a tag v0.8.2 for that change, another action should have been triggered which built and pushed the image with that tag.

Use functional options pattern for initialising `NodeServer`

Based on #71 (comment)

Other refs around this pattern:

https://golang.cafe/blog/golang-functional-options-pattern.html

TODO

failed: lsetxattr read-only file system on pod start

Hi,

Would like to test out this really promissing csi driver, but I receive the following error:

spec: failed to apply OCI options: relabel "/var/lib/kubelet/pods/ee74a41d-f2b5-4ac3-9722-80454387d5c9/volume-subpaths/source/nginx/18" with "system_u:object_r:data_t:s0:c246,c908" failed: lsetxattr /var/lib/kubelet/pods/ee74a41d-f2b5-4ac3-9722-80454387d5c9/volume-subpaths/source/nginx/18/p: read-only file system

I already changed the mount and the pod to be readable, but I still have that error.
I'm using EKS 1.28 with bottlerocket nodes.

Any ideas what I could try?

Edit: I got it working by setting readOnly: true on the volume directly. Any idea how I can troubleshoot why a writable volume does not work?

Thanks!

Feature Request: Helm chart support.

We have been using this service for over two years in production. We would like to help support a helm chart which is a different paradigm than this git repo currently has. The reason is that we have actually built a helm chart using the install script and realized that it is a not needed when you know the needed settings for your cluster/kubelet. I have put the beginnings of one together in hopes that you will merge it in. #46

Patterns I have seen other repos that have the helm chart in the same repo as the code is make tags with prefix helm-chart-VERSION.

We could always create a new github repo that contains just the helm chart.

Also eventually we could publish the helm chart as well.

Facing vulnerability issues with csi-image: 0.6.1

We are still facing the same issue [https://github.com//issues/49].

In closing comment, it's mentioned 'this warning is a misreport'.

Did anyone try with Jfrog Xray?

Pods with warm metal driver stuck at `ContainerCreating` with `Unable to attach or mount volumes...timed out waiting for the condition` error

You can reproduce the issue using the following workload

---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: pv-test-csi-image-test-simple-fs1
spec:
  storageClassName: csi-image.warm-metal.tech
  capacity:
    storage: 5Gi
  accessModes:
    - ReadOnlyMany
  persistentVolumeReclaimPolicy: Retain
  csi:
    driver: csi-image.warm-metal.tech
    volumeHandle: "docker.io/warmmetal/csi-image-test:simple-fs"
    volumeAttributes:
      # # set pullAlways if you want to ignore local images
      pullAlways: "true"
      # # set secret if the image is private
      # secret: "name of the ImagePullSecret"
      # secretNamespace: "namespace of the secret"
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: pvc-test-csi-image-test-simple-fs1
spec:
  storageClassName: csi-image.warm-metal.tech
  accessModes:
    - ReadOnlyMany
  resources:
    requests:
      storage: 5Gi
  volumeName: pv-test-csi-image-test-simple-fs1
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: pv-test-csi-image-test-simple-fs2
spec:
  storageClassName: csi-image.warm-metal.tech
  capacity:
    storage: 5Gi
  accessModes:
    - ReadOnlyMany
  persistentVolumeReclaimPolicy: Retain
  csi:
    driver: csi-image.warm-metal.tech
    volumeHandle: "docker.io/warmmetal/csi-image-test:simple-fs"
    volumeAttributes:
      # # set pullAlways if you want to ignore local images
      pullAlways: "true"
      # # set secret if the image is private
      # secret: "name of the ImagePullSecret"
      # secretNamespace: "namespace of the secret"
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: pvc-test-csi-image-test-simple-fs2
spec:
  storageClassName: csi-image.warm-metal.tech
  accessModes:
    - ReadOnlyMany
  resources:
    requests:
      storage: 5Gi
  volumeName: pv-test-csi-image-test-simple-fs2
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: pv-test-csi-image-test-simple-fs3
spec:
  storageClassName: csi-image.warm-metal.tech
  capacity:
    storage: 5Gi
  accessModes:
    - ReadOnlyMany
  persistentVolumeReclaimPolicy: Retain
  csi:
    driver: csi-image.warm-metal.tech
    volumeHandle: "docker.io/warmmetal/csi-image-test:simple-fs"
    volumeAttributes:
      # # set pullAlways if you want to ignore local images
      pullAlways: "true"
      # # set secret if the image is private
      # secret: "name of the ImagePullSecret"
      # secretNamespace: "namespace of the secret"
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: pvc-test-csi-image-test-simple-fs3
spec:
  storageClassName: csi-image.warm-metal.tech
  accessModes:
    - ReadOnlyMany
  resources:
    requests:
      storage: 5Gi
  volumeName: pv-test-csi-image-test-simple-fs3
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: pv-test-csi-image-test-simple-fs4
spec:
  storageClassName: csi-image.warm-metal.tech
  capacity:
    storage: 5Gi
  accessModes:
    - ReadOnlyMany
  persistentVolumeReclaimPolicy: Retain
  csi:
    driver: csi-image.warm-metal.tech
    volumeHandle: "docker.io/warmmetal/csi-image-test:simple-fs"
    volumeAttributes:
      # # set pullAlways if you want to ignore local images
      pullAlways: "true"
      # # set secret if the image is private
      # secret: "name of the ImagePullSecret"
      # secretNamespace: "namespace of the secret"
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: pvc-test-csi-image-test-simple-fs4
spec:
  storageClassName: csi-image.warm-metal.tech
  accessModes:
    - ReadOnlyMany
  resources:
    requests:
      storage: 5Gi
  volumeName: pv-test-csi-image-test-simple-fs4
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: pv-test-csi-image-test-simple-fs5
spec:
  storageClassName: csi-image.warm-metal.tech
  capacity:
    storage: 5Gi
  accessModes:
    - ReadOnlyMany
  persistentVolumeReclaimPolicy: Retain
  csi:
    driver: csi-image.warm-metal.tech
    volumeHandle: "docker.io/warmmetal/csi-image-test:simple-fs"
    volumeAttributes:
      # # set pullAlways if you want to ignore local images
      pullAlways: "true"
      # # set secret if the image is private
      # secret: "name of the ImagePullSecret"
      # secretNamespace: "namespace of the secret"
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: pvc-test-csi-image-test-simple-fs5
spec:
  storageClassName: csi-image.warm-metal.tech
  accessModes:
    - ReadOnlyMany
  resources:
    requests:
      storage: 5Gi
  volumeName: pv-test-csi-image-test-simple-fs5
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: core-4472
  labels:
    app: busybox
spec:
  replicas: 10
  selector:
    matchLabels:
      app: busybox
  template:
    metadata:
      labels:
        app: busybox
    spec:
      nodeName: <specify-a-nodename-here>
      containers:
        - name: ephemeral-volume
          image: polinux/stress-ng
          command: ["stress-ng"]
          args:
            - "--vm=2"
            - "--cpu=8"
            - "--timeout=0"
          env:
            - name: TARGET1
              value: /target1
            - name: TARGET2
              value: /target2
            - name: TARGET3
              value: /target3
            - name: TARGET4
              value: /target4
            - name: TARGET5
              value: /target5
          volumeMounts:
            - mountPath: /target1
              name: target1
            - mountPath: /target2
              name: target2
            - mountPath: /target3
              name: target3
            - mountPath: /target4
              name: target4
            - mountPath: /target5
              name: target5
      volumes:
        - name: target1
          persistentVolumeClaim:
            claimName: pvc-test-csi-image-test-simple-fs1
        - name: target2
          persistentVolumeClaim:
            claimName: pvc-test-csi-image-test-simple-fs2
        - name: target3
          persistentVolumeClaim:
            claimName: pvc-test-csi-image-test-simple-fs3
        - name: target4
          persistentVolumeClaim:
            claimName: pvc-test-csi-image-test-simple-fs4
        - name: target5
          persistentVolumeClaim:
            claimName: pvc-test-csi-image-test-simple-fs5

Note that I have been able to reproduce this issue on different AWS instance types. Some of them being:

c5.xlarge
m5.xlarge

I think it can be reproduced in other machine types as well.

kubelet: Marshal called with nil; EKS 1.17

Hello again. I've noticed the following error from the kubelet related to a csi-driver-image ephemeral volume.

This is using AWS EKS 1.17.

I'm not sure if this error triggered by this project, but it seems possible. Sorry I don't have more details handy, but I'm happy to try other things.

"E0604 16:43:03.142973    5169 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/csi/b56d58d9-5c81-4640-b17a-94f5718190eb-drupal-code podName:b56d58d9-5c81-4640-b17a-94f5718190eb nodeName:}" failed. 
No retries permitted until 2021-06-04 16:45:05.142951668 +0000 UTC m=+146313.618357736 (durationBeforeRetry 2m2s). 
Error: "UnmountVolume.TearDown failed for volume \"drupal-code\" (UniqueName: \"kubernetes.io/csi/b56d58d9-5c81-4640-b17a-94f5718190eb-drupal-code\") pod \"b56d58d9-5c81-4640-b17a-94f5718190eb\" (UID: \"b56d58d9-5c81-4640-b17a-94f5718190eb\") :
 kubernetes.io/csi: mounter.TearDownAt failed:
 rpc error: code = Internal desc = grpc: error while marshaling: proto: Marshal called with nil""

Status of this project

Sorry for using an issue to ask questions but the "Github Discussions" are not enabled on this repository.
We (as a company) are very interesting in this project and we have a use case where we have some component that are packaged as docker images and that need to be shared between many many pods. And we want to keep the flexibility to update those images regardless of their consuming pods (this is acceptable to require a restart though).
We have no yet found any satisactory solution to this use case and this projet is promising.
We have also looked at the https://github.com/kubernetes-csi/csi-driver-image-populator project but it is clearly stated as experimental.
I wanted to know what is the status of this projet ? is it being used somehow ? will there be any releases ?
Thanks for your answers.

Support in-flight requests for warm metal driver

Most of the context around this can be found in the discussion in #71 (comment)

TL:DR;

warm metal can put pressure on disks (via CRI/containerd)
having a way to queue requests instead of processing them immediately can distribute the load on the disks

TODO

decide an approach to implement this
create a PR
get reviews
DONE

Panic after normal operation

I had a few successful test pods operating with ephemeral image volumes, but after 24 hours, I hit a consistent panic:

 plugin I0408 18:52:05.573451       1 driver.go:93] Enabling volume access mode: MULTI_NODE_SINGLE_WRITER                                                                                                                                                      
 plugin I0408 18:52:05.577550       1 server.go:108] Listening for connections on address: &net.UnixAddr{Name:"//csi/csi.sock", Net:"unix"}                                                                                                                    
 plugin I0408 18:52:23.630273       1 node.go:34] request: volume_id:"csi-4f6ad6041b4625e5e585add44e98861b094b966ccc3df6cfd8e05faac735b572" target_path:"/var/snap/microk8s/common/var/lib/kubelet/pods/f0635891-7412-4d86-a6a2-82006a7244da/volumes/kubernete 
 s.io~csi/drupal-code/mount" volume_capability:<mount:<> access_mode:<mode:SINGLE_NODE_WRITER > > volume_context:<key:"csi.storage.k8s.io/ephemeral" value:"true" > volume_context:<key:"csi.storage.k8s.io/pod.name" value:"drupal-58fdc8f7cd-78xq8" > volume 
 _context:<key:"csi.storage.k8s.io/pod.namespace" value:"test-env-prod" > volume_context:<key:"csi.storage.k8s.io/pod.uid" value:"f0635891-7412-4d86-a6a2-82006a7244da" > volume_context:<key:"csi.storage.k8s.io/serviceAccount.name" value:"default" > volum 
 e_context:<key:"image" value:"REDACTED:ef0f6fda48eec00356401ee8d929e7d2fb9d9b98" > volume_context:<key:"secret" value:"regcred" >                       
 plugin I0408 18:52:23.641371       1 mounter.go:99] no local image found. Pull image "REDACTED:ef0f6fda48eec00356401ee8d929e7d2fb9d9b98"                
 plugin panic: image "REDACTED:ef0f6fda48eec00356401ee8d929e7d2fb9d9b98": not found                                                                      
 plugin goroutine 81 [running]:                                                                                                                                                                                                                                
 plugin github.com/warm-metal/csi-driver-image/pkg/backend/containerd.(*mounter).getImageRootFSChainID(0xc00014c000, 0x1bd90e0, 0xc000422270, 0xc000675340, 0x1b8df80, 0xc000144fa0, 0xc000224000, 0x97, 0x1, 0xc000226480, ...)                               
 plugin     /go/src/csi-driver-image/pkg/backend/containerd/mounter.go:108 +0xe90                                                                                                                                                                              
 plugin github.com/warm-metal/csi-driver-image/pkg/backend/containerd.(*mounter).refSnapshot(0xc00014c000, 0x1bd90e0, 0xc000422270, 0xc000675340, 0x1b8df80, 0xc000144fa0, 0xc0001da0a0, 0x44, 0xc000224000, 0x97, ...)                                        
 plugin     /go/src/csi-driver-image/pkg/backend/containerd/mounter.go:130 +0xe5                                                                                                                                                                               
 plugin github.com/warm-metal/csi-driver-image/pkg/backend/containerd.(*mounter).Mount(0xc00014c000, 0x1bd90e0, 0xc000422270, 0x1b8df80, 0xc000144fa0, 0xc0001da0a0, 0x44, 0xc000224000, 0x97, 0xc000136300, ...)                                              
 plugin     /go/src/csi-driver-image/pkg/backend/containerd/mounter.go:237 +0x2c5                                                                                                                                                                              
 plugin main.nodeServer.NodePublishVolume(0xc00013dfe0, 0x1b9a3e0, 0xc00014c000, 0x1be0960, 0xc00013dfd0, 0x1bd90e0, 0xc000422270, 0xc0000ae180, 0x18, 0x18, ...)                                                                                              
 plugin     /go/src/csi-driver-image/cmd/plugin/node.go:88 +0x422                                                                                                                                                                                              
 plugin github.com/container-storage-interface/spec/lib/go/csi._Node_NodePublishVolume_Handler.func1(0x1bd90e0, 0xc000422270, 0x191ac80, 0xc0000ae180, 0x18, 0x18, 0x7f14ed854c28, 0xc00041d8c0)                                                               
 plugin     /go/pkg/mod/github.com/container-storage-interface/[email protected]/lib/go/csi/csi.pb.go:5977 +0x89                                                                                                                                                     
 plugin github.com/kubernetes-csi/drivers/pkg/csi-common.logGRPC(0x1bd90e0, 0xc000422270, 0x191ac80, 0xc0000ae180, 0xc00041d8a0, 0xc00041d8c0, 0xc000416ba0, 0x508d86, 0x18b85c0, 0xc000422270)                                                                
 plugin     /go/pkg/mod/github.com/kubernetes-csi/[email protected]/pkg/csi-common/utils.go:99 +0x15d                                                                                                                                                             
 plugin github.com/container-storage-interface/spec/lib/go/csi._Node_NodePublishVolume_Handler(0x18a5280, 0xc000219230, 0x1bd90e0, 0xc000422270, 0xc000652600, 0x1a88310, 0x1bd90e0, 0xc000422270, 0xc00017c840, 0x28e)                                        
 plugin     /go/pkg/mod/github.com/container-storage-interface/[email protected]/lib/go/csi/csi.pb.go:5979 +0x150                                                                                                                                                    
 plugin google.golang.org/grpc.(*Server).processUnaryRPC(0xc0000ca000, 0x1bec140, 0xc000501500, 0xc000580100, 0xc0000c21e0, 0x263c530, 0x0, 0x0, 0x0)                                                                                                          
 plugin     /go/pkg/mod/google.golang.org/[email protected]/server.go:1210 +0x522                                                                                                                                                                                   
 plugin google.golang.org/grpc.(*Server).handleStream(0xc0000ca000, 0x1bec140, 0xc000501500, 0xc000580100, 0x0)                                                                                                                                                
 plugin     /go/pkg/mod/google.golang.org/[email protected]/server.go:1533 +0xd05                                                                                                                                                                                   
 plugin google.golang.org/grpc.(*Server).serveStreams.func1.2(0xc000138b40, 0xc0000ca000, 0x1bec140, 0xc000501500, 0xc000580100)                                                                                                                               
 plugin     /go/pkg/mod/google.golang.org/[email protected]/server.go:871 +0xa5                                                                                                                                                                                     
 plugin created by google.golang.org/grpc.(*Server).serveStreams.func1                                                                                                                                                                                         
 plugin     /go/pkg/mod/google.golang.org/[email protected]/server.go:869 +0x1fd                                                                                                                                                                                    
 plugin stream closed

Fix docker-compose for node server tests

Definition of Done

All suggestions in https://github.com/warm-metal/container-image-csi-driver/pull/83/files/d23fbed7bd143006d4b4b89bb9174c0a11ce8a4a#r1460466259 are addressed

Dynamic Volume provisioning

Hi,

It would be nice to have a dynamic volume provisioning feature available, to be able to automatically provision Persistent Volumes for user-provided PersistentVolumeClaims. As PersistentVolumes require cluster-wide access, in multi-tenant clusters users won't be able to create them, and it will require cluster-administrator support to be able to use CSI.

In order to support this, 2 API methods should be added to ControllerServer - CreateVolume and DeleteVolume.

Also, ExternalProvisionner sidecar should be included, and according to Recommended mechanism for deploying csi drivers, overall deployment scheme should be split into two parts:

Daemonset with Identity and Node services + node-driver-registar sidecar
Deployment with Identity and Controller services + external-provisioner sidecar.

As we already implemented this support in our private fork, i would like to contribute it back to the upstream to share with the community.

Thanks,
Max

GitHub Team for maintainers

Having a GitHub team like @warm-metal/maintainers would make it easy for people creating issues or pull requests to tag the maintainers and bring attention to a particular issue/PR.

Caching in integration tests caused failures to be ignored

What happened?

Integration tests caches output of some steps which resulted in failure to catch errors on PR.

While working on PR, the integration tests were passing.

But when the PR was merged to master branch, the integration tests started failing.

What did you expect to happen?

Integration tests should fail if there are any errors in the test/code files.

Trouble testing with AWS EKS 1.16.15

As far as I know, EKS uses Docker, so I used the docker installation from the readme.

Immediately I need to make a change for the containerd socket path.

        - hostPath:
-          path: /run/docker/containerd/containerd.sock
+          path: /run/containerd/containerd.sock
            type: Socket

Running the example, I see this event on the pod:

MountVolume.SetUp failed for volume "target" : kubernetes.io/csi: mounter.SetupAt failed: rpc error: code = Internal desc = no such file or directory

And this from the daemon pod:

 plugin I0405 22:46:45.461145       1 node.go:34] request: volume_id:"csi-5cc35c366d3787f082ed8fc570a7c59f2feb712e15c44337c6b82c33aeaad8c0" target_path:"/var/lib/kubelet/pods/0fed9f68-4752-4ae7-a132-9899bede7947/volumes/kubernetes.io~csi/target/mount" vo 
 lume_capability:<mount:<> access_mode:<mode:SINGLE_NODE_WRITER > > volume_context:<key:"csi.storage.k8s.io/ephemeral" value:"true" > volume_context:<key:"csi.storage.k8s.io/pod.name" value:"ephemeral-volume-6jh2z" > volume_context:<key:"csi.storage.k8s. 
 io/pod.namespace" value:"default" > volume_context:<key:"csi.storage.k8s.io/pod.uid" value:"0fed9f68-4752-4ae7-a132-9899bede7947" > volume_context:<key:"csi.storage.k8s.io/serviceAccount.name" value:"default" > volume_context:<key:"image" value:"docker. 
 io/warmmetal/csi-image-test:simple-fs" >                                                                                                                                                                                                                      
 plugin I0405 22:46:45.486677       1 mounter.go:117] image docker.io/warmmetal/csi-image-test:simple-fs unpacked                                                                                                                                              
 plugin I0405 22:46:45.510033       1 mounter.go:136] prepare sha256:07fb94c730eb52c6718a45978209e1249617058da234cd270f4bd587c7ddda87                                                                                                                          
 plugin E0405 22:46:45.545963       1 mounter.go:254] fail to mount image docker.io/warmmetal/csi-image-test:simple-fs to /var/lib/kubelet/pods/0fed9f68-4752-4ae7-a132-9899bede7947/volumes/kubernetes.io~csi/target/mount: no such file or directory         
 plugin E0405 22:46:45.546766       1 mounter.go:245] found error no such file or directory. Prepare removing the snapshot just created                                                                                                                        
 plugin I0405 22:46:45.550714       1 mounter.go:222] found snapshot csi-image.warm-metal.tech-sha256:07fb94c730eb52c6718a45978209e1249617058da234cd270f4bd587c7ddda87 for volume csi-5cc35c366d3787f082ed8fc570a7c59f2feb712e15c44337c6b82c33aeaad8c0. prepar 
 e to unref it.                                                                                                                                                                                                                                                
 plugin I0405 22:46:45.550730       1 mounter.go:178] unref snapshot csi-image.warm-metal.tech-sha256:07fb94c730eb52c6718a45978209e1249617058da234cd270f4bd587c7ddda87, parent sha256:07fb94c730eb52c6718a45978209e1249617058da234cd270f4bd587c7ddda87         
 plugin I0405 22:46:45.550740       1 mounter.go:190] no other mount refs snapshot csi-image.warm-metal.tech-sha256:07fb94c730eb52c6718a45978209e1249617058da234cd270f4bd587c7ddda87, remove it                                                                
 plugin E0405 22:46:45.558154       1 utils.go:101] GRPC error: rpc error: code = Internal desc = no such file or directory

Incorrect secrets mount path

Secret-fetcher reads a namespace from the service account using the following path:
https://github.com/warm-metal/csi-driver-image/blob/d1930b722b4808503b8127400b38759a3587d22d/pkg/secret/cache.go#L183-L186

But according to the kubernetes documentation it should be /var/run/secrets/kubernetes.io/serviceaccount

Containerd is leaking snapshots

Containerd is leaking snapshots due to what looks like the garbage collection not having a lease to reference

https://github.com/containerd/containerd/blob/main/docs/garbage-collection.md#how-to-use-leases

Stalebot seems too aggresive

Because we have limited contributors at the moment, we are unable to fix the issues or merge the PRs in progress.

I think we should increase the values in the stalebot configurations to something like -

days-before-stale: 90
days-before-close: 30

Add make targets to improve developer experience and getting started quickly

imagepullsecrets are cached and not refreshed

I've configured the deployment by running:

warm-metal-csi-image-install --pull-image-secret-for-daemonset=<mykeyname> >csi-driver-image.yaml

I then copied the docker.io/warmmetal/csi-image-test:simple-fs image to my private registry (gcr.io/my-private-project) and deployed warm-metal/csi-driver-image/master/sample/ephemeral-volume.yaml with the image patched to my private registry. Initially this works, but if I run this a day later it fails:

MountVolume.SetUp failed for volume "target" : rpc error: code = Aborted desc = unable to pull image "gcr.io/my-private-project/csi-image-test:simple-fs": rpc error: code = Unknown desc = failed to pull and unpack image "gcr.io/giza-workcells/csi-image-test:simple-fs": failed to resolve reference "gcr.io/my-private-project/csi-image-test:simple-fs": pulling from host gcr.io failed with status code [manifests simple-fs]: 401 Unauthorized

If I restart the csi (kubectl delete pod -n kube-system csi-image-warm-metal-ndqs5) this will cause it to re-read them dockerconfig from the secret and everything works again.

Here is our service that refreshes the dockerconfig in the imagepullsecret: https://github.com/googlecloudrobotics/core/blob/HEAD/src/go/pkg/gcr/update_gcr_credentials.go

I have not found any flag to disable the caching. I can make a PR, but like to agree first on how to fix this:

(my preference) if dockerconfig has Username=="oauth2accesstoken" don't cache it, just read it each time you mount a volume
add flag to disable caching globally
Handle 401 error, by rereading the key and retrying (sounds complicated)

Of course I am open for additional ideas. WDYT?

Support embedded ImagePullSecrets

See #16

Users can set ImagePullSecrets via either,

secret and secretNamespace in the original VolumeContext for the secret name and namespace respectively,
.spec.imagePullSecrets of workload Pods,
.spec.imagePullSecrets of the plugin Pods.

If all of them are set for the same workload pod, secrets are sorted in the order of the above list.

Support AWS Docker Credentials

I have successfully tested csi-driver-image with EKS, but I had to specify image pull secrets even though EKS nodes should have read access to ECR in the same account.

I tried assigning the kube-system/aws-node ServiceAccount to the csi-driver-image DaemonSet, but that did not help. Not sure why that is not working. Running a pod manually with this service account, I have the expected AWS Role and can pull from ECR.

Reviewing the code also seems like this should just magically work with the CredentialProvider, but it doesn't.

https://github.com/kubernetes/kubernetes/tree/master/pkg/credentialprovider/aws

Ideally this would work on EKS nodes with no additional configuration, but alternatively adding a ServiceAccount to the DaemonSet would be a fine solution.

Cleanup `install` utility

Based on this comment from @kitt1987, the install utility should be removed.

Steps

Change the repository name to container-image-csi-driver

See #91 for more details.

Update CI tests to have better logging

The tests running as part of the CI don't provide any logs right now. Would be good to have the cluster logs as part of the CI run itself.

Maintainer workflow missing

Currently the maintainer workflow doesn't exist for this project. It'd be good if we have the process documented so that we ensure standardization and ease of onboarding.

Developer Workflow missing

Currently a well defined developer/contributor workflow doesn't exist for this project. It'd be good to have a few things

Well defined developer workflow documentation - #96
Well defined new contributor workflow documentation
Automations to support these workflows

Support embedded ImagePullSecrets and Credential providers

The plugin currently only supports secrets specified in VolumeContext. Secrets embedded in pod manifests and credential providers are more common in the cloud. I am going to support both of them. Progress will be tracked in the following issues respectively.

~~Embedded ImagePullSecrets~~ - #17, #18
~~built-in credential providers~~ - #19
~~exec plugins~~ - #20

Devcontainers should have k8s cluster running inside it on startup

lights out recovery

When a node gets powered off hard and then brought back up, kubernetes will restart the existing pod rather then recreating it. (This does happen in production. Got the t-shirt :)

emptydir volumes, which ephemeral image volumes are patterned after, keep their data on restart of the pod in this case.

This driver reverts the data back to the original image loosing the data. This should probably be fixed to follow the emptydir pattern. The volumeHandle was designed to be unique to the pod/volume combination to ensure this state can be recovered.

Outdated GitHub issues are dangling around

We might want to add stale-bot on GitHub issues to prevent outdated open issues.

ToDo:

Decide the criteria and timelines for closing issues.
Add a stale-bot, probably in GitHub workflows.

Plugin pod crashing intermittently

What happened?

Node plugin pod crashes sometimes after printing below message.

F0129 15:23:21.872324       1 containerd.go:75] unable to retrieve local image "public.ecr.aws/docker/library/amazonlinux:2022.0.20220308.1": image "public.ecr.aws/docker/library/amazonlinux:2022.0.20220308.1": not found

Kubelet is then unable to reach the CSI driver until the new pod gets into running state, and we see warning events like -

Events:
  Type     Reason           Age                    From               Message
  ----     ------           ----                   ----               -------
  Warning  FailedMount      2m32s (x2 over 4m47s)  kubelet            MountVolume.SetUp failed for volume "target" : rpc error: code = DeadlineExceeded desc = context deadline exceeded

  Warning  FailedMount      2m28s (x2 over 4m44s)  kubelet            Unable to attach or mount volumes: unmounted volumes=[target], unattached volumes=[target kube-api-access-6k74b]: timed out waiting for the condition

  Warning  FailedMount      2m9s (x2 over 4m33s)   kubelet            MountVolume.SetUp failed for volume "target" : rpc error: code = Unavailable desc = error reading from server: EOF

  Warning  FailedMount      117s (x2 over 2m5s)    kubelet            MountVolume.SetUp failed for volume "target" : kubernetes.io/csi: mounter.SetUpAt failed to determine if the node service has VOLUME_MOUNT_GROUP capability: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /var/lib/kubelet/plugins/csi-image.warm-metal.tech/csi.sock: connect: connection refused"

  Warning  FailedMount      12s                    kubelet            Unable to attach or mount volumes: unmounted volumes=[target], unattached volumes=[kube-api-access-6k74b target]: timed out waiting for the condition

What did you expect?

Node-plugin pod shouldn't crash.

How to reproduce the issue?

It doesn't happen always so it might not be as straightforward to reproduce the issue.

Generally the restarts are seen when containerD fails to pull an image, so you may try pulling multiple larger images at the same time.

Anything more to add?

The container gets into running state again, but this behaviour causes delays in image pulls.
The restarts are noticed in both async and sync modes.

Pulling image failed with status code [manifests simple-fs]: 401 Unauthorized

Recently upgraded csi driver to v0.6.1 and started seeing pods pulling image from private registry started failing with mount failures issue along with below error:

MountVolume.SetUp failed for volume "target" : rpc error: code = Aborted desc = unable to pull image "74180267703.dkr.ecr.us-east-1.amazonaws.com/kalpana:0ab620f52e3641cb39f5b7e3b4c152e7f1651960-from-scratch": rpc error: code = Unknown desc = failed to pull image "74180267703.dkr.ecr.us-east-1.amazonaws.com/kalpana:0ab620f52e3641cb39f5b7e3b4c152e7f1651960-from-scratch": failed to resolve reference "74180267703.dkr.ecr.us-east-1.amazonaws.com/kalpana:0ab620f52e3641cb39f5b7e3b4c152e7f1651960-from-scratch": pulling from host 74180267703.dkr.ecr.us-east-1.amazonaws.com failed with status code [manifests simple-fs]: 401 Unauthorized

where pod contains below volume details

volumes:
- name: shared-files
  persistentVolumeClaim:
    claimName: e32de31b-814b-4e0e-a2e3-05c92c59066b-files
- csi:
    driver: csi-image.warm-metal.tech
    volumeAttributes:
      image: 374180267703.dkr.ecr.us-east-1.amazonaws.com/kalpana:0ab620f52e3641cb39f5b7e3b4c152e7f1651960-from-scratch
  name: test-code

Tried below below flags:

--enable-daemon-image-credential-cache=false

still having the same behaviour. Is anything I am missing
Note: same pod yaml is working with v0.5.1 csi image however its not with v0.6.1

incorrect handling of images created from scratch

Create PV with hello-world docker image, which is build from scratch. Create PVC and Pod using this PV.

apiVersion: v1
kind: PersistentVolume
metadata:
  name: test
spec:
  storageClassName: system-image
  capacity:
    storage: 10G
  accessModes:
    - ReadOnlyMany
  persistentVolumeReclaimPolicy: Delete
  csi:
    driver: csi-image.warm-metal.tech
    volumeHandle: "hello-world"
---    
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: test
spec:
  accessModes:
  - ReadOnlyMany
  resources:
    requests:
      storage: 10G
  storageClassName: system-image
  volumeMode: Filesystem
---
apiVersion: v1
kind: Pod
metadata:
    name: test
spec:
    containers:
        - name: busybox
          image: docker.io/busybox:latest
          command: ['tail', '-f', '/dev/null']
          volumeMounts:
              - name: test
                mountPath: /test
                readOnly: true
    volumes:
      - name: test
        persistentVolumeClaim:
            claimName: test
            readOnly: true

The pod will be created with no errors:

[root@node ~]# kubectl get po | grep test
test                             1/1     Running   0          4m37s

But if you try to delete this pod, it will stuck in terminating status:

[root@node ~]# kubectl get po  | grep test
test                             1/1     Terminating   0          6m11s

And the following error will occur in kubelet logs:

Jun 22 17:16:19 node kubelet[3615]: E0622 17:16:19.593027    3615 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/csi/csi-image.warm-metal.tech^hello-world podName:9cf68544-0fe0-4a30-8ee1-95d845beb776 nodeName:}" failed. No retries permitted until 2023-06-22 17:16:27.592995497 +0000 UTC m=+189462.238591632 (durationBeforeRetry 8s). Error: UnmountVolume.TearDown failed for volume "test" (UniqueName: "kubernetes.io/csi/csi-image.warm-metal.tech^hello-world") pod "9cf68544-0fe0-4a30-8ee1-95d845beb776" (UID: "9cf68544-0fe0-4a30-8ee1-95d845beb776") : kubernetes.io/csi: Unmounter.TearDownAt failed to clean mount dir [/var/lib/kubelet/pods/9cf68544-0fe0-4a30-8ee1-95d845beb776/volumes/kubernetes.io~csi/test/mount]: kubernetes.io/csi: failed to remove dir [/var/lib/kubelet/pods/9cf68544-0fe0-4a30-8ee1-95d845beb776/volumes/kubernetes.io~csi/test/mount]: remove /var/lib/kubelet/pods/9cf68544-0fe0-4a30-8ee1-95d845beb776/volumes/kubernetes.io~csi/test/mount: device or resource busy

That is because snapshotter used a bind mount for this image

[root@node ~]#
findmnt | grep csi | grep test
|-/var/lib/kubelet/pods/9cf68544-0fe0-4a30-8ee1-95d845beb776/volumes/kubernetes.io~csi/test/mount                                         /dev/mapper/vg00-root[/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/875/fs]                                                                             xfs        rw,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,prjquota

And it should be unmounted before deletion.
Node-server implements this unmounting feature:
https://github.com/warm-metal/csi-driver-image/blob/dab54201833d97838d6508455c0044ac28c979f3/cmd/plugin/node_server.go#L152-L162

But only if IsLikelyNotMountPoint function returns false - path is mount point.

According to the comments in this function it doesn't work with bind mounts, so unmount operation will be skipped and kubelet UnmountVolume.TearDown will always fail.

Facing vulnerability issues with older golang, text versions

Facing vulnerability issues with older golang, text versions.
current env:
EKS 1.22
csi-image: 0.5.1

is there any other release with fixed errors or when we can expect fixed version

Implement session pattern for async pull feature

I was asked to provide an example re-implementation for the asynchronous pull feature based on what I refer to as the session pattern. In today's community standup, we determined that we'd prefer to have an issue created for any PR that is submitted in order to facilitate a more fluid discussion. This ticket was created for that purpose. The associated PR #137 is a re-implementation of the async pull feature.

Some rationale for this request can be seen the the PR description and conversation. In the future, I'll check the issues and start the conversation here before starting implementation. Since this was a direct request, it was kind of executed backwards.

Anyway, I'm interested in any discussion that flows either here or in the PR.

Devcontainer not working for arm64 architecture

The devcontainer fails when I try to run a kind cluster.

vscode ➜ /go/…/github.com/warm-metal/csi-driver-image $ kind create cluster
Creating cluster "kind" ...
 ✓ Ensuring node image (kindest/node:v1.27.3) 🖼 
 ✗ Preparing nodes 📦  
Deleted nodes: ["kind-control-plane"]
ERROR: failed to create cluster: command "docker run --name kind-control-plane --hostname kind-control-plane --label io.x-k8s.kind.role=control-plane --privileged --security-opt seccomp=unconfined --security-opt apparmor=unconfined --tmpfs /tmp --tmpfs /run --volume /var --volume /lib/modules:/lib/modules:ro -e KIND_EXPERIMENTAL_CONTAINERD_SNAPSHOTTER --detach --tty --label io.x-k8s.kind.cluster=kind --net kind --restart=on-failure:1 --init=false --cgroupns=private --publish=127.0.0.1:38285:6443/TCP -e KUBECONFIG=/etc/kubernetes/admin.conf kindest/node:v1.27.3@sha256:3966ac761ae0136263ffdb6cfd4db23ef8a83cba8a463690e98317add2c9ba72" failed with error: exit status 125
Command Output: 6e439e708e7a15f0e19bcbbe2f3a590d694d5b8f8f883a5cb5d132a9450842c4
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: waiting for init preliminary setup: read init-p: connection reset by peer: unknown.

Note that I am using a MacBook with Apple M2 Chip and as such there's a difference between the architecture of the container and the host device.

Host device

$ uname -a
Darwin Mriyams-MacBook-Air.local 23.3.0 Darwin Kernel Version 23.3.0: Wed Dec 20 21:33:31 PST 2023; root:xnu-10002.81.5~7/RELEASE_ARM64_T8112 arm64 arm Darwin

DevContainer

$ uname -a
Linux 58b19ceaedb6 6.5.11-linuxkit #1 SMP PREEMPT Wed Dec  6 17:08:31 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

warm-metal / container-image-csi-driver Goto Github PK

container-image-csi-driver's People

Contributors

Stargazers

Watchers

Forkers

container-image-csi-driver's Issues

Scope

Issue faced

Proposed solution

When to push the image

How to push the image

Details

ToDo

TL:DR:

TODO

TODO

TL:DR;

TODO

Definition of Done

What happened?

What did you expect to happen?

Steps

ToDo:

What happened?

What did you expect?

How to reproduce the issue?

Anything more to add?

Recommend Projects

Recommend Topics

Recommend Org