Giter Club home page Giter Club logo

external-provisioner's Introduction

CSI provisioner

The external-provisioner is a sidecar container that dynamically provisions volumes by calling CreateVolume and DeleteVolume functions of CSI drivers. It is necessary because internal persistent volume controller running in Kubernetes controller-manager does not have any direct interfaces to CSI drivers.

Overview

The external-provisioner is an external controller that monitors PersistentVolumeClaim objects created by user and creates/deletes volumes for them. The Kubernetes Container Storage Interface (CSI) Documentation explains how to develop, deploy, and test a Container Storage Interface (CSI) driver on Kubernetes.

Compatibility

This information reflects the head of this branch.

Compatible with CSI Version Container Image Min K8s Version Recommended K8s Version
CSI Spec v1.7.0 registry.k8s.io/sig-storage/csi-provisioner 1.20 1.27

Feature status

Various external-provisioner releases come with different alpha / beta features. Check --help output for alpha/beta features in each release.

Following table reflects the head of this branch.

Feature Status Default Description Provisioner Feature Gate Required
Snapshots GA On Snapshots and Restore. No
CSIMigration GA On Migrating in-tree volume plugins to CSI. No
CSIStorageCapacity GA On Publish capacity information for the Kubernetes scheduler. No
ReadWriteOncePod Beta On Single pod access mode for PersistentVolumes. No
CSINodeExpandSecret GA On CSI Node expansion secret No
HonorPVReclaimPolicy Beta On Honor the PV reclaim policy No
PreventVolumeModeConversion Beta On Prevent unauthorized conversion of source volume mode --prevent-volume-mode-conversion (No in-tree feature gate)
CrossNamespaceVolumeDataSource Alpha Off Cross-namespace volume data source --feature-gates=CrossNamespaceVolumeDataSource=true

All other external-provisioner features and the external-provisioner itself is considered GA and fully supported.

Usage

It is necessary to create a new service account and give it enough privileges to run the external-provisioner, see deploy/kubernetes/rbac.yaml. The provisioner is then deployed as single Deployment as illustrated below:

kubectl create deploy/kubernetes/deployment.yaml

The external-provisioner may run in the same pod with other external CSI controllers such as the external-attacher, external-snapshotter and/or external-resizer.

Note that the external-provisioner does not scale with more replicas. Only one external-provisioner is elected as leader and running. The others are waiting for the leader to die. They re-elect a new active leader in ~15 seconds after death of the old leader.

Command line options

Recommended optional arguments

  • --csi-address <path to CSI socket>: This is the path to the CSI driver socket inside the pod that the external-provisioner container will use to issue CSI operations (/run/csi/socket is used by default).

  • --leader-election: Enables leader election. This is mandatory when there are multiple replicas of the same external-provisioner running for one CSI driver. Only one of them may be active (=leader). A new leader will be re-elected when current leader dies or becomes unresponsive for ~15 seconds.

  • --leader-election-namespace: Namespace where leader election object will be created. It is recommended that this parameter is populated from Kubernetes DownwardAPI with the namespace where the external-provisioner runs in.

  • --leader-election-lease-duration <duration>: Duration, in seconds, that non-leader candidates will wait to force acquire leadership. Defaults to 15 seconds.

  • --leader-election-renew-deadline <duration>: Duration, in seconds, that the acting leader will retry refreshing leadership before giving up. Defaults to 10 seconds.

  • --leader-election-retry-period <duration>: Duration, in seconds, the LeaderElector clients should wait between tries of actions. Defaults to 5 seconds.

  • --timeout <duration>: Timeout of all calls to CSI driver. It should be set to value that accommodates majority of ControllerCreateVolume and ControllerDeleteVolume calls. See CSI error and timeout handling for details. 15 seconds is used by default.

  • --retry-interval-start <duration>: Initial retry interval of failed provisioning or deletion. It doubles with each failure, up to --retry-interval-max and then it stops increasing. Default value is 1 second. See CSI error and timeout handling for details.

  • --retry-interval-max <duration>: Maximum retry interval of failed provisioning or deletion. Default value is 5 minutes. See CSI error and timeout handling for details.

  • --worker-threads <num>: Number of simultaneously running ControllerCreateVolume and ControllerDeleteVolume operations. Default value is 100.

  • --kube-api-qps <num>: The number of requests per second sent by a Kubernetes client to the Kubernetes API server. Defaults to 5.0.

  • --kube-api-burst <num>: The number of requests to the Kubernetes API server, exceeding the QPS, that can be sent at any given time. Defaults to 10.

  • --cloning-protection-threads <num>: Number of simultaneously running threads, handling cloning finalizer removal. Defaults to 1.

  • --http-endpoint: The TCP network address where the HTTP server for diagnostics, including metrics and leader election health check, will listen (example: :8080 which corresponds to port 8080 on local host). The default is empty string, which means the server is disabled.

  • --metrics-path: The HTTP path where prometheus metrics will be exposed. Default is /metrics.

  • --extra-create-metadata: Enables the injection of extra PVC and PV metadata as parameters when calling CreateVolume on the driver (keys: "csi.storage.k8s.io/pvc/name", "csi.storage.k8s.io/pvc/namespace", "csi.storage.k8s.io/pv/name")

  • controller-publish-readonly: This option enables PV to be marked as readonly at controller publish volume call if PVC accessmode has been set to ROX. Defaults to false.

  • --enable-pprof: Enable pprof profiling on the TCP network address specified by --http-endpoint. The HTTP path is /debug/pprof/.

Storage capacity arguments

See the storage capacity section below for details.

  • --enable-capacity: This enables producing CSIStorageCapacity objects with capacity information from the driver's GetCapacity call. The default is to not produce CSIStorageCapacity objects.

  • --capacity-ownerref-level <levels>: The level indicates the number of objects that need to be traversed starting from the pod identified by the POD_NAME and NAMESPACE environment variables to reach the owning object for CSIStorageCapacity objects: 0 for the pod itself, 1 for a StatefulSet and DaemonSet, 2 for a Deployment, etc. Defaults to 1 (= StatefulSet). Ownership is optional and can be disabled with -1.

  • --capacity-threads <num>: Number of simultaneously running threads, handling CSIStorageCapacity objects. Defaults to 1.

  • --capacity-poll-interval <interval>: How long the external-provisioner waits before checking for storage capacity changes. Defaults to 1m.

  • --capacity-for-immediate-binding <bool>: Enables producing capacity information for storage classes with immediate binding. Not needed for the Kubernetes scheduler, maybe useful for other consumers or for debugging. Defaults to false.

Distributed provisioning
  • --node-deployment: Enables deploying the external-provisioner together with a CSI driver on nodes to manage node-local volumes. Off by default.

  • --node-deployment-immediate-binding: Determines whether immediate binding is supported when deployed on each node. Enabled by default, use --node-deployment-immediate-binding=false if not desired. Disabling it may be useful for example when a custom controller will select nodes for PVCs.

  • --node-deployment-base-delay: Determines how long the external-provisioner sleeps initially before trying to own a PVC with immediate binding. Defaults to 20 seconds.

  • --node-deployment-max-delay: Determines how long the external-provisioner sleeps at most before trying to own a PVC with immediate binding. Defaults to 60 seconds.

Other recognized arguments

  • --feature-gates <gates>: A set of comma separated <feature-name>=<true|false> pairs that describe feature gates for alpha/experimental features. See list of features or --help output for list of recognized features. Example: --feature-gates Topology=true to enable Topology feature that's disabled by default.

  • --strict-topology: This controls what topology information is passed to CreateVolumeRequest.AccessibilityRequirements in case of delayed binding. See the table below for an explanation how this option changes the result. This option has no effect if either Topology feature is disabled or Immediate volume binding mode is used.

  • --immediate-topology: This controls what topology information is passed to CreateVolumeRequest.AccessibilityRequirements in case of immediate binding. See the table below for an explanation how this option changes the result. This option has no effect if either Topology feature is disabled or WaitForFirstConsumer (= delayed) volume binding mode is used. The default is true, so use --immediate-topology=false to disable it. It should not be disabled if the CSI driver might create volumes in a topology segment that is not accessible in the cluster. Such a driver should use the topology information to create new volumes where they can be accessed.

  • --kubeconfig <path>: Path to Kubernetes client configuration that the external-provisioner uses to connect to Kubernetes API server. When omitted, default token provided by Kubernetes will be used. This option is useful only when the external-provisioner does not run as a Kubernetes pod, e.g. for debugging. Either this or --master needs to be set if the external-provisioner is being run out of cluster.

  • --master <url>: Master URL to build a client config from. When omitted, default token provided by Kubernetes will be used. This option is useful only when the external-provisioner does not run as a Kubernetes pod, e.g. for debugging. Either this or --kubeconfig needs to be set if the external-provisioner is being run out of cluster.

  • --metrics-address: (deprecated) The TCP network address where the prometheus metrics endpoint will run (example: :8080 which corresponds to port 8080 on local host). The default is empty string, which means metrics endpoint is disabled.

  • --volume-name-prefix <prefix>: Prefix of PersistentVolume names created by the external-provisioner. Default value is "pvc", i.e. created PersistentVolume objects will have name pvc-<uuid>.

  • --volume-name-uuid-length: Length of UUID to be added to --volume-name-prefix. Default behavior is to NOT truncate the UUID.

  • --version: Prints current external-provisioner version and quits.

  • --prevent-volume-mode-conversion: Prevents an unauthorized user from modifying the volume mode when creating a PVC from an existing VolumeSnapshot. Defaults to true.

  • All glog / klog arguments are supported, such as -v <log level> or -alsologtostderr.

Design

External-provisioner interacts with Kubernetes by watching PVCs and PVs and implementing the external provisioner protocol. The design document explains this in more detail.

Topology support

When Topology feature is enabled* and the driver specifies VOLUME_ACCESSIBILITY_CONSTRAINTS in its plugin capabilities, external-provisioner prepares CreateVolumeRequest.AccessibilityRequirements while calling Controller.CreateVolume. The driver has to consider these topology constraints while creating the volume. Below table shows how these AccessibilityRequirements are prepared:

Delayed binding Strict topology Allowed topologies Immediate Topology Resulting accessibility requirements
Yes Yes Irrelevant Irrelevant Requisite = Preferred = Selected node topology
Yes No No Irrelevant Requisite = Aggregated cluster topology
Preferred = Requisite with selected node topology as first element
Yes No Yes Irrelevant Requisite = Allowed topologies
Preferred = Requisite with selected node topology as first element
No Irrelevant Yes Irrelevant Requisite = Allowed topologies
Preferred = Requisite with randomly selected node topology as first element
No Irrelevant No Yes Requisite = Aggregated cluster topology
Preferred = Requisite with randomly selected node topology as first element
No Irrelevant No No Requisite and Preferred both nil

*) Topology feature gate is enabled by default since v5.0.

When enabling topology support in a CSI driver that had it disabled, please make sure the topology is first enabled in the driver's node DaemonSet and topology labels are populated on all nodes. The topology can be then updated in the driver's Deployment and its external-provisioner sidecar.

Capacity support

The external-provisioner can be used to create CSIStorageCapacity objects that hold information about the storage capacity available through the driver. The Kubernetes scheduler then uses that information when selecting nodes for pods with unbound volumes that wait for the first consumer.

All CSIStorageCapacity objects created by an instance of the external-provisioner have certain labels:

  • csi.storage.k8s.io/drivername: the CSI driver name
  • csi.storage.k8s.io/managed-by: external-provisioner for central provisioning, external-provisioner- for distributed provisioning

They get created in the namespace identified with the NAMESPACE environment variable.

Each external-provisioner instance manages exactly those objects with the labels that correspond to the instance.

Optionally, all CSIStorageCapacity objects created by an instance of the external-provisioner can have an owner. This ensures that the objects get removed automatically when uninstalling the CSI driver. The owner is determined with the POD_NAME/NAMESPACE environment variables and the --capacity-ownerref-level parameter. Setting an owner reference is highly recommended whenever possible (i.e. in the most common case that drivers are run inside containers).

If ownership is disabled the storage admin is responsible for removing orphaned CSIStorageCapacity objects, and the following command can be used to clean up orphaned objects of a driver:

kubectl delete csistoragecapacities -l csi.storage.k8s.io/drivername=my-csi.example.com

When switching from a deployment without ownership to one with ownership, managed objects get updated such that they have the configured owner. When switching in the other direction, the owner reference is not removed because the new deployment doesn't know what the old owner was.

To enable this feature in a driver deployment with a central controller (see also the deploy/kubernetes/storage-capacity.yaml example):

  • Set the POD_NAME and NAMESPACE environment variables like this:
   env:
   - name: NAMESPACE
     valueFrom:
       fieldRef:
         fieldPath: metadata.namespace
   - name: POD_NAME
     valueFrom:
       fieldRef:
         fieldPath: metadata.name
  • Add --enable-capacity to the command line flags.
  • Add StorageCapacity: true to the CSIDriver information object. Without it, external-provisioner will publish information, but the Kubernetes scheduler will ignore it. This can be used to first deploy the driver without that flag, then when sufficient information has been published, enabled the scheduler usage of it.
  • If external-provisioner is not deployed with a StatefulSet, then configure with --capacity-ownerref-level which object is meant to own CSIStorageCapacity objects.
  • Optional: configure how often external-provisioner polls the driver to detect changed capacity with --capacity-poll-interval.
  • Optional: configure how many worker threads are used in parallel with --capacity-threads.
  • Optional: enable producing information also for storage classes that use immediate volume binding with --capacity-for-immediate-binding. This is usually not needed because such volumes are created by the driver without involving the Kubernetes scheduler and thus the published information would just be ignored.

To determine how many different topology segments exist, external-provisioner uses the topology keys and labels that the CSI driver instance on each node reports to kubelet in the NodeGetInfoResponse.accessible_topology field. The keys are stored by kubelet in the CSINode objects and the actual values in Node annotations.

CSI drivers must report topology information that matches the storage pool(s) that it has access to, with granularity that matches the most restrictive pool.

For example, if the driver runs in a node with region/rack topology and has access to per-region storage as well as per-rack storage, then the driver should report topology with region/rack as its keys. If it only has access to per-region storage, then it should just use region as key. If it uses region/rack, then redundant CSIStorageCapacity objects will be published, but the information is still correct. See the KEP for details.

For each segment and each storage class, CSI GetCapacity is called once with the topology of the segment and the parameters of the class. If there is no error and the capacity is non-zero, a CSIStorageCapacity object is created or updated (if it already exists from a prior call) with that information. Obsolete objects are removed.

To ensure that CSIStorageCapacity objects get removed when the external-provisioner gets removed from the cluster, they all have an owner and therefore get garbage-collected when that owner disappears. The owner is not the external-provisioner pod itself but rather one of its parents as specified by --capacity-ownerref-level. This way, it is possible to switch between external-provisioner instances without losing the already gathered information.

CSIStorageCapacity objects are namespaced and get created in the namespace of the external-provisioner. Only CSIStorageCapacity objects with the right owner are modified by external-provisioner and their name is generated, so it is possible to deploy different drivers in the same namespace. However, Kubernetes does not check who is creating CSIStorageCapacity objects, so in theory a malfunctioning or malicious driver deployment could also publish incorrect information about some other driver.

The deployment with distributed provisioning is almost the same as above, with some minor change:

  • Use --capacity-ownerref-level=1 and the NAMESPACE/POD_NAME variables to make the DaemonSet that contains the external-provisioner the owner of CSIStorageCapacity objects for the node.

Deployments of external-provisioner outside the Kubernetes cluster are also possible, albeit only without an owner for the objects. NAMESPACE still needs to be set to some existing namespace also in this case.

CSI error and timeout handling

The external-provisioner invokes all gRPC calls to CSI driver with timeout provided by --timeout command line argument (15 seconds by default).

Correct timeout value and number of worker threads depends on the storage backend and how quickly it is able to process ControllerCreateVolume and ControllerDeleteVolume calls. The value should be set to accommodate majority of them. It is fine if some calls time out - such calls will be retried after exponential backoff (starting with 1s by default), however, this backoff will introduce delay when the call times out several times for a single volume.

Frequency of ControllerCreateVolume and ControllerDeleteVolume retries can be configured by --retry-interval-start and --retry-interval-max parameters. The external-provisioner starts retries with retry-interval-start interval (1s by default) and doubles it with each failure until it reaches retry-interval-max (5 minutes by default). The external provisioner stops increasing the retry interval when it reaches retry-interval-max, however, it still retries provisioning/deletion of a volume until it's provisioned. The external-provisioner keeps its own number of provisioning/deletion failures for each volume.

The external-provisioner can invoke up to --worker-threads (100 by default) ControllerCreateVolume and up to --worker-threads (100 by default) ControllerDeleteVolume calls in parallel, i.e. these two calls are counted separately. The external-provisioner assumes that the storage backend can cope with such high number of parallel requests and that the requests are handled in relatively short time (ideally sub-second). Lower value should be used for storage backends that expect slower processing related to newly created / deleted volumes or can handle lower amount of parallel calls.

Details of error handling of individual CSI calls:

  • ControllerCreateVolume: The call might have timed out just before the driver provisioned a volume and was sending a response. From that reason, timeouts from ControllerCreateVolume is considered as "volume may be provisioned" or "volume is being provisioned in the background." The external-provisioner will retry calling ControllerCreateVolume after exponential backoff until it gets either successful response or final (non-timeout) error that the volume cannot be created.
  • ControllerDeleteVolume: This is similar to ControllerCreateVolume, The external-provisioner will retry calling ControllerDeleteVolume with exponential backoff after timeout until it gets either successful response or a final error that the volume cannot be deleted.
  • Probe: The external-provisioner retries calling Probe until the driver reports it's ready. It retries also when it receives timeout from Probe call. The external-provisioner has no limit of retries. It is expected that ReadinessProbe on the driver container will catch case when the driver takes too long time to get ready.
  • GetPluginInfo, GetPluginCapabilitiesRequest, ControllerGetCapabilities: The external-provisioner expects that these calls are quick and does not retry them on any error, including timeout. Instead, it assumes that the driver is faulty and exits. Note that Kubernetes will likely start a new provisioner container and it will start with Probe call.

HTTP endpoint

The external-provisioner optionally exposes an HTTP endpoint at address:port specified by --http-endpoint argument. When set, these two paths are exposed:

  • Metrics path, as set by --metrics-path argument (default is /metrics).
  • Leader election health check at /healthz/leader-election. It is recommended to run a liveness probe against this endpoint when leader election is used to kill external-provisioner leader that fails to connect to the API server to renew its leadership. See kubernetes-csi/csi-lib-utils#66 for details.

Deployment on each node

Normally, external-provisioner is deployed once in a cluster and communicates with a control instance of the CSI driver which then provisions volumes via some kind of storage backend API. CSI drivers which manage local storage on a node don't have such an API that a central controller could use.

To support this case, external-provisioner can be deployed alongside each CSI driver on different nodes. The CSI driver deployment must:

  • support topology, usually with one topology key ("csi.example.org/node") and the Kubernetes node name as value
  • use a service account that has the same RBAC rules as for a normal deployment
  • invoke external-provisioner with --node-deployment
  • tweak --node-deployment-base-delay and --node-deployment-max-delay to match the expected cluster size and desired response times (only relevant when there are storage classes with immediate binding, see below for details)
  • set the NODE_NAME environment variable to the name of the Kubernetes node
  • implement GetCapacity

Usage of --strict-topology and --immediate-topology=false is recommended because it makes the CreateVolume invocations simpler. Topology information is always derived exclusively from the information returned by the CSI driver that runs on the same node, without combining that with information stored for other nodes. This works as long as each node is in its own topology segment, i.e. usually with a single topology key and one unique value for each node.

Volume provisioning with late binding works as before, except that each external-provisioner instance checks the "selected node" annotation and only creates volumes if that node is the one it runs on. It also only deletes volumes on its own node.

Immediate binding is also supported, but not recommended. It is implemented by letting the external-provisioner instances assign a PVC to one of them: when they see a new PVC with immediate binding, they all attempt to set the "selected node" annotation with their own node name as value. Only one update request can succeed, all others get a "conflict" error and then know that some other instance was faster. To avoid the thundering herd problem, each instance waits for a random period before issuing an update request.

When CreateVolume call fails with ResourcesExhausted, the normal recovery mechanism is used, i.e. the external-provisioner instance removes the "selected node" annotation and the process repeats. But this triggers events for the PVC and delays volume creation, in particular when storage is exhausted on most nodes. Therefore external-provisioner checks with GetCapacity before attempting to own a PVC whether the currently available capacity is sufficient for the volume. When it isn't, the PVC is ignored and some other instance can own it.

The --node-deployment-base-delay parameter determines the initial wait period. It also sets the jitter, so in practice the initial wait period will be in the range from zero to the base delay. If the value is high, volumes with immediate binding get created more slowly. If it is low, then the risk of conflicts while setting the "selected node" annotation increases and the apiserver load will be higher.

There is an exponential backoff per PVC which is used for unexpected problems. Normally, an owner for a PVC is chosen during the first attempt, so most PVCs will use the base delays. A maximum can be set with --node-deployment-max-delay anyway, to avoid very long delays when something went wrong repeatedly.

During scale testing with 100 external-provisioner instances, a base delay of 20 seconds worked well. When provisioning 3000 volumes, there were only 500 conflicts which the apiserver handled without getting overwhelmed. The average provisioning rate of around 40 volumes/second was the same as with a delay of 10 seconds. The worst-case latency per volume was probably higher, but that wasn't measured.

Note that the QPS settings of kube-controller-manager and external-provisioner have to be increased at the moment (Kubernetes 1.19) to provision volumes faster than around 4 volumes/second. Those settings will eventually get replaced with flow control in the API server itself.

Beware that if no node has sufficient storage available, then also no CreateVolume call is attempted and thus no events are generated for the PVC, i.e. some other means of tracking remaining storage capacity must be used to detect when the cluster runs out of storage.

Because PVCs with immediate binding get distributed randomly among nodes, they get spread evenly. If that is not desirable, then it is possible to disable support for immediate binding in distributed provisioning with --node-deployment-immediate-binding=false and instead implement a custom policy in a separate controller which sets the "selected node" annotation to trigger local provisioning on the desired node.

Deleting local volumes after a node failure or removal

When a node with local volumes gets removed from a cluster before deleting those volumes, the PV and PVC objects may still exist. It may be possible to remove the PVC normally if the volume was not in use by any pod on the node, but normal deletion of the volume and thus deletion of the PV is not possible anymore because the CSI driver instance on the node is not available or reachable anymore and therefore Kubernetes cannot be sure that it is okay to remove the PV.

When an administrator is sure that the node is never going to come back, then the local volumes can be removed manually:

  • force-delete objects: kubectl delete pv <pv> --wait=false --grace-period=0 --force
  • remove all finalizers: kubectl patch pv <pv> -p '{"metadata":{"finalizers":null}}'

It may also be necessary to scrub disks before reusing them because the CSI driver had no chance to do that.

If there still was a PVC which was bound to that PV, it then will be moved to phase "Lost". It has to be deleted and re-created if still needed because no new volume will be created for it. Editing the PVC to revert it to phase "Unbound" is not allowed by the Kubernetes API server.

Community, discussion, contribution, and support

Learn how to engage with the Kubernetes community on the community page.

You can reach the maintainers of this project at:

Code of conduct

Participation in the Kubernetes community is governed by the Kubernetes Code of Conduct.

external-provisioner's People

Contributors

bertinatto avatar bswartz avatar childsb avatar chrishenzie avatar connorjc3 avatar danil-grigorev avatar davidz627 avatar ddebroy avatar dependabot[bot] avatar ggriffiths avatar humblec avatar jakobmoellerdev avatar jiawei0227 avatar jsafrane avatar k8s-ci-robot avatar lpabon avatar madhu-1 avatar msau42 avatar mucahitkurt avatar namrata-ibm avatar pohly avatar raunakshah avatar rootfs avatar saad-ali avatar sbezverk avatar spiffxp avatar sunnylovestiramisu avatar verult avatar wongma7 avatar xing-yang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

external-provisioner's Issues

csi.Provision() drops PVC information which existing external-provisioners use

In some existing external-provisioners, PVC information is used on Provision().
For example,

In current CSI external-provisioner implementation, on the other hand, PVC information, or options.PVC, is dropped, because it only set part of information(*1) and pass it to csiClient.CreateVolume() (*2).

(*1) Setting some of parameters to req:
https://github.com/kubernetes-csi/external-provisioner/blob/master/pkg/controller/controller.go#L299
(*2) Passing req to csiClient.CreateVolume():
https://github.com/kubernetes-csi/external-provisioner/blob/master/pkg/controller/controller.go#L345

As a result, external-provisioners which use PVC information to provision their volume can't be reimplemented to CSI external-provisioner.

Therefore, we should pass PVC information to csiClient.CreateVolume(), either by

  1. Adding PVC information to options.Parameters,
  2. Adding argument for PVC to CreateVolume().

I guess 1. is better because 2. requires change in spec for CSI. https://github.com/container-storage-interface/spec/blob/master/spec.md#createvolume

Create a SECURITY_CONTACTS file.

As per the email sent to kubernetes-dev[1], please create a SECURITY_CONTACTS
file.

The template for the file can be found in the kubernetes-template repository[2].
A description for the file is in the steering-committee docs[3], you might need
to search that page for "Security Contacts".

Please feel free to ping me on the PR when you make it, otherwise I will see when
you close this issue. :)

Thanks so much, let me know if you have any questions.

(This issue was generated from a tool, apologies for any weirdness.)

[1] https://groups.google.com/forum/#!topic/kubernetes-dev/codeiIoQ6QE
[2] https://github.com/kubernetes/kubernetes-template-project/blob/master/SECURITY_CONTACTS
[3] https://github.com/kubernetes/community/blob/master/committee-steering/governance/sig-governance-template-short.md

pv can not be deleted and csi-attacher and csi-provisioner fall into an infinite loop.

I1010 15:02:49.860137 1 controller.go:197] Started PV processing "pvc-652ea95e-c4c3-11e8-b1a6-58605f89e6df"
I1010 15:02:49.860158 1 csi_handler.go:350] CSIHandler: processing PV "pvc-652ea95e-c4c3-11e8-b1a6-58605f89e6df"
I1010 15:02:49.860199 1 csi_handler.go:386] CSIHandler: processing PV "pvc-652ea95e-c4c3-11e8-b1a6-58605f89e6df": VA "csi-5b88618110ae3eacf0b71c415cc1e28cdff01937dbfcc8f965c18c32b3d973f6" found
I1010 15:02:53.840700 1 reflector.go:286] github.com/kubernetes-csi/external-attacher/vendor/k8s.io/client-go/informers/factory.go:87: forcing resync
I1010 15:02:53.840779 1 controller.go:167] Started VA processing "csi-5b88618110ae3eacf0b71c415cc1e28cdff01937dbfcc8f965c18c32b3d973f6"
I1010 15:02:53.840792 1 csi_handler.go:76] CSIHandler: processing VA "csi-5b88618110ae3eacf0b71c415cc1e28cdff01937dbfcc8f965c18c32b3d973f6"
I1010 15:02:53.840799 1 csi_handler.go:103] Attaching "csi-5b88618110ae3eacf0b71c415cc1e28cdff01937dbfcc8f965c18c32b3d973f6"
I1010 15:02:53.840805 1 csi_handler.go:208] Starting attach operation for "csi-5b88618110ae3eacf0b71c415cc1e28cdff01937dbfcc8f965c18c32b3d973f6"
I1010 15:02:53.840822 1 csi_handler.go:320] Saving attach error to "csi-5b88618110ae3eacf0b71c415cc1e28cdff01937dbfcc8f965c18c32b3d973f6"
I1010 15:02:53.843389 1 csi_handler.go:330] Saved attach error to "csi-5b88618110ae3eacf0b71c415cc1e28cdff01937dbfcc8f965c18c32b3d973f6"
I1010 15:02:53.843409 1 csi_handler.go:86] Error processing "csi-5b88618110ae3eacf0b71c415cc1e28cdff01937dbfcc8f965c18c32b3d973f6": failed to attach: PersistentVolume "pvc-652ea95e-c4c3-11e8-b1a6-58605f89e6df" is marked for deletion
I1010 15:02:53.843557 1 controller.go:167] Started VA processing "csi-5b88618110ae3eacf0b71c415cc1e28cdff01937dbfcc8f965c18c32b3d973f6"
I1010 15:02:53.843567 1 csi_handler.go:76] CSIHandler: processing VA "csi-5b88618110ae3eacf0b71c415cc1e28cdff01937dbfcc8f965c18c32b3d973f6"
I1010 15:02:53.843573 1 csi_handler.go:103] Attaching "csi-5b88618110ae3eacf0b71c415cc1e28cdff01937dbfcc8f965c18c32b3d973f6"
I1010 15:02:53.843580 1 csi_handler.go:208] Starting attach operation for "csi-5b88618110ae3eacf0b71c415cc1e28cdff01937dbfcc8f965c18c32b3d973f6"
I1010 15:02:53.843589 1 csi_handler.go:320] Saving attach error to "csi-5b88618110ae3eacf0b71c415cc1e28cdff01937dbfcc8f965c18c32b3d973f6"
I1010 15:02:53.849043 1 reflector.go:286] github.com/kubernetes-csi/external-attacher/vendor/k8s.io/client-go/informers/factory.go:87: forcing resync
I1010 15:02:53.849110 1 controller.go:197] Started PV processing "pvc-652ea95e-c4c3-11e8-b1a6-58605f89e6df"
I1010 15:02:53.849126 1 csi_handler.go:350] CSIHandler: processing PV "pvc-652ea95e-c4c3-11e8-b1a6-58605f89e6df"
I1010 15:02:53.849184 1 csi_handler.go:386] CSIHandler: processing PV "pvc-652ea95e-c4c3-11e8-b1a6-58605f89e6df": VA "csi-5b88618110ae3eacf0b71c415cc1e28cdff01937dbfcc8f965c18c32b3d973f6" found
I1010 15:02:53.852979 1 csi_handler.go:330] Saved attach error to "csi-5b88618110ae3eacf0b71c415cc1e28cdff01937dbfcc8f965c18c32b3d973f6"
I1010 15:02:53.853008 1 csi_handler.go:86] Error processing "csi-5b88618110ae3eacf0b71c415cc1e28cdff01937dbfcc8f965c18c32b3d973f6": failed to attach: PersistentVolume "pvc-652ea95e-c4c3-11e8-b1a6-58605f89e6df" is marked for deletion
I1010 15:03:03.840906 1 reflector.go:286] github.com/kubernetes-csi/external-attacher/vendor/k8s.io/client-go/informers/factory.go:87: forcing resync
I1010 15:03:03.840984 1 controller.go:167] Started VA processing "csi-5b88618110ae3eacf0b71c415cc1e28cdff01937dbfcc8f965c18c32b3d973f6"
I1010 15:03:03.840997 1 csi_handler.go:76] CSIHandler: processing VA "csi-5b88618110ae3eacf0b71c415cc1e28cdff01937dbfcc8f965c18c32b3d973f6"
I1010 15:03:03.841004 1 csi_handler.go:103] Attaching "csi-5b88618110ae3eacf0b71c415cc1e28cdff01937dbfcc8f965c18c32b3d973f6"
I1010 15:03:03.841011 1 csi_handler.go:208] Starting attach operation for "csi-5b88618110ae3eacf0b71c415cc1e28cdff01937dbfcc8f965c18c32b3d973f6"
I1010 15:03:03.841033 1 csi_handler.go:320] Saving attach error to "csi-5b88618110ae3eacf0b71c415cc1e28cdff01937dbfcc8f965c18c32b3d973f6"
I1010 15:03:03.843640 1 csi_handler.go:330] Saved attach error to "csi-5b88618110ae3eacf0b71c415cc1e28cdff01937dbfcc8f965c18c32b3d973f6"
I1010 15:03:03.843658 1 csi_handler.go:86] Error processing "csi-5b88618110ae3eacf0b71c415cc1e28cdff01937dbfcc8f965c18c32b3d973f6": failed to attach: PersistentVolume "pvc-652ea95e-c4c3-11e8-b1a6-58605f89e6df" is marked for deletion
I1010 15:03:03.843685 1 controller.go:167] Started VA processing "csi-5b88618110ae3eacf0b71c415cc1e28cdff01937dbfcc8f965c18c32b3d973f6"
I1010 15:03:03.843693 1 csi_handler.go:76] CSIHandler: processing VA "csi-5b88618110ae3eacf0b71c415cc1e28cdff01937dbfcc8f965c18c32b3d973f6"
I1010 15:03:03.843699 1 csi_handler.go:103] Attaching "csi-5b88618110ae3eacf0b71c415cc1e28cdff01937dbfcc8f965c18c32b3d973f6"
I1010 15:03:03.843705 1 csi_handler.go:208] Starting attach operation for "csi-5b88618110ae3eacf0b71c415cc1e28cdff01937dbfcc8f965c18c32b3d973f6"
I1010 15:03:03.843714 1 csi_handler.go:320] Saving attach error to "csi-5b88618110ae3eacf0b71c415cc1e28cdff01937dbfcc8f965c18c32b3d973f6"
I1010 15:03:03.849234 1 reflector.go:286] github.com/kubernetes-csi/external-attacher/vendor/k8s.io/client-go/informers/factory.go:87: forcing resync
I1010 15:03:03.849312 1 controller.go:197] Started PV processing "pvc-652ea95e-c4c3-11e8-b1a6-58605f89e6df"
I1010 15:03:03.849329 1 csi_handler.go:350] CSIHandler: processing PV "pvc-652ea95e-c4c3-11e8-b1a6-58605f89e6df"
I1010 15:03:03.849401 1 csi_handler.go:386] CSIHandler: processing PV "pvc-652ea95e-c4c3-11e8-b1a6-58605f89e6df": VA "csi-5b88618110ae3eacf0b71c415cc1e28cdff01937dbfcc8f965c18c32b3d973f6" found

root@ubuntu:~/example# kubectl get pv
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
pvc-652ea95e-c4c3-11e8-b1a6-58605f89e6df 2Gi RWO Delete Terminating default/csi-pvc-opensdsplugin csi-sc-opensdsplugin 9d

I1010 15:14:20.957768 1 controller.go:1167] scheduleOperation[delete-pvc-652ea95e-c4c3-11e8-b1a6-58605f89e6df[6aef48dc-c4c3-11e8-b1a6-58605f89e6df]]
I1010 15:14:21.848426 1 controller.go:1143] volume "pvc-652ea95e-c4c3-11e8-b1a6-58605f89e6df" deleted
I1010 15:14:21.851660 1 controller.go:1167] scheduleOperation[delete-pvc-652ea95e-c4c3-11e8-b1a6-58605f89e6df[6aef48dc-c4c3-11e8-b1a6-58605f89e6df]]
I1010 15:14:21.851802 1 controller.go:1154] volume "pvc-652ea95e-c4c3-11e8-b1a6-58605f89e6df" deleted from database
I1010 15:14:34.016156 1 controller.go:1167] scheduleOperation[delete-pvc-652ea95e-c4c3-11e8-b1a6-58605f89e6df[6aef48dc-c4c3-11e8-b1a6-58605f89e6df]]
I1010 15:14:35.060507 1 controller.go:1143] volume "pvc-652ea95e-c4c3-11e8-b1a6-58605f89e6df" deleted
I1010 15:14:35.063969 1 controller.go:1167] scheduleOperation[delete-pvc-652ea95e-c4c3-11e8-b1a6-58605f89e6df[6aef48dc-c4c3-11e8-b1a6-58605f89e6df]]
I1010 15:14:35.064131 1 controller.go:1154] volume "pvc-652ea95e-c4c3-11e8-b1a6-58605f89e6df" deleted from database
I1010 15:14:49.016453 1 controller.go:1167] scheduleOperation[delete-pvc-652ea95e-c4c3-11e8-b1a6-58605f89e6df[6aef48dc-c4c3-11e8-b1a6-58605f89e6df]]
I1010 15:14:50.208533 1 controller.go:1143] volume "pvc-652ea95e-c4c3-11e8-b1a6-58605f89e6df" deleted
I1010 15:14:50.212178 1 controller.go:1154] volume "pvc-652ea95e-c4c3-11e8-b1a6-58605f89e6df" deleted from database
I1010 15:14:50.212576 1 controller.go:1167] scheduleOperation[delete-pvc-652ea95e-c4c3-11e8-b1a6-58605f89e6df[6aef48dc-c4c3-11e8-b1a6-58605f89e6df]]
I1010 15:14:51.109811 1 controller.go:1143] volume "pvc-652ea95e-c4c3-11e8-b1a6-58605f89e6df" deleted
I1010 15:14:51.113178 1 controller.go:1167] scheduleOperation[delete-pvc-652ea95e-c4c3-11e8-b1a6-58605f89e6df[6aef48dc-c4c3-11e8-b1a6-58605f89e6df]]
I1010 15:14:51.113296 1 controller.go:1154] volume "pvc-652ea95e-c4c3-11e8-b1a6-58605f89e6df" deleted from database
I1010 15:15:04.016610 1 controller.go:1167] scheduleOperation[delete-pvc-652ea95e-c4c3-11e8-b1a6-58605f89e6df[6aef48dc-c4c3-11e8-b1a6-58605f89e6df]]
I1010 15:15:05.029889 1 controller.go:1143] volume "pvc-652ea95e-c4c3-11e8-b1a6-58605f89e6df" deleted
I1010 15:15:05.033290 1 controller.go:1167] scheduleOperation[delete-pvc-652ea95e-c4c3-11e8-b1a6-58605f89e6df[6aef48dc-c4c3-11e8-b1a6-58605f89e6df]]
I1010 15:15:05.033499 1 controller.go:1154] volume "pvc-652ea95e-c4c3-11e8-b1a6-58605f89e6df" deleted from database
I1010 15:15:19.016755 1 controller.go:1167] scheduleOperation[delete-pvc-652ea95e-c4c3-11e8-b1a6-58605f89e6df[6aef48dc-c4c3-11e8-b1a6-58605f89e6df]]
I1010 15:15:20.035012 1 controller.go:1143] volume "pvc-652ea95e-c4c3-11e8-b1a6-58605f89e6df" deleted
I1010 15:15:20.039393 1 controller.go:1167] scheduleOperation[delete-pvc-652ea95e-c4c3-11e8-b1a6-58605f89e6df[6aef48dc-c4c3-11e8-b1a6-58605f89e6df]]
I1010 15:15:20.039554 1 controller.go:1154] volume "pvc-652ea95e-c4c3-11e8-b1a6-58605f89e6df" deleted from database
I1010 15:15:34.016947 1 controller.go:1167] scheduleOperation[delete-pvc-652ea95e-c4c3-11e8-b1a6-58605f89e6df[6aef48dc-c4c3-11e8-b1a6-58605f89e6df]]
I1010 15:15:35.039127 1 controller.go:1143] volume "pvc-652ea95e-c4c3-11e8-b1a6-58605f89e6df" deleted
I1010 15:15:35.042745 1 controller.go:1167] scheduleOperation[delete-pvc-652ea95e-c4c3-11e8-b1a6-58605f89e6df[6aef48dc-c4c3-11e8-b1a6-58605f89e6df]]
I1010 15:15:35.042859 1 controller.go:1154] volume "pvc-652ea95e-c4c3-11e8-b1a6-58605f89e6df" deleted from database
I1010 15:15:49.017144 1 controller.go:1167] scheduleOperation[delete-pvc-652ea95e-c4c3-11e8-b1a6-58605f89e6df[6aef48dc-c4c3-11e8-b1a6-58605f89e6df]]
I1010 15:15:50.037500 1 controller.go:1143] volume "pvc-652ea95e-c4c3-11e8-b1a6-58605f89e6df" deleted
I1010 15:15:50.040434 1 controller.go:1167] scheduleOperation[delete-pvc-652ea95e-c4c3-11e8-b1a6-58605f89e6df[6aef48dc-c4c3-11e8-b1a6-58605f89e6df]]
I1010 15:15:50.041084 1 controller.go:1154] volume "pvc-652ea95e-c4c3-11e8-b1a6-58605f89e6df" deleted from database
csi-attacher and csi-provisioner fall into an infinite loop.

NodeStageVolume is called before each call to NodePublishVolume

With a CSI plugin that works with iSCSI targets, I have to support the STAGE_UNSTAGE_VOLUME node capability. If I mount an iSCSI volume to three pods on the same host, then later delete all three pods, the spec says I should see this sequence:

  • ControllerPublishVolume
    • NodeStageVolume
      • NodePublishVolume
      • NodePublishVolume
      • NodePublishVolume
      • NodeUnpublishVolume
      • NodeUnpublishVolume
      • NodeUnpublishVolume
    • NodeUnstageVolume
  • ControllerUnpublishVolume

Instead, I see:

  • ControllerPublishVolume
    • NodeStageVolume
      • NodePublishVolume
      • NodeStageVolume
      • NodePublishVolume
      • NodeStageVolume
      • NodePublishVolume
      • NodeUnpublishVolume
      • NodeUnpublishVolume
      • NodeUnpublishVolume
    • NodeUnstageVolume
  • ControllerUnpublishVolume

The extra calls to NodeStageVolume are idempotent/harmless, but it'd be nice to avoid the unnecessary work.

DeleteVolume is called twice when deleting a volume

Around half the time, I see DeleteVolume called twice when deleting a PVC. The delete is idempotent, so the second call is harmless and does not return an error, but it'd be nice to avoid it or at least understand why it happens.

Here's a log snippet of my CSI controller when deleting a single PVC. Note the duplicate DeleteVolume calls about 1 second apart:

DEBU[0613] GRPC call: /csi.v0.Identity/GetPluginCapabilities 
DEBU[0613] GRPC request:                                
DEBU[0613] >>>> GetPluginCapabilities                    Method=GetPluginCapabilities Type=CSI_Identity
DEBU[0613] <<<< GetPluginCapabilities                    Method=GetPluginCapabilities Type=CSI_Identity
DEBU[0613] GRPC response: capabilities:<service:<type:CONTROLLER_SERVICE > >  
DEBU[0613] GRPC call: /csi.v0.Controller/ControllerGetCapabilities 
DEBU[0613] GRPC request:                                
DEBU[0613] >>>> ControllerGetCapabilities                Method=ControllerGetCapabilities Type=CSI_Controller
DEBU[0613] <<<< ControllerGetCapabilities                Method=ControllerGetCapabilities Type=CSI_Controller
DEBU[0613] GRPC response: capabilities:<rpc:<type:CREATE_DELETE_VOLUME > > capabilities:<rpc:<type:PUBLISH_UNPUBLISH_VOLUME > > capabilities:<rpc:<type:LIST_VOLUMES > >  
DEBU[0613] GRPC call: /csi.v0.Identity/GetPluginInfo    
DEBU[0613] GRPC request:                                
DEBU[0613] >>>> GetPluginInfo                            Method=GetPluginInfo Type=CSI_Identity
DEBU[0613] <<<< GetPluginInfo                            Method=GetPluginInfo Type=CSI_Identity
DEBU[0613] GRPC response: name:"io.netapp.trident.csi" vendor_version:"18.07.0"  
DEBU[0613] GRPC call: /csi.v0.Controller/DeleteVolume
DEBU[0613] GRPC request: volume_id:"{\"name\":\"pvc-f3aafe455ad011e8\",\"protocol\":\"file\"}"  
DEBU[0613] >>>> DeleteVolume                             Method=DeleteVolume Type=CSI_Controller
DEBU[0614] <<<< DeleteVolume                             Method=DeleteVolume Type=CSI_Controller
DEBU[0614] GRPC response:                               
DEBU[0614] GRPC call: /csi.v0.Identity/GetPluginCapabilities 
DEBU[0614] GRPC request:                                
DEBU[0614] >>>> GetPluginCapabilities                    Method=GetPluginCapabilities Type=CSI_Identity
DEBU[0614] <<<< GetPluginCapabilities                    Method=GetPluginCapabilities Type=CSI_Identity
DEBU[0614] GRPC response: capabilities:<service:<type:CONTROLLER_SERVICE > >  
DEBU[0614] GRPC call: /csi.v0.Controller/ControllerGetCapabilities 
DEBU[0614] GRPC request:                                
DEBU[0614] >>>> ControllerGetCapabilities                Method=ControllerGetCapabilities Type=CSI_Controller
DEBU[0614] <<<< ControllerGetCapabilities                Method=ControllerGetCapabilities Type=CSI_Controller
DEBU[0614] GRPC response: capabilities:<rpc:<type:CREATE_DELETE_VOLUME > > capabilities:<rpc:<type:PUBLISH_UNPUBLISH_VOLUME > > capabilities:<rpc:<type:LIST_VOLUMES > >  
DEBU[0614] GRPC call: /csi.v0.Identity/GetPluginInfo    
DEBU[0614] GRPC request:                                
DEBU[0614] >>>> GetPluginInfo                            Method=GetPluginInfo Type=CSI_Identity
DEBU[0614] <<<< GetPluginInfo                            Method=GetPluginInfo Type=CSI_Identity
DEBU[0614] GRPC response: name:"io.netapp.trident.csi" vendor_version:"18.07.0"  
DEBU[0614] GRPC call: /csi.v0.Controller/DeleteVolume
DEBU[0614] GRPC request: volume_id:"{\"name\":\"pvc-f3aafe455ad011e8\",\"protocol\":\"file\"}"  
DEBU[0614] >>>> DeleteVolume                             Method=DeleteVolume Type=CSI_Controller
DEBU[0614] Could not delete volume.                      error="volume pvc-f3aafe455ad011e8 not found" volumeName=pvc-f3aafe455ad011e8
DEBU[0614] <<<< DeleteVolume                             Method=DeleteVolume Type=CSI_Controller
DEBU[0614] GRPC response:                               

Scalability: Increase resync period or make parameterizable

The default resync period of 15 seconds is IMO too frequent. Once there are many PVs in the system, and multiple CSI plugins, there will be a lot of unnecessary churn. API watch is supposed to be better now and shouldn't drop events that frequently.

External-provisioner fails to issue ControllerProbe request

According to the CSI spec, the CO (or its components) shall invoke this RPC (controllerprobe) to determine readiness of the service.

A CSI driver that follows the language of the spec will fail if a probe RPC is not invoked prior to other operations. If the CSI language is correct, then this constitutes a bug.

external-provisioner should NOT issue delete calls between retries.

The CSI external-provisioner relies on the external-provisioner library https://github.com/kubernetes-incubator/external-storage/blob/master/lib/controller/controller.go

In that library the provisioning code will attempt to provision a volume, if provisioning fails it will try to delete the volume to cleanup (see https://github.com/kubernetes-incubator/external-storage/blob/master/lib/controller/controller.go#L935).

If the failure was due to the volume provisioning taking too long, this just makes the problem: Volume provisioning is triggered, it takes too long so the call times out, the volume is deleted, we still want the volume so provisioning is triggered again but timesout before it is complete, and then is deleted, so every retry it will have to start over.

Instead if we fail due to error, we should not delete the volume. We should only delete once the request for provisioning is gone (i.e. PVC is deleted).

CC @sbezverk thoughts?

Reported by @cduchesne

Provisioner should not set node affinity if k8s cluster has CSINodeInfo feature disabled.

If k8s is running on version 1.12 with CSINodeInfo disabled and external provisioner 0.4.0, volume mount fails with mismatch nodeaffinity, because when CSINodeInfo feature is disabled, Node objects are not populated with topology labels, while provisioner creates a NodeAffinity for the volume regardless.

Provisioner should have a feature flag to enable/disable topology

This is a blocker for 0.4.0 external-provisioner release

/priority critical-urgent
/kind bug

Exponential backoff for VolumeCreate issues

  1. If PVC is deleted, we don't exit the exponential backoff loop
  2. Non timeout errors are not retried within the Provision() call, but won't Provision() be retired by the external-provisioner controller with no exponential backoff?

This probably requires change to external-provisioner library

connectionTimeout is actually operationTimeout

I noticed that when I use a short connection timeout, for example 5 seconds, and then throttle between my csi plugin (csi-scaleio) and my storage api server, any operation that takes more than the timeout fails. This is a big problem because the external-provisioner has only connectionTimeout to complete any request. If a volume creation event took longer than 5 seconds, which I tested via throttling, a new request will kick off to the csi plugin, resulting in duplicate volumes with different names being created. This is because the first command was still sent to the csi plugin which it successfully connected to even though the operation is dropped from the provisioner side.

I find that context.WithTimeout is the culprit here as it is used frequently for every command that is sent to the grpc server. This results in the grpc request being severed once connectionTimeout is reached. This would possibly be fine if all operations were requested in an idempotent manner.

My suggestion is to change connectionTimeout to only occur at startup or whenever connection to the csi plugin is first established, and to maybe introduce a new option, operationTimeout.

There are some fixes required to make sure the same volume is requested each time to make the volume creation idempotent. Every time Provision (in pkg/controller) is called, a new random share name is generated, hence if the Provision command is called multiple times, it will result in duplicate volumes with different names. Only 1 volume will ever be tracked by Kubernetes and the others will be orphaned.

CSI external-provisioner does not pass parameters

The CS CreateVolumeRequest object accepts parameters:

  // Plugin specific parameters passed in as opaque key-value pairs.
  // This field is OPTIONAL. The Plugin is responsible for parsing and
  // validating these parameters. COs will treat these as opaque.
  map<string, string> parameters = 5;

However, the CSI external-provisioner does not currently pass these along. It should extract the parameters from the StorageClass and pass those along.

PVC.UID for volume name unique but hard for humans to differentiate

Issue raised by @cduchesne

With PR #66 the names of the newly provisioned volumes will be based on the UID of the Kubernetes PVC object. Although the UIDs are unique, the UID does not differ much, making it harder for humans to read and differentiate between them.

Example from @cduchesne:

[root@prometheus csi-scaleio]# for i in {01..10}; do curl -s http://127.0.0.1:8080/api/v1/namespaces/scaleio/persistentvolumeclaims/vol$i | jq '.metadata.uid'; done
"f151bfd8-3751-11e8-ba9f-0cc47ac6c23c"
"f16e31ee-3751-11e8-ba9f-0cc47ac6c23c"
"f18a35c0-3751-11e8-ba9f-0cc47ac6c23c"
"f1a737cc-3751-11e8-ba9f-0cc47ac6c23c"
"f1c6e464-3751-11e8-ba9f-0cc47ac6c23c"
"f1e6f657-3751-11e8-ba9f-0cc47ac6c23c"
"f2034a06-3751-11e8-ba9f-0cc47ac6c23c"
"f223b1e4-3751-11e8-ba9f-0cc47ac6c23c"
"f2422240-3751-11e8-ba9f-0cc47ac6c23c"
"f25d4bec-3751-11e8-ba9f-0cc47ac6c23c"

Since the primary issue here is readability and Kubernetes does the same thing (#66 (comment)), I would consider this issue low pri. It would be nice if we can come up with a better naming scheme that is deterministicly generated from PVC without collisions.

Allow `fsType` parameter from StorageClass

The CSIPersistentVolumeSource has a field for fsType.

Any component that needs to use fsType pulls it from the CSI volume source (or at least it should) in the PV object and passes it to the CSI driver (e.g. external-provisioner for ControllerPublishVolume, kubelet for NodePublishVolume, etc.) . The only problem is that nothing sets that value currently, so after dynamic volume provisioning the end user has to go in and manually modify the CSI volume source in the PV object and change the fsType before proceeding with using it.

We should modify the CSI external-provisioner to allow users to specify the fsType as a parameter in the StorageClass object:

fsType: fsType that is supported by kubernetes. Default: "ext4".

The only draw back is that option should really be part of VolumeMode, but because of legacy reasons Kubernetes chose not to do that:

type PersistentVolumeClaimSpec struct {
...
	VolumeMode *PersistentVolumeMode
}

type PersistentVolumeMode string

const (
	PersistentVolumeBlock PersistentVolumeMode = "Block"
	PersistentVolumeFilesystem PersistentVolumeMode = "Filesystem"
)

Adding PVC's annotation to CreateVolumeRequest.Parameters

As the title says, there is a great possibility that the application-specific Operator would leverage PVC's Annotation for storing custom data. The custom data will tell storage provisioner how CSI drive create volume correctly.

Here is already existing logic for provision volume :

	// Create a CSI CreateVolumeRequest and Response
	req := csi.CreateVolumeRequest{

		Name:       pvName,
		Parameters: options.Parameters,
		VolumeCapabilities: []*csi.VolumeCapability{
			{
				AccessType: accessType,
				AccessMode: accessMode,
			},
		},
		CapacityRange: &csi.CapacityRange{
			RequiredBytes: int64(volSizeBytes),
		},
	}

Note : more details

CSI Plugin specifies that CreateVolumeRequest's field Parameters is OPTIONAL and opaque.
What I am suggesting is that we could pass in PVC's Annotation in addition to Storageclass.Parameter.

Restore `volume-name-uuid-length`

external-provisioner allows the length of uuid string in volume name to be controlled via volume-name-uuid-length (see https://github.com/kubernetes-csi/external-provisioner/blob/master/cmd/csi-provisioner/csi-provisioner.go#L45-L46).

But PR https://github.com/kubernetes-csi/external-provisioner/pull/66/files stopped using volumeNameUUIDLength. The ability for a storage vendor to limit the length of volume name is very useful (some storage systems only allow names as small as 32 characters), so we should restore this ability.

... := fmt.Sprintf("%s-%s", p.volumeNamePrefix, strings.Replace(string(pvcUID), "-", "", -1)[0:p.volumeNameUUIDLength])`

/assign @sbezverk

Adding annotations to PV from CreateVolumeResponse.Volume.attributes

This is to add a feature for adding annotations to a PV that a CSI provisioner provisions.

This is similar to #86 in that it is for handling annotations on Provision, but this is for adding annotation to PV by using information passed from CSI driver, not passing PVC's annotation information to CSI driver.

As seen in the below code, current Provision implementation in CSI provisioner doesn't provide a feature to add annotations to PV by each CSI driver.
https://github.com/kubernetes-csi/external-provisioner/blob/master/pkg/controller/controller.go#L399

However, adding annotations to PV by each provisioner has been commonly used, therefore, it should be implemented in CSI provisioner. Please see below for examples of use cases for this feature in existing external provisioners.

As for implementation, CreateVolumeResponse.Volume.attributes could be used by CSI drivers to pass information on annotations to be added to PV. (Similar to #86 that is planning to use CreateVolumeRequest.Parameters.)

Orphaned PVs if provisioner gets killed and restarted

Currently there is a check in the provisioner before PV gets deleted for a privisioner ID and if it does not match with the ID of current provisioner PV does not get deleted. While playing for hostpath plugin, I killed and re-created provisioner, as a result provisioner got new ID and refused to delete previous' privisioner created volumes, so bunch of PVs became a sort of orphaned and needed manual deletion.
Maybe we should not check for provisioner identity? What does this check serve for?

Docker file in root needs to be deleted

The contents of extras/docker are used to build the container because they output a container that is only 3MB instead of 300MB as done by the one in the root.

should never time out from connecting.

The container should never time out from connecting to a grpc socket. Currently, if it fails, it just stops trying and the only way to get it back is to restart the container.

It should continue trying forever. Also, on connection failure it should go back to retrying forever.

Timed-out CreateVolume calls may leak volumes

Currently, external-provisioner's CreateVolume call has a timeout of 10s (hard-coded!) and is retried for up to 10 times:

opts := wait.Backoff{Duration: backoffDuration, Factor: backoffFactor, Steps: backoffSteps}
err = wait.ExponentialBackoff(opts, func() (bool, error) {
ctx, cancel := context.WithTimeout(context.Background(), p.timeout)
defer cancel()
rep, err = p.csiClient.CreateVolume(ctx, &req)
if err == nil {
// CreateVolume has finished successfully
return true, nil
}
if status, ok := status.FromError(err); ok {
if status.Code() == codes.DeadlineExceeded {
// CreateVolume timed out, give it another chance to complete
glog.Warningf("CreateVolume timeout: %s has expired, operation will be retried", p.timeout.String())
return false, nil
}
}
// CreateVolume failed , no reason to retry, bailing from ExponentialBackoff
return false, err
})

The CSI spec says (RPC interactions CreateVolume, DeleteVolume, ListVolumes):

It is worth noting that the plugin-generated volume_id is a REQUIRED field for the DeleteVolume RPC, as opposed to the CO-generated volume name that is REQUIRED for the CreateVolume RPC: these fields MAY NOT contain the same value. If a CreateVolume operation times out, leaving the CO without an ID with which to reference a volume, and the CO also decides that it no longer needs/wants the volume in question then the CO MAY choose one of the following paths:

  • Replay the CreateVolume RPC that timed out; upon success execute DeleteVolume using the known volume ID (from the response to CreateVolume).
  • Execute the ListVolumes RPC to possibly obtain a volume ID that may be used to execute a DeleteVolume RPC; upon success execute DeleteVolume.
  • The CO takes no further action regarding the timed out RPC, a volume is possibly leaked and the operator/user is expected to clean up.

Judging by the implementation, I'm guessing volumes are being leaked deliberately? If so, IMHO at least the timeout should be configurable + the ABORTED error code should be handled properly, respecting the Plugin's ability to protect itself from leaking volumes.

If not, then this is a bug and those retried volumes (which, in the meantime, may have been created successfully) should be disposed of, although this still sounds like a waste of resources. The current policy to pick a successful CreateVolume response seems to be "the first CreateVolume to finish in 10s or less" rather than "the first finished CreateVolume". Such approach has also the unfortunate consequence that if no CreateVolume call finishes in under 10s, Provision() will never succeed. This can be easily reproduced with this CSI plugin.

Provisioner cannot be recreated automatically if the node it's running on is down, result in cluster wide provisioning outage

Since provisioner is using the stateful set, it cannot recover automatically if the node running the provisioner is down, cause the whole cluster unable to provision new volumes.

According to kubernetes/kubernetes#47230 (comment) , it's by design for the stateful set.

It was an issue with the attacher too, but now the attacher has implemented leader selection mechanism and deployed as a deployment instead of a stateful set, so it's no longer an issue for the attacher.

Error:could not parse pre-release/metadata

system: suse server 12 sp2
go 1.9.2

linux-i5hz:~ # kubectl get pod
NAME READY STATUS RESTARTS AGE
csi-attacher-opensdsplugin-0 3/3 Running 0 1h
csi-nodeplugin-opensdsplugin-jzl76 3/3 Running 0 1h
csi-provisioner-opensdsplugin-0 2/3 CrashLoopBackOff 27 1h

linux-i5hz:~ # kubectl logs csi-provisioner-opensdsplugin-0 csi-provisioner
I0827 13:32:31.468858 1 csi-provisioner.go:70] Building kube configs for running in cluster...
panic: could not parse pre-release/metadata (-master+$Format:%h$) in version "v0.0.0-master+$Format:%h$"

goroutine 1 [running]:
github.com/kubernetes-csi/external-provisioner/vendor/k8s.io/kubernetes/pkg/util/version.MustParseSemantic(0xc4203107e0, 0x19, 0x0)
/home/lpabon/git/golang/csi-test/src/github.com/kubernetes-csi/external-provisioner/vendor/k8s.io/kubernetes/pkg/util/version/version.go:119 +0x6c
github.com/kubernetes-csi/external-provisioner/vendor/github.com/kubernetes-incubator/external-storage/lib/controller.NewProvisionController(0x1ac85e0, 0xc4201140f0, 0x7fff314e3aa3, 0x11, 0x1aa82c0, 0xc420312900, 0xc4203107e0, 0x19, 0x0, 0x0, ...)
/home/lpabon/git/golang/csi-test/src/github.com/kubernetes-csi/external-provisioner/vendor/github.com/kubernetes-incubator/external-storage/lib/controller/controller.go:346 +0x33e
main.init.0()
/home/lpabon/git/golang/csi-test/src/github.com/kubernetes-csi/external-provisioner/cmd/csi-provisioner/csi-provisioner.go:107 +0x6ef
main.init()
:1 +0x2f8

linux-i5hz:~ # kubectl version
Client Version: version.Info{Major:"", Minor:"", GitVersion:"v0.0.0-master+$Format:%h$", GitCommit:"$Format:%H$", GitTreeState:"", BuildDate:"2018-08-27T09:51:29Z", GoVersion:"go1.9.2", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"", Minor:"", GitVersion:"v0.0.0-master+$Format:%h$", GitCommit:"$Format:%H$", GitTreeState:"", BuildDate:"2018-08-27T09:51:29Z", GoVersion:"go1.9.2", Compiler:"gc", Platform:"linux/amd64"}

How is this going?

PV Capacity is no longer easy to read.

kubenretes-csi/external-provisioner v0.2.1 changed how capacity was set on provisioned PV object (#71): instead of reusing the capacity from the PVC, it uses the capacity returned by the volume plugin.

Problem is that the new capacity is hard to read, in this example vol01 generated using v0.2.1 and vol02 is generated using v0.2.0.

NAME              STATUS    VOLUME                        CAPACITY      ACCESS MODES   STORAGECLASS   AGE
vol01             Bound     csivol-319f49bb440911e8ba9f   25769803776   RWO            scaleio        40s
vol02             Bound     csivol-70ac7c5d43ef11e89665   10Gi          RWO            scaleio        3h

25769803776 bytes is exactly 24GiB. Unclear if the fix should be on the Kubernetes side or on the CSI external-provisioner side.

/assign @davidz627

Consider using Node & CSINodeInfo informers

Currently the topology code makes individual API calls to fetch nodes and CSINodeInfo objects, and in many cases a selector is used to list a subset of objects rather than all of them. Should we use informers instead?

My take is to use an informer for CSINodeInfo objects but not Node objects. CSINodeInfo objects are only updated on CSI driver installation, which is less frequent than provision calls. Node objects are updated constantly, much more frequent than provision calls.

nodeInfo, err := csiAPIClient.CsiV1alpha1().CSINodeInfos().Get(selectedNode.Name, metav1.GetOptions{})

nodeInfos, err := csiAPIClient.CsiV1alpha1().CSINodeInfos().List(metav1.ListOptions{})

selectedNodeInfo, err := csiAPIClient.CsiV1alpha1().CSINodeInfos().Get(selectedNode.Name, metav1.GetOptions{})

nodes, err := kubeClient.CoreV1().Nodes().List(metav1.ListOptions{LabelSelector: selector})

/cc @ddebroy

Need to pass UserCredentials in Create/Delete Image

Currently storage class doesn't have credentials API, external-provisioner passes everything into Parameters but not into UserCredentials.

This is a problem for certain volumes such as rbd. We currently use parameters to store plaint text secret to get around this issue.

Ideally credentials should still be encoded as a K8S Secret. External provisioner needs a mechanism to get credentials from StorageClass, either through a special parameter, annotation, or new API fields, then passes these credentials into UserCredentials in CreateVolume and DeleteVolume calls.

thanks @sbezverk for his early investigation into this issue

Add Block volume support for CSI provisioner

CSI provisioner should provision Block volumes properly.

In CSI provisioner, below three logics will need to be implemented, to add Block volume support to CSI provisioner:

  1. Add SupportsBlock that properly returns whether Storage Provider's plugin supports block (this will be checked by using ValidateVolumeCapabilities),
  2. Pass BlockVolume instead of MountVolume to CreateVolume if volumeMode is set to be Block on Provision,
  3. Set volumeMode to PV returned by Provision.

In Storage Provider's plugin, below two logics should be implemented:

  1. Return true to ValidateVolumeCapabilities with BlockVolume specified if it supports block, and return false if it doesn't,
  2. Properly handle CreateVolume with BlockVolume specified, if ValidateVolumeCapabilities with BlockVolume returns true.
    (In most case, no special handling would be needed, as far as I see existing provisioner/attacher codes.)

Errors from fetching secrets should be reported

The provisioner only logs error when it fails to fetch a secret configured in StorageClass and continues with provisioning. It should return error to the user instead.

Secrets are optional in CSI, however admin really wants them to be used once he configures them in StorageClass. If there is a typo in secret name, admin should know and provisioning should fail. Nobody knows what a CSI driver will do with no secret, it may provision a volume that's accounted to another team / department than the admin wants to.

pvName UUID should not be truncated by default

There was a change introduced that truncates the pvName UUID to volumeNameUUIDLength which is an external-provisioner parameter.

The default behavior of the driver should be to NOT truncate the UUID as that can cause naming collisions depending on the method of UUID generation.

The flag should remain for optional truncation.

This fix should be cherrypicked to v0.3.0 so that behavior does not double-regress.
v0.2.0 -> v0.3.0 -> v0.4.0 should not all have different naming behavior.

AlreadyExists error cause repeated provisioning a volume

For some unknown reasons in our circumstance, external-provisioner probably provision an already existing Volume(the name of a Volume is the unique key to decide whether it exists). Base on the current logic, that means it causes repeated provisioning a volume for the given PVC. To be worse, there is no external API I can use to delete the volume manually.

In a very simplified view, the functionality of provisionClaimOperation is responsible for provisioning a volume and could be broken down into the following steps :

  • Obtains a given PVC mapping to specific storage class;
  • Calls against CSI-Driver to provision a Volume;
  • Create a PV to save the volume;

Based on the CSI RPC specifications, the current default behavior is that CSI-Driver explicitly raises an AlreadyExists error if the volume already exists. As a result, provisionClaimOperation have to repeated provision the volume unless the already existing volume be deleted manually.

volume, err := ctrl.provisioner.Provision(options)
	if err != nil {
		if ierr, ok := err.(*IgnoredError); ok {
			// Provision ignored, do nothing and hope another provisioner will provision it.
			glog.Infof("provision of claim %q ignored: %v", claimToClaimKey(claim), ierr)
			return nil
		}
		strerr := fmt.Sprintf("Failed to provision volume with StorageClass %q: %v", claimClass, err)
		glog.Errorf("Failed to provision volume for claim %q with StorageClass %q: %v", claimToClaimKey(claim), claimClass, err)
		ctrl.eventRecorder.Event(claim, v1.EventTypeWarning, "ProvisioningFailed", strerr)
		return err
	}

I plan to take some time to figure out why more than one provisioning for the same PVC happens.
For now, IMO probably we can think of AlreadyExists error as an ignored error to improve code robustness.

	opts := wait.Backoff{Duration: backoffDuration, Factor: backoffFactor, Steps: backoffSteps}
	err = wait.ExponentialBackoff(opts, func() (bool, error) {
		ctx, cancel := context.WithTimeout(context.Background(), p.timeout)
		defer cancel()
		rep, err = p.csiClient.CreateVolume(ctx, &req)
		if err == nil {
			// CreateVolume has finished successfully
			return true, nil
		}

		if status, ok := status.FromError(err); ok {
			if status.Code() == codes.DeadlineExceeded {
				// CreateVolume timed out, give it another chance to complete
				glog.Warningf("CreateVolume timeout: %s has expired, operation will be retried", p.timeout.String())
				return false, nil
			}
		}
		// CreateVolume failed , no reason to retry, bailing from ExponentialBackoff
		return false, err
	})

        // think of `AlreadyExists` error as an ignored error
	if status, ok := status.FromError(err); ok {
		if status.Code() == codes.AlreadyExists {
			// CreateVolume timed out, give it another chance to complete
			return nil, &controller.IgnoredError{Reason: "Volume already exists"}
		}
	}

Mount options are not passed from the storage class to the CSI plugin

Storage drivers that use NFS typically need to provide more flexibility for mounting than block-based storage drivers. Before CSI, the mountOptions field in the storage class spec was used to supply these.

Currently, however, the kubernetes CSI sidecar containers does not pass through any mount options from the storage class to the CSI plugin.

A few things need to change to allow this use case.

  1. The CSI driver within the kubernetes project needs to report that mount options are supported (currently support is hardcoded to false)
  2. The external provisioner needs to copy the mount options from the storage class to the PV spec at creation time.
  3. All of the the RPC calls to the CSI plugin where mount options are an option should include the mount options where appropriate.

external-provisioner does not respect accessMode

external-provisioner hard codes access mode to:

	accessMode = &csi.VolumeCapability_AccessMode{
		Mode: csi.VolumeCapability_AccessMode_SINGLE_NODE_WRITER,
	}

It does not use the access mode specified on PV/PVC which means that the driver can not properly validate the accessMode.

Reported by @cduchesne!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.