(What's below is copied and slightly edited from an earlier e-mail thread)
As discussed with during the meeting earlier today, we'll need a story w.r.t. storage for the demo platform and, obviously, also for production deployments later on.
In general, when no kind of NAS which can be dynamically provisioned is available, and we're 'stuck' with local storage (on all servers comprising the cluster, or maybe even on some), In general, I think
there are 2.5 possible approaches:
- Deploy a NAS with a dynamic volume provider ourselves
- either host-based
- or hyperconverged
- Use local storage as it is: local-only
In the current architecture, our stateful services are all clustered, and don't need shared (NAS-style) storage in order for Pods comprising them to be schedulable on other nodes to continue operations. As such, a local storage solution should suit our needs, without the operational headache and overhead of a shared solution.
Next up: how to provision these volumes.
One possibility is to simply create empty directories under some folder, and deploy the local-storage provisioner. I'm not overly fond of this solution, because it doesn't allow for capacity isolation: a volume bound to an ElasticSearch Pod which is solely used for log ingestion could cause disruption of production services (other Pods) running on the same node. Less than desirable.
In 'real' on-prem deployments, we could aim for an architecture based on one-disk-per-volume, or a 'physical' partition per volume, prepared by some Ansible job (where we should really use by-UUID rules in fstab and associated mount-points such that losing fstab isn't lethal ;-)).
Alternatively, for both 'physical' as well as VM deployment (i.e. the demo platform), we could bundle 'similar' (SSD vs HDD etc) disks into LVM VGs, then create LVs according to our needs, and use them as in the scenario above.
@ballot-scality mentioned using thin provisioning in this case, I'm however not sure that's the right approach, since it'd require constant monitoring of the platform to ensure the thin pool doesn't run out of
space...
In the actual-volumes cases, I think we can't really use the local-volume provisioner though: this provisioner uses statfs to figure out the 'real' size of a volume mounted under the path it monitors, and creates a PV accordingly. It is, however, difficult (if possible at all) to create a disk/LV and FS of exactly the desired size we pre-define in our Charts. As a result, a used would need to override Chart values which define the desired PVC size request according to the specific cluster deployment, which is undesirable.
Instead, we should (at least, IMHO) pre-provision disks/LVs/FSs of the size we need (given the defaults in Chart values), then have an Ansible task POST the relevant PVs after the K8s cluster has been deployed, i.e. good old static provisioning.
There's some initial work to enhance the story around dynamic provisioning of local storage volumes, see kubernetes-retired/external-storage#651 and the pointers by the people who provided some input. These features, however, will only land into K8s 1.11 the earliest (though we may want to contribute to the efforts, I have some concerns about the current design which I'll raise in the PR later), so this won't be of any use in the foreseeable future.
To summarize, my current proposal would be to:
- Let a user list the disks he wants to attribute to the K8s cluster in the Ansible configuration, as some kind of dict (per-node):
my-vg:
drives: ['/dev/vdb', '/dev/vdc']
storageClassName: local-ssd
provisionedVolumeSizes:
- 50Gi
- 5Gi
- 10Gi
(up to @Zempashi , @alxf and @ballot-scality to tell me how this is properly done in Ansible ;-))
- Create PVs, VGs and LVs accordingly
- Create some FS on the LVs (TBD which)
- Create
/mnt/my-vg/$UUID
for every volume to be provisioned, add to fstab
, mount
- Deploy K8s
- POST an SC for every SC defined, also setting the right scheduling options
- POST PVs for every volume provisioned, including the correct node affinity rules etc, of the defined size (i.e. not using the size as reported by the FS, which may be slightly smaller)
By default, we'd pre-create PVs for all stateful services and their PVCs we're aware of.
We should check with TS people whether they feel comfortable using LVM2 for this purpose. I see, however, no reasonable way to achieve this otherwise.