metal3-io / hardware-classification-controller Goto Github PK
View Code? Open in Web Editor NEWController for matching host hardware characteristics to expected values.
License: Apache License 2.0
Controller for matching host hardware characteristics to expected values.
License: Apache License 2.0
As of now we have support for following parameters in Hardware classification :
1. CPU 2. DISK 3. RAM 4. NIC
Suppose a profile is configured with combination like :
We should validate the YAML while applying such profiles and display errors for appropriate parameters/sub-parameters.
The apiextensions.k8s.io/v1beta1 API version of CustomResourceDefinition will no not be served anymore starting from K8s v1.22. We should upgrade all our CRDs to v1.
Reference document: https://kubernetes.io/docs/reference/using-api/deprecation-guide/#customresourcedefinition-v122
Repository is already generated for HWCC at https://quay.io/repository/metal3-io
Modification require in Makefile and Dockerfile to use upload code at quay.io
Discussion is going to open at slack channel #cluster-api-baremetal regarding this.
Probably this could be done later once it is clear how SRIOV/DPDK configuration would be done for a host provisioning. But eventually there would be a need for HCC to classify baremetal hosts based on SRIOV/DPDK functions as well please.
User provided label values are not updated for matching BMH, currently only default label value is applied i.e. "matches"
This issue has occurred post refactoring in HWCC.
Need a summary of failed hosts while updating the Label/s for them.
Currently HWC status shows the errors per host, similarly we need to log the summary of failed hosts to convey user which hosts have failed during Label update process.
Current Log Snippet :
2020-08-31T06:13:49.376-0400 ERROR controllers.HardwareClassification.HardwareClassification-Controller Failed to update labels of BareMetalHost{"metal3-hardwareclassification": "metal3/hardwareclassification-sample"}
Hardware-classification-controller should update count of error host such as registration error, provisioning error, power management error, inspection error in the hwcc status field and also label error hosts accordingly.
As of now hcc only updates the profile match status and the error state, it should also update the count for the error host as well.
As of now we have support for following parameters in Hardware classification :
Suppose a profile is configured with combination like :
We should validate the YAML before applying such profiles and display errors/warnings for appropriate parameters :
HCC should also compare and classify bare-metal hosts based on extra parameters in CPU, Firmware and System Vendor.
At the time of testing HWCC, we found out that we can add more parameters in existing resources so that we can identify matched hosts more precisely.
CPU architecture will be useful if the user wants to have 32-bit or 64-bit processor as per his requirements.
New parameters such as Firmware and System Vendor will be added advantage to identify matched hosts.
System Vendor will be useful to add as a new parameter if the user is looking for particular vendor and their hardware for his application to support or deploy on.
Firmware will be useful to add as a new parameter if user is looking for specific feature/s in particular releases or versions.
HWCC logs should clearly convey the host for which we are comparing the hardware parameters.
Currently host-name is printed or logged only in following cases :
Suggestion : During comparison step, host-name should be logged when parameters for that host are being compared.
Log Snippet :
"metal3/hardwareclassification-sample1", "Expected": 72, "Actual": 48}
"metal3/hardwareclassification-sample1", "Expected": 38, "Actual": 48}
"metal3/hardwareclassification-sample1", "Expected": 3600, "Actual": 1000}
"metal3/hardwareclassification-sample1", "Expected": 1000, "Actual": 1000}
"metal3/hardwareclassification-sample1", "Expected": 200, "Actual": 192}
"metal3/hardwareclassification-sample1", "Expected": 6, "Actual": 192}
"metal3/hardwareclassification-sample1", "Expected": 12, "Actual": 8}
"metal3/hardwareclassification-sample1", "Expected": 1, "Actual": 8}
"metal3/hardwareclassification-sample1", "Expected": 8, "Actual": 1}
"metal3/hardwareclassification-sample1", "Expected": 1, "Actual": 1}
"metal3/hardwareclassification-sample1", "Expected": 4000, "Actual": 836}
"metal3/hardwareclassification-sample1", "Expected": 200, "Actual": 836}
"metal3/hardwareclassification-sample1", "ValidHosts": ["node-52"]} <<<<<<<<< seen at the end of comparison >>>>>
Please create a new repository for hardware-classification-controller under quay.io/metal3-io project, so that we are able to push our image under that.
Till now we are using docker hub repository.
@dhellmann
Applied Yaml File:
apiVersion: metal3.io/v1alpha1
kind: HardwareClassification
metadata:
name: hardwareclassification-sample
namespace: metal3
labels:
hardwareclassification-sample: matches
spec:
hardwareCharacteristics:
cpu:
architecture : "x86_64"
Logs:
2021-05-26T07:15:21.305Z DEBUG controller Successfully Reconciled {"reconcilerGroup": "metal3.io", "reconcilerKind": "HardwareClassification", "controller": "hardware-classification", "name": "hardwareclassification-sample", "namespace": "metal3"}
2021-05-26T07:15:21.306Z DEBUG controller Successfully Reconciled {"reconcilerGroup": "metal3.io", "reconcilerKind": "BareMetalHost", "controller": "baremetalhost", "name": "node-0", "namespace": "metal3"}
2021-05-26T07:15:21.306Z INFO controllers.BareMetalHost reconciling {"host": "metal3/node-1"}
2021-05-26T07:15:21.393Z INFO classifier CPU {"host": "node-1", "profile": "hardwareclassification-sample", "namespace": "metal3", "minCount": 0, "maxCount": 0, "actualCount": 2, "ok": true}
2021-05-26T07:15:21.395Z INFO classifier CPU {"host": "node-1", "profile": "hardwareclassification-sample", "namespace": "metal3", "minSpeed": 0, "maxSpeed": 0, "actualSpeed": 2694, "ok": true}
2021-05-26T07:15:21.395Z INFO classifier CPU {"host": "node-1", "profile": "hardwareclassification-sample", "namespace": "metal3", "architecture": "x86_64", "actualArchitecture": "x86_64", "ok": true}
2021-05-26T07:15:21.395Z DEBUG controller Successfully Reconciled {"reconcilerGroup": "metal3.io", "reconcilerKind": "BareMetalHost", "controller": "baremetalhost", "name": "node-1", "namespace": "metal3"}
2021-05-26T07:15:21.395Z INFO controllers.BareMetalHost reconciling {"host": "metal3/node-19"}
2021-05-26T07:15:21.395Z INFO classifier CPU {"host": "node-19", "profile": "hardwareclassification-sample", "namespace": "metal3", "minCount": 0, "maxCount": 0, "actualCount": 72, "ok": true}
2021-05-26T07:15:21.395Z INFO classifier CPU {"host": "node-19", "profile": "hardwareclassification-sample", "namespace": "metal3", "minSpeed": 0, "maxSpeed": 0, "actualSpeed": 3700, "ok": true}
2021-05-26T07:15:21.395Z INFO classifier CPU {"host": "node-19", "profile": "hardwareclassification-sample", "namespace": "metal3", "architecture": "x86_64", "actualArchitecture": "x86_64", "ok": true}
2021-05-26T07:15:21.395Z DEBUG controller Successfully Reconciled {"reconcilerGroup": "metal3.io", "reconcilerKind": "BareMetalHost", "controller": "baremetalhost", "name": "node-19", "namespace": "metal3"}
2021-05-26T07:15:21.395Z INFO controllers.BareMetalHost reconciling {"host": "metal3/node-53"}
2021-05-26T07:15:21.395Z INFO classifier CPU {"host": "node-53", "profile": "hardwareclassification-sample", "namespace": "metal3", "minCount": 0, "maxCount": 0, "actualCount": 48, "ok": true}
2021-05-26T07:15:21.395Z INFO classifier CPU {"host": "node-53", "profile": "hardwareclassification-sample", "namespace": "metal3", "minSpeed": 0, "maxSpeed": 0, "actualSpeed": 1000, "ok": true}
2021-05-26T07:15:21.395Z INFO classifier CPU {"host": "node-53", "profile": "hardwareclassification-sample", "namespace": "metal3", "architecture": "x86_64", "actualArchitecture": "x86_64", "ok": true}
Addresses this.
This has already been done in CAPM3 and IPAM. Check these repo's for reference.
Applied Yaml File
apiVersion: metal3.io/v1alpha1
kind: HardwareClassification
metadata:
name: hardwareclassification-sample
namespace: metal3
labels:
hardwareclassification-sample: matches
spec:
hardwareCharacteristics:
disk:
minimumCount: 1
maximumCount: 8
Logs:
2021-06-08T11:08:02.568Z DEBUG controller Successfully Reconciled {"reconcilerGroup": "metal3.io", "reconcilerKind": "HardwareClassification", "controller": "hardware-classification", "name": "hardwareclassification-sample", "namespace": "metal3"}
2021-06-08T11:08:02.568Z INFO controllers.BareMetalHost reconciling {"host": "metal3/node-19"}
2021-06-08T11:08:02.568Z INFO classifier Disk Pattern {"host": "node-19", "profile": "hardwareclassification-sample", "namespace": "metal3", "minCount": 5, "maxCount": 5, "actualCount": 21, "ok": false}
2021-06-08T11:08:02.568Z DEBUG controller Successfully Reconciled {"reconcilerGroup": "metal3.io", "reconcilerKind": "BareMetalHost", "controller": "baremetalhost", "name": "node-19", "namespace": "metal3"}
2021-06-08T11:08:02.568Z INFO controllers.BareMetalHost reconciling {"host": "metal3/node-53"}
2021-06-08T11:08:02.568Z INFO classifier Disk Pattern {"host": "node-53", "profile": "hardwareclassification-sample", "namespace": "metal3", "minCount": 5, "maxCount": 5, "actualCount": 2, "ok": false}
2021-06-08T11:08:02.568Z DEBUG controller Successfully Reconciled {"reconcilerGroup": "metal3.io", "reconcilerKind": "BareMetalHost", "controller": "baremetalhost", "name": "node-53", "namespace": "metal3"}
2021-06-08T11:08:02.568Z INFO controllers.BareMetalHost reconciling {"host": "metal3/node-18"}
2021-06-08T11:08:02.568Z INFO classifier Disk Pattern {"host": "node-18", "profile": "hardwareclassification-sample", "namespace": "metal3", "minCount": 5, "maxCount": 5, "actualCount": 3, "ok": false}
2021-06-08T11:08:02.568Z DEBUG controller Successfully Reconciled {"reconcilerGroup": "metal3.io", "reconcilerKind": "BareMetalHost", "controller": "baremetalhost", "name": "node-18", "namespace": "metal3"}
2021-06-08T11:08:02.568Z DEBUG controller Successfully Reconciled {"reconcilerGroup": "metal3.io", "reconcilerKind": "HardwareClassification", "controller": "hardware-classification", "name": "hardwareclassification-sample", "namespace": "metal3"}
2021-06-08T11:08:02.568Z INFO controllers.BareMetalHost reconciling {"host": "metal3/node-19"}
2021-06-08T11:08:02.568Z INFO classifier Disk Pattern {"host": "node-19", "profile": "hardwareclassification-sample", "namespace": "metal3", "minCount": 5, "maxCount": 5, "actualCount": 21, "ok": false}
2021-06-08T11:08:02.568Z DEBUG controller Successfully Reconciled {"reconcilerGroup": "metal3.io", "reconcilerKind": "BareMetalHost", "controller": "baremetalhost", "name": "node-19", "namespace": "metal3"}
2021-06-08T11:08:02.568Z INFO controllers.BareMetalHost reconciling {"host": "metal3/node-53"}
2021-06-08T11:08:02.568Z INFO classifier Disk Pattern {"host": "node-53", "profile": "hardwareclassification-sample", "namespace": "metal3", **"minCount": 5, "maxCount": 5, "**actualCount": 2, "ok": false}
2021-06-08T11:08:02.568Z DEBUG controller Successfully Reconciled {"reconcilerGroup": "metal3.io", "reconcilerKind": "BareMetalHost", "controller": "baremetalhost", "name": "node-53", "namespace": "metal3"}
2021-06-08T11:08:02.568Z INFO controllers.BareMetalHost reconciling {"host": "metal3/node-18"}
2021-06-08T11:08:02.568Z INFO classifier Disk Pattern {"host": "node-18", "profile": "hardwareclassification-sample", "namespace": "metal3", "minCount": 5, "maxCount": 5, "actualCount": 3, "ok": false}
2021-06-08T11:08:02.568Z DEBUG controller Successfully Reconciled {"reconcilerGroup": "metal3.io", "reconcilerKind": "BareMetalHost", "controller": "baremetalhost", "name": "node-18", "namespace": "metal3"}
When there are no hosts to compare and validate, proper message should be displayed in logs, like "no hosts found"
This case is possible when all the hosts are in states other than "Ready" and we try to apply HardwareClassification profile.
When there are no BareMetalHosts in BMH list, we applied the profile as below, status column for the profile is updated to "unmatched" instead of that, it should be updated as "No BareMetalHost Found" in error column
kind: HardwareClassification
metadata:
name: hardwareclassification-sample
namespace: metal3
labels:
hardwareclassification-sample: matches
spec:
hardwareCharacteristics:
cpu:
architecture : "x86_64"
minimumCount: 48
maximumCount: 72
minimumSpeedMHz: 1000
maximumSpeedMHz: 3600
disk:
minimumCount: 1
maximumCount: 8
If in the CR, provided the label in key-value pair, and apply the profile with kubectl apply -f Profile.yaml.
Label does not get updated with value provided in the CR to valid host it is set to default.
apiVersion: metal3.io/v1alpha1
kind: HardwareClassification
metadata:
name: hardwareclassification-sample
namespace: metal3
labels:
hardwareclassification-sample: match
spec:
hardwareCharacteristics:
cpu:
minimumCount: 48
maximumCount: 72
minimumSpeedMHz: 2600
maximumSpeedMHz: 3800
disk:
minimumCount: 1
maximumCount: 8
minimumIndividualSizeGB: 200
maximumIndividualSizeGB: 3000
ram:
minimumSizeGB: 6
maximumSizeGB: 200
nic:
minimumCount: 1
maximumCount: 12
By applying the above Profile CR, the label is set to default which is matches for valid hosts.
node-50@node50-PowerEdge-R740xd:~$ kubectl get bmh -n metal3 --show-labels
NAME STATUS PROVISIONING STATUS CONSUMER BMC HARDWARE PROFILE ONLINE ERROR LABELS
node-49 OK ready redfish://192.168.110.49/redfish/v1/Systems/System.Embedded.1/ unknown false hardwareclassification.metal3.io/hardwareclassification-sample=matches
Currently HWCC is using older version of BMO.
Update the package with latest one, also update go.mod and go.sum file.
Steps to reproduce :
Observations : For node/s in Provision-ed/ing state (non-ready state to be specific), earlier applied labels should not be deleted in above scenario.
HCC validates and classifies the hosts based on user hardware requirements
From the user perspective it will be helpful to classify matched host.
HCC should also compare and classify bare-metal hosts based on extra parameters in CPU ,DISK ,Firmware, NIC and SystemVendor.
HWCC logs should clearly indicate the label value which is applied to a matching host/s.
Currently this value is printed in "Delete Label" method but not in "Set Label" method.
This will help user to identify the label (default/custom/user-provided) being applied to matching host/s.
Log Snippet :
2020-08-21T06:15:50.472Z INFO controllers.HardwareClassification.HardwareClassification-Controller Set Label {"metal3-hardwareclassification": "metal3/hardwareclassification-sample", "BareMetalHost": "node-52"}
2020-08-21T07:04:39.281Z INFO controllers.HardwareClassification.HardwareClassification-Controller Delete Label {"metal3-hardwareclassification": "metal3/hardwareclassification-sample", "BareMetalHost": "node-52", "hardwareclassification.metal3.io/hardwareclassification-sample": "match"}
The HCC logs should be simplified so that user should be able to observe logs for profile being applied for comparison.
Current issue : The logs are getting bulkier as and when new profile is applied, so it becomes difficult to figure out the comparison details for a given profile. The list of profiles in a single log line keep on increasing so logs become little lengthy to read and difficult to debug.
PFA logs for reference.
hcc_logs.txt
While applying sample Profile an error is encountered in the container logs of hardware-classification-controller.
It successfully extracts the data from Profile. But afterwards it fails.
Below are the container logs-:
ubuntu@ubuntu-PowerEdge-R640:~$ kubectl logs -f hcc-controller-manager-5b9ccfc986-5k9bt --container manager -n hcc-system
2020-05-14T07:33:40.808Z INFO controller-runtime.metrics metrics server is starting to listen {"addr": "127.0 .0.1:8080"}
2020-05-14T07:33:40.808Z INFO setup starting manager
2020-05-14T07:33:40.809Z INFO controller-runtime.manager starting metrics server {"path": "/metrics"}
2020-05-14T07:33:58.207Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"ConfigMap","namespa ce":"hcc-system","name":"controller-leader-election-helper","uid":"47c5a3fd-6e16-426f-8fd3-bcd9118a9892","apiVersion":"v1","res ourceVersion":"3839251"}, "reason": "LeaderElection", "message": "hcc-controller-manager-5b9ccfc986-5k9bt_edc740b9-034d-4b05-84 d6-322abac02804 became leader"}
2020-05-14T07:33:58.208Z INFO controller-runtime.controller Starting EventSource {"controller": "hardwareclassif ication", "source": "kind source: /, Kind="}
2020-05-14T07:33:58.308Z INFO controller-runtime.controller Starting EventSource {"controller": "hardwareclassif ication", "source": "kind source: /, Kind="}
2020-05-14T07:33:58.308Z INFO controller-runtime.controller Starting Controller {"controller": "hardwareclassif ication"}
2020-05-14T07:33:58.408Z INFO controller-runtime.controller Starting workers {"controller": "hardwareclassif ication", "worker count": 1}
2020-05-14T07:58:02.245Z INFO controllers.HardwareClassification Extracted hardware configurations successfully {"Profile": {"CPU":{"minimumCount":48,"maximumCount":72,"minimumSpeed":"2.7","maximumSpeed":"3.6"},"Disk":{"minimumCount":1,"minimumIndividualSizeGB":200,"maximumCount":7,"maximumIndividualSizeGB":3000},"NIC":{"minimumCount":1,"maximumCount":7},"RAM":{"minimumSizeGB":6,"maximumSizeGB":180}}}
E0514 07:58:02.248759 1 reflector.go:123] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:96: Failed to list *v1alpha1.BareMetalHost: baremetalhosts.metal3.io is forbidden: User "system:serviceaccount:hcc-system:default" cannot list resource "baremetalhosts" in API group "metal3.io" at the cluster scope
E0514 07:58:03.252665 1 reflector.go:123] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:96: Failed to list *v1alpha1.BareMetalHost: baremetalhosts.metal3.io is forbidden: User "system:serviceaccount:hcc-system:default" cannot list resource "baremetalhosts" in API group "metal3.io" at the cluster scope
E0514 07:58:04.255663 1 reflector.go:123] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:96: Failed to list *v1alpha1.BareMetalHost: baremetalhosts.metal3.io is forbidden: User "system:serviceaccount:hcc-system:default" cannot list resource "baremetalhosts" in API group "metal3.io" at the cluster scope
E0514 07:58:05.256648 1 reflector.go:123] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:96: Failed to list *v1alpha1.BareMetalHost: baremetalhosts.metal3.io is forbidden: User "system:serviceaccount:hcc-system:default" cannot list resource "baremetalhosts" in API group "metal3.io" at the cluster scope
E0514 07:58:06.259305 1 reflector.go:123] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:96: Failed to list *v1alpha1.BareMetalHost: baremetalhosts.metal3.io is forbidden: User "system:serviceaccount:hcc-system:default" cannot list resource "baremetalhosts" in API group "metal3.io" at the cluster scope
There is one thing observed in our testing i.e. If I apply profile ABC using kubectl ,so HCC processed profile ABC and add labels on matched hosts.
After some time, If I change hardware parameters in Profile ABC, then apply the same.
Reconciliation is triggered twice.
########
This issue was generated because it was intentionally done in Kubernetes.(kubernetes-sigs/controller-runtime#490)
According to this, branch name "master" have to changed to "main".
In the crd file description for newly added parameters are missing .
Prow-test jobs are not running against open PR's, Is there any label to manually run those jobs or it automatically executes those jobs against PR's.
There is one noticed in our testing that if user by mistake forgot to configure right Hardware parameters and applied profile ABC using kubectl, so HCC proceesed profile ABC and add labels on matches hosts.
After sometime, by looking at profile ABC, user realized that he has to do different configuration. In this case, if user does change in same profile ABC by configuring right HW params and process it.
HCC should delete labels from existing hosts first and classify hosts based on new configuration provided and add label to matches hosts.
Please let me know if you are agree with the above workflow for deleting label feature.
Or
Let us know your approach in this case.
T
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.