metal3-io / cluster-api-provider-metal3 Goto Github PK

View Code? Open in Web Editor NEW

196.0 196.0 87.0 74.3 MB

Metal³ integration with https://github.com/kubernetes-sigs/cluster-api

License: Apache License 2.0

Dockerfile 0.15% Makefile 1.87% Go 91.46% Starlark 0.08% Python 2.22% Shell 4.04% Smarty 0.17%

cluster-api-provider-metal3's People

Contributors

Stargazers

Watchers

Forkers

nordix aneeshkp yrobla nasirkamal digambar15 dhellmann dmitrygrayscale faridhub slintes kinvolk devopstoday11 knfoo maxinjian fenggw-fnst zhouhao3 ardaguclu tdovan andfasano furkatgofurov7 iyng awhdesmond openshift-cloud-team whoami-ops killianmuldoon asherzhi keleustes wcecs etharendil jichenjc pogossian chazzrobbz mhrivnak patilsuraj767 isaacdorfman dtantsur arvinderpal aboroufar chatla-telekom hardys mboukhalfa honza schrej danilo404 ssamylin sailorvii isabella232 tanaypatankar k82cn yydzhou leoh0 fmuyassarov shibapuppy tombrander haijianyang alexander-demicev ascendant-systems madandroid rozzii sf1tzp ileixe yushangbin harshithakolipaka openshift makarovskyarthur hustd10 elfosardo ydp clobrano step-security-bot steven-mathew fhaftmann metal3-io-bot felix0080 matthewei p-strusiewiczsurmacki-mobica jenserat janmkl gmh5225 lekaf974 meenuyd tico88612 alexandremahdhaoui kashifest tannerjones4075

cluster-api-provider-metal3's Issues

Kubernetes version upgrade fails both new and old nodes due to keepalived failure

What steps did you take and what happened:
In metal3-dev-env, set KUBERNETES_VERSION to "v1.18.0"

./scripts/v1alphaX/provision_cluster.sh
./scripts/v1alphaX/provision_controlplane.sh

Once provisioned, ssh into the node and run
kubeletcl get pods -A, if you see pods, proceed.

update k8s version to "v1.18.1" by edit the kcp resource.
kubectl edit kcp -n metal3

After a while the upgrade process should start.

However, both nodes are not usable for the following reason
The new node (upgraded)

ubuntu@node-0:~$ kubectl get pods
The connection to the server 192.168.111.249:6443 was refused - did you specify the right host or port?

The original node

ubuntu@node-3:~$ kubectl get pods
The connection to the server 192.168.111.249:6443 was refused - did you specify the right host or port?

What did you expect to happen:

A new nodes with kubernetes version v1.18.1 is provisioned.
The existing node is deprovisioned.

Anything else you would like to add:

Note: We should focus on why the original node has failed despite the new one not being successful.

Environment:

Cluster-api version: v1alpha3
CAPM3 version: v1alpha3
Minikube version: 1.9.2
environment" metal3-dev-env
Kubernetes version: v1.8.0 and v1.8.1

/kind bug

Documentation update

What steps did you take and what happened:
The documentation in /docs folder is still about v1alpha2 in some places

What did you expect to happen:
The documentation should be about v1alpha3 or v1alpha4 depending on the branch

Environment:

CAPBM version: v1a3 and v1a4

/kind bug

CAPBM removes networkData from BareMetalHost

What steps did you take and what happened:

Create a BareMetalHost with spec\networkData and Secret for it.
Look at description the BareMetalHost and find "networkData".
Wait for status "Ready" of BMH.
Create Cluster, BareMetalCluster, Machine, BareMetalMachine.
Look at description the BareMetalHost and try to find networkData.

What did you expect to happen:
When CAPBM update BMH it removes networkData.

Anything else you would like to add:
Environment:

Cluster-api version: v2
CAPBM version: v2
Minikube version: n/a
environment : other
Kubernetes version: (use kubectl version): 1.16.3

/kind bug

Infinite reconciliation loop when waiting for provisioning

What steps did you take and what happened:
When a node is being provisioned, the controller keeps reconciling the metal3machine about every second.

What did you expect to happen:

It should not reconcile the metal3machine until the BMH is updated, and probably provisioned.

Anything else you would like to add:
It looks like the call to updateObject function in https://github.com/metal3-io/cluster-api-provider-metal3/blob/master/baremetal/metal3machine_manager.go#L651 updates the BMH object that in turn triggers a reconciliation of the metal3machine. Calls to the updateObject function for BMH should happen only when there has been a change on the BMH object. Some refactoring and the use of patch helper from CAPI would be very much needed to prevent this waste of resources.

Environment:

Cluster-api version: 0.3.6
CAPM3 version: 0.4.X
environment (metal3-dev-env or other): metal3-dev-env

/kind bug

Label Machine based on Host

One of the reasons for supporting the machine API is that we want metal3 clusters to work like other clusters. That means being able to identify machines based on their characteristics, without knowing to look at the host resource. We should therefore have a way to copy label information from hosts onto machines when the host is provisioned.

One way do that is by merging the labels on the host resource with the labels on the machine resource after the host is fully provisioned (some labels will be added when hardware is discovered).

We should decide whether to propagate the same labels up to the node. That would be especially useful for labels that provide information like where a host is located.

CAPI isn't deployed correctly after "make deploy"

What steps did you take and what happened:
[A clear and concise description on how to REPRODUCE the bug.]
I use fresh K8s cluster deployed with GCP Kubernetes engine
Performing make deploy in fresh cloned metal3-io/cluster-api-provider-metal3 repo
results in

capi-kubeadm-bootstrap-controller-manager-5bdc7dbf5d-d5ml5     CrashLoopBackOff
capi-controller-manager-7894bbf65d-hq9xc                       CrashLoopBackOff
capi-controller-manager-6dd9f648cf-zf2hw                       CrashLoopBackOff
capi-kubeadm-bootstrap-controller-manager-867f46d767-k4862     CrashLoopBackOff

kubectl logs capi-controller-manager-7894bbf65d-hq9xc manager -n capi-system
invalid argument "MachinePool=${EXP_MACHINE_POOL:=false},ClusterResourceSet=${EXP_CLUSTER_RESOURCE_SET:=false}" for "--feature-gates" flag: invalid value of MachinePool=${EXP_MACHINE_POOL:=false}, err: strconv.ParseBool: parsing "${EXP_MACHINE_POOL:=false}": invalid syntax

What did you expect to happen:
CAPI controllers up and running

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]
Probably it's related to kubernetes-sigs/cluster-api#3339
but I don't understand what is exactly wrong in my case. I don't see any clusterctl uses during make deploy

Environment:

Cluster-api version: 0.3.7
CAPBM version:
Minikube version:
environment (metal3-dev-env or other): GCP Kubernetes engine
Kubernetes version: (use kubectl version): v1.17.8-gke.17

/kind bug

vlanID field misdocumented as vlanId

What steps did you take and what happened:
I applied a Metal3DataTemplate manifest with the following spec[networkData[links[vlans]]] field:

      vlans:
        - id: "eth1.3"
          mtu: 1500
          macAddress:
            fromHostInterface: eth1
          vlanId: 3
          vlanLink: "eth1"

vlanId (lowercase d) is what's documented in the API: https://github.com/metal3-io/cluster-api-provider-metal3/blob/master/docs/api.md#links-specifications links/vlans/vlanId. But in the code the struct expects the field 'vlanID':

cluster-api-provider-metal3/api/v1alpha4/metal3datatemplate_types.go

Line 225 in 329ae2f

VlanID int `json:"vlanID"`

What did you expect to happen:

I expected my Metal3DataTemplate to have vlanId: 3 set, instead vlanID: 0 was set.

Environment:

Cluster-api version: v1alpha4
CAPBM version: be9faea
Minikube version: N/A
environment (metal3-dev-env or other): metal3-dev-env
Kubernetes version: (use kubectl version): N/A

/kind bug

The Metal3ClusterToMetal3Machines function returns a machine

What steps did you take and what happened:
The Metal3ClusterToMetal3Machines returns references to a machine, instead of a metal3machine. This is a remaining bit from v1alpha1.

What did you expect to happen:
It should return a reference to a Metal3Machine, hence m.Spec.InfrastructureRef.Name if set.

/kind bug

Upgrade not running to completion when changing OS image

What steps did you take and what happened:

Provision a control plane node
Save the Metal3MachineTemplate resource in the metal3 namespace to disk.
Edit its by changing its name, OS image url and hash to new valid values.
Then save new file to kubernetes, kubectl create -f ....
Edit the KubeadmControlPlane resource in the metal3 namespace by changing
the infrastructureTemplate.name to that of the new name you used.

Then, wait for the Provisioning of a new node and the de-Provisioning of the
existing one.

However,

Both VMs will be in provisioned state
The kubernetes API server in both VMs are not running.

What did you expect to happen:
That a new node with the new OS image is Provisioned, and
That the existing node is de-provisioned.

Anything else you would like to add:

This is a blocker on the ongoing upgrade workflow
The same workflow was successful before 16.03.2020
The start of provisioning and de-provisioning might take up to 20 minutes each.
Relevant logs are added as comments.
Repeated entries from the logs are removed fore brevity.

I have created a similar issue in CAPI, here
It seems that it is already fixed on the CAPI side and it could be the provider components that are the cause.

Environment:

Cluster-api version: v1alpha3
CAPBM version: v0.3.0
Minikube version: v1.8.2
metal3-dev-env
Kubernetes version: v1.17.3

/kind bug

Provisioning not starting soon enough when done repeatedly

What steps did you take and what happened:

Provision and de-Provision a control plane node few times.

At some point, Provisioning does not start at all and the bmh states remains ready.

The waiting time passes 30 minutes. Normally, it should start within 10 minutes.
It is safe assume that the provisioning will not start at all.

What did you expect to happen:

Each time, the provision and de-provision should start and succeed.
The process should be stateless in that previous provisioning should not have an impact on the current one for as long as there are free available bmh.

Anything else you would like to add:

In order to fix the issue, one needs to run make clean && make
Once in a while, even that may not help. And, re-creating the metal3-dev-env machine is the only option
the logs for relevant controllers are added as comment

Environment:

Cluster-api version: v1alpha3
CAPBM version: v0.3.0
Minikube version: v1.8.2
metal3-dev-env
Kubernetes version: v1.17.3

/kind bug

Node.Spec.ProviderID not updated after Move

After a clusterctl move from say a bootstrap cluster to a target cluster, the K.Node.Spec.ProviderID still contains the UID of the old BMH object.

Simply updating the Node.Spec.ProviderID is not currently allowed as validation checks prevent Node.Spec.ProviderID to be changed after it has been set to a non-nil value (see kubernetes/kubernetes#51761)
Forbidden: node updates may not change providerID except from "" to valid, []:

Possible Solutions:
1) Provide an option to have ProviderID be something other than UID of BMH -- for example: BMH 'Namespace/Name' may be unique for most environments.
2) Don’t use BMH.UID which is generated automatically during object creation but instead create a new field (say BMH.ProviderID) that we can populate with a random id. This field would be copied over as part of the move.
3) Open an issue with K/K to allow updates to Node.Spec.ProviderID -- ie. Relax the validation checks.

/kind bug

xref: metal3-io/baremetal-operator#431
@kashifest @maelk

UpdateObject function should operate on a deepcopy of the object

What steps did you take and what happened:
the UpdateObject (and maybe createObject too) functions in metal3machine_manager.go (branch release-0.3) should operate on deepcopies of the object. Using the object directly translates into a lost status.

Cluster-api version: 0.3.2
CAPBM version: 0.3.0

/kind bug

Add field selector support to hostSelector

PR metal3-io/cluster-api-provider-baremetal#74 added the hostSelector field to the baremetal Machine provider spec. (See doc/api.md)

The initial hostSelector support includes the ability to specify BareMetalHost matching criteria based on matching labels. It would also be interesting to add the option of specifying field selectors, so we can match on more than just labels.

A simple use case would be to match on metadata.name if you wanted to specify a specific BareMetalHost by name. A workaround for this use case would be to add a unique label to each host.

Perhaps there are more interesting use cases around matching on hardware inventory details that are in the BareMetalHost status.

BMH NetworkData and userData cleaned on deprovisioning when they should not

What steps did you take and what happened:
This bug was raised by the airship project. CAPM3 deletes the content of the networkData and metaData fields even when it does not manage them.

What did you expect to happen:
CAPM3 should not always delete those, only when the networkData and metaData fields are set in the status of the metal3machine

Anything else you would like to add:
The problem is on line https://github.com/metal3-io/cluster-api-provider-metal3/blob/master/baremetal/metal3machine_manager.go#L513 and https://github.com/metal3-io/cluster-api-provider-metal3/blob/master/baremetal/metal3machine_manager.go#L518 . There should be a check similar to https://github.com/metal3-io/cluster-api-provider-metal3/blob/master/baremetal/metal3machine_manager.go#L924 and https://github.com/metal3-io/cluster-api-provider-metal3/blob/master/baremetal/metal3machine_manager.go#L930

Environment:

Cluster-api version: NA
Cluster-api-provider-metal3 version: v0.4.X
Environment (metal3-dev-env or other): NA
Kubernetes version: (use kubectl version): NA

/kind bug

Add ability to get more than one IP out of an IP Pool

User Story

As a User with external Storage boxes i want the ability to get multiple IPs out of one IPPool for Multipathing

Detailed Description

We are running our Servers with dedicated storage boxes (based around iSCSI) and are not using Link aggregation but Multipathing to those storage arrays.
For this we need one IP per NIC, out of the same Subnet/IPPool.
Currently Metal³ doesnt allow that, if i specify

    ipAddressesFromIPPool:
    - key: storage_1
      name: storage-testpool
    - key: storage_2
      name: storage-testpool

in my DataTemplate only one IPclain will get generated and both storage_1 and storage_2 will have the same IP
Anything else you would like to add:

/kind feature

CAPM3 should not set error on transient failures

What steps did you take and what happened:
For example when failing to update the BMH due to a conflict, CAPM3 reports an error on the Metal3Machine. CAPI picks it up :

apiVersion: v1
items:
- apiVersion: cluster.x-k8s.io/v1alpha3
  kind: Machine
  metadata:
    creationTimestamp: "2020-03-22T08:26:40Z"
    finalizers:
    - machine.cluster.x-k8s.io
    generation: 3
    labels:
      cluster.x-k8s.io/cluster-name: test1
      cluster.x-k8s.io/control-plane: ""
      kubeadm.controlplane.cluster.x-k8s.io/hash: "3288331389"
    name: test1-controlplane-lpnxq
    namespace: metal3
    ownerReferences:
    - apiVersion: controlplane.cluster.x-k8s.io/v1alpha3
      blockOwnerDeletion: true
      controller: true
      kind: KubeadmControlPlane
      name: test1-controlplane
      uid: 834ec5bd-3cbf-4b87-9ccc-f8b5b07140d4
    resourceVersion: "6836"
    selfLink: /apis/cluster.x-k8s.io/v1alpha3/namespaces/metal3/machines/test1-controlplane-lpnxq
    uid: 6be1adb7-c185-4779-be4c-d04b0e396f2e
  spec:
    bootstrap:
      configRef:
        apiVersion: bootstrap.cluster.x-k8s.io/v1alpha3
        kind: KubeadmConfig
        name: test1-controlplane-bzpmb
        namespace: metal3
        uid: 5c3bbc07-b8f3-4ccb-98f5-be299edf226e
      dataSecretName: test1-controlplane-bzpmb
    clusterName: test1
    infrastructureRef:
      apiVersion: infrastructure.cluster.x-k8s.io/v1alpha3
      kind: Metal3Machine
      name: test1-controlplane-jfnpf
      namespace: metal3
      uid: 1cc69570-8c3d-4fbd-a8d6-c7d25118cb31
    providerID: metal3://d99b9c2b-5aee-45e6-9249-e7b42f7ccc76
    version: v1.17.0
  status:
    addresses:
    - address: 192.168.111.21
      type: InternalIP
    - address: 172.22.0.91
      type: InternalIP
    - address: node-1
      type: Hostname
    - address: node-1
      type: InternalDNS
    bootstrapReady: true
    failureMessage: 'Failure detected from referenced resource infrastructure.cluster.x-k8s.io/v1alpha3,
      Kind=Metal3Machine with name "test1-controlplane-jfnpf": Failed to associate
      the BaremetalHost to the Metal3Machine'
    failureReason: CreateError
    infrastructureReady: true
    lastUpdated: "2020-03-22T08:34:41Z"
    nodeRef:
      name: node-1
      uid: 49aa2fb7-b645-4da3-9600-64e31074bf81
    phase: Failed
- apiVersion: cluster.x-k8s.io/v1alpha3
  kind: Machine
  metadata:
    creationTimestamp: "2020-03-22T08:26:21Z"
    finalizers:
    - machine.cluster.x-k8s.io
    generateName: test1-md-0-78d55dc456-
    generation: 3
    labels:
      cluster.x-k8s.io/cluster-name: test1
      machine-template-hash: "3481187012"
      nodepool: nodepool-0
    name: test1-md-0-78d55dc456-5b86v
    namespace: metal3
    ownerReferences:
    - apiVersion: cluster.x-k8s.io/v1alpha3
      blockOwnerDeletion: true
      controller: true
      kind: MachineSet
      name: test1-md-0-78d55dc456
      uid: 7b2e3dbc-8af2-4f43-8945-6968bc12bd76
    resourceVersion: "9982"
    selfLink: /apis/cluster.x-k8s.io/v1alpha3/namespaces/metal3/machines/test1-md-0-78d55dc456-5b86v
    uid: 3826ad1f-b1ed-4504-b265-2dde934055e9
  spec:
    bootstrap:
      configRef:
        apiVersion: bootstrap.cluster.x-k8s.io/v1alpha3
        kind: KubeadmConfig
        name: test1-md-0-5hhsj
        namespace: metal3
        uid: 2d396de3-c3f4-45cc-baff-d915ebcfb20d
      dataSecretName: test1-md-0-5hhsj
    clusterName: test1
    infrastructureRef:
      apiVersion: infrastructure.cluster.x-k8s.io/v1alpha3
      kind: Metal3Machine
      name: test1-md-0-hr5jw
      namespace: metal3
      uid: 3477a996-d531-4901-a20e-bd9c93b1a8e6
    providerID: metal3://6fc41e7b-cc2b-4a06-a9d4-883c563d0c47
    version: v1.17.0
  status:
    addresses:
    - address: 172.22.0.87
      type: InternalIP
    - address: 192.168.111.20
      type: InternalIP
    - address: node-0
      type: Hostname
    - address: node-0
      type: InternalDNS
    bootstrapReady: true
    failureMessage: 'Failure detected from referenced resource infrastructure.cluster.x-k8s.io/v1alpha3,
      Kind=Metal3Machine with name "test1-md-0-hr5jw": Failed to associate the BaremetalHost
      to the Metal3Machine'
    failureReason: CreateError
    infrastructureReady: true
    lastUpdated: "2020-03-22T08:45:24Z"
    nodeRef:
      name: node-0
      uid: 8d146172-c593-455e-bf3c-c5c1dcb1bb73
    phase: Failed
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

What did you expect to happen:
Since it is a temporary failure, waiting for a requeue, it should not be set as error.

Environment:

Cluster-api version: v1alpha3
CAPBM version: v1alpha3
Minikube version: NA
environment (metal3-dev-env or other): NA
Kubernetes version: (use kubectl version): NA

/kind bug

Metal3machine controller looking for a node with wrong label after pivoting

What steps did you take and what happened:
After pivoting, the BMH UID has changed. however, the providerID of the node and m3m did not change, so https://github.com/metal3-io/cluster-api-provider-metal3/blob/master/controllers/metal3machine_controller.go#L242 is incorrect, because the providerID exists and is already set to a different value than the BMH UID.

What did you expect to happen:
the controller should use the providerID value from the m3m if it is set, instead of creating it from the BMH ID.

Environment:

Cluster-api version: NA
CAPM3 version: v1a3 and v1a4
Minikube version: NA
environment (metal3-dev-env or other): NA
Kubernetes version: (use kubectl version): NA

/kind bug

CAPM3 not waiting for BMH to be in ready or available state

What steps did you take and what happened:
When choosing a BMH, CAPM3 does not consider the state the BMH is in. It will associate a Metal3Machine with a BMH as long as that BMH does not have a consumer ref or is not in error. If it is in inspecting state for example, and the inspection fails, it migth break the deployment of the cluster in some cases.

What did you expect to happen:
CAPM3 should verify that the BMH is in ready or available state.

Environment:

Cluster-api version: v0.3.3
CAPBM version: v0.3.0

/kind bug

Wrong replicas count in kubeadmcontrolplane status

What steps did you take and what happened:

Run metal3-dev-env with 4 nodes
provision a control plane node
Scale up by changing the replicas to 5 or 7.

The result, status.replicas is 5.

What did you expect to happen:
status.replicas should be 4

Anything else you would like to add:

If you make spec.replicas 5 or 7 does not make a difference.
status.replicas will be 5.
The logs of the controllers seem to show correct behavior.
This is not a CAPI problem, but that of a provider.
Relevant logs are added comments below

Environment:

Cluster-api version: v1alpha3
CAPBM version: v1alpha3
Minikube version: v1.8.2
environment: metal3-dev-env:
Kubernetes version: 1.17.0

/kind bug

Speed up unit tests

User Story

Most of the time of the unit test jobs is spent on installing the tools and generating the code that is automatically generated. We should bake in all tools and not regenerate the code as it should be included in the repository.

/kind feature

capm3-controller-manager invalid memory address error

What steps did you take and what happened:
Provision and de-provision a control plane machine multiple times.
It goes to CrashLoopBackOff with invalid memory address error

full log is added as a comment
What did you expect to happen:
A success or a graceful failure.

Anything else you would like to add:
Regardless of the cause, please do focus on fixing the invalid memory address error

This makes the environment not usable. One needs to run make clean and make again.
Environment:

Cluster-api version: v1alpha3
CAPBM version: v1alpha3
Minikube version: v1.8.2
environment: metal3-dev-env
Kubernetes version: 1.17.0

/kind bug

CAPM3 sets failureReason/failureMessage for recoverable errors

The capi docs on failureReason/failureMessage state that the fields should only be used for terminal errors and not transitive errors. CAPI also looks at these fields in the provider's Machine obj. Once set, it's only possible to clear these fields by deleteing the Machine.
Currently, capm3 sets the failureReason/failureMessage for errors that are recoverable. For example, we noticed the following errror show up

	E0719 14:08:11.589531       1 controller.go:235] controller-runtime/controller "msg"="Reconciler error" "error"="failed to set the target node providerID: unable to update the target node: Operation cannot be fulfilled on nodes \"node01\": the object has been modified; please apply your changes to the latest version and try again" "controller"="metal3machine" "name"="cluster-controlplane-rnx9h" "namespace"="target-infra"

This error occurs here and is a recoverable error and in fact CAPM3 clears it eventually on the Metal3Machine obj. However, the CAPI Machine remains in the errored state.

What steps did you take and what happened:
This error occurs occasionally since it's basically a race between CAPM3 updating the Node obj and something else (e.g. kubelet) updating that same object.

What did you expect to happen:

Recoverable error should not cause the Machine to enter a terminal failure state.

Anything else you would like to add:

The discussion here, while a bit outdated, also suggests that providers should not set failureReason/failureMessage for recoverable errors

Environment:

Cluster-api version: 0.3.7
Cluster-api-provider-metal3 version: v0.4.0
Environment (metal3-dev-env or other): airship dev and test environments

/kind bug

CAPM3 references old BMH spec missing network-data

What steps did you take and what happened:
The BMO referenced in go.mod github.com/metal3-io/baremetal-operator v0.0.0-20200225121200-8161ce57c5af references an old BMH spec which is missing network-data, merged in here: metal3-io/baremetal-operator#348

When you then go to provision a cluster, CAPM3 will edit the BMH object and not keep the network-data originally defined in the BMH object as it's pointing at an older spec which does not contain it.

What did you expect to happen:
The network-data defined in the BMH CR does not get wiped during cluster creation, and persists through cluster creation.

Anything else you would like to add:
Keep up the great work on this awesome project :)

Environment:

Cluster-api version:
CAPBM version: current
Minikube version: N/A
environment (metal3-dev-env or other):
Kubernetes version: (use kubectl version): v1.17.3

/kind bug

Example for providing user-data from Machine CRD.

What steps did you take and what happened:
[A clear and concise description on how to REPRODUCE the bug.]

Created a baremetalhost CRD. It finishes inspection and gets to Ready state.

---
apiVersion: v1
kind: Secret
metadata:
  name: bm-secret103
type: Opaque
data:
  username: YWRtaW4=
  password: MWYyZDFlMmU2N2Rm
---
apiVersion: metal3.io/v1alpha1
kind: BareMetalHost
metadata:
  name: bm-host103
  labels:
    key1: host1
spec:
  online: true
  bmc:
    address: ipmi://10.4.29.12
    credentialsName: bm-secret103
  bootMACAddress: 90:38:09:68:91:f1
---

Created a machine object with the following YAML files.

---
apiVersion: cluster.x-k8s.io/v1alpha2
kind: Machine
metadata:
  name: machine
  labels:
    cluster.x-k8s.io/control-plane: "true"
    cluster.x-k8s.io/cluster-name: "cluster"
spec:
  version: "1.16"
  bootstrap:
    configRef:
      apiVersion: bootstrap.cluster.x-k8s.io/v1alpha2
      kind: KubeadmConfig
      name: kubeadmconfig
  infrastructureRef:
    apiVersion: infrastructure.cluster.x-k8s.io/v1alpha2
    kind: BareMetalMachine
    name: baremetalmachine

---
apiVersion: infrastructure.cluster.x-k8s.io/v1alpha2
kind: BareMetalMachine
metadata:
  name: baremetalmachine
  namespace: metal3
spec:
  image:
    url: http://192.168.2.1:6180/images/bionic-server-cloudimg-amd64.img
    checksum: http://192.168.2.1:6180/images/bionic-server-cloudimg-amd64.img.md5sum
  userData:
    name: bm-host103-user-data
    namespace: metal3
  hostSelector:
    matchLabels:
      key1: host1

What did you expect to happen:

Expected to see my user-data with ssh details as well, in my object. But in the ConfigDrive, I can only see the kubeadmin config secret with the Kubernetes keys and certificates.
I couldn't find an example where I can provide login information user-data from the machine CRD.

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]
I've tried providing user-data to the baremetalhost object and provisioning it through the machine CRD, but the user-data doesn't get injected in ConfigDrive.

Environment:

Cluster-api version: Most recent repo version, cloned and deployed.
CAPBM version: Most recent repo version, cloned and deployed.
Minikube version: Running on baremetal
environment (metal3-dev-env or other): Running on baremetal
Kubernetes version: (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.0", GitCommit:"70132b0f130acc0bed193d9ba59dd186f0e634cf", GitTreeState:"clean", BuildDate:"2019-12-07T21:20:10Z", GoVersion:"go1.13.4", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.2", GitCommit:"59603c6e503c87169aea6106f57b9f242f64df89", GitTreeState:"clean", BuildDate:"2020-01-18T23:22:30Z", GoVersion:"go1.13.5", Compiler:"gc", Platform:"linux/amd64"}

/kind bug

CAPM3 Not Setting ProviderID on Node in new Cluster

What steps did you take and what happened:
For some reason editing the remote node spec and setting the ProviderID does not work / happen. Nothing is set for the providerID. I went through the provisioning process as normal, first control plane gets spun up as normal, but the providerID does not get set on the remote node preventing CAPI from setting the control plane initialized status to true and the machine state to running, as it can not find the providerID on the new remote node.

CAPI complaining about no provideID set on remote node:

E0309 18:24:28.138031       1 machine_controller_noderef.go:98] controllers/Machine "msg"="Failed to parse ProviderID" "error"="providerID is empty" "providerID"={} "node"="baremetal0008d1mdw1.sendgrid.net"
E0309 18:24:28.138102       1 machine_controller.go:232] controllers/Machine "msg"="Reconciliation for Machine asked to requeue" "error"="cannot assign NodeRef to Machine \"lukasz-controlplane-wp76w\" in namespace \"metal3\", no matching Node: requeue in 10s" "cluster"="lukasz" "machine"="lukasz-controlplane-wp76w" "namespace"="metal3"
E0309 18:24:37.575951       1 machine_controller_noderef.go:98] controllers/Machine "msg"="Failed to parse ProviderID" "error"="providerID is empty" "providerID"={} "node"="baremetal0008d1mdw1.sendgrid.net"
E0309 18:24:37.576027       1 machine_controller.go:232] controllers/Machine "msg"="Reconciliation for Machine asked to requeue" "error"="cannot assign NodeRef to Machine \"lukasz-controlplane-wp76w\" in namespace \"metal3\", no matching Node: requeue in 10s" "cluster"="lukasz" "machine"="lukasz-controlplane-wp76w" "namespace"="metal3"
E0309 18:24:44.527205       1 machine_controller_noderef.go:98] controllers/Machine "msg"="Failed to parse ProviderID" "error"="providerID is empty" "providerID"={} "node"="baremetal0008d1mdw1.sendgrid.net"
E0309 18:24:44.527275       1 machine_controller.go:232] controllers/Machine "msg"="Reconciliation for Machine asked to requeue" "error"="cannot assign NodeRef to Machine \"lukasz-controlplane-wp76w\" in namespace \"metal3\", no matching Node: requeue in 10s" "cluster"="lukasz" "machine"="lukasz-controlplane-wp76w" "namespace"="metal3"
E0309 18:24:47.594796       1 machine_controller_noderef.go:98] controllers/Machine "msg"="Failed to parse ProviderID" "error"="providerID is empty" "providerID"={} "node"="baremetal0008d1mdw1.sendgrid.net"
E0309 18:24:47.594866       1 machine_controller.go:232] controllers/Machine "msg"="Reconciliation for Machine asked to requeue" "error"="cannot assign NodeRef to Machine \"lukasz-controlplane-wp76w\" in namespace \"metal3\", no matching Node: requeue in 10s" "cluster"="lukasz" "machine"="lukasz-controlplane-wp76w" "namespace"="metal3"

ProviderID does not get set on the remote node:

[root@baremetal0008d1mdw1 ~]# kubectl  get nodes -ojson | jq .items[].spec
{
  "podCIDR": "172.16.0.0/24",
  "podCIDRs": [
    "172.16.0.0/24"
  ],
  "taints": [
    {
      "effect": "NoSchedule",
      "key": "node-role.kubernetes.io/master"
    }
  ]
}

View of deployment in time when CAPI cannot find ProviderID:

[root@baremetal0020d1mdw1 deploy-cluster]#  kubectl  get cluster,machine,metal3machine,bmh -nmetal3
NAME                              PHASE
cluster.cluster.x-k8s.io/lukasz   Provisioned

NAME                                                 PROVIDERID                                      PHASE
machine.cluster.x-k8s.io/lukasz-controlplane-wp76w   metal3://3129fc05-0ce8-4850-9ae7-fc4c08a8dd95   Provisioning

NAME                                                                      PROVIDERID                                      READY   CLUSTER   PHASE
metal3machine.infrastructure.cluster.x-k8s.io/lukasz-controlplane-hqzvs   metal3://3129fc05-0ce8-4850-9ae7-fc4c08a8dd95   true    lukasz

NAME                                                       STATUS   PROVISIONING STATUS   CONSUMER                    BMC                   HARDWARE PROFILE   ONLINE   ERROR
baremetalhost.metal3.io/baremetal0004d1mdw1.sendgrid.net   OK       ready                                             ipmi://10.16.6.14/    unknown            true
baremetalhost.metal3.io/baremetal0005d1mdw1.sendgrid.net   OK       ready                                             ipmi://10.16.5.227/   unknown            true
baremetalhost.metal3.io/baremetal0006d1mdw1.sendgrid.net   OK       ready                                             ipmi://10.16.5.241/   unknown            true
baremetalhost.metal3.io/baremetal0007d1mdw1.sendgrid.net   OK       ready                                             ipmi://10.16.5.239/   unknown            false
baremetalhost.metal3.io/baremetal0008d1mdw1.sendgrid.net   OK       provisioned           lukasz-controlplane-hqzvs   ipmi://10.16.5.235/   unknown            true

What did you expect to happen:
ProviderID on the remote node to be set and for CAPI to reconcile it correctly, thus moving on to the other nodes in the provisioning process.

Anything else you would like to add:
My deployment yamls:

cluster

apiVersion: cluster.x-k8s.io/v1alpha3
kind: Cluster
metadata:
  name: lukasz
spec:
  clusterNetwork:
    pods:
      cidrBlocks:
      - 172.16.0.0/23
    serviceDomain: cluster.local
    services:
      cidrBlocks:
      - 192.168.0.0/16
  controlPlaneRef:
    apiVersion: controlplane.cluster.x-k8s.io/v1alpha3
    kind: KubeadmControlPlane
    name: lukasz-controlplane
  infrastructureRef:
    apiVersion: infrastructure.cluster.x-k8s.io/v1alpha3
    kind: Metal3Cluster
    name: lukasz
---
apiVersion: infrastructure.cluster.x-k8s.io/v1alpha3
kind: Metal3Cluster
metadata:
  name: lukasz
spec:
  controlPlaneEndpoint:
    host: 172.16.18.246
    port: 6443

control-plane

apiVersion: controlplane.cluster.x-k8s.io/v1alpha3
kind: KubeadmControlPlane
metadata:
  labels:
    cluster.x-k8s.io/cluster-name: lukasz
    cluster.x-k8s.io/control-plane: "true"
  name: lukasz-controlplane
spec:
  infrastructureTemplate:
    apiVersion: infrastructure.cluster.x-k8s.io/v1alpha3
    kind: Metal3MachineTemplate
    name: lukasz-controlplane
  kubeadmConfigSpec:
    initConfiguration:
      nodeRegistration:
        kubeletExtraArgs:
          node-labels: metal3.io/uuid={{ ds.meta_data.uuid }}
        name: '{{ ds.meta_data.local_hostname }}'
    postKubeadmCommands:
    preKubeadmCommands:
    - commands
  replicas: 3
  version: v1.17.0
---
apiVersion: infrastructure.cluster.x-k8s.io/v1alpha3
kind: Metal3MachineTemplate
metadata:
  name: lukasz-controlplane
spec:
  template:
    spec:
      hostSelector:
        matchLabels:
          owner: "yes"
      image:
        checksum: generic-image.com-not-actual-image-in-use.md5sum
        url: generic-image.com-not-actual-image-in-use

Environment:

Cluster-api version: v1alpha3
CAPBM version: v1alpha3
Minikube version:
environment (metal3-dev-env or other): other
Kubernetes version: (use kubectl version): v1.17.3

/kind bug

controllers/Metal3LabelSync/metal3-label-sync-controller back off not correctly set

What steps did you take and what happened:
Im currently seeing a lot of these Log messages in my CAPM3 logs

I0701 21:30:57.853102       1 metal3labelsync_controller.go:141] controllers/Metal3LabelSync/metal3-label-sync-controller "msg"="Could not find Node Ref on Machine object, will retry" "metal3-label-sync"={"Namespace":"metal3-refsa2-bn","Name":"ilocz20300gcv.lom.refsa2.bn.schiff.telekom.de"} 
I0701 21:30:58.855701       1 metal3labelsync_controller.go:141] controllers/Metal3LabelSync/metal3-label-sync-controller "msg"="Could not find Node Ref on Machine object, will retry" "metal3-label-sync"={"Namespace":"metal3-refsa2-bn","Name":"ilocz20300gcv.lom.refsa2.bn.schiff.telekom.de"} 
I0701 21:30:59.857972       1 metal3labelsync_controller.go:141] controllers/Metal3LabelSync/metal3-label-sync-controller "msg"="Could not find Node Ref on Machine object, will retry" "metal3-label-sync"={"Namespace":"metal3-refsa2-bn","Name":"ilocz20300gcv.lom.refsa2.bn.schiff.telekom.de"}

As the basically reoccur every half a second or so and only pause after about 10-15 tries for 10 secons i assume that the backoff for recnciliation isnt properly working as that should propably not run every half a second.

What did you expect to happen:

Not getting spammed by log messages. :D

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

Environment:

Cluster-api version: 0.3.x
Cluster-api-provider-metal3 version: 0.4.2
Environment (metal3-dev-env or other): other
Kubernetes version: (use kubectl version): n/a

/kind bug

Error in baremetal node-0

Getting the below error after trying to build the MEtal3 setup using the https://metal3.io/try-it.html

ubuntu@metal3:~/metal3-dev-env$ kubectl get baremetalhost -n metal3
NAME STATUS PROVISIONING STATUS CONSUMER BMC HARDWARE PROFILE ONLINE ERROR
node-0 error ready test1-controlplane-0 ipmi://192.168.111.1:6230 unknown true host validation error: Validation of image href http://172.22.0.1/images/CentOS-7-x86_64-GenericCloud-1907.qcow2 failed, reason: Got HTTP code 404 instead of 200 in response to HEAD request.; Validation of image href http://172.22.0.1/images/CentOS-7-x86_64-GenericCloud-1907.qcow2 failed, reason: Got HTTP code 404 instead of 200 in response to HEAD request.
node-1 OK ready ipmi://192.168.111.1:6231 unknown true

Fix the namespaces assumptions and handling

What steps did you take and what happened:
Multiple assumptions were taken when developing regarding to namespaces to which the objects belong, and the code does not use the same assumptions consistently. For example, when it comes to the userData secret, if the BMH is in a different namespace, another secret is created. However, in other places, it is assumed that the objects are in the same namespace, and that namespace is referred to either by one or the other object's namespace

What did you expect to happen:
We should define a clear way to handle the object namespacing, making it uniform in the code, and verifying that the RBAC access are set in a way that allows the controllers to access them to keep everything working fine.

Anything else you would like to add:
We should consider how CAPI does it, i.e. all objects for a cluster in the same namespace, and the impact it has in Metal3 case. For example, if some BMH would be shared between clusters, those clusters would need to be in the same namespace. That is if BMO is included in the provider-components.

Environment:

Cluster-api version: v1alpha3
CAPM3 version: v1alpha3 and v1alpha4
Minikube version: NA
environment (metal3-dev-env or other): NA
Kubernetes version: (use kubectl version): NA

/kind bug

IP addresses are incorrect when set in the BMM status

What steps did you take and what happened:
When deploying a machine with Ubuntu as base OS, the ip addresses change after deployment compared to the addresses gathered by IPA. However, the addresses listed in the BMM status are the addresses coming from BMH, before they changed. Hence the IP addresses are incorrect.

What did you expect to happen:
The controller should update the IP addresses with the correct ones.

Anything else you would like to add:

Environment:

Cluster-api version: 0.3.0
CAPBM version: 0.3.0
Minikube version: 0.7
environment (metal3-dev-env or other): metal3-dev-env
Kubernetes version: (use kubectl version): 1.17.3

/kind bug

De-provisioning and immediately Provisioning a CP node not starting

What steps did you take and what happened:

Provision a controlplane node, wait until it is provisioned.
Then run the following command
./scripts/v1alphaX/deprovision_cluster.sh && ./scripts/v1alphaX/provision_cluster.sh && ./scripts/v1alphaX/provision_controlplane.sh

The result is that a new kcp resource is saved to etcd, but no CP node is provisioned.

What did you expect to happen:
The old CP node is de-provisioned and a new CP node is provisioned.

Anything else you would like to add:
Earlier, #42 was reported. However, it was difficult to reproduce the issue predictably.

Please see comment to see the KCP resources (both old and new )
Environment:

Cluster-api version: v0.3.7
CAPBM version:v0.3.2
Minikube version: v1.12.1
environment (metal3-dev-env or other):
Kubernetes version: (use kubectl version): v1.18.3

/kind bug

failed to associate the Metal3Machine to a BaremetalHost

What steps did you take and what happened:

clusterctl config cluster lab \
  --target-namespace metal3 \
  --kubernetes-version v1.17.3 \
  --control-plane-machine-count 1 \
  --worker-machine-count 2 \
  --infrastructure metal3:v0.3.0 \ | kubectl apply -f -

  Failure Message:  Failure detected from referenced resource infrastructure.cluster.x-k8s.io/v1alpha3, Kind=Metal3Machine with name "lab-controlplane-rxz7b": Failed to associate the BaremetalHost to the Metal3Machine

E0313 14:24:24.180499       1 controller.go:258] controller-runtime/controller "msg"="Reconciler error" "error"="failed to associate the Metal3Machine to a BaremetalHost: Operation cannot be fulfilled on baremetalhosts.metal3.io \"node01\": the object has been modified; please apply your changes to the latest version and try again"  "controller"="metal3machine" "request"={"Namespace":"metal3","Name":"lab-controlplane-rxz7b"}

What did you expect to happen:

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

Environment:

Cluster-api version: 0.3.0
CAPBM version: 0.3.0
Minikube version: N/A
environment (metal3-dev-env or other): k3s + clusterctl
Kubernetes version: (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.3", GitCommit:"06ad960bfd03b39c8310aaf92d1e7c12ce618213", GitTreeState:"clean", BuildDate:"2020-02-11T18:14:22Z", GoVersion:"go1.13.6", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.3+k3s1", GitCommit:"5b17a175ce333dfb98cb8391afeb1f34219d9275", GitTreeState:"clean", BuildDate:"2020-02-27T07:28:53Z", GoVersion:"go1.13.8", Compiler:"gc", Platform:"linux/amd64"}

/kind bug

The cluster to M3Machine mapping should be done using ClusterToInfrastructureMapFunc

What steps did you take and what happened:
We have implemented our own function mapping clusters to M3Machine. CAPI has a helper that is more advanced that we should use instead : ClusterToInfrastructureMapFunc . We need to replace the function called here : https://github.com/metal3-io/cluster-api-provider-metal3/blob/master/controllers/metal3machine_controller.go#L255 . Similarly, https://github.com/metal3-io/cluster-api-provider-metal3/blob/master/controllers/metal3machine_controller.go#L324 could be replaced by a call to the same function.

/kind bug

Metal3Data and IPAddressClaim objects have unpredictable names

What steps did you take and what happened:
When creating a machine deployment, the Machines, Metal3Machines and Metal3DataClaims use names with a random suffix. When the Metal3DataClaims are fulfilled, those Metal3Data objects aren't using the same randomized names, but incremental ones instead. The IPAddressClaims that get created while creating Metal3Data objects also use those incremental names.

What did you expect to happen:
The Metal3Data and IPAddressClaim objects should use the same names as the the Metal3Machine they belong to. This makes it much easier to determine to which machine they belong to, and is also more in line with the behavior of Kubernetes.

Anything else you would like to add:
We noticed this issue while developing and testing a custom controller based on ip-address-manager to interface with our internal IPAM.

When deleting a Machine (e.g. to redeploy it), everything that belongs to the Machine gets deleted, including the Metal3Data and IPAddressClaim objects. CAPI then immediately creates a new Machine to keep the desired count of Machines. This leads to a new Metal3Data and IPAddressClaim to get created. But, since they use an incremental name, they are almost identical to the previous ones.

Environment:

Cluster-api version: 0.3.20
Cluster-api-provider-metal3 version: 0.4.2

/kind bug
cc @MaxRink

Prepare to CAPI v1alpha4

Each CAPI infra providers are required to make some changes before taking CAPI v1alpha4 in.
CAPI community is working on the document to list the changes required in provider side when moving from v1alpha3 to v1alpha4.

We can keep track of the work that needs to be done in CAPM3 in this issue.
/kind feature
/priority important-soon

Wrong order of actions when associating the BMH to the host

What steps did you take and what happened:
Currently, some actions are taken to update the BMH dependencies object (BMC secret for example) to add a cluster label. Those step update the object before we update the BMH itself, meaning that if the BMH update fails, and the association reruns, we might end up with another BMH, but the secret of the first BMH will still have the label of the cluster for example.

What did you expect to happen:
The first thing done should be to associate the BMH. Then update its dependencies. Once all this is done, update the Metal3Machine

Anything else you would like to add:
This is for both v1a3 and v1a4

Environment:

Cluster-api version: v1a3
CAPBM version: v1a3 and v1a4
Minikube version: NA
environment (metal3-dev-env or other): NA
Kubernetes version: (use kubectl version): NA

/kind bug

Provide a solution for control-plane load-balancing

This issue is related to metal3-io/cluster-api-provider-baremetal#101. When provisioning a cluster, we need to set up a load-balancer for the control-plane, since we may not be able to rely on a cloud-provider load-balancer. With this issue, we will go through the different alternatives to set up an HA cluster on baremetal nodes without external load-balancer.

We will start with a design document opened to discussion.

admission webhook "default.metal3machinetemplate.infrastructure.cluster.x-k8s.io" does not support dry run

What steps did you take and what happened:

# cat machine.yaml
apiVersion: infrastructure.cluster.x-k8s.io/v1alpha4
kind: Metal3MachineTemplate
metadata:
name: workers
namespace: default
spec:
template:
spec:
image:
checksum: XXXXXXX
checksumType: sha256
format: null
url: http://image.url/does.not.matter

# kubectl -f machine.yaml apply --dry-run=server
Error from server (BadRequest): error when creating "machine.yaml": admission webhook "default.metal3machinetemplate.infrastructure.cluster.x-k8s.io" does not support dry run
# kubectl -f machine.yaml diff
Error from server (BadRequest): admission webhook "default.metal3machinetemplate.infrastructure.cluster.x-k8s.io" does not support dry run

What did you expect to happen:

The manifest checked on the server side, kubectl shows the diff between the object definition on the cluster and in the file.

Anything else you would like to add:

According to the https://kubernetes.io/blog/2019/01/14/apiserver-dry-run-and-kubectl-diff/#how-to-enable-it to support server-side dry-run the mutating mutatingwebhook should not have any side effects when it processes dry-run requests and it should clearly mark this in the manifest using sideEffects field that should be set to None or NoneOnDryRun. The manifest does not set any value currently on https://github.com/metal3-io/cluster-api-provider-metal3/blob/master/config/webhook/manifests.yaml . As it uses admissionregistration.k8s.io/v1beta1 version not setting any value of sideEffects means "Unknown", and "Requests with the dryRun attribute will be auto-rejected if they match a webhook with sideEffects == Unknown or Some." - https://kubernetes.io/docs/reference/kubernetes-api/extend-resources/mutating-webhook-configuration-v1/

Environment:

# clusterctl upgrade plan
Checking cert-manager version...
Cert-Manager will be upgraded from "v0.11.0" to "v0.16.1"

Checking new release availability...

Management group: capi-system/cluster-api, latest release available for the v1alpha3 API Version of Cluster API (contract):

NAME NAMESPACE TYPE CURRENT VERSION NEXT VERSION
bootstrap-kubeadm capi-kubeadm-bootstrap-system BootstrapProvider v0.3.14 Already up to date
control-plane-kubeadm capi-kubeadm-control-plane-system ControlPlaneProvider v0.3.14 Already up to date
cluster-api capi-system CoreProvider v0.3.14 Already up to date
infrastructure-metal3 capm3-system InfrastructureProvider v0.4.1 Already up to date

You are already up to date!

# minikube version
minikube version: v1.17.1
commit: 043bdca07e54ab6e4fc0457e3064048f34133d7e
# kubectl version
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.2", GitCommit:"faecb196815e248d3ecfb03c680a4507229c2a56", GitTreeState:"clean", BuildDate:"2021-01-14T05:14:17Z", GoVersion:"go1.15.6", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.2", GitCommit:"faecb196815e248d3ecfb03c680a4507229c2a56", GitTreeState:"clean", BuildDate:"2021-01-21T01:11:42Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}
# kubectl get nodes
NAME STATUS ROLES AGE VERSION
clusterapi-control-plane Ready control-plane,master 26h v1.20.2

/kind bug

Remove unsupported v1alpha2

Since we are not supporting v1alpha2 API anymore, we should get rid off api/v1alpha2 CAPM3.

This could be done in when migrating to Cluster API v1alpha4, or separately.

Comparison including version should only use Kind and group

What steps did you take and what happened:
In some places we compare objects group version and kind, when we should only compare group and kind to keep things working across upgrades. For example https://github.com/metal3-io/cluster-api-provider-metal3/blob/master/controllers/metal3machine_controller.go#L350 .

This needs to be fixed in both v0.3.1 and v0.4.0

/kind bug

Add e2e tests consuming CAPI's e2e testing framework

User Story
Currently, we do not have E2E tests in the repo and we would like to consume CAPI's e2e testing framework to make this provider similar to the others and easy to maintain the tests.

As an operator we would like to:

Provide consistent e2e expectations across infrastructure providers
Have a testing framework to test all upcoming new features into the project
Utilize common end-to-end patterns that can be reused across Cluster API providers

Anything else you would like to add:
Relative links to get started:

/kind feature

Remove v1alpha3 types

We are not supporting v1alpha3 API anymore, we should get rid off api/v1alpha3 CAPM3.

/kind feature

Multiple updates operation on the BMH

What steps did you take and what happened:
Currently, CAPM3 can execute multiple update operations on the BMH. This is problematic because the update triggers a reconcile of the BMH while we still process the BMH, regularly ending in a conflict, triggering a requeue.

What did you expect to happen:
We should postpone the update operations as much as possible to do only one update per reconciliation of the Metal3machine maximum.

/kind bug

Configuring a bond interface:

CAPM3 should not select a BMH that is paused

What steps did you take and what happened:
Currently, when selecting a BMH, CAPM3 verifies that the BMH is available and in ready/available state. This is problematic because the BMH could be paused indefinitely, causing the machine deployment to fail on timeout from CAPI side.

What did you expect to happen:
CAPM3 should not consider BMH that have the pause annotation.

Cluster-api version: v0.3.3
CAPBM version: v0.3.0

/kind bug

Upgrade CustomResourceDefinition to v1

The apiextensions.k8s.io/v1beta1 API version of CustomResourceDefinition will no not be served anymore starting from K8s v1.22. We should upgrade all our CRDs to v1.

Reference document: https://kubernetes.io/docs/reference/using-api/deprecation-guide/#customresourcedefinition-v122

Wrong contract labels for v1alpha4

What steps did you take and what happened:
The cluster labels applied in config/crd/kustomization.yaml are

commonLabels:
  cluster.x-k8s.io/v1alpha2: v1alpha2
  cluster.x-k8s.io/v1alpha3: v1alpha3
  cluster.x-k8s.io/v1alpha4: v1alpha3

What did you expect to happen:
They should be

commonLabels:
  cluster.x-k8s.io/v1alpha2: v1alpha2
  cluster.x-k8s.io/v1alpha3: v1alpha3,v1alpha4

with the correct separator when this PR is merged : kubernetes-sigs/cluster-api#2775

Environment:

Cluster-api version: v0.3.2
CAPBM version: master

/kind bug

Looking for the right way to provide Static IPs for masters and workers.

I was not sure in which category this fit and hence the blank template.

Use case: I would like to have 1 master and 2 workers in my cluster. I want to create bond interfaces in the baremetalhosts and set static IPs for these workers.

Your documentation here: https://github.com/metal3-io/cluster-api-provider-metal3/blob/master/docs/getting-started.md is quite useful!
However in my case I would probably need to provide separate env variables for separate workers, right(As the kubeadmconfigtemplate is going to look different for both the hosts)? Is it possible to get the cluster configuration for such a use case, through clusterctl? Or would I need to manually edit the output from clusterctl to include two kubeadmconfig and reference them in two separate machine CRs.

Improve logging of Data creation

User Story

As a user I would like to have insight where my Metal3Data creation is and where it might be stuck.

Detailed Description

Right now the Creation of Metal3Data out of Templates doent indicate the exact point where it is and what resources it is still waiting for.
This makes it really hard to find issues in your data templates.
Increased logging with info like "waiting for ipAddressesFromIPPool ip_1 from testpool would help a lot in that regard,
Anything else you would like to add:

/kind feature

reconcile.Delete() error return might result in objects not being deleted

Currently reconcile.Delete() returns error in cases for example, when the bareMetalMachine cannot be updated or when the associated userDataSecret cannot be deleted properly. This might result in cases where the deletion workflow beyond the error return leave objects not being deleted eventually. Ideally the errors should be accumulated and returned at the end allowing other objects to be deleted even if there are some error in the early stages of deletion workflow.

Extra control plane is provisioned during downgrading kubernetes version from v1.19.0 to v1.18.0

What steps did you take and what happened:
The test was done in metal3-dev-env with 4 baremetal hosts where we have 3 control plane and one free BMH . First upgraded the kubernetes version by editing the KCP from 1.18.0 to 1.19.0. Upgrade works properly and after that downgrade the kubernetes version from 1.19.0 to 1.18.0 by editing again the KCP. All 3 controlplane was downgraded to 1.18.0 but the forth BMH is provisioned as a control plane and corresponding machine status is in provisioning. We can ssh to all BMH. The interesting thing is when we were testing the same scenario with only one controlplane, we don't face this issue at all and upgrade and downgrade of kubernetes version works properly. Also we have testes upgrade from 1.18.0 to 1.18.2 and downgrade 1.18.2 to 1.18.0, it works properly.

~$ kubectl get bmh -n metal3
NAME     STATUS   PROVISIONING STATUS   CONSUMER                   BMC                                                                                         HARDWARE PROFILE   ONLINE   ERROR
node-0   OK       provisioned           test1-controlplane-fn7md   ipmi://192.168.111.1:6230                                                                   unknown            true     
node-1   OK       provisioned           test1-controlplane-wbp2f   redfish+http://192.168.111.1:8000/redfish/v1/Systems/92dd0619-f87d-408d-988d-d9ab590e1e33   unknown            true     
node-2   OK       provisioned           test1-controlplane-7nzjp   ipmi://192.168.111.1:6232                                                                   unknown            true     
node-3   OK       provisioned           test1-controlplane-zckg2   redfish+http://192.168.111.1:8000/redfish/v1/Systems/a13146e6-052e-4d81-a71e-b19ba6e43200   unknown            true    

~$ kubectl get machine -A
NAMESPACE   NAME          PROVIDERID                                      PHASE
metal3      test1-5pvxk   metal3://69693a2e-16d7-4ba9-8c1e-00ed0d9606dd   Running
metal3      test1-5zv7g   metal3://0fc71fdc-2807-42be-863e-af8f9931f881   Running
metal3      test1-9m497                                                   Provisioning
metal3      test1-nl9k6   metal3://a9ecfaca-023b-49fa-bf1f-f19f083b21d0   Running

$ kubectl get m3m -A -o wide
NAMESPACE   NAME                       PROVIDERID                                      READY   CLUSTER   PHASE
metal3      test1-controlplane-7nzjp   metal3://0fc71fdc-2807-42be-863e-af8f9931f881   true    test1     
metal3      test1-controlplane-fn7md                                                           test1     
metal3      test1-controlplane-wbp2f   metal3://69693a2e-16d7-4ba9-8c1e-00ed0d9606dd   true    test1     
metal3      test1-controlplane-zckg2   metal3://a9ecfaca-023b-49fa-bf1f-f19f083b21d0   true    test1

What did you expect to happen:
One BMH should be free and Kubernetes version downgrade from v1.19.0 to v1.18.0 should be completed and 3 control plane should be in provisioned state.

Anything else you would like to add:
CAPM3 manager log: snap

I0901 11:01:54.824447       1 metal3machine_manager.go:647] controllers/Metal3Machine/Metal3Machine-controller "msg"="Updating machine" "cluster"="test1" "machine"="test1-9m497" "metal3-cluster"="test1" "metal3-machine"={"Namespace":"metal3","Name":"test1-controlplane-fn7md"} 
I0901 11:01:54.825579       1 metal3machine_manager.go:706] controllers/Metal3Machine/Metal3Machine-controller "msg"="Finished updating machine" "cluster"="test1" "machine"="test1-9m497" "metal3-cluster"="test1" "metal3-machine"={"Namespace":"metal3","Name":"test1-controlplane-fn7md"} 
I0901 11:01:54.849891       1 metal3machine_manager.go:1133] controllers/Metal3Machine/Metal3Machine-controller "msg"="Target node is not found, requeuing" "cluster"="test1" "machine"="test1-9m497" "metal3-cluster"="test1" "metal3-machine"={"Namespace":"metal3","Name":"test1-controlplane-fn7md"} 
I0901 11:01:54.859687       1 metal3machine_manager.go:647] controllers/Metal3Machine/Metal3Machine-controller "msg"="Updating machine" "cluster"="test1" "machine"="test1-9m497" "metal3-cluster"="test1" "metal3-machine"={"Namespace":"metal3","Name":"test1-controlplane-fn7md"}

CAPI manager log: snap

E0901 12:04:28.020349       1 machine_controller.go:249] controllers/Machine "msg"="Reconciliation for Machine asked to requeue" "error"="Infrastructure provider for Machine \"test1-9m497\" in namespace \"metal3\" is not ready, requeuing: requeue in 30s" "cluster"="test1" "machine"="test1-9m497" "namespace"="metal3" 
I0901 12:04:42.094824       1 machine_controller_noderef.go:52] controllers/Machine "msg"="Machine doesn't have a valid ProviderID yet" "cluster"="test1" "machine"="test1-9m497" "namespace"="metal3" 
E0901 12:04:42.094872       1 machine_controller.go:249] controllers/Machine "msg"="Reconciliation for Machine asked to requeue" "error"="Infrastructure provider for Machine \"test1-9m497\" in namespace \"metal3\" is not ready, requeuing: requeue in 30s" "cluster"="test1" "machine"="test1-9m497" "namespace"="metal3" 
I0901 12:04:46.830650       1 machine_controller_noderef.go:52] controllers/Machine "msg"="Machine doesn't have a valid ProviderID yet" "cluster"="test1" "machine"="test1-9m497" "namespace"="metal3" 
E0901 12:04:46.830691       1 machine_controller.go:249] controllers/Machine "msg"="Reconciliation for Machine asked to requeue" "error"="Infrastructure provider for Machine \"test1-9m497\" in namespace \"metal3\" is not ready, requeuing: requeue in 30s" "cluster"="test1" "machine"="test1-9m497" "namespace"="metal3" 
I0901 12:04:58.053847       1 machine_controller_noderef.go:52] controllers/Machine "msg"="Machine doesn't have a valid ProviderID yet" "cluster"="test1" "machine"="test1-9m497" "namespace"="metal3" 
E0901 12:04:58.054197       1 machine_controller.go:249] controllers/Machine "msg"="Reconciliation for Machine asked to requeue" "error"="Infrastructure provider for Machine \"test1-9m497\" in namespace \"metal3\" is not ready, requeuing: requeue in 30s" "cluster"="test1" "machine"="test1-9m497" "namespace"="metal3"

KCP status after downgrade:

replicas: 3
  version: v1.18.0
status:
  conditions:
  - lastTransitionTime: "2020-09-01T09:07:42Z"
    message: Rolling 2 replicas with outdated spec (2 replicas up to date)
    reason: RollingUpdateInProgress
    severity: Warning
    status: "False"
    type: Ready
  - lastTransitionTime: "2020-08-31T13:48:49Z"
    status: "True"
    type: Available
  - lastTransitionTime: "2020-08-31T13:41:46Z"
    status: "True"
    type: CertificatesAvailable
  - lastTransitionTime: "2020-09-01T09:07:42Z"
    message: 3 of 4 completed
    reason: WaitingForInfrastructure@Machine/test1-9m497
    severity: Info
    status: "False"
    type: MachinesReady
  - lastTransitionTime: "2020-09-01T09:07:42Z"
    message: Rolling 2 replicas with outdated spec (2 replicas up to date)
    reason: RollingUpdateInProgress
    severity: Warning
    status: "False"
    type: MachinesSpecUpToDate
  - lastTransitionTime: "2020-09-01T09:07:41Z"
    message: Scaling down to 3 replicas (actual 4)
    reason: ScalingDown
    severity: Warning
    status: "False"
    type: Resized
  initialized: true
  observedGeneration: 1
  replicas: 4
  selector: cluster.x-k8s.io/cluster-name=test1,cluster.x-k8s.io/control-plane
  unavailableReplicas: 4
  updatedReplicas: 2

Environment:

Cluster-api version: 0.3.8
CAPM3 version: v1alpha4
kind version: v0.8.1
environment (metal3-dev-env or other):
Kubernetes version: (use kubectl version): 1.18.0

/kind bug

metal3-io / cluster-api-provider-metal3 Goto Github PK

cluster-api-provider-metal3's People

Contributors

Stargazers

Watchers

Forkers

cluster-api-provider-metal3's Issues

Recommend Projects

Recommend Topics

Recommend Org