Hi!
I'm currently playing around with FAR with Azure VMs. I've been able to install NHC, FAR in an OCP 4.13 cluster, to create the FAR Template and start the remediation process. This is the FAR Template I'm currently using:
apiVersion: fence-agents-remediation.medik8s.io/v1alpha1
kind: FenceAgentsRemediationTemplate
metadata:
name: fenceagentsremediationtemplate-default
namespace: openshift-operators
spec:
template:
spec:
sharedparameters:
'--action': reboot
'-l': ea6bxxx
'-p': y~xxx
'--resourceGroup': jcano-cluster-mfxww-rg
'--tenantId': 60xxx
'--subscriptionId': 89xxx
nodeparameters:
'--plug=':
jcano-cluster-mfxww-master-0: jcano-cluster-mfxww-master-0
jcano-cluster-mfxww-master-1: jcano-cluster-mfxww-master-1
jcano-cluster-mfxww-master-2: jcano-cluster-mfxww-master-2
jcano-cluster-mfxww-worker-germanywestcentral1-b58kw: jcano-cluster-mfxww-worker-germanywestcentral1-b58kw
jcano-cluster-mfxww-worker-germanywestcentral2-h6zwd: jcano-cluster-mfxww-worker-germanywestcentral2-h6zwd
jcano-cluster-mfxww-worker-germanywestcentral3-xd7h5: jcano-cluster-mfxww-worker-germanywestcentral3-xd7h5
agent: fence_azure_arm
I've tried with fence_azure_arm tool standalone locally to restart a faulty VM where an OCP node is running. For that purpose, I stopped the kubelet process to bring a node to an unhealthy state, and it worked but requires a tiny modification, see: Azure/azure-sdk-for-python#30983 (comment)
Nevertheless, it is not working along with FAR operator. It throws the following errors:
2023-10-10T15:08:07.128294848Z INFO controllers.FenceAgentsRemediation Begin FenceAgentsRemediation Reconcile
2023-10-10T15:08:07.128341449Z INFO controllers.FenceAgentsRemediation Check FAR CR's name
2023-10-10T15:08:07.138883921Z INFO controllers.FenceAgentsRemediation Finalizer was added {"CR Name": "jcano-cluster-mfxww-worker-germanywestcentral2-h6zwd"}
2023-10-10T15:08:07.138914222Z INFO controllers.FenceAgentsRemediation Updating Status Condition {"processingConditionStatus": "True", "fenceAgentActionSucceededConditionStatus": "Unknown", "succededConditionStatus": "Unknown", "reason": "RemediationStarted", "LastUpdateTime": "2023-10-10 15:08:07.138913322 +0000 UTC m=+23184.695547222"}
2023-10-10T15:08:07.151777431Z INFO controllers.FenceAgentsRemediation Finish FenceAgentsRemediation Reconcile
2023-10-10T15:08:07.151923434Z INFO controllers.FenceAgentsRemediation Begin FenceAgentsRemediation Reconcile
2023-10-10T15:08:07.151954534Z INFO controllers.FenceAgentsRemediation Check FAR CR's name
2023-10-10T15:08:07.152025935Z INFO controllers.FenceAgentsRemediation Try adding FAR (Medik8s) remediation taint {"Fence Agent": "fence_azure_arm", "Node Name": "jcano-cluster-mfxww-worker-germanywestcentral2-h6zwd"}
2023-10-10T15:08:07.170359134Z INFO taints Taint was added {"taint effect": "NoExecute", "taint list": [{"key":"node.kubernetes.io/unreachable","effect":"NoSchedule","timeAdded":"2023-10-10T15:03:06Z"},{"key":"node.kubernetes.io/unreachable","effect":"NoExecute","timeAdded":"2023-10-10T15:03:12Z"},{"key":"medik8s.io/fence-agents-remediation","effect":"NoExecute","timeAdded":"2023-10-10T15:08:07Z"}]}
2023-10-10T15:08:07.170395735Z INFO controllers.FenceAgentsRemediation Fetch FAR's pod
2023-10-10T15:08:07.170512137Z INFO controllers.FenceAgentsRemediation Combine fence agent parameters {"Fence Agent": "fence_azure_arm", "Node Name": "jcano-cluster-mfxww-worker-germanywestcentral2-h6zwd"}
2023-10-10T15:08:07.170539037Z INFO controllers.FenceAgentsRemediation Execute the fence agent {"Fence Agent": "fence_azure_arm", "Node Name": "jcano-cluster-mfxww-worker-germanywestcentral2-h6zwd"}
2023-10-10T15:08:07.340974815Z ERROR executer Failed to run exec command {"stdout": "", "stderr": "time=\"2023-10-10T15:08:07Z\" level=error msg=\"exec failed: unable to start container process: exec: \\\"fence_azure_arm\\\": executable file not found in $PATH\"\n", "error": "command terminated with exit code 255"}
github.com/medik8s/fence-agents-remediation/pkg/cli.executer.Execute
/remote-source/app/pkg/cli/cliexecuter.go:92
github.com/medik8s/fence-agents-remediation/controllers.(*FenceAgentsRemediationReconciler).Reconcile
/remote-source/app/controllers/fenceagentsremediation_controller.go:203
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:118
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:314
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:265
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:226
2023-10-10T15:08:07.341030816Z ERROR controllers.FenceAgentsRemediation Fence Agent response was a failure {"CR's Name": "jcano-cluster-mfxww-worker-germanywestcentral2-h6zwd", "error": "command terminated with exit code 255"}
github.com/medik8s/fence-agents-remediation/controllers.(*FenceAgentsRemediationReconciler).Reconcile
/remote-source/app/controllers/fenceagentsremediation_controller.go:206
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:118
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:314
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:265
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:226
2023-10-10T15:08:07.350733575Z INFO controllers.FenceAgentsRemediation Finish FenceAgentsRemediation Reconcile
It looks like FAR it's not able to find the fence_azure_arm
tool in PATH for its purpose.
Environment:
- OCP version: 4.13
- NHC version: 0.6.0
- FAR version: 0.2.0
Thanks in advance!