scalr / agent-helm Goto Github PK
View Code? Open in Web Editor NEWHelm chart to install "scalr-agent" for connecting self-hosted runners and VCS to Scalr TACO
Home Page: https://scalr.github.io/agent-helm/
Helm chart to install "scalr-agent" for connecting self-hosted runners and VCS to Scalr TACO
Home Page: https://scalr.github.io/agent-helm/
The Scalr agent K8s Helm chart creates a DaemonSet in the worker template that makes use of a hostPath directory that is set based on the value in agent.data_home
. The default value for this is currently "/home/kubernetes/flexvolume/agent-k8s", which is a directory that the GKE distribution of Kubernetes uses as its Flexvolume plugin directory.
GKE changes the default Flexvolume plugin directory from /var/lib/kubelet/volumeplugins to /home/kubernetes/flexvolume, in its Kubelet configuration. (Flexvolume is deprecated but still supported.) If this directory exists, Kubelet automatically scans it for new custom volume driver plugins, which causes (non-critical) errors to be constantly logged by the kubelet on every node in the cluster where this chart is installed.
The default value for this directory should be changed to something that no service running on the host should expect to be used for any other purpose. A longer-term fix might be to move away from using a hostPath directly.
Also note that (at least with GKE) most volumes on a worker node are mounted with noexec, and /home/kubernetes/flexvolume was likely chosen because it is not mounted with noexec. A suggested new default for agent.data_home
would be "/home/kubernetes/bin/scalr/agent-k8s" or another similar directory that is not mounted with the noexec flag, and is also not reserved for some other expected purpose.
See also: #32
I'm using agent-k8s (v0.2.4), and getting these errors when booting the agent:
{
"logger_name": "tacoagent.cmd",
"url": "https://ml-ews.scalr.io",
"pool_id": "apool-v0o1iq35t9pm3bufo",
"event": "Connecting to the Scalr agent pool...",
"level": "info",
"timestamp": "2023-08-18 09:12:05.900284"
}
{
"logger_name": "tacoagent.worker.worker",
"event": "Unable to connect to the Scalr. Unable to assign agent to the Scalr agent pool \"apool-v0o1iq35t9pm3bufo\". API Error: POST /api/iacp/v3/agent-pools/apool-v0o1iq35t9pm3bufo/actions/connect: Unable to connect to the pool apool-v0o1iq35t9pm3bufo. HTTP 403 (request id: f568acd6). Please check that Scalr URL and token are correct.",
"level": "critical",
"timestamp": "2023-08-18 09:12:05.979882"
}
The URL looks correct, and the token is directly plucked from the Scalr UI (it's able to pick out the correct agent pool apool-v0o1iq35t9pm3bufo
from it). What could be wrong?
Currently the default
serviceAccount is mounted into a task pod. Is there a way to set different service account name?
The use case is vault authentication via aws backend.
I'm trying to use the new K8s agent for my Scalr agent pool but have not been successful. The daemonset pods are crashlooping with the following error.
{"logger_name": "tacoagent.cmd", "version": "0.1.33", "backend": "docker", "event": "Starting Scalr agent", "level": "info", "timestamp": "2023-06-09 15:03:53.916851"}
{"logger_name": "tacoagent.cmd", "event": "Cannot connect to the docker daemon. Error while fetching server API version: Cannot connect to the Docker API at unix:///var/run/docker.sock. Is the scalr-docker service running?", "level": "error", "timestamp": "2023-06-09 15:03:53.921063"}
{"logger_name": "tacoagent.worker.worker", "event": "Cannot connect to the Docker daemon (unix:///var/run/docker.sock). Please check that the Docker daemon is running.", "level": "critical", "timestamp": "2023-06-09 15:03:53.921502"}
I'm not sure if it's intended but I noticed the k8s agent daemonset doesn't utilize any type of dind configuration like the docker-agent does. From my understanding, the K8s agent shouldn't be trying to utilze docker but rather schedule new pods for Sclar runs.
Kubernetes Provider: EKS
Kubernetes version: 1.26 (v1.26.4-eks-0a21954
)
helm repo add scalr-agent-helm https://scalr.github.io/agent-helm/
helm repo update
helm install scalr-agent-pool scalr-agent-helm/agent-k8s --set agent.url=<> --set agent.token=<>
Workers should be a deployment/stateful set to allow better scaling of the workloads. With it as a daemon set we would have to do taints and selectors to keep it from being deployed on every node on our larger clusters.
Hello.
I have a terraform plan that runs to what looks like completion, even giving outputs, then suddenly fails.
The k8s log (k8s set up via helm install --set agent.token=$SCALR_TOKEN --set agent.url=$SCALR_URL scalr-agent-helm/agent-k8s --generate-name
) follows, and I'm left wondering if this is an architecture thing (I'm on an M1 CPU, which is ARM).
2024-01-21 03:48:15 {"logger_name": "tacoagent.worker.task.agent_task", "event": "Failed to perform terraform plan. Unexpected exit code: 1.", "level": "info", "timestamp": "2024-01-21 10:48:15.510780", "account_id": "acc-v0o69ln0se66iXXXX", "run_id": "run-v0o7n4rhcq04gp9q1", "workspace_id": "ws-v0o7mc9446otjXXXX", "container_id": "f858da19", "agent_id": "agent-v0o7n4pkpu8o9g45a", "name": "terraform.plan.v2", "plan_id": "plan-v0o7n4rhdmp7s70jn", "id": "atask-v0o7n4rhqeonnv8t8", "thread_name": "Worker-1"}
2024-01-21 03:48:15 {"logger_name": "tacoagent.worker.task.agent_task", "previous_status": "in_progress", "code": "error", "event": "The agent task is errored.", "level": "info", "timestamp": "2024-01-21 10:48:15.747213", "account_id": "acc-v0o69ln0se66iXXXX", "run_id": "run-v0o7n4rhcq04gp9q1", "workspace_id": "ws-v0o7mc9446otjXXXX", "container_id": "f858da19", "agent_id": "agent-v0o7n4pkpu8o9g45a", "name": "terraform.plan.v2", "plan_id": "plan-v0o7n4rhdmp7s70jn", "id": "atask-v0o7n4rhqeonnv8t8", "thread_name": "Worker-1"}
2024-01-21 03:48:17 {"event": "terraform.plan.v2[atask-v0o7n4rhqeonnv8t8] executed in 83.407s", "level": "info", "logger_name": "huey", "timestamp": "2024-01-21 10:48:17.657362", "thread_name": "Worker-1"}
2024-01-21 04:01:01 {"event": "Executing terraform.gc.containers[648b5e69-a08c-4597-ab15-4cc33eedcc53]", "level": "info", "logger_name": "huey", "timestamp": "2024-01-21 11:01:01.164559", "thread_name": "Worker-2"}
2024-01-21 04:01:01 {"event": "terraform.gc.containers[648b5e69-a08c-4597-ab15-4cc33eedcc53] executed in 0.000s", "level": "info", "logger_name": "huey", "timestamp": "2024-01-21 11:01:01.165837", "thread_name": "Worker-2"}
2024-01-21 04:01:01 {"event": "Executing terraform.gc.plugins[05d84b5e-e051-4530-bc97-ebcae0468502]", "level": "info", "logger_name": "huey", "timestamp": "2024-01-21 11:01:01.167133", "thread_name": "Worker-2"}
In the UI, here's what I see:
Plan: 75 to add, 0 to change, 0 to destroy.
Changes to Outputs:
+ account_id = "38149XXXXXXX"
Plan operation failed
Error: Failed to perform terraform plan. Unexpected exit code: 1
Currently the the agent is deployed as DaemonSet which will deploy one copy of the scalr agent to ALL the nodes in the cluster (by default). I don't see the need to have this deployed as DaemonSet and cannot find it documented anywhere too. I think this should be a StatefulSet or Deployment instead.
You can still have one agent listening to scalr that acts as the controller and it will start new pods for each runs. For monitoring the pods, the controller doesn't necessarily need to be on the same node as the runner pods. You can query k8s API to get the IP of the pod using labels to avoid having to know the exact pod name.
As comparison, Atlantis Helm chart uses Statefulsets (source) which makes more sense if you're thinking of persisting data across runs on the container. Comparing with Atlantis might not work here because scalr agents starts new pods for each run and not doing it running it directly inside the agent like Atlantis.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.