scalr / agent-helm Goto Github PK

View Code? Open in Web Editor NEW

4.0 4.0 3.0 1014 KB

Helm chart to install "scalr-agent" for connecting self-hosted runners and VCS to Scalr TACO

Home Page: https://scalr.github.io/agent-helm/

Smarty 50.63% Makefile 49.37%

agent-helm's People

Contributors

Stargazers

Watchers

Forkers

fad3t yurii-kryvosheia twilfong

agent-helm's Issues

Flexvolume plugin conflict in GKE with default agent.data_home value

The Scalr agent K8s Helm chart creates a DaemonSet in the worker template that makes use of a hostPath directory that is set based on the value in agent.data_home. The default value for this is currently "/home/kubernetes/flexvolume/agent-k8s", which is a directory that the GKE distribution of Kubernetes uses as its Flexvolume plugin directory.

GKE changes the default Flexvolume plugin directory from /var/lib/kubelet/volumeplugins to /home/kubernetes/flexvolume, in its Kubelet configuration. (Flexvolume is deprecated but still supported.) If this directory exists, Kubelet automatically scans it for new custom volume driver plugins, which causes (non-critical) errors to be constantly logged by the kubelet on every node in the cluster where this chart is installed.

The default value for this directory should be changed to something that no service running on the host should expect to be used for any other purpose. A longer-term fix might be to move away from using a hostPath directly.

Also note that (at least with GKE) most volumes on a worker node are mounted with noexec, and /home/kubernetes/flexvolume was likely chosen because it is not mounted with noexec. A suggested new default for agent.data_home would be "/home/kubernetes/bin/scalr/agent-k8s" or another similar directory that is not mounted with the noexec flag, and is also not reserved for some other expected purpose.

Unable to connect to the Scalr

I'm using agent-k8s (v0.2.4), and getting these errors when booting the agent:

{
  "logger_name": "tacoagent.cmd",
  "url": "https://ml-ews.scalr.io",
  "pool_id": "apool-v0o1iq35t9pm3bufo",
  "event": "Connecting to the Scalr agent pool...",
  "level": "info",
  "timestamp": "2023-08-18 09:12:05.900284"
}
{
  "logger_name": "tacoagent.worker.worker",
  "event": "Unable to connect to the Scalr. Unable to assign agent to the Scalr agent pool \"apool-v0o1iq35t9pm3bufo\". API Error: POST /api/iacp/v3/agent-pools/apool-v0o1iq35t9pm3bufo/actions/connect: Unable to connect to the pool apool-v0o1iq35t9pm3bufo. HTTP 403 (request id: f568acd6). Please check that Scalr URL and token are correct.",
  "level": "critical",
  "timestamp": "2023-08-18 09:12:05.979882"
}

The URL looks correct, and the token is directly plucked from the Scalr UI (it's able to pick out the correct agent pool apool-v0o1iq35t9pm3bufo from it). What could be wrong?

Service account name for a task container

Currently the default serviceAccount is mounted into a task pod. Is there a way to set different service account name?
The use case is vault authentication via aws backend.

Docker socket error with agent-k8s

I'm trying to use the new K8s agent for my Scalr agent pool but have not been successful. The daemonset pods are crashlooping with the following error.

{"logger_name": "tacoagent.cmd", "version": "0.1.33", "backend": "docker", "event": "Starting Scalr agent", "level": "info", "timestamp": "2023-06-09 15:03:53.916851"}
{"logger_name": "tacoagent.cmd", "event": "Cannot connect to the docker daemon. Error while fetching server API version: Cannot connect to the Docker API at unix:///var/run/docker.sock. Is the scalr-docker service running?", "level": "error", "timestamp": "2023-06-09 15:03:53.921063"}
{"logger_name": "tacoagent.worker.worker", "event": "Cannot connect to the Docker daemon (unix:///var/run/docker.sock). Please check that the Docker daemon is running.", "level": "critical", "timestamp": "2023-06-09 15:03:53.921502"}

I'm not sure if it's intended but I noticed the k8s agent daemonset doesn't utilize any type of dind configuration like the docker-agent does. From my understanding, the K8s agent shouldn't be trying to utilze docker but rather schedule new pods for Sclar runs.

Kubernetes Provider: EKS
Kubernetes version: 1.26 (v1.26.4-eks-0a21954)

Steps to reproduce

helm repo add scalr-agent-helm https://scalr.github.io/agent-helm/
helm repo update

helm install scalr-agent-pool scalr-agent-helm/agent-k8s --set agent.url=<> --set agent.token=<>

Why would workers be a daemon set?

Workers should be a deployment/stateful set to allow better scaling of the workloads. With it as a daemon set we would have to do taints and selectors to keep it from being deployed on every node on our larger clusters.

k8s, Docker runner (agent) not working under M1 Mac Mini

Hello.

I have a terraform plan that runs to what looks like completion, even giving outputs, then suddenly fails.

The k8s log (k8s set up via helm install --set agent.token=$SCALR_TOKEN --set agent.url=$SCALR_URL scalr-agent-helm/agent-k8s --generate-name) follows, and I'm left wondering if this is an architecture thing (I'm on an M1 CPU, which is ARM).

2024-01-21 03:48:15 {"logger_name": "tacoagent.worker.task.agent_task", "event": "Failed to perform terraform plan. Unexpected exit code: 1.", "level": "info", "timestamp": "2024-01-21 10:48:15.510780", "account_id": "acc-v0o69ln0se66iXXXX", "run_id": "run-v0o7n4rhcq04gp9q1", "workspace_id": "ws-v0o7mc9446otjXXXX", "container_id": "f858da19", "agent_id": "agent-v0o7n4pkpu8o9g45a", "name": "terraform.plan.v2", "plan_id": "plan-v0o7n4rhdmp7s70jn", "id": "atask-v0o7n4rhqeonnv8t8", "thread_name": "Worker-1"}
2024-01-21 03:48:15 {"logger_name": "tacoagent.worker.task.agent_task", "previous_status": "in_progress", "code": "error", "event": "The agent task is errored.", "level": "info", "timestamp": "2024-01-21 10:48:15.747213", "account_id": "acc-v0o69ln0se66iXXXX", "run_id": "run-v0o7n4rhcq04gp9q1", "workspace_id": "ws-v0o7mc9446otjXXXX", "container_id": "f858da19", "agent_id": "agent-v0o7n4pkpu8o9g45a", "name": "terraform.plan.v2", "plan_id": "plan-v0o7n4rhdmp7s70jn", "id": "atask-v0o7n4rhqeonnv8t8", "thread_name": "Worker-1"}
2024-01-21 03:48:17 {"event": "terraform.plan.v2[atask-v0o7n4rhqeonnv8t8] executed in 83.407s", "level": "info", "logger_name": "huey", "timestamp": "2024-01-21 10:48:17.657362", "thread_name": "Worker-1"}
2024-01-21 04:01:01 {"event": "Executing terraform.gc.containers[648b5e69-a08c-4597-ab15-4cc33eedcc53]", "level": "info", "logger_name": "huey", "timestamp": "2024-01-21 11:01:01.164559", "thread_name": "Worker-2"}
2024-01-21 04:01:01 {"event": "terraform.gc.containers[648b5e69-a08c-4597-ab15-4cc33eedcc53] executed in 0.000s", "level": "info", "logger_name": "huey", "timestamp": "2024-01-21 11:01:01.165837", "thread_name": "Worker-2"}
2024-01-21 04:01:01 {"event": "Executing terraform.gc.plugins[05d84b5e-e051-4530-bc97-ebcae0468502]", "level": "info", "logger_name": "huey", "timestamp": "2024-01-21 11:01:01.167133", "thread_name": "Worker-2"}

In the UI, here's what I see:

Plan: 75 to add, 0 to change, 0 to destroy.

Changes to Outputs:
  + account_id = "38149XXXXXXX"

Plan operation failed

Error: Failed to perform terraform plan. Unexpected exit code: 1

Use StatefulSets instead of DaemonSet

Currently the the agent is deployed as DaemonSet which will deploy one copy of the scalr agent to ALL the nodes in the cluster (by default). I don't see the need to have this deployed as DaemonSet and cannot find it documented anywhere too. I think this should be a StatefulSet or Deployment instead.

You can still have one agent listening to scalr that acts as the controller and it will start new pods for each runs. For monitoring the pods, the controller doesn't necessarily need to be on the same node as the runner pods. You can query k8s API to get the IP of the pod using labels to avoid having to know the exact pod name.

~~As comparison, Atlantis Helm chart uses Statefulsets (source) which makes more sense if you're thinking of persisting data across runs on the container.~~ Comparing with Atlantis might not work here because scalr agents starts new pods for each run and not doing it running it directly inside the agent like Atlantis.

scalr / agent-helm Goto Github PK

agent-helm's People

Contributors

Stargazers

Watchers

Forkers

agent-helm's Issues

Flexvolume plugin conflict in GKE with default agent.data_home value

Unable to connect to the Scalr

Service account name for a task container

Docker socket error with agent-k8s

Steps to reproduce

Why would workers be a daemon set?

k8s, Docker runner (agent) not working under M1 Mac Mini

Use StatefulSets instead of DaemonSet

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent