Giter Club home page Giter Club logo

agent-helm's People

Contributors

artemvang avatar fad3t avatar github-actions[bot] avatar maxanderson95 avatar mermoldy avatar penja avatar vmotso avatar volodymyrskliar avatar yurii-kryvosheia avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

agent-helm's Issues

Flexvolume plugin conflict in GKE with default agent.data_home value

The Scalr agent K8s Helm chart creates a DaemonSet in the worker template that makes use of a hostPath directory that is set based on the value in agent.data_home. The default value for this is currently "/home/kubernetes/flexvolume/agent-k8s", which is a directory that the GKE distribution of Kubernetes uses as its Flexvolume plugin directory.

GKE changes the default Flexvolume plugin directory from /var/lib/kubelet/volumeplugins to /home/kubernetes/flexvolume, in its Kubelet configuration. (Flexvolume is deprecated but still supported.) If this directory exists, Kubelet automatically scans it for new custom volume driver plugins, which causes (non-critical) errors to be constantly logged by the kubelet on every node in the cluster where this chart is installed.

The default value for this directory should be changed to something that no service running on the host should expect to be used for any other purpose. A longer-term fix might be to move away from using a hostPath directly.

Also note that (at least with GKE) most volumes on a worker node are mounted with noexec, and /home/kubernetes/flexvolume was likely chosen because it is not mounted with noexec. A suggested new default for agent.data_home would be "/home/kubernetes/bin/scalr/agent-k8s" or another similar directory that is not mounted with the noexec flag, and is also not reserved for some other expected purpose.

See also: #32

Unable to connect to the Scalr

I'm using agent-k8s (v0.2.4), and getting these errors when booting the agent:

{
  "logger_name": "tacoagent.cmd",
  "url": "https://ml-ews.scalr.io",
  "pool_id": "apool-v0o1iq35t9pm3bufo",
  "event": "Connecting to the Scalr agent pool...",
  "level": "info",
  "timestamp": "2023-08-18 09:12:05.900284"
}
{
  "logger_name": "tacoagent.worker.worker",
  "event": "Unable to connect to the Scalr. Unable to assign agent to the Scalr agent pool \"apool-v0o1iq35t9pm3bufo\". API Error: POST /api/iacp/v3/agent-pools/apool-v0o1iq35t9pm3bufo/actions/connect: Unable to connect to the pool apool-v0o1iq35t9pm3bufo. HTTP 403 (request id: f568acd6). Please check that Scalr URL and token are correct.",
  "level": "critical",
  "timestamp": "2023-08-18 09:12:05.979882"
}

The URL looks correct, and the token is directly plucked from the Scalr UI (it's able to pick out the correct agent pool apool-v0o1iq35t9pm3bufo from it). What could be wrong?

Service account name for a task container

Currently the default serviceAccount is mounted into a task pod. Is there a way to set different service account name?
The use case is vault authentication via aws backend.

Docker socket error with agent-k8s

I'm trying to use the new K8s agent for my Scalr agent pool but have not been successful. The daemonset pods are crashlooping with the following error.

{"logger_name": "tacoagent.cmd", "version": "0.1.33", "backend": "docker", "event": "Starting Scalr agent", "level": "info", "timestamp": "2023-06-09 15:03:53.916851"}
{"logger_name": "tacoagent.cmd", "event": "Cannot connect to the docker daemon. Error while fetching server API version: Cannot connect to the Docker API at unix:///var/run/docker.sock. Is the scalr-docker service running?", "level": "error", "timestamp": "2023-06-09 15:03:53.921063"}
{"logger_name": "tacoagent.worker.worker", "event": "Cannot connect to the Docker daemon (unix:///var/run/docker.sock). Please check that the Docker daemon is running.", "level": "critical", "timestamp": "2023-06-09 15:03:53.921502"}

I'm not sure if it's intended but I noticed the k8s agent daemonset doesn't utilize any type of dind configuration like the docker-agent does. From my understanding, the K8s agent shouldn't be trying to utilze docker but rather schedule new pods for Sclar runs.

Kubernetes Provider: EKS
Kubernetes version: 1.26 (v1.26.4-eks-0a21954)

Steps to reproduce

helm repo add scalr-agent-helm https://scalr.github.io/agent-helm/
helm repo update

helm install scalr-agent-pool scalr-agent-helm/agent-k8s --set agent.url=<> --set agent.token=<>

Why would workers be a daemon set?

Workers should be a deployment/stateful set to allow better scaling of the workloads. With it as a daemon set we would have to do taints and selectors to keep it from being deployed on every node on our larger clusters.

k8s, Docker runner (agent) not working under M1 Mac Mini

Hello.

I have a terraform plan that runs to what looks like completion, even giving outputs, then suddenly fails.

The k8s log (k8s set up via helm install --set agent.token=$SCALR_TOKEN --set agent.url=$SCALR_URL scalr-agent-helm/agent-k8s --generate-name) follows, and I'm left wondering if this is an architecture thing (I'm on an M1 CPU, which is ARM).

2024-01-21 03:48:15 {"logger_name": "tacoagent.worker.task.agent_task", "event": "Failed to perform terraform plan. Unexpected exit code: 1.", "level": "info", "timestamp": "2024-01-21 10:48:15.510780", "account_id": "acc-v0o69ln0se66iXXXX", "run_id": "run-v0o7n4rhcq04gp9q1", "workspace_id": "ws-v0o7mc9446otjXXXX", "container_id": "f858da19", "agent_id": "agent-v0o7n4pkpu8o9g45a", "name": "terraform.plan.v2", "plan_id": "plan-v0o7n4rhdmp7s70jn", "id": "atask-v0o7n4rhqeonnv8t8", "thread_name": "Worker-1"}
2024-01-21 03:48:15 {"logger_name": "tacoagent.worker.task.agent_task", "previous_status": "in_progress", "code": "error", "event": "The agent task is errored.", "level": "info", "timestamp": "2024-01-21 10:48:15.747213", "account_id": "acc-v0o69ln0se66iXXXX", "run_id": "run-v0o7n4rhcq04gp9q1", "workspace_id": "ws-v0o7mc9446otjXXXX", "container_id": "f858da19", "agent_id": "agent-v0o7n4pkpu8o9g45a", "name": "terraform.plan.v2", "plan_id": "plan-v0o7n4rhdmp7s70jn", "id": "atask-v0o7n4rhqeonnv8t8", "thread_name": "Worker-1"}
2024-01-21 03:48:17 {"event": "terraform.plan.v2[atask-v0o7n4rhqeonnv8t8] executed in 83.407s", "level": "info", "logger_name": "huey", "timestamp": "2024-01-21 10:48:17.657362", "thread_name": "Worker-1"}
2024-01-21 04:01:01 {"event": "Executing terraform.gc.containers[648b5e69-a08c-4597-ab15-4cc33eedcc53]", "level": "info", "logger_name": "huey", "timestamp": "2024-01-21 11:01:01.164559", "thread_name": "Worker-2"}
2024-01-21 04:01:01 {"event": "terraform.gc.containers[648b5e69-a08c-4597-ab15-4cc33eedcc53] executed in 0.000s", "level": "info", "logger_name": "huey", "timestamp": "2024-01-21 11:01:01.165837", "thread_name": "Worker-2"}
2024-01-21 04:01:01 {"event": "Executing terraform.gc.plugins[05d84b5e-e051-4530-bc97-ebcae0468502]", "level": "info", "logger_name": "huey", "timestamp": "2024-01-21 11:01:01.167133", "thread_name": "Worker-2"}

In the UI, here's what I see:

Plan: 75 to add, 0 to change, 0 to destroy.

Changes to Outputs:
  + account_id = "38149XXXXXXX"

Plan operation failed

Error: Failed to perform terraform plan. Unexpected exit code: 1

Use StatefulSets instead of DaemonSet

Currently the the agent is deployed as DaemonSet which will deploy one copy of the scalr agent to ALL the nodes in the cluster (by default). I don't see the need to have this deployed as DaemonSet and cannot find it documented anywhere too. I think this should be a StatefulSet or Deployment instead.

You can still have one agent listening to scalr that acts as the controller and it will start new pods for each runs. For monitoring the pods, the controller doesn't necessarily need to be on the same node as the runner pods. You can query k8s API to get the IP of the pod using labels to avoid having to know the exact pod name.

As comparison, Atlantis Helm chart uses Statefulsets (source) which makes more sense if you're thinking of persisting data across runs on the container. Comparing with Atlantis might not work here because scalr agents starts new pods for each run and not doing it running it directly inside the agent like Atlantis.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.