datadog / datadog-operator Goto Github PK

View Code? Open in Web Editor NEW

286.0 162.0 99.0 21 MB

Datadog Agent Kubernetes Operator

License: Apache License 2.0

Dockerfile 0.10% Makefile 0.47% Shell 0.91% Go 98.53% Smarty 0.01%

kubernetes operator datadog customresourcedefinition

datadog-operator's Introduction

Datadog Operator

Overview

Warning

Operator v1.7.0 removes support for DatadogAgent v1alpha1 reconciliation (v2APIEnabled flag). v1.8.0 will remove the conversion webhook as well, and users will not be able to apply the DatadogAgent v1alpha1 manifest.

Operator v1.8.0 will deprecate custom resource definitions using apiextensions.k8s.io/v1beta1. They will be kept in the repository but will not be updated. They will be removed in v1.10.0.

The Datadog Operator aims to provide a new way of deploying the Datadog Agent on Kubernetes. Once deployed, the Datadog Operator provides:

Agent configuration validation that limits configuration mistakes.
Orchestration of creating/updating Datadog Agent resources.
Reporting of Agent configuration status in its Kubernetes CRD resource.
Optionally, use of an advanced DaemonSet deployment by leveraging the ExtendedDaemonSet.
Many other features to come :).

The Datadog Operator is RedHat certified and available on operatorhub.io.

Datadog Operator vs. Helm chart

You can also use official Datadog Helm chart or a DaemonSet to install the Datadog Agent on Kubernetes. However, using the Datadog Operator offers the following advantages:

The Operator has built-in defaults based on Datadog best practices.
Operator configuration is more flexible for future enhancements.
As a Kubernetes Operator, the Datadog Operator is treated as a first-class resource by the Kubernetes API.
Unlike the Helm chart, the Operator is included in the Kubernetes reconciliation loop.

Datadog fully supports using a DaemonSet to deploy the Agent, but manual DaemonSet configuration leaves significant room for error. Therefore, using a DaemonSet is not highly recommended.

Getting started

See the Getting Started dedicated documentation to learn how to deploy the Datadog operator and your first Agent, and Configuration to see examples, a list of all configuration keys, and default values.

Migrating from `0.8.x` to `1.0.0`

Operator 1.0.0 contains several changes users need to be aware of:

DatadogAgent CRD has two versions, v1alpha1 and v2alpha1. They are used as a stored version by Operator 0.8.x and 1.0.0 respectively. See this Kubernetes documentation page for more details about CRD versioning.
v1alpha1 and v2alpha1 are not backward or forward compatible. The Datadog Operator 1.0.0 implements a Conversion Webhook to migrate, though it only supports migrating from v1alpha1 to v2alpha1.
With the Conversion Webhook enabled, users can run 1.0.0 but continue applying a v1alpha1 manifest. However, they won't be able to retrieve the DatadogAgent manifest as a v1alpha1 object (see the previous item).
The Conversion Webhook requires a cert manager. See the migration guide in the public or helm chart documentation for more details.
0.8.x managed PodDisruptionBudget for Cluster Agent and Cluster Checks Worker deployments. 1.0.0 doesn't, however this is on our roadmap.

Default Enabled Features

Cluster Agent
Admission Controller
Cluster Checks
Kubernetes Event Collection
Kubernetes State Core Check
Live Container Collection
Orchestrator Explorer
UnixDomainSocket transport for DogStatsD (and APM if enabled)
Process Discovery

Functionalities

The Datadog operator also allows you to:

How to contribute

See the How to Contribute page.

Release

Release process documentation is available here.

datadog-operator's People

Contributors

Stargazers

Watchers

Forkers

darthlukan ahmed-mez clamoriniere mantoine96 gtseres hugohenley achyudanandsingh carloscastrojumo arzarif jm-lazaro isabella232 maxgio92 connectionmaster bruecktech x3c3 srmji mengxuzhao texmachina dnln tkaesserfm amitkc1 darnould mleonidas benjaminmichel fschlent slintowin ganeshkumarsv spiliopoulos roybay luisgj arul-abisheik rogersep ymatsiuk phillebaba honghuac mmichael jonmoter mx-psi ramakrishna-mula ahlfors argogocode arirubinstein kenoshi rockaut fischbacher-rocks jeverett3000 tvieira quanghiem xornivore shidenkai0 shane4ster buker adammck stefanthewiz malikdraz dmcilwee skolobov martijnvdp matthias-henseler rlisewski ankit152 iksaif ylee-figma oscarheme ssoltys-cais orclassiq dberuben phroph lallydd diranged jgrigg jonstacks jurabek luis-sousa-pinto swarren83 maxwellbogh nullbett kty1965 sumathibaskar starlightromero jacktreble gonmmarques noufelbtboumm66 mintomoon inobu fanny-jiang bezuhlad bakayolo meowchan27 doker78 erulabs andrewjamesbrown dgoffredo fouadwahabi frits-v eliottness smarteacher vnosovsky szegedi

datadog-operator's Issues

Mention required token for cluster agent

I just installed the datadog-operator into a cluster and for the most part the process was pretty straightforward, nice work on this!

However I did run into an issue getting the cluster-agent working: I didn't set a 'secret token' anywhere for the cluster-agent to use, which caused a CreateContainerConfigError error to throw from the cluster-agent complaining that secret "datadog" not found. I didn't realize it at first after reading through the installation/getting started docs, but a token is required and can be set in a secret called 'datadog' (as mentioned here), or it can be set in the in spec.credentials.token field of the DatadogAgent config. Currently, I see that it is shown in this example with the cluster agent, but not in the 'all' example. I feel like having to set this token should be called out more explicitly in the docs.

Separately, I did think it was kind of odd that the api_key and the app_key can be configured through the DatadogAgent to get the value from a secret (instead of setting the plain text value in the config), but the token can only be set plainly in the config and can't be fetched from a secret. Was that intentional? That functionality might be nice too, so I could potentially just have one secret with my api/app keys and token for the cluster-agent.

Auto-created secret should get an auto-generated name

(this is on operator 0.5.0-rc2)

Right now, if a user describes their agent deployment like this:

apiVersion: datadoghq.com/v1alpha1
kind: DatadogAgent
metadata:
  name: datadog
spec:
  credentials:
    apiKey: <DATADOG_API_KEY>
    appKey: <DATADOG_APP_KEY>
  agent:
    image:
      name: "gcr.io/datadoghq/agent:latest"
  clusterAgent:
    image:
      name: "gcr.io/datadoghq/cluster-agent:latest"

the operator will create a secret called datadog (like the object name), but this way it is fairly easy to have a conflict in names.

Actually, if I try to create that object and I already have that secret created, the operator logs won't show any errors (probably it should) and the failure happens at the Datadog daemonset level:

    state:
      waiting:
        message: couldn't find key api_key in Secret default/datadog
        reason: CreateContainerConfigError

Ideally, we would do as Kubernetes does when automatically creating objects, where it does something like <name>-autogeneratednumber

Unable to get kubeStateMetricsCore feature working

datadog-cluster-checks-agent (7.26.0)

=========
Collector
=========

  Running Checks
  ==============
    No checks have run yet

  Loading Errors
  ==============
    kubernetes_state_core
    ---------------------
      Core Check Loader:
        Could not configure check kubernetes_state_core: temporary failure in apiserver, will retry later: check resources failed: event collection: "events is forbidden: User \"system:serviceaccount:datadog:datadog-cluster-checks-runner\" cannot list resource \"events\" in API group \"\" at the cluster scope"

      JMX Check Loader:
        check is not a jmx check, or unable to determine if it's so

      Python Check Loader:
        unable to import module 'kubernetes_state_core': No module named 'kubernetes_state_core'

DatadogAgent File (Operator version 0.5.0)

apiVersion: datadoghq.com/v1alpha1
kind: DatadogAgent
metadata:
  name: datadog
spec:
  credentials:
    apiSecret:
      keyName: api-key
      secretName: datadog
    appSecret:
      keyName: app-key
      secretName: datadog-appkey
  features:
    kubeStateMetricsCore:
      enabled: true
    orchestratorExplorer:
      enabled: true
    prometheusScrape:
      enabled: true
  agent:
    image:
      name: gcr.io/datadoghq/agent:latest
    log:
      enabled: true
      containerCollectUsingFiles: true
      logsConfigContainerCollectAll: true
    process:
      enabled: true
      processCollection: true
    systemProbe:
      enabled: true
      bpfDebugEnabled: true
    security:
      compliance:
        enabled: true
      runtime:
        enabled: false
    config:
      leaderElection: false
      criSocket:
        criSocketPath: /var/run/dockershim.sock
      tolerations:
        - operator: Exists 
  clusterAgent:
    image:
      name: gcr.io/datadoghq/cluster-agent:1.11.0-rc2
    config:
      clusterChecksEnabled: true
      collectEvents: true
      admissionController:
        enabled: true
        mutateUnlabelled: false
  clusterCheckRunner:
    image:
      name: gcr.io/datadoghq/agent:latest

Describe what happened:

Whenever the DatadogAgent resource is applied, we get all the pods that are expected. The Cluster Agent reports that it is dispatching the check:

2021-03-04 21:29:02 UTC | CLUSTER | INFO | (pkg/clusteragent/clusterchecks/dispatcher_main.go:138 in add) | Dispatching configuration kubernetes_state_core:314b92a3f57cdafa to node CORRECT_EC2_HOSTNAME_HERE

But the Cluster Checks Runner fails with the log below:

2021-03-04 21:29:03 UTC | CORE | INFO | (pkg/autodiscovery/config_poller.go:95 in poll) | cluster-checks provider: collected 1 new configurations, removed 0
2021-03-04 21:29:03 UTC | CORE | ERROR | (pkg/collector/corechecks/loader.go:72 in Load) | core.loader: could not configure check kubernetes_state_core: temporary failure in apiserver, will retry later: check resources failed: event collection: "events is forbidden: User \"system:serviceaccount:datadog:datadog-cluster-checks-runner\" cannot list resource \"events\" in API group \"\" at the cluster scope"
2021-03-04 21:29:03 UTC | CORE | ERROR | (pkg/collector/scheduler.go:248 in GetChecksFromConfigs) | Unable to load the check: unable to load any check from config 'kubernetes_state_core'

Steps to reproduce the issue:

As far as I can tell, all that is required to reproduce this is the following file

features:
    kubeStateMetricsCore:
      enabled: true
clusterAgent:
    image:
      name: "datadog/cluster-agent:1.11.0-rc2"
    config:
      clusterChecksEnabled: true
clusterChecksRunner:
    image:
      name: "datadog/agent:7.26.0"

Additional environment details (Operating System, Cloud provider, etc):

OS: Bottlerocket AMI from AWS
Cloud: AWS EKS

Notes

I tried changing version of the clusterChecksRunner but it seems to always use the latest tag.
I am using 1.11.0-rc2 because of #242
kubernetes version 1.19

How do we customize the Admission Controller webhook timeout?

Describe what happened:
The Admission Controller MutatingWebhookConfiguration is created with a 30s timeout. I would like to be able to lower this down to 1 or 2 seconds. We ran into an issue over the weekend where a mistake in a NetworkPolicy caused the APIServer->DatadogOperator API calls to fail. They took 30s to time out, which then slowed the entire APIServer down. That ultimately lead to a high rate of registration failures for new nodes joining our clusters. All of our other WebhookConfigurations have low timeouts set.

Name:         datadog-webhook
Namespace:    
Labels:       <none>
Annotations:  <none>
API Version:  admissionregistration.k8s.io/v1
Kind:         MutatingWebhookConfiguration
Metadata:
...
  Resource Version:  42875294
  Self Link:         /apis/admissionregistration.k8s.io/v1/mutatingwebhookconfigurations/datadog-webhook
  UID:               deda10ff-ca41-4e2b-bf3c-ed02bac00af6
Webhooks:
  Admission Review Versions:
    v1beta1
  Client Config:
     Ca Bundle:  LS0t...
    Service:
      Name:        datadog-admission-controller
      Namespace:   datadog-operator
      Path:        /injecttags
      Port:        443
  Failure Policy:  Ignore
  Match Policy:    Exact
  Name:            datadog.webhook.tags
  Namespace Selector:
  Object Selector:
    Match Expressions:
      Key:       admission.datadoghq.com/enabled
      Operator:  NotIn
      Values:
        false
  Reinvocation Policy:  Never
  Rules:
    API Groups:
      
    API Versions:
      v1
    Operations:
      CREATE
    Resources:
      pods
    Scope:          *
  Side Effects:     None
  Timeout Seconds:  30
Events:             <none>

Describe what you expected:
I would like to be able to configure the timeout here and lower it to meet our operational needs. This configuration is not exposed currently (as far as I can tell) in the Datadog Operator or DatadogAgent CRD.

Additional environment details (Operating System, Cloud provider, etc):

Datadog Operator v0.4.0, EKS Kubernetes 1.18

Configure Cluster Checks dynamically

It would be nice to have a ClusterCheck CRD, which could be deployed independently from the Cluster Agent deployment. Currently it is only possible to configure such a check via a ConfigMap and reference this ConfigMap in the DatadogAgent resource of the Cluster Agent.

This way the Cluster Agent could be deployed centrally without knowing about application specific cluster checks (such as checks on applications, running outside of the cluster).

If appKeyExistingSecret is specified, then the token secret fails to create

Output of the info page (if this is a bug)
What is the info page??

Describe what happened:

I have setup my app key in a secret that I have created previously, called secretapp. I set up my cluster token in plain text under the token field. This is a snippet of my agent spec:

spec:
  credentials:
    apiKeyExistingSecret: secretapi
    appKeyExistingSecret: secretapp
    token: "<ThirtyX2XcharactersXlongXtoken>"

The cluster-agent fails to run, because it expects that the secretapp key to have a token key.

  Warning  Failed     2s (x3 over 16s)  kubelet, kind-worker2  Error: couldn't find key token in Secret datadog/secretapp

Describe what you expected:

If the token key is not there, edit the secret and add the token key using the string passed in the token spec.

Steps to reproduce the issue:

Create an agent and cluster agent with the appkey passed as secret and try to pass a token:

spec:
  credentials:
    apiKeyExistingSecret: secretapi
    appKeyExistingSecret: secretapp
    token: "<ThirtyX2XcharactersXlongXtoken>"

Default command to start the agent container should be "agent run"

Describe what happened:

The container started with command agent start instead of agent run giving a warning:

Command "start" is deprecated, Use "run" instead to start the Agent

Describe what you expected:

Container starts with command agent run and no warning gets logged

Possible race condition involving service account token during reconciliation

Describe what happened:
I received a bunch of "no data" alerts from a cluster with agents deployed using this operator (v0.6.0). I checked the agents and several were stuck in an init phase:

# kubectl get pods -n datadog
NAME                                             READY   STATUS     RESTARTS   AGE
datadog-agent-4xncr                              3/3     Running    0          3h12m
datadog-agent-6nkd4                              3/3     Running    0          3h12m
datadog-agent-8pfkn                              3/3     Running    0          3h12m
datadog-agent-cggv9                              0/3     Init:0/3   0          3h13m
datadog-agent-ckxhp                              3/3     Running    0          3h12m
datadog-agent-ddlmq                              0/3     Init:0/3   0          3h13m
datadog-agent-dw2gc                              0/3     Init:0/3   0          3h13m
datadog-agent-hc8tk                              0/3     Init:0/3   0          3h13m
datadog-agent-hj8jv                              0/3     Init:0/3   0          3h13m
datadog-agent-kg9tw                              0/3     Pending    0          3h12m
datadog-agent-l8p5f                              0/3     Init:0/3   0          3h13m
datadog-agent-v6bvt                              3/3     Running    0          3h12m
datadog-agent-w59gz                              3/3     Running    0          3h12m
datadog-agent-x7ssf                              0/3     Init:0/3   0          3h13m
datadog-agent-x8wgx                              3/3     Running    0          3h12m
datadog-cluster-agent-b6b54c6c8-4jpzh            1/1     Running    0          3h13m
datadog-cluster-agent-b6b54c6c8-7k2ps            1/1     Running    0          3h13m
datadog-cluster-checks-runner-57689b4869-7qddl   1/1     Running    0          3h13m
datadog-cluster-checks-runner-57689b4869-q8ldw   1/1     Running    0          3h13m
datadog-operator-65cc54d98b-6w2bl                1/1     Running    9          80d

It was clear that 3 hours prior, the operator restarted, which triggered a redeployment of all child resources associated with our DatadogAgent object during reconciliation.

The error from the operator logs appears to indicate some type of temporary connectivity problem with the Kubernetes API:

# kubectl logs  datadog-operator-65cc54d98b-6w2bl -n datadog -p
 ...
 ...
 ...
{"level":"ERROR","ts":"2021-08-03T20:11:51Z","logger":"klog","msg":"Failed to update lock: Put \"https://10.239.224.1:443/api/v1/namespaces/datadog/configmaps/datadog-operator-lock\": context deadline exceeded\n"}
{"level":"INFO","ts":"2021-08-03T20:11:51Z","logger":"klog","msg":"failed to renew lease datadog/datadog-operator-lock: timed out waiting for the condition\n"}
{"level":"ERROR","ts":"2021-08-03T20:11:51Z","logger":"setup","msg":"Problem running manager","error":"leader election lost"}

Several of the newly-created pods were still referencing an old service account token, which had also be been redeployed:

# kubectl describe pod datadog-agent-cggv9 -n datadog
...
...
...

  Events:
  Type     Reason       Age                     From     Message
  ----     ------       ----                    ----     -------
  Warning  FailedMount  9m11s (x94 over 3h14m)  kubelet  MountVolume.SetUp failed for volume "datadog-agent-token-xhnfh" : secret "datadog-agent-token-xhnfh" not found
  Warning  FailedMount  4m3s (x83 over 178m)    kubelet  (combined from similar events): Unable to attach or mount volumes: unmounted volumes=[datadog-agent-token-xhnfh], unattached volumes=[src dsdsocket seccomp-root checksd cgroups config debugfs host-osrelease procdir installinfo datadog-agent-token-xhnfh passwd modules confd datadog-agent-auth datadog-agent-security logdatadog sysprobe-socket-dir system-probe-config runtimesocketdir]: timed out waiting for the condition

# kubectl get secret -n datadog | grep datadog-agent-token-
datadog-agent-token-2jdhr                   kubernetes.io/service-account-token   3      3h14m

It appears that datadog-agent-token-xhnfh was the old token, whereas the current one is actually datadog-agent-token-2jdhr.

Describe what you expected:
The behavior we're seeing deviates from what's expected in a couple of way:

First, I wouldn't expect a redeployments of all resources in the event of a temporary problem in accessing the Kubernetes API (if I'm assessing that as the root cause of this event correctly).
Second, I wouldn't expect newly-generated pods to have specs that reference the old service account secret.

Steps to reproduce the issue:

Additional environment details (Operating System, Cloud provider, etc):
Ubuntu 18.04
Azure AKS (1.19.7)
Datadog Operator v0.6.0

External Metrics / HPA integration not working

Describe what happened:

Deployed Datadog Operator (using the latest Datadog Operator built from master)
Deployed a DatadogAgent with this configuration:

apiVersion: datadoghq.com/v1alpha1
kind: DatadogAgent
metadata:
  name: datadog
  namespace: datadog
spec:
  agent:
    apm:
      enabled: true
      hostPort: 8126
    config:
      collectEvents: true
      criSocket:
        criSocketPath: /var/run/containerd/containerd.sock
        useCriSocketVolume: true
      env:
      - name: DD_TAGS
        value: cluster_name:relevant-cheetah,environment:dev,region:eu-west-1,account:eu-dev
      - name: DD_DOGSTATSD_NON_LOCAL_TRAFFIC
        value: "true"
      - name: DD_APM_NON_LOCAL_TRAFFIC
        value: "true"
      - name: DD_AC_EXCLUDE
        value: image:602401143452.dkr.ecr.us-west-2.amazonaws.com/cni-metrics-helper
          image:602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni image:576513738724.dkr.ecr.eu-west-1.amazonaws.com/tools/datadog-agent
          image:datadog/cluster-agent image:datadog/agent
      hostPort: 8125
      leaderElection: true
      logLevel: INFO
      tolerations:
      - operator: Exists
    image:
      name: datadog/agent:7.20.2
    log:
      enabled: true
      logsConfigContainerCollectAll: true
    process:
      enabled: true
    systemProbe:
      bpfDebugEnabled: true
      enabled: true
  clusterAgent:
    config:
      clusterChecksEnabled: true
      env:
      - name: DD_TAGS
        value: cluster_name:relevant-cheetah,environment:dev,region:eu-west-1,account:eu-dev
      - name: DD_EXTRA_CONFIG_PROVIDERS
        value: kube_services
      - name: DD_EXTRA_LISTENERS
        value: kube_services
      - name: DATADOG_HOST
        value: https://app.datadoghq.eu
      externalMetrics:
        enabled: true
      metricsProviderEnabled: true
    image:
      name: datadog/cluster-agent:1.6.0
    replicas: 2
  credentials:
    apiKey: xxxxx
    appKey: xxxxx
  site: datadoghq.eu

Deployed the test Nginx manifest from https://www.datadoghq.com/blog/autoscale-kubernetes-datadog/#register-the-external-metrics-provider in default namespace, and the test HPA
When describing the HPA, I get errors essentially saying that HPA can't read the metrics

$ k describe hpa nginxext
Name:                                                nginxext
Namespace:                                           default
Labels:                                              <none>
Annotations:                                         CreationTimestamp:  Wed, 08 Jul 2020 10:05:00 +0200
Reference:                                           Deployment/nginx
Metrics:                                             ( current / target )
  "nginx.net.request_per_s" (target average value):  <unknown> / 9
Min replicas:                                        1
Max replicas:                                        5
Deployment pods:                                     1 current / 0 desired
Conditions:
  Type           Status  Reason                   Message
  ----           ------  ------                   -------
  AbleToScale    True    SucceededGetScale        the HPA controller was able to get the target's current scale
  ScalingActive  False   FailedGetExternalMetric  the HPA was unable to compute the replica count: unable to get external metric default/nginx.net.request_per_s/&LabelSelector{MatchLabels:map[string]string{kube_container_name: nginx,},MatchExpressions:[]LabelSelectorRequirement{},}: unable to fetch metrics from external metrics API: nginx.net.request_per_s.external.metrics.k8s.io is forbidden: User "system:serviceaccount:kube-system:horizontal-pod-autoscaler" cannot list resource "nginx.net.request_per_s" in API group "external.metrics.k8s.io" in the namespace "default"
Events:
  Type     Reason                        Age                   From                       Message
  ----     ------                        ----                  ----                       -------
  Warning  FailedComputeMetricsReplicas  10m (x12 over 13m)    horizontal-pod-autoscaler  invalid metrics (1 invalid out of 1), first error is: failed to get nginx.net.request_per_s external metric: unable to get external metric default/nginx.net.request_per_s/&LabelSelector{MatchLabels:map[string]string{kube_container_name: nginx,},MatchExpressions:[]LabelSelectorRequirement{},}: unable to fetch metrics from external metrics API: nginx.net.request_per_s.external.metrics.k8s.io is forbidden: User "system:serviceaccount:kube-system:horizontal-pod-autoscaler" cannot list resource "nginx.net.request_per_s" in API group "external.metrics.k8s.io" in the namespace "default"
  Warning  FailedGetExternalMetric       3m32s (x41 over 13m)  horizontal-pod-autoscaler  unable to get external metric default/nginx.net.request_per_s/&LabelSelector{MatchLabels:map[string]string{kube_container_name: nginx,},MatchExpressions:[]LabelSelectorRequirement{},}: unable to fetch metrics from external metrics API: nginx.net.request_per_s.external.metrics.k8s.io is forbidden: User "system:serviceaccount:kube-system:horizontal-pod-autoscaler" cannot list resource "nginx.net.request_per_s" in API group "external.metrics.k8s.io" in the namespace "default"

After checking the documentation provided by Datadog, I wasn't able to find any resources documented in rbac-hpa.yaml created.

Describe what you expected:

I would expect the HPA to be able to query the external metrics API and get the metrics
I would expect the resources described in rbac-hpa.yaml to be created by Datadog Operator when we configure a Cluster Agent with External Metrics enabled.

Steps to reproduce the issue:

Deploy Datadog operator in a new cluster
Deploy a Datadog Agent with externalMetrics.enabled: true
Deploy the test Nginx manifest
Deploy the test HPA
Run kubectl describe hpa nginxext and you should see similar errors to what I've posted.

Additional environment details (Operating System, Cloud provider, etc):

Running EKS 1.16, on AWS, with the latest AMI.
Using site Datadoghq.eu

Wondering if I might have missed a configuration flag?

Feature: Make /etc/passwd mount optional for Process Agent

In order to enable the process agent on OSes that don't have /etc/passwd (such as Talos, the /etc/passwd mount should be able to be disabled.

cri check fails when running on docker

Describe what happened:

datadog-agent created by datadog-operator running on docker logs errors like this:

2020-04-06 12:05:53 UTC | CORE | ERROR | (pkg/collector/runner/runner.go:292 in work) | Error running check cri: temporary failure in criutil, will retry later: try delay not elapsed yet
2020-04-06 12:05:58 UTC | CORE | WARN | (pkg/collector/python/datadog_agent.go:118 in LogMessage) | kubelet:d884b5186b651429 | (kubelet.py:429) | GET on kubelet s `/stats/summary` failed: 403 Client Error: Forbidden for url: https://<REDACTED>:10250/stats/summary?verbose=True
2020-04-06 12:06:08 UTC | CORE | WARN | (pkg/collector/corechecks/checkbase.go:165 in Warnf) | Error initialising check: temporary failure in criutil, will retry later: try delay not elapsed yet

This looks like the same issue as criutil failure that was resolved by Do not enable the cri check when running on a docker setup.

The agent has the environment variable

      DD_CRI_SOCKET_PATH:                     /host/var/run/docker.sock

Describe what you expected:

That the agent would not try to use the docker socket as a CRI socket.

Steps to reproduce the issue:

Launch a GKE cluster with Ubuntu + Docker.
Deploy DataDog operator
Deploy an agent CR:

apiVersion: datadoghq.com/v1alpha1
kind: DatadogAgent
metadata:
  name: datadog-agent
  namespace: datadog
spec:
  credentials:
    apiKeyExistingSecret: dd-api-key
  agent:
    config:
      tolerations:
      - operator: Exists
    apm:
      enabled: true
    logs:
      enabled: true
    image:
      name: "datadog/agent:7.18.1"
    process:
      enabled: true
    systemProbe:
      enabled: true

Additional environment details (Operating System, Cloud provider, etc):

GKE. Ubuntu with Docker.

Feature request: Configuring the agent via helm-chart values

Describe what happened:

The current recommended way to install the operator is:

Deploy using a helm chart (helm chart values apply to the operator only)
kubectl to apply the config to the agent

Describe what you expected:

Being able to configure the agent by passing the agent config directly to the operator helm chart. This would save a step and having to save another configuration elsewhere.

(I'm fairly new to k8s / helm so I might be off base here, consider this newbie feedback)

Additional environment details (Operating System, Cloud provider, etc):

We use helm_release in terraform to manage our Helm charts.

Error validating data

Hello.

Omit some info from the issue template, because don't know what "Info page" is.

Describe what happened:
Got the error trying to deploy Datadog Agent with Operator:

running kubectl: exit status 1, stderr: error: error validating \"STDIN\": error validating data: ValidationError(DatadogAgent.spec.agent.process): unknown field \"processCollection\" in com.datadoghq.v1alpha1.DatadogAgent.spec.agent.process; if you choose to ignore these errors, turn validation off with --validate=false"

Describe what you expected:
Agent instance is created in desired configuration

Steps to reproduce the issue:

Deploy Datadog Operator using Helm Operator:

apiVersion: helm.fluxcd.io/v1
kind: HelmRelease
metadata:
  name: datadog-operator
  namespace: monitoring
spec:
  chart:
    repository: "https://helm.datadoghq.com"
    name: "datadog-operator"
    version: "0.4.0"
  values:
    # logLevel -- Set Datadog Operator log level (debug, info, error, panic, fatal)
    logLevel: "error"

    serviceAccount:
      # serviceAccount.name -- The name of the service account to use. If not set name is generated using the fullname template
      name: datadog-operator

    # installCRDs -- Set to true to deploy the Datadog's CRDs
    installCRDs: true

    datadog-crds:
      crds:
        # datadog-crds.crds.datadogAgents -- Set to true to deploy the DatadogAgents CRD
        datadogAgents: true
        # datadog-crds.crds.datadogMetrics -- Set to true to deploy the DatadogMetrics CRD
        datadogMetrics: true

Deploy Datadog agent resource:

apiVersion: datadoghq.com/v1alpha1
kind: DatadogAgent
metadata:
  name: datadog-agent
  namespace: monitoring
spec:
  # see:
  #   * https://github.com/DataDog/datadog-operator/blob/master/docs/secret_management.md
  credentials:
    apiKeyExistingSecret: datadog-api-key

  agent:
    image:
      # Define the image to use Use "datadog/agent:latest" for Datadog Agent 6 Use "datadog/dogstatsd:latest" for
      # Standalone Datadog Agent DogStatsD6 Use "datadog/cluster-agent:latest" for Datadog Cluster Agent
      name: "gcr.io/datadoghq/agent:latest"
    config:
      # Enable this to start event collection from the Kubernetes API
      # ref: https://docs.datadoghq.com/agent/kubernetes/event_collection/
      collectEvents: true

    # ref: https://github.com/DataDog/docker-dd-agent#tracing-from-the-host
    apm:
      # Enable this to enable APM and tracing, on port 8126
      enabled: false

    # ref.: https://docs.datadoghq.com/agent/basic_agent_usage/kubernetes/#log-collection-setup
    log:
      # Enables this to activate Datadog Agent log collection.
      enabled: true
      # Enable this to allow log collection for all containers.
      logsConfigContainerCollectAll: true

    # ref: https://docs.datadoghq.com/graphing/infrastructure/process/#kubernetes-daemonset
    process:
      # Enable this to activate the process Agent and live container monitoring.
      # Note: /etc/passwd is automatically mounted to allow username resolution.
      enabled: true
      # Enable this to activate the process collection in addition to container collection.
      processCollection: true

    rbac:
      # Used to configure RBAC resources creation
      create: true
      # Used to set up the service account name to use Ignored if the field Create is true
      serviceAccountName: datadog-agent

    # ref: https://docs.datadoghq.com/graphing/infrastructure/process/#kubernetes-daemonset
    systemProbe:
      # Enable this to activate live process monitoring. Note: /etc/passwd is automatically mounted to allow username
      # resolution.
      enabled: true
      # BPFDebugEnabled logging for kernel debug
      bpfDebugEnabled: true

  # The site of the Datadog intake to send Agent data to. Set to 'datadoghq.eu' to send data to the EU site.
  site: datadoghq.eu

Additional environment details (Operating System, Cloud provider, etc):

AWS EKS
EKS optimized AMI x64

The operator deletes an existing external metrics API service when `externalMetrics` is set to `false` or unset

Describe what happened:
The operator deletes an existing external metrics APIService when externalMetrics is set to false or unset as false is the default.
This is most likely because it calls the cleanupMetricsServerAPIService function here when external metrics it's not enabled but it does not check if the existing APIService was actually created by Datadog

Currently this blocks us from using the operator and it broke our HPA based on CloudWatch metrics.

Describe what you expected:
I would expect the operator to leave the external metrics APIService in place when they are not owned by the operator

Steps to reproduce the issue:

configure an external metric APIService like this

apiVersion: apiregistration.k8s.io/v1beta1
kind: APIService
metadata:
  name: v1beta1.external.metrics.k8s.io
spec:
  service:
    name: my-adapter
    namespace: my-namespace-not-used-by-datadog
  group: external.metrics.k8s.io
  version: v1beta1
  insecureSkipTLSVerify: true
  groupPriorityMinimum: 100
  versionPriority: 100

Then deploy the datadog operator with externalMetrics: false or unset
After a while it will remove the existing metrics APIService.

Additional environment details (Operating System, Cloud provider, etc):

Datadog operator 0.4.0 deployed into separate datadog namespace
Kubernetes 1.18
Cloud Provider: AWS

Error when trying to deploy the cluster agent with no app key

Version of the operator: 0.4.0.

Describe what happened:

I tried to deploy the node agent and 1 replica of the cluster agent, without metrics server, but got the following error:

Status:
  Conditions:
    Last Transition Time:  2021-01-22T11:54:34Z
    Last Update Time:      2021-01-22T11:59:36Z
    Message:               secrets "datadog-secret" already exists
    Status:                True
    Type:                  ReconcileError
    Last Transition Time:  2021-01-22T11:54:34Z
    Last Update Time:      2021-01-22T11:59:34Z
    Message:               Datadog metrics forwarding error
    Status:                False
    Type:                  ActiveDatadogMetrics
Events:                    <none>

And logs:

2021-01-22T11:54:34.704Z	INFO	DatadogMetricForwarders	New Datadog metrics forwarder registred	{"ID": "default/datadog"}
2021-01-22T11:54:34.710Z	INFO	DatadogMetricForwarders	Starting Datadog metrics forwarder	{"CustomResource.Namespace": "default", "CustomResource.Name": "datadog"}
2021-01-22T11:54:34.713Z	INFO	DatadogMetricForwarders	Getting Datadog credentials	{"CustomResource.Namespace": "default", "CustomResource.Name": "datadog"}
2021-01-22T11:54:34.713Z	INFO	DatadogMetricForwarders	Got Datadog Site	{"CustomResource.Namespace": "default", "CustomResource.Name": "datadog", "site": "https://api.datadoghq.com"}
2021-01-22T11:54:34.713Z	ERROR	DatadogMetricForwarders	cannot get Datadog credentials,  will retry later...	{"CustomResource.Namespace": "default", "CustomResource.Name": "datadog", "error": "Secret \"datadog\" not found"}
2021-01-22T11:54:34.714Z	INFO	DatadogAgent	Reconciling DatadogAgent	{"Request.Namespace": "default", "Request.Name": "datadog"}
2021-01-22T11:54:34.714Z	INFO	DatadogAgent	Adding Finalizer for the DatadogAgent	{"Request.Namespace": "default", "Request.Name": "datadog"}
2021-01-22T11:54:34.734Z	INFO	DatadogAgent	Reconciling DatadogAgent	{"Request.Namespace": "default", "Request.Name": "datadog"}
2021-01-22T11:54:34.734Z	INFO	DatadogAgent	Defaulting values	{"Request.Namespace": "default", "Request.Name": "datadog"}
2021-01-22T11:54:34.764Z	INFO	DatadogAgent	Reconciling DatadogAgent	{"Request.Namespace": "default", "Request.Name": "datadog"}
2021-01-22T11:54:34.804Z	ERROR	controller-runtime.controller	Reconciler error	{"controller": "datadogdeployment-controller", "request": "default/datadog", "error": "secrets \"datadog-secret\" already exists"}
2021-01-22T11:54:35.805Z	INFO	DatadogAgent	Reconciling DatadogAgent	{"Request.Namespace": "default", "Request.Name": "datadog"}
2021-01-22T11:54:35.830Z	ERROR	controller-runtime.controller	Reconciler error	{"controller": "datadogdeployment-controller", "request": "default/datadog", "error": "secrets \"datadog-secret\" already exists"}
2021-01-22T11:54:36.830Z	INFO	DatadogAgent	Reconciling DatadogAgent	{"Request.Namespace": "default", "Request.Name": "datadog"}
2021-01-22T11:54:36.854Z	ERROR	controller-runtime.controller	Reconciler error	{"controller": "datadogdeployment-controller", "request": "default/datadog", "error": "secrets \"datadog-secret\" already exists"}
2021-01-22T11:54:37.855Z	INFO	DatadogAgent	Reconciling DatadogAgent	{"Request.Namespace": "default", "Request.Name": "datadog"}
2021-01-22T11:54:37.883Z	ERROR	controller-runtime.controller	Reconciler error	{"controller": "datadogdeployment-controller", "request": "default/datadog", "error": "secrets \"datadog-secret\" already exists"}

It feels that two issues are happening: the auth token secret is not created automatically, and also tries to create the datadog-secret secret (that I had already created)

Describe what you expected:

The agents are created correctly.

Steps to reproduce the issue:

Use datadog operator 0.4.0 with the following DatadogAgent definition:

apiVersion: datadoghq.com/v1alpha1
kind: DatadogAgent
metadata:
  name: datadog
spec:
  credentials:
    apiSecret:
      secretName: datadog-secret
      keyName: api-key
  agent:
    env:
      - name: DD_KUBELET_TLS_VERIFY
        value: "false"
    image:
      name: "datadog/agent:latest"
    apm:
      enabled: true
    process:
      enabled: true
    log:
      enabled: true
    config:
      tolerations:
        - key: node-role.kubernetes.io/master
          effect: NoSchedule
  clusterAgent:
    image:
      name: "datadog/cluster-agent:latest"
    config:
      clusterChecksEnabled: true
    replicas: 1

Additional environment details (Operating System, Cloud provider, etc):

Automatic sidecar injection of datadog agent (for eg running on AWS Fargate)

When running containers on AWS Fargate, there is no possibility to run datadog-agent as a DaemonSet. The only possible way (AFAIK) is by adding a sidecar container to the pod.

It would be useful if datadog(-operator) would have a feature where it can (optionally) automatically inject datadog-agent containers into pods (similar to what eg Istio or AWS AppMesh do).

Is this feature on the roadmap somewhere?

Any eta on Event Monitor type?

Describe what happened:
When adding a monitor like the "error rate for latest deployment" the event type alert is not supported. I have tried with all types, query, metric, service and composite

Describe what you expected:
A monitor created similar to the "suggested" monitors for error rate on latest deployment

Steps to reproduce the issue:
add an alert with
events('tags:deployment_analysis,env:{{ $.Values.environment }},service:{{ $serviceName }}-{{ $value }} priority:all').rollup('count').last('15m') > 0
this is returned

error validating monitor: 400 Bad Request: {"errors": ["The value provided
      for parameter ''query'' is invalid"]}

Additional environment details (Operating System, Cloud provider, etc):
Operator v0.7.6

Add the secrets helper to datadog-operator

The datadog operator should have the default secrets manager helper baked into the docker image like the datadog agent. This would allow to easily use the same secret manager for operator, cluster-agent and agent in one shot.

Further, the operator should be easily configured to to support mounting a secret volume and configure the commands + args variables with the defaults. The user should only have to supply a secretName value and perhaps a flag to enable secrets to be read from file.

https://github.com/DataDog/datadog-agent/blob/master/Dockerfiles/agent/secrets-helper/readsecret.py

Custom Resources for DataDog resources

I apologize in advance if this isn't the right place for these sorts of questions of if I missed this on road map elsewhere.

Has there been any discussion around the DataDog operator handling custom resources for Monitors, Dashboards, etc? Idea being that I'd like to template and provision DataDog resources using the same kube manifests/charts that I deploy the rest of my infrastructure with. This is similar to how the prometheus operator supports PrometheusRule resources.

I previously hacked out this concept as part of playing with a python operator framework: https://github.com/mzizzi/dogkop

I'd imagined being able to create DataDog Monitor resources as follows:

apiVersion: datadog.mzizzi/v1
kind: Monitor
metadata:
  name: my-monitor
spec:
  type: metric alert
  query: avg(last_5m):sum:system.net.bytes_rcvd{host:host0} > 100
  name: Bytes received on host0
  message: We may need to add web hosts if this is consistently high.
  tags:
    - foo:bar
  options:
    notify_no_data: True,
    no_data_timeframe: 20

This could be extremely powerful when coupled with the DataDog agent's OpenMetrics/Prometheus checks.

Monitors modified out-of-band are not reconcilled

Describe what happened:

I've successfully created a Monitor with the following manifest:

apiVersion: datadoghq.com/v1alpha1
kind: DatadogMonitor
metadata:
  name: hello-world-web
  namespace: hello-world
spec:
  query: "avg(last_5m):avg:kubernetes_state.hpa.desired_replicas{hpa:hello-world-web,kube_namespace:hello-world} / avg:kubernetes_state.hpa.max_replicas{hpa:hello-world-web,kube_namespace:hello-world} >= 1"
  type: "query alert"
  name: "HPA TEST"
  message: "Number of replicas for HPA hello-world-web in namespace hello-world, has reached maximum threshold @target1"
  tags:
    - "service:hello-world"

I then used the DataDog console to edit that Monitor. I modified the message, changing @target1 to @target2

Describe what you expected:

I expected the controller to reconcile the monitor, changing @target2 back to @target1, this never happened even though it appears the resource is being continually synced successfully:

kubectl describe datadog-monitor hello-world

Name:         hello-world-web
Namespace:    hello-world
Labels:       <none>
Annotations:  <none>
API Version:  datadoghq.com/v1alpha1
Kind:         DatadogMonitor
Metadata:
  Creation Timestamp:  2021-12-15T12:53:41Z
  Finalizers:
    finalizer.monitor.datadoghq.com
  Generation:  3
  Managed Fields:
    API Version:  datadoghq.com/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:kubectl.kubernetes.io/last-applied-configuration:
      f:spec:
        .:
        f:message:
        f:name:
        f:query:
        f:type:
    Manager:      kubectl-client-side-apply
    Operation:    Update
    Time:         2021-12-15T12:53:41Z
    API Version:  datadoghq.com/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:finalizers:
          .:
          v:"finalizer.monitor.datadoghq.com":
      f:spec:
        f:options:
        f:tags:
      f:status:
        .:
        f:conditions:
          .:
          k:{"type":"Active"}:
            .:
            f:lastTransitionTime:
            f:lastUpdateTime:
            f:message:
            f:status:
            f:type:
          k:{"type":"Created"}:
            .:
            f:lastTransitionTime:
            f:lastUpdateTime:
            f:message:
            f:status:
            f:type:
        f:created:
        f:creator:
        f:currentHash:
        f:downtimeStatus:
        f:id:
        f:monitorState:
        f:monitorStateLastTransitionTime:
        f:monitorStateLastUpdateTime:
        f:primary:
        f:syncStatus:
    Manager:         manager
    Operation:       Update
    Time:            2021-12-15T12:53:41Z
  Resource Version:  91886916
  UID:               91c8a22f-818d-40ee-a014-dcf9f8c84a52
Spec:
  Message:  Number of replicas for HPA hello-world-web in namespace hello-world, has reached maximum threshold @slack-team-example
  Name:     [kubernetes] MARCTEST hello-world-web HPA reached max replicas
  Options:
  Query:  avg(last_5m):avg:kubernetes_state.hpa.desired_replicas{hpa:hello-world-web,kube_namespace:hello-world} / avg:kubernetes_state.hpa.max_replicas{hpa:hello-world-web,kube_namespace:hello-world} >= 1
  Tags:
    service:hello-world
    generated:kubernetes
  Type:  query alert
Status:
  Conditions:
    Last Transition Time:  2021-12-15T12:53:41Z
    Last Update Time:      2021-12-15T13:17:41Z
    Message:               DatadogMonitor ready
    Status:                True
    Type:                  Active
    Last Transition Time:  2021-12-15T12:53:41Z
    Last Update Time:      2021-12-15T12:53:41Z
    Message:               DatadogMonitor Created
    Status:                True
    Type:                  Created
  Created:                 2021-12-15T12:53:41Z
  Creator:                 redacted
  Current Hash:            a4ed04577b43c8b209ec6c3bb489b179
  Downtime Status:
  Id:                                  58175053
  Monitor State:                       OK
  Monitor State Last Transition Time:  2021-12-15T12:55:41Z
  Monitor State Last Update Time:      2021-12-15T13:17:41Z
  Primary:                             true
  Sync Status:                         OK
Events:
  Type    Reason                 Age   From            Message
  ----    ------                 ----  ----            -------
  Normal  Create DatadogMonitor  24m   DatadogMonitor  hello-world/hello-world-web

A brief look at the code suggests that a hash of the specs are being compared so I'm surprised the operator isn't picking this up?

I can't use the chart as a dependency

Output of the info page (if this is a bug)

I created a new chart by specifying datadog-operator as a dependency to describe kind: DatadogAgent in templates of my chart.

Describe what happened:

Unfortunately, when I try to run it I get

> helm upgrade datadog-operator ops/Helm/addons/datadog --install --wait --namespace=datadog-operator --create-namespace=true --set apiKey=$API_KEY

> Release "datadog-operator" does not exist. Installing it now.
> Error: unable to build kubernetes objects from release manifest: unable to recognize "": no matches for kind "DatadogAgent" in version "datadoghq.com/v1alpha1"

Describe what you expected:

successful installation

Steps to reproduce the issue:

my GitLab CI job

Additional environment details (Operating System, Cloud provider, etc):

The DCA token doesn't seem to be propagating to the process-agent

Describe what happened:

I am trying to set up the Orchestrator Explorer with operator 0.5.0-RC2, this is the DatadogAgent config that I am using:

apiVersion: datadoghq.com/v1alpha1
kind: DatadogAgent
metadata:
  name: datadog
spec:
  clusterName: katacoda
  credentials:
    apiSecret:
      secretName: datadog-secret
      keyName: api-key
    appSecret:
      secretName: datadog-secret
      keyName: app-key
  agent:
    env:
      - name: DD_KUBELET_TLS_VERIFY
        value: "false"
      - name: DD_CLUSTER_NAME
        value: "katacoda"
    image:
      name: "datadog/agent:latest"
    apm:
      enabled: true
    process:
      enabled: true
      processCollectionEnabled: true
    log:
      enabled: true
    config:
      logLevel: debug
      tolerations:
        - key: node-role.kubernetes.io/master
          effect: NoSchedule
  clusterAgent:
    config:
      logLevel: info
    image:
      name: "datadog/cluster-agent:latest"
  features:
    orchestratorExplorer:
      enabled: true

But the pod check fails, because it cannot connect to the cluster agent. This is what it shows in the process-agent logs:

2021-02-16 11:45:18 UTC | PROCESS | INFO | (pkg/forwarder/forwarder.go:379 in Start) | Forwarder started, sending to 1 endpoint(s) with 1 worker(s) each: "https://orchestrator.datadoghq.com" (1 api key(s))
2021-02-16 11:45:18 UTC | PROCESS | DEBUG | (pkg/util/kubernetes/clustername/clustername.go:145 in GetClusterID) | Cluster ID env variable DD_ORCHESTRATOR_CLUSTER_ID is missing, calling the Cluster Agent
2021-02-16 11:45:18 UTC | PROCESS | DEBUG | (pkg/util/clusteragent/clusteragent.go:175 in getClusterAgentEndpoint) | Identified service for the Datadog Cluster Agent: datadog-cluster-agent
2021-02-16 11:45:18 UTC | PROCESS | DEBUG | (pkg/api/security/security.go:196 in getClusterAgentAuthToken) | Empty cluster_agent.auth_token, loading from /etc/datadog-agent/cluster_agent.auth_token
2021-02-16 11:45:18 UTC | PROCESS | DEBUG | (pkg/util/clusteragent/clusteragent.go:92 in GetClusterAgentClient) | Cluster Agent init error: temporary failure in clusterAgentClient, will retry later: empty cluster_agent.auth_token and cannot find "/etc/datadog-agent/cluster_agent.auth_token": stat /etc/datadog-agent/cluster_agent.auth_token: ********
2021-02-16 11:45:18 UTC | PROCESS | ERROR | (collector.go:109 in runCheck) | Unable to run check 'pod': temporary failure in clusterAgentClient, will retry later: empty cluster_agent.auth_token and cannot find "/etc/datadog-agent/cluster_agent.auth_token": stat /etc/datadog-agent/cluster_agent.auth_token: ********

But my node agents seems to be connecting correctly to the DCA:

=====================
Datadog Cluster Agent
=====================
  - Datadog Cluster Agent endpoint detected: https://10.107.209.56:5005
  Successfully connected to the Datadog Cluster Agent.
  - Running: 1.10.0+commit.a285fcc

Describe what you expected:

The process-agent is able to connect to the DCA and the pod check runs successfully

Steps to reproduce the issue:

Deploy operator 0.5.0-RC2
Deploy the DatadogAgent described above

Configuration of Tag Extraction from Node Labels

Datadog supports the extraction of tags from various sources: pod labels, pod annotations, and node labels (see https://docs.datadoghq.com/agent/kubernetes/tag/?tab=containerizedagent).

Currently, it appears that the operator only supports the configuration of tag extraction via pod labels and pod annotations via agent.config.podLabelsAsTags and agent.config.podAnnotationsAsTags, respectively. Is there a reason why something like agent.config.nodeLabelsAsTags doesn't exist?

Consider listing operator in Artifact Hub

Hi! 👋🏻

Have you considered listing the datadog operator directly in Artifact Hub?

At the moment it is already listed there, because the Artifact Hub team has added the community-operators repository. However, listing it yourself directly has some benefits:

You add your repository once, and new versions (or even new operators) committed to your git repository will be indexed automatically and listed in Artifact Hub, with no extra PRs needed.
You can display the Verified Publisher label in your operators, increasing their visibility and potentially the users' trust in your content.
Increased visibility of your organization in urls and search results. Users will be able to see your organization's description, a link to the home page and search for other content published by you.
If something goes wrong indexing your repository, you will be notified and you can even inspect the logs to check what went wrong.

If you decide to go ahead, you just need to sign in and add your repository from the control panel. You can add it using a single user or create an organization for it, whatever suits your needs best.

You can find some notes about the expected repository url format and repository structure in the repositories guide. There is also available an example of an operator repository already listed in Artifact Hub in the documentation. Operators are expected to be packaged using the format defined in the Operator Framework documentation to facilitate the process.

Please let me know if you have any questions or if you encounter any issue during the process 🙂

CRI Check Fails on EKS with Bottlerocket

Output of the info page (if this is a bug)

    cri
    ---
      Instance ID: cri [ERROR]
      Configuration Source: file:/etc/datadog-agent/conf.d/cri.d/conf.yaml.default
      Total Runs: 134
      Metric Samples: Last Run: 0, Total: 0
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 62ms
      Last Execution Date : 2021-06-25 17:10:15 UTC (1624641015000)
      Last Successful Execution Date : Never
      Error: temporary failure in criutil, will retry later: try delay not elapsed yet
      No traceback
      Warning: Error initialising check: temporary failure in criutil, will retry later: try delay not elapsed yet

Describe what happened:

CRI checks are failing on EKS running Bottlerocket.

2021-06-25 17:13:16 UTC | CORE | ERROR | (pkg/collector/runner/runner.go:301 in work) | Error running check cri: temporary failure in criutil, will retry later: failed to dial: context deadline exceeded
2021-06-25 17:13:30 UTC | CORE | WARN | (pkg/collector/corechecks/checkbase.go:165 in Warnf) | Error initialising check: temporary failure in criutil, will retry later: try delay not elapsed yet
2021-06-25 17:13:30 UTC | CORE | ERROR | (pkg/collector/runner/runner.go:301 in work) | Error running check cri: temporary failure in criutil, will retry later: try delay not elapsed yet

Describe what you expected:

CRI checks should pass.

Steps to reproduce the issue:

Create an EKS cluster with Bottlerocket worker nodes.
Deploy Datadog operator
Deploy agents with the below config with the additional cirSocketPath as outlined in https://docs.datadoghq.com/agent/kubernetes/distributions/?tab=operator#EKS

apiVersion: datadoghq.com/v1alpha1
kind: DatadogAgent
metadata:
  name: datadog
  namespace: monitoring
spec:
  credentials:
    apiSecret:
      secretName: datadog-secret
      keyName: api-key
    appSecret:
      secretName: datadog-secret
      keyName: app-key
  agent:
    image:
      name: "gcr.io/datadoghq/agent:latest"
    apm:
      enabled: false
    process:
      enabled: true
      processCollectionEnabled: true
    log:
      enabled: true
      logsConfigContainerCollectAll: true
      containerCollectUsingFiles: true
    config:
      criSocket:
        criSocketPath: /run/dockershim.sock
      collectEvents: true
  clusterAgent:
    image:
      name: "gcr.io/datadoghq/cluster-agent:latest"

Additional environment details (Operating System, Cloud provider, etc):

EKS v1.20 on AWS with Bottlerocket AMI bottlerocket-aws-k8s-1.20-x86_64-v1.1.1-28c4e013.

Enable DD_APM_NON_LOCAL_TRAFFIC and hostPort by default if apm=true

If a user has a DatadogAgent configuration that says: apm:true is very confusing that the hostPort is not enable by default and that DD_APM_NON_LOCAL_TRAFFIC is not true by default.

This is the behaviour of the Helm chart, and removing it for the operator is very confusing.

datadoghq.eu supported?

Output of the info page (if this is a bug)
Not sure if this is a bug - I think this is more of a case where the proper way to specify something isn't clear (at least not clear to me). I think I need to specify the EU 'flavor' of Datadog to the Operator, and not sure if the needed config surface area already exists, or if it does exist, where it is.

Describe what happened:
As I understand it Datadog users can create their account in the US or the EU.
In my case I created my account on the EU side, i.e. I pull up the dashboard etc at https://app.datadoghq.eu/

Following the instructions here
https://docs.datadoghq.com/agent/kubernetes/?tab=operator
I generated an API key and App key, and installed the datadog-operator using Helm 3.

Describe what you expected:
I was expecting the usual data flow to work...i.e. cluster data/etc to flow into https://app.datadoghq.eu/, but when I check status of the deploy, the dd resource never reaches 'ACTIVE' status after 30 minutes. Ostensibly this is because the agents are pointing to datadoghq.com not .eu. Not 100% sure.

$ kubectl  -n datadog get dd
NAME                                     ACTIVE   AGENT   CLUSTER-AGENT   CLUSTER-CHECKS-RUNNER   AGE
datadog-agent-with-operator-1596710136                                                            30m

If I understand right this may be a related fix? #104
I was using this release
https://github.com/DataDog/datadog-operator/releases/tag/v0.2.1

Steps to reproduce the issue:
The above covers it, but in a sentence, try to use https://github.com/DataDog/datadog-operator/releases/tag/v0.2.1 on a .EU hosted Datadog account.

Additional environment details (Operating System, Cloud provider, etc):
Azure AKS, macOS 10.15.6, Helm 3, datadog-operator v0.2.1

Add support for log monitors

The monitor documentation states that currently "only metric alerts, query alerts, and service checks are supported." I'd love to see datadog-operator add support for managing log monitors as well.

For example, I'm trying to create a monitor that looks like this:

apiVersion: datadoghq.com/v1alpha1
kind: DatadogMonitor
metadata:
  finalizers:
  - finalizer.monitor.datadoghq.com
  name: ...
  namespace: datadog
spec:
  message: ...
  name: ...
  options: {}
  query: logs("...").index("main").rollup("count").by("service").last("15m") > 0
  tags: ...
  type: log alert

But currently the operator generates an error for this resource with the following message:

monitor type log alert not supported

Cluster Agent not honoring Site parameter when MetricsProvider is queried

Describe what happened:

I'm using Datadog Cluster Agent with metricsProviderEnabled: true and Site: datadoghq.eu

After configuring an HPA, I started seeing these errors:

2020-05-28 11:17:23 UTC | CLUSTER | ERROR | (pkg/util/kubernetes/autoscalers/datadogexternal.go:84 in queryDatadogExternal) | Error while executing metric query avg:eng_kyc_service.process.files.open{cluster_name:dev-eu-west-1-test,env:dev}.rollup(30): API error 403 Forbidden: {"errors": ["Invalid API key or Application key"]}
2020-05-28 11:17:23 UTC | CLUSTER | ERROR | (pkg/util/kubernetes/autoscalers/processor.go:80 in UpdateExternalMetrics) | Error getting metrics from Datadog: [Error while executing metric query avg:eng_kyc_service.process.files.open{cluster_name:dev-eu-west-1-test,env:dev}.rollup(30): API error 403 Forbidden: {"errors": ["Invalid API key or Application key"]}, strconv.Atoi: parsing "": invalid syntax]

I verified my credentials and they are valid, as long as you point to datadoghq.eu. Setting DATADOG_HOST=https://app.datadoghq.eu solved this.

Describe what you expected:

I would expect one of two things:

DATADOG_HOST is populated from DD_SITE and DD_DD_URL when creating the Cluster Agent deployment
Datadog Cluster Agent's client library leverages DD_SITE and DD_DD_URL to create the client.

Steps to reproduce the issue:

Create a Datadog Cluster agent deployment with datadoghq.eu credentials and `metricsProviderEnablk
Create an HPA using an external metric

Additional environment details (Operating System, Cloud provider, etc):

Panic occurs: invalid memory address or nil pointer dereference

Describe what happened:

I've been trying to deploy datadog agents through datadog-operator, but datadog-operator occurs a panic.

github.com/DataDog/datadog-operator/controllers/datadogagent.isComplianceEnabled(...)
	/workspace/controllers/datadogagent/utils.go:175
github.com/DataDog/datadog-operator/controllers/datadogagent.buildClusterAgentClusterRole(0xc00040f680, 0xc0005f60e0, 0x12, 0x0, 0x0, 0x44)
	/workspace/controllers/datadogagent/clusteragent.go:1298 +0x4b8
github.com/DataDog/datadog-operator/controllers/datadogagent.(*Reconciler).createClusterAgentClusterRole(0xc000451440, 0x1f35c20, 0xc0005c2b80, 0xc00040f680, 0xc0005f60e0, 0x12, 0x0, 0x0, 0xc0002d6000, 0x1ef2060, ...)
	/workspace/controllers/datadogagent/clusteragent.go:838 +0x74
github.com/DataDog/datadog-operator/controllers/datadogagent.(*Reconciler).manageClusterAgentRBACs(0xc000451440, 0x1f35c20, 0xc0005c2b80, 0xc00040f680, 0xc0007b6600, 0x0, 0x0, 0x0)
	/workspace/controllers/datadogagent/clusteragent.go:737 +0x48f
github.com/DataDog/datadog-operator/controllers/datadogagent.(*Reconciler).manageClusterAgentDependencies(0xc000451440, 0x1f35c20, 0xc0005c2b80, 0xc00040f680, 0xc0005ca870, 0x1bc6b20, 0x2, 0x2, 0xc0006f3970)
	/workspace/controllers/datadogagent/clusteragent.go:244 +0x510
github.com/DataDog/datadog-operator/controllers/datadogagent.(*Reconciler).reconcileClusterAgent(0xc000451440, 0x1f35c20, 0xc0005c2b80, 0xc00040f680, 0xc0005ca870, 0x30, 0x1b7fa00, 0xc00030f801, 0xc0005ca870)
	/workspace/controllers/datadogagent/clusteragent.go:36 +0x74
github.com/DataDog/datadog-operator/controllers/datadogagent.(*Reconciler).internalReconcile(0xc000451440, 0x1f2c3c0, 0xc0005ca750, 0xc0006270b0, 0x10, 0xc000627084, 0x4, 0x1baea00, 0x10, 0x1b10c40, ...)
	/workspace/controllers/datadogagent/controller.go:131 +0x4f3
github.com/DataDog/datadog-operator/controllers/datadogagent.(*Reconciler).Reconcile(0xc000451440, 0x1f2c3c0, 0xc0005ca750, 0xc0006270b0, 0x10, 0xc000627084, 0x4, 0x1baea00, 0xc0005ca750, 0x1a9bc60, ...)
	/workspace/controllers/datadogagent/controller.go:67 +0x8e
github.com/DataDog/datadog-operator/controllers.(*DatadogAgentReconciler).Reconcile(0xc00062e050, 0x1f2c3c0, 0xc0005ca750, 0xc0006270b0, 0x10, 0xc000627084, 0x4, 0xc0005ca750, 0x40a3df, 0xc000030000, ...)
	/workspace/controllers/datadogagent_controller.go:140 +0x7b

full panic log

{ "level": "ERROR", "logger": "klog", "msg": "Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 561 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic(0x1a9d760, 0x2bc9ff0)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:74 +0x95
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:48 +0x89
panic(0x1a9d760, 0x2bc9ff0)
	/usr/local/go/src/runtime/panic.go:969 +0x1b9
github.com/DataDog/datadog-operator/controllers/datadogagent.isComplianceEnabled(...)
	/workspace/controllers/datadogagent/utils.go:175
github.com/DataDog/datadog-operator/controllers/datadogagent.buildClusterAgentClusterRole(0xc00040f680, 0xc0005f60e0, 0x12, 0x0, 0x0, 0x44)
	/workspace/controllers/datadogagent/clusteragent.go:1298 +0x4b8
github.com/DataDog/datadog-operator/controllers/datadogagent.(*Reconciler).createClusterAgentClusterRole(0xc000451440, 0x1f35c20, 0xc0005c2b80, 0xc00040f680, 0xc0005f60e0, 0x12, 0x0, 0x0, 0xc0002d6000, 0x1ef2060, ...)
	/workspace/controllers/datadogagent/clusteragent.go:838 +0x74
github.com/DataDog/datadog-operator/controllers/datadogagent.(*Reconciler).manageClusterAgentRBACs(0xc000451440, 0x1f35c20, 0xc0005c2b80, 0xc00040f680, 0xc0007b6600, 0x0, 0x0, 0x0)
	/workspace/controllers/datadogagent/clusteragent.go:737 +0x48f
github.com/DataDog/datadog-operator/controllers/datadogagent.(*Reconciler).manageClusterAgentDependencies(0xc000451440, 0x1f35c20, 0xc0005c2b80, 0xc00040f680, 0xc0005ca870, 0x1bc6b20, 0x2, 0x2, 0xc0006f3970)
	/workspace/controllers/datadogagent/clusteragent.go:244 +0x510
github.com/DataDog/datadog-operator/controllers/datadogagent.(*Reconciler).reconcileClusterAgent(0xc000451440, 0x1f35c20, 0xc0005c2b80, 0xc00040f680, 0xc0005ca870, 0x30, 0x1b7fa00, 0xc00030f801, 0xc0005ca870)
	/workspace/controllers/datadogagent/clusteragent.go:36 +0x74
github.com/DataDog/datadog-operator/controllers/datadogagent.(*Reconciler).internalReconcile(0xc000451440, 0x1f2c3c0, 0xc0005ca750, 0xc0006270b0, 0x10, 0xc000627084, 0x4, 0x1baea00, 0x10, 0x1b10c40, ...)
	/workspace/controllers/datadogagent/controller.go:131 +0x4f3
github.com/DataDog/datadog-operator/controllers/datadogagent.(*Reconciler).Reconcile(0xc000451440, 0x1f2c3c0, 0xc0005ca750, 0xc0006270b0, 0x10, 0xc000627084, 0x4, 0x1baea00, 0xc0005ca750, 0x1a9bc60, ...)
	/workspace/controllers/datadogagent/controller.go:67 +0x8e
github.com/DataDog/datadog-operator/controllers.(*DatadogAgentReconciler).Reconcile(0xc00062e050, 0x1f2c3c0, 0xc0005ca750, 0xc0006270b0, 0x10, 0xc000627084, 0x4, 0xc0005ca750, 0x40a3df, 0xc000030000, ...)
	/workspace/controllers/datadogagent_controller.go:140 +0x7b
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0004ce000, 0x1f2c300, 0xc000396380, 0x1b0c920, 0xc00000c7e0)
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:263 +0x317
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0004ce000, 0x1f2c300, 0xc000396380, 0x0)
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:235 +0x205
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.1(0x1f2c300, 0xc000396380)
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:198 +0x4a
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1()
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:185 +0x37
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc000ba0f50)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:155 +0x5f
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc0006f3f50, 0x1ef2280, 0xc0005ca690, 0xc000396301, 0xc0001141e0)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:156 +0xad
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000ba0f50, 0x3b9aca00, 0x0, 0x1, 0xc0001141e0)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:133 +0x98
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext(0x1f2c300, 0xc000396380, 0xc0005c2af0, 0x3b9aca00, 0x0, 0x1)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:185 +0xa6
k8s.io/apimachinery/pkg/util/wait.UntilWithContext(0x1f2c300, 0xc000396380, 0xc0005c2af0, 0x3b9aca00)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:99 +0x57
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:195 +0x4e7
", "ts": "2021-05-17T02:02:07Z" }

Describe what you expected:

datadog-operator deploys datadog agent, clusteragent.

Steps to reproduce the issue:

helm install

helm repo add datadog https://helm.datadoghq.com
helm install my-datadog-operator datadog/datadog-operator

apply a following manifest.

agent.yaml

apiVersion: datadoghq.com/v1alpha1
kind: DatadogAgent
metadata:
  name: test
  namespace: datadog
spec:
  credentials:
    apiSecret:
      keyName: api-secret
      secretName: datadog
    appSecret:
      keyName: app-secret
      secretName: datadog
  agent:
    additionalAnnotations:
      sidecar.istio.io/inject: "false"
    additionalLabels:
      app: datadog-agent
    image:
      name: gcr.io/datadoghq/agent:7.27.0
    apm:
      enabled: true
    process:
      enabled: true
      processCollectionEnabled: true
    log:
      enabled: true
    systemProbe:
      bpfDebugEnabled: true
    security:
      compliance:
        enabled: true
      runtime:
        enabled: false
    config:
      tolerations:
      - operator: Exists
    env:
    - name: DD_API_KEY
      valueFrom:
        secretKeyRef:
          name: datadog
          key: api-key
  clusterAgent:
    additionalAnnotations:
      sidecar.istio.io/inject: "false"
    image:
      name: gcr.io/datadoghq/cluster-agent:1.12.0
    config:
      externalMetrics:
        enabled: true
    replicas: 1
    rbac:
      create: true

Additional environment details (Operating System, Cloud provider, etc):

gke version: v1.18.17-gke.700

Agent status not updated

Describe what happened:

I've deployed datadog-operator and created a new DatadogAgent declaring both a normal agent and a cluster agent with the configuration as attached.

Both deployments/ agents start up fine and operate as expected.

Then I run kubectl get dd datadog

$ kubectl get dd datadog                                                                                                                                                                           
NAME      ACTIVE   AGENT                 CLUSTER-AGENT     CLUSTER-CHECKS-RUNNER   AGE                                                                                                                                                    
datadog   True     Progressing (0/0/0)   Running (2/2/2)                           179m

The agent is stuck to "Progressing (0/0/0)"

Describe what you expected:

$ kubectl get dd datadog                                                                                                                                                                           
NAME      ACTIVE   AGENT                 CLUSTER-AGENT     CLUSTER-CHECKS-RUNNER   AGE                                                                                                                                                    
datadog   True     Running (1/1/1)   Running (2/2/2)                           179m

Steps to reproduce the issue:

After some extensive debugging, I've found that the Datadog Operator is able to fetch the status, but decides not to update it because it fails on this conditional: https://github.com/DataDog/datadog-operator/blob/master/pkg/controller/datadogagent/datadogagent_controller.go#L289
and doesn't update the status as needed.

~~Here's the output I get:~~

{"level":"info","ts":1590151823.1246557,"logger":"DatadogAgent","msg":"New status for Agent: Running (1/1/1)","Request.Namespace":"datadog","Request.Name":"datadog"}        # Comes from agent.go:238 / updateDaemonSet                                                              
{"level":"info","ts":1590151823.1248279,"logger":"DatadogAgent","msg":"New agent status: Progressing (0/0/0)","Request.Namespace":"datadog","Request.Name":"datadog"}                                                                     
{"level":"info","ts":1590151823.12484,"logger":"DatadogAgent","msg":"New cluster agent status: Running (2/2/2)","Request.Namespace":"datadog","Request.Name":"datadog"}                                                                   
{"level":"info","ts":1590151823.1248496,"logger":"DatadogAgent","msg":"shouldReturn(result={false, 0s}, %!s(<nil>))","Request.Namespace":"datadog","Request.Name":"datadog"}                                                              
{"level":"info","ts":1590151823.1248574,"logger":"DatadogAgent","msg":"No need to update status","Request.Namespace":"datadog","Request.Name":"datadog"}

~~And here's my modified code segment where I've added those loggers~~

reqLogger.Info("Running reconcile functions")
	for _, reconcileFunc := range reconcileFuncs {
		reqLogger.Info("Running reconcile function")
		reqLogger.Info("%v", reconcileFunc)
		result, err = reconcileFunc(reqLogger, instance, newStatus)
		if newStatus.Agent != nil {
			reqLogger.Info(fmt.Sprintf("New agent status: %s", newStatus.Agent.Status))
		}
		if newStatus.ClusterAgent != nil {
			reqLogger.Info(fmt.Sprintf("New cluster agent status: %s", newStatus.ClusterAgent.Status))
		}
		if newStatus.ClusterChecksRunner != nil {
			reqLogger.Info(fmt.Sprintf("New cluster checks runner status: %s", newStatus.ClusterChecksRunner.Status))
		}
		reqLogger.Info(fmt.Sprintf("shouldReturn(result={%v, %s}, %s)", result.Requeue, result.RequeueAfter, err))
		if shouldReturn(result, err) {
			reqLogger.Info("Need to update status")
			return r.updateStatusIfNeeded(reqLogger, instance, newStatus, result, err)
		}
		reqLogger.Info("No need to update status")
	}
	reqLogger.Info("Finished running reconcile functions")

~~I'm currently investigating this issue, but I could do with some context / guidance, since I'm not quite sure exactly why result.Requeue is false, and if that is expected.~~

Additional environment details (Operating System, Cloud provider, etc):

EKS 1.16, running on the default EKS AMI.

Add ability to select the secret key for apiKeyExistingSecret and appKeyExistingSecret

Currently, to pass the API key or the APP key as secret, the only way to do it is to pass the name of the secret and it expects the api key to be in the api_key key of the secret and the app key to be in the app_key key of the secret.

I think we shouldn't assume the key name.

My suggestion would be to have something like:

apiKeyExistingSecret:
  secretName: string
  keyName: string

cluster-checks-runner gets deployed without asking for it

Datadog operator version 0.4.0

Describe what happened:

I got 2 replicas of the checkrunner, but I didn't ask for it in the definition of my DatadogAgent object

Describe what you expected:

Only the node agent and the cluster agent gets deployed

Steps to reproduce the issue:

Create a token secret as: kubectl create secret generic datadog --from-literal=token='<ThirtyX2XcharactersXlongXtoken>' (because it doesn't get automatically created due to issue #207 )
Apply the following DatadogAgent definition:

apiVersion: datadoghq.com/v1alpha1
kind: DatadogAgent
metadata:
  name: datadog
spec:
  credentials:
    apiSecret:
      secretName: datadog-secret
      keyName: api-key
    appSecret:
      secretName: datadog-app-key
      keyName: app-key
  agent:
    env:
      - name: DD_KUBELET_TLS_VERIFY
        value: "false"
    image:
      name: "datadog/agent:latest"
    apm:
      enabled: true
    process:
      enabled: true
    log:
      enabled: true
    config:
      tolerations:
        - key: node-role.kubernetes.io/master
          effect: NoSchedule
  clusterAgent:
    image:
      name: "datadog/cluster-agent:latest"
    config:
      clusterChecksEnabled: true
    replicas: 1

At the end I got the following deployments:

NAME                            READY   UP-TO-DATE   AVAILABLE   AGE
datadog-cluster-agent           1/1     1            1           73m
datadog-cluster-checks-runner   1/2     2            1           73m

I wouldn't expect the datadog-cluster-checks-runner deployment to be created

Additional environment details (Operating System, Cloud provider, etc):

poddisruptionbudget agent causes node to be unable to drain

Output of the info page (if this is a bug)

evicting pod datadog/datadog-cluster-agent-689f579b4f-dmrcn
error when evicting pods/"datadog-cluster-agent-689f579b4f-dmrcn" -n "datadog" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.

Describe what happened:
Trying to automatically drain my nodes and patch them. During the cluster upgrade I can see that the nodes just hang in Ready,SchedulingDisabled state.

I manually run:

kubectl drain aks-standard-12457058-vmss00000d --delete-emptydir-data --ignore-daemonsets

and it gives me an error saying that the pdb(poddisruptionbudget) is creating the issue.
In the end I had to add --disable-eviction=true to my drain command to work around it.

Describe what you expected:
I don't see any need to have a pdb when running the agent as a daemonset. The daemonset controller will handle it this logic.
If running in daemon mode I would´t create a pdb.

If you don't think it should be disabled by default at least provide a variable in CRD so I can disable it.

Steps to reproduce the issue:
Drain your nodes.

Additional environment details (Operating System, Cloud provider, etc):
N/A

Operator projects using the removed APIs in k8s 1.22 requires changes.

Problem Description

Kubernetes has been deprecating API(s), which will be removed and are no longer available in 1.22. Operators projects using these APIs versions will not work on Kubernetes 1.22 or any cluster vendor using this Kubernetes version(1.22), such as OpenShift 4.9+. Following the APIs that are most likely your projects to be affected by:

apiextensions.k8s.io/v1beta1: (Used for CRDs and available since v1.16)
rbac.authorization.k8s.io/v1beta1: (Used for RBAC/rules and available since v1.8)
admissionregistration.k8s.io/v1beta1 (Used for Webhooks and available since v1.16)

Therefore, looks like this project distributes solutions in the repository and does not contain any version compatible with k8s 1.22/OCP 4.9. (More info). Following some findings by checking the distributions published:

datadog-operator.v0.2.0: this distribution is using APIs which were deprecated and removed in v1.22. More info: https://kubernetes.io/docs/reference/using-api/deprecation-guide/#v1-22. Migrate the API(s) for CRD: (["datadogagents.datadoghq.com"])
datadog-operator.v0.3.0: this distribution is using APIs which were deprecated and removed in v1.22. More info: https://kubernetes.io/docs/reference/using-api/deprecation-guide/#v1-22. Migrate the API(s) for CRD: (["datadogagents.datadoghq.com" "datadogmetrics.datadoghq.com"])
datadog-operator.v0.5.0: this distribution is using APIs which were deprecated and removed in v1.22. More info: https://kubernetes.io/docs/reference/using-api/deprecation-guide/#v1-22. Migrate the API(s) for CRD: (["datadogagents.datadoghq.com" "datadogmetrics.datadoghq.com"])
datadog-operator.v0.6.0: this distribution is using APIs which were deprecated and removed in v1.22. More info: https://kubernetes.io/docs/reference/using-api/deprecation-guide/#v1-22. Migrate the API(s) for CRD: (["datadogagents.datadoghq.com" "datadogmetrics.datadoghq.com" "datadogmonitors.datadoghq.com"])
datadog-operator.v0.1.3: this distribution is using APIs which were deprecated and removed in v1.22. More info: https://kubernetes.io/docs/reference/using-api/deprecation-guide/#v1-22. Migrate the API(s) for CRD: (["datadogagents.datadoghq.com"])
datadog-operator.v0.2.3: this distribution is using APIs which were deprecated and removed in v1.22. More info: https://kubernetes.io/docs/reference/using-api/deprecation-guide/#v1-22. Migrate the API(s) for CRD: (["datadogagents.datadoghq.com"])
datadog-operator.v0.3.1: this distribution is using APIs which were deprecated and removed in v1.22. More info: https://kubernetes.io/docs/reference/using-api/deprecation-guide/#v1-22. Migrate the API(s) for CRD: (["datadogagents.datadoghq.com" "datadogmetrics.datadoghq.com"])
datadog-operator.v0.3.2: this distribution is using APIs which were deprecated and removed in v1.22. More info: https://kubernetes.io/docs/reference/using-api/deprecation-guide/#v1-22. Migrate the API(s) for CRD: (["datadogagents.datadoghq.com" "datadogmetrics.datadoghq.com"])
datadog-operator.v0.4.0: this distribution is using APIs which were deprecated and removed in v1.22. More info: https://kubernetes.io/docs/reference/using-api/deprecation-guide/#v1-22. Migrate the API(s) for CRD: (["datadogagents.datadoghq.com" "datadogmetrics.datadoghq.com"])
datadog-operator.v0.2.2: this distribution is using APIs which were deprecated and removed in v1.22. More info: https://kubernetes.io/docs/reference/using-api/deprecation-guide/#v1-22. Migrate the API(s) for CRD: (["datadogagents.datadoghq.com"])

NOTE: The above findings are only about the manifests shipped inside of the distribution. It is not checking the codebase.

How to solve

It would be very nice to see new distributions of this project that are no longer using these APIs and so they can work on Kubernetes 1.22 and newer and published in the community-operators collection. OpenShift 4.9, for example, will not ship operators anymore that do still use v1beta1 extension APIs.

Due to the number of options available to build Operators, it is hard to provide direct guidance on updating your operator to support Kubernetes 1.22. Recent versions of the OperatorSDK greater than 1.0.0 and Kubebuilder greater than 3.0.0 scaffold your project with the latest versions of these APIs (all that is generated by tools only). See the guides to upgrade your projects with OperatorSDK Golang, Ansible, Helm or the Kubebuilder one. For APIs other than the ones mentioned above, you will have to check your code for usage of removed API versions and upgrade to newer APIs. The details of this depend on your codebase.

If this projects only need to migrate the API for CRDs and it was built with OperatorSDK versions lower than 1.0.0 then, you maybe able to solve it with an OperatorSDK version >= v0.18.x < 1.0.0:

$ operator-sdk generate crds --crd-version=v1
INFO[0000] Running CRD generator.
INFO[0000] CRD generation complete.

Alternatively, you can try to upgrade your manifests with controller-gen (version >= v0.4.1) :

If this project does not use Webhooks:

$ controller-gen crd:trivialVersions=true,preserveUnknownFields=false rbac:roleName=manager-role paths="./..."

If this project is using Webhooks:

Add the markers sideEffects and admissionReviewVersions to your webhook (Example with sideEffects=None and admissionReviewVersions={v1,v1beta1}: memcached-operator/api/v1alpha1/memcached_webhook.go):
Run the command:

$ controller-gen crd:trivialVersions=true,preserveUnknownFields=false rbac:roleName=manager-role webhook paths="./..."

For further information and tips see the comment.

Add setting spec.agent.log.useHttp to forward logs over HTTPS

Describe what happened:

There seems to be now way to configure datadog-agent to export logs over http. This makes it difficult to export logs when port 10516 is blocked.

Describe what you expected:

A useHttp field in the datadogagents.datadoghq.com spec.agent.log section:

...
            agent:
                log:
                  description: Log Agent configuration
                  properties:
                    containerLogsPath:
                      description: 'This to allow log collection from container log
                        path. Set to a different path if not using docker runtime.
                        ref: https://docs.datadoghq.com/agent/kubernetes/daemonset_setup/?tab=k8sfile#create-manifest
                        Default to `/var/lib/docker/containers`'
                      type: string
                    enabled:
                      description: 'Enables this to activate Datadog Agent log collection.
                        ref: https://docs.datadoghq.com/agent/basic_agent_usage/kubernetes/#log-collection-setup'
                      type: boolean
                    logsConfigContainerCollectAll:
                      description: 'Enable this to allow log collection for all containers.
                        ref: https://docs.datadoghq.com/agent/basic_agent_usage/kubernetes/#log-collection-setup'
                      type: boolean
                    podLogsPath:
                      description: This to allow log collection from pod log path.
                        Default to `/var/log/pods`
                      type: string
                    tempStoragePath:
                      description: This path (always mounted from the host) is used
                        by Datadog Agent to store information about processed log
                        files. If the Datadog Agent is restarted, it allows to start
                        tailing the log files from the right offset Default to `/var/lib/datadog-agent/logs`
                      type: string
                    useHttp:
                      description: Configure the Datadog Agent to send logs through 
                        HTTPS.
                      type: boolean
                  type: object

Other configuration settings for HTTPS log forwarding can be configured as environment variables. Adding fields for them would be convenient, but not necessary to solve this use case.

New feature: allow configuring security context on cluster agent

It's currently possible to customize the security context for agent pods through the agent.config.securityContext.* setting, but this option is not available for the cluster agent pod. This is a feature request to allow configuring the security context on cluster agents too.

For Kubernetes deployments with strict pod security policies, customizing the security context is required due to the image not specifying that it will run as user 0, causing pod creation to fail with Error: container has runAsNonRoot and image will run as root.

`clusterName` option doesn't get forwarded to the `process-agent`

Describe what happened:

Using 0.5.0-rc2 I wanted to enable the OrchestratorExplorer, I used this DatadogAgent config:

apiVersion: datadoghq.com/v1alpha1
kind: DatadogAgent
metadata:
  name: datadog
spec:
  clusterName: katacoda
  credentials:
    apiSecret:
      secretName: datadog-secret
      keyName: api-key
    appSecret:
      secretName: datadog-secret
      keyName: app-key
  agent:
    env:
      - name: DD_KUBELET_TLS_VERIFY
        value: "false"
    image:
      name: "datadog/agent:latest"
    apm:
      enabled: true
    process:
      enabled: true
      processCollectionEnabled: true
    log:
      enabled: true
    config:
      logLevel: info
      tolerations:
        - key: node-role.kubernetes.io/master
          effect: NoSchedule
  clusterAgent:
    config:
      logLevel: debug
    image:
      name: "datadog/cluster-agent:latest"
  features:
    orchestratorExplorer:
      enabled: true

But, when the process-agent starts, it logs the following:

2021-02-16 11:20:37 UTC | PROCESS | WARN | (pkg/process/config/config.go:382 in NewAgentConfig) | Failed to auto-detect a Kubernetes cluster name. Pod collection will not start. To fix this, set it manually via the cluster_name config option

It looks like the DCA is picking the cluster name correctly, as I get the right objects in the Deployment, Nodes, etc. views, but nothing in the Pods view.

Describe what you expected:

The process-agent uses the value for clusterName as the cluster name and Pod collection starts correctly

Steps to reproduce the issue:

Deploy operator 0.5.0-rc2
Deploy the DatadogAgent config above

Missing some default pod tags

Describe what happened:

Some default or "out of the box" tags are missing from pods (and maybe other places).

Describe what you expected:

Compared to manual agent deployment, certain tags are missing that are used to populate default facets for log searches. Missing tags that I have noticed in particular include kubernetescluster (used to populate the Kubernetes > Cluster facet) and container_name (populates log header "blocks").

Steps to reproduce the issue:

Agents deployed with the following configuration:

apiVersion: datadoghq.com/v1alpha1
kind: DatadogAgent
metadata:
  name: datadog
  namespace: monitoring
spec:
  clusterName: staging-eks01
  credentials:
    apiSecret:
      secretName: datadog-secret
      keyName: api-key
    appSecret:
      secretName: datadog-secret
      keyName: app-key
  agent:
    image:
      name: "gcr.io/datadoghq/agent:latest"
    apm:
      enabled: false
    process:
      enabled: true
      processCollectionEnabled: true
    log:
      enabled: true
      logsConfigContainerCollectAll: true
      containerCollectUsingFiles: true
    config:
      criSocket:
        criSocketPath: /run/dockershim.sock
        dockerSocketPath: null
      collectEvents: true
      leaderElection: true
  clusterAgent:
    image:
      name: "gcr.io/datadoghq/cluster-agent:latest"
    config:
      clusterChecksEnabled: true
      admissionController:
        enabled: true
        mutateUnlabelled: false
  features:
    kubeStateMetricsCore:
      enabled: true

Additional environment details (Operating System, Cloud provider, etc):

Bottlerocket OS 1.1.2, EKS 1.20, Agent (v7.29.0)

Enabling kubeStateMetricsCore disables other clusterAgent checks

Describe what happened:
I am trying to set the clusterAgent up to collect both kubernetes events and kubernetes state metrics. I enabled the kubeStateMetricsCore and collectEvents configs in the DatadogAgent resource (see config below). However, with this config, events were not being collected. On the cluster agent that is elected leader, I ran agent status and saw that there was no kubernetes_apiserver check:

=========
Collector
=========

  Running Checks
  ==============
    No checks have run yet

Looking into the /etc/datadog-agent/conf.d/ folder in the clusterAgent pods, I see that the only file is kubernetes_state_core.yaml. It seems like the ksm-core-config volume is hiding the files that are normally in that directory in the clusterAgent image:

        volumeMounts:
        - mountPath: /etc/datadog-agent/install_info
          name: installinfo
          readOnly: true
          subPath: install_info
        - mountPath: /conf.d
          name: confd
          readOnly: true
        - mountPath: /etc/datadog-agent/conf.d
          name: ksm-core-config
          readOnly: true

As a workaround, I was able to trigger this branch of the code by specifying an explicit config under kubeStateMetricsCore, which makes the volume override only the kubernetes_state_core.yaml file, not the whole conf.d directory.

Describe what you expected:
To be able to configure kubeStateMetricsCore and collectEvents simultaneously without specifying a kubeStateMetricsCore config, and have the agent collect both types of data.

Steps to reproduce the issue:
My DatadogAgent configuration looks like this:

apiVersion: datadoghq.com/v1alpha1
kind: DatadogAgent
metadata:
  name: datadog
spec:
  credentials:
    apiSecret:
      keyName: apiKey
      secretName: datadog-agent-secrets
  agent:
<...snip irrelevant config...>
  clusterAgent:
    image:
      name: "gcr.io/datadoghq/cluster-agent:latest"
    replicas: 3
    config:
      clusterChecksEnabled: true
      collectEvents: true
  features:
    kubeStateMetricsCore:
      enabled: true

When I add the following under the kubeStateMetricsCore object, then I get both events and state metrics:

      conf:
        configData: |
          cluster_check: true
          init_config:
          instances:
            - collectors:
                - pods
                - replicationcontrollers
                - statefulsets
                - nodes
                - cronjobs
                - jobs
                - replicasets
                - deployments
                - configmaps
                - services
                - endpoints
                - daemonsets
                - horizontalpodautoscalers
                - limitranges
                - resourcequotas
                - secrets
                - namespaces
                - persistentvolumeclaims
                - persistentvolumes
              telemetry: true

Additional environment details (Operating System, Cloud provider, etc):
Running on GKE, kubernetes version 1.17.17

Allow configuration of network host monitoring via CRD

Describe what happened:
The operator deployed the datadog-agent to my kubernetes nodes with the network monitoring flag enabled.

Describe what you expected:
A option to be present to disable this feature like with logs/apm/etc.

Steps to reproduce the issue:

Install the operator like normal
Deploy the agent with any of the example configs
See network collection enabled

Additional environment details (Operating System, Cloud provider, etc):
Kubernetes 1.17.9 and 1.16.9
Datadog Agent: v7.23.1
Cluster Agent: 1.9.1+commit.2270e4d
Operator 0.3.1

Error running the pod check for the Orchestration Explorer

This is using the operator 0.5.0:

I tried enabled the Orchestration Explorer, and all the collection works, except for the pod check in the process-agent where I get the following error:

2021-02-25 09:35:34 UTC | PROCESS | DEBUG | (pkg/util/kubernetes/clustername/clustername.go:145 in GetClusterID) | Cluster ID env variable DD_ORCHESTRATOR_CLUSTER_ID is missing, calling the Cluster Agent
2021-02-25 09:35:34 UTC | PROCESS | ERROR | (collector.go:109 in runCheck) | Unable to run check 'pod': unexpected status code from cluster agent: 404

This is the DatadogAgent object that I am using:

apiVersion: datadoghq.com/v1alpha1
kind: DatadogAgent
metadata:
  name: datadog
spec:
  clusterName: katacoda
  credentials:
    apiSecret:
      secretName: datadog-secret
      keyName: api-key
    appSecret:
      secretName: datadog-secret
      keyName: app-key
  agent:
    env:
      - name: DD_KUBELET_TLS_VERIFY
        value: "false"
    image:
      name: "datadog/agent:latest"
    apm:
      enabled: true
    process:
      enabled: true
    log:
      enabled: true
    config:
      tolerations:
        - key: node-role.kubernetes.io/master
          effect: NoSchedule
      logLevel: debug
  clusterAgent:
    image:
      name: "datadog/cluster-agent:latest"
  features:
    orchestratorExplorer:
      enabled: true

Different behaviour for creation of objects and update of objects for cluster checks runners deployment

OK, that was kind of a cryptic title for a bug report, but I couldn't find a better one, so let me explain with examples what happened:

This is using operator 0.5.0-RC2.

If I deploy this configuration for the first time in a 1 node cluster:

apiVersion: datadoghq.com/v1alpha1
kind: DatadogAgent
metadata:
  name: datadog
spec:
  credentials:
    apiSecret:
      secretName: datadog-secret
      keyName: api-key
    appSecret:
      secretName: datadog-secret
      keyName: app-key
  agent:
    image:
      name: "datadog/agent:latest"
  clusterAgent:
    image:
      name: "datadog/cluster-agent:latest"
    config:
      clusterChecksEnabled: true

I correctly get 1 pod for the node agent, one pod for the cluster agent and 1 pod for the cluster checks runner:

NAME      ACTIVE   AGENT             CLUSTER-AGENT     CLUSTER-CHECKS-RUNNER   AGE
datadog   True     Running (1/1/1)   Running (1/1/1)           Running (1/1/1)         112s

But, if I first apply the same configuration, but without enabling cluster checks, like this one:

apiVersion: datadoghq.com/v1alpha1
kind: DatadogAgent
metadata:
  name: datadog
spec:
  credentials:
    apiSecret:
      secretName: datadog-secret
      keyName: api-key
    appSecret:
      secretName: datadog-secret
      keyName: app-key
  agent:
    image:
      name: "datadog/agent:latest"
  clusterAgent:
    image:
      name: "datadog/cluster-agent:latest"

so, as expected, only get 1 node agent and 1 cluster agent:

NAME      ACTIVE   AGENT             CLUSTER-AGENT     CLUSTER-CHECKS-RUNNER   AGE
datadog   True     Running (1/1/1)   Running (1/1/1)                           2m57s

But, if I now edit the object and add the cluster checks option back, so it looks like the original object:

apiVersion: datadoghq.com/v1alpha1
kind: DatadogAgent
metadata:
  name: datadog
spec:
  credentials:
    apiSecret:
      secretName: datadog-secret
      keyName: api-key
    appSecret:
      secretName: datadog-secret
      keyName: app-key
  agent:
    image:
      name: "datadog/agent:latest"
  clusterAgent:
    image:
      name: "datadog/cluster-agent:latest"
    config:
      clusterChecksEnabled: true

The operator knows that there has been an update and start updating the existing objects, but it won't create a cluster checks runner replica:

NAME      ACTIVE   AGENT              CLUSTER-AGENT      CLUSTER-CHECKS-RUNNER   AGE
datadog   True     Updating (1/1/0)   Updating (2/1/1)                                                          9m30s

NAME                                      READY   STATUS    RESTARTS   AGE
datadog-agent-pzhh2                       1/1     Running   0          29s
datadog-cluster-agent-59958b4dff-whwqr    1/1     Running   0          34s

Describe what you expected:

I would expect the reconciliation loop to work in a way that sees that that configuration requires 1 cluster check runner and will create it when updating the object

Missing download link on "Getting Started" page

Describe what happened:
Although I've checked Getting Started page, the chart download link has been disabled.
The link( https://github.com/DataDog/datadog-operator/releases/latest/download/datadog-agent-with-operator.tar.gz ) returns "Not Found". I'd sought Releases page but I couldn't find which file is proper one...

Describe what you expected:
Fix the download link on the document if you can.
If this issue is caused by my misunderstanding, please let me know correct way.

Containers in a kind cluster are not being reported in Datadog

Describe what happened:

I have created a 3-node kind cluster running 1.18 and I am running datadog-operator v0.2.1 on it.

I deploy the node agent using the following configuration:

apiVersion: datadoghq.com/v1alpha1
kind: DatadogAgent
metadata:
  name: datadog-agent
  namespace: datadog
spec:
  credentials:
    apiKeyExistingSecret: secretapi
    appKeyExistingSecret: secretapp
  agent:
    image:
      name: "datadog/agent:latest"
    config:
      logLevel: "DEBUG"
      leaderElection: true
      tolerations:
      - operator: Exists
      criSocket:
        criSocketPath: /var/run/containerd/containerd.sock
        useCriSocketVolume: true
      env:
      - name: DD_KUBELET_TLS_VERIFY
        value: "false"
      - name: DD_KUBERNETES_KUBELET_HOST
        valueFrom:
          fieldRef:
            fieldPath: status.hostIP
    process:
      enabled: true

Describe what you expected:

I expect to get container information in Datadog. I get this instead:

I get process information correctly:

Steps to reproduce the issue:

Create a 3 node kind cluster:

kind create cluster --name datadog-operator  --config cluster-config.yaml

cluster-config.yaml:

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
  image: kindest/node:v1.18.0
- role: worker
  image: kindest/node:v1.18.0
- role: worker
  image: kindest/node:v1.18.0

Deploy datadog operator v.0.2.1 using Helm 3
Apply the agent configuration described in the issue

Operator fails to reconcile when enabling external metrics provider

Describe what happened:
Operator failed to reconcile changes to DatadogAgent config to enable external metrics provider.

{"level":"INFO","ts":"2021-09-23T22:21:25Z","logger":"controllers.DatadogAgent","msg":"Reconciling DatadogAgent","datadogagent":"datadog/datadog"}
{"level":"ERROR","ts":"2021-09-23T22:21:25Z","logger":"controller-runtime.manager.controller.datadogagent","msg":"Reconciler error","reconciler group":"datadoghq.com","reconciler kind":"DatadogAgent","name":"datadog","namespace":"datadog","error":"clusterroles.rbac.authorization.k8s.io \"datadog-cluster-agent\" is forbidden: user \"system:serviceaccount:datadog:operator-datadog-operator\" (groups=[\"system:serviceaccounts\" \"system:serviceaccounts:datadog\" \"system:authenticated\"]) is attempting to grant RBAC permissions not currently held:\n{APIGroups:[\"datadoghq.com\"], Resources:[\"extendeddaemonsetreplicasets\"], Verbs:[\"get\"]}"}

DatadogAgent config:

apiVersion: datadoghq.com/v1alpha1
kind: DatadogAgent
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"datadoghq.com/v1alpha1","kind":"DatadogAgent","metadata":{"annotations":{},"name":"datadog","namespace":"datadog"},"spec":{"agent":{"image":{"name":"gcr.io/datadoghq/agent:latest"}},"clusterAgent":{"image":{"name":"gcr.io/datadoghq/cluster-agent:latest"}},"credentials":{"apiSecret":{"keyName":"api-key","secretName":"datadog-secret"},"appSecret":{"keyName":"app-key","secretName":"datadog-secret"}}}}
  creationTimestamp: "2021-09-23T22:13:12Z"
  finalizers:
  - finalizer.agent.datadoghq.com
  generation: 6
  name: datadog
  namespace: datadog
  resourceVersion: "342519002"
  uid: 81388ecf-6f68-4762-a8d4-68b199afb398
spec:
  agent:
    image:
      name: gcr.io/datadoghq/agent:latest
  clusterAgent:
    config:
      admissionController:
        enabled: true
      externalMetrics:
        enabled: true
    image:
      name: gcr.io/datadoghq/cluster-agent:latest
  clusterChecksRunner: {}
  credentials:
    apiSecret:
      keyName: api-key
      secretName: datadog-secret
    appSecret:
      keyName: app-key
      secretName: datadog-secret
  features: {}
status:
  agent:
    available: 9
    current: 9
    currentHash: 42ce84f28b6184f205126ff7db90f26f
    daemonsetName: datadog-agent
    desired: 9
    lastUpdate: "2021-09-23T22:25:10Z"
    ready: 9
    state: Running
    status: Running (9/9/9)
    upToDate: 9
  clusterAgent:
    availableReplicas: 1
    currentHash: 8ddc4a9732973e98c9158a26d9031924
    deploymentName: datadog-cluster-agent
    generatedToken: QKUWgsARtiJnvteDfXCfnmeiZaKkaQng
    lastUpdate: "2021-09-23T22:13:13Z"
    readyReplicas: 1
    replicas: 1
    state: Running
    status: Running (1/1/1)
    updatedReplicas: 1
  conditions:
  - lastTransitionTime: "2021-09-23T22:25:21Z"
    lastUpdateTime: "2021-09-23T22:25:29Z"
    message: DatadogAgent reconcile error
    status: "False"
    type: Active
  - lastTransitionTime: "2021-09-23T22:13:12Z"
    lastUpdateTime: "2021-09-23T22:25:27Z"
    message: Datadog metrics forwarding ok
    status: "True"
    type: ActiveDatadogMetrics
  - lastTransitionTime: "2021-09-23T22:25:21Z"
    lastUpdateTime: "2021-09-23T22:25:29Z"
    message: |-
      clusterroles.rbac.authorization.k8s.io "datadog-cluster-agent" is forbidden: user "system:serviceaccount:datadog:operator-datadog-operator" (groups=["system:serviceaccounts" "system:serviceaccounts:datadog" "system:authenticated"]) is attempting to grant RBAC permissions not currently held:
      {APIGroups:["datadoghq.com"], Resources:["extendeddaemonsetreplicasets"], Verbs:["get"]}
    status: "True"
    type: ReconcileError
  defaultOverride:
    agent:
      apm:
        enabled: false
      config:
        collectEvents: false
        dogstatsd:
          dogstatsdOriginDetection: false
          unixDomainSocket:
            enabled: false
            hostFilepath: /var/run/datadog/statsd.sock
        healthPort: 5555
        leaderElection: false
        livenessProbe:
          failureThreshold: 6
          httpGet:
            path: /live
            port: 5555
          initialDelaySeconds: 15
          periodSeconds: 15
          successThreshold: 1
          timeoutSeconds: 5
        logLevel: INFO
        readinessProbe:
          failureThreshold: 6
          httpGet:
            path: /ready
            port: 5555
          initialDelaySeconds: 15
          periodSeconds: 15
          successThreshold: 1
          timeoutSeconds: 5
      deploymentStrategy:
        canary:
          autoFail:
            enabled: true
            maxRestarts: 5
          autoPause:
            enabled: true
            maxRestarts: 2
          duration: 5m0s
          nodeSelector: {}
          replicas: 1
        reconcileFrequency: 10s
        rollingUpdate:
          maxParallelPodCreation: 250
          maxPodSchedulerFailure: 10%
          maxUnavailable: 10%
          slowStartAdditiveIncrease: "5"
          slowStartIntervalDuration: 1m0s
        updateStrategyType: RollingUpdate
      enabled: true
      image:
        pullPolicy: IfNotPresent
      networkPolicy:
        create: false
      process:
        enabled: false
      rbac:
        create: true
      security:
        compliance:
          enabled: false
        runtime:
          enabled: false
          syscallMonitor:
            enabled: false
      systemProbe:
        enabled: false
      useExtendedDaemonset: false
    clusterAgent:
      config:
        admissionController:
          mutateUnlabelled: false
          serviceName: datadog-admission-controller
        clusterChecksEnabled: false
        collectEvents: false
        externalMetrics:
          port: 8443
        healthPort: 5555
        logLevel: INFO
      enabled: true
      image:
        pullPolicy: IfNotPresent
      networkPolicy:
        create: false
      rbac:
        create: true
    clusterChecksRunner:
      enabled: false
    credentials:
      useSecretBackend: false
    features:
      kubeStateMetricsCore:
        clusterCheck: false
        enabled: false
      logCollection:
        enabled: false
      networkMonitoring:
        enabled: false
      orchestratorExplorer:
        clusterCheck: false
        enabled: true
        scrubbing:
          containers: true
      prometheusScrape:
        enabled: false

Looks like the rbac is missing a scope.

Describe what you expected:
Operator should accept changes, create new apiservice and enable external metrics

Steps to reproduce the issue:

Install datadog operator helm install operator datadog/datadog-operator -n datadog
Deploy example config from here
Edit config and add yaml snippet for external metrics Example
Get Error :(

Additional environment details (Operating System, Cloud provider, etc):
K8S: 1.21.4
Operator: {"level":"INFO","ts":"2021-09-23T22:32:33Z","logger":"setup","msg":"Version: v0.7.0"}

DD_SITE exported in datadog-operator helm template not heeded in datadog-operator code

The helm values file in datadog-operator helm chart exposes the site value as a way to set the $DD_SITE variable. This is sent to the operator POD.

However the datadog-agent configuration code in InitDatadogClient only supports $DD_URL.

This is very confusing for people using a non-us site like datadoghq.eu .

Dashboard CRD?

I see a old issue around various CRDs here (#4), but I'd love to have a CRD for dashboards.

Is this on a roadmap anywhere? And/or anyway to help assist in bringing this to reality?

Grafana-operator might have some good inpiration
https://github.com/grafana-operator/grafana-operator/blob/master/documentation/dashboards.md

Publish linux/arm64 images

As a cluster operator of an ARM cluster (i.e. AWS Graviton or Raspberry Pis), I'd like to be able to use the Datadog Operator instead of the Helm Chart for Agent installation to avoid the requirement of using Helm.

I quickly changed GOARCH=amd64 to GOARCH=arm64 in the Dockerfile to see if the operator compiles and it does, so this should only require rigging up the CI pipelines to build and publish the images themselves.

Is it possible to disable GCE specific checks when deploying to AWS/EKS?

Trying to disable gce checks in an aws environment by using collect_gce_tags: false.

Is this possible with operator?

Describe what happened:

2021-06-01 00:16:12 UTC | CORE | WARN | (pkg/util/gce/gce_tags.go:48 in getCachedTags) | unable to get tags from gce and cache is empty: status code 404 trying to GET http://169.254.169.254/computeMetadata/v1/?recursive=true

Describe what you expected:
For this warning to not appear.

Steps to reproduce the issue:
in datadog-agent.yaml set collect_gce_tags: false

Additional environment details (Operating System, Cloud provider, etc):
AWS EkS environment.

datadog / datadog-operator Goto Github PK

datadog-operator's Introduction

Datadog Operator

Overview

Datadog Operator vs. Helm chart

Getting started

Migrating from 0.8.x to 1.0.0

Default Enabled Features

Functionalities

How to contribute

Release

datadog-operator's People

Contributors

Stargazers

Watchers

Forkers

datadog-operator's Issues

Problem Description

How to solve

If this project does not use Webhooks:

If this project is using Webhooks:

Recommend Projects

Recommend Topics

Recommend Org

Migrating from `0.8.x` to `1.0.0`