Giter Club home page Giter Club logo

Comments (26)

logicalhan avatar logicalhan commented on May 23, 2024

/triage accepted
/assign @dgrisonnet

from kube-state-metrics.

bengoldenberg09 avatar bengoldenberg09 commented on May 23, 2024

@logicalhan @dgrisonnet any estimation when are you planning fixing it?
i face same issue with the same version.

E0202 22:02:08.428816 1 metrics_handler.go:215] "Failed to write metrics" err="failed to write help text: write tcp 10.11.31.68:8080->10.11.20.121:41082: write: broken pipe" E0205 04:56:32.728805 1 metrics_handler.go:215] "Failed to write metrics" err="failed to write metrics family: write tcp 10.11.31.68:8080->10.11.20.121:48750: write: connection reset by peer" E0205 04:56:32.728835 1 metrics_handler.go:215] "Failed to write metrics" err="failed to write help text: write tcp 10.11.31.68:8080->10.11.20.121:48750: write: connection reset by peer" E0205 04:56:32.728857 1 metrics_handler.go:215] "Failed to write metrics" err="failed to write help text: write tcp 10.11.31.68:8080->10.11.20.121:48750: write: connection reset by peer" E0205 04:56:32.728870 1 metrics_handler.go:215] "Failed to write metrics" err="failed to write help text: write tcp 10.11.31.68:8080->10.11.20.121:48750: write: connection reset by peer" E0205 04:56:32.728888 1 metrics_handler.go:215] "Failed to write metrics" err="failed to write help text: write tcp 10.11.31.68:8080->10.11.20.121:48750: write: connection reset by peer" E0205 04:56:32.928694 1 metrics_handler.go:215] "Failed to write metrics" err="failed to write help text: write tcp 10.11.31.68:8080->10.11.20.121:48750: write: connection reset by peer" E0205 04:56:32.928740 1 metrics_handler.go:215] "Failed to write metrics" err="failed to write help text: write tcp 10.11.31.68:8080->10.11.20.121:48750: write: connection reset by peer" E0205 04:56:32.928757 1 metrics_handler.go:215] "Failed to write metrics" err="failed to write help text: write tcp 10.11.31.68:8080->10.11.20.121:48750: write: connection reset by peer" E0207 09:31:40.528764 1 metrics_handler.go:215] "Failed to write metrics" err="failed to write metrics family: write tcp 10.11.31.68:8080->10.11.20.121:34628: write: broken pipe" E0207 09:31:40.528815 1 metrics_handler.go:215] "Failed to write metrics" err="failed to write help text: write tcp 10.11.31.68:8080->10.11.20.121:34628: write: broken pipe" E0207 09:31:40.528829 1 metrics_handler.go:215] "Failed to write metrics" err="failed to write help text: write tcp 10.11.31.68:8080->10.11.20.121:34628: write: broken pipe" E0207 09:31:40.528848 1 metrics_handler.go:215] "Failed to write metrics" err="failed to write help text: write tcp 10.11.31.68:8080->10.11.20.121:34628: write: broken pipe" E0207 09:31:40.528861 1 metrics_handler.go:215] "Failed to write metrics" err="failed to write help text: write tcp 10.11.31.68:8080->10.11.20.121:34628: write: broken pipe" E0207 09:31:40.528875 1 metrics_handler.go:215] "Failed to write metrics" err="failed to write help text: write tcp 10.11.31.68:8080->10.11.20.121:34628: write: broken pipe" E0207 09:31:40.528894 1 metrics_handler.go:215] "Failed to write metrics" err="failed to write help text: write tcp 10.11.31.68:8080->10.11.20.121:34628: write: broken pipe"

from kube-state-metrics.

dgrisonnet avatar dgrisonnet commented on May 23, 2024

I don't have any bandwidth to investigate right now.

@CatherineF-dev do you perhaps have some time to take a look at this?

from kube-state-metrics.

CatherineF-dev avatar CatherineF-dev commented on May 23, 2024

Ok

from kube-state-metrics.

CatherineF-dev avatar CatherineF-dev commented on May 23, 2024

@bengoldenberg09

  1. Could you help provide detailed steps on reproducing this issue? Thx
  2. Does latest KSM still have this issue?

from kube-state-metrics.

bengoldenberg09 avatar bengoldenberg09 commented on May 23, 2024

this is my deployment:
apiVersion: apps/v1 kind: Deployment metadata: name: kube-state-metrics namespace: monitor labels: app.kubernetes.io/component: exporter app.kubernetes.io/name: kube-state-metrics spec: replicas: 1 selector: matchLabels: app.kubernetes.io/name: kube-state-metrics template: metadata: creationTimestamp: null labels: app.kubernetes.io/component: exporter app.kubernetes.io/name: kube-state-metrics annotations: karpenter.sh/do-not-evict: 'true' prometheus.io/path: /metrics prometheus.io/port: '8080' prometheus.io/probe: 'true' prometheus.io/scrape: 'true' spec: containers: - name: kube-state-metrics image: >- registry.k8s.io/kube-state-metrics/kube-state-metrics:v2.10.0 ports: - name: http-metrics containerPort: 8080 protocol: TCP - name: telemetry containerPort: 8081 protocol: TCP resources: limits: cpu: 10m memory: 100Mi requests: cpu: 10m memory: 100Mi livenessProbe: httpGet: path: /healthz port: 8080 scheme: HTTP initialDelaySeconds: 5 timeoutSeconds: 5 periodSeconds: 10 successThreshold: 1 failureThreshold: 3 readinessProbe: httpGet: path: / port: 8081 scheme: HTTP initialDelaySeconds: 5 timeoutSeconds: 5 periodSeconds: 10 successThreshold: 1 failureThreshold: 3 terminationMessagePath: /dev/termination-log terminationMessagePolicy: File imagePullPolicy: IfNotPresent securityContext: privileged: false runAsUser: 65534 runAsNonRoot: false readOnlyRootFilesystem: false allowPrivilegeEscalation: true restartPolicy: Always terminationGracePeriodSeconds: 30 dnsPolicy: ClusterFirst nodeSelector: kubernetes.io/os: linux serviceAccountName: kube-state-metrics serviceAccount: kube-state-metrics automountServiceAccountToken: true shareProcessNamespace: false securityContext: {} schedulerName: default-scheduler tolerations: - key: CriticalAddonsOnly operator: Exists enableServiceLinks: true strategy: type: RollingUpdate rollingUpdate: maxUnavailable: 25% maxSurge: 25% revisionHistoryLimit: 10 progressDeadlineSeconds: 600
so i have the latest one.
and it is happening from time to time.
when i am running the deployment this was my logs.
I0207 13:17:41.754839 1 server.go:72] levelinfomsgListening onaddress[::]:8081 I0207 13:17:41.854588 1 server.go:72] levelinfomsgTLS is disabled.http2falseaddress[::]:8081 I0207 13:17:53.456710 1 trace.go:236] Trace[169429371]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:231 (07-Feb-2024 13:17:41.455) (total time: 12001ms): Trace[169429371]: ---"Objects listed" error:<nil> 11001ms (13:17:52.456) Trace[169429371]: [12.001555192s] [12.001555192s] END I0207 13:17:54.557245 1 trace.go:236] Trace[330206141]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:231 (07-Feb-2024 13:17:41.454) (total time: 13102ms): Trace[330206141]: ---"Objects listed" error:<nil> 13102ms (13:17:54.557) Trace[330206141]: [13.102365236s] [13.102365236s] END I0207 13:18:11.657805 1 trace.go:236] Trace[1483466474]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:231 (07-Feb-2024 13:17:41.455) (total time: 30202ms): Trace[1483466474]: ---"Objects listed" error:<nil> 30200ms (13:18:11.655) Trace[1483466474]: [30.202597991s] [30.202597991s] END I0207 13:18:13.356858 1 trace.go:236] Trace[1575017887]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:231 (07-Feb-2024 13:17:41.454) (total time: 31901ms): Trace[1575017887]: ---"Objects listed" error:<nil> 31900ms (13:18:13.355) Trace[1575017887]: [31.901926607s] [31.901926607s] END I0207 13:18:16.554560 1 trace.go:236] Trace[1245028553]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:231 (07-Feb-2024 13:17:41.455) (total time: 35099ms): Trace[1245028553]: ---"Objects listed" error:<nil> 18700ms (13:18:00.155) Trace[1245028553]: ---"SyncWith done" 16398ms (13:18:16.554) Trace[1245028553]: [35.099403448s] [35.099403448s] END I0207 13:18:17.655904 1 trace.go:236] Trace[1419539666]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:231 (07-Feb-2024 13:17:41.455) (total time: 36200ms): Trace[1419539666]: ---"Objects listed" error:<nil> 33899ms (13:18:15.355) Trace[1419539666]: ---"SyncWith done" 2300ms (13:18:17.655) Trace[1419539666]: [36.200643647s] [36.200643647s] END E0207 15:01:13.464963 1 metrics_handler.go:215] "Failed to write metrics" err="failed to write metrics family: write tcp 10.11.34.120:8080->10.11.20.121:52334: write: broken pipe" E0207 15:01:13.465072 1 metrics_handler.go:215] "Failed to write metrics" err="failed to write help text: write tcp 10.11.34.120:8080->10.11.20.121:52334: write: broken pipe" E0207 20:02:35.555637 1 metrics_handler.go:215] "Failed to write metrics" err="failed to write metrics family: write tcp 10.11.34.120:8080->10.11.20.121:55058: write: broken pipe" E0207 20:02:35.557513 1 metrics_handler.go:215] "Failed to write metrics" err="failed to write help text: write tcp 10.11.34.120:8080->10.11.20.121:55058: write: broken pipe" E0207 20:02:35.557550 1 metrics_handler.go:215] "Failed to write metrics" err="failed to write help text: write tcp 10.11.34.120:8080->10.11.20.121:55058: write: broken pipe" E0207 20:02:35.557563 1 metrics_handler.go:215] "Failed to write metrics" err="failed to write help text: write tcp 10.11.34.120:8080->10.11.20.121:55058: write: broken pipe" E0207 20:02:35.557582 1 metrics_handler.go:215] "Failed to write metrics" err="failed to write help text: write tcp 10.11.34.120:8080->10.11.20.121:55058: write: broken pipe" E0207 20:02:35.557683 1 metrics_handler.go:215] "Failed to write metrics" err="failed to write help text: write tcp 10.11.34.120:8080->10.11.20.121:55058: write: broken pipe" E0207 20:02:35.557698 1 metrics_handler.go:215] "Failed to write metrics" err="failed to write help text: write tcp 10.11.34.120:8080->10.11.20.121:55058: write: broken pipe"
i am using victoria metrics for monitoring.

from kube-state-metrics.

towolf avatar towolf commented on May 23, 2024

We're also using VictoriaMetrics and see the same issue. Scraping component is "VMAgent".

image: registry.k8s.io/kube-state-metrics/kube-state-metrics:v2.10.0

We are seeing this always when the cluster is heavily loaded with a bazillion of Pods running and other churn.

We use k-s-m metrics to identify nodes in a JOIN and the metrics always drop out when the churn rises. So it must be related to "amount of stuff happening" for lack of a more cogent description.

See drop-outs here:

image

Correlated error logs:

image

from kube-state-metrics.

decipher27 avatar decipher27 commented on May 23, 2024

Facing the same issue, current KSM version: v2.8.2

Logs:

E0219 03:57:11.745859       1 metrics_handler.go:213] "Failed to write metrics" err="failed to write help text: write tcp 1xx.xxx.63.7:8080->1xx.168.34.xx:55444: write: broken pipe"
E0219 03:57:11.745869       1 metrics_handler.go:213] "Failed to write metrics" err="failed to write help text: write tcp 1xx.xxx.63.7:8080->1xx.168.34.xx:55444: write: broken pipe"
E0219 03:57:11.745878       1 metrics_handler.go:213] "Failed to write metrics" err="failed to write help text: write tcp 1xx.xxx.63.7:8080->1xx.168.34.xx:55444: write: broken pipe"

from kube-state-metrics.

naweeng avatar naweeng commented on May 23, 2024

KSM version: v2.10.0

I have been observing the same when there are a lot of pods in the cluster(over 5K). This works properly when the number of pods are lower(under 500).

from kube-state-metrics.

CatherineF-dev avatar CatherineF-dev commented on May 23, 2024

qq: could anyone help provide detailed steps to reproduce this issue? Thx!

from kube-state-metrics.

CatherineF-dev avatar CatherineF-dev commented on May 23, 2024

cc @bengoldenberg09, could you paste the deployment yaml again? The above code formatting messed up.

cc @naweeng, do you remember how to reproduce?

from kube-state-metrics.

CatherineF-dev avatar CatherineF-dev commented on May 23, 2024

This error is thrown from here https://github.com/kubernetes/kube-state-metrics/blob/v2.8.2/pkg/metricshandler/metrics_handler.go#L210-L215 with error write: broken pipe or golang write: connection reset by peer

        // w http.ResponseWriter
	for _, w := range m.metricsWriters {
		err := w.WriteAll(writer)
		if err != nil {
			klog.ErrorS(err, "Failed to write metrics")
		}
	}


        io.Writer .Write()

from kube-state-metrics.

CatherineF-dev avatar CatherineF-dev commented on May 23, 2024

I feel this might be related to go version. Could you try v2.9.0+ which uses golang 1.9? cc @naweeng

from kube-state-metrics.

bxy4543 avatar bxy4543 commented on May 23, 2024

I feel this might be related to go version. Could you try v2.9.0+ which uses golang 1.9? cc @naweeng

Building with go 1.19, still the same error @CatherineF-dev

from kube-state-metrics.

mrueg avatar mrueg commented on May 23, 2024

Can you check the scrape_duration_seconds for the job that scrapes it? If there's a broken pipe, the TCP/HTTP connection might get terminated early.

from kube-state-metrics.

CatherineF-dev avatar CatherineF-dev commented on May 23, 2024

@bxy4543, could you help reproduce with detailed steps?

from kube-state-metrics.

cody-amtote avatar cody-amtote commented on May 23, 2024

im seeing the same issue on our dev cluster (which is fairly large)

from kube-state-metrics.

bxy4543 avatar bxy4543 commented on May 23, 2024

Can you check the scrape_duration_seconds for the job that scrapes it? If there's a broken pipe, the TCP/HTTP connection might get terminated early.

scrape_duration_seconds{cluster="cluster-name",container="kube-state-metrics",endpoint="http",instance="xxx:8080",job="kube-state-metrics",namespace="vm",pod="victoria-metrics-k8s-stack-kube-state-metrics-847ddd64-splbf",prometheus="vm/victoria-metrics-k8s-stack",service="victoria-metrics-k8s-stack-kube-state-metrics"} 1.602

from kube-state-metrics.

bxy4543 avatar bxy4543 commented on May 23, 2024

@bxy4543, could you help reproduce with detailed steps?

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "4"
    meta.helm.sh/release-name: victoria-metrics-k8s-stack
    meta.helm.sh/release-namespace: vm
  labels:
    app.kubernetes.io/component: metrics
    app.kubernetes.io/instance: victoria-metrics-k8s-stack
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: kube-state-metrics
    app.kubernetes.io/part-of: kube-state-metrics
    app.kubernetes.io/version: 2.10.0
    helm.sh/chart: kube-state-metrics-5.12.1
  name: victoria-metrics-k8s-stack-kube-state-metrics
  namespace: vm
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app.kubernetes.io/instance: victoria-metrics-k8s-stack
      app.kubernetes.io/name: kube-state-metrics
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        app.kubernetes.io/component: metrics
        app.kubernetes.io/instance: victoria-metrics-k8s-stack
        app.kubernetes.io/managed-by: Helm
        app.kubernetes.io/name: kube-state-metrics
        app.kubernetes.io/part-of: kube-state-metrics
        app.kubernetes.io/version: 2.10.0
        helm.sh/chart: kube-state-metrics-5.12.1
    spec:
      containers:
      - args:
        - --port=8080
        - --resources=certificatesigningrequests,configmaps,cronjobs,daemonsets,deployments,endpoints,horizontalpodautoscalers,ingresses,jobs,leases,limitranges,mutatingwebhookconfigurations,namespaces,networkpolicies,nodes,persistentvolumeclaims,persistentvolumes,poddisruptionbudgets,pods,replicasets,replicationcontrollers,secrets,services,statefulsets,storageclasses,validatingwebhookconfigurations,volumeattachments
        image: registry.k8s.io/kube-state-metrics/kube-state-metrics:v2.10.0
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /healthz
            port: 8080
            scheme: HTTP
          initialDelaySeconds: 5
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 5
        name: kube-state-metrics
        ports:
        - containerPort: 8080
          name: http
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /
            port: 8080
            scheme: HTTP
          initialDelaySeconds: 5
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 5
        resources: {}
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - ALL
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext:
        fsGroup: 65534
        runAsGroup: 65534
        runAsNonRoot: true
        runAsUser: 65534
        seccompProfile:
          type: RuntimeDefault
      serviceAccount: victoria-metrics-k8s-stack-kube-state-metrics
      serviceAccountName: victoria-metrics-k8s-stack-kube-state-metrics
      terminationGracePeriodSeconds: 30

@CatherineF-dev

from kube-state-metrics.

bxy4543 avatar bxy4543 commented on May 23, 2024

Can you check the scrape_duration_seconds for the job that scrapes it? If there's a broken pipe, the TCP/HTTP connection might get terminated early.

I found the cause of the problem:
Since some data could still be obtained, I did not check the target status. When I checked the target status, I found this error:

cannot read stream body in 1 seconds: the response from "http://xxx:8080/metrics" exceeds -promscrape.maxScrapeSize=16777216; either reduce the response size for the target or increase -promscrape.maxScrapeSize

So I expanded the vmagent parameter promscrape.maxScrapeSize (default is 16777216), and no more errors were reported

from kube-state-metrics.

CatherineF-dev avatar CatherineF-dev commented on May 23, 2024

Thx!

cannot read stream body in 1 seconds: the response from "http://xxx:8080/metrics" exceeds -promscrape.maxScrapeSize=16777216; either reduce the response size for the target or increase -promscrape.maxScrapeSize

qq: where did you find this error log? Is it from prometheus?

from kube-state-metrics.

bxy4543 avatar bxy4543 commented on May 23, 2024

qq: where did you find this error log? Is it from prometheus?

I am using victoria-metrics, which I found out from the vmagent page target.

from kube-state-metrics.

CatherineF-dev avatar CatherineF-dev commented on May 23, 2024

@jgagnon44 @decipher27 @naweeng @towolf could you try the above solution So I expanded the vmagent parameter promscrape.maxScrapeSize (default is 16777216), and no more errors were reported?

If no issues, will close it.

from kube-state-metrics.

CatherineF-dev avatar CatherineF-dev commented on May 23, 2024

/close

from kube-state-metrics.

k8s-ci-robot avatar k8s-ci-robot commented on May 23, 2024

@CatherineF-dev: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

from kube-state-metrics.

hanwangsf avatar hanwangsf commented on May 23, 2024

What's the equivalent of vmagent parameter promscrape.maxScrapeSize in Prometheus? If it is body_size_limit, it's default to 0, which is no limit. I'm still having the same issue, let me know if you folks were able to implement such solution and see results.

from kube-state-metrics.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.