Comments (26)
/triage accepted
/assign @dgrisonnet
from kube-state-metrics.
@logicalhan @dgrisonnet any estimation when are you planning fixing it?
i face same issue with the same version.
E0202 22:02:08.428816 1 metrics_handler.go:215] "Failed to write metrics" err="failed to write help text: write tcp 10.11.31.68:8080->10.11.20.121:41082: write: broken pipe" E0205 04:56:32.728805 1 metrics_handler.go:215] "Failed to write metrics" err="failed to write metrics family: write tcp 10.11.31.68:8080->10.11.20.121:48750: write: connection reset by peer" E0205 04:56:32.728835 1 metrics_handler.go:215] "Failed to write metrics" err="failed to write help text: write tcp 10.11.31.68:8080->10.11.20.121:48750: write: connection reset by peer" E0205 04:56:32.728857 1 metrics_handler.go:215] "Failed to write metrics" err="failed to write help text: write tcp 10.11.31.68:8080->10.11.20.121:48750: write: connection reset by peer" E0205 04:56:32.728870 1 metrics_handler.go:215] "Failed to write metrics" err="failed to write help text: write tcp 10.11.31.68:8080->10.11.20.121:48750: write: connection reset by peer" E0205 04:56:32.728888 1 metrics_handler.go:215] "Failed to write metrics" err="failed to write help text: write tcp 10.11.31.68:8080->10.11.20.121:48750: write: connection reset by peer" E0205 04:56:32.928694 1 metrics_handler.go:215] "Failed to write metrics" err="failed to write help text: write tcp 10.11.31.68:8080->10.11.20.121:48750: write: connection reset by peer" E0205 04:56:32.928740 1 metrics_handler.go:215] "Failed to write metrics" err="failed to write help text: write tcp 10.11.31.68:8080->10.11.20.121:48750: write: connection reset by peer" E0205 04:56:32.928757 1 metrics_handler.go:215] "Failed to write metrics" err="failed to write help text: write tcp 10.11.31.68:8080->10.11.20.121:48750: write: connection reset by peer" E0207 09:31:40.528764 1 metrics_handler.go:215] "Failed to write metrics" err="failed to write metrics family: write tcp 10.11.31.68:8080->10.11.20.121:34628: write: broken pipe" E0207 09:31:40.528815 1 metrics_handler.go:215] "Failed to write metrics" err="failed to write help text: write tcp 10.11.31.68:8080->10.11.20.121:34628: write: broken pipe" E0207 09:31:40.528829 1 metrics_handler.go:215] "Failed to write metrics" err="failed to write help text: write tcp 10.11.31.68:8080->10.11.20.121:34628: write: broken pipe" E0207 09:31:40.528848 1 metrics_handler.go:215] "Failed to write metrics" err="failed to write help text: write tcp 10.11.31.68:8080->10.11.20.121:34628: write: broken pipe" E0207 09:31:40.528861 1 metrics_handler.go:215] "Failed to write metrics" err="failed to write help text: write tcp 10.11.31.68:8080->10.11.20.121:34628: write: broken pipe" E0207 09:31:40.528875 1 metrics_handler.go:215] "Failed to write metrics" err="failed to write help text: write tcp 10.11.31.68:8080->10.11.20.121:34628: write: broken pipe" E0207 09:31:40.528894 1 metrics_handler.go:215] "Failed to write metrics" err="failed to write help text: write tcp 10.11.31.68:8080->10.11.20.121:34628: write: broken pipe"
from kube-state-metrics.
I don't have any bandwidth to investigate right now.
@CatherineF-dev do you perhaps have some time to take a look at this?
from kube-state-metrics.
Ok
from kube-state-metrics.
- Could you help provide detailed steps on reproducing this issue? Thx
- Does latest KSM still have this issue?
from kube-state-metrics.
this is my deployment:
apiVersion: apps/v1 kind: Deployment metadata: name: kube-state-metrics namespace: monitor labels: app.kubernetes.io/component: exporter app.kubernetes.io/name: kube-state-metrics spec: replicas: 1 selector: matchLabels: app.kubernetes.io/name: kube-state-metrics template: metadata: creationTimestamp: null labels: app.kubernetes.io/component: exporter app.kubernetes.io/name: kube-state-metrics annotations: karpenter.sh/do-not-evict: 'true' prometheus.io/path: /metrics prometheus.io/port: '8080' prometheus.io/probe: 'true' prometheus.io/scrape: 'true' spec: containers: - name: kube-state-metrics image: >- registry.k8s.io/kube-state-metrics/kube-state-metrics:v2.10.0 ports: - name: http-metrics containerPort: 8080 protocol: TCP - name: telemetry containerPort: 8081 protocol: TCP resources: limits: cpu: 10m memory: 100Mi requests: cpu: 10m memory: 100Mi livenessProbe: httpGet: path: /healthz port: 8080 scheme: HTTP initialDelaySeconds: 5 timeoutSeconds: 5 periodSeconds: 10 successThreshold: 1 failureThreshold: 3 readinessProbe: httpGet: path: / port: 8081 scheme: HTTP initialDelaySeconds: 5 timeoutSeconds: 5 periodSeconds: 10 successThreshold: 1 failureThreshold: 3 terminationMessagePath: /dev/termination-log terminationMessagePolicy: File imagePullPolicy: IfNotPresent securityContext: privileged: false runAsUser: 65534 runAsNonRoot: false readOnlyRootFilesystem: false allowPrivilegeEscalation: true restartPolicy: Always terminationGracePeriodSeconds: 30 dnsPolicy: ClusterFirst nodeSelector: kubernetes.io/os: linux serviceAccountName: kube-state-metrics serviceAccount: kube-state-metrics automountServiceAccountToken: true shareProcessNamespace: false securityContext: {} schedulerName: default-scheduler tolerations: - key: CriticalAddonsOnly operator: Exists enableServiceLinks: true strategy: type: RollingUpdate rollingUpdate: maxUnavailable: 25% maxSurge: 25% revisionHistoryLimit: 10 progressDeadlineSeconds: 600
so i have the latest one.
and it is happening from time to time.
when i am running the deployment this was my logs.
I0207 13:17:41.754839 1 server.go:72] levelinfomsgListening onaddress[::]:8081 I0207 13:17:41.854588 1 server.go:72] levelinfomsgTLS is disabled.http2falseaddress[::]:8081 I0207 13:17:53.456710 1 trace.go:236] Trace[169429371]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:231 (07-Feb-2024 13:17:41.455) (total time: 12001ms): Trace[169429371]: ---"Objects listed" error:<nil> 11001ms (13:17:52.456) Trace[169429371]: [12.001555192s] [12.001555192s] END I0207 13:17:54.557245 1 trace.go:236] Trace[330206141]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:231 (07-Feb-2024 13:17:41.454) (total time: 13102ms): Trace[330206141]: ---"Objects listed" error:<nil> 13102ms (13:17:54.557) Trace[330206141]: [13.102365236s] [13.102365236s] END I0207 13:18:11.657805 1 trace.go:236] Trace[1483466474]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:231 (07-Feb-2024 13:17:41.455) (total time: 30202ms): Trace[1483466474]: ---"Objects listed" error:<nil> 30200ms (13:18:11.655) Trace[1483466474]: [30.202597991s] [30.202597991s] END I0207 13:18:13.356858 1 trace.go:236] Trace[1575017887]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:231 (07-Feb-2024 13:17:41.454) (total time: 31901ms): Trace[1575017887]: ---"Objects listed" error:<nil> 31900ms (13:18:13.355) Trace[1575017887]: [31.901926607s] [31.901926607s] END I0207 13:18:16.554560 1 trace.go:236] Trace[1245028553]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:231 (07-Feb-2024 13:17:41.455) (total time: 35099ms): Trace[1245028553]: ---"Objects listed" error:<nil> 18700ms (13:18:00.155) Trace[1245028553]: ---"SyncWith done" 16398ms (13:18:16.554) Trace[1245028553]: [35.099403448s] [35.099403448s] END I0207 13:18:17.655904 1 trace.go:236] Trace[1419539666]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:231 (07-Feb-2024 13:17:41.455) (total time: 36200ms): Trace[1419539666]: ---"Objects listed" error:<nil> 33899ms (13:18:15.355) Trace[1419539666]: ---"SyncWith done" 2300ms (13:18:17.655) Trace[1419539666]: [36.200643647s] [36.200643647s] END E0207 15:01:13.464963 1 metrics_handler.go:215] "Failed to write metrics" err="failed to write metrics family: write tcp 10.11.34.120:8080->10.11.20.121:52334: write: broken pipe" E0207 15:01:13.465072 1 metrics_handler.go:215] "Failed to write metrics" err="failed to write help text: write tcp 10.11.34.120:8080->10.11.20.121:52334: write: broken pipe" E0207 20:02:35.555637 1 metrics_handler.go:215] "Failed to write metrics" err="failed to write metrics family: write tcp 10.11.34.120:8080->10.11.20.121:55058: write: broken pipe" E0207 20:02:35.557513 1 metrics_handler.go:215] "Failed to write metrics" err="failed to write help text: write tcp 10.11.34.120:8080->10.11.20.121:55058: write: broken pipe" E0207 20:02:35.557550 1 metrics_handler.go:215] "Failed to write metrics" err="failed to write help text: write tcp 10.11.34.120:8080->10.11.20.121:55058: write: broken pipe" E0207 20:02:35.557563 1 metrics_handler.go:215] "Failed to write metrics" err="failed to write help text: write tcp 10.11.34.120:8080->10.11.20.121:55058: write: broken pipe" E0207 20:02:35.557582 1 metrics_handler.go:215] "Failed to write metrics" err="failed to write help text: write tcp 10.11.34.120:8080->10.11.20.121:55058: write: broken pipe" E0207 20:02:35.557683 1 metrics_handler.go:215] "Failed to write metrics" err="failed to write help text: write tcp 10.11.34.120:8080->10.11.20.121:55058: write: broken pipe" E0207 20:02:35.557698 1 metrics_handler.go:215] "Failed to write metrics" err="failed to write help text: write tcp 10.11.34.120:8080->10.11.20.121:55058: write: broken pipe"
i am using victoria metrics for monitoring.
from kube-state-metrics.
We're also using VictoriaMetrics and see the same issue. Scraping component is "VMAgent".
image: registry.k8s.io/kube-state-metrics/kube-state-metrics:v2.10.0
We are seeing this always when the cluster is heavily loaded with a bazillion of Pods running and other churn.
We use k-s-m metrics to identify nodes in a JOIN and the metrics always drop out when the churn rises. So it must be related to "amount of stuff happening" for lack of a more cogent description.
See drop-outs here:
Correlated error logs:
from kube-state-metrics.
Facing the same issue, current KSM version: v2.8.2
Logs:
E0219 03:57:11.745859 1 metrics_handler.go:213] "Failed to write metrics" err="failed to write help text: write tcp 1xx.xxx.63.7:8080->1xx.168.34.xx:55444: write: broken pipe"
E0219 03:57:11.745869 1 metrics_handler.go:213] "Failed to write metrics" err="failed to write help text: write tcp 1xx.xxx.63.7:8080->1xx.168.34.xx:55444: write: broken pipe"
E0219 03:57:11.745878 1 metrics_handler.go:213] "Failed to write metrics" err="failed to write help text: write tcp 1xx.xxx.63.7:8080->1xx.168.34.xx:55444: write: broken pipe"
from kube-state-metrics.
KSM version: v2.10.0
I have been observing the same when there are a lot of pods in the cluster(over 5K). This works properly when the number of pods are lower(under 500).
from kube-state-metrics.
qq: could anyone help provide detailed steps to reproduce this issue? Thx!
from kube-state-metrics.
cc @bengoldenberg09, could you paste the deployment yaml again? The above code formatting messed up.
cc @naweeng, do you remember how to reproduce?
from kube-state-metrics.
This error is thrown from here https://github.com/kubernetes/kube-state-metrics/blob/v2.8.2/pkg/metricshandler/metrics_handler.go#L210-L215 with error write: broken pipe
or golang write: connection reset by peer
// w http.ResponseWriter
for _, w := range m.metricsWriters {
err := w.WriteAll(writer)
if err != nil {
klog.ErrorS(err, "Failed to write metrics")
}
}
io.Writer .Write()
from kube-state-metrics.
I feel this might be related to go version. Could you try v2.9.0+ which uses golang 1.9? cc @naweeng
from kube-state-metrics.
I feel this might be related to go version. Could you try v2.9.0+ which uses golang 1.9? cc @naweeng
Building with go 1.19, still the same error @CatherineF-dev
from kube-state-metrics.
Can you check the scrape_duration_seconds for the job that scrapes it? If there's a broken pipe, the TCP/HTTP connection might get terminated early.
from kube-state-metrics.
@bxy4543, could you help reproduce with detailed steps?
from kube-state-metrics.
im seeing the same issue on our dev cluster (which is fairly large)
from kube-state-metrics.
Can you check the scrape_duration_seconds for the job that scrapes it? If there's a broken pipe, the TCP/HTTP connection might get terminated early.
scrape_duration_seconds{cluster="cluster-name",container="kube-state-metrics",endpoint="http",instance="xxx:8080",job="kube-state-metrics",namespace="vm",pod="victoria-metrics-k8s-stack-kube-state-metrics-847ddd64-splbf",prometheus="vm/victoria-metrics-k8s-stack",service="victoria-metrics-k8s-stack-kube-state-metrics"} 1.602
from kube-state-metrics.
@bxy4543, could you help reproduce with detailed steps?
apiVersion: apps/v1
kind: Deployment
metadata:
annotations:
deployment.kubernetes.io/revision: "4"
meta.helm.sh/release-name: victoria-metrics-k8s-stack
meta.helm.sh/release-namespace: vm
labels:
app.kubernetes.io/component: metrics
app.kubernetes.io/instance: victoria-metrics-k8s-stack
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: kube-state-metrics
app.kubernetes.io/part-of: kube-state-metrics
app.kubernetes.io/version: 2.10.0
helm.sh/chart: kube-state-metrics-5.12.1
name: victoria-metrics-k8s-stack-kube-state-metrics
namespace: vm
spec:
progressDeadlineSeconds: 600
replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
app.kubernetes.io/instance: victoria-metrics-k8s-stack
app.kubernetes.io/name: kube-state-metrics
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
type: RollingUpdate
template:
metadata:
creationTimestamp: null
labels:
app.kubernetes.io/component: metrics
app.kubernetes.io/instance: victoria-metrics-k8s-stack
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: kube-state-metrics
app.kubernetes.io/part-of: kube-state-metrics
app.kubernetes.io/version: 2.10.0
helm.sh/chart: kube-state-metrics-5.12.1
spec:
containers:
- args:
- --port=8080
- --resources=certificatesigningrequests,configmaps,cronjobs,daemonsets,deployments,endpoints,horizontalpodautoscalers,ingresses,jobs,leases,limitranges,mutatingwebhookconfigurations,namespaces,networkpolicies,nodes,persistentvolumeclaims,persistentvolumes,poddisruptionbudgets,pods,replicasets,replicationcontrollers,secrets,services,statefulsets,storageclasses,validatingwebhookconfigurations,volumeattachments
image: registry.k8s.io/kube-state-metrics/kube-state-metrics:v2.10.0
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3
httpGet:
path: /healthz
port: 8080
scheme: HTTP
initialDelaySeconds: 5
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
name: kube-state-metrics
ports:
- containerPort: 8080
name: http
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /
port: 8080
scheme: HTTP
initialDelaySeconds: 5
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
resources: {}
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
securityContext:
fsGroup: 65534
runAsGroup: 65534
runAsNonRoot: true
runAsUser: 65534
seccompProfile:
type: RuntimeDefault
serviceAccount: victoria-metrics-k8s-stack-kube-state-metrics
serviceAccountName: victoria-metrics-k8s-stack-kube-state-metrics
terminationGracePeriodSeconds: 30
from kube-state-metrics.
Can you check the scrape_duration_seconds for the job that scrapes it? If there's a broken pipe, the TCP/HTTP connection might get terminated early.
I found the cause of the problem:
Since some data could still be obtained, I did not check the target status. When I checked the target status, I found this error:
cannot read stream body in 1 seconds: the response from "http://xxx:8080/metrics" exceeds -promscrape.maxScrapeSize=16777216; either reduce the response size for the target or increase -promscrape.maxScrapeSize
So I expanded the vmagent parameter promscrape.maxScrapeSize
(default is 16777216), and no more errors were reported
from kube-state-metrics.
Thx!
cannot read stream body in 1 seconds: the response from "http://xxx:8080/metrics" exceeds -promscrape.maxScrapeSize=16777216; either reduce the response size for the target or increase -promscrape.maxScrapeSize
qq: where did you find this error log? Is it from prometheus?
from kube-state-metrics.
qq: where did you find this error log? Is it from prometheus?
I am using victoria-metrics, which I found out from the vmagent page target.
from kube-state-metrics.
@jgagnon44 @decipher27 @naweeng @towolf could you try the above solution So I expanded the vmagent parameter promscrape.maxScrapeSize (default is 16777216), and no more errors were reported
?
If no issues, will close it.
from kube-state-metrics.
/close
from kube-state-metrics.
@CatherineF-dev: Closing this issue.
In response to this:
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
from kube-state-metrics.
What's the equivalent of vmagent parameter promscrape.maxScrapeSize
in Prometheus? If it is body_size_limit
, it's default to 0, which is no limit. I'm still having the same issue, let me know if you folks were able to implement such solution and see results.
from kube-state-metrics.
Related Issues (20)
- If a container doesn't have cpu limits, kube_pod_resource_limit reports the init-container limit HOT 4
- CustomResource: no metrics if CRD apply after ksm starts HOT 7
- kube_namespace_labels not exported after upgrading from 2.9.2 to 2.10.1 HOT 7
- Kube-state-metrics 20x spikes in memory usage at restart HOT 5
- Reconsider Stable Metrics Approach via CLI flag HOT 1
- Additional Labels not exported by KSM HOT 5
- kube-state-metrics v2.10.1 CVE's HOT 7
- Why do kube-state-metrics clusterrole need access to list,watch all secrets at cluster scope? HOT 5
- What is kube-state-metrics status on Kubernetes 1.29? HOT 5
- Request to support ASLR in Kube-state-metrics HOT 2
- Using k8s labels in prometheus rules expr HOT 1
- v2.11.0 docker image doesn't exist on registry.k8s.io/kube-state-metrics/kube-state-metrics HOT 2
- Kube Node Status NotReady detection HOT 2
- Chart missing for v2.11.0 HOT 3
- Allow Custom Resource State mode to filter on resource labels HOT 1
- CVE in v2.11.0 Image HOT 8
- sharding with a deployment with '--resources=pods' and '--node=""' does not fetch pending pods HOT 10
- [regression] /metrics port down when not existing CRD are listed in config file HOT 5
- kube-state-metrics with autosharding stops updating shards when the labels of the statefulset are updated HOT 20
- Generated Prometheus metrics output not meet with the requirements HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from kube-state-metrics.