Giter Club home page Giter Club logo

Comments (13)

gravis avatar gravis commented on August 23, 2024

I can see some errors now on hawkular-metrics pod:

[0m�[0m20:45:16,092 INFO  [org.jboss.as] (Controller Boot Thread) WFLYSRV0051: Admin console listening on http://127.0.0.1:9990
72  �[0m�[0m20:45:16,093 INFO  [org.jboss.as] (Controller Boot Thread) WFLYSRV0025: WildFly Full 10.0.0.Final (WildFly Core 2.0.10.Final) started in 15875ms - Started 374 of 664 services (385 services are lazy, passive or on-demand)
73  �[0m�[0m20:48:15,036 INFO  [org.jboss.as.server] (Thread-2) WFLYSRV0220: Server shutdown has been requested.
74  �[0m*** JBossAS process (124) received TERM signal ***
75  �[0m20:48:15,289 INFO  [org.wildfly.extension.undertow] (ServerService Thread Pool -- 65) WFLYUT0022: Unregistered web context: /hawkular/metrics
76  �[0m�[0m20:48:15,357 INFO  [org.wildfly.extension.undertow] (MSC service thread 1-2) WFLYUT0019: Host default-host stopping
77  �[0m�[0m20:48:15,365 INFO  [org.jboss.weld.deployer] (MSC service thread 1-4) WFLYWELD0010: Stopping weld service for deployment hawkular-metrics-api-jaxrs.war
78  �[0m�[0m20:48:15,487 INFO  [org.jboss.as.connector.subsystems.datasources] (MSC service thread 1-1) WFLYJCA0010: Unbound data source [java:jboss/datasources/ExampleDS]
79  �[0m�[0m20:48:15,497 INFO  [org.jboss.as.connector.deployers.jdbc] (MSC service thread 1-1) WFLYJCA0019: Stopped Driver service with driver-name = h2
80  �[0m�[0m20:48:15,517 INFO  [org.wildfly.extension.undertow] (MSC service thread 1-3) WFLYUT0008: Undertow HTTPS listener httpsServer suspending
81  �[0m�[0m20:48:15,517 INFO  [org.wildfly.extension.undertow] (MSC service thread 1-2) WFLYUT0008: Undertow HTTP listener default suspending
82  �[0m�[0m20:48:15,522 INFO  [org.wildfly.extension.undertow] (MSC service thread 1-2) WFLYUT0007: Undertow HTTP listener default stopped, was bound to 0.0.0.0:8080
83  �[0m�[0m20:48:15,523 INFO  [org.wildfly.extension.undertow] (MSC service thread 1-3) WFLYUT0007: Undertow HTTPS listener httpsServer stopped, was bound to 0.0.0.0:8443
84  �[0m�[0m20:48:15,525 INFO  [org.wildfly.extension.undertow] (MSC service thread 1-1) WFLYUT0004: Undertow 1.3.15.Final stopping
85  �[0m�[0m20:48:15,572 INFO  [org.jboss.as.server.deployment] (MSC service thread 1-4) WFLYSRV0028: Stopped deployment hawkular-metrics-api-jaxrs.war (runtime-name: hawkular-metrics-api-jaxrs.war) in 521ms
86  �[0m�[33m20:48:15,578 WARN  [io.netty.util.ThreadDeathWatcher] (threadDeathWatcher-2-1) Thread death watcher task raised an exception:: java.lang.NoClassDefFoundError: io/netty/util/internal/Cleaner0
87      at io.netty.util.internal.PlatformDependent0.freeDirectBuffer(PlatformDependent0.java:147)
88      at io.netty.util.internal.PlatformDependent.freeDirectBuffer(PlatformDependent.java:281)
89      at io.netty.buffer.PoolArena$DirectArena.destroyChunk(PoolArena.java:448)
90      at io.netty.buffer.PoolChunkList.free(PoolChunkList.java:70)
91      at io.netty.buffer.PoolThreadCache$MemoryRegionCache.freeEntry(PoolThreadCache.java:466)
92      at io.netty.buffer.PoolThreadCache$MemoryRegionCache.free(PoolThreadCache.java:423)
93      at io.netty.buffer.PoolThreadCache.free(PoolThreadCache.java:248)
94      at io.netty.buffer.PoolThreadCache.free(PoolThreadCache.java:239)
95      at io.netty.buffer.PoolThreadCache.free0(PoolThreadCache.java:220)
96      at io.netty.buffer.PoolThreadCache.access$000(PoolThreadCache.java:32)
97      at io.netty.buffer.PoolThreadCache$1.run(PoolThreadCache.java:58)
98      at io.netty.util.ThreadDeathWatcher$Watcher.notifyWatchees(ThreadDeathWatcher.java:195)
99      at io.netty.util.ThreadDeathWatcher$Watcher.run(ThreadDeathWatcher.java:130)
100     at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
101     at java.lang.Thread.run(Thread.java:745)
102 Caused by: java.lang.ClassNotFoundException: io.netty.util.internal.Cleaner0 from [Module "deployment.hawkular-metrics-api-jaxrs.war:main" from Service Module Loader]
103     at org.jboss.modules.ModuleClassLoader.findClass(ModuleClassLoader.java:198)
104     at org.jboss.modules.ConcurrentClassLoader.performLoadClassUnchecked(ConcurrentClassLoader.java:363)
105     at org.jboss.modules.ConcurrentClassLoader.performLoadClass(ConcurrentClassLoader.java:351)
106     at org.jboss.modules.ConcurrentClassLoader.loadClass(ConcurrentClassLoader.java:93)
107     ... 15 more
108 
109 �[0m�[0m20:48:15,583 INFO  [org.jboss.as] (MSC service thread 1-4) WFLYSRV0050: WildFly Full 10.0.0.Final (WildFly Core 2.0.10.Final) stopped in 438ms
110 �[0m*** JBossAS process (124) received TERM signal ***

from origin-metrics.

gravis avatar gravis commented on August 23, 2024

and heapster was restarting before this (new) problem

from origin-metrics.

mwringe avatar mwringe commented on August 23, 2024

I would suspect something like a lifecycle hook doing something strange here, but we don't have any lifecycle scripts for Heapster in origin-metrics. The Hawkular containers do have lifecycle hooks, but since you are seeing the same thing with both, I suspect its something different.

The fact that you get metrics to display in the console means that all the components have at least at some point all started up and are functioning, so its no related to something like a deployment or security issue.

Is there any difference you can think of between the node which is working for you and the one in which pods are being killed?

from origin-metrics.

gravis avatar gravis commented on August 23, 2024

I thought it was the livenessProbe, but even with that part removed, I can see pod restarts.
Weird thing, when heapster is restarting, I can see 2 heapster pods starting, then only one is kept:

screenshot 2016-03-02 09 28 51

screenshot 2016-03-02 09 29 38

I really wonder while the RC would create 2 pods for 1 replica only...

from origin-metrics.

gravis avatar gravis commented on August 23, 2024

I just redeployed from scratch (pv, pvc, etc.) the metrics in this cluster, and .... now it's working.
I don't have an explanation, I just discovered a FAILED pv on this cluster, claimed by a project deleted since. I wonder if it can be related (something that would crash heapster silently).

from origin-metrics.

mwringe avatar mwringe commented on August 23, 2024

Heapster it self doesn't use a pv though, its still a mystery to me as to what would be causing this.

Please let us know if you encounter this again or if there is anyway to reproduce this.

from origin-metrics.

gravis avatar gravis commented on August 23, 2024

and it doesn't get metrics on them either, I know. That's the only difference I can see :(

from origin-metrics.

gravis avatar gravis commented on August 23, 2024

Looks like it's related to one of our node.
It was NotReady, we rebooted it, and now I can see metrics pods restarting like before, resulting in split metrics:

screenshot 2016-03-09 08 15 46

We're investigating

from origin-metrics.

mwringe avatar mwringe commented on August 23, 2024

Do you know which pods in particular which are restarting? Is it just the metrics pods or are other pods on the node also restarting?

from origin-metrics.

gravis avatar gravis commented on August 23, 2024

I can see some unexpected Killing container with docker id 29d1fb99c377: Need to kill pod. messages in another project, so it's definitely related to something else.

from origin-metrics.

gravis avatar gravis commented on August 23, 2024

Sounds lile it's related to selinux:

Mar 09 17:14:15 prod-node-2 audit[28912]: <audit-1400> avc: denied { write } for pid=28912 comm="java" path="/cassandra_data/commitlog/CommitLog-5-1457540055831.log" dev="fuse" ino=10111675231085273499 scontext=system_u:system_r:svirt_lxc_net_t:s0:c0,c5 tcontext=system_u:object_r:fusefs_t:s0 tclass=file permissive=1

I don't know why this node would be different, they were installed with the ansible playbook.

from origin-metrics.

gravis avatar gravis commented on August 23, 2024

I think I have a good hint.
I checked the logs on nodes AND master simultaneously, and discovered on master (only):

Mar 19 22:14:23 prod-master-1 origin-master[21162]: I0319 22:14:23.790668   21162 event.go:210] Event(api.ObjectReference{Kind:"DaemonSet", Namespace:"openshift-infra", Name:"newrelic-agent", UID:"6a09d31b-9c84-11e5-8779-005056b12d45", APIVersion:"extensions", ResourceVersion:"7614483", FieldPath:""}): type: 'Warning' reason: 'FailedCreate' Error creating: pods "newrelic-agent-" is forbidden: unable to validate
against any security context constraint: [spec.securityContext.hostNetwork: Invalid value: true: Host network is not allowed to be used spec.securityContext.hostPID: Invalid value: true: Host PID is not allowed to be used spec.securityContext.hostIPC: Invalid value: true: Host IPC is not allowed to be used spec.containers[0].securityContext.volumes[1]: Invalid value: "hostPath": HostPath volumes are not allowed to be used
 spec.containers[0].securityContext.volumes[2]: Invalid value: "hostPath": HostPath volumes are not allowed to be used spec.containers[0].securityContext.volumes[3]: Invalid value: "hostPath": HostPath volumes are not allowed to be used spec.containers[0].securityContext.volumes[4]: Invalid value: "hostPath": HostPath volumes are not allowed to be used spec.containers[0].securityContext.hostNetwork: Invalid value: true: H
ost network is not allowed to be used spec.containers[0].securityContext.hostPID: Invalid value: true: Host PID is not allowed to be used spec.containers[0].securityContext.hostIPC: Invalid value: true: Host IPC is not allowed to be used]
Mar 19 22:14:23 prod-master-1 origin-master[21162]: E0319 22:14:23.794229   21162 event.go:192] Server rejected event '&api.Event{TypeMeta:unversioned.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:api.ObjectMeta{Name:"newrelic-agent.143b83a3e303dde3", GenerateName:"", Namespace:"openshift-infra", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:0, nsec:0
, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil)}, InvolvedObject:api.ObjectReference{Kind:"DaemonSet", Namespace:"openshift-infra", Name:"newrelic-agent", UID:"6a09d31b-9c84-11e5-8779-005056b12d45", APIVersion:"extensions", ResourceVersion:"7614483", FieldPath:""}, Reason:"FailedCreate", M
essage:"Error creating: pods \"newrelic-agent-\" is forbidden: unable to validate against any security context constraint: [spec.securityContext.hostNetwork: Invalid value: true: Host network is not allowed to be used spec.securityContext.hostPID: Invalid value: true: Host PID is not allowed to be used spec.securityContext.hostIPC: Invalid value: true: Host IPC is not allowed to be used spec.containers[0].securityContext.
volumes[1]: Invalid value: \"hostPath\": HostPath volumes are not allowed to be used spec.containers[0].securityContext.volumes[2]: Invalid value: \"hostPath\": HostPath volumes are not allowed to be used spec.containers[0].securityContext.volumes[3]: Invalid value: \"hostPath\": HostPath volumes are not allowed to be used spec.containers[0].securityContext.volumes[4]: Invalid value: \"hostPath\": HostPath volumes are not
 allowed to be used spec.containers[0].securityContext.hostNetwork: Invalid value: true: Host network is not allowed to be used spec.containers[0].securityContext.hostPID: Invalid value: true: Host PID is not allowed to be used spec.containers[0].securityContext.hostIPC: Invalid value: true: Host IPC is not allowed to be used]", Source:api.EventSource{Component:"daemon-set", Host:""}, FirstTimestamp:unversi
Mar 19 22:14:23 prod-master-1 origin-master[21162]: oned.Time{Time:time.Time{sec:63593500444, nsec:296076771, loc:(*time.Location)(0x56a0960)}}, LastTimestamp:unversioned.Time{Time:time.Time{sec:63594018863, nsec:790067873, loc:(*time.Location)(0x56a0960)}}, Count:15753, Type:"Warning"}': 'events "newrelic-agent.143b83a3e303dde3" not found' (will not retry!)
Mar 19 22:14:23 prod-master-1 origin-master[21162]: I0319 22:14:23.854822   21162 event.go:210] Event(api.ObjectReference{Kind:"DaemonSet", Namespace:"openshift-infra", Name:"newrelic-agent", UID:"6a09d31b-9c84-11e5-8779-005056b12d45", APIVersion:"extensions", ResourceVersion:"7614483", FieldPath:""}): type: 'Normal' reason: 'SuccessfulDelete' Deleted pod: heapster-32ig2
Mar 19 22:14:23 prod-master-1 origin-master[21162]: E0319 22:14:23.860263   21162 event.go:192] Server rejected event '&api.Event{TypeMeta:unversioned.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:api.ObjectMeta{Name:"newrelic-agent.143c22a7784daee2", GenerateName:"", Namespace:"openshift-infra", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:0, nsec:0
, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil)}, InvolvedObject:api.ObjectReference{Kind:"DaemonSet", Namespace:"openshift-infra", Name:"newrelic-agent", UID:"6a09d31b-9c84-11e5-8779-005056b12d45", APIVersion:"extensions", ResourceVersion:"7614483", FieldPath:""}, Reason:"SuccessfulDelete
", Message:"(events with common reason combined)", Source:api.EventSource{Component:"daemon-set", Host:""}, FirstTimestamp:unversioned.Time{Time:time.Time{sec:63593675282, nsec:34437858, loc:(*time.Location)(0x56a0960)}}, LastTimestamp:unversioned.Time{Time:time.Time{sec:63594018863, nsec:854519129, loc:(*time.Location)(0x56a0960)}}, Count:790, Type:"Normal"}': 'events "newrelic-agent.143c22a7784daee2" not found' (will no
t retry!)
Mar 19 22:14:23 prod-master-1 origin-master[21162]: I0319 22:14:23.885496   21162 event.go:210] Event(api.ObjectReference{Kind:"ReplicationController", Namespace:"openshift-infra", Name:"heapster", UID:"70f3169f-edc9-11e5-8e60-005056b12d45", APIVersion:"v1", ResourceVersion:"7618434", FieldPath:""}): type: 'Normal' reason: 'SuccessfulCreate' Created pod: heapster-7jj3f 

This newrelic-agent was a previous test, and wasn't successful (https://github.com/kubernetes/kubernetes/tree/master/examples/newrelic). We forgot to remove the DaemonSet since.

Guess what, since I removed the ds, at 5:20:

screenshot 2016-03-19 17 30 15

...

from origin-metrics.

mwringe avatar mwringe commented on August 23, 2024

Housekeeping to close older issues. If you think this issue is not resolved, please reopen it.

from origin-metrics.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.