kubernetes-retired / heapster Goto Github PK

[EOL] Compute Resource Usage Analysis and Monitoring of Container Clusters

License: Apache License 2.0

Go 97.45% Shell 1.31% Makefile 1.20% Dockerfile 0.04%

heapster's Introduction

Heapster

RETIRED: Heapster is now retired. See the deprecation timeline for more information on support. We will not be making changes to Heapster.

The following are potential migration paths for Heapster functionality:

For basic CPU/memory HPA metrics: Use metrics-server.
For general monitoring: Consider a third-party monitoring pipeline that can gather Prometheus-formatted metrics. The kubelet exposes all the metrics exported by Heapster in Prometheus format. One such monitoring pipeline can be set up using the Prometheus Operator, which deploys Prometheus itself for this purpose.
For event transfer: Several third-party tools exist to transfer/archive Kubernetes events, depending on your sink. heptiolabs/eventrouter has been suggested as a general alternative.

Heapster enables Container Cluster Monitoring and Performance Analysis for Kubernetes (versions v1.0.6 and higher), and platforms which include it.

Heapster collects and interprets various signals like compute resource usage, lifecycle events, etc. Note that the model API, formerly used provide REST access to its collected metrics, is now deprecated. Please see the model documentation for more details.

Heapster supports multiple sources of data. More information here.

Heapster supports the pluggable storage backends described here. We welcome patches that add additional storage backends. Documentation on storage sinks here. The current version of Storage Schema is documented here.

Running Heapster on Kubernetes

Heapster can run on a Kubernetes cluster using a number of backends. Some common choices:

InfluxDB
Stackdriver Monitoring and Logging for Google Cloud Platform
Other backends

Running Heapster on OpenShift

Using Heapster to monitor an OpenShift cluster requires some additional changes to the Kubernetes instructions to allow communication between the Heapster instance and OpenShift's secured endpoints. To run standalone Heapster or a combination of Heapster and Hawkular-Metrics in OpenShift, follow this guide.

Troubleshooting guide here

Community

Contributions, questions, and comments are all welcomed and encouraged! Developers hang out on Slack in the #sig-instrumentation channel (get an invitation here). We also have the kubernetes-dev Google Groups mailing list. If you are posting to the list please prefix your subject with "heapster: ".

heapster's People

Contributors

Stargazers

Watchers

Forkers

doublerr rainsome-org1 henrypfhu congnghia0609 clamoriniere xueshanf vishh boostrack-oss adaptiv-io jimmidyson rjnagal pombredanne tacy dipankar arkadijs zohaib1020 jerryyum vmarmol cedbossneo jonlangemak sub-mod zeeshanali imjacobclark sqleejan cadbot simon3z saad-ali agios burmanm skytap chakri-nelluri jekyang amos6224 luksa you-n-g genti-t krousey hyperbolic2346 piosz navidshaikh jszczepkowski gosharplite alexmavr spiffxp cfeio leandrocp gongysh2004 datadog njuicsgz baicai-ai mikemichel sundarshankar mikedanese liuyang430068 lavalamp janetkuo a-robinson xsyr relh nlamirault dalanlan kmala kurkop fzu-huang liuhewei clearraining mbit-cloud mwielgus slodha jiangyaoguo varunkumar09 sanjana-bhat rodiy4e apeeyush naxhh a2903214 tharanga-abeyseela thucatebay ravihansa3000 digideskio mwringe ydhydhjanson bluebreezecf mdshuai efranford tianshangjun wcczlimited toddgardner robeferre thebigredgeek festivalbobcats sysuwbs theute c00285323 dyep49 perhapszzy ubideo bigsurge we87 shilpamayanna

heapster's Issues

lastQuery handling in cAdvisor source is wrong

There are two issues:

lastQuery is not set in the loop, resulting in data duplication, because cAdvisor returns last 64 stats instead. Example fix:
arkadijs@2fc23c2#diff-3917589e63e61c59289ec2acb14ebcc8R66
On a medium to large cluster with medium to large number of containers the logic of one lastQuery per all peers would be flawed anyway. It leads to missing data points because Heapster+cAdvisor are not fast enough and Heapster queries all cAdvisor instances serially. A crappy fix that works in practice:
arkadijs@2fc23c2#diff-3917589e63e61c59289ec2acb14ebcc8R37

where is /usr/bin/coreos?

cluster/coreos Dockerfile ADD coreos to /usr/bin/coreos. Where is the source of coreos?

Can't load data

i setup Heapster in a Kubernetes cluster with an Influxdb backend and Grafana like in the README.
I can log into Grafana using admin/admin and into the InfluxDB UI using root/root.
In Grafana UI i've got this error :

InfluxDB Error : Clouldn't look up columns

In the InfluxDB UI, i click "Explore Data" for the k8s database, and i try this request 👍

select * from /.*/ limit 1

And i've got an error message :

ERROR : Clouldn't look up columns

High CPU usage for heapster v0.6

I updated to latest version of heapster(kubernetes/heapster:v0.6) and influxdb(kubernetes/heapster_influxdb:v0.3) and found high CPU usage of influxdb.

I downgrade to heapster_influxdb:v0.2, still got high load, then tried heapster:v0.5, the cpu loads seemed normal

heapter v0.5 + heapster_influxdb:v0.3:

heapter v0.6 + heapster_influxdb:v0.3:

grafana images keeps restarting on vagrant

deployed this on kuberentes running on virtual box using vagrant. The grafana image keeps failing continuously.

[vagrant@kubernetes-minion-3 ~]$ sudo docker events
[2014-08-30 06:23:27 +0000 UTC] 89317b9961b190fd4d00273f91e04342f6f7208a3e6c111011770551bbca6673: (from vish/k8s_grafana:latest) die
[2014-08-30 06:23:36 +0000 UTC] 0e0427e166bc57924fd52d8631d7e8211a6dca4adc3541ee3d36800b01c486f6: (from vish/k8s_grafana:latest) create
[2014-08-30 06:23:36 +0000 UTC] 0e0427e166bc57924fd52d8631d7e8211a6dca4adc3541ee3d36800b01c486f6: (from vish/k8s_grafana:latest) start

Heapster fails to get stats from kubelet

Potentially related to the push of heapster 0.7. I've been seeing all heapster requests to my kubelets' /stats/ endpoints failing. I initially reported this as a Kubelet issue, but it looks possible that the new heapster added some sort of incompatibility with the kubelet.

The errors all look like this, and I expect it's very reproducible, as both clusters I've created since Friday have had these errors:

E0223 18:32:13.471559       9 kubelet.go:80] failed to get stats from kubelet url: http://10.240.173.100:10250/stats/default/kube-dns-0133o/70de926d-bb1a-11e4-b34e-42010af06d77/etcd - Got 'Internal Error: failed to retrieve cadvisor stats
': invalid character 'I' looking for beginning of value
E0223 18:32:33.378167       9 kubelet.go:80] failed to get stats from kubelet url: http://10.240.147.240:10250/stats/ - Got 'Internal Error: received empty response from "container info for \"/\""
': invalid character 'I' looking for beginning of value

Add a deployment tool for kubernetes

As of heapster setup and troubleshooting is difficult at times. Having a deployment tool, that verifies that cadvisor is installed on every node, sets up all the heapster components, and ensures that all the components are up and running. This tool could also help configure backends and dynamically generate kube configs.

Track pods running in the kube master

This will be useful to track resource usage across the entire cluster.

heapster and kubernetes: not getting any container stats

I'm currently testing heapster with kubernetes v0.5 and cadvisor 0.6.2.

Grafana doesn't display any graphs with per container stats (total cluster CPU and memory usage is displayed fine), it's just giving the error "Couldn't look up columns for series: stats"

Heapster is also displaying errors constantly:

E1121 09:30:41.665006 00007 kube.go:104] failed to get stats from kubelet on host with ip   192.168.200.230 - Got 'Internal Error: unable to unmarshal container info for "/docker/35c89c2b64d4d76ffda3cd7e3797273a6ba829bae81ca15364ed35765b546fbb" (failed to get container "/docker/35c89c2b64d4d76ffda3cd7e3797273a6ba829bae81ca15364ed35765b546fbb" with error: unknown container "/docker/35c89c2b64d4d76ffda3cd7e3797273a6ba829bae81ca15364ed35765b546fbb"
): invalid character 'i' in literal false (expecting 'l')
': invalid character 'I' looking for beginning of value

So I guess there's some problem with getting the data from cadvisor into InfluxDB?

Best regards,
Julian

Suggestion: use InfluxDB continuous queries

In real life scenarios the performance of Grafana dashboard is abysmal and metrics are plain wrong when graphs are built on top of stats table directly. I suggest to use continuous queries.

Why wrong? Imagine a single Docker container with the same name deployed on different machines. There will be multiple rows from different machines, thus derivative(cpu_cumulative_usage) will return random values because there is no group by hostname.
Continuous queries also allows to slice the metric across more than one dimension, thus side-stepping Grafana limitations.

https://github.com/arkadijs/heapster/blob/master/sinks/influxdb.go#L298

Note that group by time(10s) must be twice as large as -poll_duration (5s in my fork), else InfluxDB won't see lagged data. AFAIK, this is going to change in InfluxDB 0.9.

Also, I don't know you plans for disk-io stats, but filesystem stats greatly multiplies the volume of data as they are added as separate rows per filesystem. Consider writing them on lower cadence. Or aggregating (which is not an entirely correct approach) across the filesystems and pushing into the same record.

Heapster stops pushing data into Influxdb on Deis cluster

I have a Deis cluster set up on AWS and I've tried using heapster to monitor it, but it pushes data into influx db only for a while after start / restart and then stops. I set up everything as explained in this guide. Here are my configs. Am I doing something wrong or heapster simply wont work with deis?

core@ip-10-21-1-34 ~/monitoring $ cat cadvisor.service 
[Unit]
Description=cAdvisor Service
After=docker.socket
Requires=docker.socket

[Service]
TimeoutStartSec=10m
Restart=always
ExecStartPre=-/usr/bin/docker kill cadvisor
ExecStartPre=-/usr/bin/docker rm -f cadvisor
ExecStartPre=/usr/bin/docker pull google/cadvisor
ExecStart=/usr/bin/docker run --volume=/:/rootfs:ro --volume=/var/run:/var/run:rw --volume=/sys:/sys:ro --volume=/var/lib/docker/:/var/lib/docker:ro --publish=4194:4194 --publish=8080:8080 --name=cadvisor --net=host google/cadvisor:latest --logtostderr
ExecStop=/usr/bin/docker stop -t 2 cadvisor

[X-Fleet]
Global=true

core@ip-10-21-1-34 ~/monitoring $ cat influxdb.service 
[Unit]
Description=InfluxDB Service
After=docker.socket
Requires=docker.socket

[Service]
TimeoutStartSec=10m
Restart=always
ExecStartPre=-/usr/bin/docker kill influxdb
ExecStartPre=-/usr/bin/docker rm -f influxdb
ExecStartPre=/usr/bin/docker pull kubernetes/heapster_influxdb
ExecStart=/usr/bin/docker run --name influxdb -p 8083:8083 -p 8086:8086 -p 8090:8090 -p 8099:8099 kubernetes/heapster_influxdb
ExecStop=/usr/bin/docker stop -t 2 influxdb

[X-Fleet]
MachineID=60023ec21a6643ebb0a71a54454d1bcc

core@ip-10-21-1-34 ~/monitoring $ cat heapster.service 
[Unit]
Description=Heapster Service
After=docker.socket
Requires=docker.socket

[Service]
ExecStartPre=/bin/sh -c "docker history kubernetes/heapster:latest >/dev/null || docker pull kubernetes/heapster:latest"
ExecStart=/usr/bin/docker run -d -e INFLUXDB_HOST=10.21.***.***:8086 -e COREOS=true --name heapster --net=host --restart=on-failure:5 --publish=8082:8082 kubernetes/heapster:latest

[X-Fleet]
MachineOf=influxdb.service

core@ip-10-21-1-34 ~/monitoring $ cat grafana.service 
[Unit]
Description=Grafana service
Requires=docker.socket
After=docker.socket

[Service]
ExecStartPre=/bin/sh -c "docker history kubernetes/heapster_grafana:latest >/dev/null || docker pull kubernetes/heapster_grafana:latest"
ExecStart=/usr/bin/docker run -d -p 8081:80 -e INFLUXDB_HOST=52.11.***.*** -e INFLUXDB_NAME=k8s -e HTTP_USER=admin -e HTTP_PASS=****** kubernetes/heapster_grafana:v0.4

[X-Fleet]
MachineOf=heapster.service

Heapster logs:

core@ip-10-21-1-34 ~/monitoring $ docker logs -f heapster         
+ EXTRA_ARGS=
+ '[' '!' -z true ']'
+ EXTRA_ARGS=' --coreos'
+ '[' '!' -z ']'
+ '[' '!' -x ']'
+ HEAPSTER='/usr/bin/heapster  --coreos '
+ '[' '!' -z ']'
+ '[' '!' -z 10.21.***.***:8086 ']'
+ /usr/bin/heapster --coreos --sink influxdb --sink_influxdb_host 10.21.***.***:8086
I0311 10:28:25.571350       7 heapster.go:44] /usr/bin/heapster --coreos --sink influxdb --sink_influxdb_host 10.21.1.34:8086
I0311 10:28:25.571446       7 heapster.go:45] Heapster version 0.8
I0311 10:28:25.571555       7 heapster.go:46] Flags: alsologtostderr='false' bq_account='' bq_credentials_file='' bq_id='' bq_project_id='' bq_secret='notasecret' cadvisor_port='8080' coreos='true' external_hosts_file='/var/run/heapster/hosts' fleet_endpoints='http://127.0.0.1:4001' kubelet_port='10250' kubernetes_insecure='true' kubernetes_master='' listen_ip='' log_backtrace_at=':0' log_dir='' logtostderr='true' max_procs='0' poll_duration='10s' port='8082' sink='influxdb' sink_influxdb_buffer_duration='10s' sink_influxdb_host='10.21.1.34:8086' sink_influxdb_name='k8s' sink_influxdb_password='root' sink_influxdb_username='root' sink_memory_ttl='1h0m0s' stderrthreshold='2' v='0' vmodule=''
I0311 10:28:25.571585       7 influxdb.go:255] Using influxdb on host "10.21.1.34:8086" with database "k8s"
I0311 10:28:25.573334       7 heapster.go:57] Starting heapster on port 8082

Influxdb clustering

Unless I'm missing something, currently the heapster manifests for influxdb will only support a single Influxdb instance. Clustering Influxdb will probably be necessary for any reasonable size deployment. I assume this would be achieved by updating the seeds in the config.toml but I'm not sure of the best way to retrieve those seed servers?

Metrics pub-sub

While thinking about other services that would require metrics, e.g. alerting, autoscaling, etc, switching to a pub-sub model would allow for services to build around the metrics data. For example, if heapster could publish metrics to e.g. Apache Spark / Kafka then other services could subscribe to topics & provide functionality.

We could split up topics into a hierarchy something like namespace / pod / container, aggregate at each level & enforce security at those layers too.

Heapster CoreOS buddy assumes fleet server in running on host

As per @andrewwebber suggestion, the coreOS buddy needs to take in fleet server information via a flag.

@MaheshRudrachar reported this issue:
I have setup 5 CoreOS instances on AWS and followed kelseyhightower kubernetes-fleet-tutorial.
Basically I have
1 ETCD dedicated Server,
3 Minions with 1 Minion acting as API Server and
1 dedicated Minion for setting up Heapster.
All Minions are pointing to ETCD dedicated server.

Now when I ran units which are mentioned above:

cAdvisor, Influxdb & Grafana - Works fine with status running

Heapster Agent: Fails with log message as follows:
Starting Heapster Agent Service...
heapster-agent
Pulling repository vish/heapster-buddy-coreos
Started Heapster Agent Service.
INFO log.go:73: Failed getting response from http://127.0.0.1:4001/: dial tcp 127.0.0.1:4001:
ERROR log.go:81: Unable to get result for {Get /_coreos.com/fleet/machines}, retrying in 100m
timeout reached
heapster-agent.service: main process exited, code=exited, status=1/FAILURE
Unit heapster-agent.service entered failed state.

Heapster: Fails with log message as follows:
Pulling repository kubernetes/heapster
Started Heapster Agent Service.
/usr/bin/heapster --sink influxdb --sink_influxdb_host 127.0.0.1:8086
Heapster version 0.2
Cannot stat hosts_file /var/run/heapster/hosts. Error: stat /var/run/heapster/hosts: no such file or directory
heapster.service: main process exited, code=exited, status=1/FAILURE
heapster
Unit heapster.service entered failed state.
heapster.service holdoff time over, scheduling restart.
Stopping Heapster Agent Service...
Starting Heapster Agent Service...

Make influxdb sink resilient to connectivity issues.

The sink should attempt db creation until it succeeds, even if the influxdb service is temporarily not accessible. The sink should buffer data for some sane duration if there are temporary network issues.

Separate Sinks with an API

Sinks today are able to export data to a certain backend. We need to separate these so that we may run them out of process of the main Heapster. This will require a clear API for this.

rewrite integration test scripts in golang.

Add a REST endpoint to expose resource usage at various levels

Since heapster aggregates data across all nodes and containers, it can and probably should expose resource usage data aggregated at various levels,

container,
Pod,
Service/Controller,
labels
Instantaneous metrics will be useful for UI purposes and derives metrics (mean, 90%ile, max, etc) will be useful for end users.

Collect and export kubernetes events and docker events

InfluxDB supports events natively. Other backends might not. We can place structure around some specific events like container starts, deletions, restarts, etc and export them as metrics, but that might be sub-optimal.
Need to explore how events is currently handled by various monitoring/alerting systems.

Namespaced containers stats aren't reported

Not sure if it's a problem with cAdvisor, Heapster or Kubernetes but using kubernetes setup for monitoring I don't see in grafana pods started in namespaces other than the default one.

Provide binaries as part of the release

I believe it would be beneficial to users if pre-built binaries were provided as part of the release. Kubernetes and cAdvisor do this which makes it simpler to get started.

export diskio stats

cAdvisor supports diskio stats. It needs to be exported to the storage backends.

Add deployment support for the new GCM driver.

deploy/run.sh needs to be updated to support the new gcm backend driver.

Separate Sources into Collectors

Separate sources into collectors under a stronger API to allow us to collect from various sources and make that more configurable.

This will require us to annotate data outside of the collectors.

Add an integration test for GCM sink

Assigning to me :)

Proxy calls to InfluxDB with nginx

So, having something like

yaml
- name: "INFLUXDB_HOST"
  value: '"+window.location.hostname+"/api/v1beta1/proxy/services/monitoring-influxdb'

is very limitative when it comes to running influxdb + grafana, especially inside Kubernetes.

See my suggested solution.

Push metrics to backends in a separate thread

As of now metrics are pushed to backends synchronously. This can cause housekeeping delays. Instead, cache the metrics internally and use a separate go thread to export metrics to backends.

heapster should use ordinary kubernetes client for master communication

What can we do in k8s to make it easier for you to import just kubernetes/pkg/client & /pkg/api instead of rolling your own client which is likely to break in the future?

Track raw containers (cgroups) on all in a kubernetes cluster

To have a complete understanding of resource usage in the cluster, heapster needs to track docker containers that do not belong to a pod and raw cgroups that exist on the system.

Add integration test for heapster on CoreOS

As of now the coreos buddy is not tested periodically. Having an integration test will reduce the release overhead.

Instrument Heapster for monitoring

Both the core as well as the sinks. Things like: metrics pushed, latency to push, errors, etc.

grafana-influxdb-pod.json contains extra ','

I could not get grafana-influxdb-pod.json to properly install on kubernetes. I would just sit there at the waiting stage... so I dug a bit and noticed two extra ',' in the json file. Removing those has fixed the issue.

You mighty want to update the file ;-) Here is the file with the fix:

{
"id": "influx-grafana",
"kind": "Pod",
"apiVersion": "v1beta1",
"desiredState": {
"manifest": {
"version": "v1beta1",
"id": "influxdb",
"containers": [{
"name": "influxdb",
"image": "vish/k8s_influxdb",
"ports": [{"containerPort": 8086, "hostPort": 8086},
{"containerPort": 8083, "hostPort": 8083},
{"containerPort": 8090, "hostPort": 8090},
{"containerPort": 8099, "hostPort": 8099}]
}, {
"name": "grafana",
"image": "vish/k8s_grafana",
"ports": [{"containerPort": 80, "hostPort": 80}],
"env": [{"name": HTTP_USER, "value": admin},
{"name": HTTP_PASS, "value": admin}]
}, {
"name": "elasticsearch",
"image": "dockerfile/elasticsearch",
"ports": [{"containerPort": 9200, "hostPort": 9200},
{"containerPort": 9300, "hostPort": 9300}]
}]
},
},
"labels": {
"name": "influxdb",
}
}

Add files needed for TLS to Heapster container

Add the host volumes needed to send TLS requests in Heapster. This is needed by some sinks like GCM.

Not getting graphs on heapster UI.

Below are the pods running on my system -

Have two minions -

NAME LABELS
10.227.125.10
10.227.125.3

When I query curl -v http://10.227.125.10:10250/stats/, I get below errors -

Hostname was NOT found in DNS cache
Trying 10.227.125.10...
Connected to 10.227.125.10 (10.227.125.10) port 10250 (#0)

GET /stats/ HTTP/1.1
User-Agent: curl/7.35.0
Host: 10.227.125.10:10250
Accept: /

< HTTP/1.1 500 Internal Server Error
< Content-Type: text/plain; charset=utf-8
< Date: Thu, 29 Jan 2015 09:07:15 GMT
< Content-Length: 39
<
Internal Error: no cadvisor connection
Connection #0 to host 10.227.125.10 left intact

While trying with curl -v http://10.227.125.3:10250/stats/heapster/heapster/, getting below errors -

Hostname was NOT found in DNS cache
Trying 10.227.125.3...
Connected to 10.227.125.3 (10.227.125.3) port 10250 (#0)

GET /stats/heapster/heapster/ HTTP/1.1
User-Agent: curl/7.35.0
Host: 10.227.125.3:10250
Accept: /

< HTTP/1.1 200 OK
< Date: Thu, 29 Jan 2015 09:19:06 GMT
< Content-Length: 2
< Content-Type: text/plain; charset=utf-8
<
Connection #0 to host 10.227.125.3 left intact
{}

No graph displayed at heapster UI. Please suggest where it is going wrong?

Injest Prometheus metrics.

Prometheus client model is being adopted for supporting custom metrics from Kubernetes core components. Heapster needs to be able to scrape and parse Prometheus metrics, in-order to annotate them and make them available to various backends.

cAdvisor link

the cAdvisor link has an extra "https://" so the link is broken

CPU spike

Is this normal?

Almost 60% cpu usage on a 2gb ram vm .. fedora 20 in digitalocean, yet it's not running in the first place:

[root@k8-master heapster]# k8 get pods
NAME                                   IMAGE(S)                       HOST                LABELS                             STATUS
548e1e240e7bc2780d000003               fedora/apache:latest           <unassigned>        name=548e1e240e7bc2780d000003      Pending
27a27940-83d9-11e4-bea6-040135044f01   kubernetes/heapster_influxdb   188.226.142.107/    name=influxGrafana                 Failed
                                       kubernetes/heapster_grafana                                                           
                                       dockerfile/elasticsearch                                                              
ffed25fd-83e4-11e4-bea6-040135044f01   kubernetes/heapster            188.226.142.107/    name=heapster,uses=influx-master   Failed
55de4dd2-83e8-11e4-bea6-040135044f01   kubernetes/heapster_influxdb   <unassigned>        name=influxGrafana                 Pending
                                       kubernetes/heapster_grafana                                                           
                                       dockerfile/elasticsearch                                                              
66471f57-83e3-11e4-bea6-040135044f01   kubernetes/heapster            188.226.142.107/    name=heapster,uses=influx-master   Failed

Parallelize Heapster Core

Allow the Heapster core to fetch and export multiple streams simultaneously. This should also entail work to allow a single stream to fall behind and be robust to node failures.

deploy/run.sh doesn't detect COREOS env var

I've been trying to get Heapster up and running on a CoreOS cluster, and have been having issues. Using the v0.7 images, I get the following output from the container:

Feb 23 12:36:05 <snip> docker[18078]: Status: Image is up to date for kubernetes/heapster:v0.7
Feb 23 12:36:05 <snip> systemd[1]: Started Heapster Agent Service.
Feb 23 12:36:06 <snip> docker[18086]: + EXTRA_ARGS=
Feb 23 12:36:06 <snip> docker[18086]: + '[' '!' -z ']'
Feb 23 12:36:06 <snip> docker[18086]: + '[' '!' -z ']'
Feb 23 12:36:06 <snip> docker[18086]: + '[' '!' -x ']'
Feb 23 12:36:06 <snip> docker[18086]: + HEAPSTER='/usr/bin/heapster  '
Feb 23 12:36:06 <snip> docker[18086]: + '[' '!' -z ']'
Feb 23 12:36:06 <snip> docker[18086]: + '[' '!' -z 10.10.0.14:8086 ']'
Feb 23 12:36:06 <snip> docker[18086]: + /usr/bin/heapster --sink influxdb --sink_influxdb_host 10.10.0.14:8086
Feb 23 12:36:06 <snip> docker[18086]: I0223 12:36:06.631846       8 heapster.go:44] /usr/bin/heapster --sink influxdb --sink_influxdb_host 10.10.0.14:8086
Feb 23 12:36:06 <snip> docker[18086]: I0223 12:36:06.632198       8 heapster.go:45] Heapster version 0.7
Feb 23 12:36:06 <snip> docker[18086]: I0223 12:36:06.632341       8 heapster.go:46] Flags: alsologtostderr='false' bq_account='' bq_credentials_file='' bq_id='' bq_project_id='' bq_secret='notasecret' cadvisor_port='8080' coreos='false' external_hosts_file='/var/run/heapster/hosts' fleet_endpoints='http://127.0.0.1:4001' kubelet_port='10250' kubernetes_insecure='true' kubernetes_master='' listen_ip='' log_backtrace_at=':0' log_dir='' logtostderr='true' max_procs='0' poll_duration='10s' port='8082' sink='influxdb' sink_influxdb_buffer_duration='10s' sink_influxdb_host='10.10.0.14:8086' sink_influxdb_name='k8s' sink_influxdb_password='root' sink_influxdb_username='root' sink_memory_ttl='1h0m0s' stderrthreshold='2' v='0' vmodule=''
Feb 23 12:36:06 <snip> docker[18086]: I0223 12:36:06.632825       8 influxdb.go:255] Using influxdb on host "10.10.0.14:8086" with database "k8s"
Feb 23 12:36:07 <snip> docker[18086]: I0223 12:36:07.642797       8 heapster.go:57] Starting heapster on port 8082

The command used to run it is (as per https://github.com/GoogleCloudPlatform/heapster/tree/master/clusters/coreos):

/usr/bin/docker run --name %n -e INFLUXDB_HOST=10.10.0.14:8086 -e COREOS kubernetes/heapster:v0.7

As can be seen, the COREOS environmental variable isn't picked up, because it's empty. [ ! -z $COREOS ] checks for not null, rather than checking if the variable exists. Changing to [ -v COREOS ] will check correctly for it's existance.

I applied this patch (3a10535) manually, by changing run.sh on the container, and it seems fix the issue. I can't seem to get Heapster to build to test properly though (I've not used Go before).

Collecting application level metrics

Currently Heapster collects metrics from cAdvisor to gather container level metrics. Collecting application level metrics would provide an even richer set of data. Applications would be required to implement an http endpoint to expose the metrics that it wants to publish.

As a PoC, we have a heapster fork, jadvisor (https://github.com/fabric8io/jadvisor), that collects JVM stats via JMX exposed via Jolokia (http://jolokia.org/) for http access to JMX. Currently the stats that are collected are hard-coded, or rather discovered by what is available in the JMX tree. I would like to make this more generic so it can handle any http endpoint & hence be stack agnostic.

The difference that this would introduce to heapster is the need to connect to individual pods rather than just to the kubelet/cadvisor so would increase the number of connections made significantly. This would likely require work on clustering & sharding to implement properly.

Incorrect data type NodeList vs MinionList

I'm running Kubernetes 0.7.0 and Heapster 0.4.0 and I noticed the Heapster container exits after ~10 seconds. I see in its logs:

heapster.go:22] data of kind 'NodeList', obj of type 'MinionList'

Cannot override service name for influxdb storage when running in kubernetes

As ENTRYPOINT is set to /run.sh this cannot be overridden in kubernetes where you can only specify a command. run.sh has a hard-coded name for the InfluxDB service as INFLUX_MASTER.

Switching to CMD rather than ENTRYPOINT would allow users to deploy & just run /usr/bin/heapster, passing in appropriate arguments rather than using /run.sh which would of course remain default if no argument is specified.

marathon

Does it support getting cadvisor nodes from marathon ?

Support persistent storage for InfluxDB backend.

As of now InfluxDB pod uses local storage.

Duplicate data sent to sinks

While writing the GCM sink I found that Heapster sent a lot of duplicate data to the sinks. Something along the lines of:

Container: / - Time: 0s-10s
Container: / - Time: 0s-10s
Container: / - Time: 0s-10s
Container: / - Time: 30s-40s
Container: / - Time: 30s-40s
Container: / - Time: 30s-40s
Container: / - Time: 60s-70s
Container: / - Time: 60s-70s
Container: / - Time: 60s-70s

Make heapster work outside of GCE

Currently Kubernetes does not expose the public IP of services in GCE. This makes it necessary to obtain public IP of Influxdb service using GCE's metadata servers. If there is no public IP restriction on some of the supported Kubernetes platforms, heapster can be made to work by splitting influxdb and grafana into separate services.

cc @bmaltais

wait for pods to exit before attempting removal of docker images

There exists a race between pods deletion and docker images removal from nodes.

Suggestion: get rid of the CoreOS / "buddy"

While flexible, the buddy / hosts file things complicates the deployment.
I ended up moving Fleet logic into Heapster and getting rid of noticeable chunk of the code. As there are no currently other types of the buddies (Mesos?), I suggest doing something like the following:
arkadijs@7c4fc6c

If there is Kubernetes - use listMinions
Otherwise try Fleet.

The code is little bit messy - coupling KubeSource to ExternalSource, but my baseline is CoreOS with Deis, with optional Kubernetes, thus such practical solution.

IPV6 only constraint possibly

Not sure if the issue is environmental (only me) or configuration baked into heapster and grafana pre-canned images. Every time I test the URL (port 80) to the minion running the heapster pod I get a "connection refused" error. All ports are open and this is reproducable even from a curl command running via SSH locally on the heapster minion. It appears to be web server config versus something at the infrastructure layer.

I noticed the 'default' file in the grafana folder has "IPV6only=on" as a config item. Potentially the same in the native tutum/nginx underlying image being called. Tried setting to "off" and recreating a custom image with no success.

As stated, this could be environmental, but the presence of the "IPV6only=on" suggests this could be the reason NGINX is rejecting connections from my client browser.

Environment is running on coreos on AWS via a CloudFormation template.

Make heapster resilient to kubelet failures

As of now heapster fails stats collection completely if kubelet fails to return stats for a container. This is undesirable. heapster should log error and move on instead of failing completely.