cloudfoundry / garden-runc-release Goto Github PK
View Code? Open in Web Editor NEWLicense: Apache License 2.0
License: Apache License 2.0
I'm not sure if changes have made it upstream, but src/github.com/docker/docker is currently over five months behind. Should that be updated or submoduled in vs. copied?
i would like to know how does garden container updates its resolv.conf .
i add "options rotate timeout:1 retries:1" along with NameServer in garden cell /etc/resolv.conf.
Unable to trace the code which modifies the resolv.conf in garden-runc-release (1.4.0)
I am working on CF V287 deployment. I am facing a strange error in garden health check.
Following are the system/issue info:
OS: 14.04, Trusty Tahr
Kernel version: 4.4.0-105-generic
diego-release version 2.0.0
garden-runc-release version 1.11.1
Issue log:
{"timestamp":"1521155111.438632011","source":"guardian","message":"guardian.run.exec.execrunner.runc","log_level":0,"data":{"handle":"executor-healthcheck-a51bbaf6-b013-4d85-438b-3f6a1cde4b7d","id":"executor-healthcheck-a51bbaf6-b013-4d85-438b-3f6a1cde4b7d","message":"exec_ failed: container_linux.go:348: starting container process caused \"chdir to cwd (\\\"/home/vcap\\\") set in config.json failed: permission denied\"\n","path":"/bin/sh","session":"9.2.1"}}
Please help on this issue?
Thanks in Advance. +1
Best Regards
Amit
After pushing the grafana/grafana docker image to cloudfoundry there are missing files in the garden container.
We recently changed from garden-shed to grootfs/overlay-xfs.
We already faced a similar behaviuor in two other cases with using an earlier garden-runc version (v1.11.1), but an update to a recent version eliminated this issue there.
garden-runc 1.12.1 / grootfs/overlay-xfs
Stemcell: vsphere ubuntu trusty 3541.10 and 3468.21
cf push grafana_fail --docker-image grafana/grafana
The garden log /var/vcap/sys/log/garden/garden.stdout.log
(log_level: info)
does not hightlight an unexpected behaviour.
cf log:
2018-04-06T11:53:53.60+0000 [APP/PROC/WEB/0] OUT t=2018-04-06T11:53:53+0000 lvl=crit msg="Failed to parse /etc/grafana/grafana.ini, open /etc/grafana/grafana.ini: no such file or directory%!(EXTRA []interface {}=[])
And if looking into the app (with modified start command) the directories are there but not the files.
Error from runc: cannot set limits on the memory cgroup, as the container has not joined it
v1.12.1
AWS
ubuntu-trusty: 3541.12
4.4.0-1049-aws
We don't have repro steps yet. The error happened after a deploy:
{"timestamp":"1524090641.830159187","source":"rep","message":"rep.executing-container-operation.task-processor.run-container.containerstore-create.node-create.failed-to-create-container-in-garden","log_level":2,"data":{"container-guid":"35f7ad82-9831-4e09-b7e3-9c41df9dae22","container-state":"reserved","error":"runc run: exit status 1: container_linux.go:348: starting container process caused \"process_linux.go:402: container init caused \\\"process_linux.go:367: setting cgroup config for procHooks process caused \\\\\\\"cannot set limits on the memory cgroup, as the container has not joined it\\\\\\\"\\\"\"\n","guid":"35f7ad82-9831-4e09-b7e3-9c41df9dae22","session":"20.1.3.2.1"}}
Logs and more context can be found in this tracker story. Note output of garden-ordance-survey
is also included in the story comments.
Suspicion : it appears that the garden memory cgroup wasn't available at container creation time (Tom G).
Subsequent container creations succeeded as soon as 3 seconds after this failure.
Note from @BooleanCat : It might be worth trying to repro by slamming garden with container creations while it's starting up.
When running Concourse tasks, one of our three workers is giving the following error:
runc create: exit status 1: process_linux.go:258: applying cgroup configuration for process caused "mkdir /sys/fs/cgroup/memory/24d5d8e9-250f-4bc8-40ed-03a9d4f9ad52: no space left on device"
Here's an example of a build with the problem: https://runtime.ci.cf-app.com/teams/main/pipelines/cf-bosh-2-0/jobs/update-releases/builds/62
The vm in question is not especially full in any respect:
| VM | State | AZ | VM Type | IPs | Load | CPU | CPU | CPU | Memory Usage | Swap Usage | System | Ephemeral | Persistent |
| | | | | | (avg01, avg05, avg15) | User | Sys | Wait | | | Disk Usage | Disk Usage | Disk Usage |
| worker/1 (56bd4970-30e6-4ae8-b4f2-546bb566561b) | running | z1 | workers | 10.10.48.14 | 0.18, 0.46, 0.61 | 0.4% | 0.2% | 0.0% | 33% (9.8G) | 1% (196.2M) | 59% | 19% | n/a |
The only potentially "weird" configuration we've got on there is:
garden:
max_containers: 1000
network_pool: "10.254.0.0/20"
But, we're not really using a lot of that, as fly workers
reports:
56bd4970-30e6-4ae8-b4f2-546bb566561b 172 linux none
Suraci said this isn't a disk issue, probably isn't a concourse issue, and suggested we give 'yall a shot at it.
garden/garden.stdout.log
83381-{"timestamp":"1472766267.733082533","source":"guardian","message":"guardian.destroy.destroy.finished","log_level":1,"data":{"handle":"67234b85-750c-47fe-7b2e-921740c5a5b4","session":"998.2"}}
83382-{"timestamp":"1472766267.733111382","source":"guardian","message":"guardian.destroy.finished","log_level":1,"data":{"handle":"67234b85-750c-47fe-7b2e-921740c5a5b4","session":"998"}}
83383-{"timestamp":"1472766267.733154535","source":"guardian","message":"guardian.layer-already-deleted-skipping","log_level":1,"data":{"error":"could not find image: no such id: 37870bde9a1de1d445ed2d7862d2c17d75212ac33b1fe1810cc0401485013b8a","graphID":"67234b85-750c-47fe-7b2e-921740c5a5b4","id":"67234b85-750c-47fe-7b2e-921740c5a5b4"}}
83384-{"timestamp":"1472766267.733173132","source":"guardian","message":"guardian.create.create-failed-cleaningup.cleanedup","log_level":1,"data":{"cause":"runc create: exit status 1: process_linux.go:258: applying cgroup configuration for process caused \"mkdir /sys/fs/cgroup/memory/67234b85-750c-47fe-7b2e-921740c5a5b4: no space left on device\"","handle":"67234b85-750c-47fe-7b2e-921740c5a5b4","session":"997.3"}}
83385-{"timestamp":"1472766267.733195305","source":"guardian","message":"guardian.api.garden-server.create.failed","log_level":2,"data":{"error":"runc create: exit status 1: process_linux.go:258: applying cgroup configuration for process caused \"mkdir /sys/fs/cgroup/memory/67234b85-750c-47fe-7b2e-921740c5a5b4: no space left on device\"","request":{"Handle":"","GraceTime":0,"RootFSPath":"raw:///var/vcap/data/baggageclaim/volumes/live/25d4f36d-9f8f-4cff-4ac1-d441db8e8b71/volume","BindMounts":null,"Network":"","Privileged":true,"Limits":{"bandwidth_limits":{},"cpu_limits":{},"disk_limits":{},"memory_limits":{}}},"session":"3.1.517"}}
83386:{"timestamp":"1472766268.500758410","source":"guardian","message":"guardian.create.start","log_level":1,"data":{"handle":"d7581e52-2e88-4cdb-5d99-2fa93b4b1cd5","session":"999"}}
83387:{"timestamp":"1472766268.500808477","source":"guardian","message":"guardian.create.gc.start","log_level":1,"data":{"handle":"d7581e52-2e88-4cdb-5d99-2fa93b4b1cd5","session":"999.1"}}
83388:{"timestamp":"1472766268.500828743","source":"guardian","message":"guardian.create.gc.finished","log_level":1,"data":{"handle":"d7581e52-2e88-4cdb-5d99-2fa93b4b1cd5","session":"999.1"}}
83389:{"timestamp":"1472766268.500849724","source":"guardian","message":"guardian.create.containerizer-create.start","log_level":1,"data":{"handle":"d7581e52-2e88-4cdb-5d99-2fa93b4b1cd5","session":"999.2"}}
83390:{"timestamp":"1472766268.521710157","source":"guardian","message":"guardian.create.containerizer-create.depot-create.started","log_level":1,"data":{"handle":"d7581e52-2e88-4cdb-5d99-2fa93b4b1cd5","session":"999.2.1"}}
83391:{"timestamp":"1472766268.521897554","source":"guardian","message":"guardian.create.containerizer-create.depot-create.finished","log_level":1,"data":{"handle":"d7581e52-2e88-4cdb-5d99-2fa93b4b1cd5","session":"999.2.1"}}
83392:{"timestamp":"1472766268.521925688","source":"guardian","message":"guardian.create.containerizer-create.lookup.started","log_level":1,"data":{"handle":"d7581e52-2e88-4cdb-5d99-2fa93b4b1cd5","session":"999.2.2"}}
83393:{"timestamp":"1472766268.521943808","source":"guardian","message":"guardian.create.containerizer-create.lookup.finished","log_level":1,"data":{"handle":"d7581e52-2e88-4cdb-5d99-2fa93b4b1cd5","session":"999.2.2"}}
83394:{"timestamp":"1472766268.521981955","source":"guardian","message":"guardian.create.containerizer-create.create.creating","log_level":1,"data":{"bundle":"/var/vcap/data/garden/depot/d7581e52-2e88-4cdb-5d99-2fa93b4b1cd5","bundlePath":"/var/vcap/data/garden/depot/d7581e52-2e88-4cdb-5d99-2fa93b4b1cd5","handle":"d7581e52-2e88-4cdb-5d99-2fa93b4b1cd5","id":"d7581e52-2e88-4cdb-5d99-2fa93b4b1cd5","logPath":"/var/vcap/data/garden/depot/d7581e52-2e88-4cdb-5d99-2fa93b4b1cd5/create.log","pidFilePath":"/var/vcap/data/garden/depot/d7581e52-2e88-4cdb-5d99-2fa93b4b1cd5/pidfile","runc":"runc","session":"999.2.3"}}
83395:{"timestamp":"1472766268.564662218","source":"guardian","message":"guardian.create.containerizer-create.create.finished","log_level":1,"data":{"bundle":"/var/vcap/data/garden/depot/d7581e52-2e88-4cdb-5d99-2fa93b4b1cd5","handle":"d7581e52-2e88-4cdb-5d99-2fa93b4b1cd5","session":"999.2.3"}}
83396:{"timestamp":"1472766268.564709902","source":"guardian","message":"guardian.create.containerizer-create.runtime-create-failed","log_level":2,"data":{"error":"runc create: exit status 1: process_linux.go:258: applying cgroup configuration for process caused \"mkdir /sys/fs/cgroup/memory/d7581e52-2e88-4cdb-5d99-2fa93b4b1cd5: no space left on device\"","handle":"d7581e52-2e88-4cdb-5d99-2fa93b4b1cd5","session":"999.2"}}
83397:{"timestamp":"1472766268.564728737","source":"guardian","message":"guardian.create.containerizer-create.finished","log_level":1,"data":{"handle":"d7581e52-2e88-4cdb-5d99-2fa93b4b1cd5","session":"999.2"}}
83398:{"timestamp":"1472766268.564753771","source":"guardian","message":"guardian.create.create-failed-cleaningup.start","log_level":1,"data":{"cause":"runc create: exit status 1: process_linux.go:258: applying cgroup configuration for process caused \"mkdir /sys/fs/cgroup/memory/d7581e52-2e88-4cdb-5d99-2fa93b4b1cd5: no space left on device\"","handle":"d7581e52-2e88-4cdb-5d99-2fa93b4b1cd5","session":"999.3"}}
83399:{"timestamp":"1472766268.564774513","source":"guardian","message":"guardian.destroy.started","log_level":1,"data":{"handle":"d7581e52-2e88-4cdb-5d99-2fa93b4b1cd5","session":"1000"}}
83400:{"timestamp":"1472766268.564793587","source":"guardian","message":"guardian.destroy.state.started","log_level":1,"data":{"handle":"d7581e52-2e88-4cdb-5d99-2fa93b4b1cd5","session":"1000.1"}}
83401:{"timestamp":"1472766268.572014570","source":"guardian","message":"guardian.destroy.state.finished","log_level":1,"data":{"handle":"d7581e52-2e88-4cdb-5d99-2fa93b4b1cd5","session":"1000.1"}}
83402:{"timestamp":"1472766268.572041273","source":"guardian","message":"guardian.destroy.state-failed-skipping-delete","log_level":1,"data":{"error":"runc state: runc create: exit status 1: open /run/runc/d7581e52-2e88-4cdb-5d99-2fa93b4b1cd5/state.json: no such file or directory","handle":"d7581e52-2e88-4cdb-5d99-2fa93b4b1cd5","session":"1000"}}
83403:{"timestamp":"1472766268.572063684","source":"guardian","message":"guardian.destroy.destroy.started","log_level":1,"data":{"handle":"d7581e52-2e88-4cdb-5d99-2fa93b4b1cd5","session":"1000.2"}}
83404:{"timestamp":"1472766268.572134256","source":"guardian","message":"guardian.destroy.destroy.finished","log_level":1,"data":{"handle":"d7581e52-2e88-4cdb-5d99-2fa93b4b1cd5","session":"1000.2"}}
83405:{"timestamp":"1472766268.572152138","source":"guardian","message":"guardian.destroy.finished","log_level":1,"data":{"handle":"d7581e52-2e88-4cdb-5d99-2fa93b4b1cd5","session":"1000"}}
83406:{"timestamp":"1472766268.572190762","source":"guardian","message":"guardian.layer-already-deleted-skipping","log_level":1,"data":{"error":"could not find image: no such id: dd7bb4a2c64e30b97b6bee40971fa86e1b134e2a30a940b22eeb9823be188611","graphID":"d7581e52-2e88-4cdb-5d99-2fa93b4b1cd5","id":"d7581e52-2e88-4cdb-5d99-2fa93b4b1cd5"}}
83407:{"timestamp":"1472766268.572208643","source":"guardian","message":"guardian.create.create-failed-cleaningup.cleanedup","log_level":1,"data":{"cause":"runc create: exit status 1: process_linux.go:258: applying cgroup configuration for process caused \"mkdir /sys/fs/cgroup/memory/d7581e52-2e88-4cdb-5d99-2fa93b4b1cd5: no space left on device\"","handle":"d7581e52-2e88-4cdb-5d99-2fa93b4b1cd5","session":"999.3"}}
83408:{"timestamp":"1472766268.572227478","source":"guardian","message":"guardian.api.garden-server.create.failed","log_level":2,"data":{"error":"runc create: exit status 1: process_linux.go:258: applying cgroup configuration for process caused \"mkdir /sys/fs/cgroup/memory/d7581e52-2e88-4cdb-5d99-2fa93b4b1cd5: no space left on device\"","request":{"Handle":"","GraceTime":0,"RootFSPath":"raw:///var/vcap/data/baggageclaim/volumes/live/29dd5ffc-c577-4b89-7b66-dd87f7cfd996/volume","BindMounts":[{"src_path":"/var/vcap/data/baggageclaim/volumes/live/a40b857f-8711-41a6-45a9-e82871eacfd4/volume","dst_path":"/tmp/build/get","mode":1}],"Network":"","Privileged":true,"Limits":{"bandwidth_limits":{},"cpu_limits":{},"disk_limits":{},"memory_limits":{}}},"session":"3.1.518"}}
83409-{"timestamp":"1472766268.948507786","source":"guardian","message":"guardian.list-containers.starting","log_level":1,"data":{"session":"1001"}}
83410-{"timestamp":"1472766268.949084759","source":"guardian","message":"guardian.list-containers.finished","log_level":1,"data":{"session":"1001"}}
83411-{"timestamp":"1472766280.467016459","source":"guardian","message":"guardian.create.start","log_level":1,"data":{"handle":"dfd74a82-61d6-4aa8-5771-3a01c729b534","session":"1002"}}
83412-{"timestamp":"1472766280.467072964","source":"guardian","message":"guardian.create.gc.start","log_level":1,"data":{"handle":"dfd74a82-61d6-4aa8-5771-3a01c729b534","session":"1002.1"}}
83413-{"timestamp":"1472766280.467092991","source":"guardian","message":"guardian.create.gc.finished","log_level":1,"data":{"handle":"dfd74a82-61d6-4aa8-5771-3a01c729b534","session":"1002.1"}}
We've used monit stop beacon
to stop this worker from participating in Concourse, but preserve it for any investigation you might like to do.
Please feel free to jump on the environment and poke around. If you want to turn the beacon back on and re-run jobs, the jobs in the cf-release smokes display group or any of the jobs in this pipeline should be fine.
We've also made read access to the repo containing our bosh manifest available to your team:
This is where the bosh manifest for concourse lives, in deployments/concourse.yml). The keypair for ssh'ing into bosh jobs is in keypair
.
Instead of “vcap”, we are maintaining “paas” user for CF module deployment. “paas” user has sudoer permission as well.
Now during “garden” module deployment, “greenskeeper” is expecting “vcap” user and it is hard coded in code.
https://github.com/cloudfoundry/garden-runc-release/blob/master/src/greenskeeper/cmd/greenskeeper/main.go#L20
Is there any plan to make it configurable for other users?
Cloud Foundry Application Runtime operators want to understand how the Grootfs store is using disk on the Garden host machine, including being able to answer the following questions:
This understanding can be important when operators or support agents reconcile disk allocations for containers and actual disk usage on the host, particularly in situations in which containers are not placed or executed successfully:
In addition to assisting CFAR operators in configuring their Diego cells to use disk space efficiently and in debugging failure modes, this level of understanding would help the Diego team conceive of strategies to improve how the cell rep advertises available disk, especially when component disk usage exceeds expected margins.
It is not immediately clear how to use tools such as du
and df
to determine this information. du -sch *
run naively against /var/vcap/data/grootfs/store
overcounts disk usage by observing the same files in several different logical locations and so frequently returns absurdly high numbers. The reporting from df
, while apparently more accurate, is often too coarse (reporting on all of /var/vcap/data
, in a BOSH-deployed context) or too difficult to relate to the containers.
An example from one of the Diego cells in the Diego team electron-cannon CI environment:
# curl -k --cert /var/vcap/jobs/rep/config/certs/tls.crt --key /var/vcap/jobs/rep/config/certs/tls.key https://058fe0af-50ce-419f-8dd6-8a94b7ad44f3.cell.service.cf.internal:1801/container_metrics | jq '.lrps | map({instance_guid,disk_usage_bytes,disk_quota_bytes})'
[
{
"instance_guid": "62eef6c8-002a-40e0-537e-b869",
"disk_usage_bytes": 7544832,
"disk_quota_bytes": 67108864
},
{
"instance_guid": "547cab22-2987-46dc-5235-9713",
"disk_usage_bytes": 7544832,
"disk_quota_bytes": 67108864
},
{
"instance_guid": "369b6f00-aeed-44f7-5520-65dd",
"disk_usage_bytes": 95567872,
"disk_quota_bytes": 1073741824
},
{
"instance_guid": "a94a5e54-bd97-413e-4ebf-8e5e",
"disk_usage_bytes": 7544832,
"disk_quota_bytes": 67108864
},
{
"instance_guid": "fbf0adc4-ce51-42ba-696b-2588",
"disk_usage_bytes": 6430720,
"disk_quota_bytes": 1073741824
},
{
"instance_guid": "6aaecdbb-83a3-49a0-50f6-4f2c",
"disk_usage_bytes": 7544832,
"disk_quota_bytes": 67108864
},
{
"instance_guid": "7fcd3985-fc86-4d49-6318-bd66",
"disk_usage_bytes": 95567872,
"disk_quota_bytes": 1073741824
},
{
"instance_guid": "d3d0080a-bcc4-4acb-7f17-e662",
"disk_usage_bytes": 95830016,
"disk_quota_bytes": 1073741824
},
{
"instance_guid": "b846e975-d019-4a4c-72b4-a55f",
"disk_usage_bytes": 95567872,
"disk_quota_bytes": 1073741824
},
{
"instance_guid": "d0f44fa0-9491-4da2-755f-a864",
"disk_usage_bytes": 7544832,
"disk_quota_bytes": 67108864
},
{
"instance_guid": "096852be-16f2-48fa-41b3-fa97",
"disk_usage_bytes": 7544832,
"disk_quota_bytes": 67108864
},
{
"instance_guid": "fe0ba241-a59b-4fbe-7e6b-b423",
"disk_usage_bytes": 7544832,
"disk_quota_bytes": 67108864
},
{
"instance_guid": "ebe62763-d16c-40a0-6231-fdc6",
"disk_usage_bytes": 95567872,
"disk_quota_bytes": 1073741824
},
{
"instance_guid": "2bf3fe20-9ff3-44e6-7b90-d656",
"disk_usage_bytes": 7544832,
"disk_quota_bytes": 67108864
},
{
"instance_guid": "c95b2780-7ed0-45a6-7dcd-1d44",
"disk_usage_bytes": 7544832,
"disk_quota_bytes": 67108864
},
{
"instance_guid": "07b247bb-b7cc-420f-7c6d-56b9",
"disk_usage_bytes": 7544832,
"disk_quota_bytes": 67108864
},
{
"instance_guid": "46abea19-9fe5-4057-5466-ef57",
"disk_usage_bytes": 95567872,
"disk_quota_bytes": 1073741824
},
{
"instance_guid": "c60f8621-c751-4b91-72c4-c3cc",
"disk_usage_bytes": 7544832,
"disk_quota_bytes": 67108864
},
{
"instance_guid": "1dad94bb-76d7-4ee2-7de3-a4a4",
"disk_usage_bytes": 7544832,
"disk_quota_bytes": 67108864
},
{
"instance_guid": "bb94dfc7-5ccd-473e-4ce8-a1f6",
"disk_usage_bytes": 7544832,
"disk_quota_bytes": 67108864
},
{
"instance_guid": "87b0084e-b6e2-4e35-5f04-b502",
"disk_usage_bytes": 7544832,
"disk_quota_bytes": 67108864
},
{
"instance_guid": "ce58a99a-0b4c-4070-622d-207a",
"disk_usage_bytes": 7544832,
"disk_quota_bytes": 67108864
},
{
"instance_guid": "3c368ff6-d18d-4cd3-4cc5-cb80",
"disk_usage_bytes": 7544832,
"disk_quota_bytes": 67108864
},
{
"instance_guid": "d3e99752-f05d-41c3-5a4c-ea29",
"disk_usage_bytes": 7544832,
"disk_quota_bytes": 67108864
},
{
"instance_guid": "3097f536-5766-4112-559d-ad7a",
"disk_usage_bytes": 7544832,
"disk_quota_bytes": 67108864
},
{
"instance_guid": "75d34c0f-b1c1-4866-48c8-161c",
"disk_usage_bytes": 7544832,
"disk_quota_bytes": 67108864
}
]
df
# df -h
Filesystem Size Used Avail Use% Mounted on
udev 13G 4.0K 13G 1% /dev
tmpfs 2.6G 1.7M 2.6G 1% /run
/dev/sda1 2.8G 1.3G 1.4G 47% /
none 4.0K 0 4.0K 0% /sys/fs/cgroup
none 5.0M 0 5.0M 0% /run/lock
none 13G 0 13G 0% /run/shm
none 100M 0 100M 0% /run/user
/dev/sda3 71G 21G 47G 31% /var/vcap/data
tmpfs 1.0M 32K 992K 4% /var/vcap/data/sys/run
/dev/loop0 71G 2.6G 68G 4% /var/vcap/data/grootfs/store/unprivileged
/dev/loop1 71G 1.9G 69G 3% /var/vcap/data/grootfs/store/privileged
tmpfs 2.9M 216K 2.6M 8% /var/vcap/data/rep/instance_identity
overlay 1.0G 92M 933M 9% /var/vcap/data/grootfs/store/unprivileged/images/b846e975-d019-4a4c-72b4-a55f/rootfs
overlay 1.0G 6.2M 1018M 1% /var/vcap/data/grootfs/store/unprivileged/images/fbf0adc4-ce51-42ba-696b-2588/rootfs
overlay 1.0G 92M 933M 9% /var/vcap/data/grootfs/store/unprivileged/images/369b6f00-aeed-44f7-5520-65dd/rootfs
overlay 1.0G 92M 933M 9% /var/vcap/data/grootfs/store/unprivileged/images/d3d0080a-bcc4-4acb-7f17-e662/rootfs
overlay 64M 7.2M 57M 12% /var/vcap/data/grootfs/store/unprivileged/images/c60f8621-c751-4b91-72c4-c3cc/rootfs
overlay 64M 7.2M 57M 12% /var/vcap/data/grootfs/store/unprivileged/images/2bf3fe20-9ff3-44e6-7b90-d656/rootfs
overlay 64M 7.2M 57M 12% /var/vcap/data/grootfs/store/unprivileged/images/87b0084e-b6e2-4e35-5f04-b502/rootfs
overlay 1.0G 92M 933M 9% /var/vcap/data/grootfs/store/unprivileged/images/46abea19-9fe5-4057-5466-ef57/rootfs
overlay 64M 7.2M 57M 12% /var/vcap/data/grootfs/store/unprivileged/images/d0f44fa0-9491-4da2-755f-a864/rootfs
overlay 64M 7.2M 57M 12% /var/vcap/data/grootfs/store/unprivileged/images/ce58a99a-0b4c-4070-622d-207a/rootfs
overlay 64M 7.2M 57M 12% /var/vcap/data/grootfs/store/unprivileged/images/6aaecdbb-83a3-49a0-50f6-4f2c/rootfs
overlay 64M 7.2M 57M 12% /var/vcap/data/grootfs/store/unprivileged/images/a94a5e54-bd97-413e-4ebf-8e5e/rootfs
overlay 64M 7.2M 57M 12% /var/vcap/data/grootfs/store/unprivileged/images/62eef6c8-002a-40e0-537e-b869/rootfs
overlay 1.0G 92M 933M 9% /var/vcap/data/grootfs/store/unprivileged/images/7fcd3985-fc86-4d49-6318-bd66/rootfs
overlay 64M 7.2M 57M 12% /var/vcap/data/grootfs/store/unprivileged/images/07b247bb-b7cc-420f-7c6d-56b9/rootfs
overlay 64M 7.2M 57M 12% /var/vcap/data/grootfs/store/unprivileged/images/75d34c0f-b1c1-4866-48c8-161c/rootfs
overlay 64M 7.2M 57M 12% /var/vcap/data/grootfs/store/unprivileged/images/c6d3c65f-f8dd-47d3-4028-cf04/rootfs
overlay 64M 7.2M 57M 12% /var/vcap/data/grootfs/store/unprivileged/images/fe0ba241-a59b-4fbe-7e6b-b423/rootfs
overlay 64M 7.2M 57M 12% /var/vcap/data/grootfs/store/unprivileged/images/c95b2780-7ed0-45a6-7dcd-1d44/rootfs
overlay 64M 7.2M 57M 12% /var/vcap/data/grootfs/store/unprivileged/images/3c368ff6-d18d-4cd3-4cc5-cb80/rootfs
overlay 1.0G 92M 933M 9% /var/vcap/data/grootfs/store/unprivileged/images/ebe62763-d16c-40a0-6231-fdc6/rootfs
overlay 64M 7.2M 57M 12% /var/vcap/data/grootfs/store/unprivileged/images/d3e99752-f05d-41c3-5a4c-ea29/rootfs
overlay 64M 7.2M 57M 12% /var/vcap/data/grootfs/store/unprivileged/images/096852be-16f2-48fa-41b3-fa97/rootfs
overlay 64M 7.2M 57M 12% /var/vcap/data/grootfs/store/unprivileged/images/1dad94bb-76d7-4ee2-7de3-a4a4/rootfs
overlay 64M 7.2M 57M 12% /var/vcap/data/grootfs/store/unprivileged/images/bb94dfc7-5ccd-473e-4ce8-a1f6/rootfs
overlay 64M 7.2M 57M 12% /var/vcap/data/grootfs/store/unprivileged/images/3097f536-5766-4112-559d-ad7a/rootfs
overlay 64M 7.2M 57M 12% /var/vcap/data/grootfs/store/unprivileged/images/547cab22-2987-46dc-5235-9713/rootfs
overlay 71G 2.6G 68G 4% /var/vcap/data/grootfs/store/unprivileged/images/87b0084e-b6e2-4e35-5f04-b502-liveness-healthcheck-0/rootfs
overlay 71G 2.6G 68G 4% /var/vcap/data/grootfs/store/unprivileged/images/3097f536-5766-4112-559d-ad7a-liveness-healthcheck-0/rootfs
overlay 71G 2.6G 68G 4% /var/vcap/data/grootfs/store/unprivileged/images/ce58a99a-0b4c-4070-622d-207a-liveness-healthcheck-0/rootfs
overlay 71G 2.6G 68G 4% /var/vcap/data/grootfs/store/unprivileged/images/a94a5e54-bd97-413e-4ebf-8e5e-liveness-healthcheck-0/rootfs
overlay 71G 2.6G 68G 4% /var/vcap/data/grootfs/store/unprivileged/images/c95b2780-7ed0-45a6-7dcd-1d44-liveness-healthcheck-0/rootfs
overlay 71G 2.6G 68G 4% /var/vcap/data/grootfs/store/unprivileged/images/fbf0adc4-ce51-42ba-696b-2588-liveness-healthcheck-0/rootfs
overlay 71G 2.6G 68G 4% /var/vcap/data/grootfs/store/unprivileged/images/1dad94bb-76d7-4ee2-7de3-a4a4-liveness-healthcheck-0/rootfs
overlay 71G 2.6G 68G 4% /var/vcap/data/grootfs/store/unprivileged/images/c60f8621-c751-4b91-72c4-c3cc-liveness-healthcheck-0/rootfs
overlay 71G 2.6G 68G 4% /var/vcap/data/grootfs/store/unprivileged/images/62eef6c8-002a-40e0-537e-b869-liveness-healthcheck-0/rootfs
overlay 71G 2.6G 68G 4% /var/vcap/data/grootfs/store/unprivileged/images/369b6f00-aeed-44f7-5520-65dd-liveness-healthcheck-0/rootfs
overlay 71G 2.6G 68G 4% /var/vcap/data/grootfs/store/unprivileged/images/75d34c0f-b1c1-4866-48c8-161c-liveness-healthcheck-0/rootfs
overlay 71G 2.6G 68G 4% /var/vcap/data/grootfs/store/unprivileged/images/b846e975-d019-4a4c-72b4-a55f-liveness-healthcheck-0/rootfs
overlay 71G 2.6G 68G 4% /var/vcap/data/grootfs/store/unprivileged/images/2bf3fe20-9ff3-44e6-7b90-d656-liveness-healthcheck-0/rootfs
overlay 71G 2.6G 68G 4% /var/vcap/data/grootfs/store/unprivileged/images/096852be-16f2-48fa-41b3-fa97-liveness-healthcheck-0/rootfs
overlay 71G 2.6G 68G 4% /var/vcap/data/grootfs/store/unprivileged/images/6aaecdbb-83a3-49a0-50f6-4f2c-liveness-healthcheck-0/rootfs
overlay 71G 2.6G 68G 4% /var/vcap/data/grootfs/store/unprivileged/images/fe0ba241-a59b-4fbe-7e6b-b423-liveness-healthcheck-0/rootfs
overlay 71G 2.6G 68G 4% /var/vcap/data/grootfs/store/unprivileged/images/07b247bb-b7cc-420f-7c6d-56b9-liveness-healthcheck-0/rootfs
overlay 71G 2.6G 68G 4% /var/vcap/data/grootfs/store/unprivileged/images/d0f44fa0-9491-4da2-755f-a864-liveness-healthcheck-0/rootfs
overlay 71G 2.6G 68G 4% /var/vcap/data/grootfs/store/unprivileged/images/bb94dfc7-5ccd-473e-4ce8-a1f6-liveness-healthcheck-0/rootfs
overlay 71G 2.6G 68G 4% /var/vcap/data/grootfs/store/unprivileged/images/c6d3c65f-f8dd-47d3-4028-cf04-liveness-healthcheck-0/rootfs
overlay 71G 2.6G 68G 4% /var/vcap/data/grootfs/store/unprivileged/images/ebe62763-d16c-40a0-6231-fdc6-liveness-healthcheck-0/rootfs
overlay 71G 2.6G 68G 4% /var/vcap/data/grootfs/store/unprivileged/images/d3e99752-f05d-41c3-5a4c-ea29-liveness-healthcheck-0/rootfs
overlay 71G 2.6G 68G 4% /var/vcap/data/grootfs/store/unprivileged/images/3c368ff6-d18d-4cd3-4cc5-cb80-liveness-healthcheck-0/rootfs
overlay 71G 2.6G 68G 4% /var/vcap/data/grootfs/store/unprivileged/images/547cab22-2987-46dc-5235-9713-liveness-healthcheck-0/rootfs
overlay 71G 2.6G 68G 4% /var/vcap/data/grootfs/store/unprivileged/images/d3d0080a-bcc4-4acb-7f17-e662-liveness-healthcheck-0/rootfs
overlay 71G 2.6G 68G 4% /var/vcap/data/grootfs/store/unprivileged/images/7fcd3985-fc86-4d49-6318-bd66-liveness-healthcheck-0/rootfs
overlay 71G 2.6G 68G 4% /var/vcap/data/grootfs/store/unprivileged/images/46abea19-9fe5-4057-5466-ef57-liveness-healthcheck-0/rootfs
du
against Grootfs store directories# du -sch /var/vcap/data/grootfs/store/*
1.8G /var/vcap/data/grootfs/store/privileged
1.9G /var/vcap/data/grootfs/store/privileged.backing-store
2.6G /var/vcap/data/grootfs/store/unprivileged
4.3G /var/vcap/data/grootfs/store/unprivileged.backing-store
11G total
It would help to have a documented set of techniques and/or tools to use to answer some of the questions posed above on a particular Garden/Grootfs host.
/cc @cloudfoundry/cf-diego @nikhilsuvarna @dmikusa-pivotal
Hello,
even though we have set the property graph_cleanup_threshold_in_mb
in the diego manifest to 66560
, the clean process doesn't seem to work properly. The following is the output of the ephemeral disk usage of one of our cells:
# df -h /var/vcap/data/
Filesystem Size Used Avail Use% Mounted on
/dev/xvdb2 83G 76G 2.8G 97% /var/vcap/data
this issue is affecting multiple cells.
/var/vcap/data# df -h
Filesystem Size Used Avail Use% Mounted on
udev 7.9G 4.0K 7.9G 1% /dev
tmpfs 1.6G 884K 1.6G 1% /run
/dev/xvda1 2.9G 1.2G 1.7G 41% /
none 4.0K 0 4.0K 0% /sys/fs/cgroup
none 5.0M 0 5.0M 0% /run/lock
none 7.9G 8.1M 7.9G 1% /run/shm
none 100M 0 100M 0% /run/user
/dev/xvdb2 83G 76G 2.8G 97% /var/vcap/data
tmpfs 1.0M 24K 1000K 3% /var/vcap/data/sys/run
tmpfs 2.5M 0 2.5M 0% /var/vcap/data/executor/instance_identity
/dev/loop1 1008M 62M 880M 7% /var/vcap/data/garden/graph/aufs/diff/ef9fc27954ba8570ebd62b825309e821cb6e002aa885af2f8da22bf4b6f82344
none 1008M 62M 880M 7% /var/vcap/data/garden/graph/aufs/mnt/ef9fc27954ba8570ebd62b825309e821cb6e002aa885af2f8da22bf4b6f82344
/dev/loop3 1008M 516M 426M 55% /var/vcap/data/garden/graph/aufs/diff/7a56008bf6671e8c1bd67663a7a7c7fb7166d61255dec8265b67324a2e9ed352
none 1008M 516M 426M 55% /var/vcap/data/garden/graph/aufs/mnt/7a56008bf6671e8c1bd67663a7a7c7fb7166d61255dec8265b67324a2e9ed352
/dev/loop11 2.0G 1.1G 786M 59% /var/vcap/data/garden/graph/aufs/diff/8b48e6690d4e23f7f1ed51e5923dab779c60242f371f0c315a036ab8b06b8e68
none 2.0G 1.1G 786M 59% /var/vcap/data/garden/graph/aufs/mnt/8b48e6690d4e23f7f1ed51e5923dab779c60242f371f0c315a036ab8b06b8e68
/dev/loop12 1008M 177M 765M 19% /var/vcap/data/garden/graph/aufs/diff/10d514bbd1479a5a14dee6feaf7b7e74d0ca500febd41c2d877375aa2e0dd8a6
none 1008M 177M 765M 19% /var/vcap/data/garden/graph/aufs/mnt/10d514bbd1479a5a14dee6feaf7b7e74d0ca500febd41c2d877375aa2e0dd8a6
/dev/loop14 2.0G 622M 1.3G 33% /var/vcap/data/garden/graph/aufs/diff/96ccc671e828bd39f40a3f383eefa65240f8f7677a968e5ec720ddfe1637678c
none 2.0G 622M 1.3G 33% /var/vcap/data/garden/graph/aufs/mnt/96ccc671e828bd39f40a3f383eefa65240f8f7677a968e5ec720ddfe1637678c
/dev/loop7 1008M 1.6M 940M 1% /var/vcap/data/garden/graph/aufs/diff/cb46b68904b66880205157a202067efdf51ff97ca632add33e6d1cab8ab58fec
none 1008M 1.6M 940M 1% /var/vcap/data/garden/graph/aufs/mnt/cb46b68904b66880205157a202067efdf51ff97ca632add33e6d1cab8ab58fec
/dev/loop17 1008M 168M 774M 18% /var/vcap/data/garden/graph/aufs/diff/7f76de6e20cee9abda0b3f7667da7a29b9e1dddf4e1477f0ba78bb488327cedf
none 1008M 168M 774M 18% /var/vcap/data/garden/graph/aufs/mnt/7f76de6e20cee9abda0b3f7667da7a29b9e1dddf4e1477f0ba78bb488327cedf
/dev/loop18 1008M 168M 774M 18% /var/vcap/data/garden/graph/aufs/diff/b64a022d1db5b4cb6626a434489423c4c79945642e7513f90f2cdf9833a901b4
none 1008M 168M 774M 18% /var/vcap/data/garden/graph/aufs/mnt/b64a022d1db5b4cb6626a434489423c4c79945642e7513f90f2cdf9833a901b4
/dev/loop21 1008M 159M 783M 17% /var/vcap/data/garden/graph/aufs/diff/8a55d45b59733ca354ea7e0c3254e27c9cb27f1bf94a071cf5b7112566a3639b
none 1008M 159M 783M 17% /var/vcap/data/garden/graph/aufs/mnt/8a55d45b59733ca354ea7e0c3254e27c9cb27f1bf94a071cf5b7112566a3639b
/dev/loop22 1008M 142M 799M 16% /var/vcap/data/garden/graph/aufs/diff/918036fb41e2c309df1b8a1a10f58cb54508965bec22985cd19aed34a7bd4662
none 1008M 142M 799M 16% /var/vcap/data/garden/graph/aufs/mnt/918036fb41e2c309df1b8a1a10f58cb54508965bec22985cd19aed34a7bd4662
/dev/loop23 1008M 578M 364M 62% /var/vcap/data/garden/graph/aufs/diff/0998e4fd0d333ac6223009472d566ef2308b2125f30a1345ec30e28a626fe811
none 1008M 578M 364M 62% /var/vcap/data/garden/graph/aufs/mnt/0998e4fd0d333ac6223009472d566ef2308b2125f30a1345ec30e28a626fe811
/dev/loop24 1008M 45M 897M 5% /var/vcap/data/garden/graph/aufs/diff/64a13d4627af42e5a92bcc1c36a0b1ef509167c92c2847dff1f38c743e24eafc
none 1008M 45M 897M 5% /var/vcap/data/garden/graph/aufs/mnt/64a13d4627af42e5a92bcc1c36a0b1ef509167c92c2847dff1f38c743e24eafc
/dev/loop25 1008M 233M 708M 25% /var/vcap/data/garden/graph/aufs/diff/39eccbf075792c0e7dcd17de2c224bc0498e87b09e0558dd00c795fe39dcb6dd
none 1008M 233M 708M 25% /var/vcap/data/garden/graph/aufs/mnt/39eccbf075792c0e7dcd17de2c224bc0498e87b09e0558dd00c795fe39dcb6dd
/dev/loop26 1008M 8.6M 933M 1% /var/vcap/data/garden/graph/aufs/diff/bc4c5cd17cde53c0715afda5eda116e1576bcd529f6e1a68d679e87469a4eda1
none 1008M 8.6M 933M 1% /var/vcap/data/garden/graph/aufs/mnt/bc4c5cd17cde53c0715afda5eda116e1576bcd529f6e1a68d679e87469a4eda1
/dev/loop0 1008M 167M 775M 18% /var/vcap/data/garden/graph/aufs/diff/0020ec10f6bc5302db0bb3443863e9a42f3020fc776327916c7593f4c230b121
none 1008M 167M 775M 18% /var/vcap/data/garden/graph/aufs/mnt/0020ec10f6bc5302db0bb3443863e9a42f3020fc776327916c7593f4c230b121
/dev/loop5 1008M 23M 919M 3% /var/vcap/data/garden/graph/aufs/diff/77682895ee97cc1f4e8842775328fff239a67d0b3290579deab1a594d4a43875
none 1008M 23M 919M 3% /var/vcap/data/garden/graph/aufs/mnt/77682895ee97cc1f4e8842775328fff239a67d0b3290579deab1a594d4a43875
/dev/loop4 1008M 143M 798M 16% /var/vcap/data/garden/graph/aufs/diff/216b4568a294eadc88697c7a543eae99c13f79a6b6615ccca827d6d1c99e744b
none 1008M 143M 798M 16% /var/vcap/data/garden/graph/aufs/mnt/216b4568a294eadc88697c7a543eae99c13f79a6b6615ccca827d6d1c99e744b
/dev/loop8 1008M 197M 745M 21% /var/vcap/data/garden/graph/aufs/diff/212c22e0713294bc5eeb70a0a02392f4fa522c707d996a1e75e8377ab61f940a
none 1008M 197M 745M 21% /var/vcap/data/garden/graph/aufs/mnt/212c22e0713294bc5eeb70a0a02392f4fa522c707d996a1e75e8377ab61f940a
/dev/loop2 1008M 395M 547M 42% /var/vcap/data/garden/graph/aufs/diff/c4a5870c72d15bf7a286b5164b16c6a8250bcdbcb778ad8bca769ace68bb528c
none 1008M 395M 547M 42% /var/vcap/data/garden/graph/aufs/mnt/c4a5870c72d15bf7a286b5164b16c6a8250bcdbcb778ad8bca769ace68bb528c
/dev/loop6 1008M 139M 802M 15% /var/vcap/data/garden/graph/aufs/diff/9326380ed326af77e0fb3ffadf3decf683c93da677a77ee222bfe7e84249d001
none 1008M 139M 802M 15% /var/vcap/data/garden/graph/aufs/mnt/9326380ed326af77e0fb3ffadf3decf683c93da677a77ee222bfe7e84249d001
/dev/loop10 62M 58M 207K 100% /var/vcap/data/garden/graph/aufs/diff/9e1b57af28bf689fff28b9e621830f9f37dfce8c9ad043391f592362280af6fb
none 62M 58M 207K 100% /var/vcap/data/garden/graph/aufs/mnt/9e1b57af28bf689fff28b9e621830f9f37dfce8c9ad043391f592362280af6fb
When the cleanup is done is the garden process restarted? If so, it never was:
root 8103 2.6 0.4 3943836 77412 ? S<l Aug17 456:29 /var/vcap/packages/guardian/bin/gdn server --skip-setup --bind-ip=0.0.0.0 --bind-port=7777 --depot=/var/vcap/data/garden/depot --graph=/var/vcap/data/garden/graph --properties-path=/var/vcap/data/garden/props.json --port-pool-properties-path=/var/vcap/data/garden/port-pool-props.json --iptables-bin=/var/vcap/packages/iptables/sbin/iptables --iptables-restore-bin=/var/vcap/packages/iptables/sbin/iptables-restore --init-bin=/var/vcap/packages/guardian/bin/init --dadoo-bin=/var/vcap/packages/guardian/bin/dadoo --nstar-bin=/var/vcap/packages/guardian/bin/nstar --tar-bin=/var/vcap/packages/tar/tar --newuidmap-bin=/var/vcap/packages/garden-idmapper/bin/newuidmap --newgidmap-bin=/var/vcap/packages/garden-idmapper/bin/newgidmap --log-level=error --mtu=0 --network-pool=10.239.0.0/22 --deny-network=0.0.0.0/0 --destroy-containers-on-startup --debug-bind-ip=127.0.0.1 --debug-bind-port=17005 --default-rootfs=/var/vcap/packages/busybox --default-grace-time=0 --default-container-blockio-weight=0 --graph-cleanup-threshold-in-megabytes=66560 --max-containers=250 --cpu-quota-per-share=0 --tcp-memory-limit=0 --runtime-plugin=/var/vcap/packages/runc/bin/runc --persistent-image=/var/vcap/packages/cflinuxfs2/rootfs --apparmor=garden-default --cleanup-process-dirs-on-wait
CF: v270
Diego: v1.23.2
Garden-runc: v1.9.0
Environment: AWS
Note: This issue comes from the discussion in slack in: https://cloudfoundry.slack.com/archives/C033RE5D6/p1521478035000065
The problem has been identified and diagnosed and it is unlikely to happen again. Opening this ticket for documentation and reference as requested by @julz .
While upgrading our CF deployment, which included upgrade garden-runc-release
from v1.9.6 to v1.11.1 and introduce grootfs
, one application failed restarting after eviction, with the following error in the cf events
:
Copying into the container failed: stream-in: nstar: error streaming in: exit status 2. Output: tar: ./app/.cfignore: Cannot create symlink to '# Byte-compiled / ....' File name too long\ntar: Exiting with failure
(see below for the full error).
This was caused by the bug described in cloudfoundry/cli#1349. A spurious file was created in the droplet with a very very long file target (it was using the content of the target file, rather than the path).
But it was working fine with garden v1.9.6, started to fail with v1.11.1 after introducing grootfs
.
Pushing the application again with a newer version of cf-cli would result in a valid symlink and would fix the problem.
The full error is:
"metadata": {
"instance": "55ee4080-fc4e-4f63-48e0-daeb",
"index": 1,
"reason": "CRASHED",
"exit_description": "Copying into the container failed: stream-in: nstar: error streaming in: exit status 2. Output: tar: ./app/.cfignore: Cannot create symlink to '# Byte-compiled / optimized / DLL files\\n__pycache__/\\n*.py[cod]\\n*$py.class\\n\\n# C extensions\\n*.so\\n\\n# Distribution / packaging\\n.Python\\nenv/\\nbuild/\\ndevelop-eggs/\\ndist/\\ndownloads/\\neggs/\\n.eggs/\\nlib/\\nlib64/\\nparts/\\nsdist/\\nvar/\\nwheels/\\n*.egg-info/\\n.installed.cfg\\n*.egg\\n\\n# PyInstaller\\n# Usually these files are written by a python script from a template\\n# before PyInstaller builds the exe, so as to inject date/other infos into it.\\n*.manifest\\n*.spec\\n\\n# Installer logs\\npip-log.txt\\npip-delete-this-directory.txt\\n\\n# Unit test / coverage reports\\nhtmlcov/\\n.tox/\\n.coverage\\n.coverage.*\\n.cache\\nnosetests.xml\\ncoverage.xml\\n*.cover\\n.hypothesis/\\n\\n# Translations\\n*.mo\\n*.pot\\n\\n# Django stuff:\\n*.log\\nlocal_settings.py\\n\\n# Flask stuff:\\ninstance/\\n.webassets-cache\\n\\n# Scrapy stuff:\\n.scrapy\\n\\n# Sphinx documentation\\ndocs/_build/\\n\\n# PyBuilder\\ntarget/\\n\\n# Jupyter Notebook\\n.ipynb_checkpoints\\n\\n# pyenv\\n.python-version\\n\\n# celery beat schedule file\\ncelerybeat-schedule\\n\\n# SageMath parsed files\\n*.sage.py\\n\\n# dotenv\\n.env\\n\\n# virtualenv\\n.venv\\nvenv/\\nENV/\\n\\n# Spyder project settings\\n.spyderproject\\n.spyproject\\n\\n# Rope project settings\\n.ropeproject\\n\\n# mkdocs documentation\\n/site\\n\\n# mypy\\n.mypy_cache/\\n\\n.direnv\\n.envrc\\n': File name too long\ntar: Exiting with failure status due to previous errors\n",
"crash_count": 1,
"crash_timestamp": 1519298150947758085,
"version": "1f6aa9e0-a404-41bb-b6ec-67620be0febd"
},
cf v3-push
with cf-cli v6.32.0, including with symbolic link to the absolute path of file > 1kb:dd count=2 bs=1024 if=/dev/urandom of=- | base64 > foo
ln -s $(pwd)/foo bar
cf-6.32.0 v3-push -p . myapp
Hello,
When installing diego in vSphere environment, following error occurs :
#############################################################
Started Updating instance
Started Updating instance > database_z1/2c9a0212-55c1-4c92-9be7-49a3e1920041 (0) (canary)
Done Updating instance > database_z1/2c9a0212-55c1-4c92-9be7-49a3e1920041 (0) (canary)
Started Updating instance > brain_z1/40c56e23-ef82-4542-9ff8-77543f1d6507 (0) (canary)
Done Updating instance > brain_z1/40c56e23-ef82-4542-9ff8-77543f1d6507 (0) (canary)
Started Updating instance > cell_z1/ce3f079d-6e98-4ea5-b584-0e28da2eff2b (0) (canary)
Started Updating instance > cc_bridge_z1/cb6d6806-8c63-415e-83cb-cc0467f20adc (0) (canary)
Started Updating instance > route_emitter_z1/83a66350-ac04-423f-81cd-df2938d1bfd4 (0) (canary)
Started Updating instance > access_z1/1c6d1586-9137-4895-a579-03712bcf728c (0) (canary)
Done Updating instance > route_emitter_z1/83a66350-ac04-423f-81cd-df2938d1bfd4 (0) (canary)
Done Updating instance > cc_bridge_z1/cb6d6806-8c63-415e-83cb-cc0467f20adc (0) (canary)
Done Updating instance > access_z1/1c6d1586-9137-4895-a579-03712bcf728c (0) (canary)
Failed Updating instance > cell_z1/ce3f079d-6e98-4ea5-b584-0e28da2eff2b (0) (canary)
Error Code : 400007, Message :'cell_z1/0 (ce3f079d-6e98-4ea5-b584-0e28da2eff2b)' is not running after update. Review logs for failed jobs: rep, garden, metron_agent
#############################################################
monit message was found as:
#############################################################
chmod u+s /var/vcap/packages/grootfs/bin/tardis
{"timestamp":"1521080887.013528109","source":"grootfs","message":"grootfs.init-store.store-manager-init-store.overlayxfs-init-filesystem.mounting-filesystem-failed","log_level":2,"data":{"error":"exit status 32: mount: wrong fs type, bad option, bad superblock on /dev/loop0,\n missing codepage or helper program, or other error\n In some cases useful info is found in syslog - try\n dmesg | tail or so\n\n","filesystemPath":"/var/vcap/data/grootfs/store/unprivileged.backing-store","session":"1.1.2","spec":{"UIDMappings":[{"HostID":4294967294,"NamespaceID":0,"Size":1},{"HostID":1,"NamespaceID":1,"Size":4294967293}],"GIDMappings":[{"HostID":4294967294,"NamespaceID":0,"Size":1},{"HostID":1,"NamespaceID":1,"Size":4294967293}],"StoreSizeBytes":25571164160},"storePath":"/var/vcap/data/grootfs/store/unprivileged"}}
{"timestamp":"1521080887.013750792","source":"grootfs","message":"grootfs.init-store.store-manager-init-store.initializing-filesystem-failed","log_level":2,"data":{"backingstoreFile":"/var/vcap/data/grootfs/store/unprivileged.backing-store","error":"Mounting filesystem: exit status 32: mount: wrong fs type, bad option, bad superblock on /dev/loop0,\n missing codepage or helper program, or other error\n In some cases useful info is found in syslog - try\n dmesg | tail or so\n\n","session":"1.1","spec":{"UIDMappings":[{"HostID":4294967294,"NamespaceID":0,"Size":1},{"HostID":1,"NamespaceID":1,"Size":4294967293}],"GIDMappings":[{"HostID":4294967294,"NamespaceID":0,"Size":1},{"HostID":1,"NamespaceID":1,"Size":4294967293}],"StoreSizeBytes":25571164160},"storePath":"/var/vcap/data/grootfs/store/unprivileged"}}
{"timestamp":"1521080887.013805628","source":"grootfs","message":"grootfs.init-store.cleaning-up-store-failed","log_level":2,"data":{"error":"initializing filesyztem: Mounting filesystem: exit status 32: mount: wrong fs type, bad option, bad superblock on /dev/loop0,\n missing codepage or helper program, or other error\n In some cases useful info is found in syslog - try\n dmesg | tail or so\n\n","session":"1"}}
#############################################################
My environments were:
cf-release version 287
diego-release version 1.34.0
garden-runc-release version 1.11.0
cflinuxfs-release version 1.185.0
Can any one help on this issue? ( Error occurs when installing diego in vSphere environmet )
Thanks in Advance. 👍
Best Regards,
JY
After upgrading to garden-runc-release v1.13.0 I'm seeing the following errors in my logs:
/var/vcap/data/baggageclaim/volumes/live/c6e9cb7f-1be9-4d78-48a6-b5903816a20b/volume is not a mount point
TBC
/var/vcap/data/baggageclaim/volumes/live/c6e9cb7f-1be9-4d78-48a6-b5903816a20b/volume is not a mount point
There is a known regression related to garden's handling of bind mounts in v1.13.0.
Do not use v1.13.0, and instead bump directly to v1.13.1.
Garden job fails to come up with Pivotal App Service Tile v2.0.8 due to grootfs issue:
diego_cell/59ad51f8-81a2-443f-a77c-a066f5c612e6:~# tail -f /var/vcap/sys/log/garden/garden.std*
==> /var/vcap/sys/log/garden/garden.stderr.log <==
==> /var/vcap/sys/log/garden/garden.stdout.log <==
{"timestamp":"1522086995.059141159","source":"guardian","message":"guardian.create.create-failed-cleaningup.volumizer.image-plugin-destroy.grootfs.delete.failed-to-initialise-image-driver","log_level":2,"data":{"cause":"running image plugin create: reading namespace file: open /var/vcap/data/grootfs/store/unprivileged/meta/namespace.json: input/output error\n: exit status 1","error":"reading namespace file: open /var/vcap/data/grootfs/store/unprivileged/meta/namespace.json: input/output error","handle":"executor-healthcheck-2001547f-aa74-45df-6286-fb6ebf812728","original_timestamp":"2018-03-26T17:56:35.0587776Z","session":"91.3.2.1"}}
{"timestamp":"1522086995.063849688","source":"guardian","message":"guardian.create.create-failed-cleaningup.volumizer.image-plugin-destroy.image-plugin-result","log_level":2,"data":{"action":"destroy","cause":"running image plugin create: reading namespace file: open /var/vcap/data/grootfs/store/unprivileged/meta/namespace.json: input/output error\n: exit status 1","error":"exit status 1","handle":"executor-healthcheck-2001547f-aa74-45df-6286-fb6ebf812728","session":"91.3.2.1","stdout":"reading namespace file: open /var/vcap/data/grootfs/store/unprivileged/meta/namespace.json: input/output error\n"}}
{"timestamp":"1522086995.064245462","source":"guardian","message":"guardian.create.create-failed-cleaningup.destroy-failed","log_level":2,"data":{"cause":"running image plugin create: reading namespace file: open /var/vcap/data/grootfs/store/unprivileged/meta/namespace.json: input/output error\n: exit status 1","error":"running image plugin destroy: reading namespace file: open /var/vcap/data/grootfs/store/unprivileged/meta/namespace.json: input/output error\n: exit status 1","handle":"executor-healthcheck-2001547f-aa74-45df-6286-fb6ebf812728","session":"91.3"}}
{"timestamp":"1522086995.064504623","source":"guardian","message":"guardian.create.create-failed-cleaningup.cleanedup","log_level":1,"data":{"cause":"running image plugin create: reading namespace file: open /var/vcap/data/grootfs/store/unprivileged/meta/namespace.json: input/output error\n: exit status 1","handle":"executor-healthcheck-2001547f-aa74-45df-6286-fb6ebf812728","session":"91.3"}}
{"timestamp":"1522086995.064779043","source":"guardian","message":"guardian.api.garden-server.create.failed","log_level":2,"data":{"error":"running image plugin create: reading namespace file: open /var/vcap/data/grootfs/store/unprivileged/meta/namespace.json: input/output error\n: exit status 1","request":{"Handle":"executor-healthcheck-2001547f-aa74-45df-6286-fb6ebf812728","GraceTime":0,"RootFSPath":"/var/vcap/packages/cflinuxfs2/rootfs.tar","BindMounts":null,"Network":"","Privileged":false,"Limits":{"bandwidth_limits":{},"cpu_limits":{},"disk_limits":{},"memory_limits":{},"pid_limits":{}}},"session":"1.1.122"}}
{"timestamp":"1522087003.623143435","source":"guardian","message":"guardian.list-containers.starting","log_level":1,"data":{"session":"92"}}
{"timestamp":"1522087003.623309374","source":"guardian","message":"guardian.list-containers.finished","log_level":1,"data":{"session":"92"}}
{"timestamp":"1522087035.329714775","source":"guardian","message":"guardian.api.garden-server.waiting-for-connections-to-close","log_level":1,"data":{"session":"1.1"}}
{"timestamp":"1522087035.332157612","source":"guardian","message":"guardian.api.garden-server.stopping-backend","log_level":1,"data":{"session":"1.1"}}
{"timestamp":"1522087035.332249641","source":"guardian","message":"guardian.api.garden-server.stopped","log_level":1,"data":{"session":"1.1"}}
Underlying problem with the grootfs unprivileged store:
~# ls -al /var/vcap/data/grootfs/store/
ls: cannot access /var/vcap/data/grootfs/store/unprivileged: Input/output error
total 113824
drwxr-xr-x 4 root root 4096 Mar 26 17:52 .
drwxr-xr-x 3 root root 4096 Mar 26 17:51 ..
drwx------ 9 root root 118 Mar 26 17:53 privileged
-rw------- 1 root root 118579802112 Mar 26 17:55 privileged.backing-store
d????????? ? ? ? ? ? unprivileged
-rw------- 1 root root 118579802112 Mar 26 17:53 unprivileged.backing-store
PAS deployment fails to come up with the garden job in diego_cell failing.
See above
We believe the root cause was an underlying issue with the IaaS.
Recreating the diego_cell didnt work. cannot delete the folder as there is no permission. This was actually resolved by deleting the VM from the IaaS and redeploying.
Hi all,
We're running the test suite again, this time getting a different error:
scripts/remote-fly ci/nested-tests.yml
executing build 1
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 21.6M 0 21.6M 0 0 12.2M 0 --:--:-- 0:00:01 --:--:-- 12.2M
initializing with docker:///cloudfoundry/garden-ci-ubuntu
running guardian-release/ci/scripts/nested-tests
++ dirname guardian-release/ci/scripts/nested-tests
+ cd guardian-release/ci/scripts/../..
+ export GOROOT=/usr/local/go
+ GOROOT=/usr/local/go
+ export PATH=/usr/local/go/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
+ PATH=/usr/local/go/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
+ export GOPATH=/tmp/build/ee32fe39-4383-4d2d-50d8-76e8ae42e05a/guardian-release
+ GOPATH=/tmp/build/ee32fe39-4383-4d2d-50d8-76e8ae42e05a/guardian-release
+ export PATH=/tmp/build/ee32fe39-4383-4d2d-50d8-76e8ae42e05a/guardian-release/bin:/usr/local/go/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
+ PATH=/tmp/build/ee32fe39-4383-4d2d-50d8-76e8ae42e05a/guardian-release/bin:/usr/local/go/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
+ pushd src/github.com/cloudfoundry-incubator/guardian
/tmp/build/ee32fe39-4383-4d2d-50d8-76e8ae42e05a/guardian-release/src/github.com/cloudfoundry-incubator/guardian /tmp/build/ee32fe39-4383-4d2d-50d8-76e8ae42e05a/guardian-release
+ go install -tags daemon github.com/cloudfoundry-incubator/guardian/cmd/guardian/
+ go install github.com/cloudfoundry-incubator/guardian/rundmc/iodaemon/cmd/iodaemon
+ tmpdir=/tmp/dir
+ rm -fr /tmp/dir
+ mkdir -p /tmp/dir/depot
+ mkdir /tmp/dir/snapshots
+ mkdir /tmp/dir/graph
+ popd
/tmp/build/ee32fe39-4383-4d2d-50d8-76e8ae42e05a/guardian-release
+ go install github.com/onsi/ginkgo/ginkgo
+ /tmp/build/ee32fe39-4383-4d2d-50d8-76e8ae42e05a/guardian-release/bin/guardian -iodaemonBin=/tmp/build/ee32fe39-4383-4d2d-50d8-76e8ae42e05a/guardian-release/bin/iodaemon -depot=/tmp/dir/depot -snapshots=/tmp/dir/snapshots -graph=/tmp/dir/graph -bin=/tmp/build/ee32fe39-4383-4d2d-50d8-76e8ae42e05a/guardian-release/src/github.com/cloudfoundry-incubator/guardian/linux_backend/bin -listenNetwork=tcp -listenAddr=0.0.0.0:7777 -denyNetworks= -allowNetworks= -allowHostAccess=false -mtu=1500 -containerGraceTime=5m -logLevel=error -rootfs=/opt/warden/rootfs
ERRO[0000] Failed to GetDriver graph btrfs /tmp/dir/graph
ERRO[0000] Failed to GetDriver graph zfs /tmp/dir/graph
ERRO[0000] Failed to GetDriver graph devicemapper /tmp/dir/graph
ERRO[0000] Failed to GetDriver graph overlay /tmp/dir/graph
ERRO[0000] Failed to GetDriver graph vfs /tmp/dir/graph
{"timestamp":"1449606737.425148010","source":"guardian","message":"guardian.volume-creator.failed-to-construct-graph-driver","log_level":3,"data":{"error":"No supported storage backend found","graphRoot":"/tmp/dir/graph","session":"6","trace":"goroutine 1 [running]:\ngithub.com/pivotal-golang/lager.(*logger).Fatal(0xc2080785a0, 0xa41190, 0x20, 0x7f5eccc41e80, 0xc208095d50, 0x0, 0x0, 0x0)\n\t/tmp/build/ee32fe39-4383-4d2d-50d8-76e8ae42e05a/guardian-release/src/github.com/pivotal-golang/lager/logger.go:131 +0xc8\nmain.wireVolumeCreator(0x7f5eccc46e40, 0xc2080785a0, 0x7ffc7a3b3c47, 0xe, 0xc20806db00)\n\t/tmp/build/ee32fe39-4383-4d2d-50d8-76e8ae42e05a/guardian-release/src/github.com/cloudfoundry-incubator/guardian/cmd/guardian/main.go:343 +0x3fe\nmain.main()\n\t/tmp/build/ee32fe39-4383-4d2d-50d8-76e8ae42e05a/guardian-release/src/github.com/cloudfoundry-incubator/guardian/cmd/guardian/main.go:236 +0x5c4\n"}}
panic: No supported storage backend found
goroutine 1 [running]:
github.com/pivotal-golang/lager.(*logger).Fatal(0xc2080785a0, 0xa41190, 0x20, 0x7f5eccc41e80, 0xc208095d50, 0x0, 0x0, 0x0)
/tmp/build/ee32fe39-4383-4d2d-50d8-76e8ae42e05a/guardian-release/src/github.com/pivotal-golang/lager/logger.go:152 +0x5d0
main.wireVolumeCreator(0x7f5eccc46e40, 0xc2080785a0, 0x7ffc7a3b3c47, 0xe, 0xc20806db00)
/tmp/build/ee32fe39-4383-4d2d-50d8-76e8ae42e05a/guardian-release/src/github.com/cloudfoundry-incubator/guardian/cmd/guardian/main.go:343 +0x3fe
main.main()
/tmp/build/ee32fe39-4383-4d2d-50d8-76e8ae42e05a/guardian-release/src/github.com/cloudfoundry-incubator/guardian/cmd/guardian/main.go:236 +0x5c4
goroutine 5 [syscall]:
os/signal.loop()
/usr/local/go/src/os/signal/signal_unix.go:21 +0x1f
created by os/signal.init·1
/usr/local/go/src/os/signal/signal_unix.go:27 +0x35
+ cd src/github.com/cloudfoundry-incubator/garden-integration-tests
++ hostname
+ export GARDEN_ADDRESS=7g89flel121:7777
+ GARDEN_ADDRESS=7g89flel121:7777
+ ginkgo -p -nodes=4
Running Suite: GardenIntegrationTests Suite
===========================================
Random Seed: 1449606738
Will run 112 of 113 specs
Running in parallel across 4 nodes
• Failure in Spec Setup (JustBeforeEach) [0.003 seconds]
Limits [JustBeforeEach]
/tmp/build/ee32fe39-4383-4d2d-50d8-76e8ae42e05a/guardian-release/src/github.com/cloudfoundry-incubator/garden-integration-tests/limits_test.go:346
LimitMemory
/tmp/build/ee32fe39-4383-4d2d-50d8-76e8ae42e05a/guardian-release/src/github.com/cloudfoundry-incubator/garden-integration-tests/limits_test.go:33
with a memory limit
/tmp/build/ee32fe39-4383-4d2d-50d8-76e8ae42e05a/guardian-release/src/github.com/cloudfoundry-incubator/garden-integration-tests/limits_test.go:32
when the process writes too much to /dev/shm
/tmp/build/ee32fe39-4383-4d2d-50d8-76e8ae42e05a/guardian-release/src/github.com/cloudfoundry-incubator/garden-integration-tests/limits_test.go:31
is killed
/tmp/build/ee32fe39-4383-4d2d-50d8-76e8ae42e05a/guardian-release/src/github.com/cloudfoundry-incubator/garden-integration-tests/limits_test.go:30
Expected error:
<*url.Error | 0xc2080b9f50>: {
Op: "Post",
URL: "http://api/containers",
Err: {
Op: "dial",
Net: "tcp",
Addr: {
IP: "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xff\xff\n\xfe\x00\x06",
Port: 7777,
Zone: "",
},
Err: 0x6f,
},
}
Post http://api/containers: dial tcp 10.254.0.6:7777: connection refused
not to have occurred
/tmp/build/ee32fe39-4383-4d2d-50d8-76e8ae42e05a/guardian-release/src/github.com/cloudfoundry-incubator/garden-integration-tests/garden_integration_tests_suite_test.go:52
and more failures like this afterwards.
We saw the error on both a local Concourse lite and also on an AWS concourse worker.
Have we misconfigured something?
Thanks!
cc @dbellotti
I was trying to push same app with the same version of the python buildpack, and after the enablement of OCI Phase 1, the container for that application cannot be created anymore.
The same happens on boshlite when you enable OCI Phase 1.
App: https://github.com/jszroberto/mkdocs-test
2018-08-15T12:44:35.26+0200 [API/1] OUT App instance exited with guid 4ffa5760-c172-494e-a589-70cc524fb6d3 payload: {"instance"=>"256f40f1-9ef7-46c6-76ba-bf84", "index"=>0, "reason"=>"CRASHED", "exit_description"=>"failed to create container: running image plugin create: pulling the image: unpacking layer `sha256:78060342299850c2553a62002006edb88d2aa86e9c27d7373d2ef52a8092e919`: link ./deps/0/conda/pkgs/ncurses-6.1-hf484d3e_0/bin/clear /home/vcap/deps/0/conda/bin/clear: no such file or directory\n: exit status 1", "crash_count"=>2, "crash_timestamp"=>1534329875224712963, "version"=>"d2a9eda1-dbd3-47a5-a44d-7ee2d538f4c5"}
2018-08-15T12:44:35.26+0200 [CELL/0] OUT Cell c5e766c9-97ee-4005-af9e-aac7b80b1cb2 destroying container for instance 256f40f1-9ef7-46c6-76ba-bf84
2018-08-15T12:44:35.27+0200 [CELL/0] OUT Cell c5e766c9-97ee-4005-af9e-aac7b80b1cb2 successfully destroyed container for instance 256f40f1-9ef7-46c6-76ba-bf84
2018-08-15T12:44:35.45+0200 [CELL/0] OUT Cell c5e766c9-97ee-4005-af9e-aac7b80b1cb2 creating container for instance f9ebd6c1-5e9a-4e77-65e9-45c7
2018-08-15T12:44:54.92+0200 [CELL/0] ERR Cell c5e766c9-97ee-4005-af9e-aac7b80b1cb2 failed to create container for instance f9ebd6c1-5e9a-4e77-65e9-45c7: running image plugin create: pulling the image: unpacking layer `sha256:78060342299850c2553a62002006edb88d2aa86e9c27d7373d2ef52a8092e919`: link ./deps/0/conda/pkgs/ncurses-6.1-hf484d3e_0/bin/clear /home/vcap/deps/0/conda/bin/clear: no such file or directory
2018-08-15T12:44:54.92+0200 [CELL/0] ERR : exit status 1
Provide any output you think may be useful in understanding/explaining the issue.
The garden log files are found in /var/vcap/sys/log/garden/
on the VM in which the Garden job is running.
Run ordnance survey (from within the VM running the Garden job) using the following command and attach the resulting tar to the issue: curl bit.ly/garden-ordnance-survey -sSfL | bash
.
Irrelevant
Unknown
Unknown
Please try the following before submitting the issue:
I was unable to find this anywhere but is CRIU (Checkpoint/Restore in User-space) supported inside Garden? If not, is there a mechanism to live migrate a container (or parts of a container like active TCP connections)?
We have a Concourse setup running with v1.6.0 and recently upgraded to using garden-runc v0.5.0 for its workers.
We noticed that since the upgrade some of our build jobs on these workers have been failing, to be more specific they fail when they try to run the bosh-cli command bosh upload stemcell <filename>
.
During the stemcell upload the bosh-cli will try to print a progressbar to the screen/terminal, and fails with the error message negative argument.
:
Verifying stemcell...
File exists and readable OK
Verifying tarball...
Read tarball OK
Manifest exists OK
Stemcell image file OK
Stemcell properties OK
Stemcell info
-------------
Name: bosh-openstack-kvm-ubuntu-trusty-go_agent
Version: 3262.5
Checking if stemcell already exists...
No
Uploading stemcell...
negative argument.
Usage: upload stemcell <stemcell_location> [--skip-if-exists] [--fix] [--sha1 SHA1] [--name NAME] [--version VERSION]
I think this might be an AppArmor issue?
While trying to pinpoint the issue, I downgraded our Concourse workers back to garden-runc v0.4.0 that does not yet have AppArmor, and suddenly the bosh-cli was working again perfectly fine, printing the progressbar with no problem.
Or could there be something else where v0.4.0 and v0.5.0 differ that might cause this issue?
What's also weird is that when I use the fly
cli to hijack into the container and manually run the bosh-cli command, it will work (on bosh v0.4.0 and v0.5.0)!
Looks to me like it could have something to do with the terminal? Unfortunately I'm no expert at this and could not figure out further what might be the cause.
bosh stemcell upload <filename>
NB: Are you currently experiencing this issue? Please reach out to us on the #garden slack channel before attempting any of the temporary resolutions or before destroying affected the cell(s). We haven't been able to reproduce this issue ourselves and would really appreciate an opportunity to do some live debugging! Thanks!
Failed to create container: running image plugin create: making image: creating image: applying disk limits: apply disk limit: <nil>: setting quota to /var/vcap/data/grootfs/store/unprivileged/images/b1639404-2935-11e8-b467-0ed5f89f718b: setting quota limit for projid 2: function not implemented: exit status 1
TBC
TBC
This issue occurs when garden is configured to use GrootFS for image management.
It is possible to fall back to the previous image managemtn tool (garden-shed) by setting the following BOSH property:
garden.deprecated_use_garden_shed = true
NB It is recommended to force a recreate of VMs running garden (typically the diego cell VMs) when switching between GrootFS and garden-shed.
guardian [ERROR] 03/15 16:50:56.27 577.2.1 guardian.create.volume-creator.image-plugin-create.grootfs.create.groot-creating.making-image.overlayxfs-creating-image.applying-quotas.run-tardis.tardis.set-quota.setting-quota-failed
Error: setting quota limit for projid 2: function not implemented
guardian [ERROR] 03/15 16:50:56.27 577.2.1 guardian.create.volume-creator.image-plugin-create.grootfs.create.groot-creating.making-image.overlayxfs-creating-image.applying-quotas.run-tardis.tardis-failed
Error: exit status 1
guardian [ERROR] 03/15 16:50:56.27 577.2.1 guardian.create.volume-creator.image-plugin-create.grootfs.create.groot-creating.making-image.creating-image-failed
Error: applying disk limits: apply disk limit: <nil>: setting quota to /var/vcap/data/grootfs/store/unprivileged/images/b1639404-2935-11e8-b467-0ed5f89f718b: setting quota limit for projid 2: function notimplemented: exit status 1
I am trying out garden-runc and I have been following the github README. I was wondering if there is an easy way to deploy garden-runc on a local machine without the need for virtual box and bosh-lite?
https://github.com/cloudfoundry/garden-runc-release/releases has a v1.1.1 release but http://bosh.io/releases/github.com/cloudfoundry-incubator/garden-runc-release?all=1 has v1.0.0
@cppforlife is this an issue with bosh.io pipeline or something quirky with https://github.com/cloudfoundry/garden-runc-release repo?
Hi there! Thanks for taking time to open an Issue.
Please fill out as much of the following information as you can.
Since runc
depends on a CRIU installation to perform checkpoint and restore of running containers, can the criu
distribution be packaged along with the release?
Cloud Foundry
uname -r
prints this information)A test issue to see if GitBot is working
Garden will fail to destroy container with filenames whose path is very long.
1.15.1
bosh-aws-xen-hvm-ubuntu-trusty-go_agent/3586.25
4.4.0-130-generic
We don't have steps to reproduce but there is the AWS volume of the two cells that exhibited this problem. Let me know if you need access to them.
{"timestamp":"1531942576.706936121","source":"guardian","message":"guardian.destroy.volumizer.image-plugin-destroy.grootfs.delete.groot-deleting.deleting-image.destroying-image-failed","log_level":2,"data":{"error":"deleting rootfs folder: lstat /var/vcap/data/grootfs/store/unprivileged/images/12055071-9b46-414f-88d8-b6d2bac389b9/diff/tmp/app/public/public/public/public/public/... <about 200 more public's> .../public/public/public/public/public/public/public/public/public/public/public/public/public/public/public/public/public/public/public/public/public/public/public/public/public/public/bootstrap: file name too long","handle":"12055071-9b46-414f-88d8-b6d2bac389b9","id":"12055071-9b46-414f-88d8-b6d2bac389b9","imageID":"12055071-9b46-414f-88d8-b6d2bac389b9","original_timestamp":"2018-07-18T19:36:16.706365696Z","session":"3305.2.1","storePath":"/var/vcap/data/grootfs/store/unprivileged"}}
We think this is caused by the container creating a filename with a very long path (many nested directories).
I recently added some icmp security groups to my CF environment that utilizes code/type value of -1
(see cloudfoundry/cloud_controller_ng#815). Here is an example rule:
[ {
"code": -1,
"destination": "10.53.248.0/22",
"protocol": "icmp",
"type": -1
}
]
This appears to work great. However, when I later attempt to bosh redeploy a cell running an application with this security group applied garden fails to restart. With the error below.
Not 100% sure if this is Garden or CAPI. But figured I'd start here since at the very least Garden probably should be accepting iptables config that it cannot restart with?
bulk starter: setting up default chains: iptables: setup-global-chains: + set -o nounset
+ set -o errexit
+ shopt -s nullglob
+ filter_input_chain=w--input
+ filter_forward_chain=w--forward
+ filter_default_chain=w--default
+ filter_instance_prefix=w--instance-
+ nat_prerouting_chain=w--prerouting
+ nat_postrouting_chain=w--postrouting
+ nat_instance_prefix=w--instance-
+ iptables_bin=/var/vcap/packages/iptables/sbin/iptables
+ case "${ACTION}" in
+ setup_filter
+ teardown_filter
+ teardown_deprecated_rules
++ /var/vcap/packages/iptables/sbin/iptables -w -S INPUT
+ rules='-P INPUT ACCEPT
-A INPUT -i w+ -j w--input'
+ echo '-P INPUT ACCEPT
-A INPUT -i w+ -j w--input'
+ sed -e s/-A/-D/ -e 's/\s\+$//'
+ xargs --no-run-if-empty --max-lines=1 /var/vcap/packages/iptables/sbin/iptables -w
+ grep ' -j garden-dispatch'
++ /var/vcap/packages/iptables/sbin/iptables -w -S FORWARD
+ rules='-P FORWARD ACCEPT
-A FORWARD -i w+ -j w--forward'
+ echo '-P FORWARD ACCEPT
-A FORWARD -i w+ -j w--forward'
+ sed -e s/-A/-D/ -e 's/\s\+$//'
+ grep ' -j garden-dispatch'
+ xargs --no-run-if-empty --max-lines=1 /var/vcap/packages/iptables/sbin/iptables -w
+ /var/vcap/packages/iptables/sbin/iptables -w -F garden-dispatch
+ true
+ /var/vcap/packages/iptables/sbin/iptables -w -X garden-dispatch
+ true
++ /var/vcap/packages/iptables/sbin/iptables -w -S w--forward
+ rules='-N w--forward
-A w--forward -i eth0 -j ACCEPT
-A w--forward -j DROP'
+ echo '-N w--forward
-A w--forward -i eth0 -j ACCEPT
-A w--forward -j DROP'
+ xargs --no-run-if-empty --max-lines=1 /var/vcap/packages/iptables/sbin/iptables -w
+ sed -e s/-A/-D/ -e 's/\s\+$//'
+ grep '\-g w--instance-'
++ /var/vcap/packages/iptables/sbin/iptables -w -S
+ rules='-P INPUT ACCEPT
-P FORWARD ACCEPT
-P OUTPUT ACCEPT
-N w--default
-N w--forward
-N w--input
-N w--instance-gfulchosvqj
-N w--instance-gfulchosvqj-log
-A INPUT -i w+ -j w--input
-A FORWARD -i w+ -j w--forward
-A w--default -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A w--default -j REJECT --reject-with icmp-port-unreachable
-A w--forward -i eth0 -j ACCEPT
-A w--forward -j DROP
-A w--input -i eth0 -j ACCEPT
-A w--input -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A w--input -j REJECT --reject-with icmp-host-prohibited
-A w--instance-gfulchosvqj -p icmp -m iprange --dst-range 10.53.248.0-10.53.251.255 -m icmp --icmp-type any -m comment --comment 56d562a2-cabd-4def-6340-55dd -j RETURN'
+ echo '-P INPUT ACCEPT
-P FORWARD ACCEPT
-P OUTPUT ACCEPT
-N w--default
-N w--forward
-N w--input
-N w--instance-gfulchosvqj
-N w--instance-gfulchosvqj-log
-A INPUT -i w+ -j w--input
-A FORWARD -i w+ -j w--forward
-A w--default -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A w--default -j REJECT --reject-with icmp-port-unreachable
-A w--forward -i eth0 -j ACCEPT
-A w--forward -j DROP
-A w--input -i eth0 -j ACCEPT
-A w--input -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A w--input -j REJECT --reject-with icmp-host-prohibited
-A w--instance-gfulchosvqj -p icmp -m iprange --dst-range 10.53.248.0-10.53.251.255 -m icmp --icmp-type any -m comment --comment 56d562a2-cabd-4def-6340-55dd -j RETURN'
+ xargs --no-run-if-empty --max-lines=1 /var/vcap/packages/iptables/sbin/iptables -w
+ sed -e s/-A/-D/ -e 's/\s\+$//'
+ grep '^-A w--instance-'
iptables: Bad rule (does a matching rule exist in that chain?).
Deploy CF
Create a running security group with an icmp type/code = -1
Deploy an application
Change some cell config and redeploy the cell
See that Garden fails to restart
Garden RunC Release 1.5.0
Ubuntu Stemcell 3363.20
The diagram here refers to an imaginary component called "Ducati."
Ducati is in the attic. We don't talk about Ducati anymore.
The new thing is CF Networking Release which has a garden external networker that allows any CNI plugin to be a garden networker.
Let's update the diagram!
Please try the following before submitting the issue:
Running Concourse jobs were reporting "insufficient subnets remaining in the pool". We were unable to match this error message with a spike in container counts as reported by fly workers
. It generally came up when load was high on the system, but it wasn't strictly correlated with metrics like CPU load either.
We went onto the worker VM, and ran gaol_linux list
. It reported the same # of containers as fly workers
was reporting. We looked at the iptables -S
on the worker, and saw many more rules than active containers. In fact, we saw almost exactly 250. Logs attached.
We do not set the following parameters in our manifest, so we expect that the defaults were still:
bosh recreate worker
worked to reset the iptables, and then we had a TSA registration issue (probably not a garden problem) and bosh restart web
fixed that.
Is bosh recreate worker
necessary any time one upgrades the runC version? For indeed, the runC behavior when it goes down should be to try to keep the containers around. But if it comes back up with new version of the software running, does it leave the existing iptables around?
We had been running Concourse 1.6 with runC 0.4.0 for a long time. We were running it hard, such that the error message "insufficient subnets remaining in the pool" usually signified hitting the container limit of 250 (actually). We bumped from 1.6 + 0.4.0 straight to Concourse 2.2.1 with runC 0.8.0
The iptables -S
on a worker VM when it wasn't working (reporting many more rules than containers were supposedly running)
Logs from the garden job (tgz as provided by bosh logs
)
https://drive.google.com/file/d/0B2izPb3rj-6AeHZzejhaaXZvMFk/view?usp=sharing
Pretty elaborate; we didn't get a simple reproduction.
fly workers
reports a number of containers less than 250Can I use oci hooks with garden?
Hello,
even though we have set the property graph_cleanup_threshold_in_mb
in the diego manifest to 20480
, the cleanup process doesn't seem to work properly. The following is the output of the ephemeral disk usage of one of our cells:
Instance Process State AZ IPs VM CID VM Type Uptime Load CPU CPU CPU CPU Memory Swap System Ephemeral Persistent
(1m, 5m, 15m) Total User Sys Wait Usage Usage Disk Usage Disk Usage Disk Usage
cell_z2/d1677fee-37a2-4dcd-b988-039164fa7362 running z2 10.0.137.29 i-00359a04afbe13deb vm_4cpu_16gb - 6.89, 6.82, 6.52 - 49.4% 28.9% 0.0% 35% (5.7 GB) 0% (6.1 MB) 39% (31i%) 92% (23i%) 39% (31i%)
ps aux|grep garden | grep graph
from one of our cell:
root 3219545 1.4 0.3 3040368 61208 ? S<l 07:49 4:24 /var/vcap/packages/guardian/bin/gdn server --skip-setup --bind-ip=0.0.0.0 --bind-port=7777 --depot=/var/vcap/data/garden/depot --graph=/var/vcap/data/garden/graph --properties-path=/var/vcap/data/garden/props.json --port-pool-properties-path=/var/vcap/data/garden/port-pool-props.json --iptables-bin=/var/vcap/packages/iptables/sbin/iptables --iptables-restore-bin=/var/vcap/packages/iptables/sbin/iptables-restore --init-bin=/var/vcap/packages/guardian/bin/init --dadoo-bin=/var/vcap/packages/guardian/bin/dadoo --nstar-bin=/var/vcap/packages/guardian/bin/nstar --tar-bin=/var/vcap/packages/tar/tar --newuidmap-bin=/var/vcap/packages/garden-idmapper/bin/newuidmap --newgidmap-bin=/var/vcap/packages/garden-idmapper/bin/newgidmap --log-level=error --mtu=0 --network-pool=10.239.0.0/22 --deny-network=0.0.0.0/0 --destroy-containers-on-startup --debug-bind-ip=127.0.0.1 --debug-bind-port=17005 --default-rootfs=/var/vcap/packages/busybox --default-grace-time=0 --default-container-blockio-weight=0 --graph-cleanup-threshold-in-megabytes=20480 --max-containers=250 --cpu-quota-per-share=0 --tcp-memory-limit=0 --runtime-plugin=/var/vcap/packages/runc/bin/runc --persistent-image=/var/vcap/packages/cflinuxfs2/rootfs --apparmor=garden-default --cleanup-process-dirs-on-wait
du -hcs /var/vcap/data/garden/graph/aufs/diff
67G total
We recently upgraded to PCF 1.12.19, and a (prometheus) docker image that used to work is now failing with file permission issues
cf push some-app -o prom/prometheus
2018-05-15T09:07:09.16-0400 [CELL/0] OUT Creating container
2018-05-15T09:07:09.39-0400 [CELL/0] OUT Successfully destroyed container
2018-05-15T09:07:09.76-0400 [CELL/0] OUT Successfully created container
2018-05-15T09:07:10.01-0400 [CELL/0] OUT Starting health monitoring of container
2018-05-15T09:07:10.18-0400 [APP/PROC/WEB/0] ERR level=info ts=2018-05-15T13:07:10.179828917Z caller=main.go:220 msg="Starting Prometheus" version="(version=2.2.1, branch=HEAD, revision=bc6058c81272a8d938c05e75607371284236aadc)"
2018-05-15T09:07:10.18-0400 [APP/PROC/WEB/0] ERR level=info ts=2018-05-15T13:07:10.179928593Z caller=main.go:221 build_context="(go=go1.10, user=root@149e5b3f0829, date=20180314-14:15:45)"
2018-05-15T09:07:10.18-0400 [APP/PROC/WEB/0] ERR level=info ts=2018-05-15T13:07:10.179978934Z caller=main.go:222 host_details="(Linux 4.4.0-119-generic #143~14.04.1-Ubuntu SMP Mon Apr 2 18:04:36 UTC 2018 x86_64 b939328f-96c4-4131-61b3-d721 (none))"
2018-05-15T09:07:10.18-0400 [APP/PROC/WEB/0] ERR level=info ts=2018-05-15T13:07:10.180063828Z caller=main.go:223 fd_limits="(soft=16384, hard=16384)"
2018-05-15T09:07:10.18-0400 [APP/PROC/WEB/0] ERR level=info ts=2018-05-15T13:07:10.189856754Z caller=main.go:504 msg="Starting TSDB ..."
2018-05-15T09:07:10.19-0400 [APP/PROC/WEB/0] ERR level=info ts=2018-05-15T13:07:10.19017001Z caller=web.go:382 component=web msg="Start listening for connections" address=0.0.0.0:9090
2018-05-15T09:07:10.19-0400 [APP/PROC/WEB/0] ERR level=info ts=2018-05-15T13:07:10.190173907Z caller=main.go:398 msg="Stopping scrape discovery manager..."
2018-05-15T09:07:10.19-0400 [APP/PROC/WEB/0] ERR level=info ts=2018-05-15T13:07:10.190233993Z caller=main.go:411 msg="Stopping notify discovery manager..."
2018-05-15T09:07:10.19-0400 [APP/PROC/WEB/0] ERR level=info ts=2018-05-15T13:07:10.190271771Z caller=main.go:432 msg="Stopping scrape manager..."
2018-05-15T09:07:10.19-0400 [APP/PROC/WEB/0] ERR level=info ts=2018-05-15T13:07:10.190361913Z caller=manager.go:460 component="rule manager" msg="Stopping rule manager..."
2018-05-15T09:07:10.19-0400 [APP/PROC/WEB/0] ERR level=info ts=2018-05-15T13:07:10.190478723Z caller=manager.go:466 component="rule manager" msg="Rule manager stopped"
2018-05-15T09:07:10.19-0400 [APP/PROC/WEB/0] ERR level=info ts=2018-05-15T13:07:10.190605644Z caller=notifier.go:512 component=notifier msg="Stopping notification manager..."
2018-05-15T09:07:10.19-0400 [APP/PROC/WEB/0] ERR level=info ts=2018-05-15T13:07:10.190643211Z caller=main.go:407 msg="Notify discovery manager stopped"
2018-05-15T09:07:10.19-0400 [APP/PROC/WEB/0] ERR level=info ts=2018-05-15T13:07:10.190861107Z caller=main.go:394 msg="Scrape discovery manager stopped"
2018-05-15T09:07:10.19-0400 [APP/PROC/WEB/0] ERR level=info ts=2018-05-15T13:07:10.191290319Z caller=main.go:426 msg="Scrape manager stopped"
2018-05-15T09:07:10.19-0400 [APP/PROC/WEB/0] ERR level=info ts=2018-05-15T13:07:10.191310256Z caller=main.go:573 msg="Notifier manager stopped"
2018-05-15T09:07:10.19-0400 [APP/PROC/WEB/0] ERR level=error ts=2018-05-15T13:07:10.194150223Z caller=main.go:582 err="Opening storage failed open DB in /prometheus: open /prometheus/518918001: permission denied"
2018-05-15T09:07:10.19-0400 [APP/PROC/WEB/0] ERR level=info ts=2018-05-15T13:07:10.194206722Z caller=main.go:584 msg="See you next time!"
2018-05-15T09:07:10.19-0400 [APP/PROC/WEB/0] OUT Exit status 0
2018-05-15T09:07:10.19-0400 [CELL/0] OUT Exit status 0
2018-05-15T09:07:10.21-0400 [CELL/0] OUT Stopping instance b939328f-96c4-4131-61b3-d721
2018-05-15T09:07:10.21-0400 [CELL/0] OUT Destroying container
2018-05-15T09:07:10.23-0400 [API/1] OUT Process has crashed with type: "web"
2018-05-15T09:07:10.19-0400 [APP/PROC/WEB/0] ERR level=error ts=2018-05-15T13:07:10.194150223Z caller=main.go:582 err="Opening storage failed open DB in /prometheus: open /prometheus/518918001: permission denied"
prom/prometheus
unsure, this worked before the PCF upgrade and fails afterwards
???
Hi,
An error occured while installing Cloud Foundry, DIEGO on CentOs..
The versions i have used for Release and BOSH are listed below.
+++++++++++++++++++++++++++++++++++
BOSH-CLI v2
BOSH-RELEASE 264.7.0
CF-RELEASE 287
DIEGO-RELEASE 1.34.0
GARDEN-RUNC 1.11.0
CFLINUX-RELEASE 1.185.0
+++++++++++++++++++++++++++++++++++++
And I used Stemcell image v 3468.21 in CentOs environment.
This settings have worked on Ubuntu. Apparently not on CentOs.
When errors occur while installing Cloud Foundry, it was solved as i had access to logs and script files could be edited.
However, i have no idea how to deal with DIEGO installation issues.
My major quetions are:
In case of normal installation, it throws an error saying "apparmor_parser: command not found"
If i install package directly on DIEGO cell VM and execute : apparmor_parser -r "/var/vcap/jobs/garden/config/garden-default" >> it responds with "no cache value on read/write"
rep throws an error
#######################
Failed: 'cell_z1/d30f7487-f7ea-42fa-b9a6-a6af4a9f6c6b (0)' is not running after update. Review logs for failed jobs: rep (00:04:21)
Error 400007: 'cell_z1/d30f7487-f7ea-42fa-b9a6-a6af4a9f6c6b (0)' is not running after update. Review logs for failed jobs: rep
#######################
Log as below is found at rep_ctl.log
#######################
Not setting /proc/sys/net/ipv4 parameters, since I'm running inside a linux container
#######################
Log as below is found at garden.stdout.log
#######################
Version:1.0 StartHTML:0000000105 EndHTML:0000028323 StartFragment:0000000433 EndFragment:0000028283
{"timestamp":"1528181050.931253672","source":"guardian","message":"guardian.create.create-failed-cleaningup.volumizer.image-plugin-destroy.start","log_level":0,"data":{"cause":"runc run: exit status 1: container_linux.go:348: starting container process caused "process_linux.go:301: running exec setns process for init caused \"exit status 40\""\n","handle":"executor-healthcheck-5be43913-6810-4018-59a3-8bf1af14555c","session":"476.4.2.1"}}
{"timestamp":"1528181050.945348501","source":"guardian","message":"guardian.create.create-failed-cleaningup.volumizer.image-plugin-destroy.grootfs.delete.groot-deleting.starting","log_level":1,"data":{"cause":"runc run: exit status 1: container_linux.go:348: starting container process caused "process_linux.go:301: running exec setns process for init caused \"exit status 40\""\n","handle":"executor-healthcheck-5be43913-6810-4018-59a3-8bf1af14555c","imageID":"executor-healthcheck-5be43913-6810-4018-59a3-8bf1af14555c","original_timestamp":"2018-06-05T06:44:10.945072896Z","session":"476.4.2.1"}}
{"timestamp":"1528181050.945517778","source":"guardian","message":"guardian.create.create-failed-cleaningup.volumizer.image-plugin-destroy.grootfs.delete.groot-deleting.deleting-image.starting","log_level":1,"data":{"cause":"runc run: exit status 1: container_linux.go:348: starting container process caused "process_linux.go:301: running exec setns process for init caused \"exit status 40\""\n","handle":"executor-healthcheck-5be43913-6810-4018-59a3-8bf1af14555c","id":"executor-healthcheck-5be43913-6810-4018-59a3-8bf1af14555c","imageID":"executor-healthcheck-5be43913-6810-4018-59a3-8bf1af14555c","original_timestamp":"2018-06-05T06:44:10.94520064Z","session":"476.4.2.1","storePath":"/var/vcap/data/grootfs/store/unprivileged"}}
{"timestamp":"1528181050.945569754","source":"guardian","message":"guardian.create.create-failed-cleaningup.volumizer.image-plugin-destroy.grootfs.delete.groot-deleting.deleting-image.overlayxfs-destroying-image.starting","log_level":1,"data":{"cause":"runc run: exit status 1: container_linux.go:348: starting container process caused "process_linux.go:301: running exec setns process for init caused \"exit status 40\""\n","handle":"executor-healthcheck-5be43913-6810-4018-59a3-8bf1af14555c","id":"executor-healthcheck-5be43913-6810-4018-59a3-8bf1af14555c","imageID":"executor-healthcheck-5be43913-6810-4018-59a3-8bf1af14555c","imagePath":"/var/vcap/data/grootfs/store/unprivileged/images/executor-healthcheck-5be43913-6810-4018-59a3-8bf1af14555c","original_timestamp":"2018-06-05T06:44:10.945269248Z","session":"476.4.2.1","storePath":"/var/vcap/data/grootfs/store/unprivileged"}}
{"timestamp":"1528181050.953394413","source":"guardian","message":"guardian.create.create-failed-cleaningup.volumizer.image-plugin-destroy.grootfs.delete.groot-deleting.deleting-image.overlayxfs-destroying-image.ending","log_level":1,"data":{"cause":"runc run: exit status 1: container_linux.go:348: starting container process caused "process_linux.go:301: running exec setns process for init caused \"exit status 40\""\n","handle":"executor-healthcheck-5be43913-6810-4018-59a3-8bf1af14555c","id":"executor-healthcheck-5be43913-6810-4018-59a3-8bf1af14555c","imageID":"executor-healthcheck-5be43913-6810-4018-59a3-8bf1af14555c","imagePath":"/var/vcap/data/grootfs/store/unprivileged/images/executor-healthcheck-5be43913-6810-4018-59a3-8bf1af14555c","original_timestamp":"2018-06-05T06:44:10.953026048Z","session":"476.4.2.1","storePath":"/var/vcap/data/grootfs/store/unprivileged"}}
{"timestamp":"1528181050.953512192","source":"guardian","message":"guardian.create.create-failed-cleaningup.volumizer.image-plugin-destroy.grootfs.delete.groot-deleting.deleting-image.ending","log_level":1,"data":{"cause":"runc run: exit status 1: container_linux.go:348: starting container process caused "process_linux.go:301: running exec setns process for init caused \"exit status 40\""\n","handle":"executor-healthcheck-5be43913-6810-4018-59a3-8bf1af14555c","id":"executor-healthcheck-5be43913-6810-4018-59a3-8bf1af14555c","imageID":"executor-healthcheck-5be43913-6810-4018-59a3-8bf1af14555c","original_timestamp":"2018-06-05T06:44:10.953161728Z","session":"476.4.2.1","storePath":"/var/vcap/data/grootfs/store/unprivileged"}}
#######################
Anyone familer with this issue? Thank you in advance.
Hi there! Thanks for taking time to open an Issue.
Please fill out as much of the following information as you can.
Pushing an app with docker image would fail if default disk quota (1G) is used since the docker image size is about 1.3G.
[root@mdw ~]# CF_DOCKER_PASSWORD=xxxx cf push img1 --docker-username xxxx --docker-image scottgai/sgai-repo:img1 scottgai -t 600
2018-06-25T23:29:06.20-0400 [CELL/0] OUT Cell 21c1ae94-0f86-4dc5-83ec-7b76dd2dba58 creating container for instance da9af175-a04a-4079-49f7-89d5
2018-06-25T23:29:07.88-0400 [CELL/0] ERR Cell 21c1ae94-0f86-4dc5-83ec-7b76dd2dba58 failed to create container for instance da9af175-a04a-4079-49f7-89d5: running image plugin create: making image: creating image: applying disk limits: disk limit is smaller than volume size
2018-06-25T23:29:07.88-0400 [CELL/0] ERR : exit status 1
2018-06-25T23:29:07.88-0400 [CELL/0] OUT Cell 21c1ae94-0f86-4dc5-83ec-7b76dd2dba58 destroying container for instance da9af175-a04a-4079-49f7-89d5
2018-06-25T23:29:07.88-0400 [CELL/0] OUT Cell 21c1ae94-0f86-4dc5-83ec-7b76dd2dba58 successfully destroyed container for instance da9af175-a04a-4079-49f7-89d5
2018-06-25T23:29:07.90-0400 [API/0] OUT Process has crashed with type: "web"
2018-06-25T23:29:07.90-0400 [API/0] OUT App instance exited with guid 6b783e7e-4a21-4d52-8a6b-a99b7147ed22 payload: {"instance"=>"da9af175-a04a-4079-49f7-89d5", "index"=>0, "reason"=>"CRASHED", "exit_description"=>"failed to initialize container: running image plugin create: making image: creating image: applying disk limits: disk limit is smaller than volume size\n: exit status 1", "crash_count"=>3, "crash_timestamp"=>1529983747883330324, "version"=>"a6bc244b-30cb-49f5-8f22-0778392d1b73"}
cf push
with disk quota 2G was successful. However cf app <app name>
only shows very small value of disk usage ("52K of 2G" in following test. seems only show size of platform overhead excluding docker image size)
[root@mdw ~]# CF_DOCKER_PASSWORD=xxxx cf push img1 --docker-username xxxx --docker-image scottgai/sgai-repo:img1 scottgai -t 600 -k 2G
Using docker repository password from environment variable CF_DOCKER_PASSWORD.
Pushing app img1 to org org1 / space dev as admin...
Getting app info...
Updating app with these attributes...
name: img1
docker image: scottgai/sgai-repo:img1
docker username: scottgai
command: /docker-entrypoint.sh httpd -k start -D FOREGROUND
- disk quota: 1G
+ disk quota: 2G
health check timeout: 600
health check type: port
instances: 1
memory: 1G
stack: cflinuxfs2
routes:
img1.apps-03.haas-50.pez.pivotal.io
Updating app img1...
Mapping routes...
Stopping app...
Waiting for app to start...
name: img1
requested state: started
instances: 1/1
usage: 1G x 1 instances
routes: img1.apps-03.haas-50.pez.pivotal.io
last uploaded: Mon 25 Jun 23:27:32 EDT 2018
stack: cflinuxfs2
docker image: scottgai/sgai-repo:img1
start command: /docker-entrypoint.sh httpd -k start -D FOREGROUND
state since cpu memory disk details
#0 running 2018-06-26T03:48:35Z 0.0% 25M of 1G 52K of 2G
Is the disk usage being displayed correctly?
If known, provide the resolution to the issue here.
Thanks to @williammartin
from post opencontainers/runc#1552 , runC can be optional to run in host network mode (just like docker-engine's --network="host"
) as @cyphar describe.
But per @williammartin , Garden doesn't expose this feasibility.
as a Concourse user, I think it's typical to run a container in host namespace , so that to test external devices/systems via physical machine NIC.
I believe Garden should provide this capability, then Concourse can leverage this new power.
Concourse issue: concourse/concourse#1455
It is not necessarily an issue, just an observation. We experienced that ps aux
hangs on a cell. This is caused when accessing information about a java app process.
1.15.1
bosh-aws-xen-hvm-ubuntu-trusty-go_agent/3586.36
4.4.0-133-generic
We didn't try to reproduce the issue yet.
strace output
strace cat /proc/3981001/environ
execve("/bin/cat", ["cat", "/proc/3981001/environ"], [/* 25 vars */]) = 0
brk(0) = 0xe36000
access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=23920, ...}) = 0
mmap(NULL, 23920, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f0d18bac000
close(3) = 0
access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
open("/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0P \2\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=1857312, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f0d18bab000
mmap(NULL, 3965632, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f0d185c7000
mprotect(0x7f0d18785000, 2097152, PROT_NONE) = 0
mmap(0x7f0d18985000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1be000) = 0x7f0d18985000
mmap(0x7f0d1898b000, 17088, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f0d1898b000
close(3) = 0
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f0d18ba9000
arch_prctl(ARCH_SET_FS, 0x7f0d18ba9740) = 0
mprotect(0x7f0d18985000, 16384, PROT_READ) = 0
mprotect(0x60a000, 4096, PROT_READ) = 0
mprotect(0x7f0d18bb2000, 4096, PROT_READ) = 0
munmap(0x7f0d18bac000, 23920) = 0
brk(0) = 0xe36000
brk(0xe57000) = 0xe57000
open("/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=1607664, ...}) = 0
mmap(NULL, 1607664, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f0d18a20000
close(3) = 0
fstat(1, {st_mode=S_IFCHR|0600, st_rdev=makedev(136, 26), ...}) = 0
open("/proc/3981001/environ", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0400, st_size=0, ...}) = 0
fadvise64(3, 0, 0, POSIX_FADV_SEQUENTIAL) = 0
read(3,
cat /proc/3981001/status
Name: java
State: S (sleeping)
Tgid: 3981001
Ngid: 0
Pid: 3981001
PPid: 3980961
TracerPid: 0
Uid: 2000 2000 2000 2000
Gid: 2000 2000 2000 2000
FDSize: 4096
Groups:
NStgid: 3981001 9
NSpid: 3981001 9
NSpgid: 3981001 9
NSsid: 3981001 9
VmPeak: 1153380 kB
VmSize: 1076732 kB
VmLck: 0 kB
VmPin: 0 kB
VmHWM: 309756 kB
VmRSS: 298700 kB
VmData: 961080 kB
VmStk: 136 kB
VmExe: 4 kB
VmLib: 26380 kB
VmPTE: 872 kB
VmPMD: 24 kB
VmSwap: 11096 kB
HugetlbPages: 0 kB
Threads: 37
SigQ: 0/128591
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000000000000
SigCgt: 2000000181005ccf
CapInh: 00000000a80425fb
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 00000000a80425fb
CapAmb: 0000000000000000
Seccomp: 2
Speculation_Store_Bypass: vulnerable
Cpus_allowed: 00ff
Cpus_allowed_list: 0-7
Mems_allowed: 00000000,00000001
Mems_allowed_list: 0
voluntary_ctxt_switches: 55
nonvoluntary_ctxt_switches: 15
The process is in state S
Might be related:
http://rachelbythebay.com/w/2014/10/27/ps/
https://news.ycombinator.com/item?id=8520045
https://www.pivotaltracker.com/n/projects/1158420/stories/133660171
The process is also constantly consuming 1G of memory (probably the limit of the app). Not necessarily an issue since its an java app:
3981001 cvcap 10 -10 1076732 298696 21848 S 23.9 0.9 186:41.76 java
The cpu usage on the cell is quite low
%Cpu(s): 1.3 us, 4.0 sy, 0.0 ni, 94.5 id, 0.0 wa, 0.0 hi, 0.0 si, 0.1 st
The load is very high (might be unrelated):
load average: 36.16, 36.33, 36.13
Hi there,
I'm running
./scripts/test
against my local Concourse Lite, and seeing some test failures in GQT:
Summarizing 4 Failures:
[Fail] Run running a process [It] with an absolute path
/tmp/build/7543102e-d686-49fe-615d-24086572a169/guardian-release/src/github.com/cloudfoundry-incubator/guardian/gqt/run_test.go:62
[Fail] Net a second container [It] should have the next IP address
/tmp/build/7543102e-d686-49fe-615d-24086572a169/guardian-release/src/github.com/cloudfoundry-incubator/guardian/gqt/net_test.go:98
[Fail] Net [It] should have a loopback interface
/tmp/build/7543102e-d686-49fe-615d-24086572a169/guardian-release/src/github.com/cloudfoundry-incubator/guardian/gqt/net_test.go:45
[Fail] Net [It] should have a (dynamically assigned) IP address
/tmp/build/7543102e-d686-49fe-615d-24086572a169/guardian-release/src/github.com/cloudfoundry-incubator/guardian/gqt/net_test.go:60
Ran 19 of 19 Specs in 17.812 seconds
FAIL! -- 15 Passed | 4 Failed | 0 Pending | 0 Skipped
Any tips on how I can get the tests passing?
Thank you.
cc @zachgersh
In our CF deployment, we experienced a situation where one app instance was
able to slow down dramatically the network traffic in a diego cell.
During investigation, we observed that the app was using all the available tcp
memory on the cell.
In /var/log/kern.log
there were errors “TCP: out of memory”.
We observed that the tcp window was small observed through tcpdump. New connections were possible to be created, but no traffic due to the TCP window.
When we stopped the app instance everything was back to normal.
TCP memory was monitored through the /proc/net/sockstat
file.
Interestingly, due to not-well written app, we observed high TCP memory usage
without seeing large number of TCP connection.
We are mitigating currently the issue by increasing the TCP memory and by restarting the cell (or the app)
cat /proc/sys/net/ipv4/tcp_mem
is the actual hard-limit we have per cell which currently is 20% of the total memory.
It will be great to limit how much tcp memory a container can use, as well as to have some statistics per container.
Unable to provide script to configure additional DNS servers in 1.12 and above. This is done so we have an additional DNS server running on the cells from some additional requirements. This listens on the private IP and not on localhost (which is listened by consul (or in the near future bosh-dns))
Provide the following in the DNS entry:
dns_servers:
- $(ifconfig eth0|grep -oP 'inet addr:\K([0-9.]+)')
Error in garden.stderr.log ...
/var/vcap/jobs/garden/config/config.ini:38: invalid IP: '$(ifconfig eth0|grep -oP 'inet addr:\K([0-9.]+)')'
Change to use config.ini instead of passing this in a script causes the script we used to fail.
In config/blobs.yml
, one of the blobs being used is busybox/busybox.tar.gz
. Is there any chance someone can add a version to that blob to match the rest of the blobs in the yml? We're unable to determine what version of busybox is being used.
During an upgrade of Cloud Foundry from cf-deployment v1.8.0 to v1.9.0, the BOSH deployment failed with the following error:
Error: 'diego_cell/a8964832-a179-4496-b663-31687fbd564b (2)' is not running after update. Review logs for failed jobs: garden
The deployment succeeded after I kicked off the deployment again.
Task 147
Task 147 | 15:31:59 | Preparing deployment: Preparing deployment (00:00:23)
Task 147 | 15:33:25 | Preparing package compilation: Finding packages to compile (00:00:01)
Task 147 | 15:33:30 | Updating instance diego_cell: diego_cell/270ec524-a205-440d-a16f-d7ca514a4115 (1) (canary) (00:02:15)
Task 147 | 15:35:45 | Updating instance diego_cell: diego_cell/a8964832-a179-4496-b663-31687fbd564b (2) (00:06:19)
L Error: 'diego_cell/a8964832-a179-4496-b663-31687fbd564b (2)' is not running after update. Review logs for failed jobs: garden
Task 147 | 15:42:05 | Error: 'diego_cell/a8964832-a179-4496-b663-31687fbd564b (2)' is not running after update. Review logs for failed jobs: garden
This issue occurs when the garden BOSH job does not startup within the default 30 second monit timeout. This occurs more frequently when garden is running on slow IaaS disks (Azure seems to be particularly sensitive to this issue). You can determine if garden is hitting the timeout or not by looking in the monit.log:
$ grep garden /var/vcap/monit/monit.log
[UTC Mar 21 15:25:00] info : 'garden' start: /bin/sh
[UTC Mar 21 15:55:00] error : 'garden' failed to start
The monit timeout has been bumped to 2 minutes. This should be more than enough time for garden to start up, even on slower environments. This fix is available as of garden-runc-release v1.12.1.
After upgrading from GRR 1.11 to 1.12.21, apps started to fail with out of disk errors.
Not entirely sure, but it seems like there are two different graphs now:
/var/vcap/data/grootfs/store/{unprivileged,privileged}
and /var/vcap/data/garden/graph/aufs
.
Both seem to have files in them.
The GrootFS one looks new, is it possible that cleanup has failed? Or the versions are conflicting somehow?
Tried deploying garden-runc-release from bosh.io but getting sha1 mismatches:
Checking whether release garden-runc/1.0.0 already exists...NO
Using remote release 'https://bosh.io/d/github.com/cloudfoundry-incubator/garden-runc-release?v=1.0.0'
Director task 42532
Started downloading remote release > Downloading remote release. Done (00:00:09)
Started verifying remote release > Verifying remote release. Failed: Release SHA1 `41a9407af7eaa412a6a2dca2cbf29f2623d4e12f' does not match the expected SHA1 `2f29b4d9eedbe6b79efe3703b68a4d3c785649f4' (00:00:01)
Error 30015: Release SHA1 `41a9407af7eaa412a6a2dca2cbf29f2623d4e12f' does not match the expected SHA1 `2f29b4d9eedbe6b79efe3703b68a4d3c785649f4'
Issue with release or https://bosh.io/releases/github.com/cloudfoundry/garden-runc-release?version=1.0.0 or am I doing something wrong?
/cc @julz @cppforlife
When using a container with an empty /etc/
directory, the application crashes starting with a cryptic message:
[CELL/0] ERR Cell 6bdffa08-608d-4740-af08-3b289d433ec5 failed to create container for instance d52fd44a-eabb-4bdd-4360-6185: runc run: exit status 1: container_linux.go:348: starting container process caused "process_linux.go:402: container init caused \"rootfs_linux.go:58: mounting \\\"/var/vcap/packages/healthcheck\\\" to rootfs \\\"/var/vcap/data/grootfs/store/unprivileged/images/d52fd44a-eabb-4bdd-4360-6185/rootfs\\\" at \\\"/var/vcap/data/grootfs/store/unprivileged/images/d52fd44a-eabb-4bdd-4360-6185/rootfs/etc/cf-assets/healthcheck\\\" caused \\\"mkdir /var/vcap/data/grootfs/store/unprivileged/images/d52fd44a-eabb-4bdd-4360-6185/rootfs/etc/cf-assets: permission denied\\\"\""
$ uname -a
Linux c763447a-1984-41d5-b2fd-586c47afe31b 4.4.0-128-generic #154~14.04.1-Ubuntu SMP Fri May 25 14:58:51 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
/etc/
directory (e.g. keymon/empty-etc
)cf push test-container -o keymon/empty-etc
$ cf push test-container -o keymon/empty-etc
Pushing app test-container to org admin / space billing as admin...
Getting app info...
Updating app with these attributes...
name: test-container
- docker image: hector/empty-etc
+ docker image: keymon/empty-etc
disk quota: 1G
health check type: port
instances: 1
memory: 1G
stack: cflinuxfs2
routes:
test-container.hector.dev.cloudpipelineapps.digital
Updating app test-container...
Mapping routes...
Staging app and tracing logs...
Cell cde8f71a-a756-4891-b130-8f6ac230eb59 successfully created container for instance 48f69ea7-aaba-4845-b2f5-90761daa7c46
Staging...
Staging process started ...
Staging process finished
Exit status 0
Staging Complete
Cell cde8f71a-a756-4891-b130-8f6ac230eb59 stopping instance 48f69ea7-aaba-4845-b2f5-90761daa7c46
Cell cde8f71a-a756-4891-b130-8f6ac230eb59 destroying container for instance 48f69ea7-aaba-4845-b2f5-90761daa7c46
Cell cde8f71a-a756-4891-b130-8f6ac230eb59 successfully destroyed container for instance 48f69ea7-aaba-4845-b2f5-90761daa7c46
Waiting for app to start...
Start unsuccessful
TIP: use 'cf logs test-container --recent' for more information
FAILED
$ cf logs test-container --recent
....
2018-07-03T09:27:05.42+0100 [CELL/0] OUT Cell cde8f71a-a756-4891-b130-8f6ac230eb59 creating container for instance afd31154-83ac-425a-66d9-2df0
2018-07-03T09:27:07.07+0100 [CELL/0] ERR Cell cde8f71a-a756-4891-b130-8f6ac230eb59 failed to create container for instance afd31154-83ac-425a-66d9-2df0: runc run: exit status 1: container_linux.go:348: starting container process caused "process_linux.go:402: container init caused \"rootfs_linux.go:58: mounting \\\"/var/vcap/packages/healthcheck\\\" to rootfs \\\"/var/vcap/data/grootfs/store/unprivileged/images/afd31154-83ac-425a-66d9-2df0/rootfs\\\" at \\\"/var/vcap/data/grootfs/store/unprivileged/images/afd31154-83ac-425a-66d9-2df0/rootfs/etc/cf-assets/healthcheck\\\" caused \\\"mkdir /var/vcap/data/grootfs/store/unprivileged/images/afd31154-83ac-425a-66d9-2df0/rootfs/etc/cf-assets: permission denied\\\"\""
2018-07-03T09:27:07.08+0100 [CELL/0] OUT Cell cde8f71a-a756-4891-b130-8f6ac230eb59 destroying container for instance afd31154-83ac-425a-66d9-2df0
2018-07-03T09:27:07.08+0100 [CELL/0] OUT Cell cde8f71a-a756-4891-b130-8f6ac230eb59 successfully destroyed container for instance afd31154-83ac-425a-66d9-2df0
2018-07-03T09:27:07.11+0100 [API/0] OUT Process has crashed with type: "web"
2018-07-03T09:27:07.11+0100 [API/0] OUT App instance exited with guid 87132572-9f9c-42d3-afdf-258137144597 payload: {"instance"=>"afd31154-83ac-425a-66d9-2df0", "index"=>0, "reason"=>"CRASHED", "exit_description"=>"failed to initialize container: runc run: exit status 1: container_linux.go:348: starting container process caused \"process_linux.go:402: container init caused \\\"rootfs_linux.go:58: mounting \\\\\\\"/var/vcap/packages/healthcheck\\\\\\\" to rootfs \\\\\\\"/var/vcap/data/grootfs/store/unprivileged/images/afd31154-83ac-425a-66d9-2df0/rootfs\\\\\\\" at \\\\\\\"/var/vcap/data/grootfs/store/unprivileged/images/afd31154-83ac-425a-66d9-2df0/rootfs/etc/cf-assets/healthcheck\\\\\\\" caused \\\\\\\"mkdir /var/vcap/data/grootfs/store/unprivileged/images/afd31154-83ac-425a-66d9-2df0/rootfs/etc/cf-assets: permission denied\\\\\\\"\\\"\"\n", "crash_count"=>7, "crash_timestamp"=>1530606427101357593, "version"=>"a0f025d4-e7cf-4b18-8093-5196e1569518"}
keymon/empty-etc
It might be an issue with the layered fs, where an empty directory does not allow to mount a fs, or gets created with the wrong permissions.
If instead you drop any file into /etc/
, it does not throw that error, but one complaining about the missing /etc/passwd
(which is fine).
FROM cbaines/govuk-mini-environment-admin:release_3-test2
RUN touch /etc/foo
unable to find user root: no matching entries in passwd file
Then if we add a valid /etc/passwd
the image works.
On a re-deploy a diego-cell took 51 min to come back up. The rep left quite a few containers laying around after evacuation. On start-up of the cell after the update we saw that Garden deleted peas containers but not application containers that were orphaned.
We expected garden to have gone into a clean up loop to delete and destroy all the orphaned containers on the cell.
We know that destroy_containers_on_start
is set to true
, as such we do expect to see log lines related to: https://github.com/cloudfoundry/guardian/blob/6101577ecf3aa2a29e109a2a1380314cacc7df37/gardener/gardener.go#L479
but we do not.
However we see that garden did not delete any containers, and the slow startup time was due to the rep's own clean up loop running.
We are puzzled as to how we can end up in this state where garden is not able to immediately find the number of orphaned containers but later upon the rep's request give out a number of orphaned containers?
[...]
Some log highlights:
{"timestamp":"1529914162.323287249","source":"guardian","message":"guardian.start.starting","log_level":1,"data":{"session":"4"}}
{"timestamp":"1529914162.323378801","source":"guardian","message":"guardian.start.clean-all-peas.start","log_level":1,"data":{"session":"4.1"}}
[...]
{"timestamp":"1529914162.389882088","source":"guardian","message":"guardian.start.completed","log_level":1,"data":{"session":"4"}}
[...]
{"timestamp":"1529914162.395313263","source":"guardian","message":"guardian.list-containers.starting","log_level":1,"data":{"session":"5"}}
[...]
{"timestamp":"1529914162.399815083","source":"guardian","message":"guardian.list-containers.finished","log_level":1,"data":{"session":"5"}}
[...]
{"timestamp":"1529914162.427858829","source":"guardian","message":"guardian.started","log_level":1,"data":{"addr":"/var/vcap/data/garden/garden.sock","network":"unix"}}
[...]
{"timestamp":"1529914164.685106754","source":"guardian","message":"guardian.list-containers.starting","log_level":1,"data":{"session":"6"}}
{"timestamp":"1529914164.685930967","source":"guardian","message":"guardian.list-containers.finished","log_level":1,"data":{"session":"6"}}
{"timestamp":"1529914164.686795473","source":"guardian","message":"guardian.destroy.start","log_level":1,"data":{"handle":"0016fb0d-1ab7-4c43-7d5a-5004","session":"7"}}
{"timestamp":"1529914164.682427406","source":"rep","message":"rep.wait-for-garden.ping-garden","log_level":1,"data":{"initialTime:":"2018-06-25T08:09:24.682211759Z","session":"1","wait-time-ns:":207110}}
{"timestamp":"1529914164.684116840","source":"rep","message":"rep.wait-for-garden.ping-garden-success","log_level":1,"data":{"initialTime:":"2018-06-25T08:09:24.682211759Z","session":"1","wait-time-ns:":1885043}}
{"timestamp":"1529914164.684298515","source":"rep","message":"rep.executor-fetching-containers-to-destroy","log_level":1,"data":{}}
{"timestamp":"1529914164.686308384","source":"rep","message":"rep.executor-fetched-containers-to-destroy","log_level":1,"data":{"num-containers":58}}
{"timestamp":"1529914164.686429977","source":"rep","message":"rep.executor-destroying-container","log_level":1,"data":{"container-handle":"0016fb0d-1ab7-4c43-7d5a-5004"}}
{"timestamp":"1529914225.345789671","source":"rep","message":"rep.executor-destroyed-stray-container","log_level":1,"data":{"handle":"0016fb0d-1ab7-4c43-7d5a-5004"}}
We can provide more logs directly via secure file exchange.
TBD... maybe resource starvation?
TBD
Trying to deploy garden-runc 1.12.1 today, we hit the following as bosh tried to render the ERB templates:
Task 2485 | 09:16:12 | Error: Unable to render instance groups for deployment. Errors are:
- Unable to render jobs for instance group 'worker'. Errors are:
- Unable to render templates for job 'garden'. Errors are:
- Error filling in template '(unknown)' (line (unknown): garden/bin/garden_start.erb:4: syntax error, unexpected ';'
-; _erbout.concat "\n#!/usr/bin...
^)
- Error filling in template '(unknown)' (line (unknown): garden/config/config.ini.erb:31: syntax error, unexpected ';'
-; _erbout.concat "\n[server]\n; apparmor\n"
^
garden/config/config.ini.erb:34: syntax error, unexpected ';'
; if apparmor_profile_provided -; _erbout.concat "\n apparmor = "
^
garden/config/config.ini.erb:36: syntax error, unexpected keyword_end, expecting end-of-input
; end -; _erbout.concat "\n\n; bin...
^)
We're guessing that the templates have begun to use ERB syntax that is only supported by recent versions of bosh. We're on v257.3. We were previously able to deploy garden-runc 1.9.0 so the breaking change must been introduced somewhere between 1.9.0 and 1.12.1.
Attempt to deploy garden-runc 1.12.1 with bosh v257.3
We'll attempt to upgrade our bosh. But it'd be helpful if garden could publish a minimum version requirement, or maybe the ERB could be reworked to allow the garden-runc release to work on older boshes?
We have a Docker image where we delete some files during the build. Those files are still listed on the filesystem, but inacessible.
The same Docker image worked on 1.12.1.
We are using cf-deployment 1.28.0 at the moment and don't change any of the grootfs settings.
Dockerfile:
FROM tomcat:8.5.23-jre8-alpine
RUN rm -rf /usr/local/tomcat/webapps/*
CMD ["sleep", "1d"]
Commands to run:
bandesz:~/repos/gds/docker-test$ cf push docker-test --docker-image bandesz/tomcat-test -u process
bandesz:~/repos/gds/docker-test$ cf ssh docker-test
bash-4.3# ls -la /usr/local/tomcat/webapps
ls: /usr/local/tomcat/webapps/ROOT: No such file or directory
ls: /usr/local/tomcat/webapps/docs: No such file or directory
ls: /usr/local/tomcat/webapps/examples: No such file or directory
ls: /usr/local/tomcat/webapps/host-manager: No such file or directory
ls: /usr/local/tomcat/webapps/manager: No such file or directory
total 0
drwxr-x--- 1 root root 76 May 11 14:00 .
drwxr-xr-x 1 root root 20 May 11 14:00 ..
I checked /var/vcap/sys/log/garden
but nothing relevant were logged.
We might have some sensitive information in other system logs, therefore I'm not including them.
I created one for testing: bandesz/tomcat-test
Hi there! Thanks for taking time to open an Issue.
Please fill out as much of the following information as you can.
Prior to 1.12.0, setting garden.graph_cleanup_threshold_in_mb
to 0 resulted in unused layer cleanup routinely.
After 1.12.0, if I set
grootfs.reserved_space_for_other_jobs_in_mb: -1
; and{garden|grootfs}.graph_cleanup_threshold_in_mb: 0
... the GC is disabled.
With this configuration, I would expect the legacy property *.graph_cleanup_threshold_in_mb
to behave as before (GC routinely), since the legacy property takes precedence over reserved_space_for_other_jobs_in_mb
and 0 is considered a user choice.
Deploy Garden 1.11.* with these properties set (note that reserved_space_for_other_jobs_in_mb
will do nothing in 1.11.*):
properties:
garden:
graph_cleanup_threshold_in_mb: 0
grootfs:
graph_cleanup_threshold_in_mb: 0
reserved_space_for_other_jobs_in_mb: -1
After creating and destroying a few containers with different rootfs layers, observe cleanup occurring routinely.
Upgrade garden-runc-release to 1.12.0+ and observe GC never occurring.
N/A
Thresholder binary considers a 0 value of graph_cleanup_threshold_in_mb
to be less important that the new property reserved_space_for_other_jobs_in_mb
. It should not.
If graph_cleanup_threshold_in_mb
is set to 0 and reserved_space_for_other_jobs_in_mb
is set to -1, then GC should occur routinely.
I tried to deploy Cloud Foundry and failed during the creation of one cell instance. When i log into the failed cell and do a monit summary i get the following output:
Process 'consul_agent' running Process 'rep' running Process 'garden' Execution failed Process 'route_emitter' running Process 'metron_agent' running System 'system_localhost' running
In the CF deployment are two other cell instances running. One of them seems to be empty as there is only system_localhost
running but the other run works as expected. The desired amount of cells for the CF deployment is 3.
I couldn't find anything special under /var/vcap/sys/log/garden.
The logs under /var/vcap/sys/log/monit shows "exit status 32: mount: Structure needs cleaning".
I executed the analize script you provided me through slack (https://raw.githubusercontent.com/cloudfoundry/garden-runc-release/develop/scripts/ordnance-survey) and found out the file /var/vcap/jobs/garden/config/config.ini
your script expects to find is not present on the instance.
(i commented the line with /var/vcap/jobs/garden/config/config.ini
out and downloaded the report, the .tgz is attached underneath).
logs from monit and garden https://gist.github.com/gdenn/764af3b8ce92fe063168413b45931833
best,
Dennis
While I run
git clone [email protected]:cloudfoundry/garden-runc-release.git --branch v1.13.1 --recursive
I got
...
error: no such remote ref 96c505732559caf12840d37cb86e29cbab3e0f50
Fetched in submodule path 'src/github.com/containerd/containerd', but it did not contain 96c505732559caf12840d37cb86e29cbab3e0f50. Direct fetching of that commit failed.
...
Thank you!
We are accustomed to being able to always delete-env
on a bosh-lite director (even if there was a deployment that had not yet been deleted). We recently tried bumping garden-runc-release, and now when you deploy something (we used zookeeper) to it and then try to delete-env
on the director it results in this error:
Deleting deployment:
Unmounting disk 'disk-8301596d-916b-421d-6b82-9dc28c0c87f4' from VM 'vm-0fd3666e-29b0-420d-5a57-ed1b31a644cc':
Sending 'get_task' to the agent:
Agent responded with error: Action Failed get_task: Task 102a2a4c-5ffb-437d-5c0b-7ce2ebf63bc8 result: Unmounting persistent disk: Running command: 'umount /dev/sdc1', stdout: '', stderr: 'umount: /var/vcap/store/warden_cpi/persistent_bind_mounts_dir/c5bd8cca-0dfe-4589-6cda-0a52c2d80bfc: target is busy
(In some cases useful info about processes that
use the device is found by lsof(8) or fuser(1).)
': exit status 32
Exit code 1
1.9.4
to 1.16.3
4.15.0-32-generic
tests
dirrun.sh
While investigating on our own, we found it suspect that these processes were left around after running monit stop garden
:
bosh/0:/var/vcap/jobs/garden/bin# ps aux | grep runc
root 765911 0.0 0.1 194132 4872 ? S<l 21:52 0:00 /var/vcap/packages/guardian/bin/dadoo -runc-root /run/runc exec /var/vcap/packages/runc/bin/runc /var/vcap/data/garden/depot/8a12412d-4342-4385-5908-adf48fec9f69/processes/0c3fbfb5-d0ce-4ef3-73d7-b768cdc56766 8a12412d-4342-4385-5908-adf48fec9f69
root 765991 0.0 0.1 130004 4920 ? S<l 21:52 0:00 /var/vcap/packages/guardian/bin/dadoo -runc-root /run/runc exec /var/vcap/packages/runc/bin/runc /var/vcap/data/garden/depot/818efc09-5ee5-4815-4492-31f627f4d39e/processes/6de425de-c164-4c08-5f98-310d2975cfd1 818efc09-5ee5-4815-4492-31f627f4d39e
root 765996 0.0 0.1 121808 4964 ? S<l 21:52 0:00 /var/vcap/packages/guardian/bin/dadoo -runc-root /run/runc exec /var/vcap/packages/runc/bin/runc /var/vcap/data/garden/depot/2f02af6a-b0bf-4d60-40f6-3067dddebe36/processes/cb1d5a8b-b6fb-419b-6e59-20704f47a144 2f02af6a-b0bf-4d60-40f6-3067dddebe36
root 766086 0.0 0.1 267608 4976 ? S<l 21:52 0:00 /var/vcap/packages/guardian/bin/dadoo -runc-root /run/runc exec /var/vcap/packages/runc/bin/runc /var/vcap/data/garden/depot/a4932af8-8250-4982-4ded-8ffb340efd43/processes/e8571269-589a-4133-463a-28c3226548fb a4932af8-8250-4982-4ded-8ffb340efd43
root 766100 0.0 0.1 130004 5068 ? S<l 21:52 0:00 /var/vcap/packages/guardian/bin/dadoo -runc-root /run/runc exec /var/vcap/packages/runc/bin/runc /var/vcap/data/garden/depot/b084bdcf-36e6-4b7b-689e-444b83564ce1/processes/9dafa865-d335-4a49-746f-5e00809c4f1f b084bdcf-36e6-4b7b-689e-444b83564ce1
root 791377 0.0 0.0 12944 1088 pts/0 S+ 21:55 0:00 grep --color=auto runc
We expected that after monit stop garden
there should be no dadoo
or runc
processes left.
Please let us know if you need any more information. You can reach us in the #bosh-core-dev channel on the Cloud Foundry slack or on here.
Thanks!
The Apps Manager team has a bosh-lite running on AWS (http://apps.bosh-lite.appsman.cf-app.com/) that has an intermittent problem: hitting app URLs sometimes returns a 502.
The CAPI team has seen the same problem on their bosh-lite environment (which has been deleted).
@michaelxupiv from CAPI said:
we suspected it was a garden container issue because we tracked the traffic reaching the Diego VM, but somehow that packet was lost when trying to reach the container. It's interesting that we don't see this issue on local bosh-lite.
Here are the logs they captured on their environment before it was torn down.
502app-logs.txt
If you wish to investigate this issue, the Apps Manager team can provide credentials to the bosh lite environment.
This is the runtime that we have deployed there:
+-----------------+------------------------------------------------+-----------------------------------------------------+--------------+
| Name | Release(s) | Stemcell(s) | Cloud Config |
+-----------------+------------------------------------------------+-----------------------------------------------------+--------------+
| cf-warden | cf/250 | bosh-warden-boshlite-ubuntu-trusty-go_agent/3312.15 | none |
+-----------------+------------------------------------------------+-----------------------------------------------------+--------------+
| cf-warden-diego | cf/250 | bosh-warden-boshlite-ubuntu-trusty-go_agent/3312.15 | none |
| | cflinuxfs2-rootfs/1.45.0 | | |
| | diego/1.5.3 | | |
| | garden-runc/1.1.1 | | |
I push docker apps to CF. I was on some older version of garden-runc-release
, that didn't use GrootFS
. I upgraded to version 1.11.0 of garden-runc-release
. This also happens when upgrading to versions later than 1.11.0.
After upgrading, which resulted in using GrootFS for filesystem management, some of my docker apps started failing with the following error message, which I found in /var/vcap/sys/log/garden/garden.stdout.log
.
"error":"running image plugin create: making image: creating image: applying disk limits: disk limit is smaller than volume size\n: exit status 1"
My docker images seemed to be an acceptable size before, why are they failing now?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.