cloudfoundry / garden-runc-release Goto Github PK

License: Apache License 2.0

Shell 32.14% Go 40.72% HTML 16.05% PowerShell 0.37% Ruby 10.73%

garden-runc-release's Issues

Should docker/docker be updated or vendored as a submodule?

Description

I'm not sure if changes have made it upstream, but src/github.com/docker/docker is currently over five months behind. Should that be updated or submoduled in vs. copied?

garden containers resolv.conf

i would like to know how does garden container updates its resolv.conf .

i add "options rotate timeout:1 retries:1" along with NameServer in garden cell /etc/resolv.conf.

Unable to trace the code which modifies the resolv.conf in garden-runc-release (1.4.0)

I am working on CF V287 deployment. I am facing a strange error in garden health check.
Following are the system/issue info:
OS: 14.04, Trusty Tahr
Kernel version: 4.4.0-105-generic
diego-release version 2.0.0
garden-runc-release version 1.11.1

Issue log:

{"timestamp":"1521155111.438632011","source":"guardian","message":"guardian.run.exec.execrunner.runc","log_level":0,"data":{"handle":"executor-healthcheck-a51bbaf6-b013-4d85-438b-3f6a1cde4b7d","id":"executor-healthcheck-a51bbaf6-b013-4d85-438b-3f6a1cde4b7d","message":"exec_ failed: container_linux.go:348: starting container process caused \"chdir to cwd (\\\"/home/vcap\\\") set in config.json failed: permission denied\"\n","path":"/bin/sh","session":"9.2.1"}}

Please help on this issue? 
Thanks in Advance. +1

Best Regards
Amit

Missing files from docker image using grootfs/overlay-xfs

Description

After pushing the grafana/grafana docker image to cloudfoundry there are missing files in the garden container.
We recently changed from garden-shed to grootfs/overlay-xfs.
We already faced a similar behaviuor in two other cases with using an earlier garden-runc version (v1.11.1), but an update to a recent version eliminated this issue there.

Environment

garden-runc 1.12.1 / grootfs/overlay-xfs
Stemcell: vsphere ubuntu trusty 3541.10 and 3468.21

Steps to reproduce

cf push grafana_fail --docker-image grafana/grafana

Logs

The garden log /var/vcap/sys/log/garden/garden.stdout.log (log_level: info)
does not hightlight an unexpected behaviour.

cf log:
2018-04-06T11:53:53.60+0000 [APP/PROC/WEB/0] OUT t=2018-04-06T11:53:53+0000 lvl=crit msg="Failed to parse /etc/grafana/grafana.ini, open /etc/grafana/grafana.ini: no such file or directory%!(EXTRA []interface {}=[])

And if looking into the app (with modified start command) the directories are there but not the files.

Error from runc: `cannot set limits on the memory cgroup, as the container has not joined it`

Description

Error from runc: cannot set limits on the memory cgroup, as the container has not joined it

Environment

garden-runc-release version: v1.12.1
IaaS: AWS
Stemcell version: ubuntu-trusty: 3541.12
Kernel version: 4.4.0-1049-aws

Steps to reproduce

We don't have repro steps yet. The error happened after a deploy:

cell rep started up at 1524090639 = 2018-04-18T22:30:39Z, after passing its garden healthcheck
cell rep took on this staging task two seconds later, at 1524090641 = 2018-04-18T22:30:41Z
container creation then failed with:

{"timestamp":"1524090641.830159187","source":"rep","message":"rep.executing-container-operation.task-processor.run-container.containerstore-create.node-create.failed-to-create-container-in-garden","log_level":2,"data":{"container-guid":"35f7ad82-9831-4e09-b7e3-9c41df9dae22","container-state":"reserved","error":"runc run: exit status 1: container_linux.go:348: starting container process caused \"process_linux.go:402: container init caused \\\"process_linux.go:367: setting cgroup config for procHooks process caused \\\\\\\"cannot set limits on the memory cgroup, as the container has not joined it\\\\\\\"\\\"\"\n","guid":"35f7ad82-9831-4e09-b7e3-9c41df9dae22","session":"20.1.3.2.1"}}

Logs

Logs and more context can be found in this tracker story. Note output of garden-ordance-survey is also included in the story comments.

Cause

Suspicion : it appears that the garden memory cgroup wasn't available at container creation time (Tom G).

Resolution

Subsequent container creations succeeded as soon as 3 seconds after this failure.

Note from @BooleanCat : It might be worth trying to repro by slamming garden with container creations while it's starting up.

runc "no space left on device" cgroup errors on concourse worker

Description

When running Concourse tasks, one of our three workers is giving the following error:

runc create: exit status 1: process_linux.go:258: applying cgroup configuration for process caused "mkdir /sys/fs/cgroup/memory/24d5d8e9-250f-4bc8-40ed-03a9d4f9ad52: no space left on device"

Here's an example of a build with the problem: https://runtime.ci.cf-app.com/teams/main/pipelines/cf-bosh-2-0/jobs/update-releases/builds/62

The vm in question is not especially full in any respect:

| VM                                              | State   | AZ | VM Type       | IPs         |         Load          | CPU  | CPU  | CPU  | Memory Usage | Swap Usage  | System     | Ephemeral  | Persistent |
|                                                 |         |    |               |             | (avg01, avg05, avg15) | User | Sys  | Wait |              |             | Disk Usage | Disk Usage | Disk Usage |
| worker/1 (56bd4970-30e6-4ae8-b4f2-546bb566561b) | running | z1 | workers       | 10.10.48.14 | 0.18, 0.46, 0.61      | 0.4% | 0.2% | 0.0% | 33% (9.8G)   | 1% (196.2M) | 59%        | 19%        | n/a        |

The only potentially "weird" configuration we've got on there is:

garden:
  max_containers: 1000
  network_pool: "10.254.0.0/20"

But, we're not really using a lot of that, as fly workers reports:

56bd4970-30e6-4ae8-b4f2-546bb566561b  172         linux     none

Suraci said this isn't a disk issue, probably isn't a concourse issue, and suggested we give 'yall a shot at it.

Setup Information

Concourse versions: 1.6 & 2.0.2
garden-runc versions: 0.4.0 & 0.6.0
bosh stemcell: 3232.6

Logging and/or test output

garden/garden.stdout.log
83381-{"timestamp":"1472766267.733082533","source":"guardian","message":"guardian.destroy.destroy.finished","log_level":1,"data":{"handle":"67234b85-750c-47fe-7b2e-921740c5a5b4","session":"998.2"}}
83382-{"timestamp":"1472766267.733111382","source":"guardian","message":"guardian.destroy.finished","log_level":1,"data":{"handle":"67234b85-750c-47fe-7b2e-921740c5a5b4","session":"998"}}
83383-{"timestamp":"1472766267.733154535","source":"guardian","message":"guardian.layer-already-deleted-skipping","log_level":1,"data":{"error":"could not find image: no such id: 37870bde9a1de1d445ed2d7862d2c17d75212ac33b1fe1810cc0401485013b8a","graphID":"67234b85-750c-47fe-7b2e-921740c5a5b4","id":"67234b85-750c-47fe-7b2e-921740c5a5b4"}}
83384-{"timestamp":"1472766267.733173132","source":"guardian","message":"guardian.create.create-failed-cleaningup.cleanedup","log_level":1,"data":{"cause":"runc create: exit status 1: process_linux.go:258: applying cgroup configuration for process caused \"mkdir /sys/fs/cgroup/memory/67234b85-750c-47fe-7b2e-921740c5a5b4: no space left on device\"","handle":"67234b85-750c-47fe-7b2e-921740c5a5b4","session":"997.3"}}
83385-{"timestamp":"1472766267.733195305","source":"guardian","message":"guardian.api.garden-server.create.failed","log_level":2,"data":{"error":"runc create: exit status 1: process_linux.go:258: applying cgroup configuration for process caused \"mkdir /sys/fs/cgroup/memory/67234b85-750c-47fe-7b2e-921740c5a5b4: no space left on device\"","request":{"Handle":"","GraceTime":0,"RootFSPath":"raw:///var/vcap/data/baggageclaim/volumes/live/25d4f36d-9f8f-4cff-4ac1-d441db8e8b71/volume","BindMounts":null,"Network":"","Privileged":true,"Limits":{"bandwidth_limits":{},"cpu_limits":{},"disk_limits":{},"memory_limits":{}}},"session":"3.1.517"}}
83386:{"timestamp":"1472766268.500758410","source":"guardian","message":"guardian.create.start","log_level":1,"data":{"handle":"d7581e52-2e88-4cdb-5d99-2fa93b4b1cd5","session":"999"}}
83387:{"timestamp":"1472766268.500808477","source":"guardian","message":"guardian.create.gc.start","log_level":1,"data":{"handle":"d7581e52-2e88-4cdb-5d99-2fa93b4b1cd5","session":"999.1"}}
83388:{"timestamp":"1472766268.500828743","source":"guardian","message":"guardian.create.gc.finished","log_level":1,"data":{"handle":"d7581e52-2e88-4cdb-5d99-2fa93b4b1cd5","session":"999.1"}}
83389:{"timestamp":"1472766268.500849724","source":"guardian","message":"guardian.create.containerizer-create.start","log_level":1,"data":{"handle":"d7581e52-2e88-4cdb-5d99-2fa93b4b1cd5","session":"999.2"}}
83390:{"timestamp":"1472766268.521710157","source":"guardian","message":"guardian.create.containerizer-create.depot-create.started","log_level":1,"data":{"handle":"d7581e52-2e88-4cdb-5d99-2fa93b4b1cd5","session":"999.2.1"}}
83391:{"timestamp":"1472766268.521897554","source":"guardian","message":"guardian.create.containerizer-create.depot-create.finished","log_level":1,"data":{"handle":"d7581e52-2e88-4cdb-5d99-2fa93b4b1cd5","session":"999.2.1"}}
83392:{"timestamp":"1472766268.521925688","source":"guardian","message":"guardian.create.containerizer-create.lookup.started","log_level":1,"data":{"handle":"d7581e52-2e88-4cdb-5d99-2fa93b4b1cd5","session":"999.2.2"}}
83393:{"timestamp":"1472766268.521943808","source":"guardian","message":"guardian.create.containerizer-create.lookup.finished","log_level":1,"data":{"handle":"d7581e52-2e88-4cdb-5d99-2fa93b4b1cd5","session":"999.2.2"}}
83394:{"timestamp":"1472766268.521981955","source":"guardian","message":"guardian.create.containerizer-create.create.creating","log_level":1,"data":{"bundle":"/var/vcap/data/garden/depot/d7581e52-2e88-4cdb-5d99-2fa93b4b1cd5","bundlePath":"/var/vcap/data/garden/depot/d7581e52-2e88-4cdb-5d99-2fa93b4b1cd5","handle":"d7581e52-2e88-4cdb-5d99-2fa93b4b1cd5","id":"d7581e52-2e88-4cdb-5d99-2fa93b4b1cd5","logPath":"/var/vcap/data/garden/depot/d7581e52-2e88-4cdb-5d99-2fa93b4b1cd5/create.log","pidFilePath":"/var/vcap/data/garden/depot/d7581e52-2e88-4cdb-5d99-2fa93b4b1cd5/pidfile","runc":"runc","session":"999.2.3"}}
83395:{"timestamp":"1472766268.564662218","source":"guardian","message":"guardian.create.containerizer-create.create.finished","log_level":1,"data":{"bundle":"/var/vcap/data/garden/depot/d7581e52-2e88-4cdb-5d99-2fa93b4b1cd5","handle":"d7581e52-2e88-4cdb-5d99-2fa93b4b1cd5","session":"999.2.3"}}
83396:{"timestamp":"1472766268.564709902","source":"guardian","message":"guardian.create.containerizer-create.runtime-create-failed","log_level":2,"data":{"error":"runc create: exit status 1: process_linux.go:258: applying cgroup configuration for process caused \"mkdir /sys/fs/cgroup/memory/d7581e52-2e88-4cdb-5d99-2fa93b4b1cd5: no space left on device\"","handle":"d7581e52-2e88-4cdb-5d99-2fa93b4b1cd5","session":"999.2"}}
83397:{"timestamp":"1472766268.564728737","source":"guardian","message":"guardian.create.containerizer-create.finished","log_level":1,"data":{"handle":"d7581e52-2e88-4cdb-5d99-2fa93b4b1cd5","session":"999.2"}}
83398:{"timestamp":"1472766268.564753771","source":"guardian","message":"guardian.create.create-failed-cleaningup.start","log_level":1,"data":{"cause":"runc create: exit status 1: process_linux.go:258: applying cgroup configuration for process caused \"mkdir /sys/fs/cgroup/memory/d7581e52-2e88-4cdb-5d99-2fa93b4b1cd5: no space left on device\"","handle":"d7581e52-2e88-4cdb-5d99-2fa93b4b1cd5","session":"999.3"}}
83399:{"timestamp":"1472766268.564774513","source":"guardian","message":"guardian.destroy.started","log_level":1,"data":{"handle":"d7581e52-2e88-4cdb-5d99-2fa93b4b1cd5","session":"1000"}}
83400:{"timestamp":"1472766268.564793587","source":"guardian","message":"guardian.destroy.state.started","log_level":1,"data":{"handle":"d7581e52-2e88-4cdb-5d99-2fa93b4b1cd5","session":"1000.1"}}
83401:{"timestamp":"1472766268.572014570","source":"guardian","message":"guardian.destroy.state.finished","log_level":1,"data":{"handle":"d7581e52-2e88-4cdb-5d99-2fa93b4b1cd5","session":"1000.1"}}
83402:{"timestamp":"1472766268.572041273","source":"guardian","message":"guardian.destroy.state-failed-skipping-delete","log_level":1,"data":{"error":"runc state: runc create: exit status 1: open /run/runc/d7581e52-2e88-4cdb-5d99-2fa93b4b1cd5/state.json: no such file or directory","handle":"d7581e52-2e88-4cdb-5d99-2fa93b4b1cd5","session":"1000"}}
83403:{"timestamp":"1472766268.572063684","source":"guardian","message":"guardian.destroy.destroy.started","log_level":1,"data":{"handle":"d7581e52-2e88-4cdb-5d99-2fa93b4b1cd5","session":"1000.2"}}
83404:{"timestamp":"1472766268.572134256","source":"guardian","message":"guardian.destroy.destroy.finished","log_level":1,"data":{"handle":"d7581e52-2e88-4cdb-5d99-2fa93b4b1cd5","session":"1000.2"}}
83405:{"timestamp":"1472766268.572152138","source":"guardian","message":"guardian.destroy.finished","log_level":1,"data":{"handle":"d7581e52-2e88-4cdb-5d99-2fa93b4b1cd5","session":"1000"}}
83406:{"timestamp":"1472766268.572190762","source":"guardian","message":"guardian.layer-already-deleted-skipping","log_level":1,"data":{"error":"could not find image: no such id: dd7bb4a2c64e30b97b6bee40971fa86e1b134e2a30a940b22eeb9823be188611","graphID":"d7581e52-2e88-4cdb-5d99-2fa93b4b1cd5","id":"d7581e52-2e88-4cdb-5d99-2fa93b4b1cd5"}}
83407:{"timestamp":"1472766268.572208643","source":"guardian","message":"guardian.create.create-failed-cleaningup.cleanedup","log_level":1,"data":{"cause":"runc create: exit status 1: process_linux.go:258: applying cgroup configuration for process caused \"mkdir /sys/fs/cgroup/memory/d7581e52-2e88-4cdb-5d99-2fa93b4b1cd5: no space left on device\"","handle":"d7581e52-2e88-4cdb-5d99-2fa93b4b1cd5","session":"999.3"}}
83408:{"timestamp":"1472766268.572227478","source":"guardian","message":"guardian.api.garden-server.create.failed","log_level":2,"data":{"error":"runc create: exit status 1: process_linux.go:258: applying cgroup configuration for process caused \"mkdir /sys/fs/cgroup/memory/d7581e52-2e88-4cdb-5d99-2fa93b4b1cd5: no space left on device\"","request":{"Handle":"","GraceTime":0,"RootFSPath":"raw:///var/vcap/data/baggageclaim/volumes/live/29dd5ffc-c577-4b89-7b66-dd87f7cfd996/volume","BindMounts":[{"src_path":"/var/vcap/data/baggageclaim/volumes/live/a40b857f-8711-41a6-45a9-e82871eacfd4/volume","dst_path":"/tmp/build/get","mode":1}],"Network":"","Privileged":true,"Limits":{"bandwidth_limits":{},"cpu_limits":{},"disk_limits":{},"memory_limits":{}}},"session":"3.1.518"}}
83409-{"timestamp":"1472766268.948507786","source":"guardian","message":"guardian.list-containers.starting","log_level":1,"data":{"session":"1001"}}
83410-{"timestamp":"1472766268.949084759","source":"guardian","message":"guardian.list-containers.finished","log_level":1,"data":{"session":"1001"}}
83411-{"timestamp":"1472766280.467016459","source":"guardian","message":"guardian.create.start","log_level":1,"data":{"handle":"dfd74a82-61d6-4aa8-5771-3a01c729b534","session":"1002"}}
83412-{"timestamp":"1472766280.467072964","source":"guardian","message":"guardian.create.gc.start","log_level":1,"data":{"handle":"dfd74a82-61d6-4aa8-5771-3a01c729b534","session":"1002.1"}}
83413-{"timestamp":"1472766280.467092991","source":"guardian","message":"guardian.create.gc.finished","log_level":1,"data":{"handle":"dfd74a82-61d6-4aa8-5771-3a01c729b534","session":"1002.1"}}

Accessing Our Broken Environment

We've used monit stop beacon to stop this worker from participating in Concourse, but preserve it for any investigation you might like to do.

Please feel free to jump on the environment and poke around. If you want to turn the beacon back on and re-run jobs, the jobs in the cf-release smokes display group or any of the jobs in this pipeline should be fine.

We've also made read access to the repo containing our bosh manifest available to your team:

www.github.com/cloudfoundry/runtime-ci-env

This is where the bosh manifest for concourse lives, in deployments/concourse.yml). The keypair for ssh'ing into bosh jobs is in keypair.

support user other than vcap

Instead of “vcap”, we are maintaining “paas” user for CF module deployment. “paas” user has sudoer permission as well.
Now during “garden” module deployment, “greenskeeper” is expecting “vcap” user and it is hard coded in code.
https://github.com/cloudfoundry/garden-runc-release/blob/master/src/greenskeeper/cmd/greenskeeper/main.go#L20

Is there any plan to make it configurable for other users?

As a CF operator, I expect to be able to understand how Grootfs is currently using disk space for containers and image layers

Description

Cloud Foundry Application Runtime operators want to understand how the Grootfs store is using disk on the Garden host machine, including being able to answer the following questions:

How much disk is used exclusive to container X?
How much exclusive disk is used across all running containers?
How much disk does underlying layer Y (rootfs or, in future, droplet or other non-root layer) use?
How much disk do all the active layers use in total?
How much disk does the store use in total?
How much disk could be reclaimed through pruning unused layers?
Are there categories of grootfs disk usage that should be accounted for, other than the ones above? If so, how much disk usage do they account for?
If I force grootfs to prune its cache, would I be able to reduce total disk usage to Z% of local disk or to W GB?
Is grootfs using more disk than it is configured to use, or is it behaving correctly?

This understanding can be important when operators or support agents reconcile disk allocations for containers and actual disk usage on the host, particularly in situations in which containers are not placed or executed successfully:

Operator observes that the Diego cell rep is out of disk capacity to allocate, but actual disk usage on the Garden host is relatively low, resulting in containers not being placed on the cell.
Operator observes that, conversely, the Diego cell rep thinks it has disk space to allocate, but containers and system components together are using enough disk that the remaining available disk space is less than the allocation, resulting in containers continuing to be placed on the cell and then failing to start.

In addition to assisting CFAR operators in configuring their Diego cells to use disk space efficiently and in debugging failure modes, this level of understanding would help the Diego team conceive of strategies to improve how the cell rep advertises available disk, especially when component disk usage exceeds expected margins.

It is not immediately clear how to use tools such as du and df to determine this information. du -sch * run naively against /var/vcap/data/grootfs/store overcounts disk usage by observing the same files in several different logical locations and so frequently returns absurdly high numbers. The reporting from df, while apparently more accurate, is often too coarse (reporting on all of /var/vcap/data, in a BOSH-deployed context) or too difficult to relate to the containers.

Steps to reproduce

An example from one of the Diego cells in the Diego team electron-cannon CI environment:

CF app container allocations and reported usage, effectively from the Garden API:

# curl -k --cert /var/vcap/jobs/rep/config/certs/tls.crt --key /var/vcap/jobs/rep/config/certs/tls.key https://058fe0af-50ce-419f-8dd6-8a94b7ad44f3.cell.service.cf.internal:1801/container_metrics | jq '.lrps | map({instance_guid,disk_usage_bytes,disk_quota_bytes})'

Output

[
  {
    "instance_guid": "62eef6c8-002a-40e0-537e-b869",
    "disk_usage_bytes": 7544832,
    "disk_quota_bytes": 67108864
  },
  {
    "instance_guid": "547cab22-2987-46dc-5235-9713",
    "disk_usage_bytes": 7544832,
    "disk_quota_bytes": 67108864
  },
  {
    "instance_guid": "369b6f00-aeed-44f7-5520-65dd",
    "disk_usage_bytes": 95567872,
    "disk_quota_bytes": 1073741824
  },
  {
    "instance_guid": "a94a5e54-bd97-413e-4ebf-8e5e",
    "disk_usage_bytes": 7544832,
    "disk_quota_bytes": 67108864
  },
  {
    "instance_guid": "fbf0adc4-ce51-42ba-696b-2588",
    "disk_usage_bytes": 6430720,
    "disk_quota_bytes": 1073741824
  },
  {
    "instance_guid": "6aaecdbb-83a3-49a0-50f6-4f2c",
    "disk_usage_bytes": 7544832,
    "disk_quota_bytes": 67108864
  },
  {
    "instance_guid": "7fcd3985-fc86-4d49-6318-bd66",
    "disk_usage_bytes": 95567872,
    "disk_quota_bytes": 1073741824
  },
  {
    "instance_guid": "d3d0080a-bcc4-4acb-7f17-e662",
    "disk_usage_bytes": 95830016,
    "disk_quota_bytes": 1073741824
  },
  {
    "instance_guid": "b846e975-d019-4a4c-72b4-a55f",
    "disk_usage_bytes": 95567872,
    "disk_quota_bytes": 1073741824
  },
  {
    "instance_guid": "d0f44fa0-9491-4da2-755f-a864",
    "disk_usage_bytes": 7544832,
    "disk_quota_bytes": 67108864
  },
  {
    "instance_guid": "096852be-16f2-48fa-41b3-fa97",
    "disk_usage_bytes": 7544832,
    "disk_quota_bytes": 67108864
  },
  {
    "instance_guid": "fe0ba241-a59b-4fbe-7e6b-b423",
    "disk_usage_bytes": 7544832,
    "disk_quota_bytes": 67108864
  },
  {
    "instance_guid": "ebe62763-d16c-40a0-6231-fdc6",
    "disk_usage_bytes": 95567872,
    "disk_quota_bytes": 1073741824
  },
  {
    "instance_guid": "2bf3fe20-9ff3-44e6-7b90-d656",
    "disk_usage_bytes": 7544832,
    "disk_quota_bytes": 67108864
  },
  {
    "instance_guid": "c95b2780-7ed0-45a6-7dcd-1d44",
    "disk_usage_bytes": 7544832,
    "disk_quota_bytes": 67108864
  },
  {
    "instance_guid": "07b247bb-b7cc-420f-7c6d-56b9",
    "disk_usage_bytes": 7544832,
    "disk_quota_bytes": 67108864
  },
  {
    "instance_guid": "46abea19-9fe5-4057-5466-ef57",
    "disk_usage_bytes": 95567872,
    "disk_quota_bytes": 1073741824
  },
  {
    "instance_guid": "c60f8621-c751-4b91-72c4-c3cc",
    "disk_usage_bytes": 7544832,
    "disk_quota_bytes": 67108864
  },
  {
    "instance_guid": "1dad94bb-76d7-4ee2-7de3-a4a4",
    "disk_usage_bytes": 7544832,
    "disk_quota_bytes": 67108864
  },
  {
    "instance_guid": "bb94dfc7-5ccd-473e-4ce8-a1f6",
    "disk_usage_bytes": 7544832,
    "disk_quota_bytes": 67108864
  },
  {
    "instance_guid": "87b0084e-b6e2-4e35-5f04-b502",
    "disk_usage_bytes": 7544832,
    "disk_quota_bytes": 67108864
  },
  {
    "instance_guid": "ce58a99a-0b4c-4070-622d-207a",
    "disk_usage_bytes": 7544832,
    "disk_quota_bytes": 67108864
  },
  {
    "instance_guid": "3c368ff6-d18d-4cd3-4cc5-cb80",
    "disk_usage_bytes": 7544832,
    "disk_quota_bytes": 67108864
  },
  {
    "instance_guid": "d3e99752-f05d-41c3-5a4c-ea29",
    "disk_usage_bytes": 7544832,
    "disk_quota_bytes": 67108864
  },
  {
    "instance_guid": "3097f536-5766-4112-559d-ad7a",
    "disk_usage_bytes": 7544832,
    "disk_quota_bytes": 67108864
  },
  {
    "instance_guid": "75d34c0f-b1c1-4866-48c8-161c",
    "disk_usage_bytes": 7544832,
    "disk_quota_bytes": 67108864
  }
]

`df`

# df -h

Output

Filesystem      Size  Used Avail Use% Mounted on
udev             13G  4.0K   13G   1% /dev
tmpfs           2.6G  1.7M  2.6G   1% /run
/dev/sda1       2.8G  1.3G  1.4G  47% /
none            4.0K     0  4.0K   0% /sys/fs/cgroup
none            5.0M     0  5.0M   0% /run/lock
none             13G     0   13G   0% /run/shm
none            100M     0  100M   0% /run/user
/dev/sda3        71G   21G   47G  31% /var/vcap/data
tmpfs           1.0M   32K  992K   4% /var/vcap/data/sys/run
/dev/loop0       71G  2.6G   68G   4% /var/vcap/data/grootfs/store/unprivileged
/dev/loop1       71G  1.9G   69G   3% /var/vcap/data/grootfs/store/privileged
tmpfs           2.9M  216K  2.6M   8% /var/vcap/data/rep/instance_identity
overlay         1.0G   92M  933M   9% /var/vcap/data/grootfs/store/unprivileged/images/b846e975-d019-4a4c-72b4-a55f/rootfs
overlay         1.0G  6.2M 1018M   1% /var/vcap/data/grootfs/store/unprivileged/images/fbf0adc4-ce51-42ba-696b-2588/rootfs
overlay         1.0G   92M  933M   9% /var/vcap/data/grootfs/store/unprivileged/images/369b6f00-aeed-44f7-5520-65dd/rootfs
overlay         1.0G   92M  933M   9% /var/vcap/data/grootfs/store/unprivileged/images/d3d0080a-bcc4-4acb-7f17-e662/rootfs
overlay          64M  7.2M   57M  12% /var/vcap/data/grootfs/store/unprivileged/images/c60f8621-c751-4b91-72c4-c3cc/rootfs
overlay          64M  7.2M   57M  12% /var/vcap/data/grootfs/store/unprivileged/images/2bf3fe20-9ff3-44e6-7b90-d656/rootfs
overlay          64M  7.2M   57M  12% /var/vcap/data/grootfs/store/unprivileged/images/87b0084e-b6e2-4e35-5f04-b502/rootfs
overlay         1.0G   92M  933M   9% /var/vcap/data/grootfs/store/unprivileged/images/46abea19-9fe5-4057-5466-ef57/rootfs
overlay          64M  7.2M   57M  12% /var/vcap/data/grootfs/store/unprivileged/images/d0f44fa0-9491-4da2-755f-a864/rootfs
overlay          64M  7.2M   57M  12% /var/vcap/data/grootfs/store/unprivileged/images/ce58a99a-0b4c-4070-622d-207a/rootfs
overlay          64M  7.2M   57M  12% /var/vcap/data/grootfs/store/unprivileged/images/6aaecdbb-83a3-49a0-50f6-4f2c/rootfs
overlay          64M  7.2M   57M  12% /var/vcap/data/grootfs/store/unprivileged/images/a94a5e54-bd97-413e-4ebf-8e5e/rootfs
overlay          64M  7.2M   57M  12% /var/vcap/data/grootfs/store/unprivileged/images/62eef6c8-002a-40e0-537e-b869/rootfs
overlay         1.0G   92M  933M   9% /var/vcap/data/grootfs/store/unprivileged/images/7fcd3985-fc86-4d49-6318-bd66/rootfs
overlay          64M  7.2M   57M  12% /var/vcap/data/grootfs/store/unprivileged/images/07b247bb-b7cc-420f-7c6d-56b9/rootfs
overlay          64M  7.2M   57M  12% /var/vcap/data/grootfs/store/unprivileged/images/75d34c0f-b1c1-4866-48c8-161c/rootfs
overlay          64M  7.2M   57M  12% /var/vcap/data/grootfs/store/unprivileged/images/c6d3c65f-f8dd-47d3-4028-cf04/rootfs
overlay          64M  7.2M   57M  12% /var/vcap/data/grootfs/store/unprivileged/images/fe0ba241-a59b-4fbe-7e6b-b423/rootfs
overlay          64M  7.2M   57M  12% /var/vcap/data/grootfs/store/unprivileged/images/c95b2780-7ed0-45a6-7dcd-1d44/rootfs
overlay          64M  7.2M   57M  12% /var/vcap/data/grootfs/store/unprivileged/images/3c368ff6-d18d-4cd3-4cc5-cb80/rootfs
overlay         1.0G   92M  933M   9% /var/vcap/data/grootfs/store/unprivileged/images/ebe62763-d16c-40a0-6231-fdc6/rootfs
overlay          64M  7.2M   57M  12% /var/vcap/data/grootfs/store/unprivileged/images/d3e99752-f05d-41c3-5a4c-ea29/rootfs
overlay          64M  7.2M   57M  12% /var/vcap/data/grootfs/store/unprivileged/images/096852be-16f2-48fa-41b3-fa97/rootfs
overlay          64M  7.2M   57M  12% /var/vcap/data/grootfs/store/unprivileged/images/1dad94bb-76d7-4ee2-7de3-a4a4/rootfs
overlay          64M  7.2M   57M  12% /var/vcap/data/grootfs/store/unprivileged/images/bb94dfc7-5ccd-473e-4ce8-a1f6/rootfs
overlay          64M  7.2M   57M  12% /var/vcap/data/grootfs/store/unprivileged/images/3097f536-5766-4112-559d-ad7a/rootfs
overlay          64M  7.2M   57M  12% /var/vcap/data/grootfs/store/unprivileged/images/547cab22-2987-46dc-5235-9713/rootfs
overlay          71G  2.6G   68G   4% /var/vcap/data/grootfs/store/unprivileged/images/87b0084e-b6e2-4e35-5f04-b502-liveness-healthcheck-0/rootfs
overlay          71G  2.6G   68G   4% /var/vcap/data/grootfs/store/unprivileged/images/3097f536-5766-4112-559d-ad7a-liveness-healthcheck-0/rootfs
overlay          71G  2.6G   68G   4% /var/vcap/data/grootfs/store/unprivileged/images/ce58a99a-0b4c-4070-622d-207a-liveness-healthcheck-0/rootfs
overlay          71G  2.6G   68G   4% /var/vcap/data/grootfs/store/unprivileged/images/a94a5e54-bd97-413e-4ebf-8e5e-liveness-healthcheck-0/rootfs
overlay          71G  2.6G   68G   4% /var/vcap/data/grootfs/store/unprivileged/images/c95b2780-7ed0-45a6-7dcd-1d44-liveness-healthcheck-0/rootfs
overlay          71G  2.6G   68G   4% /var/vcap/data/grootfs/store/unprivileged/images/fbf0adc4-ce51-42ba-696b-2588-liveness-healthcheck-0/rootfs
overlay          71G  2.6G   68G   4% /var/vcap/data/grootfs/store/unprivileged/images/1dad94bb-76d7-4ee2-7de3-a4a4-liveness-healthcheck-0/rootfs
overlay          71G  2.6G   68G   4% /var/vcap/data/grootfs/store/unprivileged/images/c60f8621-c751-4b91-72c4-c3cc-liveness-healthcheck-0/rootfs
overlay          71G  2.6G   68G   4% /var/vcap/data/grootfs/store/unprivileged/images/62eef6c8-002a-40e0-537e-b869-liveness-healthcheck-0/rootfs
overlay          71G  2.6G   68G   4% /var/vcap/data/grootfs/store/unprivileged/images/369b6f00-aeed-44f7-5520-65dd-liveness-healthcheck-0/rootfs
overlay          71G  2.6G   68G   4% /var/vcap/data/grootfs/store/unprivileged/images/75d34c0f-b1c1-4866-48c8-161c-liveness-healthcheck-0/rootfs
overlay          71G  2.6G   68G   4% /var/vcap/data/grootfs/store/unprivileged/images/b846e975-d019-4a4c-72b4-a55f-liveness-healthcheck-0/rootfs
overlay          71G  2.6G   68G   4% /var/vcap/data/grootfs/store/unprivileged/images/2bf3fe20-9ff3-44e6-7b90-d656-liveness-healthcheck-0/rootfs
overlay          71G  2.6G   68G   4% /var/vcap/data/grootfs/store/unprivileged/images/096852be-16f2-48fa-41b3-fa97-liveness-healthcheck-0/rootfs
overlay          71G  2.6G   68G   4% /var/vcap/data/grootfs/store/unprivileged/images/6aaecdbb-83a3-49a0-50f6-4f2c-liveness-healthcheck-0/rootfs
overlay          71G  2.6G   68G   4% /var/vcap/data/grootfs/store/unprivileged/images/fe0ba241-a59b-4fbe-7e6b-b423-liveness-healthcheck-0/rootfs
overlay          71G  2.6G   68G   4% /var/vcap/data/grootfs/store/unprivileged/images/07b247bb-b7cc-420f-7c6d-56b9-liveness-healthcheck-0/rootfs
overlay          71G  2.6G   68G   4% /var/vcap/data/grootfs/store/unprivileged/images/d0f44fa0-9491-4da2-755f-a864-liveness-healthcheck-0/rootfs
overlay          71G  2.6G   68G   4% /var/vcap/data/grootfs/store/unprivileged/images/bb94dfc7-5ccd-473e-4ce8-a1f6-liveness-healthcheck-0/rootfs
overlay          71G  2.6G   68G   4% /var/vcap/data/grootfs/store/unprivileged/images/c6d3c65f-f8dd-47d3-4028-cf04-liveness-healthcheck-0/rootfs
overlay          71G  2.6G   68G   4% /var/vcap/data/grootfs/store/unprivileged/images/ebe62763-d16c-40a0-6231-fdc6-liveness-healthcheck-0/rootfs
overlay          71G  2.6G   68G   4% /var/vcap/data/grootfs/store/unprivileged/images/d3e99752-f05d-41c3-5a4c-ea29-liveness-healthcheck-0/rootfs
overlay          71G  2.6G   68G   4% /var/vcap/data/grootfs/store/unprivileged/images/3c368ff6-d18d-4cd3-4cc5-cb80-liveness-healthcheck-0/rootfs
overlay          71G  2.6G   68G   4% /var/vcap/data/grootfs/store/unprivileged/images/547cab22-2987-46dc-5235-9713-liveness-healthcheck-0/rootfs
overlay          71G  2.6G   68G   4% /var/vcap/data/grootfs/store/unprivileged/images/d3d0080a-bcc4-4acb-7f17-e662-liveness-healthcheck-0/rootfs
overlay          71G  2.6G   68G   4% /var/vcap/data/grootfs/store/unprivileged/images/7fcd3985-fc86-4d49-6318-bd66-liveness-healthcheck-0/rootfs
overlay          71G  2.6G   68G   4% /var/vcap/data/grootfs/store/unprivileged/images/46abea19-9fe5-4057-5466-ef57-liveness-healthcheck-0/rootfs

`du` against Grootfs store directories

# du -sch /var/vcap/data/grootfs/store/*

Output

1.8G	/var/vcap/data/grootfs/store/privileged
1.9G	/var/vcap/data/grootfs/store/privileged.backing-store
2.6G	/var/vcap/data/grootfs/store/unprivileged
4.3G	/var/vcap/data/grootfs/store/unprivileged.backing-store
11G	total

Resolution

It would help to have a documented set of techniques and/or tools to use to answer some of the questions posed above on a particular Garden/Grootfs host.

/cc @cloudfoundry/cf-diego @nikhilsuvarna @dmikusa-pivotal

[question] graph_cleanup_threshold_in_mb

Hello,
even though we have set the property graph_cleanup_threshold_in_mb in the diego manifest to 66560, the clean process doesn't seem to work properly. The following is the output of the ephemeral disk usage of one of our cells:

# df -h /var/vcap/data/
Filesystem      Size  Used Avail Use% Mounted on
/dev/xvdb2       83G   76G  2.8G  97% /var/vcap/data

this issue is affecting multiple cells.

/var/vcap/data# df -h
Filesystem      Size  Used Avail Use% Mounted on
udev            7.9G  4.0K  7.9G   1% /dev
tmpfs           1.6G  884K  1.6G   1% /run
/dev/xvda1      2.9G  1.2G  1.7G  41% /
none            4.0K     0  4.0K   0% /sys/fs/cgroup
none            5.0M     0  5.0M   0% /run/lock
none            7.9G  8.1M  7.9G   1% /run/shm
none            100M     0  100M   0% /run/user
/dev/xvdb2       83G   76G  2.8G  97% /var/vcap/data
tmpfs           1.0M   24K 1000K   3% /var/vcap/data/sys/run
tmpfs           2.5M     0  2.5M   0% /var/vcap/data/executor/instance_identity
/dev/loop1     1008M   62M  880M   7% /var/vcap/data/garden/graph/aufs/diff/ef9fc27954ba8570ebd62b825309e821cb6e002aa885af2f8da22bf4b6f82344
none           1008M   62M  880M   7% /var/vcap/data/garden/graph/aufs/mnt/ef9fc27954ba8570ebd62b825309e821cb6e002aa885af2f8da22bf4b6f82344
/dev/loop3     1008M  516M  426M  55% /var/vcap/data/garden/graph/aufs/diff/7a56008bf6671e8c1bd67663a7a7c7fb7166d61255dec8265b67324a2e9ed352
none           1008M  516M  426M  55% /var/vcap/data/garden/graph/aufs/mnt/7a56008bf6671e8c1bd67663a7a7c7fb7166d61255dec8265b67324a2e9ed352
/dev/loop11     2.0G  1.1G  786M  59% /var/vcap/data/garden/graph/aufs/diff/8b48e6690d4e23f7f1ed51e5923dab779c60242f371f0c315a036ab8b06b8e68
none            2.0G  1.1G  786M  59% /var/vcap/data/garden/graph/aufs/mnt/8b48e6690d4e23f7f1ed51e5923dab779c60242f371f0c315a036ab8b06b8e68
/dev/loop12    1008M  177M  765M  19% /var/vcap/data/garden/graph/aufs/diff/10d514bbd1479a5a14dee6feaf7b7e74d0ca500febd41c2d877375aa2e0dd8a6
none           1008M  177M  765M  19% /var/vcap/data/garden/graph/aufs/mnt/10d514bbd1479a5a14dee6feaf7b7e74d0ca500febd41c2d877375aa2e0dd8a6
/dev/loop14     2.0G  622M  1.3G  33% /var/vcap/data/garden/graph/aufs/diff/96ccc671e828bd39f40a3f383eefa65240f8f7677a968e5ec720ddfe1637678c
none            2.0G  622M  1.3G  33% /var/vcap/data/garden/graph/aufs/mnt/96ccc671e828bd39f40a3f383eefa65240f8f7677a968e5ec720ddfe1637678c
/dev/loop7     1008M  1.6M  940M   1% /var/vcap/data/garden/graph/aufs/diff/cb46b68904b66880205157a202067efdf51ff97ca632add33e6d1cab8ab58fec
none           1008M  1.6M  940M   1% /var/vcap/data/garden/graph/aufs/mnt/cb46b68904b66880205157a202067efdf51ff97ca632add33e6d1cab8ab58fec
/dev/loop17    1008M  168M  774M  18% /var/vcap/data/garden/graph/aufs/diff/7f76de6e20cee9abda0b3f7667da7a29b9e1dddf4e1477f0ba78bb488327cedf
none           1008M  168M  774M  18% /var/vcap/data/garden/graph/aufs/mnt/7f76de6e20cee9abda0b3f7667da7a29b9e1dddf4e1477f0ba78bb488327cedf
/dev/loop18    1008M  168M  774M  18% /var/vcap/data/garden/graph/aufs/diff/b64a022d1db5b4cb6626a434489423c4c79945642e7513f90f2cdf9833a901b4
none           1008M  168M  774M  18% /var/vcap/data/garden/graph/aufs/mnt/b64a022d1db5b4cb6626a434489423c4c79945642e7513f90f2cdf9833a901b4
/dev/loop21    1008M  159M  783M  17% /var/vcap/data/garden/graph/aufs/diff/8a55d45b59733ca354ea7e0c3254e27c9cb27f1bf94a071cf5b7112566a3639b
none           1008M  159M  783M  17% /var/vcap/data/garden/graph/aufs/mnt/8a55d45b59733ca354ea7e0c3254e27c9cb27f1bf94a071cf5b7112566a3639b
/dev/loop22    1008M  142M  799M  16% /var/vcap/data/garden/graph/aufs/diff/918036fb41e2c309df1b8a1a10f58cb54508965bec22985cd19aed34a7bd4662
none           1008M  142M  799M  16% /var/vcap/data/garden/graph/aufs/mnt/918036fb41e2c309df1b8a1a10f58cb54508965bec22985cd19aed34a7bd4662
/dev/loop23    1008M  578M  364M  62% /var/vcap/data/garden/graph/aufs/diff/0998e4fd0d333ac6223009472d566ef2308b2125f30a1345ec30e28a626fe811
none           1008M  578M  364M  62% /var/vcap/data/garden/graph/aufs/mnt/0998e4fd0d333ac6223009472d566ef2308b2125f30a1345ec30e28a626fe811
/dev/loop24    1008M   45M  897M   5% /var/vcap/data/garden/graph/aufs/diff/64a13d4627af42e5a92bcc1c36a0b1ef509167c92c2847dff1f38c743e24eafc
none           1008M   45M  897M   5% /var/vcap/data/garden/graph/aufs/mnt/64a13d4627af42e5a92bcc1c36a0b1ef509167c92c2847dff1f38c743e24eafc
/dev/loop25    1008M  233M  708M  25% /var/vcap/data/garden/graph/aufs/diff/39eccbf075792c0e7dcd17de2c224bc0498e87b09e0558dd00c795fe39dcb6dd
none           1008M  233M  708M  25% /var/vcap/data/garden/graph/aufs/mnt/39eccbf075792c0e7dcd17de2c224bc0498e87b09e0558dd00c795fe39dcb6dd
/dev/loop26    1008M  8.6M  933M   1% /var/vcap/data/garden/graph/aufs/diff/bc4c5cd17cde53c0715afda5eda116e1576bcd529f6e1a68d679e87469a4eda1
none           1008M  8.6M  933M   1% /var/vcap/data/garden/graph/aufs/mnt/bc4c5cd17cde53c0715afda5eda116e1576bcd529f6e1a68d679e87469a4eda1
/dev/loop0     1008M  167M  775M  18% /var/vcap/data/garden/graph/aufs/diff/0020ec10f6bc5302db0bb3443863e9a42f3020fc776327916c7593f4c230b121
none           1008M  167M  775M  18% /var/vcap/data/garden/graph/aufs/mnt/0020ec10f6bc5302db0bb3443863e9a42f3020fc776327916c7593f4c230b121
/dev/loop5     1008M   23M  919M   3% /var/vcap/data/garden/graph/aufs/diff/77682895ee97cc1f4e8842775328fff239a67d0b3290579deab1a594d4a43875
none           1008M   23M  919M   3% /var/vcap/data/garden/graph/aufs/mnt/77682895ee97cc1f4e8842775328fff239a67d0b3290579deab1a594d4a43875
/dev/loop4     1008M  143M  798M  16% /var/vcap/data/garden/graph/aufs/diff/216b4568a294eadc88697c7a543eae99c13f79a6b6615ccca827d6d1c99e744b
none           1008M  143M  798M  16% /var/vcap/data/garden/graph/aufs/mnt/216b4568a294eadc88697c7a543eae99c13f79a6b6615ccca827d6d1c99e744b
/dev/loop8     1008M  197M  745M  21% /var/vcap/data/garden/graph/aufs/diff/212c22e0713294bc5eeb70a0a02392f4fa522c707d996a1e75e8377ab61f940a
none           1008M  197M  745M  21% /var/vcap/data/garden/graph/aufs/mnt/212c22e0713294bc5eeb70a0a02392f4fa522c707d996a1e75e8377ab61f940a
/dev/loop2     1008M  395M  547M  42% /var/vcap/data/garden/graph/aufs/diff/c4a5870c72d15bf7a286b5164b16c6a8250bcdbcb778ad8bca769ace68bb528c
none           1008M  395M  547M  42% /var/vcap/data/garden/graph/aufs/mnt/c4a5870c72d15bf7a286b5164b16c6a8250bcdbcb778ad8bca769ace68bb528c
/dev/loop6     1008M  139M  802M  15% /var/vcap/data/garden/graph/aufs/diff/9326380ed326af77e0fb3ffadf3decf683c93da677a77ee222bfe7e84249d001
none           1008M  139M  802M  15% /var/vcap/data/garden/graph/aufs/mnt/9326380ed326af77e0fb3ffadf3decf683c93da677a77ee222bfe7e84249d001
/dev/loop10      62M   58M  207K 100% /var/vcap/data/garden/graph/aufs/diff/9e1b57af28bf689fff28b9e621830f9f37dfce8c9ad043391f592362280af6fb
none             62M   58M  207K 100% /var/vcap/data/garden/graph/aufs/mnt/9e1b57af28bf689fff28b9e621830f9f37dfce8c9ad043391f592362280af6fb

When the cleanup is done is the garden process restarted? If so, it never was:

root        8103  2.6  0.4 3943836 77412 ?       S<l  Aug17 456:29 /var/vcap/packages/guardian/bin/gdn server --skip-setup --bind-ip=0.0.0.0 --bind-port=7777 --depot=/var/vcap/data/garden/depot --graph=/var/vcap/data/garden/graph --properties-path=/var/vcap/data/garden/props.json --port-pool-properties-path=/var/vcap/data/garden/port-pool-props.json --iptables-bin=/var/vcap/packages/iptables/sbin/iptables --iptables-restore-bin=/var/vcap/packages/iptables/sbin/iptables-restore --init-bin=/var/vcap/packages/guardian/bin/init --dadoo-bin=/var/vcap/packages/guardian/bin/dadoo --nstar-bin=/var/vcap/packages/guardian/bin/nstar --tar-bin=/var/vcap/packages/tar/tar --newuidmap-bin=/var/vcap/packages/garden-idmapper/bin/newuidmap --newgidmap-bin=/var/vcap/packages/garden-idmapper/bin/newgidmap --log-level=error --mtu=0 --network-pool=10.239.0.0/22 --deny-network=0.0.0.0/0 --destroy-containers-on-startup --debug-bind-ip=127.0.0.1 --debug-bind-port=17005 --default-rootfs=/var/vcap/packages/busybox --default-grace-time=0 --default-container-blockio-weight=0 --graph-cleanup-threshold-in-megabytes=66560 --max-containers=250 --cpu-quota-per-share=0 --tcp-memory-limit=0 --runtime-plugin=/var/vcap/packages/runc/bin/runc --persistent-image=/var/vcap/packages/cflinuxfs2/rootfs --apparmor=garden-default --cleanup-process-dirs-on-wait

CF: v270
Diego: v1.23.2
Garden-runc: v1.9.0
Environment: AWS

Droplet with big symlink failed to deploy with grootfs: Cannot create symlink to: File name too long

Note: This issue comes from the discussion in slack in: https://cloudfoundry.slack.com/archives/C033RE5D6/p1521478035000065

The problem has been identified and diagnosed and it is unlikely to happen again. Opening this ticket for documentation and reference as requested by @julz .

Description

While upgrading our CF deployment, which included upgrade garden-runc-release from v1.9.6 to v1.11.1 and introduce grootfs, one application failed restarting after eviction, with the following error in the cf events:

Copying into the container failed: stream-in: nstar: error streaming in: exit status 2. Output: tar: ./app/.cfignore: Cannot create symlink to '# Byte-compiled / ....' File name too long\ntar: Exiting with failure

(see below for the full error).

This was caused by the bug described in cloudfoundry/cli#1349. A spurious file was created in the droplet with a very very long file target (it was using the content of the target file, rather than the path).

But it was working fine with garden v1.9.6, started to fail with v1.11.1 after introducing grootfs.

Pushing the application again with a newer version of cf-cli would result in a valid symlink and would fix the problem.

Logging and/or test output

The full error is:

            "metadata": {
               "instance": "55ee4080-fc4e-4f63-48e0-daeb",
               "index": 1,
               "reason": "CRASHED",
               "exit_description": "Copying into the container failed: stream-in: nstar: error streaming in: exit status 2. Output: tar: ./app/.cfignore: Cannot create symlink to '# Byte-compiled / optimized / DLL files\\n__pycache__/\\n*.py[cod]\\n*$py.class\\n\\n# C extensions\\n*.so\\n\\n# Distribution / packaging\\n.Python\\nenv/\\nbuild/\\ndevelop-eggs/\\ndist/\\ndownloads/\\neggs/\\n.eggs/\\nlib/\\nlib64/\\nparts/\\nsdist/\\nvar/\\nwheels/\\n*.egg-info/\\n.installed.cfg\\n*.egg\\n\\n# PyInstaller\\n#  Usually these files are written by a python script from a template\\n#  before PyInstaller builds the exe, so as to inject date/other infos into it.\\n*.manifest\\n*.spec\\n\\n# Installer logs\\npip-log.txt\\npip-delete-this-directory.txt\\n\\n# Unit test / coverage reports\\nhtmlcov/\\n.tox/\\n.coverage\\n.coverage.*\\n.cache\\nnosetests.xml\\ncoverage.xml\\n*.cover\\n.hypothesis/\\n\\n# Translations\\n*.mo\\n*.pot\\n\\n# Django stuff:\\n*.log\\nlocal_settings.py\\n\\n# Flask stuff:\\ninstance/\\n.webassets-cache\\n\\n# Scrapy stuff:\\n.scrapy\\n\\n# Sphinx documentation\\ndocs/_build/\\n\\n# PyBuilder\\ntarget/\\n\\n# Jupyter Notebook\\n.ipynb_checkpoints\\n\\n# pyenv\\n.python-version\\n\\n# celery beat schedule file\\ncelerybeat-schedule\\n\\n# SageMath parsed files\\n*.sage.py\\n\\n# dotenv\\n.env\\n\\n# virtualenv\\n.venv\\nvenv/\\nENV/\\n\\n# Spyder project settings\\n.spyderproject\\n.spyproject\\n\\n# Rope project settings\\n.ropeproject\\n\\n# mkdocs documentation\\n/site\\n\\n# mypy\\n.mypy_cache/\\n\\n.direnv\\n.envrc\\n': File name too long\ntar: Exiting with failure status due to previous errors\n",
               "crash_count": 1,
               "crash_timestamp": 1519298150947758085,
               "version": "1f6aa9e0-a404-41bb-b6ec-67620be0febd"
            },

Steps to reproduce

Using garden-runc-release v1.9.6
Push an app using cf v3-push with cf-cli v6.32.0, including with symbolic link to the absolute path of file > 1kb:

dd count=2 bs=1024 if=/dev/urandom of=- | base64 > foo
ln -s $(pwd)/foo bar
cf-6.32.0 v3-push -p . myapp

Upgrade to garden-runc-release v1.11.1. The application would fail restarting after eviction.

GrootFS fails to mount XFS filesystem: `exit status 32: mount: wrong fs type, bad option, bad superblock on /dev/loop0`

Hello,

When installing diego in vSphere environment, following error occurs :
#############################################################
Started Updating instance
Started Updating instance > database_z1/2c9a0212-55c1-4c92-9be7-49a3e1920041 (0) (canary)
Done Updating instance > database_z1/2c9a0212-55c1-4c92-9be7-49a3e1920041 (0) (canary)
Started Updating instance > brain_z1/40c56e23-ef82-4542-9ff8-77543f1d6507 (0) (canary)
Done Updating instance > brain_z1/40c56e23-ef82-4542-9ff8-77543f1d6507 (0) (canary)
Started Updating instance > cell_z1/ce3f079d-6e98-4ea5-b584-0e28da2eff2b (0) (canary)
Started Updating instance > cc_bridge_z1/cb6d6806-8c63-415e-83cb-cc0467f20adc (0) (canary)
Started Updating instance > route_emitter_z1/83a66350-ac04-423f-81cd-df2938d1bfd4 (0) (canary)
Started Updating instance > access_z1/1c6d1586-9137-4895-a579-03712bcf728c (0) (canary)
Done Updating instance > route_emitter_z1/83a66350-ac04-423f-81cd-df2938d1bfd4 (0) (canary)
Done Updating instance > cc_bridge_z1/cb6d6806-8c63-415e-83cb-cc0467f20adc (0) (canary)
Done Updating instance > access_z1/1c6d1586-9137-4895-a579-03712bcf728c (0) (canary)
Failed Updating instance > cell_z1/ce3f079d-6e98-4ea5-b584-0e28da2eff2b (0) (canary)
Error Code : 400007, Message :'cell_z1/0 (ce3f079d-6e98-4ea5-b584-0e28da2eff2b)' is not running after update. Review logs for failed jobs: rep, garden, metron_agent
#############################################################

monit message was found as:
#############################################################
chmod u+s /var/vcap/packages/grootfs/bin/tardis
{"timestamp":"1521080887.013528109","source":"grootfs","message":"grootfs.init-store.store-manager-init-store.overlayxfs-init-filesystem.mounting-filesystem-failed","log_level":2,"data":{"error":"exit status 32: mount: wrong fs type, bad option, bad superblock on /dev/loop0,\n missing codepage or helper program, or other error\n In some cases useful info is found in syslog - try\n dmesg | tail or so\n\n","filesystemPath":"/var/vcap/data/grootfs/store/unprivileged.backing-store","session":"1.1.2","spec":{"UIDMappings":[{"HostID":4294967294,"NamespaceID":0,"Size":1},{"HostID":1,"NamespaceID":1,"Size":4294967293}],"GIDMappings":[{"HostID":4294967294,"NamespaceID":0,"Size":1},{"HostID":1,"NamespaceID":1,"Size":4294967293}],"StoreSizeBytes":25571164160},"storePath":"/var/vcap/data/grootfs/store/unprivileged"}}
{"timestamp":"1521080887.013750792","source":"grootfs","message":"grootfs.init-store.store-manager-init-store.initializing-filesystem-failed","log_level":2,"data":{"backingstoreFile":"/var/vcap/data/grootfs/store/unprivileged.backing-store","error":"Mounting filesystem: exit status 32: mount: wrong fs type, bad option, bad superblock on /dev/loop0,\n missing codepage or helper program, or other error\n In some cases useful info is found in syslog - try\n dmesg | tail or so\n\n","session":"1.1","spec":{"UIDMappings":[{"HostID":4294967294,"NamespaceID":0,"Size":1},{"HostID":1,"NamespaceID":1,"Size":4294967293}],"GIDMappings":[{"HostID":4294967294,"NamespaceID":0,"Size":1},{"HostID":1,"NamespaceID":1,"Size":4294967293}],"StoreSizeBytes":25571164160},"storePath":"/var/vcap/data/grootfs/store/unprivileged"}}
{"timestamp":"1521080887.013805628","source":"grootfs","message":"grootfs.init-store.cleaning-up-store-failed","log_level":2,"data":{"error":"initializing filesyztem: Mounting filesystem: exit status 32: mount: wrong fs type, bad option, bad superblock on /dev/loop0,\n missing codepage or helper program, or other error\n In some cases useful info is found in syslog - try\n dmesg | tail or so\n\n","session":"1"}}
#############################################################

My environments were:

cf-release version 287
diego-release version 1.34.0
garden-runc-release version 1.11.0
cflinuxfs-release version 1.185.0

similer issue was found here : https://lists.cloudfoundry.org/g/cf-dev/topic/14491832#7849

Can any one help on this issue? ( Error occurs when installing diego in vSphere environmet )
Thanks in Advance. 👍

Best Regards,
JY

/some/path is not a mount point

Description

After upgrading to garden-runc-release v1.13.0 I'm seeing the following errors in my logs:

/var/vcap/data/baggageclaim/volumes/live/c6e9cb7f-1be9-4d78-48a6-b5903816a20b/volume is not a mount point

Environment

garden-runc-release version: 1.13.0
IaaS: Not applicable / all
Stemcell version: Not applicable / all
Kernel version: Not applicable / all

Steps to reproduce

TBC

Logs

/var/vcap/data/baggageclaim/volumes/live/c6e9cb7f-1be9-4d78-48a6-b5903816a20b/volume is not a mount point

Cause

There is a known regression related to garden's handling of bind mounts in v1.13.0.

Resolution

Do not use v1.13.0, and instead bump directly to v1.13.1.

Problem with grootfs store for unprivileged causes garden failure

Garden job fails to come up with Pivotal App Service Tile v2.0.8 due to grootfs issue:

diego_cell/59ad51f8-81a2-443f-a77c-a066f5c612e6:~# tail -f /var/vcap/sys/log/garden/garden.std*
==> /var/vcap/sys/log/garden/garden.stderr.log <==

==> /var/vcap/sys/log/garden/garden.stdout.log <==
{"timestamp":"1522086995.059141159","source":"guardian","message":"guardian.create.create-failed-cleaningup.volumizer.image-plugin-destroy.grootfs.delete.failed-to-initialise-image-driver","log_level":2,"data":{"cause":"running image plugin create: reading namespace file: open /var/vcap/data/grootfs/store/unprivileged/meta/namespace.json: input/output error\n: exit status 1","error":"reading namespace file: open /var/vcap/data/grootfs/store/unprivileged/meta/namespace.json: input/output error","handle":"executor-healthcheck-2001547f-aa74-45df-6286-fb6ebf812728","original_timestamp":"2018-03-26T17:56:35.0587776Z","session":"91.3.2.1"}}
{"timestamp":"1522086995.063849688","source":"guardian","message":"guardian.create.create-failed-cleaningup.volumizer.image-plugin-destroy.image-plugin-result","log_level":2,"data":{"action":"destroy","cause":"running image plugin create: reading namespace file: open /var/vcap/data/grootfs/store/unprivileged/meta/namespace.json: input/output error\n: exit status 1","error":"exit status 1","handle":"executor-healthcheck-2001547f-aa74-45df-6286-fb6ebf812728","session":"91.3.2.1","stdout":"reading namespace file: open /var/vcap/data/grootfs/store/unprivileged/meta/namespace.json: input/output error\n"}}
{"timestamp":"1522086995.064245462","source":"guardian","message":"guardian.create.create-failed-cleaningup.destroy-failed","log_level":2,"data":{"cause":"running image plugin create: reading namespace file: open /var/vcap/data/grootfs/store/unprivileged/meta/namespace.json: input/output error\n: exit status 1","error":"running image plugin destroy: reading namespace file: open /var/vcap/data/grootfs/store/unprivileged/meta/namespace.json: input/output error\n: exit status 1","handle":"executor-healthcheck-2001547f-aa74-45df-6286-fb6ebf812728","session":"91.3"}}
{"timestamp":"1522086995.064504623","source":"guardian","message":"guardian.create.create-failed-cleaningup.cleanedup","log_level":1,"data":{"cause":"running image plugin create: reading namespace file: open /var/vcap/data/grootfs/store/unprivileged/meta/namespace.json: input/output error\n: exit status 1","handle":"executor-healthcheck-2001547f-aa74-45df-6286-fb6ebf812728","session":"91.3"}}
{"timestamp":"1522086995.064779043","source":"guardian","message":"guardian.api.garden-server.create.failed","log_level":2,"data":{"error":"running image plugin create: reading namespace file: open /var/vcap/data/grootfs/store/unprivileged/meta/namespace.json: input/output error\n: exit status 1","request":{"Handle":"executor-healthcheck-2001547f-aa74-45df-6286-fb6ebf812728","GraceTime":0,"RootFSPath":"/var/vcap/packages/cflinuxfs2/rootfs.tar","BindMounts":null,"Network":"","Privileged":false,"Limits":{"bandwidth_limits":{},"cpu_limits":{},"disk_limits":{},"memory_limits":{},"pid_limits":{}}},"session":"1.1.122"}}
{"timestamp":"1522087003.623143435","source":"guardian","message":"guardian.list-containers.starting","log_level":1,"data":{"session":"92"}}
{"timestamp":"1522087003.623309374","source":"guardian","message":"guardian.list-containers.finished","log_level":1,"data":{"session":"92"}}
{"timestamp":"1522087035.329714775","source":"guardian","message":"guardian.api.garden-server.waiting-for-connections-to-close","log_level":1,"data":{"session":"1.1"}}
{"timestamp":"1522087035.332157612","source":"guardian","message":"guardian.api.garden-server.stopping-backend","log_level":1,"data":{"session":"1.1"}}
{"timestamp":"1522087035.332249641","source":"guardian","message":"guardian.api.garden-server.stopped","log_level":1,"data":{"session":"1.1"}}

Underlying problem with the grootfs unprivileged store:

~# ls -al /var/vcap/data/grootfs/store/
ls: cannot access /var/vcap/data/grootfs/store/unprivileged: Input/output error
total 113824
drwxr-xr-x 4 root root         4096 Mar 26 17:52 .
drwxr-xr-x 3 root root         4096 Mar 26 17:51 ..
drwx------ 9 root root          118 Mar 26 17:53 privileged
-rw------- 1 root root 118579802112 Mar 26 17:55 privileged.backing-store
d????????? ? ?    ?               ?            ? unprivileged
-rw------- 1 root root 118579802112 Mar 26 17:53 unprivileged.backing-store

Running on vsphere with Ops Mgr build 2.0-build-269, PAS v2.0.8, NSX-T tile v2.1 integration

garden-runc-release version: 1.11.1
IaaS: vsphere
Stemcell version: ubuntu trusty 3468.27
Kernel version: 4.4.0-116-generic

Steps to reproduce

PAS deployment fails to come up with the garden job in diego_cell failing.

Logs

See above

Cause

We believe the root cause was an underlying issue with the IaaS.

Resolution

Recreating the diego_cell didnt work. cannot delete the folder as there is no permission. This was actually resolved by deleting the VM from the IaaS and redeploying.

Integration test failures

Hi all,

We're running the test suite again, this time getting a different error:

scripts/remote-fly ci/nested-tests.yml
executing build 1
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 21.6M    0 21.6M    0     0  12.2M      0 --:--:--  0:00:01 --:--:-- 12.2M
initializing with docker:///cloudfoundry/garden-ci-ubuntu
running guardian-release/ci/scripts/nested-tests
++ dirname guardian-release/ci/scripts/nested-tests
+ cd guardian-release/ci/scripts/../..
+ export GOROOT=/usr/local/go
+ GOROOT=/usr/local/go
+ export PATH=/usr/local/go/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
+ PATH=/usr/local/go/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
+ export GOPATH=/tmp/build/ee32fe39-4383-4d2d-50d8-76e8ae42e05a/guardian-release
+ GOPATH=/tmp/build/ee32fe39-4383-4d2d-50d8-76e8ae42e05a/guardian-release
+ export PATH=/tmp/build/ee32fe39-4383-4d2d-50d8-76e8ae42e05a/guardian-release/bin:/usr/local/go/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
+ PATH=/tmp/build/ee32fe39-4383-4d2d-50d8-76e8ae42e05a/guardian-release/bin:/usr/local/go/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
+ pushd src/github.com/cloudfoundry-incubator/guardian
/tmp/build/ee32fe39-4383-4d2d-50d8-76e8ae42e05a/guardian-release/src/github.com/cloudfoundry-incubator/guardian /tmp/build/ee32fe39-4383-4d2d-50d8-76e8ae42e05a/guardian-release
+ go install -tags daemon github.com/cloudfoundry-incubator/guardian/cmd/guardian/
+ go install github.com/cloudfoundry-incubator/guardian/rundmc/iodaemon/cmd/iodaemon
+ tmpdir=/tmp/dir
+ rm -fr /tmp/dir
+ mkdir -p /tmp/dir/depot
+ mkdir /tmp/dir/snapshots
+ mkdir /tmp/dir/graph
+ popd
/tmp/build/ee32fe39-4383-4d2d-50d8-76e8ae42e05a/guardian-release
+ go install github.com/onsi/ginkgo/ginkgo
+ /tmp/build/ee32fe39-4383-4d2d-50d8-76e8ae42e05a/guardian-release/bin/guardian -iodaemonBin=/tmp/build/ee32fe39-4383-4d2d-50d8-76e8ae42e05a/guardian-release/bin/iodaemon -depot=/tmp/dir/depot -snapshots=/tmp/dir/snapshots -graph=/tmp/dir/graph -bin=/tmp/build/ee32fe39-4383-4d2d-50d8-76e8ae42e05a/guardian-release/src/github.com/cloudfoundry-incubator/guardian/linux_backend/bin -listenNetwork=tcp -listenAddr=0.0.0.0:7777 -denyNetworks= -allowNetworks= -allowHostAccess=false -mtu=1500 -containerGraceTime=5m -logLevel=error -rootfs=/opt/warden/rootfs
ERRO[0000] Failed to GetDriver graph btrfs /tmp/dir/graph
ERRO[0000] Failed to GetDriver graph zfs /tmp/dir/graph
ERRO[0000] Failed to GetDriver graph devicemapper /tmp/dir/graph
ERRO[0000] Failed to GetDriver graph overlay /tmp/dir/graph
ERRO[0000] Failed to GetDriver graph vfs /tmp/dir/graph
{"timestamp":"1449606737.425148010","source":"guardian","message":"guardian.volume-creator.failed-to-construct-graph-driver","log_level":3,"data":{"error":"No supported storage backend found","graphRoot":"/tmp/dir/graph","session":"6","trace":"goroutine 1 [running]:\ngithub.com/pivotal-golang/lager.(*logger).Fatal(0xc2080785a0, 0xa41190, 0x20, 0x7f5eccc41e80, 0xc208095d50, 0x0, 0x0, 0x0)\n\t/tmp/build/ee32fe39-4383-4d2d-50d8-76e8ae42e05a/guardian-release/src/github.com/pivotal-golang/lager/logger.go:131 +0xc8\nmain.wireVolumeCreator(0x7f5eccc46e40, 0xc2080785a0, 0x7ffc7a3b3c47, 0xe, 0xc20806db00)\n\t/tmp/build/ee32fe39-4383-4d2d-50d8-76e8ae42e05a/guardian-release/src/github.com/cloudfoundry-incubator/guardian/cmd/guardian/main.go:343 +0x3fe\nmain.main()\n\t/tmp/build/ee32fe39-4383-4d2d-50d8-76e8ae42e05a/guardian-release/src/github.com/cloudfoundry-incubator/guardian/cmd/guardian/main.go:236 +0x5c4\n"}}
panic: No supported storage backend found

goroutine 1 [running]:
github.com/pivotal-golang/lager.(*logger).Fatal(0xc2080785a0, 0xa41190, 0x20, 0x7f5eccc41e80, 0xc208095d50, 0x0, 0x0, 0x0)
    /tmp/build/ee32fe39-4383-4d2d-50d8-76e8ae42e05a/guardian-release/src/github.com/pivotal-golang/lager/logger.go:152 +0x5d0
main.wireVolumeCreator(0x7f5eccc46e40, 0xc2080785a0, 0x7ffc7a3b3c47, 0xe, 0xc20806db00)
    /tmp/build/ee32fe39-4383-4d2d-50d8-76e8ae42e05a/guardian-release/src/github.com/cloudfoundry-incubator/guardian/cmd/guardian/main.go:343 +0x3fe
main.main()
    /tmp/build/ee32fe39-4383-4d2d-50d8-76e8ae42e05a/guardian-release/src/github.com/cloudfoundry-incubator/guardian/cmd/guardian/main.go:236 +0x5c4

goroutine 5 [syscall]:
os/signal.loop()
    /usr/local/go/src/os/signal/signal_unix.go:21 +0x1f
created by os/signal.init·1
    /usr/local/go/src/os/signal/signal_unix.go:27 +0x35
+ cd src/github.com/cloudfoundry-incubator/garden-integration-tests
++ hostname
+ export GARDEN_ADDRESS=7g89flel121:7777
+ GARDEN_ADDRESS=7g89flel121:7777
+ ginkgo -p -nodes=4
Running Suite: GardenIntegrationTests Suite
===========================================
Random Seed: 1449606738
Will run 112 of 113 specs

Running in parallel across 4 nodes

• Failure in Spec Setup (JustBeforeEach) [0.003 seconds]
Limits [JustBeforeEach]
/tmp/build/ee32fe39-4383-4d2d-50d8-76e8ae42e05a/guardian-release/src/github.com/cloudfoundry-incubator/garden-integration-tests/limits_test.go:346
  LimitMemory
  /tmp/build/ee32fe39-4383-4d2d-50d8-76e8ae42e05a/guardian-release/src/github.com/cloudfoundry-incubator/garden-integration-tests/limits_test.go:33
    with a memory limit
    /tmp/build/ee32fe39-4383-4d2d-50d8-76e8ae42e05a/guardian-release/src/github.com/cloudfoundry-incubator/garden-integration-tests/limits_test.go:32
      when the process writes too much to /dev/shm
      /tmp/build/ee32fe39-4383-4d2d-50d8-76e8ae42e05a/guardian-release/src/github.com/cloudfoundry-incubator/garden-integration-tests/limits_test.go:31
        is killed
        /tmp/build/ee32fe39-4383-4d2d-50d8-76e8ae42e05a/guardian-release/src/github.com/cloudfoundry-incubator/garden-integration-tests/limits_test.go:30

        Expected error:
            <*url.Error | 0xc2080b9f50>: {
                Op: "Post",
                URL: "http://api/containers",
                Err: {
                    Op: "dial",
                    Net: "tcp",
                    Addr: {
                        IP: "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xff\xff\n\xfe\x00\x06",
                        Port: 7777,
                        Zone: "",
                    },
                    Err: 0x6f,
                },
            }
            Post http://api/containers: dial tcp 10.254.0.6:7777: connection refused
        not to have occurred

        /tmp/build/ee32fe39-4383-4d2d-50d8-76e8ae42e05a/guardian-release/src/github.com/cloudfoundry-incubator/garden-integration-tests/garden_integration_tests_suite_test.go:52

and more failures like this afterwards.

We saw the error on both a local Concourse lite and also on an AWS concourse worker.

Have we misconfigured something?

Thanks!

cc @dbellotti

Container creation fails after enabling OCI Phase 1

Description

I was trying to push same app with the same version of the python buildpack, and after the enablement of OCI Phase 1, the container for that application cannot be created anymore.

The same happens on boshlite when you enable OCI Phase 1.

Environment

cf-deployment: v1.28.0

Steps to reproduce

App: https://github.com/jszroberto/mkdocs-test

push the app to the boshlite installation and it should work (git clone [email protected]:jszroberto/mkdocs-test.git"
Enable OCI Phase 1 with -o ~/workspace/cf-deployment/operations/experimental/enable-oci-phase-1.yml
Push the app again and you should see the error below

Logs

  2018-08-15T12:44:35.26+0200 [API/1] OUT App instance exited with guid 4ffa5760-c172-494e-a589-70cc524fb6d3 payload: {"instance"=>"256f40f1-9ef7-46c6-76ba-bf84", "index"=>0, "reason"=>"CRASHED", "exit_description"=>"failed to create container: running image plugin create: pulling the image: unpacking layer `sha256:78060342299850c2553a62002006edb88d2aa86e9c27d7373d2ef52a8092e919`: link ./deps/0/conda/pkgs/ncurses-6.1-hf484d3e_0/bin/clear /home/vcap/deps/0/conda/bin/clear: no such file or directory\n: exit status 1", "crash_count"=>2, "crash_timestamp"=>1534329875224712963, "version"=>"d2a9eda1-dbd3-47a5-a44d-7ee2d538f4c5"}
   2018-08-15T12:44:35.26+0200 [CELL/0] OUT Cell c5e766c9-97ee-4005-af9e-aac7b80b1cb2 destroying container for instance 256f40f1-9ef7-46c6-76ba-bf84
   2018-08-15T12:44:35.27+0200 [CELL/0] OUT Cell c5e766c9-97ee-4005-af9e-aac7b80b1cb2 successfully destroyed container for instance 256f40f1-9ef7-46c6-76ba-bf84
   2018-08-15T12:44:35.45+0200 [CELL/0] OUT Cell c5e766c9-97ee-4005-af9e-aac7b80b1cb2 creating container for instance f9ebd6c1-5e9a-4e77-65e9-45c7
   2018-08-15T12:44:54.92+0200 [CELL/0] ERR Cell c5e766c9-97ee-4005-af9e-aac7b80b1cb2 failed to create container for instance f9ebd6c1-5e9a-4e77-65e9-45c7: running image plugin create: pulling the image: unpacking layer `sha256:78060342299850c2553a62002006edb88d2aa86e9c27d7373d2ef52a8092e919`: link ./deps/0/conda/pkgs/ncurses-6.1-hf484d3e_0/bin/clear /home/vcap/deps/0/conda/bin/clear: no such file or directory
   2018-08-15T12:44:54.92+0200 [CELL/0] ERR : exit status 1

Provide any output you think may be useful in understanding/explaining the issue.
The garden log files are found in /var/vcap/sys/log/garden/ on the VM in which the Garden job is running.

Run ordnance survey (from within the VM running the Garden job) using the following command and attach the resulting tar to the issue: curl bit.ly/garden-ordnance-survey -sSfL | bash.

Docker image (if relevant)

Irrelevant

Cause

Unknown

Resolution

Unknown

Support for checkpoint/restore in Garden?

Todos

Please try the following before submitting the issue:

Upgrade Concourse if you are observing a failure in a test
Use latest BOSH stemcell if the problem occurred in a BOSH VM
Use Go 1.6 if the issue has to do with running unit tests locally

Description

I was unable to find this anywhere but is CRIU (Checkpoint/Restore in User-space) supported inside Garden? If not, is there a mechanism to live migrate a container (or parts of a container like active TCP connections)?

Logging and/or test output

Steps to reproduce

Possible AppArmor issue (bosh-cli fails trying to print progressbar to screen/terminal)

Description

We have a Concourse setup running with v1.6.0 and recently upgraded to using garden-runc v0.5.0 for its workers.
We noticed that since the upgrade some of our build jobs on these workers have been failing, to be more specific they fail when they try to run the bosh-cli command bosh upload stemcell <filename>.
During the stemcell upload the bosh-cli will try to print a progressbar to the screen/terminal, and fails with the error message negative argument.:

Verifying stemcell...
File exists and readable                                     OK
Verifying tarball...
Read tarball                                                 OK
Manifest exists                                              OK
Stemcell image file                                          OK
Stemcell properties                                          OK

Stemcell info
-------------
Name:    bosh-openstack-kvm-ubuntu-trusty-go_agent
Version: 3262.5

Checking if stemcell already exists...
No

Uploading stemcell...

negative argument.

Usage: upload stemcell <stemcell_location> [--skip-if-exists] [--fix] [--sha1 SHA1] [--name NAME] [--version VERSION]

I think this might be an AppArmor issue?
While trying to pinpoint the issue, I downgraded our Concourse workers back to garden-runc v0.4.0 that does not yet have AppArmor, and suddenly the bosh-cli was working again perfectly fine, printing the progressbar with no problem.
Or could there be something else where v0.4.0 and v0.5.0 differ that might cause this issue?
What's also weird is that when I use the fly cli to hijack into the container and manually run the bosh-cli command, it will work (on bosh v0.4.0 and v0.5.0)!
Looks to me like it could have something to do with the terminal? Unfortunately I'm no expert at this and could not figure out further what might be the cause.

Steps to reproduce

concourse v1.6.0
garden-runc v0.5.0
bosh-cli v1.3262.4.0
build job that runs bosh stemcell upload <filename>
docker-image used for build job: https://github.com/JamesClonk/concourse-workspace/blob/master/Dockerfile | ubuntu:16.04

setting quota limit for projid 2: function not implemented

NB: Are you currently experiencing this issue? Please reach out to us on the #garden slack channel before attempting any of the temporary resolutions or before destroying affected the cell(s). We haven't been able to reproduce this issue ourselves and would really appreciate an opportunity to do some live debugging! Thanks!

Environment

IaaS: vSphere
Stemcell: 3468.21
Kernel: 4.4.0-111-generic

Symptom

Failed to create container: running image plugin create: making image: creating image: applying disk limits: apply disk limit: <nil>:  setting quota to /var/vcap/data/grootfs/store/unprivileged/images/b1639404-2935-11e8-b467-0ed5f89f718b: setting quota limit for projid 2: function not implemented: exit status 1

Cause

TBC

Resolution

TBC

Temporary workaround

This issue occurs when garden is configured to use GrootFS for image management.
It is possible to fall back to the previous image managemtn tool (garden-shed) by setting the following BOSH property:

garden.deprecated_use_garden_shed = true

NB It is recommended to force a recreate of VMs running garden (typically the diego cell VMs) when switching between GrootFS and garden-shed.

Additional info

guardian         [ERROR] 03/15 16:50:56.27 577.2.1    guardian.create.volume-creator.image-plugin-create.grootfs.create.groot-creating.making-image.overlayxfs-creating-image.applying-quotas.run-tardis.tardis.set-quota.setting-quota-failed
                                                      Error: setting quota limit for projid 2: function not implemented

guardian         [ERROR] 03/15 16:50:56.27 577.2.1    guardian.create.volume-creator.image-plugin-create.grootfs.create.groot-creating.making-image.overlayxfs-creating-image.applying-quotas.run-tardis.tardis-failed
                                                      Error: exit status 1

guardian         [ERROR] 03/15 16:50:56.27 577.2.1    guardian.create.volume-creator.image-plugin-create.grootfs.create.groot-creating.making-image.creating-image-failed
                                                      Error: applying disk limits: apply disk limit: <nil>:  setting quota to /var/vcap/data/grootfs/store/unprivileged/images/b1639404-2935-11e8-b467-0ed5f89f718b: setting quota limit for projid 2: function notimplemented: exit status 1

Is there a way to deploy garden-runc locally without bosh-lite?

Description

I am trying out garden-runc and I have been following the github README. I was wondering if there is an easy way to deploy garden-runc on a local machine without the need for virtual box and bosh-lite?

bosh.io says latest version is v1.0.0; but much newer releases - ok to use?

https://github.com/cloudfoundry/garden-runc-release/releases has a v1.1.1 release but http://bosh.io/releases/github.com/cloudfoundry-incubator/garden-runc-release?all=1 has v1.0.0

@cppforlife is this an issue with bosh.io pipeline or something quirky with https://github.com/cloudfoundry/garden-runc-release repo?

QUESTION: Can the CRIU library be packaged along with the garden-runc release?

Hi there! Thanks for taking time to open an Issue.
Please fill out as much of the following information as you can.

Description

Since runc depends on a CRIU installation to perform checkpoint and restore of running containers, can the criu distribution be packaged along with the release?

Environment

Cloud Foundry

garden-runc-release version: Any
IaaS: Any
Stemcell version: Any
Kernel version: (uname -r prints this information)

Test GitBot issue

A test issue to see if GitBot is working

Garden fails to destroy containers with long filenames

Description

Garden will fail to destroy container with filenames whose path is very long.

Environment

garden-runc-release version: 1.15.1
IaaS: AWS
Stemcell version: bosh-aws-xen-hvm-ubuntu-trusty-go_agent/3586.25
Kernel version: 4.4.0-130-generic

Steps to reproduce

We don't have steps to reproduce but there is the AWS volume of the two cells that exhibited this problem. Let me know if you need access to them.

Logs

{"timestamp":"1531942576.706936121","source":"guardian","message":"guardian.destroy.volumizer.image-plugin-destroy.grootfs.delete.groot-deleting.deleting-image.destroying-image-failed","log_level":2,"data":{"error":"deleting rootfs folder: lstat /var/vcap/data/grootfs/store/unprivileged/images/12055071-9b46-414f-88d8-b6d2bac389b9/diff/tmp/app/public/public/public/public/public/... <about 200 more public's> .../public/public/public/public/public/public/public/public/public/public/public/public/public/public/public/public/public/public/public/public/public/public/public/public/public/public/bootstrap: file name too long","handle":"12055071-9b46-414f-88d8-b6d2bac389b9","id":"12055071-9b46-414f-88d8-b6d2bac389b9","imageID":"12055071-9b46-414f-88d8-b6d2bac389b9","original_timestamp":"2018-07-18T19:36:16.706365696Z","session":"3305.2.1","storePath":"/var/vcap/data/grootfs/store/unprivileged"}}

Cause

We think this is caused by the container creating a filename with a very long path (many nested directories).

Garden fails to restart when using an icmp security group with code/type=-1

Description

I recently added some icmp security groups to my CF environment that utilizes code/type value of -1 (see cloudfoundry/cloud_controller_ng#815). Here is an example rule:

	[	{
			"code": -1,
			"destination": "10.53.248.0/22",
			"protocol": "icmp",
			"type": -1
		}
	]

This appears to work great. However, when I later attempt to bosh redeploy a cell running an application with this security group applied garden fails to restart. With the error below.

Not 100% sure if this is Garden or CAPI. But figured I'd start here since at the very least Garden probably should be accepting iptables config that it cannot restart with?

Logging and/or test output

bulk starter: setting up default chains: iptables: setup-global-chains: + set -o nounset
+ set -o errexit
+ shopt -s nullglob
+ filter_input_chain=w--input
+ filter_forward_chain=w--forward
+ filter_default_chain=w--default
+ filter_instance_prefix=w--instance-
+ nat_prerouting_chain=w--prerouting
+ nat_postrouting_chain=w--postrouting
+ nat_instance_prefix=w--instance-
+ iptables_bin=/var/vcap/packages/iptables/sbin/iptables
+ case "${ACTION}" in
+ setup_filter
+ teardown_filter
+ teardown_deprecated_rules
++ /var/vcap/packages/iptables/sbin/iptables -w -S INPUT
+ rules='-P INPUT ACCEPT
-A INPUT -i w+ -j w--input'
+ echo '-P INPUT ACCEPT
-A INPUT -i w+ -j w--input'
+ sed -e s/-A/-D/ -e 's/\s\+$//'
+ xargs --no-run-if-empty --max-lines=1 /var/vcap/packages/iptables/sbin/iptables -w
+ grep ' -j garden-dispatch'
++ /var/vcap/packages/iptables/sbin/iptables -w -S FORWARD
+ rules='-P FORWARD ACCEPT
-A FORWARD -i w+ -j w--forward'
+ echo '-P FORWARD ACCEPT
-A FORWARD -i w+ -j w--forward'
+ sed -e s/-A/-D/ -e 's/\s\+$//'
+ grep ' -j garden-dispatch'
+ xargs --no-run-if-empty --max-lines=1 /var/vcap/packages/iptables/sbin/iptables -w
+ /var/vcap/packages/iptables/sbin/iptables -w -F garden-dispatch
+ true
+ /var/vcap/packages/iptables/sbin/iptables -w -X garden-dispatch
+ true
++ /var/vcap/packages/iptables/sbin/iptables -w -S w--forward
+ rules='-N w--forward
-A w--forward -i eth0 -j ACCEPT
-A w--forward -j DROP'
+ echo '-N w--forward
-A w--forward -i eth0 -j ACCEPT
-A w--forward -j DROP'
+ xargs --no-run-if-empty --max-lines=1 /var/vcap/packages/iptables/sbin/iptables -w
+ sed -e s/-A/-D/ -e 's/\s\+$//'
+ grep '\-g w--instance-'
++ /var/vcap/packages/iptables/sbin/iptables -w -S
+ rules='-P INPUT ACCEPT
-P FORWARD ACCEPT
-P OUTPUT ACCEPT
-N w--default
-N w--forward
-N w--input
-N w--instance-gfulchosvqj
-N w--instance-gfulchosvqj-log
-A INPUT -i w+ -j w--input
-A FORWARD -i w+ -j w--forward
-A w--default -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A w--default -j REJECT --reject-with icmp-port-unreachable
-A w--forward -i eth0 -j ACCEPT
-A w--forward -j DROP
-A w--input -i eth0 -j ACCEPT
-A w--input -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A w--input -j REJECT --reject-with icmp-host-prohibited
-A w--instance-gfulchosvqj -p icmp -m iprange --dst-range 10.53.248.0-10.53.251.255 -m icmp --icmp-type any -m comment --comment 56d562a2-cabd-4def-6340-55dd -j RETURN'
+ echo '-P INPUT ACCEPT
-P FORWARD ACCEPT
-P OUTPUT ACCEPT
-N w--default
-N w--forward
-N w--input
-N w--instance-gfulchosvqj
-N w--instance-gfulchosvqj-log
-A INPUT -i w+ -j w--input
-A FORWARD -i w+ -j w--forward
-A w--default -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A w--default -j REJECT --reject-with icmp-port-unreachable
-A w--forward -i eth0 -j ACCEPT
-A w--forward -j DROP
-A w--input -i eth0 -j ACCEPT
-A w--input -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A w--input -j REJECT --reject-with icmp-host-prohibited
-A w--instance-gfulchosvqj -p icmp -m iprange --dst-range 10.53.248.0-10.53.251.255 -m icmp --icmp-type any -m comment --comment 56d562a2-cabd-4def-6340-55dd -j RETURN'
+ xargs --no-run-if-empty --max-lines=1 /var/vcap/packages/iptables/sbin/iptables -w
+ sed -e s/-A/-D/ -e 's/\s\+$//'
+ grep '^-A w--instance-'
iptables: Bad rule (does a matching rule exist in that chain?).

Steps to reproduce

Deploy CF
Create a running security group with an icmp type/code = -1
Deploy an application
Change some cell config and redeploy the cell
See that Garden fails to restart
Garden RunC Release 1.5.0
Ubuntu Stemcell 3363.20

Ducati is not a thing

The diagram here refers to an imaginary component called "Ducati."

Ducati is in the attic. We don't talk about Ducati anymore.

The new thing is CF Networking Release which has a garden external networker that allows any CNI plugin to be a garden networker.

Let's update the diagram!

Insufficient Subnets even when `gaol_linux list` reports fewer than `max_containers`

Todos

Please try the following before submitting the issue:

Upgrade Concourse if you are observing a failure in a test
Use latest BOSH stemcell if the problem occurred in a BOSH VM
[n/a] Use Go 1.6 if the issue has to do with running unit tests locally

Description

Running Concourse jobs were reporting "insufficient subnets remaining in the pool". We were unable to match this error message with a spike in container counts as reported by fly workers. It generally came up when load was high on the system, but it wasn't strictly correlated with metrics like CPU load either.

We went onto the worker VM, and ran gaol_linux list. It reported the same # of containers as fly workers was reporting. We looked at the iptables -S on the worker, and saw many more rules than active containers. In fact, we saw almost exactly 250. Logs attached.

We do not set the following parameters in our manifest, so we expect that the defaults were still:

max_containers: 250
network_pool: 10.254.0.0/22

bosh recreate worker worked to reset the iptables, and then we had a TSA registration issue (probably not a garden problem) and bosh restart web fixed that.

Is bosh recreate worker necessary any time one upgrades the runC version? For indeed, the runC behavior when it goes down should be to try to keep the containers around. But if it comes back up with new version of the software running, does it leave the existing iptables around?

Background on our particular situation:

We had been running Concourse 1.6 with runC 0.4.0 for a long time. We were running it hard, such that the error message "insufficient subnets remaining in the pool" usually signified hitting the container limit of 250 (actually). We bumped from 1.6 + 0.4.0 straight to Concourse 2.2.1 with runC 0.8.0

Logging and/or test output

The iptables -S on a worker VM when it wasn't working (reporting many more rules than containers were supposedly running)

worker-0-iptables.txt

Logs from the garden job (tgz as provided by bosh logs)

https://drive.google.com/file/d/0B2izPb3rj-6AeHZzejhaaXZvMFk/view?usp=sharing

Steps to reproduce

Pretty elaborate; we didn't get a simple reproduction.

Bosh deploy Concourse 1.6 + runC 0.4.0
Run it hard, hit the container limit, observe "insufficient subnets remaining in the pool"
Upgrade to Concourse 2.2.1 with runC 0.8.0
Run it hard, observe "insufficient subnets remaining in the pool" occur when fly workers reports a number of containers less than 250

OCI hooks

Can I use oci hooks with garden?

graph_cleanup_threshold_in_mb is ignored

Hello,
even though we have set the property graph_cleanup_threshold_in_mb in the diego manifest to 20480, the cleanup process doesn't seem to work properly. The following is the output of the ephemeral disk usage of one of our cells:

Instance                                               Process State  AZ  IPs          VM CID               VM Type       Uptime  Load              CPU    CPU    CPU    CPU   Memory        Swap         System      Ephemeral   Persistent

                                                                                                                                  (1m, 5m, 15m)     Total  User   Sys    Wait  Usage         Usage        Disk Usage  Disk Usage  Disk Usage
cell_z2/d1677fee-37a2-4dcd-b988-039164fa7362           running        z2  10.0.137.29  i-00359a04afbe13deb  vm_4cpu_16gb  -       6.89, 6.82, 6.52  -      49.4%  28.9%  0.0%  35% (5.7 GB)  0% (6.1 MB)  39% (31i%)  92% (23i%)  39% (31i%)

ps aux|grep garden | grep graph from one of our cell:

root     3219545  1.4  0.3 3040368 61208 ?       S<l  07:49   4:24 /var/vcap/packages/guardian/bin/gdn server --skip-setup --bind-ip=0.0.0.0 --bind-port=7777 --depot=/var/vcap/data/garden/depot --graph=/var/vcap/data/garden/graph --properties-path=/var/vcap/data/garden/props.json --port-pool-properties-path=/var/vcap/data/garden/port-pool-props.json --iptables-bin=/var/vcap/packages/iptables/sbin/iptables --iptables-restore-bin=/var/vcap/packages/iptables/sbin/iptables-restore --init-bin=/var/vcap/packages/guardian/bin/init --dadoo-bin=/var/vcap/packages/guardian/bin/dadoo --nstar-bin=/var/vcap/packages/guardian/bin/nstar --tar-bin=/var/vcap/packages/tar/tar --newuidmap-bin=/var/vcap/packages/garden-idmapper/bin/newuidmap --newgidmap-bin=/var/vcap/packages/garden-idmapper/bin/newgidmap --log-level=error --mtu=0 --network-pool=10.239.0.0/22 --deny-network=0.0.0.0/0 --destroy-containers-on-startup --debug-bind-ip=127.0.0.1 --debug-bind-port=17005 --default-rootfs=/var/vcap/packages/busybox --default-grace-time=0 --default-container-blockio-weight=0 --graph-cleanup-threshold-in-megabytes=20480 --max-containers=250 --cpu-quota-per-share=0 --tcp-memory-limit=0 --runtime-plugin=/var/vcap/packages/runc/bin/runc --persistent-image=/var/vcap/packages/cflinuxfs2/rootfs --apparmor=garden-default --cleanup-process-dirs-on-wait

du -hcs /var/vcap/data/garden/graph/aufs/diff
67G     total

Docker file permission issue after upgrading PCF

Description

We recently upgraded to PCF 1.12.19, and a (prometheus) docker image that used to work is now failing with file permission issues

Environment

garden-runc-release version: 1.12.1
IaaS: AWS
Stemcell version: light-bosh-stemcell-3468.30-aws-xen-hvm-ubuntu-trusty-go_agent
Kernel version: not sure, can't start container

Steps to reproduce

cf push some-app -o prom/prometheus

Logs

   2018-05-15T09:07:09.16-0400 [CELL/0] OUT Creating container
   2018-05-15T09:07:09.39-0400 [CELL/0] OUT Successfully destroyed container
   2018-05-15T09:07:09.76-0400 [CELL/0] OUT Successfully created container
   2018-05-15T09:07:10.01-0400 [CELL/0] OUT Starting health monitoring of container
   2018-05-15T09:07:10.18-0400 [APP/PROC/WEB/0] ERR level=info ts=2018-05-15T13:07:10.179828917Z caller=main.go:220 msg="Starting Prometheus" version="(version=2.2.1, branch=HEAD, revision=bc6058c81272a8d938c05e75607371284236aadc)"
   2018-05-15T09:07:10.18-0400 [APP/PROC/WEB/0] ERR level=info ts=2018-05-15T13:07:10.179928593Z caller=main.go:221 build_context="(go=go1.10, user=root@149e5b3f0829, date=20180314-14:15:45)"
   2018-05-15T09:07:10.18-0400 [APP/PROC/WEB/0] ERR level=info ts=2018-05-15T13:07:10.179978934Z caller=main.go:222 host_details="(Linux 4.4.0-119-generic #143~14.04.1-Ubuntu SMP Mon Apr 2 18:04:36 UTC 2018 x86_64 b939328f-96c4-4131-61b3-d721 (none))"
   2018-05-15T09:07:10.18-0400 [APP/PROC/WEB/0] ERR level=info ts=2018-05-15T13:07:10.180063828Z caller=main.go:223 fd_limits="(soft=16384, hard=16384)"
   2018-05-15T09:07:10.18-0400 [APP/PROC/WEB/0] ERR level=info ts=2018-05-15T13:07:10.189856754Z caller=main.go:504 msg="Starting TSDB ..."
   2018-05-15T09:07:10.19-0400 [APP/PROC/WEB/0] ERR level=info ts=2018-05-15T13:07:10.19017001Z caller=web.go:382 component=web msg="Start listening for connections" address=0.0.0.0:9090
   2018-05-15T09:07:10.19-0400 [APP/PROC/WEB/0] ERR level=info ts=2018-05-15T13:07:10.190173907Z caller=main.go:398 msg="Stopping scrape discovery manager..."
   2018-05-15T09:07:10.19-0400 [APP/PROC/WEB/0] ERR level=info ts=2018-05-15T13:07:10.190233993Z caller=main.go:411 msg="Stopping notify discovery manager..."
   2018-05-15T09:07:10.19-0400 [APP/PROC/WEB/0] ERR level=info ts=2018-05-15T13:07:10.190271771Z caller=main.go:432 msg="Stopping scrape manager..."
   2018-05-15T09:07:10.19-0400 [APP/PROC/WEB/0] ERR level=info ts=2018-05-15T13:07:10.190361913Z caller=manager.go:460 component="rule manager" msg="Stopping rule manager..."
   2018-05-15T09:07:10.19-0400 [APP/PROC/WEB/0] ERR level=info ts=2018-05-15T13:07:10.190478723Z caller=manager.go:466 component="rule manager" msg="Rule manager stopped"
   2018-05-15T09:07:10.19-0400 [APP/PROC/WEB/0] ERR level=info ts=2018-05-15T13:07:10.190605644Z caller=notifier.go:512 component=notifier msg="Stopping notification manager..."
   2018-05-15T09:07:10.19-0400 [APP/PROC/WEB/0] ERR level=info ts=2018-05-15T13:07:10.190643211Z caller=main.go:407 msg="Notify discovery manager stopped"
   2018-05-15T09:07:10.19-0400 [APP/PROC/WEB/0] ERR level=info ts=2018-05-15T13:07:10.190861107Z caller=main.go:394 msg="Scrape discovery manager stopped"
   2018-05-15T09:07:10.19-0400 [APP/PROC/WEB/0] ERR level=info ts=2018-05-15T13:07:10.191290319Z caller=main.go:426 msg="Scrape manager stopped"
   2018-05-15T09:07:10.19-0400 [APP/PROC/WEB/0] ERR level=info ts=2018-05-15T13:07:10.191310256Z caller=main.go:573 msg="Notifier manager stopped"
   2018-05-15T09:07:10.19-0400 [APP/PROC/WEB/0] ERR level=error ts=2018-05-15T13:07:10.194150223Z caller=main.go:582 err="Opening storage failed open DB in /prometheus: open /prometheus/518918001: permission denied"
   2018-05-15T09:07:10.19-0400 [APP/PROC/WEB/0] ERR level=info ts=2018-05-15T13:07:10.194206722Z caller=main.go:584 msg="See you next time!"
   2018-05-15T09:07:10.19-0400 [APP/PROC/WEB/0] OUT Exit status 0
   2018-05-15T09:07:10.19-0400 [CELL/0] OUT Exit status 0
   2018-05-15T09:07:10.21-0400 [CELL/0] OUT Stopping instance b939328f-96c4-4131-61b3-d721
   2018-05-15T09:07:10.21-0400 [CELL/0] OUT Destroying container
   2018-05-15T09:07:10.23-0400 [API/1] OUT Process has crashed with type: "web"

2018-05-15T09:07:10.19-0400 [APP/PROC/WEB/0] ERR level=error ts=2018-05-15T13:07:10.194150223Z caller=main.go:582 err="Opening storage failed open DB in /prometheus: open /prometheus/518918001: permission denied"

Docker image (if relevant)

prom/prometheus

Cause

unsure, this worked before the PCF upgrade and fails afterwards

Resolution

???

Error occurs while Installing Cloud Foundry, DIEGO on CentOs.

Hi,

An error occured while installing Cloud Foundry, DIEGO on CentOs..

The versions i have used for Release and BOSH are listed below.
+++++++++++++++++++++++++++++++++++
BOSH-CLI v2

BOSH-RELEASE 264.7.0

CF-RELEASE 287

DIEGO-RELEASE 1.34.0

GARDEN-RUNC 1.11.0

CFLINUX-RELEASE 1.185.0
+++++++++++++++++++++++++++++++++++++
And I used Stemcell image v 3468.21 in CentOs environment.

This settings have worked on Ubuntu. Apparently not on CentOs.

When errors occur while installing Cloud Foundry, it was solved as i had access to logs and script files could be edited.
However, i have no idea how to deal with DIEGO installation issues.

My major quetions are:

1. Does Cloud Foundry& Diego support Centos?

2. Is it able to use "apparmor-paser" while installing DIEGO Cell on CentOs?

In case of normal installation, it throws an error saying "apparmor_parser: command not found"

If i install package directly on DIEGO cell VM and execute : apparmor_parser -r "/var/vcap/jobs/garden/config/garden-default" >> it responds with "no cache value on read/write"

3. If i try to install without the "apparmor" section by annotating them out:

rep throws an error
#######################
Failed: 'cell_z1/d30f7487-f7ea-42fa-b9a6-a6af4a9f6c6b (0)' is not running after update. Review logs for failed jobs: rep (00:04:21)

Error 400007: 'cell_z1/d30f7487-f7ea-42fa-b9a6-a6af4a9f6c6b (0)' is not running after update. Review logs for failed jobs: rep
#######################

Log as below is found at rep_ctl.log
#######################
Not setting /proc/sys/net/ipv4 parameters, since I'm running inside a linux container
#######################

Log as below is found at garden.stdout.log
#######################
Version:1.0 StartHTML:0000000105 EndHTML:0000028323 StartFragment:0000000433 EndFragment:0000028283
{"timestamp":"1528181050.931253672","source":"guardian","message":"guardian.create.create-failed-cleaningup.volumizer.image-plugin-destroy.start","log_level":0,"data":{"cause":"runc run: exit status 1: container_linux.go:348: starting container process caused "process_linux.go:301: running exec setns process for init caused \"exit status 40\""\n","handle":"executor-healthcheck-5be43913-6810-4018-59a3-8bf1af14555c","session":"476.4.2.1"}}
{"timestamp":"1528181050.945348501","source":"guardian","message":"guardian.create.create-failed-cleaningup.volumizer.image-plugin-destroy.grootfs.delete.groot-deleting.starting","log_level":1,"data":{"cause":"runc run: exit status 1: container_linux.go:348: starting container process caused "process_linux.go:301: running exec setns process for init caused \"exit status 40\""\n","handle":"executor-healthcheck-5be43913-6810-4018-59a3-8bf1af14555c","imageID":"executor-healthcheck-5be43913-6810-4018-59a3-8bf1af14555c","original_timestamp":"2018-06-05T06:44:10.945072896Z","session":"476.4.2.1"}}
{"timestamp":"1528181050.945517778","source":"guardian","message":"guardian.create.create-failed-cleaningup.volumizer.image-plugin-destroy.grootfs.delete.groot-deleting.deleting-image.starting","log_level":1,"data":{"cause":"runc run: exit status 1: container_linux.go:348: starting container process caused "process_linux.go:301: running exec setns process for init caused \"exit status 40\""\n","handle":"executor-healthcheck-5be43913-6810-4018-59a3-8bf1af14555c","id":"executor-healthcheck-5be43913-6810-4018-59a3-8bf1af14555c","imageID":"executor-healthcheck-5be43913-6810-4018-59a3-8bf1af14555c","original_timestamp":"2018-06-05T06:44:10.94520064Z","session":"476.4.2.1","storePath":"/var/vcap/data/grootfs/store/unprivileged"}}
{"timestamp":"1528181050.945569754","source":"guardian","message":"guardian.create.create-failed-cleaningup.volumizer.image-plugin-destroy.grootfs.delete.groot-deleting.deleting-image.overlayxfs-destroying-image.starting","log_level":1,"data":{"cause":"runc run: exit status 1: container_linux.go:348: starting container process caused "process_linux.go:301: running exec setns process for init caused \"exit status 40\""\n","handle":"executor-healthcheck-5be43913-6810-4018-59a3-8bf1af14555c","id":"executor-healthcheck-5be43913-6810-4018-59a3-8bf1af14555c","imageID":"executor-healthcheck-5be43913-6810-4018-59a3-8bf1af14555c","imagePath":"/var/vcap/data/grootfs/store/unprivileged/images/executor-healthcheck-5be43913-6810-4018-59a3-8bf1af14555c","original_timestamp":"2018-06-05T06:44:10.945269248Z","session":"476.4.2.1","storePath":"/var/vcap/data/grootfs/store/unprivileged"}}
{"timestamp":"1528181050.953394413","source":"guardian","message":"guardian.create.create-failed-cleaningup.volumizer.image-plugin-destroy.grootfs.delete.groot-deleting.deleting-image.overlayxfs-destroying-image.ending","log_level":1,"data":{"cause":"runc run: exit status 1: container_linux.go:348: starting container process caused "process_linux.go:301: running exec setns process for init caused \"exit status 40\""\n","handle":"executor-healthcheck-5be43913-6810-4018-59a3-8bf1af14555c","id":"executor-healthcheck-5be43913-6810-4018-59a3-8bf1af14555c","imageID":"executor-healthcheck-5be43913-6810-4018-59a3-8bf1af14555c","imagePath":"/var/vcap/data/grootfs/store/unprivileged/images/executor-healthcheck-5be43913-6810-4018-59a3-8bf1af14555c","original_timestamp":"2018-06-05T06:44:10.953026048Z","session":"476.4.2.1","storePath":"/var/vcap/data/grootfs/store/unprivileged"}}
{"timestamp":"1528181050.953512192","source":"guardian","message":"guardian.create.create-failed-cleaningup.volumizer.image-plugin-destroy.grootfs.delete.groot-deleting.deleting-image.ending","log_level":1,"data":{"cause":"runc run: exit status 1: container_linux.go:348: starting container process caused "process_linux.go:301: running exec setns process for init caused \"exit status 40\""\n","handle":"executor-healthcheck-5be43913-6810-4018-59a3-8bf1af14555c","id":"executor-healthcheck-5be43913-6810-4018-59a3-8bf1af14555c","imageID":"executor-healthcheck-5be43913-6810-4018-59a3-8bf1af14555c","original_timestamp":"2018-06-05T06:44:10.953161728Z","session":"476.4.2.1","storePath":"/var/vcap/data/grootfs/store/unprivileged"}}
#######################

Anyone familer with this issue? Thank you in advance.

disk usage of app running with docker image

Hi there! Thanks for taking time to open an Issue.
Please fill out as much of the following information as you can.

Description

Pushing an app with docker image would fail if default disk quota (1G) is used since the docker image size is about 1.3G.

[root@mdw ~]# CF_DOCKER_PASSWORD=xxxx cf push img1 --docker-username xxxx  --docker-image scottgai/sgai-repo:img1 scottgai -t 600
   2018-06-25T23:29:06.20-0400 [CELL/0] OUT Cell 21c1ae94-0f86-4dc5-83ec-7b76dd2dba58 creating container for instance da9af175-a04a-4079-49f7-89d5
   2018-06-25T23:29:07.88-0400 [CELL/0] ERR Cell 21c1ae94-0f86-4dc5-83ec-7b76dd2dba58 failed to create container for instance da9af175-a04a-4079-49f7-89d5: running image plugin create: making image: creating image: applying disk limits: disk limit is smaller than volume size
   2018-06-25T23:29:07.88-0400 [CELL/0] ERR : exit status 1
   2018-06-25T23:29:07.88-0400 [CELL/0] OUT Cell 21c1ae94-0f86-4dc5-83ec-7b76dd2dba58 destroying container for instance da9af175-a04a-4079-49f7-89d5
   2018-06-25T23:29:07.88-0400 [CELL/0] OUT Cell 21c1ae94-0f86-4dc5-83ec-7b76dd2dba58 successfully destroyed container for instance da9af175-a04a-4079-49f7-89d5
   2018-06-25T23:29:07.90-0400 [API/0] OUT Process has crashed with type: "web"
   2018-06-25T23:29:07.90-0400 [API/0] OUT App instance exited with guid 6b783e7e-4a21-4d52-8a6b-a99b7147ed22 payload: {"instance"=>"da9af175-a04a-4079-49f7-89d5", "index"=>0, "reason"=>"CRASHED", "exit_description"=>"failed to initialize container: running image plugin create: making image: creating image: applying disk limits: disk limit is smaller than volume size\n: exit status 1", "crash_count"=>3, "crash_timestamp"=>1529983747883330324, "version"=>"a6bc244b-30cb-49f5-8f22-0778392d1b73"}

cf push with disk quota 2G was successful. However cf app <app name> only shows very small value of disk usage ("52K of 2G" in following test. seems only show size of platform overhead excluding docker image size)

[root@mdw ~]# CF_DOCKER_PASSWORD=xxxx cf push img1 --docker-username xxxx  --docker-image scottgai/sgai-repo:img1 scottgai -t 600 -k 2G
Using docker repository password from environment variable CF_DOCKER_PASSWORD.
Pushing app img1 to org org1 / space dev as admin...
Getting app info...
Updating app with these attributes...
  name:                   img1
  docker image:           scottgai/sgai-repo:img1
  docker username:        scottgai
  command:                /docker-entrypoint.sh httpd -k start -D FOREGROUND
- disk quota:             1G
+ disk quota:             2G
  health check timeout:   600
  health check type:      port
  instances:              1
  memory:                 1G
  stack:                  cflinuxfs2
  routes:
    img1.apps-03.haas-50.pez.pivotal.io

Updating app img1...
Mapping routes...

Stopping app...

Waiting for app to start...

name:              img1
requested state:   started
instances:         1/1
usage:             1G x 1 instances
routes:            img1.apps-03.haas-50.pez.pivotal.io
last uploaded:     Mon 25 Jun 23:27:32 EDT 2018
stack:             cflinuxfs2
docker image:      scottgai/sgai-repo:img1
start command:     /docker-entrypoint.sh httpd -k start -D FOREGROUND

     state     since                  cpu    memory      disk        details
#0   running   2018-06-26T03:48:35Z   0.0%   25M of 1G   52K of 2G

Is the disk usage being displayed correctly?

Environment

garden-runc-release version: 1.13.1
IaaS: vSphere
Stemcell version: 3468.46
Kernel version: 4.4.0-127-generic

If known, provide the resolution to the issue here.

Capable to run Container in host's network namespace

Description

Thanks to @williammartin

from post opencontainers/runc#1552 , runC can be optional to run in host network mode (just like docker-engine's --network="host" ) as @cyphar describe.

But per @williammartin , Garden doesn't expose this feasibility.

as a Concourse user, I think it's typical to run a container in host namespace , so that to test external devices/systems via physical machine NIC.

I believe Garden should provide this capability, then Concourse can leverage this new power.

Concourse issue: concourse/concourse#1455

Guardian release version : NA
Linux kernel version : 4.4.0-21-generic
Concourse version: 3.3.4
Go version: NA

'ps aux' hangs on cell

Description

It is not necessarily an issue, just an observation. We experienced that ps aux hangs on a cell. This is caused when accessing information about a java app process.

Environment

garden-runc-release version: 1.15.1
IaaS: AWS
Stemcell version: bosh-aws-xen-hvm-ubuntu-trusty-go_agent/3586.36
Kernel version: 4.4.0-133-generic

Steps to reproduce

We didn't try to reproduce the issue yet.

Cause

strace output

strace cat /proc/3981001/environ
execve("/bin/cat", ["cat", "/proc/3981001/environ"], [/* 25 vars */]) = 0
brk(0)                                  = 0xe36000
access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=23920, ...}) = 0
mmap(NULL, 23920, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f0d18bac000
close(3)                                = 0
access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)
open("/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0P \2\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=1857312, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f0d18bab000
mmap(NULL, 3965632, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f0d185c7000
mprotect(0x7f0d18785000, 2097152, PROT_NONE) = 0
mmap(0x7f0d18985000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1be000) = 0x7f0d18985000
mmap(0x7f0d1898b000, 17088, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f0d1898b000
close(3)                                = 0
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f0d18ba9000
arch_prctl(ARCH_SET_FS, 0x7f0d18ba9740) = 0
mprotect(0x7f0d18985000, 16384, PROT_READ) = 0
mprotect(0x60a000, 4096, PROT_READ)     = 0
mprotect(0x7f0d18bb2000, 4096, PROT_READ) = 0
munmap(0x7f0d18bac000, 23920)           = 0
brk(0)                                  = 0xe36000
brk(0xe57000)                           = 0xe57000
open("/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=1607664, ...}) = 0
mmap(NULL, 1607664, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f0d18a20000
close(3)                                = 0
fstat(1, {st_mode=S_IFCHR|0600, st_rdev=makedev(136, 26), ...}) = 0
open("/proc/3981001/environ", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0400, st_size=0, ...}) = 0
fadvise64(3, 0, 0, POSIX_FADV_SEQUENTIAL) = 0
read(3,

cat /proc/3981001/status
Name:   java
State:  S (sleeping)
Tgid:   3981001
Ngid:   0
Pid:    3981001
PPid:   3980961
TracerPid:      0
Uid:    2000    2000    2000    2000
Gid:    2000    2000    2000    2000
FDSize: 4096
Groups:
NStgid: 3981001 9
NSpid:  3981001 9
NSpgid: 3981001 9
NSsid:  3981001 9
VmPeak:  1153380 kB
VmSize:  1076732 kB
VmLck:         0 kB
VmPin:         0 kB
VmHWM:    309756 kB
VmRSS:    298700 kB
VmData:   961080 kB
VmStk:       136 kB
VmExe:         4 kB
VmLib:     26380 kB
VmPTE:       872 kB
VmPMD:        24 kB
VmSwap:    11096 kB
HugetlbPages:          0 kB
Threads:        37
SigQ:   0/128591
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000000000000
SigCgt: 2000000181005ccf
CapInh: 00000000a80425fb
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 00000000a80425fb
CapAmb: 0000000000000000
Seccomp:        2
Speculation_Store_Bypass:       vulnerable
Cpus_allowed:   00ff
Cpus_allowed_list:      0-7
Mems_allowed:   00000000,00000001
Mems_allowed_list:      0
voluntary_ctxt_switches:        55
nonvoluntary_ctxt_switches:     15

The process is in state S

The process is also constantly consuming 1G of memory (probably the limit of the app). Not necessarily an issue since its an java app:

3981001 cvcap     10 -10 1076732 298696  21848 S  23.9  0.9 186:41.76 java

The cpu usage on the cell is quite low

%Cpu(s):  1.3 us,  4.0 sy,  0.0 ni, 94.5 id,  0.0 wa,  0.0 hi,  0.0 si,  0.1 st

The load is very high (might be unrelated):

 load average: 36.16, 36.33, 36.13

GQT failures

Hi there,

I'm running

./scripts/test

against my local Concourse Lite, and seeing some test failures in GQT:

Summarizing 4 Failures:

[Fail] Run running a process [It] with an absolute path 
/tmp/build/7543102e-d686-49fe-615d-24086572a169/guardian-release/src/github.com/cloudfoundry-incubator/guardian/gqt/run_test.go:62

[Fail] Net a second container [It] should have the next IP address 
/tmp/build/7543102e-d686-49fe-615d-24086572a169/guardian-release/src/github.com/cloudfoundry-incubator/guardian/gqt/net_test.go:98

[Fail] Net [It] should have a loopback interface 
/tmp/build/7543102e-d686-49fe-615d-24086572a169/guardian-release/src/github.com/cloudfoundry-incubator/guardian/gqt/net_test.go:45

[Fail] Net [It] should have a (dynamically assigned) IP address 
/tmp/build/7543102e-d686-49fe-615d-24086572a169/guardian-release/src/github.com/cloudfoundry-incubator/guardian/gqt/net_test.go:60

Ran 19 of 19 Specs in 17.812 seconds
FAIL! -- 15 Passed | 4 Failed | 0 Pending | 0 Skipped

Full output here

Any tips on how I can get the tests passing?

Thank you.

cc @zachgersh

App Instance uses all tcp memory of the cell

In our CF deployment, we experienced a situation where one app instance was
able to slow down dramatically the network traffic in a diego cell.

During investigation, we observed that the app was using all the available tcp
memory on the cell.

In /var/log/kern.log there were errors “TCP: out of memory”.
We observed that the tcp window was small observed through tcpdump. New connections were possible to be created, but no traffic due to the TCP window.

When we stopped the app instance everything was back to normal.

TCP memory was monitored through the /proc/net/sockstat file.

Interestingly, due to not-well written app, we observed high TCP memory usage
without seeing large number of TCP connection.

We are mitigating currently the issue by increasing the TCP memory and by restarting the cell (or the app)
cat /proc/sys/net/ipv4/tcp_mem is the actual hard-limit we have per cell which currently is 20% of the total memory.

It will be great to limit how much tcp memory a container can use, as well as to have some statistics per container.

Version 1.12 can no longer set DNS value as a script to retrieve private IP address.

Description

Unable to provide script to configure additional DNS servers in 1.12 and above. This is done so we have an additional DNS server running on the cells from some additional requirements. This listens on the private IP and not on localhost (which is listened by consul (or in the near future bosh-dns))

Environment

garden-runc-release version: 1.12
IaaS: SoftLayer
Stemcell version: 3468.17.2
Kernel version: 4.4.0-109-generic

Steps to reproduce

Provide the following in the DNS entry:

    dns_servers:
    - $(ifconfig eth0|grep -oP 'inet addr:\K([0-9.]+)')

Logs

Error in garden.stderr.log ...

/var/vcap/jobs/garden/config/config.ini:38: invalid IP: '$(ifconfig eth0|grep -oP 'inet addr:\K([0-9.]+)')'

Cause

Change to use config.ini instead of passing this in a script causes the script we used to fail.

Resolution

Busybox blob is unversioned

In config/blobs.yml, one of the blobs being used is busybox/busybox.tar.gz. Is there any chance someone can add a version to that blob to match the rest of the blobs in the yml? We're unable to determine what version of busybox is being used.

Review logs for failed jobs: garden

Description

During an upgrade of Cloud Foundry from cf-deployment v1.8.0 to v1.9.0, the BOSH deployment failed with the following error:

Error: 'diego_cell/a8964832-a179-4496-b663-31687fbd564b (2)' is not running after update. Review logs for failed jobs: garden

The deployment succeeded after I kicked off the deployment again.

Environment

garden-runc-release version: The upgrade of 1.10.0 -> 1.11.0
IaaS: Azure
Stemcell version: bosh-azure-hyperv-ubuntu-trusty-go_agent v3541.9
Kernel version: 4.4.0-116-generic

Steps to reproduce

Deploy Cloud Foundry using cf-deployment v1.8.0
Upgrade Cloud Foundry to cf-deployment v1.9.0

Logs

Task 147

Task 147 | 15:31:59 | Preparing deployment: Preparing deployment (00:00:23)
Task 147 | 15:33:25 | Preparing package compilation: Finding packages to compile (00:00:01)
Task 147 | 15:33:30 | Updating instance diego_cell: diego_cell/270ec524-a205-440d-a16f-d7ca514a4115 (1) (canary) (00:02:15)
Task 147 | 15:35:45 | Updating instance diego_cell: diego_cell/a8964832-a179-4496-b663-31687fbd564b (2) (00:06:19)
                    L Error: 'diego_cell/a8964832-a179-4496-b663-31687fbd564b (2)' is not running after update. Review logs for failed jobs: garden
Task 147 | 15:42:05 | Error: 'diego_cell/a8964832-a179-4496-b663-31687fbd564b (2)' is not running after update. Review logs for failed jobs: garden

Cause

This issue occurs when the garden BOSH job does not startup within the default 30 second monit timeout. This occurs more frequently when garden is running on slow IaaS disks (Azure seems to be particularly sensitive to this issue). You can determine if garden is hitting the timeout or not by looking in the monit.log:

$ grep garden /var/vcap/monit/monit.log
[UTC Mar 21 15:25:00] info     : 'garden' start: /bin/sh
[UTC Mar 21 15:55:00] error    : 'garden' failed to start

Resolution

The monit timeout has been bumped to 2 minutes. This should be more than enough time for garden to start up, even on slower environments. This fix is available as of garden-runc-release v1.12.1.

Garden graph out of disk space

After upgrading from GRR 1.11 to 1.12.21, apps started to fail with out of disk errors.

Environment

garden-runc-release version: 1.11 -> 1.12.21

Steps to reproduce

Use most of garden disk pre-upgrade to 1.12
Perform 1.12 upgrade
Deploy more applications, deployments will start to fail
Observe on the host that there seems to be adequate space in the graph

Cause

Not entirely sure, but it seems like there are two different graphs now:
/var/vcap/data/grootfs/store/{unprivileged,privileged} and /var/vcap/data/garden/graph/aufs.
Both seem to have files in them.
The GrootFS one looks new, is it possible that cleanup has failed? Or the versions are conflicting somehow?

bosh.io sha1 mismatch

Tried deploying garden-runc-release from bosh.io but getting sha1 mismatches:

Checking whether release garden-runc/1.0.0 already exists...NO
Using remote release 'https://bosh.io/d/github.com/cloudfoundry-incubator/garden-runc-release?v=1.0.0'

Director task 42532
  Started downloading remote release > Downloading remote release. Done (00:00:09)

  Started verifying remote release > Verifying remote release. Failed: Release SHA1 `41a9407af7eaa412a6a2dca2cbf29f2623d4e12f' does not match the expected SHA1 `2f29b4d9eedbe6b79efe3703b68a4d3c785649f4' (00:00:01)

Error 30015: Release SHA1 `41a9407af7eaa412a6a2dca2cbf29f2623d4e12f' does not match the expected SHA1 `2f29b4d9eedbe6b79efe3703b68a4d3c785649f4'

Issue with release or https://bosh.io/releases/github.com/cloudfoundry/garden-runc-release?version=1.0.0 or am I doing something wrong?

/cc @julz @cppforlife

Failed to start container with empty /etc/

Description

When using a container with an empty /etc/ directory, the application crashes starting with a cryptic message:

[CELL/0] ERR Cell 6bdffa08-608d-4740-af08-3b289d433ec5 failed to create container for instance d52fd44a-eabb-4bdd-4360-6185: runc run: exit status 1: container_linux.go:348: starting container process caused "process_linux.go:402: container init caused \"rootfs_linux.go:58: mounting \\\"/var/vcap/packages/healthcheck\\\" to rootfs \\\"/var/vcap/data/grootfs/store/unprivileged/images/d52fd44a-eabb-4bdd-4360-6185/rootfs\\\" at \\\"/var/vcap/data/grootfs/store/unprivileged/images/d52fd44a-eabb-4bdd-4360-6185/rootfs/etc/cf-assets/healthcheck\\\" caused \\\"mkdir /var/vcap/data/grootfs/store/unprivileged/images/d52fd44a-eabb-4bdd-4360-6185/rootfs/etc/cf-assets: permission denied\\\"\""

Environment

garden-runc-release version: 1.13.3
IaaS: AWS
Stemcell version: 3586.24
Kernel version:

$ uname -a
Linux c763447a-1984-41d5-b2fd-586c47afe31b 4.4.0-128-generic #154~14.04.1-Ubuntu SMP Fri May 25 14:58:51 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Steps to reproduce

Create a container with an empty /etc/ directory (e.g. keymon/empty-etc)
Start the container: cf push test-container -o keymon/empty-etc

Logs

$ cf push test-container -o keymon/empty-etc
Pushing app test-container to org admin / space billing as admin...
Getting app info...
Updating app with these attributes...
  name:                test-container
- docker image:        hector/empty-etc
+ docker image:        keymon/empty-etc
  disk quota:          1G
  health check type:   port
  instances:           1
  memory:              1G
  stack:               cflinuxfs2
  routes:
    test-container.hector.dev.cloudpipelineapps.digital

Updating app test-container...
Mapping routes...

Staging app and tracing logs...
   Cell cde8f71a-a756-4891-b130-8f6ac230eb59 successfully created container for instance 48f69ea7-aaba-4845-b2f5-90761daa7c46
   Staging...
   Staging process started ...
   Staging process finished
   Exit status 0
   Staging Complete
   Cell cde8f71a-a756-4891-b130-8f6ac230eb59 stopping instance 48f69ea7-aaba-4845-b2f5-90761daa7c46
   Cell cde8f71a-a756-4891-b130-8f6ac230eb59 destroying container for instance 48f69ea7-aaba-4845-b2f5-90761daa7c46
   Cell cde8f71a-a756-4891-b130-8f6ac230eb59 successfully destroyed container for instance 48f69ea7-aaba-4845-b2f5-90761daa7c46

Waiting for app to start...
Start unsuccessful

TIP: use 'cf logs test-container --recent' for more information
FAILED

$ cf logs test-container --recent
....
   2018-07-03T09:27:05.42+0100 [CELL/0] OUT Cell cde8f71a-a756-4891-b130-8f6ac230eb59 creating container for instance afd31154-83ac-425a-66d9-2df0
   2018-07-03T09:27:07.07+0100 [CELL/0] ERR Cell cde8f71a-a756-4891-b130-8f6ac230eb59 failed to create container for instance afd31154-83ac-425a-66d9-2df0: runc run: exit status 1: container_linux.go:348: starting container process caused "process_linux.go:402: container init caused \"rootfs_linux.go:58: mounting \\\"/var/vcap/packages/healthcheck\\\" to rootfs \\\"/var/vcap/data/grootfs/store/unprivileged/images/afd31154-83ac-425a-66d9-2df0/rootfs\\\" at \\\"/var/vcap/data/grootfs/store/unprivileged/images/afd31154-83ac-425a-66d9-2df0/rootfs/etc/cf-assets/healthcheck\\\" caused \\\"mkdir /var/vcap/data/grootfs/store/unprivileged/images/afd31154-83ac-425a-66d9-2df0/rootfs/etc/cf-assets: permission denied\\\"\""
   2018-07-03T09:27:07.08+0100 [CELL/0] OUT Cell cde8f71a-a756-4891-b130-8f6ac230eb59 destroying container for instance afd31154-83ac-425a-66d9-2df0
   2018-07-03T09:27:07.08+0100 [CELL/0] OUT Cell cde8f71a-a756-4891-b130-8f6ac230eb59 successfully destroyed container for instance afd31154-83ac-425a-66d9-2df0
   2018-07-03T09:27:07.11+0100 [API/0] OUT Process has crashed with type: "web"
   2018-07-03T09:27:07.11+0100 [API/0] OUT App instance exited with guid 87132572-9f9c-42d3-afdf-258137144597 payload: {"instance"=>"afd31154-83ac-425a-66d9-2df0", "index"=>0, "reason"=>"CRASHED", "exit_description"=>"failed to initialize container: runc run: exit status 1: container_linux.go:348: starting container process caused \"process_linux.go:402: container init caused \\\"rootfs_linux.go:58: mounting \\\\\\\"/var/vcap/packages/healthcheck\\\\\\\" to rootfs \\\\\\\"/var/vcap/data/grootfs/store/unprivileged/images/afd31154-83ac-425a-66d9-2df0/rootfs\\\\\\\" at \\\\\\\"/var/vcap/data/grootfs/store/unprivileged/images/afd31154-83ac-425a-66d9-2df0/rootfs/etc/cf-assets/healthcheck\\\\\\\" caused \\\\\\\"mkdir /var/vcap/data/grootfs/store/unprivileged/images/afd31154-83ac-425a-66d9-2df0/rootfs/etc/cf-assets: permission denied\\\\\\\"\\\"\"\n", "crash_count"=>7, "crash_timestamp"=>1530606427101357593, "version"=>"a0f025d4-e7cf-4b18-8093-5196e1569518"}

Docker image (if relevant)

keymon/empty-etc

Cause

It might be an issue with the layered fs, where an empty directory does not allow to mount a fs, or gets created with the wrong permissions.

Resolution

If instead you drop any file into /etc/, it does not throw that error, but one complaining about the missing /etc/passwd (which is fine).

FROM cbaines/govuk-mini-environment-admin:release_3-test2
RUN touch /etc/foo

unable to find user root: no matching entries in passwd file

Then if we add a valid /etc/passwd the image works.

Garden failed to clean up containers on startup

Description

On a re-deploy a diego-cell took 51 min to come back up. The rep left quite a few containers laying around after evacuation. On start-up of the cell after the update we saw that Garden deleted peas containers but not application containers that were orphaned.

We expected garden to have gone into a clean up loop to delete and destroy all the orphaned containers on the cell.

We know that destroy_containers_on_start is set to true, as such we do expect to see log lines related to: https://github.com/cloudfoundry/guardian/blob/6101577ecf3aa2a29e109a2a1380314cacc7df37/gardener/gardener.go#L479
but we do not.

However we see that garden did not delete any containers, and the slow startup time was due to the rep's own clean up loop running.

We are puzzled as to how we can end up in this state where garden is not able to immediately find the number of orphaned containers but later upon the rep's request give out a number of orphaned containers?

Environment

garden-runc-release version: 1.10.0
IaaS: unknown
Stemcell version: 3468.21+
Kernel version: unknown

Steps to reproduce

[...]

Logs

Some log highlights:

{"timestamp":"1529914162.323287249","source":"guardian","message":"guardian.start.starting","log_level":1,"data":{"session":"4"}}
{"timestamp":"1529914162.323378801","source":"guardian","message":"guardian.start.clean-all-peas.start","log_level":1,"data":{"session":"4.1"}}
[...]
{"timestamp":"1529914162.389882088","source":"guardian","message":"guardian.start.completed","log_level":1,"data":{"session":"4"}}
[...]
{"timestamp":"1529914162.395313263","source":"guardian","message":"guardian.list-containers.starting","log_level":1,"data":{"session":"5"}}
[...]
{"timestamp":"1529914162.399815083","source":"guardian","message":"guardian.list-containers.finished","log_level":1,"data":{"session":"5"}}
[...]
{"timestamp":"1529914162.427858829","source":"guardian","message":"guardian.started","log_level":1,"data":{"addr":"/var/vcap/data/garden/garden.sock","network":"unix"}}
[...]
{"timestamp":"1529914164.685106754","source":"guardian","message":"guardian.list-containers.starting","log_level":1,"data":{"session":"6"}}
{"timestamp":"1529914164.685930967","source":"guardian","message":"guardian.list-containers.finished","log_level":1,"data":{"session":"6"}}
{"timestamp":"1529914164.686795473","source":"guardian","message":"guardian.destroy.start","log_level":1,"data":{"handle":"0016fb0d-1ab7-4c43-7d5a-5004","session":"7"}}


{"timestamp":"1529914164.682427406","source":"rep","message":"rep.wait-for-garden.ping-garden","log_level":1,"data":{"initialTime:":"2018-06-25T08:09:24.682211759Z","session":"1","wait-time-ns:":207110}}
{"timestamp":"1529914164.684116840","source":"rep","message":"rep.wait-for-garden.ping-garden-success","log_level":1,"data":{"initialTime:":"2018-06-25T08:09:24.682211759Z","session":"1","wait-time-ns:":1885043}}
{"timestamp":"1529914164.684298515","source":"rep","message":"rep.executor-fetching-containers-to-destroy","log_level":1,"data":{}}
{"timestamp":"1529914164.686308384","source":"rep","message":"rep.executor-fetched-containers-to-destroy","log_level":1,"data":{"num-containers":58}}
{"timestamp":"1529914164.686429977","source":"rep","message":"rep.executor-destroying-container","log_level":1,"data":{"container-handle":"0016fb0d-1ab7-4c43-7d5a-5004"}}
{"timestamp":"1529914225.345789671","source":"rep","message":"rep.executor-destroyed-stray-container","log_level":1,"data":{"handle":"0016fb0d-1ab7-4c43-7d5a-5004"}}

We can provide more logs directly via secure file exchange.

Cause

TBD... maybe resource starvation?

Resolution

TBD

@sunjayBhatia

bosh fails to render ERB templates in recent garden-runc releases

Description

Trying to deploy garden-runc 1.12.1 today, we hit the following as bosh tried to render the ERB templates:

Task 2485 | 09:16:12 | Error: Unable to render instance groups for deployment. Errors are:
   - Unable to render jobs for instance group 'worker'. Errors are:
     - Unable to render templates for job 'garden'. Errors are:
       - Error filling in template '(unknown)' (line (unknown): garden/bin/garden_start.erb:4: syntax error, unexpected ';'
      -; _erbout.concat "\n#!/usr/bin...
        ^)
       - Error filling in template '(unknown)' (line (unknown): garden/config/config.ini.erb:31: syntax error, unexpected ';'
      -; _erbout.concat "\n[server]\n; apparmor\n"
        ^
      garden/config/config.ini.erb:34: syntax error, unexpected ';'
      ;  if apparmor_profile_provided -; _erbout.concat "\n  apparmor = "
                                        ^
      garden/config/config.ini.erb:36: syntax error, unexpected keyword_end, expecting end-of-input
      ;  end -; _erbout.concat "\n\n; bin...
            ^)

We're guessing that the templates have begun to use ERB syntax that is only supported by recent versions of bosh. We're on v257.3. We were previously able to deploy garden-runc 1.9.0 so the breaking change must been introduced somewhere between 1.9.0 and 1.12.1.

Environment

garden-runc-release version: 1.12.1
IaaS: vsphere
Stemcell version: bosh-vsphere-esxi-ubuntu-trusty-go_agent/3468.17
Kernel version: 4.4.0-109-generic

Steps to reproduce

Attempt to deploy garden-runc 1.12.1 with bosh v257.3

Resolution

We'll attempt to upgrade our bosh. But it'd be helpful if garden could publish a minimum version requirement, or maybe the ERB could be reworked to allow the garden-runc release to work on older boshes?

Deleted files are listed, but inaccessible in Docker images using grootfs

Description

We have a Docker image where we delete some files during the build. Those files are still listed on the filesystem, but inacessible.

The same Docker image worked on 1.12.1.

Environment

garden-runc-release version: 1.13.1
IaaS: Custom (GOV.UK GDS PaaS)
Stemcell version: 3541.12
Kernel version: 4.4.0-119-generic

We are using cf-deployment 1.28.0 at the moment and don't change any of the grootfs settings.

Steps to reproduce

Dockerfile:

FROM tomcat:8.5.23-jre8-alpine
RUN rm -rf /usr/local/tomcat/webapps/*
CMD ["sleep", "1d"]

Commands to run:

bandesz:~/repos/gds/docker-test$ cf push docker-test --docker-image bandesz/tomcat-test -u process
bandesz:~/repos/gds/docker-test$ cf ssh docker-test
bash-4.3# ls -la /usr/local/tomcat/webapps
ls: /usr/local/tomcat/webapps/ROOT: No such file or directory
ls: /usr/local/tomcat/webapps/docs: No such file or directory
ls: /usr/local/tomcat/webapps/examples: No such file or directory
ls: /usr/local/tomcat/webapps/host-manager: No such file or directory
ls: /usr/local/tomcat/webapps/manager: No such file or directory
total 0
drwxr-x---    1 root     root            76 May 11 14:00 .
drwxr-xr-x    1 root     root            20 May 11 14:00 ..

Logs

I checked /var/vcap/sys/log/garden but nothing relevant were logged.

We might have some sensitive information in other system logs, therefore I'm not including them.

Docker image

I created one for testing: bandesz/tomcat-test

Cannot configure routine GC cleanup with deprecated cleanup property (graph_cleanup_threshold_in_mb)

Hi there! Thanks for taking time to open an Issue.
Please fill out as much of the following information as you can.

Description

Prior to 1.12.0, setting garden.graph_cleanup_threshold_in_mb to 0 resulted in unused layer cleanup routinely.

After 1.12.0, if I set

grootfs.reserved_space_for_other_jobs_in_mb: -1; and
{garden|grootfs}.graph_cleanup_threshold_in_mb: 0

... the GC is disabled.

With this configuration, I would expect the legacy property *.graph_cleanup_threshold_in_mb to behave as before (GC routinely), since the legacy property takes precedence over reserved_space_for_other_jobs_in_mb and 0 is considered a user choice.

Environment

garden-runc-release version: >= 1.12.0
IaaS: *
Stemcell version: *
Kernel version: N/A

Steps to reproduce

Deploy Garden 1.11.* with these properties set (note that reserved_space_for_other_jobs_in_mb will do nothing in 1.11.*):

properties:
      garden:
        graph_cleanup_threshold_in_mb: 0
      grootfs:
        graph_cleanup_threshold_in_mb: 0
        reserved_space_for_other_jobs_in_mb: -1

After creating and destroying a few containers with different rootfs layers, observe cleanup occurring routinely.

Upgrade garden-runc-release to 1.12.0+ and observe GC never occurring.

Logs

N/A

Cause

Thresholder binary considers a 0 value of graph_cleanup_threshold_in_mb to be less important that the new property reserved_space_for_other_jobs_in_mb. It should not.

Resolution

If graph_cleanup_threshold_in_mb is set to 0 and reserved_space_for_other_jobs_in_mb is set to -1, then GC should occur routinely.

garden execution failed during fresh CF deployment: mount: Structure needs cleaning

Description

I tried to deploy Cloud Foundry and failed during the creation of one cell instance. When i log into the failed cell and do a monit summary i get the following output:

Process 'consul_agent' running Process 'rep' running Process 'garden' Execution failed Process 'route_emitter' running Process 'metron_agent' running System 'system_localhost' running

In the CF deployment are two other cell instances running. One of them seems to be empty as there is only system_localhost running but the other run works as expected. The desired amount of cells for the CF deployment is 3.

I couldn't find anything special under /var/vcap/sys/log/garden.
The logs under /var/vcap/sys/log/monit shows "exit status 32: mount: Structure needs cleaning".

I executed the analize script you provided me through slack (https://raw.githubusercontent.com/cloudfoundry/garden-runc-release/develop/scripts/ordnance-survey) and found out the file /var/vcap/jobs/garden/config/config.ini your script expects to find is not present on the instance.
(i commented the line with /var/vcap/jobs/garden/config/config.ini out and downloaded the report, the .tgz is attached underneath).

Environment

garden-runc-release version: 1.11.1
IaaS: vSphere
Stemcell version: bosh-vsphere-esxi-ubuntu-trusty-go_agent 3541.9
Kernel version: 4.4.0-116-generic

Logs

logs from monit and garden https://gist.github.com/gdenn/764af3b8ce92fe063168413b45931833

best,
Dennis

Fetch submodule containerd failed

While I run

git clone [email protected]:cloudfoundry/garden-runc-release.git --branch v1.13.1 --recursive

I got

...
error: no such remote ref 96c505732559caf12840d37cb86e29cbab3e0f50
Fetched in submodule path 'src/github.com/containerd/containerd', but it did not contain 96c505732559caf12840d37cb86e29cbab3e0f50. Direct fetching of that commit failed.
...

Thank you!

Bumping garden-runc in bosh-lite results in `target is busy` when trying to delete the director

Description

We are accustomed to being able to always delete-env on a bosh-lite director (even if there was a deployment that had not yet been deleted). We recently tried bumping garden-runc-release, and now when you deploy something (we used zookeeper) to it and then try to delete-env on the director it results in this error:

Deleting deployment:
  Unmounting disk 'disk-8301596d-916b-421d-6b82-9dc28c0c87f4' from VM 'vm-0fd3666e-29b0-420d-5a57-ed1b31a644cc':
    Sending 'get_task' to the agent:
      Agent responded with error: Action Failed get_task: Task 102a2a4c-5ffb-437d-5c0b-7ce2ebf63bc8 result: Unmounting persistent disk: Running command: 'umount /dev/sdc1', stdout: '', stderr: 'umount: /var/vcap/store/warden_cpi/persistent_bind_mounts_dir/c5bd8cca-0dfe-4589-6cda-0a52c2d80bfc: target is busy         
        (In some cases useful info about processes that
         use the device is found by lsof(8) or fuser(1).)
': exit status 32

Exit code 1

Environment

garden-runc-release version: Upgraded from 1.9.4 to 1.16.3
IaaS: VirtualBox
Stemcell version: xenial (97.12)
Kernel version: 4.15.0-32-generic

Steps to reproduce

Clone bosh-deployment, using this branch here: https://github.com/cloudfoundry/bosh-deployment/tree/wip-bump-garden-runc
cd to the tests dir
run run.sh

Cause?

While investigating on our own, we found it suspect that these processes were left around after running monit stop garden:

bosh/0:/var/vcap/jobs/garden/bin# ps aux | grep runc                                                                                                                                                                                                                                                                          
root      765911  0.0  0.1 194132  4872 ?        S<l  21:52   0:00 /var/vcap/packages/guardian/bin/dadoo -runc-root /run/runc exec /var/vcap/packages/runc/bin/runc /var/vcap/data/garden/depot/8a12412d-4342-4385-5908-adf48fec9f69/processes/0c3fbfb5-d0ce-4ef3-73d7-b768cdc56766 8a12412d-4342-4385-5908-adf48fec9f69      
root      765991  0.0  0.1 130004  4920 ?        S<l  21:52   0:00 /var/vcap/packages/guardian/bin/dadoo -runc-root /run/runc exec /var/vcap/packages/runc/bin/runc /var/vcap/data/garden/depot/818efc09-5ee5-4815-4492-31f627f4d39e/processes/6de425de-c164-4c08-5f98-310d2975cfd1 818efc09-5ee5-4815-4492-31f627f4d39e      
root      765996  0.0  0.1 121808  4964 ?        S<l  21:52   0:00 /var/vcap/packages/guardian/bin/dadoo -runc-root /run/runc exec /var/vcap/packages/runc/bin/runc /var/vcap/data/garden/depot/2f02af6a-b0bf-4d60-40f6-3067dddebe36/processes/cb1d5a8b-b6fb-419b-6e59-20704f47a144 2f02af6a-b0bf-4d60-40f6-3067dddebe36      
root      766086  0.0  0.1 267608  4976 ?        S<l  21:52   0:00 /var/vcap/packages/guardian/bin/dadoo -runc-root /run/runc exec /var/vcap/packages/runc/bin/runc /var/vcap/data/garden/depot/a4932af8-8250-4982-4ded-8ffb340efd43/processes/e8571269-589a-4133-463a-28c3226548fb a4932af8-8250-4982-4ded-8ffb340efd43      
root      766100  0.0  0.1 130004  5068 ?        S<l  21:52   0:00 /var/vcap/packages/guardian/bin/dadoo -runc-root /run/runc exec /var/vcap/packages/runc/bin/runc /var/vcap/data/garden/depot/b084bdcf-36e6-4b7b-689e-444b83564ce1/processes/9dafa865-d335-4a49-746f-5e00809c4f1f b084bdcf-36e6-4b7b-689e-444b83564ce1      
root      791377  0.0  0.0  12944  1088 pts/0    S+   21:55   0:00 grep --color=auto runc

We expected that after monit stop garden there should be no dadoo or runc processes left.

Please let us know if you need any more information. You can reach us in the #bosh-core-dev channel on the Cloud Foundry slack or on here.

Thanks!

@jrussett and @jaresty
BOSH SF Engineering

Cloud Foundry running in Bosh Lite on AWS intermittently returns 502 when making requests to hosted apps

The Apps Manager team has a bosh-lite running on AWS (http://apps.bosh-lite.appsman.cf-app.com/) that has an intermittent problem: hitting app URLs sometimes returns a 502.

The CAPI team has seen the same problem on their bosh-lite environment (which has been deleted).

@michaelxupiv from CAPI said:

we suspected it was a garden container issue because we tracked the traffic reaching the Diego VM, but somehow that packet was lost when trying to reach the container. It's interesting that we don't see this issue on local bosh-lite.

Here are the logs they captured on their environment before it was torn down.
502app-logs.txt

If you wish to investigate this issue, the Apps Manager team can provide credentials to the bosh lite environment.

This is the runtime that we have deployed there:

+-----------------+------------------------------------------------+-----------------------------------------------------+--------------+
| Name            | Release(s)                                     | Stemcell(s)                                         | Cloud Config |
+-----------------+------------------------------------------------+-----------------------------------------------------+--------------+
| cf-warden       | cf/250                                         | bosh-warden-boshlite-ubuntu-trusty-go_agent/3312.15 | none         |
+-----------------+------------------------------------------------+-----------------------------------------------------+--------------+
| cf-warden-diego | cf/250                                         | bosh-warden-boshlite-ubuntu-trusty-go_agent/3312.15 | none         |
|                 | cflinuxfs2-rootfs/1.45.0                       |                                                     |              |
|                 | diego/1.5.3                                    |                                                     |              |
|                 | garden-runc/1.1.1                              |                                                     |              |

Started using GrootFS and now see "applying disk limits: disk limit is smaller than volume size"

Environment

I push docker apps to CF. I was on some older version of garden-runc-release, that didn't use GrootFS. I upgraded to version 1.11.0 of garden-runc-release. This also happens when upgrading to versions later than 1.11.0.

Symptom

After upgrading, which resulted in using GrootFS for filesystem management, some of my docker apps started failing with the following error message, which I found in /var/vcap/sys/log/garden/garden.stdout.log.

"error":"running image plugin create: making image: creating image: applying disk limits: disk limit is smaller than volume size\n: exit status 1"

My docker images seemed to be an acceptable size before, why are they failing now?

cloudfoundry / garden-runc-release Goto Github PK

garden-runc-release's Issues

Description

Description

Environment

Steps to reproduce

Logs

Description

Environment

Steps to reproduce

Logs

Cause

Resolution

Description

Setup Information

Logging and/or test output

Accessing Our Broken Environment

Description

Steps to reproduce

CF app container allocations and reported usage, effectively from the Garden API:

df

du against Grootfs store directories

Resolution

Description

Logging and/or test output

Steps to reproduce

Description

Environment

Steps to reproduce

Logs

Cause

Resolution

Running on vsphere with Ops Mgr build 2.0-build-269, PAS v2.0.8, NSX-T tile v2.1 integration

Steps to reproduce

Logs

Cause

Resolution

Description

Environment

Steps to reproduce

Logs

Docker image (if relevant)

Cause

Resolution

Todos

Description

Logging and/or test output

Steps to reproduce

Description

Steps to reproduce

Environment

Symptom

Cause

Resolution

Temporary workaround

Additional info

Description

Description

Environment

Description

Environment

Steps to reproduce

Logs

Cause

Description

Logging and/or test output

Steps to reproduce

Todos

Description

Background on our particular situation:

Logging and/or test output

Steps to reproduce

Description

Environment

Steps to reproduce

Logs

Docker image (if relevant)

Cause

Resolution

1. Does Cloud Foundry& Diego support Centos?

`df`

`du` against Grootfs store directories