Which jobs are failing? All the crio serial jobs (presubmits and p

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Job run with only Restart and <code class="notranslat

I was referring to the job shared in the deion <a href="https://testgrid.k8s.io/

Is this applicable/needed also for Fedora? <p dir="auto

I opened up <a class="issue-link js-issue-link" data-error-text="Failed to load title"

I asked <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-

CRIO Serial Tests are not displaying the test grid reports and failing constantly.,about kubernetes/kubernetes

Comments (45)

bart0sh commented on May 26, 2024 2

Fixed remaining failed tests. Both jobs are green now:

from kubernetes.

bart0sh commented on May 26, 2024 1

@kannon92

We tried increasing the timeout to 6 hours from 4 hours but the timeout still occurs. Something is not right and we could use some help on figuring out why this can be happening.

I don't think there is something wrong with long running tests. There are a lot of slow tests in the job and if infra is loaded with other tasks all of the tests would run slower.

Here is a list of test durations for successful and failed job runs. I don't see anything that can't be explained by the busyness of the infra. Please correct me if I'm wrong here.

successful run (https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/123386/pull-kubernetes-node-kubelet-serial-crio-cgroupv1/1767678095382286336/build-log.txt)

$ grep '\[[0-9]\+\.[0-9]\+ seconds\]' build-log.123386.passed.txt | sed 's/.*\[\(.*\) seconds\].*/\1/' | sort -un
0.011
2.033
4.033
5.365
6.196
8.067
10.381
14.344
16.362
18.369
28.415
32.516
33.452
34.098
44.494
46.470
48.319
50.448
52.412
54.300
64.436
70.343
73.407
75.377
81.594
84.577
104.448
106.508
120.254
130.551
146.379
148.473
154.882
179.385
194.662
201.310
203.016
209.513
290.059
327.427
384.706
388.334
402.381
409.849
485.483
486.376
619.149
624.753
850.932
893.079
967.962
980.838
984.849
986.872
991.022
1008.993
1042.057

Total seconds:
$ (grep '\[[0-9]\+\.[0-9]\+ seconds\]' build-log.123386.passed.txt | sed 's/.*\[\(.*\) seconds\].*/\1/' | sort -un | tr "\012" "+" ; echo "0") | bc
15984.321

failed run (https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/123386/pull-kubernetes-node-kubelet-serial-crio-cgroupv1/1765800552832176128/build-log.txt):

$ grep '\[[0-9]\+\.[0-9]\+ seconds\]' build-log.123386.failed.txt | sed 's/.*\[\(.*\) seconds\].*/\1/' | sort -un
0.010
2.162
4.038
5.333
6.023
8.082
10.371
14.333
18.370
22.381
34.448
44.623
46.310
56.427
62.302
64.531
70.368
76.363
87.083
106.517
120.064
134.263
146.335
154.730
194.619
201.182
202.974
203.502
303.038
327.277
375.703
388.318
408.687
485.539
486.443
491.590
494.590
496.764
509.069
619.159
627.600
764.817
967.865
980.931
981.038
986.906
991.223
1009.039
1030.938
1043.115
1064.974
1157.038
1408.810
1417.362

Total seconds:
$ (grep '\[[0-9]\+\.[0-9]\+ seconds\]' build-log.123386.failed.txt | sed 's/.*\[\(.*\) seconds\].*/\1/' | sort -un | tr "\012" "+" ; echo "0") | bc
21915.577

One strange thing though is that the total time for the failed test(21915.577 sec / 60 = 365.2596m) is less than the job timeout (420m). Any ideas why the job was timed out?

from kubernetes.

kannon92 commented on May 26, 2024 1

An interesting datapoint I found between containerd serial and crio serial:

in containerd:

[sig-node] SeccompDefault [Serial] [Feature:SeccompDefault] [LinuxOnly] with SeccompDefault enabled should use unconfined when specified [sig-node, Serial, Feature:SeccompDefault]
k8s.io/kubernetes/test/e2e_node/seccompdefault_test.go:77
  STEP: Creating a kubernetes client @ 03/15/24 03:08:28.153
  STEP: Building a namespace api object, basename seccompdefault-test @ 03/15/24 03:08:28.154
  I0315 03:08:28.157768 1335 framework.go:275] Skipping waiting for service account
  STEP: Stopping the kubelet @ 03/15/24 03:08:28.169
  I0315 03:08:28.186800 1335 util.go:368] Get running kubelet with systemctl:   UNIT                            LOAD   ACTIVE SUB     DESCRIPTION
    kubelet-20240315T012703.service loaded active running /tmp/node-e2e-20240315T012703/kubelet --kubeconfig /tmp/node-e2e-20240315T012703/kubeconfig --root-dir /var/lib/kubelet --v 4 --config-dir /tmp/node-e2e-20240315T012703/kubelet.conf.d --hostname-override tmp-node-e2e-32a814d3-cos-beta-113-18244-1-14 --container-runtime-endpoint unix:///run/containerd/containerd.sock --config /tmp/node-e2e-20240315T012703/kubelet-config --feature-gates=DisableKubeletCloudCredentialProviders=true --image-credential-provider-config=/tmp/node-e2e-20240315T012703/credential-provider.yaml --image-credential-provider-bin-dir=/tmp/node-e2e-20240315T012703 --kernel-memcg-notification=true --cgroups-per-qos=true --cgroup-root=/ --runtime-cgroups=/system.slice/containerd.service

  LOAD   = Reflects whether the unit definition was properly loaded.
  ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
  SUB    = The low-level unit activation state, values depend on unit type.
  1 loaded units listed.
  , kubelet-20240315T012703
  W0315 03:08:28.224588    1335 util.go:500] Health check on "http://127.0.0.1:10248/healthz" failed, error=Head "http://127.0.0.1:10248/healthz": read tcp 127.0.0.1:56986->127.0.0.1:10248: read: connection reset by peer
  STEP: Starting the kubelet @ 03/15/24 03:08:28.236
  W0315 03:08:28.263213    1335 util.go:500] Health check on "http://127.0.0.1:10248/healthz" failed, error=Head "http://127.0.0.1:10248/healthz": dial tcp 127.0.0.1:10248: connect: connection refused
  STEP: Creating a pod to test SeccompDefault-unconfined @ 03/15/24 03:08:33.268
  STEP: Saw pod success @ 03/15/24 03:08:37.282
  I0315 03:08:37.284826 1335 output.go:196] Trying to get logs from node tmp-node-e2e-32a814d3-cos-beta-113-18244-1-14 pod seccompdefault-test-04e6a259-188d-4ee1-b296-d97a82096652 container seccompdefault-test-04e6a259-188d-4ee1-b296-d97a82096652: <nil>
  STEP: delete the pod @ 03/15/24 03:08:37.294
  STEP: Stopping the kubelet @ 03/15/24 03:08:37.301
  I0315 03:08:37.319465 1335 util.go:368] Get running kubelet with systemctl:   UNIT                            LOAD   ACTIVE SUB     DESCRIPTION
    kubelet-20240315T012703.service loaded active running /tmp/node-e2e-20240315T012703/kubelet --kubeconfig /tmp/node-e2e-20240315T012703/kubeconfig --root-dir /var/lib/kubelet --v 4 --config-dir /tmp/node-e2e-20240315T012703/kubelet.conf.d --hostname-override tmp-node-e2e-32a814d3-cos-beta-113-18244-1-14 --container-runtime-endpoint unix:///run/containerd/containerd.sock --config /tmp/node-e2e-20240315T012703/kubelet-config --feature-gates=DisableKubeletCloudCredentialProviders=true --image-credential-provider-config=/tmp/node-e2e-20240315T012703/credential-provider.yaml --image-credential-provider-bin-dir=/tmp/node-e2e-20240315T012703 --kernel-memcg-notification=true --cgroups-per-qos=true --cgroup-root=/ --runtime-cgroups=/system.slice/containerd.service

  LOAD   = Reflects whether the unit definition was properly loaded.
  ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
  SUB    = The low-level unit activation state, values depend on unit type.
  1 loaded units listed.
  , kubelet-20240315T012703
  W0315 03:08:37.359305    1335 util.go:500] Health check on "http://127.0.0.1:10248/healthz" failed, error=Head "http://127.0.0.1:10248/healthz": read tcp 127.0.0.1:52932->127.0.0.1:10248: read: connection reset by peer
  STEP: Starting the kubelet @ 03/15/24 03:08:37.37
  W0315 03:08:37.400901    1335 util.go:500] Health check on "http://127.0.0.1:10248/healthz" failed, error=Head "http://127.0.0.1:10248/healthz": dial tcp 127.0.0.1:10248: connect: connection refused
  I0315 03:08:42.408960 1335 helper.go:121] Waiting up to 7m0s for all (but 0) nodes to be ready
  STEP: Destroying namespace "seccompdefault-test-1780" for this suite. @ 03/15/24 03:08:42.411
• [14.261 seconds]
This test takes 14 seconds.

On crio, it takes 974 seconds.

[sig-node] SeccompDefault [Serial] [Feature:SeccompDefault] [LinuxOnly] with SeccompDefault enabled should use unconfined when specified [sig-node, Serial, Feature:SeccompDefault]
k8s.io/kubernetes/test/e2e_node/seccompdefault_test.go:77
  STEP: Creating a kubernetes client @ 03/15/24 10:56:41.446
  STEP: Building a namespace api object, basename seccompdefault-test @ 03/15/24 10:56:41.446
  I0315 10:56:41.450556 2955 framework.go:275] Skipping waiting for service account
  STEP: Stopping the kubelet @ 03/15/24 10:56:41.456
  I0315 10:58:41.541012 2955 util.go:368] Get running kubelet with systemctl:   UNIT                            LOAD   ACTIVE SUB     DESCRIPTION
    kubelet-20240315T081727.service loaded active running /tmp/node-e2e-20240315T081727/kubelet --kubeconfig /tmp/node-e2e-20240315T081727/kubeconfig --root-dir /var/lib/kubelet --v 4 --config-dir /tmp/node-e2e-20240315T081727/kubelet.conf.d --hostname-override n1-standard-4-fedora-coreos-39-20240225-3-0-gcp-x86-64-82383702 --container-runtime-endpoint unix:///var/run/crio/crio.sock --config /tmp/node-e2e-20240315T081727/kubelet-config --cgroup-driver=systemd --cgroups-per-qos=true --cgroup-root=/ --runtime-cgroups=/system.slice/crio.service --kubelet-cgroups=/system.slice/kubelet.service

  LOAD   = Reflects whether the unit definition was properly loaded.
  ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
  SUB    = The low-level unit activation state, values depend on unit type.
  1 loaded units listed.
  , kubelet-20240315T081727
  W0315 11:02:41.775895    2955 util.go:500] Health check on "http://127.0.0.1:10248/healthz" failed, error=Head "http://127.0.0.1:10248/healthz": read tcp 127.0.0.1:60264->127.0.0.1:10248: read: connection reset by peer
  STEP: Starting the kubelet @ 03/15/24 11:02:41.784
  W0315 11:04:41.838220    2955 util.go:500] Health check on "http://127.0.0.1:10248/healthz" failed, error=Head "http://127.0.0.1:10248/healthz": dial tcp 127.0.0.1:10248: connect: connection refused
  STEP: Creating a pod to test SeccompDefault-unconfined @ 03/15/24 11:04:46.845
  STEP: Saw pod success @ 03/15/24 11:04:50.858
  I0315 11:04:50.860386 2955 output.go:196] Trying to get logs from node n1-standard-4-fedora-coreos-39-20240225-3-0-gcp-x86-64-82383702 pod seccompdefault-test-fe386ac9-d280-465e-9dd3-4aa471ecf5a7 container seccompdefault-test-fe386ac9-d280-465e-9dd3-4aa471ecf5a7: <nil>
  STEP: delete the pod @ 03/15/24 11:04:50.868
  STEP: Stopping the kubelet @ 03/15/24 11:04:50.874
  I0315 11:06:50.938206 2955 util.go:368] Get running kubelet with systemctl:   UNIT                            LOAD   ACTIVE SUB     DESCRIPTION
    kubelet-20240315T081727.service loaded active running /tmp/node-e2e-20240315T081727/kubelet --kubeconfig /tmp/node-e2e-20240315T081727/kubeconfig --root-dir /var/lib/kubelet --v 4 --config-dir /tmp/node-e2e-20240315T081727/kubelet.conf.d --hostname-override n1-standard-4-fedora-coreos-39-20240225-3-0-gcp-x86-64-82383702 --container-runtime-endpoint unix:///var/run/crio/crio.sock --config /tmp/node-e2e-20240315T081727/kubelet-config --cgroup-driver=systemd --cgroups-per-qos=true --cgroup-root=/ --runtime-cgroups=/system.slice/crio.service --kubelet-cgroups=/system.slice/kubelet.service

  LOAD   = Reflects whether the unit definition was properly loaded.
  ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
  SUB    = The low-level unit activation state, values depend on unit type.
  1 loaded units listed.
  , kubelet-20240315T081727
  W0315 11:10:51.109902    2955 util.go:500] Health check on "http://127.0.0.1:10248/healthz" failed, error=Head "http://127.0.0.1:10248/healthz": read tcp 127.0.0.1:41882->127.0.0.1:10248: read: connection reset by peer
  STEP: Starting the kubelet @ 03/15/24 11:10:51.119
  W0315 11:12:51.216556    2955 util.go:500] Health check on "http://127.0.0.1:10248/healthz" failed, error=Head "http://127.0.0.1:10248/healthz": dial tcp 127.0.0.1:10248: connect: connection refused
  I0315 11:12:56.223758 2955 helper.go:121] Waiting up to 7m0s for all (but 0) nodes to be ready
  STEP: Destroying namespace "seccompdefault-test-6757" for this suite. @ 03/15/24 11:12:56.225
• [974.782 seconds]

from kubernetes.

bart0sh commented on May 26, 2024 1

Job run with only Restart and Resource-usage slow tests enabled took more than 5 hours and didn't produce enough artifacts: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/directory/pull-kubernetes-node-kubelet-serial-crio-cgroupv2/1769052751628603392

from kubernetes.

bart0sh commented on May 26, 2024 1

Restart job is a current suspect. Test run with enabled resource_usage and system_node_critical takes 1h 46 min and looks good: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/directory/pull-kubernetes-node-kubelet-serial-crio-cgroupv1/1769276078355910656

I'll try to enable more slow jobs with excluded Restart job.

from kubernetes.

esotsal commented on May 26, 2024 1

I was referring to the job shared in the description https://testgrid.k8s.io/sig-node-kubelet#kubelet-gce-e2e-swap-fedora-serial

from kubernetes.

bart0sh commented on May 26, 2024 1

Job run with all tests except of Restart:Dbus takes 2h 9m and looks good: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/directory/pull-kubernetes-node-kubelet-serial-crio-cgroupv1/1769659931427868672

So, Restart:Dbus test case seems to slow down everything.

from kubernetes.

bart0sh commented on May 26, 2024 1

Is this applicable/needed also for Fedora?

It looks like it restarts dbus and systemd, which theoretically should work on any linux with systemd and dbus. However, that comment could mean that it was never tested on anything except Ubuntu.

from kubernetes.

kannon92 commented on May 26, 2024 1

I opened up #124000.

I talked with a few people at Red hat and it was suggested that this test is bad. Going to try just skipping it on fedora. Restarting dbus is not recommended

from kubernetes.

kannon92 commented on May 26, 2024 1

I asked @kwilczynski as he has been working on gracefulshutdown and asked about a few of these problems on RHEL like OS. I think skipping should be fine for this.

from kubernetes.

kannon92 commented on May 26, 2024 1

Nice job @bart0sh!

https://testgrid.k8s.io/sig-node-cri-o#node-kubelet-serial-crio

from kubernetes.

k8s-ci-robot commented on May 26, 2024

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

from kubernetes.

kannon92 commented on May 26, 2024

/cc @ndixita @SergeyKanzhelev

from kubernetes.

kannon92 commented on May 26, 2024

Looks like our recent PR didn't fix this.

from kubernetes.

kannon92 commented on May 26, 2024

Looking into the serial logs, it looks like these tests are hitting OOM.

[ 3276.883455] Out of memory: Killed process 39048 (dd) total-vm:10490032kB, anon-rss:3218048kB, file-rss:80kB, shmem-rss:0kB, UID:0 pgtables:6352kB oom_score_adj:1000
[ 3276.898409] Out of memory: Killed process 39048 (dd) total-vm:10490032kB, anon-rss:3218048kB, file-rss:80kB, shmem-rss:0kB, UID:0 pgtables:6352kB oom_score_adj:1000

from kubernetes.

kannon92 commented on May 26, 2024

/reopen

Memory limits did not help..

from kubernetes.

k8s-ci-robot commented on May 26, 2024

@kannon92: Reopened this issue.

In response to this:

/reopen

Memory limits did not help..

from kubernetes.

kannon92 commented on May 26, 2024

/reopen

from kubernetes.

k8s-ci-robot commented on May 26, 2024

@kannon92: Reopened this issue.

In response to this:

/reopen

from kubernetes.

kannon92 commented on May 26, 2024

Bumping machine size didn't fix this.

from kubernetes.

iholder101 commented on May 26, 2024

/cc
👀

from kubernetes.

bart0sh commented on May 26, 2024

@kannon92

[ 3276.898409] Out of memory: Killed process 39048 (dd) total-vm:10490032kB, anon-rss:3218048kB, file-rss:80kB, shmem-rss:0kB, UID:0 pgtables:6352kB oom_score_adj:1000

dd is run in a couple of tests as far as I can see:

$ git grep 'dd if' ./test/e2e_node/
test/e2e_node/eviction_test.go: return podWithCommand(volumeSource, resources, diskConsumedMB, name, fmt.Sprintf("dd if=/dev/urandom of=%s${i} bs=1048576 count=1 2>/dev/null; sleep .1;", filepath.Join(path, "file")))
test/e2e_node/oomkiller_linux_test.go:                  "sleep 5 && dd if=/dev/zero of=/dev/null bs=20M",
test/e2e_node/oomkiller_linux_test.go:                  "sleep 5 && dd if=/dev/zero of=/dev/null bs=20M & sleep 86400",
test/e2e_node/oomkiller_linux_test.go:                  "sleep 5 && dd if=/dev/zero of=/dev/null iflag=fullblock count=10 bs=10G",
test/e2e_node/system_node_critical_test.go:                                     "dd if=/dev/urandom of=file${i} bs=10485760 count=1 2>/dev/null; sleep .1;",

eviction_test.go running dd to simulate disk pressure, system_node_critical_test.go is similar. So, I'd start with oomkiller_linux_test.go. The only Serial test I can find is this one: https://github.com/kubernetes/kubernetes/blob/master/test/e2e_node/oomkiller_linux_test.go#L46.
I've removed Serial label from it to see it helps the test, here is a test PR: #120459. Let's see how it goes.

from kubernetes.

bart0sh commented on May 26, 2024

This looks quite dangerous and explains why bigger instances and limits didn't help: https://github.com/kubernetes/kubernetes/blob/master/test/e2e_node/oomkiller_linux_test.go#L250
It tries to allocate 100Gb of memory.

from kubernetes.

bart0sh commented on May 26, 2024

Successful run has 'Out of memory: Killed process' as well, e.g. https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/123386/pull-kubernetes-node-kubelet-serial-crio-cgroupv1/1763620904253788160/artifacts/n1-standard-2-fedora-coreos-39-20240210-3-0-gcp-x86-64-4346b1d8/serial-1.log

from kubernetes.

bart0sh commented on May 26, 2024

@kannon92 It looks like OOMKiller message is expected and doesn't impact the issue. It's produced by the OOMKiller for pod using more memory than node allocatable [LinuxOnly]" test case mentioned above. Without the test case the job is still getting timeouted and doesn't produce artifacts. Here is an example of failed and timeouted job run without the test case: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/directory/pull-kubernetes-node-kubelet-serial-crio-cgroupv2/1768026002467852288

from kubernetes.

bart0sh commented on May 26, 2024

Without slow tests pull-kubernetes-node-kubelet-serial-crio-cgroupv1 finished in around 1h 25m and produced valid artifacts.

from kubernetes.

bart0sh commented on May 26, 2024

Without slow tests and with OOMKiller test the job still works ok: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/directory/pull-kubernetes-node-kubelet-serial-crio-cgroupv2/1768343770648023040

from kubernetes.

bart0sh commented on May 26, 2024

With all tests the issue persisted despite the 10h timeout: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/directory/pull-kubernetes-node-kubelet-serial-crio-cgroupv2/1768400136934789120
I'll start bisecting. Hopefully will spot the test that causes everything to slow down.

from kubernetes.

kannon92 commented on May 26, 2024

One thing I noticed is that these tests are running two times. You can search for the start tests banner and see that we restart if the test failed.

from kubernetes.

esotsal commented on May 26, 2024

One of the failures in the failed link was because kubelet restart took more time than ImageMaximumGCAge of 1 minute, is this expected ? Attempt to fix this 123949.

from kubernetes.

kannon92 commented on May 26, 2024

Looking into what tests take the longest, I notice that the *Manager jobs are all pretty beefy.

@ffromani @swatisehgal WDYT of moving these to a separate serial job?

from kubernetes.

kannon92 commented on May 26, 2024

Running this test on gcp with crio I don't see why it is taking up that long. I think that there is something else going on in this test. In the tests that were passing this test only take 14 seconds also.

from kubernetes.

bart0sh commented on May 26, 2024

With all but last 3 slow tests excluded the job still run more than 8h: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/directory/pull-kubernetes-node-kubelet-serial-crio-cgroupv1/1768671241448722432

Remaining slow tests: Resource-usage, Restart and SystemNodeCriticalPod

In contrast the job with all slow tests excluded took only 1h25m: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/directory/pull-kubernetes-node-kubelet-serial-crio-cgroupv1/1768343770585108480

I'm not sure if the reason of slowness is in one of those 3 tests or it's something else.
I'll remove them and run the job again.

from kubernetes.

bart0sh commented on May 26, 2024

@kannon92

One thing I noticed is that these tests are running two times. You can search for the start tests banner and see that we restart if the test failed.

I don't think they're running 2 times. The build log is duplicated if testrun fails, I guess. Look at the duration time for the same test case. They're exactly the same with a high precision, which wouldn't be possible for two different test runs.

from kubernetes.

esotsal commented on May 26, 2024

Successful run is executed on n1-standard-2 , failed is executed on n1-standard-4 . Could it be something related with this datapoint?

from kubernetes.

bart0sh commented on May 26, 2024

Job run with enabled resource_usage, system_node_critical, log rotation, density and eviction tests takes 1h51m and looks good: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/directory/pull-kubernetes-node-kubelet-serial-crio-cgroupv2/1769311222651424768

So far so good.

Side note: I'm not sure I understand why log_rotation, density and eviction tests marked slow as they altogether take around 5 minutes.

from kubernetes.

bart0sh commented on May 26, 2024

Job run with all tests except of Restart takes 2h and still looks good: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/directory/pull-kubernetes-node-kubelet-serial-crio-cgroupv1/1769356436250300416

I'll try to investigate Restart test.

from kubernetes.

bart0sh commented on May 26, 2024

Job run with all tests except of Restart:Container Runtime and Restart:Dbus takes 2h 4m. and still looks good: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/directory/pull-kubernetes-node-kubelet-serial-crio-cgroupv1/1769457335215853568

from kubernetes.

bart0sh commented on May 26, 2024

@esotsal

Successful run is executed on n1-standard-2 , failed is executed on n1-standard-4 . Could it be something related with this datapoint?

Can you point out to those runs, please? In my testing both pull jobs use n1-standard-4 and results don't differ.

from kubernetes.

kannon92 commented on May 26, 2024

Nice! So we can just filter out that job then? Not sure if we have a slow lane for crio/containerd?

from kubernetes.

esotsal commented on May 26, 2024

@kannon92 checking the code it seems it is performing an Ubuntu specific task.

kubernetes/test/e2e_node/restart_test.go

Lines 153 to 154 in aa73f31

 // Allow dbus to be restarted on ubuntu 

 err := overlayDbusConfig()

kubernetes/test/e2e_node/node_shutdown_linux_test.go

Lines 661 to 683 in aa73f31

 var ( 

 dbusConfPath = "/etc/systemd/system/dbus.service.d/k8s-graceful-node-shutdown-e2e.conf" 

 dbusConf = ` 

 [Unit] 

 RefuseManualStart=no 

 RefuseManualStop=no 

 [Service] 

 KillMode=control-group 

 ExecStop= 

 ` 

 ) 

 func overlayDbusConfig() error { 

 err := os.MkdirAll(filepath.Dir(dbusConfPath), 0755) 

 if err != nil { 

 return err 

 } 

 err = os.WriteFile(dbusConfPath, []byte(dbusConf), 0644) 

 if err != nil { 

 return err 

 } 

 return systemctlDaemonReload() 

 }

Is this applicable/needed also for Fedora?

from kubernetes.

bart0sh commented on May 26, 2024

Restart:Dbus is a test case in the Restart test: https://github.com/kubernetes/kubernetes/blob/master/test/e2e_node/restart_test.go#L151-L203
We can either remove/comment out it or investigate why it slows down everything.

from kubernetes.

bart0sh commented on May 26, 2024

The Dbus test case was added by this commit

@endocrimes Can you help us to investigate why restart:Dbus test case slows everything down on Fedora/CoreOS ?

from kubernetes.

bart0sh commented on May 26, 2024

@kannon92 The Dbus test case was added by this PR and tested only with pull-kubernetes-node-kubelet-serial-containerd as crio pull jobs didn't exist at that time. However, it should be failing/slowing down ci jobs if they exist.

from kubernetes.

bart0sh commented on May 26, 2024

/assign

from kubernetes.

CRIO Serial Tests are not displaying the test grid reports and failing constantly. about kubernetes HOT 45 CLOSED

Comments (45)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	// Allow dbus to be restarted on ubuntu
	err := overlayDbusConfig()

	var (
	dbusConfPath = "/etc/systemd/system/dbus.service.d/k8s-graceful-node-shutdown-e2e.conf"
	dbusConf = `
	[Unit]
	RefuseManualStart=no
	RefuseManualStop=no
	[Service]
	KillMode=control-group
	ExecStop=
	`
	)

	func overlayDbusConfig() error {
	err := os.MkdirAll(filepath.Dir(dbusConfPath), 0755)
	if err != nil {
	return err
	}
	err = os.WriteFile(dbusConfPath, []byte(dbusConf), 0644)
	if err != nil {
	return err
	}
	return systemctlDaemonReload()
	}