Giter Club home page Giter Club logo

k8s-service-validator's Introduction

Demos

Build Status Go

K8S Service-API Table Tests

GH Kube-Proxy - iptables - 1.21 GH Kube-Proxy - IPTables - 1.21 - Performance Test GH Kube-Proxy - IPTables - 1.22 GH Kube-Proxy - ipvs

GH KPNG - nftables GH KPNG - ipvs

GH Cilium eBPF 1.11.0 GH Calico eBPF 3.21 GH Antrea eBPF 1.4

Problem

The upstream sig-network tests (curated by the kubernetes organization):

  • difficult to interpret in terms of failures
  • schedule pods randomly to nodes, while fixing endpoints to the 1st node in a list making them non-deterministic for failures
  • dont use positive and negative controls

Although we cant deprecate these tests - b/c of their inertia, we need a better tool to diagnose Service / kube-proxy implementations across providers. Examples of such providers are:

The implementations of kube proxy diverge over time, knowing wether loadbalancing is failing due to the source or target of traffic, wether all node proxies or just a few are broken, and wether configurations like node-local endpoints, or terminating endpoint scenarios are causing specific issues, becomes increasingly important for comparing services.

Solution

To solve the problem of how we can reason about different services, on different providers, with a low cognitive overhead, we'll Visualize service availability in a k8s cluster using... Tables!

With table tests we can rapidly compare what may or may not be broken and generate hypothesis, as is done in the network policy tests in upstream k8s, https://kubernetes.io/blog/2021/04/20/defining-networkpolicy-conformance-cni-providers/.

As an example, the below test probes 3 endpoints, from 3 endpoints. Assuming uniform spreading of services, this can detect potential problems in service loadbalancing that are specific to nodes sources / targets...

-		name-x/a	name-x/b	name-x/c
name-x/a	.		X		X	
name-x/b	.		X		X	
name-x/c	.		X		X	

In this case, we can see that pods "b" and "c" in namespace x are not reachable from ANY pod, meaning that theres a problem with any kube-proxy , on any node, i.e. the loadbalancing is fundamentally not working (the Xs are failures).

This suite of tests validates objects rules and various scenarios using Kubernetes services as a way of control tests heuristics, as for now the following tests are available:

  • ClusterIP
  • ExternalName
  • Loadbalancer
  • NodePort

Covers features like: hairpin, session affinity, headless service, hostNetwork, connections via TCP and UDP, NodePortLocal, services with annotations, etc...

Details/Contributing

This is just an initial experimental repo but we'd like to fully implement this as a KEP and add test coverage to upstream K8s, if there is consensus in the sig-network group for this.

Build and run - development

Create a local cluster using the Kind, make sure your cluster have 1+ nodes, install MetalLB:

$ kind create cluster --config=hack/kind-multi-worker.yaml
$ hack/install_metallb.sh

To run the tests directly you can use:

$ make test

To build the binary and run it, use:

$ make build
$ ./svc-test

Run with Sonobuoy

install sonobuoy: https://github.com/vmware-tanzu/sonobuoy#installation
$ make sonobuoy-run
after finished
$ make sonobuoy-retrieve

Running only specific tests

The binary supports flags to run only UDP stale endpoints, examples:

go test -v ./tests/ -labels="type=udp_stale_endpoint"

Other flags include -debug for verbose output and -namespace for pick one to run tests on, when not specified a new random namespace is created.

Using E2E tests

Download the Kubernetes repository and build the tests binary

$ make WHAT=test/e2e/e2e.test
+++ [1103 15:29:10] Building go targets for linux/amd64:
    ./vendor/k8s.io/code-generator/cmd/prerelease-lifecycle-gen
> non-static build: k8s.io/kubernetes/./vendor/k8s.io/code-generator/cmd/prerelease-lifecycle-gen
Generating prerelease lifecycle code for 28 targets
+++ [1103 15:29:13] Building go targets for linux/amd64:
    ./vendor/k8s.io/code-generator/cmd/deepcopy-gen
> non-static build: k8s.io/kubernetes/./vendor/k8s.io/code-generator/cmd/deepcopy-gen
Generating deepcopy code for 236 targets
+++ [1103 15:29:19] Building go targets for linux/amd64:
    ./vendor/k8s.io/code-generator/cmd/defaulter-gen
> non-static build: k8s.io/kubernetes/./vendor/k8s.io/code-generator/cmd/defaulter-gen
Generating defaulter code for 94 targets
+++ [1103 15:29:26] Building go targets for linux/amd64:
    ./vendor/k8s.io/code-generator/cmd/conversion-gen
> non-static build: k8s.io/kubernetes/./vendor/k8s.io/code-generator/cmd/conversion-gen
Generating conversion code for 130 targets
+++ [1103 15:29:38] Building go targets for linux/amd64:
    ./vendor/k8s.io/kube-openapi/cmd/openapi-gen
> non-static build: k8s.io/kubernetes/./vendor/k8s.io/kube-openapi/cmd/openapi-gen
Generating openapi code for KUBE
Generating openapi code for AGGREGATOR
Generating openapi code for APIEXTENSIONS
Generating openapi code for CODEGEN
Generating openapi code for SAMPLEAPISERVER
+++ [1103 15:29:47] Building go targets for linux/amd64:
    test/e2e
> non-static build: k8s.io/kubernetes/test/e2e

There will be a new compiled binary under _ouput/bin/e2e.test with around 150Mb, you can use it to run your sig-network tests as:

_output/bin/e2e.test -ginkgo.focus="\[sig-network\]" \
    -ginkgo.skip="\[Feature:(Networking-IPv6|Example|Federation|PerformanceDNS)\]|LB.health.check|LoadBalancer|load.balancer|GCE|NetworkPolicy|DualStack" \
    --provider=local \ 
    --kubeconfig=.kube/config

We use kubeconfig fetched from below, whichever works from the top of the list:

  1. env KUBECONFIG
  2. Kubernetes configuration at $HOME/.kube/config
  3. In cluster config.

Running on a K8S cluster

This test requires mode than 1 Nodes, and it guarantees the proper spread of pods across the existent nodes creating len(nodes) pods. The examples above have 4 nodes (3 workers + 1 master).

ClusterIP testing

On this example we have a pod-1, pod-2, pod-3 and pod-4, the first lines of the probe.go in the logging shows pod-1 probing the other probes on port 80, this probe is repeat across all other pods, the reachability matrix shows the result of the connections outcomes.

{"level":"info","ts":1621793385.240396,"caller":"manager/helper.go:45","msg":"Validating reachability matrix... (== FIRST TRY ==)"}
{"level":"info","ts":1621793385.3714652,"caller":"manager/probe.go:114","msg":"kubectl exec pod-1 -c cont-80-tcp -n x-12348 -- /agnhost connect s-x-12348-pod-3.x-12348.svc.cluster.local:80 --timeout=5s --protocol=tcp"}
{"level":"info","ts":1621793385.3720098,"caller":"manager/probe.go:114","msg":"kubectl exec pod-1 -c cont-80-tcp -n x-12348 -- /agnhost connect s-x-12348-pod-2.x-12348.svc.cluster.local:80 --timeout=5s --protocol=tcp"}
{"level":"info","ts":1621793385.385194,"caller":"manager/probe.go:114","msg":"kubectl exec pod-1 -c cont-80-tcp -n x-12348 -- /agnhost connect s-x-12348-pod-1.x-12348.svc.cluster.local:80 --timeout=5s --protocol=tcp"}
{"level":"info","ts":1621793385.4150555,"caller":"manager/probe.go:114","msg":"kubectl exec pod-1 -c cont-80-tcp -n x-12348 -- /agnhost connect s-x-12348-pod-4.x-12348.svc.cluster.local:80 --timeout=5s --protocol=tcp"}
...
// repreat for all pods

reachability: correct:16, incorrect:0, result=true

52 <nil>
expected:

-               x-12348/pod-1   x-12348/pod-2   x-12348/pod-3   x-12348/pod-4
x-12348/pod-1   .               .               .               .
x-12348/pod-2   .               .               .               .
x-12348/pod-3   .               .               .               .
x-12348/pod-4   .               .               .               .


176 <nil>
observed:

-               x-12348/pod-1   x-12348/pod-2   x-12348/pod-3   x-12348/pod-4
x-12348/pod-1   .               .               .               .
x-12348/pod-2   .               .               .               .
x-12348/pod-3   .               .               .               .
x-12348/pod-4   .               .               .               .


176 <nil>
comparison:

-               x-12348/pod-1   x-12348/pod-2   x-12348/pod-3   x-12348/pod-4
x-12348/pod-1   .               .               .               .
x-12348/pod-2   .               .               .               .
x-12348/pod-3   .               .               .               .
x-12348/pod-4   .               .               .               .

Sketch

diagram

Plan

  • Initial demo at sig-network (done), establishing agreement on future of service tests.
  • Establish parity w/ existing sig-network tests
  • Use this repo to do all the validation for KPNG
  • demo KPNG with the new service lb tests at sig-network
  • establish a kubernetes-sigs/... repo for this framework, to standardize the CI for KPNG
  • use these tests as a new test-infra job in sig-network for in-tree kube proxy
  • eventually remove old service lb tests from sig-network if community agrees

k8s-service-validator's People

Contributors

danielavyas avatar hanfa avatar jayunit100 avatar johnschnake avatar knabben avatar sladyn98 avatar yzaccc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

k8s-service-validator's Issues

Should Create Endpoints For Unready Pod

create rc v with selectors v
create service v with selectors v
verify pod for rc
wait for endpoints of service with dns name
scale down replication controller to zero
update service to not tolerate unready service
check if pod be unreachable
update service to tolerate unready service again
check if terminate pod be available through service
remove pod immediately

LoadBalancer IP in <pending> state for long time

bug: currently we treat below as connected, which is wrong, as this means the service IP is not created yet.
kubectl exec pod-4 -c cont-80-tcp -n x-3992 -- /agnhost connect :80 --timeout=5s --protocol=udp

Refine log and matrix

  • include logs to know which node each pod got scheduled
  • refine matrix to know which node each pod is running on
  • colorize the std outputs
  • tbd

[idea] make a fancy DSL

We could move tests like

   CREATE service w/ spn label
   VERIFY svc1 it doesnt work
   CREATE service w/o spn label
   VERIFY svc2  it works
   UPDATE the 2n service w spn label
   VERIFY both services dont work
  • implement Go TestExample as well to make more readable

Should Be Able To Update Service Type To Nodeport Listen On Same Port Number But Different Protocols

create a tcp service servicename with typeclusterip in namespace
change the tcp service to typenodeport
update nodeport service to listen tcp and udp base request over same port
create a service servicename with the typeexternalname in namespace
change the externalname service to typeclusterip
create a service servicename with the typeexternalname in namespace
change the externalname service to typenodeport
create a service servicename with the typeclusterip in namespace
create active service to test reachability when its fqdn be refer as externalname for another service
change the clusterip service to typeexternalname
create a service servicename with the typenodeport in namespace
create active service to test reachability when its fqdn be refer as externalname for another service
change the nodeport service to typeexternalname

Should Be Able To Up And Down Service

create svc1 in namespace
create svc2 in namespace
verify service svc1 be up
verify service svc2 be up
stop service
verify service svc1 be not up
verify service svc2 be still up
create service svc3 in namespace
verify service svc2 be still up
verify service svc3 be up

Add unit-tests

To avoid breaking other tests while we refactor the test-suite when adding new tests we should have unit-tests for core modules

Should Implement service.kubernetes.io/headless

create serviceheadless in namespace
create service in namespace
verify service be up
verify serviceheadless be not up
add servicekubernetesioheadless label
verify service be not up
remove servicekubernetesioheadless annotation
verify service be up
verify serviceheadless be still not up

Should Check Nodeport Outofrange

create service servicename with type nodeport in namespace
change service servicename to outofrange nodeport d
delete original service
create service servicename with outofrange nodeport d

[validate] Should drop invalid conntrack entries

Regression test for k/k #74839

Packets considered INVALID by conntrack should be dropped.

  1. create a server pod based on image, which will inject invalid packet to clients when the connection established
  2. create a service for the pod, for client connecton
  3. create a client nc the server pod continuously
  4. check client pod does not RST the TCP connection because it receives an INVALID packet

Should Allow Create A Basic Sctp Service With Pod And Endpoints

get the state of the sctp module on nod
create service servicename in namespace
validate endpoints do not exist yet
create a pod for the service
validate endpoints exist
delete the pod
validate endpoints do not exist anymore
validate sctp module be still not load

[validate] should create pod that uses dns

  1. Create two new namespaces
  2. In each namespace, create rc,svc,pod abased on cluster-dns examples https://github.com/kubernetes/examples/tree/master/staging/cluster-dns
  3. Validate if the application is initialized, in each namespace, wait and validate pods and services are responding
  4. Verify the namespace-1's dns name is resolvable, try and retry socket connect from one of the pod we created in the same namespace
  5. update front-end deployment yaml, with the newly created namespace-1 backend cluster dns
  6. create front-end in both two namespace
  7. verify if from front-end pods in both namespace, can reach to endpoints via the namespace-1 cluster dns

sonobuoy plugin

Make a sonobuoy plugin that bundles the service-lb-validator so that we can validate it for vmware-tanzu

Should Create A Pod With Sctp Hostport

get the state of the sctp module on the select node
create a pod with hostport on the select node
launch the pod on node v
validate sctp module be still not load

ExternalServiceType Local test

Lets build an ExternalTrafficPolicy=local loadbalancer service .

  • make this test fail if cluster < 2 nodes
  • will require some refactor to target pods to specific nodes
  • might be tricky !

@knabben thinks this ieasy

[validate] should run iperf2

  1. set up iperf2 server -- a single pod on any node
  2. set up iperf2 client daemonset
  3. wait for server running
  4. wait for all clients running
  5. iterate through the client pods one by one, running iperf2 in client mode to transfer data to the server and back and measure bandwidth
  6. after collecting all the client<->server data, compile and present the results

note:
/* iperf2 command parameters:
* -e: use enhanced reporting giving more tcp/udp and traffic information
* -p %d: server port to connect to
* --reportstyle C: report as Comma-Separated Values
* -i 1: seconds between periodic bandwidth reports
* -c %s: run in client mode, connecting to
*/

We should provide a better summary like ginkgo

Having a better summary of the failures can give the operator a more informative overview of final result, currently we only rely on exit status code.

Example:

Ran 10 of 7026 Specs in 2928.397 seconds
FAIL! -- 1 Passed | 9 Failed | 0 Pending | 7016 Skipped
--- FAIL: TestE2E (2930.61s)
FAIL

Should Implement service.kubernetes.io/service-proxy-name

create servicedisabled in namespace
create service in namespace
verify service be up
verify servicedisabled be not up
add serviceproxyname label
verify service be not up
remove serviceproxyname annotation
verify service be up
verify servicedisabled be still not up

Support for UDP on pods

All pods are listening only on TCP testing we MUST support UDP, given preference to the first.

[validate] should check kube-proxy urls

Check the kube proxy url, as /healthz url is deprecated, since k8s v1.16, should use url /livez or /readyz(refer The healthz endpoint is deprecated (since Kubernetes v1.16), and you should use the more specific livez and readyz endpoints instead)

Solution:

  • Create hostnetwork pod, curl localhost:[proxy bind port]/livez, to check kube-proxy url.
  • Verify the response. expect contains "200 OK"
  • Create hostnetwork pod, curl localhost:[proxy bind port]/proxyMode, to check kube-proxy url.
  • Verify the response. expect contains "200"

Add daemonset probe option

We should add an option to run the tests in a way that they run all probes from a daemonset, and flag that test as slow, since otherwise it might take a long time i.e. on a 1000 node cluster :).

Use k8devel

A lof of k8s helpers are spread and can be reused with k8devel

Add identifiers for different tests in the result message

We run different sets of tests with ./svc-test, with the pretty format test result below, it's hard to tell what's the tests' purpose from the result message, ex. testing connections for clusterIPort nodePort or loadbalancer.

Amplify the result message for each test, and add a summary message for the results of all tests.

reachability: correct:16, incorrect:0, result=true

52 <nil>
expected:

-		x-55117/pod-1	x-55117/pod-2	x-55117/pod-3	x-55117/pod-4
x-55117/pod-1	.		.		.		.
x-55117/pod-2	.		.		.		.
x-55117/pod-3	.		.		.		.
x-55117/pod-4	.		.		.		.


176 <nil>
observed:

-		x-55117/pod-1	x-55117/pod-2	x-55117/pod-3	x-55117/pod-4
x-55117/pod-1	.		.		.		.
x-55117/pod-2	.		.		.		.
x-55117/pod-3	.		.		.		.
x-55117/pod-4	.		.		.		.


176 <nil>
comparison:

-		x-55117/pod-1	x-55117/pod-2	x-55117/pod-3	x-55117/pod-4
x-55117/pod-1	.		.		.		.
x-55117/pod-2	.		.		.		.
x-55117/pod-3	.		.		.		.
x-55117/pod-4	.		.		.		.


178 <nil>
{"level":"info","ts":1636589916.03517,"caller":"matrix/helper.go:60","msg":"VALIDATION SUCCESSFUL"}
PASS

Should Prevent Nodeport Collisions

create service servicename1 with type nodeport in namespace
create service servicename2 with conflict nodeport
delete service servicename1 to release nodeport
create service servicename2 with nolongerconflicting nodeport

Add Selector for nodes

  1. We want to make sure that our pods are spread to DIFFERENT nodes
  2. Maybe it would be nice if the table explicitly printed out the node names as well ?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.