Giter Club home page Giter Club logo

kelemetry's Introduction

Kelemetry: Global control plane tracing for Kubernetes

Overview

Kelemetry aggregates various data sources including Kubernetes events, audit log, informers into the form of traditional tracing, enabling visualization through Jaeger UI and automatic analysis.

Motivation

As a distributed asynchronous declarative API, Kubernetes suffers from lower explainability compared to traditional RPC-based services as there is no clear causal relationship between events; a change in one object indirectly effects changes in other objects, posing challenges to understanding and troubleshooting the system. Past attempts of tracing in Kubernetes were either limited to single components or excessively intrusive to individual components.

Kelemetry addresses the problem by associating events of related objects into the same trace. By recognizing object relations such as OwnerReferences, related events can be visualized together without prior domain-specific knowledge. The behavior of various components are recorded on the same timeline to reconstruct the causal hierarchy of the actual events.

Features

  • Collect audit logs
  • Collect controller events (i.e. the "Events" section in kubectl describe)
  • Record object diff associated with audit logs
  • Connect objects based on owner references
  • Collect data from custom sources (Plugin API)
  • Connect objects with custom rules with multi-cluster support (Plugin API)
  • Navigate trace with Jaeger UI and API
  • Scalable for multiple large clusters
  • Construct tailormade metrics based on audit logs
graph TB
    kelemetry[Kelemetry]
    audit-log[Audit log] --> kelemetry
    event[Event] --> kelemetry
    watch[Object watch] --> kelemetry
    kelemetry ---> |OpenTelemetry protocol| storage[Jaeger storage]
    plugin[Kelemetry storage plugin] --> storage
    user[User] --> ui[Jaeger UI] --> plugin

Getting started

Contribution/Development

Code of Conduct

See Code of Conduct.

Community

License

Kelemetry is licensed under Apache License 2.0.

kelemetry's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

kelemetry's Issues

kelemetry-jaeger-query-1 keep restarting in dev deployment

Steps to reproduce

  1. check out 0.2.2 tag from kelemetry repo
  2. make stack

Expected behavior

docker container are up and running

Actual behavior

kelemetry-jaeger-query-1 keep restaring with logs

..{"level":"info","ts":1689923838.4368327,"caller":"grpclog/component.go:71","msg":"[core]pickfirstBalancer: UpdateSubConnState: 0xc000202018, {IDLE connection error: desc = \"transport: Error while dialing dial tcp 172.17.0.1:17271: connect: connection refused\"}","system":"grpc","grpc_log":true}
{"level":"info","ts":1689923838.4368534,"caller":"channelz/funcs.go:340","msg":"[core][Channel #1] Channel Connectivity change to IDLE","system":"grpc","grpc_log":true}
{"level":"info","ts":1689923838.4369116,"caller":"channelz/funcs.go:340","msg":"[core][Channel #1 SubChannel #2] Subchannel Connectivity change to CONNECTING","system":"grpc","grpc_log":true}
{"level":"info","ts":1689923838.4369738,"caller":"channelz/funcs.go:340","msg":"[core][Channel #1 SubChannel #2] Subchannel picks a new address \"172.17.0.1:17271\" to connect","system":"grpc","grpc_log":true}
{"level":"info","ts":1689923838.4370942,"caller":"grpclog/component.go:71","msg":"[core]pickfirstBalancer: UpdateSubConnState: 0xc000202018, {CONNECTING <nil>}","system":"grpc","grpc_log":true}
{"level":"info","ts":1689923838.437123,"caller":"channelz/funcs.go:340","msg":"[core][Channel #1] Channel Connectivity change to CONNECTING","system":"grpc","grpc_log":true}
{"level":"info","ts":1689923838.4372501,"caller":"grpclog/component.go:71","msg":"[core]Creating new client transport to \"{\\n  \\\"Addr\\\": \\\"172.17.0.1:17271\\\",\\n  \\\"ServerName\\\": \\\"172.17.0.1:17271\\\",\\n  \\\"Attributes\\\": null,\\n  \\\"BalancerAttributes\\\": null,\\n  \\\"Type\\\": 0,\\n  \\\"Metadata\\\": null\\n}\": connection error: desc = \"transport: Error while dialing dial tcp 172.17.0.1:17271: connect: connection refused\"","system":"grpc","grpc_log":true}
{"level":"warn","ts":1689923838.4372795,"caller":"channelz/funcs.go:342","msg":"[core][Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {\n  \"Addr\": \"172.17.0.1:17271\",\n  \"ServerName\": \"172.17.0.1:17271\",\n  \"Attributes\": null,\n  \"BalancerAttributes\": null,\n  \"Type\": 0,\n  \"Metadata\": null\n}. Err: connection error: desc = \"transport: Error while dialing dial tcp 172.17.0.1:17271: connect: connection refused\"","system":"grpc","grpc_log":true}
{"level":"info","ts":1689923838.437295,"caller":"channelz/funcs.go:340","msg":"[core][Channel #1 SubChannel #2] Subchannel Connectivity change to TRANSIENT_FAILURE","system":"grpc","grpc_log":true}
{"level":"info","ts":1689923838.437333,"caller":"grpclog/component.go:71","msg":"[core]pickfirstBalancer: UpdateSubConnState: 0xc000202018, {TRANSIENT_FAILURE connection error: desc = \"transport: Error while dialing dial tcp 172.17.0.1:17271: connect: connection refused\"}","system":"grpc","grpc_log":true}
{"level":"info","ts":1689923838.4373517,"caller":"channelz/funcs.go:340","msg":"[core][Channel #1] Channel Connectivity change to TRANSIENT_FAILURE","system":"grpc","grpc_log":true}
{"level":"info","ts":1689923840.6481597,"caller":"channelz/funcs.go:340","msg":"[core][Channel #1] Channel Connectivity change to SHUTDOWN","system":"grpc","grpc_log":true}
{"level":"info","ts":1689923840.6482139,"caller":"channelz/funcs.go:340","msg":"[core][Channel #1 SubChannel #2] Subchannel Connectivity change to SHUTDOWN","system":"grpc","grpc_log":true}
{"level":"info","ts":1689923840.6482327,"caller":"channelz/funcs.go:340","msg":"[core][Channel #1 SubChannel #2] Subchannel deleted","system":"grpc","grpc_log":true}
{"level":"info","ts":1689923840.6482399,"caller":"channelz/funcs.go:340","msg":"[core][Channel #1] Channel deleted","system":"grpc","grpc_log":true}
{"level":"fatal","ts":1689923840.6482666,"caller":"./main.go:107","msg":"Failed to init storage factory","error":"grpc-plugin builder failed to create a store: error connecting to remote storage: context deadline exceeded","stacktrace":"main.main.func1\n\t./main.go:107\ngithub.com/spf13/cobra.(*Command).execute\n\tgithub.com/spf13/[email protected]/command.go:916\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\tgithub.com/spf13/[email protected]/command.go:1044\ngithub.com/spf13/cobra.(*Command).Execute\n\tgithub.com/spf13/[email protected]/command.go:968\nmain.main\n\t./main.go:170\nruntime.main\n\truntime/proc.go:250"}
...

dev.docker-compose.yaml

# cat dev.docker-compose.yaml

# docker-compose setup for development setup.
# Use quickstart.docker-compose.yaml if you just want to try out Kelemetry.
# Use the helm chart if you want to deploy in production.
version: "2.2"
services:
  # ETCD cache storage, only required if etcd cache is used
  etcd:
    image: quay.io/coreos/etcd:v3.2
    entrypoint: [etcd]
    command:
      - -name=main
      - -advertise-client-urls=http://etcd:2479
      - -listen-client-urls=http://0.0.0.0:2479
      - -initial-advertise-peer-urls=http://etcd:2380
      - -listen-peer-urls=http://0.0.0.0:2380
      - -initial-cluster-state=new
      - -initial-cluster=main=http://etcd:2380
      - -initial-cluster-token=etcd-cluster-1
      - -data-dir=/var/run/etcd/default.etcd
    volumes:
      - etcd:/var/run/etcd/
    ports:
      - 2479:2379
    restart: always
  # Web frontend for trace view.
  jaeger-query:
    image: jaegertracing/jaeger-query:1.42
    environment:
      SPAN_STORAGE_TYPE: grpc-plugin
      GRPC_STORAGE_SERVER: host.docker.internal:17272 # run on host directly
    ports:
      - 0.0.0.0:16686:16686
    restart: always
  # OTLP collector that writes to Badger
  jaeger-collector:
    image: jaegertracing/jaeger-collector:1.42
    environment:
      COLLECTOR_OTLP_ENABLED: "true"
      SPAN_STORAGE_TYPE: grpc-plugin
      GRPC_STORAGE_SERVER: remote-badger:17271
    ports:
      - 0.0.0.0:4317:4317
    restart: always
  # Backend badger storage
  # Feel free to override environment.SPAN_STORAGE_TYPE to other storages given the proper configuration.
  remote-badger:
    image: jaegertracing/jaeger-remote-storage:1.42
    environment:
      SPAN_STORAGE_TYPE: badger
      BADGER_EPHEMERAL: "false"
      BADGER_DIRECTORY_KEY: /mnt/badger/key
      BADGER_DIRECTORY_VALUE: /mnt/badger/data
    ports:
      - 127.0.0.1:17272:17271
    volumes:
      - badger:/mnt/badger

  # Web frontend for raw trace database view.
  jaeger-query-raw:
    image: jaegertracing/jaeger-query:1.42
    environment:
      SPAN_STORAGE_TYPE: grpc-plugin
      GRPC_STORAGE_SERVER: remote-badger:17271
    ports:
      - 0.0.0.0:26686:16686
    restart: always

volumes:
  etcd: {}
  badger: {}

Kelemetry version

0.2.2

Environment

k8s:1.24.15
Jaeger:1.4.2

metric "diff_decorator_retry_count" was not initialized

Steps to reproduce

Install helm chart

Expected behavior

start normally

Actual behavior

panic: metric "diff_decorator_retry_count" was not initialized
        panic: metric "diff_decorator" was not initialized

Kelemetry version

0.2.2

Environment

Install through helm chart

Quick start has invalid option set

Steps to reproduce

make kind quickstart

Expected behavior

Start process normally

Actual behavior

kelemetry_1         | time="2023-04-20T07:[1](https://github.com/kubewharf/kelemetry/actions/runs/4751485893/jobs/8440692061#step:14:1)[4](https://github.com/kubewharf/kelemetry/actions/runs/4751485893/jobs/8440692061#step:14:4):[5](https://github.com/kubewharf/kelemetry/actions/runs/4751485893/jobs/8440692061#step:14:5)[6](https://github.com/kubewharf/kelemetry/actions/runs/4751485893/jobs/8440692061#step:14:6)Z" level=fatal msg="Cannot disable \"resource-object-tagger\" because [aggregator] depend on it but are not disabled"

Kelemetry version

940e6fb

Environment

GitHub CI https://github.com/kubewharf/kelemetry/actions/runs/4751485893

Move linking to frontend

Description

Instead of finding the trace as the linker, the aggregator just emits a tag that indicates the CGRNN of the parent span. Children are emitted with a separate trace ID so that each object has its own trace.

To be precise, we can remove the "parent"/"child" relationship in general and emit a new pseudospan type called "link":

tag name tag value
zzz-traceSource "link"
linkedCluster linked object cluster
linkedGroup linked object group
linkedResource linked object resource
linkedNamespace linked object namespace
linkedName linked object name
linkedRole If "parent", the current span is displayed as a child of the linked span. If empty, the linked span is displayed as a child of the current span. Otherwise, the linked span is displayed under a display-only virtual span with the specified role

User story

Currently, frontend queries that use the exclusive mode have to search the entire trace, making queries unnecessarily slow. Since the user may only be interested in a particular object, we can speed up the query by limiting each trace to one object only.

This allows us to use more aggressive linkers, which may become more significant with #109. Furthermore, dynamic linking means that we can now have multi-object links, such as linking pods to their secrets/nodes (this was previously not possible because nodes can relate to multiple pods).

Chart does not support jaeger-ui config management

Description

Allow configuring link pattern rules in jaeger-ui for chart deployment method

User story

#25 introduced new tags that could reference other objects, but we do not support utilizing this in the helm chart as there is no jaeger-ui config option.

how to enable extension trace from apiserver in chart

Description

Install kelemetry by helm, but can not see the audit event in k8s resource trace. It can see the audit event by kind setup to uncomment the extension trace config in tfconfig.yaml. How to enable or config them in helm install ?

User story

No response

nil deref on ReplaceNameVisitor.Enter

Steps to reproduce

List trace without any filters

Click on a non-object span

Expected behavior

No response

Actual behavior

goroutine 398 [running]:
github.com/kubewharf/kelemetry/pkg/frontend/tf/step.ReplaceNameVisitor.Enter({}, {0xc0466fd800, 0xc0466fd830, 0xc0466fd860, 0xc0466fd890, 0x0}, 0x0)
        /data00/home/chankyin/go/src/github.com/kubewharf/kelemetry/pkg/frontend/tf/step/prune_tags.go:30 +0x6e
github.com/kubewharf/kelemetry/pkg/frontend/tf/tree.spanNode.visit({{0xc0466fd800, 0xc0466fd830, 0xc0466fd860, 0xc0466fd890, 0x0}, 0x0}, {0x408a578, 0x5596210})
        /data00/home/chankyin/go/src/github.com/kubewharf/kelemetry/pkg/frontend/tf/tree/tree.go:120 +0xf9
github.com/kubewharf/kelemetry/pkg/frontend/tf/tree.SpanTree.Visit(...)
        /data00/home/chankyin/go/src/github.com/kubewharf/kelemetry/pkg/frontend/tf/tree/tree.go:116
github.com/kubewharf/kelemetry/pkg/frontend/tf.(*Transformer).Transform(0xc00037bf20, 0xc042691570, 0xc04659f8f0?, 0x459836d0?)
        /data00/home/chankyin/go/src/github.com/kubewharf/kelemetry/pkg/frontend/tf/transform.go:62 +0x2c8
github.com/kubewharf/kelemetry/pkg/frontend/reader.(*spanReader).GetTrace(0xc000514d80, {0x409e6b8, 0xc04659f8f0}, {0x10?, 0xc041061458?})
        /data00/home/chankyin/go/src/github.com/kubewharf/kelemetry/pkg/frontend/reader/reader.go:176 +0x2ca
github.com/jaegertracing/jaeger/plugin/storage/grpc/shared.(*GRPCHandler).GetTrace(0xc000614020, 0xc04659f920, {0x40a6b50, 0xc041c9fe20})
        /home/chankyin/go/pkg/mod/github.com/jaegertracing/[email protected]/plugin/storage/grpc/shared/grpc_handler.go:152 +0xf6
github.com/jaegertracing/jaeger/proto-gen/storage_v1._SpanReaderPlugin_GetTrace_Handler({0x37ddcc0?, 0xc000614020}, {0x40a45c0, 0xc0410613b0})
        /home/chankyin/go/pkg/mod/github.com/jaegertracing/[email protected]/proto-gen/storage_v1/storage.pb.go:1456 +0xf9
google.golang.org/grpc.(*Server).processStreamingRPC(0xc040c24000, {0x40a9ae0, 0xc0435da9c0}, 0xc0464978c0, 0xc00051def0, 0x553e740, 0x0)
        /home/chankyin/go/pkg/mod/google.golang.org/[email protected]/server.go:1639 +0x1fe8
google.golang.org/grpc.(*Server).handleStream(0xc040c24000, {0x40a9ae0, 0xc0435da9c0}, 0xc0464978c0, 0x0)
        /home/chankyin/go/pkg/mod/google.golang.org/[email protected]/server.go:1726 +0xfaf
google.golang.org/grpc.(*Server).serveStreams.func1.2()
        /home/chankyin/go/pkg/mod/google.golang.org/[email protected]/server.go:966 +0xed
created by google.golang.org/grpc.(*Server).serveStreams.func1
        /home/chankyin/go/pkg/mod/google.golang.org/[email protected]/server.go:964 +0x4de

Kelemetry version

d8ffc85

Environment

No response

jaeger UI Service count is 8

Steps to reproduce

jaeger UI Service count is 8

Expected behavior

No response

Actual behavior

No response

Kelemetry version

No response

Environment

No response

kelemetry-consumer and kelemetry-kelemetry-frontend CrashLoopBackOff

Steps to reproduce

  1. check out 0.2.1 tag from kelemetry repo
  2. helm install kelemetry oci://ghcr.io/kubewharf/kelemetry-chart --values values.yaml

Expected behavior

kelemetry component up and run as expected

Actual behavior

kelemetry-collector-5ccb7986df-58zdz 1/1 Running 2 (17m ago) 17m
kelemetry-collector-5ccb7986df-5cj8k 1/1 Running 2 (17m ago) 17m
kelemetry-collector-5ccb7986df-vxchw 1/1 Running 2 (17m ago) 17m
kelemetry-consumer-bdb696b46-dpp6c 0/1 CrashLoopBackOff 7 (2m29s ago) 17m
kelemetry-consumer-bdb696b46-f6bht 0/1 CrashLoopBackOff 7 (2m19s ago) 17m
kelemetry-consumer-bdb696b46-xz9xm 0/1 CrashLoopBackOff 7 (2m54s ago) 17m
kelemetry-etcd-0 1/1 Running 0 17m
kelemetry-etcd-1 1/1 Running 0 17m
kelemetry-etcd-2 1/1 Running 0 17m
kelemetry-frontend-5cb6746769-4tc4v 0/2 CrashLoopBackOff 16 (47s ago) 17m
kelemetry-frontend-5cb6746769-kl6xx 0/2 CrashLoopBackOff 17 (38s ago) 17m
kelemetry-frontend-5cb6746769-qhjwd 0/2 CrashLoopBackOff 16 (50s ago) 17m
kelemetry-informers-866df8f86d-2nh7x 1/1 Running 0 17m
kelemetry-informers-866df8f86d-4vlvk 1/1 Running 0 17m
kelemetry-informers-866df8f86d-gwml2 1/1 Running 0 17m
kelemetry-storage-0 1/1 Running 0 17m

  1. kubectl logs kelemetry-frontend-5cb6746769-4tc4v -c storage-plugin

...

time="2023-07-19T14:47:48Z" level=info msg=Initializing mod=jaeger-backend
time="2023-07-19T14:47:48Z" level=info msg=Initializing mod=tf-step/replace-name-visitor
time="2023-07-19T14:47:48Z" level=info msg=Initializing mod=tf-step/prune-tags-visitor
time="2023-07-19T14:47:48Z" level=info msg=Initializing mod=tf-step/compact-duration-visitor
time="2023-07-19T14:47:48Z" level=info msg=Initializing mod=tf-step/extract-nesting-visitor
time="2023-07-19T14:47:48Z" level=info msg=Initializing mod=tf-step/service-operation-replace-visitor
time="2023-07-19T14:47:48Z" level=info msg=Initializing mod=tf-step/cluster-name-visitor
time="2023-07-19T14:47:48Z" level=info msg=Initializing mod=tf-step/group-by-trace-source-visitor
time="2023-07-19T14:47:48Z" level=info msg=Initializing mod=tf-step/object-tags-visitor
time="2023-07-19T14:47:48Z" level=info msg=Initializing mod=tfconfig.RegisteredStep-list
time="2023-07-19T14:47:48Z" level=info msg=Initializing mod=jaeger-transform-config/file
time="2023-07-19T14:47:53Z" level=error msg="error initializing "jaeger-transform-config/file": parse tfconfig modifier error: invalid modifier args: parse extension provider config error: cannot initialize extension storage: grpc-plugin builder failed to create a store: error connecting to remote storage: context deadline exceeded"

2.#kubectl logs kelemetry-consumer-bdb696b46-f6bht
time="2023-07-19T14:46:18Z" level=info msg=Starting mod=pprof
time="2023-07-19T14:46:18Z" level=info msg="Startup complete"
panic: metric "diff_decorator_retry_count" was not initialized
panic: metric "diff_decorator" was not initialized

goroutine 84 [running]:
github.com/kubewharf/kelemetry/pkg/metrics.(*Metric[...]).With(0x4069d9?, 0x4ff15a?)
/src/pkg/metrics/interface.go:158 +0xf4
github.com/kubewharf/kelemetry/pkg/metrics.(*Metric[...]).DeferCount(0xc00071c2f8?, {0x51e68e?, 0x2bde0d0?, 0x3fd9a20?}, 0x2bc0320?)
/src/pkg/metrics/interface.go:169 +0x49
panic({0x226f460, 0xc005d20970})
/usr/local/go/src/runtime/panic.go:884 +0x213
github.com/kubewharf/kelemetry/pkg/metrics.(*Metric[...]).With(0xc005d28390?, 0x3fd9a20?)
/src/pkg/metrics/interface.go:158 +0xf4
github.com/kubewharf/kelemetry/pkg/diff/decorator.(*decorator).tryDecorate(0xc0002e20e0, {0x2bde0d0, 0xc000338730}, {0x2bfccf0, 0xc005d193b0}, 0xc005d10b00, 0xc005cba840)
/src/pkg/diff/decorator/decorator.go:282 +0x1325
github.com/kubewharf/kelemetry/pkg/diff/decorator.(*decorator).Decorate(0xc0002e20e0, {0x2bde0d0, 0xc000338730}, 0xc005d10b00, 0xc005cba840)
/src/pkg/diff/decorator/decorator.go:144 +0x465
github.com/kubewharf/kelemetry/pkg/audit/consumer.(*receiver).handleItem(0xc0001ac0e0, {0x2bde0d0, 0xc000338730}, {0x2bfccf0?, 0xc005d18fc0?}, 0xc005d10b00, 0xc005c71da0)
/src/pkg/audit/consumer/consumer.go:278 +0x12d3
github.com/kubewharf/kelemetry/pkg/audit/consumer.(*receiver).handleMessage(0xc0001ac0e0, {0x2bde0d0, 0xc000338730}, {0x2bfccf0?, 0xc005d18ee0?}, {0xc005d9fbc0, 0x33, 0x0?}, {0xc005dbac80, 0xc58, ...}, ...)
/src/pkg/audit/consumer/consumer.go:175 +0x4de
github.com/kubewharf/kelemetry/pkg/audit/consumer.(*receiver).Init.func1({0x2bde0d0?, 0xc000338730?}, {0x2bfccf0?, 0xc005d18ee0?}, {0xc005d9fbc0?, 0xd40010?, 0xc0005c2f00?}, {0xc005dbac80, 0xc58, 0xc80})
/src/pkg/audit/consumer/consumer.go:130 +0x9d
github.com/kubewharf/kelemetry/pkg/audit/mq/local.(*localConsumer).start.func1()
/src/pkg/audit/mq/local/local.go:203 +0x14c
created by github.com/kubewharf/kelemetry/pkg/audit/mq/local.(*localConsumer).start
/src/pkg/audit/mq/local/local.go:193 +0x95

Kelemetry version

0.2.1

Environment

k8s:1.23.17
Jaeger:1.4.2

frontend: Why the service deployment resource uses 2 identical storage services and wants to be optimized

Description

frontend deployment mainfest

...
      - command:
        - /usr/local/bin/kelemetry
        - --log-level=info
        - --pprof-enable=true
        - --jaeger-backend=jaeger-storage
        - --jaeger-cluster-names=cluster1
        - --jaeger-redirect-server-enable=true
        - --jaeger-storage-plugin-address=:17271   # localhost:17271
        - --jaeger-storage-plugin-enable=true
        - --jaeger-storage.grpc-storage.server=kelemetry-1689762474-storage.kelemetry.svc:17271   # storage-svc
        - --jaeger-storage.span-storage.type=grpc-plugin
        - --jaeger-trace-cache=etcd
        - --jaeger-trace-cache-etcd-endpoints=kelemetry-1689762474-etcd.kelemetry.svc:2379
        - --jaeger-trace-cache-etcd-prefix=/trace/
        - --trace-server-enable=true
        image: ghcr.io/kubewharf/kelemetry:0.1.0
...

User story

It is convenient for users to be more familiar with the source code of the project

Some questions about getting started with the project

Hello, sorry to disturb you guys. I appreciate this project very much. Is there any design document for this project?
Furthermore, there are hardly any annotations in the code, which may increase the difficulty for other contributors to participate. Would it be possible to add some annotations to key fields or structures to improve clarity?

GetTrace does not replace TraceID

Steps to reproduce

  1. Visit trace view page by direct URL instead of navigating from list page (or open in new tab)
  2. Press on "Trace JSON"

Expected behavior

Show trace with the same trace ID

Actual behavior

It uses the raw trace ID in the storage backend.

This is because the (*spanReader).GetTrace method does not update TraceID in the trace like FindTraces does:

for _, span := range thumbnail.Spans {
span.TraceID = cacheId
for i := range span.References {
span.References[i].TraceID = cacheId
}
}

Kelemetry version

0a8b7d6

Environment

No response

how to aggregate multicluster event by kelemetry

From k8s/config/mapoption, there has two field targetClusterName and kubeconfig, seem like meanning kelemetry can watch multicluster event in one master cluster(controlplane cluster). But i just found the informer only watch targetCluster, the other cluster seem like just for the diff api.So i wanna know how to aggregate multicluster event by kelemetry. I have some guess, please correct me if I am wrong:

  1. Deploy kelemetry per cluster with in-cluster mode.
  2. Deploy kelemetry per cluster with out-cluster mode in the controlerplane cluster.

jaeger-collector is not listening on port 4317

Discussed in #121

Originally posted by calvinxu July 14, 2023

  1. with audit-log webhook setting as below:
apiVersion: v1
clusters:
- cluster:
    server: http://xxxxx:8080/audit
  name: audit
contexts:
- context:
    cluster: audit
  name: main
current-context: main
kind: Config
preferences: {}

apiserver logs is normal, while from the kelemetry consumer pod log, it have the following error:

time="2023-07-14T05:00:40Z" level=error msg="cannot access cluster from object reference" error="cluster \"10.120.127.221\" is not available" mod=owner-linker object=10.120.127.221//pods/default/hello-6c64f775c6-hlmdn
time="2023-07-14T05:00:40Z" level=error msg="cannot access cluster from object reference" error="cluster \"10.120.127.221\" is not available" mod=owner-linker object=10.120.127.221//pods/default/hello-5ccd6d84d6-zldzw
time="2023-07-14T05:00:40Z" level=error msg="cannot access cluster from object reference" error="cluster \"10.120.127.221\" is not available" mod=owner-linker object=10.120.127.221//pods/default/hello-5ccd6d84d6-8dwk6
time="2023-07-14T05:00:41Z" level=error msg="cannot access cluster from object reference" error="cluster \"10.120.127.221\" is not available" mod=owner-linker object=10.120.127.221//pods/default/hello-5ccd6d84d6-qp7g4

any advice for the webhook setting or any other setting need to be adjusted to make it work?
from the Jaegure UI, there is no tracing data..

#NAME                                   READY   STATUS    RESTARTS   AGE
hello-6c64f775c6-56ghh                 1/1     Running   0          5m31s
hello-6c64f775c6-hlmdn                 1/1     Running   0          5m30s
hello-6c64f775c6-kwbvw                 1/1     Running   0          5m31s
hello-6c64f775c6-nx2gd                 1/1     Running   0          5m31s
hello-6c64f775c6-wb5kz                 1/1     Running   0          5m30s
kelemetry-collector-7459c48b85-psdk9   1/1     Running   0          3h6m
kelemetry-collector-7459c48b85-qzxrn   1/1     Running   0          3h6m
kelemetry-collector-7459c48b85-smm84   1/1     Running   0          3h6m
kelemetry-consumer-549d4b664b-nh7fd    1/1     Running   0          153m
kelemetry-consumer-549d4b664b-xbxtq    1/1     Running   0          153m
kelemetry-consumer-549d4b664b-zfglq    1/1     Running   0          153m
kelemetry-etcd-0                       1/1     Running   0          14h
kelemetry-etcd-1                       1/1     Running   0          14h
kelemetry-etcd-2                       1/1     Running   0          14h
kelemetry-frontend-d984d9fb9-7mb6j     2/2     Running   0          3h6m
kelemetry-frontend-d984d9fb9-p2w7h     2/2     Running   0          3h6m
kelemetry-frontend-d984d9fb9-qp4h4     2/2     Running   0          3h6m
kelemetry-informers-fb8ddb6b4-8wvsn    1/1     Running   0          3h6m
kelemetry-informers-fb8ddb6b4-drh8w    1/1     Running   0          3h6m
kelemetry-informers-fb8ddb6b4-l4cpk    1/1     Running   0          3h6m
kelemetry-storage-0                    1/1     Running   0          14h
# kubectl get svc
NAME                  TYPE           CLUSTER-IP       EXTERNAL-IP      PORT(S)                          AGE
kelemetry-collector   LoadBalancer   10.100.80.232    10.120.127.224   4317:31324/TCP                   80m
kelemetry-etcd        LoadBalancer   10.98.116.229    10.120.127.224   2379:31447/TCP,2380:32611/TCP    75m
kelemetry-query       LoadBalancer   10.110.123.5     10.120.127.224   16686:32328/TCP,8090:30873/TCP   86m
kelemetry-storage     ClusterIP      None             <none>           17271/TCP                        86m
kelemetry-webhook     LoadBalancer   10.108.9.80      10.120.127.224   8080:31597/TCP                   86m
#kubectl logs kelemetry-consumer-8599cb4cc5-cbfkf 
...
2023/07/14 07:21:02 traces export: context deadline exceeded: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 10.100.80.232:4317: connect: connection refused"
2023/07/14 07:21:02 traces export: context deadline exceeded: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 10.100.80.232:4317: connect: connection refused"
2023/07/14 07:21:02 traces export: context deadline exceeded: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 10.100.80.232:4317: connect: connection refused"
2023/07/14 07:21:17 traces export: context deadline exceeded: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 10.100.80.232:4317: connect: connection refused"
2023/07/14 07:21:27 traces export: context deadline exceeded: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 10.100.80.232:4317: connect: connection refused"
2023/07/14 07:21:27 traces export: context deadline exceeded: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 10.100.80.232:4317: connect: connection refused"
2023/07/14 07:21:27 traces export: context deadline exceeded: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 10.100.80.232:4317: connect: connection refused"
.....

there is no 4317 port is listening in the collector container, is that something wrong with the collector?

#kubectl exec -it kelemetry-collector-7cf7988fd5-pfvcb -- sh
/ # netstat -tulp
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 :::14268                :::*                    LISTEN      1/collector-linux
tcp        0      0 :::14269                :::*                    LISTEN      1/collector-linux
tcp        0      0 :::14250                :::*                    LISTEN      1/collector-linux

Avoid diff cache double-write with leader election

Description

Currently diff controller runs a multi-leader elector to ensure high availability. However this puts a lot of stress on the diff cache backend.

To optimize this, we can introduce a new elector for writing to diff cache. While the original lease continues to restrict the number of watch streams, the new elector only restricts the number of diff cache writers.

stateDiagram
  A --> B: acquire watch lease
  B --> C: acquire cache writer lease
  C --> D: lose cache writer lease
  D --> B: maintained state\nfor more than\n15 seconds
State Watch lease Cache writer lease Behavior
A Not acquired Do not participate in election Try to acquire watch lease. Do not watch objects.
B Acquired Not acquired Try to acquire cache writer lease. Watch objects and maintain a local cache for 15 seconds. Do not write to remote diff cache.
B -> C Acquired Not acquired -> acquired Write the local cache for the previous 15 seconds to the remote diff cache.
C Acquired Acquired Watch objects and write to remote diff cache.
D Acquired Acquired -> lost election for less than 15 seconds Watch objects and write to remote diff cache.
B Acquired Lost election for more than 15 seconds Try to acquire cache writer lease. Watch objects and maintain a local cache for 15 seconds. Do not write to remote diff cache.

User story

No response

Merge traces over multiple half-hours

Description

During FindTraces call, given startTime and endTime, retrieve all trace IDs from floor(startTime, 30 minutes) to floor(endTime, 30 minutes), grouped by the object identity tags (cluster/group/resource/namespace/name). Each group is displayed as a separate trace and has its own tracecache entry. The tracecache value stores the list of trace identifiers instead of a single identifier. During GetTrace call, all identifiers are fetched, then the tf package shall merge spans with the same object identity + nestingLevel tags together (by merging their underlying spans).

User story

When user searches spans from 12:15 to 12:45, the user expects to see events between this period, not events whose root object span is within this period (i.e. only 12:30). The current behavior is very confusing to users, and very inconvenient to use because debugging an actual event that happened across multiple traces requires opening two separate pages.

Setup GitHub workflow for pull request to render demo trace

Description

When a pull request is created/synchronized, render the sample API response similar to what the generate-page workflow is doing now. Instead of uploading the page, start a chromium headless instance to load the page and take a screenshot, then post it to the pull request.

User story

This helps spotting UI breaks in frontend more easily.

frontend pod CrashLoopBackOff(storage-plugin container) with msg="unknown flag: --trace-server-enable"

Steps to reproduce

Deploy following by steps in https://github.com/kubewharf/kelemetry/blob/a832d2856bccdbdedefbe2a437d05c709e22f5bd/docs/DEPLOY.md

#helm install kelemetry oci://ghcr.io/kubewharf/kelemetry-chart --values values.yaml

Expected behavior

All pods and svc run up fine and tracing data can be browsed from UI.

Actual behavior

frontend pod CrashLoopBackOff, check from log

kubectl logs kelemetry-frontend-75c566859f-mw44x -c storage-plugin

time="2023-07-17T13:59:40Z" level=error msg="unknown flag: --trace-server-enable"

run the docker image, no "--trace-server-enable"" in kelemetry:0.1.0
#docker run -d ghcr.io/kubewharf/kelemetry:0.1.0
af6e0f708046cbae26858de58df97a56442fabe8490646cc0aa61689564ac8d4
#docker exec -it af6e0f708046cbae26858de58df97a56442fabe8490646cc0aa61689564ac8d4 sh

kelemetry --help

...
--span-cache-local-cleanup-frequency duration frequency to collect garbage from span cache (default 30m0s)
--tracer string implementation of tracer. Possible values are ["otel"]. (default "otel")
--tracer-otel-endpoint string otel endpoint (default "127.0.0.1:4317")
--tracer-otel-insecure allow insecure otel connections
--tracer-otel-resource-attributes stringToString otel resource service attributes (default [service.version=dev])
--usage string

As with the frontend failed, not sure if need to switch to kelemetry:0.2.0? but how about other components, kelemetry-consumer, kelemetry-informers. [tried with 0.2.0, informers pod CrashLoopBackOff with
kubectl logs kelemetry-informers-84c49bc575-hs5dg
time="2023-07-18T01:30:12Z" level=error msg="unknown flag: --diff-cache-use-old-rv"]

and for the kelemetry-collector, already one issue #122 for the version 1.42, it does not listen to 4317. What version shall we use to get the tracing work?

there are also flood of such message in the "kelemetry-collector" pod.
{"level":"info","ts":1689602342.3532155,"caller":"[email protected]/server.go:932","msg":"[core][Server #6] grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: http2Server.HandleStreams received bogus greeting from client: \"\\x16\\x03\\x01\\x01\\x1b\\x01\\x00\\x01\\x17\\x03\\x03a\\x9a\\xd36\\x81\\x87\\xba\\xb0\\xdb\\xe8\\xdcp\\xa9\""","system":"grpc","grpc_log":true}
{"level":"info","ts":1689602359.958463,"caller":"[email protected]/server.go:932","msg":"[core][Server #6] grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: http2Server.HandleStreams received bogus greeting from client: \"\\x16\\x03\\x01\\x01\\x1b\\x01\\x00\\x01\\x17\\x03\\x03c\\uf448\\xb2$\\xf7\\x02\\x1f\\xee\\x13\\xe9\\x87\""","system":"grpc","grpc_log":true}
{"level":"info","ts":1689602562.2759514,"caller":"[email protected]/server.go:932","msg":"[core][Server #6] grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: http2Server.HandleStreams received bogus greeting from client: \"\\x16\\x03\\x01\\x01\\x1b\\x01\\x00\\x01\\x17\\x03\\x03Z\\xfb\\x05A=0\\xb0mi\\x12\\x1a\\xfa\\xd6\""","system":"grpc","grpc_log":true}
{"level":"info","ts":1689602634.1497142,"caller":"[email protected]/server.go:932","msg":"[core][Server #6] grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: http2Server.HandleStreams received bogus greeting from client: \"\\x16\\x03\\x01\\x01\\x1b\\x01\\x00\\x01\\x17\\x03\\x03PO\\xb9#%W־\\x03\\x1a\\xac\\t\\x9a\""","system":"grpc","grpc_log":true}
{"level":"info","ts":1689602778.7651076,"caller":"[email protected]/server.go:932","msg":"[core][Server #6] grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: http2Server.HandleStreams received bogus greeting from client: \"\\x16\\x03\\x01\\x01\\x1b\\x01\\x00\\x01\\x17\\x03\\x03e\\xd6z4\\xccLq\\xb286\\xb4\\x02z\""","system":"grpc","grpc_log":true}
{"level":"info","ts":1689602918.8066566,"caller":"[email protected]/server.go:932","msg":"[core][Server #6] grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: http2Server.HandleStreams received bogus greeting from client: \"\\x16\\x03\\x01\\x01\\x1b\\x01\\x00\\x01\\x17\\x03\\x03\\xb7\\x98\\xea\\xe3\\xd0\\x12\\xb2@\\xe7\\xc7S\\xe0\\x19\""","system":"grpc","grpc_log":true}
{"level":"info","ts":1689602953.5312855,"caller":"[email protected]/server.go:932","msg":"[core][Server #6] grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: http2Server.HandleStreams received bogus greeting from client: \"\\x16\\x03\\x01\\x01\\x1b\\x01\\x00\\x01\\x17\\x03\\x03K\\x15\\x01\\xf9\\x1cϋ\\xfb\\xc6xr7\\xe2\""","system":"grpc","grpc_log":true}
{"level":"info","ts":1689603063.8368163,"caller":"[email protected]/server.go:932","msg":"[core][Server #6] grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: http2Server.HandleStreams received bogus greeting from client: \"\\x16\\x03\\x01\\x01\\x1b\\x01\\x00\\x01\\x17\\x03\\x03\\xa9J\\x9bȨު\\xa8\\xb6\\x1bP\\xca\\xee\""","system":"grpc","grpc_log":true}

Kelemetry version

0.1.0

Environment

Kubernetes version :1.24.15
Jaeger version: 1.4.2

how to configure the storageBackend to use elasticsearch

Description

I want to aggregate multicluster event by kelemetry as mentioned in #136
but I confused how to configure the storageBackend to use elasticsearch. I met errors when I set storageBackend.type to elasticsearch(I'm using kelemetry-chart-0.2.2):

Error: UPGRADE FAILED: template: kelemetry-chart/templates/frontend.deployment.yaml:57:15: executing "kelemetry-chart/templates/frontend.deployment.yaml" at <include "kelemetry.storage-plugin-options" .>: error calling include: template: kelemetry-chart/templates/_helpers.yaml:263:4: executing "kelemetry.storage-plugin-options" at <include "kelemetry.storage-plugin-options-raw" .>: error calling include: template: kelemetry-chart/templates/_helpers.yaml:276:27: executing "kelemetry.storage-plugin-options-raw" at <include "kelemetry.storage-options-raw" .>: error calling include: template: kelemetry-chart/templates/_helpers.yaml:299:4: executing "kelemetry.storage-options-raw" at <include "kelemetry.storage-options-raw-stateless" .>: error calling include: template: no template "kelemetry.storage-options-raw-stateless" associated with template "gotpl"

Is there any example about it?

User story

No response

Support audit diff for error requests

Description

We can try to infer the diff (although not exact) by reading the RequestObject field and comparing it against the original object state. However this requires caching each resourceVersion of an object in the database.

User story

An admission webhook gets a lot of requests, most of which failed. As the admission webhook owner does not know the component that created these requests, audit diff would help them diagnose why the admission webhook rejects them.

Deploy using helm chart failed

Steps to reproduce

helm install kelemetry oci://ghcr.io/kubewharf/kelemetry-chart --values values.yaml

Expected behavior

install successfully

Actual behavior

#helm install kelemetry oci://ghcr.io/kubewharf/kelemetry-chart --values values.yaml

Pulled: ghcr.io/kubewharf/kelemetry-chart:0.2.3
Digest: sha256:5c6ae9ee1e30fc8f5f2409db3a5cfa502b0dabf8c42fba3e122e420317f5e89a
Error: INSTALLATION FAILED: template: kelemetry-chart/templates/storage.deployment.yaml:30:16: executing "kelemetry-chart/templates/storage.deployment.yaml" at <include "kelemetry.storage-options-stateless" .>: error calling include: template: no template "kelemetry.storage-options-stateless" associated with template "gotpl"

Kelemetry version

0.2.3

Environment

k8s:1.2.4
Jaeger:1.42

ERROR: Failed to init storage factory

Steps to reproduce

  1. git clone respository
  2. edit values.yaml replicas=1
  3. change templates stroageClass to local kuberneter default stroageClass
  4. helm install

Expected behavior

all pod is running

Actual behavior

all pods are running but frontend

kelemetry-1689762474-collector-79f5c59df4-52q4z   1/1     Running            2 (16h ago)   16h
kelemetry-1689762474-consumer-b88789bb4-5kzbf     1/1     Running            0             16h
kelemetry-1689762474-etcd-0                       1/1     Running            0             10m
kelemetry-1689762474-frontend-755b8f47ff-rl4jl    0/2     CrashLoopBackOff   8 (8s ago)    2m14s
kelemetry-1689762474-informers-76fb5d4458-s5ftw   1/1     Running            0             16h
kelemetry-1689762474-storage-0                    1/1     Running            0             16h

error.log

{"level":"warn","ts":1689822124.0339894,"caller":"channelz/funcs.go:342","msg":"[core][Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {\n  \"Addr\": \"localhost:17271\",\n  \"ServerName\": \"localhost:17271\",\n  \"Attributes\": null,\n  \"BalancerAttributes\": null,\n  \"Type\": 0,\n  \"Metadata\": null\n}. Err: connection error: desc = \"transport: Error while dialing dial tcp [::1]:17271: connect: connection refused\"","system":"grpc","grpc_log":true}
{"level":"info","ts":1689822124.034001,"caller":"channelz/funcs.go:340","msg":"[core][Channel #1 SubChannel #2] Subchannel Connectivity change to TRANSIENT_FAILURE","system":"grpc","grpc_log":true}
{"level":"info","ts":1689822124.0340135,"caller":"grpclog/component.go:71","msg":"[core]pickfirstBalancer: UpdateSubConnState: 0xc000196540, {TRANSIENT_FAILURE connection error: desc = \"transport: Error while dialing dial tcp [::1]:17271: connect: connection refused\"}","system":"grpc","grpc_log":true}
{"level":"info","ts":1689822124.0340188,"caller":"channelz/funcs.go:340","msg":"[core][Channel #1] Channel Connectivity change to TRANSIENT_FAILURE","system":"grpc","grpc_log":true}
{"level":"info","ts":1689822126.1586947,"caller":"channelz/funcs.go:340","msg":"[core][Channel #1] Channel Connectivity change to SHUTDOWN","system":"grpc","grpc_log":true}
{"level":"info","ts":1689822126.1587367,"caller":"channelz/funcs.go:340","msg":"[core][Channel #1 SubChannel #2] Subchannel Connectivity change to SHUTDOWN","system":"grpc","grpc_log":true}
{"level":"info","ts":1689822126.1587467,"caller":"channelz/funcs.go:340","msg":"[core][Channel #1 SubChannel #2] Subchannel deleted","system":"grpc","grpc_log":true}
{"level":"info","ts":1689822126.1587508,"caller":"channelz/funcs.go:340","msg":"[core][Channel #1] Channel deleted","system":"grpc","grpc_log":true}
{"level":"fatal","ts":1689822126.1587627,"caller":"./main.go:107","msg":"Failed to init storage factory","error":"grpc-plugin builder failed to create a store: error connecting to remote storage: context deadline exceeded","stacktrace":"main.main.func1\n\t./main.go:107\ngithub.com/spf13/cobra.(*Command).execute\n\tgithub.com/spf13/[email protected]/command.go:916\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\tgithub.com/spf13/[email protected]/command.go:1044\ngithub.com/spf13/cobra.(*Command).Execute\n\tgithub.com/spf13/[email protected]/command.go:968\nmain.main\n\t./main.go:170\nruntime.main\n\truntime/proc.go:250"}

Kelemetry version

c42c0ff

Environment

kubernetes version:

Client Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.3", GitCommit:"9e644106593f3f4aa98f8a84b23db5fa378900bd", GitTreeState:"clean", BuildDate:"2023-03-15T13:40:17Z", GoVersion:"go1.19.7", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.0", GitCommit:"b46a3f887ca979b1a5d14fd39cb1af43e7e5d12d", GitTreeState:"clean", BuildDate:"2022-12-08T19:51:45Z", GoVersion:"go1.19.4", Compiler:"gc", Platform:"linux/amd64"}

cloud provider: local vm

Jaeger version: jaegertracing/jaeger-collector:1.42 use deployment collector of Kelemetry

storage: custom nfs storage

Audit log is not missed in the tracing data

Steps to reproduce

1. check out 0.2.2 tag from kelemetry repo
2. helm install kelemetry oci://ghcr.io/kubewharf/kelemetry-chart --values values.yaml

Expected behavior

on the Jaeger UI, when browsing the tracing data for a deployment/pod, it will event/audit data in the tracing data,

Actual behavior

it only displays the event data

image

it is quite odd, that the cluster is "foo", is it from a test value?
my audit-log setting is as below, enabled in the apiserver setting, not sure whether the setting for webhook is correct??? I followed somehow from the quickstart demo setting.

#cat audit-policy.yaml
apiVersion: audit.k8s.io/v1
kind: Policy
rules:
  - level: RequestResponse
# cat webhook-config.yaml
apiVersion: v1
clusters:
- cluster:
    server: http://10.200.112.184:8080/audit
  name: audit
contexts:
- context:
    cluster: audit
  name: main
current-context: main
kind: Config
preferences: {}

in quickstart deployment, event and audit data both displayed

image

Kelemetry version

0.2.2

Environment

k8s:1.23.17
Jaeger:1.4.2

jaeger ui no data

Steps to reproduce

jaeger ui page error: No trace results. Try another query.

kelemetry-informers container log error: level=error msg="query discovery API failed: unable to retrieve the complete list of server APIs: custom.metrics.k8s.io/v1beta1: the server is currently unable to handle the request" cluster=cluster1 mod=discovery

Expected behavior

No response

Actual behavior

No response

Kelemetry version

v0.2.4

Environment

No response

Remove field pseudospans

Description

With the use of lazy frontend transformation, the field pseudospan is no longer helpful in readability. Instead it increases the complexity of trace transformation, increases the number of entries in the span cache and doubles the length of the chain of parent span creation (an extra "children" span for every parent object).

User story

No response

Links are prematurely merged during Search

Steps to reproduce

  1. Create a trace with multiple child links:
graph TB
A --> B & C
  1. Search B on frontend with "ancestors"

Expected behavior

Only A and B appear in the output trace

Actual behavior

A, B and C all appear in the output trace

If we change ff22 to ff20, the entire trace still appears.

But if we searched in exclusive mode directly to begin with, the behavior is correct

Kelemetry version

e618aa5

Environment

Jaeger storage backend

Display traces side-by-side

Description

Sometimes Kelemetry's linkers are insufficient to relate two traces or linking is too expensive/large (e.g. linking a leader election object would link all operations that interact with the component). The user can tell us the two traces are related explicitly through a new endpoint e.g. /merge?trace={traceid1}&trace={traceid2} that redirects to a new cached ID.

User story

It is often useful to debug why a controller is not working by comparing its output activity against its leader lease update frequency.

etcd CrashLoopBackOff and health check failed

Steps to reproduce

  1. check out 0.2.2 tag from kelemetry repo
  2. helm install kelemetry oci://ghcr.io/kubewharf/kelemetry-chart --values values.yaml

Expected behavior

etcd pod running normal and health check successfully

Actual behavior

kelemetry-etcd-0 0/1 CrashLoopBackOff 18 (83s ago) 74m
kelemetry-etcd-1 1/1 Running 0 74m
kelemetry-etcd-2 1/1 Running 0 74m

# kubectl logs kelemetry-etcd-0
2023-07-20 08:54:32.056926 I | etcdmain: etcd Version: 3.3.13
2023-07-20 08:54:32.056972 I | etcdmain: Git SHA: 98d3084
2023-07-20 08:54:32.056976 I | etcdmain: Go Version: go1.10.8
2023-07-20 08:54:32.056979 I | etcdmain: Go OS/Arch: linux/amd64
2023-07-20 08:54:32.056982 I | etcdmain: setting maximum number of CPUs to 4, total number of available CPUs is 4
2023-07-20 08:54:32.057030 N | etcdmain: the server is already initialized as member before, starting as etcd member...
2023-07-20 08:54:32.057261 I | embed: listening for peers on http://0.0.0.0:2380
2023-07-20 08:54:32.057293 I | embed: listening for client requests on 0.0.0.0:2379
2023-07-20 08:54:32.057826 I | pkg/netutil: resolving kelemetry-etcd-0:2380 to 192.168.253.80:2380
2023-07-20 08:54:32.060864 W | pkg/netutil: failed resolving host kelemetry-etcd-0.kelemetry-etcd.default.svc:2380 (lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host); retrying in 1s
2023-07-20 08:54:33.065013 W | pkg/netutil: failed resolving host kelemetry-etcd-0.kelemetry-etcd.default.svc:2380 (lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host); retrying in 1s
2023-07-20 08:54:34.069684 W | pkg/netutil: failed resolving host kelemetry-etcd-0.kelemetry-etcd.default.svc:2380 (lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host); retrying in 1s
2023-07-20 08:54:35.073983 W | pkg/netutil: failed resolving host kelemetry-etcd-0.kelemetry-etcd.default.svc:2380 (lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host); retrying in 1s
2023-07-20 08:54:36.079512 W | pkg/netutil: failed resolving host kelemetry-etcd-0.kelemetry-etcd.default.svc:2380 (lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host); retrying in 1s
2023-07-20 08:54:37.083088 W | pkg/netutil: failed resolving host kelemetry-etcd-0.kelemetry-etcd.default.svc:2380 (lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host); retrying in 1s
2023-07-20 08:54:38.102341 W | pkg/netutil: failed resolving host kelemetry-etcd-0.kelemetry-etcd.default.svc:2380 (lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host); retrying in 1s
2023-07-20 08:54:39.107733 W | pkg/netutil: failed resolving host kelemetry-etcd-0.kelemetry-etcd.default.svc:2380 (lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host); retrying in 1s
2023-07-20 08:54:40.112295 W | pkg/netutil: failed resolving host kelemetry-etcd-0.kelemetry-etcd.default.svc:2380 (lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host); retrying in 1s
2023-07-20 08:54:41.116753 W | pkg/netutil: failed resolving host kelemetry-etcd-0.kelemetry-etcd.default.svc:2380 (lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host); retrying in 1s
2023-07-20 08:54:42.121587 W | pkg/netutil: failed resolving host kelemetry-etcd-0.kelemetry-etcd.default.svc:2380 (lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host); retrying in 1s
2023-07-20 08:54:43.126384 W | pkg/netutil: failed resolving host kelemetry-etcd-0.kelemetry-etcd.default.svc:2380 (lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host); retrying in 1s
2023-07-20 08:54:44.131495 W | pkg/netutil: failed resolving host kelemetry-etcd-0.kelemetry-etcd.default.svc:2380 (lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host); retrying in 1s
2023-07-20 08:54:45.136237 W | pkg/netutil: failed resolving host kelemetry-etcd-0.kelemetry-etcd.default.svc:2380 (lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host); retrying in 1s
2023-07-20 08:54:46.141086 W | pkg/netutil: failed resolving host kelemetry-etcd-0.kelemetry-etcd.default.svc:2380 (lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host); retrying in 1s
2023-07-20 08:54:47.145136 W | pkg/netutil: failed resolving host kelemetry-etcd-0.kelemetry-etcd.default.svc:2380 (lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host); retrying in 1s
2023-07-20 08:54:48.150200 W | pkg/netutil: failed resolving host kelemetry-etcd-0.kelemetry-etcd.default.svc:2380 (lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host); retrying in 1s
2023-07-20 08:54:49.157245 W | pkg/netutil: failed resolving host kelemetry-etcd-0.kelemetry-etcd.default.svc:2380 (lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host); retrying in 1s
2023-07-20 08:54:50.162159 W | pkg/netutil: failed resolving host kelemetry-etcd-0.kelemetry-etcd.default.svc:2380 (lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host); retrying in 1s
2023-07-20 08:54:51.166931 W | pkg/netutil: failed resolving host kelemetry-etcd-0.kelemetry-etcd.default.svc:2380 (lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host); retrying in 1s
2023-07-20 08:54:52.172077 W | pkg/netutil: failed resolving host kelemetry-etcd-0.kelemetry-etcd.default.svc:2380 (lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host); retrying in 1s
2023-07-20 08:54:53.177795 W | pkg/netutil: failed resolving host kelemetry-etcd-0.kelemetry-etcd.default.svc:2380 (lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host); retrying in 1s
2023-07-20 08:54:54.183211 W | pkg/netutil: failed resolving host kelemetry-etcd-0.kelemetry-etcd.default.svc:2380 (lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host); retrying in 1s
2023-07-20 08:54:55.187278 I | pkg/netutil: resolving kelemetry-etcd-0.kelemetry-etcd.default.svc:2380 to 192.168.253.80:2380
2023-07-20 08:54:55.197378 C | etcdmain: member 6fa7a00416c5d67d has already been bootstrapped
#kubectl logs kelemetry-etcd-1
...
2023-07-20 08:58:06.740587 W | etcdserver: cannot get the version of member 6fa7a00416c5d67d (Get http://kelemetry-etcd-0.kelemetry-etcd.default.svc:2380/version: dial tcp: lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host)
2023-07-20 08:58:08.670076 W | rafthttp: health check for peer 6fa7a00416c5d67d could not connect: dial tcp: lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host (prober "ROUND_TRIPPER_RAFT_MESSAGE")
2023-07-20 08:58:08.670482 W | rafthttp: health check for peer 6fa7a00416c5d67d could not connect: dial tcp: lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host (prober "ROUND_TRIPPER_SNAPSHOT")
2023-07-20 08:58:10.745502 W | etcdserver: failed to reach the peerURL(http://kelemetry-etcd-0.kelemetry-etcd.default.svc:2380) of member 6fa7a00416c5d67d (Get http://kelemetry-etcd-0.kelemetry-etcd.default.svc:2380/version: dial tcp: lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host)
2023-07-20 08:58:10.745557 W | etcdserver: cannot get the version of member 6fa7a00416c5d67d (Get http://kelemetry-etcd-0.kelemetry-etcd.default.svc:2380/version: dial tcp: lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host)
2023-07-20 08:58:13.670395 W | rafthttp: health check for peer 6fa7a00416c5d67d could not connect: dial tcp: lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host (prober "ROUND_TRIPPER_RAFT_MESSAGE")
2023-07-20 08:58:13.670855 W | rafthttp: health check for peer 6fa7a00416c5d67d could not connect: dial tcp: lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host (prober "ROUND_TRIPPER_SNAPSHOT")
2023-07-20 08:58:14.750999 W | etcdserver: failed to reach the peerURL(http://kelemetry-etcd-0.kelemetry-etcd.default.svc:2380) of member 6fa7a00416c5d67d (Get http://kelemetry-etcd-0.kelemetry-etcd.default.svc:2380/version: dial tcp: lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host)
2023-07-20 08:58:14.751047 W | etcdserver: cannot get the version of member 6fa7a00416c5d67d (Get http://kelemetry-etcd-0.kelemetry-etcd.default.svc:2380/version: dial tcp: lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host)

# kubectl logs kelemetry-etcd-2
...
2023-07-20 08:58:53.156209 W | rafthttp: health check for peer 6fa7a00416c5d67d could not connect: dial tcp: lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host (prober "ROUND_TRIPPER_RAFT_MESSAGE")
2023-07-20 08:58:53.156809 W | rafthttp: health check for peer 6fa7a00416c5d67d could not connect: dial tcp: lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host (prober "ROUND_TRIPPER_SNAPSHOT")
2023-07-20 08:58:58.156996 W | rafthttp: health check for peer 6fa7a00416c5d67d could not connect: dial tcp: lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host (prober "ROUND_TRIPPER_RAFT_MESSAGE")
2023-07-20 08:58:58.157404 W | rafthttp: health check for peer 6fa7a00416c5d67d could not connect: dial tcp: lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host (prober "ROUND_TRIPPER_SNAPSHOT")
2023-07-20 08:59:03.157340 W | rafthttp: health check for peer 6fa7a00416c5d67d could not connect: dial tcp: lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host (prober "ROUND_TRIPPER_RAFT_MESSAGE")
2023-07-20 08:59:03.157662 W | rafthttp: health check for peer 6fa7a00416c5d67d could not connect: dial tcp: lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host (prober "ROUND_TRIPPER_SNAPSHOT")
2023-07-20 08:59:08.157610 W | rafthttp: health check for peer 6fa7a00416c5d67d could not connect: dial tcp: lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host (prober "ROUND_TRIPPER_RAFT_MESSAGE")
2023-07-20 08:59:08.157900 W | rafthttp: health check for peer 6fa7a00416c5d67d could not connect: dial tcp: lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host (prober "ROUND_TRIPPER_SNAPSHOT")
2023-07-20 08:59:13.158076 W | rafthttp: health check for peer 6fa7a00416c5d67d could not connect: dial tcp: lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host (prober "ROUND_TRIPPER_RAFT_MESSAGE")
2023-07-20 08:59:13.158478 W | rafthttp: health check for peer 6fa7a00416c5d67d could not connect: dial tcp: lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host (prober "ROUND_TRIPPER_SNAPSHOT")



Kelemetry version

0.2.2

Environment

k8s:1.23.17
jaeger:1.4.2

Add integration tests for Helm charts

Description

Setup tests and CI to ensure the chart is working correctly in different setup scenarios:

  • Default setup
  • Custom tfconfig with apiserver tracing
  • ElasticSearch span storage

User story

Helm chart is often out of maintenance due to rapid feature iteration, but our CI only tests the make quickstart script and development often just uses the make stack script.

etcd is easily overloaded

Description

Kelemetry is perfectly meets our needs, and I have been running it for a few days.
But one thing that confused me is the db size of etcd keeps increasing. It brings io pressure on the etcd server, and sometimes cause proposals pending. This could be due to a high number of k8s events(We have also optimized the data disk of etcd by using SSDs).

If Kelemetry could support event filtering, it would be a great help. For example, filter periodic events which I don't care about
(Is redactPattern a similar filtering functionality? I had tried, but not works will)

User story

No response

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.