datadog / lading Goto Github PK
View Code? Open in Web Editor NEWA suite of data generation and load testing tools
License: MIT License
A suite of data generation and load testing tools
License: MIT License
---- payload::opentelemetry_metric::test::payload_not_exceed_max_bytes stdout ----
thread 'payload::opentelemetry_metric::test::payload_not_exceed_max_bytes' panicked at 'max len: 241, actual: 242', src/payload/opentelemetry_metric.rs:236:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread 'payload::opentelemetry_metric::test::payload_not_exceed_max_bytes' panicked at 'max len: 241, actual: 242', src/payload/opentelemetry_metric.rs:236:13
thread 'payload::opentelemetry_metric::test::payload_not_exceed_max_bytes' panicked at 'max len: 241, actual: 242', src/payload/opentelemetry_metric.rs:236:13
thread 'payload::opentelemetry_metric::test::payload_not_exceed_max_bytes' panicked at 'max len: 241, actual: 242', src/payload/opentelemetry_metric.rs:236:13
proptest: Saving this and future failures in /Users/runner/work/lading/lading/proptest-regressions/payload/opentelemetry_metric.txt
proptest: If this test was run on a CI system, you may wish to add the following line to your copy of the file. (You may need to create it.)
cc 8f93294fa8263a942d270c84bb3b989b57987f8aebdb0ac8efed5d8fba333bba
thread 'payload::opentelemetry_metric::test::payload_not_exceed_max_bytes' panicked at 'Test failed: max len: 241, actual: 242; minimal failing input: seed = 11168970835765128255, max_bytes = 241
successes: 224
local rejects: 0
global rejects: 0
', src/payload/opentelemetry_metric.rs:227:5
The lading
binary's error handling at the top level panics if any error is encountered. Add some structure and improve the error messages.
Right now lading only supports a single source per configuration. This is fine but it would be valuable to have multiple sources into a target program.
I'm running lading off of rev 2e8462a with the following cmd and lading.yml
:
lading.yml
:
generator:
- unix_datagram:
seed: [2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53,
59, 61, 67, 71, 73, 79, 83, 89, 97, 101, 103, 107, 109, 113, 127, 131]
path: "/tmp/socat.socket"
variant: "dogstatsd"
bytes_per_second: "256 Kb"
block_sizes: ["256b", "512b", "1Kb", "2Kb", "3Kb", "4Kb", "5Kb", "6Kb"]
maximum_prebuild_cache_size_bytes: "10 Kb"
cmdline:
RUST_BACKTRACE=1 RUST_LOG=debug ./target/debug/lading --warmup-duration-seconds 1 --experiment-duration-seconds 4 --prometheus-addr=127.0.0.1:6002 --config-path=./socatlading.yaml --target-stderr-path=./socatstderr.txt --target-stdout-path=./socatstdout.txt --target-path strace /usr/bin/socat UNIX-RECV:/tmp/socat.socket OPEN:data.txt,creat,append
I get the following panic:
thread 'tokio-runtime-worker' panicked at 'attempt to add with overflow', /rustc/69f9c33d71c871fc16ac445211281c6e7a340943/library/core/src/iter/traits/accum.rs:141:1
stack backtrace:
0: rust_begin_unwind
at /rustc/69f9c33d71c871fc16ac445211281c6e7a340943/library/std/src/panicking.rs:575:5
1: core::panicking::panic_fmt
at /rustc/69f9c33d71c871fc16ac445211281c6e7a340943/library/core/src/panicking.rs:65:14
2: core::panicking::panic
at /rustc/69f9c33d71c871fc16ac445211281c6e7a340943/library/core/src/panicking.rs:115:5
3: <u64 as core::iter::traits::accum::Sum>::sum::{{closure}}
at /rustc/69f9c33d71c871fc16ac445211281c6e7a340943/library/core/src/iter/traits/accum.rs:45:28
4: core::iter::adapters::map::map_fold::{{closure}}
at /rustc/69f9c33d71c871fc16ac445211281c6e7a340943/library/core/src/iter/adapters/map.rs:84:21
5: core::iter::traits::iterator::Iterator::fold
at /rustc/69f9c33d71c871fc16ac445211281c6e7a340943/library/core/src/iter/traits/iterator.rs:2414:21
6: <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::fold
at /rustc/69f9c33d71c871fc16ac445211281c6e7a340943/library/core/src/iter/adapters/map.rs:124:9
7: <u64 as core::iter::traits::accum::Sum>::sum
at /rustc/69f9c33d71c871fc16ac445211281c6e7a340943/library/core/src/iter/traits/accum.rs:42:17
8: core::iter::traits::iterator::Iterator::sum
at /rustc/69f9c33d71c871fc16ac445211281c6e7a340943/library/core/src/iter/traits/iterator.rs:3381:9
9: lading::observer::Server::run::{{closure}}
at ./src/observer.rs:181:43
10: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
and if I add some debug logging to see what the value of allstats
looks like, I see the following:
2023-02-07T16:56:22.897890Z DEBUG lading::observer: Rss lim list has value: Map { iter: Iter([Stat { pid: 4054447, comm: "strace", state: 't', ppid: 4054445, pgrp: 4054435, session: 3292311, tty_nr: 34821, tpgid: 4054435, flags: 1077936192, minflt: 19, cminflt: 0, majflt: 0
, cmajflt: 0, utime: 0, stime: 0, cutime: 0, cstime: 0, priority: 20, nice: 0, num_threads: 1, itrealvalue: 0, starttime: 8194209, vsize: 3989504, rss: 96, rsslim: 18446744073709551615, startcode: 187650176253952, endcode: 187650177525124, startstack: 281474060964784, kstke
sp: 0, kstkeip: 0, signal: 0, blocked: 0, sigignore: 0, sigcatch: 0, wchan: 1, nswap: 0, cnswap: 0, exit_signal: Some(17), processor: Some(1), rt_priority: Some(0), policy: Some(0), delayacct_blkio_ticks: Some(0), guest_time: Some(0), cguest_time: Some(0), start_data: Some(
187650177591184), end_data: Some(187650177900952), start_brk: Some(187650321068032), arg_start: Some(281474060967839), arg_end: Some(281474060967916), env_start: Some(281474060967916), env_end: Some(281474060967916), exit_code: Some(133) }, Stat { pid: 4054445, comm: "strac
e", state: 'R', ppid: 4054435, pgrp: 4054435, session: 3292311, tty_nr: 34821, tpgid: 4054435, flags: 4194304, minflt: 196, cminflt: 1, majflt: 0, cmajflt: 0, utime: 0, stime: 0, cutime: 0, cstime: 0, priority: 20, nice: 0, num_threads: 1, itrealvalue: 0, starttime: 8194209
, vsize: 3989504, rss: 378, rsslim: 18446744073709551615, startcode: 187650176253952, endcode: 187650177525124, startstack: 281474060964784, kstkesp: 0, kstkeip: 0, signal: 0, blocked: 0, sigignore: 0, sigcatch: 0, wchan: 1, nswap: 0, cnswap: 0, exit_signal: Some(17), proce
ssor: Some(3), rt_priority: Some(0), policy: Some(0), delayacct_blkio_ticks: Some(0), guest_time: Some(0), cguest_time: Some(0), start_data: Some(187650177591184), end_data: Some(187650177900952), start_brk: Some(187650321068032), arg_start: Some(281474060967839), arg_end:
Some(281474060967916), env_start: Some(281474060967916), env_end: Some(281474060967916), exit_code: Some(0) }]) }
It looks like there are two pids both with the rsslim
value set to u64::Max
, which then causes the .sum
to overflow.
$ uname -a
Linux lima-beef-nov22 5.15.0-58-generic #64-Ubuntu SMP Thu Jan 5 12:06:43 UTC 2023 aarch64 aarch64 aarch64 GNU/Linux
OpenTelemetry consumers may drop payloads with distant timestamps. This could cause significant performance differences if some consumers drop payloads more aggressively than others.
This is a tradeoff. Timestamp generation should optimally be deterministic. That precludes the use of system time based bounds. The best we can do while maintaining determinism may be to bound timestamps to a far larger block of time, like the current decade.
Right now file_gen is able to generate new content but it'd be useful if we could also emit stock content from a flat file at a given rate.
Consider vectordotdev/vector#12447. The blackhole needs to develop the ability to mimic a Firehose healthcheck, other responses sufficient to make at least the vector use-case functional. I haven't captured exactly what Vector is trying, but relevant docs: https://docs.aws.amazon.com/firehose/latest/APIReference/API_DescribeDeliveryStream.html
lading/lading_payload/src/dogstatsd/metric.rs
Lines 88 to 92 in f6d0e7c
This TODO from the code has now come back to bite us. I put in the comment "should not have a big impact on the agent", but that is incorrect. For histogram and distributions this has a very big impact as it will act as a multiplier for the traffic, which has a significant effect on memory usage.
I don't have any great ideas on how to find ground-truth for the distribution of sampleRates
that are specified "in the wild" but at a guess we'd see a very large number of missing/1
-valued sample rates and then a long tail of smaller values.
The OpenTelemetry metrics generator should support generating Histogram
, ExponentialHistogram
, and Summary
metrics
As of this writing blackhole
has only a few response payloads:
It would be useful if the following responses were also available:
Currently the body variants are here: https://github.com/blt/lading/blob/5acb3c419958536d43282ebfaa5c05042bd49b40/src/blackhole/http.rs#L32-L39
I ran across this issue while working with workload-checks and smp local-run
.
Due to the fact that most keys in the lading::Config
recursive struct have a default
, if they aren't present in the specified config, there is no error, the deserialization process will happily apply the defaults and start the lading run.
This is kind of okay so far, but if you have a significantly old lading
config (as is specified in the agent's workload-checks here) and you feed this into a newer lading, the result is that the entire dogstatsd
block is no longer valid and therefore we use defaults for the entire configuration.
This is silent as there is no feedback in the logs that unexpected keys were found or a dump of the run-time config. So the end-user has no way of knowing that the config they specified is NOT the one being used.
This should only be a real problem with local-run
as the user is supplying a lading
version. In the CI, we should be updating the format of the experiments as we update lading (which we do, example PR)
Lading's RSS data collected through the observer is not accurate if the target spawns sub-processes. Lading is unaware entirely of sub-processes, meaning that their memory statistics do not contribute to the observer's tally.
The lading blackhole might shutdown before the target does, leading to a crash in the target. The blackhole should shutdown after the target.
Right now if the target sub-process exits lading will log what the exit code was but will not actually fiddle with its own code. This means that lading can appear to succeed at collecting samples while in practice this was not the case, in a way that harms use in CI.
It would be useful to know:
Turn ducks & lading logging up to trace in the integration tests. The test runner will deadlock. Discovered while building the UDP generator.
Much like with file_gen
and http_gen
this project needs a udp_gen
. Same goals apply.
The generators pre-build their data in 'blocks' which, depending on user settings, can take quite a while to complete. Today we assume that this process is instantaneous with the consequence being that any time sensitive deadlines -- such as, the target process experimental duration, warmup periods etc -- are all inaccurate.
It's fine for the target to be started before block generation is complete but we should not start any of the timers until blocks are ready to go.
We should add the ability for lading
to produce metric style events to unblock additional soak test configurations. Depending on the metrics and "flavor" of metrics this may be a new generator or a new variant of existing generators.
We'll likely want to create separate tickets for the metric formats and flavors we want to include in lading
There are important project invariants that are, today, demonstrated by running lading against Vector in that project's regression detection scheme. That's... not super great. Invariants that need test demonstration:
The project must be able to generate syslog payloads, of the common varieties. Much like the foundationdb format it would be helpful if the lines had some sensible content internally instead of random ascii noise.
Metrics should be emitted with enough information to be differentiated by external tooling. There are a number of locations where metrics emitted by different modules look the same and cannot be differentiated. (E.g. bytes_received
in both the HTTP & TCP blackholes)
Consider that the internal governor is generally left with a little bytes left over after each request as requests are made randomly between the maximum line size and 1. This is surprising -- it took me a minute to figure out -- and we might ought to have the algorithm be to request the maximum line size and then randomly use a portion of this only.
Introspecting datadog-agent
-to-vector
communications would be much easier if it were possible to run vector
as a subprocess in the same way as target
, and collect telemetry from it into the collected output.
Doing so would allow us to compare datapoints from the perspective of both programs, which can be useful when debugging aggregation transforms, for instance.
As of this writing the sources generate load into the target program at a fixed throughput. If the target program is unable to meet this target lading participates in coordinated omission, meaning that it produces -- where possible -- in lockstep with the target. We should support other emission modes:
To the last point, we use governor to ensure we write no faster than the fixed throughput, but the result is sort of jittery because we prepare fixed sized blocks at startup and loop through those blocks in-order, meaning we could potentially have a block that would fit but wait for much longer than necessary because a large block has come up. Our current fixed throughput approach also has downsides. Consider that a system might appear "unusually slow" if lading is configured to a low throughput. That is, it's possible to look at the throughput result and misinterpret it as the result of the target rather than the configuration of lading.
This looks like the same dependabot token restriction that we saw elsewhere recently. Skip this check for dependabot runs.
https://github.com/DataDog/lading/actions/runs/3092084055/jobs/5002947799
There is a stray character at the end of the example file. Also, maximum_bytes_per
needs to be maximum_bytes_per_file
.
The splunk_hec
generator spawns its own sub-tasks which stay active even while the rest of lading shut down. This leaves background tasks still running, which can hang the program on controlled shutdown. We do try to work around this by using a timed shutdown by this does not appear to work in all cases, as seen here. The top-level process must signal to its spawned tasks that the time has come to shutdown.
CI currently only checks x86-64 Linux & Mac targets. We also publish Aarch64 builds of both and should check both in CI.
Validate the config and print a machine-readable listing of expected output metrics (i.e. expect bytes_written
from the TCP blackhole)
https://github.com/DataDog/lading/actions/runs/3512867784/jobs/5885035430
---- observer::tests::observer_observes_process_hierarchy stdout ----
thread 'observer::tests::observer_observes_process_hierarchy' panicked at 'assertion failed: `(left == right)`
left: `["sh", "sh"]`,
right: `["sh", "sleep"]`', src/observer.rs:244:9
The way we generate blocks is almost there but not quite right. We first generate a Vec<impl Arbitrary>
from the block size of bytes and then serialize this. The end result are blocks that go well over the actual block size, naturally, which has led to some hacky code to kinda sorta avoid blowing up the governor.
From #372
From https://docs.datadoghq.com/getting_started/tagging/#define-tags
Tags can be up to 200 characters long and support Unicode letters (which includes most character sets, including languages such as Japanese).
The logic for outputting the json variant was always iffy. The most recent release has tightened the logic around generated sizes, which is now sometimes violated by the json variant, crashing the program.
The OpenTelemetry generators should create payloads with attached resources. This will be a better representation of typical payloads.
While it's useful to be able to generate a constant load not every load generation test is served by this notion. Some situations call for ramping up load until the system under examination falls behind. This program should support that kind of ramp. Initial work can be as straightforward as observing a bit of telemetry -- prometheus? -- from the system under test.
It would be valuable to be able to produce logs in the structured forms that kubernetes has:
https://kubernetes.io/docs/concepts/cluster-administration/system-logs/
This is a false-positive error:
2023-11-14T17:07:24.609931Z INFO lading: Starting lading run.
2023-11-14T17:07:24.612361Z INFO lading: target is running, now sleeping for warmup
2023-11-14T17:07:24.612453Z INFO lading::target_metrics::prometheus: Prometheus target metrics scraper running
2023-11-14T17:07:25.614083Z INFO lading::target_metrics::prometheus: failed to get Prometheus uri
2023-11-14T17:07:54.614506Z INFO lading: warmup completed, collecting samples
I'm guessing that the first time it checks the target's openmetrics endpoint has not started yet, so we log an error, then some future time we succeed, but don't print an "OK" message.
We should either remove this error, or also print out a "Success" message when we do find the openmetrics endpoint
The primary motivation behind this issue is to offer a smooth experience on Apple Silicon.
Consider
Lines 28 to 32 in 5aae408
CI failures have been observed with the following payload length error:
---- payload::opentelemetry_metric::test::payload_not_exceed_max_bytes stdout ----
thread 'payload::opentelemetry_metric::test::payload_not_exceed_max_bytes' panicked at 'max len: 250, actual: 251', src/payload/opentelemetry_metric.rs:236:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread 'payload::opentelemetry_metric::test::payload_not_exceed_max_bytes' panicked at 'max len: 250, actual: 251', src/payload/opentelemetry_metric.rs:236:13
thread 'payload::opentelemetry_metric::test::payload_not_exceed_max_bytes' panicked at 'max len: 250, actual: 251', src/payload/opentelemetry_metric.rs:236:13
thread 'payload::opentelemetry_metric::test::payload_not_exceed_max_bytes' panicked at 'max len: 250, actual: 251', src/payload/opentelemetry_metric.rs:236:13
proptest: Saving this and future failures in /Users/runner/work/lading/lading/proptest-regressions/payload/opentelemetry_metric.txt
proptest: If this test was run on a CI system, you may wish to add the following line to your copy of the file. (You may need to create it.)
cc 5ada42712d32e65e5b5292cbe437a53495db6a483c0f612a24ea5d1846d2e787
thread 'payload::opentelemetry_metric::test::payload_not_exceed_max_bytes' panicked at 'Test failed: max len: 250, actual: 251; minimal failing input: seed = 16556747614468308026, max_bytes = 250
successes: 111
local rejects: 0
global rejects: 0
', src/payload/opentelemetry_metric.rs:227:5
In DogStatsD v1.1, multi-values can be packed into dogstatsd messages with the exception of SET
metrics.
In the current lading dogstatsd generator, the SET messages can and will contain multiple values.
lading/lading/src/payload/dogstatsd/metric.rs
Lines 272 to 276 in 5f8d48b
Files and HTTP endpoints shouldn't have all the fun!
If a splunk_hec
generator peer does not respond back with an AckID -- as the protocol does not seem to require -- then the lading generator will fail, crashing the emission task but not lading. It should not be the case that lading's send task crashes when an ack ID is missing.
The target's PID today has to be discovered by the inspector at runtime, by means of pidof
or similar. This means that the process name of the target must be known at configuration time of lading and causes issues if there is more than one process with the given name. This also implies a race between the inspector and target tasks in the lading binary: the target is not guaranteed to be online when the inspector is started.
Consider the situation where --target-environment-variables
is passed an empty string. It should be that no environment variables are set, however presently lading exits with an error that key/value pairs must be separated by a =
.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.