datadog / lading Goto Github PK

View Code? Open in Web Editor NEW

59.0 334.0 11.0 2.23 MB

A suite of data generation and load testing tools

License: MIT License

Rust 99.74% Dockerfile 0.21% Shell 0.06%

lading's People

Contributors

Stargazers

Watchers

Forkers

tobz 001wwang fuchsnj stephenwakely pablosichert jszwedko goxberry e-tjan147 adpaco-aws ben1009

lading's Issues

Target sub-process exit code should influence lading's

Right now if the target sub-process exits lading will log what the exit code was but will not actually fiddle with its own code. This means that lading can appear to succeed at collecting samples while in practice this was not the case, in a way that harms use in CI.

Auto-scale based on observed telemetry

While it's useful to be able to generate a constant load not every load generation test is served by this notion. Some situations call for ramping up load until the system under examination falls behind. This program should support that kind of ramp. Initial work can be as straightforward as observing a bit of telemetry -- prometheus? -- from the system under test.

Support metrics generation

We should add the ability for lading to produce metric style events to unblock additional soak test configurations. Depending on the metrics and "flavor" of metrics this may be a new generator or a new variant of existing generators.

We'll likely want to create separate tickets for the metric formats and flavors we want to include in lading

Older format lading configs silently ignored

I ran across this issue while working with workload-checks and smp local-run.

Due to the fact that most keys in the lading::Config recursive struct have a default, if they aren't present in the specified config, there is no error, the deserialization process will happily apply the defaults and start the lading run.

This is kind of okay so far, but if you have a significantly old lading config (as is specified in the agent's workload-checks here) and you feed this into a newer lading, the result is that the entire dogstatsd block is no longer valid and therefore we use defaults for the entire configuration.

This is silent as there is no feedback in the logs that unexpected keys were found or a dump of the run-time config. So the end-user has no way of knowing that the config they specified is NOT the one being used.

This should only be a real problem with local-run as the user is supplying a lading version. In the CI, we should be updating the format of the experiments as we update lading (which we do, example PR)

Json variant is broken

The logic for outputting the json variant was always iffy. The most recent release has tightened the logic around generated sizes, which is now sometimes violated by the json variant, crashing the program.

Make `udp_gen`

Much like with file_gen and http_gen this project needs a udp_gen. Same goals apply.

Make `kafka_gen`

Files and HTTP endpoints shouldn't have all the fun!

Support multiple sources

Right now lading only supports a single source per configuration. This is fine but it would be valuable to have multiple sources into a target program.

Property test important project invariants

There are important project invariants that are, today, demonstrated by running lading against Vector in that project's regression detection scheme. That's... not super great. Invariants that need test demonstration:

Lading will always shut down, that is, will not hang. See #203 and #155.
User inputs can be satisfied in block construction. Currently our property tests avoid known-bad inputs, implying we need to constrain user input in some way.
Startup ordering is accurately ordered. That is, the target comes online before the generator etc etc. This will require some documentation with regard to what we guarantee.

Add support for unicode tags in dogstatsd generation

From https://docs.datadoghq.com/getting_started/tagging/#define-tags

Tags can be up to 200 characters long and support Unicode letters (which includes most character sets, including languages such as Japanese).

Allow a `subprocess` blackhole

Introspecting datadog-agent-to-vector communications would be much easier if it were possible to run vector as a subprocess in the same way as target, and collect telemetry from it into the collected output.

Doing so would allow us to compare datapoints from the perspective of both programs, which can be useful when debugging aggregation transforms, for instance.

Allow HTTP `blackhole` to respond with additional payloads.

As of this writing blackhole has only a few response payloads:

blank
AWS Kinesis

It would be useful if the following responses were also available:

Elasticsearch
S3
Static byte sequence

Currently the body variants are here: https://github.com/blt/lading/blob/5acb3c419958536d43282ebfaa5c05042bd49b40/src/blackhole/http.rs#L32-L39

Remove generator untagged enum configuration

Consider

lading/src/config.rs

Lines 28 to 32 in 5aae408

 /// We have many uses that exist prior to the introduction of multiple 

 /// blackholes, meaning many configs that _assume_ there is only one bloackhole 

 /// and that they do not exist in an array. In order to avoid breaking those 

 /// configs we support this goofy structure. A deprecation cycle here is in 

 /// order someday.

. It is reasonable to remove the untagged enum here and require a list of generators only.

Publish multiplatform docker images or release binaries

The primary motivation behind this issue is to offer a smooth experience on Apple Silicon.

Support flat-file format

Right now file_gen is able to generate new content but it'd be useful if we could also emit stock content from a flat file at a given rate.

`splunk_hec` generator requires acks, should not

If a splunk_hec generator peer does not respond back with an AckID -- as the protocol does not seem to require -- then the lading generator will fail, crashing the emission task but not lading. It should not be the case that lading's send task crashes when an ack ID is missing.

`splunk_hec` generator does participate fully in shutdown, hangs shutdown

The splunk_hec generator spawns its own sub-tasks which stay active even while the rest of lading shut down. This leaves background tasks still running, which can hang the program on controlled shutdown. We do try to work around this by using a timed shutdown by this does not appear to work in all cases, as seen here. The top-level process must signal to its spawned tasks that the time has come to shutdown.

All errors should implement `std::error::Error`

Block generation isn't quite right

The way we generate blocks is almost there but not quite right. We first generate a Vec<impl Arbitrary> from the block size of bytes and then serialize this. The end result are blocks that go well over the actual block size, naturally, which has led to some hacky code to kinda sorta avoid blowing up the governor.

Support parallel gRPC requests from the gRPC blackhole

Panic in observer with integer overflow

I'm running lading off of rev 2e8462a with the following cmd and lading.yml:

lading.yml:

generator:
  - unix_datagram:
      seed: [2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53,
             59, 61, 67, 71, 73, 79, 83, 89, 97, 101, 103, 107, 109, 113, 127, 131]
      path: "/tmp/socat.socket"
      variant: "dogstatsd"
      bytes_per_second: "256 Kb"
      block_sizes: ["256b", "512b", "1Kb", "2Kb", "3Kb", "4Kb", "5Kb", "6Kb"]
      maximum_prebuild_cache_size_bytes: "10 Kb"

cmdline:

RUST_BACKTRACE=1 RUST_LOG=debug ./target/debug/lading --warmup-duration-seconds 1 --experiment-duration-seconds 4 --prometheus-addr=127.0.0.1:6002 --config-path=./socatlading.yaml --target-stderr-path=./socatstderr.txt --target-stdout-path=./socatstdout.txt --target-path strace /usr/bin/socat UNIX-RECV:/tmp/socat.socket OPEN:data.txt,creat,append

I get the following panic:

thread 'tokio-runtime-worker' panicked at 'attempt to add with overflow', /rustc/69f9c33d71c871fc16ac445211281c6e7a340943/library/core/src/iter/traits/accum.rs:141:1
stack backtrace:
   0: rust_begin_unwind
             at /rustc/69f9c33d71c871fc16ac445211281c6e7a340943/library/std/src/panicking.rs:575:5
   1: core::panicking::panic_fmt
             at /rustc/69f9c33d71c871fc16ac445211281c6e7a340943/library/core/src/panicking.rs:65:14
   2: core::panicking::panic
             at /rustc/69f9c33d71c871fc16ac445211281c6e7a340943/library/core/src/panicking.rs:115:5
   3: <u64 as core::iter::traits::accum::Sum>::sum::{{closure}}
             at /rustc/69f9c33d71c871fc16ac445211281c6e7a340943/library/core/src/iter/traits/accum.rs:45:28
   4: core::iter::adapters::map::map_fold::{{closure}}
             at /rustc/69f9c33d71c871fc16ac445211281c6e7a340943/library/core/src/iter/adapters/map.rs:84:21
   5: core::iter::traits::iterator::Iterator::fold
             at /rustc/69f9c33d71c871fc16ac445211281c6e7a340943/library/core/src/iter/traits/iterator.rs:2414:21
   6: <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::fold
             at /rustc/69f9c33d71c871fc16ac445211281c6e7a340943/library/core/src/iter/adapters/map.rs:124:9
   7: <u64 as core::iter::traits::accum::Sum>::sum
             at /rustc/69f9c33d71c871fc16ac445211281c6e7a340943/library/core/src/iter/traits/accum.rs:42:17
   8: core::iter::traits::iterator::Iterator::sum
             at /rustc/69f9c33d71c871fc16ac445211281c6e7a340943/library/core/src/iter/traits/iterator.rs:3381:9
   9: lading::observer::Server::run::{{closure}}
             at ./src/observer.rs:181:43
  10: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll

and if I add some debug logging to see what the value of allstats looks like, I see the following:

2023-02-07T16:56:22.897890Z DEBUG lading::observer: Rss lim list has value: Map { iter: Iter([Stat { pid: 4054447, comm: "strace", state: 't', ppid: 4054445, pgrp: 4054435, session: 3292311, tty_nr: 34821, tpgid: 4054435, flags: 1077936192, minflt: 19, cminflt: 0, majflt: 0
, cmajflt: 0, utime: 0, stime: 0, cutime: 0, cstime: 0, priority: 20, nice: 0, num_threads: 1, itrealvalue: 0, starttime: 8194209, vsize: 3989504, rss: 96, rsslim: 18446744073709551615, startcode: 187650176253952, endcode: 187650177525124, startstack: 281474060964784, kstke
sp: 0, kstkeip: 0, signal: 0, blocked: 0, sigignore: 0, sigcatch: 0, wchan: 1, nswap: 0, cnswap: 0, exit_signal: Some(17), processor: Some(1), rt_priority: Some(0), policy: Some(0), delayacct_blkio_ticks: Some(0), guest_time: Some(0), cguest_time: Some(0), start_data: Some(
187650177591184), end_data: Some(187650177900952), start_brk: Some(187650321068032), arg_start: Some(281474060967839), arg_end: Some(281474060967916), env_start: Some(281474060967916), env_end: Some(281474060967916), exit_code: Some(133) }, Stat { pid: 4054445, comm: "strac
e", state: 'R', ppid: 4054435, pgrp: 4054435, session: 3292311, tty_nr: 34821, tpgid: 4054435, flags: 4194304, minflt: 196, cminflt: 1, majflt: 0, cmajflt: 0, utime: 0, stime: 0, cutime: 0, cstime: 0, priority: 20, nice: 0, num_threads: 1, itrealvalue: 0, starttime: 8194209
, vsize: 3989504, rss: 378, rsslim: 18446744073709551615, startcode: 187650176253952, endcode: 187650177525124, startstack: 281474060964784, kstkesp: 0, kstkeip: 0, signal: 0, blocked: 0, sigignore: 0, sigcatch: 0, wchan: 1, nswap: 0, cnswap: 0, exit_signal: Some(17), proce
ssor: Some(3), rt_priority: Some(0), policy: Some(0), delayacct_blkio_ticks: Some(0), guest_time: Some(0), cguest_time: Some(0), start_data: Some(187650177591184), end_data: Some(187650177900952), start_brk: Some(187650321068032), arg_start: Some(281474060967839), arg_end:
Some(281474060967916), env_start: Some(281474060967916), env_end: Some(281474060967916), exit_code: Some(0) }]) }

It looks like there are two pids both with the rsslim value set to u64::Max, which then causes the .sum to overflow.

$ uname -a
Linux lima-beef-nov22 5.15.0-58-generic #64-Ubuntu SMP Thu Jan 5 12:06:43 UTC 2023 aarch64 aarch64 aarch64 GNU/Linux

Lading blackhole can shutdown before target

The lading blackhole might shutdown before the target does, leading to a crash in the target. The blackhole should shutdown after the target.

Open question: should OpenTelemetry payloads have recent timestamps?

OpenTelemetry consumers may drop payloads with distant timestamps. This could cause significant performance differences if some consumers drop payloads more aggressively than others.

This is a tradeoff. Timestamp generation should optimally be deterministic. That precludes the use of system time based bounds. The best we can do while maintaining determinism may be to bound timestamps to a far larger block of time, like the current decade.

Add a config validation subcommand

Validate the config and print a machine-readable listing of expected output metrics (i.e. expect bytes_written from the TCP blackhole)

Lading blackhole needs to mimic AWS Firehose

Consider vectordotdev/vector#12447. The blackhole needs to develop the ability to mimic a Firehose healthcheck, other responses sufficient to make at least the vector use-case functional. I haven't captured exactly what Vector is trying, but relevant docs: https://docs.aws.amazon.com/firehose/latest/APIReference/API_DescribeDeliveryStream.html

Generate syslog payloads

The project must be able to generate syslog payloads, of the common varieties. Much like the foundationdb format it would be helpful if the lines had some sensible content internally instead of random ascii noise.

Update to `clap` v4

[dogstatsd] Add sample_rate_prob to avoid setting a random sample rate on every DSD metric

lading/lading_payload/src/dogstatsd/metric.rs

Lines 88 to 92 in f6d0e7c

 // TODO sample_rate should be option and have a probability that determines if its present 

 // Mostly inconsequential for the Agent, for certain metric types the Agent 

 // applies some correction based on this value. Affects count and histogram computation. 

 // https://docs.datadoghq.com/metrics/custom_metrics/dogstatsd_metrics_submission/#sample-rates 

 let sample_rate = rng.gen();

This TODO from the code has now come back to bite us. I put in the comment "should not have a big impact on the agent", but that is incorrect. For histogram and distributions this has a very big impact as it will act as a multiplier for the traffic, which has a significant effect on memory usage.

I don't have any great ideas on how to find ground-truth for the distribution of sampleRates that are specified "in the wild" but at a guess we'd see a very large number of missing/1-valued sample rates and then a long tail of smaller values.

Min-repr tokio issue and report upstream

From #372

CI should check all targets that are published in releases

CI currently only checks x86-64 Linux & Mac targets. We also publish Aarch64 builds of both and should check both in CI.

Warmup, experiment time should not start until generator blocks are constructed

The generators pre-build their data in 'blocks' which, depending on user settings, can take quite a while to complete. Today we assume that this process is instantaneous with the consequence being that any time sensitive deadlines -- such as, the target process experimental duration, warmup periods etc -- are all inaccurate.

It's fine for the target to be started before block generation is complete but we should not start any of the timers until blocks are ready to go.

CI failure on `main`

---- payload::opentelemetry_metric::test::payload_not_exceed_max_bytes stdout ----
thread 'payload::opentelemetry_metric::test::payload_not_exceed_max_bytes' panicked at 'max len: 241, actual: 242', src/payload/opentelemetry_metric.rs:236:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread 'payload::opentelemetry_metric::test::payload_not_exceed_max_bytes' panicked at 'max len: 241, actual: 242', src/payload/opentelemetry_metric.rs:236:13
thread 'payload::opentelemetry_metric::test::payload_not_exceed_max_bytes' panicked at 'max len: 241, actual: 242', src/payload/opentelemetry_metric.rs:236:13
thread 'payload::opentelemetry_metric::test::payload_not_exceed_max_bytes' panicked at 'max len: 241, actual: 242', src/payload/opentelemetry_metric.rs:236:13
proptest: Saving this and future failures in /Users/runner/work/lading/lading/proptest-regressions/payload/opentelemetry_metric.txt
proptest: If this test was run on a CI system, you may wish to add the following line to your copy of the file. (You may need to create it.)
cc 8f93294fa8263a942d270c84bb3b989b57987f8aebdb0ac8efed5d8fba333bba
thread 'payload::opentelemetry_metric::test::payload_not_exceed_max_bytes' panicked at 'Test failed: max len: 241, actual: 242; minimal failing input: seed = 11168970835765128255, max_bytes = 241
	successes: 224
	local rejects: 0
	global rejects: 0
', src/payload/opentelemetry_metric.rs:227:5

Generator & blackhole metrics should be sufficiently tagged

Metrics should be emitted with enough information to be differentiated by external tooling. There are a number of locations where metrics emitted by different modules look the same and cannot be differentiated. (E.g. bytes_received in both the HTTP & TCP blackholes)

Hierarchical observer test is flaky

https://github.com/DataDog/lading/actions/runs/3512867784/jobs/5885035430

 ---- observer::tests::observer_observes_process_hierarchy stdout ----
thread 'observer::tests::observer_observes_process_hierarchy' panicked at 'assertion failed: `(left == right)`
  left: `["sh", "sh"]`,
 right: `["sh", "sleep"]`', src/observer.rs:244:9

`--target-environment-variables` does not allow a blank string

Consider the situation where --target-environment-variables is passed an empty string. It should be that no environment variables are set, however presently lading exits with an error that key/value pairs must be separated by a =.

Allow for various throughput settings.

As of this writing the sources generate load into the target program at a fixed throughput. If the target program is unable to meet this target lading participates in coordinated omission, meaning that it produces -- where possible -- in lockstep with the target. We should support other emission modes:

coupled vs uncoupled, that is, in lockstep to the target's ability to keep up (current behavior) or not,
as fast as possible, or, effectively infinite throughput and
different distributions.

To the last point, we use governor to ensure we write no faster than the fixed throughput, but the result is sort of jittery because we prepare fixed sized blocks at startup and loop through those blocks in-order, meaning we could potentially have a block that would fit but wait for much longer than necessary because a large block has come up. Our current fixed throughput approach also has downsides. Consider that a system might appear "unusually slow" if lading is configured to a low throughput. That is, it's possible to look at the throughput result and misinterpret it as the result of the target rather than the configuration of lading.

Support kubernetes log formats

It would be valuable to be able to produce logs in the structured forms that kubernetes has:

https://kubernetes.io/docs/concepts/cluster-administration/system-logs/

Inspector must be passed the target's PID

The target's PID today has to be discovered by the inspector at runtime, by means of pidof or similar. This means that the process name of the target must be known at configuration time of lading and causes issues if there is more than one process with the given name. This also implies a race between the inspector and target tasks in the lading binary: the target is not guaranteed to be online when the inspector is started.

Typo in example

There is a stray character at the end of the example file. Also, maximum_bytes_per needs to be maximum_bytes_per_file.

'observer' does not collect sub-process RSS

Lading's RSS data collected through the observer is not accurate if the target spawns sub-processes. Lading is unaware entirely of sub-processes, meaning that their memory statistics do not contribute to the observer's tally.

Make top level error handling user friendly

The lading binary's error handling at the top level panics if any error is encountered. Add some structure and improve the error messages.

Generate OpenTelemetry resources to attach to payloads

The OpenTelemetry generators should create payloads with attached resources. This will be a better representation of typical payloads.

[dogstatsd] SET messages incorrectly contain multiple values

In DogStatsD v1.1, multi-values can be packed into dogstatsd messages with the exception of SET metrics.

In the current lading dogstatsd generator, the SET messages can and will contain multiple values.

lading/lading/src/payload/dogstatsd/metric.rs

Lines 272 to 276 in 5f8d48b

 // <METRIC_NAME>:<VALUE1>:<VALUE2>:<VALUE3>|<TYPE>|@<SAMPLE_RATE>|#<TAG_KEY_1>:<TAG_VALUE_1>,<TAG_2> 

 write!(f, "{name}", name = self.name)?; 

 for val in &self.value { 

 write!(f, ":{val}")?; 

 }

"failed to get Prometheus uri" error

This is a false-positive error:

2023-11-14T17:07:24.609931Z  INFO lading: Starting lading run.
2023-11-14T17:07:24.612361Z  INFO lading: target is running, now sleeping for warmup
2023-11-14T17:07:24.612453Z  INFO lading::target_metrics::prometheus: Prometheus target metrics scraper running
2023-11-14T17:07:25.614083Z  INFO lading::target_metrics::prometheus: failed to get Prometheus uri
2023-11-14T17:07:54.614506Z  INFO lading: warmup completed, collecting samples

I'm guessing that the first time it checks the target's openmetrics endpoint has not started yet, so we log an error, then some future time we succeed, but don't print an "OK" message.

We should either remove this error, or also print out a "Success" message when we do find the openmetrics endpoint

Generation throughput gradually ramps up

Consider that the internal governor is generally left with a little bytes left over after each request as requests are made randomly between the maximum line size and 1. This is surprising -- it took me a minute to figure out -- and we might ought to have the algorithm be to request the maximum line size and then randomly use a portion of this only.

Improve telemetry

It would be useful to know:

per target how many files duplicates are set
per target what the bytes-per-second value is

 ---- payload::opentelemetry_metric::test::payload_not_exceed_max_bytes stdout ----
thread 'payload::opentelemetry_metric::test::payload_not_exceed_max_bytes' panicked at 'max len: 250, actual: 251', src/payload/opentelemetry_metric.rs:236:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread 'payload::opentelemetry_metric::test::payload_not_exceed_max_bytes' panicked at 'max len: 250, actual: 251', src/payload/opentelemetry_metric.rs:236:13
thread 'payload::opentelemetry_metric::test::payload_not_exceed_max_bytes' panicked at 'max len: 250, actual: 251', src/payload/opentelemetry_metric.rs:236:13
thread 'payload::opentelemetry_metric::test::payload_not_exceed_max_bytes' panicked at 'max len: 250, actual: 251', src/payload/opentelemetry_metric.rs:236:13
proptest: Saving this and future failures in /Users/runner/work/lading/lading/proptest-regressions/payload/opentelemetry_metric.txt
proptest: If this test was run on a CI system, you may wish to add the following line to your copy of the file. (You may need to create it.)
cc 5ada42712d32e65e5b5292cbe437a53495db6a483c0f612a24ea5d1846d2e787
thread 'payload::opentelemetry_metric::test::payload_not_exceed_max_bytes' panicked at 'Test failed: max len: 250, actual: 251; minimal failing input: seed = 16556747614468308026, max_bytes = 250
	successes: 111
	local rejects: 0
	global rejects: 0
', src/payload/opentelemetry_metric.rs:227:5

Integration tests can deadlock when a lot of data is written to stdout

Turn ducks & lading logging up to trace in the integration tests. The test runner will deadlock. Discovered while building the UDP generator.

	/// We have many uses that exist prior to the introduction of multiple
	/// blackholes, meaning many configs that _assume_ there is only one bloackhole
	/// and that they do not exist in an array. In order to avoid breaking those
	/// configs we support this goofy structure. A deprecation cycle here is in
	/// order someday.

	// TODO sample_rate should be option and have a probability that determines if its present
	// Mostly inconsequential for the Agent, for certain metric types the Agent
	// applies some correction based on this value. Affects count and histogram computation.
	// https://docs.datadoghq.com/metrics/custom_metrics/dogstatsd_metrics_submission/#sample-rates
	let sample_rate = rng.gen();

	// <METRIC_NAME>:<VALUE1>:<VALUE2>:<VALUE3>\|<TYPE>\|@<SAMPLE_RATE>\|#<TAG_KEY_1>:<TAG_VALUE_1>,<TAG_2>
	write!(f, "{name}", name = self.name)?;
	for val in &self.value {
	write!(f, ":{val}")?;
	}