Giter Club home page Giter Club logo

lading's People

Contributors

001wwang avatar blt avatar dependabot[bot] avatar dmalis18 avatar fuchsnj avatar georgehahn avatar goxberry avatar jszwedko avatar modernplumbing avatar pablosichert avatar paulcacheux avatar safchain avatar scottopell avatar spencergilbert avatar tobz avatar wiyu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

lading's Issues

Target sub-process exit code should influence lading's

Right now if the target sub-process exits lading will log what the exit code was but will not actually fiddle with its own code. This means that lading can appear to succeed at collecting samples while in practice this was not the case, in a way that harms use in CI.

Auto-scale based on observed telemetry

While it's useful to be able to generate a constant load not every load generation test is served by this notion. Some situations call for ramping up load until the system under examination falls behind. This program should support that kind of ramp. Initial work can be as straightforward as observing a bit of telemetry -- prometheus? -- from the system under test.

Support metrics generation

We should add the ability for lading to produce metric style events to unblock additional soak test configurations. Depending on the metrics and "flavor" of metrics this may be a new generator or a new variant of existing generators.

We'll likely want to create separate tickets for the metric formats and flavors we want to include in lading

Older format lading configs silently ignored

I ran across this issue while working with workload-checks and smp local-run.

Due to the fact that most keys in the lading::Config recursive struct have a default, if they aren't present in the specified config, there is no error, the deserialization process will happily apply the defaults and start the lading run.

This is kind of okay so far, but if you have a significantly old lading config (as is specified in the agent's workload-checks here) and you feed this into a newer lading, the result is that the entire dogstatsd block is no longer valid and therefore we use defaults for the entire configuration.

This is silent as there is no feedback in the logs that unexpected keys were found or a dump of the run-time config. So the end-user has no way of knowing that the config they specified is NOT the one being used.

This should only be a real problem with local-run as the user is supplying a lading version. In the CI, we should be updating the format of the experiments as we update lading (which we do, example PR)

Json variant is broken

The logic for outputting the json variant was always iffy. The most recent release has tightened the logic around generated sizes, which is now sometimes violated by the json variant, crashing the program.

Make `udp_gen`

Much like with file_gen and http_gen this project needs a udp_gen. Same goals apply.

Support multiple sources

Right now lading only supports a single source per configuration. This is fine but it would be valuable to have multiple sources into a target program.

Property test important project invariants

There are important project invariants that are, today, demonstrated by running lading against Vector in that project's regression detection scheme. That's... not super great. Invariants that need test demonstration:

  • Lading will always shut down, that is, will not hang. See #203 and #155.
  • User inputs can be satisfied in block construction. Currently our property tests avoid known-bad inputs, implying we need to constrain user input in some way.
  • Startup ordering is accurately ordered. That is, the target comes online before the generator etc etc. This will require some documentation with regard to what we guarantee.

Allow a `subprocess` blackhole

Introspecting datadog-agent-to-vector communications would be much easier if it were possible to run vector as a subprocess in the same way as target, and collect telemetry from it into the collected output.

Doing so would allow us to compare datapoints from the perspective of both programs, which can be useful when debugging aggregation transforms, for instance.

Remove generator untagged enum configuration

Consider

lading/src/config.rs

Lines 28 to 32 in 5aae408

/// We have many uses that exist prior to the introduction of multiple
/// blackholes, meaning many configs that _assume_ there is only one bloackhole
/// and that they do not exist in an array. In order to avoid breaking those
/// configs we support this goofy structure. A deprecation cycle here is in
/// order someday.
. It is reasonable to remove the untagged enum here and require a list of generators only.

Support flat-file format

Right now file_gen is able to generate new content but it'd be useful if we could also emit stock content from a flat file at a given rate.

`splunk_hec` generator requires acks, should not

If a splunk_hec generator peer does not respond back with an AckID -- as the protocol does not seem to require -- then the lading generator will fail, crashing the emission task but not lading. It should not be the case that lading's send task crashes when an ack ID is missing.

`splunk_hec` generator does participate fully in shutdown, hangs shutdown

The splunk_hec generator spawns its own sub-tasks which stay active even while the rest of lading shut down. This leaves background tasks still running, which can hang the program on controlled shutdown. We do try to work around this by using a timed shutdown by this does not appear to work in all cases, as seen here. The top-level process must signal to its spawned tasks that the time has come to shutdown.

Block generation isn't quite right

The way we generate blocks is almost there but not quite right. We first generate a Vec<impl Arbitrary> from the block size of bytes and then serialize this. The end result are blocks that go well over the actual block size, naturally, which has led to some hacky code to kinda sorta avoid blowing up the governor.

Panic in observer with integer overflow

I'm running lading off of rev 2e8462a with the following cmd and lading.yml:

lading.yml:

generator:
  - unix_datagram:
      seed: [2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53,
             59, 61, 67, 71, 73, 79, 83, 89, 97, 101, 103, 107, 109, 113, 127, 131]
      path: "/tmp/socat.socket"
      variant: "dogstatsd"
      bytes_per_second: "256 Kb"
      block_sizes: ["256b", "512b", "1Kb", "2Kb", "3Kb", "4Kb", "5Kb", "6Kb"]
      maximum_prebuild_cache_size_bytes: "10 Kb"

cmdline:

RUST_BACKTRACE=1 RUST_LOG=debug ./target/debug/lading --warmup-duration-seconds 1 --experiment-duration-seconds 4 --prometheus-addr=127.0.0.1:6002 --config-path=./socatlading.yaml --target-stderr-path=./socatstderr.txt --target-stdout-path=./socatstdout.txt --target-path strace /usr/bin/socat UNIX-RECV:/tmp/socat.socket OPEN:data.txt,creat,append

I get the following panic:

thread 'tokio-runtime-worker' panicked at 'attempt to add with overflow', /rustc/69f9c33d71c871fc16ac445211281c6e7a340943/library/core/src/iter/traits/accum.rs:141:1
stack backtrace:
   0: rust_begin_unwind
             at /rustc/69f9c33d71c871fc16ac445211281c6e7a340943/library/std/src/panicking.rs:575:5
   1: core::panicking::panic_fmt
             at /rustc/69f9c33d71c871fc16ac445211281c6e7a340943/library/core/src/panicking.rs:65:14
   2: core::panicking::panic
             at /rustc/69f9c33d71c871fc16ac445211281c6e7a340943/library/core/src/panicking.rs:115:5
   3: <u64 as core::iter::traits::accum::Sum>::sum::{{closure}}
             at /rustc/69f9c33d71c871fc16ac445211281c6e7a340943/library/core/src/iter/traits/accum.rs:45:28
   4: core::iter::adapters::map::map_fold::{{closure}}
             at /rustc/69f9c33d71c871fc16ac445211281c6e7a340943/library/core/src/iter/adapters/map.rs:84:21
   5: core::iter::traits::iterator::Iterator::fold
             at /rustc/69f9c33d71c871fc16ac445211281c6e7a340943/library/core/src/iter/traits/iterator.rs:2414:21
   6: <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::fold
             at /rustc/69f9c33d71c871fc16ac445211281c6e7a340943/library/core/src/iter/adapters/map.rs:124:9
   7: <u64 as core::iter::traits::accum::Sum>::sum
             at /rustc/69f9c33d71c871fc16ac445211281c6e7a340943/library/core/src/iter/traits/accum.rs:42:17
   8: core::iter::traits::iterator::Iterator::sum
             at /rustc/69f9c33d71c871fc16ac445211281c6e7a340943/library/core/src/iter/traits/iterator.rs:3381:9
   9: lading::observer::Server::run::{{closure}}
             at ./src/observer.rs:181:43
  10: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll

and if I add some debug logging to see what the value of allstats looks like, I see the following:

2023-02-07T16:56:22.897890Z DEBUG lading::observer: Rss lim list has value: Map { iter: Iter([Stat { pid: 4054447, comm: "strace", state: 't', ppid: 4054445, pgrp: 4054435, session: 3292311, tty_nr: 34821, tpgid: 4054435, flags: 1077936192, minflt: 19, cminflt: 0, majflt: 0
, cmajflt: 0, utime: 0, stime: 0, cutime: 0, cstime: 0, priority: 20, nice: 0, num_threads: 1, itrealvalue: 0, starttime: 8194209, vsize: 3989504, rss: 96, rsslim: 18446744073709551615, startcode: 187650176253952, endcode: 187650177525124, startstack: 281474060964784, kstke
sp: 0, kstkeip: 0, signal: 0, blocked: 0, sigignore: 0, sigcatch: 0, wchan: 1, nswap: 0, cnswap: 0, exit_signal: Some(17), processor: Some(1), rt_priority: Some(0), policy: Some(0), delayacct_blkio_ticks: Some(0), guest_time: Some(0), cguest_time: Some(0), start_data: Some(
187650177591184), end_data: Some(187650177900952), start_brk: Some(187650321068032), arg_start: Some(281474060967839), arg_end: Some(281474060967916), env_start: Some(281474060967916), env_end: Some(281474060967916), exit_code: Some(133) }, Stat { pid: 4054445, comm: "strac
e", state: 'R', ppid: 4054435, pgrp: 4054435, session: 3292311, tty_nr: 34821, tpgid: 4054435, flags: 4194304, minflt: 196, cminflt: 1, majflt: 0, cmajflt: 0, utime: 0, stime: 0, cutime: 0, cstime: 0, priority: 20, nice: 0, num_threads: 1, itrealvalue: 0, starttime: 8194209
, vsize: 3989504, rss: 378, rsslim: 18446744073709551615, startcode: 187650176253952, endcode: 187650177525124, startstack: 281474060964784, kstkesp: 0, kstkeip: 0, signal: 0, blocked: 0, sigignore: 0, sigcatch: 0, wchan: 1, nswap: 0, cnswap: 0, exit_signal: Some(17), proce
ssor: Some(3), rt_priority: Some(0), policy: Some(0), delayacct_blkio_ticks: Some(0), guest_time: Some(0), cguest_time: Some(0), start_data: Some(187650177591184), end_data: Some(187650177900952), start_brk: Some(187650321068032), arg_start: Some(281474060967839), arg_end:
Some(281474060967916), env_start: Some(281474060967916), env_end: Some(281474060967916), exit_code: Some(0) }]) }

It looks like there are two pids both with the rsslim value set to u64::Max, which then causes the .sum to overflow.

$ uname -a
Linux lima-beef-nov22 5.15.0-58-generic #64-Ubuntu SMP Thu Jan 5 12:06:43 UTC 2023 aarch64 aarch64 aarch64 GNU/Linux

Open question: should OpenTelemetry payloads have recent timestamps?

OpenTelemetry consumers may drop payloads with distant timestamps. This could cause significant performance differences if some consumers drop payloads more aggressively than others.

This is a tradeoff. Timestamp generation should optimally be deterministic. That precludes the use of system time based bounds. The best we can do while maintaining determinism may be to bound timestamps to a far larger block of time, like the current decade.

Add a config validation subcommand

Validate the config and print a machine-readable listing of expected output metrics (i.e. expect bytes_written from the TCP blackhole)

Generate syslog payloads

The project must be able to generate syslog payloads, of the common varieties. Much like the foundationdb format it would be helpful if the lines had some sensible content internally instead of random ascii noise.

[dogstatsd] Add sample_rate_prob to avoid setting a random sample rate on every DSD metric

// TODO sample_rate should be option and have a probability that determines if its present
// Mostly inconsequential for the Agent, for certain metric types the Agent
// applies some correction based on this value. Affects count and histogram computation.
// https://docs.datadoghq.com/metrics/custom_metrics/dogstatsd_metrics_submission/#sample-rates
let sample_rate = rng.gen();

This TODO from the code has now come back to bite us. I put in the comment "should not have a big impact on the agent", but that is incorrect. For histogram and distributions this has a very big impact as it will act as a multiplier for the traffic, which has a significant effect on memory usage.

I don't have any great ideas on how to find ground-truth for the distribution of sampleRates that are specified "in the wild" but at a guess we'd see a very large number of missing/1-valued sample rates and then a long tail of smaller values.

Warmup, experiment time should not start until generator blocks are constructed

The generators pre-build their data in 'blocks' which, depending on user settings, can take quite a while to complete. Today we assume that this process is instantaneous with the consequence being that any time sensitive deadlines -- such as, the target process experimental duration, warmup periods etc -- are all inaccurate.

It's fine for the target to be started before block generation is complete but we should not start any of the timers until blocks are ready to go.

CI failure on `main`

---- payload::opentelemetry_metric::test::payload_not_exceed_max_bytes stdout ----
thread 'payload::opentelemetry_metric::test::payload_not_exceed_max_bytes' panicked at 'max len: 241, actual: 242', src/payload/opentelemetry_metric.rs:236:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread 'payload::opentelemetry_metric::test::payload_not_exceed_max_bytes' panicked at 'max len: 241, actual: 242', src/payload/opentelemetry_metric.rs:236:13
thread 'payload::opentelemetry_metric::test::payload_not_exceed_max_bytes' panicked at 'max len: 241, actual: 242', src/payload/opentelemetry_metric.rs:236:13
thread 'payload::opentelemetry_metric::test::payload_not_exceed_max_bytes' panicked at 'max len: 241, actual: 242', src/payload/opentelemetry_metric.rs:236:13
proptest: Saving this and future failures in /Users/runner/work/lading/lading/proptest-regressions/payload/opentelemetry_metric.txt
proptest: If this test was run on a CI system, you may wish to add the following line to your copy of the file. (You may need to create it.)
cc 8f93294fa8263a942d270c84bb3b989b57987f8aebdb0ac8efed5d8fba333bba
thread 'payload::opentelemetry_metric::test::payload_not_exceed_max_bytes' panicked at 'Test failed: max len: 241, actual: 242; minimal failing input: seed = 11168970835765128255, max_bytes = 241
	successes: 224
	local rejects: 0
	global rejects: 0
', src/payload/opentelemetry_metric.rs:227:5

Generator & blackhole metrics should be sufficiently tagged

Metrics should be emitted with enough information to be differentiated by external tooling. There are a number of locations where metrics emitted by different modules look the same and cannot be differentiated. (E.g. bytes_received in both the HTTP & TCP blackholes)

Allow for various throughput settings.

As of this writing the sources generate load into the target program at a fixed throughput. If the target program is unable to meet this target lading participates in coordinated omission, meaning that it produces -- where possible -- in lockstep with the target. We should support other emission modes:

  • coupled vs uncoupled, that is, in lockstep to the target's ability to keep up (current behavior) or not,
  • as fast as possible, or, effectively infinite throughput and
  • different distributions.

To the last point, we use governor to ensure we write no faster than the fixed throughput, but the result is sort of jittery because we prepare fixed sized blocks at startup and loop through those blocks in-order, meaning we could potentially have a block that would fit but wait for much longer than necessary because a large block has come up. Our current fixed throughput approach also has downsides. Consider that a system might appear "unusually slow" if lading is configured to a low throughput. That is, it's possible to look at the throughput result and misinterpret it as the result of the target rather than the configuration of lading.

Inspector must be passed the target's PID

The target's PID today has to be discovered by the inspector at runtime, by means of pidof or similar. This means that the process name of the target must be known at configuration time of lading and causes issues if there is more than one process with the given name. This also implies a race between the inspector and target tasks in the lading binary: the target is not guaranteed to be online when the inspector is started.

Typo in example

There is a stray character at the end of the example file. Also, maximum_bytes_per needs to be maximum_bytes_per_file.

'observer' does not collect sub-process RSS

Lading's RSS data collected through the observer is not accurate if the target spawns sub-processes. Lading is unaware entirely of sub-processes, meaning that their memory statistics do not contribute to the observer's tally.

[dogstatsd] SET messages incorrectly contain multiple values

In DogStatsD v1.1, multi-values can be packed into dogstatsd messages with the exception of SET metrics.

In the current lading dogstatsd generator, the SET messages can and will contain multiple values.

// <METRIC_NAME>:<VALUE1>:<VALUE2>:<VALUE3>|<TYPE>|@<SAMPLE_RATE>|#<TAG_KEY_1>:<TAG_VALUE_1>,<TAG_2>
write!(f, "{name}", name = self.name)?;
for val in &self.value {
write!(f, ":{val}")?;
}

"failed to get Prometheus uri" error

This is a false-positive error:

2023-11-14T17:07:24.609931Z  INFO lading: Starting lading run.
2023-11-14T17:07:24.612361Z  INFO lading: target is running, now sleeping for warmup
2023-11-14T17:07:24.612453Z  INFO lading::target_metrics::prometheus: Prometheus target metrics scraper running
2023-11-14T17:07:25.614083Z  INFO lading::target_metrics::prometheus: failed to get Prometheus uri
2023-11-14T17:07:54.614506Z  INFO lading: warmup completed, collecting samples

I'm guessing that the first time it checks the target's openmetrics endpoint has not started yet, so we log an error, then some future time we succeed, but don't print an "OK" message.

We should either remove this error, or also print out a "Success" message when we do find the openmetrics endpoint

Generation throughput gradually ramps up

Consider that the internal governor is generally left with a little bytes left over after each request as requests are made randomly between the maximum line size and 1. This is surprising -- it took me a minute to figure out -- and we might ought to have the algorithm be to request the maximum line size and then randomly use a portion of this only.

Improve telemetry

It would be useful to know:

  • per target how many files duplicates are set
  • per target what the bytes-per-second value is

Otel Metrics payload can be too long

CI failures have been observed with the following payload length error:

 ---- payload::opentelemetry_metric::test::payload_not_exceed_max_bytes stdout ----
thread 'payload::opentelemetry_metric::test::payload_not_exceed_max_bytes' panicked at 'max len: 250, actual: 251', src/payload/opentelemetry_metric.rs:236:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread 'payload::opentelemetry_metric::test::payload_not_exceed_max_bytes' panicked at 'max len: 250, actual: 251', src/payload/opentelemetry_metric.rs:236:13
thread 'payload::opentelemetry_metric::test::payload_not_exceed_max_bytes' panicked at 'max len: 250, actual: 251', src/payload/opentelemetry_metric.rs:236:13
thread 'payload::opentelemetry_metric::test::payload_not_exceed_max_bytes' panicked at 'max len: 250, actual: 251', src/payload/opentelemetry_metric.rs:236:13
proptest: Saving this and future failures in /Users/runner/work/lading/lading/proptest-regressions/payload/opentelemetry_metric.txt
proptest: If this test was run on a CI system, you may wish to add the following line to your copy of the file. (You may need to create it.)
cc 5ada42712d32e65e5b5292cbe437a53495db6a483c0f612a24ea5d1846d2e787
thread 'payload::opentelemetry_metric::test::payload_not_exceed_max_bytes' panicked at 'Test failed: max len: 250, actual: 251; minimal failing input: seed = 16556747614468308026, max_bytes = 250
	successes: 111
	local rejects: 0
	global rejects: 0
', src/payload/opentelemetry_metric.rs:227:5

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.