relex / slog-agent Goto Github PK

View Code? Open in Web Editor NEW

8.0 5.0 1.0 546 KB

High performance log agent/processor to be used with fluentd

License: MIT License

Makefile 0.44% Go 98.90% Shell 0.66%

logging log-parser fluentd

slog-agent's People

Contributors

Stargazers

Watchers

slog-agent's Issues

config: Reorganize defs/params and make them configurable

Background

defs/params.go contains advanced parameters. Some of the values are shared while others are specific to certain packages or individual modules. Some of them also have dependencies on others and cannot be set to any arbitrary values.

What to do

Move parameters to corresponding packages or individual units and add them to configuration
For parameters that have inter-dependencies, either calculate the values automatically or check the values when loading config

all: Upgrade to go 1.18 and generics

Same process as 1.16:

Running benchmarks with fixed GOMAXPROCS=12 against different inputs to make sure there is no slowdown
Running benchmarks with GOMAXPROCS=1000 against different inputs to make sure there is no slowdown
Running benchmarks with default GOMAXPROCS on a server with at least 500 cores to make sure there is no slowdown
Try different defs.IntermediateBufferedChannelSize values to see if it should be changed.

And replace code generation with generics

orchestrate: improve visibility on incoming connections

It's been impossible to tell which clients and what they are logging through slog-agent.

The existing information:

client addresses
orchestration keys

are not sufficient in situations where the keys cannot differentiate clients (e.g. multi-instance services), and full information cannot be represented in metrics due to concerns about metric label cardinality (M inputs * N pipelines * T steps * U outputs).

A workaround is to log key fields when an incoming connection requests to dispatch logs to a previously-unused pipeline (id'ed by by orchestration keys), done in major: Use Go 1.18 generics, but turned out useless since it logs only the orchestration keys.

An option is to log the metrics keys in addition to orchestration keys, and / or dump raw headers in one of the new logs.

A better solution would be to save samples of incoming logs with information on destination pipelines assigned by orchestrator, and print them on certain signals, which may be done as part of #8

base: Update unsafe string usage

See https://stackoverflow.com/a/73608099/3488757 and https://www.sobyte.net/post/2022-09/string-byte-convertion/

Since go 1.19, StringHeader/SliceHeader has been deprecated and their usage is the code should be updated accordingly.

all: Document full integration testing

Document our full integration testing once we have a prototype setup

The scope of full integration testing should cover:

Input: mixed normal log records (200b) and huge log records (~50K), with a sizable part of them to be dropped
Processing: hit every transforms except deprecated or not-recommended-for-production
Output: network output
Upstream: fake server from https://github.com/relex/fluentlib with random failures to trigger recovery
Kill agent in the middle and restart to test shutdown & recovery: note this only works for catchable signals such as SIGTERM

Automatic verification needs to cover:

Final output logs from upstream in JSON
- Order of output logs should be the same as input, per each of pipeline (key-set)
  - Except when recovery happens, which should be resent and possibly duplicated in the original order. For example logs 1 2 3 4 5 may be sent as 1 2 3 | 2 3 | 4 5
Metric values except buffer metrics regarding on-disk vs in-memory chunks

output: Repeated failures in poor network environment

Reproduce

Start the test server in fluentlibtool with random failure chances (30% for all stages is enough) to simulate poor network environment, and run benchmark with minimal repeating - sometimes the client fails to end and keeps reconnecting only for sending pings (which then fails), while there are still chunks queued in the buffer.

Expected behavior

The buffer & client should be able to push chunks through even under such situation. Also periodic pings aren't supposed to start since there are still many chunks queued.

Notes

The feeding from hybrid buffer to forwarder client's channel may be delayed for unknown reason, or maybe they should be in "leftover" queue which gets priority processing.

output: Add fast shutdown and make it optional if possible

Background

The current shutdown process in fluentd forward client is to attempt waiting for all sent chunks to be acknowledged from server, in order to reduce the numbers of re-sent chunks in the next start.

When there are network troubles, it could take minutes until timeout and during the whole time the input has been shut down, meaning nothing is listening on the incoming port to receive logs from applications. A minute downtime could mean GBs of logs lost, queued up, or blocking applications' operations or triggering even more logs due to logging failure.

What to do

One of the two:

Make fast shutdown a global option and support it in fluentd forward client: close everything and assume ongoing buffers lost when the end signal is received - the buffer above output clients should persist all of them for recovery next time.
If making it optional is difficult, just replace the old approach completely. The current code may still be needed for #6

Not in Scope

Shutdown handling in other parts should be fine and need no change

redactemail: skip URLs with usernames

skip redactions of URLs with usernames as emails, e.g. ftp://[email protected] and s3://foo:bar@nowhere

output: Limit max duration of outgoing connection to help load balancing

Background

The outgoing connections from fluentd forward client have been designed to be persistent, unlike the async request approach used in fluent-bit with pooling. It works fine except it's very poor for load balancing on the server side, as connections stick to the same server node forever until either side is restarted.

What to do

Automatically close the connections after certain duration of time (e.g. 30m) should be able to address this, and also avoid pausing before next connection attempt in such situations.

Note a connection shouldn't be closed if there are still pending acknowledgements, which could be a problem if traffic never stops (need better ideas). Meanwhile a new connection cannot be opened until the old one is closed (current behavior), or the order of logs cannot be guaranteed.

config: Check for invalid keys in YAML config file

YAML v3 is being used to support custom tags in the config file. It's WIP and there is no checking for invalid keys for now. As a result most transforms simply require all keys to be specified to reduce the chance of human errors.

It needs to be fixed whether in this repository or in go yaml lib.

Simplify internal communication / flows, starting with baseoutput

Currently all internal communications and code flows are built around primitive go channels and signals, as a result the complexity has exploded especially in baseoutput related to shutdown mechanism:

There are async sender and receiver to enable pipelining / reduce latency
There are soft stop, hard stop (abort) to reconnect, and hard stop to shut down.
Network reading and writing can fail independently and the other side may continue to work (unintended)
All pending chunks whether in queue or being processed in functions need to be collected for recovery
The code is difficult to test. Currently the only tests involving real output are in run package and they can't cover any exceptional situations.

Instead of a free form that each piece of code decides how to use channel and signals, there could be some unified framework or library that help making:

The top-level flow should be visualized, in a declarative form. It needs to cover all exceptional situations because dealing with them is the main cause of complexity.
The components (e.g. acknowledger or sub unit) should have clear input, output, responsibilities and no implicit side-effects

The internal communication in baseoutput is not performance critical as it works on compressed log chunks

all: Dump current stats and log samples on signal

Background

We need to be able to get very detailed stats of the program when it hangs or crashes

What to do

Make current stats and latest logs-in-process available: note log fields are backed by mutable arrays and they may be changed or dropped when accessed out of scope.
Dump the information on fatal/panic and specific signals

all: Upgrade to go 1.16

go 1.16 should offer significant performance improvement in the scheduler when GOMAXPROCS is high (e.g. 1000), due to partial addressing of golang/go#28808

What need to be done:

Running benchmarks with fixed GOMAXPROCS=12 against different inputs to make sure there is no slowdown
Running benchmarks with GOMAXPROCS=1000 against different inputs to make sure there is no slowdown
Running benchmarks with default GOMAXPROCS on a server with at least 500 cores to make sure there is no slowdown
Try different defs.IntermediateBufferedChannelSize values to see if it should be changed.

Input log types:

Common log records with around 200 bytes for each record, including syslog headers
Huge log records for error dumps, 10-50K. The new errors-input.log may be good enough for this.

all: Limit amounts of logs from the agent itself

Background

Excessive amounts of logs from the agent could be triggered by I/O error (ex: disk quota) or broken input logs. In the worst case, we could assume one malformed input log would at least trigger one warning level log from the agent. This is unacceptable as it could easily cause TBs of logs in a few minutes.

What to do

Set up log limit per component and apply it to the smallest unit possible: a TCP connection, a pipelines (not sub-pipelines) or a set of metric keys. The limit should be configurable, for example:

max_logging_count: 500
max_logging_duration: 1h (reset counter afterwards)

orchestrate: Investigate bottleneck in parallel sub-pipelines

For unknown reasons the sub-pipelines under byKeySet orchestrator offers minimal performance improvement. In most recent tests num=2 could give about 30% time saving at the cost +30% CPU time, while higher values has zero effect.

The problem could be anywhere: in the Distributor, generic buffering in PipelineChannel (called from byKeySetOrchestrator) or maybe even test input - the current benchmark code feeds input records by itself and may be affected by or affect the normal operations.

What is known is that chunks locks used to ensure each resulting chunk is outputted in the original order are rarely touched, and disabling them changed nothing.

output: datadog & HTTP2

HTTP/2 should offer builtin multiplexing and avoid the need for custom pipelining implementation. In the ideal case merely changing the protocol to HTTP/2 should remove the latency issue.

There might still be buffer size problem with Go's HTTP2 implementation: golang/go#47840

Load testing will be needed against real Datadog API.

output: Simplify tracking of chunks in acknowledger

The way chunks are tracked for acknowledgement and recovered at the end seems unnecessarily overcomplicated and it cannot survive a BUG situation.

A redesign is needed here. It's probably enough to track them in a sync map instead of having acknowledger “returning” them at the end.

relex / slog-agent Goto Github PK

slog-agent's People

Contributors

Stargazers

Watchers

slog-agent's Issues

Background

What to do

Reproduce

Expected behavior

Notes

Background

What to do

Not in Scope

Background

What to do

Background

What to do

Background

What to do

Recommend Projects

Recommend Topics

Recommend Org