relex / slog-agent Goto Github PK
View Code? Open in Web Editor NEWHigh performance log agent/processor to be used with fluentd
License: MIT License
High performance log agent/processor to be used with fluentd
License: MIT License
defs/params.go contains advanced parameters. Some of the values are shared while others are specific to certain packages or individual modules. Some of them also have dependencies on others and cannot be set to any arbitrary values.
Same process as 1.16:
And replace code generation with generics
It's been impossible to tell which clients and what they are logging through slog-agent.
The existing information:
are not sufficient in situations where the keys cannot differentiate clients (e.g. multi-instance services), and full information cannot be represented in metrics due to concerns about metric label cardinality (M inputs * N pipelines * T steps * U outputs).
A workaround is to log key fields when an incoming connection requests to dispatch logs to a previously-unused pipeline (id'ed by by orchestration keys), done in major: Use Go 1.18 generics, but turned out useless since it logs only the orchestration keys.
An option is to log the metrics keys in addition to orchestration keys, and / or dump raw headers in one of the new logs.
A better solution would be to save samples of incoming logs with information on destination pipelines assigned by orchestrator, and print them on certain signals, which may be done as part of #8
See https://stackoverflow.com/a/73608099/3488757 and https://www.sobyte.net/post/2022-09/string-byte-convertion/
Since go 1.19, StringHeader/SliceHeader has been deprecated and their usage is the code should be updated accordingly.
Document our full integration testing once we have a prototype setup
The scope of full integration testing should cover:
Automatic verification needs to cover:
1 2 3 4 5
may be sent as 1 2 3 | 2 3 | 4 5
Start the test server in fluentlibtool with random failure chances (30% for all stages is enough) to simulate poor network environment, and run benchmark with minimal repeating - sometimes the client fails to end and keeps reconnecting only for sending pings (which then fails), while there are still chunks queued in the buffer.
The buffer & client should be able to push chunks through even under such situation. Also periodic pings aren't supposed to start since there are still many chunks queued.
The feeding from hybrid buffer to forwarder client's channel may be delayed for unknown reason, or maybe they should be in "leftover" queue which gets priority processing.
The current shutdown process in fluentd forward client is to attempt waiting for all sent chunks to be acknowledged from server, in order to reduce the numbers of re-sent chunks in the next start.
When there are network troubles, it could take minutes until timeout and during the whole time the input has been shut down, meaning nothing is listening on the incoming port to receive logs from applications. A minute downtime could mean GBs of logs lost, queued up, or blocking applications' operations or triggering even more logs due to logging failure.
One of the two:
Shutdown handling in other parts should be fine and need no change
skip redactions of URLs with usernames as emails, e.g. ftp://[email protected]
and s3://foo:bar@nowhere
The outgoing connections from fluentd forward client have been designed to be persistent, unlike the async request approach used in fluent-bit with pooling. It works fine except it's very poor for load balancing on the server side, as connections stick to the same server node forever until either side is restarted.
Automatically close the connections after certain duration of time (e.g. 30m) should be able to address this, and also avoid pausing before next connection attempt in such situations.
Note a connection shouldn't be closed if there are still pending acknowledgements, which could be a problem if traffic never stops (need better ideas). Meanwhile a new connection cannot be opened until the old one is closed (current behavior), or the order of logs cannot be guaranteed.
YAML v3 is being used to support custom tags in the config file. It's WIP and there is no checking for invalid keys for now. As a result most transforms simply require all keys to be specified to reduce the chance of human errors.
It needs to be fixed whether in this repository or in go yaml lib.
Currently all internal communications and code flows are built around primitive go channels and signals, as a result the complexity has exploded especially in baseoutput
related to shutdown mechanism:
Instead of a free form that each piece of code decides how to use channel and signals, there could be some unified framework or library that help making:
The internal communication in baseoutput
is not performance critical as it works on compressed log chunks
We need to be able to get very detailed stats of the program when it hangs or crashes
go 1.16 should offer significant performance improvement in the scheduler when GOMAXPROCS is high (e.g. 1000), due to partial addressing of golang/go#28808
What need to be done:
Input log types:
Excessive amounts of logs from the agent could be triggered by I/O error (ex: disk quota) or broken input logs. In the worst case, we could assume one malformed input log would at least trigger one warning level log from the agent. This is unacceptable as it could easily cause TBs of logs in a few minutes.
Set up log limit per component and apply it to the smallest unit possible: a TCP connection, a pipelines (not sub-pipelines) or a set of metric keys. The limit should be configurable, for example:
For unknown reasons the sub-pipelines under byKeySet orchestrator offers minimal performance improvement. In most recent tests num=2 could give about 30% time saving at the cost +30% CPU time, while higher values has zero effect.
The problem could be anywhere: in the Distributor, generic buffering in PipelineChannel (called from byKeySetOrchestrator) or maybe even test input - the current benchmark code feeds input records by itself and may be affected by or affect the normal operations.
What is known is that chunks locks used to ensure each resulting chunk is outputted in the original order are rarely touched, and disabling them changed nothing.
HTTP/2 should offer builtin multiplexing and avoid the need for custom pipelining implementation. In the ideal case merely changing the protocol to HTTP/2 should remove the latency issue.
There might still be buffer size problem with Go's HTTP2 implementation: golang/go#47840
Load testing will be needed against real Datadog API.
The way chunks are tracked for acknowledgement and recovered at the end seems unnecessarily overcomplicated and it cannot survive a BUG situation.
A redesign is needed here. It's probably enough to track them in a sync map instead of having acknowledger “returning” them at the end.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.