rubrikinc / fluent-plugin-throttle Goto Github PK

Rate limiting for fluentd

License: Apache License 2.0

Ruby 100.00%

fluent-plugin-throttle's Introduction

fluent-plugin-throttle

A sentry plugin to throttle logs. Logs are grouped by a configurable key. When a group exceeds a configuration rate, logs are dropped for this group.

Installation

install with gem or td-agent provided command as:

# for fluentd
$ gem install fluent-plugin-throttle

Usage

<filter **>
  @type throttle
  group_key kubernetes.container_name
  group_bucket_period_s   60
  group_bucket_limit    6000
  group_reset_rate_s     100
</filter>

Configuration

group_key

Default: kubernetes.container_name.

Used to group logs. Groups are rate limited independently.

A dot indicates a key within a sub-object. As an example, in the following log, the group key resolve to "random":

{"level": "error", "msg": "plugin test", "kubernetes": { "container_name": "random" } }

Multiple groups can be specified using the fluentd config array syntax, e.g. kubernetes.container_name,kubernetes.pod_name, in which case each unique pair of key values are rate limited independently.

If the group cannot be resolved, an anonymous (nil) group is used for rate limiting.

group_bucket_period_s

Default: 60 (60 second).

This is the period of of time over which group_bucket_limit applies.

group_bucket_limit

Default: 6000 (logs per group_bucket_period_s).

Maximum number logs allowed per groups over the period of group_bucket_period_s.

This translate to a log rate of group_bucket_limit/group_bucket_period_s. When a group exceeds this rate, logs from this group are dropped.

For example, the default is 6000/60s, making for a rate of 100 logs per seconds.

Note that this is not expressed as a rate directly because there is a difference between the overall rate and the distribution of logs over a period time. For example, a burst of logs in the middle of a minute bucket might not exceed the average rate of the full minute.

Consider 60/60s, 60 logs over a minute, versus 1/1s, 1 log per second. Over a minute, both will emit a maximum of 60 logs. Limiting to a rate of 60 logs per minute. However 60/60s will readily emit 60 logs within the first second then nothing for the remaining 59 seconds. While the 1/1s will only emit the first log of every second.

group_drop_logs

Default: true.

When a group reaches its limit, logs will be dropped from further processing if this value is true (set by default). To prevent the logs from being dropped and only receive a warning message when rate limiting would have occurred, set this value for false. This can be useful to fine-tune your group bucket limits before dropping any logs.

group_reset_rate_s

Default: group_bucket_limit/group_bucket_period_s (logs per group_bucket_period_s). Maximum: group_bucket_limit/group_bucket_period_s.

After a group has exceeded its bucket limit, logs are dropped until the rate per second falls below or equal to group_reset_rate_s.

The default value is group_bucket_limits/group_bucket_period_s. For example for 3600 logs per hour, the reset will defaults to 3600/3600s = 1/s, one log per second.

Taking the example 3600 log/hour with the default reset rate of 1 log/s further:

Let's say we have a period of 10 hours.
During the first hour, 2 logs/s are produced. After 30 minutes, the hourly bucket has reached its limit, and logs are dropped. At this point the rate is still 2 logs/s for the remaining 30 minutes.
Because the last hour finished on 2 logs/s, which is higher that the 1 log/s reset, all logs are still dropped when starting the second hour. The bucket limit is left untouched since nothing is being emitted.
Now, at 2 hours and 30 minutes, the log rate halves to 1 log/s, which is equal to the reset rate. Logs are emitted again, counting toward the bucket limit as normal. Allowing up to 3600 logs for the last 30 minutes of the second hour.

Because this could allow for some instability if the log rate hovers around the group_bucket_limit/group_bucket_period_s rate, it is possible to set a different reset rate.

Note that a value of 0 effectively means the plugin will drops logs forever after a single breach of the limit until the next restart of fluentd.

A value of -1 disables the feature.

group_warning_delay_s

Default: 10 (seconds).

When a group reaches its limit and as long as it is not reset, a warning message with the current log rate of the group is emitted repeatedly. This is the delay between every repetition.

License

Apache License, Version 2.0

Copyright

fluent-plugin-throttle's People

Contributors

Stargazers

Watchers

fluent-plugin-throttle's Issues

Add support for avoid throttle by fields

Hi,
First of all, thanks for your great plugin, I find it very useful and powerful.

Our use case requires group records by pod_name + specific log_levels.
For example, we want to throttle only records that have log_level warn or error.

Thought that it'll be nice to have an option to specify field and value to prevent the record from being throttled. it can be built like record_modifier does it:

<filter pattern>
  @type record_modifier

  # replace key key1
  <replace>
    # your key name
    key key1
    # your regexp
    expression /^(?<start>.+).{2}(?<end>.+)$/
    # replace string
    replace \\k<start>ors\\k<end>
  </replace>
  # replace key key2
  <replace>
    # your key name
    key key2
    # your regexp
    expression /^(.{1}).{2}(.{1})$/
    # replace string
    replace \\1ors\\2
  </replace>
</filter>

For this plugin we can have:

<ignore_record>
   key log_level
   expression /(info|debug|trace)/
</ignore_record>

What do you think?

td-agent stop emitting the log after rate limit exceeds

Hi,
we limit the application access log to 1 log per second(see below config), but the td-agent stops emitting the logs completely instead of emitting expected 1 log per second. Is this a known issue or we have the wrong settings?

Thank you!!

<filter access.log>
  @type throttle
  group_key path
  group_bucket_period_s   1
  group_bucket_limit    1
  group_reset_rate_s     1
</filter>

Multiple conditions for a group

Hi,

First of all, thanks for your great plugin, I find it very useful and powerful.
One thing that might prevent us from using that plugin is that we want to have multiple conditions for a group.
For example, we want to have a group which contains container name + log level.
I think that a good solution for that limitation is to introduce a new configuration called group_key_regex that let you define a group_key by a regex, that way we can enforce two conditions for one group.

I'm starting to write a PR for that, please let me know what do you think.

After adding throttling kafka plugin not working properly

throttling config

<filter k8s_log.**>
@type throttle
group_key $.kubernetes.namespace_name
group_bucket_period_s 60
group_bucket_limit 6000
group_reset_rate_s 100

<match .k8s_log.>
@id copy_k8s_log
log_level trace
@type copy

@id kafka_buffered_k8s_log
reserve_data true
@log_level trace
@type kafka_buffered
brokers {brokers list}
default_topic fluent_data
output_include_tag true
#output_include_time true
required_acks 1
kafka_agg_max_bytes 10000000
kafka_agg_max_messages 1000000
max_send_limit_bytes 9000000000 #to avoid MessageSizeTooLarge
get_kafka_client_log true
#
#@type memory
#flush_mode immediate
#flush_thread_count 20
#chunk_limit_size 8MB
#total_limit_size 64MB
#overflow_action drop_oldest_chunk
#

@id out_prometheus_k8s_log
@type prometheus

name fluentd_output_status_num_records_total
type counter
desc The total number of outgoing records

tag ${tag}
hostname ${hostname}

failed to flush the buffer. retry_time=2 next_retry_seconds=2018-09-03 07:18:49 +0000 chunk="574f2567d5ab677a1bd50654f855d45d" error_class=ArgumentError error="wrong number of arguments (given 6, expected 0)"
2018-09-03 07:18:49 +0000 [warn]: #0 suppressed same stacktrace
2018-09-03 07:18:50 +0000 [info]: #0 following tail of /applog/container/logs/json/splunk_2018-09-03.0718.log
2018-09-03 07:18:50 +0000 [trace]: #0 [kafka_buffered_k8s_log] enqueueing all chunks in buffer instance=47336657585960
2018-09-03 07:18:53 +0000 [warn]: #0 [kafka_buffered_fluent_logs] Send exception occurred: wrong number of arguments (given 6, expected 0)
2018-09-03 07:18:53 +0000 [warn]: #0 [kafka_buffered_fluent_logs] Exception Backtrace : /usr/lib/ruby/gems/2.4.0/gems/ruby-kafka-0.7.0/lib/kafka/pending_message.rb:7:in initialize' /usr/lib/ruby/gems/2.4.0/gems/fluent-plugin-kafka-0.6.6/lib/fluent/plugin/kafka_producer_ext.rb:16:in new'
/usr/lib/ruby/gems/2.4.0/gems/fluent-plugin-kafka-0.6.6/lib/fluent/plugin/kafka_producer_ext.rb:16:in produce2' /usr/lib/ruby/gems/2.4.0/gems/fluent-plugin-kafka-0.6.6/lib/fluent/plugin/out_kafka_buffered.rb:322:in block in write'
/usr/lib/ruby/gems/2.4.0/gems/fluentd-1.1.0/lib/fluent/event.rb:323:in each' /usr/lib/ruby/gems/2.4.0/gems/fluentd-1.1.0/lib/fluent/event.rb:323:in block in each'
/usr/lib/ruby/gems/2.4.0/gems/fluentd-1.1.0/lib/fluent/plugin/buffer/memory_chunk.rb:80:in open' /usr/lib/ruby/gems/2.4.0/gems/fluentd-1.1.0/lib/fluent/plugin/buffer/memory_chunk.rb:80:in open'
/usr/lib/ruby/gems/2.4.0/gems/fluentd-1.1.0/lib/fluent/event.rb:322:in each' /usr/lib/ruby/gems/2.4.0/gems/fluent-plugin-kafka-0.6.6/lib/fluent/plugin/out_kafka_buffered.rb:284:in write'
/usr/lib/ruby/gems/2.4.0/gems/fluentd-1.1.0/lib/fluent/compat/output.rb:131:in write' /usr/lib/ruby/gems/2.4.0/gems/fluentd-1.1.0/lib/fluent/plugin/output.rb:1094:in try_flush'
/usr/lib/ruby/gems/2.4.0/gems/fluentd-1.1.0/lib/fluent/plugin/output.rb:1319:in flush_thread_run' /usr/lib/ruby/gems/2.4.0/gems/fluentd-1.1.0/lib/fluent/plugin/output.rb:439:in block (2 levels) in start'
/usr/lib/ruby/gems/2.4.0/gems/fluentd-1.1.0/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create'

Throttling is not resetting after setting the group_reset_rate_s

Hi,

I know this may be a long shot as I haven't seen any activity on this repository, but I was playing around with this plugin and even though it does throttle the logs to 1000, it doesn't seem to reset after 15s. I don't receive any logs after the first 1000. The output is elasticsearch.

<filter throttle.**>
  @type throttle
  group_key random.env
  group_bucket_period_s  60
  group_bucket_limit     1000
  group_drop_logs        true
  group_reset_rate_s     15
</filter>

Below are the versions I am using:

ruby version 2.7.0
fluent-config-regexp-type (1.0.0)
fluent-plugin-beats (1.1.0)
fluent-plugin-elasticsearch (5.2.3)
fluent-plugin-filter-list (0.7.5)
fluent-plugin-record-modifier (2.1.1)
fluent-plugin-rewrite-tag-filter (2.4.0)
fluent-plugin-s3 (1.7.1)
fluent-plugin-throttle (0.0.5)
fluentd (1.15.2)

The logs show it being throttle (note the group shows nil so i guess it isnt resolving but still works as i can see 1000 logs)

2022-12-16 14:25:38 +0000 [warn]: #0 rate exceeded group_key=[nil] rate_s=286 period_s=60 limit=1000 rate_limit_s=16 reset_rate_s=15
2022-12-16 14:25:38.418280984 +0000 fluent.warn: {"group_key":[null],"rate_s":286,"period_s":60,"limit":1000,"rate_limit_s":16,"reset_rate_s":15,"message":"rate exceeded group_key=[nil] rate_s=286 period_s=60 limit=1000 rate_limit_s=16 reset_rate_s=15"}

Any help would be appreciated

Prometheus metrics?

Add support for metrics using the Prometheus plugin that can be used for monitoring:

count of dropped logs
rate of logs

multiple groups in group_key resolving to `nil`

Using namespace key group_key "kube_namespace_name":

fluentd-2 fluentd 2018-11-23 10:43:01 +0000 [warn]: rate exceeded group_key="kube-system" rate_s=3 period_s=60 limit=10 rate_limit_s=0 reset_rate_s=0
fluentd-2 fluentd 2018-11-23 10:43:02 +0000 [warn]: rate exceeded group_key="sys-mon" rate_s=1 period_s=60 limit=10 rate_limit_s=0 reset_rate_s=0
fluentd-0 fluentd 2018-11-23 10:42:45 +0000 [warn]: rate exceeded group_key="kube-system" rate_s=3 period_s=60 limit=10 rate_limit_s=0 reset_rate_s=0
fluentd-0 fluentd 2018-11-23 10:42:47 +0000 [warn]: rate exceeded group_key="telecom" rate_s=0 period_s=60 limit=10 rate_limit_s=0 reset_rate_s=0
fluentd-0 fluentd 2018-11-23 10:42:52 +0000 [warn]: rate exceeded group_key="sys-mon" rate_s=0 period_s=60 limit=10 rate_limit_s=0 reset_rate_s=0

using kube_pod_name key group_key "kube_pod_name" :

fluentd-0 fluentd 2018-11-23 10:38:12 +0000 [warn]: rate exceeded group_key="grafana-84dfb4866b-zpdtx" rate_s=0 period_s=60 limit=10 rate_limit_s=0 reset_rate_s=0
fluentd-0 fluentd 2018-11-23 10:38:14 +0000 [warn]: rate exceeded group_key="blackbox-exporter-7f75d97d67-z9xtc" rate_s=0 period_s=60 limit=10 rate_limit_s=0 reset_rate_s=0
fluentd-0 fluentd 2018-11-23 10:38:14 +0000 [warn]: rate exceeded group_key="kube-apiserver-ip-10-66-23-164.eu-west-1.compute.internal" rate_s=1 period_s=60 limit=10 rate_limit_s=0 reset_rate_s=0
fluentd-2 fluentd 2018-11-23 10:38:06 +0000 [warn]: rate exceeded group_key="kube-apiserver-ip-10-66-23-164.eu-west-1.compute.internal" rate_s=1 period_s=60 limit=10 rate_limit_s=0 reset_rate_s=0
fluentd-2 fluentd 2018-11-23 10:38:09 +0000 [warn]: rate exceeded group_key="certificate-authority-77ff9b9fbc-n72m8" rate_s=0 period_s=60 limit=10 rate_limit_s=0 reset_rate_s=0

Using both: group_key "kube_namespace_name,kube_pod_name" :

fluentd-0 fluentd 2018-11-23 10:44:24 +0000 [warn]: rate exceeded group_key=nil rate_s=29 period_s=60 limit=10 rate_limit_s=0 reset_rate_s=0
fluentd-2 fluentd 2018-11-23 10:44:27 +0000 [warn]: rate exceeded group_key=nil rate_s=5 period_s=60 limit=10 rate_limit_s=0 reset_rate_s=0
fluentd-0 fluentd 2018-11-23 10:44:35 +0000 [warn]: rate exceeded group_key=nil rate_s=1 period_s=60 limit=10 rate_limit_s=0 reset_rate_s=0

example message:

{
  "date": 1542973504.01747,
  "log": "I1123 11:45:04.017354       1 logs.go:49] http: TLS handshake error from 10.66.24.95:29201: EOF\n",
  "stream": "stderr",
  "kube_pod_name": "kube-apiserver-ip-10-66-23-81.eu-west-1.compute.internal",
  "kube_namespace_name": "kube-system",
  "kube_pod_id": "f33508e4-dcf6-11e8-9adf-060b9e31fc68",
  "kube_labels": {
    "k8s-app": "apiserver"
  },
  "kube_host": "ip-10-66-23-81.eu-west-1.compute.internal",
  "kube_container_name": "kube-apiserver",
  "kube_docker_id": "e7b3eea81c35c558defcd2418a7953bab7932a33c90981dccdc73c198f730007",
  "origin": "exp-1-aws"
}

support to add a flag for throttled records instead of dropping it

Hi,

For our application we needs an additional custom field in the record to indicate throttling instead of discarding it. Based on that we will do post processing things on that record. This new field should be named "add_field," and the flag associated with it will determine whether the record should be dropped or modified, allowing it to continue with further processing.

   <filter syslog.**>
        @type throttle
        group_key  type
        group_bucket_period_s  60
        group_bucket_limit  5000
        group_reset_rate_s  -1
        add_field  throttled  # added acknowledgment like yes for this field.
        proceed_further  true   #defaults as false.
   </filter>

We would appreciate your input on this.

Bytes?

Throttling on messages is only one axis. We have software that produces a modest number of log 'messages' but blow up our pipelines because those 'messages' are sometimes > 1MB.

Is there any thoughts to adding this measure as something else we can throttle on?