Version details:
- td-agent-4.3.1
- Syslog plugin gems:
fluent-plugin-remote_syslog (1.0.0)
remote_syslog_sender (1.2.2)
syslog_protocol (0.9.2)
Background:
- Fluentd and rsyslog are deployed through helm and are running as pods in kubernetes.
- There is a k8s service 'rsyslog' created for the rsyslog pod.
- Fluentd is configured to read a file and send to the rsyslog server. File buffer is used in the match section for on-disk buffering.
Config looks like this
<source>
@type tail
path "/tmp/test.log"
pos_file /var/log/shivtest-retry-fluentd-container-log.pos
tag "test"
<parse>
@type json
</parse>
</source>
<match test>
@type remote_syslog
@log_level trace
host rsyslog.shiv-syslog.svc.cluster.local
port 514
protocol tcp
timeout 2
<buffer>
@type file
path /tmp/buffer
flush_interval 10s
flush_mode interval
flush_thread_count 1
retry_forever true
chunk_limit_size 2MB
total_limit_size 100m
</buffer>
</match>
Issue:
Fluentd successfully connects to the configured syslog endpoint and keeps pushing the records as per the flush interval.
But when the k8s service of syslog server goes down (i.e. when the syslog pod gets deleted or goes down to 0/1), fluentd does not detect any failures in connection to syslog.
It also keeps flushing all the chunks from the file buffer and does not retain anything in buffer - inspite of the destination being unreachable.
Pls see that the syslog service is unreachable (as seen by other clients like logger, curl)
$ logger "Hello world" -n rsyslog.shiv-syslog.svc.cluster.local -P 514 -T
logger: failed to connect rsyslog.shiv-syslog.svc.cluster.local port 514
$ curl telnet://rsyslog.shiv-syslog.svc.cluster.local:514
curl: (7) Failed connect to rsyslog.shiv-syslog.svc.cluster.local:514; Connection refused
Fluentd logs: there are no errors in fluentd. New chunks keep getting created and keep getting cleared from the file buffer location. Trace level logging is enabled.
<struct Fluent::Plugin::Buffer::Metadata timekey=nil, tag=nil, variables=nil, seq=0>
2023-03-21 13:25:45 +0000 [trace]: #0 done to commit a chunk chunk="5f768fa89b48e9499eecee4380ecf53f"
2023-03-21 13:25:45 +0000 [trace]: #0 writing events into buffer instance=1980 metadata_size=1
2023-03-21 13:25:45 +0000 [debug]: #0 Created new chunk chunk_id="5f768fb22636dafcbf2df4c4593736a5" metadata=#<struct Fluent::Plugin::Buffer::Metadata timekey=nil, tag=nil, variables=nil, seq=0>
2023-03-21 13:25:45 +0000 [trace]: #0 chunk /tmp/buffer/buffer.b5f768fb22636dafcbf2df4c4593736a5.log size_added: 22 new_size: 22
2023-03-21 13:25:46 +0000 [trace]: #0 enqueueing all chunks in buffer instance=1980
2023-03-21 13:25:47 +0000 [trace]: #0 enqueueing all chunks in buffer instance=1980
2023-03-21 13:25:47 +0000 [trace]: #0 writing events into buffer instance=1980 metadata_size=1
2023-03-21 13:25:47 +0000 [trace]: #0 chunk /tmp/buffer/buffer.b5f768fb22636dafcbf2df4c4593736a5.log size_added: 22 new_size: 44
2023-03-21 13:25:47 +0000 [trace]: #0 enqueueing all chunks in buffer instance=1980
2023-03-21 13:25:48 +0000 [trace]: #0 enqueueing all chunks in buffer instance=1980
2023-03-21 13:25:49 +0000 [trace]: #0 writing events into buffer instance=1980 metadata_size=1
2023-03-21 13:25:49 +0000 [trace]: #0 chunk /tmp/buffer/buffer.b5f768fb22636dafcbf2df4c4593736a5.log size_added: 22 new_size: 66
So why are the chunks getting flushed from the buffer when the destination is unreachable?
Observation: -
- An interesting observation is -
When syslog server is not running as a k8s pod, and is running as a standalone service on linux (i.e. managed through systemctl), when the service is stopped (using systemctl stop rsyslog), we immediately see the following error in fluentd logs when it tries to flush the next chunk from its buffer to the syslog endpoint.
As the flush fails due to connectivity, it retains the chunk in the file buffer and keeps retrying flush (as per the configuration).
2023-03-21 10:03:38 +0000 [warn]: #0 failed to flush the buffer. retry_times=0 next_retry_time=2023-03-21 10:03:39 +0000 chunk="5f766277ef08d8f6257ad093f0a07328" error_class=Errno::ECONNREFUSED error="Connection refused - connect(2) for \"rsyslog.shiv-syslog.svc.cluster.local\" port 514"
2023-03-21 10:03:38 +0000 [warn]: #0 /opt/td-agent/lib/ruby/gems/2.7.0/gems/remote_syslog_sender-1.2.2/lib/remote_syslog_sender/tcp_sender.rb:56:in `initialize'
2023-03-21 10:03:38 +0000 [warn]: #0 /opt/td-agent/lib/ruby/gems/2.7.0/gems/remote_syslog_sender-1.2.2/lib/remote_syslog_sender/tcp_sender.rb:56:in `new'
2023-03-21 10:03:38 +0000 [warn]: #0 /opt/td-agent/lib/ruby/gems/2.7.0/gems/remote_syslog_sender-1.2.2/lib/remote_syslog_sender/tcp_sender.rb:56:in `block in connect'
2023-03-21 10:03:38 +0000 [warn]: #0 /opt/td-agent/lib/ruby/gems/2.7.0/gems/remote_syslog_sender-1.2.2/lib/remote_syslog_sender/tcp_sender.rb:52:in `synchronize'
2023-03-21 10:03:38 +0000 [warn]: #0 /opt/td-agent/lib/ruby/gems/2.7.0/gems/remote_syslog_sender-1.2.2/lib/remote_syslog_sender/tcp_sender.rb:52:in `connect'
2023-03-21 10:03:38 +0000 [warn]: #0 /opt/td-agent/lib/ruby/gems/2.7.0/gems/remote_syslog_sender-1.2.2/lib/remote_syslog_sender/tcp_sender.rb:129:in `rescue in send_msg'
2023-03-21 10:03:38 +0000 [warn]: #0 /opt/td-agent/lib/ruby/gems/2.7.0/gems/remote_syslog_sender-1.2.2/lib/remote_syslog_sender/tcp_sender.rb:108:in `send_msg'
2023-03-21 10:03:38 +0000 [warn]: #0 /opt/td-agent/lib/ruby/gems/2.7.0/gems/remote_syslog_sender-1.2.2/lib/remote_syslog_sender/sender.rb:49:in `block in transmit'
2023-03-21 10:03:38 +0000 [warn]: #0 /opt/td-agent/lib/ruby/gems/2.7.0/gems/remote_syslog_sender-1.2.2/lib/remote_syslog_sender/sender.rb:37:in `each'
2023-03-21 10:03:38 +0000 [warn]: #0 /opt/td-agent/lib/ruby/gems/2.7.0/gems/remote_syslog_sender-1.2.2/lib/remote_syslog_sender/sender.rb:37:in `transmit'
2023-03-21 10:03:38 +0000 [warn]: #0 /opt/td-agent/lib/ruby/gems/2.7.0/gems/fluent-plugin-remote_syslog-1.0.0/lib/fluent/plugin/out_remote_syslog.rb:105:in `block (2 levels) in write'
2023-03-21 10:03:38 +0000 [warn]: #0 /opt/td-agent/lib/ruby/gems/2.7.0/gems/fluent-plugin-remote_syslog-1.0.0/lib/fluent/plugin/out_remote_syslog.rb:104:in `each_line'
2023-03-21 10:03:38 +0000 [warn]: #0 /opt/td-agent/lib/ruby/gems/2.7.0/gems/fluent-plugin-remote_syslog-1.0.0/lib/fluent/plugin/out_remote_syslog.rb:104:in `block in write'
2023-03-21 10:03:38 +0000 [warn]: #0 /opt/td-agent/lib/ruby/gems/2.7.0/gems/fluentd-1.14.6/lib/fluent/plugin/buffer/file_chunk.rb:171:in `open'
2023-03-21 10:03:38 +0000 [warn]: #0 /opt/td-agent/lib/ruby/gems/2.7.0/gems/fluent-plugin-remote_syslog-1.0.0/lib/fluent/plugin/out_remote_syslog.rb:103:in `write'
2023-03-21 10:03:38 +0000 [warn]: #0 /opt/td-agent/lib/ruby/gems/2.7.0/gems/fluentd-1.14.6/lib/fluent/plugin/output.rb:1179:in `try_flush'
2023-03-21 10:03:38 +0000 [warn]: #0 /opt/td-agent/lib/ruby/gems/2.7.0/gems/fluentd-1.14.6/lib/fluent/plugin/output.rb:1500:in `flush_thread_run'
2023-03-21 10:03:38 +0000 [warn]: #0 /opt/td-agent/lib/ruby/gems/2.7.0/gems/fluentd-1.14.6/lib/fluent/plugin/output.rb:499:in `block (2 levels) in start'
2023-03-21 10:03:38 +0000 [warn]: #0 /opt/td-agent/lib/ruby/gems/2.7.0/gems/fluentd-1.14.6/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create'
2023-03-21 10:03:38 +0000 [trace]: #0 writing events into buffer instance=1980 metadata_size=1
Later when the rsyslog service is restarted and up again, it connects and successfully pushes all buffered chunks to the destination without any loss of data. For ex.
2023-03-21 10:23:24 +0000 [warn]: #0 retry succeeded. chunk_id="5f7664fef1ec6c112220332f2732de46"
2) For syslog running over kubernetes, we see that fluentd detects if there is a connectivity issue with syslog initially when fluentd starts running and there retry/buffering works fine as expected.
But once fluentd successfully establishes a connection with syslog, and then if syslog destination goes unreachable, fluentd fails to detect connection errors.
Concern:
Why is the loss of connectivity not identified in case of rsyslog server running as a kubernetes service?
Pls let us know if there're any additional configs in syslog plugin that could help us achieve retry/buffering properly in this case too.