Comments (4)
After some investigation we managed to confirm that the out-of-memory errors are due to the connector consuming from the Kafka topics faster than it writes to S3. This happens especially in the cases when the topics already contain a lot of data.
We found that using the offset_flush_interval_ms
configuration property we can control how often the connector writes to S3. By default, this is done once every 60 seconds. Setting this to a much lower value (5 seconds or even 1 second) triggers the writing to S3 much quickly. Therefore, memory is being freed up on time.
Nonetheless, the main problem remains: If the connector needs to copy large amounts of data and for some reason writing to S3 takes up more time (maybe also due to other reasons unrelated to the configured flush interval), the connector will consume the entire topic in memory, thus, overloading the Kafka Connect cluster.
We would like to request a new feature that allows us to configure a maximum size of the input records buffer. When this size is reached, the Kafka consumer should pause and wait until memory has been freed up.
In this way, we can easily configure the maximum memory consumption of each connector and avoid overloading the cluster.
from s3-connector-for-apache-kafka.
Out-of-memory logs:
[kafka-development-connect-6]2024-01-16T13:41:47.138064[kafka-connect]Terminating due to java.lang.OutOfMemoryError: Java heap space
from s3-connector-for-apache-kafka.
@vpapanchev thanks for reporting this issue!
Agree with your assessment. This is something we were aware of and the workaround described is the current alternative to deal with this issue. Nonetheless, your feature request is valid as there should be a better way to avoid OOM on this connector.
Let us know if you are planning to work on this, otherwise we will add it to our backlog
from s3-connector-for-apache-kafka.
Thank you @jeqo for the response.
I am not currently planning on working on this, so please add it to your backlog. I would appreciate it if you give an update here once you start working on it.
If it's a known issue, then you might be able to help me with my current struggles :)
So far, we were only able to configure the offset_flush_interval_ms
property on the Kafka Connect cluster itself. Do you know if it's possible to configure it per connector, i.e., whether a connector is able to override the value configured in the cluster?
We enabled the parameter https://docs.aiven.io/docs/products/kafka/kafka-connect/reference/advanced-params#connector-client-config-override-policy for the Kafka Connect cluster by setting its value to βAllβ. I then tried to configure a particular offset_flush_interval_ms for an S3 Connector using various properties, such as:
- offset.flush.interval.ms
- admin.override.offset.flush.interval.ms
- consumer.override.offset.flush.interval.ms
- kafka_connect.offset.flush.interval.ms
- override.kafka_connect.offset.flush.interval.ms
- override.offset.flush.interval.ms
- ...
None of it seemed to work..
If you have any suggestions, that would be great!
Kind regards,
Vasil
from s3-connector-for-apache-kafka.
Related Issues (20)
- file in s3 without return line for each event HOT 1
- file.name.prefix doesn't seem to work
- Random UnknownHostException resolving S3 hostname HOT 1
- Just hold latest 10 record of each key or just hold data of 10 minutes ago HOT 4
- The s3 file only has startoffset. How to obtain the endoffset HOT 2
- AWS ECS Task or (likely) EKS pod with pre-mounted IAM credentials unsupported
- Support dynamic fields partitioned without requiring Avro scheme
- There is an issue handling avro schema with default value defined as String HOT 2
- java.lang.NoClassDefFoundError HOT 1
- S3 Source Connector HOT 7
- Consumer offsets not committed when using KeyRecordGrouper
- Use environment variables for AWS Access Key ID/Secret Access Key Config
- master incompatible with latest cp-kafka-connect Docker image HOT 1
- AivenKafkaConnectS3SinkConnector should inherit from SinkConnector
- Published artifacts available for download? HOT 3
- does csv output require the input to be ByteArrayConverter ? HOT 1
- csv output with headers as 1st row HOT 2
- Delivery semantic HOT 1
- Issue on "Date Format" using the S3 Sink Connector from Avro to Parquet HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from s3-connector-for-apache-kafka.