Comments (2)
I'm afraid I don't have any specific insights. We were dealing with a URL file that had up to about 10 million records in it; your 2+ billion records (living in just the one transactions.csv file, it seems?) is probably going to run into the limits of what standard Java data structures can support (Integer.MAX_VALUE
being 2147483647
).
My recommendation would be to break your file up into smaller chunks. That's probably the easiest solution. If you're feeling bold, you could try to revisit my solution to make it more efficient or better utilize streaming.
Also, in case you haven't already found it, #399 was where I implemented the S3 functionality. There may be more answers that can be gleaned from that PR. I don't think I did anything in particular with processing the records themselves; I just dealt with reading the list from S3 and passing it to the normal DSBulk operation. (It may be worth noting that I have no association with DataStax or DSBulk. I'm just a random dev who needed a new feature and decided to implement it himself.)
I hope that is at least a little helpful!
from dsbulk.
@DavidTaylorCengage do you have any inputs here?
from dsbulk.
Related Issues (20)
- Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/LoggerFactory HOT 9
- dsbulk unload stuck when config -maxConcurrentFiles (write concurrency) greater than 1 HOT 1
- DSBulk Java API
- DSBulk dependency on `logback` implementation
- `ClassLoader` aware DSBulk
- `maxRecords` flag does not apply to write operations
- DSBulk count doesn't work on tables with just partition keys
- dsbulk compat with vector type HOT 4
- Cannot import multiple values in a map<T,T> column using CSV files
- Add support for loading/unloading vector type data HOT 1
- dsbulk doesn't support toUnixTimestamp? HOT 4
- Parsing trouble when a column is called "vector" HOT 6
- Parsing vector data from JSON fails for "floats" with too many digits (aka doubles) HOT 1
- Split when unloading into smaller files
- Escape character when unloading
- DSBulk unload fails to parse map[value] as provided in query HOT 2
- Windows version only works when dsbulk in in short folders
- DSBulk DELETE can not accept any ranges on the clustering column when used within -query
- Allow file input for dsbulk unload
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dsbulk.