Giter Club home page Giter Club logo

Comments (2)

DavidTaylorCengage avatar DavidTaylorCengage commented on August 11, 2024 1

I'm afraid I don't have any specific insights. We were dealing with a URL file that had up to about 10 million records in it; your 2+ billion records (living in just the one transactions.csv file, it seems?) is probably going to run into the limits of what standard Java data structures can support (Integer.MAX_VALUE being 2147483647).

My recommendation would be to break your file up into smaller chunks. That's probably the easiest solution. If you're feeling bold, you could try to revisit my solution to make it more efficient or better utilize streaming.

Also, in case you haven't already found it, #399 was where I implemented the S3 functionality. There may be more answers that can be gleaned from that PR. I don't think I did anything in particular with processing the records themselves; I just dealt with reading the list from S3 and passing it to the normal DSBulk operation. (It may be worth noting that I have no association with DataStax or DSBulk. I'm just a random dev who needed a new feature and decided to implement it himself.)

I hope that is at least a little helpful!

from dsbulk.

msmygit avatar msmygit commented on August 11, 2024

@DavidTaylorCengage do you have any inputs here?

from dsbulk.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.