Giter Club home page Giter Club logo

terashuf's People

Contributors

alexandres avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

terashuf's Issues

how to use SKIP.

SKIP (int): how many lines to skip at beginning of input; defaults to 0. When shuffling CSV files, set to 1 to preserve header.

Do I need to add this parameter when executing the command in the command line? Could you give me an example?
Thank you very much.

Silent failure when reading multiple large files via cat

I have a collection of 12 files (~50GB each) that I need to shuffle. Using the instructions from the issue I last opened and the README, which have always worked for me before, I ran

cat file1.txt file2.txt file3.txt file4.txt file5.txt file6.txt file7.txt file8.txt file9.txt file10.txt file11.txt file12.txt | MEMORY=110 ./terashuf | split --line-bytes 10G - shuffled.txt.

The output is

trying to allocate 118111600640 bytes

starting read
skipped 0 lines
mean line-length is 89.47, estimated memory usage is 1.09 * 110.00 GB = 119.73 GB
Tip: If you would like use exactly 110.00 GB of memory, use MEMORY=101.0631 ./terashuf ...

It never outputs anything about reading lines in or writing lines out, nor does it create any temporary files. This bug also occurs if I cat only two of the files together, but it does not occur if I cat one file; in that case, the file shuffles normally.

seeking clarification on max file descriptors

You write:

When shuffling very large files, terashuf needs to keep open SIZE_OF_FILE_TO_SHUFFLE / MEMORY temporary files. Make sure to set the maximum number of file descriptors to at least this number. By setting a large file descriptor limit, you ensure that terashuf won't abort a shuffle midway, saving precious researcher time.

  • what is SIZE_OF_FILE_TO_SHUFFLE? Number of \n-terminated lines or total size in Bytes or GBytes or else?
  • what's MEMORY? GBs or Bytes?

Perhaps an example would be self-documenting, so if I have 70M records in a 600GB file and my MEMORY is set to the default 4, will I need

  1. 1 file descriptor 70000000/(4*2**30)
  2. 17.5M file descriptors 70000000/4 or something totally different?
  3. 150 file descriptors 600/4 (if size is in GBs)

Thank you!


edit after running it I see it's (3) in my attempts above. Perhaps the suggestion can be modified to: SIZE_OF_FILE_TO_SHUFFLE_IN_GBS / MEMORY_IN_GBS

Also a default file descriptor limit is 16K, so realistically the user will need to sort 64TB file using 4GB memory before they may run into problems. Therefore perhaps this note could also be added to help not to unnecessarily worry for an average user.

So I ended up sorting a 900GB file using 150GB memory and it needed just 6 additional file descriptors.

Very awesome program - took only about 1h! Thank you!

parallel file are not consistent using the same seed

First, I wanna say thank you so much for such an amazing and fast tool for shuffling... It's really really fast!

I had two parallel files: train.src and train.tgt, each is around 90M lines long. When I tried to shuffle them using the same seed, I got different order of files. Here are the command I used:

SEED=7 ./terashuf < train.src > train.src.shuf
SEED=7 ./terashuf < train.tgt > train.tgt.shuf

I wish if I could upload the files to you for reproducibility, but each is around 10GB.

Note:

This command worked fine with valid.src, valid.tgt, test.src and test.tgt which had way fewer lines.

Memory usage too high

The actual memory consumption does not seem to be limited to the amount specified in the MEMORY environment variable. For example, if I run:

$ env MEMORY="20.0" ./terashuf < myfile.txt > shuffled.txt
trying to allocate 21474836480 bytes

starting read

the actual memory usage can be as high as ~40 GB. Is this behavior expected, or is there something I should be doing differently? I am shuffling a tab-delimited file with two columns and total file size of ~60 GB.

Thank you

Does not work. Only wrote 1GB of file.

I had a 100GB of text file. Needed to shuffle it. It reads the file correctly and even created 23 temp files. This is the final message I received:


starting read
lines read: 723456445, gb read: 91
Read 723456445 lines, 98466096665 bytes, have 23 tmp files

starting write to output
lines written: 9126923, gb written: 1
done

Shuffling collections of files?

Does the current interface support shuffling a large corpus that is broken down into many files, which is a very common situation in practice? I don't really know C++, so it would be difficult for me to implement this feature myself, but I might do it if no one else does. One can of course cat the files together and split them, but that creates additional disk space overhead.

Suggestion for additional comparison benchmark

Hi,

First, I just want to say that I appreciate your work towards faster single machine shuffling of large data sets. This is often useful.

Second, I wanted to point out some experiments I did to see how fast disk-based shuffling could be done with existing unix command line tools. The writeup is here: Shuffling large files, in the eBay tsv-utils github repo.

The approach used does make use of a tool in the tsv-utils repo, tsv-sample, so its not strictly traditional unix tools, but its close. The tsv-sample tool is very much like GNU shuf, but with an additional capability needed for the disk-based sampling algorithm. When used for shuffling standalone, tsv-sample reads all data into memory, like shuf.

The disk-based shuffling approach used by terashuf should be able to beat the disk-based shuffling approach I tested. It might make an interesting comparison point should you run your benchmarks again.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.