alexandres / terashuf Goto Github PK

View Code? Open in Web Editor NEW

201.0 201.0 15.0 56 KB

terashuf shuffles multi-terabyte text files using limited memory

License: MIT License

Makefile 1.55% C++ 98.45%

terashuf's People

Contributors

Stargazers

Watchers

Forkers

rich-junwang ctylim pkhlop clinjie barryzm skywalkerytx hung4x3 ajesujoba sighingnow chenyangh nie-yingying tnq177 sergioprates wlike

terashuf's Issues

how to use SKIP.

SKIP (int): how many lines to skip at beginning of input; defaults to 0. When shuffling CSV files, set to 1 to preserve header.

Do I need to add this parameter when executing the command in the command line? Could you give me an example?
Thank you very much.

Silent failure when reading multiple large files via cat

I have a collection of 12 files (~50GB each) that I need to shuffle. Using the instructions from the issue I last opened and the README, which have always worked for me before, I ran

cat file1.txt file2.txt file3.txt file4.txt file5.txt file6.txt file7.txt file8.txt file9.txt file10.txt file11.txt file12.txt | MEMORY=110 ./terashuf | split --line-bytes 10G - shuffled.txt.

The output is

trying to allocate 118111600640 bytes

starting read
skipped 0 lines
mean line-length is 89.47, estimated memory usage is 1.09 * 110.00 GB = 119.73 GB
Tip: If you would like use exactly 110.00 GB of memory, use MEMORY=101.0631 ./terashuf ...

It never outputs anything about reading lines in or writing lines out, nor does it create any temporary files. This bug also occurs if I cat only two of the files together, but it does not occur if I cat one file; in that case, the file shuffles normally.

seeking clarification on max file descriptors

You write:

When shuffling very large files, terashuf needs to keep open SIZE_OF_FILE_TO_SHUFFLE / MEMORY temporary files. Make sure to set the maximum number of file descriptors to at least this number. By setting a large file descriptor limit, you ensure that terashuf won't abort a shuffle midway, saving precious researcher time.

what is SIZE_OF_FILE_TO_SHUFFLE? Number of \n-terminated lines or total size in Bytes or GBytes or else?
what's MEMORY? GBs or Bytes?

Perhaps an example would be self-documenting, so if I have 70M records in a 600GB file and my MEMORY is set to the default 4, will I need

1 file descriptor 70000000/(4*2**30)
17.5M file descriptors 70000000/4 or something totally different?
150 file descriptors 600/4 (if size is in GBs)

Thank you!

edit after running it I see it's (3) in my attempts above. Perhaps the suggestion can be modified to: SIZE_OF_FILE_TO_SHUFFLE_IN_GBS / MEMORY_IN_GBS

Also a default file descriptor limit is 16K, so realistically the user will need to sort 64TB file using 4GB memory before they may run into problems. Therefore perhaps this note could also be added to help not to unnecessarily worry for an average user.

So I ended up sorting a 900GB file using 150GB memory and it needed just 6 additional file descriptors.

Very awesome program - took only about 1h! Thank you!

parallel file are not consistent using the same seed

First, I wanna say thank you so much for such an amazing and fast tool for shuffling... It's really really fast!

I had two parallel files: train.src and train.tgt, each is around 90M lines long. When I tried to shuffle them using the same seed, I got different order of files. Here are the command I used:

SEED=7 ./terashuf < train.src > train.src.shuf
SEED=7 ./terashuf < train.tgt > train.tgt.shuf

I wish if I could upload the files to you for reproducibility, but each is around 10GB.

Note:

This command worked fine with valid.src, valid.tgt, test.src and test.tgt which had way fewer lines.

Memory usage too high

The actual memory consumption does not seem to be limited to the amount specified in the MEMORY environment variable. For example, if I run:

$ env MEMORY="20.0" ./terashuf < myfile.txt > shuffled.txt
trying to allocate 21474836480 bytes

starting read

the actual memory usage can be as high as ~40 GB. Is this behavior expected, or is there something I should be doing differently? I am shuffling a tab-delimited file with two columns and total file size of ~60 GB.

Thank you

Does not work. Only wrote 1GB of file.

I had a 100GB of text file. Needed to shuffle it. It reads the file correctly and even created 23 temp files. This is the final message I received:


starting read
lines read: 723456445, gb read: 91
Read 723456445 lines, 98466096665 bytes, have 23 tmp files

starting write to output
lines written: 9126923, gb written: 1
done

Another comparison suggestion: xsv "sample"

https://github.com/BurntSushi/xsv

Shuffling collections of files?

Does the current interface support shuffling a large corpus that is broken down into many files, which is a very common situation in practice? I don't really know C++, so it would be difficult for me to implement this feature myself, but I might do it if no one else does. One can of course cat the files together and split them, but that creates additional disk space overhead.

Suggestion for additional comparison benchmark

Hi,

First, I just want to say that I appreciate your work towards faster single machine shuffling of large data sets. This is often useful.

Second, I wanted to point out some experiments I did to see how fast disk-based shuffling could be done with existing unix command line tools. The writeup is here: Shuffling large files, in the eBay tsv-utils github repo.

The approach used does make use of a tool in the tsv-utils repo, tsv-sample, so its not strictly traditional unix tools, but its close. The tsv-sample tool is very much like GNU shuf, but with an additional capability needed for the disk-based sampling algorithm. When used for shuffling standalone, tsv-sample reads all data into memory, like shuf.

The disk-based shuffling approach used by terashuf should be able to beat the disk-based shuffling approach I tested. It might make an interesting comparison point should you run your benchmarks again.

alexandres / terashuf Goto Github PK

terashuf's People

Contributors

Stargazers

Watchers

Forkers

terashuf's Issues

how to use SKIP.

Silent failure when reading multiple large files via cat

seeking clarification on max file descriptors

parallel file are not consistent using the same seed

Memory usage too high

Does not work. Only wrote 1GB of file.

Another comparison suggestion: xsv "sample"

Shuffling collections of files?

Suggestion for additional comparison benchmark

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent