My input space has ~1000 features, and my training data files are topping 5GB just for

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Any way to reduce data file size? about starspace HOT 23 CLOSED

facebookresearch commented on September 25, 2024

Any way to reduce data file size?

from starspace.

Comments (23)

ledw commented on September 25, 2024

@dhmay currently it does not support reading from compressed format. Does the current version work for you? How long does it take to load all the examples. Any suggestions on more compact formats?

from starspace.

dhmay commented on September 25, 2024

The current version is frustrating but manageable for 200K samples. I shortened the names of my features, which got my file size down to about 2G. :) Loading that 2G file takes ~6min, mostly because of the I/O. It's going to get unwieldy to scale up, and I'd like to scale up to at least 2M samples.

The most flexible thing to do would be to read the training file from stdin if -trainFile isn't supplied. That would allow for any kind of compression that could be decompressed streaming -- I'd probably use gzip. That'll make the on-disk file size hugely smaller, and it ought to provide a good speedup, too.

from starspace.

ledw commented on September 25, 2024

@dhmay Thanks for the suggestion! We'll test that out and update shortly.

from starspace.

ledw commented on September 25, 2024

@dhmay How does reading from stdin work with multi thread? Any examples? Thanks!

from starspace.

dhmay commented on September 25, 2024

I'm not the best one to give advice, not having done multithreaded stuff in C++ in a long time. But, in the current code, aren't you simply loading all the samples into one corpus per thread? Wouldn't that flow be essentially the same for reading from stdin, rather than a file? Seems to me that the only code that should really need to change is what's in InternDataHandler::loadFromFile, right?

from starspace.

ckingdev commented on September 25, 2024

It's not the most convenient, but you could assign IDs to each of your features- if you used base 64 ids, you would be able to have 4096 features using no more than 2 characters. If you're io bound, it could help there. You'd have to keep a mapping to and from, but that could at least be kept outside of the training process itself.

And if you've got RAM to spare, you could put your data files in something like tmpfs, and afterwards stream it back to disk through a compressor. Something like zstd or lz4 (pzstd or lz4mt for multithreaded compression and decompression) speeds up io for me considerably, even on my SSD- the reduction in data size more than makes up for the cpu cost.

Lastly, if you're tight on RAM as well and you're on a Linux machine, using zram could be a big help and might make the difference between being able to hold your data in main memory or not.

from starspace.

ledw commented on September 25, 2024

@dhmay yes InternDataHandler::loadFromFile is mainly where reading from file happens. It uses multithread and currently it reads the file first to determine where each thread starts to read. Would you be interested in adding the functionality of reading from stdin for StarSpace? It would be very helpful to have more people contributing to the StarSpace codebase and community.

from starspace.

dhmay commented on September 25, 2024

I'm worried that there's some kind of complexity here that I don't understand. From my parse of InternDataHandler::loadFromFile, it looks like it literally just walks through the file, line by line, parsing each line and storing it. And so changing it to allow reading from stdin would be trivial: the foreach_line() method in util.h would need to be made capable of handling input either from a file or from stdin, and argument parsing would need to be tweaked, and that's it.

But it sounds like you think it's more complicated than that, so I worry that if I took a crack at it I'd screw something up that I don't understand.

from starspace.

ledw commented on September 25, 2024

@dhmay yes if we do not consider multithread for that then basically that is it.

from starspace.

dhmay commented on September 25, 2024

@ledw There's a complication. I hadn't realized before that args_->trainFile is read through twice. The first pass is in Dictionary::readFromFile, to build the dictionary, and the second pass is in InternDataHandler::loadFromFile, to load the corpora.

I made some changes to StarSpace::init and readFromFile to read from stdin, and that worked fine for reading the dictionary. But then the input is gone, no way to access it again for InternDataHandler::loadFromFile.

If we want to be able to read from stdin, some more structural changes need to be made to do it all in one pass over the training data. loadFromFile would need to build both the dictionary and the corpora, so it'd need to parse the line twice, with parseForDict and parse. On the upside, if that's feasible, it ought to give a big speedup.

I'm not sure I've anticipated all the downstream effects of that change. What do you think... does it make sense to go this route, or would it make more sense to for StarSpace to explicitly support a particular type of compression, by leaving the code structure untouched and wrapping all getline calls in some wrapper?

from starspace.

ledw commented on September 25, 2024

@dhmay Thanks for looking into this! You are correct, dictionary also needs to load the file and construct dict. Constructing dictionary on the fly is doable but requires some fundamental changes on the dictionary part (for instance, how the size is controlled). I think it makes more sense to support a particular type of compression instead.

from starspace.

halflings commented on September 25, 2024

One solution here would be to support binary formats, such as protobuf or parquet, instead of plaintext.

from starspace.

jaseweston commented on September 25, 2024

Another way would be to do 2 passes with stdin: first loads dictionary, and saves. Then run again for training?

…

On Tue, Dec 12, 2017 at 3:10 AM, Ahmed Kachkach ***@***.***> wrote: One solution here would be to support binary formats, such as protobuf, instead of plaintext. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#88 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AKjk-KDsSDHPhkgKB-Mno3C7L6jzNuNFks5s_jUMgaJpZM4QlFQx> .

from starspace.

dhmay commented on September 25, 2024

@jaseweston I'm not sure I understand that suggestion. Once you've run through the input from stdin, it's gone. Caching it to run through again would be arbitrarily memory-intensive.

from starspace.

jaseweston commented on September 25, 2024

youd run the program twice, second time it loads the saved dict

…

On Dec 12, 2017 1:15 PM, "Damon May" ***@***.***> wrote: @jaseweston <https://github.com/jaseweston> I'm not sure I understand that suggestion. Once you've run through the input from stdin, it's gone. Caching it to run through again would be arbitrarily memory-intensive. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#88 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AKjk-M7mLc6h8MKt5lNf1biL6aHWI4wxks5s_sLHgaJpZM4QlFQx> .

from starspace.

dhmay commented on September 25, 2024

Oh, I see. So you'd make a new command, something like starspace makedict, and give starspace train an optional input dictionary file argument.

The two drawbacks I can see are 1. still two passes through the data (though that would only be necessary the first time you ran on a particular dataset), which takes time, and 2. it's less flexible -- if you were generating random data, or something, you'd have to make sure you could do it the same way twice.

But it would allow for arbitrary compression of input, so if that approach is much easier than restructuring to build the dict on the fly, then it sounds good to me!

from starspace.

jaseweston commented on September 25, 2024

Yes, a positive is that sometimes you want to run a lot of train hyperparams for the same dict, so you don't have to keep building it again..

…

On Tue, Dec 12, 2017 at 1:30 PM, Damon May ***@***.***> wrote: Oh, I see. So you'd make a new command, something like *starspace makedict*, and give *starspace train* an optional input dictionary file argument. The two drawbacks I can see are 1. still two passes through the data (though that would only be necessary the first time you ran on a particular dataset), which takes time, and 2. it's less flexible -- if you were generating random data, or something, you'd have to make sure you could do it the same way twice. But it would allow for arbitrary compression of input, so if that approach is much easier than restructuring to build the dict on the fly, then it sounds good to me! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#88 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AKjk-PGa7_8ChmInFuO2toBs_mM5prdsks5s_sY_gaJpZM4QlFQx> .

from starspace.

rnditdev commented on September 25, 2024

Which data structure and format should be preferable to persist the dictionary? I'd aim to serialize the vector here. Any other option?

from starspace.

ledw commented on September 25, 2024

@rnditdev do you mean to serialize the whole dictionary or just the vector you pointed out? For dictionary I think so simple structure would do. You can either serializing it to a binary file or txt file and reload the dictionary (if I understand correctly).

from starspace.

rnditdev commented on September 25, 2024

I'm looking to save the right data structure. As per your comment, it seems that It should be preferable to save the whole Dict object. I was looking at external libraries like Cereal for it, but it could add unnecessary overhead. Thoughts?

from starspace.

rnditdev commented on September 25, 2024

I've reviewed, and it seems all the dictionary saving methods are inplace, we should just add a new option (loaddictfromfile), or a modifier to initmodel saying we will just use the dictionary, but not the weights.

from starspace.

ledw commented on September 25, 2024

@rnditdev your approach sounds good to me.
@dhmay Support to read from gzip is available from here: #206
Let me know if you want to try it out.

from starspace.

dhmay commented on September 25, 2024

Thanks, @ledw ! I should have a good application for this enhancement soon.

from starspace.

Any way to reduce data file size? about starspace HOT 23 CLOSED

Comments (23)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent