Giter Club home page Giter Club logo

Comments (23)

ledw avatar ledw commented on September 25, 2024

@dhmay currently it does not support reading from compressed format. Does the current version work for you? How long does it take to load all the examples. Any suggestions on more compact formats?

from starspace.

dhmay avatar dhmay commented on September 25, 2024

The current version is frustrating but manageable for 200K samples. I shortened the names of my features, which got my file size down to about 2G. :) Loading that 2G file takes ~6min, mostly because of the I/O. It's going to get unwieldy to scale up, and I'd like to scale up to at least 2M samples.

The most flexible thing to do would be to read the training file from stdin if -trainFile isn't supplied. That would allow for any kind of compression that could be decompressed streaming -- I'd probably use gzip. That'll make the on-disk file size hugely smaller, and it ought to provide a good speedup, too.

from starspace.

ledw avatar ledw commented on September 25, 2024

@dhmay Thanks for the suggestion! We'll test that out and update shortly.

from starspace.

ledw avatar ledw commented on September 25, 2024

@dhmay How does reading from stdin work with multi thread? Any examples? Thanks!

from starspace.

dhmay avatar dhmay commented on September 25, 2024

I'm not the best one to give advice, not having done multithreaded stuff in C++ in a long time. But, in the current code, aren't you simply loading all the samples into one corpus per thread? Wouldn't that flow be essentially the same for reading from stdin, rather than a file? Seems to me that the only code that should really need to change is what's in InternDataHandler::loadFromFile, right?

from starspace.

ckingdev avatar ckingdev commented on September 25, 2024

It's not the most convenient, but you could assign IDs to each of your features- if you used base 64 ids, you would be able to have 4096 features using no more than 2 characters. If you're io bound, it could help there. You'd have to keep a mapping to and from, but that could at least be kept outside of the training process itself.

And if you've got RAM to spare, you could put your data files in something like tmpfs, and afterwards stream it back to disk through a compressor. Something like zstd or lz4 (pzstd or lz4mt for multithreaded compression and decompression) speeds up io for me considerably, even on my SSD- the reduction in data size more than makes up for the cpu cost.

Lastly, if you're tight on RAM as well and you're on a Linux machine, using zram could be a big help and might make the difference between being able to hold your data in main memory or not.

from starspace.

ledw avatar ledw commented on September 25, 2024

@dhmay yes InternDataHandler::loadFromFile is mainly where reading from file happens. It uses multithread and currently it reads the file first to determine where each thread starts to read. Would you be interested in adding the functionality of reading from stdin for StarSpace? It would be very helpful to have more people contributing to the StarSpace codebase and community.

from starspace.

dhmay avatar dhmay commented on September 25, 2024

I'm worried that there's some kind of complexity here that I don't understand. From my parse of InternDataHandler::loadFromFile, it looks like it literally just walks through the file, line by line, parsing each line and storing it. And so changing it to allow reading from stdin would be trivial: the foreach_line() method in util.h would need to be made capable of handling input either from a file or from stdin, and argument parsing would need to be tweaked, and that's it.

But it sounds like you think it's more complicated than that, so I worry that if I took a crack at it I'd screw something up that I don't understand.

from starspace.

ledw avatar ledw commented on September 25, 2024

@dhmay yes if we do not consider multithread for that then basically that is it.

from starspace.

dhmay avatar dhmay commented on September 25, 2024

@ledw There's a complication. I hadn't realized before that args_->trainFile is read through twice. The first pass is in Dictionary::readFromFile, to build the dictionary, and the second pass is in InternDataHandler::loadFromFile, to load the corpora.

I made some changes to StarSpace::init and readFromFile to read from stdin, and that worked fine for reading the dictionary. But then the input is gone, no way to access it again for InternDataHandler::loadFromFile.

If we want to be able to read from stdin, some more structural changes need to be made to do it all in one pass over the training data. loadFromFile would need to build both the dictionary and the corpora, so it'd need to parse the line twice, with parseForDict and parse. On the upside, if that's feasible, it ought to give a big speedup.

I'm not sure I've anticipated all the downstream effects of that change. What do you think... does it make sense to go this route, or would it make more sense to for StarSpace to explicitly support a particular type of compression, by leaving the code structure untouched and wrapping all getline calls in some wrapper?

from starspace.

ledw avatar ledw commented on September 25, 2024

@dhmay Thanks for looking into this! You are correct, dictionary also needs to load the file and construct dict. Constructing dictionary on the fly is doable but requires some fundamental changes on the dictionary part (for instance, how the size is controlled). I think it makes more sense to support a particular type of compression instead.

from starspace.

halflings avatar halflings commented on September 25, 2024

One solution here would be to support binary formats, such as protobuf or parquet, instead of plaintext.

from starspace.

jaseweston avatar jaseweston commented on September 25, 2024

from starspace.

dhmay avatar dhmay commented on September 25, 2024

@jaseweston I'm not sure I understand that suggestion. Once you've run through the input from stdin, it's gone. Caching it to run through again would be arbitrarily memory-intensive.

from starspace.

jaseweston avatar jaseweston commented on September 25, 2024

from starspace.

dhmay avatar dhmay commented on September 25, 2024

Oh, I see. So you'd make a new command, something like starspace makedict, and give starspace train an optional input dictionary file argument.

The two drawbacks I can see are 1. still two passes through the data (though that would only be necessary the first time you ran on a particular dataset), which takes time, and 2. it's less flexible -- if you were generating random data, or something, you'd have to make sure you could do it the same way twice.

But it would allow for arbitrary compression of input, so if that approach is much easier than restructuring to build the dict on the fly, then it sounds good to me!

from starspace.

jaseweston avatar jaseweston commented on September 25, 2024

from starspace.

rnditdev avatar rnditdev commented on September 25, 2024

Which data structure and format should be preferable to persist the dictionary? I'd aim to serialize the vector here. Any other option?

from starspace.

ledw avatar ledw commented on September 25, 2024

@rnditdev do you mean to serialize the whole dictionary or just the vector you pointed out? For dictionary I think so simple structure would do. You can either serializing it to a binary file or txt file and reload the dictionary (if I understand correctly).

from starspace.

rnditdev avatar rnditdev commented on September 25, 2024

I'm looking to save the right data structure. As per your comment, it seems that It should be preferable to save the whole Dict object. I was looking at external libraries like Cereal for it, but it could add unnecessary overhead. Thoughts?

from starspace.

rnditdev avatar rnditdev commented on September 25, 2024

I've reviewed, and it seems all the dictionary saving methods are inplace, we should just add a new option (loaddictfromfile), or a modifier to initmodel saying we will just use the dictionary, but not the weights.

from starspace.

ledw avatar ledw commented on September 25, 2024

@rnditdev your approach sounds good to me.
@dhmay Support to read from gzip is available from here: #206
Let me know if you want to try it out.

from starspace.

dhmay avatar dhmay commented on September 25, 2024

Thanks, @ledw ! I should have a good application for this enhancement soon.

from starspace.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.