Giter Club home page Giter Club logo

Comments (4)

intoraw avatar intoraw commented on August 11, 2024

Putting data on NFS like ceph or gluster can fix this issue.

from pytorch-biggraph.

lw avatar lw commented on August 11, 2024

PBG's input and output is file-based: its config, entity counts, edge lists, checkpoints, ... must all be files on the filesystem, i.e, they must have a path which allows to read from and write through them using standards operating system interfaces. And that's really all PBG needs, so in particular it doesn't require that those files are backed by local disk: one can use network storage, as long as it can be accessed by the above means. In fact, for distributed training, the checkpoint directory must be shared across the machines because it's used to transfer data among the trainers. NFS storage works. I'm not familiar with HDFS but I expect it can be used with vanilla PBG if it can be mounted as a filesystem.

from pytorch-biggraph.

lw avatar lw commented on August 11, 2024

Two more things:

  • HDF5 (the library we use to read edgelists and store checkpoints) may be able to use some lower-level interfaces to perform faster I/O when the files are on local disk. I'm using the conditional because I've no idea how HDF5 works internally. However it is still able to fall back to "regular" file I/O for remote files.
  • During distributed training the checkpoints are only used to transfer files between one epoch and the next. Within one epoch the data is passed around between trainers using parameter servers and partition servers. Except if partition servers are disabled (which, unless you operate under strong memory pressure, they shouldn't), in which case checkpoints are also used within the epoch. Edgelists are however always read from file.

What this boils down to is: in single-machine mode, using local disk may be faster; in distributed mode it doesn't matter that much how the checkpoint directory is stored because anyways I/O to the filesystem is rarer.

from pytorch-biggraph.

intoraw avatar intoraw commented on August 11, 2024

@lerks Thanks, that's really helpful.

from pytorch-biggraph.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.