Comments (4)
We have not tried MultiWorkerMirroredStrategy
-- getting everything to work well with MirroredStrategy
was very involved but sufficient for all published results. We have successfully done multi-working training using TF-Replicator, so in principle it should be achievable.
Are you trying with ADAM or K-FAC? For any porting work, I would suggest first getting ferminet working with ADAM and then investigate K-FAC. I added support for MirroredStrategy
to K-FAC but don't know if MultiWorkerMirroredStrategy
will work out of the box. I certainly hit many problems with tensor placement and naming getting MirroredStrategy
to work.
Re: JAX multi-node support, please see google/jax#2731. We're not intending to continue developing the TF-version of ferminet.
from ferminet.
Got it, thanks a lot for the info and suggestion @jsspencer ! Let me try harder then.
I was indeed trying KFAC, but it seems to me the error was thrown way before optimization starts.
From the github thread you mentioned about JAX, it seems its multi-node support is still under development. Let me ask them if they have anything to share now.
Thanks again!
from ferminet.
These kinds of errors are normally triggered at graph-construction time rather than (later) at training time.
from ferminet.
These kinds of errors are normally triggered at graph-construction time rather than (later) at training time.
You are absolutely right. I've managed to fix this issue and successfully run the TF code with MultiWorkerMirroredStrategy on multiple nodes (modulo the fact that using kfac would cause core-dump).
from ferminet.
Related Issues (20)
- Jax install - issue with correct version number HOT 1
- AttributeError: module 'jax.core' has no attribute 'extract_call_jaxpr' HOT 1
- Jax error running on A100 GPU (everything is okay on CPU) HOT 2
- unable to setup HOT 1
- The proper way to cite FermiNet repo HOT 1
- Ground State Energies HOT 2
- Question about pbc ewald part. HOT 2
- nan when training with 'adam' HOT 1
- About configs HOT 3
- Question About load Checkpoint HOT 1
- Evaluating logprob using batch_network in train HOT 1
- Issue on running pytest HOT 5
- Extension of PBC code to 1D HOT 7
- Something went wrong in RepeatedDenseBlock.update_curvature_matrix_estimate HOT 2
- Different results obtained from the paper for ch3nh2 HOT 2
- kfac_jax error when running H2 example script HOT 2
- Upstream breaking change in `kfac-jax`
- KeyError raised after burn-in MCMC steps HOT 1
- Logdet Bug Similar to e9f8c64 HOT 1
- Natural Excited States to nan HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ferminet.