Giter Club home page Giter Club logo

Comments (6)

tfgg avatar tfgg commented on May 29, 2024

Hi, we require version of 0.1.69 jaxlib to be able to use CUDA unified memory for running long sequences. If you don't need this you can probably run with 0.1.68, but that might be related to the illegal address error that you see. How long was the sequence you were trying to run?

Some of the other open issues about CUDA versions might also be of help.

from alphafold.

huhlim avatar huhlim commented on May 29, 2024

I was benchmarking with the CASP14 targets. T1026 (172 residues) raised the issue.
I realized that some of the targets still have issues of the CUDA_ERROR_ILLEGAL_ADDRESS, even though I used jax==0.2.17 and jaxlib==0.1.68+cuda110. Those targets were running okay on CPUs.

For my system information,

  • NVIDIA driver: 450.36.06
  • CUDA version: 11.0
  • jax: 0.1.68
  • jaxlib: 0.1.68+cuda110
  • tensorflow: 2.5.0

from alphafold.

tfgg avatar tfgg commented on May 29, 2024

That's a very small protein, so I'm surprised it's an issue. What GPU are you using? Is it possible to try using the Dockerfile?

You could try disabling unified memory by commenting out these two lines in your script, if you have them:
https://github.com/deepmind/alphafold/blob/main/docker/run_docker.py#L171-L172

from alphafold.

huhlim avatar huhlim commented on May 29, 2024

I tested with Quadro RTX 6000 and RTX 2080Ti.
I have tested with
(1) jaxlib==0.1.68+cuda110, jax==0.2.17, cudatoolkit=11.0.3 for my custom non-Docker version
(2) jaxlib==0.1.69+cuda110, jax==0.2.17, cudatoolkit=11.0.3 for my custom non-Docker version
(3) (1) or (2) + commenting out the two lines for the unified memory
(4) the same as (2), but with a docker container (the original one)

There was no issue with the (4)... So, there may be some differences between my non-Docker version and the original Docker version... (I thought I implemented my custom non-Docker version with the exact same version of libraries...) I will try it again.

from alphafold.

chrisroat avatar chrisroat commented on May 29, 2024

@huhlim Did you solve your CUDA_ERROR_ILLEGAL_ADDRESS problems? I just ran ~100 proteins from an internal sample, and this cropped up for me in some cases. As I investigate, it would be helpful if you follow-up here with anything you learned and/or how you resolved your problem. (I am using Docker at an A100)

from alphafold.

huhlim avatar huhlim commented on May 29, 2024

@christroat I could not fully resolve the issue. When I turned off the jax.jit compilation of models (initialization of the RunModel class in alphafold/model/model.py), it reduced the chance of the error but did not resolve the issue. I have not had the issue with my Docker system, so I guess my problem is related to our cluster setup... Unfortunately, I gave up to tackle the issue.

from alphafold.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.