Giter Club home page Giter Club logo

Comments (15)

Vincentwei1021 avatar Vincentwei1021 commented on September 6, 2024 1

@XWwwwww You are right. I run the interactive scripts on a single GPU cloud machine, and eva_finetune script on a multi-gpu cloud machine in distributed mode. Both with the docker provided. The interactive script works well to build the utils.so, but finetune script fails with ninja: fatal: waitpid(113): No child processes.

Anyway, I just tried to copy my utils.so generated by the interactive script to my multi-gpu machine, and it works well now. Thanks!

from eva.

Jiaxin-Wen avatar Jiaxin-Wen commented on September 6, 2024

what's the version of pytorch in your environment

from eva.

Vincentwei1021 avatar Vincentwei1021 commented on September 6, 2024

what's the version of pytorch in your environment
@XWwwwww

print(torch.version)
1.10.1+cu102

from eva.

t1101675 avatar t1101675 commented on September 6, 2024

Thanks for reporting this. We will check the environment.

from eva.

Hermes777 avatar Hermes777 commented on September 6, 2024

I encountered the same error. Here's my environment
torch 1.10.1 + cu111 + gcc 9.3

from eva.

Hermes777 avatar Hermes777 commented on September 6, 2024

Hi. I upgrade the torch to 1.10.2, and cuda to cu113. That solves the problem.

from eva.

Vincentwei1021 avatar Vincentwei1021 commented on September 6, 2024

Hi. I upgrade the torch to 1.10.2, and cuda to cu113. That solves the problem.

@Hermes777 Hi were you able to get the path /torch_extensions/py38_cu102/utils/utils.so after you upgrade the torch and cuda?

from eva.

Hermes777 avatar Hermes777 commented on September 6, 2024

From the term "py38_cu102", you are obviously using cuda 102, which might not compatible.

from eva.

Hermes777 avatar Hermes777 commented on September 6, 2024

I checked the directory utils/, it contains utils.so

Never forget to install the apex once again, after you upgrade the torch and cuda.

from eva.

Vincentwei1021 avatar Vincentwei1021 commented on September 6, 2024

I checked the directory utils/, it contains utils.so

Never forget to install the apex once again, after you upgrade the torch and cuda.

Thanks so much. I will try later

from eva.

Vincentwei1021 avatar Vincentwei1021 commented on September 6, 2024

@t1101675 @XWwwwww Sorry, I'm still having this issue. I don't think it is the problem of cuda version, as I'm using cuda10.2 which is suggested in your readme. I'm running with the docker provided and I was able to run the interactive scripts. But the eva_finetune still gives me the aforementioned error. I checked the log and spotted that while building the extension module utils, there was

Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: fatal: waitpid(113): No child processes

So the building of utils might fail? Please help and give me some advices, thanks!

from eva.

Jiaxin-Wen avatar Jiaxin-Wen commented on September 6, 2024

@Vincentwei1021 We upload the missing file in src/ds_fix/utils.so. Let me know if it works.

from eva.

Vincentwei1021 avatar Vincentwei1021 commented on September 6, 2024

@XWwwwww Hi thanks for your reply. I have tried to use the file, and it gives the following error:
ImportError: /.cache/torch_extensions/py38_cu102/utils/utils.so: undefined symbol: _ZNK2at6Tensor6narrowElll

from eva.

Jiaxin-Wen avatar Jiaxin-Wen commented on September 6, 2024

@t1101675 @XWwwwww Sorry, I'm still having this issue. I don't think it is the problem of cuda version, as I'm using cuda10.2 which is suggested in your readme. I'm running with the docker provided and I was able to run the interactive scripts. But the eva_finetune still gives me the aforementioned error. I checked the log and spotted that while building the extension module utils, there was

Building extension module utils... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: fatal: waitpid(113): No child processes

So the building of utils might fail? Please help and give me some advices, thanks!

I just notice that you have already been able to run the interactive scripts, which means the utils.so file has already been automatically compiled and saved in your cache path, right?
And the current error message is ninja: fatal: waitpid(113): No child processes instead of cache/torch_extensions/py38_cu102/utils/utils.so: cannot open shared object file: No such file or directory?

from eva.

BaiMeiyingxue avatar BaiMeiyingxue commented on September 6, 2024

spotted

@XWwwwww You are right. I run the interactive scripts on a single GPU cloud machine, and eva_finetune script on a multi-gpu cloud machine in distributed mode. Both with the docker provided. The interactive script works well to build the utils.so, but finetune script fails with ninja: fatal: waitpid(113): No child processes.

Anyway, I just tried to copy my utils.so generated by the interactive script to my multi-gpu machine, and it works well now. Thanks!

i met this situation :

Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: fatal: waitpid(97011): No child processes

and occured this error:
....
File "/home/zhangwenjuan/anaconda3/envs/evacuda11/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1079, in load
return _jit_compile(
File "/home/zhangwenjuan/anaconda3/envs/evacuda11/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1317, in _jit_compile
return _import_module_from_library(name, build_directory, is_python_module)
File "/home/zhangwenjuan/anaconda3/envs/evacuda11/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1699, in _import_module_from_library
file, path, description = imp.find_module(module_name, [path])
File "/home/zhangwenjuan/anaconda3/envs/evacuda11/lib/python3.8/imp.py", line 296, in find_module
raise ImportError(_ERR_MSG.format(name), name=name)
ImportError: No module named 'utils'

could you give me some suggestions? thank you!

from eva.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.