Giter Club home page Giter Club logo

Comments (8)

jonathan-laurent avatar jonathan-laurent commented on May 22, 2024

I can post a pretrained model with the next release, this is a good suggestion!

The error you report looks like a CUDA.jl bug or a problem with unsupported drivers.

Are you using the master version of the repo or the v0.3 release?
Can you try again after a "Pkg.update()"?

from alphazero.jl.

morozig avatar morozig commented on May 22, 2024

Thanks for quick answer! Initially I tried current master branch, but after your suggestion I switched to release 0.3.0, where there was new cuda related error. Updating packages, and reinstalling Nvidia driver to older version didn't help.

Cublas documentation states that CUBLAS_STATUS_EXECUTION_FAILED means:

The GPU program failed to execute. This is often caused by a launch failure of the kernel on the GPU, which can be caused by multiple reasons. To correct: check that the hardware, an appropriate version of the driver, and the cuBLAS library are correctly installed.

So it's hard to detect actual cause, but probably it's related to my setup, and I'll try to run on different PC. Strangely self play games and explorer works fine, so at least CUDA is installed, but it fails during forward pass.

from alphazero.jl.

Martyn-R avatar Martyn-R commented on May 22, 2024

Following the connect-four tutorial, I get the same error message: ERROR: LoadError: CUBLASError: the GPU program failed to execute (code 13, CUBLAS_STATUS_EXECUTION_FAILED)

My setup is Ubuntu 18.04 with Nvidia driver 440.95.01 and AlphaZero 0.3.0 (8ed9eb0b-7496-408d-8c8b-2119aeea02cd).

Any recommendations? Thanks in advance!

from alphazero.jl.

jonathan-laurent avatar jonathan-laurent commented on May 22, 2024

Would you mind also trying master?

from alphazero.jl.

Martyn-R avatar Martyn-R commented on May 22, 2024

On master I get the same error message as on v0.3.0.

from alphazero.jl.

jonathan-laurent avatar jonathan-laurent commented on May 22, 2024

Can I get the whole stacktrace?

from alphazero.jl.

Martyn-R avatar Martyn-R commented on May 22, 2024
Stacktrace:
 [1] throw_api_error(::CUDA.CUBLAS.cublasStatus_t) at /home/martijn/.julia/packages/CUDA/dZvbp/lib/cublas/error.jl:47
 [2] macro expansion at /home/martijn/.julia/packages/CUDA/dZvbp/lib/cublas/error.jl:58 [inlined]
 [3] cublasSgemm_v2(::Ptr{Nothing}, ::Char, ::Char, ::Int64, ::Int64, ::Int64, ::Array{Float32,1}, ::CUDA.CuArray{Float32,2}, ::Int64, ::CUDA.CuArray{Float32,2}, ::Int64, ::Array{Float32,1}, ::CUDA.CuArray{Float32,2}, ::Int64) at /home/martijn/.julia/packages/CUDA/dZvbp/lib/utils/call.jl:93
 [4] gemm!(::Char, ::Char, ::Float32, ::CUDA.CuArray{Float32,2}, ::CUDA.CuArray{Float32,2}, ::Float32, ::CUDA.CuArray{Float32,2}) at /home/martijn/.julia/packages/CUDA/dZvbp/lib/cublas/wrappers.jl:721
 [5] gemm_wrapper!(::CUDA.CuArray{Float32,2}, ::Char, ::Char, ::CUDA.CuArray{Float32,2}, ::CUDA.CuArray{Float32,2}, ::Float32, ::Float32) at /home/martijn/.julia/packages/CUDA/dZvbp/lib/cublas/linalg.jl:188
 [6] mul! at /home/martijn/.julia/packages/CUDA/dZvbp/lib/cublas/linalg.jl:222 [inlined]
 [7] mul! at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/LinearAlgebra/src/matmul.jl:208 [inlined]
 [8] * at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/LinearAlgebra/src/matmul.jl:160 [inlined]
 [9] (::Flux.Dense{typeof(NNlib.relu),CUDA.CuArray{Float32,2},CUDA.CuArray{Float32,1}})(::CUDA.CuArray{Float32,2}) at /home/martijn/.julia/packages/Flux/05b38/src/layers/basic.jl:123
 [10] Dense at /home/martijn/.julia/packages/Flux/05b38/src/layers/basic.jl:134 [inlined]
 [11] applychain at /home/martijn/.julia/packages/Flux/05b38/src/layers/basic.jl:36 [inlined] (repeats 4 times)
 [12] (::Flux.Chain{Tuple{Flux.Conv{2,4,typeof(identity),CUDA.CuArray{Float32,4},CUDA.CuArray{Float32,1}},Flux.BatchNorm{typeof(NNlib.relu),CUDA.CuArray{Float32,1},CUDA.CuArray{Float32,1},Float32},typeof(Flux.flatten),Flux.Dense{typeof(NNlib.relu),CUDA.CuArray{Float32,2},CUDA.CuArray{Float32,1}},Flux.Dense{typeof(tanh),CUDA.CuArray{Float32,2},CUDA.CuArray{Float32,1}}}})(::CUDA.CuArray{Float32,4}) at /home/martijn/.julia/packages/Flux/05b38/src/layers/basic.jl:38
 [13] forward(::AlphaZero.FluxLib.ResNet{Game}, ::CUDA.CuArray{Float32,4}) at /home/martijn/AlphaZero.jl/src/networks/flux.jl:162
 [14] evaluate(::AlphaZero.FluxLib.ResNet{Game}, ::CUDA.CuArray{Float32,4}, ::CUDA.CuArray{Float32,2}) at /home/martijn/AlphaZero.jl/src/networks/network.jl:253
 [15] losses(::AlphaZero.FluxLib.ResNet{Game}, ::LearningParams, ::Float32, ::Float32, ::Tuple{CUDA.CuArray{Float32,2},CUDA.CuArray{Float32,4},CUDA.CuArray{Float32,2},CUDA.CuArray{Float32,2},CUDA.CuArray{Float32,2}}) at /home/martijn/AlphaZero.jl/src/learning.jl:62
 [16] learning_status(::AlphaZero.Trainer, ::Tuple{CUDA.CuArray{Float32,2},CUDA.CuArray{Float32,4},CUDA.CuArray{Float32,2},CUDA.CuArray{Float32,2},CUDA.CuArray{Float32,2}}) at /home/martijn/AlphaZero.jl/src/learning.jl:142
 [17] (::AlphaZero.var"#71#74"{AlphaZero.Trainer})(::Tuple{CUDA.CuArray{Float32,2},CUDA.CuArray{Float32,4},CUDA.CuArray{Float32,2},CUDA.CuArray{Float32,2},CUDA.CuArray{Float32,2}}) at ./none:0
 [18] iterate at ./generator.jl:47 [inlined]
 [19] collect(::Base.Generator{Base.Generator{Array{Tuple{Array{Float32,2},Array{Float32,4},Array{Float32,2},Array{Float32,2},Array{Float32,2}},1},AlphaZero.Util.var"#9#11"{AlphaZero.var"#70#73"{AlphaZero.Trainer}}},AlphaZero.var"#71#74"{AlphaZero.Trainer}}) at ./array.jl:686
 [20] learning_status(::AlphaZero.Trainer) at /home/martijn/AlphaZero.jl/src/learning.jl:157
 [21] learning_step!(::Env{Game,AlphaZero.FluxLib.ResNet{Game},NamedTuple{(:board, :curplayer),Tuple{StaticArrays.SArray{Tuple{7,6},UInt8,2,42},UInt8}}}, ::Session{Env{Game,AlphaZero.FluxLib.ResNet{Game},NamedTuple{(:board, :curplayer),Tuple{StaticArrays.SArray{Tuple{7,6},UInt8,2,42},UInt8}}}}) at /home/martijn/AlphaZero.jl/src/training.jl:170
 [22] macro expansion at ./timing.jl:310 [inlined]
 [23] macro expansion at /home/martijn/AlphaZero.jl/src/report.jl:229 [inlined]
 [24] train!(::Env{Game,AlphaZero.FluxLib.ResNet{Game},NamedTuple{(:board, :curplayer),Tuple{StaticArrays.SArray{Tuple{7,6},UInt8,2,42},UInt8}}}, ::Session{Env{Game,AlphaZero.FluxLib.ResNet{Game},NamedTuple{(:board, :curplayer),Tuple{StaticArrays.SArray{Tuple{7,6},UInt8,2,42},UInt8}}}}) at /home/martijn/AlphaZero.jl/src/training.jl:295
 [25] resume!(::Session{Env{Game,AlphaZero.FluxLib.ResNet{Game},NamedTuple{(:board, :curplayer),Tuple{StaticArrays.SArray{Tuple{7,6},UInt8,2,42},UInt8}}}}) at /home/martijn/AlphaZero.jl/src/ui/session.jl:383
 [26] top-level scope at /home/martijn/AlphaZero.jl/scripts/alphazero.jl:89
 [27] include(::Function, ::Module, ::String) at ./Base.jl:380
 [28] include(::Module, ::String) at ./Base.jl:368
 [29] exec_options(::Base.JLOptions) at ./client.jl:296
 [30] _start() at ./client.jl:506
in expression starting at /home/martijn/AlphaZero.jl/scripts/alphazero.jl:80

from alphazero.jl.

jonathan-laurent avatar jonathan-laurent commented on May 22, 2024

Thanks! I recommend that you open an issue on CUDA.jl, with your exact config, Pkg.status() and stack trace.

from alphazero.jl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.