gorgonia / cu Goto Github PK

package cu provides an idiomatic interface to the CUDA Driver API.

License: Apache License 2.0

Go 78.14% C 21.50% Python 0.36%

cu's Introduction

cu

Package cu is a package that interfaces with the CUDA Driver API. This package was directly inspired by Arne Vansteenkiste's cu package.

Why Write This Package?

The main reason why this package was written (as opposed to just using the already-excellent cu package) was because of errors. Specifically, the main difference between this package and Arne's package is that this package returns errors instead of panicking.

Additionally another goal for this package is to have an idiomatic interface for CUDA. For example, instead of exposing cuCtxCreate to be CtxCreate, a nicer, more idiomatic name MakeContext is used. The primary goal is to make calling the CUDA API as comfortable as calling Go functions or methods. Additional convenience functions and methods are also created in this package in the pursuit of that goal.

Lastly, this package uses the latest CUDA toolkit whereas the original package cu uses a number of deprecated APIs.

Installation

This package is go-gettable: go get -u gorgonia.org/cu

This package mostly depends on built-in packages. There are two external dependencies:

errors, which is licenced under a MIT-like licence. This package is used for wrapping errors and providing a debug trail.
assert, which is licenced under a MIT-like licence. This package is used for quick and easy testing.

However, package cu DOES depend on one major external dependency: CUDA. Specifically, it requires the CUDA driver. Thankfully nvidia has made this rather simple - everything that is required can be installed with one click: CUDA Toolkit.

To verify that this library works, install and run the cudatest program, which accompanies this package:

go install gorgonia.org/cu/cmd/cudatest@latest
cudatest

You should see something like this if successful:

CUDA version: 10020
CUDA devices: 1

Device 0
========
Name      :	"TITAN RTX"
Clock Rate:	1770000 kHz
Memory    :	25393561600 bytes
Compute   : 	7.5

Windows

To setup CUDA in Windows:

Install CUDA Toolkit
Add %CUDA_PATH%/bin to your %PATH% environment variable (running nvcc from console should work)
Make a symlink mklink /D C:\cuda "c:\Program Files\NVIDIA GPU Computing Toolkit\CUDA" (alternatively, install CUDA toolkit to C:\cuda\)

To setup the compiler chain (MSYS2):

Install MSYS2 (see https://www.msys2.org/)
In c:\msys64\msys2_shell.cmd uncomment the line with set MSYS2_PATH_TYPE=inherit (this makes Windows PATH variable visible)
Install go in MSYS2 (64 bit) with pacman -S go

Alternatively, if you already have Go setup and only need to install cgo dependencies:

Install TDM-GCC (see https://jmeubank.github.io/tdm-gcc/download/)
Ensure gcc is in %PATH% environment variable (running gcc from console should work)

FAQ

Here is a common list of problems that you may encounter.

ld: cannot find -lcuda (Linux)

Checklist:

Installed CUDA and applied the relevant post-installation steps?
Checked that the sample programs in the CUDA install all works?
Checked the output of ld -lcuda --verbose?
Checked that there is a libcuda.so in the given search paths?
Checked that the permissions on libcuda.so is correct?

Note, depending on how you install CUDA on Linux, sometimes the .so file is not properly linked. For example: in CUDA 10.2 on Ubuntu, the default .deb installation installs the shared object file to /usr/lib/x86_64-linux-gnu/libcuda.so.1. However ld searches only for libcuda.so. So the solution is to symlink libcuda.so.1 to libcuda.so, like so:

sudo ln -s /PATH/TO/libcuda.so.1 /PATH/TO/libcuda.so

Be careful when using ln. This author spent several hours being tripped up by permissions issues.

Progress

The work to fully represent the CUDA Driver API is a work in progress. At the moment, it is not complete. However, most of the API that are required for GPGPU purposes are complete. None of the texture, surface and graphics related APIs are handled yet. Please feel free to send a pull request.

Roadmap

Remaining API to be ported over
All texture, surface and graphics related API have an equivalent Go prototype.
Batching of common operations (see for example Device.Attributes(...)
Generic queueing/batching of API calls (by some definition of generic)

Contributing

This author loves pull requests from everyone. Here's how to contribute to this package:

Fork then clone this repo: git clone [email protected]:YOUR_USERNAME/cu.git
Work on your edits.
Commit with a good commit message.
Push to your fork then submit a pull request.

We understand that this package is an interfacing package with a third party API. As such, tests may not always be viable. However, please do try to include as much tests as possible.

Licence

The package is licenced with a MIT-like licence. Ther is one file (cgoflags.go) where code is directly copied and two files (execution.go and memory.go) where code was partially copied from Arne Vansteenkiste's package, which is unlicenced (but to be safe, just assume a GPL-like licence, as mumax/3 is licenced under GPL).

cu's People

Contributors

Stargazers

Watchers

cu's Issues

Tests for `Ctx.Run`; `BatchedContext.Run`

Now that a convenience method Run is provided, proper tests should follow

API for `cuModuleLoadDataEx`

The C declaration is CUresult cuModuleLoadDataEx(CUmodule* module, const void* image, unsigned int numOptions, CUjit_option* options, void** optionValues);

Cannot find CUjit_target_enum members

after calling

go install gorgonia.org/cu/cmd/cudatest@latest

I get this error:

/home/miner/go/pkg/mod/gorgonia.org/[email protected]/jit.go:138:32: could not determine kind of name for C.CU_TARGET_COMPUTE_20
/home/miner/go/pkg/mod/gorgonia.org/[email protected]/jit.go:139:32: could not determine kind of name for C.CU_TARGET_COMPUTE_21

Eliminate cgo usage.

Some basic example with Kernel

Hi!
Please provide some basic example how to work with this library properly.
How to create basic Kernel and use it into computation with GPU.

Thanks!

Lack of examples

Hello, i think It'd be really helpfull for cuda newbies if there were full working examples. For example a simple elementwise addition between two large array.

How to use with CUDA 9?

Hi! I use Visual Studio 2017 and CUDA 8.0 can't install with this version.
How to use with CUDA 9.0 RC?

API for `cuMemAllocHost`

The C declaration is CUresult cuMemAllocHost(void** pp, size_t bytesize);

Check that new security updates don't break cu

golang/go@1dcb583

Hint: use go list -json

CUDA9 Support

Expected to land in version 0.10.

"go get -u gorgonia.org/cu" doesn't work on Windows 10 with CUDA v10.2

When I try to use "go get -u gorgonia.org/cu" to download and build the package, I get:

# gorgonia.org/cu
C:\Users\TEMPDR~1.001\AppData\Local\Temp\go-build193606874\b001\_x003.o: In function `_cgo_a01f1e6a60e1_Cfunc_cuDeviceGetP2PAttribute':
/tmp/go-build/cgo-gcc-prolog:530: undefined reference to `cuDeviceGetP2PAttribute'
collect2.exe: error: ld returned 1 exit status

I added the path to CUDA v10.2 in cgoflags.go using the example from issue #50:

package cu

// This file provides CGO flags to find CUDA libraries and headers.

//#cgo LDFLAGS:-lcuda
//
////default location:
//#cgo linux LDFLAGS:-L/usr/local/cuda/lib64 -L/usr/local/cuda/lib
//#cgo linux CFLAGS: -I/usr/local/cuda/include/
//
////default location if not properly symlinked:
//#cgo linux LDFLAGS:-L/usr/local/cuda-10.2/lib64 -L/usr/local/cuda-10.2/lib
//#cgo linux LDFLAGS:-L/usr/local/cuda-10.1/lib64 -L/usr/local/cuda-10.1/lib
//#cgo linux LDFLAGS:-L/usr/local/cuda-6.0/lib64 -L/usr/local/cuda-6.0/lib
//#cgo linux LDFLAGS:-L/usr/local/cuda-5.5/lib64 -L/usr/local/cuda-5.5/lib
//#cgo linux LDFLAGS:-L/usr/local/cuda-5.0/lib64 -L/usr/local/cuda-5.0/lib
//#cgo linux CFLAGS: -I/usr/local/cuda-10.2/include/
//#cgo linux CFLAGS: -I/usr/local/cuda-10.1/include/
//#cgo linux CFLAGS: -I/usr/local/cuda-6.0/include/
//#cgo linux CFLAGS: -I/usr/local/cuda-5.5/include/
//#cgo linux CFLAGS: -I/usr/local/cuda-5.0/include/
//
////Ubuntu 15.04:
//#cgo linux LDFLAGS:-L/usr/lib/x86_64-linux-gnu/
//#cgo linux CFLAGS: -I/usr/include
//
////arch linux:
//#cgo linux LDFLAGS:-L/opt/cuda/lib64 -L/opt/cuda/lib
//#cgo linux CFLAGS: -I/opt/cuda/include
//
////Darwin:
//#cgo darwin LDFLAGS:-L/usr/local/cuda/lib
//#cgo darwin CFLAGS: -I/usr/local/cuda/include/
//
////WINDOWS:
//#cgo windows LDFLAGS:-LC:/cuda/v5.0/lib/x64 -LC:/cuda/v5.5/lib/x64 -LC:/cuda/v6.0/lib/x64 -LC:/cuda/v6.5/lib/x64 -LC:/cuda/v7.0/lib/x64 -LC:/cuda/v8.0/lib/x64 -LC:/cuda/v9.0/lib/x64 -LC:/cuda/v10.2/lib/x64
//#cgo windows CFLAGS: -IC:/cuda/v5.0/include -IC:/cuda/v5.5/include -IC:/cuda/v6.0/include -IC:/cuda/v6.5/include -IC:/cuda/v7.0/include -IC:/cuda/v8.0/include -IC:/cuda/v9.0/include -IC:/cuda/v10.2/include
import "C"

TensorCore for CUDA9

https://devblogs.nvidia.com/parallelforall/programming-tensor-cores-cuda-9/

Errors when building

GOROOT=C:/Go
GOPATH=D:/Go
C:/Go\bin\go.exe build -i -o C:\Users\Andy\AppData\Local\Temp\___Unnamed.exe github.com/chewxy/cu/cmd/cudatest
# github.com/chewxy/cu
gcc: error: FilesNVIDIA: No such file or directory
gcc: error: GPU: No such file or directory
gcc: error: Computing: No such file or directory
gcc: error: ToolkitCUDAv8.0include: No such file or directory

st

Documentation

The API is autogenerated-ish. Some documentation should be cribbed over from the nvidia website.

Easy issue

I'm newer and first i go get -u gorgonia.org/cu,then go install gorgonia.org/cu/cmd/cudatest,but here was a mistake

cublas silently fails when passed CPU memory instead of GPU memory

Reproduction:

a := tensor.New(tensor.Of(tensor.Float64), tensor.WithShape(2,3)) // allocate []float64 in CPU
b := tensor.new(tensor.Of(tensor.Float64), tensor.WithShape(3))  // allocate []float64 in CPU

// set the engine AFTER the values have been allocated
e := newEngine(cudaCtx)
tensor.WithEngine(e)(a)
tensor.WithEngine(e)(b)

ad := a.Data().([]float64)
for i := range ad {
	ad[i] = float64(i + 1)
}

bd := b.Data().([]float64)
for i := range bd {
	bd[i] = float64(i + 1)
}

var err error
// c2, err = tensor.MatVecMul(a, b, tensor.WithReuse(c))
err = e.MatVecMul(a, b, c)
if err != nil || e.Standard.Err() != nil {
	log.Println(err)
	fmt.Println(e.Standard.Err())
} else {
	fmt.Printf("c %v\n", c.Data())
}

Expected to see: error, or results

Got:

⎡1  2  3⎤
⎣4  5  6⎦

[1  2  3]

c [1000 1000]
2018/07/14 11:50:56 impl.e <nil>

API for `cuMemHostGetFlags`

The C declaration is CUresult cuMemHostGetFlags(unsigned int* pFlags, void* p);

Build failure with cuda toolkit 12.0

I am trying to build the cudatest program on Debian Trixie (testing) as the documentation suggests. It fails to build the cgo part due to missing symbols.

versions

linux kernel 6.1.49
go 1.21.0
cuda 12.0.140
nvidia driver 525.125.06
geforce rtx 3080

logs

% go version
go version go1.21.0 linux/amd64

% go install gorgonia.org/cu/cmd/cudatest
go: 'go install' requires a version when current directory is not in a module
	Try 'go install gorgonia.org/cu/cmd/cudatest@latest' to install the latest version

% go install gorgonia.org/cu/cmd/cudatest@latest
# gorgonia.org/cu
go/pkg/mod/gorgonia.org/[email protected]/jit.go:138:32: could not determine kind of name for C.CU_TARGET_COMPUTE_20
go/pkg/mod/gorgonia.org/[email protected]/jit.go:139:32: could not determine kind of name for C.CU_TARGET_COMPUTE_21

% dpkg -l nvidia-driver-bin nvidia-cuda-toolkit golang-go
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                Version           Architecture Description
+++-===================-=================-============-=======================================>
ii  golang-go:amd64     2:1.21~2          amd64        Go programming language compiler, linke>
ii  nvidia-cuda-toolkit 12.0.140~12.0.1-2 amd64        NVIDIA CUDA development toolkit
ii  nvidia-driver-bin   525.125.06-2      amd64        NVIDIA driver support binaries

% nvidia-smi                                    
Mon Sep 11 19:24:49 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06   Driver Version: 525.125.06   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:43:00.0  On |                  N/A |
| 47%   48C    P0    97W / 350W |   1441MiB / 12288MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

to reproduce without breaking everything

Note: alias docker=podman if that's what you have; either one should work. And you don't need an actual GPU to build it.

Also note: nvidia libs are kinda big. Make sure you have ~10GB of disk space available for this.

% docker run -it --rm debian:trixie bash
# perl -pi -e 's/main/main contrib non-free/' /etc/apt/sources.list.d/debian.sources
# apt update
# apt install nvidia-cuda-toolkit golang-go
# go install gorgonia.org/cu/cmd/cudatest@latest

API for JIT modules

The C declaration is CUresult cuLinkCreate(unsigned int numOptions, CUjit_option* options, void** optionValues, CUlinkState* stateOut);

This requires the creation of new types, as prototyped below:

type Link uintptr (note that uintptr is just a placeholder. It's only good to use if the underlying type is a pointer, refer to cuda.h)
type JITOption int (with const)
type JITInputType byte (with const)

The APIs that need to be ported:

cuLinkCreate -> func NewLink() (Link, error)
cuLinkComplete -> func (Link) Complete() (fatBin unsafe.Pointer, size int64, err error)
cuLinkAddFile -> func (Link) AddFile(...) error
cuLinkAddData -> func (Link) AddData(...) error

CUDA 12 support

First of all - nice package! 👍

I would like to ask if anyone is working on updating this package to support CUDA 12?

problems with #include <cuda.h>

I am trying to build this library on windows.

I've got cuba 8 installed as well as cuDNN because I use tensorflow. When I use go get on this library, I get a fatal error: fatal error: cuda.h: No such file or directory.

I don't have much experience with cgo, is there some flag that I have to pass to link the cuda library to this library so that I can build it?

Move to organization Gorgonia

Add meta redirect at gorgonia.org/cu
Add canonical import path to the package
Move to organization gorgonia

Branch in question is v0.8.0-working

cudatest failed(cu.init return Unknown error)

With the error:

panic: Unknown

goroutine 1 [running]:
github.com/chewxy/cu.init.1()
	github.com/chewxy/cu/cu.go:13 +0x5e
github.com/chewxy/cu.init()
	github.com/chewxy/cu/_obj/_cgo_import.go:87 +0x11d
main.init()
	github.com/chewxy/cu/cmd/cudatest/main.go:27 +0x49

Go: 1.8
nvidia-smi -i 0 output:

...
NVIDIA-SMI 375.26                 Driver Version: 375.26
...

cannot find -lcuda

windows
cuda 9.0

gorgonia.org/cu

C:/TDM-GCC-64/bin/../lib/gcc/x86_64-w64-mingw32/5.1.0/../../../../x86_64-w64-mingw32/bin/ld.exe: cannot find -lcuda
collect2.exe: error: ld returned 1 exit status

Hello...
Any idea on how to resolver this problem?

TestLargeBatch does not end and gives different results

When I run go test -tags="cuda" -run=LargeBatch -v, the test display nothing and never finish.

(base) ➜ ip-172-31-49-229 cu git:(master) ✗  go test -tags="cuda" -run=LargeBatch -v
=== RUN   TestLargeBatch
2019/09/04 23:04:42 Large batch
^Csignal: interrupt
FAIL    gorgonia.org/cu 6.192s

When I run another test before, it is more verbose, but it does not give the same result all the times:

(base) ➜ ip-172-31-49-229 cu git:(master) ✗  go test -tags="cuda" -v
=== RUN   TestBatchContext
2019/09/04 23:02:41 BatchContext
--- PASS: TestBatchContext (0.10s)
=== RUN   TestLargeBatch
2019/09/04 23:02:41 Large batch
2019/09/04 23:02:42 Errors found true
2019/09/04 23:02:42 Errors: 
[0]: ContextIsDestroyed
[1]: ContextIsDestroyed
[2]: ContextIsDestroyed
[3]: ContextIsDestroyed
[4]: ContextIsDestroyed
[5]: ContextIsDestroyed
2019/09/04 23:02:42 Queue: 6
        [QUEUE] memcpyHtoD. dest: 0xb047bb000, src: 0xc00012f000, size: 4000
        [QUEUE] launchKernel. KernelParams: 0x7f34d4004120
        [QUEUE] sync. Current Context 0
        [QUEUE] memfreeD. mem: 0xb047ba000
        [QUEUE] memfreeD. mem: 0xb047bb000
        [QUEUE] allocAndCopy. Size: 4000, src: 0xc00012e000
^Csignal: interrupt

(base) ➜ ip-172-31-49-229 cu git:(master) ✗  go test -tags="cuda" -v                
=== RUN   TestBatchContext
2019/09/04 23:06:54 BatchContext
--- PASS: TestBatchContext (0.09s)
=== RUN   TestLargeBatch
2019/09/04 23:06:54 Large batch
2019/09/04 23:06:54 Errors found true
2019/09/04 23:06:54 Errors: 
[0]: ContextIsDestroyed
2019/09/04 23:06:54 Queue: 1
        [QUEUE] mallocD. Size 4000

It is just an assumption, but it may be related to Issue 297 of gorgonia

cublas.Dnrm2 Get InternalError

cu release v0.9.3 in windows
it hapends almost all of these funcion which need a scalar value to return, like nrm2, dot.
I test these function in c++ and it works well
and, when I avoid the error, other funcions seems like broken eventhough the Dnrm2 magicly worked once

API for `cuModuleGetTexRef`

The C declaration is CUresult cuModuleGetTexRef(CUtexref* pTexRef, CUmodule hmod, const char* name);

The required Go functions and methods are:

func (mod Module) TexRef(name string) (TexRef, error)
func (ctx *Ctx) ModuleTexRef(name string) (TexRef, error)

ROCm support

is their any suuport for AMD graphics cards?

Documentation wrt OS thread locking, and the various contexts

API for `cuMemHostAlloc`

The C declaration is CUresult cuMemHostAlloc(void** pp, size_t bytesize, unsigned int Flags);

Go get fails with Cuda Toolkit 6.5 installed

Hi! I have Cuda Toolkit 6.5 and Cuda Toolkit 8.0 installed on different machines. The package works well with 8.0, but fails with 6.5.

Error stack:

# gorgonia.org/cu
../../go/src/gorgonia.org/cu/addressing.go:57:8: could not determine kind of name for C.CUmem_advise
../../go/src/gorgonia.org/cu/addressing.go:60:16: could not determine kind of name for C.cuMemAdvise
../../go/src/gorgonia.org/cu/addressing.go:86:16: could not determine kind of name for C.cuMemPrefetchAsync

Test issue

Ignore this issue.

could not determine kind of name for C.cudnnSetRNNDescriptor

Hi, I am trying to run a simple MNIST neural net with gorgonia. The program runs fine when I do not specify -tags="cuda", but when I write that tag, the following error happens:

root@92e58a10ce54:~/Documents/Coding/Go/ai/gorgonia-rl# go run -tags="cuda" .
go: downloading github.com/gorgonia/vulkan v0.0.0-20210704115342-b11687f54f7e
go: downloading github.com/chewxy/math32 v1.0.8
go: downloading go4.org/unsafe/assume-no-moving-gc v0.0.0-20220617031537-928513b29760
go: downloading github.com/vulkan-go/vulkan v0.0.0-20201213112254-a536091a798a
# gorgonia.org/cu
/root/go/pkg/mod/gorgonia.org/[email protected]/addressing.go:3:11: fatal error: cuda.h: No such file or directory
    3 | // #include <cuda.h>
      |           ^~~~~~~~
compilation terminated.
# gorgonia.org/cu/dnn
/root/go/pkg/mod/gorgonia.org/[email protected]/dnn/rnn.go:67:19: could not determine kind of name for C.cudnnSetRNNDescriptor

Here are the troubleshooting steps as described on the README. It looks to me like they are all working

root@92e58a10ce54:~/Documents/Coding/Go/ai/gorgonia-rl# ld -lcuda --verbose
GNU ld (GNU Binutils for Ubuntu) 2.38
  Supported emulations:
   elf_x86_64
   elf32_x86_64
   elf_i386
   elf_iamcu
   elf_l1om
   elf_k1om
   i386pep
   i386pe
using internal linker script:
==================================================
/* Script for -z combreloc -z separate-code */
/* Copyright (C) 2014-2022 Free Software Foundation, Inc.
   Copying and distribution of this script, with or without modification,
   are permitted in any medium without royalty provided the copyright
   notice and this notice are preserved.  */
OUTPUT_FORMAT("elf64-x86-64", "elf64-x86-64",
              "elf64-x86-64")
OUTPUT_ARCH(i386:x86-64)
ENTRY(_start)
SEARCH_DIR("=/usr/local/lib/x86_64-linux-gnu"); SEARCH_DIR("=/lib/x86_64-linux-gnu"); SEARCH_DIR("=/usr/lib/x86_64-linux-gnu"); SEARCH_DIR("=/usr/lib/x86_64-linux-gnu64"); SEARCH_DIR("=/usr/local/lib64"); SEARCH_DIR("=/lib64"); SEARCH_DIR("=/usr/lib64"); SEARCH_DIR("=/usr/local/lib"); SEARCH_DIR("=/lib"); SEARCH_DIR("=/usr/lib"); SEARCH_DIR("=/usr/x86_64-linux-gnu/lib64"); SEARCH_DIR("=/usr/x86_64-linux-gnu/lib");
SECTIONS
{
  PROVIDE (__executable_start = SEGMENT_START("text-segment", 0x400000)); . = SEGMENT_START("text-segment", 0x400000) + SIZEOF_HEADERS;
  .interp         : { *(.interp) }
  .note.gnu.build-id  : { *(.note.gnu.build-id) }
  .hash           : { *(.hash) }
  .gnu.hash       : { *(.gnu.hash) }
  .dynsym         : { *(.dynsym) }
  .dynstr         : { *(.dynstr) }
  .gnu.version    : { *(.gnu.version) }
  .gnu.version_d  : { *(.gnu.version_d) }
  .gnu.version_r  : { *(.gnu.version_r) }
  .rela.dyn       :
    {
      *(.rela.init)
      *(.rela.text .rela.text.* .rela.gnu.linkonce.t.*)
      *(.rela.fini)
      *(.rela.rodata .rela.rodata.* .rela.gnu.linkonce.r.*)
      *(.rela.data .rela.data.* .rela.gnu.linkonce.d.*)
      *(.rela.tdata .rela.tdata.* .rela.gnu.linkonce.td.*)
      *(.rela.tbss .rela.tbss.* .rela.gnu.linkonce.tb.*)
      *(.rela.ctors)
      *(.rela.dtors)
      *(.rela.got)
      *(.rela.bss .rela.bss.* .rela.gnu.linkonce.b.*)
      *(.rela.ldata .rela.ldata.* .rela.gnu.linkonce.l.*)
      *(.rela.lbss .rela.lbss.* .rela.gnu.linkonce.lb.*)
      *(.rela.lrodata .rela.lrodata.* .rela.gnu.linkonce.lr.*)
      *(.rela.ifunc)
    }
  .rela.plt       :
    {
      *(.rela.plt)
      PROVIDE_HIDDEN (__rela_iplt_start = .);
      *(.rela.iplt)
      PROVIDE_HIDDEN (__rela_iplt_end = .);
    }
  .relr.dyn : { *(.relr.dyn) }
  . = ALIGN(CONSTANT (MAXPAGESIZE));
  .init           :
  {
    KEEP (*(SORT_NONE(.init)))
  }
  .plt            : { *(.plt) *(.iplt) }
.plt.got        : { *(.plt.got) }
.plt.sec        : { *(.plt.sec) }
  .text           :
  {
    *(.text.unlikely .text.*_unlikely .text.unlikely.*)
    *(.text.exit .text.exit.*)
    *(.text.startup .text.startup.*)
    *(.text.hot .text.hot.*)
    *(SORT(.text.sorted.*))
    *(.text .stub .text.* .gnu.linkonce.t.*)
    /* .gnu.warning sections are handled specially by elf.em.  */
    *(.gnu.warning)
  }
  .fini           :
  {
    KEEP (*(SORT_NONE(.fini)))
  }
  PROVIDE (__etext = .);
  PROVIDE (_etext = .);
  PROVIDE (etext = .);
  . = ALIGN(CONSTANT (MAXPAGESIZE));
  /* Adjust the address for the rodata segment.  We want to adjust up to
     the same address within the page on the next page up.  */
  . = SEGMENT_START("rodata-segment", ALIGN(CONSTANT (MAXPAGESIZE)) + (. & (CONSTANT (MAXPAGESIZE) - 1)));
  .rodata         : { *(.rodata .rodata.* .gnu.linkonce.r.*) }
  .rodata1        : { *(.rodata1) }
  .eh_frame_hdr   : { *(.eh_frame_hdr) *(.eh_frame_entry .eh_frame_entry.*) }
  .eh_frame       : ONLY_IF_RO { KEEP (*(.eh_frame)) *(.eh_frame.*) }
  .gcc_except_table   : ONLY_IF_RO { *(.gcc_except_table .gcc_except_table.*) }
  .gnu_extab   : ONLY_IF_RO { *(.gnu_extab*) }
  /* These sections are generated by the Sun/Oracle C++ compiler.  */
  .exception_ranges   : ONLY_IF_RO { *(.exception_ranges*) }
  /* Adjust the address for the data segment.  We want to adjust up to
     the same address within the page on the next page up.  */
  . = DATA_SEGMENT_ALIGN (CONSTANT (MAXPAGESIZE), CONSTANT (COMMONPAGESIZE));
  /* Exception handling  */
  .eh_frame       : ONLY_IF_RW { KEEP (*(.eh_frame)) *(.eh_frame.*) }
  .gnu_extab      : ONLY_IF_RW { *(.gnu_extab) }
  .gcc_except_table   : ONLY_IF_RW { *(.gcc_except_table .gcc_except_table.*) }
  .exception_ranges   : ONLY_IF_RW { *(.exception_ranges*) }
  /* Thread Local Storage sections  */
  .tdata          :
   {
     PROVIDE_HIDDEN (__tdata_start = .);
     *(.tdata .tdata.* .gnu.linkonce.td.*)
   }
  .tbss           : { *(.tbss .tbss.* .gnu.linkonce.tb.*) *(.tcommon) }
  .preinit_array    :
  {
    PROVIDE_HIDDEN (__preinit_array_start = .);
    KEEP (*(.preinit_array))
    PROVIDE_HIDDEN (__preinit_array_end = .);
  }
  .init_array    :
  {
    PROVIDE_HIDDEN (__init_array_start = .);
    KEEP (*(SORT_BY_INIT_PRIORITY(.init_array.*) SORT_BY_INIT_PRIORITY(.ctors.*)))
    KEEP (*(.init_array EXCLUDE_FILE (*crtbegin.o *crtbegin?.o *crtend.o *crtend?.o ) .ctors))
    PROVIDE_HIDDEN (__init_array_end = .);
  }
  .fini_array    :
  {
    PROVIDE_HIDDEN (__fini_array_start = .);
    KEEP (*(SORT_BY_INIT_PRIORITY(.fini_array.*) SORT_BY_INIT_PRIORITY(.dtors.*)))
    KEEP (*(.fini_array EXCLUDE_FILE (*crtbegin.o *crtbegin?.o *crtend.o *crtend?.o ) .dtors))
    PROVIDE_HIDDEN (__fini_array_end = .);
  }
  .ctors          :
  {
    /* gcc uses crtbegin.o to find the start of
       the constructors, so we make sure it is
       first.  Because this is a wildcard, it
       doesn't matter if the user does not
       actually link against crtbegin.o; the
       linker won't look for a file to match a
       wildcard.  The wildcard also means that it
       doesn't matter which directory crtbegin.o
       is in.  */
    KEEP (*crtbegin.o(.ctors))
    KEEP (*crtbegin?.o(.ctors))
    /* We don't want to include the .ctor section from
       the crtend.o file until after the sorted ctors.
       The .ctor section from the crtend file contains the
       end of ctors marker and it must be last */
    KEEP (*(EXCLUDE_FILE (*crtend.o *crtend?.o ) .ctors))
    KEEP (*(SORT(.ctors.*)))
    KEEP (*(.ctors))
  }
  .dtors          :
  {
    KEEP (*crtbegin.o(.dtors))
    KEEP (*crtbegin?.o(.dtors))
    KEEP (*(EXCLUDE_FILE (*crtend.o *crtend?.o ) .dtors))
    KEEP (*(SORT(.dtors.*)))
    KEEP (*(.dtors))
  }
  .jcr            : { KEEP (*(.jcr)) }
  .data.rel.ro : { *(.data.rel.ro.local* .gnu.linkonce.d.rel.ro.local.*) *(.data.rel.ro .data.rel.ro.* .gnu.linkonce.d.rel.ro.*) }
  .dynamic        : { *(.dynamic) }
  .got            : { *(.got) *(.igot) }
  . = DATA_SEGMENT_RELRO_END (SIZEOF (.got.plt) >= 24 ? 24 : 0, .);
  .got.plt        : { *(.got.plt) *(.igot.plt) }
  .data           :
  {
    *(.data .data.* .gnu.linkonce.d.*)
    SORT(CONSTRUCTORS)
  }
  .data1          : { *(.data1) }
  _edata = .; PROVIDE (edata = .);
  . = .;
  __bss_start = .;
  .bss            :
  {
   *(.dynbss)
   *(.bss .bss.* .gnu.linkonce.b.*)
   *(COMMON)
   /* Align here to ensure that the .bss section occupies space up to
      _end.  Align after .bss to ensure correct alignment even if the
      .bss section disappears because there are no input sections.
      FIXME: Why do we need it? When there is no .bss section, we do not
      pad the .data section.  */
   . = ALIGN(. != 0 ? 64 / 8 : 1);
  }
  .lbss   :
  {
    *(.dynlbss)
    *(.lbss .lbss.* .gnu.linkonce.lb.*)
    *(LARGE_COMMON)
  }
  . = ALIGN(64 / 8);
  . = SEGMENT_START("ldata-segment", .);
  .lrodata   ALIGN(CONSTANT (MAXPAGESIZE)) + (. & (CONSTANT (MAXPAGESIZE) - 1)) :
  {
    *(.lrodata .lrodata.* .gnu.linkonce.lr.*)
  }
  .ldata   ALIGN(CONSTANT (MAXPAGESIZE)) + (. & (CONSTANT (MAXPAGESIZE) - 1)) :
  {
    *(.ldata .ldata.* .gnu.linkonce.l.*)
    . = ALIGN(. != 0 ? 64 / 8 : 1);
  }
  . = ALIGN(64 / 8);
  _end = .; PROVIDE (end = .);
  . = DATA_SEGMENT_END (.);
  /* Stabs debugging sections.  */
  .stab          0 : { *(.stab) }
  .stabstr       0 : { *(.stabstr) }
  .stab.excl     0 : { *(.stab.excl) }
  .stab.exclstr  0 : { *(.stab.exclstr) }
  .stab.index    0 : { *(.stab.index) }
  .stab.indexstr 0 : { *(.stab.indexstr) }
  .comment       0 : { *(.comment) }
  .gnu.build.attributes : { *(.gnu.build.attributes .gnu.build.attributes.*) }
  /* DWARF debug sections.
     Symbols in the DWARF debugging sections are relative to the beginning
     of the section so we begin them at 0.  */
  /* DWARF 1.  */
  .debug          0 : { *(.debug) }
  .line           0 : { *(.line) }
  /* GNU DWARF 1 extensions.  */
  .debug_srcinfo  0 : { *(.debug_srcinfo) }
  .debug_sfnames  0 : { *(.debug_sfnames) }
  /* DWARF 1.1 and DWARF 2.  */
  .debug_aranges  0 : { *(.debug_aranges) }
  .debug_pubnames 0 : { *(.debug_pubnames) }
  /* DWARF 2.  */
  .debug_info     0 : { *(.debug_info .gnu.linkonce.wi.*) }
  .debug_abbrev   0 : { *(.debug_abbrev) }
  .debug_line     0 : { *(.debug_line .debug_line.* .debug_line_end) }
  .debug_frame    0 : { *(.debug_frame) }
  .debug_str      0 : { *(.debug_str) }
  .debug_loc      0 : { *(.debug_loc) }
  .debug_macinfo  0 : { *(.debug_macinfo) }
  /* SGI/MIPS DWARF 2 extensions.  */
  .debug_weaknames 0 : { *(.debug_weaknames) }
  .debug_funcnames 0 : { *(.debug_funcnames) }
  .debug_typenames 0 : { *(.debug_typenames) }
  .debug_varnames  0 : { *(.debug_varnames) }
  /* DWARF 3.  */
  .debug_pubtypes 0 : { *(.debug_pubtypes) }
  .debug_ranges   0 : { *(.debug_ranges) }
  /* DWARF 5.  */
  .debug_addr     0 : { *(.debug_addr) }
  .debug_line_str 0 : { *(.debug_line_str) }
  .debug_loclists 0 : { *(.debug_loclists) }
  .debug_macro    0 : { *(.debug_macro) }
  .debug_names    0 : { *(.debug_names) }
  .debug_rnglists 0 : { *(.debug_rnglists) }
  .debug_str_offsets 0 : { *(.debug_str_offsets) }
  .debug_sup      0 : { *(.debug_sup) }
  .gnu.attributes 0 : { KEEP (*(.gnu.attributes)) }
  /DISCARD/ : { *(.note.GNU-stack) *(.gnu_debuglink) *(.gnu.lto_*) }
}


==================================================
ld: mode elf_x86_64
attempt to open /usr/local/lib/x86_64-linux-gnu/libcuda.so failed
attempt to open /usr/local/lib/x86_64-linux-gnu/libcuda.a failed
attempt to open /lib/x86_64-linux-gnu/libcuda.so succeeded
/lib/x86_64-linux-gnu/libcuda.so
libm.so.6 needed by /lib/x86_64-linux-gnu/libcuda.so
attempt to open /usr/local/nvidia/lib/libm.so.6 failed
attempt to open /usr/local/nvidia/lib64/libm.so.6 failed
attempt to open /usr/local/cuda/lib64/libm.so.6 failed
attempt to open /usr/local/cuda/targets/x86_64-linux/lib/libm.so.6 failed
attempt to open /usr/local/cuda-11/targets/x86_64-linux/lib/libm.so.6 failed
attempt to open /usr/local/cuda-11.8/targets/x86_64-linux/lib/libm.so.6 failed
attempt to open /usr/local/lib/libm.so.6 failed
attempt to open /usr/local/nvidia/lib/libm.so.6 failed
attempt to open /usr/local/nvidia/lib64/libm.so.6 failed
attempt to open /usr/local/lib/x86_64-linux-gnu/libm.so.6 failed
found libm.so.6 at /lib/x86_64-linux-gnu/libm.so.6
libc.so.6 needed by /lib/x86_64-linux-gnu/libcuda.so
attempt to open /usr/local/nvidia/lib/libc.so.6 failed
attempt to open /usr/local/nvidia/lib64/libc.so.6 failed
attempt to open /usr/local/cuda/lib64/libc.so.6 failed
attempt to open /usr/local/cuda/targets/x86_64-linux/lib/libc.so.6 failed
attempt to open /usr/local/cuda-11/targets/x86_64-linux/lib/libc.so.6 failed
attempt to open /usr/local/cuda-11.8/targets/x86_64-linux/lib/libc.so.6 failed
attempt to open /usr/local/lib/libc.so.6 failed
attempt to open /usr/local/nvidia/lib/libc.so.6 failed
attempt to open /usr/local/nvidia/lib64/libc.so.6 failed
attempt to open /usr/local/lib/x86_64-linux-gnu/libc.so.6 failed
attempt to open /usr/local/lib/x86_64-linux-gnu/libc.so.6 failed
attempt to open /usr/lib/x86_64-linux-gnu64/libc.so.6 failed
attempt to open /usr/local/lib64/libc.so.6 failed
attempt to open /lib64/libc.so.6 failed
attempt to open /usr/lib64/libc.so.6 failed
attempt to open /usr/local/lib/libc.so.6 failed
attempt to open /lib/libc.so.6 failed
attempt to open /usr/lib/libc.so.6 failed
attempt to open /usr/x86_64-linux-gnu/lib64/libc.so.6 failed
attempt to open /usr/x86_64-linux-gnu/lib/libc.so.6 failed
attempt to open /usr/local/nvidia/lib/libc.so.6 failed
attempt to open /usr/local/nvidia/lib64/libc.so.6 failed
attempt to open /usr/local/cuda/lib64/libc.so.6 failed
attempt to open /usr/local/cuda/targets/x86_64-linux/lib/libc.so.6 failed
attempt to open /usr/local/cuda-11/targets/x86_64-linux/lib/libc.so.6 failed
attempt to open /usr/local/cuda-11.8/targets/x86_64-linux/lib/libc.so.6 failed
attempt to open /usr/local/lib/libc.so.6 failed
attempt to open /usr/local/nvidia/lib/libc.so.6 failed
attempt to open /usr/local/nvidia/lib64/libc.so.6 failed
attempt to open /usr/local/lib/x86_64-linux-gnu/libc.so.6 failed
found libc.so.6 at /lib/x86_64-linux-gnu/libc.so.6
libdl.so.2 needed by /lib/x86_64-linux-gnu/libcuda.so
attempt to open /usr/local/nvidia/lib/libdl.so.2 failed
attempt to open /usr/local/nvidia/lib64/libdl.so.2 failed
attempt to open /usr/local/cuda/lib64/libdl.so.2 failed
attempt to open /usr/local/cuda/targets/x86_64-linux/lib/libdl.so.2 failed
attempt to open /usr/local/cuda-11/targets/x86_64-linux/lib/libdl.so.2 failed
attempt to open /usr/local/cuda-11.8/targets/x86_64-linux/lib/libdl.so.2 failed
attempt to open /usr/local/lib/libdl.so.2 failed
attempt to open /usr/local/nvidia/lib/libdl.so.2 failed
attempt to open /usr/local/nvidia/lib64/libdl.so.2 failed
attempt to open /usr/local/lib/x86_64-linux-gnu/libdl.so.2 failed
found libdl.so.2 at /lib/x86_64-linux-gnu/libdl.so.2
libpthread.so.0 needed by /lib/x86_64-linux-gnu/libcuda.so
attempt to open /usr/local/nvidia/lib/libpthread.so.0 failed
attempt to open /usr/local/nvidia/lib64/libpthread.so.0 failed
attempt to open /usr/local/cuda/lib64/libpthread.so.0 failed
attempt to open /usr/local/cuda/targets/x86_64-linux/lib/libpthread.so.0 failed
attempt to open /usr/local/cuda-11/targets/x86_64-linux/lib/libpthread.so.0 failed
attempt to open /usr/local/cuda-11.8/targets/x86_64-linux/lib/libpthread.so.0 failed
attempt to open /usr/local/lib/libpthread.so.0 failed
attempt to open /usr/local/nvidia/lib/libpthread.so.0 failed
attempt to open /usr/local/nvidia/lib64/libpthread.so.0 failed
attempt to open /usr/local/lib/x86_64-linux-gnu/libpthread.so.0 failed
found libpthread.so.0 at /lib/x86_64-linux-gnu/libpthread.so.0
librt.so.1 needed by /lib/x86_64-linux-gnu/libcuda.so
attempt to open /usr/local/nvidia/lib/librt.so.1 failed
attempt to open /usr/local/nvidia/lib64/librt.so.1 failed
attempt to open /usr/local/cuda/lib64/librt.so.1 failed
attempt to open /usr/local/cuda/targets/x86_64-linux/lib/librt.so.1 failed
attempt to open /usr/local/cuda-11/targets/x86_64-linux/lib/librt.so.1 failed
attempt to open /usr/local/cuda-11.8/targets/x86_64-linux/lib/librt.so.1 failed
attempt to open /usr/local/lib/librt.so.1 failed
attempt to open /usr/local/nvidia/lib/librt.so.1 failed
attempt to open /usr/local/nvidia/lib64/librt.so.1 failed
attempt to open /usr/local/lib/x86_64-linux-gnu/librt.so.1 failed
found librt.so.1 at /lib/x86_64-linux-gnu/librt.so.1
ld-linux-x86-64.so.2 needed by /lib/x86_64-linux-gnu/libm.so.6
attempt to open /usr/local/nvidia/lib/ld-linux-x86-64.so.2 failed
attempt to open /usr/local/nvidia/lib64/ld-linux-x86-64.so.2 failed
attempt to open /usr/local/cuda/lib64/ld-linux-x86-64.so.2 failed
attempt to open /usr/local/cuda/targets/x86_64-linux/lib/ld-linux-x86-64.so.2 failed
attempt to open /usr/local/cuda-11/targets/x86_64-linux/lib/ld-linux-x86-64.so.2 failed
attempt to open /usr/local/cuda-11.8/targets/x86_64-linux/lib/ld-linux-x86-64.so.2 failed
attempt to open /usr/local/lib/ld-linux-x86-64.so.2 failed
attempt to open /usr/local/nvidia/lib/ld-linux-x86-64.so.2 failed
attempt to open /usr/local/nvidia/lib64/ld-linux-x86-64.so.2 failed
attempt to open /usr/local/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2 failed
found ld-linux-x86-64.so.2 at /lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
ld: warning: cannot find entry symbol _start; not setting start address
root@92e58a10ce54:~/Documents/Coding/Go/ai/gorgonia-rl# whereis libcuda.so
libcuda.so: /usr/lib/x86_64-linux-gnu/libcuda.so
root@92e58a10ce54:~/Documents/Coding/Go/ai/gorgonia-rl# ls -l /usr/lib/x86_64-linux-gnu/libcuda.so
lrwxrwxrwx 1 root root 12 Mar 24 10:52 /usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.1

Below is the docker image I am using:

FROM nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04
#nvidia/cuda:10.2-cudnn8-devel-ubuntu18.04

# Update apt-get and install stuff
RUN apt-get update && apt-get install -y \
    wget \
    nano

RUN wget -q https://go.dev/dl/go1.20.2.linux-amd64.tar.gz && \
    tar -C /usr/local -xzf go1.20.2.linux-amd64.tar.gz && \
    rm go1.20.2.linux-amd64.tar.gz

# Enable color prompt
RUN sed -i 's/#force_color_prompt=yes/force_color_prompt=yes/' /root/.bashrc

# Add Go to PATH (for running go commands in this install file we should still use absolute paths)
ENV PATH $PATH:/usr/local/go/bin

# Download Gorgonia. This is just so you dont have to downlaod every time you start the container
RUN cd /root && \
    mkdir gorgonia-temp && \
    cd gorgonia-temp && \
    go mod init gorgonia-temp && \
    go get -u gorgonia.org/gorgonia && \
    cd .. && \
    rm -rd gorgonia-temp

# Add some other libs
RUN apt-get install -y libc6-dev libgl1-mesa-glx libgl1-mesa-dri

# Add cuda to PATH (IMPORTANT-UPDATE this when you change the cuda version)
ENV PATH $PATH:/usr/local/cuda/include:/usr/local/cuda/bin
ENV LD_LIBRARY_PATH $LD_LIBRARY_PATH:/usr/local/cuda/lib64

# Give libcuda all permissions
RUN chmod -R 777 /usr/local/cuda-11.8/targets/x86_64-linux/lib/stubs
RUN chmod -R 777 /usr/local/cuda-11.8/compat

RUN find / -name libcuda.so

CMD "bash"

And my go version: go version go1.20.2 linux/amd64

and the output of nvidia-smi from inside the container:

Fri Mar 24 11:07:55 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:09:00.0  On |                  N/A |
|  0%   50C    P5    22W / 170W |    854MiB / 12288MiB |     11%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

API for `cuModuleGetSurfRef`

The C declaration is CUresult cuModuleGetSurfRef(CUsurfref* pSurfRef, CUmodule hmod, const char* name);

The required Go functions/methods are:

func (m Module) SurfRef(name string) (SurfRef,error)
func (ctx *Ctx) ModuleSurfRef(name string) (SurfRef, error)

API for `cuMemHostGetDevicePointer`

The C declaration is CUresult cuMemHostGetDevicePointer(CUdeviceptr* pdptr, void* p, unsigned int Flags);

[low priority] Inspect dependencies

There is a lot of deps in go.mod which adds a lot of transitive dependencies to projects who import gorgonia/cu (like mine)

Try to inspect for what are these deps needed, and possibly eliminate them to shorten download time.

I would like to take a look at this.

There isn't handleCUDACB in CUDA v11.4

I got
go\pkg\mod\gorgonia.org\[email protected]\params.go: In function 'CallHostFunc': go\pkg\mod\gorgonia.org\[email protected]\params.go:7:2: warning: implicit declaration of function 'handleCUDACB' [-Wimplicit-function-declaration] 7 | handleCUDACB(fn)
from run
go get -u -v gorgonia.org/cu

Batching is Broken in CUDA11

Document Prerequisites

Hi,

I'm quite new to Cuda and Go, so maybe this is just some noob problem but when I go get the project with this command:

CGO_CFLAGS="-I/usr/include/linux -I/usr/include" go get -u gorgonia.org/cu

I'm getting the following error:

# runtime/cgo
In file included from _cgo_export.c:3:0:
/usr/include/stdlib.h:97:8: error: unknown type name ‘size_t’
 extern size_t __ctype_get_mb_cur_max (void) __THROW __wur;
        ^~~~~~
In file included from _cgo_export.c:3:0:
/usr/include/stdlib.h:411:4: error: unknown type name ‘size_t’; did you mean ‘ssize_t’?
    size_t __statelen) __THROW __nonnull ((2));
    ^~~~~~
    ssize_t
/usr/include/stdlib.h:441:4: error: unknown type name ‘size_t’; did you mean ‘ssize_t’?
    size_t __statelen,
    ^~~~~~
    ssize_t
/usr/include/stdlib.h:539:22: error: unknown type name ‘size_t’; did you mean ‘ssize_t’?
 extern void *malloc (size_t __size) __THROW __attribute_malloc__ __wur;
...

testing

Error in initialization, please refer to "https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__INITIALIZE.html" for details on:

I have encountered this error while following the instruction to test if its correctly installed

API for `cuModuleLoadFatBinary`

The C declaration is CUresult cuModuleLoadFatBinary(CUmodule* module, const void* fatCubin);

The Go declarations required:

func LoadFatBinary(fatCubin ???) Module
func (ctx *Ctx) LoadFatBinary(fatCubin ???) Module

bug in cgoflags.go

Hi people,

I had some difficulty trying to install this library. For one part that was because I didn't see that there is a guide until after I got it working. For the other part is was because cgoflags.go isn't entirely correct.

I got this error:

$ go install -v gorgonia.org/cu/cmd/cudatest
# gorgonia.org/cu
/home/markkremer/go/pkg/mod/gorgonia.org/[email protected]/addressing.go:3:11: fatal error: cuda.h: No such file or directory
 // #include <cuda.h>
           ^~~~~~~~
compilation terminated.

I did have cuda version 10.2 installed and symlinked to /usr/local/cuda/.

The start of cgoflags.go reads:

package cu

// This file provides CGO flags to find CUDA libraries and headers.

//#cgo LDFLAGS:-lcuda
//
////default location:
//#cgo linux,windows LDFLAGS:-L/usr/local/cuda/lib64 -L/usr/local/cuda/lib
//#cgo linux,windows CFLAGS: -I/usr/local/cuda/include/
//
////default location if not properly symlinked:
//#cgo linux LDFLAGS:-L/usr/local/cuda-10.1/lib64 -L/usr/local/cuda-10.1/lib
//#cgo linux LDFLAGS:-L/usr/local/cuda-6.0/lib64 -L/usr/local/cuda-6.0/lib
//#cgo linux LDFLAGS:-L/usr/local/cuda-5.5/lib64 -L/usr/local/cuda-5.5/lib
//#cgo linux LDFLAGS:-L/usr/local/cuda-5.0/lib64 -L/usr/local/cuda-5.0/lib
//#cgo linux CFLAGS: -I/usr/local/cuda-10.2/include/
//#cgo linux CFLAGS: -I/usr/local/cuda-6.0/include/
//#cgo linux CFLAGS: -I/usr/local/cuda-5.5/include/
//#cgo linux CFLAGS: -I/usr/local/cuda-5.0/include/

When running go install -v -n gorgonia.org/cu/cmd/cudatest (with -n) to view the commands go runs, it doesn't use the top two /usr/local/cuda/... at all:

#
# gorgonia.org/cu
#

mkdir -p $WORK/b033/
cd /home/markkremer/go/pkg/mod/gorgonia.org/[email protected]
CGO_LDFLAGS='"-g" "-O2" "-lcuda" "-L/usr/local/cuda-10.1/lib64" "-L/usr/local/cuda-10.1/lib" "-L/usr/local/cuda-6.0/lib64" "-L/usr/local/cuda-6.0/lib" "-L/usr/local/cuda-5.5/lib64" "-L/usr/local/cuda-5.5/lib" "-L/usr/local/cuda-5.0/lib64" "-L/usr/local/cuda-5.0/lib" "-L/usr/lib/x86_64-linux-gnu/" "-L/opt/cuda/lib64" "-L/opt/cuda/lib"' /usr/local/go/pkg/tool/linux_amd64/cgo -objdir $WORK/b033/ -importpath gorgonia.org/cu -- -I $WORK/b033/ -g -O2 -g -O3 -std=c99 -I/usr/local/cuda-10.1/include/ -I/usr/local/cuda-6.0/include/ -I/usr/local/cuda-5.5/include/ -I/usr/local/cuda-5.0/include/ -I/usr/include -I/opt/cuda/include ./addressing.go ./api.go ./array.go ./attributes.go ./batch.go ./batchedPatterns.go ./cgoflags.go ./context.go ./convenience.go ./ctx.go ./ctx_api.go ./cu.go ./cucontext.go ./device.go ./event.go ./execution.go ./flags.go ./jit.go ./memory.go ./module.go ./occupancy.go ./result.go ./stream.go ./surfref.go ./texref.go
cd $WORK

It does use all the other cuda directories. Unfortunately for me, 10.2 isn't in there.

I think the build constraint tags should be separated by spaces instead of commas:

- //#cgo linux,windows LDFLAGS:-L/usr/local/cuda/lib64 -L/usr/local/cuda/lib
+ //#cgo linux windows LDFLAGS:-L/usr/local/cuda/lib64 -L/usr/local/cuda/lib
- //#cgo linux,windows CFLAGS: -I/usr/local/cuda/include/
+ //#cgo linux windows CFLAGS: -I/usr/local/cuda/include/

Now it does show up:

#
# gorgonia.org/cu
#

mkdir -p $WORK/b033/
cd /home/markkremer/go/pkg/mod/gorgonia.org/[email protected]
CGO_LDFLAGS='"-g" "-O2" "-lcuda" "-L/usr/local/cuda/lib64" "-L/usr/local/cuda/lib" "-L/usr/local/cuda-10.2/lib64" "-L/usr/local/cuda-10.2/lib" "-L/usr/local/cuda-6.0/lib64" "-L/usr/local/cuda-6.0/lib" "-L/usr/local/cuda-5.5/lib64" "-L/usr/local/cuda-5.5/lib" "-L/usr/local/cuda-5.0/lib64" "-L/usr/local/cuda-5.0/lib" "-L/usr/lib/x86_64-linux-gnu/" "-L/opt/cuda/lib64" "-L/opt/cuda/lib"' /usr/local/go/pkg/tool/linux_amd64/cgo -objdir $WORK/b033/ -importpath gorgonia.org/cu -- -I $WORK/b033/ -g -O2 -g -O3 -std=c99 -I/usr/local/cuda/include/ -I/usr/local/cuda-10.1/include/ -I/usr/local/cuda-6.0/include/ -I/usr/local/cuda-5.5/include/ -I/usr/local/cuda-5.0/include/ -I/usr/include -I/opt/cuda/include ./addressing.go ./api.go ./array.go ./attributes.go ./batch.go ./batchedPatterns.go ./cgoflags.go ./context.go ./convenience.go ./ctx.go ./ctx_api.go ./cu.go ./cucontext.go ./device.go ./event.go ./execution.go ./flags.go ./jit.go ./memory.go ./module.go ./occupancy.go ./result.go ./stream.go ./surfref.go ./texref.go
cd $WORK

Although that isn't a Windows path so it's best to entirely remove the Windows part for that build constraint:

- //#cgo linux,windows LDFLAGS:-L/usr/local/cuda/lib64 -L/usr/local/cuda/lib
+ //#cgo linux LDFLAGS:-L/usr/local/cuda/lib64 -L/usr/local/cuda/lib
- //#cgo linux,windows CFLAGS: -I/usr/local/cuda/include/
+ //#cgo linux CFLAGS: -I/usr/local/cuda/include/

I still don't have everything working locally but at least it builds now.

cuda.h: No such file or directory // #include <cuda.h>

hello，fellows！
when i ran command "go install gorgonia.org/cu/cmd/cudatest@latest",it cames out with an error of "E:\mygo\pkg\mod\gorgonia.org\[email protected]\addressing.go:3:11: fatal error: cuda.h: No such file or directory
// #include <cuda.h>".i 've already installed cudatookit.what should i do?your help will be highly appreciated!tks

leading dimension of sgemm in blas

C = α op ( A ) op ( B ) + β C
where α and β are scalars, and A , B and C are matrices stored in column-major format with dimensions op ( A ) m × k , op ( B ) k × n and C m × n , respectively. Also, for matrix A

op ( A ) = A if transa == CUBLAS_OP_N A T if transa == CUBLAS_OP_T A H if transa == CUBLAS_OP_C

and op ( B ) is defined similarly for matrix B

Code line https://github.com/gorgonia/cu/blob/master/blas/blas.go#L3514

if ldc*(m-1)+n > len(c) || ldc < max(1, n) {
	panic("blas: index of c out of range")
}

It seams that ldc always will be n.

However, according Nvidia cublas api document ( https://docs.nvidia.com/cuda/cublas/index.html#cublas-lt-t-gt-gemm ), ldc always will be m.

C	device	in/out	array of dimensions ldc x n with ldc>=max(1,m).
ldc		input	leading dimension of a two-dimensional array used to store the matrix C.

The same situation as lda and ldb.

So when I follow the Nvidia api doc, the params check will failed in blas.go. When I use opposite rule of lda/ldb/ldc from the doc, an error like "On entry to SGEMM parameter number 10 had an illegal value" will be returned.

It makes me confused. Would you please help me?
Thanks!