sx-aurora / veda Goto Github PK

View Code? Open in Web Editor NEW

14.0 14.0 2.0 634 KB

VEDA (VE Driver API)

License: Other

CMake 8.39% C++ 74.32% C 16.24% Shell 1.04%

veda's People

Contributors

Stargazers

Watchers

Forkers

wmeddie veos-sxarr-nec

veda's Issues

some temperature sensors show no temperature

It seems that the VE10 has up to "10" virtual cores. The "cores_enable" is a hex value, each bit represents one active core. We need to read that file and create a map per VE for each core.

"numa0_cores" and "numa1_cores" show which cores are dedicated for which VE.

Encountering make error when building aveo

Seems like cmake performs a git clone with --depth 1 so now that aveo has one more commit than before git is unable to checkout d2b04de. I think we need to set GIT_SHALLOW false

[egonzalez@XAIJPVE1 build]$ make
[  3%] Performing download step (git clone) for 'aveo'
Cloning into 'src'...
error: pathspec 'd2b04de' did not match any file(s) known to git.
CMake Error at tmp/aveo-gitclone.cmake:40 (message):
  Failed to checkout tag: 'd2b04de'

understanding the vedaMemcpyHtoDAsync sync behaviour and veo_async_write_mem poor performance

I'm using one device +one stream OMP mode:
My Veda codes almost correspond below lines:

//once: vedaCtxCreate(&ctx, VEDA_CONTEXT_MODE_OMP, 0)

vedaDevicePrimaryCtxRetain(&ctx, device);
vedaCtxPushCurrent(ctx);
vedaMemAllocAsync
vedaMemcpyHtoDAsync
...
call kernel
 vedaCtxPopCurrent(&ctx);

I keep Veda calls async and synchronize them when I need them inside the CPU operations.

to my surprise, my async Veda operations took more time than I expected, as it was synchronizing inside (vedaMemcpyHtoDAsync).

libfill_lib.so: undefined reference to `veraInit()'

I'm getting a linking error in my test project, trying to use VERA:

[ 85%] Linking CXX executable veda_test
libfill_lib.so: undefined reference to `veraInit()'

I understand I should be linking my host library with libvera.so if I'm going to use VERA functionality. For the VEDA functionality, CMake defines VEDA_LIBRARY:

grep -r "FIND_LIBRARY"
veda-0.9.5.1/cmake/FindVE.cmake:  FIND_LIBRARY(VEDA_LIBRARY "libveda.so" "libveda.a" PATHS "${VEDA_DIR}/lib64")

but I see no equivalent for the VERA_LIBRARY variable.

Am I doing something wrong?

fatal error: vera_enums.h: No such file or directory

Some header files seem to be missing when compiling code using vera.h

My CMake compilation output is:

/usr/local/ve/veda-0.9.5/include/vera.h:4:24: fatal error: vera_enums.h: No such file or directory  
 #include "vera_enums.h"

As a side note, when searching the aurora system (and the github repository), I can't find the next file included in vera.h either:

vera_types.h

add wrapper to call VEBLAS from host code

load libveda.vso automatically

Currently users who use cmake won't notice that their build of a .vso file includes linking libveda.vso (VE side runtime) to it. Users with other make setups need to explicitly add libveda.vso to the link line, which is actually ok.

When using veda for something like a fast JIT, creating, loading and unloading .vso objects all the time, linking libveda.vso to each of the VE side shared objects is increasing latency. It would be sufficient to load the libveda.vso runtime just once, when the proc is created (or the first context for a device is opened in veda). This is an issue encountered during porting JuliaLang to VE.

std::list with thread_local could cause segs;

thread_local std::list could cause segs;
I faced this problem when I had singleton for my vedaInit and vedaExit;
with a little investigation, I found that its due to thread_local and std::llist

I believe it could be either replaced with std::vector or just thread_local could be removed

#0  0x00007ffff5e21143 in std::__cxx11::_List_base<veda::Context*, std::allocator<veda::Context*> >::_M_clear (this=0x7ffff7fcb980) at /usr/include/c++/8/bits/list.tcc:74
#1  0x00007ffff5e21030 in std::__cxx11::list<veda::Context*, std::allocator<veda::Context*> >::clear (this=0x7ffff7fcb980) at /usr/include/c++/8/bits/stl_list.h:1508
#2  0x00007ffff5e2093f in veda::Contexts::shutdown () at /home/qwr/veda/src/veda/Contexts.cpp:54
#3  0x00007ffff5e09041 in vedaExit () at /home/qwr/veda/src/veda.cpp:36
#4  0x0000000001e500e6 in VEDA_HANDLE::~VEDA_HANDLE() ()
#5  0x00007ffff4ae4b0c in __run_exit_handlers () from /lib64/libc.so.6
#6  0x00007ffff4ae4c40 in exit () from /lib64/libc.so.6
#7  0x00007ffff4ace49a in __libc_start_main () from /lib64/libc.so.6

here, I wrote simple thread_local + std::list to show problem
https://godbolt.org/z/7jPWWexGP

lock/wait msg on the exit

we have java project(dl4j) that calls cpp library with veda.
I am facing strange behaviour with Veda. I dont know if I'm using veda incorrectly or its something else.

so each time on exit I'm getting this (annoying lock/wait with below messages):
[VH] [TID 3480243] ERROR: wait_req_ack() timeout waiting for ACK req=1604
[VH] [TID 3480243] ERROR: close() child sent no ACK to EXIT. Killing it.

plus when I ran my inference more , I am getting error for the method call which works just fine for one inference session.

[VE] ERROR: sigactionHandler() Interrupt signal 11 received
0x600fffe00000
0x60001004cf40 -> (null)
0x600c01583b60 -> __vthr$_pcall_va
[VH] [TID 3611133] ERROR: unpack_call_result() VE exception 11
�@$+U?�^\s�?(*f@xϾ?�v�A]
[VH] [TID 3611133] ERROR: _progress_nolock() Internal error on executing a command(-4)
[VEDA_ERROR_VEO_COMMAND_EXCEPTION] /home/qwr/veda/src/veda/Context.cpp (435)

I'm using the veda this way: for now I integrated your vednn library through veda, so it has few ops support
java->javacpp-> nd4jcpp+veda-> device vednn lib



Veda handle cpp class {
     function,
     module, 
     ctx

}

Singleton Veda{
   list<veda_handle> ;
  //for now I'm using one device
  init(){ 
  init and load libs per device, 
 also I pop
vedaCtxPopCurrent;
//this way I'm solving vedaExit problem that I had because of thread local std::list
 }
exit() { vedaExit;}

}

how I'm callng (for now I'm calling it using 0th device with the context that was created in OMP mode):

    vedaDevicePrimaryCtxRetain(&ctx, device);
    vedaCtxPushCurrent(ctx);

    veda Mem calls..
   veda Launch

   sync

   vedaCtxPopCurrent(&ctx);

Cpp api

instead of using reinterpret_cast style type punning and enable_if
overload all types with its' proper C setter (int ,..., float, double)

inline VEDAresult vedaArgsSet(VEDAargs args, const int idx, const int32_t value) {
	return vedaArgsSetI32(args, idx, value);
}

inline VEDAresult vedaArgsSet(VEDAargs args, const int idx, const int64_t value) {
	return vedaArgsSetI64(args, idx, value);
}

inline VEDAresult vedaArgsSet(VEDAargs args, const int idx, const float value) {
	return vedaArgsSetF32(args, idx, value);
}

inline VEDAresult vedaArgsSet(VEDAargs args, const int idx, const double value) {
	return vedaArgsSetF64(args, idx, value);
}
.....
....
template<typename T, typename... Args>
inline VEDAresult __vedaLaunchKernel(VEDAfunction func, VEDAstream stream, uint64_t* result, VEDAargs args, const int idx, const T value, Args... vargs) {
    static_assert(!std::is_same<T, bool>::value, "Don't use bool as data-type when calling a VE function, as it defined as 1B on VH and 4B on VE!");
	vedaArgsSet(args, idx, value);
	return __vedaLaunchKernel(func, stream, result, args, idx+1, vargs...);
}

use AVEO HMEM instead of normal allocations

Change VEDAdeviceptr implementation to support up to 128GB memory offsets

Modified delayed_malloc to work on arrays, doesn't work with current vedaMemFree/Copy & context sync order.

I have been using the delayed_malloc as an inspiration to try out my own simple test. Instead of copying a char array, I've been copying two vectors/arrays that are added in a kernel into a third array.

When delay-allocating this third array on the VE, and keeping the memCopy, memFree & ctx synchronization in the same order as in the example case, everything seems to be working fine. So the order is:

VEDA(vedaMemcpyDtoHAsync(cpu_return, kern_sum, N*sizeof(int), stream));

VEDA(vedaMemFreeAsync(x_device, stream));
VEDA(vedaMemFreeAsync(y_device, stream));
VEDA(vedaMemFreeAsync(kern_sum,  stream));

vedaCtxSynchronize();

Now. If I replace the device allocation by a pre-allocation, then the array returned has some garbage in the first few elements. The only way to get a proper data return is to put the vedaCtxSynchronize() call before the vedaMemFreeAsync() calls.

Did I do something wrong, or is this a bug? Intuitively I would have put a sync() before a free() to start with, but perhaps that's then missing the point of the vedaMemFreeAsync() functionality?

See attachment for a minimal test case.

simple_add_kernel.tar.gz

MPI_<LANG>_COMPILER not set correctly

Hi,

when using your injection the MPI__COMPILER variables are not set to a valid MPI Compiler.

For instance

CMAKE_MINIMUM_REQUIRED(VERSION 3.9)

PROJECT(TEST C CXX)

FIND_PACKAGE(MPI REQUIRED)

message(${MPI_C_COMPILER})

returns /opt/nec/ve/ncc/3.0.8/bin/ncc/mpincc as path for the MPI C Compiler.

Best regards,
Severin

add NUMA support

determine if NUMA is enabled on device [read /sys/class/ve/veX/partitioning_mode]
create one VEDA device per NUMA node

Is there any way somehow to embed the device library in the host and load from within?

Is there any way somehow to embed the device library in the host and load from within?
thanks

Wrong MPI version found by injection script

Hi,
I noticed that with multiple MPI versions installed, veda doesn't always selects the newest version when using the injection.
With versions 1.3.0, 2.0.0, 2.2.0, 2.3.0, 2.5.0, 2.7.0 and 2.10.0 installed this leads to 2.7.0 being selected. It would be nice if veda would either select the newest MPI version, or, even better, use the MPI version which is in PATH. (At RWTH Aachen we use a module system on the aurora to switch between multiple compiler/MPI versions which would be nice to be considered by veda)