dmitrylyakh / tal_sh Goto Github PK

View Code? Open in Web Editor NEW

38.0 38.0 15.0 1.73 MB

Tensor Algebra Library Routines for Shared Memory Systems

License: BSD 3-Clause "New" or "Revised" License

Cuda 31.37% Fortran 39.59% C++ 25.92% Makefile 0.88% C 1.81% CMake 0.43%

tal_sh's People

Contributors

Stargazers

Watchers

Forkers

basanders grseb9s miroi mtaillefumier barnesal fuchsto hjjvandam dlyongemallo alexkiryushkin rqzhang0 abagusetty enp1s0 caguerra central-intelligence-tf-185 relmbdev

tal_sh's Issues

Conjugation of the right argument in tensor contraction

Turns out this is due to wrong handling of conjugation of the second argument in a contraction. This only happens when your own implementation of BLAS is used, when the code calls a standard BLAS everything is fine.
For conjugation on the first argument the result is OK, so this gives the following work-around (red = old code that shows the error, green = work-around that gives correct results). Hope this gives a clue for you to fix.

         ierr=talsh_tensor_contract("B(p,q)+=A(i,q)*M+(i,p)",mo_tensor,ao_tensor,mocoef_alpha)

         ierr=talsh_tensor_contract("B(p,q)+=M+(i,p)*A(i,q)",mo_tensor,mocoef_alpha,ao_tensor)

I can update you on the details on your reguar bi-weekly meeting if you can make that one (in 30 minutes from now).

Memory leak

==3498902==ERROR: AddressSanitizer: heap-use-after-free on address 0x606001bd7d28 at pc 0x000000553e3c bp 0x7fffffff8c50 sp 0x7fffffff8c48
READ of size 4 at 0x606001bd7d28 thread T0
#0 0x553e3b in tensShape_volume /home/users/coe0014/src/TAL_SH/tensor_algebra_gpu.cpp:781
#1 0x58227f in talsh_tensor_c_dissoc /home/users/coe0014/src/TAL_SH/talshc.cpp:367
#2 0x598fb6 in talshTensorPlace /home/users/coe0014/src/TAL_SH/talshc.cpp:3418
#3 0x42cc28 in test_talsh_c /home/users/coe0014/src/TAL_SH/test.cpp:158
#4 0x405d05 in MAIN__ /home/users/coe0014/src/TAL_SH/main.F90:104
#5 0x429a82 in main /home/users/coe0014/src/TAL_SH/main.F90:212
#6 0x15554ff946a2 in __libc_start_main (/lib64/libc.so.6+0x236a2)
#7 0x4059ed in _start (/home/users/coe0014/src/TAL_SH/test_talsh.x+0x4059ed)

0x606001bd7d28 is located 8 bytes inside of 64-byte region [0x606001bd7d20,0x606001bd7d60)
freed by thread T0 here:
#0 0x15555444c860 in __interceptor_free ../../../../cray-gcc-8.1.0-201806150759.6677a227493f2/libsanitizer/asan/asan_malloc_linux.cc:66
#1 0x556107 in tensBlck_destroy /home/users/coe0014/src/TAL_SH/tensor_algebra_gpu_nvidia.hip.cu:3026

previously allocated by thread T0 here:
#0 0x15555444cbe0 in interceptor_malloc ../../../../cray-gcc-8.1.0-201806150759.6677a227493f2/libsanitizer/asan/asan_malloc_linux.cc:86
#1 0x55602d in tensBlck_create /home/users/coe0014/src/TAL_SH/tensor_algebra_gpu_nvidia.hip.cu:2997
#2 0x59870b in talshTensorPlace /home/users/coe0014/src/TAL_SH/talshc.cpp:3384
#3 0x42cc28 in test_talsh_c /home/users/coe0014/src/TAL_SH/test.cpp:158
#4 0x405d05 in MAIN /home/users/coe0014/src/TAL_SH/main.F90:104
#5 0x429a82 in main /home/users/coe0014/src/TAL_SH/main.F90:212
#6 0x15554ff946a2 in __libc_start_main (/lib64/libc.so.6+0x236a2)

SUMMARY: AddressSanitizer: heap-use-after-free /home/users/coe0014/src/TAL_SH/tensor_algebra_gpu.cpp:781 in tensShape_volume
Shadow bytes around the buggy address:
0x0c0c80372f50: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c0c80372f60: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c0c80372f70: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c0c80372f80: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c0c80372f90: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
=>0x0c0c80372fa0: fa fa fa fa fd[fd]fd fd fd fd fd fd fa fa fa fa
0x0c0c80372fb0: 00 00 00 00 00 00 00 fa fa fa fa fa 00 00 00 00
0x0c0c80372fc0: 00 00 00 00 fa fa fa fa fd fd fd fd fd fd fd fd
0x0c0c80372fd0: fa fa fa fa fd fd fd fd fd fd fd fd fa fa fa fa
0x0c0c80372fe0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c0c80372ff0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
Addressable: 00
Partially addressable: 01 02 03 04 05 06 07
Heap left redzone: fa
Freed heap region: fd
Stack left redzone: f1
Stack mid redzone: f2
Stack right redzone: f3
Stack after return: f5
Stack use after scope: f8
Global redzone: f9
Global init order: f6
Poisoned by user: f7
Container overflow: fc
Array cookie: ac
Intra object redzone: bb
ASan internal: fe
Left alloca redzone: ca
Right alloca redzone: cb
==3498902==ABORTING

aggregate TALSH stats

Currently calling talshstats() at the end of a large multi-node run (using all available GPUs per node) dumps the stats for every GPU. Can we have a mechanism to aggregate the stats (flop count, etc) instead of dumping the stats for every GPU?

Thanks
Ajay

Where can I find this function "tensor_block_scalar_value"

Hi,
I am reading talshf.F90 and I want to where I can find this method "tensor_block_scalar_value" lying in 854.

talsh_update_f_scalar=TALSH_SUCCESS
         if(c_associated(tensF)) then
          call c_f_pointer(tensF,ftens)
          if(.not.tensor_block_is_empty(ftens,ierr)) then
           if(ierr.eq.0) then
            if(c_associated(gmem_p)) then
             val=tensor_block_scalar_value(ftens)                           **this method**
             select case(data_kind)
             case(R4)
              call c_f_pointer(gmem_p,r4p); r4p=real(val,4); r4p=>NULL()
             case(R8)
              call c_f_pointer(gmem_p,r8p); r8p=real(val,8); r8p=>NULL()
             case(C4)
              call c_f_pointer(gmem_p,c4p); c4p=cmplx(real(val),imag(val),4); c4p=>NULL()
             case(C8)
              call c_f_pointer(gmem_p,c8p); c8p=val; c8p=>NULL()

Using this in C

Since i am kind of a newbie i have problems linking to this library in my c code. After building i #include "talsh.h" in my c file and compile using gcc. E. g.: gcc test.c -L./path_to_lib -I./path_to_lib -ltalsh. But i then get errors regarding the inclusion of the c++ headers in tensor_algebra.h, eg. cstddef.

In file included from tensor_algebra.h:54,
from talsh.h:12,
from kram.c:2:
/usr/include/c++/13.2.1/cstddef:52:8: error: expected identifier or '(' before string constant
52 | extern "C++"
| ^~~~~
In file included from /usr/include/c++/13.2.1/cstdint:35,
from tensor_algebra.h:55:
/usr/include/c++/13.2.1/bits/c++0x_warning.h:32:2: error: #error This file requires compiler and library support for the ISO C++ 2011 standard. This support must be enabled with the -std=c++11
or -std=gnu++11 compiler options.
32 | #error This file requires compiler and library support
| ^~~~~

Compiling with e. g. -std=c++11 gives the same error.
Greetings

build failure with tensor_block_pcontract_batch_dlf

To reproduce: Change export BLASLIB ?= NONE in the Makefile and run make

gfortran -I. -I. -I. -c -fopenmp -O3 -I. -DWITH_LAPACK -DNO_GPU -DNO_AMD -DNO_PHI -DNO_BLAS -DLINUX -fPIC tensor_algebra_cpu.F90 -o ./OBJ/tensor_algebra_cpu.o

tensor_algebra_cpu.F90:287:47:

  287 |         public tensor_block_pcontract_batch_dlf !batched version of tensor_block_pcontract_dlf
      |                                               1
Error: Symbol ‘tensor_block_pcontract_batch_dlf’ at (1) has no IMPLICIT type; did you mean ‘tensor_block_compatible’?
make: *** [Makefile:520: OBJ/tensor_algebra_cpu.o] Error 1

contraction errors when using C8 types on AMD gpus

I see error messages when contracting tensors that are of type complex double (C8) on AMD GPUs.

#MESSAGE: Printing TAL-SH task info:
 Device kind -1: Error 106
#END OF MESSAGE

I consistently see this error with rocm versions 4.5.0, 4.5.2 and 5.1.0.

Below is a slimmer version of test.cpp which only runs the test_talsh_c routine. Additionally changed the R8 occurrences to C8 to reproduce the error. It looks like call to gpu_tensor_block_contract_dlf is where things go wrong. This call returns a task error code that is > 0 for when the tensor type is C8.

#include "talshxx.hpp"
#include "talsh.h"
#include "device_algebra.hip.h"

#include <iostream>
#include <memory>
#include <string>
#include <complex> 

#include <cstdio>
#include <cstdlib>
#include <cmath>
#include <ctime>
#include <cassert>

void test_talsh_c(int * ierr)
{
 const int VDIM_SIZE=30; //virtual
 const int ODIM_SIZE=20; //occupied
 int errc;
 //size_t host_buffer_size=TALSH_NO_HOST_BUFFER;
 size_t host_buffer_size = 1024*1024*1024; //bytes
 int gpu_list[MAX_GPUS_PER_NODE];

 *ierr=0;

//Query the total number of NVIDIA GPU on node:
 int ngpu;
 errc=talshDeviceCount(DEV_NVIDIA_GPU,&ngpu); if(errc){*ierr=1; return;};
 printf(" Number of NVIDIA GPU found on node = %d\n",ngpu);

//Initialize TAL-SH (with a negligible Host buffer since we will use external memory):
 int host_arg_max;
 for(int i=0; i<ngpu; ++i) gpu_list[i]=i; //list of NVIDIA GPU devices to use in this process
 errc=talshInit(&host_buffer_size,&host_arg_max,ngpu,gpu_list,0,NULL,0,NULL);
 printf(" TAL-SH has been initialized: Status %d: Host buffer size = %lu\n",errc,host_buffer_size); if(errc){*ierr=2; return;};

//Allocate three tensor blocks in Host memory outside of TAL-SH (external application):
 //Tensor block 0:
 int trank0 = 4; //tensor block rank
 const int dims0[] = {VDIM_SIZE,VDIM_SIZE,ODIM_SIZE,ODIM_SIZE}; //tensor block dimension extents
 int trank1 = 4; //tensor block rank
 const int dims1[] = {VDIM_SIZE,VDIM_SIZE,VDIM_SIZE,VDIM_SIZE}; //tensor block dimension extents
 
 int trank2 = 4; //tensor block rank
 const int dims2[] = {ODIM_SIZE,VDIM_SIZE,ODIM_SIZE,VDIM_SIZE}; //tensor block dimension extents

 talsh_tens_t tens0; //declare a TAL-SH tensor block
 errc = talshTensorClean(&tens0); if(errc){*ierr=3; return;}; //clean TAL-SH tensor block object (default ctor)
 errc = talshTensorConstruct(&tens0,C8,trank0,dims0,talshFlatDevId(DEV_HOST,0),NULL,-1,NULL,0.0); //construct tensor block in Host buffer
 //errc = talshTensorConstruct(&tens0,C8,trank0,dims0,talshFlatDevId(DEV_HOST,0),(void*)tblock0); //register tensor block with external memory
 if(errc){*ierr=4; return;};
 size_t vol0 = talshTensorVolume(&tens0);
 //Tensor block 1:
 talsh_tens_t tens1; //declare a TAL-SH tensor block
 errc = talshTensorClean(&tens1); if(errc){*ierr=5; return;}; //clean TAL-SH tensor block object (default ctor)
 errc = talshTensorConstruct(&tens1,C8,trank1,dims1,talshFlatDevId(DEV_HOST,0),NULL,-1,NULL,0.001); //construct tensor block in Host buffer
 //errc = talshTensorConstruct(&tens1,C8,trank1,dims1,talshFlatDevId(DEV_HOST,0),(void*)tblock1); //register tensor block with external memory
 if(errc){*ierr=6; return;};
 size_t vol1 = talshTensorVolume(&tens1);
 //Tensor block 2:
 talsh_tens_t tens2; //declare a TAL-SH tensor block
 errc = talshTensorClean(&tens2); if(errc){*ierr=7; return;}; //clean TAL-SH tensor block object (default ctor)
 errc = talshTensorConstruct(&tens2,C8,trank2,dims2,talshFlatDevId(DEV_HOST,0),NULL,-1,NULL,0.01); //construct tensor block in Host buffer
 //errc=talshTensorConstruct(&tens2,C8,trank2,dims2,talshFlatDevId(DEV_HOST,0),(void*)tblock2); //register tensor block with external memory
 if(errc){*ierr=8; return;};
 size_t vol2 = talshTensorVolume(&tens2);
 double gflops = (sqrt(((double)(vol0))*((double)(vol1))*((double)(vol2)))*2.0)/1e9; //total number of floating point operations (GFlops)
 double theor_norm1 = gflops * 0.01 * 0.001 * 1e9;
 printf(" Three TAL-SH tensor blocks have been constructed: Volumes: %lu, %lu, %lu: GFlops = %f\n",vol0,vol1,vol2,gflops);

//Declare a TAL-SH task handle:
 talsh_task_t task0; //declare a TAL-SH task handle
 errc=talshTaskClean(&task0); //clean TAL-SH task handle object to an empty state
 if(errc){*ierr=9; return;};

//Execute a tensor contraction either on CPU (synchronously) or GPU (asynchronously):
#ifndef NO_GPU
 int dev_kind = DEV_NVIDIA_GPU; //NVIDIA GPU devices
 int dev_num = 0; //specific device number (any from gpu_list[])
#else
 int dev_kind = DEV_HOST; //CPU Host (multicore)
 int dev_num = 0; //CPU Host is always a single device (but multicore)
#endif
 //Schedule:
 clock_t tms = clock();
 errc=talshTensorContract("D(a,b,i,j)+=L(c,b,d,a)*R(j,d,i,c)",&tens0,&tens1,&tens2,2.0,0.0,dev_num,dev_kind,COPY_MTT,YEP,&task0);
 printf(" Tensor contraction has been scheduled for execution: Status %d\n",errc); if(errc){*ierr=10; return;};
 //Test for completion: 
 int sts,done=NOPE;
 while(done != YEP && errc == TALSH_SUCCESS){done=talshTaskComplete(&task0,&sts,&errc);}
 double tm = ((double)(clock() - tms))/CLOCKS_PER_SEC;
 if(errc == TALSH_SUCCESS){
  printf(" Tensor contraction has completed successfully: Status %d: Time %f sec\n",sts,tm);
 }else{
  printf(" Tensor contraction has failed: Status %d: Error %d\n",sts,errc);
  *ierr=11; return;
 }
 //Timing:
 double total_time;
 errc=talshTaskTime(&task0,&total_time); if(errc){*ierr=12; return;};
 printf(" Tensor contraction total time = %f: GFlop/s = %f\n",total_time,gflops/total_time);
 //Destruct the task handle:
 errc=talshTaskDestruct(&task0); if(errc){*ierr=13; return;};
#ifndef NO_GPU
 //If executed on GPU, COPY_MTT parameter in the tensor contraction call above means that the
 //destination tensor image was moved to GPU device (letter M means MOVE).
 //So, let's move it back to Host (to a user-specified memory location):
 errc=talshTensorPlace(&tens0,0,DEV_HOST,NULL,COPY_M); //this will move the resulting tensor block back to Host (letter M means MOVE)
 if(errc){*ierr=14; return;};
#endif
 printf(" Tensor result was moved back to Host: Norm1 = %E: Correct = %E\n",talshTensorImageNorm1_cpu(&tens0),theor_norm1);

//Unregister tensor blocks with TAL-SH:
 errc=talshTensorDestruct(&tens2); if(errc){*ierr=15; return;};
 errc=talshTensorDestruct(&tens1); if(errc){*ierr=16; return;};
 errc=talshTensorDestruct(&tens0); if(errc){*ierr=17; return;};
 printf(" Three external tensor blocks have been unregistered with TAL-SH\n");

//Shutdown TAL-SH:
 errc=talshShutdown();
 printf(" TAL-SH has been shut down: Status %d\n",errc); if(errc){*ierr=18; return;};

 return;
}

int main(int argc, char* argv[]) {
  int ierr=0;
  test_talsh_c(&ierr);
  return 0;
}

Porting To SYCL

Are you interested in having an (SYCL)[https://www.intel.com/content/www/us/en/developer/tools/oneapi/training/dpc-essentials.html#gs.bnjiaf] port of TAL_SH as a new backend?

With the SYCL backend, we'd like to extend the existing functionalities of the TAL_SH, by enabling the application to leverage the multi-core accelerator devices of Nvidia, AMD, and Intel vendor platforms

add topics

I suggest adding the topics tensor, tensor-algebra, linear-algebra in the About section.

dmitrylyakh / tal_sh Goto Github PK

tal_sh's People

Contributors

Stargazers

Watchers

Forkers

tal_sh's Issues

Conjugation of the right argument in tensor contraction

Memory leak

aggregate TALSH stats

Where can I find this function "tensor_block_scalar_value"

Using this in C

build failure with tensor_block_pcontract_batch_dlf

contraction errors when using C8 types on AMD gpus

Porting To SYCL

add topics

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent