Giter Club home page Giter Club logo

Comments (12)

Epliz avatar Epliz commented on June 28, 2024

(not an AMD employee)
Your first proposal seems like a good idea for hardware that doesn't have instructions for them (some architectures do have instructions for these atomics).

The second one is maybe best left to the application developers who will know if there will be memory address conflicts?
It might be also good to consider that things seem to get complicated if the return value of the atomic function is actually used.
Eventually might be a good idea however to have a compiler optimization for the case where the address is known to be uniform, and in that case add the warp reduce first, maybe for the simple case where the return value is not used.

from hipamd.

Epliz avatar Epliz commented on June 28, 2024

Maybe interesting to @arsenm from a compiler perspective and who can maybe recommend someone to have a look at for the HIP implementation of the atomics?

from hipamd.

FreddieWitherden avatar FreddieWitherden commented on June 28, 2024

The second one is maybe best left to the application developers who will know if there will be memory address conflicts?

This would be a reasonable position if HIP exposed the necessary primitives for application developers to implement the functionality efficiently themselves. For example, on NVIDIA hardware, one would do something along the lines of:

__device__ int atomicAggInc(int *ptr) {
    int mask = __match_any_sync(__activemask(), (unsigned long long)ptr);
    int leader = __ffs(mask) – 1;    // select a leader
    int res;
    if(lane_id() == leader)                  // leader does the update
        res = atomicAdd(ptr, __popc(mask));
    res = __shfl_sync(mask, res, leader);    // get leader’s old value
    return res + __popc(mask & ((1 << lane_id()) – 1)); //compute old value
}

Here, most of the heavy lifting is done by the SM7.0+ primitive __match_any_sync which partitions threads in the warp according to their address. This means in the case of no conflicts the overhead from aggregation is only a couple of extra instructions (insignificant compared with the latency associated with atomic operations). HIP, however, does not expose such functionality and so you're guaranteed a sub-optimal code path on recent NVIDIA hardware if you implement it yourself. Similarly, there may be a means of efficiently implementing such operations on AMD GPUs using device specific instructions which are not currently exposed through HIP.

from hipamd.

Epliz avatar Epliz commented on June 28, 2024

I don't know if for cuda there are really special SSA instructions for the __match_any builtins, but they can be implemented without as shown in an answer at https://stackoverflow.com/questions/59879285/whats-the-alternative-for-match-any-sync-on-compute-capability-6 .
On AMD GPUs, they could be implemented like that as well.

from hipamd.

FreddieWitherden avatar FreddieWitherden commented on June 28, 2024

I don't know if for cuda there are really special SSA instructions for the __match_any builtins, but they can be implemented without as shown in an answer at https://stackoverflow.com/questions/59879285/whats-the-alternative-for-match-any-sync-on-compute-capability-6 . On AMD GPUs, they could be implemented like that as well.

It reduces to the PTX match.any.sync.b64 which, on SM70, is turned into the SASS MATCH.ANY.U64 <RA>, <RB>. No reason on a recent NVIDIA GPU that one should be emulating this with a for loop.

from hipamd.

arsenm avatar arsenm commented on June 28, 2024

Tricks like turning FP atomics into integer atomics would be best done in the backend (though FP atomicrmw does not have fast math flags, so that's a bit annoying).

We do have a pass which tries to rewrite uniform atomics to do a single atomic op and a reduction in the wave. It's currently not enabled by default.

from hipamd.

Epliz avatar Epliz commented on June 28, 2024

Hi @FreddieWitherden ,

Would be interested in seeing what you think about results you can get with a benchmark like the following:

#include <hip/hip_runtime.h>
#include <hip/hip_runtime_api.h>

#include <iostream>
#include <chrono>

#define DIV_ROUND_UP(a, b) (((a) + (b) - 1) / (b))

inline void gpuAssert(hipError_t code, const char *file, int line, bool abort=true)
{
  if (code != hipSuccess) {
	  std::cerr<<"GPUassert: "<<hipGetErrorString(code)<<" "<<file<<" "<<line<<std::endl;
    if (abort) exit(code);
  }
}
#define hipCheckError(ans) { gpuAssert((ans), __FILE__, __LINE__); }


template<typename T>
__global__
void vecadd(T* a, T* counters, int N, int s) {
  int globalThreadId = blockIdx.x * blockDim.x + threadIdx.x;
  if (globalThreadId < N) {
    T retValue = atomicAdd(&counters[globalThreadId % s], T(1));
    a[globalThreadId] = retValue; // to force variants that return values
  }
}

template<typename T>
__global__
void vecmin(T* a, T* counters, int N, int s) {
  int globalThreadId = blockIdx.x * blockDim.x + threadIdx.x;
  if (globalThreadId < N) {
    T retValue = atomicMin(&counters[globalThreadId % s], -T(globalThreadId));
    a[globalThreadId] = retValue; // to force variants that return values
  }
}

__device__
float customAtomicMin(float* addr, float value) {
  float old;
  old = !signbit(value) ? __int_as_float(atomicMin((int*)addr, __float_as_int(value))) :
      __uint_as_float(atomicMax((unsigned int*)addr, __float_as_uint(value)));

  return old;
}

__device__
int customAtomicMin(int* addr, int value) {
  return atomicMin(addr, value);
}

template<typename T>
__global__
void vecminCustom(T* a, T* counters, int N, int s) {
  int globalThreadId = blockIdx.x * blockDim.x + threadIdx.x;
  if (globalThreadId < N) {
    T retValue = customAtomicMin(&counters[globalThreadId % s], -T(globalThreadId));
    a[globalThreadId] = retValue; // to force variants that return values
  }
}

using namespace std::chrono;

template<typename F, typename R>
void bench_it(size_t num_warmups, size_t num_repeats, F func, R report_func) {

    auto start_warmup = high_resolution_clock::now();  
    for (int r = 0; r < num_warmups; r++) {
      func();
    }
    
    hipCheckError(hipDeviceSynchronize());
    auto stop_warmup = high_resolution_clock::now();
    auto duration_warmupns = (double) duration_cast<nanoseconds>(stop_warmup - start_warmup).count();
    std::cout <<"time warmup: "<< duration_warmupns / 1000000.0 << "ms"<< std::endl;

    // measurement  
    auto start = high_resolution_clock::now();  
  
    for (int r = 0; r < num_repeats; r++) {
      func();
    }

    hipCheckError(hipDeviceSynchronize());
    auto stop = high_resolution_clock::now();
    auto durationns = duration_cast<nanoseconds>(stop - start);
    
    report_func((double) durationns.count() / num_repeats);
}

enum bench_data_type {
  FLOAT = 0,
  INT = 1
};

enum bench_op {
  OP_ADD = 0,
  OP_MIN = 1
};


template<typename T>
void run_benchmark(int N, int s, bench_op op, bool custom) {
  std::cout<<"stride: "<<s<<std::endl;
  
  if (op == OP_ADD) {
    std::cout<<"Testing add"<<std::endl;
  } else if (op == OP_MIN) {
    std::cout<<"Testing min"<<std::endl;
  }
  if (custom) {
    std::cout<<"custom approach if available"<<std::endl;
  }

  size_t size = N * sizeof(T);
  size_t size_counters = s * sizeof(T);

  T* a_device;
  hipCheckError(hipMalloc((void**) &a_device, size));
  hipCheckError(hipMemset((void*)a_device, 0, size));

  T* counters_device;
  hipCheckError(hipMalloc((void**) &counters_device, size_counters));
  hipCheckError(hipMemset((void*)counters_device, 0, size_counters));

  bench_it(
    /*warmups*/ 10,
    /*repetitions */ 100,
    /*function*/ [&]() {
      size_t threads_per_block = 64;
      size_t num_blocks = DIV_ROUND_UP(N, threads_per_block);
      dim3 block_dim(threads_per_block);
      dim3 grid_dim(num_blocks);
      
      if (op == OP_ADD) {
        vecadd<T><<<grid_dim, block_dim, 0, 0>>>(a_device, counters_device, N, s);
      } else if (op == OP_MIN) {
        if (custom) {
          vecminCustom<T><<<grid_dim, block_dim, 0, 0>>>(a_device, counters_device, N, s);
        } else {
          vecmin<T><<<grid_dim, block_dim, 0, 0>>>(a_device, counters_device, N, s);
        }
      }
      
      hipMemsetAsync((void*)counters_device, 0, size_counters, 0);
      
    },
    /*report function*/[=](double durationns){
      double bandwidth = (size) / durationns;
      std::cout <<"time: "<< durationns / 1000000.0 << "ms"<< std::endl;
      std::cout <<"bandwidth: "<< bandwidth << "GB/s"<< std::endl;
    });

  hipFree(counters_device);
  
  hipFree(a_device);
}

int main(int argc, char** argv) {

  // size, type
  if (argc < 6) {
    std::cout<<"Missing argumemts. Usage: <size> <stride> <op> <type> <custom>"<<std::endl;
    return -1;
  }
  
  size_t N = atoi(argv[1]);
  size_t s = atoi(argv[2]);
  bench_op op = (bench_op) atoi(argv[3]);
  bench_data_type type = (bench_data_type) atoi(argv[4]);
  bool custom = atoi(argv[5]);
  
  int num_devices;
  hipGetDeviceCount(&num_devices);
  
  std::cout<<"Num devices: "<<num_devices<<std::endl;
  
  for (int d = 0; d < num_devices; d++) {
    std::cout<<"Device "<<d<<std::endl;
    hipSetDevice(d);
    
    if (type == FLOAT) {
      std::cout<<"Testing float"<<std::endl;
      run_benchmark<float>(N, s, op, custom);
    } else if (type == INT) {
      std::cout<<"Testing int"<<std::endl;
      run_benchmark<int>(N, s, op, custom);
    } else {
      std::cout<<"unknown type"<<std::endl;
      return -1;
    }
  }
  return 0;
}

What I observe when running this on a gfx90c iGPU (Ryzen 5800U) is the following:

  • for the add operation, all operations need to be performed no matter what, and when a CAS loop is needed (for float) it is ridiculously slow. While when there is a dedicated instruction (for ints) it is OK. For these, a local reduction is probably a good idea in all cases as the L2 cache atomic units are the bottleneck and it would be OK to spend some instructions in the core to reduce requests to L2.
  • for the min operation, CAS can be actually better than playing the atomic int instructions trick allegedly because there is potentially much less atomic operations forwarded at the L2 cache level because the core did not issue the CAS for the threads that cannot lower the value: the CAS loop already does the local reduction to some extent!

These observations seem to correspond to the ones at section 5 of https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2021/p0493r3.pdf : atomic adds are "read-modify-write", but atomic min/max are "read-and-conditional-store".

So for atomic min/max, it might not be better to just re-implement them with the approach in your first message: CAS might be better. But maybe if combining with the local reduction in your next message, it might be better than CAS?
I will try that next, when time permits.

EDIT: fixed code to reset counters between runs so that work happens

from hipamd.

FreddieWitherden avatar FreddieWitherden commented on June 28, 2024

So for atomic min/max, it might not be better to just re-implement them with the approach in your first message: CAS might be better.

I think the issue is more subtle than that. Everything you said about a CAS loop for float min/max (avoiding the write if our local value is larger/smaller than the one we just read in from global memory) is also true for int/uint min/max. Thus, if you repeated your experiments with an array of integers I think you would find a CAS based approach outperforming HIP's native functions.

Here, there are two possibilities (i) that the HIP min/max functions for integers are implemented poorly on your platform, or (ii) that they've been optimised for a different level of contention than your test program exhibits. The first case is certainly possible given most HIP development is focused on CDNA rather than GCN and RDNA.

from hipamd.

Epliz avatar Epliz commented on June 28, 2024

I had a bug in my benchmark where actually no work was happening for later iterations due to the counters being changed not being reset.

After fixing that, if I run my benchmark and vary the contention by changing the stride parameter, we can see that the CAS is only better for the highest level of contention (stride == 1).
Here are some results for MI100 (CDNA1 architecture), RX6700XT (RDNA2 architecture):

  array size 4194304            
                 
  GPU stride            
Method   1 4 16 64 256 1024 2048
  MI100              
CAS   8.95GB/s 9.33GB/s 10.20GB/s 16.19GB/s 34.50GB/s 77.62GB/s 148.0GB/s
Integer atomics   5.797GB/s 22.80GB/s 27.67GB/s 87.19GB/s 166.9GB/s 187.9GB/s 187.7GB/s
  RX 6700XT              
CAS   10.52GB/s 31.63GB/s 65.37GB/s 55.45GB/s 91.57GB/s 96.86GB/s 100.0GB/s
Integer atomic   10.07GB/s 39.06GB/s 95.43GB/s 95.42GB/s 123.8GB/s 119.6GB/s 119.7GB/s
  Ryzen 5800U              
CAS   2.75GB/s 11.51GB/s 12.79GB/s 20.73GB/s 25.86GB/s 27.90GB/s 30.73GB/s
Integer atomic   7.42GB/s 23.80GB/s 28.25GB/s 40.45GB/s 41.24GB/s 40.95GB/s 35.27GB/s

So indeed, might be better to implement atomic min/max that way.

from hipamd.

FreddieWitherden avatar FreddieWitherden commented on June 28, 2024

I'll note that these ideas also extend to double precision, with one catch. HIP does not provide long long versions of min/max. We get unsigned long long, but not the signed versions. Which is strange, as CUDA provides them, as do AMD themselves under OpenCL.

from hipamd.

Epliz avatar Epliz commented on June 28, 2024

Given that it is available in CUDA, if you create a ticket just for this, they will probably fix it.

from hipamd.

Epliz avatar Epliz commented on June 28, 2024

@FreddieWitherden , if you are interested, here is an example implementation of __match_any that should correspond to what you need to implement the atomicAddAgg:

// the compiler miscompiles without this in many cases as it doesn't understand
// when it can or cannot hoist calls
// cf. https://github.com/llvm/llvm-project/issues/62477
static inline __device__ int  __optimization_barrier_hack(int in_val)
{
    int out_val;
    __asm__ volatile ("" : "=v"(out_val) : "0"(in_val));
    return out_val;
}

static inline __device__ uint64_t __match_any(int value) {
  value = __optimization_barrier_hack(value);
  bool active = true;
  uint64_t result_mask = 0;

  while (active) {
    // determine what threads have the same value as the currently first active thread
    int first_active_value = __builtin_amdgcn_readfirstlane(value);
    int predicate = (value == first_active_value);
    uint64_t mask = __ballot(predicate);

    // if the current thread has the same value, set its result mask to the current one
    if (predicate) {
      result_mask |= mask;
      active = false;
    }
  }

  return result_mask;
}

Such intrinsic corresponds to "WaveMatch" in HLSL shader model 6.5 (https://microsoft.github.io/DirectX-Specs/d3d/HLSL_ShaderModel6_5.html#wavematch-function) supported by both Nvidia and AMD GPUs (Vega+ as far as I can tell).
The implementation above is what I could reconstruct from looking at some compiled HLSL code using WaveMatch.

I think all the warp intrinsics supported on Nvidia GPUs could be supported as well on AMD GPUs. It is quite surprising to me that they were not all implemented yet.

Best regards,
Epliz

from hipamd.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.