m4rs-mt / ilgpu.algorithms Goto Github PK

The new standard algorithms library for ILGPU

License: Other

C# 100.00%

ilgpu.algorithms's Introduction

ILGPU.Algorithms (! MOVED !)

Please note that the ILGPU.Algorithms library has been merged with the ILGPU repository. Refer to the ILGPU repository (https://github.com/m4rs-mt/ILGPU) for updates and new releases.

ILGPU.Algorithms Library (pre 1.0 Version)

Real-world applications typically require a standard library and a set of standard algorithms that "simply work". The ILGPU Algorithms library meets these requirements by offering a set of auxiliary functions and high-level algorithms (e.g. sorting or prefix sum). All algorithms can be run on all supported accelerator types. The CPU accelerator support is especially useful for kernel debugging.

Build instructions

ILGPU.Algorithms requires Visual Studio 2019 or higher.

Make sure to init/update the ILGPU git submodule using git submodule update --init before building the algorithms library.

License information

ILGPU.Algorithms is licensed under the University of Illinois/NCSA Open Source License. Detailed license information can be found in LICENSE.txt.

License information of required dependencies

Different parts of ILGPU.Algorithms require different third-party libraries.

ILGPU.Algorithms Dependencies
- ILGPU (http://www.ilgpu.net)

Detailed copyright and license information of these dependencies can be found in LICENSE-3RD-PARTY.txt.

Credits

This work was supported by the Deutsches Forschungszentrum f�r K�nstliche Intelligenz GmbH (DFKI; German Research Center for Artificial Intelligence).

ilgpu.algorithms's People

Contributors

Stargazers

Watchers

Forkers

albosc jgiannuzzi mm86133 davidmatthews1uvm isbrad uzbekdev1 ifzen ruberik

ilgpu.algorithms's Issues

ILGPU 0.8 compatibility issue

Hi,
I experienced an error with the very last ILGPU 0.8 related to ILGPU.Algorithms.
ILGPU.Algorithms compatibility is broken by a missing method PTXCodeGenerator.AllocatePrimitive()

When can ilgpu provide parallel syntax function similar to alea GPU framework？

When can ilgpu provide parallel syntax function similar to alea GPU framework, which can support implicit conversion between variables and arrayview
like：
Alea.Parallel.GpuExtension.For(gpu, 0, Points.Count, i =>
{
xComponent[i] = xComponent[i] - minX;
yComponent[i] = yComponent[i] - minY;
zComponent[i] = zComponent[i] - minZ;
});

Expected behaviour of XMath.Rem

What is the expected behaviour of XMath.Rem?

Does it follow the behaviour of IEEERemainder?
Or does it follow the floating-point remainder operator?

The CPU accelerator is using Math.IEEERemainder.
It looks like PTXMath follows the remainder operator.
And the OpenCL remainder looks like it follows IEEERemainder behaviour.

Add missing test cases for Scan, Reduce, Transform etc.

The current test projects do not cover the high-level functions Scan, Reduce, Transform etc. We should include additional test cases to cover these functions.

Custom reduce

I'm trying to write my custom reducer which basically should do: y = x1x1 + x2x2 + etc.

The reducer code looks like:

public readonly struct MyReducer : IScanReduceOperation<int>
{
	public string CLCommand { get; }
	public int Identity { get => 0; }

	public int Apply(int first, int second)
	{
		return first + second * second;
	}
	//
	public void AtomicApply(ref int target, int value)
	{
		Atomic.Add(ref target, value);
	}
}

accl.Reduce<int, MyReducer>(accl.DefaultStream, buffer.View, target.View);

I was deciphering how to implement this from the code so it could be incorrect. This works when I'm testing it with CPU accelerator.

CPU: for buffer = [0, 1, 2, 3] it returns 14
Cuda: (GeForce card) it returns 6740
OpenCL: (Intel card) it crashes - exception message "An internal compiler error has been detected"

Could you help me how to write the custom reducer correctly?

Many thanks! :-)

PS: Finally a good GPU C# library :-)

Feature request: Histograms and distinct lists

Histograms are quite commonly used for image analysis.
Creating distinct lists aka a sorted or unsorted list where every element is unique from an arbitrary list can also be used for everything from lossless image encoding to data analysis.

"AlgorithmsRadixSort" can't sort long properly, and also fails under CPU

It will give something like following result (negative is listed behind positive) if there are negative long in the array.
1,2,3,4,5,-5,-4,-3,-2,-1

It also throws error message if accelarator is CPU

Cublas Scaling Linearly With Streams

Hi,

Just wondering if CUBLAS is really streaming correctly? Code below scales linearly with number of streams (1 to 100), and it scales linearly whether it is a 2,2,2 matrix or 3000,200,200 matrix. So irrespective of load.


                        System.Diagnostics.Stopwatch sw = new System.Diagnostics.Stopwatch();
                        sw.Start();
                        Parallel.ForEach(streamPackages, p =>
                        {
                            using (var blas = new CuBlas(accelerator))
                            {
                                blas.Stream = p.Stream;
                                blas.Gemm(CuBlasOperation.NonTranspose, CuBlasOperation.NonTranspose, m, n, k, 1, p.A, m, p.B, k, 0, p.C, m);
                            }
                        });

                     // wait for finished

                        foreach (var p in streamPackages)
                        {
                            p.Stream.Synchronize();
                        }
                        accelerator.Synchronize();
                 
                       sw.Stop();
                        System.Diagnostics.Debug.WriteLine( sw.ElapsedMilliseconds + "ms");

XMath.Pow, XMath.Exp etc. don't work

@m4rs-mt

if XMath.Pow or Math.Pow, and XMath.Exp etc. are used in kernel, "too many resources requested" error will be thrown if there are too many threads (e.g. 1024 in my case). No error thrown if I decrease threads (say, 256). but kernel running will hang on for ever. There should be something wrong with those complex math function in ILGPU.

Attached is code for replication the problem.

WindowsFormsApp1.zip

TanF, TanhF

Hi, I'm trying to add Tan, Tanh, Log and Exp to my code, but I get the error:

"The function 'TanF' does not have an intrinsic implementation for this backend. 'EnableAlgorithms' from the Algorithms library not invoked?"

Enable Elgorithms has been invoked. Sin/Cos worked

I tried looking at the code for ILGPU.Algorithms, but getting it to compile is a challenge. But from the CUDA documentation, it seems to be natively supported;

// https://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH__SINGLE.html
device float tanf ( float x )
Calculate the tangent of the input argument.

cheers,
/m

Min reduction on unsigned integer produces incorrect result.

When using the Reduce algorithm with the MinUInt32 or MinUInt64 reduction, the output is not the expected value.

The following code should output Reduced[0] = 0, but instead outputs Reduced[0] = 4294967295.

using ILGPU;
using ILGPU.Algorithms;
using ILGPU.Algorithms.ScanReduceOperations;
using ILGPU.Algorithms.Sequencers;
using ILGPU.Runtime.OpenCL;
using System;
using System.Linq;

namespace AlgorithmsReduce
{
    class Program
    {
        static void Main()
        {
            using var context = new Context();
            context.EnableAlgorithms();
            using var accelerator = new CLAccelerator(context, CLAccelerator.CLAccelerators.First());
            using var buffer = accelerator.Allocate<uint>(64);
            using var target = accelerator.Allocate<uint>(1);

            accelerator.Sequence(accelerator.DefaultStream, buffer.View, new UInt32Sequencer());
            accelerator.Reduce<uint, MinUInt32>(
                accelerator.DefaultStream,
                buffer.View,
                target.View);

            var data = target.GetAsArray(accelerator.DefaultStream);
            for (int i = 0, e = data.Length; i < e; ++i)
                Console.WriteLine($"Reduced[{i}] = {data[i]}");
        }
    }
}

NOTE: Only applies to the OpenCL accelerator, affecting ILGPU.Algorithms v0.9.2 and v0.10.0-beta1.

Delete i th Element of ArrayView

Hi @m4rs-mt ,

I am wondering whether there is a way to delete an i_th Element of ArrayView. For instance, my method scale() converts the captured data frame information to the meter. However, the returned value also can be (0,0,0). In this case, I want to delete the i_th element so that this zero information is not forwarded to CPU. I couldn't find a delete function. Is there any work around?

`private static void ApplyKernel(
            Index index, /* The global thread index (1D in this case) */
            ArrayView<CameraSpacePoint> pixelArray, /* A view to a chunk of memory (1D in this case)*/
            ArrayView<Point3d> pixelArray_pt /* A view to a chunk of memory (1D in this case)*/
            )
        {
            Point3d tmp = scale(pixelArray[index]);
            if (XMath.Abs(tmp.X) > 0.0001 || XMath.Abs(tmp.Y) > 0.0001 || XMath.Abs(tmp.Z) > 0.0001)
            {
                pixelArray_pt[index] = tmp;
            }
            else
            {
                _pixelArray_pt.delete(index);_
            }
     
        }`

Problem with Algorithm.ScanExtensions

Algorithm of "ScanInclusive" is not stable. If the number of input array is more than 20000, output of "scan inclusive" can give different result if repeating calculation on the same input array for several times. It can't be used for scanning a large amount of array. Attached is sample code, I am using iLGPU beta2, and Nvidia Qudro P1000.

SimpleStructures.zip

System.Numerics.Matrix4x4.Decompose intrinsic support

Matrix4x4.Decompose cannot be compiled because its implementation has CPU-only il instruction. Could you consider making an intrinsic implementation for Matrix4x4.Decompose?

Question: Is it necessary to call context.EnableAlgorithms() to use the algorithm library in the kernel function?

Use Tanh PTX intrinsic on SM_75 or higher

The Tanh PTX intrinsic for float32 is available on SM_75 or higher. ILGPU.Algorithms should be modified to NOT register XMath.Tanh(float) as a replacement intrinsic.

Is there a way to call Cub's DeviceScan functionality directly?

Based on https://nvlabs.github.io/cub/structcub_1_1_device_scan.html vs. https://nvlabs.github.io/cub/structcub_1_1_device_reduce.html, Nvidia's DeviceScan appears to get about 40% of DeviceReduce's speed. On my graphics card, using ILGPU, I'm seeing Scan get about 1/6 of Reduce's speed. Is a library used for ILGPU's Scan, or is it your own code? Is Cub's DeviceScan something you'd consider making available if it isn't already?

Cuda XMath Floor/Ceiling/Truncate functions incorrectly handle large values

On the Cuda Accelerator, the XMath.Floor and XMath.Ceiling functions incorrectly handle the edge cases PositiveInfinity, NegativeInfinity and NaN. They also fail on large numbers (e.g. 2^53).

NB: This also affects the XMath.Truncate function as it uses these internally.

This is currently being tracked in #44.