exanauts / cudss.jl Goto Github PK

License: MIT License

Julia 100.00%

cudss.jl's Introduction

CUDSS.jl: Julia interface for NVIDIA cuDSS

Overview

CUDSS.jl is a Julia interface to the NVIDIA cuDSS library. NVIDIA cuDSS provides three factorizations (LDU, LDLᵀ, LLᵀ) for solving sparse linear systems on GPUs.

Why CUDSS.jl?

Unlike other CUDA libraries that are commonly bundled together, cuDSS is currently in preview. For this reason, it is not included in CUDA.jl. To maintain consistency with the naming conventions used for other CUDA libraries (such as CUBLAS, CUSOLVER, CUSPARSE, etc.), we have named this interface CUDSS.jl.

Installation

CUDSS.jl can be installed and tested through the Julia package manager:

julia> ]
pkg> add CUDSS
pkg> test CUDSS

Content

CUDSS.jl provides a structured approach for leveraging NVIDIA cuDSS functionalities. It introduces the CudssSolver type along with three core routines: cudss, cudss_set, and cudss_get. Additionally, specialized methods for the CuSparseMatrixCSR type have been incorporated for cholesky, ldlt, lu and \. To further enhance performance, in-place variants including cholesky!, ldlt!, lu! and ldiv! have been implemented. These variants optimize performance by reusing the symbolic factorization as well as storage. This ensures efficient solving of sparse linear systems on GPUs.

Examples

Example 1: Sparse unsymmetric linear system with one right-hand side

using CUDA, CUDA.CUSPARSE
using CUDSS
using SparseArrays, LinearAlgebra

T = Float64
n = 100
A_cpu = sprand(T, n, n, 0.05) + I
x_cpu = zeros(T, n)
b_cpu = rand(T, n)

A_gpu = CuSparseMatrixCSR(A_cpu)
x_gpu = CuVector(x_cpu)
b_gpu = CuVector(b_cpu)

solver = CudssSolver(A_gpu, "G", 'F')

cudss("analysis", solver, x_gpu, b_gpu)
cudss("factorization", solver, x_gpu, b_gpu)
cudss("solve", solver, x_gpu, b_gpu)

r_gpu = b_gpu - A_gpu * x_gpu
norm(r_gpu)

# In-place LU
d_gpu = rand(T, n) |> CuVector
A_gpu = A_gpu + Diagonal(d_gpu)
cudss_set(solver, A_gpu)

c_cpu = rand(T, n)
c_gpu = CuVector(c_cpu)

cudss("factorization", solver, x_gpu, c_gpu)
cudss("solve", solver, x_gpu, c_gpu)

r_gpu = c_gpu - A_gpu * x_gpu
norm(r_gpu)

Example 2: Sparse symmetric linear system with multiple right-hand sides

using CUDA, CUDA.CUSPARSE
using CUDSS
using SparseArrays, LinearAlgebra

T = Float64
R = real(T)
n = 100
p = 5
A_cpu = sprand(T, n, n, 0.05) + I
A_cpu = A_cpu + A_cpu'
X_cpu = zeros(T, n, p)
B_cpu = rand(T, n, p)

A_gpu = CuSparseMatrixCSR(A_cpu |> tril)
X_gpu = CuMatrix(X_cpu)
B_gpu = CuMatrix(B_cpu)

structure = T <: Real ? "S" : "H"
solver = CudssSolver(A_gpu, structure, 'L')

cudss("analysis", solver, X_gpu, B_gpu)
cudss("factorization", solver, X_gpu, B_gpu)
cudss("solve", solver, X_gpu, B_gpu)

R_gpu = B_gpu - CuSparseMatrixCSR(A_cpu) * X_gpu
norm(R_gpu)

# In-place LDLᵀ
d_gpu = rand(R, n) |> CuVector
A_gpu = A_gpu + Diagonal(d_gpu)
cudss_set(solver, A_gpu)

C_cpu = rand(T, n, p)
C_gpu = CuMatrix(C_cpu)

cudss("factorization", solver, X_gpu, C_gpu)
cudss("solve", solver, X_gpu, C_gpu)

R_gpu = C_gpu - ( CuSparseMatrixCSR(A_cpu) + Diagonal(d_gpu) ) * X_gpu
norm(R_gpu)

Example 3: Sparse hermitian positive definite linear system with multiple right-hand sides

using CUDA, CUDA.CUSPARSE
using CUDSS
using SparseArrays, LinearAlgebra

T = ComplexF64
R = real(T)
n = 100
p = 5
A_cpu = sprand(T, n, n, 0.01)
A_cpu = A_cpu * A_cpu' + I
X_cpu = zeros(T, n, p)
B_cpu = rand(T, n, p)

A_gpu = CuSparseMatrixCSR(A_cpu |> triu)
X_gpu = CuMatrix(X_cpu)
B_gpu = CuMatrix(B_cpu)

structure = T <: Real ? "SPD" : "HPD"
solver = CudssSolver(A_gpu, structure, 'U')

cudss("analysis", solver, X_gpu, B_gpu)
cudss("factorization", solver, X_gpu, B_gpu)
cudss("solve", solver, X_gpu, B_gpu)

R_gpu = B_gpu - CuSparseMatrixCSR(A_cpu) * X_gpu
norm(R_gpu)

# In-place LLᴴ
d_gpu = rand(R, n) |> CuVector
A_gpu = A_gpu + Diagonal(d_gpu)
cudss_set(solver, A_gpu)

C_cpu = rand(T, n, p)
C_gpu = CuMatrix(C_cpu)

cudss("factorization", solver, X_gpu, C_gpu)
cudss("solve", solver, X_gpu, C_gpu)

R_gpu = C_gpu - ( CuSparseMatrixCSR(A_cpu) + Diagonal(d_gpu) ) * X_gpu
norm(R_gpu)

cudss.jl's People

Contributors

Stargazers

Watchers

Forkers

chrisrackauckas

cudss.jl's Issues

Package management and type piracy: should this be a part of CUDA.jl?

There are a few issues with the package given its separation from CUDA.jl:

https://github.com/exanauts/CUDSS.jl/blob/v0.1.3/src/generic.jl#L15 these linear algebra overloads are type piracy. Also, it's kind of odd that CuSPARSE lu doesn't "just work", when this fixes it to work more like a regular sparse array. I would think users would be better served if there wasn't a special package to add on the side that fixed missing linear algebra functions.
The ergonomics of having it separate are also hard to deal with. For example, with LinearSolve.jl I'd like to make it so that if a user uses a sparse matrix then the extension should allow for a sparse LU factorization. But LinearSolve's CUDA.jl extension cannot define the code for making sparse LUs work, because it needs this package, but the user only needs CUDA.jl for the sparse matrix, and they are given no indication that another package is needed. What I can do is add an error message that throws an error in the LinearSolve CUDA extension saying that they also need to add CUDSS.jl, but given that packages are now required to correct for issues that the package manager should be able to identify, that's a bit of a red flag that something is going on here.

So, since this is just the sparse factorization of a sparse matrix defined in CUDA.jl, it's very odd that this one function is not in CUDA.jl.

CC @maleadt

Add a method \ for CudssSolver

[documentation] Describe the keyword argument view in `ldlt` and `cholesky`

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

If you'd like for me to do this for you, comment TagBot fix on this issue.
I'll open a PR within a few hours, please be patient!

[documentation] Add an example with iterative refinement

I should also add an example where the ordering is computed outside CUDSS (like AMD or SCOTCH).

Issue with iterative refinement

using CUDA, CUDA.CUSPARSE
using CUDSS
using SparseArrays, LinearAlgebra

using Random
Random.seed!(666)

function example1(T::DataType, n::Int, ir::Int)
	A_cpu = sprand(T, n, n, 0.05) + I
	x_cpu = zeros(T, n)
	b_cpu = rand(T, n)

	A_gpu = CuSparseMatrixCSR(A_cpu)
	x_gpu = CuVector(x_cpu)
	b_gpu = CuVector(b_cpu)

	solver = CudssSolver(A_gpu, "G", 'F')
	cudss_set(solver, "ir_n_steps", ir)

	cudss("analysis", solver, x_gpu, b_gpu)
	cudss("factorization", solver, x_gpu, b_gpu)
	cudss("solve", solver, x_gpu, b_gpu)

	r_gpu = b_gpu - A_gpu * x_gpu
	norm(r_gpu)
end

function example2(T::DataType, n::Int, ir::Int)
	p = 5
	A_cpu = sprand(T, n, n, 0.05) + I
	A_cpu = A_cpu + A_cpu'
	X_cpu = zeros(T, n, p)
	B_cpu = rand(T, n, p)

	A_gpu = CuSparseMatrixCSR(A_cpu |> tril)
	X_gpu = CuMatrix(X_cpu)
	B_gpu = CuMatrix(B_cpu)

	structure = T <: Real ? "S" : "H"
	solver = CudssSolver(A_gpu, structure, 'L')
	cudss_set(solver, "ir_n_steps", ir)
	
	# Required with CUDSS 0.1.0
	(T <: Complex) && cudss_set(solver, "pivot_type", 'N')

	cudss("analysis", solver, X_gpu, B_gpu)
	cudss("factorization", solver, X_gpu, B_gpu)
	cudss("solve", solver, X_gpu, B_gpu)

	R_gpu = B_gpu - CuSparseMatrixCSR(A_cpu) * X_gpu
	norm(R_gpu)
end

function example3(T::DataType, n::Int, ir::Int)
	p = 5
	A_cpu = sprand(T, n, n, 0.01)
	A_cpu = A_cpu * A_cpu' + I
	X_cpu = zeros(T, n, p)
	B_cpu = rand(T, n, p)

	A_gpu = CuSparseMatrixCSR(A_cpu |> triu)
	X_gpu = CuMatrix(X_cpu)
	B_gpu = CuMatrix(B_cpu)

	structure = T <: Real ? "SPD" : "HPD"
	solver = CudssSolver(A_gpu, structure, 'U')
	cudss_set(solver, "ir_n_steps", ir)

	cudss("analysis", solver, X_gpu, B_gpu)
	cudss("factorization", solver, X_gpu, B_gpu)
	cudss("solve", solver, X_gpu, B_gpu)

	R_gpu = B_gpu - CuSparseMatrixCSR(A_cpu) * X_gpu
	norm(R_gpu)
end

# Size of the random linear systems
n = 100

for ir in (0,1,2)
	println("Number of iterations of iterative refinement: $ir")

	for T in (Float32, Float64, ComplexF32, ComplexF64)
		println("Precision: $T")

		r1 = example1(T, n, ir)
		println("Residual norm for example1: $r1")

		r2 = example2(T, n, ir)
		println("Residual norm for example2: $r2")

		r3 = example3(T, n, ir)
		println("Residual norm for example3: $r3")
	end

	println()
end

Number of iterations of iterative refinement: 0
Precision: Float32
Residual norm for example1: 3.692087e-5
Residual norm for example2: 0.0002999944
Residual norm for example3: 8.5040307e-7
Precision: Float64
Residual norm for example1: 3.999599835861191e-13
Residual norm for example2: 1.8113843282606318e-13
Residual norm for example3: 1.438189339051409e-15
Precision: ComplexF32
Residual norm for example1: 7.7133656e-5
Residual norm for example2: 0.0015232839
Residual norm for example3: 1.3667711e-6
Precision: ComplexF64
Residual norm for example1: 5.983645231861647e-14
Residual norm for example2: 3.5220521611201605e-12
Residual norm for example3: 2.6205255057235815e-15

Number of iterations of iterative refinement: 1
Precision: Float32
Residual norm for example1: 689.0252
Residual norm for example2: 35.675114
Residual norm for example3: 19.295319
Precision: Float64
Residual norm for example1: 17.012903059690192
Residual norm for example2: 128.73250852726014
Residual norm for example3: 19.50544156059542
Precision: ComplexF32
Residual norm for example1: 91.10225
Residual norm for example2: 169.85542
Residual norm for example3: 37.091965
Precision: ComplexF64
Residual norm for example1: 66.7120050629793
Residual norm for example2: 178.9033657822368
Residual norm for example3: 33.54613616599457

Number of iterations of iterative refinement: 2
Precision: Float32
Residual norm for example1: 398.21548
Residual norm for example2: 144.65608
Residual norm for example3: 53.373383
Precision: Float64
Residual norm for example1: 351.5959719021192
Residual norm for example2: 785.4054754458682
Residual norm for example3: 52.362171222079056
Precision: ComplexF32
Residual norm for example1: 69.65438
Residual norm for example2: 869.5622
Residual norm for example3: 134.29175
Precision: ComplexF64
Residual norm for example1: 4401.508407855525
Residual norm for example2: 2843.681478289806
Residual norm for example3: 117.71330055583913

LDL' factorizations are failing with complex Hermitian matrices

using Test
using CUDA, CUDA.CUSPARSE
using CUDSS
using SparseArrays
using LinearAlgebra

using Random
Random.seed!(1)

n = 100
p = 5
T = ComplexF64
R = real(T)
A_cpu = sprand(T, n, n, 0.01) + I
A_cpu = A_cpu + A_cpu'
X_cpu = zeros(T, n, p)
B_cpu = rand(T, n, p)

A_gpu = CuSparseMatrixCSR(A_cpu)
X_gpu = CuMatrix(X_cpu)
B_gpu = CuMatrix(B_cpu)

matrix = CudssMatrix(A_gpu, "H", 'F', index='O')
config = CudssConfig()
data = CudssData()
solver = CudssSolver(matrix, config, data)

cudss("analysis", solver, X_gpu, B_gpu)
cudss("factorization", solver, X_gpu, B_gpu)

nnz_decomposition = cudss_get(solver, "lu_nnz")
inertia = cudss_get(solver, "inertia")

cudss("solve", solver, X_gpu, B_gpu)

R_gpu = B_gpu - CuSparseMatrixCSR(A_cpu) * X_gpu
@test norm(R_gpu) ≤ √eps(R)

Y_cpu = zeros(T, n, n)
Y_gpu = CuMatrix(Y_cpu)
C_cpu = Matrix{T}(I, n, n)
C_gpu = CuMatrix(C_cpu)

cudss("solve", solver, Y_gpu, C_gpu)

RR_gpu = C_gpu - CuSparseMatrixCSR(A_cpu) * Y_gpu
@test norm(RR_gpu) ≤ √eps(R)

using MatrixMarket

MatrixMarket.mmwrite("cudss_A.mtx", A_gpu |> SparseMatrixCSC)
MatrixMarket.mmwrite("cudss_B.mtx", B_gpu |> Matrix |> sparse)
MatrixMarket.mmwrite("cudss_X.mtx", X_gpu |> Matrix |> sparse)
MatrixMarket.mmwrite("cudss_Y.mtx", Y_gpu |> Matrix |> sparse)

Fail to call "cudss_set" and "cudss_get"

Hi,

I tried to change (extract) the settings of a cudss solver other than the default one. It seems to be related to functions "cudss_set" and "cudss_get". However, Julia would crash once I try to change any parameters via "cudss_set". Is it currently an issue with the Julia wrapper?

Unable to reuse the analysis if we only store one triangle for symmetric factorizations

using CUDA, CUDA.CUSPARSE
using CUDSS
using LinearAlgebra
using SparseArrays
using Test

println(CUDSS.CUDSS_INSTALLATION)

function cudss_generic()
  n = 100
  p = 5
  @testset "precision = $T" for T in (Float32, Float64, ComplexF32, ComplexF64)
    R = real(T)
    @testset "Unsymmetric -- Non-Hermitian" begin
      A_cpu = sprand(T, n, n, 0.02) + I
      b_cpu = rand(T, n)

      A_gpu = CuSparseMatrixCSR(A_cpu)
      b_gpu = CuVector(b_cpu)

      @testset "ldiv!" begin
        x_cpu = zeros(T, n)
        x_gpu = CuVector(x_cpu)

        solver = lu(A_gpu)
        ldiv!(x_gpu, solver, b_gpu)
        r_gpu = b_gpu - A_gpu * x_gpu
        @test norm(r_gpu) ≤ √eps(R)

        A_gpu2 = rand(T) * A_gpu
        lu!(solver, A_gpu2)
        x_gpu .= b_gpu
        ldiv!(solver, x_gpu)
        r_gpu2 = b_gpu - A_gpu2 * x_gpu
        @test norm(r_gpu2) ≤ √eps(R)
      end

      @testset "\\" begin
        solver = lu(A_gpu)
        x_gpu = solver \ b_gpu
        r_gpu = b_gpu - A_gpu * x_gpu
        @test norm(r_gpu) ≤ √eps(R)

        A_gpu2 = rand(T) * A_gpu
        lu!(solver, A_gpu2)
        x_gpu = solver \ b_gpu
        r_gpu2 = b_gpu - A_gpu2 * x_gpu
        @test norm(r_gpu2) ≤ √eps(R)
      end
    end

    @testset "Symmetric -- Hermitian" begin
          @testset "view = $view" for view in ('F', 'L', 'U')
        A_cpu = sprand(T, n, n, 0.01) + I
        A_cpu = A_cpu + A_cpu'
        B_cpu = rand(T, n, p)

        (view == 'L') && (A_gpu = CuSparseMatrixCSR(A_cpu |> tril))
        (view == 'U') && (A_gpu = CuSparseMatrixCSR(A_cpu |> triu))
        (view == 'F') && (A_gpu = CuSparseMatrixCSR(A_cpu))
        B_gpu = CuMatrix(B_cpu)

        @testset "ldiv!" begin
          X_cpu = zeros(T, n, p)
          X_gpu = CuMatrix(X_cpu)

          solver = ldlt(A_gpu; view)
          ldiv!(X_gpu, solver, B_gpu)
          R_gpu = B_gpu - A_gpu * X_gpu
          @test norm(R_gpu) ≤ √eps(R)

          c = rand(R)
          A_cpu2 = c * A_cpu
          A_gpu2 = c * A_gpu

          ldlt!(solver, A_gpu2)
          X_gpu .= B_gpu
          ldiv!(solver, X_gpu)
          R_gpu2 = B_gpu - CuSparseMatrixCSR(A_cpu2) * X_gpu
          @test norm(R_gpu2) ≤ √eps(R)
        end

        @testset "\\" begin
          solver = ldlt(A_gpu; view)
          X_gpu = solver \ B_gpu
          R_gpu = B_gpu - A_gpu * X_gpu
          @test norm(R_gpu) ≤ √eps(R)

          c = rand(R)
          A_cpu2 = c * A_cpu
          A_gpu2 = c * A_gpu

          ldlt!(solver, A_gpu2)
          X_gpu = solver \ B_gpu
          R_gpu2 = B_gpu - CuSparseMatrixCSR(A_cpu2) * X_gpu
          @test norm(R_gpu2) ≤ √eps(R)
        end
      end
   end

      @testset "SPD -- HPD" begin
    @testset "view = $view" for view in ('F', 'L', 'U')
        A_cpu = sprand(T, n, n, 0.01)
        A_cpu = A_cpu * A_cpu' + I
        B_cpu = rand(T, n, p)

        (view == 'L') && (A_gpu = CuSparseMatrixCSR(A_cpu |> tril))
        (view == 'U') && (A_gpu = CuSparseMatrixCSR(A_cpu |> triu))
        (view == 'F') && (A_gpu = CuSparseMatrixCSR(A_cpu))
        B_gpu = CuMatrix(B_cpu)

        @testset "ldiv!" begin
          X_cpu = zeros(T, n, p)
          X_gpu = CuMatrix(X_cpu)

          solver = cholesky(A_gpu; view)
          ldiv!(X_gpu, solver, B_gpu)
          R_gpu = B_gpu - A_gpu * X_gpu
          @test norm(R_gpu) ≤ √eps(R)

          c = rand(R)
          A_cpu2 = c * A_cpu
          A_gpu2 = c * A_gpu

          cholesky!(solver, A_gpu2)
          X_gpu .= B_gpu
          ldiv!(solver, X_gpu)
          R_gpu2 = B_gpu - CuSparseMatrixCSR(A_cpu2) * X_gpu
          @test norm(R_gpu2) ≤ √eps(R)
        end

        @testset "\\" begin
          solver = cholesky(A_gpu; view)
          X_gpu = solver \ B_gpu
          R_gpu = B_gpu - A_gpu * X_gpu
          @test norm(R_gpu) ≤ √eps(R)

          c = rand(R)
          A_cpu2 = c * A_cpu
          A_gpu2 = c * A_gpu

          cholesky!(solver, A_gpu2)
          X_gpu = solver \ B_gpu
          R_gpu2 = B_gpu - CuSparseMatrixCSR(A_cpu2) * X_gpu
          @test norm(R_gpu2) ≤ √eps(R)
        end
      end
    end
  end
end

cudss_generic()

Irreproducible results for the same linear system

I found cudss can't reproduce the same result:

using CUDA, CUDA.CUSPARSE
using CUDSS 
using SparseArrays, LinearAlgebra
using Random
Random.seed!(666)

n = 1000
T = Float64
A_cpu = sprand(T, n, n, 0.05) + 0.01*I
x_cpu = zeros(T, n)
b_cpu = rand(T, n)

A_gpu = CuSparseMatrixCSR(A_cpu)
x_gpu = CuVector(x_cpu)
b_gpu = CuVector(b_cpu)

solver = CudssSolver(A_gpu, "G", 'F')

cudss("analysis", solver, x_gpu, b_gpu)
cudss("factorization", solver, x_gpu, b_gpu)
cudss("solve", solver, x_gpu, b_gpu)

x1 = deepcopy(x_gpu)
ldiv!(x_gpu,solver,b_gpu)
x2 = deepcopy(x_gpu)

println("error is ", norm(x1-x2))

It just want to solve the same linear system twice but the values of x_gpu are different, i.e. norm(x1-x2) is nonzero.

Is it a bug or the internal property of the method underlying cudss?

Segmentation fault with cudssDataDestroy

We must call an analysis before that the CudssData can be destroyed.

using CUDSS
data = CudssData()
CUDSS.cudssDataDestroy(data.handle, data)