exanauts / cudss.jl Goto Github PK

License: MIT License

Julia 100.00%

cudss.jl's Issues

[documentation] Add an example with iterative refinement

I should also add an example where the ordering is computed outside CUDSS (like AMD or SCOTCH).

Unable to reuse the analysis if we only store one triangle for symmetric factorizations

using CUDA, CUDA.CUSPARSE
using CUDSS
using LinearAlgebra
using SparseArrays
using Test

println(CUDSS.CUDSS_INSTALLATION)

function cudss_generic()
  n = 100
  p = 5
  @testset "precision = $T" for T in (Float32, Float64, ComplexF32, ComplexF64)
    R = real(T)
    @testset "Unsymmetric -- Non-Hermitian" begin
      A_cpu = sprand(T, n, n, 0.02) + I
      b_cpu = rand(T, n)

      A_gpu = CuSparseMatrixCSR(A_cpu)
      b_gpu = CuVector(b_cpu)

      @testset "ldiv!" begin
        x_cpu = zeros(T, n)
        x_gpu = CuVector(x_cpu)

        solver = lu(A_gpu)
        ldiv!(x_gpu, solver, b_gpu)
        r_gpu = b_gpu - A_gpu * x_gpu
        @test norm(r_gpu) ≤ √eps(R)

        A_gpu2 = rand(T) * A_gpu
        lu!(solver, A_gpu2)
        x_gpu .= b_gpu
        ldiv!(solver, x_gpu)
        r_gpu2 = b_gpu - A_gpu2 * x_gpu
        @test norm(r_gpu2) ≤ √eps(R)
      end

      @testset "\\" begin
        solver = lu(A_gpu)
        x_gpu = solver \ b_gpu
        r_gpu = b_gpu - A_gpu * x_gpu
        @test norm(r_gpu) ≤ √eps(R)

        A_gpu2 = rand(T) * A_gpu
        lu!(solver, A_gpu2)
        x_gpu = solver \ b_gpu
        r_gpu2 = b_gpu - A_gpu2 * x_gpu
        @test norm(r_gpu2) ≤ √eps(R)
      end
    end

    @testset "Symmetric -- Hermitian" begin
          @testset "view = $view" for view in ('F', 'L', 'U')
        A_cpu = sprand(T, n, n, 0.01) + I
        A_cpu = A_cpu + A_cpu'
        B_cpu = rand(T, n, p)

        (view == 'L') && (A_gpu = CuSparseMatrixCSR(A_cpu |> tril))
        (view == 'U') && (A_gpu = CuSparseMatrixCSR(A_cpu |> triu))
        (view == 'F') && (A_gpu = CuSparseMatrixCSR(A_cpu))
        B_gpu = CuMatrix(B_cpu)

        @testset "ldiv!" begin
          X_cpu = zeros(T, n, p)
          X_gpu = CuMatrix(X_cpu)

          solver = ldlt(A_gpu; view)
          ldiv!(X_gpu, solver, B_gpu)
          R_gpu = B_gpu - A_gpu * X_gpu
          @test norm(R_gpu) ≤ √eps(R)

          c = rand(R)
          A_cpu2 = c * A_cpu
          A_gpu2 = c * A_gpu

          ldlt!(solver, A_gpu2)
          X_gpu .= B_gpu
          ldiv!(solver, X_gpu)
          R_gpu2 = B_gpu - CuSparseMatrixCSR(A_cpu2) * X_gpu
          @test norm(R_gpu2) ≤ √eps(R)
        end

        @testset "\\" begin
          solver = ldlt(A_gpu; view)
          X_gpu = solver \ B_gpu
          R_gpu = B_gpu - A_gpu * X_gpu
          @test norm(R_gpu) ≤ √eps(R)

          c = rand(R)
          A_cpu2 = c * A_cpu
          A_gpu2 = c * A_gpu

          ldlt!(solver, A_gpu2)
          X_gpu = solver \ B_gpu
          R_gpu2 = B_gpu - CuSparseMatrixCSR(A_cpu2) * X_gpu
          @test norm(R_gpu2) ≤ √eps(R)
        end
      end
   end

      @testset "SPD -- HPD" begin
    @testset "view = $view" for view in ('F', 'L', 'U')
        A_cpu = sprand(T, n, n, 0.01)
        A_cpu = A_cpu * A_cpu' + I
        B_cpu = rand(T, n, p)

        (view == 'L') && (A_gpu = CuSparseMatrixCSR(A_cpu |> tril))
        (view == 'U') && (A_gpu = CuSparseMatrixCSR(A_cpu |> triu))
        (view == 'F') && (A_gpu = CuSparseMatrixCSR(A_cpu))
        B_gpu = CuMatrix(B_cpu)

        @testset "ldiv!" begin
          X_cpu = zeros(T, n, p)
          X_gpu = CuMatrix(X_cpu)

          solver = cholesky(A_gpu; view)
          ldiv!(X_gpu, solver, B_gpu)
          R_gpu = B_gpu - A_gpu * X_gpu
          @test norm(R_gpu) ≤ √eps(R)

          c = rand(R)
          A_cpu2 = c * A_cpu
          A_gpu2 = c * A_gpu

          cholesky!(solver, A_gpu2)
          X_gpu .= B_gpu
          ldiv!(solver, X_gpu)
          R_gpu2 = B_gpu - CuSparseMatrixCSR(A_cpu2) * X_gpu
          @test norm(R_gpu2) ≤ √eps(R)
        end

        @testset "\\" begin
          solver = cholesky(A_gpu; view)
          X_gpu = solver \ B_gpu
          R_gpu = B_gpu - A_gpu * X_gpu
          @test norm(R_gpu) ≤ √eps(R)

          c = rand(R)
          A_cpu2 = c * A_cpu
          A_gpu2 = c * A_gpu

          cholesky!(solver, A_gpu2)
          X_gpu = solver \ B_gpu
          R_gpu2 = B_gpu - CuSparseMatrixCSR(A_cpu2) * X_gpu
          @test norm(R_gpu2) ≤ √eps(R)
        end
      end
    end
  end
end

cudss_generic()

Add continuous integration

We need to setup things with Moonshot.

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

If you'd like for me to do this for you, comment TagBot fix on this issue.
I'll open a PR within a few hours, please be patient!

Package management and type piracy: should this be a part of CUDA.jl?

There are a few issues with the package given its separation from CUDA.jl:

https://github.com/exanauts/CUDSS.jl/blob/v0.1.3/src/generic.jl#L15 these linear algebra overloads are type piracy. Also, it's kind of odd that CuSPARSE lu doesn't "just work", when this fixes it to work more like a regular sparse array. I would think users would be better served if there wasn't a special package to add on the side that fixed missing linear algebra functions.
The ergonomics of having it separate are also hard to deal with. For example, with LinearSolve.jl I'd like to make it so that if a user uses a sparse matrix then the extension should allow for a sparse LU factorization. But LinearSolve's CUDA.jl extension cannot define the code for making sparse LUs work, because it needs this package, but the user only needs CUDA.jl for the sparse matrix, and they are given no indication that another package is needed. What I can do is add an error message that throws an error in the LinearSolve CUDA extension saying that they also need to add CUDSS.jl, but given that packages are now required to correct for issues that the package manager should be able to identify, that's a bit of a red flag that something is going on here.

So, since this is just the sparse factorization of a sparse matrix defined in CUDA.jl, it's very odd that this one function is not in CUDA.jl.

CC @maleadt

Irreproducible results for the same linear system

I found cudss can't reproduce the same result:

using CUDA, CUDA.CUSPARSE
using CUDSS 
using SparseArrays, LinearAlgebra
using Random
Random.seed!(666)

n = 1000
T = Float64
A_cpu = sprand(T, n, n, 0.05) + 0.01*I
x_cpu = zeros(T, n)
b_cpu = rand(T, n)

A_gpu = CuSparseMatrixCSR(A_cpu)
x_gpu = CuVector(x_cpu)
b_gpu = CuVector(b_cpu)

solver = CudssSolver(A_gpu, "G", 'F')

cudss("analysis", solver, x_gpu, b_gpu)
cudss("factorization", solver, x_gpu, b_gpu)
cudss("solve", solver, x_gpu, b_gpu)

x1 = deepcopy(x_gpu)
ldiv!(x_gpu,solver,b_gpu)
x2 = deepcopy(x_gpu)

println("error is ", norm(x1-x2))

It just want to solve the same linear system twice but the values of x_gpu are different, i.e. norm(x1-x2) is nonzero.

Is it a bug or the internal property of the method underlying cudss?

[documentation] Describe the keyword argument view in `ldlt` and `cholesky`

Add a method \ for CudssSolver

CUDSS.jl is broken with CUDA.jl 5.4

I have the following error message when loading CUDSS.jl with CUDA.jl 5.4:

ERROR: LoadError: MethodError: no method matching CUDA.APIUtils.HandleCache{CUDA.CuContext, Ptr{CUDSS.cudssContext}}()

Closest candidates are:
  CUDA.APIUtils.HandleCache{K, V}(::Any, ::Any; max_entries) where {K, V}
   @ CUDA ~/.julia/packages/CUDA/75aiI/lib/utils/cache.jl:17

Stacktrace:
 [1] top-level scope
   @ ~/.julia/packages/CUDSS/ACgsr/src/management.jl:20
 [2] include(mod::Module, _path::String)
   @ Base ./Base.jl:495
 [3] include(x::String)
   @ CUDSS ~/.julia/packages/CUDSS/ACgsr/src/CUDSS.jl:1
 [4] top-level scope
   @ ~/.julia/packages/CUDSS/ACgsr/src/CUDSS.jl:24

It looks like the culprit is:
https://github.com/exanauts/CUDSS.jl/blob/main/src/management.jl#L20

Replace @ccall with @gcsafe_ccall when CUDA.jl v5.3.0 is released

JuliaGPU/CUDA.jl#2262
We should replace the @ccall macro with the @gcsafe_ccall macro.

I should also update the function check like Tim did in this commit:
JuliaGPU/CUDA.jl@41dcc90

Error with small matrices

using CUDA, CUDA.CUSPARSE
using CUDSS
using SparseArrays, LinearAlgebra

using Random
Random.seed!(666)

function example1(T::DataType, n::Int)
	A_cpu = sprand(T, n, n, 1.0) + I
	x_cpu = zeros(T, n)
	b_cpu = rand(T, n)

	A_gpu = CuSparseMatrixCSR(A_cpu)
	x_gpu = CuVector(x_cpu)
	b_gpu = CuVector(b_cpu)

	solver = CudssSolver(A_gpu, "G", 'F')

	cudss("analysis", solver, x_gpu, b_gpu)
	cudss("factorization", solver, x_gpu, b_gpu)
	cudss("solve", solver, x_gpu, b_gpu)

	r_gpu = b_gpu - A_gpu * x_gpu
	norm(r_gpu)
end

function example2(T::DataType, n::Int)
	A_cpu = sprand(T, n, n, 1.0) + I
	A_cpu = A_cpu + A_cpu' + I
	x_cpu = zeros(T, n)
	b_cpu = rand(T, n)

	A_gpu = CuSparseMatrixCSR(A_cpu |> tril)
	x_gpu = CuVector(x_cpu)
	b_gpu = CuVector(b_cpu)

	structure = T <: Real ? "S" : "H"
	solver = CudssSolver(A_gpu, structure, 'L')

	cudss("analysis", solver, x_gpu, b_gpu)
	cudss("factorization", solver, x_gpu, b_gpu)
	cudss("solve", solver, x_gpu, b_gpu)

	r_gpu = b_gpu - CuSparseMatrixCSR(A_cpu) * x_gpu
	norm(r_gpu)
end

function example3(T::DataType, n::Int)
	A_cpu = sprand(T, n, n, 1.0)
	A_cpu = A_cpu * A_cpu' + I
	x_cpu = zeros(T, n)
	b_cpu = rand(T, n)

	A_gpu = CuSparseMatrixCSR(A_cpu |> triu)
	x_gpu = CuVector(x_cpu)
	b_gpu = CuVector(b_cpu)

	structure = T <: Real ? "SPD" : "HPD"
	solver = CudssSolver(A_gpu, structure, 'U')

	cudss("analysis", solver, x_gpu, b_gpu)
	cudss("factorization", solver, x_gpu, b_gpu)
	cudss("solve", solver, x_gpu, b_gpu)

	r_gpu = b_gpu - CuSparseMatrixCSR(A_cpu) * x_gpu
	norm(r_gpu)
end

for n in (10, 5, 2, 1)
	println("Size of the linear systems: $n")

	for T in (Float32, Float64, ComplexF32, ComplexF64)
		println("Precision: $T")

		r1 = example1(T, n)
		println("Residual norm for example1: $r1")

		r2 = example2(T, n)
		println("Residual norm for example2: $r2")

		r3 = example3(T, n)
		println("Residual norm for example3: $r3")

		println()
	end
end

ERROR: CUDSSError: an internal operation failed (code 7, CUDSS_STATUS_INTERNAL_ERROR)

Add an option to reuse an handle when we create a CudssData

https://github.com/exanauts/CUDSS.jl/blob/main/src/helpers.jl#L104

LDL' factorizations are failing with complex Hermitian matrices

using Test
using CUDA, CUDA.CUSPARSE
using CUDSS
using SparseArrays
using LinearAlgebra

using Random
Random.seed!(1)

n = 100
p = 5
T = ComplexF64
R = real(T)
A_cpu = sprand(T, n, n, 0.01) + I
A_cpu = A_cpu + A_cpu'
X_cpu = zeros(T, n, p)
B_cpu = rand(T, n, p)

A_gpu = CuSparseMatrixCSR(A_cpu)
X_gpu = CuMatrix(X_cpu)
B_gpu = CuMatrix(B_cpu)

matrix = CudssMatrix(A_gpu, "H", 'F', index='O')
config = CudssConfig()
data = CudssData()
solver = CudssSolver(matrix, config, data)

cudss("analysis", solver, X_gpu, B_gpu)
cudss("factorization", solver, X_gpu, B_gpu)

nnz_decomposition = cudss_get(solver, "lu_nnz")
inertia = cudss_get(solver, "inertia")

cudss("solve", solver, X_gpu, B_gpu)

R_gpu = B_gpu - CuSparseMatrixCSR(A_cpu) * X_gpu
@test norm(R_gpu) ≤ √eps(R)

Y_cpu = zeros(T, n, n)
Y_gpu = CuMatrix(Y_cpu)
C_cpu = Matrix{T}(I, n, n)
C_gpu = CuMatrix(C_cpu)

cudss("solve", solver, Y_gpu, C_gpu)

RR_gpu = C_gpu - CuSparseMatrixCSR(A_cpu) * Y_gpu
@test norm(RR_gpu) ≤ √eps(R)

using MatrixMarket

MatrixMarket.mmwrite("cudss_A.mtx", A_gpu |> SparseMatrixCSC)
MatrixMarket.mmwrite("cudss_B.mtx", B_gpu |> Matrix |> sparse)
MatrixMarket.mmwrite("cudss_X.mtx", X_gpu |> Matrix |> sparse)
MatrixMarket.mmwrite("cudss_Y.mtx", Y_gpu |> Matrix |> sparse)

Fail to call "cudss_set" and "cudss_get"

Hi,

I tried to change (extract) the settings of a cudss solver other than the default one. It seems to be related to functions "cudss_set" and "cudss_get". However, Julia would crash once I try to change any parameters via "cudss_set". Is it currently an issue with the Julia wrapper?

Segmentation fault with cudssDataDestroy

We must call an analysis before that the CudssData can be destroyed.

using CUDSS
data = CudssData()
CUDSS.cudssDataDestroy(data.handle, data)

Issue with iterative refinement

using CUDA, CUDA.CUSPARSE
using CUDSS
using SparseArrays, LinearAlgebra

using Random
Random.seed!(666)

function example1(T::DataType, n::Int, ir::Int)
	A_cpu = sprand(T, n, n, 0.05) + I
	x_cpu = zeros(T, n)
	b_cpu = rand(T, n)

	A_gpu = CuSparseMatrixCSR(A_cpu)
	x_gpu = CuVector(x_cpu)
	b_gpu = CuVector(b_cpu)

	solver = CudssSolver(A_gpu, "G", 'F')
	cudss_set(solver, "ir_n_steps", ir)

	cudss("analysis", solver, x_gpu, b_gpu)
	cudss("factorization", solver, x_gpu, b_gpu)
	cudss("solve", solver, x_gpu, b_gpu)

	r_gpu = b_gpu - A_gpu * x_gpu
	norm(r_gpu)
end

function example2(T::DataType, n::Int, ir::Int)
	p = 5
	A_cpu = sprand(T, n, n, 0.05) + I
	A_cpu = A_cpu + A_cpu'
	X_cpu = zeros(T, n, p)
	B_cpu = rand(T, n, p)

	A_gpu = CuSparseMatrixCSR(A_cpu |> tril)
	X_gpu = CuMatrix(X_cpu)
	B_gpu = CuMatrix(B_cpu)

	structure = T <: Real ? "S" : "H"
	solver = CudssSolver(A_gpu, structure, 'L')
	cudss_set(solver, "ir_n_steps", ir)
	
	# Required with CUDSS 0.1.0
	(T <: Complex) && cudss_set(solver, "pivot_type", 'N')

	cudss("analysis", solver, X_gpu, B_gpu)
	cudss("factorization", solver, X_gpu, B_gpu)
	cudss("solve", solver, X_gpu, B_gpu)

	R_gpu = B_gpu - CuSparseMatrixCSR(A_cpu) * X_gpu
	norm(R_gpu)
end

function example3(T::DataType, n::Int, ir::Int)
	p = 5
	A_cpu = sprand(T, n, n, 0.01)
	A_cpu = A_cpu * A_cpu' + I
	X_cpu = zeros(T, n, p)
	B_cpu = rand(T, n, p)

	A_gpu = CuSparseMatrixCSR(A_cpu |> triu)
	X_gpu = CuMatrix(X_cpu)
	B_gpu = CuMatrix(B_cpu)

	structure = T <: Real ? "SPD" : "HPD"
	solver = CudssSolver(A_gpu, structure, 'U')
	cudss_set(solver, "ir_n_steps", ir)

	cudss("analysis", solver, X_gpu, B_gpu)
	cudss("factorization", solver, X_gpu, B_gpu)
	cudss("solve", solver, X_gpu, B_gpu)

	R_gpu = B_gpu - CuSparseMatrixCSR(A_cpu) * X_gpu
	norm(R_gpu)
end

# Size of the random linear systems
n = 100

for ir in (0,1,2)
	println("Number of iterations of iterative refinement: $ir")

	for T in (Float32, Float64, ComplexF32, ComplexF64)
		println("Precision: $T")

		r1 = example1(T, n, ir)
		println("Residual norm for example1: $r1")

		r2 = example2(T, n, ir)
		println("Residual norm for example2: $r2")

		r3 = example3(T, n, ir)
		println("Residual norm for example3: $r3")
	end

	println()
end

Number of iterations of iterative refinement: 0
Precision: Float32
Residual norm for example1: 3.692087e-5
Residual norm for example2: 0.0002999944
Residual norm for example3: 8.5040307e-7
Precision: Float64
Residual norm for example1: 3.999599835861191e-13
Residual norm for example2: 1.8113843282606318e-13
Residual norm for example3: 1.438189339051409e-15
Precision: ComplexF32
Residual norm for example1: 7.7133656e-5
Residual norm for example2: 0.0015232839
Residual norm for example3: 1.3667711e-6
Precision: ComplexF64
Residual norm for example1: 5.983645231861647e-14
Residual norm for example2: 3.5220521611201605e-12
Residual norm for example3: 2.6205255057235815e-15

Number of iterations of iterative refinement: 1
Precision: Float32
Residual norm for example1: 689.0252
Residual norm for example2: 35.675114
Residual norm for example3: 19.295319
Precision: Float64
Residual norm for example1: 17.012903059690192
Residual norm for example2: 128.73250852726014
Residual norm for example3: 19.50544156059542
Precision: ComplexF32
Residual norm for example1: 91.10225
Residual norm for example2: 169.85542
Residual norm for example3: 37.091965
Precision: ComplexF64
Residual norm for example1: 66.7120050629793
Residual norm for example2: 178.9033657822368
Residual norm for example3: 33.54613616599457

Number of iterations of iterative refinement: 2
Precision: Float32
Residual norm for example1: 398.21548
Residual norm for example2: 144.65608
Residual norm for example3: 53.373383
Precision: Float64
Residual norm for example1: 351.5959719021192
Residual norm for example2: 785.4054754458682
Residual norm for example3: 52.362171222079056
Precision: ComplexF32
Residual norm for example1: 69.65438
Residual norm for example2: 869.5622
Residual norm for example3: 134.29175
Precision: ComplexF64
Residual norm for example1: 4401.508407855525
Residual norm for example2: 2843.681478289806
Residual norm for example3: 117.71330055583913

exanauts / cudss.jl Goto Github PK

cudss.jl's Issues

Recommend Projects

Recommend Topics

Recommend Org