GPU Memory allocation (converting Array to KnetArray) is really slow, 0.06s for each b

For another data point, here is what I get on an <a href="http://denizyuret.github.io/

Buffer data to GPU in a parallel CPU thread about knet.jl HOT 7 CLOSED

denizyuret commented on May 16, 2024

Buffer data to GPU in a parallel CPU thread

from knet.jl.

Comments (7)

ngphuoc commented on May 16, 2024

The following is my simple template and benchmark comparing prefetching data in a CPU parallel thread to the normal version without prefetching. Basically, the training loop does not need to wait for data. However this is just twice as fast, not 10 times. So I guess the GPU can run 10 parallel trainings once all data is in GPU. Probably I need 10 threads for prefetching. Could you give some comments? Many thanks.

macro swap(x,y)
  quote
    local tmp = $(esc(x))
    $(esc(x)) = $(esc(y))
    $(esc(y)) = tmp
  end
end

# some slow function
@everywhere function get_data(i)
  sleep(0.6)
  println("get_data $i")
  i
end

function slow_train(x)
  sleep(0.6)
  println("slow_train $x")
end

function prefetch(rng)
  @assert length(rng) > 1
  rng = collect(rng)
  a = b = nothing
  function _iter()
    for i ∈ 1:length(rng)
      if a == nothing
        a = remotecall(get_data, 2, rng[i])
        b = remotecall(get_data, 2, rng[i+1])
      else
        if i < length(rng)
          a = remotecall(get_data, 2, rng[i+1])
        end
        @swap(a,b)
      end
      d = fetch(a)
      produce(d)
    end
  end
  return Task(_iter)
end
@time for x ∈ prefetch(1:10)
  slow_train(x)
end

% julia -p 2 test-task.jl
6.957115 seconds (153.23 k allocations: 6.454 MB)

macro swap(x,y)
  quote
    local tmp = $(esc(x))
    $(esc(x)) = $(esc(y))
    $(esc(y)) = tmp
  end
end

# some slow function
@everywhere function get_data(i)
  sleep(0.6)
  println("get_data $i")
  i
end

function slow_train(x)
  sleep(0.6)
  println("slow_train $x")
end

function fetch(rng)
  rng = collect(rng)
  function _iter()
    for i ∈ 1:length(rng)
      d = get_data(rng[i])
      produce(d)
    end
  end
  return Task(_iter)
end
@time for x ∈ fetch(1:10)
  slow_train(x)
end

% julia test-task.jl
12.146958 seconds (84.82 k allocations: 3.528 MB)

from knet.jl.

denizyuret commented on May 16, 2024

I am getting 0.1ms for 1000 batches of 64x1000 Float32. Did I misunderstand the problem? Here is my code:

using Knet

function togpu(a)
    b=Array(Any,length(a))
    @inbounds for i=1:length(a)
        b[i]=KnetArray(a[i])
    end
    return b
end

a = [ rand(Float32, 64, 1000) for i=1:1000 ]
@time a1=togpu(a);
@time a2=togpu(a);
@time a3=togpu(a);

from knet.jl.

denizyuret commented on May 16, 2024

To clarify: I meant 0.1ms per transfer.

from knet.jl.

ngphuoc commented on May 16, 2024

I tried your test and got 0.27s for the first @time, 0.07s for the second and third @time, which is
70 times slower than yours. Is this abnormal? My PC configuration is CPU i7-5820K + GPU GTX 1080

julia> using Knet
INFO: Knet using GPU 0

julia> function togpu(a)
           b=Array(Any,length(a))
           @inbounds for i=1:length(a)
               b[i]=KnetArray(a[i])
           end
           return b
       end
togpu (generic function with 1 method)

julia> a = [ rand(Float32, 64, 1000) for i=1:1000 ];

julia> @time a1=togpu(a);
  0.276411 seconds (243.87 k allocations: 10.282 MB)

julia> @time a2=togpu(a);
  0.073134 seconds (10.01 k allocations: 289.391 KB)

julia> @time a3=togpu(a);
  0.073607 seconds (10.01 k allocations: 289.391 KB)

from knet.jl.

denizyuret commented on May 16, 2024

I think this is consistent with my results, not slower. Ignore the first result, it includes compilation time. You are transferring 1000 arrays in 0.073 seconds. This means per transfer cost is 0.073 ms or 73 μs, which is better than my setup. One call to cudaMalloc takes 10 μs, (gpu allocation is slow, which is why I had to write a custom memory manager for Knet). So it seems 63 μs is the cost of RAM->GPU transfer.

from knet.jl.

denizyuret commented on May 16, 2024

For another data point, here is what I get on an AWS instance:

[ec2-user@ip-172-31-23-146 ~]$ julia foo.jl
INFO: Knet using GPU 0
  0.488581 seconds (243.72 k allocations: 10.312 MB)
  0.184520 seconds (10.01 k allocations: 289.391 KB)
  0.190965 seconds (10.01 k allocations: 289.391 KB)

from knet.jl.

ngphuoc commented on May 16, 2024

Thanks for clarifying the CPU to GPU time info. It is a common problem and independent of the framework then. As your benchmarks, it is fast enough though.

from knet.jl.

Buffer data to GPU in a parallel CPU thread about knet.jl HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent