Giter Club home page Giter Club logo

Comments (7)

ngphuoc avatar ngphuoc commented on May 16, 2024

The following is my simple template and benchmark comparing prefetching data in a CPU parallel thread to the normal version without prefetching. Basically, the training loop does not need to wait for data. However this is just twice as fast, not 10 times. So I guess the GPU can run 10 parallel trainings once all data is in GPU. Probably I need 10 threads for prefetching. Could you give some comments? Many thanks.

macro swap(x,y)
  quote
    local tmp = $(esc(x))
    $(esc(x)) = $(esc(y))
    $(esc(y)) = tmp
  end
end

# some slow function
@everywhere function get_data(i)
  sleep(0.6)
  println("get_data $i")
  i
end

function slow_train(x)
  sleep(0.6)
  println("slow_train $x")
end

function prefetch(rng)
  @assert length(rng) > 1
  rng = collect(rng)
  a = b = nothing
  function _iter()
    for i  1:length(rng)
      if a == nothing
        a = remotecall(get_data, 2, rng[i])
        b = remotecall(get_data, 2, rng[i+1])
      else
        if i < length(rng)
          a = remotecall(get_data, 2, rng[i+1])
        end
        @swap(a,b)
      end
      d = fetch(a)
      produce(d)
    end
  end
  return Task(_iter)
end
@time for x  prefetch(1:10)
  slow_train(x)
end
% julia -p 2 test-task.jl
6.957115 seconds (153.23 k allocations: 6.454 MB)
macro swap(x,y)
  quote
    local tmp = $(esc(x))
    $(esc(x)) = $(esc(y))
    $(esc(y)) = tmp
  end
end

# some slow function
@everywhere function get_data(i)
  sleep(0.6)
  println("get_data $i")
  i
end

function slow_train(x)
  sleep(0.6)
  println("slow_train $x")
end

function fetch(rng)
  rng = collect(rng)
  function _iter()
    for i  1:length(rng)
      d = get_data(rng[i])
      produce(d)
    end
  end
  return Task(_iter)
end
@time for x  fetch(1:10)
  slow_train(x)
end

% julia test-task.jl
12.146958 seconds (84.82 k allocations: 3.528 MB)


from knet.jl.

denizyuret avatar denizyuret commented on May 16, 2024

I am getting 0.1ms for 1000 batches of 64x1000 Float32. Did I misunderstand the problem? Here is my code:

using Knet

function togpu(a)
    b=Array(Any,length(a))
    @inbounds for i=1:length(a)
        b[i]=KnetArray(a[i])
    end
    return b
end

a = [ rand(Float32, 64, 1000) for i=1:1000 ]
@time a1=togpu(a);
@time a2=togpu(a);
@time a3=togpu(a);

from knet.jl.

denizyuret avatar denizyuret commented on May 16, 2024

To clarify: I meant 0.1ms per transfer.

from knet.jl.

ngphuoc avatar ngphuoc commented on May 16, 2024

I tried your test and got 0.27s for the first @time, 0.07s for the second and third @time, which is
70 times slower than yours. Is this abnormal? My PC configuration is CPU i7-5820K + GPU GTX 1080

julia> using Knet
INFO: Knet using GPU 0

julia> function togpu(a)
           b=Array(Any,length(a))
           @inbounds for i=1:length(a)
               b[i]=KnetArray(a[i])
           end
           return b
       end
togpu (generic function with 1 method)

julia> a = [ rand(Float32, 64, 1000) for i=1:1000 ];

julia> @time a1=togpu(a);
  0.276411 seconds (243.87 k allocations: 10.282 MB)

julia> @time a2=togpu(a);
  0.073134 seconds (10.01 k allocations: 289.391 KB)

julia> @time a3=togpu(a);
  0.073607 seconds (10.01 k allocations: 289.391 KB)

from knet.jl.

denizyuret avatar denizyuret commented on May 16, 2024

I think this is consistent with my results, not slower. Ignore the first result, it includes compilation time. You are transferring 1000 arrays in 0.073 seconds. This means per transfer cost is 0.073 ms or 73 μs, which is better than my setup. One call to cudaMalloc takes 10 μs, (gpu allocation is slow, which is why I had to write a custom memory manager for Knet). So it seems 63 μs is the cost of RAM->GPU transfer.

from knet.jl.

denizyuret avatar denizyuret commented on May 16, 2024

For another data point, here is what I get on an AWS instance:

[ec2-user@ip-172-31-23-146 ~]$ julia foo.jl
INFO: Knet using GPU 0
  0.488581 seconds (243.72 k allocations: 10.312 MB)
  0.184520 seconds (10.01 k allocations: 289.391 KB)
  0.190965 seconds (10.01 k allocations: 289.391 KB)

from knet.jl.

ngphuoc avatar ngphuoc commented on May 16, 2024

Thanks for clarifying the CPU to GPU time info. It is a common problem and independent of the framework then. As your benchmarks, it is fast enough though.

from knet.jl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.