juliapomdp / deepqlearning.jl Goto Github PK

View Code? Open in Web Editor NEW

72.0 13.0 13.0 294 KB

Implementation of the Deep Q-learning algorithm to solve MDPs

License: Other

Julia 100.00%

deep-reinforcement-learning pomdps reinforcement-learning machine-learning

deepqlearning.jl's Issues

Error: Can't differentiate loopinfo expression

Running the example given in the docs:

using DeepQLearning
using POMDPs
using Flux
using POMDPModels
using POMDPSimulators
using POMDPPolicies

# load MDP model from POMDPModels or define your own!
mdp = SimpleGridWorld();

# Define the Q network (see Flux.jl documentation)
# the gridworld state is represented by a 2 dimensional vector.
model = Chain(Dense(2, 32), Dense(32, length(actions(mdp))))

exploration = EpsGreedyPolicy(mdp, LinearDecaySchedule(start=1.0, stop=0.01, steps=10000/2))

solver = DeepQLearningSolver(qnetwork = model, max_steps=10000, 
                             exploration_policy = exploration,
                             learning_rate=0.005,log_freq=500,
                             recurrence=false,double_q=true, dueling=true, prioritized_replay=true)
policy = solve(solver, mdp)

sim = RolloutSimulator(max_steps=30)
r_tot = simulate(sim, mdp, policy)
println("Total discounted reward for 1 simulation: $r_tot")

produces an error as follows, where can I look to fix this? I am using Julia 1.6

ERROR: Can't differentiate loopinfo expression
Stacktrace:
  [1] error(s::String)
    @ Base ./error.jl:33
  [2] macro expansion
    @ ./simdloop.jl:79 [inlined]
  [3] Pullback
    @ ./reduce.jl:243 [inlined]
  [4] (::typeof(∂(mapreduce_impl)))(Δ::Float32)
    @ Zygote ~/.julia/packages/Zygote/6HN9x/src/compiler/interface2.jl:0
  [5] Pullback
    @ ./reduce.jl:257 [inlined]
  [6] (::typeof(∂(mapreduce_impl)))(Δ::Float32)
    @ Zygote ~/.julia/packages/Zygote/6HN9x/src/compiler/interface2.jl:0
  [7] Pullback
    @ ./reduce.jl:415 [inlined]
  [8] (::typeof(∂(_mapreduce)))(Δ::Float32)
    @ Zygote ~/.julia/packages/Zygote/6HN9x/src/compiler/interface2.jl:0
  [9] Pullback
    @ ./reducedim.jl:318 [inlined]
 [10] Pullback (repeats 2 times)
    @ ./reducedim.jl:310 [inlined]
 [11] (::typeof(∂(mapreduce)))(Δ::Float32)
    @ Zygote ~/.julia/packages/Zygote/6HN9x/src/compiler/interface2.jl:0
 [12] Pullback
    @ ./reducedim.jl:878 [inlined]
 [13] (::typeof(∂(#_sum#682)))(Δ::Float32)
    @ Zygote ~/.julia/packages/Zygote/6HN9x/src/compiler/interface2.jl:0
 [14] Pullback
    @ ./reducedim.jl:878 [inlined]
 [15] (::typeof(∂(_sum)))(Δ::Float32)
    @ Zygote ~/.julia/packages/Zygote/6HN9x/src/compiler/interface2.jl:0
 [16] Pullback (repeats 2 times)
    @ ./reducedim.jl:874 [inlined]
 [17] (::typeof(∂(sum)))(Δ::Float32)
    @ Zygote ~/.julia/packages/Zygote/6HN9x/src/compiler/interface2.jl:0
 [18] Pullback
    @ ~/.julia/packages/DeepQLearning/jJkAu/src/solver.jl:223 [inlined]
 [19] (::typeof(∂(λ)))(Δ::Float32)
    @ Zygote ~/.julia/packages/Zygote/6HN9x/src/compiler/interface2.jl:0
 [20] (::Zygote.var"#69#70"{Zygote.Params, typeof(∂(λ)), Zygote.Context})(Δ::Float32)
    @ Zygote ~/.julia/packages/Zygote/6HN9x/src/compiler/interface.jl:252
 [21] gradient(f::Function, args::Zygote.Params)
    @ Zygote ~/.julia/packages/Zygote/6HN9x/src/compiler/interface.jl:59
 [22] batch_train!(solver::DeepQLearningSolver{EpsGreedyPolicy{LinearDecaySchedule{Float64}, Random._GLOBAL_RNG, NTuple{4, Symbol}}}, env::POMDPModelTools.MDPCommonRLEnv{AbstractArray{Float32, N} where N, SimpleGridWorld, StaticArrays.SVector{2, Int64}}, policy::NNPolicy{SimpleGridWorld, DeepQLearning.DuelingNetwork, Symbol}, optimizer::ADAM, target_q::DeepQLearning.DuelingNetwork, replay::PrioritizedReplayBuffer{Int32, Float32, CartesianIndex{2}, StaticArrays.SVector{2, Float32}, Matrix{Float32}}; discount::Float64)
    @ DeepQLearning ~/.julia/packages/DeepQLearning/jJkAu/src/solver.jl:219
 [23] batch_train!
    @ ~/.julia/packages/DeepQLearning/jJkAu/src/solver.jl:200 [inlined]
 [24] dqn_train!(solver::DeepQLearningSolver{EpsGreedyPolicy{LinearDecaySchedule{Float64}, Random._GLOBAL_RNG, NTuple{4, Symbol}}}, env::POMDPModelTools.MDPCommonRLEnv{AbstractArray{Float32, N} where N, SimpleGridWorld, StaticArrays.SVector{2, Int64}}, policy::NNPolicy{SimpleGridWorld, DeepQLearning.DuelingNetwork, Symbol}, replay::PrioritizedReplayBuffer{Int32, Float32, CartesianIndex{2}, StaticArrays.SVector{2, Float32}, Matrix{Float32}})
    @ DeepQLearning ~/.julia/packages/DeepQLearning/jJkAu/src/solver.jl:138
 [25] solve(solver::DeepQLearningSolver{EpsGreedyPolicy{LinearDecaySchedule{Float64}, Random._GLOBAL_RNG, NTuple{4, Symbol}}}, env::POMDPModelTools.MDPCommonRLEnv{AbstractArray{Float32, N} where N, SimpleGridWorld, StaticArrays.SVector{2, Int64}})
    @ DeepQLearning ~/.julia/packages/DeepQLearning/jJkAu/src/solver.jl:56
 [26] solve(solver::DeepQLearningSolver{EpsGreedyPolicy{LinearDecaySchedule{Float64}, Random._GLOBAL_RNG, NTuple{4, Symbol}}}, problem::SimpleGridWorld)
    @ DeepQLearning ~/.julia/packages/DeepQLearning/jJkAu/src/solver.jl:32
 [27] top-level scope
    @ REPL[11]:1

Support last Flux version with Float32

Fix avgR in terminal while training

Logging of avgR in terminal may be off, see below.

51000 / 1000000 eps 0.899 |  avgR -0.997 | Loss 9.240e-03 | Grad 5.585e-03 
51500 / 1000000 eps 0.898 |  avgR -0.997 | Loss 2.298e-02 | Grad 1.119e-02 
52000 / 1000000 eps 0.897 |  avgR -0.997 | Loss 8.478e-02 | Grad 7.080e-02 
52500 / 1000000 eps 0.896 |  avgR -0.997 | Loss 1.660e-02 | Grad 6.243e-03 
53000 / 1000000 eps 0.895 |  avgR -0.997 | Loss 1.105e-02 | Grad 4.012e-03 
53500 / 1000000 eps 0.894 |  avgR -0.997 | Loss 1.182e-02 | Grad 7.963e-03

Tests still contain "using RLInterface"

I think that is redundant as we switched to CommonRLInterface and RLInterface is no longer in dependencies.

Problem with reading log files

Hi,

I'm attempting to read in the log files generated by TensorBoardLogger, but am having some issues. When I try the method for de-serialization recommended in the TensorBoardLogger docs I get an error regarding crc headers, so I'm wondering if there's a specific method that works for reading the logs generated from this package. I've included the error message below.

Alternatively, if there's a way to plot learning curves without reading in the log files that would also be helpful.

Thanks

ERROR: AssertionError: crc_header == crc_header_ck
Stacktrace:
[1] read_event(::IOStream) at /home/ben/.julia/packages/TensorBoardLogger/gv4oF/src/Deserialization/deserialization.jl:16
[2] iterate(::TensorBoardLogger.TBEventFileIterator, ::Int64) at /home/ben/.julia/packages/TensorBoardLogger/gv4oF/src/Deserialization/deserialization.jl:84
[3] iterate at /home/ben/.julia/packages/TensorBoardLogger/gv4oF/src/Deserialization/deserialization.jl:83 [inlined]
[4] iterate(::TensorBoardLogger.TBEventFileCollectionIterator, ::Int64) at /home/ben/.julia/packages/TensorBoardLogger/gv4oF/src/Deserialization/deserialization.jl:59
[5] iterate at /home/ben/.julia/packages/TensorBoardLogger/gv4oF/src/Deserialization/deserialization.jl:52 [inlined]
[6] #map_summaries#158(::Bool, ::Nothing, ::Nothing, ::Bool, ::typeof(map_summaries), ::var"#6#7", ::String) at /home/ben/.julia/packages/TensorBoardLogger/gv4oF/src/Deserialization/deserialization.jl:211
[7] map_summaries(::Function, ::String) at /home/ben/.julia/packages/TensorBoardLogger/gv4oF/src/Deserialization/deserialization.jl:205
[8] top-level scope at REPL[36]:1

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

If you'd like for me to do this for you, comment TagBot fix on this issue.
I'll open a PR within a few hours, please be patient!

Type of the discount factor.

If the discount function returns a Float64 , then everything gets promoted to Float64.

We could:

convert to Float32 here: https://github.com/JuliaPOMDP/DeepQLearning.jl/blob/master/src/solver.jl#L196
write something in the doc about it

DQExperience should support AbstractArrays

From Piazza:

I'm troubleshooting in the deep Q learning package and I'm having a problem with the DQExperience object. The object is defined in 'prioritized_experience_replay.jl' and is as follows:
struct DQExperience{N <: Real,T <: Real, Q}
s::Array{T, Q}
a::N
r::T
sp::Array{T, Q}
done::Bool
end
The problem is that our states are defined using the StaticArrays type

Is there a reason that state is constrained to an array of real numbers? Is there a way to create a subclass of DQExperience or something that allows or the StaticArrays type?

Avoid using env.state

The solver makes use of env.state to resolve conflicts between change in the mutable object env during both exploration and evaluation.

See the comments [here].(cf60925)

Logging Training Information

It would be nice to use something like this: https://github.com/oxinabox/UniversalTensorBoard.jl

To log training data and be able to use tensorboard.

Compilation Error

The following error is thrown when attempting to use DeepQLearning.jl:
ERROR: LoadError: LoadError: UndefVarError: Tracker not defined

This appears to be a Flux issue.

TensorBoardLogger.jl New Version Compatability

New version JuliaLogging/TensorBoardLogger.jl@e9cbedf changes from Logger() object to TBLogger() object.

Exploration Policy requires a (PO)MDP

It would be nice to be able to use this package only with CommonRLInterface and not need to know anything about POMDPs.jl. Currently, the main thing presenting this is the exploration policy.

Question: How would you make a decay schedule for prioritized replay alpha/beta?

Action masking feature (legal actions)

POMDPs.jl supports state-dependent action spaces

However, DeepQLearning.jl is always picking the full action space.
That's because the solve enumerates the actions once here, hands them into the policy, which are broadly used there after.

Do you think of a way to have action masking with the current implementation ?

RNN Performance

The solver seems too slow when using RNN. This statement needs to be supported by benchmark of course.

Potential performance issues:

the replay buffer
computing the loss function

Dimension Mismatch

I am having trouble debugging my use of the DeepQLearning package and looking for some help.

My problem is the mountain car problem where the state represents the position and velocity of the car and the action is the force you can apply in order to climb the mountain. The car starts in a valley and needs to climb out to get the reward. The force is not enough to climb to the top so you need to build up momentum to get up the hill.

The state is a 2-element StaticArrays.SArray{Tuple{2},Float64,1,2} with indices SOneTo(2).
The action space is RealInterval{Float64}(-1.0, 1.0), but I discretized this.

My network is as follows:
#Define the Q Network (input {state,action}, return Q(s,a))
activation = leakyrelu;
inputlayer = Dense(3,50,activation); # Input is the size of the state-action pair
hiddenlayer1 = Dense(50,50,activation);
outputlayer = Dense(50,1,activation);
model = Chain(inputlayer,hiddenlayer1,outputlayer)

The environment is an MDPEnvironment and my solver is a DeepQLearningSolver but running the following results in a dimension mismatch:
policy = solve(solver,env)
DimensionMismatch("A has dimensions (50,3) but B has dimensions (2,32)")

I followed the stacktrace and its happening in the Flux library here:
function (a::Dense)(x::AbstractArray)
W, b, σ = a.W, a.b, a.σ
σ.(W*x .+ b)
end

But I have no idea why any of these matrices would be size (2,32)? I assume its this AbstractArray x, since the weights would be size (50,3)... but shouldn't x just be the input size (3,1)? Is it trying to do some batch processing or something? But that doesn't explain how we get to (2,32). The only place I can imagine those numbers is that the weight matrix itself would be a Array{Float32,2}, which makes no sense but does match up if it's somehow getting transposed... I'm not sure if this is a bug or if I am implementing this incorrectly. Any thoughts would be great appreciated. Thanks!

The entire Stacktrace is below for reference:
DimensionMismatch("A has dimensions (50,3) but B has dimensions (2,32)")
Stacktrace:
[1] gemm_wrapper!(::Array{Float32,2}, ::Char, ::Char, ::Array{Float32,2}, ::Array{Float32,2}, ::LinearAlgebra.MulAddMul{true,true,Float32,Float32}) at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.3/LinearAlgebra/src/matmul.jl:545
[2] mul! at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.3/LinearAlgebra/src/matmul.jl:160 [inlined]
[3] mul! at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.3/LinearAlgebra/src/matmul.jl:203 [inlined]
[4] * at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.3/LinearAlgebra/src/matmul.jl:153 [inlined]
[5] (::Dense{typeof(leakyrelu),Array{Float32,2},Array{Float32,1}})(::Array{Float32,2}) at /Users/liamsmith/.julia/packages/Flux/NpkMm/src/layers/basic.jl:115
[6] applychain at /Users/liamsmith/.julia/packages/Flux/NpkMm/src/layers/basic.jl:126 [inlined]
[7] Chain at /Users/liamsmith/.julia/packages/Flux/NpkMm/src/layers/basic.jl:32 [inlined]
[8] batch_train!(::DeepQLearningSolver, ::MDPEnvironment{Array{Float32,1},QuickPOMDPs.QuickMDP{UUID("c4d31997-7cb6-478c-8b46-c104fdaf65ad"),StaticArrays.SArray{Tuple{2},Float64,1,2},Float64,NamedTuple{(:isterminal, :render, :initialstate, :gen, :actions, :discount),Tuple{DMUStudent.HW4.var"#3#10",DMUStudent.HW4.var"#4#11",DMUStudent.HW4.var"#2#9",DMUStudent.HW4.var"#1#8",DMUStudent.HW4.RealInterval{Float64},Float64}}},StaticArrays.SArray{Tuple{2},Float64,1,2},Random.MersenneTwister,false}, ::NNPolicy{QuickPOMDPs.QuickMDP{UUID("c4d31997-7cb6-478c-8b46-c104fdaf65ad"),StaticArrays.SArray{Tuple{2},Float64,1,2},Float64,NamedTuple{(:isterminal, :render, :initialstate, :gen, :actions, :discount),Tuple{DMUStudent.HW4.var"#3#10",DMUStudent.HW4.var"#4#11",DMUStudent.HW4.var"#2#9",DMUStudent.HW4.var"#1#8",DMUStudent.HW4.RealInterval{Float64},Float64}}},Chain{Tuple{Dense{typeof(leakyrelu),Array{Float32,2},Array{Float32,1}},Dense{typeof(leakyrelu),Array{Float32,2},Array{Float32,1}},Dense{typeof(leakyrelu),Array{Float32,2},Array{Float32,1}}}},Float64}, ::ADAM, ::Chain{Tuple{Dense{typeof(leakyrelu),Array{Float32,2},Array{Float32,1}},Dense{typeof(leakyrelu),Array{Float32,2},Array{Float32,1}},Dense{typeof(leakyrelu),Array{Float32,2},Array{Float32,1}}}}, ::PrioritizedReplayBuffer{Int32,Float32,CartesianIndex{2},1}) at /Users/liamsmith/.julia/packages/DeepQLearning/wF0rJ/src/solver.jl:208
[9] dqn_train!(::DeepQLearningSolver, ::MDPEnvironment{Array{Float32,1},QuickPOMDPs.QuickMDP{UUID("c4d31997-7cb6-478c-8b46-c104fdaf65ad"),StaticArrays.SArray{Tuple{2},Float64,1,2},Float64,NamedTuple{(:isterminal, :render, :initialstate, :gen, :actions, :discount),Tuple{DMUStudent.HW4.var"#3#10",DMUStudent.HW4.var"#4#11",DMUStudent.HW4.var"#2#9",DMUStudent.HW4.var"#1#8",DMUStudent.HW4.RealInterval{Float64},Float64}}},StaticArrays.SArray{Tuple{2},Float64,1,2},Random.MersenneTwister,false}, ::NNPolicy{QuickPOMDPs.QuickMDP{UUID("c4d31997-7cb6-478c-8b46-c104fdaf65ad"),StaticArrays.SArray{Tuple{2},Float64,1,2},Float64,NamedTuple{(:isterminal, :render, :initialstate, :gen, :actions, :discount),Tuple{DMUStudent.HW4.var"#3#10",DMUStudent.HW4.var"#4#11",DMUStudent.HW4.var"#2#9",DMUStudent.HW4.var"#1#8",DMUStudent.HW4.RealInterval{Float64},Float64}}},Chain{Tuple{Dense{typeof(leakyrelu),Array{Float32,2},Array{Float32,1}},Dense{typeof(leakyrelu),Array{Float32,2},Array{Float32,1}},Dense{typeof(leakyrelu),Array{Float32,2},Array{Float32,1}}}},Float64}, ::PrioritizedReplayBuffer{Int32,Float32,CartesianIndex{2},1}) at /Users/liamsmith/.julia/packages/DeepQLearning/wF0rJ/src/solver.jl:136
[10] solve(::DeepQLearningSolver, ::MDPEnvironment{Array{Float32,1},QuickPOMDPs.QuickMDP{UUID("c4d31997-7cb6-478c-8b46-c104fdaf65ad"),StaticArrays.SArray{Tuple{2},Float64,1,2},Float64,NamedTuple{(:isterminal, :render, :initialstate, :gen, :actions, :discount),Tuple{DMUStudent.HW4.var"#3#10",DMUStudent.HW4.var"#4#11",DMUStudent.HW4.var"#2#9",DMUStudent.HW4.var"#1#8",DMUStudent.HW4.RealInterval{Float64},Float64}}},StaticArrays.SArray{Tuple{2},Float64,1,2},Random.MersenneTwister,false}) at /Users/liamsmith/.julia/packages/DeepQLearning/wF0rJ/src/solver.jl:58
[11] top-level scope at In[26]:1

Switch to Flux.jl

The current package is relying on TensorFlow.jl, it might be interesting to test out an implementation using Flux.jl since it seems to be the future of deep learning for Julia.

Use only RLInterface.jl interface

One of my students discovered that this package uses POMDPs.actionindex. Would it be possible to make the package only use functions from the RLInterface.jl interface? (i.e. we would need to construct our own action map)

Deprecation of `loadparams!`

Flux deprecation warning.

 Warning: loadparams! will be deprecated eventually. Use loadmodel! instead.

GPU support

Can a maintainer mention what is the state with the GPU support?
The README says that the gpu-support branch should be used, but that one was updated 5 years ago last time.
Given all the changes that happened in the meantime I guess it might be easier to make a new branch and support GPU from scratch. Also why a different branch? couldn't it be an option? It would be great if someone could provide some insight.

Automatically convert to Float32

Currently this package automatically converts everything to Float32 for (PO)MDPs, but it does not for other environments.

Support of AbtractEnvironment

This solver uses some function that are broader than the minimal interface defined in RLInterface and relies on internal fields such as env.problem in many places.
Ideally, the solver should support an RL environment defined just using RLInterface.jl and without necessarily an MDP or POMDP object associated with it.

juliapomdp / deepqlearning.jl Goto Github PK

deepqlearning.jl's Issues

Recommend Projects

Recommend Topics

Recommend Org