Giter Club home page Giter Club logo

experienced-metronome's Introduction

experienced-metronome

Implementing Deep Deterministic Policy Gradient algorithm to Inverted Pendulum Problem

What is Deep Deterministic Policy Gradient model?

Deep Deterministic Policy Gradient (DDPG) is a model-free off-policy algorithm for learning continuous actions. It combines the Experience Replay & slow-learning target networks from Deep Q-networks with Deterministic Policy Gradient to operate over continuous action spaces.

To solve the classic Inverted Pendulum problem, we can take only 2 actions: swing left or swing right. Since it selects from an infinite actions from -2 to +2 (instead of 2 discrete actions liek -1 & +1); it can be considered as Continuous making it a difficult chanllenge for Q-learning Algorithms.

Apart from the 2 networks, i.e Actor(propose an action in a state) and Critic(predict if action is good), DDPG uses 2 more techniques not present in the original DQN. It uses 2 Target Networks because it adds stability to training & uses Experience Relay i.e instead of learning only from recent experiences, it learns from sampling all experiences by storing list of tuples (state, action, reward, next_state) .

To implement better exploration by the Actor network, Ornstein-Uhlenbeck process is used for generating noise.

Critic loss - Mean Squared Error of y - Q(s, a) where y is the expected return as seen by the Target network, and Q(s, a) is action value predicted by the Critic network. y is a moving target that the critic model tries to achieve; we make this target stable by updating the Target model slowly.

Actor loss - This is computed using the mean of the value given by the Critic network for the actions taken by the Actor network. We seek to maximize this quantity.

Hence, for a given state; Actor network is updated so that it produces actions that get the maximum predicted value as seen by the Critic. The Actor & Critic networks are basic Dense Models with ReLU activation. The initialization for last layer of the Actor must be between -0.003 and 0.003 as this prevents us from getting 1 or -1 output values in the initial stages, which would squash our gradients to zero, as we use the tanh activation.

A policy() is used to sample actions and train with learn() is used to train at each time step, along with updating the Target networks at a rate tau.

NOTE: If training proceeds correctly, the average episodic reward will increase with time.

Before vs After Training:

experienced-metronome's People

Contributors

n1ghtf4l1 avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.