zeke-xie / adaptive-inertia-adai Goto Github PK

[ICML 2022, Oral] The PyTorch Implementation of Adaptive Inertia Methods. The algorithms are based on our paper: "Adaptive Inertia: Disentangling the Effects of Adaptive Learning Rate and Momentum".

License: MIT License

Python 63.29% Jupyter Notebook 36.71%

machine-learning optimization ai deep-learning optimizer pytorch

adaptive-inertia-adai's People

Contributors

Stargazers

Watchers

Forkers

felix660 gaomath dkajtoch suntaochun xczhanjun logichen chasemonsteraway liuqi8827 systemerrorwang ljessons

adaptive-inertia-adai's Issues

Linear layer state['step'] increament is 2

If I define a simple linear model like this:

class TinyModel(torch.nn.Module):

    def __init__(self):
        super(TinyModel, self).__init__()

        self.layer1 = torch.nn.Linear(1000, 100, bias=True)
        self.relu = torch.nn.ReLU()

    def forward(self, x):
        x = self.layer1(x)
        x = self.relu(x)
        return x

Linear layer has two parameters, y=xA' +b.
In your adai.py code, state['step'] will be 1 for A, and 2 for b. If we unroll the for loop:

param_size = param_size +sizeA
grad = p.grad.data# A's grad
bias_correction2 = 1 - beta2
exp_avg_sq.mul_(beta2).addcmul_(grad, grad, value=1 - beta2)
exp_avg_sq_hat_sum += exp_avg_sq.sum() / bias_correction2

param_size = param_size +sizeb
grad = p.grad.data# b's grad
bias_correction2 = 1 - beta2**2
exp_avg_sq.mul_(beta2).addcmul_(grad, grad, value=1 - beta2)
exp_avg_sq_hat_sum += exp_avg_sq.sum() / bias_correction2

Is it a bug for a layer with multiple parameters?

Update code for adaiW please

fit bortfeld function with adai and adam, adai cannot converge at the same learning rate

I tested some simple models on pytorch, Adai does have better performance over Adam, therefore, I tried to use Adai to fit bortfeld function. I implemented two matlab functions to compare the performance of Adai and Adam. And I found Adai cannot converge while Adam converged. Is my implementation wrong?

Bortfeld function() is a function used to fit proton bragg peak, An analytical approximation of the Bragg curve for therapeutic proton beams

function [theta_best,loss] = adai(depth,para,idd_i,lb,ub,lr)
    % adam inertia
    T = 2000;
    beta0 = 0.1;
    beta1_cum_prod = 1;
    beta2 = 0.99;
    epsilon = 1e-3;
    loss = zeros(T,1);
    m_tm1 = 0;
    v_tm1 = 0;
    theta_tm1 = para;
    v_t_mean = 0;
    
    theta_best = para;
    loss_best = 1e9;
    loss(1) = norm((bf_mex(depth,theta_tm1,'idd') - idd_i),'fro');
    for t = 2:T
        % get gradient = jacobian*error
        g_t = 2*bf_mex(depth,theta_tm1,'jacobian')'*(bf_mex(depth,theta_tm1,'idd') - idd_i);
        % Update biased second raw moment estimate
        v_t = beta2*v_tm1 + (1-beta2)*g_t.^2;
        % Compute bias-corrected second raw moment estimate
        v_t_hat = v_t / (1-beta2^(t-1));
        v_t_mean = mean(v_t_hat);
        beta1t = max(min(1-(v_t_hat./v_t_mean).*beta0, 1-epsilon),0);
        % Update biased first moment estimate
        m_t = beta1t.*m_tm1 + (1-beta1t).*g_t;
        beta1_cum_prod = beta1_cum_prod.*beta1t;
        % Compute bias-corrected first moment estimate
        m_t_hat = m_t ./ (1-beta1_cum_prod);
        % Update parameters
        theta_t = theta_tm1 - lr*m_t_hat;
        % constrain
        theta_t(theta_t < lb) = lb(theta_t < lb);
        theta_t(theta_t > ub) = ub(theta_t > ub);
        
        theta_tm1 = theta_t;
        m_tm1 = m_t;
        v_tm1 = v_t;
            
        idd_pred = bf_mex(depth,theta_t,'idd');
        loss(t) = norm((idd_pred - idd_i),'fro');
        
        if loss(t) < loss_best
           loss_best = loss(t);
           theta_best = theta_t;
        end
        if (abs(loss(t) - loss(t-1)) < 1e-6)
            break;
        end
    end
    
end

function [theta_best,loss] = adam(depth,para,idd_i,lb,ub,lr)
    T = 2000;
    beta1 = 0.9;
    beta2 = 0.999;
    epsilon = 1e-8;
    loss = zeros(T,1);
    m_tm1 = 0;
    v_tm1 = 0;
    theta_tm1 = para;
    
    theta_best = para;
    loss_best = 1e9;
    loss(1) = norm((bf_mex(depth,theta_tm1,'idd') - idd_i),'fro');
    for t = 2:T
        % get gradient = jacobian*error
        g_t = 2*bf_mex(depth,theta_tm1,'jacobian')'*(bf_mex(depth,theta_tm1,'idd') - idd_i);
        % Update biased first moment estimate
        m_t = beta1*m_tm1 + (1-beta1)*g_t;
        % Update biased second raw moment estimate
        v_t = beta2*v_tm1 + (1-beta2)*g_t.^2;
        % Compute bias-corrected first moment estimate
        m_t_hat = m_t / (1-beta1^(t-1));
        % Compute bias-corrected second raw moment estimate
        v_t_hat = v_t / (1-beta2^(t-1));
        % Update parameters
        theta_t = theta_tm1 - lr*m_t_hat./(sqrt(v_t_hat)+epsilon);
        
        % constrain
        theta_t(theta_t < lb) = lb(theta_t < lb);
        theta_t(theta_t > ub) = ub(theta_t > ub);
        
        theta_tm1 = theta_t;
        m_tm1 = m_t;
        v_tm1 = v_t;
            
        idd_pred = bf_mex(depth,theta_t,'idd');
        loss(t) = norm((idd_pred - idd_i),'fro');
        
        if loss(t) < loss_best
           loss_best = loss(t);
           theta_best = theta_t;
        end
        if (abs(loss(t) - loss(t-1)) < 1e-6)
            break;
        end
    end
    
end

upload code to pypi

Can you post your code on pypi?

About combining with pnm

Hi,

Firstly, thanks for the great work.

I just wonder if adai is suitable for combining together with positive-negative momentum which is another work of you. From my understanding of these two, it is ok to replace the original momentum with pnm in adai. Do you have any suggestions or experiences?

Thanks for the reply.

Test performance of adai on NLP tasks?

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.