Giter Club home page Giter Club logo

Comments (7)

Ledzy avatar Ledzy commented on August 24, 2024 1

We believe that we have clearly pointed out the major difference: BAdam is a block coordinate optimizer, while HiFT is to split Adam (need cpu offloads). This results in two totally different algorithms, and they are essentially different in terms of managing the optimizer states. The design principle is fundamentally different as well, as we mentioned before. We will provide more discussions on this related work in later version of our paper.

from badam.

misonsky avatar misonsky commented on August 24, 2024 1

We believe that we have clearly pointed out the major difference: BAdam is a block coordinate optimizer, while HiFT is to split Adam (need cpu offloads). This results in two totally different algorithms, and they are essentially different in terms of managing the optimizer states. The design principle is fundamentally different as well, as we mentioned before. We will provide more discussions on this related work in later version of our paper.

I think this is a different story with a similar idea. Thanks for more discussion.

from badam.

misonsky avatar misonsky commented on August 24, 2024 1

Our team evaluated BAdam's code and paper, and we agreed that BAdam and HiFT have a high degree of overlap in methods. The essence of coordinate is the index corresponding to the block. We continue to pay attention to BAdam and will take measures to protect our rights if necessary.

from badam.

Ledzy avatar Ledzy commented on August 24, 2024

Thanks for your comment.

We had a quick read on HiFT paper. We believe the algorithmic design and implementation of HiFT and BAdam are fundamentally different. In particular, BAdam is based on the principle of classical block coordinate optimization framework, which solves block sub-problems sequentially, and each sub-problem is approximately solved using Adam or other optimizers. Based on this principle, BAdam is a block coordinate optimizer and is implmented by updating the same block for $K$ Adam steps before switching to the next block. In comparison, HiFT proposes a valuable algorithm that switches the update parameter for every iteration. As a result, BAdam only stores the optimizer states of the actively updating block, while HiFT stores the optimizer state of the whole model (they apply an elegant cpu offload technique to reduce the memory cost of optimizer states).

Additionally, we would like to highlight that BAdam's block partition strategy can be arbitrary and is not restricted to transformer layers, while HiFT focuses on layer-wise partition. We have discussed two possible partition strategy and released the implementation: partition by module (the BlockOptimizer) and partition by ratio (the BlockOptimizerRatio).

Please see our discussion "BAdam is not blocklizing Adam" in the end of our paper's Section 2.1, where we carefully clarify the aforementioned difference and highlight our algorithmic design's principle.

from badam.

misonsky avatar misonsky commented on August 24, 2024

Thanks for your comment.

We had a quick read on HiFT paper. We believe the algorithmic design and implementation of HiFT and BAdam are fundamentally different. In particular, BAdam is based on the principle of classical block coordinate optimization framework, which solves block sub-problems sequentially, and each sub-problem is approximately solved using Adam or other optimizers. Based on this principle, BAdam is a block coordinate optimizer and is implmented by updating the same block for K Adam steps before switching to the next block. In comparison, HiFT proposes a valuable algorithm that switches the update parameter for every iteration. As a result, BAdam only stores the optimizer states of the actively updating block, while HiFT stores the optimizer state of the whole model (they apply an elegant cpu offload technique to reduce the memory cost of optimizer states).

Additionally, we would like to highlight that BAdam's block partition strategy can be arbitrary and is not restricted to transformer layers, while HiFT focuses on layer-wise partition. We have discussed two possible partition strategy and released the implementation: partition by module (the BlockOptimizer) and partition by ratio (the BlockOptimizerRatio).

Please see our discussion "BAdam is not blocklizing Adam" in the end of our paper's Section 2.1, where we carefully clarify the aforementioned difference and highlight our algorithmic design's principle.

Blocks in BAdam are similar to HiFT groups. The biggest difference between BAdam and HiFT is that HiFT only updates a set of parameters at each step, while BAdam will update a set of parameters multiple times continuously. Both BAdam and HiFT adopt the idea of updating only some parameters each time and updating them layer by layer. Just slightly different in operation. But the overall ideas have a lot of overlap, and I think this needs to be clarified and distinguished in the paper.

from badam.

misonsky avatar misonsky commented on August 24, 2024

HiFT is model independent and optimizer independent, and it can be applied to any model. Anyway, I think this needs a clarification or comparison.

from badam.

misonsky avatar misonsky commented on August 24, 2024

Thanks for your comment.

We had a quick read on HiFT paper. We believe the algorithmic design and implementation of HiFT and BAdam are fundamentally different. In particular, BAdam is based on the principle of classical block coordinate optimization framework, which solves block sub-problems sequentially, and each sub-problem is approximately solved using Adam or other optimizers. Based on this principle, BAdam is a block coordinate optimizer and is implmented by updating the same block for K Adam steps before switching to the next block. In comparison, HiFT proposes a valuable algorithm that switches the update parameter for every iteration. As a result, BAdam only stores the optimizer states of the actively updating block, while HiFT stores the optimizer state of the whole model (they apply an elegant cpu offload technique to reduce the memory cost of optimizer states).

Additionally, we would like to highlight that BAdam's block partition strategy can be arbitrary and is not restricted to transformer layers, while HiFT focuses on layer-wise partition. We have discussed two possible partition strategy and released the implementation: partition by module (the BlockOptimizer) and partition by ratio (the BlockOptimizerRatio).

Please see our discussion "BAdam is not blocklizing Adam" in the end of our paper's Section 2.1, where we carefully clarify the aforementioned difference and highlight our algorithmic design's principle.

BlockOptimizerRatio masks the calculated gradient. This does not reduce the amount of calculation and memory usage compared to normal block-by-block fine-tuning.

from badam.

Related Issues (11)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.