This work is almost very similar to existing HiFT, what differences do the authors thi

What is the difference from existing HiFT "A Hierarchical Full Parameter Fine-Tuning Strategy"? about badam HOT 7 CLOSED

misonsky commented on August 24, 2024

What is the difference from existing HiFT "A Hierarchical Full Parameter Fine-Tuning Strategy"?

from badam.

Comments (7)

Ledzy commented on August 24, 2024 1

We believe that we have clearly pointed out the major difference: BAdam is a block coordinate optimizer, while HiFT is to split Adam (need cpu offloads). This results in two totally different algorithms, and they are essentially different in terms of managing the optimizer states. The design principle is fundamentally different as well, as we mentioned before. We will provide more discussions on this related work in later version of our paper.

from badam.

misonsky commented on August 24, 2024 1

We believe that we have clearly pointed out the major difference: BAdam is a block coordinate optimizer, while HiFT is to split Adam (need cpu offloads). This results in two totally different algorithms, and they are essentially different in terms of managing the optimizer states. The design principle is fundamentally different as well, as we mentioned before. We will provide more discussions on this related work in later version of our paper.

I think this is a different story with a similar idea. Thanks for more discussion.

from badam.

misonsky commented on August 24, 2024 1

Our team evaluated BAdam's code and paper, and we agreed that BAdam and HiFT have a high degree of overlap in methods. The essence of coordinate is the index corresponding to the block. We continue to pay attention to BAdam and will take measures to protect our rights if necessary.

from badam.

Ledzy commented on August 24, 2024

Thanks for your comment.

We had a quick read on HiFT paper. We believe the algorithmic design and implementation of HiFT and BAdam are fundamentally different. In particular, BAdam is based on the principle of classical block coordinate optimization framework, which solves block sub-problems sequentially, and each sub-problem is approximately solved using Adam or other optimizers. Based on this principle, BAdam is a block coordinate optimizer and is implmented by updating the same block for $K$ Adam steps before switching to the next block. In comparison, HiFT proposes a valuable algorithm that switches the update parameter for every iteration. As a result, BAdam only stores the optimizer states of the actively updating block, while HiFT stores the optimizer state of the whole model (they apply an elegant cpu offload technique to reduce the memory cost of optimizer states).

Additionally, we would like to highlight that BAdam's block partition strategy can be arbitrary and is not restricted to transformer layers, while HiFT focuses on layer-wise partition. We have discussed two possible partition strategy and released the implementation: partition by module (the BlockOptimizer) and partition by ratio (the BlockOptimizerRatio).

Please see our discussion "BAdam is not blocklizing Adam" in the end of our paper's Section 2.1, where we carefully clarify the aforementioned difference and highlight our algorithmic design's principle.

from badam.

misonsky commented on August 24, 2024

Thanks for your comment.

We had a quick read on HiFT paper. We believe the algorithmic design and implementation of HiFT and BAdam are fundamentally different. In particular, BAdam is based on the principle of classical block coordinate optimization framework, which solves block sub-problems sequentially, and each sub-problem is approximately solved using Adam or other optimizers. Based on this principle, BAdam is a block coordinate optimizer and is implmented by updating the same block for K Adam steps before switching to the next block. In comparison, HiFT proposes a valuable algorithm that switches the update parameter for every iteration. As a result, BAdam only stores the optimizer states of the actively updating block, while HiFT stores the optimizer state of the whole model (they apply an elegant cpu offload technique to reduce the memory cost of optimizer states).

Additionally, we would like to highlight that BAdam's block partition strategy can be arbitrary and is not restricted to transformer layers, while HiFT focuses on layer-wise partition. We have discussed two possible partition strategy and released the implementation: partition by module (the BlockOptimizer) and partition by ratio (the BlockOptimizerRatio).

Please see our discussion "BAdam is not blocklizing Adam" in the end of our paper's Section 2.1, where we carefully clarify the aforementioned difference and highlight our algorithmic design's principle.

Blocks in BAdam are similar to HiFT groups. The biggest difference between BAdam and HiFT is that HiFT only updates a set of parameters at each step, while BAdam will update a set of parameters multiple times continuously. Both BAdam and HiFT adopt the idea of updating only some parameters each time and updating them layer by layer. Just slightly different in operation. But the overall ideas have a lot of overlap, and I think this needs to be clarified and distinguished in the paper.

from badam.

misonsky commented on August 24, 2024

HiFT is model independent and optimizer independent, and it can be applied to any model. Anyway, I think this needs a clarification or comparison.

from badam.

misonsky commented on August 24, 2024

Thanks for your comment.

We had a quick read on HiFT paper. We believe the algorithmic design and implementation of HiFT and BAdam are fundamentally different. In particular, BAdam is based on the principle of classical block coordinate optimization framework, which solves block sub-problems sequentially, and each sub-problem is approximately solved using Adam or other optimizers. Based on this principle, BAdam is a block coordinate optimizer and is implmented by updating the same block for K Adam steps before switching to the next block. In comparison, HiFT proposes a valuable algorithm that switches the update parameter for every iteration. As a result, BAdam only stores the optimizer states of the actively updating block, while HiFT stores the optimizer state of the whole model (they apply an elegant cpu offload technique to reduce the memory cost of optimizer states).

Additionally, we would like to highlight that BAdam's block partition strategy can be arbitrary and is not restricted to transformer layers, while HiFT focuses on layer-wise partition. We have discussed two possible partition strategy and released the implementation: partition by module (the BlockOptimizer) and partition by ratio (the BlockOptimizerRatio).

Please see our discussion "BAdam is not blocklizing Adam" in the end of our paper's Section 2.1, where we carefully clarify the aforementioned difference and highlight our algorithmic design's principle.

BlockOptimizerRatio masks the calculated gradient. This does not reduce the amount of calculation and memory usage compared to normal block-by-block fine-tuning.

from badam.

What is the difference from existing HiFT "A Hierarchical Full Parameter Fine-Tuning Strategy"? about badam HOT 7 CLOSED

Comments (7)

Related Issues (11)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent