I'm currently delving into the implementation details of Lisa and have encountered a p

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Question Regarding Optimizer States Management in Lisa Implementation about lmflow HOT 5 CLOSED

AlvL1225 commented on May 20, 2024

Question Regarding Optimizer States Management in Lisa Implementation

from lmflow.

Comments (5)

AlvL1225 commented on May 20, 2024 1

Gocha, thank you very much!

from lmflow.

research4pan commented on May 20, 2024

Thanks for your interest in LMFlow! Currently they are treated as being discarded at the end of each LISA interval for intermediate layers. If we need to maintain those m1 and m2 states, that will incur large memory consumptions, essentially making LISA the same as full-parameter training.

The suggestion of maintaining them in a smarter way is great! I think there can be some mechanisms in engineer to offload them to CPUs or disks occasionally, but this feature is still under implementation and not integrated yet. Hope this information can be helpful 😄

from lmflow.

AlvL1225 commented on May 20, 2024

Thanks for your interest in LMFlow! Currently they are treated as being discarded at the end of each LISA interval for intermediate layers. If we need to maintain those m1 and m2 states, that will incur large memory consumptions, essentially making LISA the same as full-parameter training.

The suggestion of maintaining them in a smarter way is great! I think there can be some mechanisms in engineer to offload them to CPUs or disks occasionally, but this feature is still under implementation and not integrated yet. Hope this information can be helpful 😄

Thank you for your prompt response!

I have another question regarding the implementation of AdamW in PyTorch. Specifically, does the native PyTorch implementation of AdamW accommodate dynamic adjustments, such as disregarding the states of frozen layers and initializing states for newly activated layers? Alternatively, have you customized the AdamW class to support these functionalities for Lisa?

from lmflow.

research4pan commented on May 20, 2024

Thanks for your comments! It is a great question. In our paper, we avoid this risk by conducting experiments that run each LISA interval separately via loading & saving models each time, this makes the implementation easier for early-stage experiments.

We haven't examined our current implementation in LMFlow yet, but we have been monitoring the memory consumption and it is much lower than full-parameter training, so we conjectured that part is not a serious problem. If so, I think LISA's memory consumption in LMFlow can be further reduced and that would be great 😄

from lmflow.

eric8607242 commented on May 20, 2024

Hi @research4pan,
Thanks for your great work.

I have a question regarding the implementation where the optimizer state of the freeze layer is discarded in LMFlow.
I've been trying to locate this particular section in the code, but I couldn't find any corresponding implementation in https://github.com/OptimalScale/LMFlow/blob/main/src/lmflow/pipeline/finetuner.py#L301.

Your help in figuring this out would be greatly appreciated.
Thanks for your time!

from lmflow.

Question Regarding Optimizer States Management in Lisa Implementation about lmflow HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent