Giter Club home page Giter Club logo

Comments (8)

Ablustrund avatar Ablustrund commented on August 14, 2024

请问有具体报错信息吗

from loramoe.

Liukairong2023 avatar Liukairong2023 commented on August 14, 2024

请问有具体报错信息吗

没有具体的报错信息,环境按照requestment.yml安装,run_loramoe.sh配置修改,运行bash.run_loramoe.sh指令实验没跑起来,下面我们重新提供LoAaMoE项目结构图、run_loramoe.sh配置截图、运行bash run_loramoe.sh运行指令截图。
WX20240429-120416@2x1111

22222
333
可以看到环境是newloramoe,项目是从https://github.com/Ablustrund/LoRAMoE克隆的,并且新建了output这个目录,并对output目录里面的创建log/xxxx.log文件,模型选用的是自己下载的llama-7b-hf本地路径。最后运行bash run_loramoe.sh脚本文本,一闪而过,服务器界面没有出现训练的信息

from loramoe.

Ablustrund avatar Ablustrund commented on August 14, 2024

那很奇怪,这个项目目前没有遇到这个错误,起码也是有一些报错信息的。从您提供的截图来看,nproc_per_node应该是4。不过应该不是这个问题。

from loramoe.

Liukairong2023 avatar Liukairong2023 commented on August 14, 2024

那很奇怪,这个项目目前没有遇到这个错误,起码也是有一些报错信息的。从您提供的截图来看,nproc_per_node应该是4。不过应该不是这个问题。

你好,现在CUDA_VISIBLE_DEVICES=0,1,2,3 ;nproc_per_node=4报下面错误:
unscale_and_clip_grads
self.fp32_partitioned_groups_flat[sub_group_id].grad.mul_(1. / combined_scale)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
模型训练过程中一些张量位置不一致的问题,某些张量被分配到了不同的设备上,报错

from loramoe.

Ablustrund avatar Ablustrund commented on August 14, 2024

有一些tensor未被设置到cuda(仍处于cpu)。可以检查一下您修改的部分。如果正常运行toydata应该是没有问题的

from loramoe.

Liukairong2023 avatar Liukairong2023 commented on August 14, 2024

您好,基于您团队的数据集,运行实验代码,报outofmenory,我服务器是4*3090的,对应是96g显存,运行脚本打开deepspeed,按理说96g显存是能跑起来的,但是报oom,麻烦您看一下,下面是运行截图:
12222
Uploading 211111.png…

from loramoe.

Liukairong2023 avatar Liukairong2023 commented on August 14, 2024

echo "Starting the training process...3"
per_device_train_batch_size=1
per_device_eval_batch_size=1
gradient_accumulation_steps=1
max_seq_length=1024
output_dir=/home/lkr/LoRAMoE/output
exp_name=0308_debug_format_for_opensource

echo "Starting the training process...4"

deepspeed_config_file=ds_zero2_no_offload.json

deepspeed_config_file=ds_zero3_offload.json

CUDA_VISIBLE_DEVICES=0,1,2,3
CUDA_LAUNCH_BLOCKING=1
echo "Starting the training process...5"
torchrun --nnodes 1 --nproc_per_node 4 --node_rank 0 --master_port 29502
run_loramoe.py \

gpu是这样设置

from loramoe.

Liukairong2023 avatar Liukairong2023 commented on August 14, 2024

请问这个实验至少需要多少显存才能跑起来

from loramoe.

Related Issues (11)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.