使用requestments.yml配置环境，run_loramoe.sh文件更改了本地的配置（本地模型路径、本地数据集路径、本地output路径、显卡卡数量修改）、已经创

【bug】代码无法运行 about loramoe HOT 8 OPEN

ablustrund commented on August 14, 2024

【bug】代码无法运行

from loramoe.

Comments (8)

Ablustrund commented on August 14, 2024

请问有具体报错信息吗

from loramoe.

Liukairong2023 commented on August 14, 2024

请问有具体报错信息吗

没有具体的报错信息，环境按照requestment.yml安装，run_loramoe.sh配置修改，运行bash.run_loramoe.sh指令实验没跑起来，下面我们重新提供LoAaMoE项目结构图、run_loramoe.sh配置截图、运行bash run_loramoe.sh运行指令截图。

可以看到环境是newloramoe，项目是从https://github.com/Ablustrund/LoRAMoE克隆的，并且新建了output这个目录，并对output目录里面的创建log/xxxx.log文件，模型选用的是自己下载的llama-7b-hf本地路径。最后运行bash run_loramoe.sh脚本文本，一闪而过，服务器界面没有出现训练的信息

from loramoe.

Ablustrund commented on August 14, 2024

那很奇怪，这个项目目前没有遇到这个错误，起码也是有一些报错信息的。从您提供的截图来看，nproc_per_node应该是4。不过应该不是这个问题。

from loramoe.

Liukairong2023 commented on August 14, 2024

那很奇怪，这个项目目前没有遇到这个错误，起码也是有一些报错信息的。从您提供的截图来看，nproc_per_node应该是4。不过应该不是这个问题。

你好，现在CUDA_VISIBLE_DEVICES=0,1,2,3 ；nproc_per_node=4报下面错误：
unscale_and_clip_grads
self.fp32_partitioned_groups_flat[sub_group_id].grad.mul_(1. / combined_scale)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
模型训练过程中一些张量位置不一致的问题，某些张量被分配到了不同的设备上，报错

from loramoe.

Ablustrund commented on August 14, 2024

有一些tensor未被设置到cuda（仍处于cpu）。可以检查一下您修改的部分。如果正常运行toydata应该是没有问题的

from loramoe.

Liukairong2023 commented on August 14, 2024

您好，基于您团队的数据集，运行实验代码，报outofmenory，我服务器是4*3090的，对应是96g显存，运行脚本打开deepspeed，按理说96g显存是能跑起来的，但是报oom，麻烦您看一下，下面是运行截图：

from loramoe.

Liukairong2023 commented on August 14, 2024

echo "Starting the training process...3"
per_device_train_batch_size=1
per_device_eval_batch_size=1
gradient_accumulation_steps=1
max_seq_length=1024
output_dir=/home/lkr/LoRAMoE/output
exp_name=0308_debug_format_for_opensource

echo "Starting the training process...4"

deepspeed_config_file=ds_zero2_no_offload.json

deepspeed_config_file=ds_zero3_offload.json

CUDA_VISIBLE_DEVICES=0,1,2,3
CUDA_LAUNCH_BLOCKING=1
echo "Starting the training process...5"
torchrun --nnodes 1 --nproc_per_node 4 --node_rank 0 --master_port 29502
run_loramoe.py \

gpu是这样设置

from loramoe.

Liukairong2023 commented on August 14, 2024

请问这个实验至少需要多少显存才能跑起来

from loramoe.

【bug】代码无法运行 about loramoe HOT 8 OPEN

Comments (8)

deepspeed_config_file=ds_zero2_no_offload.json

Related Issues (11)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent