Where can I find the codes to train the open source LLM, please? Trying to build an in

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="16

Where can I find the codes to train the open source LLM, please? about h2ogpt HOT 2 CLOSED

h2oai commented on August 15, 2024

Where can I find the codes to train the open source LLM, please?

from h2ogpt.

Comments (2)

flippercy commented on August 15, 2024

@arnocandel @pseudotensor Thank you!

from h2ogpt.

pseudotensor commented on August 15, 2024

#22

for specific model card, e.g. https://huggingface.co/h2oai/h2ogpt-oasst1-512-20b you can get the Training logs zip given there. The log file shows the full parameters used for finetune.py.

Training logs: zip

E.g. for that model card, the log file is called 1013.log, and inside you'll see a block like:

local_rank: 6
global rank: 6
local_rank: 0
global rank: 0
Training model with params:
save_code: True
run_id: 1013
tokenizer_base_model: EleutherAI/gpt-neox-20b
data_path: openassistant_oasst1.json
data_col_dict: None
valid_path: None
data_mix_in_path: 0-hero/OIG-small-chip2
data_mix_in_factor: 0.0
data_mix_in_col_dict: {'user': 'instruction', 'chip2': 'output'}
data_mix_in_prompt_type: instruct
output_dir: gpt-neox-20b.openassistant_oasst1.json.6.0_epochs.5a14ea8b3794c0d60476fc262d0a297f98dd712d.1013
lora_weights: 
batch_size: 64
micro_batch_size: 8
gradient_checkpointing: False
fp16: True
num_epochs: 6.0
learning_rate: 0.0003
val_set_size: 0
val_metrics: []
eval_steps: 32000
eval_epochs: None
lora_r: 16
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules: ['query_key_value']
llama_type: False
group_by_length: False
resume_from_checkpoint: None
ddp: True
local_files_only: False
resume_download: True
warmup_steps: 100
logging_steps: 1
save_steps: 2000
add_eos_token: False
world_size: 8
local_rank: 0
rank: 0
gpus: 8
device_map: auto
gradient_accumulation_steps: 8
base_model: EleutherAI/gpt-neox-20b
cutoff_len: 512
prompt_type: plain
train_on_inputs: True
Command: finetune.py --base_model=EleutherAI/gpt-neox-20b --data_path=openassistant_oasst1.json --lora_target_modules=["query_key_value"] --run_id=1013 --batch_size=64 --micro_batch_size=8 --num_epochs=6.0 --val_set_size=0 --eval_steps=32000 --save_steps=2000 --data_mix_in_factor=0.0 --data_mix_in_factor=0.0 --prompt_type=plain --save_code=True --cutoff_len=512 --lora_r=16
Hash: 5a14ea8b3794c0d60476fc262d0a297f98dd712d
Distributed: data parallel

The "Command" shows the actual command used. It also shows the hash of the repo used. So everything can be perfectly reproduced.

The only issue you have to account for is the system issues. E.g. I specifically trained that 20B with this line:

WORLD_SIZE=8 CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" torchrun --nproc_per_node=8 --master_port=1234 finetune.py --base_model='EleutherAI/gpt-neox-20b' --data_path='openassistant_oasst1.json' --lora_target_modules='["query_key_value"]' --run_id=1013 --batch_size=64 --micro_batch_size=8 --num_epochs=6.0 --val_set_size=0 --eval_steps=32000 --save_steps=2000 --data_mix_in_factor=0.0 --data_mix_in_factor=0.0  --prompt_type='plain' --save_code=True --cutoff_len=512 --lora_r=16 &> 1013.log

That is you'll see how I made that 1013.log file. The only thing missing from the 1013.log file is the WORLD_SIZE=8 CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" torchrun --nproc_per_node=8 --master_port=1234 that is very system specific for a case we ran on 8*A100.

from h2ogpt.

Where can I find the codes to train the open source LLM, please? about h2ogpt HOT 2 CLOSED

Comments (2)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent