Comments (8)
i tried to use another platform without a CUDA or GPU to do finetune. However, there is another error like this:
dnabert2-cpu/lib/python3.8/site-packages/torch/cuda/__init__.py:619: UserWarning: Can't initialize NVML
warnings.warn("Can't initialize NVML")
No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
Traceback (most recent call last):
File "/data5/chenruipu/software/DNABERT_2-main/finetune/train.py", line 314, in <module>
train()
File "/data5/chenruipu/software/DNABERT_2-main/finetune/train.py", line 227, in train
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
File "/data5/chenruipu/miniconda3/envs/dnabert2-cpu/lib/python3.8/site-packages/transformers/hf_argparser.py", line 346, in parse_args_into_dataclasses
obj = dtype(**inputs)
File "<string>", line 117, in __init__
File "/data5/chenruipu/miniconda3/envs/dnabert2-cpu/lib/python3.8/site-packages/transformers/training_args.py", line 1337, in __post_init__
raise ValueError(
ValueError: FP16 Mixed precision training with AMP or APEX (`--fp16`) and FP16 half precision evaluation (`--fp16_full_eval`) can only be used on CUDA devices.
from dnabert_2.
If you want to finetune the model with CPU, please get rid of the --fp16
tag. This only applies to GPUs. We have never tested model fine-tuning on CPUs. So please share more here if you meet other type of errors.
from dnabert_2.
warnings.warn("Can't initialize NVML")
No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
WARNING:root:Perform single sequence classification...
<__main__.SupervisedDataset object at 0x7fc2e0271d00>
WARNING:root:Perform single sequence classification...
WARNING:root:Perform single sequence classification...
Some weights of the model checkpoint at /data5/chenruipu/data/wangchao/model/DNABERT-2-117M_model were not used when initializing BertForSequenceClassification: ['cls.predictions.decoder.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at /data5/chenruipu/data/wangchao/model/DNABERT-2-117M_model and are newly initialized: ['classifier.weight', 'classifier.bias', 'bert.pooler.dense.bias', 'bert.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
***** Running training *****
Num examples = 46,499
Num Epochs = 5
Instantaneous batch size per device = 4
Total train batch size (w. parallel, distributed & accumulation) = 4
Gradient Accumulation steps = 1
Total optimization steps = 58,125
Number of trainable parameters = 117,069,313
0%| | 0/58125 [00:00<?, ?it/s]/data5/chenruipu/.cache/huggingface/modules/transformers_modules/DNABERT-2-117M_model/bert_layers.py:433: UserWarning: Increasing alibi size from 512 to 1501
warnings.warn(
Traceback (most recent call last):
File "/data5/chenruipu/software/DNABERT_2-main/finetune/train.py", line 314, in <module>
train()
File "/data5/chenruipu/software/DNABERT_2-main/finetune/train.py", line 296, in train
trainer.train()
File "/data5/chenruipu/miniconda3/envs/dnabert2-cpu/lib/python3.8/site-packages/transformers/trainer.py", line 1664, in train
return inner_training_loop(
File "/data5/chenruipu/miniconda3/envs/dnabert2-cpu/lib/python3.8/site-packages/transformers/trainer.py", line 1940, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/data5/chenruipu/miniconda3/envs/dnabert2-cpu/lib/python3.8/site-packages/transformers/trainer.py", line 2735, in training_step
loss = self.compute_loss(model, inputs)
File "/data5/chenruipu/miniconda3/envs/dnabert2-cpu/lib/python3.8/site-packages/transformers/trainer.py", line 2767, in compute_loss
outputs = model(**inputs)
File "/data5/chenruipu/miniconda3/envs/dnabert2-cpu/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/data5/chenruipu/miniconda3/envs/dnabert2-cpu/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/data5/chenruipu/.cache/huggingface/modules/transformers_modules/DNABERT-2-117M_model/bert_layers.py", line 859, in forward
outputs = self.bert(
File "/data5/chenruipu/miniconda3/envs/dnabert2-cpu/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/data5/chenruipu/miniconda3/envs/dnabert2-cpu/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/data5/chenruipu/.cache/huggingface/modules/transformers_modules/DNABERT-2-117M_model/bert_layers.py", line 609, in forward
encoder_outputs = self.encoder(
File "/data5/chenruipu/miniconda3/envs/dnabert2-cpu/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/data5/chenruipu/miniconda3/envs/dnabert2-cpu/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/data5/chenruipu/.cache/huggingface/modules/transformers_modules/DNABERT-2-117M_model/bert_layers.py", line 447, in forward
hidden_states = layer_module(hidden_states,
File "/data5/chenruipu/miniconda3/envs/dnabert2-cpu/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/data5/chenruipu/miniconda3/envs/dnabert2-cpu/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/data5/chenruipu/.cache/huggingface/modules/transformers_modules/DNABERT-2-117M_model/bert_layers.py", line 328, in forward
attention_output = self.attention(hidden_states, cu_seqlens, seqlen,
File "/data5/chenruipu/miniconda3/envs/dnabert2-cpu/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/data5/chenruipu/miniconda3/envs/dnabert2-cpu/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/data5/chenruipu/.cache/huggingface/modules/transformers_modules/DNABERT-2-117M_model/bert_layers.py", line 241, in forward
self_output = self.self(input_tensor, cu_seqlens, max_s, indices,
File "/data5/chenruipu/miniconda3/envs/dnabert2-cpu/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/data5/chenruipu/miniconda3/envs/dnabert2-cpu/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/data5/chenruipu/.cache/huggingface/modules/transformers_modules/DNABERT-2-117M_model/bert_layers.py", line 182, in forward
attention = flash_attn_qkvpacked_func(qkv, bias)
File "/data5/chenruipu/miniconda3/envs/dnabert2-cpu/lib/python3.8/site-packages/torch/autograd/function.py", line 598, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/data5/chenruipu/.cache/huggingface/modules/transformers_modules/DNABERT-2-117M_model/flash_attn_triton.py", line 1021, in forward
o, lse, ctx.softmax_scale = _flash_attn_forward(
File "/data5/chenruipu/.cache/huggingface/modules/transformers_modules/DNABERT-2-117M_model/flash_attn_triton.py", line 781, in _flash_attn_forward
assert q.is_cuda and k.is_cuda and v.is_cuda
AssertionError
0%|
then i got more complex errors like this error
from dnabert_2.
can you try pip uninstall triton
?
from dnabert_2.
I also encountered a CUDA out of memory error, I am fine-tuning with 3xA100. I first tried using sample_data
for fine-tuning, and it works fine, then I switched to my own data for fine-tuning which raised the CUDA out of memory error (I also uninstalled the triton.)
Here is the code for fine-tuning:
cd finetune
export DATA_PATH=../data # e.g., ./sample_data
export MAX_LENGTH=128 # Please set the number as 0.25 * your sequence length.
# e.g., set it as 250 if your DNA sequences have 1000 nucleotide bases
# This is because the tokenized will reduce the sequence length by about 5 times
export LR=3e-5
# Training use DataParallel
python train.py \
--model_name_or_path zhihan1996/DNABERT-2-117M \
--data_path ${DATA_PATH} \
--kmer -1 \
--run_name DNABERT2_${DATA_PATH} \
--model_max_length ${MAX_LENGTH} \
--per_device_train_batch_size 8 \
--per_device_eval_batch_size 16 \
--gradient_accumulation_steps 1 \
--learning_rate ${LR} \
--num_train_epochs 5 \
--fp16 \
--save_steps 200 \
--output_dir output/dnabert2 \
--evaluation_strategy steps \
--eval_steps 200 \
--warmup_steps 50 \
--logging_steps 100 \
--overwrite_output_dir True \
--log_level info \
--find_unused_parameters False
Here is the error I got:
File "train.py", line 303, in <module>
train()
File "train.py", line 285, in train
trainer.train()
File "/work/09059/xliaoyi/ls6/software/miniconda/envs/dna/lib/python3.8/site-packages/transformers/trainer.py", line 1664, in train
return inner_training_loop(
File "/work/09059/xliaoyi/ls6/software/miniconda/envs/dna/lib/python3.8/site-packages/transformers/trainer.py", line 2019, in _inner_training_loop
self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
File "/work/09059/xliaoyi/ls6/software/miniconda/envs/dna/lib/python3.8/site-packages/transformers/trainer.py", line 2300, in _maybe_log_save_evaluate
metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
File "/work/09059/xliaoyi/ls6/software/miniconda/envs/dna/lib/python3.8/site-packages/transformers/trainer.py", line 3029, in evaluate
output = eval_loop(
File "/work/09059/xliaoyi/ls6/software/miniconda/envs/dna/lib/python3.8/site-packages/transformers/trainer.py", line 3235, in evaluation_loop
preds_host = logits if preds_host is None else nested_concat(preds_host, logits, padding_index=-100)
File "/work/09059/xliaoyi/ls6/software/miniconda/envs/dna/lib/python3.8/site-packages/transformers/trainer_pt_utils.py", line 114, in nested_concat
return type(tensors)(nested_concat(t, n, padding_index=padding_index) for t, n in zip(tensors, new_tensors))
File "/work/09059/xliaoyi/ls6/software/miniconda/envs/dna/lib/python3.8/site-packages/transformers/trainer_pt_utils.py", line 114, in <genexpr>
return type(tensors)(nested_concat(t, n, padding_index=padding_index) for t, n in zip(tensors, new_tensors))
File "/work/09059/xliaoyi/ls6/software/miniconda/envs/dna/lib/python3.8/site-packages/transformers/trainer_pt_utils.py", line 116, in nested_concat
return torch_pad_and_concatenate(tensors, new_tensors, padding_index=padding_index)
File "/work/09059/xliaoyi/ls6/software/miniconda/envs/dna/lib/python3.8/site-packages/transformers/trainer_pt_utils.py", line 75, in torch_pad_and_concatenate
return torch.cat((tensor1, tensor2), dim=0)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 18.09 GiB. GPU
^M 0%| | 200/758330 [06:32<413:24:47, 1.96s/it]
Is that because the dataset used for fine-tuning is too large (I have 3 million sequences for fine-tuning)?
from dnabert_2.
The size of the dataset should not impact memory usage. Can you try to launch the experiment with distributed data parallel? Basically, you can achieve this by replacing python
with torchrun --npro_per_node=3
in your scripts.
from dnabert_2.
thanks for your respon, now my finetune task can run correctly. But it seems to use only 1 cpu core for the task, which will take too much time (about 250 hours) to finish. i want to know whether i can do the finetune with multiple cpus?
from dnabert_2.
Sorry, I have no idea and experience on multiple-cpu training. You may need to investigate this by yourself. Good luck!
from dnabert_2.
Related Issues (20)
- About the pretrain data HOT 1
- Data distribution in pretraining dataset HOT 1
- Instability in reproducing GUE dataset result HOT 1
- Unable to reproduce covid results HOT 1
- Is there a way to turn off the setting to use flash attention/triton library? HOT 2
- How to specifically implement the task of Enhancer promoter interaction? HOT 1
- Whether huggingface released model has been further pretrained on GUE benchmark HOT 1
- Is it possible to publish the detailed requirement file?
- Is it neccessary to train a specific BPE tokenizer on own datasets? HOT 1
- Getting embedding of a sequence HOT 2
- About random factor in the embedding/tokenization process HOT 7
- TypeError: __init__() got an unexpected keyword argument 'token' HOT 1
- GUE labels
- About the motif prediction function HOT 3
- Random issues still come up in use HOT 1
- Attention error HOT 1
- Got negative train loss when do pretrain process HOT 3
- What is the loss of pre-training of the published model? HOT 2
- EPI datasets, not getting published results HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dnabert_2.