Giter Club home page Giter Club logo

Comments (8)

chenruipu avatar chenruipu commented on September 28, 2024

i tried to use another platform without a CUDA or GPU to do finetune. However, there is another error like this:

dnabert2-cpu/lib/python3.8/site-packages/torch/cuda/__init__.py:619: UserWarning: Can't initialize NVML
  warnings.warn("Can't initialize NVML")
No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
Traceback (most recent call last):
  File "/data5/chenruipu/software/DNABERT_2-main/finetune/train.py", line 314, in <module>
    train()
  File "/data5/chenruipu/software/DNABERT_2-main/finetune/train.py", line 227, in train
    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
  File "/data5/chenruipu/miniconda3/envs/dnabert2-cpu/lib/python3.8/site-packages/transformers/hf_argparser.py", line 346, in parse_args_into_dataclasses
    obj = dtype(**inputs)
  File "<string>", line 117, in __init__
  File "/data5/chenruipu/miniconda3/envs/dnabert2-cpu/lib/python3.8/site-packages/transformers/training_args.py", line 1337, in __post_init__
    raise ValueError(
ValueError: FP16 Mixed precision training with AMP or APEX (`--fp16`) and FP16 half precision evaluation (`--fp16_full_eval`) can only be used on CUDA devices.

from dnabert_2.

Zhihan1996 avatar Zhihan1996 commented on September 28, 2024

If you want to finetune the model with CPU, please get rid of the --fp16 tag. This only applies to GPUs. We have never tested model fine-tuning on CPUs. So please share more here if you meet other type of errors.

from dnabert_2.

chenruipu avatar chenruipu commented on September 28, 2024
  warnings.warn("Can't initialize NVML")
No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
WARNING:root:Perform single sequence classification...
<__main__.SupervisedDataset object at 0x7fc2e0271d00>
WARNING:root:Perform single sequence classification...
WARNING:root:Perform single sequence classification...
Some weights of the model checkpoint at /data5/chenruipu/data/wangchao/model/DNABERT-2-117M_model were not used when initializing BertForSequenceClassification: ['cls.predictions.decoder.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at /data5/chenruipu/data/wangchao/model/DNABERT-2-117M_model and are newly initialized: ['classifier.weight', 'classifier.bias', 'bert.pooler.dense.bias', 'bert.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
***** Running training *****
  Num examples = 46,499
  Num Epochs = 5
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 4
  Gradient Accumulation steps = 1
  Total optimization steps = 58,125
  Number of trainable parameters = 117,069,313
  0%|                                                                                                                                              | 0/58125 [00:00<?, ?it/s]/data5/chenruipu/.cache/huggingface/modules/transformers_modules/DNABERT-2-117M_model/bert_layers.py:433: UserWarning: Increasing alibi size from 512 to 1501
  warnings.warn(
Traceback (most recent call last):
  File "/data5/chenruipu/software/DNABERT_2-main/finetune/train.py", line 314, in <module>
    train()
  File "/data5/chenruipu/software/DNABERT_2-main/finetune/train.py", line 296, in train
    trainer.train()
  File "/data5/chenruipu/miniconda3/envs/dnabert2-cpu/lib/python3.8/site-packages/transformers/trainer.py", line 1664, in train
    return inner_training_loop(
  File "/data5/chenruipu/miniconda3/envs/dnabert2-cpu/lib/python3.8/site-packages/transformers/trainer.py", line 1940, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/data5/chenruipu/miniconda3/envs/dnabert2-cpu/lib/python3.8/site-packages/transformers/trainer.py", line 2735, in training_step
    loss = self.compute_loss(model, inputs)
  File "/data5/chenruipu/miniconda3/envs/dnabert2-cpu/lib/python3.8/site-packages/transformers/trainer.py", line 2767, in compute_loss
    outputs = model(**inputs)
  File "/data5/chenruipu/miniconda3/envs/dnabert2-cpu/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data5/chenruipu/miniconda3/envs/dnabert2-cpu/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data5/chenruipu/.cache/huggingface/modules/transformers_modules/DNABERT-2-117M_model/bert_layers.py", line 859, in forward
    outputs = self.bert(
  File "/data5/chenruipu/miniconda3/envs/dnabert2-cpu/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data5/chenruipu/miniconda3/envs/dnabert2-cpu/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data5/chenruipu/.cache/huggingface/modules/transformers_modules/DNABERT-2-117M_model/bert_layers.py", line 609, in forward
    encoder_outputs = self.encoder(
  File "/data5/chenruipu/miniconda3/envs/dnabert2-cpu/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data5/chenruipu/miniconda3/envs/dnabert2-cpu/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data5/chenruipu/.cache/huggingface/modules/transformers_modules/DNABERT-2-117M_model/bert_layers.py", line 447, in forward
    hidden_states = layer_module(hidden_states,
  File "/data5/chenruipu/miniconda3/envs/dnabert2-cpu/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data5/chenruipu/miniconda3/envs/dnabert2-cpu/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data5/chenruipu/.cache/huggingface/modules/transformers_modules/DNABERT-2-117M_model/bert_layers.py", line 328, in forward
    attention_output = self.attention(hidden_states, cu_seqlens, seqlen,
  File "/data5/chenruipu/miniconda3/envs/dnabert2-cpu/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data5/chenruipu/miniconda3/envs/dnabert2-cpu/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data5/chenruipu/.cache/huggingface/modules/transformers_modules/DNABERT-2-117M_model/bert_layers.py", line 241, in forward
    self_output = self.self(input_tensor, cu_seqlens, max_s, indices,
  File "/data5/chenruipu/miniconda3/envs/dnabert2-cpu/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data5/chenruipu/miniconda3/envs/dnabert2-cpu/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data5/chenruipu/.cache/huggingface/modules/transformers_modules/DNABERT-2-117M_model/bert_layers.py", line 182, in forward
    attention = flash_attn_qkvpacked_func(qkv, bias)
  File "/data5/chenruipu/miniconda3/envs/dnabert2-cpu/lib/python3.8/site-packages/torch/autograd/function.py", line 598, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/data5/chenruipu/.cache/huggingface/modules/transformers_modules/DNABERT-2-117M_model/flash_attn_triton.py", line 1021, in forward
    o, lse, ctx.softmax_scale = _flash_attn_forward(
  File "/data5/chenruipu/.cache/huggingface/modules/transformers_modules/DNABERT-2-117M_model/flash_attn_triton.py", line 781, in _flash_attn_forward
    assert q.is_cuda and k.is_cuda and v.is_cuda
AssertionError
  0%|      

then i got more complex errors like this error

from dnabert_2.

Zhihan1996 avatar Zhihan1996 commented on September 28, 2024

can you try pip uninstall triton?

from dnabert_2.

xliaoyi avatar xliaoyi commented on September 28, 2024

I also encountered a CUDA out of memory error, I am fine-tuning with 3xA100. I first tried using sample_data for fine-tuning, and it works fine, then I switched to my own data for fine-tuning which raised the CUDA out of memory error (I also uninstalled the triton.)

Here is the code for fine-tuning:

cd finetune

export DATA_PATH=../data  # e.g., ./sample_data
export MAX_LENGTH=128 # Please set the number as 0.25 * your sequence length. 
											# e.g., set it as 250 if your DNA sequences have 1000 nucleotide bases
											# This is because the tokenized will reduce the sequence length by about 5 times
export LR=3e-5

# Training use DataParallel
python train.py \
    --model_name_or_path zhihan1996/DNABERT-2-117M \
    --data_path  ${DATA_PATH} \
    --kmer -1 \
    --run_name DNABERT2_${DATA_PATH} \
    --model_max_length ${MAX_LENGTH} \
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 16 \
    --gradient_accumulation_steps 1 \
    --learning_rate ${LR} \
    --num_train_epochs 5 \
    --fp16 \
    --save_steps 200 \
    --output_dir output/dnabert2 \
    --evaluation_strategy steps \
    --eval_steps 200 \
    --warmup_steps 50 \
    --logging_steps 100 \
    --overwrite_output_dir True \
    --log_level info \
    --find_unused_parameters False

Here is the error I got:


  File "train.py", line 303, in <module>
    train()
  File "train.py", line 285, in train
    trainer.train()
  File "/work/09059/xliaoyi/ls6/software/miniconda/envs/dna/lib/python3.8/site-packages/transformers/trainer.py", line 1664, in train
    return inner_training_loop(
  File "/work/09059/xliaoyi/ls6/software/miniconda/envs/dna/lib/python3.8/site-packages/transformers/trainer.py", line 2019, in _inner_training_loop
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "/work/09059/xliaoyi/ls6/software/miniconda/envs/dna/lib/python3.8/site-packages/transformers/trainer.py", line 2300, in _maybe_log_save_evaluate
    metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
  File "/work/09059/xliaoyi/ls6/software/miniconda/envs/dna/lib/python3.8/site-packages/transformers/trainer.py", line 3029, in evaluate
    output = eval_loop(
  File "/work/09059/xliaoyi/ls6/software/miniconda/envs/dna/lib/python3.8/site-packages/transformers/trainer.py", line 3235, in evaluation_loop
    preds_host = logits if preds_host is None else nested_concat(preds_host, logits, padding_index=-100)
  File "/work/09059/xliaoyi/ls6/software/miniconda/envs/dna/lib/python3.8/site-packages/transformers/trainer_pt_utils.py", line 114, in nested_concat
    return type(tensors)(nested_concat(t, n, padding_index=padding_index) for t, n in zip(tensors, new_tensors))
  File "/work/09059/xliaoyi/ls6/software/miniconda/envs/dna/lib/python3.8/site-packages/transformers/trainer_pt_utils.py", line 114, in <genexpr>
    return type(tensors)(nested_concat(t, n, padding_index=padding_index) for t, n in zip(tensors, new_tensors))
  File "/work/09059/xliaoyi/ls6/software/miniconda/envs/dna/lib/python3.8/site-packages/transformers/trainer_pt_utils.py", line 116, in nested_concat
    return torch_pad_and_concatenate(tensors, new_tensors, padding_index=padding_index)
  File "/work/09059/xliaoyi/ls6/software/miniconda/envs/dna/lib/python3.8/site-packages/transformers/trainer_pt_utils.py", line 75, in torch_pad_and_concatenate
    return torch.cat((tensor1, tensor2), dim=0)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 18.09 GiB. GPU
^M  0%|          | 200/758330 [06:32<413:24:47,  1.96s/it]

Is that because the dataset used for fine-tuning is too large (I have 3 million sequences for fine-tuning)?

from dnabert_2.

Zhihan1996 avatar Zhihan1996 commented on September 28, 2024

The size of the dataset should not impact memory usage. Can you try to launch the experiment with distributed data parallel? Basically, you can achieve this by replacing python with torchrun --npro_per_node=3 in your scripts.

from dnabert_2.

chenruipu avatar chenruipu commented on September 28, 2024

thanks for your respon, now my finetune task can run correctly. But it seems to use only 1 cpu core for the task, which will take too much time (about 250 hours) to finish. i want to know whether i can do the finetune with multiple cpus?

from dnabert_2.

Zhihan1996 avatar Zhihan1996 commented on September 28, 2024

Sorry, I have no idea and experience on multiple-cpu training. You may need to investigate this by yourself. Good luck!

from dnabert_2.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.