I'm following the readme to try and Finetune HMNet on the AMI dataset. My only modific

When I tried to reproduce the results I noticed that in <a href="https://github.com/mi

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-ho

cublas runtime error about hmnet HOT 7 OPEN

microsoft commented on May 21, 2024

cublas runtime error

from hmnet.

Comments (7)

rgowtham commented on May 21, 2024

I am facing the same issue as above. I am trying to run this on a mac with no gpu. My command and output is as follows,

root@c85979e176ac:~/HMNet# python PyLearn.py train ExampleConf/conf_hmnet_AMI --no_cuda
{'MODEL': 'MeetingNet_Transformer', 'TASK': 'HMNet', 'CRITERION': 'MLECriterion', 'SEED': 1033, 'MAX_NUM_EPOCHS': 20, 'SAVE_PER_UPDATE_NUM': 400, 'UPDATES_PER_EPOCH': 2000, 'OPTIMIZER': 'RAdam', 'NO_AUTO_LR_SCALING': True, 'START_LEARNING_RATE': 0.001, 'LR_SCHEDULER': 'LnrWrmpInvSqRtDcyScheduler', 'WARMUP_STEPS': 16000, 'WARMUP_INIT_LR': 0.0001, 'WARMUP_END_LR': 0.001, 'GRADIENT_ACCUMULATE_STEP': 20, 'GRAD_CLIPPING': 2, 'USE_REL_DATA_PATH': True, 'TRAIN_FILE': '../ExampleRawData/meeting_summarization/AMI_proprec/train_ami.json', 'DEV_FILE': '../ExampleRawData/meeting_summarization/AMI_proprec/valid_ami.json', 'TEST_FILE': '../ExampleRawData/meeting_summarization/AMI_proprec/test_ami.json', 'ROLE_DICT_FILE': '../ExampleRawData/meeting_summarization/role_dict_ext.json', 'MINI_BATCH': 1, 'MAX_PADDING_RATIO': 1, 'BATCH_READ_AHEAD': 10, 'DOC_SHUFFLE_BUF_SIZE': 10, 'SAMPLE_SHUFFLE_BUFFER_SIZE': 10, 'BATCH_SHUFFLE_BUFFER_SIZE': 10, 'MAX_TRANSCRIPT_WORD': 8300, 'MAX_SENT_LEN': 30, 'MAX_SENT_NUM': 300, 'DROPOUT': 0.1, 'VOCAB_DIM': 512, 'ROLE_SIZE': 32, 'ROLE_DIM': 16, 'POS_DIM': 16, 'ENT_DIM': 16, 'USE_ROLE': True, 'USE_POSENT': True, 'USE_BOS_TOKEN': True, 'USE_EOS_TOKEN': True, 'TRANSFORMER_EMBED_DROPOUT': 0.1, 'TRANSFORMER_RESIDUAL_DROPOUT': 0.1, 'TRANSFORMER_ATTENTION_DROPOUT': 0.1, 'TRANSFORMER_LAYER': 6, 'TRANSFORMER_HEAD': 8, 'TRANSFORMER_POS_DISCOUNT': 80, 'PRE_TOKENIZER': 'TransfoXLTokenizer', 'PRE_TOKENIZER_PATH': '../ExampleInitModel/transfo-xl-wt103', 'PYLEARN_MODEL': '../ExampleInitModel/HMNet-pretrained', 'EXTRA_IDS': 1000, 'BEAM_WIDTH': 6, 'MAX_GEN_LENGTH': 512, 'MIN_GEN_LENGTH': 320, 'EVAL_TOKENIZED': True, 'EVAL_LOWERCASE': True, 'NO_REPEAT_NGRAM_SIZE': 3, 'cuda': False, 'confFile': 'ExampleConf/conf_hmnet_AMI', 'datadir': 'ExampleConf', 'basename': 'conf_hmnet_AMI', 'command': 'train', 'conf_file': 'ExampleConf/conf_hmnet_AMI', 'cluster': 'local', 'dist_init_path': './tmp', 'fp16': False, 'fp16_opt_level': 'O1', 'no_cuda': True}
Using CPU

Saving logs, model, checkpoint, and evaluation in ExampleConf/conf_hmnet_AMI_conf~/run_12
 1.2.0  is high
Number of GPUs is  1 
Effective batch size is increased from  1  to  1 
Gradient accumulation steps =  20 
Effective batch size =  20 
[c85979e176ac:00029] pml_ucx.c:285  Error: UCP worker does not support MPI_THREAD_MULTIPLE
Select command: train
train on rank 0
-----------------------------------------------
Initializing model...
Loading Tokenizer from ExampleConf/../ExampleInitModel/transfo-xl-wt103...
Using pad_token, but it is not set yet.
Using bos_token, but it is not set yet.
Use POS and ENT
USE_ROLE
Total trainable parameters: 204488240
Loaded data on rank 0.
Using custom optimizer: RAdam
Optimizer parameters: {'lr': 0.001}
Using custom lr scheduler: LnrWrmpInvSqRtDcyScheduler
Lr scheduler parameters: {'warmup_steps': 16000, 'warmup_init_lr': 0.0001, 'warmup_end_lr': 0.001}
Epoch 0
Killed

I am specifying no_cuda, but it still says no. of gpu is 1. And also it does not give a clear error msg on where it is failing. Can someone help by looking into this.

from hmnet.

irenebenedetto commented on May 21, 2024

When I tried to reproduce the results I noticed that in Transformer class there are some variables which have .cuda() not controlled by the option opt['cuda'] . Did you try to modify them?

from hmnet.

rgowtham commented on May 21, 2024

Hi @irenebenedetto, yes - before making those cuda related changes, the error msg was something like below,

Epoch 0
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=50 error=35 : CUDA driver version is insufficient for CUDA runtime version
Traceback (most recent call last):
  File "PyLearn.py", line 71, in <module>
    trainer.train()
  File "/root/HMNet/Models/Trainers/HMNetTrainer.py", line 273, in train
    self.update(batch)
  File "/root/HMNet/Models/Trainers/HMNetTrainer.py", line 358, in update
    loss = self.network(batch)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/HMNet/Models/Trainers/HMNetTrainer.py", line 38, in forward
    output = self.model(batch)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/HMNet/Models/Networks/MeetingNet_Transformer.py", line 101, in forward
    outputs = self._forward(**batch)
  File "/root/HMNet/Models/Networks/MeetingNet_Transformer.py", line 126, in _forward
    token_encoder_outputs, sent_encoder_outputs = self.encoder(encoder_input_ids, encoder_input_roles, encoder_input_pos, encoder_input_ent)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/HMNet/Models/Networks/MeetingNet_Transformer.py", line 1131, in forward
    embedded = self.embedder(vocab_x.view(batch_size, -1))
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/HMNet/Models/Networks/Transformer.py", line 387, in forward
    x_pos = self.pos_emb(torch.arange(x_len).type(torch.cuda.FloatTensor)) # len x n_state
RuntimeError: cuda runtime error (35) : CUDA driver version is insufficient for CUDA runtime version at /pytorch/aten/src/THC/THCGeneral.cpp:50

Once I remove the .cuda() parts from here, here and here, I get the error msg as I have posted above. I was expecting to see more of the cuda related error if the code still tries to access gpu.

Were you able to get it running after you changed all the places where .cuda() was used?

from hmnet.

irenebenedetto commented on May 21, 2024

Hi @irenebenedetto, yes - before making those cuda related changes, the error msg was something like below,

Epoch 0
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=50 error=35 : CUDA driver version is insufficient for CUDA runtime version
Traceback (most recent call last):
  File "PyLearn.py", line 71, in <module>
    trainer.train()
  File "/root/HMNet/Models/Trainers/HMNetTrainer.py", line 273, in train
    self.update(batch)
  File "/root/HMNet/Models/Trainers/HMNetTrainer.py", line 358, in update
    loss = self.network(batch)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/HMNet/Models/Trainers/HMNetTrainer.py", line 38, in forward
    output = self.model(batch)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/HMNet/Models/Networks/MeetingNet_Transformer.py", line 101, in forward
    outputs = self._forward(**batch)
  File "/root/HMNet/Models/Networks/MeetingNet_Transformer.py", line 126, in _forward
    token_encoder_outputs, sent_encoder_outputs = self.encoder(encoder_input_ids, encoder_input_roles, encoder_input_pos, encoder_input_ent)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/HMNet/Models/Networks/MeetingNet_Transformer.py", line 1131, in forward
    embedded = self.embedder(vocab_x.view(batch_size, -1))
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/HMNet/Models/Networks/Transformer.py", line 387, in forward
    x_pos = self.pos_emb(torch.arange(x_len).type(torch.cuda.FloatTensor)) # len x n_state
RuntimeError: cuda runtime error (35) : CUDA driver version is insufficient for CUDA runtime version at /pytorch/aten/src/THC/THCGeneral.cpp:50

Once I remove the .cuda() parts from here, here and here, I get the error msg as I have posted above. I was expecting to see more of the cuda related error if the code still tries to access gpu.

Were you able to get it running after you changed all the places where .cuda() was used?

Looking at the error message I see other variables on cuda (line 387 in forward). Did you convert also all the variables with .type(torch.cuda.FloatTensor))in .type(torch.FloatTensor)?

from hmnet.

rgowtham commented on May 21, 2024

Yes, the error msg that I posted was before the cuda changes. After I make the changes (to wherever cuda was used in Transfomer.py script), I am seeing the same error posted in this msg

from hmnet.

irenebenedetto commented on May 21, 2024

Yes, the error msg that I posted was before the cuda changes. After I make the changes (to wherever cuda was used in Transfomer.py script), I am seeing the same error posted in this msg

A okay sorry. And did you check also the MeetingNet_transformer class here

HMNet/Models/Networks/MeetingNet_Transformer.py

Line 85 in 1f5a24d

 checkpoint = torch.load(os.path.join(load_dir, 'model.pt'), map_location=torch.device('cuda', self.opt['local_rank'])) 

? (I used checkpoint = torch.load(os.path.join(load_dir, 'model.pt'), map_location=torch.device('cpu')))

from hmnet.

rgowtham commented on May 21, 2024

Yes this is changed too to load from cpu

from hmnet.

cublas runtime error about hmnet HOT 7 OPEN

Comments (7)

Related Issues (13)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent