Giter Club home page Giter Club logo

Comments (7)

rgowtham avatar rgowtham commented on May 21, 2024

I am facing the same issue as above. I am trying to run this on a mac with no gpu. My command and output is as follows,

root@c85979e176ac:~/HMNet# python PyLearn.py train ExampleConf/conf_hmnet_AMI --no_cuda
{'MODEL': 'MeetingNet_Transformer', 'TASK': 'HMNet', 'CRITERION': 'MLECriterion', 'SEED': 1033, 'MAX_NUM_EPOCHS': 20, 'SAVE_PER_UPDATE_NUM': 400, 'UPDATES_PER_EPOCH': 2000, 'OPTIMIZER': 'RAdam', 'NO_AUTO_LR_SCALING': True, 'START_LEARNING_RATE': 0.001, 'LR_SCHEDULER': 'LnrWrmpInvSqRtDcyScheduler', 'WARMUP_STEPS': 16000, 'WARMUP_INIT_LR': 0.0001, 'WARMUP_END_LR': 0.001, 'GRADIENT_ACCUMULATE_STEP': 20, 'GRAD_CLIPPING': 2, 'USE_REL_DATA_PATH': True, 'TRAIN_FILE': '../ExampleRawData/meeting_summarization/AMI_proprec/train_ami.json', 'DEV_FILE': '../ExampleRawData/meeting_summarization/AMI_proprec/valid_ami.json', 'TEST_FILE': '../ExampleRawData/meeting_summarization/AMI_proprec/test_ami.json', 'ROLE_DICT_FILE': '../ExampleRawData/meeting_summarization/role_dict_ext.json', 'MINI_BATCH': 1, 'MAX_PADDING_RATIO': 1, 'BATCH_READ_AHEAD': 10, 'DOC_SHUFFLE_BUF_SIZE': 10, 'SAMPLE_SHUFFLE_BUFFER_SIZE': 10, 'BATCH_SHUFFLE_BUFFER_SIZE': 10, 'MAX_TRANSCRIPT_WORD': 8300, 'MAX_SENT_LEN': 30, 'MAX_SENT_NUM': 300, 'DROPOUT': 0.1, 'VOCAB_DIM': 512, 'ROLE_SIZE': 32, 'ROLE_DIM': 16, 'POS_DIM': 16, 'ENT_DIM': 16, 'USE_ROLE': True, 'USE_POSENT': True, 'USE_BOS_TOKEN': True, 'USE_EOS_TOKEN': True, 'TRANSFORMER_EMBED_DROPOUT': 0.1, 'TRANSFORMER_RESIDUAL_DROPOUT': 0.1, 'TRANSFORMER_ATTENTION_DROPOUT': 0.1, 'TRANSFORMER_LAYER': 6, 'TRANSFORMER_HEAD': 8, 'TRANSFORMER_POS_DISCOUNT': 80, 'PRE_TOKENIZER': 'TransfoXLTokenizer', 'PRE_TOKENIZER_PATH': '../ExampleInitModel/transfo-xl-wt103', 'PYLEARN_MODEL': '../ExampleInitModel/HMNet-pretrained', 'EXTRA_IDS': 1000, 'BEAM_WIDTH': 6, 'MAX_GEN_LENGTH': 512, 'MIN_GEN_LENGTH': 320, 'EVAL_TOKENIZED': True, 'EVAL_LOWERCASE': True, 'NO_REPEAT_NGRAM_SIZE': 3, 'cuda': False, 'confFile': 'ExampleConf/conf_hmnet_AMI', 'datadir': 'ExampleConf', 'basename': 'conf_hmnet_AMI', 'command': 'train', 'conf_file': 'ExampleConf/conf_hmnet_AMI', 'cluster': 'local', 'dist_init_path': './tmp', 'fp16': False, 'fp16_opt_level': 'O1', 'no_cuda': True}
Using CPU

Saving logs, model, checkpoint, and evaluation in ExampleConf/conf_hmnet_AMI_conf~/run_12
 1.2.0  is high
Number of GPUs is  1 
Effective batch size is increased from  1  to  1 
Gradient accumulation steps =  20 
Effective batch size =  20 
[c85979e176ac:00029] pml_ucx.c:285  Error: UCP worker does not support MPI_THREAD_MULTIPLE
Select command: train
train on rank 0
-----------------------------------------------
Initializing model...
Loading Tokenizer from ExampleConf/../ExampleInitModel/transfo-xl-wt103...
Using pad_token, but it is not set yet.
Using bos_token, but it is not set yet.
Use POS and ENT
USE_ROLE
Total trainable parameters: 204488240
Loaded data on rank 0.
Using custom optimizer: RAdam
Optimizer parameters: {'lr': 0.001}
Using custom lr scheduler: LnrWrmpInvSqRtDcyScheduler
Lr scheduler parameters: {'warmup_steps': 16000, 'warmup_init_lr': 0.0001, 'warmup_end_lr': 0.001}
Epoch 0
Killed

I am specifying no_cuda, but it still says no. of gpu is 1. And also it does not give a clear error msg on where it is failing. Can someone help by looking into this.

from hmnet.

irenebenedetto avatar irenebenedetto commented on May 21, 2024

When I tried to reproduce the results I noticed that in Transformer class there are some variables which have .cuda() not controlled by the option opt['cuda'] . Did you try to modify them?

from hmnet.

rgowtham avatar rgowtham commented on May 21, 2024

Hi @irenebenedetto, yes - before making those cuda related changes, the error msg was something like below,

Epoch 0
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=50 error=35 : CUDA driver version is insufficient for CUDA runtime version
Traceback (most recent call last):
  File "PyLearn.py", line 71, in <module>
    trainer.train()
  File "/root/HMNet/Models/Trainers/HMNetTrainer.py", line 273, in train
    self.update(batch)
  File "/root/HMNet/Models/Trainers/HMNetTrainer.py", line 358, in update
    loss = self.network(batch)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/HMNet/Models/Trainers/HMNetTrainer.py", line 38, in forward
    output = self.model(batch)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/HMNet/Models/Networks/MeetingNet_Transformer.py", line 101, in forward
    outputs = self._forward(**batch)
  File "/root/HMNet/Models/Networks/MeetingNet_Transformer.py", line 126, in _forward
    token_encoder_outputs, sent_encoder_outputs = self.encoder(encoder_input_ids, encoder_input_roles, encoder_input_pos, encoder_input_ent)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/HMNet/Models/Networks/MeetingNet_Transformer.py", line 1131, in forward
    embedded = self.embedder(vocab_x.view(batch_size, -1))
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/HMNet/Models/Networks/Transformer.py", line 387, in forward
    x_pos = self.pos_emb(torch.arange(x_len).type(torch.cuda.FloatTensor)) # len x n_state
RuntimeError: cuda runtime error (35) : CUDA driver version is insufficient for CUDA runtime version at /pytorch/aten/src/THC/THCGeneral.cpp:50

Once I remove the .cuda() parts from here, here and here, I get the error msg as I have posted above. I was expecting to see more of the cuda related error if the code still tries to access gpu.

Were you able to get it running after you changed all the places where .cuda() was used?

from hmnet.

irenebenedetto avatar irenebenedetto commented on May 21, 2024

Hi @irenebenedetto, yes - before making those cuda related changes, the error msg was something like below,

Epoch 0
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=50 error=35 : CUDA driver version is insufficient for CUDA runtime version
Traceback (most recent call last):
  File "PyLearn.py", line 71, in <module>
    trainer.train()
  File "/root/HMNet/Models/Trainers/HMNetTrainer.py", line 273, in train
    self.update(batch)
  File "/root/HMNet/Models/Trainers/HMNetTrainer.py", line 358, in update
    loss = self.network(batch)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/HMNet/Models/Trainers/HMNetTrainer.py", line 38, in forward
    output = self.model(batch)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/HMNet/Models/Networks/MeetingNet_Transformer.py", line 101, in forward
    outputs = self._forward(**batch)
  File "/root/HMNet/Models/Networks/MeetingNet_Transformer.py", line 126, in _forward
    token_encoder_outputs, sent_encoder_outputs = self.encoder(encoder_input_ids, encoder_input_roles, encoder_input_pos, encoder_input_ent)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/HMNet/Models/Networks/MeetingNet_Transformer.py", line 1131, in forward
    embedded = self.embedder(vocab_x.view(batch_size, -1))
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/HMNet/Models/Networks/Transformer.py", line 387, in forward
    x_pos = self.pos_emb(torch.arange(x_len).type(torch.cuda.FloatTensor)) # len x n_state
RuntimeError: cuda runtime error (35) : CUDA driver version is insufficient for CUDA runtime version at /pytorch/aten/src/THC/THCGeneral.cpp:50

Once I remove the .cuda() parts from here, here and here, I get the error msg as I have posted above. I was expecting to see more of the cuda related error if the code still tries to access gpu.

Were you able to get it running after you changed all the places where .cuda() was used?

Looking at the error message I see other variables on cuda (line 387 in forward). Did you convert also all the variables with .type(torch.cuda.FloatTensor))in .type(torch.FloatTensor)?

from hmnet.

rgowtham avatar rgowtham commented on May 21, 2024

Yes, the error msg that I posted was before the cuda changes. After I make the changes (to wherever cuda was used in Transfomer.py script), I am seeing the same error posted in this msg

from hmnet.

irenebenedetto avatar irenebenedetto commented on May 21, 2024

Yes, the error msg that I posted was before the cuda changes. After I make the changes (to wherever cuda was used in Transfomer.py script), I am seeing the same error posted in this msg

A okay sorry. And did you check also the MeetingNet_transformer class here

checkpoint = torch.load(os.path.join(load_dir, 'model.pt'), map_location=torch.device('cuda', self.opt['local_rank']))
? (I used checkpoint = torch.load(os.path.join(load_dir, 'model.pt'), map_location=torch.device('cpu')))

from hmnet.

rgowtham avatar rgowtham commented on May 21, 2024

Yes this is changed too to load from cpu

from hmnet.

Related Issues (13)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.