Comments (7)
I am facing the same issue as above. I am trying to run this on a mac with no gpu. My command and output is as follows,
root@c85979e176ac:~/HMNet# python PyLearn.py train ExampleConf/conf_hmnet_AMI --no_cuda
{'MODEL': 'MeetingNet_Transformer', 'TASK': 'HMNet', 'CRITERION': 'MLECriterion', 'SEED': 1033, 'MAX_NUM_EPOCHS': 20, 'SAVE_PER_UPDATE_NUM': 400, 'UPDATES_PER_EPOCH': 2000, 'OPTIMIZER': 'RAdam', 'NO_AUTO_LR_SCALING': True, 'START_LEARNING_RATE': 0.001, 'LR_SCHEDULER': 'LnrWrmpInvSqRtDcyScheduler', 'WARMUP_STEPS': 16000, 'WARMUP_INIT_LR': 0.0001, 'WARMUP_END_LR': 0.001, 'GRADIENT_ACCUMULATE_STEP': 20, 'GRAD_CLIPPING': 2, 'USE_REL_DATA_PATH': True, 'TRAIN_FILE': '../ExampleRawData/meeting_summarization/AMI_proprec/train_ami.json', 'DEV_FILE': '../ExampleRawData/meeting_summarization/AMI_proprec/valid_ami.json', 'TEST_FILE': '../ExampleRawData/meeting_summarization/AMI_proprec/test_ami.json', 'ROLE_DICT_FILE': '../ExampleRawData/meeting_summarization/role_dict_ext.json', 'MINI_BATCH': 1, 'MAX_PADDING_RATIO': 1, 'BATCH_READ_AHEAD': 10, 'DOC_SHUFFLE_BUF_SIZE': 10, 'SAMPLE_SHUFFLE_BUFFER_SIZE': 10, 'BATCH_SHUFFLE_BUFFER_SIZE': 10, 'MAX_TRANSCRIPT_WORD': 8300, 'MAX_SENT_LEN': 30, 'MAX_SENT_NUM': 300, 'DROPOUT': 0.1, 'VOCAB_DIM': 512, 'ROLE_SIZE': 32, 'ROLE_DIM': 16, 'POS_DIM': 16, 'ENT_DIM': 16, 'USE_ROLE': True, 'USE_POSENT': True, 'USE_BOS_TOKEN': True, 'USE_EOS_TOKEN': True, 'TRANSFORMER_EMBED_DROPOUT': 0.1, 'TRANSFORMER_RESIDUAL_DROPOUT': 0.1, 'TRANSFORMER_ATTENTION_DROPOUT': 0.1, 'TRANSFORMER_LAYER': 6, 'TRANSFORMER_HEAD': 8, 'TRANSFORMER_POS_DISCOUNT': 80, 'PRE_TOKENIZER': 'TransfoXLTokenizer', 'PRE_TOKENIZER_PATH': '../ExampleInitModel/transfo-xl-wt103', 'PYLEARN_MODEL': '../ExampleInitModel/HMNet-pretrained', 'EXTRA_IDS': 1000, 'BEAM_WIDTH': 6, 'MAX_GEN_LENGTH': 512, 'MIN_GEN_LENGTH': 320, 'EVAL_TOKENIZED': True, 'EVAL_LOWERCASE': True, 'NO_REPEAT_NGRAM_SIZE': 3, 'cuda': False, 'confFile': 'ExampleConf/conf_hmnet_AMI', 'datadir': 'ExampleConf', 'basename': 'conf_hmnet_AMI', 'command': 'train', 'conf_file': 'ExampleConf/conf_hmnet_AMI', 'cluster': 'local', 'dist_init_path': './tmp', 'fp16': False, 'fp16_opt_level': 'O1', 'no_cuda': True}
Using CPU
Saving logs, model, checkpoint, and evaluation in ExampleConf/conf_hmnet_AMI_conf~/run_12
1.2.0 is high
Number of GPUs is 1
Effective batch size is increased from 1 to 1
Gradient accumulation steps = 20
Effective batch size = 20
[c85979e176ac:00029] pml_ucx.c:285 Error: UCP worker does not support MPI_THREAD_MULTIPLE
Select command: train
train on rank 0
-----------------------------------------------
Initializing model...
Loading Tokenizer from ExampleConf/../ExampleInitModel/transfo-xl-wt103...
Using pad_token, but it is not set yet.
Using bos_token, but it is not set yet.
Use POS and ENT
USE_ROLE
Total trainable parameters: 204488240
Loaded data on rank 0.
Using custom optimizer: RAdam
Optimizer parameters: {'lr': 0.001}
Using custom lr scheduler: LnrWrmpInvSqRtDcyScheduler
Lr scheduler parameters: {'warmup_steps': 16000, 'warmup_init_lr': 0.0001, 'warmup_end_lr': 0.001}
Epoch 0
Killed
I am specifying no_cuda, but it still says no. of gpu is 1. And also it does not give a clear error msg on where it is failing. Can someone help by looking into this.
from hmnet.
When I tried to reproduce the results I noticed that in Transformer class there are some variables which have .cuda()
not controlled by the option opt['cuda']
. Did you try to modify them?
from hmnet.
Hi @irenebenedetto, yes - before making those cuda related changes, the error msg was something like below,
Epoch 0
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=50 error=35 : CUDA driver version is insufficient for CUDA runtime version
Traceback (most recent call last):
File "PyLearn.py", line 71, in <module>
trainer.train()
File "/root/HMNet/Models/Trainers/HMNetTrainer.py", line 273, in train
self.update(batch)
File "/root/HMNet/Models/Trainers/HMNetTrainer.py", line 358, in update
loss = self.network(batch)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "/root/HMNet/Models/Trainers/HMNetTrainer.py", line 38, in forward
output = self.model(batch)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "/root/HMNet/Models/Networks/MeetingNet_Transformer.py", line 101, in forward
outputs = self._forward(**batch)
File "/root/HMNet/Models/Networks/MeetingNet_Transformer.py", line 126, in _forward
token_encoder_outputs, sent_encoder_outputs = self.encoder(encoder_input_ids, encoder_input_roles, encoder_input_pos, encoder_input_ent)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "/root/HMNet/Models/Networks/MeetingNet_Transformer.py", line 1131, in forward
embedded = self.embedder(vocab_x.view(batch_size, -1))
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "/root/HMNet/Models/Networks/Transformer.py", line 387, in forward
x_pos = self.pos_emb(torch.arange(x_len).type(torch.cuda.FloatTensor)) # len x n_state
RuntimeError: cuda runtime error (35) : CUDA driver version is insufficient for CUDA runtime version at /pytorch/aten/src/THC/THCGeneral.cpp:50
Once I remove the .cuda()
parts from here, here and here, I get the error msg as I have posted above. I was expecting to see more of the cuda related error if the code still tries to access gpu.
Were you able to get it running after you changed all the places where .cuda()
was used?
from hmnet.
Hi @irenebenedetto, yes - before making those cuda related changes, the error msg was something like below,
Epoch 0 THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=50 error=35 : CUDA driver version is insufficient for CUDA runtime version Traceback (most recent call last): File "PyLearn.py", line 71, in <module> trainer.train() File "/root/HMNet/Models/Trainers/HMNetTrainer.py", line 273, in train self.update(batch) File "/root/HMNet/Models/Trainers/HMNetTrainer.py", line 358, in update loss = self.network(batch) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__ result = self.forward(*input, **kwargs) File "/root/HMNet/Models/Trainers/HMNetTrainer.py", line 38, in forward output = self.model(batch) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__ result = self.forward(*input, **kwargs) File "/root/HMNet/Models/Networks/MeetingNet_Transformer.py", line 101, in forward outputs = self._forward(**batch) File "/root/HMNet/Models/Networks/MeetingNet_Transformer.py", line 126, in _forward token_encoder_outputs, sent_encoder_outputs = self.encoder(encoder_input_ids, encoder_input_roles, encoder_input_pos, encoder_input_ent) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__ result = self.forward(*input, **kwargs) File "/root/HMNet/Models/Networks/MeetingNet_Transformer.py", line 1131, in forward embedded = self.embedder(vocab_x.view(batch_size, -1)) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__ result = self.forward(*input, **kwargs) File "/root/HMNet/Models/Networks/Transformer.py", line 387, in forward x_pos = self.pos_emb(torch.arange(x_len).type(torch.cuda.FloatTensor)) # len x n_state RuntimeError: cuda runtime error (35) : CUDA driver version is insufficient for CUDA runtime version at /pytorch/aten/src/THC/THCGeneral.cpp:50Once I remove the
.cuda()
parts from here, here and here, I get the error msg as I have posted above. I was expecting to see more of the cuda related error if the code still tries to access gpu.Were you able to get it running after you changed all the places where
.cuda()
was used?
Looking at the error message I see other variables on cuda (line 387 in forward). Did you convert also all the variables with .type(torch.cuda.FloatTensor))
in .type(torch.FloatTensor)
?
from hmnet.
Yes, the error msg that I posted was before the cuda changes. After I make the changes (to wherever cuda was used in Transfomer.py script), I am seeing the same error posted in this msg
from hmnet.
Yes, the error msg that I posted was before the cuda changes. After I make the changes (to wherever cuda was used in Transfomer.py script), I am seeing the same error posted in this msg
A okay sorry. And did you check also the MeetingNet_transformer class here
? (I usedcheckpoint = torch.load(os.path.join(load_dir, 'model.pt'), map_location=torch.device('cpu'))
)from hmnet.
Yes this is changed too to load from cpu
from hmnet.
Related Issues (13)
- Docker building, Tensor Size issues, may be related to package versions. HOT 3
- Problems while building docker HOT 4
- Preprocessing my own data for inference
- The order of token_attn and sent_attn in decoder is different between the code and the paper, in MeetingNet_Transformer.py
- This repo is missing important files
- How to build a new data set with the same format HOT 2
- Modules Versions are not specified HOT 2
- Cuda out of memory HOT 6
- How to solve cuda out of memory error? HOT 1
- tokenizer.convert_ids_to_tokens not generating special tokens with predefined position offset
- Which version of spacy are you using? HOT 1
- How to train models with mine own data sets?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from hmnet.