RuntimeError: CUDA error: no kernel image is available for execution on the device about fastseq HOT 7 CLOSED

microsoft commented on May 18, 2024

RuntimeError: CUDA error: no kernel image is available for execution on the device

from fastseq.

Comments (7)

NickNickGo commented on May 18, 2024

Hi @sshleifer ,

Could you list the steps to reproduce this error? Also please provide environment info.

Thanks,

from fastseq.

sshleifer commented on May 18, 2024

Hard to explain the cluster setup, but we fixed with export TORCH_CUDA_ARCH_LIST="6.0;6.1;7.0" before building the extension.

Another question, is there an advantage for NGramRepeatBlock inheriting from nn.Module?

from fastseq.

yuyan2do commented on May 18, 2024

@sshleifer We are open to pull some changes back into 'fairseq'.

I am trying to use your repeat ngram extension, but when I switch GPUs (without rebuilding the extension) it breaks with RuntimeError: CUDA error: no kernel image is available for execution on the device. If I rerun: python setup.py build_ext --inplace it works again. Any clues how to build the extension so that it works on a different GPU (same cuda version, same python version, same torch) than where it was built?

Also, we're considering pulling some of these changes back into fairseq, if that's alright with you guys!

from fastseq.

sshleifer commented on May 18, 2024

Awesome! If you guys tell me your twitter handles/or some other link I will make sure to credit you when I tweet. The speedup for ngram blocking is really impressive, it will get merged into fairseq/master soon.

I'm also trying to prioritize including the other changes:

MultiheadAttention: einsum
SequenceGenerator: parallel post-processing
BeamSearch: ?
TransformerEncoder, TransformerModel: delete reorder_encoder_out

Are the last two changes important? Do you guys have a sense of why?
Is the MultiheadAttention just to save memory or also faster?

Thanks and sorry for all the questions.

from fastseq.

yuyan2do commented on May 18, 2024

Thanks Sam, it will be great if you mention our project https://github.com/microsoft/fastseq and twitter @fastseq.

MultiheadAttention einsum combine with reorder_incremental_state are both faster and save memory under same batch size. Memory copy takes a lot of time, especially when input is long. Remove reorder_encoder_out because don't need duplicate encoder out by beam size times. There are some analysis in here and here
SequenceGenerator: parallel post-processing.
BeamSearch: combability change for fairseq v0.9.0 only, just replace torch.div to torch.floor_divide.

from fastseq.

feihugis commented on May 18, 2024

@sshleifer Thanks for your interest! I think this issue has been resolved. I will close it, but feel free to reopen it if you have more questions.

from fastseq.

yuyan2do commented on May 18, 2024

@sshleifer I saw ngram blocking has merged to fairseq/main. Do you get chance to try other changes? We have papers (FastSeq and EL-Attention) to description the changes now. It may give more sense how it gives speedup.

from fastseq.

RuntimeError: CUDA error: no kernel image is available for execution on the device about fastseq HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent