-
Notifications
You must be signed in to change notification settings - Fork 321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ValueError: Specified device cuda:0 does not match device of data cuda:-2 #423
Comments
Could you insert
before |
I tried that and it shows "cuda:0" |
What is the output of
[EDITED]: There is a similar error reported at https://discuss.pytorch.org/t/allocation-of-tensor-on-cuda-fails/144204 In your case, it can handle the training successfully but fails during the validation stage, which is very odd. |
|
I did a |
I tried the following solutions but this issue still persists, I am not sure it is related to the dataset I prepared or the functions that line of code invokes Solution 1: As suggested by @csukuangfj from #247 , I switched to pytorch1.10 and Cuda 10.2, the error changes to (other parts of the stack are the same)
Solution 2: As suggested from https://discuss.pytorch.org/t/allocation-of-tensor-on-cuda-fails/144204, I added lines to convert the device to CPU and convert it back, but the error is still there. Solution 3: As suggested from #247, I added back the function call to filter our input segments below 1s and above 20s, but it still doesn't solve the issue. (same error message). Could you please guide me to how to debug it or a workaround ? Any help is highly appreciated in advance. Thanks, |
This could be an error from a previous kernel. Perhaps try doing |
Did you install k2 from source? If so, is the machine you used to build k2 the same as the one you are using for training? |
I installed k2 using conda
Do you suggest build from the source? The k2.version output is below,
|
I added those 2 options but the stack trace doesn't seem to include more information, I pasted it below,
|
You could run it inside pdb ("python3 -m pdb [program] [args]") and try to print out the dimensions of y. |
yes, that was within the same shell. I haven't tried pdb yet. I built k2 from source with pytorch 1.10.0 , cuda 10.2, cudnn 7.6.5. This exception disappears and the training continues normally. |
updates: the exception re-appears during the second epoch(epoch 0 finished without error). |
updates on this issue, This seems to be a bug on the
if we convert the tensor to CPU then it works. so I guess this is a bug related to pytorch.
But then I noticed that the training (pruned-rnn-t-stateless2) gives |
Shall we filter out such inputs from the training data? |
Thanks for the advice, pruning should work I think. But we also found from previous experiments on wav2vec models that those noisy inputs are very helpful. We often have noisy inputs during production as well and by training on those noisy inputs, the model would learn to decode nothing instead of some random words. I am currently trying to retrain the BPE system with an addition special token '' added to those empty samples. |
In that case, I suggest padding it with at least |
Does 0 token also serve as the termination token when computing the loss? I got the impression since normally we also have padding tokens at the end of each y tensor, and those tokens probably don't contribute to the computation of the loss. |
icefall/egs/librispeech/ASR/pruned_transducer_stateless2/model.py Lines 143 to 144 in 6c69c4e
Only things within the given boundary contribute to the loss. |
Thank you for the explanation! I will try padding with
complaining division by 0, does it mean that the x_lens(frames) are all 0? |
I suggest either setting a breakpoint and running it step by step or just dumping the data to disk and inspect it. |
Is it clear that this is a Torch bug? We are calling pad on a k2 ragged object. |
I am not 100% sure that it is a torch bug. I see k2 supports empty tensor very well. but when I call the
By the way, I follow @csukuangfj 's suggestion and pad empty inputs with s_range 0 tokens(which is 5 in my current setup) and I find the pruned loss doesn't return inf value anymore. |
It may be a bug in k2, we should debug that separately. @pkufool ? |
Another issue is that I found the loss function sometimes gave Thanks in advance. |
From k2-fsa/fast_rnnt#10 (comment) Can you try the latest master of k2? |
Try this fix k2-fsa/k2#1009, it has not been merged into master yet. I think this fix can handle the samples that only has one symbol no mater what |
Thank you for the fix! I downloaded and tested against my bad input batch, no more inf. |
I am trying to fine-tune the pre-trained model from gigaspeech recipe, but encountered with the above error, below is the entire traceback log,
I tried to look into the code but could not find any clue about this error. My guess is some of the cuts from my dev set trigger it. Could you please point me to the relevant place to further debug this issue?
Thanks
The text was updated successfully, but these errors were encountered: