-
Notifications
You must be signed in to change notification settings - Fork 222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A bug when index num_den_graphs at LFMMILoss.forward #739
Comments
After I reduce the batch_size, no overflow at the point above.
Got:
After debug, I found the reason:
in the upper frame:
Again, it's due to the offset's type int32_t @danpovey @csukuangfj |
My exp is based on espnet2 rather than snowfall, but it's similar |
Even if you were to fix the indexing error, it would likely to be too large
to fit in memory, I think the problem is that your den graph is too large.
I think the right way to fix this is either to construct a smaller den
graph somehow, or to do the den decoding separately from the num
forward-backward with intersect_dense_pruned() which allows you to use a
single FSA instead of
multiple FSAs. That would also be more suitable since the den graph is too
large.
Fangjun, at some point we should probably create an option in the training
code (or simply an alternative code path) which is efficient
when the den graph is inconveniently large.
…On Tue, May 18, 2021 at 5:53 PM Jarvan-Wang ***@***.***> wrote:
After I reduce the batch_size, no overflow at the point above.
Could you reduce your --max-duration to reduce the problem size further?
Is there any plan to fix this issue and #730
<#730> ?
I think @danpovey <https://github.com/danpovey> has something to say
about this.
My exp is based on espnet2 rather than snowfall, but it's similar
the batch_bins==2560, or 25.6 secs
I think it's small enough, regular value of this in espnet2 is 102400
even small value is meaningless
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#739 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLO62IDXCKIAKL6WARATTOI2I3ANCNFSM4433ICVA>
.
|
... another way to reduce the size of the den graph would be to limit it to
only seen pairs of phone symbols. But that would require some changes to
the code as our P matrix
would need to be ragged not square.
I'm curious why the den graph is so large in the first place though-- i.e.
what the vocabulary size is.
…On Tue, May 18, 2021 at 7:08 PM Daniel Povey ***@***.***> wrote:
Even if you were to fix the indexing error, it would likely to be too
large to fit in memory, I think the problem is that your den graph is too
large.
I think the right way to fix this is either to construct a smaller den
graph somehow, or to do the den decoding separately from the num
forward-backward with intersect_dense_pruned() which allows you to use a
single FSA instead of
multiple FSAs. That would also be more suitable since the den graph is
too large.
Fangjun, at some point we should probably create an option in the training
code (or simply an alternative code path) which is efficient
when the den graph is inconveniently large.
On Tue, May 18, 2021 at 5:53 PM Jarvan-Wang ***@***.***>
wrote:
> After I reduce the batch_size, no overflow at the point above.
>
> Could you reduce your --max-duration to reduce the problem size further?
>
> Is there any plan to fix this issue and #730
> <#730> ?
>
> I think @danpovey <https://github.com/danpovey> has something to say
> about this.
>
> My exp is based on espnet2 rather than snowfall, but it's similar
> the batch_bins==2560, or 25.6 secs
> I think it's small enough, regular value of this in espnet2 is 102400
> even small value is meaningless
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#739 (comment)>, or
> unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AAZFLO62IDXCKIAKL6WARATTOI2I3ANCNFSM4433ICVA>
> .
>
|
Will fix it in a day or two using |
Thanks!
…On Tue, May 18, 2021 at 7:39 PM Fangjun Kuang ***@***.***> wrote:
Fangjun, at some point we should probably create an option in the training
code (or simply an alternative code path) which is efficient
when the den graph is inconveniently large.
Will fix it in a day or two using intersect_dense_pruned and provide a
commandline option to enable/disable it.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#739 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLO4VUIJNVM3QQIV3BKLTOJGWTANCNFSM4433ICVA>
.
|
this csukuangfj/snowfall@550a6c5 fix works for our syllable level recipe PARTLY(start training successfully, but out of gpu memory after 1400 batches training). but my character level recipe(about 4k nodes) is still not working(time to give it up) |
Possibly reducing pruning beams might help. Stack trace might show
something.
... another way to reduce the size of the den graph would be to limit it to
… only seen pairs of phone symbols. But that would require some changes to
the code as our P matrix would need to be ragged not square. I'm curious
why the den graph is so large in the first place though-- i.e. what the
vocabulary size is.
… <#m_4640109050101705941_>
On Tue, May 18, 2021 at 7:08 PM Daniel Povey *@*.*> wrote: Even if you
were to fix the indexing error, it would likely to be too large to fit in
memory, I think the problem is that your den graph is too large. I think
the right way to fix this is either to construct a smaller den graph
somehow, or to do the den decoding separately from the num forward-backward
with intersect_dense_pruned() which allows you to use a single FSA instead
of multiple FSAs. That would also be more suitable since the den graph is
too large. Fangjun, at some point we should probably create an option in
the training code (or simply an alternative code path) which is efficient
when the den graph is inconveniently large. On Tue, May 18, 2021 at 5:53 PM
Jarvan-Wang @.*> wrote: > After I reduce the batch_size, no overflow at
the point above. > > Could you reduce your --max-duration to reduce the
problem size further? > > Is there any plan to fix this issue and #730
<#730> > <#730
<#730>> ? > > I think @danpovey
<https://github.com/danpovey> https://github.com/danpovey has something
to say > about this. > > My exp is based on espnet2 rather than snowfall,
but it's similar > the batch_bins==2560, or 25.6 secs > I think it's small
enough, regular value of this in espnet2 is 102400 > even small value is
meaningless > > — > You are receiving this because you were mentioned. >
Reply to this email directly, view it on GitHub > <#739 (comment)
<#739 (comment)>>, or >
unsubscribe >
https://github.com/notifications/unsubscribe-auth/AAZFLO62IDXCKIAKL6WARATTOI2I3ANCNFSM4433ICVA
> . >
this ***@***.***
<csukuangfj/snowfall@550a6c5>
fix works for our syllable level recipe PARTLY(start training successfully,
but out of gpu memory after 1400 batches training).
do you have any more optimizing plans to deal with it?
but my character level recipe(about 4k nodes) is still not working(time to
give it up)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#739 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLO44ZVJWKNXIBHKUM5TTOOB7NANCNFSM4433ICVA>
.
|
I mentioned this bug in #730 (comment)
but that issue is about decoding error.
After reading and digging into codes, I finally found the reasons, here it is:
some values before prefix sum of new_offsets
are
after prefix sum,
new_offsets is
after new_offsets[2][103], it's began to overflow.
now the problem is clear, when den graph is large, pair every num graph with the same big den graph will cause the issue.
The text was updated successfully, but these errors were encountered: