Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Difference in reliability w/ SubtitleEdit+Tesseract? #14

Closed
scr4tchy opened this issue Mar 2, 2023 · 5 comments
Closed

Difference in reliability w/ SubtitleEdit+Tesseract? #14

scr4tchy opened this issue Mar 2, 2023 · 5 comments

Comments

@scr4tchy
Copy link

scr4tchy commented Mar 2, 2023

Hi!

Really excited to see this tool - fits amazingly as a tdarr plugin too!

I noticed while parsing my first English PGS (1h15min) that it detected 530 strings, whereas SubtitleEdit + Tesseract 5.3.0 detected 783 strings (and had great accuracy on them). I felt a bit surprised considering that both use Tesseract 5 and the performance should theoretically be really good regardless of the dataset given it's just working with bare English, black on white, straight. I noted that when subtitles are missed, the previous subtitles would stick around for a long time.

Do you have any ideas from your experience?

@viown
Copy link

viown commented Mar 2, 2023

I have this issue as well where some dialogue is missing from the ripped SRT file. I'm not sure if it's related but it seems to happen with segments that are in a different position from the center.

@ratoaq2 Here is a sample containing the original SUP file, the ripped pgsrip SRT file, and a Subtitle Edit version (which does not have any dialogue missing) that may help diagnose the issue.

subtitle.zip

@viown
Copy link

viown commented Mar 2, 2023

I tested out an older version of pgsrip (v0.1.4) and the issue does not happen there and all dialogue is extracted properly as expected. So it must be a recent change that causes this.

@scr4tchy
Copy link
Author

scr4tchy commented Mar 2, 2023

Good observation.

On another media:

  • pgsrip latest yields 530 strings (525 unique)
  • pgsrip v0.1.4 yields 1131 strings (1104 unique)
  • subtitle edit yields 1132 strings (1132 unique)

@ratoaq2 ratoaq2 closed this as completed in ea8e7f9 Mar 3, 2023
@ratoaq2
Copy link
Owner

ratoaq2 commented Mar 3, 2023

Thanks for reporting it and providing information to reproduce it. I'm doing a release with the fix

@mrwunderbar666
Copy link

Hi,

thanks for making this great tool!

I am running into similar issues where SubtitleEd
example_subs.zip
it extracts more / better subtitles from PGS than pgsrip. I am not sure what causes the problem but SubtitleEdit found 463 strings, but pgsrip only 242.

I attached a sample sup file where I encountered the issue

Help would be appreciated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants