Giter Club home page Giter Club logo

Comments (10)

ratoaq2 avatar ratoaq2 commented on June 27, 2024

psgrip uses the available tesseract that's installed. And you should also define where's the tess data to use with an environment variable.

In case you're using the docker image, the image was built three weeks ago using latest Debian bullseye, installing tesseract-ocr amd64 4.1.1-2.1 and using the trained data from https://github.com/tesseract-ocr/tessdata_best

Could you have examples of the output differences?
If the difference is because of the version, I could check how to install the latest on our docker image.
Otherwise only looking in detail the differences to see how to improve

from pgsrip.

imaadh avatar imaadh commented on June 27, 2024

Here's a zip with two files, one with the pgsrip output, and one that was generated by SubtitleEdit (tesseract 5,2)

srt_files.zip

Mostly it seems like the accuracy is off in the pgsrip version, there are missed spacings, incorrect punctuation, and also sometimes certain speaker names are not included (see subtitle #5).

I was not using docker, just running locally.

Appreciate the help.

from pgsrip.

ratoaq2 avatar ratoaq2 commented on June 27, 2024

Now I understand and I can see the differences. Since you're not using the docker image, I'm assuming pgsrip is using exactly the same tesseract installation that subtitleedit is using. And I'm also assuming it is using the very same trained data as subtitleedit.

Let's point one by one:

  1. pgsrip uses cleanit to post-process the extracted subtitle. The default rules/tags used by cleanit are:
  • ocr: Fix common OCR errors
  • tidy: Fix common formatting issues (e.g.: extra/missing spaces after punctuation)
  • no-sdh: Remove SDH descriptions
  • no-lyrics: Remove lyrics
  • no-spam

For instance, the bellow example shows cleanit removing the SDH descriptions from the subtitle:

subtitleedit
194
00:10:12,027 --> 00:10:13,195
ELLIOT:
Are you still there?
pgsrip
177
00:10:12,028 --> 00:10:13,196
Are you still there?

You can control what tags you want to use with the option -t, --tag:

  -t, --tag TEXT                  Rule tags to be used, e.g. ocr, tidy, no-
                                  sdh, no-style, no-lyrics, no-spam (can be
                                  used multiple times).

And you can even specify your own cleanit rules if needed:

  -c, --config PATH               cleanit configuration path to be used
  1. Spacing issues / OCR errors:
pgsrip
388
00:25:04,420 --> 00:25:06,422
pointing toalistener
on Tyrell's machine.

For a given subtitle track, we create a single image (if the image gets too big, we could have 2 or more) and then we apply some image processing to prepare the input data for tesseract. It's mainly to have it monocromatic and with clear edges. This is not perfect, but enhances tesseract accuracy. In initial versions I was calling tesseract multiple times for each subtitle entry. Then I realized that putting all subtitles in a single image is better for tesseract's AI, so tesseract can see multiple occurrences of a given character. But the results can vary a lot depending on the subtitles font/style and also the image processing part. The only way to improve this is to fine tune some parameters, but I need to have the PGS to do that.

Another common OCR issue is when the wrong language is used/passed to tesseract. But I assume this is not happening, since pgsrip takes the language information from the track metadata.

I also see that subtitleedit does some post processing in the extracted subtitle, since some entries that are 3 lines are changed to 2 lines. Probably they also have some common OCR fixes applied in the subtitle.
For some issues that I see in pgsrip, a new rule in cleanit would solve them, for instance the following error is a good example:

182
00:10:29,671 --> 00:10:32,382
If I'm alive,
! must have been right.

I would like to improve pgsrip, but I would need the PGS used for that.

And one final thing, only if you're willing to try something else, you could check if running pgsrip from the docker image produces the same result or not.

from pgsrip.

ratoaq2 avatar ratoaq2 commented on June 27, 2024

I'm looking subtitleedit code and it seems they used their own trained data:
https://github.com/SubtitleEdit/support-files/tree/master/tessdata

Since tesseract is very powerful and generic, probably subtitleedit created their own trained data feeding the AI only with subtitles, which seems a good idea

from pgsrip.

ratoaq2 avatar ratoaq2 commented on June 27, 2024

I'm wondering in your case when running pgsrip, which trained data tesseract is using...
did you install tesseract yourself? did you downloaded any trained data? how are you executing psgrip?

I suspect your results are not optimal because of the trained data.

from pgsrip.

imaadh avatar imaadh commented on June 27, 2024

I'm working on a Windows PC, so I had installed pgsrip using the instructions from the repo README.md into a WSL Ubuntu VM, so it should be completely logically separated from SubtitleEdit, which is installed on Windows. Hence, I don't think pgsrip and SubtitleEdit are using the same tesseract data.

from pgsrip.

ratoaq2 avatar ratoaq2 commented on June 27, 2024

Thanks for the information.

I published a new release with a few things:

  • I updated the instructions in order to install the latest tesseract.
  • I also tested a pure windows installation and updated the instructions for it
  • I added some more options to the cli to keep the extracted PGS file and to dump the generated image (that can help troubleshooting and to further optimize the image before hand it over to tesseract)

If you're still willing to help, you could execute with the following options:

pgsrip --keep-temp-files --debug -vvv <your_media_path>

There should be some output pointing to a temporary folder where you could find the extracted PGS and 1 or more PNG files.
The PNG files could give us a hint how the image that we're passing to tesseract looks like. Maybe, depending on the subtitle font, the image processing needs to be enhanced and with that I could improve this tool.

from pgsrip.

ratoaq2 avatar ratoaq2 commented on June 27, 2024

I'm trying subtitleedit myself. I see there's plenty of options there...

What I found is that when using pure tesseract 5.3.0, without fallbacks and fixes I'm getting this error:
image

And if I put to fallback to tesseract 3.02, it parses correctly.
It seems subtitleedit has some dictionary to validate what was parsed and if there's some strange word, it tries fallbacks, like tesseract 3.02, and this old version of tesseract seems to be more accurate when parsing some font types/styles.

from pgsrip.

imaadh avatar imaadh commented on June 27, 2024

Thanks for the information.

I published a new release with a few things:

  • I updated the instructions in order to install the latest tesseract.
  • I also tested a pure windows installation and updated the instructions for it
  • I added some more options to the cli to keep the extracted PGS file and to dump the generated image (that can help troubleshooting and to further optimize the image before hand it over to tesseract)

If you're still willing to help, you could execute with the following options:

pgsrip --keep-temp-files --debug -vvv <your_media_path>

There should be some output pointing to a temporary folder where you could find the extracted PGS and 1 or more PNG files. The PNG files could give us a hint how the image that we're passing to tesseract looks like. Maybe, depending on the subtitle font, the image processing needs to be enhanced and with that I could improve this tool.

Here's the output from the debug command

1.en.srt-13-0.zip

from pgsrip.

ratoaq2 avatar ratoaq2 commented on June 27, 2024

I tweaked pgsrip to increase accuracy:

  • Added some border to the png image
  • Increased gaps between subtitle entries in png image
  • Switched tesseract from PSM 11 to PSM 6, since the text and font is uniform

Releasing a new version with it

from pgsrip.

Related Issues (12)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.