Comments (10)
psgrip uses the available tesseract that's installed. And you should also define where's the tess data to use with an environment variable.
In case you're using the docker image, the image was built three weeks ago using latest Debian bullseye, installing tesseract-ocr amd64 4.1.1-2.1
and using the trained data from https://github.com/tesseract-ocr/tessdata_best
Could you have examples of the output differences?
If the difference is because of the version, I could check how to install the latest on our docker image.
Otherwise only looking in detail the differences to see how to improve
from pgsrip.
Here's a zip with two files, one with the pgsrip output, and one that was generated by SubtitleEdit (tesseract 5,2)
Mostly it seems like the accuracy is off in the pgsrip version, there are missed spacings, incorrect punctuation, and also sometimes certain speaker names are not included (see subtitle #5).
I was not using docker, just running locally.
Appreciate the help.
from pgsrip.
Now I understand and I can see the differences. Since you're not using the docker image, I'm assuming pgsrip is using exactly the same tesseract installation that subtitleedit is using. And I'm also assuming it is using the very same trained data as subtitleedit.
Let's point one by one:
pgsrip
uses cleanit to post-process the extracted subtitle. The defaultrules/tags
used by cleanit are:
- ocr: Fix common OCR errors
- tidy: Fix common formatting issues (e.g.: extra/missing spaces after punctuation)
- no-sdh: Remove SDH descriptions
- no-lyrics: Remove lyrics
- no-spam
For instance, the bellow example shows cleanit
removing the SDH
descriptions from the subtitle:
subtitleedit
194
00:10:12,027 --> 00:10:13,195
ELLIOT:
Are you still there?
pgsrip
177
00:10:12,028 --> 00:10:13,196
Are you still there?
You can control what tags you want to use with the option -t
, --tag
:
-t, --tag TEXT Rule tags to be used, e.g. ocr, tidy, no-
sdh, no-style, no-lyrics, no-spam (can be
used multiple times).
And you can even specify your own cleanit rules if needed:
-c, --config PATH cleanit configuration path to be used
- Spacing issues / OCR errors:
pgsrip
388
00:25:04,420 --> 00:25:06,422
pointing toalistener
on Tyrell's machine.
For a given subtitle track, we create a single image (if the image gets too big, we could have 2 or more) and then we apply some image processing to prepare the input data for tesseract. It's mainly to have it monocromatic and with clear edges. This is not perfect, but enhances tesseract accuracy. In initial versions I was calling tesseract multiple times for each subtitle entry. Then I realized that putting all subtitles in a single image is better for tesseract's AI, so tesseract can see multiple occurrences of a given character. But the results can vary a lot depending on the subtitles font/style and also the image processing part. The only way to improve this is to fine tune some parameters, but I need to have the PGS to do that.
Another common OCR issue is when the wrong language is used/passed to tesseract. But I assume this is not happening, since pgsrip takes the language information from the track metadata.
I also see that subtitleedit
does some post processing in the extracted subtitle, since some entries that are 3 lines are changed to 2 lines. Probably they also have some common OCR fixes applied in the subtitle.
For some issues that I see in pgsrip
, a new rule in cleanit
would solve them, for instance the following error is a good example:
182
00:10:29,671 --> 00:10:32,382
If I'm alive,
! must have been right.
I would like to improve pgsrip
, but I would need the PGS used for that.
And one final thing, only if you're willing to try something else, you could check if running pgsrip
from the docker image produces the same result or not.
from pgsrip.
I'm looking subtitleedit
code and it seems they used their own trained data:
https://github.com/SubtitleEdit/support-files/tree/master/tessdata
Since tesseract is very powerful and generic, probably subtitleedit
created their own trained data feeding the AI only with subtitles, which seems a good idea
from pgsrip.
I'm wondering in your case when running pgsrip
, which trained data tesseract is using...
did you install tesseract yourself? did you downloaded any trained data? how are you executing psgrip?
I suspect your results are not optimal because of the trained data.
from pgsrip.
I'm working on a Windows PC, so I had installed pgsrip using the instructions from the repo README.md into a WSL Ubuntu VM, so it should be completely logically separated from SubtitleEdit, which is installed on Windows. Hence, I don't think pgsrip and SubtitleEdit are using the same tesseract data.
from pgsrip.
Thanks for the information.
I published a new release with a few things:
- I updated the instructions in order to install the latest tesseract.
- I also tested a pure windows installation and updated the instructions for it
- I added some more options to the cli to keep the extracted PGS file and to dump the generated image (that can help troubleshooting and to further optimize the image before hand it over to tesseract)
If you're still willing to help, you could execute with the following options:
pgsrip --keep-temp-files --debug -vvv <your_media_path>
There should be some output pointing to a temporary folder where you could find the extracted PGS and 1 or more PNG files.
The PNG files could give us a hint how the image that we're passing to tesseract looks like. Maybe, depending on the subtitle font, the image processing needs to be enhanced and with that I could improve this tool.
from pgsrip.
I'm trying subtitleedit
myself. I see there's plenty of options there...
What I found is that when using pure
tesseract 5.3.0, without fallbacks and fixes I'm getting this error:
And if I put to fallback to tesseract 3.02, it parses correctly.
It seems subtitleedit has some dictionary to validate what was parsed and if there's some strange word, it tries fallbacks, like tesseract 3.02, and this old version of tesseract seems to be more accurate when parsing some font types/styles.
from pgsrip.
Thanks for the information.
I published a new release with a few things:
- I updated the instructions in order to install the latest tesseract.
- I also tested a pure windows installation and updated the instructions for it
- I added some more options to the cli to keep the extracted PGS file and to dump the generated image (that can help troubleshooting and to further optimize the image before hand it over to tesseract)
If you're still willing to help, you could execute with the following options:
pgsrip --keep-temp-files --debug -vvv <your_media_path>
There should be some output pointing to a temporary folder where you could find the extracted PGS and 1 or more PNG files. The PNG files could give us a hint how the image that we're passing to tesseract looks like. Maybe, depending on the subtitle font, the image processing needs to be enhanced and with that I could improve this tool.
Here's the output from the debug command
from pgsrip.
I tweaked pgsrip to increase accuracy:
- Added some border to the png image
- Increased gaps between subtitle entries in png image
- Switched tesseract from PSM 11 to PSM 6, since the text and font is uniform
Releasing a new version with it
from pgsrip.
Related Issues (12)
- Tesseract uses a maximum of only 4 threads HOT 1
- Corrupted <PgsSubtitleItem>: Found 65508 bytes for image, but x were expected HOT 6
- Difference in reliability w/ SubtitleEdit+Tesseract? HOT 5
- Support other subtitle formats like VobSub HOT 1
- Python script usage HOT 1
- Error: invalid literal for int() HOT 5
- FileNotFoundError
- Add black/whitelist support HOT 2
- Error during rip: "ValueError: invalid literal for int() with base 16" HOT 3
- version for Python 3? HOT 2
- Filtering by track name
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pgsrip.