ratoaq2 / pgsrip Goto Github PK

Rip your PGS subtitles

License: MIT License

Dockerfile 4.39% Python 95.50% Shell 0.11%

pgsrip's Issues

Error during rip: "ValueError: invalid literal for int() with base 16"

I getting this error for episodes I've ripped from a few particular series. In most cases I've had no problem, but these appear to be getting something unexpected - maybe there is an edge case which needs handled? I'll see whether I can use MKVToolNix to rip the subtitle tracks.

Thank you for a great tool!

Traceback (most recent call last):
  File ".venv/lib/python3.10/site-packages/pgsrip/ripper.py", line 128, in __init__
    max_height = max([item.height for item in self.pgs.items]) // 2
  File ".venv/lib/python3.10/site-packages/pgsrip/media.py", line 161, in items
    self._items = self.decode(data, self.media_path)
  File ".venv/lib/python3.10/site-packages/pgsrip/media.py", line 184, in decode
    return PgsSubtitleItem.create_items(media_path, display_sets)
  File ".venv/lib/python3.10/site-packages/pgsrip/media.py", line 45, in create_items
    candidates.append(PgsSubtitleItem(index, media_path, current_sets))
  File ".venv/lib/python3.10/site-packages/pgsrip/media.py", line 33, in __init__
    self.x_offset = min([ds.wds.x_offset for ds in display_sets] or [None])
  File ".venv/lib/python3.10/site-packages/pgsrip/media.py", line 33, in <listcomp>
    self.x_offset = min([ds.wds.x_offset for ds in display_sets] or [None])
  File ".venv/lib/python3.10/site-packages/pgsrip/pgs.py", line 278, in x_offset
    return from_hex(self.data[2:4])
  File ".venv/lib/python3.10/site-packages/pgsrip/utils.py", line 7, in from_hex
    return int(b.hex(), base=16)
ValueError: invalid literal for int() with base 16: ''

Zero byte output file and pure white png

Hi! When using pgsrip with a BD3D, Walk, the output files are 0 bytes. After a bit of digging, it appears the PgsImage class is creating an object of all 255. The rle_data appears to contain data (I don't know how to interpret it). The pgs in the mkv appear fine since the subtitles work in a player and SubtitleEdit.

I'm stuck on troubleshooting at this point. I would appreciate any help! I use pgsrip in my 3D Blu-ray to MV-HEVC (Apple Vision Pro) converter and would love to figure out why this one (and possibly others) do not work. This issue happens across different machines and with a clean macOS install. This is running on Python 3.12 on the latest macOS.

Thank you!

Python script usage

Hey,
Is there any intention (or current ability) to use pgsrip inline in a python script?

I have a script that pulls and extracts out sup files from an MKV ingest, but now want to convert said sup file to srt which it appears this can do.

Let me know :)

Tesseract uses a maximum of only 4 threads

No matter what I set max workers to, more than 4 threads are not used by the tesseract process.
In the help dialog, however, it says that 1 to 50 threads should be supported.

Corrupted <PgsSubtitleItem>: Found 65508 bytes for image, but x were expected

I seem to be getting this warning on a lot of separate video files, and the affected subtitle items usually aren't OCR'd correctly at the bottom or sometimes at the right of the image, as if a portion of the image was cut off.

I used --keep-temp-files to check the images, and they're indeed not being displayed correctly:

Playing the video with subtitles on a player like VLC seems to display the subtitles correctly without issues though.

Specify output folder

Is it possible to specify the output folder of the srts from the API? I couldn't figure out the correct way to do it if possible.

Add black/whitelist support

Hello, thanks for creating this package!

A feature I'm missing is the ability to blacklist certain characters (seeing as tesseract keeps catching I as |). Scouring through the code, I could not find any options for this. Having the ability to pass a string of disallowed characters would be nice.

Similarly, whitelist supports would also be welcome for similar reasons. However if only one or the other can be supported, I think it's best to support blacklists instead (or to at least prioritise it if both are passed).

If there's currently no development time being put into this for a little while, I can also try to create a PR if you'd like me to.

Thanks again!

Support other subtitle formats like VobSub

Your program works exceptionally well, so I hope to be able to use it with VobSubs as well.

A flag could perhaps include VobSubs too and convert them to PGS with something like TheGreatMcPain/BDSup2SubPlusPlus.

Perhaps plain VobSub text recognition with Tesseract is not good enough and needs additional post-processing.

I'll try to automate the whole thing with your API and hope the results are reasonably okay.

Tesseract version?

Thanks for the great work on this. I was wondering what version of Tesseract this uses?

Reason is I tested pgsrip via CLI against SubtitleEdit which uses Tesseract 5.2, and the latter seems to be more accurate with fewer errors, but I suppose that could also just be due to the config of tesseract in pgsrip.

Appreciate the help. Thanks.

FileNotFoundError

Hello, here's my problem, I noticed that with some of my subtitle files, pgsrip (latest version) returned this error:

INFO:pgsrip:Tesseract version: 5.3.0.20221222
INFO:pgsrip:Tesseract data: None
DEBUG:pgsrip.media_path:The Godfather Part II (1974) [imdbid-tt0071562] - [Multi-French VFI][Bluray-1080p][PQ][AC3 5.1][FR+EN][10bit][x265]-Winks.fr.sup is using temporary folder C:\Users\Paul\AppData\Local\Temp\The Godfather Part II (1974) [imdbid-tt0071562] - [Multi-French VFI][Bluray-1080p][PQ][AC3 5.1][FR+EN][10bit][x265]-Winks.fr.supkj831ll7.pgsrip
1 PGS subtitle collected from 1 file
DEBUG:pgsrip.media:Removing temporary files in C:\Users\Paul\AppData\Local\Temp\The Godfather Part II (1974) [imdbid-tt0071562] - [Multi-French VFI][Bluray-1080p][PQ][AC3 5.1][FR+EN][10bit][x265]-Winks.fr.supkj831ll7.pgsrip
WARNING:pgsrip.core:Error while trying to rip The Godfather Part II (1974) [imdbid-tt0071562] - [Multi-French VFI][Bluray-1080p][PQ][AC3 5.1][FR+EN][10bit][x265]-Winks.fr.sup: <FileNotFoundError> [[Errno 2] No such file or directory: 'The Godfather Part II (1974) [imdbid-tt0071562] - [Multi-French VFI][Bluray-1080p][PQ][AC3 5.1][FR+EN][10bit][x265]-Winks.fr.sup']
Traceback (most recent call last):
  File "C:\Users\Paul\AppData\Local\Programs\Python\Python39\lib\site-packages\pgsrip\core.py", line 69, in rip_pgs
    srt = PgsToSrtRipper(p, options).rip(lambda t: rules.apply(t, '')[0])
  File "C:\Users\Paul\AppData\Local\Programs\Python\Python39\lib\site-packages\pgsrip\ripper.py", line 128, in __init__
    max_height = max([item.height for item in self.pgs.items]) // 2
  File "C:\Users\Paul\AppData\Local\Programs\Python\Python39\lib\site-packages\pgsrip\media.py", line 160, in items
    data = self.data_reader()
  File "C:\Users\Paul\AppData\Local\Programs\Python\Python39\lib\site-packages\pgsrip\media_path.py", line 44, in get_data
    with open(str(self), 'rb') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'The Godfather Part II (1974) [imdbid-tt0071562] - [Multi-French VFI][Bluray-1080p][PQ][AC3 5.1][FR+EN][10bit][x265]-Winks.fr.sup'
0 PGS subtitle ripped from 1 file

I enclose the subtitles used for the example.
The Godfather Part II (1974) [imdbid-tt0071562] - [Multi-French VFI][Bluray-1080p][PQ][AC3 5.1][FR+EN][10bit][x265]-Winks.fre.zip

Filtering by track name

It is common to find videos that contain multiple subtitles for the same language (for instance, SDH and non-SDH subtitles).

In order to make this flexible, it would be useful to have something that could apply a regex pattern on the track_name property, which is being extracted from mkv tracks.

Error: invalid literal for int()

Hi,

I have a mkv file with 2 PGS tracks (eng & fre). When trying to convert the subtitles I have the following error (for both tracks)

Error while tyring to rip <mkv / or sup file>: [invalid literal for int() with base 10: '95.936287']

Difference in reliability w/ SubtitleEdit+Tesseract?

Hi!

Really excited to see this tool - fits amazingly as a tdarr plugin too!

I noticed while parsing my first English PGS (1h15min) that it detected 530 strings, whereas SubtitleEdit + Tesseract 5.3.0 detected 783 strings (and had great accuracy on them). I felt a bit surprised considering that both use Tesseract 5 and the performance should theoretically be really good regardless of the dataset given it's just working with bare English, black on white, straight. I noted that when subtitles are missed, the previous subtitles would stick around for a long time.

Do you have any ideas from your experience?

Problems with stylized PGS subs on some files

I've had very good results with most of the files I've converted, but I have noticed the OCR seems to be particularly bad in some situations. I've narrowed this down to the specific way that the PGS subtitles are styled and how they are processed.

With a particularly styled file, the OCR is really inaccurate. I used the --keep-temp-files to see the files, and it looks like the text is inverted and placed on a black background, but the way these particular subtitles are formatted, they show up as a mostly black file.

Here is a normal file:

and here is an example of the issue:

The second example has a border around the font which seems to be the cause of the issues.

version for Python 3?

On a Debian machine with "python -V" and "python3 -V" both reporting Python 3.9.2, the command "pip install pgsrip" produces the error "bash: cd: too many arguments." I can install with "sudo pip install pgsrip," although this produces warnings that running pip as root user can result in broken permissions. After thus installing pgsrip, the command "pgsrip file.mks" gives the error:

File "/usr/local/bin/pgsrip", line 5, in
from pgsrip.cli import pgsrip
ModuleNotFoundError: No module named 'pgsrip.cli'

The command "sudo pgsrip file.mks" does work, although it is necessary to run chown on the output file, and this isn't really what one would prefer.

Is there a simple way to install pgsrip from source with Python 3.9.2?

ratoaq2 / pgsrip Goto Github PK

pgsrip's Issues

Error during rip: "ValueError: invalid literal for int() with base 16"

Zero byte output file and pure white png

Python script usage

Tesseract uses a maximum of only 4 threads

Corrupted <PgsSubtitleItem>: Found 65508 bytes for image, but x were expected

Specify output folder

Add black/whitelist support

Support other subtitle formats like VobSub

Tesseract version?

FileNotFoundError

Filtering by track name

Error: invalid literal for int()

Difference in reliability w/ SubtitleEdit+Tesseract?

Problems with stylized PGS subs on some files

version for Python 3?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent