Giter Club home page Giter Club logo

pgsrip's Issues

Error during rip: "ValueError: invalid literal for int() with base 16"

I getting this error for episodes I've ripped from a few particular series. In most cases I've had no problem, but these appear to be getting something unexpected - maybe there is an edge case which needs handled? I'll see whether I can use MKVToolNix to rip the subtitle tracks.

Thank you for a great tool!

Traceback (most recent call last):
  File ".venv/lib/python3.10/site-packages/pgsrip/ripper.py", line 128, in __init__
    max_height = max([item.height for item in self.pgs.items]) // 2
  File ".venv/lib/python3.10/site-packages/pgsrip/media.py", line 161, in items
    self._items = self.decode(data, self.media_path)
  File ".venv/lib/python3.10/site-packages/pgsrip/media.py", line 184, in decode
    return PgsSubtitleItem.create_items(media_path, display_sets)
  File ".venv/lib/python3.10/site-packages/pgsrip/media.py", line 45, in create_items
    candidates.append(PgsSubtitleItem(index, media_path, current_sets))
  File ".venv/lib/python3.10/site-packages/pgsrip/media.py", line 33, in __init__
    self.x_offset = min([ds.wds.x_offset for ds in display_sets] or [None])
  File ".venv/lib/python3.10/site-packages/pgsrip/media.py", line 33, in <listcomp>
    self.x_offset = min([ds.wds.x_offset for ds in display_sets] or [None])
  File ".venv/lib/python3.10/site-packages/pgsrip/pgs.py", line 278, in x_offset
    return from_hex(self.data[2:4])
  File ".venv/lib/python3.10/site-packages/pgsrip/utils.py", line 7, in from_hex
    return int(b.hex(), base=16)
ValueError: invalid literal for int() with base 16: ''

Zero byte output file and pure white png

Hi! When using pgsrip with a BD3D, Walk, the output files are 0 bytes. After a bit of digging, it appears the PgsImage class is creating an object of all 255. The rle_data appears to contain data (I don't know how to interpret it). The pgs in the mkv appear fine since the subtitles work in a player and SubtitleEdit.

I'm stuck on troubleshooting at this point. I would appreciate any help! I use pgsrip in my 3D Blu-ray to MV-HEVC (Apple Vision Pro) converter and would love to figure out why this one (and possibly others) do not work. This issue happens across different machines and with a clean macOS install. This is running on Python 3.12 on the latest macOS.

Thank you!

Python script usage

Hey,
Is there any intention (or current ability) to use pgsrip inline in a python script?

I have a script that pulls and extracts out sup files from an MKV ingest, but now want to convert said sup file to srt which it appears this can do.

Let me know :)

Tesseract uses a maximum of only 4 threads

No matter what I set max workers to, more than 4 threads are not used by the tesseract process.
In the help dialog, however, it says that 1 to 50 threads should be supported.

Corrupted <PgsSubtitleItem>: Found 65508 bytes for image, but x were expected

I seem to be getting this warning on a lot of separate video files, and the affected subtitle items usually aren't OCR'd correctly at the bottom or sometimes at the right of the image, as if a portion of the image was cut off.

I used --keep-temp-files to check the images, and they're indeed not being displayed correctly:
image

Playing the video with subtitles on a player like VLC seems to display the subtitles correctly without issues though.

Specify output folder

Is it possible to specify the output folder of the srts from the API? I couldn't figure out the correct way to do it if possible.

Add black/whitelist support

Hello, thanks for creating this package!

A feature I'm missing is the ability to blacklist certain characters (seeing as tesseract keeps catching I as |). Scouring through the code, I could not find any options for this. Having the ability to pass a string of disallowed characters would be nice.

Similarly, whitelist supports would also be welcome for similar reasons. However if only one or the other can be supported, I think it's best to support blacklists instead (or to at least prioritise it if both are passed).

If there's currently no development time being put into this for a little while, I can also try to create a PR if you'd like me to.

Thanks again!

Support other subtitle formats like VobSub

Your program works exceptionally well, so I hope to be able to use it with VobSubs as well.

A flag could perhaps include VobSubs too and convert them to PGS with something like TheGreatMcPain/BDSup2SubPlusPlus.

Perhaps plain VobSub text recognition with Tesseract is not good enough and needs additional post-processing.

I'll try to automate the whole thing with your API and hope the results are reasonably okay.

Tesseract version?

Thanks for the great work on this. I was wondering what version of Tesseract this uses?

Reason is I tested pgsrip via CLI against SubtitleEdit which uses Tesseract 5.2, and the latter seems to be more accurate with fewer errors, but I suppose that could also just be due to the config of tesseract in pgsrip.

Appreciate the help. Thanks.

FileNotFoundError

Hello, here's my problem, I noticed that with some of my subtitle files, pgsrip (latest version) returned this error:

INFO:pgsrip:Tesseract version: 5.3.0.20221222
INFO:pgsrip:Tesseract data: None
DEBUG:pgsrip.media_path:The Godfather Part II (1974) [imdbid-tt0071562] - [Multi-French VFI][Bluray-1080p][PQ][AC3 5.1][FR+EN][10bit][x265]-Winks.fr.sup is using temporary folder C:\Users\Paul\AppData\Local\Temp\The Godfather Part II (1974) [imdbid-tt0071562] - [Multi-French VFI][Bluray-1080p][PQ][AC3 5.1][FR+EN][10bit][x265]-Winks.fr.supkj831ll7.pgsrip
1 PGS subtitle collected from 1 file
DEBUG:pgsrip.media:Removing temporary files in C:\Users\Paul\AppData\Local\Temp\The Godfather Part II (1974) [imdbid-tt0071562] - [Multi-French VFI][Bluray-1080p][PQ][AC3 5.1][FR+EN][10bit][x265]-Winks.fr.supkj831ll7.pgsrip
WARNING:pgsrip.core:Error while trying to rip The Godfather Part II (1974) [imdbid-tt0071562] - [Multi-French VFI][Bluray-1080p][PQ][AC3 5.1][FR+EN][10bit][x265]-Winks.fr.sup: <FileNotFoundError> [[Errno 2] No such file or directory: 'The Godfather Part II (1974) [imdbid-tt0071562] - [Multi-French VFI][Bluray-1080p][PQ][AC3 5.1][FR+EN][10bit][x265]-Winks.fr.sup']
Traceback (most recent call last):
  File "C:\Users\Paul\AppData\Local\Programs\Python\Python39\lib\site-packages\pgsrip\core.py", line 69, in rip_pgs
    srt = PgsToSrtRipper(p, options).rip(lambda t: rules.apply(t, '')[0])
  File "C:\Users\Paul\AppData\Local\Programs\Python\Python39\lib\site-packages\pgsrip\ripper.py", line 128, in __init__
    max_height = max([item.height for item in self.pgs.items]) // 2
  File "C:\Users\Paul\AppData\Local\Programs\Python\Python39\lib\site-packages\pgsrip\media.py", line 160, in items
    data = self.data_reader()
  File "C:\Users\Paul\AppData\Local\Programs\Python\Python39\lib\site-packages\pgsrip\media_path.py", line 44, in get_data
    with open(str(self), 'rb') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'The Godfather Part II (1974) [imdbid-tt0071562] - [Multi-French VFI][Bluray-1080p][PQ][AC3 5.1][FR+EN][10bit][x265]-Winks.fr.sup'
0 PGS subtitle ripped from 1 file

I enclose the subtitles used for the example.
The Godfather Part II (1974) [imdbid-tt0071562] - [Multi-French VFI][Bluray-1080p][PQ][AC3 5.1][FR+EN][10bit][x265]-Winks.fre.zip

Filtering by track name

It is common to find videos that contain multiple subtitles for the same language (for instance, SDH and non-SDH subtitles).

In order to make this flexible, it would be useful to have something that could apply a regex pattern on the track_name property, which is being extracted from mkv tracks.

Error: invalid literal for int()

Hi,

I have a mkv file with 2 PGS tracks (eng & fre). When trying to convert the subtitles I have the following error (for both tracks)

Error while tyring to rip <mkv / or sup file>: [invalid literal for int() with base 10: '95.936287']

Difference in reliability w/ SubtitleEdit+Tesseract?

Hi!

Really excited to see this tool - fits amazingly as a tdarr plugin too!

I noticed while parsing my first English PGS (1h15min) that it detected 530 strings, whereas SubtitleEdit + Tesseract 5.3.0 detected 783 strings (and had great accuracy on them). I felt a bit surprised considering that both use Tesseract 5 and the performance should theoretically be really good regardless of the dataset given it's just working with bare English, black on white, straight. I noted that when subtitles are missed, the previous subtitles would stick around for a long time.

Do you have any ideas from your experience?

Problems with stylized PGS subs on some files

I've had very good results with most of the files I've converted, but I have noticed the OCR seems to be particularly bad in some situations. I've narrowed this down to the specific way that the PGS subtitles are styled and how they are processed.

With a particularly styled file, the OCR is really inaccurate. I used the --keep-temp-files to see the files, and it looks like the text is inverted and placed on a black background, but the way these particular subtitles are formatted, they show up as a mostly black file.

Here is a normal file:

english srt-1539-psm6-NEURAL-65

and here is an example of the issue:

Blue Collar example

The second example has a border around the font which seems to be the cause of the issues.

version for Python 3?

On a Debian machine with "python -V" and "python3 -V" both reporting Python 3.9.2, the command "pip install pgsrip" produces the error "bash: cd: too many arguments." I can install with "sudo pip install pgsrip," although this produces warnings that running pip as root user can result in broken permissions. After thus installing pgsrip, the command "pgsrip file.mks" gives the error:

File "/usr/local/bin/pgsrip", line 5, in
from pgsrip.cli import pgsrip
ModuleNotFoundError: No module named 'pgsrip.cli'

The command "sudo pgsrip file.mks" does work, although it is necessary to run chown on the output file, and this isn't really what one would prefer.

Is there a simple way to install pgsrip from source with Python 3.9.2?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.