ratoaq2 / pgsrip Goto Github PK
View Code? Open in Web Editor NEWRip your PGS subtitles
License: MIT License
Rip your PGS subtitles
License: MIT License
I getting this error for episodes I've ripped from a few particular series. In most cases I've had no problem, but these appear to be getting something unexpected - maybe there is an edge case which needs handled? I'll see whether I can use MKVToolNix to rip the subtitle tracks.
Thank you for a great tool!
Traceback (most recent call last):
File ".venv/lib/python3.10/site-packages/pgsrip/ripper.py", line 128, in __init__
max_height = max([item.height for item in self.pgs.items]) // 2
File ".venv/lib/python3.10/site-packages/pgsrip/media.py", line 161, in items
self._items = self.decode(data, self.media_path)
File ".venv/lib/python3.10/site-packages/pgsrip/media.py", line 184, in decode
return PgsSubtitleItem.create_items(media_path, display_sets)
File ".venv/lib/python3.10/site-packages/pgsrip/media.py", line 45, in create_items
candidates.append(PgsSubtitleItem(index, media_path, current_sets))
File ".venv/lib/python3.10/site-packages/pgsrip/media.py", line 33, in __init__
self.x_offset = min([ds.wds.x_offset for ds in display_sets] or [None])
File ".venv/lib/python3.10/site-packages/pgsrip/media.py", line 33, in <listcomp>
self.x_offset = min([ds.wds.x_offset for ds in display_sets] or [None])
File ".venv/lib/python3.10/site-packages/pgsrip/pgs.py", line 278, in x_offset
return from_hex(self.data[2:4])
File ".venv/lib/python3.10/site-packages/pgsrip/utils.py", line 7, in from_hex
return int(b.hex(), base=16)
ValueError: invalid literal for int() with base 16: ''
Hi! When using pgsrip with a BD3D, Walk, the output files are 0 bytes. After a bit of digging, it appears the PgsImage class is creating an object of all 255. The rle_data appears to contain data (I don't know how to interpret it). The pgs in the mkv appear fine since the subtitles work in a player and SubtitleEdit.
I'm stuck on troubleshooting at this point. I would appreciate any help! I use pgsrip in my 3D Blu-ray to MV-HEVC (Apple Vision Pro) converter and would love to figure out why this one (and possibly others) do not work. This issue happens across different machines and with a clean macOS install. This is running on Python 3.12 on the latest macOS.
Thank you!
Hey,
Is there any intention (or current ability) to use pgsrip inline in a python script?
I have a script that pulls and extracts out sup files from an MKV ingest, but now want to convert said sup file to srt which it appears this can do.
Let me know :)
No matter what I set max workers to, more than 4 threads are not used by the tesseract process.
In the help dialog, however, it says that 1 to 50 threads should be supported.
I seem to be getting this warning on a lot of separate video files, and the affected subtitle items usually aren't OCR'd correctly at the bottom or sometimes at the right of the image, as if a portion of the image was cut off.
I used --keep-temp-files to check the images, and they're indeed not being displayed correctly:
Playing the video with subtitles on a player like VLC seems to display the subtitles correctly without issues though.
Is it possible to specify the output folder of the srts from the API? I couldn't figure out the correct way to do it if possible.
Hello, thanks for creating this package!
A feature I'm missing is the ability to blacklist certain characters (seeing as tesseract keeps catching I
as |
). Scouring through the code, I could not find any options for this. Having the ability to pass a string of disallowed characters would be nice.
Similarly, whitelist supports would also be welcome for similar reasons. However if only one or the other can be supported, I think it's best to support blacklists instead (or to at least prioritise it if both are passed).
If there's currently no development time being put into this for a little while, I can also try to create a PR if you'd like me to.
Thanks again!
Your program works exceptionally well, so I hope to be able to use it with VobSubs as well.
A flag could perhaps include VobSubs too and convert them to PGS with something like TheGreatMcPain/BDSup2SubPlusPlus.
Perhaps plain VobSub text recognition with Tesseract is not good enough and needs additional post-processing.
I'll try to automate the whole thing with your API and hope the results are reasonably okay.
Thanks for the great work on this. I was wondering what version of Tesseract this uses?
Reason is I tested pgsrip via CLI against SubtitleEdit which uses Tesseract 5.2, and the latter seems to be more accurate with fewer errors, but I suppose that could also just be due to the config of tesseract in pgsrip.
Appreciate the help. Thanks.
Hello, here's my problem, I noticed that with some of my subtitle files, pgsrip (latest version) returned this error:
INFO:pgsrip:Tesseract version: 5.3.0.20221222
INFO:pgsrip:Tesseract data: None
DEBUG:pgsrip.media_path:The Godfather Part II (1974) [imdbid-tt0071562] - [Multi-French VFI][Bluray-1080p][PQ][AC3 5.1][FR+EN][10bit][x265]-Winks.fr.sup is using temporary folder C:\Users\Paul\AppData\Local\Temp\The Godfather Part II (1974) [imdbid-tt0071562] - [Multi-French VFI][Bluray-1080p][PQ][AC3 5.1][FR+EN][10bit][x265]-Winks.fr.supkj831ll7.pgsrip
1 PGS subtitle collected from 1 file
DEBUG:pgsrip.media:Removing temporary files in C:\Users\Paul\AppData\Local\Temp\The Godfather Part II (1974) [imdbid-tt0071562] - [Multi-French VFI][Bluray-1080p][PQ][AC3 5.1][FR+EN][10bit][x265]-Winks.fr.supkj831ll7.pgsrip
WARNING:pgsrip.core:Error while trying to rip The Godfather Part II (1974) [imdbid-tt0071562] - [Multi-French VFI][Bluray-1080p][PQ][AC3 5.1][FR+EN][10bit][x265]-Winks.fr.sup: <FileNotFoundError> [[Errno 2] No such file or directory: 'The Godfather Part II (1974) [imdbid-tt0071562] - [Multi-French VFI][Bluray-1080p][PQ][AC3 5.1][FR+EN][10bit][x265]-Winks.fr.sup']
Traceback (most recent call last):
File "C:\Users\Paul\AppData\Local\Programs\Python\Python39\lib\site-packages\pgsrip\core.py", line 69, in rip_pgs
srt = PgsToSrtRipper(p, options).rip(lambda t: rules.apply(t, '')[0])
File "C:\Users\Paul\AppData\Local\Programs\Python\Python39\lib\site-packages\pgsrip\ripper.py", line 128, in __init__
max_height = max([item.height for item in self.pgs.items]) // 2
File "C:\Users\Paul\AppData\Local\Programs\Python\Python39\lib\site-packages\pgsrip\media.py", line 160, in items
data = self.data_reader()
File "C:\Users\Paul\AppData\Local\Programs\Python\Python39\lib\site-packages\pgsrip\media_path.py", line 44, in get_data
with open(str(self), 'rb') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'The Godfather Part II (1974) [imdbid-tt0071562] - [Multi-French VFI][Bluray-1080p][PQ][AC3 5.1][FR+EN][10bit][x265]-Winks.fr.sup'
0 PGS subtitle ripped from 1 file
I enclose the subtitles used for the example.
The Godfather Part II (1974) [imdbid-tt0071562] - [Multi-French VFI][Bluray-1080p][PQ][AC3 5.1][FR+EN][10bit][x265]-Winks.fre.zip
It is common to find videos that contain multiple subtitles for the same language (for instance, SDH and non-SDH subtitles).
In order to make this flexible, it would be useful to have something that could apply a regex pattern on the track_name
property, which is being extracted from mkv tracks.
Hi,
I have a mkv file with 2 PGS tracks (eng & fre). When trying to convert the subtitles I have the following error (for both tracks)
Error while tyring to rip <mkv / or sup file>: [invalid literal for int() with base 10: '95.936287']
Hi!
Really excited to see this tool - fits amazingly as a tdarr plugin too!
I noticed while parsing my first English PGS (1h15min) that it detected 530 strings, whereas SubtitleEdit + Tesseract 5.3.0 detected 783 strings (and had great accuracy on them). I felt a bit surprised considering that both use Tesseract 5 and the performance should theoretically be really good regardless of the dataset given it's just working with bare English, black on white, straight. I noted that when subtitles are missed, the previous subtitles would stick around for a long time.
Do you have any ideas from your experience?
I've had very good results with most of the files I've converted, but I have noticed the OCR seems to be particularly bad in some situations. I've narrowed this down to the specific way that the PGS subtitles are styled and how they are processed.
With a particularly styled file, the OCR is really inaccurate. I used the --keep-temp-files
to see the files, and it looks like the text is inverted and placed on a black background, but the way these particular subtitles are formatted, they show up as a mostly black file.
Here is a normal file:
and here is an example of the issue:
The second example has a border around the font which seems to be the cause of the issues.
On a Debian machine with "python -V" and "python3 -V" both reporting Python 3.9.2, the command "pip install pgsrip" produces the error "bash: cd: too many arguments." I can install with "sudo pip install pgsrip," although this produces warnings that running pip as root user can result in broken permissions. After thus installing pgsrip, the command "pgsrip file.mks" gives the error:
File "/usr/local/bin/pgsrip", line 5, in
from pgsrip.cli import pgsrip
ModuleNotFoundError: No module named 'pgsrip.cli'
The command "sudo pgsrip file.mks" does work, although it is necessary to run chown on the output file, and this isn't really what one would prefer.
Is there a simple way to install pgsrip from source with Python 3.9.2?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.