Giter Club home page Giter Club logo

pgsrip's Introduction

PGSRip

Rip your PGS subtitles.

Latest Version

License

PGSRip is a command line tool that allows you to extract and convert PGS subtitles into SRT format. This tool requires MKVToolNix and tesseract-ocr and tessdata (https://github.com/tesseract-ocr/tessdata or https://github.com/tesseract-ocr/tessdata_best)

Installation

pgsrip:

$ pip install pgsrip

MKVToolNix:

[Linux/WSL - Ubuntu/Debian]
$ sudo apt-get install mkvtoolnix

[Windows/Chocolatey]
$ choco install mkvtoolnix

tesseract:

PPA is used to install latest tesseract 5.x. Skip PPA repository if you decide to stick with latest official Debian/Ubuntu package

[Linux/WSL - Ubuntu/Debian]
$ sudo add-apt-repository ppa:alex-p/tesseract-ocr5
$ sudo apt update
$ sudo apt-get install tesseract-ocr

[Windows/Chocolatey]
$ choco install tesseract-ocr

tessdata:

$ git clone https://github.com/tesseract-ocr/tessdata_best.git
export TESSDATA_PREFIX=~/tessdata_best

If you prefer to build the docker image Build Docker:

$ git clone https://github.com/ratoaq2/pgsrip.git
cd pgsrip
docker build . -t pgsrip

Usage

CLI

Rip from a .mkv:

$ pgsrip mymedia.mkv
3 PGS subtitles collected from 1 file
Ripping subtitles  [####################################]  100%  mymedia.mkv [5:de]
3 PGS subtitles ripped from 1 file

Rip from a .mks:

$ pgsrip mymedia.mks
3 PGS subtitles collected from 1 file
Ripping subtitles  [####################################]  100%  mymedia.mks [3:pt-BR]
3 PGS subtitles ripped from 1 file

Rip from a .sup:

$ pgsrip mymedia.en.sup
1 PGS subtitle collected from 1 file
Ripping subtitles  [####################################]  100%  mymedia.en.sup
1 PGS subtitle ripped from 1 file

Rip from a folder path:

$ pgsrip -l en -l pt-BR ~/medias/
11 PGS subtitles collected from 9 files / 2 files filtered out
Ripping subtitles  [####################################]  100%  ~/medias/mymedia.mkv [4:en]
11 PGS subtitles ripped from 9 files

Using docker:

$ docker run -it --rm -v /medias:/medias -u $(id -u username):$(id -g username) ratoaq2/pgsrip -l en -l de -l pt-BR -l pt /medias
11 PGS subtitles collected from 9 files / 2 files filtered out
Ripping subtitles  [####################################]  100%  /medias/mymedia.mkv [4:en]
11 PGS subtitles ripped from 9 files

API

from pgsrip import pgsrip, Mkv, Options
from babelfish import Language

media = Mkv('/subtitle/path/mymedia.mkv')
options = Options(languages={Language('eng')}, overwrite=True, one_per_lang=False)
pgsrip.rip(media, options)

pgsrip's People

Contributors

dependabot[bot] avatar macro1 avatar ratoaq2 avatar twirx avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

pgsrip's Issues

Corrupted <PgsSubtitleItem>: Found 65508 bytes for image, but x were expected

I seem to be getting this warning on a lot of separate video files, and the affected subtitle items usually aren't OCR'd correctly at the bottom or sometimes at the right of the image, as if a portion of the image was cut off.

I used --keep-temp-files to check the images, and they're indeed not being displayed correctly:
image

Playing the video with subtitles on a player like VLC seems to display the subtitles correctly without issues though.

Add black/whitelist support

Hello, thanks for creating this package!

A feature I'm missing is the ability to blacklist certain characters (seeing as tesseract keeps catching I as |). Scouring through the code, I could not find any options for this. Having the ability to pass a string of disallowed characters would be nice.

Similarly, whitelist supports would also be welcome for similar reasons. However if only one or the other can be supported, I think it's best to support blacklists instead (or to at least prioritise it if both are passed).

If there's currently no development time being put into this for a little while, I can also try to create a PR if you'd like me to.

Thanks again!

Support other subtitle formats like VobSub

Your program works exceptionally well, so I hope to be able to use it with VobSubs as well.

A flag could perhaps include VobSubs too and convert them to PGS with something like TheGreatMcPain/BDSup2SubPlusPlus.

Perhaps plain VobSub text recognition with Tesseract is not good enough and needs additional post-processing.

I'll try to automate the whole thing with your API and hope the results are reasonably okay.

Python script usage

Hey,
Is there any intention (or current ability) to use pgsrip inline in a python script?

I have a script that pulls and extracts out sup files from an MKV ingest, but now want to convert said sup file to srt which it appears this can do.

Let me know :)

Tesseract uses a maximum of only 4 threads

No matter what I set max workers to, more than 4 threads are not used by the tesseract process.
In the help dialog, however, it says that 1 to 50 threads should be supported.

FileNotFoundError

Hello, here's my problem, I noticed that with some of my subtitle files, pgsrip (latest version) returned this error:

INFO:pgsrip:Tesseract version: 5.3.0.20221222
INFO:pgsrip:Tesseract data: None
DEBUG:pgsrip.media_path:The Godfather Part II (1974) [imdbid-tt0071562] - [Multi-French VFI][Bluray-1080p][PQ][AC3 5.1][FR+EN][10bit][x265]-Winks.fr.sup is using temporary folder C:\Users\Paul\AppData\Local\Temp\The Godfather Part II (1974) [imdbid-tt0071562] - [Multi-French VFI][Bluray-1080p][PQ][AC3 5.1][FR+EN][10bit][x265]-Winks.fr.supkj831ll7.pgsrip
1 PGS subtitle collected from 1 file
DEBUG:pgsrip.media:Removing temporary files in C:\Users\Paul\AppData\Local\Temp\The Godfather Part II (1974) [imdbid-tt0071562] - [Multi-French VFI][Bluray-1080p][PQ][AC3 5.1][FR+EN][10bit][x265]-Winks.fr.supkj831ll7.pgsrip
WARNING:pgsrip.core:Error while trying to rip The Godfather Part II (1974) [imdbid-tt0071562] - [Multi-French VFI][Bluray-1080p][PQ][AC3 5.1][FR+EN][10bit][x265]-Winks.fr.sup: <FileNotFoundError> [[Errno 2] No such file or directory: 'The Godfather Part II (1974) [imdbid-tt0071562] - [Multi-French VFI][Bluray-1080p][PQ][AC3 5.1][FR+EN][10bit][x265]-Winks.fr.sup']
Traceback (most recent call last):
  File "C:\Users\Paul\AppData\Local\Programs\Python\Python39\lib\site-packages\pgsrip\core.py", line 69, in rip_pgs
    srt = PgsToSrtRipper(p, options).rip(lambda t: rules.apply(t, '')[0])
  File "C:\Users\Paul\AppData\Local\Programs\Python\Python39\lib\site-packages\pgsrip\ripper.py", line 128, in __init__
    max_height = max([item.height for item in self.pgs.items]) // 2
  File "C:\Users\Paul\AppData\Local\Programs\Python\Python39\lib\site-packages\pgsrip\media.py", line 160, in items
    data = self.data_reader()
  File "C:\Users\Paul\AppData\Local\Programs\Python\Python39\lib\site-packages\pgsrip\media_path.py", line 44, in get_data
    with open(str(self), 'rb') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'The Godfather Part II (1974) [imdbid-tt0071562] - [Multi-French VFI][Bluray-1080p][PQ][AC3 5.1][FR+EN][10bit][x265]-Winks.fr.sup'
0 PGS subtitle ripped from 1 file

I enclose the subtitles used for the example.
The Godfather Part II (1974) [imdbid-tt0071562] - [Multi-French VFI][Bluray-1080p][PQ][AC3 5.1][FR+EN][10bit][x265]-Winks.fre.zip

Error: invalid literal for int()

Hi,

I have a mkv file with 2 PGS tracks (eng & fre). When trying to convert the subtitles I have the following error (for both tracks)

Error while tyring to rip <mkv / or sup file>: [invalid literal for int() with base 10: '95.936287']

Difference in reliability w/ SubtitleEdit+Tesseract?

Hi!

Really excited to see this tool - fits amazingly as a tdarr plugin too!

I noticed while parsing my first English PGS (1h15min) that it detected 530 strings, whereas SubtitleEdit + Tesseract 5.3.0 detected 783 strings (and had great accuracy on them). I felt a bit surprised considering that both use Tesseract 5 and the performance should theoretically be really good regardless of the dataset given it's just working with bare English, black on white, straight. I noted that when subtitles are missed, the previous subtitles would stick around for a long time.

Do you have any ideas from your experience?

version for Python 3?

On a Debian machine with "python -V" and "python3 -V" both reporting Python 3.9.2, the command "pip install pgsrip" produces the error "bash: cd: too many arguments." I can install with "sudo pip install pgsrip," although this produces warnings that running pip as root user can result in broken permissions. After thus installing pgsrip, the command "pgsrip file.mks" gives the error:

File "/usr/local/bin/pgsrip", line 5, in
from pgsrip.cli import pgsrip
ModuleNotFoundError: No module named 'pgsrip.cli'

The command "sudo pgsrip file.mks" does work, although it is necessary to run chown on the output file, and this isn't really what one would prefer.

Is there a simple way to install pgsrip from source with Python 3.9.2?

Error during rip: "ValueError: invalid literal for int() with base 16"

I getting this error for episodes I've ripped from a few particular series. In most cases I've had no problem, but these appear to be getting something unexpected - maybe there is an edge case which needs handled? I'll see whether I can use MKVToolNix to rip the subtitle tracks.

Thank you for a great tool!

Traceback (most recent call last):
  File ".venv/lib/python3.10/site-packages/pgsrip/ripper.py", line 128, in __init__
    max_height = max([item.height for item in self.pgs.items]) // 2
  File ".venv/lib/python3.10/site-packages/pgsrip/media.py", line 161, in items
    self._items = self.decode(data, self.media_path)
  File ".venv/lib/python3.10/site-packages/pgsrip/media.py", line 184, in decode
    return PgsSubtitleItem.create_items(media_path, display_sets)
  File ".venv/lib/python3.10/site-packages/pgsrip/media.py", line 45, in create_items
    candidates.append(PgsSubtitleItem(index, media_path, current_sets))
  File ".venv/lib/python3.10/site-packages/pgsrip/media.py", line 33, in __init__
    self.x_offset = min([ds.wds.x_offset for ds in display_sets] or [None])
  File ".venv/lib/python3.10/site-packages/pgsrip/media.py", line 33, in <listcomp>
    self.x_offset = min([ds.wds.x_offset for ds in display_sets] or [None])
  File ".venv/lib/python3.10/site-packages/pgsrip/pgs.py", line 278, in x_offset
    return from_hex(self.data[2:4])
  File ".venv/lib/python3.10/site-packages/pgsrip/utils.py", line 7, in from_hex
    return int(b.hex(), base=16)
ValueError: invalid literal for int() with base 16: ''

Filtering by track name

It is common to find videos that contain multiple subtitles for the same language (for instance, SDH and non-SDH subtitles).

In order to make this flexible, it would be useful to have something that could apply a regex pattern on the track_name property, which is being extracted from mkv tracks.

Tesseract version?

Thanks for the great work on this. I was wondering what version of Tesseract this uses?

Reason is I tested pgsrip via CLI against SubtitleEdit which uses Tesseract 5.2, and the latter seems to be more accurate with fewer errors, but I suppose that could also just be due to the config of tesseract in pgsrip.

Appreciate the help. Thanks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.