belval / pdf2image Goto Github PK

View Code? Open in Web Editor NEW

1.5K 18.0 190.0 4.7 MB

A python module that wraps the pdftoppm utility to convert PDF to PIL Image object

License: MIT License

Python 100.00%

pdf pil pil-image convert poppler

pdf2image's Introduction

pdf2image

A python (3.7+) module that wraps pdftoppm and pdftocairo to convert PDF to a PIL Image object

How to install

pip install pdf2image

Windows

Windows users will have to build or download poppler for Windows. I recommend @oschwartz10612 version which is the most up-to-date. You will then have to add the bin/ folder to PATH or use poppler_path = r"C:\path\to\poppler-xx\bin" as an argument in convert_from_path.

Mac

Mac users will have to install poppler.

Installing using Brew:

brew install poppler

Linux

Most distros ship with pdftoppm and pdftocairo. If they are not installed, refer to your package manager to install poppler-utils

Platform-independant (Using `conda`)

Install poppler: conda install -c conda-forge poppler
Install pdf2image: pip install pdf2image

How does it work?

from pdf2image import convert_from_path, convert_from_bytes
from pdf2image.exceptions import (
    PDFInfoNotInstalledError,
    PDFPageCountError,
    PDFSyntaxError
)

Then simply do:

images = convert_from_path('/home/belval/example.pdf')

images = convert_from_bytes(open('/home/belval/example.pdf', 'rb').read())

OR better yet

import tempfile

with tempfile.TemporaryDirectory() as path:
    images_from_path = convert_from_path('/home/belval/example.pdf', output_folder=path)
    # Do something here

images will be a list of PIL Image representing each page of the PDF document.

Here are the definitions:

convert_from_path(pdf_path, dpi=200, output_folder=None, first_page=None, last_page=None, fmt='ppm', jpegopt=None, thread_count=1, userpw=None, use_cropbox=False, strict=False, transparent=False, single_file=False, output_file=str(uuid.uuid4()), poppler_path=None, grayscale=False, size=None, paths_only=False, use_pdftocairo=False, timeout=600, hide_attributes=False)

convert_from_bytes(pdf_file, dpi=200, output_folder=None, first_page=None, last_page=None, fmt='ppm', jpegopt=None, thread_count=1, userpw=None, use_cropbox=False, strict=False, transparent=False, single_file=False, output_file=str(uuid.uuid4()), poppler_path=None, grayscale=False, size=None, paths_only=False, use_pdftocairo=False, timeout=600, hide_attributes=False)

What's new?

Allow users to hide attributes when using pdftoppm with hide_attributes (Thank you @StaticRocket)
Fix console opening on Windows (Thank you @OhMyAgnes!)
Add timeout parameter which raises PDFPopplerTimeoutError after the given number of seconds.
Add use_pdftocairo parameter which forces pdf2image to use pdftocairo. Should improve performance.
Fixed a bug where using pdf2image with multiple threads (but not multiple processes) would cause and exception
jpegopt parameter allows for tuning of the output JPEG when using fmt="jpeg" (-jpegopt in pdftoppm CLI) (Thank you @abieler)
pdfinfo_from_path and pdfinfo_from_bytes which expose the output of the pdfinfo CLI
paths_only parameter will return image paths instead of Image objects, to prevent OOM when converting a big PDF
size parameter allows you to define the shape of the resulting images (-scale-to in pdftoppm CLI)
- size=400 will fit the image to a 400x400 box, preserving aspect ratio
- size=(400, None) will make the image 400 pixels wide, preserving aspect ratio
- size=(500, 500) will resize the image to 500x500 pixels, not preserving aspect ratio
grayscale parameter allows you to convert images to grayscale (-gray in pdftoppm CLI)
single_file parameter allows you to convert the first PDF page only, without adding digits at the end of the output_file
Allow the user to specify poppler's installation path with poppler_path

Performance tips

Using an output folder is significantly faster if you are using an SSD. Otherwise i/o usually becomes the bottleneck.
Using multiple threads can give you some gains but avoid more than 4 as this will cause i/o bottleneck (even on my NVMe SSD!).
If i/o is your bottleneck, using the JPEG format can lead to significant gains.
PNG format is pretty slow, this is because of the compression.
If you want to know the best settings (most settings will be fine anyway) you can clone the project and run python tests.py to get timings.

Limitations / known issues

A relatively big PDF will use up all your memory and cause the process to be killed (unless you use an output folder)
Sometimes fail read pdf signed using DocuSign, Solution for DocuSign issue.

pdf2image's People

Contributors

Stargazers

Watchers

Forkers

yangpingyan drzhengym minarth sotoup xingxingzhang cybergodsa volarion dee006 urrape gerschtli jianyexu2020 hanwsf wolhow123 samnus126 zhushansheng torishere forbaokhanh wanjunhao jaypadia-frame barokahhouse lanxbruce ashwin29 li8023ning shiprashalini bcjr1997 saravananpsg florida-finance rao2321 umitkabuli bratao ahuirecome happog hnn123 rsantana-isg st-rnd rsest icanfly777 korzak rajat--paliwal lf-devjourney pacinolucifer zhou256 luckydog5 shinoysivan 23pointsnorth fglz claudiomeinberg xw-syu weforkbusiness plat251 pareta1107 chinadongnet suryaxanden liyangxu hopshine tarsbase madmuffin1 qt-pay clementcj abmyii xuzhuwukong prohit93 saonam ssitb hugovk pankajjajoo bdess167 yaoliuoa andrewlaikh ticdenis zigorewslike abieler wzf9 iridant marvinyuan lix19937 zonghaofan hakanaku1234 chatchai-komrangded crackercat sporterman jodonnell77 building-estimates sailormoon001 off-log-byte godboysun oschwartz10612 ngseecheong laashub-sua sunxingxingtf ldevandiere melissa-bei bobdu maincm mnewls d-hoke zhouxinfei pierre3l matheusleonardodias keherri

pdf2image's Issues

How to embed image into html body?

Hi so I am wanting to convert a pdf binary to a jpeg and then embed it into a html string to be push to our server.
This is what I have so far:

self.pdf_file = Binary_pdf
images = convert_from_bytes(self.pdf_file, 200, fmt="jpg", thread_count=1)
I then want to loop through the images and put them in my body string to be uploaded:

    images = images.reverse() 
    for image in images:
        body += <img src = <image binary?></img>

How to do this?

Corrupt JPEG data: Premature end of data segment

I am trying to convert a set of PDF documents into jpegs. While converting, I occasionally get a pop up error message as shown in the screenshot below. I wonder what may have caused this and if there is a way to by pass it (since the conversion process stalls when the error dialog window is shown unless I close it).

fmt='png' is slow

Using png as the desired format causes very slow to terribly slow conversion.

Exception: Unable to get page count

Great library @Belval When I play around with my pdf I encountered that several of my pdf throws below exception. Not all of them but some of them throws error. I don't know it because of my pdf unfortunately I can't able to show the pdf to you.

Can you provide the hints where it went wrong. Or let me know if you want any input from me to resolve this issue.

Traceback (most recent call last):
  File "/home/simon/.local/share/virtualenvs/floornet-K8h42tkL/lib/python3.7/site-packages/pdf2image/pdf2image.py", line 163, in __page_count
    return int(re.search(r'Pages:\s+(\d+)', out.decode("utf8", "ignore")).group(1))
AttributeError: 'NoneType' object has no attribute 'group'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/simon/.local/share/virtualenvs/floornet-K8h42tkL/bin/flask", line 11, in <module>
    sys.exit(main())
  File "/home/simon/.local/share/virtualenvs/floornet-K8h42tkL/lib/python3.7/site-packages/flask/cli.py", line 894, in main
    cli.main(args=args, prog_name=name)
  File "/home/simon/.local/share/virtualenvs/floornet-K8h42tkL/lib/python3.7/site-packages/flask/cli.py", line 557, in main
    return super(FlaskGroup, self).main(*args, **kwargs)
  File "/home/simon/.local/share/virtualenvs/floornet-K8h42tkL/lib/python3.7/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/home/simon/.local/share/virtualenvs/floornet-K8h42tkL/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/simon/.local/share/virtualenvs/floornet-K8h42tkL/lib/python3.7/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/simon/.local/share/virtualenvs/floornet-K8h42tkL/lib/python3.7/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/home/simon/.local/share/virtualenvs/floornet-K8h42tkL/lib/python3.7/site-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/home/simon/.local/share/virtualenvs/floornet-K8h42tkL/lib/python3.7/site-packages/flask/cli.py", line 412, in decorator
    return __ctx.invoke(f, *args, **kwargs)
  File "/home/simon/.local/share/virtualenvs/floornet-K8h42tkL/lib/python3.7/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/home/simon/Workspace/floornet/floornet/cli.py", line 43, in convert_command
    image = convert_from_path(floor4, output_folder=path, fmt='png')
  File "/home/simon/.local/share/virtualenvs/floornet-K8h42tkL/lib/python3.7/site-packages/pdf2image/pdf2image.py", line 29, in convert_from_path
    page_count = __page_count(pdf_path, userpw)
  File "/home/simon/.local/share/virtualenvs/floornet-K8h42tkL/lib/python3.7/site-packages/pdf2image/pdf2image.py", line 165, in __page_count
    raise Exception('Unable to get page count.')
Exception: Unable to get page count.

unable to import pdf2image

I'm on a mac osx 10.12 using python 3.6.1 and used pip3 to install pdf2image. Prior to installation, I installed poppler using brew. After installation, it said that pdf2image-0.1.5 was successfully installed. But, when I try to import pdf2image, I get the following error:

Traceback (most recent call last):
File "", line 1, in
ModuleNotFoundError: No module named 'pdf2image'

Is there another requirement I'm missing?

Ability to disable link/ref highlighting

Greetings!

Pdftoppm/poppler has this weird "feature" to highlight hyperlinks with very bright rectangles that are never rendered in mainstream pdf readers and is very annoying.

I was looking for an appropriate flag to disable it and fork/merge your repo but in vain. I traced an issue in this conversation but doesn't seem to offer any solution in the end. I also started digging into pdftoppm.cc source code but nothing caught my eye yet. I was wondering if you guys had any idea how to fix the issue before I continue digging into the source code.

This is general issue and happens to every arxiv.org pdf I have tried. Attaching sample bellow anyways.

Example file: page-15.png
PDF Source: https://arxiv.org/abs/1806.00451

Cannot delete *.ppm files after process

Describe the issue
I cannot seem to delete *.ppm files after running the pdf2image. The code is taken from a test suite, where I intend to delete all *.ppm and *.jpg files after test execution.

Steps to reproduce
Test script:

...
    @classmethod
    def tearDownClass(cls):
        files_to_delete = [x for x in os.listdir(
            DATA_DIR) if os.path.splitext(x)[1] in ['.ppm', '.jpg']]
        for x in files_to_delete:
           os.remove(os.path.join(DATA_DIR, x))
...

Error log:

=================================== ERRORS ====================================
__________ ERROR at teardown of TestPDFComparer.test_is_both_invalid __________

cls = <class 'test.test_pdf.TestPDFComparer'>

    @classmethod
    def tearDownClass(cls):
        files_to_delete = [x for x in os.listdir(
            DATA_DIR) if os.path.splitext(x)[1] in ['.ppm', '.jpg']]
        for x in files_to_delete:
>           os.remove(os.path.join(DATA_DIR, x))
E           PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'I:\\pdf-compare\\test\\test_data\\2a9ae9eb-6edd-40da-b46f-b49fa1e0e459-1.ppm'

test\test_pdf.py:131: PermissionError
===================== 17 passed, 1 error in 2.15 seconds ======================

Desktop environment:

OS: Windows 7
Python: 3.6.4 (Anaconda)

DecompressionBombWarning while converting scaled pdf

Describe the bug
While converting a pdf to image which is highly scaled i.e. Page size is 36.40x48.44 in
Received DecompressionBombWarning: Image size exceeds limit, could be decompression bomb DOS attack.
And Image conversion fails.
To Reproduce
Convert a pdf which is having large page size.

Expected behavior
Should convert pdf to image with warning.

Screenshots

Desktop (please complete the following information):

OS: [e.g. Windows 10]
Python 3.6

Additional context
Solved issue by adding -scale_to argument and letting developer use his own value and setting default to 1000.

Do you support one pdf file with multiple pages converted to one image?

Cannot move the temp file

Hi,
I am using convert_from_path:
pages = convert_from_path(filename, dpi=200, output_folder = inputPath + "\temp", fmt='jpg', last_page=1, first_page =0, thread_count=1)
After this is executed I am copying the file to a standard filename:
file2=os.listdir(inputPath+"\temp")[0]
JPGfilename = os.path.join(inputPath+"\temp", file2) shutil.move(JPGfilename, inputPath +'\temp.jpg')

However SHUTIL cannot move the file because of permission error.

How can I close the process that is still using the temp file convert_from_path created?

Thanks!

page_count feature + slow threads

Hi, thanks the script works great for me.
For large pdfs (>200 pages will exceed the 2GB limit and raise a memory error) I process them in chunks of 100 pages and append them to the images list.
Therefor I need the number of pdf pages.
__page_count() is private, so I made a lokal copy with a puplic page_count()

However the process is rather slow. If I use thread_count=20 (could be anything), in the taskmanager 20 thread are created, but only one of them has 10% cpu usage.
My SSD stays below 1%.

Any Idea how to improve this?
My cpu is a ryzen 5 1600 and visual studio code on windows 10.

Changing the name of the output file

This issue was raised before here, but got closed prematurely.

I wanted to know if there is any way that we can have control over the name of the output image. While I understand there are reasons why the filename is the way it is from your explanation in the previous issue, it would save me a lot of time and resources if I could change the output name.

Thanks!

Crop before generating image

Hello there,

Is there any way I can export a small selection of a single page to image? I mean, providing coords to get an image 'crop' from a page, instead of getting full page as image.

It may be an undocumented feature that the lib already supports. I found 'use_cropbox' but couldn't find how to provide such coords.

The files I'm using always have some useless header and footer that I want to get rid of.

Add PDFium support

I have been toying with PDFium recently and I believe it could be a good improvement.

pdftoppm would not be phased out since it is still faster at any resolution higher than 200dpi.

Installation Problems

I've proceed the instructions for install but while executing the code have following error:

>>> from pdf2image import convert_from_path
>>> images = convert_from_path('c:\\Users\\cp\\Python\\QGIS\\Local Plan Policies Map 2018 - Map 3.pdf')
Traceback (most recent call last):
  File "C:\Anaconda3\lib\site-packages\pdf2image\pdf2image.py", line 168, in __page_count
    return int(re.search(r'Pages:\s+(\d+)', out.decode("utf8", "ignore")).group(1))
AttributeError: 'NoneType' object has no attribute 'group'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Anaconda3\lib\site-packages\pdf2image\pdf2image.py", line 29, in convert_from_path
    page_count = __page_count(pdf_path, userpw)
  File "C:\Anaconda3\lib\site-packages\pdf2image\pdf2image.py", line 170, in __page_count
    raise Exception('Unable to get page count. %s' % err.decode("utf8", "ignore"))
Exception: Unable to get page count.
>>>

I installed packages for Windows. Also tried to install poppler package from Python but the environment don't want to make it stick.

(base) C:\Users\cp>conda install -c conda-forge poppler
Collecting package metadata: done
Solving environment: done

## Package Plan ##

  environment location: C:\Anaconda3

  added / updated specs:
    - poppler


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    ca-certificates-2019.3.9   |       hecc5488_0         184 KB  conda-forge
    certifi-2019.3.9           |           py36_0         149 KB  conda-forge
    cryptography-2.6.1         |   py36hb32ad35_0         557 KB  conda-forge
    curl-7.64.1                |       h4496350_0         110 KB  conda-forge
    freetype-2.10.0            |       h5db478b_0         478 KB  conda-forge
    geos-3.7.2                 |       habb2df7_0         1.5 MB  conda-forge
    gettext-0.19.8.1           |    hb01d8f6_1002         5.0 MB  conda-forge
    glib-2.58.3                |    hc0c2ac7_1001         3.4 MB  conda-forge
    jpeg-9c                    |    hfa6e2cd_1001         314 KB  conda-forge
    krb5-1.16.3                |    hdd46e55_1001         819 KB  conda-forge
    libcurl-7.64.1             |       h4496350_0         271 KB  conda-forge
    libffi-3.2.1               |    h6538335_1006          36 KB  conda-forge
    libpng-1.6.37              |       h7602738_0         1.3 MB  conda-forge
    libssh2-1.8.2              |       h642c060_2         186 KB  conda-forge
    matplotlib-3.1.0           |           py36_0           6 KB  conda-forge
    matplotlib-base-3.1.0      |   py36h3e3dc42_0         6.5 MB  conda-forge
    openjpeg-2.3.1             |       ha922770_0         225 KB  conda-forge
    openssl-1.1.1b             |       hfa6e2cd_2         4.8 MB  conda-forge
    pcre-8.41                  |    h6538335_1003         453 KB  conda-forge
    pillow-6.0.0               |   py36h9a613e6_0         674 KB  conda-forge
    poppler-0.77.0             |       h92819f6_0         1.9 MB  conda-forge
    poppler-data-0.4.9         |                1         3.4 MB  conda-forge
    pycurl-7.43.0.2            |   py36h636d3bd_0          66 KB  conda-forge
    qt-5.9.7                   |       hc6833c9_1        91.1 MB  conda-forge
    sqlite-3.28.0              |       hfa6e2cd_0         985 KB  conda-forge
    tk-8.6.9                   |    hfa6e2cd_1001         3.7 MB  conda-forge
    vc-14.1                    |       h0510ff6_4           6 KB
    vs2015_runtime-14.15.26706 |       h3a45250_4         2.4 MB
    ------------------------------------------------------------
                                           Total:       130.3 MB

The following NEW packages will be INSTALLED:

  gettext            conda-forge/win-64::gettext-0.19.8.1-hb01d8f6_1002
  glib               conda-forge/win-64::glib-2.58.3-hc0c2ac7_1001
  krb5               conda-forge/win-64::krb5-1.16.3-hdd46e55_1001
  libffi             conda-forge/win-64::libffi-3.2.1-h6538335_1006
  matplotlib-base    conda-forge/win-64::matplotlib-base-3.1.0-py36h3e3dc42_0
  openjpeg           conda-forge/win-64::openjpeg-2.3.1-ha922770_0
  pcre               conda-forge/win-64::pcre-8.41-h6538335_1003
  poppler            conda-forge/win-64::poppler-0.77.0-h92819f6_0
  poppler-data       conda-forge/noarch::poppler-data-0.4.9-1

The following packages will be REMOVED:

  anaconda-5.2.0-py36_3

The following packages will be UPDATED:

  ca-certificates    anaconda::ca-certificates-2018.03.07-0 --> conda-forge::ca-certificates-2019.3.9-hecc5488_0
  certifi                                  2018.4.16-py36_0 --> 2019.3.9-py36_0
  cryptography       pkgs/main::cryptography-2.2.2-py36hfa~ --> conda-forge::cryptography-2.6.1-py36hb32ad35_0
  curl                    pkgs/main::curl-7.60.0-h7602738_0 --> conda-forge::curl-7.64.1-h4496350_0
  freetype               pkgs/main::freetype-2.8-h51f8f2c_1 --> conda-forge::freetype-2.10.0-h5db478b_0
  geos                                         3.6.0-vc14_0 --> 3.7.2-habb2df7_0
  jpeg                        pkgs/main::jpeg-9b-hb83a4c4_2 --> conda-forge::jpeg-9c-hfa6e2cd_1001
  libcurl              pkgs/main::libcurl-7.60.0-hc4dcbb0_0 --> conda-forge::libcurl-7.64.1-h4496350_0
  libpng                pkgs/main::libpng-1.6.34-h79bbb47_0 --> conda-forge::libpng-1.6.37-h7602738_0
  libssh2               pkgs/main::libssh2-1.8.0-hd619d38_4 --> conda-forge::libssh2-1.8.2-h642c060_2
  matplotlib         pkgs/main::matplotlib-2.2.2-py36h153e~ --> conda-forge::matplotlib-3.1.0-py36_0
  openssl               anaconda::openssl-1.0.2o-h8ea7d77_0 --> conda-forge::openssl-1.1.1b-hfa6e2cd_2
  pillow             pkgs/main::pillow-5.1.0-py36h0738816_0 --> conda-forge::pillow-6.0.0-py36h9a613e6_0
  pycurl             pkgs/main::pycurl-7.43.0.1-py36h74b6d~ --> conda-forge::pycurl-7.43.0.2-py36h636d3bd_0
  qt                     pkgs/main::qt-5.9.5-vc14he4a7d60_0 --> conda-forge::qt-5.9.7-hc6833c9_1
  sqlite                pkgs/main::sqlite-3.23.1-h35aae40_0 --> conda-forge::sqlite-3.28.0-hfa6e2cd_0
  tk                         pkgs/main::tk-8.6.7-hcb92d03_3 --> conda-forge::tk-8.6.9-hfa6e2cd_1001
  vc                                          14-h0510ff6_3 --> 14.1-h0510ff6_4
  vs2015_runtime                               14.0.25123-3 --> 14.15.26706-h3a45250_4


Proceed ([y]/n)?

What else can I try?

Problem converting pdf to png

Describe the bug
Hi have a problem reading a specific one page pdf file and converting it to PNG.
After investigation, the problem arise when using the __parse_buffer_to_png checking for the IEND chunk.
A data chunk seems to unfortunately contain IEND, misleading the parser into considering there are two pages, when there are really only one page.

Problems with output_file

Hello,

I have a PDF file with 10 page, and for a requirements I have to split the pdf file
onto 10 separate-one-page pdf files. I solve that with PyPDF2.

Then, when I try to convert each pdf file to png passing output_file=name_page I have the next error:

Traceback (most recent call last):
  File "medicion.py", line 20, in <module>
    cc.convert_pdf_to_png2()
  File "/home/eamanu/dev/alados-repository/OasisSoftwareAlados/Oasis/PDFConverter.py", line 275, in convert_pdf_to_png2
    output_file=o)
  File "/home/eamanu/dev/alados-repository/OasisSoftwareAlados/venv/lib/python3.7/site-packages/pdf2image/pdf2image.py", line 118, in convert_from_path
    images += _load_from_output_folder(output_folder, uid, in_memory=auto_temp_dir)
  File "/home/eamanu/dev/alados-repository/OasisSoftwareAlados/venv/lib/python3.7/site-packages/pdf2image/pdf2image.py", line 241, in _load_from_output_folder
    images.append(Image.open(os.path.join(output_folder, f)))
  File "/home/eamanu/dev/alados-repository/OasisSoftwareAlados/venv/lib/python3.7/site-packages/PIL/Image.py", line 2705, in open
    % (filename if filename else fp))
OSError: cannot identify image file '/home/eamanu/dev/alados-repository/OasisSoftwareAlados/Oasis/temp/test8.pdf'

The algorithms is this:

def convert()
        self.split_pdf()  # via PyPDF2
        list_dir = os.listdir(self.working_dir)
        from pdf2image  import convert_from_path
        for ld in list_dir:
            o = ld.split('.')[0]
            print(o)
            convert_from_path(os.path.join(self.working_dir,
                                           ld),
                              output_folder=self.working_dir,
                              fmt='png',
                              thread_count=7,
                              transparent=True,
                              output_file=o)

pdf2image outputs blank image when the orientation of pages in PDF is alternating

Describe the bug
I get a blank output image when the pdf contains pages which have different orientations.
For instance:
If the pages are in portrait orientation till page 6 and the code encounters landscape orientation from page 7 onward, then the pages from 7 to 8 will be blank.

To Reproduce
Steps to reproduce the behavior:

Take a pdf while have combination of pages in two different orientation, say 1-6 in portrait and 7-10 in landscape
Run the program with fmt parameter set as "jpg" and output_folder as "image"
You will see that pages 7 and 8 will be blank.

Expected behavior
Images to have content similar to pdf

Desktop (please complete the following information):

Windows
Python 3.5.5

In docker container to perform PDF image, Chinese is not recognized

Describe the bug
In docker container to perform PDF image, Chinese is not recognized. I have tried to use docker centos 7.6.1810 and ubuntu 16.04 to identify Chinese PDFS, but they are not recognized.
This is the PDF to be identified：
031001700111-49924029.pdf

As a result：

The code is
**import pdf2image
import time
import sys

if name == "main":
filename = sys.argv[1]
images =pdf2image.convert_from_path(filename)
imagename = str(time.time())+".jpg"
images[0].save(imagename)
print(imagename)**

Dockerfile:

**
FROM centos:python3.6.5

RUN yum -y upgrade
&& yum install -y poppler-cpp-devel
&& yum install -y poppler
&& yum install -y poppler-utils
&& pip install pdf2image
**

Execute the command：
docker run -v /home:/home -w /home --rm -it centos:pdf2image python3 pdfToimage.py 031001700111-49924029.pdf

1 page missing/Blank image when converting from pdf to images

Describe the bug
I was trying to convert a pdf to images using the module and one of the page is missing, getting a blank image.

Screenshots
For all the pages the font style and size looks like this :

For the page which is being skipped it looks like this :

essentially rest of the pages are typed and the error page is scanned.

Please tell me if I am missing any settings or attributes which can fix this.

Too many open files error

Describe the bug
When a real big pdf is converted to images, i run into an OSError: OSError: [Errno 24] Too many open files ...

To Reproduce
Run pdf2image on a real big pdf (e.g. 7.000 pages)

Expected behavior
Since i specified the output folder, i expected the images to be there, without all of them being loaded, and their file pointers to be closed.

UnicodeEncodeError: 'ascii' codec can't encode characters in position 70-76: ordinal not in range(128)

there wil be a problem when the fileppath have chinese

PDF created with enscript + ghostscript are not processed

PDF created with enscript + ghostscript are not processed.

Unfortunately it seems to be caused by pdftoppm and not by the package itself.

pdf2image.convert_from_path :- [Error 2] The system cannot find the file specified

When use 'convert_from_path' method, following error snippet is returned:

import pdf2image
pdf2image.convert_from_path('C:/Users/ashish.singla/Desktop/Amazon_invoice.pdf')
Traceback (most recent call last):
File "", line 1, in
File "C:\Users\ashish.singla\AppData\Local\Continuum\anaconda2\lib\site-packages\pdf2image\pdf2image.py", line 29, in convert_from_path
page_count = __page_count(pdf_path, userpw)
File "C:\Users\ashish.singla\AppData\Local\Continuum\anaconda2\lib\site-packages\pdf2image\pdf2image.py", line 158, in __page_count
proc = Popen(["pdfinfo", pdf_path], stdout=PIPE, stderr=PIPE)
File "C:\Users\ashish.singla\AppData\Local\Continuum\anaconda2\lib\subprocess.py", line 390, in init
errread, errwrite)
File "C:\Users\ashish.singla\AppData\Local\Continuum\anaconda2\lib\subprocess.py", line 640, in _execute_child
startupinfo)
WindowsError: [Error 2] The system cannot find the file specified

Unable to understand.

Version 0.1.9 doesn't include latest modifications

Hi @Belval ,

I was the one asking for a modification some days ago (#9).

Your latest modification is 3 days ago, but version 0.1.9 has been released 20/3/2018, so in the latest version your latest modifcations are not included.

how to change the output image file name?

Hello! I am using your pdf2image and I find it a little bit uncomfortable that the file name of output image is kind of difficult to read and identify. Where can I find the place to modify the output file format?
Thank you very much!

Open PDF with password

Unable to open pdf with 2nd password

convert_from_path(sys.argv[1],userpw=sys.argv[2],dpi=300)

The library doesn't work for pdfs with large number of pages

I tried converting each page of a pdf to JPEG images. The conversion is not happening. PDF is having 3248 pages. No error being thrown but the execution went to an idle state

pdf2image.exceptions.PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?

I use a MAC, according to the README installed popple, PIP also installed pdf2image, but wrong in the code to run times: pdf2image. Exceptions. PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in the PATH?

when first_page exceeds last page error is thrown

when running this code:

pages = convert_from_path("./" + a, dpi = 200, output_folder = dir_img, fmt = 'JPEG', 
		first_page = max(existem)+1)

I get the error

File "/xxx/.local/lib/python3.6/site-packages/pdf2image/pdf2image.py", line 81, in convert_from_path
    reminder = page_count % thread_count
ZeroDivisionError: integer division or modulo by zero

the library doesn't treat the possibility of first_page being of a value greater then the number of pages of the pdf file. I was expecting for it to do nothing and return empty array.

Is there a way to skip disk r/w operations, and we can directly use the image object without saving it on disk.

Currently I am working on an OCR project in which I have used pdf2image library for converting pdfs into image and the converted image is fed into tesseract OCR engine, but this whole processing is taking a long time.
There can be a way to optimize it when we can avoid disk i/o's i.e. directly feeding the image object from pdf2image to OCR engine without writing the image onto the disk.
Is it possible to do so, if yes then how.

__load_from_output_folder fails on MacOS because of .DS_Store file

On MacOS, whenever we use

convert_from_path
convert_from_bytes

functions, if we pass an output_folder they end up calling __load_from_output_folder function.

If we use an existing folder, already visited with the Finder app, the function fails saying that .DS_Store is not a valid image.

The problem is that this file is created automatically by MacOS Finder app. My suggestion if that this can be fixed passing a list of files to be ignored. If you agree, I can make the pull request.

Unable to convert pdf to image using pdf2image

def convert(files): pages = convert_from_path(files, 500) out_file="ConvertedToImage.jpg" for page in pages: page.save(out_file, 'JPEG') return out_file

Above is the code snippet of the function. I am getting a following error while converting the pdf to image:
Traceback (most recent call last):
File "C:\Users\DEll\AppData\Local\Programs\Python\Python36\lib\site-packages\pdf2image\pdf2image.py", line 190, in _page_count
return int(re.search(r'Pages:\s+(\d+)', out.decode("utf8", "ignore")).group(1))
AttributeError: 'NoneType' object has no attribute 'group'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "final.py", line 11, in
divide.send(in_file)
File "C:\Users\DEll\Downloads\Interns Py\divide.py", line 123, in send
toIm=convert(in_file)
File "C:\Users\DEll\Downloads\Interns Py\divide.py", line 86, in convert
pages = convert_from_path(files, 500)
File "C:\Users\DEll\AppData\Local\Programs\Python\Python36\lib\site-packages\pdf2image\pdf2image.py", line 46, in convert_from_path
page_count = _page_count(pdf_path, userpw)
File "C:\Users\DEll\AppData\Local\Programs\Python\Python36\lib\site-packages\pdf2image\pdf2image.py", line 192, in _page_count
raise PDFPageCountError('Unable to get page count. %s' % err.decode("utf8", "ignore"))
pdf2image.exceptions.PDFPageCountError: Unable to get page count. I/O Error: Couldn't open file 'pdfinfo': No error.

pdf2image outputs 1x1 blank image

Describe the bug
For some pdf files, convert_from_path, convert_from_bytes outputs a blank 1x1 PIL image. Interestingly for very similar pdfs it works fine. The documents are mostly one very long page pdfs. Any ideais?

To Reproduce
Steps to reproduce the behavior:

Unfortunately the pdfs I'm working on are confidential and I am not allowed to share

Expected behavior
I would expect to see a normal PIL image, as happened to other similar pdfs.

Screenshots
Output: [<PIL.PpmImagePlugin.PpmImageFile image mode=RGB size=1x1 at 0x7FCC6C3FA4A8>]. The number of pages is correct, the pdf is just one very long page.

Desktop (please complete the following information):

OS: Ubuntu 16.04

No such file or directory: 'pdftoppm'

Running through the example:

from pdf2image import convert_from_path
images = convert_from_path('PDFs/2625605.pdf')

Raises the error FileNotFoundError: [Errno 2] No such file or directory: 'pdftoppm'

I'm running on Mac, is there anything else I need to have installed?

Not possible to catch fonts missing error.

Describe the bug
If fonts are not found that are used within the pdf, an error will be thrown by pdftoppm. Currently, there is no way to catch this exception. This can result in a blank page(s) being created "successfully".

To Reproduce
Try to convert a pdf that contains a font that your system does not have installed.

Expected behavior
Suggesting you somehow return stderr to the user.

RFC: Switching to pdftocairo as a backend instead of pdftoppm

It seems that there are two "competing" pdf conversion tools in the poppler projects. The first one being pdftoppm and the second pdftocairo.

Recently I cam across pdftocairo and I realized that it has more features while using the exact same parameters so the switch would be mostly painless. Furthermore, it is distributed as part of poppler-utils just like pdftoppm so most, if not all pdf2image users would already have it on their system.

Does anyone use pdf2image in a setup that would not allow for this change? I would like to know so I could put it in a "new" 2.0 version to avoid backward incompatible change.

Does not keep the order

your library Does not keep the order of pdf file.
when i save images by indexes.

with tempfile.TemporaryDirectory() as path:
     images = convert_from_path('/Users/mkal/Desktop/sample/sample.pdf', output_folder=path)
     for image in images:
         image.save("{}.jpg".format(images.index(image)),"JPEG")

The pages are not in order ... 🤔

Directory name or file name include number will throw a error

Describe the bug
Directory name or file name include number will throw a error

To Reproduce
Steps to reproduce the behavior:

Go to '...'
Click on '....'
Scroll down to '....'
See error

Expected behavior
if the pdf file in the directory like below, will throw a error:
the directory: digitalbook/pdf/temp/t/1/output.pdf or digitalbook/pdf/temp/t/1output.pdf
Unable to get page count. Syntax Error: Couldn't find trailer dictionary
Syntax Error: Couldn't find trailer dictionary
Syntax Error: Couldn't read xref table

if the directory like this : digitalbook/pdf/temp/t/output.pdf. Will be OK.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

OS: Ubuntu 18.04.1
Browser [e.g. chrome, safari]
Version 1.4.0

Additional context
Add any other context about the problem here.

Use Utf-8 to convert pdf to image

When I try to convert an invoice pdf to image, i use the sample code to convert it.
But I got an invoice framework with number and characters and without all the Chinese text.
I want to use utf-8 encode with the pdf text, and I can't find any way to do that.
Is there any solutions?
Thanks so much

Error running in a docker with gunicorn: AttributeError: 'NoneType' object has no attribute 'fork_exec'

Hi,
The docker is running the application with gunicorn(19.8.1) and flask but during the execution of my code in the docker this error raise when I try to open a pdf.
Anyone encountered the same problem?

File "/home/DUA/file_converter_manager.py", line 200, in build_img
img = pdf2image.convert_from_path(pdf_path, dpi=300)
File "/usr/local/lib/python3.5/dist-packages/pdf2image/pdf2image.py", line 29, in convert_from_path
page_count = __page_count(pdf_path, userpw)
File "/usr/local/lib/python3.5/dist-packages/pdf2image/pdf2image.py", line 158, in __page_count
proc = Popen(["pdfinfo", pdf_path], stdout=PIPE, stderr=PIPE)
File "/usr/lib/python3.5/subprocess.py", line 676, in init
restore_signals, start_new_session)
File "/usr/lib/python3.5/subprocess.py", line 1215, in _execute_child
self.pid = _posixsubprocess.fork_exec(
AttributeError: 'NoneType' object has no attribute 'fork_exec'

FileNotFoundError: [WinError 2] 系统找不到指定的文件。

hello, thank you for your script, but i get this error when use pdf2image:

f_o = open(file_name_path, 'rb') # Debug shows that the file was read correctly
images = convert_from_bytes(f_o.read())

Complete error ：

  File ".../test.py", line 21, in <module>
    images = convert_from_bytes(f_o.read())
  File "...\lib\site-packages\pdf2image\pdf2image.py", line 90, in convert_from_bytes
    return convert_from_path(f.name, dpi=dpi, output_folder=output_folder, first_page=first_page, last_page=last_page, fmt=fmt, thread_count=thread_count, userpw=userpw)
  File "...\lib\site-packages\pdf2image\pdf2image.py", line 29, in convert_from_path
    page_count = __page_count(pdf_path, userpw)
  File "...\lib\site-packages\pdf2image\pdf2image.py", line 158, in __page_count
    proc = Popen(["pdfinfo", pdf_path], stdout=PIPE, stderr=PIPE)
  File "...\lib\subprocess.py", line 709, in __init__
    restore_signals, start_new_session)
  File "...\lib\subprocess.py", line 997, in _execute_child
    startupinfo)

pdf2image is not trivially usable in AWS Lambda

Describe the bug
I'm trying to use this library in an AWS Lambda.

Exception: Unable to get page count. Is poppler installed and in PATH?

To Reproduce
Steps to reproduce the behavior:

Create a lambda in python that uses pdf2image
Run it

Expected behavior
Would be nice to have an official guide on how to make it work, as I'm still trying to figure out how to include it. I'm no python expert but I guess shipping the binary directly would be the best.

OSError Thrown When Corrupt Image Is Generated

Have a script that is using pdf2image to process a large number of PDF files that are subsequently being OCR'ed.

OSError: cannot identify image file seems to occur when pdf2image is trying to collect the outputted images:

File "/opt/conda/lib/python3.6/site-packages/pdf2image/pdf2image.py", line 196, in _load_from_output_folder images.append(Image.open(os.path.join(output_folder, f)))

I was able to verify that the outputted image was unreadable on my local machine.

Here is the full Traceback:

---------------------------------------------------------------------------
RemoteTraceback                           Traceback (most recent call last)
RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/opt/conda/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "<ipython-input-4-04311ac4d4d9>", line 9, in doOCR
    timings = file.ocr()
  File "<ipython-input-1-c1e57a508011>", line 186, in ocr
    images = convert_from_path(self.tempFilePath, output_folder=tempImageDirectory, fmt="tif")
  File "/opt/conda/lib/python3.6/site-packages/pdf2image/pdf2image.py", line 103, in convert_from_path
    images += _load_from_output_folder(output_folder, uid, in_memory=auto_temp_dir)
  File "/opt/conda/lib/python3.6/site-packages/pdf2image/pdf2image.py", line 196, in _load_from_output_folder
    images.append(Image.open(os.path.join(output_folder, f)))
  File "/opt/conda/lib/python3.6/site-packages/PIL/Image.py", line 2657, in open
    % (filename if filename else fp))
OSError: cannot identify image file '/storage/temp/74F13-0025_Radiometric Logs (S-3 to S-7).pdf/images/be3861fa-f942-4c95-9aa3-4ed16550fc48-4.tif'
"""

The above exception was the direct cause of the following exception:

OSError                                   Traceback (most recent call last)
<ipython-input-5-8c10a9d0814b> in <module>
      1 startTime_ = datetime.datetime.now()
      2 p = Pool(12)
----> 3 p.map(doOCR, filesToOCR)
      4 
      5 while not q.empty():

/opt/conda/lib/python3.6/multiprocessing/pool.py in map(self, func, iterable, chunksize)
    286         in a list that is returned.
    287         '''
--> 288         return self._map_async(func, iterable, mapstar, chunksize).get()
    289 
    290     def starmap(self, func, iterable, chunksize=None):

/opt/conda/lib/python3.6/multiprocessing/pool.py in get(self, timeout)
    668             return self._value
    669         else:
--> 670             raise self._value
    671 
    672     def _set(self, i, obj):

OSError: cannot identify image file '/storage/temp/someFile.tif'

Error 'pdfinfo'

Hello! I am using your pdf2image function to convert a pdf file to images but it turns out that there is an error.
proc = Popen(["pdfinfo", pdf_path], stdout=PIPE, stderr=PIPE)
FileNotFoundError: [Errno 2] No such file or directory: 'pdfinfo'
I have checked the Popen() function's documentation and the first element in the table should be the address for an executable process. Could you tell me where exist the 'pdfinfo' file?
Thank you very much!

FileNotFoundError: [Errno 2] No such file or directory: 'pdftoppm'

Hi I am getting this issue with the package

-- Python version 3.6.1

I am using it with flask micro framework

Here is the stack trace

Running on http://127.0.0.1:5000/ (Press CTRL+C to quit)
[2018-02-07 11:18:47,787] ERROR in app: Exception on / [GET]
Traceback (most recent call last):
File "/Applications/MAMP/htdocs/Projects/py-pdftoimage/venv/lib/python3.6/site-packages/flask/app.py", line 1982, in wsgi_app
response = self.full_dispatch_request()
File "/Applications/MAMP/htdocs/Projects/py-pdftoimage/venv/lib/python3.6/site-packages/flask/app.py", line 1614, in full_dispatch_request
rv = self.handle_user_exception(e)
File "/Applications/MAMP/htdocs/Projects/py-pdftoimage/venv/lib/python3.6/site-packages/flask/app.py", line 1517, in handle_user_exception
reraise(exc_type, exc_value, tb)
File "/Applications/MAMP/htdocs/Projects/py-pdftoimage/venv/lib/python3.6/site-packages/flask/_compat.py", line 33, in reraise
raise value
File "/Applications/MAMP/htdocs/Projects/py-pdftoimage/venv/lib/python3.6/site-packages/flask/app.py", line 1612, in full_dispatch_request
rv = self.dispatch_request()
File "/Applications/MAMP/htdocs/Projects/py-pdftoimage/venv/lib/python3.6/site-packages/flask/app.py", line 1598, in dispatch_request
return self.view_functionsrule.endpoint
File "/Applications/MAMP/htdocs/Projects/py-pdftoimage/py-pdftoimage.py", line 9, in main
images = convert_from_path('/static/pdf/p1000.pdf', output_folder='/static/pdf/')
File "/Applications/MAMP/htdocs/Projects/py-pdftoimage/venv/lib/python3.6/site-packages/pdf2image/pdf2image.py", line 23, in convert_from_path
proc = Popen(args, stdout=PIPE, stderr=PIPE)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/subprocess.py", line 707, in init
restore_signals, start_new_session)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/subprocess.py", line 1326, in _execute_child
raise child_exception_type(errno_num, err_msg)
FileNotFoundError: [Errno 2] No such file or directory: 'pdftoppm'
127.0.0.1 - - [07/Feb/2018 11:18:47] "GET / HTTP/1.1" 500 -

Temporary files are not closed properly

The temporary files are deleted, but not properly closed. This leads to a too many files open error, when processing large amounts of pdf files.

To Reproduce
Example Script:

from pathlib import Path
import tempfile
import pdf2image


pdf = Path('./test.pdf').read_bytes()

for i in range(5000):
    with tempfile.TemporaryFile() as tmp_file:
        page_images = pdf2image.convert_from_bytes(pdf, output_file=tmp_file)
        [img.close() for img in page_images]

See the leaked files during execution:
ls -l /proc/{{YOUR PID}}/fd

Expected behavior
The files should get closed properly.

System:

OS: Ubuntu 18.04
Environment: virtualenv with python 3.6.7

Possible fix:

def convert_from_bytes(...):
    fh, temp_filename = tempfile.mkstemp()
    try:
        with open(temp_filename, 'wb') as f:
            f.write(pdf_file)
            f.flush()
            return convert_from_path(...)
    finally:
        os.close(fh)
        os.remove(temp_filename)

belval / pdf2image Goto Github PK

pdf2image's Introduction

pdf2image

How to install

Windows

Mac

Linux

Platform-independant (Using conda)

How does it work?

What's new?

Performance tips

Limitations / known issues

pdf2image's People

Contributors

Stargazers

Watchers

Forkers

pdf2image's Issues

Recommend Projects

Recommend Topics

Recommend Org

Platform-independant (Using `conda`)