Giter Club home page Giter Club logo

pdftotext's Introduction

pdftotext

PyPI Tests Downloads

Simple PDF text extraction

import pdftotext

# Load your PDF
with open("lorem_ipsum.pdf", "rb") as f:
    pdf = pdftotext.PDF(f)

# If it's password-protected
with open("secure.pdf", "rb") as f:
    pdf = pdftotext.PDF(f, "secret")

# How many pages?
print(len(pdf))

# Iterate over all the pages
for page in pdf:
    print(page)

# Read some individual pages
print(pdf[0])
print(pdf[1])

# Read all the text into one string
print("\n\n".join(pdf))

OS Dependencies

These instructions assume you're using Python 3 on a recent OS. Package names may differ for Python 2 or for an older OS.

Debian, Ubuntu, and friends

sudo apt install build-essential libpoppler-cpp-dev pkg-config python3-dev

Fedora, Red Hat, and friends

sudo yum install gcc-c++ pkgconfig poppler-cpp-devel python3-devel

macOS

brew install pkg-config poppler python

Windows

Currently tested only when using conda:

  • Install the Microsoft Visual C++ Build Tools
  • Install poppler through conda:
    conda install -c conda-forge poppler
    

Install

pip install pdftotext

pdftotext's People

Contributors

8w9ag avatar jalan avatar tirkarthi avatar woodsjs avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pdftotext's Issues

error: command 'gcc' failed with exit status 1

Hi,

I'm having trouble installing pdftotext. I'm using Python 3.6 on Anaconda 5.2.0 and pip version 18.0. There seems to be a problem with gcc so I did conda install libgcc but that didn't make any difference. I also made sure python3-dev was installed.

john@john-Virtual-Machine:~/py3eg$` pip install pdftotext
Collecting pdftotext
  Using cached https://files.pythonhosted.org/packages/96/41/aa31f4a6809eb0574674d6c0cf6bc0e00aaf0ea53c62db8a2d9af50b7cc6/pdftotext-2.1.0.tar.gz
Building wheels for collected packages: pdftotext
  Running setup.py bdist_wheel for pdftotext ... error
  Complete output from command /home/john/anaconda3/bin/python -u -c "import setuptools, tokenize;__file__='/tmp/pip-install-9uyu6ggf/pdftotext/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" bdist_wheel -d /tmp/pip-wheel-epbnqs4m --python-tag cp36:
  running bdist_wheel
  running build
  running build_ext
  building 'pdftotext' extension
  creating build
  creating build/temp.linux-x86_64-3.6
  gcc -pthread -B /home/john/anaconda3/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -DPOPPLER_CPP_AT_LEAST_0_30_0=0 -I/home/john/anaconda3/include/python3.6m -c pdftotext.cpp -o build/temp.linux-x86_64-3.6/pdftotext.o -Wall
  cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
  pdftotext.cpp:3:10: fatal error: poppler/cpp/poppler-document.h: No such file or directory
   #include <poppler/cpp/poppler-document.h>
            ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  compilation terminated.
  error: command 'gcc' failed with exit status 1
  
  ----------------------------------------
  Failed building wheel for pdftotext
  Running setup.py clean for pdftotext
Failed to build pdftotext
Installing collected packages: pdftotext
  Running setup.py install for pdftotext ... error
    Complete output from command /home/john/anaconda3/bin/python -u -c "import setuptools, tokenize;__file__='/tmp/pip-install-9uyu6ggf/pdftotext/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /tmp/pip-record-sx0bea7r/install-record.txt --single-version-externally-managed --compile:
    running install
    running build
    running build_ext
    building 'pdftotext' extension
    creating build
    creating build/temp.linux-x86_64-3.6
    gcc -pthread -B /home/john/anaconda3/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -DPOPPLER_CPP_AT_LEAST_0_30_0=0 -I/home/john/anaconda3/include/python3.6m -c pdftotext.cpp -o build/temp.linux-x86_64-3.6/pdftotext.o -Wall
    cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
    pdftotext.cpp:3:10: fatal error: poppler/cpp/poppler-document.h: No such file or directory
     #include <poppler/cpp/poppler-document.h>
              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    compilation terminated.
    error: command 'gcc' failed with exit status 1
    
    ----------------------------------------
Command "/home/john/anaconda3/bin/python -u -c "import setuptools, tokenize;__file__='/tmp/pip-install-9uyu6ggf/pdftotext/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /tmp/pip-record-sx0bea7r/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /tmp/pip-install-9uyu6ggf/pdftotext/

Any help would be greatly appreciated.

Thanks!

Issue installing pdftotext

Hi, I am trying to pip install pdftotext on Mac (Mojave 10.14) but I keep getting the following error:

(base) C02RQ3W9G8WP:shull_analysis arnav.gulati$ pip install pdftotext

Collecting pdftotext
  Using cached https://files.pythonhosted.org/packages/21/35/60094dbadd9de2035873390b1cac25e01da605844eba6a07a53a82fa4adc/pdftotext-2.1.1.tar.gz
Building wheels for collected packages: pdftotext
  Building wheel for pdftotext (setup.py) ... error
  Complete output from command /anaconda3/bin/python -u -c "import setuptools, tokenize;__file__='/private/var/folders/n2/cl8vfmpn54n9x4n21ltfd6b4m9kd3d/T/pip-install-tgpv1h8j/pdftotext/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" bdist_wheel -d /private/var/folders/n2/cl8vfmpn54n9x4n21ltfd6b4m9kd3d/T/pip-wheel-rrczn_b8 --python-tag cp37:
  running bdist_wheel
  running build
  running build_ext
  building 'pdftotext' extension
  creating build
  creating build/temp.macosx-10.9-x86_64-3.7
  x86_64-apple-darwin13.4.0-clang -DNDEBUG -fwrapv -O3 -Wall -Wstrict-prototypes -march=core2 -mtune=haswell -mssse3 -ftree-vectorize -fPIC -fPIE -fstack-protector-strong -O2 -pipe -D_FORTIFY_SOURCE=2 -mmacosx-version-min=10.9 -DPOPPLER_CPP_AT_LEAST_0_30_0=1 -I/anaconda3/include/python3.7m -c pdftotext.cpp -o build/temp.macosx-10.9-x86_64-3.7/pdftotext.o -Wall -mmacosx-version-min=10.9
  pdftotext.cpp:3:10: fatal error: 'poppler/cpp/poppler-document.h' file not found
  #include <poppler/cpp/poppler-document.h>
           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  1 error generated.
  error: command 'x86_64-apple-darwin13.4.0-clang' failed with exit status 1
  
  ----------------------------------------
  Failed building wheel for pdftotext
  Running setup.py clean for pdftotext
Failed to build pdftotext
Installing collected packages: pdftotext
  Running setup.py install for pdftotext ... error
    Complete output from command /anaconda3/bin/python -u -c "import setuptools, tokenize;__file__='/private/var/folders/n2/cl8vfmpn54n9x4n21ltfd6b4m9kd3d/T/pip-install-tgpv1h8j/pdftotext/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /private/var/folders/n2/cl8vfmpn54n9x4n21ltfd6b4m9kd3d/T/pip-record-ghe90p4m/install-record.txt --single-version-externally-managed --compile:
    running install
    running build
    running build_ext
    building 'pdftotext' extension
    creating build
    creating build/temp.macosx-10.9-x86_64-3.7
    x86_64-apple-darwin13.4.0-clang -DNDEBUG -fwrapv -O3 -Wall -Wstrict-prototypes -march=core2 -mtune=haswell -mssse3 -ftree-vectorize -fPIC -fPIE -fstack-protector-strong -O2 -pipe -D_FORTIFY_SOURCE=2 -mmacosx-version-min=10.9 -DPOPPLER_CPP_AT_LEAST_0_30_0=1 -I/anaconda3/include/python3.7m -c pdftotext.cpp -o build/temp.macosx-10.9-x86_64-3.7/pdftotext.o -Wall -mmacosx-version-min=10.9
    pdftotext.cpp:3:10: fatal error: 'poppler/cpp/poppler-document.h' file not found
    #include <poppler/cpp/poppler-document.h>
             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    1 error generated.
    error: command 'x86_64-apple-darwin13.4.0-clang' failed with exit status 1
    
    ----------------------------------------
Command "/anaconda3/bin/python -u -c "import setuptools, tokenize;__file__='/private/var/folders/n2/cl8vfmpn54n9x4n21ltfd6b4m9kd3d/T/pip-install-tgpv1h8j/pdftotext/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /private/var/folders/n2/cl8vfmpn54n9x4n21ltfd6b4m9kd3d/T/pip-record-ghe90p4m/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /private/var/folders/n2/cl8vfmpn54n9x4n21ltfd6b4m9kd3d/T/pip-install-tgpv1h8j/pdftotext/

I tried brew installing poppler as suggested in the readme to no avail.
Anyone have any suggestions?

Set C++ version based on Poppler version?

Poppler 0.69 and later claim to require C++14. The build works without any effort on common systems, but some systems need help. For example, anaconda on macOS might say

error: expected ‘,’ or ‘...’ before ‘&&’ token

fails on latin-1 encoded pdfs

If you try and access a latin-1 encoded pdf it gives the following error,

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 10: invalid continuation byte

poppler supports latin-1 encodings in the api, but apparently it isn't implemented in pdftotext, would appreciate it.

Raw parameter

I really need to pass the parameter raw to pdftotext, because diagonal text it's ruining the text.

Bounding boxes

The poppler command line pdftotext to has a layout option allowing specific extraction based on bounding boxes. Can that feature be integrated into this?

Stop using `python setup.py test`

python setup.py test now results in a deprecation warning:

WARNING: Testing via this command is deprecated and will be removed in a future version. Users looking for a generic test entry point independent of test runner are encouraged to use tox.

Who knows when they will actually remove it 🤷‍♂️

Prebuilt binaries

(I'm aware that #16 already exists, I though it would be nice to layout a few reasons in an organized fashion)

This PDF library is, in my experience, the best in the business. PDFMiner, with all due respect, is slow, inaccurate, and inconsistent making impossible in some cases to use reliably. Other XPDF/Poppler bindings are outdated and abandoned. Other workarounds (such as those mentioned in #16) are plagued with some of the same issues (mainly inaccuracy).

This is where pdftotext comes in handy. It's fast and gives accurate results. The only problem is that there's a pretty high barrier for being able to use this package. Developers must install a few packages on a Linux system for this package to be built and installed. Windows users, on the other hand are left with no clue on how to install. This could all be mitigated with prebuilt binaries for Windows, but also other platforms.

Has the -layout argument been removed?

This function is about 40x faster than anything else I've tried but it interlaces columns. I've read that there is a 'layout' argument that fixes this but it doesn't figure in the help documentation. Is it available anywhere?

Thanks!

Can't install on MacOS via pip

Running pip, with or without su, on MacOS produces the following error:

Command "/Library/Frameworks/Python.framework/Versions/2.7/Resources/Python.app/Contents/MacOS/Python -u -c "import setuptools, tokenize;file='/private/var/folders/cm/60_4h2mj23d_70fhqwvtjf7m0000gn/T/pip-build-bd88s9/pdftotext/setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" install --record /var/folders/cm/60_4h2mj23d_70fhqwvtjf7m0000gn/T/pip-eNla3s-record/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /private/var/folders/cm/60_4h2mj23d_70fhqwvtjf7m0000gn/T/pip-build-bd88s9/pdftotext/

This library is DOA until the dependency issue is resolved.

Can't pip install on Mac

hey, when I run the pip command it gives me the following error:
ERROR: Could not find a version that satisfies the requirement pdftotext (from versions: none)
ERROR: No matching distribution found for pdftotext

is there another way to install it, or solve this way?

CentOS Specific Install Instructions

To install on CentOS:

Following the instructions from this link

On CentOS

On CentOS the libpoppler-cpp library is not included with the system so we need to build from source. Note that recent versions of poppler require C++11 which is not available on CentOS, so we build a slightly older version of libpoppler.

# Build dependencies
yum install wget xz libjpeg-devel openjpeg2-devel

# Download and extract
wget https://poppler.freedesktop.org/poppler-0.47.0.tar.xz
tar -Jxvf poppler-0.47.0.tar.xz
cd poppler-0.47.0

# Build and install
./configure
make
sudo make install

By default libraries get installed in /usr/local/lib and /usr/local/include. On CentOS this is not a default search path so we need to set PKG_CONFIG_PATH and LD_LIBRARY_PATH to point R to the right directory:

export LD_LIBRARY_PATH="/usr/local/lib"
export PKG_CONFIG_PATH="/usr/local/lib/pkgconfig"

I am getting the following errors in High Sierra (MacOS) when I try to do `python setup.py install`. Could you please fix?

/usr/local/include/poppler/cpp/poppler-global.h:53:40: warning: deleted function
      definitions are a C++11 extension [-Wc++11-extensions]
    noncopyable(const noncopyable &) = delete;
                                       ^
/usr/local/include/poppler/cpp/poppler-global.h:54:57: warning: deleted function
      definitions are a C++11 extension [-Wc++11-extensions]
    const noncopyable& operator=(const noncopyable &) = delete;
                                                        ^
In file included from pdftotext.cpp:5:
/usr/local/include/poppler/cpp/poppler-page.h:39:22: warning: rvalue references
      are a C++11 extension [-Wc++11-extensions]
    text_box(text_box&&) noexcept;
                     ^
/usr/local/include/poppler/cpp/poppler-page.h:39:25: error: expected ';' at end
      of declaration list
    text_box(text_box&&) noexcept;
                        ^
                        ;
/usr/local/include/poppler/cpp/poppler-page.h:40:33: warning: rvalue references
      are a C++11 extension [-Wc++11-extensions]
    text_box& operator=(text_box&&) noexcept;
                                ^
/usr/local/include/poppler/cpp/poppler-page.h:40:36: error: expected ';' at end
      of declaration list
    text_box& operator=(text_box&&) noexcept;
                                   ^
                                   ;
4 warnings and 2 errors generated.

I am getting the following errors in High Sierra (MacOS) when I try to do python setup.py install. Could you please fix?

Unable to install poppler-cpp-devel and python-devel on RHEL 8

Hi Jalan
As per official documentation on pdftotext webpage below libraries are required to be installed on REDHAT.
sudo yum install gcc-c++ pkgconfig poppler-cpp-devel python-devel redhat-rpm-config

Unfortunately i was unable to install poppler-cpp-devel and python-devel and received below message:
sudo yum install poppler-cpp-devel python-devel
Red Hat Update Infrastructure 3 Client Configur 7.4 kB/s | 2.1 kB 00:00
Red Hat Enterprise Linux 8 for x86_64 - AppStre 8.1 kB/s | 2.8 kB 00:00
Red Hat Enterprise Linux 8 for x86_64 - BaseOS 7.5 kB/s | 2.3 kB 00:00
No match for argument: poppler-cpp-devel
No match for argument: python-devel
Error: Unable to find a match

I tried to install poppler from it's official linux webpage which is given below and went recursively and installed almost 10 dependencies such as cmake, libarchieve, fontconfig and the list went on.
http://www.linuxfromscratch.org/blfs/view/svn/general/poppler.html

Finally i have come to a stage where getting below error while installing pdftotext through pip3.
pip3 install pdftotext
creating build/lib.linux-x86_64-3.6
g++ -pthread -shared -Wl,-z,relro -Wl,-z,now -specs=/usr/lib/rpm/redhat/redh at-hardened-ld -g build/temp.linux-x86_64-3.6/pdftotext.o -L/usr/lib64 -lpoppler -cpp -lpython3.6m -o build/lib.linux-x86_64-3.6/pdftotext.cpython-36m-x86_64-lin ux-gnu.so
/usr/bin/ld: cannot find -lpoppler-cpp
collect2: error: ld returned 1 exit status
error: command 'g++' failed with exit status 1

I am unable to understand above error and don't know how many more will be faced as i am working on same from 5 days.
It would be great if you can provide the libraries which are required for pdftotext and can be installed on REDHAT directly or without causing much errors.
We will be really grateful for your response.

pdftotext.Error: Poppler error creating document

while using pdftotext with multiprocessing module on ec2

('read pdf file', '1004.5293.pdf')
Traceback (most recent call last):
  File "main.py", line 44, in <module>
    result = pool.map(pdf_extract, filenames)
  File "/usr/lib64/python2.7/multiprocessing/pool.py", line 251, in map
    return self.map_async(func, iterable, chunksize).get()
  File "/usr/lib64/python2.7/multiprocessing/pool.py", line 567, in get
    raise self._value
pdftotext.Error: Poppler error creating document

My code:

def pdf_extract(dirs):
    paths, filename = dirs
    file = filename.replace(".pdf", ".txt")
    if file in have:
        print("file alreafy extracted!!")
    else:
	print("read pdf file", filename)
        with open(os.path.join(paths, filename), "rb") as f:
            pdf = pdftotext.PDF(f)
            prin(len(pdf))
        text = "\n\n".join(pdf)
        print("converted file")
        file = filename.replace(".pdf", ".txt")
        with open(txt_path+file, "w") as f:
            f.writelines(text)
            f.close()
            print("saved file")
        time.sleep(0.01)

Link : arxiv paper

Deploying on AWS Lambda

Hi Jalan,

Thanks for a wonderful library. I am trying to deploy a small python 3.7 which uses pdftotext on AWS-Lambda. I was able to run this successfully on my local machine (Mac). I then followed the AWS documentation on creating the package with a virtual environment. However I am still getting module not found error. By any chance is there a complete build package which has all the dependent packages (poppler, et al.) that I can use in AWS Lambda. If you so, can you share the location please. Many thanks in advance.

Regards,
Vaidya.

Can't import on MacOs

pkg-config and poppler are installed via brew. Then pdftotext is installed via pip.
When I tried to import it in a Jupyter Notebook (conda env, python3) :

ImportError                               Traceback (most recent call last)
<ipython-input-104-46fa7238b159> in <module>()
----> 1 import pdftotext

ImportError: dlopen(/Users/[username]/miniconda3/envs/[env-name]/lib/python3.6/site-packages/pdftotext.cpython-36m-darwin.so, 2): Symbol not found: __ZN7poppler24set_debug_error_functionEPFvRKSsPvES2_
  Referenced from: /Users/[username]/miniconda3/envs/[env-name]/lib/python3.6/site-packages/pdftotext.cpython-36m-darwin.so
  Expected in: flat namespace
 in /Users/[username]/miniconda3/envs/[env-name]/lib/python3.6/site-packages/pdftotext.cpython-36m-darwin.so

Installing and working with anaconda

Hi,
please add the following code for python-anaconda part, otherwise it results in errors while importing pdftotext in jupyter notebook

conda install libgcc

Thanks for yours library, it works very good and does the job perfectly.

Add sources for PDF test files

Thanks for providing this great module! I would like to package it for the Debian archive and there is a minor issue when running the tests during package build: Debian packages should have the source of all its files in their modifiable format. For PDF this is usually something like a TeX file or similar. Would you be able to provide those formats along with the PDFs for the tests in your repo? Otherwise I'll have to exclude the tests from the Debian package which is not the best solution I guess.

TypeError: 'pdftotext.PDF' object has no attribute '__getitem__'

This does not work:

import pdftotext

def get_text(filepath, page=None):
    """
    Extract text from a PDF

    Parameters
    ----------
    filepath : str
        Path to a PDF file
    page : int or None

    Returns
    -------
    text : str
    """
    with open(filepath) as f:
        pdf = pdftotext.PDF(f)
    if page is not None:
        text = pdf[page]
    else:
        text = pdf.read_all()
    return text

It returns:

TypeError: 'pdftotext.PDF' object has no attribute '__getitem__'

Unknown symbols found in scientific literature

Hi,
I'm trying to convert a PDF from a scientific journal to a text file, but many of the characters such as the "~" sign and Greek symbols do not get read/converted correctly. For example:

Original text:
Screen Shot 2019-04-17 at 12 40 19 PM

Converted text:
Screen Shot 2019-04-17 at 12 39 50 PM

Is there a straightforward fix for this?

Thanks!

Error installing pdftotext

Collecting pdftotext
Using cached https://files.pythonhosted.org/packages/21/35/60094dbadd9de2035873390b1cac25e01da605844eba6a07a53a82fa4adc/pdftotext-2.1.1.tar.gz
Building wheels for collected packages: pdftotext
Building wheel for pdftotext (setup.py): started
Building wheel for pdftotext (setup.py): finished with status 'error'
Complete output from command C:\Users\huang\AppData\Local\Continuum\anaconda3\python.exe -u -c "import setuptools, tokenize;file='C:\Users\huang\AppData\Local\Temp\pip-install-amj4lgms\pdftotext\setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" bdist_wheel -d C:\Users\huang\AppData\Local\Temp\pip-wheel-suicv9j_ --python-tag cp37:
WARNING: pkg-config not found--guessing at poppler version.
If the build fails, install pkg-config and try again.
running bdist_wheel
running build
running build_ext
building 'pdftotext' extension
creating build
creating build\temp.win-amd64-3.7
creating build\temp.win-amd64-3.7\Release
C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\x86_amd64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -DPOPPLER_CPP_AT_LEAST_0_30_0=1 -IC:\Users\huang\AppData\Local\Continuum\anaconda3\include -IC:\Users\huang\AppData\Local\Continuum\anaconda3\include "-IC:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\INCLUDE" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\8.1\include\shared" "-IC:\Program Files (x86)\Windows Kits\8.1\include\um" "-IC:\Program Files (x86)\Windows Kits\8.1\include\winrt" /EHsc /Tppdftotext.cpp /Fobuild\temp.win-amd64-3.7\Release\pdftotext.obj -Wall
pdftotext.cpp
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\corecrt_io.h(49): warning C4820: '_finddata32i64_t': '4' bytes padding added after data member '_finddata32i64_t::name'
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\corecrt_io.h(54): warning C4820: '_finddata64i32_t': '4' bytes padding added after data member '_finddata64i32_t::attrib'
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\corecrt_io.h(64): warning C4820: '__finddata64_t': '4' bytes padding added after data member '__finddata64_t::attrib'
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\corecrt_io.h(69): warning C4820: '__finddata64_t': '4' bytes padding added after data member '__finddata64_t::name'
C:\Program Files (x86)\Windows Kits\8.1\include\shared\basetsd.h(418): warning C4668: '_WIN32_WINNT' is not defined as a preprocessor macro, replacing with '0' for '#if/#elif'
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\time.h(35): warning C4820: '_timespec64': '4' bytes padding added after data member '_timespec64::tv_nsec'
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\time.h(42): warning C4820: 'timespec': '4' bytes padding added after data member 'timespec::tv_nsec'
c:\users\huang\appdata\local\continuum\anaconda3\include\object.h(381): warning C4820: '_typeobject': '4' bytes padding added after data member '_typeobject::tp_flags'
c:\users\huang\appdata\local\continuum\anaconda3\include\object.h(425): warning C4820: '_typeobject': '4' bytes padding added after data member '_typeobject::tp_version_tag'
c:\users\huang\appdata\local\continuum\anaconda3\include\object.h(440): warning C4820: '': '4' bytes padding added after data member '::slot'
c:\users\huang\appdata\local\continuum\anaconda3\include\object.h(448): warning C4820: '': '4' bytes padding added after data member '::flags'
c:\users\huang\appdata\local\continuum\anaconda3\include\bytearrayobject.h(30): warning C4820: '': '4' bytes padding added after data member '::ob_exports'
c:\users\huang\appdata\local\continuum\anaconda3\include\bytesobject.h(41): warning C4820: '': '7' bytes padding added after data member '::ob_sval'
c:\users\huang\appdata\local\continuum\anaconda3\include\bytesobject.h(165): warning C4820: '': '4' bytes padding added after data member '::small_buffer'
c:\users\huang\appdata\local\continuum\anaconda3\include\unicodeobject.h(330): warning C4820: '': '4' bytes padding added after data member '::state'
c:\users\huang\appdata\local\continuum\anaconda3\include\unicodeobject.h(905): warning C4820: '': '2' bytes padding added after data member '::readonly'
c:\users\huang\appdata\local\continuum\anaconda3\include\longintrepr.h(88): warning C4820: '_longobject': '4' bytes padding added after data member '_longobject::ob_digit'
c:\users\huang\appdata\local\continuum\anaconda3\include\memoryobject.h(45): warning C4820: '': '4' bytes padding added after data member '::flags'
c:\users\huang\appdata\local\continuum\anaconda3\include\memoryobject.h(62): warning C4820: '': '4' bytes padding added after data member '::flags'
c:\users\huang\appdata\local\continuum\anaconda3\include\methodobject.h(61): warning C4820: 'PyMethodDef': '4' bytes padding added after data member 'PyMethodDef::ml_flags'
c:\users\huang\appdata\local\continuum\anaconda3\include\moduleobject.h(62): warning C4820: 'PyModuleDef_Slot': '4' bytes padding added after data member 'PyModuleDef_Slot::slot'
c:\users\huang\appdata\local\continuum\anaconda3\include\pystate.h(48): warning C4820: '': '4' bytes padding added after data member '::utf8_mode'
c:\users\huang\appdata\local\continuum\anaconda3\include\pystate.h(51): warning C4820: '': '4' bytes padding added after data member '::argc'
c:\users\huang\appdata\local\continuum\anaconda3\include\pystate.h(55): warning C4820: '': '4' bytes padding added after data member '::nxoption'
c:\users\huang\appdata\local\continuum\anaconda3\include\pystate.h(58): warning C4820: '': '4' bytes padding added after data member '::nwarnoption'
c:\users\huang\appdata\local\continuum\anaconda3\include\pystate.h(68): warning C4820: '': '4' bytes padding added after data member '::nmodule_search_path'
c:\users\huang\appdata\local\continuum\anaconda3\include\pystate.h(77): warning C4820: '': '4' bytes padding added after data member '::_disable_importlib'
c:\users\huang\appdata\local\continuum\anaconda3\include\pystate.h(98): warning C4820: '': '4' bytes padding added after data member '::install_signal_handlers'
c:\users\huang\appdata\local\continuum\anaconda3\include\pystate.h(225): warning C4820: '_ts': '2' bytes padding added after data member '_ts::recursion_critical'
c:\users\huang\appdata\local\continuum\anaconda3\include\pystate.h(233): warning C4820: '_ts': '4' bytes padding added after data member '_ts::use_tracing'
c:\users\huang\appdata\local\continuum\anaconda3\include\pystate.h(256): warning C4820: '_ts': '4' bytes padding added after data member '_ts::gilstate_counter'
c:\users\huang\appdata\local\continuum\anaconda3\include\pystate.h(290): warning C4820: '_ts': '4' bytes padding added after data member '_ts::coroutine_origin_tracking_depth'
c:\users\huang\appdata\local\continuum\anaconda3\include\pystate.h(293): warning C4820: '_ts': '4' bytes padding added after data member '_ts::in_coroutine_wrapper'
c:\users\huang\appdata\local\continuum\anaconda3\include\genobject.h(33): warning C4820: '': '7' bytes padding added after data member '::gi_running'
c:\users\huang\appdata\local\continuum\anaconda3\include\genobject.h(53): warning C4820: '': '7' bytes padding added after data member '::cr_running'
c:\users\huang\appdata\local\continuum\anaconda3\include\genobject.h(70): warning C4820: '': '7' bytes padding added after data member '::ag_running'
c:\users\huang\appdata\local\continuum\anaconda3\include\descrobject.h(29): warning C4820: 'wrapperbase': '4' bytes padding added after data member 'wrapperbase::offset'
c:\users\huang\appdata\local\continuum\anaconda3\include\descrobject.h(33): warning C4820: 'wrapperbase': '4' bytes padding added after data member 'wrapperbase::flags'
c:\users\huang\appdata\local\continuum\anaconda3\include\structseq.h(20): warning C4820: 'PyStructSequence_Desc': '4' bytes padding added after data member 'PyStructSequence_Desc::n_in_sequence'
c:\users\huang\appdata\local\continuum\anaconda3\include\pyerrors.h(18): warning C4820: '': '7' bytes padding added after data member '::suppress_context'
c:\users\huang\appdata\local\continuum\anaconda3\include\pyerrors.h(22): warning C4820: '': '7' bytes padding added after data member '::suppress_context'
c:\users\huang\appdata\local\continuum\anaconda3\include\pyerrors.h(32): warning C4820: '': '7' bytes padding added after data member '::suppress_context'
c:\users\huang\appdata\local\continuum\anaconda3\include\pyerrors.h(39): warning C4820: '': '7' bytes padding added after data member '::suppress_context'
c:\users\huang\appdata\local\continuum\anaconda3\include\pyerrors.h(48): warning C4820: '': '7' bytes padding added after data member '::suppress_context'
c:\users\huang\appdata\local\continuum\anaconda3\include\pyerrors.h(53): warning C4820: '': '7' bytes padding added after data member '::suppress_context'
c:\users\huang\appdata\local\continuum\anaconda3\include\pyerrors.h(65): warning C4820: '': '7' bytes padding added after data member '::suppress_context'
c:\users\huang\appdata\local\continuum\anaconda3\include\modsupport.h(90): warning C4820: '_PyArg_Parser': '4' bytes padding added after data member '_PyArg_Parser::max'
c:\users\huang\appdata\local\continuum\anaconda3\include\pylifecycle.h(15): warning C4820: '': '4' bytes padding added after data member '::user_err'
c:\users\huang\appdata\local\continuum\anaconda3\include\import.h(140): warning C4820: '_frozen': '4' bytes padding added after data member '_frozen::size'
c:\users\huang\appdata\local\continuum\anaconda3\include\fileutils.h(79): warning C4820: '_Py_stat_struct': '4' bytes padding added after data member '_Py_stat_struct::st_dev'
c:\users\huang\appdata\local\continuum\anaconda3\include\fileutils.h(81): warning C4820: '_Py_stat_struct': '2' bytes padding added after data member '_Py_stat_struct::st_mode'
c:\users\huang\appdata\local\continuum\anaconda3\include\fileutils.h(85): warning C4820: '_Py_stat_struct': '4' bytes padding added after data member '_Py_stat_struct::st_rdev'
c:\users\huang\appdata\local\continuum\anaconda3\include\fileutils.h(88): warning C4820: '_Py_stat_struct': '4' bytes padding added after data member '_Py_stat_struct::st_atime_nsec'
c:\users\huang\appdata\local\continuum\anaconda3\include\fileutils.h(90): warning C4820: '_Py_stat_struct': '4' bytes padding added after data member '_Py_stat_struct::st_mtime_nsec'
pdftotext.cpp(3): fatal error C1083: Cannot open include file: 'poppler/cpp/poppler-document.h': No such file or directory
error: command 'C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\x86_amd64\cl.exe' failed with exit status 2


Running setup.py clean for pdftotext
Failed to build pdftotext
Installing collected packages: pdftotext
Running setup.py install for pdftotext: started
Running setup.py install for pdftotext: finished with status 'error'
Complete output from command C:\Users\huang\AppData\Local\Continuum\anaconda3\python.exe -u -c "import setuptools, tokenize;file='C:\Users\huang\AppData\Local\Temp\pip-install-amj4lgms\pdftotext\setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" install --record C:\Users\huang\AppData\Local\Temp\pip-record-w5c0rhy9\install-record.txt --single-version-externally-managed --compile:
WARNING: pkg-config not found--guessing at poppler version.
If the build fails, install pkg-config and try again.
running install
running build
running build_ext
building 'pdftotext' extension
creating build
creating build\temp.win-amd64-3.7
creating build\temp.win-amd64-3.7\Release
C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\x86_amd64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -DPOPPLER_CPP_AT_LEAST_0_30_0=1 -IC:\Users\huang\AppData\Local\Continuum\anaconda3\include -IC:\Users\huang\AppData\Local\Continuum\anaconda3\include "-IC:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\INCLUDE" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\8.1\include\shared" "-IC:\Program Files (x86)\Windows Kits\8.1\include\um" "-IC:\Program Files (x86)\Windows Kits\8.1\include\winrt" /EHsc /Tppdftotext.cpp /Fobuild\temp.win-amd64-3.7\Release\pdftotext.obj -Wall
pdftotext.cpp
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\corecrt_io.h(49): warning C4820: '_finddata32i64_t': '4' bytes padding added after data member '_finddata32i64_t::name'
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\corecrt_io.h(54): warning C4820: '_finddata64i32_t': '4' bytes padding added after data member '_finddata64i32_t::attrib'
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\corecrt_io.h(64): warning C4820: '__finddata64_t': '4' bytes padding added after data member '__finddata64_t::attrib'
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\corecrt_io.h(69): warning C4820: '__finddata64_t': '4' bytes padding added after data member '__finddata64_t::name'
C:\Program Files (x86)\Windows Kits\8.1\include\shared\basetsd.h(418): warning C4668: '_WIN32_WINNT' is not defined as a preprocessor macro, replacing with '0' for '#if/#elif'
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\time.h(35): warning C4820: '_timespec64': '4' bytes padding added after data member '_timespec64::tv_nsec'
C:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt\time.h(42): warning C4820: 'timespec': '4' bytes padding added after data member 'timespec::tv_nsec'
c:\users\huang\appdata\local\continuum\anaconda3\include\object.h(381): warning C4820: '_typeobject': '4' bytes padding added after data member '_typeobject::tp_flags'
c:\users\huang\appdata\local\continuum\anaconda3\include\object.h(425): warning C4820: '_typeobject': '4' bytes padding added after data member '_typeobject::tp_version_tag'
c:\users\huang\appdata\local\continuum\anaconda3\include\object.h(440): warning C4820: '': '4' bytes padding added after data member '::slot'
c:\users\huang\appdata\local\continuum\anaconda3\include\object.h(448): warning C4820: '': '4' bytes padding added after data member '::flags'
c:\users\huang\appdata\local\continuum\anaconda3\include\bytearrayobject.h(30): warning C4820: '': '4' bytes padding added after data member '::ob_exports'
c:\users\huang\appdata\local\continuum\anaconda3\include\bytesobject.h(41): warning C4820: '': '7' bytes padding added after data member '::ob_sval'
c:\users\huang\appdata\local\continuum\anaconda3\include\bytesobject.h(165): warning C4820: '': '4' bytes padding added after data member '::small_buffer'
c:\users\huang\appdata\local\continuum\anaconda3\include\unicodeobject.h(330): warning C4820: '': '4' bytes padding added after data member '::state'
c:\users\huang\appdata\local\continuum\anaconda3\include\unicodeobject.h(905): warning C4820: '': '2' bytes padding added after data member '::readonly'
c:\users\huang\appdata\local\continuum\anaconda3\include\longintrepr.h(88): warning C4820: '_longobject': '4' bytes padding added after data member '_longobject::ob_digit'
c:\users\huang\appdata\local\continuum\anaconda3\include\memoryobject.h(45): warning C4820: '': '4' bytes padding added after data member '::flags'
c:\users\huang\appdata\local\continuum\anaconda3\include\memoryobject.h(62): warning C4820: '': '4' bytes padding added after data member '::flags'
c:\users\huang\appdata\local\continuum\anaconda3\include\methodobject.h(61): warning C4820: 'PyMethodDef': '4' bytes padding added after data member 'PyMethodDef::ml_flags'
c:\users\huang\appdata\local\continuum\anaconda3\include\moduleobject.h(62): warning C4820: 'PyModuleDef_Slot': '4' bytes padding added after data member 'PyModuleDef_Slot::slot'
c:\users\huang\appdata\local\continuum\anaconda3\include\pystate.h(48): warning C4820: '': '4' bytes padding added after data member '::utf8_mode'
c:\users\huang\appdata\local\continuum\anaconda3\include\pystate.h(51): warning C4820: '': '4' bytes padding added after data member '::argc'
c:\users\huang\appdata\local\continuum\anaconda3\include\pystate.h(55): warning C4820: '': '4' bytes padding added after data member '::nxoption'
c:\users\huang\appdata\local\continuum\anaconda3\include\pystate.h(58): warning C4820: '': '4' bytes padding added after data member '::nwarnoption'
c:\users\huang\appdata\local\continuum\anaconda3\include\pystate.h(68): warning C4820: '': '4' bytes padding added after data member '::nmodule_search_path'
c:\users\huang\appdata\local\continuum\anaconda3\include\pystate.h(77): warning C4820: '': '4' bytes padding added after data member '::_disable_importlib'
c:\users\huang\appdata\local\continuum\anaconda3\include\pystate.h(98): warning C4820: '': '4' bytes padding added after data member '::install_signal_handlers'
c:\users\huang\appdata\local\continuum\anaconda3\include\pystate.h(225): warning C4820: '_ts': '2' bytes padding added after data member '_ts::recursion_critical'
c:\users\huang\appdata\local\continuum\anaconda3\include\pystate.h(233): warning C4820: '_ts': '4' bytes padding added after data member '_ts::use_tracing'
c:\users\huang\appdata\local\continuum\anaconda3\include\pystate.h(256): warning C4820: '_ts': '4' bytes padding added after data member '_ts::gilstate_counter'
c:\users\huang\appdata\local\continuum\anaconda3\include\pystate.h(290): warning C4820: '_ts': '4' bytes padding added after data member '_ts::coroutine_origin_tracking_depth'
c:\users\huang\appdata\local\continuum\anaconda3\include\pystate.h(293): warning C4820: '_ts': '4' bytes padding added after data member '_ts::in_coroutine_wrapper'
c:\users\huang\appdata\local\continuum\anaconda3\include\genobject.h(33): warning C4820: '': '7' bytes padding added after data member '::gi_running'
c:\users\huang\appdata\local\continuum\anaconda3\include\genobject.h(53): warning C4820: '': '7' bytes padding added after data member '::cr_running'
c:\users\huang\appdata\local\continuum\anaconda3\include\genobject.h(70): warning C4820: '': '7' bytes padding added after data member '::ag_running'
c:\users\huang\appdata\local\continuum\anaconda3\include\descrobject.h(29): warning C4820: 'wrapperbase': '4' bytes padding added after data member 'wrapperbase::offset'
c:\users\huang\appdata\local\continuum\anaconda3\include\descrobject.h(33): warning C4820: 'wrapperbase': '4' bytes padding added after data member 'wrapperbase::flags'
c:\users\huang\appdata\local\continuum\anaconda3\include\structseq.h(20): warning C4820: 'PyStructSequence_Desc': '4' bytes padding added after data member 'PyStructSequence_Desc::n_in_sequence'
c:\users\huang\appdata\local\continuum\anaconda3\include\pyerrors.h(18): warning C4820: '': '7' bytes padding added after data member '::suppress_context'
c:\users\huang\appdata\local\continuum\anaconda3\include\pyerrors.h(22): warning C4820: '': '7' bytes padding added after data member '::suppress_context'
c:\users\huang\appdata\local\continuum\anaconda3\include\pyerrors.h(32): warning C4820: '': '7' bytes padding added after data member '::suppress_context'
c:\users\huang\appdata\local\continuum\anaconda3\include\pyerrors.h(39): warning C4820: '': '7' bytes padding added after data member '::suppress_context'
c:\users\huang\appdata\local\continuum\anaconda3\include\pyerrors.h(48): warning C4820: '': '7' bytes padding added after data member '::suppress_context'
c:\users\huang\appdata\local\continuum\anaconda3\include\pyerrors.h(53): warning C4820: '': '7' bytes padding added after data member '::suppress_context'
c:\users\huang\appdata\local\continuum\anaconda3\include\pyerrors.h(65): warning C4820: '': '7' bytes padding added after data member '::suppress_context'
c:\users\huang\appdata\local\continuum\anaconda3\include\modsupport.h(90): warning C4820: '_PyArg_Parser': '4' bytes padding added after data member '_PyArg_Parser::max'
c:\users\huang\appdata\local\continuum\anaconda3\include\pylifecycle.h(15): warning C4820: '': '4' bytes padding added after data member '::user_err'
c:\users\huang\appdata\local\continuum\anaconda3\include\import.h(140): warning C4820: '_frozen': '4' bytes padding added after data member '_frozen::size'
c:\users\huang\appdata\local\continuum\anaconda3\include\fileutils.h(79): warning C4820: '_Py_stat_struct': '4' bytes padding added after data member '_Py_stat_struct::st_dev'
c:\users\huang\appdata\local\continuum\anaconda3\include\fileutils.h(81): warning C4820: '_Py_stat_struct': '2' bytes padding added after data member '_Py_stat_struct::st_mode'
c:\users\huang\appdata\local\continuum\anaconda3\include\fileutils.h(85): warning C4820: '_Py_stat_struct': '4' bytes padding added after data member '_Py_stat_struct::st_rdev'
c:\users\huang\appdata\local\continuum\anaconda3\include\fileutils.h(88): warning C4820: '_Py_stat_struct': '4' bytes padding added after data member '_Py_stat_struct::st_atime_nsec'
c:\users\huang\appdata\local\continuum\anaconda3\include\fileutils.h(90): warning C4820: '_Py_stat_struct': '4' bytes padding added after data member '_Py_stat_struct::st_mtime_nsec'
pdftotext.cpp(3): fatal error C1083: Cannot open include file: 'poppler/cpp/poppler-document.h': No such file or directory
error: command 'C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\x86_amd64\cl.exe' failed with exit status 2

----------------------------------------

Failed building wheel for pdftotext
Command "C:\Users\huang\AppData\Local\Continuum\anaconda3\python.exe -u -c "import setuptools, tokenize;file='C:\Users\huang\AppData\Local\Temp\pip-install-amj4lgms\pdftotext\setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" install --record C:\Users\huang\AppData\Local\Temp\pip-record-w5c0rhy9\install-record.txt --single-version-externally-managed --compile" failed with error code 1 in C:\Users\huang\AppData\Local\Temp\pip-install-amj4lgms\pdftotext\

Program Stuck with a pdf

While extracting text from this pdf:
http://ihassociation.org/wordpress/wp-content/uploads/2015/06/2014-ASHP-Handbook-web-edition.pdf

The program just gets stuck.

Reproduce:

import pdftotext

with open('./2014-ASHP-Handbook-web-edition.pdf', 'rb') as f:
    pdf = pdftotext.PDF(f)
text = "\n\n".join(pdf)

I tried to iterate through the pages and there is a specific page in which the program is stuck on.
The program is stuck and uses 100% cpu which means it keeps processing something.

If its a dependecies problem, adding timeout for processing would be good.

Does not build on FreeBSD 11

Hi, I want to use pdftotext on FreeBSD.

I have both poppler and pkg-config installed and
the header that seems to be missing does in fact exist:

(env) env λ › ll /usr/local/include/poppler/cpp/poppler-document.h
-rw-r--r--  1 root  wheel   4.2K Feb  6 17:03 /usr/local/include/poppler/cpp/poppler-document.h

Here is the complete output of pip:

Installing collected packages: pdftotext
  Running setup.py install for pdftotext ... error
    Complete output from command /usr/home/kai/paperless/env/bin/python3.6 -u -c "import setuptools, tokenize;__file__='/tmp/pip-build-g6r42huk/pdftotext/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /tmp/pip-oqrtuqb6-record/install-record.txt --single-version-externally-managed --compile --install-headers /usr/home/kai/paperless/env/include/site/python3.6/pdftotext:
    running install
    running build
    running build_ext
    building 'pdftotext' extension
    creating build
    creating build/temp.freebsd-11.1-RELEASE-p4-amd64-3.6
    creating build/temp.freebsd-11.1-RELEASE-p4-amd64-3.6/pdftotext
    cc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -O2 -pipe -fstack-protector -fno-strict-aliasing -fPIC -DPOPPLER_CPP_AT_LEAST_0_30_0=1 -I/usr/local/include/python3.6m -c pdftotext/pdftotext.cpp -o build/temp.freebsd-11.1-RELEASE-p4-amd64-3.6/pdftotext/pdftotext.o -Wall
    pdftotext/pdftotext.cpp:4:10: fatal error: 'poppler/cpp/poppler-document.h' file not found
    #include <poppler/cpp/poppler-document.h>
             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    1 error generated.
    error: command 'cc' failed with exit status 1

    ----------------------------------------
Command "/usr/home/kai/paperless/env/bin/python3.6 -u -c "import setuptools, tokenize;__file__='/tmp/pip-build-g6r42huk/pdftotext/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /tmp/pip-oqrtuqb6-record/install-record.txt --single-version-externally-managed --compile --install-headers /usr/home/kai/paperless/env/include/site/python3.6/pdftotext" failed with error code 1 in /tmp/pip-build-g6r42huk/pdftotext/

To make it work on FreeBSD, I think /usr/local/include should be added to setup.py.
I don't quite know how that works, but I can help if you need any more information or testing.

Thanks!

pip install fails due to missing headers if Homebrew is in a nonstandard location

I have Homebrew is installed in a nonstandard location (~/homebrew). I've installed pkg-config and poppler, but when I run pipenv install pdftotext, the install fails with an error stating: "pdftotext.cpp:3:10: fatal error: 'poppler/cpp/poppler-document.h' file not found".

I see that setup.py is not looking in the path where the headers for poppler are installed on my machine. Pkg-config is able to find them, however:

pkg-config poppler-cpp  --cflags-only-I
-I/Users/brianshacklett/Applications/homebrew/Cellar/poppler/0.79.0/include/poppler/cpp -I/Users/brianshacklett/Applications/homebrew/Cellar/poppler/0.79.0/include/poppler

Perhaps something like the following might be added to help locate the include paths?

import subprocess

def find_poppler_headers():
    try:
        with subprocess.Popen(
                                 [
                                     "pkg-config",
                                     "--cflags-only-I",
                                     "poppler-cpp",
                                 ],
                                 stdout=subprocess.PIPE,
                                 stderr=subprocess.PIPE,
                             ) as proc: 
            outs, errs = proc.communicate(timeout=1)

            print('')
            print('Outs: {}'.format(outs))
            print('')

        if errs:
            raise Exception(errs)

        poppler_include_paths = outs.decode('utf-8') \
                                    .replace('\n','') \
                                    .replace('-I/','/')\
                                    .split(' ')

        print((poppler_include_paths))

    except subprocess.CalledProcessError:
        return False
    except OSError:
        print("WARNING: pkg-config not found--guessing at poppler include path.")
        print("         If the build fails, install pkg-config and try again.")
    return True

include_dirs = find_poppler_headers()

Segmentation Fault with pdf

Hey,

I've encountered with a pdf which cause a segmetation fault.

How to reproduce:
`import pdftotext

with open("seg_fault.pdf", "rb") as f:
pdf = pdftotext.PDF(f)

print("\n\n".join(pdf))`

The file:
seg_fault.pdf

Using the newest versions of the dependencies on Ubuntu 16.04:

build-essential is already the newest version (12.1ubuntu2).
pkg-config is already the newest version (0.29.1-0ubuntu1).
libpoppler-cpp-dev is already the newest version (0.41.0-0ubuntu1.14).
python-dev is already the newest version (2.7.12-1~16.04).

And python 3.6.8 with:

pdftotext 2.1.2

Extract text line by line

Is there a way to extract text line-by-line instead of page-by-page. There aren't helpful \n's in the code for this. I guess I could always just create a new line every set number of characters. Just wondering if this is a built-in feature.

Can't install on MacOS

After running this command:
pip3 install pdftotext

I get this error:

Collecting pdftotext
Using cached https://files.pythonhosted.org/packages/a6/a7/c202adb0bcd3adc3030b0c5f7f0e21f62a721913e93296e6c4ddc305cbd3/pdftotext-2.1.2.tar.gz
Building wheels for collected packages: pdftotext
Building wheel for pdftotext (setup.py) ... error
ERROR: Command errored out with exit status 1:
command: /usr/local/opt/python/bin/python3.7 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/7g/lc910ymn57xbjt_n5p2hdnk00000gn/T/pip-install-xbjs_4ab/pdftotext/setup.py'"'"'; file='"'"'/private/var/folders/7g/lc910ymn57xbjt_n5p2hdnk00000gn/T/pip-install-xbjs_4ab/pdftotext/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' bdist_wheel -d /private/var/folders/7g/lc910ymn57xbjt_n5p2hdnk00000gn/T/pip-wheel-k5ped4oo --python-tag cp37
cwd: /private/var/folders/7g/lc910ymn57xbjt_n5p2hdnk00000gn/T/pip-install-xbjs_4ab/pdftotext/
Complete output (24 lines):
running bdist_wheel
running build
running build_ext
building 'pdftotext' extension
creating build
creating build/temp.macosx-10.14-x86_64-3.7
clang -Wno-unused-result -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.15.sdk -I/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.15.sdk/usr/include -I/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.15.sdk/System/Library/Frameworks/Tk.framework/Versions/8.5/Headers -DPOPPLER_CPP_AT_LEAST_0_30_0=1 -I/usr/local/include -I/usr/local/include -I/usr/local/opt/[email protected]/include -I/usr/local/opt/sqlite/include -I/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/include/python3.7m -c pdftotext.cpp -o build/temp.macosx-10.14-x86_64-3.7/pdftotext.o -Wall -mmacosx-version-min=10.9
In file included from pdftotext.cpp:5:
/usr/local/include/poppler/cpp/poppler-page.h:39:22: warning: rvalue references are a C++11 extension [-Wc++11-extensions]
text_box(text_box&&) noexcept;
^
/usr/local/include/poppler/cpp/poppler-page.h:39:25: error: expected ';' at end of declaration list
text_box(text_box&&) noexcept;
^
;
/usr/local/include/poppler/cpp/poppler-page.h:40:33: warning: rvalue references are a C++11 extension [-Wc++11-extensions]
text_box& operator=(text_box&&) noexcept;
^
/usr/local/include/poppler/cpp/poppler-page.h:40:36: error: expected ';' at end of declaration list
text_box& operator=(text_box&&) noexcept;
^
;
2 warnings and 2 errors generated.
error: command 'clang' failed with exit status 1

ERROR: Failed building wheel for pdftotext
Running setup.py clean for pdftotext
Failed to build pdftotext
Installing collected packages: pdftotext
Running setup.py install for pdftotext ... error
ERROR: Command errored out with exit status 1:
command: /usr/local/opt/python/bin/python3.7 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/7g/lc910ymn57xbjt_n5p2hdnk00000gn/T/pip-install-xbjs_4ab/pdftotext/setup.py'"'"'; file='"'"'/private/var/folders/7g/lc910ymn57xbjt_n5p2hdnk00000gn/T/pip-install-xbjs_4ab/pdftotext/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record /private/var/folders/7g/lc910ymn57xbjt_n5p2hdnk00000gn/T/pip-record-8wbar5v2/install-record.txt --single-version-externally-managed --compile
cwd: /private/var/folders/7g/lc910ymn57xbjt_n5p2hdnk00000gn/T/pip-install-xbjs_4ab/pdftotext/
Complete output (24 lines):
running install
running build
running build_ext
building 'pdftotext' extension
creating build
creating build/temp.macosx-10.14-x86_64-3.7
clang -Wno-unused-result -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.15.sdk -I/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.15.sdk/usr/include -I/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.15.sdk/System/Library/Frameworks/Tk.framework/Versions/8.5/Headers -DPOPPLER_CPP_AT_LEAST_0_30_0=1 -I/usr/local/include -I/usr/local/include -I/usr/local/opt/[email protected]/include -I/usr/local/opt/sqlite/include -I/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/include/python3.7m -c pdftotext.cpp -o build/temp.macosx-10.14-x86_64-3.7/pdftotext.o -Wall -mmacosx-version-min=10.9
In file included from pdftotext.cpp:5:
/usr/local/include/poppler/cpp/poppler-page.h:39:22: warning: rvalue references are a C++11 extension [-Wc++11-extensions]
text_box(text_box&&) noexcept;
^
/usr/local/include/poppler/cpp/poppler-page.h:39:25: error: expected ';' at end of declaration list
text_box(text_box&&) noexcept;
^
;
/usr/local/include/poppler/cpp/poppler-page.h:40:33: warning: rvalue references are a C++11 extension [-Wc++11-extensions]
text_box& operator=(text_box&&) noexcept;
^
/usr/local/include/poppler/cpp/poppler-page.h:40:36: error: expected ';' at end of declaration list
text_box& operator=(text_box&&) noexcept;
^
;
2 warnings and 2 errors generated.
error: command 'clang' failed with exit status 1
----------------------------------------
ERROR: Command errored out with exit status 1: /usr/local/opt/python/bin/python3.7 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/7g/lc910ymn57xbjt_n5p2hdnk00000gn/T/pip-install-xbjs_4ab/pdftotext/setup.py'"'"'; file='"'"'/private/var/folders/7g/lc910ymn57xbjt_n5p2hdnk00000gn/T/pip-install-xbjs_4ab/pdftotext/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record /private/var/folders/7g/lc910ymn57xbjt_n5p2hdnk00000gn/T/pip-record-8wbar5v2/install-record.txt --single-version-externally-managed --compile Check the logs for full command output.

Cannot install on Windows

I am running Win10 with the anaconda dist of python 3.6 and have the MS build tools and compiler installed. I pip install the pdftotext package. Installation begins and then terminates with this message:

pdftotext.cpp(3): fatal error C1083: Cannot open include file: 'poppler/cpp/poppler-document.h': No such file or directory

Any ideas?

Importing Pandas messes up raw mode

After importing pandas, using raw=True fails silently with the result being an arbitrarily short string of the beginning of each pdf page. The length of the string is different on each run.

Steps to reproduce:

  1. In a new shell import pdftotext and run pdftotext.PDF(<filename>, raw=True) to test for successful result.
  2. Import pandas.
  3. Run again the same test.

Python 3.8.0; pdftotext 2.1.2; pandas 0.25.3

Issue with installing on python 3.6 on Windows

I have installed the Vc++ redistributable and python compiler on windows yet I get the below error.

pdftotext.cpp(3) : fatal error C1083: Cannot open include file: 'poppler/cpp/poppler-document.h': No such file or directory

Please help me how do I fix this?

Cant install on windows using pip

pip install pdftotext Collecting pdftotext Using cached pdftotext-2.0.1.tar.gz Installing collected packages: pdftotext Running setup.py install for pdftotext ... error Complete output from command "c:\users\vinayak sharma\appdata\local\programs\python\python35\python.exe" -u -c "import setuptools, tokenize;__file__='C:\\Users\\Local\\Temp\\pip-build-6eh2vxu8\\pdftotext\\setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record C:\Users\VINAYA~1\AppData\Local\Temp\pip-kyy39x3a-record\install-record.txt --single-version-externally-managed --compile: WARNING: pkg-config not found--guessing at poppler version. If the build fails, install pkg-config and try again. running install running build running build_ext building 'pdftotext' extension error: Unable to find vcvarsall.bat

----------------------------------------

Command ""c:\users\Local\\Temp\\pip-build-6eh2vxu8\\pdftotext\\setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record C:\Users\VINAYA~1\AppData\Local\Temp\pip-kyy39x3a-record\install-record.txt --single-version-externally-managed --compile" failed with error code 1 in C:\Users\VINAYA~1\AppData\Local\Temp\pip-build-6eh2vxu8\pdftotext\

pip install fails on macOS

Hi,
I'm running on macOs and trying ton install pdftotext
I tried
pip install pdftotext
and got this error

`Collecting pdftotext
Using cached https://files.pythonhosted.org/packages/a6/a7/c202adb0bcd3adc3030b0c5f7f0e21f62a721913e93296e6c4ddc305cbd3/pdftotext-2.1.2.tar.gz
Building wheels for collected packages: pdftotext
Building wheel for pdftotext (setup.py) ... error
ERROR: Command errored out with exit status 1:
command: /Users/romainvandelouw/venv/oreilly/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/zg/8mfp262s1093qtv0klghbfnr0000gn/T/pip-install-oailros8/pdftotext/setup.py'"'"'; file='"'"'/private/var/folders/zg/8mfp262s1093qtv0klghbfnr0000gn/T/pip-install-oailros8/pdftotext/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' bdist_wheel -d /private/var/folders/zg/8mfp262s1093qtv0klghbfnr0000gn/T/pip-wheel-yvlotdyb --python-tag cp36
cwd: /private/var/folders/zg/8mfp262s1093qtv0klghbfnr0000gn/T/pip-install-oailros8/pdftotext/
Complete output (27 lines):
running bdist_wheel
running build
running build_ext
building 'pdftotext' extension
creating build
creating build/temp.macosx-10.7-x86_64-3.6
gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/Users/romainvandelouw/anaconda/include -arch x86_64 -I/Users/romainvandelouw/anaconda/include -arch x86_64 -DPOPPLER_CPP_AT_LEAST_0_30_0=1 -I/usr/local/include -I/Users/romainvandelouw/anaconda/include/python3.6m -c pdftotext.cpp -o build/temp.macosx-10.7-x86_64-3.6/pdftotext.o -Wall -mmacosx-version-min=10.9
In file included from pdftotext.cpp:5:
/Users/romainvandelouw/anaconda/include/poppler/cpp/poppler-page.h:37:22: warning: rvalue references are a C++11 extension [-Wc++11-extensions]
text_box(text_box&&) = default;
^
/Users/romainvandelouw/anaconda/include/poppler/cpp/poppler-page.h:37:28: warning: defaulted function definitions are a C++11 extension [-Wc++11-extensions]
text_box(text_box&&) = default;
^
/Users/romainvandelouw/anaconda/include/poppler/cpp/poppler-page.h:38:33: warning: rvalue references are a C++11 extension [-Wc++11-extensions]
text_box& operator=(text_box&&) = default;
^
/Users/romainvandelouw/anaconda/include/poppler/cpp/poppler-page.h:38:39: warning: defaulted function definitions are a C++11 extension [-Wc++11-extensions]
text_box& operator=(text_box&&) = default;
^
4 warnings generated.
creating build/lib.macosx-10.7-x86_64-3.6
g++ -bundle -undefined dynamic_lookup -L/Users/romainvandelouw/anaconda/lib -arch x86_64 -L/Users/romainvandelouw/anaconda/lib -arch x86_64 -arch x86_64 build/temp.macosx-10.7-x86_64-3.6/pdftotext.o -L/usr/local/lib -lpoppler-cpp -o build/lib.macosx-10.7-x86_64-3.6/pdftotext.cpython-36m-darwin.so
clang: warning: libstdc++ is deprecated; move to libc++ with a minimum deployment target of OS X 10.9 [-Wdeprecated]
ld: library not found for -lstdc++
clang: error: linker command failed with exit code 1 (use -v to see invocation)
error: command 'g++' failed with exit status 1

ERROR: Failed building wheel for pdftotext`

I read in previous issues that it could be related to dependencies but Popler is installed

Warning: pkg-config 0.29.2 is already installed and up-to-date To reinstall 0.29.2, run brew reinstall pkg-configWarning: poppler 0.81.0 is already installed and up-to-date To reinstall 0.81.0, runbrew reinstall poppler`

I read #26 but in my case it doesn't work outside the virtualenv either...

verbose_pdftotext.txt
is the result of pip --verbose install pdftotext :

What am I missing ?
Thanks for your help !

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.