metachris / pdfx Goto Github PK

View Code? Open in Web Editor NEW

1.0K 39.0 113.0 1.77 MB

Extract text, metadata and references (pdf, url, doi, arxiv) from PDF. Optionally download all referenced PDFs.

Home Page: http://www.metachris.com/pdfx

License: Apache License 2.0

Python 98.78% Makefile 1.22%

pdfx's Introduction

PDFx

Introduction

Extract references (pdf, url, doi, arxiv) and metadata from a PDF. Optionally download all referenced PDFs and check for broken links.

Features

Extract references and metadata from a given PDF
Detects pdf, url, arxiv and doi references
Fast, parallel download of all referenced PDFs
Find broken hyperlinks (using the -c flag) (more)
Output as text or JSON (using the -j flag)
Extract the PDF text (using the --text flag)
Use as command-line tool or Python package
Compatible with Python 2 and 3
Works with local and online pdfs

Getting Started

Grab a copy of the code with easy_install or pip, and run it:

$ sudo easy_install -U pdfx
...
$ pdfx <pdf-file-or-url>

Run pdfx -h to see the help output:

$ pdfx -h
usage: pdfx [-h] [-d OUTPUT_DIRECTORY] [-c] [-j] [-v] [-t] [-o OUTPUT_FILE]
            [--version]
            pdf

Extract metadata and references from a PDF, and optionally download all
referenced PDFs. Visit https://www.metachris.com/pdfx for more information.

positional arguments:
  pdf                   Filename or URL of a PDF file

optional arguments:
  -h, --help            show this help message and exit
  -d OUTPUT_DIRECTORY, --download-pdfs OUTPUT_DIRECTORY
                        Download all referenced PDFs into specified directory
  -c, --check-links     Check for broken links
  -j, --json            Output infos as JSON (instead of plain text)
  -v, --verbose         Print all references (instead of only PDFs)
  -t, --text            Only extract text (no metadata or references)
  -o OUTPUT_FILE, --output-file OUTPUT_FILE
                        Output to specified file instead of console
  --version             show program's version number and exit

Examples

Lets take a look at this paper: https://weakdh.org/imperfect-forward-secrecy.pdf:

$ pdfx https://weakdh.org/imperfect-forward-secrecy.pdf
Document infos:
- CreationDate = D:20150821110623-04'00'
- Creator = LaTeX with hyperref package
- ModDate = D:20150821110805-04'00'
- PTEX.Fullbanner = This is pdfTeX, Version 3.1415926-2.5-1.40.14 (TeX Live 2013/Debian) kpathsea version 6.1.1
- Pages = 13
- Producer = pdfTeX-1.40.14
- Title = Imperfect Forward Secrecy: How Diffie-Hellman Fails in Practice
- Trapped = False
- dc = {'title': {'x-default': 'Imperfect Forward Secrecy: How Diffie-Hellman Fails in Practice'}, 'creator': [None], 'description': {'x-default': None}, 'format': 'application/pdf'}
- pdf = {'Keywords': None, 'Producer': 'pdfTeX-1.40.14', 'Trapped': 'False'}
- pdfx = {'PTEX.Fullbanner': 'This is pdfTeX, Version 3.1415926-2.5-1.40.14 (TeX Live 2013/Debian) kpathsea version 6.1.1'}
- xap = {'CreateDate': '2015-08-21T11:06:23-04:00', 'ModifyDate': '2015-08-21T11:08:05-04:00', 'CreatorTool': 'LaTeX with hyperref package', 'MetadataDate': '2015-08-21T11:08:05-04:00'}
- xapmm = {'InstanceID': 'uuid:4e570f88-cd0f-4488-85ad-03f4435a4048', 'DocumentID': 'uuid:98988d37-b43d-4c1a-965b-988dfb2944b6'}

References: 36
- URL: 18
- PDF: 18

PDF References:
- http://www.spiegel.de/media/media-35533.pdf
- http://www.spiegel.de/media/media-35513.pdf
- http://www.spiegel.de/media/media-35509.pdf
- http://www.spiegel.de/media/media-35529.pdf
- http://www.spiegel.de/media/media-35527.pdf
- http://cr.yp.to/factorization/smoothparts-20040510.pdf
- http://www.spiegel.de/media/media-35517.pdf
- http://www.spiegel.de/media/media-35526.pdf
- http://www.spiegel.de/media/media-35519.pdf
- http://www.spiegel.de/media/media-35522.pdf
- http://cryptome.org/2013/08/spy-budget-fy13.pdf
- http://www.spiegel.de/media/media-35515.pdf
- http://www.spiegel.de/media/media-35514.pdf
- http://www.hyperelliptic.org/tanja/SHARCS/talks06/thorsten.pdf
- http://www.spiegel.de/media/media-35528.pdf
- http://www.spiegel.de/media/media-35671.pdf
- http://www.spiegel.de/media/media-35520.pdf
- http://www.spiegel.de/media/media-35551.pdf

You can use the -v flag to output all references instead of just the PDFs.

Download all referenced pdfs with -d (for download-pdfs) to the specified directory (eg. to /tmp/):

$ pdfx https://weakdh.org/imperfect-forward-secrecy.pdf -d /tmp/
...

To extract text, you can use the -t flag:

# Extract text to console
$ pdfx https://weakdh.org/imperfect-forward-secrecy.pdf -t

# Extract text to file
$ pdfx https://weakdh.org/imperfect-forward-secrecy.pdf -t -o pdf-text.txt

To check for broken links use the -c flag:

$ pdfx https://weakdh.org/imperfect-forward-secrecy.pdf -c

[Example (with video) of checking for broken links](https://www.metachris.com/2016/03/find-broken-hyperlinks-in-a-pdf-document-with-pdfx/).

Usage as Python library

>>> import pdfx
>>> pdf = pdfx.PDFx("filename-or-url.pdf")
>>> metadata = pdf.get_metadata()
>>> references_list = pdf.get_references()
>>> references_dict = pdf.get_references_as_dict()
>>> pdf.download_pdfs("target-directory")

Dev & Contributing

# Setup venv
python3 -m venv
venv . venv/bin/activate

# Install PDFx and dev deps
pip install -e .
pip install -r requirements_dev.txt

# Run tests and checks
make test
make lint
make check

# Format the code (with black)
make format

Releasing

Update version number in setup.py and pdfx/__init__.py
Create a git tag starting with v (eg. git tag v1.5.9)
Push the tag to GitHub: git push --tags

GitHub Actions is then publishing to PyPI.

Various

Author: Chris Hager twitter.com/metachris
Homepage: https://www.metachris.com/pdfx
License: Apache

Feedback, ideas and pull requests are welcome!

Improvement Ideas

Possible:

Timeout (see #43)
Cuts off links that span two lines #40
Include Check-Links Results in Output #39

pdfx's People

Contributors

Stargazers

Watchers

Forkers

orinocoz mikeatm entaopy ssword cghiban afthill dprop-developers jslhs caoyu0 yuan- arrmac solualexis senthilmm christoberaps capncodewash sheelaselvamani olivierh59500 badbye mrpnkt vishalbelsare acorver austinsmom neo4reo mediaeater jmassapina dulani nulledexceptions cacalote qiangleizang kld123509945 vpineda7 kobefanzzf sarah4uk vv111y hbueno diggold marsam silky dotlambda project-renard-survey jrideout aperz joshuaamitha7 damir-cuturic dontworry33 rterwedo miracleyoo colern ktp-forked-repos ronhab ghkareem nicolas-raoul daviddekoning asdbaihu jayd2446 larytet-py geographerwang tokkieza mblurtonww ideol0gue henry-nlp awoziji jmhorcas mohammedalrozzi dankwartrustow andrewzigerelli knifelees3 meatware zhaoweikb rordi pierreselim silentsoul04 thocell cosmo65 zhuzihan728 tfalexuj odnodn newrain7803 vmanke emerens jdarsey6 tfroehlich82 htinedin vinayaksable2399 helias dustywhite7 clement-topformation ram02z aberja zgiet ltargaryen tddschn jaygith fairhopeweb calanin 1989shack kelchuan alitrack done520 lkampoli

pdfx's Issues

Cli error --> Ask to send email / Open issue

When the CLI encounters an error with a PDF, ask the user to shoot an email / open a Github issue

pdfx reports mailto: links as an error ('nonnumeric port')

Hi there,
Thanks for creating a really useful tool.

When running pdfx -c across a PDF with mailto: links (e.g. mailto:[email protected], it reports these as errors:

nonnumeric port: '[email protected]' mailto:[email protected]

Is there a way to suppress this, or disable checking non-HTTP(S) links?

Also, the line output colouring doesn't seem to play nicely on Windows. Is there a way to disable this?

Thanks,

Graeme

Cuts off links that span two lines

Links that span spill over onto the second line are cut off when being recognized and thus reported as dead.

Unable to install pdfx

I am trying to follow the instructions you have provided but it is not installing pdfx.

easy_install or pip gives error with regards to requirements. Can you please add a requiremients.txt to the code for pip install if required.

runing easy_install -U pdfx or setup.py gives couldn't find a setup script

AttributeError: 'NoneType' object has no attribute 'findall'

complete traceback is:

...\lib\site-packages\pdfx\libs\xmp.py", line 50, in meta  
    for desc in self.rdftree.findall(RDF_NS+'Description'):
AttributeError: 'NoneType' object has no attribute 'findall'

Have folks seen this error on some pdfs?

this is a remote file, addressed through http://

I cannot publish the location of this particular file here, but would appreciate a potential strategy for a solution to this problem!

JSON Output for Check Links Subcommand

Hello, thank you for developing such a useful tool. I would like to evaluate this tool as part of my work project (see reference GSA/fedramp-automation#130). To move forward, I would like if the --check-links --json parameters could be combined to both check all referenced links and report the results per detected hyperlink in JSON.

If this project is still maintained, I can submit a pull request. If not, I can evaluate the feasibility of that in a fork.

PDF references should not be treated as such based on extension

PDF files pointed to by other PDF files need not have a .pdf extension to be identified as such. I had to apply the following patch to be able to download PDFs recursively (in my case, they had no extension):

diff --git a/pdfx/__init__.py b/pdfx/__init__.py
index 6042e26..8411235 100644
--- a/pdfx/__init__.py
+++ b/pdfx/__init__.py
@@ -194,7 +194,7 @@ class PDFx(object):
         logger.debug("- Saved metadata to '%s'" % fn_json)

         # Download references
-        urls = [ref.ref for ref in self.get_references("pdf")]
+        urls = [ref.ref for ref in self.get_references()]
         if not urls:
             return
```

Of course, this quick fix brings problems. pdfx will try (and fail) to download `mailto:` links, or will download random websites linked to. Point is: pdfx should allow some kind of custom regex or something to identify desirable files among references. Maybe it should also allow some a posteriori file checking (download a file, see if it's a PDF, if not, delete it).

unable to install via easy_install

when i do "sudo easy_install -U pdfx" as per the instructions, i get the following error:

Include Check-Links Results in Output

Is there any way to include the check-links results in the output .txt document?

Checking a list of PDF URLs

Is there a way to check a list of several hundred PDFs? Thanks!

getting a 400 for twitter.com

Seems like twitter.com does not like the request for checking references. Not sure if this is a user agent issue, or just a problem on the twitter.com end.

A small snippet below used for testing:

import pdfx

print(pdfx.downloader.get_status_code('google.com'))
print(pdfx.downloader.get_status_code('twitter.com'))

Thanks a lot, and a question

Thanks a lot for this great tool. i loved it.
Would you mind helping me in this:
I am translating a very big document (in pdf) and it includes a lot of hyperlinks, which I forgot to attach in the docx of the translation. Now, I have to go through the links in the pdf one by one and open the page, and attach the link to the translated text. I wonder if there is a way to list all the links with their corresponding page. this worked for -c (the broken links) but not when i list the links using -v.
I can send you the pdf file if this helps....

thanks a lot.. very much appreciated.

PDFx won't see links in some PDFs

PDFx won't see most of the links in the PDF below. Is this a known issue? Is there a fix for it?
Many thanks!
https://webarchive.nationalarchives.gov.uk/20160613090753/https://www.litvinenkoinquiry.org/files/Litvinenko-Inquiry-Report-web-version.pdf

Cross-platform GUI with Kivy

Kivy cross-platform GUI

Copious INFO logging

Running this from a Python script with logging.basicConfig(level=logging.INFO) creates incredible amounts of detailed logging which would probably best be confined to level=logging.DEBUG and/or ideally possible to turn off if you don't cate about the library's internals.

PDFx is storing prior parsed PDFs causing incorrect references / annotations to be found

Doc1.pdf
Doc2.pdf

Parsing annotations with get_references() on multiple files will cause annotations from all prior parsed PDFs to appear in the current one.

PDF 1: Correct

from pdfx import PDFx
pdf_1 = PDFx('Doc1.pdf')
print([url.ref for url in pdf_1.get_references()])
# >> ['http://www.google.com/', 'google.com']

PDF 2: Correct

from pdfx import PDFx
pdf_2 = PDFx('Doc2.pdf')
print([url.ref for url in pdf_2.get_references()])
# >> ['bing.com', 'http://www.bing.com/']

PDF1 and PDF2 Together: Bug - PDF2 has annotations from PDF1

# -*- coding: utf-8 -*-
from pdfx import PDFx
pdf_1 = PDFx('Doc1.pdf')
print([url.ref for url in pdf_1.get_references()])
# >> ['google.com', 'http://www.google.com/']
pdf_2 = PDFx('Doc2.pdf')
print([url.ref for url in pdf_2.get_references()])
# >> ['http://www.google.com/', 'bing.com', 'google.com', 'http://www.bing.com/']

Point pdf links to local files downloaded - feature request

Is there any possibility that the original pdf file be modified to make the original link to point to the locally downloaded files?
A second, more interesting option would be to combine all pdf's in a single one and change every link to point internally at the specified page.
That would be interesting for example to save documents (PhD Tesis, Master Tesis, etc) in a single document that can be saved for long time without losing content.

SSL Error?

I'm not sure if this is an issue with pdfx or my local machine or the server where the pdf's links are being loaded from. When using either:

$ pdfx testpdf.pdf -c

or

$ pdfx https://example.com/testpdf.pdf -c

to test for bad links in the pdf pointing to example.com, I get this error: [SSL: TLSV1_ALERT_PROTOCOL_VERSION] tlsv1 alert protocol version (_ssl.c:590) - example.org/bad-link

This is with Python3 by brew install and openssl 1.0.2n is being loaded by Python.

Thanks

AttributeError: 'PDFObjRef' object has no attribute 'decode'

While using pdfx for extraction of url, I got this issue.

URLs truncated at line endings

First of all: great tool! I did however come across a problem with URLs that span more than one line. I've attached a PDF that reproduces the problem here:

testpdfx.pdf

Command:

pdfx -v testpdfx.pdf -o testpdfx.txt

The URL in the footnote is extracted as::

http://jpylyzer.openpreservation.org//2016/01/06/Release-of-

Whereas this should be:

http://jpylyzer.openpreservation.org//2016/01/06/Release-of-jpylyzer-1-17-0

I used pdfx version 1.3.1 on Linux Mint.

How to get HyperText(not HyperLink)?

I have PDFs with many hyperlinks. I want to get the text label for the hyperlinks, not the hyperlink URLs.

import pdfx
pdf = pdfx.PDFx("filename-or-url.pdf")
references_list = pdf.get_references()
for LinkObj in references_list:
Link=LinkObj.ref # get url
HyperText =LinkObj.text # CAN NOT GET LABEL over Link !

how to get HYPERTEXT.pdf

TIA

PDF fails to open if special character in path

$ pdfx Prés/presentation.pdf
Traceback (most recent call last):
File "/usr/local/bin/pdfx", line 9, in
load_entry_point('pdfx==1.3.0', 'console_scripts', 'pdfx')()
File "build/bdist.linux-x86_64/egg/pdfx/cli.py", line 149, in main
File "build/bdist.linux-x86_64/egg/pdfx/init.py", line 99, in init
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2: ordinal not in range(128)

But the PDF opens fine with;:
$ cd Prés
$ pdfx presentation.pdf

TypeError: '<' not supported between instances of 'tuple' and 'int'

Getting an error while passing a url in PDFx function.

Here is the traceback:

 File "/home/siddhartha/anaconda3/envs/straikit/lib/python3.6/site-packages/pdfx/__init__.py", line 127, in __init__
    self.reader = PDFMinerBackend(self.stream)
  File "/home/siddhartha/anaconda3/envs/straikit/lib/python3.6/site-packages/pdfx/backends.py", line 167, in __init__
    doc = PDFDocument(parser, password=password, caching=True)
  File "/home/siddhartha/anaconda3/envs/straikit/lib/python3.6/site-packages/pdfminer/pdfdocument.py", line 558, in __init__
    self.read_xref_from(parser, pos, self.xrefs)
  File "/home/siddhartha/anaconda3/envs/straikit/lib/python3.6/site-packages/pdfminer/pdfdocument.py", line 782, in read_xref_from
    xref.load(parser)
  File "/home/siddhartha/anaconda3/envs/straikit/lib/python3.6/site-packages/pdfminer/pdfdocument.py", line 235, in load
    (_, stream) = parser.nextobject()
  File "/home/siddhartha/anaconda3/envs/straikit/lib/python3.6/site-packages/pdfminer/psparser.py", line 582, in nextobject
    (pos, token) = self.nexttoken()
  File "/home/siddhartha/anaconda3/envs/straikit/lib/python3.6/site-packages/pdfminer/psparser.py", line 508, in nexttoken
    self.fillbuf()
  File "/home/siddhartha/anaconda3/envs/straikit/lib/python3.6/site-packages/pdfminer/psparser.py", line 232, in fillbuf
    if self.charpos < len(self.buf):
TypeError: '<' not supported between instances of 'tuple' and 'int'

Way to check only real hyperlinks

Hi there,
I've been using pdfx with the '-c' option to check links in PDFs.

I was wondering if there's a way to restrict the list of links that it checks to actual PDF hyperlinks - because it seems to also pull out any non-hyperlinked body text that contains a URL.

My PDFs sometimes contain example URLs that shouldn't validate ( e.g. http://your-subdomain.example.com ) as plain text, so I want to avoid checking these.

Thanks,

Graeme

Recursive URL extraction from PDFs - feature request

I use pdfx -v path_to_pdf_file to gather URLs from a PDF. This is great on its own.

I would love to see pdfx expand to allow for URL extraction across a directory tree - the ability to extract URLs recursively across a directory, skipping files that are not PDFs as it goes along.

Right now I use
find /path/to/folder/ -type f -name '*.pdf' -exec pdfx -v {} \; > foo.txt

This works well and someone else more skilled than I helped me with the above command but I wonder if a recursive type of feature could be integrated directly into pdfx or maybe it's redundant as unix itself has features to accomplish the same, as noted by the command above.

**
I really like this tool and am using it for a personal project of mine that I will share freely once it becomes voluminous enough. Basically it's a filetype miner/download that pulls specific filetypes from the waybackmachine - a digital archeological tool of sorts. I use old books and magazine from archive.org as sources for URLs. The URLs are used to query the waybackmachine downloader to download file types.

Thanks for this really easy to use and powerful tool!

Fails if output is piped

Trying to pipe output of pdfx causes error

Traceback (most recent call last):
File "/usr/bin/pdfx", line 11, in
sys.exit(main())
File "/usr/lib64/python2.7/site-packages/pdfx/cli.py", line 189, in main
print_to_console(text)
File "/usr/lib64/python2.7/site-packages/pdfx/cli.py", line 130, in print_to_console
bytes_string = text.encode(sys.stdout.encoding, 'backslashreplace')
TypeError: encode() argument 1 must be string, not None

Similar errors can be found in other projects, such as ansible/ansible@c8494cd

Detect metadata from Arxiv Documents

Arxiv documents don't have title / author etc metadata.

➜ pdfx https://arxiv.org/pdf/1911.02782.pdf
Document infos:
- CreationDate = D:20200708010812Z
- Creator = LaTeX with hyperref package
- ModDate = D:20200708010812Z
- PTEX.Fullbanner = This is pdfTeX, Version 3.14159265-2.6-1.40.17 (TeX Live 2016) kpathsea version 6.2.2
- Pages = 15
- Producer = pdfTeX-1.40.17
- Trapped = False

References: 77
- URL: 71
- ARXIV: 4
- PDF: 2

PDF References:
- http://www.lrec-conf.org/proceedings/lrec2008/pdf/445_paper.pdf
- http://ceur-ws.org/Vol-2345/paper2.pdf

Perhaps we could use arxiv.py to query Arxiv and get that metadata?

"URI" in PDF attributes may be a string itself

The URI value in an attribute object may be itself a string, instead of a PDFObjRef. Not dealing with this case would cause many URIs to be ignored. The following patch fixed the issue for me, but a better solution may be desirable:

@@ -282,16 +279,22 @@ class PDFMinerBackend(ReaderBackend):
         if isinstance(obj_resolved, list):
             return [self.resolve_PDFObjRef(o) for o in obj_resolved]

+        print(obj_resolved)
         if "URI" in obj_resolved:
             if isinstance(obj_resolved["URI"], PDFObjRef):
                 return self.resolve_PDFObjRef(obj_resolved["URI"])
+            elif isinstance(obj_resolved["URI"], (str, unicode)):
+               if IS_PY2:
+                   ref = obj_resolved["URI"].decode("utf-8")
+               else:
+                   ref = obj_resolved
+               return Reference(ref, self.curpage)

DOI traversal / CrossRef API

"The standard way for getting the actual PDF from a DOI, when it's a Crossref DOI (which it probably is) is to use the full-text link, available in the CrossRef API.
For DOI 10.1155/2010/963926
http://api.crossref.org/works/10.1155/2010/963926
From the returned JSON message -> link -> there's the PDF!"

[
  {
    intended-application: "text-mining",
    content-version: "vor",
    content-type: "application/pdf",
    URL: "http://downloads.hindawi.com/journals/jo/2010/963926.pdf"
  },
  {
    intended-application: "text-mining",
    content-version: "vor",
    content-type: "application/xml",
    URL: "http://downloads.hindawi.com/journals/jo/2010/963926.xml"
  }
]

via HN: https://news.ycombinator.com/item?id=10452048

Internal links - enhancement request

Is there any sensible output (source page number, destination page number, anchor name, for example) that could be generated for internal links pointing somewhere in the input PDF file, such as those generated for theorem numbers, citations etc by hyperref+LaTeX?

Some PDFs don't work

TODO:

Collect PDFs that don't work

Example error message:

$ pdfx xhyve\ –\ Lightweight\ Virtualization\ on\ OS\ X\ Based\ on\ bhyve\ _\ pagetable.pdf
Traceback (most recent call last):
  File "/usr/local/bin/pdfx", line 9, in <module>
load_entry_point('pdfx==1.0.1', 'console_scripts', 'pdfx')()
File "build/bdist.macosx-10.10-x86_64/egg/pdfx/cli.py", line 66, in main
File "build/bdist.macosx-10.10-x86_64/egg/pdfx/__init__.py", line 137, in __init__
AttributeError: 'NoneType' object has no attribute 'items'

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 62: invalid continuation byte

Running pdfx file.pdf -v > output.txt I get this issue:

  File "/home/helias/.local/bin/pdfx", line 8, in <module>
    sys.exit(main())
  File "/home/helias/.local/lib/python3.8/site-packages/pdfx/cli.py", line 158, in main
    pdf = pdfx.PDFx(args.pdf)
  File "/home/helias/.local/lib/python3.8/site-packages/pdfx/__init__.py", line 128, in __init__
    self.reader = PDFMinerBackend(self.stream)
  File "/home/helias/.local/lib/python3.8/site-packages/pdfx/backends.py", line 236, in __init__
    refs = self.resolve_PDFObjRef(page.annots)
  File "/home/helias/.local/lib/python3.8/site-packages/pdfx/backends.py", line 273, in resolve_PDFObjRef
    return [self.resolve_PDFObjRef(item) for item in obj_ref]
  File "/home/helias/.local/lib/python3.8/site-packages/pdfx/backends.py", line 273, in <listcomp>
    return [self.resolve_PDFObjRef(item) for item in obj_ref]
  File "/home/helias/.local/lib/python3.8/site-packages/pdfx/backends.py", line 305, in resolve_PDFObjRef
    return Reference(obj_resolved["A"]["URI"].decode("utf-8"), self.curpage)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 62: invalid continuation byte

I guess it is related to some utf-8 codec, is there a way to solve it?

It should be related to this: https://github.com/metachris/pdfx/blob/master/pdfx/backends.py#L305

Try to generate a BibTex

Perhaps also try to generate a BibTex entry for the PDF

Embedded URLs not being picked up?

Take a look at this PDF: http://mountainview.gov/civicax/filebank/blobdload.aspx?BlobID=20591

There are 11 documents linked to on the second page, but pdfx doesn't seem to notice them:

pdfx -v http://mountainview.gov/civicax/filebank/blobdload.aspx?BlobID=20591
Document infos:

Creator = Crystal Reports
Pages = 4
Producer = Powered By Crystal
Title = Agenda and Notice

References: 1

URL: 1

URL References:

- www.mountainview.gov

Am I doing something wrong, or would pdfx need to be changed to detect links like these?

Unable to install

Dependency issue:
Reading http://pypi.python.org/simple/pdfminer2/
Best match: pdfminer2 20151206.macosx-10.10-x86-64
Downloading https://pypi.python.org/packages/e0/55/5e235321d7494772264b577a8569c102b9d9ef867f7239d14d562e89bed9/pdfminer2-20151206.macosx-10.10-x86_64.tar.gz#md5=fa3add6ee50de0132da0f851d12a180b
Processing pdfminer2-20151206.macosx-10.10-x86_64.tar.gz
error: Couldn't find a setup script in /tmp/easy_install-DYGZND/pdfminer2-20151206.macosx-10.10-x86_64.tar.gz

Why did it download macosx, I am on Linux.

timeout option

Hi,

pdfx is very helpful for us to analyze a few things. Thanks for creating pdfx.

But we have a small problem. When a pdf file contains much text pdfx / python only fails after the "too many recursions" error is thrown.

It would be helpful to have a max-timeout option to prevent that pdfx tries to parse files for 45 minutes and more (in our case).

And another small question: how could we scan / check many files at once in the best way? So far we run single pdfx commands from a bash script and wait until every command has finished. Using the & trick would cause some issues with the job scheduler of the OS and that the whole OS freezes.

Check for Unicode chars in PDF files

C:\Users\User\Desktop\python\CouncilAgendaMapper>pdfx --debug agenda.pdf
DEBUG - init - Init with uri: agenda.pdf
Document infos:

Author = blah
CreationDate = D:20151106153359-06'00'
Traceback (most recent call last):
File "C:\Python33\lib\runpy.py", line 160, in run_module_as_main
"main", fname, loader, pkg_name)
File "C:\Python33\lib\runpy.py", line 73, in run_code
exec(code, run_globals)
File "C:\Python33\Scripts\pdfx.exe__main.py", line 9, in
File "C:\Python33\lib\site-packages\pdfx\cli.py", line 90, in main
print("- %s = %s" % (k, parse_str(v).strip("/")))
File "C:\Python33\lib\encodings\cp437.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\xae' in position 21
: character maps to

Not sure what the correct way to handle encoding errors would be, skip the char?

I usually fix this kind of issue by changing the cmd window encoding, eg.:
cmd> chcp 65001

Work with other targets than only PDF (eg. html, text, etc)

At least think about extracting PDFs from websites etc.

Detect if PDF is a scan

and recommend OCR?

Title detection heurisitcs

Currently, PDF title is retrieved directly from the metadata info, but most PDFs (like Arxiv) don't actually have that metadata. We could have custom logic, if we detect that it is an "Arxiv" PDF which is what #52 is about, or else we could add heuristic based "guessing" of title (say from the text with largest font on the first page.) This will obviously not work everywhere. But, it doesn't have to!

I've past experience with KDE's KFileMetaData which used a similar heuristic, and it used to give good results. This was later removed though (commit), because KDE as a distro has to make a lot of people happy.

If you're okay with a heuristic based approach, I could take a stab at implementing this!

Usecase: I would really like to have a script that auto-renames my PDFs with proper titles. I actually had a script that was based on KFileMetaData, but I've since moved onto Windows. https://github.com/dufferzafar/.scripts/blob/master/pdf-titles

Error when running pdf x

When I attempt to run the program on a PDF I get the following output:

ImportError: cannot import name settings

Any suggestions as to how I should proceed?

Combine downloaded pdfs into one file / pdf portfolio

It would be very useful for my application cases to then combine the downloaded pdfs into one pdf portfolio with the urls in the initial file now linking to the respective pdfs in the combined file - would this be feasible?