pmaupin / pdfrw Goto Github PK

View Code? Open in Web Editor NEW

1.8K 1.8K 272.0 523 KB

pdfrw is a pure Python library that reads and writes PDFs

License: Other

Python 71.73% Jupyter Notebook 28.27%

pdfrw's People

Contributors

Stargazers

Watchers

Forkers

aquavitae tjwei cwittern serpheroth gotoc peralser skidzo ndevenish angelinrashmi2000 vinypooh tnebel b4stien lioaphy mazulo nawang-maya theship totalboy jbarlow83 sgdgp mdabbagh88 alexumnov edwardbetts abrasive neomap semanticsos doarthon javierag noobdoesre hstau dmitrygorkovenko magma2 ganesh-git2014 keyweeusr bitha121295 divergentdave tassamartz modulexcite 75f2cc wannaphong olivierh59500 josephdfh gart17 wzugang maralad darg0001 wuhaochen amiraayadi vreuter fwdevmobile fashtimedotcom henryl jonls jetpks lrawicz taylr lucianoviola pdpark ltaocs wzh880801 mr9esx yolin huangtianhe ishafizan praveen049 tsangha vipinkriz hbrunn holiszsz trueskills tyronmaxico2008 wrkhenddher ytaler leguizamonluciano noyez qiwsir lingjianshi ograycode fo0nikens timtangcoding daviddliu jcampbell05 nettles-sync tommyding jeffli678 gumbyu jcjones1515 jcao00 supershinyeyes jamesblunt shubhampachori12110095 mikaraunio saviodcunha tejash-jl oskarh2 kotaro-ono wintermute0110 yadavankit insightindustry mozhouwen morinokumasn

pdfrw's Issues

Move towards best practices for docs and tests and releases

Convert wiki from markdown to rst, and build it at readthedocs.

Start using wheels, and travisci.

Anything else?

Support /PageLabels

I wrote a little library based on pdfrw to manipulate pdf page labels:
https://github.com/lovasoa/pagelabels-py/tree/master/pagelabels

I thought it might interest you to integrate it directly to pdfrw, for easier page labels manipulation.

add to pypi

it would be good if pdfrw could be installed with easy_install or pip


the following simple setup.py works for me:

#!/usr/bin/env python

from setuptools import setup

setup(
    name = "pdfrw",
    version = "0.1",

    packages = ["pdfrw"]
)

Original issue reported on code.google.com by [email protected] on 17 Sep 2012 at 11:38

Wrong boolean keywords

Boolean values get converted to "True" and "False". According to the PDF reference it must be lower-case.

[Feature Request] Ability to add/remove bookmarks

If this feature already exists, I'd love to know about it. If it doesn't, I'd like to put in a formal request for it. I know where the information exists (Root.Outlines), but I don't know how to modify it.

decryption

I did read the documentation that indicates that it isn't supported. I am using pypdf2 and it does but only a few and the newer encryptions used by some government agencies, 128 AES, is not supported.

Any ideas or thoughts if this is something that you will be implementing in the future? If so do you have any timeframe in mind?

Thanks

barcode

thanks for your job.
Can I use barcodes with pdfrw?
Need another library, or should I import the image barcode into the document?

Code review request

Purpose of code changes on this branch:

Allow reading imperfect or just plain broken PDFs:
1. no newline after %%EOF (allowed in PDF format)
2. support single filter when specified in an array ie /Filter[/FlateDecode] 
instead of /Filter 
/FlateDecode
3. when "endstream" is not found at specified stream length, try to find it 
again using simple 
string search from start.

When reviewing my code changes, please focus on:

Make sure it does not affect handling correct PDFs.

After the review, I'll merge this branch into:
/trunk

Original issue reported on code.google.com by [email protected] on 12 Mar 2010 at 6:25

Bug: Can't rewatermark file

Hi,

Just discovered a small bug:

code to reproduce:

from reportlab.pdfgen import canvas
from pdfrw import PdfReader, PdfWriter, PageMerge


# create some files
pdf_file = canvas.Canvas('page.pdf')
pdf_file.drawString(0, 0, 'hello')
pdf_file.save()

watermark_file = canvas.Canvas('water.pdf')
watermark_file.drawString(0, 0, 'water')
watermark_file.save()


# watermark 1
wmark = PageMerge().add(PdfReader('water.pdf').pages[0])[0]
trailer = PdfReader('page.pdf')

for page in trailer.pages:
    PageMerge(page).add(wmark).render()

PdfWriter().write('merged.pdf', trailer)


# watermark the watermarked file
trailer = PdfReader('merged.pdf')

for page in trailer.pages:
    PageMerge(page).add(wmark).render()

PdfWriter().write('merged2.pdf', trailer)

The problem is around https://github.com/pmaupin/pdfrw/blob/master/pdfrw/pagemerge.py#L202.

The number 6 is not len('\pdfrw_'), so the isdigit() fails, and \pdfrw_0 is re-used every time.

PdfString encode() does not work properly with unicode strings

decode() was modified to fix one user's needs; need to fix encode() as well. Not sure what the right thing to do here is yet.

[Question] Page rescaling?

Is there any way to rescale a single page, such as the PyPDF2 library Page.scaleTo(width, height) ?

There seems to be some examples like this:

for page in output.pages:
    try:
        p = PageMerge().add(page)
        p[0].scale(0.1)
        p.render()
    except Exception as e:
        print e

which are a bit unclear, and also doesn't seem to work when iterating an opened file pages, plus some pages raise an error ( 'TypeError: 'NoneType' object has no attribute 'getitem'').
The correct way would be using the above code but add the pages to a new writer, but that would mean losing any bookmarks from the original file which is really bad.

One would expect a resize option like the one already possible for rotations, like this (don't mind the added watermark code):

for page in output.pages:
    page.Rotate = 90
    PageMerge(page).add(watermark, prepend=True).render()

'int' object is not callable

attempting to run the following line of code:
pages[0].MediaBox[2:]

would result in something like:

Traceback (most recent call last):
  File "/tp/new_backend/teampatent/test/test_pdfwrap.py", line 233, in test_pdfwrap_page_sizes
    eq_([420, 595], [int(n) for n in pages[0].MediaBox[2:]])
  File "build/bdist.linux-i686/egg/pdfrw/objects/pdfarray.py", line 39, in __getslice__
    return listget(self, index)
TypeError: 'int' object is not callable

this used to work before 0.1 release

Original issue reported on code.google.com by [email protected] on 5 Mar 2013 at 1:21

how to set unicode info?

I mean use non-english chanracters, for example
writer.trailer.Info = IndirectPdfDict(
Title=u'unicode string1',
Author=u'unicode string2',
)

thanks

Question : how to remove some elements in a pdf file ?

I'm trying to make a script to remove some images from a pdf file based on their dimension.

Iterating over pages,

if I use findobjs.find_objects(page, valid_subtypes=(PdfName.Image,)), it finds Image objects and I can check width and height properties. But then the link with the parent (the page) is lost so I'm not able to remove this element from the page.
if I use find_objects on each content of the page (page.Contents), so I can keep the link with the page, it is not able to find any Image object.

I've tried to understand find_objects function to mock the behavior in a custom function. But there is some magic around obj.iteritems() that I don't get.

Do you have any idea on how to proceed ?

Add more unittests for string encoding

The code added for #30 does not break any current tests, but we don't have any unittests that will keep us from having a regression.

RE: PdfReader cannot read the io.BytesIO properly in Python 3.5

It seems there is a "tab/space" issue on line 499 in pdfreader.py, where it currently only do the 'convert_load' in case of file, but not from 'in-memory'(such as BytesIO) object. Ideally it should be done in both case, so I believe this is a typo that mis-place the line "fdata=conver_load(fdata)" into the 'file-reading' section only.

After I fix the 'tab' issue above(so that it applies for both case), I can use it for BytesIO object now.

By the way, it seems this bug only occurs in Python 3.5, I don't have any issue with Python 3.4.

errors decoding pdf files

I'm developing a Python's application using pdfrw and all seems ok, but i 
discovered that when i run my application with optimizations activated (python 
-OO) pdfrw can't decode any pdf and raises Exceptions.

By a quickly inspection of pdfrw's source code i found in pdfreader.py rows as 
these:
  assert source.next() == 'R'
  assert source.next() == '<<'
  assert source.next() == 'startxref' and source.floc > startloc

calling .next() in an assert will change the program working flow if 
optimizations are on or off.

Giuseppe

Original issue reported on code.google.com by [email protected] on 14 Sep 2012 at 7:48

Additional support needed

- More compression types
- Linearized PDFs
- Maybe more PyPDF emulation (additional dict attributes, mainly)

Original issue reported on code.google.com by pmaupin on 4 Sep 2012 at 2:09

Tests are failing with AttributeError

Running the tests e.g. with nosetests results in a failure:

======================================================================
ERROR: test_doubleslash (tests.test_pdfstring.TestEncoding)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/valhalla/packaging/misc/pdfrw/pkg-pdfrw/tests/test_pdfstring.py", line 29, in test_doubleslash
    self.roundtrip('\\')
  File "/home/valhalla/packaging/misc/pdfrw/pkg-pdfrw/tests/test_pdfstring.py", line 26, in roundtrip
    self.assertEqual(value, self.encode_decode(value))
  File "/home/valhalla/packaging/misc/pdfrw/pkg-pdfrw/tests/test_pdfstring.py", line 23, in encode_decode
    return cls.decode(cls.encode(value))
  File "/home/valhalla/packaging/misc/pdfrw/pkg-pdfrw/tests/test_pdfstring.py", line 19, in encode
    return str(pdfrw.pdfobjects.PdfString.encode(value))
AttributeError: 'module' object has no attribute 'pdfobjects'

----------------------------------------------------------------------
Ran 1 test in 0.027s

FAILED (errors=1)

I've noticed that pdfrw/__init__.py includes a line

   from pdfrw.objects import PdfObject [...] PdfString

so I've tried to change:

  s/pdfrw.pdfobjects.PdfString/pdfrw.PdfString/g

everywhere in the file, which resulted in a passing test.

What is the expected behaviour? the one used in the tests or the one resulting 
from code?

Thanks in advance

Original issue reported on code.google.com by [email protected] on 30 Aug 2014 at 1:59

Why merge pdf is flipped with example..

use example fancy_watemark.py or watermark.py go to run it.

But the resulting watermark are all horizontal and vertical flip.

watermark with reportlab generate.for example

c = canvas.Canvas('transafe.pdf')
c.drawString(0,0,'hello') 
c.save()

Thanks in advance
Please tell me how to change it ..

Problems using table of contents (rl) with pdfrw

I'm not sure this is actually a bug in pdfrw makerl or not, but when I try
to use table of contents together with a template with a pdfrw object in
it, it fails with:

  File "/usr/lib64/python2.6/site-packages/reportlab/pdfbase/pdfdoc.py",
line 852, in format
    raise KeyError, "forward reference to %s not resolved upon final
formatting" % repr(self.name)
KeyError: "forward reference to 'FormXob.pdfrw_3' not resolved upon final
formatting"

I have attached a small test application that draws background.pdf before
anything else and outputs output.pdf.

Original issue reported on code.google.com by [email protected] on 13 Jan 2010 at 10:26

Attachments:

test_background.py

Another release?

I would like to use pdfrw for a project, but it involves in-memory PDFs and runs on Python 3.x, so I'm stuck using a git+ssh:// URL to install it, which is somewhat problematic. Any chance of a new release including #43 / 9e4aa55 getting pushed to pypi?

Updating field's default value doesn't update rendered text

I have form fields in my PDF (that make it interactive - you can fill them and print with your data). I want to programatically fill those fields based on their names (template.Root.Pages.Kids[x].Annots[y] - name in 'T', default value in 'V'). The problem is that when I do so it's updated in metadata, but the old value is displayed until I edit the PDF in some desktop editor (I can see new default value and it starts to be displayed when I make any change to this field). I'd love it to be updated as well.

Example:

template = pdfrw.PdfReader('template.pdf')
template.Root.Pages.Kids[0].Annots[3].update(pdfrw.PdfDict(V='(test)'))
pdfrw.PdfWriter().write('test.pdf', template)

crashes in makerl

What steps will reproduce the problem?
1. Go to example/rl1/
2. run subset.py test.pdf 1 1
3.

What is the expected output? What do you see instead?
I expect it to run, instead I get an error:

~/svn/pdfrw/examples/rl1$ python subset.py side1.pdf 1 1
Traceback (most recent call last):
  File "subset.py", line 43, in <module>
    go(inpfn, firstpage, lastpage)
  File "subset.py", line 36, in go
    canvas.doForm(makerl(canvas, page))
  File "~/svn/pdfrw/pdfrw/toreportlab.py", line 138, in makerl
    rlobj = makerl_recurse(doc, pdfobj)
  File "~/svn/pdfrw/pdfrw/toreportlab.py", line 131, in makerl_recurse
    return func(rldoc, pdfobj)
  File "~/svn/pdfrw/pdfrw/toreportlab.py", line 94, in _makestream
    rldict[key[1:]] = makerl_recurse(rldoc, value)
  File "~/svn/pdfrw/pdfrw/toreportlab.py", line 131, in makerl_recurse
    return func(rldoc, pdfobj)
  File "~/svn/pdfrw/pdfrw/toreportlab.py", line 72, in _makedict
    rldict[key[1:]] = makerl_recurse(rldoc, value)
  File "~/svn/pdfrw/pdfrw/toreportlab.py", line 131, in makerl_recurse
    return func(rldoc, pdfobj)
  File "~/svn/pdfrw/pdfrw/toreportlab.py", line 72, in _makedict
    rldict[key[1:]] = makerl_recurse(rldoc, value)
  File "~/svn/pdfrw/pdfrw/toreportlab.py", line 131, in makerl_recurse
    return func(rldoc, pdfobj)
  File "~/svn/pdfrw/pdfrw/toreportlab.py", line 108, in _makearray
    mylist = rlobj.sequence
AttributeError: PDFObjectReference instance has no attribute 'sequence'

What version of the product are you using? On what operating system?
Subversion revision 82
On Linux.

Please provide any additional information below.
Don't know if this helps, but if I place a try/except around the sequence
usage like so:
    try:
        mylist = rlobj.sequence
        for value in pdfobj:   
            mylist.append(makerl_recurse(rldoc, value))
        print dir(rlobj)
    except:
        print dir(rlobj)

    return rlobj

I get the following output:
['__PDFObject__', '__doc__', '__init__', '__module__', 'format', 'name']
['__PDFObject__', '__doc__', '__init__', '__module__', 'format', 'name']
['__PDFObject__', '__doc__', '__init__', '__module__', 'format', 'name']
['References', '__PDFObject__', '__doc__', '__init__', '__module__',
'format', 'multiline', 'sequence']
['References', '__PDFObject__', '__doc__', '__init__', '__module__',
'format', 'multiline', 'sequence']
['References', '__PDFObject__', '__doc__', '__init__', '__module__',
'format', 'multiline', 'sequence']
['References', '__PDFObject__', '__doc__', '__init__', '__module__',
'format', 'multiline', 'sequence']

Ah, and it works.. (output file seems correct anyways) :)

Original issue reported on code.google.com by [email protected] on 12 Jan 2010 at 11:20

Update or remove wiki

Maybe nice to have, but now out of date from readme.

Direct page objects in /Kids

In examples where XObjects are used, after adding new pages, somehow they are 
written in /Kids array as direct objects. According to specification, they must 
be indirect. Although pdf readers open such documents just fine, some tools are 
complaining about that. The solutions can be:

1) in examples (e.g. 4up.py function get4) change returning type from PdfDict 
to IndirectPdfDict.

2) changing type to indirect in writer. For example, in _get_trailer:

        # Make all the pages point back to the page dictionary
        pagedict = trailer.Root.Pages
        for page in pagedict.Kids:
            page.Parent = pagedict
            page.indirect = True  <-- add this line

I think the second approach is more cleaner.

Original issue reported on code.google.com by [email protected] on 17 Nov 2012 at 3:56

Spurious brackets in URIs.


1. Get a PDF with a URI in an annotation.
2. Run this code on it:

#!/usr/bin/env python

import sys
import os

from pdfrw import PdfReader, PdfWriter

def convert(inpfn, outfn):

  pdf = PdfReader(inpfn)

  for K in pdf.Root.Pages.Kids:
    if K.Annots is not None:
      for An in K.Annots:
        if An.A is not None:
          if An.A.URI is not None:
            An.A.URI = An.A.URI

  outdata = PdfWriter()

  outdata.trailer = pdf

  outdata.write(outfn)

for inpfn in sys.argv[1:]:
    print inpfn, ':'
    outfn = 'out/' + inpfn
    convert(inpfn, outfn)


Expected output: the output PDF should be identical to the input.

Actual result: In the output PDF the URI will have extra brackets added around 
it, ie instead of

http://www.example.com

the URI now points to:

(http://www.example.com)

which fails to open correctly in any PDF reader.


Using version 0.1-1 on Ubuntu 14.04.

Original issue reported on code.google.com by [email protected] on 21 Oct 2014 at 4:16

Stopiteration exception

Hello there,

First thanks for your great work!

I've been using this library for a while and it worked perfectly, my use case is the following:
1- I build a pdf document with reportlab library
2- I have other existing pdf documents that i append at the end of the document build in 1-. Till now every thing was fine till i encountered an error with a pdf file which when i want to add it at the end of the document, the program exits with an error on Stopiteration exception. Following is the error stack. The pdf document in question is fine when opened with adobe Acrobat reader and is made of 3 pages. So is there something i can do to know that some pdf file are not supported ?

Thanks!

File "/usr/local/lib/python2.7/dist-packages/pdfrw/pdfwriter.py", line 316, in write
[Mon Jun 13 13:15:50.782466 2016] [:error] [pid 16822] self.killobj, user_fmt=user_fmt)
[Mon Jun 13 13:15:50.782471 2016] [:error] [pid 16822] File "/usr/local/lib/python2.7/dist-packages/pdfrw/pdfwriter.py", line 196, in FormatObjects
[Mon Jun 13 13:15:50.782476 2016] [:error] [pid 16822] format_deferred()
[Mon Jun 13 13:15:50.782481 2016] [:error] [pid 16822] File "/usr/local/lib/python2.7/dist-packages/pdfrw/pdfwriter.py", line 164, in format_deferred
[Mon Jun 13 13:15:50.782487 2016] [:error] [pid 16822] objlist[index] = format_obj(obj)
[Mon Jun 13 13:15:50.782492 2016] [:error] [pid 16822] File "/usr/local/lib/python2.7/dist-packages/pdfrw/pdfwriter.py", line 145, in format_obj
[Mon Jun 13 13:15:50.782512 2016] [:error] [pid 16822] myarray.append(add(value))
[Mon Jun 13 13:15:50.782517 2016] [:error] [pid 16822] File "/usr/local/lib/python2.7/dist-packages/pdfrw/pdfwriter.py", line 83, in add
[Mon Jun 13 13:15:50.782523 2016] [:error] [pid 16822] result = format_obj(obj)
[Mon Jun 13 13:15:50.782528 2016] [:error] [pid 16822] File "/usr/local/lib/python2.7/dist-packages/pdfrw/pdfwriter.py", line 135, in format_obj
[Mon Jun 13 13:15:50.782533 2016] [:error] [pid 16822] myarray = [add(x) for x in obj]
[Mon Jun 13 13:15:50.782538 2016] [:error] [pid 16822] File "/usr/local/lib/python2.7/dist-packages/pdfrw/objects/pdfarray.py", line 46, in iter
[Mon Jun 13 13:15:50.782543 2016] [:error] [pid 16822] self._resolve()
[Mon Jun 13 13:15:50.782548 2016] [:error] [pid 16822] File "/usr/local/lib/python2.7/dist-packages/pdfrw/objects/pdfarray.py", line 28, in _resolver
[Mon Jun 13 13:15:50.782554 2016] [:error] [pid 16822] value = value.real_value()
[Mon Jun 13 13:15:50.782559 2016] [:error] [pid 16822] File "/usr/local/lib/python2.7/dist-packages/pdfrw/objects/pdfindirect.py", line 21, in real_value
[Mon Jun 13 13:15:50.782564 2016] [:error] [pid 16822] value = self.value = self._loader(self)
[Mon Jun 13 13:15:50.782569 2016] [:error] [pid 16822] File "/usr/local/lib/python2.7/dist-packages/pdfrw/pdfreader.py", line 200, in loadindirect
[Mon Jun 13 13:15:50.782574 2016] [:error] [pid 16822] source.next()
[Mon Jun 13 13:15:50.782579 2016] [:error] [pid 16822] StopIteration

Document new functionality and examples

Need to finish up a more comprehensive readme, and document the new object extraction functionality, and changed/deleted examples.

How to remove object from pdf

I'm sorry for posting something which is not directly an issue (and I'll be glad to push a PR for the docs with the answer !).

I try to remove some objects (mostly images) from an existing pdf, i can find them with pdfrw. findobjs.find_objects(), but I have no clue how to remove them from "the tree"... Can you provide any kind of guidance (or place to look for more resources?).

Many thanks,

Xobjects with compression parameters not supported

Was getting this when trying to merge this pdf:
http://demo.visualid.com/_mediafiles/demo/_tmp/3GoeKXdVeEk7.pdf
Int this pdf:
https://s3-eu-west-1.amazonaws.com/visualid-mediafiles/demo/20170131/AC1E00B70333d13AF9qXE39A3E9A/9sd9DGiaS4X4.pdf

FlateDecode needs to decompress and recompress
Here is thd dict of parameters in the pdf...
[{'/Length': '3134', '/Filter': '/FlateDecode'}, {'/Length':
'3172',
'/Filter': '/FlateDecode'}, {'/Length': '3597', '/Filter':
'/FlateDecode'},
{'/Length': '3580', '/Filter': '/FlateDecode'}, {'/Length': '3044',
'/Filter': '/FlateDecode'}, {'/Length': '3393', '/Filter':
'/FlateDecode'},
{'/Length': '3347', '/Filter': '/FlateDecode'}, {'/Length': '3223',
'/Filter': '/FlateDecode'}]

Code fix by @pmaupin to come

Inline images not handled

Inline images, as described at section 4.8.6 of the PDF 1.7 reference, turn content streams on their head by putting raw image data in the middle of normal PDF objects. It seems the tokenizer doesn't handle inline images at present, so the image data gets parsed into nonsense operators/operands. If there are left and right angle brackets in the image data, one of the tokens will be an invalid hex-encoded string, which will raise an assertion when you try to decode() it.

I've started work on this in my fork, but I'm wondering if the image data should be returned as a different data type or object. I haven't looked at PdfWriter yet either.

I filed a PR over at pmaupin/static_pdfs#1 to add a file with inline images to the test files. If you run the script below on that file, you should get an AssertionError with a message of '<\x00\x00>'.

import sys

import pdfrw


with open(sys.argv[1], "rb") as f:
    doc = pdfrw.PdfReader(f)
    for page in doc.pages:
        if isinstance(page.Contents, pdfrw.PdfArray):
            contents = list(page.Contents)
        else:
            contents = [page.Contents]
        pdfrw.uncompress.uncompress(contents)
        for content in contents:
            if content is None:
                continue
            for token in pdfrw.PdfTokens(content.stream):
                if isinstance(token, pdfrw.PdfString):
                    token.decode()

find_pdfrw in subdirectories under examples does not work

Needs to search further up the directory hierarchy.

PDFString values containing 2 backslashes are incorrectly decoded

What steps will reproduce the problem?
1. Call pdfrw.pdfobjects.PdfString.encode on a string containing a double 
backslash.
2. Call .decode() on the pdfstring.

What is the expected output? What do you see instead?
Encoding and then decoding a string, should return the original.

What version of the product are you using? On what operating system?
Latest SVN (revision 136).

Please provide any additional information below.
Patch attached (including a unittest).

Original issue reported on code.google.com by beechhorn on 13 Sep 2011 at 4:28

Attachments:

pdfstring_decoding_fix.diff

Any chance at memory optimizations?

Hello there,

I'm wondering if there are any plans or perhaps some hints on reducing the amount of memory (and perhaps speeding things up) required.

Quickly looking through the code there are way too many strings being used/loaded, split, concatenated, etc... Much of this can probably be improved through bytearray or memoryview to avoid excesive string copying.

Reading a pdf from file like object or data not working in python 3 with bytesIO

I have noticed that it is possible to make a PdfReader either by specifying a filename or file-like object, or by giving the data directly with fdata argument. This is great, however, it doesn't work if I give it a BytesIO object since the various functions in the following code only work with strings. For example, fdata.startswith('%PDF-') is called rather than fdata.startswith(b'%PDF-').

I can't immediately see an elegant way to solve this. Directly converting the data with str() produces assertion errors such as 'File "/usr/lib/python3.4/site-packages/pdfrw/pdfreader.py", line 319, in findxref assert tok == 'startxref' # (We just checked this...)' with the files I have tried.

Pdf2txt

Can a pdf to text conversion tool(a la pdf2txt in pdfminer) be added?

watermark.py example uses -o directory incorrectly (overwrite input files)

What steps will reproduce the problem?
1. using watermark.py with -d and -o where a path is specified with -d
Watermarked files overwrite the input files

What is the expected output? What do you see instead?
watermarked files in directory specified by -o

What version of the product are you using? On what operating system?
0.1 downloaded 18-Jan-2014 as .zip file; on Windows 7 (irrelevant)

Please provide any additional information below.
Line 67
  PdfWriter().write(path.join(outdir, fname), trailer)
should be changed to
  PdfWriter().write(path.join(outdir, os.path.basename(fname)), trailer)

Original issue reported on code.google.com by [email protected] on 18 Jan 2014 at 9:45

code.google.com usage

Service code.google.com is closing. Do you have plans to migrate pdfrw to 
somewhere?

Original issue reported on code.google.com by [email protected] on 17 Mar 2015 at 10:31

Several failing round-trip test cases

Need to start doing triage on them.

Obviously some of them are because of the missing object stream support.

Potential Parsing Error for some PDFs

What steps will reproduce the problem?
1. Using the code from the watermark.py as a sample, I attempted to Overlay the 
attached PDF Document into another PDF Document.

Minimal code reproduction example:

Python 2.7.3 (default, Apr 10 2012, 23:31:26) [MSC v.1500 32 bit (Intel)] on 
win32
Type "help", "copyright", "credits" or "license" for more information.
>>>
>>> from pdfrw import PdfReader
>>> from pdfrw.buildxobj import pagexobj
>>>
>>> xobj = pagexobj(PdfReader('boverlay-new.pdf').getPage(0))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "c:\python27\lib\site-packages\pdfrw\buildxobj.py", line 193, in pagexobj
    assert int(contents.Length) == len(contents.stream)
AttributeError: 'PdfArray' object has no attribute 'Length'

The Overlay file will open in PDF Readers (Foxit, Adobe), but pdfrw is unable 
to create a page object from the first page of the PDF.  The Overlay PDF was 
created using Adobe Indesign, and is attached.


What is the expected output? What do you see instead?
No overlay is produced, and the exception above is generated instead.


What version of the product are you using? On what operating system?
I have the latest pdfrw as retrieved from via SVN.  Windows 7, 64bit, using 
Python 2.7.3 32bit.


Please provide any additional information below.

Original issue reported on code.google.com by [email protected] on 4 Nov 2013 at 5:23

Attachments:

boverlay-new.pdf

Opening document with PdfReader fails of a flattened PDF document.

I have a PDF document with an Acroform in it. After filling in values and flattening the document, I tried to open the document again with the PdfReader, but this fails. It gives me the following stacktrace.

Error stacktrace:

../../.virtualenvs/pdftools35/lib/python3.5/site-packages/pdfrw/pdfreader.py:546: in __init__
    trailer, is_stream = self.parsexref(source)
../../.virtualenvs/pdftools35/lib/python3.5/site-packages/pdfrw/pdfreader.py:439: in parsexref
    return self.parse_xref_stream(source), True
../../.virtualenvs/pdftools35/lib/python3.5/site-packages/pdfrw/pdfreader.py:372: in parse_xref_stream
    xtype, p1, p2 = islice(get, 3)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

s = '\x00\x00\x00\x00ÿÿ\x01\x00\x00\x12\x00\x00\x01\x00\x02\x02\x00\x00\x01\x00\x12[\x00\x00\x01\x00\x13\x94\x00\x00\x01\x...12\x00\x00\x01\x01ãc\x00\x00\x01\x01ãÜ\x00\x00\x01\x01ä-\x00\x00\x01\x01ä¦\x00\x00\x01\x01ä÷\x00\x00\x01\x01åq\x00\x00'
lengths = <itertools.cycle object at 0x120fb4e08>

    def readint(s, lengths):
        lengths = itertools.cycle(lengths)
        offset = 0
        for length in itertools.cycle(lengths):
            next = offset + length
            # if isinstance(s, str):
            #    s = bytes(s, 'latin-1')
    
>           yield int(hexlify(s[offset:next]), 16) if length else None
E           TypeError: a bytes-like object is required, not 'str'

../../.virtualenvs/pdftools35/lib/python3.5/site-packages/pdfrw/pdfreader.py:342: TypeError

I am using Python 3.5.1.

My test code:

import pdfrw

with open('document_after_flattening.pdf', 'rb') as f:
    fdata = f.read().decode('latin-1')
    pdfrw.PdfReader(fdata=fdata)

I have traced the error back to :

        def readint(s, lengths):
            lengths = itertools.cycle(lengths)
            offset = 0
            for length in itertools.cycle(lengths):
                next = offset + length
                yield int(hexlify(s[offset:next]), 16) if length else None
                offset = next

hexlify(s[offset:next])

I replaced the code above with:

        def readint(s, lengths):
            lengths = itertools.cycle(lengths)
            offset = 0
            for length in itertools.cycle(lengths):
                next = offset + length


                if isinstance(s, str):
                   s = bytes(s, 'latin-1')


                yield int(hexlify(s[offset:next]), 16) if length else None
                offset = next

It appears that the stream is a string and the function didn't expect this and that's the reason why it crashes.

The document
document_before_flattening.pdf

document_after_flattening.pdf

Add tests to the release tarball

Could you please add the tests to your next release tarball? while the git repo is of course better for developement, having tests in the tarball is useful to check that the installation is working, expecially in the context of packaging (I'm in the process of adopting the packaging of pdfrw for debian).

Thanks in advance.

Add FormXObject and reportlab unit tests

And maybe unit tests that exercise some of the examples.

Question: How to blacken text / Replace text / Remove text ?

Thanks for making this lib!
And sorry for the Question, this is not an issue.

I have an existing PDF with a specific word I want to hide out.
I am happy with any solution on how to hide it (Write on top of it, cut it out, replace it etc..).
I am not so much familier with the PDF format so not sure how to go about that, any suggestion?

Happy to contribute back a working example when I get this to work :)

Importing * from pdfrw and pdfrw.objects doesn't work

The __all__ list in __init__.py must contain strings with the names of modules not the modules itself.

https://docs.python.org/2/tutorial/modules.html#importing-from-a-package

watermarking some PDFs doesn't work

I have some input PDFs and try to watermark them with a given single page watermark.pdf.
For some PDFs that works (watermark shows), for some not (output looks as input).

My code is like this, underneath=False:
https://github.com/pmaupin/pdfrw/blob/master/examples/watermark.py

Do you have an idea why? Is this a bug? How can it be debugged?

Problem with "/Contents" in some pdf's

Some pdf's in /Contents have array with one object instead of object directly. 

Eg. /Contents [ 5 0 R ] instead of /Contents 5 0 R.

To fix this problem, I changed buildxobj.py pagexobj method in line 190 to:

    if isinstance(page.Contents, PdfArray):
        contents = page.Contents[0]
    else:
        contents = page.Contents

Original issue reported on code.google.com by [email protected] on 18 Oct 2012 at 10:12

Update examples to work with Python 3

Only ported example is watermark (it just required print fixes).

How to navigate through the tree to see the PDF text content.

I am trying to navigate through the tree created by PdfReader and looking for the var that stores the text of PDF.

>>> x.pages[1].Resources
{'/Font': {'/T1_2': (14, 0), '/T1_3': (16, 0), '/T1_0': (146, 0), '/T1_1': (148, 0)}, '/ProcSet': ['/PDF', '/Text']}
>>> x.pages[1].Resources.ProcSet
['/PDF', '/Text']

Not sure if this is the correct way of doing it.

pmaupin / pdfrw Goto Github PK

pdfrw's People

Contributors

Stargazers

Watchers

Forkers

pdfrw's Issues

Recommend Projects

Recommend Topics

Recommend Org