Giter Club home page Giter Club logo

pdfrw's People

Contributors

abrasive avatar aquavitae avatar b4stien avatar edwardbetts avatar jondel avatar jonls avatar jorjmckie avatar lambdafu avatar mazulo avatar metalshark avatar ndevenish avatar pmaupin avatar takluyver avatar tjwei avatar wuhaochen avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pdfrw's Issues

add to pypi

it would be good if pdfrw could be installed with easy_install or pip


the following simple setup.py works for me:

#!/usr/bin/env python

from setuptools import setup

setup(
    name = "pdfrw",
    version = "0.1",

    packages = ["pdfrw"]
)

Original issue reported on code.google.com by [email protected] on 17 Sep 2012 at 11:38

Wrong boolean keywords

Boolean values get converted to "True" and "False". According to the PDF reference it must be lower-case.

[Feature Request] Ability to add/remove bookmarks

If this feature already exists, I'd love to know about it. If it doesn't, I'd like to put in a formal request for it. I know where the information exists (Root.Outlines), but I don't know how to modify it.

decryption

I did read the documentation that indicates that it isn't supported. I am using pypdf2 and it does but only a few and the newer encryptions used by some government agencies, 128 AES, is not supported.

Any ideas or thoughts if this is something that you will be implementing in the future? If so do you have any timeframe in mind?

Thanks

barcode

thanks for your job.
Can I use barcodes with pdfrw?
Need another library, or should I import the image barcode into the document?

Code review request

Purpose of code changes on this branch:

Allow reading imperfect or just plain broken PDFs:
1. no newline after %%EOF (allowed in PDF format)
2. support single filter when specified in an array ie /Filter[/FlateDecode] 
instead of /Filter 
/FlateDecode
3. when "endstream" is not found at specified stream length, try to find it 
again using simple 
string search from start.

When reviewing my code changes, please focus on:

Make sure it does not affect handling correct PDFs.

After the review, I'll merge this branch into:
/trunk


Original issue reported on code.google.com by [email protected] on 12 Mar 2010 at 6:25

Bug: Can't rewatermark file

Hi,

Just discovered a small bug:

code to reproduce:

from reportlab.pdfgen import canvas
from pdfrw import PdfReader, PdfWriter, PageMerge


# create some files
pdf_file = canvas.Canvas('page.pdf')
pdf_file.drawString(0, 0, 'hello')
pdf_file.save()

watermark_file = canvas.Canvas('water.pdf')
watermark_file.drawString(0, 0, 'water')
watermark_file.save()


# watermark 1
wmark = PageMerge().add(PdfReader('water.pdf').pages[0])[0]
trailer = PdfReader('page.pdf')

for page in trailer.pages:
    PageMerge(page).add(wmark).render()

PdfWriter().write('merged.pdf', trailer)


# watermark the watermarked file
trailer = PdfReader('merged.pdf')

for page in trailer.pages:
    PageMerge(page).add(wmark).render()

PdfWriter().write('merged2.pdf', trailer)

The problem is around https://github.com/pmaupin/pdfrw/blob/master/pdfrw/pagemerge.py#L202.

The number 6 is not len('\pdfrw_'), so the isdigit() fails, and \pdfrw_0 is re-used every time.

[Question] Page rescaling?

Is there any way to rescale a single page, such as the PyPDF2 library Page.scaleTo(width, height) ?

There seems to be some examples like this:

for page in output.pages:
    try:
        p = PageMerge().add(page)
        p[0].scale(0.1)
        p.render()
    except Exception as e:
        print e

which are a bit unclear, and also doesn't seem to work when iterating an opened file pages, plus some pages raise an error ( 'TypeError: 'NoneType' object has no attribute 'getitem'').
The correct way would be using the above code but add the pages to a new writer, but that would mean losing any bookmarks from the original file which is really bad.

One would expect a resize option like the one already possible for rotations, like this (don't mind the added watermark code):

for page in output.pages:
    page.Rotate = 90
    PageMerge(page).add(watermark, prepend=True).render()

'int' object is not callable

attempting to run the following line of code:
pages[0].MediaBox[2:]

would result in something like:

Traceback (most recent call last):
  File "/tp/new_backend/teampatent/test/test_pdfwrap.py", line 233, in test_pdfwrap_page_sizes
    eq_([420, 595], [int(n) for n in pages[0].MediaBox[2:]])
  File "build/bdist.linux-i686/egg/pdfrw/objects/pdfarray.py", line 39, in __getslice__
    return listget(self, index)
TypeError: 'int' object is not callable

this used to work before 0.1 release


Original issue reported on code.google.com by [email protected] on 5 Mar 2013 at 1:21

how to set unicode info?

I mean use non-english chanracters, for example
writer.trailer.Info = IndirectPdfDict(
Title=u'unicode string1',
Author=u'unicode string2',
)

thanks

Question : how to remove some elements in a pdf file ?

I'm trying to make a script to remove some images from a pdf file based on their dimension.

Iterating over pages,

  • if I use findobjs.find_objects(page, valid_subtypes=(PdfName.Image,)), it finds Image objects and I can check width and height properties. But then the link with the parent (the page) is lost so I'm not able to remove this element from the page.
  • if I use find_objects on each content of the page (page.Contents), so I can keep the link with the page, it is not able to find any Image object.

I've tried to understand find_objects function to mock the behavior in a custom function. But there is some magic around obj.iteritems() that I don't get.

Do you have any idea on how to proceed ?

RE: PdfReader cannot read the io.BytesIO properly in Python 3.5

It seems there is a "tab/space" issue on line 499 in pdfreader.py, where it currently only do the 'convert_load' in case of file, but not from 'in-memory'(such as BytesIO) object. Ideally it should be done in both case, so I believe this is a typo that mis-place the line "fdata=conver_load(fdata)" into the 'file-reading' section only.

After I fix the 'tab' issue above(so that it applies for both case), I can use it for BytesIO object now.

By the way, it seems this bug only occurs in Python 3.5, I don't have any issue with Python 3.4.

errors decoding pdf files

I'm developing a Python's application using pdfrw and all seems ok, but i 
discovered that when i run my application with optimizations activated (python 
-OO) pdfrw can't decode any pdf and raises Exceptions.

By a quickly inspection of pdfrw's source code i found in pdfreader.py rows as 
these:
  assert source.next() == 'R'
  assert source.next() == '<<'
  assert source.next() == 'startxref' and source.floc > startloc

calling .next() in an assert will change the program working flow if 
optimizations are on or off.

Giuseppe

Original issue reported on code.google.com by [email protected] on 14 Sep 2012 at 7:48

Additional support needed

- More compression types
- Linearized PDFs
- Maybe more PyPDF emulation (additional dict attributes, mainly)

Original issue reported on code.google.com by pmaupin on 4 Sep 2012 at 2:09

Tests are failing with AttributeError

Running the tests e.g. with nosetests results in a failure:

======================================================================
ERROR: test_doubleslash (tests.test_pdfstring.TestEncoding)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/valhalla/packaging/misc/pdfrw/pkg-pdfrw/tests/test_pdfstring.py", line 29, in test_doubleslash
    self.roundtrip('\\')
  File "/home/valhalla/packaging/misc/pdfrw/pkg-pdfrw/tests/test_pdfstring.py", line 26, in roundtrip
    self.assertEqual(value, self.encode_decode(value))
  File "/home/valhalla/packaging/misc/pdfrw/pkg-pdfrw/tests/test_pdfstring.py", line 23, in encode_decode
    return cls.decode(cls.encode(value))
  File "/home/valhalla/packaging/misc/pdfrw/pkg-pdfrw/tests/test_pdfstring.py", line 19, in encode
    return str(pdfrw.pdfobjects.PdfString.encode(value))
AttributeError: 'module' object has no attribute 'pdfobjects'

----------------------------------------------------------------------
Ran 1 test in 0.027s

FAILED (errors=1)

I've noticed that pdfrw/__init__.py includes a line

   from pdfrw.objects import PdfObject [...] PdfString

so I've tried to change:

  s/pdfrw.pdfobjects.PdfString/pdfrw.PdfString/g

everywhere in the file, which resulted in a passing test.

What is the expected behaviour? the one used in the tests or the one resulting 
from code?

Thanks in advance

Original issue reported on code.google.com by [email protected] on 30 Aug 2014 at 1:59

Why merge pdf is flipped with example..

use example fancy_watemark.py or watermark.py go to run it.

But the resulting watermark are all horizontal and vertical flip.

watermark with reportlab generate.for example

c = canvas.Canvas('transafe.pdf')
c.drawString(0,0,'hello') 
c.save()

Thanks in advance
Please tell me how to change it ..

Problems using table of contents (rl) with pdfrw

I'm not sure this is actually a bug in pdfrw makerl or not, but when I try
to use table of contents together with a template with a pdfrw object in
it, it fails with:

  File "/usr/lib64/python2.6/site-packages/reportlab/pdfbase/pdfdoc.py",
line 852, in format
    raise KeyError, "forward reference to %s not resolved upon final
formatting" % repr(self.name)
KeyError: "forward reference to 'FormXob.pdfrw_3' not resolved upon final
formatting"

I have attached a small test application that draws background.pdf before
anything else and outputs output.pdf.

Original issue reported on code.google.com by [email protected] on 13 Jan 2010 at 10:26

Attachments:

Another release?

I would like to use pdfrw for a project, but it involves in-memory PDFs and runs on Python 3.x, so I'm stuck using a git+ssh:// URL to install it, which is somewhat problematic. Any chance of a new release including #43 / 9e4aa55 getting pushed to pypi?

Updating field's default value doesn't update rendered text

I have form fields in my PDF (that make it interactive - you can fill them and print with your data). I want to programatically fill those fields based on their names (template.Root.Pages.Kids[x].Annots[y] - name in 'T', default value in 'V'). The problem is that when I do so it's updated in metadata, but the old value is displayed until I edit the PDF in some desktop editor (I can see new default value and it starts to be displayed when I make any change to this field). I'd love it to be updated as well.

Example:

template = pdfrw.PdfReader('template.pdf')
template.Root.Pages.Kids[0].Annots[3].update(pdfrw.PdfDict(V='(test)'))
pdfrw.PdfWriter().write('test.pdf', template)

crashes in makerl

What steps will reproduce the problem?
1. Go to example/rl1/
2. run subset.py test.pdf 1 1
3.

What is the expected output? What do you see instead?
I expect it to run, instead I get an error:

~/svn/pdfrw/examples/rl1$ python subset.py side1.pdf 1 1
Traceback (most recent call last):
  File "subset.py", line 43, in <module>
    go(inpfn, firstpage, lastpage)
  File "subset.py", line 36, in go
    canvas.doForm(makerl(canvas, page))
  File "~/svn/pdfrw/pdfrw/toreportlab.py", line 138, in makerl
    rlobj = makerl_recurse(doc, pdfobj)
  File "~/svn/pdfrw/pdfrw/toreportlab.py", line 131, in makerl_recurse
    return func(rldoc, pdfobj)
  File "~/svn/pdfrw/pdfrw/toreportlab.py", line 94, in _makestream
    rldict[key[1:]] = makerl_recurse(rldoc, value)
  File "~/svn/pdfrw/pdfrw/toreportlab.py", line 131, in makerl_recurse
    return func(rldoc, pdfobj)
  File "~/svn/pdfrw/pdfrw/toreportlab.py", line 72, in _makedict
    rldict[key[1:]] = makerl_recurse(rldoc, value)
  File "~/svn/pdfrw/pdfrw/toreportlab.py", line 131, in makerl_recurse
    return func(rldoc, pdfobj)
  File "~/svn/pdfrw/pdfrw/toreportlab.py", line 72, in _makedict
    rldict[key[1:]] = makerl_recurse(rldoc, value)
  File "~/svn/pdfrw/pdfrw/toreportlab.py", line 131, in makerl_recurse
    return func(rldoc, pdfobj)
  File "~/svn/pdfrw/pdfrw/toreportlab.py", line 108, in _makearray
    mylist = rlobj.sequence
AttributeError: PDFObjectReference instance has no attribute 'sequence'

What version of the product are you using? On what operating system?
Subversion revision 82
On Linux.

Please provide any additional information below.
Don't know if this helps, but if I place a try/except around the sequence
usage like so:
    try:
        mylist = rlobj.sequence
        for value in pdfobj:   
            mylist.append(makerl_recurse(rldoc, value))
        print dir(rlobj)
    except:
        print dir(rlobj)

    return rlobj

I get the following output:
['__PDFObject__', '__doc__', '__init__', '__module__', 'format', 'name']
['__PDFObject__', '__doc__', '__init__', '__module__', 'format', 'name']
['__PDFObject__', '__doc__', '__init__', '__module__', 'format', 'name']
['References', '__PDFObject__', '__doc__', '__init__', '__module__',
'format', 'multiline', 'sequence']
['References', '__PDFObject__', '__doc__', '__init__', '__module__',
'format', 'multiline', 'sequence']
['References', '__PDFObject__', '__doc__', '__init__', '__module__',
'format', 'multiline', 'sequence']
['References', '__PDFObject__', '__doc__', '__init__', '__module__',
'format', 'multiline', 'sequence']

Ah, and it works.. (output file seems correct anyways) :)

Original issue reported on code.google.com by [email protected] on 12 Jan 2010 at 11:20

Direct page objects in /Kids

In examples where XObjects are used, after adding new pages, somehow they are 
written in /Kids array as direct objects. According to specification, they must 
be indirect. Although pdf readers open such documents just fine, some tools are 
complaining about that. The solutions can be:

1) in examples (e.g. 4up.py function get4) change returning type from PdfDict 
to IndirectPdfDict.

2) changing type to indirect in writer. For example, in _get_trailer:

        # Make all the pages point back to the page dictionary
        pagedict = trailer.Root.Pages
        for page in pagedict.Kids:
            page.Parent = pagedict
            page.indirect = True  <-- add this line

I think the second approach is more cleaner.

Original issue reported on code.google.com by [email protected] on 17 Nov 2012 at 3:56

Spurious brackets in URIs.


1. Get a PDF with a URI in an annotation.
2. Run this code on it:

#!/usr/bin/env python

import sys
import os

from pdfrw import PdfReader, PdfWriter

def convert(inpfn, outfn):

  pdf = PdfReader(inpfn)

  for K in pdf.Root.Pages.Kids:
    if K.Annots is not None:
      for An in K.Annots:
        if An.A is not None:
          if An.A.URI is not None:
            An.A.URI = An.A.URI

  outdata = PdfWriter()

  outdata.trailer = pdf

  outdata.write(outfn)

for inpfn in sys.argv[1:]:
    print inpfn, ':'
    outfn = 'out/' + inpfn
    convert(inpfn, outfn)


Expected output: the output PDF should be identical to the input.

Actual result: In the output PDF the URI will have extra brackets added around 
it, ie instead of

http://www.example.com

the URI now points to:

(http://www.example.com)

which fails to open correctly in any PDF reader.


Using version 0.1-1 on Ubuntu 14.04.


Original issue reported on code.google.com by [email protected] on 21 Oct 2014 at 4:16

Stopiteration exception

Hello there,

First thanks for your great work!

I've been using this library for a while and it worked perfectly, my use case is the following:
1- I build a pdf document with reportlab library
2- I have other existing pdf documents that i append at the end of the document build in 1-. Till now every thing was fine till i encountered an error with a pdf file which when i want to add it at the end of the document, the program exits with an error on Stopiteration exception. Following is the error stack. The pdf document in question is fine when opened with adobe Acrobat reader and is made of 3 pages. So is there something i can do to know that some pdf file are not supported ?

Thanks!

File "/usr/local/lib/python2.7/dist-packages/pdfrw/pdfwriter.py", line 316, in write
[Mon Jun 13 13:15:50.782466 2016] [:error] [pid 16822] self.killobj, user_fmt=user_fmt)
[Mon Jun 13 13:15:50.782471 2016] [:error] [pid 16822] File "/usr/local/lib/python2.7/dist-packages/pdfrw/pdfwriter.py", line 196, in FormatObjects
[Mon Jun 13 13:15:50.782476 2016] [:error] [pid 16822] format_deferred()
[Mon Jun 13 13:15:50.782481 2016] [:error] [pid 16822] File "/usr/local/lib/python2.7/dist-packages/pdfrw/pdfwriter.py", line 164, in format_deferred
[Mon Jun 13 13:15:50.782487 2016] [:error] [pid 16822] objlist[index] = format_obj(obj)
[Mon Jun 13 13:15:50.782492 2016] [:error] [pid 16822] File "/usr/local/lib/python2.7/dist-packages/pdfrw/pdfwriter.py", line 145, in format_obj
[Mon Jun 13 13:15:50.782512 2016] [:error] [pid 16822] myarray.append(add(value))
[Mon Jun 13 13:15:50.782517 2016] [:error] [pid 16822] File "/usr/local/lib/python2.7/dist-packages/pdfrw/pdfwriter.py", line 83, in add
[Mon Jun 13 13:15:50.782523 2016] [:error] [pid 16822] result = format_obj(obj)
[Mon Jun 13 13:15:50.782528 2016] [:error] [pid 16822] File "/usr/local/lib/python2.7/dist-packages/pdfrw/pdfwriter.py", line 135, in format_obj
[Mon Jun 13 13:15:50.782533 2016] [:error] [pid 16822] myarray = [add(x) for x in obj]
[Mon Jun 13 13:15:50.782538 2016] [:error] [pid 16822] File "/usr/local/lib/python2.7/dist-packages/pdfrw/objects/pdfarray.py", line 46, in iter
[Mon Jun 13 13:15:50.782543 2016] [:error] [pid 16822] self._resolve()
[Mon Jun 13 13:15:50.782548 2016] [:error] [pid 16822] File "/usr/local/lib/python2.7/dist-packages/pdfrw/objects/pdfarray.py", line 28, in _resolver
[Mon Jun 13 13:15:50.782554 2016] [:error] [pid 16822] value = value.real_value()
[Mon Jun 13 13:15:50.782559 2016] [:error] [pid 16822] File "/usr/local/lib/python2.7/dist-packages/pdfrw/objects/pdfindirect.py", line 21, in real_value
[Mon Jun 13 13:15:50.782564 2016] [:error] [pid 16822] value = self.value = self._loader(self)
[Mon Jun 13 13:15:50.782569 2016] [:error] [pid 16822] File "/usr/local/lib/python2.7/dist-packages/pdfrw/pdfreader.py", line 200, in loadindirect
[Mon Jun 13 13:15:50.782574 2016] [:error] [pid 16822] source.next()
[Mon Jun 13 13:15:50.782579 2016] [:error] [pid 16822] StopIteration

How to remove object from pdf

I'm sorry for posting something which is not directly an issue (and I'll be glad to push a PR for the docs with the answer !).

I try to remove some objects (mostly images) from an existing pdf, i can find them with pdfrw. findobjs.find_objects(), but I have no clue how to remove them from "the tree"... Can you provide any kind of guidance (or place to look for more resources?).

Many thanks,

Xobjects with compression parameters not supported

Was getting this when trying to merge this pdf:
http://demo.visualid.com/_mediafiles/demo/_tmp/3GoeKXdVeEk7.pdf
Int this pdf:
https://s3-eu-west-1.amazonaws.com/visualid-mediafiles/demo/20170131/AC1E00B70333d13AF9qXE39A3E9A/9sd9DGiaS4X4.pdf

FlateDecode needs to decompress and recompress
Here is thd dict of parameters in the pdf...
[{'/Length': '3134', '/Filter': '/FlateDecode'}, {'/Length':
'3172',
'/Filter': '/FlateDecode'}, {'/Length': '3597', '/Filter':
'/FlateDecode'},
{'/Length': '3580', '/Filter': '/FlateDecode'}, {'/Length': '3044',
'/Filter': '/FlateDecode'}, {'/Length': '3393', '/Filter':
'/FlateDecode'},
{'/Length': '3347', '/Filter': '/FlateDecode'}, {'/Length': '3223',
'/Filter': '/FlateDecode'}]

Code fix by @pmaupin to come

Inline images not handled

Inline images, as described at section 4.8.6 of the PDF 1.7 reference, turn content streams on their head by putting raw image data in the middle of normal PDF objects. It seems the tokenizer doesn't handle inline images at present, so the image data gets parsed into nonsense operators/operands. If there are left and right angle brackets in the image data, one of the tokens will be an invalid hex-encoded string, which will raise an assertion when you try to decode() it.

I've started work on this in my fork, but I'm wondering if the image data should be returned as a different data type or object. I haven't looked at PdfWriter yet either.

I filed a PR over at pmaupin/static_pdfs#1 to add a file with inline images to the test files. If you run the script below on that file, you should get an AssertionError with a message of '<\x00\x00>'.

import sys

import pdfrw


with open(sys.argv[1], "rb") as f:
    doc = pdfrw.PdfReader(f)
    for page in doc.pages:
        if isinstance(page.Contents, pdfrw.PdfArray):
            contents = list(page.Contents)
        else:
            contents = [page.Contents]
        pdfrw.uncompress.uncompress(contents)
        for content in contents:
            if content is None:
                continue
            for token in pdfrw.PdfTokens(content.stream):
                if isinstance(token, pdfrw.PdfString):
                    token.decode()

PDFString values containing 2 backslashes are incorrectly decoded

What steps will reproduce the problem?
1. Call pdfrw.pdfobjects.PdfString.encode on a string containing a double 
backslash.
2. Call .decode() on the pdfstring.

What is the expected output? What do you see instead?
Encoding and then decoding a string, should return the original.

What version of the product are you using? On what operating system?
Latest SVN (revision 136).

Please provide any additional information below.
Patch attached (including a unittest).

Original issue reported on code.google.com by beechhorn on 13 Sep 2011 at 4:28

Attachments:

Any chance at memory optimizations?

Hello there,

I'm wondering if there are any plans or perhaps some hints on reducing the amount of memory (and perhaps speeding things up) required.

Quickly looking through the code there are way too many strings being used/loaded, split, concatenated, etc... Much of this can probably be improved through bytearray or memoryview to avoid excesive string copying.

Reading a pdf from file like object or data not working in python 3 with bytesIO

I have noticed that it is possible to make a PdfReader either by specifying a filename or file-like object, or by giving the data directly with fdata argument. This is great, however, it doesn't work if I give it a BytesIO object since the various functions in the following code only work with strings. For example, fdata.startswith('%PDF-') is called rather than fdata.startswith(b'%PDF-').

I can't immediately see an elegant way to solve this. Directly converting the data with str() produces assertion errors such as 'File "/usr/lib/python3.4/site-packages/pdfrw/pdfreader.py", line 319, in findxref assert tok == 'startxref' # (We just checked this...)' with the files I have tried.

Pdf2txt

Can a pdf to text conversion tool(a la pdf2txt in pdfminer) be added?

watermark.py example uses -o directory incorrectly (overwrite input files)

What steps will reproduce the problem?
1. using watermark.py with -d and -o where a path is specified with -d
Watermarked files overwrite the input files

What is the expected output? What do you see instead?
watermarked files in directory specified by -o

What version of the product are you using? On what operating system?
0.1 downloaded 18-Jan-2014 as .zip file; on Windows 7 (irrelevant)

Please provide any additional information below.
Line 67
  PdfWriter().write(path.join(outdir, fname), trailer)
should be changed to
  PdfWriter().write(path.join(outdir, os.path.basename(fname)), trailer)

Original issue reported on code.google.com by [email protected] on 18 Jan 2014 at 9:45

code.google.com usage

Service code.google.com is closing. Do you have plans to migrate pdfrw to 
somewhere?

Original issue reported on code.google.com by [email protected] on 17 Mar 2015 at 10:31

Potential Parsing Error for some PDFs

What steps will reproduce the problem?
1. Using the code from the watermark.py as a sample, I attempted to Overlay the 
attached PDF Document into another PDF Document.

Minimal code reproduction example:

Python 2.7.3 (default, Apr 10 2012, 23:31:26) [MSC v.1500 32 bit (Intel)] on 
win32
Type "help", "copyright", "credits" or "license" for more information.
>>>
>>> from pdfrw import PdfReader
>>> from pdfrw.buildxobj import pagexobj
>>>
>>> xobj = pagexobj(PdfReader('boverlay-new.pdf').getPage(0))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "c:\python27\lib\site-packages\pdfrw\buildxobj.py", line 193, in pagexobj
    assert int(contents.Length) == len(contents.stream)
AttributeError: 'PdfArray' object has no attribute 'Length'

The Overlay file will open in PDF Readers (Foxit, Adobe), but pdfrw is unable 
to create a page object from the first page of the PDF.  The Overlay PDF was 
created using Adobe Indesign, and is attached.


What is the expected output? What do you see instead?
No overlay is produced, and the exception above is generated instead.


What version of the product are you using? On what operating system?
I have the latest pdfrw as retrieved from via SVN.  Windows 7, 64bit, using 
Python 2.7.3 32bit.


Please provide any additional information below.


Original issue reported on code.google.com by [email protected] on 4 Nov 2013 at 5:23

Attachments:

Opening document with PdfReader fails of a flattened PDF document.

I have a PDF document with an Acroform in it. After filling in values and flattening the document, I tried to open the document again with the PdfReader, but this fails. It gives me the following stacktrace.

Error stacktrace:

../../.virtualenvs/pdftools35/lib/python3.5/site-packages/pdfrw/pdfreader.py:546: in __init__
    trailer, is_stream = self.parsexref(source)
../../.virtualenvs/pdftools35/lib/python3.5/site-packages/pdfrw/pdfreader.py:439: in parsexref
    return self.parse_xref_stream(source), True
../../.virtualenvs/pdftools35/lib/python3.5/site-packages/pdfrw/pdfreader.py:372: in parse_xref_stream
    xtype, p1, p2 = islice(get, 3)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

s = '\x00\x00\x00\x00ÿÿ\x01\x00\x00\x12\x00\x00\x01\x00\x02\x02\x00\x00\x01\x00\x12[\x00\x00\x01\x00\x13\x94\x00\x00\x01\x...12\x00\x00\x01\x01ãc\x00\x00\x01\x01ãÜ\x00\x00\x01\x01ä-\x00\x00\x01\x01ä¦\x00\x00\x01\x01ä÷\x00\x00\x01\x01åq\x00\x00'
lengths = <itertools.cycle object at 0x120fb4e08>

    def readint(s, lengths):
        lengths = itertools.cycle(lengths)
        offset = 0
        for length in itertools.cycle(lengths):
            next = offset + length
            # if isinstance(s, str):
            #    s = bytes(s, 'latin-1')
    
>           yield int(hexlify(s[offset:next]), 16) if length else None
E           TypeError: a bytes-like object is required, not 'str'

../../.virtualenvs/pdftools35/lib/python3.5/site-packages/pdfrw/pdfreader.py:342: TypeError

I am using Python 3.5.1.

My test code:

import pdfrw

with open('document_after_flattening.pdf', 'rb') as f:
    fdata = f.read().decode('latin-1')
    pdfrw.PdfReader(fdata=fdata)

I have traced the error back to :

        def readint(s, lengths):
            lengths = itertools.cycle(lengths)
            offset = 0
            for length in itertools.cycle(lengths):
                next = offset + length
                yield int(hexlify(s[offset:next]), 16) if length else None
                offset = next
hexlify(s[offset:next])

I replaced the code above with:

        def readint(s, lengths):
            lengths = itertools.cycle(lengths)
            offset = 0
            for length in itertools.cycle(lengths):
                next = offset + length


                if isinstance(s, str):
                   s = bytes(s, 'latin-1')


                yield int(hexlify(s[offset:next]), 16) if length else None
                offset = next

It appears that the stream is a string and the function didn't expect this and that's the reason why it crashes.

The document
document_before_flattening.pdf

document_after_flattening.pdf

Add tests to the release tarball

Could you please add the tests to your next release tarball? while the git repo is of course better for developement, having tests in the tarball is useful to check that the installation is working, expecially in the context of packaging (I'm in the process of adopting the packaging of pdfrw for debian).

Thanks in advance.

Question: How to blacken text / Replace text / Remove text ?

Thanks for making this lib!
And sorry for the Question, this is not an issue.

I have an existing PDF with a specific word I want to hide out.
I am happy with any solution on how to hide it (Write on top of it, cut it out, replace it etc..).
I am not so much familier with the PDF format so not sure how to go about that, any suggestion?

Happy to contribute back a working example when I get this to work :)

Problem with "/Contents" in some pdf's

Some pdf's in /Contents have array with one object instead of object directly. 

Eg. /Contents [ 5 0 R ] instead of /Contents 5 0 R.

To fix this problem, I changed buildxobj.py pagexobj method in line 190 to:

    if isinstance(page.Contents, PdfArray):
        contents = page.Contents[0]
    else:
        contents = page.Contents

Original issue reported on code.google.com by [email protected] on 18 Oct 2012 at 10:12

How to navigate through the tree to see the PDF text content.

I am trying to navigate through the tree created by PdfReader and looking for the var that stores the text of PDF.

>>> x.pages[1].Resources
{'/Font': {'/T1_2': (14, 0), '/T1_3': (16, 0), '/T1_0': (146, 0), '/T1_1': (148, 0)}, '/ProcSet': ['/PDF', '/Text']}
>>> x.pages[1].Resources.ProcSet
['/PDF', '/Text'] 

Not sure if this is the correct way of doing it.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.