pmaupin / pdfrw Goto Github PK
View Code? Open in Web Editor NEWpdfrw is a pure Python library that reads and writes PDFs
License: Other
pdfrw is a pure Python library that reads and writes PDFs
License: Other
Convert wiki from markdown to rst, and build it at readthedocs.
Start using wheels, and travisci.
Anything else?
I wrote a little library based on pdfrw to manipulate pdf page labels:
https://github.com/lovasoa/pagelabels-py/tree/master/pagelabels
I thought it might interest you to integrate it directly to pdfrw, for easier page labels manipulation.
it would be good if pdfrw could be installed with easy_install or pip
the following simple setup.py works for me:
#!/usr/bin/env python
from setuptools import setup
setup(
name = "pdfrw",
version = "0.1",
packages = ["pdfrw"]
)
Original issue reported on code.google.com by [email protected]
on 17 Sep 2012 at 11:38
Boolean values get converted to "True" and "False". According to the PDF reference it must be lower-case.
If this feature already exists, I'd love to know about it. If it doesn't, I'd like to put in a formal request for it. I know where the information exists (Root.Outlines), but I don't know how to modify it.
I did read the documentation that indicates that it isn't supported. I am using pypdf2 and it does but only a few and the newer encryptions used by some government agencies, 128 AES, is not supported.
Any ideas or thoughts if this is something that you will be implementing in the future? If so do you have any timeframe in mind?
Thanks
thanks for your job.
Can I use barcodes with pdfrw?
Need another library, or should I import the image barcode into the document?
Purpose of code changes on this branch:
Allow reading imperfect or just plain broken PDFs:
1. no newline after %%EOF (allowed in PDF format)
2. support single filter when specified in an array ie /Filter[/FlateDecode]
instead of /Filter
/FlateDecode
3. when "endstream" is not found at specified stream length, try to find it
again using simple
string search from start.
When reviewing my code changes, please focus on:
Make sure it does not affect handling correct PDFs.
After the review, I'll merge this branch into:
/trunk
Original issue reported on code.google.com by [email protected]
on 12 Mar 2010 at 6:25
Hi,
Just discovered a small bug:
code to reproduce:
from reportlab.pdfgen import canvas
from pdfrw import PdfReader, PdfWriter, PageMerge
# create some files
pdf_file = canvas.Canvas('page.pdf')
pdf_file.drawString(0, 0, 'hello')
pdf_file.save()
watermark_file = canvas.Canvas('water.pdf')
watermark_file.drawString(0, 0, 'water')
watermark_file.save()
# watermark 1
wmark = PageMerge().add(PdfReader('water.pdf').pages[0])[0]
trailer = PdfReader('page.pdf')
for page in trailer.pages:
PageMerge(page).add(wmark).render()
PdfWriter().write('merged.pdf', trailer)
# watermark the watermarked file
trailer = PdfReader('merged.pdf')
for page in trailer.pages:
PageMerge(page).add(wmark).render()
PdfWriter().write('merged2.pdf', trailer)
The problem is around https://github.com/pmaupin/pdfrw/blob/master/pdfrw/pagemerge.py#L202.
The number 6 is not len('\pdfrw_')
, so the isdigit()
fails, and \pdfrw_0
is re-used every time.
decode() was modified to fix one user's needs; need to fix encode() as well. Not sure what the right thing to do here is yet.
Is there any way to rescale a single page, such as the PyPDF2 library Page.scaleTo(width, height) ?
There seems to be some examples like this:
for page in output.pages:
try:
p = PageMerge().add(page)
p[0].scale(0.1)
p.render()
except Exception as e:
print e
which are a bit unclear, and also doesn't seem to work when iterating an opened file pages, plus some pages raise an error ( 'TypeError: 'NoneType' object has no attribute 'getitem'').
The correct way would be using the above code but add the pages to a new writer, but that would mean losing any bookmarks from the original file which is really bad.
One would expect a resize option like the one already possible for rotations, like this (don't mind the added watermark code):
for page in output.pages:
page.Rotate = 90
PageMerge(page).add(watermark, prepend=True).render()
attempting to run the following line of code:
pages[0].MediaBox[2:]
would result in something like:
Traceback (most recent call last):
File "/tp/new_backend/teampatent/test/test_pdfwrap.py", line 233, in test_pdfwrap_page_sizes
eq_([420, 595], [int(n) for n in pages[0].MediaBox[2:]])
File "build/bdist.linux-i686/egg/pdfrw/objects/pdfarray.py", line 39, in __getslice__
return listget(self, index)
TypeError: 'int' object is not callable
this used to work before 0.1 release
Original issue reported on code.google.com by [email protected]
on 5 Mar 2013 at 1:21
I mean use non-english chanracters, for example
writer.trailer.Info = IndirectPdfDict(
Title=u'unicode string1',
Author=u'unicode string2',
)
thanks
I'm trying to make a script to remove some images from a pdf file based on their dimension.
Iterating over pages,
findobjs.find_objects(page, valid_subtypes=(PdfName.Image,))
, it finds Image objects and I can check width and height properties. But then the link with the parent (the page) is lost so I'm not able to remove this element from the page.find_objects
on each content of the page (page.Contents), so I can keep the link with the page, it is not able to find any Image object.I've tried to understand find_objects
function to mock the behavior in a custom function. But there is some magic around obj.iteritems()
that I don't get.
Do you have any idea on how to proceed ?
It seems there is a "tab/space" issue on line 499 in pdfreader.py, where it currently only do the 'convert_load' in case of file, but not from 'in-memory'(such as BytesIO) object. Ideally it should be done in both case, so I believe this is a typo that mis-place the line "fdata=conver_load(fdata)" into the 'file-reading' section only.
After I fix the 'tab' issue above(so that it applies for both case), I can use it for BytesIO object now.
By the way, it seems this bug only occurs in Python 3.5, I don't have any issue with Python 3.4.
I'm developing a Python's application using pdfrw and all seems ok, but i
discovered that when i run my application with optimizations activated (python
-OO) pdfrw can't decode any pdf and raises Exceptions.
By a quickly inspection of pdfrw's source code i found in pdfreader.py rows as
these:
assert source.next() == 'R'
assert source.next() == '<<'
assert source.next() == 'startxref' and source.floc > startloc
calling .next() in an assert will change the program working flow if
optimizations are on or off.
Giuseppe
Original issue reported on code.google.com by [email protected]
on 14 Sep 2012 at 7:48
- More compression types
- Linearized PDFs
- Maybe more PyPDF emulation (additional dict attributes, mainly)
Original issue reported on code.google.com by pmaupin
on 4 Sep 2012 at 2:09
Running the tests e.g. with nosetests results in a failure:
======================================================================
ERROR: test_doubleslash (tests.test_pdfstring.TestEncoding)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/valhalla/packaging/misc/pdfrw/pkg-pdfrw/tests/test_pdfstring.py", line 29, in test_doubleslash
self.roundtrip('\\')
File "/home/valhalla/packaging/misc/pdfrw/pkg-pdfrw/tests/test_pdfstring.py", line 26, in roundtrip
self.assertEqual(value, self.encode_decode(value))
File "/home/valhalla/packaging/misc/pdfrw/pkg-pdfrw/tests/test_pdfstring.py", line 23, in encode_decode
return cls.decode(cls.encode(value))
File "/home/valhalla/packaging/misc/pdfrw/pkg-pdfrw/tests/test_pdfstring.py", line 19, in encode
return str(pdfrw.pdfobjects.PdfString.encode(value))
AttributeError: 'module' object has no attribute 'pdfobjects'
----------------------------------------------------------------------
Ran 1 test in 0.027s
FAILED (errors=1)
I've noticed that pdfrw/__init__.py includes a line
from pdfrw.objects import PdfObject [...] PdfString
so I've tried to change:
s/pdfrw.pdfobjects.PdfString/pdfrw.PdfString/g
everywhere in the file, which resulted in a passing test.
What is the expected behaviour? the one used in the tests or the one resulting
from code?
Thanks in advance
Original issue reported on code.google.com by [email protected]
on 30 Aug 2014 at 1:59
use example fancy_watemark.py or watermark.py go to run it.
But the resulting watermark are all horizontal and vertical flip.
watermark with reportlab generate.for example
c = canvas.Canvas('transafe.pdf')
c.drawString(0,0,'hello')
c.save()
Thanks in advance
Please tell me how to change it ..
I'm not sure this is actually a bug in pdfrw makerl or not, but when I try
to use table of contents together with a template with a pdfrw object in
it, it fails with:
File "/usr/lib64/python2.6/site-packages/reportlab/pdfbase/pdfdoc.py",
line 852, in format
raise KeyError, "forward reference to %s not resolved upon final
formatting" % repr(self.name)
KeyError: "forward reference to 'FormXob.pdfrw_3' not resolved upon final
formatting"
I have attached a small test application that draws background.pdf before
anything else and outputs output.pdf.
Original issue reported on code.google.com by [email protected]
on 13 Jan 2010 at 10:26
Attachments:
I have form fields in my PDF (that make it interactive - you can fill them and print with your data). I want to programatically fill those fields based on their names (template.Root.Pages.Kids[x].Annots[y] - name in 'T', default value in 'V'). The problem is that when I do so it's updated in metadata, but the old value is displayed until I edit the PDF in some desktop editor (I can see new default value and it starts to be displayed when I make any change to this field). I'd love it to be updated as well.
Example:
template = pdfrw.PdfReader('template.pdf')
template.Root.Pages.Kids[0].Annots[3].update(pdfrw.PdfDict(V='(test)'))
pdfrw.PdfWriter().write('test.pdf', template)
What steps will reproduce the problem?
1. Go to example/rl1/
2. run subset.py test.pdf 1 1
3.
What is the expected output? What do you see instead?
I expect it to run, instead I get an error:
~/svn/pdfrw/examples/rl1$ python subset.py side1.pdf 1 1
Traceback (most recent call last):
File "subset.py", line 43, in <module>
go(inpfn, firstpage, lastpage)
File "subset.py", line 36, in go
canvas.doForm(makerl(canvas, page))
File "~/svn/pdfrw/pdfrw/toreportlab.py", line 138, in makerl
rlobj = makerl_recurse(doc, pdfobj)
File "~/svn/pdfrw/pdfrw/toreportlab.py", line 131, in makerl_recurse
return func(rldoc, pdfobj)
File "~/svn/pdfrw/pdfrw/toreportlab.py", line 94, in _makestream
rldict[key[1:]] = makerl_recurse(rldoc, value)
File "~/svn/pdfrw/pdfrw/toreportlab.py", line 131, in makerl_recurse
return func(rldoc, pdfobj)
File "~/svn/pdfrw/pdfrw/toreportlab.py", line 72, in _makedict
rldict[key[1:]] = makerl_recurse(rldoc, value)
File "~/svn/pdfrw/pdfrw/toreportlab.py", line 131, in makerl_recurse
return func(rldoc, pdfobj)
File "~/svn/pdfrw/pdfrw/toreportlab.py", line 72, in _makedict
rldict[key[1:]] = makerl_recurse(rldoc, value)
File "~/svn/pdfrw/pdfrw/toreportlab.py", line 131, in makerl_recurse
return func(rldoc, pdfobj)
File "~/svn/pdfrw/pdfrw/toreportlab.py", line 108, in _makearray
mylist = rlobj.sequence
AttributeError: PDFObjectReference instance has no attribute 'sequence'
What version of the product are you using? On what operating system?
Subversion revision 82
On Linux.
Please provide any additional information below.
Don't know if this helps, but if I place a try/except around the sequence
usage like so:
try:
mylist = rlobj.sequence
for value in pdfobj:
mylist.append(makerl_recurse(rldoc, value))
print dir(rlobj)
except:
print dir(rlobj)
return rlobj
I get the following output:
['__PDFObject__', '__doc__', '__init__', '__module__', 'format', 'name']
['__PDFObject__', '__doc__', '__init__', '__module__', 'format', 'name']
['__PDFObject__', '__doc__', '__init__', '__module__', 'format', 'name']
['References', '__PDFObject__', '__doc__', '__init__', '__module__',
'format', 'multiline', 'sequence']
['References', '__PDFObject__', '__doc__', '__init__', '__module__',
'format', 'multiline', 'sequence']
['References', '__PDFObject__', '__doc__', '__init__', '__module__',
'format', 'multiline', 'sequence']
['References', '__PDFObject__', '__doc__', '__init__', '__module__',
'format', 'multiline', 'sequence']
Ah, and it works.. (output file seems correct anyways) :)
Original issue reported on code.google.com by [email protected]
on 12 Jan 2010 at 11:20
Maybe nice to have, but now out of date from readme.
In examples where XObjects are used, after adding new pages, somehow they are
written in /Kids array as direct objects. According to specification, they must
be indirect. Although pdf readers open such documents just fine, some tools are
complaining about that. The solutions can be:
1) in examples (e.g. 4up.py function get4) change returning type from PdfDict
to IndirectPdfDict.
2) changing type to indirect in writer. For example, in _get_trailer:
# Make all the pages point back to the page dictionary
pagedict = trailer.Root.Pages
for page in pagedict.Kids:
page.Parent = pagedict
page.indirect = True <-- add this line
I think the second approach is more cleaner.
Original issue reported on code.google.com by [email protected]
on 17 Nov 2012 at 3:56
1. Get a PDF with a URI in an annotation.
2. Run this code on it:
#!/usr/bin/env python
import sys
import os
from pdfrw import PdfReader, PdfWriter
def convert(inpfn, outfn):
pdf = PdfReader(inpfn)
for K in pdf.Root.Pages.Kids:
if K.Annots is not None:
for An in K.Annots:
if An.A is not None:
if An.A.URI is not None:
An.A.URI = An.A.URI
outdata = PdfWriter()
outdata.trailer = pdf
outdata.write(outfn)
for inpfn in sys.argv[1:]:
print inpfn, ':'
outfn = 'out/' + inpfn
convert(inpfn, outfn)
Expected output: the output PDF should be identical to the input.
Actual result: In the output PDF the URI will have extra brackets added around
it, ie instead of
http://www.example.com
the URI now points to:
(http://www.example.com)
which fails to open correctly in any PDF reader.
Using version 0.1-1 on Ubuntu 14.04.
Original issue reported on code.google.com by [email protected]
on 21 Oct 2014 at 4:16
Hello there,
First thanks for your great work!
I've been using this library for a while and it worked perfectly, my use case is the following:
1- I build a pdf document with reportlab library
2- I have other existing pdf documents that i append at the end of the document build in 1-. Till now every thing was fine till i encountered an error with a pdf file which when i want to add it at the end of the document, the program exits with an error on Stopiteration exception. Following is the error stack. The pdf document in question is fine when opened with adobe Acrobat reader and is made of 3 pages. So is there something i can do to know that some pdf file are not supported ?
Thanks!
File "/usr/local/lib/python2.7/dist-packages/pdfrw/pdfwriter.py", line 316, in write
[Mon Jun 13 13:15:50.782466 2016] [:error] [pid 16822] self.killobj, user_fmt=user_fmt)
[Mon Jun 13 13:15:50.782471 2016] [:error] [pid 16822] File "/usr/local/lib/python2.7/dist-packages/pdfrw/pdfwriter.py", line 196, in FormatObjects
[Mon Jun 13 13:15:50.782476 2016] [:error] [pid 16822] format_deferred()
[Mon Jun 13 13:15:50.782481 2016] [:error] [pid 16822] File "/usr/local/lib/python2.7/dist-packages/pdfrw/pdfwriter.py", line 164, in format_deferred
[Mon Jun 13 13:15:50.782487 2016] [:error] [pid 16822] objlist[index] = format_obj(obj)
[Mon Jun 13 13:15:50.782492 2016] [:error] [pid 16822] File "/usr/local/lib/python2.7/dist-packages/pdfrw/pdfwriter.py", line 145, in format_obj
[Mon Jun 13 13:15:50.782512 2016] [:error] [pid 16822] myarray.append(add(value))
[Mon Jun 13 13:15:50.782517 2016] [:error] [pid 16822] File "/usr/local/lib/python2.7/dist-packages/pdfrw/pdfwriter.py", line 83, in add
[Mon Jun 13 13:15:50.782523 2016] [:error] [pid 16822] result = format_obj(obj)
[Mon Jun 13 13:15:50.782528 2016] [:error] [pid 16822] File "/usr/local/lib/python2.7/dist-packages/pdfrw/pdfwriter.py", line 135, in format_obj
[Mon Jun 13 13:15:50.782533 2016] [:error] [pid 16822] myarray = [add(x) for x in obj]
[Mon Jun 13 13:15:50.782538 2016] [:error] [pid 16822] File "/usr/local/lib/python2.7/dist-packages/pdfrw/objects/pdfarray.py", line 46, in iter
[Mon Jun 13 13:15:50.782543 2016] [:error] [pid 16822] self._resolve()
[Mon Jun 13 13:15:50.782548 2016] [:error] [pid 16822] File "/usr/local/lib/python2.7/dist-packages/pdfrw/objects/pdfarray.py", line 28, in _resolver
[Mon Jun 13 13:15:50.782554 2016] [:error] [pid 16822] value = value.real_value()
[Mon Jun 13 13:15:50.782559 2016] [:error] [pid 16822] File "/usr/local/lib/python2.7/dist-packages/pdfrw/objects/pdfindirect.py", line 21, in real_value
[Mon Jun 13 13:15:50.782564 2016] [:error] [pid 16822] value = self.value = self._loader(self)
[Mon Jun 13 13:15:50.782569 2016] [:error] [pid 16822] File "/usr/local/lib/python2.7/dist-packages/pdfrw/pdfreader.py", line 200, in loadindirect
[Mon Jun 13 13:15:50.782574 2016] [:error] [pid 16822] source.next()
[Mon Jun 13 13:15:50.782579 2016] [:error] [pid 16822] StopIteration
Need to finish up a more comprehensive readme, and document the new object extraction functionality, and changed/deleted examples.
I'm sorry for posting something which is not directly an issue (and I'll be glad to push a PR for the docs with the answer !).
I try to remove some objects (mostly images) from an existing pdf, i can find them with pdfrw. findobjs.find_objects()
, but I have no clue how to remove them from "the tree"... Can you provide any kind of guidance (or place to look for more resources?).
Many thanks,
Was getting this when trying to merge this pdf:
http://demo.visualid.com/_mediafiles/demo/_tmp/3GoeKXdVeEk7.pdf
Int this pdf:
https://s3-eu-west-1.amazonaws.com/visualid-mediafiles/demo/20170131/AC1E00B70333d13AF9qXE39A3E9A/9sd9DGiaS4X4.pdf
FlateDecode needs to decompress and recompress
Here is thd dict of parameters in the pdf...
[{'/Length': '3134', '/Filter': '/FlateDecode'}, {'/Length':
'3172',
'/Filter': '/FlateDecode'}, {'/Length': '3597', '/Filter':
'/FlateDecode'},
{'/Length': '3580', '/Filter': '/FlateDecode'}, {'/Length': '3044',
'/Filter': '/FlateDecode'}, {'/Length': '3393', '/Filter':
'/FlateDecode'},
{'/Length': '3347', '/Filter': '/FlateDecode'}, {'/Length': '3223',
'/Filter': '/FlateDecode'}]
Code fix by @pmaupin to come
Inline images, as described at section 4.8.6 of the PDF 1.7 reference, turn content streams on their head by putting raw image data in the middle of normal PDF objects. It seems the tokenizer doesn't handle inline images at present, so the image data gets parsed into nonsense operators/operands. If there are left and right angle brackets in the image data, one of the tokens will be an invalid hex-encoded string, which will raise an assertion when you try to decode()
it.
I've started work on this in my fork, but I'm wondering if the image data should be returned as a different data type or object. I haven't looked at PdfWriter yet either.
I filed a PR over at pmaupin/static_pdfs#1 to add a file with inline images to the test files. If you run the script below on that file, you should get an AssertionError with a message of '<\x00\x00>'.
import sys
import pdfrw
with open(sys.argv[1], "rb") as f:
doc = pdfrw.PdfReader(f)
for page in doc.pages:
if isinstance(page.Contents, pdfrw.PdfArray):
contents = list(page.Contents)
else:
contents = [page.Contents]
pdfrw.uncompress.uncompress(contents)
for content in contents:
if content is None:
continue
for token in pdfrw.PdfTokens(content.stream):
if isinstance(token, pdfrw.PdfString):
token.decode()
Needs to search further up the directory hierarchy.
What steps will reproduce the problem?
1. Call pdfrw.pdfobjects.PdfString.encode on a string containing a double
backslash.
2. Call .decode() on the pdfstring.
What is the expected output? What do you see instead?
Encoding and then decoding a string, should return the original.
What version of the product are you using? On what operating system?
Latest SVN (revision 136).
Please provide any additional information below.
Patch attached (including a unittest).
Original issue reported on code.google.com by beechhorn
on 13 Sep 2011 at 4:28
Attachments:
Hello there,
I'm wondering if there are any plans or perhaps some hints on reducing the amount of memory (and perhaps speeding things up) required.
Quickly looking through the code there are way too many strings being used/loaded, split, concatenated, etc... Much of this can probably be improved through bytearray or memoryview to avoid excesive string copying.
I have noticed that it is possible to make a PdfReader
either by specifying a filename or file-like object, or by giving the data directly with fdata argument. This is great, however, it doesn't work if I give it a BytesIO
object since the various functions in the following code only work with strings. For example, fdata.startswith('%PDF-')
is called rather than fdata.startswith(b'%PDF-')
.
I can't immediately see an elegant way to solve this. Directly converting the data with str() produces assertion errors such as 'File "/usr/lib/python3.4/site-packages/pdfrw/pdfreader.py", line 319, in findxref assert tok == 'startxref' # (We just checked this...)'
with the files I have tried.
Can a pdf to text conversion tool(a la pdf2txt in pdfminer) be added?
What steps will reproduce the problem?
1. using watermark.py with -d and -o where a path is specified with -d
Watermarked files overwrite the input files
What is the expected output? What do you see instead?
watermarked files in directory specified by -o
What version of the product are you using? On what operating system?
0.1 downloaded 18-Jan-2014 as .zip file; on Windows 7 (irrelevant)
Please provide any additional information below.
Line 67
PdfWriter().write(path.join(outdir, fname), trailer)
should be changed to
PdfWriter().write(path.join(outdir, os.path.basename(fname)), trailer)
Original issue reported on code.google.com by [email protected]
on 18 Jan 2014 at 9:45
Service code.google.com is closing. Do you have plans to migrate pdfrw to
somewhere?
Original issue reported on code.google.com by [email protected]
on 17 Mar 2015 at 10:31
Need to start doing triage on them.
Obviously some of them are because of the missing object stream support.
What steps will reproduce the problem?
1. Using the code from the watermark.py as a sample, I attempted to Overlay the
attached PDF Document into another PDF Document.
Minimal code reproduction example:
Python 2.7.3 (default, Apr 10 2012, 23:31:26) [MSC v.1500 32 bit (Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.
>>>
>>> from pdfrw import PdfReader
>>> from pdfrw.buildxobj import pagexobj
>>>
>>> xobj = pagexobj(PdfReader('boverlay-new.pdf').getPage(0))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "c:\python27\lib\site-packages\pdfrw\buildxobj.py", line 193, in pagexobj
assert int(contents.Length) == len(contents.stream)
AttributeError: 'PdfArray' object has no attribute 'Length'
The Overlay file will open in PDF Readers (Foxit, Adobe), but pdfrw is unable
to create a page object from the first page of the PDF. The Overlay PDF was
created using Adobe Indesign, and is attached.
What is the expected output? What do you see instead?
No overlay is produced, and the exception above is generated instead.
What version of the product are you using? On what operating system?
I have the latest pdfrw as retrieved from via SVN. Windows 7, 64bit, using
Python 2.7.3 32bit.
Please provide any additional information below.
Original issue reported on code.google.com by [email protected]
on 4 Nov 2013 at 5:23
Attachments:
I have a PDF document with an Acroform in it. After filling in values and flattening the document, I tried to open the document again with the PdfReader, but this fails. It gives me the following stacktrace.
Error stacktrace:
../../.virtualenvs/pdftools35/lib/python3.5/site-packages/pdfrw/pdfreader.py:546: in __init__
trailer, is_stream = self.parsexref(source)
../../.virtualenvs/pdftools35/lib/python3.5/site-packages/pdfrw/pdfreader.py:439: in parsexref
return self.parse_xref_stream(source), True
../../.virtualenvs/pdftools35/lib/python3.5/site-packages/pdfrw/pdfreader.py:372: in parse_xref_stream
xtype, p1, p2 = islice(get, 3)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
s = '\x00\x00\x00\x00ÿÿ\x01\x00\x00\x12\x00\x00\x01\x00\x02\x02\x00\x00\x01\x00\x12[\x00\x00\x01\x00\x13\x94\x00\x00\x01\x...12\x00\x00\x01\x01ãc\x00\x00\x01\x01ãÜ\x00\x00\x01\x01ä-\x00\x00\x01\x01ä¦\x00\x00\x01\x01ä÷\x00\x00\x01\x01åq\x00\x00'
lengths = <itertools.cycle object at 0x120fb4e08>
def readint(s, lengths):
lengths = itertools.cycle(lengths)
offset = 0
for length in itertools.cycle(lengths):
next = offset + length
# if isinstance(s, str):
# s = bytes(s, 'latin-1')
> yield int(hexlify(s[offset:next]), 16) if length else None
E TypeError: a bytes-like object is required, not 'str'
../../.virtualenvs/pdftools35/lib/python3.5/site-packages/pdfrw/pdfreader.py:342: TypeError
I am using Python 3.5.1.
My test code:
import pdfrw
with open('document_after_flattening.pdf', 'rb') as f:
fdata = f.read().decode('latin-1')
pdfrw.PdfReader(fdata=fdata)
I have traced the error back to :
def readint(s, lengths):
lengths = itertools.cycle(lengths)
offset = 0
for length in itertools.cycle(lengths):
next = offset + length
yield int(hexlify(s[offset:next]), 16) if length else None
offset = next
hexlify(s[offset:next])
I replaced the code above with:
def readint(s, lengths):
lengths = itertools.cycle(lengths)
offset = 0
for length in itertools.cycle(lengths):
next = offset + length
if isinstance(s, str):
s = bytes(s, 'latin-1')
yield int(hexlify(s[offset:next]), 16) if length else None
offset = next
It appears that the stream is a string and the function didn't expect this and that's the reason why it crashes.
The document
document_before_flattening.pdf
Could you please add the tests to your next release tarball? while the git repo is of course better for developement, having tests in the tarball is useful to check that the installation is working, expecially in the context of packaging (I'm in the process of adopting the packaging of pdfrw for debian).
Thanks in advance.
And maybe unit tests that exercise some of the examples.
Thanks for making this lib!
And sorry for the Question, this is not an issue.
I have an existing PDF with a specific word I want to hide out.
I am happy with any solution on how to hide it (Write on top of it, cut it out, replace it etc..).
I am not so much familier with the PDF format so not sure how to go about that, any suggestion?
Happy to contribute back a working example when I get this to work :)
The __all__
list in __init__.py
must contain strings with the names of modules not the modules itself.
https://docs.python.org/2/tutorial/modules.html#importing-from-a-package
I have some input PDFs and try to watermark them with a given single page watermark.pdf.
For some PDFs that works (watermark shows), for some not (output looks as input).
My code is like this, underneath=False:
https://github.com/pmaupin/pdfrw/blob/master/examples/watermark.py
Do you have an idea why? Is this a bug? How can it be debugged?
Some pdf's in /Contents have array with one object instead of object directly.
Eg. /Contents [ 5 0 R ] instead of /Contents 5 0 R.
To fix this problem, I changed buildxobj.py pagexobj method in line 190 to:
if isinstance(page.Contents, PdfArray):
contents = page.Contents[0]
else:
contents = page.Contents
Original issue reported on code.google.com by [email protected]
on 18 Oct 2012 at 10:12
Only ported example is watermark (it just required print fixes).
I am trying to navigate through the tree created by PdfReader and looking for the var that stores the text of PDF.
>>> x.pages[1].Resources
{'/Font': {'/T1_2': (14, 0), '/T1_3': (16, 0), '/T1_0': (146, 0), '/T1_1': (148, 0)}, '/ProcSet': ['/PDF', '/Text']}
>>> x.pages[1].Resources.ProcSet
['/PDF', '/Text']
Not sure if this is the correct way of doing it.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.