Comments (18)
Okay, okay, you made me look.
I'll give you points for trying, and I'll give you a lot of points for working code. A working starting point is precious.
But PyPDF2-style code usually pretty much misses the entire point of how pdfrw works, so I've taken your working code and munged it into something that feels better to me. I didn't spend a lot of time on it, so there might be some rough edges.
Also, as I mentioned earlier, I may not be really keen on adding too much more stuff to pdfwriter.py. For example, I could see adding a generic write-hook that a bookmark module could use to hook itself in, so then we could have a bookmarks.py that has a function you specifically invoke to use bookmarks.
I haven't really thought about it that much, though, so for now, I kept your basic structure of subclassing the writer -- except for how the Info dict is handled. One of the things that makes the PyPDF2 codebase unwieldy is the proliferation of java-style getters and setters. pdfrw is not big on that -- if you want to set up an Info dict, then just set up an Info dict. Otherwise, where does it stop?
newwriter.py:
from pdfrw.pdfwriter import PdfWriter, IndirectPdfDict, PdfName, \
PdfOutputError, PdfDict, user_fmt
class NewPdfWriter(PdfWriter):
_outline = None
def addBookmark(self, title, pageNum, parent=None):
'''
Adds a new bookmark entry.
pageNum must be a valid page number in the writer
and parent can be a bookmark object returned by
a previous addBookmark call
'''
try:
page = self.pagearray[pageNum]
except IndexError:
# TODO: Improve error handling ?
PdfOutputError("Invalid page number: " % (pageNum))
parent = parent or self._outline
if parent is None:
parent = self._outline = IndirectPdfDict()
bookmark = IndirectPdfDict(
Parent = parent,
Title = title,
A = PdfDict(
D = [page, PdfName.Fit],
S = PdfName.GoTo
)
)
if parent.Count:
parent.Count += 1
prev = parent.Last
bookmark.Prev = prev
prev.Next = bookmark
parent.Last = bookmark
else:
parent.Count = 1
parent.First = bookmark
parent.Last = bookmark
return bookmark
def write(self, fname, trailer=None, user_fmt=user_fmt,
disable_gc=True):
trailer = trailer or self.trailer
trailer.Root.Outlines = self._outline
super(NewPdfWriter, self).write(fname, trailer, user_fmt, disable_gc)
test.py:
from pdfrw import PdfReader, IndirectPdfDict
import newwriter
from datetime import datetime
output = newwriter.NewPdfWriter()
for i in xrange(3):
totalPages = len(output.pagearray)
output.addpages(PdfReader('out_small.pdf').pages)
bmname = 'Bm (%s) - %s' % (i+1, 'Root')
t1 = output.addBookmark(bmname, totalPages)
t2 = output.addBookmark("Child 1", totalPages+1, t1)
output.addBookmark("Child 1.1", totalPages+2, t2)
now = datetime.now()
date = 'D:%04d%02d%02d%02d%02d%02d' % (now.year, now.month,
now.day, now.hour, now.minute, now.second)
info = output.trailer.Info = IndirectPdfDict()
info.Title = 'Test Merged PDF'
info.Author = 'asdasd'
info.Creator = 'random dude'
info.Producer = 'another random dude'
info.CreationDate = date
output.write('result.pdf')
from pdfrw.
would like to have that feature also. Could we share details about possible implementation ? or at least if you could give some directions. I'm about to switch to pypdf2 because of that missing feature
from pdfrw.
Not sure how helpful I can be, but at least I know that the bookmarks are supposed to live in the "[PdfReader].Root.Outlines" object. The document "Outlines" dictionary object is defined in the PDF Reference, section 8.2.2.
from pdfrw.
@eburghar , @tisimst : If I may I would recommend to not switch to PyPDF2. pdfrw has a much better performance and is better constructed internally. Its owner is obviously very busy, but PyPDF2 in contrast seems to have come to a halt.
If bookmark maintenance is the one feature that you badly miss: take PyMuPDF as an interim solution. It supports this since a few days, as well as metadata maintenance and incremental saves. A wxPython based GUI TOC maintenance example is contained in the repo. PyMuPDF's drawback compared to pdfrw is, that it is not pure Python and has not as many PDF output features as pdfrw.
from pdfrw.
How difficult would this be? I'm ending up doing all the processing with pdfrw since it's way faster than the rest, but at the end I always need to re open the final file with PyPDF2 to add bookmarks and metadata =/
from pdfrw.
-
PyMuPDF is based on MuPDF, so it's not pure Python. Runs on Windows, Linux and Mac, Python 2.7 and up, 32 and 64 bit. To be usable, MuPDF must be compiled / generated first, before a setup of PyMuPDF is possible. On Windows pre-generated binaries are available, shortening all this to a few seconds.
-
In PyMuPDF's example dir there exist utilities for CSV export / import of bookmarks and metadata.
-
In your Python you can do this:
import fitz # PyMuPDF
doc = fitz.open(...) # open PDF either from file or memory
meta = doc.metadata # get existing metadata (a dictionary)
meta["author"] = "new author" # modify metadata
....
doc.setMetadata(meta) # store modified metadat in PDF
doc.save("xxx.pdf", ...) # store modified PDF to disk
Logic for bookmark maintenance is similar - a lot simpler than using PyPDF2!
(... not to mention speed)
from pdfrw.
Adding bookmarks is fairly trivial. Reading them is somewhat complicated by the fact that there are multiple ways to do it, added in the file format at different versions.
Fun fact: pdfrw's lack of direct support for reading and writing bookmarks does not preclude you from using pdfrw to read and write bookmarks. pdfrw operates at a very low level on files, and the page-handling capability is a very thin layer on top of that. If you learn enough about pdfrw to refactor pdfwriter, you will know enough to do your own bookmarks, and once you have iterated through that code a few times and cleaned it up, it might be suitable for inclusion in pdfrw proper :-)
from pdfrw.
@JorjMcKie sadly this is deployed into amazon's services which uses ubuntu so having to compile stuff from source is not an option =/
@pmaupin that's a good advice, seems like a lot of work though.
from pdfrw.
Hello there,
I have come with some first attempt to be able to add bookmarks (and took the chance to add a helper method to add metadata that's currently possible already)
Since I don't know anything about the PDF spec all I did was basically to copy the format used by PyPDF2 to add the outlines objects and add this to pdfrw Writer class.
Added a addBookmark method similar to the one from PyPDF2, although it is lacking additional options like zoom or XYZ values that doesn't seem important neither.
Since I didn't want to modify the original Writer, I extended the class which means some additional code was needed but if this was to be added into the original class, some code can be removed.
Any improvement or anything that looks odd? It seems to work really well, indeed the small layer on top of the PDF spec is really useful!
Here's the sample code, and attached the code and a test.
class NewPdfWriter(PdfWriter):
def __init__(self, version='1.3', compress=False):
self._bookmarks = []
self._bookmarksDict = {}
self._info = None
super(NewPdfWriter, self).__init__(version, compress)
def addBookmark(self, title, pageNum, parent = None):
'''
Adds a new bookmark entry.
pageNum must be a valid page number in the writer
and parent can be a bookmark object returned by a previous addBookmark call
'''
try:
page = self.pagearray[pageNum]
except IndexError:
# TODO: Improve error handling ?
PdfOutputError("Invalid page number: " % (pageNum))
bookmark = {
'title': title,
'page': page,
'childs': []
}
bid = id(bookmark)
if not parent:
self._bookmarks.append(bookmark)
else:
parentObj = self._bookmarksDict.get(id(parent), None)
if not parentObj:
PdfOutputError("Bookmark parent object not found: " % parent)
parentObj['childs'].append(bookmark)
self._bookmarksDict[bid] = bookmark
return bookmark
def setInfo(self, info):
'''
Sets pdf metadata, info must be a dict where each key is the metadata key
standard/known keys are:
Title
Author
Creator
Producer
'''
self._info = info
def write(self, fname, trailer=None, user_fmt=None, disable_gc=True):
# Recursive function to build outlines tree
def buildOutlines(parent, bookmarks):
outline = None
if bookmarks:
outline = IndirectPdfDict()
outline.Count = len(bookmarks)
first = None
next = None
last = None
for b in bookmarks:
newb = IndirectPdfDict(
Parent = parent or outline,
Title = b['title'],
A = IndirectPdfDict(
D = PdfArray( (b['page'], PdfName('Fit')) ),
S = PdfName('GoTo')
)
)
if not first:
first = newb
else:
last.Next = newb
newb.Prev = last
last = newb
# Add children, if any.
if b['childs']:
childOutline = buildOutlines(newb, b['childs'])
newb.First = childOutline.First
newb.Last = childOutline.Last
newb.Count = childOutline.Count
outline.First = first
outline.Last = last
return outline
# Testing for now, only add root level bookmarks
outlines = buildOutlines(None, self._bookmarks)
# If not custom trailer is given and we have info to set.
# set info on self trailer
if not trailer:
if self._info:
self.trailer.Info = IndirectPdfDict(**self._info)
if outlines:
self.trailer.Root.Outlines = outlines
# if user_fmt is given use it otherwise use default from pdfrw
# this if is not necessary if this code is moved into the actual writer, did it this way
# for now to avoid adding a reference to user_fmt
if user_fmt:
super(NewPdfWriter, self).write(fname, trailer, user_fmt, disable_gc)
else:
super(NewPdfWriter, self).write(fname, trailer, disable_gc=disable_gc)
from pdfrw.
Update:
Noticed unicode strings would give issues (python2), somehow str strings would work but not unicode, figured I had to wrap every string for metadata or bookmarks titles into PdfString.encode() calls, I'm not sure if this is the way to handle this.
One more thing would be about wraping the metadata dict, I couldn't find any way on the code how to handle the dict.iteritems difference for python 3 which if I'm not wrong is simply items so I'm guessing it should just be .items()
Basically setInfo is changed to:
self._info = {k: PdfString.encode(v) for k,v in info.items()}
and addBookmark, bookmark title wraped into:
'title': PdfString.encode(title),
from pdfrw.
I'm glad you got that working!
Once you have some more experience with it, it might be something we add into the core. I would need to think about how much we want to modify the writer class. Perhaps a separate bookmarks module for all the bookmarks functionality and a simple hook for using the writer with it.
I think I can see a bug that is why unicode didn't work for you -- in the user_fmt definition, basestring is improperly aliased to string. I think to fix that, we need to define basestring or an alias for it in the py23_diffs module (as str in python3 or basestring in python 2) and use that. I don't really use unicode; maybe you can investigate further and submit a pull request if that seems to be the problem and you come up with a good fix.
As far as iter vs iteritem, you can always use a PdfDict, where iteritems is always defined regardless of Python2/3 usage.
Regards,
Pat
from pdfrw.
@pmaupin I noticed the str bug with in the user_fmt definition as well, but that's really something I shouldn't change for bookmarks support since it may be simply a bug with the existing code, so for now calling the encode method from PdfString should work and it can be changed once that's fixed. I don't know if that also applies for metadata/info values, I think those won't go through user_fmt calls.
I would like some review on how the actual pdf data is added, I have no idea if I'm doing it right, I basically copied the format used by PyPDF2
from pdfrw.
@pmaupin Those are really nice changes! Shows how little I digged into the existing code. I didn't realize that the outlines object can behave almost like a double linked list just switching some node pointers in order to add a new one, my recursion thing was totally useless!
There's one thing about setting the document info, I agree with setters/getters issue you mention, but on this very specific case it wasn't really obvious you could modify the output trailer object to add info, at least not for me (I had to look up the examples), and the other issue is that the above test will fail if the string is unicode on Python2.7, fixing the above bug you mentioned might help or not, I'm not sure if the info dict goes through the user_fmt code at some point.
One other side effect of changing the trailer object directly, if you happen to do it before adding pages, you pretty much lose all the values you set, since the trailer object is destroyed on every page add, correct me if I'm wrong, and that was the initial idea of keeping the user data on a private variable until the very last write, so the side effect of adding new pages wouldn't affect this.
from pdfrw.
The info dict definitely goes through the user_fmt code, so the fix applied there should be what you need. The trick there (for the next release) is to do it so it's not ugly and works on both Python 2 and 3. That's a bona fide bug, so a pull request would be appreciated.
Yes, the trailer object should not be modified before you are done adding pages. The hook I envision for the writer would allow stuff to be dynamically added to the trailer object whenever the write function is called. This would happen in _get_trailer(), which would set up the trailer and then call all the registered functions, so you could have a registered function to set up your info, and another one for your bookmarks, for example.
One other thing, the Adobe PDF definition of the /Count member of the Outline and the bookmarks is kind of squirrelly. I don't claim to understand it completely, but positive numbers allow you to have open bookmarks, and negative numbers allow you to have closed bookmarks. Anyway, I realized the code I posted isn't quite right. To have all the bookmarks open by default, you probably want to do something like this before the return bookmark
line:
while True:
parent = parent.Parent
if parent is None:
break
parent.Count += 1
from pdfrw.
@pmaupin it seems to me that all bookmarks are opened by default already without that loop? At least on my reader. So the Count value is actually for the open/close state rather than keeping track of all the nodes count? Odd
About the user_fmt issue, shouldn't it be as simple as replacing str with either (str, unicode) if python2 or (bytes, str) if python3? Probably some variable added into the module with python2-3 compat stuff.
from pdfrw.
Yeah, like I say, I dunno. The spec is kinda vague so all the readers probably compensate for badly written PDFs :-) But in theory, you can have some bookmarks open and others closed on initial file open. IIUC, if the count is '2' , it should default to open, with 1 additional child marks open (traverse the list to figure out which one, I guess), and if it is -2, it should default to closed, but when you open it, there should be one child mark open...
I think the unicode fix is that in this code you want, instead of basestring=str
, to have basestring=basestring
.
That's probably how it originally was, but then someone (probably me if I wasn't thinking straight at the time) "fixed" it so that it would work on Python 3, without thinking about how it would break Python 2.
The correct fix will be to do that, and then to import basestring from the pydiffs_23 module, where it should be defined equal to str IFF the Python version is 3. (They dropped basestring from Python 3.)
Since unicode and str are both subclasses of basestring in Python 2, that should be all you need, I think.
Thanks,
Pat
from pdfrw.
I added metadata in pdf using the following part of the code.
It successfully edited metadata but removes bookmarks from pdf .
Does anyone know the solution for this?
def metadata(doc,rgno):
reader = PdfFileReader(doc)
writer = PdfFileWriter()
kw = rgno[:2]+" "+rgno[3:7]+" "+rgno[8:]
writer.appendPagesFromReader(reader)
with open(json_location) as jf:
data = json.load(jf)
# Write your custom metadata here:
writer.addMetadata({"/Producer": data["/Producer"] })
writer.addMetadata({"/Title":rgno})
writer.addMetadata({"/Author": data["Author"]})
writer.addMetadata({"/Subject": data["Subject"]})
writer.addMetadata({"/Keywords": kw})
with open(doc, "wb") as fp:
writer.write(fp)
from pdfrw.
@pmaupin you said:
Adding bookmarks is fairly trivial. Reading them is somewhat complicated by the fact that there are multiple ways to do it, added in the file format at different versions.
What if I just want to remove all previous bookmarks and annotations in a pdf file? (no need to read or get any contents: just delete them all). Would that be possible?
@cristianoccazinsp did you try that too?
Thanks!
from pdfrw.
Related Issues (20)
- means to determine that a file is encrypted
- Incorrect reportlab link in README
- Preserve table of contents when editing PDF HOT 1
- NameError: name 'name' is not defined
- please use logging.NullHandler
- [Discussion] PyPDF2 ❤️ pdfrw HOT 2
- Hi! I was wondering if pdfrw is capable of making a new PDF based off of the PDF form
- How to change the font? HOT 1
- How to create a new PDF, add new pages with some text? HOT 1
- PdfReader only reads 11 pages of a certain file but there is no error
- PageMerge with canvas :: Links are not working
- invisible text layer HOT 1
- pdfrwx: a pre-preliminary announcement
- reader.Info is None. How to set attribute in this case? HOT 1
- Expected dict type of /XRef HOT 1
- Form push button widget creation does not work HOT 2
- Get Layered PDF using pdfrw HOT 1
- Fill PDF from Excel sheet HOT 1
- Filled PDF form fields not visible in Windows preview
- How do I load an existing pdf implementation and replace strings in the document without changing the original layout HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pdfrw.