Giter Club home page Giter Club logo

Comments (18)

pmaupin avatar pmaupin commented on September 18, 2024 1

Okay, okay, you made me look.

I'll give you points for trying, and I'll give you a lot of points for working code. A working starting point is precious.

But PyPDF2-style code usually pretty much misses the entire point of how pdfrw works, so I've taken your working code and munged it into something that feels better to me. I didn't spend a lot of time on it, so there might be some rough edges.

Also, as I mentioned earlier, I may not be really keen on adding too much more stuff to pdfwriter.py. For example, I could see adding a generic write-hook that a bookmark module could use to hook itself in, so then we could have a bookmarks.py that has a function you specifically invoke to use bookmarks.

I haven't really thought about it that much, though, so for now, I kept your basic structure of subclassing the writer -- except for how the Info dict is handled. One of the things that makes the PyPDF2 codebase unwieldy is the proliferation of java-style getters and setters. pdfrw is not big on that -- if you want to set up an Info dict, then just set up an Info dict. Otherwise, where does it stop?

newwriter.py:

from pdfrw.pdfwriter import PdfWriter, IndirectPdfDict, PdfName, \
                            PdfOutputError, PdfDict, user_fmt


class NewPdfWriter(PdfWriter):

    _outline = None

    def addBookmark(self, title, pageNum, parent=None):
        '''
        Adds a new bookmark entry.
        pageNum must be a valid page number in the writer
        and parent can be a bookmark object returned by
        a previous addBookmark call
        '''

        try:
            page = self.pagearray[pageNum]
        except IndexError:
            # TODO: Improve error handling ?
            PdfOutputError("Invalid page number: " % (pageNum))

        parent = parent or self._outline
        if parent is None:
            parent = self._outline = IndirectPdfDict()

        bookmark = IndirectPdfDict(
            Parent = parent,
            Title = title,
            A = PdfDict(
                D = [page, PdfName.Fit],
                S = PdfName.GoTo
            )
        )

        if parent.Count:
            parent.Count += 1
            prev = parent.Last
            bookmark.Prev = prev
            prev.Next = bookmark
            parent.Last = bookmark
        else:
            parent.Count = 1
            parent.First = bookmark
            parent.Last = bookmark

        return bookmark


    def write(self, fname, trailer=None, user_fmt=user_fmt,
              disable_gc=True):
        trailer = trailer or self.trailer
        trailer.Root.Outlines = self._outline
        super(NewPdfWriter, self).write(fname, trailer, user_fmt, disable_gc)

test.py:

from pdfrw import PdfReader, IndirectPdfDict
import newwriter
from datetime import datetime


output = newwriter.NewPdfWriter()

for i in xrange(3):
    totalPages = len(output.pagearray)
    output.addpages(PdfReader('out_small.pdf').pages)

    bmname = 'Bm (%s) - %s' % (i+1, 'Root')

    t1 = output.addBookmark(bmname, totalPages)
    t2 = output.addBookmark("Child 1", totalPages+1, t1)
    output.addBookmark("Child 1.1", totalPages+2, t2)


now = datetime.now()
date = 'D:%04d%02d%02d%02d%02d%02d' % (now.year, now.month,
           now.day, now.hour, now.minute, now.second)

info = output.trailer.Info = IndirectPdfDict()
info.Title = 'Test Merged PDF'
info.Author = 'asdasd'
info.Creator = 'random dude'
info.Producer = 'another random dude'
info.CreationDate = date

output.write('result.pdf')

from pdfrw.

eburghar avatar eburghar commented on September 18, 2024

would like to have that feature also. Could we share details about possible implementation ? or at least if you could give some directions. I'm about to switch to pypdf2 because of that missing feature

from pdfrw.

tisimst avatar tisimst commented on September 18, 2024

Not sure how helpful I can be, but at least I know that the bookmarks are supposed to live in the "[PdfReader].Root.Outlines" object. The document "Outlines" dictionary object is defined in the PDF Reference, section 8.2.2.

from pdfrw.

JorjMcKie avatar JorjMcKie commented on September 18, 2024

@eburghar , @tisimst : If I may I would recommend to not switch to PyPDF2. pdfrw has a much better performance and is better constructed internally. Its owner is obviously very busy, but PyPDF2 in contrast seems to have come to a halt.
If bookmark maintenance is the one feature that you badly miss: take PyMuPDF as an interim solution. It supports this since a few days, as well as metadata maintenance and incremental saves. A wxPython based GUI TOC maintenance example is contained in the repo. PyMuPDF's drawback compared to pdfrw is, that it is not pure Python and has not as many PDF output features as pdfrw.

from pdfrw.

cristianocca avatar cristianocca commented on September 18, 2024

How difficult would this be? I'm ending up doing all the processing with pdfrw since it's way faster than the rest, but at the end I always need to re open the final file with PyPDF2 to add bookmarks and metadata =/

from pdfrw.

JorjMcKie avatar JorjMcKie commented on September 18, 2024
  1. PyMuPDF is based on MuPDF, so it's not pure Python. Runs on Windows, Linux and Mac, Python 2.7 and up, 32 and 64 bit. To be usable, MuPDF must be compiled / generated first, before a setup of PyMuPDF is possible. On Windows pre-generated binaries are available, shortening all this to a few seconds.

  2. In PyMuPDF's example dir there exist utilities for CSV export / import of bookmarks and metadata.

  3. In your Python you can do this:

    import fitz                    # PyMuPDF
    doc = fitz.open(...)           # open PDF either from file or memory
    meta = doc.metadata            # get existing metadata (a dictionary)
    meta["author"] = "new author"  # modify metadata
    ....
    doc.setMetadata(meta)          # store modified metadat in PDF
    doc.save("xxx.pdf", ...)       # store modified PDF to disk

Logic for bookmark maintenance is similar - a lot simpler than using PyPDF2!
(... not to mention speed)

from pdfrw.

pmaupin avatar pmaupin commented on September 18, 2024

@cristianocca:

Adding bookmarks is fairly trivial. Reading them is somewhat complicated by the fact that there are multiple ways to do it, added in the file format at different versions.

Fun fact: pdfrw's lack of direct support for reading and writing bookmarks does not preclude you from using pdfrw to read and write bookmarks. pdfrw operates at a very low level on files, and the page-handling capability is a very thin layer on top of that. If you learn enough about pdfrw to refactor pdfwriter, you will know enough to do your own bookmarks, and once you have iterated through that code a few times and cleaned it up, it might be suitable for inclusion in pdfrw proper :-)

from pdfrw.

cristianocca avatar cristianocca commented on September 18, 2024

@JorjMcKie sadly this is deployed into amazon's services which uses ubuntu so having to compile stuff from source is not an option =/

@pmaupin that's a good advice, seems like a lot of work though.

from pdfrw.

cristianocca avatar cristianocca commented on September 18, 2024

Hello there,

I have come with some first attempt to be able to add bookmarks (and took the chance to add a helper method to add metadata that's currently possible already)

Since I don't know anything about the PDF spec all I did was basically to copy the format used by PyPDF2 to add the outlines objects and add this to pdfrw Writer class.

Added a addBookmark method similar to the one from PyPDF2, although it is lacking additional options like zoom or XYZ values that doesn't seem important neither.
Since I didn't want to modify the original Writer, I extended the class which means some additional code was needed but if this was to be added into the original class, some code can be removed.

Any improvement or anything that looks odd? It seems to work really well, indeed the small layer on top of the PDF spec is really useful!

pdfrw_bookmarks.zip

Here's the sample code, and attached the code and a test.

class NewPdfWriter(PdfWriter):

    def __init__(self, version='1.3', compress=False):
        self._bookmarks = []
        self._bookmarksDict = {}
        self._info = None

        super(NewPdfWriter, self).__init__(version, compress)
        
    def addBookmark(self, title, pageNum, parent = None):
        '''
        Adds a new bookmark entry.
        pageNum must be a valid page number in the writer
        and parent can be a bookmark object returned by a previous addBookmark call
        '''
                
        try:
            page = self.pagearray[pageNum]
        except IndexError:
            # TODO: Improve error handling ?
            PdfOutputError("Invalid page number: " % (pageNum))
            
        bookmark = {
            'title': title,
            'page': page,
            'childs': []
        }
        bid = id(bookmark)        
        
        if not parent:            
            self._bookmarks.append(bookmark)        
            
        else:
            parentObj = self._bookmarksDict.get(id(parent), None)
            if not parentObj:
                PdfOutputError("Bookmark parent object not found: " % parent)
                        
            parentObj['childs'].append(bookmark)
                        
        self._bookmarksDict[bid] = bookmark
        return bookmark
        
    def setInfo(self, info):
        '''
        Sets pdf metadata, info must be a dict where each key is the metadata key
        standard/known keys are:
            Title
            Author
            Creator
            Producer
        '''
        self._info = info
        
    def write(self, fname, trailer=None, user_fmt=None, disable_gc=True):        
            
        # Recursive function to build outlines tree
        def buildOutlines(parent, bookmarks):
            
            outline = None
            
            if bookmarks:
                outline = IndirectPdfDict()
                outline.Count = len(bookmarks)
                
                first = None
                next = None
                last = None
                                       
                for b in bookmarks:
                    
                    newb = IndirectPdfDict(
                        Parent = parent or outline,
                        Title = b['title'],
                        A = IndirectPdfDict(
                            D = PdfArray( (b['page'], PdfName('Fit')) ),
                            S = PdfName('GoTo')
                        )
                    ) 
                    
                    if not first:
                        first = newb
                        
                    else:
                        last.Next = newb
                        newb.Prev = last                                              
                        
                    last = newb
                    
                    # Add children, if any.
                    if b['childs']:
                        childOutline = buildOutlines(newb, b['childs'])
                        newb.First = childOutline.First
                        newb.Last = childOutline.Last
                        newb.Count = childOutline.Count
                        
                        
                outline.First = first
                outline.Last = last
                     
            return outline
            
        # Testing for now, only add root level bookmarks
        outlines = buildOutlines(None, self._bookmarks)
                       
        # If not custom trailer is given and we have info to set.
        # set info on self trailer
        if not trailer:
            if self._info:
                self.trailer.Info = IndirectPdfDict(**self._info)            
                
            if outlines:
                self.trailer.Root.Outlines = outlines
        
        
        # if user_fmt is given use it otherwise use default from pdfrw
        # this if is not necessary if this code is moved into the actual writer, did it this way
        # for now to avoid adding a reference to user_fmt
        if user_fmt:
            super(NewPdfWriter, self).write(fname, trailer, user_fmt, disable_gc)
        else:
            super(NewPdfWriter, self).write(fname, trailer, disable_gc=disable_gc)

from pdfrw.

cristianocca avatar cristianocca commented on September 18, 2024

Update:

Noticed unicode strings would give issues (python2), somehow str strings would work but not unicode, figured I had to wrap every string for metadata or bookmarks titles into PdfString.encode() calls, I'm not sure if this is the way to handle this.

One more thing would be about wraping the metadata dict, I couldn't find any way on the code how to handle the dict.iteritems difference for python 3 which if I'm not wrong is simply items so I'm guessing it should just be .items()

Basically setInfo is changed to:
self._info = {k: PdfString.encode(v) for k,v in info.items()}

and addBookmark, bookmark title wraped into:
'title': PdfString.encode(title),

from pdfrw.

pmaupin avatar pmaupin commented on September 18, 2024

@cristianocca

I'm glad you got that working!

Once you have some more experience with it, it might be something we add into the core. I would need to think about how much we want to modify the writer class. Perhaps a separate bookmarks module for all the bookmarks functionality and a simple hook for using the writer with it.

I think I can see a bug that is why unicode didn't work for you -- in the user_fmt definition, basestring is improperly aliased to string. I think to fix that, we need to define basestring or an alias for it in the py23_diffs module (as str in python3 or basestring in python 2) and use that. I don't really use unicode; maybe you can investigate further and submit a pull request if that seems to be the problem and you come up with a good fix.

As far as iter vs iteritem, you can always use a PdfDict, where iteritems is always defined regardless of Python2/3 usage.

Regards,
Pat

from pdfrw.

cristianocca avatar cristianocca commented on September 18, 2024

@pmaupin I noticed the str bug with in the user_fmt definition as well, but that's really something I shouldn't change for bookmarks support since it may be simply a bug with the existing code, so for now calling the encode method from PdfString should work and it can be changed once that's fixed. I don't know if that also applies for metadata/info values, I think those won't go through user_fmt calls.

I would like some review on how the actual pdf data is added, I have no idea if I'm doing it right, I basically copied the format used by PyPDF2

from pdfrw.

cristianocca avatar cristianocca commented on September 18, 2024

@pmaupin Those are really nice changes! Shows how little I digged into the existing code. I didn't realize that the outlines object can behave almost like a double linked list just switching some node pointers in order to add a new one, my recursion thing was totally useless!

There's one thing about setting the document info, I agree with setters/getters issue you mention, but on this very specific case it wasn't really obvious you could modify the output trailer object to add info, at least not for me (I had to look up the examples), and the other issue is that the above test will fail if the string is unicode on Python2.7, fixing the above bug you mentioned might help or not, I'm not sure if the info dict goes through the user_fmt code at some point.

One other side effect of changing the trailer object directly, if you happen to do it before adding pages, you pretty much lose all the values you set, since the trailer object is destroyed on every page add, correct me if I'm wrong, and that was the initial idea of keeping the user data on a private variable until the very last write, so the side effect of adding new pages wouldn't affect this.

from pdfrw.

pmaupin avatar pmaupin commented on September 18, 2024

The info dict definitely goes through the user_fmt code, so the fix applied there should be what you need. The trick there (for the next release) is to do it so it's not ugly and works on both Python 2 and 3. That's a bona fide bug, so a pull request would be appreciated.

Yes, the trailer object should not be modified before you are done adding pages. The hook I envision for the writer would allow stuff to be dynamically added to the trailer object whenever the write function is called. This would happen in _get_trailer(), which would set up the trailer and then call all the registered functions, so you could have a registered function to set up your info, and another one for your bookmarks, for example.

One other thing, the Adobe PDF definition of the /Count member of the Outline and the bookmarks is kind of squirrelly. I don't claim to understand it completely, but positive numbers allow you to have open bookmarks, and negative numbers allow you to have closed bookmarks. Anyway, I realized the code I posted isn't quite right. To have all the bookmarks open by default, you probably want to do something like this before the return bookmark line:

        while True:
            parent = parent.Parent
            if parent is None:
                break
            parent.Count += 1


from pdfrw.

cristianocca avatar cristianocca commented on September 18, 2024

@pmaupin it seems to me that all bookmarks are opened by default already without that loop? At least on my reader. So the Count value is actually for the open/close state rather than keeping track of all the nodes count? Odd

About the user_fmt issue, shouldn't it be as simple as replacing str with either (str, unicode) if python2 or (bytes, str) if python3? Probably some variable added into the module with python2-3 compat stuff.

from pdfrw.

pmaupin avatar pmaupin commented on September 18, 2024

Yeah, like I say, I dunno. The spec is kinda vague so all the readers probably compensate for badly written PDFs :-) But in theory, you can have some bookmarks open and others closed on initial file open. IIUC, if the count is '2' , it should default to open, with 1 additional child marks open (traverse the list to figure out which one, I guess), and if it is -2, it should default to closed, but when you open it, there should be one child mark open...

I think the unicode fix is that in this code you want, instead of basestring=str, to have basestring=basestring.

That's probably how it originally was, but then someone (probably me if I wasn't thinking straight at the time) "fixed" it so that it would work on Python 3, without thinking about how it would break Python 2.

The correct fix will be to do that, and then to import basestring from the pydiffs_23 module, where it should be defined equal to str IFF the Python version is 3. (They dropped basestring from Python 3.)

Since unicode and str are both subclasses of basestring in Python 2, that should be all you need, I think.

Thanks,
Pat

from pdfrw.

Nijai avatar Nijai commented on September 18, 2024

I added metadata in pdf using the following part of the code.
It successfully edited metadata but removes bookmarks from pdf .
Does anyone know the solution for this?

def metadata(doc,rgno):
reader = PdfFileReader(doc)
writer = PdfFileWriter()
kw = rgno[:2]+" "+rgno[3:7]+" "+rgno[8:]
writer.appendPagesFromReader(reader)
with open(json_location) as jf:
data = json.load(jf)
# Write your custom metadata here:
writer.addMetadata({"/Producer": data["/Producer"] })
writer.addMetadata({"/Title":rgno})
writer.addMetadata({"/Author": data["Author"]})
writer.addMetadata({"/Subject": data["Subject"]})
writer.addMetadata({"/Keywords": kw})

with open(doc, "wb") as fp:
    writer.write(fp)

from pdfrw.

abubelinha avatar abubelinha commented on September 18, 2024

@pmaupin you said:

Adding bookmarks is fairly trivial. Reading them is somewhat complicated by the fact that there are multiple ways to do it, added in the file format at different versions.

What if I just want to remove all previous bookmarks and annotations in a pdf file? (no need to read or get any contents: just delete them all). Would that be possible?

@cristianoccazinsp did you try that too?
Thanks!

from pdfrw.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.