I want to combine two existing PDFs into a new document. When I try this, two of the p

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Check if this issue is related: <a class="issue-link js-issue-link" data-error-tex

Opening an existing PDF file for import ignores two of its pages about pdfsharp HOT 9 OPEN

empira commented on August 16, 2024

Opening an existing PDF file for import ignores two of its pages

from pdfsharp.

Comments (9)

kenlyon commented on August 16, 2024 1

@ThomasHoevel Ah, thanks for the tip. I must have missed that. I was looking for a way to attach a file while writing the bug eport.

Here you go: Issue.zip

from pdfsharp.

ThomasHoevel commented on August 16, 2024

Other folks simply attach the ZIP files to their issue posts on GitHub.

from pdfsharp.

ThomasHoevel commented on August 16, 2024

I downloaded the file.
Nothing obviously wrong with the PDF.

It probably takes a few hours with the debugger to understand why the two pages are missing.

from pdfsharp.

ThomasHoevel commented on August 16, 2024

The PDF specification reads:

Together, the combination of an object number and a generation number shall uniquely identify an indirect object.

The file with the issue has several duplicated object IDs. PDFsharp uses one of the objects and ignores the duplicates.
By making a different choice, PDFsharp probably could find four pages instead of two.

But after hours of debugging, I have no clue how to achieve that. So I'm afraid there will be no change in PDFsharp in the near future.

from pdfsharp.

kenlyon commented on August 16, 2024

@ThomasHoevel Thanks for looking into this and providing this explanation. I will investigate how our customer is generating these files in the first place to see if we can address it there. If the file does not comply with the PDF specification then I think it's fair enough that you handle it the way you do. I'm grateful that it fails more gracefully than the previous version of PDF sharp.

from pdfsharp.

packdat commented on August 16, 2024

@kenlyon Thanks for providing the example documents.

As we regularly receive documents from our customers created by tools that take the PDF-spec not too seriously, I'm always on the hunt for "problematic" PDFs, to fix issues before one of our customers complains.

In the case of the provided documents however, i think, PDFsharp is not behaving properly.
The spec says in chapter 7.5.6 (Incremental Updates):

...a file that has been updated several times contains several trailers. Because updates are appended to PDF files, 
a file may have several copies of an object with the same object identifier (object number and generation number).

And later:

When a conforming reader reads the file, it shall build its cross-reference information in such a way
that the most recent copy of each object shall be the one accessed from the file.

When reading a PDF, the library reads all trailers from back to front; that is, it reads the last (most recent) trailer and if it has a /Prev entry, it reads the trailer found there and repeats, collecting all found object-references on the way.
When reading the actual objects, it takes the found references, sorts them by their ObjectID and then read them.
There are some issues with this approach and the provided files (especially document1):

The document seems to be incrementally updated 2 times (we now have multiple objects with the same ObjectID)
The updated objects are stored in new ObjectStreams
When an object is parsed from an ObjectStream, the library keeps the first that was found, ignoring all others
(see here)
By sorting the objects by their ObjectID, the library actually gives the oldest object preference (added objects typically have larger ObjectIDs)

The last point is actually the inverse of what the spec says.

With the mentioned pdf, the following happens:

read the oldest xref-stream
read the object-stream referenced from that xref-stream
read the /Pages dictionary
read the next-oldest xref-stream
read object-stream
do not read the newer version of the /Pages dictionary because that object already exist

I was able to fix this in a local branch by simply re-sorting the xref-streams before handling them. (newest first)

@ThomasHoevel i could provide a pull-request if you like

from pdfsharp.

ThomasHoevel commented on August 16, 2024

@packdat I'll have a look when you provide a PR. I thought the parser was using the newest objects as tables where read from rear to front, starting with the newest XREF table. But it is not my code and maybe I missed something.
Thanks for your efforts.

from pdfsharp.

ThomasHoevel commented on August 16, 2024

The fix by packdat should be included in version 6.1.0 coming later this year or next year.
Thanks for the feedback.
Issue still exists with version 6.0.0.

from pdfsharp.

ThomasHoevel commented on August 16, 2024

Check if this issue is related:
#62 (comment)

from pdfsharp.

Opening an existing PDF file for import ignores two of its pages about pdfsharp HOT 9 OPEN

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent