Giter Club home page Giter Club logo

Comments (9)

kenlyon avatar kenlyon commented on August 16, 2024 1

@ThomasHoevel Ah, thanks for the tip. I must have missed that. I was looking for a way to attach a file while writing the bug eport.

Here you go: Issue.zip

from pdfsharp.

ThomasHoevel avatar ThomasHoevel commented on August 16, 2024

Other folks simply attach the ZIP files to their issue posts on GitHub.

from pdfsharp.

ThomasHoevel avatar ThomasHoevel commented on August 16, 2024

I downloaded the file.
Nothing obviously wrong with the PDF.

It probably takes a few hours with the debugger to understand why the two pages are missing.

from pdfsharp.

ThomasHoevel avatar ThomasHoevel commented on August 16, 2024

The PDF specification reads:

Together, the combination of an object number and a generation number shall uniquely identify an indirect object.

The file with the issue has several duplicated object IDs. PDFsharp uses one of the objects and ignores the duplicates.
By making a different choice, PDFsharp probably could find four pages instead of two.

But after hours of debugging, I have no clue how to achieve that. So I'm afraid there will be no change in PDFsharp in the near future.

from pdfsharp.

kenlyon avatar kenlyon commented on August 16, 2024

@ThomasHoevel Thanks for looking into this and providing this explanation. I will investigate how our customer is generating these files in the first place to see if we can address it there. If the file does not comply with the PDF specification then I think it's fair enough that you handle it the way you do. I'm grateful that it fails more gracefully than the previous version of PDF sharp.

from pdfsharp.

packdat avatar packdat commented on August 16, 2024

@kenlyon Thanks for providing the example documents.

As we regularly receive documents from our customers created by tools that take the PDF-spec not too seriously, I'm always on the hunt for "problematic" PDFs, to fix issues before one of our customers complains.

In the case of the provided documents however, i think, PDFsharp is not behaving properly.
The spec says in chapter 7.5.6 (Incremental Updates):

...a file that has been updated several times contains several trailers. Because updates are appended to PDF files, 
a file may have several copies of an object with the same object identifier (object number and generation number).

And later:

When a conforming reader reads the file, it shall build its cross-reference information in such a way
that the most recent copy of each object shall be the one accessed from the file.

When reading a PDF, the library reads all trailers from back to front; that is, it reads the last (most recent) trailer and if it has a /Prev entry, it reads the trailer found there and repeats, collecting all found object-references on the way.
When reading the actual objects, it takes the found references, sorts them by their ObjectID and then read them.
There are some issues with this approach and the provided files (especially document1):

  • The document seems to be incrementally updated 2 times (we now have multiple objects with the same ObjectID)
  • The updated objects are stored in new ObjectStreams
  • When an object is parsed from an ObjectStream, the library keeps the first that was found, ignoring all others
    (see here)
  • By sorting the objects by their ObjectID, the library actually gives the oldest object preference (added objects typically have larger ObjectIDs)

The last point is actually the inverse of what the spec says.

With the mentioned pdf, the following happens:

  • read the oldest xref-stream
  • read the object-stream referenced from that xref-stream
  • read the /Pages dictionary
  • read the next-oldest xref-stream
  • read object-stream
  • do not read the newer version of the /Pages dictionary because that object already exist

I was able to fix this in a local branch by simply re-sorting the xref-streams before handling them. (newest first)

@ThomasHoevel i could provide a pull-request if you like

from pdfsharp.

ThomasHoevel avatar ThomasHoevel commented on August 16, 2024

@packdat I'll have a look when you provide a PR. I thought the parser was using the newest objects as tables where read from rear to front, starting with the newest XREF table. But it is not my code and maybe I missed something.
Thanks for your efforts.

from pdfsharp.

ThomasHoevel avatar ThomasHoevel commented on August 16, 2024

The fix by packdat should be included in version 6.1.0 coming later this year or next year.
Thanks for the feedback.
Issue still exists with version 6.0.0.

from pdfsharp.

ThomasHoevel avatar ThomasHoevel commented on August 16, 2024

Check if this issue is related:
#62 (comment)

from pdfsharp.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.