Comments (9)
@ThomasHoevel Ah, thanks for the tip. I must have missed that. I was looking for a way to attach a file while writing the bug eport.
Here you go: Issue.zip
from pdfsharp.
Other folks simply attach the ZIP files to their issue posts on GitHub.
from pdfsharp.
I downloaded the file.
Nothing obviously wrong with the PDF.
It probably takes a few hours with the debugger to understand why the two pages are missing.
from pdfsharp.
The PDF specification reads:
Together, the combination of an object number and a generation number shall uniquely identify an indirect object.
The file with the issue has several duplicated object IDs. PDFsharp uses one of the objects and ignores the duplicates.
By making a different choice, PDFsharp probably could find four pages instead of two.
But after hours of debugging, I have no clue how to achieve that. So I'm afraid there will be no change in PDFsharp in the near future.
from pdfsharp.
@ThomasHoevel Thanks for looking into this and providing this explanation. I will investigate how our customer is generating these files in the first place to see if we can address it there. If the file does not comply with the PDF specification then I think it's fair enough that you handle it the way you do. I'm grateful that it fails more gracefully than the previous version of PDF sharp.
from pdfsharp.
@kenlyon Thanks for providing the example documents.
As we regularly receive documents from our customers created by tools that take the PDF-spec not too seriously, I'm always on the hunt for "problematic" PDFs, to fix issues before one of our customers complains.
In the case of the provided documents however, i think, PDFsharp is not behaving properly.
The spec says in chapter 7.5.6 (Incremental Updates):
...a file that has been updated several times contains several trailers. Because updates are appended to PDF files,
a file may have several copies of an object with the same object identifier (object number and generation number).
And later:
When a conforming reader reads the file, it shall build its cross-reference information in such a way
that the most recent copy of each object shall be the one accessed from the file.
When reading a PDF, the library reads all trailers from back to front; that is, it reads the last (most recent) trailer and if it has a /Prev entry, it reads the trailer found there and repeats, collecting all found object-references on the way.
When reading the actual objects, it takes the found references, sorts them by their ObjectID and then read them.
There are some issues with this approach and the provided files (especially document1):
- The document seems to be incrementally updated 2 times (we now have multiple objects with the same ObjectID)
- The updated objects are stored in new ObjectStreams
- When an object is parsed from an ObjectStream, the library keeps the first that was found, ignoring all others
(see here) - By sorting the objects by their ObjectID, the library actually gives the oldest object preference (added objects typically have larger ObjectIDs)
The last point is actually the inverse of what the spec says.
With the mentioned pdf, the following happens:
- read the oldest xref-stream
- read the object-stream referenced from that xref-stream
- read the /Pages dictionary
- read the next-oldest xref-stream
- read object-stream
- do not read the newer version of the /Pages dictionary because that object already exist
I was able to fix this in a local branch by simply re-sorting the xref-streams before handling them. (newest first)
@ThomasHoevel i could provide a pull-request if you like
from pdfsharp.
@packdat I'll have a look when you provide a PR. I thought the parser was using the newest objects as tables where read from rear to front, starting with the newest XREF table. But it is not my code and maybe I missed something.
Thanks for your efforts.
from pdfsharp.
The fix by packdat should be included in version 6.1.0 coming later this year or next year.
Thanks for the feedback.
Issue still exists with version 6.0.0.
from pdfsharp.
Check if this issue is related:
#62 (comment)
from pdfsharp.
Related Issues (20)
- Migradoc: Issue using cloned Table HOT 1
- Support of incremental updates? HOT 1
- How to use Chinese fonts? HOT 4
- unable to check if a PdfDocument has an AcroForm inside
- cannot open pdf created by PdfSharp HOT 7
- .NET 6 end of support is November 12, 2024 HOT 5
- MigraDoc: Support different margins for first page in PageSetup HOT 3
- incorrect image path concatenation under Linux HOT 1
- MigraDoc: LeftPadding seems to be negated when drawing Table HOT 3
- BeginContainer crashes when XGraphics was created from a pdf page
- how to extract data from this kine of pdf? HOT 3
- Query for PDFsharp v1.50 beta 5 (ECCN) HOT 1
- How to split page into multiple columns like MS WORD in MigraDoc? HOT 2
- DrawImage draws allways the same (first) image HOT 21
- Wrong page number when reading a simple pdf HOT 6
- Document links don't work with Adobe Acrobat on startup HOT 1
- AddDocumentLink incompatible with named destinations generated by Chrome PDF printer when opened in Acrobat/macOS/iOS HOT 1
- How to convert pdf page to image
- Does anyone know how to remove image from a PDF? HOT 2
- When I reference PdfSharp, why is a directory named de created in the output directory? HOT 7
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pdfsharp.