The scope of this feature is to add support to fpdf2 to produce linearized PDF

Thank you for reporting this! I fixed those tests in <a class="commit-link" data-h

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

This could be checked using <a href="https://github.com/pikepdf/pikepdf/blob/master/te

Great <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Produce linearized PDFs about fpdf2 HOT 20 OPEN

Lucas-C commented on August 16, 2024

Produce linearized PDFs

from fpdf2.

Comments (20)

chandan00761 commented on August 16, 2024 1

I would like to work on this issue.

from fpdf2.

chandan00761 commented on August 16, 2024 1

Thank you for replying so quickly. I have used python manly to develop some scripts(scraper, goods transportation report generation) and web servers using Django. However I am new to open source. I have used fpdf2 to generate pdf reports in goods transportation report generation script.

I am currently reading about linearized pdf and have set up a local development environment. However when testing, I see that 3 cases fail.

Here is the summary:

and here is the full test logs.
https://pastebin.com/Z7pa2h2G.

from fpdf2.

Lucas-C commented on August 16, 2024 1

Thank you for reporting this!
I fixed those tests in f0e2a40.
If you update your local repository copy (here is a guide to update your fork) the tests should now pass.
You may also want to install qpdf in order to get more helpful error messages when tests fail.

from fpdf2.

chandan00761 commented on August 16, 2024 1

@Lucas-C Sorry, I was busy with my semester exams. I am free now and looking into it. I have read the pdf spec file and will start the implementation.

from fpdf2.

chandan00761 commented on August 16, 2024 1

I am still working on it. However I haven't worked with PDF at byte level so it is taking a lot of time to understand some concepts.

from fpdf2.

Lucas-C commented on August 16, 2024 1

Ok!
Feel free to ask any questions here, I'd be happy to help by answering them if I can.

from fpdf2.

Lucas-C commented on August 16, 2024

This could be checked using pikepdf or qpdf:

qpdf --check-linearization / --show-linearization

from fpdf2.

Lucas-C commented on August 16, 2024

Great @chandan00761 !

How familiar are you with fpdf2 and Python development in general?

As a starting point I would recommend that you get a look at the Development documentation page. Maybe start to get the sources with git, install it with pip install -e . and launch the unit tests with pytest.

If you have any questions (on the code, tests, how things work...), feel free to ping me! 😊

from fpdf2.

Lucas-C commented on August 16, 2024

Hi @chandan00761 !

Have you been able to move forward on this? 😊

from fpdf2.

chandan00761 commented on August 16, 2024

In linearization parameter dictionary there is an entry about the length of the entire file in bytes. Does this include the size of the dictionary?

from fpdf2.

Lucas-C commented on August 16, 2024

I don't know.
Maybe you could use PikePDF & qpdf to check this length value? cf. test_pdf.py

from fpdf2.

Lucas-C commented on August 16, 2024

Have you been able to find an answer there @chandan00761?
Are you still planning to work on this?
If not, no worries, I'd just like to make it clear for other contributors that feature is "up-for-grabs" 😊

There is a general methodology I used frequently while adding features to fpdf2, that I would recommend to adopt here:

find a reference linearized PDF, or craft it using another software
Use qpdf --qdf --compress-streams=n $in_file.pdf $out_file.pdf to produce a "pretty-formatted" PDF
Open the "pretty-formatted" PDF in a text editor or IDE in order to study its structure

from fpdf2.

chandan00761 commented on August 16, 2024

What is the use of _trace_size ? Should I use it when placing my objects?
Also are all the object identifiers of indirect objects are in sequential manner? (Like starting from 2 and going to 3, 4, 5 ... without changing order?)

from fpdf2.

Lucas-C commented on August 16, 2024

What is the use of _trace_size ?

This internal method allows to track the size of every section in the final PDF (images, fonts, pages...),
when logging is configure.

Should I use it when placing my objects?

Only if you introduce a new top-level resource type.

are all the object identifiers of indirect objects are in sequential manner?

If I understood your question correctly, then yes.

from fpdf2.

Lucas-C commented on August 16, 2024

As it has been a few months now without any update, I guess this issue is up-for-grabs 😊

Anybody is welcome to give it a try!

from fpdf2.

Lucas-C commented on August 16, 2024

I had a look a this feature, and implementing it will require some big refactoring.

Here is a naive starting point, a new method that should be called just after _putheader() in _enddoc(), because this PDF object must be inserted first in the document:

   def _putlinearization(self):
        "Inserting the linearization parameter dictionary"
        self._newobj()
        self._out(pdf_dict({
            "/Linearized": 1.0,  # Version
            "/L": len(self.buffer),  # File length
            "/H": [ ? ],  # Primary hint stream offset and length (part 5)
            "/O": object_id_for_page(1),  # Object number of first page’s page object (part 6)
            "/E": ?,  # Offset of end of first page
            "/N": self.pages_count,
            "/T": self.offsets[1],  # Offset of first entry in main cross-reference table (part 11)
        }))
        self._out("endobj")

As indicated by the code comments, several numbers must be known:

the full file length (= value of len(self.buffer) after having inserted the %%EOF)
the offsets (= byte position in the buffer) of several PDF objects: hint streams, end of first page (= len(self.buffer) after inserting the first page in _putpages()), first entry in the main cross-reference table

Knowing those values before the call to _putlinearization() will require some code overhaul.

One potential strategy could be to insert a placeholder (made of % characters?) in the buffer at this stage first,
and then later, after inserting the %%EOF in the buffer, substitute this placeholder by the real linearization parameter dictionary.
This is the strategy currently used for document signing: https://github.com/PyFPDF/fpdf2/blob/master/fpdf/sign.py#L24
One specific point of the PDF spec would help if we adopt this approach:

The linearization parameter dictionary shall be entirely contained within the first 1024 bytes of the PDF file.

But the most challenging part will probably be to change the order in which the PDF objects are rendered by fpdf2 in _enddoc(), to conform to the order required for linearized PDF documents:

Header
Linearization parameter dictionary (new object)
First-page cross-reference table and trailer (new object)
Document catalogue and other required document-level objects (must be rendered earlier than currently)
Primary hint stream (may precede or follow part 6) (new object)
First-page section (may precede or follow part 5)
Remaining pages
Shared objects for all pages except the first
Objects not associated with pages, if any (XMP metadata ? Info object ? Embedded files not associated with a /FileAttachment annotation?)
Overflow hint stream (optional)
Main cross-reference table and trailer

Among other things, this will have some impact on util.object_id_for_page() and all the parts of the code that rely on this utility function.

from fpdf2.

commented on August 16, 2024

Part 7 adds Each successive page followed by its nonshared objects. If I understand this correct, that means if I embed a file on page 1 and on page 10.000 (for example link to it on page 10.000 via FileAttachementAnnotation and to same object number from page 1), the object is shared. If I only link to it once on page 1 it is nonshared. But if it's nonshared, it should follow immediately in that memory region. If it's shared, it should go at the end (the assumption is probably that shared objects are not interesting and unique objects are interesting for a reader with slow internet connection). If this is correct, this would be difficult to implement in a single pass.

Regarding the problem with the file size, I think the solution was to look at the xref table: it allows only to address and store 10 digits (I think this was the number, not sure anymore). That means that also the filesize can have 10 digits at most. The unneeded digits can just be spaces.
Using this, we can probably calculate len(self.buffer) + len(lin_header_with_fixed_size_10_digits) and write this number in the header without changing the final size.

I think the most difficult part to achieve is that the elements related to page 1 and the catalog etc. should have the highest object numbers of all objects but still it should be a sequence of numbers.

from fpdf2.

Lucas-C commented on August 16, 2024

Just a quick note: I'm currently attempting to implement this, but it may take some weeks before completion, and will require some important code refactoring

from fpdf2.

Lucas-C commented on August 16, 2024

I merged a first PR ( #574 ) that initiates a fpdf/linearization.py module, with a LinearizedOutputProducer subclass that starts to implement the spec. I haven't implemented the hint tables & hint streams yet, but the PDF objects can now be serialized in the correct order in the output file.

Also, there is an example of linearized PDF file: AlertBoxExamples.pdf @ acrobatusers.com (28KB)
QPDF can be used on this file to display useful linearization info: qpdf --show-linearization AlertBoxExamples.pdf

This issue is up-for-grab, as I currently do not have much time to dedicate to this.

from fpdf2.

Lucas-C commented on August 16, 2024

I also added a first unit test: test/test_linearization.py

Making this test pass will mean that is issue can be closed.

from fpdf2.

Produce linearized PDFs about fpdf2 HOT 20 OPEN

Comments (20)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent