Giter Club home page Giter Club logo

Comments (21)

jacques-quidu avatar jacques-quidu commented on July 4, 2024 1

i shared a file at this url: testpublic
it contains a PDF file with text using Segoe UI and part of text in arabic: the font Segoe UI is encoded with CIDSet (using your last fix commit). The file still raises the CIDSet requirement error with VeraPDF. (and no issue with hebrew or latin text).

i simply use this text to generate the PDF, using Segoe UI font on Windows 11:
"Segoe UI: للمصممين نص"

If i use my workaround by removing the CIDSet the PDF file passes VeraPDF validation.

from pdf-writer.

jacques-quidu avatar jacques-quidu commented on July 4, 2024

I reproduced this CIDSet PDF/A issue with "Yu Gothic" for instance on Windows:
but the text is still rendered well if i remove the CIDSet and passes without the CIDSet VeraPDF validation.

from pdf-writer.

galkahana avatar galkahana commented on July 4, 2024

How about this - please prep a pr and ill look into trying to understand where it might cause trouble.

from pdf-writer.

galkahana avatar galkahana commented on July 4, 2024

Also if you have an example that can recreate the issue maybe i can recreate it and figure out if theres something to do about the cid set to correct it

from pdf-writer.

jacques-quidu avatar jacques-quidu commented on July 4, 2024

Hi, you can reproduce by generating a pdf file on windows with PDF-Writer using text with font "Yu Gothic": then if you verify it with VeraPDF online (selecting PDF/A-2b comformance) it will show conformance errors including the same CIDSet error as described above.

Only this CIDSet error is still remaining for me: i fixed other conformance issues in my client application on top of PDF-Writer.
But note that removing CIDSet key and object in PDF-Writer code fixes this last PDF/A-2 conformance error and the rendering of text remains correct: according to PDF specification CIDSet is optional so it seems to be safe to just remove it: but i am not 100% sure so it is why i ask you if it is really safe or not to remove CIDSet in font descriptor ?
(cf jacques-quidu@6cf1030 in my fork of your repo)

I will try to send you tomorrow 2 samples of PDF generated by my customer application, one with the CIDSet and one without so you can compare, if you need it: rendering is exactly the same in my tests with or without CIDSet.

from pdf-writer.

galkahana avatar galkahana commented on July 4, 2024

id rather correct it

from pdf-writer.

jacques-quidu avatar jacques-quidu commented on July 4, 2024

or as CIDSet is optional according to PDF specification, maybe just providing a PDF option at creation in order to not add /CIDSet would be enough.
Because the requirement which raises the conformance error is not present in PDF/A-1, only starting with PDF/A-2 (Specification: ISO 19005-2:2011, Clause: 6.2.11.4) so the current CIDSet implementation is still legit in PDF 1.3-1.7 or PDF/A-1.

from pdf-writer.

galkahana avatar galkahana commented on July 4, 2024

the code intends to do what PDF/A-2 states. at least this -

"Specification: ISO 19005-2:2011, Clause: 6.2.11.4, Test number: 4
If the FontDescriptor dictionary of an embedded CID font contains a CIDSet stream, then it shall identify all CIDs which are present in the font program, regardless of whether a CID in the font is referenced or used by the PDF or not."

seems to be what the code does. whatever it is it seems like a bug then and i'd like to fix it.
you are free to do what you will in your fork.

from pdf-writer.

galkahana avatar galkahana commented on July 4, 2024

to answer your question RE whether it's safe to remove it, i can't speak for all usages, but im guessing if it's optional and renders well that it's fine to remove it. im not sure what possible side effects might happen, but on its face it seems rather safe. here's the note in the PDF specs:

CIDSet
stream
(Optional) A stream identifying which CIDs are present in the CIDFont file. If this entry is present, the CIDFont contains only a subset of the glyphs in the character collection defined by the CIDSystemInfo dictionary. If it is absent, the only indication of a CIDFont subset is the subset tag in the FontName entry (see Section 5.5.3, “Font Subsets”).
The stream’s data is organized as a table of bits indexed by CID. The bits should be stored in bytes with the high-order bit first. Each bit corresponds to a CID. The most significant bit of the first byte corresponds to CID 0, the next bit to CID 1, and so on.

from pdf-writer.

jacques-quidu avatar jacques-quidu commented on July 4, 2024

yes i read this in the PDF spec too so it is why i thought it would be safe just to remove it ;)
Also i checked with different fonts and same rendering with ou without CIDSet so it seems to be really safe.

Thanks again for this great C++ lib: i like its easy extensibility too.

from pdf-writer.

galkahana avatar galkahana commented on July 4, 2024

ok. i am able to recreate the problem and then also understand the problem and generate a working prototype.
i'll have to figure out how to combine it with the rest of the code, but i expect to be able to deliver a working solution with the CID set no later than the weekend.

from pdf-writer.

galkahana avatar galkahana commented on July 4, 2024

ok..figured i'll just go ahead and fix it.
Turns out my CIDSet implementation was nowhere near how it should be. omg. not for true type fonts (which Yu Gothic is) nor for otf fonts. figured out what's actually intended to be the implementation. corrected both cases, and it seems to make https://demo.verapdf.org/ complaints about CIDSet go away.

if you wanna test this, grab the code from master branch (or just change per what's in here - #217) and i think it'll fix the problem on your end too.

from pdf-writer.

jacques-quidu avatar jacques-quidu commented on July 4, 2024

Hi thanks for this quick fix:
i will integrate it in my fork asap and try it with my test documents.

Best regards,

Jacques.

from pdf-writer.

jacques-quidu avatar jacques-quidu commented on July 4, 2024

Unfortunately the PDF/A-2b conformance error for CIDSet encoding is still present with text in arabic:
it is not reproducible with latin or hebrew text otherwise (according to my tests).

Well i decided to keep the actual workaround (by removing CIDSet) in my fork as it is safe to remove it (i did not found any issue with my test documents without the CIDSet): the main need for my customer is to generate Factur-X or ZugFERD invoices which are based so on PDF/A-3.
And as removing CIDSet reduces also file size (even if very slightly) it is good too for electronic invoices (or for archiving with PDF/A).

from pdf-writer.

galkahana avatar galkahana commented on July 4, 2024

This is weird. Id live to be able to reproduce it. Tried arabic text didnt work. And im fairly sure the solution is good. Ok. Ill try a bit more or wait for recreation method from someone where the workaround isnt good enough. Thanks.

i suspect that hebrew and arabic in your example just dont create cidset (you can open the pdf file and look for the string CIDSet) because they don't generate a CID font. when introducing something like Japanese (or maybe Arabic in the font you are using) the CID font is created and with it a CID set. anyways. good that the workaround works for you.

from pdf-writer.

galkahana avatar galkahana commented on July 4, 2024

Thanks man!
and thanks for the workaround

from pdf-writer.

jacques-quidu avatar jacques-quidu commented on July 4, 2024

i added in testpublic
the PDF file version without CIDSet: same rendering so that with the other file but this file testArabicNoCIDSet.pdf passes VeraPDF validation for PDF/A-2b.

from pdf-writer.

galkahana avatar galkahana commented on July 4, 2024

oh my. i understand the problem. my solution does not account for something called dependent glyphs, which are used here. ok. i can add something for that. (at this point it's fine if you don't want to test the result haha. i understand that you got a good solution).

from pdf-writer.

galkahana avatar galkahana commented on July 4, 2024

This MR - #218 - should take care of this problem.
again, up to you if you want to verify it.

from pdf-writer.

jacques-quidu avatar jacques-quidu commented on July 4, 2024

Thanks for fixing again this issue:
but yes i will stick with my workaround for now which is safe also.

By the way, i found another issue related to copying context: when you append pages from another pdf using a pdf copying context, annotations are lost (like url links - or bookmarks links i implemented in client application using a similar code as for url links by using also pdf annotations): otherwise if you use ModifyPDF (with PDF incremental so), annotations are not lost. So for converting for instance PDF to PDF/A i use ModifyPDF when 4-bytes signature is correct in source file or stream (whick keeps annotations) and a copying context from the source file or stream and then append pages from source when 4-bytes signature is not correct (but with annotations being lost in this case). For now it is fine as i need conversion from PDF to PDF/A only for PDF files generated by the client application (by printing or exported directly) so in this case for now 4-bytes signature in original pdf is always correct and so i can use ModifyPDF which preserves annotations.
But just tell me if you intend to fix this issue too: but you can take your time as for now using ModifyPDF is fine for me ;)

I opened a separate issue for it: #219

from pdf-writer.

jacques-quidu avatar jacques-quidu commented on July 4, 2024

i guess it would need for copying context to copy annotations from source page to DocumentContext::mAnnotations in append pages code before writing page because the write page code assumes annotations to write for page are stored in DocumentContext::mAnnotations ?

from pdf-writer.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.