galkahana / pdf-writer Goto Github PK

View Code? Open in Web Editor NEW

865.0 865.0 204.0 128.39 MB

High performance library for creating, modiyfing and parsing PDF files in C++

Home Page: http://www.pdfhummus.com

License: Apache License 2.0

Python 0.94% C 81.28% Makefile 0.44% Shell 0.01% C++ 16.96% CMake 0.28% Perl 0.09% Lua 0.01%

pdf-writer's Issues

Added free Type projects for all other compilers/plattforms

Added free Type projects for all other compilers/plattforms
Untested but from same source as the one from VC2010

JPEGImageHandler::GetImageDimensions() and pixel density

It's not all that clear what units are used for HummusImageInformation::imageWidth and HummusImageInformation::imageHeight... I assume pixels in the case of a bitmap?
I'm calling PDFWriter::GetImageDimensions("path.to.file") to get the image dimensions on a JPEG file that is 300ppi in the JFIF segment.
It looks like JPEGImageHandler::GetImageDimensions() imposes 72ppi on the returned measurement, and this doesn't jive with what's actually in the placed ImageXObject itself (SamplesWidth and SamplesHeight) when written by JPEGImageHandler::CreateAndWriteImageXObjectFromJPGInformation().

Failure in JPGParser while parsing specific JPG

Hi Gal,

Running the function PDFWriter::CreateImageXObjectFromJPGFile with specific JPG returns failure.
When debugging I realized that the problem is in the function JPEGImageParser::ReadPhotoshopData, it seems that when resolutionBim not found there is an exceeding in the read bytes.
I added some validation tests that seems to solve the problem, below is the function after my changes:
EStatusCode JPEGImageParser::ReadPhotoshopData(JPEGImageInformation& outImageInformation,bool outPhotoshopDataOK)
{
EStatusCode status;
unsigned int intSkip;
unsigned long toSkip;
unsigned int nameSkip;
unsigned long dataLength;
bool resolutionBimNotFound = true;

do {
    status = ReadIntValue(intSkip);
    if(status != PDFHummus::eSuccess)
        break;
    toSkip = intSkip-2;
    status = SkipTillChar(scEOS,toSkip);
    if(status != PDFHummus::eSuccess)
        break;
    while(toSkip > 0 && resolutionBimNotFound)
    {
        status = ReadStreamToBuffer(4);
        if(status !=PDFHummus::eSuccess)
            break;
        toSkip-=4;
        if(0 != memcmp(mReadBuffer,sc8Bim,4))
            break; // k. corrupt header. stop here and just skip the next
        status = ReadStreamToBuffer(3);
        if(status !=PDFHummus::eSuccess)
            break;
        toSkip-=3;
        nameSkip = (int)mReadBuffer[2];
        if(nameSkip % 2 == 0)
            ++nameSkip;
        SkipStream(nameSkip);
        toSkip-=nameSkip;
        resolutionBimNotFound = (0 != memcmp(mReadBuffer,scResolutionBIMID,2));
        status = ReadLongValue(dataLength);
        if(status != PDFHummus::eSuccess)
            break;
        toSkip-=4;
        if(resolutionBimNotFound)
        {
            if(dataLength % 2 == 1)
                ++dataLength;
            toSkip-=dataLength;
            SkipStream(dataLength);
        }
        else
        {
            status = ReadStreamToBuffer(16);
            if(status !=PDFHummus::eSuccess)
                break;
            toSkip-=16;
            outImageInformation.PhotoshopInformationExists = true;
            outImageInformation.PhotoshopXDensity = GetIntValue(mReadBuffer) + GetFractValue(mReadBuffer + 2);
            outImageInformation.PhotoshopYDensity = GetIntValue(mReadBuffer + 8) + GetFractValue(mReadBuffer + 10);
        }
    }
    if(PDFHummus::eSuccess == status)
        SkipStream(toSkip);
}while(false);
outPhotoshopDataOK = !resolutionBimNotFound;
return status;

}

What do you think?

Attached is the problematic JPG.

Thanks,
Hadas

Linux build Error Fix patch

Hi galkahana, first of all congratulations on the superb job you all have been doing creating this tool.

I try PDF-Writer use Linux (Fedora 18 x86_64) but build error.

typo fix & Linux is case-sensitive directories.
and gcc 4.3 Header dependency cleanup. http://gcc.gnu.org/gcc-4.3/porting_to.html

nothing file xobjectContentContext.h .

cp ../PDFWriter/XObjectContentContext.h ../PDFWriterTestPlayground/

please fix patch.

diff -ur PDF-Writer/CMakeLists.txt PDF-Writer_new/CMakeLists.txt
--- PDF-Writer/CMakeLists.txt   2013-04-09 01:33:23.976952997 +0900
+++ PDF-Writer_new/CMakeLists.txt   2013-04-08 23:54:35.643684419 +0900
@@ -4,7 +4,7 @@
 if(NOT PDFHUMMUS_NO_DCT)
    ADD_SUBDIRECTORY(LibJpeg)
 endif(NOT PDFHUMMUS_NO_DCT)
-ADD_SUBDIRECTORY(Zlib)
+   ADD_SUBDIRECTORY(ZLib)
 if(NOT PDFHUMMUS_NO_TIFF)
    ADD_SUBDIRECTORY(LibTiff)
 endif(NOT PDFHUMMUS_NO_TIFF)
diff -ur PDF-Writer/FreeType/CMakeLists.txt PDF-Writer_new/FreeType/CMakeLists.txt
--- PDF-Writer/FreeType/CMakeLists.txt  2013-04-09 01:33:23.977952981 +0900
+++ PDF-Writer_new/FreeType/CMakeLists.txt  2013-04-08 23:54:35.618684812 +0900
@@ -49,7 +49,7 @@
 src/base/ftglyph.c
 src/gzip/ftgzip.c
 src/base/ftinit.c
-src/lzW/ftlzw.c
+src/lzw/ftlzw.c
 src/base/ftstroke.c
 src/base/ftsystem.c
 src/smooth/smooth.c
@@ -61,4 +61,4 @@
 include/freetype/config/ftoption.h
 include/freetype/config/ftstdlib.h
 include/ft2build.h
-)
\ No newline at end of file
+)
diff -ur PDF-Writer/PDFWriter/AbstractWrittenFont.cpp PDF-Writer_new/PDFWriter/AbstractWrittenFont.cpp
--- PDF-Writer/PDFWriter/AbstractWrittenFont.cpp    2013-04-09 01:33:24.053951779 +0900
+++ PDF-Writer_new/PDFWriter/AbstractWrittenFont.cpp    2013-04-08 23:54:35.610684938 +0900
@@ -20,7 +20,7 @@
 */
 #include "AbstractWrittenFont.h"
 #include "ObjectsContext.h"
-#include "InDirectObjectsReferenceRegistry.h"
+#include "IndirectObjectsReferenceRegistry.h"
 #include "Trace.h"
 #include "DictionaryContext.h"
 #include "PDFParser.h"
@@ -485,4 +485,4 @@
        item = it.GetItem();
        inGlyphEncodingInfo.mUnicodeCharacters.push_back((unsigned long)item->GetValue());
    }
-}
\ No newline at end of file
+}
diff -ur PDF-Writer/PDFWriter/CFFFileInput.h PDF-Writer_new/PDFWriter/CFFFileInput.h
--- PDF-Writer/PDFWriter/CFFFileInput.h 2013-04-09 01:33:24.055951748 +0900
+++ PDF-Writer_new/PDFWriter/CFFFileInput.h 2013-04-08 23:54:35.608684971 +0900
@@ -25,6 +25,8 @@
 #include "CFFPrimitiveReader.h"
 #include "IType2InterpreterImplementation.h"

+#include <string.h>
+
 #include <string>
 #include <list>
 #include <map>
diff -ur PDF-Writer/PDFWriter/InputAscii85DecodeStream.cpp PDF-Writer_new/PDFWriter/InputAscii85DecodeStream.cpp
--- PDF-Writer/PDFWriter/InputAscii85DecodeStream.cpp   2013-04-09 01:33:24.060951668 +0900
+++ PDF-Writer_new/PDFWriter/InputAscii85DecodeStream.cpp   2013-04-08 23:54:35.577685458 +0900
@@ -20,6 +20,8 @@
 */
 #include "InputAscii85DecodeStream.h"

+#include <string.h>
+
 #include <algorithm>

 using namespace IOBasicTypes;
@@ -145,4 +147,4 @@
        }

    }
-}
\ No newline at end of file
+}
diff -ur PDF-Writer/PDFWriter/InputDCTDecodeStream.cpp PDF-Writer_new/PDFWriter/InputDCTDecodeStream.cpp
--- PDF-Writer/PDFWriter/InputDCTDecodeStream.cpp   2013-04-09 01:33:24.060951668 +0900
+++ PDF-Writer_new/PDFWriter/InputDCTDecodeStream.cpp   2013-04-08 23:54:35.604685032 +0900
@@ -21,6 +21,8 @@
 #include "InputDCTDecodeStream.h"
 #include "Trace.h"

+#include <string.h>
+
 #ifndef PDFHUMMUS_NO_DCT

 using namespace IOBasicTypes;
diff -ur PDF-Writer/PDFWriter/MD5Generator.cpp PDF-Writer_new/PDFWriter/MD5Generator.cpp
--- PDF-Writer/PDFWriter/MD5Generator.cpp   2013-04-09 01:33:24.062951637 +0900
+++ PDF-Writer_new/PDFWriter/MD5Generator.cpp   2013-04-08 23:54:35.607684987 +0900
@@ -67,6 +67,8 @@
 #include "OutputStringBufferStream.h"
 #include "SafeBufferMacrosDefs.h"

+#include <string.h>
+
 using namespace IOBasicTypes;
 using namespace PDFHummus;

diff -ur PDF-Writer/PDFWriter/PDFWriter.h PDF-Writer_new/PDFWriter/PDFWriter.h
--- PDF-Writer/PDFWriter/PDFWriter.h    2013-04-09 01:33:24.068951542 +0900
+++ PDF-Writer_new/PDFWriter/PDFWriter.h    2013-04-08 23:54:35.612684906 +0900
@@ -30,7 +30,7 @@
 #include "DocumentContext.h"
 #include "ObjectsContext.h"
 #include "PDFRectangle.h"
-#include "TIFFUsageParameters.h"
+#include "TiffUsageParameters.h"
 #include "PDFEmbedParameterTypes.h"

 #include <string>
diff -ur PDF-Writer/PDFWriter/PrimitiveObjectsWriter.h PDF-Writer_new/PDFWriter/PrimitiveObjectsWriter.h
--- PDF-Writer/PDFWriter/PrimitiveObjectsWriter.h   2013-04-09 01:33:24.069951526 +0900
+++ PDF-Writer_new/PDFWriter/PrimitiveObjectsWriter.h   2013-04-08 23:54:35.606685003 +0900
@@ -21,6 +21,7 @@
 #pragma once

 #include "ETokenSeparator.h"
+#include <string.h>
 #include <string>


diff -ur PDF-Writer/PDFWriter/Trace.h PDF-Writer_new/PDFWriter/Trace.h
--- PDF-Writer/PDFWriter/Trace.h    2013-04-09 01:33:24.070951510 +0900
+++ PDF-Writer_new/PDFWriter/Trace.h    2013-04-08 23:54:35.605685017 +0900
@@ -21,6 +21,10 @@
 #pragma once
 #include "Singleton.h"

+#include <stdarg.h>
+
+#include <string.h>
+
 #include <string>


diff -ur PDF-Writer/PDFWriterTestPlayground/AppendingAndReading.h PDF-Writer_new/PDFWriterTestPlayground/AppendingAndReading.h
--- PDF-Writer/PDFWriterTestPlayground/AppendingAndReading.h    2013-04-09 01:33:24.074951447 +0900
+++ PDF-Writer_new/PDFWriterTestPlayground/AppendingAndReading.h    2013-04-08 23:54:35.640684466 +0900
@@ -22,6 +22,8 @@
 #pragma once
 #include "ITestUnit.h"

+#include <string.h>
+
 class AppendingAndReading : public ITestUnit
 {
 public:
diff -ur PDF-Writer/PDFWriterTestPlayground/FlateEncryptionTest.h PDF-Writer_new/PDFWriterTestPlayground/FlateEncryptionTest.h
--- PDF-Writer/PDFWriterTestPlayground/FlateEncryptionTest.h    2013-04-09 01:33:24.075951431 +0900
+++ PDF-Writer_new/PDFWriterTestPlayground/FlateEncryptionTest.h    2013-04-08 23:54:35.641684450 +0900
@@ -20,6 +20,7 @@
 */
 #pragma once

+#include <string.h>
 #include "TestsRunner.h"

 class FlateEncryptionTest : public ITestUnit
diff -ur PDF-Writer/PDFWriterTestPlayground/ImagesAndFormsForwardReferenceTest.cpp PDF-Writer_new/PDFWriterTestPlayground/ImagesAndFormsForwardReferenceTest.cpp
--- PDF-Writer/PDFWriterTestPlayground/ImagesAndFormsForwardReferenceTest.cpp   2013-04-09 01:33:24.075951431 +0900
+++ PDF-Writer_new/PDFWriterTestPlayground/ImagesAndFormsForwardReferenceTest.cpp   2013-04-08 23:54:35.642684434 +0900
@@ -28,7 +28,7 @@
 #include "ProcsetResourcesConstants.h"
 #include "ObjectsContext.h"
 #include "IndirectObjectsReferenceRegistry.h"
-#include "xobjectContentContext.h"
+#include "XObjectContentContext.h"

 #include <iostream>

diff -ur PDF-Writer/PDFWriterTestPlayground/ImagesAndFormsForwardReferenceTest.h PDF-Writer_new/PDFWriterTestPlayground/ImagesAndFormsForwardReferenceTest.h
--- PDF-Writer/PDFWriterTestPlayground/ImagesAndFormsForwardReferenceTest.h 2013-04-09 01:33:24.075951431 +0900
+++ PDF-Writer_new/PDFWriterTestPlayground/ImagesAndFormsForwardReferenceTest.h 2013-04-08 23:54:35.641684450 +0900
@@ -20,6 +20,7 @@
 */
 #pragma once

+#include <string.h>
 #include "ITestUnit.h"

 class ImagesAndFormsForwardReferenceTest: public ITestUnit
diff -ur PDF-Writer/PDFWriterTestPlayground/TestsRunner.h PDF-Writer_new/PDFWriterTestPlayground/TestsRunner.h
--- PDF-Writer/PDFWriterTestPlayground/TestsRunner.h    2013-04-09 01:33:24.079951368 +0900
+++ PDF-Writer_new/PDFWriterTestPlayground/TestsRunner.h    2013-04-08 23:54:35.640684466 +0900
@@ -25,6 +25,8 @@
 #include "Singleton.h"
 #include "FileURL.h"

+#include <string.h>
+
 #include <string>
 #include <list>
 #include <utility>
Only in PDF-Writer_new/PDFWriterTestPlayground: XObjectContentContext.h

my environment gcc version

gcc -v
Using built-in specs.
COLLECT_GCC=/bin/gcc
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/4.7.2/lto-wrapper
Target: x86_64-redhat-linux
Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared --enable-threads=posix --enable-checking=release --disable-build-with-cxx --disable-build-poststage1-with-cxx --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-linker-hash-style=gnu --enable-languages=c,c++,objc,obj-c++,java,fortran,ada,go,lto --enable-plugin --enable-initfini-array --enable-java-awt=gtk --disable-dssi --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-1.5.0.0/jre --enable-libgcj-multifile --enable-java-maintainer-mode --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --disable-libjava-multilib --with-ppl --with-cloog --with-tune=generic --with-arch_32=i686 --build=x86_64-redhat-linux
Thread model: posix
gcc version 4.7.2 20121109 (Red Hat 4.7.2-8) (GCC)

[Question] How to re-order pages in a PDF ?

Thank Gal for the great SDK!

Did anyone know how to use this SDK for re-ordering pages in a PDF?
i.e. move the 2nd page to the 6th page

I already checked the PDF modification sample codes, but they are all about modifying (or adding / deleting) the contents.
There were no sample codes to re-order pages.

MinGW32 compilation

I am trying to build Hummus PDF Writer on MinGW32/WIndows 7 with gcc version 4.8.1.

Compiler complained about undefined sprintf_s, so I changed

#ifdef  WIN32

#if defined(WIN32) && defined(_MSC_VER)

in SafeBufferMacrosDefs.h.

And also, as there are no fseeko and ftello available on my MinGW32 system, I inserted

    #ifdef __MINGW32__
    #define fseeko fseeko64
    #define ftello ftello64
    #endif

in SafeBufferMacrosDefs.h.

With the above changes, I could successfully build Hummus PDF Writer and some sample
programs I made are working but I haven't tested extensively yet.

I am not sure if I am doing correct thing so I'd appreciate it if you could comment on this.
Thanks.

U3D support?

Hi Gal,
your job is very good, I looked for C++ library to create and manage PDF files, I found some library, but in my opinion only your and libharu library are good to use. I prefer your library, but libharu have support to U3D standard to show a CAD like canvas with javascript support.
Have you planned to support this feature?

Thank you

Image Extraction giving invalid files

Hi, I tried to use the following code to extract all the streams in the pdf file. Some of them were supposed to be images, since the source pdf is a scanned document of 3 pages with ~30KB in size (all of them different) each.
However, the only 3 extracted streams which weren't ASCII text with pdf commands, had exactly 460290 bytes, most it was just a lot of "ÿ" characters (viewed through vim). These 3 streams are different though.
Am I doing something wrong, or did I stumble upon a bug?

More importantly, how can I achieve what I want, i.e. extract images from pdfs, preferably detecting if there's an image per page and it's corresponding layout.

Code follows:

    PDFWriter pdfWriter;
    EStatusCode status;
    PDFRectangle a4size(0,0,595,842);

    status = pdfWriter.StartPDF("mark_test_out.pdf", ePDFVersion13);
    if(status != eSuccess)
        return status;

    PDFDocumentCopyingContext *cp_ctx = pdfWriter.CreatePDFCopyingContext("mark_test.pdf");
    PDFParser *parser = cp_ctx->GetSourceDocumentParser();

    for(unsigned long i = 0; i < parser->GetObjectsCount(); ++i)
    {
        PDFObject *obj = parser->ParseNewObject(i);
        if(!obj) continue;
        PDFObject::EPDFObjectType type = obj->GetType();
        fprintf(stderr, "Object %lu is of type %d\n", i, type);
        if(type == PDFObject::ePDFObjectIndirectObjectReference)
        {
            obj = parser->ParseNewObject(static_cast<PDFIndirectObjectReference *>(obj)->mObjectID);
            type = obj->GetType();
            fprintf(stderr, "\tIndirect object is of type %d\n", type);
        }
        if(type == PDFObject::ePDFObjectStream)
        {
            fprintf(stderr, "\tFound a stream\n");
            std::string fname(std::to_string(i));
            std::ofstream outfile(fname.c_str(), std::ofstream::binary);

            IByteReader *stream = parser->StartReadingFromStream(static_cast<PDFStreamInput*>(obj));
            while(stream->NotEnded())
            {
                IOBasicTypes::Byte buf[4096];
                outfile.write((const char *)buf, stream->Read(buf, sizeof(buf)));
            }

            delete stream;
        }
        fprintf(stderr, "\n");
    }

Benchmarks?

The description mentions "high performance" but I don't see any benchmarks or benchmark results anywhere.

Which other node pdf modules/addons did you compare this addon with and how? What were the concrete results?

Extracting Text from PDF

First of all many thanks for writing such a fantastic pdf library in C++.
I saw an old discussion about this for JS but I am not sure if there is any C++ API to just extract basic text from a PDF, something like apache pdfbox textstripper.
I know this is naive but having a straight forward API would help for many scenarios where we just want to read specific elements.

Problematic sanity check in OpenTypeFileInput.cpp OpenTypeFileInput::ReadOpenTypeSFNTFromDfont() leads to problems embedding .dfont fonts on El Capitan

There is an sanity check in OpenTypeFileInput::ReadOpenTypeSFNTFromDfont() that results in failures embedding some fonts that are packaged as .dfont files on Mac OS X 10.11 El Capitan.

I believe there has been a misreading of the Inside Mac documentation for the resource fork format, where the space reserved in the resource map header on disk for a copy of the resource file header when in memory is assumed to actually be a bitwise copy of the resource file header on disk. the diagram of the format of the resource map in Inside Mac labels this field as "Reserved for a copy of the resource header", and the text actually says, "After reading the resource map into memory, the Resource Manager stores the indicated information in the reserved areas at the beginning of the map."

{
    // check that the two headers match

    int allzeros = 1, allmatch = 1;
    for (int i = 0; i < 16; ++i )
    {
        if ( head[i] != 0 ) allzeros = 0;
        if ( head2[i] != head[i] ) allmatch = 0;
    }
    if ( !allzeros && !allmatch ) return PDFHummus::eFailure;
}

For example, the data-fork-based resource file /System/Library/Fonts/Geneva.dfont does happen to pass this sanity check on Yosemite, but the version that comes with El Capitan does not. And this means that an attempt to embed the Geneva font will fail on a stock El Capitan system because the reader bails since the resource file header does not match whatever bytes happen to be in the space reserved for the in-memory copy

Removing this check lets us successfully embed the Geneva font (and some other fonts housed in .dfont files) on Mac OS X 10.11 El Capitan.

There is also another sanity check which may be problematic...

        if ( rdata_pos + rdata_len != map_pos || map_pos == 0 ) {
            return PDFHummus::eFailure;
        }

The documentation in Inside Mac does not guarantee that the resource map is exactly positioned immediately after the resource data (i.e. it is not guaranteed by the documentation that rdata_post+rdata_len == map_pos). Having two separate offset fields allows the data structures to appear in any order on disk, and with any number of (possibly non-zero) bytes padding the gap between them. For example, a quick way for the Resource Manager to update the resource map for a resource file on disk might be to simply append a brand new resource map to the end of the file, and then update the header to point to the new one, leaving the old one in place.

[Question] How to modify all pages' "Parent" to the new Page Tree node?

How to modify all pages' "Parent" to the new page tree node?
Calling WritePagesTree() to create a new page tree node, but how to modify existing pages to re-parent them to the new page tree node?

Making use of Base 14 fonts

How can I create a PDFUsedFont that references one of the Base 14 fonts in pdf?
The idea is to not embed the fonts, but still have a "deterministic" choice of fonts (quite obviously? :P)

Thanks!

PS: I tried using TfLow("Helvetica", 14); but then mupdf complains that it couldn't find font dictionary. I'm not knowledgeble in pdf specification at all, apart from what I read here.

PDFModifiedPage class issue

Hi Gal,

What's up ?

I have to add some content to existing pdf page.

Should
PDFModifiedPage class ( from
https://github.com/galkahana/PDF-Writer/blob/master/PDFWriterTestPlayground/PageModifierTest.cpp) do the work ( add something to current content ) ?

If not ( means, it is supposed to replace the page's current content with new one added between StartContentContext and EndContentContext like it does now), how can to keep current content ?

Thanks and cheers.
Lidia.

Build tools cleanup and python bindings

First off I want to say that I'm really liking the library you've got here @galkahana

I have two requirements for your library (and some initial it looks like your library can do it).

Create a new PDF, embed a page from an existing PDF, write text in arbitrary locations on the page
Create a new PDF, embed an image, write text in arbitrary locations on the page

The thing is that I'd like to do this in python so I'm going to write some bindings for this library in python. I'd like to confirm you'd be okay with that (and that you have no immediate plans yourself).

The second thing is after reading through some of the code and understanding how things are working here... would you be opposed to me submitting a pull request to clean up the build process for this library?

The primary improvement being relying on development headers available in the environment rather than having libtiff, libjpeg, etc. included in this repository. As I'd like to use the latest libtiff and some API changes have happened since ~2 years ago.

Embedded LinuxLibertine OpenType fonts come out totally messed up

I was trying to embed some otf fonts i.e. Linux Libertine otf fonts (I got them form http://www.linuxlibertine.org/ )
Since font looks ok elsewhere I'm assuming something goes wrong with font embedding
Here is actuall Hello World otuput I get with LinLibertine_R.otf font:

Looks like curve control points got treated like actual points or cubic vs quad curves are messed up while converting path or something like that happens.

Type1Input::ParseSubrs() makes assumptions about format of /Subrs dictionary entries that can lead to crashes with older Type 1 fonts

The code in Type1Input::ParseSubrs() assumes that each entry in the /Subrs dictionary uses the NP and ND shortcuts, and this can lead to a crash.

The assumption is that each entry will look like this in all Type 1 fonts:

dup index numBytes RD [numBytes of binary data] NP

However, NP (and ND) are procedures that are locally defined in the Type 1 font as abbreviations to save space in the font file...

/ND { noaccess def } executeonly def
/NP { noaccess put } executeonly def

Most fonts do use them, but some older fonts do not, using the full commands noaccess, put, and def in place. For example, ...

So, instead of...

/Subrs 115 array
dup 0 15 RD 15bytes~ NP
dup 1 9 RD 9bytes~ NP
:
:
ND

an older font might have...

/Subrs 115 array
dup 0 15 RD 15bytes~ noaccess put
dup 1 9 RD 9bytes~ noaccess put
:
:
noaccess def

In the code for Type1Input::ParseSubrs(), after reading the binary bytes of the first entry, it does exactly two further calls to mPFMDecoder.GetNextToken(), expecting to eat up the NP token then either the dup token or the ND token to be ready to read the key (subrIndex) of the next dictionary entry or be finished with the dictionary.

When presented with a font that does not use NP or ND, the next token after reading the first entry is now the dup in the second entry rather than the key (subrIndex) for that entry. Further entries in the mSubrs array are now garbage, and at some point, a 0 CodeLength may used to create an empty Byte array on the heap.

mSubrs[subrIndex].CodeLength = Int(token.second);
mSubrs[subrIndex].Code = new Byte[mSubrs[subrIndex].CodeLength];

This can lead to a crash later in Type1Input::FreeTables(), when the mSubrs array is cleaned up.

for(long i=0;i<mSubrsCount;++i)
    delete[] mSubrs[i].Code;

We don't have code for a general solution (since essentially, this is a problem of the fact that the Type 1 font is actually a Postscript program), but what we have come up with replaces the two of the calls to mPFBDecoder.GetNextToken() in Type1Input::ParseSubrs() :

So that...

// skip NP token
mPFBDecoder.GetNextToken();

// skip dup or end array definition
mPFBDecoder.GetNextToken();

... is replaced with ...

while ( token.first )
{
    token = mPFBDecoder.GetNextToken();
    if ( 0 == token.second.compare("dup") )
        break;
    if ( 0 == token.second.compare("ND") )
        break;
    if ( 0 == token.second.compare("def") )
        break;
}

That handles the cases where the font uses the NP and ND shortcuts, or if they use noaccess, put, and def directly.

Section 2.4 of the Type 1 specification also notes that some fonts can also use other names defined in userdict, or -|, |-, and | defined in the Private dictionary.

Add new object to PDF file.

Hello,
Can i add new PDFDictionary object in my PDF document with your library?

[Question] How to set the Font in a FreeText annotation ?

I tried to use your library to create a FreeText annotation (Subtype = FreeText).
So far everything went well, but to set the font, i followed the PDF spec to set
the "DA" field as FontName FontSize "Tf" Red Green Blue "rg"
In the spec, it said FontName should be the key (or name) to the Font dictionary.
Do you know to implement this by using your library?
I checked the PDFUsedFont class and i can load any font file, but how can i write the selected font to the Font dictionary for a "FreeText" annotation?
Thank you very much for creating the great library and helping answer the question!!

[Question] How to get the UTF8 encoded string from a PDFLiteralString that has escaped characters?

I found a PDFLiteralString has a value of cstring: "3a\xb2bx + 15cxy + 25ady"
the "\xb2" in an escaped character with the octal value "0xb2".
How to convert the literal string to be a UTF8 encoded string?

LC_NUMERIC may use decimal comma, thousand separator etc..

locale settings affect sprintf, although pdf numbers should always use c locale.

I noticed it by MediaBox:
/MediaBox [ 0 0 595,2 841,68 ] (comma as decimal point, as in most European locale)

after writing this to a PDF, parsing it back fails.

fix could be in PrimitiveObjectsWriter.cpp:

#include <sstream>
#include <locale>
....
void PrimitiveObjectsWriter::WriteInteger(long long inIntegerToken,ETokenSeparator inSeparate)
{
	
	std::stringstream formatter;
	formatter.imbue(std::locale("C"));
	formatter << inIntegerToken;
	std::string formatter_buf = formatter.str();

	mStreamForWriting->Write((const IOBasicTypes::Byte *)formatter_buf.data(), formatter_buf.size());

	WriteTokenSeparator(inSeparate);
}
...

void PrimitiveObjectsWriter::WriteDouble(double inDoubleToken,ETokenSeparator inSeparate)
{
	std::stringstream formatter;
	formatter.imbue(std::locale("C"));
	formatter << inDoubleToken;
	std::string formatter_buf = formatter.str();

	mStreamForWriting->Write((const IOBasicTypes::Byte *)formatter_buf.data(),formatter_buf.size());
	WriteTokenSeparator(inSeparate);
}

I'm tried this on aix/linux/msvc, and it fixed the issue.
Sorry, I can't create a pull request for it now.

FontDescriptorWriter::WriteFontDescriptor() does not write the required entry for the "Type" key in the Font Descriptor dictionary

The PDF specification requires that a Font Descriptor dictionary have a "Type" entry with a name value that is "FontDescriptor". The code in FontDescriptorWriter.cpp, in FontDescriptorWriter::WriteFontDescriptor() omits this required key and name value, and Adobe Acrobat's preflight checker flags this as an issue in the output.

[bug][fix] DetermineDoubleTrimmedLength

Annotation

Hi,

You've made a amazing work. Are you planning to add support for annotations ?

Thanks and regards,

Enhance: Web capture

Hey guys, I was wondering if you guys are implementing this feature because i will be really happy to help.

Page hierarchies/ToC support

Is there any way to write the table of contents for a document? I looked all over the docs and I couldn't find any specs.

Stack buffer overflow in PDFParser::ParseXrefFromXrefTable()

In PDFParser.cpp, in PDFParser::ParseXrefFromXrefTable(), there is a possibility of an attempt to read past the bounds of the 20-byte array for holding the xref entry.

There are four lines where a pointer to a part of this array (on the stack) is cast to (const char*) and implicitly converted into the std::string passed to the BoxingBaseWithRW<> constructor. The implicit construction of the std::string uses the constructor that only takes a single const char* parameter, and is intended to convert from NULL-terminated character strings; a different std::string constructor for constructing from byte buffers is probably more appropriate here.

I tried some changes to explicitly choose the byte buffer version of the std::string constructor here...

            if(currentObject < inXrefSize)
            {
                inXrefTable[currentObject].mObjectPosition = LongFilePositionTypeBox( std::string( (const char*)entry, 10 ) );
                inXrefTable[currentObject].mRivision = ULong( std::string( (const char*)(entry+11), 5 ) );
                inXrefTable[currentObject].mType = entry[17] == 'n' ? eXrefEntryExisting:eXrefEntryDelete;
            }
            ++currentObject;



            // now parse the section. 
            while(currentObject < firstNonSectionObject)
            {
                if(mStream->Read(entry,20) != 20)
                {
                    TRACE_LOG("PDFParser::ParseXref, failed to read xref entry");
                    status = PDFHummus::eFailure;
                    break;
                }
                if(currentObject < inXrefSize)
                {
                    inXrefTable[currentObject].mObjectPosition = LongFilePositionTypeBox( std::string( (const char*)entry, 10 ) );
                    inXrefTable[currentObject].mRivision = ULong( std::string( (const char*)(entry+11), 5 ) );
                    inXrefTable[currentObject].mType = entry[17] == 'n' ? eXrefEntryExisting:eXrefEntryDelete;
                }
                ++currentObject;
            }

ANSIFontWriter::WriteWidths crashes

ANSIFontWriter::WriteWidths crashes when called from CFFANSIFontWriter and no glyphs in the font have actually been used. Specifically, the result from mCharactersVector.begin() can't be dereferenced because it's empty in this case, so an exception is thrown.

I'm not sure if this issue is confined to ANSIFontWriter or if other types of fonts have an analogous issue.

Ideally, the font wouldn't be emitted at all since it's not actually used.

Linking error

Ideas?

It seems to find some symbols in libPDFWriter.a, but unable to find others. Stumped.

Ld build/Debug/PDFWriterTestPlayground normal x86_64
cd /Users/jbierling/Downloads/Code/PDF-Writer-master/PDFWriterTestPlayground/PDFWriterTestPlayground
setenv MACOSX_DEPLOYMENT_TARGET 10.8
/Applications/Xcode5-DP5.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/clang++ -arch x86_64 -isysroot /Applications/Xcode5-DP5.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.9.sdk -L/Users/jbierling/Downloads/Code/PDF-Writer-master/PDFWriterTestPlayground/PDFWriterTestPlayground/build/Debug -L/Users/jbierling/Downloads/Code/PDF-Writer-master/XCode/build/Debug -F/Users/jbierling/Downloads/Code/PDF-Writer-master/PDFWriterTestPlayground/PDFWriterTestPlayground/build/Debug -filelist /Users/jbierling/Downloads/Code/PDF-Writer-master/PDFWriterTestPlayground/PDFWriterTestPlayground/build/PDFWriterTestPlayground.build/Debug/PDFWriterTestPlayground.build/Objects-normal/x86_64/PDFWriterTestPlayground.LinkFileList -mmacosx-version-min=10.8 -stdlib=libc++ -lLibJpeg -lPDFWriter -lLibTiff -lz.1.2.5 -lstdc++.6.0.9 -lFreetype -Xlinker -dependency_info -Xlinker /Users/jbierling/Downloads/Code/PDF-Writer-master/PDFWriterTestPlayground/PDFWriterTestPlayground/build/PDFWriterTestPlayground.build/Debug/PDFWriterTestPlayground.build/Objects-normal/x86_64/PDFWriterTestPlayground_dependency_info.dat -o /Users/jbierling/Downloads/Code/PDF-Writer-master/PDFWriterTestPlayground/PDFWriterTestPlayground/build/Debug/PDFWriterTestPlayground

Undefined symbols for architecture x86_64:
"OutputFile::OpenFile(std::__1::basic_string<char, std::__1::char_traits, std::__1::allocator > const&, bool)", referenced from:
OpenTypeTest::SaveCharstringCode(TestConfiguration const&, unsigned short, unsigned short, CFFFileInput*) in OpenTypeTest.o

Digital Signatures

Hi,

I'm looking at using digital signatures in an application, specifically Node, so I'd be using HummusJS.

Is this a feature you have considered adding?

Is is maybe possible to do with the existing API? Maybe adding it to the stream manually?

Thank you!
Andrew

Type1ToCFFEmbeddedFontWriter::AddComponentGlyphs() uses the private encoding for dependent glyphs even for glyphs defined using the 'seac' operator

The 'seac' operator ("Standard Encoding Accented Character") is defined to specify dependent glyphs in the Adobe Standard Encoding rather than the Type 1 font's private encoding, but the implementation of Type1ToCFFEmbeddedFontWriter::AddComponentGlyphs() was only getting the glyph names according to the font's private encoding. In our case, we were getting glyph names defined in the font's private encoding dictionary, but the font did not actually have any charstrings for those glyph names. This caused the recursive call to AddComponentGlyphs() to fail on calling Type1Input::CalculateDependenciesForCharIndex(), and that led to the failure to embed the font, and ultimately the failure to complete the PDF. Or, no glyph would be shown at all in the output PDF.
For example, the charstring for "Aring" might use 'seac' with "A" and "ring" as dependent glyphs, specified in the Adobe Standard Encoding (where "ring" has the code point 0xCA). The font's private encoding might have a different name for the code point 0xCA, "eth"... if the font did not have a charstring for "eth", that would lead to the failure described above; and if the font did have a charstring for that character, the wrong glyph would be drawn.
Our solution was to get the encoded glyph name explicitly from the StandardEncoding object instead.

FreeTypeType1Wrapper::GetGlyphForUnicodeChar() does not actually get a glyph for a Unicode character

We convert to strings of UCS2 codepoints and call FreeTypeFaceWrapper::GetGlyphsForUnicodeText() regardless of the font, and the implementation of FreeTypeType1Wrapper::GetGlyphForUnicodeChar() wasn't actually getting the desired glyph for Type 1 fonts, nor was it reporting if the font did not have a glyph for the given character.
Our solution was to convert the UCS2 codepoint to a Postscript glyph name, check that the type 1 font provided a charstring for that glyph name, and then look up the glyph number for the glyph name in the font's private encoding. This let us use the same input to display text as we would for any TrueType or OpenType font, and also let us determine if we needed to switch to a different font (if the original font didn't provide the needed glyph).
Sorry, I don't have code to share at the moment. We created our map of UCS2 to Postscript glyph names using data from here: https://github.com/adobe-type-tools/agl-aglfn/

Resolution for Copying Context

Hi galkahana,

I'm very impressed of your PDF-Writer Project! Love it! 👍

I started to work with it and one question arised concerning the issue of embedding pdf pages into a new pdf file.

Now the problem is, I have a source pdf file which has a specific format and it was made with a specific resolution (300 dpi for example). This resolution shall be kept.

In the Acrobat Reader there's an opportunity to define the resolution for snapshot tool images (Page Display Preferences -> General). The default is 72 dpi I guess.

To copy and paste the whole first page of my source pdf the relevant code looks like this:

firstContext = pdfWriter.CreatePDFCopyingContext(frame_path);

EStatusCodeAndObjectIDType resultFirst = firstContext->CreateFormXObjectFromPDFPage(0, ePDFPageBoxMediaBox);

// placing the pages in a result page
contentContext->q();
contentContext->cm(1,0,0,1,0,0);
contentContext->Do(page->GetResourcesDictionary().AddFormXObjectMapping(resultFirst.second));
contentContext->Q();

So my question: Is it possible to cut out a specific area (or a whole page) from the source pdf file with a defined resolution?

Thanks a lot!
Best regards
Maurice

improprer "using namespace std" in header files

I'm trying to evaluate using PDF-Writer as backend to the TeXmacs (www.texmacs.org) scientific editor, however the fact that the PDF-Writer header fields contains "using namespace std" prevents me to include them into the TeXmacs sources since they do not rely on the C++ standard library and we have a different definition for the string class.

In general is a good idea to avoid using that declaration in header files:

http://www.cplusplus.com/forum/beginner/25538/

Btw, nice library! I look forward to be able to use it to write PDFs inside TeXmacs, we really need a good PDF writing library like that.

Typo in test in CFFFileInput::ReadEncoding()

On line 847 of CFFFileInput.cpp, in the function CFFFileInput::ReadEncoding(), there is a logic error. The code is intended to check if the high bit of an 8-bit byte is set, but the condition actually tested ((encodingFormat & 0x80) == 1) will always be false. That should probably read ((encodingFormat & 0x80) != 0), instead (or any number of ways to correctly test only that bit).
I'm sorry, but for a number of reasons, I can't fork the project and submit a pull request with a simple fix; right now, the best I can do is alert that there is an issue.

[Question] How to correctly skip a failed object copying?

I need to copy objects from one PDF to the other PDF, but sometimes some objects are not "valid" (problematic PDF file maybe generated by some buggy PDF apps), i.e. missing parent node, or missing any indirect object. In such cases, i want to skip the object copying, but when PDFDocumentCopyingContext::CopyObject return eFailure, it already allocated some object ID such that in the later pdfWriter.EndPDF() always returned eFailure due to unwritten objects in the xRefTable (failed in ObjectsContext::WriteXrefTable line: 204).
Is there any method in the library that i can call to roll-back the states before the failed CopyObject call?

Flate decode issue

Decoding the Flate encoded image stream in the attached PDF file doesn't seem to work. See sample code below. Am I doing something wrong?

#include <iostream>
#include <fstream>
#include <string>
#include "PDFHummus/PDFParser.h"
#include "PDFHummus/InputFile.h"
#include "PDFHummus/PDFStreamInput.h"
#include "PDFHummus/IByteReader.h"
#include "PDFHummus/EStatusCode.h"

using namespace std;
using namespace PDFHummus;

void decodeStream(char *path);

int main(int count, char* args[]) {
    if (count < 2) {
        cerr << "PDF file required" << endl;
        return 1;
    }

    if (count == 2) {
        decodeStream(args[1]);
    }

    return 0;
}

void decodeStream(char *path) {
    PDFParser parser;
    InputFile pdfFile;
    EStatusCode status = pdfFile.OpenFile(path);
    if(status == eSuccess) {
        status = parser.StartPDFParsing(pdfFile.GetInputStream());
        if(status == eSuccess) {
            // Parse image object
            PDFObject* streamObj = parser.ParseNewObject(7);
            if (streamObj != NULL
                && streamObj->GetType() == PDFObject::ePDFObjectStream) {
                PDFStreamInput* stream = ((PDFStreamInput*)streamObj);
                IByteReader* reader = parser.StartReadingFromStream(stream);
                if (!reader) {
                    cout << "Couldn't create reader\n";
                }

                Byte buffer[1000];
                LongBufferSizeType total = 0;
                while(reader->NotEnded()) {
                    LongBufferSizeType readAmount = reader->Read(buffer,1000);
                    total += readAmount;
                    cout << "Total read: " << total << "\n";
                }
            }
        }
    }
}

test1.pdf

How to get baseline to baseline distance (new line advance)?

Currently, I'm using freetype directly like so:

float font_size = 14.;
PDFUsedFont *font = pdfWriter.GetFontForFile("font.ttf");
FT_Face face = *font->GetFreeTypeFont();
float newline_height = font_size * face->height / face->units_per_EM;

But I'd rather use PDFUsedFont, and never mess with freetype (what if some other non-freetype format comes along, or what if this has some quirks which doesn't work with all fonts?). Is it possible to use only PDF-Writer objects to achieve this?

Also, PDFUsedFont::CalculateTextDimensions receives a long as font size, but Tf operator receives a double. Shouldn't these be consistent?

Type1Input::ParseEncoding() in Type1Input.cpp interprets /NUL as a character name rather than ".notdef"

We came across an old Type 1 font with an /Encoding array populated explicitly (no /StandardEncoding, etc.), and the first entry was /NUL (for code 0). The code in Type1Input::ParseEncoding() simply adds "NUL" as the character name, and that leads to issues if code glyph 0 is actually used... Type1Input::CalculateDependenciesForCharIndex() and Type1Input::GetGlyphCharString() will probably not be able to find an entry for it in mCharStrings, leading to an error.

Generate encrypted pdfs

Any hints on how to write encrypted pdf?

MergePDFPagesToPage the other way around?

L.S.,

I see pdfPage is written out first and then test.pdf.
This results in a pdf file that hides my page with the content in the file (a word export).

MergePDFPagesToPage(pdfPage, "c:\devel\test.pdf", singePageRange);

Could it also work the other way around?

Adding 'print' javascript to an existing PDF file

Hi Gal,
Very nice project, thanks!

Can you please guide me what's the best and easiest way to make an existing PDF file to be 'auto-printed' when it's downloaded to a browser?

Thanks a lot!

Amit

Question :Resize PDF with many different sizes, is this possible?

Is it possible to read a PDF file containing pages of different sizes and set them all to A4 size without loosing markups etc?

About your library

HI, I would like to know that it can run on the IOS platform? Is there any compatibility issues? Can it be synthesized? For example, the picture is synthesized in a PDF file to form a new PDF file. Please reply, thank you!

Including SVG graphics in PDF

Is there a way to incorporate vector graphics like .svg file using hummus library (similar to the way to including jpeg or tiff images)?

[Question] How to add LZWDecode support?

Failed in an invocation to copyingContext->AppendPDFPageFromPDF(pageIndex) --> calling PDFParser::CreateFilterForStream to parse a PDF page and failed: the log showed "PDFParser::CreateFilterForStream, supporting only flate decode and ascii 85 decode, failing".
Then, i found the PDF using LZWDecode.
It seemed this SDK doesn't support LZWDecode, am i right?
If so, how to add LZWDecode support by myself?
PS: if it's not easy to add LZWDecode filter support, how can i copy pages from one PDF to the other PDF without involving the unsupported filter support?

Can increase the signing of the protection of PDF document functions

Can increase the signing of the protection of PDF document functions。for example:Signseal PDF like the IText Library

[Question] How to check and add a font into DR dictionary of the interactive form dictionary (AcroForm) in the catalog dictionary?

I need to create an interactive form (AcroForm) if it does not exist in a PDF.
Then, check and add a Font entry in the DR dictionary of the AcroForm dictionary.
How to achieve that in this library?

Arabic text is not mapped to correct glyphs

I am trying to write simple Arabic text consisting of 3 consecutive characters of same Unicode code point (letter Ain U+0639, which can be represented by 4 glyphs depending on its position in the word).

pageContentContext->WriteText(50, 200, u8"ععع", textOptions);

Also tried to hard code the unicode text as utf-8 U+0639 -> \xD8\xB9

pageContentContext->WriteText(50, 100, "\xD8\xB9\xD8\xB9\xD8\xB9", textOptions);

But, the output in PDF is shown as: ﻉﻉﻉ
The correct output should be: ععع

Is Unicode to Glyph mapping is not working correctly or am I missing something here?

Failing playgroundtest on linux(ubuntu) due to inconsistency in filename/directory cases

Linux filesystem is case-sensitive but in playgroundtest some file directory's cases does not match the actual file/directory. Its not really an issue or anything...I just had to check the .log and manually fix some of the names. I guess I will report it in case someone else is wondering why some of the tests failed to save them some time.

galkahana / pdf-writer Goto Github PK

pdf-writer's Issues

Recommend Projects

Recommend Topics

Recommend Org