While adding new functionality to the library (in my case: Form-Flattening), i discovered several issues with content-related classes.
As some of the problems are with classes internal
to the library, the IssueSubmission-Template is not well suited for this, but i do my best to explain the issues as detailed as possible.
What I'm trying to do:
AcroFields (to be more precise: the WidgetAnnotation's of the fields) in PDF-documents have Appearance-Streams that contain drawing-operators used to render the fields on a page.
In PdfSharp, these drawing-operators (and their operands) are represented by CObject
s (and it's sub-classes).
In the process of flattening the AcroFields, I'm extracting these drawing-operators, adding some additional ones and render the result.
Somewhat simplified it looks like this:
protected virtual void RenderContentStream(PdfPage page, PdfDictionary streamDict, PdfRectangle rect)
{
var stream = streamDict.Stream;
var content = ContentReader.ReadContent(stream.UnfilteredValue);
// start drawing at the position specified in rect
var matrix = new XMatrix();
matrix.TranslateAppend(rect.X1, rect.Y1);
var matElements = matrix.GetElements();
var matrixOp = OpCodes.OperatorFromName("cm");
foreach (var el in matElements)
matrixOp.Operands.Add(new CReal { Value = el });
content.Insert(0, matrixOp);
// Save and restore Graphics state
content.Insert(0, OpCodes.OperatorFromName("q"));
content.Add(OpCodes.OperatorFromName("Q"));
// create new content
var appendedContent = page.Contents.AppendContent();
using (var ms = new System.IO.MemoryStream())
{
var cw = new ContentWriter(ms);
foreach (var obj in content)
obj.WriteObject(cw);
appendedContent.CreateStream(ms.ToArray());
}
}
The problems:
When reading the content (with ContentReader.ReadContent
) the CParser
does not set the CStringType
for CString
s.
This results in an exception when writing the objects back to a stream with WriteObject
.
The method CString.ToString
does a
switch (CStringType)
And the getter of CStringType
throws because _cStringType
was never set:
public CStringType CStringType
{
get => _cStringType ?? NRT.ThrowOnNull<CStringType>();
set => _cStringType = value;
}
CStringType? _cStringType;
Fix in CParser.ParseObject
:
case CSymbol.String:
case CSymbol.HexString:
case CSymbol.UnicodeString:
case CSymbol.UnicodeHexString:
s = new CString();
s.Value = _lexer.Token;
// CString.ToString() only supports CStringType.String // added
s.CStringType = CStringType.String; // added
_operands.Add(s);
break;
Wondering why the flattened fields were not rendered as intended, (sometimes not visible at all, sometimes at weird positions), i discovered that the operators q and Q which I've added to the content were not present in the output-document.
The reason was found in COperator.WriteObject
, which looks like this:
internal override void WriteObject(ContentWriter writer)
{
if (_sequence != null)
{
int count = _sequence.Count;
for (int idx = 0; idx < count; idx++)
{
// ReSharper disable once PossibleNullReferenceException because the loop is not entered if _sequence is null
_sequence[idx].WriteObject(writer);
}
writer.WriteLineRaw(ToString());
}
}
This writes out the operator, but only if it has operands.
q and Q don't have operands, so they were left out.
Moving the line writer.WriteLineRaw(ToString());
out of the if-block fixed the issue.
CLexer.ScanHexadecimalString
does not handle strings with odd length.
The standard-Lexer
handles this btw. so it should be easy to fix.
CParser
looses the last token.
Given the content-stream q (text) Tj Q
when parsing this and re-writing the objects i should be able to reconstruct the input, but the last operator (Q in this case) is missing.
I created test-cases for these issues in the PdfSharp.Tests
project so you should be able to reproduce them.
I needed to add the following lines to the PdfSharp.csproj
file in order to access internal classes (like ContentWriter
):
<ItemGroup>
<InternalsVisibleTo Include="$(AssemblyName).Tests" />
</ItemGroup>
Test-cases (i added them to BasicTests.cs
) :
[Theory]
[InlineData("q (text) Tj Q ")] // this works
[InlineData("q (text) Tj Q")] // this doesn't
public void Content_Can_Be_Parsed_And_Reconstructed(string contentString)
{
var contentBytes = Encoding.UTF8.GetBytes(contentString);
var sequence = ContentReader.ReadContent(contentBytes);
using var ms = new MemoryStream();
var cw = new ContentWriter(ms);
foreach (var obj in sequence)
{
obj.WriteObject(cw);
}
var newContent = new PdfContent(new PdfDictionary());
newContent.CreateStream(ms.ToArray());
// ContentWriter adds a newline after each operator
newContent.Stream.ToString().Should().Be("q\n(text)Tj\nQ\n");
// is this intended ? ToString() writes only operator-names but not the operands...
var s = sequence.ToString(); // result: "qTjQ"
}
[Fact]
public void Content_Can_Be_Manually_Constructed()
{
var sequence = new CSequence();
var op = OpCodes.OperatorFromName("q");
sequence.Add(op);
op = OpCodes.OperatorFromName("Tj");
op.Operands.Add(new CString() { CStringType = CStringType.String, Value = "text" });
sequence.Add(op);
op = OpCodes.OperatorFromName("Q");
sequence.Add(op);
using var ms = new MemoryStream();
var cw = new ContentWriter(ms);
foreach (var obj in sequence)
{
obj.WriteObject(cw);
}
var newContent = new PdfContent(new PdfDictionary());
newContent.CreateStream(ms.ToArray());
// ContentWriter adds a newline after each operator
newContent.Stream.ToString().Should().Be("q\n(text)Tj\nQ\n");
}
[Theory]
[InlineData("<7465787420> Tj")] // this works
[InlineData("<746578742> Tj")] // this doesn't
public void Can_Parse_Hex_String_With_Odd_Length(string contentString)
{
var contentBytes = Encoding.UTF8.GetBytes(contentString);
var sequence = ContentReader.ReadContent(contentBytes);
using var ms = new MemoryStream();
var cw = new ContentWriter(ms);
foreach (var obj in sequence)
{
obj.WriteObject(cw);
}
var newContent = new PdfContent(new PdfDictionary());
newContent.CreateStream(ms.ToArray());
// ContentWriter adds a newline after each operator
newContent.Stream.ToString().Should().Be("(text )Tj\n");
}
One last thing, but not content-related:
I am also working on an API to enable the creation of AcroForms from scratch.
In doing so i encountered a possible Font-related issue.
It seems, PdfSharp is always creating Font-Subsets when rendering text.
While this is great for saving space for text rendered on a page, it is an issue for AcroFields, where users may change the text.
For example when i create an PdfTextField
, set the value to Bob and create an appearance that renders the text "Bob", PdfSharp creates a Font-subset with only the glyphs for B
, o
and b
present.
When opening the Pdf, i'm unable to change the value to say Peter, because the required glyphs are missing in the font.
Is there an option in PdfSharp that allows to embed a font in full and not as a subset ?
Am i missing something here ?
(my workaround is to render all glyphs from the font to an XForm
positioned outside the page, but this is obviously not optimal)
Anyway, thanks for a great library !