torakiki / sambox Goto Github PK
View Code? Open in Web Editor NEWA PDFBox fork intended to be used as PDF processor for Sejda and PDFsam
License: Apache License 2.0
A PDFBox fork intended to be used as PDF processor for Sejda and PDFsam
License: Apache License 2.0
Make sure everything is fine with the current specs
'PDDocumentCatalog.getPageMode()' throws an exception if the PageMode
defined in the document is unrecognized, example: None
It could handle this leniently and not throw an exception, but return the default value (as if the page mode was not defined in the doc)
In the PDDocumentCatalog it's missing the return statement
See if there's room for improvement
... because the provider already has it and it already has the logic to release it. Having it in the ExistingIndirectCOSObject can lead to unexpected behaviour in the case like one of the split we have where:
Current SAMBox behaviour during the body write is to store the couple COSObjectKey
,IndirectCOSObjectReference
for every ExistingIndirectCOSObject
, this allows to reuse the same IndirectCOSObjectReference
when the same COSObjectKey
is found. This approach doesn't cover the case where the document is modified and the COSBase
wrapped by the ExistingIndirectCOSObject
is reused. An example is the creation of a document from an existing one, we add the page dictionary and we create an outline with a page destination where the page is the existing ExistingIndirectCOSObject
, in this scenario SAMBox fails to match the page dictionary and the one wrapped by the ExistingIndirectCOSObject
, writing the page dictionary twice.
We need a mechanism where, while writing the body, SAMBox can retrieve the IndirectCOSObjectReference
associated to a COSBase
, no matter if it's found as a COSBase
or wrapped by a ExistingIndirectCOSObject
.
The component is the fallback mechanism in case there's something wrong with xref offset and it performs full scan of the doc searching for any xref table or stream, if it fails to find any if should signal that.
The COSStream::addCompression currently decodes the stream and adds the filters to later re-encode. This doesn't work with the RunLengthDecode filter for which encode is not implemented.
They are currently scattered around and difficult to find.
There are currently a lot of leftovers in the project which are not used or useful. Clean up leaving what is needed.
We came across arrays of this form
[1 1 1 1 10 0 obj
SAMBox should be smart enough to recognise the array is [1 1 1 1]
and 10 0 obj
is an object definition.
Investigate if it makes sense to make the write method calls async so while one thread reads lazy indirect objects from one side, the other streams down the completed ones.
It's currently a float making possible to assign whatever number to it
Investigate COSStream::addCompression to see if a smarter algorithm can be applied. See if we can avoid to decode the existing stream and maybe we just encode whatever is there with the flate filter.
Objects like
12 0 obj
endobj
result in a NullPointerException. We should handle those, probably returning a COSNull.
Investigate the implication of this. The idea is that objects loaded are cached with reclaimable references to avoid OOM and facilitate those tasks that work on a page basis (ex. page rendering) and also the write of the document (once an object is written it can be discarded, only it's reference is needed), this would work particularly well when we modify existing documents (ex. merge, rotate ecc) because the current implementation reuses object references based on COSObjectKey so what wound happen when we write down the doc is:
This needs to be investigate, in particular what happens when we modify a value of a lazy object and the object is GC?
Just to be sure that SAMBox doesn't leave anything leaking
While taking care of #13 we should investigate also how to set source length in the SourceReader when we are reading from a non File source
To solve #27 I introduced an id for every COSBase, the id is created using UUID.randomUUID(). This approach slows down SAMBox because UUID uses cryptographically strong pseudo random number generator to ensure randomness/uniqueness, which can be quite slow. I think we don't need this level of randomness and we can solve the issue without incurring in the UUID.randomUUID() overhead.
Resulting in the existing length instead of the new one when we write an existing document
'TwoColumnRight' is missing from PageLayout enum
When the BaseCOSParser
parses a COSStream
with a wrong Length it applies a fallback strategy and tries to find the stream length reading until it finds endstream
or endobj
keywords. The current algorithm fails to find the correct length if after the endstream
or endobj
keywords there is a CR+LF.
The current implementation uses PushBackInputStream and RandomAccessFile and some other. All this needs to be validated and investigate, since up to jdk 1.8 there might be something new and better performing.
As result of #24 we now return a boolean to inform that the xref full scan has failed and we should perform a full objects scan. If an exception is thrown during the xref full scan, the object scan is not triggered. In case of exception the XrefFullScanner should log it and return false so that the fallback full objects scan can kick in.
Filters currently handle the case where a DecodeParms
item is an invalid type (i.e. it's not a Dictionary or an Array) and it logs the issue and return an empty dictionary. The same should happen if DecodeParms
is an array and the invalid type is a value of the array
Relax the constraint on the required type entry in the page dictionary and consider it valid even if the type is missing
Currently the ObjectsFullScanner searches for objects definitions inside the document and is used as fallback when there's something broken in the document. We might want to enhance that to parse objects stream when found so that even objects define inside the stream are picked up
PDF spec:
Threads (Optional; PDF 1.1; shall be an indirect reference)
Add a write option that the user can select to tell SAMBox to add a flate filter to all the uncompressed streams
it throws an UnsupportedOperationException when trying to encode while, given the length is 0 and there is no data it should just return 0
Currently the AbstractPdfBodyWriter
visits the document graph and replaces COSDictionary
s and ExistingIndirectCOSObject
with newly created instances of IndirectCOSObjectReference
. This turned out to be a fragile approach, we should be able to write the document without changing the original one.
java.lang.NullPointerException
at org.sejda.sambox.pdmodel.common.PDNameTreeNode.lambda$getValue$30(PDNameTreeNode.java:229)
at java.util.Optional.orElseGet(Optional.java:267)
at org.sejda.sambox.pdmodel.common.PDNameTreeNode.getValue(PDNameTreeNode.java:221)
at org.sejda.sambox.pdmodel.PDDocumentCatalog.lambda$findNamedDestinationPage$25(PDDocumentCatalog.java:591)
at java.util.Optional.map(Optional.java:215)
at org.sejda.sambox.pdmodel.PDDocumentCatalog.findNamedDestinationPage(PDDocumentCatalog.java:591)
See if it make sense to have the document concurrently loaded by multiple threads. This brings up quite some issue to take care of (say you have two threads moving the current offset around) but it might be worth to speed up loading of big docs.
A null color for an Outline item breaks the parsing. Handle lenienlty
java.lang.ClassCastException: org.sejda.sambox.cos.COSNull cannot be cast to org.sejda.sambox.cos.COSNumber
org.sejda.sambox.pdmodel.graphics.color.PDColor.(PDColor.java:66)
org.sejda.sambox.pdmodel.interactive.documentnavigation.outline.PDOutlineItem.getTextColor(PDOutlineItem.java:334)
org.sejda.impl.sambox.component.OutlineUtils.copyOutlineDictionary(OutlineUtils.java:113)
org.sejda.impl.sambox.component.OutlineDistiller.lambda$cloneLeafIfNeeded$3(OutlineDistiller.java:102)
org.sejda.impl.sambox.component.OutlineDistiller$$Lambda$60/96349924.apply(Unknown Source)
java.util.Optional.flatMap(Optional.java:241)
org.sejda.impl.sambox.component.OutlineDistiller.cloneLeafIfNeeded(OutlineDistiller.java:98)
org.sejda.impl.sambox.component.OutlineDistiller.cloneNode(OutlineDistiller.java:90)
org.sejda.impl.sambox.component.OutlineDistiller.appendRelevantOutlineTo(OutlineDistiller.java:63)
org.sejda.impl.sambox.component.PagesExtractor.createOutline(PagesExtractor.java:96)
org.sejda.impl.sambox.component.PagesExtractor.save(PagesExtractor.java:88)
org.sejda.impl.sambox.component.split.AbstractPdfSplitter.split(AbstractPdfSplitter.java:93)
org.sejda.impl.sambox.SplitByPageNumbersTask.execute(SplitByPageNumbersTask.java:61)
org.sejda.impl.sambox.SplitByPageNumbersTask.execute(SplitByPageNumbersTask.java:41)
org.sejda.core.service.DefaultTaskExecutionService.actualExecution(DefaultTaskExecutionService.java:133)
org.sejda.core.service.DefaultTaskExecutionService.execute(DefaultTaskExecutionService.java:64)
We had a document where the startxref was wrong and the xref keyword was missing. SAMBox cannot handle those docs.
Currently the close() is not implemented and the workflow
Create a fallback strategy to fully read a document with a malformed xref
Review it and make sure it doesn't unnecessary loads object upfront when getKids is called
To avoid excessive GC, create a pool of reusable StringBuilder to be used by the SourceReader
The spec says "If the first element is zero, the type field shall not be present, and shall default to type 1." but the current implementation assumes a type 0, considering the object as a free one
Caused by: java.io.IOException: Unknown filter type:org.sejda.sambox.input.ExistingIndirectCOSObject@2a4b0ab1
at org.sejda.sambox.cos.COSStream.doDecode(COSStream.java:275) ~[org.sejda.sambox-1.0.0-SNAPSHOT.jar:na]
at org.sejda.sambox.cos.COSStream.decodeIfRequired(COSStream.java:201) ~[org.sejda.sambox-1.0.0-SNAPSHOT.jar:na]
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.