Giter Club home page Giter Club logo

sambox's People

Contributors

dependabot[bot] avatar dthadi3 avatar ediweissmann avatar jahewson avatar jmaerki avatar jukka avatar lehmi avatar thausherr avatar torakiki avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

sambox's Issues

Exception while getting PageMode when unrecognized value is used

'PDDocumentCatalog.getPageMode()' throws an exception if the PageMode defined in the document is unrecognized, example: None

It could handle this leniently and not throw an exception, but return the default value (as if the page mode was not defined in the doc)

ExistingIndirectCOSObject shouldn't keep the ref to the COSBase...

... because the provider already has it and it already has the logic to release it. Having it in the ExistingIndirectCOSObject can lead to unexpected behaviour in the case like one of the split we have where:

  • say page1 has an ExistingIndirectCOSObject for 10 0 R (COSStream) and the process loads it
  • say page2 also has an ExistingIndirectCOSObject for 10 0 R (COSStream) and the process loads it to verify if it's time to split
  • its time to split, let's write down page, 10 0 R (COSStream) is written down and closed. Now the ExistingIndirectCOSObject in page2 has an reference to a lazy loaded COSStream that has already been closed and released in the provider.

SAMBox fails to reuse indirect reference

Current SAMBox behaviour during the body write is to store the couple COSObjectKey,IndirectCOSObjectReference for every ExistingIndirectCOSObject, this allows to reuse the same IndirectCOSObjectReference when the same COSObjectKey is found. This approach doesn't cover the case where the document is modified and the COSBase wrapped by the ExistingIndirectCOSObject is reused. An example is the creation of a document from an existing one, we add the page dictionary and we create an outline with a page destination where the page is the existing ExistingIndirectCOSObject, in this scenario SAMBox fails to match the page dictionary and the one wrapped by the ExistingIndirectCOSObject, writing the page dictionary twice.
We need a mechanism where, while writing the body, SAMBox can retrieve the IndirectCOSObjectReference associated to a COSBase, no matter if it's found as a COSBase or wrapped by a ExistingIndirectCOSObject.

Async write of the file

Investigate if it makes sense to make the write method calls async so while one thread reads lazy indirect objects from one side, the other streams down the completed ones.

COSStream::addCompression could be smarter

Investigate COSStream::addCompression to see if a smarter algorithm can be applied. See if we can avoid to decode the existing stream and maybe we just encode whatever is there with the flate filter.

Weak or Soft references for the loaded objects

Investigate the implication of this. The idea is that objects loaded are cached with reclaimable references to avoid OOM and facilitate those tasks that work on a page basis (ex. page rendering) and also the write of the document (once an object is written it can be discarded, only it's reference is needed), this would work particularly well when we modify existing documents (ex. merge, rotate ecc) because the current implementation reuses object references based on COSObjectKey so what wound happen when we write down the doc is:

  • an object is lazy loaded through its reference (say 5 0)
  • a new reference is assigned (say 11 0)
  • the object is written as object 11 0
  • the mapping 5 0 -> 11 0 is stored
  • the object is GC
  • another dictionary referencing the lazy 5 0 is found, it looks up the new reference 11 0 and writes the value as 11 0 R, nothing is loaded into memory

This needs to be investigate, in particular what happens when we modify a value of a lazy object and the object is GC?

Remove UUID.randomUUID() in COSBase

To solve #27 I introduced an id for every COSBase, the id is created using UUID.randomUUID(). This approach slows down SAMBox because UUID uses cryptographically strong pseudo random number generator to ensure randomness/uniqueness, which can be quite slow. I think we don't need this level of randomness and we can solve the issue without incurring in the UUID.randomUUID() overhead.

SAMBox fails to find stream length in some cases

When the BaseCOSParser parses a COSStream with a wrong Length it applies a fallback strategy and tries to find the stream length reading until it finds endstream or endobj keywords. The current algorithm fails to find the correct length if after the endstream or endobj keywords there is a CR+LF.

XrefFullScanner should not throw an exception

As result of #24 we now return a boolean to inform that the xref full scan has failed and we should perform a full objects scan. If an exception is thrown during the xref full scan, the object scan is not triggered. In case of exception the XrefFullScanner should log it and return false so that the fallback full objects scan can kick in.

Handle unsupported types ar value in decode params array

Filters currently handle the case where a DecodeParms item is an invalid type (i.e. it's not a Dictionary or an Array) and it logs the issue and return an empty dictionary. The same should happen if DecodeParms is an array and the invalid type is a value of the array

Allow missing page type

Relax the constraint on the required type entry in the page dictionary and consider it valid even if the type is missing

ObjectsFullScanner to look inside ObjStm

Currently the ObjectsFullScanner searches for objects definitions inside the document and is used as fallback when there's something broken in the document. We might want to enhance that to parse objects stream when found so that even objects define inside the stream are picked up

Make the body writer more solid

Currently the AbstractPdfBodyWriter visits the document graph and replaces COSDictionarys and ExistingIndirectCOSObject with newly created instances of IndirectCOSObjectReference. This turned out to be a fragile approach, we should be able to write the document without changing the original one.

NPE when name tree node doesn't have the required Limits

java.lang.NullPointerException
at org.sejda.sambox.pdmodel.common.PDNameTreeNode.lambda$getValue$30(PDNameTreeNode.java:229)
at java.util.Optional.orElseGet(Optional.java:267)
at org.sejda.sambox.pdmodel.common.PDNameTreeNode.getValue(PDNameTreeNode.java:221)
at org.sejda.sambox.pdmodel.PDDocumentCatalog.lambda$findNamedDestinationPage$25(PDDocumentCatalog.java:591)
at java.util.Optional.map(Optional.java:215)
at org.sejda.sambox.pdmodel.PDDocumentCatalog.findNamedDestinationPage(PDDocumentCatalog.java:591)

Multithreaded load of a doc

See if it make sense to have the document concurrently loaded by multiple threads. This brings up quite some issue to take care of (say you have two threads moving the current offset around) but it might be worth to speed up loading of big docs.

Annotations can be handled more lenient

A null color for an Outline item breaks the parsing. Handle lenienlty

java.lang.ClassCastException: org.sejda.sambox.cos.COSNull cannot be cast to org.sejda.sambox.cos.COSNumber
org.sejda.sambox.pdmodel.graphics.color.PDColor.(PDColor.java:66)
org.sejda.sambox.pdmodel.interactive.documentnavigation.outline.PDOutlineItem.getTextColor(PDOutlineItem.java:334)
org.sejda.impl.sambox.component.OutlineUtils.copyOutlineDictionary(OutlineUtils.java:113)
org.sejda.impl.sambox.component.OutlineDistiller.lambda$cloneLeafIfNeeded$3(OutlineDistiller.java:102)
org.sejda.impl.sambox.component.OutlineDistiller$$Lambda$60/96349924.apply(Unknown Source)
java.util.Optional.flatMap(Optional.java:241)
org.sejda.impl.sambox.component.OutlineDistiller.cloneLeafIfNeeded(OutlineDistiller.java:98)
org.sejda.impl.sambox.component.OutlineDistiller.cloneNode(OutlineDistiller.java:90)
org.sejda.impl.sambox.component.OutlineDistiller.appendRelevantOutlineTo(OutlineDistiller.java:63)
org.sejda.impl.sambox.component.PagesExtractor.createOutline(PagesExtractor.java:96)
org.sejda.impl.sambox.component.PagesExtractor.save(PagesExtractor.java:88)
org.sejda.impl.sambox.component.split.AbstractPdfSplitter.split(AbstractPdfSplitter.java:93)
org.sejda.impl.sambox.SplitByPageNumbersTask.execute(SplitByPageNumbersTask.java:61)
org.sejda.impl.sambox.SplitByPageNumbersTask.execute(SplitByPageNumbersTask.java:41)
org.sejda.core.service.DefaultTaskExecutionService.actualExecution(DefaultTaskExecutionService.java:133)
org.sejda.core.service.DefaultTaskExecutionService.execute(DefaultTaskExecutionService.java:64)

Review PDDocument and expected workflow

Currently the close() is not implemented and the workflow

  • open the doc
  • modify the doc
  • save the doc
    creates a PDDocument that cannot be saved twice. We could warn the user if he tries to save it again but anyway, it needs to be thought.

Review the PDPageTree

Review it and make sure it doesn't unnecessary loads object upfront when getKids is called

Wrong default type for type in Xref stream W array

The spec says "If the first element is zero, the type field shall not be present, and shall default to type 1." but the current implementation assumes a type 0, considering the object as a free one

COSStream::addCompression fails when the filter array is an indirect object

Caused by: java.io.IOException: Unknown filter type:org.sejda.sambox.input.ExistingIndirectCOSObject@2a4b0ab1
at org.sejda.sambox.cos.COSStream.doDecode(COSStream.java:275) ~[org.sejda.sambox-1.0.0-SNAPSHOT.jar:na]
at org.sejda.sambox.cos.COSStream.decodeIfRequired(COSStream.java:201) ~[org.sejda.sambox-1.0.0-SNAPSHOT.jar:na]

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.