Giter Club home page Giter Club logo

Comments (16)

pmaupin avatar pmaupin commented on September 18, 2024

For example?

from pdfrw.

cristianocca avatar cristianocca commented on September 18, 2024

https://github.com/pmaupin/pdfrw/blob/master/pdfrw/pdfreader.py#L507
lines = fdata.lstrip().splitlines()

that will do the whole pdf (that is loaded into memory as a single string) to be copied once for lstrip and then once again for splitlines. Basically loading the whole PDF 2 additional times into memory?
Also I don't really know the objective of that code, if the purpose is to simply check if there are lines, there has to be better options than that.

https://github.com/pmaupin/pdfrw/blob/master/pdfrw/pdfreader.py#L520
junk = fdata[endloc:]
fdata = fdata[:endloc]

a few more copies right there

I think a bytearray might help with all the strip, split, slice, etc, avoiding multiple copies of the main string. Although there's the overhead of the initial string to bytearray loading, which can probably be loaded from an empty bytearray() and then combining chunk reads of the file and bytearray.extend.

The above is for the reader, but the writer might have similar issues.

Building a ~20mb PDF can take up to 300mb of RAM at least on Windows. This a dense pdf with multiple pages and the memory used is only from the writer object since the reader is read lazily one by one and disposed one after the other. If you need more info I can build some small test or something.

from pdfrw.

pmaupin avatar pmaupin commented on September 18, 2024

Your first example shows code that attempts to fix invalid PDF files that have junk before the start of the file. Your second example shows code that attempts to fix invalid PDF files that have junk after the end of the file. If endloc is at the end of the file (as it should be), junk = fdata[endloc:] returns an empty string, and fdata = fdata[:endloc] doesn't actually make a copy of the string -- it returns a reference to the original string. If endloc is not at the end of the file, then a copy will be made, and a warning will be emitted if there is anything besides nulls after the EOF. But even files with only nulls are non-standard; it's just that there are enough of them floating around that it is worth the small amount of code and memory space it takes to process them.

Honestly, I don't care at all about the performance in either of those cases -- I barely care about even doing minimal processing to make a few broken files work, and would certainly not support any ugly, bug-prone code to help out those cases.

I am concerned about speed on correct PDF files, but I am not at all concerned about hand-wavey assertions that doing this or that might make things go faster. There are a lot of functional tests with a lot of real-world PDFs. Write some code and show the tests go faster.

If memory usage is a problem, the first thing to do is to switch from Python 3 to Python 2.7. That will drop your memory usage considerably, and make pdfrw run faster, as well, because Python 2 strings are only one byte per character.

If you write real patches where bytearray drops memory usage and does not negatively impact performance and does not look too ugly or error-prone, that would be interesting, but you do realize that if bytearray.extend() doesn't make a copy of the entire bytearray, it's usually because a previous bytearray.extend() greedily preallocated a bunch of memory, right?

from pdfrw.

takluyver avatar takluyver commented on September 18, 2024

Python 2 strings are only one byte per character.

As of Python 3.3, strings using only characters in the Latin-1 set can be stored with 1 byte per character (PEP 393). Looking at the way pdfrw reads files, I think this should apply to all the data it reads. So there might not be much of a memory saving from going back to Python 2.

from pdfrw.

cristianocca avatar cristianocca commented on September 18, 2024

Sorry I forgot to mention that I'm working with python 2.7.

I agree that the code shouldn't be changed if it gets too ugly or error prone, what I wrote above is for the reader class and perhaps the improvements would be too little, I'm a bit more concerned about the writer class that is growing up the RAM usage like crazy when merging a bunch of 1mb PDF files (up to 30 files), a few concurrent requests to the web server easily raises the RAM usage to 1GB.

The readers are always loaded one by one with the disable_gc flag as False, and the actual files are loaded lazily (from a network stream) and discarded immediatly so technically there's only one reader alive in memory at a time. However the writer grows incredibly high, compared to PyPDF2 that I dislike since it requires the streams to be seekable and it's extremely slow, but the RAM usage on a merge process is like 10% of what I'm seeing here.

I'm wondering if there's any known piece of code on the Writer class that I could take a look at that might be memory hungry or could reduce the memory usage by simple changes like a string to a memoryview or bytearray. I gave examples of the Reader class since the code it's quite simple, the Writer class on the other hand requires more time to understand :)

from pdfrw.

pmaupin avatar pmaupin commented on September 18, 2024

@takluyver -- That's interesting. Shows how badly I've been keeping up. 3.6 seems to have some nice stuff, though, so maybe I'll start using 3.x :-)

@cristianocca -- The current writer does have the ability to build up huge strings before dumping them to disk. An incremental version of the writer could certainly be written to avoid this, and to make more, smaller, disk writes. Were I to write such a thing, I would probably also maintain stacks in data structures to avoid recursion. In a full re-write, I would probably also contemplate optional support of compressed object streams, but dynamic use of that support would necessitate more, not less, memory, so it's probably not applicable for your use.

There are no current plans (or anybody working on it that I know of) for such a modification, but if somebody presented me with such code, I would be more than happy to add it, at least as an alternative writer, and then perhaps as the default writer once it has been proven.

A word of warning, for anybody attempting this rewrite: understanding of the current code may be somewhat hampered by my initial stupidity in writing it. The current writer has code dating back to when my knowledge of PDFs was even more limited than it is today, and some of its design sprung from a misguided attempt to de-duplicate PDF file objects on the way out to the disk. That's a hard problem that I didn't devote nearly enough time to to do it properly, and I don't think the code actually does that very well, if at all.

Note 1: pdfrw creates and destroys many small objects, so if enabling the garbage collector actually makes a memory difference in your usage, you could probably gain most of the memory benefit of what you are doing with the collector by leaving gc_disable set true, and then forcing an explicit collection between the reads of your individual files. That would reduce the CPU time required for garbage collection.

Note 2: Even though you are handling the input files separately, the writer doesn't do any writing until it sees them all. To achieve significant memory gains in the writer, you will need to incrementally write pages to disk and then discard all the data for those page from memory (except for the file offsets for all the objects written, and the index to the page object from the /Pages/Kids array.

Note 3: The original design of the PDF file format happened when systems didn't have much memory, and specifically contemplated working with WORM (write once/read many) media, such as non-rewritable optical drives. You can append to a PDF file and reference original objects in it. So if you really wanted to reduce the memory footprint, it might be possible to fully create the PDF file for the first page, and then to append to the file for each subsequent page.

Good luck!
Pat

from pdfrw.

pmaupin avatar pmaupin commented on September 18, 2024

Since nobody else has stepped up, I might take a look at the pdfwriter. I want to refactor it anyway for a couple of reasons, including issues #52 and #37

from pdfrw.

troyhoffman avatar troyhoffman commented on September 18, 2024

I would take a stab at helping with the refactoring, but my PDF knowledge isn't strong enough and I'd probably hinder more than help. However, I'd be more than happy to help test changes and possibly help with code reviews. I also have some PDFs that have caused problems with other libraries.

from pdfrw.

pmaupin avatar pmaupin commented on September 18, 2024

I actually have some useful code written, I think. Need to do a bit more work on it, and then I'll update the issue.

from pdfrw.

troyhoffman avatar troyhoffman commented on September 18, 2024

That's great news! I'll take a look at it once it's ready.

By the way, for those who are having memory errors when merging (like I was having), there might be a relatively easy fix. I was trying to merge 1,491 PDFs that were just under 1 MiB each into a single file. I would get up to the call to write(). It would churn for a bit and then give me a memory error. This was with Python 3.6.1.

That was when I realized I was using the 32-bit build of Python. In Windows, 32-bit processes are limited to 2 GiB of RAM. I switched to 64-bit and it worked. It wound up using about 4.5 GiB of RAM, which explains the memory error with 32-bit Python.

I will re-run the merge test with the new code to give you some before and after numbers. I have 16 GiB of RAM. Merging those 1,491 files took about 2 1/2 minutes total. Considering it had to read 1.22 GB worth of data and wrote 1.16 GB, that's actually not too bad. I can't wait to see any improvements.

from pdfrw.

cristianocca avatar cristianocca commented on September 18, 2024

Those are really good times by the way, even with the massive memory footprint.

from pdfrw.

pmaupin avatar pmaupin commented on September 18, 2024

Actually, I think the memory usage can be brought down in the writer, but I don't think the CPU performance will be much better otherwise (unless, of course, it keeps you from paging). Might even take a slight hit. Not sure yet. If it does get better, it will be because of better overlap of file write output with computation.

from pdfrw.

troyhoffman avatar troyhoffman commented on September 18, 2024

@cristianocca A large memory footprint isn't always a problem, as long as you have the RAM. Writing and reading large amounts of memory is actually quite fast. It's only when you start paging that it's a problem. I have 16 GiB. Even though that's probably more than average, so is a 42,348 page PDF. No matter what tool you're using to create that monstrosity, you'll need a lot of RAM.

from pdfrw.

soferio avatar soferio commented on September 18, 2024

Any news on this front? We are using it on Lambda, and there is a 1.5Gb limit, so any memory efficiencies would be much appreciated. Thanks for the library!

from pdfrw.

pmaupin avatar pmaupin commented on September 18, 2024

The plan is to get the 0.4 release out this week, and then merge all the preliminary work I have done, and then add the final code to do this. The refactored pdfrwiter should be the first thing added to 0.5.

from pdfrw.

AlunYou avatar AlunYou commented on September 18, 2024

@pmaupin Any plan of improving the memory footprint now?

from pdfrw.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.