Giter Club home page Giter Club logo

Comments (6)

Belval avatar Belval commented on September 3, 2024

From your usecase, I think you could achieve better performance and easier usage if you passed an output_folder to the function call.

import tempfile

with tempfile.TemporaryDirectory() as path:
     images_from_path = convert_from_path('/home/Belval/example.pdf', thread_count=8, output_folder=path)

This way you won't need to process the pdf in small batches since the memory footprint will be small.

I'd recommend not going over 12 threads (even 6 would probably work better) to avoid process fighting for CPU time.

If that does not solve your problem, please let me know!

from pdf2image.

Belval avatar Belval commented on September 3, 2024

I tested it an indeed, you need to use an output_folder for the multithreading to be useful at all. This is due to the process reading from the buffer being single threaded.

I'll explore solutions to this limitation, but until then you really want to use a temporary directory.

from pdf2image.

Belval avatar Belval commented on September 3, 2024

Closing for lack of response.

Feel free to reopen should my answers be unsatisfactory.

from pdf2image.

sirius0503 avatar sirius0503 commented on September 3, 2024

I tested it an indeed, you need to use an output_folder for the multithreading to be useful at all. This is due to the process reading from the buffer being single threaded.

I'll explore solutions to this limitation, but until then you really want to use a temporary directory.

@Belval : I am taking the output from convert_from_path and directly feeding it to tesseract, so will thread_count = number of cpu cores, help me here. I am using multiprocessing for tesseract too.

from pdf2image.

Belval avatar Belval commented on September 3, 2024

Few things here:

  • pdf2image will be multithreaded if you use and output_folder otherwise the output is parsed in memory sequentially and you will get no gains.
  • pdf2image returns a list and not a generator, so while the conversion is multithreaded, the call to convert_from_path is still blocking and will wait until all pages are converted.

That being said, you probably want something like this:

import tempfile
from pdf2image import convert_from_path

with tempfile.TemporaryDirectory() as path:
    images_from_path = convert_from_path(
        '/home/belval/example.pdf',
        output_folder=path,
        thread_count=8
    )

# Do your processing with Tesseract here

from pdf2image.

Belval avatar Belval commented on September 3, 2024

Closing for lack of activity.

from pdf2image.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.