Giter Club home page Giter Club logo

tamil_digibooks's Introduction

Tamil_digibooks project is used to convert the images of a tamil book into a text file and/or searchable pdf using tessaract.

#Setup

Docker

Install docker in Windows/Linux.

After installation of Docker, pull the tamil_img2pdf docker from docker hub by executing the following command in terminal (Linux) or Command Prompt (Window)

docker pull docker.io/sksenthil1/tamil_img2pdf:latest

Input

Input folder should have the jpeg image of the tamil book pages.

Input_folder
    |
    | -- Tamilbook_1
            |--page_1.jpg
            |--page_2.jpg
            | .
            | .
            |--page_n.jpg
    | -- Tamilbook_2
    | .
    | . 
    | -- Tamilbook_n

NOTE

  • The input folder should have at least one book folder

  • The name of the book and pages should be written in english

  • The books can also be in the format of zipped folders

    Input_folder
        |
        | -- Tamilbook_1.zip
        | -- Tamilbook_2.zip
        | .
        | . 
        | -- Tamilbook_n.zip
    

#Running the script If the input file have multiple tamil_book folders

docker run -it --rm -v <path/to/input/image/folder>:/input_folder -v <path/to/output/empty/folder>:/output_folder --entrypoint "python" docker.io/sksenthil1/tamil_img2pdf:latest create_pdf_from_multiple_folders.py

If the input file have multiple zipped tamil_book folders, add --zipped to the above command at the end

docker run -it --rm -v <path/to/input/image/folder>:/input_folder -v <path/to/output/empty/folder>:/output_folder --entrypoint "python" docker.io/sksenthil1/tamil_img2pdf:latest create_pdf_from_multiple_folders.py --zipped

Running the above two commands will generate output folder of structure

Output_folder
    |
    | -- Tamilbook_1
    |       |- pdfs
    |       |   |--page_1.pdf
    |       |   |--page_2.pdf
    |       |   | .
    |       |   | .
    |       |   |--page_n.pdf
    |       |- txts
    |       |   |--page_1.txt
    |       |   |--page_2.txt
    |       |   | .
    |       |   | .
    |       |   |--page_n.txt
    |       |-Tamilbook.pdf
    | -- Tamilbook_2
    | .
    | . 
    | -- Tamilbook_n

If only one book needs to be converted then,

docker run -it --rm -v <path/to/input/image/folder>:/input_folder -v <path/to/output/empty/folder>:/output_folder --entrypoint "python" docker.io/sksenthil1/tamil_img2pdf:latest create_pdf_from_folder.py

Running the above command will generate

Output_folder
    |- pdfs
    |   |--page_1.pdf
    |   |--page_2.pdf
    |   | .
    |   | .
    |   |--page_n.pdf
    |- txts
    |   |--page_1.txt
    |   |--page_2.txt
    |   | .
    |   | .
    |   |--page_n.txt
    |-Output_folder.pdf

tamil_digibooks's People

Contributors

sksenthilkumar avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

tamil_digibooks's Issues

Tesseract not found error

Traceback (most recent call last):
File "/volumes1/anaconda3/envs/tamil_digibooks/lib/python3.6/contextlib.py", line 99, in exit
self.gen.throw(type, value, traceback)
File "/volumes1/anaconda3/envs/tamil_digibooks/lib/python3.6/site-packages/pytesseract/pytesseract.py", line 191, in save
yield f.name, realpath(normpath(normcase(image)))
File "/volumes1/anaconda3/envs/tamil_digibooks/lib/python3.6/site-packages/pytesseract/pytesseract.py", line 282, in run_and_get_output
run_tesseract(**kwargs)
File "/volumes1/anaconda3/envs/tamil_digibooks/lib/python3.6/site-packages/pytesseract/pytesseract.py", line 254, in run_tesseract
raise TesseractNotFoundError()
pytesseract.pytesseract.TesseractNotFoundError: tesseract is not installed or it's not in your PATH. See README file for more information.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.