Giter Club home page Giter Club logo

ocr-search's Introduction

OCR-search

Searching for a text in scanned files using OCR.
If the document has a table, then the program determines the structure of the table and recognizes the contents.
The original document (image/PDF) can be rotated to any angle.
The search is also performed in files of other formats, if these files are present in the target directory.
Supported formats: pdf/png/jpeg/jpg/tif/doc/docx/odt/xls/xlsx/ods/txt
Works on Windows, Linux

Use

The program starts without any parameters. Uses settings from config file.
You need to set at least one parameter in search.conf: the 'search' parameter specifies what to search for.
If you do not specify a search directory, the current directory will be used.

Prerequisites

for Windows10 (Linux):
  1. python 3.9 (or later)
  2. tesseract-ocr-w64-setup-v5.2.0.20220712.exe
    2.1. PATH environment variable: "D:\Tesseract" (specify the directory selected during installation)
    2.2. Add file with prefered language (ukr.traineddata) to "D:\Tesseract\tessdata"
    (for Linux: apt-get install tesseract-ocr)
  3. pip install pytesseract
  4. poppler and pdf2image
    4.1. install Poppler (release 22.12.0-0 or later) for Windows: https://github.com/oschwartz10612/poppler-windows/releases
    Extract Poppler to "D:\Poppler\poppler-22.12.0" (or any other folder) and add PATH:
    "D:\Poppler\poppler-22.12.0\Library\bin"
    4.2. pip install pdf2image
  5. pip install filetype
  6. for 'XLS/XLSX' file format:
    pip install pandas
    pip install xlrd
  7. for 'DOCX' file format:
    pip install docx2python
  8. pip install opencv
  9. for 'DOC' file format:
    Download and unpack antiword to c:\antiword.
    Set the PATH and add Environment Variable:
    ANTIWORDHOME = c:\antiword
    PATH = c:\antiword (for Linux: apt-get install antiword)
  10. pip install progress
  11. for 'ods/odt' file format:
    pip install odfpy

Run

For Windows: to generate exe file:

pyinstaller.exe --onefile search.py
search.exe

or run:

search.py

Example

example

Versioning

v3.3

Author

  • Svitlana Viblaia

License

This project is licensed under the MIT License.

ocr-search's People

Contributors

svitlana1209 avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar

Forkers

kerneltravel

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.