Giter Club home page Giter Club logo

khmer-ocr-benchmark-dataset's Introduction

image

image image image

This open-source project aims to provide a standardized benchmark dataset for Khmer Optical Character Recognition (OCR) engine. It consists of different difficulty levels where achieving good results on each level provides insights into specific capabilities of the OCR engine.

Background

In Cambodia, an area of active development in AI technology is the Khmer OCR. However, it is currently difficult to determine the capabilities of Khmer OCR engines in the market. This is due to two reasons. First, those engines tend to benchmark performance on internal testing datasets, which often leads to bias in favor of the creator. Second, some engines ran performance tests on computer-generated opensource datasets provided by the Tessaract project, which does not reflect performance on real-world images. Hence, this project aims to solve this problem by providing a standardized benchmarking dataset for text recognition in the Khmer language.

Dataset Description

Each level aims to test the specific capabilities of the OCR engine. The levels are arranged in increasing difficulties and consist of subtasks. Each task consists of images and corresponding labels in JSON format. The labels are created using LabelMe, an open-source Image Polygonal Annotation Tool.

Level 1: Clean Digital Images

This level provides testing samples for clean digital images in the form of 15 Government Official Documents. There are four tasks at this level that provides variations of the same 15 documents. The table below outlines the description of each task.

Task 1 Clean printed text exported from PDFs.
Task 2 Clean printed text sent through compression algorithms (Facebook Messenger)
Task 3 Printed text that is printed out and scanned back into digital format through a physical scanner.
Task 4 Printed text that is printed out and scanned back into digital format through scanner apps.

Text Images in this level include

  1. Khmer printed text in straight lines.
  2. Khmer printed text in bold letters.
  3. Khmer printed text with a clean background.

What can this level tell about your OCR?

OCR engine that performs well on this level can read Khmer text from Government Official Posts.

Level 2: Scene Text Images

This level provides testing samples for scene text images which are text in real-world scenery. The images are taken through mobile devices. There are two tasks at this level:

Task 1 Printed text images in sceneries with well lid condition.
Task 2 Printed text images in sceneries with low light conditions.

Text Images in this level include

  1. Khmer printed text that is curved.
  2. Khmer printed text with occlusions/noise.
  3. Khmer printed text under different lighting conditions.
  4. Khmer printed text with reflection.

What can this level tell about your OCR?

OCR engine that performs well on this level can read Khmer printed text in the real world where the environment can’t be controlled.

Level 3: Hand-Written Images

This level includes testing samples for handwritten text images. There are two tasks at this level:

Task 1 Hand-written text images that are scanned with physical scanner.
Task 2 Hand-written text images that are scanned through scanner apps.

This is the biggest level of OCR engine that was able to reconginize and understand the human written text that was present along side many of the real world document and text. This require the OCR engine to be capable of handling:

  • The inconsistency of human written size, stroke, style, aligment and padding
  • Accountable for human error in writing
  • The randomness of text location

What can this level tell about your OCR?

OCR engine that performs well on this level can read Khmer hand-written text in the real world where the environment can’t be controlled.

(back to top)

How to Evaluate with the Benchmark Dataset

  1. Download the dataset from one of the levels here.

  2. Test the dataset with your own OCR engine and provide the output in one text file where each line in the text file consists of a pair of prediction and label separated by a tab.

ការដើរលេង  ការដើរលេង
  1. Install the neccessary python packages from requirements.txt
pip install -r requirements.txt
  1. Start the evaluation using scripts/evaluate.py.
python scripts/evaluate.py --input PATH_TO_OCR_OUTPUT_FILE

Road Map

  • Level 1: Clean Digital Images
  • Level 2: Scene Text Images
  • Level 3: Hand Written Images

Acknowledgments

This open project is initiated in collaboration with institutions from different industries in Cambodia. We are grateful for their continuing support of this project and our mission to help improve the state of the Khmer OCR engine.

alt_text alt_text alt_text
EKYC Solutions Prudential Life Assurance PLC Paragon International University

License

Distributed under the MIT License. See LICENSE for more information.

Contact

Email: [email protected]

(back to top)

khmer-ocr-benchmark-dataset's People

Contributors

gthell avatar mighty-potato avatar phoo5 avatar vitouphy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

khmer-ocr-benchmark-dataset's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.