Giter Club home page Giter Club logo

documentlab's Introduction

Optical character recognition, image processing and intelligent data extraction

NuGet version (DocumentLab-x64) License) Platform

Comming soon: Realtime script editor and visualization tool


This is a solution for data extraction from images of documents. You send in a bitmap, a set of queries and get extracted data in structured json.

  • DocumentLab takes care of image analysis and processing, optical character recognition, text classification
  • It provides a query language created specifically for document data extraction
    • Defining documents by queries, patterns and tables allows DocumentLab to understand documents intelligently
    • Intelligently meaning that the solution has the capacity to extract data from documents with layouts it has never seen before
    • A C# interface to the query language is also supported
  • Extensive configuration for all parts of the process are available to facilitate your requirements


Getting started

  • Important: DocumentLab has been optimized to expect a certain size range when it comes to analysing text from images. Which means that if the picture you pass in has text that is too big or too small (pixel wise) then you will not get optimal results. You can use the fakeinvoice.png from the example above for a scale reference if you need to resize your input data.

  • Important: A drawback of using the prebuilt binary application is that on each execution DocumentLab needs to load configuration files and initialize a tesseract engine for each thread. This initialization overhead might take 3x the amount of time DocumentLab would otherwise need to scan a single page. It's in the plans to provide a better prebuilt binary application for integration.

You can download the prebuilt binary and get started immediately. Parameter usage: (Picture file path) (Query file path) ((Optional) Output file path)

The (Query file path) parameter needs to be a text file containing a set of queries that DocumentLab can perform on the picture provided alongside as an argument. This still requires investigating the DocumentLab query language. The script used in the example above was the same as was used for the topmost picture with the fake invoice.

... Or if you want to integrate with code,

Quickly

// To use the scripting interface
var documentLab = new DocumentInterpreter();
var interpretedJsonResult = documentLab.InterpretToJson(script, (Bitmap)Image.FromFile(imagePath));
// To use the FluentQuery API
using (var document = new Document((Bitmap)Image.FromFile(imagePath))) 
{
 ...

You can write scripts in the query language or use the C# API. The C# fluent interface is easier to get started quickly. The raw text scripting interface allows more versatility and configurability in a production context.

Definitions

  • Pattern: A description of how information is presented in a document as well as which data to capture
    • e.g: Text(Total amount) Right [Amount]
  • Table: A description of which table column labels to match in a document and which text types are represented in each column
    • e.g: Table 'ItemNo': [Number(Item number)] 'Description': [Text(Description)] 'Price': [Amount(...
  • Query: A named set of patterns prioritized first to last
    • e.g: IncoiceNumber: *pattern 1*; *pattern 2*; ... *pattern n*;
  • Script: A collection of queries to execute in one go. Output properties will have the query name

From here...

Quick examples

Below are a few select examples with comments on how DocumentLab can be used.

C# Fluent Query Example

using (var dl = new Document((Bitmap)Image.FromFile("pathToSomeImage.png")))
{
  // Here we ask DocumentLab to specifically find a date value for the specified possible labels
  string dueDate = dl.Query().FindValueForLabel(TextType.Date, "Due date", "Payment date");

  // Here we ask DocumentLab to specifically find a date value for the specified label in a specific direction 
  string customerNumber = dl.Query().GetValueForLabel(Direction.Right, "Customer number");

  // We can build patterns using predicates, directions and capture operations that return the value matched in the document
  // Patterns allow us to recognize and capture data by contextual information, i.e., how we'd read for example receiver information from an invoice
  string receiverName = dl
    .Query()
    .Match("PostCode") // Text classification using contextual data files can be referenced by string
    .Up()
    .Match("Town")
    .Up()
    .Match("City")
    .Up()
    .Capture(TextType.Text); // All text type operations can also use the statically defined text type enum
} 

Script example

  • Your document has a label "Customer number:" and a value to the right of it
    • Query: CustomerNumber: Text(Customer number) Right [Text];
    • Match text labels with implicit starts with and Levensthein distance 2 comparison
  • Your document has a label "Invoice date" and a date below it
    • Query: InvoiceDate: Text(Invoice date) Down [Date];
    • You can capture a variety of text types. Even if the document contains additional text at the capture you'll only get back a standardized ISO date.
  • Want to capture invoice receiver info in one query?
    • Query: Receiver: 'Name': [Text] Down 'Address': [StreetAddress] Down 'City': [Town] Down 'PostalCode': [PostalCode];
    • Json output will name properties according to the query predicate naming parameters
  • You want to capture all amounts in a document?
    • Query: AllAmounts: Any [Amount];
    • When we use any, results are returned in a json array

Documentation

The case with any OCR process is that the quality of the output depends entirely on the quality of source image you pass into it. If the original image quality is very low then you can expect very low quality OCR results.

Depending on your image source, you may want to upsample low dpi images to a range between 220 - 300. This often also becomes a question of quality vs. time in terms of execution time and should be adjusted to your requirements.

documentlab's People

Contributors

karisigurd4 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

documentlab's Issues

Ocr wont work in a simple image

First of all let me say that i love this concept of OCR.

But I'm having a hard time making it work with a simple image to extract a number. Here's is what im doing and the output:

Untitled3

I'm trying to extract the number that is after the text "FACTURA NO." but the result is always empty.
Here is the script:

InvoiceNum:
Text(FACTURA NO.) Right [Number];

Here is the original image:
e

Can you point out what i'm doing wrong?

Unable to use in .net core 6.0

The compiler returns an error indicating to set the warning level from 0 to 4.
Setting it to 4 the IDE returns it to 6 and the compilation process still doesn't work.
But checking in the project files, it is set to 4.
is there a version for .net core?

Get matching values as array

I have an image that repeats the same sentence multiple times like for example:

Matricula 40339 text text text text text text

Matricula 4033 text text text text text text

I want to get all values as array. But when i do the script

Identification:
Text(Matricula) Right [Number];

I get the first one only but Im expecting 2.
How can I do this?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.