vsymbol / cutie Goto Github PK

CUTIE (TensorFlow implementation of Convolutional Universal Text Information Extractor)

Python 100.00%

computer-vision deep-learning text-extraction

cutie's Introduction

CUTIE

TensorFlow implementation of the paper "CUTIE: Learning to Understand Documents with Convolutional Universal Text Information Extractor." Xiaohui Zhao Paper Link

CUTIE 是用于“票据文档” 2D 关键信息提取/命名实体识别/槽位填充算法。使用CUTIE前，需先使用OCR算法对“票据文档” 中的文字执行检测和识别，而后将格式化的文本输入入CUTIE网络，具体流程可参照论文。

CUTIE can be considered as one type of 2-Dimensional Key Information Extraction, 2-D NER (Named Entity Recognition) or a 2-Dimensional 2D Slot Filling algorithm. Before training / inference with CUTIE, prepare your structured texts in your scanned document images with any type of OCR algorithm. Refer to the CUTIE paper for details about the procedure.

Results

Result evaluated on 4,484 receipt documents, including taxi receipts, meals entertainment receipts, and hotel receipts, with 9 different key information classes. (AP / softAP)

Method	#Params	Taxi	Hotel
CloudScan	-	82.0 / -	60.0 / -
BERT	110M	88.1 / -	71.7 / -
CUTIE	14M	94.0 / 97.3	74.6 / 87.0

Installation & Usage

pip install -r requirements.txt

Generate your own dictionary with main_build_dict.py / main_data_tokenizer.py
Train your model with main_train_json.py

CUTIE achieves best performance with rows/cols well configured. For more insights, refer to statistics in the file (others/TrainingStatistic.xlsx).

Others

For information about the input example, refer to issue discussion.

Apply any OCR tool that help you detecting and recognizing words in the scanned document image.
Label image OCR results with key information class as the .json file in the invoice_data folder. (thanks to @4kssoft)

cutie's People

Contributors

Stargazers

Watchers

cutie's Issues

Predict and Output

What is the input format for prediction? I used the main_evaluate_json.py to get output using the validation code. The output converts the numbers to "0's" and substrings are splitted and adds "##" to the prefix. Is is possible to get the actual text output from text_boxes key in input json?
I did change update_dict=True for evaluation and then commented the tokenization code where numbers are converted to 0's and replaced final string " ##" with "", which gives me the output somewhat. But why are we masking the data?
As suggested I updated the dict using main_dict_json.py with training data. Also updated the vocab.txt with new words.
If I change update_dict=False for evaluation I only get [UNK] for all words for both GT/INF.

How to predict each line item in an invoice?

If a receipt has multiple line items, is there a way to predict each line item and then correlate the different fields (such as item name and amout) for each line item? Can anyone provide a sample training data for doing so?

grid_label minus values ?

I prepared data for 21 classes
And when i check uniques values in grid_label: np.unique(grid_label)

array([-124, -116, -114, -103,  -93,  -82,  -61,    0,    1,    3,    4,
          5,    6,    8,   10,   11,   12,   13,   14,   15,   16,   17,
         19,   20,   22,   24,   25,   27,   33,   35,   36,   37,   38,
         43,   45,   46,   48,   54,   56,   57,   58,   64,   66,   69,
         75,   77,   78,   79,   90,   98,  100,  111,  119,  121],
      dtype=int8)

I have minus values. I dont now why?
Training crashes on the loss function.

Regarding Saved model

Hi,

From where did I get save checkpoints file , I tried with ('--ckpt_path', type=str, default='../graph/CUTIE/graph/') this path and I have given own path too but I couldn't find the checkpoints file.

Thanks in advance!

Unable to run the main_evaluate_json file

I am unable to evaluate the model with main_evaluate_json file.
I keep getting the following error whenever I run the file.
PS: Have given the correct arguments for checkpoint path and file, but I unsure of how to modify the save-prefix argument.

Which labelling or annoation tool you are using?

May i know which is the tool being used for annotating the data?

Input format of data

Hello,
Could you please specify the input data format or provide an example so that I can train the model on my custom dataset.

Tensorflow 2

Hi,
Was anyone successful in migrating the code to tensorflow 2 at all? and if so, can you share the code?
Thanks!

I am new to Deep Learning can anyone help in getting started with this model

Creating input json file for SROIE dataset

Hi,

I'm trying to create input json files from images to run some tests for SROIE dataset. I have downloaded the revised datasets from their website. From what I gather is they have two text files (.txt format) along with images(.jpg's). Task1train files have the bounding boxes with texts and task2train files have the annotations texts with classes.

Looking at the sample provided here by @4kssoft :
https://github.com/4kssoft/CUTIE/blob/master/invoice_data/Faktura1.pdf_0.json

It needs the multiple value_id's to be populated correctly if the annotated text has multiple words.

For example: Image X51008099073.jpg (from SROIE train dataset) -

The bounding box file has :

112,240,365,240,365,278,112,278,PROSPER NIAGA
114,281,574,281,574,315,114,315,COMPANY NO : SA0099552-P
112,319,534,319,534,357,112,357,LOT PT 1138 ,PT 33122,
114,357,518,357,518,395,114,395,BANDAR MAHKOTA CHERAS
115,395,539,395,539,433,115,433,43200 CHERAS, SELANGOR
114,431,325,431,325,469,114,469,SITE : 2365
...

Annotations file has:

{
    "company": "PROSPER NIAGA",
    "date": "26/06/18",
    "address": "LOT PT 1138 ,PT 33122, BANDAR MAHKOTA CHERAS 43200 CHERAS, SELANGOR",
    "total": "100.00"
}

The "address" annotation value matches lines 3, 4 and 5 from bounding box file. The same is true for "total" annotation (not shown here). If you could share some insights on how do I automatically parse and link these into multiple value_id and value_text fields. Not sure If I'm missing something. I see the same cases for many other texts inside other images in this dataset as well.

Thanks !

Training and predicting using model

Hello, I've seen that alot of people have gotten this to work. I need some help on the general workflow to follow with this project. Could someone please post a small description highlighting the steps needed? I have prepared my data in the required format, need to know how to proceed in order to train and then make predictions using this model.

So far, for training, I'm doing this:

python main_build_dict.py --doc_path 'invoice_data/'
python main_data_tokenizer.py --doc_path 'invoice_data/'

python main_train_json.py --doc_path 'invoice_data/' --save_prefix 'INVOICE' --embedding_file '' --ckpt_path 'graph/' --tokenize True --update_dict True --dict_path 'dict/' --rows_segment 72 --cols_segment 72 --augment_strategy 1 --positional_mapping_strategy 1 --rows_target 64 --cols_target 64 --rows_ulimit 80 --fill_bbox False --data_augmentation_extra True --data_augmentation_dropout 1 --data_augmentation_extra_rows 16 --data_augmentation_extra_cols 16 --batch_size 32 --iterations 40000 --lr_decay_step 13000 --learning_rate 0.0001 --lr_decay_factor 0.1 --hard_negative_ratio 3 --use_ghm 0 --ghm_bins 30 --ghm_momentum 0 --log_path 'log/' --log_disp_step 100 --log_save_step 100 --validation_step 100 --test_step 400 --ckpt_save_step 50 --embedding_size 128 --weight_decay 0.0005 --eps 1e-6

invoice_data is the folder that contains the images and their respective json files.

I copied these from another post here in the issues section. The model does start to train but I'm not sure if I'm doing it correctly. Any help would be appreciated.

Unable to obtain Training Statistic

After running the main_train_json.py file for training the model, how do we obtain the graph shown in TrainingStatistic.xlsx file in the others folder?

Input data

Hello,
Could you provide your input data for the model to reproduce the results or at least the input data format so that I can try the model on my custom dataset

Apply model on SROIE Dataset

For generating the input data, you have to know which Bounding box belong to each field. For his own dataset the author says : ". Each text and their bounding box is manually labelled as one of the 9 different classes". But how can we do this for SROIE? We don't have the Bounding box ground truth of each fied..

Dictionary and vocab generation

Hello,

Can you please explain how the dictionary and vocab is generated and the steps to replicate for a different dataset?
Is the vocab.txt file pre-existing or is it generated through the program?