Giter Club home page Giter Club logo

edd's Introduction

Image-based table recognition: data, model, evaluation

Task

Converting table images into HTML code

Dataset

PubTabNet contains over 500k table images annotated with the corresponding HTML representation.

Model

Encoder-Dual-Decoder (EDD)

Encoder-Dual-Decoder (EDD)

Evaluation

Tree-Edit-Distance-based Similarity (TEDS)

TEDS(T_1, T_2) = 1 - EditDistance(T_1, T_2) / max(|T_1|, |T_2|), where EditDistance(T_1, T_2) is the tree edit distance between T_1 and T_2, and |T| is the number of nodes in T.

Installation

Please use python 3 (>=3.6) environment.

pip install -r requirements

Training and testing on PubTabNet

Prepare data

Download PubTabNet and extract the files into the following file structure

{DATA_DIR}
|
-- train
   |
   -- PMCXXXXXXX.png
   -- ...
-- val
   |
   -- PMCXXXXXXX.png
   -- ...
-- test
   |
   -- PMCXXXXXXX.png
   -- ...
-- PubTabNet_2.0.0.jsonl

Prepare data for training

python prepare_data.py \
       --annotation {DATA_DIR}/PubTabNet_2.0.0.jsonl  \
       --image_dir {DATA_DIR} \
       --out_dir {TRAIN_DATA_DIR}

The following files will be generated in {TRAIN_DATA_DIR}:

- TRAIN_IMAGES_{POSTFIX}.h5          # Training images
- TRAIN_TAGS_{POSTFIX}.json          # Training structural tokens
- TRAIN_TAGLENS_{POSTFIX}.json       # Length of training structural tokens
- TRAIN_CELLS_{POSTFIX}.json         # Training cell tokens
- TRAIN_CELLLENS_{POSTFIX}.json      # Length of training cell tokens
- TRAIN_CELLBBOXES_{POSTFIX}.json    # Training cell bboxes
- VAL.json                           # Validation ground truth
- WORDMAP_{POSTFIX}.json             # Vocab

where {POSTFIX} is PubTabNet_False_keep_AR_300_max_tag_len_100_max_cell_len_512_max_image_size

Train tag decoder

Use larger (0.001) learning rate in the first 10 epochs

python train_dual_decoder.py \
       --out_dir {CHECKPOINT_DIR} \
       --data_folder {TRAIN_DATA_DIR} \
       --data_name PubTabNet_False_keep_AR_300_max_tag_len_100_max_cell_len_512_max_image_size \
       --epochs 10 \
       --batch_size 10 \
       --fine_tune_encoder \
       --encoder_lr 0.001 \
       --fine_tune_tag_decoder \
       --tag_decoder_lr 0.001 \
       --tag_loss_weight 1.0 \
       --cell_decoder_lr 0.001 \
       --cell_loss_weight 0.0 \
       --tag_embed_dim 16 \
       --cell_embed_dim 80 \
       --encoded_image_size 28 \
       --decoder_cell LSTM \
       --tag_attention_dim 256 \
       --cell_attention_dim 256 \
       --tag_decoder_dim 256 \
       --cell_decoder_dim 512 \
       --cell_decoder_type 1 \
       --cnn_stride '{"tag":1, "cell":1}' \
       --resume

Use smaller (0.0001) learning rate for another 3 epochs

python train_dual_decoder.py \
       --out_dir {CHECKPOINT_DIR} \
       --data_folder {TRAIN_DATA_DIR} \
       --data_name PubTabNet_False_keep_AR_300_max_tag_len_100_max_cell_len_512_max_image_size \
       --epochs 13 \
       --batch_size 10 \
       --fine_tune_encoder \
       --encoder_lr 0.0001 \
       --fine_tune_tag_decoder \
       --tag_decoder_lr 0.0001 \
       --tag_loss_weight 1.0 \
       --cell_decoder_lr 0.001 \
       --cell_loss_weight 0.0 \
       --tag_embed_dim 16 \
       --cell_embed_dim 80 \
       --encoded_image_size 28 \
       --decoder_cell LSTM \
       --tag_attention_dim 256 \
       --cell_attention_dim 256 \
       --tag_decoder_dim 256 \
       --cell_decoder_dim 512 \
       --cell_decoder_type 1 \
       --cnn_stride '{"tag":1, "cell":1}' \
       --resume

Train dual decoders

NOTE:

  • Sometimes when a random batch is too large, it may exceeds the GPU memory. When this happens, just re-execute the training command, which will resume from the latest checkpoint.
  • Training dual decoders requires 2 V100 GPUs.

Use larger (0.001) learning rate in the first 10 epochs

python train_dual_decoder.py \
       --checkpoint {CHECKPOINT_DIR}/PubTabNet_False_keep_AR_300_max_tag_len_100_max_cell_len_512_max_image_size/checkpoint_12.pth.tar \
       --out_dir {CHECKPOINT_DIR}/cell_decoder \
       --data_folder {TRAIN_DATA_DIR} \
       --data_name PubTabNet_False_keep_AR_300_max_tag_len_100_max_cell_len_512_max_image_size \
       --epochs 23 \
       --batch_size 8 \
       --fine_tune_encoder \
       --encoder_lr 0.001 \
       --fine_tune_tag_decoder \
       --tag_decoder_lr 0.001 \
       --tag_loss_weight 0.5 \
       --cell_decoder_lr 0.001 \
       --cell_loss_weight 0.5 \
       --tag_embed_dim 16 \
       --cell_embed_dim 80 \
       --encoded_image_size 28 \
       --decoder_cell LSTM \
       --tag_attention_dim 256 \
       --cell_attention_dim 256 \
       --tag_decoder_dim 256 \
       --cell_decoder_dim 512 \
       --cell_decoder_type 1 \
       --cnn_stride '{"tag":1, "cell":1}' \
       --resume \
       --predict_content

Use smaller (0.0001) learning rate for another 2 epochs

python train_dual_decoder.py \
       --out_dir {CHECKPOINT_DIR}/cell_decoder \
       --data_folder {TRAIN_DATA_DIR} \
       --data_name PubTabNet_False_keep_AR_300_max_tag_len_100_max_cell_len_512_max_image_size \
       --epochs 25 \
       --batch_size 8 \
       --fine_tune_encoder \
       --encoder_lr 0.0001 \
       --fine_tune_tag_decoder \
       --tag_decoder_lr 0.0001 \
       --tag_loss_weight 0.5 \
       --cell_decoder_lr 0.0001 \
       --cell_loss_weight 0.5 \
       --tag_embed_dim 16 \
       --cell_embed_dim 80 \
       --encoded_image_size 28 \
       --decoder_cell LSTM \
       --tag_attention_dim 256 \
       --cell_attention_dim 256 \
       --tag_decoder_dim 256 \
       --cell_decoder_dim 512 \
       --cell_decoder_type 1 \
       --cnn_stride '{"tag":1, "cell":1}' \
       --resume \
       --predict_content

Inferencing

Get validation performance

python eval.py \
       --image_folder {DATA_DIR}/val \
       --result_json {RESULT_DIR}/RESULT_FILE.json \
       --gt {TRAIN_DATA_DIR}/VAL.json \
       --model {CHECKPOINT_DIR}/cell_decoder/PubTabNet_False_keep_AR_300_max_tag_len_100_max_cell_len_512_max_image_size/checkpoint_24.pth.tar \
       --word_map {TRAIN_DATA_DIR}/WORDMAP_PubTabNet_False_keep_AR_300_max_tag_len_100_max_cell_len_512_max_image_size.json \
       --image_size 448 \
       --dual_decoder \
       --beam_size '{"tag":3, "cell":3}' \
       --max_steps '{"tag":1800, "cell":600}'

This will save the TEDS score of every validation sample in {RESULT_DIR}/RESULT_FILE.json in the following format:

{
  'PMCXXXXXXX.png': float,
}

Get testing performance

python eval.py \
       --image_folder {DATA_DIR}/test \
       --result_json {RESULT_DIR}/RESULT_FILE.json \
       --model {CHECKPOINT_DIR}/cell_decoder/PubTabNet_False_keep_AR_300_max_tag_len_100_max_cell_len_512_max_image_size/checkpoint_24.pth.tar \
       --word_map {TRAIN_DATA_DIR}/WORDMAP_PubTabNet_False_keep_AR_300_max_tag_len_100_max_cell_len_512_max_image_size.json \
       --image_size 448 \
       --dual_decoder \
       --beam_size '{"tag":3, "cell":3}' \
       --max_steps '{"tag":1800, "cell":600}'

This will save the inference result (HTML code) of every testing sample in {RESULT_DIR}/RESULT_FILE.json in the following format:

{
  'PMCXXXXXXX.png': str,
}

The json file can be compared agains the ground truth using the code here. The ground truth of test set has been kept secret.

Cite us

@article{zhong2019image,
  title={Image-based table recognition: data, model, and evaluation},
  author={Zhong, Xu and ShafieiBavani, Elaheh and Yepes, Antonio Jimeno},
  journal={arXiv preprint arXiv:1911.10683},
  year={2019}
}

edd's People

Contributors

ajjimeno avatar jonnyfgn avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.