doc-analysis / tablebank Goto Github PK

View Code? Open in Web Editor NEW

980.0 980.0 139.0 795 KB

TableBank: A Benchmark Dataset for Table Detection and Recognition

License: Apache License 2.0

tablebank's People

Contributors

Stargazers

Watchers

Forkers

alwc chaoso zouyj0921 shadowkun li-sizhi yishuihanhan luanrly dst1213 parety ieee820 wangvenn boozyguo jangocheng 444135296 happog maxe-xq roycezjq dukezhb tangchangcheng marchbruno09 qwzhong1988 haoransh ghlop kapitsa2811 yyqgood suchaoxiao hxk11111 cooleel mede-a sengcheong kungwanin guoyin90 slbinilkumar shawangyi123 jhonlv jeozhao windchaserz wengbenjue duood xialei gerrygekao guptam dlml 1026295417 rkshuai hitman56 d-major v-smwang lululalala12138 susierao johnreid lumiqai sayanbanerjee32 joheras joyfish aliushn bababoss ylfennn litingfeng00 xuwei1119 wddan1 mjsaio hell-to-heaven qaisqadri1 jdegange xiaoyubing cjhaitman subbaraomanchala wolfshow luoj-roger elnazsn1988 chengstone hainan89 manikant92 wqfengnlpr kilinye 2016xjtuzyt akmaral-yes ygest monkeyfx johnson7788 chuong-phung wang91zhe xrosliang yuansky gztangde beyondyourself shifop jzw0025 rocke2020 chenglong-s unisunis tomwwjjtt andres-mejia liuzhuang1024 keyochali mppmys sunxingxingtf gunnsth kioncheng

tablebank's Issues

How to generate my own dataset with your method ? Can release the code?

Thanks for your great work！ If the method to generate the TableBank can be released?

Can you release the names of the files in test set to reproduce the result reported in your paper?

As topic. This can help to reproduce your experiments.

KeyError: 'Non-existent config key: _BASE_'

I am running the code on google colab, and running:

!python detectron/tools/infer_simple.py --cfg /content/All_X101.yaml --output-dir /tmp/detectron-tablebank --image-ext jpg \
    --wts /content/model_final.pth /content/drive/MyDrive/TableBank/Image

The error is:

Found Detectron ops lib: /usr/local/lib/python3.7/dist-packages/torch/lib/libcaffe2_detectron_ops_gpu.so
[E init_intrinsics_check.cc:44] CPU feature avx is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
[E init_intrinsics_check.cc:44] CPU feature avx2 is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
[E init_intrinsics_check.cc:44] CPU feature fma is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
Traceback (most recent call last):
  File "detectron/tools/infer_simple.py", line 185, in <module>
    main(args)
  File "detectron/tools/infer_simple.py", line 125, in main
    merge_cfg_from_file(args.cfg)
  File "/content/detectron/detectron/core/config.py", line 1152, in merge_cfg_from_file
    _merge_a_into_b(yaml_cfg, __C)
  File "/content/detectron/detectron/core/config.py", line 1202, in _merge_a_into_b
    raise KeyError('Non-existent config key: {}'.format(full_key))
KeyError: 'Non-existent config key: _BASE_'

RuntimeError: CUDA out of memory. Tried to allocate 1.32 GiB (GPU 0; 7.92 GiB total capacity; 5.68 GiB already allocated; 1.22 GiB free; 192.30 MiB cached)

My commands :
python train.py -model_type img -data data/demo -save_model demo-model -gpu_ranks 0 -batch_size 4 -learning_rate 0.1 -word_vec_size 80 -encoder_type brnn -image_channel_size 3

My GPU breaks while running to the 10000 steps of 100000

My batch_size sets 4 ,as 20 defined can not run

Question on the quality of table annotations

Hi,

The Problem

I'd like to thank you for releasing this nice dataset. However, I found the quality of the annotation is actually not quite high, mainly two issues:

Missing labels: no annotation found for an existing table
Inaccurate annotations: some bbox does not cover the whole table region

Issue 1 has been mentioned by #9 , where the author answer by

some error may cause a little table unlabeled

However, I plotted the first 100 image ids and their annotations in /Detection_data/Word, and found 21 images out of 100 with missing annotations ( 1 or up to 3 tables were missing). Unless I'm extremely lucky to catch these problematic annotations from the first 100 plot, this issue does not only exist in 'a little table'.

To be specific, I post the imgIds for those 21 images:

3, 9, 10, 27, 32, 33, 39, 47, 51, 56, 57, 58, 59, 60, 61, 62, 73, 76, 77, 87, 95

As for issue 2, I found 3 images (out of 100 tested images) with incorrect annotations:

18, 62, 83

I understand from the paper that these annotations are generated by parsing the PDF/Word documents, and those document parsing code could not catch all the tables. I post this here only for providing researchers some info that they might care about.

Possible Fix

Issue 1 is actually not hard to fix. I have trained a model for table detection (trained on other datasets) with descent performance, I'd like to use this model to run one pass through all the data provided here and hopefully spot a large amount of missing annotations, then fix those manually. I'd be happy to share and discuss more.

FYI

I load the data with pycocotools, get annotations for each images using:

img_ann = coco.loadAnns(coco.getAnnIds(imgIds = image_id))

and plotted the annotations on a matplotlib figure using

coco.showAnns(img_ann)

The missing/incorrect annotations were then spotted by eye.

I'd be happy to discuss more and provide the testing jpynb if wanted.

Best,
Julian

bounding boxes coordinates

how do I get the coordinates of the boxes?

Is there any documentation regarding finetuning the model on custom dataset?

I think I saw it somewhere, I can't find it now.

how to extract the dataset ?

I downloaded the dataset parts but I cannot manage to extract the files correctly.

I tried different commands cited here: https://unix.stackexchange.com/questions/40480/how-to-unzip-a-multipart-spanned-zip-on-linux

But the only successful method was this one :

cat test.zip.* >test.zip
zip -FF test.zip --out test-full.zip
unzip test-full.zip

However, after the extraction one of the annotation json file is broken and has not been extracted correctly.

Can someone share their way to extract the dataset please ?

How to use tablebank to recognition table structure

First, thank you very much for your code. But I didn't find a way to use tablebank to recognition table structure. Can you provide it, thank you.

When will the dataset be released?

Compatibility Detectron2

Hi, is there any upcoming update to the models to work with Detectron2 ?

Thanks in advance,

error in running recognition model ?

While running this command after having recognition model in my local model.pt

python translate.py -model model.pt --src_dir recognition.jpg -output pred.txt

usage: translate.py [-h] [-config CONFIG] [-save_config SAVE_CONFIG] --model
                    MODEL [MODEL ...] [--fp32] [--avg_raw_probs]
                    [--data_type DATA_TYPE] --src SRC [--src_dir SRC_DIR]
                    [--tgt TGT] [--shard_size SHARD_SIZE] [--output OUTPUT]
                    [--report_bleu] [--report_rouge] [--report_time]
                    [--dynamic_dict] [--share_vocab]
                    [--random_sampling_topk RANDOM_SAMPLING_TOPK]
                    [--random_sampling_temp RANDOM_SAMPLING_TEMP]
                    [--seed SEED] [--beam_size BEAM_SIZE]
                    [--min_length MIN_LENGTH] [--max_length MAX_LENGTH]
                    [--max_sent_length] [--stepwise_penalty]
                    [--length_penalty {none,wu,avg}] [--ratio RATIO]
                    [--coverage_penalty {none,wu,summary}] [--alpha ALPHA]
                    [--beta BETA] [--block_ngram_repeat BLOCK_NGRAM_REPEAT]
                    [--ignore_when_blocking IGNORE_WHEN_BLOCKING [IGNORE_WHEN_BLOCKING ...]]
                    [--replace_unk] [--phrase_table PHRASE_TABLE] [--verbose]
                    [--log_file LOG_FILE]
                    [--log_file_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET,50,40,30,20,10,0}]
                    [--attn_debug] [--dump_beam DUMP_BEAM] [--n_best N_BEST]
                    [--batch_size BATCH_SIZE] [--gpu GPU]
                    [--sample_rate SAMPLE_RATE] [--window_size WINDOW_SIZE]
                    [--window_stride WINDOW_STRIDE] [--window WINDOW]
                    [--image_channel_size {3,1}]

translate.py: error: the following arguments are required: --src/-src

pytorch 0.4.1
py36_cuda9.2.148_cudnn7.1.4_1

Can someone tell what could be error ?
I don't know what to pass as an argument to -src
I see here that -src means Source sequence to decode (one line per sequence) which is what I don't understand !!

this doesn't help either, mentioned in this issue

how to annotation cell points

how to annotation cell points in table recognition datasets

run in CPU

i wanna run this model in detectron on CPU (not GPU)
im using table detection task model and detectron code
can anybody help me how do i do that ?

how to download？

I tried to download the TableBank dataset from the official website https://doc-analysis.github.io/tablebank-page/index.html, but it seems that I do not have the permission to download it. How can I obtain this dataset?

How to fine tune model on custom classes(eg - border and borderless)

Wanted to know steps regarding fine tuning of model on custom classes.

Table Detection data mismatch in Word subset

I have downloaded and checked the TableBank dataset from your dataset homepage

I have found some issues in the annotations, the README denotes the number of tables in the Table Detection task as follows:

Task	Word	Latex	Word+Latex
Table detection	163,417	253,817	417,234

But I ran my script to check the data annotations, it showed that there were only 101889 tables in the Word subset.

Topic breakdown of Tablebank data

Hi,

What is the breakdown of the topics or document types (news article, scientific papers, etc) in the Tablebank dataset?

Prediction Using Table Recognition

I used the follow command to predict structure of the table :

python translate.py -model model.pt --src_dir './tables/' --src './src_txt.txt' -output pred.txt

and I get the following error:
AssertionError: Cannot use _dir with TextDataReader.

From your previous replies to issues https://github.com/doc-analysis/TableBank/issues/12 and https://github.com/doc-analysis/TableBank/issues/10, its looks that I can test the model by using -tgt (providing a ground truth file)

Can I not only predict on a sample?

why need -src src-test.txt for image to text opennmt?

i was bit confused. please once explain me, help me anyone :)

i was tried this way.

python drive/My\ Drive/OpenNMT-py/translate.py -data_type img -model drive/My\ Drive/Pretrained_Word_Embeddings/detectron_table_detection/model.pt -src_dir drive/My\ Drive/datasets/table_dataset_sample/8.jpg \
  -output pred.txt -max_length 150 -beam_size 5 -gpu 0 -verbose

i am getting same issue. i don't know what is -src_dir & -src

usage: translate.py [-h] [-config CONFIG] [-save_config SAVE_CONFIG] --model
                    MODEL [MODEL ...] [--fp32] [--avg_raw_probs]
                    [--data_type DATA_TYPE] --src SRC [--src_dir SRC_DIR]
                    [--tgt TGT] [--shard_size SHARD_SIZE] [--output OUTPUT]
                    [--report_bleu] [--report_rouge] [--report_time]
                    [--dynamic_dict] [--share_vocab]
                    [--random_sampling_topk RANDOM_SAMPLING_TOPK]
                    [--random_sampling_temp RANDOM_SAMPLING_TEMP]
                    [--seed SEED] [--beam_size BEAM_SIZE]
                    [--min_length MIN_LENGTH] [--max_length MAX_LENGTH]
                    [--max_sent_length] [--stepwise_penalty]
                    [--length_penalty {none,wu,avg}] [--ratio RATIO]
                    [--coverage_penalty {none,wu,summary}] [--alpha ALPHA]
                    [--beta BETA] [--block_ngram_repeat BLOCK_NGRAM_REPEAT]
                    [--ignore_when_blocking IGNORE_WHEN_BLOCKING [IGNORE_WHEN_BLOCKING ...]]
                    [--replace_unk] [--phrase_table PHRASE_TABLE] [--verbose]
                    [--log_file LOG_FILE]
                    [--log_file_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET,50,40,30,20,10,0}]
                    [--attn_debug] [--dump_beam DUMP_BEAM] [--n_best N_BEST]
                    [--batch_size BATCH_SIZE] [--gpu GPU]
                    [--sample_rate SAMPLE_RATE] [--window_size WINDOW_SIZE]
                    [--window_stride WINDOW_STRIDE] [--window WINDOW]
                    [--image_channel_size {3,1}]
translate.py: error: the following arguments are required: --src/-src

here docuentation (Image to text ) they said,

python translate.py -data_type img -model demo-model_acc_x_ppl_x_e13.pt -src_dir data/im2text/images \
					-src data/im2text/src-test.txt -output pred.txt -max_length 150 -beam_size 5 -gpu 0 -verbose

-src_dir: The directory containing the images.

then why i need -src data/im2text/src-test.txt ?

we want image to text. but src why txt. what is that can any one clarify me.

Thank you all

please tell how to run table structure recognition model

password for decompress

when i decompress TableBank.zip.001, which need password, where can i get it?

some table not labeled

I found there is some problem in the data , table not labeled . two example from Word.json

{'category_id': 1, 'area': 46280, 'iscrowd': 0, 'segmentation': [[71, 176, 71, 280, 516, 280, 516, 176]], 'id': 69303, 'image_id': 53565, 'bbox': [71, 176, 445, 104]}

{'category_id': 1, 'area': 143613, 'iscrowd': 0, 'segmentation': [[66, 72, 66, 269, 795, 269, 795, 72]], 'id': 67935, 'image_id': 52492, 'bbox': [66, 72, 729, 197]}

No email reply

I have submitted the form, but haven't received the reply email. Could you please send me the download link , my gmail address is [email protected].. Thanks a loooooooooot....

Document about how to use the trained model

Why results bad for table structure recognition?

i tried here table image text extraction using tablebank pretrained model (Table structure recognition).

Results here,

python drive/My\ Drive/OpenNMT-py/translate.py -data_type img -model drive/My\ Drive/Pretrained_Word_Embeddings/detectron_table_detection/model.pt -src_dir drive/My\ Drive/datasets/table_dataset_sample/ \
 -src src-test.txt -output pred.txt -max_length 150 -beam_size 5 -gpu 0 -verbose

[2019-08-21 05:40:53,223 INFO] Translating shard 0.
/usr/local/lib/python3.6/dist-packages/torchtext/data/field.py:359: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  var = torch.tensor(arr, dtype=self.dtype, device=device)

SENT 1: None
PRED 1: <tabular> <tbody> <tr> <tdy> <tdy> <tdy> <tdy> <tdy> <tdy> <tdy> <tdy> <tdy> <tdy> <tdy> <tdy> </tr> <tr> <tdy> <tdy> <tdy> <tdy> <tdy> <tdy> <tdy> <tdy> <tdy> <tdy> <tdy> <tdy> </tr> <tr> <tdy> <tdy> <tdy> <tdy> <tdy> <tdy> <tdy> <tdy> <tdy> <tdy> <tdy> <tdn> </tr> <tr> <tdy> <tdy> <tdy> <tdy> <tdy> <tdy> <tdy> <tdy> <tdy> <tdy> <tdy> <tdn> </tr> <tr> <tdy> <tdy> <tdy> <tdy> <tdy> <tdy> <tdy> <tdy> <tdy> <tdy> <tdy> <tdn> </tr> <tr> <tdy> <tdy> <tdy> <tdy> <tdy> <tdy> <tdy> <tdy> <tdy> <tdy> <tdy> <tdn> </tr> <tr> <tdy> <tdy> <tdy> <tdy> <tdy> <tdy> <tdy> <tdy> <tdy> <tdy> <tdy> <tdn> </tr> <tr> <tdy> <tdy> <tdy> <tdy> <tdy> <tdy> <tdy> <tdy> <tdy> <tdy> <tdy> <tdn> </tr> </tbody> </tabular>
PRED SCORE: -1.2848
PRED AVG SCORE: -0.0111, PRED PPL: 1.0111

Please if am i wrong, correct me.:)

Unable to download TableBank.zip[1]~[5]

Unable to download TableBank.zip[1]~[5] from
https://doc-analysis.github.io/tablebank-page/index.html

This XML file does not appear to have any style information associated with it. The document tree is shown below.

AuthenticationFailed
Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature. ...

Getting this error : yaml.reader.ReaderError

Traceback (most recent call last):
File "tools/infer_simple.py", line 185, in
main(args)
File "tools/infer_simple.py", line 125, in main
merge_cfg_from_file(args.cfg)
File "/home/anshuman/detectron/detectron/core/config.py", line 1148, in merge_cfg_from_file
yaml_cfg = AttrDict(load_cfg(f))
File "/home/anshuman/detectron/detectron/core/config.py", line 1142, in load_cfg
return envu.yaml_load(cfg_to_load)
File "/home/anshuman/Downloads/envs/myenv/lib/python3.7/site-packages/yaml/init.py", line 70, in load
loader = Loader(stream)
File "/home/anshuman/Downloads/envs/myenv/lib/python3.7/site-packages/yaml/loader.py", line 34, in init
Reader.init(self, stream)
File "/home/anshuman/Downloads/envs/myenv/lib/python3.7/site-packages/yaml/reader.py", line 74, in init
self.check_printable(stream)
File "/home/anshuman/Downloads/envs/myenv/lib/python3.7/site-packages/yaml/reader.py", line 144, in check_printable
'unicode', "special characters are not allowed")
yaml.reader.ReaderError: unacceptable character #x0002: special characters are not allowed
in "", position 0

Reproduce Precision, Recall and F1 score results from Detectron2 checkpoints

Is it possible to have the code used for calculate the Precision, Recall and F1 score reported in the table here on GitHub? I'm talking about the results obtained with the last released checkpoints obtained with Detectron2.

The instruction on the paper are not so clear...

Unable to download

https://layoutlm.blob.core.windows.net/tablebank/model_zoo/detection/Latex_X152/model_final.pth

<Error>
<Code>PublicAccessNotPermitted</Code>
<Message>
Public access is not permitted on this storage account. RequestId:f2d499af-401e-0029-4a9c-a9d626000000 Time:2023-06-28T08:40:37.7148634Z
</Message>
</Error>

how can i download the
X152(Latex) | Model/Config | 1.03GB

AttributeError: 'NoneType' object has no attribute 'astype'

Getting this error while trying to infer

INFO net.py: 96: rpn_bbox_pred_fpn2_w [+ momentum] loaded from weights file into gpu_0/rpn_bbox_pred_fpn2_w: (12, 256, 1, 1)
INFO net.py: 96: rpn_bbox_pred_fpn2_b [+ momentum] loaded from weights file into gpu_0/rpn_bbox_pred_fpn2_b: (12,)
INFO net.py: 96: fc6_w [+ momentum] loaded from weights file into gpu_0/fc6_w: (1024, 12544)
INFO net.py: 96: fc6_b [+ momentum] loaded from weights file into gpu_0/fc6_b: (1024,)
INFO net.py: 96: fc7_w [+ momentum] loaded from weights file into gpu_0/fc7_w: (1024, 1024)
INFO net.py: 96: fc7_b [+ momentum] loaded from weights file into gpu_0/fc7_b: (1024,)
INFO net.py: 96: cls_score_w [+ momentum] loaded from weights file into gpu_0/cls_score_w: (2, 1024)
INFO net.py: 96: cls_score_b [+ momentum] loaded from weights file into gpu_0/cls_score_b: (2,)
INFO net.py: 96: bbox_pred_w [+ momentum] loaded from weights file into gpu_0/bbox_pred_w: (8, 1024)
INFO net.py: 96: bbox_pred_b [+ momentum] loaded from weights file into gpu_0/bbox_pred_b: (8,)
INFO net.py: 133: pred_b preserved in workspace (unused)
INFO net.py: 133: pred_w preserved in workspace (unused)
[I net_dag_utils.cc:102] Operator graph pruning prior to chain compute took: 7.9674e-05 secs
[I net_dag_utils.cc:102] Operator graph pruning prior to chain compute took: 5.9175e-05 secs
INFO infer_simple.py: 147: Processing /images/ -> /tmp/detectron-tablebank/.pdf
Traceback (most recent call last):
File "detectron/tools/infer_simple.py", line 185, in
main(args)
File "detectron/tools/infer_simple.py", line 153, in main
model, im, None, timers=timers
File "/content/detectron/detectron/core/test.py", line 66, in im_detect_all
model, im, cfg.TEST.SCALE, cfg.TEST.MAX_SIZE, boxes=box_proposals
File "/content/detectron/detectron/core/test.py", line 137, in im_detect_bbox
inputs, im_scale = _get_blobs(im, boxes, target_scale, target_max_size)
File "/content/detectron/detectron/core/test.py", line 946, in _get_blobs
blob_utils.get_image_blob(im, target_scale, target_max_size)
File "/content/detectron/detectron/utils/blob.py", line 52, in get_image_blob
im, cfg.PIXEL_MEANS, target_scale, target_max_size
File "/content/detectron/detectron/utils/blob.py", line 108, in prep_im_for_blob
im = im.astype(np.float32, copy=False)
AttributeError: 'NoneType' object has no attribute 'astype'

How to use pretrained weights in custom model

Hello Team,

I have build a model using retinanet and I would like to use your weights in my model for training. Could you please help me how to use it.
If not in retinanet, how can i use the pretrained model and weights for inference on new images.?

Weights downloaded from: https://dl.fbaipublicfiles.com/detectron/ImageNetPretrained/25093814/X-152-32x8d-IN5k.pkl

I am using CPU system.

How to get the same validation/testing set as reported in the paper? And will the evaluation tools be publicly available?

With the same validation/testing set and evaluation tools, we can make a fair comparison between other methods and the baselines. Thank you!