wenhuchen / table-fact-checking Goto Github PK

View Code? Open in Web Editor NEW

371.0 371.0 52.0 756.04 MB

Data and Code for ICLR2020 Paper "TabFact: A Large-scale Dataset for Table-based Fact Verification"

License: MIT License

HTML 9.08% Python 58.73% Shell 0.03% Jupyter Notebook 32.16%

table-fact-checking's People

Contributors

Stargazers

Watchers

table-fact-checking's Issues

"Concatenation" method

Hi wenhu, is the "concatenation" method of Table-BERT available in the repo? Thanks a lot!

Not really an issue, but a question

Hey Wenhu, was wondering if you could shed some light on the batch size you used in training. Was it the default 6? I'm trying to replicate your paper results, but using your saved model I can't quite get the results you got in your paper. I know you are using 16 as the batch size in evaluation, so was wondering if maybe that was the same for training? Trying to replicate your fact first, template table bert results.

I actually just emailed, you as well :)

About differences between collected_data and tokenized data.

Thank you for sharing with us your interesting dataset.

I'm curious about the differences between collected_data and tokenized data.
What did you process the collected_data to generate the tokenized_data?

Originally, I've tried to split the collected_data into train/val/test splits by using the train_id.json/val_id.json/test_id.json in the data folder.
But, the number of examples in each split differs from the train/val/split in your paper as below.

[In my case]
train: 92,585
val: 12,851
test: 12,839

[In your paper]
train: 92,283
val: 12,792
test: 12,779

However, I found that the number of the train/val/test examples in the tokenized_data folder equals your paper.
Did you apply any filtering process to the collected_data?

Issue when running on CPU : "TypeError : iteration over a 0-d array " in model.py

I'm running python model.py --do_train --do_val --batch_size 2 with torch.device('cpu')

Here is the error message :

Traceback (most recent call last):
  File "model.py", line 257, in <module>
    precision, recall, accuracy = evaluate(val_dataloader, encoder_stat, encoder_prog)
  File "model.py", line 142, in evaluate
    for i, s, p, t, inp_id, prog_id in zip(index, similarity, pred_lab, true_lab, input_ids, prog_ids):
TypeError: iteration over a 0-d array

My configuration is the following :
ubuntu 20.04
python 3.8.1 (installed with pyenv)

python packages :
boto3==1.17.55
botocore==1.20.55
certifi==2020.12.5
chardet==4.0.0
click==7.1.2
idna==2.10
jmespath==0.10.0
joblib==1.0.1
nltk==3.6.2
numpy==1.20.2
pandas==1.2.4
protobuf==3.15.8
python-dateutil==2.8.1
pytorch-pretrained-bert==0.6.2
pytz==2021.1
regex==2021.4.4
requests==2.25.1
s3transfer==0.4.1
six==1.15.0
tdqm==0.0.1
tensorboardX==2.2
torch==1.8.1+cpu
tqdm==4.60.0
typing-extensions==3.7.4.3
ujson==1.35
Unidecode==1.2.0
urllib3==1.26.4

Problem in Loading model checkpoint

I am trying to run LPA on my custom dataset. My aim is to load the checkpoint and fine-tuning on my dataset but I am getting this error while loading the checkpoint:

File "model.py", line 205, in
encoder_prog.load_state_dict(torch.load(args.output_dir + "encoder_prog_{}.pt".format(args.id)))
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 1052, in load_state_dict
self.class.name, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for Decoder:
size mismatch for tgt_word_emb.weight: copying a param with shape torch.Size([148791, 128]) from checkpoint, the shape in current model is torch.Size([68717, 128]).

On analyzing the code the problem looks due to different vocab size. Is there any way to fine-tune my dataset on the provided checkpoint?

Thanks!

Question: Are the sentences prior to rewriting available?

Hi Wenhu, thanks for making all of these open-source! I was wondering if the positive sentences that were rewritten into negative ones are available in the repo as well, or if you would consider doing so, since I think it would be an interesting (albeit different) task as well.

Cannot reproduce Table-BERT results w/ model checkpoint & HF transformers

Hi, I tried to reproduce your results with the checkpoints of Table-BERT and using HF transformers.

The code is running without errors but evaluating the checkpoint model gives very low accuracy.

Thanks!

Edit: I had an error in migrating the code from pytroch-pretrained-bert to transformers, could fix it myself!

issues with preproces_data code

while using the pre_process data code, it is creating error as number of tags not equal number of words. Any suggestions on this?

About pairwise data

Hello, I notice that there is a "all_positive_negative_pairs.json" file in "pairwise_data" folder. I'm curious about:

How do you obtain these samples?
What are these samples use for?

Thanks!

Can not reproduce the results obtained when training

Hi wenhu, when evaluating the checkpoints trained with the provided code, I can't get the same results with the results when training. Is my model not loaded correctly? Do you know how to deal with the problem? Thank you very much!

Question about the bootstrap data

Hi Wenhu, thanks for sharing the data!
I wonder what are the ground-truth labels for the statements in the bootstrap folder? They seem to not match any statements in the tokenized_data.

wenhuchen / table-fact-checking Goto Github PK

table-fact-checking's People

Contributors

Stargazers

Watchers

Forkers

table-fact-checking's Issues

"Concatenation" method

Not really an issue, but a question

About differences between collected_data and tokenized data.

Issue when running on CPU : "TypeError : iteration over a 0-d array " in model.py

Problem in Loading model checkpoint

Question: Are the sentences prior to rewriting available?

Cannot reproduce Table-BERT results w/ model checkpoint & HF transformers

issues with preproces_data code

About pairwise data

Can not reproduce the results obtained when training

Question about the bootstrap data

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent