wenhuchen / table-fact-checking Goto Github PK
View Code? Open in Web Editor NEWData and Code for ICLR2020 Paper "TabFact: A Large-scale Dataset for Table-based Fact Verification"
License: MIT License
Data and Code for ICLR2020 Paper "TabFact: A Large-scale Dataset for Table-based Fact Verification"
License: MIT License
Hi wenhu, is the "concatenation" method of Table-BERT available in the repo? Thanks a lot!
Hey Wenhu, was wondering if you could shed some light on the batch size you used in training. Was it the default 6? I'm trying to replicate your paper results, but using your saved model I can't quite get the results you got in your paper. I know you are using 16 as the batch size in evaluation, so was wondering if maybe that was the same for training? Trying to replicate your fact first, template table bert results.
I actually just emailed, you as well :)
Thank you for sharing with us your interesting dataset.
I'm curious about the differences between collected_data and tokenized data.
What did you process the collected_data to generate the tokenized_data?
Originally, I've tried to split the collected_data into train/val/test splits by using the train_id.json/val_id.json/test_id.json in the data folder.
But, the number of examples in each split differs from the train/val/split in your paper as below.
[In my case]
train: 92,585
val: 12,851
test: 12,839
[In your paper]
train: 92,283
val: 12,792
test: 12,779
However, I found that the number of the train/val/test examples in the tokenized_data folder equals your paper.
Did you apply any filtering process to the collected_data?
I'm running python model.py --do_train --do_val --batch_size 2
with torch.device('cpu')
Here is the error message :
Traceback (most recent call last):
File "model.py", line 257, in <module>
precision, recall, accuracy = evaluate(val_dataloader, encoder_stat, encoder_prog)
File "model.py", line 142, in evaluate
for i, s, p, t, inp_id, prog_id in zip(index, similarity, pred_lab, true_lab, input_ids, prog_ids):
TypeError: iteration over a 0-d array
My configuration is the following :
ubuntu 20.04
python 3.8.1 (installed with pyenv)
python packages :
boto3==1.17.55
botocore==1.20.55
certifi==2020.12.5
chardet==4.0.0
click==7.1.2
idna==2.10
jmespath==0.10.0
joblib==1.0.1
nltk==3.6.2
numpy==1.20.2
pandas==1.2.4
protobuf==3.15.8
python-dateutil==2.8.1
pytorch-pretrained-bert==0.6.2
pytz==2021.1
regex==2021.4.4
requests==2.25.1
s3transfer==0.4.1
six==1.15.0
tdqm==0.0.1
tensorboardX==2.2
torch==1.8.1+cpu
tqdm==4.60.0
typing-extensions==3.7.4.3
ujson==1.35
Unidecode==1.2.0
urllib3==1.26.4
I am trying to run LPA on my custom dataset. My aim is to load the checkpoint and fine-tuning on my dataset but I am getting this error while loading the checkpoint:
File "model.py", line 205, in
encoder_prog.load_state_dict(torch.load(args.output_dir + "encoder_prog_{}.pt".format(args.id)))
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 1052, in load_state_dict
self.class.name, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for Decoder:
size mismatch for tgt_word_emb.weight: copying a param with shape torch.Size([148791, 128]) from checkpoint, the shape in current model is torch.Size([68717, 128]).
On analyzing the code the problem looks due to different vocab size. Is there any way to fine-tune my dataset on the provided checkpoint?
Thanks!
Hi Wenhu, thanks for making all of these open-source! I was wondering if the positive sentences that were rewritten into negative ones are available in the repo as well, or if you would consider doing so, since I think it would be an interesting (albeit different) task as well.
Hi, I tried to reproduce your results with the checkpoints of Table-BERT and using HF transformers.
The code is running without errors but evaluating the checkpoint model gives very low accuracy.
Thanks!
Edit: I had an error in migrating the code from pytroch-pretrained-bert to transformers, could fix it myself!
while using the pre_process data code, it is creating error as number of tags not equal number of words. Any suggestions on this?
Hello, I notice that there is a "all_positive_negative_pairs.json" file in "pairwise_data" folder. I'm curious about:
Thanks!
Hi wenhu, when evaluating the checkpoints trained with the provided code, I can't get the same results with the results when training. Is my model not loaded correctly? Do you know how to deal with the problem? Thank you very much!
Hi Wenhu, thanks for sharing the data!
I wonder what are the ground-truth labels for the statements in the bootstrap folder? They seem to not match any statements in the tokenized_data.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.