megagonlabs / doduo Goto Github PK

View Code? Open in Web Editor NEW

28.0 28.0 12.0 50 KB

Annotating Columns with Pre-trained Language Models

Home Page: https://arxiv.org/abs/2104.01785

License: Apache License 2.0

Python 99.26% Shell 0.74%

doduo's People

Contributors

Stargazers

Watchers

Forkers

xuangestallone tingyaohsu ij007 penfever tabbydoc xuemduan ziqingyuan kirilltobola prcnsi agentcap wesmadrigal

doduo's Issues

Training fails when changing batch size

I tried running the training script with a smaller batch size since I'm running on machine without enough memory for the default batch size of 32. Instead trying with a batch size of 16, I get the error below.

$ python doduo/train_multi.py --batch_size=16
args={"shortcut_name": "bert-base-uncased", "max_length": 128, "batch_size": 16, "epoch": 30, "random_seed": 4649, "num_classes": 78, "multi_gpu": false, "fp16": false, "warmup": 0.0, "lr": 5e-05, "tasks": ["sato0"], "colpair": false, "train_ratios": [], "from_scratch": false, "single_col": false}
model/sato0_mosato_bert_bert-base-uncased-bs16-ml-128__sato0-1.00
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMultiOutputClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForMultiOutputClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMultiOutputClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForMultiOutputClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Traceback (most recent call last):
  File "doduo/train_multi.py", line 436, in <module>
    logits, = model(batch["data"].T)  # (row, col) is opposite?
  File "/home/mmior/.local/share/virtualenvs/doduo-ztkaJOAZ/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/mmior/apps/doduo/doduo/model.py", line 372, in forward
    outputs = self.bert(
  File "/home/mmior/.local/share/virtualenvs/doduo-ztkaJOAZ/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/mmior/apps/doduo/doduo/model.py", line 286, in forward
    embedding_output = self.embeddings(input_ids=input_ids,
  File "/home/mmior/.local/share/virtualenvs/doduo-ztkaJOAZ/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/mmior/.local/share/virtualenvs/doduo-ztkaJOAZ/lib/python3.8/site-packages/transformers/models/bert/modeling_bert.py", line 207, in forward
    embeddings += position_embeddings
RuntimeError: The size of tensor a (650) must match the size of tensor b (512) at non-singleton dimension 1

how to load metadata

I want to get metadata and table specific data, and then use DOsolo for training, but I don't see the process of loading metadata from the process of loading data. Can you teach me how to do it in detail?

Release the source code

Hi, I want to know when will you release the source code?

RuntimeWarning: invalid value encountered in true_divide

/root/doduo/doduo/util.py:15: RuntimeWarning: invalid value encountered in long_scalars
r = agg_conf_mat[1, 1] / agg_conf_mat[:, 1].sum()
/root/doduo/doduo/util.py:19: RuntimeWarning: invalid value encountered in true_divide
class_r = conf_mat[:, 1, 1] / conf_mat[:, :, 1].sum(axis=1)
/root/doduo/doduo/util.py:15: RuntimeWarning: invalid value encountered in long_scalars
r = agg_conf_mat[1, 1] / agg_conf_mat[:, 1].sum()
/root/doduo/doduo/util.py:18: RuntimeWarning: invalid value encountered in true_divide
class_p = conf_mat[:, 1, 1] / conf_mat[:, 1, :].sum(axis=1)
/root/doduo/doduo/util.py:19: RuntimeWarning: invalid value encountered in true_divide
class_r = conf_mat[:, 1, 1] / conf_mat[:, :, 1].sum(axis=1)

what causes this problem and how can I solve it?

Getting invalid version error

When executing "$ python doduo/train_multi.py --tasks turl turl_re-colpair --max_length 32 --batch_size 16", getting the following error. Could you please help?
Traceback (most recent call last):
File "doduo/train_multi.py", line 14, in
from transformers import BertTokenizer, BertForSequenceClassification, BertConfig
File "/data/conceptdrift/anaconda3/envs/doduo/bin/transformers/init.py", line 43, in
from . import dependency_versions_check
File "/data/conceptdrift/anaconda3/envs/doduo/bin/transformers/dependency_versions_check.py", line 41, in
require_version_core(deps[pkg])
File "/data/conceptdrift/anaconda3/envs/doduo/bin/transformers/utils/versions.py", line 101, in require_version_core
return require_version(requirement, hint)
File "/data/conceptdrift/anaconda3/envs/doduo/bin/transformers/utils/versions.py", line 92, in require_version
if want_ver is not None and not ops[op](version.parse(got_ver), version.parse(want_ver)):
File "/data/conceptdrift/anaconda3/envs/doduo/bin/packaging/version.py", line 52, in parse
return Version(version)
File "/data/conceptdrift/anaconda3/envs/doduo/bin/packaging/version.py", line 198, in init
raise InvalidVersion(f"Invalid version: '{version}'")
packaging.version.InvalidVersion: Invalid version: '0.10.1,<0.11'

The pre-trained DODUO model

Hi thank you for the open-sourced repository for the wonderful work. I wonder if you can release your pre-trained model on VizNet dataset so that we can have a quick try of DODUO model by evaluating its performance? Thanks!

Details for fine-tuning part of the model

Hi! Can you provide more details for the fine-tuning part of your model? Is the fine-tuning process prior to the training of the models? Or it is indeed the training process itself? In your paper, you did not state clearly the fine-tuning process (such as the number of epochs you fine-tuned, etc.). Also, it seems that there's no introduction for the fine-tuning used by the model in this repository.
I have been reproducing the results of the paper, yet it seems that I cannot achieve 96.3% micro F1 score stated in the paper.

FileNotFoundError: [Errno 2] No such file or directory: './data/turl_coltype_mlb.pickle'

Where can I find this file?

out of memory

Hello, your training paper is written in a 16G T100, but I follow your steps to keep displaying out of memory

Description of tasks

The Doduo paper defines two tasks: "column type prediction and column relation annotation." However, the training script for Doduo in this repository defines 12 different tasks. How do these tasks map to the two tasks defined in the paper? It seems as though turl-re is the column relation annotation.

Are all the other tasks then different forms of type prediction? If so, why is a slightly different modwel needed for the different datasets used?

megagonlabs / doduo Goto Github PK

doduo's People

Contributors

Stargazers

Watchers

Forkers

doduo's Issues

Training fails when changing batch size

how to load metadata

Release the source code

RuntimeWarning: invalid value encountered in true_divide

Getting invalid version error

The pre-trained DODUO model

Details for fine-tuning part of the model

FileNotFoundError: [Errno 2] No such file or directory: './data/turl_coltype_mlb.pickle'

out of memory

Description of tasks

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent