megagonlabs / doduo Goto Github PK
View Code? Open in Web Editor NEWAnnotating Columns with Pre-trained Language Models
Home Page: https://arxiv.org/abs/2104.01785
License: Apache License 2.0
Annotating Columns with Pre-trained Language Models
Home Page: https://arxiv.org/abs/2104.01785
License: Apache License 2.0
I tried running the training script with a smaller batch size since I'm running on machine without enough memory for the default batch size of 32. Instead trying with a batch size of 16, I get the error below.
$ python doduo/train_multi.py --batch_size=16
args={"shortcut_name": "bert-base-uncased", "max_length": 128, "batch_size": 16, "epoch": 30, "random_seed": 4649, "num_classes": 78, "multi_gpu": false, "fp16": false, "warmup": 0.0, "lr": 5e-05, "tasks": ["sato0"], "colpair": false, "train_ratios": [], "from_scratch": false, "single_col": false}
model/sato0_mosato_bert_bert-base-uncased-bs16-ml-128__sato0-1.00
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMultiOutputClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForMultiOutputClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMultiOutputClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForMultiOutputClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Traceback (most recent call last):
File "doduo/train_multi.py", line 436, in <module>
logits, = model(batch["data"].T) # (row, col) is opposite?
File "/home/mmior/.local/share/virtualenvs/doduo-ztkaJOAZ/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/mmior/apps/doduo/doduo/model.py", line 372, in forward
outputs = self.bert(
File "/home/mmior/.local/share/virtualenvs/doduo-ztkaJOAZ/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/mmior/apps/doduo/doduo/model.py", line 286, in forward
embedding_output = self.embeddings(input_ids=input_ids,
File "/home/mmior/.local/share/virtualenvs/doduo-ztkaJOAZ/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/mmior/.local/share/virtualenvs/doduo-ztkaJOAZ/lib/python3.8/site-packages/transformers/models/bert/modeling_bert.py", line 207, in forward
embeddings += position_embeddings
RuntimeError: The size of tensor a (650) must match the size of tensor b (512) at non-singleton dimension 1
I want to get metadata and table specific data, and then use DOsolo for training, but I don't see the process of loading metadata from the process of loading data. Can you teach me how to do it in detail?
Hi, I want to know when will you release the source code?
/root/doduo/doduo/util.py:15: RuntimeWarning: invalid value encountered in long_scalars
r = agg_conf_mat[1, 1] / agg_conf_mat[:, 1].sum()
/root/doduo/doduo/util.py:19: RuntimeWarning: invalid value encountered in true_divide
class_r = conf_mat[:, 1, 1] / conf_mat[:, :, 1].sum(axis=1)
/root/doduo/doduo/util.py:15: RuntimeWarning: invalid value encountered in long_scalars
r = agg_conf_mat[1, 1] / agg_conf_mat[:, 1].sum()
/root/doduo/doduo/util.py:18: RuntimeWarning: invalid value encountered in true_divide
class_p = conf_mat[:, 1, 1] / conf_mat[:, 1, :].sum(axis=1)
/root/doduo/doduo/util.py:19: RuntimeWarning: invalid value encountered in true_divide
class_r = conf_mat[:, 1, 1] / conf_mat[:, :, 1].sum(axis=1)
what causes this problem and how can I solve it?
When executing "$ python doduo/train_multi.py --tasks turl turl_re-colpair --max_length 32 --batch_size 16", getting the following error. Could you please help?
Traceback (most recent call last):
File "doduo/train_multi.py", line 14, in
from transformers import BertTokenizer, BertForSequenceClassification, BertConfig
File "/data/conceptdrift/anaconda3/envs/doduo/bin/transformers/init.py", line 43, in
from . import dependency_versions_check
File "/data/conceptdrift/anaconda3/envs/doduo/bin/transformers/dependency_versions_check.py", line 41, in
require_version_core(deps[pkg])
File "/data/conceptdrift/anaconda3/envs/doduo/bin/transformers/utils/versions.py", line 101, in require_version_core
return require_version(requirement, hint)
File "/data/conceptdrift/anaconda3/envs/doduo/bin/transformers/utils/versions.py", line 92, in require_version
if want_ver is not None and not ops[op](version.parse(got_ver), version.parse(want_ver)):
File "/data/conceptdrift/anaconda3/envs/doduo/bin/packaging/version.py", line 52, in parse
return Version(version)
File "/data/conceptdrift/anaconda3/envs/doduo/bin/packaging/version.py", line 198, in init
raise InvalidVersion(f"Invalid version: '{version}'")
packaging.version.InvalidVersion: Invalid version: '0.10.1,<0.11'
Hi thank you for the open-sourced repository for the wonderful work. I wonder if you can release your pre-trained model on VizNet dataset so that we can have a quick try of DODUO model by evaluating its performance? Thanks!
Hi! Can you provide more details for the fine-tuning part of your model? Is the fine-tuning process prior to the training of the models? Or it is indeed the training process itself? In your paper, you did not state clearly the fine-tuning process (such as the number of epochs you fine-tuned, etc.). Also, it seems that there's no introduction for the fine-tuning used by the model in this repository.
I have been reproducing the results of the paper, yet it seems that I cannot achieve 96.3% micro F1 score stated in the paper.
FileNotFoundError: [Errno 2] No such file or directory: './data/turl_coltype_mlb.pickle'
Where can I find this file?
Hello, your training paper is written in a 16G T100, but I follow your steps to keep displaying out of memory
The Doduo paper defines two tasks: "column type prediction and column relation annotation." However, the training script for Doduo in this repository defines 12 different tasks. How do these tasks map to the two tasks defined in the paper? It seems as though turl-re
is the column relation annotation.
Are all the other tasks then different forms of type prediction? If so, why is a slightly different modwel needed for the different datasets used?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.