n-waves / multifit Goto Github PK

The code to reproduce results from paper "MultiFiT: Efficient Multi-lingual Language Model Fine-tuning" https://arxiv.org/abs/1909.04761

License: MIT License

Python 30.30% Shell 1.59% Jupyter Notebook 68.11%

fastai ulmfit multiple-languages nlp

multifit's People

Contributors

Stargazers

Watchers

multifit's Issues

Blogpost Consent and Preferred Handle

As we approach the time to publish a blogpost to share the results, we are looking for consent from the contributors to be named in it. Also please add your personal page or twitter handle or the preferred URL where clicking on your name should link to. Thank you all!

Training custom classifier

Hello, I've seen your code at the front page for training a language model

`from fastai.text import *
import multifit

exp = multifit.from_pretrained("name of the model")
fa_config =  exp.pretrain_lm.tokenizer.get_fastai_config(add_open_file_processor=True)
data_lm = (TextList.from_folder(imdb_path, **fa_config)
            .filter_by_folder(include=['train', 'test', 'unsup']) 
            .split_by_rand_pct(0.1)
            .label_for_lm()           
            .databunch(bs=bs))
learn = exp.finetune_lm.get_learner(data_lm)  
# learn is a preconfigured fastai learner with a pretrained model loaded
learn.fit_one_cycle(10)
learn.save_encoder("enc")
...`

I would like to ask how I can then train my own classifier on top of this model, since all guidlines described here https://docs.fast.ai/text.html assume AWD-LSTM architecture, so they will not work with MULTIFIT language model as an encoder.

Thanks

Problems with reproducing zero-shot learning results

I tried replicating results for zero-shot learning on CLS, but my results don't match those from the paper. Since the script for predicting labels with LASER seems not be a part of Multifit repository I trained LASER on the CLS dataset (only en and de books for now) by adjusting the MLDoc script from LASER repo to CLS. My fork of LASER with these adjustment is [here]h(ttps://github.com/blazejdolicki/LASER). For the time being I only tested on books in German. After some hyperparameter tuning performed on English training set, my best setup obtains 82.25% accuracy compared to 84.15% from the Multifit paper. My hyperparams are:

n_epochs=200
lr=0.001
wd=0.0
nhid="10 8"
drop=0.2
seed=1
bsize=12

and I'm using the last 10% of the test set as validation.
When I tried to make them more similar to Multifit (n_epochs=8, wd=0.001,bsize=18), the accuracy dropped to around 60%.

Afterwards, I used the best (82.25% acc) LASER classifier (trained on English training set) to predict labels for German books. Then I copied test, training and unsupervised sets in Multifit repo from folder de-books into de-books-laser and replaced ground truth labels in training set with pseudolabels. Afterwards I trained the Multifit classifier on those pseudolabels and while my validation accuracy isn't great but at least similar, my test set accuracy is as low as 70% (compared to 89.60 from the paper and here) as you can see in the attached logs.
Multifit CLS zero shot terrible results 15.04.2020.txt

I did expect some drop due to the issue explained in #63, but such big difference shows that the unsupervised set size can't be the only factor deteriorating the results. Other possible reason of the drop in performance that come to my mind are:

I used different hyperparameters for training and predicting LASER pseudolabels?
I used different train-dev split for training and predicting LASER pseudolabels?
your script was loading the LASER model with fastai library and training the classifier with it instead of Pytorch ?

My fork of mutlifit is here, I'm using the ulmfit-original-scripts branch.

I would really appreciate a reply :)

Cannot run examples / pytest tests:

I cannot make MultiFiT to work in my environment :-(

What I did was...

I checked out the repo and ran any "prepare..." script available.
I had to "pip install" the modules "fire" and "sacremoses" since they neither were available via the code nor the "fastai" package (I installed the most recent version 1.0.59)
I started pytest . or training according to the example python -m ulmfit lm --dataset-path data/wiki/${LANG}-100 --tokenizer='f' --nl 3 --name 'orig' --max-vocab 60000 \ --lang ${LANG} --qrnn=False - train 10 --bs=50 --drop_mult=0 --label-smoothing-eps=0.0

RESULT: I always get an UnicodeDecodeError

e.g. with the training command:

Max vocab: 60000
Cache dir: data/wiki/en-100/models/f60k
Model dir: data/wiki/en-100/models/f60k/lstm_orig.m
Wiki text was split to 28476 articles
Wiki text was split to 60 articles
Running tokenization lm...
Traceback (most recent call last): File "/home/user/miniconda/envs/py36/lib/python3.6/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/home/user/miniconda/envs/py36/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/app/work/ulmfit/__main__.py", line 188, in <module> fire.Fire(ULMFiT()) File "/home/user/miniconda/envs/py36/lib/python3.6/site-packages/fire/core.py", line 138, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/home/user/miniconda/envs/py36/lib/python3.6/site-packages/fire/core.py", line 471, in _Fire target=component.__name__) File "/home/user/miniconda/envs/py36/lib/python3.6/site-packages/fire/core.py", line 675, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "/app/work/ulmfit/pretrain_lm.py", line 164, in train_lm data_lm = self.load_wiki_data(bs=bs) if data_lm is None else data_lm File "/app/work/ulmfit/pretrain_lm.py", line 246, in load_wiki_data **args) File "/app/work/ulmfit/pretrain_lm.py", line 254, in lm_databunch return self.databunch(name, bunch_class=TextLMDataBunch, *args, **kwargs) File "/app/work/ulmfit/pretrain_lm.py", line 279, in databunch **args) File "/home/user/miniconda/envs/py36/lib/python3.6/site-packages/fastai/text/data.py", line 202, in from_df if cls==TextLMDataBunch: src = src.label_for_lm() File "/home/user/miniconda/envs/py36/lib/python3.6/site-packages/fastai/data_block.py", line 480, in _inner self.process() File "/home/user/miniconda/envs/py36/lib/python3.6/site-packages/fastai/data_block.py", line 534, in process for ds,n in zip(self.lists, ['train','valid','test']): ds.process(xp, yp, name=n) File "/home/user/miniconda/envs/py36/lib/python3.6/site-packages/fastai/data_block.py", line 714, in process self.x.process(xp) File "/home/user/miniconda/envs/py36/lib/python3.6/site-packages/fastai/data_block.py", line 84, in process for p in self.processor: p.process(self) File "/home/user/miniconda/envs/py36/lib/python3.6/site-packages/fastai/text/data.py", line 296, in process for i in progress_bar(range(0,len(ds),self.chunksize), leave=False): File "/home/user/miniconda/envs/py36/lib/python3.6/site-packages/fastprogress/fastprogress.py", line 75, in __iter__ if self.auto_update: self.update(i+1) File "/home/user/miniconda/envs/py36/lib/python3.6/site-packages/fastprogress/fastprogress.py", line 92, in update self.update_bar(val) File "/home/user/miniconda/envs/py36/lib/python3.6/site-packages/fastprogress/fastprogress.py", line 104, in update_bar else: self.on_update(val, f'{100 * val/self.total:.2f}% [{val}/{self.total} {elapsed_t}<{remaining_t}{end}]')
File "/home/user/miniconda/envs/py36/lib/python3.6/site-packages/fastprogress/fastprogress.py", line 274, in on_update
if printing(): WRITER_FN(to_write, end = '\r')
UnicodeEncodeError: 'ascii' codec can't encode characters in position 3-35: ordinal not in range(128)

or with the tests (any!):

self = <encodings.ascii.IncrementalDecoder object at 0x7fdcdf958e10>
input = b' \n = Valkyria Chronicles III = \n \n Senj\xc5\x8d no Valkyria 3 : <unk> Chronicles ( Japanese : \xe6\x88\xa6\xe5\xa...n force invading the Empire just following the two nations \' cease @-@ fire would certainly wreck their newfound peac'
final = False

def decode(self, input, final=False):
> return codecs.ascii_decode(input, self.errors)[0]
E UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 39: ordinal not in range(128)

/home/user/miniconda/envs/py36/lib/python3.6/encodings/ascii.py:26: UnicodeDecodeError

Does anyone have a clue here? Thanks a lot in advance!

Expected input batch_size to match target batch_size error occurs while bidir=True

I am trying to use cpu to train an lstm ulmfit lm with bidir=true and the error occurs while the first round in fit_one_cycle.

Concretely, the error happened at torch.nn.functional.py#Ln1788 in nll_loss.format(input.size(0), target.size(0))).

The above level shows the program is trying to compute cross_entropy in nn.modules.loss.forward.

I'm using pytorch-nightly-cpu with fastai installed by conda.

Anyone could provide any hint on how to dig in? Thanks.

The argument bs is not known

I tried to run
python -m ulmfit lm --dataset-path data/wiki/wikitext-103 --bidir=False --qrnn=False --tokenizer=vf --name 'bs40' --bs=40 --cuda-id=0 - train 20 --drop-mult=0.9 ... as mentioned in readme file, but I get this error message:
TypeError: __init__() got an unexpected keyword argument 'bs'
It seems the class LMHyperParams doesn't have this batch size option.

Missing File in CLS-DE.ipynb

Hi,

first of all thanks for sharing your code, this is great work!
I am interested in using your models and am playing around with the CLS-DE.ipynb notebook file that you provide to fine-tune on a dataset.

Everything works smoothly until I run into a missing file error:

> cls_dataset.load_clas_databunch(bs=exp.finetune_lm.bs).show_batch()


Running tokenization: 'lm-notst' ...

---------------------------------------------------------------------------

FileNotFoundError                         Traceback (most recent call last)

<ipython-input-13-51e69f217e1f> in <module>()
----> 1 cls_dataset.load_clas_databunch(bs=exp.finetune_lm.bs).show_batch()

11 frames

/usr/local/lib/python3.6/dist-packages/pandas/io/parsers.py in __init__(self, src, **kwds)
   1889         kwds["usecols"] = self.usecols
   1890 
-> 1891         self._reader = parsers.TextReader(src, **kwds)
   1892         self.unnamed_cols = self._reader.unnamed_cols
   1893 

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.__cinit__()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._setup_parser_source()

FileNotFoundError: [Errno 2] File data/cls/de-music/de.train.csv does not exist: 'data/cls/de-music/de.train.csv'

Can you offer any help on where one can get that file, or, other steps that I may be missing?

Thank you!

Vocabulary size for sentencepiece

Before commit 5ffbe8b (which was required for making sure the vocab for testing would be small) the sentencepiece model's vocabulary size defaulted to 30,000 whereas, now it'll be the same as pretrain_lm()'s default of 60,000.

In my experience in training the sentencepiece, I've found that very large vocabularies degrade performance and in most cases lead to memory failure, at least on small/medium machines (it did on mine). Even the official documentation suggests vocabularies of up to 32,000 tokens.

Any thoughts on how we may resolve this discrepancy?

Error loading pretrained DE model

Hi Piotr,

I'm facing some issues using the master branch for training a classifier from a pretrained DE model provided in the utility scripts.

I get the following error

Traceback (most recent call last):                                                                     
  File "/usr/local/lib/python3.7/runpy.py", line 193, in _run_module_as_main                     
    "__main__", mod_spec)                                                                 
  File "/usr/local/lib/python3.7/runpy.py", line 85, in _run_code                  
    exec(code, run_globals)                                                                        
  File "/home/vlaurmaa/coding/python/ulmfit-multilingual/ulmfit/__main__.py", line 188, in <module>
    fire.Fire(ULMFiT())                                                                                                      
  File "/home/vlaurmaa/coding/python/ulmfit-multilingual/new_env/lib/python3.7/site-packages/fire/core.py", line 138, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)                                                 
  File "/home/vlaurmaa/coding/python/ulmfit-multilingual/new_env/lib/python3.7/site-packages/fire/core.py", line 471, in _Fire
    target=component.__name__)                                                                                                              
  File "/home/vlaurmaa/coding/python/ulmfit-multilingual/new_env/lib/python3.7/site-packages/fire/core.py", line 675, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)                                                               
  File "/home/vlaurmaa/coding/python/ulmfit-multilingual/ulmfit/train_clas.py", line 84, in train_cls
    learn.load('cls_best')                                                                                                            
  File "/home/vlaurmaa/coding/python/ulmfit-multilingual/new_env/lib/python3.7/site-packages/fastai/basic_train.py", line 254, in load
    get_model(self.model).load_state_dict(state, strict=strict)                                                                                       
  File "/home/vlaurmaa/coding/python/ulmfit-multilingual/new_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 769, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))                                                                     
RuntimeError: Error(s) in loading state_dict for SequentialRNN:                                                                                                                              
        Missing key(s) in state_dict: "0.module.rnns.0.linear.weight_raw", "0.module.rnns.0.linear.module.weight", "0.module.rnns.0.linear.module.bias", "0.module.rnns.1.linear.weight_raw",
 "0.module.rnns.1.linear.module.weight", "0.module.rnns.1.linear.module.bias", "0.module.rnns.2.linear.weight_raw", "0.module.rnns.2.linear.module.weight", "0.module.rnns.2.linear.module.bi
as", "0.module.rnns.3.linear.weight_raw", "0.module.rnns.3.linear.module.weight", "0.module.rnns.3.linear.module.bias".                                                                      
        Unexpected key(s) in state_dict: "0.module.rnns.0.layers.0.linear.weight_raw", "0.module.rnns.0.layers.0.linear.module.weight", "0.module.rnns.0.layers.0.linear.module.bias", "0.mod
ule.rnns.1.layers.0.linear.weight_raw", "0.module.rnns.1.layers.0.linear.module.weight", "0.module.rnns.1.layers.0.linear.module.bias", "0.module.rnns.2.layers.0.linear.weight_raw", "0.modu
le.rnns.2.layers.0.linear.module.weight", "0.module.rnns.2.layers.0.linear.module.bias", "0.module.rnns.3.layers.0.linear.weight_raw", "0.module.rnns.3.layers.0.linear.module.weight", "0.mo
dule.rnns.3.layers.0.linear.module.bias".

I tried several versions of fastai but if I go too far back some other requirements break.

The current requirements file I use is the following :

fire>=0.1.3
cupy>=5.0.0
scikit-learn>=0.20
sacremoses>=0.0.5
sentencepiece
ninja==1.9.0.post1
fastai==1.0.46
torch==1.2.0

Are these the correct requirements? Any other ideas to fix the issue?

I can overcome this issue by setting strict=False in the loading but it's a bit hacky and I'm not sure everything will work as expected. On IMDB I get 93.5% accuracy.

EDIT : Even with the hacky fix, evaluation fails with the error

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/local/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/vlaurmaa/coding/python/ulmfit-multilingual/ulmfit/__main__.py", line 188, in <module>
    fire.Fire(ULMFiT())
  File "/home/vlaurmaa/coding/python/ulmfit-multilingual/env/lib/python3.7/site-packages/fire/core.py", line 138, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/vlaurmaa/coding/python/ulmfit-multilingual/env/lib/python3.7/site-packages/fire/core.py", line 471, in _Fire
    target=component.__name__)
  File "/home/vlaurmaa/coding/python/ulmfit-multilingual/env/lib/python3.7/site-packages/fire/core.py", line 675, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/home/vlaurmaa/coding/python/ulmfit-multilingual/ulmfit/train_clas.py", line 102, in train_cls
    return self.validate_cls('cls_best', bs=bs, data_cls=data_clas, data_tst=data_tst, learn=None)
  File "/home/vlaurmaa/coding/python/ulmfit-multilingual/ulmfit/train_clas.py", line 110, in validate_cls
    learn.load(save_name)
  File "/home/vlaurmaa/coding/python/ulmfit-multilingual/env/lib/python3.7/site-packages/fastai/basic_train.py", line 279, in load
    get_model(self.model).load_state_dict(state, strict=False)
  File "/home/vlaurmaa/coding/python/ulmfit-multilingual/env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 845, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for SequentialRNN:
        size mismatch for 1.layers.6.weight: copying a param with shape torch.Size([24, 50]) from checkpoint, the shape in current model is torch.Size([23, 50]).
        size mismatch for 1.layers.6.bias: copying a param with shape torch.Size([24]) from checkpoint, the shape in current model is torch.Size([23]).

Postprocess_wikitext does not separate wikipedia articles.

The LM needs to be trained on long articles to be useful for downstream tasks. We solve it in wikitext-103 by splitting articles as follows:

def istitle(line):
    return len(re.findall(r'^ = [^=]* = $', line)) != 0

postporcess_wikitext need to add similar separators to the articles to train correctly BOS and EOS tokens. Until this is fixed the training of LM on custom wikipedia will break.

Why is ulmfit/postprocess_wikitext.py necessary?

As building the vocab is now part of the Data Block APIs pre-processing, what is the need for this step in preparing the wikipedia data for training?

In particular, my questions are:

Why build the vocab here and convert OOV tokens to when it can/is done as a pre-processing step via the DataBlock API?
What is the purpose of the "-unk" folder? It doesn't seem like its being used anywhere else (though I may be mistaken), so just wondering why it exists?
What is the reasoning behind the "replace_numbers" function?

I'm looking at some of the approach fastai folks are using to build pre-trained LMs in various languages, and I don't see them implementing the same post-processing here (in fact, the apporaches all seem to vary a little). Anyhow, just trying to get an understanding of how things should/need to be processed in a fasta v.1 world.

Thanks

Basic support for testing ULMFiT against XNLI

Improve the way we run experiments - saving, hyper params selection

I want to address the following issues:

there are quite a few collisions in the way models are saved currently, for example, different tokenization strategies may reuse the same tokens which is obviously incorrect.
no easy way to select tokenization strategy currently it is selected by boolean parameters which isn't the best way to address it
all the hyper-parameters must be passed externally via command line arguments from pretrain_lm to train_clas, which is super error prone.

Here is how I'm inteding to fix this:

re. 1

Each LM model will get's its own folder under which all files are saved (itos, weights, sp.model etc), Each Classfication model will be stored undert the LM model folder under it's own folder name. This will give us a quick way to remove the experiments we don't want, more over we will know exactly how the models were created. Here is an example structure:

-  data/wiki/wikitext-103/models/
    - sp_nh4_30epoch.m
          - itos.pkl
          - sp.model
          - sp.vocab
          - info.json # all hyper params, and the performance info 
          - train.log
          - weights.pth
          - imdb/large-head.m
               - itos.pkl
               - sp.model
               - sp.vocab
               - info.json
               - train.log
               - weights-e1.pth
               - weights-e10.pth
               - weights-e100.pth
               - weights-best.pth

re. 2

Instead of boleans we will simply select the startegy by name

re. 3

I will make it easy to get the required hyper-params from the pretrain lm model selected (info.json), plus we will make it possible to define some experiments directly in python and refere to them by name.

Speed of QRNN vs LSTM re size of batch

I've just got access to v100 with 32gb of memory and I'm testing the speed of our lm training.
It seems that QRNN isn't faster when larger batch size is used. Which is very odd.
What is even more worring is that it is as fast on v100 is as on 1080ti :/
Here are some results:

QRNN on v100

((fastaiv1) n-waves@GV100:~/workspace/ulmfit-multilingual$ time python -m ulmfit.pretrain_lm data/wiki/wikitext-2 --qrnn=True --cuda-id=0 --name=wt-103 --num_epochs=1  --bs=300
Batch size: 300
Max vocab: 60000
Using QRNNs...
Size of vocabulary: 33279
First 10 words in vocab: the, <pad>, ,, ., of, <unk>, and, in, to, <eos>
Saving vocabulary as data/wiki/wikitext-2/models
epoch  train_loss  valid_loss  accuracy
1      7.082489    6.515145    0.120889                                                            
Saving models at data/wiki/wikitext-2/models
Saving optimiser state at data/wiki/wikitext-2/models/qrnn3_wt-103_state.pth

real	0m52,769s
user	1m15,645s
sys	0m17,174s
(fastaiv1) n-waves@GV100:~/workspace/ulmfit-multilingual$ time python -m ulmfit.pretrain_lm data/wiki/wikitext-2 --qrnn=True --cuda-id=0 --name=wt-103 --num_epochs=1  --bs=64
Batch size: 64
Max vocab: 60000
Using QRNNs...
Size of vocabulary: 33279
First 10 words in vocab: the, <pad>, ,, ., of, <unk>, and, in, to, <eos>
Saving vocabulary as data/wiki/wikitext-2/models
epoch  train_loss  valid_loss  accuracy
1      5.429696    5.269462    0.228376                                                              
Saving models at data/wiki/wikitext-2/models
Saving optimiser state at data/wiki/wikitext-2/models/qrnn3_wt-103_state.pth

real	0m51,677s
user	0m37,639s
sys	0m13,337s
(fastaiv1) n-waves@GV100:~/workspace/ulmfit-multilingual$ time python -m ulmfit.pretrain_lm data/wiki/wikitext-2 --qrnn=True --cuda-id=0 --name=wt-103 --num_epochs=1  --bs=32
Batch size: 32
Max vocab: 60000
Using QRNNs...
Size of vocabulary: 33279
First 10 words in vocab: the, <pad>, ,, ., of, <unk>, and, in, to, <eos>
Saving vocabulary as data/wiki/wikitext-2/models
epoch  train_loss  valid_loss  accuracy
1      5.200118    5.109034    0.238820                                                              
Saving models at data/wiki/wikitext-2/models
Saving optimiser state at data/wiki/wikitext-2/models/qrnn3_wt-103_state.pth

real	0m55,291s
user	0m40,788s
sys	0m13,070s

QRNN on 1080 TI

time python -m ulmfit.pretrain_lm data/wiki/wikitext-2 --qrnn=True --cuda-id=1 --name=wt-103 --num_epochs=1  --bs=64
Batch size: 64
Max vocab: 60000
Using QRNNs...
Size of vocabulary: 33279
First 10 words in vocab: the, <pad>, ,, ., of, <unk>, and, in, to, <eos>
Saving vocabulary as data/wiki/wikitext-2/models
epoch  train_loss  valid_loss  accuracy
1      5.429705    5.272144    0.227702
Saving models at data/wiki/wikitext-2/models
Saving optimiser state at data/wiki/wikitext-2/models/qrnn3_wt-103_state.pth
itos_fname: data/wiki/wikitext-2/models/itos_wt-103.pkl
accuracy:   tensor(0.2278)
python -m ulmfit.pretrain_lm data/wiki/wikitext-2 --qrnn=True --cuda-id=1   

51.68s user 17.25s system 99% cpu 1:09.19 total

@sebastianruder do you remember how fast was the pervious version of QRNN in old fastai?

error in LM pretraining

What I did?

Checked out the pretrain-lm branch because it has clear instructions how to pretrain LM (#57).
Installed required packages.
Executed bash prepare_wiki.sh de
Executed python -W ignore -m multifit new multifit_paper_version replace_ --name my_lm - train_ --pretrain-dataset data/wiki/de-100
Received the following traceback:
python -W ignore -m multifit new multifit_paper_version replace_ --name my_lm - train_ --pretrain-dataset data/wiki/de-100
Setting LM weights seed seed to 0
Running tokenization: 'lm-notst' ...
Wiki text was split to 1 articles
Wiki text was split to 1 articles
Wiki text was split to 1 articles
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/ubuntu/anaconda3/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/ubuntu/multifit/multifit/__main__.py", line 16, in <module>
fire.Fire(Experiment())
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/fire/core.py", line 138, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/fire/core.py", line 468, in _Fire
target=component.__name__)
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/fire/core.py", line 672, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/home/ubuntu/multifit/multifit/training.py", line 587, in train_
self.pretrain_lm.train_(pretrain_dataset)
File "/home/ubuntu/multifit/multifit/training.py", line 275, in train_
learn = self.get_learner(data_lm=dataset.load_lm_databunch(bs=self.bs, bptt=self.bptt, limit=self.limit))
File "/home/ubuntu/multifit/multifit/datasets/dataset.py", line 208, in load_lm_databunch
limit=limit)
File "/home/ubuntu/multifit/multifit/datasets/dataset.py", line 258, in load_n_cache_databunch
databunch = self.databunch_from_df(bunch_class, train_df, valid_df, **args)
File "/home/ubuntu/multifit/multifit/datasets/dataset.py", line 271, in databunch_from_df
**args)
File "/home/ubuntu/multifit/fastai_contrib/text_data.py", line 147, in make_data_bunch_from_df
TextList.from_df(valid_df, path, cols=text_cols, processor=processor))
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/fastai/data_block.py", line 434, in __init__
if not self.train.ignore_empty and len(self.train.items) == 0:
TypeError: len() of unsized object

From initial debugging, train.items is an ndarray with shape () . When I print it, it returns articles in German. I suppose this part suggests a problem Wiki text was split to 1 articles - I reckon the wiki text should be split in more than 1 article. So maybe something goes wrong in read_wiki_articles() in dataset.py... This is my educated guess, but I don't know where to go from here.

About the data and train on my own datasets

Hi,
I have two questions to ask you . One is how can i get CLS datasets,I need to run prepare_cls.py?But what is the 'url_prefix',I can not open 'https://user:passwd@server/path/'
Another one is how can I train on my own datasets.

Compare QRNN Performance Metrics

We need to create benchmarks to prove the performance gains of QRNN. @sebastianruder do you mind helping out with a guideline of what are the best measurements and variables to control for.

I am thinking of fixing vocabulary size to 15 or 30k. Comparing the speed of QRNN and LSTM.

Language-Model (1K training sentences of length bptt)
Unfrozen Classifier (1K training examples)

Is that what you had in mind?

Train on my own data

Hi,
I want to ask a question,how can i train my own data ?
Thanks.

Saliency maps

Hi everyone,
I am working on the creation of saliency maps for multifit classification models, with the ultimate goal of highlighting over the text those parts that are decisive in making a prediction.

In order to make the prominence maps, on the one hand I need to obtain the activations of a designated layer of the model using a hook, and on the other hand I need the tokenized text and in the form of a tensor. In this way I can relate the activations to the input text that caused those activations.

My question is to know how to get the input tensor and the tokenized text that the multifit model receives when doing a prediction, because those operations are done internally by multifit.

Thanks

Test ULMFit on CLS (Cross Lingual Sentiment)

Dataset lives at http://www.uni-weimar.de/medien/webis/corpora/corpus-webis-cls-10/cls-acl10-unprocessed.tar.gz

http://www.aclweb.org/anthology/P10-1114

Is it possible to load several "learn" models at a time?

I don't understand very well how the model works, but I was wondering if each classification model is loading the language model independently and then doing the evaluation. I want to use it for production with different classification datasets, but each of them need a lot of CPU memory.

Different size of CLS unsupervised data between .csv and original .xml files

Here the de-books data used for finetuning the LM is of size: 152523 + 16947 = 169470 which corresponds to the size of the original data from the xml file where the total size of data is also 169470. However, when I run python prepare_cls.py https://storage.googleapis.com/ulmfit/cls, the downloaded de.unsup.csv file has 29999 items. I checked and the sizes of train and test set are corresponding to logs in the link. So for some reason currently in the .csv files not all the data is available and thus the achieved results are worse than the ones from the link (which correspond to the results in the paper). Is there any explanation for that?

Add test to check the whole pipeline

Add a test to check the whole pipeline and show how to train a classification task for EN (I'm working on it)

pytest error

Hi Piotr,

I tried the "pytest ." on your repo, but I got following error message, maybe the current language_model_learner() in fastai is not compatible anymore with your code:

platform linux -- Python 3.7.2, pytest-4.3.1, py-1.8.0, pluggy-0.9.0
rootdir: /home/cahya/Work/Machine Learning/FastAI/ulmfit-multilingual, inifile:
plugins: xdist-1.27.0, forked-1.0.2
collected 13 items

tests/test_end_to_end.py F...FFF                                         [ 53%]
tests/test_text_data.py FF                                               [ 69%]
tests/test_text_train.py EFFF                                            [100%]

==================================== ERRORS ====================================
_______________________ ERROR at setup of test_val_loss ________________________

    @pytest.fixture(scope="module")
    def learn():
        path, df_trn, df_val = prep_human_numbers()
        data = TextLMDataBunch.from_df(path, df_trn, df_val, tokenizer=Tokenizer(BaseTokenizer))
>       learn = language_model_learner(data, emb_sz=100, nl=1, drop_mult=0.1)
E       TypeError: language_model_learner() missing 1 required positional argument: 'arch'

/home/cahya/Work/Machine Learning/FastAI/ulmfit-multilingual/tests/test_text_train.py:39: TypeError
...

Where can I find the dataset de.train.csv?

I am trying to run the notebook CLS-DE.ipynb, but I don't know where I can find the dataset. When I run following line in the notebook:

cls_dataset.load_clas_databunch(bs=exp.finetune_lm.bs).show_batch()

I got this error message:

FileNotFoundError: [Errno 2] File data/cls/de-music/de.train.csv does not exist: 'data/cls/de-music/de.train.csv'

how to setup the format of data as input

Hi , I am reproducing your nice script but I don't know how to setup the format of data as input, namely the final . To clearly to get it, Could you give me an example to show? For example, how DATASET #cls-acl10-unprocessed# is actually .xml file , So it will be processed to be .csv file? what .csv file will be like finally? give me a snip simply.Thank you in advance.

Add subword tokenization and support training with subword units

Test ULMFiT on MLDoc and create models for 9 languages

Suggested models to test MLDoc against feel free to add other combinations and mark the one you would like to work on:

LSTM

QRNN

Test Bert Multilingual on MLDoc

LICENSE

Can you please add a license to the code and pretrained models ?

Specifying a validation set

I'm training a language model similar to what has been shown here https://github.com/n-waves/multifit/blob/master/notebooks/CLS-JA.ipynb

While running cls_dataset.load_clas_databunch(bs=exp.finetune_lm.bs).show_batch()
I'm getting this output

Running tokenization: 'lm-notst' ...
Validation set not found using 10% of trn
Data lm-notst, trn: 26925, val: 2991
Size of vocabulary: 15000
First 20 words in vocab: ['xxunk', 'xxpad', 'xxbos', 'xxfld', 'xxmaj', 'xxup', 'xxrep', 'xxwrep', '', '▁', '▁,', '▁.', '▁в', 'а', 'и', 'е', '▁и', 'й', '▁на', 'х']
Running tokenization: 'cls' ...
Data cls, trn: 26925, val: 2991
Running tokenization: 'tst' ...
/home/explorer/miniconda3/envs/fast/lib/python3.6/site-packages/fastai/data_block.py:537: UserWarning: You are labelling your items with CategoryList.
Your valid set contained the following unknown labels, the corresponding items have been discarded.
201, 119, 192, 162, 168...
if getattr(ds, 'warn', False): warn(ds.warn)
Data tst, trn: 2991, val: 7448

I assume this to be a problem with misrepresentation of labels in a validation set that was inferred automatically. Is there a way to explicitly pass a validation set?

use_test_for_validation - no longer needed?

In ulmfit/train_clas.py there is a somewhat confusing setting use_test_for_validation.

Am I correct to assume that it's just a work-around for this bug?
fastai/fastai#1292

If so, the proper fix has been merged a few days ago: fastai/fastai#1293

In such a case it could be useful to get rid of this work-around. I can give it a try, but only after a confirmation that it is not needed for other reasons.

Add training and fine-tuning of bidirectional LM - following ELMO

BiLM implementation
Saving the weights and loading them later
Handle AR & TAR
Using biLM for classification tasks to see how much we can improve over SOTA. The fastest and most meaningful comparison will be if we manage to improve sentiment classification over the original ULMFiT paper. @sebastianruder can you propose a dataset and do you have the best parameters?

KeyError: ‘1.decoder.bias’ in convert_weights_with_prefix

While running the train_clas.py script, I get
convert_weights_with_prefix. I also tried the refactored version of the same fn.

    dec_bias, enc_wgts = wgts[prefix+'1.decoder.bias'], wgts[prefix+'0.encoder.weight']
KeyError: '1.decoder.bias'

If run with fine_tune off, there is another error:
Error(s) in loading state_dict for SequentialRNN:
Missing key(s) in state_dict: "0.fwd_lm.encoder.weight", "0.fwd_lm.encoder_dp.emb.weight", ...
Unexpected key(s) in state_dict: "fwd_lm.0.encoder.weight", "fwd_lm.0.encoder_dp.emb.weight", ...

ValueError invalid literal for int in xnli labels in train_clas.py

Got ValueError: invalid literal for int() with base 10: 'neutral' while running train_clas.py
Params:
!python train_clas.py --data_dir 'data' --lang 'ar' --cuda_id 0 --pretrain_name 'arabic2' --model_dir 'data/models' --qrnn False --num_lm_epochs 5 --fine_tune True --max_vocab 60000 --bs 32 --bptt 70 --name='arabic2' --dataset='xnli' --bidir True --ds_pct 1.0 --train True
Output:

Dataset: xnli. Language: ar.
BiLM
Reading the data...
Reading data/xnli/XNLI-MT-1.0/multinli/multinli.train.ar.tsv...
Reading data/xnli/XNLI-1.0/xnli.test.tsv...
Reading data/xnli/XNLI-1.0/xnli.dev.tsv...
Train size: 392702. Valid size: 2490. Test size: 5010.
Traceback (most recent call last):
  File "train_clas.py", line 194, in <module>
    fire.Fire(new_train_clas)
  File "/usr/local/lib/python3.6/dist-packages/fire/core.py", line 127, in Fire
    component_trace = _Fire(component, args, context, name)
  File "/usr/local/lib/python3.6/dist-packages/fire/core.py", line 366, in _Fire
    component, remaining_args)
  File "/usr/local/lib/python3.6/dist-packages/fire/core.py", line 542, in _CallCallable
    result = fn(*varargs, **kwargs)
  File "train_clas.py", line 81, in new_train_clas
    data_clas, data_lm = get_datasets(dataset, dataset_dir, bptt, bs, lang, max_vocab, ds_pct, lm_type=lm_type)
  File "train_clas.py", line 181, in get_datasets
    lbls[split] = np.array([np.array(e, dtype=np.int) for e in lbls[split]])
  File "train_clas.py", line 181, in <listcomp>
    lbls[split] = np.array([np.array(e, dtype=np.int) for e in lbls[split]])
ValueError: invalid literal for int() with base 10: 'neutral'

Is there something I can do to the xnli dataset to avoid this error?
Update: I already tried cleaning the 3 tsv files (removed nulls, error lines) and coded all label fields as 0, 1, 2 ( 'label' in train which seems to be the source of the error). Same error. Could that be a caching issue? I also reset kernel (colab instance is read only file system).
Update 2[solved]: issue caused by quotes in file. Rather than cleaning the tsv files, I converted labels to ints before the xnli reader returned.

Fix QRNN performance issue

QRNN training is still slow/defective since some recent update. Need to look into that.

Fix suggested as PR to fastai repo:
fastai/fastai#1138

todo test

Tokenizer

Hi everyone,

Does anyone know which tokenizer Multifit uses?(especially in spanish texts), as well as the method used to vectorize them. I'd like to be able to tokenize and vectorize texts in the same way that multifit does internally.

Multifit inference problem

Hi everyone.
Maybe you can help me with something.
Once the multifit model is trained with my own dataset, I export it to a .pkl file, in order to use it later to make predictions. The problem comes when I load the model from a different machine than the one that trained the multifit. The model is loaded with load_learner(), but when I try to make a prediction an error related to SentencePiece appears, followed by this error message :

OSError: Not found: "/home/.fastai/data/.../tmp/spm.model": No such file or directory

Maybe I need to save the model in another way in order to load it and make predictions correctly?

Multi language model

It seems that MultiFiT prefer fine-tuning to cross-language pre-training models.

Calculation of Perplexity ins't working (refactoring and master branch)

I find that when computing test logloss and perplexity here: https://github.com/n-waves/ulmfit-multilingual/blob/eea9be09db8520ad87d28551a88a728e170c455e/ulmfit/pretrain_lm.py#L167
the data trn_path points at is used but tst_path points at.

Is this a typo or mean to do this?

Since I found the test data is never used during pretrain_lm section, I intend to accept this is a typo.

Repeated title in pretrain_lm (in read_wiki_articles)

When running the ulmfit.train_lm and when wiki articles are read (separated by = title =, the resulting dataframe has the title repeated as
= title =
and
title
Do we need both?

Data split of CLS dataset

Could you please provide the data split in your experiments? I couldn't find it in your paper or the script.
I can only find train, test and unlabeled set. Where is the dev/valid set?
Thanks.

Polyglot Language Model

Pretrain a language model with for two wikipedias (such as en and de) with a single vocabulary from sentencepiece
Prepare RCV data
Prepare CLS data
Test zero shot classification accuracy on XNLI/CLS/RCV. The hope is that with none or only a few training samples from a target language and full training set from a source language we can learn a good classifier
Test alignment loss term from Europarl parallel corpus (such as what they do in XNLI paper)

test model on other languages

Hi,
I want to ask a question ,if i want to use model that trained in English,and use it to test other languages.How do I run the code?

Kernel restarted

I have some other problems to run the notebook CLS-DE.ipynb. If I use conda and install the default pytorch (1.3.1), after the command

exp.finetune_lm.train_(cls_dataset, num_epochs=20)

I get following error message:

ImportError: /tmp/torch_extensions/forget_mult_cuda/forget_mult_cuda.so: undefined symbol: _ZN3c106Symbol14fromQualStringERKSs

Then I installed pytroch from the pytorch channel as follow:

conda install pytorch=1.3.1 torchvision cudatoolkit=10.0 -c pytorch

The issue with "undefined symbol" is gone, but the kernel was restarted during the first epoch of exp.finetune_lm.train_(cls_dataset, num_epochs=20)

Is this known problem? Following is maybe the relevan python modules:

$ conda list| egrep 'torch|^fastai|cuda|nvid'
_pytorch_select           0.2                       gpu_0  
cudatoolkit               10.0.130                      0  
cudnn                     7.6.5                cuda10.0_0  
fastai                    1.0.61                        1    fastai
nvidia-ml-py3             7.352.0                    py_0    fastai
pytorch                   1.3.1           cuda100py37h53c1284_0  
torchvision               0.4.2           cuda100py37hecfc37a_0

Thanks.

Get activations of a specific layer of the multifit model

Hello everybody,
Anyone knows how to get the activations of an intermediate layer when I make a prediction, despite what I have tried so far, it gives me errors.To give you a little context:

The first thing I do is extract the model layer(for example the first embeddings layer):
self.specific_layer = list(self.classifier.model.modules())[0][0].module.encoder

The second thing I do is put a hook on the layer I want to get the activations from :
def hook_function(module, grad_in, grad_out): self.gradients = grad_out[0]
self.specific_layer.register_backward_hook(hook_function)

Then I vectorize the text input using SentencePiece and give it as input to the model like this, similar to how I've seen it done in other fastai models:
model_output = self.classifier.model(self.inputs)

But when i try to run the above code line it gives me this error:
File "/home/francis/.virtualenvs/my_project/lib/python3.7/site-packages/fastai/text/learner.py", line 261, in forward bs,sl = input.size() AttributeError: 'list' object has no attribute 'size'
Does anyone know what the problem might be?, or if there is a better way to get the activations of an intermediate layer in a multifit model.

Thanks

Tweak hyper params so that lstm on IMDB trains well

The hyper paramters selected in train_clas don't work that well :/
The pretrained LM was a toy example trained on wikitext-2, but i guess we will have similar issues with regular models.

python -m ulmfit.train_clas --data_dir data --model_dir data/en/wikitext-2/models  --pretrain_name=wt-2-q
Dataset: imdb. Language: en.
Loading the pickled data...
Train size: 22500. Valid size: 2500. Test size: 25000.
Fine-tuning the language model...
epoch  train_loss  valid_loss  accuracy
1      4.716092    4.573514    0.258413
2      4.632831    4.475942    0.267385
Starting classifier training
epoch  train_loss  valid_loss  accuracy
1      0.553358    0.460404    0.789200
epoch  train_loss  valid_loss  accuracy
1      0.450415    0.323716    0.862000
epoch  train_loss  valid_loss  accuracy
1      0.353984    0.256489    0.895200
2      0.498414    0.364469    0.855200  <--------- either to large LR or some kind of exploding gradients?
3      0.692660    0.687355    0.514800  <----------

@NirantK, have this hyper params worked with your experiments?

FileNotFoundError: [Errno 2] No such file or directory: 'data/wiki/es-100/models/f60k/data_save.pkl'

Hi, I run bash prepare_wiki.sh with $LANG=es, and then when I try to run python -m ulmfit lm --dataset-path data/wiki/es-100 --tokenizer='f' --nl 3 --name 'orig' --max-vocab 60000 --lang es --qrnn=False - train 10 --bs=50 --drop_mult=0 --label-smoothing-eps=0.0 I get this exception:

Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/tertiary/imanol/imanol/ulmfit-multilingual/ulmfit/__main__.py", line 188, in <module>
    fire.Fire(ULMFiT())
  File "/tertiary/imanol/imanol/env_fastai/lib/python3.6/site-packages/fire/core.py", line 127, in Fire
    component_trace = _Fire(component, args, context, name)
  File "/tertiary/imanol/imanol/env_fastai/lib/python3.6/site-packages/fire/core.py", line 366, in _Fire
    component, remaining_args)
  File "/tertiary/imanol/imanol/env_fastai/lib/python3.6/site-packages/fire/core.py", line 542, in _CallCallable
    result = fn(*varargs, **kwargs)
  File "/tertiary/imanol/imanol/ulmfit-multilingual/ulmfit/pretrain_lm.py", line 163, in train_lm
    data_lm = self.load_wiki_data(bs=bs) if data_lm is None else data_lm
  File "/tertiary/imanol/imanol/ulmfit-multilingual/ulmfit/pretrain_lm.py", line 245, in load_wiki_data
    **args)
  File "/tertiary/imanol/imanol/ulmfit-multilingual/ulmfit/pretrain_lm.py", line 253, in lm_databunch
    return self.databunch(name, bunch_class=TextLMDataBunch, *args, **kwargs)
  File "/tertiary/imanol/imanol/ulmfit-multilingual/ulmfit/pretrain_lm.py", line 275, in databunch
    data = load_data(self.cache_dir, fname='lm', bs=bs)
  File "/tertiary/imanol/imanol/env_fastai/lib/python3.6/site-packages/fastai/basic_data.py", line 277, in load_data
    ll = torch.load(source, map_location='cpu') if defaults.device == torch.device('cpu') else torch.load(source)
  File "/tertiary/imanol/imanol/env_fastai/lib/python3.6/site-packages/torch/serialization.py", line 385, in load
    f = f.open('rb')
  File "/usr/lib/python3.6/pathlib.py", line 1183, in open
    opener=self._opener)
  File "/usr/lib/python3.6/pathlib.py", line 1037, in _opener
    return self._accessor.open(self, flags, mode)
  File "/usr/lib/python3.6/pathlib.py", line 387, in wrapped
    return strfunc(str(pathobj), *args)
FileNotFoundError: [Errno 2] No such file or directory: 'data/wiki/es-100/models/f60k/data_save.pkl'

In the directory data/wiki/es-100/models/f60k/ I have lm, itos.pkl files and lstm_orig.m empty directory.

I'm with torch==1.1.0 and fastai==1.0.55 versions.

Should I put fastai_contrib subdir under ulmfit dir?

By default, it would report an error that "the fastai_contrib module is not found" when I execute postprocess_wikitext.py.

So I have to move it under ulmift dir and make it work properly.

I wonder, if this is a necessary step or do you meet this little slit?

Validation performance of LM

How do you validate the performance of your LM in your respective languages?

Assuming that we are training on varying size of wikis, I tried 20% and 1% holdouts. The 20% has the perplexity of about 46 (a little higher than reported for English) and the 1% has 30. Thai wiki is 1.6G.

n-waves / multifit Goto Github PK

multifit's People

Contributors

Stargazers

Watchers

Forkers

multifit's Issues

re. 1

re. 2

re. 3

QRNN on v100

QRNN on 1080 TI

What I did?

LSTM

QRNN

Recommend Projects

Recommend Topics

Recommend Org