johngiorgi / declutr Goto Github PK

The corresponding code from our paper "DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations". Do not hesitate to open an issue if you run into any trouble!

Home Page: https://aclanthology.org/2021.acl-long.72/

License: Apache License 2.0

Jsonnet 9.22% Python 60.33% Jupyter Notebook 30.46%

contrastive-learning natural-language-processing allennlp pytorch transformers representation-learning sentence-embeddings sentence-similarity semantic-search semantic-text-similarity

declutr's Introduction

Hi there 👋

I am a 4th-year computer science Ph.D. student at the University of Toronto and a graduate student researcher at the Donnelly Centre for Cellular and Biomolecular Research and the Vector Institute for Artificial Intelligence. I previously interned at Ai2 (on the Semantic Scholar team) and at Semantic Health. My work centres on natural language processing (NLP) and natural language understanding (NLU) of scientific text, particularly biomedical literature, but I am broadly interested in all things ML/AI/NLP. I completed an undergraduate degree at the University of Ottawa, graduating with a B.Sc in biochemistry. I completed an M.Sc. in computer science at the University of Toronto. When I am not 💻 I like to 🚵‍♂️, 🏂 and 🏋️‍♂️.

declutr's People

Contributors

Stargazers

Watchers

declutr's Issues

Data loading process is way to slow

Overview

The data loading process that occurs before training in order to build a vocabulary in AllenNLP is prohibitively slow. Based on some quick testing, it looks like this is sample_spans fault

https://github.com/JohnGiorgi/t2t/blob/7f78ff9030e4759af3a39e68fcdca1fc62dec959/t2t/data/dataset_readers/dataset_utils/contrastive_utils.py#L7-L22

I can't see anything that is obviously closing the slowdown, but I suspect it is the pair of (str.split(), " ".join(list)) calls that occur twice in every call to sample_spans. This leads to 4 * num_instances of calls to this pair of methods, which is probably dramatically bottlenecking things.

It would be great to figure this out so that we can:

Still generate data stochastically, online (currently I have to do this offline to scale to 500K training instances)
Remove the bottleneck of almost 1 hour (!) of pre-processing that is currently happening before training even begins.

Write notebooks for training and evaluating

Write a Jupyter notebook (including in this repo and available through Colab) that walks a user through:

Installing the repo
Collecting data
Training a model
Evaluating the model on SentEval
Embedding text with a trained model
6 (Optionally) Uploading the model's weights to HF Transformers.

This would provide runnable documentation with a GPU that would complement the readme.

Different learning rates for encoder/decoder

Because our encoder is pre-trained, we select a much lower learning rate than usual (e.g. 2e-5) a high weight decay (e.g. 0.1) and train only for a few epochs.

Because all other layers are trained from scratch, they will likely benefit from a higher learning weights and smaller weight decay.

In general, we should set the optimizer up to use different learning rate / weight decay's for the pre-trained weights and the weights trained from scratch.

Come up with a good way of saving document embeddings to disk

Currently, the document embeddings we generate are used as input to the decoder, but never saved to disk. There does not appear to be a "natural" way to do this with AllenNLP.

We will have to come up with a solution. Ideally, you would be able to use a trained model to produce document embeddings for a given dataloader and save these to disk along with the source text.

Experiment with RoBERTa/ALBERT

Overview

Both RoBERTa and ALBERT outperform BERT on almost every downstream task. ALBERT has the additional benefit of having many times fewer parameters. The authors also demonstrated that it scales better (i.e. they don't observe the same degradation in model performance when they increase the model size and training set size as they do with BERT).

For these reasons, ALBERT is likely our best choice for the encoder. To confirm this, train the model using BERT, RoBERTa and ALBERT and compare the quality of the produced embeddings.

TODO

Train the model with BERT, inspect the embeddings in tensorboard (qualitative) and assess their performance on the document retrival task (quantitative)
Repeat for RoBERTa
Repeat for ALBERT
Make the winner the default encoder

Update repo structure to match AllenNLP template

Update the repo's structure to match the AllenNLP template here: https://github.com/allenai/allennlp-template-config-files.

Evaluate embedding with SentEval

Some run_senteval subcommands capture logging output

Some of the models we support in run_senteval.py appear to be messing with the logger we use to log SentEval output to the console. So far, this happens with the sentence-transformers and google-use subcommands. It would be nice to figure this out, as right now, --verbose has no effect for either of these subcommands, which could be confusing to a user.

Update, this is a TensorFlow thing. Just having it installed causes all sorts of weird issues with the logger?

Switch decoder to AutoRegressiveDecoder

From what I can tell, the proper way to setup a encoder-decoder model in AllenNLP is to use ComposedSeq2Seq, and provide the encoder/decoder you want to use.

Currently we are sort of hacking SimpleSeq2Seq for our purposes, but should switch to ComposedSeq2Seq asap.

Try out the "scalar mix" option in AllenNLP

AllenNLP recently added an option to use a "scalar" mix of the embeddings from each layer of a Transformer to produce a final embedding (as opposed to using the last layer exclusively). This reminds me a lot of the SBERT-WK approach.

Try to run a few experiements with this option enabled to see if it improves the quality of the embeddings.

Write unit tests for Encoder

The Encoder is currently not unit tested. I should write some really simply sanity checks that make sure:

It can be instantiated and used to embed text.
The relative ranking of embeddings makes sense (use easy examples).

Try Mish activation function.

Training progress bar no longer displays

On the latest pre-release of AllenNLP, there is no longer a progress bar indicating the remaining training time. Figure out why!

Ruuning into an error while training the model through colab

Hi,

I was trying to test the Colab for training the model over wiki_test_103, but I am running into the below error. I thought maybe it is because of the recent updates in the allennlp package. I checked the version you use in setup.py and it is "allennlp>=1.0.0".

Is there any specific allennlp version that you guys use for training? Or this error is because of something else?

Tahnks!

2020-08-21 19:26:28,599 - CRITICAL - root - Uncaught exception Traceback (most recent call last): File "/usr/local/bin/allennlp", line 8, in <module> sys.exit(run()) File "/usr/local/lib/python3.6/dist-packages/allennlp/__main__.py", line 34, in run main(prog="allennlp") File "/usr/local/lib/python3.6/dist-packages/allennlp/commands/__init__.py", line 92, in main args.func(args) File "/usr/local/lib/python3.6/dist-packages/allennlp/commands/train.py", line 118, in train_model_from_args file_friendly_logging=args.file_friendly_logging, File "/usr/local/lib/python3.6/dist-packages/allennlp/commands/train.py", line 177, in train_model_from_file file_friendly_logging=file_friendly_logging, File "/usr/local/lib/python3.6/dist-packages/allennlp/commands/train.py", line 238, in train_model file_friendly_logging=file_friendly_logging, File "/usr/local/lib/python3.6/dist-packages/allennlp/commands/train.py", line 429, in _train_worker params=params, serialization_dir=serialization_dir, local_rank=process_rank, File "/usr/local/lib/python3.6/dist-packages/allennlp/common/from_params.py", line 581, in from_params **extras, File "/usr/local/lib/python3.6/dist-packages/allennlp/common/from_params.py", line 612, in from_params return constructor_to_call(**kwargs) # type: ignore File "/usr/local/lib/python3.6/dist-packages/allennlp/commands/train.py", line 683, in from_partial_objects model=model_, data_loader=data_loader_, validation_data_loader=validation_data_loader_, File "/usr/local/lib/python3.6/dist-packages/allennlp/common/lazy.py", line 46, in construct return self._constructor(**kwargs) File "/usr/local/lib/python3.6/dist-packages/allennlp/common/from_params.py", line 447, in constructor return value_cls.from_params(params=deepcopy(popped_params), **constructor_extras) File "/usr/local/lib/python3.6/dist-packages/allennlp/common/from_params.py", line 581, in from_params **extras, File "/usr/local/lib/python3.6/dist-packages/allennlp/common/from_params.py", line 610, in from_params kwargs = create_kwargs(constructor_to_inspect, cls, params, **extras) File "/usr/local/lib/python3.6/dist-packages/allennlp/common/from_params.py", line 193, in create_kwargs params.assert_empty(cls.__name__) File "/usr/local/lib/python3.6/dist-packages/allennlp/common/params.py", line 429, in assert_empty "Extra parameters passed to {}: {}".format(class_name, self.params) allennlp.common.checks.ConfigurationError: Extra parameters passed to GradientDescentTrainer: {'opt_level': None}

Upload pre-trained model weights

There are two good options for uploading pre-trained model weights:

Upload the serialized AllenNLP model to a GitHub release, modifying Encoder to download the weights if the user provides the model name.
Upload the weights of the model to https://huggingface.co/models.

The first case requires slightly more effort, but it will work even if someone uses another pooler, and has the benefit of working with our Encoder class directly. The second case should be straightforward (although I am not sure how so I asked for help here), but would require a user to write the data-loading code and re-write the pooler.

Ultimately, I think it would be best to do both, so that users of HF Transformers (who may not be users of AllenNLP) can use the model without actually installing this repo.

Source and target vocabularies should be shared

Overview

The source and target texts are identical, and therefore their vocabularies should be shared. It is unclear whether this will lead to any model improvements, but it will simplify the setup as we will not need to maintain separate tokenizers/indexers/vocabularies for the source and target texts.

TODO

Drop the target_tokenizer and target_indexer from the config. This will lead to the source_tokenizer and target_tokenizer being used by default.
Find away around the START_SYMBOL / END_SYMBOL problem.

Can't load weights for 'johngiorgi/declutr-sci-base'

When I would like to load the model from Transformer, I am facing the following issue. Could you help? Thanks.

OSError: Can't load weights for 'johngiorgi/declutr-sci-base'. Make sure that:

- 'johngiorgi/declutr-sci-base' is a correct model identifier listed on 'https://huggingface.co/models'

- or 'johngiorgi/declutr-sci-base' is the correct path to a directory containing a file named one of pytorch_model.bin, tf_model.h5, model.ckpt.

Colab runs out of memory while generating bulk embeddings

Hello,

I am trying to use DeCLUTR for my thesis where I wish to generate bulk embeddings (around 23k sentences). Once the model is done generating 3.5k embeddings, Google Colab simply crashes and reloads the runtime. Is there any work around for this?

Thanks,
Deven

Validation iterator should not shuffle

Currently, when performing inference, the object_dict saved to disk does not include the source text (it is unclear how to get this from the decode hook). This means that while we get the predictions made by the decoder, these are not paired to their original input text. This is a problem for evaluating the decoded outputs, and for assigning document embeddings to input documents.

The simplest solution would be to use a validation iterator at train time that does not shuffle. In this way, the predictions saved to disk will be in the same order as the input text (in the validation set).

Make PyTorch Metric Learning loss functions registerable

WIP.

Please release a requirements.txt for dependencies

Hi! I wanted to reproduce the result on SentEval, but failed because of the following error:

Traceback (most recent call last):
  File "scripts/run_senteval.py", line 732, in <module>
    app()
  File "/home/hadoop-aipnlp/anaconda3/lib/python3.7/site-packages/typer/main.py", line 214, in __call__
    return get_command(self)(*args, **kwargs)
  File "/home/hadoop-aipnlp/anaconda3/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/hadoop-aipnlp/anaconda3/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/hadoop-aipnlp/anaconda3/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/hadoop-aipnlp/anaconda3/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/hadoop-aipnlp/anaconda3/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/hadoop-aipnlp/anaconda3/lib/python3.7/site-packages/typer/main.py", line 497, in wrapper
    return callback(**use_params)  # type: ignore
  File "scripts/run_senteval.py", line 714, in allennlp
    overrides="{'trainer.use_amp': true}",
  File "/home/hadoop-aipnlp/anaconda3/lib/python3.7/site-packages/allennlp/models/archival.py", line 191, in load_archive
    cuda_device=cuda_device,
  File "/home/hadoop-aipnlp/anaconda3/lib/python3.7/site-packages/allennlp/models/model.py", line 354, in load
    config["model"] if isinstance(config["model"], str) else config["model"]["type"]
  File "/home/hadoop-aipnlp/anaconda3/lib/python3.7/site-packages/allennlp/common/params.py", line 436, in __getitem__
    raise KeyError
KeyError

I think is the caused by the incompatible version of packages? I noticed the allennlp1.1.0 is released in September but your paper is released in June.
So could you release a requirements.txt, please?
Thanks!

Huggingface hosted inference api: rogue character in mask, also perhaps wrong task?

The Declutr work is very exciting, thank you for sharing.

Issue part one: The huggingface hosting site for the model has the default fill-text task - perhaps a note that this is not a relevant task in the model card?

Additionally, the code seems to have an error, where the results include a rogue character at the start of the decoded mask token.

Running the raw pipeline code does not seem to reproduce this error:

from transformers import AutoModelWithLMHead, AutoTokenizer
import torch
import unicodedata

lmtokenizer = AutoTokenizer.from_pretrained("johngiorgi/declutr-small")
lmmodel = AutoModelWithLMHead.from_pretrained("johngiorgi/declutr-small")

sequence = f"The capital of France is {tokenizer.mask_token}."

input = lmtokenizer.encode(sequence, return_tensors="pt")
mask_token_index = torch.where(input == lmtokenizer.mask_token_id)[1]

token_logits = lmmodel(input)[0]
mask_token_logits = token_logits[0, mask_token_index, :]

top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()

for token in top_5_tokens:
    print(lmtokenizer.decode([token]).encode('utf-8'))
    print([unicodedata.name(c) for c in lmtokenizer.decode([token])])
    print(sequence.replace(lmtokenizer.mask_token, lmtokenizer.decode([token])))

and a similar error seems to propagate here https://huggingface.co/xlm-roberta-large?text=The+goal+of+life+is+%3Cmask%3E. perhaps this is an issue for Hugging face?

Multi-gpu training

In SimCLR, the number of negatives per some positive pair is taken to be 2 * (N - 1) (all examples in a minibatch of size N that don't belong to that positive pair), and they find (as other works before them) that the bigger the batch size, the larger the number of negatives, and the better the learned representations.

DataParallel and DistributedDataParallel divide up a mini-batch, send each partition of examples to a GPU, compute gradients, and then average these gradients before backpropagating. But this means that each GPU is computing a loss with N/n_gpus examples and, therefore, 2 * (N/n_gpus - 1) negatives per positive pair.

We need to figure out how to use multi-GPU training, preferably with DistributedDataParallel, in a way that avoids this "issue", I.e. that allows us to use multiple GPUs while maintaining 2 * (N - 1) negatives per positive pair.

This is high priority as we will need multiple GPUs to scale the number of negatives in order to maximize performance (and make training on millions of documents feasible).

What is the max length of the text for the DeCLUTR-small model input?

Thanks.

Does shuffle need to be set to false when evaluating?

It is unclear whether or not shuffle needs to be set to false when using a trained model to embed text, or when evaluating with SentEval.

My worry is that if it is not False, the order of the output embeddings won't match the order of the input text. Maybe this is not the case. I should check to be sure.

Try AdaMod

https://github.com/lancopku/AdaMod

Does the model borrow the initial weights from other models?

Hi,

Thanks for the well written paper! I have two questions:

In the paper it is mentioned that To make the computational requirements feasible, we do not train from scratch, but rather we continue training a model that has been pre-trained with the MLM objective. Specifically, we use both RoBERTa-base [16] and DistilRoBERTa [49] (a distilled version of RoBERTa-base) in our experiments. Does it mean that Training your own model notebook uses the weights from per-trained models and then fine-tunes them based on my dataset?
If the answer to the first question is yes, then I assume the embeddings for vocabularies that do not exist in the pre-trained models, are trained from scratch. Correct?

Thanks!

Mine positive examples online

Currently, we mine positive example "offline" (before training begins). This means they are static, i.e. the positive example for a given anchor is the positive example for that anchor across all batches. I suspect this is a problem (especially for small datasets), as the anchors are never trained against different positives.

A solution would be to mine positives "online". For example, the dataloading process could sentence segment the incoming text. During training, we could then select one sentence (at random) from each anchor (a document) which would be that anchors positive example.

Don't shuffle the dataset when num_epochs=1

Currently, the dataset reader will shuffle the dataset during every epoch. In order to do this, it reads the entire dataset into memory, shuffles it, then yields instances one-by-one. This was the only way I could figure out how to shuffle a lazy AllenNLP dataset reader.

Unfortunately, for large datasets this means we need a lot of memory. Fortunately, for large datasets, really good performance can be achieved in only 1 epoch (as we found in the paper). Therefore, I think the DatasetReader should be updated such that shuffling only happens when num_epochs > 1. I am not sure how the DatasetReader could get access to num_epochs, so the user may just have to provide a shuffle argument.

Smarter truncation scheme

Overview

This paper recently performed a series of empirical experiments on fine-tuning BERT. Among other things, they devised various truncation scheme to handle BERT/ALBERT/RoBERTa's limit of 512 wordpiece tokens.

The best performing scheme involves selecting the first 128 and the last 382 tokens of a document (if the length after wordpiece tokenization exceeds 512). This performs better in text classification tasks. The motivation is that these tokens typically contain the most information within a document.

TODO

Subclass the Seq2Seq dataset reader and add a class attribute that controls the truncation strategy.
Add the above truncation scheme as an option.
If it outperforms our current truncation scheme, make it the default.

Mine this paper for insights

This paper recently performed a bunch of empirical experiments on fine-tuning BERT. There are some highly relevant insights for us:

Rather than truncate a document with > 512 tokens, they select the first 128 and the last 382 tokens. This performs better in text classification tasks. The motivation is that these tokens typically contain the most information.
Layer-wise learning rate: they select a single learning rate and multiply it by a "decay factor", such that the bottom layers of BERT are fine-tuned with smaller effective learning rates than the top layers. This is inspired by transfer learning in computer vision.

Low priority for now, but we should investigate how these tricks affect the quality of document embeddings.

Getting package dependency errors while using the library

I followed the instruction here to install the package. Then I tried to run this code. But I am getting this error:

/opt/anaconda3/envs/allennlp/lib/python3.7/site-packages/declutr/modules/token_embedders/pretrained_transformer_embedder_mlm.py in <module> 7 from allennlp.modules.token_embedders.token_embedder import TokenEmbedder 8 from overrides import overrides ----> 9 from transformers import AutoConfig, AutoModelForMaskedLM 10 11 ImportError: cannot import name 'AutoModelForMaskedLM' from 'transformers' (/opt/anaconda3/envs/allennlp/lib/python3.7/site-packages/transformers/__init__.py)

To solve the issue I ran pip install -U transformers which causes this:
allennlp 1.0.0 requires torch<1.6.0,>=1.5.0, but you'll have torch 1.6.0 which is incompatible. allennlp 1.0.0 requires transformers<2.12,>=2.9, but you'll have transformers 3.0.2 which is incompatible.

Upgrading transformer library changed the the first error message to this:
/opt/anaconda3/envs/allennlp/lib/python3.7/site-packages/transformers/tokenization_utils_fast.py in _batch_encode_plus(self, batch_text_or_text_pairs, add_special_tokens, padding_strategy, truncation_strategy, max_length, stride, is_pretokenized, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs) 311 312 if kwargs: --> 313 raise ValueError(f"Keyword arguments {kwargs} not recognized.") 314 315 # Set the truncation and padding strategy and restore the initial configuration ValueError: Keyword arguments {'add_prefix_space': True} not recognized.

Thanks!

UnboundLocalError about local variable 'this_epoch_val_metric' in Allennlp trainer.py

Hello,
I ran into this issue when trying to run your code on wikitext103 data.

UnboundLocalError: local variable 'this_epoch_val_metric' referenced before assignment

I think it might related to not specify "validation_data_path" in the config file. I appreciate your help!

Representing full documents using DeCLUTR

Hi!

Great paper and project, thanks for sharing!

Just a quick question, a lot of this work is around representing sentences, but quite often use case dictates that we need to represent a whole document. How do you suggest that is achieved using your library?

Thanks in advance!
D

Setup Apex

Figure out how to use Apex with the PretrainedTransformer models of AllenNLP. This will speed up training and reduce memory usage (we are currently having trouble training on a batch size > 1).

This is not mission-critical, so we can wait for it to become an official part of the AllenNLP library. Follow the roadmap for updates.

Revert to previous optimizer hyperparameters

It looks like the default parameters of the huggingface_admaw optimizer in the AllenNLP library have recently been changed. This may explain why, in my latest experiments, the performance of DeCLUTR-small was slightly worse than the score reported in our paper.

TODO: Try running some experiments with the old hyperparameter values. If they are better, revert.

Try out sparse weight matrices

Sparsifying weight matrices can lead to sizable reductions in the memory footprint of a model, while having little effect on performance (see here for more info).

Try to sparsify the weight matrix of the projection head using this library. As this is not high-priority, it might be wise to wait until this implementation matures a little (or PyTorch supports it natively).

Spin SentEval Runner into its own library

We have a script for evaluating a model against SentEval, but I think this would actually be useful outside of this project. We currently support three popular libraries (AllenNLP, Transformers and Sentence Transformers) and I plan to add word vectors, Google's universal sentence embedder and more.

I think before I do any more work on this, I should spin this off into its own repo. I have begun that process here.

getting error ValueError: Using AMP requires a cuda device

While running the following code from sample notebook, I am getting error ValueError: Using AMP requires a cuda device, but I am running it on a GPU enabled colab notebook.

!allennlp train "declutr_small.jsonnet"
--serialization-dir "output"
--overrides "$overrides"
--include-package "declutr"
-f

 from cache at /root/.cache/torch/transformers/5aab0d7dfa1db7d97ead13a37479db888b133a51a05ae4ab62ff5c8d1fcabb65.52b6ec356fb91985b3940e086d1b2ebf8cd40f8df0ba1cabf4cac27769dee241
2021-01-21 07:00:21,397 - INFO - transformers.modeling_utils - All model checkpoint weights were used when initializing RobertaForMaskedLM.

2021-01-21 07:00:21,398 - WARNING - transformers.modeling_utils - Some weights of RobertaForMaskedLM were not initialized from the model checkpoint at distilroberta-base and are newly initialized: ['lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
2021-01-21 07:00:21,398 - INFO - allennlp.common.params - model.seq2vec_encoder = None
2021-01-21 07:00:21,399 - INFO - allennlp.common.params - model.feedforward = None
2021-01-21 07:00:21,399 - INFO - allennlp.common.params - model.miner = None
2021-01-21 07:00:21,399 - INFO - allennlp.common.params - model.loss.type = nt_xent
2021-01-21 07:00:21,399 - INFO - allennlp.common.params - model.loss.temperature = 0.05
2021-01-21 07:00:21,400 - INFO - allennlp.common.params - model.initializer = <allennlp.nn.initializers.InitializerApplicator object at 0x7f545585b2e8>
2021-01-21 07:00:21,400 - INFO - allennlp.nn.initializers - Initializing parameters
2021-01-21 07:00:21,400 - INFO - allennlp.nn.initializers - Done initializing parameters; the following parameters are using their default initialization from their code
2021-01-21 07:00:21,400 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.lm_head.bias
2021-01-21 07:00:21,400 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.lm_head.dense.bias
2021-01-21 07:00:21,401 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.lm_head.dense.weight
2021-01-21 07:00:21,401 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.lm_head.layer_norm.bias
2021-01-21 07:00:21,401 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.lm_head.layer_norm.weight
2021-01-21 07:00:21,401 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.embeddings.LayerNorm.bias
2021-01-21 07:00:21,401 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.embeddings.LayerNorm.weight
2021-01-21 07:00:21,401 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.embeddings.position_embeddings.weight
2021-01-21 07:00:21,401 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.embeddings.token_type_embeddings.weight
2021-01-21 07:00:21,401 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.embeddings.word_embeddings.weight
2021-01-21 07:00:21,401 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.0.attention.output.LayerNorm.bias
2021-01-21 07:00:21,401 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.0.attention.output.LayerNorm.weight
2021-01-21 07:00:21,401 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.0.attention.output.dense.bias
2021-01-21 07:00:21,401 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.0.attention.output.dense.weight
2021-01-21 07:00:21,401 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.0.attention.self.key.bias
2021-01-21 07:00:21,401 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.0.attention.self.key.weight
2021-01-21 07:00:21,401 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.0.attention.self.query.bias
2021-01-21 07:00:21,401 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.0.attention.self.query.weight
2021-01-21 07:00:21,401 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.0.attention.self.value.bias
2021-01-21 07:00:21,402 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.0.attention.self.value.weight
2021-01-21 07:00:21,402 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.0.intermediate.dense.bias
2021-01-21 07:00:21,402 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.0.intermediate.dense.weight
2021-01-21 07:00:21,402 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.0.output.LayerNorm.bias
2021-01-21 07:00:21,402 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.0.output.LayerNorm.weight
2021-01-21 07:00:21,402 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.0.output.dense.bias
2021-01-21 07:00:21,402 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.0.output.dense.weight
2021-01-21 07:00:21,402 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.1.attention.output.LayerNorm.bias
2021-01-21 07:00:21,402 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.1.attention.output.LayerNorm.weight
2021-01-21 07:00:21,402 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.1.attention.output.dense.bias
2021-01-21 07:00:21,402 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.1.attention.output.dense.weight
2021-01-21 07:00:21,402 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.1.attention.self.key.bias
2021-01-21 07:00:21,402 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.1.attention.self.key.weight
2021-01-21 07:00:21,402 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.1.attention.self.query.bias
2021-01-21 07:00:21,402 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.1.attention.self.query.weight
2021-01-21 07:00:21,402 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.1.attention.self.value.bias
2021-01-21 07:00:21,402 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.1.attention.self.value.weight
2021-01-21 07:00:21,403 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.1.intermediate.dense.bias
2021-01-21 07:00:21,403 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.1.intermediate.dense.weight
2021-01-21 07:00:21,403 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.1.output.LayerNorm.bias
2021-01-21 07:00:21,403 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.1.output.LayerNorm.weight
2021-01-21 07:00:21,403 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.1.output.dense.bias
2021-01-21 07:00:21,403 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.1.output.dense.weight
2021-01-21 07:00:21,403 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.2.attention.output.LayerNorm.bias
2021-01-21 07:00:21,403 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.2.attention.output.LayerNorm.weight
2021-01-21 07:00:21,403 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.2.attention.output.dense.bias
2021-01-21 07:00:21,403 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.2.attention.output.dense.weight
2021-01-21 07:00:21,403 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.2.attention.self.key.bias
2021-01-21 07:00:21,403 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.2.attention.self.key.weight
2021-01-21 07:00:21,403 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.2.attention.self.query.bias
2021-01-21 07:00:21,403 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.2.attention.self.query.weight
2021-01-21 07:00:21,403 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.2.attention.self.value.bias
2021-01-21 07:00:21,403 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.2.attention.self.value.weight
2021-01-21 07:00:21,404 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.2.intermediate.dense.bias
2021-01-21 07:00:21,404 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.2.intermediate.dense.weight
2021-01-21 07:00:21,404 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.2.output.LayerNorm.bias
2021-01-21 07:00:21,404 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.2.output.LayerNorm.weight
2021-01-21 07:00:21,404 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.2.output.dense.bias
2021-01-21 07:00:21,404 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.2.output.dense.weight
2021-01-21 07:00:21,404 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.3.attention.output.LayerNorm.bias
2021-01-21 07:00:21,404 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.3.attention.output.LayerNorm.weight
2021-01-21 07:00:21,404 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.3.attention.output.dense.bias
2021-01-21 07:00:21,404 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.3.attention.output.dense.weight
2021-01-21 07:00:21,404 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.3.attention.self.key.bias
2021-01-21 07:00:21,404 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.3.attention.self.key.weight
2021-01-21 07:00:21,404 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.3.attention.self.query.bias
2021-01-21 07:00:21,404 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.3.attention.self.query.weight
2021-01-21 07:00:21,404 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.3.attention.self.value.bias
2021-01-21 07:00:21,405 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.3.attention.self.value.weight
2021-01-21 07:00:21,405 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.3.intermediate.dense.bias
2021-01-21 07:00:21,405 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.3.intermediate.dense.weight
2021-01-21 07:00:21,405 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.3.output.LayerNorm.bias
2021-01-21 07:00:21,405 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.3.output.LayerNorm.weight
2021-01-21 07:00:21,405 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.3.output.dense.bias
2021-01-21 07:00:21,405 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.3.output.dense.weight
2021-01-21 07:00:21,405 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.4.attention.output.LayerNorm.bias
2021-01-21 07:00:21,405 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.4.attention.output.LayerNorm.weight
2021-01-21 07:00:21,405 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.4.attention.output.dense.bias
2021-01-21 07:00:21,405 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.4.attention.output.dense.weight
2021-01-21 07:00:21,405 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.4.attention.self.key.bias
2021-01-21 07:00:21,405 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.4.attention.self.key.weight
2021-01-21 07:00:21,405 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.4.attention.self.query.bias
2021-01-21 07:00:21,405 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.4.attention.self.query.weight
2021-01-21 07:00:21,405 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.4.attention.self.value.bias
2021-01-21 07:00:21,405 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.4.attention.self.value.weight
2021-01-21 07:00:21,483 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.4.intermediate.dense.bias
2021-01-21 07:00:21,484 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.4.intermediate.dense.weight
2021-01-21 07:00:21,484 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.4.output.LayerNorm.bias
2021-01-21 07:00:21,484 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.4.output.LayerNorm.weight
2021-01-21 07:00:21,484 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.4.output.dense.bias
2021-01-21 07:00:21,484 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.4.output.dense.weight
2021-01-21 07:00:21,484 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.5.attention.output.LayerNorm.bias
2021-01-21 07:00:21,484 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.5.attention.output.LayerNorm.weight
2021-01-21 07:00:21,484 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.5.attention.output.dense.bias
2021-01-21 07:00:21,484 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.5.attention.output.dense.weight
2021-01-21 07:00:21,484 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.5.attention.self.key.bias
2021-01-21 07:00:21,484 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.5.attention.self.key.weight
2021-01-21 07:00:21,484 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.5.attention.self.query.bias
2021-01-21 07:00:21,484 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.5.attention.self.query.weight
2021-01-21 07:00:21,485 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.5.attention.self.value.bias
2021-01-21 07:00:21,485 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.5.attention.self.value.weight
2021-01-21 07:00:21,485 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.5.intermediate.dense.bias
2021-01-21 07:00:21,485 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.5.intermediate.dense.weight
2021-01-21 07:00:21,485 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.5.output.LayerNorm.bias
2021-01-21 07:00:21,485 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.5.output.LayerNorm.weight
2021-01-21 07:00:21,485 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.5.output.dense.bias
2021-01-21 07:00:21,485 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.5.output.dense.weight
2021-01-21 07:00:21,485 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.pooler.dense.bias
2021-01-21 07:00:21,485 - INFO - allennlp.nn.initializers -    _text_field_embedder.token_embedder_tokens.transformer_model.roberta.pooler.dense.weight
2021-01-21 07:00:21,486 - INFO - allennlp.common.params - data_loader.type = pytorch_dataloader
2021-01-21 07:00:21,486 - INFO - allennlp.common.params - data_loader.batch_size = 2
2021-01-21 07:00:21,486 - INFO - allennlp.common.params - data_loader.shuffle = False
2021-01-21 07:00:21,487 - INFO - allennlp.common.params - data_loader.sampler = None
2021-01-21 07:00:21,487 - INFO - allennlp.common.params - data_loader.batch_sampler = None
2021-01-21 07:00:21,487 - INFO - allennlp.common.params - data_loader.num_workers = 1
2021-01-21 07:00:21,487 - INFO - allennlp.common.params - data_loader.pin_memory = False
2021-01-21 07:00:21,487 - INFO - allennlp.common.params - data_loader.drop_last = True
2021-01-21 07:00:21,487 - INFO - allennlp.common.params - data_loader.timeout = 0
2021-01-21 07:00:21,487 - INFO - allennlp.common.params - data_loader.worker_init_fn = None
2021-01-21 07:00:21,487 - INFO - allennlp.common.params - data_loader.multiprocessing_context = None
2021-01-21 07:00:21,487 - INFO - allennlp.common.params - data_loader.batches_per_epoch = 8912
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
2021-01-21 07:00:21,505 - INFO - allennlp.common.params - trainer.type = gradient_descent
2021-01-21 07:00:21,507 - INFO - allennlp.common.params - trainer.patience = None
2021-01-21 07:00:21,507 - INFO - allennlp.common.params - trainer.validation_metric = -loss
2021-01-21 07:00:21,508 - INFO - allennlp.common.params - trainer.num_epochs = 1
2021-01-21 07:00:21,508 - INFO - allennlp.common.params - trainer.cuda_device = None
2021-01-21 07:00:21,508 - INFO - allennlp.common.params - trainer.grad_norm = 1
2021-01-21 07:00:21,508 - INFO - allennlp.common.params - trainer.grad_clipping = None
2021-01-21 07:00:21,508 - INFO - allennlp.common.params - trainer.distributed = None
2021-01-21 07:00:21,508 - INFO - allennlp.common.params - trainer.world_size = 1
2021-01-21 07:00:21,508 - INFO - allennlp.common.params - trainer.num_gradient_accumulation_steps = 1
2021-01-21 07:00:21,508 - INFO - allennlp.common.params - trainer.use_amp = True
2021-01-21 07:00:21,509 - INFO - allennlp.common.params - trainer.no_grad = None
/usr/local/lib/python3.6/dist-packages/allennlp/data/dataset_readers/dataset_reader.py:371: UserWarning: Using multi-process data loading without setting DatasetReader.manual_multi_process_sharding to True.
Did you forget to set this?
If you're not handling the multi-process sharding logic within your _read() method, there is probably no benefit to using more than one worker.
  UserWarning,
2021-01-21 07:00:21,511 - INFO - allennlp.common.params - trainer.momentum_scheduler = None
2021-01-21 07:00:21,512 - INFO - allennlp.common.params - trainer.tensorboard_writer = None
2021-01-21 07:00:21,512 - INFO - allennlp.common.params - trainer.moving_average = None
reading instances: 0it [00:00, ?it/s]2021-01-21 07:00:21,512 - INFO - allennlp.common.params - trainer.batch_callbacks = None
2021-01-21 07:00:21,512 - INFO - allennlp.common.params - trainer.epoch_callbacks = None
2021-01-21 07:00:21,513 - INFO - declutr.dataset_reader - Reading instances from lines in file at: wikitext_103/train.txt
2021-01-21 07:00:21,514 - INFO - allennlp.common.params - trainer.optimizer.type = huggingface_adamw
2021-01-21 07:00:21,515 - INFO - allennlp.common.params - trainer.optimizer.lr = 5e-05
2021-01-21 07:00:21,515 - INFO - allennlp.common.params - trainer.optimizer.betas = (0.9, 0.999)
2021-01-21 07:00:21,515 - INFO - allennlp.common.params - trainer.optimizer.eps = 1e-06
2021-01-21 07:00:21,515 - INFO - allennlp.common.params - trainer.optimizer.weight_decay = 0
2021-01-21 07:00:21,515 - INFO - allennlp.common.params - trainer.optimizer.correct_bias = False
2021-01-21 07:00:21,520 - INFO - allennlp.training.optimizers - Done constructing parameter groups.
2021-01-21 07:00:21,520 - INFO - allennlp.training.optimizers - Group 0: ['_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.1.intermediate.dense.weight', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.2.attention.self.query.weight', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.1.attention.self.value.weight', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.3.attention.self.value.weight', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.1.attention.output.dense.weight', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.1.attention.self.key.weight', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.2.output.dense.weight', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.3.attention.self.key.weight', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.3.output.dense.weight', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.2.attention.self.value.weight', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.2.intermediate.dense.weight', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.4.attention.self.value.weight', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.3.attention.output.dense.weight', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.embeddings.position_embeddings.weight', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.embeddings.token_type_embeddings.weight', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.5.attention.self.value.weight', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.4.attention.self.query.weight', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.4.attention.self.key.weight', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.5.attention.output.dense.weight', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.1.attention.self.query.weight', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.4.intermediate.dense.weight', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.0.attention.self.value.weight', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.5.intermediate.dense.weight', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.3.attention.self.query.weight', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.0.attention.self.key.weight', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.embeddings.word_embeddings.weight', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.4.attention.output.dense.weight', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.pooler.dense.weight', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.4.output.dense.weight', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.5.output.dense.weight', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.0.intermediate.dense.weight', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.1.output.dense.weight', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.0.attention.self.query.weight', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.0.attention.output.dense.weight', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.3.intermediate.dense.weight', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.5.attention.self.key.weight', '_text_field_embedder.token_embedder_tokens.transformer_model.lm_head.layer_norm.weight', '_text_field_embedder.token_embedder_tokens.transformer_model.lm_head.dense.weight', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.0.output.dense.weight', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.5.attention.self.query.weight', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.2.attention.output.dense.weight', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.2.attention.self.key.weight'], {'weight_decay': 0.1}
2021-01-21 07:00:21,521 - INFO - allennlp.training.optimizers - Group 1: ['_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.1.output.LayerNorm.weight', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.3.attention.self.key.bias', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.2.output.dense.bias', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.5.attention.output.LayerNorm.bias', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.1.attention.self.query.bias', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.0.intermediate.dense.bias', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.2.attention.self.key.bias', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.1.attention.self.key.bias', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.1.output.dense.bias', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.2.attention.output.LayerNorm.bias', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.3.output.LayerNorm.weight', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.4.attention.self.key.bias', '_text_field_embedder.token_embedder_tokens.transformer_model.lm_head.layer_norm.bias', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.2.output.LayerNorm.weight', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.embeddings.LayerNorm.weight', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.2.attention.self.value.bias', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.0.attention.self.query.bias', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.embeddings.LayerNorm.bias', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.5.output.LayerNorm.weight', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.2.attention.output.LayerNorm.weight', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.4.attention.self.value.bias', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.4.intermediate.dense.bias', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.5.intermediate.dense.bias', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.0.attention.self.key.bias', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.5.output.dense.bias', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.3.attention.output.LayerNorm.weight', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.2.attention.output.dense.bias', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.4.output.dense.bias', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.5.attention.self.query.bias', '_text_field_embedder.token_embedder_tokens.transformer_model.lm_head.dense.bias', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.0.attention.self.value.bias', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.2.attention.self.query.bias', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.1.attention.output.LayerNorm.bias', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.3.output.LayerNorm.bias', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.4.attention.output.LayerNorm.weight', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.4.attention.self.query.bias', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.pooler.dense.bias', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.5.attention.self.key.bias', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.4.output.LayerNorm.bias', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.2.intermediate.dense.bias', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.3.attention.self.value.bias', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.2.output.LayerNorm.bias', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.3.attention.self.query.bias', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.1.output.LayerNorm.bias', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.1.attention.output.dense.bias', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.3.intermediate.dense.bias', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.4.output.LayerNorm.weight', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.5.attention.output.LayerNorm.weight', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.4.attention.output.dense.bias', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.1.attention.self.value.bias', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.3.attention.output.dense.bias', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.4.attention.output.LayerNorm.bias', '_text_field_embedder.token_embedder_tokens.transformer_model.lm_head.bias', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.1.intermediate.dense.bias', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.1.attention.output.LayerNorm.weight', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.3.attention.output.LayerNorm.bias', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.0.attention.output.LayerNorm.bias', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.5.attention.output.dense.bias', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.0.attention.output.LayerNorm.weight', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.0.output.dense.bias', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.5.output.LayerNorm.bias', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.0.output.LayerNorm.weight', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.5.attention.self.value.bias', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.0.output.LayerNorm.bias', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.0.attention.output.dense.bias', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.3.output.dense.bias'], {}
2021-01-21 07:00:21,597 - INFO - allennlp.training.optimizers - Number of trainable parameters: 82760793
2021-01-21 07:00:21,602 - INFO - allennlp.common.util - The following parameters are Frozen (without gradient):
2021-01-21 07:00:21,603 - INFO - allennlp.common.util - The following parameters are Tunable (with gradient):
2021-01-21 07:00:21,603 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.embeddings.word_embeddings.weight
2021-01-21 07:00:21,603 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.embeddings.position_embeddings.weight
2021-01-21 07:00:21,603 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.embeddings.token_type_embeddings.weight
2021-01-21 07:00:21,603 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.embeddings.LayerNorm.weight
2021-01-21 07:00:21,603 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.embeddings.LayerNorm.bias
2021-01-21 07:00:21,603 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.0.attention.self.query.weight
2021-01-21 07:00:21,604 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.0.attention.self.query.bias
2021-01-21 07:00:21,604 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.0.attention.self.key.weight
2021-01-21 07:00:21,604 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.0.attention.self.key.bias
2021-01-21 07:00:21,604 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.0.attention.self.value.weight
2021-01-21 07:00:21,604 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.0.attention.self.value.bias
2021-01-21 07:00:21,604 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.0.attention.output.dense.weight
2021-01-21 07:00:21,604 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.0.attention.output.dense.bias
2021-01-21 07:00:21,604 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.0.attention.output.LayerNorm.weight
2021-01-21 07:00:21,604 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.0.attention.output.LayerNorm.bias
2021-01-21 07:00:21,604 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.0.intermediate.dense.weight
2021-01-21 07:00:21,604 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.0.intermediate.dense.bias
2021-01-21 07:00:21,604 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.0.output.dense.weight
2021-01-21 07:00:21,604 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.0.output.dense.bias
2021-01-21 07:00:21,604 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.0.output.LayerNorm.weight
2021-01-21 07:00:21,605 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.0.output.LayerNorm.bias
2021-01-21 07:00:21,605 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.1.attention.self.query.weight
2021-01-21 07:00:21,605 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.1.attention.self.query.bias
2021-01-21 07:00:21,605 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.1.attention.self.key.weight
2021-01-21 07:00:21,605 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.1.attention.self.key.bias
2021-01-21 07:00:21,605 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.1.attention.self.value.weight
2021-01-21 07:00:21,605 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.1.attention.self.value.bias
2021-01-21 07:00:21,605 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.1.attention.output.dense.weight
2021-01-21 07:00:21,605 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.1.attention.output.dense.bias
2021-01-21 07:00:21,605 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.1.attention.output.LayerNorm.weight
2021-01-21 07:00:21,605 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.1.attention.output.LayerNorm.bias
2021-01-21 07:00:21,605 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.1.intermediate.dense.weight
2021-01-21 07:00:21,605 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.1.intermediate.dense.bias
2021-01-21 07:00:21,605 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.1.output.dense.weight
2021-01-21 07:00:21,606 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.1.output.dense.bias
2021-01-21 07:00:21,606 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.1.output.LayerNorm.weight
2021-01-21 07:00:21,606 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.1.output.LayerNorm.bias
2021-01-21 07:00:21,606 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.2.attention.self.query.weight
2021-01-21 07:00:21,606 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.2.attention.self.query.bias
2021-01-21 07:00:21,606 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.2.attention.self.key.weight
2021-01-21 07:00:21,606 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.2.attention.self.key.bias
2021-01-21 07:00:21,606 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.2.attention.self.value.weight
2021-01-21 07:00:21,606 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.2.attention.self.value.bias
2021-01-21 07:00:21,606 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.2.attention.output.dense.weight
2021-01-21 07:00:21,606 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.2.attention.output.dense.bias
2021-01-21 07:00:21,606 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.2.attention.output.LayerNorm.weight
2021-01-21 07:00:21,606 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.2.attention.output.LayerNorm.bias
2021-01-21 07:00:21,607 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.2.intermediate.dense.weight
2021-01-21 07:00:21,607 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.2.intermediate.dense.bias
2021-01-21 07:00:21,607 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.2.output.dense.weight
2021-01-21 07:00:21,607 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.2.output.dense.bias
2021-01-21 07:00:21,607 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.2.output.LayerNorm.weight
2021-01-21 07:00:21,607 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.2.output.LayerNorm.bias
2021-01-21 07:00:21,607 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.3.attention.self.query.weight
2021-01-21 07:00:21,607 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.3.attention.self.query.bias
2021-01-21 07:00:21,607 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.3.attention.self.key.weight
2021-01-21 07:00:21,607 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.3.attention.self.key.bias
2021-01-21 07:00:21,607 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.3.attention.self.value.weight
2021-01-21 07:00:21,607 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.3.attention.self.value.bias
2021-01-21 07:00:21,607 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.3.attention.output.dense.weight
2021-01-21 07:00:21,607 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.3.attention.output.dense.bias
2021-01-21 07:00:21,608 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.3.attention.output.LayerNorm.weight
2021-01-21 07:00:21,608 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.3.attention.output.LayerNorm.bias
2021-01-21 07:00:21,608 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.3.intermediate.dense.weight
2021-01-21 07:00:21,608 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.3.intermediate.dense.bias
2021-01-21 07:00:21,608 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.3.output.dense.weight
2021-01-21 07:00:21,608 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.3.output.dense.bias
2021-01-21 07:00:21,608 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.3.output.LayerNorm.weight
2021-01-21 07:00:21,608 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.3.output.LayerNorm.bias
2021-01-21 07:00:21,608 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.4.attention.self.query.weight
2021-01-21 07:00:21,608 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.4.attention.self.query.bias
2021-01-21 07:00:21,608 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.4.attention.self.key.weight
2021-01-21 07:00:21,608 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.4.attention.self.key.bias
2021-01-21 07:00:21,609 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.4.attention.self.value.weight
2021-01-21 07:00:21,609 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.4.attention.self.value.bias
2021-01-21 07:00:21,609 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.4.attention.output.dense.weight
2021-01-21 07:00:21,609 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.4.attention.output.dense.bias
2021-01-21 07:00:21,609 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.4.attention.output.LayerNorm.weight
2021-01-21 07:00:21,609 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.4.attention.output.LayerNorm.bias
2021-01-21 07:00:21,609 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.4.intermediate.dense.weight
2021-01-21 07:00:21,609 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.4.intermediate.dense.bias
2021-01-21 07:00:21,609 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.4.output.dense.weight
2021-01-21 07:00:21,609 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.4.output.dense.bias
2021-01-21 07:00:21,609 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.4.output.LayerNorm.weight
2021-01-21 07:00:21,609 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.4.output.LayerNorm.bias
2021-01-21 07:00:21,609 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.5.attention.self.query.weight
2021-01-21 07:00:21,609 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.5.attention.self.query.bias
2021-01-21 07:00:21,610 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.5.attention.self.key.weight
2021-01-21 07:00:21,610 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.5.attention.self.key.bias
2021-01-21 07:00:21,610 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.5.attention.self.value.weight
2021-01-21 07:00:21,610 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.5.attention.self.value.bias
2021-01-21 07:00:21,610 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.5.attention.output.dense.weight
2021-01-21 07:00:21,610 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.5.attention.output.dense.bias
2021-01-21 07:00:21,610 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.5.attention.output.LayerNorm.weight
2021-01-21 07:00:21,610 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.5.attention.output.LayerNorm.bias
2021-01-21 07:00:21,610 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.5.intermediate.dense.weight
2021-01-21 07:00:21,610 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.5.intermediate.dense.bias
2021-01-21 07:00:21,704 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.5.output.dense.weight
2021-01-21 07:00:21,705 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.5.output.dense.bias
2021-01-21 07:00:21,705 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.5.output.LayerNorm.weight
2021-01-21 07:00:21,705 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.encoder.layer.5.output.LayerNorm.bias
2021-01-21 07:00:21,705 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.pooler.dense.weight
2021-01-21 07:00:21,705 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.roberta.pooler.dense.bias
2021-01-21 07:00:21,705 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.lm_head.bias
2021-01-21 07:00:21,706 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.lm_head.dense.weight
2021-01-21 07:00:21,706 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.lm_head.dense.bias
2021-01-21 07:00:21,706 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.lm_head.layer_norm.weight
2021-01-21 07:00:21,706 - INFO - allennlp.common.util - _text_field_embedder.token_embedder_tokens.transformer_model.lm_head.layer_norm.bias
2021-01-21 07:00:21,706 - INFO - allennlp.common.params - trainer.learning_rate_scheduler.type = slanted_triangular
2021-01-21 07:00:21,707 - INFO - allennlp.common.params - trainer.learning_rate_scheduler.cut_frac = 0.1
2021-01-21 07:00:21,708 - INFO - allennlp.common.params - trainer.learning_rate_scheduler.ratio = 32
2021-01-21 07:00:21,709 - INFO - allennlp.common.params - trainer.learning_rate_scheduler.last_epoch = -1
2021-01-21 07:00:21,709 - INFO - allennlp.common.params - trainer.learning_rate_scheduler.gradual_unfreezing = False
2021-01-21 07:00:21,709 - INFO - allennlp.common.params - trainer.learning_rate_scheduler.discriminative_fine_tuning = False
2021-01-21 07:00:21,710 - INFO - allennlp.common.params - trainer.learning_rate_scheduler.decay_factor = 0.38
2021-01-21 07:00:21,710 - INFO - allennlp.common.params - trainer.checkpointer.type = default
2021-01-21 07:00:21,712 - INFO - allennlp.common.params - trainer.checkpointer.keep_serialized_model_every_num_seconds = None
2021-01-21 07:00:21,713 - INFO - allennlp.common.params - trainer.checkpointer.num_serialized_models_to_keep = -1
2021-01-21 07:00:21,718 - INFO - allennlp.common.params - trainer.checkpointer.model_save_interval = None
2021-01-21 07:00:21,723 - CRITICAL - root - Uncaught exception
Traceback (most recent call last):
  File "/usr/local/bin/allennlp", line 8, in <module>
    sys.exit(run())
  File "/usr/local/lib/python3.6/dist-packages/allennlp/__main__.py", line 34, in run
    main(prog="allennlp")
  File "/usr/local/lib/python3.6/dist-packages/allennlp/commands/__init__.py", line 92, in main
    args.func(args)
  File "/usr/local/lib/python3.6/dist-packages/allennlp/commands/train.py", line 118, in train_model_from_args
    file_friendly_logging=args.file_friendly_logging,
  File "/usr/local/lib/python3.6/dist-packages/allennlp/commands/train.py", line 177, in train_model_from_file
    file_friendly_logging=file_friendly_logging,
  File "/usr/local/lib/python3.6/dist-packages/allennlp/commands/train.py", line 238, in train_model
    file_friendly_logging=file_friendly_logging,
  File "/usr/local/lib/python3.6/dist-packages/allennlp/commands/train.py", line 433, in _train_worker
    local_rank=process_rank,
  File "/usr/local/lib/python3.6/dist-packages/allennlp/common/from_params.py", line 595, in from_params
    **extras,
  File "/usr/local/lib/python3.6/dist-packages/allennlp/common/from_params.py", line 624, in from_params
    return constructor_to_call(**kwargs)  # type: ignore
  File "/usr/local/lib/python3.6/dist-packages/allennlp/commands/train.py", line 689, in from_partial_objects
    validation_data_loader=validation_data_loader_,
  File "/usr/local/lib/python3.6/dist-packages/allennlp/common/lazy.py", line 46, in construct
    return self._constructor(**kwargs)
  File "/usr/local/lib/python3.6/dist-packages/allennlp/common/from_params.py", line 461, in constructor
    return value_cls.from_params(params=deepcopy(popped_params), **constructor_extras)
  File "/usr/local/lib/python3.6/dist-packages/allennlp/common/from_params.py", line 595, in from_params
    **extras,
  File "/usr/local/lib/python3.6/dist-packages/allennlp/common/from_params.py", line 624, in from_params
    return constructor_to_call(**kwargs)  # type: ignore
  File "/usr/local/lib/python3.6/dist-packages/allennlp/training/trainer.py", line 1174, in from_partial_objects
    use_amp=use_amp,
  File "/usr/local/lib/python3.6/dist-packages/allennlp/training/trainer.py", line 437, in __init__
    raise ValueError("Using AMP requires a cuda device")
ValueError: Using AMP requires a cuda device

sample_spans may loop forever

Dramatically increase number of negative examples using MoCo

Currently, the number of negative examples is coupled to the batch size. This is a problem, as our maximum batch size is quite small (~16 on a 16GB GPU). I suspect this is part of the reason for the poor embedding quality of our model as is.

Try to implement something similar to MoCo, where a dictionary of negatives is maintained during training, decoupling the number of negative samples from the batch size.

Add L2 normalization

WIP.

Setup a prediction pipeline

Using the predict tool provided by AllenNLP, we need to set up a prediction interface so that we can manually inspect the quality of the decoded sequences.

Get code coverage to >80%.

Before the v0.1.0 release, try to get code coverage to >80%.

Update HuggingFace Sci-base example

Hi all,
It seems like there has been a breaking change in HF that causes the sci-base example to fail, the code in the sci-base model card is as follows:

import torch
from scipy.spatial.distance import cosine

from transformers import AutoModel, AutoTokenizer

# Load the model
tokenizer = AutoTokenizer.from_pretrained("johngiorgi/declutr-sci-base")
model = AutoModel.from_pretrained("johngiorgi/declutr-sci-base")

# Prepare some text to embed
text = [
    "Oncogenic KRAS mutations are common in cancer.",
    "Notably, c-Raf has recently been found essential for development of K-Ras-driven NSCLCs.",
]
inputs = tokenizer(text, padding=True, truncation=True, return_tensors="pt")

# Embed the text
with torch.no_grad():
    sequence_output, _ = model(**inputs, output_hidden_states=False)

# Mean pool the token-level embeddings to get sentence-level embeddings
embeddings = torch.sum(
    sequence_output * inputs["attention_mask"].unsqueeze(-1), dim=1
) / torch.clamp(torch.sum(inputs["attention_mask"], dim=1, keepdims=True), min=1e-9)

# Compute a semantic similarity via the cosine distance
semantic_sim = 1 - cosine(embeddings[0], embeddings[1])

Version that works with current HF version:

import torch
from scipy.spatial.distance import cosine

from transformers import AutoModel, AutoTokenizer

# Load the model
tokenizer = AutoTokenizer.from_pretrained("johngiorgi/declutr-sci-base")
model = AutoModel.from_pretrained("johngiorgi/declutr-sci-base")
# Prepare some text to embed
text = [
    "Oncogenic KRAS mutations are common in cancer.",
    "Notably, c-Raf has recently been found essential for development of K-Ras-driven NSCLCs.",
]
inputs = tokenizer(text, padding=True, truncation=True, return_tensors="pt")
# Put the tensors on the GPU, if available
for name, tensor in inputs.items():
    inputs[name] = tensor.to(model.device)

# Embed the text
with torch.no_grad():
    sequence_output = model(**inputs, output_hidden_states=False)

# Mean pool the token-level embeddings to get sentence-level embeddings
embeddings = torch.sum(
    sequence_output.last_hidden_state * inputs["attention_mask"].unsqueeze(-1), dim=1
) / torch.clamp(torch.sum(inputs["attention_mask"], dim=1, keepdims=True), min=1e-9)

# Compute a semantic similarity via the cosine distance
semantic_sim = 1 - cosine(embeddings[0], embeddings[1])
print(semantic_sim)

Please let me know if I am mistaken.

Many thanks,
Chris

Encounter RuntimeError while running with Apex

Running apex with allennlp train configs/contrastive.jsonnet -s tmp --include-package t2t -o "{"trainer": {"opt_level": 'O1'}}" returns exceptions as following:

Traceback (most recent call last):
  File "/h/haotian/.conda/envs/t2tCLR/bin/allennlp", line 11, in <module>
    load_entry_point('allennlp', 'console_scripts', 'allennlp')()
  File "/scratch/ssd001/home/haotian/Code/allennlp/allennlp/__main__.py", line 18, in run
    main(prog="allennlp")
  File "/scratch/ssd001/home/haotian/Code/allennlp/allennlp/commands/__init__.py", line 93, in main
    args.func(args)
  File "/scratch/ssd001/home/haotian/Code/allennlp/allennlp/commands/train.py", line 143, in train_model_from_args
    dry_run=args.dry_run,
  File "/scratch/ssd001/home/haotian/Code/allennlp/allennlp/commands/train.py", line 202, in train_model_from_file
    dry_run=dry_run,
  File "/scratch/ssd001/home/haotian/Code/allennlp/allennlp/commands/train.py", line 265, in train_model
    dry_run=dry_run,
  File "/scratch/ssd001/home/haotian/Code/allennlp/allennlp/commands/train.py", line 462, in _train_worker
    metrics = train_loop.run()
  File "/scratch/ssd001/home/haotian/Code/allennlp/allennlp/commands/train.py", line 521, in run
    return self.trainer.train()
  File "/scratch/ssd001/home/haotian/Code/allennlp/allennlp/training/trainer.py", line 687, in train
    train_metrics = self._train_epoch(epoch)
  File "/scratch/ssd001/home/haotian/Code/allennlp/allennlp/training/trainer.py", line 465, in _train_epoch
    batch_outputs = self.batch_outputs(batch, for_training=True)
  File "/scratch/ssd001/home/haotian/Code/allennlp/allennlp/training/trainer.py", line 380, in batch_outputs
    output_dict = self._pytorch_model(**batch)
  File "/h/haotian/.conda/envs/t2tCLR/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/scratch/ssd001/home/haotian/Code/t2t/t2t/models/contrastive_text_encoder.py", line 122, in forward
    contrastive_loss = self._loss(embeddings, labels)
  File "/h/haotian/.conda/envs/t2tCLR/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/h/haotian/.conda/envs/t2tCLR/lib/python3.7/site-packages/pytorch_metric_learning/losses/base_metric_loss_function.py", line 53, in forward
    loss = self.compute_loss(embeddings, labels, indices_tuple)
  File "/h/haotian/.conda/envs/t2tCLR/lib/python3.7/site-packages/pytorch_metric_learning/losses/generic_pair_loss.py", line 40, in compute_loss
    return self.loss_method(mat, labels, indices_tuple)
  File "/h/haotian/.conda/envs/t2tCLR/lib/python3.7/site-packages/pytorch_metric_learning/losses/generic_pair_loss.py", line 59, in pair_based_loss
    return self._compute_loss(pos_pair, neg_pair, indices_tuple)
  File "/h/haotian/.conda/envs/t2tCLR/lib/python3.7/site-packages/pytorch_metric_learning/losses/ntxent_loss.py", line 20, in _compute_loss
    max_val = torch.max(pos_pairs, torch.max(neg_pairs, dim=1, keepdim=True)[0])
RuntimeError: Expected object of scalar type Half but got scalar type Float for argument #2 'other' in call to _th_max
  0%|                                                                                                   | 0/1 [00:01<?, ?it/s]

Try Hypothesis

Try Hypothesis (https://hypothesis.readthedocs.io/en/latest/quickstart.html). Seems like it would be an easy way to beef up unit tests.

Reproducing the same training dataset

Hi,
Thank you for the great paper and code!

I want to reproduce the results and I am not sure about the exact steps to reproduce the training dataset from OpenWebText. I downloaded the entire OpenWebText from the corpus and removed documents that have a length of less than 2048 after Roberta tokenization. This leaves me with 812,436 documents.

In your paper, you reported that you got 495,243 documents from a subset, and I am wondering how I can reproduce that? I see that there is a max_documents argument for the preprocessing script. If you used that, can you tell me the arguments you used for the script?

Make PyTorchMetricLearningLoss self contained

Currently, when using a PyTorchMetricLearningLoss, you still have to call its class method get_embeddings_and_labels before calling its forward hook. Find a way to trigger get_embeddings_and_labels when the forward hook of the subclass (BaseMetricLossFunction) is invoked.

Models don't load with allennlp>=1.2.0

The pretrained models do not load properly with allennlp>=1.2.0. The error reported is:

RuntimeError: Error loading state dict for DeCLUTR
    Missing keys: []
    Unexpected keys: ['_text_field_embedder.token_embedder_tokens.transformer_model.roberta.pooler.dense.weight', '_text_field_embedder.token_embedder_tokens.transformer_model.roberta.pooler.dense.bias']

For now, I will constrain the dependency to be "allennlp>=1.1.0, <1.2.0", but it would be great to find another solution (short of re-training the model).

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

johngiorgi / declutr Goto Github PK

declutr's Introduction

Hi there 👋

declutr's People

Contributors

Stargazers

Watchers

Forkers

declutr's Issues

Overview

Overview

TODO

Overview

TODO

Overview

TODO

Recommend Projects

Recommend Topics

Recommend Org