richarddwang / electra_pytorch Goto Github PK

View Code? Open in Web Editor NEW

321.0 3.0 41.0 1.08 MB

Pretrain and finetune ELECTRA with fastai and huggingface. (Results of the paper replicated !)

Python 56.91% Jupyter Notebook 43.09%

electra pytorch fastai huggingface glue language-model deeplearning nlp

electra_pytorch's People

Contributors

Stargazers

Watchers

electra_pytorch's Issues

multi-GPU training

Hi,

First of all the code in this repo does not work out of the box - requires 2 edits (my_model to False + fix of OWT dataset name).
Anyway - this model does not seem to be taking advantage of multi-GPU setups. I tried to modify it (using DataParallel) but although memory gets allocated on designated GPUs, the model is not training (maybe because my FastAi knowledge is almost none).
Could you give some at least some hints how to update the code to perform multi-GPU training?

back-propagate the discriminator loss through the generator

Hi , Thanks for sharing your code. I have a quick question, in the paper it is mentioned that We don’t back-propagate the discriminator loss through the generator, maybe I have missed it, but where in your code this has been taken care of ? can you refer me to it ? Thanks

Sequence length too long for `ELECTRADataProcessor`.

I am using ELECTRADataProcessor to tokenize my corpus for pretraining (like your example sais).

I am getting this following message:

Token indices sequence length is longer than the specified maximum sequence length for this model (642 > 512). Running this sequence through the model will result in indexing errors

My question: Can this be ignored because the tokenizer cuts off the text or will it cause a crash when training? How can that be avoided?

Thanks again
Philip

RAM usage during training

hi @richarddwang

Thank you for putting up this repo. This is truly great work.

I want to ask the environment you used to run this code. How much CPU RAM did you use?

I tried to run the training with 50GB RAM but got OOM after 10K steps.

Is this expected?

Glue score during pre-training

Great job AGAIN！

I have a question, did you test the glue score over the different pre-training steps? What's the behavior or what happened during pertaining?

And how do you choose the checkpoint or just training to a certain step and use that one?

Thanks!

Description to use "just text files".

Hey @richarddwang
would it be possible to provide a description how to use "just text files" for pretraining?
I have a large sentence splitted file with blank line between documents and would like to domain adapt
my electra model to my domainspecific corpus.

Your examples use these hugdatafast arrow datasets. How do I inject my own texts?

Many thanks
Philip

how long to train electra base? (i have 30GB text, 1 GPU v100)

How do I extract and save the discriminator from the checkpoint?

Hi @richarddwang ,
my question is: How do I extract and save the discriminator from the checkpoint?

I can load it with model = torch.load("path_to_checkpoint")

but after that I have to split it into generator and discriminator and save it in the .bin format.

You you maybe provide some hints?

Many thanks
Philip

How do I pretrain on multiple GPUs?

Hi,

could you please provide information how to pretrain on multiple GPUs?
I did try to send ELECTRAModel to CUDA and wrap it with DataParallel
without success. See this screenshot with the comments.

Sorry - I can not copy the text out of my env.

I do not know why tensors are also on CUDA:1

PS: c.device is still 'cuda:0'

Could you please help me?

Thanks
Philip

Assignment without usage.

Hi,

here you assign something but do not use it later.
Smells like a small bug.

electra_pytorch/pretrain.py

Line 377 in 208c62f

 lr_shed_func = linear_warmup_and_then_decay if c.schedule=='separate_linear' else linear_warmup_and_decay 

Is it right that your input data is not sentence splitted?

As far as I can see you do not sentence split your input data for pretraining. Is that correct?

You have one document per "row" and just cut it when the sequence lenth of the model is reached.
But how do you continue after that for the nest "sentence"? With the rest of the cut sentence?

Thanks
Philip

Question about the use of with no grad

Hi Richard,

Thank you so much for the great work of this replication. It helps me a lot in understanding Electra in details.

I have one question regarding your use of "with no_grad" in:
https://github.com/richarddwang/electra_pytorch/blob/master/pretrain.py#L294

In my understanding, all the tensors under the scope don't require grads. Is there any specific reason that you use with no_grad here?

Thanks!

Use different tokenizer (and specify special tokens)

Thank you for this great repository. It really is a huge help.
There is one thing, however, that I cannot figure out on my own:
I would like to train an ELECTRA for a different language and therefore use another tokenizer.
Unfortunately, I cannot find where I can change the IDs of the special tokens. I trained a BPE-tokenizer with "<s>":0,"<pad>":1,"</s>":2,"<unk>":3,"<mask>":4,..., but the model seems to assume that these special tokens have the ids 100, 101, 102 and 103. Could you please tell me where I can overwrite this assumption?
I'm really sorry for the stupid question, but I really could not find it.
Thank you very much in advance.

How do I continue language model training?

Hi,
I have an pretrained ELECTRA generator and discriminator stored on disk.
Both trained on a large corpus. Now I want to train it on a domainspecific corpus.

To do that I am loading them from disk by adding .from_pretrained() here:

electra_pytorch/pretrain.py

Lines 364 to 365 in ab29d03

 generator = ElectraForMaskedLM(gen_config) 

 discriminator = ElectraForPreTraining(disc_config)

My question is: Why do you exactly do this:

electra_pytorch/pretrain.py

Lines 366 to 367 in ab29d03

 discriminator.electra.embeddings = generator.electra.embeddings 

 generator.generator_lm_head.weight = generator.electra.embeddings.word_embeddings.weight

and do I still need that in my case or does it " destroy" my pretrained generator and discriminator?

Many thanks
Philip

Restarting from previous checkpoint

Hi, do you know what the best way to resume training from a previous checkpoint would be? Let's assume I am training for 100k steps but I have a 24-hour time limit, and I just have the following checkpoints available:

ls checkpoints/pretrain
vanilla_11081_12.0%.pth  vanilla_11081_25.0%.pth  vanilla_11081_50.0%.pth

Given that the generator and discriminator are instantiated as separated models, do we point them to the same .pth file? Also, I believe the .from_pretrained() method requires a single config.json so how do we merge the two configs if that is necessary?

Thanks

How can I pretrain ELECTRA starting from weights from google ?

This issue is to answer the question from hugggingface forum.

Although I haven't tried it, it should be possible.

Make sure my_model is set to False to use huggingface model

electra_pytorch/pretrain.py

Line 43 in ab29d03

'my_model': False, # only for my personal research
Change model(config) -> model.from_pretrained(model_name)

electra_pytorch/pretrain.py

Lines 364 to 365 in ab29d03

generator = ElectraForMaskedLM(gen_config)

discriminator = ElectraForPreTraining(disc_config)

Be careful about size, max_length, and other configs

electra_pytorch/pretrain.py

Line 38 in ab29d03

'size': 'small',

electra_pytorch/pretrain.py

Lines 76 to 81 in ab29d03

 i = ['small', 'base', 'large'].index(c.size) 

 c.mask_prob = [0.15, 0.15, 0.25][i] 

 c.lr = [5e-4, 2e-4, 2e-4][i] 

 c.bs = [128, 256, 2048][i] 

 c.steps = [10**6, 766*1000, 400*1000][i] 

 c.max_length = [128, 512, 512][i]

Note: ELECTRA models published are actually ++ model described in appendix D, and max sequence length of ELECTRA-Small/ Small++ is 128/512.

Feel free to tag me if you have other questions.

GPU utilization falls back to 0% when training with multiple GPUs

Hi,

using multiple GPUs and DataParallel the GPU utilization falls back to 0% between the batches.
I think the tokenizer is the bottleneck. Do you see a way to improve performance?

Thanks
Philip

Error: `ELECTRAModel` object has no attribute `to_native_fp16`

Hi,
after installing the python packages like you say I am getting this error:
ELECTRAModel object has no attribute to_native_fp16

For details see screenshot below (sorry, I can not copy text - just do screenshots in that environment).

Training from scratch with other datasets (other languages)

Hi! Thanks @richarddwang for the reimplementation. For some time I am getting less than desired results with the official huggingface electra implementation. Would you consider adding support for pretraining on other datasets (meaning other languages)? Right now it's just the wiki and books from /nlp.

Thanks!

Pretrain does not save the model.

Hi,

I am starting a pretrain with just 100 iterations like so: learn.fit(9999, cbs=[lr_shedule])

But the model is not saved to './checkpoints/pretrain'.

I did specify (as you did) the following for Learner:

                path='./checkpoints',
                model_dir='pretrain',

Do I have to manualy save the model?

Pyarrow dataloading issue

Hi Richard,

I get the following pyarrow issue when trying to load the openwebtext corpus dataset>

Traceback (most recent call last):
  File "pretrain.py", line 150, in <module>
    e_owt = ELECTRAProcessor(owt, apply_cleaning=False).map(cache_file_name=f"electra_owt_{c.max_length}.arrow", num_proc=1)
  File "/root/_utils/utils.py", line 120, in map
    return self.hf_dset.my_map(
  File "/usr/local/lib/python3.8/dist-packages/hugdatafast/transform.py", line 23, in my_map
    return self.map(*args, cache_file_name=cache_file_name, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/datasets/arrow_dataset.py", line 2102, in map
    return self._map_single(
  File "/usr/local/lib/python3.8/dist-packages/datasets/arrow_dataset.py", line 518, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/datasets/arrow_dataset.py", line 485, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/datasets/fingerprint.py", line 413, in wrapper
    out = func(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/datasets/arrow_dataset.py", line 2498, in _map_single
    writer.write_batch(batch)
  File "/usr/local/lib/python3.8/dist-packages/datasets/arrow_writer.py", line 499, in write_batch
    self.write_table(pa_table, writer_batch_size)
  File "/usr/local/lib/python3.8/dist-packages/datasets/arrow_writer.py", line 516, in write_table
    self.pa_writer.write_batch(batch)
  File "pyarrow/ipc.pxi", line 384, in pyarrow.lib._CRecordBatchWriter.write_batch
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Tried to write record batch with different schema

Any ideas?

Best,
Shiv

providing datasets

Hi, thanks for the great codes!

I am trying to train electra from scratch using your codes here.
Could you upload the datasets you used on training? (Drives, links, etc..)
I just wanted to make sure that I get the same results before trying other experiments.

Thanks :)

Is multi_task.py in a working state and if so how should one use it?

I am looking to train a transformer for a multi-task classification problem and I happened to see multi_task.py within your _utils folder and was curious if it could be useful for my purposes.

I have a problem where I have the texts of novels and I want to train a classifier to classify 1) its genre and 2) its successfulness in hopes that doing both together would help each individual task. Thus my question is if your script would be applicable for said scenario and if so how can I use it?

Questtion about ELECTRADataProcessor or ExampleBuilder

First, thanks to share this repo! it's very helpful for me to understand pretraining ELECTRA.

I got a question about ELECTRADataProcessor.

electra_pytorch/_utils/utils.py

Line 101 in 80d1790

class ELECTRADataProcessor(object):

I read this code and found it corresponds to this file.
https://github.com/google-research/electra/blob/master/build_pretraining_dataset.py#L34

I can understand what the part does. It's a preprocessing thing. randomly split sentences to two segments, and merge it as an example and so on.
But, I can't understand why it does.
I read the ELECTRA paper roughly, but I can't found it.
On my understanding, ELECTRA just needs many sentences like BERT. Why 2 segments are needed, and why it is randomly split in preprocess time?

I already ask it here, but there is no response.
google-research/electra#114

I would be happy if you could reply to me when you know something and have a time.

duration of training

Hi, I was wondering if we have to train for 10000 epochs as the default setting of your code to get the result. In official electra implementation, it trained for 1000000 steps so 10000 epochs seemed too long. (by the printed value it seems like there are 273193 steps for each epoch) Also, when training electra-small the save points seemed to be 0.0625*(10**6). Is this number related to the step number?
Thanks :)

如何多卡并行训练模型（How to train multi-card models in parallel？）

你好，我正在尝试ELECTRA的方法训练自己的预训练模型，我在其他地方看到你的代码实现了ELECTRA的多GPU并行，但是我在尝试运行的时候发现只有一个GPU在运行，我的num_workers被设置为4，num_proc被设置为4，尝试运行时用的数据大小约0.5k。请问要实现多GPU并行还需要注意哪里？

Hi, I am trying ELECTRA to train my pre-training model, I learnt elsewhere that your code implements ELECTRA's multi-GPU parallelization, but When I tried to run it, I found that only one GPU was running. My num_workers was set to 4, num_proc was set to 4. The data size used for the attempted run is about 0.5K. What else should be change when implementing multi-GPU parallelism?

ipython package is missing

pretrain.py imports ipython but it is not given here

electra_pytorch/requirements.txt

Line 7 in 9b2533e

hugdatafast>=1.0.0

Is it possible to perform the fine-tuning within the HuggingFace library?

I have successfully performed the pretraining on my dataset using your Pretrain script and have my model saved to a checkpoint. Would it then be possible to load this model into the HuggingFace library so that I could use their Trainer class for performing the downstream task?

License info is required

Hi, there. Where is your LICENSE preferences? Please add them.

How to load the cached data from ELECTRAProcessor?

Hi, thanks for the awesome repo!

In https://github.com/richarddwang/electra_pytorch/blob/master/pretrain.py#L142, ELECTRAProcessor will generate a f"electra_owt_{c.max_length}.arrow" file. I am wondering if it is possible to load the cached arrow data?

It would be helpful if the data would not be processed from scratch for each run. Could you give some hints?

Thanks!

Error in store_attr()

Tried running the pretrain.py script when I got this error

process id: 76202
{'device': 'cuda:0', 'base_run_name': 'vanilla', 'seed': 11081, 'adam_bias_correction': False, 'schedule': 'original_linear', 'sampling': 'fp32_gumbel', 'electra_mask_style': True, 'tie_gen_in_out_embedding': False, 'gen_smooth_label': False, 'disc_smooth_label': False, 'size': 'small', 'my_model': False, 'run_name': 'vanilla_11081', 'mask_prob': 0.15, 'lr': 0.0005, 'bs': 128, 'steps': 1000000, 'max_length': 128}
{}
loading the electra data (wiki)
loading the electra data (BookCorpus)
electra_pytorch/venv/lib/python3.6/site-packages/nlp/utils/py_utils.py:191: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at  /pytorch/torch/csrc/utils/tensor_numpy.cpp:141.)
  return function(data_struct)
Traceback (most recent call last):
  File "pretrain.py", line 390, in <module>
    RunSteps(c.steps, [0.0625, 0.125, 0.25, 0.5, 1.0], c.run_name+"_{percent}"),
  File "electra_pytorch/electra_pytorch/_utils/would_like_to_pr.py", line 51, in __init__
    store_attr(self, 'n_steps,save_points,base_name,no_val')
  File "electra_pytorch/venv/lib/python3.6/site-packages/fastcore/utils.py", line 97, in store_attr
    if not hasattr(self, '__stored_args__'): self.__stored_args__ = {}
AttributeError: 'str' object has no attribute '__stored_args__'

Updating fastai and fastcore does not solve the issue. I even when through the documentation and found that if we don't provide arguments to store_attr(), it would automatically store all parameters passed to the function. I tried that and the model did start training, but it got nan loss right from epoch 1. Could you please help me resolve this? @richarddwang

process id: 76772
{'device': 'cuda:0', 'base_run_name': 'vanilla', 'seed': 11081, 'adam_bias_correction': False, 'schedule': 'original_linear', 'sampling': 'fp32_gumbel', 'electra_mask_style': True, 'tie_gen_in_out_embedding': False, 'gen_smooth_label': False, 'disc_smooth_label': False, 'size': 'small', 'my_model': False, 'run_name': 'vanilla_11081', 'mask_prob': 0.15, 'lr': 0.0005, 'bs': 128, 'steps': 1000000, 'max_length': 128}
{}
loading the electra data (wiki)
loading the electra data (BookCorpus)
electra_pytorch/venv/lib/python3.6/site-packages/nlp/utils/py_utils.py:191: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at  /pytorch/torch/csrc/utils/tensor_numpy.cpp:141.)
  return function(data_struct)
vanilla_11081 , starts at 2020-09-07 09:02:02.063122
epoch     train_loss  valid_loss  time
0         nan         00:13
1         nan         00:18

Cannot run

Cannot found modeling file in project .

Relative importance of different "tricks" in README

Thank you for the brilliant repository! You list some important tricks which are necessary to reproduce performance. In your experience, which of those tricks are critical for matching the released GLUE scores?

On a cursory glance. seems like reordering the sentences to augment the dataset for STS tasks seems to be a critical detail. I was wondering if that also aligns with your experience in running these models?

Small typo in the README.md

Hi, Thank you for the awesome repository.

There is a small typo in the second table. Row (ELECTRA-Small++) and Col (RTE) should be 63.6, not 6.36.

Thank you again.

Training time and ++ version

First of all, thanks for the great repo, this is an absolute lifesaver for me. I have two questions though:

How long does it take to pretrain on a single GPU? I thought I read it somewhere but I can't find it anymore (maybe I'm remembering something that isn't there).
You mention that the checkpoints in Huggingface are all the ++ version. Is the default configuration of pretrain.py the correct one for the ++ version, or what needs to change? I want to use your repo to train a Dutch ELECTRA-small model, but I want it to be as comparable as possible to the English ELECTRA-small checkpoint from Huggingface.

Custom Dataset

I am trying to train on a custom dataset however I can not process the dataset. Mapping gives this error " Column to remove ['validation'] not in the dataset. Current columns in the dataset: ['text']". I am using the below code as similar to other datasets. Could you give a working example of a custom dataset like the one I am using?
babylm = datasets.load_dataset("asparius/babylm-10m","all.txt")
e_babylm = ELECTRAProcessor(babylm).map(num_proc=1)

NameError: name 'sort_by_run' is not defined

Hi!! When executing the command python pretrain.py got the following error.

I would be happy if you could reply to me when you know something and have a time.

SST-2 accuracy is 50% after finetuning

Hi Richard,

Thanks for providing the implementation to pretrain ELECTRA in pytorch!

I tried pre-training an ELECTRA-small model with wikipedia data and selected the 25% trained model (updated for 250k steps) to fine-tune on SST-2. At the end of each epoch, the validation accuracy of SST-2 is the same around 50%. It seems to be an optimization issue if the accuracy stays the same throughout the whole training process. Do you have any idea about why it happens? Thank you!!

Python version

Hi @richarddwang,

What version of python did you test this on?

Thanks,
Alex

	generator = ElectraForMaskedLM(gen_config)
	discriminator = ElectraForPreTraining(disc_config)

	discriminator.electra.embeddings = generator.electra.embeddings
	generator.generator_lm_head.weight = generator.electra.embeddings.word_embeddings.weight

	i = ['small', 'base', 'large'].index(c.size)
	c.mask_prob = [0.15, 0.15, 0.25][i]
	c.lr = [5e-4, 2e-4, 2e-4][i]
	c.bs = [128, 256, 2048][i]
	c.steps = [10*6, 7661000, 400*1000][i]
	c.max_length = [128, 512, 512][i]

richarddwang / electra_pytorch Goto Github PK

electra_pytorch's People

Contributors

Stargazers

Watchers

Forkers

electra_pytorch's Issues

Recommend Projects

Recommend Topics

Recommend Org