Giter Club home page Giter Club logo

electra_pytorch's People

Contributors

ngoquanghuy99 avatar richarddwang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

electra_pytorch's Issues

multi-GPU training

Hi,

First of all the code in this repo does not work out of the box - requires 2 edits (my_model to False + fix of OWT dataset name).
Anyway - this model does not seem to be taking advantage of multi-GPU setups. I tried to modify it (using DataParallel) but although memory gets allocated on designated GPUs, the model is not training (maybe because my FastAi knowledge is almost none).
Could you give some at least some hints how to update the code to perform multi-GPU training?

back-propagate the discriminator loss through the generator

Hi , Thanks for sharing your code. I have a quick question, in the paper it is mentioned that We don’t back-propagate the discriminator loss through the generator, maybe I have missed it, but where in your code this has been taken care of ? can you refer me to it ? Thanks

Sequence length too long for `ELECTRADataProcessor`.

I am using ELECTRADataProcessor to tokenize my corpus for pretraining (like your example sais).

I am getting this following message:

Token indices sequence length is longer than the specified maximum sequence length for this model (642 > 512). Running this sequence through the model will result in indexing errors

My question: Can this be ignored because the tokenizer cuts off the text or will it cause a crash when training? How can that be avoided?

Thanks again
Philip

RAM usage during training

hi @richarddwang

Thank you for putting up this repo. This is truly great work.

I want to ask the environment you used to run this code. How much CPU RAM did you use?

I tried to run the training with 50GB RAM but got OOM after 10K steps.

Is this expected?

Glue score during pre-training

Great job AGAIN!

I have a question, did you test the glue score over the different pre-training steps? What's the behavior or what happened during pertaining?

And how do you choose the checkpoint or just training to a certain step and use that one?

Thanks!

Description to use "just text files".

Hey @richarddwang
would it be possible to provide a description how to use "just text files" for pretraining?
I have a large sentence splitted file with blank line between documents and would like to domain adapt
my electra model to my domainspecific corpus.

Your examples use these hugdatafast arrow datasets. How do I inject my own texts?

Many thanks
Philip

How do I pretrain on multiple GPUs?

Hi,

could you please provide information how to pretrain on multiple GPUs?
I did try to send ELECTRAModel to CUDA and wrap it with DataParallel
without success. See this screenshot with the comments.

Sorry - I can not copy the text out of my env.

grafik

I do not know why tensors are also on CUDA:1

PS: c.device is still 'cuda:0'

Could you please help me?

Thanks
Philip

Is it right that your input data is not sentence splitted?

As far as I can see you do not sentence split your input data for pretraining. Is that correct?

You have one document per "row" and just cut it when the sequence lenth of the model is reached.
But how do you continue after that for the nest "sentence"? With the rest of the cut sentence?

Thanks
Philip

Use different tokenizer (and specify special tokens)

Thank you for this great repository. It really is a huge help.
There is one thing, however, that I cannot figure out on my own:
I would like to train an ELECTRA for a different language and therefore use another tokenizer.
Unfortunately, I cannot find where I can change the IDs of the special tokens. I trained a BPE-tokenizer with "<s>":0,"<pad>":1,"</s>":2,"<unk>":3,"<mask>":4,..., but the model seems to assume that these special tokens have the ids 100, 101, 102 and 103. Could you please tell me where I can overwrite this assumption?
I'm really sorry for the stupid question, but I really could not find it.
Thank you very much in advance.

How do I continue language model training?

Hi,
I have an pretrained ELECTRA generator and discriminator stored on disk.
Both trained on a large corpus. Now I want to train it on a domainspecific corpus.

To do that I am loading them from disk by adding .from_pretrained() here:

electra_pytorch/pretrain.py

Lines 364 to 365 in ab29d03

generator = ElectraForMaskedLM(gen_config)
discriminator = ElectraForPreTraining(disc_config)

My question is: Why do you exactly do this:

electra_pytorch/pretrain.py

Lines 366 to 367 in ab29d03

discriminator.electra.embeddings = generator.electra.embeddings
generator.generator_lm_head.weight = generator.electra.embeddings.word_embeddings.weight

and do I still need that in my case or does it " destroy" my pretrained generator and discriminator?

Many thanks
Philip

Restarting from previous checkpoint

Hi, do you know what the best way to resume training from a previous checkpoint would be? Let's assume I am training for 100k steps but I have a 24-hour time limit, and I just have the following checkpoints available:

ls checkpoints/pretrain
vanilla_11081_12.0%.pth  vanilla_11081_25.0%.pth  vanilla_11081_50.0%.pth

Given that the generator and discriminator are instantiated as separated models, do we point them to the same .pth file? Also, I believe the .from_pretrained() method requires a single config.json so how do we merge the two configs if that is necessary?

Thanks

How can I pretrain ELECTRA starting from weights from google ?

This issue is to answer the question from hugggingface forum.

Although I haven't tried it, it should be possible.

  1. Make sure my_model is set to False to use huggingface model

    'my_model': False, # only for my personal research

  2. Change model(config) -> model.from_pretrained(model_name)

    electra_pytorch/pretrain.py

    Lines 364 to 365 in ab29d03

    generator = ElectraForMaskedLM(gen_config)
    discriminator = ElectraForPreTraining(disc_config)

  3. Be careful about size, max_length, and other configs

    'size': 'small',

    i = ['small', 'base', 'large'].index(c.size)
    c.mask_prob = [0.15, 0.15, 0.25][i]
    c.lr = [5e-4, 2e-4, 2e-4][i]
    c.bs = [128, 256, 2048][i]
    c.steps = [10**6, 766*1000, 400*1000][i]
    c.max_length = [128, 512, 512][i]

Note: ELECTRA models published are actually ++ model described in appendix D, and max sequence length of ELECTRA-Small/ Small++ is 128/512.

Feel free to tag me if you have other questions.

Training from scratch with other datasets (other languages)

Hi! Thanks @richarddwang for the reimplementation. For some time I am getting less than desired results with the official huggingface electra implementation. Would you consider adding support for pretraining on other datasets (meaning other languages)? Right now it's just the wiki and books from /nlp.

Thanks!

Pretrain does not save the model.

Hi,

I am starting a pretrain with just 100 iterations like so: learn.fit(9999, cbs=[lr_shedule])

But the model is not saved to './checkpoints/pretrain'.

I did specify (as you did) the following for Learner:

                path='./checkpoints',
                model_dir='pretrain',

Do I have to manualy save the model?

Pyarrow dataloading issue

Hi Richard,

I get the following pyarrow issue when trying to load the openwebtext corpus dataset>

Traceback (most recent call last):
  File "pretrain.py", line 150, in <module>
    e_owt = ELECTRAProcessor(owt, apply_cleaning=False).map(cache_file_name=f"electra_owt_{c.max_length}.arrow", num_proc=1)
  File "/root/_utils/utils.py", line 120, in map
    return self.hf_dset.my_map(
  File "/usr/local/lib/python3.8/dist-packages/hugdatafast/transform.py", line 23, in my_map
    return self.map(*args, cache_file_name=cache_file_name, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/datasets/arrow_dataset.py", line 2102, in map
    return self._map_single(
  File "/usr/local/lib/python3.8/dist-packages/datasets/arrow_dataset.py", line 518, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/datasets/arrow_dataset.py", line 485, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/datasets/fingerprint.py", line 413, in wrapper
    out = func(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/datasets/arrow_dataset.py", line 2498, in _map_single
    writer.write_batch(batch)
  File "/usr/local/lib/python3.8/dist-packages/datasets/arrow_writer.py", line 499, in write_batch
    self.write_table(pa_table, writer_batch_size)
  File "/usr/local/lib/python3.8/dist-packages/datasets/arrow_writer.py", line 516, in write_table
    self.pa_writer.write_batch(batch)
  File "pyarrow/ipc.pxi", line 384, in pyarrow.lib._CRecordBatchWriter.write_batch
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Tried to write record batch with different schema

Any ideas?

Best,
Shiv

providing datasets

Hi, thanks for the great codes!

I am trying to train electra from scratch using your codes here.
Could you upload the datasets you used on training? (Drives, links, etc..)
I just wanted to make sure that I get the same results before trying other experiments.

Thanks :)

Is multi_task.py in a working state and if so how should one use it?

I am looking to train a transformer for a multi-task classification problem and I happened to see multi_task.py within your _utils folder and was curious if it could be useful for my purposes.

I have a problem where I have the texts of novels and I want to train a classifier to classify 1) its genre and 2) its successfulness in hopes that doing both together would help each individual task. Thus my question is if your script would be applicable for said scenario and if so how can I use it?

Questtion about ELECTRADataProcessor or ExampleBuilder

First, thanks to share this repo! it's very helpful for me to understand pretraining ELECTRA.

I got a question about ELECTRADataProcessor.

class ELECTRADataProcessor(object):

I read this code and found it corresponds to this file.
https://github.com/google-research/electra/blob/master/build_pretraining_dataset.py#L34

I can understand what the part does. It's a preprocessing thing. randomly split sentences to two segments, and merge it as an example and so on.
But, I can't understand why it does.
I read the ELECTRA paper roughly, but I can't found it.
On my understanding, ELECTRA just needs many sentences like BERT. Why 2 segments are needed, and why it is randomly split in preprocess time?

I already ask it here, but there is no response.
google-research/electra#114

I would be happy if you could reply to me when you know something and have a time.

duration of training

Hi, I was wondering if we have to train for 10000 epochs as the default setting of your code to get the result. In official electra implementation, it trained for 1000000 steps so 10000 epochs seemed too long. (by the printed value it seems like there are 273193 steps for each epoch) Also, when training electra-small the save points seemed to be 0.0625*(10**6). Is this number related to the step number?
Thanks :)

如何多卡并行训练模型(How to train multi-card models in parallel?)

你好,我正在尝试ELECTRA的方法训练自己的预训练模型,我在其他地方看到你的代码实现了ELECTRA的多GPU并行,但是我在尝试运行的时候发现只有一个GPU在运行,我的num_workers被设置为4,num_proc被设置为4,尝试运行时用的数据大小约0.5k。请问要实现多GPU并行还需要注意哪里?

Hi, I am trying ELECTRA to train my pre-training model, I learnt elsewhere that your code implements ELECTRA's multi-GPU parallelization, but When I tried to run it, I found that only one GPU was running. My num_workers was set to 4, num_proc was set to 4. The data size used for the attempted run is about 0.5K. What else should be change when implementing multi-GPU parallelism?

Error in store_attr()

Tried running the pretrain.py script when I got this error

process id: 76202
{'device': 'cuda:0', 'base_run_name': 'vanilla', 'seed': 11081, 'adam_bias_correction': False, 'schedule': 'original_linear', 'sampling': 'fp32_gumbel', 'electra_mask_style': True, 'tie_gen_in_out_embedding': False, 'gen_smooth_label': False, 'disc_smooth_label': False, 'size': 'small', 'my_model': False, 'run_name': 'vanilla_11081', 'mask_prob': 0.15, 'lr': 0.0005, 'bs': 128, 'steps': 1000000, 'max_length': 128}
{}
loading the electra data (wiki)
loading the electra data (BookCorpus)
electra_pytorch/venv/lib/python3.6/site-packages/nlp/utils/py_utils.py:191: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at  /pytorch/torch/csrc/utils/tensor_numpy.cpp:141.)
  return function(data_struct)
Traceback (most recent call last):
  File "pretrain.py", line 390, in <module>
    RunSteps(c.steps, [0.0625, 0.125, 0.25, 0.5, 1.0], c.run_name+"_{percent}"),
  File "electra_pytorch/electra_pytorch/_utils/would_like_to_pr.py", line 51, in __init__
    store_attr(self, 'n_steps,save_points,base_name,no_val')
  File "electra_pytorch/venv/lib/python3.6/site-packages/fastcore/utils.py", line 97, in store_attr
    if not hasattr(self, '__stored_args__'): self.__stored_args__ = {}
AttributeError: 'str' object has no attribute '__stored_args__'

Updating fastai and fastcore does not solve the issue. I even when through the documentation and found that if we don't provide arguments to store_attr(), it would automatically store all parameters passed to the function. I tried that and the model did start training, but it got nan loss right from epoch 1. Could you please help me resolve this? @richarddwang

process id: 76772
{'device': 'cuda:0', 'base_run_name': 'vanilla', 'seed': 11081, 'adam_bias_correction': False, 'schedule': 'original_linear', 'sampling': 'fp32_gumbel', 'electra_mask_style': True, 'tie_gen_in_out_embedding': False, 'gen_smooth_label': False, 'disc_smooth_label': False, 'size': 'small', 'my_model': False, 'run_name': 'vanilla_11081', 'mask_prob': 0.15, 'lr': 0.0005, 'bs': 128, 'steps': 1000000, 'max_length': 128}
{}
loading the electra data (wiki)
loading the electra data (BookCorpus)
electra_pytorch/venv/lib/python3.6/site-packages/nlp/utils/py_utils.py:191: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at  /pytorch/torch/csrc/utils/tensor_numpy.cpp:141.)
  return function(data_struct)
vanilla_11081 , starts at 2020-09-07 09:02:02.063122
epoch     train_loss  valid_loss  time
0         nan         00:13
1         nan         00:18

Cannot run

image
Cannot found modeling file in project .

Relative importance of different "tricks" in README

Thank you for the brilliant repository! You list some important tricks which are necessary to reproduce performance. In your experience, which of those tricks are critical for matching the released GLUE scores?

On a cursory glance. seems like reordering the sentences to augment the dataset for STS tasks seems to be a critical detail. I was wondering if that also aligns with your experience in running these models?

Small typo in the README.md

Hi, Thank you for the awesome repository.

There is a small typo in the second table. Row (ELECTRA-Small++) and Col (RTE) should be 63.6, not 6.36.

Thank you again.

image

Training time and ++ version

First of all, thanks for the great repo, this is an absolute lifesaver for me. I have two questions though:

  1. How long does it take to pretrain on a single GPU? I thought I read it somewhere but I can't find it anymore (maybe I'm remembering something that isn't there).

  2. You mention that the checkpoints in Huggingface are all the ++ version. Is the default configuration of pretrain.py the correct one for the ++ version, or what needs to change? I want to use your repo to train a Dutch ELECTRA-small model, but I want it to be as comparable as possible to the English ELECTRA-small checkpoint from Huggingface.

Custom Dataset

I am trying to train on a custom dataset however I can not process the dataset. Mapping gives this error " Column to remove ['validation'] not in the dataset. Current columns in the dataset: ['text']". I am using the below code as similar to other datasets. Could you give a working example of a custom dataset like the one I am using?
babylm = datasets.load_dataset("asparius/babylm-10m","all.txt")
e_babylm = ELECTRAProcessor(babylm).map(num_proc=1)

SST-2 accuracy is 50% after finetuning

Hi Richard,

Thanks for providing the implementation to pretrain ELECTRA in pytorch!

I tried pre-training an ELECTRA-small model with wikipedia data and selected the 25% trained model (updated for 250k steps) to fine-tune on SST-2. At the end of each epoch, the validation accuracy of SST-2 is the same around 50%. It seems to be an optimization issue if the accuracy stays the same throughout the whole training process. Do you have any idea about why it happens? Thank you!!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.