transformers-trainers's Introduction

lmtuners

This repo contains trainers for language model pre-training tasks. Currently, there are two kinds:

LMTrainer (normal/causal LM as well as masked LM)
DiscLMTrainer (discriminative language modelling task from ELECTRA paper)

We've only built small models with this library (fit on one GPU), but the code theoretically generalizes to bigger models. We don't have the resources to experiment with that, but it should be relatively easy to adapt the lightning modules to other needs.

Dependencies

This package is built on top of:

huggingface/transformers
- model implementations, *ForMaskedLM, *ForTokenClassification, and optimizers
huggingface/tokenizers
- their Rust-backed fast tokenizers
pytorch-lightning
- Abstracts training loops, checkpointing, multi-gpu/distributed learning, other training features.
- Theoretically supports TPU, but WIP.
pytorch-lamb
- LAMB optimizer implementation.

transformers-trainers's People

Contributors

Stargazers

Watchers

transformers-trainers's Issues

Unable to load the trained model

Thanks for your contribution and previous bug fixes.
I was able to train an Albert model from scratch.

But I am facing issues with using PyTorch-Lightning saved .ckpt file in HuggingFace code-base that I have.
I cannot use the .ckpt file directly, cause there is no corresponding index file present.(frompretrained
method throws error)
I am not entirely sure why pytorch-Lightning is saving model in ckpt format which is tf convention than the usual pytorch pt or bin

Not able to pre-tokenize the input

After downloading OpenWebText corpus, I extract it using the tar xvf openwebtext.tar.gz command. When I try running the python -m lmtuners.utils.tokenize_and_cache_data data/ data_tokenized_128/ --tokenizer_path bert-base-uncased-vocab.txt --max_length=64 command, I get an error saying

skipping urlsf_subset16-730_data.xz
0 tokens, 0 examples: 12% 2423/20610 [00:01<00:09, 1893.57it/s]'utf-8' codec can't decode byte 0xfd in position 0: invalid start byte

for every file. Could you please help me overcome this issue? @shoarora

Token Ids are negative when preprocessed.

Hi,

I am trying to run the code in the repo with bert-base-multilingual-cased-vocab.txt vocab
But tokens being converted to ids when preprocessed are negative.
when passed to the network throws this error.

I tried ranging the max-length from 64 to 256.

RuntimeError: index out of range: Tried to access index -28619 out of table with 119546 rows. at /tmp/pip-req-build-ufslq_a9/aten/src/TH/generic/THTensorEvenMoreMath.cpp:418

Recommend Projects

shoarora / transformers-trainers Goto Github PK

transformers-trainers's Introduction

lmtuners

Dependencies

transformers-trainers's People

Contributors

Stargazers

Watchers

Forkers

transformers-trainers's Issues

Unable to load the trained model

Not able to pre-tokenize the input

Token Ids are negative when preprocessed.

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent