Giter Club home page Giter Club logo

transformers-trainers's Introduction

lmtuners

This repo contains trainers for language model pre-training tasks. Currently, there are two kinds:

  1. LMTrainer (normal/causal LM as well as masked LM)
  2. DiscLMTrainer (discriminative language modelling task from ELECTRA paper)

We've only built small models with this library (fit on one GPU), but the code theoretically generalizes to bigger models. We don't have the resources to experiment with that, but it should be relatively easy to adapt the lightning modules to other needs.

Dependencies

This package is built on top of:

  • huggingface/transformers
    • model implementations, *ForMaskedLM, *ForTokenClassification, and optimizers
  • huggingface/tokenizers
    • their Rust-backed fast tokenizers
  • pytorch-lightning
    • Abstracts training loops, checkpointing, multi-gpu/distributed learning, other training features.
    • Theoretically supports TPU, but WIP.
  • pytorch-lamb
    • LAMB optimizer implementation.

transformers-trainers's People

Contributors

shoarora avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

thak123 bcmi220

transformers-trainers's Issues

Unable to load the trained model

Thanks for your contribution and previous bug fixes.
I was able to train an Albert model from scratch.

But I am facing issues with using PyTorch-Lightning saved .ckpt file in HuggingFace code-base that I have.
I cannot use the .ckpt file directly, cause there is no corresponding index file present.(frompretrained
method throws error)
I am not entirely sure why pytorch-Lightning is saving model in ckpt format which is tf convention than the usual pytorch pt or bin

Not able to pre-tokenize the input

After downloading OpenWebText corpus, I extract it using the tar xvf openwebtext.tar.gz command. When I try running the python -m lmtuners.utils.tokenize_and_cache_data data/ data_tokenized_128/ --tokenizer_path bert-base-uncased-vocab.txt --max_length=64 command, I get an error saying

skipping urlsf_subset16-730_data.xz
0 tokens, 0 examples: 12% 2423/20610 [00:01<00:09, 1893.57it/s]'utf-8' codec can't decode byte 0xfd in position 0: invalid start byte

for every file. Could you please help me overcome this issue? @shoarora

Token Ids are negative when preprocessed.

Hi,

I am trying to run the code in the repo with bert-base-multilingual-cased-vocab.txt vocab
But tokens being converted to ids when preprocessed are negative.
when passed to the network throws this error.

I tried ranging the max-length from 64 to 256.

RuntimeError: index out of range: Tried to access index -28619 out of table with 119546 rows. at /tmp/pip-req-build-ufslq_a9/aten/src/TH/generic/THTensorEvenMoreMath.cpp:418

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.