Giter Club home page Giter Club logo

allenai-olmo's Introduction

OLMo Logo

OLMo: Open Language Model

GitHub License GitHub release

OLMo is a repository for training and using AI2's state-of-the-art open language models. It is built by scientists, for scientists.

Installation

First install PyTorch according to the instructions specific to your operating system.

To install from source (recommended for training/fine-tuning) run:

git clone https://github.com/allenai/OLMo.git
cd OLMo
pip install -e .[all]

Otherwise you can install the model code by itself directly from PyPI with:

pip install ai2-olmo

Models overview

The core models in the OLMo family released so far are (all trained on the Dolma dataset):

Model Training Tokens Context Length
OLMo 1B 3 Trillion 2048
OLMo 7B 2.5 Trillion 2048
OLMo 7B Twin 2T 2 Trillion 2048

Fine-tuning

To fine-tune an OLMo model using our trainer you'll first need to prepare your dataset by tokenizing it and saving the tokens IDs to a flat numpy memory-mapped array. See scripts/prepare_tulu_data.py for an example with the Tulu V2 dataset, which can be easily modified for other datasets.

Next, prepare your training config. There are many examples in the configs/ directory that you can use as a starting point. The most important thing is to make sure the model parameters (the model field in the config) match up with the checkpoint you're starting from. To be safe you can always start from the config that comes with the model checkpoint. At a minimum you'll need to make the following changes to the config or provide the corresponding overrides from the command line:

  • Update load_path to point to the checkpoint you want to start from.
  • Set reset_trainer_state to true.
  • Update data.paths to point to the token_ids.npy file you generated.
  • Optionally update data.label_mask_paths to point to the label_mask.npy file you generated, unless you don't need special masking for the loss.
  • Update evaluators to add/remove in-loop evaluations.

Once you're satisfied with your training config, you can launch the training job via torchrun. For example:

torchrun --nproc_per_node=8 scripts/train.py {path_to_train_config} \
    --data.paths=[{path_to_data}/input_ids.npy] \
    --data.label_mask_paths=[{path_to_data}/label_mask.npy] \
    --load_path={path_to_checkpoint} \
    --reset_trainer_state

Note: passing CLI overrides like --reset_trainer_state is only necessary if you didn't update those fields in your config.

Inference

You can utilize our HuggingFace integration to run inference on the olmo checkpoints:

from hf_olmo import * # registers the Auto* classes

from transformers import AutoModelForCausalLM, AutoTokenizer

olmo = AutoModelForCausalLM.from_pretrained("allenai/OLMo-7B")
tokenizer = AutoTokenizer.from_pretrained("allenai/OLMo-7B")

message = ["Language modeling is "]
inputs = tokenizer(message, return_tensors='pt', return_token_type_ids=False)
response = olmo.generate(**inputs, max_new_tokens=100, do_sample=True, top_k=50, top_p=0.95)
print(tokenizer.batch_decode(response, skip_special_tokens=True)[0])

Alternatively, with the huggingface pipeline abstraction:

from transformers import pipeline
olmo_pipe = pipeline("text-generation", model="allenai/OLMo-7B")
print(olmo_pipe("Language modeling is"))

Inference on finetuned checkpoints

If you finetune the model using the code above, you can use the conversion script to convert a native OLMo checkpoint to a HuggingFace-compatible checkpoint

python hf_olmo/convert_olmo_to_hf.py --checkpoint-dir /path/to/checkpoint

Quantization

olmo = AutoModelForCausalLM.from_pretrained("allenai/OLMo-7B", torch_dtype=torch.float16, load_in_8bit=True)  # requires bitsandbytes

The quantized model is more sensitive to typing / cuda, so it is recommended to pass the inputs as inputs.input_ids.to('cuda') to avoid potential issues.

Evaluation

Additional tools for evaluating OLMo models are available at the OLMo Eval repo.

allenai-olmo's People

Contributors

epwalsh avatar dirkgr avatar soldni avatar akshitab avatar 2015aroras avatar rodneykinney avatar kyleclo avatar ianmagnusson avatar muennighoff avatar aakanksha19 avatar rauthur avatar lolipopshock avatar oyvindtafjord avatar mewil avatar abhilasharavichander avatar ibeltagy avatar natolambert avatar nishantsubramani avatar dependabot[bot] avatar hamishivi avatar pranjali97 avatar saurabh111233212 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.