stanford-crfm / mistral Goto Github PK

Mistral: A strong, northwesterly wind: Framework for transparent and accessible large-scale language model training, built with Hugging Face 🤗 Transformers.

License: Apache License 2.0

Makefile 0.70% Python 59.61% Shell 31.95% Dockerfile 1.78% Jupyter Notebook 5.96%

mistral's People

Contributors

Stargazers

Watchers

mistral's Issues

Better Batching/Pre-Processing Implementation

As discussed in #5 , two options to avoid brute-forcing the data truncation issue

Use Padding (Propagating change down to the model level).
After we do the normalization/tokenization, we extract all the token ids into a contiguous array (potentially memory-mapped using PyArrow). Basically going back to what we did for Tempest.

DeepSpeed Multi-Node Requires PDSH Dependency?

@lorr1 - DeepSpeed Multi-Node training requires the dependency pdsh for multi-node communication (https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/launcher/runner.py#L95).

Not sure how you were able to manage running DeepSpeed w/ Multi-Node?

Regardless, can you email Jimmy and ask him to install pdsh on the sphinxes - requires sudo.

Separate Dataset Download Path from Model/Tokenizer

https://github.com/stanford-mercury/mistral/blob/d95fc32e59bfab21f0b4034f090f7c770dec2e38/src/util/paths.py#L37

@siddk I think we should separate the dataset download path from the tokenized dataset cache path so that we won't re-download the dataset if we use a different tokenizer/model in the future.

If this sounds good, I will go ahead and implement it.

Add "Differences between Mistral and Hugging Face" Page to Docs, and link from README

There's been a series of questions asking why Mistral is necessary/how it's different from HF Transformers. This is probably worth being made explicit.

I wrote up my response in a Twitter thread here: https://twitter.com/siddkaramcheti/status/1430258050640711681?s=20

It's probably worth distilling this down and refining it, then adding it as a page to the docs, and a summary to the README.

Fix: Have Log Files for Each Resumed Model/Add Loss and Metrics to Log

Currently we have one log file that collects up until we call HF.train(). This log file gets overwritten when a model is resumed.

Ideal behavior:

New log per model being resumed so we can understand errors and progress (add timestamp would be easy fix)
If possible, add a callback and adds loss values etc to log file, too. Very helpful when debugging a job on Kubernetes where we don't have easy access to the process running the model.

Quinfig Bugs

There are a series of minor issues with Quinfig currently that are halting our workflow - here they are ranked from most critical to least critical:

CRITICAL: Add support for passing command line arguments in addition to config (e.g. --local_rank which is necessary for all calls to torch.distributed.launch. We should test this with a minimal example - @lorr1 any ideas? These should also be able to override argument in the original Quinfig as per Line 73 of train.py.
MID: Create Quinfig Schema - my understanding is this is probably necessary for the above.
MID: Handle nested inheritance (create recursive namespaces) so that > depth-2 args don't require "strings" as per Line 56 of train.py.
LOW: Remove extra print line in parse_quinfig() as per Line 49 of train.py
QUESTION: Is there a cleaner way to handle argument-injection based on runtime-defined variables? Specifically, what's the right way to write Lines 167-173 of train.py?

Let me know if I should create/link to separate issues in the Quinine repo.

Sharing Models through the Hugging Face Hub

Hi CRFM team!

Mistral is very exciting! I see you currently share your model checkpoints through links to a hosted server. Would you be interested in sharing the pretrained models in the Hugging Face Hub? We already have a similar collaboration with the Stanford NLP group (see org).

The Hub offers free hosting of over 20K models, and it would make your work more accessible and visible to the rest of the community. Some of the benefits of sharing your models would be:

forget about the pain of managing the hosting
built-in versioning
commit history and diffs
repos provide useful metadata about their tasks, languages, metrics, etc that is useful for discoverability but also to understand the model

Creating the repos and adding new models should be a relatively straightforward process if you've used Git before. This is a step-by-step guide explaining the process in case you're interested. Please let us know if you would be interested and if you have any questions.

In a future we could also integrate this to our Inference API so users can play with the models directly in the browser with our widgets.

Happy to hear your thoughts,
Omar and the Hugging Face team

cc @lewtun @anton-l @LysandreJik

Add GPT-2 Custom Weight Initialization

GPT-2 Paper mentions that model uses "A modified initialization which accounts for the accumulation on the residual path with model depth. We scale the weights of residual layers at initialization by a factor of 1/√N where N is the number of residual layers."

We're going to borrow the initialization tricks from GPT-NeoX/Megatron: https://github.com/EleutherAI/gpt-neox/blob/c610c7266e676b1c6270b40a513aa495aa7cf1c6/megatron/model/gpt2_model.py#L94.

Allow finetuning of mistral models using the HuggingFace Flax LM classes

It would be amazing if we could load and finetune the models on TPUs using the flax LM classes in HF. In my experience, this makes the training and generation very straightforward on TPUs, along ofc with taking advantage of their compute.

I have tried to load a mistral checkpoint with the following code:
model = FlaxAutoModelForCausalLM.from_pretrained("alias/arwen-x21-checkpoint-400000", from_pt=True, pad_token_id=50256, )
This seems to work. The model loads, I can access its properties, and can even generate text.

However, once I try to fine tune it, using (more or less) the code here: https://github.com/huggingface/transformers/blob/master/examples/flax/language-modeling/run_clm_flax.py, it takes about 10mins to compile and then about 5mins for each step (for reference, in this should be 2mins and some seconds respectively got gpt2-medium).

Finally, it would be nice if the changes in mistral models were smh included when loading the model in HF (I am actually not 100% sure that does not happen). Specifically, I'm thinking of this line here:

mistral/src/models/mistral_gpt2.py

Line 312 in 7be4c58

scale_factor = 1 / ((float(v.size(-1)) ** 0.5) * self.layer_num)

Hope this makes sense. Thank you in advance!

Best,
Theodore.

DeepSpeed Learning Rate and Loss Discrepancies?

Using the default parameters (even inheriting from HF) results in different Learning Rate scheduling behavior (and train/eval loss) compared to DDP or FairScale. Unclear why this is happening, but if we want to use DeepSpeed, we should sort this out.

Reproduce Battlestar Crash (from GCP) on Sphinxes

To facilitate debugging numerical instability, we need to reproduce the Battlestar Crash on GCP on the Sphinxes; unfortunately, our random seeding isn't perfect across different hardware configurations (see #70 for full issue). We seem to be seeing the same batches, but initialization seems off...

For this issue, implement a hot-fix that allows for loading checkpoint-0 from the initial Battlestar run on GCP (immediately after initialization) so that we (presumably) start with the same initialization and can replicate the crash...

Partial/Bespoke Gradient Checkpointing for GPT-2 Models

Currently, HuggingFace models implement Gradient Checkpointing for every block even if it's not necessary. With bespoke gradient checkpointing, we can control how many blocks get wrapped with checkpointing, and save memory/time dynamically.

DeepSpeed Dynamic Loss Scaling Panic?

On starting a DeepSpeed run, there's this massive floating point panic where DeepSpeed dynamically rescales the loss and throws a ton of warnings. Unclear why this is happening?

[2021-03-18 04:21:58,423] [INFO] [stage1.py:633:step] [deepspeed] fp16 dynamic loss scale overflow! Skipping step. Attempted loss scale: 131072.0, reducing to 65536.0                                           │
[2021-03-18 04:21:58,423] [INFO] [stage1.py:633:step] [deepspeed] fp16 dynamic loss scale overflow! Skipping step. Attempted loss scale: 131072.0, reducing to 65536.0                                           │
[2021-03-18 04:21:58,424] [INFO] [stage1.py:633:step] [deepspeed] fp16 dynamic loss scale overflow! Skipping step. Attempted loss scale: 131072.0, reducing to 65536.0

Enable pre-commit CI

Is your feature request related to a problem? Please describe.

This pre-commit config is not enforced by Github actions or pre-commit CI. Please consider enabling them.

Add EOS token when concatenating documents in preprocessing loop

As per #90, we currently do not add an EOS separator between documents. We should do this, to facilitate unprompted generation for the future.

In the process, we should also probably add some strict tests checking preprocessing invariants like this, amongst other.

Hot-Fix HF Caching Permissions Issue

Currently, when using HF Caching with Datasets + Tokenizers, file permissions are slightly problematic, and locked to a single user. Ideally we bump/escalate up to HF Datasets repo and fix upstream.

In meantime, a temporary, pythonic fix would be good to have.

Factor out and Fix Dataset Preprocessing Logic

Currently, most of the Dataset Fetching, Preprocessing Code is embedded directly in Lines 98 - 157 of train.py.

We should:

Extract/factor out this code and put it in src/corpora/auto.py (general class for any HF dataset, following a "generic" API - similar to run_clm.py).
Clean up the truncation and pre-processing correctness (this may or may not matter too much). Specifically, I'd like to get around the fact that:
- We pre-compute splits only once rather than every epoch.
- We drop a lot of text. Switching back to tempest dataset is probably not the best option because multiprocessing is nice, so we should find some other workaround.

Resolving this issue should both cleanup the existing dataset code, as well as ensure more principled/correct pre-processing.

Add Activation Norm and Overflow/Underflow Logging

Add Tianyi's code for tracking activation norms throughout Training.

Additionally, would love to use the HF tool for underflow/overflow reporting (maybe don't keep it on all the time, it's slow): https://huggingface.co/transformers/master/debugging.html#underflow-and-overflow-detection.

Hugging Face Dataset Caching

Currently, our preprocessing code doesn't actually end up re-using the cached outputs from prior runs - let's fix this.

Specifically - the offending lines seem to be: Lines 117 - onwards, Line 156.

Might be useful to read into the fingerprinting behavior described in HuggingFace.datasets. Seems like we just need to call map with the right set of values for cache_file_name and new_fingerprint.

[Installation] Resolving dependency chain due to the latest Transformer version

The current *-gpu.yaml failed due to issues in the lib dependency chain introduced by the latest Transformer (4.12) version. For an example, there exists dependency issues with Transformer and versions of huggingface_hub, datasets, etc. libs. Which transformer version should we used to have smoothed installation?

Set Gradient Accumulation Steps Dynamically

Currently, we set effective batch size AND gradient accumulation steps (along with nodes, total GPUs) all statically, in one of our Quinfig files.

It might be nicer to compute effective batch size dynamically, based on hardware + desired effective batch size (generally just cleaner).

Aesthetic: Refactor and use Pathlib instead of `os.path.join`!

Angrily refactor things to use pathlib Path instead of os.path.join.

Better Console/File Logging

Currently our logger (dubbed overwatch) is initialized via a function, with lots of manual specification. Furthermore, when calling external library code (e.g. calls to HF.datasets, HF.transformers), the corresponding prints and external log calls are not captured in our logger.

To fix this, take a look at the following TODOs:

Line 19 of overwatch.py describes how to initialize a Logger from a YAML definition file. This would be nice because we could auto-generate the logger in parallel with parsing the Quinfig.
Line 20 of overwatch.py details how to wrap external calls with context managers to redirect loggers... not sure if this is the right thing to do here, so maybe let's open a discussion?

Resolving this PR will probably entail at least the first point above, and a semi-clean solution to the second point.

Fairscale ZeRO-Offload Bug

Seems to be a weird mixed precision assertion error/bug in FairScale's Zero-Offload. Follow up with either Stas at Hugging Face, or with the Fairscale folks directly.

Fix: DeepSpeed Resume Behavior

Currently, resuming a DeepSpeed distributed run (multi-node) requires replicating and syncing a bunch of files for a given checkpoint directory. Specifically:

Copy all optimizer states (in checkpoint-XXX/global_stepXXX/) from each node (e.g. when saved, Node 0 has optimizer states checkpoints 0 - 7, Node 1 has 8 - 15, and so on) to all nodes.
Copy mpu (model) state in checkpoint-XXX/global_stepXXX to all nodes.
Copy top-level Transformers checkpoints (pytorch.bin, json files, txt files) to checkpoint directories on all nodes.

Add Resume from Checkpoint Behavior

Currently, we don't have support for resuming runs from checkpoints (in case of interruptions/we want to stop training and pick things up later).

Line 73 of train.py walks you through the requisite steps to get set up (make sure you add logic at Line 198 as well!).

Create Ground-Truth, Verified GPT-2 Small Configuration

Create and verify GPT-2 Small Configuration. Should include the following:

Closest possible translation from GPT-2 original Cosine Schedule to Linear Scheduler (for DeepSpeed).
Custom GPT-2 Weight Initialization
LR and Scheduling Parameters from Neo-X/Megatron-LM

Enable static typing with mypy

Is your feature request related to a problem? Please describe.
This repo has typehints currently, but they are not enforced and many are incorrect or incomplete.

Torch.Distributed and Vanilla DistributedDataParallel (Single & Multi-Node)

Prior to getting DeepSpeed and FairScale integrated, we need to have code in place for running/launching DDP jobs with the HF Trainer.

With Data already processed, this should just be calls to torch.distributed.launch with the right arguments, but for future-proofing and cleanliness, we should also:

Add torch.distributed.barrier to the appropriate places in the preprocessing code.
Write wrapper scripts (shell scripts) that auto-call torch.distributed.launch with the valid arguments.
Write a cleanup function (probably wrap in try/finally somewhere) that greps ps aux for all running processes, and runs pkill / kill -9, since torch.distributed.launch doesn't clean-up by itself.

Reproducibility across Hardware/OS Configurations

We've verified that on a given set of hardware (e.g., the Sphinxes) we have fully reproducible runs. However, when trying to replicate results on GCP locally, we see slight differences in training behavior (see battlestar vs sphinx-battlestar) on W&B: https://wandb.ai/stanford-mercury/mistral-gpt2.

Let's isolate the problem, and figure out how to best seed randomness such that we ensure full reproducibility.

Online Evaluation

Quantities to be tracked online:

1. Training Loss on OpenWebText
2. Perplexity on Held-Out Subset of OpenWebText
3. LAMBADA Perplexity (EleutherAI Implementation: https://github.com/EleutherAI/lm-evaluation-harness/blob/master/lm_eval/tasks/lambada.py).
4. WikiText103 Perplexity

DeepSpeed and FairScale Integration

Setup DeepSpeed and FairScale Integration. This is going to be highly involved - we need to follow the instructions here to install dependencies: https://huggingface.co/transformers/main_classes/trainer.html#trainer-integrations

It's also not clear (I don't have a mental model) of all the knobs we have to turn/everything we want to benchmark. We should do some reading and figure out a table of things we want to try.

Add Arbitrary Save/Evaluation Schedules for HuggingFace Trainer

Currently, we can only save checkpoints/evaluate every K steps for some fixed value of K. Would be nice to pass in a full list schedule (of arbitrary length, with schedule[-1] == max_steps) to maintain periodicity of checkpoints/evaluation.

Might make sense to have two separate arguments -- one for the evaluation schedule, one for the saving schedule. Probably would require either sub-classing the DefaultFlowCallback() in Transformers, or writing a custom Callback.

Depending, can be contributed back to HuggingFace.

Upscale K, Q and Scaled Dot-Product Attention

Problem: It seems like most existing repositories (mesh-TF, Megatron, possibly FairSeq) upscale K, Q, V and perform dot-product in FP32 rather than in FP16 (which has been where we are noticing the overflow). We have a few ways to fix this, which we'll work through step-by-step (with thorough testing) before landing on a final solution.

Deliverable: Constrain Scaled-Dot Product Attention to be in FP32 (manually) -- look at Mesh-TF implementation to identify which ops need to be in FP32 (and where to cast back down).

Question: How does this play with autocast and DeepSpeed? We need to look into this.

Testing: Resume the battlestar run a few thousand steps prior to crash (we should be able to, because weights aren't changing). If all goes well, we should not see an overflow at the same point.

Additionally stack with #66 to ensure "more stability."

Push Stability Fixes / Create Issues on HF Transformers

We have a series of stability fixes (GPT-2 Initialization, stability things, etc.) in our codebase, that aren't present in the default GPT-2 model on HF Transformers.

We should push the fixes that don't change semantics (EDIT: it's not clear to me any of these changes keep semantics entirely the same) as PRs, and open issues to figure out with the HF team directly how to incorporate these new fixes.

Might make sense to create a new GPT-2 model... but also feels like adding overhead.

Verify DeepSpeed Checkpoints

Some HF users have noticed that using Checkpoints dumped by DeepSpeed leads to issues with the common HF pipeline. This is relatively important, so we should verify that we can recover "regular" Hugging Face functionality with our checkpoints.

Conda installation fails

Conda fails when I run

conda env create -f environments/environment-cpu.yaml

with the error

Pip subprocess error:
ERROR: Could not find a version that satisfies the requirement transformers==4.4.0.dev0
ERROR: No matching distribution found for transformers==4.4.0.dev0

I'm assuming this is because conda isn't freezing the pip requirements right for the libraries that are installed from source (aka pip install git+https://github.com/huggingface/transformers)? And this may not be supported by conda directly (https://stackoverflow.com/a/19071214).

Benchmarking

Run Exhaustive Benchmarking Suite, comparing times for single/multi-GPU, single/multi-node, DeepSpeed/FairScale, mixed precision, etc.

Follow along with progress here: https://www.notion.so/skaramcheti/Mistral-Benchmarking-DS-FS-b9d1c15bffbb4694adcad8b51a6f890b

More logging to debug instability

In order to debug why our runs are crashing, I want to log the first and second moments in the Adam optimizer, as well as the actual updates.

Once this feature is implemented, we can resume training of dark matter and battle star right before the crash and checkpoint more frequently (every 100 steps for example).

For battlestar, we can do this starting at 165K steps and for dark matter, we can do this starting at 46K steps.

Bug - Current Evaluation Loads Incorrect GPT-2 Model

With current evaluation code (anything not called directly via train.py) we end up loading the default Hugging Face GPT-2 Model (e.g., in all our evaluation scripts). This is problematic because the model we use while training implement the FP16 stability heuristics -- including further scaling the scaled dot-product attention by 1 / layer_idx.

This is a subtle bug in that the default Hugging Face model can still load the correct weights (we're only adding algorithmic logic, no extra parameters), but this significantly hurts evaluation performance.

Fixes for this would require changing the default model to be loaded to be the MistralGPT2 model with the layer-wise scaling enabled.

Error on `main` when Loading Wiki103

Currently on main when trying to load wikitext103 I encountered

ValueError: Config name is missing.
Please pick one among the available configs: ['wikitext-103-raw-v1', 'wikitext-2-raw-v1', 'wikitext-103-v1', 'wikitext-2-v1']
Example of usage:
        `load_dataset('wikitext', 'wikitext-103-raw-v1')

because this line https://github.com/stanford-mercury/mistral/blob/main/src/corpora/auto.py#L28

Factor out Configuration, Tokenizer Initialization, and Model Definition Code

Currently, most of the Configuration Code is embedded directly in Lines 82 - 94 of train.py.

Similarly the Model code is separately defined much later (though it doesn't need to be) in Lines 160 - 163 of train.py.

These should all probably be relegated to auto.py in src.models -- we probably just need one function that initializes and returns a config, tokenizer, and model.

Fix Scaled Dot-Product Attention Order of Operations

Deliverable: In this issue, switch the order of operations of the scaled-dot product attention as follows:

Normal: (1 / root(dk)) [K @ Q]
New: ((1 / root(dk) K) @ Q

This is done in Megatron by using torch.baddbmm with beta=0.0, alpha = 1/root(dk). Rewrite the GPT-2 forward pass to use this.

Testing: Resume the battlestar run a few thousand steps prior to crash. If all goes well, we should not see an overflow at the same point.

Mistral doesn't join docs with the <|endoftext|> separator

Unlike GPT-2 and other GPT-style LMs, the Mistral codebase and pretrained models do not make use of the special <|endoftext|> token.

Evidence that this is true:

When prompted with this token, the pretrained models usually begin in the middle of a sentence.
If I understand correctly, this line in get_auto_dataset concatenates tokenized documents without inserting anything in between them.

If this code was used to prepare data for the pretrained models, that would explain the behavior noted in point 1.

Was this a deliberate choice? Mistral follows GPT-2 carefully in other respects, so I'm surprised by this difference.

Also, concatenating documents without inserting such a character seems sub-optimal from a language modeling perspective. At the boundaries between documents, it produces sudden discontinuities in style/content. The resulting dataset makes it look to the LM as if such discontinuities were a feature of natural text, which they aren't.

Intensive Benchmarking

Integrate Online Evaluation Code, spend some initial time tuning Batch Size and other parameters, then run "Intensive Benchmarking" (1000 updates, evaluate every 100 updates, log every 50 steps) for the following 8 runs (Multi-Node = 16 GPUs):

Vanilla DDP - FP 16 - Per Device BSZ = 8, Accumulation = 4
Vanilla DDP - FP 16 - Gradient Checkpointing - Per Device BSZ = 32, Accumulation = None
FairScale ZeRO Stage 2 - FP 16 - Per Device BSZ = 8, Accumulation = 4
FairScale ZeRO Stage 3 - FP 16 - Per Device BSZ = 8, Accumulation = 4
DeepSpeed ZeRO Stage 1 - FP 16 - Per Device BSZ = 8, Accumulation = 4
DeepSpeed ZeRO Stage 1 - FP 16 - Per Device BSZ = 16, Accumulation = 2
DeepSpeed ZeRO Stage 2 - FP 16 - Per Device BSZ = 8, Accumulation = 4
DeepSpeed ZeRO Stage 2 - FP 16 - Per Device BSZ = 16, Accumulation = 2

Error in Tokenization Cache Name

I try to run python train.py --config conf/gpt2-sphinx-debug-config.yaml on main and discovered an error.

The cache name generated by this line of code (https://github.com/stanford-mercury/mistral/blob/main/src/corpora/auto.py#L51) is /scr-ssd/mercury/mistral/artifacts/gpt2-processed/preprocessing/tokenization/train-tokenized.hf but in fact we want this to be /scr-ssd/mercury/mistral/artifacts/gpt2-processed/openwebtext/preprocessing/tokenization/train-tokenized.hf (i.e. record the dataset name).

The fix is to add dataset_id as we did for the other cache in https://github.com/stanford-mercury/mistral/blob/main/src/corpora/auto.py#L32

stanford-crfm / mistral Goto Github PK

mistral's People

Contributors

Stargazers

Watchers

Forkers

mistral's Issues

Recommend Projects

Recommend Topics

Recommend Org