Giter Club home page Giter Club logo

mistral's People

Contributors

anarayan avatar dlwh avatar j38 avatar krandiash avatar lorr1 avatar siddk avatar skylion007 avatar teetone avatar tiiiger avatar yifanmai avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mistral's Issues

Better Batching/Pre-Processing Implementation

As discussed in #5 , two options to avoid brute-forcing the data truncation issue

  1. Use Padding (Propagating change down to the model level).
  2. After we do the normalization/tokenization, we extract all the token ids into a contiguous array (potentially memory-mapped using PyArrow). Basically going back to what we did for Tempest.

Fix: Have Log Files for Each Resumed Model/Add Loss and Metrics to Log

Currently we have one log file that collects up until we call HF.train(). This log file gets overwritten when a model is resumed.

Ideal behavior:

  • New log per model being resumed so we can understand errors and progress (add timestamp would be easy fix)
  • If possible, add a callback and adds loss values etc to log file, too. Very helpful when debugging a job on Kubernetes where we don't have easy access to the process running the model.

Quinfig Bugs

There are a series of minor issues with Quinfig currently that are halting our workflow - here they are ranked from most critical to least critical:

  • CRITICAL: Add support for passing command line arguments in addition to config (e.g. --local_rank which is necessary for all calls to torch.distributed.launch. We should test this with a minimal example - @lorr1 any ideas? These should also be able to override argument in the original Quinfig as per Line 73 of train.py.
  • MID: Create Quinfig Schema - my understanding is this is probably necessary for the above.
  • MID: Handle nested inheritance (create recursive namespaces) so that > depth-2 args don't require "strings" as per Line 56 of train.py.
  • LOW: Remove extra print line in parse_quinfig() as per Line 49 of train.py
  • QUESTION: Is there a cleaner way to handle argument-injection based on runtime-defined variables? Specifically, what's the right way to write Lines 167-173 of train.py?

Let me know if I should create/link to separate issues in the Quinine repo.

Sharing Models through the Hugging Face Hub

Hi CRFM team!

Mistral is very exciting! I see you currently share your model checkpoints through links to a hosted server. Would you be interested in sharing the pretrained models in the Hugging Face Hub? We already have a similar collaboration with the Stanford NLP group (see org).

The Hub offers free hosting of over 20K models, and it would make your work more accessible and visible to the rest of the community. Some of the benefits of sharing your models would be:

  • forget about the pain of managing the hosting
  • built-in versioning
  • commit history and diffs
  • repos provide useful metadata about their tasks, languages, metrics, etc that is useful for discoverability but also to understand the model

Creating the repos and adding new models should be a relatively straightforward process if you've used Git before. This is a step-by-step guide explaining the process in case you're interested. Please let us know if you would be interested and if you have any questions.

In a future we could also integrate this to our Inference API so users can play with the models directly in the browser with our widgets.

Happy to hear your thoughts,
Omar and the Hugging Face team

cc @lewtun @anton-l @LysandreJik

Allow finetuning of mistral models using the HuggingFace Flax LM classes

It would be amazing if we could load and finetune the models on TPUs using the flax LM classes in HF. In my experience, this makes the training and generation very straightforward on TPUs, along ofc with taking advantage of their compute.

I have tried to load a mistral checkpoint with the following code:
model = FlaxAutoModelForCausalLM.from_pretrained("alias/arwen-x21-checkpoint-400000", from_pt=True, pad_token_id=50256, )
This seems to work. The model loads, I can access its properties, and can even generate text.

However, once I try to fine tune it, using (more or less) the code here: https://github.com/huggingface/transformers/blob/master/examples/flax/language-modeling/run_clm_flax.py, it takes about 10mins to compile and then about 5mins for each step (for reference, in this should be 2mins and some seconds respectively got gpt2-medium).

Finally, it would be nice if the changes in mistral models were smh included when loading the model in HF (I am actually not 100% sure that does not happen). Specifically, I'm thinking of this line here:

scale_factor = 1 / ((float(v.size(-1)) ** 0.5) * self.layer_num)

Hope this makes sense. Thank you in advance!

Best,
Theodore.

DeepSpeed Learning Rate and Loss Discrepancies?

Using the default parameters (even inheriting from HF) results in different Learning Rate scheduling behavior (and train/eval loss) compared to DDP or FairScale. Unclear why this is happening, but if we want to use DeepSpeed, we should sort this out.

Reproduce Battlestar Crash (from GCP) on Sphinxes

To facilitate debugging numerical instability, we need to reproduce the Battlestar Crash on GCP on the Sphinxes; unfortunately, our random seeding isn't perfect across different hardware configurations (see #70 for full issue). We seem to be seeing the same batches, but initialization seems off...

For this issue, implement a hot-fix that allows for loading checkpoint-0 from the initial Battlestar run on GCP (immediately after initialization) so that we (presumably) start with the same initialization and can replicate the crash...

Partial/Bespoke Gradient Checkpointing for GPT-2 Models

Currently, HuggingFace models implement Gradient Checkpointing for every block even if it's not necessary. With bespoke gradient checkpointing, we can control how many blocks get wrapped with checkpointing, and save memory/time dynamically.

DeepSpeed Dynamic Loss Scaling Panic?

On starting a DeepSpeed run, there's this massive floating point panic where DeepSpeed dynamically rescales the loss and throws a ton of warnings. Unclear why this is happening?

[2021-03-18 04:21:58,423] [INFO] [stage1.py:633:step] [deepspeed] fp16 dynamic loss scale overflow! Skipping step. Attempted loss scale: 131072.0, reducing to 65536.0                                           โ”‚
[2021-03-18 04:21:58,423] [INFO] [stage1.py:633:step] [deepspeed] fp16 dynamic loss scale overflow! Skipping step. Attempted loss scale: 131072.0, reducing to 65536.0                                           โ”‚
[2021-03-18 04:21:58,424] [INFO] [stage1.py:633:step] [deepspeed] fp16 dynamic loss scale overflow! Skipping step. Attempted loss scale: 131072.0, reducing to 65536.0

Enable pre-commit CI

Is your feature request related to a problem? Please describe.

This pre-commit config is not enforced by Github actions or pre-commit CI. Please consider enabling them.

Hot-Fix HF Caching Permissions Issue

Currently, when using HF Caching with Datasets + Tokenizers, file permissions are slightly problematic, and locked to a single user. Ideally we bump/escalate up to HF Datasets repo and fix upstream.

In meantime, a temporary, pythonic fix would be good to have.

Factor out and Fix Dataset Preprocessing Logic

Currently, most of the Dataset Fetching, Preprocessing Code is embedded directly in Lines 98 - 157 of train.py.

We should:

  1. Extract/factor out this code and put it in src/corpora/auto.py (general class for any HF dataset, following a "generic" API - similar to run_clm.py).
  2. Clean up the truncation and pre-processing correctness (this may or may not matter too much). Specifically, I'd like to get around the fact that:
    - We pre-compute splits only once rather than every epoch.
    - We drop a lot of text. Switching back to tempest dataset is probably not the best option because multiprocessing is nice, so we should find some other workaround.

Resolving this issue should both cleanup the existing dataset code, as well as ensure more principled/correct pre-processing.

[Installation] Resolving dependency chain due to the latest Transformer version

The current *-gpu.yaml failed due to issues in the lib dependency chain introduced by the latest Transformer (4.12) version. For an example, there exists dependency issues with Transformer and versions of huggingface_hub, datasets, etc. libs. Which transformer version should we used to have smoothed installation?

Set Gradient Accumulation Steps Dynamically

Currently, we set effective batch size AND gradient accumulation steps (along with nodes, total GPUs) all statically, in one of our Quinfig files.

It might be nicer to compute effective batch size dynamically, based on hardware + desired effective batch size (generally just cleaner).

Better Console/File Logging

Currently our logger (dubbed overwatch) is initialized via a function, with lots of manual specification. Furthermore, when calling external library code (e.g. calls to HF.datasets, HF.transformers), the corresponding prints and external log calls are not captured in our logger.

To fix this, take a look at the following TODOs:

  • Line 19 of overwatch.py describes how to initialize a Logger from a YAML definition file. This would be nice because we could auto-generate the logger in parallel with parsing the Quinfig.
  • Line 20 of overwatch.py details how to wrap external calls with context managers to redirect loggers... not sure if this is the right thing to do here, so maybe let's open a discussion?

Resolving this PR will probably entail at least the first point above, and a semi-clean solution to the second point.

Fairscale ZeRO-Offload Bug

Seems to be a weird mixed precision assertion error/bug in FairScale's Zero-Offload. Follow up with either Stas at Hugging Face, or with the Fairscale folks directly.

Fix: DeepSpeed Resume Behavior

Currently, resuming a DeepSpeed distributed run (multi-node) requires replicating and syncing a bunch of files for a given checkpoint directory. Specifically:

  • Copy all optimizer states (in checkpoint-XXX/global_stepXXX/) from each node (e.g. when saved, Node 0 has optimizer states checkpoints 0 - 7, Node 1 has 8 - 15, and so on) to all nodes.
  • Copy mpu (model) state in checkpoint-XXX/global_stepXXX to all nodes.
  • Copy top-level Transformers checkpoints (pytorch.bin, json files, txt files) to checkpoint directories on all nodes.

Create Ground-Truth, Verified GPT-2 Small Configuration

Create and verify GPT-2 Small Configuration. Should include the following:

  • Closest possible translation from GPT-2 original Cosine Schedule to Linear Scheduler (for DeepSpeed).
  • Custom GPT-2 Weight Initialization
  • LR and Scheduling Parameters from Neo-X/Megatron-LM

Enable static typing with mypy

Is your feature request related to a problem? Please describe.
This repo has typehints currently, but they are not enforced and many are incorrect or incomplete.

Torch.Distributed and Vanilla DistributedDataParallel (Single & Multi-Node)

Prior to getting DeepSpeed and FairScale integrated, we need to have code in place for running/launching DDP jobs with the HF Trainer.

With Data already processed, this should just be calls to torch.distributed.launch with the right arguments, but for future-proofing and cleanliness, we should also:

  • Add torch.distributed.barrier to the appropriate places in the preprocessing code.
  • Write wrapper scripts (shell scripts) that auto-call torch.distributed.launch with the valid arguments.
  • Write a cleanup function (probably wrap in try/finally somewhere) that greps ps aux for all running processes, and runs pkill / kill -9, since torch.distributed.launch doesn't clean-up by itself.

Add Arbitrary Save/Evaluation Schedules for HuggingFace Trainer

Currently, we can only save checkpoints/evaluate every K steps for some fixed value of K. Would be nice to pass in a full list schedule (of arbitrary length, with schedule[-1] == max_steps) to maintain periodicity of checkpoints/evaluation.

Might make sense to have two separate arguments -- one for the evaluation schedule, one for the saving schedule. Probably would require either sub-classing the DefaultFlowCallback() in Transformers, or writing a custom Callback.

Depending, can be contributed back to HuggingFace.

Upscale K, Q and Scaled Dot-Product Attention

Problem: It seems like most existing repositories (mesh-TF, Megatron, possibly FairSeq) upscale K, Q, V and perform dot-product in FP32 rather than in FP16 (which has been where we are noticing the overflow). We have a few ways to fix this, which we'll work through step-by-step (with thorough testing) before landing on a final solution.


Deliverable: Constrain Scaled-Dot Product Attention to be in FP32 (manually) -- look at Mesh-TF implementation to identify which ops need to be in FP32 (and where to cast back down).

Question: How does this play with autocast and DeepSpeed? We need to look into this.

Testing: Resume the battlestar run a few thousand steps prior to crash (we should be able to, because weights aren't changing). If all goes well, we should not see an overflow at the same point.

Additionally stack with #66 to ensure "more stability."

Push Stability Fixes / Create Issues on HF Transformers

We have a series of stability fixes (GPT-2 Initialization, stability things, etc.) in our codebase, that aren't present in the default GPT-2 model on HF Transformers.

We should push the fixes that don't change semantics (EDIT: it's not clear to me any of these changes keep semantics entirely the same) as PRs, and open issues to figure out with the HF team directly how to incorporate these new fixes.

Might make sense to create a new GPT-2 model... but also feels like adding overhead.

Verify DeepSpeed Checkpoints

Some HF users have noticed that using Checkpoints dumped by DeepSpeed leads to issues with the common HF pipeline. This is relatively important, so we should verify that we can recover "regular" Hugging Face functionality with our checkpoints.

Conda installation fails

Conda fails when I run

conda env create -f environments/environment-cpu.yaml

with the error

Pip subprocess error:
ERROR: Could not find a version that satisfies the requirement transformers==4.4.0.dev0
ERROR: No matching distribution found for transformers==4.4.0.dev0

I'm assuming this is because conda isn't freezing the pip requirements right for the libraries that are installed from source (aka pip install git+https://github.com/huggingface/transformers)? And this may not be supported by conda directly (https://stackoverflow.com/a/19071214).

More logging to debug instability

In order to debug why our runs are crashing, I want to log the first and second moments in the Adam optimizer, as well as the actual updates.

Once this feature is implemented, we can resume training of dark matter and battle star right before the crash and checkpoint more frequently (every 100 steps for example).

For battlestar, we can do this starting at 165K steps and for dark matter, we can do this starting at 46K steps.

Bug - Current Evaluation Loads Incorrect GPT-2 Model

With current evaluation code (anything not called directly via train.py) we end up loading the default Hugging Face GPT-2 Model (e.g., in all our evaluation scripts). This is problematic because the model we use while training implement the FP16 stability heuristics -- including further scaling the scaled dot-product attention by 1 / layer_idx.

This is a subtle bug in that the default Hugging Face model can still load the correct weights (we're only adding algorithmic logic, no extra parameters), but this significantly hurts evaluation performance.

Fixes for this would require changing the default model to be loaded to be the MistralGPT2 model with the layer-wise scaling enabled.

Fix Scaled Dot-Product Attention Order of Operations

Problem: It seems like most existing repositories (mesh-TF, Megatron, possibly FairSeq) upscale K, Q, V and perform dot-product in FP32 rather than in FP16 (which has been where we are noticing the overflow). We have a few ways to fix this, which we'll work through step-by-step (with thorough testing) before landing on a final solution.


Deliverable: In this issue, switch the order of operations of the scaled-dot product attention as follows:

  • Normal: (1 / root(dk)) [K @ Q]
  • New: ((1 / root(dk) K) @ Q

This is done in Megatron by using torch.baddbmm with beta=0.0, alpha = 1/root(dk). Rewrite the GPT-2 forward pass to use this.

Testing: Resume the battlestar run a few thousand steps prior to crash. If all goes well, we should not see an overflow at the same point.

Mistral doesn't join docs with the <|endoftext|> separator

Unlike GPT-2 and other GPT-style LMs, the Mistral codebase and pretrained models do not make use of the special <|endoftext|> token.

Evidence that this is true:

  1. When prompted with this token, the pretrained models usually begin in the middle of a sentence.
  2. If I understand correctly, this line in get_auto_dataset concatenates tokenized documents without inserting anything in between them.
  • If this code was used to prepare data for the pretrained models, that would explain the behavior noted in point 1.

Was this a deliberate choice? Mistral follows GPT-2 carefully in other respects, so I'm surprised by this difference.

Also, concatenating documents without inserting such a character seems sub-optimal from a language modeling perspective. At the boundaries between documents, it produces sudden discontinuities in style/content. The resulting dataset makes it look to the LM as if such discontinuities were a feature of natural text, which they aren't.

Intensive Benchmarking

Integrate Online Evaluation Code, spend some initial time tuning Batch Size and other parameters, then run "Intensive Benchmarking" (1000 updates, evaluate every 100 updates, log every 50 steps) for the following 8 runs (Multi-Node = 16 GPUs):

  • Vanilla DDP - FP 16 - Per Device BSZ = 8, Accumulation = 4
  • Vanilla DDP - FP 16 - Gradient Checkpointing - Per Device BSZ = 32, Accumulation = None
  • FairScale ZeRO Stage 2 - FP 16 - Per Device BSZ = 8, Accumulation = 4
  • FairScale ZeRO Stage 3 - FP 16 - Per Device BSZ = 8, Accumulation = 4
  • DeepSpeed ZeRO Stage 1 - FP 16 - Per Device BSZ = 8, Accumulation = 4
  • DeepSpeed ZeRO Stage 1 - FP 16 - Per Device BSZ = 16, Accumulation = 2
  • DeepSpeed ZeRO Stage 2 - FP 16 - Per Device BSZ = 8, Accumulation = 4
  • DeepSpeed ZeRO Stage 2 - FP 16 - Per Device BSZ = 16, Accumulation = 2

Error in Tokenization Cache Name

I try to run python train.py --config conf/gpt2-sphinx-debug-config.yaml on main and discovered an error.

The cache name generated by this line of code (https://github.com/stanford-mercury/mistral/blob/main/src/corpora/auto.py#L51) is /scr-ssd/mercury/mistral/artifacts/gpt2-processed/preprocessing/tokenization/train-tokenized.hf but in fact we want this to be /scr-ssd/mercury/mistral/artifacts/gpt2-processed/openwebtext/preprocessing/tokenization/train-tokenized.hf (i.e. record the dataset name).

The fix is to add dataset_id as we did for the other cache in https://github.com/stanford-mercury/mistral/blob/main/src/corpora/auto.py#L32

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.