young-geng / easylm Goto Github PK
View Code? Open in Web Editor NEWLarge language models (LLMs) made easy, EasyLM is a one stop solution for pre-training, finetuning, evaluating and serving LLMs in JAX/Flax.
License: Apache License 2.0
Large language models (LLMs) made easy, EasyLM is a one stop solution for pre-training, finetuning, evaluating and serving LLMs in JAX/Flax.
License: Apache License 2.0
https://api.wandb.ai/links/matrix-zxw/qmcnboxa
Is it normal for the learning rate to reach the peak_value 1e-4 and not decrease, but instead slowly rise?
python -m EasyLM.models.llama.llama_train
--optimizer.type=adamw
--optimizer.adamw_optimizer.lr=1e-4
...
When I configured batch size 4 on v3-8, it was normal. But when I configured batch size 128 on v3-256, it reported OOM.
What is the reason?
...
--mp_mesh_dim='-1, 1'
--train_dataset.json_dataset.batch_size=128
...
Hi,
I find that in the following code, easylm processes the dataset by taking chunks. From my understanding, it might make different documents in the same chunk. For example, the first document might take 512 tokens while the second documents take 128 tokens in a chunk of 640 tokens. In this case, I think the generation for the second document should not see the first document, so we might need to use attention mask to mask that. But I don't see any related code in easylm, so I am wondering how you handle the attention mask for this problem.
Lines 158 to 183 in 18375bd
2023-04-06 07:42:51.278258: E external/org_tensorflow/tensorflow/compiler/xla/pjrt/pjrt_stream_executor_client.cc:2153] Execution of replica 0 failed: RE
SOURCE_EXHAUSTED: Failed to allocate request for 625.00MiB (655360000B) on device ordinal 7
2023-04-06 07:42:51.278359: E external/org_tensorflow/tensorflow/compiler/xla/pjrt/pjrt_stream_executor_client.cc:2153] Execution of replica 0 failed: RE
SOURCE_EXHAUSTED: Failed to allocate request for 625.00MiB (655360000B) on device ordinal 1
2023-04-06 07:42:51.278403: E external/org_tensorflow/tensorflow/compiler/xla/pjrt/pjrt_stream_executor_client.cc:2153] Execution of replica 0 failed: RE
SOURCE_EXHAUSTED: Failed to allocate request for 625.00MiB (655360000B) on device ordinal 3
2023-04-06 07:42:51.278439: E external/org_tensorflow/tensorflow/compiler/xla/pjrt/pjrt_stream_executor_client.cc:2153] Execution of replica 0 failed: RE
SOURCE_EXHAUSTED: Failed to allocate request for 625.00MiB (655360000B) on device ordinal 2
2023-04-06 07:42:51.278486: E external/org_tensorflow/tensorflow/compiler/xla/pjrt/pjrt_stream_executor_client.cc:2153] Execution of replica 0 failed: RE
SOURCE_EXHAUSTED: Failed to allocate request for 625.00MiB (655360000B) on device ordinal 0
2023-04-06 07:42:51.278526: E external/org_tensorflow/tensorflow/compiler/xla/pjrt/pjrt_stream_executor_client.cc:2153] Execution of replica 0 failed: RE
SOURCE_EXHAUSTED: Failed to allocate request for 625.00MiB (655360000B) on device ordinal 4
2023-04-06 07:42:51.278553: E external/org_tensorflow/tensorflow/compiler/xla/pjrt/pjrt_stream_executor_client.cc:2153] Execution of replica 0 failed: RE
SOURCE_EXHAUSTED: Failed to allocate request for 625.00MiB (655360000B) on device ordinal 5
2023-04-06 07:42:51.278591: E external/org_tensorflow/tensorflow/compiler/xla/pjrt/pjrt_stream_executor_client.cc:2153] Execution of replica 0 failed: RESOURCE_EXHAUSTED: Failed to allocate request for 625.00MiB (655360000B) on device ordinal 6
nohup python -m EasyLM.models.llama.llama_train
--mp_mesh_dim='-1,1'
--load_llama_config='13b'
--load_checkpoint="params::${EASYLM_CHECKPOINT_DIR}/checkpoint"
--tokenizer.vocab_file=${TOKENIZER_FILE}
--seed=42
--initialize_jax_distributed=False
--total_steps=1000
--log_freq=10
--save_model_freq=100
--save_milestone_freq=500
--eval_steps=100
--train_dataset.text_processor.fields='[input],output'
--train_dataset.text_processor.add_eos_token=True
--train_dataset.type='json'
--train_dataset.json_dataset.path=${TRAIN_DATA_FILE}
--train_dataset.json_dataset.seq_length=1024
--train_dataset.json_dataset.batch_size=2
--eval_dataset.text_processor.fields='[input],output'
--eval_dataset.text_processor.add_eos_token=True
--eval_dataset.type='json'
--eval_dataset.json_dataset.path=${EVAL_DATA_FILE}
--eval_dataset.json_dataset.seq_length=1024
--eval_dataset.json_dataset.batch_size=2
Command:
python -m EasyLM.models.llama.convert_easylm_to_hf \
--load_checkpoint='params::/home/nap/Downloads/githubs/EasyLM/easylm_checkpoint/koala_13b.diff.weights' \
--tokenizer_path='/home/nap/Documents/text-generation-webui/models/llama_original/13B/tokenizer.model' \
--model_size='13b' \
--output_dir='/home/nap/Downloads/githubs/EasyLM/easylm_checkpoint/koala-13B-HF'
Output:
Traceback (most recent call last):
File "/home/nap/miniconda3/envs/EasyLM/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/nap/miniconda3/envs/EasyLM/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/nap/Downloads/githubs/EasyLM/EasyLM/models/llama/convert_easylm_to_hf.py", line 32, in <module>
from transformers import LlamaConfig, LlamaForCausalLM
ImportError: cannot import name 'LlamaConfig' from 'transformers' (/home/nap/miniconda3/envs/EasyLM/lib/python3.8/site-packages/transformers/__init__.py)
Do I not have the right version of Transformers? I used these commands from the docs to set up conda:
conda env create -f scripts/gpu_environment.yml
conda activate EasyLM
(EasyLM) nap@wintermute:~/Downloads/githubs/EasyLM$ pip freeze | grep transform
transformers==4.27.2
(Ubuntu 22)
if it is normal
...
--train_dataset.json_dataset.batch_size=4
--optimizer.bf16_accumulate_gradient=True
--optimizer.accumulate_gradient_steps=1
...
so is it right?
...
--train_dataset.json_dataset.batch_size=8
--optimizer.bf16_accumulate_gradient=True
--optimizer.accumulate_gradient_steps=2
...
Hi!
I have recently come across this repository and have been conducting tests using TPUv4 pods.
As part of my experimentation, I have explored several approaches for feeding datasets into the model,
including utilizing Hugging Face datasets or employing JSON files (with lines) either locally or through a GCS bucket.
During my analysis,
I have noticed that the JSON data loader appears to download the entire JSON file from the gs:// directory and subsequently tokenize and yield the line-by-line data.
This approach presents a challenge when dealing with corpus files exceeding 1TB in size, as it is not practical to store such extensive data in a single JSON file.
I am curious to learn how do you handle this issue,
and I would appreciate any insights!
Thanks for your great work ๐
I try to run LLAMA using EasyLM. I follow the README for llama. The first step is conver raw LLAMA parameters.
python -m EasyLM.models.llama.convert_torch_to_easylm.py \
--checkpoint_dir='path/to/torch/llama/checkpoint' \
--output_dir='path/to/output/easylm/checkpoint' \
--streaming=True
The arg output_dir does not appear in convert_torch_to_easylm.py
, which should be output_file
now, as shown in code.
I wonder if the doc is outdated?
LoRA fine-tuning is much faster and use less memory than normal fine-tuning.
Firstly, thanks very much for releasing Koala and the code. I'm really looking forward to trying it.
I download the 7B delta from https://drive.google.com/drive/folders/10f7wrlAFoPIy-TECHsx9DKIvbQYunCfl and would like to convert the delta to HF format that I can use with other tools
Here are the commands I have run:
$ PYTHON_PATH="${PWD}:$PYTHONPATH" ~/anaconda3/envs/torch21/bin/python \
-m EasyLM.models.llama.convert_torch_to_easylm \
--checkpoint_dir=/Users/tomj/Downloads/Torrents/Done/LLaMA/7B \
--output_file=/Users/tomj/src/llama.cpp/models/koala/7B/llama-7b-LM \
--streaming=True
$ PYTHON_PATH="${PWD}:$PYTHONPATH" ~/anaconda3/envs/torch21/bin/python \
-m EasyLM.scripts.diff_checkpoint --recover_diff=True \
--load_base_checkpoint='params::/Users/tomj/src/llama.cpp/models/koala/7B/llama-7b-LM' \
--load_target_checkpoint='params::/Users/tomj/src/llama.cpp/models/koala/7B/koala_7b_diff_v2' \
--output_file=/Users/tomj/src/llama.cpp/models/koala/7B/koala_7b_diff.diff \
--streaming=True
$ PYTHON_PATH="${PWD}:$PYTHONPATH" ~/anaconda3/envs/torch21/bin/python \
-m EasyLM.models.llama.convert_easylm_to_hf --model_size=7b \
--output_dir=/Users/tomj/src/llama.cpp/models/koala/7B/HF \
--load_checkpoint='params::/Users/tomj/src/llama.cpp/models/koala/7B/koala_7b_diff.diff' \
--tokenizer_path=/Users/tomj/src/llama.cpp/models/tokenizer.model
TypeError: can't convert np.ndarray of type bfloat16. The only supported types are: float64, float32, float16, complex64, complex128, int64, int32, int16, int8, uint8, and bool.
Am I misunderstanding how these scripts are supposed to work? How can I get an HF version of the Koala deltas?
Or, how can I apply the Koala deltas to the original Llama 7B, and then convert that to HF?
Here's the full output from running the convert script:
Fetching the tokenizer from /Users/tomj/src/llama.cpp/models/tokenizer.model.
/Users/tomj/src/EasyLM/EasyLM/models/llama/convert_easylm_to_hf.py:94: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /Users/runner/work/_temp/anaconda/conda-bld/pytorch_1680419296502/work/torch/csrc/utils/tensor_numpy.cpp:212.)
torch_params[key] = torch.from_numpy(tensor)
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ Traceback (most recent call last) โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ /Users/tomj/anaconda3/envs/torch21/lib/python3.10/runpy.py:196 in _run_module_as_main โ
โ โ
โ 193 โ main_globals = sys.modules["__main__"].__dict__ โ
โ 194 โ if alter_argv: โ
โ 195 โ โ sys.argv[0] = mod_spec.origin โ
โ โฑ 196 โ return _run_code(code, main_globals, None, โ
โ 197 โ โ โ โ โ "__main__", mod_spec) โ
โ 198 โ
โ 199 def run_module(mod_name, init_globals=None, โ
โ โ
โ /Users/tomj/anaconda3/envs/torch21/lib/python3.10/runpy.py:86 in _run_code โ
โ โ
โ 83 โ โ โ โ โ __loader__ = loader, โ
โ 84 โ โ โ โ โ __package__ = pkg_name, โ
โ 85 โ โ โ โ โ __spec__ = mod_spec) โ
โ โฑ 86 โ exec(code, run_globals) โ
โ 87 โ return run_globals โ
โ 88 โ
โ 89 def _run_module_code(code, init_globals=None, โ
โ โ
โ /Users/tomj/src/EasyLM/EasyLM/models/llama/convert_easylm_to_hf.py:233 in <module> โ
โ โ
โ 230 โ
โ 231 โ
โ 232 if __name__ == "__main__": โ
โ โฑ 233 โ mlxu.run(main) โ
โ 234 โ
โ โ
โ /Users/tomj/anaconda3/envs/torch21/lib/python3.10/site-packages/absl/app.py:308 in run โ
โ โ
โ 305 โ callback = _init_callbacks.popleft() โ
โ 306 โ callback() โ
โ 307 โ try: โ
โ โฑ 308 โ _run_main(main, args) โ
โ 309 โ except UsageError as error: โ
โ 310 โ usage(shorthelp=True, detailed_error=error, exitcode=error.exitcode) โ
โ 311 โ except: โ
โ โ
โ /Users/tomj/anaconda3/envs/torch21/lib/python3.10/site-packages/absl/app.py:254 in _run_main โ
โ โ
โ 251 โ retval = profiler.runcall(main, argv) โ
โ 252 โ sys.exit(retval) โ
โ 253 else: โ
โ โฑ 254 โ sys.exit(main(argv)) โ
โ 255 โ
โ 256 โ
โ 257 def _call_exception_handlers(exception): โ
โ โ
โ /Users/tomj/src/EasyLM/EasyLM/models/llama/convert_easylm_to_hf.py:226 in main โ
โ โ
โ 223 โ โ input_tokenizer_path=FLAGS.tokenizer_path, โ
โ 224 โ ) โ
โ 225 โ write_model( โ
โ โฑ 226 โ โ load_and_convert_checkpoint(FLAGS.load_checkpoint), โ
โ 227 โ โ model_path=FLAGS.output_dir, โ
โ 228 โ โ model_size=FLAGS.model_size, โ
โ 229 โ ) โ
โ โ
โ /Users/tomj/src/EasyLM/EasyLM/models/llama/convert_easylm_to_hf.py:94 in โ
โ load_and_convert_checkpoint โ
โ โ
โ 91 โ for key, tensor in flax_params.items(): โ
โ 92 โ โ if match_keywords(key, ["kernel"], ["norm", 'ln_f']): โ
โ 93 โ โ โ tensor = tensor.T โ
โ โฑ 94 โ โ torch_params[key] = torch.from_numpy(tensor) โ
โ 95 โ return torch_params โ
โ 96 โ
โ 97 โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
TypeError: can't convert np.ndarray of type bfloat16. The only supported types are: float64, float32, float16, complex64, complex128, int64, int32, int16, int8, uint8, and bool.
I am running on an Intel macOS system on Ventura 13.3. Here's some env info:
transformers
version: 4.28.0.dev0Any help would be much appreciated.
Is there a plan to support Falcon
Considering the better performance of Falcon on OpenLLM leaderboard, would you consider supporting Falcon?
Thank you~
When I use 13b, batch_size=1, an error occurs on v3-8 as follows:
When using fsdp=true, the error message is the same. Does this not have any effect?
jaxlib.xla_extension.XlaRuntimeError: RESOURCE_EXHAUSTED: Failed to allocate request for 12.50MiB (13107200
B) on device ordinal 0: while running replica 0 and partition 0 of a replicated computation (other replicas
may have failed as well).
...
--fsdp=True
...
If it is pre-training, can we just omit the [] directly?
for example
...
--train_dataset.text_processor.fields='input,output' \
...
Appears there is issue running this in windows? Getting the following error
Collecting package metadata (repodata.json): done
Solving environment: failed
ResolvePackageNotFound:
I will likely wrap this up in a docker container in the meantime.
the transformers version transformers==4.27.2
in the scripts/gpu_environment.yml
file leads to an import issue when I ran EasyLM.models.llama.convert_easylm_to_hf
File "/scratch/users/ruiqi-zhong/conda/envs/EasyLM/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/scratch/users/ruiqi-zhong/conda/envs/EasyLM/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/scratch/users/ruiqi-zhong/EasyLM/EasyLM/models/llama/convert_easylm_to_hf.py", line 33, in <module>
from transformers import LlamaConfig, LlamaForCausalLM
ImportError: cannot import name 'LlamaConfig' from 'transformers'
It can be easily fixed after pip install-ing the latest transformers library, though.
any discord channel or group?
conda env create -f scripts/gpu_environment.yml
install process is so slow ~
Executing transaction: / By downloading and using the CUDA Toolkit conda packages, you accept the terms and conditions of the CUDA End User License Agreement (EULA): https://docs.nvidia.com/cuda/eula/index.html
/ By downloading and using the cuDNN conda packages, you accept the terms and conditions of the NVIDIA cuDNN EULA -
https://docs.nvidia.com/deeplearning/cudnn/sla/index.html
|
|
Hi,
What would it take to support Falcon-40b-Instruct for fine-tuning?
Hi!
First of all: thanks for your amazing repo. It's pretty easy to use, even for someone that doesn't have an ML/Python background like me.
I am, however, struggling to use it in training LLaMA on a TPU.
Can you maybe give a short step by step guide?
Like for example: Which weights should I load as checkpoint? The original one from Meta, the Huggingface or the JAX weights, and if so, how do I convert the original from Meta to JAX?
What is the format of and how are the model weights diff combined into the base weights?
Is there a script to merge them?
Great project and documentation.
Can you further elucidate the difference between FSDP and Model Parallelism? Isn't FSDP already a form of model parallelism? Trying to understand the nuanced differences between 3-stage DeepSpeed ZeRO, FSDP, and "model parallelism".
Thanks!
https://github.com/young-geng/EasyLM/blob/main/EasyLM/checkpoint.py#L91
Can save_checkpoint support writing to a GCS path?
Hello and thank you for setting up an excellent repo!
I was wondering if you can provide checksum (say md5sum) for models that was recovered from the original LLaMa weights and the diff file?
(I am especially interested in Koala)
This way, people can be confident that they have managed to recover a sane model.
Thanks!
Does wandb_dir support GCP paths?
gcp_path=gs://path/
...
--logger.wandb_dir=gcp_path
...
Thanks much for implementing the Jax/Flax version of these foundation language models! This is really helpful for TPU-backended researchers.
I am still a beginner of Jax/Flax, and I have a detailed question on the LLaMA training script. When the train_state
is created at:
EasyLM/EasyLM/models/llama/llama_train.py
Line 232 in e3e2657
create_trainstate_from_params
? It seems that in: EasyLM/EasyLM/models/llama/llama_train.py
Line 224 in e3e2657
sharded_fn
is already passed into the checkpointer and the output restored_params
is already sharded across all TPU devices. Will there be any problems if I use create_trainstate_from_params
instead of shareded_create_trainstate_from_params
in Line 232 (assuming that I am not using distributed training)?
Thanks!
Hi, thank you for opening such a nice work on public.
I have two issues I want to raise.
No.1, in the code for processing all the datasets, https://github.com/young-geng/koala_data_pipeline ,
I'm afraid there are some missing datasets.
For example, in the line 14 of process_chat_data.py,
input_file='/nfs/vault/data/language/chat_data_v3.json'
above file must exists in order to run the file without an error.
Where can I get all those input datasets that are listed in all the processing python files?
No.2, I've tried to look for the documentation on using the EasyLM library to fine-tune
the OPT model with the Koala dataset, but there was only the documentation for fine-tuning
the LLaMA model.
Can I get the any documentation on finetuning, for example, OPT-6.7B with the Koala dataset?
Again, thank you so much for an amazing work!
I followed the steps to convert the model into HF format, but when I load the tokenizer it takes around 300 seconds to load the converted tokenizer using tokenizer = AutoTokenizer.from_pretrained(model_path)
. Any ideas why?
Firstly, thanks very much for releasing Koala and the code. I'm really looking forward to trying it.
I download the 7B delta from https://drive.google.com/drive/folders/10f7wrlAFoPIy-TECHsx9DKIvbQYunCfl and am trying to use convert_easylm_to_hf
to put the model into a format I can use with other tools
I am running this on the command line:
tomj@Eddie ~/src/EasyLM (main)$ PYTHON_PATH="${PWD}:$PYTHONPATH" ~/anaconda3/envs/torch21/bin/python \
-m EasyLM.models.llama.convert_easylm_to_hf --model_size=7b \
--output_dir=/Users/tomj/src/llama.cpp/models/koala/7B \
--load_checkpoint='params::/Users/tomj/src/llama.cpp/models/koala/koala_7b_diff_v2' \
--tokenizer_path=/Users/tomj/src/llama.cpp/models/tokenizer.model
And getting this error:
TypeError: can't convert np.ndarray of type bfloat16. The only supported types are: float64, float32, float16, complex64, complex128, int64, int32, int16, int8, uint8, and bool.
I am running on an Intel macOS system on Ventura 13.3. Here's some env info:
transformers
version: 4.28.0.dev0Any help would be much appreciated.
Because during training, I saw that the code added a starting character like
<s>
in the first position. Should we also add this character during inference to maintain consistency with training?
First of all, thanks for the great work!
could you point me to the SFT script? It might sound silly, But I fail to find it anywhere.
Hi,
Thanks for the amazing repo. I am trying to serve local models then do the evaluation. But I can only find related classes implemented in serving.py. Can you give some examples about how to initialize the classes, call the functions and serve the model?
Thanks a lot!
Hello!
I'm looking into fine-tuning LLaMA-7b with EasyLM on a TPU v3-8. From my initial runs, I've found that I can get around 975 token/sec. I've tested all the flag combinations I can think of, but am unable to increase the batch size or gradient accumulation steps beyond 1 without OOMing.
I saw that you achieved a high throughput of 2,200 tokens/sec/TPU-v4 chip on OpenLLaMA-7b, and mesh-transformer-jax gets 5k/T/sec on a v3-8 for GPT-J, so I was curious if there was an issue in my config.
Here's how I'm running it:
# Removed "jax_enable_async_all_gather", as it causes a crash on a v3-8. Without these flags, the throughput is 590 tokens/sec.
export LIBTPU_INIT_ARGS='--xla_jf_spmd_threshold_for_windowed_einsum_mib=0 --xla_tpu_spmd_threshold_for_allgather_cse=10000 --xla_tpu_spmd_rewrite_einsum_with_reshape=true --jax_enable_async_collective_offload=true --xla_tpu_enable_latency_hiding_scheduler=true TPU_MEGACORE=MEGACORE_DENSE'
python -m EasyLM.models.llama.llama_train \
--dtype='fp32' \ # bf16 causes errors - is it only intended for serving?
--mesh_dim='1,-1,1' \
--load_llama_config='7b' \
--optimizer.type='adamw' \
--train_dataset.json_dataset.seq_length=2048 \
--train_dataset.json_dataset.batch_size=1 \
# ... omitting other flags which shouldn't affect throughput
Do you have any tips? Or is higher throughput only expected on larger TPU pods?
Thanks!
Can the 'load_checkpoint' parameter support reading files from GCP?
Is there any plans for these models to be supported?
https://www.cerebras.net/blog/cerebras-gpt-a-family-of-open-compute-efficient-large-language-models/
Hey, awesome work! Thanks for sharing such amazing work with us all.
Can you provide/guide fine-tuning script as can use for our own data, in a multi-gpu setup, in possible, preferably with hf?
How many vocab size used here?
May I ask about the configs of pre-training? For example, did you use dropout?
Hey,
If I understand correctly, you trained using a v4-512.
Can you share how you configured your node and what sizes?
Thanks!
Ohad
https://huggingface.co/decapoda-research/llama-13b-hf
How to convert the weights on HF into the format of EasyLM?
For 30B LLama model, can server be supported by configuring mesh_dims on tpu v3-8 (128g)? I tried 8,1 and 4,1 but they don't seem to work.
Hi!
First, thank you for this great repository. I noticed that conda takes a long time to examine conflicts and solve the environment. I was wondering if it would be better to include a setup.py
and use pip install -e .
instead. This would be much faster and won't require modifying the system PYTHONPATH
(export PYTHONPATH="${PWD}:$PYTHONPATH"
).
Hi, Thanks for this great clean codebase!
I was a bit curious about 2 things:
For context, I've been trying to replicate the stanford alpaca model using this codebase and tpu pods, and so far have found trained model isn't as good (~36% on MMLU vs ~41% for 7B trained on formatted alpaca data). Any pointers or advice regarding potential gotchas would be super duper appreciated!
When I train a 30-billion-parameter Llama model using V3-256, what configuration would be appropriate? I've tried '1, 64, 4', '1, 128, 2', and '1, 32, 8', but none of them worked.
for llama 7b on tpu v3-8,
when --optimizer.accumulate_gradient_steps=1, it is normal,
but --optimizer.accumulate_gradient_steps=2, it occurs oom
optimizer.accumulate_ gradient_ steps Will the related changes to this configuration increase the usage of graphics memory?
Do you have any good solutions๏ผ
python3 -m EasyLM.models.llama.llama_train
--mp_mesh_dim='4,1'
--optimizer.accumulate_gradient_steps=1
--fsdp=True
...
jax._src.traceback_util.UnfilteredStackTrace: jaxlib.xla_extension.XlaRuntimeError: RESOURCE_EXHAUSTED: XLA:TPU compile permanent error. Ran out of memory in memory space hbm. Used 23.43G of 15.48G hbm. Exceeded hbm capacity by 7.95G.
first of all I would like to sincerely thank you for providing the model weight diffs
this is just in time for a tool I'm literally planning to start building tomorrow which is intended to increase accessibility of technical material for folks with disabilities
would you mind sharing some information as to what the differences are between version 1 and version 2 of the model weight diffs?
also are you able to provide any details on memory + GPU requirements for running each model for inference? here is a spreadsheet (related thread) someone made for LLaMa if this helps?
Does this support multi-host GPU training? I see the README says it supports GPU/TPU on a single host and multi-host training for TPU, but does not mention multi-host GPU training.
Very nice library and I can't wait to get everything up and running. Thank you for sharing!!!!
I have installed the conda env and run the initial conversion as outlined in Koala.md.
Once that was done, I wanted to recover the model using the diff and that is where I ran out of memory.
I watched the system monitor steadily climb until my virtual memory and physical memory were 95% full and that is when the process was killed by the system.
I have a Dell Precision workstation with 32GB of RAM and Quadro P6000 with 24GB VRAM.
hello, I just learned about your super AI Koala through the media and if I understood that it was possible to try Koala locally (with 128GB RAM, according to one of the discussion topics ! ), I was wondering if you propose an API like ChatGPT for example?
Indeed on Linux, there is an application called Bavarder, in flatpak format, which allows to consult some AI of this type, and I thought it would be really nice to be able to use a really open source and unrestricted AI.
Thanks a lot for your feedback.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.