Comments (17)
There are two ways to use a fixed data loading proportion!
The first way:
- dynamic: false
- split: wikipedia (make sure that it's mds files in this directory)
This set up allows you to load data from a single data folder with mds files.
The second way:
- dynamic: true
- update_type: constant
- set_names: specify the set names
- proportion: specify the loading proportion
This setup allows you to load data from multiple data folders with mds files and use a constant proportion.
You can refer to the callback function of dynamic loading here: https://github.com/princeton-nlp/LLM-Shearing/blob/main/llmshearing/callbacks/dynamic_loading_callback.py#L32
from llm-shearing.
Hi! How large is your dataset? We currently only supports using all the data points for once and exceeding 1ep of data will cause errors. Supporting multiple epochs requires to modify the StreamingDataset logics.
from llm-shearing.
@xiamengzhou I performed data processing on the entire redpajama-1T according to your Readme, including tokenizing and sampling. This error occurred when I performed batch=[7/3200]:. It seems that the calculation has reached epoch1, and it comes error.
Does the algorithm only need to calculate 7/3200batch? The details are as follows:
[batch=6/3200]:
Train time/batch: 5
Train time/sample: 160
Train time/batch_in_epoch: 5
Train time/sample_in_epoch: 160
Train time/token: 655360
Train time/token_in_epoch: 655360
Train metrics/train/cc_weight: 0.6700
Train metrics/train/github_weight: 0.0450
Train metrics/train/book_weight: 0.0450
Train metrics/train/wiki_weight: 0.0450
Train metrics/train/arxiv_weight: 0.0450
Train metrics/train/c4-rp_weight: 0.1500
Train memory/current_allocated_mem: 14.6140
Train memory/current_active_mem: 14.6140
Train memory/current_inactive_mem: 1.9267
Train memory/current_reserved_mem: 39.3450
Train memory/peak_allocated_mem: 28.0700
Train memory/peak_active_mem: 28.0700
Train memory/peak_inactive_mem: 11.7290
Train memory/peak_reserved_mem: 39.3450
Train memory/alloc_retries: 0
Train metrics/train/expected_head_sparsity: 0.0039
Train metrics/train/target_head_sparsity: 0.0029
Train metrics/train/expected_intermediate_sparsity: 0.0039
Train metrics/train/target_intermediate_sparsity: 0.0029
Train metrics/train/expected_layer_sparsity: 0.0039
Train metrics/train/target_layer_sparsity: 0.0000
Train metrics/train/expected_hidden_sparsity: 0.0039
Train metrics/train/target_hidden_sparsity: 0.0029
Train metrics/train/expected_sparsity: 0.0117
Train metrics/train/target_sparsity: 0.0048
Train trainer/device_train_microbatch_size: 4
Train loss/train/total: 1.8510
Train loss/train/ce_loss: 1.8509
Train loss/train/lag_loss: 0.0001
Train metrics/train/LanguageCrossEntropy: 1.8509
Train metrics/train/Perplexity: 6.3655
Train metrics/train/cc_LanguageCrossEntropy: 1.9415
Train metrics/train/cc_count: 121
Train metrics/train/github_LanguageCrossEntropy: 0.8384
Train metrics/train/github_count: 11
Train metrics/train/book_LanguageCrossEntropy: nan
Train metrics/train/book_count: 7
Train metrics/train/wiki_LanguageCrossEntropy: 1.6548
Train metrics/train/wiki_count: 8
Train metrics/train/arxiv_LanguageCrossEntropy: nan
Train metrics/train/arxiv_count: 5
Train metrics/train/c4-rp_LanguageCrossEntropy: 1.9918
Train metrics/train/c4-rp_count: 40
Train time/train: 0.0152
Train time/val: 0.0000
Train time/total: 0.0152
[batch=7/3200]:
Train time/batch: 6
Train time/sample: 192
Train time/batch_in_epoch: 6
Train time/sample_in_epoch: 192
Train time/token: 786432
Train time/token_in_epoch: 786432
Train metrics/train/cc_weight: 0.6700
Train metrics/train/github_weight: 0.0450
Train metrics/train/book_weight: 0.0450
Train metrics/train/wiki_weight: 0.0450
Train metrics/train/arxiv_weight: 0.0450
Train metrics/train/c4-rp_weight: 0.1500
Train memory/current_allocated_mem: 14.6140
Train memory/current_active_mem: 14.6140
Train memory/current_inactive_mem: 1.9267
Train memory/current_reserved_mem: 39.3450
Train memory/peak_allocated_mem: 28.0700
Train memory/peak_active_mem: 28.0700
Train memory/peak_inactive_mem: 11.7290
Train memory/peak_reserved_mem: 39.3450
Train memory/alloc_retries: 0
Train metrics/train/expected_head_sparsity: 0.0039
Train metrics/train/target_head_sparsity: 0.0035
Train metrics/train/expected_intermediate_sparsity: 0.0039
Train metrics/train/target_intermediate_sparsity: 0.0035
Train metrics/train/expected_layer_sparsity: 0.0039
Train metrics/train/target_layer_sparsity: 0.0000
Train metrics/train/expected_hidden_sparsity: 0.0039
Train metrics/train/target_hidden_sparsity: 0.0035
Train metrics/train/expected_sparsity: 0.0117
Train metrics/train/target_sparsity: 0.0057
Train trainer/device_train_microbatch_size: 4
Train loss/train/total: 1.8914
Train loss/train/ce_loss: 1.8913
Train loss/train/lag_loss: 0.0001
Train metrics/train/LanguageCrossEntropy: 1.8913
Train metrics/train/Perplexity: 6.6280
Train metrics/train/cc_LanguageCrossEntropy: 1.8021
Train metrics/train/cc_count: 140
Train metrics/train/github_LanguageCrossEntropy: nan
Train metrics/train/github_count: 11
Train metrics/train/book_LanguageCrossEntropy: 1.9494
Train metrics/train/book_count: 8
Train metrics/train/wiki_LanguageCrossEntropy: 1.7889
Train metrics/train/wiki_count: 9
Train metrics/train/arxiv_LanguageCrossEntropy: nan
Train metrics/train/arxiv_count: 5
Train metrics/train/c4-rp_LanguageCrossEntropy: 2.0495
Train metrics/train/c4-rp_count: 51
Train time/train: 0.0172
Train time/val: 0.0000
Train time/total: 0.0172
Traceback (most recent call last):
File "/root/paddlejob/workspace/LLM/baidu/personal-code/LLM-Shearing/llmshearing/train.py", line 319, in <module>
main(cfg)
File "/root/paddlejob/workspace/LLM/baidu/personal-code/LLM-Shearing/llmshearing/train.py", line 299, in main
trainer.fit()
File "/root/miniconda3/envs/shearing/lib/python3.10/site-packages/composer/trainer/trainer.py", line 1876, in fit
self._train_loop()
File "/root/miniconda3/envs/shearing/lib/python3.10/site-packages/composer/trainer/trainer.py", line 2018, in _train_loop
for batch_idx, self.state.batch in enumerate(self._iter_dataloader(TrainerMode.TRAIN)):
File "/root/miniconda3/envs/shearing/lib/python3.10/site-packages/composer/trainer/trainer.py", line 3024, in _iter_dataloader
batch = next(dataloader_iter)
File "/root/miniconda3/envs/shearing/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 633, in __next__
data = self._next_data()
File "/root/miniconda3/envs/shearing/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 677, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/root/miniconda3/envs/shearing/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 32, in fetch
data.append(next(self.dataset_iter))
File "/root/paddlejob/workspace/LLM/baidu/personal-code/LLM-Shearing/llmshearing/datasets/streaming_dataset.py", line 401, in __iter__
sample_ids_per_stream = self._get_work(world, epoch, used_sample_ids)
File "/root/paddlejob/workspace/LLM/baidu/personal-code/LLM-Shearing/llmshearing/datasets/streaming_dataset.py", line 355, in _get_work
sample_ids_per_stream = generate_work(self, world, epoch, used_domain_ids)
File "/root/paddlejob/workspace/LLM/baidu/personal-code/LLM-Shearing/llmshearing/datasets/streaming_dataset.py", line 46, in generate_work
assert epoch == 0, "Currently only supports dynamic loading from each domain for once."
AssertionError: Currently only supports dynamic loading from each domain for once.
from llm-shearing.
Hiiii, I am not sure why it is happening here -- I will need to take a closer look at it and will get back to you later. Could you share the configuration you are using, and the number of data points in each domain?
from llm-shearing.
Hi, thanks for your help. I think you are right, there may be something wrong with the data. Although I cannot directly read the mds file to view the number of data points, I found that the size of the data file obtained using sample sampling is smaller than usual. I'll check carefully what's wrong with the sample file.
from llm-shearing.
You can use the TextStreamingDataset to load the data and count the number of data points by simply using the len()
function. You can also check the index.json
file to check the number of samples.
from llm-shearing.
Thank you for your patient reply. Previously I used the default script you provided without setting the number of sampling tokens. This leads to related problems. I'm resampling the data now and waiting to see if it improves, it may take a while.
There is also a related question:
If I need to disable DoReMi, how do I need to modify the settings?
Because I found that there are multiple places that seem to be related to DoReMi configuration, including
In addition, the yaml file also configures the data path. When will this be used or will it be overwritten?
from llm-shearing.
Thanks for your help, the code runs smoothly. But sometimes the loss will be nan. Is this normal?
[batch=3194/3200]:
Train time/batch: 3193
Train time/sample: 102176
Train time/batch_in_epoch: 3193
Train time/sample_in_epoch: 102176
Train time/token: 418512896
Train time/token_in_epoch: 418512896
Train metrics/train/cc_weight: 0.0450
Train metrics/train/github_weight: 0.0017
Train metrics/train/book_weight: 0.0007
Train metrics/train/stackexchange_weight: 0.0023
Train metrics/train/wiki_weight: 0.0121
Train metrics/train/arxiv_weight: 0.0011
Train metrics/train/c4-rp_weight: 0.9370
Train memory/current_allocated_mem: 14.6140
Train memory/current_active_mem: 14.6140
Train memory/current_inactive_mem: 1.9286
Train memory/current_reserved_mem: 43.5430
Train memory/peak_allocated_mem: 28.0710
Train memory/peak_active_mem: 28.0710
Train memory/peak_inactive_mem: 11.7290
Train memory/peak_reserved_mem: 43.5430
Train memory/alloc_retries: 0
Train metrics/train/expected_head_sparsity: 0.3750
Train metrics/train/target_head_sparsity: 0.3750
Train metrics/train/expected_intermediate_sparsity: 0.3714
Train metrics/train/target_intermediate_sparsity: 0.3721
Train metrics/train/expected_layer_sparsity: 0.0039
Train metrics/train/target_layer_sparsity: 0.0000
Train metrics/train/expected_hidden_sparsity: 0.3734
Train metrics/train/target_hidden_sparsity: 0.3750
Train metrics/train/expected_sparsity: 0.6085
Train metrics/train/target_sparsity: 0.6082
Train trainer/device_train_microbatch_size: 4
Train loss/train/total: 9.1105
Train loss/train/ce_loss: 2.4873
Train loss/train/lag_loss: 6.6233
Train metrics/train/LanguageCrossEntropy: 2.4873
Train metrics/train/Perplexity: 12.0283
Train metrics/train/cc_LanguageCrossEntropy: nan
Train metrics/train/cc_count: 24363
Train metrics/train/github_LanguageCrossEntropy: nan
Train metrics/train/github_count: 1389
Train metrics/train/book_LanguageCrossEntropy: nan
Train metrics/train/book_count: 1033
Train metrics/train/stackexchange_LanguageCrossEntropy: nan
Train metrics/train/stackexchange_count: 741
Train metrics/train/wiki_LanguageCrossEntropy: 1.9528
Train metrics/train/wiki_count: 14344
Train metrics/train/arxiv_LanguageCrossEntropy: nan
Train metrics/train/arxiv_count: 783
Train metrics/train/c4-rp_LanguageCrossEntropy: 2.5045
Train metrics/train/c4-rp_count: 59555
Train throughput/batches_per_sec: 0.1416
Train throughput/samples_per_sec: 4.5306
Train throughput/device/batches_per_sec: 0.0177
Train throughput/device/samples_per_sec: 0.5663
Train throughput/tokens_per_sec: 18557.1898
Train throughput/device/tokens_per_sec: 2319.6487
Train throughput/flops_per_sec: 869869884376190.8750
Train throughput/device/flops_per_sec: 108733735547023.8594
Train throughput/device/mfu: 0.3485
Train time/train: 6.3065
Train time/val: 1.3523
Train time/total: 7.6588
from llm-shearing.
When the batch does not contain data from a specific domain, the loss becomes nan
. So it should be normal! For a sanity check, you can print the amount of data used by each batch to verify.
from llm-shearing.
@Longyichen Would you like to share your pruning script?
from llm-shearing.
Thanks for your help, the code runs smoothly. But sometimes the loss will be nan. Is this normal?
[batch=3194/3200]: Train time/batch: 3193 Train time/sample: 102176 Train time/batch_in_epoch: 3193 Train time/sample_in_epoch: 102176 Train time/token: 418512896 Train time/token_in_epoch: 418512896 Train metrics/train/cc_weight: 0.0450 Train metrics/train/github_weight: 0.0017 Train metrics/train/book_weight: 0.0007 Train metrics/train/stackexchange_weight: 0.0023 Train metrics/train/wiki_weight: 0.0121 Train metrics/train/arxiv_weight: 0.0011 Train metrics/train/c4-rp_weight: 0.9370 Train memory/current_allocated_mem: 14.6140 Train memory/current_active_mem: 14.6140 Train memory/current_inactive_mem: 1.9286 Train memory/current_reserved_mem: 43.5430 Train memory/peak_allocated_mem: 28.0710 Train memory/peak_active_mem: 28.0710 Train memory/peak_inactive_mem: 11.7290 Train memory/peak_reserved_mem: 43.5430 Train memory/alloc_retries: 0 Train metrics/train/expected_head_sparsity: 0.3750 Train metrics/train/target_head_sparsity: 0.3750 Train metrics/train/expected_intermediate_sparsity: 0.3714 Train metrics/train/target_intermediate_sparsity: 0.3721 Train metrics/train/expected_layer_sparsity: 0.0039 Train metrics/train/target_layer_sparsity: 0.0000 Train metrics/train/expected_hidden_sparsity: 0.3734 Train metrics/train/target_hidden_sparsity: 0.3750 Train metrics/train/expected_sparsity: 0.6085 Train metrics/train/target_sparsity: 0.6082 Train trainer/device_train_microbatch_size: 4 Train loss/train/total: 9.1105 Train loss/train/ce_loss: 2.4873 Train loss/train/lag_loss: 6.6233 Train metrics/train/LanguageCrossEntropy: 2.4873 Train metrics/train/Perplexity: 12.0283 Train metrics/train/cc_LanguageCrossEntropy: nan Train metrics/train/cc_count: 24363 Train metrics/train/github_LanguageCrossEntropy: nan Train metrics/train/github_count: 1389 Train metrics/train/book_LanguageCrossEntropy: nan Train metrics/train/book_count: 1033 Train metrics/train/stackexchange_LanguageCrossEntropy: nan Train metrics/train/stackexchange_count: 741 Train metrics/train/wiki_LanguageCrossEntropy: 1.9528 Train metrics/train/wiki_count: 14344 Train metrics/train/arxiv_LanguageCrossEntropy: nan Train metrics/train/arxiv_count: 783 Train metrics/train/c4-rp_LanguageCrossEntropy: 2.5045 Train metrics/train/c4-rp_count: 59555 Train throughput/batches_per_sec: 0.1416 Train throughput/samples_per_sec: 4.5306 Train throughput/device/batches_per_sec: 0.0177 Train throughput/device/samples_per_sec: 0.5663 Train throughput/tokens_per_sec: 18557.1898 Train throughput/device/tokens_per_sec: 2319.6487 Train throughput/flops_per_sec: 869869884376190.8750 Train throughput/device/flops_per_sec: 108733735547023.8594 Train throughput/device/mfu: 0.3485 Train time/train: 6.3065 Train time/val: 1.3523 Train time/total: 7.6588
@Longyichen Have you noticed that c4-rp_weight is 0.9370, which is not consistent with the data in the literature?
from llm-shearing.
@lippman1125 Yes, I have a similar problem, but I don't know what causes it. It seems that the performance of the model does not suffer much loss compared to the paper. For details, we can seek for @xiamengzhou for help.
from llm-shearing.
@Longyichen Because the eval CE Loss determines the proportion, but the new proportion only affects train CE Loss. I guess, if there is some gap between the training samples and eval samples, it could lead to this problem.
from llm-shearing.
@lippman1125 Have you tried continuous pre-training? You can try using the pre-trained data set and evaluation set to see if the same problem occurs
from llm-shearing.
@Longyichen could you share your scripts? i happened same problem , and not solv.
from llm-shearing.
我尝试了上面所有的方法,还是不行;我使用的数据集是样例数据集;帮忙讲解一下,可能是哪里的问题吗?
from llm-shearing.
@coderchem Have you been using the data shared on the google drive?
from llm-shearing.
Related Issues (20)
- Could you provide tokenized continue-pretraining dataset for reproduction? HOT 2
- missmatch shape
- Start training but nothing continue HOT 6
- TypeError: buffer is too small for requested array
- Pruning fine-tuned model HOT 2
- save model meet problem HOT 1
- Instruction tuning dataset HOT 2
- If I can't configure Slurm on a cluster, does that mean I can't use multi-node multi-GPU setups? HOT 5
- 有没有不用Slurm跑剪枝的方法?
- None
- Start training but only output config information HOT 3
- The Project is not implemented for 70B llama? HOT 7
- LlamaRMSNorm() layer differs from original llama HOT 1
- composer model trans to pythia problem
- The dtype of tokenized data should be uint32
- Why the rope params are ignored while converting hf checkpoint to composer checkpoint? HOT 3
- about shearing params config
- Can LLM-Shearing be used on ViT models?
- Support for Llama-3 / GQA?
- Open source the pruning mask.
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from llm-shearing.