Giter Club home page Giter Club logo

Comments (17)

xiamengzhou avatar xiamengzhou commented on May 18, 2024 1

There are two ways to use a fixed data loading proportion!

The first way:

  • dynamic: false
  • split: wikipedia (make sure that it's mds files in this directory)
    This set up allows you to load data from a single data folder with mds files.

The second way:

  • dynamic: true
  • update_type: constant
  • set_names: specify the set names
  • proportion: specify the loading proportion
    This setup allows you to load data from multiple data folders with mds files and use a constant proportion.

You can refer to the callback function of dynamic loading here: https://github.com/princeton-nlp/LLM-Shearing/blob/main/llmshearing/callbacks/dynamic_loading_callback.py#L32

from llm-shearing.

xiamengzhou avatar xiamengzhou commented on May 18, 2024

Hi! How large is your dataset? We currently only supports using all the data points for once and exceeding 1ep of data will cause errors. Supporting multiple epochs requires to modify the StreamingDataset logics.

from llm-shearing.

Longyichen avatar Longyichen commented on May 18, 2024

@xiamengzhou I performed data processing on the entire redpajama-1T according to your Readme, including tokenizing and sampling. This error occurred when I performed batch=[7/3200]:. It seems that the calculation has reached epoch1, and it comes error.
Does the algorithm only need to calculate 7/3200batch? The details are as follows:

[batch=6/3200]:
        Train time/batch: 5
        Train time/sample: 160
        Train time/batch_in_epoch: 5
        Train time/sample_in_epoch: 160
        Train time/token: 655360
        Train time/token_in_epoch: 655360
        Train metrics/train/cc_weight: 0.6700
        Train metrics/train/github_weight: 0.0450
        Train metrics/train/book_weight: 0.0450
        Train metrics/train/wiki_weight: 0.0450
        Train metrics/train/arxiv_weight: 0.0450
        Train metrics/train/c4-rp_weight: 0.1500
        Train memory/current_allocated_mem: 14.6140
        Train memory/current_active_mem: 14.6140
        Train memory/current_inactive_mem: 1.9267
        Train memory/current_reserved_mem: 39.3450
        Train memory/peak_allocated_mem: 28.0700
        Train memory/peak_active_mem: 28.0700
        Train memory/peak_inactive_mem: 11.7290
        Train memory/peak_reserved_mem: 39.3450
        Train memory/alloc_retries: 0
        Train metrics/train/expected_head_sparsity: 0.0039
        Train metrics/train/target_head_sparsity: 0.0029
        Train metrics/train/expected_intermediate_sparsity: 0.0039
        Train metrics/train/target_intermediate_sparsity: 0.0029
        Train metrics/train/expected_layer_sparsity: 0.0039
        Train metrics/train/target_layer_sparsity: 0.0000
        Train metrics/train/expected_hidden_sparsity: 0.0039
        Train metrics/train/target_hidden_sparsity: 0.0029
        Train metrics/train/expected_sparsity: 0.0117
        Train metrics/train/target_sparsity: 0.0048
        Train trainer/device_train_microbatch_size: 4
        Train loss/train/total: 1.8510
        Train loss/train/ce_loss: 1.8509
        Train loss/train/lag_loss: 0.0001
        Train metrics/train/LanguageCrossEntropy: 1.8509
        Train metrics/train/Perplexity: 6.3655
        Train metrics/train/cc_LanguageCrossEntropy: 1.9415
        Train metrics/train/cc_count: 121
        Train metrics/train/github_LanguageCrossEntropy: 0.8384
        Train metrics/train/github_count: 11
        Train metrics/train/book_LanguageCrossEntropy: nan
        Train metrics/train/book_count: 7
        Train metrics/train/wiki_LanguageCrossEntropy: 1.6548
        Train metrics/train/wiki_count: 8
        Train metrics/train/arxiv_LanguageCrossEntropy: nan
        Train metrics/train/arxiv_count: 5
        Train metrics/train/c4-rp_LanguageCrossEntropy: 1.9918
        Train metrics/train/c4-rp_count: 40
        Train time/train: 0.0152
        Train time/val: 0.0000
        Train time/total: 0.0152
[batch=7/3200]:
        Train time/batch: 6
        Train time/sample: 192
        Train time/batch_in_epoch: 6
        Train time/sample_in_epoch: 192
        Train time/token: 786432
        Train time/token_in_epoch: 786432
        Train metrics/train/cc_weight: 0.6700
        Train metrics/train/github_weight: 0.0450
        Train metrics/train/book_weight: 0.0450
        Train metrics/train/wiki_weight: 0.0450
        Train metrics/train/arxiv_weight: 0.0450
        Train metrics/train/c4-rp_weight: 0.1500
        Train memory/current_allocated_mem: 14.6140
        Train memory/current_active_mem: 14.6140
        Train memory/current_inactive_mem: 1.9267
        Train memory/current_reserved_mem: 39.3450
        Train memory/peak_allocated_mem: 28.0700
        Train memory/peak_active_mem: 28.0700
        Train memory/peak_inactive_mem: 11.7290
        Train memory/peak_reserved_mem: 39.3450
        Train memory/alloc_retries: 0
        Train metrics/train/expected_head_sparsity: 0.0039
        Train metrics/train/target_head_sparsity: 0.0035
        Train metrics/train/expected_intermediate_sparsity: 0.0039
        Train metrics/train/target_intermediate_sparsity: 0.0035
        Train metrics/train/expected_layer_sparsity: 0.0039
        Train metrics/train/target_layer_sparsity: 0.0000
        Train metrics/train/expected_hidden_sparsity: 0.0039
        Train metrics/train/target_hidden_sparsity: 0.0035
        Train metrics/train/expected_sparsity: 0.0117
        Train metrics/train/target_sparsity: 0.0057
        Train trainer/device_train_microbatch_size: 4
        Train loss/train/total: 1.8914
        Train loss/train/ce_loss: 1.8913
        Train loss/train/lag_loss: 0.0001
        Train metrics/train/LanguageCrossEntropy: 1.8913
        Train metrics/train/Perplexity: 6.6280
        Train metrics/train/cc_LanguageCrossEntropy: 1.8021
        Train metrics/train/cc_count: 140
        Train metrics/train/github_LanguageCrossEntropy: nan
        Train metrics/train/github_count: 11
        Train metrics/train/book_LanguageCrossEntropy: 1.9494
        Train metrics/train/book_count: 8
        Train metrics/train/wiki_LanguageCrossEntropy: 1.7889
        Train metrics/train/wiki_count: 9
        Train metrics/train/arxiv_LanguageCrossEntropy: nan
        Train metrics/train/arxiv_count: 5
        Train metrics/train/c4-rp_LanguageCrossEntropy: 2.0495
        Train metrics/train/c4-rp_count: 51
        Train time/train: 0.0172
        Train time/val: 0.0000
        Train time/total: 0.0172
Traceback (most recent call last):
 File "/root/paddlejob/workspace/LLM/baidu/personal-code/LLM-Shearing/llmshearing/train.py", line 319, in <module>
   main(cfg)
 File "/root/paddlejob/workspace/LLM/baidu/personal-code/LLM-Shearing/llmshearing/train.py", line 299, in main
   trainer.fit()
 File "/root/miniconda3/envs/shearing/lib/python3.10/site-packages/composer/trainer/trainer.py", line 1876, in fit
   self._train_loop()
 File "/root/miniconda3/envs/shearing/lib/python3.10/site-packages/composer/trainer/trainer.py", line 2018, in _train_loop
   for batch_idx, self.state.batch in enumerate(self._iter_dataloader(TrainerMode.TRAIN)):
 File "/root/miniconda3/envs/shearing/lib/python3.10/site-packages/composer/trainer/trainer.py", line 3024, in _iter_dataloader
   batch = next(dataloader_iter)
 File "/root/miniconda3/envs/shearing/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 633, in __next__
   data = self._next_data()
 File "/root/miniconda3/envs/shearing/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 677, in _next_data
   data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
 File "/root/miniconda3/envs/shearing/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 32, in fetch
   data.append(next(self.dataset_iter))
 File "/root/paddlejob/workspace/LLM/baidu/personal-code/LLM-Shearing/llmshearing/datasets/streaming_dataset.py", line 401, in __iter__
   sample_ids_per_stream = self._get_work(world, epoch, used_sample_ids)
 File "/root/paddlejob/workspace/LLM/baidu/personal-code/LLM-Shearing/llmshearing/datasets/streaming_dataset.py", line 355, in _get_work
   sample_ids_per_stream = generate_work(self, world, epoch, used_domain_ids)
 File "/root/paddlejob/workspace/LLM/baidu/personal-code/LLM-Shearing/llmshearing/datasets/streaming_dataset.py", line 46, in generate_work
   assert epoch == 0, "Currently only supports dynamic loading from each domain for once."
AssertionError: Currently only supports dynamic loading from each domain for once.

from llm-shearing.

xiamengzhou avatar xiamengzhou commented on May 18, 2024

Hiiii, I am not sure why it is happening here -- I will need to take a closer look at it and will get back to you later. Could you share the configuration you are using, and the number of data points in each domain?

from llm-shearing.

Longyichen avatar Longyichen commented on May 18, 2024

Hi, thanks for your help. I think you are right, there may be something wrong with the data. Although I cannot directly read the mds file to view the number of data points, I found that the size of the data file obtained using sample sampling is smaller than usual. I'll check carefully what's wrong with the sample file.

from llm-shearing.

xiamengzhou avatar xiamengzhou commented on May 18, 2024

You can use the TextStreamingDataset to load the data and count the number of data points by simply using the len() function. You can also check the index.json file to check the number of samples.

from llm-shearing.

Longyichen avatar Longyichen commented on May 18, 2024

Thank you for your patient reply. Previously I used the default script you provided without setting the number of sampling tokens. This leads to related problems. I'm resampling the data now and waiting to see if it improves, it may take a while.
There is also a related question:
If I need to disable DoReMi, how do I need to modify the settings?
Because I found that there are multiple places that seem to be related to DoReMi configuration, including

In addition, the yaml file also configures the data path. When will this be used or will it be overwritten?

from llm-shearing.

Longyichen avatar Longyichen commented on May 18, 2024

Thanks for your help, the code runs smoothly. But sometimes the loss will be nan. Is this normal?

[batch=3194/3200]:
         Train time/batch: 3193
         Train time/sample: 102176
         Train time/batch_in_epoch: 3193
         Train time/sample_in_epoch: 102176
         Train time/token: 418512896
         Train time/token_in_epoch: 418512896
         Train metrics/train/cc_weight: 0.0450
         Train metrics/train/github_weight: 0.0017
         Train metrics/train/book_weight: 0.0007
         Train metrics/train/stackexchange_weight: 0.0023
         Train metrics/train/wiki_weight: 0.0121
         Train metrics/train/arxiv_weight: 0.0011
         Train metrics/train/c4-rp_weight: 0.9370
         Train memory/current_allocated_mem: 14.6140
         Train memory/current_active_mem: 14.6140
         Train memory/current_inactive_mem: 1.9286
         Train memory/current_reserved_mem: 43.5430
         Train memory/peak_allocated_mem: 28.0710
         Train memory/peak_active_mem: 28.0710
         Train memory/peak_inactive_mem: 11.7290
         Train memory/peak_reserved_mem: 43.5430
         Train memory/alloc_retries: 0
         Train metrics/train/expected_head_sparsity: 0.3750
         Train metrics/train/target_head_sparsity: 0.3750
         Train metrics/train/expected_intermediate_sparsity: 0.3714
         Train metrics/train/target_intermediate_sparsity: 0.3721
         Train metrics/train/expected_layer_sparsity: 0.0039
         Train metrics/train/target_layer_sparsity: 0.0000
         Train metrics/train/expected_hidden_sparsity: 0.3734
         Train metrics/train/target_hidden_sparsity: 0.3750
         Train metrics/train/expected_sparsity: 0.6085
         Train metrics/train/target_sparsity: 0.6082
         Train trainer/device_train_microbatch_size: 4
         Train loss/train/total: 9.1105
         Train loss/train/ce_loss: 2.4873
         Train loss/train/lag_loss: 6.6233
         Train metrics/train/LanguageCrossEntropy: 2.4873
         Train metrics/train/Perplexity: 12.0283
         Train metrics/train/cc_LanguageCrossEntropy: nan
         Train metrics/train/cc_count: 24363
         Train metrics/train/github_LanguageCrossEntropy: nan
         Train metrics/train/github_count: 1389
         Train metrics/train/book_LanguageCrossEntropy: nan
         Train metrics/train/book_count: 1033
         Train metrics/train/stackexchange_LanguageCrossEntropy: nan
         Train metrics/train/stackexchange_count: 741
         Train metrics/train/wiki_LanguageCrossEntropy: 1.9528
         Train metrics/train/wiki_count: 14344
         Train metrics/train/arxiv_LanguageCrossEntropy: nan
         Train metrics/train/arxiv_count: 783
         Train metrics/train/c4-rp_LanguageCrossEntropy: 2.5045
         Train metrics/train/c4-rp_count: 59555
         Train throughput/batches_per_sec: 0.1416
         Train throughput/samples_per_sec: 4.5306
         Train throughput/device/batches_per_sec: 0.0177
         Train throughput/device/samples_per_sec: 0.5663
         Train throughput/tokens_per_sec: 18557.1898
         Train throughput/device/tokens_per_sec: 2319.6487
         Train throughput/flops_per_sec: 869869884376190.8750
         Train throughput/device/flops_per_sec: 108733735547023.8594
         Train throughput/device/mfu: 0.3485
         Train time/train: 6.3065
         Train time/val: 1.3523
         Train time/total: 7.6588

from llm-shearing.

xiamengzhou avatar xiamengzhou commented on May 18, 2024

When the batch does not contain data from a specific domain, the loss becomes nan. So it should be normal! For a sanity check, you can print the amount of data used by each batch to verify.

from llm-shearing.

lippman1125 avatar lippman1125 commented on May 18, 2024

@Longyichen Would you like to share your pruning script?

from llm-shearing.

lippman1125 avatar lippman1125 commented on May 18, 2024

Thanks for your help, the code runs smoothly. But sometimes the loss will be nan. Is this normal?

[batch=3194/3200]:
         Train time/batch: 3193
         Train time/sample: 102176
         Train time/batch_in_epoch: 3193
         Train time/sample_in_epoch: 102176
         Train time/token: 418512896
         Train time/token_in_epoch: 418512896
         Train metrics/train/cc_weight: 0.0450
         Train metrics/train/github_weight: 0.0017
         Train metrics/train/book_weight: 0.0007
         Train metrics/train/stackexchange_weight: 0.0023
         Train metrics/train/wiki_weight: 0.0121
         Train metrics/train/arxiv_weight: 0.0011
         Train metrics/train/c4-rp_weight: 0.9370
         Train memory/current_allocated_mem: 14.6140
         Train memory/current_active_mem: 14.6140
         Train memory/current_inactive_mem: 1.9286
         Train memory/current_reserved_mem: 43.5430
         Train memory/peak_allocated_mem: 28.0710
         Train memory/peak_active_mem: 28.0710
         Train memory/peak_inactive_mem: 11.7290
         Train memory/peak_reserved_mem: 43.5430
         Train memory/alloc_retries: 0
         Train metrics/train/expected_head_sparsity: 0.3750
         Train metrics/train/target_head_sparsity: 0.3750
         Train metrics/train/expected_intermediate_sparsity: 0.3714
         Train metrics/train/target_intermediate_sparsity: 0.3721
         Train metrics/train/expected_layer_sparsity: 0.0039
         Train metrics/train/target_layer_sparsity: 0.0000
         Train metrics/train/expected_hidden_sparsity: 0.3734
         Train metrics/train/target_hidden_sparsity: 0.3750
         Train metrics/train/expected_sparsity: 0.6085
         Train metrics/train/target_sparsity: 0.6082
         Train trainer/device_train_microbatch_size: 4
         Train loss/train/total: 9.1105
         Train loss/train/ce_loss: 2.4873
         Train loss/train/lag_loss: 6.6233
         Train metrics/train/LanguageCrossEntropy: 2.4873
         Train metrics/train/Perplexity: 12.0283
         Train metrics/train/cc_LanguageCrossEntropy: nan
         Train metrics/train/cc_count: 24363
         Train metrics/train/github_LanguageCrossEntropy: nan
         Train metrics/train/github_count: 1389
         Train metrics/train/book_LanguageCrossEntropy: nan
         Train metrics/train/book_count: 1033
         Train metrics/train/stackexchange_LanguageCrossEntropy: nan
         Train metrics/train/stackexchange_count: 741
         Train metrics/train/wiki_LanguageCrossEntropy: 1.9528
         Train metrics/train/wiki_count: 14344
         Train metrics/train/arxiv_LanguageCrossEntropy: nan
         Train metrics/train/arxiv_count: 783
         Train metrics/train/c4-rp_LanguageCrossEntropy: 2.5045
         Train metrics/train/c4-rp_count: 59555
         Train throughput/batches_per_sec: 0.1416
         Train throughput/samples_per_sec: 4.5306
         Train throughput/device/batches_per_sec: 0.0177
         Train throughput/device/samples_per_sec: 0.5663
         Train throughput/tokens_per_sec: 18557.1898
         Train throughput/device/tokens_per_sec: 2319.6487
         Train throughput/flops_per_sec: 869869884376190.8750
         Train throughput/device/flops_per_sec: 108733735547023.8594
         Train throughput/device/mfu: 0.3485
         Train time/train: 6.3065
         Train time/val: 1.3523
         Train time/total: 7.6588

@Longyichen Have you noticed that c4-rp_weight is 0.9370, which is not consistent with the data in the literature?

from llm-shearing.

Longyichen avatar Longyichen commented on May 18, 2024

@lippman1125 Yes, I have a similar problem, but I don't know what causes it. It seems that the performance of the model does not suffer much loss compared to the paper. For details, we can seek for @xiamengzhou for help.

from llm-shearing.

lippman1125 avatar lippman1125 commented on May 18, 2024

@Longyichen Because the eval CE Loss determines the proportion, but the new proportion only affects train CE Loss. I guess, if there is some gap between the training samples and eval samples, it could lead to this problem.

from llm-shearing.

Longyichen avatar Longyichen commented on May 18, 2024

@lippman1125 Have you tried continuous pre-training? You can try using the pre-trained data set and evaluation set to see if the same problem occurs

from llm-shearing.

coderchem avatar coderchem commented on May 18, 2024

@Longyichen could you share your scripts? i happened same problem , and not solv.

from llm-shearing.

coderchem avatar coderchem commented on May 18, 2024

我尝试了上面所有的方法,还是不行;我使用的数据集是样例数据集;帮忙讲解一下,可能是哪里的问题吗?

from llm-shearing.

xiamengzhou avatar xiamengzhou commented on May 18, 2024

@coderchem Have you been using the data shared on the google drive?

from llm-shearing.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.