Hi and thanks for the great work! I have finished all the preliminary steps and uses python -m vall_e.train yaml=config/test/ar.yml
to train. It outputs something like this:
{'data_dirs': ['data/test'], 'model': 'ar-quarter', 'batch_size': 1, 'eval_batch_size': 1, 'save_ckpt_every': 500, 'eval_every': 500, 'max_iter': 1000, 'cfg_name': PosixPath('test/ar')} {}
2it [00:00, 1906.94it/s]
2023-02-28 00:43:47 - vall_e.data - INFO - GR=0;LR=0 -
{'</s>': 1, '<s>': 2, 'AH0': 3, 'D': 4, 'ER1': 5, 'HH': 6, 'L': 7, 'OW1': 8, 'W': 9, '_': 10}
2023-02-28 00:43:47 - vall_e.data - INFO - GR=0;LR=0 -
{'test': 0}
2023-02-28 00:43:47 - vall_e.data - INFO - GR=0;LR=0 -
#samples (train): 2.
2023-02-28 00:43:47 - vall_e.data - INFO - GR=0;LR=0 -
#samples (val): 0.
[2023-02-28 00:43:47,269] [INFO] [comm.py:657:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
2023-02-28 00:43:47 - torch.distributed.distributed_c10d - INFO - GR=0;LR=0 -
Added key: store_based_barrier_key:1 to store for rank: 0
2023-02-28 00:43:47 - torch.distributed.distributed_c10d - INFO - GR=0;LR=0 -
Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 1 nodes.
2023-02-28 00:43:51 - torch.distributed.distributed_c10d - INFO - GR=0;LR=0 -
Added key: store_based_barrier_key:2 to store for rank: 0
2023-02-28 00:43:51 - torch.distributed.distributed_c10d - INFO - GR=0;LR=0 -
Rank 0: Completed store-based barrier for key:store_based_barrier_key:2 with 1 nodes.
[2023-02-28 00:43:51,787] [INFO] [logging.py:75:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
Using /mnt/lustre/sjtu/home/ywg12/.cache/torch_extensions/py310_cu102 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /mnt/lustre/sjtu/home/ywg12/.cache/torch_extensions/py310_cu102/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_adam...
Time to load fused_adam op: 0.10433101654052734 seconds
[2023-02-28 00:43:52,152] [INFO] [logging.py:75:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adam as basic optimizer
[2023-02-28 00:43:52,155] [INFO] [logging.py:75:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedAdam
[2023-02-28 00:43:52,155] [INFO] [logging.py:75:log_dist] [Rank 0] Creating fp16 optimizer with dynamic loss scale
[2023-02-28 00:43:52,165] [INFO] [logging.py:75:log_dist] [Rank 0] DeepSpeed Final Optimizer = adam
[2023-02-28 00:43:52,166] [INFO] [logging.py:75:log_dist] [Rank 0] DeepSpeed using configured LR scheduler = WarmupDecayLR
[2023-02-28 00:43:52,166] [INFO] [logging.py:75:log_dist] [Rank 0] DeepSpeed LR Scheduler = <deepspeed.runtime.lr_schedules.WarmupDecayLR object at 0x7fa2e7319ed0>
[2023-02-28 00:43:52,166] [INFO] [logging.py:75:log_dist] [Rank 0] step=0, skipped=0, lr=[0.001], mom=[(0.9, 0.999)]
[2023-02-28 00:43:52,166] [INFO] [config.py:1009:print] DeepSpeedEngine configuration:
[2023-02-28 00:43:52,166] [INFO] [config.py:1013:print] activation_checkpointing_config {
"partition_activations": false,
"contiguous_memory_optimization": false,
"cpu_checkpointing": false,
"number_checkpoints": null,
"synchronize_checkpoint_boundary": false,
"profile": false
}
[2023-02-28 00:43:52,166] [INFO] [config.py:1013:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2023-02-28 00:43:52,167] [INFO] [config.py:1013:print] amp_enabled .................. False
[2023-02-28 00:43:52,167] [INFO] [config.py:1013:print] amp_params ................... False
[2023-02-28 00:43:52,167] [INFO] [config.py:1013:print] autotuning_config ............ {
"enabled": false,
"start_step": null,
"end_step": null,
"metric_path": null,
"arg_mappings": null,
"metric": "throughput",
"model_info": null,
"results_dir": "autotuning_results",
"exps_dir": "autotuning_exps",
"overwrite": true,
"fast": true,
"start_profile_step": 3,
"end_profile_step": 5,
"tuner_type": "gridsearch",
"tuner_early_stopping": 5,
"tuner_num_trials": 50,
"model_info_path": null,
"mp_size": 1,
"max_train_batch_size": null,
"min_train_batch_size": 1,
"max_train_micro_batch_size_per_gpu": 1.024000e+03,
"min_train_micro_batch_size_per_gpu": 1,
"num_tuning_micro_batch_sizes": 3
}
[2023-02-28 00:43:52,167] [INFO] [config.py:1013:print] bfloat16_enabled ............. False
[2023-02-28 00:43:52,167] [INFO] [config.py:1013:print] checkpoint_parallel_write_pipeline False
[2023-02-28 00:43:52,167] [INFO] [config.py:1013:print] checkpoint_tag_validation_enabled True
[2023-02-28 00:43:52,167] [INFO] [config.py:1013:print] checkpoint_tag_validation_fail False
[2023-02-28 00:43:52,167] [INFO] [config.py:1013:print] comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7fa2e7319ae0>
[2023-02-28 00:43:52,167] [INFO] [config.py:1013:print] communication_data_type ...... None
[2023-02-28 00:43:52,167] [INFO] [config.py:1013:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2023-02-28 00:43:52,168] [INFO] [config.py:1013:print] curriculum_enabled_legacy .... False
[2023-02-28 00:43:52,168] [INFO] [config.py:1013:print] curriculum_params_legacy ..... False
[2023-02-28 00:43:52,168] [INFO] [config.py:1013:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2023-02-28 00:43:52,168] [INFO] [config.py:1013:print] data_efficiency_enabled ...... False
[2023-02-28 00:43:52,168] [INFO] [config.py:1013:print] dataloader_drop_last ......... False
[2023-02-28 00:43:52,168] [INFO] [config.py:1013:print] disable_allgather ............ False
[2023-02-28 00:43:52,168] [INFO] [config.py:1013:print] dump_state ................... False
[2023-02-28 00:43:52,168] [INFO] [config.py:1013:print] dynamic_loss_scale_args ...... None
[2023-02-28 00:43:52,168] [INFO] [config.py:1013:print] eigenvalue_enabled ........... False
[2023-02-28 00:43:52,168] [INFO] [config.py:1013:print] eigenvalue_gas_boundary_resolution 1
[2023-02-28 00:43:52,168] [INFO] [config.py:1013:print] eigenvalue_layer_name ........ bert.encoder.layer
[2023-02-28 00:43:52,168] [INFO] [config.py:1013:print] eigenvalue_layer_num ......... 0
[2023-02-28 00:43:52,168] [INFO] [config.py:1013:print] eigenvalue_max_iter .......... 100
[2023-02-28 00:43:52,168] [INFO] [config.py:1013:print] eigenvalue_stability ......... 1e-06
[2023-02-28 00:43:52,168] [INFO] [config.py:1013:print] eigenvalue_tol ............... 0.01
[2023-02-28 00:43:52,168] [INFO] [config.py:1013:print] eigenvalue_verbose ........... False
[2023-02-28 00:43:52,168] [INFO] [config.py:1013:print] elasticity_enabled ........... False
[2023-02-28 00:43:52,168] [INFO] [config.py:1013:print] flops_profiler_config ........ {
"enabled": false,
"profile_step": 1,
"module_depth": -1,
"top_modules": 1,
"detailed": true,
"output_file": null
}
[2023-02-28 00:43:52,168] [INFO] [config.py:1013:print] fp16_auto_cast ............... False
[2023-02-28 00:43:52,168] [INFO] [config.py:1013:print] fp16_enabled ................. True
[2023-02-28 00:43:52,168] [INFO] [config.py:1013:print] fp16_master_weights_and_gradients False
[2023-02-28 00:43:52,168] [INFO] [config.py:1013:print] global_rank .................. 0
[2023-02-28 00:43:52,168] [INFO] [config.py:1013:print] grad_accum_dtype ............. None
[2023-02-28 00:43:52,168] [INFO] [config.py:1013:print] gradient_accumulation_steps .. 1
[2023-02-28 00:43:52,168] [INFO] [config.py:1013:print] gradient_clipping ............ 100.0
[2023-02-28 00:43:52,168] [INFO] [config.py:1013:print] gradient_predivide_factor .... 1.0
[2023-02-28 00:43:52,168] [INFO] [config.py:1013:print] initial_dynamic_scale ........ 65536
[2023-02-28 00:43:52,168] [INFO] [config.py:1013:print] load_universal_checkpoint .... False
[2023-02-28 00:43:52,168] [INFO] [config.py:1013:print] loss_scale ................... 0
[2023-02-28 00:43:52,168] [INFO] [config.py:1013:print] memory_breakdown ............. False
[2023-02-28 00:43:52,168] [INFO] [config.py:1013:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
[2023-02-28 00:43:52,169] [INFO] [config.py:1013:print] nebula_config ................ {
"enabled": false,
"persistent_storage_path": null,
"persistent_time_interval": 100,
"num_of_version_in_retention": 2,
"enable_nebula_load": true,
"load_path": null
}
[2023-02-28 00:43:52,169] [INFO] [config.py:1013:print] optimizer_legacy_fusion ...... False
[2023-02-28 00:43:52,169] [INFO] [config.py:1013:print] optimizer_name ............... adam
[2023-02-28 00:43:52,169] [INFO] [config.py:1013:print] optimizer_params ............. None
[2023-02-28 00:43:52,169] [INFO] [config.py:1013:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2023-02-28 00:43:52,169] [INFO] [config.py:1013:print] pld_enabled .................. False
[2023-02-28 00:43:52,169] [INFO] [config.py:1013:print] pld_params ................... False
[2023-02-28 00:43:52,169] [INFO] [config.py:1013:print] prescale_gradients ........... False
[2023-02-28 00:43:52,169] [INFO] [config.py:1013:print] scheduler_name ............... WarmupDecayLR
[2023-02-28 00:43:52,169] [INFO] [config.py:1013:print] scheduler_params ............. {'warmup_min_lr': 1e-06, 'warmup_max_lr': 0.0002, 'warmup_num_steps': 1000, 'total_num_steps': 1000, 'warmup_type': 'linear'}
[2023-02-28 00:43:52,169] [INFO] [config.py:1013:print] sparse_attention ............. None
[2023-02-28 00:43:52,169] [INFO] [config.py:1013:print] sparse_gradients_enabled ..... False
[2023-02-28 00:43:52,169] [INFO] [config.py:1013:print] steps_per_print .............. 10
[2023-02-28 00:43:52,169] [INFO] [config.py:1013:print] train_batch_size ............. 1
[2023-02-28 00:43:52,169] [INFO] [config.py:1013:print] train_micro_batch_size_per_gpu 1
[2023-02-28 00:43:52,169] [INFO] [config.py:1013:print] use_node_local_storage ....... False
[2023-02-28 00:43:52,169] [INFO] [config.py:1013:print] wall_clock_breakdown ......... False
[2023-02-28 00:43:52,169] [INFO] [config.py:1013:print] world_size ................... 1
[2023-02-28 00:43:52,169] [INFO] [config.py:1013:print] zero_allow_untested_optimizer False
[2023-02-28 00:43:52,169] [INFO] [config.py:1013:print] zero_config .................. stage=0 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=False load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False
[2023-02-28 00:43:52,169] [INFO] [config.py:1013:print] zero_enabled ................. False
[2023-02-28 00:43:52,169] [INFO] [config.py:1013:print] zero_optimization_stage ...... 0
[2023-02-28 00:43:52,169] [INFO] [config.py:998:print_user_config] json = {
"train_micro_batch_size_per_gpu": 1,
"gradient_accumulation_steps": 1,
"optimizer": {
"type": "Adam",
"lr": 1e-06
},
"scheduler": {
"type": "WarmupDecayLR",
"params": {
"warmup_min_lr": 1e-06,
"warmup_max_lr": 0.0002,
"warmup_num_steps": 1000,
"total_num_steps": 1000,
"warmup_type": "linear"
}
},
"gradient_clipping": 100.0,
"fp16": {
"enabled": true
}
}
Using /mnt/lustre/sjtu/home/ywg12/.cache/torch_extensions/py310_cu102 as PyTorch extensions root...
Emitting ninja build file /mnt/lustre/sjtu/home/ywg12/.cache/torch_extensions/py310_cu102/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.12616920471191406 seconds
[2023-02-28 00:43:52,296] [INFO] [torch_checkpoint_engine.py:21:load] [Torch] Loading checkpoint from ckpts/test/ar/model/default/mp_rank_00_model_states.pt...
[2023-02-28 00:43:52,350] [INFO] [torch_checkpoint_engine.py:23:load] [Torch] Loaded checkpoint from ckpts/test/ar/model/default/mp_rank_00_model_states.pt.
[2023-02-28 00:43:52,351] [INFO] [torch_checkpoint_engine.py:21:load] [Torch] Loading checkpoint from ckpts/test/ar/model/default/mp_rank_00_model_states.pt...
[2023-02-28 00:43:52,400] [INFO] [torch_checkpoint_engine.py:23:load] [Torch] Loaded checkpoint from ckpts/test/ar/model/default/mp_rank_00_model_states.pt.
fatal: Not a git repository (or any parent up to mount point /mnt/lustre)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
fatal: Not a git repository (or any parent up to mount point /mnt/lustre)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
2023-02-28 00:43:52 - vall_e.utils.trainer - INFO - GR=0;LR=0 -
{
"batch_size": 1,
"cache_dataloader": false,
"cache_dir": ".cache/test/ar",
"cfg_name": "test/ar",
"cfg_relpath": null,
"ckpt_dir": "ckpts/test/ar",
"ckpt_root": "ckpts",
"data_dirs": "[PosixPath('data/test')]",
"data_root": "data",
"device": "cuda",
"dis_warmup_max_lr": 0.0004,
"ds_cfg": {
"train_micro_batch_size_per_gpu": 1,
"gradient_accumulation_steps": 1,
"optimizer": {
"type": "Adam",
"lr": 1e-06
},
"scheduler": {
"type": "WarmupDecayLR",
"params": {
"warmup_min_lr": 1e-06,
"warmup_max_lr": 0.0002,
"warmup_num_steps": 1000,
"total_num_steps": 1000,
"warmup_type": "linear"
}
},
"gradient_clipping": 100.0,
"fp16": {
"enabled": true
}
},
"eval_batch_size": 1,
"eval_every": 500,
"fp16_cfg": {
"enabled": true
},
"git_commit": "",
"git_status": "",
"gradient_accumulation_steps": 1,
"gradient_clipping": 100.0,
"log_dir": "logs/test/ar/1677516227",
"log_root": "logs",
"max_grad_norm": null,
"max_iter": 1000,
"max_num_val": 20,
"max_phones": 50,
"max_prompts": 3,
"max_val_ar_steps": 300,
"min_phones": 10,
"model": "ar-quarter",
"nj": 8,
"num_tokens": 1024,
"p_additional_prompt": 0.8,
"relpath": "test/ar",
"sample_rate": 24000,
"sampling_temperature": 1.0,
"save_artifacts_every": 100,
"save_ckpt_every": 500,
"save_on_oom": true,
"save_on_quit": true,
"spkr_name_getter": "lambda p: p.parts[-2]",
"start_time": 1677516227,
"token_dim": 256,
"use_fp16": true,
"warmup_max_lr": 0.0002,
"warmup_min_lr": 1e-06,
"warmup_num_steps": 1000
}
2023-02-28 00:43:52 - vall_e.utils.trainer - INFO - GR=0;LR=0 -
New epoch starts.