On starting a DeepSpeed run, there's this massive floating point panic where DeepSpeed dynamically rescales the loss and throws a ton of warnings. Unclear why this is happening?
[2021-03-18 04:21:58,423] [INFO] [stage1.py:633:step] [deepspeed] fp16 dynamic loss scale overflow! Skipping step. Attempted loss scale: 131072.0, reducing to 65536.0 โ
[2021-03-18 04:21:58,423] [INFO] [stage1.py:633:step] [deepspeed] fp16 dynamic loss scale overflow! Skipping step. Attempted loss scale: 131072.0, reducing to 65536.0 โ
[2021-03-18 04:21:58,424] [INFO] [stage1.py:633:step] [deepspeed] fp16 dynamic loss scale overflow! Skipping step. Attempted loss scale: 131072.0, reducing to 65536.0