Comments (6)
This is very weird as TorchScipt explicitly forbids using Sequence
annotator https://pytorch.org/docs/stable/jit_language_reference.html#supported-type
I also ran with NGC docker with torch 2.1.0; didn't work
from megatron-lm.
Can confirm I get the same error using
- PyTorch 2.3.1
- Megatron-LM e33c8f7
- CUDA 12.1
from megatron-lm.
Can confirm I get the same error using
- PyTorch 2.3.1
- Megatron-LM e33c8f7
- CUDA 12.1
While they didn't say this, using the newest NVIDIA PyTorch container (torch 2.4) seems to work
from megatron-lm.
Thanks for pointing this out. We will have a fix shortly.
from megatron-lm.
When I was processing the training dataset for GPT using the tools/preprocess_data.py
script, I encountered this issue.
'vocab_range_from_per_partition_vocab_size' is being compiled since it was called from 'calculate_predicted_logits'
from megatron-lm.
python tools/preprocess_data.py
--input data/oscar-1GB.jsonl
--output-prefix my-gpt3
--vocab-file gpt2-vocab.json
--tokenizer-type GPT2BPETokenizer
--merge-file gpt2-merges.txt
--append-eod
Traceback (most recent call last):
File "/mnt/users/lihai/gpt3/code/Megatron-LM/tools/preprocess_data.py", line 23, in
from megatron.training.tokenizer import build_tokenizer
File "/mnt/users/lihai/gpt3/code/Megatron-LM/megatron/training/init.py", line 16, in
from .initialize import initialize_megatron
File "/mnt/users/lihai/gpt3/code/Megatron-LM/megatron/training/initialize.py", line 18, in
from megatron.training.arguments import parse_args, validate_args
File "/mnt/users/lihai/gpt3/code/Megatron-LM/megatron/training/arguments.py", line 14, in
from megatron.core.models.retro.utils import (
File "/mnt/users/lihai/gpt3/code/Megatron-LM/megatron/core/models/retro/init.py", line 12, in
from .decoder_spec import get_retro_decoder_block_spec
File "/mnt/users/lihai/gpt3/code/Megatron-LM/megatron/core/models/retro/decoder_spec.py", line 9, in
from megatron.core.models.gpt.gpt_layer_specs import (
File "/mnt/users/lihai/gpt3/code/Megatron-LM/megatron/core/models/gpt/init.py", line 1, in
from .gpt_model import GPTModel
File "/mnt/users/lihai/gpt3/code/Megatron-LM/megatron/core/models/gpt/gpt_model.py", line 13, in
from megatron.core.models.common.language_module.language_module import LanguageModule
File "/mnt/users/lihai/gpt3/code/Megatron-LM/megatron/core/models/common/language_module/language_module.py", line 9, in
from megatron.core.fusions.fused_cross_entropy import fused_vocab_parallel_cross_entropy
File "/mnt/users/lihai/gpt3/code/Megatron-LM/megatron/core/fusions/fused_cross_entropy.py", line 27, in
def calculate_predicted_logits(
File "/mnt/users/lihai/miniconda3/envs/gpt3/lib/python3.9/site-packages/torch/jit/_script.py", line 1381, in script
fn = torch._C._jit_script_compile(
File "/mnt/users/lihai/miniconda3/envs/gpt3/lib/python3.9/site-packages/torch/jit/_recursive.py", line 1010, in try_compile_fn
return torch.jit.script(fn, _rcb=rcb)
File "/mnt/users/lihai/miniconda3/envs/gpt3/lib/python3.9/site-packages/torch/jit/_script.py", line 1381, in script
fn = torch._C._jit_script_compile(
File "/mnt/users/lihai/miniconda3/envs/gpt3/lib/python3.9/site-packages/torch/jit/_recursive.py", line 1010, in try_compile_fn
return torch.jit.script(fn, _rcb=rcb)
File "/mnt/users/lihai/miniconda3/envs/gpt3/lib/python3.9/site-packages/torch/jit/_script.py", line 1381, in script
fn = torch._C._jit_script_compile(
RuntimeError:
Unknown type constructor Sequence:
File "/mnt/users/lihai/gpt3/code/Megatron-LM/megatron/core/tensor_parallel/utils.py", line 106
def vocab_range_from_per_partition_vocab_size(
per_partition_vocab_size: int, rank, world_size: int
) -> Sequence[int]:
~~~~~~~~~~~~~ <--- HERE
index_f = rank * per_partition_vocab_size
index_l = index_f + per_partition_vocab_size
'vocab_range_from_per_partition_vocab_size' is being compiled since it was called from 'calculate_predicted_logits'
File "/mnt/users/lihai/gpt3/code/Megatron-LM/megatron/core/tensor_parallel/cross_entropy.py", line 41
# Get the partition's vocab indices
get_vocab_range = VocabUtility.vocab_range_from_per_partition_vocab_size
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
partition_vocab_size = vocab_parallel_logits.size()[-1]
rank = get_tensor_model_parallel_rank()
'calculate_predicted_logits' is being compiled since it was called from 'calculate_predicted_logits'
File "/mnt/users/lihai/gpt3/code/Megatron-LM/megatron/core/fusions/fused_cross_entropy.py", line 31
) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
(
~
target_mask,
~~~~~~~~~~~~
masked_target_1d,
~~~~~~~~~~~~~~~~~
predicted_logits,
~~~~~~~~~~~~~~~~~
sum_exp_logits,
~~~~~~~~~~~~~~~
exp_logits,
~~~~~~~~~~~
) = VocabParallelCrossEntropy.calculate_predicted_logits(
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
vocab_parallel_logits, target, logits_max
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
)
from megatron-lm.
Related Issues (20)
- [QUESTION] Question about Mixtral compatibility with Megatron-LM core0.7.0
- [BUG] megatron.training not found HOT 3
- [QUESTION] How to time the code
- [BUG] pipeline_paralle is not available when pp_size > 2
- [BUG] RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead. HOT 1
- [QUESTION]when pretraining bert,meet bug:cuBLAS Error: the requested functionality is not supported HOT 3
- [QUESTION] Gloo connectFullMesh failed when the number of nodes setting "export GLOO_SOCKET_IFNAME=bond4" exceeds 60
- [QUESTION] OSError: [Errno 28] No space left on device HOT 4
- [QUESTION] --overlap-grad-allreduce failing as gradients coming through as None in param hook HOT 2
- [BUGS] Pipeline Parallelism fails/hangs with Megatron Core example HOT 1
- [QUESTION] What's the internal difference for training when setting only "fp8-format" or setting "fp8-format"+"bf16" HOT 1
- [QUESTION] Why is TELayerNormColumnParallelLinear used instead of TEColumnParallelLinear in gpt_layer_specs HOT 2
- [QUESTION] Why does the tokenizer of mamba-2-hybrid have two ids for the token 'Yes'? id 24639 and id 7298 HOT 1
- [QUESTION] Has standalone_embedding_stage been supported yet in core?
- [QUESTION] Sample idx, bin files in public domain for trying out pretrain_gpt.py?
- [QUESTION] Getting tools/preprocess_data.py to work is painful
- [REGRESSION] MoEs are obtaining higher loss than they should during training HOT 5
- [BUG]Question about helpers.cpp in version core_v0.7.0
- Batch_input and elapsed time per iteration slow down during model training
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from megatron-lm.