Giter Club home page Giter Club logo

da-transformer's People

Contributors

hzhwcmhf avatar shaochenze avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

da-transformer's Issues

Running process stopped at “compiling cuda operations”

Hello! I successfully run the code. However, when the running process reaches this step, it stops and does not continue without any error. Do you have any advice or opinion about this problem?

2022-10-18 17:05:27 | INFO | fairseq.utils | ***********************CUDA enviroments for all 4 workers***********************
2022-10-18 17:05:27 | INFO | fairseq_cli.train | training on 4 devices (GPUs/TPUs)
2022-10-18 17:05:27 | INFO | fairseq_cli.train | max tokens per device = 2048 and max sentences per device = None
2022-10-18 17:05:27 | INFO | fairseq.trainer | Preparing to load checkpoint ./model/checkpoint_last.pt
2022-10-18 17:05:27 | INFO | fairseq.trainer | No existing checkpoint found ./model/checkpoint_last.pt
2022-10-18 17:05:27 | INFO | fairseq.trainer | loading train data for epoch 1
2022-10-18 17:05:28 | INFO | fairseq.data.data_utils | loaded 4,500,966 examples from: ./bin_data/WMT16/train.en-de.en
2022-10-18 17:05:28 | INFO | fairseq.data.data_utils | loaded 4,500,966 examples from: ./bin_data/WMT16/train.en-de.de
2022-10-18 17:05:28 | INFO | fairseq.tasks.translation | ./bin_data/WMT16 train en-de 4500966 examples
2022-10-18 17:05:34 | WARNING | fairseq.tasks.fairseq_task | 1,391 samples have invalid sizes and will be skipped, max_positions=(128, 1024), first few sample ids=[3749843, 2629309, 3912533, 2428533, 3659653, 4231852, 3663212, 2382171, 3373663, 4175821]
2022-10-18 17:05:34 | WARNING | fairseq.tasks.fairseq_task | 1,391 samples have invalid sizes and will be skipped, max_positions=(128, 1024), first few sample ids=[3749843, 2629309, 3912533, 2428533, 3659653, 4231852, 3663212, 2382171, 3373663, 4175821]
2022-10-18 17:05:34 | WARNING | fairseq.tasks.fairseq_task | 1,391 samples have invalid sizes and will be skipped, max_positions=(128, 1024), first few sample ids=[3749843, 2629309, 3912533, 2428533, 3659653, 4231852, 3663212, 2382171, 3373663, 4175821]
2022-10-18 17:05:34 | WARNING | fairseq.tasks.fairseq_task | 1,391 samples have invalid sizes and will be skipped, max_positions=(128, 1024), first few sample ids=[3749843, 2629309, 3912533, 2428533, 3659653, 4231852, 3663212, 2382171, 3373663, 4175821]
2022-10-18 17:05:35 | INFO | fairseq.data.iterators | grouped total_num_itrs = 1278
2022-10-18 17:05:35 | INFO | fairseq.trainer | begin training epoch 1
2022-10-18 17:05:35 | INFO | fairseq_cli.train | Start iterating over samples
Start compiling cuda operations for DA-Transformer...(It usually takes a few minutes for the first time running.)
Start compiling cuda operations for DA-Transformer...(It usually takes a few minutes for the first time running.)
Start compiling cuda operations for DA-Transformer...(It usually takes a few minutes for the first time running.)
Start compiling cuda operations for DA-Transformer...(It usually takes a few minutes for the first time running.)

dag_best_alignment: graph size is too small

Hello, this is great work in NAT and I like it. I tried to modify the src-upsample-scale and make the lambda smaller, like 2 or 4. But it is raise an error: "dag_best_alignment.cu:68: calculate_maxalpha_kernel: block: [0,77,0], thread: [0,244,0] Assertion output_len >= target_len && "dag_best_alignment: graph size is too small (smaller than target length)" failed."
Do you know how to fix this error? Thank you

Can not reproduce the result when factor=4

Hello, I tried to reproduce the situation where factor=4 and used Lookahead decoding, but get the result 25.64 BLEU which is lower than the reported one 26.14 BLEU Score on WMT'14 EN-DE raw data in the paper. I use the same environment, same training script, same decoding script and the same dataset but still fail. Can you help me? Or can you share the checkpoints on WMT14 EN-DE raw data and distilled data?

My Training Script

fairseq-train ${data_dir}  \
    \
    `# loading DA-Transformer plugins` \
    --user-dir fs_plugins \
    \
    `# DA-Transformer Task Configs` \
    --task translation_dat_task \
    --upsample-base source --upsample-scale 4 \
    --filter-max-length 128:1024 --filter-ratio 2 \
    --skip-invalid-size-inputs-valid-test \
    \
    `# DA-Transformer Architecture Configs` \
    --arch glat_decomposed_link_base \
    --links-feature feature:position \
    --max-source-positions 128 --max-target-positions 1024 \
    --encoder-learned-pos --decoder-learned-pos \
    --share-all-embeddings --activation-fn gelu --apply-bert-init \
    \
    `# DA-Transformer Decoding Configs (See more in the decoding section)` \
    --decode-strategy lookahead --decode-upsample-scale 4.0 \
    \
    `# DA-Transformer Criterion Configs` \
    --criterion nat_dag_loss \
    --length-loss-factor 0 --max-transition-length 99999 \
    --glat-p 0.5:0.1@200k --glance-strategy number-random \
    --no-force-emit \
    \
    `# Optimizer & Regularizer Configs` \
    --optimizer adam --adam-betas '(0.9,0.999)' --fp16 \
    --label-smoothing 0.0 --weight-decay 0.01 --dropout 0.1 \
    --lr-scheduler inverse_sqrt  --warmup-updates 10000   \
    --clip-norm 0.1 --lr 0.0005 --warmup-init-lr '1e-07' --stop-min-lr '1e-09' \
    \
    `# Training Configs` \
    --max-tokens 32392  --max-tokens-valid 4096 --update-freq 1 \
    --max-update 300000  --grouped-shuffling \
    --max-encoder-batch-tokens 8000 --max-decoder-batch-tokens 34000 \
    --seed 0 --ddp-backend c10d --required-batch-size-multiple 1 \
    \
    `# Validation Configs` \
    --valid-subset valid \
    --validate-interval 1       --validate-interval-updates 10000 \
    --eval-bleu --eval-bleu-detok space --eval-bleu-remove-bpe --eval-bleu-print-samples --eval-tokenized-bleu \
    --fixed-validation-seed 7 \
    \
    `# Checkpoint Configs` \
    --best-checkpoint-metric bleu --maximize-best-checkpoint-metric \
    --save-interval 1  --save-interval-updates 10000 \
    --keep-best-checkpoints 5 --save-dir ${checkpoint_dir} \
    \
    `# Logging Configs` \
    --log-format 'simple' --log-interval 100

My Decoding Script

average_checkpoint_path=${checkpoint_dir}/average.pt

python3 ./fs_plugins/scripts/average_checkpoints.py \
  --inputs ${checkpoint_dir} \
  --max-metric \
  --best-checkpoints-metric bleu \
  --num-best-checkpoints-metric 5 \
  --output ${average_checkpoint_path}

fairseq-generate ${data_dir} \
    --gen-subset test --user-dir fs_plugins --task translation_dat_task \
    --remove-bpe --max-tokens 4096 --seed 0 \
    --decode-strategy lookahead --decode-upsample-scale 4 --decode-beta 1  \
    --path ${average_checkpoint_path}

Compiled Failed

  • python 3.7.12
  • pytorch 1.11.0+cu102
  • gcc 5.4

I have modified the cloneable.h file according to the FAQs section, but I still encounter the following error when the program is running. Please tell me how can i fix it?

 
Traceback (most recent call last):  
File /home/env/nat/lib/python3.7/site-packages/torch/utils/cpp_extension.py, line 1746, in _run_ninja_build   env=env)
File /home/env/nat/lib/python3.7/subprocess.py, line 512, in run   output=stdout, stderr=stderr)  subprocess.CalledProcessError: Command [ninja, -v] returned non-zero exit status 1.
The above exception was the direct cause of the following exception:

RuntimeError: Error building extension 'dag_loss_fn': [1/2] /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=dag_loss_fn -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/env/nat/lib/python3.7/site-packages/torch/include -isystem /home/env/nat/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /home/env/nat/lib/python3.7/site-packages/torch/include/TH -isystem /home/env/nat/lib/python3.7/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /home/env/nat/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' -DOF_SOFTMAX_USE_FAST_MATH -std=c++14 -c /home/DA-Transformer/fs_plugins/custom_ops/logsoftmax_gather.cu -o logsoftmax_gather.cuda.o 
FAILED: logsoftmax_gather.cuda.o 

/usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=dag_loss_fn -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/env/nat/lib/python3.7/site-packages/torch/include -isystem /home/env/nat/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /home/env/nat/lib/python3.7/site-packages/torch/include/TH -isystem /home/env/nat/lib/python3.7/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /home/env/nat/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' -DOF_SOFTMAX_USE_FAST_MATH -std=c++14 -c /home/DA-Transformer/fs_plugins/custom_ops/logsoftmax_gather.cu -o logsoftmax_gather.cuda.o 
/home/DA-Transformer/fs_plugins/custom_ops/logsoftmax_gather.cu:31:23: fatal error: cub/cub.cuh: No such file or directory compilation terminated.
ninja: build stopped: subcommand failed

model miniaturization

Hi, I tried to train a miniaturized model with 6-layer encoder 3-layer decoder and 256 hidden dims, but found that the accuracy of the model declines rapidly. Is there any suggestion for model miniaturization? Thanks.

training config

To replicate and build upon your results, it is crucial for me to have a comprehensive understanding of the training configuration employed during the experiments. Is the examples/DA-Transformer/wmt14_ende.sh the config used to get the results in your paper. I found it impossible to finished 300,000 updates within 16 hours using 8*A100 using that config.

Divide by zero error

Hello,great work! Since there is no nvcc on the server shared by our laboratory, I choose to use torch to calculate the dag loss. When running, I find that logging_outputs and ntokens are 0, and the error is as follows:
1
There is a divide by zero error, which I suspect is caused by version mismatch,my experimental environment is as follows:
pytorch and cuda version:1.10.1+cu102 Python 3.7.11 gcc version 7.5.0 fairseq-1.0.0a0+2d06841
For the above problems, can I ask you for solutions? Thank you very much! @hzhwcmhf

errors when executing script for generating the binarized data

steps to reproduce the error
1,git clone --recurse-submodules https://github.com/thu-coai/DA-Transformer.git && pip install -e .
it didn't work well .I execute git clone --recurse-submodules https://github.com/thu-coai/DA-Transformer.git alone and then cd DA-Transformer,pip install -e . works fine
2,I tried to use the script in readme to generate binarized data

input_dir=path/to/raw_data        # directory of pre-processed text data
data_dir=path/to/binarized_data   # directory of the generated binarized data
src=src                           # source suffix
tgt=tgt                           # target suffix
fairseq-datpreprocess --source-lang ${src} --target-lang ${tgt} \
    --trainpref ${input_dir}/train --validpref ${input_dir}/valid --testpref ${input_dir}/test \
    --src-dict ${input_dir}/dict.${src}.txt --tgt-dict {input_dir}/dict.${tgt}.txt \
    --destdir ${data_dir} --workers 32 \
    --user-dir fs_plugins --task translation_dat_task [--seg-tokens 32]

# seg-tokens should be set to 32 when you use pre-trained models.

image
I don't know what's going wrong. Plz help me

runtimeerror

python 3.7
pytorch 1.10.1+cu111
gcc 5.4.0

I have modified the cloneable.h file according to the FAQs section, but I still encounter the following error when the program is running. Moreover, I have tried to run this code under gcc==7.5.0, the same error appears. Please tell me how can i fix it?
Uploading 截屏2022-07-31 下午3.09.29.png…

Runtime error when using live demo

Hello,

I encountered a runtime error when using the live demo on HuggingFace Space. The error message is as follows:

error message

Runtime error

failed to create containerd task: failed to create shim task: context canceled: unknown

Container logs:

===== Application Startup at 2023-09-03 01:29:23 =====

2023-09-03 01:40:53 | INFO | __main__ | args: Namespace(host='0.0.0.0', port=None, concurrency_count=1, share=False)
/home/user/app/app.py:421: GradioDeprecationWarning: The `style` method is deprecated. Please set these arguments in the constructor instead.
  model_selector = gr.Dropdown(
/home/user/app/app.py:327: GradioDeprecationWarning: The `style` method is deprecated. Please set these arguments in the constructor instead.
  model_selector = gr.Dropdown(

Link

https://huggingface.co/spaces/thu-coai/DA-Transformer

screenshot

image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.