I am working on a project that involves restructuring a network over different phases

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Some workaround methods that may work: Decorate at least one <

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Segmentation fault when DataLoader processes are launched after compiling Triton kernels about triton HOT 12 OPEN

leademeule commented on August 24, 2024

Segmentation fault when DataLoader processes are launched after compiling Triton kernels

from triton.

Comments (12)

leademeule commented on August 24, 2024 1

I have yet another version of the code that confirms the issue occurs when DataLoader workers are started after Triton compilation. Setting enable_preload=True and enable_persistance=True launches the workers early and seems to prevent all crashes:

if __name__ == "__main__":
    enable_multiprocessing = True
    enable_persistance = True
    enable_triton = True
    enable_preload = True

    dataset_train = DatasetDummy(
        True
    )
    dataset_test = DatasetDummy(
        False
    )

    loader_train = torch.utils.data.DataLoader(
        dataset_train,
        batch_size=16,
        num_workers=1 if enable_multiprocessing else 0,
        persistent_workers=enable_persistance if enable_multiprocessing else None,
    )
    loader_test = torch.utils.data.DataLoader(
        dataset_test,
        batch_size=16,
        num_workers=1 if enable_multiprocessing else 0,
        persistent_workers=enable_persistance if enable_multiprocessing else None,
    )

    if enable_preload:
        for batch in loader_train:
            break
        for batch in loader_test:
            break

    for epoch in range(16):
        print(f"Epoch: {epoch:>3}")

        print(f"Train...")
        for batch_index, batch in enumerate(loader_train):
            print(f"Batch: {batch_index:>3}")
            batch_source, batch_target = batch

            if enable_triton:
                batch_add = add(
                    batch_source.cuda(),
                    batch_target.cuda(),
                )
            else:
                batch_add = batch_source.cuda() + batch_target.cuda()

        print(f"Test...")
        for batch_index, batch in enumerate(loader_test):
            print(f"Batch: {batch_index:>3}")
            batch_source, batch_target = batch

            if enable_triton:
                batch_add = add(
                    batch_source.cuda(),
                    batch_target.cuda(),
                )
            else:
                batch_add = batch_source.cuda() + batch_target.cuda()

from triton.

leademeule commented on August 24, 2024 1

@flishwang thank you for sharing. On my side I have continued using the persistent data loader trick to avoid the crash. It thankfully has worked consistently over the last few weeks. A proper fix would be much appreciated however.

from triton.

leademeule commented on August 24, 2024

It seems making the data loaders persistant is not sufficient to prevent segmentation faults with more complex setups. I will keep trying to isolate the issue.

from triton.

leademeule commented on August 24, 2024

The slightly modified code below mimics a typical setup where two DataLoaders are used to cover a training dataset and a testing dataset. Even with enable_persistance=True, the code crashes when the testing DataLoader is reached.

class DatasetDummy(torch.utils.data.Dataset):
    def __init__(
        self,
        dataset_partition,
    ):
        super().__init__()
        self.dataset_dimensionality = 8192
        self.dataset_size = 128
        self.dataset_generator = torch.Generator()
        self.dataset_partition = dataset_partition

    def __getitem__(self, index):
        if self.dataset_partition:
            self.dataset_generator.manual_seed(index)
        else:
            self.dataset_generator.manual_seed(-1 - index)

        sample_source = torch.randn(
            (self.dataset_dimensionality,), generator=self.dataset_generator
        )
        sample_target = torch.randn(
            (self.dataset_dimensionality,), generator=self.dataset_generator
        )

        return sample_source, sample_target

    def __len__(self):
        return self.dataset_size

if __name__ == "__main__":
    enable_multiprocessing = True
    enable_persistance = True
    enable_triton = True

    dataset_train = DatasetDummy(
        True
    )
    dataset_test = DatasetDummy(
        False
    )

    loader_train = torch.utils.data.DataLoader(
        dataset_train,
        batch_size=16,
        num_workers=1 if enable_multiprocessing else 0,
        persistent_workers=enable_persistance if enable_multiprocessing else None,
    )
    loader_test = torch.utils.data.DataLoader(
        dataset_test,
        batch_size=16,
        num_workers=1 if enable_multiprocessing else 0,
        persistent_workers=enable_persistance if enable_multiprocessing else None,
    )

    for epoch in range(16):
        print(f"Epoch: {epoch:>3}")

        print(f"Train...")
        for batch_index, batch in enumerate(loader_train):
            print(f"Batch: {batch_index:>3}")
            batch_source, batch_target = batch

            if enable_triton:
                batch_add = add(
                    batch_source.cuda(),
                    batch_target.cuda(),
                )
            else:
                batch_add = batch_source.cuda() + batch_target.cuda()

        print(f"Test...")
        for batch_index, batch in enumerate(loader_test):
            print(f"Batch: {batch_index:>3}")
            batch_source, batch_target = batch

            if enable_triton:
                batch_add = add(
                    batch_source.cuda(),
                    batch_target.cuda(),
                )
            else:
                batch_add = batch_source.cuda() + batch_target.cuda()

It seems the problem really kicks in when DataLoader workers are created.

I would greatly appreciate help on this, as the overhead of disabling multiprocessing makes Triton unusable for my application, yet Triton greatly improves the performance of important computational bottlenecks.

from triton.

TidalPaladin commented on August 24, 2024

I think I'm running into the same problem. Getting ERROR: Unexpected segmentation fault encountered in worker on multiple workers while training a model that uses Triton kernels. Persistent workers doesn't fix the issue. Setting num_workers=0 prevents the segfault but training is CPU bottlenecked. Crashes always happen at the end of an epoch (presumably when workers are relaunched), but not always on the first epoch.

I pulled a core dump from one of the segfaulted workers.

#0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=11, no_tid=no_tid@entry=0) at pthread_kill.c:44
#1  0x000071f4d42ab393 in __pthread_kill_internal (signo=11, threadid=<optimized out>) at pthread_kill.c:78
#2  0x000071f4d425a6c8 in __GI_raise (sig=11) at ../sysdeps/posix/raise.c:26
#3  0x000071f4d1562e13 in handler_SIGSEGV(int, siginfo_t*, void*) ()
   from /home/tidal/.local/share/pdm/venvs/mit-ub-7pzcQwz--mit_ub/lib/python3.11/site-packages/torch/lib/libtorch_python.so
#4  <signal handler called>
#5  __pthread_clockjoin_ex (threadid=125287851361984, thread_return=0x0, clockid=0, abstime=0x0, block=true) at pthread_join_common.c:43
#6  0x000071f392131228 in llvm::llvm_thread_join_impl(unsigned long) ()
   from /home/tidal/.local/share/pdm/venvs/mit-ub-7pzcQwz--mit_ub/lib/python3.11/site-packages/triton/_C/libtriton.so
#7  0x000071f3942b0408 in llvm::ThreadPool::~ThreadPool() ()
   from /home/tidal/.local/share/pdm/venvs/mit-ub-7pzcQwz--mit_ub/lib/python3.11/site-packages/triton/_C/libtriton.so
#8  0x000071f392e590b9 in mlir::MLIRContextImpl::~MLIRContextImpl() ()
   from /home/tidal/.local/share/pdm/venvs/mit-ub-7pzcQwz--mit_ub/lib/python3.11/site-packages/triton/_C/libtriton.so
#9  0x000071f392e52d27 in mlir::MLIRContext::~MLIRContext() ()
   from /home/tidal/.local/share/pdm/venvs/mit-ub-7pzcQwz--mit_ub/lib/python3.11/site-packages/triton/_C/libtriton.so
#10 0x000071f390d228da in std::default_delete<mlir::MLIRContext>::operator() (this=<optimized out>, __ptr=0x71ecb11e9db0)
    at /opt/rh/devtoolset-10/root/usr/include/c++/10/bits/unique_ptr.h:79
#11 std::default_delete<mlir::MLIRContext>::operator() (__ptr=0x71ecb11e9db0, this=<optimized out>)
    at /opt/rh/devtoolset-10/root/usr/include/c++/10/bits/unique_ptr.h:79
#12 std::unique_ptr<mlir::MLIRContext, std::default_delete<mlir::MLIRContext> >::~unique_ptr (this=<optimized out>, 
    __in_chrg=<optimized out>) at /opt/rh/devtoolset-10/root/usr/include/c++/10/bits/unique_ptr.h:361
#13 pybind11::class_<mlir::MLIRContext>::dealloc (v_h=...) at /root/.triton/pybind11/pybind11-2.11.1/include/pybind11/pybind11.h:1880
#14 0x000071f390cd6840 in pybind11::detail::clear_instance (self=0x71f4134bec30)
    at /root/.triton/pybind11/pybind11-2.11.1/include/pybind11/detail/class.h:424
#15 0x000071f390cd7431 in pybind11::detail::pybind11_object_dealloc (self=0x71f4134bec30)
    at /root/.triton/pybind11/pybind11-2.11.1/include/pybind11/detail/class.h:457
#16 0x000071f4d48a9ea3 in _Py_Dealloc (op=0x71f2d9e006c0) at Objects/object.c:2390
#17 Py_DECREF (op=0x71f2d9e006c0) at ./Include/object.h:538
#18 _PyObject_ClearInstanceAttributes (self=0x71f418cfeb10) at Objects/dictobject.c:5566
#19 subtype_clear (self=0x71f418cfeb10) at Objects/typeobject.c:1279
#20 0x000071f4d481e8b8 in delete_garbage (tstate=0x71f4d5b48c58 <_PyRuntime+166328>, gcstate=0x71f4d5b2eb60 <_PyRuntime+59584>, 
    collectable=0x7fff5ac9ea10, old=0x71f4d5b2eba8 <_PyRuntime+59656>) at Modules/gcmodule.c:1013

And version info:

Collecting environment information...
PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Artix Linux (x86_64)
GCC version: (GCC) 14.1.1 20240507
Clang version: 17.0.6
CMake version: version 3.29.3
Libc version: glibc-2.39

Python version: 3.11.8 (main, Feb 25 2024, 04:18:18) [Clang 17.0.6 ] (64-bit runtime)
Python platform: Linux-6.8.9-artix1-2-x86_64-with-glibc2.39
Is CUDA available: True
CUDA runtime version: 12.4.131
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA GeForce RTX 3090
GPU 1: NVIDIA GeForce RTX 3090

Nvidia driver version: 550.78
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        43 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               48
On-line CPU(s) list:                  0-47
Vendor ID:                            AuthenticAMD
Model name:                           AMD Ryzen Threadripper 3960X 24-Core Processor
CPU family:                           23
Model:                                49
Thread(s) per core:                   2
Core(s) per socket:                   24
Socket(s):                            1
Stepping:                             0
Frequency boost:                      enabled
CPU(s) scaling MHz:                   73%
CPU max MHz:                          4568.1641
CPU min MHz:                          2200.0000
BogoMIPS:                             7603.36
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sev sev_es
L1d cache:                            768 KiB (24 instances)
L1i cache:                            768 KiB (24 instances)
L2 cache:                             12 MiB (24 instances)
L3 cache:                             128 MiB (8 instances)
NUMA node(s):                         1
NUMA node0 CPU(s):                    0-47
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Mitigation; untrained return thunk; SMT enabled with STIBP protection
Vulnerability Spec rstack overflow:   Mitigation; Safe RET
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected

Versions of relevant libraries:
[pip3] flake8==7.0.0
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.26.4
[pip3] pytorch-lightning==2.2.4
[pip3] torch==2.3.0
[pip3] torch-dicom==0.1.dev68+g40a15aa
[pip3] torchmetrics==1.4.0
[pip3] torchvision==0.18.0
[pip3] triton==2.3.0
[pip3] triton-helpers==0.1.dev16+g179af43
[conda] Could not collect

from triton.

flishwang commented on August 24, 2024

I also met this bug, and created a issue on the pytorch side.

from triton.

flishwang commented on August 24, 2024

Some workaround methods that may work:

Decorate at least one forward function with torch.compile of the model before the triton kernel called. The more compiled functions there are, the less probabilities workers crashes.
manually break the data loader to avoid the multiprocessing iters reach their ends.

Not sure when and where they work.
@TidalPaladin @leademeule

from triton.

TidalPaladin commented on August 24, 2024

@flishwang I have adopted the manually break early strategy. Since I'm using PyTorch Lightning I don't have easy access to the data loader directly, but setting limit_train_batches=0.95, limit_val_batches=0.95 in pl.Trainer does the trick.

from triton.

TidalPaladin commented on August 24, 2024

It seems that this issue is not fully mitigated by the early break workaround. When running for a large number of epochs (200+) the error reappears. This is still much longer than I would be able to run without early breaking. For now I have disabled the Triton components of my model and have had no issues since.

from triton.

TidalPaladin commented on August 24, 2024

I think this has been resolved with the 3.0 update. I'm no longer seeing segmentation faults

from triton.

silingtong123 commented on August 24, 2024

I think this has been resolved with the 3.0 update. I'm no longer seeing segmentation faults

which commit

from triton.

23Uday commented on August 24, 2024

This hasn't resolved for me even after the triton 3.0 update. Although with Triton 2.3.1, it used to happen every time.

ERROR: Unexpected segmentation fault encountered in worker.
Traceback (most recent call last):
File ".../script_train.py", line 73, in
main_wrapper()
File ".../script_train.py", line 34, in main_wrapper
main(data_dir, exp, method, optim,
File ".../script_train.py", line 69, in main
train(config, dataloader, model, model_path, device)
File ".../train.py", line 113, in train
loss_train.append(loss_train_.item())
File ".../pytorch-2.3.1_cu121_py310_triton/lib/python3.10/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 787) is killed by signal: Segmentation fault.

from triton.

Segmentation fault when DataLoader processes are launched after compiling Triton kernels about triton HOT 12 OPEN

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent