evshiron / rocm_lab Goto Github PK

View Code? Open in Web Editor NEW

53.0 13.0 7.0 32 KB

Home Page: https://are-we-gfx1100-yet.github.io

License: Other

Shell 93.35% Roff 6.65%

tensorflow torch gfx1100 torchaudio torchvision rocm

rocm_lab's Introduction

ROCm LAB

Experiments to see the potential of RX 7000 series.

IMPORTANT!!!

You can now install the official torch nightly build for ROCm 5.5+ with:

pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm5.5

# or this more performant ones if you are using rocm 5.6
pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm5.6

For the official tensorflow nightly build, see here.

Motivations

As we know, ROCm 5.5.0 was released on May 2nd, 2023. After waiting for several days, we discovered that all official Docker images did not include support for the RX 7000 series (a.k.a. gfx1100), which are currently the best-performing and most suitable consumer-grade GPUs for the AI field under the AMD brand. As we aim to tap into the potential of RX 7000 series GPUs as soon as possible:

ROCm LAB will focus on conducting proof of concept and delivering prebuilt wheels specifically for the RX 7000 series until official support is provided. The goal of these efforts is to demonstrate the viability and effectiveness of using these GPUs for AI applications and to provide a basis for further development.

Additionally, the website Are we gfx1100 yet? will serve as a platform for showcasing the latest proof of concept developments in the AI field using the RX 7000 series.

Prebuilt wheels

Prebuilt wheels are built by GitHub Actions and can now be found in GitHub Releases.

It's worth noting that these wheels are built using GitHub's ubuntu-latest runner, which is Ubuntu 22.04 right now. There might be dynamic linking issues when used in other systems. If it does, consider building a Docker container instead.

How to use

Simple download the wheel you want and install it with:

# recommended: activate venv
source venv/bin/activate

# download the wheel
curl -L -O https://github.com/evshiron/rocm_lab/releases/download/v1.14.514/torch-2.0.1+gite19229c-cp310-cp310-linux_x86_64.whl

# install the wheel
pip install torch-2.0.1+gite19229c-cp310-cp310-linux_x86_64.whl

Prebuilt Docker images

https://github.com/evshiron/rocm_lab/pkgs/container/rocm_lab
- rocm5.5-ub22.04-base
- rocm5.5-ub22.04-torch2.0.1
- rocm5.5-a1111-webui
- rocm5.5-automatic
- rocm5.5-text-gen-webui

These Docker images are mainly proofs of concepts and will not be updated frequently.

It's recommended to use the wheels above directly, or build your own Docker images with these wheels if you like.

How to use

# add environment variables or volumes for your need
docker run -ti --net=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 8G -e HSA_OVERRIDE_GFX_VERSION=11.0.0 --name rocm5.5-automatic ghcr.io/evshiron/rocm_lab:rocm5.5-automatic

Are we gfx1100 yet?

https://are-we-gfx1100-yet.github.io/

Credits

A large portion of the content in this repository comes from the internet. My main work is to collect, experiment, and organize this information.

I would like to express my gratitude to other developers for their contributions, as well as those who have discussed with me and provided me with assistance.

If you find my work helpful, please consider giving this repository a star. Your recognition will be my motivation.

License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

rocm_lab's People

Contributors

Stargazers

Watchers

Forkers

bee1850 ocean2333 huotui manba024 devloic johnnynunez r3dlobst3r

rocm_lab's Issues

Support for torchaudio wheels

I was able to run Tensorflow, Torch and TorchVision with the latest release. However, Im currently trying to run a project that uses a fork of neonbjb/tortoise-tts which requires torchaudio. I tried installing the provided .whl for torch and torchvision and install torchaudio afterwards, but I'm getting an error while importing torchaudio:
OSError: libtorch_cuda.so: cannot open shared object file: No such file or directory

I assume this is because torchaudio version was not built along the custom torch/torchvision.

Do you have any plans on making a guide using torchaudio?

Also, thank you so much for the work on the TF/Torch changes!

Error running ghcr.io/evshiron/rocm_lab:rocm5.5-text-gen-webui 7dea7110f293

starlette.websockets.WebSocketDisconnect: 1001
INFO:Loading TheBloke_Llama-2-13B-chat-GGML...
INFO:llama.cpp weights detected: models/TheBloke_Llama-2-13B-chat-GGML/llama-2-13b-chat.ggmlv3.q6_K.bin

INFO:Cache capacity is 0 bytes
llama.cpp: loading model from models/TheBloke_Llama-2-13B-chat-GGML/llama-2-13b-chat.ggmlv3.q6_K.bin
error loading model: unrecognized tensor type 14

llama_init_from_file: failed to load model
Exception ignored in: <function LlamaCppModel.del at 0x7fdeac07f910>
Traceback (most recent call last):
File "/root/text-generation-webui/modules/llamacpp_model.py", line 23, in del
self.model.del()
AttributeError: 'LlamaCppModel' object has no attribute 'model'

what about ait

i only ran aitemplate in navi3_rel_ver_1.0,it is so old

CUDA Setup failed despite GPU being available.

I'm trying to get text-generation-webui to work on Arch. I've installed rocm5.6, and followed the installation steps to install text-gen-webui on RDNA3. But when launching, I get the following error:

CUDA SETUP: Loading binary /home/zhenyapav/Projects/text-generation-webui/venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_hip_nohipblaslt.so...
libhipblas.so.0: cannot open shared object file: No such file or directory
CUDA SETUP: Loading binary /home/zhenyapav/Projects/text-generation-webui/venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_hip_nohipblaslt.so...
libhipblas.so.0: cannot open shared object file: No such file or directory
CUDA SETUP: Problem: The main issue seems to be that the main CUDA library was not detected.
CUDA SETUP: Solution 1): Your paths are probably not up-to-date. You can update them via: sudo ldconfig.
CUDA SETUP: Solution 2): If you do not have sudo rights, you can do the following:
CUDA SETUP: Solution 2a): Find the cuda library via: find / -name libcuda.so 2>/dev/null
CUDA SETUP: Solution 2b): Once the library is found add it to the LD_LIBRARY_PATH: export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:FOUND_PATH_FROM_2a
CUDA SETUP: Solution 2c): For a permanent solution add the export from 2b into your .bashrc file, located at ~/.bashrc
Traceback (most recent call last):
  File "/home/zhenyapav/Projects/text-generation-webui/server.py", line 28, in <module>
    from modules import (
  File "/home/zhenyapav/Projects/text-generation-webui/modules/chat.py", line 16, in <module>
    from modules.text_generation import (
  File "/home/zhenyapav/Projects/text-generation-webui/modules/text_generation.py", line 22, in <module>
    from modules.models import clear_torch_cache, local_rank
  File "/home/zhenyapav/Projects/text-generation-webui/modules/models.py", line 10, in <module>
    from accelerate import infer_auto_device_map, init_empty_weights
  File "/home/zhenyapav/Projects/text-generation-webui/venv/lib/python3.10/site-packages/accelerate/__init__.py", line 3, in <module>
    from .accelerator import Accelerator
  File "/home/zhenyapav/Projects/text-generation-webui/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 35, in <module>
    from .checkpointing import load_accelerator_state, load_custom_state, save_accelerator_state, save_custom_state
  File "/home/zhenyapav/Projects/text-generation-webui/venv/lib/python3.10/site-packages/accelerate/checkpointing.py", line 24, in <module>
    from .utils import (
  File "/home/zhenyapav/Projects/text-generation-webui/venv/lib/python3.10/site-packages/accelerate/utils/__init__.py", line 131, in <module>
    from .bnb import has_4bit_bnb_layers, load_and_quantize_model
  File "/home/zhenyapav/Projects/text-generation-webui/venv/lib/python3.10/site-packages/accelerate/utils/bnb.py", line 42, in <module>
    import bitsandbytes as bnb
  File "/home/zhenyapav/Projects/text-generation-webui/venv/lib/python3.10/site-packages/bitsandbytes/__init__.py", line 7, in <module>
    from .autograd._functions import (
  File "/home/zhenyapav/Projects/text-generation-webui/venv/lib/python3.10/site-packages/bitsandbytes/autograd/__init__.py", line 1, in <module>
    from ._functions import undo_layout, get_inverse_transform_indices
  File "/home/zhenyapav/Projects/text-generation-webui/venv/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 9, in <module>
    import bitsandbytes.functional as F
  File "/home/zhenyapav/Projects/text-generation-webui/venv/lib/python3.10/site-packages/bitsandbytes/functional.py", line 17, in <module>
    from .cextension import COMPILED_WITH_CUDA, lib
  File "/home/zhenyapav/Projects/text-generation-webui/venv/lib/python3.10/site-packages/bitsandbytes/cextension.py", line 22, in <module>

I have looked at my rocm installation, there are libraries libhipblas.so, libhipblas.so.1, libhipblas.so.1.0, but no libhipblas.so.0. But it isn't present in the docker image I used before either, so that's probably not the cause of the issue.

stable diffusion segement faults

this is output i get when i click generate

08:39:26-515543 INFO     Startup time: 87.7s (torch=2.0s gradio=0.8s libraries=1.0s models=46.6s codeformer=0.9s scripts=26.7s onchange=0.1s
                         ui=8.5s launch=0.2s scripts app_started_callback=0.2s checkpoint=0.7s)
Initializing ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% -:--:-- 0:00:00MIOpen(HIP): Info [get_device_name] Raw device name: gfx1100
MIOpen(HIP): Info [Handle] stream: 0, device_id: 0
MIOpen(HIP): Info [get_device_name] Raw device name: gfx1100
MIOpen(HIP): Info [SetStream] stream: 0, device_id: 0
MIOpen(HIP): Info [GetFindModeValueImpl] MIOPEN_FIND_MODE = DYNAMIC_HYBRID(5)
MIOpen(HIP): Info [ForwardGetWorkSpaceSize]
MIOpen(HIP): Info [AmdRocmMetadataVersionDetect] ROCm MD version AMDHSA_COv3, HIP version 5.4.0, MIOpen version 2.19.0.
MIOpen(HIP): Info [GetForwardSolutions]
MIOpen(HIP): Info [GetInstalledPathFile] Checking find db file: gfx1030_36.HIP.fdb                                                       [162/525]
MIOpen(HIP): Info [GetInstalledPathFile] Checking find db file: gfx803_36.HIP.fdb
MIOpen(HIP): Info [GetInstalledPathFile] Checking find db file: gfx803_36.OpenCL.fdb
MIOpen(HIP): Info [GetInstalledPathFile] Checking find db file: gfx803_64.HIP.fdb
MIOpen(HIP): Info [GetInstalledPathFile] Checking find db file: gfx803_64.OpenCL.fdb
MIOpen(HIP): Info [GetInstalledPathFile] Checking find db file: gfx900_56.HIP.fdb
MIOpen(HIP): Info [GetInstalledPathFile] Checking find db file: gfx900_56.OpenCL.fdb
MIOpen(HIP): Info [GetInstalledPathFile] Checking find db file: gfx900_64.HIP.fdb
MIOpen(HIP): Info [GetInstalledPathFile] Checking find db file: gfx900_64.OpenCL.fdb
MIOpen(HIP): Info [GetInstalledPathFile] Checking find db file: gfx906_60.HIP.fdb
MIOpen(HIP): Info [GetInstalledPathFile] Checking find db file: gfx906_60.OpenCL.fdb
MIOpen(HIP): Info [GetInstalledPathFile] Checking find db file: gfx906_64.HIP.fdb
MIOpen(HIP): Info [GetInstalledPathFile] Checking find db file: gfx906_64.OpenCL.fdb
MIOpen(HIP): Info [GetInstalledPathFile] Checking find db file: gfx90878.HIP.fdb
MIOpen(HIP): Info [GetInstalledPathFile] Checking find db file: gfx90878.OpenCL.fdb
MIOpen(HIP): Info [GetInstalledPathFile] Checking find db file: gfx90a68.HIP.fdb
MIOpen(HIP): Info [GetInstalledPathFile] Checking find db file: gfx90a6e.HIP.fdb
MIOpen(HIP): Info [Measure] ReadonlyRamDb::Prefetch time: 5e-05 ms
MIOpen(HIP): Info [Measure] RamDb::Prefetch time: 0.614723 ms
MIOpen(HIP): Info [FindConvFwdAlgorithm] requestAlgoCount = 1, workspace = 294912
MIOpen(HIP): Info [GetForwardSolutions]
MIOpen(HIP): Info [CompileForwardSolution] solver_id = GemmFwdRest
MIOpen(HIP): Info [SQLitePerfDb] database not present
MIOpen(HIP): Info [FindSolutionImpl] GemmFwdRest (not searchable)
MIOpen(HIP): Info [FindConvFwdAlgorithm] miopenConvolutionFwdAlgoGEMM   0.101198        294912
MIOpen(HIP): Info [FindConvFwdAlgorithm] FW Chosen Algorithm: GemmFwdRest , 294912, 0.101198
MIOpen(HIP): Info [ConvolutionForward] algo = 0, workspace = 294912
MIOpen(HIP): Info [KernDb] database not present
MIOpen(HIP): Info [get_device_name] Raw device name: gfx1100
MIOpen(HIP): Info [SetStream] stream: 0, device_id: 0
MIOpen(HIP): Info [ForwardGetWorkSpaceSize]
MIOpen(HIP): Info [GetForwardSolutions]
MIOpen(HIP): Info [FindConvFwdAlgorithm] requestAlgoCount = 1, workspace = 23592960
MIOpen(HIP): Info [GetForwardSolutions]
MIOpen(HIP): Info [CompileForwardSolution] solver_id = GemmFwdRest                                                                       [128/525]
MIOpen(HIP): Info [FindSolutionImpl] GemmFwdRest (not searchable)
MIOpen(HIP): Info [FindConvFwdAlgorithm] miopenConvolutionFwdAlgoGEMM   2.33155 23592960
MIOpen(HIP): Info [FindConvFwdAlgorithm] FW Chosen Algorithm: GemmFwdRest , 23592960, 2.33155
MIOpen(HIP): Info [ConvolutionForward] algo = 0, workspace = 23592960
MIOpen(HIP): Info [get_device_name] Raw device name: gfx1100
MIOpen(HIP): Info [SetStream] stream: 0, device_id: 0
MIOpen(HIP): Info [ConvolutionForward] algo = 0, workspace = 23592960
MIOpen(HIP): Info [get_device_name] Raw device name: gfx1100
MIOpen(HIP): Info [SetStream] stream: 0, device_id: 0
MIOpen(HIP): Info [ForwardGetWorkSpaceSize]
MIOpen(HIP): Info [GetForwardSolutions]
MIOpen(HIP): Info [FindConvFwdAlgorithm] requestAlgoCount = 1, workspace = 0
MIOpen(HIP): Info [GetForwardSolutions]
MIOpen(HIP): Info [CompileForwardSolution] solver_id = GemmFwd1x1_0_1
MIOpen(HIP): Info [FindSolutionImpl] GemmFwd1x1_0_1 (not searchable)
MIOpen(HIP): Info [FindConvFwdAlgorithm] miopenConvolutionFwdAlgoGEMM   0.109958        0
MIOpen(HIP): Info [FindConvFwdAlgorithm] FW Chosen Algorithm: GemmFwd1x1_0_1 , 0, 0.109958
MIOpen(HIP): Info [ConvolutionForward] algo = 0, workspace = 0
Initializing ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% -:--:-- 0:00:00MIOpen(HIP): Info [get_device_name] Raw device name: gfx1100
MIOpen(HIP): Info [SetStream] stream: 0, device_id: 0
MIOpen(HIP): Info [ConvolutionForward] algo = 0, workspace = 0
MIOpen(HIP): Info [get_device_name] Raw device name: gfx1100
MIOpen(HIP): Info [SetStream] stream: 0, device_id: 0
MIOpen(HIP): Info [ConvolutionForward] algo = 0, workspace = 23592960
MIOpen(HIP): Info [get_device_name] Raw device name: gfx1100
MIOpen(HIP): Info [SetStream] stream: 0, device_id: 0
MIOpen(HIP): Info [ConvolutionForward] algo = 0, workspace = 23592960
MIOpen(HIP): Info [get_device_name] Raw device name: gfx1100
MIOpen(HIP): Info [SetStream] stream: 0, device_id: 0
MIOpen(HIP): Info [ConvolutionForward] algo = 0, workspace = 0
MIOpen(HIP): Info [get_device_name] Raw device name: gfx1100
MIOpen(HIP): Info [SetStream] stream: 0, device_id: 0
MIOpen(HIP): Info [ConvolutionForward] algo = 0, workspace = 0
MIOpen(HIP): Info [get_device_name] Raw device name: gfx1100                                                                              [94/525]
MIOpen(HIP): Info [SetStream] stream: 0, device_id: 0
MIOpen(HIP): Info [ForwardGetWorkSpaceSize]
MIOpen(HIP): Info [GetForwardSolutions]
MIOpen(HIP): Info [FindConvFwdAlgorithm] requestAlgoCount = 1, workspace = 5898240
MIOpen(HIP): Info [GetForwardSolutions]
MIOpen(HIP): Info [CompileForwardSolution] solver_id = GemmFwdRest
MIOpen(HIP): Info [FindSolutionImpl] GemmFwdRest (not searchable)
MIOpen(HIP): Info [FindConvFwdAlgorithm] miopenConvolutionFwdAlgoGEMM   0.82742 5898240
MIOpen(HIP): Info [FindConvFwdAlgorithm] FW Chosen Algorithm: GemmFwdRest , 5898240, 0.82742
MIOpen(HIP): Info [ConvolutionForward] algo = 0, workspace = 5898240
MIOpen(HIP): Info [get_device_name] Raw device name: gfx1100
MIOpen(HIP): Info [SetStream] stream: 0, device_id: 0
MIOpen(HIP): Info [ForwardGetWorkSpaceSize]
MIOpen(HIP): Info [GetForwardSolutions]
MIOpen(HIP): Info [FindConvFwdAlgorithm] requestAlgoCount = 1, workspace = 5898240
MIOpen(HIP): Info [GetForwardSolutions]
MIOpen(HIP): Info [CompileForwardSolution] solver_id = GemmFwdRest
MIOpen(HIP): Info [FindSolutionImpl] GemmFwdRest (not searchable)
MIOpen(HIP): Info [FindConvFwdAlgorithm] miopenConvolutionFwdAlgoGEMM   1.22405 5898240
MIOpen(HIP): Info [FindConvFwdAlgorithm] FW Chosen Algorithm: GemmFwdRest , 5898240, 1.22405
MIOpen(HIP): Info [ConvolutionForward] algo = 0, workspace = 5898240
MIOpen(HIP): Info [get_device_name] Raw device name: gfx1100
MIOpen(HIP): Info [SetStream] stream: 0, device_id: 0
MIOpen(HIP): Info [ForwardGetWorkSpaceSize]
MIOpen(HIP): Info [GetForwardSolutions]
MIOpen(HIP): Info [FindConvFwdAlgorithm] requestAlgoCount = 1, workspace = 11796480
MIOpen(HIP): Info [GetForwardSolutions]
MIOpen(HIP): Info [CompileForwardSolution] solver_id = GemmFwdRest
MIOpen(HIP): Info [FindSolutionImpl] GemmFwdRest (not searchable)
MIOpen(HIP): Info [FindConvFwdAlgorithm] miopenConvolutionFwdAlgoGEMM   1.0463  11796480
MIOpen(HIP): Info [FindConvFwdAlgorithm] FW Chosen Algorithm: GemmFwdRest , 11796480, 1.0463
MIOpen(HIP): Info [ConvolutionForward] algo = 0, workspace = 11796480
MIOpen(HIP): Info [get_device_name] Raw device name: gfx1100
MIOpen(HIP): Info [SetStream] stream: 0, device_id: 0                                                                                     [60/525]
MIOpen(HIP): Info [ForwardGetWorkSpaceSize]
MIOpen(HIP): Info [GetForwardSolutions]
MIOpen(HIP): Info [FindConvFwdAlgorithm] requestAlgoCount = 1, workspace = 0
MIOpen(HIP): Info [GetForwardSolutions]
MIOpen(HIP): Info [CompileForwardSolution] solver_id = GemmFwd1x1_0_1
MIOpen(HIP): Info [FindSolutionImpl] GemmFwd1x1_0_1 (not searchable)
MIOpen(HIP): Info [FindConvFwdAlgorithm] miopenConvolutionFwdAlgoGEMM   0.060599        0
MIOpen(HIP): Info [FindConvFwdAlgorithm] FW Chosen Algorithm: GemmFwd1x1_0_1 , 0, 0.060599
MIOpen(HIP): Info [ConvolutionForward] algo = 0, workspace = 0
MIOpen(HIP): Info [get_device_name] Raw device name: gfx1100
MIOpen(HIP): Info [SetStream] stream: 0, device_id: 0
MIOpen(HIP): Info [ForwardGetWorkSpaceSize]
MIOpen(HIP): Info [GetForwardSolutions]
MIOpen(HIP): Info [FindConvFwdAlgorithm] requestAlgoCount = 1, workspace = 0
MIOpen(HIP): Info [GetForwardSolutions]
MIOpen(HIP): Info [CompileForwardSolution] solver_id = GemmFwd1x1_0_1
MIOpen(HIP): Info [FindSolutionImpl] GemmFwd1x1_0_1 (not searchable)
MIOpen(HIP): Info [FindConvFwdAlgorithm] miopenConvolutionFwdAlgoGEMM   0.106277        0
MIOpen(HIP): Info [FindConvFwdAlgorithm] FW Chosen Algorithm: GemmFwd1x1_0_1 , 0, 0.106277
MIOpen(HIP): Info [ConvolutionForward] algo = 0, workspace = 0
MIOpen(HIP): Info [get_device_name] Raw device name: gfx1100
MIOpen(HIP): Info [SetStream] stream: 0, device_id: 0
MIOpen(HIP): Info [ConvolutionForward] algo = 0, workspace = 0
MIOpen(HIP): Info [get_device_name] Raw device name: gfx1100
MIOpen(HIP): Info [SetStream] stream: 0, device_id: 0
MIOpen(HIP): Info [ConvolutionForward] algo = 0, workspace = 11796480
MIOpen(HIP): Info [get_device_name] Raw device name: gfx1100
MIOpen(HIP): Info [SetStream] stream: 0, device_id: 0
MIOpen(HIP): Info [ConvolutionForward] algo = 0, workspace = 11796480
MIOpen(HIP): Info [get_device_name] Raw device name: gfx1100
MIOpen(HIP): Info [SetStream] stream: 0, device_id: 0
MIOpen(HIP): Info [ConvolutionForward] algo = 0, workspace = 0
MIOpen(HIP): Info [get_device_name] Raw device name: gfx1100
MIOpen(HIP): Info [SetStream] stream: 0, device_id: 0                                                                                     [34/524]
MIOpen(HIP): Info [ConvolutionForward] algo = 0, workspace = 11796480
MIOpen(HIP): Info [get_device_name] Raw device name: gfx1100
MIOpen(HIP): Info [SetStream] stream: 0, device_id: 0
MIOpen(HIP): Info [ConvolutionForward] algo = 0, workspace = 11796480
MIOpen(HIP): Info [get_device_name] Raw device name: gfx1100
MIOpen(HIP): Info [SetStream] stream: 0, device_id: 0
MIOpen(HIP): Info [ConvolutionForward] algo = 0, workspace = 0
MIOpen(HIP): Info [get_device_name] Raw device name: gfx1100
MIOpen(HIP): Info [SetStream] stream: 0, device_id: 0
MIOpen(HIP): Info [ConvolutionForward] algo = 0, workspace = 0
MIOpen(HIP): Info [get_device_name] Raw device name: gfx1100
MIOpen(HIP): Info [SetStream] stream: 0, device_id: 0
MIOpen(HIP): Info [ForwardGetWorkSpaceSize]
MIOpen(HIP): Info [GetForwardSolutions]
MIOpen(HIP): Info [FindConvFwdAlgorithm] requestAlgoCount = 1, workspace = 2949120
MIOpen(HIP): Info [GetForwardSolutions]
MIOpen(HIP): Info [CompileForwardSolution] solver_id = GemmFwdRest
MIOpen(HIP): Info [FindSolutionImpl] GemmFwdRest (not searchable)
MIOpen(HIP): Info [FindConvFwdAlgorithm] miopenConvolutionFwdAlgoGEMM   1.08749 2949120
MIOpen(HIP): Info [FindConvFwdAlgorithm] FW Chosen Algorithm: GemmFwdRest , 2949120, 1.08749
MIOpen(HIP): Info [ConvolutionForward] algo = 0, workspace = 2949120
Initializing ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% -:--:-- 0:00:00MIOpen(HIP): Info [get_device_name] Raw device name: gfx1100
MIOpen(HIP): Info [SetStream] stream: 0, device_id: 0
MIOpen(HIP): Info [ForwardGetWorkSpaceSize]
MIOpen(HIP): Info [GetForwardSolutions]
MIOpen(HIP): Info [FindConvFwdAlgorithm] requestAlgoCount = 1, workspace = 2949120
MIOpen(HIP): Info [GetForwardSolutions]
MIOpen(HIP): Info [CompileForwardSolution] solver_id = GemmFwdRest
MIOpen(HIP): Info [FindSolutionImpl] GemmFwdRest (not searchable)
MIOpen(HIP): Info [FindConvFwdAlgorithm] miopenConvolutionFwdAlgoGEMM   1.4954  2949120
MIOpen(HIP): Info [FindConvFwdAlgorithm] FW Chosen Algorithm: GemmFwdRest , 2949120, 1.4954
MIOpen(HIP): Info [ConvolutionForward] algo = 0, workspace = 2949120
MIOpen(HIP): Info [get_device_name] Raw device name: gfx1100
MIOpen(HIP): Info [SetStream] stream: 0, device_id: 0
MIOpen(HIP): Info [ForwardGetWorkSpaceSize]
MIOpen(HIP): Info [GetForwardSolutions]
MIOpen(HIP): Info [FindConvFwdAlgorithm] requestAlgoCount = 1, workspace = 5898240
MIOpen(HIP): Info [GetForwardSolutions]
MIOpen(HIP): Info [CompileForwardSolution] solver_id = GemmFwdRest
MIOpen(HIP): Info [FindSolutionImpl] GemmFwdRest (not searchable)
MIOpen(HIP): Info [FindConvFwdAlgorithm] miopenConvolutionFwdAlgoGEMM   1.43853 5898240
MIOpen(HIP): Info [FindConvFwdAlgorithm] FW Chosen Algorithm: GemmFwdRest , 5898240, 1.43853
MIOpen(HIP): Info [ConvolutionForward] algo = 0, workspace = 5898240
MIOpen(HIP): Info [get_device_name] Raw device name: gfx1100
MIOpen(HIP): Info [SetStream] stream: 0, device_id: 0
MIOpen(HIP): Info [ForwardGetWorkSpaceSize]
MIOpen(HIP): Info [GetForwardSolutions]
MIOpen(HIP): Info [FindConvFwdAlgorithm] requestAlgoCount = 1, workspace = 0
MIOpen(HIP): Info [GetForwardSolutions]
MIOpen(HIP): Info [CompileForwardSolution] solver_id = GemmFwd1x1_0_1
MIOpen(HIP): Info [FindSolutionImpl] GemmFwd1x1_0_1 (not searchable)
MIOpen(HIP): Info [FindConvFwdAlgorithm] miopenConvolutionFwdAlgoGEMM   0.063279        0
MIOpen(HIP): Info [FindConvFwdAlgorithm] FW Chosen Algorithm: GemmFwd1x1_0_1 , 0, 0.063279
MIOpen(HIP): Info [ConvolutionForward] algo = 0, workspace = 0
MIOpen(HIP): Info [get_device_name] Raw device name: gfx1100
MIOpen(HIP): Info [SetStream] stream: 0, device_id: 0
MIOpen(HIP): Info [ForwardGetWorkSpaceSize]
MIOpen(HIP): Info [GetForwardSolutions]
MIOpen(HIP): Info [FindConvFwdAlgorithm] requestAlgoCount = 1, workspace = 0
MIOpen(HIP): Info [GetForwardSolutions]
MIOpen(HIP): Info [CompileForwardSolution] solver_id = GemmFwd1x1_0_1
MIOpen(HIP): Info [FindSolutionImpl] GemmFwd1x1_0_1 (not searchable)
MIOpen(HIP): Info [FindConvFwdAlgorithm] miopenConvolutionFwdAlgoGEMM   0.116197        0
MIOpen(HIP): Info [FindConvFwdAlgorithm] FW Chosen Algorithm: GemmFwd1x1_0_1 , 0, 0.116197
MIOpen(HIP): Info [ConvolutionForward] algo = 0, workspace = 0
Segmentation fault

using the wheels from latest release, artix, i installed amdgpu with packages mesa rocm-hip-sdk roctracer

Roadmap

BitsAndBytes

GPTQ for LLaMA

https://github.com/WapaMario63/GPTQ-for-LLaMa-ROCm

AutoGPTQ

https://github.com/are-we-gfx1100-yet/AutoGPTQ-rocm

Good performance. 43it/s for 7B, 25it/s for 13B, 15it/s for 30B, 0.25it/s for 40B 3bit, 1 beam.

Triton

Navi 3x support is currently work in progress. Stay tuned.

13% performance compared to rocBLAS, when running 03-matrix-multiplication, with this branch, which is merged back recently.

There is still a lot of room for improvement.

AITemplate

Navi 3x support is currently work in progress. Stay tuned.

Reach 25it/s in generating a 512x512 image with Stable Diffusion, with this branch.

Somewhat disappointing. Is this really the limit of the RX 7900 XTX?

Flash Attention

To be ported to Navi 3x.

ROCm

ROCm 5.6.0 is available now, but we can't find Windows support anywhere.

I think it might be more appropriate to call it ROCm 5.5.2.

bitsandbytes 0.39.0?

do u have plans for it or is it not possible rn

how can i solve the problem in training lora with 7900xtx

please take a look,thank u

Why are we using a different repo for automatic1111 build script?

The following file clones from this repo. Can you elaborate why?

scripts/build_a1111-webui.sh

ROCM5.7 build pytorch failed

I use ROCmSoftwarePlatform pytorch Repositories to build the latest pytorch-rocm and it's fail .
use script command in rocm_lab
error log is

pytorch/torch/csrc/jit/ir/ir.cpp:1191:16: error: ‘set_stream’ is not a member of ‘torch::jit::cuda’; did you mean ‘c10::cuda::set_stream’?
 1191 |     case cuda::set_stream:
      |                ^~~~~~~~~~

Copy to VRAM hanging

I first tried using stable diffusion with pytorch/rocm5.4.2. That didn't work (it hangs indefinetely when copying data to VRAM) since my RX7900 XT is not officially supported by ROCm 5.4. Then I tried compiling pytorch with rocm5.5 myself in a docker container. 2h later, I got the same problem. Then i tried the prebuilt wheels from this repo (with automatic and deepfloyd) and the docker containers (a1111 and automatic), still no success. Even a simple script like this hangs:

import torch

print('Ok  :', torch.cuda.is_available())
print('CUDA:', torch.version.cuda)
print('HIP :', torch.version.hip)

print('Num :', torch.cuda.device_count())
device = torch.device(0)
print('Dev :', device)

print('Creating tensor')
tensor = torch.Tensor([1., 2., 3., 4.])
print('Copy tensor')
tensor.to(device) # hangs
print('Tensor copied')

Output:

import torch

print('Ok  :', torch.cuda.is_available())
print('CUDA:', torch.version.cuda)
print('HIP :', torch.version.hip)

print('Num :', torch.cuda.device_count())
device = torch.device(0)
print('Dev :', device)

print('Creating tensor')
tensor = torch.Tensor([1., 2., 3., 4.])
print('Copy tensor')
tensor.to(device) # hangs
print('Tensor copied')

dmesg doesn't show anything, rocminfo and rocm-smi aren't helpful either.
radeontop doesn't show a difference to normal usage. With htop I can see that torch maxes out a single core. Probably it is stuck in an infinite loop?
GDB backtrace:

(gdb) bt
#0  0x00007f1d28a5aed9 in ?? () from /opt/rocm/lib/libhsa-runtime64.so.1
#1  0x00007f1d28a5ad5e in ?? () from /opt/rocm/lib/libhsa-runtime64.so.1
#2  0x00007f1d28a4f8a1 in ?? () from /opt/rocm/lib/libhsa-runtime64.so.1
#3  0x00007f1d28a29e01 in ?? () from /opt/rocm/lib/libhsa-runtime64.so.1
#4  0x00007f1d28a43f70 in ?? () from /opt/rocm/lib/libhsa-runtime64.so.1
#5  0x00007f1d28a7d1c2 in ?? () from /opt/rocm/lib/libhsa-runtime64.so.1
#6  0x00007f1d28a7c8ab in ?? () from /opt/rocm/lib/libhsa-runtime64.so.1
#7  0x00007f1d28a521fc in ?? () from /opt/rocm/lib/libhsa-runtime64.so.1
#8  0x00007f1d383b2d03 in ?? () from /opt/rocm/lib/libroctracer64.so.4
#9  0x00007f1d383bbb83 in ?? () from /opt/rocm/lib/libroctracer64.so.4
#10 0x00007f1d290bebe3 in ?? () from /opt/rocm/hip/lib/libamdhip64.so.5
#11 0x00007f1d2910875d in ?? () from /opt/rocm/hip/lib/libamdhip64.so.5
#12 0x00007f1d290f6adb in ?? () from /opt/rocm/hip/lib/libamdhip64.so.5
#13 0x00007f1d290b42f1 in ?? () from /opt/rocm/hip/lib/libamdhip64.so.5
#14 0x00007f1d29108d70 in ?? () from /opt/rocm/hip/lib/libamdhip64.so.5
#15 0x00007f1d290107e7 in ?? () from /opt/rocm/hip/lib/libamdhip64.so.5
#16 0x00007f1d28ea8b69 in ?? () from /opt/rocm/hip/lib/libamdhip64.so.5
#17 0x00007f1d28f54e14 in hipMemcpyWithStream () from /opt/rocm/hip/lib/libamdhip64.so.5
#18 0x00007f1d0a83402a in at::native::copy_kernel_cuda(at::TensorIterator&, bool) () from /home/mattisb/Programming/AI/deepflyd-if-rocm5.5/.venv/lib/python3.10/site-packages/torch/lib/libtorch_hip.so
#19 0x00007f1d1629edbe in at::native::copy_impl(at::Tensor&, at::Tensor const&, bool) [clone .isra.0] () from /home/mattisb/Programming/AI/deepflyd-if-rocm5.5/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
#20 0x00007f1d1629ffc0 in at::native::copy_(at::Tensor&, at::Tensor const&, bool) () from /home/mattisb/Programming/AI/deepflyd-if-rocm5.5/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
#21 0x00007f1d16f6250c in at::_ops::copy_::call(at::Tensor&, at::Tensor const&, bool) () from /home/mattisb/Programming/AI/deepflyd-if-rocm5.5/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
#22 0x00007f1d1657289b in at::native::_to_copy(at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>) ()

Is this a common bug?
I think, it might actually be a broken amdgpu/rocm installation?

use wheels on artix?

ERROR: torch-2.0.1+gite19229c-cp310-cp310-linux_x86_64.whl is not a supported wheel on this platform.

Generation not starting locally

When using the prebuilt wheels on archlinux (installing them directly into automatic1111 venv) it doesn't crash via segfault like usual but instead doesn't generate anything. (not even the progress bar appears) is there something like those symptomps you know?

7900xtx get confused sdpa result

benchmark use pytorch function is slow but if timeit use system function is faster

code like this

import time 
import torch
import torch.nn as nn
import torch.nn.functional as F
device = "cuda" if torch.cuda.is_available() else "cpu"

# Lets define a helpful benchmarking function:
import torch.utils.benchmark as benchmark
def benchmark_torch_function_in_microseconds(f, *args, **kwargs):
    t0 = benchmark.Timer(
        stmt="f(*args, **kwargs)", globals={"args": args, "kwargs": kwargs, "f": f}
    )
    return t0.blocked_autorange().mean * 1e6

# Lets define the hyper-parameters of our input
batch_size = 32
max_sequence_len = 1024
num_heads = 40
embed_dimension = 128

dtype = torch.float16

query = torch.rand(batch_size, num_heads, max_sequence_len, embed_dimension, device=device, dtype=dtype)
key = torch.rand(batch_size, num_heads, max_sequence_len, embed_dimension, device=device, dtype=dtype)
value = torch.rand(batch_size, num_heads, max_sequence_len, embed_dimension, device=device, dtype=dtype)

# Helpful arguments mapper

print(f"The default implementation runs in {benchmark_torch_function_in_microseconds(F.scaled_dot_product_attention, query, key, value):.3f} microseconds")

x = time.time()
for i in range(10):
    y = F.scaled_dot_product_attention(query, key, value)
    y = y.detach().cpu()
delta = time.time() - x
print('real exec time is',delta*1000, 'ms')

result in 3090

The default implementation runs in 19356.447 microseconds
real exec time is 804.9180507659912 ms

result in rx7900xtx

The default implementation runs in 24952.195 microseconds
real exec time is 614.5007610321045 ms

Any progress on Rocm 5.6 ?

Rocm 5.6 was Release but do not have a pytorch support

evshiron / rocm_lab Goto Github PK

rocm_lab's Introduction

ROCm LAB

IMPORTANT!!!

Motivations

Prebuilt wheels

How to use

Prebuilt Docker images

How to use

Are we gfx1100 yet?

Credits

License

rocm_lab's People

Contributors

Stargazers

Watchers

Forkers

rocm_lab's Issues

BitsAndBytes

GPTQ for LLaMA

AutoGPTQ

Triton

AITemplate

Flash Attention

ROCm

Recommend Projects

Recommend Topics

Recommend Org