Giter Club home page Giter Club logo

radeon_gpu_detective's Introduction

Radeon™ GPU Detective (RGD)

RGD is a tool for post-mortem analysis of GPU crashes.

The tool performs offline processing of AMD GPU crash dump files and generates crash analysis reports in text and JSON formats.

To generate AMD GPU crash dumps for your application, use Radeon Developer Panel (RDP) and follow the Crash Analysis help manual.

Build Instructions

It is recommended to build the tool using the "pre_build.py" script which can be found under the "build" subdirectory.

Steps:

cd build

python pre_build.py

The script supports different options such as using different MSVC toolsets versions. For the list of options run the script with -h.

By default, a solution is generated for VS 2019. To generate a solution for a different VS version or to use a different MSVC toolchain use the --vs argument. For example, to generate the solution for VS 2022 with the VS 2022 toolchain (MSVC 17), run:

python pre_build.py --vs 2022

Running

Basic usage (text output):

rgd --parse <input .rgd crash dump file> -o <output text file>

Basic usage (JSON output):

rgd --parse <input .rgd crash dump file> --json <output JSON file>

For more options, run rgd -h to print the help manual.

Usage

The rgd command line tool accepts AMD driver GPU crash dump files as an input (.rgd files) and generates crash analysis report files with summarized information that can assist in debugging GPU crashes.

The basic usage is:

rgd --parse <full path to input .rgd file> -o <full path to text output file>

The rgd command line tool's crash analysis output files include the following information by default:

  • System information (CPUs, GPUs, driver, OS etc.)
  • Execution marker tree for each command buffer which was in flight during the crash.
  • Summary of the markers that were in progress during the crash (similar to the marker tree, just without the hierarchy and only including markers that were in progress).
  • Page fault summary (for crashes that were determined to be caused by a page fault)

Both text and JSON output files include the same information, in different representation. For simplicity, we will refer here to the human-readable textual output. Here are some more details about the crash analysis file's contents:

Crash Analysis File Information

  • Input crash dump file name: the full path to the .rgd file that was used to generate this file.
  • Input crash dump file creation time
  • RGD CLI version used

System Information

This section is titled SYSTEM INFO and includes information about:

  • Driver
  • OS
  • CPUs
  • GPUs

Markers in Progress

This section is titled MARKERS IN PROGRESS and contains information only about the execution markers that were in progress during the crash for each command buffer which was determined to be in flight during the crash. Here is the matching output for the tree below (see EXECUTION MARKER TREE):

Command Buffer ID: 0x2e
=======================
Frame 268 CL0/DownSamplePS/CmdDraw
Frame 268 CL0/DownSamplePS/CmdDraw
Frame 268 CL0/DownSamplePS/CmdDraw
Frame 268 CL0/DownSamplePS/CmdDraw
Frame 268 CL0/DownSamplePS/CmdDraw
Frame 268 CL0/Bloom/BlurPS/CmdDraw

Note that marker hierarchy is denoted by "/".

Execution Marker Tree

This section is titled EXECUTION MARKER TREE and contains a tree describing the marker status for each command buffer that was determined to be in flight during the crash.

User-provided marker strings will be wrapped in "double quotes". Here is an example marker tree:

Command Buffer ID: 0x2e
=======================
[>] "Frame 268 CL0"
 ├─[X] "Depth + Normal + Motion Vector PrePass"
 ├─[X] "Shadow Cascade Pass"
 ├─[X] "TLAS Build"
 ├─[X] "Classify tiles"
 ├─[X] "Trace shadows"
 ├─[X] "Denoise shadows"
 ├─[X] CmdDispatch
 ├─[X] CmdDispatch
 ├─[X] "GltfPbrPass::DrawBatchList"
 ├─[X] "Skydome Proc"
 ├─[X] "GltfPbrPass::DrawBatchList"
 ├─[>] "DownSamplePS"
 └─[>] "Bloom"
    ├─[>] "BlurPS"
    │  ├─[>] CmdDraw
    │  └─[ ] CmdDraw
    ├─[ ] CmdDraw
    ├─[ ] "BlurPS"
    ├─[ ] CmdDraw
    ├─[ ] "BlurPS"
    ├─[ ] CmdDraw
    ├─[ ] "BlurPS"
    ├─[ ] CmdDraw
    ├─[ ] "BlurPS"
    └─[ ] CmdDraw

Configuring the Execution Marker Output with RGD CLI

RGD CLI exposes a few options that impact how the marker tree is generated:

  • --marker-src include a suffix tag for each node in the tree indicating its origin (the component that submitted the marker). The supported components are:

    • [APP] for application marker.
    • [Driver-PAL] for markers originating from PAL.
    • [Driver-DX12] for markers originating from DXCP non-PAL code.
  • --expand-markers: expand all parent nodes in the tree (RGD will collapse all nodes which do not have any sub-nodes in progress as these would generally be considered "noise" when trying to find the culprit marker).

Configuring the Page Fault Summary with RGD CLI

  • --va-timeline: print a table with all events that impacted the offending VA, sorted chronologically (note that this is only applicable for crashes that are caused by a page fault). Since this table can be extremely verbose, and since in most cases this table is not required for analyzing the crash, it is not included by default in the output file.

  • --all-resources: If specified, the tool's output will include all the resources regardless of their virtual address from the input crash dump file.

Page Fault Summary

If the crash was determined to be caused by a page fault, a section titled PAGE FAULT SUMMARY will include useful details about the page fault such as:

  • Offending VA: the virtual address which triggered the page fault.
  • Resource timeline: a timeline of the associated resources (all resources that resided in the offending VA) with relevant events such as Create, Bind and Destroy. These events will be sorted chronologically. Each line includes:
    • Time of event
    • Event type
    • Type of resource
    • Resource ID
    • Resource size
    • Resource name (if named, otherwise NULL)
  • Associated resources: a list of all associated resources with their details, including:
    • Resource ID
    • Name (if named by the user)
    • Type and creation flags
    • Size
    • Virtual address and parent allocation base address
    • Commit type
    • Resource timeline that shows all events that are relevant to this specific resource in chronological order

A note about time tracking:

The general time format used by RGD is <hh:mm:ss:clks> which stands for <hours, minutes, seconds, CPU clks>. Beginning of time (00:00:00.00) is when the crash analysis session started (note that there is an expected lag between the start of the crashing process and the beginning of the crash analysis session, due to the time that takes to initialize crash analysis in the driver).

Capturing AMD GPU Crash Dump Files

radeon_gpu_detective's People

Contributors

amitbm avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

gmh5225 tho-lewis

radeon_gpu_detective's Issues

Linux support

Not every programmer uses windows. As such, this should support Linux.

I was going to buy an AMD gpu since I am tired of Nvidia's terrible Linux drivers, but now it seems that y'all have serious gaps too.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.