Giter Club home page Giter Club logo

rocprofiler's Introduction

ROCm Profiling Tools

DISCLAIMER

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions, and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. Any computer system has risks of security vulnerabilities that cannot be completely prevented or mitigated. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.THIS INFORMATION IS PROVIDED ‘AS IS.” AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS, OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY RELIANCE, DIRECT, INDIRECT, SPECIAL, OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. AMD, the AMD Arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. Other product names used in this publication are for identification purposes only and may be trademarks of their respective companies.

© 2022 Advanced Micro Devices, Inc. All Rights Reserved.

Introduction

ROCProfiler is AMD’s tooling infrastructure that provides a hardware specific low level performance analysis interface for the profiling and the tracing of GPU compute applications.

ROCProfiler V1

Profiling with metrics and traces based on perfcounters (PMC) and traces (SPM). Implementation is based on AqlProfile HSA extension. The last API library version for ROCProfiler v1 is 8.0.0

The library source tree:

  • doc - Documentation
  • include/rocprofiler/rocprofiler.h - Library public API
  • include/rocprofiler/v2/rocprofiler.h - V2 Beta Library public API
  • include/rocprofiler/v2/rocprofiler_plugins.h - V2 Beta Tool's Plugins Library public API
  • src - Library sources
    • core - Library API sources
    • util - Library utils sources
    • xml - XML parser
  • test - Library test suite
    • ctrl - Test controll
    • util - Test utils
    • simple_convolution - Simple convolution test kernel

Build environment

Roctracer & Rocprofiler need to be installed in the same directory.

export CMAKE_PREFIX_PATH=<path_to_hsa-runtime_includes>:<path_to_hsa-runtime_library>
export CMAKE_BUILD_TYPE=<debug|release> # release by default
export CMAKE_DEBUG_TRACE=1 # 1 to enable debug tracing

To build with the current installed ROCM:

cd .../rocprofiler
./build.sh ## (for clean build use `-cb`)

To run the test:

cd .../rocprofiler/build
export LD_LIBRARY_PATH=.:$LD_LIBRARY_PATH # paths to ROC profiler and oher libraries
export HSA_TOOLS_LIB=librocprofiler64.so.1 # ROC profiler library loaded by HSA runtime
export ROCP_TOOL_LIB=test/librocprof-tool.so # tool library loaded by ROC profiler
export ROCP_METRICS=metrics.xml # ROC profiler metrics config file
export ROCP_INPUT=input.xml # input file for the tool library
export ROCP_OUTPUT_DIR=./ # output directory for the tool library, for metrics results file 'results.txt' and trace files
./<your_test>

Internal 'simple_convolution' test run script:

cd .../rocprofiler/build
./run.sh
  • To enabled error messages logging to '/tmp/rocprofiler_log.txt':
export ROCPROFILER_LOG=1
  • To enable verbose tracing:
export ROCPROFILER_TRACE=1

Supported AMD GPU Architectures (V1)

The following AMD GPU architectures are supported with ROCprofiler V1:

  • gfx8 (Fiji/Ellesmere)
  • gfx900 (AMD Vega 10)
  • gfx906 (AMD Vega 7nm also referred to as AMD Vega 20)
  • gfx908 (AMD Instinct™ MI100 accelerator)
  • gfx90a (AMD Instinct™ MI200)
  • gfx94x (AMD Instinct™ MI300)

Note: ROCProfiler V1 tool usage documentation is available at Click Here


ROCProfiler V2

The first API library version for ROCProfiler v2 is 9.0.0


Note: ROCProfilerV2 is currently considered a beta version and is subject to change in future releases


ROCProfilerV2 Modules

  • Counters
  • Hardware
  • Generic Buffer
  • Session
  • Filter
  • Tools
  • Plugins
  • Samples
  • Tests

Getting started

Requirements

  • rocm-llvm-dev

  • makecache

  • Gtest Development Package (Ubuntu: libgtest-dev)

  • libsystemd-dev, libelf-dev, libnuma-dev, libpciaccess-dev on ubuntu or their corresponding packages on any other OS

  • Cppheaderparser, websockets, matplotlib, lxml, barectf Python3 Packages

  • Python packages can be installed using:

    pip3 install -r requirements.txt

Build

The user has two options for building:

  • Option 1 (It will install in the path saved in ROCM_PATH environment variable or /opt/rocm if ROCM_PATH is empty):

    • Run

      Normal Build

      ./build.sh --build OR ./build.sh -b

      Clean Build

      ./build.sh --clean-build OR ./build.sh -cb
  • Option 2 (Where ROCM_PATH envronment need to be set with the current installation directory of rocm), run the following:

    • Creating the build directory

      mkdir build && cd build
    • Configuring the rocprofv2 build

      cmake -DCMAKE_PREFIX_PATH=$ROCM_PATH -DCMAKE_MODULE_PATH=$ROCM_PATH/hip/cmake -DROCPROFILER_BUILD_TESTS=1 -DROCPROFILER_BUILD_SAMPLES=1 <CMAKE_OPTIONS> ..
    • Building the main runtime of the rocprofv2 project

      cmake --build . -- -j
    • Optionally, for building API documentation

      cmake --build . -- -j doc
    • Optionally, for building packages (DEB, RPM, TGZ) Note: Requires rpm package on ubuntu

      cmake --build . -- -j package

Install

  • Optionally, run the following to install

    cd build
    cmake --build . -- -j install

Features & Usage

rocsys

A command line utility to control a session (launch/start/stop/exit), with the required application to be traced or profiled in a rocprofv2 context. Usage:

  • Launch the application with the required profiling and tracing options with giving a session identifier to be used later

    rocsys --session session_name launch mpiexec -n 2 rocprofv2 -i samples/input.txt Histogram
  • Start a session with a given identifier created at launch

    rocsys --session session_name start
  • Stop a session with a given identifier created at launch

    rocsys –session session_name stop
  • Exit a session with a given identifier created at launch

    rocsys –session session_name exit

ROCProf Versioning Support

Currently, rocprof can support both versions, rocprof and rocprofv2, that can be done using --tool-version

rocprof --tool-version <VERSION_REQUIRED> <rocprof/v2_options> <app_relative_path>
  • --tool-version 1 means it will just use rocprof V1.
  • --tool-version 2 means it will just use rocprofv2.

To know what version you are using right now, along with more information about the rocm version, use the following:

rocprof --version

Counters and Metric Collection

HW counters and derived metrics can be collected using following option:

rocprofv2 -i samples/input.txt <app_relative_path>

input.txt content Example (Details of what is needed inside input.txt will be mentioned with every feature):

pmc: SQ_WAVES GRBM_COUNT GRBM_GUI_ACTIVE SQ_INSTS_VALU

Application Trace Support

Different trace options are available while profiling an app:

  • HIP API & asynchronous activity tracing

    rocprofv2 --hip-api <app_relative_path> ## For synchronous HIP API Activity tracing
    rocprofv2 --hip-activity <app_relative_path> ## For both Synchronous & ASynchronous HIP API Activity tracing
    rocprofv2 --hip-trace <app_relative_path> ## Same as --hip-activity, added for backward compatibility
  • HSA API & asynchronous activity tracing

    rocprofv2 --hsa-api <app_relative_path> ## For synchronous HSA API Activity tracing
    rocprofv2 --hsa-activity <app_relative_path> ## For both Synchronous & ASynchronous HSA API Activity tracing
    rocprofv2 --hsa-trace <app_relative_path> ## Same as --hsa-activity, added for backward compatibility
  • Kernel dispatches tracing

    rocprofv2 --kernel-trace <app_relative_path> ## Kernel Dispatch Tracing
  • HIP & HSA API and asynchronous activity and kernel dispatches tracing

    rocprofv2 --sys-trace <app_relative_path> ## Same as combining --hip-trace & --hsa-trace & --kernel-trace
  • For complete usage options, please run rocprofv2 help

    rocprofv2 --help

Plugin Support

We have a template for adding new plugins. New plugins can be written on top of rocprofv2 to support the desired output format using include/rocprofiler/v2/rocprofiler_plugins.h header file. These plugins are modular in nature and can easily be decoupled from the code based on need. Installation files:

rocprofiler-plugins_2.0.0-local_amd64.deb
rocprofiler-plugins-2.0.0-local.x86_64.rpm

Plugins may have multiple versions, the user can specify which version of the plugin to use by running the following command:

rocprofv2 --plugin <plugin_name> --plugin-version <plugin_version_required> <rocprofv2_options> <app_relative_path>
  • File plugin: outputs the data in txt files. File plugin have two versions, by default version 2 is the current default. Usage:

    rocprofv2 --plugin file -i samples/input.txt -d output_dir <app_relative_path> # -d is optional, but can be used to define the directory output for output results

    File plugin version 1 output header will be similar to the legacy rocprof v1 output:

    Index,KernelName,gpu-id,queue-id,queue-index,pid,tid,grd,wgr,lds,scr,arch_vgpr,accum_vgpr,sgpr,wave_size,sig,obj,DispatchNs,BeginNs,EndNs,CompleteNs,Counters
    

    File plugin version 2 output header:

    Dispatch_ID,GPU_ID,Queue_ID,PID,TID,Grid_Size,Workgroup_Size,LDS_Per_Workgroup,Scratch_Per_Workitem,Arch_VGPR,Accum_VGPR,SGPR,Wave_Size,Kernel_Name,Start_Timestamp,End_Timestamp,Correlation_ID,Counters
    
  • Perfetto plugin: outputs the data in protobuf format. Protobuf files can be viewed using ui.perfetto.dev or using trace_processor. Usage:

    rocprofv2 --plugin perfetto --hsa-trace -d output_dir <app_relative_path> # -d is optional, but can be used to define the directory output for output results
  • CTF plugin: Outputs the data in ctf format(a binary trace format). CTF binary output can be viewed using TraceCompass or babeltrace. Usage:

    rocprofv2 --plugin ctf --hip-trace -d output_dir <app_relative_path> # -d is optional, but can be used to define the directory output for output results
  • ATT (Advanced thread tracer) plugin: advanced hardware traces data in binary format. Please refer ATT section. Tool used to collect fine-grained hardware metrics. Provides ISA-level instruction hotspot analysis via hardware tracing.

    • Install plugin package. See Plugin Support section for installation

    • Run the following to view the trace. Att-specific options must come right after the assembly file.

    • On ROCm 6.0, ATT enables automatic capture of the ISA during kernel execution, and does not require recompiling. It is recommeneded to leave at "auto".

      rocprofv2 -i input.txt --plugin att auto --mode csv <app_relative_path>
      # Or using a user-supplied ISA:
      # rocprofv2 -i input.txt --plugin att <app_assembly_file> --mode csv <app_relative_path>
    • app_relative_path Path for the running application

    • ATT plugin optional parameters

      • --att_kernel "filename": Kernel filename(s) (glob) to use. A CSV file (or UI folder) will be generated for each kernel.txt file. Default: all in current folder.
      • --mode [csv, file, off (default)]
        • off Runs trace collection but not analysis, so it can be analyzed at a later time. Run rocprofv2 ATT with the same parameters (+ --mode csv), removing the application binary, to analyze previously generated traces.
        • csv Dumps the analyzed assembly into a CSV format, with the hitcount and total cycles cost. Recommended mode for most users.
        • file (deprecated) Dumps the analyzed json files to disk for vieweing at a later time. Run python3 httpserver.py from within the generated name_ui/ folder to view the trace. The folder can be copied to another machine, and will run without rocm.
        • file,csv Both options can be used at the same time, generating a UI folder and a .csv.
        • network [removed] Network mode was removed, since it's functionality is included in file mode with the httpserver.py script generated inside the UI folder.
    • input.txt Required. Used to select specific compute units and other trace parameters. For first time users, using the following input file:

      # vectoradd
      att: TARGET_CU=1
      SE_MASK=0x1
      SIMD_SELECT=0x3
      # histogram
      att: TARGET_CU=0
      SE_MASK=0xFF
      SIMD_SELECT=0xF // 0xF for GFX9, SIMD_SELECT=0 for Navi

      Possible contents:

      • att: TARGET_CU=1 // or some other CU [0,15] - WGP for Navi [0,8]
      • SE_MASK=0x1 // bitmask of shader engines. The fewer, the easier on the hardware. Default enables 1 out of 4 shader engines.
      • SIMD_SELECT=0xF // GFX9: bitmask of SIMDs. Navi: SIMD Index [0-3]. Recommended 0xF for GFX9 and 0x0 for Navi.
      • DISPATCH=ID // collect trace only for the given dispatch_ID. Multiple lines for can be added.
      • DISPATCH=ID,RN // collect trace only for the given dispatch_ID and MPI rank RN. Multiple lines with varying combinations of RN and ID can be added.
      • KERNEL=kernname // Profile only kernels containing the string kernname (c++ mangled name). Multiple lines can be added.
      • PERFCOUNTERS_CTRL=0x3 // Multiplier period for counter collection [0~31]. 0=fastest. GFX9 only.
      • PERFCOUNTER_MASK=0xFFF // Bitmask for perfcounter collection. GFX9 only.
      • PERFCOUNTER=counter_name // Add a SQ counter to be collected with ATT; period defined by PERFCOUNTERS_CTRL. GFX9 only.
      • BUFFER_SIZE=[size] // Sets size of the ATT buffer collection, per dispatch, in megabytes (shared among all shader engines).
      • ISA_CAPTURE_MODE=[0,1,2] // Set codeobj capture mode during kernel dispatch.
        • 0 = capture symbols only.
        • 1 = capture symbols for file:// and make a copy of memory://, dump captured copy as .out file.
        • 2 = Copy file:// and memory://, dump copied codeobj as .out files.
      • DISPATCH_RANGE=[begin],[end] // Continuously collect ATT data starting at "begin" and stop at "end". Alternative to DISPATCH= and KERNEL=.
      • By default, kernel names are truncated for ATT. To disable, please see the kernel name truncation section below.
    • Example for vectoradd.

      # -g adds debugging symbols to the binary. Required only for tracking disassembly back to c++.
      hipcc -g vectoradd_hip.cpp -o vectoradd_hip.exe
      # "auto" means to use the automatically captured ISA, e.g. vectoradd_float_v0_isa.s dumped along with .att files.
      # "--mode csv" dumps the result to "att_output_vectoradd_float_v0.csv".
      rocprofv2 -i input.txt --plugin att auto --mode csv ./vectoradd_hip.exe
      # Alternatively, using --save-temps to generate the ISA
      hipcc -g --save-temps vectoradd_hip.cpp -o vectoradd_hip.exe
      # Replace "auto" with <generated_gpu_isa.s> for user-supplied ISA. Typically they match the wildcards *amdgcn-amd-amdhsa*.s.
      # Special attention to the correct architecture for the ISA, such as "gfx1100" (navi31).
      rocprofv2 -i input.txt --plugin att vectoradd_hip-hip-amdgcn-amd-amdhsa-gfx1100.s --mode csv ./vectoradd_hip.exe

      Instruction latencies will be in att_output_vectoradd_float_v0.csv

      # Use -d option to specify the generated data directory, and -o to specify dir and filename is the csv:
      rocprofv2 -d mydir -o test/mycsv -i input.txt --plugin att auto --mode csv ./vectoradd_hip.exe
      # Generates raw files inside mydir/ and the parsed data on test/mycsv_vectoradd_float_v0.csv

    Note: For MPI or long running applications, we recommend to run collection, and later run the parser with already collected data: Run only collection: The assembly file is not used. Use mpirun [...] rocprofv2 [...] if needed.

    # Run only data collection, not the parser
    rocprofv2 -i input.txt --plugin att auto --mode off ./vectoradd_hip.exe

    Remove the binary/application from the command line.

    # Only runs the parser on previously collected data.
    rocprofv2 -i input.txt --plugin att auto --mode csv

    Note 2: By default, ATT only collects a SINGLE kernel dispatch for the whole application, which is the first dispatch matching the given filters (DISPATCH= or KERNEL=). To collect multiple dispatches in a single application run, use:

    export ROCPROFILER_MAX_ATT_PROFILES=<max_collections>

    Or, alternatively, use the continuous ATT mode (DISPATCH_RANGE parameter).


Flush Interval

Flush interval can be used to control the interval time in milliseconds between the buffers flush for the tool. However, if the buffers are full the flush will be called on its own. This can be used as in the next example:

rocprofv2 --flush-interval <TIME_INTERVAL_IN_MILLISECONDS> <rest_of_rocprofv2_arguments> <app_relative_path>

Trace Period

Trace period can be used to control when the profiling or tracing is enabled using two arguments, the first one is the delay time, which is the time spent idle without tracing or profiling. The second argument is the profiling or the tracing time, which is the active time where the profiling and tracing are working, so basically, the session will work in the following timeline:

<DELAY_TIME> => <PROFILING_OR_TRACING_SESSION_START> => <ACTIVE_PROFILING_OR_TRACING_TIME> => <PROFILING_OR_TRACING_SESSION_STOP>

This feature can be used using the following command:

rocprofv2 --trace-period <delay>:<active_time>:<interval> <rest_of_rocprofv2_arguments> <app_relative_path>
  • delay: Time delay to start profiling (ms).
  • active_time: How long to profile for (ms).
  • interval: If set, profiling sessions will start (loop) every "interval", and run for "active_time", until the application ends. Must be higher than "active_time".

Device Profiling

A device profiling session allows the user to profile the GPU device for counters irrespective of the running applications on the GPU. This is different from application profiling. device profiling session doesn't care about the host running processes and threads. It directly provides low level profiling information.

Session Support

A session is a unique identifier for a profiling/tracing/pc-sampling task. A ROCProfilerV2 Session has enough information about what needs to be collected or traced and it allows the user to start/stop profiling/tracing whenever required. More details on the API can be found in the API specification documentation that can be installed using rocprofiler-doc package. Samples also can be found for how to use the API in samples directory.

Tests

We make use of the GoogleTest (Gtest) framework to automatically find and add test cases to the CMAKE testing environment. ROCProfilerV2 testing is categorized as following:

  • unittests (Gtest Based) : These includes tests for core classes. Any newly added functionality should have a unit test written to it.

  • featuretests (standalone and Gtest Based): These includes both API tests and tool tests. Tool is tested against different applications to make sure we have right output in evry run.

  • memorytests (standalone): This includes running address sanitizer for memory leaks, corruptions.

installation: rocprofiler-tests_9.0.0-local_amd64.deb rocprofiler-tests-9.0.0-local.x86_64.rpm

List and Run tests

Run unit tests on the commandline

./build/tests/unittests/runUnitTests

Run profilerfeaturetests on the commandline

./build/tests/featuretests/profiler/runFeatureTests

Run tracer featuretests on the commandline

./build/tests/featuretests/tracer/runTracerFeatureTests

Run all tests

rocprofv2 -t

OR

ctest

Guidelines for adding new tests

  • Prefer to enhance an existing test as opposed to writing a new one. Tests have overhead to start and many small tests spend precious test time on startup and initialization issues.
  • Make the test run standalone without requirement for command-line arguments. This makes it easier to debug since the name of the test is shown in the test report and if you know the name of the test you can the run the test.

Logging

To enable error messages logging to '/tmp/rocprofiler_log.txt':

export ROCPROFILER_LOG=1

Kernel Name Truncation

By default kernel names are not truncated. To enable truncation for readability:

export ROCPROFILER_TRUNCATE_KERNEL_PATH=1

Documentation

We make use of doxygen to automatically generate API documentation. Generated document can be found in the following path:

ROCM_PATH by default is /opt/rocm It can be set by the user in different location if needed. <ROCM_PATH>/share/doc/rocprofv2

installation:

rocprofiler-docs_9.0.0-local_amd64.deb
rocprofiler-docs-9.0.0-local.x86_64.rpm

Samples

  • Profiling: Profiling Samples depending on replay mode
  • Tracing: Tracing Samples

installation:

rocprofiler-samples_9.0.0-local_amd64.deb
rocprofiler-samples-9.0.0-local.x86_64.rpm

usage:

samples can be run as independent executables once installed

Project Structure

  • bin: ROCProf scripts along with V1 post processing scripts
  • doc: Documentation settings for doxygen, V1 API Specifications pdf document.
  • include:
    • rocprofiler.h: V1 API Header File
    • v2:
      • rocprofiler.h: V2 API Header File
      • rocprofiler_plugin.h: V2 Tool Plugins API
  • plugin
    • file: File Plugin
    • perfetto: Perfetto Plugin
    • att: Adavced thread tracer Plugin
    • ctf: CTF Plugin
  • samples: Samples of how to use the API, and also input.txt input file samples for counter collection and ATT.
  • script: Scripts needed for tracing
  • src: Source files of the project
    • api: API implementation for rocprofv2
    • core: Core source files needed for the V1/V2 API
      • counters: Basic and Derived Counters
      • hardware: Hardware support
      • hsa: Provides support for profiler and tracer to communicate with HSA
        • queues: Intercepting HSA Queues
        • packets: Packets Preparation for profiling
      • memory: Memory Pool used in buffers that saves the output data
      • session: Session Logic
        • filter: Type of profiling or tracing and its properties
        • tracer: Tracing support of the session
        • profiler: Profiling support of the session
        • spm: SPM support of the session
        • att: ATT support of the session
    • tools: Tools needed to run profiling and tracing
      • rocsys: Controlling Session from another CLI
    • utils: Utilities needed by the project
  • tests: Tests folder
  • CMakeLists.txt: Handles cmake list for the whole project
  • build.sh: To easily build and compile rocprofiler
  • CHANGELOG.md: Changes that are happening per release

Support

Please report in the Github Issues.

Limitations

  • Navi3x requires a stable power state for counter collection. Currently, this state needs to be set by the user. To do so, set "power_dpm_force_performance_level" to be writeable for non-root users, then set performance level to profile_standard:

    sudo chmod 777 /sys/class/drm/card0/device/power_dpm_force_performance_level
    echo profile_standard >> /sys/class/drm/card0/device/power_dpm_force_performance_level

    Recommended: "profile_standard" for counter collection and "auto" for all other profiling. Use rocm-smi to verify the current power state. For multiGPU systems (includes integrated graphics), replace "card0" by the desired card.

  • Timestamps may be incorrect with HIP_OPS when the system has been in sleep state.

  • HIP_OPS are mutually exclusive with HSA_OPS.

Supported AMD GPU Architectures (V2)

The following AMD GPU architectures are supported with ROCprofiler V2:

  • gfx900 (AMD Vega 10)
  • gfx906 (AMD Vega 7nm also referred to as AMD Vega 20)
  • gfx908 (AMD Instinct™ MI100 accelerator)
  • gfx90a (AMD Instinct™ MI200)
  • gfx94x (AMD Instinct™ MI300)
  • gfx10xx ([Navi2x] AMD Radeon(TM) Graphics)
  • gfx11xx ([Navi3x] AMD Radeon(TM) Graphics)

rocprofiler's People

Contributors

amd-isparry avatar ammarwa avatar apokalipse-v avatar arvindcheru avatar bgopesh avatar bwelton avatar chrispaquot avatar cjatin avatar cyamder avatar frepaul avatar jedwards-amd avatar kentrussell avatar kiumarssabeti avatar kzhuravl avatar lancesix avatar lmoriche avatar manjunath-jakaraddi avatar marklawsamd avatar mythreyak avatar nelsonc-amd avatar nunnikri avatar pbhandar-amd avatar raramakr avatar rkebichi avatar sauverma93 avatar solaiys avatar srirakshanag avatar t-tye avatar vlaindic avatar youssefaly97 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

rocprofiler's Issues

rocprof --hip-trace not working properly.

HI.
I am trying to profile one of my test cases using the rocprof --hip-trace, but i am getting "wrap_buffer(), worker thread is not started ". As soon as this message is coming, rocprof is closing without capturing any details. If I am using rocprof without --hip-trace I am not facing this issue. Can someone please point out what i need to do to resolve this.
I am using rocm-3.3 and gfx-906.

Thanks.
Saurabh

Dead links in documentation

Both of the links in the documentation section of your README appear to be down.

Could you update this so we can access to these resources. Thanks!

(I'm specifically interested in getting access to the C API documentation)

Standard output not being flushed during program execution when using Open MPI and rocprof

When running a MPI program with Open MPI and rocprof, all that has been written in the standard output is only flushed at the end of the program execution at once even if "\n" has been written in the standard output . When using MPICH and rocprof or Open MPI without rocprof, the standard output is automatically flushed during program execution when "\n" has been written in the standard output.
When using the fflush function, the standard output is correctly flushed during program execution with Open MPI and rocprof.

This can be problematic for long programs that rely on printing "\n" in the standard output rather than using fflush(stdout) to display the progress of the program (machine learning for example).

A test case is provided attached to this issue. It can be build with :
mpicc openmpi_rocprof_issue.c -L/opt/rocm/hip/lib/ -l amdhip64 -I /opt/rocm/include -D __HIP_PLATFORM_AMD__

I ran it with only one rank to avoid post-processing issues :
mpirun -n 1 rocprof --hip-trace -d my_trace ./a.out

Line 13 can be commented/uncommented to check the behavior when using fflush(stdout)

openmpi_rocprof_issue.zip

Core dumped with performance counter mode profiling

I am experiencing a core dumped error when I am running a simple HIP program with the profiler.

I am running the vector add example from the HIP-Examples repo. After making the executable file. I used /opt/rocm/bin/rocprof --stats vectoradd_hip.exe to profile the program. The program cannot properly terminate and I see this error:

/opt/rocm/bin/rocprof: line 275: 17919 Aborted                 (core dumped) "vectoradd_hip.exe"

Here is my machine configuration:

ROCm version: 3.5

"name -a":

Linux lowfreq 4.15.0-106-generic #107-Ubuntu SMP Thu Jun 4 11:27:52 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

"lsb_release -a":

No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 18.04.4 LTS
Release:        18.04
Codename:       bionic

"rocm-info":

�[37mROCk module is loaded�[0m
�[37mAble to open /dev/kfd read-write�[0m
=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
  Uuid:                    CPU-XX                             
  Marketing Name:          Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   3100                               
  BDFID:                   0                                  
  Internal Node ID:        0                                  
  Compute Unit:            20                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    65878732(0x3ed3acc) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    65878732(0x3ed3acc) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
    N/A                      
*******                  
Agent 2                  
*******                  
  Name:                    Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
  Uuid:                    CPU-XX                             
  Marketing Name:          Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   3100                               
  BDFID:                   0                                  
  Internal Node ID:        1                                  
  Compute Unit:            20                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    66033596(0x3ef97bc) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    66033596(0x3ef97bc) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
    N/A                      
*******                  
Agent 3                  
*******                  
  Name:                    gfx803                             
  Uuid:                    GPU-XX                             
  Marketing Name:          Fiji [Radeon R9 FURY / NANO Series]
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          4096(0x1000)                       
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    2                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
  Chip ID:                 29440(0x7300)                      
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   1000                               
  BDFID:                   33280                              
  Internal Node ID:        2                                  
  Compute Unit:            64                                 
  SIMDs per CU:            4                                  
  Shader Engines:          4                                  
  Shader Arrs. per Eng.:   1                                  
  WatchPts on Addr. Ranges:4                                  
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      FALSE                              
  Wavefront Size:          64(0x40)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        40(0x28)                           
  Max Work-item Per CU:    2560(0xa00)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    4194304(0x400000) KB               
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx803          
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*******                  
Agent 4                  
*******                  
  Name:                    gfx803                             
  Uuid:                    GPU-XX                             
  Marketing Name:          Fiji [Radeon R9 FURY / NANO Series]
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          4096(0x1000)                       
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    3                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
  Chip ID:                 29440(0x7300)                      
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   1000                               
  BDFID:                   33536                              
  Internal Node ID:        3                                  
  Compute Unit:            64                                 
  SIMDs per CU:            4                                  
  Shader Engines:          4                                  
  Shader Arrs. per Eng.:   1                                  
  WatchPts on Addr. Ranges:4                                  
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      FALSE                              
  Wavefront Size:          64(0x40)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        40(0x28)                           
  Max Work-item Per CU:    2560(0xa00)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    4194304(0x400000) KB               
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx803          
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*** Done ***             

rocprof error hipRegisterApiCallback(0)

I am working on porting PIConGPU to rocm and tried to trace my application with rocprof

I am working with rocm/2.10 and tried rocprof with a mini app without issues. When I try to trace PIConGPU I got an error I do not know how I can solve it.

Can someone please explain what this error means and how I can solve it?

mpiexec -n 1  rocprof --hip-trace  --timestamp on  ./bin/picongpu -d 1 1 1 -g  128 128 128 -s 100 --periodic 1 0 0 
RPL: on '191213_070919' from '/opt/rocm/rocprofiler' in 'workspace/buildPIC/khi'
RPL: profiling '"./bin/picongpu" "-d" "1" "1" "1" "-g" "128" "128" "128" "-s" "100" "--periodic" "1" "0" "0"'
RPL: input file ''
RPL: output dir '/tmp/rpl_data_191213_070919_42785'
RPL: result dir '/tmp/rpl_data_191213_070919_42785/input_results_191213_070919'
HIP_DB=0x1 [api]
ROCTracer (pid=42804): 
    HIP-trace()
roctracer_enable_domain_callback(), roctracer_enable_callback_fun(), HIP error: hipRegisterApiCallback(0) error(1011)



RPL: 'results.csv' is generated

rocm is installed in /shared/opt/rocm/2.10/ and in /opt/rocm is a older version of rocm. The error message shows RPL: on '191213_070919' from '/opt/rocm/rocprofiler' in 'workspace/buildPIC/khi'
Could it be that I need to set some environment variables that rocprof is not looking into the default install path?

rocprof -i input.txt with several pmc lines crash

When running

./rocprofiler/bin/rocprof -i input.txt --hsa-trace -d traces ./roctracer/test/MatrixTranspose/MatrixTranspose

with several pmc lines in the input.txt file, the program crash with a segmentation fault error after the second run of the program

./rocprofiler/bin/rocprof: line 271: 861977 Segmentation fault      (core dumped) "./roctracer/test/MatrixTranspose/MatrixTranspose"

I use rocm-4.0.0 and I cloned and built rocprofiler and roctracer from the github repositories. I used both amd-master and rocm-4.0.x branches and the error occurs in both cases.

input_error.txt
input.txt

SQTT trace in rocprofiler 3.3

I used to dump SQTT trace data with rocprofiler 2.1 and ROCm2.x but failed to get SQTT trace with latest build. How can I configure SQTT trace in rocprofiler 3.3? It errors out not recognizing the trace token SQTT.

my old xml configuration:

<trace name="SQTT">
</trace>

Could you please help with this?

non-default builds seem to be really hard

I'm trying to hack my way around it.
I just want to build a local copy of the library etc.
All the use of cmake envirables is making it difficult.
e.g.:
CMAKE_INSTALL_PREFIX=~/tools/rocprofiler/install CMAKE_PREFIX_PATH=/opt/rocm/include/hsa:/opt/rocm cmake ..
end up with
CMake-install-prefix: /usr/local
CPack-install-prefix: /usr/local

Also it doesn't find libhsakmt.so apparently not in 4.5.2:
/opt/rocm-4.5.2/lib64/cmake/hsakmt
/opt/rocm-4.5.2/lib64/cmake/hsakmt/hsakmt-config.cmake
/opt/rocm-4.5.2/lib64/cmake/hsakmt/hsakmtTargets-release.cmake
/opt/rocm-4.5.2/lib64/cmake/hsakmt/hsakmtTargets.cmake
/opt/rocm-4.5.2/lib64/cmake/hsakmt/hsakmt-config-version.cmake
/opt/rocm-4.5.2/lib64/libhsakmt.a
/opt/rocm-4.5.2/include/hsakmttypes.h
/opt/rocm-4.5.2/include/hsakmt.h
/opt/rocm-4.5.2/share/doc/hsakmt
/opt/rocm-4.5.2/share/pkgconfig/libhsakmt.pc

Is it on the github somewhere?
Thanks.

Hacked around it:

diff --git a/CMakeLists.txt b/CMakeLists.txt
index 7e1df75..0551493 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -23,7 +23,7 @@
 cmake_minimum_required ( VERSION 2.8.12 )
 
 # Install prefix
-set(CMAKE_INSTALL_PREFIX "/opt/rocm" CACHE PATH "Install path prefix default")
+#set(CMAKE_INSTALL_PREFIX "/opt/rocm" CACHE PATH "Install path prefix default")
 
 ## Verbose output.
 set ( CMAKE_VERBOSE_MAKEFILE TRUE CACHE BOOL "Verbose Output" FORCE )
diff --git a/cmake_modules/env.cmake b/cmake_modules/env.cmake
index 30e86c1..c8ca290 100644
--- a/cmake_modules/env.cmake
+++ b/cmake_modules/env.cmake
@@ -108,12 +108,15 @@ elseif ( ${CMAKE_SYSTEM_PROCESSOR} STREQUAL "x86" )
 endif ()
 
 ## Find hsa-runtime headers/lib
-find_file ( HSA_RUNTIME_INC "hsa/hsa.h" )
-find_library ( HSA_RUNTIME_LIB "libhsa-runtime${NBIT}.so" )
+#find_file ( HSA_RUNTIME_INC "hsa/hsa.h" )
+#find_library ( HSA_RUNTIME_LIB "libhsa-runtime${NBIT}.so" )
+set ( HSA_RUNTIME_INC "/opt/rocm/include/hsa/hsa.h" )
+set ( HSA_RUNTIME_LIB "/opt/rocm/lib/libhsa-runtime64.so" )
 get_filename_component ( HSA_RUNTIME_INC_PATH "${HSA_RUNTIME_INC}" DIRECTORY )
 get_filename_component ( HSA_RUNTIME_LIB_PATH "${HSA_RUNTIME_LIB}" DIRECTORY )
 
-find_library ( HSA_KMT_LIB "libhsakmt.so" )
+#find_library ( HSA_KMT_LIB "libhsakmt.so" )
+set ( HSA_KMT_LIB "/opt/rocm-4.5.2/lib64/libhsakmt.a" )
 get_filename_component ( HSA_KMT_LIB_PATH "${HSA_KMT_LIB}" DIRECTORY )
 get_filename_component ( ROCM_ROOT_DIR "${HSA_KMT_LIB_PATH}" DIRECTORY )

Error when statfile path contains 'stats'

If rocprofiler is used where the current working directory contains the expression 'stats' then rocprofiler errors out because it does a regular expression match and replace on 'stats':

When run with --hip-trace:

Traceback (most recent call last):
File "/home/WORKSPACE/statsang/rocprofiler_pkg/bin/tblextr.py", line 499, in
dform.gen_table_bins(db, 'HIP', statfile, 'Name', 'DurationNs')
File "/home/WORKSPACE/statsang/rocprofiler_pkg/bin/dform.py", line 24, in gen_table_bins
gen_data_bins(db, outfile)
File "/home/WORKSPACE/statsang/rocprofiler_pkg/bin/dform.py", line 19, in gen_data_bins
db.dump_csv('C', outfile)
File "/home/WORKSPACE/statsang/rocprofiler_pkg/bin/sqlitedb.py", line 95, in dump_csv
with open(file_name, mode='w') as fd:
IOError: [Errno 2] No such file or directory: '/home/hip_statsang/WORKSPACE/rccl-tests/output.hip_stats.csv'
Data extracting error: /tmp/rpl_data_191204_153135_49168/input_results_191204_153135/'

Please strengthen the regular expression match/replace to account for the case where the target path contains 'stats', (or when the user has a cursed username like me). Thank you.

Rocprofiler does not allow to change metrics when using intercept mode

Currently, rocprofiler does not allow to change metrics at runtime for intercepted kernels, so the following example won't work:

  rocprofiler_feature_t features[4];
  features[0].kind = ROCPROFILER_FEATURE_KIND_METRIC;
  features[0].name = "SQ_WAVES";
  unsigned feature_count = 1;

  init_intercept(features, feature_count);
  start_intercept();

  hipLaunchKernelGGL(vectoradd_float,
                  dim3(WIDTH/THREADS_PER_BLOCK_X, HEIGHT/THREADS_PER_BLOCK_Y),
                  dim3(THREADS_PER_BLOCK_X, THREADS_PER_BLOCK_Y),
                  0, 0,
                  deviceA ,deviceB ,deviceC ,WIDTH ,HEIGHT);

  hipDeviceSynchronize();
  stop_intercept();
  shutdown_intercept();

  features[1].kind = ROCPROFILER_FEATURE_KIND_METRIC;
  features[1].name = "SQ_INSTS_VALU";
  feature_count += 1;

  init_intercept(features, feature_count);
  start_intercept();

  hipLaunchKernelGGL(vectoradd_float,
                  dim3(WIDTH/THREADS_PER_BLOCK_X, HEIGHT/THREADS_PER_BLOCK_Y),
                  dim3(THREADS_PER_BLOCK_X, THREADS_PER_BLOCK_Y),
                  0, 0,
                  deviceA ,deviceB ,deviceC ,WIDTH ,HEIGHT);

  hipDeviceSynchronize();
  stop_intercept();
  shutdown_intercept();

Above, init_intercept() initializes the queue callbacks for intercept mode and calls rocprofiler_set_queue_callbacks(). start_intercept() and stop_intercept() call rocprofiler_start_queue_callbacks() and rocprofiler_stop_queue_callbacks(), respectively, and shutdown_intercept() calls rocprofiler_remove_queue_callbacks().

Rocprofiler does not allow users to call rocprofiler_set_queue_callbacks() if this has been already called. Thus, the second call to init_intercept() in the example code above causes the following error message:

> error(4096) "SetCallbacks(), reassigning queue callbacks - not supported”

The ability to change metrics at runtime (while using intercept mode) is a feature highly desirable for tools like PAPI. With the current implementation of rocprofiler PAPI users would have to define metrics once and have them applied to all the kernels being intercepted.

txt2params.py not installed by "make install"

after a clean install, running rocprof results in:

Traceback (most recent call last):
  File "/opt/rocm-3.5.1/rocprofiler/bin/tblextr.py", line 26, in <module>
    from sqlitedb import SQLiteDB
  File "/opt/rocm-3.5.1/rocprofiler/bin/sqlitedb.py", line 3, in <module>
    from txt2params import gen_params
ImportError: No module named txt2params

looking at what's installed:

-- Installing: /opt/rocm-3.5.1/rocprofiler/bin/rpl_run.sh
-- Installing: /opt/rocm-3.5.1/rocprofiler/bin/txt2xml.sh
-- Installing: /opt/rocm-3.5.1/rocprofiler/bin/tblextr.py
-- Installing: /opt/rocm-3.5.1/rocprofiler/bin/dform.py
-- Installing: /opt/rocm-3.5.1/rocprofiler/bin/sqlitedb.py
> ls /opt/rocm-3.5.1/rocprofiler/bin
dform.py  rpl_run.sh  sqlitedb.py  tblextr.py  txt2xml.sh

Kernel table is not filled after tblextr.py is run

In tblextr.py, the kernel table ("A") is supposed to be filled with the kernel events but it is empty. The script insert the different entries, but the queries are never committed. The previous tables have their entries because the creation of the kernel table commits the previous queries. A simple call to "db.connection.commit()" would fix the issue at the end of the fill_kernel_db(table_name, db) function.

Explain of expression of metric L2CacheHit

In lib/metrics.xml, derived metric L2CacheHit has the following definition ,

# L2CacheHit      The percentage of fetch, write, atomic, and other instructions that hit the data in L2 cache. Value range: 0% (no hit) to 100% (optimal).
  <metric
    name="L2CacheHit"
    descr="The percentage of fetch, write, atomic, and other instructions that hit the data in L2 cache. Value range: 0% (no hit) to 100% (optimal)."
    expr=100*sum(TCC_HIT,16)/(sum(TCC_HIT,16)+sum(TCC_MISS,16))
  ></metric>

Would anyone help to tell the meaning of magic number "16" in the expr attribute above?
I am wondering if profiler only collected metrics data for single SE? If so, could we specify which one of SE_NUM to be collected?

rocm-3.10 typos

RPL: on '210105_080245' from '/opt/rocm-3.10.0/rocprofiler'

rocprof -h

--timestamp <on|off> - to turn on/off the kernel disoatches timestamps, dispatch/begin/end/complete [off]

change "disoatches" to "dispatches"

--parallel-kernels - to enable cnocurrent kernels

change "cnocurrent" to "concurrent"

Bytes transferred with CopyHostToDevice and CopyDeviceToHost

--hip-trace gives the COPY calls but the number of bytes transferred is glaringly missing.
Is this in the HSA trace layer?
There's a mention of being able to trace specific HSA APIs but I can't find a list of those APIs anywhere.
Thanks.

hsa_ven_amd_aqlprofile_1_00_pfn_s has no member hsa_ven_amd_aqlprofile_read

I am running into an error, when trying to compile this under Ubuntu 18.04.1:

$ export CMAKE_PREFIX_PATH=/opt/rocm/hsa/include/:/opt/rocm/hsa/lib/libhsa-runtime64.so.1
$ cmake -DCMAKE_PREFIX_PATH=/opt/rocm/lib:/opt/rocm/include/hsa -DCMAKE_INSTALL_PREFIX=/opt/rocm ..
-- The C compiler identification is GNU 7.3.0
-- The CXX compiler identification is GNU 7.3.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
----------------NBit: 64
----------Build-Type: release
------------Compiler: /usr/bin/c++
----Compiler-Version: 7.3.0
-----HSA-Runtime-Inc: /opt/rocm/hsa/include/hsa
-----HSA-Runtime-Lib: /opt/rocm/lib
-----------CXX-Flags: -std=c++11 -Wall -Werror -Werror=return-type -fexceptions -fvisibility=hidden -fno-math-errno -fno-threadsafe-statics -fmerge-all-constants -fms-extensions -fmerge-all-constants -Werror=unused-result -fPIC -m64  -msse -msse2
---CMAKE_PREFIX_PATH: /opt/rocm/lib:/opt/rocm/include/hsa
---------Install-Dir: /opt/rocm
-- LIB-VERSION: 1.0.0
+ cp /home/bengt/Projekte/github.com/ROCmSoftwarePlatform/rocprofiler/test/simple_convolution/gfx8_SimpleConvolution.hsaco /home/bengt/Projekte/github.com/ROCmSoftwarePlatform/rocprofiler/test/simple_convolution/gfx9_SimpleConvolution.hsaco /home/bengt/Projekte/github.com/ROCmSoftwarePlatform/rocprofiler/build
+ cp /home/bengt/Projekte/github.com/ROCmSoftwarePlatform/rocprofiler/test/run.sh /home/bengt/Projekte/github.com/ROCmSoftwarePlatform/rocprofiler/build
+ cp /home/bengt/Projekte/github.com/ROCmSoftwarePlatform/rocprofiler/test/tool/gfx_metrics.xml /home/bengt/Projekte/github.com/ROCmSoftwarePlatform/rocprofiler/test/tool/input.xml /home/bengt/Projekte/github.com/ROCmSoftwarePlatform/rocprofiler/test/tool/input1.xml /home/bengt/Projekte/github.com/ROCmSoftwarePlatform/rocprofiler/test/tool/metrics.xml /home/bengt/Projekte/github.com/ROCmSoftwarePlatform/rocprofiler/build
+ mkdir -p /home/bengt/Projekte/github.com/ROCmSoftwarePlatform/rocprofiler/build/RESULTS
-- Configuring done
-- Generating done
-- Build files have been written to: /home/bengt/Projekte/github.com/ROCmSoftwarePlatform/rocprofiler/build
bengt@Bengt-TR4:~/Projekte/github.com/ROCmSoftwarePlatform/rocprofiler/build$ make
/usr/bin/cmake -H/home/bengt/Projekte/github.com/ROCmSoftwarePlatform/rocprofiler -B/home/bengt/Projekte/github.com/ROCmSoftwarePlatform/rocprofiler/build --check-build-system CMakeFiles/Makefile.cmake 0
/usr/bin/cmake -E cmake_progress_start /home/bengt/Projekte/github.com/ROCmSoftwarePlatform/rocprofiler/build/CMakeFiles /home/bengt/Projekte/github.com/ROCmSoftwarePlatform/rocprofiler/build/CMakeFiles/progress.marks
make -f CMakeFiles/Makefile2 all
make[1]: Entering directory '/home/bengt/Projekte/github.com/ROCmSoftwarePlatform/rocprofiler/build'
make -f CMakeFiles/rocprofiler64.dir/build.make CMakeFiles/rocprofiler64.dir/depend
make[2]: Entering directory '/home/bengt/Projekte/github.com/ROCmSoftwarePlatform/rocprofiler/build'
cd /home/bengt/Projekte/github.com/ROCmSoftwarePlatform/rocprofiler/build && /usr/bin/cmake -E cmake_depends "Unix Makefiles" /home/bengt/Projekte/github.com/ROCmSoftwarePlatform/rocprofiler /home/bengt/Projekte/github.com/ROCmSoftwarePlatform/rocprofiler /home/bengt/Projekte/github.com/ROCmSoftwarePlatform/rocprofiler/build /home/bengt/Projekte/github.com/ROCmSoftwarePlatform/rocprofiler/build /home/bengt/Projekte/github.com/ROCmSoftwarePlatform/rocprofiler/build/CMakeFiles/rocprofiler64.dir/DependInfo.cmake --color=
Scanning dependencies of target rocprofiler64
make[2]: Leaving directory '/home/bengt/Projekte/github.com/ROCmSoftwarePlatform/rocprofiler/build'
make -f CMakeFiles/rocprofiler64.dir/build.make CMakeFiles/rocprofiler64.dir/build
make[2]: Entering directory '/home/bengt/Projekte/github.com/ROCmSoftwarePlatform/rocprofiler/build'
[  5%] Building CXX object CMakeFiles/rocprofiler64.dir/src/core/rocprofiler.cpp.o
/usr/bin/c++  -DAMD_INTERNAL_BUILD -DAQLPROF_NEW_API=1 -DHSA_DEPRECATED="" -DHSA_LARGE_MODEL="" -DLINUX -DLITTLEENDIAN_CPU=1 -DUNIX_OS -D__AMD64__ -D__linux__ -D__x86_64__ -Drocprofiler64_EXPORTS -I/home/bengt/Projekte/github.com/ROCmSoftwarePlatform/rocprofiler/src -I/home/bengt/Projekte/github.com/ROCmSoftwarePlatform/rocprofiler -I/opt/rocm/hsa/include/hsa  -std=c++11 -Wall -Werror -Werror=return-type -fexceptions -fvisibility=hidden -fno-math-errno -fno-threadsafe-statics -fmerge-all-constants -fms-extensions -fmerge-all-constants -Werror=unused-result -fPIC -m64  -msse -msse2 -O3 -DNDEBUG -fPIC   -o CMakeFiles/rocprofiler64.dir/src/core/rocprofiler.cpp.o -c /home/bengt/Projekte/github.com/ROCmSoftwarePlatform/rocprofiler/src/core/rocprofiler.cpp
In file included from /home/bengt/Projekte/github.com/ROCmSoftwarePlatform/rocprofiler/src/core/context.h:39:0,
                 from /home/bengt/Projekte/github.com/ROCmSoftwarePlatform/rocprofiler/src/core/rocprofiler.cpp:30:
/home/bengt/Projekte/github.com/ROCmSoftwarePlatform/rocprofiler/src/core/profile.h: In member function ‘hsa_status_t rocprofiler::Profile::Finalize(rocprofiler::pkt_vector_t&, rocprofiler::pkt_vector_t&, rocprofiler::pkt_vector_t&)’:
/home/bengt/Projekte/github.com/ROCmSoftwarePlatform/rocprofiler/src/core/profile.h:144:37: error: ‘const pfn_t {aka const struct hsa_ven_amd_aqlprofile_1_00_pfn_s}’ has no member named ‘hsa_ven_amd_aqlprofile_read’; did you mean ‘hsa_ven_amd_aqlprofile_start’?
       hsa_status_t rd_status = api->hsa_ven_amd_aqlprofile_read(&profile_, &read);
                                     ^~~~~~~~~~~~~~~~~~~~~~~~~~~
                                     hsa_ven_amd_aqlprofile_start
CMakeFiles/rocprofiler64.dir/build.make:65: recipe for target 'CMakeFiles/rocprofiler64.dir/src/core/rocprofiler.cpp.o' failed
make[2]: *** [CMakeFiles/rocprofiler64.dir/src/core/rocprofiler.cpp.o] Error 1
make[2]: Leaving directory '/home/bengt/Projekte/github.com/ROCmSoftwarePlatform/rocprofiler/build'
CMakeFiles/Makefile2:70: recipe for target 'CMakeFiles/rocprofiler64.dir/all' failed
make[1]: *** [CMakeFiles/rocprofiler64.dir/all] Error 2
make[1]: Leaving directory '/home/bengt/Projekte/github.com/ROCmSoftwarePlatform/rocprofiler/build'
Makefile:154: recipe for target 'all' failed
make: *** [all] Error 2
```

tblextr.py: missing import of commands

running rocprofiler:

Traceback (most recent call last):
  File "/opt/rocm-3.5.1/rocprofiler/bin/tblextr.py", line 466, in <module>
    metadata_gen(sysinfo_file, '/opt/rocm/bin/rocminfo')
  File "/opt/rocm-3.5.1/rocprofiler/bin/tblextr.py", line 69, in metadata_gen
    status, output = commands.getstatusoutput(sysinfo_cmd + direct_str + sysinfo_file)
NameError: global name 'commands' is not defined

import commands needs to be added to metadata_gen function in tblextr.py

question about Profiling data corrupted

"For workloads where the hip application might make more than 10 million HIP API calls, the application might crash with the error - "Profiling data corrupted""
Can you please explain the root cause ? I I hope the issue could be solved without the workaround.

Thanks

Is there any update about the document?

Hi, is there any update about the document?
It seems that something has changed.
For example, in the document, it says

const char* rocprofiler_error_string();

but in fact, the function is

hsa_status_t rocprofiler_error_string(
    const char** str); 

Thank you.

free(): corrupted unsorted chunks

When running the following command

./rocprofiler/bin/rocprof --hsa-trace --hip-trace --kfd-trace -d traces python3 ./test.py

I sometimes get the following error

free(): corrupted unsorted chunks
./rocprofiler/bin/rocprof: line 271: 861483 Aborted                 (core dumped) "python3" "./test.py"

I use rocm-4.0.0 and I cloned and built rocprofiler and roctracer from the github repositories. I used both amd-master and rocm-4.0.x branches and the error occurs in both cases.

unsorted_chunks_error.txt
test.py.zip

Issue

Issue removed, was a mistake on my behalf. Can be closed.

ROCPRofiler: 0 contexts collected

hello, I am new to rocprof and trying to profile MatrixTranspose example (https://github.com/ROCm-Developer-Tools/HIP/blob/develop/samples/2_Cookbook/0_MatrixTranspose/MatrixTranspose.cpp). My input file is the same from help menu with the exception of kernel name, however, it seems that rocprof is not collecting any metrics for some reason. I tried with different examples but faced the same issue. Does this mean that I dont have access to performance counters or something? Thanks!

 ROCProfiler: input from "/tmp/rpl_data_211123_102928_104567/input0.xml"
gpu_index = 0,1,2,3
kernel = matrixTranspose
range = 1:4
10 metrics
Wavefronts, VALUInsts, SALUInsts, SFetchInsts, FlatVMemInsts, LDSInsts, FlatLDSInsts, GDSInsts, VALUUtilization, FetchSize
Device name
PASSED!

ROCPRofiler: 0 contexts collected, output directory /tmp/rpl_data_211123_102928_104567/input0_results_211123_102928
RPL: result dir '/tmp/rpl_data_211123_102928_104567/input1_results_211123_102928'
ROCProfiler: input from "/tmp/rpl_data_211123_102928_104567/input1.xml"
gpu_index = 0,1,2,3
kernel = matrixTranspose
range = 1:4
2 metrics
WriteSize, L2CacheHit
Device name
PASSED!

ROCPRofiler: 0 contexts collected, output directory /tmp/rpl_data_211123_102928_104567/input1_results_211123_102928

tblextr.py : bad kfd record / IndexError

When running the following command

./rocprofiler/bin/rocprof --hsa-trace --hip-trace --kfd-trace -d traces python3 ./test.py

I sometimes get one of the following errors

  File "/home/yoann/rocprofiler/bin/tblextr.py", line 710, in <module>
    hip_trace_found = fill_api_db('HIP', db, indir, 'hip', HIP_PID, OPS_PID, [], {}, 1)
  File "/home/yoann/rocprofiler/bin/tblextr.py", line 441, in fill_api_db
    copy_data = list(copy_raws[copy_index])
IndexError: list index out of range
Profiling data corrupted: ' traces/rpl_data_210319_075021_860663/input_results_210319_075021/results.txt'

or

scan kfd API data 2803664:2803665                                                                                                    /home/yoann/rocprofiler/bin/tblextr.py: kfd bad record: ''
Profiling data corrupted: ' traces/rpl_data_210319_074742_860200/input_results_210319_074742/results.txt'

I use rocm-4.0.0 and I cloned and built rocprofiler and roctracer from the github repositories. I used both amd-master and rocm-4.0.x branches. The IndexError occurred in both cases. The kfd bad record only occurred with the rocm-4.0.0 branch (but the errors didn't occur at each run so maybe it could also occur with the amd-master branch.

index_error.txt
bad_record_error.txt
test.py.zip

why are you linking to numa?

  1. The tests are always linking to numa, without even running find_library (on any tagged version, not on the amd-master branch, confusingly)
  2. You can't turn off building the tests.

Can you please fix this?

libnuma is not a direct dependency of rocprofiler as far as I can see, it's a transient dependency: librocprofiler.so depends on libhsa-runtime64.so depends on libhsakmt.so depends on libnuma.so

Also hsa runtime has a cmake config file, so why don't you use it? My guess its someone added linking to libnuma because they had a static build of rocm libs? If you would use find_package(...) the flags will likely be correct including -lnuma if some libraries are static, and without -lnuma if shared.

[HSA_STATUS_ERROR] A generic error has occurred

Hi,

I am trying to profile some CEED benchmarks. I'm using a gfx906 card with rocm 2.10.

I'm using hipcc and the compilation/run seem fine, but I can't get any output from the profiler.

I tried different options, but I keep getting this generic error:

rcprof -C  ./bp3 -o 2 -l 8 -d hip
Options used:
   --mesh-dimension 3
   --refinement-level 8
   --order 2
   --device hip
Radeon Compute Profiler V5.6.7262 is enabled
No counter file specified. Only counters that will fit into a single pass will be enabled.
Device configuration: hip,cpu
Processor partitioning: 1 1 1
Mesh dimensions: 8 8 4
Total number of elements: 256
Number of finite element unknowns: 2601
aqlprofile API table load failed: HSA_STATUS_ERROR: A generic error has occurred.
[corona90:mpi_rank_0][error_sighandler] Caught error: Aborted (signal 6)
Failed to generate profile result /g/g91/camier1/Session1.csv.

Have you seen this kind of error?

Thank you for your help,

Jean-Sylvain

question about profiling HIP and OpenMP programs

May you please advise which commands should be used to profile HIP and OpenMP implementations of a program? It is possible that users may put the timers at the wrong places in the HIP and OpenMP source files, so I would like to reply on the profiler for a fair comparison.

Thanks for your instructions!

Issues when using rocprofiler kernel interception mode

With ROCm 4.0.1, I am seeing a SIGILL or SIGSEGV that appears to be at the point where a rocprofiler kernel dispatch callback should be invoked. This appears to be consistent behavior since at least 3.8.

Backtrace (unfortunately not terribly useful without debug symbols for libamdhip or librocprofiler):

>>> bt
#0  0x00007ffffffeab2a in ?? ()
#1  0x00007ffffffeac6d in ?? ()
#2  0x00002aaaacb50f9b in ?? () from /opt/rocm-4.0.1/rocprofiler/lib/librocprofiler64.so.1
#3  0x00002aaaacb5619d in ?? () from /opt/rocm-4.0.1/rocprofiler/lib/librocprofiler64.so.1
#4  0x00002aaaab6cf8be in ?? () from /opt/rocm-4.0.1/lib/libamdhip64.so.4
#5  0x00002aaaab6dc435 in ?? () from /opt/rocm-4.0.1/lib/libamdhip64.so.4
#6  0x00002aaaab6ccecc in ?? () from /opt/rocm-4.0.1/lib/libamdhip64.so.4
#7  0x00002aaaab6b22d9 in ?? () from /opt/rocm-4.0.1/lib/libamdhip64.so.4
#8  0x00002aaaab56ce96 in ?? () from /opt/rocm-4.0.1/lib/libamdhip64.so.4
#9  0x00002aaaab6b3cbf in ?? () from /opt/rocm-4.0.1/lib/libamdhip64.so.4
#10 0x00002aaaaacd6e65 in start_thread () from /lib64/libpthread.so.0
#11 0x00002aaaad07688d in clone () from /lib64/libc.so.6

The sequence of roctracer/rocprofiler calls is as follows:

<library init time>
    roctracer_set_properties( ACTIVITY_DOMAIN_HIP_API, NULL );
    roctracer_properties_t properties = { 0 };
    properties.buffer_size         = 0x1000;
    properties.buffer_callback_fun = scorep_hip_activity_callback;
    ROCTRACER_CALL( roctracer_open_pool( &properties ) );
    // note: roctracer callbacks are not registered in this build
<OnLoadTool and OnLoadToolProp>
    void* callback_data = NULL;
    rocprofiler_queue_callbacks_t cbs;
    cbs.dispatch = &dispatch_cb;
    ROCPROFILER_CALL( rocprofiler_set_queue_callbacks(cbs, callback_data) );

Environment variables for rocprofiler:

HSA_TOOLS_LIB=/opt/rocm-4.0.1/rocprofiler/lib/librocprofiler64.so.1
ROCP_METRICS=/opt/rocm-4.0.1/rocprofiler/lib/metrics.xml
ROCP_HSA_INTERCEPT=2
ROCP_TOOL_LIB=<path to library containing roctracer and rocprofiler code>

This matches the interception library test case in rocprofiler to the best of my knowledge.

With this environment, OnLoadTool is called. With any other environment I have tried, it is not. With the dispatch callback set at library load rather than via OnLoadTool, ROCP_TOOL_LIB is not necessary but with the other three variables set the same crash will occur. OnLoadToolProp is not called, and OnLoadTool is called once and only once. The crash occurs consistently at the point of the first kernel launch, and with a consistently similar-looking stack.

Equivalent rocprofiler code compiled and linked directly into the application worked fine for me, although I have not yet tested that case with a dummy roctracer set of calls (and roctracer link dependency) in addition.

How do host process id, host thread id, GPU id and GPU stream id mapped to pid and tid in chrome://tracing?

A heterogeneous computing application usually has computing units of muitiple level, for example, a host process controls a GPU, and may uses several streams on a GPU; or a host process spawns several host threads, each thread controls a GPU and may use several streams on a GPU. However, in chrome://tracing there seems to have only two levels: process and thread. In my experience, rocprof --hip-trace doesn't have a good solution of this problem. Another tool rpt, which is provided in hcc, seems to always map the "quene" number to tid, and the pid is alwayse 1, as the rpt has these code:
def printJSON(self, file, timeOffset=0):
tid = self.queue
file.write('{ "pid":1, "tid":%d, "ts":%d, "dur":%d, "ph":"X", "name":"%s", "args":{"dev.queue.op":"%d.%d.%d", "stop":%d } }' %\
(tid, self.startTime/1000, (self.stopTime - self.startTime)/1000, self.name, \
self.device, self.queue, self.cmdNum, self.stopTime/1000) )
file.write(',\n')
Maybe the self.device is really the GPU ID, but I don't know what the self.queue really is. Is it GPU stream id? But in rpt it is mapped to host thread id. This map sometimes makes the visualization in chrome://tracing in confusion.

rocm profiler creates trace for 1 gpu only when kernels launched onto two separate kernels.

#include <stdio.h>
#include "hip/hip_runtime.h"

// 1. if N is set to up to 1024, then sum is OK.
// 2. Set N past the 1024 which is past No. of threads per blocks, and then all iterations of sum results in
// even the ones within the block.

// 3. To circumvent the problem described in 2. above, since if N goes past No. of threads per block, we need multiple block launch.
// The trick is describe in p65 to use formula (N+127) / 128 for blocknumbers so that when block number starts from 1, it is
// (1+127) / 128.

#define N 2048
#define N 536870912
#define MAX_THREAD_PER_BLOCK 1024

__global__ void add( int * a, int * b, int * c ) {
    int tid = hipThreadIdx_x + hipBlockIdx_x * hipBlockDim_x ;
    if (tid < N)
        c[tid] = a[tid] + b[tid];
}

int main (void) {
    int *a, *b, *c;
    int *dev_a, *dev_b, *dev_c;
    int stepSize;

    int count = 0;

    hipGetDeviceCount(&count);

    printf("\nDevice count: %d.", count);

    if (count < 2) {
        printf("No. of devices must be at least 2.");
        return 1;
    }

    // allocate dev memory for N size for pointers declared earlier.
    // allocate dev memory for N size for pointers declared earlier.

    printf("\nAllocating memory...(size %u array size of INT).\n", N );

    hipMalloc( (void**)&dev_a, N * sizeof(int));
    hipMalloc( (void**)&dev_b, N * sizeof(int));
    hipMalloc( (void**)&dev_c, N * sizeof(int));

    const unsigned blocks = 512;
    const unsigned threadsPerBlock = 256;

    // invoke the kernel:
    // block count: (N+127)/128
    // thread count: 128

    hipSetDevice(0);
    hipLaunchKernelGGL(add, blocks, threadsPerBlock, 0, 0, dev_a, dev_b, dev_c);
    hipSetDevice(1);
    hipLaunchKernelGGL(add, blocks, threadsPerBlock, 0, 0, dev_a, dev_b, dev_c);
    hipDeviceSynchronize();

    hipFree(dev_a);
    hipFree(dev_b);
    hipFree(dev_c);
}


use following to compile and create trace:


FILE1=p61
for FILE in $FILE1 ; do
    hipcc $FILE.cpp -o $FILE.out
    rocprof --hip-trace  -d ./$FILE ./$FILE.out
done

there is a result.json created and when opened in chrome tracer, only gpu0 is seen.

00:07.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 [Radeon Instinct MI25 MxGPU] (rev 06)
00:08.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 [Radeon Instinct MI25 MxGPU] (rev 06)
root@sriov-guest:~/dev-learn/gpu/hip/hip-stream-aql/p61-2gpus# egrep -irn gpu results.json
4:,{"args":{"name":"GPU0"},"ph":"M","pid":6,"name":"process_name","sort_index":2}
268:    "Agent2.Uuid": "GPU-0215141e35aa2184",
269:    "Agent2.MarketingName": "Vega10[RadeonInstinctMI25MxGPU]",
279:    "Agent2.DeviceType": "GPU",
335:    "Agent3.Uuid": "GPU-0215141e35aa2904",
336:    "Agent3.MarketingName": "Vega10[RadeonInstinctMI25MxGPU]",
346:    "Agent3.DeviceType": "GPU",

Assumption about directory layout breaks Spack support.

https://github.com/ROCm-Developer-Tools/rocprofiler/blob/93778bdc4fa5403fedede3afcc1155338c62a6f3/bin/rpl_run.sh#L29

When rocprofiler is installed with spack https://github.com/spack/spack/blob/29d344e4c72aadb1672a2c8f36f9ff773b636ac4/var/spack/repos/builtin/packages/rocprofiler-dev/package.py

The assumption that all of the rocm packages are installed next to each other in a common root like

/opt/rom
  rocprofiler
  roctracer

Does not hold, since the names are versioned.

Infinite recursion in librocprofiler.so

Hi,

I am a developer from the HPCToolkit project at Rice University. I am developing AMD GPU counter support in HPCToolkit directly using rocprofiler API. I am currently running into an infinite recursion in librocprofiler.so with the following stack trace:

#0  0x00007f38ba752e76 in d_print_comp_inner () from /lib64/libstdc++.so.6
#1  0x00007f38ba7572b8 in d_print_comp () from /lib64/libstdc++.so.6
#2  0x00007f38ba753c57 in d_print_comp_inner () from /lib64/libstdc++.so.6
#3  0x00007f38ba7572b8 in d_print_comp () from /lib64/libstdc++.so.6
#4  0x00007f38ba75584d in d_print_comp_inner () from /lib64/libstdc++.so.6
#5  0x00007f38ba7572b8 in d_print_comp () from /lib64/libstdc++.so.6
#6  0x00007f38ba75584d in d_print_comp_inner () from /lib64/libstdc++.so.6
#7  0x00007f38ba7572b8 in d_print_comp () from /lib64/libstdc++.so.6
#8  0x00007f38ba75584d in d_print_comp_inner () from /lib64/libstdc++.so.6
#9  0x00007f38ba7572b8 in d_print_comp () from /lib64/libstdc++.so.6
#10 0x00007f38ba757d02 in d_print_function_type.isra () from /lib64/libstdc++.so.6
#11 0x00007f38ba75474a in d_print_comp_inner () from /lib64/libstdc++.so.6
#12 0x00007f38ba7572b8 in d_print_comp () from /lib64/libstdc++.so.6
#13 0x00007f38ba754a42 in d_print_comp_inner () from /lib64/libstdc++.so.6
#14 0x00007f38ba7572b8 in d_print_comp () from /lib64/libstdc++.so.6
#15 0x00007f38ba75387d in d_print_comp_inner () from /lib64/libstdc++.so.6
#16 0x00007f38ba7572b8 in d_print_comp () from /lib64/libstdc++.so.6
#17 0x00007f38ba75907f in d_demangle_callback.constprop () from /lib64/libstdc++.so.6
#18 0x00007f38ba759361 in __cxa_demangle () from /lib64/libstdc++.so.6
#19 0x00007f38b0b51dd5 in ?? () from /opt/rocm-4.3.1/lib/librocprofiler64.so
#20 0x00007f38b8cbfc6d in rocr::amd::hsa::loader::ExecutableImpl::IterateSymbols(hsa_status_t (*)(hsa_executable_s, hsa_executable_symbol_s, void*), void*) ()
   from /opt/rocm-4.3.1/lib/libhsa-runtime64.so.1
#21 0x00007f38b8c9b853 in rocr::HSA::hsa_executable_iterate_symbols(hsa_executable_s, hsa_status_t (*)(hsa_executable_s, hsa_executable_symbol_s, void*), void*) ()
   from /opt/rocm-4.3.1/lib/libhsa-runtime64.so.1
#22 0x00007f38b0b4fa4a in ?? () from /opt/rocm-4.3.1/lib/librocprofiler64.so
#23 0x00007f38b0b4fa5a in ?? () from /opt/rocm-4.3.1/lib/librocprofiler64.so
#24 0x00007f38b0b4fa5a in ?? () from /opt/rocm-4.3.1/lib/librocprofiler64.so
#25 0x00007f38b0b4fa5a in ?? () from /opt/rocm-4.3.1/lib/librocprofiler64.so
#26 0x00007f38b0b4fa5a in ?? () from /opt/rocm-4.3.1/lib/librocprofiler64.so
#27 0x00007f38b0b4fa5a in ?? () from /opt/rocm-4.3.1/lib/librocprofiler64.so
#28 0x00007f38b0b4fa5a in ?? () from /opt/rocm-4.3.1/lib/librocprofiler64.so
#29 0x00007f38b0b4fa5a in ?? () from /opt/rocm-4.3.1/lib/librocprofiler64.so
#30 0x00007f38b0b4fa5a in ?? () from /opt/rocm-4.3.1/lib/librocprofiler64.so
#31 0x00007f38b0b4fa5a in ?? () from /opt/rocm-4.3.1/lib/librocprofiler64.so
#32 0x00007f38b0b4fa5a in ?? () from /opt/rocm-4.3.1/lib/librocprofiler64.so
#33 0x00007f38b0b4fa5a in ?? () from /opt/rocm-4.3.1/lib/librocprofiler64.so
#34 0x00007f38b0b4fa5a in ?? () from /opt/rocm-4.3.1/lib/librocprofiler64.so
#35 0x00007f38b0b4fa5a in ?? () from /opt/rocm-4.3.1/lib/librocprofiler64.so
#36 0x00007f38b0b4fa5a in ?? () from /opt/rocm-4.3.1/lib/librocprofiler64.so
#37 0x00007f38b0b4fa5a in ?? () from /opt/rocm-4.3.1/lib/librocprofiler64.so
#38 0x00007f38b0b4fa5a in ?? () from /opt/rocm-4.3.1/lib/librocprofiler64.so
#39 0x00007f38b0b4fa5a in ?? () from /opt/rocm-4.3.1/lib/librocprofiler64.so

You can see this is with rocm-4.3.1. With rocm-4.3.1, I was able to work around this issue by settings->code_obj_tracking = 0; inside OnLoadToolProp:

https://github.com/HPCToolkit/hpctoolkit/blob/rocprofiler_support/src/tool/hpcrun/gpu/amd/rocprofiler-api.c#L436

This work around does not seem to work with rocm-4.5.0 and I am seeing a similar infinite recursion in librocprofiler.

Are there any recommendations or insights on resolving this problem?

rocminfo dependency not specified in the debian package

rocprof depends on rocminfo to be executed properly, but such dependency is not specified in the debian package.

Source code where the dependency to rocminfo takes place:
https://github.com/ROCm-Developer-Tools/rocprofiler/blob/amd-master/bin/rpl_run.sh#L205

Debian package dependency is missing:

$ apt show rocprofiler-dev
Package: rocprofiler-dev
Version: 1.0.0
Priority: optional
Section: devel
Maintainer: Advanced Micro Devices Inc.
Installed-Size: 1541 kB
Depends: hsa-rocr-dev
Download-Size: 254 kB
APT-Sources: http://repo.radeon.com/rocm/apt/debian xenial/main amd64 Packages
Description: ROCPROFILER library for AMD HSA runtime API extension support

N: There is 1 additional record. Please use the '-a' switch to see it

GUI

Is there plans to make a GUI similar to Nvidia Visual Profiler?

rocprof not running the executable

I noticed that rocprof completely bypasses running the executable.

$ OMP_NUM_THREADS=8 rocprof --hsa-trace --obj-tracking on ./bin/check_spo_batched 
RPL: on '200317_135650' from '/opt/rocm-3.1.0/rocprofiler' in '/home/yeluo/opt/miniqmc/build_ryzen_aomp_MP'
RPL: profiling '"./bin/check_spo_batched"'
RPL: input file ''
RPL: output dir '/tmp/rpl_data_200317_135650_31629'
RPL: result dir '/tmp/rpl_data_200317_135650_31629/input_results_200317_135650'
ROCProfiler: input from "/tmp/rpl_data_200317_135650_31629/input.xml"
  0 metrics
  0 traces
ROCTracer (pid=31651): 
    HSA-trace()
    HSA-activity-trace()
File '/home/yeluo/opt/miniqmc/build_ryzen_aomp_MP/results.hsa_stats.csv' is generating

File '/home/yeluo/opt/miniqmc/build_ryzen_aomp_MP/results.json' is generating

File '/home/yeluo/opt/miniqmc/build_ryzen_aomp_MP/results.json' is generating

Any idea what is wrong?

[Build error] Missing string header

When I compile the 3.5.0 release of rocprofiler I get the following build error which I could fix by adding

#include <string>

to test/ctrl/test_kernel.h:

[ 36%] Linking C executable c_test
cd /tmp/rocprofiler-rocm-3.5.0/build/test && /usr/bin/cmake -E cmake_link_script CMakeFiles/c_test.dir/link.txt --verbose=1
/usr/bin/cc -Wall -g -march=native -O2 -O3 -DNDEBUG  -rdynamic CMakeFiles/c_test.dir/app/c_test.c.o  -o c_test 
make[2]: Leaving directory '/tmp/rocprofiler-rocm-3.5.0/build'
[ 36%] Built target c_test
In file included from /tmp/rocprofiler-rocm-3.5.0/test/simple_convolution/simple_convolution.h:29,
                 from /tmp/rocprofiler-rocm-3.5.0/test/simple_convolution/simple_convolution.cpp:23:
/tmp/rocprofiler-rocm-3.5.0/test/ctrl/test_kernel.h:81:16: error: ‘string’ in namespace ‘std’ does not name a type
   81 |   virtual std::string Name() const = 0;
      |                ^~~~~~
/tmp/rocprofiler-rocm-3.5.0/test/ctrl/test_kernel.h:28:1: note: ‘std::string’ is defined in header ‘<string>’; did you forget to ‘#include <string>’?
   27 | #include <stdint.h>
  +++ |+#include <string>
   28 | #include <map>
/tmp/rocprofiler-rocm-3.5.0/test/ctrl/test_kernel.h: In member function ‘bool TestKernel::SetHostDescr(const uint32_t&, const TestKernel::des_id_t&, const uint32_t&)’:
/tmp/rocprofiler-rocm-3.5.0/test/ctrl/test_kernel.h:105:19: error: ‘malloc’ was not declared in this scope
  105 |       descr.ptr = malloc(size);
      |                   ^~~~~~
/tmp/rocprofiler-rocm-3.5.0/test/ctrl/test_kernel.h:28:1: note: ‘malloc’ is defined in header ‘<cstdlib>’; did you forget to ‘#include <cstdlib>’?
   27 | #include <stdint.h>
  +++ |+#include <cstdlib>
   28 | #include <map>
In file included from /tmp/rocprofiler-rocm-3.5.0/test/simple_convolution/simple_convolution.cpp:23:
/tmp/rocprofiler-rocm-3.5.0/test/simple_convolution/simple_convolution.h: At global scope:
/tmp/rocprofiler-rocm-3.5.0/test/simple_convolution/simple_convolution.h:50:8: error: ‘string’ in namespace ‘std’ does not name a type
   50 |   std::string Name() const { return std::string("SimpleConvolution"); }
      |        ^~~~~~
/tmp/rocprofiler-rocm-3.5.0/test/simple_convolution/simple_convolution.h:30:1: note: ‘std::string’ is defined in header ‘<string>’; did you forget to ‘#include <string>’?
   29 | #include "ctrl/test_kernel.h"
  +++ |+#include <string>
   30 | 

Rocprofiler issue with AMD OpenMP (aomp)

Hi,
I tried getting tracing data of an OpenMP example program of AOMP (vmulsum) and I got a segmentation fault tracing either with hsa, kfd or hip.
I recorded a debug log using the following command:
$ LD_DEBUG=libs rocprof --hsa-trace vmulsum 2&>1 > log.txt
Here is the log.txt file:
log.txt

I am using a gfx900, Ubuntu 20.04 with ROCm 4.2

Missing symlinks in `<rocm-path>/lib`

Why is there only the .so symlink in <rocm-path>/lib for rocprofiler? roctracer has .so, .so.1, and .so.1.0.40500 symlinks. So have all other libs. This is ROCm 4.5.0.

[3.3] failed to dump v3 code object

When I use rocm 3.3 to dump TensorFlow data, it pops below error.
Does that mean it does not support V3 code object for now?

error(4096) "QueryKernelName(), Error: V3 code object detected - code objects tracking should be enabled

OpenCL support question

When profiling an OpenCL program, the error message is:

ROCTracer (pid=1529322):
HSA-trace()
HIP-trace()
roctracer: Loading 'libamdhip64.so' failed, (null)

Am I doing right ? Thank you for your answer.

Tool lib failed to load.

Using rocprof with ROCm v4.5 in Ubuntu gives the following problem:

rocprof --hsa-trace ./test
RPL: on '220201_151632' from '/opt/rocm-4.5.0/rocprofiler' in '/home'
RPL: profiling '"./test"'
RPL: input file ''
RPL: output dir '/tmp/rpl_data_220201_151632_971494'
RPL: result dir '/tmp/rpl_data_220201_151632_971494/input_results_220201_151632'
Tool lib "/opt/rocm-4.5.0/roctracer/tool/libtracer_tool.so" failed to load.

Is this a problem with the driver? Which driver do I need to use?

runtime error: aqlprofile API table load failed

After updating rocm from 3.3.0 to 3.5.1, rebuilding rocprofiler and roctracer, I get the following error when profiling an executable (which uses an AMD Vega 56 GPU):

> rocprof --stats -o rocpf_stat.csv the_prog
RPL: on '200626_131451' from '/opt/rocm-3.5.1/rocprofiler/rocprofiler' in '/home/leggett/work/fcs/bk_hip'
RPL: profiling '"runTFCSSimulation"'
RPL: input file ''
RPL: output dir '/tmp/rpl_data_200626_131451_50543'
RPL: result dir '/tmp/rpl_data_200626_131451_50543/input_results_200626_131451'
ROCProfiler: input from "/tmp/rpl_data_200626_131451_50543/input.xml"
  0 metrics
aqlprofile API table load failed: HSA_STATUS_ERROR: A generic error has occurred.
( program exits )

I see a similar error when doing --hsa-trace

rocprof --hsa-trace -o rocpf_hsa.csv the_prog
RPL: on '200626_131810' from '/opt/rocm-3.5.1/rocprofiler/rocprofiler' in '/home/leggett/work/fcs/bk_hip'
RPL: profiling '"runTFCSSimulation"'
RPL: input file ''
RPL: output dir '/tmp/rpl_data_200626_131810_50607'
RPL: result dir '/tmp/rpl_data_200626_131810_50607/input_results_200626_131810'
ROCProfiler: input from "/tmp/rpl_data_200626_131810_50607/input.xml"
  0 metrics
ROCTracer (pid=50626): 
    HSA-trace()
    HSA-activity-trace()
aqlprofile API table load failed: HSA_STATUS_ERROR: A generic error has occurred.
File 'rocpf_hsa.hsa_stats.csv' is generating

File 'rocpf_hsa.json' is generating

File 'rocpf_hsa.json' is generating

this is on a centos7 host.

rocprof behavior with both -d and -i options

When running rocprof command with -d option

rocprof -d traces --hsa-trace my_script

The traces will be put in the following directory with the following architecture:

-traces
    -rpl_<date>_<time>_<id>
        -input_results_<date>_<time>
            -all the .txt traces

However when running the same command but with -i option

rocprof -i input.xml -d traces --hsa-trace my_script

Then the traces will be put in the following directory with the following architecture:

-traces
    -all the .txt traces

In the first case, we can rerun rocprof several times without the risk of overwriting/having a collision with previous traces since other directories with unique ids (based on the time) are created in the given repository and the traces are put in those directories.
In the second case, we risk to overwrite previous traces in the ./traces directory. Is it expected behavior?

PAPI rocm_r component segfaults in intercept mode

Testing PAPI rocm_r component (https://bitbucket.org/congiu/papi/branch/2022.01.11_rocm-rewrite) with the code at this link: https://bitbucket.org/congiu/papi/src/b9533e4c207f20d0477174d097bec2df73867f02/src/components/rocm_r/tests/hip_matmul_single_gpu.cpp

on MI100 GPUs with rocm-4.5.0 and rocm-5.0.0 generates the behaviour following reported.

Following is the kernel running with PAPI rocm_r component in sample mode

$ ./hip_matmul_single_gpu
./hip_matmul_single_gpu : Multiply two square matrices of size 8192 x 8192
First kernel run...
rocm:::SQ_INSTS_VALU:device=0 : 77329334272
rocm:::SQ_INSTS_SALU:device=0 : 17188257792
rocm:::SQ_WAVES:device=0 : 1048576
Second kernel run...
rocm:::SQ_INSTS_VMEM_RD:device=0 : 17179869184
rocm:::SQ_INSTS_VMEM_WR:device=0 : 1048576

And with PAPI rocm_r component in intercept mode

$ ROCP_HSA_INTERCEPT=1 ./hip_matmul_single_gpu
./hip_matmul_single_gpu : Multiply two square matrices of size 8192 x 8192
Segmentation fault (core dumped)

Rerunning the above with gdb:

$ ROCP_HSA_INTERCEPT=1 gdb ./hip_matmul_single_gpu
...Starting program: /home/gcongiu/papi/src/components/rocm_r/tests/./hip_matmul_single_gpu
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
/home/gcongiu/papi/src/components/rocm_r/tests/./hip_matmul_single_gpu : Multiply two square matrices of size 8192 x 8192
[New Thread 0x7fffd303d700 (LWP 68100)]

Program received signal SIGSEGV, Segmentation fault.
0x00000000000127e0 in ?? ()
Missing separate debuginfos, use: debuginfo-install elfutils-libelf-0.176-5.el7.x86_64 glibc-2.17-325.el7_9.x86_64 libdrm-2.4.97-2.el7.x86_64 libgcc-4.8.5-44.el7.x86_64 libstdc++-4.8.5-44.el7.x86_64 ncurses-libs-5.9-14.20130511.el7_4.x86_64 numactl-libs-2.0.12-5.el7.x86_64 zlib-1.2.7-19.el7_9.x86_64
(gdb) bt
#0  0x00000000000127e0 in ?? ()
#1  0x00007fffd30718b0 in ?? () from /opt/rocm-4.5.0/rocprofiler/lib/librocprofiler64.so
#2  0x00007fffd3077d6d in ?? () from /opt/rocm-4.5.0/rocprofiler/lib/librocprofiler64.so
#3  0x00007ffff6885947 in ?? () from /opt/rocm-4.5.0/hip/lib/libamdhip64.so.4
#4  0x00007ffff6899ce5 in ?? () from /opt/rocm-4.5.0/hip/lib/libamdhip64.so.4
#5  0x00007ffff68812ca in ?? () from /opt/rocm-4.5.0/hip/lib/libamdhip64.so.4
#6  0x00007ffff686faa8 in ?? () from /opt/rocm-4.5.0/hip/lib/libamdhip64.so.4
#7  0x00007ffff68143d1 in ?? () from /opt/rocm-4.5.0/hip/lib/libamdhip64.so.4
#8  0x00007ffff6814af8 in ?? () from /opt/rocm-4.5.0/hip/lib/libamdhip64.so.4
#9  0x00007ffff68155da in hipStreamCreate () from /opt/rocm-4.5.0/hip/lib/libamdhip64.so.4
#10 0x0000000000206bce in main ()

Interestingly, if I use MALLOC_CHECK_=1:

$ MALLOC_CHECK_=1 ROCP_HSA_INTERCEPT=1 ./hip_matmul_single_gpu
./hip_matmul_single_gpu : Multiply two square matrices of size 8192 x 8192
First kernel run...
rocm:::SQ_INSTS_VALU:device=0 : 77329334272
rocm:::SQ_INSTS_SALU:device=0 : 17188257792
rocm:::SQ_WAVES:device=0 : 1048576
Error! Failed starting eventset, error=-8 -> 'Event exists, but cannot be counted due to hardware resource limits’

The segmentation fault disappears. This seems to indicate a memory error in librocprofiler.

Ignore the “Error! …” line. This is generated by PAPI and is due to the fact that the EventSet that initially contained the VALU, SALU and WAVES events has been cleaned up and reused with different events (i.e. VMEM). Since rocprofiler does not allow changing the dispatch callbacks after they have been set PAPI throws an error.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.