rocm / omniperf Goto Github PK

Advanced Profiling and Analytics for AMD Hardware

Home Page: https://rocm.github.io/omniperf/

License: MIT License

CMake 1.75% Dockerfile 0.85% Shell 2.36% AMPL 0.03% JavaScript 3.37% CSS 8.23% HTML 0.79% TypeScript 4.27% Python 78.03% Makefile 0.14% Batchfile 0.17%

gpu-kernels profiling hardware-counters hpc linux performance-analysis

omniperf's Introduction

Omniperf

General

Omniperf is a system performance profiling tool for machine learning/HPC workloads running on AMD MI GPUs. The tool presently targets usage on MI100 and MI200 accelerators.

For more information on available features, installation steps, and workload profiling and analysis, please refer to the online documentation.
Omniperf is an AMD open source research project and is not supported as part of the ROCm software stack. We welcome contributions and feedback from the community. Please see the CONTRIBUTING.md file for additional details on our contribution process.
Licensing information can be found in the LICENSE file.

Development

Omniperf follows a main-dev branching model. As a result, our latest stable release is shipped from the main branch, while new features are developed in our dev branch.

Users may checkout dev to preview upcoming features.

How to Cite

This software can be cited using a Zenodo DOI reference. A BibTex style reference is provided below for convenience:

@software{xiaomin_lu_2022_7314631
  author       = {Xiaomin Lu and
                  Cole Ramos and
                  Fei Zheng and
                  Karl W. Schulz and
                  Jose Santos and
                  Keith Lowery and
                  Nicholas Curtis and
                  Cristian Di Pietrantonio},
  title        = {AMDResearch/omniperf: v1.1.0-PR1 (13 Oct 2023)},
  month        = oct,
  year         = 2023,
  publisher    = {Zenodo},
  version      = {v1.1.0-PR1},
  doi          = {10.5281/zenodo.7314631},
  url          = {https://doi.org/10.5281/zenodo.7314631}
}

omniperf's People

Contributors

Stargazers

Watchers

omniperf's Issues

Output data format

omniperf -v
omniperf (1.0.3)

omniperf analyze -p workloads/kernel/mi200/ -o doesnotexist

--------
Analyze
--------

Saved Analysis folder exists

Why does omniperf analysis say: Saved Analysis folder exists ? The folder does not exist. If create the directory and rerun, I get:

IsADirectoryError: [Errno 21] Is a directory: 'doesnotexist'

I guess that omniperf says Saved analysis folder exists because I pointed it to the existing workload workloads/kernel/mi200.

The output that I get via -o is the stdout? It would be helpful it the data was in say sqlite, json, or even Python pickle file. Such an option would facilitate further analysis.

Issue using Matplotlib with X-server

This seems to be an issue that only occurs when you are connected via ssh to a system that has an X-server running.

https://github.com/AMDResearch/omniperf/blob/796c495d0b41bb63f62220278bd4fdca323a463b/src/utils/plot_roofline.py#L31

The reason being that when an X-server is running, matplotlib, by default, tries to connect but can't and so he throws fatal exceptions.

Suggested Fix

import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt

the use('Agg') call sets matplotlib to use the non-interactive backend that can only write to files. This is apparently required if you happen to run omniperf on a system with X running.

CC: @keithloweryamd

Extending OmniXXX to profile/trace EPYC CPUs

Hello,

The functionality provided by the omniXXX packages could be extended to provide similar performance information on AMD EPYC CPUS. We are lacking tools to produce Roofline curves for EPYC DRAM/L3/L2/L1 curves. Although floating point (SP/DP) can be obtained with other synthetic benchmarks, cache hierarchy BW and latencies require specialized low-level instructions.

I am suggesting the omni tools to be extended to provide such information about the various AMD EPYC CPUs as well.

U can incorporate this functionality to AMDuProf or at least add L3 BWs (and latency) profiles.

thank you
Michael Thomadakis

AAC Requirements

~~Need instruction to install docker-compose and volumes~~ (added in v1.0.3)
Need to install CMake 3.19 separately
Modify Python dependency instructions to python3 -m pip install **--system** -t ${INSTALL_DIR}/python-libs -r requirements.txt
We see matplotlib conflict with the default installation, resulting in omniperf failure
AAC default is ubuntu18.04/python3.6.9, while we require u20.04/py3.7+. might be helpful, after testing, relax the constraints
Add sudo apt install libjpeg-dev zlib1g-dev to install instuctions. This is required by Pillow

cc: Xiaomin Lu

Error in build system when installing from release tarball

The v.1.04 release introduced inclusion of the release tag sha in a VERSION.sha file. There is an issue in the build system presently where this value is overwritten with an empty value when doing a cmake build/install starting from the release tarball.

Need to fix so that the correct git sha is displayed when running omniperf with the --version option.

Comparison of two workloads using CLI fails with v1.0.6

I created two workloads by profiling a different kernel each time and tried to compare the performance counters between the two kernels using the following command:

omniperf analyze -p workloads/vcopy_vecCopy/mi200/ -p workloads/vcopy_vecCopy_nocheck/mi200/

This fails with the following error after printing the "System Info" panel (I have intentionally changed the full path to my omniperf install, but this does not change the stack trace otherwise):

Traceback (most recent call last):
  File "/path/to/omniperf/1.0.6/bin/omniperf", line 663, in <module>
    main()
  File "/path/to/omniperf/1.0.6/bin/omniperf", line 643, in main
    analyze(args)
  File "/path/to/omniperf/1.0.6/bin/omniperf_analyze/omniperf_analyze.py", line 250, in analyze
    run_cli(args, runs)
  File "/path/to/omniperf/1.0.6/bin/omniperf_analyze/omniperf_analyze.py", line 199, in run_cli
    tty.show_all(
  File "/path/to/omniperf/1.0.6/bin/omniperf_analyze/utils/tty.py", line 108, in show_all
    base_df[header].astype("double"),
  File "/path/to/omniperf/python-libs/pandas/core/generic.py", line 6240, in astype
    new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
  File "/path/to/omniperf/python-libs/pandas/core/internals/managers.py", line 450, in astype
    return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
  File "/path/to/omniperf/python-libs/pandas/core/internals/managers.py", line 352, in apply
    applied = getattr(b, f)(**kwargs)
  File "/path/to/omniperf/python-libs/pandas/core/internals/blocks.py", line 526, in astype
    new_values = astype_array_safe(values, dtype, copy=copy, errors=errors)
  File "/path/to/omniperf/python-libs/pandas/core/dtypes/astype.py", line 299, in astype_array_safe
    new_values = astype_array(values, dtype, copy=copy)
  File "/path/to/omniperf/python-libs/pandas/core/dtypes/astype.py", line 230, in astype_array
    values = astype_nansafe(values, dtype, copy=copy)
  File "/path/to/omniperf/python-libs/pandas/core/dtypes/astype.py", line 170, in astype_nansafe
    return arr.astype(dtype, copy=True)
ValueError: could not convert string to float: ''

omniperf v1.0.6 was installed from source using instructions in this repo's documentation without any trouble.
To reproduce, I took the vcopy.cpp example from this repo and added new kernel called vecCopy_nocheck where I just commented out the check for array bounds. I also added a call to launch this kernel. My updates can be seen in the following diff:

$ git diff
diff --git a/sample/vcopy.cpp b/sample/vcopy.cpp
index 0eed487..565d8c0 100644
--- a/sample/vcopy.cpp
+++ b/sample/vcopy.cpp
@@ -18,6 +18,12 @@ __global__ void vecCopy(double *a, double *b, double *c, int n,int stride)
         c[id] = a[id];     
     }      
 }
+__global__ void vecCopy_nocheck(double *a, double *b, double *c, int n,int stride)
+{
+    // Get our global thread ID
+    int id = blockIdx.x*blockDim.x+threadIdx.x;
+    c[id] = a[id];
+}
 
 void usage()
 {
@@ -114,6 +120,7 @@ int main( int argc, char* argv[] )
     printf("Launching the  kernel on the GPU\n");
     // Execute the kernel
     hipLaunchKernelGGL(vecCopy, dim3(gridSize), dim3(blockSize), 0, 0, d_a, d_b, d_c, n,stride);
+    hipLaunchKernelGGL(vecCopy_nocheck, dim3(gridSize), dim3(blockSize), 0, 0, d_a, d_b, d_c, n,stride);
     hipDeviceSynchronize( );
     printf("Finished executing kernel\n");
     // Copy array back to host

Now, compile, profile and analyze this workload using the following commands:

hipcc -O3 -o vcopy vcopy.cpp
omniperf profile --device 0 -k vecCopy -n vcopy_vecCopy -- ./vcopy 102400 256 0
omniperf profile --device 0 -k vecCopy_nocheck -n vcopy_vecCopy_nocheck -- ./vcopy 102400 256 0
omniperf analyze -p workloads/vcopy_vecCopy/mi200/ -p workloads/vcopy_vecCopy_nocheck/mi200/

Ubuntu 20_04 roofline not installed

I just found that the roofline analysis file for Ubuntu 20_04 is not installed. Take a look here: https://github.com/AMDResearch/omniperf/blob/main/CMakeLists.txt#L227-L229

Reduce default content in GUI

In the standalone GUI, when no filters are applied

omniperf analyze -p workloads/sample/mi200/ --gui

the HTML page will load data for every single metric and chart. To reduce loading time and compute, only high-level sections should be displayed:

Top Kernels
Speed-of-Light
Memory Chart

The rest of the information can be displayed when kernel or dispatch filters are applied. Which will significantly decrease the compute required to generate results.

Docker setup throws grafana warnings

These warnings say to me, this tool is possibly impossible to use without a container with an obsolete version of grafana. While docker protects you for now...

main/9bc41f3a85b4bea7fa7febdec104983da41b9e51

cd omniperf
sudo docker-compose build
...
[2/5] Resolving packages...
warning @grafana/runtime > @grafana/[email protected]: Package no longer supported. Contact Support at https://www.npmjs.com/support for more info.
warning @grafana/runtime > @grafana/agent-web > @grafana/[email protected]: Package no longer supported. Contact Support at https://www.npmjs.com/support for more info.
warning @grafana/runtime > @grafana/ui > @grafana/slate-react > [email protected]: New custom equality api does not play well with all equality helpers. Please use v5.x
warning @grafana/runtime > @grafana/ui > react-highlight-words > [email protected]: New custom equality api does not play well with all equality helpers. Please use v5.x
warning @grafana/runtime > @grafana/agent-web > @grafana/agent-core > @opentelemetry/[email protected]: Please use @opentelemetry/api >= 1.3.0
warning @grafana/runtime > @grafana/agent-web > @grafana/agent-core > @opentelemetry/otlp-transformer > @opentelemetry/[email protected]: Please use @opentelemetry/api >= 1.3.0
warning @grafana/runtime > @grafana/agent-web > @grafana/agent-core > @opentelemetry/otlp-transformer > @opentelemetry/[email protected]: Please use @opentelemetry/sdk-metrics
warning @grafana/runtime > @grafana/agent-web > @grafana/agent-core > @opentelemetry/otlp-transformer > @opentelemetry/sdk-metrics-base > @opentelemetry/[email protected]: Please use @opentelemetry/api >= 1.3.0
warning @grafana/runtime > @grafana/ui > rc-time-picker > rc-trigger > babel-runtime > [email protected]: core-js@<3.23.3 is no longer maintained and not recommended for usage due to the number of issues. Because of the V8 engine whims, feature detection in old core-js versions could cause a slowdown up to 100x even if nothing is polyfilled. Some versions have web compatibility issues. Please, upgrade your dependencies to the actual version of core-js.
warning @grafana/runtime > @grafana/ui > react-use > nano-css > [email protected]: Please use @jridgewell/sourcemap-codec instead
warning @grafana/toolkit > @grafana/ui > slate-react > [email protected]: New custom equality api does not play well with all equality helpers. Please use v5.x
warning @grafana/toolkit > @jest/core > jest-config > jest-environment-jsdom > jsdom > [email protected]: Use your platform's native performance.now() and performance.timeOrigin.
warning @grafana/toolkit > css-minimizer-webpack-plugin > cssnano > cssnano-preset-default > postcss-svgo > svgo > [email protected]: Modern JS already guarantees Array#sort() is a stable sort, so this library is deprecated. See the compatibility table on MDN: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/sort#browser_compatibility
[3/5] Fetching packages...

remove .gitmodules file from release tarball

And .gitignore as well

Update minimum version requirements for rocm

With the switch to leverage counter files supplied directly with rocm, it appears the minimum version check may need to be updated. Running on Crusher, I get an unknown hardware counter in profile mode using rocm/5.1.0. Howevere, rocm/5.2.0 runs without incident.

In the meantime, I have updated the minimum version requirement for the omniperf/1.6.0 module on Crusher to require rocm 5.2.0 or newer.

Add testing for Ubuntu 18.04

There needs to be testing added for Ubuntu 18.04. This is what will be used in an upcoming demo of roofline analysis capabilities and presenters would like to know if there are any missing or incompatible dependencies.

Note: AAC cloud runs on u18.04

Have CI build docs in lieu of current update-docs.sh script

Would be nice to clean up the docs build a bit and have a companion github action land html from markdown. Also, I'd be in favor of cleaning up the branches so that all docs collateral only resides in the gh-pages branch. In that case we would remove from main and dev.

Filtering by block doesn't consider cross-block dependencies for metrics

Specifically, we noticed this while trying to collect coalescing (which lives in the TCP section):

https://github.com/AMDResearch/omniperf/blob/62d130b458a21a2c964da234cf7a24420e01efe1/src/omniperf_cli/configs/gfx90a/1600_L1_cache.yaml#L20

but uses values from the TA (i.e., TA_TOTAL_WAVEFRONTS_sum).

So, if a user does:

omniperf profile -b TCP -n bar -- <foo>
omniperf analyze -p workloads/bar/mi200

the resulting Buffer Coalescing value in the L1 section will be empty.

Add "per-kernel" normalization mode to standalone GUI

The 'per-kernel' normalization mode present in the Grafana dashboard appears to be missing from the standalone GUI.

This is useful for some metrics (e.g., requests, bytes moved, etc.) it's often of interest how many there were in total.
For instance, a user might want to see the total number of bytes read from HBM.
Right now with the standalone GUI, the only option is to really choose (e.g.,) per-wave and then multiply the reported value by the number of waves

Dockerfile for ROCm + Omniperf (and more)

Hi again,

Sorry, first of all, if this is the wrong place to post this.

I genuinely wonder whether AMDResearch would be willing to maintain a Dockerfile that ships the following components:

ROCm
ROCm-aware MPI
Omnitrace
Omniperf

As a developer, this would significantly ease my (our, at @devitocodes/devito) life . At the same time, I think this would greatly benefit your users. Ultimately ROCm-aware MPI, Omnitrace and Omniperf will be part of the ROCm suite, I'm sure, but it feels like it's still a long way to go. Interested in your thoughts.

Here's our Dockerfile :

https://github.com/devitocodes/devito/blob/d4e9dc36ff92299644aada824f0ec3786d2f9fef/docker/Dockerfile.amd

The link above is from a PR, but you get the idea. We test it on CI so we know it does work (aside from MPI which still needs to be refreshed).

Apologies again I know this might not be the best place to have this discussion but happy to delete and move if you have a better place (or remove if not interested -- not a problem!)

EDIT: Just to clarify: basically, I'm wondering whether it would make sense to lift that Dockerfile from our codebase somewhere into one of yours

Unable to compare 2 kernels from same workload

It would be nice to easily compare 2 kernels from the same workload where counters were collected for all kernels. I would like to use a command such as:

omniperf analyze -p workloads/vcopy_all/mi200 -k 0 -p workloads/vcopy_all/mi200 -k 1

This results in an error though:

Traceback (most recent call last):
  File "/path/to/omniperf/dev/bin/omniperf", line 682, in <module>
    main()
  File "/path/to/omniperf/dev/bin/omniperf", line 662, in main
    analyze(args)
  File "/path/to/omniperf/dev/bin/omniperf_analyze/omniperf_analyze.py", line 253, in analyze
    run_cli(args, runs)
  File "/path/to/omniperf/dev/bin/omniperf_analyze/omniperf_analyze.py", line 195, in run_cli
    parser.load_table_data(
  File "/path/to/omniperf/dev/bin/omniperf_analyze/utils/parser.py", line 706, in load_table_data
    eval_metric(
  File "/path/to/omniperf/dev/bin/omniperf_analyze/utils/parser.py", line 570, in eval_metric
    out = eval(compile(row[expr], "<string>", "eval"))
  File "<string>", line 1
    ������@
        ^
SyntaxError: (unicode error) 'utf-8' codec can't decode byte 0x9a in position 0: invalid start byte

A workaround is to make a copy of this workload and use each copy in the analyze command as shown below.

cp -r workloads/vcopy_all workloads/vcopy_all_2
omniperf analyze -p workloads/vcopy_all/mi200 -k 0 -p workloads/vcopy_all_2/mi200 -k 1

A fix would be nice to have. It is not urgent though.

Questions about server side installation

Hello,

just a quick comment about the installation of MongoDB and Grafana via Dockerfile

One of the cool things about docker is that generally you don't need sudo. However, all the commands here prepend it to docker. Is there a particular reason or is it just an oversight?

~~And, related question, why aren't the MongoDB utils part of the Dockerfile?~~
Ignore me, I just found out the utils are necessary to import the databases, hence they're needed locally

Thanks a lot!

Dead link to grafana in getting started

https://github.com/AMDResearch/omniperf/blob/5fa2dd99bc0d4491750d9287ca6e854bf5fe7770/src/docs/getting_started.md?plain=1#L72

Now that grafana-analysis has it's own page, this link needs to be updated

investigate encoding failure

Ran into this error during analyze example running on an older ubuntu 18.04 system system that had LANG=en_US by default.

--------
Analyze
--------

Created a Saved Analysis folder

--------------------------------------------------------------------------------
0. Top Stat
Traceback (most recent call last):
  File "/global/scratch/sw/omniperf/1.0.3/bin/omniperf", line 624, in <module>
    main()
  File "/global/scratch/sw/omniperf/1.0.3/bin/omniperf", line 604, in main
    omniperf_cli(args)
  File "/global/scratch/sw/omniperf/1.0.3/bin/omniperf_cli/omniperf_cli.py", line 225, in omniperf_cli
    tty.show_all(
  File "/global/scratch/sw/omniperf/1.0.3/bin/omniperf_cli/utils/tty.py", line 172, in show_all
    print(ss, file=output)
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-107: ordinal not in range(256)

Updating to LANG=en_US.UTF-8 fixed the issue.

We presumably always want to use utf-8 coding....

Rocprof: Profiling data is corrupt

Description: Attempting to profile RESNET50 workload results in "profiling data is corrupt" message

System details
git checkout python-logging
OS/distro: Ubuntu 5.15.0-52-generic #58~20.04.1-Ubuntu
ROCm Version: 5.2.3
Omniperf Version: 1.0.4dev
Logs of crash output:

Steps to reproduce:

Within docker container of resnet50 (https://confluence.amd.com/display/MLSE/MLPerf-1.1-ResNet50v1.5):
copy command into run.sh:
#!/bin/bash python3 -u -m mlperf_utils.bind_launch --nproc_per_node 1 --auto_binding ./main.py --amp --dynamic-loss-scale --lr-schedule polynomial --num-gpus 4 --mom 0.9 --wd 0.0002 --lr 9.1 --prof 100 --warmup 2 --epochs 1 --nhwc --use-lars -b 256 --eval-offset 1 --get-logs --submission-platform MI200system --num-nodes 1 --no-checkpoints --raport-file raport.json -j32 -p 100 --arch resnet50 --data /data/imagenet_pytorch 2>&1 | tee -a run.log.txt

Execute profiling command:
omniperf profile --name resnet50 --path /data/imagenet_pytorch/RN50FP16_DATA2 -- run.sh

At the end of data collection execution, observe Profile data is corrupt message.

Write statistics does not match understanding

Hi, We are running an all-reduce kernel (with remote memory stores) on 4 MI210s and are trying to understand the memory traffic using MIPERF (snapshot for one is attached). We are unclear about what each of the Writes are counting and had the following questions we were hoping you could help with:

i) We find that the ‘Write (64B)’ is the sum of ‘Write (Uncached 32B)’ and ‘HBM Write’ (minus the ‘Write (32B)’, which is small anyway).
- Why are 32B writes (Writes (Uncached 32B)) being counted as 64B writes (Write (64B))? Are the ‘Writes (Uncached 32B)’ actually 64B
- Are ‘HBM write’ also 64B writes?
ii) Should we use ‘HBM writes’ and the ‘Write (Uncached 32B)’ separately as 64B and 32B writes, instead of the considering the combined ‘Write (64B)’?

Thank you!

Add L1<->L2 bandwidth calculation

Omniperf currently does not report the achieved L2 bandwidth from the L1s, despite collecting the counters required to do so.
Following the convention for L1 bandwidth calculations, this is essentially the total amount of data moved from L1<->L2, which can be calculated from the L1<->L2 requests, e.g.:

https://github.com/AMDResearch/omniperf/blob/main/src/omniperf_cli/configs/gfx90a/1600_L1_cache.yaml#L173

The L2 bandwidth calculation would be:

L2 BW = 64B * (TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum) / $denom

Provide Binary with Tags/Releases

Some users will want to test the software without a full install. Release a binary for simple testing and installs with tags and major releases.

Installation instructions for RHEL-like systems

It would be helpful if the installation instructions in the documentation can be extended for RHEL and RHEL-like distros like CentOS, Rocky Linux, and Alma Linux.

omniperf fails to perf a python based command line

I know that this has been reported internally but I thought it would be useful to leave a trace here on GitHub.

Reproducer:

omniperf profile -n devito_iso -- /global/home/ymmu/projects/devito-venv/bin/python devitopro/demos/iso_acoustic/run.py -d 512 512 512 -so 8 --nt 10 -opt "('advanced', {'par-tile': (32, 4, 4)})"

fails with:

Kernel Selection:  None
Dispatch Selection:  None
IP Blocks: All
RPL: on '221122_144013' from '/opt/rocm-5.1.3/rocprofiler' in '/app'
RPL: profiling '""/venv/bin/python devitopro/demos/iso_acoustic/run.py -d 512 512 512 -nt 400 -so 8 -opt ('advanced', {'par-tile': (32, 4, 8)})""'
RPL: input file '/app/workloads/omniperf-iso-acoustic/mi200/perfmon/SQ_INST_LEVEL_LDS.txt'
RPL: output dir '/tmp/rpl_data_221122_144013_1247'
RPL: result dir '/tmp/rpl_data_221122_144013_1247/input0_results_221122_144013'
/usr/bin/rocprof: eval: line 286: syntax error near unexpected token `('

<error trace continues>

If I remove the -opt "('advanced', {'par-tile': (32, 4, 4)})" part, then it works

Switching "Normalization" doesn't seem to work

Hi,

Quick question. I import my dataset, I can navigate it, all fine...
Then I want to switch Normalization (top-left), from "per Wave" to "per Kernel", because I'm comparing two different versions of the same algorithm, but one of them generates many more waves than the other one. However, after switching, nothing happens. I tried refreshing the page and other things, but nothing. I'm not sure how to create a reproducer for this aside from letting you access the Grafana instance on our remote server . But first of all -- am I the only one experiencing this?

Thanks again

Add example of SSH port forwarding to documentation

Current documentation (https://amdresearch.github.io/omniperf/standalone_gui_analyzer.html?highlight=ssh) is a pretty good start, but lacks a worked example of:

How to connect to a system with a forwarded port (or at least, a link to some documentation of this)
How to select the forwarded port for the GUI

This would be a nice improvement, and was mentioned by some Crusher users.

Add better error detection when ROCm install is incomplete

Omniperf presently relies on the .info directory included with normal ROCm install to determine versioning information. If this directory is missing (say, due to incomplete ROCm install), the user will encounter runtime errors.

Improve the error message in this case to indicate the ROCm installation is incomplete.

[Feature Request] Kernel Replay

Use cases:

often there are significant run-to-run variation of an application due to the inherent randomness, e.g., for Monte-Carlo simulations.
rocprof doesn't play well with MPI which makes it difficult to collect the multiple sets of counters required for omniperf. This is because rocprof's replay mode (application replay) requires that rocprof launches the MPI command (e.g., rocprof <...> mpirun <...> application <...>) which is generally is unsupported as re-launching an MPI command is poorly defined.

Some possible short-term solutions:

Allow the user to query the number of application runs that will be required, and add a "--pass <XYZ>" argument to let them manually script up a way to repeatedly run the application, collecting a different set of passes each time. This can potentially alleviate the "rocprof / mpirun" issue, but doesn't do much for applications with significant non-deterministic behavior.
'Stochastic mode' -- implement a tool wrapper around the rocprofiler library that randomly selects a subset of counters that can give 'complete' metrics (that is, it should select both the level counters and the values being counted, etc.) This can likely help both cases, but doesn't do much if a user wants all possible information for a very specific dispatch

omniperf analyze statistics does not match understanding

I have using using omniperf to analyze some of the applications. I ran a simple 8x8x8 gemm in BF16 data format using following command line
omniperf profile -n gemm_m8_k8_n8] -d 1 --device 0 -- ./rocBLAS/build/release/clients/staging/rocblas-bench -m 8 -k 8 -n 8 -f gemm_ex -r bf16_r --compute_type f32_r -i 1 -j 1 --device 0

after running omniperf analyze -p gemm_m8_k8_n8 I get following output

The highlighted metric MFMA Flops (BF16) does not make sense. I expect 8x8x8x2 = 1024 flops.

Kernel takes 14.8 us, look below

So I expect 1024/(14.8 * 1e-6) = 69.2 Million Flops ~ 0.068 GFLOPS.

But I see 4.4 Gflops. How is this calculated?

Suggestion: workload names should be checked before profiling to prevent "'-' and '.' are not permited in workload name" errors during import

I created a few workloads with - in names but failed to import them to database due to:

Traceback (most recent call last):
  File "/home/.../omniperf/install/1.0.5/bin/omniperf", line 663, in <module>
    main()
  File "/home/.../omniperf/install/1.0.5/bin/omniperf", line 609, in main
    mongo_import(args, False)
  File "/home/.../omniperf/install/1.0.5/bin/omniperf", line 205, in mongo_import
    connectionInfo, Extractionlvl = csv_converter.parse(args, profileAndImport)
  File "/home/.../omniperf/install/1.0.5/bin/utils/csv_converter.py", line 165, in parse
    raise ValueError("'-' and '.' are not permited in workload name", db)

It would be great to have such checks before profiling.

Corrupted imports due to the absence of `mongoimport` should be removed or Grafana GUI won't work

Symptom: when I open a panel like "System info" or "System speed of light" the loading wheel at the top right keeps spinning forever and nothing gets displayed

Multi-normalization not working for MFMA Arithmetic Instr Mix

As pointed out by @arghdos in #66,

The values in the MFMA Arithmetic Instr Mix section can only be normalized by # of waves. i.e.

https://github.com/AMDResearch/omniperf/blob/9bc41f3a85b4bea7fa7febdec104983da41b9e51/src/omniperf_analyze/configs/gfx90a/1000_compute-unit-instruction-mix.yaml#L157-L179

SQ_WAVES should be changed to $denom to enable multi-normalization

Update host detection for thera

Need to update host detection for the Thera system in order to enable modulefile customization.

An additional hostname where admins install from is:

TheraS01
thera-hn

Pull images for CI from Docker Hub

As pointed out by @koomie -

Our CI framework spends ~10 minutes each run installing ROCm in the testing container. We can speed things up by pulling an image from Docker Hub that already has ROCm installed.

[Feature Request] Progress Bar/Indicator

There is no obvious indication to the user when the standalone gui is loading data. (i.e. on page refresh or data filtering)

Adding a progress bar with percentage completion would resolve this confusion and make progress clearer to user. There's a known number of tasks to be completed on each request from front-end, leverage this to build a progress bar

(See dash-bootstrap-components)

Timestamp inaccuracy using --roof-only

When profiling with the --roof-only flag, timestamps are not corrected. You can see timestamps.csv isn't generated and replace_timestamps() isn't called.

https://github.com/AMDResearch/omniperf/blob/9bc41f3a85b4bea7fa7febdec104983da41b9e51/src/omniperf#L286-L300

This leads to inaccurate kernel duration readouts in analyze mode. This logic needs to be added to the standalone roofline.

CC: Georgios

Requesting update to Readme with a "How to cite" section

It would be great to add a How to cite section in the README as we expect a lot our out Instinct customers will be keen on using the tool and presenting their results in research papers and conferences.

This tool will add a tremendous value to our application developers.

Merge roofline modules

At the moment, there are two areas where Omniperf computes Empirical Roofline data

src/utils/plot_roofline.py
src/omniperf_analyze/utils/roofline_calc.py

A lot of this code is duplicated so I propose we reorganize this into one module.

(1) is used in the standalone roofline capability (i.e. --roof-only) and generates a .pdf file with roofline leveraging matplotlib tools.
(2) returns critical data points for roofline to our Dash interface where a plot is sent to the html webpage

Grafana GUI documentation unclear

Documentation available at:

https://amdresearch.github.io/omniperf/grafana_analyzer.html#grafana-gui-import

Issue: to upload a database to our MongoDB server I was running

omniperf database --import -H <host_ip> -u admin -t asw -w workloads/devito_iso/mi200/

--------
Import Profiling Results
--------

Pulling data from  /app/workloads/devito_iso/mi200
The directory exists
Found sysinfo file
KernelName shortening enabled
Kernel name verbose level: 2
-- Conversion & Upload in Progress --
ERROR: Unable to connect to the server

After a short while (but still, order of minutes) I realised that -u admin was wrong -- because admin is what was used for for the Grafana service, not MongoDB. For someone like me who didn't (doesn't) know anything about Grafana and MongoDB perhaps a bit more justified to be confused...

So then I started following the docs, strictly, that is I started using -u temp. However I was then prompted for a password. (Much) later on, I realised the the MongoDB password was hardcoded in the Dockerfile. This could be improved I think, I think two-three more lines in the docs would be enough.

This was co-debugged with @ggorman -- just to be sure it wasn't me having an unlucky day

Enable multi-normalization

At the moment the only normalization supported in the standalone GUI is "per Wave". Enable normalizations for

"per Cycle"
"per Kernel"
"per Sec"

Unable to profile DLM: KeyError: 'BeginNs'

Description: Some workloads fail on timestamp generation. ShibuyaStream, DLM Accuracy tests fail

OS/distro: Ubuntu 5.15.0-52-generic #58~20.04.1-Ubuntu
ROCm Version: 5.2.0
Omniperf Version: 1.0.4dev
Logs of crash output:


[433 rows x 17 columns]
File 'dml_profile_DEEPSPEED_ROBERTA_data/dml_profile_DEEPSPEED_ROBERTA/mi200/timestamps.csv' is generating
Traceback (most recent call last):
  File "/home/svt/clement/omni/python-libs/pandas/core/indexes/base.py", line 3803, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 165, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 5745, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 5753, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'BeginNs'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/svt/clement/omni/1.0.4-dev/bin/omniperf", line 630, in <module>
    main()
  File "/home/svt/clement/omni/1.0.4-dev/bin/omniperf", line 525, in main
    omniperf_profile(args,VER)
  File "/home/svt/clement/omni/1.0.4-dev/bin/omniperf", line 376, in omniperf_profile
    replace_timestamps(workload_dir)
  File "/home/svt/clement/omni/1.0.4-dev/bin/omniperf", line 113, in replace_timestamps
    df_pmc_perf["BeginNs"] = df_stamps["BeginNs"]
  File "/home/svt/clement/omni/python-libs/pandas/core/frame.py", line 3804, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/home/svt/clement/omni/python-libs/pandas/core/indexes/base.py", line 3805, in get_loc
    raise KeyError(key) from err
KeyError: 'BeginNs'

Steps to reproduce:

Install ROCm, omniperf
Set export variables as per installation

git clone https://github.com/ROCmSoftwarePlatform/DeepLearningModels
cd DeepLearningModels
#modify the tags.json to the following:
{
        "tags": [
                "pyt_train_huggingface_distilbert"
        ]

}
#run:
omniperf profile --name dml_profile --path dml_profile_data echo val | sudo -S ./tools/run_models.py --timeout 0

Observe failure after 23 loops

Expected: timestamps.csv generated, successful profiling
Actual: KeyError, timestamps.csv is EMPTY, profile fail.

Update L1 bandwidth metric calculations

For the L1 bandwidth calcs in the SoL section, we should be using:

64 * tcp_total_cache_accesses_sum / Duration

This will need to be changed in both the yaml configs (for CLI / standalone), e.g.:

https://github.com/AMDResearch/omniperf/blob/main/src/omniperf_cli/configs/gfx90a/1600_L1_cache.yaml#L28

and the grafana dashboard (cacheBW_pct in the vL1 data section)

include git sha in release tarball

fix versioning info for submodes

Noticed this in current release, version info has funky string for submodes:

ok

$ ./omniperf -v
omniperf (1.0.3)

funky

$ ./omniperf analyze -v
%(PROG)s (1.0.3)
$ ./omniperf database -v
%(PROG)s (1.0.3)
$ ./omniperf database -v
%(PROG)s (1.0.3)

Improve documentation for usage with multi-process runs

Could some guidance be added in the documentation for using omniperf with MPI jobs? Should we collect profiles with omniperf for one rank only using a wrapper script that does so (see example of wrapper script below) and invoke it by mpirun <...> wrapper_omniperf.sh <...> <exe>? Or should we run omniperf <...> mpirun <...> <exe>?
A sample wrapper script that I tried using is:

#! /usr/bin/env bash
if [[ -n ${OMPI_COMM_WORLD_RANK+z} ]]; then
  # mpich
  export MPI_RANK=${OMPI_COMM_WORLD_RANK}
elif [[ -n ${MV2_COMM_WORLD_RANK+z} ]]; then
  # ompi
  export MPI_RANK=${MV2_COMM_WORLD_RANK}
elif [[ -n ${SLURM_PROCID+z} ]]; then
    # mpich via srun
    export MPI_RANK=${SLURM_PROCID}
fi
if [[ ${MPI_RANK} == "0" ]]; then
  eval "omniperf profile -n <workload_name> -k <kernel_name> -b <ip_block> -- $*"
else
  "$*"
fi

It crashes when it (internally rocprof) tries to collect counters that are split in to multiple groups.

Is this a fundamental limitation on MI50 or completely unusable

Hi, I am going to use your tool to develop the analytical model tool on AMD GPU. I only have MI50 but this GPU is marked as unsupported in your document. I want to check if it is just a fundamental limitation or completely unusable. Thanks for your help.

Leverage container with pre-installed ROCm

A decent portion of the test CI jobs is devoted to installing ROCm each time (~10 minutes). Consider starting from a container with ROCm pre-installed to shorten this process. Could pull from upstream (e.g. rocm/dev-ubuntu-22.04:latest) or maintain our own container build.

Store application parameters in profiling output

As a 3rd party reviewing workloads in Grafana it would be nice to track/get better insights into how the app was invoked. App parms can make a huge difference in the profiling results.

For example, comparing babelstream data I see that a different rocm stack / CPU was present in each SUT, and some of the kernels are the same across the two data sets. I'd like to know babelstream parameters. What was on the command line when it was launched?

cc: James Dezelle