Giter Club home page Giter Club logo

omniperf's Introduction

Ubuntu 22.04 RHEL 8 Instinct Docs DOI

Omniperf

General

Omniperf is a system performance profiling tool for machine learning/HPC workloads running on AMD MI GPUs. The tool presently targets usage on MI100 and MI200 accelerators.

  • For more information on available features, installation steps, and workload profiling and analysis, please refer to the online documentation.

  • Omniperf is an AMD open source research project and is not supported as part of the ROCm software stack. We welcome contributions and feedback from the community. Please see the CONTRIBUTING.md file for additional details on our contribution process.

  • Licensing information can be found in the LICENSE file.

Development

Omniperf follows a main-dev branching model. As a result, our latest stable release is shipped from the main branch, while new features are developed in our dev branch.

Users may checkout dev to preview upcoming features.

How to Cite

This software can be cited using a Zenodo DOI reference. A BibTex style reference is provided below for convenience:

@software{xiaomin_lu_2022_7314631
  author       = {Xiaomin Lu and
                  Cole Ramos and
                  Fei Zheng and
                  Karl W. Schulz and
                  Jose Santos and
                  Keith Lowery and
                  Nicholas Curtis and
                  Cristian Di Pietrantonio},
  title        = {AMDResearch/omniperf: v1.1.0-PR1 (13 Oct 2023)},
  month        = oct,
  year         = 2023,
  publisher    = {Zenodo},
  version      = {v1.1.0-PR1},
  doi          = {10.5281/zenodo.7314631},
  url          = {https://doi.org/10.5281/zenodo.7314631}
}

omniperf's People

Contributors

coleramos425 avatar koomie avatar josesantosamd avatar skyreflectedinmirrors avatar keithloweryamd avatar dgaliffiamd avatar feizheng10 avatar dipietrantonio avatar jasonray avatar

Stargazers

TianYu GUO avatar  avatar Mark Richardson avatar David Boehme avatar Bartłomiej Kocot avatar YUSUKE IZAWA avatar  avatar Beatriz Navidad Vilches avatar Nick Imanzi avatar Artur Klauser avatar Shangyan Zhou avatar Donald Frederick avatar Perry Yuan avatar wdong5 avatar Hugo A. Andrade avatar Hengguan Cui avatar Serban Porumbescu avatar BlaCkinkGJ avatar  avatar Joy Juechu Dong avatar zhenwei avatar Scott Cheng avatar  avatar Ken Matsui avatar Ray Manaloto avatar  avatar Raymond Dang avatar Shihab Shahriar Khan avatar Niranjan Ravichandra avatar  avatar Mark Kogan avatar Felix Gündling avatar Jay Kalinani avatar  avatar João Leonardi avatar Kyleoklin avatar Siu Chi Chan avatar Jeff Poznanovic avatar Jared  avatar Victor A. P. Magri avatar Xuanteng Huang avatar  avatar  avatar Rafael Ristovski avatar Haocong WANG avatar Umio Yasuno avatar Vijay  Aravindh avatar Xueshen Liu avatar René van Oostrum avatar Ryan avatar Cameron Rutherford avatar Wileam Y. Phan avatar Paul Jarrett avatar Sergei Bastrakov avatar  avatar Ben Wibking avatar Bala - MSFT avatar jgp avatar Davide Conficconi avatar Markos Horro avatar Blake Becker avatar  avatar  avatar Utkarsh Katiyar avatar Carlos J. Barrios H. avatar Michael E. Rowan avatar Jack Morrison avatar Axel Huebl avatar Yongpeng Zhang avatar  avatar Suraj avatar vasya vasin avatar Ernie Pasveer avatar Lento Manickathan avatar Paolo Fabio Zaino avatar  avatar Hiroki Noda avatar Aditya avatar Zike Xu avatar Kyle Gerard Felker avatar Muhammad Awad avatar Denis Denisov avatar Sarat Sreepathi avatar Lu Ming avatar David Colignon avatar Wentao Zhang avatar Bo Qiao avatar Ioannis Vardas avatar Craig Wilson avatar  avatar Jake Muff avatar Maxime Delorme avatar Yiwei Yang avatar Simeon Ehrig avatar  avatar George Zagaris avatar Muhammad Awad avatar Onur Cankur avatar Jered Dominguez-Trujillo avatar Gaurav Kaul avatar

Watchers

Allan MacKinnon avatar  avatar Jonathan R. Madsen avatar Suraj avatar  avatar Joe McCall avatar  avatar  avatar  avatar Ray Manaloto avatar Muhammad Awad avatar  avatar  avatar

omniperf's Issues

Output data format

omniperf -v
omniperf (1.0.3)

omniperf analyze -p workloads/kernel/mi200/ -o doesnotexist

--------
Analyze
--------

Saved Analysis folder exists

Why does omniperf analysis say: Saved Analysis folder exists ? The folder does not exist. If create the directory and rerun, I get:

IsADirectoryError: [Errno 21] Is a directory: 'doesnotexist'

I guess that omniperf says Saved analysis folder exists because I pointed it to the existing workload workloads/kernel/mi200.

The output that I get via -o is the stdout? It would be helpful it the data was in say sqlite, json, or even Python pickle file. Such an option would facilitate further analysis.

Issue using Matplotlib with X-server

This seems to be an issue that only occurs when you are connected via ssh to a system that has an X-server running.

https://github.com/AMDResearch/omniperf/blob/796c495d0b41bb63f62220278bd4fdca323a463b/src/utils/plot_roofline.py#L31

The reason being that when an X-server is running, matplotlib, by default, tries to connect but can't and so he throws fatal exceptions.

Suggested Fix

import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt

the use('Agg') call sets matplotlib to use the non-interactive backend that can only write to files. This is apparently required if you happen to run omniperf on a system with X running.

CC: @keithloweryamd

Extending OmniXXX to profile/trace EPYC CPUs

Hello,

The functionality provided by the omniXXX packages could be extended to provide similar performance information on AMD EPYC CPUS. We are lacking tools to produce Roofline curves for EPYC DRAM/L3/L2/L1 curves. Although floating point (SP/DP) can be obtained with other synthetic benchmarks, cache hierarchy BW and latencies require specialized low-level instructions.

I am suggesting the omni tools to be extended to provide such information about the various AMD EPYC CPUs as well.

U can incorporate this functionality to AMDuProf or at least add L3 BWs (and latency) profiles.

thank you
Michael Thomadakis

AAC Requirements

  1. Need instruction to install docker-compose and volumes (added in v1.0.3)
  2. Need to install CMake 3.19 separately
  3. Modify Python dependency instructions to python3 -m pip install **--system** -t ${INSTALL_DIR}/python-libs -r requirements.txt
  4. We see matplotlib conflict with the default installation, resulting in omniperf failure
  5. AAC default is ubuntu18.04/python3.6.9, while we require u20.04/py3.7+. might be helpful, after testing, relax the constraints
  6. Add sudo apt install libjpeg-dev zlib1g-dev to install instuctions. This is required by Pillow

cc: Xiaomin Lu

Error in build system when installing from release tarball

The v.1.04 release introduced inclusion of the release tag sha in a VERSION.sha file. There is an issue in the build system presently where this value is overwritten with an empty value when doing a cmake build/install starting from the release tarball.

Need to fix so that the correct git sha is displayed when running omniperf with the --version option.

Comparison of two workloads using CLI fails with v1.0.6

I created two workloads by profiling a different kernel each time and tried to compare the performance counters between the two kernels using the following command:

omniperf analyze -p workloads/vcopy_vecCopy/mi200/ -p workloads/vcopy_vecCopy_nocheck/mi200/

This fails with the following error after printing the "System Info" panel (I have intentionally changed the full path to my omniperf install, but this does not change the stack trace otherwise):

Traceback (most recent call last):
  File "/path/to/omniperf/1.0.6/bin/omniperf", line 663, in <module>
    main()
  File "/path/to/omniperf/1.0.6/bin/omniperf", line 643, in main
    analyze(args)
  File "/path/to/omniperf/1.0.6/bin/omniperf_analyze/omniperf_analyze.py", line 250, in analyze
    run_cli(args, runs)
  File "/path/to/omniperf/1.0.6/bin/omniperf_analyze/omniperf_analyze.py", line 199, in run_cli
    tty.show_all(
  File "/path/to/omniperf/1.0.6/bin/omniperf_analyze/utils/tty.py", line 108, in show_all
    base_df[header].astype("double"),
  File "/path/to/omniperf/python-libs/pandas/core/generic.py", line 6240, in astype
    new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
  File "/path/to/omniperf/python-libs/pandas/core/internals/managers.py", line 450, in astype
    return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
  File "/path/to/omniperf/python-libs/pandas/core/internals/managers.py", line 352, in apply
    applied = getattr(b, f)(**kwargs)
  File "/path/to/omniperf/python-libs/pandas/core/internals/blocks.py", line 526, in astype
    new_values = astype_array_safe(values, dtype, copy=copy, errors=errors)
  File "/path/to/omniperf/python-libs/pandas/core/dtypes/astype.py", line 299, in astype_array_safe
    new_values = astype_array(values, dtype, copy=copy)
  File "/path/to/omniperf/python-libs/pandas/core/dtypes/astype.py", line 230, in astype_array
    values = astype_nansafe(values, dtype, copy=copy)
  File "/path/to/omniperf/python-libs/pandas/core/dtypes/astype.py", line 170, in astype_nansafe
    return arr.astype(dtype, copy=True)
ValueError: could not convert string to float: ''

omniperf v1.0.6 was installed from source using instructions in this repo's documentation without any trouble.
To reproduce, I took the vcopy.cpp example from this repo and added new kernel called vecCopy_nocheck where I just commented out the check for array bounds. I also added a call to launch this kernel. My updates can be seen in the following diff:

$ git diff
diff --git a/sample/vcopy.cpp b/sample/vcopy.cpp
index 0eed487..565d8c0 100644
--- a/sample/vcopy.cpp
+++ b/sample/vcopy.cpp
@@ -18,6 +18,12 @@ __global__ void vecCopy(double *a, double *b, double *c, int n,int stride)
         c[id] = a[id];     
     }      
 }
+__global__ void vecCopy_nocheck(double *a, double *b, double *c, int n,int stride)
+{
+    // Get our global thread ID
+    int id = blockIdx.x*blockDim.x+threadIdx.x;
+    c[id] = a[id];
+}
 
 void usage()
 {
@@ -114,6 +120,7 @@ int main( int argc, char* argv[] )
     printf("Launching the  kernel on the GPU\n");
     // Execute the kernel
     hipLaunchKernelGGL(vecCopy, dim3(gridSize), dim3(blockSize), 0, 0, d_a, d_b, d_c, n,stride);
+    hipLaunchKernelGGL(vecCopy_nocheck, dim3(gridSize), dim3(blockSize), 0, 0, d_a, d_b, d_c, n,stride);
     hipDeviceSynchronize( );
     printf("Finished executing kernel\n");
     // Copy array back to host

Now, compile, profile and analyze this workload using the following commands:

hipcc -O3 -o vcopy vcopy.cpp
omniperf profile --device 0 -k vecCopy -n vcopy_vecCopy -- ./vcopy 102400 256 0
omniperf profile --device 0 -k vecCopy_nocheck -n vcopy_vecCopy_nocheck -- ./vcopy 102400 256 0
omniperf analyze -p workloads/vcopy_vecCopy/mi200/ -p workloads/vcopy_vecCopy_nocheck/mi200/

Reduce default content in GUI

In the standalone GUI, when no filters are applied

omniperf analyze -p workloads/sample/mi200/ --gui

the HTML page will load data for every single metric and chart. To reduce loading time and compute, only high-level sections should be displayed:

  • Top Kernels
  • Speed-of-Light
  • Memory Chart

The rest of the information can be displayed when kernel or dispatch filters are applied. Which will significantly decrease the compute required to generate results.

Docker setup throws grafana warnings

These warnings say to me, this tool is possibly impossible to use without a container with an obsolete version of grafana. While docker protects you for now...

main/9bc41f3a85b4bea7fa7febdec104983da41b9e51

cd omniperf
sudo docker-compose build
...
[2/5] Resolving packages...
warning @grafana/runtime > @grafana/[email protected]: Package no longer supported. Contact Support at https://www.npmjs.com/support for more info.
warning @grafana/runtime > @grafana/agent-web > @grafana/[email protected]: Package no longer supported. Contact Support at https://www.npmjs.com/support for more info.
warning @grafana/runtime > @grafana/ui > @grafana/slate-react > [email protected]: New custom equality api does not play well with all equality helpers. Please use v5.x
warning @grafana/runtime > @grafana/ui > react-highlight-words > [email protected]: New custom equality api does not play well with all equality helpers. Please use v5.x
warning @grafana/runtime > @grafana/agent-web > @grafana/agent-core > @opentelemetry/[email protected]: Please use @opentelemetry/api >= 1.3.0
warning @grafana/runtime > @grafana/agent-web > @grafana/agent-core > @opentelemetry/otlp-transformer > @opentelemetry/[email protected]: Please use @opentelemetry/api >= 1.3.0
warning @grafana/runtime > @grafana/agent-web > @grafana/agent-core > @opentelemetry/otlp-transformer > @opentelemetry/[email protected]: Please use @opentelemetry/sdk-metrics
warning @grafana/runtime > @grafana/agent-web > @grafana/agent-core > @opentelemetry/otlp-transformer > @opentelemetry/sdk-metrics-base > @opentelemetry/[email protected]: Please use @opentelemetry/api >= 1.3.0
warning @grafana/runtime > @grafana/ui > rc-time-picker > rc-trigger > babel-runtime > [email protected]: core-js@<3.23.3 is no longer maintained and not recommended for usage due to the number of issues. Because of the V8 engine whims, feature detection in old core-js versions could cause a slowdown up to 100x even if nothing is polyfilled. Some versions have web compatibility issues. Please, upgrade your dependencies to the actual version of core-js.
warning @grafana/runtime > @grafana/ui > react-use > nano-css > [email protected]: Please use @jridgewell/sourcemap-codec instead
warning @grafana/toolkit > @grafana/ui > slate-react > [email protected]: New custom equality api does not play well with all equality helpers. Please use v5.x
warning @grafana/toolkit > @jest/core > jest-config > jest-environment-jsdom > jsdom > [email protected]: Use your platform's native performance.now() and performance.timeOrigin.
warning @grafana/toolkit > css-minimizer-webpack-plugin > cssnano > cssnano-preset-default > postcss-svgo > svgo > [email protected]: Modern JS already guarantees Array#sort() is a stable sort, so this library is deprecated. See the compatibility table on MDN: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/sort#browser_compatibility
[3/5] Fetching packages...

Update minimum version requirements for rocm

With the switch to leverage counter files supplied directly with rocm, it appears the minimum version check may need to be updated. Running on Crusher, I get an unknown hardware counter in profile mode using rocm/5.1.0. Howevere, rocm/5.2.0 runs without incident.

In the meantime, I have updated the minimum version requirement for the omniperf/1.6.0 module on Crusher to require rocm 5.2.0 or newer.

Add testing for Ubuntu 18.04

There needs to be testing added for Ubuntu 18.04. This is what will be used in an upcoming demo of roofline analysis capabilities and presenters would like to know if there are any missing or incompatible dependencies.

Note: AAC cloud runs on u18.04

Have CI build docs in lieu of current update-docs.sh script

Would be nice to clean up the docs build a bit and have a companion github action land html from markdown. Also, I'd be in favor of cleaning up the branches so that all docs collateral only resides in the gh-pages branch. In that case we would remove from main and dev.

Filtering by block doesn't consider cross-block dependencies for metrics

Specifically, we noticed this while trying to collect coalescing (which lives in the TCP section):

https://github.com/AMDResearch/omniperf/blob/62d130b458a21a2c964da234cf7a24420e01efe1/src/omniperf_cli/configs/gfx90a/1600_L1_cache.yaml#L20

but uses values from the TA (i.e., TA_TOTAL_WAVEFRONTS_sum).

So, if a user does:

omniperf profile -b TCP -n bar -- <foo>
omniperf analyze -p workloads/bar/mi200

the resulting Buffer Coalescing value in the L1 section will be empty.

Add "per-kernel" normalization mode to standalone GUI

The 'per-kernel' normalization mode present in the Grafana dashboard appears to be missing from the standalone GUI.

This is useful for some metrics (e.g., requests, bytes moved, etc.) it's often of interest how many there were in total.
For instance, a user might want to see the total number of bytes read from HBM.
Right now with the standalone GUI, the only option is to really choose (e.g.,) per-wave and then multiply the reported value by the number of waves

Dockerfile for ROCm + Omniperf (and more)

Hi again,

Sorry, first of all, if this is the wrong place to post this.

I genuinely wonder whether AMDResearch would be willing to maintain a Dockerfile that ships the following components:

  • ROCm
  • ROCm-aware MPI
  • Omnitrace
  • Omniperf

As a developer, this would significantly ease my (our, at @devitocodes/devito) life . At the same time, I think this would greatly benefit your users. Ultimately ROCm-aware MPI, Omnitrace and Omniperf will be part of the ROCm suite, I'm sure, but it feels like it's still a long way to go. Interested in your thoughts.

Here's our Dockerfile :

https://github.com/devitocodes/devito/blob/d4e9dc36ff92299644aada824f0ec3786d2f9fef/docker/Dockerfile.amd

The link above is from a PR, but you get the idea. We test it on CI so we know it does work (aside from MPI which still needs to be refreshed).

Apologies again I know this might not be the best place to have this discussion but happy to delete and move if you have a better place (or remove if not interested -- not a problem!)

EDIT: Just to clarify: basically, I'm wondering whether it would make sense to lift that Dockerfile from our codebase somewhere into one of yours

Unable to compare 2 kernels from same workload

It would be nice to easily compare 2 kernels from the same workload where counters were collected for all kernels. I would like to use a command such as:

omniperf analyze -p workloads/vcopy_all/mi200 -k 0 -p workloads/vcopy_all/mi200 -k 1

This results in an error though:

Traceback (most recent call last):
  File "/path/to/omniperf/dev/bin/omniperf", line 682, in <module>
    main()
  File "/path/to/omniperf/dev/bin/omniperf", line 662, in main
    analyze(args)
  File "/path/to/omniperf/dev/bin/omniperf_analyze/omniperf_analyze.py", line 253, in analyze
    run_cli(args, runs)
  File "/path/to/omniperf/dev/bin/omniperf_analyze/omniperf_analyze.py", line 195, in run_cli
    parser.load_table_data(
  File "/path/to/omniperf/dev/bin/omniperf_analyze/utils/parser.py", line 706, in load_table_data
    eval_metric(
  File "/path/to/omniperf/dev/bin/omniperf_analyze/utils/parser.py", line 570, in eval_metric
    out = eval(compile(row[expr], "<string>", "eval"))
  File "<string>", line 1
    ������@
        ^
SyntaxError: (unicode error) 'utf-8' codec can't decode byte 0x9a in position 0: invalid start byte

A workaround is to make a copy of this workload and use each copy in the analyze command as shown below.

cp -r workloads/vcopy_all workloads/vcopy_all_2
omniperf analyze -p workloads/vcopy_all/mi200 -k 0 -p workloads/vcopy_all_2/mi200 -k 1

A fix would be nice to have. It is not urgent though.

Questions about server side installation

Hello,

just a quick comment about the installation of MongoDB and Grafana via Dockerfile

One of the cool things about docker is that generally you don't need sudo. However, all the commands here prepend it to docker. Is there a particular reason or is it just an oversight?

And, related question, why aren't the MongoDB utils part of the Dockerfile?
Ignore me, I just found out the utils are necessary to import the databases, hence they're needed locally

Thanks a lot!

investigate encoding failure

Ran into this error during analyze example running on an older ubuntu 18.04 system system that had LANG=en_US by default.

--------
Analyze
--------

Created a Saved Analysis folder

--------------------------------------------------------------------------------
0. Top Stat
Traceback (most recent call last):
  File "/global/scratch/sw/omniperf/1.0.3/bin/omniperf", line 624, in <module>
    main()
  File "/global/scratch/sw/omniperf/1.0.3/bin/omniperf", line 604, in main
    omniperf_cli(args)
  File "/global/scratch/sw/omniperf/1.0.3/bin/omniperf_cli/omniperf_cli.py", line 225, in omniperf_cli
    tty.show_all(
  File "/global/scratch/sw/omniperf/1.0.3/bin/omniperf_cli/utils/tty.py", line 172, in show_all
    print(ss, file=output)
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-107: ordinal not in range(256)

Updating to LANG=en_US.UTF-8 fixed the issue.

We presumably always want to use utf-8 coding....

Rocprof: Profiling data is corrupt

Description: Attempting to profile RESNET50 workload results in "profiling data is corrupt" message

System details
git checkout python-logging
OS/distro: Ubuntu 5.15.0-52-generic #58~20.04.1-Ubuntu
ROCm Version: 5.2.3
Omniperf Version: 1.0.4dev
Logs of crash output:

Steps to reproduce:

Within docker container of resnet50 (https://confluence.amd.com/display/MLSE/MLPerf-1.1-ResNet50v1.5):
copy command into run.sh:
#!/bin/bash python3 -u -m mlperf_utils.bind_launch --nproc_per_node 1 --auto_binding ./main.py --amp --dynamic-loss-scale --lr-schedule polynomial --num-gpus 4 --mom 0.9 --wd 0.0002 --lr 9.1 --prof 100 --warmup 2 --epochs 1 --nhwc --use-lars -b 256 --eval-offset 1 --get-logs --submission-platform MI200system --num-nodes 1 --no-checkpoints --raport-file raport.json -j32 -p 100 --arch resnet50 --data /data/imagenet_pytorch 2>&1 | tee -a run.log.txt

Execute profiling command:
omniperf profile --name resnet50 --path /data/imagenet_pytorch/RN50FP16_DATA2 -- run.sh

At the end of data collection execution, observe Profile data is corrupt message.

Write statistics does not match understanding

Hi, We are running an all-reduce kernel (with remote memory stores) on 4 MI210s and are trying to understand the memory traffic using MIPERF (snapshot for one is attached). We are unclear about what each of the Writes are counting and had the following questions we were hoping you could help with:

i) We find that the ‘Write (64B)’ is the sum of ‘Write (Uncached 32B)’ and ‘HBM Write’ (minus the ‘Write (32B)’, which is small anyway).
- Why are 32B writes (Writes (Uncached 32B)) being counted as 64B writes (Write (64B))? Are the ‘Writes (Uncached 32B)’ actually 64B
- Are ‘HBM write’ also 64B writes?
ii) Should we use ‘HBM writes’ and the ‘Write (Uncached 32B)’ separately as 64B and 32B writes, instead of the considering the combined ‘Write (64B)’?

Thank you!
write_stat

Add L1<->L2 bandwidth calculation

Omniperf currently does not report the achieved L2 bandwidth from the L1s, despite collecting the counters required to do so.
Following the convention for L1 bandwidth calculations, this is essentially the total amount of data moved from L1<->L2, which can be calculated from the L1<->L2 requests, e.g.:

https://github.com/AMDResearch/omniperf/blob/main/src/omniperf_cli/configs/gfx90a/1600_L1_cache.yaml#L173

The L2 bandwidth calculation would be:

L2 BW = 64B * (TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum) / $denom

Provide Binary with Tags/Releases

Some users will want to test the software without a full install. Release a binary for simple testing and installs with tags and major releases.

omniperf fails to perf a python based command line

I know that this has been reported internally but I thought it would be useful to leave a trace here on GitHub.

Reproducer:

omniperf profile -n devito_iso -- /global/home/ymmu/projects/devito-venv/bin/python devitopro/demos/iso_acoustic/run.py -d 512 512 512 -so 8 --nt 10 -opt "('advanced', {'par-tile': (32, 4, 4)})"

fails with:

Kernel Selection:  None
Dispatch Selection:  None
IP Blocks: All
RPL: on '221122_144013' from '/opt/rocm-5.1.3/rocprofiler' in '/app'
RPL: profiling '""/venv/bin/python devitopro/demos/iso_acoustic/run.py -d 512 512 512 -nt 400 -so 8 -opt ('advanced', {'par-tile': (32, 4, 8)})""'
RPL: input file '/app/workloads/omniperf-iso-acoustic/mi200/perfmon/SQ_INST_LEVEL_LDS.txt'
RPL: output dir '/tmp/rpl_data_221122_144013_1247'
RPL: result dir '/tmp/rpl_data_221122_144013_1247/input0_results_221122_144013'
/usr/bin/rocprof: eval: line 286: syntax error near unexpected token `('

<error trace continues>

If I remove the -opt "('advanced', {'par-tile': (32, 4, 4)})" part, then it works

Switching "Normalization" doesn't seem to work

Hi,

Quick question. I import my dataset, I can navigate it, all fine...
Then I want to switch Normalization (top-left), from "per Wave" to "per Kernel", because I'm comparing two different versions of the same algorithm, but one of them generates many more waves than the other one. However, after switching, nothing happens. I tried refreshing the page and other things, but nothing. I'm not sure how to create a reproducer for this aside from letting you access the Grafana instance on our remote server . But first of all -- am I the only one experiencing this?

Thanks again

Add better error detection when ROCm install is incomplete

Omniperf presently relies on the .info directory included with normal ROCm install to determine versioning information. If this directory is missing (say, due to incomplete ROCm install), the user will encounter runtime errors.

Improve the error message in this case to indicate the ROCm installation is incomplete.

[Feature Request] Kernel Replay

Use cases:

  • often there are significant run-to-run variation of an application due to the inherent randomness, e.g., for Monte-Carlo simulations.
  • rocprof doesn't play well with MPI which makes it difficult to collect the multiple sets of counters required for omniperf. This is because rocprof's replay mode (application replay) requires that rocprof launches the MPI command (e.g., rocprof <...> mpirun <...> application <...>) which is generally is unsupported as re-launching an MPI command is poorly defined.

Some possible short-term solutions:

  1. Allow the user to query the number of application runs that will be required, and add a "--pass <XYZ>" argument to let them manually script up a way to repeatedly run the application, collecting a different set of passes each time. This can potentially alleviate the "rocprof / mpirun" issue, but doesn't do much for applications with significant non-deterministic behavior.
  2. 'Stochastic mode' -- implement a tool wrapper around the rocprofiler library that randomly selects a subset of counters that can give 'complete' metrics (that is, it should select both the level counters and the values being counted, etc.) This can likely help both cases, but doesn't do much if a user wants all possible information for a very specific dispatch

omniperf analyze statistics does not match understanding

I have using using omniperf to analyze some of the applications. I ran a simple 8x8x8 gemm in BF16 data format using following command line
omniperf profile -n gemm_m8_k8_n8] -d 1 --device 0 -- ./rocBLAS/build/release/clients/staging/rocblas-bench -m 8 -k 8 -n 8 -f gemm_ex -r bf16_r --compute_type f32_r -i 1 -j 1  --device 0

after running omniperf analyze -p gemm_m8_k8_n8 I get following output
image

The highlighted metric MFMA Flops (BF16) does not make sense. I expect 8x8x8x2 = 1024 flops.

Kernel takes 14.8 us, look below
image

So I expect 1024/(14.8 * 1e-6) = 69.2 Million Flops ~ 0.068 GFLOPS.

But I see 4.4 Gflops. How is this calculated?

Suggestion: workload names should be checked before profiling to prevent "'-' and '.' are not permited in workload name" errors during import

I created a few workloads with - in names but failed to import them to database due to:

Traceback (most recent call last):
  File "/home/.../omniperf/install/1.0.5/bin/omniperf", line 663, in <module>
    main()
  File "/home/.../omniperf/install/1.0.5/bin/omniperf", line 609, in main
    mongo_import(args, False)
  File "/home/.../omniperf/install/1.0.5/bin/omniperf", line 205, in mongo_import
    connectionInfo, Extractionlvl = csv_converter.parse(args, profileAndImport)
  File "/home/.../omniperf/install/1.0.5/bin/utils/csv_converter.py", line 165, in parse
    raise ValueError("'-' and '.' are not permited in workload name", db)

It would be great to have such checks before profiling.

Update host detection for thera

Need to update host detection for the Thera system in order to enable modulefile customization.

An additional hostname where admins install from is:

  • TheraS01
  • thera-hn

Pull images for CI from Docker Hub

As pointed out by @koomie -

Our CI framework spends ~10 minutes each run installing ROCm in the testing container. We can speed things up by pulling an image from Docker Hub that already has ROCm installed.

[Feature Request] Progress Bar/Indicator

There is no obvious indication to the user when the standalone gui is loading data. (i.e. on page refresh or data filtering)

Adding a progress bar with percentage completion would resolve this confusion and make progress clearer to user. There's a known number of tasks to be completed on each request from front-end, leverage this to build a progress bar

image
(See dash-bootstrap-components)

Requesting update to Readme with a "How to cite" section

It would be great to add a How to cite section in the README as we expect a lot our out Instinct customers will be keen on using the tool and presenting their results in research papers and conferences.

This tool will add a tremendous value to our application developers.

Merge roofline modules

At the moment, there are two areas where Omniperf computes Empirical Roofline data

  1. src/utils/plot_roofline.py
  2. src/omniperf_analyze/utils/roofline_calc.py

A lot of this code is duplicated so I propose we reorganize this into one module.

(1) is used in the standalone roofline capability (i.e. --roof-only) and generates a .pdf file with roofline leveraging matplotlib tools.
(2) returns critical data points for roofline to our Dash interface where a plot is sent to the html webpage

Grafana GUI documentation unclear

Documentation available at:

https://amdresearch.github.io/omniperf/grafana_analyzer.html#grafana-gui-import

Issue: to upload a database to our MongoDB server I was running

omniperf database --import -H <host_ip> -u admin -t asw -w workloads/devito_iso/mi200/

--------
Import Profiling Results
--------

Pulling data from  /app/workloads/devito_iso/mi200
The directory exists
Found sysinfo file
KernelName shortening enabled
Kernel name verbose level: 2
-- Conversion & Upload in Progress --
ERROR: Unable to connect to the server

After a short while (but still, order of minutes) I realised that -u admin was wrong -- because admin is what was used for for the Grafana service, not MongoDB. For someone like me who didn't (doesn't) know anything about Grafana and MongoDB perhaps a bit more justified to be confused...

So then I started following the docs, strictly, that is I started using -u temp. However I was then prompted for a password. (Much) later on, I realised the the MongoDB password was hardcoded in the Dockerfile. This could be improved I think, I think two-three more lines in the docs would be enough.

This was co-debugged with @ggorman -- just to be sure it wasn't me having an unlucky day

Enable multi-normalization

At the moment the only normalization supported in the standalone GUI is "per Wave". Enable normalizations for

  • "per Cycle"
  • "per Kernel"
  • "per Sec"

Unable to profile DLM: KeyError: 'BeginNs'

Description: Some workloads fail on timestamp generation. ShibuyaStream, DLM Accuracy tests fail

OS/distro: Ubuntu 5.15.0-52-generic #58~20.04.1-Ubuntu
ROCm Version: 5.2.0
Omniperf Version: 1.0.4dev
Logs of crash output:


[433 rows x 17 columns]
File 'dml_profile_DEEPSPEED_ROBERTA_data/dml_profile_DEEPSPEED_ROBERTA/mi200/timestamps.csv' is generating
Traceback (most recent call last):
  File "/home/svt/clement/omni/python-libs/pandas/core/indexes/base.py", line 3803, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 165, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 5745, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 5753, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'BeginNs'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/svt/clement/omni/1.0.4-dev/bin/omniperf", line 630, in <module>
    main()
  File "/home/svt/clement/omni/1.0.4-dev/bin/omniperf", line 525, in main
    omniperf_profile(args,VER)
  File "/home/svt/clement/omni/1.0.4-dev/bin/omniperf", line 376, in omniperf_profile
    replace_timestamps(workload_dir)
  File "/home/svt/clement/omni/1.0.4-dev/bin/omniperf", line 113, in replace_timestamps
    df_pmc_perf["BeginNs"] = df_stamps["BeginNs"]
  File "/home/svt/clement/omni/python-libs/pandas/core/frame.py", line 3804, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/home/svt/clement/omni/python-libs/pandas/core/indexes/base.py", line 3805, in get_loc
    raise KeyError(key) from err
KeyError: 'BeginNs'

Steps to reproduce:

  1. Install ROCm, omniperf
  2. Set export variables as per installation
git clone https://github.com/ROCmSoftwarePlatform/DeepLearningModels
cd DeepLearningModels
#modify the tags.json to the following:
{
        "tags": [
                "pyt_train_huggingface_distilbert"
        ]

}
#run:
omniperf profile --name dml_profile --path dml_profile_data echo val | sudo -S ./tools/run_models.py --timeout 0
  1. Observe failure after 23 loops

Expected: timestamps.csv generated, successful profiling
Actual: KeyError, timestamps.csv is EMPTY, profile fail.

fix versioning info for submodes

Noticed this in current release, version info has funky string for submodes:

ok

$ ./omniperf -v
omniperf (1.0.3)

funky

$ ./omniperf analyze -v
%(PROG)s (1.0.3)
$ ./omniperf database -v
%(PROG)s (1.0.3)
$ ./omniperf database -v
%(PROG)s (1.0.3)

Improve documentation for usage with multi-process runs

Could some guidance be added in the documentation for using omniperf with MPI jobs? Should we collect profiles with omniperf for one rank only using a wrapper script that does so (see example of wrapper script below) and invoke it by mpirun <...> wrapper_omniperf.sh <...> <exe>? Or should we run omniperf <...> mpirun <...> <exe>?
A sample wrapper script that I tried using is:

#! /usr/bin/env bash
if [[ -n ${OMPI_COMM_WORLD_RANK+z} ]]; then
  # mpich
  export MPI_RANK=${OMPI_COMM_WORLD_RANK}
elif [[ -n ${MV2_COMM_WORLD_RANK+z} ]]; then
  # ompi
  export MPI_RANK=${MV2_COMM_WORLD_RANK}
elif [[ -n ${SLURM_PROCID+z} ]]; then
    # mpich via srun
    export MPI_RANK=${SLURM_PROCID}
fi
if [[ ${MPI_RANK} == "0" ]]; then
  eval "omniperf profile -n <workload_name> -k <kernel_name> -b <ip_block> -- $*"
else
  "$*"
fi

It crashes when it (internally rocprof) tries to collect counters that are split in to multiple groups.

Is this a fundamental limitation on MI50 or completely unusable

Hi, I am going to use your tool to develop the analytical model tool on AMD GPU. I only have MI50 but this GPU is marked as unsupported in your document. I want to check if it is just a fundamental limitation or completely unusable. Thanks for your help.

Store application parameters in profiling output

As a 3rd party reviewing workloads in Grafana it would be nice to track/get better insights into how the app was invoked. App parms can make a huge difference in the profiling results.

For example, comparing babelstream data I see that a different rocm stack / CPU was present in each SUT, and some of the kernels are the same across the two data sets. I'd like to know babelstream parameters. What was on the command line when it was launched?

cc: James Dezelle

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.