DocumentSASS

The instruction sets for NVIDIA GPUs have a very sparse official documentation.

Other projects have worked on examining the instructions mainly through reverse-engineering, such as MaxAs, AsFermi, CuAssembler, TuringAs, KeplerAs, Decuda, and the paper Dissecting the NVidia Turing T4 GPU.

Since the instructions and architecture changes from generation to generation, it is an uphill battle.
What if a description of the instruction encoding could be found within the tools provided by NVIDIA?
What if the instruction latencies could be found inside these as well?

The answer is of course they can. Otherwise the compiler would do a poor job scheduling instructions. Furthermore, for SASS it turns out that fixed-latency instructions have the number of stall cycles hard-coded into them [src]. It is just a question of finding where this data is hidden.

It turns out that an extensive description of SASS instructions as well as latencies was contained in two specific strings in nvdisasm. Instead of having to write micro-benchmarks to find latencies, or use reverse engineering to make an assembler, one could in theory just consult these files. Instruction scheduling info is given in the latencies file, with the minimum time for fixed-latency ops. essentially being the latency. See NOTES.

For some additional, unrelated observations, see OTHER.

How to run

The easy way is by simply running this notebook in Google Colab. No requirements.

Requirements to run locally: Linux, Python 3, CUDA Toolkit. Run make to generate the raw files describing instructions and latencies. Be sure to change the paths in the beginning of the Makefile if they are different on your system. Tested with CUDA 11.6.

How it works

nvcc is used to compile example.cu to .cubin binaries for a list of architectures.
cc is used to compile intercept.c to a .so library that serves as a man-in-the-middle for data from memcpy calls.
We intercept nvdisasm applied on each binary file using intercept.so.
The result is filtered with strings to only get text, and then the script funnel.py gathers the relevant portions and writes them to files.

An initial approach was to simply run strings nvdisasm to get text embedded in the executable, but it turned out the relevant strings were dynamically generated (and only for the input architecture), which is why this solution is needed.

TODO

It appears the instruction string may be slightly corrupted for compute capability 3.5 currently.

0xd0gf00d / documentsass Goto Github PK

documentsass's Introduction

DocumentSASS

How to run

How it works

TODO

documentsass's People

Contributors

Stargazers

Watchers

Forkers

documentsass's Issues

Generated Files

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent