Comments (4)
Thanks for sharing this, @rohitdwivedula. We had a chat with the PyTorch profiler team, and they advised us to use the correlation ID to link GPU operators with the launcher operators.
Previously, we used the external ID for linking CPU operators in a Chakra host trace and a Chakra device trace. It turned out that the external ID field is not stable, so we are currently using the rf_id field.
from chakra.
To solve issue 1: in Nvidia Kineto traces, each entry in the JSON file contains two fields correlation
and External id
- and they always appear to be the same thing, e.g:
{
"ph": "X", "cat": "cuda_runtime", "name": "cudaStreamWaitEvent", "pid": 2012624, "tid": 1142494784,
"ts": 1720537333825191, "dur": 1,
"args": {
"External id": 350,
"cbid": 147, "correlation": 350
}
}
AMD traces look like this:
{
"ph": "X", "cat": "gpu_memcpy", "name": "CopyHostToDevice", "pid": 2, "tid": 0,
"ts": 1720537542569197, "dur": 32,
"args": {
"External id": 131
}
}
It is unclear if External id == correlation
always, but in all of the Nvidia traces I've seen so far they have never been different. If they are, indeed, always the same, we could modify the trace_link
script to use the External id
as a fallback in case correlation
is not found as a field.
from chakra.
Hi @TaekyungHeo - am hoping to open a PR to try to fix this issue and had a quick question. Currently, PyTorch's kineto traces do not contain correlation IDs
at all - we opened an issue on the PyTorch repo for this. In the interim, what we have been doing is manually postprocessing the kineto json
produced by torch.profile.profile
by adding a new correlation
field equal to the External ID
field. Essentially, we modify each entry in the Kineto JSON from this:
{
"ph": "X", "cat": "gpu_memcpy", "name": "CopyHostToDevice", "pid": 2, "tid": 0,
"ts": 1720537542569197, "dur": 32,
"args": {
"External id": 131
}
}
to this:
{
"ph": "X", "cat": "gpu_memcpy", "name": "CopyHostToDevice", "pid": 2, "tid": 0,
"ts": 1720537542569197, "dur": 32,
"args": {
"External id": 131, "correlation": 131
}
}
After making this one change to the JSON, we ran chakra_trace_link
on our fork of chakra
on a bunch of models and no warnings are being generated at all.
Question: would it be possible for us to upstream the change in our fork (essentially adding all hipLaunch
operators to the codebase) using either option 1 or option 2 (described below) while we wait on PyTorch to fix the lack of correlation
field in AMD Kineto traces?
Option 1
We add a section to the documentation with the hacky fix mentioned above for AMD hardware. Before passing the kineto script to chakra_trace_link
, just pass it through a function like this:
def process_kineto_file(infile, outfile):
with open(infile, 'r') as f:
data = f.read()
data = json.loads(data)
for i in range(len(data['traceEvents'])):
if 'args' in data['traceEvents'][i].keys() and 'External id' in data['traceEvents'][i]['args'].keys() and 'correlation' not in data['traceEvents'][i]['args'].keys():
data['traceEvents'][i]['args']['correlation'] = data['traceEvents'][i]['args']['External id']
with open(outfile, 'w') as f:
json.dump(data, f, indent=2)
Option 2
Inside the chakra_trace_link
function, we add an extra codepath to use External ID
instead of correlation
if (1) the trace is an AMD trace, and (2) no correlation IDs are found in the entire file.
from chakra.
Thanks for raising this issue, @rohitdwivedula. I prefer option 1 mainly because this issue needs to be fixed in PyTorch. We had faced a ton of issues around this problem in the past and needed to make sure Kineto was doing the right thing for consistent behavior. Simple traces would work but more complex ones would fail.
from chakra.
Related Issues (20)
- [Tutorial] Many nodes have a common parent node, but the node doesn't exist in PyTorch ET. HOT 7
- record_param_comms HOT 2
- Improving node time duration resolution HOT 1
- Segmentfault when running ns3 simulation HOT 1
- more traces? HOT 6
- may i ask is there any tutorial or example for this project? HOT 3
- How to distinguish communication domains between different communication (ET) node? HOT 2
- how to use chakra_trace_link? HOT 3
- Can't convert text use et.converter HOT 9
- the Converted Text file can't be visualized by et_visualizer HOT 4
- In kineto_trace, there is no Record function id information in args. HOT 5
- some questions about generating trace HOT 1
- Question about FlexFlow Feature HOT 1
- ET & KT merge through chakra_trace_link (ET+) does not contain timing information HOT 2
- Error when using chakra_converter HOT 2
- Cyclic Dependency
- Chakra ETNode Comm Size Read Error HOT 1
- unable to decode Chakra output .json HOT 1
- The default `is_cpu_op` value causes the COMM node to be skipped by the Workload layer. HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from chakra.