Giter Club home page Giter Club logo

Comments (4)

TaekyungHeo avatar TaekyungHeo commented on August 15, 2024 2

Thanks for sharing this, @rohitdwivedula. We had a chat with the PyTorch profiler team, and they advised us to use the correlation ID to link GPU operators with the launcher operators.

Previously, we used the external ID for linking CPU operators in a Chakra host trace and a Chakra device trace. It turned out that the external ID field is not stable, so we are currently using the rf_id field.

from chakra.

rohitdwivedula avatar rohitdwivedula commented on August 15, 2024

To solve issue 1: in Nvidia Kineto traces, each entry in the JSON file contains two fields correlation and External id - and they always appear to be the same thing, e.g:

{
    "ph": "X", "cat": "cuda_runtime", "name": "cudaStreamWaitEvent", "pid": 2012624, "tid": 1142494784,
    "ts": 1720537333825191, "dur": 1,
    "args": {
      "External id": 350,
      "cbid": 147, "correlation": 350
    }
  }

AMD traces look like this:

  {
    "ph": "X", "cat": "gpu_memcpy", "name": "CopyHostToDevice", "pid": 2, "tid": 0,
    "ts": 1720537542569197, "dur": 32,
    "args": {
      "External id": 131
    }
  }

It is unclear if External id == correlation always, but in all of the Nvidia traces I've seen so far they have never been different. If they are, indeed, always the same, we could modify the trace_link script to use the External id as a fallback in case correlation is not found as a field.

from chakra.

rohitdwivedula avatar rohitdwivedula commented on August 15, 2024

Hi @TaekyungHeo - am hoping to open a PR to try to fix this issue and had a quick question. Currently, PyTorch's kineto traces do not contain correlation IDs at all - we opened an issue on the PyTorch repo for this. In the interim, what we have been doing is manually postprocessing the kineto json produced by torch.profile.profile by adding a new correlation field equal to the External ID field. Essentially, we modify each entry in the Kineto JSON from this:

  {
    "ph": "X", "cat": "gpu_memcpy", "name": "CopyHostToDevice", "pid": 2, "tid": 0,
    "ts": 1720537542569197, "dur": 32,
    "args": {
      "External id": 131
    }
  }

to this:

  {
    "ph": "X", "cat": "gpu_memcpy", "name": "CopyHostToDevice", "pid": 2, "tid": 0,
    "ts": 1720537542569197, "dur": 32,
    "args": {
      "External id": 131, "correlation": 131
    }
  }

After making this one change to the JSON, we ran chakra_trace_link on our fork of chakra on a bunch of models and no warnings are being generated at all.

Question: would it be possible for us to upstream the change in our fork (essentially adding all hipLaunch operators to the codebase) using either option 1 or option 2 (described below) while we wait on PyTorch to fix the lack of correlation field in AMD Kineto traces?

Option 1

We add a section to the documentation with the hacky fix mentioned above for AMD hardware. Before passing the kineto script to chakra_trace_link, just pass it through a function like this:

def process_kineto_file(infile, outfile):
    with open(infile, 'r') as f:
        data = f.read()
    data = json.loads(data)

    for i in range(len(data['traceEvents'])):
        if 'args' in data['traceEvents'][i].keys() and 'External id' in data['traceEvents'][i]['args'].keys() and 'correlation' not in data['traceEvents'][i]['args'].keys():
            data['traceEvents'][i]['args']['correlation'] = data['traceEvents'][i]['args']['External id']

    with open(outfile, 'w') as f:
        json.dump(data, f, indent=2)

Option 2

Inside the chakra_trace_link function, we add an extra codepath to use External ID instead of correlation if (1) the trace is an AMD trace, and (2) no correlation IDs are found in the entire file.

from chakra.

srinivas212 avatar srinivas212 commented on August 15, 2024

Thanks for raising this issue, @rohitdwivedula. I prefer option 1 mainly because this issue needs to be fixed in PyTorch. We had faced a ton of issues around this problem in the past and needed to make sure Kineto was doing the right thing for consistent behavior. Simple traces would work but more complex ones would fail.

from chakra.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.