mlcommons / chakra Goto Github PK

View Code? Open in Web Editor NEW

44.0 44.0 17.0 6.61 MB

Repository for MLCommons Chakra schema and tools

Home Page: https://mlcommons.org/working-groups/research/chakra/

License: Apache License 2.0

Python 92.17% C++ 7.83%

chakra's People

Contributors

Stargazers

Watchers

Forkers

changhai0109 frankucas anageswa joongunpark stayyule yfeng-44 jinsun-yoo ajbalogh alexandruantonescukeysight sanrise dageita danmih-ixia cpa872

chakra's Issues

May I ask how to obtain PyTorch execution graph?

I can't find the information about PyTorch Execution Graph, although I programed on pytorch. could you give me some advice about it , so as to feed chakra to run.

Information redundancy

There is too much information from pytorch ET plus, including function call relationships with too many details, but in fact we only focus on computing, memory access and communication.
Does such a large amount of information run counter to the design philosophy of chakra?

Improving node time duration resolution

Problem Related to the Feature

As defined in the et_def.proto, the attribute that stores duration (duration_micros) uses microsecond precision.
There are some cases where we encounter lots of sub-microsecond-runtime COMP_NODE nodes, which cannot be aggregated into larger compute COMP_NODE. These times can add up, and turn into a significant amount of time.

Proposed Solution

I think it makes sense to have nanoseconds precision. Probably a double type would be the way to go. The simulator can then cap to the precision it's the best for its use cases.
The above seems to be the more straightforward solution, but alternatively, a per-node or per-trace "timescale" field could also do the trick.

[Tutorial] Many nodes have a common parent node, but the node doesn't exist in PyTorch ET.

Describe the Bug

I was following the Chakra trace collection tutorial. I was able to collect both PyTorch ET and Kineto trace, but I couldn't link them using trace_link.py. trace_link.py emitted the following error:

$ python3 trace_link.py --et-file matmul_et.json --kineto-file kineto_trace_matmul.json --exact-match

[2024-04-30 13:46:33,291] execution_trace.py:455 [INFO]: Iteration node ids list = [1]
[2024-04-30 13:46:33,291] trace_link.py:306 [INFO]: Number of original ops in execution trace: 2
[2024-04-30 13:46:33,291] trace_link.py:225 [INFO]: Kineto trace has 0 segments
[2024-04-30 13:46:33,291] trace_link.py:338 [WARNING]: Could not find annotation DataLoader in kineto file using the whole file, processing could be very slow!!
[2024-04-30 13:46:33,291] trace_link.py:343 [INFO]: Number of original cpu ops in kineto trace: 46
[2024-04-30 13:46:33,291] trace_link.py:344 [INFO]: Number of original gpu ops in kineto trace: 6
[2024-04-30 13:46:33,291] trace_link.py:350 [INFO]: Average iteration latency: 4282.0
Traceback (most recent call last):
  File "/home/jmoon/workspace/transport/collect_et/trace_link.py", line 891, in <module>
    main()  # pragma: no cover
    ^^^^^^
  File "/home/jmoon/workspace/transport/collect_et/trace_link.py", line 880, in main
    dump_et_file(
  File "/home/jmoon/workspace/transport/collect_et/trace_link.py", line 818, in dump_et_file
    node["parent"] = assigned_ids[node["parent"]]
                     ~~~~~~~~~~~~^^^^^^^^^^^^^^^^
KeyError: 3

I looked into the collected PyTorch ET to further debug the issue. I found that many nodes have a parent attribute with a value of 3, but there was no node with id 3 (Please refer to the screenshot). I believe this caused the above error. Is my trace collection procedure wrong or is it a known bug? If it's a known bug, is there any way to resolve this error? Any pointers or answers would be appreciated.

Steps to Reproduce

Below is the PyTorch code that I used for the ET and Kineto trace collection:

import torch
import numpy as np
from torch.profiler import ExecutionTraceObserver, profile

def trace_handler(prof):
    prof.export_chrome_trace("kineto_trace_matmul.json")

def gpu_matrix_multiplication(matrix1: np.ndarray, matrix2: np.ndarray) -> torch.Tensor:
    """
    Perform matrix multiplication on the GPU using PyTorch.

    Args:
        matrix1 (np.ndarray): The first input matrix as a NumPy array.
        matrix2 (np.ndarray): The second input matrix as a NumPy array.

    Returns:
        torch.Tensor: The result of the matrix multiplication, as a PyTorch tensor.

    Raises:
        ValueError: If matrices have incompatible shapes for multiplication.
    """
    if matrix1.shape[1] != matrix2.shape[0]:
        raise ValueError("Matrices have incompatible shapes for multiplication.")

    # Convert numpy arrays to PyTorch tensors and set dtype to float
    matrix1_torch = torch.tensor(matrix1, dtype=torch.float)
    matrix2_torch = torch.tensor(matrix2, dtype=torch.float)

    # Transfer tensors to GPU if available
    if torch.cuda.is_available():
        matrix1_torch = matrix1_torch.to('cuda')
        matrix2_torch = matrix2_torch.to('cuda')

    # Perform matrix multiplication using GPU
    result_gpu = torch.matmul(matrix1_torch, matrix2_torch)

    return result_gpu

if __name__ == "__main__":
    
    # for ET
    et = ExecutionTraceObserver()
    et_filename = "matmul_et.json"
    et.register_callback(et_filename)



    # for Kineto traces
    with profile(
        activities=[
            torch.profiler.ProfilerActivity.CPU,
            torch.profiler.ProfilerActivity.CUDA,
        ],
        # skip first 10 iterations
        # record 1 iteration after the first 10.
        schedule=torch.profiler.schedule(wait=0, warmup=10, active=1),
        on_trace_ready=trace_handler,
    ) as prof:
        # Define larger matrices (1024x1024) using NumPy
        matrix_a = np.random.rand(1024, 1024)
        matrix_b = np.random.rand(1024, 1024)
        for epoch in range(20):
            # training function goes here
            result_on_gpu = gpu_matrix_multiplication(matrix_a, matrix_b)
            result2_on_gpu = gpu_matrix_multiplication(matrix_a, result_on_gpu)
            if epoch == 11:
                et.stop()
            if epoch == 10:
                et.start()
            prof.step()

    et.unregister_callback()

trace_link.py is from the PARAM GitHub repository, and I executed it with the command below.
$ python3 trace_link.py --et-file matmul_et.json --kineto-file kineto_trace_matmul.json --exact-match

The PyTorch version is 2.1.2 as the higher version has some issues.(related to #40)

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] torch==2.1.2+cu121
[pip3] torchaudio==2.1.2+cu121
[pip3] torchvision==0.16.2+cu121
[pip3] triton==2.1.0
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] torch                     2.1.2+cu121              pypi_0    pypi
[conda] torchaudio                2.1.2+cu121              pypi_0    pypi
[conda] torchvision               0.16.2+cu121             pypi_0    pypi
[conda] triton                    2.1.0                    pypi_0    pypi

Expected Behavior

I expected that PyTorch ET would be collected without missing dependencies so that the link procedure would succeed without an error.

Screenshots

How to distinguish communication domains between different communication (ET) node？

Regarding astra’s issue: astra-sim/astra-sim#195, I think different parallel groups should be defined in comm_group_configuration, corresponding to tensor parallel, data parallel, etc. But this Chakra support is required. Currently, chakra cannot distinguish the communication domains to which communication nodes in ET belong.
So is there any way to correspond the communication nodes in chakra ET to different communication domains?

may i ask is there any tutorial or example for this project?

it too tough to get start

may i ask is there any tutorial or example for this project?

for example, how can i get the pytorch et from cluster, and how to convert it to chakra et?
how to visualize the chakra et?
how to install the tools such as mystique or execution graph observer?

It is a good tool, but it is hard for beginer.

Missing Module for Execution Trace Converter

Describe the Bug

Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/content/chakra/et_converter/et_converter.py", line 9, in <module>
    from .text2chakra_converter import Text2ChakraConverter
  File "/content/chakra/et_converter/text2chakra_converter.py", line 8, in <module>
    from chakra.et_def.et_def_pb2 import (
ModuleNotFoundError: No module named 'chakra.et_def.et_def_pb2'

Steps to Reproduce

git clone https://github.com/mlcommons/chakra
cd chakra
pip install -e .
python -m chakra.et_converter.et_converter --input_type PyTorch --input_filename traces/traces/cdd55a1099e8_561.1714517342978231506.pt.trace.json --output_filename traces/Chakra

what is the expected way to collect pytorch execution trace?

wonder whether you can show how to get the pytorch execution trace output that Chakra will take and convert?

I tried to collect the trace using the default trace handler, torch.profiler.tensorboard_trace_handler, and the torch.jit.trace(). The outputs from both trials are very different from what pytorch2chakra_converter would expect.

Thanks.

Unable to run et_generator to generate traces

Hi,
The following error appears when trying to run et_generator. As et_def_pb2.py no longer seems to include the following data types as enums.

    BOOL,
    FLOAT,
    INT,
    STRING,
    BOOLS,
    FLOATS,
    INTS,
    STRINGS,

python3 -m utils.et_generator.et_generator --num_npus 5 --num_dims 4
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/leekevin/chakra/utils/et_generator/et_generator.py", line 6, in <module>
    from et_def.et_def_pb2 import (
ImportError: cannot import name 'BOOL' from 'et_def.et_def_pb2' (/home/leekevin/chakra/et_def/et_def_pb2.py)

Is it possible to provide a base case code to get Chakra ET of DNN?

For example:
Given a DNN object and random data, get Chakra ET on DNN.(data)

Steps include:
-Get ET by Graph Observer(include output file)
-Get Execution Timestamps by Kineto(include output file)
-Merge above two files by param into ET with timestamps (include output file)
-Convert into Chakra ET by Chakra Converter

nccl:send not found

Describe the Bug

When I run the pytorch converter, it shows nccl:send comm_type not supported, is there any plan to support this or this comm_type is not expected in the trace?

admin@admin: ~/llm/chakra(main)$ python3 -m chakra.et_converter.et_converter --input_type PyTorch --input_filename et_plus/profile_et_rank_0_plus.json --output_filename et_plus/profile_chakra.0.et 
Traceback (most recent call last):
  File "/home/admin/miniconda3/lib/python3.12/site-packages/chakra/et_converter/et_converter.py", line 89, in main
    converter.convert()
  File "/home/admin/miniconda3/lib/python3.12/site-packages/chakra/et_converter/pytorch2chakra_converter.py", line 169, in convert
    collective_comm_type = self.get_collective_comm_type(pytorch_node.name)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/admin/miniconda3/lib/python3.12/site-packages/chakra/et_converter/pytorch2chakra_converter.py", line 395, in get_collective_comm_type
    raise ValueError(f"'{name}' not found in collective communication mapping. "
ValueError: 'nccl:send' not found in collective communication mapping. Please add this collective communication name to the mapping.

Do you have a plan for integrating chakra to SST?

If you have the plan, I would like to know.
Or, if you want to integrate chakra to SST, but it is not yet prioritized, I am willing to contribute this and discuss about it.

more traces?

Please provide a detailed description of your question or the information you seek.

Hi,

Could you please share more ET traces, such as the LLaMA traces you mentioned in previous issues?

Currently, I only have the converted traces from Astra-sim 1.0 and the Megatron trace mentioned in issue #176.

It would be really helpful if you could share more traces.

Thanks!

Would it be possible to provide an example of input files?

Dear Authors,

Thank you for the tool. I am new to Charka and want to use the performance model for my current project. I have just started installing the tool and generating the execution trace. I see the following instruction for converting ET from Pytorch but I have no idea what the input file, output file, default simulated run time, and num dims look like. Would it be possible to provide an example of these files or configurations? Thank you.
$ python -m et_converter.et_converter
--input_type PyTorch
--input_filename <input_filename>
--output_filename <output_filename>
--default_simulated_run_time <default_simulated_run_time>
--num_dims <num_dims>

record_param_comms

I have encountered numerous 'record_param_comms' nodes in Chakra ET, which serve as child nodes to collective communication nodes. I presume that these functions are intended to log communication information, such as the communication domain for collective communications, the counterpart in point-to-point communications, the size of the communication volume, and other parameters. However, this is just my speculation, as I have not been able to find specific invocations of these functions within PyTorch. How is this information utilized within Chakra?

import error in pytorch2chakra_converter.py

Runing pytorch2chakra_converter.py file causes this error, where BOOL, FLOAT, ..., STRINGS can't be imported from chakra/et_def/et_def_pb2.py

lack of attribute 'parent'

when I was using AstraSim-2.0 to generate chakra traces from 1.0, using the following command:
python3 -m chakra.et_converter.et_converter
--input_type Text
--input_filename ../../../inputs/workload/ASTRA-sim-1.0/Resnet50_DataParallel.txt
--output_filename ../../../inputs/workload/ASTRA-sim-2.0/Resnet50_DataParallel
--num_npus 64
--num_dims 1
--num_passes 1
I met a bug:
DEBUG [04/17/2024 12:04:48 PM] Traceback (most recent call last):
File "/home/esar/.local/lib/python3.10/site-packages/chakra/et_converter/et_converter.py", line 106, in main
converter.convert()
File "/home/esar/.local/lib/python3.10/site-packages/chakra/et_converter/text2chakra_converter.py", line 147, in convert
self.convert_data_parallel(f, num_layers)
File "/home/esar/.local/lib/python3.10/site-packages/chakra/et_converter/text2chakra_converter.py", line 202, in convert_data_parallel
self.add_parent(fwd_comp_node, layers[idx-1].fwd_comp_node)
File "/home/esar/.local/lib/python3.10/site-packages/chakra/et_converter/text2chakra_converter.py", line 136, in add_parent
child_node.parent.append(parent_node.id)
AttributeError: parent
So I check the Node file in ./et_def/et_def.proto and find that the node don't have attribute 'parent', I add at the bottom like:
//parent
repeated uint64 parent=11;
and later it passes.

Segmentfault when running ns3 simulation

During the ns3 simulation, an error indicating a segmentation fault occurred. After debugging, it was discovered that when checking for communication nodes, a function related to the 'dim' parameter in the et feeder was called, such as 'involved_dim_size'. Since there was no 'involved_dim' entry in the node's attributes, 'involved_dim' was null. To address this, I added a default return value to allow the simulation to continue.

ETFeederNode::ETFeederNode(std::shared_ptr<ChakraProtoMsg::Node> node) {
  this->node_= node;
  this->id_ = node->id();
  this->name_ = node->name();
  this->runtime_ = node->duration_micros();
  this->is_cpu_op_ = true;
  for (int i = 0; i < node->attr_size(); i++) {
    string attr_name = node->attr(i).name();
    if (attr_name == "is_cpu_op") {
      assign_attr_val(node, i, (void *)(&is_cpu_op_));
    } else if (attr_name == "num_ops") {
      assign_attr_val(node, i, (void *)(&num_ops_));
    } else if (attr_name == "tensor_size") {
      assign_attr_val(node, i, (void *)(&tensor_size_));
    } else if (attr_name == "comm_type") {
      assign_attr_val(node, i, (void *)(&comm_type_));
    } else if (attr_name == "involved_dim") {
      assign_attr_val(node, i, (void *)(&involved_dim_));
      involved_dim_size_ = node->attr(i).bool_list().values_size();
    } else if (attr_name == "comm_priority") {
      assign_attr_val(node, i, (void *)(&comm_priority_));
    } else if (attr_name == "comm_size") {
      assign_attr_val(node, i, (void *)(&comm_size_));
    } else if (attr_name == "comm_src") {
      assign_attr_val(node, i, (void *)(&comm_src_));
    } else if (attr_name == "comm_dst") {
      assign_attr_val(node, i, (void *)(&comm_dst_));
    } else if (attr_name == "comm_tag") {
      assign_attr_val(node, i, (void *)(&comm_tag_));
    }
  }
}

uint32_t ETFeederNode::involved_dim_size() {
  return involved_dim_size_;
}

bool ETFeederNode::involved_dim(int i) {
  return involved_dim_[i];
}

In addition, I found that these 'dim' related functions have been removed in the main branch of Chakra. How should we proceed with this in the future?

mlcommons / chakra Goto Github PK

chakra's People

Contributors

Stargazers

Watchers

Forkers

chakra's Issues

Problem Related to the Feature

Proposed Solution

Describe the Bug

Steps to Reproduce

Expected Behavior

Screenshots

Describe the Bug

Steps to Reproduce

Describe the Bug

Recommend Projects

Recommend Topics

Recommend Org