mlcommons / chakra Goto Github PK
View Code? Open in Web Editor NEWRepository for MLCommons Chakra schema and tools
Home Page: https://mlcommons.org/working-groups/research/chakra/
License: Apache License 2.0
Repository for MLCommons Chakra schema and tools
Home Page: https://mlcommons.org/working-groups/research/chakra/
License: Apache License 2.0
I can't find the information about PyTorch Execution Graph, although I programed on pytorch. could you give me some advice about it , so as to feed chakra to run.
There is too much information from pytorch ET plus, including function call relationships with too many details, but in fact we only focus on computing, memory access and communication.
Does such a large amount of information run counter to the design philosophy of chakra?
As defined in the et_def.proto, the attribute that stores duration (duration_micros) uses microsecond precision.
There are some cases where we encounter lots of sub-microsecond-runtime COMP_NODE nodes, which cannot be aggregated into larger compute COMP_NODE. These times can add up, and turn into a significant amount of time.
I think it makes sense to have nanoseconds precision. Probably a double type would be the way to go. The simulator can then cap to the precision it's the best for its use cases.
The above seems to be the more straightforward solution, but alternatively, a per-node or per-trace "timescale" field could also do the trick.
I was following the Chakra trace collection tutorial. I was able to collect both PyTorch ET and Kineto trace, but I couldn't link them using trace_link.py
. trace_link.py
emitted the following error:
$ python3 trace_link.py --et-file matmul_et.json --kineto-file kineto_trace_matmul.json --exact-match
[2024-04-30 13:46:33,291] execution_trace.py:455 [INFO]: Iteration node ids list = [1]
[2024-04-30 13:46:33,291] trace_link.py:306 [INFO]: Number of original ops in execution trace: 2
[2024-04-30 13:46:33,291] trace_link.py:225 [INFO]: Kineto trace has 0 segments
[2024-04-30 13:46:33,291] trace_link.py:338 [WARNING]: Could not find annotation DataLoader in kineto file using the whole file, processing could be very slow!!
[2024-04-30 13:46:33,291] trace_link.py:343 [INFO]: Number of original cpu ops in kineto trace: 46
[2024-04-30 13:46:33,291] trace_link.py:344 [INFO]: Number of original gpu ops in kineto trace: 6
[2024-04-30 13:46:33,291] trace_link.py:350 [INFO]: Average iteration latency: 4282.0
Traceback (most recent call last):
File "/home/jmoon/workspace/transport/collect_et/trace_link.py", line 891, in <module>
main() # pragma: no cover
^^^^^^
File "/home/jmoon/workspace/transport/collect_et/trace_link.py", line 880, in main
dump_et_file(
File "/home/jmoon/workspace/transport/collect_et/trace_link.py", line 818, in dump_et_file
node["parent"] = assigned_ids[node["parent"]]
~~~~~~~~~~~~^^^^^^^^^^^^^^^^
KeyError: 3
I looked into the collected PyTorch ET to further debug the issue. I found that many nodes have a parent attribute with a value of 3, but there was no node with id 3 (Please refer to the screenshot). I believe this caused the above error. Is my trace collection procedure wrong or is it a known bug? If it's a known bug, is there any way to resolve this error? Any pointers or answers would be appreciated.
Below is the PyTorch code that I used for the ET and Kineto trace collection:
import torch
import numpy as np
from torch.profiler import ExecutionTraceObserver, profile
def trace_handler(prof):
prof.export_chrome_trace("kineto_trace_matmul.json")
def gpu_matrix_multiplication(matrix1: np.ndarray, matrix2: np.ndarray) -> torch.Tensor:
"""
Perform matrix multiplication on the GPU using PyTorch.
Args:
matrix1 (np.ndarray): The first input matrix as a NumPy array.
matrix2 (np.ndarray): The second input matrix as a NumPy array.
Returns:
torch.Tensor: The result of the matrix multiplication, as a PyTorch tensor.
Raises:
ValueError: If matrices have incompatible shapes for multiplication.
"""
if matrix1.shape[1] != matrix2.shape[0]:
raise ValueError("Matrices have incompatible shapes for multiplication.")
# Convert numpy arrays to PyTorch tensors and set dtype to float
matrix1_torch = torch.tensor(matrix1, dtype=torch.float)
matrix2_torch = torch.tensor(matrix2, dtype=torch.float)
# Transfer tensors to GPU if available
if torch.cuda.is_available():
matrix1_torch = matrix1_torch.to('cuda')
matrix2_torch = matrix2_torch.to('cuda')
# Perform matrix multiplication using GPU
result_gpu = torch.matmul(matrix1_torch, matrix2_torch)
return result_gpu
if __name__ == "__main__":
# for ET
et = ExecutionTraceObserver()
et_filename = "matmul_et.json"
et.register_callback(et_filename)
# for Kineto traces
with profile(
activities=[
torch.profiler.ProfilerActivity.CPU,
torch.profiler.ProfilerActivity.CUDA,
],
# skip first 10 iterations
# record 1 iteration after the first 10.
schedule=torch.profiler.schedule(wait=0, warmup=10, active=1),
on_trace_ready=trace_handler,
) as prof:
# Define larger matrices (1024x1024) using NumPy
matrix_a = np.random.rand(1024, 1024)
matrix_b = np.random.rand(1024, 1024)
for epoch in range(20):
# training function goes here
result_on_gpu = gpu_matrix_multiplication(matrix_a, matrix_b)
result2_on_gpu = gpu_matrix_multiplication(matrix_a, result_on_gpu)
if epoch == 11:
et.stop()
if epoch == 10:
et.start()
prof.step()
et.unregister_callback()
trace_link.py
is from the PARAM GitHub repository, and I executed it with the command below.
$ python3 trace_link.py --et-file matmul_et.json --kineto-file kineto_trace_matmul.json --exact-match
The PyTorch version is 2.1.2 as the higher version has some issues.(related to #40)
Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] torch==2.1.2+cu121
[pip3] torchaudio==2.1.2+cu121
[pip3] torchvision==0.16.2+cu121
[pip3] triton==2.1.0
[conda] numpy 1.26.4 pypi_0 pypi
[conda] torch 2.1.2+cu121 pypi_0 pypi
[conda] torchaudio 2.1.2+cu121 pypi_0 pypi
[conda] torchvision 0.16.2+cu121 pypi_0 pypi
[conda] triton 2.1.0 pypi_0 pypi
I expected that PyTorch ET would be collected without missing dependencies so that the link procedure would succeed without an error.
Regarding astra’s issue: astra-sim/astra-sim#195, I think different parallel groups should be defined in comm_group_configuration, corresponding to tensor parallel, data parallel, etc. But this Chakra support is required. Currently, chakra cannot distinguish the communication domains to which communication nodes in ET belong.
So is there any way to correspond the communication nodes in chakra ET to different communication domains?
it too tough to get start
may i ask is there any tutorial or example for this project?
for example, how can i get the pytorch et from cluster, and how to convert it to chakra et?
how to visualize the chakra et?
how to install the tools such as mystique or execution graph observer?
It is a good tool, but it is hard for beginer.
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/content/chakra/et_converter/et_converter.py", line 9, in <module>
from .text2chakra_converter import Text2ChakraConverter
File "/content/chakra/et_converter/text2chakra_converter.py", line 8, in <module>
from chakra.et_def.et_def_pb2 import (
ModuleNotFoundError: No module named 'chakra.et_def.et_def_pb2'
git clone https://github.com/mlcommons/chakra
cd chakra
pip install -e .
python -m chakra.et_converter.et_converter --input_type PyTorch --input_filename traces/traces/cdd55a1099e8_561.1714517342978231506.pt.trace.json --output_filename traces/Chakra
wonder whether you can show how to get the pytorch execution trace output that Chakra will take and convert?
I tried to collect the trace using the default trace handler, torch.profiler.tensorboard_trace_handler, and the torch.jit.trace(). The outputs from both trials are very different from what pytorch2chakra_converter would expect.
Thanks.
Hi,
The following error appears when trying to run et_generator. As et_def_pb2.py no longer seems to include the following data types as enums.
BOOL,
FLOAT,
INT,
STRING,
BOOLS,
FLOATS,
INTS,
STRINGS,
python3 -m utils.et_generator.et_generator --num_npus 5 --num_dims 4
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/leekevin/chakra/utils/et_generator/et_generator.py", line 6, in <module>
from et_def.et_def_pb2 import (
ImportError: cannot import name 'BOOL' from 'et_def.et_def_pb2' (/home/leekevin/chakra/et_def/et_def_pb2.py)
For example:
Given a DNN object and random data, get Chakra ET on DNN.(data)
Steps include:
-Get ET by Graph Observer(include output file)
-Get Execution Timestamps by Kineto(include output file)
-Merge above two files by param into ET with timestamps (include output file)
-Convert into Chakra ET by Chakra Converter
When I run the pytorch converter, it shows nccl:send comm_type not supported, is there any plan to support this or this comm_type is not expected in the trace?
admin@admin: ~/llm/chakra(main)$ python3 -m chakra.et_converter.et_converter --input_type PyTorch --input_filename et_plus/profile_et_rank_0_plus.json --output_filename et_plus/profile_chakra.0.et
Traceback (most recent call last):
File "/home/admin/miniconda3/lib/python3.12/site-packages/chakra/et_converter/et_converter.py", line 89, in main
converter.convert()
File "/home/admin/miniconda3/lib/python3.12/site-packages/chakra/et_converter/pytorch2chakra_converter.py", line 169, in convert
collective_comm_type = self.get_collective_comm_type(pytorch_node.name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/admin/miniconda3/lib/python3.12/site-packages/chakra/et_converter/pytorch2chakra_converter.py", line 395, in get_collective_comm_type
raise ValueError(f"'{name}' not found in collective communication mapping. "
ValueError: 'nccl:send' not found in collective communication mapping. Please add this collective communication name to the mapping.
If you have the plan, I would like to know.
Or, if you want to integrate chakra to SST, but it is not yet prioritized, I am willing to contribute this and discuss about it.
Please provide a detailed description of your question or the information you seek.
Hi,
Could you please share more ET traces, such as the LLaMA traces you mentioned in previous issues?
Currently, I only have the converted traces from Astra-sim 1.0 and the Megatron trace mentioned in issue #176.
It would be really helpful if you could share more traces.
Thanks!
Dear Authors,
Thank you for the tool. I am new to Charka and want to use the performance model for my current project. I have just started installing the tool and generating the execution trace. I see the following instruction for converting ET from Pytorch but I have no idea what the input file, output file, default simulated run time, and num dims look like. Would it be possible to provide an example of these files or configurations? Thank you.
$ python -m et_converter.et_converter
--input_type PyTorch
--input_filename <input_filename>
--output_filename <output_filename>
--default_simulated_run_time <default_simulated_run_time>
--num_dims <num_dims>
I have encountered numerous 'record_param_comms' nodes in Chakra ET, which serve as child nodes to collective communication nodes. I presume that these functions are intended to log communication information, such as the communication domain for collective communications, the counterpart in point-to-point communications, the size of the communication volume, and other parameters. However, this is just my speculation, as I have not been able to find specific invocations of these functions within PyTorch. How is this information utilized within Chakra?
when I was using AstraSim-2.0 to generate chakra traces from 1.0, using the following command:
python3 -m chakra.et_converter.et_converter
--input_type Text
--input_filename ../../../inputs/workload/ASTRA-sim-1.0/Resnet50_DataParallel.txt
--output_filename ../../../inputs/workload/ASTRA-sim-2.0/Resnet50_DataParallel
--num_npus 64
--num_dims 1
--num_passes 1
I met a bug:
DEBUG [04/17/2024 12:04:48 PM] Traceback (most recent call last):
File "/home/esar/.local/lib/python3.10/site-packages/chakra/et_converter/et_converter.py", line 106, in main
converter.convert()
File "/home/esar/.local/lib/python3.10/site-packages/chakra/et_converter/text2chakra_converter.py", line 147, in convert
self.convert_data_parallel(f, num_layers)
File "/home/esar/.local/lib/python3.10/site-packages/chakra/et_converter/text2chakra_converter.py", line 202, in convert_data_parallel
self.add_parent(fwd_comp_node, layers[idx-1].fwd_comp_node)
File "/home/esar/.local/lib/python3.10/site-packages/chakra/et_converter/text2chakra_converter.py", line 136, in add_parent
child_node.parent.append(parent_node.id)
AttributeError: parent
So I check the Node file in ./et_def/et_def.proto and find that the node don't have attribute 'parent', I add at the bottom like:
//parent
repeated uint64 parent=11;
and later it passes.
ETFeederNode::ETFeederNode(std::shared_ptr<ChakraProtoMsg::Node> node) {
this->node_= node;
this->id_ = node->id();
this->name_ = node->name();
this->runtime_ = node->duration_micros();
this->is_cpu_op_ = true;
for (int i = 0; i < node->attr_size(); i++) {
string attr_name = node->attr(i).name();
if (attr_name == "is_cpu_op") {
assign_attr_val(node, i, (void *)(&is_cpu_op_));
} else if (attr_name == "num_ops") {
assign_attr_val(node, i, (void *)(&num_ops_));
} else if (attr_name == "tensor_size") {
assign_attr_val(node, i, (void *)(&tensor_size_));
} else if (attr_name == "comm_type") {
assign_attr_val(node, i, (void *)(&comm_type_));
} else if (attr_name == "involved_dim") {
assign_attr_val(node, i, (void *)(&involved_dim_));
involved_dim_size_ = node->attr(i).bool_list().values_size();
} else if (attr_name == "comm_priority") {
assign_attr_val(node, i, (void *)(&comm_priority_));
} else if (attr_name == "comm_size") {
assign_attr_val(node, i, (void *)(&comm_size_));
} else if (attr_name == "comm_src") {
assign_attr_val(node, i, (void *)(&comm_src_));
} else if (attr_name == "comm_dst") {
assign_attr_val(node, i, (void *)(&comm_dst_));
} else if (attr_name == "comm_tag") {
assign_attr_val(node, i, (void *)(&comm_tag_));
}
}
}
uint32_t ETFeederNode::involved_dim_size() {
return involved_dim_size_;
}
bool ETFeederNode::involved_dim(int i) {
return involved_dim_[i];
}
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.