Giter Club home page Giter Club logo

timeloop-accelergy-exercises's Introduction

Fibertree/Timeloop/Accelergy Tutorial Exercises

This repository contains a set of exercises and baseline designs to explore Fibertrees, Timeloop, and Accelergy. Please find the respective directories and more detailed descriptions under workspace folder.

Using Docker

We provide a docker with built-in tools for you to get started

  • Make a copy of the provided template docker compose file: cp docker-compose.yaml.template docker-compose.yaml
  • Examine the instructions in docker-compose.yaml to setup the docker correctly, e.g., setup the correct UID and GID.
  • Pull the newest docker image: docker-compose pull
  • Run docker: docker-compose up. You should see the docker being setup.
  • This docker uses Jupyter notebooks, and you will see an URL once the docker is up. Please copy and paste the URL to a web browser of your choice to access the workspace.
Notes (if notebook URL does not work)
  • Option1: in your docker-compose.yaml file, uncomment the last line under environment to disable token and try again
  • Option2: try the 192.168.X.X host with the same token as shown in the output (X.X can be obtained by hostname -I)
  • Option3: if you have access to docker GUI app (e.g., Kitematic for docker temrinal), try open the web page there with the token

Native Installation

Please find the instructions for native installations of the tools needed here

Related reading

Citation

Please cite the following:

  • A. Parashar, P. Raina, Y. S. Shao, Y.-H. Chen, V. A. Ying, A. Mukkara, R. Venkatesan, B. Khailany, S. W. Keckler, and J. Emer, “Timeloop: A systematic approach to DNN accelerator evaluation,” in 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2019, pp. 304–315.
  • M. Horeni, P. Taheri, P. Tsai, A. Parashar, J. Emer, and S. Joshi, “Ruby: Improving hardware efficiency for tensor algebra accelerators through imperfect factorization,” in 2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2022, pp. 254–266.
  • Y. N. Wu, P.-A. Tsai, A. Parashar, V. Sze, and J. S. Emer, “Sparseloop: An analytical, energy-focused design space exploration methodology for sparse tensor accelerators,” in 2021 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2021, pp. 232–234.
  • Y. N. Wu, J. S. Emer, and V. Sze, “Accelergy: An architecture-level energy estimation methodology for accelerator designs,” in 2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), 2019, pp. 1–8.
  • T. Andrulis, J. S. Emer, and V. Sze, “CiMLoop: A flexible, accurate, and fast compute-in-memory modeling tool,” in 2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2024.

Or use the following BibTeX:

@inproceedings{timeloop,
  author      = {Parashar, Angshuman and Raina, Priyanka and Shao, Yakun Sophia and  Chen, Yu-Hsin and Ying, Victor A and Mukkara, Anurag and Venkatesan, Rangharajan and Khailany, Brucek and Keckler, Stephen W and Emer, Joel},
  booktitle   = {2019 IEEE international symposium on performance analysis of systems and software (ISPASS)}, pages={304--315}, year={2019},
  title       = {Timeloop: A systematic approach to dnn accelerator evaluation},
  year        = {2019},
}
@inproceedings{ruby,
  author      = {M. Horeni and P. Taheri and P. Tsai and A. Parashar and J. Emer and S. Joshi},
  booktitle   = {2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)},
  title       = {Ruby: Improving Hardware Efficiency for Tensor Algebra Accelerators Through Imperfect Factorization},
  year        = {2022},
}
@inproceedings{sparseloop,
  author      = {Wu, Yannan N. and Tsai, Po-An, and Parashar, Angshuman and Sze, Vivienne and Emer, Joel S.},
  booktitle   = {{ ACM/IEEE International Symposium on Microarchitecture (MICRO)}},
  title       = {{Sparseloop: An Analytical Approach To Sparse Tensor Accelerator Modeling }},
  year        = {{2022}}
}
@inproceedings{accelergy,
  author      = {Wu, Yannan Nellie and Emer, Joel S and Sze, Vivienne},
  booktitle   = {2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)},
  title       = {Accelergy: An architecture-level energy estimation methodology for accelerator designs},
  year        = {2019},
}
@inproceedings{cimloop,
  author      = {Andrulis, Tanner and Emer, Joel S. and Sze, Vivienne},
  booktitle   = {2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)}, 
  title       = {{CiMLoop}: A Flexible, Accurate, and Fast Compute-In-Memory Modeling Tool}, 
  year        = {2024},
}

timeloop-accelergy-exercises's People

Contributors

05tushar avatar angshuman-parashar avatar jsemer avatar nellie-wu avatar poant avatar prshr avatar tanner-andrulis avatar victorbutoi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

timeloop-accelergy-exercises's Issues

The total energy is not related to the number of arithmetic instances.

Hello,

In the timeloop+accelergy exercise, I changed the number of PE from 168 to 42. Looking at the timeloop-mapper.stats.txt file I noticed that the Energy (total) of the arithmetic part (level 0) does not change between the version 168 PE and 42 PE contrary to the number of cycle and the surface.

- architecture with 168 PE -> Energy (total)       : 2034656870.40 pJ
- architecture with 42 PE  -> Energy (total)       : 2034656870.40 pJ

Can you tell me why this energy value does not change?

Thank you in advance.

GPU model as input to timeloop

Hi,

There is a sparse tensor core like configuration given in the workspace. I understand it does not represent a GPU, but an accelerator emulation of the sparse tensor core. I have the following questions regarding the input configuration file for GPU:

  1. Is it reasonble to use Timeloop to analytically evaluate the performance trend of different workloads on a GPU?

  2. Does the example on MM based on Resnet take care of im2col transformation in on-chip memory when calculating energy?

  3. Are the constraints (bypass and spatial division of problem space) based on how MM is performed on a GPU or just a plausible mapping option?

  4. Is it possible to use sparse tensor core example to run dense MM computations with some modification to constraints? It fails to find any valid mapping if it is simply run with timeloop-mapper by providing all other yaml files except the sparse-opt.yaml.

Thanks!

Timeloop Help

Hello Timeloop Team:

I am trying to create a model for my design and an using "compound_components" to create my building blocks. When I am instantiating the compound component in my architecture.yaml file, the component is not getting recognized. I am seeing the below message, which says it is recognizing only the standard classes.

Unknown element class xyz. Accepted classes: {('DRAM', 'SRAM', 'regfile', 'smartbuffer', 'storage'): <class 'timeloopfe.v4.arch.Storage'>, ('mac', 'intmac', 'fpmac', 'compute'): <class 'timeloopfe.v4.arch.Compute'>, ('XY_NoC', 'Legacy', 'ReductionTree', 'SimpleMulticast'): <class 'timeloopfe.v4.arch.Network'>, ('nothing',): <class 'timeloopfe.v4.arch.Nothing'>}.

Any ideas what I am doing wrong here?

thanks
Sreekanth

ERROR: key not found: ERT

Hello,

When I run the timeloop exercise_06, an error popped out. You can see the screenshot for more detail. Could you please help me to fix that? Thank you so much!
image

example designs output for a simple_output_stationary, simple_weight_stationary do not seem to match the described dataflows

Thank you so much for great examples.
I am not quite sure if I am correct but looking at the example designs for a simple_output_stationary,
simple_weight_stationary and the mapping assigned by Timeloop you can find that it doesn't seem to match the described dataflows.
I believe that the reason is that there is no special data type been set for the shared global buffer and it changes during the optimization.

Area is always 0 in the stats file

Hello,

When working with the Timeloop exercises, the area is always 0.00 um2 in the .stats.txt file.
I assume this was done for the tutorial purposes, to make it simpler?
What would be the way to have Timeloop & Accelergy consider the components area ? I am currently using my own estimation .csv tables, and while the energy is correctly considered, the area i still stuck at 0.

Thanks.

ERROR: key not found: data-spaces, at line: 0

How to solve this?

(base) root@f79f1fdce31f:~/accelergy-timeloop-infrastructure/src/timeloop-accelergy-exercises/workspace/example_designs# python3 run_example_designs.py --architecture simple_weight_stationary


Running mapper for target simple_weight_stationary in /root/accelergy-timeloop-infrastructure/src/timeloop-accelergy-exercises/workspace/example_designs/example_designs/simple_weight_stationary/outputs/default_problem...
input file: /root/accelergy-timeloop-infrastructure/src/timeloop-accelergy-exercises/workspace/example_designs/example_designs/simple_weight_stationary/outputs/default_problem/parsed-processed-input.yaml
  _______                __                
 /_  __(_)___ ___  ___  / /___  ____  ____ 
  / / / / __ `__ \/ _ \/ / __ \/ __ \/ __ \
 / / / / / / / / /  __/ / /_/ / /_/ / /_/ /
/_/ /_/_/ /_/ /_/\___/_/\____/\____/ .___/ 
                                  /_/      

ERROR: key not found: data-spaces, at line: 0
joblib.externals.loky.process_executor._RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.10/site-packages/joblib/externals/loky/process_executor.py", line 463, in _process_worker
    r = call_item()
  File "/root/miniconda3/lib/python3.10/site-packages/joblib/externals/loky/process_executor.py", line 291, in __call__
    return self.fn(*self.args, **self.kwargs)
  File "/root/miniconda3/lib/python3.10/site-packages/joblib/parallel.py", line 589, in __call__
    return [func(*args, **kwargs)
  File "/root/miniconda3/lib/python3.10/site-packages/joblib/parallel.py", line 589, in <listcomp>
    return [func(*args, **kwargs)
  File "/root/accelergy-timeloop-infrastructure/src/timeloop-accelergy-exercises/workspace/example_designs/run_example_designs.py", line 66, in run_mapper
    assert os.path.exists(f"{output_dir}/timeloop-mapper.stats.txt"), (
AssertionError: Mapper did not generate expected output for simple_weight_stationary. Please check the logs for more details.
"""

Core Dump

Hello:

I am seeing the below core dump when I enabled the sparsity at DRAM level.

timeloop-mapper: src/model/arithmetic.cpp:387: virtual double model::ArithmeticUnits::Energy(problem::Shape::DataSpaceID) const: Assertion `is_evaluated_' failed.
Aborted (core dumped)

My sparsity spec with DRAM as my target is below:

image

Any help in root-causing the core dump is appreciated.

thanks
Sreekanth

Quick clarification on the SIMBA datawidth

Hello,

I just wanted to double check something on the SIMBA architecture file (https://github.com/Accelergy-Project/timeloop-accelergy-exercises/blob/master/baseline_designs/example_designs/simba_like/arch/simba_like.yaml):

1- You define the DRAM word to be 16 bits (word-bits: 16)
2- The GlobalBuffer/PEInputBuffer/PEWeightBuffer/PEWeightRegs is set to 8 bits (word-bits: 8)
3- The LMACis set to 16 bits (datawidth: 16)

In the SIMBA paper, they are using a datawidth of 8b for both the weights and ifmaps. Should not you use 8 bits for the DRAM and LMAC here? Is that a mistake or am I missing something (or does the datawidth field for the intmac is used to describe the datawidth of the output rather than the inputs?)?

Thanks!

Check here! https://github.com/nellie-wu/timeloop/blob/micro22-artifact/src/workload/density-models/real-data.cpp . It's not merged with the main branch (and it uses the old v3 syntax), but if you may adapt the artifact to your needs

          Check here! https://github.com/nellie-wu/timeloop/blob/micro22-artifact/src/workload/density-models/real-data.cpp . It's not merged with the main branch (and it uses the old v3 syntax), but if you may adapt the artifact to your needs

Originally posted by @tanner-andrulis in #37 (comment)

Total area in `timeloop-mapper.stats.txt` is always zero

This seems to be a very old bug: issue 4. But I still find this unsolved in the latest timeloop.
image
Can I get the total area with the API provided by timeloop, for example, some functions in timeloopfe?
Or do I have to parse the timeloop-mapper.ART.yaml and sum all the area of components?

Difference in "factors" in paper and exercise 6 for eyeriss

Screenshot from 2020-05-16 07-10-09
According to this image from Sze's survey and timeloop paper:
S and C are distributed in the y-direction. Q and K are distributed in the x-direction.
And the "factors" in the paper : "S0 P1 R1 N1" along with permutation: "SC.QK" suggest that the parallelism is allowed in SCQK dimensions and SC to be mapped to x-direction with QK being mapped to the y-direction.
However, the .yaml file in exercise 6 for eyeriss is different:
Here the factors are: "N1 C1 P1 R1 S1" and permutation is: "NCPRS QK", which means that parallelism is allowed in QK dimensions and QK is being unrolled along the y-direction, while other loops are being unrolled in the x-direction.

Please help me clear these doubts.

How Accelergy determines an area from the plug-ins ?

Hello,

I'm looking at the different exercises in the tutorial to understand the Timeloop and Accelergy frameworks.

In example 05-mapper-conv1d+oc-3level of Timeloop , the architecture is composed of a single PE containing 1 integer MAC.
The command timeloop-mapper arch/3level.arch.yaml prob/conv1d+oc.prob.yaml mapper/exhaustive.mapper.yaml constraints/null.constraints.yaml -o output generates the file timeloop-mapper.ART_summary.yaml which contains the following MAC area data:

ART_summary:
  version: 0.3
  table_summary:
  ...
  - name: System.Chip.PE.MACC
    area: 332.25
    primitive_estimations: Aladdin_table
  ...

I have seen that it is script aladdin_table.py that collects the area data of a multiplier and an adder in the corresponding .csv files. By default, the data corresponding to a latency of 5ns are collected. Then the area of the multiplier and the adder are added to estimate the area of the MAC. In this case, I think I can manually estimate the area : Adder area : 1.79E+02 + mult. area : 4.60E+03 = MAC area : 4 779 um^2

But, the manually calculated MAC area doesn't correspond to the MAC area obtained in the file timeloop-mapper.ART_summary.yaml

  • 332.25 um^2 (MAC area estimation by accelergy)
  • 4 779 um^2 (manually MAC area estimation)

Can you explain to me why I have this difference, please?

Thank you in advance.

load_yaml() got an unexpected keyword argument 'data'

when i run "python3 run_example.py 00" in "timeloop-accelergy-exercises/workspace/exercises/01_accelergy_timeloop_2020_ispass/timeloop" ,this error happened:
File "/home/mrpp/software/anaconda3/lib/python3.11/site-packages/timeloopfe/common/nodes.py", line 1348, in from_yaml_files
loaded = yaml.load_yaml(f, data=jinja_parse_data)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: load_yaml() got an unexpected keyword argument 'data'

then i found load_yaml function:
def load_yaml(
path: str = None, string: str = None
) -> Union[Dict[str, Any], None]:
"""
Load YAML content from a file or string
:param path: string that specifies the path of the YAML file to be loaded
:param string: string that contains the YAML content to be loaded
:return: parsed YAML content
"""
assert (string is None) != (
path is None
), "Must specify either path or string, but not both."
# Recursively parse through x, replacing any <<< with a recursive merge
# print(f'Calling recursive merge check on {x}')
return merge_check(yaml.load(load_file_and_includes(path, string)))

How can I fix it?

Representation of Depth-wise convolution

Hello,

Thanks for providing so powerful infrastructures and informative tutorials.
As the exercises take the standard convolution as examples, but compact convolutions are becoming more favorable in mobile lightweight models like depth-wise convolution. So how can I represent depth-wise convolution in timeloop? Thanks!

Questions about Problem.yaml

Hello,
Thank you for publishing this useful tool.
Here I have some questions about Problem.yaml.

  1. Does Timeloop support other network layers (such as maxpooling, avgpooling, and batch normalization) besides conv? Are there any methods and examples of implementation?
  2. Does Timeloop support measuring the energy consumption of forward, backward and weight update calculation processes? Are there any methods and examples of implementation?
    Thank you!

Error on bringing up docker image

Thanks for the quick response in the "tutorial" git. I am seeing the below error now in this repository.

unable to get image 'timeloopaccelergy/timeloop-accelergy-pytorch:latest-amd64': request returned Internal Server Error for API route and version http://%2F%2F.%2Fpipe%2Fdocker_engine/v1.24/images/timeloopaccelergy/timeloop-accelergy-pytorch:latest-amd64/json, check if the server supports the requested API version

Timeloop Paper Figure 10.b

I am attempting to replicate Figure 10.b, but I've encountered an issue. According to my calculations, AlexNet conv4 has 224,280,576 MACCs, while conv3 has 149,520,384 MACCs. When estimating the total energy consumption for both layers by Timeloop mapper, conv4 consumes more energy than conv3, which is logical.
However, in Figure 10.b of the paper, the total energy for conv4 is depicted as less than conv3. How can this be explained?

timeloop exercises reporting incorrect large energy value

Hello,

I am excited to get started with Timeloop and Accelergy. I plan to use it for my research.

I did a native installation and tried the first example. I see very high energy number

$TIMELOOP_BASE_PATH/build/timeloop-model arch/* prob/* map/*
input file: arch/1level.arch.yaml
input file: prob/conv1d.prob.yaml
input file: map/conv1d-1level.map.yaml
execute:/home/shoumikm/.local/bin/accelergy arch/1level.arch.yaml prob/conv1d.prob.yaml map/conv1d-1level.map.yaml --oprefix timeloop-model. -o ./ > timeloop-model.accelergy.log 2>&1
Utilization = 1.00 | pJ/Compute = 161478185935234976.000

I did a detailed comparison of the results with the reference outputs, and I find these mismatches in the outputs section timeloop-model.stats.txt

Outputs:
Partition size : 16
Utilized capacity : 16
Utilized instances (max) : 1
Utilized clusters (max) : 1
Scalar reads (per-instance) : 32
Scalar updates (per-instance) : 48
Scalar fills (per-instance) : 18446744073709551600
Temporal reductions (per-instance) : 32
Address generations (per-cluster) : 32
Energy (per-scalar-access) : 121108639451426240.00 pJ
Energy (per-instance) : 7750952924891279360.00 pJ
Energy (total) : 7750952924891279360.00 pJ

Temporal Reduction Energy (per-instance) : 0.00 pJ
Temporal Reduction Energy (total) : 0.00 pJ
Address Generation Energy (per-cluster) : 0.00 pJ
Address Generation Energy (total) : 0.00 pJ
Shared Bandwidth (per-instance) : 1.33 words/cycle
Shared Bandwidth (total) : 1.33 words/cycle
Read Bandwidth (per-instance) : 0.67 words/cycle
Read Bandwidth (total) : 0.67 words/cycle
Write Bandwidth (per-instance) : 0.67 words/cycle
Write Bandwidth (total) : 0.67 words/cycle

Can you please let me know what is going wrong ? Appreciate your help.

Best,

Shoumik

No ART or ERT files generated

Hi,

I installed accelergy. Then compiled timeloop with accelergy and ran the following commnad:

timeloop-mapper arch/eyeriss_like-int16.yaml
arch/components/.yaml
prob/prob.yaml
mapper/mapper.yaml
constraints/
.yaml

I do not see any ART or ERT files. Both timeloop and accelergy are the latest versions, cloned directly from their respective repos.
Moreover, if I run the same command without the components files, I get the same result in terms of utilization and energy per macc. I am doubtful if accelergy is being used or not.

timeloop-mapper arch/eyeriss_like-int16.yaml
prob/prob.yaml
mapper/mapper.yaml
constraints/*.yaml

Can you please guide me through?

permission error

Hello,

When trying the timeloop exercise 00-model-conv1d-1level, it is working and done.
However, when trying 01-model-conv1d-2level, it is not working and following error message.
Therefore, I try to find the "timeloop-model.SRAM.cfg" file, but the cfg file is not in the cacti directory.
I think it seems to be wrong somewhere in the installation process.
Could you help hot to fix it?

Thanks.

error message

Info: generating outputs according to the following specified output flags...
 Please use the -f flag to update the preference (default to all output files)
{'ART': 1, 'energy_estimation': 1, 'flattened_arch': 1, 'ERT_summary': 1, 'ART_summary': 1, 'ERT': 1}
Info: Parsing file arch/2level.arch.yaml for architecture info
Warn: Cannot recognize the top key "problem" in file prob/conv1d.prob.yaml
Warn: Cannot recognize the top key "mapping" in file map/conv1d-2level-ws.map.yaml
Info: config file located: /home/aim/.config/accelergy/accelergy_config.yaml
config file content:
 {'version': 0.3, 'table_plug_ins': {'roots': ['/usr/local/lib/share/accelergy/estimation_plug_ins/accelergy-table-based-plug-ins/set_of_table_templates']}, 'primitive_components': ['/usr/local/lib/share/accelergy/primitive_component_libs'], 'estimator_plug_ins': ['/usr/local/lib/share/accelergy/estimation_plug_ins']}
Info: primitive component file parsed:  /usr/local/lib/share/accelergy/primitive_component_libs/primitive_component.lib.yaml
Info: primitive component file parsed:  /usr/local/lib/share/accelergy/primitive_component_libs/gem5_primitives.lib.yaml
Info: primitive component file parsed:  /usr/local/lib/share/accelergy/primitive_component_libs/pim_primitive_component.lib.yaml
Warn: No compound component classes specified, architecture can only contain primitive components
Info: estimator plug-in identified by:  /usr/local/lib/share/accelergy/estimation_plug_ins/accelergy-table-based-plug-ins/table.estimator.yaml
table-based-plug-ins Identifies a set of tables named:  test_tables
Info: estimator plug-in identified by:  /usr/local/lib/share/accelergy/estimation_plug_ins/accelergy-cacti-plug-in/cacti.estimator.yaml
Info: estimator plug-in identified by:  /usr/local/lib/share/accelergy/estimation_plug_ins/estimation_plug_ins/aladdin.estimator.yaml
Info: estimator plug-in identified by:  /usr/local/lib/share/accelergy/estimation_plug_ins/dummy_tables/dummy.estimator.yaml
Info: CACTI plug-in... Querying CACTI for request:
 {'arguments': None, 'class_name': 'SRAM', 'attributes': {'n_rdwr_ports': 1, 'word-bits': 8, 'technology': '40nm', 'width': 64, 'latency': '5ns', 'block-size': 8, 'n_rd_ports': 0, 'depth': 32768, 'n_banks': 1, 'n_wr_ports': 0}, 'action_name': 'idle'}
Traceback (most recent call last):
  File "/usr/local/bin/accelergy", line 9, in <module>
    load_entry_point('accelergy==0.3', 'console_scripts', 'accelergy')()
  File "/usr/local/lib/python3.5/dist-packages/accelergy-0.3-py3.5.egg/accelergy/accelergy_console.py", line 136, in main
    'precision': precision})
  File "/usr/local/lib/python3.5/dist-packages/accelergy-0.3-py3.5.egg/accelergy/ERT_generator.py", line 56, in __init__
    self.generate_pc_ERT(pc)
  File "/usr/local/lib/python3.5/dist-packages/accelergy-0.3-py3.5.egg/accelergy/ERT_generator.py", line 72, in generate_pc_ERT
    estimated_energy, estimator_name = self.eval_primitive_action_energy(estimation_plug_in_interface)
  File "/usr/local/lib/python3.5/dist-packages/accelergy-0.3-py3.5.egg/accelergy/ERT_generator.py", line 154, in eval_primitive_action_energy
    energy = round(best_estimator.estimate_energy(estimator_plug_in_interface), self.precision)
  File "/usr/local/lib/share/accelergy/estimation_plug_ins/accelergy-cacti-plug-in/cacti_wrapper.py", line 68, in estimate_energy
    energy = getattr(self, query_function_name)(interface)
  File "/usr/local/lib/share/accelergy/estimation_plug_ins/accelergy-cacti-plug-in/cacti_wrapper.py", line 258, in SRAM_estimate_energy
    self.SRAM_populate_data(interface)
  File "/usr/local/lib/share/accelergy/estimation_plug_ins/accelergy-cacti-plug-in/cacti_wrapper.py", line 199, in SRAM_populate_data
    n_banks, cfg_file_path)
  File "/usr/local/lib/share/accelergy/estimation_plug_ins/accelergy-cacti-plug-in/cacti_wrapper.py", line 312, in cacti_wrapper_for_SRAM
    shutil.copyfile(default_cfg_file_path, cacti_exec_dir + '/' + cfg_file_name)
  File "/usr/lib/python3.5/shutil.py", line 115, in copyfile
    with open(dst, 'wb') as fdst:
PermissionError: [Errno 13] Permission denied: '/usr/local/lib/share/accelergy/estimation_plug_ins/accelergy-cacti-plug-in/cacti/timeloop-model.SR

Reductiontree design

Hello,

I am trying to model a reduction tree in my architecture, but I am not sure how to fully proceed:

I'd like to have 1 PE containing 4 MACs, and each MAC output would go to an adder tree. The final output (of the 4 MACs) would then be stored in a local shared accumulation register.

I believe I should use the network_drain attribute on a dummy buffer and the network_update on the accumulation register, but I am not sure if there are other attributes I should specify, like for the SimpleMulticast.
Do you have an example design using a Reductiontree network by any chance?

Thanks.

Generate action_counts.yaml file

In the exercises for accelergy, there was an action_counts.yaml file in the input directory. You can find it here. Is there any inbuilt function or flag to get the action_counts for the best mapping obtained by running the timeloop mapper.

docker terminal possibility

Hello,
This repo gives the exercise and the baseline design. Based on the description, the running of yaml in baseline-design repo needs to be in the docker terminal, not the jupter env.
Is that possible to run the baseline-design directly on the docker-compose.yaml ?
Or must I install the accelergy lib for it?

Typo in Readme

Hi,
In the commands for running the timeloop-mapper, the command:
...mapper/apper.yaml , should be changed to
...mapper/mapper.yaml \

Core dumped when running mapper

Hello:

I am seeing the below error when running the timeloop mapper. I reduced my architecture to a bare minimum and no constraints or compound components. I am using timeloopfe. Any ideas how to debug this failure?

thanks
Sreekanth

_Using threads = 8
Mapper configuration complete.
terminate called after throwing an instance of 'std::out_of_range'
what(): vector::_M_range_check: _n (which is 13222272) >= this->size() (which is 1)
Aborted (core dumped)

Query on using Timeloop and Accelergy

Hi,

Just wanted to check in if there's a way for us to load the weight of pre-trained models onto one of the yaml files and use them to perform the necessary analysis on the architecture and mapping?

If yes, could you please redirect me to the relevant documents?

Thanks!

simba PE number config

Hello,
Is it possible to change the number of PE inside simba arch, or the number of MAC inside the PE?

Best Regards

No mapping for Eyeriss

I am trying to find a mapping for Eyeriss, but when I use the 'linear-pruned' algorithm, no valid mapping is found. What could be the issue?
I have only changed the algorithm to "linear-pruned" in the mapper file.

Compound component usage

Hello:

I am trying to build a model for an accelerator. I defined my accelerator as a compound component and instantiated in my arch.yaml. I was able to constrain the mapspace correctly to my components (DRAM->GlobalBuffer->Accelerator Array) in arch.yaml. Once it hits the compound component, all the fanout converges to 1 (to the compound component) and throws a fanout error as shown below.

Fail reason: mapped fanoutX 65536 exceeds hardware fanoutX 1

My compound component also will slice the mapspace further. How do I achieve this?

thanks
Sreekanth

What does factor = 0 means in the mapping?

Hi, great work! I saw factors can be translated to the bound of the loop. However, should that case the least factor is 1? What does the factor=0 mean? For instance, below if R=0, should the loop be terminated? Thanks!

row stationary -> 1 row at a time

  • target: weights_spad
    type: temporal
    permutation: NMPQS CR
    factors: N=1 M=1 P=1 Q=1 S=1 R=0

RuntimeError for Simba_like

Thanks for providing such a useful tool! I've encountered an error while testing it with samba_like architecture and I already using the latest version.

My input is: python3 run_example_designs.py --architecture simba_like --problem gpt2. and here is the output, just wondering if this is still the problem of Ruby mapper?

Screenshot 2024-06-03 at 9 38 32 PM

Designing the connection between VMACs and Weight and Input Buffers in a Simba-like Architecture

First off, Thank you for great examples you provided.
I am trying to reproduce a Simba-like architecture. I am accounting for the router and the wires energies in the NoCs. Now I know if there is a need for scatter or multicast operations Timeloop instantiates a legacy network by default. Looking at the architecture diagram provided in the original Simba’s micro paper, input and distributed weight buffers are basically wired to the VMACs.
Screenshot 2024-05-30 at 12 34 06 PM

In order to have a more accurate evaluation, I made a bus component with wires connecting these buffers to the VMACs. Here is how I modified the architecture specification (I am aware that there is a newer version but I am still using the V3):

                - name: PEAccuBuffer[0..7]
                  class: storage
                  subclass: smartbuffer
                  attributes:
                    data_storage_depth: 16
                    data_storage_width: 24
                    read_bandwidth: 8
                    write_bandwidth: 8
                    block-size: 24
                    n_banks: 1
                    meshX: 16         
                - name: PEWeightBuffer[0..7]
                  class: storage
                  subclass: smartbuffer_metadata
                  attributes:
                    data_storage_depth: 4096
                    data_storage_width: 64
                    metadata_storage_depth: 4096
                    metadata_storage_width: 64
                    metadata_block_size: 8
                    metadata_datawidth: 8
                    metadata_counter_width: 4
                    meshX: 16
                    n_banks: 1
                    decompression_supported: true
                    compression_supported: true
                    network_read: PEWeightBuffer <==> PEWeightRegs
                    network_update: PEWeightBuffer <==> PEWeightRegs

                - name: PEWeightBuffer <==> PEWeightRegs
                  class: SimpleMulticast
                  subclass: bus
                  attributes:
                    datawidth: 8
                    Y_X_wire_avg_length: 0.1
                    action_name: transfer
                    multicast_factor_argument: n_cols_per_row
                    per_datatype_ERT: True                    

                - name: PEWeightRegs[0..63]
                  class: storage
                  subclass: reg_storage
                  attributes:
                    memory_depth: 1
                    read_bandwidth: 1
                    write_bandwidth: 1
                    block-size: 8
                    num-ports: 2
                    meshX: 16
                    decompression_supported: true
                    compression_supported: true 
                    network_fill: PEWeightBuffer <==> PEWeightRegs
                    network_drain: PEWeightBuffer <==> PEWeightRegs

And here is my bus component:

compound_components:
  version: 0.3
  classes:
  - name: bus
    attributes:
      technology: 40nm
      n_PE_rows: 1
      n_PE_cols: 8
      total_PEs: n_PE_cols * n_PE_rows
      datawidth: 8
      Y_X_wire_avg_length: 2mm

    subcomponents:
      - name: Y_X_wire[1..datawidth]
        class: wire
        attributes:
          technology: 32nm
          datawidth: 1
          length: Y_X_wire_avg_length
          latency: 1ns
      - name: Y_X_wire-input[1..2]
        class: wire
        attributes:
          technology: 32nm
          datawidth: 1
          length: Y_X_wire_avg_length
          latency: 1ns
      - name: Y_X_wire-output[1..24]
        class: wire
        attributes:
          technology: 32nm
          datawidth: 1
          length: Y_X_wire_avg_length
          latency: 1ns             
    actions:
      - name: transfer_Weights
        arguments:
          n_rows: 1..n_PE_rows
          n_cols_per_row: 1..n_PE_cols    
        subcomponents:
          - name: Y_X_wire[1..datawidth]
            actions:
            - name: transfer_random
      - name: transfer_Inputs
        arguments:
          n_rows: 1..n_PE_rows
          n_cols_per_row: 1..n_PE_cols    
        subcomponents:
          - name: Y_X_wire-input[1..2]
            actions:
            - name: transfer_random            
      - name: transfer_Outputs
        arguments:
          n_rows: 1..n_PE_rows
          n_cols_per_row: 1..n_PE_cols    
        subcomponents:
          - name: Y_X_wire-output[1..24]
            actions:
            - name: transfer_random           

And also I had to change the Hierarchy between distributed weight buffer and accumulation buffer so I could have the PEWeightBuffer <==> PEWeightRegs network.

Would incorporating a bus be a more accurate choice than the legacy NoC for reproducing a Simba-like architecture?
Also I have modified my legacy network to account for operand precision for calculating the hop energy (router + wire). The router energy I assume is constant but the wire energy should change based on the packet size. Do you think it is a reasonable assumption to have variable length packets for different operands or since they are using the same NoC then packet size should stay constant.

Thank you

How to calculate latency or delay?

I am trying to model non - volatile memories for hardware accelerators. I am using the eyeriss architecture as the base architecture and using the table-based plugin. Although there is a change in the value of area and energy (on changing area and energy) but no change in the number of clock cycles on changing the value of latency. Any guesses what could be the possible reason for that?

I added this to the primitive component file.

    name: STTRAM
    attributes:
      technology: sttram
      width: 64
      depth: 512
      n_rdwr_ports: 1
      n_banks: 1
      latency: 100ns
    actions:
      - name: read
      - name: write
      - name: idle`

And this as the table-based plugin

technology,width,depth,n_rd_ports,n_wr_ports,n_rd_wr_ports,n_banks,latency,action,data_delta,address_delta,energy,area
45nm,64,16384,0,0,2,32,100ns,read,0,0,5,30000
45nm,64,16384,0,0,2,32,100ns,write,0,0,5,30000
45nm,64,16384,0,0,2,32,100ns,idle,0,0,5,30000

(These are just dummy values)

support a simple pim?

Great work! I want to ask if there is accelerggy v0.3 to support exercise on a simple pim ? the attachment below shows that the support 0.3 is out to update, and many performance parameters aren't supported to show.
timeloop-mapper.accelergy.log
What's more, could I ask if the sparseloop's density model "Actual data" is supported in this reposity?

Define sparsity aware accelerator configuration

Hi,

Is there a way to define the configurations for sparsity aware accelerators such as Cambricon-X [1] or Stripes [2] in Timeloop?

[1] Zhang, Shijin, et al. "Cambricon-x: An accelerator for sparse neural networks." 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2016.
[2] Judd, Patrick, et al. "Stripes: Bit-serial deep neural network computing." 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2016.

Regarding 3d-CNN-layer

Hi, Thanks for providing the infrastructure.

When I'm trying to simulate 3d cnn layers, the timeloop simulator reports a core dumped error.
MKH651FK8A_`MT9{)7UBIB

The shape of the Con3d is:
P% ` 4LLU5UCA8I0OW2M6 X

And I write it to yaml as:
problem:
shape:
name: "3d-CNN-Layer"
dimensions: [C,K,R,S,T,N,Q,P,F]
coefficients:
- name: Wstride
default: 1
- name: Hstride
default: 1
- name: Dstride
default: 1
- name: Wdilation
default: 1
- name: Hdilation
default: 1
- name: Ddilation
default: 1
data-spaces:
- name: Weights
projection:
- [ [C] ]
- [ [K] ]
- [ [R] ]
- [ [S] ]
- [ [T] ]
- name: Inputs
projection:
- [ [N] ]
- [ [C] ]
- [ [R, Wdilation], [P, Wstride] ] # SOP form: RWdilation + PWstride
- [ [S, Hdilation], [Q, Hstride] ] # SOP form: SHdilation + QHstride
- [ [T, Ddilation], [F, Dstride] ] # SOP form: TDdilation + FDstride
- name: Outputs
projection:
- [ [N] ]
- [ [K] ]
- [ [Q] ]
- [ [P] ]
- [ [F] ]
read-write: True
instance:
C: 64
K: 64
R: 1
S: 1
T: 1
N: 1
P: 256
Q: 256
F: 30
Wdilation: 1
Wstride: 1
Hdilation: 1
Hstride: 1
Ddilation: 1
Dstride: 1

Could you help figure out what's wrong with my yaml definition? Thanks a lot!

MatMul Timeloop V1

Hi!

I saw the matmul examples here for Timeloop V2. I was wondering if there were any examples of matmul for Timeloop V1. Thanks!

Issue with running MM

I want to run the MM in layer_shape using the /run_example_designs.py code, but an error occurs because the instance of archi.yaml in example_designs/ is different.
So, when I corrected the instance and ran the code, an error occurred again in top.yaml.jinja2.
Are you sharing the code to run MM that I couldn't find?

latency info possibility

Hello,
I want to map the VGG16 on the Simba arch, will that be possible to get the layerwise latency info from the generated output?
Having seen from the example with Alexnet, does the level-based info in file shows the layerwise info?

Thanks alot

Runtime error in timeloop+accelergy exercises

Hi,

Thanks for providing such useful exercises.

I can run the examples in timeloop/accelergy successfully, but encounter one runtime error in the timeloop+accelergy exercise as show below:

Problem configuration complete.
ERROR: key not found: arithmetic, at line: 0

Could you help to solve such issue, thanks in advance for your time and effort!

Best regards,
Haoran You

Metadata Access Energy for DRAM

I have been reviewing the examples for sparse workloads, specifically focusing on weights, and I noticed that there is no metadata access energy associated with accesses to compressed data in DRAM. I am wondering if this is being intentionally ignored for the sake of simplicity.
I made a component called DRAM_metadata that is defined it as described below.

compound_components:
  version: 0.3
  classes:
  - name: dram_metadata
    attributes:
      technology: must_specify
      n_rdwr_ports: 2
      datawidth: datawidth
      metadata_storage_width: must_specify
      metadata_storage_depth: must_specify
      metadata_datawidth: must_specify
      metadata_counter_width: must_specify
      metadata_block_size: 1

    subcomponents:
      - name: storage
        class: DRAM
        attributes:
          technology: technology
          width: width
          depth: data_storage_depth
          datawidth: datawidth
          n_rdwr_ports: n_rdwr_ports

      - name: metadata_counters[0..1] # one for read, one for write
        class: intadder
        attributes:
          technology: technology
          datawidth: metadata_counter_width
      - name: metadata_storage
        class: DRAM
        attributes:
          technology: technology
          width: metadata_storage_width
          depth: metadata_storage_depth
          datawidth: metadata_datawidth
    actions:
      - name: write
        subcomponents:
          - name: storage
            actions:
              - name: write
      - name: read
        subcomponents:
          - name: storage
            actions:
              - name: read
      - name: gated_write
        subcomponents:
          - name: storage
            actions:
            - name: idle
      - name: gated_read
        subcomponents:
          - name: storage
            actions:
            - name: idle
      - name: metadata_read
        subcomponents:
          - name: metadata_storage
            actions:
              - name: read
      - name: metadata_write
        subcomponents:
          - name: metadata_storage
            actions:
              - name: write
      - name: gated_metadata_read
        subcomponents:
          - name: metadata_storage
            actions:
              - name: idle
      - name: gated_metadata_write
        subcomponents:
          - name: metadata_storage
            actions:
              - name: idle
      - name: decompression_count
        subcomponents:
          - name: metadata_counters[1]
            actions:
              - name: add
      - name: compression_count
        subcomponents:
          - name: metadata_counters[0]
            actions:
              - name: add
      - name: idle
        subcomponents:
          - name: storage
            actions:
              - name: idle
          - name: storage
            actions:
              - name: idle

Would this be an accurate simulation?
Thank you very much

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.