Giter Club home page Giter Club logo

champsim's Introduction

ChampSim

GitHub GitHub Workflow Status GitHub forks Coverage Status

ChampSim is a trace-based simulator for a microarchitecture study. If you have questions about how to use ChampSim, we encourage you to search the threads in the Discussions tab or start your own thread. If you are aware of a bug or have a feature request, open a new Issue.

Using ChampSim

ChampSim is the result of academic research. To support its continued growth, please cite our work when you publish results that use ChampSim by clicking "Cite this Repository" in the sidebar.

Download dependencies

ChampSim uses vcpkg to manage its dependencies. In this repository, vcpkg is included as a submodule. You can download the dependencies with

git submodule update --init
vcpkg/bootstrap-vcpkg.sh
vcpkg/vcpkg install

Compile

ChampSim takes a JSON configuration script. Examine champsim_config.json for a fully-specified example. All options described in this file are optional and will be replaced with defaults if not specified. The configuration scrip can also be run without input, in which case an empty file is assumed.

$ ./config.sh <configuration file>
$ make

Download DPC-3 trace

Traces used for the 3rd Data Prefetching Championship (DPC-3) can be found here. (https://dpc3.compas.cs.stonybrook.edu/champsim-traces/speccpu/) A set of traces used for the 2nd Cache Replacement Championship (CRC-2) can be found from this link. (http://bit.ly/2t2nkUj)

Storage for these traces is kindly provided by Daniel Jimenez (Texas A&M University) and Mike Ferdman (Stony Brook University). If you find yourself frequently using ChampSim, it is highly encouraged that you maintain your own repository of traces, in case the links ever break.

Run simulation

Execute the binary directly.

$ bin/champsim --warmup_instructions 200000000 --simulation_instructions 500000000 ~/path/to/traces/600.perlbench_s-210B.champsimtrace.xz

The number of warmup and simulation instructions given will be the number of instructions retired. Note that the statistics printed at the end of the simulation include only the simulation phase.

Add your own branch predictor, data prefetchers, and replacement policy

Copy an empty template

$ mkdir prefetcher/mypref
$ cp prefetcher/no_l2c/no.cc prefetcher/mypref/mypref.cc

Work on your algorithms with your favorite text editor

$ vim prefetcher/mypref/mypref.cc

Compile and test Add your prefetcher to the configuration file.

{
    "L2C": {
        "prefetcher": "mypref"
    }
}

Note that the example prefetcher is an L2 prefetcher. You might design a prefetcher for a different level.

$ ./config.sh <configuration file>
$ make
$ bin/champsim --warmup_instructions 200000000 --simulation_instructions 500000000 600.perlbench_s-210B.champsimtrace.xz

How to create traces

Program traces are available in a variety of locations, however, many ChampSim users wish to trace their own programs for research purposes. Example tracing utilities are provided in the tracer/ directory.

Evaluate Simulation

ChampSim measures the IPC (Instruction Per Cycle) value as a performance metric.
There are some other useful metrics printed out at the end of simulation.

Good luck and be a champion!

champsim's People

Contributors

agusnt avatar alberto-ros avatar deeprajpandey avatar dependabot[bot] avatar djimeneth avatar fjshen avatar ginoac avatar github-actions[bot] avatar jinchun-kim avatar jofepre avatar lpha-z avatar maccoymerrell avatar ndesh26 avatar ngober avatar pabellan-arch avatar pranjal-mittal0 avatar rajesh-s avatar sawansib avatar sethpugsley avatar tamu-jk avatar vinsanity101 avatar vishalgupta97 avatar w1nda avatar yuyadegawa avatar zeal4u avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

champsim's Issues

RAW dependencies are broken

I think following code (in the develop branch) breaks RAW dependencies.

if((count_source_regs == 1) && (ROB.entry[rob_index].source_registers[0] == ROB.entry[rob_index].destination_registers[0]))

I tested ./bin/bimodal-no-no-no-no-lru-1core -warmup_instructions 1 -simulation_instructions 1000 -traces ipc1_public/client_001.champsimtrace.xz, and got the result of following:

(omit)
[ROB] reg_dependency instr_id: 200 is_memory: 1 load  reg_index: 1
[ROB] reg_dependency instr_id: 200 is_memory: 1 store reg_index: 67
(omit)
[ROB] reg_dependency instr_id: 202 is_memory: 1 load  reg_index: 67
[ROB] reg_dependency instr_id: 202 is_memory: 1 store reg_index: 67
(omit)
[RTL0] operate_lsq instr_id: 200 rob_index: 200 is popped to RTL0 head: 26 tail: 28
[RTL0] operate_lsq instr_id: 202 rob_index: 202 is popped to RTL0 head: 27 tail: 28
(omit)

The instruction with id200 is a load instruction and it reads from reg1 and writes to reg67.
The instruction with id202 is a load instruction and it reads from reg67 and writes to reg67.
The instruction with id202 is woken up at the same time with the instruction with id200, but this violates RAW dependency (writing to reg67 will be completed after a few cycles).

I don't know what the above code meant, so I don't know how to fix it.
(I estimate it's intended to handle idioms like xor eax, eax. The problem is came from not examining memory operands. Even if the problem is solved, it is impossible to distinguish add eax, 1 from xor eax, eax in the current trace format.)

Traces Clarification

Hi,
Are the traces generated by the provided pintool, for physical addresses or virtual?

Running debug mode

Hi how to I print the cout statements from cache.cc and ooo_cpu.cc files? Do you have any debug modes I can access in this simulator.

Thanks

Questions

Hello,
I have been getting familiar with ChampSim and I have a few questions.
Does prefetching happen on every instruction, or on demand access or demand miss? Is there a place to set my preference for when to prefetch?
What is the line size implemented in the cache?
Are all instructions simulated, or are only memory instructions simulated from the trace?

Any answers are appreciated, any info on where I could find the answers to these questions in the code would also be appreciated!
Thanks,
Nick

RAW detection bug

I believe the RAW dependency check (reproduced below from O3_CPU::reg_RAW_dependecy) misses some dependencies. In particular, PIN uses different register IDs for equivalent registers of different sizes (e.g. RAX uses ID 10 and EAX uses ID 56 in PIN 3.13), so the check will not indicate there is any dependency if a binary uses mixed register sizes.

if (ROB.entry[prior].destination_registers[i] == ROB.entry[current].source_registers[source_index])

Putting efforts in a proper documentation?

For this issue, I'll just pretend to be 6 months back in time when started my Ph.D degree.

I had to put my hands onChampSim and god it was hard! 🤕
This project is in a nutshell quite good. It's simple, efficient, it does what you expect from it. But, the main issue with it is to put your hands on it. At first working with cache replacement policies was a nightmare, not understanding the chronology of the execution and so on.

I would suggest to put some effort in writing a proper documentation for this project. In humble opinion, reading a documentation could save sometime in comparison to going into the depth of the simulator's code.

I'm waiting for your opinions. 👍 🇫🇷

Cache coherency and inclusion

Hey,
1- ChampSIm implements a non-inclusive cache hierarchy. right?
2- How does ChampSim maintain the coherency? I could not find anything regarding the coherency.

Packet Queue removal is non-contiguous

When a packet is removed from a PACKET_QUEUE, it is done in a very unsafe way.

The removal occurs on L111 of src/block.cc.

*packet = empty_packet;

In it, the passed pointed-to value is replaced with an empty packet. Then, the head is incremented a few lines later. However, there is no guarantee that the passed pointer is at the head of the queue or even is a member of the current queue. In theory, you could do something particularly stupid like

RQ.remove_queue(WQ.entry[PQ.SIZE])

Which, while syntactically valid and compilable, is a huge bug. Now, the only place where a non-head element is removed from the queue is in the DRAM model.

queue->remove_queue(&queue->entry[request_index]);

The request_index here is the next_process_index, which is set on L589 of the same file, where it is set to a scheduled packet with the earliest event cycle (with ties broken by index closest to zero? Is this an unrelated bug?).

Are we getting lucky that there are no holes appearing in the queues and that packets are not getting popped from the queue in an untimely way?

I ran a quick experiment to verify what was going on. I added an assertion to PACKET_QUEUE::remove_queue() that checked the passed pointer to determine if it is inside the bounds of the queue, and nothing immediately jumped out as wrong. I also added the following code following L600 of src/main.cc, after the call to DRAM.operate(). It examines the read and write queues for each channel in the DRAM. If the element falls between the head and the tail, it should have a nonzero address, that is, it shouldn't be empty. Otherwise, it should have a zero address.

    bool in_bounds = true;
    bool elem_present = true;
    std::size_t queue_idx;
    for (std::size_t chan = 0; chan < DRAM_CHANNELS; ++chan)
    {
        queue_idx = DRAM.WQ[chan].head;
        do
        {
            if (queue_idx == DRAM.WQ[chan].tail)
                in_bounds = false;
            elem_present = (DRAM.WQ[chan].entry[queue_idx].address != 0);
            assert((elem_present && in_bounds) || (!elem_present && !in_bounds));

            ++queue_idx;
            if (queue_idx == DRAM.WQ[chan].SIZE)
                queue_idx = 0;
        } while (queue_idx != DRAM.WQ[chan].head);

        queue_idx = DRAM.RQ[chan].head;
        do
        {
            if (queue_idx == DRAM.RQ[chan].tail)
                in_bounds = false;
            elem_present = (DRAM.RQ[chan].entry[queue_idx].address != 0);
            assert((elem_present && in_bounds) || (!elem_present && !in_bounds));

            ++queue_idx;
            if (queue_idx == DRAM.RQ[chan].SIZE)
                queue_idx = 0;
        } while (queue_idx != DRAM.RQ[chan].head);
    }

This test caused the program to fail its assertions not long after finishing warmup. This makes sense, since the DRAM returns all packets immediately during warmup. So, I'm pretty convinced something is wrong with the DRAM queues, but I'm not certain what.

Support for C++17

C++17 introduces a number of features that are useful for ChampSim.

  • std::size, an implementation-agnostic way to examine the sizes of array-like structures.
  • std::byte, a type representing raw data, like uint8_t, but without any implications of being an ASCII character.
  • std::optional, a representation of the valid/not-valid state that is ubiquitous in caches.

I think it is beneficial for ChampSim to adopt C++17. GCC and Clang have supported it for 3 years now, and all of the existing ChampSim sources compile cleanly on C++17. The biggest issue that would stop this change is that it would break support for older systems (Ubuntu 16.04, in particular, where the packaged GCC is version 5).

MacroOp or MicroOp

Hello
does the Champsim read macroops from the trace (CRC-2 traces) and then converts them into microops? if so, at which stage that happens?

Issue with generating traces using ChampSimTracer on Pin 3.2

Hi,

I'm having trouble generating traces with the Pin version 3.2 that's documented in the README. Has anyone encountered this before? Does PIN3.2 work with recent linux kernels as well?

I'm not sure if it's a PIN issue or if I'm missing something with the champsim_tracer.

Command:
pin-3.2-81205-gcc-linux/pin -ifeellucky -t obj-intel64/champsim_tracer.so -- ls

OS:
Ubuntu 20.04.1 LTS
uname -r: 5.4.0-54-generic

Output:

A: Source/pin/elfio/img_elf.cpp: ProcessSectionHeaders: 601: assertion failed: SEC_vaddr_i(sec) >= IMG_seg_text_vaddr_i(img) && SEC_vaddr_i(sec) < IMG_seg_data_vaddr_i(img)

NO STACK TRACE AVAILABLE
Pin 3.2
Copyright (c) 2003-2016, Intel Corporation. All rights reserved.
@CHARM-VERSION: $Rev: 81201 $
Aborted (core dumped)

Thanks!

Inconsistent Prefetcher Parameters

Why is the sum of useful and useless prefetches not equal to the total prefetches issued? Moreover, what is the difference between
PREFETCH ACCESS and PREFETCH REQUESTED? I have attached a log file for reference. In the log file, the sum of useful and useless prefetches is not equal to the prefetches issued for L1D and L2C. For LLC, prefetches issued are zero, but useful and useless prefetches are non zero quantities. How is this possible?
astar_23B.trace.xz-bimodal-no-bip-bip-no-lru-1core.txt

Fetcher issues (develop branch)

I changed source code a little, then I compiled by default configuration and ran ChampSim with the following command: bin/champsim -warmup_instructions 0 -simulation_instructions 2500 -traces client_001.champsimtrace.xz
Then I found some fetcher issues.

--- a/inc/champsim.h
+++ b/inc/champsim.h
@@ -6,7 +6,7 @@
 #include "champsim_constants.h"

 // USEFUL MACROS
-//#define DEBUG_PRINT
+#define DEBUG_PRINT
 #define SANITY_CHECK
 #define LLC_BYPASS
 #define DRC_BYPASS

--- a/src/ooo_cpu.cc
+++ b/src/ooo_cpu.cc
@@ -298,7 +298,7 @@ uint32_t O3_CPU::check_rob(uint64_t instr_id)

     for (uint32_t i=ROB.head, count=0; count<ROB.occupancy; i=(i+1==ROB.SIZE) ? 0 : i+1, count++) {
         if (ROB.entry[i].instr_id == instr_id) {
-            DP ( if (warmup_complete[ROB.cpu]) {
+            DP ( if (warmup_complete[cpu]) {
             cout << "[ROB] " << __func__ << " same instr_id: " << ROB.entry[i].instr_id;
             cout << " rob_index: " << i << endl; });
             return i;
@@ -401,7 +401,7 @@ void O3_CPU::fetch_instruction()
          else
               trace_packet.address = ifb_entry.ip >> LOG2_PAGE_SIZE;
           trace_packet.full_addr = ifb_entry.ip;
-         trace_packet.instr_id = 0;
+         trace_packet.instr_id = ifb_entry.instr_id;
          trace_packet.rob_index = i;
           trace_packet.ip = ifb_entry.ip;
          trace_packet.type = LOAD;
@@ -450,7 +450,7 @@ void O3_CPU::fetch_instruction()
           fetch_packet.full_addr = ifb_entry.instruction_pa;
           fetch_packet.v_address = ifb_entry.ip >> LOG2_PAGE_SIZE;
           fetch_packet.full_v_addr = ifb_entry.ip;
-         fetch_packet.instr_id = 0;
+         fetch_packet.instr_id = ifb_entry.instr_id;
          fetch_packet.rob_index = 0;
           fetch_packet.ip = ifb_entry.ip;
          fetch_packet.type = LOAD;

ITLB is accessed multiple times but is not needed.

[ITLB_RQ] add_rq instr_id: 2385 address: ffffa844d full_addr: ffffa844dacc type: 0 head: 9 tail: 10 occupancy: 1 event: 24176 current: 24175
[ITLB_RQ] add_rq instr_id: 2391 address: ffffa844d full_addr: ffffa844dae4 type: 0 head: 10 tail: 11 occupancy: 1 event: 24177 current: 24176
[ITLB_RQ] add_rq instr_id: 2391 address: ffffa844d full_addr: ffffa844dae4 type: 0 head: 11 tail: 12 occupancy: 1 event: 24178 current: 24177
[ITLB_RQ] add_rq instr_id: 2397 address: ffffa844d full_addr: ffffa844dafc type: 0 head: 12 tail: 13 occupancy: 1 event: 24179 current: 24178

In cycle 24175, an ITLB request is issued by instr2385.
In cycle 24176, another ITLB request is issued by instr2391.
In cycle 24177, the first ITLB request is returned, and the following code restores instr2391.translated == INFLIGHT to instr2391.translated == 0, then instr2391 issues yet another ITLB request since instr2391.translated == 0.
In cycle 24178, the second ITLB request is returned, and it sets instr2391.translated == COMPLETED.
In cycle 24179, the third ITLB request is returned, but no one uses it.

ChampSim/src/ooo_cpu.cc

Lines 1698 to 1699 in 37fc622

// not enough fetch bandwidth to translate this instruction this time, so try reading the ITLB again
IFETCH_BUFFER.entry[index].translated = 0;

The root of this problem is that the following codes assume "all instructions on the same line can be translated in one cycle" while the above code assumes "only 6 instructions can be translated in one cycle".

ChampSim/src/ooo_cpu.cc

Lines 417 to 437 in 37fc622

// successfully sent to the ITLB, so mark all instructions in the IFETCH_BUFFER that match this ip as translated INFLIGHT
uint32_t translate_inflight_index = index;
for(uint32_t j=0; j<IFETCH_BUFFER.SIZE; j++)
{
if(IFETCH_BUFFER.entry[translate_inflight_index].ip == 0)
{
break;
}
if(((IFETCH_BUFFER.entry[translate_inflight_index].ip >> LOG2_PAGE_SIZE) == (ifb_entry.ip >> LOG2_PAGE_SIZE))
&& (IFETCH_BUFFER.entry[translate_inflight_index].translated == 0))
{
IFETCH_BUFFER.entry[translate_inflight_index].translated = INFLIGHT;
IFETCH_BUFFER.entry[translate_inflight_index].fetched = 0;
}
translate_inflight_index++;
if(translate_inflight_index >= IFETCH_BUFFER.SIZE)
{
translate_inflight_index = 0;
}
}

Instruction fetch is done out of order.

[L1I_RQ] add_rq instr_id: 2391 address: 2f0b6b full_addr: bc2dae4 type: 0 head: 3 tail: 4 occupancy: 1 event: 24182 current: 24178
[L1I_RQ] add_rq instr_id: 2398 address: 2f0b6c full_addr: bc2db00 type: 0 head: 3 tail: 5 occupancy: 2 event: 24183 current: 24179
[L1I_RQ] add_rq instr_id: 2408 address: 2f0b6f full_addr: bc2dbe4 type: 0 head: 3 tail: 6 occupancy: 3 event: 24184 current: 24180
[L1I_RQ] add_rq instr_id: 2415 address: 2f0b70 full_addr: bc2dc00 type: 0 head: 4 tail: 7 occupancy: 3 event: 24186 current: 24182
[L1I_RQ] add_rq instr_id: 2389 address: 2f0b6b full_addr: bc2dadc type: 0 head: 5 tail: 8 occupancy: 3 event: 24187 current: 24183
[L1I_RQ] add_rq instr_id: 2404 address: 2f0b6c full_addr: bc2db18 type: 0 head: 6 tail: 9 occupancy: 3 event: 24188 current: 24184
[L1I_RQ] add_rq instr_id: 2431 address: 2f0b71 full_addr: bc2dc40 type: 0 head: 6 tail: 10 occupancy: 4 event: 24188 current: 24184
[L1I_RQ] add_rq instr_id: 2414 address: 2f0b6f full_addr: bc2dbfc type: 0 head: 6 tail: 11 occupancy: 5 event: 24189 current: 24185
[L1I_RQ] add_rq instr_id: 2438 address: 19e74ee full_addr: 679d3b90 type: 0 head: 7 tail: 12 occupancy: 5 event: 24190 current: 24186
[L1I_RQ] add_rq instr_id: 2421 address: 2f0b70 full_addr: bc2dc18 type: 0 head: 8 tail: 13 occupancy: 5 event: 24191 current: 24187
[L1I_RQ] add_rq instr_id: 2450 address: 19e74ef full_addr: 679d3bc0 type: 0 head: 10 tail: 14 occupancy: 4 event: 24192 current: 24188
[L1I_RQ] add_rq instr_id: 2437 address: 2f0b71 full_addr: bc2dc58 type: 0 head: 11 tail: 15 occupancy: 4 event: 24193 current: 24189
[L1I_RQ] add_rq instr_id: 2466 address: 19e74f0 full_addr: 679d3c00 type: 0 head: 12 tail: 16 occupancy: 4 event: 24194 current: 24190
[L1I_RQ] add_rq instr_id: 2444 address: 19e74ee full_addr: 679d3ba8 type: 0 head: 13 tail: 17 occupancy: 4 event: 24195 current: 24191
[L1I_RQ] add_rq instr_id: 2427 address: 2f0b70 full_addr: bc2dc30 type: 0 head: 14 tail: 18 occupancy: 4 event: 24196 current: 24192

The same thing is happening.

ChampSim/src/ooo_cpu.cc

Lines 1744 to 1745 in 37fc622

// not enough fetch bandwidth to get this instruction this time, so try reading the L1I again
IFETCH_BUFFER.entry[index].fetched = 0;

Prefetcher

What inputs are going into prefetcher exactly?
Where are they coming from?

Handling of event_cycle is inconsistent

The event_cycle member (of ooo_model_instr, PACKET, etc.) is handled inconsistently throughout ChampSim. It is supposed to be used to model delays. An object is not ready until its event_cycle has passed. Here are two examples of how it is handled.

if (ROB.entry[rob_index].event_cycle < current_core_cycle[cpu])

if ((WQ.entry[WQ.head].event_cycle <= current_core_cycle[writeback_cpu]) && (WQ.occupancy > 0)) {

In one, it is compared to the current cycle with <, in the other, with <=. I think this would ideally be handled with a function like

template <typename T>
bool is_ready(T item, uint64_t cpu)
{
    return item.event_cycle < current_core_cycle[cpu];
}

which we can then use consistently throughout. Such a function can, because of templating, be applied to any struct or class that has an event_cycle member.

This change is likely not going to have significant performance impact, but it serves to clean up the codebase.

Deadlock on STLB Merge

Suggest inserting additional check into PACKET_QUEUE::check_queue lines 20, 42, 62 to verify that entry[i].instruction == packet->instruction.

This is a workaround for an issue where requests simultaneously reach STLB from the ITLB and DTLB for the same address. They get merged, and then when the request is filled, only the first parent is notified, starving the other and causing deadlock.

Address space randomization

After my trace file generation in pintool, when I run it on champsim I do not see the address in my memory operations in champsim reflected. Why does such an address randomization happen and how to overcome this? Do you have any feature that stops address space randomization?

Register Renaming

Hello
I'm pretty known to Champsim and just started going through the source code but I was wondering if anyone knows about if Champsim has any register renaming to eliminate false dependencies? and what is the range of numbers assigned to logical registers to do the register renaming?

Thanks

Counting physical pages properly?

Hello folks!

As I was going through the code of ChampSim and particularly the virtual tp physical address translation routine, I encountered that macro in the code which, from my reckoning the number of physical pages in DRAM.

#define DRAM_PAGES ((DRAM_SIZE<<10)>>2)

Basically, and if I understood it properly, this line counts the number of DRAM pages as the size of DRAM multiplied by 256. As pages are 4 KiB by default in ChampSim, I was expecting something like #define DRAM_PAGES (DRAM_SIZE >> LOG2_PAGE_SIZE) winch is dividing the DRAM in chunks of 4 KiB.

So the questions are:

  • Is that intentional? And if yes why?
  • If it's a bug, we might need to fix it 😄

I'm also providing outputs with and without the value that I found meaningful here.

N.B.: The output file were obtained using traces of the GAP workloads and they are known for their huge memory footprint. Hence the number of major page fault.

sim_miss is flushed in the middle of the run

I just added a cout after sim_miss in handle_fill() to print the value of the counter. Like this

sim_miss[fill_cpu][MSHR.entry[mshr_index].type]++;
cout  << current_core_cycle[fill_cpu] << " "
         << NAME << " "
         << fill_cpu << " "
         << unsigned(MSHR.entry[mshr_index].type) << " "
         << sim_miss[fill_cpu][MSHR.entry[mshr_index].type]
         << "\n";

In the output I see something like this:

388327 L1D 0 0 718
391859 L1D 0 0 719
393049 L1D 0 0 720
393158 L1D 0 0 721
398761 L1D 0 0 722
404647 L1D 0 0 1
413331 L1D 0 0 2
413545 L1D 0 0 3
462999 L1D 0 0 4
471030 L1D 0 0 5
479273 L1D 0 0 6
485725 L1D 0 0 7

Sounds weird...
As you can see the cycle number is increasing which shows a progress in time. The NAME, fill_cpu and MSHR.entry[mshr_index].type are also constant. However, somewhere in the middle of the run the stat becomes zero.
I would like to know if you agree with me that this must not happen in the middle of simulation or there is a logic for flushing/resetting sim_miss in the middle of the run.

Getting the right number of L1D misses

Hi
As I look at the code, I see that read miss stat in cache.cc is incremented via

MISS[RQ.entry[index].type]++;

By Simply filtering the stat for L1D as below

if (NAME == "L1D")
    cout << "M\n";
MISS[RQ.entry[index].type]++;

in the end of the simulation, I count the number of 'M' symbols and get 100.

Then I expect that TOTAL stats in main.cc be the same as the number of 'M' symbols. However, it is very small compared to 100.

Tracking the code, I see that in the final stat phase, cache->roi_miss[cpu][0] is printed (main.cc). That gets the value from cache->roi_miss[cpu][i] = cache->sim_miss[cpu][i]; (main.cc) and this value is incremented by sim_miss[fill_cpu][MSHR.entry[mshr_index].type]++; (cache.cc).

So, it seems that what is printed in the output of the run as L1D misses, is a stat for MSHR and not L1D.
Any comment on that?

Possible bug in MSHR match

Hi
Part of the handle_read() for checking a miss block in MSHR seems to be buggy. The code is

                // check mshr
                uint8_t miss_handled = 1;
                int mshr_index = check_mshr(&RQ.entry[index]);

		if(mshr_index == -2)
		  {
		    // this is a data/instruction collision in the MSHR, so we have to wait before we can allocate this miss
		    miss_handled = 0;
		  }

So, if we find a match the mshr_index must be -2. That means, upon a match, check_mshr() must return -2. Now look at part of the code:

  for (uint32_t index=0; index<MSHR_SIZE; index++)
    {
      if (MSHR.entry[index].address == packet->address)
	{
	      return index;
	}
    }

So, that function, returns index upon a match and not -2.
In fact return -2 is commented in that function.

That can be verified easily by debugging. I see that there is a match in the MSHR and the function returns 0 (first entry in MSHR) but that is considered as collision.

Any comment on that? If you agree with that, I will create a poll request.

A new approach to the virtual memory system

ChampSim's approach to a virtual memory system is currently very basic. It provides the basic functionality of separating the address spaces of multiple processes, but it doesn't correctly model the performance impact of offering this support.

ChampSim's current virtual memory system consists of 4 parts: an L1 ITLB, and L1 DTLB, an STLB, and a va_to_pa() function. The L1 TLBs plug into the STLB, and when there's an STLB miss, the va_to_pa() function is invoked to get the correct translation, which is then cached in the appropriate TLBs.

The va_to_pa() function has always been implemented in a "magic" sort of way. It used to impose a magic, draconian penalty whenever it was invoked that would freeze all functions of the core, even those that had nothing to do with virtual memory. I recently bypassed that to eliminate the penalty altogether, because I believe an unrealistically small penalty was still more realistic than the extreme penalty that was in place.

So what's missing in ChampSim's virtual memory system? Page walkers. There was a recent pull request to introduce page walkers, but these page walkers also worked in a very magical way. They would instantaneously update caches, and worked with hard-coded fixed latencies.

Real page walkers should generate memory transactions that traverse the page table, naturally caching everything they touch in the existing cache hierarchy. They should be state machines that work over many cycles to return an answer that's added to the TLBs.

Here is a very rough outline of what I propose:

*** a virtual memory class ***

At initialization time, you specify the size of the physical address space, the number of processes that will be sharing the physical memory, and the number of page table layers. There will of course be a function similar to the existing va_to_pa() function that can magically give you a translation, but also within the physical address space will exist the page tables for all of the processes, and this kind of capability:

get_pte_physical_address(process_number, virtual_address, layer_number)

This is what the page walker will use to traverse the page table. First it will start at get_pte_physical_address(PID, VA, 0), which will return the root of the page table, and then when that returns it will move on to fetch get_pte_physical_address(PID, VA, 1), and so on. When the leaf node of the page table returns to the page walker, then the va_to_pa() function can be used to complete the translation.

*** TLBs ***

TLBs are currently the same type as caches, and work in exactly the same way. I propose making them a sub class that inherits from caches instead, with these changes:

  1. TLBs can be backed up by another TLB, or by a page walker. You can't have a TLB that just hangs out by itself.
  2. TLBs can be attached to any cache. The project I'm working on now wants to add virtual address prefetching at all levels of the hierarchy, so I want to add TLBs at the L1, L2, and L3 caches. Each of these TLBs would be backed up by its own page walker in this model. Or maybe it would be possible to have the L2 and the L3 plug into the L1's STLB. Who knows? The point is, I want TLBs and address translation capability to exist at more than a single point in the hierarchy.

*** page walkers ***

Page walkers would be finite state machines that traverse the page table of their core's process. The page walker would need to be forwarded the addresses of all cache fills and cache hits for the level of the hierarchy where they live, and as soon as one page table entry shows up at the cache, then it can generate a new memory transaction to read the next level page table entry.

Page walks with a lot of cache misses need to take longer than page walks with cache hits, and page walks need to generate real cache and DRAM traffic that consumes bandwidth and impacts the latency of other memory requests.


I'm sure I've left out millions of necessary details, but I just wanted to get a conversation started. In summary, I want to add 3 more classes to ChampSim to support virtual memory: a virtual memory class, TLBs (which would inherit almost all of their functionality from caches), and page walkers.

Does this plan sound reasonable? What's missing?

collected traces can not run on champsim

Hi,
I collected traces using the following commd:
pin-3.2-81205-gcc-linux/pin -t /home/zhangqianlong/MyPHD/ChampSim/tracer/obj-intel64/champsim_tracer.so -o ls .champsim -- ls
then :
tar cfJ ls.champsim.tar.xz ls.champsim

and run:
./bin/perceptron-large-no-bingo_16k-lru-4core -warmup_instructions 10 -simulation_instructions 10000 -traces ls.champsim.tar.xz ls.champsim.tar.xz ls.champsim.tar.xz ls.champsim.tar.xz

erros like this:
CPU 0 runs ls.champsim.tar.xz
CPU 1 runs ls.champsim.tar.xz
CPU 2 runs ls.champsim.tar.xz
CPU 3 runs ls.champsim.tar.xz
[ROB_ERROR] add_to_rob ip is zero index: 1 instr_id: 1 ip: 0
perceptron-large-no-bingo_16k-lru-4core: src/ooo_cpu.cc:321: uint32_t O3_CPU::add_to_rob(ooo_model_instr*): Assertion `0' failed.
Aborted (core dumped)

what's the problem? thanks

Instruction pointer in champsim

I would like to know if MSHR.entry[mshr_index].ip means the instruction address (PC) or not. Based on my understanding form the code, it should be the same as PC. For libquantum_964B, I have traced 100M instructions and for memory access I see that there are only 36 PCs for memory blocks. While it is related to the code base of libquantum, I want to be sure about that.

E: Pin does not support instrumenting this conditional branch

Hello again, I was able to get the code to compile by changing a couple things (string -> std::string, cerr -> std::cerr, etc). I was able to generate champsim_tracer.so, but now I am getting a new error when I try to run it. I'm not really sure what this error means or what I can do to fix it. Can anyone help?

My command line input:
../../pin2/pin2/pin -t obj-intel64/champsim_tracer.so -- ls

The error:
E: Pin does not support instrumenting this conditional branch: [ xbegin 0x7f4b2d69d8d8 ] with IARG_BRANCH_TAKEN

ChampSim Prefetcher Address Space

Does the prefetcher code in ChampSim operate in Virtual address space or physical address space? I mean for the function:
void CACHE::l1d_prefetcher_operate(uint64_t addr, uint64_t ip, uint8_t cache_hit, uint8_t type)
is the address addr virtual or physical?

Not enough traces for the configured number of cores

Hi,
What does the following error mean exactly?

$ ./bin/bimodal-no-no-no-lru-1core -warmup_instructions 1000000 -simulation_instructions 400000000 -trace dpc3_traces/600.perlbench_s-210B.champsimtrace

*** ChampSim Multicore Out-of-Order Simulator ***

Warmup Instructions: 1000000
Simulation Instructions: 400000000
Number of CPUs: 1
LLC sets: 2048
LLC ways: 16
Off-chip DRAM Size: 4096 MB Channels: 1 Width: 64-bit Data Rate: 1600 MT/s


*** Not enough traces for the configured number of cores ***

bimodal-no-no-no-lru-1core: src/main.cc:616: int main(int, char**): Assertion `0' failed.
Aborted (core dumped)

But

$ ls dpc3_traces/ -lh
total 120G
-rw-rw-r-- 1 mahmood mahmood 120G ژوئیه 10  2017 600.perlbench_s-210B.champsimtrace

I have downloaded only one trace. For one core simulation that should be fine.

make_tracer.sh will not correctly build tracer

I have been trying to build tracer so I can generate some traces for research I am doing. I changed the pin tool path in make_tracer.sh to be the correct location of my pin tool, but it still does not run properly. It fails and gives the following errors:
https://pastebin.com/unxiqrqB
I'm using pin version 3.11, and g++ 5.4.0

Is there anything I can do to get tracer working?

IP stride prefetching code for L2 cache

As you can see, on the following example taken from the output of the build script of ChampSim, it looks like the ip_stride prefetecher is not working because of some errors in the code. As well, I pointer to the faulty line in the code (c.f.: the very end of this issue)

alexandre@alexandre-Latitude-7490:~/Projets/ChampSim_modified$ ./build_champsim.sh hashed_perceptron next_line ip_stride no lru 1
Building single-core ChampSim...

rm -f -r obj
Generating dependencies for src/uncore.cc...
Compiling src/uncore.cc...
Generating dependencies for src/dram_controller.cc...
Compiling src/dram_controller.cc...
Generating dependencies for src/ooo_cpu.cc...
Compiling src/ooo_cpu.cc...
src/ooo_cpu.cc: In member function ‘void O3_CPU::add_load_queue(uint32_t, uint32_t)’:
src/ooo_cpu.cc:895:17: warning: this ‘if’ clause does not guard... [-Wmisleading-indentation]
                 if (LQ.entry[lq_index].producer_id != UINT64_MAX)
                 ^~
src/ooo_cpu.cc:898:21: note: ...this statement, but the latter is misleadingly indented as if it were guarded by the ‘if’
                     mem_RAW_dependency(i, rob_index, data_index, lq_index);
                     ^~~~~~~~~~~~~~~~~~
src/ooo_cpu.cc:903:17: warning: this ‘if’ clause does not guard... [-Wmisleading-indentation]
                 if (LQ.entry[lq_index].producer_id != UINT64_MAX)
                 ^~
src/ooo_cpu.cc:906:21: note: ...this statement, but the latter is misleadingly indented as if it were guarded by the ‘if’
                     mem_RAW_dependency(i, rob_index, data_index, lq_index);
                     ^~~~~~~~~~~~~~~~~~
src/ooo_cpu.cc:909:17: warning: this ‘if’ clause does not guard... [-Wmisleading-indentation]
                 if (LQ.entry[lq_index].producer_id != UINT64_MAX)
                 ^~
src/ooo_cpu.cc:912:21: note: ...this statement, but the latter is misleadingly indented as if it were guarded by the ‘if’
                     mem_RAW_dependency(i, rob_index, data_index, lq_index);
                     ^~~~~~~~~~~~~~~~~~
Generating dependencies for src/cache.cc...
Compiling src/cache.cc...
Generating dependencies for src/main.cc...
Compiling src/main.cc...
In file included from /usr/include/getopt.h:24:0,
                 from src/main.cc:3:
/usr/include/features.h:184:3: warning: #warning "_BSD_SOURCE and _SVID_SOURCE are deprecated, use _DEFAULT_SOURCE" [-Wcpp]
 # warning "_BSD_SOURCE and _SVID_SOURCE are deprecated, use _DEFAULT_SOURCE"
   ^~~~~~~
src/main.cc: In function ‘uint64_t va_to_pa(uint32_t, uint64_t, uint64_t, uint64_t)’:
src/main.cc:312:48: warning: variable ‘pr2’ set but not used [-Wunused-but-set-variable]
             map <uint64_t, uint64_t>::iterator pr2 = recent_page.begin();
                                                ^~~
Generating dependencies for src/block.cc...
Compiling src/block.cc...
Generating dependencies for branch/branch_predictor.cc...
Compiling branch/branch_predictor.cc...
Generating dependencies for replacement/base_replacement.cc...
Compiling replacement/base_replacement.cc...
Generating dependencies for replacement/llc_replacement.cc...
Compiling replacement/llc_replacement.cc...
Generating dependencies for prefetcher/l2c_prefetcher.cc...
Compiling prefetcher/l2c_prefetcher.cc...
prefetcher/l2c_prefetcher.cc: In member function ‘uint32_t CACHE::l2c_prefetcher_operate(uint64_t, uint64_t, uint8_t, uint8_t, uint32_t)’:
prefetcher/l2c_prefetcher.cc:83:9: error: return-statement with no value, in function returning ‘uint32_t {aka unsigned int}’ [-fpermissive]
         return;
         ^~~~~~
prefetcher/l2c_prefetcher.cc:106:9: error: return-statement with no value, in function returning ‘uint32_t {aka unsigned int}’ [-fpermissive]
         return;
         ^~~~~~
Makefile:49: recipe for target 'obj/prefetcher/l2c_prefetcher.o' failed
make: *** [obj/prefetcher/l2c_prefetcher.o] Error 1

ChampSim build FAILED!

Results question

image

Hi all,

I have some questions regarding the results, specifically for the prefetch data of L2:

  1. What is the difference between L2C prefetch hit, and L2C prefetch useful? Are they different values because a prefetched block can be hit more than once, but only counted as a useful prefetch once?
  2. Since the L2C prefetcher is prefetch on demand, why aren’t L2C load access and L2C prefetch requested the same value?
  3. Is coverage of L2C prefetch calculated by L2C prefetch Hit / L2C Load access? Or L2C prefetch useful / L2C load access?
  4. Is accuracy of L2C prefetch calculated by L2C Prefetch Hit / L2C prefetch access?
    Also, is there a document that explains the results metrics?

Thank you,
Nick

Choosing a build system

I think I'm not alone in thinking that the current method of building ChampSim is clunky. Over the past year, there have been three pull requests opened to address issues (#10, #11, #27). The three of them are incompatible, but they all attempt to address the same issues: The current build system

  • Does not allow for models that span multiple files. This leads to monolithic, non-reusable code.
  • Does not permit compilation flags or parameters (such as CXXFLAGS) to be passed to the compiler.
  • Requires that models be named *.l2c_pref, or similar. This causes problems with tools that would expect *.cc or *.cpp.
  • Does not scale well with the number of options available. With the recent addition of L1I prefetchers, build_champsim.sh now takes 7 parameters, while it is likely that only one is important on any build.
  • Rebuilds the entire project on every build. Why do we need to do this when we have tools like Make available to us?

I think that continuing to use Make is a smart idea. It is a standard and it is well-understood by developers. There are multiple ways to prepare the project for Make:

  • Autoconf is the GNU standard for building projects, but is targeted more to enable compatibility, not to select between source files
  • CMake is a possible alternative, but introduces a new dependency (#11)
  • SCons is used by GEM5, and seems to be a powerful tool.
  • A home-cooked solution may also be good, since it varies only marginally from what we have now. (#10 uses a Python script to parse a configuration file, #27 uses environment variables)

What features do we need out of our build system? What can we do to make it better?

Two questions about the OoO code

Hi,

I have two questions regarding the OoO core code.

The first one is why in main.cc, the schedule and execute calls for normal instructions and memory instructions are alternated:

         ooo_cpu[i].schedule_instruction();
      // execute
      ooo_cpu[i].execute_instruction();

      ooo_cpu[i].update_rob();

      // memory operation
      ooo_cpu[i].schedule_memory_instruction();
      ooo_cpu[i].execute_memory_instruction();

In the other stages, the calls go from later to earlier stage, which makes more sense to me.

The second question is why the LQ or SQ are not filled along with the ROB at dispatch, but at schedule. It seems to me that these queues can be filled out-of-order even if, as commented in the code, the schedule stage is in-order. My understanding was the ROB, LQ, and SQ are filled at the same stage, and usually at the last in-order stage at the front-end.

Thanks a lot,
Alberto

[Help Wanted / Question] Running ChampSim Tracer

I'm trying to create a trace, but I'm not exactly sure what steps I should be following.

I ran ./make_tracer.sh, but I keep getting the following error:
makefile:15: /Config/makefile.default.rules: No such file or directory make: *** No rule to make target '/Config/makefile.default.rules'. Stop.

Can someone help me out?

Reshaping ChampSim

As I see that there has been quite a lot of issues about the structure of ChampSim, I was willing to make an addition to #31 #11 #10.

Here, I'm going to share with you the way I believe ChampSim should be structured.

As a matter of fact ChampSim, to my knowledge, is mostly used to try out replacement policies, prefetching mechanisms, and so on. It appears to me that it's quite unlikely during your research process that you'll change ChampSim internals. Then this part of the simulation infrastructure would better be fixed while changing components such as replacement policies, prefetching mechanisms, etc. would gain to be in distinct compilation units.

A third part that I would like to consider is the knobs or the many macros in the code to configure caches, CPU, and so on. Going on an exploration with them can rapidly end up being a compilation nightmare.

To wrap up, here is three key concepts that I would like to push forward to you guys:

  1. The main simulator infrastructure: this part would be the actual ChampSim executable and it would be in charge of the initialization of the memory infrastructure, reading traces and more;
  2. Modules: these would be the replacement policies, prefetching mechanisms, or any crazy thing you want. I reckon they should libraries loaded at runtime, containing the minimum code required to have its mechanism implemented and working;
  3. Configuration file(s): this part would be an input telling the simulator which knobs to use when allocating structures, how the CPU(s) should be configured, which policies to use in which cache, etc.

N.B.: I understand that it might sound like a lot of work (it is actually), but I feel like it addresses some of the issues pointed out by @ngober in #10. However, to me, that sounds like a very good and ambitious milestone for ChampSim, making it more user-friendly and approachable.

Code Formatting

There has been considerable talk on #38 about what code format style to adopt and how to implement it. I'm creating this issue to collect all of the discussion in a more topical place.

First, why ChampSim needs to care about code formatting: its purpose is to be an accessible tool for rapid innovations in computer architecture research, particularly in memory prefetching and replacement. In order to be accessible, it has to be readable and comprehensible. Therefore, any style that ChampSim adopts must prioritize comprehensibility.

Any style tool that ChampSim uses must also not be a burden to a novice programmer. We see many people using ChampSim that have never worked on a C++ project before. ChampSim is designed with modularity, that is, you can add your own custom prefetchers, branch predictors, and replacement policies. These modules, being primarily used in-house, need not respect any ChampSim format guidelines and should not be required to. Yet, contributors to the codebase should be required to maintain proper style (for self-evident reasons). Therefore, any style tool that Champsim uses must be as optional as possible for researchers while being as required as possible for maintainers.

In the other thread, two toolchains have been suggested:

Let's use this thread to discuss two things: What style to adopt and what toolchains to use to implement it.

Core/Cache stalled on an STLB miss

On an STLB miss, currently everything is stalled including the core, cache and DRAM. Why it is simulated like that? Wouldn't it be better if only that particular request is stalled and other request can proceed further?

ChampSim does not compile under Cygwin

Alternate title: Windows users are people, too.

ChampSim does not compile under Cygwin due to the lack of support for POSIX threads. Currently, the only place pthreads are used is in src/main.cc, where it is used to spin up gzip in parallel and read the traces. If it does not take too much effort, it could be worth it to provide a Windows compatible solution in this one place. If we do, then ChampSim should run on Windows, too.

Tracing the whole program

How can I specify the whole duration of the program while creating a trace? Example of

pin -t obj/champsim_tracer.so -o traces/ls_trace.champsim -s 100000 -t 200000 -- ls

says tracing 200K instructions. Let's assume ls program finishes before 200K instructions. What will happen then? The trace has less than 200K instructions?

Or assume that ls has more than 200K instructions. How we can fix that then?

Instr being executed during Branch miss-prediction and ROB buffer flush

I just want to confirm that either the trace files includes the miss-predicted branch paths or not?
As from the code it looks like miss-predicted branch path is not included in the trace file and on branch miss-prediction only the branch miss penalty has been added and not the miss predicted instructions being implemented as ROB is not flushed in any way.
Can someone please confirm?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.