jbush001 / nyuziprocessor Goto Github PK

GPGPU microprocessor architecture

License: Apache License 2.0

SystemVerilog 15.95% C++ 27.14% C 48.90% Assembly 3.52% Python 3.35% Shell 0.10% Perl 0.08% Tcl 0.01% CMake 0.82% Java 0.11% Dockerfile 0.03%

fpga gpu-computing gpu verilog hardware microprocessor graphics processor-architecture

nyuziprocessor's Introduction

Nyuzi Processor

Nyuzi is an experimental GPGPU processor focused on compute intensive tasks. It includes a synthesizable hardware design written in System Verilog, an instruction set emulator, an LLVM based C/C++ compiler, software libraries, and tests. It can be used to experiment with microarchitectural and instruction set design tradeoffs.

Documentation: https://github.com/jbush001/NyuziProcessor/wiki

The following instructions explain how to set up the Nyuzi development environment. This includes an emulator and cycle-accurate hardware simulator, which allow hardware and software development without an FPGA, as well as scripts and components to run on FPGA.

Install Prerequisites

Linux (Ubuntu)

This requires Ubuntu 16 (Xenial Xeres) or later to get the proper package versions. It should work for other distributions, but you will probably need to change some package names. From a terminal, execute the following:

sudo apt-get -y install autoconf cmake make ninja gcc g++ bison flex python \
    python3 perl emacs openjdk-8-jdk swig zlib1g-dev python-dev \
    libxml2-dev libedit-dev libncurses5-dev libsdl2-dev gtkwave python3-pip
pip3 install pillow

Note: Recent versions of cmake break building the LLVM toolchain. This can be worked around by switching back to an older version of cmake: #204

Emacs is used for verilog-mode AUTO macros. The makefile executes this operation in batch mode

MacOS

These instruction assume OSX Mavericks or later.

Install XCode from the Mac AppStore (Click Here). Then install the command line compiler tools by opening Terminal and typing the following:

xcode-select --install

Install python libraries:

pip3 install pillow

Install Homebrew from https://brew.sh/, then use it to install the remaining packages from the terminal (do not use sudo):

brew install cmake bison swig sdl2 emacs ninja

Alternatively, you could use MacPorts if that is installed on your system, but you will need to change some of the package names

You may optionally install GTKWave for analyzing waveform files.

Windows

I have not tested this on Windows. Many of the libraries are cross platform, so it should be possible to port it. But the easiest route is probably to run Linux under a virtual machine like VirtualBox.

Build (Linux & MacOS)

The following script will download and install the Nyuzi toolchain and Verilator Verilog simulator. Although some Linux package managers have Verilator, they have old versions. It will ask for your root password a few times to install stuff.

./scripts/setup_tools.sh

Build everything else:

cmake .
make

Run tests:

make tests

If you are on a Linux distribution that defaults to python3, you may run into build problems with the compiler. In tools/NyuziToolchain/tools/CMakeLists.txt, comment out the following line:

add_llvm_external_project(lldb)

Occasionally a change will require a new version of the compiler. To rebuild:

git submodule update
cd tools/NyuziToolchain/build
make
sudo make install

Next Steps

Sample applications are available in software/apps. You can run these in the emulator by switching to the build directory and using the run_emulator script (some need 3rd party data files, details are in the READMEs in those directories).

For example, this will render a 3D model in the emulator:

cd software/apps/sceneview
./run_emulator

To run on FPGA, see instructions in hardware/fpga/de2-115/README.md

nyuziprocessor's People

Contributors

Stargazers

Watchers

Forkers

tianna1121 matlinsas liujian883 philipdexter vlsi1217 rtlab-uestc brandonhamilton hydai jonathanmarvens blastarindia claudiouzelac thatchristoph nivertech nhulac prodigeni wilhil carpenterap adonut ajblane stanleyleee chenzewei mgrube hacklinux jyizheng raulbehl dimpolz adakite guztech chuan573906361 gazben 8l pedromaltez-forks eejackliu fulcronz gxliu gitter-badger hoangt ikkichung xhy20070406 ksarada longjohncoder neuroidss bushalo aakarsh chrisrammy trigonom lomax regina-officium changeyourname duvitech-llc jslhs felipeasg mulianov yanlinaung neuroradiology ezhangle cpehle qicny nickodell cloudcalvin lgtmcu award0707 godelmachine vigilv lifego cyang13 mahircg sprunjer mercymm istanbulboy tsgts s117 baeeq ronobirdas luseengithub darwinbeing coalescensetech ragerdl amhagan solertis solotic govanify zyh329 idioticgenius yuex1994 schuler1 zmole945 saadmahboob wangfandasha efecankocuklu rowhit mackenzieg octaplexsys tpalan praveenmunagapati manili gale320 zinhoo kuanwoo harriet-muggle

nyuziprocessor's Issues

Cache stress test fails with specific random seed

This passes most of the time, but if the random seed is hardcoded to 1419094753, it flags a memory mismatch:

Building cache_stress.s
Random seed is 1419094753
400036 total instructions
Binary files WORK/vmem.bin and WORK/mmem.bin differ
13480c13480
< 0036110 00 00 00 00 a2 d3 00 00 00 00 00 00 00 00 00 00

---
> 0036110 00 00 00 00 a2 d3 00 00 00 00 00 00 ca fb 01 00
FAIL: final memory contents do not match

Texture misaligned where mip level changes in sceneview

When rendering Sponza in software/sceneview, the texture is misaligned at the mip level transition. This may be a problem with the resource compiler or in the texture sampler.

emulator doesn't clean up shared memory file

When shared memory is specified in the emulator using the '-s' flag, it creates a file, but it does not delete it when finished or if there is an error.

Cache performance will degrade in some simulator due to the implement of module "cache_lru"

Recently I have managed to port the Nyuzi to Xilinx ZC706 evaluation board and run doom & quakeview successfully, with respect. But during this process, I noticed a cache performance degradation when debugging with Modelsim SE 10.5c. Since such degradation seems only existing when simulating in some simulators and doesn't exist in verilator or after synthesize, I think maybe this topic is more like a portability suggestion than an issue.

When running the core on some simulators such as Modelsim, the x value coming from uninitialized SRAM will cause the casez statement in module cache_lru fail to match any pattern, further cause that in all cache set only way 0 is available. To illustrate it clearly, here is a piece of code snipping from cache_lru.sv when parameter NUM_WAYS is 4.

...

#130            4:
#131            begin
#132                always_comb
#133                begin
#134                    casez (lru_flags)
#135                        3'b00?: fill_way = 0;
#136                        3'b10?: fill_way = 1;
#137                        3'b?10: fill_way = 2;
#138                        3'b?11: fill_way = 3;
#139                        default: fill_way = '0;
#140                    endcase
#141                end
#142
#143                always_comb
#144                begin
#145                    case (new_mru)
#146                        2'd0: update_flags = {2'b11, lru_flags[0]};
#147                        2'd1: update_flags = {2'b01, lru_flags[0]};
#148                        2'd2: update_flags = {lru_flags[2], 2'b01};
#149                        2'd3: update_flags = {lru_flags[2], 2'b00};
#150                        default: update_flags = '0;
#151                    endcase
#152                end
#153            end

...

Consider after reset, two or more cache fill operations happened in the same cache set. When the first fill request comes, because the value of "lru_flags" is 3'bxxx (the value of uninitialized SRAM), the default branch will take effect, way 0 got filled, and the "update_flags" will be 3'b11x. Then, in the subsequent request handling process, the casez statement will always fail to match any pattern with the value of "lru_flags" (3'b11x), the default branch will always take effect, way 0 will always get filled, and "update_flags" will always be 3'b11x. The similar issue also exists when NUM_WAY is 8. The marked red area in the waveform generated by Modelsim attached below shows such situation:

Although it sounds like a bad idea, one solution to this problem is changing all casez in cache_lru.sv to casex. I made an experiment to estimate the performance impact: Some logic was inserted into the writeback_stage firstly to trace the function activity. Then, the same program got executed on two different designs, the original design and the design with modified cache_lru, for a given time. Since the program is same, the amount of log generated during simulation is related to the instructions got executed. By comparing the number of function tracing record, we can compare the performance approximately. The screenshot below shows the result: The design with modified cache_lru produced 2.8x more log than the original one, which can roughly be considered as running faster 2.8x.

The version I use is ce221ff, but this issue still exists in the latest version.

Finally, Thanks for bringing such interesting project!

tests/kernel/initidata times out intermittently when run under verilator on TravisCI

On TravisCI, times out once in a while. The timeout is set to 120 seconds. It takes about 86 seconds to run on my laptop. It's possible that's the timeout simply doesn't have enough margin when Travis's build server is running slowly.

Support all IEEE754 rounding modes

Currently hardcoded to round-to-nearest, ties to even. Add bits to flags control register to configure and logic in fp_execute_stage3 to compute do_round based on the mode and GRS bits.

https://en.wikipedia.org/wiki/Floating-point_arithmetic#Rounding_modes

Round to nearest, ties to even (current)
Round to nearest, ties round up (away from zero)
Round up (towards positive infinity)
Round down (towards negative infinity)
Round towards zero (truncation, basically no rounding)

Probably need to fix issue #58 first.

Clean up thread halt/resume model

Get rid of thread enable register. Instead resume a thread by sending it an inter-thread interrupt (requires feature #61)
Add HALT instruction or control register that suspends current thread (instead of memory mapped global halt/resume registers). Threads could be woken by external interrupts, depending on state of interrupt mask.
Don't stop simulation when all threads halt. Instead add dedicated memory mapped I/O register exposed by simulator that ends simulation.

Advantages:

Enables implementation of power management (halt in idle loop #91 )
Scales more cleanly to larger thread counts (currently limited to 32 by width of enable register)
Removes hacky processor_halt signal, which is only used by simulator

MMU large page support

Design doc: https://gist.github.com/jbush001/9149c14df733b25bea73b66fb7b213bc

Generate bus error interrupt for AXI error

The AXI bus has signals to indicate an error, but the processor ignores them. Because bus writes are deferred, these are necessarily imprecise.

[V1 uArch] __sync_add_and_fetch behaves incorrectly on FPGA

This is manifested with a number of programs, but the test in tests/fpga/atomic_bug/ fails reproduces it consistently. While this program is running, it will output the result of __sync_add_and_fetch in binary on the 7 segment LEDs--with each segment corresponding to a bit, and each digit corresponding to a hardware thread. If the program is behaving correctly, the digits should be updated continuously. However, after the first iteration, the program hangs and two of the digits generally display the same value (which violates the semantics of __sync_add_and_fetch).

This problem does not occur when running the same program RTL simulation under Modelsim (invoked from Quartus, which should ostensibly be using almost the same configuration).

application ideas

hello sorry for asking this question but i was wondering if people can list what THIS gpugpu processor is good for once synthesized i don't want a general what gpgpu is good for but what this project will help speedup or facilitate

Assertion when interrupt occurs during synchronized store

This occurred in the io_interrupt test:

io_interrupt
Process returned error: Random seed is 1460932741
cores 1|threads per core 4|l1i$ 16k 4 ways|l1d$ 16k 4 ways|l2$ 128k 8 ways|itlb 64 entries|dtlb 64 entries
>ABC*DE*FGHI*JKLM*NOPQ*RSTU*VWXYZ*abcd*efgh*ijklm*nopq*rstu*vwxy*z0123*4567*89
[32128] %Error: l1_store_queue.sv:208: Assertion failed in TOP.v.nyuzi.core_gen[0].core.l1_l2_interface.l1_store_queue.thread_store_buf_gen[0]
%Error: core/l1_store_queue.sv:208: Verilog $stop

The code is here:

                    if (store_requested_this_entry)
                    begin
                        if (is_restarted_sync_request)
                        begin
                            ...
                            assert(!rollback[thread_idx]);    // <----- Here

The problem appears to be as follows:

Core initiates synchronized store. The first instruction issue is rolled back and the state is recorded in the synchronized flag in the store buffer entry.
Interrupt occurs and thread is rolled back.
The interrupt handler attempts to do a store (of any type) before the response comes back for the synchronized store request.
In l1_store_queue, is_restarted_sync_request will be true because there is already a store with the synchronized flag set. The logic assumes this is the restarted synchronized store request. However, rollback will be set to 1 for this thread because the entry is still pending and can_write_combine is 0.

l1_store_queue assumes a core will not attempt to issue another store after it's been rolled back until the l1_store_queue wakes it:

     else if (pending_stores[thread_idx].valid && !can_write_combine
         && !got_response_this_entry)
         rollback[thread_idx] = 1;

The reason it occurred in this test was because there is a spinlock in crt0.s when the program terminates (for calling destructors). A compiler update changed the timing enough to hit this race condition. A better test would be to issue synchronized stores in a loop with interrupts enabled.

The fix is probably to disable interrupts while a synchronized store is pending. Something similar is already done for I/O operations: the ic_interrupt_pending bitmask goes into the instruction_decode_state to detect this stage. An additional signal could go from the store buffer to indicate this.

mulhs_i returns incorrect value in hardware

In tests/compiler/multiply64.c, should return 'FixedMul af07b15d', instead returns 'FixedMul 8fcdb15d'. The top 16 bits were computed by mulhs_i.

Floating point comparison instructions do not properly detect equality when both operands are -inf

Executing a floating point compare instruction that checks for equality when both operands are negative infinity will incorrectly not treat them as equal. That's because the equality checks that the result of a subtraction is zero and it is not in this case. There needs to be logic to check for infinity explicitly in these operations.

Optimize hardware multiplier

Currently integer and floating point multiplication occur in one stage (fp_execute_stage2) using the '*' Verilog operator. This is the critical path when synthesizing for silicon. However, three stages are reserved in the pipeline for it. Create a proper multi-stage multiplier that uses modified Booth encoding and a Wallace tree to accumulate partial products.

https://web.stanford.edu/class/archive/ee/ee371/ee371.1066/lectures/lect_05.2up.pdf

Or, eliminate booth encoding and use a single row of 4:2 compressors:

http://www.acsel-lab.com/Publications/Papers/38-booth-para-multi-EL93.pdf

Build error on OSX 10.9

%Error: ../fpga_common/axi_internal_ram.v:204: Internal: Extra arguments for $display-like format
%Error: core/writeback_stage.v:150: Internal: Extra arguments for $display-like format
%Error: core/instruction_fetch_stage.v:202: Internal: Extra arguments for $display-like format
%Error: Exiting due to 3 error(s)
%Error: Command Failed /usr/local/Cellar/verilator/3.860/bin/verilator_bin --assert -DENABLE_PERFORMANCE_COUNTERS -DSIMULATION -Icore -y fpga -y testbench -y ../fpga_common -Wno-fatal -Werror-implicit --cc testbench/verilator_tb.v --exe testbench/verilator_main.cpp
make[1]: *** [verilator]

Add JTAG debugging support to hardware pipeline

Single step, read/write registers, read/write memory.

NaN propagation

From IEEE754-2008, 6.2.3:

"If two or more inputs are NaN, then the payload of the resulting NaN should be identical to the payload of one of the input NaNs if representable in the destination format. This standard does not specify which of the input NaNs will provide the payload."

The convention seems to be to take the first operand. For example:

union uval {
   float fval;
   int32_t ival;
};

volatile union uval a, b, c;

int main()
{
    a.ival = 0xff8abcde;
    b.ival = 0xff812345;
    c.fval = a.fval + b.fval;
    printf("%08x\n", c.ival);
}

On a desktop machine, this returns 0xffcabcde (the most significant bit of the significand is set). On Nyuzi, it returns the hardcoded NaN value 0x7fffffff.

emulator crashes when resuming stopped process

Seems to be a regression. Happens on MacOS. I haven't tested on Linux.

Launch doom in debugger:

doom> make debug
Start the process, stop it, then restart it

(lldb) c
Process 1 resuming
(lldb) process interrupt
Process 1 stopped
(lldb) c
Process 1 resuming

The emulator will restart, but the first time the window is clicked on (or presumably receives any event), the emulator will crash:

* thread #1: tid = 0x1ed6dd, 0x00007fff92da3c43 libsystem_c.dylib`__findenv + 90, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
    frame #0: 0x00007fff92da3c43 libsystem_c.dylib`__findenv + 90
libsystem_c.dylib`__findenv + 90:
-> 0x7fff92da3c43:  movb   (%r11), %cl
   0x7fff92da3c46:  testb  %cl, %cl
   0x7fff92da3c48:  je     0x7fff92da3c6a            ; __findenv + 129
   0x7fff92da3c4a:  movzbl (%rbx), %r15d

* thread #1: tid = 0x1ed6dd, 0x00007fff92da3c43 libsystem_c.dylib`__findenv + 90, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
  * frame #0: 0x00007fff92da3c43 libsystem_c.dylib`__findenv + 90
    frame #1: 0x00007fff92da3cc7 libsystem_c.dylib`getenv + 29
    frame #2: 0x00007fff894f3cc6 CarbonCore`GetDYLDEntryPointWithImage + 56
    frame #3: 0x00007fff8eef5eea HIToolbox`HLTBRegisterLazyHIObjectClass + 140
    frame #4: 0x00007fff8ee95bbb HIToolbox`HIObjectClass::Lookup(__CFString const*, unsigned char) + 47
    frame #5: 0x00007fff8ee95a59 HIToolbox`HIObject::Create(__CFString const*, OpaqueEventRef*, HIObject**) + 41
    frame #6: 0x00007fff8ee95a0b HIToolbox`HIObjectCreate + 90
    frame #7: 0x00007fff8ef8b929 HIToolbox`HIMenuBarView::GetDrawingDelegate() + 37
    frame #8: 0x00007fff8eea8854 HIToolbox`HIMenuBarView::HIMenuBarView(OpaqueHIObjectRef*) + 222
    frame #9: 0x00007fff8eea875a HIToolbox`HIMenuBarView::Construct(OpaqueHIObjectRef*) + 34
    frame #10: 0x00007fff8eea6156 HIToolbox`HIView::EventHandler(OpaqueEventHandlerCallRef*, OpaqueEventRef*, void*) + 890
    frame #11: 0x00007fff8ee95f31 HIToolbox`HIObject::Construct(HIObjectClass*, HIObject**, OpaqueHIObjectRef*) + 181
    frame #12: 0x00007fff8ee95b0e HIToolbox`HIObject::Create(__CFString const*, OpaqueEventRef*, HIObject**) + 222
    frame #13: 0x00007fff8ee95a0b HIToolbox`HIObjectCreate + 90
    frame #14: 0x00007fff8eea8510 HIToolbox`HIMenuBarView::Create(CGRect const*, OpaqueMenuRef*, OpaqueControlRef**) + 86
    frame #15: 0x00007fff8eea7e6c HIToolbox`HIMenuBarFrameView::Initialize(OpaqueEventRef*) + 38
    frame #16: 0x00007fff8ee9c450 HIToolbox`HIObject::HandleClassHIObjectEvent(OpaqueEventHandlerCallRef*, OpaqueEventRef*, void*) + 490
    frame #17: 0x00007fff8ee9c24a HIToolbox`HIObject::EventHook(OpaqueEventHandlerCallRef*, OpaqueEventRef*, void*) + 128
    frame #18: 0x00007fff8ee9b98c HIToolbox`DispatchEventToHandlers(EventTargetRec*, OpaqueEventRef*, HandlerCallRec*) + 1260
    frame #19: 0x00007fff8ee9adce HIToolbox`SendEventToEventTargetInternal(OpaqueEventRef*, OpaqueEventTargetRef*, HandlerCallRec*) + 386
    frame #20: 0x00007fff8ee9ac42 HIToolbox`SendEventToEventTargetWithOptions + 43
    frame #21: 0x00007fff8ee95b31 HIToolbox`HIObject::Create(__CFString const*, OpaqueEventRef*, HIObject**) + 257
    frame #22: 0x00007fff8ee95a0b HIToolbox`HIObjectCreate + 90
    frame #23: 0x00007fff8eea5768 HIToolbox`NewWindowCommon(WindowData**, unsigned int, unsigned long long, WindowDefSpec const*, Rect const*, unsigned char const*, unsigned char, OpaqueWindowPtr*, void*, unsigned int, unsigned short*, bool) + 945
    frame #24: 0x00007fff8eea4f73 HIToolbox`_HIWindowCreateWithCGWindow + 335
    frame #25: 0x00007fff8eea4d49 HIToolbox`MBWindows::CreateWindow(CGRect, unsigned int) + 245
    frame #26: 0x00007fff8eea49f6 HIToolbox`MBWindows::GetWindowIDOnDisplay(unsigned int, unsigned char) + 174
    frame #27: 0x00007fff8eea47ee HIToolbox`MenuBarInstance::ForEachWindowDo(void (unsigned int, unsigned int) block_pointer) + 162
    frame #28: 0x00007fff8eea454d HIToolbox`MenuBarInstance::UpdateWindowBoundsAndResolution() + 155
    frame #29: 0x00007fff8eea4117 HIToolbox`MenuBarInstance::Show(MenuBarAnimationStyle, unsigned char, unsigned char, unsigned char) + 229
    frame #30: 0x00007fff8eecfc95 HIToolbox`SetMenuBarObscured + 232
    frame #31: 0x00007fff8eecf8f2 HIToolbox`HIApplication::HandleActivated(OpaqueEventRef*, unsigned char, OpaqueWindowPtr*) + 184
    frame #32: 0x00007fff8eece4f0 HIToolbox`HIApplication::EventObserver(unsigned int, OpaqueEventRef*, void*) + 238
    frame #33: 0x00007fff8ee9b12c HIToolbox`_NotifyEventLoopObservers + 155
    frame #34: 0x00007fff8f8ea216 AppKit`-[NSWindow _reallySendEvent:] + 10671
    frame #35: 0x00007fff8f37116e AppKit`-[NSWindow sendEvent:] + 446
    frame #36: 0x000000010009836a libSDL2-2.0.0.dylib`-[SDLWindow sendEvent:] + 48
    frame #37: 0x00007fff8f323451 AppKit`-[NSApplication sendEvent:] + 4183
    frame #38: 0x0000000100094d1f libSDL2-2.0.0.dylib`Cocoa_PumpEvents + 171
    frame #39: 0x000000010003bc97 libSDL2-2.0.0.dylib`SDL_PumpEvents_REAL + 23
    frame #40: 0x000000010003bd05 libSDL2-2.0.0.dylib`SDL_WaitEventTimeout_REAL + 55
    frame #41: 0x0000000100006f46 emulator`poll_fb_window_event + 38
    frame #42: 0x00000001000069d8 emulator`run_until_interrupt + 104
    frame #43: 0x000000010000628d emulator`remote_gdb_main_loop + 1293

Multiple store queue entries per thread

This could improve performance, as writes often happen back-to-back.

https://jbush001.github.io/2016/11/30/measure-twice-cut-once.html

FPGA build fails with newer versions of Quartus

Version 14.1+ do not work.

Floating point addition overflow handled incorrectly

The following code:

float a = 3.40282347E+38F;
float b = 3.40282347E+38F;

int main()
{
    union {
        float fval;
        int ival;
    } u;
    
    u.fval = a + b;
    printf("%08x\n", u.ival);
}

Should print '7f800000', which corresponds to 'inf'. However, when running in Verilog simulation, it prints 7fffffff (NaN), because hardware does not explicitly detect overflow.

More efficient inter-thread synchronization primitive

Spinlocks burn a lot of cycles. Can threads be suspended at barriers without consuming issues cycles?

https://gist.github.com/jbush001/22a3c336f0b59b095547025cdb7cee5d

Allow overlapping AXI address/data phases in L2 bus interface

The current implementation in l2_axi_bus_interface.sv uses a single state machine that:

Sends the address of the burst
Waits for an acknowledgement
Transfers the burst data
Waits for the data transfer to finish before sending the next address.

However, AXI is designed to allow the next address to be sent before the first data transfer finishes. This improves performance on high latency links (for example, ones with multiple clock-domain-crossings). This design could be modified to use pipelines for write and read transactions, where the first stage issues the address and the second performs the transfer.

The current testbench has a relatively low latency link, with SDRAM coupled directly to the processor. For testing, it might be interesting to add a parameterized FIFO to simulate different latencies and measure the performance difference with various workloads.

It's also possible to separate the write and read state machines and allow them to operate independently, but this adds more edge cases (e.g. if a line is evicted than there is a cache miss on it before writeback, need to sequence write before read) and is of questionable performance benefit.

Performance counter interrupts

Make performance counter have the ability to generate an interrupt when a specific event has occurred some number of times. This allows statistically sampling which routines cause more of that type of event.

There would be a register for the event type, and perhaps one for the interrupt threshold. It would also need a way to reset the counter when the interrupt handler occurred, which could be automatic or driven by software.

KERNEL PANIC: ASSERT FAILED: rwlock.c:75: m->active_read_count > 0

In tests/kernel/multiprocess, modify makefile to set randseed:

--- a/tests/kernel/multiprocess/Makefile
+++ b/tests/kernel/multiprocess/Makefile
@@ -30,7 +30,7 @@ run: program.elf fsimage.bin
    $(EMULATOR) -b fsimage.bin $(TOPDIR)/software/kernel/kernel.hex

 verirun: program.elf fsimage.bin
-   $(VERILATOR) +bin=$(TOPDIR)/software/kernel/kernel.hex +block=fsimage.bin
+   $(VERILATOR) +randseed=1469249757 +bin=$(TOPDIR)/software/kernel/kernel.hex +block=fsimage.bin

The program will run for a while, then crash:

Loading segment 0 offset 00000000 vaddr 00001000 file size 000001d8 mem size 000001d8 flags 5
Loading segment 0 offset 00000000 vaddr 00001000 file size 00000280 mem size 00000280 flags 5
Loading segment 0 offset 00000000 vaddr 00001000 file size 00000280 mem size 00000280 flags 5
Loading segment 0 offset 00000000 vaddr 00001000 file size 00000280 mem size 00000280 flags 5
Loading segment 0 offset 00000000 vaddr 00001000 file size 00000280 mem size 00000280 flags 5
Loading segment 0 offset 00000000 vaddr 00001000 file size 00000280 mem size 00000280 flags 5
KERNEL PANIC: ASSERT FAILED: rwlock.c:75: m->active_read_count > 0

Does not happen when running in emulator.

Inter-thread interrupts

Add ability to send an interrupt from one thread to another/all, across cores. The easiest way to do this is probably to send it as a message in the L2 cache, as it already has the ability to broadcast messages to all cores and to serialize requests. The l1_l2_interface would be decode the message and have a signal directly to the control register unit to flag an interrupt. The control register would probably treat this interrupt specially, as it is routed directly to threads unlike other interrupts.

librender: antialiasing

Repo split

I'd say this is more of a suggestion than an issue.

As you already use submodules for the toolchain and verilator, I would suggest you split out the software and RTL code into different repos as well. For the RTL code I would even suggest moving out the board-specific stuff (hardware/fpga/de2_115) and the common peripherals (hardware/fpga/common) so that it will be easier to make create new board ports and reuse your peripheral controllers

Optimize constant pool/global memory accesses

With new implementation, a constant pool takes 3 instructions:

	movehi s0, hi(.LCPI1_0)
	or s0, s0, lo(.LCPI1_0)
	load_32 s0, (s0)

This can be done in two instructions by using an immediate offset with the memory instruction:

	movehi s0, hi(.LCPI1_0)
	load_32 s0, lo(.LCPI1_0)(s0)

This will require a new type of relocation that can patch the memory instruction. Global loads and stores can be optimized similarly.

Hi,jbush001,i just want to contact with you

Hi,
there is no public email to us.i want to consult with you
my email is [email protected] thank you very much! 👍
--Jian Liu

Saturating integer arithmetic operations

Investigate adding two new instructions that perform signed addition and subtraction, but check for overflow and handle as follows:

if a + b > 2^31-1
    sum = 2^31
else if a + b < -(2^31)
    sum = -(2^31)
else
    sum = a + b;

Useful for fixed point operations. Need a benchmark that could exercise use case.

Results requiring normalization shift rounded incorrectly

Add floating point values represented by bit patterns 0x41000001 and 0xBF800004. The result should be 40E00001, but this implementation will return 40E00002. The shifted GRS (guard/round/sticky) bits are 100, which looks like it should round to even (because it's half way), but the normalizing shift that happens after should shift the guard bit into the LSB. This page describes in more detail:
http://pages.cs.wisc.edu/~david/courses/cs552/S12/handouts/guardbits.pdf

Increase physical address width

Add more bits to physical page index in TLB entry to access more than 4GB of physical memory.

But need to leave control bits for features like large pages, for example.

MMU stress test

Probably will be multiple tests. Ideally exercises:

Multiple hardware threads
- Collided TLB misses (two threads in same address space)
- Insert/access same virtual address from threads in different address spaces (ASIDs)
Enough code to stress iTLB misses

tests/stress/mmu/

Use MOVEHI/OR in frame pointer elimination and prologue insertion

In NyuziRegisterInfo::eliminateFrameIndex and NyuziInstrInfo::loadConstant, a combination of shifts and moves is used to load constants. Switch this to use MOVEHI/OR

Single step trap

When a bit is set in the flags control register, it should execute exactly one instruction and then raise a trap.

Because the pipelines are different lengths, it's not sufficient to simply trap in the writeback stage, need to set a bit on the instruction in the instruction decode stage.
The flag bit is presumably set when returning from privileged execution mode using the eret instruction.
After a valid instruction is issued, the next instruction should have the flag set.

However if the issued instruction causes another trap, what should happen then?

tests/compiler/longlong.c fails in verilator

USE_VERILATOR=1 ./runtest.sh longlong.c

FAIL: line 12 expected string ffee1580 was not found
searching here:
e0b4157f

Works correctly when run in simulator

dflush test fails intermittently

tests/misc/dflush will fail sometimes (depending on random seed). The test passes consistently if +autoflushl2=1 is specified, so it appears the flushes are not completing consistently.

[V1 uArch] Occasionally, thread ID is incorrect in mandelbrot test

While running tests/fpga/mandelbrot on FPGA, occasionally one of the threads (each of which draws a 64 pixel wide vertical stripe) will have the wrong parameters. These are computed based on the running thread ID (control register 0).

[V1 uArch] setlt.f does not handle inf properly.

Doing a setlt.f 1.13418960571, inf results in false. There needs to be special case logic for doing comparisons to infinity.

librender: Proper homogenous clipping

Currently the perspective transform does not divide z by w. The view volume is z (-1.0 -> -inf), which is nonstandard and will break non-floating point depth buffers.

http://fabiensanglard.net/polygon_codec/
http://research.microsoft.com/pubs/73937/p245-blinn.pdf
http://www.cs.unc.edu/~olano/papers/2dh-tri/
http://www.songho.ca/opengl/gl_projectionmatrix.html

Coprocessor memory interface

Currently, the only way to access external coprocessors is to use the I/O bus. Writes to addresses at high physical memory addresses, instead of going through the normal cache hierarchy, are redirected to the I/O bus. While this is useful for relatively low-speed peripherals, it has performance limitations when used with coprocessors such as texture fetch units:

It can only write a single, 32-bit word at a time. For vectorized compute code, this requires 32 instructions to copy a vector value into our out of it (getlane/write).
It takes a rollback for every read or write, since the I/O bus is shared by all cores and transactions need to globally arbitrate
There is no way for peripherals to make a thread wait for them to be ready, so threads must poll them, which wastes processor cycles and clogs up the bus
The bus does not have a concept of thread IDs, which requires some form of arbitration at the software or coprocessor level.

The proposal is to create a new bus for use in high-speed peripherals. This would not replace the low speed bus, but would address different use cases and constraints.

Design here:
https://gist.github.com/jbush001/09f51178a366c0f6b8f07363c30f414f

Randomly generated cosimulation tests fail with interrupts enabled

In tests/cosimulation:

$ ./generate_random.py -i
$ ./runtest.sh random.s 
...
COSIM MISMATCH, thread 0 instruction d4d414e7

Floating point multiplication exponent underflow handled incorrectly

The following code:

float a = 3.40282347E-24F;
float b = 3.40282347E-24F;

int main()
{
    union {
        float fval;
        int ival;
    } u;
    
    u.fval = a * b;
    printf("%08x\n", u.ival);
}

Should print 00000000, but prints 7187625d when run in Verilog simulation. Hardware needs to explicitly detect if the sum of the exponents underflows and set the result to zero.

Make cosimulation log every instruction

./generate_random -i
./runtest random.s

Has about an 8% failure rate. In all cases, hardware executes an instruction from the interrupt handler, while the emulator continue executing code. In each example below, hardware jumps to 1dc (beginning of the interrupt handler). The value the interrupt handler loads into s11 is the PC that was interrupted. In each case, the address of the instruction executed by the reference emulator is equal to the PC loaded by the hardware implementation.

interrupt_handler:
     1dc:   62 01 00 ac     getcr s11, 2

COSIM MISMATCH, thread 0
Reference: 000755e0 s4 <= 00000005
Hardware: 000001dc s11 <= 000755e0

COSIM MISMATCH, thread 2
Reference: 00000260 v4{ffff} <= 18f878a5 040630c3 bc0614e0 253f20a5 bc060cc0 c0738063 12900022 bc0810c1 41fbf904 d0a400e6 028f0022 d0e400e0 257e9485 128d0022 c47380a4 c14380e4
Hardware: 000001dc s11 <= 00000260

This would suggest an issue where the reference implementation is not getting the cosimulation interrupt message, or has interrupts masked for some reason.

Works correctly in simulator.

Clang crash with undefined variables

The following

static void foo()
{
    bar(a, (b + 1));
}

Crashes:

Assertion failed: (Entry != DelayedTypos.end() && "Failed to get the state for a TypoExpr!"), function getTypoExprState, file /Users/jeffbush/src/NyuziToolchain/tools/clang/lib/Sema/SemaLookup.cpp, line 5032.
0  clang-3.9                0x000000010a6207ae llvm::sys::PrintStackTrace(llvm::raw_ostream&) + 46
1  clang-3.9                0x000000010a620c09 PrintStackTraceSignalHandler(void*) + 25
2  clang-3.9                0x000000010a61d719 llvm::sys::RunSignalHandlers() + 425
3  clang-3.9                0x000000010a620f94 SignalHandler(int) + 372
4  libsystem_platform.dylib 0x00007fff937aef1a _sigtramp + 26
5  clang-3.9                0x000000010df4d8ac guard variable for shouldAddRequirement(clang::Module*, llvm::StringRef, bool&)::IOKitAVC + 81916
6  clang-3.9                0x000000010a620c2b raise + 27
7  clang-3.9                0x000000010a620ce2 abort + 18
8  clang-3.9                0x000000010a620cc1 __assert_rtn + 129
9  clang-3.9                0x000000010cb57c81 clang::Sema::getTypoExprState(clang::TypoExpr*) const + 225
10 clang-3.9                0x000000010ca9881c (anonymous namespace)::TransformTypos::TransformTypoExpr(clang::TypoExpr*) + 188
11 clang-3.9                0x000000010ca89c5a clang::TreeTransform<(anonymous namespace)::TransformTypos>::TransformExpr(clang::Expr*) + 5114
12 clang-3.9                0x000000010ca42ae7 (anonymous namespace)::TransformTypos::TryTransform(clang::Expr*) + 71
13 clang-3.9                0x000000010ca418c1 (anonymous namespace)::TransformTypos::Transform(clang::Expr*) + 65
14 clang-3.9                0x000000010ca4169d clang::Sema::CorrectDelayedTyposInExpr(clang::Expr*, clang::VarDecl*, llvm::function_ref<clang::ActionResult<clang::Expr*, true> (clang::Expr*)>) + 493
15 clang-3.9                0x000000010c9f3eca clang::Sema::CorrectDelayedTyposInExpr(clang::Expr*, llvm::function_ref<clang::ActionResult<clang::Expr*, true> (clang::Expr*)>) + 74
16 clang-3.9                0x000000010c9f6947 clang::Sema::CorrectDelayedTyposInExpr(clang::ActionResult<clang::Expr*, true>, clang::VarDecl*, llvm::function_ref<clang::ActionResult<clang::Expr*, true> (clang::Expr*)>) + 119
17 clang-3.9                0x000000010c1fe4cb clang::Parser::ParseRHSOfBinaryExpression(clang::ActionResult<clang::Expr*, true>, clang::prec::Level) + 6107
18 clang-3.9                0x000000010c1fccd8 clang::Parser::ParseAssignmentExpression(clang::Parser::TypeCastState) + 280
19 clang-3.9                0x000000010c1fcb8f clang::Parser::ParseExpression(clang::Parser::TypeCastState) + 31
20 clang-3.9                0x000000010c206f41 clang::Parser::ParseParenExpression(clang::Parser::ParenParseOption&, bool, bool, clang::OpaquePtr<clang::QualType>&, clang::SourceLocation&) + 5729
21 clang-3.9                0x000000010c200d71 clang::Parser::ParseCastExpression(bool, bool, bool&, clang::Parser::TypeCastState) + 401
22 clang-3.9                0x000000010c1fe773 clang::Parser::ParseCastExpression(bool, bool, clang::Parser::TypeCastState) + 83
23 clang-3.9                0x000000010c1fccba clang::Parser::ParseAssignmentExpression(clang::Parser::TypeCastState) + 250
24 clang-3.9                0x000000010c20a933 clang::Parser::ParseExpressionList(llvm::SmallVectorImpl<clang::Expr*>&, llvm::SmallVectorImpl<clang::SourceLocation>&, std::__1::function<void ()>) + 371
25 clang-3.9                0x000000010c1ff917 clang::Parser::ParsePostfixExpressionSuffix(clang::ActionResult<clang::Expr*, true>) + 4263
26 clang-3.9                0x000000010c205278 clang::Parser::ParseCastExpression(bool, bool, bool&, clang::Parser::TypeCastState) + 18072
27 clang-3.9                0x000000010c1fe773 clang::Parser::ParseCastExpression(bool, bool, clang::Parser::TypeCastState) + 83
28 clang-3.9                0x000000010c1fccba clang::Parser::ParseAssignmentExpression(clang::Parser::TypeCastState) + 250
29 clang-3.9                0x000000010c1fcb8f clang::Parser::ParseExpression(clang::Parser::TypeCastState) + 31
30 clang-3.9                0x000000010c25ec7c clang::Parser::ParseExprStatement() + 60
31 clang-3.9                0x000000010c25dbff clang::Parser::ParseStatementOrDeclarationAfterAttributes(llvm::SmallVector<clang::Stmt*, 32u>&, clang::Parser::AllowedContsructsKind, clang::SourceLocation*, clang::Parser::ParsedAttributesWithRange&) + 2751
32 clang-3.9                0x000000010c25cff8 clang::Parser::ParseStatementOrDeclaration(llvm::SmallVector<clang::Stmt*, 32u>&, clang::Parser::AllowedContsructsKind, clang::SourceLocation*) + 168
33 clang-3.9                0x000000010c264b1a clang::Parser::ParseCompoundStatementBody(bool) + 1322
34 clang-3.9                0x000000010c265740 clang::Parser::ParseFunctionStatementBody(clang::Decl*, clang::Parser::ParseScope&) + 496
35 clang-3.9                0x000000010c286462 clang::Parser::ParseFunctionDefinition(clang::ParsingDeclarator&, clang::Parser::ParsedTemplateInfo const&, clang::Parser::LateParsedAttrList*) + 3922
36 clang-3.9                0x000000010c1bbdf5 clang::Parser::ParseDeclGroup(clang::ParsingDeclSpec&, unsigned int, clang::SourceLocation*, clang::Parser::ForRangeInit*) + 1061
37 clang-3.9                0x000000010c2854d5 clang::Parser::ParseDeclOrFunctionDefInternal(clang::Parser::ParsedAttributesWithRange&, clang::ParsingDeclSpec&, clang::AccessSpecifier) + 1397
38 clang-3.9                0x000000010c284b65 clang::Parser::ParseDeclarationOrFunctionDefinition(clang::Parser::ParsedAttributesWithRange&, clang::ParsingDeclSpec*, clang::AccessSpecifier) + 197
39 clang-3.9                0x000000010c2842ed clang::Parser::ParseExternalDeclaration(clang::Parser::ParsedAttributesWithRange&, clang::ParsingDeclSpec*) + 3981
40 clang-3.9                0x000000010c283315 clang::Parser::ParseTopLevelDecl(clang::OpaquePtr<clang::DeclGroupRef>&) + 1061
41 clang-3.9                0x000000010c1a24ee clang::ParseAST(clang::Sema&, bool, bool) + 766
42 clang-3.9                0x000000010b1d548f clang::ASTFrontendAction::ExecuteAction() + 511
43 clang-3.9                0x000000010ac34bcb clang::CodeGenAction::ExecuteAction() + 6043
44 clang-3.9                0x000000010b1d49b8 clang::FrontendAction::Execute() + 120
45 clang-3.9                0x000000010b117bc7 clang::CompilerInstance::ExecuteAction(clang::FrontendAction&) + 1847
46 clang-3.9                0x000000010b268466 clang::ExecuteCompilerInvocation(clang::CompilerInstance*) + 4870
47 clang-3.9                0x000000010944164a cc1_main(llvm::ArrayRef<char const*>, char const*, void*) + 4986
48 clang-3.9                0x0000000109431011 ExecuteCC1Tool(llvm::ArrayRef<char const*>, llvm::StringRef) + 481
49 clang-3.9                0x000000010942eb23 main + 3283
50 libdyld.dylib            0x00007fff86f785c9 start + 1
51 libdyld.dylib            0x0000000000000024 start + 2030598748
Stack dump:
0.  Program arguments: /usr/local/llvm-nyuzi/bin/clang-3.9 -cc1 -triple nyuzi-none-none -emit-obj -disable-free -main-file-name thread.c -mrelocation-model static -mthread-model posix -fmath-errno -masm-verbose -mconstructor-aliases -munwind-tables -nostdsysteminc -fuse-init-array -target-cpu nyuzi -target-linker-version 241.9 -momit-leaf-frame-pointer -dwarf-column-info -debugger-tuning=gdb -O3 -ferror-limit 19 -fmessage-length 80 -fobjc-runtime=gcc -fdiagnostics-show-option -fcolor-diagnostics -x c thread-8755e8.c 
1.  thread-8755e8.c:6:18: current parser token ')'
2.  thread-8755e8.c:5:1: parsing function body 'foo'
3.  thread-8755e8.c:5:1: in compound statement ('{}')
./thread-8755e8.sh: line 4: 88650 Illegal instruction: 4  "/usr/local/llvm-nyuzi/bin/clang-3.9" "-cc1" "-triple" "nyuzi-none-none" "-emit-obj" "-disable-free" "-main-file-name" "thread.c" "-mrelocation-model" "static" "-mthread-model" "posix" "-fmath-errno" "-masm-verbose" "-mconstructor-aliases" "-munwind-tables" "-nostdsysteminc" "-fuse-init-array" "-target-cpu" "nyuzi" "-target-linker-version" "241.9" "-momit-leaf-frame-pointer" "-dwarf-column-info" "-debugger-tuning=gdb" "-O3" "-ferror-limit" "19" "-fmessage-length" "80" "-fobjc-runtime=gcc" "-fdiagnostics-show-option" "-fcolor-diagnostics" "-x" "c" "thread-8755e8.c"

This may be a bug that was pulled in from the last upstream integration. Retest with next one.