lensplaysgames / lensorcompilercollection Goto Github PK

A compiler we made just for fun :^)

License: MIT License

CMake 1.19% C 2.07% Emacs Lisp 1.57% Vim Script 0.27% C++ 94.90%

first-class-functions programming-language static-typed compiler compiler-design compiler-optimization

lensorcompilercollection's Introduction

Lensor Compiler Collection

Now written in C++, LCC started as a compiler written in C for just one hobby language, Intercept. Over the course of a few years, it has grown into a compiler collection, with a whole host of frontends created and maintained by the growing and thriving community.

Implemented Languages

Intercept (WIP)
Laye (WIP)

In the future, we hope to support

some of C
YOUR language :)

Usage

For convenience purposes, there is a single executable, lcc, that can delegate between all of the different compilers in the collection. This is called the compiler driver, often shortened to just driver.

Running the driver executable with no arguments will display a usage message that contains compiler flags and options as well as command layout.

The driver uses file extension to determine which compiler to pass the source code to. lcc ./tst/intercept/byte.int will invoke the Intercept compiler, while lcc ./tst/laye/exit_code.laye will invoke the Laye compiler, for example.

Building

Dependencies:

CMake >= 3.20
Any C++ Compiler (We like GCC)

NOTE: If on Windows and using Visual Studio, see this document instead.

First, generate a build tree using CMake.

cmake -B bld

Finally, build an executable from the build tree.

cmake --build bld

To build generated `.s` x86_64 ASM

To use external calls, link with appropriate libraries!

GNU Binutils:

as code.S -o code.o
ld code.o -o code

GNU Compiler Collection

gcc code.S -o code

LLVM/Clang

clang code.S -o code --target=x86_64

To build generated `.ll` LLVM

Use LLVM’s clang to compile the generated LLVM output into a library or executable.

clang code.ll -o code

lensorcompilercollection's People

Contributors

Stargazers

Watchers

Forkers

landarxt edk3d-dev arnau478 sirraide pterameta 500085570 songhtdo gmh5225 shonker tuanthanh-nguyen endeyshentlabs m561247 skycompeu iconicbark nashiora sapjax brugarolas 00gh

lensorcompilercollection's Issues

SysV aggregates aren't handled properly

When using the SysV calling convention, a lot of the tests regarding aggregate parameters or returning aggregate types are failing.

SysV:

         33 - ASM_/mnt/d/Programming/strema/2023/Intercept/tst/tests/call_return_value_as_lvalue.int (Failed)
         34 - OBJ_/mnt/d/Programming/strema/2023/Intercept/tst/tests/call_return_value_as_lvalue.int (Failed)
        113 - ASM_/mnt/d/Programming/strema/2023/Intercept/tst/tests/function_arguments_many.int (Failed)
        114 - OBJ_/mnt/d/Programming/strema/2023/Intercept/tst/tests/function_arguments_many.int (Failed)
        179 - ASM_/mnt/d/Programming/strema/2023/Intercept/tst/tests/overload-block.int (Failed)
        180 - OBJ_/mnt/d/Programming/strema/2023/Intercept/tst/tests/overload-block.int (Failed)
        195 - ASM_/mnt/d/Programming/strema/2023/Intercept/tst/tests/passing_weird_bytearray.int (Failed)
        196 - OBJ_/mnt/d/Programming/strema/2023/Intercept/tst/tests/passing_weird_bytearray.int (Failed)
        203 - ASM_/mnt/d/Programming/strema/2023/Intercept/tst/tests/return_array_literal.int (Failed)
        204 - OBJ_/mnt/d/Programming/strema/2023/Intercept/tst/tests/return_array_literal.int (Failed)
        205 - ASM_/mnt/d/Programming/strema/2023/Intercept/tst/tests/return_arrays.int (Failed)
        206 - OBJ_/mnt/d/Programming/strema/2023/Intercept/tst/tests/return_arrays.int (Failed)

MS x64:

         33 - ASM_D:/Programming/strema/2023/Intercept/tst/tests/call_return_value_as_lvalue.int (Failed)
         34 - OBJ_D:/Programming/strema/2023/Intercept/tst/tests/call_return_value_as_lvalue.int (Failed)
        179 - ASM_D:/Programming/strema/2023/Intercept/tst/tests/overload-block.int (Failed)
        180 - OBJ_D:/Programming/strema/2023/Intercept/tst/tests/overload-block.int (Failed)

As you can see, SysV has a lot more failing tests. This is mostly due to me developing (currently) on Windows, which means it's easier for me to test/debug that calling convention

[FEATURE] Add a warning for the common error of `a = b` assignment

state : integer = 1
while rowcount {
  display_state(state)
  state = calculate_state(state)
  rowcount := rowcount - 1
}

If you can spot what's wrong with the above code right away, good for you. For most of us, the difference between = and := is so small that we often tend to overlook it, and it can be very confusing when it comes time to run the code and things are broken for seemingly no reason. So, while the above is a properly formed program, it doesn't do what the programmer intends. This seems like a perfect use case for emitting a warning.

Btw, the code should be this

state : integer = 1
while rowcount {
  display_state(state)
  state := calculate_state(state)
  rowcount := rowcount - 1
}

Note the very relevant state := ... instead of state = ..., which properly assigns to the variable (as intended) vs just leaving the result of a comparison unused.

At first, we could just emit a warning at any comparison of which the result is unused, and it'd be good enough.

Diag doesn't print location if token is EOF

Check is in location.cc:8

Optimiser yeets referenced global variables

https://github.com/LensPlaysGames/FUNCompiler/blob/c9bb7543562bd04e747525fd122198f42e43ca0e/src/codegen/x86_64/arch_x86_64.c#L1598

This should also check for global store/load.

[FEATURE] Better Overloaded Function Name Mangling

The current syntax is exactly the same as the return value of ast_typename(), except all invalid characters in assembly are updated to _ ... Obviously, this is not ideal. In fact, it kind of breaks one of the main points of name mangling, in that the parameter types should be inferrable from the mangled identifier. However, by replacing all of @, (, ), and with just _, it means that a lot of data is lost. I think a better route would be to have a new function in the API that all backends can use to get a simplified ascii representation of a function type or IRFunction (mangle it).

<mangled_name>      ::= <name_length> <name> { <param_type_length> <param_type> }

<name_length>       ::= NUMBER
<param_type_length> ::= NUMBER

<name>              ::= IDENTIFIER
<param_type>        ::= IDENTIFIER

I think the following syntax would make sense, and just implementing it would restore the "two way street" feature that is necessary if we ever want to actually be able to call things from external symbols compiled from the same language.

// NOTE: <type-length> is a length that extends until the first <> marker (it's only meaning)

<mangled-function>       ::= "_X" <func-id-length> <func-id> <mangled-name>;

<mangled-name>           ::= <type-length> <type>;

<type>                   ::= <base-type> | <named-type> | <derived-type>;
<base-type>              ::= "integer" | "byte" | "void";
<named-type>             ::= IDENTIFIER;
<derived-type>           ::= <pointer-type> | <function-type>;
<pointer-type>           ::= <type> <> "P";
<array-type>             ::= <type> <> "A"  <array-length> <array-base-type>;
<function-type>          ::= "_F" ( <function-return-type> { <function-param-type> } );

<type-length>            ::= <length>;
<func-id-length>         ::= <length>;
<array-length>           ::= <length>;
<length>                 ::= NUMBER;

<func-id>                ::= IDENTIFIER;

<array-base-type>        ::= <mangled-name>;

<function-return-type>   ::= <mangled-name>;
<function-param-type>    ::= <mangled-name>;
<array-base-type>        ::= <mangled-name>;

Where:

P :: Pointer to. One more level of pointer indirection on the type just parsed.
A length elem-size :: Array, with length and element type

The grammar is a bit odd, but it should be okay due to everything giving it's length first. So like "P" and "A..." are after the length.

It does inhibit the user from creating functions called _F*, but I don't think that's much of an issue. We can always use __F or something more obscure if we really need.

[KNOWN] IR parser and IR backend are severely outdated

Development has happened really fast, and both of these have fallen out of date with the current changes; they've been minimally patched to allow everything to compile, but it's mostly code that has been neglected. It would be great to see it back.

The major change would be to update the IR grammar to support types

Incorrect reported error location for initialiser type errors

update_state : integer(state : @integer) {
  newstate : integer = state
  ...
}

the above code produces the following error

examples\rule110.un:28:30: Error: Type "@integer" is not convertible to "integer"
 28 | update_state : integer(state : @integer) {
    |                               ~~~~~~~~~

whereas we'd expect

examples\rule110.un:29: Error: Type "@integer" is not convertible to "integer"
 29 | newstate : integer = state
    |                      ~~~~~

[Intercept] Reference parameters aren't working

;; 42

a : int = 69;

foo : void(x : &int) {
  x := 42;
};

foo a;
a; ;; should return 42

This currently returns 69 both through the assembly backend and the LLVM one. Likely an issue during IR generation.

ICE on, in my humble opinion, valid code

This

Foo:> type {
    a: u32
    c: @byte
}

make_foo : Foo() {
    foo : Foo
    foo
}

throws

codegen/x86_64/arch_x86_64_common.c:125: Internal Compiler Error: Register size can not be converted into name on x86_64: 16
  in function ??? at offset 0x637b5
  in function femit_mem_to_reg():?
  in function ??? at offset 0x6846d
  in function ??? at offset 0x632f7
  in function ??? at offset 0xbce4
  in function ??? at offset 0xca03
  in function ??? at offset 0x10386
  in function dom_dfs():?

[FEATURE] Types in the IR

It seems like all we have to do is "pass through" expr->type, during codegen, into expr->ir->type. The issue of this happening everywhere is the hardest part, though, because it's currently designed in a way where expr->ir is set and then return happens immediately, but not in every case. So we could set it where codegen_expr returns, but then we may miss some instructions that were generated in between.

The other interesting thing; if an IR instruction's type can be understood from only it's arguments/children (like a call from the callee), then this means we can simply do the assigment within the creation function of the IR instruction, much like we mark uses.

So, really, we just have to figure out what the best way to actually assign expr->ir->type to expr->type within codegen_expr().

IR doesnt generate implicit return for void functions

Compiling this

foo : void() {

};

0;

generates this

missing a return for the void function, resulting in no return when generating llvm IR and llvm IR without the return is ill-formed

We should check for functions without a body

Idk if Intercept allows for predeclared functions (wouldn't make much sense as they are order independent anyways), but if you try to declare one
foo : u32();
It is gonna run into an assert in the parseBlock function

Arrays

Arrays compile, but they really do not work, for some reason.

17c4d1b added a test for them that now fails.

Redefinition of variable is not caught by compiler

; ERROR

a : integer
a : integer

The following test (included in repo at tst/tests/redef.un) fails (as in, doesn't error even though it should). The only error comes from gcc/as complaining about multiple definition.

This bug was introduced when function overloading was, as symbols are no longer duplicate checked when being added to a scope, IIRC.
EDIT: 95a65ab

To fix this, we will have to do a pass on the AST during semantic analysis ensuring that each variable is only defined once, OR we could handle function symbols specially and add back in the functionality of checking duplicates, with the caveat of functions... It's a bit complex, but shouldn't be too hard to clear up.

Scopes seem to be broken

foo: struct {
    bar: u32
};

bar: void() {
    test : foo;
};

0;

throws an error

while

foo: struct {
    bar: u32
};

test : foo;

0;

doesn't

MrMugame said this causes an ICE

Code:

Foo:> type {
    a: @byte
    b: u8
}

bar : Foo
bar.a := "Hello World!"[0]

bar.b

Output:

codegen/x86_64/arch_x86_64_tgt_assembly.c:789: Internal Compiler Error: [x86_64/CodeEmission]: Unhandled instruction, sorry
  in function ??? at offset 0x27b4a
  in function __libc_start_main() at offset 0x8b

Scopes permit redefinition of build-in types

{
    integer :: 69
    a : integer
}

In the example above line №2 is completely valid and the error only occurs on line №3:

i.int:3:8: Error: 'integer' is not a type!
 3 |     a : integer
   |         ~~~~~~~

Incorrect error message regarding function return type

In a test soon to be committed to the IR types branch, the following error occurs.

tst\tests\return_arrays.un:3:0: Error: Type '@integer[2]' of function body is not convertible to return type 'integer'.
 3 | foo : @integer[2]() {
   | ~~~

; 69

foo : integer[2]() {
  out : integer[2]
  @out[0] = 42
  @out[1] = 69
}

@(foo()[1])

Given this test, it would seem that it is mixing up return type and type of function body; it is missing out at the end

Member access is not properly working

This code

foo: struct {
    bar: u32
};

i : foo;
i.bar := 2;

0;

results in this error

which it probably shouldn't.

IRGen is not handling ifExpr with no else

Title

ASSERT macro issue?

The source line in question:

    ASSERT(iterator, "Invalid or mishapen AST.");

Looks like the assert macro is off by one in the output message, somehow.

Variables with the same name as registers break Intel assembly

rax :: 69
putchar : ext integer(ch : integer)
putchar(rax)

When compiling with --dialect intel no warnings/errors are produced and the following code is generated:

.section .data
rax: .byte 69,0,0,0,0,0,0,0
.intel_syntax noprefix
.section .text

.global main
.global putchar

main:
    push rbp
    mov rbp, rsp
    sub rsp, 0
.L0:
    lea rax, [rip + rax]
    mov rdi, [rip + rax]
    sub rsp, 8
    push rcx
    push rsi
    push rdi
    call putchar
    pop rdi
    pop rsi
    pop rcx
    add rsp, 8
    mov rcx, rax
    mov rax, rcx
    mov rsp, rbp
    pop rbp
    ret

Same as in #46, the issue only occurs during assembly:

code.S: Assembler messages:
code.S:14: Error: `[rip+rax]' is not a valid base/index expression
code.S:15: Error: `[rip+rax]' is not a valid base/index expression

Function type parameter breaks overloaded function resolution?

The test located at tst/tests/downwards_funarg.un doesn't pass; it errors with the following output:

tests/downwards_funarg.un:7:0: Error: Could not resolve overloaded function.
 7 | foo(sixty_nine)
   | ~~~

Most likely, there is a bug related to unresolved functions as arguments to an unresolved function (a case that requires special handling).

If a compiler stages throws an error, the compiler just continues, causing a segfault.

Compiler should just exit, Laye frontend already does this, I wonder why the Intercept one doesnt...

IR_PARAMETER causes crash with `-f ir` command line flags

https://github.com/LensPlaysGames/FUNCompiler/blob/103aee3d2d27aeee5594e2e9a93475fbf4d7dda7/src/codegen/intermediate_representation.c#L257-L259

Is this still correct? Do parameters still reference in this way? I'm not thinking too hard about it right now, but this seems like a result of changing PARAMETER_REFERENCE into PARAMETER.

Scuffed error messages when forgetting a semicolon

foo : u8[10]

@foo[3] := 1;

If you leave out the semicolon on the first line, the parser tries to parse the second line as an argument to the first, which just doesn't make sense.
Maybe the compile should have some flag, disallowing the parameter parsing for declarations.

"Silent" error when using variable before it's definition

I'm not sure what's going wrong here, but when using a variable as if it's been defined, even though it hasn't, there are no user-facing errors other than a return code of 2...

foo : integer(state : integer) {
  idx := 0
  while idx < colcount {
    ;;putchar(
    ;;  if state & (1 << idx) {
    ;;    42
    ;;  } else {
    ;;    32
    ;;  }
    ;;)
    if state & (1 << idx) {
      putchar(42)
    } else {
      putchar(32)
    }
    idx := idx + 1
  }
  putchar(10)
}

idx : integer = 0
foo(1)

NOTE: No output file is ever written. It looks like we are exiting from https://github.com/LensPlaysGames/FUNCompiler/blob/17c4d1bc8f45f093b10638423044ebd0748b5256/src/main.c#L304 silently, which is very not ideal.

Sema fails to evaluate cast properly

42 as u8;

throws

Internal Compiler Error: Assertion failed: "ok()" in /home/dani/Workdir/Intercept/lib/intercept/eval.cc at line 5.
Message: Cannot evaluate ill-formed or unchecked expression

Return value of call can not be treated as lvalue

examples/vec2.int fails

vec2 :> type {
  x : integer
  y : integer
}

vec2_add : vec2(a : vec2, b : vec2) {
  out : vec2
  out.x := a.x + b.x
  out.y := a.y + b.y
  out
}

a : vec2
a.x := 34
b : vec2
b.y := 420
b.x := 35

vec2_add(a, b).y

D:/Programming/strema/2023/Intercept/src/codegen.c: In function ‘codegen_lvalue’
D:/Programming/strema/2023/Intercept/src/codegen.c:137: Internal Compiler Error: Unhandled node kind 6

[MINOR] Terminal color is not reset to default after printing AST

When passing --print-ast to the compiler, it prints the AST, in color. This is great, however it does not reset the color back to default, which means the subsequent prompt is of a random color (usually all blue because of types). To fix this, we literally just have to print \033m at the end of wherever we print the AST in color.

Inverted check in Sema

Check should probably be if it isn't a struct
in Intercept/sema.cc:745

[KNOWN] Garbage output for stackframe in compiler errors and such

D:/Programming/strema/2022/FUNCompiler/src/codegen/x86_64/arch_x86_64.c: In function ‘regsize_from_bytes’
D:/Programming/strema/2022/FUNCompiler/src/codegen/x86_64/arch_x86_64.c:207: Internal Compiler Error: Byte size can not be converted into register size on x86_64: 32
  in function `�9 R��() at offset 7f007e007d007c
  in function `�9 R��() at offset 7f007e007d007c
  in function `�9 R��() at offset 7f007e007d007c
  in function `�9 R��() at offset 7f007e007d007c
  in function `�9 R��() at offset 7f007e007d007c
  in function `�9 R��() at offset 7f007e007d007c
  in function `�9 R��() at offset 7f007e007d007c
  in function `�9 R��() at offset 7f007e007d007c
  in function `�9 R��() at offset 7f007e007d007c
  in function `�9 R��() at offset 7f007e007d007c
  in function BaseThreadInitThunk() at offset 7fff05cd74a0
  in function RtlUserThreadStart() at offset 7fff05e82680

I think it speaks for itself.

[WIP] [FEATURE] References

Operator overloading has been planned from the start. However, I didn't realise that operator overloading kind of inherently requires references, at least of some kind. If we're going to implement references of some kind, might as well just implement references, right? A reference is a non-null pointer, effectively. It seems like having to deal with less null checking is a good thing. So if we can write most of our code with references, and have pointers/by value be the explicit way, maybe that'd be good?

Open for discussion, suggestions, and fixes

Potentially incorrect codegen of variable reassignment

This is in regard to the following check
https://github.com/LensPlaysGames/FUNCompiler/blob/061ede20453730e12b935584d52713ebbb34b622/src/codegen.c#L127-L129

We typically don’t want to emit the same expr twice what with SSA and all that. However, there is an exception: If a variable declaration or the same variable reference is used twice, we need to emit the load twice because the variable’s value may have changed inbetween.

I’m not sure if this really is an issue, but we should probably add a test for that (e.g. make a test that reassigns a variable a couple of times and see if it’s correct). This may not be a problem in our implementation (because we create a new varref node every time the variable name is used iirc), but it’s just something that came to mind and that I wanted to mention just in case because I feel like it might warrant at least some investigation.

Variables that start with '$' break generated assembly

$ : integer = 36
putchar : ext integer(ch : integer)
putchar($)

The code example above produces no compiler errors/warnings and after codegen looks like this:

.section .data
$: .byte 36,0,0,0,0,0,0,0
.section .text

.global main
.global putchar

main:
    push %rbp
    mov %rsp, %rbp
    sub $0, %rsp
.L0:
    lea $(%rip), %rax
    mov $(%rip), %rdi
    sub $8, %rsp
    push %rcx
    push %rsi
    push %rdi
    call putchar
    pop %rdi
    pop %rsi
    pop %rcx
    add $8, %rsp
    mov %rax, %rcx
    mov %rcx, %rax
    mov %rbp, %rsp
    pop %rbp
    ret

After running gcc code.S the following error is shown:

code.S: Assembler messages:
code.S:13: Error: illegal immediate register operand (%rip)
code.S:14: Error: illegal immediate register operand (%rip)

EDIT: updated the output to be in sync with that of the latest commit

MSVC (confusingly) defines `_MSC_VER`, not `_MSVC_VER`

https://github.com/LensPlaysGames/FUNCompiler/blob/c9bb7543562bd04e747525fd122198f42e43ca0e/src/opt.c#L9

README: dependencies issue with linux

When trying to run cmake -B bld it will complain about ERROR: MISSING PROGRAM! Could not find a68gAlgol 68 Genie Interpreter, test target has not been generated. See README intst subdirectory.

Found information on how to build it myself since it is not available in Ubuntu 22.04.2 repos.

Parser can't parse cast of a function call

foo : u32() {
    69;
};

bar : @u32 = (foo() as! @u32);

0;

So this code throws an error while parsing, because it can't parse the cast properly after a function call

This happens because in the parser, the postfix operators like "as" are getting parsed (l.617) before empty perens are parsed into a call (l.665)
(Without the perens around the cast the vardecl is gonna be cast, so its gonna be even more uselesss)

Clang doesn’t accept AT&T assembly without proper suffixes

This is what happens if you try to compile the ASM generated for the SDL example with both GCC and Clang:

GCC only issues a warning about the missing suffix, whereas Clang outright rejects it. That’s because whereas GCC quite literally just invokes GNU as, Clang seems has its own assembler backend (cc1as). Renaming the file from code.S to code.s doesn’t change anything either, which means that Clang always invokes its own assembler and never GNU as, even if no preprocessing is required.

It’s also worth noting that, seeing as it only starts complaining on line 530, it seems to have no problem with e.g. push %rax instead of pushq %rax, which occurs several times much earlier in the file. In other words, it seems to me that a suffix may only be required when there is a memory operand. At the same time, if we’re already emitting suffixes in some cases, we might as well just do that in any case; otherwise, we’ll probably end up missing some cases that we’ll then have to fix later on...

[OPT] `examples/rule110.un` breaks with `-O`

$ cmake --build bld --clean-first ; bld\func examples\rule110.un --debug-ir -v ; gcc code.S ; a.exe ; echo ; echo $env.LAST_EXIT_CODE
...
Generated code at output filepath "code.S"
*
***
* **
*****
*   **
**  ***
*** * **
* *******
***     **
* **    ***
*****   * **
*   **  *****
**  *** *   **
*** * ****  ***
* *****  ** * **
***   ** ********
* **  ****      **
***** *  **     ***
*   **** ***    * **
**  *  *** **   *****
*** ** * *****  *   **
* ********   ** **  ***
***      **  ****** * **
* **     *** *    *******
*****    * ****   *     **
*   **   ***  **  **    ***
**  ***  * ** *** ***   * **
*** * ** ****** *** **  *****
* ********    *** ***** *   **
***      **   * ***   ****  ***
* **     ***  *** **  *  ** * *
*****    * ** * ***** ** ******
*   **   ********   ******    *
**  ***  *      **  *    **   *
*** * ** **     *** **   ***  *
* **********    * *****  * ** *
***        **   ***   ** ******
* **       ***  * **  ****    *
*****      * ** ***** *  **   *
*   **     ******   **** ***  *
**  ***    *    **  *  *** ** *

0

This is the correct output. However, when adding -O, here is the output...

$ cmake --build bld --clean-first ; bld\func examples\rule110.un --debug-ir -v -O ; gcc code.S ; a.exe ; echo ; echo $env.LAST_EXIT_CODE
...
Generated code at output filepath "code.S"
     *
     *
     *
     *
     *
     *
     *
     *
     *
     *
     *
     *
     *
     *
     *
     *
     *
     *
     *
     *
     *
     *
     *
     *
     *
     *
     *
     *
     *
     *
     *
     *
     *
     *
     *
     *
     *
     *
     *
     *
     *

0

Incorrect error when using a variable that hasn't been defined

I get the following error when a variable is used but never defined.

examples\rule110.un:6:14: Error: Could not resolve overloaded function.
 6 |   while idx < colcount {
   |               ~~~~~~~~

Obviously, this isn't ideal, as we probably shouldn't assume this is a function or should be one 🤔

ISel table currently does not support two-address instructions properly.

If I’m not mistaken, there is currently a fundamental issue with the ISel table patterns for two-address instructions (e.g. add). I first ran into this problem when examining the problem below, which should return 69, but returns 70.

In this IR, with a initialised to 34, the value of %9 should be 69 (note that %9 = %6 + %8, where
%6 is the initial value of a (i.e. 34), and %8 is %6 + 1 (i.e. 35), so %9 = 34 + 35 + 69).

defun main (%0, %0, %0) global leaf nomangle {
bb1:
      %4 │         │ .ref b | @integer
      %5 │         │ .ref a | @integer
      %6 │         │ load %5 | integer
      %7 │         │ imm 1 | <integer_literal>
      %8 │         │ add %6, %7 | integer
         │         │ store into %4, %8 | void
      %9 │         │ add %6, %8 | integer
         │         │ ret %9 | void
}

However, the result of the load is clobbered by the first add in the backend:

r1 | mov "a" integer, r1 8 DEF ; <-- This value, loaded into r1,
v8 | add 1, r1 8 ; <-- is clobbered by the add here,
r2 | mov r1 8, r2 8 DEF
v5 | mov r2 8, "b" integer
v9 | add r2 8, r1 8 ; <-- but intended to be used here.
r1 | ret

The reason for this seems to be the following rule in the ISel table:

match
MIR_ADD i1(Register lhs, Register rhs)
emit {
  MX64_ADD(rhs, lhs)
  MX64_MOV(lhs, i1)
}

The fundamental problem here is that the add clobbers one of the operands before the result is moved
into a new register. Presumably, this rule assumes that neither operand will be used after this add. This is often not true in optimised codegen. This ISel rule is correct, iff this add instruction is the last use of both lhs and rhs.

A while ago, I mentioned this case and that, if the operand to an instruction in a pattern happens to be used somewhere outside the pattern, the pattern can generally not be applied. I don’t know if we’re checking for that at the moment, but we should.

The most general way of emitting a two-register add correctly if the operands need to be preserved would be to move either operand into a new register and add the other operand into that new register. This means that, in this particular case, the pattern should work if we swap the order of the mov and add and add into i1, but I haven’t thought about this too much.

Irrespective of that, there are cases (such as that of adding an immediate and a register) where the ‘ideal’ codegen (add the immediate directly into the register, clobbering it, which is 1 add instruction) and the ‘operands-preserving’ codegen (move either operand into a new register and add the other one, preserving the register operand, which is 2 instructions, 1 mov + 1 add) are not the same.

This means that, for those cases, we need both ‘ideal’ patterns that are applied iff the operands of all instructions in the pattern are not used outside the pattern (however, it’s ok if an operand of one instruction in the pattern is used by another instruction in the same pattern), as well as ‘fallback’ patterns that must not clobber any of their operands.

Side note: Yes, for add in particular, we can also use lea instead to get around this, but this problem persists in all the other two-address instructions that clobber one of their operands.