Giter Club home page Giter Club logo

dino's Introduction

Hi! I'm Sunjay. ๐Ÿ‘‹

sunjay varma

I'm a Rust programmer, compiler developer, and systems programmer who loves making games in my spare time. I teach people programming and game development at tech conferences and love creating resources to make hard technical topics accessible to a wider audience.

Follow me on Twitter: @sunjay03 ๐Ÿฆ
Pronouns: he/him

dino's People

Contributors

sunjay avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

dino's Issues

Add never type (!)

We should support empty/never types. Rust RFC. Rust Docs

The implementation of the codegen for this will be interesting because you can have code like:

fn main() {
    let x: ! = return;
    let y: i32 = x;
}
  • return should be made to return ! instead of ()
  • an infinite loop should be added that results in ! unless there is a break inside of it

Recursive types

Given that we have a managed runtime, recursive types are actually fine. Everything is stored as a pointer. We should make sure to explicitly support recursive types.

struct Foo {
    foo: Self,
}

Question: even if we can represent this, is there actually any way to construct it? Maybe recursive enums are fine but structs still aren't? Does transitively recursive matter?

Other numeric literals

We should consider having some syntax for other numeric literals like hexadecimal, octal, binary, etc.

Dynamically-sized int

There is only one int type in this language. It would be neat to experiment with making it dynamically-sized so it can store any range of values. We can start with a pointer-sized integer and then use a bit flag to branch into using a dynamically sized array that represents a number. This would give us near optimal performance when using small numbers and still allow us to have arbitrary precision.

There is lots of prior work on this in Ruby's Fixnum. See also the various BigInt and big-num crates in the Rust ecosystem.

From my conversation with Alex:

Ruby's Fixnum is "if the integer is odd, it's an integer retrieved by performing signed shift right by 1" and "if the integer is even, it's a pointer"

  • bounds checking in numeric operations

New Algorithm for Applying Equality Constraints During Unification

We should implement #11 before trying this.


Preprocessing step: Take all the A = B constraints and make lists of variables that are equal (essentially transitive closure)

  • Start with var_groups = HashMap<TyVar, usize>
  • For each A = B:
    • if A in var_groups and B in var_groups: merge the two groups
    • if A in var_groups: set var_groups[B] to A's group
    • if B in var_groups: set var_groups[A] to B's group
    • else: create a new group and set both A's group and B's group to that group
  • To merge two groups, set every item in B's group to be in A's group

Algorithm: For each list of variables, take the intersection of the set of types of each variable in the list. Then copy that set into each of those variables.

Allow if conditions as the last statement of a block

There is a slight ambiguity caused by if conditions being both statements and expressions: if you use an if condition as the last line in a block, it will be interpreted as a statement, even though it should actually be considered the final expression in the block. We should special case this and pull the if condition out of the statements list and into the final expression.

Exponentiation operator

We should have an operator ^ that raises the left hand side to the power of the right hand side. Using the ^ symbol because it will come in handy later when we add units to the language.

We can use ~ for both bitwise OR and bitwise complement.

Ord trait

Right now we're using lt, gt, and eq methods. When structs and a trait system is added, we should switch to a proper Ord trait and Ordering enum.

Generate function tables

The runtime will contain functions that map to impls like the following:

impl Add<int> for int {
    type Output = int;

    // `self` because no references in this language
    fn add(self, other: Self) -> Self::Output {
        ...
    }
}

This is (future) disco syntax, not necessarily valid Rust.

The C function in the runtime (as defined in Rust) for the add method in this particular impl could look like the following:

#[no_mangle]
extern fn __disco__Add__add(*const DInt, *const DInt) -> DInt {
    ...
}

We need to figure out a way to map the fully qualified name based on the impl: <int as Add<int, Output=int>>::add to the C function name. We need to be able to generate valid invocations of the C function name.

My initial idea for this is to annotate each function in the runtime with a comment that represents the impl method it corresponds to. We can then parse those out and build the proper code data structures.

In order for this to work we need a proper name mangling scheme that can be applied in the runtime too.

Standard Library + Runtime Interface Files

We want to get rid of the giant list of function/method insertions in src/lib.rs.

dino/src/lib.rs

Lines 65 to 348 in 84adda6

fn insert_prelude(decls: &mut resolve::ProgramDecls) {
//TODO: Figure out how to do this properly without hard coding things
use crate::ast::*;
let prims = &decls.prims;
let decls = &mut decls.top_level_decls;
decls.insert_func(Function::new_extern("unit_eq", FuncSig {
return_type: Ty::Named("bool"),
params: vec![
FuncParam {name: "left", ty: Ty::Unit},
FuncParam {name: "right", ty: Ty::Unit},
],
})).unwrap();
decls.insert_func(Function::new_extern("print_unit", FuncSig {
return_type: Ty::Unit,
params: vec![
FuncParam {name: "value", ty: Ty::Unit},
],
})).unwrap();
decls.insert_func(Function::new_extern("bool_eq", FuncSig {
return_type: Ty::Named("bool"),
params: vec![
FuncParam {name: "left", ty: Ty::Named("bool")},
FuncParam {name: "right", ty: Ty::Named("bool")},
],
})).unwrap();
decls.insert_func(Function::new_extern("bool_and", FuncSig {
return_type: Ty::Named("bool"),
params: vec![
FuncParam {name: "left", ty: Ty::Named("bool")},
FuncParam {name: "right", ty: Ty::Named("bool")},
],
})).unwrap();
decls.insert_func(Function::new_extern("bool_or", FuncSig {
return_type: Ty::Named("bool"),
params: vec![
FuncParam {name: "left", ty: Ty::Named("bool")},
FuncParam {name: "right", ty: Ty::Named("bool")},
],
})).unwrap();
decls.insert_func(Function::new_extern("bool_not", FuncSig {
return_type: Ty::Named("bool"),
params: vec![
FuncParam {name: "value", ty: Ty::Named("bool")},
],
})).unwrap();
decls.insert_func(Function::new_extern("print_bool", FuncSig {
return_type: Ty::Unit,
params: vec![
FuncParam {name: "value", ty: Ty::Named("bool")},
],
})).unwrap();
decls.insert_method(prims.int(), "eq", Function::new_extern("int__eq", FuncSig {
return_type: Ty::Named("bool"),
params: vec![
FuncParam {name: "self", ty: Ty::Named("int")},
FuncParam {name: "right", ty: Ty::Named("int")},
],
})).unwrap();
decls.insert_method(prims.int(), "gt", Function::new_extern("int__gt", FuncSig {
return_type: Ty::Named("bool"),
params: vec![
FuncParam {name: "self", ty: Ty::Named("int")},
FuncParam {name: "right", ty: Ty::Named("int")},
],
})).unwrap();
decls.insert_method(prims.int(), "gte", Function::new_extern("int__gte", FuncSig {
return_type: Ty::Named("bool"),
params: vec![
FuncParam {name: "self", ty: Ty::Named("int")},
FuncParam {name: "right", ty: Ty::Named("int")},
],
})).unwrap();
decls.insert_method(prims.int(), "lt", Function::new_extern("int__lt", FuncSig {
return_type: Ty::Named("bool"),
params: vec![
FuncParam {name: "self", ty: Ty::Named("int")},
FuncParam {name: "right", ty: Ty::Named("int")},
],
})).unwrap();
decls.insert_method(prims.int(), "lte", Function::new_extern("int__lte", FuncSig {
return_type: Ty::Named("bool"),
params: vec![
FuncParam {name: "self", ty: Ty::Named("int")},
FuncParam {name: "right", ty: Ty::Named("int")},
],
})).unwrap();
decls.insert_method(prims.int(), "add", Function::new_extern("int__add", FuncSig {
return_type: Ty::Named("int"),
params: vec![
FuncParam {name: "self", ty: Ty::Named("int")},
FuncParam {name: "other", ty: Ty::Named("int")},
],
})).unwrap();
decls.insert_method(prims.int(), "sub", Function::new_extern("int__sub", FuncSig {
return_type: Ty::Named("int"),
params: vec![
FuncParam {name: "self", ty: Ty::Named("int")},
FuncParam {name: "right", ty: Ty::Named("int")},
],
})).unwrap();
decls.insert_method(prims.int(), "mul", Function::new_extern("int__mul", FuncSig {
return_type: Ty::Named("int"),
params: vec![
FuncParam {name: "self", ty: Ty::Named("int")},
FuncParam {name: "right", ty: Ty::Named("int")},
],
})).unwrap();
decls.insert_method(prims.int(), "div", Function::new_extern("int__div", FuncSig {
return_type: Ty::Named("int"),
params: vec![
FuncParam {name: "self", ty: Ty::Named("int")},
FuncParam {name: "right", ty: Ty::Named("int")},
],
})).unwrap();
decls.insert_method(prims.int(), "rem", Function::new_extern("int__rem", FuncSig {
return_type: Ty::Named("int"),
params: vec![
FuncParam {name: "self", ty: Ty::Named("int")},
FuncParam {name: "right", ty: Ty::Named("int")},
],
})).unwrap();
decls.insert_method(prims.int(), "neg", Function::new_extern("int__neg", FuncSig {
return_type: Ty::Named("int"),
params: vec![
FuncParam {name: "self", ty: Ty::Named("int")},
],
})).unwrap();
decls.insert_func(Function::new_extern("print_int", FuncSig {
return_type: Ty::Unit,
params: vec![
FuncParam {name: "value", ty: Ty::Named("int")},
],
})).unwrap();
decls.insert_func(Function::new_extern("add_real", FuncSig {
return_type: Ty::Named("real"),
params: vec![
FuncParam {name: "left", ty: Ty::Named("real")},
FuncParam {name: "right", ty: Ty::Named("real")},
],
})).unwrap();
decls.insert_func(Function::new_extern("sub_real", FuncSig {
return_type: Ty::Named("real"),
params: vec![
FuncParam {name: "left", ty: Ty::Named("real")},
FuncParam {name: "right", ty: Ty::Named("real")},
],
})).unwrap();
decls.insert_func(Function::new_extern("print_real", FuncSig {
return_type: Ty::Unit,
params: vec![
FuncParam {name: "value", ty: Ty::Named("real")},
],
})).unwrap();
decls.insert_func(Function::new_extern("add_complex", FuncSig {
return_type: Ty::Named("complex"),
params: vec![
FuncParam {name: "left", ty: Ty::Named("complex")},
FuncParam {name: "right", ty: Ty::Named("complex")},
],
})).unwrap();
decls.insert_func(Function::new_extern("add_real_complex", FuncSig {
return_type: Ty::Named("complex"),
params: vec![
FuncParam {name: "left", ty: Ty::Named("real")},
FuncParam {name: "right", ty: Ty::Named("complex")},
],
})).unwrap();
decls.insert_func(Function::new_extern("add_complex_real", FuncSig {
return_type: Ty::Named("complex"),
params: vec![
FuncParam {name: "left", ty: Ty::Named("complex")},
FuncParam {name: "right", ty: Ty::Named("real")},
],
})).unwrap();
decls.insert_func(Function::new_extern("sub_complex", FuncSig {
return_type: Ty::Named("complex"),
params: vec![
FuncParam {name: "left", ty: Ty::Named("complex")},
FuncParam {name: "right", ty: Ty::Named("complex")},
],
})).unwrap();
decls.insert_func(Function::new_extern("sub_real_complex", FuncSig {
return_type: Ty::Named("complex"),
params: vec![
FuncParam {name: "left", ty: Ty::Named("real")},
FuncParam {name: "right", ty: Ty::Named("complex")},
],
})).unwrap();
decls.insert_func(Function::new_extern("sub_complex_real", FuncSig {
return_type: Ty::Named("complex"),
params: vec![
FuncParam {name: "left", ty: Ty::Named("complex")},
FuncParam {name: "right", ty: Ty::Named("real")},
],
})).unwrap();
decls.insert_func(Function::new_extern("print_complex", FuncSig {
return_type: Ty::Unit,
params: vec![
FuncParam {name: "value", ty: Ty::Named("complex")},
],
})).unwrap();
decls.insert_func(Function::new_extern("bstr_len", FuncSig {
return_type: Ty::Named("int"),
params: vec![
FuncParam {name: "value", ty: Ty::Named("bstr")},
],
})).unwrap();
decls.insert_func(Function::new_extern("bstr_eq", FuncSig {
return_type: Ty::Named("bool"),
params: vec![
FuncParam {name: "left", ty: Ty::Named("bstr")},
FuncParam {name: "right", ty: Ty::Named("bstr")},
],
})).unwrap();
decls.insert_func(Function::new_extern("bstr_gt", FuncSig {
return_type: Ty::Named("bool"),
params: vec![
FuncParam {name: "left", ty: Ty::Named("bstr")},
FuncParam {name: "right", ty: Ty::Named("bstr")},
],
})).unwrap();
decls.insert_func(Function::new_extern("bstr_gte", FuncSig {
return_type: Ty::Named("bool"),
params: vec![
FuncParam {name: "left", ty: Ty::Named("bstr")},
FuncParam {name: "right", ty: Ty::Named("bstr")},
],
})).unwrap();
decls.insert_func(Function::new_extern("bstr_lt", FuncSig {
return_type: Ty::Named("bool"),
params: vec![
FuncParam {name: "left", ty: Ty::Named("bstr")},
FuncParam {name: "right", ty: Ty::Named("bstr")},
],
})).unwrap();
decls.insert_func(Function::new_extern("bstr_lte", FuncSig {
return_type: Ty::Named("bool"),
params: vec![
FuncParam {name: "left", ty: Ty::Named("bstr")},
FuncParam {name: "right", ty: Ty::Named("bstr")},
],
})).unwrap();
decls.insert_func(Function::new_extern("bstr_concat", FuncSig {
return_type: Ty::Named("bstr"),
params: vec![
FuncParam {name: "left", ty: Ty::Named("bstr")},
FuncParam {name: "right", ty: Ty::Named("bstr")},
],
})).unwrap();
decls.insert_func(Function::new_extern("bstr_slice", FuncSig {
return_type: Ty::Named("bstr"),
params: vec![
FuncParam {name: "string", ty: Ty::Named("bstr")},
FuncParam {name: "start", ty: Ty::Named("int")},
FuncParam {name: "end", ty: Ty::Named("int")},
],
})).unwrap();
decls.insert_func(Function::new_extern("bstr_get", FuncSig {
return_type: Ty::Named("bstr"),
params: vec![
FuncParam {name: "string", ty: Ty::Named("bstr")},
FuncParam {name: "index", ty: Ty::Named("int")},
],
})).unwrap();
decls.insert_func(Function::new_extern("print_bstr", FuncSig {
return_type: Ty::Unit,
params: vec![
FuncParam {name: "value", ty: Ty::Named("bstr")},
],
})).unwrap();
decls.insert_func(Function::new_extern("read_line_bstr", FuncSig {
return_type: Ty::Named("bstr"),
params: Vec::new(),
})).unwrap();
}

The best way to do this is to enable the compiler to compile a "dino library" file instead of just binaries.

After this work is done, the dino-std and dino-runtime crates should have three parts each:

  1. A header file of all the external symbols so we can use them in C
  2. A static library that will be linked into the final program
  3. A discolib file that contains the compiled modules that represent the library

This third part is the most complex. Basically the idea is that we define the standard library like this:

extern {
    fn __disco__DInt__operator_add(a: int, b: int) -> int;
}

// This is disco syntax
impl Add<int> for int {
    type Output = int;

    // `self` because no references in this language
    fn add(self, other: Self) -> Self::Output {
        __disco__DInt__operator_add(self, other)
    }
}

This disco file is then compiled as a library (a discolib file) and linked into the final program.

Support non-trivial merge sort program

This is a sort of "minimally non-trivial" program. It consumes input, processes it, and then prints output. Supporting it would require quite a few additional features be added (e.g. loops, conditionals, recursion, byte strings, bytes, functions, etc.).

//! merge_sort.dino
//!
//! A dino program that reads a line of text as input, sorts it, and then outputs the
//! sorted string.

fn main() {
    // Continue forever until the user quits the program with Ctrl-C
    while true {
        print_bstr(b"Enter some text: ");
        let input = read_line();
        let output = merge_sort(input);
        print_bstr(output);
    }
}

fn merge_sort(input: bstr) -> bstr {
    let input_len = bstr_len(input);

    // Base case: return a single character (or nothing)
    if int_lte(input_len, 1) {
        return input;
    }

    // Split the string and sort each half
    let left = bstr_slice(input, 0, int_div(input_len, 2));
    let right = bstr_slice(input, int_div(input_len, 2), input_len);

    let sorted_left = merge_sort(left);
    let sorted_right = merge_sort(right);

    // Merge the two halves together to get a single sorted string
    merge(sorted_left, sorted_right)
}

/// Merges two sorted strings into a single sorted string
fn merge(left: bstr, right: bstr) -> bstr {
    let left_len = bstr_len(left);
    let right_len = bstr_len(right);

    let left_i = 0;
    let right_i = 0;

    let output = b"";
    // Continue until one of the strings runs out of characters
    while bool_and(int_lt(left_i, left_len), int_lt(right_i, right_len)) {
        let left_byte = bstr_get(left, left_i);
        let right_byte = bstr_get(right, right_i);

        if byte_lt(left_byte, right_byte) {
            append_string_byte(output, left_byte);
            left_i = add_int(left_i, 1);
        } else { // byte_gte(left_byte, right_byte)
            append_string_byte(output, right_byte);
            right_i = add_int(right_i, 1);
        }
    }

    // Append any remaining characters on to the string
    // Only one of these loops will run
    while int_lt(left_i, left_len) {
        append_string_byte(output, bstr_get(left, left_i));
        left_i = add_int(left_i, 1);
    }
    while int_lt(right_i, right_len) {
        append_string_byte(output, bstr_get(right, right_i));
        right_i = add_int(right_i, 1);
    }

    output
}

This program intentionally doesn't use any binary operators, so there is no need to add any sort of trait dispatching.

Optimize Out Zero-Sized-Types

Perform an optimization pass on the IR to optimize out any zero sized types. Add a guarantee in the docs of the codegen IR to indicate that we assume that zero sized types have been removed.

  • Zero-sized function parameters (error if an extern function has one)
  • Variable declarations that are zero-sized (keep the right hand expression potentially)
  • Referring to a variable that zero-sized in an expression
  • Returning a zero-sized literal - return () (the return must still be present, we just need to optimize its argument)
  • Returning an expression that would have otherwise produced a zero sized value - if ... { () }

Note: it's very important that we don't actually remove any function calls themselves as they may have side effects even if the input/output ends up being zero sized.

Make a valid test program that shows zero sized types being used in all of these ways and make sure it compiles.


This could involve creating a new IR:

// elab.rs
//! The "elaborated" IR (for lack of a better name) represents a very low-level view of the
//! program. Execution order is made explicit and type information is retained. This is designed
//! to enable optimizations, for example the removal of zero-sized types.

Parallelize Compiler

Certain things in the compiler can be trivially parallelized.

  • Type checking of individual functions
  • Code generation of individual functions

Assignment Semantics

Some of the questions being answered: What does assignment do? Is it a copy? Is it a ref-copy? How does that effect mutation and information sharing? Is there a way to return e.g. a string or int from a struct method without necessarily allowing that to be mutated? What are the desired properties of assignment?

Notes

  • Every value is a mutable shared pointer
  • Names are bound and rebound to different pointers
  • Whether something is mutable is given by the semantics of the type itself
  • All primitive types are copy-on-write or immutable
    • This enables sharing and mutability without forcing the programmer to reason about it even for primitives
    • Most primitives are immutable
    • Strings are copy-on-write and will reassign their data pointer to a copy when mutable operations (e.g. push) are performed
  • Field accesses always implicitly return lvalues
    • This enables fields to be mutated when they contain copy on write types
    • e.g. This is what allows strings to be held in fields and mutable
  • User generated structs are mutable and shared, but not implicitly copy-on-write
struct Foo {
    data: str,
}

impl Foo {
    // self is implemented as a mutable shared pointer
    fn bar(self) {
         // data is implemented as an lvalue pointer, not a temporary
         self.data.push('a');

         // This needs to be implemented as follows:
         // str__push(&self->data, 'a');
         // This will not work:
         // DBStr *tmp222 = self->data;
         // str__push(tmp222, 'a');
         // If data is shared, we need to copy it and put a new value into the field
         // This somewhat implies that self is always an out pointer
    }
}

Questions:

  • What does self = Foo { ... } generate?
    • Idea: copy (the pointer of) all the fields into self, this being a shortcut for doing assignment on all fields
    • Does that idea work even if the RHS is a variable, not a temporary?

Type Check Fields and Methods

We need to type check fields in each declared type to ensure that each type exists. We also need to type check methods to ensure that their signature / body are correct (and to generate code for them).

Type Variables can be both int vars and real vars

Right now we assume that no types can be both int vars and real vars:

/// Records this type variable as an int var so it can be special-cased in the later stages of
/// type checking. No variable should be both an int var and a real var.
pub fn ty_var_is_int(&mut self, ty_var: TyVar) {
self.int_vars.insert(ty_var);
}
/// Records this type variable as an real var so it can be special-cased in the later stages of
/// type checking. No variable should be both an int var and a real var.
pub fn ty_var_is_real(&mut self, ty_var: TyVar) {
self.real_vars.insert(ty_var);
}

No variable should be both an int var and a real var.

This is not valid though because 12 is both a valid integer literal and a valid real literal. We should go through the code and remove this assumption.

A variable can be both an int var and a real var. In that case, the variable must be in the intersection of the set of valid types for both int and real vars.

Complex Numbers

We should support a complex type for complex numbers and also a complex number literal. The complex number literal is just any valid real number literal with an i or j suffix. Complex numbers are represented by two values: a real part and a imaginary part. The number with the suffix is the imaginary part. You can add a real number to an imaginary number to add to the real part of the complex number.

Real number literals can be assigned to variables of type complex in the same way that integer literals can be assigned to variables of type real.

Guard on zero sized structs

Until #19 is fully implemented, we should either disallow zero sized structs entirely or otherwise build in a hack to address them.

One potential solution is for the runtime to return a dangling pointer if the requested allocation is zero sized.

Currying

It would be neat if functions were curried. That is, every function compiles down to a closure that returns another closure for each argument. The key here is that curried functions must be closures, or else some of the context can be lost during currying.

Lots of interesting concerns to explore here.

Coroutines

Question: Stackful or stackless coroutines?


Here's an example that shows the additional type and value syntax for coroutines in context:

// This is dino syntax

fn cool_stuff<T>(x: T) -> bstr
    yields <T as Add<i32>>::Output
        given Option<i32>
    where T: Into<i32> + Add<i32, Output=T>
{
    for _ in 0..10 {
        match (yield x + 1) {
            Some(val) => x = val,
            None => {},
        }
    }

    b"cool"
}

This is a contrived example, but the point is to show how the syntax looks with some generics and where clauses.

The yields A given B is additional type syntax that specifies the type yielded and the input type that can be passed to resume. The return type is independent of A and B. The given B part is optional and B defaults to () (no input). The yields ... given ... must be placed before the where clauses.

This function returns an anonymous type that implements the Coroutine trait.

// This is dino syntax

enum CoResult<T, R> {
    Yield(T),
    Return(R),
}

trait Coroutine<I> {
    type Yield;
    type Return;

    fn resume(self, input: I) -> CoResult<Self::Yield, Self::Return>;
}

For example, the above function may be desugared to:

// This is dino syntax

enum CoCoolStuff<T> {
    Start {x: T},
    ...stack state...
}

impl<T> Coroutine<Option<i32>> for CoCoolStuff<T>
    where T: Into<i32> + Add<i32, Output=T>,
{
    type Yield = <T as Add<i32>>::Output;
    type Return = bstr;

    fn resume(self, input: Option<i32>) -> CoResult<Self::Yield, Self::Return> {
        ...state machine that returns CoResult::Yield(x + 1) and CoResult::Return(b"cool")...
    }
}

If given is omitted, or given () is specified, and the coroutine returns (), then the coroutine implements Iterator.

// This is dino syntax

impl<T, C> Iterator for C
    where C: Coroutine<(), Yield=T, Return=()>
{
    type Item = C::Yield;
    
    fn next(self) -> Option<Self::Item> {
        match self.resume(()) {
            CoResult::Yield(val) => Some(val),
            CoResult::Return(()) => None,
        }
    }
}

Interesting idea: we could enable using coroutines in loops by allowing continue to take a value. E.g. continue input_var;

Another simpler example (with side effects):

// This is dino syntax

fn print_up_to(end: i32) yields () {
    for i in 0..=end {
        // Could also just put `yield;` or `yield ();` on the next line
        yield println!("{}", i);
    }
}

This is a pretty minimal coroutine that also doubles as an iterator. Nothing is printed until you resume the iterator at least once. Each resume (or call to next in the iterator case) will cause one number to be printed.

Scope Design Notes

Given the (imaginary) syntax:

package calculator {
    struct Calc;

    mod operators {
        use super::Calc;
        
        fn add(a, b) {
            fn add_inner(a, b) {
                let d = a - b - a + b;
            }

            let c = a + b;

            use std::ops::Add;
        }

        fn sub(a, b) {
            let e = {
                let q = a - b;
                q
            };
        }
    }
}

We can derive the following tree:

ModuleScope - package calculator
    Decl - struct Calc
    ModuleScope - mod operators
        ImportedDecl - use super::Calc
        IsolatedScope - fn add
            Variable - param a
            Variable - param b
            ImportedDecl - use std::ops::Add
            IsolatedScope - fn add_inner
                Variable - param a
                Variable - param b
                Variable - let d
            Variable - let c
        IsolatedScope - fn sub
            Variable - param a
            Variable - param b
            InnerScope
                Variable - let q
            Variable - let e

Notes

  • ImportedDecl and IsolatedScope are always lifted to the top of their scope (variables stay in their declared order)
  • IsolatedScope cannot access upper isolated scopes, but can access the upper module scope
  • InnerScope may access upper level items that are previous siblings of InnerScope, and of course the upper module scope too

Short-circuit evaluation

This applies to the boolean && and || operators. We need to make sure all of our code generation adequately short-circuits, only executing the right-hand side if the left is true for && or false for ||.

This has to work even for weird code like:

if true || (if true { foo(); false } else { false }) { ... }

Here, foo should never execute because the left branch is true and foo may have side effects.

Arbitrarily nested boolean expressions should work correctly.

Explicitly coerce DBool to bool

Add a coerce_bool to ExternType that must exist on any type that type checks as bool (i.e. on DBool). This is used to produce a bool from *mut DBool.

We can then have the code generator generate the following whenever a bool is needed in the C code.

if (coerce_bool(...)) { ... }

while (coerce_bool(...)) { ... }

Better Type Inference for Operators

The Rust compiler has special cases that make it so you don't have to explicitly annotate the type of items on either side of most operators. We should consider similar special casing so we don't need 5int + 6 all the time.

For more info, read this whole file:

https://github.com/rust-lang/rust/blob/f54911c6f2971b98ede2d21ab8cabd7daa1110ba/src/librustc_typeck/check/op.rs#L109-L145

  • Once this is implemented, we can remove this HACK:

dino/src/ast/parser.rs

Lines 290 to 301 in bd8803d

// HACK: we can make type inference a bit easier for ourselves if we allow
// numeric literals to just include their negative sign directly
'-' => match lhs {
Expr::IntegerLiteral(lit) => {
return Expr::IntegerLiteral(IntegerLiteral {
value: -lit.value,
..lit
});
},
Expr::RealLiteral(value) => return Expr::RealLiteral(-value),
Expr::ComplexLiteral(value) => return Expr::ComplexLiteral(-value),
_ => "neg",

Byte Strings, Bytes, and Corresponding Literals

We can wrap the bstr crate to provide byte strings and methods on byte strings. This is an excellent step down from full Unicode strings that will simplify the initial implementation.

String literals will only be valid with the b prefix which creates a value of type bstr. Much like the int type and its backing implementation DInt, there will be a DByteStr type in the standard library implementation that backs the bstr type. This is the type that will wrap bstr::BStr.

String syntax: b"cool stuff"
Character syntax: b'c'

Rename string type to text

You know, as much as we're all used to it by now, "string" really isn't a great name for a type that represents text. We should change our Unicode string type from str to text.

Guarantee Evaluation Order

Suppose you had the following expression:

a1() + if a2() { a3() } else { a4() }

Right now, a2 and a3/a4 will execute before a1. This is not ideal. It's actually just an implementation detail that has to do with how we generate code. It would be much better if call expressions generated a temporary variable and a statement so we could ensure that everything happens in the order that a user would expect.

List/Array

We should support a dynamically-sized list type. Since we have a managed runtime, every item in every list is going to be a pointer. That makes the generic implementation easy and the actual type a detail of the type system item.

Extended Type Inference

We will need to extend our type inference algorithm to include function/method dispatch in order to support methods. In doing this extension, it's probably a good idea to consider generics and traits as well since those features will also eventually be added.

We're probably going to move from the current set-satisfiability approach to something more based on the classic Hindley-Milner type inference algorithm. The reason we can do this is because the main reason for using sets in the first place was so we could support polymorphic literals (e.g. 123 can be either int or real or the real part of complex). It turns out that we can actually do this in Hindley-Milner as a post-processing step. That saves us from having to build the entire algorithm around sets. (sets are hard because a lot of the inference would quickly become exponential as we iterate over all possible combinations of different set elements in more complex constraints.)

The post-processing step is simple: First, we store the type variables that correspond to any integer or real number literals. Then, after the types have been inferred/checked, we go through each one. We check that integer literals were typed as int, real, or complex. We check that real number literals were typed as real or complex. If an integer literal type variable is ambiguous (did not resolve to a concrete type but also didn't encounter any errors), we resolve it as int. Similarly we resolve ambiguous real number literals to real.

Once we get to the point of adding generics/traits, we can start to bound the "ambiguous" type variables with traits. That means that before resolving to int or real, we'll need to check that those types implement the necessary traits.

Syntax for Statically-checked Units

Building on the work from lion, it would be great if this language had a statically-checked units system. The lion language was always just interpreted, so it will be interesting to see what we can do with a compiled language.

Unit Representation, Desugaring, & Operators

We can define a UnitValue struct that is generic over both the value type and the unit. All types containing units are desugared to UnitValue (e.g. real 'km desugars to UnitValue<'real, 'km>).

// This is dino code, not Rust code

/// All units implement this marker trait
///
/// The trait is sealed (using an internal-only Sealed trait)
/// to prevent outside implementations
trait Unit: private::Sealed {}

/// Represents a value with a statically-checked unit
struct UnitValue<T, U: Unit> {
    // private fields
    value: T,
    unit: U, // All units are zero-sized () values
}

A UnitValue can be created from a "unitless" value.

/// All unitless values can be turned into a value with a unit at no cost
impl<T, U: Unit> From<T> for UnitValue<T, U> {
    fn from(value: T) -> Self {
        Self {
            value,
            unit: /*...compiler magic: unit literal for U...*/,
        }
    }
}

You can get the raw value from a unitless ('_) value using the From trait. Any other value that has a unit can be retrieved using the value() method.

impl<T, U> UnitValue<T, U> {
    /// Explicitly get the value (without the protection of the unit)
    /// out of the united quantity
    fn value(self) -> T {
        self.value
    }
}

/// A unitless value can be converted directly into the value type
impl<T> From<UnitValue<T, '_>> for T {
    fn from(x: UnitValue<T, '_>) -> T {
        x.value()
    }
}

We can dynamically-generate impls for the From trait that convert a value with a unit into a value with a different unit.

// Dynamically-generated impl, created as-needed by the compiler
impl<T1, T2> From<UnitValue<T1, 'm>> for UnitValue<T2, 'km> {
    fn from(other: UnitValue<T1, 'm>) -> Self
        where T1: Div<T1, Output=T2> + From<real>
    {
        Self {
            value: other.value / T1::from(1000.0),
            unit: 'km, // Every unit has a constructor (similar to the () type)
        }
    }
}

Operators can be implemented like any other numeric type (over compatible units).

// impl of Add that results in the LHS unit being carried forward
impl<V1, U1, V2, U2, O> Add<UnitValue<V2, U2>> for UnitValue<V1, U1>
          // Allow the addition to result in any type
    where V1: Add<V2, Output=O>,
          // Check that the RHS unit can be converted into the LHS unit
          // This constraint signals the compiler to generate an impl
          // if needed
          UnitValue<V2, U1>: From<UnitValue<V2, U2>> {

    type Output = UnitValue<O, U1>;

    fn add(self, other: UnitValue<V2, U2>) -> Self::Output {
        let other = other.into();
        Self {
            value: self.value + other.value,
            unit: self.unit,
        }
    }
}

// similar implementation of the Sub trait

Both addition and subtraction require the LHS and RHS to be the same unit.

The Mul and Div traits are a little more complicated because they can result in compound units. Take the following functions as an example:

// 'km and 'h have no defined conversion between them, so
// we keep both units and preserve that they have been divided
fn foo1(x: real 'km, y: real 'h) -> real 'km/'h {
    x / y
}

// 'km and 'm have a conversion defined between them. Dividing
// them results in a unitless (denoted '_) value.
fn foo2(x: real 'km, y: real 'm) -> real '_ {
    x / y
}

// A unitless ('_) value can be converted into any ratio of two
// units that both have a conversion defined between them.
fn foo3(x: real 'km, y: real 'm) -> real 'km/'m {
    x / y
}

// 'km and 'm^2 do not have a conversion defined between them
// because they are not the same dimension of unit. That means
// that we can only divide the values and units, not do any
// conversion.
fn foo4(x: real 'km, y: real 'm^2) -> real 'km / 'm^2 {
    x / y
}

// 'm^3 and 'L are not the same dimension (based on exponent),
// but there is still a conversion defined between them
// (1 'm^3 == 1000 L). That allows us to divide them and get
// a simple (non-compound) unit.
fn foo5(x: real 'm^3, y: real 'L) -> real 'm^3 {
    x / y
}

// 'm and 'm^2 have no defined conversion between them.
// Dividing the units does not directly get you to 'km/'m^2.
// Thus, we need to convert either 'm or 'm^2 and find a path
// to the returned unit. In this case, one solution is
// to convert 'm to 'km and then produce a result.
fn foo6(x: real 'm, y: real 'm^2) -> real 'km / 'm^2 {
    x / y
}

// 'm and 'm^2 have no defined conversion between them.
// Dividing the units can give you 1/m or m^-1 which can
// then be converted into 'km^-1.
fn foo7(x: real 'm, y: real 'm^2) -> real 'km^-1 {
    x / y
}

// and so on...

One option to implement all of these cases (and the many cases that were missed) is to dynamically generate impls of Div and Mul for compatible units as we need them. This might get messy fast, so maybe there is something more sophisticated we can do?

Observation: Multiplying and dividing unit-ed quantities almost feels like defining a very specialized dependent type system. Maybe we can borrow lessons from dependent types and use them here? (Perhaps reading The Little Typer will help with this.)

Declaring Units & Conversions

See: https://github.com/sunjay/lion#declaration-files

Randomized Testing for Type Unification (Reconstruction) Algorithm

Before making changes, we should try to generate random tests and make sure we are fairly robust.

Each mutation below should have a different probability of occurring. E.g. unconstrained should happen very little (1% of the time) whereas split should happen more often.

Try with both big constraint sets and small (or empty) constraint sets.


Dino Constraint Solving Test Case Generator

  • Generate a solution mapping from variables to a type ID
  • Create a valid types constraint that maps that variable to that type ID
  • Go through each constraint and apply a mutation to that constraint, replacing
    that constraint with the result of the mutation
  • Repeat mutations until satisfied
  • Shuffle the constraints and ensure that the solution is still the same

Mutation 0: Do nothing

Mutation 1: Split

  • Split into two separate constraints on the same type variable where the
    intersection of their type IDs gives you the original type ID

Mutation 2: Split Superset

  • Split into two separate constraints on the same type variable where one of the
    sets of types is strictly a superset of the other set (intersection results in
    the smaller set)

Mutation 3: Split Equals

  • Split into two separate constraints using two different type variables
    bound together by an equality constraint
  • Can use any splitting mutation to do the splitting

Mutation 4: Reflexivity

  • Take an equality constraint A = B and replace it with A = B and A = A

Mutation 5: Transitivity

  • Take an equality constraint A = B and replace it with A = C and C = B
    where C is a fresh type variable

Mutation 6: Commutativity

  • Take an equality constraint A = B and replace it with B = A

Mutation 7: Unconstrained

  • Without removing anything else, add a new constraint that equates two fresh type variables A = B

Trait Derive Syntax

While the #[derive(Trait)] Rust syntax works well for simple cases, I've found a couple of issues with it as the Rust language grows. The syntax works very well for simple impls that require no information from the programmer (e.g. for the Copy) trait, but once you start to get into things that require customization, people generally tend to want more.

One example of this is the Serialize and Deserialize traits from Serde. Serde is a great library and it provides a very convenient syntax for declaring how you want your fields to be serialized:

#[derive(Deserialize, Debug)]
struct Request {
    // Use the result of a function as the default if "resource" is
    // not included in the input.
    #[serde(default = "default_resource")]
    resource: String,

    // Use the type's implementation of std::default::Default if
    // "timeout" is not included in the input.
    #[serde(default)]
    timeout: Timeout,

    // Use a method from the type as the default if "priority" is not
    // included in the input. This may also be a trait method.
    #[serde(default = "Priority::lowest")]
    priority: Priority,
}

This syntax works well, again for simple cases like default where you often just need to provide one thing. There are a lot of cases however where the number of different attributes can get a little insane.

For example, here's an example that uses structopt for command line parsing:

#[derive(Debug, StructOpt)]
#[structopt(author = "The ProtoArt Team <https://protoart.me>")]
#[structopt(raw(setting = "ColoredHelp", setting = "DontCollapseArgsInUsage",
    setting = "ArgRequiredElseHelp"))]
pub struct AppArgs {
    /// Path to the configuration file to execute tasks from
    #[structopt(name = "config-file", default_value = "spritec.toml", parse(from_os_str))]
    config_path: PathBuf,
}

There are several attributes here, and many of them contain multiple options. Another example is the snafu crate which allows you to describe how each enum variant is formatted when printed with Display:

#[derive(Debug, Snafu)]
enum Error {
    #[snafu_display("Could not open config from {}: {}", "filename.display()", "source")]
    OpenConfig { filename: PathBuf, source: std::io::Error },
    #[snafu_display("Could not save config to {}: {}", "filename.display()", "source")]
    SaveConfig { filename: PathBuf, source: std::io::Error },
    #[snafu_display("The user id {} is invalid", "user_id")]
    UserIdInvalid { user_id: i32, backtrace: Backtrace },
}

You can see how there is actually quite a bit of information in each of these.

It's nice that for the sake of terseness you can generate code just by annotating the fields, but one starts to wonder what would happen if you wanted to combine a couple of these crates:

#[derive(Debug, Snafu, Serialize, Deserialize)]
#[serde(deny_unknown_fields)]
enum Error {
    #[snafu_display("Could not open config from {}: {}", "filename.display()", "source")]
    #[serde(rename(serialize = "open_conf"))]
    OpenConfig { filename: PathBuf, source: std::io::Error },

    #[snafu_display("Could not save config to {}: {}", "filename.display()", "source")]
    #[serde(skip)]
    SaveConfig { filename: PathBuf, source: std::io::Error },

    #[snafu_display("The user id {} is invalid", "user_id")]
    #[serde(other)]
    UserIdInvalid { user_id: i32, backtrace: Backtrace },
}

Things can start to get a little congested.

I believe that there are several design issues at play here:

  1. Multiple derives needing the programmer to provide information about the struct and its fields
  2. Each derive takes the struct tokens as input, and so it's natural to try and add the annotations to those tokens so you can easily parse them out
  3. Different crates have different needs, so they all tend to invent their own ad-hoc syntax for things (e.g. default in the attributes above, or snafu_display)
  4. Users (programmers) not wanting to repeat field names and types every time they want to derive a trait
    • You could argue that a solution to this is to provide some other macro, e.g. impl_serialize! { ... } that has you rewrite the struct definition and put the annotations in there. The big downside to this is number (2) from above. Repeating the fields and their types is cumbersome.

It would be nice if there was a syntax that encouraged less ad-hoc/macro-specific syntaxes (while still allowing them when necessary), while also allowing users to separate the derive input for each macro.

Proposal

In the dino language, I am proposing that we adopt this new syntax:

struct Foo {
    x: int,
    name: str,
    values: [int],
}

// This impl is derived automatically
derive impl Eq for Foo;

// This impl is manually implemented
impl PartialEq for Foo {
    fn eq(&self, other: &Self) -> bool {
        self.name == other.name
    }
}

derive impl Serialize for Foo with {
    // ....
}

derive impl Snafu for Foo with {
    // ....
}

Here, the derive keyword is used to signify that the impl will be automatically derived rather than implemented manually. We still use the full impl ... for ... syntax to clearly show which syntax is being generated. We can then use the body of the with (a new keyword) to define all the custom information required by that particular impl.

This changes a number of things:

  1. Derive macros now get the tokens of the struct as an input but also get the body of the with (if any) as an input
  2. There is no #[derive(...)], even trivial traits need a derive impl ... line
    • This requires a bit more typing in the short-term, but I believe it's better long term since it clearly spells out that an impl is being generated
  3. Users need to repeat each field name in the with clause only if that field is actually going to be customized in the derive (fields and their types are provided as part of (1))
    • Pro: very clear which fields are customized and which ones are not
    • Pro: much more room to do custom syntax things and specify as much
      information as needed
    • Con: need to repeat each field name to be configured
      • Counterpoint: it is very easy to have exhaustiveness checks where necessary to help prevent bugs

The tokens generated for the derive impl replaces the tokens starting with derive impl and up to the end of the block. That means that any attributes on the derive impl are also to be passed into the macro that generates the code. The macro is in full control of which attributes get outputted in the generated code. A well behaved macro will usually choose to pass along any attributes that it does not otherwise consume.

// The custom derive for Bar will likely leave this
// attribute on the generated impl.
#[allow(unused_fields)]
derive impl Bar for Foo with {
    // ....
}

Unresolved Design Issues

  • Multiple traits needing the same info (e.g. Serialize and Deserialize) reuse the same attribute values in some cases
    • The solution to this might just to be to get people to copy/paste (not the end of the world)
  • What syntax is allowed within a with block? How do we replicate the info previously provided with attributes?
    • How do we replace the top-level attributes on the structs themselves (e.g. description in structopt)

Unsigned integer type

It's useful to have an unsigned integer type for cases where we know that the number cannot be negative. We can call it uint.

Notes on Compiler Passes

Rough notes on a new design for the passes.


parsing: ast / parse tree

  • input is &str

desugaring: hir - high-level IR

name resolution: nir - nameless IR

  • all Idents/paths replaced by DefIds (methods??)
  • ast decorated with push/pop DefId
  • push can push either a type or a variable
  • types, functions and variables have different namespaces
  • each scope points to its parent scope
  • name resolution searches up the stack until it finds something
  • prevent duplicate type names and function names in the same scope
  • allow shadowing for variables in any scope
  • some variable shadows are temporary if they happen within a subscope
  • disallow shadowing in parameter names
  • each struct and function is given a DefId at this stage (in a separate pass before any name resolution happens)

type checking: tyir - Type IR

  • type inference
  • method/field resolution

mir - middle-intermediate representation

  • fully decorated with types
  • monomorphization

high-level code generation: cgenir - codegen IR

  • generate constructors for each struct for the struct literal
  • name mangling
  • high level concepts like "allocate", "initialize field", etc.
  • single-static assignment
  • all types are decorated with things like their mangled name, constructor name, etc.
  • all variables may only be a value or a single function call where all the arguments are other variables

low-level code generation: c

  • C language codegen - subset of C
  • implementation of things like "allocate"
  • implements the calling convention
  • depends on the language runtime
  • runtime contains functions to do allocation, GC, etc.

Final result is a "package"

  • mapping from type name to codegen info (mangled name, constructor, etc.)
  • mapping from function name to codegen info (mangled name, etc.)
  • list of functions and their generated c code

To generate the executable we layout forward declarations for each package: types first in dependency order, then functions in any order. Functions and structs marked extern do not get placed. Then, we look up a function named "main" in the root package and get its mangled name. We generate a C function named "main" and call that mangled name.

The std package can be a hard-coded "package" where we set the mangled name fields to their C function names.


Parser library:

  • input - token stream - peekable

  • matcher (ident, int, float, expr)

  • combinator (and_then, or)

  • matchers and combinators must be self-describing (for errors), self-referential (for recursion), and forgiving (for optional error recovery)

  • error mode enum - strict (hard errors), recover (error recovery)

  • fn match(input, error mode, error list) -> Result<Self::Output, Error>

  • error can be added to error list if we successfully recover, otherwise we return Err for a hard error

  • parser is defined entirely at the type level

  • type Add = (Expr, Plus, Expr);

  • Plus is a struct that implements the matcher trait and matches the plus token

  • Tuples are a shorthand for nested AndThen

Cheapen Cloning

A lot of Strings get cloned right now that could really be Arc<String>. We can start returning that type from methods the currently return a string slice.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.