anachronauts / jeff65 Goto Github PK

a compiler targeting the Commodore 64 with gold-syntax

License: GNU General Public License v3.0

Python 100.00%

python python3 compiler commodore64 gplv3 mos6502

jeff65's Introduction

jeff65

jeff65 is a compiler for the Commodore 64 (and perhaps in the future, other 6502-based computers). It is implemented in Python 3 and produces .prg files which can be loaded directly in VICE, or combined into .d64 disk images, written to a floppy disk, and run on real hardware.

note: this project is currently in its early stages. Features discussed below are largely the product of wishful thinking and may change at any time.

jeff65 compiles languages using gold syntax, producing blum files as intermediate files. gold syntax provides an imperative systems language for 6502-series processors.

Primary invocation:

usage: jeff65 compile [-h] [-v] file

positional arguments:
  file           the file to compile

optional arguments:
  -h, --help     show this help message and exit
  -v, --verbose  show the output of each pass

Licensing

The jeff65 compiler itself is provided under the GPLv3 license; if you distribute a modified version of the compiler, you must also make the source code for your modified version available, as described in the license terms. A copy of the GPLv3 license is included in LICENSE.txt in the source distribution.

The standard library units and runtime library, whenever they get written, will probably be provided either using the GPL with a linking exception, or under a non-copyleft license.

Gold-syntax

Gold-syntax provides an imperative systems programming language for 6502-series processors. Features of the processor are exposed in a friendly-but-powerful way; it should be possible to understand what code will be generated by looking directly at the source file.

Gold-syntax is not associated with the Gold parser framework or the gold LLVM linker.

Here is an example file which puts a light red heart to the top-left corner of the screen, which actually works with the compiler in its current, very unfinished, state:

use mem

constant screen-corner: &u8 = mem.as-pointer(0x0400)
constant screen-corner-color: &u8 = mem.as-pointer(0xd800)

fun main()
  @screen-corner = 0x53      /* screencode for <3 */
  @screen-corner-color = 10  /* light red */
endfun

jeff65's People

Contributors

Stargazers

Watchers

jeff65's Issues

naming issue: "ast" should be "ir"

So, technically, our tree structure should be called an "intermediate representation" instead of an AST, which really only refers to the first IR. So the AstNode class should be renamed to IrNode (or just Node) and be moved to an ir module.

This is a really big change, and kind of awkward to do with #30, #33, and #34 outstanding. If we're going to do it, I'd like to do it ASAP, since it's a job that'll only get bigger over time. I see four main approaches:

Wait until after the outstanding PRs are merged.
- Pros: this is (almost) the easiest approach. Comparisons between outstanding branches and master remain readable.
- Cons: it kind of stalls development until the PRs are merged. Admittedly, #33 not being merged is kind of stalling development anyway, but in theory I can base new branches off of it.
Make the change in master (or in a branch which is immediately merged into master). Merge master back into all outstanding branches.
- Pros: gets the change out of the way ASAP. Comparisons between outstanding branches and master remain readable.
- Cons: requires me to commit to every PR branch. (This isn't a big deal right now since I kind of own all of them, but becomes awkward if I hand off #34)
Make the change in a new rename-ast-to-ir branch. Create a PR, but also merge the branch into #30 and #33 (possibly also #34).
- Pros: keeps the change out of master until it's been reviewed, still gets it done quickly. New branches can be based off that branch or #33 easily.
- Cons: merge graph from hell, comparisons between current outstanding branches and master will be totally unreadable until the new branch is merged.
Decide that we're okay with calling our IRs ASTs, close this issue and never speak of it again.
- Pros: this is the real easiest approach. Status remains quo.
- Cons: confusing nomenclature, broken windows theory.

@woodrowbarlow thoughts on this?

Run VICE-based tests under Windows

PR #59 introduced a harness for testing gold-syntax programs by running them under VICE. As written, this uses the pty module, meaning that it doesn't work under Windows. It would be nice if we these tests could also be run under Windows.

There may be other issues, besides the pty issue. In particular, I don't know if WinVICE connects its monitor to stdout/stdin. Of course, the remote monitor might actually work under WinVICE, in which case we can just talk to it over a socket.

Versioning plan?

Might be a bit premature, but better early than late. How do we plan to version this?

It's not a library, but it does provide an interface (in the form of gold-syntax), so SemVer might be an option. But to be frank, once we produce a stable release, I'd like to keep a strong commitment to backward-compatibility, i.e. any program which compiles correctly with no warnings on a previous version of the software should compile correctly on a later version (albeit possibly with warnings). In this case, we'd just be at 1.x.y forever, making the major version useless.

CalVer appeals. I'm inclined to suggest Twisted's scheme, i.e. YY.MM.MICRO. So if we were to cut a release at time of writing, it would be 18.8.0. (Actually, it'd be 18.8.0.dev0.)

In order to accomplish the backwards-compatibility thing, I'd suggest grabbing a page from Perl5's book, though with a variant. A breaking change to the language gets an identifier, and a directive can be placed at the top of the file to indicate which version of the language the file should be compiled with. Without a directive, Perl5 will just use an old version of the language. Personally, I'd prefer that this situation use the latest version, but issue a warning explaining this and suggesting adding an explicit directive at the top of the file.

Language versions could be as simple as use gold1, use gold2, etc., or be related to the version number, e.g. use 18.8.0. The latter has the disadvantage that not every compiler version will correspond to a language version. Maybe use some kind of semver-adjacent system for the language, e.g. use 1.0, use 1.2, and later, use 2.0.

Language directives should also select the appropriate stdlib version.

Builtin types

Unsigned integers

Types u8, u16, u24, and u32 represent 1-byte, 2-byte, 3-byte, and 4-byte unsigned integers, respectively. (3-byte makes sense because the C64 is memory-constrained and the 6502 doesn't really do alignment.)

Unsigned integer arithmetic is well-defined as wrapping on over/underflow.

Shorter types can be coerced to longer types, but not the other way around.

add unsigned types to the spec
add unsigned types to the type system
write tests for unsigned arithmetic behaviour

Signed integers

Types i8, i16, i24, and i32 represent 1-byte, 2-byte, 3-byte, and 4-byte signed two's-complement integers, respectively.

Signed integer arithmetic is well-defined as wrapping on over/underflow.

Shorter types can be coerced to longer types, but not the other way around.

add signed types to the spec
add signed types to the type system
write tests for unsigned arithmetic behaviour

Booleans

Type bool represents a boolean value (represented in machine code as a 1-byte value). Values are true and false (or yes and no?)

Booleans cannot be coerced to anything.

decide on true/false vs. yes/no
add booleans to the spec
add booleans to the type system

Ranges

Types range and irange are 2-byte types indicating a range of u8s and a range of i8s, respectively. The first byte is the inclusive lower bound, and the second byte is the exclusive upper bound. These are written using ... to ... notation.

decide if ranges should be actual types/values or not
add ranges to the spec
add ranges to the type system

Pointers

Pointer types are created by prefixing any other type with & or &mut. The only valid operation for a pointer is to dereference it. Dereferencing a pointer always succeeds, due to the C64 having a completely flat memory model. Because of this, there is no such thing as a null pointer.

(Alternatively, we could use 0x0000 for nullable pointers should we need them, and provide an intrinsic for accessing that memory offset.)

add pointers to the spec
add pointers to the type system

Arrays

Array types are of the form [T; n to m] or [T; n to m, p to q, ...], where T is any type and (m - n) * (q - p) * ... is in the range 0 through 255. The index range of the array is part of the type. As a shorthand, 0 to n may simply be written n in an array type; these are considered strictly identical.

Arrays can be initialized using [ ... ] syntax or "..." syntax; the latter always results in an array of u8s.

Internally, arrays are always passed by reference. In addition, an array type can be coerced to another array type with the same element type and same dimension but a range which is a subrange.

The only valid operations on an array are indexing, creating a slice by taking a reference, and creating a slice by indexing with a range.

add arrays to the spec
add arrays to the type system

Slices

A slice of an array is written &[T] or &mut [T] where T is any type. Slices know their range but it isn't part of the type. They are 4 bytes long (2 for the address, 2 for the range).

The only valid operations on a slice are indexing and creating another slice by indexing with a range.

(I'm not 100% sure about slices--they're useful, but you either have to make them 4 bytes long to store the upper and lower bounds, or 3 bytes, but you have to do pointer arithmetic when they're created. In either case, they really abstract over the "pass in a pointer and a length" pattern one sees in C.)

add slices to the spec
add slices to the type system

ISR type

Written &isr. The type of names bound using an isr ... endisr block; pass it to library functions for registering ISRs. 2 bytes long.

add ISRs to the spec
add ISRs to the type system

Function types

Written &fun(...) and &fun(...) -> .... The type of names bound using a fun ... endfun block. 2 bytes long. They can be called like functions.

add functions to the spec
add functions to the type system
add function types to the parser
add calling conventions to the function type

Mystery pointer, mystery array, and mystery function

Unnamed 2-byte types which can be coerced to and from any pointer type, any array type, and any function type (!), respectively. Used in the "type signatures" of certain intrinsic/library "functions" provided for purposes of poking holes in the type system when necessary.

Bindings cannot be given these types, even using type inference (see #18); a real type must be given.

add mystery pointer to the type system
add mystery array to the type system
add mystery function to the type system
add mystery types to the spec

Apple IIe language design considerations

Kyle mentioned wanting to be able to target the Apple IIe in the future, which has a 6502 CPU similar to the Commodore 64. However, it has different hardware.

This issue tracks how potential IIe support affects language features. The only one I can think of off the top of my head is isr blocks—these will likely need to generate different code on the IIe, though they probably have similar restrictions.

case-sensitivity for syntax

So, this is a little bit of a weird suggestion. What if we make our language case-insensitive outside of string literals?

I do have a pipe dream of porting to the Commodore, and you know the Commodore is weird about case. Plus the shift key is physically difficult to press in combination with other keys when you're trying to type at any kind of speed.

test_archive.test_roundtrip_archive is flaky

Specifically, it's flaky because sometimes examples generate too slowly, causing Hypothesis to fail it due to not passing health checks. Rerunning the build usually fixes it.

Tried to fix it in 154dde7 by decreasing the maximum example size, but that hasn't worked. At this point I suspect that it's due to using hypothesis' example-generation facilities in a pathological way, which is going to require digging into the docs to see if there's a better way to do it.

To make things worse, re-using the seed that hypothesis gives for reproduction in the error message doesn't always reproduce the error. However, it may result in higher error rates than random seeds, so it's probably a good place to start.

AST notation/serialization/deserialization

Add a human-friendly notation for AST serialization/deserialization, to be used for inline AST stuff. This helps with ad-hoc testing of later passes.

(Another useful bit extracted from #37)

Comment syntax

agree on a comment syntax
realize that it has problems, switch to an entirely different comment syntax
change our minds again eight months later, reopen discussion
re-establish consensus on comment syntax and semantics
add comment syntax and semantics to the spec
update lexer/parser to handle it

Comments are indicated using --[[ and ]] as delimiters. Nesting is respected; ]] won't end a comment early if the comment contains a matching --[[.

An example single-line comment:

--[[ TODO: better examples ]]

An example multi-line comment:

--[[ hackneyed meta-commentary should probably
--   be avoided, but serves as a handy source of both
--   short and long text fragments. ]]

Note that the leading -- characters are presented exclusively for aesthetic purposes and entirely unnecessary. The following would work just as well:

--[[
examples can be used to both illustrate features
and set examples for their use, without going into
overbearing detail.
]]

(Original) C-style comments are ugly. Also, in C, they’re allowed anywhere; however, we only allow comments in statement-positions, so it might be better to switch to lua-style comments:

-- this is a comment

At the parser level, this means that they are terminated by a white space token that contains one or more ‘\n’ characters.

Storage classes `read-once`

decide if we want this or not
add it to the spec
implement it

Names may need work. (Maybe also an argument for allowing multiple storage classes to be combined, though some of the combinations don't make sense. Perhaps make this a storage-adjective?)

Description

read-once is a storage class for function-local variables which behaves the same way as stash, except it can only be present in an rvalue exactly once in the function body.

Rationale

If a variable is used in only one spot in the function body, the value can be stored directly in the instruction stream. For pointers, this means that they can be dereferenced using the efficient Absolute and Absolute-Indexed addressing modes; for 8-bit values, they can be used as Immediate arguments to instructions.

Alternatives

Depending on whether or not one considers generating self-modifying code in order to use a faster addressing mode an "optimization", this may be a better candidate for annotations than a storage class, particularly for 8-bit values. That said, because it changes the semantics of the variable and causes the compiler to enforce those semantics, it seems to fit better as a storage class.

Pointer/array syntaxes

decide on syntax
add pointer syntaxes to the spec
add array indexing syntax to the spec
get array syntax working in parser/simplifier (again; previously in #20, now in #34)

Jonathoughts (tm):

At the machine level, pointers are just numbers. So we could say that we have no pointer type, and allow operations on u16 values directly. I think that this might be too low-level, though we definitely need a way to convert between numbers and pointers to support normal c64 programming.
I'm sort of inclined to suggest that we should keep arrays separate from pointers; I like the idea of having the size of the array be a part of the type. This makes it easy to support looping over the array using a for statement as well.
The C-alike "read type declarations clockwise" or however it goes is a headache and not easy. I will be very happy if we do not do that.
Programmers should be able to know which addressing mode they're using without getting bogged down in details.
Slices are kind of cool.

Hardware considerations

The fastest (from both a cycle and register pressure perspective) addressing modes are Absolute and Absolute Indexed, which require the pointer to be a compile-time constant. See issue #5 for ideas on how to expose this desirable addressing mode to dynamic and mutable variables.
For dynamic and mutable pointers and arrays, the our bread-and-butter will be Indirect Indexed.
For compile-time constant addresses, a dereference will be more efficient than indexing by zero.
For dynamic addresses, there's no indirect addressing mode, so indexing by zero is about as efficient as indexing by a dynamic value.

Note that the Indexed addressing modes are indexed by a u8. Indexing with something bigger is probably a standard library concern.

Syntax ideas

Not in love with all of these. In particular, I don't know if I like @ better than *, though I do like it not being the same character as used for multiplication. I'm exercising it below to see if it grows on me at all. & is probably fine for address-of; using it in the type instead of @/* provides symmetry with [] and means that it can always be read as "address of" and @/* can be read as "target of", i.e. &u8 reads as "address of u8" and &foo reads as "address of foo". I stole this from Rust.

We should probably distinguish between mutable pointers and read-only pointers. So C's const int* becomes &i16 and C's int* becomes &mut i16. I have no desire to steal Rust's borrow checker because that is an undertaking and even Rust is having trouble getting it right. I'm fine with providing just basic support for making it clear what a function mutates and what it doesn't.

I actually really like the idea of providing compiler intrinsics using function syntax, which can later be
promoted to "real" syntax if it proves useful. Rust uses name!(args) syntax for this (and macros), but I'm not really in love with that.

use mem    

let foo: u8 = 7
let bar: &u8 = &foo     --[[ places the address of foo into bar ]]
let baz: u8 = @bar      --[[ dereferences bar, placing the result into baz ]]

--[[ mem.as-address is a compiler intrinsic that allows
--   a pointer to be interpreted as an address. 
--   mem.as-pointer is also a compiler intrinsic which is
--   goofily generic on its output type. I don't like it. ]]
let qux: u16 = mem.as-address(bar)
let quux: &u8 = mem.as-pointer(quz)

--[[ uninitialized arrays seem like a bad thing to allow by default
--   maybe provide another goofy mem intrinsic though? ]]
--[[ language arrays are limited to 256 bytes in length; bigarrays sounds
--   like a nice library feature. ]]
let mut spam: [u8 of 0 to 10] = [ 0 ]      --[[ declares a 10-element zeroed array ]]
let mut eggs: [u8 of 32 to 57] = [ 7 ]     --[[ declares a 25-element array filled with sevens ]]

let wibble: u8 = spam[foo]         --[[ note that array indices are u8 ]]

--[[ another goofy-generic mem intrinsic which evaluates to a compile-time constant ]]
let wobble: &u8 = mem.well-known(0x0400)

Note that pointer types don't have arithmetic defined on them--the mem.as-address and mem.as-pointer functions have to be used to convert them. Pointer arithmetic on the C64 is generally pretty slow compared to direct indexing, so I'm inclined to steer people away from it.

IL Checker

In A Nanopass Framework for Compiler Education (subsequently NFCE), sections 2.1 and 2.2 discuss a system for defining intermediate languages. These provide a way of checking whether a pass behaved as expected, i.e. all of the relevant nodes were transformed.

They also provide a way of establishing the appropriate order of transforms. Currently this is hard-coded based on knowledge of what each transform does, but this doesn't scale. In particular, it isn't obvious where new transforms can/should be inserted. Finally, they provide a way of establishing a stable interface between passes.

NFCE defines five types of compiler pass:

Simplification "reduces the complexity of subsequent passes by translating its input into a simpler intermediate language," i.e. desugaring,
Verification "checks compiler invariants that are not easily expressed within the grammar" for debugging purposes,
Conversion "makes explicit an abstraction that is not directly supported by the low-level target language," i.e. lowering,
Analysis "collects information from the input program and records that information with annotations in the output program," and
Improvement "attempts to decrease the run time or resource utilization of the program," i.e. optimization.

Simplification and conversion passes each transform from one IL to a new IL, analysis and improvement transform from one IL to the same IL, and verification traverses but does not transform.

During debugging, verification passes should be turned on, and conformance against the IL definition should be checked after each pass. During normal operation, some of all of them may be turned off, or selectively enabled using compiler flags.

Unparenthesized calls

Are we interested in supporting some variety of non-parenthesized calls, a la Ruby and Lua? There are a few ways to go with this.

The first is to say no, all function calls are always parenthesized. This is okay.

The second is to say that function calls don't have to be parenthesized ever except where it would be ambiguous. In my experience, it's not always obvious to a programmer where the ambiguity would pop up, and usually involves some confluence of nested comma-bearing expressions, so I don't like this one.

The third is to say that function calls may only omit their parentheses if they have exactly one argument. This is what Lua does. It is naturally unambiguous, even in chains of functions (i.e. g f x parses as g(f(x)), though I wouldn't necessarily advocate that kind of usage in practice).

An extension of this is to say that function calls may also omit their parentheses if they have no arguments. This is problematic for a couple of reasons. The first is that, while it allows you to have really cute getters on your objects which look just like properties, it's not super obvious how much computation is going to happen on a given line. The second is that it makes taking function pointers slightly more confusing.

Ironically, I find the fact that in C, if foo is a function, &foo and foo evaluating to the same thing is a little confusing, so I'm in favour of having &foo mean "a pointer to foo", while foo means "the value of foo", which could mean either the literal bytes of the function, or the result of evaluating the function with no arguments. However, that introduces a bit of sneaky--now in order to take a pointer to the result of a function, you have to write &(foo). (The other precedence order doesn't really make sense, since it becomes impossible to write out function pointers.)

(Incidentally, taking pointers to results of functions isn't necessarily a terrible idea in gold-syntax-- unless we implement copy elision or some kind of optimization, our stackless calling convention means that the pointer will be valid until the next call to the function!)

Compiled unit format should not use pickle

As of commit 0c95371, the compiled unit file format is simply a jeff65.blum.symbol.Archive object which has been dumped to disk using the pickle module. This is probably a terrible idea, as according to the pickle module docs,

The pickle module is not secure against erroneous or maliciously constructed data. Never unpickle data received from an untrusted or unauthenticated source.

This means that sharing compiled unit files is a massive security hole. Therefore, we should consider replacing the compiled unit file format with either

a custom binary format (fun to write, but possibly a source of bugs),
a msgpack serialization (easy to use, adds a dependency),
a JSON serialization (eew, but easy and included in the stdlib),
some horrendous perversion of the ELF format (...), or
something else?

Hello world program

Decide on the goal program
Create issues and project cards for the required steps
Compile and run the program

Just to make the goal clear for the "Hello world!" milestone, here's the test program:

use kernal

fun print(text: &[u8])
  for c: u8 in text do
    kernal.chrout(c)
  end
endfun

fun main()
  print("hello, world!\n")
endfun

Original commentary:

So, what's interesting here? Well, we have some basic control flow, a function call, all good. We have a slice. Additionally, we have a KERNAL call. KERNAL calls have "weird" (in jeff65 terms) calling conventions, so this could (provisionally) be done in a few ways:

kernal.chrout is a real, honest-to-god linker symbol with type fun(u8), value 0xffd2, and an associated calling convention.
kernal.chrout is a compiler intrinsic which emits the appropriate assembly directly.
kernal.chrout is a real function implemented in assembly which puts its argument in the A register and then does jsr $ffd2.

Each of this is interesting in a different way. Doing 1 right out of the gate means that we ensure that our code generation for function calls isn't hard-coded to a particular convention, but means we have to support two of them. Doing 2 might just be a bad idea--my gut feeling is that that's not really what compiler intrinsics are for. Doing 3 means that we implement the ability to (a) load the symbol table from referenced units and use it, and (b) link multiple units together. (I think technically we can already do that last thing, but there's no support for it in the command-line driver.)

Obviously 1 is the "correct" way to do it long-term, but 3 is good enough for now I think.

For bonus points, implement an "improvement" pass (NFCE terminology) which notices that the range variable c is only used as a function argument, and loads it directly into the argument slot for the kernal.chrout function.

Annotation/attribute syntax

We... should probably figure out our annotation/attribute syntax at some point. Specifically, #31 has me concerned, because once you have a vmfun, you need vmrefun, and also for vmtfun you need vmtrefun and now you have exponential blowup of keywords and you look at your code with your vmtxz7qfun and you're all like, this is no fun at all.

Additionally, if we target other 6502 machines, we'll want to be able to mark functions as C64-only or Apple-II-only or Atari-only or whatever.

And then if we targeted any 16-bit machines (@kpmgeek is making m68k noises, which I think I pooh-poohed at the time, but hey), all of our existing *fun keywords would suddenly mean exactly the same thing, with the possible exception of refun.

I remember we earmarked {} as a possibility for annotations/attributes, though a small part of me wants to use them for comments. Here are some existing syntaxes. Common to all of them is that they go on the line above the thing they're annotating, e.g. a function.

#[foo(a, b=c), bar, spam=eggs]     /* Rust-style */

[<foo(a, b=c); bar>]    /* F#-style */

[foo(a, b=c), bar]      /* C#-style */

@foo(a, b=c)            /* Java/Python-style */
@bar

Rust also has a syntax to indicate that the attribute applies to the current module #![...], which is handy--we definitely want to mark the vic-ii module as C64-only.

None of these really call to me, though the order I've put them in is roughly the order I'd rank them for gold-syntax suitability. I think putting the attribute before the function is probably the best option. Putting it inside the definition makes it harder to bump them to another line, which is awkward for long attributes, and putting it inside the body makes it hard to apply them to other kinds of toplevel definition which don't have bodies.

The function-call-esque syntax is pretty common. This makes sense though, since all of them are function/constructor/macro calls (though in Rust this wasn't the case until recently, so). I'm not sure it makes sense for us, but I'm also not sure it doesn't make sense for us.

Yes, the F# syntax is super ugly, but you can be sure that it's not going to get confused for anything else.

Put storage class after colon

right now a let declaration looks like this:

let mut x: u16 = 0

we've agreed that it's easier to read if the storage class modifies the type rather than the name:

let x: mut u16 = 0

Iteration protocol

It'd be neat to allow for ... in ... do ... end loops to work on things other than the built-in ranges and arrays. Since we're not object-oriented (and afaik have no aspirations to be so), I'd like to suggest the following protocol for iterable types. For an iterable type foo, define three methods in the same unit (where T and U are arbitrary types):

a foo#iter function to create the iterator, one of:

  fun foo#iter(x: &foo) -> T ... endfun
  fun foo#iter(x: foo) -> T ... endfun

a foo#iter-next function to advance the iterator, one of:

  fun foo#iter-next(x: &foo, i: &mut T) -> bool ... endfun
  fun foo#iter-next(x: foo, i: &mut T) -> bool ... endfun
  fun foo#iter-next(i: &mut T) -> bool ... endfun

a foo#iter-value function to get the current value of the iterator, one of:

  fun foo#iter-value(x: &foo, i: &T) -> U ... endfun
  fun foo#iter-value(x: &foo, i: T) -> U ... endfun
  fun foo#iter-value(x: foo, i: &T) -> U ... endfun
  fun foo#iter-value(x: foo, i: T) -> U ... endfun
  fun foo#iter-value(i: &T) -> U ... endfun
  fun foo#iter-value(i: T) -> U ... endfun

Yes, the # is part of the function name, so I'm suggesting that name for "protocols". Note that we allow some wobbliness in the protocol as to by-value/by-address passing, and whether or not the iterated-upon object is passed in; as long as the compiler can resolve a function from each of the three groups, the object is considered iterable.

An (example, unnecessary) implementation for a struct wrapping an array of u8s (using unreviewed struct syntax):

struct my-u8s
    values: [u8; 10]
endstruct

fun my-u8s#iter(x: &my-u8s) -> u8
    return 255
endfun

fun my-u8s#iter-next(i: &mut u8) -> bool
    @i += 1 --[[ this overflows to 0 on the first call ]]
    return @i < 10
endfun

fun my-u8s#iter-value(x: &my-u8s, i: u8) -> u8
    return x.values[i]
endfun

Given the above, we can write a for ... in ... do ... end loop:

let z: my-u8s = ...
for x: u8 in z do
    ...
endfor

... which would desugar to something like:

let z: my-u8s = ...
let mut _z#iter = my-u8s#iter(&z)
while my-u8s#iter-next(&mut _z#iter) do
    let x: u8 = my-u8s#iter-value(&z, _z#iter)
    ...
end

Additional functions could be defined for mutating iterators (for mut ... in ... do ... end), e.g. a foo#iter-mut-value which returns a mutable pointer.

Nulls/Optionals

In #6, and #19, we discussed pointers, and in #53, error handling. This neatly leads on to the question of nullable pointers.

In my mind, our existing pointers aren't nullable; we don't even have a way to create a null pointer at the moment (there's no null keyword or equivalent, and no concept of default initialization). However, it's generally useful to be able to indicate that a function didn't return a useful value.

It's actually pretty hard to accidentally invent a null pointer on the C64, since the entire address space is addressable. The best candidate for a null value is probably 0xffff, which points into the middle of the hardware ISR vector, and is at the end of memory, so a pointer with value 0xffff is totally useless. But also has a chance to bone your hardware good if you try to write to it, since it'll clobber the high byte of your ISR vector and possibly also your processor port direction register 0x0000 if it wraps around.

In general, I'm a fan of systems which (a) distinguish between nullable and non-nullable pointers, and (b) either force you to check a nullable pointer before using it. We could just extend the error-code system I suggested in #53, but that means flags for everything ever forever. That's not necessarily the worst approach.

The other option would be some way to make types nullable, which either adds a flag or chooses an appropriate sentinel value as appropriate for the type. (This is what Rust does -- option<u16> is three bytes wide, and carries a flag; option<&u16> is two bytes wide, and uses a sentinel value.)

Inline string literals / string constants

make decisions about the spiel below
change the spec if necessary

I want to be able to write

console.print("Hello, world!")

This is currently semantically problematic, and fixing that is simple enough, but I'm concerned that we're losing something if we do (specifically, an easy way of auditing program image size).

A little background explanation. In the spec, the constant statement's behaviour is defined as follows:

Binds a name to a value known at compile time which does not allocate memory in the program image. The value will be inlined at usage sites. Top-level constant bindings are exported from the unit as symbols, and may be referenced in other units.

The restriction to values which do not allocate memory means that arrays and strings cannot be declared as constant-bindings.

This wording stems from the fact that, when writing the spec, I struggled to come up with the difference between non-mut toplevel let-bindings and constant-bindings. In retrospect, I suspect that a simpler approach is better:

constant takes a "known" expression on the RHS, and binds the identifier given on the LHS so that it is also considered "known". (The rules for "known" expressions haven't been formally stated yet, which is probably a doc bug, but they go as you'd expect.) Crucially, "known" expressions are always inlined at usage sites. This implies that pointers cannot be taken to "known" expressions.

Toplevel non-mut let takes a "known" expression on the RHS, and binds the identifier given on the LHS so that it is not considered "known". This implies that it cannot be inlined at a usage site, and therefore must allocate memory in the program image to store the value. This, in turn, implies that pointers may be taken to such expressions.

Now, when writing an inline string literal (as above), the expected behaviour is that the string will be allocated in the program image, and it will be passed in some way (almost certainly as a slice) to the function, and that the actual numbers for that slice will be inlined.

This means that an inline string literal is a "known" expression. Therefore, it makes sense that we can define constants with string (and by extension, array) values, the result of which is that the string is allocated in the program image, and the slice addressing the string is bound to the LHS identifier as a "known" value.

This means that one can no longer rely on constant statements to not take up program image space, which makes auditing image size slightly harder, which bothers me a little. On the other hand, the fact that I plan to have the linker automatically eliminate unreferenced functions kind of makes it hard to tell where the space in your program image is coming from anyway, so maybe it's not that big of a deal. Maybe we could provide a tool which produces a breakdown of what's going into the program image, and where it comes from?

Type inference

I think it could be helpful to provide simple type inference for let and for statements, to make the language a little cleaner. So the following would be allowed:

let x: u8 = 7
let y = x  --[[ y is u8 ]]

let z: [u16; 4] = [1, 2, 3, 4]
for w in z do
    --[[ w is u16 ]]
end

Since the type-checker has to derive what y and w should be anyway, it should be simple to allow it to just assume the type. (One might want y or w to be a longer type, but we'd still have to derive u8 and u16 respectively to determine which types are longer.)

Product type (struct) declaration syntax

Discuss design of struct declaration syntax. So far we have the struct and endstruct keywords reserved.

choose struct declaration syntax
choose direct member access syntax (the . operator from C)
choose indirect member access syntax, if any (the -> operator from C)

Sum type (union) syntax

Do we want unions? Would we prefer discriminated unions (i.e. you can tell at-runtime which branch of the union it is)? I am a huge fan of discriminated unions but they might be a feature to put off until later.

decide if we want regular unions
decide if we want discriminated unions
decide on access syntax

Feature proposal: VM-based functions (vmfun)

Apparently one time-honored way of generating smaller 6502 code is to add a bytecode VM and implement some functions on the VM. This produces much smaller code which runs slower. One well-known example of this is Wozniak's SWEET16 for the Apple II. People have even written JVMs for the 6502.

I propose that we provide a way of indicating that some functions should be compiled to a VM which we ship as part of the runtime. The VM would only be included if the final image includes functions compiled for it. Based on previous work, a syntactic option would be vmfun, to go with fun and refun, but I could see that getting out of hand.

An additional option that could be included with this is to allow compilation to threaded form, i.e. instead of a bytecode stream, the compiler generates the jumps directly to the subroutines that implement the opcodes. This provides an intermediate size-speed tradeoff between machine code and bytecode. (This doesn't even require an extra target--just a different bytecode assembler.)

Pros:

Allows an avenue for programmers to optimize aggressively for size over speed, if that's what they need.
Removes some pressure on making every single language element maximally 6502-friendly at the cost of usability. (We already do this to some extent, since we support multiplication and division as infix operators, even though they'd actually be a call into the runtime.)
Forces us to make sure that the compiler is retargetable, since we'd effectively have two targets now.

Cons:

Compiler now has to support two targets properly, meaning increased testing surface area.
Understanding compiler output would require additional work.

Questions:

Are there any features that should be restricted to vmfuns, or should the language be the same regardless of function type?
Should we provide a way of compiling blocks into bytecode without creating a whole new function? If we do this, do we need a separate function type?
Do bytecode functions require a different calling convention?

Prior art:

SWEET16 -- A bytecode interpreter by the Woz which was included in the Apple II ROM. Sadly, it's Copyright Apple 1977, and would have been out of copyright by now except for RETROACTIVE COPYRIGHT EXTENSIONS. THANKS DISNEY.
VM02 -- A JVM for the 6502. No, I don't know how.
PLASMA -- A language for 6502 with similar goals to ours, actually. Actively developed, and by the same guy who did VM02 (above).
Not really prior art, but David A. Wheeler's 6502 Language Implementation Approaches is full of interesting and useful information. I'd already been considering some of the approaches described, so it's nice to have prior art to build on.

Error handling

We should probably think about our story for this at some point.

We could add generics/templates and discriminated unions and go for a Result type, but that's... a lot.

We could do multiple return. Maybe say that you return your error first, then your value, and provide a syntax...

fun foo(a: u8) -> err, u8
  -- do something
endfun

fun bar() -> err, u8
  let x = try foo(7)    -- causes early-return on error
  return x + 7          -- returns ok as the error
endfun

-- bar desugars to this, where e is some kind of gensym
fun bar() -> err, u8
  let e, x = foo(7)
  if e != ok do  return e  end
  return ok, x + 7
end

EDIT: Of course, err is a new type... it's one byte wide, has the special value literal ok, and allows 1-255 as literals. You can't do arithmetic with it or assign u8s to it. If a conditional expression has type err, ok is considered truthy.

Specify rules for known-expressions

As mentioned in #49, the rules for what constitutes a "known-expression" in the spec haven't been formally specified.

On the one hand, it's kind of obvious (literals are known, constant bindings are known, expressions whose subexpressions are all known are themselves known), but it wouldn't hurt to spell it out explicitly in case there are some non-obvious corner cases.

In particular, some expressions which look like function calls are known, and some are not, while function identifiers themselves should probably be considered known.

Get Sphinx/RTD set up properly

We're set up on ReadTheDocs (https://jeff65.readthedocs.io/en/latest/) but we're not actually generating real documentation. We should at least be including the specs for now.

Also, the commit hook still needs to be set up. I have the keys for that, so once Sphinx is ready, this needs to be assigned to me to sort that out.

Set up Sphinx
Enable RTD webhook

Run VICE-based tests in CI

PR #59 introduced a harness for testing gold-syntax programs by running them under VICE. Currently, these cannot be run under either of our CI setups; it currently hangs under Travis' Ubuntu Trusty, which ships an old version of VICE (without ROMs, though that's easy to fix), and AppVeyor uses Windows, which is unsupported (see #60).

It doesn't really matter which CI system the tests get run under. If fixing #60 gets us AppVeyor support, that's cool, though we won't get coverage. I'm not really sure what the best way to get it working under Travis is, though the two options that stand out are getting VICE 3.2 binaries into the build environment, or modifying the driver to work with VICE 2.4.

Logging system

Issue to track adding a proper system for logging. This includes debug messages, compiler warnings and errors, etc.

(I did this work in an unpushed branch, mixed with other stuff... again. 🤦‍♂️ )

Linker should include type assertions for relocations

At the moment, relocations are untyped. Therefore, if I recompile a unit, changing the types of some of the toplevel declarations, and relink it with a previously-compiled unit referring to those toplevels, an incorrect link is possible.

The linker should check whether the type assertions agree with the ABI of the referenced unit, and if they do not match, attempt to recompile the referencing unit before linking. If this is not possible (either because the changed ABI causes the source to no longer compile correctly, or because the source is unavailable), an error should be raised.

Identifier characters

Currently, pretty much any character is allowed in identifiers except for ()[]{}:.,"\, as long as the identifier begins with an alphabetical character. This is probably a little too liberal, though I would definitely like to preserve the following characters:

- -- I rather like kebab-case since you don't have to wildly mash the shift key to type multi-word identifiers. Lisps typically name things like this, though it does require that if you want to infix-subtract from a constant you have to put a space between them, which seems reasonable anyway.
? -- useful for predicates. Ruby uses this.
! -- useful for marking functions which change their arguments. Ruby again.
' -- useful for marking helper/variation functions. Haskell this time.

I'm inclined to be pretty liberal about what goes in identifiers at first, especially since this isn't an especially punctuation-heavy language, though allowing ; and back-ticks inside identifiers rubs me the wrong way somehow.