rust-hosted-langs / book Goto Github PK

View Code? Open in Web Editor NEW

467.0 8.0 28.0 2.99 MB

Writing Interpreters in Rust: a Guide

Home Page: https://rust-hosted-langs.github.io/book/

License: Creative Commons Attribution 4.0 International

Rust 100.00%

rust interpreted-programming-language garbage-collection allocator book

book's People

Contributors

Stargazers

Watchers

book's Issues

Close up to 3 upvalues in one opcode

The CloseUpvalues opcode can take three operands, i.e. up to three upvalues can be closed in one operation. The present implementation lazily only does one.

See TODO: interpreter/src/compiler.rs:247

Refactor the loop to add up to three upvalues per CloseUpvalues opcode.

Refactor parsing and compiler around List type

I originally wrote the parser to generate a nested Pair tree data structure, just like Lisp uses cons cells. I did this to simplify bootstrapping into a compiler when I didn't have more complex data structures (like the Array types) available yet.

Since this Lisp-style cons cell data structure is archaic, is it worth refactoring the parser and compiler around the List type?

the List type is an Array<TaggedCellPtr> so, like most dynamic language arrays is a heterogenous array
the book chapter(s) on parsing will have to be (re)written around the List type rather than Pairs

CellPtr and TaggedCellPtr can't be Copy because std::cell::Cell isn't copy

This seems wrong. Since these types are just storing word-sized pointers and the behavior we want is to be able to copy them optimally, there should be no other reason why they can't be Copy.

There must be a crate out there that provides Cell mechanics while also being Copy, maybe we need to define our own.

Explain why a custom allocator is needed at start of the book

Hi, thanks for the book It is very helpful.

I don't get the rationale for having a custom allocator though, shouldn't it be explained at the start? Wouldn't default rust allocator be good for this?

Rooting objects

I was reading over the section about dereferencing safely and noticed that there was not any way to root objects. I don't know if this already planned to be covered in the section on garbage collection, but I think it is a really interesting problem that doesn't have a easy solution. It seems like the two approaches used for Rust thus far are either mutation sessions where you can only access heap objects in a closure or having some macro with drop guards. Each have their own trade-offs and I am curious what rust-hosted-langs recommends.

Support for large objects

Currently, the allocator only allows objects up to 32k (the Immix block size) - large objects are not supported.

The code path in stickyimmix/heap.rs:114 that raises an error should implement some other mechanism for tracking large objects.

I haven't thought this through yet but my initial half-baked thought has been to keep a list (linked list? Vec?) of large objects, where memory for large objects might be backed by Vec<u8> and the object header stored in the initial word(s).

Refactor towards nanopass style chaining

See discussion on rust-hosted-langs/runtimes-WG#11 regarding the nanopass concept.

In my opinion this is a very relevant idea for this book, given that half the purpose is to provide a baseline for others to build on (in addition, of course, to making it easier to implement LSP or analytics.)

How should the interpreter code be refactored to move toward the nanopass concept? Separate crates for parsing and compiling?
What should the interface look like between passes? What types?

Bump Allocation

We need a chapter on allocating memory. I'm specifically thinking of bump allocation because:

It's super easy to implement
It delivers good performance
It can support concurrent and atomic operations, though you may still need locking in certain cases (e.g. when requesting a new arena to allocate into)

Object model

We need a chapter that outlines the object model the VM will use, what the trade-offs are, etc, etc.

Runtime type identification and tagged ptr types

Instead of the enum TypeList in interpreter/src/headers.rs, we should explore better type identifier means.

For example, can we generate a compile-time const for each object type that can be used to map to a trait object, where the trait object provides core runtime functions for each object type? This would be less verbose than the enums and mapping between tagged pointer types, possibly at some cost. Perhaps some combination of the two approaches would make sense?

Support for range-limited integer arithmetic

The TaggedPtr type allows for size_of::<isize>() * 8 - 2 bits of integer to be stored in a pointer.

Support for numbers needs to be implemented:

in the parser
in the compiler (+, -, integer div, *) opcodes
in the VM

This initial implementation should return RuntimeError for values and results that fall outside of the supported TaggedPtr range.

A short chapter of the book might be written to describe this.

Consider splitting this issue into multiple issues to break it up, perhaps an issue for each area of the code and one for documentation.

Bump allocate downward

It should be fairly straightforward to adjust the allocator to bump-allocate downward from the end of a block rather than upward from the beginning. See https://fitzgeraldnick.com/2019/11/01/always-bump-downwards.html for an explanation.

Bump allocation happens inside an available set of Immix lines. The "find-next-line" operation can still look upward for the next set of free lines, but within lines, allocation should bump downward from the end of the lines.

Things to take into account:

word 0 of a block is currently used to point at the block metadata
the code should be updated - stickyimmix/src/bumpblock.rs
the book text should be updated - chapter booksrc/chapter-simple-bump.md

I'm not thrilled with the level and quality of unit testing yet, so consider what new tests can improve coverage and corner cases as part of this.

Simplify internal pointer types

Is FatPtr (interpreter/src/taggedptr.rs) necessary? Can it be eliminated?
Value (interpreter/src/taggedptr.rs) and TaggedScopedPtr (interpreter/src/safeptr.rs) appear to be near-identical, can we merge them?

Edit book chapter booksrc/interp-tagged-symbols.md to reflect changes.

Garbage Collection Basics

The book should include a chapter (or series of multiple chapters) discussing the basics of garbage collection and how to tackle this using Rust. The end goal should be a set of resources to implement a naïve, non-moving mark-and-sweep garbage collector. While such a GC won't be very useful for most applications it's easy enough to understand/explain that it should serve as a good introduction.

License: book code

I chose MPLv2 as a middle way between copyleft and the Rust community MIT/Apache default. Should a different licensing scheme be used? Why?

Improve book readability with diagrams

Bob Nystrom continues to inspire

https://www.youtube.com/watch?v=iN1MsCXkPSA

Improve efficiency of fetching next opcode

See interpreter/src/bytecode.rs:293 - multiple dereferences required to fetch a single opcode.

Ideal state: an instruction pointer that is a pointer to the next opcode.

Realistically: that might not be sensible, given that we probably want to retain bounds checking; what about extending Array with a RefCell-style ArrayOpcode slice borrow? That's as close to bare pointer in safe Rust as we can get.

Files:

interpreter/src/bytecode.rs:293
booksrc/chapter-interp-bytecode.md

Obtaining blocks of memory

There are options in Rust, depending on circumstances and requirements:

Nightly only:

use of experimental alloc features on nightly Rust
- access the global allocator https://doc.rust-lang.org/nightly/alloc/alloc/struct.Global.html
- allows specifying block alignment
- cross-platform

Stable:

libc::posix_memalign() (and equivalent on other platforms that I'm not familiar with)
- https://github.com/jonas-schievink/aligned_alloc.rs
- allows specifying block alignment
- platform-specific solutions
using a Vec as a *mut u8
- https://github.com/reem/rust-memalloc
- no way to specify block alignment

Stop-the-world object tracing

Implement simple stop-the-world tracing for all objects in all blocks, including large objects.

This code should live on the interpreter side as the entry point to tracing an object is the object header, which is implemented on the interpreter side.

Use "white/gray/black" terminology in the code.
Each object type should implement a Trace trait with a trace() method. Objects such as symbols and integers that do not get traced into may have a default no-op implementation of this method.
The trace() method should call ObjectHeader::mark() on the object.
The trace() method may append passed-in objects to a Vec argument to maintain a gray list. There are probably other ways to do this.
Lines must be marked too (conservatively for small objects.)

License: book text

I chose CC-BY-4.0 as a default to begin with. What other licenses should be considered and why?

GC Sweeping

Implement dead object sweeping across all blocks.

Blocks with 2 or more consecutively unmarked lines should be added to an allocation block-recycling list
The allocator should make use of recycled blocks: recycled blocks should be made use of before pulling from the free-blocks list

Depends on:

Bytecode format & parser

There needs to be a chapter discussing a simple bytecode format and a parser to use in the VM. I would recommend not using heavy-weight crates such as serde and instead focus on writing this by hand. This may require a bit more boilerplate but I believe it will make it easier to understand the code/concepts involved.

Interpreter Loop

The book needs a chapter dedicated to the interpreter loop, how it consumes bytecode, etc. rust-hosted-langs/runtimes-WG#3 lists some techniques, but for the book I think it's best if we focus on the usual loop + match approach since it can be done using regular Rust code.

In implementation of Array, replace Cell<RawArray> with RefCell

Originally, Array didn't have runtime-borrow capability and I made the (hasty) decision to base the Array struct on using Cell<RawArray<T>>.

It is clear that it makes far more sense to use RefCell now, both from a semantic and performance basis. The explicit borrowflag can also be removed.

Source: interpreter/src/array.rs:42
Book: booksrc/chapter-interp-arrays.md

Proposal: Don't use a toy language

My understanding from reading the other issues is: The idea of the book is that different concepts are explored and demonstrated with a simple toy language.

I see where this is coming from, but imo it leaves out the most important step in the process: Setting real requirements. If choices between bytecode formats or garbage collection strategies are arbitrary for the provided example, it's not really an example for a meaningful choice. Also it removes the accountability that these choices where actually the right ones.

I get that we don't want to be buried in too much detail, but i think that the semantic complexity of an embeddable dynamic language build for real world usage is not that much greater than a toy language that actually shows anything interesting.

The one thing that would be vastly more complex is the "chapter" about language design itself and all the parsing stuff, at least in comparison to the proposed s-expression based language. I propose to just leave that stuff out and replace it with a textual intermediate representation that is effectively an assembler for the bytecode. A chapter about parsing a super simplified language that we actually don't want to write and that is not the focus of the book has probably not that much value anyway.

Syntax & language semantics

We need a chapter that outlines the syntax and language semantics that we will use for the rest of the book. Using an S expression based language would be best since it makes parsing super easy. Semantics wise I wouldn't really lock down on either OO or purely functional, instead it would probably be a hybrid between the two.

Tagged Pointer: Nil == 0?

First of all, great work on the book, I've been really enjoying reading through it!

I noticed that the implementation of TaggedPtr represents nil as a pointer with a tag of 0. I've also noticed that integers are represented with a tag of 0.

Would this not make a literal 0 be equal to nil, and mean that a zero integer would always be converted to a nil FatPtr (due to this code here )? Not sure if this is an intentional part of the design or not :)

Inconsistent tense in writing

The book text switches inconsistently between a future and present tense. Pick one style and apply it consistently.

Why BlockMeta is stored on a Box on the BumpBlock struct?

is there any special consideration on why is this?, or its just to make BumpBlock more cache friendly?

Support for big numbers

Numbers that overflow (size_of::<isize>() * 8) - 2 bits should be stored in NumberObject objects.

I have no experience implementing number objects and I understand that there may be commonly used or standard schemes and/or libraries? Needs an expert.

Depends on #22

implement NumberObject (we can rename this)
extend VM arithmetic opcodes to handle number objects
extend book chapter from #22

Advanced Garbage Collection

As an extension for #2 we need a chapter discussing more advanced garbage collection techniques such as:

Moving objects and using forwarding pointers
Parallel garbage collection (concurrent GC is an entirely different beast that I think we should leave out)
Generational garbage collection
Possibly more

For this chapter we probably want to focus on a specific algorithm such as Immix. I certainly wouldn't call Immix trivial, but it's not that hard either (the paper is just really really dense).

rust-hosted-langs / book Goto Github PK

book's People

Contributors

Stargazers

Watchers

Forkers

book's Issues

Recommend Projects

Recommend Topics

Recommend Org