ducklogic / zerogc Goto Github PK

View Code? Open in Web Editor NEW

46.0 4.0 2.0 1.06 MB

Zero overhead tracing garbage collection for rust (WIP)

License: MIT License

Rust 98.27% Java 0.35% Go 0.29% Lua 0.29% Python 0.80%

garbage-collector rust rust-lifetime

zerogc's People

Contributors

Stargazers

Watchers

Forkers

cyberflamego iq-scm

zerogc's Issues

Allow mixing `Gc` pointers from different `CollectorId`s

This should work cleanly for CollectorIds of different types. Mixing multiple collector ids of the same type is unlikely to be supported right now.

I already have a doctest demonstrating the expected behavior.

All that remains is to get the procedural drive to support it, then test mixing the 'simple' collector with the 'epsilon' collector.
The main difficulty is breaking the derive macro's current assumption that there is a single 'gc lifetime. With multiple collector ids, there are likely to be multiple GC lifetimes.

Reduce duplication between `copying` and `simple` collectors

These are both single-threaded collectors that use a u64 as their ID. They have the same shadow-stack implementation.

I'm thinking about adding a new simple module to the new crate. This will share common code for both collectors. We'll move GcAllocContext there since it's really only applicable to those implementations.

Complex, external collectors (like DuckLogic) will have the option to use some other interface, just as they did before.

Read barriers

This is needed for collectors that use concurrent marking or forwarding pointers, like Java's ZGC and Shenandoah.

I used to think this was impossible, but now I think it is not.

I think we could insert them in the deref implementation of Gc. We could guarantee the pointers are updated at least once after each safe point. Is this good enough?

TODO: Consult The GC Handbook

#![no_std] support

Some allocators may require the stdlib, as an implementation detail. However, the core API should not in order to remain lightweight.

Some allocators may also require liballoc or malloc as a backing allocator.

Support ignoring lifetimes in #[derive(Trace)]

Users should be able to explicitly ignore lifetimes (that are not 'gc) when procedurally deriving. This should be verified

The attribute should go on the lifetime itself. We should also probably change #[zerogc(ignore_params(A, B, ...))] to match this behavior.

For example:

struct Foo<'gc, #[zerogc(ignore)] 'a> {
    foo: Gc<'gc, Foo<'gc, 'a>>,
    s: String
}

Right now the 'a lifetime causes an error.

Object Handles

ZeroGC should support dynamically tracking roots without tying them to a specific GcContext. The collector would treat the ObjectHandle as a root, . See mono's documentation on their(lock-free) GC handles.

In order to access the object, the collector must be paused at a safepoint (to support moving collectors). We could implement this either by temporarily associating it with an existing context, or tying usage to some sort of temporary closure.

We need this support PyObject -> DuckLogic references. Allocation and access need to be extremely efficient in order for our CAPI support to be even reasonably performant.

It'd also be nice if we could use (and maybe allocate) these in the JIT threads. Currently working with constant objects is kind of awkard 😢

Thread Safety

Right now the simple collector is "thread-safe" in that everything is !Send. It's not possible to send garbage collected pointers across threads, so no synchronization is ever needed.

As it stands, the API is pretty agnostic to whether or not the underlying implementation is thread-safe. It exects thread safety to be marked by the appropriate Send and Sync implementations.

The main issue with allowing multiple threads to use the collector is that a collection would have to pause all threads. Once we decide its time to collect, we'd have to block until all other collectors reach a `safepoint!'. This is done cooperatively (like all of zerogc's safepoints), meaning that a single thread could hold up collection by waiting too long to invoke a safepoint.

Internally other VMs, they just don't have an issue with uncooperative threads, since the VM automatically inserts safepoints. The JVM's JIT inserts a safepoint on every single loop iteration (implemented as a test against a guard page).

This is almost analogous to the tradeoff between cooperative and preemptive multitasking. ZeroGC is designed as a low-cost abstraction. We already rely on the user to insert regular safepoints in the single-threaded case, so it's reasonable to expect them to be responsible in multithreaded case as well.

Currently, ZeroGC safepoints are expected to return very quickly if there's not a collection (usually just checking memory usage).
The ZeroGC API should be updated to allow a user to "freeze" a GcContext at a safepoint. This would prevent future allocations (until unfrozen), ensuring that the context's set of roots is known. This is essentially a long-term safepoint, that would other threads to collect while the first set is still working.

The simple collector will probably need a little bit of internal synchronization for this (we'll add a feature-flag to control threading support). We'll need to support thread-local caches for the small-object-arenas. A good experiment would be attempting to reimplement the binary-trees example using multiple threads.

This is basically a requirement for DuckLogic. Although Python is essentially single-threaded (due to the GIL), it's still technically possible for python code to be invoked from different threads.

Make a `GcRef` interface to replace our `Gc`

Other systems (like DuckLogic) may want to use a custom type for their garbage-collected pointers. We should support this.

This would also allow us to remove CollectorId since that would be an implementation-detail of the pointer. We could move the old "simple" Gc pointers to the simple interface described in #4

Simple: "small object arenas" never shrink

This is actually not uncommon among garbage collected programs (Eclipse),
but our collector should strive to do better.

Potentially we could allocate a bunch of objects into an arena, requiring bumpalo to create a new chunk of memory.
Hypothetically a program could stop using the old objects, then never allocate any again (or at least not as many).
Right now, the sweep phase will simply add free objects to the free lists and never attempt to return it to the OS. Even if a chunk of memory is completely empty, we'll still never return it.

In a way, this is related to our dependency on bumpalo since they don't provide an interface to free a single chunk at a time.

Fragmentation of the arenas could also leave a ton of unused memory, resulting in a similar situation. That situation would require a compaction phase, which is out of scope for the simple allocator (just disable the arenas if this is a problem).....

Support arrays of GC-allocated objects

This is related to #15 because [T] is unsized. Ideally, we would be able to have Gc<'gc, [i32]> without any indirection through a Box<[T]>.

The big problem here is that the simple collector expects Gc to have a statically-known size......

This is very important to actually properly implement any real-world use. In theory, arrays could just wrap Vec right now but that's not very elegant.

Implement a Generational Garbage Collector

Implement a heavy-duty generational garbage collector. This should significantly improve performance for applications where allocations are frequent. It is intended primarily for use in language VMs and JITs (#12).

Tasks

Basic Implementation
Use card tables for efficient write barriers
Extremely efficent bump-pointer allocation for new generation
- Properly relocate pointers
Handle "humongous objects" properly
- Most GCs allocate very large objects (4KB+) directly into a dedicated space, bypassing the new generation

[Ask for help] Is it possible to use with JIT?

I have simple runtime Waffle and there is baseline JIT, currently I use simple mark&sweep and use reference counting for roots and this causes runtime overhead, my question is it possible to use zerogc with JIT? Will safepoints work in JITed code? I do not use LLVM or Cranelift, instead I have macro assembler and if safepoints needs some kind of additional code to make it work I believe I can emit this code just fine.

Garbage collected trace objects (`dyn Trace`)

The current Trace trait is not object-safe, so it can’t be used with trait objects:

trait Foo: Trace {
    fn foo(&self);
}

#[derive(Trace)]
struct AnyFoo(dyn Foo);

Could it be made object-safe by turning the associated constants into functions and moving the V: GcVisitor type parameter of Trace::visit<V> up to Trace<V>::visit?

Object Handles should be Sync

The simple collector is thread safe, so handles should be too. GcHandle is effectively a shared reference, so T must be Sync in order for GcHandle to be Send (just like the requirements in Arc and Gc).

This is already true in practice, we just haven't added the unsafe impl declarations.

Support write barriers

All writes value -> dest need to inform the GC. By default, the GC returns immutable references and you need GcCell to mutate fields.

Write barriers are essential for generational and incremental garbage collections. Although it's possible to use hardware barriers (by marking memory read-only), most modern GC implementations opt for software-barriers in practice.

Unfortunately GcCell doesn't have a reference to the owning object, so the current implementation can only inform the collector about value (and not dest).

I plan to fix this by adding an unsafe write_barrier method that takes both value and dest as arguments and then triggers a write barrier. We can automatically generate safe wrappers for these, alongside our new procedural derive in #2.

Cleanup documentation

The readme alone has several typos.

I think it might be nice to write a small booklet using MdBook in order to describe the usage and design of the API.

I have some serious API cleanup I want to do, so this documentation will probably come after the 0.2.0 release.

Is it safe to panic when zerogc is used?

I want to use zerogc to implement main GC algorithm in starlight but there is question: is it safe to panic and then catch panic when zerogc shadow stack is used? As I understand when panic happens zerogc does not at all try to remove shadow stack entries which leads to UB.

Fail to compile such an enum type

I try to use this library, but I have a lot of things I don't understand.

for version 0.2.0-alpha.1

I want to implement the following types, are there any design problems?

type GcCollector = zerogc::dummy_impl::DummyCollectorId;
type GcValue<'gc> = Gc<'gc, Value<'gc>, GcCollector>;
type GcString<'gc> = Gc<'gc, String, GcCollector>;

#[derive(Debug, Clone, Trace)]
pub enum Value<'gc> {
    Null,
    Boolean(bool),
    String(GcString<'gc>),
    Array(Vec<GcValue<'gc>>),
    Object(HashMap<GcString<'gc>, GcValue<'gc>>),
}

How to deal with missing TraceImmutable?

error[E0277]: the trait bound `Gc<'gc, String, DummyCollectorId>: TraceImmutable` is not satisfied
  --> projects\nyar-interpreter\src\value\mod.rs:13:24
   |
13 | #[derive(Debug, Clone, Trace)]
   |                        ^^^^^ the trait `TraceImmutable` is not implemented for `Gc<'gc, String, DummyCollectorId>`
   |
   = note: required because of the requirements on the impl of `Trace` for `HashMap<Gc<'gc, String, DummyCollectorId>, Gc<'gc, value::Value<'gc>, DummyCollectorId>>`
   = note: required by `NEEDS_TRACE`
   = note: this error originates in a derive macro (in Nightly builds, run with -Z macro-backtrace for more info)

How to deal with missing GcRebrand

#[derive(Debug, Clone, Trace)]
pub enum Value<'gc> {
    Null,
    Boolean(bool),
    String(GcString<'gc>),
    Array(Vec<GcValue<'gc>>),
    Object(HashMap<String, GcValue<'gc>>),
}

error[E0277]: the trait bound `Gc<'gc, String, DummyCollectorId>: GcRebrand<'new_gc, Id>` is not satisfied
  --> projects\nyar-interpreter\src\value\mod.rs:17:12
   |
17 |     String(GcString<'gc>),
   |            ^^^^^^^^^^^^^ the trait `GcRebrand<'new_gc, Id>` is not implemented for `Gc<'gc, String, DummyCollectorId>`
   |
   = note: required by `assert_rebrand`
help: consider introducing a `where` bound, but there might be an alternative better way to express this requirement
   |
14 | pub enum Value<'gc> where Gc<'gc, String, DummyCollectorId>: GcRebrand<'new_gc, Id> {
   |                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

How to deal with missing types such as VecDeque, which trait bounds need to be implemented manually

#[derive(Debug, Clone, Trace)]
pub enum Value<'gc> {
    Null,
    Boolean(bool),
    String(GcString<'gc>),
    Array(VecDeque<GcValue<'gc>>),
    Object(HashMap<String, GcValue<'gc>>),
}

error[E0277]: the trait bound `VecDeque<Gc<'gc, value::Value<'gc>, DummyCollectorId>>: Trace` is not satisfied
  --> projects\nyar-interpreter\src\value\mod.rs:13:24
   |
13 | #[derive(Debug, Clone, Trace)]
   |                        ^^^^^ the trait `Trace` is not implemented for `VecDeque<Gc<'gc, value::Value<'gc>, DummyCollectorId>>`
   |
   = note: required by `NEEDS_TRACE`
   = note: this error originates in a derive macro (in Nightly builds, run with -Z macro-backtrace for more info)

Implement a nop GC (like Java Epsilon)

This would be super useful for testing.

See details Java Epsilon GC for how it might be used in production.
The nice thing about using this with Rust's ownership model is that collectors could safely be disposed of.
Once all the references are gone, you can clear the arena and reuse it for other tasks.

Essentially this would make it an arena allocator that conforms to the zerogc interface. That would be super cool!
Let's say one day you decide the overhead of GC isn't worth it. If your tasks are short running enough (1-2 ms) and only allocate a couple of MB of memory, it'd be silly to collect them. Simply throw away the collector and reuse the memory for the next task.

I have a temporary implementation (for unit tests) in 772ffbd
However, when we finally get the real one, I think this should be a separate crate even though it would complicate testing.

Simple: Thread local caches to avoid atomic overhead of `Chunk` in small object arenas

Right now allocation requires atomic operations. We should use a thread-local buffer so this isn't required in the common-case.

This would be somewhat difficult to do since we can have multiple running instances. Would we have to use thread_local? How does the performance overhead of that compare to using atomics?

Maybe we could make SmallObjectCache a static variable shared between instances. However, if we do that we'd have to differentiate between allocations from different collectors (in for_each and the linked list). This could also result in worse performance if all the different collectors end up messing with each others stuff.

This was much easier when it was just a comment in a config file.......

Simple: Lazy Sweeping Optimization

This would remove the sweep phase of the collection and implicitly mark the arenas as dead instead of adding them to the free list. Instead of searching the free list, allocations would search for a free slot in the dead memory. This would amortize the sweeping cost across future allocations.

As of to 9a9634d, the binary_trees benchmark spent 30% of total time compacting (I accidentally said tracing). This would actually make the mark/sweep collector noticeably faster than the old copying collector.

This actually reduces the theoretical complexity of the algorithm 😮 Normally, the collection is proportional to total objects (live + dead), while now it's only proportional to live objects.

This is discussed in Section 2.5 of The Garbage Collection Handbook (Jones, 2012). In addition to the obvious reduction in collection time, it improves the locality of sweeping. They state that it offers good throughput, especially when there are many dead objects (the same workloads copying collectors are best at).

This threatens to increase complexity, so I'm not quite sure about whether it's a great idea... It should probably be optional, although I'd put it on by default (eager sweeping could return memory to the operating system: #6).

If issues ever arise with the simple collector's pause times, this is definitely the most obvious optimization opportunity...
However, this is currently not a very high priority for me.

Simple: Forgetting to drop FrozenContext will cause use after free

The pointer to a frozen collector's stack is only guaranteed to be valid for the lifetime of the FrozenContext. We store it in a hashtable.
Ideally, the user would always unfreeze! through the standard API, which causes it to be removed.

However, if the user forgets to unfreeze (or runs mem::forget) the frozen context, the old pointer will remain in the set of PersistentRoots even if the underlying memory is freed (or moved).

Right now I actually even have this reflected in the API. If a FrozenContext is ever dropped without unfreezing first it will panic with a "TODO message".

The code is undergoing temporary insanity 🚧

See also c9d70d8 for the unsafe implementation of freezing simple collectors that we currently use.
When we finally fix this issue, we need to remember to remove all the TODOs in the code (and the FrozenContext destructor that panics).

Copying collector has x132 as many (minor) page faults as the simple collector

I suspect this is due to the fact we reallocate the entire heap every collection, then copy everything into the new memory.

Normally implementations of Cheney's copying algorithm divide the heap into two equal halves. While the mutator is running only one of these heaps is active. During collection, the live objects are copied from the old heap to the new heap (implicitly compacting them). The old heap is then considered discarded (but not freed) and the mutator uses the new heap.

Both implementations need 2x the used memory during any given collection cycle (which is a major disadvantage to this algorithm). However, the standard implementation effectively caches the old heap while we reallocate it every time. In theory, this could result in less memory use during mutation (because we free the old memory instead of caching it), however allocating large chunks of memory generally require a mmap syscall to the operating system. Accessing the new memory seems to trigger a (minor) page fault because it's not in the TLB or any of the caches.

We're wasting most of our time page-faulting, losing any performance gains we would otherwise win by bump-pointer allocation. We should switch to caching the old heap, like standard

Of course, right now the copying collector still has some SEGFAULTs, which are obviously higher-priority.

Implement procedural-derive for `Trace`

This is tedious and unsafe to perform by hand. This is essential if I'm going to be using this in production.....

`impl Copy for GcVec`

This basically boils down to iterator invalidation. If we have two references to a vector,
a GcVec::push on one may mutate the vector's length while the user coerced the other reference to a slice (or iterator).

To put it another way, it's impossible for a garbage collected vector/list to implement all of the following properties:

Coerce the vector into a slice that directly points to the underlying memory
Allow multiple references to the vector (implement Copy)
The ability to mutate the length and potentially re-allocate
Keep capacity immutable, and have the area [len, capacity) uninitialized (as opposed to initialized with null)

This is a problem common to all garbage-collected languages, not just zerogc (as I describe below).

What Java/Python do

Most languages sidestep the problem by forbidding option 1 and refusing to give direct pointer-access to garbage-collected arrays.
Usually, forbidding option 1 is not a problem, since arrays in Java and lists in Python are builtin into the language and replace most of the users of pointers.

There are some exceptions though. Java's JNI allows raw access to an arrays's memory with GetPrimitiveArrayCritical. However, there is no need to worry about mutating length because java arrays have a fixed size (although it does temporarily prevent garbage collection). Python's native CAPI exposes access to a list's raw memory with PyListObject->ob_items, although native code must be careful of any user code changing .

This problem can also come up in unexpected places. For example, the C implementation of Python's list.sort coerces the list into a raw pointer, then sorts the memory directly using the pointers. The original code continued to use the original pointer even if a user comparison, leading to bugs. The current implementation temporarily sets the length to zero and carefully guards against any mutations.

What to do

If we took the Java/Python approach of refusing to give out raw pointers, it would also prevent access as a slice.

In idiomatic rust, this is unrealistic, since slices are so pervasive.
As a temporary solution, I ended up forbidding option 2.

However, doesn't work well if you're trying to use GcVec as the underlying implementation of Java arrays or Python lists. These languages allow users to create multiple references to an array or a list. In our current system, this would require double-indirection of Gc<GcVec<T>>.

SIDENOTE: Right now GcVec

Instead, I propose a new system of four types and two traits (which will incidentally solve the write-barrier problem):

GcRawVec - Allows multiple references, but has unsafe slice or pointer access
- This will even allow mutable slices, as long as the user correctly triggers write barriers.
GcVec - Effectively a RefCell<GcRawVec<T>>. Includes safe slice access.
GcVecBuffer - A GcRawVec that is !Copy, but gains safe slice access.
trait GcVecAccess - Abstracts over different forms of GcVec. Calls to push require a GcContext. Requires the user to pass a context
trait GcArrayAccess - Allows read/write access to arrays, without mutating their length.
- Implemented for GcArray in addition to GcVec

TODO: Better naming for GcVec and friends?