higherorderco / hvm-core Goto Github PK
View Code? Open in Web Editor NEWLicense: Apache License 2.0
License: Apache License 2.0
with the removal of constructors, and use of GPU, how can HVM be used for general purpose programming?
in HVM1 the a constructor could be implemented by the runtime as a built-in function, which could then do side effects, so without them how are things like syscalls called?
in fact by reducing the network on the GPU you might not be able to do side effects at all since you don't have access to the CPU OS from the GPU.
Hello! It seems like the discord invitation link expired. Is this intentional?
With cargo +nightly bench
command I get the following error: https://gist.github.com/zamazan4ik/3d5de124d7e2e92dd1c86f04afb3ba7d
main
branch, 57570ec6e9a2282c2cf51d7ebd90ce8b20a9a72f
commitThis is on hvm-core
commit 7d8053d.
Note this seems to be related to an output file, because if I do use a non-.hvmc
file, it shows a different error:
I got this:
error: reached the recursion limit while instantiating `<(for<'a, 'b> fn(&'a mut run::Net<'b, Strict>, ...) {call_Table_at_S83____ce760551ccb52d84::<...>}, ...) as AsDef>::call::<...>`
|
note: `<(F, G) as AsDef>::call` defined here
--> src/run.rs:631:3
|
631 | unsafe fn call<M: Mode>(slf: *const Def<Self>, net: &mut Net<M>, port: Port) {
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
= note: the full type name has been written to '/Users/martinjaniczek/Localhost/scratch/hvm-crc32/.hvm/target/release/deps/hvmc-946b8138f3fad536.long-type.txt'
error: could not compile `hvm-core` (lib) due to 1 previous error
when compiling my CRC32 file (https://gist.github.com/Janiczek/fa91fadd53d29ef0e1bbcf8e034a519b) with this string:
testString = "12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678"
(It's 2048 characters long.)
Later I tried to make this smaller program:
str = "12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678"
main = (String.len str)
String.len list = (String.len.go list 0)
String.len.go (String.nil) count = count
String.len.go (String.cons x xs) count = (String.len.go xs (+ count 1))
which hvm-lang runs fine, compiles fine, hvm-core runs fine but the compilation gets stuck like this (it's compiling like that for a good few minutes now):
The hvmc file (note -O all
was used) contains a lot of duplications it seems:
Depends on #109
I'm making this a separate issue because I think there are more open questions in the design of this.
IMO, floats should have a separate tag at runtime, so that they can be read back correctly. Blurring the edges with the numeric types slightly is fine, IMO, because converting between the representations is easy, but reading back floats as integers is unacceptably opaque IMO.
Luckily, there is room for another tag in the tag enum, so I'd propose we add an F32
tag.
The floating point ops should also be represented as separate ops.
I'm flexible on what the behavior is for using ints in float ops and vice versa; whether that's reinterpreting the bits or casting the value between the types.
Add support for u8
/i8
/u16
/i16
/u32
/i32
numbers and operations (currently, there's just u60
, which should be kept).
The relevant code will mostly be in ops.rs
and a little in ast.rs
.
Number nodes should stay largely as-is; the type of a number will not be explicitly stored in the node; instead, operations will be typed. The one change I would suggest though is switching the representation of a number value from a u64
to an i64
, to make sign-extension stuff easy. I don't have any strong opinions on how the Op
type is represented, but it needs to be convertible to/from a 16-bit integer. Currently, it's just a simple enum; I'd probably suggest making it a #[repr(transparent)] struct(Type, OpKind)
, with two separate enums that each fit in a byte, but I'm flexible.
The numbers should be represented such that they can sensibly be implicitly casted between; i.e. 1
should have the same representation in all the integer types, and -1
should have the same representation in all signed integer types (and be uN::MAX
in all the unsigned types).
For the syntax, I'd suggest type.op
(e.g. u8.-
), with op
being a shorthand for u60.op
(for back-compact). Also, IMO support for negative numbers would be good in the syntax, so #-1
, and I think numbers should be printed as signed (so #-1
is printed as that instead of #1152921504606846975
; makes the u60
readback weird, but that's fine IMO).
Great project. I want to help with ROCm/HIP if I have some time. Is PR welcomed?
From Discord.
I would think this is a tail-recursive program that can run with any N, just growing the CPU time but not any memory usage:
Main n = (Sum (PowOfTwo n))
PowOfTwo n = (PowOfTwoHelp 1 n)
PowOfTwoHelp acc 0 = acc
PowOfTwoHelp acc e = (* 2 (PowOfTwoHelp acc (- e 1)))
Sum n = (SumHelp 0 n)
SumHelp acc 0 = acc
SumHelp acc n = (SumHelp (+ acc 1) (- n 1))
But it fails around 2^28
$ hvml run -m 32G -O all -s count-sequential.hvml 28
with
OOM
stack backtrace:
0: rust_begin_unwind
1: core::panicking::panic_fmt
2: core::option::expect_failed
3: hvmc::run::allocator::Allocator::alloc
4: <hvmc::run::def::InterpretedDef as hvmc::run::def::AsDef>::call
Consider the following programs:
// sum_ctr.hvm1
(Sum (Leaf x)) = x
(Sum (Both a b)) = (+ (Sum a) (Sum b))
(Gen 0 x) = (Leaf x)
(Gen n x) = (Both (Gen (- n 1) (* x 2)) (Gen (- n 1) (+ 1 (* x 2))))
Main = (% (Sum (Gen 22 0)) (<< 1 24))
// sum_lam.hvm1
(Match 0 case_zero case_succ) = case_zero
(Match n case_zero case_succ) = (case_succ (- n 1))
Leaf = λx λleaf λnode (leaf x)
Node = λa λb λleaf λnode (node a b)
Sum = λtree
let case_leaf = λx x
let case_node = λa λb (+ ((Sum) a) ((Sum) b))
(tree case_leaf case_node)
Gen = λn
let case_zero = λx ((Leaf) x)
let case_succ = λm λx ((Node) ((Gen) m (* x 2)) ((Gen) m (+ (* x 2) 1)))
(Match n case_zero case_succ)
Main = (% ((Sum) ((Gen) 22 0)) (<< 1 24))
// sum_lam.hvm2
leaf = λx λleaf λnode (leaf x)
node = λa λb λleaf λnode (node a b)
sum = λtree
let case_leaf = λx x
let case_node = λa λb (+ (sum a) (sum b))
(tree case_leaf case_node)
gen = λn
let case_zero = λx (leaf x)
let case_succ = λm λx (node (gen m (* x 2)) (gen m (+ (* x 2) 1)))
match n { 0: case_zero; 1+m: (case_succ m) }
Main = (sum (gen 22 0))
We implement a micro-benchmark that allocates and sums all elements of a perfect binary tree, in 3 different ways:
Benchmarking both cases on Apple M1, we have:
v@v ~/vic/dev/hvm-test$ time hvm run -f sum_lam.hvm -t 1 "Main"
14680064
hvm run -f sum_lam.hvm -t 1 "Main" 3.69s user 0.61s system 99% cpu 4.309 total
v@v ~/vic/dev/hvm-test/sum_lam$ time ./target/release/sum_lam run -t 1 "Main"
14680064
./target/release/sum_lam run -t 1 "Main" 2.43s user 0.57s system 99% cpu 3.007 total
v@v ~/vic/dev/hvm-test$ time hvm run -f sum_ctr.hvm -t 1 "Main"
14680064
hvm run -f sum_ctr.hvm -t 1 "Main" 1.22s user 0.37s system 99% cpu 1.596 total
v@v ~/vic/dev/hvm-test/sum_ctr$ time ./target/release/sum_ctr run -t 1 "Main"
14680064
./target/release/sum_ctr run -t 1 "Main" 0.25s user 0.19s system 98% cpu 0.447 total
v@v ~/vic/dev/hvm-test$ hvm-lang compile sum_lam.hvm2 >> sum_lam.hvmc; time hvmc run sum_lam.hvmc
#14680064
hvmc run sum_lam.hvmc 1.23s user 0.29s system 99% cpu 1.524 total
So, we can make the following ranking:
- HVM1-COMP, sum-ctr: 0.447 s
- HVM2-INTR, sum-lam: 1.524 s
- HVM1-INTR, sum-ctr: 1.596 s
- HVM1-COMP, sum-lam: 3.007 s
- HVM1-INTR, sum-lam: 4.309 s
So, as we can see, on the identical sum-lam program, HVM2 (interpreted) outperforms HVM1 (interpreted and compiled) by 2x-3x, all in a single core computations. Since the current CUDA implementation is ~35x faster than the single core implementation, we can expect HVM2 to complete this task in about 0.43s, i.e., almost 100x faster than HVM1. That is really solid and promising.
Yet, thanks to its equational notation, HVM1 allowed us to dispatch calls by pattern-matching on constructors, which, in turn, allowed us to compile routines. This wasn't explored to its full potential, and already made a huge difference, as can be seen on the compiled sum-ctr case, which outperforms all others, because all the numeric operations are compiled to actual machine code, never allocating nodes (like OP1, OP2, etc.). On its current design, HVM2 does not feature constructors nor pattern-matching, which means compiling isn't possible. While this won't make much difference for highly symbolic computations, for most real-world applications the impact would be huge.
The question is: when should we introduce some form of compiled routines to HVM2? For one it would likely result in massive (2x to 100x?) speedups, depending on the workload. In fact, without compiling routines, HVM2 is kinda sabotaging itself, as it would be restricted to being a massively parallel interpreter. Yet, how do we deal with the huge complexity bump that such a thing would entail? Implementing HVM2 as an interpreter with just interaction nodes and numbers in all the target environments (Rust, CUDA, Metal, Vulkan...), in sync and correctly, is already a colossal tas. As such, sadly, I do not think myself or the current team has the scale to implement a compiled version for HVM2 across all these targets.
When we release HVM2, we want people to be able to use it to run their programs in a massively parallel fashion and get a feeling of what is to come. Yet, if it turns out HVM needs 10 cores to do what GHC does in one, that will not paint a good picture. So, what do we do? There are 3 options that make logical sense:
We implement HVM2 as an interpreter on CPU+GPU, release it, and make clear the first version is interpreter only and, even though it is fast, massive speedups can be expected when we compile it, in the future. Due to lacking compilation, numbers wouldn't be nearly as great as they could, and this greatly impacts the public perception and value of the project.
We implement a compiler and restrict ourselves to a single implementation in Rust. That would make the complexity manageable, but would mean we will not be shipping with GPU support, and GPUs are clearly vastly superior than CPUs for interaction net reduction. Due to lacking GPU support, numbers wouldn't be nearly as great as they could.
We somehow raise significant funds (in the order of 100m or so?) and hire to let us build a fully featured compiler and target the GPU. I don't think that's a likely option.
So, I'm leaving this thread as a testament of the current situation and the reality that we must pick one of 2 for the initial release: compiled version or GPU version; and for my own acceptance that we won't have both for a while, sadly.
tl;dr on identical programs, HVM1 is 2-3 faster than HVM2 in a single core, and up to 100x faster on GPUs, but the fact it doesn't support constructors means we can't implement a compiler, which may leave us with interpreter-only, which may completely overshadow the difference in many cases. Sadly, we don't have the budget to have a compilation+GPU on HVM2's first release, so we must pick one.
Given:
Running example program from the README using latest master (8a9a009)
@add = (<+ a b> (a b))
@sum = (?<(#1 @sumS) a> a)
@sumS = ({2 a b} c)
& @add ~ (e (d c))
& @sum ~ (a d)
& @sum ~ (b e)
@main = a
& @sum ~ (#24 a)
I get the following error:
error[E0382]: use of moved value: `def_sum`
--> src/gen.rs:16:79
|
11 | let def_sum = Port::new_ref(&host.defs[r#"sum"#]);
| ------- move occurs because `def_sum` has type `port::Port`, which does not implement the `Copy` trait
...
14 | host.get_mut::<HostedDef<Def_main>>(r#"main"#).data.0 = Def_main { def_sum };
| ------- value moved here
15 | host.get_mut::<HostedDef<Def_sum>>(r#"sum"#).data.0 = Def_sum { def_sumS };
16 | host.get_mut::<HostedDef<Def_sumS>>(r#"sumS"#).data.0 = Def_sumS { def_add, def_sum };
| ^^^^^^^ value used here after move
|
help: consider cloning the value if the performance cost is acceptable
|
14 | host.get_mut::<HostedDef<Def_main>>(r#"main"#).data.0 = Def_main { def_sum: def_sum.clone() };
|
Applying the suggested fix resolves the issue and program works correctly
[I'm deleting old branches; saving links here for posterity]
26c48fb
chore/sc-484/optimize-pre-reduce-pass
d755c94
direct_deref_experiment
a7bf4c1
feature/sc-173/add-n-ary-ctrs-to-hvm-core
c03e4a3
feature/sc-177/implement-parallel-rust-cpu-runtime
8c8dbc5
feature/sc-213/implement-numeric-operations-in-the-hvm-core
2bd8d8e
feature/sc-84/describe-the-semantics-of-the-runtime-in
a870726
flat-heap
f3135d7
main
54b4cf1
negative_cost_abstractions
65a6b5a
parallel-locking
d322f5a
parallel-olivenke
3e82b9b
parallel-work-stealing-fast
d3e23f4
parallel-work-stealing-queue
473298e
parallel-work-stealing-randomized-order
a61bd56
parallel-work-stealing-randomized-order-add-redex-count-diff-to-total
5fd6ef3
parallel-work-stealing-randomized-order-atomic-total_redex_count
24b5bfd
parallel_cpu_wip
9a6ba4c
parallel_fast_incomplete
3110de1
parallel_slow_complete
74ae424
ptr-refactor-alt-link
ea9385f
redirect_free
335e40a
rework-expand
50c7188
rust_transmute_not_zero_cost
17d2f3d
sequential
495ac80
tmp_debug
#111 proposes a way to use IO functions in compiled mode, but it is limited to using hvm-core-defined IO, hvm-core readback, etc. Currently, if we wanted to support compiling hvm-lang programs to binaries, hvm-lang would need to have a comparable copy-the-whole-source-of-the-library-into-a-rust-crate-and-build-it script, which isn't really tenable.
Instead, it would be much more desirable if the hvm-core compilation could be modular – i.e. one could compile an hvm-lang book to a native form and then use it in the hvm-lang CLI. Only the definitions in the compiled book would need to be included in the compilation output, and the hvm-lang CLI would supply extraneous things like IO, lambda calc readback, etc.
Thus, I would propose that we support compiling hvm-core code to dynamic libraries (i.e. #![crate_type = "dylib"]
in rust terms), and loading dynamic libraries to extend Host
s.
I don't think this would require too many changes. Building off of #111, it would roughly just replace the main.rs
file with:
#![crate_type = "dylib"]
#[no_mangle]
pub fn hvmc_dylib_v0__insert_host(host: &mut Host) {
gen::insert_into_host(host)
}
(it could also disable the cli
feature)
Then loading one of these compiled dylibs would look roughly like:
let lib = libloading::Library::new("/path/to/dylib")?;
let insert_host = unsafe { lib.get::<fn(&mut Host)>(b"hvmc_dylib_v0__insert_host\0") };
insert_host(&mut host);
mem::forget(lib); // not sure if this is necessary, but I think it may need to be leaked
Currently, compiled binaries are generated in a way that assumes all definitions are knowable at compile-time, but this is not the case for IO defs.
Right now the compiled output looks roughly like:
@foo = *
@bar = @foo
// gen.rs
pub fn host() -> Host {
let mut host = Host::default();
host.insert_def(r#"foo"#, DefRef::Static(unsafe { &*DEF_foo }));
host.insert_def(r#"bar"#, DefRef::Static(unsafe { &*DEF_bar }));
host
}
pub const DEF_foo: *const Def = const { &Def::new(LabSet::NONE, (call_foo, call_foo)) }.upcast();
pub const DEF_bar: *const Def = const { &Def::new(LabSet::NONE, (call_bar, call_bar)) }.upcast();
pub fn call_foo<M: Mode>(net: &mut Net<M>, to: Port) {
let t0 = Trg::port(to);
net.link_trg(t0, Trg::port(Port::ERA));
}
pub fn call_bar<M: Mode>(net: &mut Net<M>, to: Port) {
let t0 = Trg::port(to);
net.link_trg(t0, Trg::port(Port::new_ref(unsafe { &*DEF_foo })));
}
Instead, it should look like:
// gen.rs
pub fn insert_into_host(host: &mut Host) {
host.insert_def(r#"foo"#, unsafe { HostedDef::new_hosted(LabSet::NONE, Def_foo::default()) });
host.insert_def(r#"bar"#, unsafe { HostedDef::new_hosted(LabSet::NONE, Def_bar::default()) });
let def_foo = Port::new_ref(&host.defs[r#"foo"#]);
host.get_mut::<HostedDef<Def_bar>>(r#"bar"#).data.0 = Def_bar { def_foo };
}
#[derive(Default)]
struct Def_foo {}
impl AsHostedDef for Def_foo {
fn call<M: Mode>(slf: &Def<Self>, net: &mut Net<M>, port: Port) {
let t0 = Trg::port(port);
net.link_trg(t0, Trg::port(Port::ERA));
}
}
#[derive(Default)]
struct Def_bar {
def_foo: Port,
}
impl AsHostedDef for Def_bar {
fn call<M: Mode>(slf: &Def<Self>, net: &mut Net<M>, port: Port) {
let t0 = Trg::port(port);
net.link_trg(t0, Trg::port(slf.data.def_foo.clone()));
}
}
(using this Host::get_mut
api)
The insert_into_host
function is somewhat peculiar because it needs to be able to handle circular references; thus, first, all of the definitions are inserted with meaningless default state, and then are all initialized with their dependencies.
This naturally supports using pre-defined IO defs with minimal modifications.
Then, in the compiled binary, one would simply write:
let mut host = create_host(&Book::default());
gen::insert_into_host(&mut host);
This would, of course, only support the IO operations hvm-core supports; I'll be opening another issue shortly for how that could be addressed.
data Foo.T = Foo
main = (^ 28416 Foo)
⬇️
01800004000000d8 01800005000000d8
thread '<unnamed>' panicked at 'internal error: entered unreachable code', /Users/martinjaniczek/.cargo/git/checkouts/hvm-core-31580e46fc731f4f/7d8053d/src/run.rs:729:67
stack backtrace:
0: 0x100342e74 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h6d1ef757ca6aa08c
1: 0x1002731f4 - core::fmt::write::h24aade456e61cbc1
2: 0x100323a9c - std::io::Write::write_fmt::h5df313f4d7305ca8
3: 0x1003466a8 - std::sys_common::backtrace::print::h509ed65c6d882fc0
4: 0x1003462e0 - std::panicking::default_hook::{{closure}}::h9d25327d11b3fd1b
5: 0x1003471cc - std::panicking::rust_panic_with_hook::h1d6008cc2f3fe794
6: 0x100346d24 - std::panicking::begin_panic_handler::{{closure}}::hd15752c0e75f10f5
7: 0x100346cb8 - std::sys_common::backtrace::__rust_end_short_backtrace::h6334164110c20096
8: 0x100346cac - _rust_begin_unwind
9: 0x100348d60 - core::panicking::panic_fmt::h09e6dc0b209e1ff9
10: 0x100348e88 - core::panicking::panic::hd967f53605ad7120
11: 0x100284f6c - std::sys_common::backtrace::__rust_begin_short_backtrace::haf3e1d75ee7a05a4
12: 0x100285e5c - core::ops::function::FnOnce::call_once{{vtable.shim}}::hb62ee96d3c082907
13: 0x100347aec - std::sys::unix::thread::Thread::new::thread_start::h4b97637403528bf6
14: 0x18def6034 - __pthread_joiner_wake
(and the hvml run
process hangs, unless it's ran with -1
)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.