valida-xyz / valida Goto Github PK
View Code? Open in Web Editor NEWA STARK-based VM focused on code reuse, performance, and modularity
License: Apache License 2.0
A STARK-based VM focused on code reuse, performance, and modularity
License: Apache License 2.0
The challenger for the Fiat-Shamir transformation needs to observe all of the pieces which are in the transcript. Otherwise, soundness bugs can result. See derive/src/lib.rs
, where there are some relevant TODO
comments. Review the use of the challenger to make sure that the Fiat-Shamir transformation is being performed properly, adding any observations to it as needed.
Find the most interesting and impactful opportunities for parallelizing the Valida prover, and implement them in the code.
A comment in memory/src/stark.rs
says FIXME: Degree constraint 4, need to remove
. It’s not clear to me why a degree 4 constraint needs to be removed. Is this an optimization or a requirement for functional completeness and correctness?
Design and implement an extension of the Valida ISA which supports getting the current value of fp
, or more generally the current address of an object on the stack in source programs.
Looking up AND input/output in a preprocessed table will cut Bitwise chip column count from 79 to 15.
Load32
provides the ability to load from an arbitrary RAM address. Store32
should provide the ability to write to an arbitrary RAM address. As specified and implemented, it does not. Instead of acting as the write-side analogue of Load32
, Store32
acts as an analogue of LW
; it moves a value from one stack location to another stack location, with the stack locations determined as fixed offsets from fp
. What's needed is an instruction to write to a RAM address specified at a location on the stack.
Question. How do compilers targeting Valida deal with varags in languages supporting them?
Context. PR #65 . The Valida calling convention as described in the Valida spec requires the size of each function's stack frame to be known at compile time. As a result, potentially variable sized local variables should be allocated on the heap. This is because the ISA only allows fp
to be updated by adding to it a constant or a value loaded from memory at a constant offset from fp
. Since function arguments are also passed on the stack, this means that the length of the arguments also must be known at compile time.
An alternative semantics for jalv
would not impose this constraint. In the contemplated alternative semantics, jalv
would load a value from memory into fp
. Perhaps a complementary jump instruction would store fp
into a location in memory contemporaneously with jumping. This would allow for the local variables and the arguments to functions to have sizes not known at compile time.
The idea we're currently going with is to let stack objects exist on the heap if they need to be variable sized. Each stack frame would then contain a constant number of scalar local variables and pointers into the heap. Each function call generates a stack frame full of arguments, and so to extend this idea to those frames, each of these arguments stack frames would contain a constant number of scalar arguments pointers into the heap. I want to confirm my understanding that this is how we would support varags in compilers targeting Valida.
In the equality test in the CPU chip, one of the constraints states that not_equal = diff * diff_inv
. Since not_equal
is 0 or 1, this is satisfied only if diff_inv
is the multiplicative inverse of diff
, or diff
is zero, or diff_inv
is zero. If not_equal
is 1, then diff_inv
is the multiplicative inverse of diff
, implying that diff
is not zero, implying that the input values are not equal. What’s not clear is that if not_equal
is 0, then diff
is 0. It seems like it could be that diff
is non-zero, and diff_inv
is 0, and not_equal
is 0. So, the argument is not sound. We can fix this by adding another constraint: (1 - not_equal) * diff = 0
. This should complete the constraints for the zero test on diff. This issue may be replicated elsewhere where zero tests occur.
Currently the Shift32 chip STARK does not put any constraint on its output. The output should be constrained to be equal to input_1 * power_of_two
, truncated to a 32-bit word.
./build/bin/llc fib.ll --march=valida --global-isel
Pass 'IRTranslator' is not initialized.
Verify if there is a pass dependency cycle.
Required Passes:
Insert stack protectors
Target Pass Configuration
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.
Stack dump:
0. Program arguments: ./build/bin/llc fib.ll --march=valida --global-isel
Stack dump without symbol names (ensure you have llvm-symbolizer in your PATH or set the environment var LLVM_SYMBOLIZER_PATH
to point to it):
0 llc 0x00000001035ac644 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) + 56
1 llc 0x00000001035aa9f0 llvm::sys::RunSignalHandlers() + 112
2 llc 0x00000001035ace24 SignalHandler(int) + 352
3 libsystem_platform.dylib 0x0000000183771a24 _sigtramp + 56
4 llc 0x00000001030036b8 llvm::PMTopLevelManager::schedulePass(llvm::Pass*) + 2876
5 llc 0x0000000102e2ff24 llvm::TargetPassConfig::addPass(llvm::Pass*) + 660
6 llc 0x00000001027b8594 (anonymous namespace)::ValidaPassConfig::addIRTranslator() + 48
7 llc 0x0000000102e30b44 llvm::TargetPassConfig::addCoreISelPasses() + 460
8 llc 0x0000000102e30d6c llvm::TargetPassConfig::addISelPasses() + 272
9 llc 0x0000000102bbac88 llvm::LLVMTargetMachine::addPassesToEmitFile(llvm::legacy::PassManagerBase&, llvm::raw_pwrite_stream&, llvm::raw_pwrite_stream*, llvm::CodeGenFileType, bool, llvm::MachineModuleInfoWrapperPass*) + 180
10 llc 0x00000001027a1f24 main + 7456
11 dyld 0x00000001833c9058 start + 2224
zsh: segmentation fault ./build/bin/llc fib.ll --march=valida --global-isel
When attempting to generate assembly using the llc tool for the valida target on a MacBook with an M2 chip, it results in a crash.
Currently the program ROM bus channel sends are commented out (in CpuChip::global_sends
, in cpu/src/lib.rs
). As a result, the consistency of the execution trace with the program ROM contents is not being checked. Re-enable these sends, causing the consistency of the execution trace with the program ROM contents to be enforced. Make sure that the prover still runs and the verifier still accepts the proofs output by the prover after this change.
The verifier needs to check that the prover is sending correct commitments to the preprocessed traces. The prover can send openings, proven with the batch opening proof, which let the verifier check that the commitments to the preprocessed traces are for the correct preprocessed traces. Or maybe the verifier can check it by reproducing the same commitments to the preprocessed traces.
Specify a commit hash for Plonky3 in the build files which lets Valida be built without having a local clone of Plonky3.
Valida is using a common memory argument, where the prover commits to a log of operations ordered by (address, timestamp)
. Once we have this ordered log, memory consistency is easily enforced with constraints.
To simplify this discussion, say we're ordering by address
only. We verify this ordering by range checking each address delta, asserting that it is within some "small" range like 16 bits.
The circular nature of field elements complicates this; e.g. an address step from p - 1
to 0 would be seen as a small delta of 1. So if a memory log had a sequence of addresses like 0, 2^16, 2 * 2^16, ...
, wrapping around beyond p
, it could revisit some address a second time, breaking the soundness of our memory argument.
Our initial thought was to avoid this problem by limiting the address space to ~30 bits, creating a ~30 bit "dead zone". Attempting to cross this "dead zone" would result in a failed range check.
However, we currently allow the prover to insert "dummy reads", to break down larger address deltas into smaller ones. It seems like these could also be used to break down a ~30 bit "dead zone" gap into small deltas.
Instead of adding dummy rows in the memory table to break down large deltas, we could just support larger range checks. Say we limit the address space to 29 bits; then any legitimate delta will be < 229. We could do a 29-bit range check by breaking it into say 14- and 15-bit range checks.
It might be worthwhile to also evaluate other memory techniques though. Triton has a unique method of enforcing that the log is grouped (not necessarily ordered) by address. BEGKN92 (section 4.1) offers a different approach, which doesn't involve address grouping at all. I'm not up to speed on Lasso yet, but I think it has another method inspired by BEGKN92.
Here is a LLVM backend for TinyRAM architecture. Have you considered reusing it?
The constraints for immediate values are missing a range check on read_value_2
, in the implementation of Air<AB>::eval
in cpu/src/stark.rs
. The comment says:
Immediate value constraints (TODO: we'd need to range check read_value_2 in this case)
Our ISA spec is getting out of date as we actively develop. We should update it. E.g.,
Add READ_ADVICE
:
Operands: a
Description: Read the content of the advice tape and write the next 4 byte values to those beginning at offset a
.
Add WRITE
:
Operands: b(fp)
Description: Follow the pointer stored at offset b
from the current frame pointer and write the next 4 byte values to the output chip.
Remove `ISH{L,R}
Add DIVI
Add immediate versions of bitwise operations and
, or
, xor
Add NE
:
Operands: a, b(fp), c(fp)
Description: If [b] != [c] then set a to 1. Otherwise, set a to 0.
Add MULHU
:
Operands: a, b(fp), c(fp)
Description: Multiply [b] by [c] then set a to the top part of the result.
Add LOADFP
(to be implemented):
Operands: a, b(fp)
Description: Stores the current value of fp
plus the immediate operand b
into the register numbered a
. The operand c
is always set to 0 (it is unused), while the operands d
and e
are always set to 1 (indicating that b
and c
are immediate operands).
Add ILT
Various Valida functions might exhibit partiality; on some inputs, they might fail to terminate with a result of the output type. Instead of panicking or non-terminating, functions should return a result indicating an error. The public functions which are currently partial should be modified to return a Result
type.
It is okay to use functions which may panic in unreachable code sections and in cases where it is impossible for the functions to get inputs which result in a panic. However, attainable conditions should never result in a panic.
The reason for this issue is that Valida is going to be used in services, and those services can't panic. They have to stay running even when errors occur.
The following code sections (as far as I know) may potentially result in partiality. Line numbers are relative to commit eddd2b031a13278bc4855dea802fbc045f1378d8
.
There are three instances of panic!()
, excluding macro code which is only executed at compile-time, and code which is only used for debugging:
run
in the implementation of the run_method
macro in derive/src/lib.rs
.AirBuilder::is_transition_window
for ConstraintFolder<’a, F, EF, M>
in machine/src/__internal/folding_builder.rs
.MemoryChip::read
, in memory/src/lib.rs
.There are no instances of .expect()
in non-test code, except for some instances in macro code which runs exclusively at compile time. It is okay for code which runs exclusively at compile time to be partial, as long as it terminates.
There are 16 instances of .unwrap()
in non-test-code, excluding macro code which runs exclusively at compile time:
main
in basic/src/bin/valida.rs
, on the call to load_program_rom
.main
in basic/src/bin/valida.rs
, on the call to std::io::stdin().read_to_end
.main
in basic/src/bin/valida.rs
, on the call to std::io::stdout().write_all
.check_constraints
in machine/src/__internal/check_constraints.rs
, on line 30.check_constraints
in machine/src/__internal/check_constraints.rs
, on line 39.check_constraints
in machine/src/__internal/check_constraints.rs
, on line 44.check_cumulative_sums
in machine/src/__internal/check_constraints.rs
, on line 90.generate_permutation_trace
in machine/src/chip.rs
, on line 40.generate_permutation_trace
in machine/src/chip.rs
, on line 169.generate_permutation_trace
in machine/src/chip.rs
, on line 189.eval_permutation_constraints
in machine/src/chip.rs
, on line 268.eval_permutation_constraints
in machine/src/chip.rs
, on line 270.MemoryChip::insert_dummy_reads
in memory/src/lib.rs
, on line 257.MemoryChip::insert_dummy_reads
in memory/src/lib.rs
, on line 258.MemoryChip::insert_dummy_reads
in memory/src/lib.rs
, on line 259.Instruction<M>::execute
for WriteInstruction
in output/src/lib.rs
, on line 154.There are 4 instances of assert_eq!()
in non-test code which is run after compile time:
check_constraints
in machine/src/__internal/check_constraints.rs
.check_cumulative_sums
in machine/src/__internal/check_constraints.rs
.Instruction<M>::execute
for WriteInstruction
in output/src/lib.rs
, on lines 162 and 163.There is one instance of assert!()
in non-test code which runs after compile time:
read_word
in the MachineWithProgramChip
trait definition, in program/src/lib.rs
.The following files contain unsafe
blocks, which should be assumed pending further review to potentially result in partiality:
alu_u32/src/add/columns.rs
alu_u32/src/add/mod.rs
alu_u32/src/bitwise/columns.rs
alu_u32/src/bitwise/mod.rs
alu_u32/src/div/columns.rs
alu_u32/src/lt/columns.rs
alu_u32/src/lt/mod.rs
alu_u32/src/mul/columns.rs
alu_u32/src/shift/columns.rs
alu_u32/src/shift/mod.rs
alu_u32/src/sub/columns.rs
alu_u32/src/sub/mod.rs
cpu/src/columns.rs
cpu/src/lib.rs
derive/src/lib.rs
memory/src/columns.rs
memory/src/lib.rs
native_field/src/columns.rs
native_field/src/lib.rs
output/src/columns.rs
output/src/lib.rs
program/src/columns.rs
range/src/columns.rs
range/src/lib.rs
The following arithmetic operations may result in division by zero and a panic:
Div::div
for Word<u8>
, in machine/src/core.rs
, line 87.MemoryChip::insert_dummy_reads
in memory/src/lib.rs
, line 222.Instruction<M>::execute
for Div32Instruction
, in alu_u32/src/div/mod.rs
, line 89.The following unchecked array indexes may result in an out of bounds error and a panic:
generate_permutation_trace
in machine/src/chip.rs
, lines 120, 166, 179, 182, and 189.generate_rlc_elements
in machine/src/chip.rs
, lines 285 and 300.CpuChip::set_instruction_values
in cpu/src/lib.rs
, line 248.CpuChip::pad_to_power_of_two
in cpu/src/lib.rs
, lines 328, 329, 330, 346, 347, 348, 351, 352, 355, 356, and 357.The zk-VM run
method is capable of non-terminating; see derive/src/lib.rs
, lines 158-176. This is to be expected, because Valida is Turing complete. Perhaps, however, we should code in a maximum number of steps, in order to avoid partiality. We can even pass in the maximum number of steps as an input to the method.
In machine/__internal/folding_builder.rs
, the implementation of AirBuilder::assert_zero
for ConstraintFolder<'a, F, EF, M>
is stubbed out. The stub should be replaced with a proper implementation.
The implementation of Chip<M>::generate_trace
for RangeCheckerChip<MAX>
, in range/src/lib.rs
, is O(MAX * log(|self.count|))
but can be done in O(|self.count| * log(|self.count|))
. In this context, self.count
is a BTreeMap<u32, u32>
whose key space is 0..MAX
. This means that |self.count|
, the cardinality or number of keys of self.count
, is less than or equal to MAX
, and thus the algorithm is asymptotically slow, as noted by the code comment.
In machine/src/__internal/folding_builder.rs
, the implementation of PermutationAirBuilder::permutation_randomness
for ConstraintFolder<'a, F, EF, M>
is stubbed out. The stub should be replaced with a proper implementation.
The following arithmetic operations may result in unchecked overflow, resulting in different behavior in debug vs release compilation mode. For each of these, the desired resolution is likely (a) the overflow should wrap around, and this should be made explicit and made to happen in both debug and release mode, or (b) overflow is deemed to be impossible for all inputs, and should result in a panic with an informative error message. However, other resolutions are possible and these must be considered on a case by case basis.
machine/src/core.rs
:Add:add
for Word<u8>
, on line 57Sub::sub
for Word<u8>
, on line 67alu_u32/src/add/mod.rs
:Add32Chip::op_to_row
:Instruction<M>::execute
for Add32Instruction
:read_addr_1
)write_addr_1
)read_addr_2
)a
)Instruction<M>::execute
for ReadAdviceInstruction
:Instruction<M>::execute
for WriteAdviceInstruction
:fp + mem_addr
)(fp + mem_addr) + mem_buf_len
)Instruction<M>::execute
for Load32Instruction
:read_addr_1
)write_addr
)Instruction<M>::execute
for Store32Instruction
:read_addr
)write_addr
)Instruction<M>::execute
for JalInstruction
:write_addr
)next_pc
)Instruction<M>::execute
for JalvInstruction
:write_addr
)next_pc
)read_addr
)read_addr
)Instruction<M>::execute
for BeqInstruction
:read_addr_1
)read_addr_2
)Instruction<M>::execute
for BneInstruction
:read_addr_1
)read_addr_2
)Instruction<M>::execute
for Imm32Instruction
:write_addr
)state.cpu_mut().pc += 1;
impl CpuChip
:self.pc += 1;
self.clock += 1;
alu_u32/src/bitwise/mod.rs
:Instruction<M>::execute
for Xor32Instruction
:read_addr_1
)write_addr
)read_addr_2
)Instruction<M>::execute
for And32Instruction
:read_addr_1
)write_addr
)read_addr_2
)Instruction<M>::execute
for Or32Instruction
:read_addr_1
)write_addr
)read_addr_2
)alu_u32/src/div/mod.rs
, in Instruction<M>::execute
for Div32Instruction
:read_addr_1
)write_addr
)read_addr_2
)alu_u32/src/lt/mod.rs
, in Instruction<M>::execute
for Lt32Instruction
:read_addr_1
)write_addr
)read_addr_2
)alu_u32/src/mul/mod.rs
, in Instruction::execute
for Mul32Instruction
:read_addr_1
)write_addr
)read_addr_2
)a
)alu_u32/src/shift/mod.rs
:Instruction<M>::execute
for Shl32Instruction
:read_addr_1
)write_addr
)read_addr_2
)Instruction<M>::execute
for Shr32Instruction
:read_addr_1
)write_addr
)read_addr_2
)alu_u32/src/sub/mod.rs
, in Instruction<M>::execute
for Sub32Instruction
:read_addr_1
)write_addr
)read_addr_2
)a
)native_field/src/lib.rs
:Instruction<M>::execute
for AddInstruction
:read_addr_1
)write_addr
)read_addr_2
)Instruction<M>::execute
for SubInstruction
:read_addr_1
)write_addr
)read_addr_2
)Instruction<M>::execute
for MulInstruction
:read_addr_1
)write_addr
)read_addr_2
)output/src/lib.rs
:Chip<M>::generate_trace
for OutputChip
:cols.diff
)Instruction<M>::execute
for WriteInstruction
:read_addr_1
)Completed chips (that aren't known to be missing any constraints or witness generation code) are checked off below:
I tried the instructions from the README and was a bit surprised by the results.
When running the interpreter, fib(25) takes 104 seconds:
$ echo -n '\x19' | time ./target/release/valida basic/tests/data/fibonacci.bin
Command terminated by signal 9
104.76user 16.74system 2:01.90elapsed 99%CPU (0avgtext+0avgdata 62092988maxresident)k
38656inputs+0outputs (194major+16075827minor)pagefaults 0swaps
The README suggested piping to hexdump, but doing so produces no output, so I elided that command.
I don't know how many cycles of execution this takes, but looking at the assembly, seems like 10-ish instructions per Fib iteration? So this lands somewhere around 250 instructions / 104 seconds = just a few Hz. That doesn't seem right; what's going on?
The unit tests, however, complete much quicker:
$ cargo test --release
...
Running `/home/tcarstens/valida/valida/target/debug/valida tests/data/fibonacci.bin`
test run_fibonacci ... ok
test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.10s
Running tests/test_prover.rs (target/release/deps/test_prover-6787c50d144a7a27)
running 1 test
test prove_fibonacci ... ok
test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s
Using the rough time for the interpreter unit test (0.1 seconds), this comes in around 250 instructions / 0.1 seconds = 2.5 kHz, which also doesn't seem right. (The prover completed in fewer than 0.00 seconds, so I can't do a similar comparison. Why would proving be faster, though?)
Why is the interpreter so much slower than the unit test? And what's the right way to calculate the effective clockspeed?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.