valida-xyz / valida Goto Github PK

View Code? Open in Web Editor NEW

266.0 266.0 45.0 957 KB

A STARK-based VM focused on code reuse, performance, and modularity

License: Apache License 2.0

Rust 100.00%

valida's People

Contributors

Stargazers

Watchers

valida's Issues

Let challenger observe all of the transcript

The challenger for the Fiat-Shamir transformation needs to observe all of the pieces which are in the transcript. Otherwise, soundness bugs can result. See derive/src/lib.rs, where there are some relevant TODO comments. Review the use of the challenger to make sure that the Fiat-Shamir transformation is being performed properly, adding any observations to it as needed.

Multi-core prover

Find the most interesting and impactful opportunities for parallelizing the Valida prover, and implement them in the code.

Memory chip: remove degree 4 constraint

A comment in memory/src/stark.rs says FIXME: Degree constraint 4, need to remove. It’s not clear to me why a degree 4 constraint needs to be removed. Is this an optimization or a requirement for functional completeness and correctness?

Get current value of `fp`

Design and implement an extension of the Valida ISA which supports getting the current value of fp, or more generally the current address of an object on the stack in source programs.

Use preprocessed lookup table in Bitwise chip

Looking up AND input/output in a preprocessed table will cut Bitwise chip column count from 79 to 15.

Verifier

Writing to an arbitrary RAM address

Load32 provides the ability to load from an arbitrary RAM address. Store32 should provide the ability to write to an arbitrary RAM address. As specified and implemented, it does not. Instead of acting as the write-side analogue of Load32, Store32 acts as an analogue of LW; it moves a value from one stack location to another stack location, with the stack locations determined as fixed offsets from fp. What's needed is an instruction to write to a RAM address specified at a location on the stack.

Varargs

Question. How do compilers targeting Valida deal with varags in languages supporting them?

Context. PR #65 . The Valida calling convention as described in the Valida spec requires the size of each function's stack frame to be known at compile time. As a result, potentially variable sized local variables should be allocated on the heap. This is because the ISA only allows fp to be updated by adding to it a constant or a value loaded from memory at a constant offset from fp. Since function arguments are also passed on the stack, this means that the length of the arguments also must be known at compile time.

An alternative semantics for jalv would not impose this constraint. In the contemplated alternative semantics, jalv would load a value from memory into fp. Perhaps a complementary jump instruction would store fp into a location in memory contemporaneously with jumping. This would allow for the local variables and the arguments to functions to have sizes not known at compile time.

The idea we're currently going with is to let stack objects exist on the heap if they need to be variable sized. Each stack frame would then contain a constant number of scalar local variables and pointers into the heap. Each function call generates a stack frame full of arguments, and so to extend this idea to those frames, each of these arguments stack frames would contain a constant number of scalar arguments pointers into the heap. I want to confirm my understanding that this is how we would support varags in compilers targeting Valida.

Underconstrained equality / zero test(s)

In the equality test in the CPU chip, one of the constraints states that not_equal = diff * diff_inv. Since not_equal is 0 or 1, this is satisfied only if diff_inv is the multiplicative inverse of diff, or diff is zero, or diff_inv is zero. If not_equal is 1, then diff_inv is the multiplicative inverse of diff, implying that diff is not zero, implying that the input values are not equal. What’s not clear is that if not_equal is 0, then diff is 0. It seems like it could be that diff is non-zero, and diff_inv is 0, and not_equal is 0. So, the argument is not sound. We can fix this by adding another constraint: (1 - not_equal) * diff = 0. This should complete the constraints for the zero test on diff. This issue may be replicated elsewhere where zero tests occur.

Shift32 chip: output constraint(s)

Currently the Shift32 chip STARK does not put any constraint on its output. The output should be constrained to be equal to input_1 * power_of_two, truncated to a 32-bit word.

segmentation fault ./build/bin/llc fib.ll --march=valida --global-isel

./build/bin/llc fib.ll --march=valida --global-isel
Pass 'IRTranslator' is not initialized.
Verify if there is a pass dependency cycle.
Required Passes:
Insert stack protectors
Target Pass Configuration
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.
Stack dump:
0. Program arguments: ./build/bin/llc fib.ll --march=valida --global-isel
Stack dump without symbol names (ensure you have llvm-symbolizer in your PATH or set the environment var LLVM_SYMBOLIZER_PATH to point to it):
0 llc 0x00000001035ac644 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) + 56
1 llc 0x00000001035aa9f0 llvm::sys::RunSignalHandlers() + 112
2 llc 0x00000001035ace24 SignalHandler(int) + 352
3 libsystem_platform.dylib 0x0000000183771a24 _sigtramp + 56
4 llc 0x00000001030036b8 llvm::PMTopLevelManager::schedulePass(llvm::Pass*) + 2876
5 llc 0x0000000102e2ff24 llvm::TargetPassConfig::addPass(llvm::Pass*) + 660
6 llc 0x00000001027b8594 (anonymous namespace)::ValidaPassConfig::addIRTranslator() + 48
7 llc 0x0000000102e30b44 llvm::TargetPassConfig::addCoreISelPasses() + 460
8 llc 0x0000000102e30d6c llvm::TargetPassConfig::addISelPasses() + 272
9 llc 0x0000000102bbac88 llvm::LLVMTargetMachine::addPassesToEmitFile(llvm::legacy::PassManagerBase&, llvm::raw_pwrite_stream&, llvm::raw_pwrite_stream*, llvm::CodeGenFileType, bool, llvm::MachineModuleInfoWrapperPass*) + 180
10 llc 0x00000001027a1f24 main + 7456
11 dyld 0x00000001833c9058 start + 2224
zsh: segmentation fault ./build/bin/llc fib.ll --march=valida --global-isel

When attempting to generate assembly using the llc tool for the valida target on a MacBook with an M2 chip, it results in a crash.

Program ROM bus constraints

Currently the program ROM bus channel sends are commented out (in CpuChip::global_sends, in cpu/src/lib.rs). As a result, the consistency of the execution trace with the program ROM contents is not being checked. Re-enable these sends, causing the consistency of the execution trace with the program ROM contents to be enforced. Make sure that the prover still runs and the verifier still accepts the proofs output by the prover after this change.

Check preprocessed traces

The verifier needs to check that the prover is sending correct commitments to the preprocessed traces. The prover can send openings, proven with the batch opening proof, which let the verifier check that the commitments to the preprocessed traces are for the correct preprocessed traces. Or maybe the verifier can check it by reproducing the same commitments to the preprocessed traces.

Add a Plonky3 version pin

Specify a commit hash for Plonky3 in the build files which lets Valida be built without having a local clone of Plonky3.

Wrapping could break memory soundness

Valida is using a common memory argument, where the prover commits to a log of operations ordered by (address, timestamp). Once we have this ordered log, memory consistency is easily enforced with constraints.

To simplify this discussion, say we're ordering by address only. We verify this ordering by range checking each address delta, asserting that it is within some "small" range like 16 bits.

The circular nature of field elements complicates this; e.g. an address step from p - 1 to 0 would be seen as a small delta of 1. So if a memory log had a sequence of addresses like 0, 2^16, 2 * 2^16, ..., wrapping around beyond p, it could revisit some address a second time, breaking the soundness of our memory argument.

Our initial thought was to avoid this problem by limiting the address space to ~30 bits, creating a ~30 bit "dead zone". Attempting to cross this "dead zone" would result in a failed range check.

However, we currently allow the prover to insert "dummy reads", to break down larger address deltas into smaller ones. It seems like these could also be used to break down a ~30 bit "dead zone" gap into small deltas.

Instead of adding dummy rows in the memory table to break down large deltas, we could just support larger range checks. Say we limit the address space to 29 bits; then any legitimate delta will be < 2²⁹. We could do a 29-bit range check by breaking it into say 14- and 15-bit range checks.

It might be worthwhile to also evaluate other memory techniques though. Triton has a unique method of enforcing that the log is grouped (not necessarily ordered) by address. BEGKN92 (section 4.1) offers a different approach, which doesn't involve address grouping at all. I'm not up to speed on Lasso yet, but I think it has another method inspired by BEGKN92.

Reuse existing work

Here is a LLVM backend for TinyRAM architecture. Have you considered reusing it?

scipr-lab/libsnark#190 (comment)

IMM32: Add missing range check

The constraints for immediate values are missing a range check on read_value_2, in the implementation of Air<AB>::eval in cpu/src/stark.rs. The comment says:

Immediate value constraints (TODO: we'd need to range check read_value_2 in this case)

Range Check Chip

Big integer arithmetic chip

Update ISA spec

Our ISA spec is getting out of date as we actively develop. We should update it. E.g.,

Add READ_ADVICE:
Operands: a
Description: Read the content of the advice tape and write the next 4 byte values to those beginning at offset a.
Add WRITE:
Operands: b(fp)
Description: Follow the pointer stored at offset b from the current frame pointer and write the next 4 byte values to the output chip.
Remove `ISH{L,R}
Add DIVI
Add immediate versions of bitwise operations and, or, xor
Add NE:
Operands: a, b(fp), c(fp)
Description: If [b] != [c] then set a to 1. Otherwise, set a to 0.
Add MULHU:
Operands: a, b(fp), c(fp)
Description: Multiply [b] by [c] then set a to the top part of the result.
Add LOADFP(to be implemented):
Operands: a, b(fp)
Description: Stores the current value of fp plus the immediate operand b into the register numbered a. The operand c is always set to 0 (it is unused), while the operands d and e are always set to 1 (indicating that b and c are immediate operands).
Add ILT

Make partial functions total

Various Valida functions might exhibit partiality; on some inputs, they might fail to terminate with a result of the output type. Instead of panicking or non-terminating, functions should return a result indicating an error. The public functions which are currently partial should be modified to return a Result type.

It is okay to use functions which may panic in unreachable code sections and in cases where it is impossible for the functions to get inputs which result in a panic. However, attainable conditions should never result in a panic.

The reason for this issue is that Valida is going to be used in services, and those services can't panic. They have to stay running even when errors occur.

The following code sections (as far as I know) may potentially result in partiality. Line numbers are relative to commit eddd2b031a13278bc4855dea802fbc045f1378d8.

There are three instances of panic!() , excluding macro code which is only executed at compile-time, and code which is only used for debugging:

In the quoted code for run in the implementation of the run_method macro in derive/src/lib.rs.
In the implementation of AirBuilder::is_transition_window for ConstraintFolder<’a, F, EF, M> in machine/src/__internal/folding_builder.rs.
In the implementation of MemoryChip::read, in memory/src/lib.rs.

There are no instances of .expect() in non-test code, except for some instances in macro code which runs exclusively at compile time. It is okay for code which runs exclusively at compile time to be partial, as long as it terminates.

There are 16 instances of .unwrap() in non-test-code, excluding macro code which runs exclusively at compile time:

In the implementation of main in basic/src/bin/valida.rs, on the call to load_program_rom.
In the implementation of main in basic/src/bin/valida.rs, on the call to std::io::stdin().read_to_end.
In the implementation of main in basic/src/bin/valida.rs, on the call to std::io::stdout().write_all.
In the implementation of check_constraints in machine/src/__internal/check_constraints.rs, on line 30.
In the implementation of check_constraints in machine/src/__internal/check_constraints.rs, on line 39.
In the implementation of check_constraints in machine/src/__internal/check_constraints.rs, on line 44.
In the implementation of check_cumulative_sums in machine/src/__internal/check_constraints.rs, on line 90.
In the implementation of generate_permutation_trace in machine/src/chip.rs, on line 40.
In the implementation of generate_permutation_trace in machine/src/chip.rs, on line 169.
In the implementation of generate_permutation_trace in machine/src/chip.rs, on line 189.
In the implementation of eval_permutation_constraints in machine/src/chip.rs, on line 268.
In the implementation of eval_permutation_constraints in machine/src/chip.rs, on line 270.
In the implementation of MemoryChip::insert_dummy_reads in memory/src/lib.rs, on line 257.
In the implementation of MemoryChip::insert_dummy_reads in memory/src/lib.rs, on line 258.
In the implementation of MemoryChip::insert_dummy_reads in memory/src/lib.rs, on line 259.
In the implementation of Instruction<M>::execute for WriteInstruction in output/src/lib.rs, on line 154.

There are 4 instances of assert_eq!() in non-test code which is run after compile time:

In the implementation of check_constraints in machine/src/__internal/check_constraints.rs.
In the implementation of check_cumulative_sums in machine/src/__internal/check_constraints.rs.
In the implementation of Instruction<M>::execute for WriteInstruction in output/src/lib.rs, on lines 162 and 163.

There is one instance of assert!() in non-test code which runs after compile time:

In the default implementation of read_word in the MachineWithProgramChip trait definition, in program/src/lib.rs.

The following files contain unsafe blocks, which should be assumed pending further review to potentially result in partiality:

alu_u32/src/add/columns.rs
alu_u32/src/add/mod.rs
alu_u32/src/bitwise/columns.rs
alu_u32/src/bitwise/mod.rs
alu_u32/src/div/columns.rs
alu_u32/src/lt/columns.rs
alu_u32/src/lt/mod.rs
alu_u32/src/mul/columns.rs
alu_u32/src/shift/columns.rs
alu_u32/src/shift/mod.rs
alu_u32/src/sub/columns.rs
alu_u32/src/sub/mod.rs
cpu/src/columns.rs
cpu/src/lib.rs
derive/src/lib.rs
memory/src/columns.rs
memory/src/lib.rs
native_field/src/columns.rs
native_field/src/lib.rs
output/src/columns.rs
output/src/lib.rs
program/src/columns.rs
range/src/columns.rs
range/src/lib.rs

The following arithmetic operations may result in division by zero and a panic:

In the implementation of Div::div for Word<u8>, in machine/src/core.rs, line 87.
In the implementation of MemoryChip::insert_dummy_reads in memory/src/lib.rs, line 222.
In the implementation of Instruction<M>::execute for Div32Instruction, in alu_u32/src/div/mod.rs, line 89.

The following unchecked array indexes may result in an out of bounds error and a panic:

In the implementation of generate_permutation_trace in machine/src/chip.rs, lines 120, 166, 179, 182, and 189.
In the implementation of generate_rlc_elements in machine/src/chip.rs, lines 285 and 300.
In the implementation of CpuChip::set_instruction_values in cpu/src/lib.rs, line 248.
In the implementation of CpuChip::pad_to_power_of_two in cpu/src/lib.rs, lines 328, 329, 330, 346, 347, 348, 351, 352, 355, 356, and 357.

The zk-VM run method is capable of non-terminating; see derive/src/lib.rs, lines 158-176. This is to be expected, because Valida is Turing complete. Perhaps, however, we should code in a maximum number of steps, in order to avoid partiality. We can even pass in the maximum number of steps as an input to the method.

Implement constraint folder assert zero

In machine/__internal/folding_builder.rs, the implementation of AirBuilder::assert_zero for ConstraintFolder<'a, F, EF, M> is stubbed out. The stub should be replaced with a proper implementation.

Asymptotically speed up range checker chip trace generation

The implementation of Chip<M>::generate_trace for RangeCheckerChip<MAX>, in range/src/lib.rs, is O(MAX * log(|self.count|)) but can be done in O(|self.count| * log(|self.count|)). In this context, self.count is a BTreeMap<u32, u32> whose key space is 0..MAX. This means that |self.count|, the cardinality or number of keys of self.count, is less than or equal to MAX, and thus the algorithm is asymptotically slow, as noted by the code comment.

Implement constraint folder permutation randomness

In machine/src/__internal/folding_builder.rs, the implementation of PermutationAirBuilder::permutation_randomness for ConstraintFolder<'a, F, EF, M> is stubbed out. The stub should be replaced with a proper implementation.

Avoid unchecked arithmetic overflow / underflow

The following arithmetic operations may result in unchecked overflow, resulting in different behavior in debug vs release compilation mode. For each of these, the desired resolution is likely (a) the overflow should wrap around, and this should be made explicit and made to happen in both debug and release mode, or (b) overflow is deemed to be impossible for all inputs, and should result in a panic with an informative error message. However, other resolutions are possible and these must be considered on a case by case basis.

In machine/src/core.rs:
a. In the implementation of Add:add for Word<u8>, on line 57
b. In the implementation of Sub::sub for Word<u8>, on line 67
In alu_u32/src/add/mod.rs:
a. In the implementation of Add32Chip::op_to_row:
i. The addition on line 108
ii. The first addition on line 109
iii. The second addition on line 109
iv. The first addition on line 113
v. The second addition on line 113
b. In the implementation of Instruction<M>::execute for Add32Instruction:
i. The addition on line 141 (computing read_addr_1)
ii. The addition on line 142 (computing write_addr_1)
iii. The addition on line 149 (computing read_addr_2)
iv. The addition on line 153 (computing a)
In cpu/src/lib.rs:
a. In Instruction<M>::execute for ReadAdviceInstruction:
i. The two additions on line 410
ii. The multiplication on line 410
b. In Instruction<M>::execute for WriteAdviceInstruction:
i. The first and second additions on line 436 (i.e., fp + mem_addr)
ii. The third addition on line 436 (i.e., (fp + mem_addr) + mem_buf_len)
c. In Instruction<M>::execute for Load32Instruction:
i. The addition on line 470 (computing read_addr_1)
ii. The addition on line 472 (computing write_addr)
d. In Instruction<M>::execute for Store32Instruction:
i. The addition on line 491 (computing read_addr)
ii. The addition on line 492 (computing write_addr)
e. In Instruction<M>::execute for JalInstruction:
i. The addition on line 512 (computing write_addr)
ii. The addition on line 513 (computing next_pc)
ii. The addition on line 518
f. In Instruction<M>::execute for JalvInstruction:
i. The addition on line 536 (computing write_addr)
ii. The addition on line 537 (computing next_pc)
iii. The addition on line 540 (computing read_addr)
iv. The addition on line 543 (computing read_addr)
v. The additive assignment on line 545
g. In Instruction<M>::execute for BeqInstruction:
i. The addition on line 562 (computing read_addr_1)
ii. The addition on line 570 (computing read_addr_2)
iii. The addition on line 576
h. In Instruction<M>::execute for BneInstruction:
i. The addition on line 594 (computing read_addr_1)
ii. The addition on line 602 (computing read_addr_2)
iii. The addition on line 608
i. In Instruction<M>::execute for Imm32Instruction:
i. The addition on line 624 (computing write_addr)
j. Every instance of state.cpu_mut().pc += 1;
k. In impl CpuChip:
i. On line 655 and 660, self.pc += 1;
ii. On line 668, self.clock += 1;
In alu_u32/src/bitwise/mod.rs:
a. In Instruction<M>::execute for Xor32Instruction:
i. The addition on line 145 (computing read_addr_1)
ii. The addition on line 146 (computing write_addr)
iii. The addition on line 153 (computing read_addr_2)
b. In Instruction<M>::execute for And32Instruction:
i. The addition on line 181 (computing read_addr_1)
ii. The addition on line 182 (computing write_addr)
iii. The addition on line 190 (computing read_addr_2)
c. In Instruction<M>::execute for Or32Instruction:
i. The addition on line 217 (computing read_addr_1)
ii. The addition on line 218 (computing write_addr)
iii. The addition on line 225 (computing read_addr_2)
In alu_u32/src/div/mod.rs, in Instruction<M>::execute for Div32Instruction:
a. The addition on line 77 (computing read_addr_1)
b. The addition on line 78 (computing write_addr)
c. The addition on line 85 (computing read_addr_2)
In alu_u32/src/lt/mod.rs, in Instruction<M>::execute for Lt32Instruction:
a. The addition on line 128 (computing read_addr_1)
b. The addition on line 129 (computing write_addr)
c. The addition on line 136 (computing read_addr_2)
In alu_u32/src/mul/mod.rs, in Instruction::execute for Mul32Instruction:
a. The addition on line 123 (computing read_addr_1)
b. The addition on line 124 (computing write_addr)
c. The addition on line 131 (computing read_addr_2)
d. The multiplication on line 135 (computing a)
In alu_u32/src/shift/mod.rs:
a. In Instruction<M>::execute for Shl32Instruction:
i. The addition on line 172 (computing read_addr_1)
ii. The addition on line 172 (computing write_addr)
iii. The addition on line 180 (computing read_addr_2)
b. In Instruction<M>::execute for Shr32Instruction:
i. The addition on line 216 (computing read_addr_1)
ii. The addition on line 217 (computing write_addr)
iii. The addition on line 224 (computing read_addr_2)
In alu_u32/src/sub/mod.rs, in Instruction<M>::execute for Sub32Instruction:
a. The addition on line 137 (computing read_addr_1)
b. The addition on line 138 (computing write_addr)
c. The addition on line 145 (computing read_addr_2)
d. The subtraction on line 149 (computing a)
In native_field/src/lib.rs:
a. In Instruction<M>::execute for AddInstruction:
i. The addition on line 157 (computing read_addr_1)
ii. The addition on line 158 (computing write_addr)
iii. The addition on line 165 (computing read_addr_2)
b. In Instruction<M>::execute for SubInstruction:
i. The addition on line 197 (computing read_addr_1)
ii. The addition on line 198 (computing write_addr)
iii. The addition on line 205 (computing read_addr_2)
c. In Instruction<M>::execute for MulInstruction:
i. The addition on line 237 (computing read_addr_1)
ii. The addition on line 238 (computing write_addr)
iii. The addition on line 245 (computing read_addr_2)
In output/src/lib.rs:
a. In Chip<M>::generate_trace for OutputChip:
i. The subtraction on line 66 (computing cols.diff)
b. In Instruction<M>::execute for WriteInstruction:
i. The addition on line 149 (computing read_addr_1)

U32 ALU chips

Completed chips (that aren't known to be missing any constraints or witness generation code) are checked off below:

Performance difference between interpreter & unit tests

I tried the instructions from the README and was a bit surprised by the results.

When running the interpreter, fib(25) takes 104 seconds:

$ echo -n '\x19' | time ./target/release/valida basic/tests/data/fibonacci.bin
Command terminated by signal 9
104.76user 16.74system 2:01.90elapsed 99%CPU (0avgtext+0avgdata 62092988maxresident)k
38656inputs+0outputs (194major+16075827minor)pagefaults 0swaps

The README suggested piping to hexdump, but doing so produces no output, so I elided that command.

I don't know how many cycles of execution this takes, but looking at the assembly, seems like 10-ish instructions per Fib iteration? So this lands somewhere around 250 instructions / 104 seconds = just a few Hz. That doesn't seem right; what's going on?

The unit tests, however, complete much quicker:

$ cargo test --release

...

     Running `/home/tcarstens/valida/valida/target/debug/valida tests/data/fibonacci.bin`
test run_fibonacci ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.10s

     Running tests/test_prover.rs (target/release/deps/test_prover-6787c50d144a7a27)

running 1 test
test prove_fibonacci ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

Using the rough time for the interpreter unit test (0.1 seconds), this comes in around 250 instructions / 0.1 seconds = 2.5 kHz, which also doesn't seem right. (The prover completed in fewer than 0.00 seconds, so I can't do a similar comparison. Why would proving be faster, though?)

Why is the interpreter so much slower than the unit test? And what's the right way to calculate the effective clockspeed?

valida-xyz / valida Goto Github PK

valida's People

Contributors

Stargazers

Watchers

Forkers

valida's Issues

Recommend Projects

Recommend Topics

Recommend Org