A (WIP) bytecode multi-stack awk interpreter. The goal of rawk is the to be the fastest awk for all programs.
rawk uses type inference to determine the types of variables: string, string numeric, number, array at compile time. This allows rawk to emit bytecode that is non-dynamic in many scenarios. For code like
{ a = 1; b = 2; print a + b; }
rawk emits this code for print a + b
(Gscl means Global scalar)
6 GsclNum(0) args: [[]] push: [[Num]]
7 GsclNum(1) args: [[]] push: [[Num]]
8 Add args: [[Num, Num]] push: [[Num]]
The add instruction knows its operands are numbers and will not need to check types at runtime.
pub fn add(vm: &mut VirtualMachine, ip: usize, _imm: Immed) -> usize {
let rhs: f64 = vm.pop_num(); // pop from the numeric stack
let lhs: f64 = vm.pop_num(); // again
vm.push_num(lhs+rhs); // add them and push
ip + 1 // advance
}
If the types of a
and b
are variable (like below)
{ if ($1) { a = 1; b = 2; } else { a = "1"; b = "2" } print (a + b); }
rawk has no significant advantage here and will add two more bytecode ops to convert string -> number. For print (a + b)
rawk emits
16 GsclVar(0) args: [[]] push: [[Var]]
17 VarToNum args: [[Var]] push: [[Num]]
18 GsclVar(1) args: [[]] push: [[Var]]
19 VarToNum args: [[Var]] push: [[Num]]
20 Add args: [[Num, Num]] push: [[Num]]
Var means variable which is the stack of values whose type could be string/strnum/number and whose types need to be checked at runtime.
rawk uses a ring buffer to read from files without copying unless the data is needed. rawk's file reading is faster than all other awks I am aware of. I have not yet optimized output so I have no idea how it compares. Here's a comparison of various awks reading every line in a file storing it, and then printing the final value.
(onetrueawk is far to the right of this chart so I've omitted it)
- Reading from stdin
- Native string functions
- index
- match
- split
- sprintf
- Redirect output to file
- close() function
- Pattern Ranges
- The columns runtime should not duplicate work when the same field is looked up multiple times
- The columns runtime should support assignment
- Divide by 0 needs to print an error
- All the builtin variables that are read only:
- ARGC (float)
- FILENAME (str)
- FNR (float)
- NF (float)
- NR (float)
- RLENGTH (float)
- RSTART (float)
- Builtins that are read/write
- CONVFMT (str)
- FS (str)
- OFMT (str)
- OFS (str)
- ORS (str)
- RS (str)
- SUBSEP (str)
- Builtins that are arrays (in this impl read only)
- ARGV
- ENVIRON
Mawk is GPLv2 (./mawk-regex-sys/LICENSE) Quick Drop Deque is MIT (./quick-drop-deque/LICENSE) The combined project is GPLv2
Install other awks to test against (they should be on your path with these exact names)
- gawk (linux/mac you already have it)
- mawk - build from src
- goawk - need the go toolchain, then go get
- onetrueawk - super easy and fast build from src
Tests by default just check correctness against other awks and oracle result.
cargo test
If you want to run perf tests set the env var "jperf" to "true" and do a cargo build --release
and cargo test -- --test-threads=1
first. This will test the speed of the release binary against other awks.