skx / simple.vm Goto Github PK

View Code? Open in Web Editor NEW

455.0 24.0 50.0 176 KB

Simple virtual machine which interprets bytecode.

License: GNU General Public License v2.0

C 62.66% Perl 34.96% Emacs Lisp 0.97% Makefile 1.42%

virtual-machine perl c bytecode opcodes register binary-opcodes

simple.vm's Introduction

simple.vm
Compilation
Implementation Notes
Embedding
Instructions
- Simple Example
Fuzz Testing
See Also

simple.vm

This repository contains the implementation for a simple virtual-machine, along with a driver which will read a program from a file and execute it via that virtual machine.

In addition to the virtual machine interpreter you'll also find:

A simple compiler, written in perl.
- This will translate from assembly-source into binary-opcodes.
A simple decompiler, written in perl.
- This will translate in the other direction.
Several example programs written in our custom assembly-language.
An example of embedding the virtual machine in a C host program.
- Along with the definition of a custom-opcode handler.

This particular virtual machine is intentionally simple, but despite that it is hopefully implemented in a readable fashion. ("Simplicity" here means that we support only a small number of instructions, and the registers the virtual CPU possesses can store strings and integers, but not floating-point values.) This particular virtual machine is register-based, having ten registers which can be used to store strings or integer values.

Compilation

Because the compiler and decompiler are written in Perl they need no special treatment.

The interpretter, written in C, can be built like so:

$ make

This will generate simple-vm and embedded from the contents of src/.

Implementation Notes

Implementing a basic virtual machine, like this one, is a pretty well-understood problem:

Load the bytecode that is to be interpreted into a buffer.
Fetch an instruction from from the buffer.
- Decode the instruction, usually via a long case statement.
- Advance the index such that the next instruction is ready to be interpreted.
- Continue until you hit a halt, or exit instruction.
Cleanup.

The main complications of writing a virtual machine are:

Writing a compiler to convert readable source into the binary opcodes which will be actually executed.
- Users want to write "goto repeat" not "0x10 0x03 0x00".
Writing a disassembler for debugging purposes.
Implementing a useful range of operations/opcodes.
- For example "add", "dec", "inc", etc.

This particular virtual machine contains only a few primitives, but it does include the support for labels, looping, conditional jumps, calls and returns. There is also a stack which can be used for storing temporary values, and used for call/ret handling.

The handling of labels in this implementation is perhaps worthy of note, because many simple/demonstration virtual machines don't handle them at all.

In order to support jumping to labels which haven't necessarily been defined yet our compiler keeps a running list of all labels (i.e. possible jump destinations) and when it encounters a jump instruction, or something else that refers to a label, it outputs a placeholder-address, such as:

JUMP 0x000
- In our bytecode that is the three-byte sequence 0x10 0x00 0x00 as the JUMP instruction is defined as 0x10.

After compilation is complete all the targets should have been discovered and the compiler is free to update the generated bytecodes to fill in the appropriate destinations.

NOTE: In our virtual machines all jumps are absolute, so they might look like "JUMP 0x0123" rather than "JUMP -4" or "JUMP +30".

NOTE: The same thing applies for other instructions which handle labels, such as storing the address of a label, making a call, etc.

Embedding

This virtual machine is designed primarily as a learning experience, but it is built with the idea of embedding in mind.

The standard simple-vm binary, which will read opcodes from a file and interpret them, is less than 40k in size.

Because the processing of binary opcodes is handled via a dispatch-table it is trivially possible for you to add your own application-specific opcodes to the system which would allow you to execute tiny compiled, and efficient, programs which can call back into your application when they wish.

There is an example of defining a custom opcode in the file src/embedded.c. This example defines a custom opcode 0xCD, and executes a small program which uses that opcode for demonstration purposes:

 $ ./embedded
 [stdout] Register R01 => 16962 [Hex:4242]
 Custom Handling Here
     Our bytecode is 8 bytes long

Instructions

There are several instruction-types implemented:

Storing string/int values into a given register.
Mathematical operations:
- add, and, sub, multiply, divide, incr, dec, or, & xor.
Output the contents of a given register. (string/int).
Jumping instructions.
- Conditional and unconditional
Comparison of register contents.
- Against registers, or string/int constants.
String to integer conversion, and vice-versa.
Stack operations
- PUSH/POP/CALL/RETURN

The instructions are pretty basic, as this is just a toy, but adding new ones isn't difficult and the available primitives are reasonably useful as-is.

The following are examples of all instructions:

:test
:label
goto  0x44ff      # Jump to the given address
goto  label       # Jump to the given label
jmpnz label       # Jump to label if Zero-Flag is not set
jmpz  label       # Jump to label if Zero-Flag is set

store #1, 33      # store 33 in register 1
store #2, "Steve" # Store the string "Steve" in register 1.
store #1, #3      # register1 is set to the contents of register #3.

exit              # Stop processing.
nop               # Do nothing

print_int #3      # Print the (integer) contents of register 3
print_str #3      # Print the (string) contents of register 3

system #3         # Call the (string) command stored in register 3

add #1, #2, #3    # Add register 2 + register 3 contents, store in reg 1
sub #1, #2, #3    # sub register 2 + register 3 contents, store in reg 1
mul #1, #2, #3    # multiply register 2 + register 3 contents, store in reg 1
concat #1, #2,#3  # store concatenated strings from reg2 + reg3 in reg1.

dec #2            # Decrement the integer in register 2
inc #2            # Increment the integer in register 2

string2int #3     # Change register 3 to have a string from an int
is_integer #3     # Does the given register have an integer content?
int2string #3     # Change register 3 to have an int from a string
is_string  #3     # Does the given register have a string-content?

cmp #3, #4        # Compare contents of reg 3 & 4, set the Z-flag.
cmp #3, 42        # Compare contents of reg 3 with the constant 42.  sets z.
cmp #3, "Moi"     # Compare contents of reg 3 with the constant string "Moi".  sets z.

peek #1, #4       # Load register 1 with the contents of the address in #4.
poke #1, #4       # Set the address stored in register4 with the contents of reg1.
random #2         # Store a random integer in register #2.

push #1           # Store the contents of register #1 in the stack
pop  #1           # Load register #1 with the contents of the stack.
call 0xFFEE       # Call the given address.
call my_label     # Call the defined label
ret               # Return from a called-routine.

Simple Example

The following program will just endlessly output a string:

 :start
      store #1, "I like loops"
      print_str #1
      goto start

This program first stores the string "I like loops" in register 1, then prints that register, before jumping back to the start of the program.

To actually compile this program into bytecode run:

  $ ./compiler ./examples/simple.in

This will produce an output file full of binary-opcodes in the file simple.raw:

  $ od -t x1 -An ./examples/simple.raw
  01 01 0c 49 20 6c 69 6b 65 20 6c 6f 6f 70 73 03
  01 06 00 00

Now we can execute that series of opcodes:

  ./simple-vm ./examples/simple.raw

If you wish to debug the execution then run:

  DEBUG=1 ./simple-vm ./examples/simple.raw

There are more examples stored beneath the examples/ subdirectory in this repository. The file examples/quine.in provides a good example of various features - it outputs its own opcodes.

Fuzz Testing

If you wish to fuzz-test with afl you should find that pretty straight-forward:

Install afl itself, as per the instructions.
Compile this project, setting CC=afl-gcc in the Makefile.
Generate some sample-programs, as starting point, then run the fuzzer.

You can compile each of our bundled samples like so:

 cd examples/
 for i in *.in; do ../compiler $i; done
 cd ..

This will compile examples/*.in into examples/*.raw. Place those in a directory, and start your fuzzer:

 mkdir samples/
 cp examples/*.raw samples/

Now you have ./samples/ containing only compiled programs. You can then mutate/permute those examples under the control of the fuzzer with a command-line like this:

 export FUZZ=1
 afl-fuzz -i ./samples/ -o results/ ./simple-vm @@ 16384

NOTE: Here we cap each run to executing no more than 16384 instructions. This will ensure that programs don't have infinite loops in them.

We set the environmental variable FUZZ to contain 1 solely to disable the use of the system() function. Which might accidentally remove your home-directory, format your drive, or send me a donation!

In addition to that you can find a real virtual-machine inside the embedded scripting engine I wrote, also for golang. In that case a scripting language is parsed and converted into a series of bytecode instructions, which are executed by a virtual machine. Similar to this project, but the bytecode operations are more complex and high-level:

https://github.com/skx/evalfilter

If you're interested in compilers, and interpreters, generally you might enjoy these other projects too:

https://github.com/skx/gobasic
- A simple BASIC interpreter
https://github.com/skx/foth
- An interpreter for a FORTH-like language.
https://github.com/skx/critical
- An interpreter for a TCL-like language.
https://github.com/skx/bfcc
- A brainfuck compiler
https://github.com/skx/math-compiler
- A simple compiler for mathematical expressions.

Steve

simple.vm's People

Contributors

Stargazers

Watchers

simple.vm's Issues

Our code size and our jump size are potentially different..

The destination of jumps is an integer in the range 0x0000-0xFFFF, because we've got a 64k address-space. However when we execute a program we don't allocate 64k for code, instead we load and allocate precisely enough space for the program.

This means we could load a piece of code, a binary program, which is 50 bytes long, and that might contain "jump 16384".

Either 👍

We allocate 64k for programs, filled with NOPs, before loading the program.
We limit jump targets to the size of the code.

The former is wasteful, but not hugely. The latter is sane.

I'm actually leaning towards the former solution because the user might want to write self-modifying code, via a smple XOR loop, or similar, and then execute it.

ObRelated: Relocatable code pretty much requires user-supplied fixups or relative jumps.

Add string equally opcode

Once the mess and mass-churn of #2 is resolved I'd like to get back to adding missing primitives.

One that immediately springs to mind is something for comparing registers with fixed strings. We have two CMP instructions at the moment:

cmp reg1, reg2
cmp reg1, 0xFF

The obvious missing case is:

cmp reg1, "Steve".

Simple and easy to understand

Your byte code design is pretty elegant, I love this

        store #1, 1982
        store #2, "Good byte code design!"

And love the implementation behind the sence.

Thanks for showing this kind of VM, forked.

LDIR missing.

We can't copy the semantics of LDIR exactly, but we should have a similar RAM-copying instruction.

Store label offsets in registers.

At the moment we have:

  store #1, 1234
  store #1, "this is a string"

We cannot store the offset of a label, which would be useful for things. Such as:

     :start 
      ...
     :end

This would allow us to work out the length of a program, for example. This is immediately useful in the examples/quine.in sample where the length is hardcoded.

Segmentation Faults 2017-05-14

Found a few more bad inputs that cause segmentation faults. Though not all of them seg fault under valgrind, which I'm still trying to figure out. They can be found https://github.com/rwhitworth/simple.vm-fuzz/tree/master/2017-05-14 if you're interested. Thanks for the quick fixes this weekend.

The compiler should allow inline data.

For example:

  DATA 0x12
  DATA 0x13

This would allow embedding data for lookup-tables, self-modifying-code, or similar.

Compiler warning: simple-vm-opcodes.c ignoring return from system()

simple-vm-opcodes.c lines 583 thru 597:

void op_string_system(struct svm *svm)
{
    /* get the reg */
    unsigned int reg = next_byte(svm);
    BOUNDS_TEST_REGISTER(reg);

    if (getenv("DEBUG") != NULL)
        printf("STRING_SYSTEM(Register %d)\n", reg);

    char *str = get_string_reg(svm, reg);
    system(str);

    /* handle the next instruction */
    svm->ip += 1;
}

clang compiler is issuing this warning during compilation:

simple-vm-opcodes.c:593:5: warning: ignoring return value of function declared with warn_unused_result attribute [-Wunused-result]
    system(str);
    ^~~~~~ ~~~

Making the return result from the system opcode may not be worth doing or may not be valuable enough, which is fine. Make whatever design decision makes sense for this project.

Compiling multiple files is broken.

If you compile two scripts things break if you use labels:

  ./compiler ./examples/jump.in ./examples/quine.in
  ./simple-vm examples/quine.raw 
  ERROR running script - The register doesn't contain an integer

Trivial problem caused by the @UPDATES and %LABELS being outside the compilation function.

Add random opcode.

Once the mess and mass-churn of #2 is resolved I'd like to get back to adding missing primitives.

One that immediately springs to mind is something for generating a random integer.

e.g. INT_RAND().

IP wrap-around isn't consistent.

Within the main-intepretter we handle the IP exceeding the 64k boundary - but the opcode handlers don't.

We should handle something like:

   0xfffe: store #1, 0102

That would be encoded as:

   ram[0xFFFE] = 01
   ram[0xFFFF] = 01
   ram[0x0000 ] = 02

i.e. The opcode handler would read the wrong address for the second octet of the immediate value 0x0102.

Missing opcodes: is_string/is_integer

Set the Z-flag to true if the given register has the correct type.

Compiler warning: main.c ignoring return from fread()

main.c line 68:
fread(code, 1, size, fp);

clang compiler is issuing this warning during compilation:

main.c:68:5: warning: ignoring return value of function declared with warn_unused_result attribute [-Wunused-result]
    fread(code, 1, size, fp);
    ^~~~~ ~~~~~~~~~~~~~~~~~

Segmentation Faults 2017-05-12

Hello,
I was using American Fuzzy Lop (afl-fuzz) to fuzz input the simple-vm program on Linux. Is fixing the crashes from these input files something you're interested in? The input files can be found here: https://github.com/rwhitworth/simple.vm-fuzz/tree/master/2017-05-12.

The files can be executed as ./simple-vm id_filename to cause seg faults.

Let me know if I can provide any more information to help narrow down this issue.

Print integers with a "0x" prefix

Hello @skx, thanks a lot for this great project!

I modified the print_int formatting string to include a "0x" prefix, and changed the padding to be four positions:

diff --git a/simple-vm-opcodes.c b/simple-vm-opcodes.c
index b284dc9..9edf3fb 100644
--- a/simple-vm-opcodes.c
+++ b/simple-vm-opcodes.c
@@ -343,7 +343,7 @@ void op_int_print(struct svm *svm)
     if (getenv("DEBUG") != NULL)
         printf("[STDOUT] Register R%02d => %d [Hex:%04x]\n", reg, val, val);
     else
-        printf("%02X", val);
+        printf("0x%04X", val);
 
 
     /* handle the next instruction */

The reason for the prefix is because it makes the base explicit, and the padding hints at the max value accepted for integers. I decided to use "0x%04X" instead of "%#04X" because I like the numbers in uppercase and the prefix in lowercase.

Do you think it could be a good change for simple.vm?

Our IP-manipulation is .. subpar.

Currently there are a bunch of opcode implementations defined, some of which modify the IP and some that don't. The way this works is that the main loop looks up the handler and invokes them.

The handlers are defined as:

bool do_stuff( svm_t *cpu );

If the functions return true then it is assume they've updated the IP, otherwise they haven't so it is incremented to point to the next instruction.

Unfortunately this isn't great. Consider the situation where the following program is executed:

  store #1, 0x1234
  exit

This is a four-byte program with two instructions:

  store #1, 0x1234   ->  0x01 0x12 0x34
  exit 0> 0x00

At the start of execution IP=0, the byte there is 0x01 for int_store. The byte is incremented to read the first number, then again the second. Meaning ip=0x02.

At this point the main loop exits, the IP is incremented to 0x03 where the exit code is waiting to be executed. We've, internally in in the int_store method, bumped IP by two bytes but not kept track of it!

The thing to do is let the opcode handlers do what they want. Don't bump the IP in the handler, but add an extra cpu->ip++ at the end of each handler. Or reset it in the case of a jump.

Move to using function-pointers.

Rather than having a giant switch statement, with a case for each opcode, we should break out the implementations of the opcodes into their own functions.

If we define an array of 255-opcodes in the svm_t function then we can use function pointers to invoke the correct function.

Pros
- The code becomes more readable.
- We could have a default handler which dumps "unknown instruction" for debugging purposes.
- This would allow host-applications to define host-specific opcodes.
Cons
- Significant churn.
- Wastes 255 * (function-pointer size) bytes per svm_t instance.
- We have some repetition setting up the opcode handlers - both defining the function and then allocating it.

Strings should not be limited to 255 bytes.

Strings longer than 255 bytes cannot be loaded via "load #x, 'value'". That's not ideal, but it could be tolerable.

However more seriously a string cannot be concatenated past that length, or bad things happen.

Suggest both issues are resolved at the same time.

Use a macro for reading from the CPU-RAM

As per this comment on reddit we should have a READ_BYTE macro for reading a single byte from the current instruction-pointer, and incrementing it.

We could combine that with the existing BYTES_TO_ADDR macro to create a READ_ADDR macro too - although perhaps that might not be so clear.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.