emustudio / edigen Goto Github PK

Emulator Disassembler Generator

License: GNU General Public License v3.0

Java 100.00%

edigen's Introduction

Welcome to emuStudio

emuStudio is a desktop application used for computer emulation and writing programs for emulated computers. It extensible; it encourages developers to write their own computer emulators.

The main goal of emuStudio is to support the "compile-load-emulate" workflow, aiming at students or anyone to help to learn about older but important computers or even abstract machines.

emuStudio is very appropriate for use at schools, e.g. when students are doing first steps in assembler, or when they are taught about computer history. For example, emuStudio is used at the Technical University of Košice since 2007.

Available emulators

BIG THANKS

emuStudio was written based on existing emulators, sites and existing documentation of real hardware. For example:

Projects:

simh project, which was the main inspiration for Altair8800 computer
MAME project, which helped with resolving a lot of bugs in a correct implementation of some 8080 and Z80 CPU instructions

Sites:

David Sharp's SSEM site, main inspiration for SSEM implementation
Esolang's BrainFuck site, main inspiration for Brainfuck implementation
DeRamp Altair, more inspiration for Altair8800
Altair Clone, more inspiration for Altair8800
Study of techniques for emulation programming, emulation techniques classic
8080 instruction table

Discord:

Discord Emulation Development

Getting started

At first, either compile or download emuStudio. The prerequisite is to have installed Java, at least version 11 (download here).

Then, unzip the tar/zip file (emuStudio-xxx.zip) and run it using command:

On Linux / Mac

> ./emuStudio

On Windows:

> emuStudio.bat

NOTE: Currently supported are Linux and Windows. Mac is NOT supported, but it might work to some extent.

For more information, please read user documentation.

Contributing

Anyone can contribute. Before start, please read developer documentation, which includes information like:

Which tools to use and how to set up the environment
How to compile emuStudio and prepare local releases
Which git branch to use
Code architecture, naming conventions, best practices

Related projects

There exist some additional projects, which are used by emuStudio, useful for contributors:

emuLib - a shared runtime library
Edigen - instruction decoder and disassembler generator
Edigen Gradle plugin - Edigen Gradle plugin
CPU testing suite - a JUnit-based test suite for comfortable testing of CPU plugins
emuStudio website - emuStudio website

edigen's People

Contributors

Stargazers

Watchers

edigen's Issues

Deploy to Maven Central

Edigen should be in the Maven Cental Repository.

Move away from Maven to Gradle

emuStudio and related projects were moved to Gradle already. Build code is easier to read, simpler.

Allow to specify loops

This should be possible:

instruction = prefixed | 0xEF;
optionalPrefix = 0xFF instruction;

GenerateMaxInstructionBytes doesn't properly follow nested subrules

Example:

root instruction;

instruction = \"a\": 0x00 other | \"b\": 0x10;
other = 0x11 ref16;
ref16 = ref16: ref16(16);

%%

"%s\" = instruction;

generates maxInstructionBytes = 3 instead of 4.

Add ambiguous path/variant detection

Currently it is possible to write such input file that more than one variant of a rule can match. This can cause unintended behavior of the generated instruction decoder.

It would be useful to detect such cases and stop the generator, printing an error message.

After fix from #18, 16 bit values are disassembled as signed

For example, instruction

lxi SP, 0E3ABh

is dissassembled as

lxi SP, -01C55

Unfortunatelly, this is probably caused by BigInteger.toString() inside Disassembler.edt method format().

I think it will be possible to use here RadixUtils.convertToRadix() from emuLib.

Add an option to change unit size

Instructions are read one unit at a time. Currently, this unit is hard-coded to one byte. There should be an option to change it to short (2 bytes) or int (4 bytes).

Support multiline comments

There should be support of multiline comments, also multiple one-line comment formats could be supported:

multiline: /* ... */
oneline: # comment (already supported)
oneline: // comment
oneline: ; comment

edigen is broken: computing instruction max size + storing instruction image

GenerateMaxInstructionBytes is not working properly
decoder doesnt include just read bytes in the instruction image (it puts there full max instruction bytes), thus next instruction position is always + max instruction bytes instead of actual size

Detect unused rule definitions

After merging #35, there is a problem that typos may become unnoticed. For example, in

instr = add(8);
addd = ...;

the addd = ...; rule definition should ideally throw an error if the addd rule is not used anywhere.

DecodedInstruction is very slow; it's using hashtables

See emustudio/emuStudio#141

Remove RuleNameSet - allow only one name of a rule

Currently, a rule can have multiple names, e.g.:

instruction, data = line(5) 111;

However I don't really see a point why. I think one rule should have unique name and it should be an identifier of the rule. The current situation complicates identifying rule in code, e.g.:

    /**
     * Returns a human-readable label of this rule - a name or a list of names
     * separated by commas.
     * @return the label
     */
    public String getLabel() {
        Iterator<String> nameIterator = names.iterator();
        StringBuilder result = new StringBuilder();
        
        while (nameIterator.hasNext()) {
            result.append(nameIterator.next());
            
            if (nameIterator.hasNext())
                result.append(", ");
        }
        
        return result.toString();
    }

In my opinion this is unnecessary. @sulir do you have some use cases why this feature might be useful? Thanks!

Detection of unreachable disassembler formats

For example:

root instruction;

instruction =
  "nop": 0000 0000 |
  "arg %X": 10000 0000 arg ;

arg = arg: arg(8);
%%

"%s" = instruction arg;
"%X" = arg;  // unreachable

Plain arg rule is unreachable, because all instruction variants return, so instruction is always present too.

Decoder incorrectly assumes Short data type of memory context

When working on SSEM CPU, I ran into the following exception:

Exception in thread "AWT-EventQueue-0" java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.Short
    at net.sf.emustudio.ssem.DecoderImpl.read(DecoderImpl.java:75)
    at net.sf.emustudio.ssem.DecoderImpl.instruction(DecoderImpl.java:119)
    at net.sf.emustudio.ssem.DecoderImpl.decode(DecoderImpl.java:59)
    at net.sf.emustudio.ssem.DisassemblerImpl.getNextInstructionPosition(DisassemblerImpl.java:125)
    at emustudio.gui.debugTable.InteractiveDisassembler.updateCache(InteractiveDisassembler.java:97)
    at emustudio.gui.debugTable.InteractiveDisassembler.rowToLocation(InteractiveDisassembler.java:187)
    at emustudio.gui.debugTable.DebugTableModel.getValueAt(DebugTableModel.java:113)
    at javax.swing.JTable.getValueAt(JTable.java:2717)
    at javax.swing.JTable.prepareRenderer(JTable.java:5706)
    ...

The code points at Decoder.edt template, where I found the following snippet:

private byte read(int start, int length) {
        ...
        if (start + length > 8 * bytesRead) {
            instructionBytes[bytesRead++] = ((Short) memory.read(memoryPosition++)).byteValue();
        }
       ...

The MemoryContext.read() returns generic type. It can be anything. We can agree that in order to use decoder, there should be a requirement that the context type should at least extend Number, which has the desired byteValue method. It is a superclass of all numbers (Integer, Long, Short, Double, Float, Byte), so it should be fine for future as well.

Do not specify endianness as command line argument

Specifying the Big / Little Endian for disassembler in the command line arguments seemed as a good idea at first. But the endianness is an integral property of a processor, not something which can be accidentally changed by forgetting to use a command line switch during a compilation.

Add decoder (and disassembler) cache

The cache should work like this: Every time an instruction is decoded, it is placed into the cache of some capacity. Next time, if the supplied bytes are exactly the same, the decoded instruction is returned directly from the cache, without performing the decoding process.

There is one problem - the size of the instruction is unknown before decoding. So the cache should be a combination of a tree and hash-maps.

Something similar could be also done for the disassembler.

If the insertion and selection were fast enough, this would significantly improve the emulator speed.

Change Travis CI badge from PNG to SVG in README

Travis CI now supports SNG badges.

Add logging support

Logging information about the generation progress could be useful. There exist libraries like Logback / SLF4J for this.

Apply several optimizations in decoder

Unnecessary read of unit

In GenerateMethodsVisitor, when the unit was read in a method when processing a Mask, with given start and length, it is not necessary to read it again in nested switch.

unit = read(start + 0, 5);

switch (unit & 0x1f) {
        case 0x10:
            ...

        default:
            unit = read(start + 0, 5);  // <-- unnecessary
            switch (unit & 0x1c) {
                case 0x08:
               ...
...

Unnecessary read of unit no.2

In GenerateMethodsVisitor, when the mask is zero, there won't be any pattern as the child of the mask (in any depth). If there was, current code will be broken anyway, because Pattern will generate case and default switch branches, while the switch itself is defined in the parent mask only if the mask is not zero. Therefore it is not necessary to read unit. Variant returning subrules will use readBytes anyway.

    private void line(int start) throws InvalidInstructionException {
        unit = read(start + 0, 5);  // <-- unnecessary
        instruction.add(LINE, readBytes(start + 0, 5));
    }

Do not use `break` under default branches

It is unnecessary to call break; in the default: branch, since it's always the last branch.

I have doubts about `readBytes`

As I traverse the code - from decoder to disassembler, I'm starting to think that byte[] array - of instruction "bits" is not necessary. I'm talking about:

public class DecodedInstruction {
    public void add(int key, byte[] bits) {   // <-- this one
...

With this it is connected method readBytes in Decoder, which is very complicated and hard to understand. If you imagine this method is often called only for few bits, it can be pretty expensive call.

Anyway, in disassembler the "bits" are somehow presented - either as integers, floats, doubles, or strings. If we have bits which have more than 4 bytes, we cannot present it as integer (but we allow it). For doubles I agree 8 bytes should be supported, if we want to emulate some modern CPUs. But actually I doubt it. I think emuStudio will always be emulating older machines or some esoteric machines. Therefore I suggest to stick with 4 bytes as maximum (which will be also hard limit for instruction size, because root rule mask has usually size of the whole instruction). Also, unit size is 4 bytes.

Maximum instruction size on modern processors is only few bytes

It is unnecessary to store 1024 bytes per instruction (field instructionBytes). Even modern CPUs on x86 have instruction size of maximum 120 bits which is 15 bytes. Therefore it will be better if we read instructionBytes ahead. For preserving "generality" we should first find out instruction maximum size by some new visitor, which will store it in a constant. Then, using MemoryContext.read(memoryPosition, count) we can try to read max bytes. If there is less bytes available, the method will return only those bytes which are available and will not throw.

Remove "+ 0"

It doesn't look good when the code has + 0 - we know that it's a zero so we don't need to add it. E.g.:

private void instruction(int start) throws InvalidInstructionException {
        unit = readBits(start + 0, 32); // <-- here
        
        switch (unit & 0x00070000) {
        case 0x00000000:
            ...
            line(start + 0);  // <-- here
            ...

Returning subrules associated with nonexistent rules

Subrules are associated with nonexistent main rules after the second syntax tree pass. This can cause NullPointerException.

Add more semantic error checks

There are many situations where the input file contains syntactically correct, but semantically incorrect information. Possible reactions are (from the best to the worst):

print a meaningful error message containing details about the error
generate syntactically incorrect Java file
print a typical message about an unhandled exception ("Exception in thread...")
generate syntactically correct Java file which will produce unexpected results upon execution

The first two are acceptable, the others are not.

Semantic errors in a decoder include:

multiple rules with the same name [1]
subrule refers to a nonexistent rule [2]
variant returns a subrule not present in the variant [3]

In a disassembler:

value refers to a nonexistent rule [4]
multiple formats contain the same values [5]
value refers to a rule which can never return a value [6]

Of course, the lists are not complete.

Allow formatting directly in variants which return strings

Edigen should allow formatting in rule names, e.g.:

  "ld %s, %s":   01 0 r_bcde(2) r(3)              |  # x=1, y=0 r_bcde, z=r

the specification file can be then so much shorter and cleaner!

Migrate to GPL v3

emuStudio is also using GPL v3, and since edigen doest contain any parts which do not include source code, it can be easily migrated.

Add line numbers to semantic error messages

Some of the semantic error messages would be more useful if they contained line numbers.

Add more unit tests

At least the Visitors should be tested.

Support either multiple root rules or "else" construct if root rule fails

Sometimes it is handy to recover from InvalidInstructionException while decoding by trying another rule - e.g. if the bytes do not represent an instruction, they represent data (and we want to show it in dissasembler).

Proposal 1

Currently, the first rule in the file is the only root rule. If this rule fails, there's no possibility to try other path. One solution would be to have multiple root rules (either explicitly specified - or taken as the first rules appearing in all lines in the dissasembler part) and try apply from the first to the last. For example:

...
%%

"%s %d" = instruction line(shift_left, shift_left, shift_left, bit_reverse, absolute) ignore8 ignore16;
"%s" = instruction ignore8 ignore16;
"%s %d" = number num(bit_reverse);

In this code the rules instruction, number are root rules and will be tried in this order.

Proposal 2

Another solution would be being able to capture the unmatched input in the only root rule, for example as:

instruction =  "JMP": line(5)     ignore8(8) 000 ignore16(16) |
               "JPR": line(5)     ignore8(8) 100 ignore16(16) |
               ... |
               <otherwise> "DATA": data(32)

We can debate how the <otherwise> part should be named. This part would be optional.

Why the <otherwise> is this actually needed is because of possible ambiguities with instructions (e.g. pattern 000...0 is ambiguous). I don't really want to explicitly specify all possible values except the ambiguous ones - this is what I expect from the <otherwise> part.

@sulir what do you think about it?

Unicode input support

In comments it is not possible to use unicode characters (e.g. chars with punctuation) like č, etc.

Low-order bytes are not read thus not shown in mnemonics

All instructions using numbers, both 8 bit or 16 bit, have not disassembled the low-order byte. For example:

cpi 55h

is disassembled as

cpi 00

However, the opcode is correct (FE 55)

Add documentation for users

Documentation about Edigen usage should be added to README or Wiki pages.

Please update disassembler template to reflect current situation in emuLib

The disassembler changed from SimpleDisassembler to AbstractDisassembler and Decoder is already included.

Also please consider if it would be a good step to generalize disassembler and move the code (of a template) into emuLib directly. In fact, only static tables of disassembler formats and values need really to be generated, everything else can be generalized. These tables can be passed either in constructor, or through setters (if we would like to allow change format at runtime) to the disassembler.

Add implicit rule support

Rules like src_mem = mem: mem(16); are a little bit redundant. If there is a definition similar to

arith_operands =
    001 dst_reg(3) 01 src_mem(16);

and the rule src_mem does not exist, then instead of printing an error, the rule should be implicitly constructed according to the former example.