Giter Club home page Giter Club logo

edigen's Introduction

emuStudio logo Welcome to emuStudio

emuStudio Build License: GPL v3

emuStudio is a desktop application used for computer emulation and writing programs for emulated computers. It extensible; it encourages developers to write their own computer emulators.

The main goal of emuStudio is to support the "compile-load-emulate" workflow, aiming at students or anyone to help to learn about older but important computers or even abstract machines.

emuStudio is very appropriate for use at schools, e.g. when students are doing first steps in assembler, or when they are taught about computer history. For example, emuStudio is used at the Technical University of Košice since 2007.

Available emulators

BIG THANKS

emuStudio was written based on existing emulators, sites and existing documentation of real hardware. For example:

Projects:

  • simh project, which was the main inspiration for Altair8800 computer
  • MAME project, which helped with resolving a lot of bugs in a correct implementation of some 8080 and Z80 CPU instructions

Sites:

Discord:

Getting started

At first, either compile or download emuStudio. The prerequisite is to have installed Java, at least version 11 (download here).

Then, unzip the tar/zip file (emuStudio-xxx.zip) and run it using command:

  • On Linux / Mac
> ./emuStudio
  • On Windows:
> emuStudio.bat

NOTE: Currently supported are Linux and Windows. Mac is NOT supported, but it might work to some extent.

For more information, please read user documentation.

Contributing

Anyone can contribute. Before start, please read developer documentation, which includes information like:

  • Which tools to use and how to set up the environment
  • How to compile emuStudio and prepare local releases
  • Which git branch to use
  • Code architecture, naming conventions, best practices

Related projects

There exist some additional projects, which are used by emuStudio, useful for contributors:

edigen's People

Contributors

dependabot[bot] avatar sulir avatar vbmacher avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

edigen's Issues

Allow to specify loops

This should be possible:

instruction = prefixed | 0xEF;
optionalPrefix = 0xFF instruction; 

Add ambiguous path/variant detection

Currently it is possible to write such input file that more than one variant of a rule can match. This can cause unintended behavior of the generated instruction decoder.

It would be useful to detect such cases and stop the generator, printing an error message.

After fix from #18, 16 bit values are disassembled as signed

For example, instruction

lxi SP, 0E3ABh

is dissassembled as

lxi SP, -01C55

Unfortunatelly, this is probably caused by BigInteger.toString() inside Disassembler.edt method format().

I think it will be possible to use here RadixUtils.convertToRadix() from emuLib.

Add an option to change unit size

Instructions are read one unit at a time. Currently, this unit is hard-coded to one byte. There should be an option to change it to short (2 bytes) or int (4 bytes).

Support multiline comments

There should be support of multiline comments, also multiple one-line comment formats could be supported:

  • multiline: /* ... */
  • oneline: # comment (already supported)
  • oneline: // comment
  • oneline: ; comment

Detect unused rule definitions

After merging #35, there is a problem that typos may become unnoticed. For example, in

instr = add(8);
addd = ...;

the addd = ...; rule definition should ideally throw an error if the addd rule is not used anywhere.

Remove RuleNameSet - allow only one name of a rule

Currently, a rule can have multiple names, e.g.:

instruction, data = line(5) 111;

However I don't really see a point why. I think one rule should have unique name and it should be an identifier of the rule. The current situation complicates identifying rule in code, e.g.:

    /**
     * Returns a human-readable label of this rule - a name or a list of names
     * separated by commas.
     * @return the label
     */
    public String getLabel() {
        Iterator<String> nameIterator = names.iterator();
        StringBuilder result = new StringBuilder();
        
        while (nameIterator.hasNext()) {
            result.append(nameIterator.next());
            
            if (nameIterator.hasNext())
                result.append(", ");
        }
        
        return result.toString();
    }

In my opinion this is unnecessary. @sulir do you have some use cases why this feature might be useful? Thanks!

Detection of unreachable disassembler formats

For example:

root instruction;

instruction =
  "nop": 0000 0000 |
  "arg %X": 10000 0000 arg ;

arg = arg: arg(8);
%%

"%s" = instruction arg;
"%X" = arg;  // unreachable

Plain arg rule is unreachable, because all instruction variants return, so instruction is always present too.

Decoder incorrectly assumes Short data type of memory context

When working on SSEM CPU, I ran into the following exception:

Exception in thread "AWT-EventQueue-0" java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.Short
    at net.sf.emustudio.ssem.DecoderImpl.read(DecoderImpl.java:75)
    at net.sf.emustudio.ssem.DecoderImpl.instruction(DecoderImpl.java:119)
    at net.sf.emustudio.ssem.DecoderImpl.decode(DecoderImpl.java:59)
    at net.sf.emustudio.ssem.DisassemblerImpl.getNextInstructionPosition(DisassemblerImpl.java:125)
    at emustudio.gui.debugTable.InteractiveDisassembler.updateCache(InteractiveDisassembler.java:97)
    at emustudio.gui.debugTable.InteractiveDisassembler.rowToLocation(InteractiveDisassembler.java:187)
    at emustudio.gui.debugTable.DebugTableModel.getValueAt(DebugTableModel.java:113)
    at javax.swing.JTable.getValueAt(JTable.java:2717)
    at javax.swing.JTable.prepareRenderer(JTable.java:5706)
    ...

The code points at Decoder.edt template, where I found the following snippet:

private byte read(int start, int length) {
        ...
        if (start + length > 8 * bytesRead) {
            instructionBytes[bytesRead++] = ((Short) memory.read(memoryPosition++)).byteValue();
        }
       ...

The MemoryContext.read() returns generic type. It can be anything. We can agree that in order to use decoder, there should be a requirement that the context type should at least extend Number, which has the desired byteValue method. It is a superclass of all numbers (Integer, Long, Short, Double, Float, Byte), so it should be fine for future as well.

Do not specify endianness as command line argument

Specifying the Big / Little Endian for disassembler in the command line arguments seemed as a good idea at first. But the endianness is an integral property of a processor, not something which can be accidentally changed by forgetting to use a command line switch during a compilation.

Add decoder (and disassembler) cache

The cache should work like this: Every time an instruction is decoded, it is placed into the cache of some capacity. Next time, if the supplied bytes are exactly the same, the decoded instruction is returned directly from the cache, without performing the decoding process.

There is one problem - the size of the instruction is unknown before decoding. So the cache should be a combination of a tree and hash-maps.

Something similar could be also done for the disassembler.

If the insertion and selection were fast enough, this would significantly improve the emulator speed.

Add logging support

Logging information about the generation progress could be useful. There exist libraries like Logback / SLF4J for this.

Apply several optimizations in decoder

Unnecessary read of unit

In GenerateMethodsVisitor, when the unit was read in a method when processing a Mask, with given start and length, it is not necessary to read it again in nested switch.

unit = read(start + 0, 5);

switch (unit & 0x1f) {
        case 0x10:
            ...

        default:
            unit = read(start + 0, 5);  // <-- unnecessary
            switch (unit & 0x1c) {
                case 0x08:
               ...
...        

Unnecessary read of unit no.2

In GenerateMethodsVisitor, when the mask is zero, there won't be any pattern as the child of the mask (in any depth). If there was, current code will be broken anyway, because Pattern will generate case and default switch branches, while the switch itself is defined in the parent mask only if the mask is not zero. Therefore it is not necessary to read unit. Variant returning subrules will use readBytes anyway.

    private void line(int start) throws InvalidInstructionException {
        unit = read(start + 0, 5);  // <-- unnecessary
        instruction.add(LINE, readBytes(start + 0, 5));
    }

Do not use break under default branches

It is unnecessary to call break; in the default: branch, since it's always the last branch.

                                    }
                                    break;
                                }
                                break;
                            }
                            break;
                        }
                        break;
                    }
                    break;
                }
                break;
            }
            break;

I have doubts about readBytes

As I traverse the code - from decoder to disassembler, I'm starting to think that byte[] array - of instruction "bits" is not necessary. I'm talking about:

public class DecodedInstruction {
    public void add(int key, byte[] bits) {   // <-- this one
...

With this it is connected method readBytes in Decoder, which is very complicated and hard to understand. If you imagine this method is often called only for few bits, it can be pretty expensive call.

Anyway, in disassembler the "bits" are somehow presented - either as integers, floats, doubles, or strings. If we have bits which have more than 4 bytes, we cannot present it as integer (but we allow it). For doubles I agree 8 bytes should be supported, if we want to emulate some modern CPUs. But actually I doubt it. I think emuStudio will always be emulating older machines or some esoteric machines. Therefore I suggest to stick with 4 bytes as maximum (which will be also hard limit for instruction size, because root rule mask has usually size of the whole instruction). Also, unit size is 4 bytes.

Maximum instruction size on modern processors is only few bytes

It is unnecessary to store 1024 bytes per instruction (field instructionBytes). Even modern CPUs on x86 have instruction size of maximum 120 bits which is 15 bytes. Therefore it will be better if we read instructionBytes ahead. For preserving "generality" we should first find out instruction maximum size by some new visitor, which will store it in a constant. Then, using MemoryContext.read(memoryPosition, count) we can try to read max bytes. If there is less bytes available, the method will return only those bytes which are available and will not throw.

Remove "+ 0"

It doesn't look good when the code has + 0 - we know that it's a zero so we don't need to add it. E.g.:

private void instruction(int start) throws InvalidInstructionException {
        unit = readBits(start + 0, 32); // <-- here
        
        switch (unit & 0x00070000) {
        case 0x00000000:
            ...
            line(start + 0);  // <-- here
            ...

Add more semantic error checks

There are many situations where the input file contains syntactically correct, but semantically incorrect information. Possible reactions are (from the best to the worst):

  • print a meaningful error message containing details about the error
  • generate syntactically incorrect Java file
  • print a typical message about an unhandled exception ("Exception in thread...")
  • generate syntactically correct Java file which will produce unexpected results upon execution

The first two are acceptable, the others are not.

Semantic errors in a decoder include:

  • multiple rules with the same name [1]
  • subrule refers to a nonexistent rule [2]
  • variant returns a subrule not present in the variant [3]

In a disassembler:

  • value refers to a nonexistent rule [4]
  • multiple formats contain the same values [5]
  • value refers to a rule which can never return a value [6]

Of course, the lists are not complete.

Migrate to GPL v3

emuStudio is also using GPL v3, and since edigen doest contain any parts which do not include source code, it can be easily migrated.

Support either multiple root rules or "else" construct if root rule fails

Sometimes it is handy to recover from InvalidInstructionException while decoding by trying another rule - e.g. if the bytes do not represent an instruction, they represent data (and we want to show it in dissasembler).

Proposal 1

Currently, the first rule in the file is the only root rule. If this rule fails, there's no possibility to try other path. One solution would be to have multiple root rules (either explicitly specified - or taken as the first rules appearing in all lines in the dissasembler part) and try apply from the first to the last. For example:

...
%%

"%s %d" = instruction line(shift_left, shift_left, shift_left, bit_reverse, absolute) ignore8 ignore16;
"%s" = instruction ignore8 ignore16;
"%s %d" = number num(bit_reverse);

In this code the rules instruction, number are root rules and will be tried in this order.

Proposal 2

Another solution would be being able to capture the unmatched input in the only root rule, for example as:

instruction =  "JMP": line(5)     ignore8(8) 000 ignore16(16) |
               "JPR": line(5)     ignore8(8) 100 ignore16(16) |
               ... |
               <otherwise> "DATA": data(32)

We can debate how the <otherwise> part should be named. This part would be optional.

Why the <otherwise> is this actually needed is because of possible ambiguities with instructions (e.g. pattern 000...0 is ambiguous). I don't really want to explicitly specify all possible values except the ambiguous ones - this is what I expect from the <otherwise> part.

@sulir what do you think about it?

Unicode input support

In comments it is not possible to use unicode characters (e.g. chars with punctuation) like č, etc.

Please update disassembler template to reflect current situation in emuLib

The disassembler changed from SimpleDisassembler to AbstractDisassembler and Decoder is already included.

Also please consider if it would be a good step to generalize disassembler and move the code (of a template) into emuLib directly. In fact, only static tables of disassembler formats and values need really to be generated, everything else can be generalized. These tables can be passed either in constructor, or through setters (if we would like to allow change format at runtime) to the disassembler.

Add implicit rule support

Rules like src_mem = mem: mem(16); are a little bit redundant. If there is a definition similar to

arith_operands =
    001 dst_reg(3) 01 src_mem(16);

and the rule src_mem does not exist, then instead of printing an error, the rule should be implicitly constructed according to the former example.

Generate return-style code instead of default branches

There exists an another possible style of generated decoder code, which uses the return statement after a variant is successfully recognized and throws an exception at the end of each rule. The default branches are not used at all.

This style should allow less strict ambiguity detection and the performance could be improved. However, it would bring some new issues which must be concerned, e.g. the instruction length recognition would be more difficult.

Support other constant disassembling strategies

In addition to currently supported big-endian and little-endian, custom strategies like "bit reverse" should be supported.

One option is to use formatting character modifiers: e.g., %+d for big-endian, %*d for bit-reverse.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.