Giter Club home page Giter Club logo

decomp's Introduction

The decomp project

Build Status Coverage Status GoDoc

The aim of this project is to implement a decompilation pipeline composed of independent components interacting through well-defined interfaces, as further described in the design documents of the project.

Installation

git clone https://github.com/decomp/decomp
cd decomp
go install -v ./...

Usage

See example usage at examples/demo, and this comment for further details.

Decompilation pipeline

From a high-level perspective, the components of the decompilation pipeline are conceptually grouped into three modules. Firstly, the front-end translates a source language (e.g. x86 assembly) into LLVM IR; a platform-independent low-level intermediate representation. Secondly, the middle-end structures the LLVM IR by identifying high-level control flow primitives (e.g. pre-test loops, 2-way conditionals). Lastly, the back-end translates the structured LLVM IR into a high-level target programming language (e.g. Go).

The following poster summarizes the current capabilities of the decompilation pipeline, using a composition of independent components to translate LLVM IR to Go.

Poster: Compositional Decompilation

Front-end

Translate machine code (e.g. x86 assembly) to LLVM IR.

Third-party front-end components.

Middle-end

Perform control flow analysis on the LLVM IR to identify high-level control flow primitives (e.g. pre-test loops).

ll2dot

https://godoc.org/github.com/decomp/decomp/cmd/ll2dot

Control flow graph generation tool.

Generate control flow graphs from LLVM IR assembly (*.ll -> *.dot).

restructure

https://godoc.org/github.com/decomp/decomp/cmd/restructure

Control flow recovery tool.

Recover control flow primitives from control flow graphs (*.dot -> *.json).

Back-end

Translate structured LLVM IR to a high-level target language (e.g. Go).

ll2go

https://godoc.org/github.com/decomp/decomp/cmd/ll2go

Go code generation tool.

Decompile LLVM IR assembly to Go source code (*.ll -> *.go).

go-post

https://godoc.org/github.com/decomp/decomp/cmd/go-post

Go post-processing tool.

Post-process Go source code to make it more idiomatic (*.go -> *.go).

Release history

Version 0.2 (2018-01-30)

Primary focus of version 0.2: project-wide compilation speed.

Developing decompilation components should be fun.

There seem to be an inverse correlation between depending on a huge C++ library and having fun developing decompilation components.

Version 0.2 of the decompilation pipeline strives to resolve this issue by leveraging an LLVM IR library written in pure Go. Prior to this release, project-wide compilation could take several hours to complete. Now, they complete in less than 1 minute -- the established hard limit for all future releases.

Version 0.1 (2015-04-21)

Initial release.

Primary focus of version 0.1: compositional decompilation.

Decompilers should be composable and open source.

A decompilation pipeline should be composed of individual components, each with a single purpose and well-defined input and output.

Version 0.1 of the decomp project explores the feasibility of composing a decompilation pipeline from independent components, and the potential of exposing those components to the end-user.

For further background, refer to the Compositional Decompilation using LLVM IR design document.

Roadmap

Version 0.3 (to be released)

Primary focus of version 0.3: type-aware binary lifting.

Decompilers rely on high-quality binary lifting.

The quality of the output IR of the binary lifting front-end fundamentally determines the quality of the output of the entire decompilation pipeline.

Version 0.3 aims to improve the quality of the output LLVM IR by implementing a type-aware binary lifting front-end.

Version 0.4 (to be released)

Primary focus of version 0.4: control flow analysis.

Decompilers should recover high-level control flow primitives.

One of the primary differences between low-level assembly and high-level source code is the use of high-level control flow primitives; e.g. 1-way, 2-way and n-way conditionals (if, if-else and switch), pre- and post-test loops (while and do-while).

Version 0.4 seeks to recover high-level control flow primitives using robust control flow analysis algorithms.

Version 0.5 (to be released)

Primary focus of version 0.5: fault tolerance.

Decompilers should be robust.

Decompilation components should respond well to unexpected states and incomplete analysis.

Version 0.5 focuses on stability, and seeks to stress test the decompilation pipeline using semi-real world software (see the challenge issue series).

Version 0.6 (to be released)

Primary focus of version 0.6: data flow analysis.

Version 0.7 (to be released)

Primary focus of version 0.7: type analysis.

decomp's People

Contributors

dependabot[bot] avatar golint-fixer avatar mcaldwelva avatar mewmew avatar sangisos avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

decomp's Issues

cmd/restructure: Develop a component for control flow analysis. [100h]

The component will identify high-level control structures using control flow analysis of LLVM IR. Through pattern matching the properties of high-level control structures would be identified in a Control Flow Graph of the LLVM IR.

TODO: Split the task into sub-tasks and allocate time to them.

design: Design of decompilation system [30h]

Design a decompilation system composed of individual components and based on the principle of separation of concerns. The system must be language-agnostic so that decompilation passes can be reused from other programming language environments.

Attempt to find flaws in the design by stress testing it through proof of concept implementations.

Engage in discussions with the open source community during the design process of any library intended for third party use; specifically the LLVM IR libraries.

llvm: Verify the implementation of the LLVM IR components. [5h]

  • Add test cases to ensure the reliability of the LLVM IR bitcode parser as it has to be accurate.
  • Create round-trip test cases which reads an LLVM IR bitcode file, stores it, and reads it back again. The IR of the two reads should be identical.
  • Add similar test cases for the human readable assembly language representation of LLVM IR.

meta: Project Plan

Project Plan

  • Issue #17 - Introduction
  • Issue #22 - Literature Review
  • Issue #35 - Related Work
  • Issue #49 - Methodology
  • Issue #48 - Requirements (MUST)
  • Issue #55 - Design (MUST)
  • Issue #60 - Implementation (MUST)
  • Issue #116 - Verification
  • Issue #64 - Evaluation (MUST)
  • Issue #69 - Conclusion

meta: Brainstorm about decompilation techniques. [30h]

Brainstorm about additional decompilation steps. Identify structural patterns in the low-level IR which conveys information about the high-level semantics.

  • Patterns of incrementation instructions, conditional jumps, and unconditional jumps may be represented as for loops using initialization, condition, and post statements.
  • Jumps to offsets specified by branch tables may represent switch statements.
  • Increments and decrements of the stack pointer register may indicate function prologues and epilogues respectively; which conveys information about local variables.

meta: Report, Compositional Decompilation using LLVM IR

  • Issue #115 - Abstract
  • Issue #17 - Introduction
    • Issue #19 - Project Aim and Objectives
    • Issue #20 - Deliverables
    • Issue #21 - Disposition
  • Issue #22 - Literature Review
    • Issue #87 - The Anatomy of an Executable
    • Issue #28 - Decompilation Phases
      • Issue #86 - Binary Analysis
      • Issue #29 - Disassembly
      • Issue #34 - Control Flow Analysis
    • Issue #23 - Evaluation of Intermediate Representations
      • Issue #25 - REIL
      • Issue #26 - LLVM IR
  • Issue #35 - Related Work
    • Issue #36 - Native Code to LLVM IR
      • Issue #46 - Dagger
      • Issue #90 - MC-Semantics
    • Issue #45 - Hex-Rays Decompiler
  • Issue #49 - Methodology
    • Issue #125 - Operational Prototyping
      • Issue #50 - Throwaway Prototyping
      • Issue #124 - Evolutionary Prototyping
    • Issue #53 - Continuous Integration
  • Issue #48 - Requirements
    • Issue #121 - LLVM IR Library
    • Issue #122 - Control Flow Analysis Library
    • Issue #123 - Control Flow Analysis Tool
  • Issue #55 - Design
    • Issue #59 - System Architecture
    • Issue #58 - Front-end Components
      • Issue #128 - Native Code to LLVM IR
      • Issue #129 - Compilers
    • Issue #57 - Middle-end Components
      • Issue #130 - Control Flow Graph Generation
      • Issue #131 - Control Flow Analysis
    • Issue #56 - Back-end Components
      • Issue #133 - Post-processing
  • Issue #60 - Implementation
    • Issue #143 - Language Considerations
    • Issue #144 - LLVM IR Library
    • Issue #152 - Go Bindings for LLVM
    • Issue #146 - Subgraph Isomorphism Search Library
    • Issue #61 - Documentation
  • Issue #116 - Verification
    • Issue #134 - Test Cases
      • Issue #135 - Code Coverage
    • Issue #118 - Performance
      • Issue #136 - Profiling
      • Issue #67 - Benchmarks
    • Issue #119 - Security Assessment
    • Issue #120 - Continuous Integration
      • Issue #137 - Source Code Formatting
      • Issue #138 - Coding Style
      • Issue #139 - Code Correctness
      • Issue #140 - Build Status
      • Issue #141 - Test Cases
      • Issue #142 - Code Coverage
  • Issue #64 - Evaluation
    • Issue #153 - LLVM IR Library
      • Issue #159 - Essential Requirements
      • Issue #160 - Desirable Requirements
    • Issue #154 - Control Flow Analysis Library
      • Issue #161 - Essential Requirements
      • Issue #162 - Important Requirements
      • Issue #163 - Desirable Requirements
    • Issue #155 - Control Flow Analysis Tool
      • Issue #164 - Essential Requirements
  • Issue #69 - Conclusion
    • Issue #70 - Project Summary
    • Issue #68 - Future Work
      • Issue #156 - Design Validation
      • Issue #157 - Reliability Improvements
      • Issue #158 - Extended Capabilities
    • Issue #72 - Personal Development
    • Issue #73 - Final Thoughts

llvm: Develop a library for interacting with LLVM IR. [100h]

Develop libraries for interacting with LLVM IR; in each of its three forms. These components will be fundamental for the project as all decompilation phases build upon its Intermediate Representation. Therefore the data structure of the LLVM IR has to be chosen with careful consideration. Research idiomatic data structures and experiment until it feels just right.

report: Literature Review [2h]

  • Issue #87 - The Anatomy of an Executable
  • Issue #28 - Decompilation Phases
    • Issue #86 - Binary Analysis
    • Issue #29 - Disassembly
    • Issue #34 - Control Flow Analysis
  • Issue #23 - Evaluation of Intermediate Representations
    • Issue #25 - REIL
    • Issue #26 - LLVM IR

meta: Literature review

The following theses, papers, and online references will be included in the literature review:

  • Issue #76 - C. Cifuentes, Reverse Compilation Techniques. PhD thesis, Queensland University of Technology, 1994.
  • Issue #80 - I. Guilfanov, Decompilers and beyond. Black Hat USA, 2008.
  • Issue #16: S. Moll, Decompilation of LLVM IR. BSc thesis, Saarland University, 2011.
  • Issue #81 - L. Ďurfina, J. Křoustek, P. Zemek, D. Kolář, T. Hruška, K. Masařík, and A. Meduna, Design of a retargetable decompiler for a static platform-independent malware analysis, in Information Security and Assurance (pp. 72-86), Springer, 2011.
  • Issue #82 - G. Chen, Z. Qi, S. Huang, K. Ni, Y. Zheng, W. Binder, and H. Guan, A refined decompiler to generate C code with high readability, Software: Practice and Experience, vol. 43, no. 11, pp. 1337-1358, 2013.
  • Issue #83 - K. Yakdan, S. Eschweiler, and E. Gerhards-Padilla, REcompile: A Decompilation Framework for Static Analysis of Binaries, in MALWARE'13, pp. 95-102, IEEE, 2013.
  • Issue #75 - LLVM Language Reference Manual.
  • Issue #13 - LLVM Bitcode File Format.

The following theses, papers, and online references have been marked as future ambitions:

  • Issue #77 - A. Mycroft, Type-Based Decompilation, in 8th European Symposium on Programming, ESOP'99, pp. 208–223, Springer-Verlag, 1999.
  • Issue #78 - M. J. Van Emmerik, Static Single Assignment for Decompilation. PhD thesis, The University of Queensland, 2007.
  • Issue #79 - T. Durden, Automated vulnerability auditing in machine code, Phrack Magazine, vol. 64, 2007.

review: Literature search

Search for relevant literature related to decompilation, and its key concepts and algorithms. Add located resources to issue #2.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.