Giter Club home page Giter Club logo

malwareloganalysis's Introduction

MalwareLogAnalysis

Malware log analysis based on Branch data.

Relative Presentation is here.

Branch data refers to the data processed at a branching situation such as jmp and call. This data is advantageous for showing the structure of a binary regardless of the polymorphism.

BranchLogPreprocess.ipynb preprocess the raw branch logs to regularized data. Then MalwareLogAnalysis.ipynb classify the branch data into malware and normal software.

  • Analysis : Processing utils for log analysis.
  • Cuckoo : Customization of Cuckoo sandbox for automation of the malware branching.
  • Data : API List and ML dataset.
  • Log : Raw logs of the branch data.
  • BranchTracer : Branch tracer based on VEH for logging branch data.

Analysis

Processing utils for log analysis.

  • maldb.py : Create database for the branch logs.
  • preproc.py : Log regularizer for feeding ML model.

Malware database

id name
ID Malware Name based on VirusTotal

Branch database

Column Comment
id ID
malware_id Malware ID
order Order of the branch data generated from the same malware
src_addr Source Address
dst_addr Destination Address
dll Destination DLL Space (Nullable)
symbol Destination Symbol Data (Nullable)

Preprocessing

Filter out API symbol data and map to symbol index given by function_list.txt.

For example, calc's preprocessed branch data is [0,186,0,143,187,0,292]. First column is is_malware flag. if 1 then malware else normal software. Following column is the one-based index of the API based on function_list.txt. If index is 0, it represent the function_list.txt doesn't have such API symbol.

Cuckoo

Customization of Cuckoo sandbox for automation of the malware branching.

It is a schematic representation of the structure of the Cuckoo Sandbox. When we submit a malware to the Cuckoo sandbox, scheduler recieve the malware. It sent the malware to the vm or put it into the queue. The agent.py of the VM receive it and start the analyzer.py to analyze malware. It makes the report and sent it to the scheduler.

I customize the analyzer.py to run the branch tracer. Helper injects the Brancher dll to the target process and Brancher logs the branch data.

if is32bit:
    self.target = 'C:\\dbg\\Helper32.exe'
else:
    self.target = 'C:\\dbg\\Helper64.exe'

try:
    proc = Popen(self.target)
    pids = proc.pid
except Exception as e:
    log.error('custom : fail to open process %s : %s', self.target, e)

After run the software, analyzer.py preprocess the log and write it on the debug log of the Cuckoo sandbox.

time.sleep(3)
with open('C:\\dbg\\log.txt') as f_log:
    raw = f_log.read()
    data = ''.join(raw.split('\x00'))
    log.debug('logged : \n%s', data)

The ./cuckoo/filter.py parse the branch data and make a new log file of the software's log data.

with open(analysis_path % filt) as f:
    log = f.read()

if '+' not in log:
    shutil.rmtree('./' + filt)
else:
    liner = log.replace('\r', '').split('\n')
    branch = filter(lambda x: '+' in x, liner)
    data = '\n'.join(branch)

    with open(log_path % filt, 'w') as branch_log:
        branch_log.write(data + '\n')

Data

Top 1000 Windows API called by malware samples.

I collect 470 malware branch data and 40 normal software branch data. It's very unbalanced classification problem, so I make dataset seperately mal_trainset.csv (450), norm_trainset.csv (20) except testset.csv (20+20). Then train 20 malware trainset and whole normal trainset as a one batch.

Branch Tracer

Branch Tracer based on Vectored Exception Handler. Here is my repo.

MalwareLogAnalysis

Classification problem between malware and normal software. Dataset is the preprocessed branch data.

First, project the symbol index number to 1024 vectors. Then pass it to the LSTM reucrrent unit of 1024 hidden units. Connect it to Fully-connected layer (512, 256, 2) and softmax the result.

I was able to get a classifier with 92% accuracy.

Benefit

Unlike existing detection method, it use branch data to detect malware. Branch data represent a structure as a small view, behavior as a larger view of a binary, and it can carry on the definition that malware do malicious acts.

Theoritically, if I can extract the branch data, it will be able to detect most malware.

malwareloganalysis's People

Contributors

revsic avatar

Stargazers

Shuya Motouchi avatar

Watchers

 avatar

Forkers

onlyff

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.