albertan017 / llm4decompile Goto Github PK

View Code? Open in Web Editor NEW

2.8K 2.8K 206.0 9.39 MB

Reverse Engineering: Decompiling Binary Code with Large Language Models

License: MIT License

Python 98.06% Shell 1.94%

decompile large-language-models reverse-engineering

llm4decompile's People

Contributors

Stargazers

Watchers

Forkers

sxxxw npc2000 pent throwoutofcoffeeexception monologuer diogodsa mylysddp hbcbh1999 joontju codeduckky showmethemoney2022 nocabbages crandeng edisonh leiyuch tianhm thor-ola mrcodechef happy-ferret-entertainment casteriveaux gmh5225 nzb15555196162 sinnisoipor shieldgold dmulxw janees-ek kokizzu diode23 inspirasiprogrammer braveheartcc33 techthiyanes crackercat 123456789zws ivoidcat pengjukang lyhiving zgimszhd61 gxfc389 freezing111 suddy btbujiangjun heckerstone moranlian geekwish githubxiaowangzi chenzongyao200127 garfield-chen zixu4728 themayursinha deepb1t thearchiver zha0 segmond ghgit2020 youngjun-chang hotelzululima t3xtm0d3 xdcs100 visuals-ai shubhampachori12110095 1008610010 suchoudh nasa03 misterypoem dbl007 xibo-sun cuicuicuinice ailabteam mehdibtc ox-nexoneoft targetcoopsh 76marson ladyin-w born2hannah58 telltinn55 canyonsfr-84 kmewgnome saberieflashgal 40versinda dubyapi94eyemucu shcamviking-a channetr70icytwilight svorwerk-flextg iscorpionte romangoudtenderis binderpost-b kittyzena-ferdywaves upforbion43 l-readyna fisherno94 guychrome66glenesto limetaryd xrinairgi tchirpik bugsliglobays nixonetgodzillayellow qqq-tech osbarcelos79 zuodh pascal-h-kim

llm4decompile's Issues

Empty output in slightly more complex prompt

Hello, many thanks for the brilliant work!

I've tried the models, including llm4decompile-6.7b and 33b, the uo version (llm4decompile-6.7b-uo). However, when I tried to add more information in the prompt (e.g., adding some more detailed requirements), llm4decompile will output nothing at all.

I tried the same prompt on the DeepSeek-Coder 33b, and it can output decompiled code normally.

Could you give some hint about why this happens? Could it be some type of overfitting? Should llm4decompile be used only under the fixed prompt?

Converting to GGUF

I've been trying to convert llm4decompile-6.7b-uo with the script of llama.cpp to GGUF but I am facing some issues since there is no tokenizer.model file provided. This is perhaps outside the scope of the project but there are no GGUF community-provided files right now and since the usual process is not working, would you consider documenting this? It would make the models much more accessible. Many thanks!

Concern Regarding Dataset Integrity

Upon thorough examination, it has come to my attention that the dataset's integrity might be compromised due to the methodology employed in generating the assembly representations. Specifically, the use of object files instead of fully linked binaries introduces inaccuracies, particularly concerning external function calls and the handling of immediate values.

The absence of the linking process results in disassemblies where immediate numbers for external function calls are left blank, leading to misleading representations. Each call to an external function is disassembled to call the next instruction, which can severely impact the model's ability to distinguish between different external function calls.

For example, in your decompile-eval.json, line:294, task 10, O1, the function with strlen, malloc and strncpy results in using the following disassembly as the input, those callqs do not point to the correct location. Even state of the art decompilers cannot decompile those assembly (when the object files are stripped and correct values are not filled into those calls).

endbr64
push   %r15
push   %r14
push   %r13
push   %r12
push   %rbp
push   %rbx
sub    $0x18,%rsp
mov    %rdi,%rbp
mov    $0xffffffffffffffff,%rcx
mov    $0x0,%eax
repnz scas %es:(%rdi),%al
mov    %rcx,%rax
not    %rax
lea    -0x1(%rax),%r12
lea    (%r12,%r12,1),%r15d
lea    0x1(%r15),%edi
movslq %edi,%rdi
callq  3d <func0+0x3d>
mov    %rax,%r14
test   %rax,%rax
je     ca <func0+0xca>
mov    %r12d,%r13d
test   %r12d,%r12d
jle    7a <func0+0x7a>
mov    %r12d,%r9d
lea    -0x1(%r12),%eax
mov    %eax,0xc(%rsp)
mov    %eax,%r8d
mov    %rbp,%rsi
mov    $0x0,%ebx
movslq %r12d,%rdi
sub    $0x1,%rdi
jmp    e8 <func0+0xe8>
mov    0xc(%rsp),%ebx
jmpq   11f <func0+0x11f>
movslq %r12d,%rdx
mov    %rbp,%rsi
mov    %rax,%rdi
callq  88 <func0+0x88>
jmp    c2 <func0+0xc2>
movslq %ebx,%rbx
mov    %rbx,%rdx
mov    %rbp,%rsi
mov    %r14,%rdi
callq  9b <func0+0x9b>
lea    -0x1(%rbp,%rbx,1),%rax
lea    (%r14,%rbx,1),%rdx
lea    -0x2(%rbp,%rbx,1),%rsi
mov    0xc(%rsp),%ecx
sub    %rcx,%rsi
movzbl (%rax),%ecx
mov    %cl,(%rdx)
sub    $0x1,%rax
add    $0x1,%rdx
cmp    %rsi,%rax
jne    b0 <func0+0xb0>
movslq %r15d,%r15
movb   $0x0,(%r14,%r15,1)
mov    %r14,%rax
add    $0x18,%rsp
pop    %rbx
pop    %rbp
pop    %r12
pop    %r13
pop    %r14
pop    %r15
retq
add    $0x1,%ebx
add    $0x1,%rsi
cmp    %ebx,%r13d
je     8a <func0+0x8a>
mov    %r9d,%eax
sub    %ebx,%eax
mov    %eax,%ecx
shr    $0x1f,%ecx
add    %eax,%ecx
sar    %ecx
cmp    %r8d,%ebx
je     71 <func0+0x71>
lea    0x0(%rbp,%rdi,1),%rdx
mov    $0x0,%eax
movzbl (%rdx),%r10d
cmp    %r10b,(%rsi,%rax,1)
jne    dc <func0+0xdc>
add    $0x1,%rax
sub    $0x1,%rdx
cmp    %eax,%ecx
jg     109 <func0+0x109>
movslq %r12d,%r13
mov    %r13,%rdx
mov    %rbp,%rsi
mov    %r14,%rdi
callq  130 <func0+0x130>
test   %ebx,%ebx
jle    15d <func0+0x15d>
movslq %ebx,%rcx
lea    -0x1(%rbp,%rcx,1),%rax
lea    (%r14,%r13,1),%rdx
lea    -0x2(%rbp,%rcx,1),%rsi
lea    -0x1(%rbx),%ecx
sub    %rcx,%rsi
movzbl (%rax),%ecx
mov    %cl,(%rdx)
sub    $0x1,%rax
add    $0x1,%rdx
cmp    %rsi,%rax
jne    14b <func0+0x14b>
lea    (%rbx,%r12,1),%eax
cltq
movb   $0x0,(%r14,%rax,1)
jmpq   ca <func0+0xca>

This discrepancy raises concerns about the reliability and effectiveness of the language models trained on such data. Inaccurate representations could potentially undermine the model's ability to generalize and produce meaningful decompiled C functions.

Add support for PDB files

This is amazing. It would be great if the PDB information (when available) could also be utilized.

Python Environment Setup

Hi, thank you all for the hard wok. I am trying to replicate the results, while the provided requirements.txt doesn't specify the version number, is it convenient to publish the package/Python version, or the environment yaml file if you were using conda?

How train?

Hi, I wish train on larger base set (like gentoo), and on multiple architecture (RISC-V, ARM, MIPS v2, x86, ...).
How do?
You code support foreign architecture?
I wish too train with old compiler, that's help to analis old unmaintened code on MIPS arch.

Does it support Java language?

Comparison with IDA Pro's decompile

I am a heavy user of IDA Pro for reverse programming. Can you compare IDA and llm4decompile in terms of re-executability?

Merry this with a reverse engineeing framework like Rizin

Thanks for the very cool project!
We folks from Rizin were wondering, if the results and usability wouldn't be way better, if the decompiler was built on top of a proper reverse engineering framework?

The most immediate advantage would be, that the model can be trained on an IL, instead of assembly. This would allow to decompile any function build for the OS this was trained on. Independent on the architecture, since the model only argues based on the IL and not on machine specific assembly.

Additionally, in the framework we can implement several algorithms which reduce noise in the IL fed to the model. E.g. if obfuscation patterns are present, the framework can can first resolve them and afterwards pass them to the model.

Same for type inference. Algorithms better suited for the job could first determine as many types as possible and later pass them additionally to the model (assuming the model is also trained on inferred types).
Same applies for flow graphs and whatever such a model can be trained on. Additionally, you don't need to implement parsing and loading of binary types (see #1), but only get the functions and the details of it via the API. Which in turn gives you more time to enhance the model.

If you are interested feel free to take a look at Rizin!

What GPUs did you train this on, and how long did it take? (cost estimate)

Hi, I'm interested in making a version of this for a version of the MIPS architecture, to apply in decompilation projects for a particular videogame console.
I was wondering how expensive it would be, but you don't mention how much compute this took to fine-tune.

Thank you for your time!

Inquiry on Using Pre-trained Model for Sequence-to-Sequence Tasks

Hi,

Thank you for your excellent work. We've successfully used compily.py to generate dataset.json, which includes the input with source code and the output with assembly code. We are considering using a pre-trained model for sequence-to-sequence tasks. Should we modify the JSON file to have the assembly code as input and the source code as output for this purpose?

Thanks!

Question about max token size for training.

Hi,
I have a question regarding the maximum token size. Currently, the maximum token size for the model inference is set to 500. I would like to know if it's possible to share the training parameters, including the value of the maximum token size.

Thanks a lot.

context \ sequence length

Hi there!
I couldn't find details about it in the paper nor README,
what is the maximum content length of your inputs? I mean, disassmbled files are pretty long, around ~300K tokens, how did you handle it? what was the tokens distribution of your inputs?

Thank you :)

Do you take `struct` into consideration?

Do you take struct into consideration?
And how do you handle the issue of excessively long functions in assembly code?

Training budget estimation

We trained our model using AnghaBench compilation results across four optimization levels (O0~O3), selecting samples under 1024 tokens. That gave us a total of 534,564 samples per level, and we trained for 2 epochs on a cluster of 8 Nvidia A100 GPUs.

As for the training times, they were 10 hours for the 1.3B model, 85 hours for the 6.7B model, and 440 hours for the 33B model.

Let me know if you need more info!

Originally posted by @rocky-lq in #3 (comment)

Hi @rocky-lq @albertan017 ,

We are estimating the training budget of reproducing LLM4Decompile. In your previous issue response, _given 534,564 samples per level and a cluster of 8 Nvidia A100 GPUs, 10 hours were cost for the 1.3B model, 85 hours were cost for the 6.7B model, and 440 hours were cost for the 33B model _.

In the 19 june updated paper, fine-tuning the 1.3B and 6.7B LLM4Decompile-End takes 12 and 61 days on 8×A100 respectively given 7.2 million compilable samples and 1.6 million executable samples. There is some confusion about training budget estimation.

Would you please provide more information about training budget and are all the training are fully supervised finetuning?

[Bug Fix Request] Request for Correction of Incorrect Folder Path

Hello,

I've encountered an execution issue in the run_evaluation_llm4decompile_singleGPU.py file due to an incorrect folder path. Specifically, the folder path referenced on line 12 is incorrect.

Before correction: "../decompile-eval/decompile-eval.json"
After correction: "../decompiler-eval/decompile-eval.json"

Here is the code link:

LLM4Decompile/evaluation/run_evaluation_llm4decompile_singleGPU.py

Line 12 in 8447a5c

 parser.add_argument('--data_path',type=str,default='../decompile-eval/decompile-eval.json',required=False) 

Due to the incorrect path, an error occurs when trying to execute the script as it cannot find the specified folder. I would greatly appreciate it if you could correct the folder path accurately.

Thank you.

Dataset

Hi,

The paper mentions that the dataset is released, but unless I'm being really stupid I can't see it anywhere. Are you planning to release the training dataset anytime soon?

cannot reproduce the results, is there anything wrong?

#!/bin/bash

CUDA_VISIBLE_DEVICES=0,1 python ../evaluation/run_evaluation_llm4decompile_vllm.py
--model_path ../../LLM/llm4decompile-6.7b-v1.5
--testset_path ../decompile-eval/decompile-eval.json
--gpus 2
--max_total_tokens 2048
--max_new_tokens 2000
--repeat 1
--num_workers 32
--gpu_memory_utilization 0.82
--temperature 0

Optimization O0: Compile Rate: 0.9268, Run Rate: 0.5488
Optimization O1: Compile Rate: 0.9268, Run Rate: 0.3598
Optimization O2: Compile Rate: 0.8902, Run Rate: 0.3537
Optimization O3: Compile Rate: 0.8902, Run Rate: 0.3171