Giter Club home page Giter Club logo

llm4decompile's People

Contributors

albertan017 avatar rocky-lq avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

llm4decompile's Issues

Empty output in slightly more complex prompt

Hello, many thanks for the brilliant work!

I've tried the models, including llm4decompile-6.7b and 33b, the uo version (llm4decompile-6.7b-uo). However, when I tried to add more information in the prompt (e.g., adding some more detailed requirements), llm4decompile will output nothing at all.

I tried the same prompt on the DeepSeek-Coder 33b, and it can output decompiled code normally.

Could you give some hint about why this happens? Could it be some type of overfitting? Should llm4decompile be used only under the fixed prompt?

Converting to GGUF

I've been trying to convert llm4decompile-6.7b-uo with the script of llama.cpp to GGUF but I am facing some issues since there is no tokenizer.model file provided. This is perhaps outside the scope of the project but there are no GGUF community-provided files right now and since the usual process is not working, would you consider documenting this? It would make the models much more accessible. Many thanks!

Concern Regarding Dataset Integrity

Upon thorough examination, it has come to my attention that the dataset's integrity might be compromised due to the methodology employed in generating the assembly representations. Specifically, the use of object files instead of fully linked binaries introduces inaccuracies, particularly concerning external function calls and the handling of immediate values.

The absence of the linking process results in disassemblies where immediate numbers for external function calls are left blank, leading to misleading representations. Each call to an external function is disassembled to call the next instruction, which can severely impact the model's ability to distinguish between different external function calls.

For example, in your decompile-eval.json, line:294, task 10, O1, the function with strlen, malloc and strncpy results in using the following disassembly as the input, those callqs do not point to the correct location. Even state of the art decompilers cannot decompile those assembly (when the object files are stripped and correct values are not filled into those calls).

endbr64
push   %r15
push   %r14
push   %r13
push   %r12
push   %rbp
push   %rbx
sub    $0x18,%rsp
mov    %rdi,%rbp
mov    $0xffffffffffffffff,%rcx
mov    $0x0,%eax
repnz scas %es:(%rdi),%al
mov    %rcx,%rax
not    %rax
lea    -0x1(%rax),%r12
lea    (%r12,%r12,1),%r15d
lea    0x1(%r15),%edi
movslq %edi,%rdi
callq  3d <func0+0x3d>
mov    %rax,%r14
test   %rax,%rax
je     ca <func0+0xca>
mov    %r12d,%r13d
test   %r12d,%r12d
jle    7a <func0+0x7a>
mov    %r12d,%r9d
lea    -0x1(%r12),%eax
mov    %eax,0xc(%rsp)
mov    %eax,%r8d
mov    %rbp,%rsi
mov    $0x0,%ebx
movslq %r12d,%rdi
sub    $0x1,%rdi
jmp    e8 <func0+0xe8>
mov    0xc(%rsp),%ebx
jmpq   11f <func0+0x11f>
movslq %r12d,%rdx
mov    %rbp,%rsi
mov    %rax,%rdi
callq  88 <func0+0x88>
jmp    c2 <func0+0xc2>
movslq %ebx,%rbx
mov    %rbx,%rdx
mov    %rbp,%rsi
mov    %r14,%rdi
callq  9b <func0+0x9b>
lea    -0x1(%rbp,%rbx,1),%rax
lea    (%r14,%rbx,1),%rdx
lea    -0x2(%rbp,%rbx,1),%rsi
mov    0xc(%rsp),%ecx
sub    %rcx,%rsi
movzbl (%rax),%ecx
mov    %cl,(%rdx)
sub    $0x1,%rax
add    $0x1,%rdx
cmp    %rsi,%rax
jne    b0 <func0+0xb0>
movslq %r15d,%r15
movb   $0x0,(%r14,%r15,1)
mov    %r14,%rax
add    $0x18,%rsp
pop    %rbx
pop    %rbp
pop    %r12
pop    %r13
pop    %r14
pop    %r15
retq
add    $0x1,%ebx
add    $0x1,%rsi
cmp    %ebx,%r13d
je     8a <func0+0x8a>
mov    %r9d,%eax
sub    %ebx,%eax
mov    %eax,%ecx
shr    $0x1f,%ecx
add    %eax,%ecx
sar    %ecx
cmp    %r8d,%ebx
je     71 <func0+0x71>
lea    0x0(%rbp,%rdi,1),%rdx
mov    $0x0,%eax
movzbl (%rdx),%r10d
cmp    %r10b,(%rsi,%rax,1)
jne    dc <func0+0xdc>
add    $0x1,%rax
sub    $0x1,%rdx
cmp    %eax,%ecx
jg     109 <func0+0x109>
movslq %r12d,%r13
mov    %r13,%rdx
mov    %rbp,%rsi
mov    %r14,%rdi
callq  130 <func0+0x130>
test   %ebx,%ebx
jle    15d <func0+0x15d>
movslq %ebx,%rcx
lea    -0x1(%rbp,%rcx,1),%rax
lea    (%r14,%r13,1),%rdx
lea    -0x2(%rbp,%rcx,1),%rsi
lea    -0x1(%rbx),%ecx
sub    %rcx,%rsi
movzbl (%rax),%ecx
mov    %cl,(%rdx)
sub    $0x1,%rax
add    $0x1,%rdx
cmp    %rsi,%rax
jne    14b <func0+0x14b>
lea    (%rbx,%r12,1),%eax
cltq
movb   $0x0,(%r14,%rax,1)
jmpq   ca <func0+0xca>

This discrepancy raises concerns about the reliability and effectiveness of the language models trained on such data. Inaccurate representations could potentially undermine the model's ability to generalize and produce meaningful decompiled C functions.

Add support for PDB files

This is amazing. It would be great if the PDB information (when available) could also be utilized.

Python Environment Setup

Hi, thank you all for the hard wok. I am trying to replicate the results, while the provided requirements.txt doesn't specify the version number, is it convenient to publish the package/Python version, or the environment yaml file if you were using conda?

How train?

Hi, I wish train on larger base set (like gentoo), and on multiple architecture (RISC-V, ARM, MIPS v2, x86, ...).
How do?
You code support foreign architecture?
I wish too train with old compiler, that's help to analis old unmaintened code on MIPS arch.

Merry this with a reverse engineeing framework like Rizin

Thanks for the very cool project!
We folks from Rizin were wondering, if the results and usability wouldn't be way better, if the decompiler was built on top of a proper reverse engineering framework?

The most immediate advantage would be, that the model can be trained on an IL, instead of assembly. This would allow to decompile any function build for the OS this was trained on. Independent on the architecture, since the model only argues based on the IL and not on machine specific assembly.

Additionally, in the framework we can implement several algorithms which reduce noise in the IL fed to the model. E.g. if obfuscation patterns are present, the framework can can first resolve them and afterwards pass them to the model.

Same for type inference. Algorithms better suited for the job could first determine as many types as possible and later pass them additionally to the model (assuming the model is also trained on inferred types).
Same applies for flow graphs and whatever such a model can be trained on. Additionally, you don't need to implement parsing and loading of binary types (see #1), but only get the functions and the details of it via the API. Which in turn gives you more time to enhance the model.

If you are interested feel free to take a look at Rizin!

Inquiry on Using Pre-trained Model for Sequence-to-Sequence Tasks

Hi,

Thank you for your excellent work. We've successfully used compily.py to generate dataset.json, which includes the input with source code and the output with assembly code. We are considering using a pre-trained model for sequence-to-sequence tasks. Should we modify the JSON file to have the assembly code as input and the source code as output for this purpose?

Thanks!

Question about max token size for training.

Hi,
I have a question regarding the maximum token size. Currently, the maximum token size for the model inference is set to 500. I would like to know if it's possible to share the training parameters, including the value of the maximum token size.

Thanks a lot.

context \ sequence length

Hi there!
I couldn't find details about it in the paper nor README,
what is the maximum content length of your inputs? I mean, disassmbled files are pretty long, around ~300K tokens, how did you handle it? what was the tokens distribution of your inputs?

Thank you :)

Training budget estimation

We trained our model using AnghaBench compilation results across four optimization levels (O0~O3), selecting samples under 1024 tokens. That gave us a total of 534,564 samples per level, and we trained for 2 epochs on a cluster of 8 Nvidia A100 GPUs.

As for the training times, they were 10 hours for the 1.3B model, 85 hours for the 6.7B model, and 440 hours for the 33B model.

Let me know if you need more info!

Originally posted by @rocky-lq in #3 (comment)

Hi @rocky-lq @albertan017 ,

We are estimating the training budget of reproducing LLM4Decompile. In your previous issue response, _given 534,564 samples per level and a cluster of 8 Nvidia A100 GPUs, 10 hours were cost for the 1.3B model, 85 hours were cost for the 6.7B model, and 440 hours were cost for the 33B model _.

In the 19 june updated paper, fine-tuning the 1.3B and 6.7B LLM4Decompile-End takes 12 and 61 days on 8ร—A100 respectively given 7.2 million compilable samples and 1.6 million executable samples. There is some confusion about training budget estimation.

Would you please provide more information about training budget and are all the training are fully supervised finetuning?

[Bug Fix Request] Request for Correction of Incorrect Folder Path

Hello,

I've encountered an execution issue in the run_evaluation_llm4decompile_singleGPU.py file due to an incorrect folder path. Specifically, the folder path referenced on line 12 is incorrect.

  • Before correction: "../decompile-eval/decompile-eval.json"
  • After correction: "../decompiler-eval/decompile-eval.json"

Here is the code link:

parser.add_argument('--data_path',type=str,default='../decompile-eval/decompile-eval.json',required=False)

Due to the incorrect path, an error occurs when trying to execute the script as it cannot find the specified folder. I would greatly appreciate it if you could correct the folder path accurately.

Thank you.

Dataset

Hi,

The paper mentions that the dataset is released, but unless I'm being really stupid I can't see it anywhere. Are you planning to release the training dataset anytime soon?

cannot reproduce the results, is there anything wrong?

#!/bin/bash

CUDA_VISIBLE_DEVICES=0,1 python ../evaluation/run_evaluation_llm4decompile_vllm.py
--model_path ../../LLM/llm4decompile-6.7b-v1.5
--testset_path ../decompile-eval/decompile-eval.json
--gpus 2
--max_total_tokens 2048
--max_new_tokens 2000
--repeat 1
--num_workers 32
--gpu_memory_utilization 0.82
--temperature 0

Optimization O0: Compile Rate: 0.9268, Run Rate: 0.5488
Optimization O1: Compile Rate: 0.9268, Run Rate: 0.3598
Optimization O2: Compile Rate: 0.8902, Run Rate: 0.3537
Optimization O3: Compile Rate: 0.8902, Run Rate: 0.3171

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.