This script processes binary and decompiled code to parse target variables and yield Access Expressions for model inference. The pipeline includes steps to prepare decompiled files, parse comments, generate commands for various analyses, and produce the final JSONL files for LLM input
The pipeline is constructed in process_data/process_data.sh
. Here are some break-down details:
-
Check and Create Directories:
- Ensures the source directory exists (
source_dir
). - Creates necessary directories if they do not already exist, such as
decompiled
,decompiled_vars
,beyond_access
,callsites
,dataflow
,commands
, andlogs
.
- Ensures the source directory exists (
-
Process Each Binary File:
- For each file in the
bin
directory, the script:- Prepare Decompiled Files:
- Prepares the decompiled files using
prep_decompiled.py
.
- Prepares the decompiled files using
- Parse Comments:
- Parses comments from the decompiled code using
parse_decompiled.py
.
- Parses comments from the decompiled code using
- Generate Commands:
- Generates and saves commands for beyond access, callsites, and data flow analyses using
gen_beyond_access_command.py
. - Commands are saved in the
commands
directory as<filename>_beyond_access_command.sh
,<filename>_callsites_command.sh
, and<filename>_dataflow_command.sh
.
- Generates and saves commands for beyond access, callsites, and data flow analyses using
- Prepare Decompiled Files:
- For each file in the
-
Generate Combined Command Scripts:
- Combines the generated command scripts into comprehensive scripts for beyond access, callsites, and data flow using
gen_command_script.py
. - Combined scripts are saved as
beyond_access_command.sh
,callsites_command.sh
, anddataflow_command.sh
.
- Combines the generated command scripts into comprehensive scripts for beyond access, callsites, and data flow using
-
Execute Command Scripts:
- Executes the combined command scripts to perform the analyses using
bash
:bash $source_dir/beyond_access_command.sh
bash $source_dir/callsites_command.sh
bash $source_dir/dataflow_command.sh
- Executes the combined command scripts to perform the analyses using
-
Generate Final JSONL Files:
- Generates the
stack.jsonl
file by combining data from decompiled files and variables usinggen_stack_data.py
. - Generates the
heap.jsonl
file by combining data from decompiled files and beyond access analyses usinggen_heap_data.py
.
- Generates the
docker pull dnxie/resym_demo:latest
docker run -it --memory=100G --name resym dnxie/resym_demo:latest
conda activate binary
cd /home/resym/process_data
bash process_osprey.sh /home/data
Exit the docker
The output from a previous experiment is stored at ./data
.
If you would like to run the model for inference, make sure to have CUDA installed. The rest of the requirements are in requirements.txt
.
You can install them with pip install -r requirements.txt
.