spcl / ncc Goto Github PK

Neural Code Comprehension: A Learnable Representation of Code Semantics

License: BSD 3-Clause "New" or "Revised" License

Python 99.95% C++ 0.02% Shell 0.03%

machine-learning llvm-ir embeddings embedding-models embedding-based neural-networks code-analysis

ncc's Introduction

Neural Code Comprehension: A Learnable Representation of Code Semantics

ncc (Neural Code Comprehension) is a general Machine Learning technique to learn semantics from raw code in virtually any programming language. It relies on inst2vec, an embedding space and graph representation of LLVM IR statements and their context.

This repository contains the code used in [paper]:

Neural Code Comprehension: A Learnable Representation of Code Semantics, Tal Ben-Nun, Alice Shoshana Jakobovits, Torsten Hoefler

Please cite as:

@incollection{ncc,
title = {Neural Code Comprehension: A Learnable Representation of Code Semantics},
author = {Ben-Nun, Tal and Jakobovits, Alice Shoshana and Hoefler, Torsten},
booktitle = {Advances in Neural Information Processing Systems 31},
editor = {S. Bengio and H. Wallach and H. Larochelle and K. Grauman and N. Cesa-Bianchi and R. Garnett},
pages = {3588--3600},
year = {2018},
publisher = {Curran Associates, Inc.},
url = {http://papers.nips.cc/paper/7617-neural-code-comprehension-a-learnable-representation-of-code-semantics.pdf}
}

Code

Requirements

For training inst2vec embeddings:

GNU / Linux or Mac OS
Python (3.6.5)
- tensorflow (1.7.0) or preferably: tensorflow-gpu (1.7.0)
- networkx (2.1)
- scipy (1.1.0)
- absl-py (0.2.2)
- jinja2 (2.10)
- bokeh (0.12.16)
- umap (0.1.1)
- sklearn (0.0)
- wget (3.2)

Additionally, for training ncc models:

GNU / Linux or Mac OS
Python (3.6.5)
- labm8 (0.1.2)
- keras (2.2.0)

Running the code

1. Training `inst2vec` embeddings

By default, inst2vec will be trained on publicly available code. Some additional datasets are available on demand and you may add them manually to the training data. For more information on how to do this as well as on the datasets in general, see datasets.

$ python train_inst2vec.py --helpfull # to see the full list of options
$ python train_inst2vec.py \
>  # --context_width ... (default: 2)
>  # --data ... (default: data/, automatically generated one. You may provide your own)

Alternatively, you may skip this step and use pre-trained embeddings.

2. Evaluating `inst2vec` embeddings

$ python train_inst2vec.py \
> --embeddings_file ... (path to the embeddings p-file to evaluate)
> --vocabulary_folder ... (path to the associated vocabulary folder)

3. Training on tasks with `ncc`

We provide the code for training three downstream tasks using the same neural architecture (ncc) and inst2vec embeddings.

Algorithm classification

Task: Classify applications into 104 classes given their raw code.
Code and classes provided by https://sites.google.com/site/treebasedcnn/ (see Convolutional neural networks over tree structures for programming language processing)

Train:

$ python train_task_classifyapp.py --helpfull # to see the full list of options
$ python train_task_classifyapp.py

Alternatively, display results from a pre-trained model.

Optimal device mapping prediction

Task: Predict the best-performing compute device (e.g., CPU, GPU) Code and classes provided by https://github.com/ChrisCummins/paper-end2end-dl (see End-to-end Deep Learning of Optimization Heuristics)

Train:

$ python train_task_devmap.py --helpfull # to see the full list of options
$ python train_task_devmap.py

Alternatively, display results from a pre-trained model.

Optimal thread coarsening factor prediction

Code and classes provided by https://github.com/ChrisCummins/paper-end2end-dl (see End-to-end Deep Learning of Optimization Heuristics)

Train:

$ python train_task_threadcoarsening.py --helpfull # to see the full list of options
$ python train_task_threadcoarsening.py

Alternatively, display results from a pre-trained model.

Contact

We would be thrilled if you used and built upon this work. Contributions, comments, and issues are welcome!

License

NCC is published under the New BSD license, see LICENSE.

ncc's People

Contributors

Stargazers

Watchers

ncc's Issues

Asm inline call handling

I have a question about the way you handle assembly calls. In the pre-processing part where you preprocess .ll files, you discard asm call that return void by using the keep() function:

if re.search('call void asm', line):
        return False

However you dont handle asm inline calls that return something else during the pre-processing (maybe it is very specific in your case) and you seem to handle other cases specifically while parsing the preprocessed code:

           # function call
           elif re.match(r'(' + rgx.local_id + r' = )?(tail )?(call|invoke) ', line):

               # Get function name
               if ' asm ' in line:
                   if line == '%13 = tail call { %struct.rw_semaphore*, i64 } asm sideeffect "':
                       line = '%13 = tail call { %struct.rw_semaphore*, i64 } asm sideeffect "# beginning down_read\0A\09.pushsection .smp_locks,\22a\22\0A.balign 4\0A.long 671f - .\0A.popsection\0A671:\0A\09lock;  incq ($3)\0A\09  jns        1f\0A  call call_rwsem_down_read_failed\0A1:\0A\09# ending down_read\0A\09", "=*m,={ax},={rsp},{ax},*m,2,~{memory},~{cc},~{dirflag},~{fpsr},~{flags}"(%struct.atomic64_t* %11, %struct.rw_semaphore* %10, %struct.atomic64_t* %11, i64 %12) #4, !srcloc !9'
                   if line == '%16 = tail call i64 asm sideeffect "':
                       line = '%16 = tail call i64 asm sideeffect "# beginning __up_read\0A\09.pushsection .smp_locks,\22a\22\0A.balign 4\0A.long 671f - .\0A.popsection\0A671:\0A\09lock;   xadd      $1,($2)\0A\09  jns        1f\0A\09  call call_rwsem_wake\0A1:\0A# ending __up_read\0A", "=*m,={dx},{ax},1,*m,~{memory},~{cc},~{dirflag},~{fpsr},~{flags}"(%struct.atomic64_t* %11, %struct.rw_semaphore* %10, i64 -1, %struct.atomic64_t* %11) #4, !srcloc !11'
                   func_name_ = re.search(r' asm (?:sideeffect )?(\".*\")\(', line)

My question is what is the difference between those two cases ? does it really matter or could we ignore asm inline calls whatever the type returned ?

How to get the embedding result of inst2vec ?

am not familiar with tensorflow, and want to use the trained model to embedding some new llvm code, but there's only 'training' and 'evaluation' provided in the instruction? could u give m some advice? really appreciate that!

train_inst2vec.py fails on specific file during vocabulary building

Workaround: delete ncc/data/shoc/sho/ProgressBar.ll from the dataset before preprocessing.

The file is ncc/data/shoc/sho/ProgressBar.ll and the problem seems to be that this file has only a single LLVM IR instruction that goes into the XFG and later we end up with an empty graph and a call to
build_H_dictionary(D, skip_window, folder, filename, dictionary, stmts_cut_off)
that fails at nx.adjacency_matrix(D).

The xfg has 2 nodes: a root_node and the one mentioned above, so IMHO the dual-xfg should still contain a single node, which it doesn't and that is causing the problem.

'MultiDiGraph' object has no attribute 'node'

Hi, I met some problem when use the code.

Traceback (most recent call last):
  File "/root/userfolder/code/ncc/inst2vec/inst2vec_preprocess.py", line 2828, in construct_xfg_single_raw_folder
    G, multi_edges = build_graph(preprocessed_file, functions_declared_in_files[i], file_names[i])
  File "/root/userfolder/code/ncc/inst2vec/inst2vec_preprocess.py", line 2198, in build_graph
    G = add_stmts_to_graph(G, file, functions_defined_in_file, functions_declared_in_file)
  File "/root/userfolder/code/ncc/inst2vec/inst2vec_preprocess.py", line 1503, in add_stmts_to_graph
    if basic_block_leaf(G, n, ids_in_basic_block):
  File "/root/userfolder/code/ncc/inst2vec/inst2vec_preprocess.py", line 1098, in basic_block_leaf
    if G.node[n]['id'] != 'ad_hoc':
AttributeError: 'MultiDiGraph' object has no attribute 'node'

ncc/inst2vec/inst2vec_preprocess.py

Line 1097 in 6bba4f3

if G.node[n]['id'] != 'ad_hoc':

After read code, I think this maybe a typo. node->nodes

my pr:

#26

The link to the dataset is not working

Dataset download failed

I have been getting connection timed out errors each time I downloaded the dataset via the link for the last two days. I tried downloading the dataset with different devices and IPs, but all failed. Is it due to my connection problem or the server issue?

the links to all the datasets did not work.

The links in ncc/data/ and task/readme.md for classifyapp did not work anymore. The page said the file was not found. Is it possible to update these dataset links? Thanks

Expected combineable dataset

I download tensorflow from https://polybox.ethz.ch/index.php/s/ojd0RPFOtUTPPRr and put tensorflow/ir_0 in data/.
When i run python train_inst2vec.py, there is a error :
"Expected combineable dataset"

I know the except is from https://github.com/spcl/ncc/blob/master/inst2vec/inst2vec_embedding.py#L56. However, I don't know why single file is not allowed? I just would like to quickly run the code and get some result, so i put a single file in /data.
Please tell me why it is wrong to put a single file in /data

error info

python-BaseException
Traceback (most recent call last):
File "3\envs\ncc\lib\site-packages\absl\app.py", line 274, in run
_run_main(main, args)
File "\Anaconda3\envs\ncc\lib\site-packages\absl\app.py", line 238, in _run_main
sys.exit(main(argv))
File "/ncc/train_inst2vec.py", line 60, in main
embedding_matrix, embeddings_file = i2v_emb.train_embeddings(data_folder, data_folders)
File "\ncc\inst2vec\inst2vec_embedding.py", line 453, in train_embeddings
data_pair_files = get_data_pair_files(data_folders, context_width)
File "\ncc\inst2vec\inst2vec_embedding.py", line 56, in get_data_pair_files
assert len(folders) > 1, "Expected combineable dataset"
AssertionError: Expected combineable dataset

Process finished with exit code 1

dictionary_pickle not available

I was trying to work with the pre-trained embeddings but was unable to do that since the dictionary_pickle file is not provided in the repository. I don't see a way to generate the dictionary without having to train the embeddings as well.

Train task classifyapp on same data as for training the embedding

Hello together,
we want to train a Keras model with the train_task_classifyapp.py script to make a simple binary classification:

class: Applications which perform a stencil operation
class: Applications which do not perform a stencil operation

For this purpose we created a dataset based on your synthetic datasets

Eigen-synthetic https://polybox.ethz.ch/index.php/s/52wWiK5fjRGHLJR
GEMM-synthetic https://polybox.ethz.ch/index.php/s/Bm6cwAY3eVkR6v3
Stencil-synthetic https://polybox.ethz.ch/index.php/s/OOmylxGcBxQM1D3

The dataset has the following directory structure so the python script can handle it:

.
├── ncc
│   ├── train
│   │   ├── classifyapp
│   │   │   ├── ir_train
│   │   │   │   ├── 1
│   │   │   │   ├── 2
│   │   │   ├── ir_val
│   │   │   │   ├── 1
│   │   │   │   ├── 2
│   │   │   ├── ir_test
│   │   │   │   ├── 1
│   │   │   │   ├── 2

Folder 2 is a mixture of applications from the Eigen- and GEMM-synthetic dataset, folder 1 has only applications from the Stencil-synthetic dataset.

My questions are the following:

Since the Eigen-, GEMM- and Stencil-synthetic dataset have been used for training the inst2vec embedding, will this affect the training for classifyapp Keras model (positive or negative way)?
What was your setup for training and how long did it take? In our current setup, each folder for class 1 and 2 has 80 randomly picked applications, batch size is 4, epochs is 20 and number of training samples per classis 20. We are running the training on a Nvidia 1080 Ti. Only with this setup we could train the network in an affordable time (45 minutes per epoch). We are aware that this can yield in bad accuracy.
In another setup, we had 2000 sample applications for each class in each set (train, val and test). The batch size was 4, number of training samples per class were 30 and 20 epochs. With this setup, the training time for each epoch went up to 55 hours (Keras ETA)! Using larger batch sizes leads to an error within Cuda since it can not allocate enough memory.
Do you have any experience with these parameters for training? What could be the reason for such a high training time? In your script, you are using 64 batch size and 1500 training samples for class. Did it also take so much time for training 104 classes?

question about predict value p at train_task_classifyapp.py line 417

p = model.predict_gen(generator=gen_test)[0]
In line 417, p is identified as the first element of the model.predict_gen () return value.

In my understanding, model.predict_gen () should return the list P_1 of prediction results, and the program does the same.
def predict_gen(self, generator: EmbeddingSequence) -> np.array:
...
return [i + 1 for i in indices]
Then, can we use P_1 and y_test to calculate the accuracy?
Why use p instead of P_1 in this procedure?
accuracy = p == y_test
return accuracy
classifyapp_accuracy = evaluate(NCC_classifyapp(), embeddings, folder_data, train_samples, folder_results, dense_layer_size, print_summary, num_epochs, batch_size)
print('\nTest accuracy:', sum(classifyapp_accuracy)*100/len(classifyapp_accuracy), '%')

ValueError: GraphDef cannot be larger than 2GB.

Hello, i run train_task_classifyapp.py to learn model myself, its differ only in last dense layer: i removed sigmoid activation function. Train was failed on last batch as i understand, but i cant unsderstand why.
Attached log:

Tensor("l2_normalize:0", shape=(8565, 200), dtype=float32)

--- Initializing model...
built Keras model

--- Training model...
Epoch 1/50
2436/2437 [============================>.] - ETA: 8s - loss: 8.2014 - acc: 0.0107 Traceback (most recent call last):
File "/home/vtelepov/tmp/env/lib/python3.6/site-packages/keras/utils/data_utils.py", line 578, in get
inputs = self.queue.get(block=True).get()
File "/usr/lib/python3.6/multiprocessing/pool.py", line 644, in get
raise self._value
File "/usr/lib/python3.6/multiprocessing/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "/home/vtelepov/tmp/env/lib/python3.6/site-packages/keras/utils/data_utils.py", line 401, in get_index
return _SHARED_SEQUENCES[uid][i]
File "train_task_classifyapp.py", line 152, in getitem
emb_x = tf.nn.embedding_lookup(self.emb, x).eval(session=self.sess)
File "/home/vtelepov/tmp/env/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 656, in eval
return _eval_using_default_session(self, feed_dict, self.graph, session)
File "/home/vtelepov/tmp/env/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 5016, in _eval_using_default_session
return session.run(tensors, feed_dict)
File "/home/vtelepov/tmp/env/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 905, in run
run_metadata_ptr)
File "/home/vtelepov/tmp/env/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1140, in _run
feed_dict_tensor, options, run_metadata)
File "/home/vtelepov/tmp/env/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1321, in _do_run
run_metadata)
File "/home/vtelepov/tmp/env/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1327, in _do_call
return fn(*args)
File "/home/vtelepov/tmp/env/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1310, in _run_fn
self._extend_graph()
File "/home/vtelepov/tmp/env/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1353, in _extend_graph
from_version=self._current_version, add_shapes=self._add_shapes)
File "/home/vtelepov/tmp/env/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3094, in _as_graph_def
raise ValueError("GraphDef cannot be larger than 2GB.")
ValueError: GraphDef cannot be larger than 2GB.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "train_task_classifyapp.py", line 572, in
app.run(switch)
File "/home/vtelepov/tmp/env/lib/python3.6/site-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/home/vtelepov/tmp/env/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "train_task_classifyapp.py", line 569, in switch
main(argv)
File "train_task_classifyapp.py", line 558, in main
dense_layer_size, print_summary, num_epochs, batch_size)
File "train_task_classifyapp.py", line 429, in evaluate
epochs=num_epochs)
File "train_task_classifyapp.py", line 243, in train_gen
shuffle=True, callbacks=[checkpoint])
File "/home/vtelepov/tmp/env/lib/python3.6/site-packages/keras/legacy/interfaces.py", line 91, in wrapper
return func(*args, **kwargs)
File "/home/vtelepov/tmp/env/lib/python3.6/site-packages/keras/engine/training.py", line 1426, in fit_generator
initial_epoch=initial_epoch)
File "/home/vtelepov/tmp/env/lib/python3.6/site-packages/keras/engine/training_generator.py", line 211, in fit_generator
max_queue_size=max_queue_size)
File "/home/vtelepov/tmp/env/lib/python3.6/site-packages/keras/legacy/interfaces.py", line 91, in wrapper
return func(*args, **kwargs)
File "/home/vtelepov/tmp/env/lib/python3.6/site-packages/keras/engine/training.py", line 1480, in evaluate_generator
verbose=verbose)
File "/home/vtelepov/tmp/env/lib/python3.6/site-packages/keras/engine/training_generator.py", line 309, in evaluate_generator
generator_output = next(output_generator)
File "/home/vtelepov/tmp/env/lib/python3.6/site-packages/keras/utils/data_utils.py", line 584, in get
six.raise_from(StopIteration(e), e)
File "", line 3, in raise_from
StopIteration: GraphDef cannot be larger than 2GB.

The difference between XFG and PDG?

The corresponding paper claim that a novel representation of IR, called conteXtual Flow Graph (XFG), was propsed. However, I realize that this representation is similar with traditional Program Dependence Graph.

train_task_classify is searching for .rec in wrong folder

When I am running train_task_classify.py

ncc/train_task_classifyapp.py

Lines 285 to 291 in 4e9eaeb

 seq_files = [os.path.join(folder, f) for f in listing if f[-4:] == '.rec'] 

 # training: Randomly pick programs 

 assert len(seq_files) >= samples_per_class, "Cannot sample " + str(samples_per_class) + " from " + str( 

 len(seq_files)) + " files found in " + folder 

 X_train += resample(seq_files, replace=False, n_samples=samples_per_class, random_state=seed) 

 y_train = np.concatenate([y_train, np.array([int(i)] * samples_per_class, dtype=np.int32)])

This assertion is failing (seq_files list is empty).
This is because the .rec files are generated in separate folders prefixed by seq.

ncc/task_utils.py

Lines 259 to 269 in 4e9eaeb

 def llvm_ir_to_trainable(folder_ir): 

 #################################################################################################################### 

 # Setup 

 assert len(folder_ir) > 0, "Please specify a folder containing the raw LLVM IR" 

 assert os.path.exists(folder_ir), "Folder not found: " + folder_ir 

 folder_seq = re.sub('ir', 'seq', folder_ir) 

 if len(folder_seq) > 0: 

 print('Preparing to write LLVM IR index sequences to', folder_seq) 

 if not os.path.exists(folder_seq): 

 os.makedirs(folder_seq)

But train_task_classify is searching for .rec files in the original IR folder ir_train.

[inst2vec_evaluate.py] IndexError : list index out of range in analogies

Hi, I'm working on a project that makes Go code into llvm IR using gollvm(https://go.googlesource.com/gollvm)
After get IR from Go codes, I made test folder having go IRs to run train_inst2vec.py.
But there was an error that analogies list don't have any value.

Can I know what the analogies do?
If can, I want to know how to solve the Error attached in this issue.

Thanks.

Bug of Regular Express matching Report For Preprocessing LLVM IR

construct_struct_types_dictionary_for_file in inst2vec_preprocess.py cannot not handle the value in to_process, { i64, { i32, { x86_fp80 } }*, [7 x { i32, { x86_fp80 } }] }.
I try to fix it but the matching seems to complex? Could you help me?

how to get llvm ir from tensorflow dataset?

Hello, Thanks for your impressive paper and remarkable code! I'm starting from compiling tensorflow source code to learn your algorithm better, but I'm stuck in comiling c++ code to llvm ir. Could you tell me how to compile tensorflow source code to llvm ir?

For now, I've downloaded tensorflow source code and put it in path: "/data/sjd/project/sometest/llvmtest/tensorflow-c-source/tensorflow/"

Then, I entered the path:"/data/sjd/project/sometest/llvmtest/tensorflow-c-source/tensorflow/tensorflow/cc/gradients"
I invoke command clang -I/data/sjd/project/sometest/llvmtest/tensorflow-c-source/tensorflow -I/data/sjd/project/sometest/llvmtest/tensorflow-c-source/tensorflow/third_party/eigen3 -S -emit-llvm image_grad.cc to get llvm ir, but I got error below:

In file included from /data/sjd/project/sometest/llvmtest/tensorflow-c-source/tensorflow/third_party/eigen3/unsupported/Eigen/CXX11/Tensor:1:
In file included from /data/sjd/project/sometest/llvmtest/tensorflow-c-source/tensorflow/third_party/eigen3/unsupported/Eigen/CXX11/Tensor:1:
/data/sjd/project/sometest/llvmtest/tensorflow-c-source/tensorflow/third_party/eigen3/unsupported/Eigen/CXX11/Tensor:1:10: error: #include nested too deeply
#include "unsupported/Eigen/CXX11/Tensor"

In file included from image_grad.cc:17:
In file included from /data/sjd/project/sometest/llvmtest/tensorflow-c-source/tensorflow/tensorflow/cc/framework/grad_op_registry.h:21:
In file included from /data/sjd/project/sometest/llvmtest/tensorflow-c-source/tensorflow/tensorflow/cc/framework/ops.h:21:
In file included from /data/sjd/project/sometest/llvmtest/tensorflow-c-source/tensorflow/tensorflow/core/framework/tensor.h:22:
In file included from /data/sjd/project/sometest/llvmtest/tensorflow-c-source/tensorflow/tensorflow/core/framework/allocator.h:23:
In file included from /data/sjd/project/sometest/llvmtest/tensorflow-c-source/tensorflow/tensorflow/core/framework/numeric_types.h:19:
In file included from /usr/lib/gcc/x86_64-linux-gnu/5.5.0/../../../../include/c++/5.5.0/complex:44:
/usr/lib/gcc/x86_64-linux-gnu/5.5.0/../../../../include/c++/5.5.0/cmath:1096:11: error: no member named 'acoshf' in the global namespace; did you mean 'acosh'?
using ::acoshf;
~~^
/usr/include/x86_64-linux-gnu/bits/mathcalls.h:88:13: note: 'acosh' declared here
__MATHCALL (acosh,, (Mdouble __x));
^

[train_task_classifyapp.py] GPU memory grows unlimitedly

When I try to train the model on another dataset, I find that the programme will take up more and more GPU memory and finally trigger Out of Memory Error. Then I add

gp = tf.get_default_graph()
gp.finalize()

before model.train_gen is called (line 402) to test whether new tensorflow ops are added to the graph while training. The result is

RuntimeError: Graph is finalized and cannot be modified.

which is located at self.model.fit_generator (line 237).
I don't know whether it is a bug or not. Since all interfaces are written in Keras, it is difficult for me to find out the exact problem in tensorflow backend.

[train_task_classifyapp] doesn't have an opportunity to use pre-trained model weights

I've noticed that there was missing a method in NCC_classifyapp class:

def load_weights(self, file_with_weights):
        self.model.load_weights(file_with_weights)

So, with the aim to just load weights and test the model in evaluate method you can do smth like this:

model.load_weights('published_results/classifyapp/CLASSIFYAPP-94.83.h5')

# Test model
print('\n--- Testing model...')
p = model.predict_gen(generator=gen_test)[0]

llvm ir of linux kernel

Hello, firstly thanks for your interesting paper and for releasing its code.
This issue Is similar to #1 . I'm currently working on ways to generate the llvm-ir files of the linux kernel. So I compiled the kernel using Clang, and then I used a python script I found on github (https://github.com/ClangBuiltLinux/linux/blob/master/scripts/gen_compile_commands.py) that parses the .cmd files generated alongside the compilation and generates a Json file of the Clang commands with the correct linkers that were run to compile the kernel. Here is an example of these commands:

/usr/bin/clang-9 -Wp,-MD,fs/.pnode.o.d -nostdinc -isystem /usr/lib/llvm-9/lib/clang/9.0.0/include -I./arch/x86/include -I./arch/x86/include/generated -I./include -I./arch/x86/include/uapi -I./arch/x86/include/generated/uapi -I./include/uapi -I./include/generated/uapi -include ./include/linux/kconfig.h -D__KERNEL__ -Qunused-arguments -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -fno-strict-aliasing -fno-common -fshort-wchar -Werror-implicit-function-declaration -Wno-format-security -std=gnu89 -Wno-unused-variable -Wno-format-invalid-specifier -Wno-gnu -Wno-address-of-packed-member -Wno-tautological-compare -mno-global-merge -no-integrated-as -fno-PIE -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -mno-avx -m64 -mno-80387 -mstack-alignment=8 -mtune=generic -mno-red-zone -mcmodel=kernel -funit-at-a-time -DCONFIG_AS_CFI=1 -DCONFIG_AS_CFI_SIGNAL_FRAME=1 -DCONFIG_AS_CFI_SECTIONS=1 -DCONFIG_AS_FXSAVEQ=1 -DCONFIG_AS_SSSE3=1 -DCONFIG_AS_CRC32=1 -DCONFIG_AS_AVX=1 -DCONFIG_AS_AVX2=1 -DCONFIG_AS_AVX512=1 -DCONFIG_AS_SHA1_NI=1 -DCONFIG_AS_SHA256_NI=1 -pipe -Wno-sign-compare -fno-asynchronous-unwind-tables -fno-delete-null-pointer-checks -O2 --param=allow-store-data-races=0 -DCC_HAVE_ASM_GOTO -Wframe-larger-than=2048 -fno-stack-protector -fomit-frame-pointer -Wdeclaration-after-statement -Wno-pointer-sign -fno-strict-overflow -Werror=implicit-int -Werror=strict-prototypes -Werror=date-time -Werror=incompatible-pointer-types -Wno-initializer-overrides -Wno-unused-value -Wno-format -Wno-sign-compare -Wno-format-zero-length -Wno-uninitialized -DKBUILD_BASENAME='\"pnode\"' -DKBUILD_MODNAME='\"pnode\"' -c -o fs/pnode.o fs/pnode.c

To build the llvm-ir files (.ll), I replaced the end of the command
-c -o fs/pnode.o fs/pnode.c
with
-S -emit-llvm fs/pnode.c -o llvm-ir/fs_pnode.ll
I managed to create 2334 llvm-ir files with version 4.15.1 of the linux kernel.

My questions are:

How did you create the llvm-ir files for the linux kernel project, and more generally, how did you do it for each of the projects you present in the paper ? Is there a general method or is it project specific ?
Is it because I use a more recent version of clang (9) than you used that I was able to generate a larger amount of llvm-ir files for the linux kernel ?

[train_inst2vec.py] need to be more than one folder with raw data in data/

To tell the truth, it is not clear for me why in train_inst2vec.py there is such condition (45-47 rows):

if FLAGS.data == "data" and len(os.listdir(data_folder)) <= 1:
            # Generate the data set
            print('Folder', data_folder, 'is empty - preparing to download training data')

For example, I downloaded BLAS dataset. So, my data-folder looks like data/blas/*.ll.
And after running python train_inst2vec.py I got something like this

Folder data is empty - preparing to download training data
Downloading AMD data set...

Why folders with raw data need to be more than one?
Depending on what a data should be placed in several folders?

test

loss and acc have large differences even though train and valid are set totally same

dear author,
I download the code to train the original data , but i found acc and loss are much different. Then I set the train dataset as the validation, that is we use the same dataset in training and validation. But, i found the same result. which as follow shows:
8/8 [=========] - 52s 7s/step - loss: 1.2012 - acc: 0.5000 - val_loss: 6.4256 - val_acc: 0.3926
Epoch 2/50
8/8 [=========] - 46s 6s/step - loss: 0.7563 - acc: 0.7617 - val_loss: 1.4548 - val_acc: 0.5596
Epoch 3/50
8/8 [=========] - 45s 6s/step - loss: 0.5647 - acc: 0.7969 - val_loss: 3.5613 - val_acc: 0.5557
Epoch 4/50
8/8 [=========] - 47s 6s/step - loss: 0.4402 - acc: 0.8496 - val_loss: 4.9303 - val_acc: 0.2559
Epoch 5/50
8/8 [=========] - 46s 6s/step - loss: 0.3777 - acc: 0.8672 - val_loss: 1.0182 - val_acc: 0.6807
Epoch 6/50
8/8 [=========] - 45s 6s/step - loss: 0.3009 - acc: 0.8945 - val_loss: 3.2592 - val_acc: 0.3340
Epoch 7/50
8/8 [=========] - 46s 6s/step - loss: 0.2769 - acc: 0.9053 - val_loss: 2.2627 - val_acc: 0.4609
Epoch 8/50
8/8 [=========] - 47s 6s/step - loss: 0.2585 - acc: 0.9150 - val_loss: 1.1746 - val_acc: 0.6348
Epoch 9/50
8/8 [=========] - 47s 6s/step - loss: 0.2096 - acc: 0.9316 - val_loss: 3.2337 - val_acc: 0.5039
Epoch 10/50
8/8 [=========] - 47s 6s/step - loss: 0.2602 - acc: 0.9131 - val_loss: 2.9752 - val_acc: 0.3994

there is not published_results/vocabulary/dic_pickle

I ran python train_task_classifyapp.py with default parameters, so, there was the needed file, published_resilts/emb.p, but there was not a folder published_results/vocabulary.

That's why I had the error like this:

FileNotFoundError: [Errno 2] No such file or directory: 'published_results/vocabulary/dic_pickle'

Confusion in inst2vec_preprocess.py when reading code

When I reading code in inst2vec_preprocess.py, I find in line 865 that
assert check is not None, "Could not match argument list in:\n" + line + "\nFunction:\n" + func_name
may have to change to
assert check is None.
But I'm confused and don't know whether I should change it.


def get_num_args_func(line, func_name=None):
    """
    Get the number of arguments in a line containing a function
    :param line: LLVM IR line
    :param func_name: function name
    :return num_args: number of arguments
            arg_list: list of arguments
    """
    modif_line = re.sub(r'<[^<>]+>', '', line)  # commas in vectors/arrays should not be counted as argument-separators
    arg_list_ = find_outer_most_last_parenthesis(modif_line)  # get last parenthesis
    if arg_list_ is None:
        # Make sure that this is the case because the function has no arguments
        # and not because there was in error in regex matching
        check = re.match(rgx.func_call_pattern + r'\(\)', modif_line)
        **_assert check is not None, "Could not match argument list in:\n" + line + "\nFunction:\n" + func_name_**
        num_args = 0
        arg_list = ''
    elif arg_list_ == '()':
        # Make sure that this is the case because the function has no arguments
        # and not because there was in error in regex matching
        check = re.match(rgx.func_call_pattern + r'\(\)', modif_line)
        if check is None:
            check = re.search(r' asm (?:sideeffect )?(\".*\")\(\)', modif_line)
        if check is None:
            check = re.search(rgx.local_id + r'\(\)', modif_line)
        if check is None:
            okay = line[-2:] == '()'
            if not okay:
                check = None
            else:
                check = True
        assert check is not None, "Could not match argument list in:\n" + line + "\nFunction:\n" + func_name
        num_args = 0
        arg_list = ''
    else:
        arg_list = arg_list_[1:-1]
        arg_list = re.sub(r'<[^<>]+>', '', arg_list)
        arg_list_modif = re.sub(r'\([^\(\)]+\)', '', arg_list)
        arg_list_modif = re.sub(r'\([^\(\)]+\)', '', arg_list_modif)
        arg_list_modif = re.sub(r'\([^\(\)]+\)', '', arg_list_modif)
        arg_list_modif = re.sub(r'\([^\(\)]+\)', '', arg_list_modif)
        arg_list_modif = re.sub(r'\"[^\"]*\"', '', arg_list_modif)
        arg_list_modif = re.sub(r'{.*}', '', arg_list_modif)
        num_args = len(re.findall(',', arg_list_modif)) + 1

    return num_args, arg_list

The original source code of the datasets

Dear authors,

Thanks for your kindly sharing of thie repo. The shared datasets are all LLVM IR files (https://polybox.ethz.ch/index.php/s/..). Would you please share the corresponding original source code of the datasets? If you can provide the scripts about how to convert the source code into the LLVM IR, that would be highly appreciated.

Kind regards,

No way to convert .class files to .ll files

It seems like there is no any good way to convert java bytecode to llvm bitcode.

All projects that I've discovered and which are dedicated to this topic are either outdated or abandoned.

Could you please suggest some way to make this conversion?

for classifyapp, vocubalary dictionary is not present.

for classifyapp, vocubalary dictionary is not present.
can you please upload it.
I am getting following error:
FileNotFoundError: [Errno 2] No such file or directory: 'published_results/vocabulary/dic_pickle'

[train_task_classifyapp.py] Cannot sample 1500 from 0 files found in task/classifyapp/ir_train/1

I called python train_task_classifyapp.py and got this output:

Evaluating ClassifyappInst2Vec ...
Getting file names for 104 classes from folders:
task/classifyapp/ir_train
task/classifyapp/ir_val
task/classifyapp/ir_test
	training  : Read file names from folder  task/classifyapp/ir_train/1
 Traceback (most recent call last):
  File "train_task_classifyapp.py", line 478, in <module>
    app.run(main)
  File "/home/selp/.local/lib/python3.6/site-packages/absl/app.py", line 300, in run
    _run_main(main, args)
  File "/home/selp/.local/lib/python3.6/site-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "train_task_classifyapp.py", line 470, in main
    dense_layer_size, print_summary, num_epochs, batch_size)
  File "train_task_classifyapp.py", line 289, in evaluate
    len(seq_files)) + " files found in " + folder
AssertionError: Cannot sample 1500 from 0 files found in task/classifyapp/ir_train/1

Implementation of inst2vec-imm is not available for Device Mapping task

It seems that current release doesn't provide implementation of inst2vec immediate version. As this version is much better than the no immediate version, do you have plan to open source the code for future comparison?

	seq_files = [os.path.join(folder, f) for f in listing if f[-4:] == '.rec']

	# training: Randomly pick programs
	assert len(seq_files) >= samples_per_class, "Cannot sample " + str(samples_per_class) + " from " + str(
	len(seq_files)) + " files found in " + folder
	X_train += resample(seq_files, replace=False, n_samples=samples_per_class, random_state=seed)
	y_train = np.concatenate([y_train, np.array([int(i)] * samples_per_class, dtype=np.int32)])

	def llvm_ir_to_trainable(folder_ir):

	####################################################################################################################
	# Setup
	assert len(folder_ir) > 0, "Please specify a folder containing the raw LLVM IR"
	assert os.path.exists(folder_ir), "Folder not found: " + folder_ir
	folder_seq = re.sub('ir', 'seq', folder_ir)
	if len(folder_seq) > 0:
	print('Preparing to write LLVM IR index sequences to', folder_seq)
	if not os.path.exists(folder_seq):
	os.makedirs(folder_seq)

spcl / ncc Goto Github PK

ncc's Introduction

Neural Code Comprehension: A Learnable Representation of Code Semantics

Code

Requirements

Running the code

1. Training inst2vec embeddings

2. Evaluating inst2vec embeddings

3. Training on tasks with ncc

Contact

License

ncc's People

Contributors

Stargazers

Watchers

Forkers

ncc's Issues

Recommend Projects

Recommend Topics

Recommend Org

1. Training `inst2vec` embeddings

2. Evaluating `inst2vec` embeddings

3. Training on tasks with `ncc`