Giter Club home page Giter Club logo

astnn's People

Contributors

luizvbo avatar xu-zhiwei avatar zhangj111 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

astnn's Issues

EOFError: Ran out of input

When I try to run pipeline.py for Source Code Classification, there's an error says "line 182, in read_pickle
return pickle.load(f)
EOFError: Ran out of input" as bellow:
image
Could you suggest me how to solve this issue? Thank you very much.

你好👋

你好,能发我一份关于这篇论文的PPT吗?谢谢

Using ASTnn on our own Dataset

Hello,

I am trying to use ASTnn for our research purposes. But, we are facing issues when we need to train model on our own dataset. As we are trying to train it entire repository code, when we are feeding code we are getting errors as we need to preprocess data according to pycparser requirement as they mention on their page.

But, we can't preprocess it as it's manual process where we have to preprocess file one by one, so we are unable to run this on repository.

Can you please explain step by step that how can we run astnn tool on our dataset (specially when running on repository, so having lots of header files and all).

Looking forward for your reply

Exception during the command execution

Dear colleagues,

When I run this command: python pipeline.py --lang java, I have a KeyError exception in dictionary_and_embedding(...) method:
image

image

Are dataset files correct?

Thanks in advance.

Traverse mul Explanation Needed

Hello @zhangj111, You did such an incredible work with this model. I ran it on a free GPU and the results were really satisfactory. Having read the paper, I wish to know how the model work in depth. I am having a hard time figuring out how the encoder module works, especially the Traverse mul method.

Please can you help me with some explanation of what this piece of codes does exactly?

astnn/clone/model.py

Lines 39 to 54 in edd14c9

index, children_index = [], []
current_node, children = [], []
for i in range(size):
# if node[i][0] is not -1:
index.append(i)
current_node.append(node[i][0])
temp = node[i][1:]
c_num = len(temp)
for j in range(c_num):
if temp[j][0] is not -1:
if len(children_index) <= j:
children_index.append([i])
children.append([temp[j]])
else:
children_index[j].append(i)
children[j].append(temp[j])

Thanks.

A revision to solve 'nn.utils.rnn.pack_padded_sequence' issue in model.py

In model.py , I notice that use the method 'pack_padded_sequence' should have a decreasing sequence cause: The sequences should be sorted by length in a decreasing order
otherwise the code will raise RuntimeError: 'lengths' array has to be sorted in decreasing order

I revised the forward function to solve this issue below:

 def forward(self, x):

    lens = [len(item) for item in x]
    max_len = max(lens)
    encodes = []
    for i in range(self.batch_size):
        for j in range(lens[i]):
            encodes.append(x[i][j])

    encodes = self.encoder(encodes, sum(lens))
    seq, start, end = [], 0, 0
    for i in range(self.batch_size):
        end += lens[i]
        seq.append(encodes[start:end])
        if max_len-lens[i]:
            seq.append(self.get_zeros(max_len-lens[i]))
        start = end
    encodes = torch.cat(seq)
    encodes = encodes.view(self.batch_size, max_len, -1)


    lens = torch.LongTensor(lens)

    lens, perm_idx = lens.sort(0, descending=True)
    encodes = encodes[perm_idx]
    _ , unperm_idx = perm_idx.sort(0, descending=False)



    encodes = nn.utils.rnn.pack_padded_sequence(encodes, lens, True)

    # gru
    gru_out, _ = self.bigru(encodes, self.hidden)
    gru_out, _ = nn.utils.rnn.pad_packed_sequence(gru_out, batch_first=True, padding_value=-1e9)

    gru_out = gru_out[unperm_idx]
    # print(gru_out.shape)

    gru_out = torch.transpose(gru_out, 1, 2)
    # pooling
    gru_out = F.max_pool1d(gru_out, gru_out.size(2)).squeeze(2)
    # gru_out = gru_out[:,-1]

    # linear
    y = self.hidden2label(gru_out)
    return y

CUDA out of memory using a custom dataset

Hi, I am using a custom dataset where the ASTs are significantly larger and get the following message (this is on a V100 with 32GB memory!)

RuntimeError: CUDA out of memory. Tried to allocate 11.69 GiB (GPU 0; 31.75 GiB total capacity; 11.88 GiB already allocated; 7.13 GiB free; 11.71 GiB cached)

Is there anything I can do after already having set BATCH_SIZE = 1?

Shallow copy when finding best_model

I could be wrong, but it looks like you're not performing a deep copy when you copy model to best_model and vice versa (train.py: line 58, 117 and 127), meaning they're both pointing at the same memory location and both get updated during optimization. In this case it would mean that you're never going to return to an earlier model state that might have had a higher accuracy, only ever the latest. I'm not sure if this is what you intended?

A bug about Code Clone Detection in file astnn/clone/pipeline.py

Dear @zhangj111 ,
I tried to run Code-Clone-Detection task by using python pipeline.py --lang c
But an error occurred with info 'FileNotFoundError: [WinError 3] 系统找不到指定的路径。: 'datac/train/'
The correct path should be data/c/train/ instead of datac/train/, which was caused by line 86 of the file astnn/clone/pipeline.py

data_path = self.root+self.language+'/'

To fix the bug, maybe we could change this line of code to:
data_path = self.root+'/'+self.language+'/'
Finally, truly thankful about your innovational paper and model.

将AST树拆分成语句子树的问题

作者您好,因能力有限,我没能找到将AST树拆分成语句子树的代码,请您赐教。另外,您在论文中做了AST-Full,AST-Block,AST-Node的对比实验,如果方便的话,关于这部分的代码您可否分享一下呢?期待您的解答,感谢!

request to modify astnn for wider applications

ASTNN didn't worked when i used for error java codes.

i request if you modify your ASTNN to also work for error codes.
and would you add special functionality in your ASTNN that would take a code file and create a mapping of code statement to its corresponding vector, then it would be a great help for us and also more people would be able to use your library for their research work instead of small segment of people who are only aiming for source code classification and code clone detection.

code clone categories

Dear writer,
I am trying to use my own dataset on your clone code, and I see from your bcb_pair_ids.pkl that the labels are not only 0,1 but 2,3,4,5. Could you please explain that why there are more than 2 categories? Because in my perspective, the label stands for only "cloned" and "not cloned". Your help would be highly appreciated.

EMBEDDING_DIM is redundant as embedding dim must match encoding dim

I have noticed that the EMBEDDING_DIM variable is passed through BatchProgramClassifier() and BatchTreeEncoder() but not used when defining the dimensions of the matrix (in this case batch_current) that will be passed as input to self.W_c (line 59 of model.py). Instead the ENCODE_DIM value is used to define the shape of batch_current, and this happens to work as with the configuration you have used they both are 128. However, were you to change the EMBEDDING_DIM it would throw an error.

A potential solution to this would be to use a separate matrix to replace batch_current in the first part of traverse_mul() up to line 59 which uses self.embedding_dim (you'd need to create this variable also) instead of self.encode_dim, and then initialise batch_current in line 59.

the Code snippet to the Embedded vector

Hi,

As you explained, I have installed your model and trained it with the "ast.pkt" file.
I was wondering if you help me to know, how I can pass a Code snippet to your trained model and get its embedded vector?
(snippet code has been stored in a string variable)

Thank you so much for your help in advance.

Question about preprocessing poj104 dataset into programs.pkl

Hi,

Thanks for your great paper and opening source this repo. The paper is really inspiring and I've learnt a lot about how to deal with RvNN-style structure in torch from your nice implementation .

I am recently playing with the 'poj104' dataset, the one you use for code classification task. In your workflow it's directly loaded from a pickle file and each code sample is processed into an AST by parser.parse (line 20-25 in pipeline.py)

            from pycparser import c_parser
            parser = c_parser.CParser()
            source = pd.read_pickle(self.root+'programs.pkl')

            source.columns = ['id', 'code', 'label']
            source['code'] = source['code'].apply(parser.parse)

However, I find that pycparser CAN'T DIRECTLY parse the code text read from poj104. So could you please tell me how you preprocess the code text into program.pkl file? (Yes, I'm a newbie to both DL & pycparser...)

Thanks in advance!

bug

import javalang
from clone.utils import get_blocks_v1


func2 = """
public int test2(int a){
    if(a>3){
        try{
            if(a>10){
                a = 9;
            } 
        }catch(Exception e){
            a = 10;
        }
        
    }
    return a;
}
"""


def tree_to_index(node):
    token = node.token
    # result = [vocab[token].index if token in vocab else max_token]
    result = [token]
    children = node.children
    for child in children:
        result.append(tree_to_index(child))
    return result


def trans2seq(r):
    blocks = []
    get_blocks_v1(r, blocks)
    tree = []
    for b in blocks:
        btree = tree_to_index(b)
        tree.append(btree)
    return tree


tokens = javalang.tokenizer.tokenize(func2)
parser = javalang.parser.Parser(tokens)
tree = parser.parse_member_declaration()

seq = trans2seq(tree)
print(seq)

the output is [['MethodDeclaration', ['Modifier', ['public']], ['BasicType', ['int']], ['test2'], ['FormalParameter', ['BasicType', ['int']], ['a']]], ['IfStatement', ['BinaryOperation', ['>'], ['MemberReference', ['a']], ['Literal', ['3']]]], ['BlockStatement'], ['TryStatement', ['CatchClause', ['CatchClauseParameter', ['Exception'], ['e']], ['StatementExpression', ['Assignment', ['MemberReference', ['a']], ['Literal', ['10']], ['=']]]]], ['End'], ['ReturnStatement', ['MemberReference', ['a']]]]

the statement if(a>10){ a = 9;} disappeared from the output.

torch 版本的问题

之前安装的是pytorch 版本是0.3.1, clone代码后,运行正常。现在是不是要把pytorch 升级到 1.0.0?

对吗?

in your ASTNN library, getting no. of statements not equal to no. of vectors generated for each file using ASTNN.

in your astnn library i am just trying to generate vectors of statements of each code file.
but getting no. of statements in a code file not equal to no. of vectors generated.

image

i used your astnn code in colab exactly doing the same which is in the file till vector generation part .
(till encodes = encodes.view(self.batch_size, max_len, -1) only)

image

https://colab.research.google.com/drive/15FC9I4D0MRTjhV4hlDpgrZrNGC_WyzeM?usp=sharing

About data processing

We are trying to apply this model on our own dataset, but how to process our data to adapt this model has been a big problem for us.
So I wish to know how to process our data to get the correct format.
We would appreciate it greatly if u are willing to offer some help.

Best wishes.

Characters level Embedding OR Token based Embedding?

Hello,
I wonder that whether this work using characters level or token level word embedding by Word2Vec ?
I may found an error on line 76 in file 'pipeline.py', this operation will let Word2Vec to learn character level embedding instead of token based embedding.
Finally, It may cause an error on line 95 in function 'tree_to_index'. If I have understand it correctly, this should turn the token in AST node to it's embedding index. However, all tokens are word level and they are not in w2v.wv.vocab at all. Finally, all tokens on AST nodes excepts some leaf tokens are regarded as 'max_token' in else branch.
I may have some misunderstand this file or this problem is caused by gensim versions.

Astnn as a pure encoder

Hi,

I would like to decouple the encoder(BatchTreeEncoder and gru) part of the code from the classifier. Is this possible? I tried doing it but not able to figure out what should be the output so that the decoder part of the neural network decodes properly. The loss function also might need to change. I tried to flatten the input to a list of integers, created a hash and converted the input to a fixed length list of integers to serve as a comparison of the output to the input. I also changed the neural network architecture to reflect this instead of the labels. I used the MSE as loss function. But not getting a good score. Any ideas to help me here? I would like to then use the encoded output for clustering.

Thanks and Regards,
Jyothi

将 torch 升级为version 1.2.0 ,运行不成功

Start training...
[Epoch: 0/ 15] Training Loss: 2.3469, Validation Loss: 0.8280, Training Acc: 0.000, Validation Acc: 0.000, Time Cost: 107.420 s
[Epoch: 1/ 15] Training Loss: 0.4703, Validation Loss: 0.3033, Training Acc: 0.000, Validation Acc: 0.000, Time Cost: 107.467 s
[Epoch: 2/ 15] Training Loss: 0.1991, Validation Loss: 0.1727, Training Acc: 0.000, Validation Acc: 0.000, Time Cost: 107.485 s
[Epoch: 3/ 15] Training Loss: 0.1193, Validation Loss: 0.1204, Training Acc: 0.000, Validation Acc: 0.000, Time Cost: 107.439 s
[Epoch: 4/ 15] Training Loss: 0.0837, Validation Loss: 0.1022, Training Acc: 0.000, Validation Acc: 0.000, Time Cost: 107.594 s
[Epoch: 5/ 15] Training Loss: 0.0635, Validation Loss: 0.0876, Training Acc: 0.000, Validation Acc: 0.000, Time Cost: 107.626 s
[Epoch: 6/ 15] Training Loss: 0.0508, Validation Loss: 0.0778, Training Acc: 0.000, Validation Acc: 0.000, Time Cost: 108.131 s
[Epoch: 7/ 15] Training Loss: 0.0422, Validation Loss: 0.0720, Training Acc: 0.000, Validation Acc: 0.000, Time Cost: 108.008 s
[Epoch: 8/ 15] Training Loss: 0.0356, Validation Loss: 0.0659, Training Acc: 0.000, Validation Acc: 0.000, Time Cost: 107.627 s
[Epoch: 9/ 15] Training Loss: 0.0304, Validation Loss: 0.0636, Training Acc: 0.000, Validation Acc: 0.000, Time Cost: 107.660 s
[Epoch: 10/ 15] Training Loss: 0.0262, Validation Loss: 0.0610, Training Acc: 0.000, Validation Acc: 0.000, Time Cost: 107.398 s
[Epoch: 11/ 15] Training Loss: 0.0228, Validation Loss: 0.0577, Training Acc: 0.000, Validation Acc: 0.000, Time Cost: 106.979 s
[Epoch: 12/ 15] Training Loss: 0.0200, Validation Loss: 0.0586, Training Acc: 0.000, Validation Acc: 0.000, Time Cost: 106.910 s
[Epoch: 13/ 15] Training Loss: 0.0178, Validation Loss: 0.0554, Training Acc: 0.000, Validation Acc: 0.000, Time Cost: 107.089 s
[Epoch: 14/ 15] Training Loss: 0.0158, Validation Loss: 0.0549, Training Acc: 0.000, Validation Acc: 0.000, Time Cost: 107.778 s

best_model WASN'T defined after all!

Testing results(Acc): tensor(0, device='cuda:0')
是什么原因呢?一定要是torch 1.0.0吗?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.