Giter Club home page Giter Club logo

grammart5's Introduction

A Pytorch implemenmtation of GrammarT5.

A PyTorch Implementation of "GrammarT5: Grammar-Integrated Pre-trained Encoder-Decoder Neural Model for Code"

Introduction

Pre-trained models for code have exhibited promising performance across various code-related tasks, such as code summarization, code completion, code translation, and bug detection. These accomplishments have substantially contributed to the advancement of AI-assisted programming and developer tools. However, despite their success, the majority of current models still represent code as a token sequence in the fine-tuning phase, which may not adequately capture the essence of the underlying code structure.

In this work, we propose GrammarT5, a grammar-integrated encoder-decoder pre-trained model for code. GrammarT5 employs a novel grammar-integrated representation, Tokenized Grammar Rule List (TGRL), for code. TGRL is constructed based on the grammar rule list utilized in syntax-guided code generation and integrates syntax information with code tokens within an appropriate input length. Furthermore, we suggest attaching language flags to help GrammarT5 differentiate between grammar rules of various programming languages. Finally, we introduce three novel pre-training objectives—Edge Prediction (EP), Identifier Prediction (IP), and Sub-Tree Prediction (STP)—for GrammarT5 to learn syntax from TGRL.

Experiments were conducted on five code-related tasks using ten datasets, demonstrating that GrammarT5 achieves state-of-the-art performance on all tasks in comparison to models of the same scale. Additionally, the paper illustrates that the proposed pre-training objectives and language flags can enhance GrammarT5's ability to better capture code syntax and semantics.

Dataset

Train set

The raw data from Code Search Net (https://zenodo.org/record/7857872)

Test set

Usage

Pre-trained Model

We will publish our pre-trained models after the paper is accepted.

Fine-tuning Model

Task can be one of the following tasks. ['django', 'concode', 'codetrans', 'repair', 'assert', 'conala', 'test', 'repairme', 'transj2c', 'transc2j', 'commentjava', 'commentpython', 'mbpp', 'searchadv', 'searchcos']

sh run.sh

The saved model is checkModel[task].

Testing Model

sh eval.sh

Dependency

  • Python 3.7
  • PyTorch 1.12
  • transformers 2.26
  • Java 8
  • docker
  • nvidia-docker

grammart5's People

Contributors

grammart5 avatar

Stargazers

 avatar Junjie Yang avatar Zeyu Sun avatar Yunfei Zhao avatar

Watchers

 avatar

Forkers

ufwt

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.