Giter Club home page Giter Club logo

gnn_pretraining's Introduction

GNN pre-training & fine-tuning

Pre-training and fine-tuning GNN model on source code

For correct usage install all requirements from requirements.txt. Due to issues with installing order, it's recommended to use prepared shell script. More information about issues inside it too.

./install_dependencies.sh

Data preparation

src.data.preprocess module provide all necessary functionality to prepare data for further training.

Option 1: Source code should be provided in single file with the following format:

  • Examples are separated between each other by special symbol:
  • Inside example source code and filename (or other label) are separated by special symbol: .

Option 2: Or you can use raw data which can be obtained from GitHub using this repository. Unfortunately, already scrapped repositories are still unavailable due to privacy policy, but you can download such data by yourself.

Preprocessed data

All source and preprocessed data can be obtained from this table:

name source preprocessed holdout sizes (train/val/test) # tokens
dev s3 link (3.6Mb) s3 link (15Mb) 552 / 185 / 192 12 269
small s3 link (287Mb) s3 link (1.2Gb) 44 683 / 14 892 / 14 934 213 875
full Unavailable Unavailable 56 666 194 / 19 892 270 / 18 464 490 1 306 153

Code2graph

To represents code as graphs we use the approach presented in Typilus. The implementation is taken from the fork of original implementation with essential bug fixes.

Use preprocess.py to convert your data into graphs:

PYTHONPATH="." python src/data/preprocess/preprocess.py
    -d <path to file with data>
    -t <path to destination folder>
    --vocabulary

--vocabulary flag used to collect information about tokens appearance in code.

The output of preprocessing is a 3 gzipped JSONL file. Each file correspond to separate holdout (train, val, test). Each line is a standalone JSON that describes one graph.

Model pre-training

We use PyTorch Lightning to implement all necessary modules for training. Thus they can be easily reused in other research works. All pre-training are using GINEConv operator from the Strategies for Pre-training Graph Neural Networks paper. src.models.modules.gine_conv_encoder contains descibed encoder model.

Currently, we supported next pretraining schemes:

  • Predicting Node and Edge types. For each graph we randomly masked Node and Edge types with special token and trained model to restore them back. src.models.gine_conv_masking_pretraining contains complete Lightning module for this pre-training.
  • Predicting sequence of subtokens in Node. For each graph we randomly masked tokens with special token and trained model to restore them back. src.models.gine_conv_token_prediction contains complete Lightning module for this pre-training.

To run pretraining with chosen model use the following command:

PYTHONPATH="." python src/pretraining.py -c <path to YAML config file> 

Model fine-tuning

Currently we support fine-tuning for code-to-text task, e.g., generating documentation for code. We use BPE to tokenize documentation for train holdout. The decoder of the model is LSTM with attention to node states. src.models.gine_conv_sequence_generating contains complete Lightning module for this fine-tuning.

To run fine-tuning use the following command:

PYTHONPATH="." python src/finetuning.py -c <path to YAML config file> 

Configuration

Complete configuration of model is defined by YAML config. Examples of config are stored in config folder.

gnn_pretraining's People

Contributors

sokolovyaroslav avatar spirinegor avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

for-just-we

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.