Giter Club home page Giter Club logo

tokompiler's Introduction

Tokompiler: Code-Oriented Tokenizer for HPC Tasks

Tokompiler is a specialized tokenizer designed for code preprocessing and tokenization. It provides a pipeline for converting source code into a format that enhances code representation and understanding. Its primary purpose is to support the pretraining of language models for high-performance computing.

Getting Started

To use Tokompiler, follow these steps:

Install tree-sitter:

pip install tree-sitter

Build the Tree-sitter parser by executing the following commands from the root directory:

cd src/tokompiler/parsers
python build_ts.py
cd -

Finally, run setup.py to build pip package as:

python setup.py bdist_wheel && python -m pip install dist/tokompiler-0.1-py3-none-any.whl

Pipeline Overview

The pipeline of using Tokompiler consists of the following steps:

  1. Convert Representation The ConvertRepresentation class contains a function called generate_replaced that takes input code and the programming language. It returns a replaced format of the code. For example:

Original Code:

int main() {
    int r[2800 + 1];
    // ... (code continues)
}

Replaced Format:

int func_252() {
    int arr_88[num_34 + num_842];
    // ... (replaced code)
}
  1. Lexicalization The Lexicalization class contains a function called lexicalize that splits the code into its semantical tokens given the code and the programming language. For example:

Original Code:

int func_252() {
    int arr_88[num_34 + num_842];
    // ... (code continues)
}

Lexicalized Tokens:

["int", "func", "252", "(", ")", "{", "int", "arr", "88", // ... (tokens continue)
  1. Tokenization The Tokenizer class defines an interface that Tokompiler implements. It includes the following functions:

tokenize: Converts code into a list of tokens.

encode: Given code, returns a list of IDs representing each token.

decode: Given a list of token IDs, returns the original code.

Tokompiler Usage

A complete example of using Tokompiler can be found in this example script.

Code Perplexity

Perplexity calculation in accordance to PolyCoder (https://arxiv.org/abs/2202.13169) methods using lmppl (https://github.com/asahi417/lmppl)

you will need your model in the huggingface format. preferably pytorch only.

Environment

The environment.yml file functions seamlessly.

However, if you already have PyTorch installed, utilizing the partial_requirements.txt with pip is a suitable approach as well.

LMPPL Usage

After setting up your environment and ensuring a compatible model, you can calculate perplexity by running the perplexity.py script with your chosen model and text inputs.

Here is an example of how to use the script:

# Get lexical count for the texts using 'c' language lexer (replace 'c' with the appropriate lexer for your texts)
count = get_lex_count(texts,'c')

tokenizer=AutoTokenizer.from_pretrained('gpt2')
model=AutoModelForCausalLM.from_pretrained('gpt2')
scorer = lmppl_code.LM(tokenizer=tokenizer, model_obj=model)
print(scorer.get_perplexity(texts,count))

tokompiler's People

Contributors

nevakrien avatar talkad avatar scientific-computing-user avatar galoren avatar nhasabni avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.