code-data-and-github-crawler's Introduction

Ongoing repository

USAGE

Use your API token modifying headers in gh_crawler.py
Run collect_data.sh

CODE CORPUS

GitHub - (version, October 9-10, 2021): 249GB of multi-lingual code.

Stack-Repo: Stack-Repo is a dataset of 200 Java repositories from GitHub with permissive licenses and near-deduplicated files.

CodeAlpaca-20k: 20k instruction tuning code dataset.

CODE BENCHMARKS

CoderEval: CoderEval is a pragmatic code generation benchmark to evaluate the performace of generative pre-trained models. CoderEval supports Python and Java, with 230 functions from 43 Python projects and 230 methods from 10 Java projects.

HumanEval: HumanEval contains 164 hand-written Python programming problems. Each problem provides a prompt with descriptions of the function to be generated, function signature, and example test cases in the form of assertions.

HumanEval-X: Previous works evaluate multilingual program synthesis under semantic similarity (e.g., CodeBLEU) which is often misleading, HumanEval-X evaluates the functional correctness of the generated programs. HumanEval-X consists of 820 high-quality human-crafted data samples (each with test cases) in Python, C++, Java, JavaScript, and Go, and can be used for various tasks.

Acknowledge

Xu, Frank F and Alon, Uri and Neubig, Graham and Hellendoorn, Vincent J. A Systematic Evaluation of Large Language Models of Code. PAPER CODE

Recommend Projects

huuunan / code-data-and-github-crawler Goto Github PK

code-data-and-github-crawler's Introduction

Ongoing repository

USAGE

CODE CORPUS

CODE BENCHMARKS

Acknowledge

code-data-and-github-crawler's People

Contributors

Stargazers

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent