bigcode-project / bigcode-dataset Goto Github PK

View Code? Open in Web Editor NEW

349.0 349.0 60.0 3.89 MB

License: Apache License 2.0

Jupyter Notebook 86.26% Python 13.25% Shell 0.49%

bigcode-dataset's People

Contributors

Stargazers

Watchers

bigcode-dataset's Issues

Run language detection GitHub issues

We need to detect and filter languages on GitHub issues. For example the language detector from fasttext could be used for this

TF-Update The Stack with new languages and licenses

The first version of The Stack included those weak copyleft licenses and we should exclude them.

Define filters for cleaning GitHub issues

We should come up with good filters to get rid of low quality issues. This could be done by going through a sample (e.g. 100) issues and working out patterns of low quality data.

TF-PII Redaction Benchmark

Build a benchmark (for short term) for PII detection of Emails, IP addresses and SSH & API keys:

Bibliographic research for existing PII benchmarks
Setup the benchmark to be annotated 100~1000 samples from the Stack (from the most popular languages)
Setup an Annotation platform (https://www.lighttag.io/pricing/ vs https://labelstud.io)

Dataset filter based on code/docs ratio

The ratio of code and docstrings in a document can be used as a proxy for code quality. Code with no comments or comments without any code at all could not be very useful for training and we want to test that hypothesis with some experiments.

Create near-dedup dataset for all programming languages

Dataset filtering with additional near-deduplication

In first experiments near-deduplication seemed very effective for bumping up the model's performance. We want to investigate if further, more aggressive deduplication would help even more.

Redacting PII from code datasets

HuggingFace Need Data Access Approval

Hi -

I'm from the huggingface community. Can you please approve the users for the dataset on HF?

https://huggingface.co/datasets/bigcode/bigcode-pii-dataset

Most CMake files missed when categorizing by extension

The cmake subdirectory contains files with the .cmake extension.
Most of the CMake logic to build projects is contained in files named CMakeLists.txt, which get placed in the text directory in this dataset.

In the dedup dataset, there are 186,517 files in cmake, while there are 469,291 files named CMakeLists.txt in the text directory.

Create text-code pairs from Jupyter Notebooks

The Stack includes Jupyter Notebooks, we want to convert them to text-code pairs after appropriate filtering for each programming language.

An example is the conversion of this Jupter Notebook data to this parsed version in Python.

Suggest datasets for Code Dataset Catalogue

While we are preparing a large code dataset from GitHub we also want to include other datasets that might be useful for pretraining or fine-tuning.

Note: For evaluation there is a separate list here.

Existing list

There are already a few datasets gathered here:
https://docs.google.com/spreadsheets/d/1otSgs2-iDp1KTbasEDlFhTHiaKsYbIRvWrgH_TiubFQ/edit#gid=0

Contribute

We would like to gather an extensive lists of resources that could be useful for training language models for code. You can contribute by commenting on this issue with a missing dataset with the template below. A resource could be a specific code dataset (e.g. CodeXGlue), a related dataset (e.g. CS part of Wikipedia), or a website (e.g. Stackoverflow), or something similar.

Template

The following is the template for reporting an interesting data source:

Name	Link	License	Number of files/samples	Size in GB	Languages	Available on the HF Hub
Transformers repo	https://github.com/huggingface/transformers	Apache 2.0	100	0.1	Python	No

Here's the Markdown snippet that you can copy/paste:

|Name|Link|License|Number of files/samples| Size in GB| Languages |Available on the HF Hub|
|:-|:-|:-|:-|:-|:-|:-|
| | | | | | | |

Create dataset with git commits

Git commits pose an interesting data source as they could be used as a natural source for prompts for code models with a format like:

CODE BEFORE
COMMIT MESSAGE
CODE AFTER

The dataset requires some cleaning and filtering since:

commits can be multi-file
commit messages can be non-descriptive ("fix", "bla" etc.)
bad quality in general

Refactor PII Code

Wrap PII pipeline (detection + redaction) in a library, to easily detect PII in text or HF datasets in an efficient way.

@shamikbose

Decontaminate pretraining dataset from evaluation benchmarks

In order to make sure that the evaluation results reflect the true performance of the model, it is important to make sure that the evaluation benchmarks are not part of the training data. For that purpose we want to use #15 to search and remove evaluation benchmarks from the code dataset.

Build StackerFlow datasets

StackOverflow contains a lot of interesting data that can be used for classification, language modeling or retrieval. This is an issue to gather all ideas and sub-issues to add datasets to the Hub.

EDA on full dataset

It would be great if we could do some exploratory data analysis on the full dataset:

Language distribution
File length
Licence distribution
This would help a bit with the understanding of the dataset and the trade-offs in filtering.

Include GPL licenses?

Dataset filters based on tokenizer or perplexity

To filter out low-quality code one can use the char/token ratio of a tokenizer to e.g. identify files with hashes or the train a small LM (e.g. KenLM models) on a clean dataset and use its perplexity to remove noisy data.

Question: File Counts and Dataset Size

I recently downloaded The Stack (the-stack-dedup) from Huggingface via GIT LFS. I have two questions that I need help with:

The size on disk of the dedup datset is only around 900GB (much smaller than the 1.5TB indicated on the data card - https://huggingface.co/datasets/bigcode/admin/resolve/main/the-stack-infographic-v11.png)
Is there somewhere were the file counts are listed in full for each dataset by language (dedup and full)?

Essentially I am looking to make sure that I have accessed the entirety of the dataset, so I either need to understand the dataset size difference, or know how many files there should be for each language so I can validate my download. Ideally both.

Thanks in advance!

NER models for PII

The next iteration of PII, we want yo train an NER model to detect PII on the benchmark we will annotate. Here are some tasks:

1 - Train an encoder (Bert) model on languages of The Stack (same languages as the target model) => @loubnabnl
2 - Prepare pipeline for fine-tuning the encoder model on NER PII task => @mrm8488

Help wanted:

3 - Estimate cost of using a Bert sized model to detect PII on The Stack (e.g. estimate inference with CodeBert-NER on the-stack-smol (300k samples) on 1 GPU)
4- Implement PII redaction using special tokens in the tokenizer
5- Analyze PII in a dataset of Github issues, and investigate using our regexes + standard english NER models to remove it

For any intermediate code/analysis please open PRs in bigcode-analysis repo.

Decontaminate evaluation benchmarks from pretraining dataset

Benchmarks like HumanEval could by now be part of GitHub repositories and thus part of the the pretraining. We could search and delete those before training.

Define filters for git commits

We should come up with good filters to get rid of low quality commits. This could be done by going through a sample (e.g. 100) issues and working out patterns of low quality data.

Create dataset with GitHub metadata

For data filtering or model conditioning we would like to have GH metadata such as stars, forks, contributors etc. as part of the dataset.

百度云连接 cloud cleaned database?

洗干净的数据下载链接?

Build dataset index

Building a dataset search index is necessary for two reasons:

removing near duplicates from training data
retrieve training data during generation to infer attribution

The current idea is to use pyserini to build the index of the near-deduplicated dataset.

When I do pii_inference, cannot load bigcode/bigcode-encoder-pii-ner-v2

Hi,
I now need to do PII detection and hope to use the already trained NER model.
In infer.slurm, I found the project attempt to load model from bigcode/bigcode-encoder-pii-ner-v2.
However, when I want to do it by myself, it shows

OSError: bigcode/bigcode-encoder-pii-ner-v2 is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo either by logging in with huggingface-cli login or by passing token=<your_token>

I cannot find bigcode/bigcode-encoder-pii-ner-v2 in hugging face.
So, may I ask how can I get a trained pii detection model?

Parse code dataset into AST

Convert Jupyter Notebooks to scripts

Try converting Jupyter Notebooks in The Stack to scripts for different programming languages with some filtering (as data could be included in for the pre-training)

Include code review data.

Based on a recent tweet, it sounds like issues and commits will be added to the dataset soon (which is great news!).

Any chance on adding code reviews, as well? Or is this limited by the capabilities for the GitHub API?

TF-PII Redaction regexes

Test and update regexes for detecting the following entities in The Stack

Emails
IP addresses
SSH & API keys

Remove opt-out accounts

Before training the first model we can filter out any opt-out repository from the dataset.

From GH Archive to bigcode/the-stack-github-issues

Hi,

I'm wondering how to get the issues from the gh archive.
It seems the issues are from the gh archive.

Could you provide scripts that do the transformation?

Thank you!

Suggest datasets for Code Dataset Catalogue

While we are preparing a large code dataset from GitHub we also want to include other datasets that might be useful for pretraining or fine-tuning.

Note: For evaluation there is a separate list here.

Existing list

There are already a few datasets gathered here:
https://docs.google.com/spreadsheets/d/1otSgs2-iDp1KTbasEDlFhTHiaKsYbIRvWrgH_TiubFQ/edit#gid=0

Contribute

Template

The following is the template for reporting an interesting data source:

Name	Link	License	Number of files/samples	Size in GB	Languages	Available on the HF Hub
Transformers repo	https://github.com/huggingface/transformers	Apache 2.0	100	0.1	Python	No

Here's the Markdown snippet that you can copy/paste:

|Name|Link|License|Number of files/samples| Size in GB| Languages |Available on the HF Hub|
|:-|:-|:-|:-|:-|:-|:-|
| | | | | | | |

Which languages to include?

Which programming languages should be included in the pretraining dataset?

In contrast to natural language does multilinguality not increase the vocab size significantly so I think it could be interesting to be on the liberal side here to open the doors for interesting downstream use-cases.

Deduplication also removes data < ngram_size

I noticed ngrams function returns empty list when number of tokens is smaller than ngram_size value.
This has the side effect of clustering all sentences with number of tokens smaller than the ngram_size value together & removing them as dupes. Not sure if this is intentional?
so for the default of ngram_size=5 "a sentence like this" gets removed by default.

Could maybe filter & rerun the deduping with a smaller n_gram value for those sentences.

Dataset filtering based on content

Most code models following Codex used some content based filtering (e.g. max/mean line length of file, fraction of alphanumeric characters). These work well for Python but can fail for other langugages as HTML. We need to gather good filters for each language, especially the high volume languages such as HTML, Java, JavaScript, Python, C, C++ etc.

During your processing, have you ever encountered the need to extract part of the code? How was it handled?

If I need to remove parts of the file other than the code, is there any other way besides regular matching? For example with the codebert model?

Some file extensions excluded from the published dataset (Racket)

programming-languages-to-file-extensions.json correctly has the most common rkt file extension of 'rkt' for Racket, but the data subset (for Racket) at https://huggingface.co/datasets/bigcode/the-stack/tree/main/data/racket has zero instances of files with this extension, and rkt is mentioned specifically as being an excluded extension in the paper at https://arxiv.org/abs/2305.06161 This would likely exclude the majority of actual racket files found on github.

bigcode-project / bigcode-dataset Goto Github PK

bigcode-dataset's People

Contributors

Stargazers

Watchers

Forkers

bigcode-dataset's Issues

Suggest datasets for Code Dataset Catalogue

Existing list

Contribute

Template

Suggest datasets for Code Dataset Catalogue

Existing list

Contribute

Template

Recommend Projects

Recommend Topics

Recommend Org