Giter Club home page Giter Club logo

bigcode-dataset's People

Contributors

cceyda avatar chenghaomou avatar lingjzhu avatar loubnabnl avatar lvwerra avatar mponty avatar paulovn avatar rainrat avatar raymondli0 avatar terryyz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bigcode-dataset's Issues

Define filters for cleaning GitHub issues

We should come up with good filters to get rid of low quality issues. This could be done by going through a sample (e.g. 100) issues and working out patterns of low quality data.

Dataset filter based on code/docs ratio

The ratio of code and docstrings in a document can be used as a proxy for code quality. Code with no comments or comments without any code at all could not be very useful for training and we want to test that hypothesis with some experiments.

Most CMake files missed when categorizing by extension

The cmake subdirectory contains files with the .cmake extension.
Most of the CMake logic to build projects is contained in files named CMakeLists.txt, which get placed in the text directory in this dataset.

In the dedup dataset, there are 186,517 files in cmake, while there are 469,291 files named CMakeLists.txt in the text directory.

Suggest datasets for Code Dataset Catalogue

Suggest datasets for Code Dataset Catalogue

While we are preparing a large code dataset from GitHub we also want to include other datasets that might be useful for pretraining or fine-tuning.

Note: For evaluation there is a separate list here.

Existing list

There are already a few datasets gathered here:
https://docs.google.com/spreadsheets/d/1otSgs2-iDp1KTbasEDlFhTHiaKsYbIRvWrgH_TiubFQ/edit#gid=0

Contribute

We would like to gather an extensive lists of resources that could be useful for training language models for code. You can contribute by commenting on this issue with a missing dataset with the template below. A resource could be a specific code dataset (e.g. CodeXGlue), a related dataset (e.g. CS part of Wikipedia), or a website (e.g. Stackoverflow), or something similar.

Template

The following is the template for reporting an interesting data source:

Name Link License Number of files/samples Size in GB Languages Available on the HF Hub
Transformers repo https://github.com/huggingface/transformers Apache 2.0 100 0.1 Python No

Here's the Markdown snippet that you can copy/paste:

|Name|Link|License|Number of files/samples| Size in GB| Languages |Available on the HF Hub|
|:-|:-|:-|:-|:-|:-|:-|
| | | | | | | |

Create dataset with git commits

Git commits pose an interesting data source as they could be used as a natural source for prompts for code models with a format like:

CODE BEFORE
COMMIT MESSAGE
CODE AFTER

The dataset requires some cleaning and filtering since:

  • commits can be multi-file
  • commit messages can be non-descriptive ("fix", "bla" etc.)
  • bad quality in general

Decontaminate pretraining dataset from evaluation benchmarks

In order to make sure that the evaluation results reflect the true performance of the model, it is important to make sure that the evaluation benchmarks are not part of the training data. For that purpose we want to use #15 to search and remove evaluation benchmarks from the code dataset.

Build StackerFlow datasets

StackOverflow contains a lot of interesting data that can be used for classification, language modeling or retrieval. This is an issue to gather all ideas and sub-issues to add datasets to the Hub.

EDA on full dataset

It would be great if we could do some exploratory data analysis on the full dataset:

  • Language distribution
  • File length
  • Licence distribution
    This would help a bit with the understanding of the dataset and the trade-offs in filtering.

Dataset filters based on tokenizer or perplexity

To filter out low-quality code one can use the char/token ratio of a tokenizer to e.g. identify files with hashes or the train a small LM (e.g. KenLM models) on a clean dataset and use its perplexity to remove noisy data.

Question: File Counts and Dataset Size

I recently downloaded The Stack (the-stack-dedup) from Huggingface via GIT LFS. I have two questions that I need help with:

  1. The size on disk of the dedup datset is only around 900GB (much smaller than the 1.5TB indicated on the data card - https://huggingface.co/datasets/bigcode/admin/resolve/main/the-stack-infographic-v11.png)

  2. Is there somewhere were the file counts are listed in full for each dataset by language (dedup and full)?

Essentially I am looking to make sure that I have accessed the entirety of the dataset, so I either need to understand the dataset size difference, or know how many files there should be for each language so I can validate my download. Ideally both.

Thanks in advance!

NER models for PII

The next iteration of PII, we want yo train an NER model to detect PII on the benchmark we will annotate. Here are some tasks:

1 - Train an encoder (Bert) model on languages of The Stack (same languages as the target model) => @loubnabnl
2 - Prepare pipeline for fine-tuning the encoder model on NER PII task => @mrm8488

Help wanted:

3 - Estimate cost of using a Bert sized model to detect PII on The Stack (e.g. estimate inference with CodeBert-NER on the-stack-smol (300k samples) on 1 GPU)
4- Implement PII redaction using special tokens in the tokenizer
5- Analyze PII in a dataset of Github issues, and investigate using our regexes + standard english NER models to remove it

For any intermediate code/analysis please open PRs in bigcode-analysis repo.

Define filters for git commits

We should come up with good filters to get rid of low quality commits. This could be done by going through a sample (e.g. 100) issues and working out patterns of low quality data.

Create dataset with GitHub metadata

For data filtering or model conditioning we would like to have GH metadata such as stars, forks, contributors etc. as part of the dataset.

Build dataset index

Building a dataset search index is necessary for two reasons:

  1. removing near duplicates from training data
  2. retrieve training data during generation to infer attribution

The current idea is to use pyserini to build the index of the near-deduplicated dataset.

When I do pii_inference, cannot load bigcode/bigcode-encoder-pii-ner-v2

Hi,
I now need to do PII detection and hope to use the already trained NER model.
In infer.slurm, I found the project attempt to load model from bigcode/bigcode-encoder-pii-ner-v2.
However, when I want to do it by myself, it shows

OSError: bigcode/bigcode-encoder-pii-ner-v2 is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo either by logging in with huggingface-cli login or by passing token=<your_token>

I cannot find bigcode/bigcode-encoder-pii-ner-v2 in hugging face.
So, may I ask how can I get a trained pii detection model?

Include code review data.

Based on a recent tweet, it sounds like issues and commits will be added to the dataset soon (which is great news!).

Any chance on adding code reviews, as well? Or is this limited by the capabilities for the GitHub API?

Remove opt-out accounts

Before training the first model we can filter out any opt-out repository from the dataset.

Suggest datasets for Code Dataset Catalogue

Suggest datasets for Code Dataset Catalogue

While we are preparing a large code dataset from GitHub we also want to include other datasets that might be useful for pretraining or fine-tuning.

Note: For evaluation there is a separate list here.

Existing list

There are already a few datasets gathered here:
https://docs.google.com/spreadsheets/d/1otSgs2-iDp1KTbasEDlFhTHiaKsYbIRvWrgH_TiubFQ/edit#gid=0

Contribute

We would like to gather an extensive lists of resources that could be useful for training language models for code. You can contribute by commenting on this issue with a missing dataset with the template below. A resource could be a specific code dataset (e.g. CodeXGlue), a related dataset (e.g. CS part of Wikipedia), or a website (e.g. Stackoverflow), or something similar.

Template

The following is the template for reporting an interesting data source:

Name Link License Number of files/samples Size in GB Languages Available on the HF Hub
Transformers repo https://github.com/huggingface/transformers Apache 2.0 100 0.1 Python No

Here's the Markdown snippet that you can copy/paste:

|Name|Link|License|Number of files/samples| Size in GB| Languages |Available on the HF Hub|
|:-|:-|:-|:-|:-|:-|:-|
| | | | | | | |

Which languages to include?

Which programming languages should be included in the pretraining dataset?

In contrast to natural language does multilinguality not increase the vocab size significantly so I think it could be interesting to be on the liberal side here to open the doors for interesting downstream use-cases.

Deduplication also removes data < ngram_size

I noticed ngrams function returns empty list when number of tokens is smaller than ngram_size value.
This has the side effect of clustering all sentences with number of tokens smaller than the ngram_size value together & removing them as dupes. Not sure if this is intentional?
so for the default of ngram_size=5 "a sentence like this" gets removed by default.

Could maybe filter & rerun the deduping with a smaller n_gram value for those sentences.

Dataset filtering based on content

Most code models following Codex used some content based filtering (e.g. max/mean line length of file, fraction of alphanumeric characters). These work well for Python but can fail for other langugages as HTML. We need to gather good filters for each language, especially the high volume languages such as HTML, Java, JavaScript, Python, C, C++ etc.

Some file extensions excluded from the published dataset (Racket)

programming-languages-to-file-extensions.json correctly has the most common rkt file extension of 'rkt' for Racket, but the data subset (for Racket) at https://huggingface.co/datasets/bigcode/the-stack/tree/main/data/racket has zero instances of files with this extension, and rkt is mentioned specifically as being an excluded extension in the paper at https://arxiv.org/abs/2305.06161 This would likely exclude the majority of actual racket files found on github.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.