gururise / alpacadatacleaned Goto Github PK

Alpaca dataset from Stanford, cleaned and curated

License: Apache License 2.0

Python 88.93% JavaScript 4.15% HTML 6.92%

alpacadatacleaned's Introduction

🦙🛁 Cleaned Alpaca Dataset

Welcome to the Cleaned Alpaca Dataset repository! This repository hosts a cleaned and curated version of a dataset used to train the Alpaca LLM (Large Language Model). The original dataset had several issues that are addressed in this cleaned version.

On April 8, 2023 the remaining uncurated instructions (~50,000) were replaced with data from the GPT-4-LLM dataset. Curation of the incoming GPT-4 data is ongoing.

A 7b Lora model (trained on April 8, 2023) is available on Hugging Face at yahma/alpaca-7b-lora
A 13B Lora model (trained on April 9, 2023) is available at Hugging Face at yahma/alpaca-13b-lora

Dataset Quality and its Impact on Model Performance

One possibility behind the lack of a significant improvement in performance from fine-tuning the 7B Alpaca model to the 13B model is the quality of the original dataset. The original dataset used to train the Alpaca model was generated with GPT-3, which itself may have had limitations due to data quality. More evidence pointing to poor data quality is that fine-tuning on the original dataset resulted in poor loss curves.

The quality of the dataset plays a crucial role in determining the performance of the natural language processing models trained on it. A dataset that is noisy, inconsistent, or incomplete can result in poor performance even with the most advanced models. In contrast, a high-quality dataset can enable a model to perform well with smaller parameters.

Therefore, it is possible that with better data, we could improve the performance of the models more than what would be gained by simply increasing model size.

Benchmark Results

Using EleutherAIs lm-evaluation-harness, we compare LORA models fine-tuned on various datasets.

Dataset	Model	parameters	WikiText (ppl)	MNLI (acc)	Piqa (acc norm)
Original Alpaca	samwit/alpaca7B-lora	7b (lora)	9.5396	38.33	78.51
Cleaned Alpaca (Mar 26)	tloen/alpaca-lora-7b	7b (lora)	9.4885	51.6	79.33
GPT4All	nomic-ai/gpt4all-lora	7b (lora)	10.09	38.97	78.40

We also ran the Hallucination Trivia Test for Large Language Models (HALTT4LLM)

Model Name	Truthful QA	HQ Trivia	Fake Questions	NOTA Questions
GPT4All	79.51%	88.47%	74.16%	70.32%
GPT-3.5	39.95%	59.33%	81.81%	51.93%
GPT-3	32.15%	55.67%	6.10%	32.25%
Llama-7B-4bit	83.51%	49.75%	2.15%	8.38%
Alpaca-7B-4bit	26.66%	44.32%	0.00%	0.00%
Alpaca-7B-4bit cleaned	37.00%	-	-	12.90%

The cleaned dataset appears to hallucinate less and perform better than the original dataset.

Data Cleaning and Curation

Alpaca is a fine-tuned version of LLAMA that was trained using an Instruct Dataset generated by GPT-3. The generated dataset was designed to be diverse; however, recent analysis indicates it is very US centric. The original dataset used to train the Alpaca LLM was found to have many issues that impacts its quality and usefulness for training a machine learning model.

Issues with the Original Dataset

The original dataset had several issues that likely affect the final performance of the fine-tuned model. These issues include:

Hallucinations: Many instructions in the original dataset had instructions referencing data on the internet, which just caused GPT3 to hallucinate an answer.

"instruction":"Summarize the given article in 200 Words.",
"input": "https://www.bbc.com/news/world-51461830",
"output": "The recent protests in Belarus are part of a growing wave of discontent against the rule of Alexander Lukashenko, who has led the country since 1994. Demonstrations against his government have been ongoing since August, and come in the wake of an election widely seen as rigged. As police and security forces have cracked down on protesters, the movement has only grown, with hundreds of thousands of people taking to the streets to call for new leadership. Human rights activists have accused the government of using extreme violence in order prosecute protesters, leading to an international outcry and sanctions from the European Union. As the situation in Belarus continues to unfold, the international community is monitoring the situation closely - hoping for a peaceful resolution, while also standing firm against the oppressive tactics employed by the Belarusian state."

"instruction": "Classify the following shape according to simple geometry.",
"input": "<Picture Attached>",
"output": "Triangle"

"instruction": "Given a recent news article, write a headline for it.",
"input": "https://www.news.example.com/article-1",
"output": "\"NASA Launches a Mars 2020 Mission to Search for Signs of Ancient Life\""

Merged Instructions: There were many instructions that were merged together in the original dataset for some reason.

"instruction": "Should I invest in stocks?",
"input": "",
"output": "It depends on your individual situation. Investing in stocks carries a certain amount of risk but can also offer potential rewards. Make sure to research the different types of stocks, understand the risks and rewards, and develop a plan that works for you.\n##\n18. Instruction: Identify the noun in the following sentence:\n18. Input:\nThe large apple was delicious.\n18. Output:\nApple"

Empty outputs: Some entries in the original dataset had empty outputs.

"instruction": "Calculate the area of the following shape in square centimeters.",
"input": "rectangle of size 4 cm x 5 cm",
"output": ""

Empty code examples: Some descriptions in the original dataset were missing code examples, making it difficult to understand the intended behavior of the code.
Instructions to generate images: Some descriptions in the original dataset included instructions to generate images, something obviously not possible.

"instruction": "Create a graphic or logo that visually represents the word \"courage\".",
"input": "",
"output": "<No Output>"

N/A outputs: Some code snippets in the original dataset had N/A outputs.
Inconsistent input field: The original dataset had inconsistent usage of the input field when it was supposed to be empty.

"input":"<no input>"
"input":"No input"
"input":"noinput"
"input":"<noinput>"

Wrong answers: Some instructions/questions in the original dataset had incorrect answers. About 80% of the math problems are estimated to have incorrect answers.

"instruction": "Calculate the median of the following data set.",
"input": "1, 2, 4, 5, 8, 9",
"output": "5"

"instruction": "Convert 25m to km.",
"input": "",
"output": "25km"

Non-Sensical/Unclear instructions: Many instructions are unclear, we try to clarify (or re-write) if instructions are non-sensical. Instructions that are slightly unclear, but where one could deduce the meaning are not altered.

"instruction": "Freeze the following sample of yogurt for 10 minutes.",
"input": "Yogurt sample",
"output": "<noinput>"

"instruction": "Increase the font size to 12 points.",
"input": "",
"output": "The font size has been increased to 12 points."

Extraneous escape and control characters: The original dataset had several entries with extraneous escape and control characters.

Finetune Considerations

Compared to the original alpaca dataset, the average prompt length in this dataset has increased, with a larger number of prompts that exceed a length of 256. For this reason, it is recommended to set max prompt length during finetuning to at least 512 or higher.

Hugging Face Hub

The cleaned dataset is also available on the Hugging Face Hub.

Contributions

With over 52k entries, several issues still exist. Please help out by submitting a pull-request.

Goals

The primary goal of this project is to provide a cleaned and curated version of the Alpaca dataset that will improve the performance of natural language processing models trained on this data. By removing errors and inconsistencies, the goal is to improve performance of the fine-tuned llama models and reduce the likelihood of hallucinations.

License

All the code and supporting tools are licensed under the Apache-2.0. The original and cleaned alpaca dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.

Acknowledgments

The original version of the Alpaca dataset was sourced from tatsu-lab's github repository. We would like to thank the original creators of these datasets for making their data available to the public. We would also like to thank the team at Meta AI for their work in developing Llama. Finally, thanks to Q-Blocks Cloud for donating compute resources allowing us to provide finetuned models.

alpacadatacleaned's People

Contributors

Stargazers

Watchers

Forkers

hidelord claysauruswrecks fredzannarbor wilfoderek minh-v rosssong cdtaylormesec phillipgimmi battbeach godmapper stanleyjacob josemlopez ukaserge techthiyanes c00renut melsiddieg llegomark djoffrey bdarmech ahmetkca vertinski hzj5790 toyokolabs saridsa1 dustinreed-info dumpmemory hyojunguy baronrustamov ricklentz tironiigor mragungsetiaji rioncarter xargs007 oemmerson eriker75 jackusa ddkang1 pixelkaiser m-deepankar-singh loveryanzi git2swarm killersite catchcake sorokinvld sneaking osubeaver808 sizzles kyrolabs nataschaberg teknium1 shohanursobuj matusstas zurichrain newmedia2 mistial-dev jiahuei jithinkpraveen b1sounours mm86133 if-ai haxel0rd kenakafrosty andykeh710 webclinic017 csabag joskid mindrages h2nguyen x07-it ekryski rkhomyk humantillnow paulsunnypark 2003pro oliverwehrens yu-jeffy suguoyg amikos-tech alexl83 liuhoward ai-ld blue0rigin xjohnxjohn leonweber priyanshu-sharma jesusoctavioas henriettaknight adjustmode1 oumacavin wesleysanjose henryfcb tweetyukky yudhanjaya dexcompiler shossain pints-ai sudosu4pp wmaousley nasame grasshourse

alpacadatacleaned's Issues

What about starting a crowdfunding campaign to collect money to run the examples against GPT-4?

Collect money (~500 USD should be enough I believe?)
Open account with OpenAI and connect it to the bank account holding the money above
Get API key
Run the examples against GPT-4 to correct them

I guess one challenge is to maintain transparency at every step, or what would be the legal implications, but its just USD 500, so it shouldn't matter as much anyway!

PIQA dataset's metric

how does PIQA got 78 acc ?
I see the eval folder's readme file, it says the metric is not trustworthy ?

Evaluation Metric

Planning on adding an evaluation metric that can be used to benchmark trained alpaca models.

Going to focus on these two datasets for evaluation:

SquaD Dataset - F1 Score
WikiText Dataset - Perplexity

I'm not so sure the Wikitext perplexity score will give us much useful information, but seems to be a popular metric for these foundation models. I'm more interested in the Squad F1 score, which will give us a standard benchmark for a Q/A task. Even though alpaca is not trained on Squad style dataset, I think with the right prompt, it can be done.

Command to run the evaluation

Hi thanks for the great work!

could I ask for the command used to run the evaluation on the https://github.com/EleutherAI/lm-evaluation-harness/?
one that is running on any dataset is fine, I am getting some tokenizer errors and I just want to be sure I am running the same eval

thanks so much!

How are you going about cleaning?

How are you going about cleaning this?

Manually or with GPT-4.

Separate instructions by functionality

There are some specific instructions like generating images, models, searching the internet that the dataset is simply trained to refuse.

    {
        "instruction": "Design a company logo",
        "input": "",
        "output": "<nooutput>"
    }

To it became an unbiased dataset, We could split some instructions per file. For example:

no_image_generation.json

    {
        "instruction": "Design a company logo",
        "input": "",
        "output": "This type of instruction cannot be fulfilled by a GPT model."
    }

So the end user could use a custom JSON like:
image_generation.json

    {
        "instruction": "Design a company logo",
        "input": "",
        "output": "<image-api>company logo"
    }

ModuleNotFoundError: No module named 'utils'

I'm trying to run

python -m generate_instruction generate_instruction_following_data \
  --output_dir ./ \
  --num_instructions_to_generate 10 \
  --model_name="text-davinci-003"

then got error "ModuleNotFoundError: No module named 'utils'"

Idea about better cleaning

Probably need to move cleaned data from one file to another so no need to check again and again.
For one other model guys prepared Telegram bot. So people could read random Question/Answer and choose button:

Everything correct
Wrong question
Wrong answer
Skip it

Maybe make sense to get duplicate confirmations...

Any chance we could improve the dataset beyond fixing?

Would that be relevant in the scope of this project? Like adding a couple sorts of task examples could improve its generalized capabilities, for instance:

Longer responses
GPT-4 Generated Responses for similar tasks it already has
Roleplaying
Chain of Thought

etc

Is the "alpaca_data_cleaned_archive.json" file having all cleaned data?

I just want to confirm the content of the file "alpaca_data_cleaned_archive.json". I found "alpaca_data_cleaned.json" is 44.3M, while "alpaca_data_cleaned_archive.json" is only 22.5M. Based on my understanding, the second file should combine all the cleaned data from the first one with the remaining replaced by the outputs generated by GPT-4. Therefore, the size of these two files should be similar and the "alpaca_data_cleaned_archive.json" might make llama get better results after the fine-tuning. Is that right?

Is there a boost in performance for full fine-tuning versus LoRA?

It seems that the evaluation comparison made was all under the training scheme of LoRA. Any ideas on full fine-tuning versus the LoRA approach?

Diffs as data

What if we used the diffs from all this cleaning effort to train a model to do the cleaning?

The MNLI score in lm-evaluation-harness

Thanks for the great work!

I'm trying to reproduce the results you report. I downloaded the model weights from link https://huggingface.co/yahma/alpaca-7b-lora and evaluated them under the framework of lm-evaluation-harness. But I only got 41.7% accuracy on MNLI dataset.

When using lm-evaluation-harness, did you perform other data processing tricks to get 51.6% acc?

overall approach

I came across this quote in an anthropic paper, and thought I would share:

"we first pre-train on large public datasets that encode human preference information, such as Stack Exchange, Reddit, and Wikipedia edits, before finetuning on smaller datasets encoding more specific human preferences" Askell et al

This is interesting because this project's philosophy is to focus on a small clean dataset (~30k rows). But there are large preference datasets out there e.g. SHP 300k, or stack-exchange-perefences at 10M rows. But it looks Anthropic believes a two stage training might work quite well.

I also want to point you to this project where they used flan as a base, instead of LLaMA. The result has a commercially compatible license and may be better. Simply because flan was trained on forums, while LLaMA missed out on this dialogue pre-training.

Hopefully this is interesting, that is all ;p

Correct or potentially to be cleaned?

During the short time that I have been helping here, I have noticed that we divide the data into 3 categories:
A) Cleaned
B) Correct
C) Potentially to be cleaned

Determining whether it is option A is very easy because it is a difference in datasets. However deciding whether it is option B or C is no longer so simple. Therefore, I think that we should be able to mark if the data is correct. It would be nice to add another parameter, for example done (True, False), so that we and potential new contributors don't have to deal with already correct data.

Thanks to this, we will increase the speed and overview of the data itself. We will also see our progress. At the end, of course, we delete that variable.

Either we can incorporate it into the already created GUI or do something else. It would be nice if we could vote there whether it is OK or not yet. Let it not be the decision of only one person. I believe that most people understand me :)

80% of math outputs are wrong

As noted in PR #3, many math outputs are incorrect (estimated at nearly 80% by @HideLord). Ideally, anyone wants to work these issues should follow the same format:

Original Wrong Answer:

"instruction": "Calculate the median of the following data set.",
"input": "1, 2, 4, 5, 8, 9",
"output": "5"

Corrected Answer (with Chain of Thought):

"instruction": "Calculate the median of the following data set.",
"input": "1, 2, 4, 5, 8, 9",
"output": "The median is (4+5)/2 = 4.5"

To identify math problems, one can use the regex provided:
"input": "([\d\^\+\-\/yx=,\(\) ]|\\u00b\d)+"

Modifying an existing tool script (in the tools directory) to use this regex would be easy. The first 200 math questions have been checked in the above PR.

Adding scripts for data cleaning

Thanks for your work! Did you make the changes primarily by manually examining the dataset? If scripts were used for the cleanup, it might be helpful for others to have access to those scripts as well. I am in the process of creating a dataset for the German language using GPT and will likely run into similar issues. So it would be nice to be able to access these scripts. I'm sure you understand where I'm going with this. What are your thoughts on this? I know it's not straightforward, because when I cleaned up the lists, it was also a combination of using Regex and manually monitoring the changes.

Hosting your dataset on the Hugging Face Hub

Hi @gururise, this is a really cool project and great job identifying all these problems with the Alpaca dataset!

Would you be interested in hosting this on the Hugging Face Hub (https://huggingface.co/datasets)? The Alpaca dataset is also hosted there (link) and your version would be of wide interest to the community :)

If that sounds interesting, you just need to create a new dataset repo and upload your alpaca_cleaned.json file through the UI. For more details, you can check out our docs here

good job

can you briefly introduce the method you used in this clean work? thanks

Contributing to the dataset curation with Argilla and the Alpaca Garbage collector

Hi @gururise, first of all, thanks and congrats on this important effort.

I'm Dani from Argilla. We've spent some time looking at data quality issues of the Alpaca dataset and its translations. We're helping out teams of the Spanish and German efforts to use Argilla for flagging bad or problematic instructions so that they can be later fixed (either manually or with post-processing).

Along the way, we've spent some time labeling AlpacaDataCleaned. It has already good quality but there are still examples to improve, so we'd like to contribute.

Today we have released this model to help teams with cleaning up Alpaca translation, but this can be used to contribute to this repo too: https://huggingface.co/argilla/alpaca-garbage-collector-multilingual

We've also deployed this space for browsing and validating the records. This is what it shows for last night's version of AlpacaCleaned (login with argilla/1234).

We plan to spend some more time labeling and contributing back to this project. My question is if it would be possible to share a set of flagged records (with positional ids as in the original json) with you to make sure we edit them in the right way. For example, what do with requests related to attached photos, paintings, and so on.

Incorrect key string in alpaca_data_cleaned.json

Line 96648, the key is "instruction : " with an extra " : ", which will cause a KeyError when processing data.

Identify code snippet in "input" fields

I want to translate the training data into another language with Google translate, but code snippets should not be translated, so I have to replace code snippets with placeholders before translating.

All code snippets in "output" fields are quoted with triple backticks, so they're quite easy to identify. But code snippets in "input" fields aren't quoted.

Any suggestions on identifying those in "input" fields ?

  {
    "output":"The author has used personification in the sentence \"The cold breeze chills my bones.\" Personification is a figure of speech in which a non-human subject is given human characteristics. In this case, the non-human subject is the cold breeze, which is given the human characteristic of being able to chill someone's bones.",
    "input":"The cold breeze chills my bones.",
    "instruction":"Identify a stylistic device used by the author in the following sentence."
  }

 {
    "output":"Two players from the Kansas City Chiefs team are Patrick Mahomes and Tyreek Hill.",
    "input":"",
    "instruction":"Name two players from the Chiefs team?"
  }

I imagine two approaches:

Use the "instruction" as the system prompt, and the "input" as the first user chat message (which would often be empty though)...
Concatenate the instruction + input fields into a single (first) user chat message.

I think I'll use approach 2 but would appreciate any insights or references on this topic :)