Comments (7)
Update: The problem is fixed by using the latest version of datasets (v2.17.1), so my above fix is not needed.
So all you need to do is change the requirements in setup.py to the latest version of datasets. I kind of suspect that a lot of the weird issues people have (like in #45) might be because of this.
When using the current code with the versions of software suggested in
Line 47 in 87cc800
from alignment-handbook.
See
alignment-handbook/scripts/run_dpo.py
Line 91 in 87cc800
alignment-handbook/scripts/run_sft.py
Line 98 in 87cc800
from alignment-handbook.
Problem is fixed by replacing:
raw_datasets = raw_datasets.map(
apply_chat_template,
fn_kwargs={"tokenizer": tokenizer, "task": "dpo"},
num_proc=data_args.preprocessing_num_workers,
remove_columns=column_names,
desc="Formatting comparisons with prompt template",
)
with
raw_datasets = raw_datasets.map(
apply_chat_template,
fn_kwargs={"tokenizer": tokenizer, "task": "dpo"},
num_proc=data_args.preprocessing_num_workers,
desc="Formatting comparisons with prompt template",
)
raw_datasets = raw_datasets.map(lambda x: x, remove_columns=column_names)
I don't know why, but the datasets library (v 2.14.6) is not properly removing the columns after processing them. So the variables need to be deleted after the first map.
from alignment-handbook.
Are you sure about this? As far as I know the "remove columns" is only called after updating the dataset. So the chat template is applied, new columns are added and all the previous (old) columns are removed. I think that's how it has always worked. Cc @lhoestq
from alignment-handbook.
This is how it should be, but it must be bugged in this old version of datasets.
Yes, I tested it with prints inside the apply_chat_template. It doesn't print with datasets==2.14.6.
from alignment-handbook.
As far as I know the "remove columns" is only called after updating the dataset. So the chat template is applied, new columns are added and all the previous (old) columns are removed. I think that's how it has always worked. Cc @lhoestq
Actually the remove_columns is taken into account when updating the dataset itself, to not have to write unnecessary data
from alignment-handbook.
Hi @AlexiaJM
Thanks for opening the thread. I was unable to replicate the error you reported using datasets==2.14.6. I inserted a breakpoint after the map
call and inspect the processed dataset. The prompt is in the correct format.
Here's what the first item of the processed UltraFeedback Binarized dataset for Zephyr training looks like:
'prompt': '<|system|>\n</s>\n<|user|>\nDo you know something about crystallography and structure factor?</s>\n'
Seems like the effect of map should be checked after the whole map call is completed.
Regards,
Eric.
from alignment-handbook.
Related Issues (20)
- Can any one share the script what params should be passed to run_dpo.py HOT 1
- Efficient dialog data format for KTO training
- Can we please add the option to work with a tokenized dataset, escpailly for the CPT task.
- Constitutional AI models do not achieve MT-Bench scores as reported
- Multi-GPU Training with DPO Full Parameter Stucks
- Cannot reproduce zephyr-7b-gemma-v0.1 HOT 3
- CPT training is giving pretty unstalbe results with the learning rate 2e-5. HOT 1
- Method to disable evaluation
- Different dtype while saving optimizer with FSDP HOT 2
- Dependency updates for QLoRA+FSDP
- Clarification on dataset mixer HOT 2
- How to work with local data HOT 1
- FSDP + QDoRA Support HOT 6
- Issue Running `run_sft.py` After Configuration Changes in GMAL Folder : (ChildFailedError) HOT 3
- CI failing due to `mistralai/Mistral-7B-Instruct-v0.2` being gated now
- [ORPO] system special token is included in chosen/rejected samples after applying chat template HOT 1
- Released model weights for ablations of KTO/IPO/DPO cannot be found
- Cannot flatten integer dtype tensors
- Question about sft with deepspeed
- Unexpected behavior in apply_chat_template function adding repeated assistant turns HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from alignment-handbook.