Giter Club home page Giter Club logo

Comments (7)

AlexiaJM avatar AlexiaJM commented on June 8, 2024 1

Update: The problem is fixed by using the latest version of datasets (v2.17.1), so my above fix is not needed.

So all you need to do is change the requirements in setup.py to the latest version of datasets. I kind of suspect that a lot of the weird issues people have (like in #45) might be because of this.

When using the current code with the versions of software suggested in

"datasets==2.14.6",
, the chat template is not applied!

from alignment-handbook.

AlexiaJM avatar AlexiaJM commented on June 8, 2024

See

#####################
and
#####################
.

from alignment-handbook.

AlexiaJM avatar AlexiaJM commented on June 8, 2024

Problem is fixed by replacing:

    raw_datasets = raw_datasets.map(
        apply_chat_template,
        fn_kwargs={"tokenizer": tokenizer, "task": "dpo"},
        num_proc=data_args.preprocessing_num_workers,
        remove_columns=column_names,
        desc="Formatting comparisons with prompt template",
    )

with

    raw_datasets = raw_datasets.map(
        apply_chat_template,
        fn_kwargs={"tokenizer": tokenizer, "task": "dpo"},
        num_proc=data_args.preprocessing_num_workers,
        desc="Formatting comparisons with prompt template",
    )
    raw_datasets = raw_datasets.map(lambda x: x, remove_columns=column_names)

I don't know why, but the datasets library (v 2.14.6) is not properly removing the columns after processing them. So the variables need to be deleted after the first map.

from alignment-handbook.

BramVanroy avatar BramVanroy commented on June 8, 2024

Are you sure about this? As far as I know the "remove columns" is only called after updating the dataset. So the chat template is applied, new columns are added and all the previous (old) columns are removed. I think that's how it has always worked. Cc @lhoestq

from alignment-handbook.

AlexiaJM avatar AlexiaJM commented on June 8, 2024

This is how it should be, but it must be bugged in this old version of datasets.

Yes, I tested it with prints inside the apply_chat_template. It doesn't print with datasets==2.14.6.

from alignment-handbook.

lhoestq avatar lhoestq commented on June 8, 2024

As far as I know the "remove columns" is only called after updating the dataset. So the chat template is applied, new columns are added and all the previous (old) columns are removed. I think that's how it has always worked. Cc @lhoestq

Actually the remove_columns is taken into account when updating the dataset itself, to not have to write unnecessary data

from alignment-handbook.

EriChen0615 avatar EriChen0615 commented on June 8, 2024

Hi @AlexiaJM

Thanks for opening the thread. I was unable to replicate the error you reported using datasets==2.14.6. I inserted a breakpoint after the map call and inspect the processed dataset. The prompt is in the correct format.

Here's what the first item of the processed UltraFeedback Binarized dataset for Zephyr training looks like:

'prompt': '<|system|>\n</s>\n<|user|>\nDo you know something about crystallography and structure factor?</s>\n'

Seems like the effect of map should be checked after the whole map call is completed.

Regards,
Eric.

from alignment-handbook.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.