Giter Club home page Giter Club logo

continuallm's Introduction

ContinualLM


Imagine an LM that not only effortlessly acquires new knowledge but also retains its mastery of skills, all while successfully transferring knowledge. Is it even possible?

News

πŸ”₯ We have added checkpoints in Hugging Face for easier reproduction!
πŸ”₯ We have added continual_pretrain.ipynb as a self-contained example of the soft-masking scenario. It runs well without GPUs!
πŸ”₯ Soft-masking can also work in conventional continual fine-tuning. Check out our latest EMNLP23 paper!
πŸ”₯ Wondering whether you can adapt a black-box LLM without worrying about the update of its parameters? Check out our latest paper on retrieval-augmented generation (RAG) here!

Quick Links

Introduction

In 2021, we introduced Pycontinual, a straightforward and flexible framework for continual learning. Our research has benefited significantly from this framework. Today, we are excited to share the ContinualLM, an extensible continual learning framework focused on language models (LMs), designed to sustain the benefits of continual learning (CL) in this field.

Continual learning for LMs is distinct from traditional CL because

  • Each task is treated as a domain-specific corpus (at present, our primary focus is on domain-adaptive pre-training, which is also known as pre-finetuning or post-training).
  • Moreover, the evaluation process involves fine-tuning the corresponding end-task.

Our repository includes a PyTorch implementation of a collection of state-of-the-art (SoTA) methods, using the same training and evaluation pipeline. This repository is committed to advancing the field of continual learning for LMs. The methods included are:

Simple Example

We have added continual_pretrain.ipynb as a self-contained example of the soft-masking scenario. It runs well without GPUs!

Dataset

When it comes to the continual learning of language models (LMs), finding appropriate datasets is crucial. The datasets we provide adhere to the following principles:

  • Domain-specific: The domain corpus must be specific enough to enhance end-task performance.
  • End-task available: We favor assessing the trained language models through the end-task rather than relying on perplexity, since the former represents a more dependable evaluation approach.

We release our dataset comprising 6 distinct domains, each accompanied by its corresponding end-task. The dataset can be found here. Below are some statistics for each domain:

Domain Corpus Size End-task Task #Training #Testing #Classes
Yelp Restaurant 758MB Restaurant Aspect Sentiment Classification (ASC) 3,452 1,120 3
Amazon Phone 724MB Phone Aspect Sentiment Classification (ASC) 239 553 2
Amazon Camera 319MB Camera Aspect Sentiment Classification (ASC) 230 626 2
ACL Papers 867MB ACL Citation Intent Classification 1,520 421 6
AI Papers 507MB AI Relation Classification 2,260 2,388 7
PubMed Papers 989MB PubMed Chemical-protein Interaction Prediction 2,667 7,398 13

Architecture

The architecture of ContinualLM largely follows that of Pycontinual, CPT and DGA.

Installation

conda create --name continuallm --file requirements.txt

⚠️ Our model is based on transformers==4.17.0 and adapter-transformers==3.0.1. We recommend using these specific versions, as using other versions may result in unexpected bugs.

Domain-adaptive Pre-training

This is where continual learning happens. We will learn a sequnce of domains.

max_samples=640000 
for idrandom in 0 
do    
 for pt_task in 0 1 2 3 4 5    
  do    
 python -m torch.distributed.launch --nproc_per_node 4 --use_env posttrain.py \    
 --per_device_train_batch_size 62 \ 
 --fp16\    
 --max_seq_length 164 \ 
 --max_samples ${max_samples} \ 
 --idrandom ${idrandom} \ 
 --ntasks 6 \ 
 --pt_task ${pt_task} \ 
 --baseline 'das'
 done 
done  
  • --idrandom: choose the task sequence. See ./sequences for more details.
  • --baseline: see the introduction for available baseline models (see choices in config.py).

End-task Fine-tuning

After conitinual learning of LMs, now we are able to evaluate the performace by runing end-task fine-tuning individually.

max_samples=640000    
 seed=(2021 111 222 333 444 555 666 777 888 999)    
 for round in 0; do    
  for idrandom in 0;    
  do    
    for pt_task in 0 1 2 3 4 5   
    do    
      for ft_task in $(seq 0 ${pt_task});    
      do    
       python finetune.py \    
       --max_seq_length 164 \ 
       --pt_task ${pt_task} \ 
       --ft_task ${ft_task} \ 
       --idrandom ${idrandom} \ 
       --ntasks 6 \ 
       --max_samples ${max_samples} \
       --seed ${seed[$round]} \ 
       --baseline 'das'    
       done    
    done   
  done  
done  

Checkpoints in Huggingface

For those who are interested solely in the resulting model or want to continue per-training the model with their own data, we have good news! We offer checkpoints through Hugging Face.

You can easily import our continually post-trained model with HuggingFace's transformers!

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Import our model. The package will take care of downloading the models automatically
tokenizer = AutoTokenizer.from_pretrained("UIC-Liu-Lab/DAS-Rest2Cam")
model = AutoModelForSequenceClassification.from_pretrained("UIC-Liu-Lab/DAS-Rest2Cam", trust_remote_code=True)

# Tokenize input texts
texts = [
    "There's a kid on a skateboard.",
    "A kid is skateboarding.",
    "A kid is inside the house."
]
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")

# Get the model output!
res = model(**inputs)

If you encounter any problem when directly loading the models by HuggingFace's API, you can also download the models manually from the repo and use model = AutoModel.from_pretrained({PATH TO THE DOWNLOAD MODEL}).

⚠ The continual pre-training sequence is the first sequence at ./sequences/posttrain (from Restaurant to Camera), you can use the downloaded weights to fine-tune the corresponding end-task.

⚠ If you are interested in the importance files, please refer to before_distill0 and after_mlm{domain_id}. before signifies the importance computed before pre-training, which is done only once before the first domain for general pre-trained knowledge. after indicates the importance computed after the pre-training of domain_id.

Reference

We highly appreciate your act of staring and citing. Your attention to detail and recognition is greatly valued.

  
@inproceedings{ke2022dgs,  
 title={Continual Learning of Language Models}, author={Ke, Zixuan and Shao, Yijia and Lin, Haowei and Konishi, Tatsuya and Kim, Gyuhak and Liu, Bing}, booktitle={International Conference on Learning Representations (ICLR)}, year={2023}}  
  
@inproceedings{ke2022dga,  
 title={Adapting a Language Model While Preserving its General Knowledge}, author={Ke, Zixuan and Shao, Yijia and Lin, Haowei and Xu, Hu and Shu, Lei, and Liu, Bing}, booktitle={Empirical Methods in Natural Language Processing (EMNLP)}, year={2022}}  
  
@inproceedings{ke2022continual,  
 title={Continual Training of Language Models for Few-Shot Learning}, author={Ke, Zixuan and Lin, Haowei and Shao, Yijia and Xu, Hu and Shu, Lei, and Liu, Bing}, booktitle={Empirical Methods in Natural Language Processing (EMNLP)}, year={2022}}  

Contact

If you have any questions regarding the code, please feel free to send an email to Zixuan Ke, Yijia Shao, or Haowei Lin. Alternatively, you may open an issue. We would like to express our gratitude to Bing Liu, Hu Xu, and Lei Shu for their valuable comments and opinions

continuallm's People

Contributors

zixuanke avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

continuallm's Issues

low accuracy

Why choose the restaurant dataset in sequence 1 for incremental training and perform fine-tuning tests on downstream task restaurants with low accuracy (last epoch macro_f1 = 0.2626, acc = 0.6500)

question about virtual parameter

I have a question about virtual parameter. In your paper,
图片
图片

By chain rule,βˆ‚L/βˆ‚g_l = (βˆ‚L/βˆ‚^o_l) x (βˆ‚^o_l/βˆ‚g_l)= (βˆ‚L/βˆ‚^o_l) x (o_l) . So we can just view g_l as βˆ‚L/βˆ‚o_l times o_l. Am I right?

A bug in the adapter-based methods

Hello! I am running the classic adapter in your framework. It seems a function is missing in the code.
When running to the line 24, "approaches/compute_loss.py". It will report 'Appr' object has no attribute 'mask'.

I am greatly appreciate it if you could complete the correction of the classic adapter method

can ContinalLM learn new words from real world?

this is a good idea of continue learning.

but I'm doubt this mechanism can study new knowledge.

because weights and nodes are affected by LLMs train data,

the soft masks may not find appropriate nodes and weights if a new concept appeared.

so, hope your explaination, or if my infer is wrong.

TypeError: __init__() got an unexpected keyword argument 'location_key'

    self.self = RobertaSelfAttention(
TypeError: __init__() got an unexpected keyword argument 'location_key'

Hi, thank you sharing your great work. However, I encountered this error when trying to run your code. I have also checked the transformer version 4.17 and the RobertaSelfAttention does not have this property either.

Can't reproduce the results in DAS

Hello, we have recently been following your impressive work on DAS. We have attempted to reproduce the main results presented in the paper as well as the results of swapping sequence order mentioned in the appendix (Table 5). However, we have found that our results fall short of those reported in the paper, as shown in the following diagram.

image

In addition, we haven't modified any related code except to fix necessary bugs, and our training script is provided below. Do you have any insights regarding this inconsistency in results? Looking forward to your response.

max_samples=640000 
for idrandom in 0 1 2 3 4
do    
  for pt_task in 0 1 2 3 4 5   
  do    
    python -m torch.distributed.launch --nproc_per_node 8 --use_env posttrain.py \
    --per_device_train_batch_size 32 \
    --fp16 \
    --max_seq_length 164 \
    --max_samples ${max_samples} \
    --idrandom ${idrandom} \
    --ntasks 6 \
    --pt_task ${pt_task} \
    --baseline 'das'
  done
done  
max_samples=640000
seed=(2021 111 222 333 444 555 666 777 888 999)

for round in 0;
do
  for idrandom in 0 1 2 3 4;
  do
  for pt_task in 0 1 2 3 4 5
    do
      for ft_task in $(seq 0 ${pt_task});
        do
          CUDA_VISIBLE_DEVICES=0 python finetune.py \
          --max_seq_length 164 \
          --pt_task ${pt_task} \
          --ft_task ${ft_task} \
          --idrandom ${idrandom} \
          --ntasks 6 \
          --max_samples ${max_samples} \
          --seed ${seed[$round]} \
        --baseline 'das'
      done
    done
  done
done

Trouble achieving results in paper

Hello, I attempted to replicate the results of the DAS baseline using your code, but encountered discrepancies with the results presented in the paper. I posttrained the model with a batch size of 16 due to resource limitations, while the README file suggests a batch size of 62. Below are the F1 scores and accuracies obtained:

Acc:
0.8679 0.0000 0.0000 0.0000 0.0000 0.0000
0.8616 0.7387 0.0000 0.0000 0.0000 0.0000
0.8616 0.7387 0.6889 0.0000 0.0000 0.0000
0.8616 0.7387 0.6889 0.8246 0.0000 0.0000
0.8616 0.7387 0.6889 0.8246 0.6880 0.0000
0.8616 0.7387 0.6889 0.8246 0.6880 0.8051

F1:
0.7925 0.0000 0.0000 0.0000 0.0000 0.0000
0.7858 0.7002 0.0000 0.0000 0.0000 0.0000
0.7858 0.7002 0.5529 0.0000 0.0000 0.0000
0.7858 0.7002 0.5529 0.7909 0.0000 0.0000
0.7858 0.7002 0.5529 0.7909 0.6880 0.0000
0.7858 0.7002 0.5529 0.7909 0.6880 0.6751

Could the discrepancy be related to the difference in batch size, or am I missing something else? Your insights would be appreciated. Thank you.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.