Giter Club home page Giter Club logo

punr's Introduction

PUNR: Pre-training with User Behavior Modeling for News Recommendation

Codebase for [EMNLP 2023 Findings] PUNR: Pre-training with User Behavior Modeling for News Recommendation

Overview

In this work, we propose an unsupervised pre-training paradigm with two tasks for effective user behavior modeling.

  1. User Behavior Masking: This task aims to recover the masked user behaviors based on their contextual behaviors.
  2. User Behavior Generation: We further incorporate a novel auxiliary user behavior generation pre-training task to enhance the user representation vector derived from the user encoder.
nr_model

Data processing

[Optional] Auxiliary Decoder Initialization

Wiki + Bookcorpus, the same pre-training corpus of BERT, is used for initilized the auxiliary decoder in pre-training stage. Use following command to download and tokenize the corpus.

python pretrain/mkdata/download_make_wikibook.py \
    --save_to data/pretrain/wikibook.mlen512.s15.json \
    --n_proc 30

Set proper processor n_proc based on your environments. The download process needs internet access to Huggingface.

You can also ignore this step and use our initialized decoder layer in assets/model.pt. Please follow the pre-training section for detailed instructions.

Pre-training Preprocess

MIND corpus is used for the pre-training. Download MIND collections from here and unzip them into data/ folder.

Then use following command to extract and tokenize the pre-training corpus.

python pretrain/mkdata/make_pretrain_data_multi_spans.py \
    --files data/MINDlarge_train/news.tsv \
    --behavior data/MINDlarge_train/behaviors.tsv \
    --save_to data/pretrain/mindlarge_user_title.multispans.json

Fine-tuning Preprocess

Use the following commands to tokenize the fine-tuning corpus.

python mindft/mkdata/split_train_file.py data
python mindft/mkdata/split_dev_file.py data

Default to process MIND large Change the folder_name in both scripts to MINDsmall for processing MIND small.

Pre-training

[Optional] Auxiliary Decoder Initialization

Use the following commands to initialize the auxiliary decoder.

torchrun pretrain/run_pretraining.py \
    --model_name_or_path bert-base-uncased \
    --output_dir $OUTPUT_DIR/model \
    --overwrite_output_dir \
    --dataloader_drop_last \
    --dataloader_num_workers 16 \
    --do_train \
    --logging_steps 200 \
    --save_strategy steps \
    --save_steps 20000 \
    --save_total_limit 5 \
    --fp16 \
    --logging_dir $TENSORBOARD_DIR \
    --report_to tensorboard \
    --optim adamw_torch_fused \
    --warmup_ratio 0.1 \
    --data_type anchor \
    --learning_rate 3e-4 \
    --max_steps 150000 \
    --per_device_train_batch_size 16 \
    --gradient_accumulation_steps 4 \
    --max_seq_length 512 \
    --weight_decay 0.01 \
    --freeze_bert \
    --disable_bert_mlm_loss \
    --bert_mask_ratio 0.30 \
    --use_dec_head \
    --n_dec_head_layers 1 \
    --enable_dec_head_loss \
    --dec_head_coef 1.0

We keep the global batch size to 512 for this initialization step. Only the decoder's parameters are updated during the initialization.

We also provided the initialized decoder in folder assets/model.pt. You can also ignore the above steps and directly use it by following options.

# Clone BERT-base to local folder
git lfs install
git clone https://huggingface.co/bert-base-uncased

# Copy `assets/model.pt` to BERT folder
cp assets/model.pt bert-base-uncased

User Behavior Pre-training

To pre-train the encoder with User Behavior Masking and User Behavior Generation, please try the following commands.

torchrun pretrain/run_pretraining.py \
    --model_name_or_path $PATH/TO/BERT_w_decoder_head \
    --output_dir $OUTPUT_DIR/model \
    --overwrite_output_dir \
    --dataloader_drop_last \
    --dataloader_num_workers 8 \
    --do_train \
    --logging_steps 50 \
    --save_strategy steps \
    --save_steps 2000 \
    --save_total_limit 2 \
    --fp16 \
    --logging_dir $TENSORBOARD_DIR \
    --report_to tensorboard \
    --optim adamw_torch_fused \
    --warmup_ratio 0.1 \
    --data_type title \
    --sample_from_spans \
    --learning_rate 1e-5 \
    --max_steps 10000 \
    --per_device_train_batch_size 16 \
    --gradient_accumulation_steps 2 \
    --max_seq_length 512 \
    --train_path data/pretrain/mindlarge_user_title.multispans.json \
    --weight_decay 0.01 \
    --bert_mask_ratio 0.30 \
    --use_dec_head \
    --n_dec_head_layers 1 \
    --enable_dec_head_loss \
    --dec_head_coef 1.0 \
    --mask_whole_spans \
    --mask_whole_spans_token_ratio 0.3

We keep the total batch size to 256 in the pre-training setup.

Fine-tuning and Testing

Use the following commands to fine-tune and test the pre-trained encoder.

MODEL_PATH=$OUTPUT_DIR/model
python mindft/run.py \
    --world_size 8 --n_gpu 8 --batch_size 16 \
    --model_name $MODEL_PATH \
    --config_name $MODEL_PATH \
    --tokenizer_name $MODEL_PATH \
    --model_dir $FINE_TUNED_DIR/model \
    --train_data_dir data/MINDlarge_train \
    --test_data_dir data/MINDlarge_dev \
    --use_linear_lr_scheduler True \
    --random_seed 42 \
    --epochs 3 \
    --lr 3e-5 \
    --mode train_test

You can choose to test MIND large / small by changing the train_data_dir and test_data_dir to MINDsmall.

Performances

Pre-training with User Behavior Modeling gives effective user behavior modeling to the encoder. Performances on MIND benchmark is as follows.

MIND-Small

Methods AUC MRR nDCG@5 nDCG@10
NRMS-BERT 65.52 31 33.87 40.38
PUNR 68.89 33.33 36.94 43.1

MIND-Large

Methods AUC MRR nDCG@5 nDCG@10
NRMS-BERT 69.5 34.75 37.99 43.72
PUNR 71.03 35.17 39.04 45.4

punr's People

Contributors

ma787639046 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.