mauve-experiments

This repository contains the code and the scripts to reproduce the experiments in this paper published at NeurIPS 2021 and awarded an Outstanding Paper Award. The paper introduces MAUVE, a comparison measure for open-ended text generation.

MAUVE directly compares the distribution of machine-generated text to that of human language as the area under the divergence curve for the two distributions. MAUVE summarizes the trade-off between two types of errors: those arising from parts of the human distribution that the model distribution approximates well, and those it does not.

Standalone package: For a self-contained package to compute MAUVE, installable via pip install mauve-text, please see this repository.

Summary:

The rest of the README describes how to create the generations and reproduce the experiments described in the paper.
To download the generations used in the papers, see here. This will still require you to featurize the generations and compute the metrics.
The data we collected in the human evaluations can be found here. The code to compute the corresponding Bradley-Terry coefficients can be found here.

Dependencies

The code is written in Python and the dependencies are:

Python >= 3.6
PyTorch >= 1.1
Huggingface Transformers >= 4.2.0
NLTK >= 3.4.5
scikit-learn >= 0.22.1
faiss-gpu >= 1.7.0
tqdm >= 4.40.0

Conda Environment: We recommend using a conda environment for Python 3.8. To setup the environment, run

conda env create --file environment.yml
# activate the environment
conda activate mauve-experiments

In addition, you will have to install the following manually:

PyTorch, version 1.7: instructions,
HuggingFace Transformers, version 4.2.0: instructions.

The code is compatible with PyTorch >= 1.1.0 and transformers >= 3.2.0 but we have not thoroughly tested it in this configuration.

Install Dependencies via Pip: Install PyTorch, version 1.7 (instructions here) and then run

pip install -r requirement.txt

Datasets

We use the webtext data from the GPT-2 output dataset repository. For the purpose of reproducing these experiments, it suffices to simply download the test set of webtext. To this end, run:

python local_scripts/download_data.py

The data is downloaded to the folder ./data and pass --data_dir ./data for all scripts below.

Experimental Pipeline

For each dataset, once we have the pretrained models, the experimental pipeline is as follows:

generate samples and featurize samples (GPU needed)
compute MAUVE (CPU suffices, highly parallelizable)
compute LM metrics such as perplexity, sparsemax score, Jensen-Shannon score, etc. (GPU needed)
compute self-BLEU (CPU only, embarassingly parallelizable between multiple cores)
compute all other metrics (CPU only)
compute steps 4 and 5 on the human data

The generation of samples (Step 1) must be run first. Other steps can proceed in any order.

Here is how to find the scripts step-by-step for webtext. The variables which need to be set are detailed at the top of each script.

Step 0. Prepare directory: Run bash local_scripts/make_output_dirs.sh to create the necessary output directories.

Step 1. Generate the samples: Run slurm_scripts/webtext/arr_generate_basic.sh ${model_size} to generate samples of basic methods (pure sampling, top-K sampling, temperature sampling, nucleus sampling and greedy decoding). ${model_size} is one of ['small', 'medium', 'large', 'xl'].

It is written as a slurm array job. For each configuration and model size, we generate five sets of 5000 samples each using prompts from the dataset. This script internally calls the file generate_basic.py. The outputs are stored in ./outputs/{dataset_name}_{model_name}/generations/basic The running time for each run varies from around 1 hour (GPT-2 small/medium) to around 3-4 hours (GPT-2 large) and 12 hours (GPT-2 XL) on a NVIDIA Quadro GPU with a memory of 24G. If you use a GPU with a memory of 12G, it will likely take around twice as long.

This creates the following in ./outputs/{dataset_name}_{model_name}/generations/basic.

sentences_test_p${topp}_k${topk}_t${temp}_seed${seed}.p (e.g. sentences_test_p0.99_k0_t1.0_seed0.p): contains the raw samples in string form. If you load this using pickle, you will find two lists: (1) list of strings which are the actual samples generated, and, (2) list of booleans, denoting termination, i.e., whether a |<endoftext>| (EOS) token was generated.
sample_test_p${topp}_k${topk}_t${temp}_seed${seed}.p (e.g. sample_test_p0.99_k0_t1.0_seed0.p): contains the samples after tokenization. If you load this using pickle, you will find a list of 5 entires: (1) list of list of ints, each of which is the BPE tokenized representation of the samples generated above, (2) list of booleans, denoting termination (same as above), (3) unique n-gram fraction, for n in 1 to 6, (4) perplexity of the generated text under the model, and, (5) the parsed arguments of the script generate_basic.py.
featsL${max_length}_test_p${topp}_k${topk}_t${temp}.0_seed4.pt (e.g. featsL1024_test_p0.99_k0_t1.0_seed0.pt): features representation (i.e., terminal hidden state) under the GPT-2 large model. Each such a file is 25M in size. For each configuration, we create 4 files with max_length in {128, 256, 512, 1024}.

Next, run slurm_scripts/webtext/generate_ref.sh to featurize the human-written text (i.e., webtext test set). The output is created in ./outputs/{dataset_name}_{model_name}/generations/ref.

Step 2. Compute MAUVE: Run local_scripts/webtext/mauve_metrics_*.sh.

local_scripts/webtext/mauve_metrics_kmeans.sh: use k-means for quantization. Runs on CPU within a few minutes per run. It is massively parallelizable.
local_scripts/webtext/mauve_metrics_drmm.sh: use deep residual mixture models (DRMM) for quantization (Hämäläinen et. al. 2020). It is copied with minor edits from the original repo (note: this requires TensorFlow 1.12 to be installed. A CPU-only install suffices). A CPU-only run takes around 2 hours. It is also massively parallelizable.
local_scripts/webtext/mauve_metrics_spv.sh: use spreading vectors for quantization (Sablayrolles et. al. 2018). It is copied with minor edits from the original repo. It runs in <10 minutes on a GPU.

The outputs are written in ./outputs/{dataset_name}_{model_name}/metrics/basic. The filenames are:

k-means: mauve_L${max_len}_test_p${topp}_k${topk}_t${temp}_seed${seed}_kmeans_l2_${num_clusters}_${lower_dim_explained_variance}.p (e.g., mauve_L1024_test_p1.0_k50_t1.0_seed2_drmm_3_10.p): arguments are num_clusters (number of clusters) and lower_dim_explained_variance (lower dimensionality after PCA is chosen with at least this much explained variance).
DRMM: mauve_L${max_len}_test_p${topp}_k${topk}_t${temp}_seed${seed}_drmm_${num_layers}_${num_components_per_layer}.p (e.g., mauve_L1024_test_p1.0_k50_t1.0_seed2_drmm_3_10.p): arguments are num_layers (number of layers in the DRMM) and num_components_per_layer (number of components in each layer). The equivalent number of k-means clusters would be ${num_components_per_layer}^${num_layers}
SPV: mauve_L${max_len}_test_p${topp}_k${topk}_t${temp}_seed${seed}_spv.p (e.g., mauve_L1024_test_p1.0_k50_t1.0_seed2_drmm_3_10.p)

Each of these outputs is a pickle file. In each output, we have, [p_hist, q_hist, mauve], where p_hist and q_hist are respectively the multinomial distributions obtained after quantization.

Step 3. Compute LM metrics: Run local_scripts/webtext/run_lm_metrics.sh, which in turn invokes compute_lm_metrics_basic.sh. Output files are written in ./outputs/{dataset_name}_{model_name}/metrics/basic with name lm_test_p${topp}_k${topk}_t${temp}.p (e.g., lm_test_p1.0_k5_t1.0.p). Only one job is run per each seed (since the metrics depend on the model but not on the actual generations).

Step 4. Compute Self-BLEU: Run local_scripts/webtext/run_self_bleu.sh, which in turn calls compute_self_bleu_metric.sh. Takes around 7 hours on a single processor core, but is embarassingly parallel. The current script runs one processor per job but parallelizes jobs at once. Output files are written in ./outputs/{dataset_name}_{model_name}/metrics/basic with name bleu_test_p${topp}_k${topk}_t${temp}_seed${seed}.p (e.g., bleu_test_p1.0_k500_t1.0_seed4.p).

Step 5. Compute all other metrics: Run local_scripts/webtext/run_all_L_metrics.sh. It calls compute_all_L_metrics.py under the hood and computes other metrics such as the Zipf coefficient and repetition ratio. Runs in a few seconds.

Output files are written in ./outputs/{dataset_name}_{model_name}/metrics/basic with name all_L_test_p${topp}_k${topk}_t${temp}_seed${seed}.p (e.g., all_L_test_p0.92_k0_t1.0_seed3.p).

Step 6. Compute metrics on human data: To perform steps 4 and 5 on the human-written text, run

python compute_ref_metrics.py --datasplit test --device 0 --parallel_bleu --n_proc_bleu 24

The self-BLEU computation is the most time-consuming (~7 hours with one process) and its running time depends on how many processes are allowed (--n_proc_bleu). Outputs are written to ./outputs/{dataset_name}_{model_name}/metrics/ref.

Citation

If you find this repository useful, or you use it in your research, please cite the following papers:



@article{pillutla-etal:mauve:jmlr2023,
  title={{MAUVE Scores for Generative Models: Theory and Practice}},
  author={Pillutla, Krishna and Liu, Lang and Thickstun, John and Welleck, Sean and Swayamdipta, Swabha and Zellers, Rowan and Oh, Sewoong and Choi, Yejin and Harchaoui, Zaid},
  journal={JMLR},
  year={2023}
}

@inproceedings{pillutla-etal:mauve:neurips2021,
  title={MAUVE: Measuring the Gap Between Neural Text and Human Text using Divergence Frontiers},
  author={Pillutla, Krishna and Swayamdipta, Swabha and Zellers, Rowan and Thickstun, John and Welleck, Sean and Choi, Yejin and Harchaoui, Zaid},
  booktitle = {NeurIPS},
  year      = {2021}
}

@inproceedings{liu-etal:mauve-theory:neurips2021,
  title={{Divergence Frontiers for Generative Models: Sample Complexity, Quantization Effects, and Frontier Integrals}},
  author={Liu, Lang and Pillutla, Krishna and Welleck, Sean and Oh, Sewoong and Choi, Yejin and Harchaoui, Zaid},
  booktitle={NeurIPS},
  year={2021}
}

Acknowledgements

This work was supported by NSF CCF-2019844,the DARPA MCS program through NIWC Pacific(N66001-19-2-4031), the CIFAR program "Learning in Machines and Brains", a Qualcomm Innovation Fellowship, and faculty research awards.

krishnap25 / mauve-experiments Goto Github PK