This is the codebase for Learning Structured Representations by Embedding Class Hierarchy.
Label hierarchies, which encode the semantic relationship between objects, are widely available in modern ML datasets. To leverage such information in feature representation, we introduce CPCC as a regularizer to convert the traditional permutation invariant feature into a structured representation. CPCC is defined to be the correlation coefficient of two pairwise metrics
Following label hierarchy, CPCC groups fine classes sharing the same coarse class together and push fine classes under different coarse classes away. This interpretable representation leads to better performance on several downstream tasks.
CPCC is implemented in loss.py
. Please use the following instructions to run experiments in our paper.
- To download the repository and install all necessary requirements, run
git clone https://github.com/hanzhaoml/HierarchyCPCC
conda env create -n hierarchy --file environment.yml
- To run experiments on BREEDS, go to the parent directory of
HierarchyCPCC
and run:
git clone https://github.com/MadryLab/BREEDS-Benchmarks.git
We use wandb for logging metrics to the cloud. All intermediate evaluation metrics and code will be automatically uploaded after you have set up your wandb account. For local debugging, you can also run wandb disabled
before starting experiments, so your checkpoints will be automatically saved in your local results folder.
Most training snippets are included in main.py
. Here we explain some important configuration arguments.
-
--root
: directory that you want to save your experiment results. Once experiment starts, all outputs will be stored in<your specified root>/hierarchy_results/<DATASET>/<EXPERIMENT ID>
, or if you run experiments on BREEDS, outputs will be saved in<your specified root>/hierarchy_results/<DATASET>/<EXPERIMENT ID>/<BREEDS SETTING>
. -
--timestamp
: a unique id to identify your experiment. You can usedatetime.now().strftime("%m%d%Y%H%M%S")
if you want to timestamp as the identifier. -
--dataset
: any ofMNIST
,CIFAR
, orBREEDS
. Out-of-distribution evalution is only available for CIFAR. Four BREEDS settings will start sequentially in the order of living17, entity13, entity30, nonliving27. -
--exp_name
: baseline loss functions, please see Appendix C of our paper for details. Any ofERM
: empirical risk minimizationMTL
: multi-task learningCurriculum
: curriculum learningsumloss
: SumLossHXE
: Hierarchical Cross Entropysoft
: Soft Labelquad
: Quadruplet Loss
-
--split
: if the value isfull
, use all train data for training. If the value issplit
, use partial dataset for training. -
--task
: If the value ofsplit
isfull
, leave it empty to run the train and evaluation snippets in one step. Otherwise, if the value ofsplit
issplit
, you can choose the value oftask
to be eitherin
orsub
to create the source-target data splits. Whentask == in
, the orginal whole training dataset will be split into halves and the test dataset is only the half of original test set. Whentask == sub
, the training and testing dataset will be split into corresponding subpopulations such that no fine class in source and target set overlaps. -
--cpcc
: 0 if use loss function without CPCC regularizer, 1 if use loss function with CPCC regularizer. -
--cpcc_metric
: metric on feature space. Available choices includel2
,l1
, orpoincare
. -
--lamb
: regularization factor of CPCC regularizer. Default value is 1.
All default training hyperparameters are saved in json files under corresponding dataset folders. pre.json
contains pretraining parameters, and down.json
contains parameters for downstream transfer tasks.
To run all experiments, please see train-table1.sh
and train-table2.sh
as reference.
To use any custom dataset, you can create a new dataset folder and include:
- A backbone model in
model.py
to run the classification task. data.py
which includes:- a dataset that inherited from both
Hierarchy
andDataset
. You need to define the mapping from the finest level label to coarser labels. - a subset of hierarchical dataset that resets all levels of label index from 0 to
len(label) - 1
- a
make_dataloader
function that creates train test loader for each split-task combination - Note: In our experiments, while calculating CPCC of within a batch, any pairwise calculations are derived from pairs of labels at the same height. Therefore, instead of using any shortest distance algorithms, we simply hard code the distance later in
loss.py
given precalculated layer mappings.
- a dataset that inherited from both
Please read code in /cifar
, /mnist
, /breeds
for reference.
Our repository is built upon making-better-mistakes and breeds.
If you find our work helpful, please consider cite our paper:
@inproceedings{
zeng2023learning,
title={Learning Structured Representations by Embedding Class Hierarchy},
author={Siqi Zeng and Remi Tachet des Combes and Han Zhao},
booktitle={International Conference on Learning Representations},
year={2023},
url={https://openreview.net/forum?id=7J-30ilaUZM}
}
Please contact [email protected] for any questions or comments.