isl-org / multiobjectiveoptimization Goto Github PK

Source code for Neural Information Processing Systems (NeurIPS) 2018 paper "Multi-Task Learning as Multi-Objective Optimization"

License: MIT License

Shell 0.10% Python 99.90%

multiobjectiveoptimization's People

Contributors

Stargazers

Watchers

Forkers

ml-lab rongchangzhao yyht leiup jario-jin zeng-hello-world jiabinxue wenzishou famousgrouse sean0719 miquelmarti houxy12 liu3xing3long hyzcn eycab songgc limaosen0 vivien000 bsaint duynht aiwebops georkap cdawei drfengze pyun-ram mikael10j silverwavegl dbmptr mengzaiqiao mengze-96 joeytang3377 hewg2008 rambobean njuhaozhang databill86 fishcf qianrenjian williamlipro mata62n kiminh fanyin3639 liuguoyou chen-dixi sunupo young917 haolun-wu yi-xin608 guohaoxiang jovialio shaanrockz alexyahi evgeniyakorneva zhegeliang2 orthodex youngadvance sehwanmoon vishalbelsare zhenglinli gusuperstar changhwapark zhenshen-mla meinternational biaohe illidanlab zscwind zshy1205 moneyqyh kay1794 jianzhu helmuthb cock-puncher wq20000101 stlslm zhenpingli lioutikov mbabby leizy1008 xiaoye77 xubin1994 ppp1992 jiangyuewu madonokouki byvalentino zhouhaolin1997 tianhaofu jzengust geofferyzhong chengyinghao tinyloop lijing271 dora-cmon wzn123 iamkarthikbk westwing84 mgsong tomorrow1weak2up heleifz bibek-bhandari ankurhcu xqjin

multiobjectiveoptimization's Issues

inconsistent implementation of update for task specific weights with the description in paper

Hi, thanks for releasing this awesome code! Currently, i am working on reproducing the result on cityscapes in paper. I found that in paper the description of mtl update equation say the weights of task specific subnetwork should be updated with original learning rate, then the shared weights of network is updated with the MGDA algorithm. But i didnt find the corresponding implementation in code where both the shared weights and task specific weights are updated consistently by timing loss of different task with a weight factor determined by MGDA. Am i missing something here, or is this a implemention trick?

Is there any difference between soft sharing and hard sharing network when using MGDA?

In this paper, it seems only to conduct experiments on a hard sharing network for MTL, if there exist some differences when using MGDA on a soft sharing network? And if it's yes, Why?

`iter_count` is never incremented in `min_norm_solvers`?

Both the find_min_norm_element (https://github.com/isl-org/MultiObjectiveOptimization/blob/master/multi_task/min_norm_solvers.py#L120) and find_min_norm_element_FW (https://github.com/isl-org/MultiObjectiveOptimization/blob/master/multi_task/min_norm_solvers.py#L166) functions define a while loop based on an iter_count variable.

Line 13 of Algorithm 2 in the paper (https://arxiv.org/pdf/1810.04650.pdf) notes that optimization is done until a number of iterations or convergence threshold is reached. However, the code never actually increments iter_count, and so optimization only terminates based on the convergence threshold.

Citation for 'Projected Descent' method in Readme

Hi,
thanks for the open-source code!
I'm looking to implement something like https://openreview.net/forum?id=HJewiCVFPB (Gradient Surgery for Multi-Task Learning) using/on top of this repo, but i'm a little confused about whether the projected descent method that is mentioned in the Readme is something similar already present.
Could you add citations and sources for any additional algorithms and tricks that this repo implements apart from the main paper?
That would be very helpful!
(or alternatively it would be helpful to just get some pointers on what/where to modify to get what i want working)

PS: ^ the paper i mentioned just involves projecting the gradients of all tasks onto the planes of the gradients of the other tasks before doing updates, to avoid gradient conflicts, (for some quick context)

Thanks

size of the segmentation output

I find the input size of CityScapes is 256x512, but in the transform, the target is resize into 32x64. what's more, the measure metrix is based on 32x64.

So, why the target size is not 256x512, just same as the input size?

run error, Please help me, Thanks

I use "
"algorithm": "mgda",
"use_approximation": true,", That's mean the MGDA_UB is used。
When I trained the task，At the begin time, It can run true. But when it run some steps, It will get error: "RuntimeError: CUDA out of memory." ，I use torch_version==v1.1.0， Is it torch_version question?

`_min_norm_2d` func in `min_norm_solver.py` is slow

In my test, I found change below code in _min_norm_2d func may spped up train procedure.

    def _min_norm_2d(self, vecs, dps):
        """
        Find the minimum norm solution as combination of two points
        This is correct only in 2D
        ie. min_c |\sum c_i x_i|_2^2 st. \sum c_i = 1 , 1 >= c_1 >= 0 for all i, c_i + c_j = 1.0 for some i, j
        """
        dmin = 1e8
        for i in range(len(vecs)):
            for j in range(i + 1, len(vecs)):
                if (i, j) not in dps:
                    dps[(i, j)] = 0.0
                    # for k in range(len(vecs[i])):
                    #     dps[(i, j)] += torch.dot(vecs[i][k], vecs[j][k]).data[0]
                    dps[(i, j)] = torch.mm(vecs[i].transpose(1,0), vecs[j])
                    dps[(j, i)] = dps[(i, j)]
                if (i, i) not in dps:
                    dps[(i, i)] = 0.0
                    # for k in range(len(vecs[i])):
                    #     dps[(i, i)] += torch.dot(vecs[i][k], vecs[i][k]).data[0]
                    dps[(i, i)] = torch.mm(vecs[i].transpose(1, 0), vecs[i])
                if (j, j) not in dps:
                    dps[(j, j)] = 0.0
                    # for k in range(len(vecs[i])):
                    #     dps[(j, j)] += torch.dot(vecs[j][k], vecs[j][k]).data[0]
                    dps[(j, j)] = torch.mm(vecs[j].transpose(1, 0), vecs[j])
                c, d = self._min_norm_element_from2(dps[(i, i)], dps[(i, j)], dps[(j, j)])
                if d < dmin:
                    dmin = d
                    sol = [(i, j), c, d]
        return sol, dps

cityscape dataset

Hi，Thanks you code.

Could you provide the cityscape dataset depth ground truth?

thanks!

How to speed up the calculation of func '_min_norm_2d'?

I found that if the vecs input into func _min_norm_2d is too large, the whole procedure becomes very slow. In my experiment, the vecs.shape = [39321600,1], and one single training iteration costs 40s, while the _min_norm_2d spends 35s+.
The experiment in your paper assume that the share layers shape is much smaller mine, so I guess maybe this is the point.

local variable 'sol' referenced before assignment

I encounter an error UnboundLocalError: local variable 'sol' referenced before assignment

In min_norm_solvers.py,
def _min_norm_2d(vecs, dps):
"""
Find the minimum norm solution as combination of two points
This is correct only in 2D
ie. min_c |\sum c_i x_i|_2^2 st. \sum c_i = 1 , 1 >= c_1 >= 0 for all i, c_i + c_j = 1.0 for some i, j
"""
dmin = 1e8
for i in range(len(vecs)):
for j in range(i+1,len(vecs)):
if (i,j) not in dps:
dps[(i, j)] = 0.0
for k in range(len(vecs[i])):
dps[(i,j)] += torch.mul(vecs[i][k], vecs[j][k]).sum().data.cpu()
dps[(j, i)] = dps[(i, j)]
if (i,i) not in dps:
dps[(i, i)] = 0.0
for k in range(len(vecs[i])):
dps[(i,i)] += torch.mul(vecs[i][k], vecs[i][k]).sum().data.cpu()
if (j,j) not in dps:
dps[(j, j)] = 0.0
for k in range(len(vecs[i])):
dps[(j, j)] += torch.mul(vecs[j][k], vecs[j][k]).sum().data.cpu()
c,d = MinNormSolver._min_norm_element_from2(dps[(i,i)], dps[(i,j)], dps[(j,j)])
if d < dmin:
dmin = d
sol = [(i,j),c,d]
return sol, dps

The sol variable is only assigned if d < dmin, what if d >= dmin always, what should the sol variable be?
I assume that it will be sol = [(i,j),c,dmin], is that right?

is the code corresponding to the algorithm in paper ?

Hi, thanks for providing code for your paper

I was wondering, when you do the actual update there:

MultiObjectiveOptimization/multi_task/train_multi_task.py

Line 184 in d45eb26

optimizer.step()

It looks to me that you are not exactly performing the same thing as what's stated in the paper
https://proceedings.neurips.cc/paper/2018/file/432aca3a1e345e339f35a30c8f65edce-Paper.pdf

On line 2, the "heads" parameters are updated through theta_t = theta_t - eta * grad, but here you do theta_t = theta_t * scale_t * eta * grad
is that correct ?

best

antoine

Is MultiMNIST data avaliable publicly

Thanks for you code.
I want to know where can I find the MultiMNIST dataset used in your paper if it's avaliable now.

best config for MultiMNIST?

Could you please publish the best config for MultiMNIST experiments? I tried with the following but only got ~70% testing accuracy for both tasks, which is much lower than what's reported in your paper.

{
    "optimizer": "SGD",
    "batch_size": 256,
    "lr": 0.0005,
    "dataset": "mnist",
    "tasks": ["L", "R"],
    "normalization_type": "none",
    "algorithm": "mgda",
    "use_approximation": true,
    "scales": {"L":0.5, "R":0.5}
}

Thanks!

depth loss and metric

Hi, @ozansener, I find that there may be some mismatch among the depth loss, metric and reported results in the paper. In the depth loss, those pixels with <0 loss are excluded, but in the depth metric, pixels with <250 target depth (normalized by mean and std) are all included, So I think they include all the pixels, maybe this is a bug because 250 should be used for segmentation but not depth regression. Also, in Tab. 4 of your paper w.r.t disparity error, are they the direct output of the depth branch of the model in your code? Thanks.

PS: For evaluating the depth regression performance, why should the gt and pred be cast as int?https://github.com/intel-isl/MultiObjectiveOptimization/blob/5d8e8343c56cf3184081f71cc4a661af64e7cf7e/multi_task/metrics.py#L46-L54

params.json

how can i get the file named params.json？

If I want to train the 3D image data using MTL by MGDA do I have to change the code min_norm_solver.py?

About Cityscapes Instance Segmentation GroundTruth Generation.

Hi Ozen,
I notice that your function encoding_instancemap in cityscapes_loaders.py is strange.
The ignore pixel in variable ins is set to self.ignore_index, as default 250. But those pixel is not instance and 250 are in the range of x_map (I sample a image and calculate it. x_map max is 1038, min is -800, 250 is in the range)
Is it right? or it will be set to (0, 0) for y_map and x_map respectively?

IndexError: list index out of range (when reading base_path in "celeba_loader.py" line 52 )

Hi Ozan,

May I ask how to solve this issue? When I read the path in "celeba_loader.py", line 52, it shows an index error. I followed some suggestions from the website but they all do not work. May you help me figure out how to fix this? BTW, I am using a Mac and Google Colab. Both face this error.

Thanks in advance.

Best,
Haolun

How to get the mean of error in the CelebA experiment

Seems like the code only provide classification accuracy and negative log-likelihood loss instead of the mean of error shown in Table 1 in the paper. Is there any way that we can get that number? Thanks!

Algorithm 2 line 10

MultiObjectiveOptimization/multi_task/min_norm_solvers.py

Lines 121 to 122 in d45eb26

 grad_dir = -1.0*np.dot(grad_mat, sol_vec) 

 new_point = MinNormSolver._next_point(sol_vec, grad_dir, n)

Are these lines corresponding to algorithm 2 line 10? For e_{\hat_{t}} ?

Comment mentioned a missing notebook

https://github.com/intel-isl/MultiObjectiveOptimization/blob/1c6d0d503ccf33cc83d5b6c356ca2fc2bf255606/multi_task/train_multi_task.py#L158

This comment mentions a notebook which cannot be found in the repository. Is it available somewhere else?

Redundancy of optimiser.zero_grad()?

There are two calls of optimizer.zero_grad()

https://github.com/intel-isl/MultiObjectiveOptimization/blob/5d8e8343c56cf3184081f71cc4a661af64e7cf7e/multi_task/train_multi_task.py#L114

https://github.com/intel-isl/MultiObjectiveOptimization/blob/5d8e8343c56cf3184081f71cc4a661af64e7cf7e/multi_task/train_multi_task.py#L146

before optimizer.zero_grad() is called finally in line:

https://github.com/intel-isl/MultiObjectiveOptimization/blob/5d8e8343c56cf3184081f71cc4a661af64e7cf7e/multi_task/train_multi_task.py#L173

and optimizer.step() is applied.

Are the first two calls of optimizer.zero_grad() necessary or redundant?

Thx

A question about the output of rep

Hello, my backbone has two output. For example, q0, q4 are both reps that will be used in one task head. I don't know how to adjust the code. Could you give me some advices? thanks!

ModuleNotFoundError: No module named 'models.gradient_scaler'

After executing the python multi_task/train_multi_task.py --param_file=./sample.json,
i got error below:
from models.gradient_scaler import MinNormElement
ModuleNotFoundError: No module named 'models.gradient_scaler'

Cityscapes experiment

Hi,
Thanks for open sourcing the code, this is great!
Could you share your json parameter file for cityscapes?
Also, I think it is is missing the file depth_mean.npy to be able to run it.

Thanks.

KeyError with the tensorboardX

Hello, can anybody tell me how to fix this issue? Is this a version inconsistency with the tensorboardX?
Cuz I am exactly run the code offered on the github without changing anything.
Appreciate in advance.

How to convert .pkl file to .onnx model. I mean there are two parts of the whole model.

For example:
model_task0 = ResNet() + FaceAttributeNet()
model_task1 = ResNet() + FaceAttributeNet()
How can i convert the .pkl data to onnx model

Classification Task on different datasets.

Hi, if I use different datasets to each task. Does this algorithm still works? e.g. Three classification tasks on datasets MNIST, ImageNet, Celebra, using multi objective optimization method proposed in your paper. Can I get better performances both on these three classification tasks?

Question about the MinNormSolver

Thanks for your code!

Here is my question.

According to the following code:

# Compute gradients of each loss function wrt z
for t in tasks:
    optimizer.zero_grad()
    out_t, masks[t] = model[t](rep_variable, None)
    loss = loss_fn[t](out_t, labels[t])
    loss_data[t] = loss.item()
    loss.backward()
    grads[t] = []
    if list_rep:
        grads[t].append(Variable(rep_variable[0].grad.data.clone(), requires_grad=False))
        rep_variable[0].grad.data.zero_()
    else:
        grads[t].append(Variable(rep_variable.grad.data.clone(), requires_grad=False))
        rep_variable.grad.data.zero_()

The grads is in the following format:

grads = {'Task_1': [grad_1], 'Task_2': [grad_2]}

where grad_1 and grad_2 are tensors of shape [batch_size, N]

Then the argument vecs of MinNormSolver.find_min_norm_element is in the following format:

vecs = [[grad_1], [grad_2]]

In this way, the vecs[i][k] is not a 1D vector and will lead to error.

    def _min_norm_2d(vecs, dps):
        """
        Find the minimum norm solution as combination of two points
        This is correct only in 2D
        ie. min_c |\sum c_i x_i|_2^2 st. \sum c_i = 1 , 1 >= c_1 >= 0 for all i, c_i + c_j = 1.0 for some i, j
        """
        dmin = 1e8
        for i in range(len(vecs)):
            for j in range(i+1,len(vecs)):
                if (i,j) not in dps:
                    dps[(i, j)] = 0.0
                    for k in range(len(vecs[i])):
                        dps[(i,j)] += torch.dot(vecs[i][k], vecs[j][k]).item()
                    dps[(j, i)] = dps[(i, j)]
                if (i,i) not in dps:
                    dps[(i, i)] = 0.0
                    for k in range(len(vecs[i])):
                        dps[(i,i)] += torch.dot(vecs[i][k], vecs[i][k]).item()
                if (j,j) not in dps:
                    dps[(j, j)] = 0.0   
                    for k in range(len(vecs[i])):
                        dps[(j, j)] += torch.dot(vecs[j][k], vecs[j][k]).item()
                c,d = MinNormSolver._min_norm_element_from2(dps[(i,i)], dps[(i,j)], dps[(j,j)])
                if d < dmin:
                    dmin = d
                    sol = [(i,j),c,d]
        return sol, dps

Unstable GPU usage while Training on Cityscapes

Hi,

Thanks for open-sourcing the awesome code.

I am using PyTorch 3.6 to run the equal-weights setup(without using mgda) on Cityscapes with a single GPU, but the GPU usage is extremely unstable and keeps shifting between 0 - 99. Do you know what could cause that?

Thanks

instance segmentation

During the inference, i predict instance vectors for x/y coordinates,
How can i get the final instance output?

The cityscapes training time is so long

Thanks for your amazing code. I want to reproduce the cityscapes result.

I use one P40 card and run the training code. However, it doesn't finish one epoch in 4 hours.

I want to know if there are some problems and how long did you train the net in cityscapes dataset.

Thank you very much.

Connection between fmp and task branches

May I ask what should be the correct dimension of grads?
Suppose the output feature map size is 100, and the batch_size is 512, should the rep and grads be of size [512, 100]? And how do you deal with the results of batch_size in the following calculation? Do you average the batchsize results into single result?
In my experiments, when I compute the _min_norm_2d, the input vec is like [[tensor1], [tensor2], [tensor3]], and each tensor is of size 512x100. Therefore, the issue occurs with dps[(i,j)] += torch.dot(vecs[i][k], vecs[j][k]).data[0], RuntimeError: 1D tensors expected, got 2D

Unable to obtain the same average error as Table 1 in the paper

The average error on celeba is 8.25 in the paper, but my result is 10.14, which is obtained by directly using the codes in this repo. I list the configurations I used below, is there any problem?

in configs.json

    "celeba": {
        "path": "data/celeba",
        "all_tasks": ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20",
                      "21", "22", "23", "24", "25", "26", "27", "28", "29", "30", "31", "32", "33", "34", "35", "36", "37", "38", "39"],
        "img_rows": 64,
        "img_cols": 64
    }

in sample.json

{
    "optimizer": "Adam", 
    "batch_size": 128, 
    "lr": 0.0005,
    "dataset": "celeba", 
    "tasks": ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20",
                  "21", "22", "23", "24", "25", "26", "27", "28", "29", "30", "31", "32", "33", "34", "35", "36", "37", "38", "39"],
    "normalization_type": "loss+",
    "algorithm": "mgda",
    "use_approximation": true,
    "scales": {"0":0.025, "1":0.025, "2":0.025, "3":0.025, "4":0.025, "5":0.025, "6":0.025, "7":0.025, "8":0.025, "9":0.025, "10":0.025, 
               "11":0.025, "12":0.025, "13":0.025, "14":0.025, "15":0.025, "16":0.025, "17":0.025, "18":0.025, "19":0.025, "20":0.025, 
               "21":0.025, "22":0.025, "23":0.025, "24":0.025, "25":0.025, "26":0.025, "27":0.025, "28":0.025, "29":0.025, "30":0.025, 
               "31":0.025, "32":0.025, "33":0.025, "34":0.025, "35":0.025, "36":0.025, "37":0.025, "38":0.025, "39":0.025}
}

and the learning rate is multiplied by 0.85 every 10 epochs:

    if (epoch+1) % 10 == 0:
        # Every 50 epoch, half the LR
        for param_group in optimizer.param_groups:
            param_group['lr'] *= 0.85
        logger.info('Multiply the learning rate by {} [{} steps]'.format(0.85, n_iter))

the time of "loss_data[t] = loss.data[0]" is so long

I find that "loss_data[t] = loss.data[0]" will cost 0.5s.
Is there any way to decrease the cost?

two times if 'D' in params['tasks'], first one always ignored

https://github.com/intel-isl/MultiObjectiveOptimization/blob/1c6d0d503ccf33cc83d5b6c356ca2fc2bf255606/multi_task/losses.py#L79

Unable to obtain the same average error as Table 1 in the paper

The average error on celeba is 8.25 in the paper, but my result is 10.14, which is obtained by directly using the codes in this repo. I list the configurations I used below, is there any problem?

in configs.json

    "celeba": {
        "path": "data/celeba",
        "all_tasks": ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20",
                      "21", "22", "23", "24", "25", "26", "27", "28", "29", "30", "31", "32", "33", "34", "35", "36", "37", "38", "39"],
        "img_rows": 64,
        "img_cols": 64
    }

in sample.json

{
    "optimizer": "Adam", 
    "batch_size": 256, 
    "lr": 0.0005,
    "dataset": "celeba", 
    "tasks": ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20",
                  "21", "22", "23", "24", "25", "26", "27", "28", "29", "30", "31", "32", "33", "34", "35", "36", "37", "38", "39"],
    "normalization_type": "loss+",
    "algorithm": "mgda",
    "use_approximation": true,
    "scales": {"0":0.025, "1":0.025, "2":0.025, "3":0.025, "4":0.025, "5":0.025, "6":0.025, "7":0.025, "8":0.025, "9":0.025, "10":0.025, 
               "11":0.025, "12":0.025, "13":0.025, "14":0.025, "15":0.025, "16":0.025, "17":0.025, "18":0.025, "19":0.025, "20":0.025, 
               "21":0.025, "22":0.025, "23":0.025, "24":0.025, "25":0.025, "26":0.025, "27":0.025, "28":0.025, "29":0.025, "30":0.025, 
               "31":0.025, "32":0.025, "33":0.025, "34":0.025, "35":0.025, "36":0.025, "37":0.025, "38":0.025, "39":0.025}
}

and the learning rate is multiplied by 0.85 every 10 epochs:

    if (epoch+1) % 10 == 0:
        # Every 50 epoch, half the LR
        for param_group in optimizer.param_groups:
            param_group['lr'] *= 0.85
        logger.info('Multiply the learning rate by {} [{} steps]'.format(0.85, n_iter))


and I train the model on two GPUs.

sample.json for Multi Mnist?

Hi. thanks for sharing.
what is the structure for sample.json and config.json when I want to use code for multiMnist dataset?

Get test accuracy for MultiMNIST

Hi,

I'm wondering how I can get the test accuracy for MultiMNIST. Right now, it seems to me that the metric results logged at each epoch are the test accuracies. Can you clarify how you obtain the test accuracy numbers in the paper? Thanks!

How is the derivative wrt. representations computed in a single backward pass?

First, thank you for the nice work. If you don't mind, I have a question:

You mention right after defining the upper bound (Eq. 6) that the derivative wrt. representations, i.e. \nabla_Z L^t, can be computed in a single backward pass for all tasks. But all tasks are sharing the same representation nodes, so how is this possible? Wouldn't this also apply then to the original derivative wrt. to the shared parameters?

Depth loss

There is a bug in the depth loss.
All entries smaller than zero are masked. This causes most of the entries to be excluded in the loss, as the depth map has been normalized. The correct way would be to mask only the zero entries in the loss.

different sizes of gradient

Hi, thanks for sharing your code!

I have one question about the size of gradient. I found that in the current code, only 2D gradient is supported. But if the parameter has 4 dimensional, such as (batch-size, N, width, height), how should I modify the min-norm solver ?

	grad_dir = -1.0*np.dot(grad_mat, sol_vec)
	new_point = MinNormSolver._next_point(sol_vec, grad_dir, n)