charliedinh / pfedme Goto Github PK

View Code? Open in Web Editor NEW

284.0 284.0 86.0 11.23 MB

Personalized Federated Learning with Moreau Envelopes (pFedMe) using Pytorch (NeurIPS 2020)

Python 100.00%

federated-learning federated-meta-learning neurips-2020 paper per-fedavg personalized-federated-learning pfedme pytorch

pfedme's Introduction

👋 Hi, I’m Dr. Canh T. Dinh (@CharlieDinh).
👀 I’m interested in Federated Machine Learning, NLP, Computer Vision, etc, ...
🌱 I obtained Phd in Privacy Machine Learning at The University of Sydney 2023.
🌱 I’m Machine Learning Engineer at Canva.
💞️ I’m looking to collaborate on Federated Machine Learning research or any ML research.
📫 How to reach me: [email protected].

pfedme's People

Contributors

Stargazers

Watchers

Forkers

tieuho1302 nguyenhoangtran jding0 tpnguyen guobbin zhuangdizhu ahmedcs gyjgyjgyj fujileo02 zizyng15 realrui eugeneyuz quocnh som-don xueyu19 sshpark longtanle tangx-yy bhui97 bingyuanw darlinghang kamagf311 tzq2doc abdulmoneamali 13301338176 skyblueballykid brighthaozi trendingtechnology cpaulzyf jiahaoc1993 rdadan muse1418 bdemo renpuliu shuangyiw meiyuan666 chq yingzhuguan young1403 amberljc daishu-li ywang037 cluver-21 vananle ttccq monica970215 selimfirat ashkan-pirmani joshuachou2018 levykwok ki-ljl tailinzhou cdes5804 zuoxiaojiang xrosliang weschorney gerid galaxylzp tchaye59 tclw123 bethwu weepon zibaparsons hwan-sig jevenm duduruhappy xtlyu siabdullah4 cj-mclaughlin bokun-wang hongdawu1226 huichen3313 andrew89982018 binh2702 caoyuanpu hyliu1994 fujennifer wei-d-zhang kitaharasetusna thkhoa kwunhang phanducthien82 runshapan levier1

pfedme's Issues

Client's train method

Hi, I have a question about the train method in the clients. Normally, in each epoch the complete dataset is trained using batches, but I have seen in your code that in each epoch, only a single batch is trained. Is this correct? Best regards.

why train loss will be nan?

Experiment image

Why does FedAvg only train on 1 batch in each local epoch?

pFedMe/FLAlgorithms/users/useravg.py

Line 32 in 5060b34

def train(self, epochs):

This disagrees with the original FederatedAveraging algorithm in (https://arxiv.org/pdf/1602.05629.pdf), where the local model should be trained on all batches in each local epoch.

pFedMe and peravg also have the same behavior. Is there any reason for training with only 1 batch of data in each local epoch? Thanks!

pfedme Optimizer Probelem

Hello,
I want to ask:
Why is different, code and _algorithm?
That is in fedoptimizer.py line 64:
p.data = p.data - group['lr'] * (p.grad.data + group['lamda'] * (p.data - localweight.data) + group['mu']*p.data)
and Algorithm 1 line 8

About the Hessian Approximation

Dear authors,
I have read the two implementations on pFedme and pFedAvg.
One problem to me is that the both implementation missing the Hessian Approximation according to per-FedAvg research in the meta-update phase.
Is this critical in the per-FedAvg and pFedme settings?

Question about pFedMeOptimizer.

pFedMe/FLAlgorithms/optimizers/fedoptimizer.py

Line 64 in 96863e0

 p.data = p.data - group['lr'] * (p.grad.data + group['lamda'] * (p.data - localweight.data) + group['mu']*p.data) 

Can you please explain the update in this line? I'm not sure what 'mu' signifies here, as it is not present in the algorithm given on page 5 of the corresponding paper?

Is Per-FedAvg implemented properly?

In the code, when training Per-FedAvg, there are two steps, and each step sample a batch of data and perform parameter update. But in the MAML framework, I think the first step is to obtain a fast weight, and the second step is to update the parameters based on the fast weight of the first step. So why do you update the parameters two times? Are the fundamental differences between Per-FedAvg and FedAvg lie in that the former performs two steps update and the latter performs a one-step update? Is this fair for FedAvg?

In the Per-FedAvg experiments, there is always an unignorable gap between the accuracy of the actual experimental results and the expected accuracy provided under the same conditions.

For example:
\begin{table}[]
\begin{tabular}{ccc}
ACC(Per-FedAvg) & MNIST & Synthetic \
MLR & 92.96% & 81.04% \
DNN & 93.01% & 76.79%
\end{tabular}
\end{table}

Does this algorithm require special settings in actual experiments?
Finally, sincerely thank you for your work.

Some mistakes in generating niid mnist data

Thanks to the author for modifying some old errors in the file two months ago, but there are still some errors that need attention.

Line 39: "l = (user * NUM_USERS + j) % 10" should be changed to "l = (user * NUM_LABELS + j) % 10". The former will cause data allocation errors, and all users are assigned data with the same label.
Line 81: The code to calculate "l" should be same in Line 39 and Line 81.
Line 86: "if idx[l] + num_samples < len(mnist_data[l]):" the "<" should be modified to "<=". The former will cause the last part of the data set of each label to not be correctly assigned to the user. （This problem occurs because the author modified an old error on line 87， which has changed "mnist_data[l][idx[l]:num_samples]" to "mnist_data[l][idx[l]:idx[l]+num_samples]". ）

Some questions about your peper and your code

Hi , I am very interested in your work. I have few questions.

Does the model expressive power have great influence to pFedMe.
You propose a personalized FL method so that clients with differen data statistics can train personalized models. pFedMe send the same parameter to each selected clients at the beginning of each glob_iteration, and each client begins to train their local model from this w. If the model has strong expressive power to fit most training data across many clients, will these clients still train their personalized models? Will pFedMe still outperform FedAvg?
The code
You create variables local_model, persionalized_model, persionalized_model_bar for personalized FL, but it seems that you have never used persionalized_model and persionalized_model_bar is just a copy of local_model. Is there anything I missed?

A question in PerAvg algorithm

Thank you for your code. I have a question about the code of PerAvg algorithm.
When utilizing evaluate_one_step (in serverperavg.py) function to evaluate the performance of PerAvg, the function first executes
for c in self.users: c.train_one_step()
to train personalized models for one step. However, in the function train_one_step, it seems that it utilizes testing data to update the personalized model. Is it right?
Source code:
```
def train_one_step(self):
self.model.train()
#step 1
X, y = self.get_next_testbatch()
self.optimizer.zero_grad()
output = self.model(X)
loss = self.loss(output, y)
loss.backward()
self.optimizer.step()
#step 2
X, y = self.get_nexttest_batch()
self.optimizer.zero_grad()
output = self.model(X)
loss = self.loss(output, y)
loss.backward()
self.optimizer.step(beta=self.beta)

Looking forward to your reply! Thank you!

Something maybe wrong in data/mnist/generate_niid_20users.py

I think the code "l = (user * NUM_USERS + j) % 10" in line 39 should be modified to "l = (user * NUM_LABELS + j) % 10".
Using "user * NUM_USERS" will cause all users to share a data set with the same label, because user * NUM_USERS% 10 = 0.

Cifar-10 running error

When I use Cifar-10 and Netcifar model, it shows me the following error:

RuntimeError: Expected 4-dimensional input for 4-dimensional weight [16, 1, 2, 2], but got 3-dimensional input of size [1, 28, 28] instead.

Unable to generate non-iid MNIST Data

Describe the bug
While generation of the non-iid MNIST data, generate_niid_20users.py runs into an error

To Reproduce
Steps to reproduce the behavior:

Go to 'data/Mnist'
Run python generate_niid_20users.py

Trace

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 38.74it/s]

Numb samples of each label:
 [6903, 7877, 6990, 7141, 6824, 6313, 6876, 7293, 6825, 6958]
idx 0        False
1        False
2        False
3        False
4         True
         ...  
69995    False
69996    False
69997    False
69998    False
69999    False
Name: class, Length: 70000, dtype: bool
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 135300.13it/s]
--------------
[0 1 2 3 4 5 6 7 8 9] [4 4 4 4 4 4 4 4 4 4]
6903
[2441, 1127, 1575, 1760]
7877
[2671, 1946, 1367, 1893]
6990
[2358, 1070, 841, 2721]
7141
[2630, 2202, 715, 1594]
6824
[1721, 1934, 1101, 2068]
6313
[2169, 1080, 1102, 1962]
6876
[2043, 1364, 1255, 2214]
7293
[2211, 2518, 598, 1966]
6825
[1506, 2480, 574, 2265]
6958
[1710, 1878, 1208, 2162]
--------------
[[2441, 1127, 1575, 1760], [2671, 1946, 1367, 1893], [2358, 1070, 841, 2721], [2630, 2202, 715, 1594], [1721, 1934, 1101, 2068], [2169, 1080, 1102, 1962], [2043, 1364, 1255, 2214], [2211, 2518, 598, 1966], [1506, 2480, 574, 2265], [1710, 1878, 1208, 2162]]
[2441, 1127, 1575, 1760]
[2671, 1946, 1367, 1893]
[2358, 1070, 841, 2721]
[2630, 2202, 715, 1594]
[1721, 1934, 1101, 2068]
[2169, 1080, 1102, 1962]
[2043, 1364, 1255, 2214]
[2211, 2518, 598, 1966]
[1506, 2480, 574, 2265]
[1710, 1878, 1208, 2162]
[2441, 1127, 1575, 1760]
[2671, 1946, 1367, 1893]
[2358, 1070, 841, 2721]
[2630, 2202, 715, 1594]
[1721, 1934, 1101, 2068]
[2169, 1080, 1102, 1962]
[2043, 1364, 1255, 2214]
[2211, 2518, 598, 1966]
[1506, 2480, 574, 2265]
[1710, 1878, 1208, 2162]
[2441, 1127, 1575, 1760]
[2671, 1946, 1367, 1893]
[2358, 1070, 841, 2721]
[2630, 2202, 715, 1594]
[1721, 1934, 1101, 2068]
[2169, 1080, 1102, 1962]
[2043, 1364, 1255, 2214]
[2211, 2518, 598, 1966]
[1506, 2480, 574, 2265]
[1710, 1878, 1208, 2162]
[2441, 1127, 1575, 1760]
[2671, 1946, 1367, 1893]
[2358, 1070, 841, 2721]
[2630, 2202, 715, 1594]
[1721, 1934, 1101, 2068]
[2169, 1080, 1102, 1962]
[2043, 1364, 1255, 2214]
[2211, 2518, 598, 1966]
[1506, 2480, 574, 2265]
[1710, 1878, 1208, 2162]
--------------
[2441, 2671, 2358, 2630, 1721, 2169, 2043, 2211, 1506, 1710, 1127, 1946, 1070, 2202, 1934, 1080, 1364, 2518, 2480, 1878, 1575, 1367, 841, 715, 1101, 1102, 1255, 598, 574, 1208, 1760, 1893, 2721, 1594, 2068, 1962, 2214, 1966, 2265, 2162]
  0%|                                                                                                                                           | 0/20 [00:00<?, ?it/s]value of L 0
value of count 0
  0%|                                                                                                                                           | 0/20 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "generate_niid_20users.py", line 86, in <module>
    X[user] += mnist_data[l][idx[l]:num_samples].tolist()
  File "/Users/sharadchitlangia/miniconda3/envs/FL/lib/python3.6/site-packages/pandas/core/frame.py", line 2881, in __getitem__
    indexer = convert_to_index_sliceable(self, key)
  File "/Users/sharadchitlangia/miniconda3/envs/FL/lib/python3.6/site-packages/pandas/core/indexing.py", line 2132, in convert_to_index_sliceable
    return idx._convert_slice_indexer(key, kind="getitem")
  File "/Users/sharadchitlangia/miniconda3/envs/FL/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 3159, in _convert_slice_indexer
    self._validate_indexer("slice", key.start, "getitem")
  File "/Users/sharadchitlangia/miniconda3/envs/FL/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 5000, in _validate_indexer
    self._invalid_indexer(form, key)
  File "/Users/sharadchitlangia/miniconda3/envs/FL/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 3271, in _invalid_indexer
    f"cannot do {form} indexing on {type(self).__name__} with these "
TypeError: cannot do slice indexing on Int64Index with these indexers [False] of type bool_

some questions about the results

thanks for ur code!These days i tryed to run ur code with Non-iid MNIST dataset. However,my ultimate results didn't match that presented in ur paper.I'm wondering why this happened.It seems that FedAvg did better than pFedAvg.I did try different model to train that one,apparently,it didn't work.Maybe i need some suggestions for that.Sorry for bothering!

Hello author, why did Loss become Nan after more than a dozen rounds of training？

A question about train_one_step() method.

pFedMe/FLAlgorithms/users/userperavg.py

Line 66 in 96863e0

X, y = self.get_next_test_batch()

Hi CharlieDinh, I' m Sorry to take up your time. Can you tell me why self.get_next_test_batch() is used to do the update in the first step and then self.get_next_train_batch() is used to do the update in the second step? Can I do both updates with self.get_next_train_batch() ?

UserpFedMe class

Hi , thanks for sharing your code. I have few questions. First is about the model update inside UserpFedMe class. Specifically I don't quite understand this part

pFedMe/FLAlgorithms/users/userpFedMe.py

Line 58 in 96863e0

self.update_parameters(self.local_model)

. What this line is doing is basically updating the personalized parameters (self.model.parameters()) to the final updated parameters of self.local_model. Can you please explain this further ? I don't know, maybe I have missed something, but the self.model.paramters() are already updated inside the inner optimization as done here

pFedMe/FLAlgorithms/optimizers/fedoptimizer.py

Line 64 in 96863e0

 p.data = p.data - group['lr'] * (p.grad.data + group['lamda'] * (p.data - localweight.data) + group['mu']*p.data) 

, and I don't understand why we do need to set them back to final local weight! Second Question is I don't understand this

pFedMe/FLAlgorithms/users/userbase.py

Line 39 in 96863e0

old_param.data = new_param.data.clone()

. It makes sense to update local_param but do not quite get why we need to update old_param. Please correct me if I am wrong but the only reason I can think of is because we are evaluating on the final aggregated model. Third question is here

pFedMe/FLAlgorithms/servers/serverpFedMe.py

Line 54 in 96863e0

for user in self.users:

where we train on all users not selected users. Why is that ? thanks much