shaoxiongji / federated-learning Goto Github PK

View Code? Open in Web Editor NEW

1.2K 14.0 350.0 39 KB

A PyTorch Implementation of Federated Learning http://doi.org/10.5281/zenodo.4321561

Home Page: http://doi.org/10.5281/zenodo.4321561

License: MIT License

Python 100.00%

federated-learning deep-learning pytorch

federated-learning's People

Contributors

Stargazers

Watchers

Forkers

tongluoiupui konstantino hyzcn liuying350169 mtroglia phamqv schronuman vishalsurya sovitagar multiplecrashes driver4567 erenakgun jimmyc96 sundycoders corazju todun amberwangsiwen daniellsm chunhuizng ydangerous yonggucheng zsl98 millionairechen cyfusion zhangzhao156 yining1013 tanaysh7 hcwuestc zhikunch chorseng xiangyi1900 jennylee2017 tzq2doc yinbsh liudyboy kitt1996 dh434 yanmu-github raccoondml commandsecurity bruinxiong ylubg zhaoyang626 mkfhe-ado utoniumharsha geehokim arbrefleur lianzhuotao finlay-liu fduerwilliam luan-gu codeljs minhthangbk wu-jiasheng l1v1t rosemondshen trandinhhieu1989 phunglai728 jeme-yufeng-zhan fagan2888 sheldon-anderson zhuoyuechen yuanxiongguo gyjgyjgyj ranyus wang2506 yeshwanthv5 carudy drzhang3 akaanirban kelenlv ruihu-zoey ychen404 kpansxxa 13301338176 jianxu95 guobbin alllucky1996 greatwizard9519 giuse1 xiorcale liuhang1994 xianruimeng tblacerda thecml 564612540 ladin157 mvisionai maoweinuaa franciszchen zhanzheng8585 zmy231 diegocao thu-syh aiswariya-cse 1032864600 poloholmes changqing1234 houdong1992 allenfeizz

federated-learning's Issues

About non-iid sampling

How and why do you choose num_shards, num_imgs = 200, 300 ?

Why cigar10 did not achieve Non-iid in sampling.py ？

Is it impossible to achieve or other reasons ？

Can multiprocessing speed up the training?

First of all, thank you for your contribution.

I don't understand the statement "Note: The scripts will be slow without the implementation of parallel computing."
What does "parallel computing" mean?
Because as I understand in the code below, each local training performs sequentially.

federated-learning/main_fed.py

Lines 83 to 90 in 5a9da1a

 for idx in idxs_users: 

 local = LocalUpdate(args=args, dataset=dataset_train, idxs=dict_users[idx]) 

 w, loss = local.train(net=copy.deepcopy(net_glob).to(args.device)) 

 if args.all_clients: 

 w_locals[idx] = copy.deepcopy(w) 

 else: 

 w_locals.append(copy.deepcopy(w)) 

 loss_locals.append(copy.deepcopy(loss))

What do you think about multiprocessing with each process corresponding to each client?

About the implementation of Fed.py

I think it's wrong when the data distribution is noniid, should change to:
def FedAvg(w, dict_len):
w_avg = copy.deepcopy(w[0])
for k in w_avg.keys():
w_avg[k] = w_avg[k] * dict_len[0]
for i in range(1, len(w)):
w_avg[k] += w[i][k] * dict_len[i]
w_avg[k] = w_avg[k] / sum(dict_len)
return w_avg
Which dict_len is a list contains number of samples in each clients.

mnist数据集mlp-noniid的运行结果

请问一下大佬，为什么mlp-noniid-mnist第一次测试集运行结果是75%，第二次运行就78%甚至83%+？变化这么大的原因是什么？

第一次结果：

Round   0, Average loss 0.133
Round   1, Average loss 0.097
Round   2, Average loss 0.084
Round   3, Average loss 0.063
Round   4, Average loss 0.075
Round   5, Average loss 0.057
Round   6, Average loss 0.041
Round   7, Average loss 0.049
Round   8, Average loss 0.076
Round   9, Average loss 0.056
Training accuracy: 74.83
Testing accuracy: 75.21

第二次结果：

Round   0, Average loss 0.128
Round   1, Average loss 0.068
Round   2, Average loss 0.099
Round   3, Average loss 0.060
Round   4, Average loss 0.057
Round   5, Average loss 0.070
Round   6, Average loss 0.069
Round   7, Average loss 0.057
Round   8, Average loss 0.066
Round   9, Average loss 0.049
Training accuracy: 78.18
Testing accuracy: 78.39

experiment on other tasks

Can I experiment on other tasks? For example, some tasks in NLP.

how to acquire the middle gradient of each client in FL

how to acquire the middle gradient of each client in FL by using pytorch? I try hook, but can't figure out

The dataset seems to be in trouble

Hi, When I ran your code locally, I found that the program reported an error when downloading the test dataset. This dataset website can't be accessed normally.

Testing accuracy is very low

Dear,
First thank you for your code.
I have run your code, however, the result is not satisfying.
Result:
Training accuracy: 43.00
Testing accuracy: 43.00

my cmd:

python main_fed.py --dataset cifar --num_channels 1 --model cnn --epochs 10 --gpu 0 --iid

look forward to your reply.
best wishes~

about the implementation of FedAvg

Why does the FedAvg use a simple average without weight?

Max number of clients

What is the max number of clients that can be selected in each round of training using this code?

Many machine experiment

Have you ever tried to train multiple machines together?

num_workers

why it is useless when I use bigger num_workers in DataLoader. How can I increase gpu utilization?

fixture 'net_g' not found

When I run the "main_nn.py", an error appears:
`============================= test session starts ==============================
platform linux -- Python 3.6.9, pytest-5.3.1, py-1.8.0, pluggy-0.13.1 -- /home/anaconda3/envs/pytorch/bin/python3.6
cachedir: .pytest_cache
rootdir: /home/federated-learning-master
collecting ... collected 1 item

main_nn.py::test ERROR [100%]
test setup failed
file /home/federated-learning-master/main_nn.py, line 19
def test(net_g, data_loader):
E fixture 'net_g' not found

  available fixtures: cache, capfd, capfdbinary, caplog, capsys, capsysbinary, doctest_namespace, monkeypatch, pytestconfig, record_property, record_testsuite_property, record_xml_attribute, recwarn, tmp_path, tmp_path_factory, tmpdir, tmpdir_factory
  use 'pytest --fixtures [testpath]' for help on them.

/home/federated-learning-master/main_nn.py:19`

How can I solve it?

并行

你好，请问代码有实现CPU并行训练吗？还是只能每轮将每个worker都训练一遍然后收集参数这样来模拟联邦学习

What is "the parameters C=0.1, B=10, E=5"?

Please tell me What is "the parameters C=0.1, B=10, E=5"? and the "C=0.1, B=10, E=5" which parameters represent in the options.py, Thanks.

Runtime error on cuda

`bin/bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)

cuda:0

CNNMnist(

(conv1): Conv2d(1, 10, kernel_size=(5, 5), stride=(1, 1))

(conv2): Conv2d(10, 20, kernel_size=(5, 5), stride=(1, 1))

(conv2_drop): Dropout2d(p=0.5)

(fc1): Linear(in_features=320, out_features=50, bias=True)

(fc2): Linear(in_features=50, out_features=10, bias=True)

)

/opt/conda/lib/python3.6/site-packages/torchvision/datasets/mnist.py:43: UserWarning: train_labels has been renamed targets

warnings.warn("train_labels has been renamed targets")

Traceback (most recent call last):

File "main_fed.py", line 113, in

w, loss = local.train(net=copy.deepcopy(net_glob).to(args.device))

File "/code/models/Update.py", line 48, in train

loss = self.loss_func(log_probs, labels)

File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call

result = self.forward(*input, **kwargs)

File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/loss.py", line 904, in forward

ignore_index=self.ignore_index, reduction=self.reduction)

File "/opt/conda/lib/python3.6/site-packages/torch/nn/functional.py", line 1970, in cross_entropy

return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)

File "/opt/conda/lib/python3.6/site-packages/torch/nn/functional.py", line 1790, in nll_loss

ret = torch._C._nn.nll_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index)

RuntimeError: Expected object of backend CUDA but got backend CPU for argument 'weight'`

I get the above error, only when trying to run it on CUDA.

Pytorch CrossEntropy function contains softmax

Hi, thanks for your nice code.

However, I find that your code has a bug: you apply CrossEntropy function after softmax activation. But actually pytorch CrossEntropy function itself takes logit as its input.

After removing the softmax activation, I'm able to improve the MLP from 90% to 95%.

About the results of the code

python main_fed.py --dataset mnist --iid --num_channels 1 --model cnn --epochs 50 --gpu 0
In addition
Hi, about main_fed.py, how to run the program results for non-iid data

cifar transform

Hello. Thanks for you nice code. But I think the accuracy can be better with the new 'tranform' of cifar:

        trans_train = transforms.Compose([
            transforms.RandomCrop(32, padding=4),
            transforms.RandomHorizontalFlip(),
            transforms.ToTensor(),
            transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
        ])
        trans_test = transforms.Compose([
            transforms.ToTensor(),
            transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
        ])
        dataset_train = datasets.CIFAR10('../data/cifar', train=True, download=True, transform=trans_train)
        dataset_test = datasets.CIFAR10('../data/cifar', train=False, download=True, transform=trans_test)

Run time error for main_fed.py (without gpu)

When I was running this code, using the command as you suggested,

python main_fed.py --dataset mnist --model cnn --epochs 50 --gpu -1 --num_channels 1

It raised the following error:

CNNMnist(
(conv1): Conv2d(1, 10, kernel_size=(5, 5), stride=(1, 1))
(conv2): Conv2d(10, 20, kernel_size=(5, 5), stride=(1, 1))
(conv2_drop): Dropout2d(p=0.5)
(fc1): Linear(in_features=320, out_features=50, bias=True)
(fc2): Linear(in_features=50, out_features=10, bias=True)
)
0%| | 0/50 [00:00<?, ?it/s]
Traceback (most recent call last):
File "main_fed.py", line 122, in
w, loss = local.update_weights(net=copy.deepcopy(net_glob))
File "C:\Users\lliubb\PycharmProjects\DistributedLearning_LLM\Fed
Avg\Update.py", line 50, in update_weights
for batch_idx, (images, labels) in enumerate(self.ldr_train):
File "C:\Users\lliubb\PycharmProjects\Federated-Learning\venv\lib
\site-packages\torch\utils\data\dataloader.py", line 314, in __next
__
batch = self.collate_fn([self.dataset[i] for i in indices])
File "C:\Users\lliubb\PycharmProjects\Federated-Learning\venv\lib
\site-packages\torch\utils\data\dataloader.py", line 314, in
batch = self.collate_fn([self.dataset[i] for i in indices])
File "C:\Users\lliubb\PycharmProjects\DistributedLearning_LLM\Fed
Avg\Update.py", line 21, in getitem
image, label = self.dataset[self.idxs[item]]
File "C:\Users\lliubb\PycharmProjects\Federated-Learning\venv\lib
\site-packages\torchvision\datasets\mnist.py", line 68, in getite
m
img, target = self.train_data[index], self.train_labels[index]
IndexError: only integers, slices (:), ellipsis (...), None and
long or byte Variables are valid indices (got numpy.float64)

Can you give me some hints on how to solve this?
I do not have a gpu and I am using python 3.6 on a windows system.

Getting Runtime Error

HI,
When I try to run the code with the following command:
python main_fed.py --dataset mnist --model cnn --epochs 50 --gpu -1
(since I have no gpu)
I get the following error message:

CNNMnist(
(conv1): Conv2d(3, 10, kernel_size=(5, 5), stride=(1, 1))
(conv2): Conv2d(10, 20, kernel_size=(5, 5), stride=(1, 1))
(conv2_drop): Dropout2d(p=0.5)
(fc1): Linear(in_features=320, out_features=50, bias=True)
(fc2): Linear(in_features=50, out_features=10, bias=True)
)
0%| | 0/50 [00:00<?, ?it/s]
Traceback (most recent call last):
File "main_fed.py", line 122, in
w, loss = local.update_weights(net=copy.deepcopy(net_glob))
File "/federated-learning-master/FedAvg/Update.py", line 55, in update_weights
log_probs = net(images)
File "/miniconda/envs/fedlearn/lib/python3.6/site-packages/torch/nn/modules/module.py", line 357, in call
result = self.forward(*input, **kwargs)
File "/federated-learning-master/FedAvg/FedNets.py", line 38, in forward
x = F.relu(F.max_pool2d(self.conv1(x), 2))
File "/home/santanu/miniconda/envs/fedlearn/lib/python3.6/site-packages/torch/nn/modules/module.py", line 357, in call
result = self.forward(*input, **kwargs)
File "/miniconda/envs/fedlearn/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 282, in forward
self.padding, self.dilation, self.groups)
File "/miniconda/envs/fedlearn/lib/python3.6/site-packages/torch/nn/functional.py", line 90, in conv2d
return f(input, weight, bias)
RuntimeError: Given groups=1, weight[10, 3, 5, 5], so expected input[10, 1, 28, 28] to have 3 channels, but got 1 channels instead

Any suggestion how to fix it?

Testing accuracy equals to training accuracy?

You code is excellent and helps me a lot. However, I wonder why the testing accuracy always equals to training accuracy, wish your explanation and thanks a lot.

issues of running python main_fed.py --dataset mnist --num_channels 1 --model cnn --epochs 50 --gpu 0

when I tried to run python main_fed.py --dataset mnist --num_channels 1 --model cnn --epochs 50 --gpu 0, then it shows me a problem.

Jians-Air:FedAvg jiansun$ python main_fed.py --dataset mnist --num_channels 1 --model cnn --epochs 50 --gpu 0
Traceback (most recent call last):
File "main_fed.py", line 11, in
from torchvision import datasets, transforms
File "/Library/Python/2.7/site-packages/torchvision/init.py", line 1, in
from torchvision import models

模型聚合这个步骤感觉和FedAvg原文上描述的不一样

最近参考大佬您的这个代码学习联邦学习，偶然发现一点令我疑惑的地方。原文中每一个global epoch会随机指定所有clients中的一个fraction进行更新（并不是所有clients都参与更新），聚合的时候原文描述的是所有clients的模型都进行聚合，即没有参与更新的clients的模型也都会参与平均。而代码中的聚合步骤只考虑了参与更新的clients的模型平均。请问代码是不是有问题，还是我的理解错误呢？

for iter in range(args.epochs):
    w_locals, loss_locals = [], []
    m = max(int(args.frac * args.num_users), 1)
    idxs_users = np.random.choice(range(args.num_users), m, replace=False)
    for idx in idxs_users:
        local = LocalUpdate(args=args, dataset=dataset_train, idxs=dict_users[idx])
        w, loss = local.train(net=copy.deepcopy(net_glob).to(args.device))
        w_locals.append(copy.deepcopy(w))
        loss_locals.append(copy.deepcopy(loss))
    # update global weights
    w_glob = FedAvg(w_locals)

    # copy weight to net_glob
    net_glob.load_state_dict(w_glob)

split dataset

how you partitioned your database between clients ? is that automatically (script name?) or manually ?
Thanks

Why the MLP architecture is different from the paper?

Hi, thanks for your nice work!

I wonder why you implement a different MLP with the author.
In your code, it is 784->64->10, while the paper by McMahan uses a net with 784 -> 200 -> 200 -> 10.

	for idx in idxs_users:
	local = LocalUpdate(args=args, dataset=dataset_train, idxs=dict_users[idx])
	w, loss = local.train(net=copy.deepcopy(net_glob).to(args.device))
	if args.all_clients:
	w_locals[idx] = copy.deepcopy(w)
	else:
	w_locals.append(copy.deepcopy(w))
	loss_locals.append(copy.deepcopy(loss))