shaoxiongji / federated-learning Goto Github PK
View Code? Open in Web Editor NEWA PyTorch Implementation of Federated Learning http://doi.org/10.5281/zenodo.4321561
Home Page: http://doi.org/10.5281/zenodo.4321561
License: MIT License
A PyTorch Implementation of Federated Learning http://doi.org/10.5281/zenodo.4321561
Home Page: http://doi.org/10.5281/zenodo.4321561
License: MIT License
How and why do you choose num_shards, num_imgs = 200, 300 ?
Is it impossible to achieve or other reasons ?
First of all, thank you for your contribution.
I don't understand the statement "Note: The scripts will be slow without the implementation of parallel computing."
What does "parallel computing" mean?
Because as I understand in the code below, each local training performs sequentially.
federated-learning/main_fed.py
Lines 83 to 90 in 5a9da1a
What do you think about multiprocessing with each process corresponding to each client?
I think it's wrong when the data distribution is noniid, should change to:
def FedAvg(w, dict_len):
w_avg = copy.deepcopy(w[0])
for k in w_avg.keys():
w_avg[k] = w_avg[k] * dict_len[0]
for i in range(1, len(w)):
w_avg[k] += w[i][k] * dict_len[i]
w_avg[k] = w_avg[k] / sum(dict_len)
return w_avg
Which dict_len is a list contains number of samples in each clients.
请问一下大佬,为什么mlp-noniid-mnist第一次测试集运行结果是75%,第二次运行就78%甚至83%+?变化这么大的原因是什么?
第一次结果:
Round 0, Average loss 0.133
Round 1, Average loss 0.097
Round 2, Average loss 0.084
Round 3, Average loss 0.063
Round 4, Average loss 0.075
Round 5, Average loss 0.057
Round 6, Average loss 0.041
Round 7, Average loss 0.049
Round 8, Average loss 0.076
Round 9, Average loss 0.056
Training accuracy: 74.83
Testing accuracy: 75.21
第二次结果:
Round 0, Average loss 0.128
Round 1, Average loss 0.068
Round 2, Average loss 0.099
Round 3, Average loss 0.060
Round 4, Average loss 0.057
Round 5, Average loss 0.070
Round 6, Average loss 0.069
Round 7, Average loss 0.057
Round 8, Average loss 0.066
Round 9, Average loss 0.049
Training accuracy: 78.18
Testing accuracy: 78.39
Can I experiment on other tasks? For example, some tasks in NLP.
how to acquire the middle gradient of each client in FL by using pytorch? I try hook, but can't figure out
Dear,
First thank you for your code.
I have run your code, however, the result is not satisfying.
Result:
Training accuracy: 43.00
Testing accuracy: 43.00
python main_fed.py --dataset cifar --num_channels 1 --model cnn --epochs 10 --gpu 0 --iid
look forward to your reply.
best wishes~
Why does the FedAvg use a simple average without weight?
What is the max number of clients that can be selected in each round of training using this code?
Have you ever tried to train multiple machines together?
why it is useless when I use bigger num_workers in DataLoader. How can I increase gpu utilization?
When I run the "main_nn.py", an error appears:
`============================= test session starts ==============================
platform linux -- Python 3.6.9, pytest-5.3.1, py-1.8.0, pluggy-0.13.1 -- /home/anaconda3/envs/pytorch/bin/python3.6
cachedir: .pytest_cache
rootdir: /home/federated-learning-master
collecting ... collected 1 item
main_nn.py::test ERROR [100%]
test setup failed
file /home/federated-learning-master/main_nn.py, line 19
def test(net_g, data_loader):
E fixture 'net_g' not found
available fixtures: cache, capfd, capfdbinary, caplog, capsys, capsysbinary, doctest_namespace, monkeypatch, pytestconfig, record_property, record_testsuite_property, record_xml_attribute, recwarn, tmp_path, tmp_path_factory, tmpdir, tmpdir_factory use 'pytest --fixtures [testpath]' for help on them.
/home/federated-learning-master/main_nn.py:19`
How can I solve it?
你好,请问代码有实现CPU并行训练吗?还是只能每轮将每个worker都训练一遍然后收集参数这样来模拟联邦学习
Please tell me What is "the parameters C=0.1, B=10, E=5"? and the "C=0.1, B=10, E=5" which parameters represent in the options.py, Thanks.
`bin/bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
cuda:0
CNNMnist(
(conv1): Conv2d(1, 10, kernel_size=(5, 5), stride=(1, 1))
(conv2): Conv2d(10, 20, kernel_size=(5, 5), stride=(1, 1))
(conv2_drop): Dropout2d(p=0.5)
(fc1): Linear(in_features=320, out_features=50, bias=True)
(fc2): Linear(in_features=50, out_features=10, bias=True)
)
/opt/conda/lib/python3.6/site-packages/torchvision/datasets/mnist.py:43: UserWarning: train_labels has been renamed targets
warnings.warn("train_labels has been renamed targets")
Traceback (most recent call last):
File "main_fed.py", line 113, in
w, loss = local.train(net=copy.deepcopy(net_glob).to(args.device))
File "/code/models/Update.py", line 48, in train
loss = self.loss_func(log_probs, labels)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/loss.py", line 904, in forward
ignore_index=self.ignore_index, reduction=self.reduction)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/functional.py", line 1970, in cross_entropy
return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/functional.py", line 1790, in nll_loss
ret = torch._C._nn.nll_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index)
RuntimeError: Expected object of backend CUDA but got backend CPU for argument 'weight'`
I get the above error, only when trying to run it on CUDA.
Hi, thanks for your nice code.
However, I find that your code has a bug: you apply CrossEntropy function after softmax activation. But actually pytorch CrossEntropy function itself takes logit as its input.
After removing the softmax activation, I'm able to improve the MLP from 90% to 95%.
python main_fed.py --dataset mnist --iid --num_channels 1 --model cnn --epochs 50 --gpu 0
In addition
Hi, about main_fed.py, how to run the program results for non-iid data
Hello. Thanks for you nice code. But I think the accuracy can be better with the new 'tranform' of cifar:
trans_train = transforms.Compose([
transforms.RandomCrop(32, padding=4),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
])
trans_test = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
])
dataset_train = datasets.CIFAR10('../data/cifar', train=True, download=True, transform=trans_train)
dataset_test = datasets.CIFAR10('../data/cifar', train=False, download=True, transform=trans_test)
When I was running this code, using the command as you suggested,
python main_fed.py --dataset mnist --model cnn --epochs 50 --gpu -1 --num_channels 1
It raised the following error:
CNNMnist(
(conv1): Conv2d(1, 10, kernel_size=(5, 5), stride=(1, 1))
(conv2): Conv2d(10, 20, kernel_size=(5, 5), stride=(1, 1))
(conv2_drop): Dropout2d(p=0.5)
(fc1): Linear(in_features=320, out_features=50, bias=True)
(fc2): Linear(in_features=50, out_features=10, bias=True)
)
0%| | 0/50 [00:00<?, ?it/s]
Traceback (most recent call last):
File "main_fed.py", line 122, in
w, loss = local.update_weights(net=copy.deepcopy(net_glob))
File "C:\Users\lliubb\PycharmProjects\DistributedLearning_LLM\Fed
Avg\Update.py", line 50, in update_weights
for batch_idx, (images, labels) in enumerate(self.ldr_train):
File "C:\Users\lliubb\PycharmProjects\Federated-Learning\venv\lib
\site-packages\torch\utils\data\dataloader.py", line 314, in __next
__
batch = self.collate_fn([self.dataset[i] for i in indices])
File "C:\Users\lliubb\PycharmProjects\Federated-Learning\venv\lib
\site-packages\torch\utils\data\dataloader.py", line 314, in
batch = self.collate_fn([self.dataset[i] for i in indices])
File "C:\Users\lliubb\PycharmProjects\DistributedLearning_LLM\Fed
Avg\Update.py", line 21, in getitem
image, label = self.dataset[self.idxs[item]]
File "C:\Users\lliubb\PycharmProjects\Federated-Learning\venv\lib
\site-packages\torchvision\datasets\mnist.py", line 68, in getite
m
img, target = self.train_data[index], self.train_labels[index]
IndexError: only integers, slices (:
), ellipsis (...
), None and
long or byte Variables are valid indices (got numpy.float64)
Can you give me some hints on how to solve this?
I do not have a gpu and I am using python 3.6 on a windows system.
HI,
When I try to run the code with the following command:
python main_fed.py --dataset mnist --model cnn --epochs 50 --gpu -1
(since I have no gpu)
I get the following error message:
CNNMnist(
(conv1): Conv2d(3, 10, kernel_size=(5, 5), stride=(1, 1))
(conv2): Conv2d(10, 20, kernel_size=(5, 5), stride=(1, 1))
(conv2_drop): Dropout2d(p=0.5)
(fc1): Linear(in_features=320, out_features=50, bias=True)
(fc2): Linear(in_features=50, out_features=10, bias=True)
)
0%| | 0/50 [00:00<?, ?it/s]
Traceback (most recent call last):
File "main_fed.py", line 122, in
w, loss = local.update_weights(net=copy.deepcopy(net_glob))
File "/federated-learning-master/FedAvg/Update.py", line 55, in update_weights
log_probs = net(images)
File "/miniconda/envs/fedlearn/lib/python3.6/site-packages/torch/nn/modules/module.py", line 357, in call
result = self.forward(*input, **kwargs)
File "/federated-learning-master/FedAvg/FedNets.py", line 38, in forward
x = F.relu(F.max_pool2d(self.conv1(x), 2))
File "/home/santanu/miniconda/envs/fedlearn/lib/python3.6/site-packages/torch/nn/modules/module.py", line 357, in call
result = self.forward(*input, **kwargs)
File "/miniconda/envs/fedlearn/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 282, in forward
self.padding, self.dilation, self.groups)
File "/miniconda/envs/fedlearn/lib/python3.6/site-packages/torch/nn/functional.py", line 90, in conv2d
return f(input, weight, bias)
RuntimeError: Given groups=1, weight[10, 3, 5, 5], so expected input[10, 1, 28, 28] to have 3 channels, but got 1 channels instead
Any suggestion how to fix it?
You code is excellent and helps me a lot. However, I wonder why the testing accuracy always equals to training accuracy, wish your explanation and thanks a lot.
when I tried to run python main_fed.py --dataset mnist --num_channels 1 --model cnn --epochs 50 --gpu 0, then it shows me a problem.
Jians-Air:FedAvg jiansun$ python main_fed.py --dataset mnist --num_channels 1 --model cnn --epochs 50 --gpu 0
Traceback (most recent call last):
File "main_fed.py", line 11, in
from torchvision import datasets, transforms
File "/Library/Python/2.7/site-packages/torchvision/init.py", line 1, in
from torchvision import models

最近参考大佬您的这个代码学习联邦学习,偶然发现一点令我疑惑的地方。原文中每一个global epoch会随机指定所有clients中的一个fraction进行更新(并不是所有clients都参与更新),聚合的时候原文描述的是所有clients的模型都进行聚合,即没有参与更新的clients的模型也都会参与平均。而代码中的聚合步骤只考虑了参与更新的clients的模型平均。请问代码是不是有问题,还是我的理解错误呢?
for iter in range(args.epochs):
w_locals, loss_locals = [], []
m = max(int(args.frac * args.num_users), 1)
idxs_users = np.random.choice(range(args.num_users), m, replace=False)
for idx in idxs_users:
local = LocalUpdate(args=args, dataset=dataset_train, idxs=dict_users[idx])
w, loss = local.train(net=copy.deepcopy(net_glob).to(args.device))
w_locals.append(copy.deepcopy(w))
loss_locals.append(copy.deepcopy(loss))
# update global weights
w_glob = FedAvg(w_locals)
# copy weight to net_glob
net_glob.load_state_dict(w_glob)
how you partitioned your database between clients ? is that automatically (script name?) or manually ?
Thanks
Hi, thanks for your nice work!
I wonder why you implement a different MLP with the author.
In your code, it is 784->64->10, while the paper by McMahan uses a net with 784 -> 200 -> 200 -> 10.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.