Giter Club home page Giter Club logo

flperf's Introduction

This repository contains an ongoing benchmarking project for Federated Learning (FL). We will release a standardized benchmark with efficient simulator and real-world datasets pretty soon! Please stay tuned! This benchmark is used in OSDI '21 paper "Efficient Federated Learning via Guided Participant Selection" (preprint here).

Getting Started

To download the dataset, run (or download from here manually):

git clone https://github.com/SymbioticLab/FLPerf.git
cd FLPerf
# Download all datasets from DropBox. Check download.sh -h for more details
bash download.sh -A  # If fail, bash download_google.sh from Google Drive

Note that no details were kept of any of the participants age, gender, or location, and random ids were assigned to each individual.

Dataset of Samples

We provide real-world datasets for the federated learning community, and plan to release much more soon! Each is associated with its training and testing dataset. A summary of statistics for training datasets can be found in Table, and you can refer to each folder for more details.

Dataset # of Clients # of Samples
Google Speech 2, 618 105, 829
OpenImage 14, 477 1, 672, 231
StackOverflow 342, 477 135, 818, 730
Reddit 1, 660, 820 351, 523, 459

Google Speech Commands

A speech recognition dataset with over ten thousand clips of one-second-long duration. Each clip contains one of the 35 common words (e.g., digits zero to nine, "Yes", "No", "Up", "Down") spoken by thousands of different people.

OpenImage.

OpenImage is a vision dataset collected from Flickr, an image and video hosting service. It contains a total of 16M bounding boxes for 600 object classes (e.g., Microwave oven). We clean up the dataset according to the provided indices of clients.

Reddit and StackOverflow

Reddit (StackOverflow) consists of comments from the Reddit (StackOverflow) website. It has been widely used for language modeling tasks, and we consider each user as a client. In our benchmark, we restrict to the 30k most frequently used words, and represent each sentence as a sequence of indices corresponding to these 30k frequently used words. We use Transformers to tokenize these sequences with a block size 64.

Dataset of System Performance and Availability

Heterogeneous System Performance

We use the AIBench dataset and MobiPerf dataset. AIBench dataset provides the computation capacity of different models across a wide range of devices. As specified in real FL deployments, we focus on the capability of mobile devices that have > 2GB RAM in this benchmark. To understand the network capacity of these devices, we clean up the MobiPerf dataset, and provide the available bandwidth when they are connected with WiFi, which is preferred in FL as well.

Availability of Clients

We use a large-scale real-world user behavior dataset from FLASH. It comes from a popular input method app (IMA) that can be downloaded from Google Play, and covers 136k users and spans one week from January 31st to February 6th in 2020. This dataset includes 180 million trace items (e.g., battery charge or screen lock) and we consider user devices that are in charging to be available, as specified in real FL deployments.

Efficient Simulator for Federated Learning

We plan to release a simulator for standardized benchmarking that can support the simulation of millions of clients soon.

Notes

please consider to cite our paper if you use the code or data in your research project.

@inproceedings{kuiper-osdi21,
  title={Kuiper: Efficient Federated Learning via Guided Participant Selection},
  author={Fan Lai, Xiangfeng Zhu, Harsha V. Madhyastha, Mosharaf Chowdhury},
  booktitle={USENIX Symposium on Operating Systems Design and Implementation (OSDI)},
  year={2021}
}

Contact

Fan Lai ([email protected]), Yinwei Dai ([email protected]) and Xiangfeng Zhu ([email protected]).

flperf's People

Contributors

fanlai0990 avatar romero027 avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.