Pandaral·lel
An easy to use library to speed up computation (by parallelizing on multi CPUs) with pandas.
Without parallelisation | |
---|---|
With parallelisation |
Latest Release | |
License |
Installation
$ pip install pandarallel [--user]
Requirements
Warnings
- The V1.0 of this library is not yet released. API is able to change at any time.
- Parallelization has a cost (instanciating new processes, transmitting data via shared memory, etc ...), so parallelization is efficiant only if the amount of computation to parallelize is high enough. For very little amount of data, using parallezation not always worth it.
- Functions applied should NOT be lambda functions.
from pandarallel import pandarallel
from math import sin
pandarallel.initialize()
# FORBIDDEN
df.parallel_apply(lambda x: sin(x**2), axis=1)
# ALLOWED
def func(x):
return sin(x**2)
df.parallel_apply(func, axis=1)
Examples
An example of each API is available here.
Benchmark
For the Dataframe.apply
example here, here is the comparative benchmark with "standard" apply
and with parallel_apply
(error bars are too small to be displayed).
Computer used for this benchmark:
- OS: Linux Ubuntu 16.04
- Hardware: Intel Core i7 @ 3.40 GHz (4 cores)
- Number of workers (parallel processes) used: 4
For this given example, parallel_apply
runs approximatively 3.7 faster than the "standard" apply
.
API
First, you have to import pandarallel
:
from pandarallel import pandarallel
Then, you have to initialize it.
pandarallel.initialize()
This method takes 3 optional parameters:
shm_size_mo
: The size of the Pandarallel shared memory in Mo. If the default one is too small, it is possible to set a larger one. By default, it is set to 2 Go. (int)nb_workers
: The number of workers. By default, it is set to the number of cores your operating system sees. (int)progress_bar
: Put it toTrue
to display a progress bar. WARNING: Progress bar is an experimental feature. This can lead to a sensitive performance loss. Available only forDataframe.parallel_apply
.
With df
a pandas DataFrame, series
a pandas Series, col_name
the name of a pandas Dataframe column & func
a function to apply/map,
Without parallelisation | With parallelisation |
---|---|
df.apply(func) |
df.parallel_apply(func) |
series.map(func) |
series.parallel_map(func) |
df.groupby(col_name).apply(func) |
df.groupby(col_name).parallel_apply(func) |