dubovikmaster / parallel-pandas Goto Github PK
View Code? Open in Web Editor NEWParallel processing on pandas with progress bars
License: MIT License
Parallel processing on pandas with progress bars
License: MIT License
I did not find the parallel version of DataFrame.pivot_table, I was wandering if it's possible to parallelise that function too.
I have 256 GB of RAM and a dataset with 13m+ rows.
I am applying a complex function to all columns using p_applymap.
As I was having memory issues I split the df in 117 chunks of ca. 100k rows each.
The first 115 chunks run without issues, but chunk 116 (which is virtually identical to the rest), uses all the available memory until it is killed by the OS.
Changing the number of processes does not help.
Naturally, I assumed that there was an issue with some of the rows in that chunk, but I could not find anything.
Furthermore, if I apply the function row by row with p_apply and lambda to that chunk, there are no issues.
Also pandas map works just fine.
Could this be the result of a memory leak?
Any thoughts?
I used the example from the README, just wanted to parallelize calculating correlations in a large dataframe. Unfortunately, it did not work.
Windows: 10
Python 3.9.7
Pandas: 1.5.2
File "C:\Anaconda3\lib\site-packages\parallel_pandas\core\parallel_window.py", line 116, in p_min
return self.parallelize_method(data, name, executor, *args, engine=engine, engine_kwargs=engine_kwargs,
File "C:\Anaconda3\lib\site-packages\parallel_pandas\core\parallel_window.py", line 80, in parallelize_method
result = progress_imap(
File "C:\Anaconda3\lib\site-packages\parallel_pandas\core\progress_imap.py", line 127, in progress_imap
result = _do_parallel(func, tasks, initializer, initargs, n_cpu, total, disable, show_vmem, q, executor, desc)
File "C:\Anaconda3\lib\site-packages\parallel_pandas\core\progress_imap.py", line 111, in _do_parallel
result.append(next(iter_result))
File "C:\Anaconda3\lib\multiprocessing\pool.py", line 870, in next
raise value
File "C:\Anaconda3\lib\multiprocessing\pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "C:\Anaconda3\lib\site-packages\parallel_pandas\core\parallel_window.py", line 36, in do_method
return progress_udf_wrapper(foo, workers_queue, 1)()
File "C:\Anaconda3\lib\site-packages\parallel_pandas\core\progress_imap.py", line 60, in wrapped_udf
state.next_update += max(int((delta_i / delta_t) * .25), 1)
ZeroDivisionError: float division by zero
ri.iloc[:10000].groupby('Stkcd').p_apply(get_beta)
File c:\Users\guduh\anaconda3\Lib\site-packages\parallel_pandas\core\parallel_groupby.py:57, in parallelize_groupby_apply..p_apply(data, func, executor, args, **kwargs)
44 result, mutated = _prepare_result(result)
46 # due to a bug in the get_iterator method of the Basegrouper class
47 # that was only fixed in pandas 1.4.0, earlier versions are not yet supported
48
(...)
55 # return data._wrap_applied_output(data.grouper._get_group_keys(), data.grouper._get_group_keys(), result,
56 # not_indexed_same=mutated or data.mutated)
---> 57 return data._wrap_applied_output(data._selected_obj, result, not_indexed_same=mutated or data.mutated)
File c:\Users\guduh\anaconda3\Lib\site-packages\pandas\core\groupby\groupby.py:1312, in GroupBy.getattr(self, attr)
1309 if attr in self.obj:
1310 return self[attr]
-> 1312 raise AttributeError(
1313 f"'{type(self).name}' object has no attribute '{attr}'"
1314 )
AttributeError: 'DataFrameGroupBy' object has no attribute 'mutated'
With this example code, I get the error below. Maybe due to the DatetimeIndex
with a frequency?
minimal reproducible example:
import pandas as pd
from parallel_pandas import ParallelPandas
ParallelPandas.initialize()
test = pd.DataFrame(index=pd.date_range(start='2024-01-01', end='2024-01-02', freq='30min'))
test["test"] = list(range(len(test)))
test.p_describe()
output:
"name": "TypeError",
"message": "NDFrame.describe() got an unexpected keyword argument 'datetime_is_numeric'",
"stack": "---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[36], line 4
1 test = pd.DataFrame(index=pd.date_range(start='2024-01-01', end='2024-01-02', freq='30min'))
2 test[\"test\"] = list(range(len(test)))
----> 4 test.p_describe()
File ~/venv/lib/python3.11/site-packages/parallel_pandas/core/parallel_dataframe.py:336, in parallelize_describe.<locals>.p_describe(data, percentiles, include, exclude, datetime_is_numeric)
334 split_size = get_split_size(n_cpu, split_factor)
335 tasks = get_split_data(data, 0, split_size)
--> 336 result = progress_imap(
337 partial(do_describe, workers_queue=workers_queue, percentiles=percentiles, include=include, exclude=exclude,
338 datetime_is_numeric=datetime_is_numeric), tasks, workers_queue, n_cpu=n_cpu, disable=disable_pr_bar,
339 show_vmem=show_vmem, total=min(split_size, data.shape[1]), desc='DESCRIBE')
340 return pd.concat(result, axis=1)
File ~/venv/lib/python3.11/site-packages/parallel_pandas/core/progress_imap.py:127, in progress_imap(func, tasks, q, executor, initializer, initargs, n_cpu, total, disable, process_timeout, show_vmem, desc)
125 if process_timeout:
126 func = partial(_wrapped_func, func, process_timeout, True)
--> 127 result = _do_parallel(func, tasks, initializer, initargs, n_cpu, total, disable, show_vmem, q, executor, desc)
128 except (KeyboardInterrupt, Exception):
129 q.put((None, None))
File ~/venv/lib/python3.11/site-packages/parallel_pandas/core/progress_imap.py:111, in _do_parallel(func, tasks, initializer, initargs, n_cpu, total, disable, show_vmem, q, executor, desc)
109 while 1:
110 try:
--> 111 result.append(next(iter_result))
112 except StopIteration:
113 break
File /usr/lib/python3.11/multiprocessing/pool.py:873, in IMapIterator.next(self, timeout)
871 if success:
872 return value
--> 873 raise value
File /usr/lib/python3.11/multiprocessing/pool.py:125, in worker(inqueue, outqueue, initializer, initargs, maxtasks, wrap_exception)
123 job, i, func, args, kwds = task
124 try:
--> 125 result = (True, func(*args, **kwds))
126 except Exception as e:
127 if wrap_exception and func is not _helper_reraises_exception:
File ~/venv/lib/python3.11/site-packages/parallel_pandas/core/parallel_dataframe.py:326, in do_describe(df, workers_queue, percentiles, include, exclude, datetime_is_numeric)
322 def foo():
323 return df.describe(percentiles=percentiles, include=include, exclude=exclude,
324 datetime_is_numeric=datetime_is_numeric)
--> 326 return progress_udf_wrapper(foo, workers_queue, 1)()
File ~/venv/lib/python3.11/site-packages/parallel_pandas/core/progress_imap.py:52, in progress_udf_wrapper.<locals>.wrapped_udf(*args, **kwargs)
51 def wrapped_udf(*args, **kwargs):
---> 52 result = func(*args, **kwargs)
53 updated = next(cnt)
54 if updated == state.next_update:
File ~/venv/lib/python3.11/site-packages/parallel_pandas/core/parallel_dataframe.py:323, in do_describe.<locals>.foo()
322 def foo():
--> 323 return df.describe(percentiles=percentiles, include=include, exclude=exclude,
324 datetime_is_numeric=datetime_is_numeric)
TypeError: NDFrame.describe() got an unexpected keyword argument 'datetime_is_numeric'"
environment:
(btw, I love parallel-pandas! It so easy to use, intuitive and the speedup is amazing. Thanks a lot!)
parallel_series.py:24: FutureWarning: the convert_dtype parameter is deprecated and will be removed in a future version. Do ser.astype(object).apply()
instead if you want convert_dtype=False
.
Hi if I use p_map on a dataframe it raises the error:
AttributeError: 'DataFrame' object has no attribute 'p_map'.
I guess that because map used to be for series and not dataframes.
However, if I use p_applymap on the dataframe, I get myself a warning:
FutureWarning: DataFrame.applymap has been deprecated. Use DataFrame.map instead.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.