dubovikmaster / parallel-pandas Goto Github PK

View Code? Open in Web Editor NEW

45.0 45.0 4.0 192 KB

Parallel processing on pandas with progress bars

License: MIT License

Python 100.00%

parallel-pandas's Introduction

Hi there 👋

parallel-pandas's People

Contributors

Stargazers

Watchers

Forkers

valeman rmallof firobeid techthiyanes

parallel-pandas's Issues

Could be paralleled even DataFrame.pivot_table?

I did not find the parallel version of DataFrame.pivot_table, I was wandering if it's possible to parallelise that function too.

Strange memory issue

I have 256 GB of RAM and a dataset with 13m+ rows.

I am applying a complex function to all columns using p_applymap.

As I was having memory issues I split the df in 117 chunks of ca. 100k rows each.

The first 115 chunks run without issues, but chunk 116 (which is virtually identical to the rest), uses all the available memory until it is killed by the OS.

Changing the number of processes does not help.

Naturally, I assumed that there was an issue with some of the rows in that chunk, but I could not find anything.

Furthermore, if I apply the function row by row with p_apply and lambda to that chunk, there are no issues.

Also pandas map works just fine.

Could this be the result of a memory leak?

Any thoughts?

Did not work for pd.corr (calculating correlation matrix)

I used the example from the README, just wanted to parallelize calculating correlations in a large dataframe. Unfortunately, it did not work.

ZeroDivisionError: float division by zero

Windows: 10
Python 3.9.7
Pandas: 1.5.2

File "C:\Anaconda3\lib\site-packages\parallel_pandas\core\parallel_window.py", line 116, in p_min
return self.parallelize_method(data, name, executor, *args, engine=engine, engine_kwargs=engine_kwargs,

File "C:\Anaconda3\lib\site-packages\parallel_pandas\core\parallel_window.py", line 80, in parallelize_method
result = progress_imap(

File "C:\Anaconda3\lib\site-packages\parallel_pandas\core\progress_imap.py", line 127, in progress_imap
result = _do_parallel(func, tasks, initializer, initargs, n_cpu, total, disable, show_vmem, q, executor, desc)

File "C:\Anaconda3\lib\site-packages\parallel_pandas\core\progress_imap.py", line 111, in _do_parallel
result.append(next(iter_result))

File "C:\Anaconda3\lib\multiprocessing\pool.py", line 870, in next
raise value

File "C:\Anaconda3\lib\multiprocessing\pool.py", line 125, in worker
result = (True, func(*args, **kwds))

File "C:\Anaconda3\lib\site-packages\parallel_pandas\core\parallel_window.py", line 36, in do_method
return progress_udf_wrapper(foo, workers_queue, 1)()

File "C:\Anaconda3\lib\site-packages\parallel_pandas\core\progress_imap.py", line 60, in wrapped_udf
state.next_update += max(int((delta_i / delta_t) * .25), 1)

ZeroDivisionError: float division by zero

'DataFrameGroupBy' object has no attribute 'mutated'

ri.iloc[:10000].groupby('Stkcd').p_apply(get_beta)

File c:\Users\guduh\anaconda3\Lib\site-packages\parallel_pandas\core\parallel_groupby.py:57, in parallelize_groupby_apply..p_apply(data, func, executor, args, **kwargs)
44 result, mutated = _prepare_result(result)
46 # due to a bug in the get_iterator method of the Basegrouper class
47 # that was only fixed in pandas 1.4.0, earlier versions are not yet supported
48
(...)
55 # return data._wrap_applied_output(data.grouper._get_group_keys(), data.grouper._get_group_keys(), result,
56 # not_indexed_same=mutated or data.mutated)
---> 57 return data._wrap_applied_output(data._selected_obj, result, not_indexed_same=mutated or data.mutated)

File c:\Users\guduh\anaconda3\Lib\site-packages\pandas\core\groupby\groupby.py:1312, in GroupBy.getattr(self, attr)
1309 if attr in self.obj:
1310 return self[attr]
-> 1312 raise AttributeError(
1313 f"'{type(self).name}' object has no attribute '{attr}'"
1314 )

AttributeError: 'DataFrameGroupBy' object has no attribute 'mutated'

[BUG] `p_describe()`: TypeError: NDFrame.describe() got an unexpected keyword argument 'datetime_is_numeric'

With this example code, I get the error below. Maybe due to the DatetimeIndex with a frequency?

minimal reproducible example:

import pandas as pd
from parallel_pandas import ParallelPandas

ParallelPandas.initialize()

test = pd.DataFrame(index=pd.date_range(start='2024-01-01', end='2024-01-02', freq='30min'))
test["test"] = list(range(len(test)))
test.p_describe()

output:

	"name": "TypeError",
	"message": "NDFrame.describe() got an unexpected keyword argument 'datetime_is_numeric'",
	"stack": "---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[36], line 4
      1 test = pd.DataFrame(index=pd.date_range(start='2024-01-01', end='2024-01-02', freq='30min'))
      2 test[\"test\"] = list(range(len(test)))
----> 4 test.p_describe()

File ~/venv/lib/python3.11/site-packages/parallel_pandas/core/parallel_dataframe.py:336, in parallelize_describe.<locals>.p_describe(data, percentiles, include, exclude, datetime_is_numeric)
    334 split_size = get_split_size(n_cpu, split_factor)
    335 tasks = get_split_data(data, 0, split_size)
--> 336 result = progress_imap(
    337     partial(do_describe, workers_queue=workers_queue, percentiles=percentiles, include=include, exclude=exclude,
    338             datetime_is_numeric=datetime_is_numeric), tasks, workers_queue, n_cpu=n_cpu, disable=disable_pr_bar,
    339     show_vmem=show_vmem, total=min(split_size, data.shape[1]), desc='DESCRIBE')
    340 return pd.concat(result, axis=1)

File ~/venv/lib/python3.11/site-packages/parallel_pandas/core/progress_imap.py:127, in progress_imap(func, tasks, q, executor, initializer, initargs, n_cpu, total, disable, process_timeout, show_vmem, desc)
    125     if process_timeout:
    126         func = partial(_wrapped_func, func, process_timeout, True)
--> 127     result = _do_parallel(func, tasks, initializer, initargs, n_cpu, total, disable, show_vmem, q, executor, desc)
    128 except (KeyboardInterrupt, Exception):
    129     q.put((None, None))

File ~/venv/lib/python3.11/site-packages/parallel_pandas/core/progress_imap.py:111, in _do_parallel(func, tasks, initializer, initargs, n_cpu, total, disable, show_vmem, q, executor, desc)
    109 while 1:
    110     try:
--> 111         result.append(next(iter_result))
    112     except StopIteration:
    113         break

File /usr/lib/python3.11/multiprocessing/pool.py:873, in IMapIterator.next(self, timeout)
    871 if success:
    872     return value
--> 873 raise value

File /usr/lib/python3.11/multiprocessing/pool.py:125, in worker(inqueue, outqueue, initializer, initargs, maxtasks, wrap_exception)
    123 job, i, func, args, kwds = task
    124 try:
--> 125     result = (True, func(*args, **kwds))
    126 except Exception as e:
    127     if wrap_exception and func is not _helper_reraises_exception:

File ~/venv/lib/python3.11/site-packages/parallel_pandas/core/parallel_dataframe.py:326, in do_describe(df, workers_queue, percentiles, include, exclude, datetime_is_numeric)
    322 def foo():
    323     return df.describe(percentiles=percentiles, include=include, exclude=exclude,
    324                        datetime_is_numeric=datetime_is_numeric)
--> 326 return progress_udf_wrapper(foo, workers_queue, 1)()

File ~/venv/lib/python3.11/site-packages/parallel_pandas/core/progress_imap.py:52, in progress_udf_wrapper.<locals>.wrapped_udf(*args, **kwargs)
     51 def wrapped_udf(*args, **kwargs):
---> 52     result = func(*args, **kwargs)
     53     updated = next(cnt)
     54     if updated == state.next_update:

File ~/venv/lib/python3.11/site-packages/parallel_pandas/core/parallel_dataframe.py:323, in do_describe.<locals>.foo()
    322 def foo():
--> 323     return df.describe(percentiles=percentiles, include=include, exclude=exclude,
    324                        datetime_is_numeric=datetime_is_numeric)

TypeError: NDFrame.describe() got an unexpected keyword argument 'datetime_is_numeric'"

environment:

pandas==2.2.1
parallel-pandas==0.6.2
Python 3.11.6

(btw, I love parallel-pandas! It so easy to use, intuitive and the speedup is amazing. Thanks a lot!)

FutureWarning: the convert_dtype parameter is deprecated and will be removed in a future version

parallel_series.py:24: FutureWarning: the convert_dtype parameter is deprecated and will be removed in a future version. Do ser.astype(object).apply() instead if you want convert_dtype=False.

FutureWarning: DataFrame.applymap has been deprecated. Use DataFrame.map instead.

Hi if I use p_map on a dataframe it raises the error:

AttributeError: 'DataFrame' object has no attribute 'p_map'.

I guess that because map used to be for series and not dataframes.

However, if I use p_applymap on the dataframe, I get myself a warning:

FutureWarning: DataFrame.applymap has been deprecated. Use DataFrame.map instead.

dubovikmaster / parallel-pandas Goto Github PK

parallel-pandas's Introduction

Hi there 👋

parallel-pandas's People

Contributors

Stargazers

Watchers

Forkers

parallel-pandas's Issues

Could be paralleled even DataFrame.pivot_table?

Strange memory issue

Did not work for pd.corr (calculating correlation matrix)

ZeroDivisionError: float division by zero

'DataFrameGroupBy' object has no attribute 'mutated'

[BUG] `p_describe()`: TypeError: NDFrame.describe() got an unexpected keyword argument 'datetime_is_numeric'

FutureWarning: the convert_dtype parameter is deprecated and will be removed in a future version

FutureWarning: DataFrame.applymap has been deprecated. Use DataFrame.map instead.

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent