Giter Club home page Giter Club logo

parallel-pandas's Introduction

Hi there ๐Ÿ‘‹

Pavel's github stats

parallel-pandas's People

Contributors

dubovikmaster avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

parallel-pandas's Issues

Strange memory issue

I have 256 GB of RAM and a dataset with 13m+ rows.

I am applying a complex function to all columns using p_applymap.

As I was having memory issues I split the df in 117 chunks of ca. 100k rows each.

The first 115 chunks run without issues, but chunk 116 (which is virtually identical to the rest), uses all the available memory until it is killed by the OS.

Changing the number of processes does not help.

Naturally, I assumed that there was an issue with some of the rows in that chunk, but I could not find anything.

Furthermore, if I apply the function row by row with p_apply and lambda to that chunk, there are no issues.

Also pandas map works just fine.

Could this be the result of a memory leak?

Any thoughts?

ZeroDivisionError: float division by zero

Windows: 10
Python 3.9.7
Pandas: 1.5.2

File "C:\Anaconda3\lib\site-packages\parallel_pandas\core\parallel_window.py", line 116, in p_min
return self.parallelize_method(data, name, executor, *args, engine=engine, engine_kwargs=engine_kwargs,

File "C:\Anaconda3\lib\site-packages\parallel_pandas\core\parallel_window.py", line 80, in parallelize_method
result = progress_imap(

File "C:\Anaconda3\lib\site-packages\parallel_pandas\core\progress_imap.py", line 127, in progress_imap
result = _do_parallel(func, tasks, initializer, initargs, n_cpu, total, disable, show_vmem, q, executor, desc)

File "C:\Anaconda3\lib\site-packages\parallel_pandas\core\progress_imap.py", line 111, in _do_parallel
result.append(next(iter_result))

File "C:\Anaconda3\lib\multiprocessing\pool.py", line 870, in next
raise value

File "C:\Anaconda3\lib\multiprocessing\pool.py", line 125, in worker
result = (True, func(*args, **kwds))

File "C:\Anaconda3\lib\site-packages\parallel_pandas\core\parallel_window.py", line 36, in do_method
return progress_udf_wrapper(foo, workers_queue, 1)()

File "C:\Anaconda3\lib\site-packages\parallel_pandas\core\progress_imap.py", line 60, in wrapped_udf
state.next_update += max(int((delta_i / delta_t) * .25), 1)

ZeroDivisionError: float division by zero

'DataFrameGroupBy' object has no attribute 'mutated'

ri.iloc[:10000].groupby('Stkcd').p_apply(get_beta)

File c:\Users\guduh\anaconda3\Lib\site-packages\parallel_pandas\core\parallel_groupby.py:57, in parallelize_groupby_apply..p_apply(data, func, executor, args, **kwargs)
44 result, mutated = _prepare_result(result)
46 # due to a bug in the get_iterator method of the Basegrouper class
47 # that was only fixed in pandas 1.4.0, earlier versions are not yet supported
48
(...)
55 # return data._wrap_applied_output(data.grouper._get_group_keys(), data.grouper._get_group_keys(), result,
56 # not_indexed_same=mutated or data.mutated)
---> 57 return data._wrap_applied_output(data._selected_obj, result, not_indexed_same=mutated or data.mutated)

File c:\Users\guduh\anaconda3\Lib\site-packages\pandas\core\groupby\groupby.py:1312, in GroupBy.getattr(self, attr)
1309 if attr in self.obj:
1310 return self[attr]
-> 1312 raise AttributeError(
1313 f"'{type(self).name}' object has no attribute '{attr}'"
1314 )

AttributeError: 'DataFrameGroupBy' object has no attribute 'mutated'

[BUG] `p_describe()`: TypeError: NDFrame.describe() got an unexpected keyword argument 'datetime_is_numeric'

With this example code, I get the error below. Maybe due to the DatetimeIndex with a frequency?

minimal reproducible example:

import pandas as pd
from parallel_pandas import ParallelPandas

ParallelPandas.initialize()

test = pd.DataFrame(index=pd.date_range(start='2024-01-01', end='2024-01-02', freq='30min'))
test["test"] = list(range(len(test)))
test.p_describe()

output:

	"name": "TypeError",
	"message": "NDFrame.describe() got an unexpected keyword argument 'datetime_is_numeric'",
	"stack": "---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[36], line 4
      1 test = pd.DataFrame(index=pd.date_range(start='2024-01-01', end='2024-01-02', freq='30min'))
      2 test[\"test\"] = list(range(len(test)))
----> 4 test.p_describe()

File ~/venv/lib/python3.11/site-packages/parallel_pandas/core/parallel_dataframe.py:336, in parallelize_describe.<locals>.p_describe(data, percentiles, include, exclude, datetime_is_numeric)
    334 split_size = get_split_size(n_cpu, split_factor)
    335 tasks = get_split_data(data, 0, split_size)
--> 336 result = progress_imap(
    337     partial(do_describe, workers_queue=workers_queue, percentiles=percentiles, include=include, exclude=exclude,
    338             datetime_is_numeric=datetime_is_numeric), tasks, workers_queue, n_cpu=n_cpu, disable=disable_pr_bar,
    339     show_vmem=show_vmem, total=min(split_size, data.shape[1]), desc='DESCRIBE')
    340 return pd.concat(result, axis=1)

File ~/venv/lib/python3.11/site-packages/parallel_pandas/core/progress_imap.py:127, in progress_imap(func, tasks, q, executor, initializer, initargs, n_cpu, total, disable, process_timeout, show_vmem, desc)
    125     if process_timeout:
    126         func = partial(_wrapped_func, func, process_timeout, True)
--> 127     result = _do_parallel(func, tasks, initializer, initargs, n_cpu, total, disable, show_vmem, q, executor, desc)
    128 except (KeyboardInterrupt, Exception):
    129     q.put((None, None))

File ~/venv/lib/python3.11/site-packages/parallel_pandas/core/progress_imap.py:111, in _do_parallel(func, tasks, initializer, initargs, n_cpu, total, disable, show_vmem, q, executor, desc)
    109 while 1:
    110     try:
--> 111         result.append(next(iter_result))
    112     except StopIteration:
    113         break

File /usr/lib/python3.11/multiprocessing/pool.py:873, in IMapIterator.next(self, timeout)
    871 if success:
    872     return value
--> 873 raise value

File /usr/lib/python3.11/multiprocessing/pool.py:125, in worker(inqueue, outqueue, initializer, initargs, maxtasks, wrap_exception)
    123 job, i, func, args, kwds = task
    124 try:
--> 125     result = (True, func(*args, **kwds))
    126 except Exception as e:
    127     if wrap_exception and func is not _helper_reraises_exception:

File ~/venv/lib/python3.11/site-packages/parallel_pandas/core/parallel_dataframe.py:326, in do_describe(df, workers_queue, percentiles, include, exclude, datetime_is_numeric)
    322 def foo():
    323     return df.describe(percentiles=percentiles, include=include, exclude=exclude,
    324                        datetime_is_numeric=datetime_is_numeric)
--> 326 return progress_udf_wrapper(foo, workers_queue, 1)()

File ~/venv/lib/python3.11/site-packages/parallel_pandas/core/progress_imap.py:52, in progress_udf_wrapper.<locals>.wrapped_udf(*args, **kwargs)
     51 def wrapped_udf(*args, **kwargs):
---> 52     result = func(*args, **kwargs)
     53     updated = next(cnt)
     54     if updated == state.next_update:

File ~/venv/lib/python3.11/site-packages/parallel_pandas/core/parallel_dataframe.py:323, in do_describe.<locals>.foo()
    322 def foo():
--> 323     return df.describe(percentiles=percentiles, include=include, exclude=exclude,
    324                        datetime_is_numeric=datetime_is_numeric)

TypeError: NDFrame.describe() got an unexpected keyword argument 'datetime_is_numeric'"

environment:

  • pandas==2.2.1
  • parallel-pandas==0.6.2
  • Python 3.11.6

(btw, I love parallel-pandas! It so easy to use, intuitive and the speedup is amazing. Thanks a lot!)

FutureWarning: DataFrame.applymap has been deprecated. Use DataFrame.map instead.

Hi if I use p_map on a dataframe it raises the error:

AttributeError: 'DataFrame' object has no attribute 'p_map'.

I guess that because map used to be for series and not dataframes.

However, if I use p_applymap on the dataframe, I get myself a warning:

FutureWarning: DataFrame.applymap has been deprecated. Use DataFrame.map instead.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.