I'm not sure if I'm using deco wrong or what but normal codes seem to slow down for me

I'm getting this error if I don't specify the number. <div class="snippet-clipboar

Code slows down without sleep about deco HOT 7 CLOSED

RashiqAzhan commented on July 30, 2024

Code slows down without sleep

from deco.

Comments (7)

alex-sherman commented on July 30, 2024

Great question! It's honestly kind of unintuitive. I mentioned this briefly in https://github.com/alex-sherman/deco#limitations

This effect exists, to some degree, for all forms of parallelism. Yes we can execute many things at once, but for each individual work item there is a bit of extra time spent in overhead, queueing the work item, serializing it into a separate process, waiting on the response etc. Some of the overhead is still executed serially, in the thread calling the @concurrent function, so it follows that if the serial portion of overhead is greater than the time it would take to execute the concurrent operation, then the total execution time is actually longer.

TL;DR; yes, for cheap (in time) operations like this, parallelizing them (one at a time) will take longer. The solution in general is to batch items together, e.g. summing 1000 squares from A to B, rather than a single square. I suggest aiming for @concurrent functions taking >1ms.

from deco.

RashiqAzhan commented on July 30, 2024

Understood. Thank you for the detailed explanation. I tried my hand at modifying the code into batches. Here are the results:

# 100 000 000 numbers; 100 000 Batch Size, 100 threads
# Average foo Function Execution Time: 7.3452558999999855 ms
# Parralelized, Execution Time: 8276.4746 ms
# Unparralelized, Execution Time: 8187.649199999999 ms
# 
# 100 000 000 numbers; 1 000 000 Batch Size, 100 threads
# Average foo Function Execution Time: 79.287038 ms
# Parralelized, Execution Time: 9230.890500000001 ms
# Unparralelized, Execution Time: 9097.0797 ms
# 
# 100 000 000 numbers; 10 000 000 Batch Size, 10 threads
# foo Function Execution Time: 757.0656 ms
# foo Function Execution Time: 786.8340000000001 ms
# foo Function Execution Time: 765.7913000000001 ms
# foo Function Execution Time: 765.7311 ms
# foo Function Execution Time: 770.1053999999998 ms
# foo Function Execution Time: 791.5252000000006 ms
# foo Function Execution Time: 782.5573999999999 ms
# foo Function Execution Time: 791.6626000000005 ms
# foo Function Execution Time: 777.1696999999999 ms
# foo Function Execution Time: 785.0100999999992 ms
# Average foo Function Execution Time: 777.34524 ms
# Parralelized, Execution Time: 8758.3183 ms
# Unparralelized, Execution Time: 8601.8984 ms

The modified code:

import timeit
from deco import concurrent, synchronized

@concurrent.threaded(processes=10)
def foo(x):
    start_time = timeit.default_timer()
    
    list_of_value = []
    for value in x:
        list_of_value.append(value + value)
        
    end_time = timeit.default_timer()
    print(f"foo Function Execution Time: {(end_time - start_time) * 1000} ms")

    return list_of_value

@synchronized
def foo_sync(list_of_x, batch_size):
    list_x = list()
    for i in range(0, len(list_of_x), batch_size):
        list_x.append(foo(list_of_x[i:i+batch_size]))

    return list_x

def main():
    start_time = timeit.default_timer()
    foo_sync(range(100_000_000), 10_000_000)
    end_time = timeit.default_timer()
    print(f"Execution Time: {(end_time-start_time)*1000} ms")

if __name__ == '__main__':
    main()

For some reason, I can't get the >1ms concurrent functions to execute quickly in parallel. Is putting a for loop inside a concurrent function to process the batch the wrong move here?

from deco.

alex-sherman commented on July 30, 2024

Looking again, there are some more important things goings on:

Using @concurrent.threaded for work that isn't blocking on IO, won't allow any speed up, you should use just @concurrent in general.
Setting the number of processes for CPU bound work is also generally a bad idea, unless you're sure it's less than the number of cores on your machine. It will default to the number of cores on your machine.
I think the changes you have cover the batching, but will still have most of the overhead from serialization. After switching your example to just @concurrent move of the execution times come from serializing the resulting list back to the main process. So I guess the full criteria of the batching I would suggest is at least 1ms of execution and small (in terms of serialization) inputs/outputs.

from deco.

RashiqAzhan commented on July 30, 2024

I'm getting this error if I don't specify the number.

multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "...\Python37\lib\multiprocessing\pool.py", line 121, in worker
  File "...\.venv\lib\site-packages\deco\conc.py", line 10, in concWrapper
    result = concurrent.functions[f](*args, **kwargs)
  File "...\main1.py", line 39, in foo
    list_of_value.append(value + value)
MemoryError
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:/Users/user/Desktop/New Folder/poetry/main1.py", line 62, in <module>
    main()
  File "C:/Users/user/Desktop/New Folder/poetry/main1.py", line 57, in main
    foo_sync(range(100_000_000), 10_000_000)
  File "...\.venv\lib\site-packages\deco\conc.py", line 62, in __call__
    return self.f(*args, **kwargs)
  File "<string>", line 1, in foo_sync
  File "...\.venv\lib\site-packages\deco\conc.py", line 139, in wait
    result, operations = self.results.pop().get()
  File "...\.venv\lib\site-packages\deco\conc.py", line 159, in get
    return self.async_result.get(3e+6)
  File "...\Python37\lib\multiprocessing\pool.py", line 657, in get
    raise self._value
MemoryError

How much memory do I need for concurrency to work?

Average foo Function Execution Time: in the last post highlighted the execution time for each group. Would you say 7.3452558999999855, 79.287038, or 777.34524 milliseconds is adequate, or should each group be closer to 1 ms while still being greater than 1 ms?

Am I correct in assuming larger groups or batches would reduce the thread-locking and serialization overhead at the cost of concurrent execution speed up?

from deco.

alex-sherman commented on July 30, 2024

Yeah 1ms is just a minimum, anything longer the overheads won't be noticeable.

Higher level, is this a useful program to continue debugging? It seems a bit like a toy example, which sure is useful for understanding how deco works, but maybe not worth spending a whole lot of time on.

Making a list with 100 million numbers (something like 42 bytes/entry according to this method = 4.2Gb), copying it around between processes a few times, yeah seems fair to run out of memory. Again, aim for small inputs/outputs, a list with millions of entries will make it difficult.

from deco.

RashiqAzhan commented on July 30, 2024

I cooked this up to model a real project. I'm processing videos at the bit level using NumPy. A second of 30 fps FHD sRGB video has 186,624,000 bytes of data in need of processing. Suffice to say, that work is really slow for a single processing core to go through when throwing 60 fps 8k in the mix for which I'm looking at concurrency with deco as it is my only option.

I am not sure how to reduce the input and output for such a case as you say.

I tried manually specifying the processes to various degree and have observed a performance regression even after having 100 batches where each batch on average takes 82.312661 ms to execute.

# 100 000 000 numbers; 1 000 000 Batch Size, 100 batches, No Concurrency
# Execution Time: 9000.9163 ms

# 100 000 000 numbers; 1 000 000 Batch Size, 100 batches, 1 process
# Execution Time: 12856.1549 ms

# 100 000 000 numbers; 1 000 000 Batch Size, 100 batches, 2 processes
# Execution Time: 9000.9163 ms

# 100 000 000 numbers; 1 000 000 Batch Size, 100 batches, 5 processes
# Execution Time: 5425.6894999999995 ms

# 100 000 000 numbers; 1 000 000 Batch Size, 100 batches, 10 processes
# Execution Time: 5606.244200000001 ms

# 100 000 000 numbers; 1 000 000 Batch Size, 100 batches, 15 processes
# Execution Time: 5964.0328 ms

# 100 000 000 numbers; 1 000 000 Batch Size, 100 batches, 20 processes
# Execution Time: 6098.0767 ms

# 100 000 000 numbers; 1 000 000 Batch Size, 100 batches, 25 processes
# Execution Time: 6288.636200000001 ms

# 100 000 000 numbers; 1 000 000 Batch Size, 100 batches, 30 processes
# Execution Time: 6587.3597 ms

The program scales really well if I replace that part with something like sleep(0.08). Is this the accepted behaviour or am I missing something? Is passing around the list through the function is screwing me over? If so, since deco doesnot work with global variables, should I code the multi-processing by hand (I really don't want to do this.)?

from deco.

alex-sherman commented on July 30, 2024

The crux of your problem is serializing ~100MB between processes, it's not really an issue with deco. I would honestly encourage you to write the same example using multiprocessing.pool by hand, if there is some difference between the results then there is an issue with deco.

To avoid serializing such large arrays, you might consider something like this. These datatypes should work with deco, but I haven't tried to use them.

from deco.

Code slows down without sleep about deco HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent