Comments (7)
Great question! It's honestly kind of unintuitive. I mentioned this briefly in https://github.com/alex-sherman/deco#limitations
This effect exists, to some degree, for all forms of parallelism. Yes we can execute many things at once, but for each individual work item there is a bit of extra time spent in overhead, queueing the work item, serializing it into a separate process, waiting on the response etc. Some of the overhead is still executed serially, in the thread calling the @concurrent
function, so it follows that if the serial portion of overhead is greater than the time it would take to execute the concurrent operation, then the total execution time is actually longer.
TL;DR; yes, for cheap (in time) operations like this, parallelizing them (one at a time) will take longer. The solution in general is to batch items together, e.g. summing 1000 squares from A to B, rather than a single square. I suggest aiming for @concurrent
functions taking >1ms.
from deco.
Understood. Thank you for the detailed explanation. I tried my hand at modifying the code into batches. Here are the results:
# 100 000 000 numbers; 100 000 Batch Size, 100 threads
# Average foo Function Execution Time: 7.3452558999999855 ms
# Parralelized, Execution Time: 8276.4746 ms
# Unparralelized, Execution Time: 8187.649199999999 ms
#
# 100 000 000 numbers; 1 000 000 Batch Size, 100 threads
# Average foo Function Execution Time: 79.287038 ms
# Parralelized, Execution Time: 9230.890500000001 ms
# Unparralelized, Execution Time: 9097.0797 ms
#
# 100 000 000 numbers; 10 000 000 Batch Size, 10 threads
# foo Function Execution Time: 757.0656 ms
# foo Function Execution Time: 786.8340000000001 ms
# foo Function Execution Time: 765.7913000000001 ms
# foo Function Execution Time: 765.7311 ms
# foo Function Execution Time: 770.1053999999998 ms
# foo Function Execution Time: 791.5252000000006 ms
# foo Function Execution Time: 782.5573999999999 ms
# foo Function Execution Time: 791.6626000000005 ms
# foo Function Execution Time: 777.1696999999999 ms
# foo Function Execution Time: 785.0100999999992 ms
# Average foo Function Execution Time: 777.34524 ms
# Parralelized, Execution Time: 8758.3183 ms
# Unparralelized, Execution Time: 8601.8984 ms
The modified code:
import timeit
from deco import concurrent, synchronized
@concurrent.threaded(processes=10)
def foo(x):
start_time = timeit.default_timer()
list_of_value = []
for value in x:
list_of_value.append(value + value)
end_time = timeit.default_timer()
print(f"foo Function Execution Time: {(end_time - start_time) * 1000} ms")
return list_of_value
@synchronized
def foo_sync(list_of_x, batch_size):
list_x = list()
for i in range(0, len(list_of_x), batch_size):
list_x.append(foo(list_of_x[i:i+batch_size]))
return list_x
def main():
start_time = timeit.default_timer()
foo_sync(range(100_000_000), 10_000_000)
end_time = timeit.default_timer()
print(f"Execution Time: {(end_time-start_time)*1000} ms")
if __name__ == '__main__':
main()
For some reason, I can't get the >1ms concurrent functions to execute quickly in parallel. Is putting a for loop inside a concurrent function to process the batch the wrong move here?
from deco.
Looking again, there are some more important things goings on:
- Using
@concurrent.threaded
for work that isn't blocking on IO, won't allow any speed up, you should use just@concurrent
in general. - Setting the number of processes for CPU bound work is also generally a bad idea, unless you're sure it's less than the number of cores on your machine. It will default to the number of cores on your machine.
- I think the changes you have cover the batching, but will still have most of the overhead from serialization. After switching your example to just
@concurrent
move of the execution times come from serializing the resulting list back to the main process. So I guess the full criteria of the batching I would suggest is at least 1ms of execution and small (in terms of serialization) inputs/outputs.
from deco.
I'm getting this error if I don't specify the number.
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "...\Python37\lib\multiprocessing\pool.py", line 121, in worker
File "...\.venv\lib\site-packages\deco\conc.py", line 10, in concWrapper
result = concurrent.functions[f](*args, **kwargs)
File "...\main1.py", line 39, in foo
list_of_value.append(value + value)
MemoryError
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "C:/Users/user/Desktop/New Folder/poetry/main1.py", line 62, in <module>
main()
File "C:/Users/user/Desktop/New Folder/poetry/main1.py", line 57, in main
foo_sync(range(100_000_000), 10_000_000)
File "...\.venv\lib\site-packages\deco\conc.py", line 62, in __call__
return self.f(*args, **kwargs)
File "<string>", line 1, in foo_sync
File "...\.venv\lib\site-packages\deco\conc.py", line 139, in wait
result, operations = self.results.pop().get()
File "...\.venv\lib\site-packages\deco\conc.py", line 159, in get
return self.async_result.get(3e+6)
File "...\Python37\lib\multiprocessing\pool.py", line 657, in get
raise self._value
MemoryError
How much memory do I need for concurrency to work?
Average foo Function Execution Time:
in the last post highlighted the execution time for each group. Would you say 7.3452558999999855
, 79.287038
, or 777.34524
milliseconds is adequate, or should each group be closer to 1 ms while still being greater than 1 ms?
Am I correct in assuming larger groups or batches would reduce the thread-locking and serialization overhead at the cost of concurrent execution speed up?
from deco.
Yeah 1ms is just a minimum, anything longer the overheads won't be noticeable.
Higher level, is this a useful program to continue debugging? It seems a bit like a toy example, which sure is useful for understanding how deco works, but maybe not worth spending a whole lot of time on.
Making a list with 100 million numbers (something like 42 bytes/entry according to this method = 4.2Gb), copying it around between processes a few times, yeah seems fair to run out of memory. Again, aim for small inputs/outputs, a list with millions of entries will make it difficult.
from deco.
I cooked this up to model a real project. I'm processing videos at the bit level using NumPy. A second of 30 fps FHD sRGB video has 186,624,000 bytes of data in need of processing. Suffice to say, that work is really slow for a single processing core to go through when throwing 60 fps 8k in the mix for which I'm looking at concurrency with deco as it is my only option.
I am not sure how to reduce the input and output for such a case as you say.
I tried manually specifying the processes to various degree and have observed a performance regression even after having 100 batches where each batch on average takes 82.312661 ms
to execute.
# 100 000 000 numbers; 1 000 000 Batch Size, 100 batches, No Concurrency
# Execution Time: 9000.9163 ms
# 100 000 000 numbers; 1 000 000 Batch Size, 100 batches, 1 process
# Execution Time: 12856.1549 ms
# 100 000 000 numbers; 1 000 000 Batch Size, 100 batches, 2 processes
# Execution Time: 9000.9163 ms
# 100 000 000 numbers; 1 000 000 Batch Size, 100 batches, 5 processes
# Execution Time: 5425.6894999999995 ms
# 100 000 000 numbers; 1 000 000 Batch Size, 100 batches, 10 processes
# Execution Time: 5606.244200000001 ms
# 100 000 000 numbers; 1 000 000 Batch Size, 100 batches, 15 processes
# Execution Time: 5964.0328 ms
# 100 000 000 numbers; 1 000 000 Batch Size, 100 batches, 20 processes
# Execution Time: 6098.0767 ms
# 100 000 000 numbers; 1 000 000 Batch Size, 100 batches, 25 processes
# Execution Time: 6288.636200000001 ms
# 100 000 000 numbers; 1 000 000 Batch Size, 100 batches, 30 processes
# Execution Time: 6587.3597 ms
The program scales really well if I replace that part with something like sleep(0.08)
. Is this the accepted behaviour or am I missing something? Is passing around the list through the function is screwing me over? If so, since deco doesnot work with global variables, should I code the multi-processing by hand (I really don't want to do this.)?
from deco.
The crux of your problem is serializing ~100MB between processes, it's not really an issue with deco
. I would honestly encourage you to write the same example using multiprocessing.pool by hand, if there is some difference between the results then there is an issue with deco
.
To avoid serializing such large arrays, you might consider something like this. These datatypes should work with deco
, but I haven't tried to use them.
from deco.
Related Issues (20)
- Can deco decorates nested functions? HOT 1
- conc_test.py was eror in my python 3 HOT 1
- Processor limit? HOT 3
- Can I specify the number of processes to use manually? HOT 1
- Decorators from deco cannot be used on class methods HOT 3
- KeyError with peterbe.com easy example on WinPython 3.6.3.0-64 HOT 4
- Bug with deco function call HOT 2
- wiki problems HOT 1
- Issues With Hanging Processes / Restart HOT 2
- Is nbody.py a Deco example? HOT 1
- Specify fixed number of CPUs? HOT 1
- User problem: where to look for when passing and receiving complex data structures? HOT 2
- Is it a good idea to pass read-only variables in a global like manner to the @concurrent function? HOT 3
- "posonlyargs" missing from arguments HOT 3
- get() without synchronized returns a tuple? HOT 2
- Multiple for loops HOT 4
- TypeError: cannot pickle '_nrt_python._MemInfo' object when passing a numba Dict to the @concurrent function HOT 2
- error_callback possible for deco?
- Setting and handling timeout HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from deco.