Right now, fetchItems() is only called once, which is

Allow trait to work without knowing the number of items about console-parallelization HOT 5 CLOSED

webmozarts commented on May 26, 2024

Allow trait to work without knowing the number of items

from console-parallelization.

Comments (5)

theofidry commented on May 26, 2024 1

Agreed. Unsure if this should be part of 1.0 or 2.0 though.

I'm kinda temped to release the current RC as stable and the changes I've introduced and those as 2.0

from console-parallelization.

theofidry commented on May 26, 2024

However, we also want to use the ParallelizationTrait to process queue alike structures

I'm not sure it's the appropriate usage. If you have a queue, then you are probably better off having:

a command that dispatches your message to the queue (does not need to be parallelized)
a command that consumes your queue messages
a supervisor which takes care of spawning the workers, restarting them and stopping them

This is something that works really well and the ecosystem for it is quite mature. Is there anything specific for which you think this parallelization would help out more?

from console-parallelization.

andreas-gruenwald commented on May 26, 2024

First of all: Actually I am also not sure if the feature should be part of the bundle. If you decide to not include it, as it will make the bundle overly complex, then I am fine.

I had two use-cases in mind:

a) The command processes a larger amount of items (let's say 500.000). Some of them fail in runSingleCommand() (10% -> 50.000), let's say because an external API was offline for several minutes, or because some other reason. In the end, there should be a "retry" for the failed items. This might be done in the runAfterLastCommand method (without parallelization), but it would be better to process these items in parallel as well.

b) When processing large datasets (e.g. a XML file containing 5.000.000 records / nodes), there could be limitations on extracting all the data in one single fetchItems() call. Another solution might be to only extract a subset of the XML's records (e.g. 500.000) at a time, and call fetchItems() until there are no records left.

a command that dispatches your message to the queue (does not need to be parallelized)

a command that consumes your queue messages

a supervisor which takes care of spawning the workers, restarting them and stopping them

This is something that works really well and the ecosystem for it is quite mature.

I just got curious: is there a specific technology / bundle that you suggest?

from console-parallelization.

webmozart commented on May 26, 2024

a) The command processes a larger amount of items (let's say 500.000). Some of them fail in runSingleCommand() (10% -> 50.000), let's say because an external API was offline for several minutes, or because some other reason. In the end, there should be a "retry" for the failed items. This might be done in the runAfterLastCommand method (without parallelization), but it would be better to process these items in parallel as well.

This is rather asking for a queue infrastructure with retry support. I.e., if an API call fails, you want to move it back to the queue and retry it either immediately or after some time. Depending on what strategy you choose, there are different implementation options. However, you don't really want to parallelize this.

Quite the contrary, I would keep calls to external APIs within a single process so that you can control how many calls you make to the API and potentially throttle your requests not to be blocked for exceeding rate limits.

b) When processing large datasets (e.g. a XML file containing 5.000.000 records / nodes), there could be limitations on extracting all the data in one single fetchItems() call. Another solution might be to only extract a subset of the XML's records (e.g. 500.000) at a time, and call fetchItems() until there are no records left.

Again there are different strategies, some of which could be supported by this trait and others which can't:

You could parse the file and move all parsed entries to a queue, from where you process it with a worker. There you don't need this trait. Just fire up as many workers in parallel as you want. The queuing system will make sure each message is only passed to one of them.
You could parse the file and stream them to parallel processes directly without the queue in between. Therefore Parallelization would need support for "uncountable" items, i.e. item streams where the total number is unknown since we never load them all into memory at once. That is technically possible (and might even make a lot of sense), except that we wouldn't know the number of batches and segments in advance and hence couldn't display a progress bar.

from console-parallelization.

andreas-gruenwald commented on May 26, 2024

I agree with your argumentation. I am voting to implement strategy number 2.
Even with that strategy, the items are not always "uncountable", the total number might be known or estimated. For instance, when dealing with a very large XML-file, the computation of the root's children nodes (=total number of entries) for the progress bar might be cheap, while the extraction of the nodes' data will be expensive. So there are cases where it is possible to initially set the progress bar limit to a reasonable value, while there are probably other cases, where the limit is unknown and will increase over time.

from console-parallelization.

Allow trait to work without knowing the number of items about console-parallelization HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent