Giter Club home page Giter Club logo

Comments (5)

theofidry avatar theofidry commented on May 26, 2024 1

Agreed. Unsure if this should be part of 1.0 or 2.0 though.

I'm kinda temped to release the current RC as stable and the changes I've introduced and those as 2.0

from console-parallelization.

theofidry avatar theofidry commented on May 26, 2024

However, we also want to use the ParallelizationTrait to process queue alike structures

I'm not sure it's the appropriate usage. If you have a queue, then you are probably better off having:

  • a command that dispatches your message to the queue (does not need to be parallelized)
  • a command that consumes your queue messages
  • a supervisor which takes care of spawning the workers, restarting them and stopping them

This is something that works really well and the ecosystem for it is quite mature. Is there anything specific for which you think this parallelization would help out more?

from console-parallelization.

andreas-gruenwald avatar andreas-gruenwald commented on May 26, 2024

First of all: Actually I am also not sure if the feature should be part of the bundle. If you decide to not include it, as it will make the bundle overly complex, then I am fine.


I had two use-cases in mind:

a) The command processes a larger amount of items (let's say 500.000). Some of them fail in runSingleCommand() (10% -> 50.000), let's say because an external API was offline for several minutes, or because some other reason. In the end, there should be a "retry" for the failed items. This might be done in the runAfterLastCommand method (without parallelization), but it would be better to process these items in parallel as well.

b) When processing large datasets (e.g. a XML file containing 5.000.000 records / nodes), there could be limitations on extracting all the data in one single fetchItems() call. Another solution might be to only extract a subset of the XML's records (e.g. 500.000) at a time, and call fetchItems() until there are no records left.


  • a command that dispatches your message to the queue (does not need to be parallelized)
  • a command that consumes your queue messages
  • a supervisor which takes care of spawning the workers, restarting them and stopping them

This is something that works really well and the ecosystem for it is quite mature.

I just got curious: is there a specific technology / bundle that you suggest?

from console-parallelization.

webmozart avatar webmozart commented on May 26, 2024

a) The command processes a larger amount of items (let's say 500.000). Some of them fail in runSingleCommand() (10% -> 50.000), let's say because an external API was offline for several minutes, or because some other reason. In the end, there should be a "retry" for the failed items. This might be done in the runAfterLastCommand method (without parallelization), but it would be better to process these items in parallel as well.

This is rather asking for a queue infrastructure with retry support. I.e., if an API call fails, you want to move it back to the queue and retry it either immediately or after some time. Depending on what strategy you choose, there are different implementation options. However, you don't really want to parallelize this.

Quite the contrary, I would keep calls to external APIs within a single process so that you can control how many calls you make to the API and potentially throttle your requests not to be blocked for exceeding rate limits.

b) When processing large datasets (e.g. a XML file containing 5.000.000 records / nodes), there could be limitations on extracting all the data in one single fetchItems() call. Another solution might be to only extract a subset of the XML's records (e.g. 500.000) at a time, and call fetchItems() until there are no records left.

Again there are different strategies, some of which could be supported by this trait and others which can't:

  1. You could parse the file and move all parsed entries to a queue, from where you process it with a worker. There you don't need this trait. Just fire up as many workers in parallel as you want. The queuing system will make sure each message is only passed to one of them.
  2. You could parse the file and stream them to parallel processes directly without the queue in between. Therefore Parallelization would need support for "uncountable" items, i.e. item streams where the total number is unknown since we never load them all into memory at once. That is technically possible (and might even make a lot of sense), except that we wouldn't know the number of batches and segments in advance and hence couldn't display a progress bar.

from console-parallelization.

andreas-gruenwald avatar andreas-gruenwald commented on May 26, 2024

I agree with your argumentation. I am voting to implement strategy number 2.
Even with that strategy, the items are not always "uncountable", the total number might be known or estimated. For instance, when dealing with a very large XML-file, the computation of the root's children nodes (=total number of entries) for the progress bar might be cheap, while the extraction of the nodes' data will be expensive. So there are cases where it is possible to initially set the progress bar limit to a reasonable value, while there are probably other cases, where the limit is unknown and will increase over time.

from console-parallelization.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.