Comments (5)
Agreed. Unsure if this should be part of 1.0 or 2.0 though.
I'm kinda temped to release the current RC as stable and the changes I've introduced and those as 2.0
from console-parallelization.
However, we also want to use the ParallelizationTrait to process queue alike structures
I'm not sure it's the appropriate usage. If you have a queue, then you are probably better off having:
- a command that dispatches your message to the queue (does not need to be parallelized)
- a command that consumes your queue messages
- a supervisor which takes care of spawning the workers, restarting them and stopping them
This is something that works really well and the ecosystem for it is quite mature. Is there anything specific for which you think this parallelization would help out more?
from console-parallelization.
First of all: Actually I am also not sure if the feature should be part of the bundle. If you decide to not include it, as it will make the bundle overly complex, then I am fine.
I had two use-cases in mind:
a) The command processes a larger amount of items (let's say 500.000). Some of them fail in runSingleCommand()
(10% -> 50.000), let's say because an external API was offline for several minutes, or because some other reason. In the end, there should be a "retry" for the failed items. This might be done in the runAfterLastCommand
method (without parallelization), but it would be better to process these items in parallel as well.
b) When processing large datasets (e.g. a XML file containing 5.000.000 records / nodes), there could be limitations on extracting all the data in one single fetchItems()
call. Another solution might be to only extract a subset of the XML's records (e.g. 500.000) at a time, and call fetchItems()
until there are no records left.
- a command that dispatches your message to the queue (does not need to be parallelized)
- a command that consumes your queue messages
- a supervisor which takes care of spawning the workers, restarting them and stopping them
This is something that works really well and the ecosystem for it is quite mature.
I just got curious: is there a specific technology / bundle that you suggest?
from console-parallelization.
a) The command processes a larger amount of items (let's say 500.000). Some of them fail in runSingleCommand() (10% -> 50.000), let's say because an external API was offline for several minutes, or because some other reason. In the end, there should be a "retry" for the failed items. This might be done in the runAfterLastCommand method (without parallelization), but it would be better to process these items in parallel as well.
This is rather asking for a queue infrastructure with retry support. I.e., if an API call fails, you want to move it back to the queue and retry it either immediately or after some time. Depending on what strategy you choose, there are different implementation options. However, you don't really want to parallelize this.
Quite the contrary, I would keep calls to external APIs within a single process so that you can control how many calls you make to the API and potentially throttle your requests not to be blocked for exceeding rate limits.
b) When processing large datasets (e.g. a XML file containing 5.000.000 records / nodes), there could be limitations on extracting all the data in one single fetchItems() call. Another solution might be to only extract a subset of the XML's records (e.g. 500.000) at a time, and call fetchItems() until there are no records left.
Again there are different strategies, some of which could be supported by this trait and others which can't:
- You could parse the file and move all parsed entries to a queue, from where you process it with a worker. There you don't need this trait. Just fire up as many workers in parallel as you want. The queuing system will make sure each message is only passed to one of them.
- You could parse the file and stream them to parallel processes directly without the queue in between. Therefore
Parallelization
would need support for "uncountable" items, i.e. item streams where the total number is unknown since we never load them all into memory at once. That is technically possible (and might even make a lot of sense), except that we wouldn't know the number of batches and segments in advance and hence couldn't display a progress bar.
from console-parallelization.
I agree with your argumentation. I am voting to implement strategy number 2.
Even with that strategy, the items are not always "uncountable", the total number might be known or estimated. For instance, when dealing with a very large XML-file, the computation of the root's children nodes (=total number of entries) for the progress bar might be cheap, while the extraction of the nodes' data will be expensive. So there are cases where it is possible to initially set the progress bar limit to a reasonable value, while there are probably other cases, where the limit is unknown and will increase over time.
from console-parallelization.
Related Issues (20)
- Auto-detect the number of processes
- Always use a sub-process; add no-parallel option HOT 1
- Add options to change the segment size or batch size on the fly
- Add stop-on-failure option
- Configuration calculation may result in an exception if there is no items
- [2.x] Some resources are missing HOT 4
- Limit the exit code to 255
- Incorrect error message HOT 1
- Make it easier to override getParallelExecutableFactory()
- Add a utility decorator logger
- Add a logger to log the item handling to an arbitrary logger HOT 1
- Error when $_SERVER['PWD'] is not set HOT 7
- Remove or change input option alias for the number of processes HOT 2
- Error when command is executed outside bin/ directory. HOT 16
- Cannot install symfony/console dependency in Pimcore
- ItemBatchIterator | Implement as Iterable? HOT 3
- Roadmap? HOT 1
- Better solution for the "item" - argument? HOT 2
- Pipe breaks when quotes are used in input options
- Plans for v2 HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from console-parallelization.