flipkart-incubator / databuilderframework Goto Github PK

A data driven execution engine

Java 100.00%

databuilderframework's Issues

DataFlowExecutor's databuilderFactory is not used

I am creating a SimpleDataFlowExecutor with a custom DataBuilderFactory.
DataFlowExecutor executor = new SimpleDataFlowExecutor(myDataBuilderFactory);
However, when I run a dataflow with this executor the factory set in the constructor of DataFlowExecutor is not used.
The databuilderFactory of dataFlow takes precedence over the executor's builderFactory.
https://github.com/flipkart-incubator/databuilderframework/blob/master/src/main/java/com/flipkart/databuilderframework/engine/DataFlowExecutor.java#L57

    public DataExecutionResponse run(DataFlow dataFlow, DataDelta dataDelta) throws DataBuilderFrameworkException, DataValidationException {
        Preconditions.checkNotNull(dataFlow);
        Preconditions.checkArgument(null != dataFlow.getDataBuilderFactory() || null != this.dataBuilderFactory);
        return this.run(new DataBuilderContext(), new DataFlowInstance(), dataDelta, dataFlow, dataFlow.getDataBuilderFactory());
    }

Suggestion: If dataflow's builderFactory is null, exectuor's factory can be used.
So whenever i have to run the dataflow, i have to explicitly set the databuilderFactory and then call run.

            dataFlow.setDataBuilderFactory(myDatabuilderFactory);
            result = executor.run(dataFlow, data);

Also the default factory set in DataFlowBuilder is MixedDataBuilderFactory. So can't really use this DataFlowBuilder to create a DataFlow.

Readme.md sample image

Null returned from data builder is not nullifying the previous generated data.

Lets assume in flow#1 builders produced data1, now in second execution same builder run and now it is returning null. But still we are able to access data1 which is stale data. Ideally in next available data data1 should not be present.

ProcessedBuilders check removal

Post introduction of Access no builder would need to add its data produced to consumes for the sake of access. Hence this check should be removed. As databuilder should be a capable of being a cyclic graph if needed and this check prevents it.

Making accesses mandatory for running a builder defeats purpose of statefulness of a builder

Suppose any builder is stateful , which accesses same data and produces same data as action of the respective builder depends on the previous state of the data.

Builder whose data is not consumed is left out of execution graph ( Because of bottom up building of graph)

ConcurrentModificationExceptions when using Optimised Multithread Executors

Encountered a ConcurrentModificationException in a multi-threaded environment when attempting to get DataSet object by filtering the available data based on the accessibility criteria defined for a DataBuilder. The application uses a DataSet class to manage data and it is stored in availableData which is a Map<String, Data>. If the builderLevel is more than 1, we are experiencing this error.

To provide filtered views of this data, we use Guava's Maps.filterKeys alongside Predicates.in based on criteria from DataBuilderMeta. The exception suggests a concurrent modification issue, likely due to the iteration over a collection that is subject to concurrent changes. Given the Maps.filterKeys creates a live view of the original map, any concurrent modifications to the underlying map (such as additions or removals) may lead to this exception if the view is iterated over concurrently.

Let us know if this issue was encountered earlier(fixed in some other branch) or we are using this executors and dataSets wrongly.

Executor code:
OptimizedMultiThreadedDataFlowExecutor executor = new OptimizedMultiThreadedDataFlowExecutor(Executors.newFixedThreadPool(10));

Source Code which triggered exception:

public DataSet getDataSet(DataBuilder builder) { Preconditions.checkNotNull(builder.getDataBuilderMeta(), "No metadata present in this builder"); return new DataSet( Maps.filterKeys(Utils.sanitize(dataSet.getAvailableData()), Predicates.in(Utils.sanitize(builder.getDataBuilderMeta().getAccessibleDataSet())))); }

Complete stack trace

0 = {StackTraceElement@20540} "java.base/java.util.HashMap$HashIterator.nextNode(HashMap.java:1510)" 1 = {StackTraceElement@20541} "java.base/java.util.HashMap$EntryIterator.next(HashMap.java:1543)" 2 = {StackTraceElement@20542} "java.base/java.util.HashMap$EntryIterator.next(HashMap.java:1541)" 3 = {StackTraceElement@20543} "com.google.common.collect.Iterators.indexOf(Iterators.java:806)" 4 = {StackTraceElement@20544} "com.google.common.collect.Iterators.any(Iterators.java:698)" 5 = {StackTraceElement@20545} "com.google.common.collect.Iterables.any(Iterables.java:634)" 6 = {StackTraceElement@20546} "com.google.common.collect.Collections2$FilteredCollection.isEmpty(Collections2.java:176)" 7 = {StackTraceElement@20547} "com.google.common.collect.Maps$AbstractFilteredMap.isEmpty(Maps.java:2957)" 8 = {StackTraceElement@20548} "com.flipkart.databuilderframework.engine.Utils.isEmpty(Utils.java:38)" 9 = {StackTraceElement@20549} "com.flipkart.databuilderframework.engine.Utils.sanitize(Utils.java:54)" 10 = {StackTraceElement@20550} "com.flipkart.databuilderframework.engine.DataBuilderContext.getDataSet(DataBuilderContext.java:42)" 11 = {StackTraceElement@20552} "com.flipkart.databuilderframework.engine.OptimizedMultiThreadedDataFlowExecutor$BuilderRunner.call(OptimizedMultiThreadedDataFlowExecutor.java:249)" 12 = {StackTraceElement@20553} "com.flipkart.databuilderframework.engine.OptimizedMultiThreadedDataFlowExecutor$BuilderRunner.call(OptimizedMultiThreadedDataFlowExecutor.java:206)" 13 = {StackTraceElement@20554} "java.base/java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:264)" 14 = {StackTraceElement@20555} "java.base/java.util.concurrent.FutureTask.run(FutureTask.java)" 15 = {StackTraceElement@20556} "java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)" 16 = {StackTraceElement@20557} "java.base/java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:264)" 17 = {StackTraceElement@20558} "java.base/java.util.concurrent.FutureTask.run(FutureTask.java)" 18 = {StackTraceElement@20559} "java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)" 19 = {StackTraceElement@20560} "java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)" 20 = {StackTraceElement@20561} "java.base/java.lang.Thread.run(Thread.java:834)"

Accessible datas

Should be able to access data in a builder in addition to those declared in the "consumes" list. For the builder all data mentioned in the "accesses" list will be access-only, non-mandatory and nullable.

Proper way to access dataSet from DataBuilderContext

As of now there are two methods by which we can get dataset from DataBuilderContext

getDataSet() (marked as @Deprecated)
getDataSet(DataBuilder builder)

While implementing DataBuilder#process(DataBuilderContext context) we would need to get dataset and isn't using context.getDataSet() is the correct way to access? (If yes then why is it marked as deprecated?)
I don't see a clean way to use getDataSet(DataBuilder builder) inside process, if I do getDataSet(this) there will a NPE as there won't be any dataBuilderMeta set with current instance

getDataSet(DataBuilder builder) enforces to use only data classes mentioned in consumes, optional and access anyways this enforcement is already happening from executors when calling process.
Ref:

Either we have to set dataBuilderMeta when processed in withDataBuilder here something like dataBuilder.setDataBuilderMeta(dataBuilderMeta) which set's it to dataBuilder instance and can be accessed when processing or make other non-deprecated or I should be missing something 😅

DataSet Passed to builder should be immutable

One of the primary use case of builder is to do state management of entity which is being represented as data. For this one builder access same data which is being produced by the same. In case builder prematurely exits for some reasons(exceptions) partial commit happens which is not correct. IMO ideally immutable copy should be passed to handle this scenarios

Builders of the same rank terminate when first one does not run

Since Builder is topo sorted, there is a check builder executor to break if the first builder in the respective rank does not run.
But this a bug as there are cases where builders in the same rank can consume independent data such that first builder will expect Data A which is produced by builder above it but second builder on the other hand does not depend on Data A.
The reason these two builders are in the same rank is because when execution graph is builder from bottom up.

DataDelta content is getting changed if builder is producing the same data that is present in dataDelta.

DataDelta is not cloned before passing the data to builders. If the builder produces the same data (that was present in DataDelta) after modification, it will result in changing the dataDelta content as well (as both are having the same reference).

DataBuilderContext not accessible in DataBuilderExecutionListener

Context variable is not sent in the execution listener. This could be sent as certain clients could have better use of this variable.

DataBuilderClassInfo - customize Builder Name

With the annotation DataBuilderClassInfo we get to specify classNames rather than String names. But this involves using the canonical name of the class by default. We will need to provide ability to customize this and expose that to client configuring DatabuilderMetaManager such that they can choose to use their logic like using simpleName rather than canonicalName.

Builder whose data is not consumed in is left out of execution graph ( Because of bottom up building of graph)

Builder A produces some dataA
No other builder have dependency on dataA , neither in consumes, nor in optionals. Because of bottom up graph construction builder A doesn't come in in execution graph.

Guice support

I'm using this with dropwizard. How do I inject some of the dependent objects into the builder classes?
Can you suggest if the framework needs any changes to support this?

Thanks

EnhancementRequest - ReactiveExecutor

As our application of Databuilder primarily was used in Orchestrating downstream service calls we noticed while using Multithreaded executor, A lot of threads needed to be created for Databuilder as dowstream service in their on thread pools where blocking and builder was running out of threads in timed_wait. The situation worsens when builder starts blocking controller threads.

The idea here is to have databuilder threads recyclable and reusable, such that they could leverage IO hand off to respective httpClient pools.

DataBuilderExecutor would need to implement and Observable kinda of interface returning data when possible. Also Internal to builder each process method invoke is blocking. This should also be reactive.

MultiThreadedExecutor runs executor even when there is one builder in the respective rank.

Parallelization with overhead of context switch would be beneficial if there are more than one builders in the same rank.

Making accesses mandatory for running a builder defeats purpose of statefulness of a builder

Suppose any builder is stateful , which accesses same data and produces same data as action of the respective builder depends on the previous state of the data.

flipkart-incubator / databuilderframework Goto Github PK

databuilderframework's Issues

Recommend Projects

Recommend Topics

Recommend Org