flipkart-incubator / databuilderframework Goto Github PK
View Code? Open in Web Editor NEWA data driven execution engine
A data driven execution engine
I am creating a SimpleDataFlowExecutor
with a custom DataBuilderFactory.
DataFlowExecutor executor = new SimpleDataFlowExecutor(myDataBuilderFactory);
However, when I run a dataflow with this executor the factory set in the constructor of DataFlowExecutor is not used.
The databuilderFactory of dataFlow takes precedence over the executor's builderFactory.
https://github.com/flipkart-incubator/databuilderframework/blob/master/src/main/java/com/flipkart/databuilderframework/engine/DataFlowExecutor.java#L57
public DataExecutionResponse run(DataFlow dataFlow, DataDelta dataDelta) throws DataBuilderFrameworkException, DataValidationException {
Preconditions.checkNotNull(dataFlow);
Preconditions.checkArgument(null != dataFlow.getDataBuilderFactory() || null != this.dataBuilderFactory);
return this.run(new DataBuilderContext(), new DataFlowInstance(), dataDelta, dataFlow, dataFlow.getDataBuilderFactory());
}
Suggestion: If dataflow's builderFactory is null, exectuor's factory can be used.
So whenever i have to run the dataflow, i have to explicitly set the databuilderFactory and then call run.
dataFlow.setDataBuilderFactory(myDatabuilderFactory);
result = executor.run(dataFlow, data);
Also the default factory set in DataFlowBuilder is MixedDataBuilderFactory
. So can't really use this DataFlowBuilder to create a DataFlow.
Lets assume in flow#1 builders produced data1, now in second execution same builder run and now it is returning null. But still we are able to access data1 which is stale data. Ideally in next available data data1 should not be present.
Post introduction of Access no builder would need to add its data produced to consumes for the sake of access. Hence this check should be removed. As databuilder should be a capable of being a cyclic graph if needed and this check prevents it.
Suppose any builder is stateful , which accesses same data and produces same data as action of the respective builder depends on the previous state of the data.
Encountered a ConcurrentModificationException in a multi-threaded environment when attempting to get DataSet object by filtering the available data based on the accessibility criteria defined for a DataBuilder. The application uses a DataSet class to manage data and it is stored in availableData which is a Map<String, Data>. If the builderLevel is more than 1, we are experiencing this error.
To provide filtered views of this data, we use Guava's Maps.filterKeys alongside Predicates.in based on criteria from DataBuilderMeta. The exception suggests a concurrent modification issue, likely due to the iteration over a collection that is subject to concurrent changes. Given the Maps.filterKeys creates a live view of the original map, any concurrent modifications to the underlying map (such as additions or removals) may lead to this exception if the view is iterated over concurrently.
Let us know if this issue was encountered earlier(fixed in some other branch) or we are using this executors and dataSets wrongly.
Executor code:
OptimizedMultiThreadedDataFlowExecutor executor = new OptimizedMultiThreadedDataFlowExecutor(Executors.newFixedThreadPool(10));
Source Code which triggered exception:
public DataSet getDataSet(DataBuilder builder) { Preconditions.checkNotNull(builder.getDataBuilderMeta(), "No metadata present in this builder"); return new DataSet( Maps.filterKeys(Utils.sanitize(dataSet.getAvailableData()), Predicates.in(Utils.sanitize(builder.getDataBuilderMeta().getAccessibleDataSet())))); }
Complete stack trace
0 = {StackTraceElement@20540} "java.base/java.util.HashMap$HashIterator.nextNode(HashMap.java:1510)" 1 = {StackTraceElement@20541} "java.base/java.util.HashMap$EntryIterator.next(HashMap.java:1543)" 2 = {StackTraceElement@20542} "java.base/java.util.HashMap$EntryIterator.next(HashMap.java:1541)" 3 = {StackTraceElement@20543} "com.google.common.collect.Iterators.indexOf(Iterators.java:806)" 4 = {StackTraceElement@20544} "com.google.common.collect.Iterators.any(Iterators.java:698)" 5 = {StackTraceElement@20545} "com.google.common.collect.Iterables.any(Iterables.java:634)" 6 = {StackTraceElement@20546} "com.google.common.collect.Collections2$FilteredCollection.isEmpty(Collections2.java:176)" 7 = {StackTraceElement@20547} "com.google.common.collect.Maps$AbstractFilteredMap.isEmpty(Maps.java:2957)" 8 = {StackTraceElement@20548} "com.flipkart.databuilderframework.engine.Utils.isEmpty(Utils.java:38)" 9 = {StackTraceElement@20549} "com.flipkart.databuilderframework.engine.Utils.sanitize(Utils.java:54)" 10 = {StackTraceElement@20550} "com.flipkart.databuilderframework.engine.DataBuilderContext.getDataSet(DataBuilderContext.java:42)" 11 = {StackTraceElement@20552} "com.flipkart.databuilderframework.engine.OptimizedMultiThreadedDataFlowExecutor$BuilderRunner.call(OptimizedMultiThreadedDataFlowExecutor.java:249)" 12 = {StackTraceElement@20553} "com.flipkart.databuilderframework.engine.OptimizedMultiThreadedDataFlowExecutor$BuilderRunner.call(OptimizedMultiThreadedDataFlowExecutor.java:206)" 13 = {StackTraceElement@20554} "java.base/java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:264)" 14 = {StackTraceElement@20555} "java.base/java.util.concurrent.FutureTask.run(FutureTask.java)" 15 = {StackTraceElement@20556} "java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)" 16 = {StackTraceElement@20557} "java.base/java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:264)" 17 = {StackTraceElement@20558} "java.base/java.util.concurrent.FutureTask.run(FutureTask.java)" 18 = {StackTraceElement@20559} "java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)" 19 = {StackTraceElement@20560} "java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)" 20 = {StackTraceElement@20561} "java.base/java.lang.Thread.run(Thread.java:834)"
Should be able to access data in a builder in addition to those declared in the "consumes" list. For the builder all data mentioned in the "accesses" list will be access-only, non-mandatory and nullable.
As of now there are two methods by which we can get dataset from DataBuilderContext
getDataSet()
(marked as @Deprecated
)getDataSet(DataBuilder builder)
While implementing DataBuilder#process(DataBuilderContext context)
we would need to get dataset and isn't using context.getDataSet()
is the correct way to access? (If yes then why is it marked as deprecated?)
I don't see a clean way to use getDataSet(DataBuilder builder)
inside process
, if I do getDataSet(this)
there will a NPE as there won't be any dataBuilderMeta
set with current instance
getDataSet(DataBuilder builder)
enforces to use only data classes mentioned in consumes
, optional
and access
anyways this enforcement is already happening from executors when calling process
.
Ref:
Either we have to set dataBuilderMeta
when processed in withDataBuilder
here something like dataBuilder.setDataBuilderMeta(dataBuilderMeta)
which set's it to dataBuilder instance and can be accessed when processing or make other non-deprecated or I should be missing something ๐
One of the primary use case of builder is to do state management of entity which is being represented as data. For this one builder access same data which is being produced by the same. In case builder prematurely exits for some reasons(exceptions) partial commit happens which is not correct. IMO ideally immutable copy should be passed to handle this scenarios
Since Builder is topo sorted, there is a check builder executor to break if the first builder in the respective rank does not run.
But this a bug as there are cases where builders in the same rank can consume independent data such that first builder will expect Data A which is produced by builder above it but second builder on the other hand does not depend on Data A.
The reason these two builders are in the same rank is because when execution graph is builder from bottom up.
DataDelta is not cloned before passing the data to builders. If the builder produces the same data (that was present in DataDelta) after modification, it will result in changing the dataDelta content as well (as both are having the same reference).
Context variable is not sent in the execution listener. This could be sent as certain clients could have better use of this variable.
With the annotation DataBuilderClassInfo we get to specify classNames rather than String names. But this involves using the canonical name of the class by default. We will need to provide ability to customize this and expose that to client configuring DatabuilderMetaManager such that they can choose to use their logic like using simpleName rather than canonicalName.
Builder A produces some dataA
No other builder have dependency on dataA , neither in consumes, nor in optionals. Because of bottom up graph construction builder A doesn't come in in execution graph.
I'm using this with dropwizard. How do I inject some of the dependent objects into the builder classes?
Can you suggest if the framework needs any changes to support this?
Thanks
As our application of Databuilder primarily was used in Orchestrating downstream service calls we noticed while using Multithreaded executor, A lot of threads needed to be created for Databuilder as dowstream service in their on thread pools where blocking and builder was running out of threads in timed_wait. The situation worsens when builder starts blocking controller threads.
The idea here is to have databuilder threads recyclable and reusable, such that they could leverage IO hand off to respective httpClient pools.
DataBuilderExecutor would need to implement and Observable kinda of interface returning data when possible. Also Internal to builder each process method invoke is blocking. This should also be reactive.
Parallelization with overhead of context switch would be beneficial if there are more than one builders in the same rank.
Suppose any builder is stateful , which accesses same data and produces same data as action of the respective builder depends on the previous state of the data.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.