nirmata / workflow Goto Github PK

View Code? Open in Web Editor NEW

95.0 20.0 48.0 2.37 MB

A ZooKeeper and Curator based distributed workflow management library that enables distributed task workflows.

Home Page: http://nirmata.github.io/workflow

License: Apache License 2.0

Java 99.75% CSS 0.25%

workflow curator zookeeper java

workflow's Introduction

Nirmata Workflow

A ZooKeeper and Curator based distributed workflow management library that enables distributed task workflows.

Full details here: http://nirmata.github.io/workflow

Changes

Change log here: https://github.com/Nirmata/workflow/blob/master/CHANGELOG.md

workflow's People

Contributors

Stargazers

Watchers

workflow's Issues

uneven workflow distribution

I am using 3 Workflow Managers distributed on 3 EC2 instances. The Workflow Managers belong to the same namespace and they execute the same type of task. The workflow managers have the same configuration (task executor, task type, quantity, ...).
I observed that the number of tasks executed on each instance is significantly and consistently different. For instance:
Tasks executed by workflowManager 1: 7684
Tasks executed by workflowManager 2: 12120
Tasks executed by workflowManager 3: 15296

I reproduced these results using 3 Workflow Managers running in the same JVM and using the test server (see attached file). I used nirmata-workflow-0.5.1 & org.apache.curator:curator-test:2.8.0 for this test.
WorkflowTester.java.txt

SchedulerSelector throws IllegalStateException on exiting the runLoop

I'm very new to ZK/Curator and was just executing your example.
It seems that your SchedulerSelector tries to take leadership after curator is closed.
May be a proper closing of SchedulerSelector should be added to the usage example or are these 39 errors intended?

Getting the following 39 IllegalStateExceptions while executing your example:

2023-09-12 09:40:57,206 INFO  com.nirmata.workflow.queue.zookeeper.SimpleQueue [Curator-SimpleQueue-0] - Exiting runLoop
2023-09-12 09:40:57,206 INFO  com.nirmata.workflow.queue.zookeeper.SimpleQueue [Curator-SimpleQueue-0] - Exiting runLoop
2023-09-12 09:40:57,206 INFO  com.nirmata.workflow.queue.zookeeper.SimpleQueue [Curator-SimpleQueue-0] - Exiting runLoop
2023-09-12 09:40:57,206 INFO  com.nirmata.workflow.queue.zookeeper.SimpleQueue [Curator-SimpleQueue-0] - Exiting runLoop
2023-09-12 09:40:57,206 INFO  com.nirmata.workflow.queue.zookeeper.SimpleQueue [Curator-SimpleQueue-0] - Exiting runLoop
2023-09-12 09:40:57,206 INFO  com.nirmata.workflow.queue.zookeeper.SimpleQueue [Curator-SimpleQueue-0] - Exiting runLoop
2023-09-12 09:40:57,207 INFO  com.nirmata.workflow.queue.zookeeper.SimpleQueue [Curator-SimpleQueue-0] - Exiting runLoop
2023-09-12 09:40:57,207 INFO  com.nirmata.workflow.queue.zookeeper.SimpleQueue [Curator-SimpleQueue-0] - Exiting runLoop
2023-09-12 09:40:57,207 INFO  com.nirmata.workflow.queue.zookeeper.SimpleQueue [Curator-SimpleQueue-0] - Exiting runLoop
2023-09-12 09:40:57,206 INFO  com.nirmata.workflow.queue.zookeeper.SimpleQueue [Curator-SimpleQueue-0] - Exiting runLoop
2023-09-12 09:40:57,207 INFO  org.apache.curator.framework.imps.CuratorFrameworkImpl [Curator-Framework-0] - backgroundOperationsLoop exiting

// the following exception is logged 38 times
2023-09-12 09:40:57,208 ERROR org.apache.curator.framework.imps.CuratorFrameworkImpl [Curator-LeaderSelector-0] - Background exception was not retry-able or retry gave up
java.lang.IllegalStateException: Client is not started
                at org.apache.curator.shaded.com.google.common.base.Preconditions.checkState(Preconditions.java:502) ~[curator-client-5.5.0.jar:5.5.0]
                at org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:139) ~[curator-client-5.5.0.jar:5.5.0]
                at org.apache.curator.framework.imps.CuratorFrameworkImpl.getZooKeeper(CuratorFrameworkImpl.java:649) ~[curator-framework-5.5.0.jar:5.5.0]
                at org.apache.curator.framework.imps.NamespaceFacade.getZooKeeper(NamespaceFacade.java:115) ~[curator-framework-5.5.0.jar:5.5.0]
                at org.apache.curator.framework.imps.RemoveWatchesBuilderImpl.performBackgroundOperation(RemoveWatchesBuilderImpl.java:337) ~[curator-framework-5.5.0.jar:5.5.0]
                at org.apache.curator.framework.imps.OperationAndData.callPerformBackgroundOperation(OperationAndData.java:84) ~[curator-framework-5.5.0.jar:5.5.0]
                at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:1008) ~[curator-framework-5.5.0.jar:5.5.0]
                at org.apache.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:667) ~[curator-framework-5.5.0.jar:5.5.0]
                at org.apache.curator.framework.imps.NamespaceFacade.processBackgroundOperation(NamespaceFacade.java:121) ~[curator-framework-5.5.0.jar:5.5.0]
                at org.apache.curator.framework.imps.RemoveWatchesBuilderImpl.pathInBackground(RemoveWatchesBuilderImpl.java:229) ~[curator-framework-5.5.0.jar:5.5.0]
                at org.apache.curator.framework.imps.RemoveWatchesBuilderImpl.internalRemoval(RemoveWatchesBuilderImpl.java:84) ~[curator-framework-5.5.0.jar:5.5.0]
                at org.apache.curator.framework.imps.WatcherRemovalManager.removeWatchers(WatcherRemovalManager.java:67) ~[curator-framework-5.5.0.jar:5.5.0]
                at org.apache.curator.framework.imps.WatcherRemovalFacade.removeWatchers(WatcherRemovalFacade.java:68) ~[curator-framework-5.5.0.jar:5.5.0]
                at org.apache.curator.framework.recipes.cache.PathChildrenCache.close(PathChildrenCache.java:384) ~[curator-recipes-5.5.0.jar:5.5.0]
                at org.apache.curator.shaded.com.google.common.io.Closeables.close(Closeables.java:79) ~[curator-client-5.5.0.jar:5.5.0]
                at org.apache.curator.utils.CloseableUtils.closeQuietly(CloseableUtils.java:59) ~[curator-client-5.5.0.jar:5.5.0]
                at com.nirmata.workflow.details.Scheduler.run(Scheduler.java:173) ~[nirmata-workflow-0.9.7.jar:?]
                at com.nirmata.workflow.details.SchedulerSelector.takeLeadership(SchedulerSelector.java:97) ~[nirmata-workflow-0.9.7.jar:?]
                at com.nirmata.workflow.details.SchedulerSelector.access$000(SchedulerSelector.java:33) ~[nirmata-workflow-0.9.7.jar:?]
                at com.nirmata.workflow.details.SchedulerSelector$1.takeLeadership(SchedulerSelector.java:55) ~[nirmata-workflow-0.9.7.jar:?]
                at org.apache.curator.framework.recipes.leader.LeaderSelector$WrappedListener.takeLeadership(LeaderSelector.java:594) ~[curator-recipes-5.5.0.jar:5.5.0]
                at org.apache.curator.framework.recipes.leader.LeaderSelector.doWork(LeaderSelector.java:452) ~[curator-recipes-5.5.0.jar:5.5.0]
                at org.apache.curator.framework.recipes.leader.LeaderSelector.doWorkLoop(LeaderSelector.java:506) ~[curator-recipes-5.5.0.jar:5.5.0]
                at org.apache.curator.framework.recipes.leader.LeaderSelector.access$200(LeaderSelector.java:66) ~[curator-recipes-5.5.0.jar:5.5.0]
                at org.apache.curator.framework.recipes.leader.LeaderSelector$2.call(LeaderSelector.java:251) ~[curator-recipes-5.5.0.jar:5.5.0]
                at org.apache.curator.framework.recipes.leader.LeaderSelector$2.call(LeaderSelector.java:244) ~[curator-recipes-5.5.0.jar:5.5.0]
                at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
                at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) ~[?:?]
                at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
                at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
                at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
                at java.lang.Thread.run(Thread.java:833) ~[?:?]
…
2023-09-12 09:40:57,223 INFO  com.nirmata.workflow.details.SchedulerSelector [Curator-LeaderSelector-0] - cab-wsv-demaawo is no longer the scheduler
2023-09-12 09:40:57,224 ERROR org.apache.curator.framework.recipes.leader.LeaderSelector [Curator-LeaderSelector-0] - The leader threw an exception
java.lang.IllegalStateException: Expected state [STARTED] was [STOPPED]
                at org.apache.curator.shaded.com.google.common.base.Preconditions.checkState(Preconditions.java:821) ~[curator-client-5.5.0.jar:5.5.0]
                at org.apache.curator.framework.imps.CuratorFrameworkImpl.checkState(CuratorFrameworkImpl.java:457) ~[curator-framework-5.5.0.jar:5.5.0]
                at org.apache.curator.framework.imps.CuratorFrameworkImpl.delete(CuratorFrameworkImpl.java:477) ~[curator-framework-5.5.0.jar:5.5.0]
                at org.apache.curator.framework.recipes.locks.LockInternals.deleteOurPath(LockInternals.java:348) ~[curator-recipes-5.5.0.jar:5.5.0]
                at org.apache.curator.framework.recipes.locks.LockInternals.releaseLock(LockInternals.java:125) ~[curator-recipes-5.5.0.jar:5.5.0]
                at org.apache.curator.framework.recipes.locks.InterProcessMutex.release(InterProcessMutex.java:154) ~[curator-recipes-5.5.0.jar:5.5.0]
                at org.apache.curator.framework.recipes.leader.LeaderSelector.doWork(LeaderSelector.java:477) ~[curator-recipes-5.5.0.jar:5.5.0]
                at org.apache.curator.framework.recipes.leader.LeaderSelector.doWorkLoop(LeaderSelector.java:506) ~[curator-recipes-5.5.0.jar:5.5.0]
                at org.apache.curator.framework.recipes.leader.LeaderSelector.access$200(LeaderSelector.java:66) ~[curator-recipes-5.5.0.jar:5.5.0]
                at org.apache.curator.framework.recipes.leader.LeaderSelector$2.call(LeaderSelector.java:251) ~[curator-recipes-5.5.0.jar:5.5.0]
                at org.apache.curator.framework.recipes.leader.LeaderSelector$2.call(LeaderSelector.java:244) ~[curator-recipes-5.5.0.jar:5.5.0]
                at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
                at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) ~[?:?]
                at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
                at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
                at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
                at java.lang.Thread.run(Thread.java:833) ~[?:?]
2023-09-12 09:40:57,313 INFO  org.apache.zookeeper.ZooKeeper [main] - Session: 0x1008ba19a880016 closed
2023-09-12 09:40:57,313 INFO  org.apache.zookeeper.ClientCnxn [main-EventThread] - EventThread shut down for session: 0x1008ba19a880016

Process finished with exit code 0

ZK server 3.6.4
ZK clients tested 3.6.4, 3.7.1, 3.8.2, 3.9.0
curator-recipes 5.2.0, 5.5.0

Providing more debugging for the scheduler

Add additional debugging and possible status for the scheduler.

Potential thread leak

i noticed that one of my process is leaking threads. It could be due to not always closing the WorkflowManager. Any other possibility?. Here is an example of thread dump:

"SimpleQueue-0" #27573 daemon prio=5 os_prio=0 tid=0x00007f1068a92800 nid=0x6bc8 waiting on condition [0x00007f0f61d5c000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000000ca6ff380> (a java.util.concurrent.CountDownLatch$Sync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231)
at com.nirmata.workflow.queue.zookeeper.SimpleQueue.runLoop(SimpleQueue.java:197)
at com.nirmata.workflow.queue.zookeeper.SimpleQueue$$Lambda$17/1746711015.run(Unknown Source)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Locked ownable synchronizers:
- <0x00000000bb5b1ab8> (a java.util.concurrent.ThreadPoolExecutor$Worker)

mvn clean install command failing to execute

while trying to setup development environment, maven build is failing. Error says "gpg: no default secret key: secret key not available. After I checked, I found that for maven-gpg-plugin's passphrase attribute, ${gpg.passphrase} property place holder is used but is not defined in properties section of pom.xml file. Is there something I am missing?

Upgrade to Curator 3.2.0

support ZK 3.5.x for container nodes

Pluggable serialization

Right now, serialization of state is via json, which is easy to read. It would be nice to control the serialization as this affects the size of state in zookeeper. For example, if there was an interface to implement, which could alternatively use a thrift codec to marshal task and run data.

New feature request - Allow custom workflow task executors

Currently workflow task executor threads are launched from the thread that schedules the workflow.

This created a situation, where when a workflow thread was created from a thread handling a user operation, it inherited security context data (we are using Apache Shiro which uses an InheritableThreadLocal) from the user operation thread. Since the workflow task executors are subsequently reused, this can lead to problems due to cached thread local data.

A solution would be to allow passing in a custom ExecutorService, so users can implement their own logic in ThreadPoolExecutor.beforeExecute and / or ThreadPoolExecutor.afterExecute.

workflow client or/and curator client loose connection to zookeeper

The exception seen on the zookeeper side are:

error 5880416
2014-12-17 01:49:00,688 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connection for client /10.10.64.50:38118 which had sessionid 0x34a4c0530500002
2014-12-17 01:49:02,424 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket connection from /10.10.64.50:38123
2014-12-17 01:49:02,431 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@861] - Client attempting to renew session 0x34a4c0530500002 at /10.10.64.50:38123
2014-12-17 01:49:02,431 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:Learner@108] - Revalidating client: 0x34a4c0530500002
2014-12-17 01:49:02,431 [myid:1] - INFO [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@617] - Established session 0x34a4c0530500002 with negotiated timeout 40000 for client /10.10.64.50:38123
2014-12-17 01:49:02,509 [myid:1] - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of session 0x34a4c0530500002 due to java.io.IOException: Len error 5880416
2014-12-17 01:49:02,509 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connection for client /10.10.64.50:38123 which had sessionid 0x34a4c0530500002
2014-12-17 01:49:04,619 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket connection from /10.10.64.50:38133
2014-12-17 01:49:04,626 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@861] - Client attempting to renew session 0x34a4c0530500002 at /10.10.64.50:38133
2014-12-17 01:49:04,626 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:Learner@108] - Revalidating client: 0x34a4c0530500002
2014-12-17 01:49:04,627 [myid:1] - INFO [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@617] - Established session 0x34a4c0530500002 with negotiated timeout 40000 for client /10.10.64.50:38133
2014-12-17 01:49:04,697 [myid:1] - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of session 0x34a4c0530500002 due to java.io.IOException: Len error 5880416
2014-12-17 01:49:04,697 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connection for client /10.10.64.50:38133 which had sessionid 0x34a4c0530500002
2014-12-17 01:49:06,963 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket connection from /10.10.64.50:38139
2014-12-17 01:49:06,971 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@861] - Client attempting to renew session 0x34a4c0530500002 at /10.10.64.50:38139
2014-12-17 01:49:06,971 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:Learner@108] - Revalidating client: 0x34a4c0530500002
2014-12-17 01:49:06,972 [myid:1] - INFO [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@617] - Established session 0x34a4c0530500002 with negotiated timeout 40000 for client /10.10.64.50:38139
2014-12-17 01:49:07,040 [myid:1] - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of session 0x34a4c0530500002 due to java.io.IOException: Len error 5880416
2014-12-17 01:49:07,040 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connection for client /10.10.64.50:38139 which had sessionid 0x34a4c0530500002
2014-12-17 01:49:08,762 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket connection from /10.10.64.50:38143
2014-12-17 01:49:08,770 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@861] - Client attempting to renew session 0x34a4c0530500002 at /10.10.64.50:38143
2014-12-17 01:49:08,770 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:Learner@108] - Revalidating client: 0x34a4c0530500002
2014-12-17 01:49:08,771 [myid:1] - INFO [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@617] - Established session 0x34a4c0530500002 with negotiated timeout 40000 for client /10.10.64.50:38143
2014-12-17 01:49:08,841 [myid:1] - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of session 0x34a4c0530500002 due to java.io.IOException: Len error 5880416
2014-12-17 01:49:08,842 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connection for client /10.10.64.50:38143 which had sessionid 0x34a4c0530500002
2014-12-17 01:49:11,530 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket connection from /10.10.64.50:38156
2014-12-17 01:49:11,538 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@861] - Client attempting to renew session 0x34a4c0530500002 at /10.10.64.50:38156
2014-12-17 01:49:11,538 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:Learner@108] - Revalidating client: 0x34a4c0530500002
2014-12-17 01:49:11,538 [myid:1] - INFO [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@617] - Established session 0x34a4c0530500002 with negotiated timeout 40000 for client /10.10.64.50:38156
2014-12-17 01:49:11,608 [myid:1] - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of session 0x34a4c0530500002 due to java.io.IOException: Len error 5880416
2014-12-17 01:49:11,608 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connection for client /10.10.64.50:38156 which had sessionid 0x34a4c0530500002

NPE in WorkflowListenerManagerImpl at line 48 on event type CONNECTION_RECONNECTED with data = null

Getting the following NPE while executing your example:

2023-09-12 09:40:32,831 ERROR org.apache.curator.framework.recipes.cache.PathChildrenCache [Curator-PathChildrenCache-0] -
java.lang.NullPointerException: Cannot invoke "org.apache.curator.framework.recipes.cache.ChildData.getPath()" because the return value of "org.apache.curator.framework.recipes.cache.PathChildrenCacheEvent.getData()" is null
                at com.nirmata.workflow.details.WorkflowListenerManagerImpl.lambda$start$0(WorkflowListenerManagerImpl.java:48) ~[nirmata-workflow-0.9.7.jar:?]
                at org.apache.curator.framework.recipes.cache.PathChildrenCache.lambda$callListeners$1(PathChildrenCache.java:535) ~[curator-recipes-5.5.0.jar:5.5.0]
                at org.apache.curator.framework.listen.MappingListenerManager.lambda$forEach$0(MappingListenerManager.java:93) ~[curator-framework-5.5.0.jar:5.5.0]
                at org.apache.curator.framework.listen.MappingListenerManager.forEach(MappingListenerManager.java:90) ~[curator-framework-5.5.0.jar:5.5.0]
                at org.apache.curator.framework.listen.StandardListenerManager.forEach(StandardListenerManager.java:90) ~[curator-framework-5.5.0.jar:5.5.0]
                at org.apache.curator.framework.recipes.cache.PathChildrenCache.callListeners(PathChildrenCache.java:532) ~[curator-recipes-5.5.0.jar:5.5.0]
                at org.apache.curator.framework.recipes.cache.EventOperation.invoke(EventOperation.java:36) ~[curator-recipes-5.5.0.jar:5.5.0]
                at org.apache.curator.framework.recipes.cache.PathChildrenCache$8.run(PathChildrenCache.java:802) ~[curator-recipes-5.5.0.jar:5.5.0]
                at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) ~[?:?]
                at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
                at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) ~[?:?]
                at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
                at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
                at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
                at java.lang.Thread.run(Thread.java:833) ~[?:?]

At WorkflowListenerManagerImpl:48 on event = PathChildrenCacheEvent{type=CONNECTION_RECONNECTED, data=null}

    @Override
    public void start() {
        try {
            runsCache.getListenable().addListener((client, event) -> {
                RunId runId = new RunId(ZooKeeperConstants.getRunIdFromRunPath(event.getData().getPath())); //<-- NPE here
                if (event.getType() == PathChildrenCacheEvent.Type.CHILD_ADDED) {
                    postEvent(new WorkflowEvent(WorkflowEvent.EventType.RUN_STARTED, runId));
                } else if (event.getType() == PathChildrenCacheEvent.Type.CHILD_UPDATED) {
                    postEvent(new WorkflowEvent(WorkflowEvent.EventType.RUN_UPDATED, runId));
                }
            });

ZK server 3.6.4
ZK clients tested 3.6.4, 3.7.1, 3.8.2, 3.9.0
curator-recipes 5.2.0, 5.5.0

Use ZK container nodes wherever possible

ZooKeeper container nodes clean up (Curator 3.x, ZK 3.5.X) after themselves and make management easier. We should use these wherever possible.

FAILED_STOP re-executes the task

The TaskExecutor is re-queing after each failure even if we return as

new TaskExecutionResult(TaskExecutionStatus.FAILED_STOP, "Failed to execute config stats task");

It is always requeing the task the code in org.apache.curator.framework.recipes.queue.DistributedQueue class

try
{
consumer.consumeMessage(item);
}
catch ( Throwable e )errorMode.get()
{
log.error("Exception processing queue item: " + itemNode, e);
if ( errorMode.get() == ErrorMode.REQUEUE )
{
resultCode = ProcessMessageBytesCode.REQUEUE; break; }
}
}

scheduling of workflow stopped working

I have 3 intances of the same application that can generate periodically the workflow. The same 2 apps can executed these workflows. The 3 instances run on 3 EC2 instances. I observed 3 times in the last 10 days that the scheduling of the workflow stopped working. When it happens I see the following exception being generate continuously on all 3 instances:

15:35:57.050 [LeaderSelector-1] DEBUG c.nirmata.workflow.details.Scheduler - initLatch completed
15:35:57.050 [LeaderSelector-1] INFO c.nirmata.workflow.details.Scheduler - Updating run: Id{id='a14f3b93-bc5e-4b63-9b90-410d331229b6'}
15:35:57.050 [LeaderSelector-1] ERROR c.nirmata.workflow.details.Scheduler - Could not find run for RunId: Id{id='a14f3b93-bc5e-4b63-9b90-410d331229b6'}
15:35:57.051 [LeaderSelector-1] ERROR c.nirmata.workflow.details.Scheduler - Error while running scheduler
java.lang.RuntimeException: Could not find run for RunId: Id{id='a14f3b93-bc5e-4b63-9b90-410d331229b6'}
at com.nirmata.workflow.details.Scheduler.updateTasks(Scheduler.java:218) ~[nirmata-workflow-0.6.1-SNAPSHOT.jar:na]
at com.nirmata.workflow.details.Scheduler.run(Scheduler.java:142) ~[nirmata-workflow-0.6.1-SNAPSHOT.jar:na]
at com.nirmata.workflow.details.SchedulerSelector.takeLeadership(SchedulerSelector.java:81) [nirmata-workflow-0.6.1-SNAPSHOT.jar:na]
at com.nirmata.workflow.details.SchedulerSelector.access$000(SchedulerSelector.java:31) [nirmata-workflow-0.6.1-SNAPSHOT.jar:na]
at com.nirmata.workflow.details.SchedulerSelector$1.takeLeadership(SchedulerSelector.java:52) [nirmata-workflow-0.6.1-SNAPSHOT.jar:na]
at org.apache.curator.framework.recipes.leader.LeaderSelector$WrappedListener.takeLeadership(LeaderSelector.java:536) [curator-recipes-2.6.0.jar:na]
at org.apache.curator.framework.recipes.leader.LeaderSelector.doWork(LeaderSelector.java:398) [curator-recipes-2.6.0.jar:na]
at org.apache.curator.framework.recipes.leader.LeaderSelector.doWorkLoop(LeaderSelector.java:443) [curator-recipes-2.6.0.jar:na]
at org.apache.curator.framework.recipes.leader.LeaderSelector.access$100(LeaderSelector.java:63) [curator-recipes-2.6.0.jar:na]
at org.apache.curator.framework.recipes.leader.LeaderSelector$2.call(LeaderSelector.java:244) [curator-recipes-2.6.0.jar:na]
at org.apache.curator.framework.recipes.leader.LeaderSelector$2.call(LeaderSelector.java:238) [curator-recipes-2.6.0.jar:na]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_40-internal]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_40-internal]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_40-internal]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_40-internal]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_40-internal]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_40-internal]
15:35:57.051 [LeaderSelector-1] INFO c.n.w.details.SchedulerSelector - 10.10.66.50 is no longer the scheduler
15:35:57.051 [pool-36-thread-56] DEBUG HostGroupController - host lock /Orchestrator/HostGroup/1e55787a-e59a-444d-8787-c93b99cd0b3f acquired
15:35:57.051 [pool-36-thread-56] DEBUG c.n.client.service.DefaultCache - CACHE - get config{type=HostGroup, id=1e55787a-e59a-444d-8787-c93b99cd0b3f} - size=2975 hit=4044506 miss=177934
15:35:57.051 [pool-36-thread-56] DEBUG c.n.client.service.DefaultCache - CACHE - get config{type=Host, id=f6a4a486-2356-4b4e-8b40-897c79b48bf7} - size=2975 hit=4044507 miss=177934
15:35:57.051 [pool-36-thread-56] DEBUG c.n.client.service.DefaultCache - CACHE - get config{type=Host, id=9b6bb115-0333-493

To recover from this situation, I stop the 3 instances, remove the workflow nodes in zookeeper and then restart the 3 instances.

New feature request - workflow diagnostics

We are running into a situation where task execution occasionally stops. When this happens, would be great to run a diagnostic check and report information on a workflow, including number or registered executors and their state to help isolate the root cause.

Add failure policy (or similar) to limit retries

When an exception is thrown from a task, the task is immediately re-queued.

Although the root cause is the task code, would be good to have some protection in the workflow framework to prevent resource exhaustion.

A possible solution would be to have a retry policy { numRetries: 10, retryIntervalInMilli: 1000 } and a sensible default.

Scheduler stops receiving events from PathChildrenCache when associated ZK path was deleted and later recreated

We sometimes have task execution issues, which could be summarized as the following two types.

Description	Task be submitted but never be executed and left in the queue.
Impact	After that no task with same task key could be submitted. Task of other type using same workflow path would also have same issue.
Workaround	Delete the whole path in ZK, and restart the service

Description	Task be executed successfully but not be auto cleaned
Impact	After that no task with same task key could be submitted. Task with different task key, and task of other type using same workflow path, would not be impacted
Workaround	Restart the service instance on which the scheduler is running currentlyFor single instance service, just restart the service

More details are described here https://docs.google.com/document/d/1uf4RARRjwfy8V_AfOGvoNj6BWKMr32HtSrcNw86irwY/edit?usp=sharing

The root cause of these issues, seems to be that ZK path would be auto-deleted and later when it's recreated the Scheduler could no longer receive the event and consequently could not handle the change of task node in ZK path.

Found similar issue ticket on Curator dashboard https://issues.apache.org/jira/browse/CURATOR-388.

Would provide the test code later.

Closing/shutting down workflow instance one by one gracefully

Use case: Suppose we have a fleet of workflow task executor instances picking tasks continuously submitted by the leader node. The tasks in nature can be time taking ones. Now I need to shutdown instances one by one. I need to somehow tell the workflow manager to not pick any further tasks but finish the ones which are in progress.

Problem: I don't want to block all the instances at a time. The other active instances should pick tasks to avoid full down time and extra check of remaining number of tasks to be picked after stopping the submission of new tasks and before closing all the instances.

Probable solution : We can instruct WorkflowManager to shutdown gracefully which in turn will shutdown the executor service in SimpleQueue class with thread await.

Let me know if I have to fork and submit a change if the use case is valid and solution is workable.

Graceful shutdown of one workflow instance.

I have one use case, where I need to tell worflow-manager to stop listening to new tasks and complete already running tasks.

I have checked all APIs available ,but I could not find any solution for it. If you could help me on this use case that would be very great.

Long running Idempotent task keeps creating watches and leads to Out Of Memory issues

SimpleQueue lists queued task and sets watcher to watch for new events in case of empty but if task is present and its picked up by some worker its not deleted in case of idempotent and all other workers goes into loop of watching again and again. If this single task is long processing other workers or threads if multithreaded keeps creating watcher instances.

A child task can submit its ancestor after a delay

I'd like to create a DAG that repeats after a delay. The period would be fixed, but the delay would be relative to the completion time of its ancestor.

An example which roughly describes behavior which will deviate slightly in calculation. Given a terminal task completes, and the period of the DAG is 3 minutes. One minute elapsed since its root node completed, it would resubmit the root node after a delay of 2 minutes.

It is fine to use job metadata to lookup period, etc. Main thing I'm looking for is the best flow to resubmit on an interval, and also how to cancel this loop.

JsonSerializer.getTask() is not visible

you can create task objects programmatically or via a file/stream. To load via file/stream, you can use >>>the utility JsonSerializer.getTask()

but JsonSerializer.getTask() is not visible

Need a way to retrieve the metadata of a task submitted but not executed

workflowManager.getRunInfo() take a long time to complete

I have edited the original issue: the call taking a long time to complete is getRunInfo(), not getAdmin() ...

Background information:
Before creating a workflow, I verify if there is already a similar workflow pending. I do that to avoid having tasks pilling up in zookeeper. When a task is created, I add a key in the metadata of the task. This key is the combination of a task type and the id of the resource for which the task is going to be executed. Example: if a task is in charge of pinging a host then the key would be something like "PingHostTask-". Before creating a ping task for a host, I first call getRunInfo() to see if there is already a task pending for this host. Here is the code:

public boolean isSimilarTaskPending(TaskType taskType, List runInfos,String key) {
for (RunInfo runInfo: runInfos) {
if (runInfo.isComplete() ) {
continue;
}
RunId runId = runInfo.getRunId();
Map<TaskId,TaskDetails> taskDetails = _admin.getTaskDetails(runId);
Set taskIds = taskDetails.keySet();
for (TaskId taskId: taskIds) {
TaskDetails details = taskDetails.get(taskId);
TaskType type = details.getTaskType();
if (type == taskType) {
Map<String,String> metadata = details.getMetaData();
String taskKey = metadata.get("taskKey");
if ((taskKey != null) && (taskKey.equals(key) )) {
return true;
}
}
}
}
return false;
}

*Problem: *
In our environment, the call to workflowManager.getRunInfo() to get the list of runInfos takes 15 to 20 seconds to complete.

Our zookeeper cluster has around 100K nodes but only few hundreds are related to workflows. And even less are related to a particular type of workflow (about 1500 nodes in ../runs for this workflow)Why is this call taking so long? Do you have an alternative approach to recommend to check if a task is already pending?

Fail to execute task after killing and restarting workflowManagers

I had several workflow managers running in the same process. Tasks were generated and executed correctly. I killed the process. After restarting the my application some of the tasks previously generated could not be exectuted. I got the following exception:

0:55:54.883 [QueueBuilder-10] DEBUG c.n.o.controllers.HostController - GET_HOST_REQUEST sent
00:55:54.965 [pool-4-thread-1] DEBUG c.n.n.impl.DefaultConsumer - consuming notification from topic env.alpha_HostGatewayMessageOut:
00:55:54.965 [pool-4-thread-1] DEBUG c.n.n.impl.DefaultConsumer - {"@Class":"com.nirmata.host.messages.ResponseMessage","requestId":"a547c7d1-8cb6-480f-8069-82da9fcd20cc","partial":true,"errorReason":null,"state":null,"hostId":null,"timestamp":null}
00:55:54.970 [QueueBuilder-10] ERROR c.n.w.details.WorkflowManagerImpl - Could not set completed data for executable task: ExecutableTask{runId=Id{id='df748916-388f-4b89-9273-7ae74de7bdfb'}, taskId=Id{id='4bd529f1-1a97-4dc6-86fd-164673cdcfc9'}, taskType=TaskType{type='PingHostTask', version='V1', isIdempotent=true}, metaData={hostId=1aae571e-55d9-48f0-ad42-d91115ecb3b3}, isExecutable=true}
org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists for /env.alpha-V1/tasks-completed/df748916-388f-4b89-9273-7ae74de7bdfb|4bd529f1-1a97-4dc6-86fd-164673cdcfc9
at org.apache.zookeeper.KeeperException.create(KeeperException.java:119) ~[KeeperException.class:3.4.6-1569965]
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) ~[KeeperException.class:3.4.6-1569965]
at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783) ~[ZooKeeper.class:3.4.6-1569965]
at org.apache.curator.framework.imps.CreateBuilderImpl$11.call(CreateBuilderImpl.java:688) ~[CreateBuilderImpl$11.class:na]
at org.apache.curator.framework.imps.CreateBuilderImpl$11.call(CreateBuilderImpl.java:672) ~[CreateBuilderImpl$11.class:na]
at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:107) ~[RetryLoop.class:na]
at org.apache.curator.framework.imps.CreateBuilderImpl.pathInForeground(CreateBuilderImpl.java:668) ~[CreateBuilderImpl.class:na]
at org.apache.curator.framework.imps.CreateBuilderImpl.protectedPathInForeground(CreateBuilderImpl.java:453) ~[CreateBuilderImpl.class:na]
at org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:443) ~[CreateBuilderImpl.class:na]
at org.apache.curator.framework.imps.CreateBuilderImpl$4.forPath(CreateBuilderImpl.java:325) ~[CreateBuilderImpl$4.class:na]
at org.apache.curator.framework.imps.CreateBuilderImpl$4.forPath(CreateBuilderImpl.java:267) ~[CreateBuilderImpl$4.class:na]
at com.nirmata.workflow.details.WorkflowManagerImpl.executeTask(WorkflowManagerImpl.java:402) [WorkflowManagerImpl.class:na]
at com.nirmata.workflow.details.WorkflowManagerImpl.lambda$null$8(WorkflowManagerImpl.java:415) [WorkflowManagerImpl.class:na]
at com.nirmata.workflow.details.WorkflowManagerImpl$$Lambda$13/2090508016.executeTask(Unknown Source) [WorkflowManagerImpl.class:na]
at com.nirmata.workflow.queue.zookeeper.ZooKeeperQueueConsumer.consumeMessage(ZooKeeperQueueConsumer.java:66) [ZooKeeperQueueConsumer.class:na]
at com.nirmata.workflow.queue.zookeeper.ZooKeeperQueueConsumer.consumeMessage(ZooKeeperQueueConsumer.java:19) [ZooKeeperQueueConsumer.class:na]
at org.apache.curator.framework.recipes.queue.DistributedQueue.processMessageBytes(DistributedQueue.java:678) [DistributedQueue.class:na]
at org.apache.curator.framework.recipes.queue.DistributedQueue.processWithLockSafety(DistributedQueue.java:749) [DistributedQueue.class:na]
at org.apache.curator.framework.recipes.queue.DistributedQueue$5.run(DistributedQueue.java:625) [DistributedQueue$5.class:na]
at com.google.common.util.concurrent.MoreExecutors$DirectExecutorService.execute(MoreExecutors.java:299) [MoreExecutors$DirectExecutorService.class:na]
at org.apache.curator.framework.recipes.queue.DistributedQueue.processChildren(DistributedQueue.java:614) [DistributedQueue.class:na]
at org.apache.curator.framework.recipes.queue.DistributedQueue.runLoop(DistributedQueue.java:566) [DistributedQueue.class:na]
at org.apache.curator.framework.recipes.queue.DistributedQueue.access$000(DistributedQueue.java:65) [DistributedQueue.class:na]
at org.apache.curator.framework.recipes.queue.DistributedQueue$1.call(DistributedQueue.java:196) [DistributedQueue$1.class:na]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_11]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_11]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_11]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_11]
00:55:54.972 [QueueBuilder-10] ERROR o.a.c.f.r.queue.DistributedQueue - Exception processing queue item: queue-0000001388
java.lang.RuntimeException: org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists for /env.alpha-V1/tasks-completed/df748916-388f-4b89-9273-7ae74de7bdfb|4bd529f1-1a97-4dc6-86fd-164673cdcfc9
at com.nirmata.workflow.details.WorkflowManagerImpl.executeTask(WorkflowManagerImpl.java:407) ~[WorkflowManagerImpl.class:na]
at com.nirmata.workflow.details.WorkflowManagerImpl.lambda$null$8(WorkflowManagerImpl.java:415) ~[WorkflowManagerImpl.class:na]
at com.nirmata.workflow.details.WorkflowManagerImpl$$Lambda$13/2090508016.executeTask(Unknown Source) ~[na:na]
at com.nirmata.workflow.queue.zookeeper.ZooKeeperQueueConsumer.consumeMessage(ZooKeeperQueueConsumer.java:66) ~[ZooKeeperQueueConsumer.class:na]
at com.nirmata.workflow.queue.zookeeper.ZooKeeperQueueConsumer.consumeMessage(ZooKeeperQueueConsumer.java:19) ~[ZooKeeperQueueConsumer.class:na]
at org.apache.curator.framework.recipes.queue.DistributedQueue.processMessageBytes(DistributedQueue.java:678) [DistributedQueue.class:na]
at org.apache.curator.framework.recipes.queue.DistributedQueue.processWithLockSafety(DistributedQueue.java:749) [DistributedQueue.class:na]
at org.apache.curator.framework.recipes.queue.DistributedQueue$5.run(DistributedQueue.java:625) [DistributedQueue$5.class:na]
at com.google.common.util.concurrent.MoreExecutors$DirectExecutorService.execute(MoreExecutors.java:299) [MoreExecutors$DirectExecutorService.class:na]
at org.apache.curator.framework.recipes.queue.DistributedQueue.processChildren(DistributedQueue.java:614) [DistributedQueue.class:na]
at org.apache.curator.framework.recipes.queue.DistributedQueue.runLoop(DistributedQueue.java:566) [DistributedQueue.class:na]
at org.apache.curator.framework.recipes.queue.DistributedQueue.access$000(DistributedQueue.java:65) [DistributedQueue.class:na]
at org.apache.curator.framework.recipes.queue.DistributedQueue$1.call(DistributedQueue.java:196) [DistributedQueue$1.class:na]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_11]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_11]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_11]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_11]
Caused by: org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists for /env.alpha-V1/tasks-completed/df748916-388f-4b89-9273-7ae74de7bdfb|4bd529f1-1a97-4dc6-86fd-164673cdcfc9
at org.apache.zookeeper.KeeperException.create(KeeperException.java:119) ~[KeeperException.class:3.4.6-1569965]
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) ~[KeeperException.class:3.4.6-1569965]
at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783) ~[ZooKeeper.class:3.4.6-1569965]
at org.apache.curator.framework.imps.CreateBuilderImpl$11.call(CreateBuilderImpl.java:688) ~[CreateBuilderImpl$11.class:na]
at org.apache.curator.framework.imps.CreateBuilderImpl$11.call(CreateBuilderImpl.java:672) ~[CreateBuilderImpl$11.class:na]
at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:107) ~[RetryLoop.class:na]
at org.apache.curator.framework.imps.CreateBuilderImpl.pathInForeground(CreateBuilderImpl.java:668) ~[CreateBuilderImpl.class:na]
at org.apache.curator.framework.imps.CreateBuilderImpl.protectedPathInForeground(CreateBuilderImpl.java:453) ~[CreateBuilderImpl.class:na]
at org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:443) ~[CreateBuilderImpl.class:na]
at org.apache.curator.framework.imps.CreateBuilderImpl$4.forPath(CreateBuilderImpl.java:325) ~[CreateBuilderImpl$4.class:na]
at org.apache.curator.framework.imps.CreateBuilderImpl$4.forPath(CreateBuilderImpl.java:267) ~[CreateBuilderImpl$4.class:na]
at com.nirmata.workflow.details.WorkflowManagerImpl.executeTask(WorkflowManagerImpl.java:402) ~[WorkflowManagerImpl.class:na]
... 16 common frames omitted
00:55:55.057 [pool-4-thread-1] DEBUG com.nirmata.notification.rx.RxNotify - Received message instanceId=null
00:55:55.057 [pool-4-thread-1] DEBUG com.nirmata.messages.RxMessages - Dispatching to subscriber for com.nirmata.host.messages.ResponseMessage a547c7d1-8cb6-480f-8069-82da9fcd20cc
00:55:55.059 [pool-4-thread-1] DEBUG com.nirmata.notification.rx.RxNotify - Got message instanceId=null
00:55:55.059 [pool-4-thread-1] DEBUG com.nirmata.notification.rx.RxNotify - Setting timeout for subsequent responses
00:55:55.061 [pool-4-thread-1] DEBUG c.n.n.impl.DefaultConsumer - notification has been processed by callback

Soon after that I started seeing more exceptions:

"containers" : [ ]
}
00:56:54.144 [LeaderSelector-5] DEBUG c.n.client.service.DefaultCache - CACHE - get config{type=Host, id=e4159caf-4010-4b4f-8277-08af3007868a} - size=23 hit=17 miss=44
00:56:54.207 [localhost-startStop-1-SendThread(10.10.64.138:2181)] WARN org.apache.zookeeper.ClientCnxn - Session 0x14a55958b020014 for server 10.10.64.138/10.10.64.138:2181, unexpected error, closing socket connection and attempting reconnect
java.io.IOException: Xid out of order. Got Xid 585 with err 0 expected Xid 584 for a packet with details: clientPath:null serverPath:null finished:false header:: 584,14 replyHeader:: 0,0,-4 request:: org.apache.zookeeper.MultiTransactionRecord@513f6986 response:: org.apache.zookeeper.MultiResponse@0
at org.apache.zookeeper.ClientCnxn$SendThread.readResponse(ClientCnxn.java:798) ~[ClientCnxn$SendThread.class:3.4.6-1569965]
at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:94) ~[ClientCnxnSocketNIO.class:3.4.6-1569965]
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366) ~[ClientCnxnSocketNIO.class:3.4.6-1569965]
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081) ~[ClientCnxn$SendThread.class:3.4.6-1569965]
00:56:54.208 [QueueBuilder-10] DEBUG org.apache.curator.RetryLoop - Retry-able exception received
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) ~[KeeperException.class:3.4.6-1569965]
at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:935) ~[ZooKeeper.class:3.4.6-1569965]
at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915) ~[ZooKeeper.class:3.4.6-1569965]
at org.apache.curator.framework.imps.CuratorTransactionImpl.doOperation(CuratorTransactionImpl.java:159) [CuratorTransactionImpl.class:na]
at org.apache.curator.framework.imps.CuratorTransactionImpl.access$200(CuratorTransactionImpl.java:44) [CuratorTransactionImpl.class:na]
at org.apache.curator.framework.imps.CuratorTransactionImpl$2.call(CuratorTransactionImpl.java:129) ~[CuratorTransactionImpl$2.class:na]
at org.apache.curator.framework.imps.CuratorTransactionImpl$2.call(CuratorTransactionImpl.java:125) ~[CuratorTransactionImpl$2.class:na]
at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:107) ~[RetryLoop.class:na]
at org.apache.curator.framework.imps.CuratorTransactionImpl.commit(CuratorTransactionImpl.java:121) [CuratorTransactionImpl.class:na]
at org.apache.curator.framework.recipes.queue.DistributedQueue.processWithLockSafety(DistributedQueue.java:754) [DistributedQueue.class:na]
at org.apache.curator.framework.recipes.queue.DistributedQueue$5.run(DistributedQueue.java:625) [DistributedQueue$5.class:na]
at com.google.common.util.concurrent.MoreExecutors$DirectExecutorService.execute(MoreExecutors.java:299) [MoreExecutors$DirectExecutorService.class:na]
at org.apache.curator.framework.recipes.queue.DistributedQueue.processChildren(DistributedQueue.java:614) [DistributedQueue.class:na]
at org.apache.curator.framework.recipes.queue.DistributedQueue.runLoop(DistributedQueue.java:566) [DistributedQueue.class:na]
at org.apache.curator.framework.recipes.queue.DistributedQueue.access$000(DistributedQueue.java:65) [DistributedQueue.class:na]
at org.apache.curator.framework.recipes.queue.DistributedQueue$1.call(DistributedQueue.java:196) [DistributedQueue$1.class:na]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_11]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_11]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_11]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_11]
00:56:54.311 [LeaderSelector-5] DEBUG org.apache.curator.RetryLoop - Retry-able exception received
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /env.alpha-V1/runs/f9581c2e-be20-48a0-b096-47a0f12fa106
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) ~[KeeperException.class:3.4.6-1569965]
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) ~[KeeperException.class:3.4.6-1569965]
at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783) ~[ZooKeeper.class:3.4.6-1569965]
at org.apache.curator.framework.imps.CreateBuilderImpl$11.call(CreateBuilderImpl.java:688) ~[CreateBuilderImpl$11.class:na]
at org.apache.curator.framework.imps.CreateBuilderImpl$11.call(CreateBuilderImpl.java:672) ~[CreateBuilderImpl$11.class:na]
at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:107) ~[RetryLoop.class:na]
at org.apache.curator.framework.imps.CreateBuilderImpl.pathInForeground(CreateBuilderImpl.java:668) [CreateBuilderImpl.class:na]
at org.apache.curator.framework.imps.CreateBuilderImpl.protectedPathInForeground(CreateBuilderImpl.java:453) [CreateBuilderImpl.class:na]
at org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:443) [CreateBuilderImpl.class:na]
at org.apache.curator.framework.imps.CreateBuilderImpl$4.forPath(CreateBuilderImpl.java:325) [CreateBuilderImpl$4.class:na]
at org.apache.curator.framework.imps.CreateBuilderImpl$4.forPath(CreateBuilderImpl.java:267) [CreateBuilderImpl$4.class:na]
at com.nirmata.workflow.details.WorkflowManagerImpl.submitSubTask(WorkflowManagerImpl.java:115) [WorkflowManagerImpl.class:na]
at com.nirmata.workflow.details.WorkflowManagerImpl.submitTask(WorkflowManagerImpl.java:94) [WorkflowManagerImpl.class:na]
at com.nirmata.orchestrator.monitors.HostMonitor.lambda$scheduleWorkflowExecution$10(HostMonitor.java:141) [HostMonitor.class:na]
at com.nirmata.orchestrator.monitors.HostMonitor$$Lambda$19/55116442.run(Unknown Source) [HostMonitor.class:na]
at com.nirmata.scheduler.LeaderExecutor.takeLeadership(LeaderExecutor.java:48) [LeaderExecutor.class:na]
at org.apache.curator.framework.recipes.leader.LeaderSelector$WrappedListener.takeLeadership(LeaderSelector.java:536) [LeaderSelector$WrappedListener.class:na]
at org.apache.curator.framework.recipes.leader.LeaderSelector.doWork(LeaderSelector.java:398) [LeaderSelector.class:na]
at org.apache.curator.framework.recipes.leader.LeaderSelector.doWorkLoop(LeaderSelector.java:443) [LeaderSelector.class:na]
at org.apache.curator.framework.recipes.leader.LeaderSelector.access$100(LeaderSelector.java:63) [LeaderSelector.class:na]
at org.apache.curator.framework.recipes.leader.LeaderSelector$2.call(LeaderSelector.java:244) [LeaderSelector$2.class:na]
at org.apache.curator.framework.recipes.leader.LeaderSelector$2.call(LeaderSelector.java:238) [LeaderSelector$2.class:na]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_11]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_11]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_11]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_11]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_11]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_11]
00:56:54.312 [localhost-startStop-1-EventThread] INFO o.a.c.f.state.ConnectionStateManager - State change: SUSPENDED
00:56:54.316 [LeaderSelector-3] DEBUG com.nirmata.scheduler.LeaderExecutor - Leader thread CloudProviderMonitor for Thread[LeaderSelector-3,5,main] was interrupted
00:56:54.323 [LeaderSelector-9] DEBUG com.nirmata.scheduler.LeaderExecutor - Leader thread GarbageCollector for Thread[LeaderSelector-9,5,main] was interrupted
00:56:54.324 [LeaderSelector-7] DEBUG com.nirmata.scheduler.LeaderExecutor - Leader thread ServiceScalingTask for Thread[LeaderSelector-7,5,main] was interrupted
00:56:54.325 [LeaderSelector-5] DEBUG org.apache.curator.RetryLoop - Retry policy not allowing retry
00:56:54.329 [LeaderSelector-5] ERROR c.n.o.monitors.HostMonitor - failed to monitor cloud provider hosts:
java.lang.RuntimeException: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /env.alpha-V1/runs/f9581c2e-be20-48a0-b096-47a0f12fa106
at com.nirmata.workflow.details.WorkflowManagerImpl.submitSubTask(WorkflowManagerImpl.java:119) ~[WorkflowManagerImpl.class:na]
at com.nirmata.workflow.details.WorkflowManagerImpl.submitTask(WorkflowManagerImpl.java:94) ~[WorkflowManagerImpl.class:na]
at com.nirmata.orchestrator.monitors.HostMonitor.lambda$scheduleWorkflowExecution$10(HostMonitor.java:141) ~[HostMonitor.class:na]
at com.nirmata.orchestrator.monitors.HostMonitor$$Lambda$19/55116442.run(Unknown Source) [HostMonitor.class:na]
at com.nirmata.scheduler.LeaderExecutor.takeLeadership(LeaderExecutor.java:48) [LeaderExecutor.class:na]
at org.apache.curator.framework.recipes.leader.LeaderSelector$WrappedListener.takeLeadership(LeaderSelector.java:536) [LeaderSelector$WrappedListener.class:na]
at org.apache.curator.framework.recipes.leader.LeaderSelector.doWork(LeaderSelector.java:398) [LeaderSelector.class:na]
at org.apache.curator.framework.recipes.leader.LeaderSelector.doWorkLoop(LeaderSelector.java:443) [LeaderSelector.class:na]
at org.apache.curator.framework.recipes.leader.LeaderSelector.access$100(LeaderSelector.java:63) [LeaderSelector.class:na]
at org.apache.curator.framework.recipes.leader.LeaderSelector$2.call(LeaderSelector.java:244) [LeaderSelector$2.class:na]
at org.apache.curator.framework.recipes.leader.LeaderSelector$2.call(LeaderSelector.java:238) [LeaderSelector$2.class:na]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_11]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_11]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_11]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_11]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_11]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_11]
Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /env.alpha-V1/runs/f9581c2e-be20-48a0-b096-47a0f12fa106
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) ~[KeeperException.class:3.4.6-1569965]
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) ~[KeeperException.class:3.4.6-1569965]
at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783) ~[ZooKeeper.class:3.4.6-1569965]
at org.apache.curator.framework.imps.CreateBuilderImpl$11.call(CreateBuilderImpl.java:688) ~[CreateBuilderImpl$11.class:na]
at org.apache.curator.framework.imps.CreateBuilderImpl$11.call(CreateBuilderImpl.java:672) ~[CreateBuilderImpl$11.class:na]
at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:107) ~[RetryLoop.class:na]
at org.apache.curator.framework.imps.CreateBuilderImpl.pathInForeground(CreateBuilderImpl.java:668) ~[CreateBuilderImpl.class:na]
at org.apache.curator.framework.imps.CreateBuilderImpl.protectedPathInForeground(CreateBuilderImpl.java:453) ~[CreateBuilderImpl.class:na]
at org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:443) ~[CreateBuilderImpl.class:na]
at org.apache.curator.framework.imps.CreateBuilderImpl$4.forPath(CreateBuilderImpl.java:325) ~[CreateBuilderImpl$4.class:na]
at org.apache.curator.framework.imps.CreateBuilderImpl$4.forPath(CreateBuilderImpl.java:267) ~[CreateBuilderImpl$4.class:na]
at com.nirmata.workflow.details.WorkflowManagerImpl.submitSubTask(WorkflowManagerImpl.java:115) ~[WorkflowManagerImpl.class:na]
... 16 common frames omitted
00:56:54.329 [LeaderSelector-5] DEBUG com.nirmata.scheduler.LeaderExecutor - Completed PingHostTask in leader thread Thread[LeaderSelector-5,5,main]
00:56:54.329 [LeaderSelector-5] DEBUG com.nirmata.scheduler.LeaderExecutor - Leader thread PingHostTask for Thread[LeaderSelector-5,5,main] was interrupted

cleanup automatically runIds for completed tasks

Provide an option to allow the framework to cleanup completed tasks and all related metadata on the user behalf after a configurable time period.

we could also add an option for deleting non completed tasks after a given time.

nirmata / workflow Goto Github PK

workflow's Introduction

Nirmata Workflow

Changes

workflow's People

Contributors

Stargazers

Watchers

Forkers

workflow's Issues

Recommend Projects

Recommend Topics

Recommend Org