orkes-io / orkes-conductor-community Goto Github PK

View Code? Open in Web Editor NEW

106.0 106.0 26.0 368 KB

Orkes Conductor is a microservices orchestration engine.

License: Other

Java 99.64% Shell 0.36%

java microservice orchestration workflow workflow-engine

orkes-conductor-community's People

Contributors

Stargazers

Watchers

orkes-conductor-community's Issues

Swagger UI displays no operation - "No operations defined in spec! " error message

Describe the bug
When running the Docker standalone, http://localhost:8080/swagger-ui/index.html does not display operations, instead an error "No operations defined in spec!"

Steps To Reproduce
Run the self-contained, standalone Docker image as per instructions. Navigate to http://localhost:8080/, then click on "Swagger Documentation".

Expected behavior
The usual Swagger page to be displayed with operations.

Device/browser

OS: Centos 7
Browser: Firefox
Version 1.0.3 (of orkes-conductor-community)

Additional context
The same issue occurs when downloading, building the server and running locally outside Docker against local REDIS and Postgres.

Conductor:UI: Only workflows with status failed are visible

Describe the bug
Since version 1.0.7 only

Steps To Reproduce
Steps to reproduce the behavior:

Run orkesio/orkes-conductor-community-standalone:1.0.7 or higher in Docker as described in readme:

docker volume create postgres
docker volume create redis
docker run --init -p 8080:8080 -p 1234:5000 --mount source=redis,target=/redis
--mount source=postgres,target=/pgdata orkesio/orkes-conductor-community-standalone:latest

Navigate to the Conductor UI (http://localhost:1234)
Go to Workbench and start workflow "load_test" -> which is now in status running
Start Workflow "http" -> which is now in status failed
Go to the Executions Window: Only failed workflows are displayed. Even if you select "running" in the status dropdown, only failed tasks are shown.

Expected behavior
I expect that all workflows are shown.

Device/browser

OS: Docker on Windows (WSL2), Docker on Ubuntu 22.04
Browser: Chrome
Version: 117

Additional context
In version 1.0.6 it worked fine. Bug is here since 1.0.7.

missing curl in standalone docker image => health check always down

Describe the bug
We run the docker image orkesio/orkes-conductor-community-standalone:latest and the container is always unhealthy because curl seem to be missing

"Health": {
"Status": "unhealthy",
"FailingStreak": 386,
"Log": [
{
"Start": "2024-07-08T16:51:14.249784013Z",
"End": "2024-07-08T16:51:14.344971847Z",
"ExitCode": 1,
"Output": "/bin/sh: curl: not found\n"
},
{
"Start": "2024-07-08T16:52:14.346606472Z",
"End": "2024-07-08T16:52:14.433134513Z",
"ExitCode": 1,
"Output": "/bin/sh: curl: not found\n"
},
{
"Start": "2024-07-08T16:53:14.440914138Z",
"End": "2024-07-08T16:53:14.549711097Z",
"ExitCode": 1,
"Output": "/bin/sh: curl: not found\n"
},
{
"Start": "2024-07-08T16:54:14.552011055Z",
"End": "2024-07-08T16:54:14.644104638Z",
"ExitCode": 1,
"Output": "/bin/sh: curl: not found\n"
},
{
"Start": "2024-07-08T16:55:14.646305222Z",
"End": "2024-07-08T16:55:14.75006993Z",
"ExitCode": 1,
"Output": "/bin/sh: curl: not found\n"
}
]

MySQL backend

Helm charts for the deployments

redis:Could not get a resource from the pool,Pool not open

I encountered an error. During the startup process, Redis shut down abnormally, causing the service to fail to start. First of all, I'm sure that Redis itself is OK.

Race condition when indexTask with ES

Describe the bug
Race condition found when indexTask with ES, the index requests' count sent by conductor server are not matched by received on ES side

Steps To Reproduce
Steps to reproduce the behavior:

Run multiple tasks in parallel
Change ES index log to debug
Logged requests in ES (Using ES7 , may same in ES6)
Some task status finished in IN_PROGRESS rather than COMPLETED after workflow COMPLETED
On the other hand, the persistency component status is right (using postgres as persistency)
indexBatchSize is default as 1 and asyncIndexingEnabled is also default as false
Even better - add a Loom video where you walk through the steps of the error.

Expected behavior
All the task should in terminated status, such as COMPLETED/FAILED in ES rather than IN_PROGRESS

Device/browser

OS: Ubuntu
Browser N/A
Version 3.14

Additional context

When debug log opened, following log printed right in our env, we have 3 index requests per task, which logged in ElasticSearchRestDAOV7.java -> indexTask, the average time cost is less than 30 ms

Time taken {} for indexing task:{} in workflow: {}

On ES side, the received records count is less than 3 randomly
Seem that, there is a race condition in function indexObject and indexBulkRequest,

private void indexObject(
        final String index, final String docType, final String docId, final Object doc) {

    byte[] docBytes;
    try {
        docBytes = objectMapper.writeValueAsBytes(doc);
    } catch (JsonProcessingException e) {
        logger.error("Failed to convert {} '{}' to byte string", docType, docId);
        return;
    }
    IndexRequest request = new IndexRequest(index);
    request.id(docId).source(docBytes, XContentType.JSON);

    if (bulkRequests.get(docType) == null) {
        bulkRequests.put(
                docType, new BulkRequests(System.currentTimeMillis(), new BulkRequest()));
    }

    bulkRequests.get(docType).getBulkRequest().add(request);
    if (bulkRequests.get(docType).getBulkRequest().numberOfActions() >= this.indexBatchSize) {
        indexBulkRequest(docType);
    }
}

private synchronized void indexBulkRequest(String docType) {
    if (bulkRequests.get(docType).getBulkRequest() != null
            && bulkRequests.get(docType).getBulkRequest().numberOfActions() > 0) {
        synchronized (bulkRequests.get(docType).getBulkRequest()) {
            indexWithRetry(
                    bulkRequests.get(docType).getBulkRequest().get(),
                    "Bulk Indexing " + docType,
                    docType);
            bulkRequests.put(
                    docType, new BulkRequests(System.currentTimeMillis(), new BulkRequest()));
        }
    }
}`

No lock found when add request to bulkRequest in indexObject
lock found when sent bulkRequest and removed local bulkRequest in indexBulkRequest
When exec with order in 2 threads as, T1 sent bulkRequest -> T2 add request to bulkRequest -> T2 might wait on synchronized of indexBulkRequest -> T1 removed local bulkRequest -> T2 runs into indexBulkRequest and failed with check, nothing to be sent/or even if T3 added a new one to empty bulkRequest so that the check past ...

Thanks

Conductor.db.type of redis_sentinel is being ignored

I'm trying to run the orkes server, pointing it to a redis sentinel cluster. I am mounting the following in /app/config/config.properties

    spring.datasource.url=jdbc:postgresql://postgres:5432/postgres

    spring.datasource.username=postgres

    spring.datasource.password=postgres

    conductor.db.type=redis_sentinel

    conductor.redis-lock.serverAddress=redis://redis:26379

    conductor.redis.hosts=redis:26379:this-one

Below is the output from orkes, grepping for the wording redis:

10:32:07.812 [main] INFO io.orkes.conductor.OrkesConductorApplication - System Env Props - Key: REDIS_PORT_6379_TCP_PROTO, Value: tcp

10:32:07.812 [main] INFO io.orkes.conductor.OrkesConductorApplication - System Env Props - Key: REDIS_PORT_6379_TCP_ADDR, Value: 10.43.51.71

10:32:07.812 [main] INFO io.orkes.conductor.OrkesConductorApplication - System Env Props - Key: REDIS_PORT, Value: tcp://10.43.51.71:6379

10:32:07.813 [main] INFO io.orkes.conductor.OrkesConductorApplication - System Env Props - Key: REDIS_SERVICE_PORT_TCP_SENTINEL, Value: 26379

10:32:07.813 [main] INFO io.orkes.conductor.OrkesConductorApplication - System Env Props - Key: REDIS_PORT_26379_TCP, Value: tcp://10.43.51.71:26379

10:32:07.814 [main] INFO io.orkes.conductor.OrkesConductorApplication - System Env Props - Key: REDIS_PORT_26379_TCP_ADDR, Value: 10.43.51.71

10:32:07.814 [main] INFO io.orkes.conductor.OrkesConductorApplication - System Env Props - Key: REDIS_PORT_26379_TCP_PORT, Value: 26379

10:32:07.814 [main] INFO io.orkes.conductor.OrkesConductorApplication - System Env Props - Key: REDIS_SERVICE_HOST, Value: 10.43.51.71

10:32:07.815 [main] INFO io.orkes.conductor.OrkesConductorApplication - System Env Props - Key: REDIS_SERVICE_PORT_TCP_REDIS, Value: 6379

10:32:07.815 [main] INFO io.orkes.conductor.OrkesConductorApplication - System Env Props - Key: REDIS_PORT_26379_TCP_PROTO, Value: tcp

10:32:07.815 [main] INFO io.orkes.conductor.OrkesConductorApplication - System Env Props - Key: REDIS_PORT_6379_TCP, Value: tcp://10.43.51.71:6379

10:32:07.815 [main] INFO io.orkes.conductor.OrkesConductorApplication - System Env Props - Key: REDIS_PORT_6379_TCP_PORT, Value: 6379

10:32:07.816 [main] INFO io.orkes.conductor.OrkesConductorApplication - System Env Props - Key: REDIS_SERVICE_PORT, Value: 6379

10:32:07.832 [main] INFO io.orkes.conductor.OrkesConductorApplication - Setting conductor.redis-lock.serverAddress - redis://redis:26379

10:32:07.832 [main] INFO io.orkes.conductor.OrkesConductorApplication - Setting conductor.db.type - redis_sentinel

10:32:07.832 [main] INFO io.orkes.conductor.OrkesConductorApplication - Setting conductor.redis.hosts - redis:26379:this-one

ESC[30m2023-02-21 10:32:18,624ESC[0;39m ESC[34mINFO ESC[0;39m [ESC[34mmainESC[0;39m] ESC[33mio.orkes.conductor.queue.config.RedisQueueConfigurationESC[0;39m: Starting conductor server using redis_standalone - use SSL? false

ESC[30m2023-02-21 10:32:19,055ESC[0;39m ESC[1;31mERRORESC[0;39m [ESC[34mmainESC[0;39m] ESC[33mcom.netflix.conductor.redis.dao.RedisMetadataDAOESC[0;39m: refresh TaskDefs failed

redis.clients.jedis.exceptions.JedisDataException: ERR unknown command `HSCAN`, with args beginning with: `conductor.test.TASK_DEFS`, `0`,

        at redis.clients.jedis.Protocol.processError(Protocol.java:135)

        at redis.clients.jedis.Protocol.process(Protocol.java:169)

        at redis.clients.jedis.Protocol.read(Protocol.java:223)

        at redis.clients.jedis.Connection.readProtocolWithCheckingBroken(Connection.java:352)

        at redis.clients.jedis.Connection.getUnflushedObjectMultiBulkReply(Connection.java:314)

        at redis.clients.jedis.Connection.getObjectMultiBulkReply(Connection.java:319)

        at redis.clients.jedis.Jedis.hscan(Jedis.java:3727)

        at redis.clients.jedis.Jedis.hscan(Jedis.java:3719)

        at com.netflix.conductor.redis.jedis.JedisStandalone.lambda$hscan$127(JedisStandalone.java:706)

        at com.netflix.conductor.redis.jedis.JedisStandalone.executeInJedis(JedisStandalone.java:59)

        at com.netflix.conductor.redis.jedis.JedisStandalone.hscan(JedisStandalone.java:706)

        at com.netflix.conductor.redis.jedis.OrkesJedisProxy.hgetAll(OrkesJedisProxy.java:148)

        at com.netflix.conductor.redis.dao.RedisMetadataDAO.getAllTaskDefs(RedisMetadataDAO.java:125)

        at com.netflix.conductor.redis.dao.RedisMetadataDAO.refreshTaskDefs(RedisMetadataDAO.java:92)

        at com.netflix.conductor.redis.dao.RedisMetadataDAO.<init>(RedisMetadataDAO.java:57)

        at com.netflix.conductor.redis.dao.OrkesMetadataDAO.<init>(OrkesMetadataDAO.java:55)

ESC[30m2023-02-21 10:32:19,059ESC[0;39m ESC[34mINFO ESC[0;39m [ESC[34mmainESC[0;39m] ESC[33mcom.netflix.conductor.redis.dao.OrkesMetadataDAOESC[0;39m: taskDefCacheTTL set to 1000

So I believe I am loading the config fine, but it's still running the standalone Redis configuration. It looks like the default config, despite me attempting to override it and the output messages suggesting I've done that, is still take precedence?

Any ideas please?

Wrong row count in executions UI landing page

Describe the bug
In Ui even though there are many execution requests it does not show correct row and pagination count

Steps To Reproduce
Steps to reproduce the behavior:

Make sure you have many executions in prior preferably more than 15
Go to Executions tab, keep default view count of executions as 15, the page nav icons are disabled and number of executions count is wrongly displayed
Click on drop down to increase row count and you will be able to more extra records

Expected behavior
Rows count should be correct and also the nav icons allow to navigate.

Additional context
These workflows were spawn from backend using locahost:1234 endpoints not via UI workbench. Thanks

Conductor client does not work with spring cloud config

Describe the bug
We are working on an adhoc task workflow. For that we have created a new spring boot service with conductor client. We are using conductor server in docker [orkesio/orkes-conductor-community-standalone:latest].
My service works fine without spring cloud config. It is able to poll for the tasks and execute it. However, when I add spring boot cloud config dependencies, it is able to poll for tasks and hence it does not execute it.

Steps To Reproduce
Steps to reproduce the behavior:
I have created a demo project in my github.

Go to https://github.com/arpitrathore/conductor-cloud-config-test)
Follow the steps in the README to spin up two docker containers. One for spring cloud config and one for conductor server.
Start the service by running the main method in src/main/java/com/arpitrathore/test/Application.java
Run following curl command to submit a task

curl -H 'Content-Type: application/json' http://localhost:8080/submit/ -d '{"someId": 123}'

Notice that the service is NOT able to poll the task and execute it.

Now switch the branch to without-cloud-config. This branch does not have spring cloud dependency. Run the main method in src/main/java/com/arpitrathore/test/Application.java again.
Run following curl command to submit a task

curl -H 'Content-Type: application/json' http://localhost:8080/submit/ -d '{"someId": 123}'

Notice the service is able to poll the task and execute it.

Expected behavior
Service should poll and execute the task with or without spring cloud config dependency

Device/browser

OS: Mac OS M1
Browser NA

High redis usage caused by OrkesWorkflowSweeper

In our production use-case, we often have long running workflows that wait on human tasks.
Because we want to be able to track human tasks in our own backoffice systems, we created a subworkflow that creates and tracks human tasks for us and ends with a HUMAN task in conductor.
We noticed an absurd load on REDIS, even when every single currently non-completed workflow is idling on a subworkflow that's idling on a HUMAN task. Looking into it more we noticed that our logs are getting spammed with
INFO [sweeper-thread-1] io.orkes.conductor.server.service.OrkesWorkflowSweeper: Running sweeper for workflow ***. This constantly fetches the workflows and its tasks, and it seems like it is currently impossible to slow this process down.

Looking into the contradictory statements of this code and it's comment : https://github.com/orkes-io/orkes-conductor-community/blob/60325ef7b196a96d1062ddfecf924c4be7866309/server/src/main/java/io/orkes/conductor/server/service/OrkesWorkflowSweeper.java#L152C4-L152C4 ( Comment says 60 seconds, code is 60 milis ) , I'm worried a mistake might have been made in the implementation of the sweeper service, and workflows are being checked way more often than they should be.

I believe this to be a root cause of our production systems failing under relatively light load. Is there any way to slow down the sweeper without disabling it completely, or does a bug need to be fixed?

RetryDelay do not work

Describe the bug
parameter retryDelaySeconds in tasks definition do not add delay while retrying the task. The issue is observed with all tasks.

Can try to reproduce on sample workflow in this repo:
path: orkes-conductor-community-build/persistence/src/test/resources/wf2.json

Expected behavior
Failed Tasks should retry after delay of N seconds

startDelay do not work

Describe the bug
I created a http task and inline task with an intention of starting it with an delay of 5 seconds, there is a predined property suggested to use for this feature - startDelay( in seconds). After adding a delay of 5 (also tried 5000) but it didnot work, workflows are starting instantaneously.

Reproducible with all workflows, can try on sample workflow in this workflow by modifing the value of startDelay:
Path: orkes-conductor-community-build/server/src/main/resources/workflows.json

Expected behavior
Workflow should start after N seconds

Device/browser
Across all browsers

orkes-io / orkes-conductor-community Goto Github PK

orkes-conductor-community's People

Contributors

Stargazers

Watchers

Forkers

orkes-conductor-community's Issues

Recommend Projects

Recommend Topics

Recommend Org