opendronemap / clusterodm Goto Github PK

View Code? Open in Web Editor NEW

82.0 17.0 64.0 714 KB

A NodeODM API compatible autoscalable load balancer and task tracker for easy horizontal scaling ♆

License: GNU Affero General Public License v3.0

JavaScript 98.42% Dockerfile 0.53% Shell 0.63% HTML 0.28% CSS 0.14%

odm proxy nodeodm webodm

clusterodm's Issues

Docker image versions are not pushed to Dockerhub

What is the problem?

On Dockerhub, the only tags available are latest and master. https://hub.docker.com/r/opendronemap/clusterodm/tags
This is against best practice as new versions can break existing environments. Docker users are advised to not use the latest tag to avoid unwanted updates.

Tooling to inspect running tasks

Given a taskId find at a minimum which processing node is in charge.

aws s3 bucker creation+autoscaling

Hello,
I have been testing cluster odm on local machines+some aws instances. Since clusterodm supports autoscaling/creating and destroying nodes on demand, I would like to know a bit more about the process of setting it up. The description provided in https://github.com/OpenDroneMap/ClusterODM/blob/master/docs/aws.md is not very clear for aws newbie like me.
So from here what I understand is:

Given that I have aws account, I need to create an s3 bucket with unblocked access.
Then I guess I need to deploy clusterodm in an instance in aws?
Also, I did not quite understand what this means.

Select an AMI (machine image) to run - Ubuntu has a handy AMI finder.

Implement /task/list

Same as OpenDroneMap/NodeODM@dccc51f

Split-merge Integration

At some point we should add the ability to distribute large datasets over multiple nodes for parallel processing.

Did not run cleanup even task finished when i use this by aws auto-scaling

hello, geniuses,

did not run cleanup even task finished when I use this by AWS Auto-scaling.

when I used it last year, the instances were automatically shut down after the work was finished. I'm using it for the first time this year, and the instances aren't shutting down.

I use ClusterODM by docker-compose, and nodeodm version is "2.4.10" for compatibility with my webodm 1.8.1.
(it looks like something problem also within webodm 1.8.1 and nodeodm 2.5.0. that is not success even all process looks like done.)

**all task was finished.**

**still online even spend many times...**

version: '2.1'
services:
  nodeodm-1:
    image: opendronemap/nodeodm:2.4.10
    container_name: nodeodm-1
    ports:
      - "3000"
    restart: on-failure:10
    oom_score_adj: 500
    entrypoint: /usr/bin/node /var/www/index.js --max_images 1 --max_concurrency 1 --max_runtime 0 -q 0
  clusterodm:
    image: opendronemap/clusterodm
    container_name: clusterodm
    ports:
      - "80:3000"
      - "8082:8080"
      - "10000:10000"
    volumes:
      - ./docker/data:/var/www/data
      - ./config.json:/var/www/config-default.json  
      - ./asr-configuration.json:/var/www/configuration.json
    restart: on-failure:10
    depends_on:
      - nodeodm-1

ClusterODM is dropping a high number of uploads

What is the problem?

Since yesterday my Webodm is constantly failing all tasks after i restarted it. I noticed it pulled a newer image from DockerHub and there are no previous versions available on DockerHub.
After investigating i noticed that ClusterODM is closing a lot of POST http requests on the routes /task/new/upload/<task_id>
The error message displayed in Webodm is sometimes Connection error: HTTPSConnectionPool(host='example.com', port=443): Read timed out. (read timeout=30) and some other times just 502.

Even the smallest jobs are failing, i had this issue with a dataset only containing 5 images.

On the web interface of ClusterODM, i can still launch a task, but during the uploads, i get a lot of messages saying Upload of IMG_NAME.jpg failed, retrying...

After seeing this i made a clean install of my entire stack (Webodm webapp & worker, ClusterODM and one locked NodeODM for the autoscaler) on a totally different infrastructure and had the same exact problem.

What should be the expected behavior?

Uploading the files on WebODM or ClusterODM UI should work

How can we reproduce this? (What steps did you do to trigger the problem? If applicable, please include multiple screenshots of the problem! Be detailed)

Install WebODM and ClusterODM and try to upload files to launch a task.
My current installation is on a Kubernetes cluster hosted on scaleway. I can provide the manifests i'm using if needed.
WebODM version: 1.9.11
ClusterODM version: latest on Dockerhub

[Feature Request]: `docker-machine` is deprecated, we need to add support for another tool for autoscaling

docker-machine is being deprecated and starts to be unavailable from repositories. It's also not compatible with the latest docker version. See docker/machine#4537 and docker/roadmap#245 .

One alternative would be to move towards Terraform. There is a provider for Scaleway, so we wouldn't lose functionality. Is there any interest in moving towards this solution?

What is the problem?

docker-machine being deprecated.

What should be the expected behavior?

Not losing the autoscale!

How can we reproduce this? (What steps did you do to trigger the problem? If applicable, please include multiple screenshots of the problem! Be detailed)

Try installing docker-machine on a recent OS.

The version information in package.json is different from the version information in /info.

ClusterODM/libs/proxy.js

Line 139 in 420895b

version: "1.5.3", // this is the version we speak

ClusterODM/package.json

Line 3 in 420895b

"version": "1.4.2",

Setting up ClusterODM setup using WebODM

Hey Everyone,
I am struggling to make ClusterODM work with web odm. I have setup 3 nodes running NodeODM. And setup ClusterODM in a server spec machine (no apparent reason why i choose a server machine for this task). As the guide says, i connect all the node to ClusterODM using telnet, and I can see the nodes are online. Now how do I implement this cluster setup from WebODM? As per the documentation in https://docs.opendronemap.org/large.html#distributed-split-merge, in CLI odm, you need to add --sm-cluster http://cluster_odm_ip:3001.

My question again, is how do I trigger this action in webODM? In the custom settings option in webODM, there is a field to enter the URL of clusterODM, but I havent had any luck with that. What I need to know, is at what process stage, does odm create sub-models and distribute them to different nodes to process?

Setup :
WebODM (SYSTEM (1))-->ClusterODM(SYSTEM (3))+--------NodeODM
+--------NodeODM
+--------NodeODM

System specifications:
Following are the system specifications.

Machine running WebODM: (1)
CPU: Ryzen 7 3600
RAM: 16gb DDR4 2133
500GB SSD and 1TB HDD

node machine specs: (2)
CPU: Ryzen 7 3600
RAM: 16gb DDR4 2133
500GB SSD and 1TB HDD

System running ClusterODM: (A server Machine) (3)
CPU: Intel Xeon Silver 4208 2.10 Ghz
RAM: 32GB
HDD: 198GB

Automatically remove ASR nodes that are offline for too long

This would prevent overcharges on cloud providers for nodes that have gone offline due to various errors (memory, network outage, etc.)

Add ?line parameter support in task output API

Without this a client might mistakenly loop forever while retrieving task outputs.

Max images exceeded due to connection drop/timeout?

It seems that in certain instances the call to

ClusterODM/libs/taskNew.js

Line 170 in c2c2e8f

const handleClose = () => {

is not being made, causing the max images exceeded error to be thrown prematurely.

Ref OpenDroneMap/WebODM#1248

Add support for autoscaling queue

Optional support for an autoscaling queue could allow users to queue tasks even if they hit account limit restrictions and provide a better flow.

Add Scaleway autoscaling driver

Seems like a cheaper option than digital ocean: https://github.com/scaleway/docker-machine-driver-scaleway

"progress" key missing from task stubs

When creating entries in the task table, the "progress" entry is missing (per 1.5.3 API spec).

Probably no big deal, but should be added.

Admin commands to handle tasks

Would be nice to have a way to send manual API commands for certain tasks (for example, to query task status, restart tasks, cancel tasks, etc.)

Add Azure support for auto-scaling

What is the problem?

No option to integrate Azure components for storage & compute.

What should be the expected behavior?

One should be able to use Blob Storage/File Share for storing the results and add configuration file to auto-scale by spinning up new Azure Container Instances or removing them.

How can we reproduce this? (What steps did you do to trigger the problem? If applicable, please include multiple screenshots of the problem! Be detailed)

It's a feature enhancement request.

Add zip files upload support

The NodeODM API supports zip file uploads, ClusterODM currently doesn't support it.

warn: Cannot forward task (ID) to processing node (IP):3000: Failed sending data to the peer

Experimenting with ClusterODM. Have got a two node test cluster setup and a VM with ClusterODM running on it. Everything seems to be setup correctly, both devices show correctly in :10000 and with NODE LIST.

When submitting a job with split, it throws the error warn: Cannot forward task (ID) to processing node (IP):3000: Failed sending data to the peer. The error appears with both IPs alternately with several test datasets from 14 - 986 images. No GCPs on these. I tried the following splits on a 986 image dataset:

50 - error above
100 - error above
400 - error above
500 - job splits into FIVE parts and only make it to one node. Raising the issue via my mobile but hopefully the below table comes out OK:

Node 	Status 	Queue 	Version 	Flags

1 192.168.1.172:3000 Online 5/4 1.5.2
2 192.168.1.173:3000 Online 0/4 1.5.2

Add ASR config CLI commands

This would allow a person to change the configuration of the ASR at runtime without a restart.

Add/set keys
Delete keys
Write config to file (make changes permanent)
Restore original config (?)

add suggested version of node?

what version of node is suggested?
would it make sense to add a .nvmrc file?

Add AWS autoscaling driver

Amazon AWS is a popular cloud provider.

Windows Native ClusterODM

It should be relatively easy to create an executable bundle for running ClusterODM natively, NodeODM style. Just need to copy the relevant github workflow files and adjust them.

Admin CLI option to stop routing tasks to a particular node

Can be useful during maintenance.

Display queue information with NODES LIST

It's a command that's used quite often.

Shared Task / Routes / Nodes Table

Currently task tables and routes tables are stored as objects. This means a single point of failure and no ability for the network to have multiple proxies running concurrently on multiple machines.

We should have a shared database of such information (redis?) so that multiple proxies might be synced and distribute lots of incoming connections.

Add better web interface

It would be nice to have a web UI to show node information (memory, CPU usage, tasks) as well as the ability to do the same tasks as the CLI.

Add token authentication to LocalCloudProviders

Handle duplicate filenames

Same as OpenDroneMap/NodeODM#104

Why switch nodes many times in working?

122 photos
Options: split: 50, split-overlap: 50
Why switch nodes many times in working???

description：
122 pictures are ortho-spliced according to 50 segmentation. There are four nodes. When splicing, they are not allocated to three nodes for splicing, but to 1-2-3 for splicing, then 1-2-4 for splicing and then 1-4 for processing. I want to ask why the nodes are changed in this way instead of fixed for three nodes for processing.

like this
1.

2.

3.

4.

[ERROR] Cluster node offline

[ERROR] Cluster node seems to be offline: HTTPConnectionPool(host='topaz318hn', port=3000): Max retries exceeded with url: /info (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f91f14f1910>: Failed to establish a new connection: [Errno -2] Name or service not known',))

Hi,

I deployed the clusterODM to Docker Swarm.

I found this issue when I start the task with setting split. It works if I do not set the split and let it run on one node.

Is there any solution?

Thanks so much,
Tianyang

Auto scaling

We should implement the ability to automatically spin-up nodes if no nodes are currently available to process a task immediately.

A provider independent abstract layer should be implemented, as we don't want to be tied to a single cloud solution.

ClusterODM distributed split/merge results vary from single instance Docker split/merge

I’m getting mixed results between using Normal docker (single instance) split/merge output vs ClusterODM distributed split/merge.

Single instance split/merge is giving me a full orthophoto from the dataset. No problems (other than taking 3.5 days to run (701 images).

ClusterODM distributed split/merge is showing zero data for about half of the imaged area, but it completes the task in about 4 hours. Once you set the transparency band to None in QGIS, you can see that it should have the entire image, but doesn’t.

Task table entry sometimes appears to be missing when using ASR

WebODM clients will sometimes report the error message "Invalid route for taskId xxxx-xxxx-xxx-xxxx:info, no task table entry." after a task is completed.

Does it work with NodeMicMac ?

Didn't check the code but simple question :
Could it work with NodeMicMac or need rework ?

Fix - AWS Autoscaling w/ Docker-Machine

Problem

Docker-Machine fails to create node instances with an AWS asr request in some configurations of the AWS platform. Specifically, it fails for those configurations without a default VPC or subnet and/or those that use region zones, e.g. us-east-1c, for the VPC/subnet/security group resources. Yielding the errors below -

No default vpc/subnet -

(debug-machine) Couldn't determine your account Default VPC ID : "default-vpc is 'none'"
Error setting machine configuration from flags provided: amazonec2 driver requires either the --amazonec2-subnet-id or --amazonec2-vpc-id option or an AWS Account with a default vpc-id

No availability Zone -

Error creating machine: Error in driver during machine creation: Error launching instance: InvalidParameterValue: Value (us-east-1a) for parameter availabilityZone is invalid. Subnet 'subnet-****' is in the availability zone us-east-1c
status code: 400,

Defining a default VPC or subnet requires interaction with AWS support. Defining the zone is straightforward given the parameter in the JSON.

Expectation

What is expected is that when running cluster-odm locally via 'node index.js --asr aws.json' assuming a properly formed aws.json file, the docker-machine invocation should create a new machine, load docker, and invoke a containerized node.

Reproduction

In order to reproduce this error, in an AWS environment with the configuration above, launch local ClusterODM with the --asr aws.json config flag reflecting that environment. Telnet to 8080 and run 'asr viewcmd <# images>'. This provides the docker-machine command reflecting the information provided in the asr config. Copy that docker-machine command and attempt to execute it in the command line. If a VPC and subnet are not defaulted in the AWS environment, the machine will not be created. If the AWS resources are contained in zone of a region, e.g. a,b,c, the machine will not be created.

The command line for Docker-Machine described above provides a more descriptive error than simply using the WebODM interface to launch a processing job. That approach simply fails with the error below;

node index.js --asr aws.json
info: ClusterODM 1.5.3 started with PID 14656
info: Starting admin CLI on 8080
warn: No admin CLI password specified, make sure port 8080 is secured
info: Starting admin web interface on 10000
warn: No admin password specified, make sure port 10000 is secured
info: Cloud: LocalCloudProvider
info: ASR: AWSAsrProvider
info: Can write to S3
info: Found docker-machine executable
info: Loaded 1 nodes
info: Loaded 0 routes
info: Starting http proxy on 3000
info: Trying to create machine... (1)
warn: Cannot create machine: Error: docker-machine exited with code 1
info: Trying to create machine... (2)
warn: Cannot create machine: Error: docker-machine exited with code 1

Resolution

The problem resolution is limited in scope to the definition of the AWSAsrProvider extension of the AbstractASRProvider class in ./ClusterODM/libs/asr-providers/aws.js. Documentation must be updated in ~/ClusterODM/docs/aws.md.

The following provides the reference variable names in Docker-Machine required to translate between the JSON naming and the Docker-Machine naming for the variable names. https://gdevillele.github.io/machine/drivers/aws/

Better Docs

The README should have some more information on the various flags.

Feature Request: Add OpenStack provider.

If I wanted to add an OpenStack provider, I would just discern the appropriate template based on the divers in docker-machine, I think.

https://github.com/docker/machine/blob/master/drivers/openstack/client.go

Is this the approach?

No Splitting of Jobs

There is no spliting of projects in my cluster.
these are my options:
Optionen: auto-boundary: true, fast-orthophoto: true, split: 40, split-overlap: 15, rerun-from: dataset

Node Status Queue Engine API CPU Cores RAM available Flags -

1 20.113.*.3:3000 Online 0/1 odm 2.8.1 2.2.0 4 97.38%
2 20.54..0:3000 Online 1/1 odm 2.8.1 2.2.0 4 51.87%
3 20.113..*3:3000 Online 0/1 odm 2.8.1 2.2.0 4 97.93%

only one node is used any idea how to change this behaver?
I have a Photo set of 760 pics.

Cluster node seems to be offline: HTTPConnectionPool(host='127.0.0.1', port=3001)

There's a weird intermittent connectivity issue for the cluster software:

[ERROR] Cluster node seems to be offline: HTTPConnectionPool(host='127.0.0.1', port=3001): Max retries exceeded with url: /info (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f9861261810>: Failed to establish a new connection: [Errno 111] Connection refused',))

It works about 50% of the time, but times out for the rest. ClusterODM is started via systemd and remains running fine:

ExecStart=/usr/bin/node /opt/ClusterODM/index.js -p 3001

When submitting a job, it almost seems like WebODM is trying to brute force a job into ClusterODM, as the number of connections is huge:

root@survey-dev:~# netstat -tapn | grep 3001 | wc -l
269

tcp        0      0 127.0.0.1:43016         127.0.0.1:3001          TIME_WAIT   -               
tcp        0      0 127.0.0.1:42818         127.0.0.1:3001          TIME_WAIT   -               
tcp        0      0 127.0.0.1:42802         127.0.0.1:3001          TIME_WAIT   -               
tcp6       0      0 :::3001                 :::*                    LISTEN      13085/node      
tcp6       0      0 127.0.0.1:3001          127.0.0.1:43050         TIME_WAIT   -

telneting to the port works fine. Server load is very low. Packet loss is zero (it's localhost).

Cluster Split fail and raise Exception!

700 photos to split

CMD:
docker run -ti -v "/ftpfile/1400/1400:/code/images" opendronemap/odm --split 400 --split-overlap 100 --sm-cluster http://192.168.3.86:3100

cluster: http://192.168.3.86:3100
config :

#> node list
1) 192.168.3.86:3001 [online] [0/2] <version 1.5.3>
2) 192.168.3.86:3002 [online] [0/2] <version 1.5.3>
3) 192.168.3.86:3003 [online] [0/2] <version 1.5.3>
4) 192.168.3.86:3005 [online] [0/2] <version 1.5.3>
5) 192.168.3.86:3006 [online] [0/2] <version 1.5.3>
6) 192.168.3.86:3009 [online] [0/2] <version 1.5.3>

EXCEPTION:

OpenCV Error: Assertion failed (data0.dims <= 2 && type == CV_32F && K > 0) in kmeans, file /code/SuperBuild/src/opencv/modules/core/src/matrix.cpp, line 2701
Traceback (most recent call last):
  File "/code/SuperBuild/src/opensfm/bin/opensfm", line 34, in <module>
    command.run(args)
  File "/code/SuperBuild/src/opensfm/opensfm/commands/create_submodels.py", line 30, in run
    self._cluster_images(meta_data, data.config['submodel_size'])
  File "/code/SuperBuild/src/opensfm/opensfm/commands/create_submodels.py", line 100, in _cluster_images
    labels, centers = tools.kmeans(positions, K)[1:]
  File "/code/SuperBuild/src/opensfm/opensfm/large/tools.py", line 34, in kmeans
    return cv2.kmeans(samples, nclusters, criteria, attempts, flags)
cv2.error: /code/SuperBuild/src/opencv/modules/core/src/matrix.cpp:2701: error: (-215) data0.dims <= 2 && type == CV_32F && K > 0 in function kmeans

Traceback (most recent call last):
  File "/code/run.py", line 56, in <module>
    app.execute()
  File "/code/stages/odm_app.py", line 93, in execute
    self.first_stage.run()
  File "/code/opendm/types.py", line 376, in run
    self.next_stage.run(outputs)
  File "/code/opendm/types.py", line 357, in run
    self.process(self.args, outputs)
  File "/code/stages/splitmerge.py", line 65, in process
    octx.run("create_submodels")
  File "/code/opendm/osfm.py", line 21, in run
    (context.opensfm_path, command, self.opensfm_project_path))
  File "/code/opendm/system.py", line 76, in run
    raise Exception("Child returned {}".format(retcode))
Exception: Child returned 1

Process exited with code 1

288 pictures

Processing Node: ClusterODM (auto)
Options: split: 80, split-overlap: 80

1) 192.168.3.86:3001 [offline] [0/2] <version 1.5.3>
2) 192.168.3.155:3001 [online] [0/2] <version 1.5.3>
3) 192.168.3.24:3001 [online] [0/2] <version 1.5.3>

[WARNING] LRE: submodel_0001 failed with: (ac5bb764-14e0-43ef-a8de-58861b7d0f52) failed with task output: self.process(self.args, outputs)
File "/code/stages/odm_meshing.py", line 72, in process
smooth_dsm=not args.fast_orthophoto)
File "/code/opendm/mesh.py", line 35, in create_25dmesh
apply_smoothing=smooth_dsm
File "/code/opendm/dem/commands.py", line 236, in create_dem
'{merged_vrt} {geotiff}'.format(**kwargs))
File "/code/opendm/system.py", line 76, in run
raise Exception("Child returned {}".format(retcode))
Exception: Child returned 1
Full log saved at /var/www/data/cd504d84-75e8-44c9-9159-6e6f01563722/submodels/submodel_0001/error.log
[INFO]    LRE: Cleaning up remote task (ac5bb764-14e0-43ef-a8de-58861b7d0f52)... OK
[INFO]    LRE: submodel_0002 (1171854e-aaf5-4537-aa83-912e1095a3bd) is still running
[INFO]    LRE: submodel_0003 (08cba982-e75f-4b0c-a5c6-a642d0129278) is still running
[INFO]    LRE: submodel_0002 (1171854e-aaf5-4537-aa83-912e1095a3bd) is still running
[INFO]    LRE: submodel_0003 (08cba982-e75f-4b0c-a5c6-a642d0129278) is still running
[INFO]    LRE: submodel_0002 (1171854e-aaf5-4537-aa83-912e1095a3bd) is still running
[INFO]    LRE: submodel_0003 (08cba982-e75f-4b0c-a5c6-a642d0129278) is still running
[INFO]    LRE: submodel_0002 (1171854e-aaf5-4537-aa83-912e1095a3bd) is still running
[INFO]    LRE: submodel_0003 (08cba982-e75f-4b0c-a5c6-a642d0129278) is still running
[INFO]    LRE: submodel_0002 (1171854e-aaf5-4537-aa83-912e1095a3bd) is still running
[INFO]    LRE: submodel_0003 (08cba982-e75f-4b0c-a5c6-a642d0129278) is still running
[WARNING] LRE: submodel_0002 failed with: (1171854e-aaf5-4537-aa83-912e1095a3bd) failed with task output: File "/code/opendm/types.py", line 376, in run
self.next_stage.run(outputs)
File "/code/opendm/types.py", line 376, in run
self.next_stage.run(outputs)
File "/code/opendm/types.py", line 357, in run
self.process(self.args, outputs)
File "/code/stages/mve.py", line 129, in process
raise e
Exception
: Child returned 137
Full log saved at /var/www/data/cd504d84-75e8-44c9-9159-6e6f01563722/submodels/submodel_0002/error.log
[INFO]    LRE: Cleaning up remote task (1171854e-aaf5-4537-aa83-912e1095a3bd)... OK
[INFO]    LRE: submodel_0003 (08cba982-e75f-4b0c-a5c6-a642d0129278) is still running
[INFO]    LRE: submodel_0003 (08cba982-e75f-4b0c-a5c6-a642d0129278) is still running
[INFO]    LRE: submodel_0003 (08cba982-e75f-4b0c-a5c6-a642d0129278) is still running
[INFO]    LRE: submodel_0003 (08cba982-e75f-4b0c-a5c6-a642d0129278) is still running
[INFO]    LRE: submodel_0003 (08cba982-e75f-4b0c-a5c6-a642d0129278) is still running
[INFO]    LRE: submodel_0003 (08cba982-e75f-4b0c-a5c6-a642d0129278) is still running
[INFO]    LRE: submodel_0003 (08cba982-e75f-4b0c-a5c6-a642d0129278) is still running
[INFO]    LRE: submodel_0003 (08cba982-e75f-4b0c-a5c6-a642d0129278) is still running
[INFO]    LRE: submodel_0003 (08cba982-e75f-4b0c-a5c6-a642d0129278) is still running
[INFO]    LRE: submodel_0003 (08cba982-e75f-4b0c-a5c6-a642d0129278) is still running
[INFO]    LRE: submodel_0003 (08cba982-e75f-4b0c-a5c6-a642d0129278) is still running
[INFO]    LRE: submodel_0003 (08cba982-e75f-4b0c-a5c6-a642d0129278) is still running
[INFO]    LRE: submodel_0003 (08cba982-e75f-4b0c-a5c6-a642d0129278) is still running
[WARNING] LRE: submodel_0003 failed with: (08cba982-e75f-4b0c-a5c6-a642d0129278) failed with task output: File "/code/opendm/types.py", line 376, in run
self.next_stage.run(outputs)
File "/code/opendm/types.py", line 357, in run
self.process(self.args, outputs)
File "/code/stages/mvstex.py", line 97, in process
'-n {nadirWeight}'.format(**kwargs))
File "/code/opendm/system.py", line 76, in run
raise Exception("Child returned {}".format(retcode))
Exception
: Child returned 134
Full log saved at /var/www/data/cd504d84-75e8-44c9-9159-6e6f01563722/submodels/submodel_0003/error.log
[INFO]    LRE: Cleaning up remote task (08cba982-e75f-4b0c-a5c6-a642d0129278)... OK
[INFO]    LRE: No remote tasks left to cleanup
Traceback (most recent call last):
File "/code/run.py", line 56, in <module>
app.execute()
File "/code/stages/odm_app.py", line 93, in execute
self.first_stage.run()
File "/code/opendm/types.py", line 376, in run
self.next_stage.run(outputs)
File "/code/opendm/types.py", line 357, in run
self.process(self.args, outputs)
File "/code/stages/splitmerge.py", line 153, in process
lre.run_toolchain()
File "/code/opendm/remote.py", line 57, in run_toolchain
self.run(ToolchainTask)
File "/code/opendm/remote.py", line 251, in run
raise nonloc.error
pyodm.exceptions.TaskFailedError: (08cba982-e75f-4b0c-a5c6-a642d0129278) failed with task output: File "/code/opendm/types.py", line 376, in run
self.next_stage.run(outputs)
File "/code/opendm/types.py", line 357, in run
self.process(self.args, outputs)
File "/code/stages/mvstex.py", line 97, in process
'-n {nadirWeight}'.format(**kwargs))
File "/code/opendm/system.py", line 76, in run
raise Exception("Child returned {}".format(retcode))
Exception
: Child returned 134
Full log saved at /var/www/data/cd504d84-75e8-44c9-9159-6e6f01563722/submodels/submodel_0003/error.log

description ：
288 pictures are divided into 80 pictures. When running to 01:34:45, the error is "Process exited with code 1", but 174 of them can be successfully stitched.
The environment used is three machines and three nodes.
Ask for help to find out the reason from the exception.

Thanks.

Autoscaling sometimes fails

pyodm.exceptions.TaskFailedError: (7e4b9fc9-902d-4865-ae2d-9d20a78b4fa2) failed with task output: Failed sending data to the peer

Running OpenDroneMap on HPC

Purposes:
1a. Implement rootless container to run ODM on HPC, using Singularity (for ODM) and Podman (NodeODM)
1b. Use binary files if could not proceed with 1a
2. Connect ClusterODM to NodeODM in HPC environment, probably with proxy.
3. Use SLURM to dynamically assign tasks between different nodes in NodeODM.

Add support for async uploads API

Currently tasks are forwarded using the single upload API, which might be slower across throttled networks.

Return response faster when processing /task/new

After upload has finished, it takes some time to forward the input images to one of the nodes.

This causes the upload call to "hang" for a little while. It would be better if we returned a status of running and assign a task ID right away.

This will require modifying nodeodm to support ID suggestions.

How to load balance queue?

I want processing to happen in everynode but currently I see

# | Node | Status | Queue | Engine | API | Flags
-- | -- | -- | -- | -- | -- | --

1 | 192.168.0.247:3000 | Online | 1/4 | odm 2.4.3 | 2.1.4 |  
2 | 192.168.0.248:3000 | Online | 0/4 | odm 2.4.3 | 2.1.4 |  
3 | 192.168.0.246:3000 | Online | 0/4 | odm 2.4.3 | 2.1.4 |

One task takes up only 1 queue. But I would like it to queue on all node so processing will be very fast? Is there any scheduling provision that we can use to schedule it across all the queues on multiple nodes?

I do see split_merge options but I don't see any documentation etc. https://github.com/OpenDroneMap/ClusterODM/blob/master/config.js#L40

Thank You.

Add --node-priority flag

Add --node-priority flag for optimizing according to various logics.

Possible values:

least cost
round robin
...

Docker machine is not present on the image for the last build (07/02/2022)

What is the problem?

On the last build there is a XML file instead of the docker-machine binary:

<?xml version="1.0" encoding="UTF-8"?>
<Error><Code>AccessDenied</Code><Message>Access Denied</Message><RequestId>9X64YXJWB3NTJXDY</RequestId><HostId>gcYPM2y5ZRki30XSL44791CDe8Nz0kMvQu0PYlmKNUXVMFzebFBYkjlAB8pfiG3WqQ6uygHt1wc=</HostId></Error>

So the autoscaler is not working anymore.

How can we reproduce this?

Just pull the latest image and try to run a docker-machine command

opendronemap / clusterodm Goto Github PK

clusterodm's Issues

What is the problem?

What is the problem?

What should be the expected behavior?

How can we reproduce this? (What steps did you do to trigger the problem? If applicable, please include multiple screenshots of the problem! Be detailed)

What is the problem?

What should be the expected behavior?

How can we reproduce this? (What steps did you do to trigger the problem? If applicable, please include multiple screenshots of the problem! Be detailed)

What is the problem?

What should be the expected behavior?

How can we reproduce this? (What steps did you do to trigger the problem? If applicable, please include multiple screenshots of the problem! Be detailed)

Node Status Queue Engine API CPU Cores RAM available Flags -

What is the problem?

How can we reproduce this?

Recommend Projects

Recommend Topics

Recommend Org