Giter Club home page Giter Club logo

clusterodm's Introduction

ClusterODM

A reverse proxy, load balancer and task tracker with optional cloud autoscaling capabilities for NodeODM API compatible nodes. In a nutshell, it's a program to link together multiple NodeODM API compatible nodes under a single network address. The program allows to distribute tasks across multiple nodes while taking in consideration factors such as maximum number of images, queue size and slots availability. It can also automatically spin up/down nodes based on demand using cloud computing providers (currently DigitalOcean, Hetzner, Scaleway or Amazon Web Services).

image

The program has been battle tested on the WebODM Lightning Network for quite some time and has proven reliable in processing thousands of datasets. However, if you find bugs, please report them.

Installation

The only requirement is a working installation of NodeJS 14 or earlier (ClusterODM has compatibility issues with NodeJS 16 and later).

git clone https://github.com/OpenDroneMap/ClusterODM
cd ClusterODM
npm install

There's also a docker image available at opendronemap/clusterodm and a native Windows bundle.

Usage

First, start the program:

node index.js [parameters]

Or with docker:

docker run --rm -ti -p 3000:3000 -p 8080:8080 opendronemap/clusterodm [parameters]

Or with apptainer, after cd into ClusterODM directory:

apptainer run docker://opendronemap/clusterodm [parameters]

Then connect to the CLI and connect new NodeODM instances:

telnet localhost 8080
> HELP
> NODE ADD nodeodm-host 3001
> NODE LIST

Finally, use a web browser to connect to http://localhost:3000. A normal NodeODM UI should appear. This means the application is working, as web requests are being properly forwarded to nodes.

You can also check the status of nodes via a web interface available at http://localhost:10000.

See node index.js --help for all parameter options.

Autoscale Setup

ClusterODM can spin up/down nodes based on demand. This allows users to reduce costs associated with always-on instances as well as being able to scale processing based on demand.

To setup autoscaling you must:

You can then launch ClusterODM with:

node index.js --asr configuration.json

You should see the following messages in the console:

info: ASR: DigitalOceanAsrProvider
info: Can write to S3
info: Found docker-machine executable

You should always have at least one static NodeODM node attached to ClusterODM, even if you plan to use the autoscaler for all processing. If you setup auto scaling, you can't have zero nodes and rely 100% on the autoscaler. You need to attach a NodeODM node to act as the "reference node" otherwise ClusterODM will not know how to handle certain requests (for the forwarding the UI, for validating options prior to spinning up an instance, etc.). For this purpose, you should add a "dummy" NodeODM node and lock it:

telnet localhost 8080
> NODE ADD localhost 3001
> NODE LOCK 1
> NODE LIST
1) localhost:3001 [online] [0/2] <version 1.5.1> [L]

This way all tasks will be automatically forwarded to the autoscaler.

A docker-compose file is available to automatically setup both ClusterODM and NodeODM on the same machine by issuing:

docker-compose up

Windows Bundle

ClusterODM can run as a self-contained executable on Windows without the need for additional dependencies. You can download the latest clusterodm-windows-x64.zip bundle from the releases page. Extract the contents in a folder and run:

clusterodm.exe

HPC set up with SLURM

You can write a SLURM script to schedule and set up available nodes with NodeODM for the ClusterODM to be wired to if you are on the HPC. Using SLURM will decrease the amount of time and processes needed to set up nodes for ClusterODM each time. This provides an easier way for user to use ODM on the HPC.

To setup HPC with SLURM, you must make sure SLURM is installed.

SLURM script will be different from cluster to cluster, depending on which nodes in the cluster that you have. However, the main idea is we want to run NodeODM on each node once, and by default, each NodeODM will be running on port 3000. Apptainer will be taking available ports starting from port 3000, so if your node's port 3000 is open, by default NodeODM will be run on that node. After that, we want to run ClusterODM on the head node and connect the running NodeODMs to the ClusterODM. With that, we will have a functional ClusterODM running on HPC.

Here is an example of SLURM script assigning nodes 48, 50, 51 to run NodeODM. You can freely change and use it depending on your system:

image

You can check for available nodes using sinfo:

sinfo

Run the following command to schedule using the SLURM script:

sbatch sample.slurm

You can also check for currently running jobs using squeue:

squeue -u $user

Unfortunately, SLURM does not handle assigning jobs to the head node. Hence, if we want to run ClusterODM on the head node, we have to run it locally. After that, you can connect to the CLI and wire the NodeODMs to the ClusterODMs. Here is an example following the sample SLURM script:

telnet localhost 8080
> NODE ADD node48 3000
> NODE ADD node50 3000
> NODE ADD node51 3000
> NODE LIST

You should always check to make sure which ports are being used to run NodeODM if ClusterODM is not wired correctly.

It is also possible to pre-populate nodes using JSON. If starting ClusterODM from apptainer or docker, the relevant JSON is available at docker/data/nodes.json. Contents might look similar to the following:

[
        {"hostname":"node48","port":"3000","token":""},
        {"hostname":"node50","port":"3000","token":""},
        {"hostname":"node51","port":"3000","token":""}
]

After finish hosting ClusterODM on the head node and finish wiring it to the NodeODM, you can try tunneling to see if ClusterODM works as expected. Open another shell window in your local machine and tunnel them to the HPC using the following command:

ssh -L localhost:10000:localhost:10000 user@hostname

Replace user and hostname with your appropriate username and the hpc address. Basically, this command will tunnel the port of the hpc to your local port. After this, open a browser in your local machine and connect to http://localhost:10000. Port 10000 is where ClusterODM's administrative web interface is hosted at. This is what it looks like:

image

Here you can check the NodeODMs status and even add or delete working nodes.

After that, do tunneling for port 3000 of the HPC to your local machine:

ssh -L localhost:3000:localhost:3000 user@hostname

Port 3000 is ClusterODM's proxy. This is the place we assign tasks to ClusterODM. Once again, connect to http://localhost:3000 with your browser after tunneling. Here, you can Assign Tasks and observe the tasks' processes.

image

After adding images in this browser, you can press Start Task and see ClusterODM assigning tasks to the nodes you have wired to. Go for a walk and check the progress.

Roadmap

We have plenty of goals. If you want to help, or need help getting started contributing, get in touch on the OpenDroneMap community forum.

clusterodm's People

Contributors

abrbhat avatar danbjoseph avatar fjeannot avatar gromain avatar jeongyong-park avatar mateo3d avatar mojodna avatar nghi01 avatar okaluza avatar pierotofy avatar rrowlands avatar russss avatar saijin-naib avatar smathermather avatar trick-1 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

clusterodm's Issues

Why switch nodes many times in working?

122 photos
Options: split: 50, split-overlap: 50
Why switch nodes many times in working???

image

description:
122 pictures are ortho-spliced according to 50 segmentation. There are four nodes. When splicing, they are not allocated to three nodes for splicing, but to 1-2-3 for splicing, then 1-2-4 for splicing and then 1-4 for processing. I want to ask why the nodes are changed in this way instead of fixed for three nodes for processing.

like this
1.
image
2.
image
3.
image
4.
image

Running OpenDroneMap on HPC

Purposes:
1a. Implement rootless container to run ODM on HPC, using Singularity (for ODM) and Podman (NodeODM)
1b. Use binary files if could not proceed with 1a
2. Connect ClusterODM to NodeODM in HPC environment, probably with proxy.
3. Use SLURM to dynamically assign tasks between different nodes in NodeODM.

Better Docs

The README should have some more information on the various flags.

Setting up ClusterODM setup using WebODM

Hey Everyone,
I am struggling to make ClusterODM work with web odm. I have setup 3 nodes running NodeODM. And setup ClusterODM in a server spec machine (no apparent reason why i choose a server machine for this task). As the guide says, i connect all the node to ClusterODM using telnet, and I can see the nodes are online. Now how do I implement this cluster setup from WebODM? As per the documentation in https://docs.opendronemap.org/large.html#distributed-split-merge, in CLI odm, you need to add --sm-cluster http://cluster_odm_ip:3001.

My question again, is how do I trigger this action in webODM? In the custom settings option in webODM, there is a field to enter the URL of clusterODM, but I havent had any luck with that. What I need to know, is at what process stage, does odm create sub-models and distribute them to different nodes to process?

Setup :
WebODM (SYSTEM (1))-->ClusterODM(SYSTEM (3))+--------NodeODM
+--------NodeODM
+--------NodeODM

System specifications:
Following are the system specifications.

Machine running WebODM: (1)
CPU: Ryzen 7 3600
RAM: 16gb DDR4 2133
500GB SSD and 1TB HDD

node machine specs: (2)
CPU: Ryzen 7 3600
RAM: 16gb DDR4 2133
500GB SSD and 1TB HDD

System running ClusterODM: (A server Machine) (3)
CPU: Intel Xeon Silver 4208 2.10 Ghz
RAM: 32GB
HDD: 198GB

ClusterODM distributed split/merge results vary from single instance Docker split/merge

I’m getting mixed results between using Normal docker (single instance) split/merge output vs ClusterODM distributed split/merge.

Single instance split/merge is giving me a full orthophoto from the dataset. No problems (other than taking 3.5 days to run (701 images).

ClusterODM distributed split/merge is showing zero data for about half of the imaged area, but it completes the task in about 4 hours. Once you set the transparency band to None in QGIS, you can see that it should have the entire image, but doesn’t.

Screen Shot 2019-08-11 at 11 22 20 AM

Screen Shot 2019-08-11 at 11 22 47 AM

Screen Shot 2019-08-11 at 11 23 59 AM

Screen Shot 2019-08-11 at 11 24 14 AM

Process exited with code 1

288 pictures

Processing Node: ClusterODM (auto)
Options: split: 80, split-overlap: 80
1) 192.168.3.86:3001 [offline] [0/2] <version 1.5.3>
2) 192.168.3.155:3001 [online] [0/2] <version 1.5.3>
3) 192.168.3.24:3001 [online] [0/2] <version 1.5.3>
[WARNING] LRE: submodel_0001 failed with: (ac5bb764-14e0-43ef-a8de-58861b7d0f52) failed with task output: self.process(self.args, outputs)
File "/code/stages/odm_meshing.py", line 72, in process
smooth_dsm=not args.fast_orthophoto)
File "/code/opendm/mesh.py", line 35, in create_25dmesh
apply_smoothing=smooth_dsm
File "/code/opendm/dem/commands.py", line 236, in create_dem
'{merged_vrt} {geotiff}'.format(**kwargs))
File "/code/opendm/system.py", line 76, in run
raise Exception("Child returned {}".format(retcode))
Exception: Child returned 1
Full log saved at /var/www/data/cd504d84-75e8-44c9-9159-6e6f01563722/submodels/submodel_0001/error.log
[INFO]    LRE: Cleaning up remote task (ac5bb764-14e0-43ef-a8de-58861b7d0f52)... OK
[INFO]    LRE: submodel_0002 (1171854e-aaf5-4537-aa83-912e1095a3bd) is still running
[INFO]    LRE: submodel_0003 (08cba982-e75f-4b0c-a5c6-a642d0129278) is still running
[INFO]    LRE: submodel_0002 (1171854e-aaf5-4537-aa83-912e1095a3bd) is still running
[INFO]    LRE: submodel_0003 (08cba982-e75f-4b0c-a5c6-a642d0129278) is still running
[INFO]    LRE: submodel_0002 (1171854e-aaf5-4537-aa83-912e1095a3bd) is still running
[INFO]    LRE: submodel_0003 (08cba982-e75f-4b0c-a5c6-a642d0129278) is still running
[INFO]    LRE: submodel_0002 (1171854e-aaf5-4537-aa83-912e1095a3bd) is still running
[INFO]    LRE: submodel_0003 (08cba982-e75f-4b0c-a5c6-a642d0129278) is still running
[INFO]    LRE: submodel_0002 (1171854e-aaf5-4537-aa83-912e1095a3bd) is still running
[INFO]    LRE: submodel_0003 (08cba982-e75f-4b0c-a5c6-a642d0129278) is still running
[WARNING] LRE: submodel_0002 failed with: (1171854e-aaf5-4537-aa83-912e1095a3bd) failed with task output: File "/code/opendm/types.py", line 376, in run
self.next_stage.run(outputs)
File "/code/opendm/types.py", line 376, in run
self.next_stage.run(outputs)
File "/code/opendm/types.py", line 357, in run
self.process(self.args, outputs)
File "/code/stages/mve.py", line 129, in process
raise e
Exception
: Child returned 137
Full log saved at /var/www/data/cd504d84-75e8-44c9-9159-6e6f01563722/submodels/submodel_0002/error.log
[INFO]    LRE: Cleaning up remote task (1171854e-aaf5-4537-aa83-912e1095a3bd)... OK
[INFO]    LRE: submodel_0003 (08cba982-e75f-4b0c-a5c6-a642d0129278) is still running
[INFO]    LRE: submodel_0003 (08cba982-e75f-4b0c-a5c6-a642d0129278) is still running
[INFO]    LRE: submodel_0003 (08cba982-e75f-4b0c-a5c6-a642d0129278) is still running
[INFO]    LRE: submodel_0003 (08cba982-e75f-4b0c-a5c6-a642d0129278) is still running
[INFO]    LRE: submodel_0003 (08cba982-e75f-4b0c-a5c6-a642d0129278) is still running
[INFO]    LRE: submodel_0003 (08cba982-e75f-4b0c-a5c6-a642d0129278) is still running
[INFO]    LRE: submodel_0003 (08cba982-e75f-4b0c-a5c6-a642d0129278) is still running
[INFO]    LRE: submodel_0003 (08cba982-e75f-4b0c-a5c6-a642d0129278) is still running
[INFO]    LRE: submodel_0003 (08cba982-e75f-4b0c-a5c6-a642d0129278) is still running
[INFO]    LRE: submodel_0003 (08cba982-e75f-4b0c-a5c6-a642d0129278) is still running
[INFO]    LRE: submodel_0003 (08cba982-e75f-4b0c-a5c6-a642d0129278) is still running
[INFO]    LRE: submodel_0003 (08cba982-e75f-4b0c-a5c6-a642d0129278) is still running
[INFO]    LRE: submodel_0003 (08cba982-e75f-4b0c-a5c6-a642d0129278) is still running
[WARNING] LRE: submodel_0003 failed with: (08cba982-e75f-4b0c-a5c6-a642d0129278) failed with task output: File "/code/opendm/types.py", line 376, in run
self.next_stage.run(outputs)
File "/code/opendm/types.py", line 357, in run
self.process(self.args, outputs)
File "/code/stages/mvstex.py", line 97, in process
'-n {nadirWeight}'.format(**kwargs))
File "/code/opendm/system.py", line 76, in run
raise Exception("Child returned {}".format(retcode))
Exception
: Child returned 134
Full log saved at /var/www/data/cd504d84-75e8-44c9-9159-6e6f01563722/submodels/submodel_0003/error.log
[INFO]    LRE: Cleaning up remote task (08cba982-e75f-4b0c-a5c6-a642d0129278)... OK
[INFO]    LRE: No remote tasks left to cleanup
Traceback (most recent call last):
File "/code/run.py", line 56, in <module>
app.execute()
File "/code/stages/odm_app.py", line 93, in execute
self.first_stage.run()
File "/code/opendm/types.py", line 376, in run
self.next_stage.run(outputs)
File "/code/opendm/types.py", line 357, in run
self.process(self.args, outputs)
File "/code/stages/splitmerge.py", line 153, in process
lre.run_toolchain()
File "/code/opendm/remote.py", line 57, in run_toolchain
self.run(ToolchainTask)
File "/code/opendm/remote.py", line 251, in run
raise nonloc.error
pyodm.exceptions.TaskFailedError: (08cba982-e75f-4b0c-a5c6-a642d0129278) failed with task output: File "/code/opendm/types.py", line 376, in run
self.next_stage.run(outputs)
File "/code/opendm/types.py", line 357, in run
self.process(self.args, outputs)
File "/code/stages/mvstex.py", line 97, in process
'-n {nadirWeight}'.format(**kwargs))
File "/code/opendm/system.py", line 76, in run
raise Exception("Child returned {}".format(retcode))
Exception
: Child returned 134
Full log saved at /var/www/data/cd504d84-75e8-44c9-9159-6e6f01563722/submodels/submodel_0003/error.log

description :
288 pictures are divided into 80 pictures. When running to 01:34:45, the error is "Process exited with code 1", but 174 of them can be successfully stitched.
The environment used is three machines and three nodes.
Ask for help to find out the reason from the exception.

Thanks.

Did not run cleanup even task finished when i use this by aws auto-scaling

hello, geniuses,

did not run cleanup even task finished when I use this by AWS Auto-scaling.

when I used it last year, the instances were automatically shut down after the work was finished. I'm using it for the first time this year, and the instances aren't shutting down.

I use ClusterODM by docker-compose, and nodeodm version is "2.4.10" for compatibility with my webodm 1.8.1.
(it looks like something problem also within webodm 1.8.1 and nodeodm 2.5.0. that is not success even all process looks like done.)

화면 캡처 2021-05-26 161800

**all task was finished.**

화면 캡처 2021-05-26 161639

**still online even spend many times...**
version: '2.1'
services:
  nodeodm-1:
    image: opendronemap/nodeodm:2.4.10
    container_name: nodeodm-1
    ports:
      - "3000"
    restart: on-failure:10
    oom_score_adj: 500
    entrypoint: /usr/bin/node /var/www/index.js --max_images 1 --max_concurrency 1 --max_runtime 0 -q 0
  clusterodm:
    image: opendronemap/clusterodm
    container_name: clusterodm
    ports:
      - "80:3000"
      - "8082:8080"
      - "10000:10000"
    volumes:
      - ./docker/data:/var/www/data
      - ./config.json:/var/www/config-default.json  
      - ./asr-configuration.json:/var/www/configuration.json
    restart: on-failure:10
    depends_on:
      - nodeodm-1

Autoscaling sometimes fails

pyodm.exceptions.TaskFailedError: (7e4b9fc9-902d-4865-ae2d-9d20a78b4fa2) failed with task output: Failed sending data to the peer

No Splitting of Jobs

There is no spliting of projects in my cluster.
these are my options:
Optionen: auto-boundary: true, fast-orthophoto: true, split: 40, split-overlap: 15, rerun-from: dataset

Node Status Queue Engine API CPU Cores RAM available Flags -

1 20.113.*.3:3000 Online 0/1 odm 2.8.1 2.2.0 4 97.38%
2 20.54.
.0:3000 Online 1/1 odm 2.8.1 2.2.0 4 51.87%
3 20.113.
.*3:3000 Online 0/1 odm 2.8.1 2.2.0 4 97.93%

only one node is used any idea how to change this behaver?
I have a Photo set of 760 pics.

Admin commands to handle tasks

Would be nice to have a way to send manual API commands for certain tasks (for example, to query task status, restart tasks, cancel tasks, etc.)

warn: Cannot forward task (ID) to processing node (IP):3000: Failed sending data to the peer

Experimenting with ClusterODM. Have got a two node test cluster setup and a VM with ClusterODM running on it. Everything seems to be setup correctly, both devices show correctly in :10000 and with NODE LIST.

When submitting a job with split, it throws the error warn: Cannot forward task (ID) to processing node (IP):3000: Failed sending data to the peer. The error appears with both IPs alternately with several test datasets from 14 - 986 images. No GCPs on these. I tried the following splits on a 986 image dataset:

50 - error above
100 - error above
400 - error above
500 - job splits into FIVE parts and only make it to one node. Raising the issue via my mobile but hopefully the below table comes out OK:

Node 	Status 	Queue 	Version 	Flags

1 192.168.1.172:3000 Online 5/4 1.5.2
2 192.168.1.173:3000 Online 0/4 1.5.2

Windows Native ClusterODM

It should be relatively easy to create an executable bundle for running ClusterODM natively, NodeODM style. Just need to copy the relevant github workflow files and adjust them.

Add --node-priority flag

Add --node-priority flag for optimizing according to various logics.

Possible values:

  • least cost
  • round robin
  • ...

ClusterODM is dropping a high number of uploads

What is the problem?

Since yesterday my Webodm is constantly failing all tasks after i restarted it. I noticed it pulled a newer image from DockerHub and there are no previous versions available on DockerHub.
After investigating i noticed that ClusterODM is closing a lot of POST http requests on the routes /task/new/upload/<task_id>
The error message displayed in Webodm is sometimes Connection error: HTTPSConnectionPool(host='example.com', port=443): Read timed out. (read timeout=30) and some other times just 502.

Even the smallest jobs are failing, i had this issue with a dataset only containing 5 images.

On the web interface of ClusterODM, i can still launch a task, but during the uploads, i get a lot of messages saying Upload of IMG_NAME.jpg failed, retrying...

After seeing this i made a clean install of my entire stack (Webodm webapp & worker, ClusterODM and one locked NodeODM for the autoscaler) on a totally different infrastructure and had the same exact problem.

What should be the expected behavior?

Uploading the files on WebODM or ClusterODM UI should work

How can we reproduce this? (What steps did you do to trigger the problem? If applicable, please include multiple screenshots of the problem! Be detailed)

Install WebODM and ClusterODM and try to upload files to launch a task.
My current installation is on a Kubernetes cluster hosted on scaleway. I can provide the manifests i'm using if needed.
WebODM version: 1.9.11
ClusterODM version: latest on Dockerhub

aws s3 bucker creation+autoscaling

Hello,
I have been testing cluster odm on local machines+some aws instances. Since clusterodm supports autoscaling/creating and destroying nodes on demand, I would like to know a bit more about the process of setting it up. The description provided in https://github.com/OpenDroneMap/ClusterODM/blob/master/docs/aws.md is not very clear for aws newbie like me.
So from here what I understand is:

  1. Given that I have aws account, I need to create an s3 bucket with unblocked access.
  2. Then I guess I need to deploy clusterodm in an instance in aws?
  3. Also, I did not quite understand what this means.

Select an AMI (machine image) to run - Ubuntu has a handy AMI finder.

[Feature Request]: `docker-machine` is deprecated, we need to add support for another tool for autoscaling

docker-machine is being deprecated and starts to be unavailable from repositories. It's also not compatible with the latest docker version. See docker/machine#4537 and docker/roadmap#245 .

One alternative would be to move towards Terraform. There is a provider for Scaleway, so we wouldn't lose functionality. Is there any interest in moving towards this solution?

What is the problem?

docker-machine being deprecated.

What should be the expected behavior?

Not losing the autoscale!

How can we reproduce this? (What steps did you do to trigger the problem? If applicable, please include multiple screenshots of the problem! Be detailed)

Try installing docker-machine on a recent OS.

[ERROR] Cluster node offline

[ERROR] Cluster node seems to be offline: HTTPConnectionPool(host='topaz318hn', port=3000): Max retries exceeded with url: /info (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f91f14f1910>: Failed to establish a new connection: [Errno -2] Name or service not known',))

Hi,

I deployed the clusterODM to Docker Swarm.

I found this issue when I start the task with setting split. It works if I do not set the split and let it run on one node.

Is there any solution?

Thanks so much,
Tianyang

How to load balance queue?

I want processing to happen in everynode but currently I see

# | Node | Status | Queue | Engine | API | Flags
-- | -- | -- | -- | -- | -- | --

1 | 192.168.0.247:3000 | Online | 1/4 | odm 2.4.3 | 2.1.4 |  
2 | 192.168.0.248:3000 | Online | 0/4 | odm 2.4.3 | 2.1.4 |  
3 | 192.168.0.246:3000 | Online | 0/4 | odm 2.4.3 | 2.1.4 |  

One task takes up only 1 queue. But I would like it to queue on all node so processing will be very fast? Is there any scheduling provision that we can use to schedule it across all the queues on multiple nodes?

I do see split_merge options but I don't see any documentation etc. https://github.com/OpenDroneMap/ClusterODM/blob/master/config.js#L40

Thank You.

Return response faster when processing /task/new

After upload has finished, it takes some time to forward the input images to one of the nodes.

This causes the upload call to "hang" for a little while. It would be better if we returned a status of running and assign a task ID right away.

This will require modifying nodeodm to support ID suggestions.

Add better web interface

It would be nice to have a web UI to show node information (memory, CPU usage, tasks) as well as the ability to do the same tasks as the CLI.

Cluster Split fail and raise Exception!

700 photos to split

CMD:
docker run -ti -v "/ftpfile/1400/1400:/code/images" opendronemap/odm --split 400 --split-overlap 100 --sm-cluster http://192.168.3.86:3100

cluster: http://192.168.3.86:3100
config :

#> node list
1) 192.168.3.86:3001 [online] [0/2] <version 1.5.3>
2) 192.168.3.86:3002 [online] [0/2] <version 1.5.3>
3) 192.168.3.86:3003 [online] [0/2] <version 1.5.3>
4) 192.168.3.86:3005 [online] [0/2] <version 1.5.3>
5) 192.168.3.86:3006 [online] [0/2] <version 1.5.3>
6) 192.168.3.86:3009 [online] [0/2] <version 1.5.3>

EXCEPTION:

OpenCV Error: Assertion failed (data0.dims <= 2 && type == CV_32F && K > 0) in kmeans, file /code/SuperBuild/src/opencv/modules/core/src/matrix.cpp, line 2701
Traceback (most recent call last):
  File "/code/SuperBuild/src/opensfm/bin/opensfm", line 34, in <module>
    command.run(args)
  File "/code/SuperBuild/src/opensfm/opensfm/commands/create_submodels.py", line 30, in run
    self._cluster_images(meta_data, data.config['submodel_size'])
  File "/code/SuperBuild/src/opensfm/opensfm/commands/create_submodels.py", line 100, in _cluster_images
    labels, centers = tools.kmeans(positions, K)[1:]
  File "/code/SuperBuild/src/opensfm/opensfm/large/tools.py", line 34, in kmeans
    return cv2.kmeans(samples, nclusters, criteria, attempts, flags)
cv2.error: /code/SuperBuild/src/opencv/modules/core/src/matrix.cpp:2701: error: (-215) data0.dims <= 2 && type == CV_32F && K > 0 in function kmeans

Traceback (most recent call last):
  File "/code/run.py", line 56, in <module>
    app.execute()
  File "/code/stages/odm_app.py", line 93, in execute
    self.first_stage.run()
  File "/code/opendm/types.py", line 376, in run
    self.next_stage.run(outputs)
  File "/code/opendm/types.py", line 357, in run
    self.process(self.args, outputs)
  File "/code/stages/splitmerge.py", line 65, in process
    octx.run("create_submodels")
  File "/code/opendm/osfm.py", line 21, in run
    (context.opensfm_path, command, self.opensfm_project_path))
  File "/code/opendm/system.py", line 76, in run
    raise Exception("Child returned {}".format(retcode))
Exception: Child returned 1

Add ASR config CLI commands

This would allow a person to change the configuration of the ASR at runtime without a restart.

  • Add/set keys
  • Delete keys
  • Write config to file (make changes permanent)
  • Restore original config (?)

Auto scaling

We should implement the ability to automatically spin-up nodes if no nodes are currently available to process a task immediately.

A provider independent abstract layer should be implemented, as we don't want to be tied to a single cloud solution.

Fix - AWS Autoscaling w/ Docker-Machine

Problem

Docker-Machine fails to create node instances with an AWS asr request in some configurations of the AWS platform. Specifically, it fails for those configurations without a default VPC or subnet and/or those that use region zones, e.g. us-east-1c, for the VPC/subnet/security group resources. Yielding the errors below -

No default vpc/subnet -

(debug-machine) Couldn't determine your account Default VPC ID : "default-vpc is 'none'"
Error setting machine configuration from flags provided: amazonec2 driver requires either the --amazonec2-subnet-id or --amazonec2-vpc-id option or an AWS Account with a default vpc-id

No availability Zone -

Error creating machine: Error in driver during machine creation: Error launching instance: InvalidParameterValue: Value (us-east-1a) for parameter availabilityZone is invalid. Subnet 'subnet-****' is in the availability zone us-east-1c
status code: 400,

Defining a default VPC or subnet requires interaction with AWS support. Defining the zone is straightforward given the parameter in the JSON.

Expectation

What is expected is that when running cluster-odm locally via 'node index.js --asr aws.json' assuming a properly formed aws.json file, the docker-machine invocation should create a new machine, load docker, and invoke a containerized node.

Reproduction

In order to reproduce this error, in an AWS environment with the configuration above, launch local ClusterODM with the --asr aws.json config flag reflecting that environment. Telnet to 8080 and run 'asr viewcmd <# images>'. This provides the docker-machine command reflecting the information provided in the asr config. Copy that docker-machine command and attempt to execute it in the command line. If a VPC and subnet are not defaulted in the AWS environment, the machine will not be created. If the AWS resources are contained in zone of a region, e.g. a,b,c, the machine will not be created.

The command line for Docker-Machine described above provides a more descriptive error than simply using the WebODM interface to launch a processing job. That approach simply fails with the error below;

node index.js --asr aws.json
info: ClusterODM 1.5.3 started with PID 14656
info: Starting admin CLI on 8080
warn: No admin CLI password specified, make sure port 8080 is secured
info: Starting admin web interface on 10000
warn: No admin password specified, make sure port 10000 is secured
info: Cloud: LocalCloudProvider
info: ASR: AWSAsrProvider
info: Can write to S3
info: Found docker-machine executable
info: Loaded 1 nodes
info: Loaded 0 routes
info: Starting http proxy on 3000
info: Trying to create machine... (1)
warn: Cannot create machine: Error: docker-machine exited with code 1
info: Trying to create machine... (2)
warn: Cannot create machine: Error: docker-machine exited with code 1

Resolution

The problem resolution is limited in scope to the definition of the AWSAsrProvider extension of the AbstractASRProvider class in ./ClusterODM/libs/asr-providers/aws.js. Documentation must be updated in ~/ClusterODM/docs/aws.md.

The following provides the reference variable names in Docker-Machine required to translate between the JSON naming and the Docker-Machine naming for the variable names. https://gdevillele.github.io/machine/drivers/aws/

Cluster node seems to be offline: HTTPConnectionPool(host='127.0.0.1', port=3001)

There's a weird intermittent connectivity issue for the cluster software:

[ERROR] Cluster node seems to be offline: HTTPConnectionPool(host='127.0.0.1', port=3001): Max retries exceeded with url: /info (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f9861261810>: Failed to establish a new connection: [Errno 111] Connection refused',))

It works about 50% of the time, but times out for the rest. ClusterODM is started via systemd and remains running fine:

ExecStart=/usr/bin/node /opt/ClusterODM/index.js -p 3001

When submitting a job, it almost seems like WebODM is trying to brute force a job into ClusterODM, as the number of connections is huge:

root@survey-dev:~# netstat -tapn | grep 3001 | wc -l
269
tcp        0      0 127.0.0.1:43016         127.0.0.1:3001          TIME_WAIT   -               
tcp        0      0 127.0.0.1:42818         127.0.0.1:3001          TIME_WAIT   -               
tcp        0      0 127.0.0.1:42802         127.0.0.1:3001          TIME_WAIT   -               
tcp6       0      0 :::3001                 :::*                    LISTEN      13085/node      
tcp6       0      0 127.0.0.1:3001          127.0.0.1:43050         TIME_WAIT   - 

telneting to the port works fine. Server load is very low. Packet loss is zero (it's localhost).

Add Azure support for auto-scaling

What is the problem?

No option to integrate Azure components for storage & compute.

What should be the expected behavior?

One should be able to use Blob Storage/File Share for storing the results and add configuration file to auto-scale by spinning up new Azure Container Instances or removing them.

How can we reproduce this? (What steps did you do to trigger the problem? If applicable, please include multiple screenshots of the problem! Be detailed)

It's a feature enhancement request.

Split-merge Integration

At some point we should add the ability to distribute large datasets over multiple nodes for parallel processing.

Add support for autoscaling queue

Optional support for an autoscaling queue could allow users to queue tasks even if they hit account limit restrictions and provide a better flow.

Docker machine is not present on the image for the last build (07/02/2022)

What is the problem?

On the last build there is a XML file instead of the docker-machine binary:

<?xml version="1.0" encoding="UTF-8"?>
<Error><Code>AccessDenied</Code><Message>Access Denied</Message><RequestId>9X64YXJWB3NTJXDY</RequestId><HostId>gcYPM2y5ZRki30XSL44791CDe8Nz0kMvQu0PYlmKNUXVMFzebFBYkjlAB8pfiG3WqQ6uygHt1wc=</HostId></Error>

So the autoscaler is not working anymore.

How can we reproduce this?

Just pull the latest image and try to run a docker-machine command

Shared Task / Routes / Nodes Table

Currently task tables and routes tables are stored as objects. This means a single point of failure and no ability for the network to have multiple proxies running concurrently on multiple machines.

We should have a shared database of such information (redis?) so that multiple proxies might be synced and distribute lots of incoming connections.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.