meldig / pdal-parallelizer Goto Github PK

A python app (cli/api) to parallelize your PDAL pipelines

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

pdal-parallelizer's Introduction

PDAL-PARALLELIZER

Some processing on point clouds can be very time consuming, this problem can be solved by using several processes on a machine to run calculations in parallel. The pdal-parallelizer tool will allow you to fully use the power of your machine very simply to put it at the service of your processing.

pdal-parallelizer is a tool that allows you to process your point clouds through pipelines that will be executed on several cores of your machine. This tool uses the flexible open-source Python library Dask for the multiprocess side and allows you to use the power of the Point Data Abstraction Library, PDAL to write your pipelines.

It also protect you from any problem during the execution. Indeed, as the points clouds treatments can be really long, if something goes wrong during the execution you don’t want to restart this from the beginning. So pdal-parallelizer will serialize each pipeline to protect you from this.

Read the documentation for more details : https://pdal-parallelizer.readthedocs.io/

Installation

Using Pip

pip install pdal-parallelizer

Using Conda

conda install -c clementalba pdal-parallelizer

GitHub

The repository of pdal-parallelizer is available at https://github.com/meldig/pdal-parallelizer

Usage

Config file

Your configuration file must be like that :

{
    "input": "The folder that contains your input files (or a file path)",
    "output": "The folder that will receive your output files",
    "temp": "The folder that will contains your temporary files"
    "pipeline": "Your pipeline path"
}

Processing pipelines with API

from pdal_parallelizer import process_pipelines as process

process(config="./config.json", input_type="single", timeout=500, n_workers=5, diagnostic=True)

Processing pipelines with CLI

pdal-parallelizer process-pipelines -c <config file> -it dir -nw <n_workers> -tpw <threads_per_worker> -dr <number of files> -d
pdal-parallelizer process-pipelines -c <config file> -it single -nw <n_workers> -tpw <threads_per_worker> -ts <tiles size> -d -dr <number of tiles> -b <buffer size>

Requirements (only for pip installs)

Python 3.9+ (eg conda install -c anaconda python)

PDAL 2.4+ (eg conda install -c conda-forge pdal)

pdal-parallelizer's People

Contributors

Stargazers

Watchers

Forkers

clementalba iain-s djes

pdal-parallelizer's Issues

Simplify the tile.split function

Separating it in multiple functions

extra dim withdrawn from merged output

Extra dims are withdrawn from the output, even when "extra_dims":"all" is activated in the writers.las stage.

Tile size impact on workers number

I faced a memory saturation when over-tiling a single input file.

A 500mx500m input, divided in 35 tiles computed by 15 worked over-consumes memory capacities.
pdal-parallelizer process-pipelines -c E:\0000_test\pdal_parallelizer\config.json -it single -nw 15 -tpw 1 -ts 100 100 -mt

When worker and tiles size are computed so workers only compute once during the process, the unmanaged memory doesn't blow up.
pdal-parallelizer process-pipelines -c E:\0000_test\pdal_parallelizer\config.json -it single -nw 9 -tpw 1 -ts 180 180 -mt

Computing the input file's hold might be used to calculate the optimal tile sizing, or conversely.

Manage the error related to statistics.quantiles

Trigger a warning if the default value of the tile_size option is used
If there is only one file in the output directory, do not create quantiles
Write tests

Fixing the bugs of the new version

How to execute deserialized pipelines in single mode ?
Properly open the cloud just once using client.scatter in single mode

Cannot work due to the error of json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

I installed by conda and tried to run by process(config = path_to_config.json, input_type="single"...
But it cannot work due to the error below.

raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

my config.json was written by this
{
"input": "F:/",
"output": "F:/",
"temp": "F:/",
"pipeline": "F:/pipeline.json"
}
Please tell me how to solve it.

Thank you.

-buffer Int

For documentation : the buffer value has to be an integer.

Simplify the tile.pipeline function

Separating it in multiple functions

How to execute the deserialized pipelines in single mode

How to execute the pipelines after a crash in single mode ? With the execute_stages_standard or execute_stages_streaming ?

CPU explotation limit

Working with 2.0.3 release with -dir option on, I faced underutilization of CPUs over 6 wokers assigned.

Basically, from 3 to 6 workers, each CPU use around 4 to 5% of the global CPU's capacities, but over 6 workers it falls to less than 2%.

Warning message for tile size display even with -it dir option

With this command :

pdal-parallelizer process-pipelines -c E:\0000_test\pdal_parallelizer\config.json -it dir -nw 15 -tpw 1

I get this message :

WARNING - You are using the default value of the tile_size option (256 by 256 meters). Please check if your points cloud's dimensions are greater than this value.
Do you want to continue ?  [y/N]

comman-line temp directory

When using pdal-parallelizer in command-line with mini-forge on Windows, it seems that temp directory has to be already created otherwise his path is unknown.
FileNotFoundError: [WinError 3] Le chemin d’accès spécifié est introuvable: 'E:/0_en_cours/2020_classification/temp_parallelizer'
It should be nice to automatically create this directory.

Stop requiering a specific version of Python

The version of Python called is hard coded, it should uses whatever version is present in the conda environment as long that it meets a minimum version criteria.

Properly open the cloud just once using client.scatter in single mode

When processing a single cloud, this one is open before the parallelization start. But as we have to create iterators in parallel, the image_array of the cloud is present on the task graph.

Maybe use client.scatter and client.submit can solve the problem

-tile_size option needs -buffer

pdal-parallelizer process-pipelines -c E:\0000_test\pdal_parallelizer\config.json -it single -nw 9 -tpw 1 -ts 80 80 -mt -rb
Beginning of the execution

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "D:\applications\miniforge3\envs\prod_env\Scripts\pdal-parallelizer.exe\__main__.py", line 7, in <module>
  File "D:\applications\miniforge3\envs\prod_env\Lib\site-packages\click\core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\applications\miniforge3\envs\prod_env\Lib\site-packages\click\core.py", line 1055, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "D:\applications\miniforge3\envs\prod_env\Lib\site-packages\click\core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\applications\miniforge3\envs\prod_env\Lib\site-packages\click\core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\applications\miniforge3\envs\prod_env\Lib\site-packages\click\core.py", line 760, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\applications\miniforge3\envs\prod_env\Lib\site-packages\pdal_parallelizer\pdal_parallelizer_cli\__main__.py", line 44, in process_pipelines
    process(
  File "D:\applications\miniforge3\envs\prod_env\Lib\site-packages\pdal_parallelizer\__init__.py", line 140, in process_pipelines
    delayed = do.process_pipelines(output_dir=output, json_pipeline=pipeline, temp_dir=temp, iterator=iterator,
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\applications\miniforge3\envs\prod_env\Lib\site-packages\pdal_parallelizer\do.py", line 58, in process_pipelines
    p = t.pipeline(is_single)
        ^^^^^^^^^^^^^^^^^^^^^
  File "D:\applications\miniforge3\envs\prod_env\Lib\site-packages\pdal_parallelizer\tile.py", line 69, in pipeline
    p.insert(len(p) - 1, bounds.removeBuffer(self.bounds_without_buffer))
                                             ^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'Tile' object has no attribute 'bounds_without_buffer'

What if we don't want any buffer to be used ? A null value isn't valid.

Command-line tiling and renaming single input

When the single input option is activated, the ouput is tiled based on dask parameters and named according to xmins/ymins of the bounding-boxes.

Would it be possible to

merge the ouputs,
exclude the buffers of the ouput when used, and
rename the merged output based on the input name ?

So
input_name.laz -> xmins/ymins.laz & xmins/ymins.laz & xmins/ymins.laz & xmins/ymins.laz
would be
input_name.laz -> input_name.laz

Computation performance decreasing with important number of files

Working with Conda version 2.0.3, I was confronted with underutilized computational capabilities while working with a large dataset consisting of about 5000 laz tiles. Each worker was arround 0.5% of the CPU when the expected behaviour is about 5%.
With a smaller subset of the exact same tiles, it works fine.

My process file is
pdal-parallelizer process-pipelines -c F:\nuages\2020_lidar\2022_06_pdal_mneh_config.json -it dir -nw 20 -tpw 1 -mt

Advise the user in the number of workers to choose

Launch a first execution that will take a chunk in the center of the dataset, process it and then advise the user on the number of workers to avoid memory issues.

Crash due to multiple expressions on filters.assign

With pdal-parallelizer 2.0.3 on Windows, both with single and dir options, multiples requests with filters.assign make pdal-parallelizer crash. For exemple,

    {
        "type":"filters.assign",
        "value":
		[
		"value" : "Dimension = ValueExpression [WHERE ConditionalExpression)]",
		"value" : "Dimension = ValueExpression [WHERE ConditionalExpression)]",
		"value" : "Dimension = ValueExpression [WHERE ConditionalExpression)]"
		]
    },

has to be replaced by,

    {
        "type":"filters.assign",
        "value":
		[
		"value" : "Dimension = ValueExpression [WHERE ConditionalExpression)]"
		]
    },    {
        "type":"filters.assign",
        "value":
		[
		"value" : "Dimension = ValueExpression [WHERE ConditionalExpression)]"
		]
    },    {
        "type":"filters.assign",
        "value":
		[
		"value" : "Dimension = ValueExpression [WHERE ConditionalExpression)]"
		]
    },

Add some checks before computation

Check if the input, output and temp values of the config file are differents
Check if the output folder is empty

-merge_tiles option writer issue

When using the merge_tiles option I face a writer issue.

pdal-parallelizer process-pipelines -c E:\0000_test\pdal_parallelizer\config.json -it single -nw 9 -tpw 1 -ts 60 60 -b 1 -mt

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "D:\applications\miniforge3\envs\prod_env\Scripts\pdal-parallelizer.exe\__main__.py", line 7, in <module>
  File "D:\applications\miniforge3\envs\prod_env\Lib\site-packages\click\core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\applications\miniforge3\envs\prod_env\Lib\site-packages\click\core.py", line 1055, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "D:\applications\miniforge3\envs\prod_env\Lib\site-packages\click\core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\applications\miniforge3\envs\prod_env\Lib\site-packages\click\core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\applications\miniforge3\envs\prod_env\Lib\site-packages\click\core.py", line 760, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\applications\miniforge3\envs\prod_env\Lib\site-packages\pdal_parallelizer\pdal_parallelizer_cli\__main__.py", line 44, in process_pipelines
    process(
  File "D:\applications\miniforge3\envs\prod_env\Lib\site-packages\pdal_parallelizer\__init__.py", line 186, in process_pipelines
    merge_ppln.execute()
RuntimeError: Couldn't create writer stage of type 'writers.laz'.
You probably have a version of PDAL that didn't come with a plugin
you're trying to load.

It seems that a writers.laz is called during the merge step when it should be the writers.las.

Add support for Virtual Point Cloud

PDAL Wrench has created a new format called Virtual Point Cloud (.vpc) with now a specification, it is based on STAC and similar to GDAL's VRT.

It links in one JSON file several local or remote sources with additionnal metadatas (geojson extent, statistics, etc.), it could be a new input type for the CLI.

Wrench uses it to parallelize the processing of multiple files but on a file by file basis, not by generating new chunks.

StatisticsError('must have at least two data points')

Processing never starts, as it appears to be expecting that there are already files in the output directory.

CLI Command:
pdal-parallelizer process-pipelines -c /work/test/data_1/pdal-parallelizer.json -it single

pdal-parallelizer.json:

{
  "input": "/work/test/data_1/input/sample_points.laz",
  "output": "/work/test/data_1/output",
  "temp": "/work/test/data_1/temp",
  "pipeline": "/work/test/data_1/pipeline.json"
}

pipeline.json:

{
  "pipeline": [
    {
      "type": "readers.las",
      "filename": "/work/test/data_1/input/sample_points.laz"
    },
    {
      "type": "writers.laz",
      "filename": "/work/test/data_1/output/output.laz"
    }
  ]
}

Trace:

Parallelization started.

Traceback (most recent call last):
  File "/root/miniconda3/bin/pdal-parallelizer", line 8, in <module>
    sys.exit(main())
  File "/root/miniconda3/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/root/miniconda3/lib/python3.10/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/root/miniconda3/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/root/miniconda3/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/root/miniconda3/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/root/miniconda3/lib/python3.10/site-packages/pdal_parallelizer/pdal_parallelizer_cli/__main__.py", line 42, in process_pipelines
    process(
  File "/root/miniconda3/lib/python3.10/site-packages/pdal_parallelizer/__init__.py", line 130, in process_pipelines
    file_manager.getEmptyWeight(output_directory=output)
  File "/root/miniconda3/lib/python3.10/site-packages/pdal_parallelizer/file_manager.py", line 60, in getEmptyWeight
    deciles = [round(q, 2) for q in statistics.quantiles(weights_ko, n=10)]
  File "/root/miniconda3/lib/python3.10/statistics.py", line 662, in quantiles
    raise StatisticsError('must have at least two data points')
statistics.StatisticsError: must have at least two data points

Create a Python API

Create a new Python package for the CLI version
Port the main.py file from the CLI version in this new Python package
Create a new main.py file which will be the entry point of the API
Adapt the process_pipelines function from the CLI version for use via the API
Make the process_pipelines function in the CLI version now use the process_pipelines function in the API version
Check that all tests pass

Create a package for Conda

Having a pdal-parallelizer available through conda-forge would eliminate the requirement to use pip, it would be simpler as conda is already needed to install PDAL as a dependency.

Reorganise the project

Use a "PipelineWrapper" or a "PipelineManager" class to simply actions related to PDAL Pipelines
Simplify the tile.pipeline function by separating it in multiple functions
Simplify the tile.split function by separating it in multiple functions

Use a "PipelineWrapper" or a "PipelineManager" class to simplify actions related to PDAL pipelines

This class will contains functions like:

get_readers
get_writers
set_readers_filename
set_writers_filename
load_pipeline
...

Default tiles format las

The default format of the tiles is the las. It doesn't take account of the compression option specified in the pipeline (writers.las).
However, when using the merge step, called with -mt, it does.

Removing tiles when merge_tiles on

It's usefull to keep tiles resulting from -ts option, but it would also be usefull to be able to remove them with an option if the result you're looking for is a merge of the tiles.