meldig / pdal-parallelizer Goto Github PK
View Code? Open in Web Editor NEWA python app (cli/api) to parallelize your PDAL pipelines
License: BSD 3-Clause "New" or "Revised" License
A python app (cli/api) to parallelize your PDAL pipelines
License: BSD 3-Clause "New" or "Revised" License
The version of Python called is hard coded, it should uses whatever version is present in the conda environment as long that it meets a minimum version criteria.
Launch a first execution that will take a chunk in the center of the dataset, process it and then advise the user on the number of workers to avoid memory issues.
When using pdal-parallelizer in command-line with mini-forge on Windows, it seems that temp directory has to be already created otherwise his path is unknown.
FileNotFoundError: [WinError 3] Le chemin d’accès spécifié est introuvable: 'E:/0_en_cours/2020_classification/temp_parallelizer'
It should be nice to automatically create this directory.
When using the merge_tiles option I face a writer issue.
pdal-parallelizer process-pipelines -c E:\0000_test\pdal_parallelizer\config.json -it single -nw 9 -tpw 1 -ts 60 60 -b 1 -mt
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "D:\applications\miniforge3\envs\prod_env\Scripts\pdal-parallelizer.exe\__main__.py", line 7, in <module>
File "D:\applications\miniforge3\envs\prod_env\Lib\site-packages\click\core.py", line 1130, in __call__
return self.main(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\applications\miniforge3\envs\prod_env\Lib\site-packages\click\core.py", line 1055, in main
rv = self.invoke(ctx)
^^^^^^^^^^^^^^^^
File "D:\applications\miniforge3\envs\prod_env\Lib\site-packages\click\core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\applications\miniforge3\envs\prod_env\Lib\site-packages\click\core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\applications\miniforge3\envs\prod_env\Lib\site-packages\click\core.py", line 760, in invoke
return __callback(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\applications\miniforge3\envs\prod_env\Lib\site-packages\pdal_parallelizer\pdal_parallelizer_cli\__main__.py", line 44, in process_pipelines
process(
File "D:\applications\miniforge3\envs\prod_env\Lib\site-packages\pdal_parallelizer\__init__.py", line 186, in process_pipelines
merge_ppln.execute()
RuntimeError: Couldn't create writer stage of type 'writers.laz'.
You probably have a version of PDAL that didn't come with a plugin
you're trying to load.
It seems that a writers.laz is called during the merge step when it should be the writers.las.
With this command :
pdal-parallelizer process-pipelines -c E:\0000_test\pdal_parallelizer\config.json -it dir -nw 15 -tpw 1
I get this message :
WARNING - You are using the default value of the tile_size option (256 by 256 meters). Please check if your points cloud's dimensions are greater than this value.
Do you want to continue ? [y/N]
I faced a memory saturation when over-tiling a single input file.
A 500mx500m input, divided in 35 tiles computed by 15 worked over-consumes memory capacities.
pdal-parallelizer process-pipelines -c E:\0000_test\pdal_parallelizer\config.json -it single -nw 15 -tpw 1 -ts 100 100 -mt
When worker and tiles size are computed so workers only compute once during the process, the unmanaged memory doesn't blow up.
pdal-parallelizer process-pipelines -c E:\0000_test\pdal_parallelizer\config.json -it single -nw 9 -tpw 1 -ts 180 180 -mt
Computing the input file's hold might be used to calculate the optimal tile sizing, or conversely.
Extra dims are withdrawn from the output, even when "extra_dims":"all"
is activated in the writers.las
stage.
pdal-parallelizer process-pipelines -c E:\0000_test\pdal_parallelizer\config.json -it single -nw 9 -tpw 1 -ts 80 80 -mt -rb
Beginning of the execution
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "D:\applications\miniforge3\envs\prod_env\Scripts\pdal-parallelizer.exe\__main__.py", line 7, in <module>
File "D:\applications\miniforge3\envs\prod_env\Lib\site-packages\click\core.py", line 1130, in __call__
return self.main(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\applications\miniforge3\envs\prod_env\Lib\site-packages\click\core.py", line 1055, in main
rv = self.invoke(ctx)
^^^^^^^^^^^^^^^^
File "D:\applications\miniforge3\envs\prod_env\Lib\site-packages\click\core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\applications\miniforge3\envs\prod_env\Lib\site-packages\click\core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\applications\miniforge3\envs\prod_env\Lib\site-packages\click\core.py", line 760, in invoke
return __callback(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\applications\miniforge3\envs\prod_env\Lib\site-packages\pdal_parallelizer\pdal_parallelizer_cli\__main__.py", line 44, in process_pipelines
process(
File "D:\applications\miniforge3\envs\prod_env\Lib\site-packages\pdal_parallelizer\__init__.py", line 140, in process_pipelines
delayed = do.process_pipelines(output_dir=output, json_pipeline=pipeline, temp_dir=temp, iterator=iterator,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\applications\miniforge3\envs\prod_env\Lib\site-packages\pdal_parallelizer\do.py", line 58, in process_pipelines
p = t.pipeline(is_single)
^^^^^^^^^^^^^^^^^^^^^
File "D:\applications\miniforge3\envs\prod_env\Lib\site-packages\pdal_parallelizer\tile.py", line 69, in pipeline
p.insert(len(p) - 1, bounds.removeBuffer(self.bounds_without_buffer))
^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'Tile' object has no attribute 'bounds_without_buffer'
What if we don't want any buffer to be used ? A null value isn't valid.
Separating it in multiple functions
PDAL Wrench has created a new format called Virtual Point Cloud (.vpc) with now a specification, it is based on STAC and similar to GDAL's VRT.
It links in one JSON file several local or remote sources with additionnal metadatas (geojson extent, statistics, etc.), it could be a new input type for the CLI.
Wrench uses it to parallelize the processing of multiple files but on a file by file basis, not by generating new chunks.
For documentation : the buffer value has to be an integer.
Working with Conda version 2.0.3, I was confronted with underutilized computational capabilities while working with a large dataset consisting of about 5000 laz tiles. Each worker was arround 0.5% of the CPU when the expected behaviour is about 5%.
With a smaller subset of the exact same tiles, it works fine.
My process file is
pdal-parallelizer process-pipelines -c F:\nuages\2020_lidar\2022_06_pdal_mneh_config.json -it dir -nw 20 -tpw 1 -mt
How to execute the pipelines after a crash in single mode ? With the execute_stages_standard or execute_stages_streaming ?
I installed by conda and tried to run by process(config = path_to_config.json, input_type="single"...
But it cannot work due to the error below.
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
my config.json was written by this
{
"input": "F:/",
"output": "F:/",
"temp": "F:/",
"pipeline": "F:/pipeline.json"
}
Please tell me how to solve it.
Thank you.
Separating it in multiple functions
It's usefull to keep tiles resulting from -ts option, but it would also be usefull to be able to remove them with an option if the result you're looking for is a merge of the tiles.
The default format of the tiles is the las. It doesn't take account of the compression option specified in the pipeline (writers.las).
However, when using the merge step, called with -mt, it does.
When processing a single cloud, this one is open before the parallelization start. But as we have to create iterators in parallel, the image_array of the cloud is present on the task graph.
Maybe use client.scatter and client.submit can solve the problem
This class will contains functions like:
Working with 2.0.3 release with -dir option on, I faced underutilization of CPUs over 6 wokers assigned.
Basically, from 3 to 6 workers, each CPU use around 4 to 5% of the global CPU's capacities, but over 6 workers it falls to less than 2%.
Processing never starts, as it appears to be expecting that there are already files in the output directory.
CLI Command:
pdal-parallelizer process-pipelines -c /work/test/data_1/pdal-parallelizer.json -it single
pdal-parallelizer.json:
{
"input": "/work/test/data_1/input/sample_points.laz",
"output": "/work/test/data_1/output",
"temp": "/work/test/data_1/temp",
"pipeline": "/work/test/data_1/pipeline.json"
}
pipeline.json:
{
"pipeline": [
{
"type": "readers.las",
"filename": "/work/test/data_1/input/sample_points.laz"
},
{
"type": "writers.laz",
"filename": "/work/test/data_1/output/output.laz"
}
]
}
Trace:
Parallelization started.
Traceback (most recent call last):
File "/root/miniconda3/bin/pdal-parallelizer", line 8, in <module>
sys.exit(main())
File "/root/miniconda3/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
return self.main(*args, **kwargs)
File "/root/miniconda3/lib/python3.10/site-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
File "/root/miniconda3/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/root/miniconda3/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/root/miniconda3/lib/python3.10/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/root/miniconda3/lib/python3.10/site-packages/pdal_parallelizer/pdal_parallelizer_cli/__main__.py", line 42, in process_pipelines
process(
File "/root/miniconda3/lib/python3.10/site-packages/pdal_parallelizer/__init__.py", line 130, in process_pipelines
file_manager.getEmptyWeight(output_directory=output)
File "/root/miniconda3/lib/python3.10/site-packages/pdal_parallelizer/file_manager.py", line 60, in getEmptyWeight
deciles = [round(q, 2) for q in statistics.quantiles(weights_ko, n=10)]
File "/root/miniconda3/lib/python3.10/statistics.py", line 662, in quantiles
raise StatisticsError('must have at least two data points')
statistics.StatisticsError: must have at least two data points
Having a pdal-parallelizer available through conda-forge would eliminate the requirement to use pip, it would be simpler as conda is already needed to install PDAL as a dependency.
With pdal-parallelizer 2.0.3 on Windows, both with single and dir options, multiples requests with filters.assign make pdal-parallelizer crash. For exemple,
{
"type":"filters.assign",
"value":
[
"value" : "Dimension = ValueExpression [WHERE ConditionalExpression)]",
"value" : "Dimension = ValueExpression [WHERE ConditionalExpression)]",
"value" : "Dimension = ValueExpression [WHERE ConditionalExpression)]"
]
},
has to be replaced by,
{
"type":"filters.assign",
"value":
[
"value" : "Dimension = ValueExpression [WHERE ConditionalExpression)]"
]
}, {
"type":"filters.assign",
"value":
[
"value" : "Dimension = ValueExpression [WHERE ConditionalExpression)]"
]
}, {
"type":"filters.assign",
"value":
[
"value" : "Dimension = ValueExpression [WHERE ConditionalExpression)]"
]
},
When the single input option is activated, the ouput is tiled based on dask parameters and named according to xmins/ymins of the bounding-boxes.
Would it be possible to
So
input_name.laz -> xmins/ymins.laz & xmins/ymins.laz & xmins/ymins.laz & xmins/ymins.laz
would be
input_name.laz -> input_name.laz
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.