As the architecture allows to run split and fingerprints separately, and in a future we plan to have many more things we could already split the different argparse in different files and join all of them in the main file.
Multiprocessing does not share memory between nodes. @cescgina, proposed Ray as a high level library to implement MPI processes to use several nodes at full capacity. First benchmark was 10M in 14h. (Mordred descriptors).
The calculation of mordred descriptors is internally parallelized via a multiprocessing pool, which crashes if called via a script using another pool. To avoid this issue you can set nproc=1 in the call to calc.pandas in fingerprints.py. I don't know however if it is faster to use a external pool or serially process the file and use the mordred internal pool.
Also you can use quiet=False in the calc.pandas function to avoid the progress bar in the output, which will look bad if the output is redirected to a file.
Check whether is faster to use mordred pool or and external one
Default value for the --parquet option is true, but I found no way to turn it off since the action in the argparse is store_true. I think the default should be False. Then, if the user has not specified any output format either raise an error or set parquet as the output format.