r-spatialecology / spectre Goto Github PK

View Code? Open in Web Editor NEW

6.0 6.0 0.0 15.19 MB

Home Page: https://r-spatialecology.github.io/spectre/

License: Other

R 3.47% C++ 96.05% QMake 0.48%

spectre's People

Contributors

Stargazers

Watchers

spectre's Issues

Improvements

Todo list for what needs to be polished:

Documentation
- Get Started Vignette
- Detailed introduction into method
- Prerequirements
Code
- No warnings, no notes, no errors
- Get rid of load_raster and other functions to load data
- improve naming? not sure, haven't checked in detail, just a thought

make spectre stoppable

Stopping spectre may take > 40 mins, for large landscapes with many species. It would be convenient if the package would check for "stops" from time to time.

Rcpp

There is now an rcpp function for calculate_solution_commonness_site.

It s faster then the previous implementation but still very bare:

 expr      min       lq      mean    median        uq      max neval cld
    r 484.7872 762.7204 1357.2797 1404.3644 1746.8956 2440.874    10   a
 rcpp 325.4761 464.5862  909.2614  713.4591  963.5737 3100.043    10   a

... if everyone is in favor, I would vote to include RcppParallel and try to fiddle with this function.

Same functionality as DynamicFoam

@csim063 ...

What are we missing what you can do there? I think we need the same stuff to be in the package as well.

Add skip-ci option to gh actions

The idea is to have a keyword in the commit message (e.g. [skip-ci] ) to avoid a CI run get triggered. This is useful for minor changes like typos or in the readme. It doesn't seem to be very complicated to implement.

Quick Google search:

https://github.community/t/github-actions-does-not-respect-skip-ci/17325/8
https://timheuer.com/blog/skipping-ci-github-actions-workflows/
https://github.com/marketplace/actions/skip-based-on-commit-message

Add continious benchmarking

It would be helpful to automatically track the speed of spectre via GH actions to avoid unsuspected performance losses.

There are ways to do this for the C++ part:

https://thomaspoignant.medium.com/ci-build-performance-testing-with-github-action-e6b227097c83
https://github.com/catchorg/Catch2/blob/devel/docs/benchmarks.md
https://github.com/marketplace/actions/r-benchmark

I think, however, benchmarking on the R side would be more beneficial. We need some research here.

Scale difference/energy

Should scale the difference value checked as a stop condition with the size of the landscape as it will get smaller for the same amount of error as landscapes become smaller making it difficult for users to know what stop condition to set.

Code cleanup

I will clean up the code next week. The idea is to have a clean version in the master branch that could be published with a paper. It will contain the min_con_0 algorithm, only. Other algorithms will go to separate branches.

To avoid interference, please let me know if you work on the code within the next two weeks.

New measures

I'd like to try two things now:

implement a normalized "energy" (as Kerstin suggested, we probably shouldn't call it energy anymore)
try different alternative measures

What we do is to take the difference between two matrices (commonness and target) and take the norm of the resulting matrix as a summarized measure.

Currently, we use the Absolute-value norm (i.e. the sum of the abs values of all matrix entries). The issue is, that the norm of (3, 7, 5), (1, 9, 5) and (0, 15, 0) are all the same: 15. However, from a landscape perspective, they are different.

A different norm would be the Euclidian norm, i.e. the square root of the sum of squares. For the examples above, the norms would be 9.11, 10.344, and 15 - respectively.

An extreme would be the Max norm, i.e. the maximum value for each matrix. In the example, the norms would 7, 9, 15.

This is all relatively simple to implement and might have an impact because min-conf directly relies on this measure.

unknown macro(s) '\alpha' and '\beta'

In the generate_commonness_matrix_from_gdm.R description, as well in run_optimization_min_conf.R, both "\alpha" and "\beta" cause the following warnings:

/man/generate_commonness_matrix_from_gdm.Rd:41: unknown macro '\alpha'
/man/generate_commonness_matrix_from_gdm.Rd:41: unknown macro '\beta'
/man/run_optimization_min_conf.Rd:71: unknown macro '\alpha'
/man/run_optimization_min_conf.Rd:71: unknown macro '\beta'

Generate initial solution

We should turn this into a function so that users can generate their own species matrices.

Annealing

Keep new matrics with a probalbility of e.g. anneling = 0.001 even if difference did not decrease to avoid running in local minimums

Local minima: accept imperfect solutions (epsilon-replacement)

So far, we accept solutions that are slightly worse than the actual solution (delta( |current_solution- last_solution| ) < epsilon), but like energy, the value of delta depends on n_sites, n_species, ... . I'd suggest to start with an acceptance_ probability of e.g. 0.25 to accept the current solution, independent of being better or not as the last solution, and to decrease the acceptance_probability with increasing number of iterations. The acceptance_probability could be optimized itself as a "hyper_parameter". This procedure is a quite common way in machine learning to avoid local minima, and it would avoid the energy measure issues.

Simplify `calculate_commonness` and `calculate_solution_commonness`

Need to redo the for loops to only run over a single triangle of the matrix. @nldoc has said he could do this, thanks.

Plotting functions

Think about plotting functions

Re-enable gh actions with macOS

I de-activated gh actions with macOS because we have limited computation time with a private repo and gh actions. macOS time "costs" 10x of linux time.

We just need to un-comment a line in the action.

Kanban?

I'm toying around with the Kanban feature ("Projects"). What do you think about using this? Too much overhead?

Rename master branch -> main branch

https://www.bbc.com/news/technology-53050955

Gotten myself lost

So this is probably me just being stupid but does anyone know what the name of the function to generate the target matrix from the alpha and beta diversity measures is. I have found the calculate_solution_commonness.cpp to calculate the commonness from the solution presence-absence matrix but can't see the function to generate the target?

Weighted species selection for initial solution

We should implement a weighted random selection procedure to improve our initial solution before it goes into the main optimisation algorithm (as done in Mokany et al 2011 supplementary materials). Their approach does require to have a couple sites worth of known species data but then basically selected a couple of the known sites which were predicted to have the rest of the sites in the landscape. The species in these selected sites were then given a weighting which increased their probability of occurring in the other sites hopefully creating a better initial solution speeding up the creation of an estimate. I like this approach as it is quite simply to envisage but I would be happy if anyone had a better idea of how to do this.

Replace "Energy" with "Error" everywhere

There are quite some places in the code where energy has to be replaced by (absolute commonness) error.

naming1: discrepancy D or energy E?

I still think that E reads well, but Kerstin asked whether we should change the name. We could also use discrepancy D. Anyway, the following naming & subscripts could also work for E...

If we go for D: I'd suggest (D_{CO}) [ _ = subscript ] for discrepancy in commonness & D_{BC} for "Bray-Curtis-Dissimilarity (in the ms Bray-Curtis is used, rather than Sorensen).

An example sentence would read as: absolute D_{CO} (for the sum over all siteXsite pairs) was 42, mean D_{CO} (per siteXsite pair) was 0.084, and standardized D_{CO} (relative to the average of 1000 random solutions) was 0.12.

Accept only solutions where alpha diversity constraint is fulfilled

When commonness is very low, the energy can be lower at the beginning (all sites empty) than when all sites are filled.

Should we allow for some inaccuracy with regard to #43?

Use R's RNG in Rcpp code

Either once to seed the C++ RNG or completely (need a performance comparison).

Benefit: no seed parameter needed, auto-sync between R and C++ seeds

Perfect solution must result in error == 0

which, as @mspngnbrg found out, it somehow does not :-(

"new" min_conf is very slow

Compared to min_conf_0, the new min_conf is > 10 times slower.
BCI data, 50 sites, 100 species, 1000 iterations:
min_conf_0: 5 seconds, min_conf: 52 seconds;

BCI data, 100 sites, 100 species, 10 known sites, 200 iterations:
40 seconds vs 6 minutes

I'll send you my script, Sebastian...

Name change

Just a thought. If we make this a package I think it may be in competition with Karel's idea so maybe to be safe we should not call it DynamicFOAM.

First bit of code

So using this as our message board for now. I have built the simplified version of the dynamicFOAM algorithm and pushed the code to this repository. Everything is rather basic, especially the documentation. If you want to have a look at how it works I have created a RNotebook (example_run) which goes through everything. The example is basically just the first iteration of running our EFFoRTS-ABM approach, on a shrunken landscape.

dplyr

Does the package need the function simulate_environmental_data?
We should anyway get rid of every dplyr dependency.

TODO

Not sure where exactly to put this (guess that's why they have the Kanban but hey).

This is a summary of what still needs to be done from my point of view.

Allow approximate solutions

E.g. allow that +/- 1 species or +/-10 % species in the siteXsite commonness count as "perfect" solution:

Example:
target sites 5X7 = 10
commonness sites 5X7 = 9, 10 and 11 would yield in a difference/energy/discrepancy between target and commonness of 0.

A good idea would be to start very coarse (like +/-25% allowed) and increase precision over time.

Check parallelization

see Jan's experiments

Feature: optimize further functionality

I think it would be nice to be able to further optimize a result, i.e. give the solution back to the algorithm and let it try for some more iterations.

For this, we'd probably want to have some sort of S4/R class where the result is stored within a class object and the algorithm can re-use it.

What do you think about it?

Data

Can we use the stuff in inst/examples starting from the calculation of having the beta and alpha predictions (e.g. pairwise similarity matrix and the species richness list)?

Or is this data private?

@nldoc @csim063

add feature "fixed_species" to min_conf algorithm?

So far "fixed species" can only be used with min_conf_0, but not with min_conf. We could either keep both algorithms, or use only min_conf (Jan used min_conf for the virtual_species, I used min_conf_0 for BCI). If I had to chose one, I'd pick min_conf now.

Implement improved optimization algorithm(s)

The current optimization algorithm is very simple. We could implement one or more options to improve efficiency of the optimization (genetic algorithms, simulated annealing, bayesian, ...)

Add description for min conf algorithm

Inaccurate solutions

So following from the last issue to make scaled difference measures. I implemented that so that now we kind of get the proportion of cells which do not match the target estimate (or simply the proportion of error). This has highlighted a problem, we have a huge error. Running the optimisation on the test data for 10000 iterations and a patience of 10000 and repeating that 10 times I never got below a 0.7 or 70% error rate. 😱

Kickoff

Guys, I am hyped 😆

How do we start, where do we start, what do I have to read ...
Give me the details!

MacOS build fails

I think it is only a configuration issue with GitHub Actions. @mhesselbarth can you confirm that it builds on macOS in gernerall?

desired returns of the algorithm

A: No fixed species:

A1. best speciesXsite solution (already implemented)
A2. absolute commonness error between solution matrix and objective matrix for each iteration (already implemented). Primary used for plotting, thus users can estimate how many iterations they need to run.
A3. RCE [%] between best solution matrix and objective matrix.
Calculated as:

"absolute commonness error between solution matrix and objective matrix"
divided by
"number of siteXsite pairs [= (nsites^2 - nsites) / 2]"
divided by
"mean commonness of the objective"
times 100.

UPDATE 2021-05-06
A3: now calc_commonness_error() calculates both the mean absolute error in commonness, as well as the RCE [%].

Thus spectre produces all desired outputs for A (no fixed species).

B: Fixed species, assuming that at "known sites" species lists are "complete", i.e. all absent & present species are known, as opposed to having incomplete species lists at known sites. (we decided to leave B to the user)

I split information from "known sites" into "input" & "testing" data. Species lists from "input sites" were fixed and a solution was calculated by spectre.

B1. Evaluation:
Using only the "testing sites", correctly predicted species [%] were calculated as:

Species that were present both in "testing data" && "solution" at the same site (=correctly predicted)
divided by
"number of present species in the testing data"
times 100.

Not too sure whether B should be in the package, since:

it is already in the code for the "BCI known species" analysis (but quite messy)
evaluation makes sense only if species lists of testing sites are complete.

Swap species issue in while loop

Is it true that currently in the row selection it is not checked that only a 1 is swapped with a 0? i.e. it can currently happen that two zeros or two ones are swapped with one another?

   random_col <- sample(seq_len(ncol(new_solution)), size = 1)
    random_row1 <- sample(seq_len(nrow(new_solution)), size = 1)
    random_row2 <- sample(seq_len(nrow(new_solution)), size = 1)

    new_solution[random_row1, random_col] <- current_solution[random_row2, random_col]
    new_solution[random_row2, random_col] <- current_solution[random_row1, random_col]

Replace "Energy"with RCE

re-implement weighting function for "known species/sites"

return "best" solution, instead of last solution

By now, we return the solution derived from the last iteration. (1) Since in most cases we have an increase in energy when random site(s) are deleted temporarily (parameter patience), the last solution is not necessarily the best solution, especially if deletion took place shortly before the last iteration. (2) Another observation I made is, that in the beginning (all sites empty) energy can be lower than when all sites are filled (admittedly only the case if overall commonness is very low). Thus, if we constrain the solutions to have no empty sites, and constantly update the "temporarily best solution" whenever energy of a new solution (with all sites filled) has a lower energy, we could overcome issues 1 and 2.

Profile function speeds

Need to identify any sticking points in the code that are worth trying to streamline before we get into Cpp. @marcosci has kindly offered to do this

Improved 'fixed_species' parameter

I am thinking about changing the behavior of the 'fixed_species' parameter a bit. At the moment 'fixed_species' is simply a special kind of 'partial_solution', i.e. species that are present are not changed.

I would like to change it so that 'fixed_species' becomes a kind of mask for 'partial_solution'. 'fixed_species' would then simply be a site X species matrix with values of either 0 or 1. Parts of 'partial_solution' that are marked with 1 by 'fixed_species' are then ignored by the optimizer. This way we can also make sure that species are not allowed to occur in certain sites.

What do you think? Does this break any of our scripts?

@mspngnbrg @nldoc @csim063 @mhesselbarth

[dev] run_optimization_min_conf_0 returns NaN when target is filled w zeros

if fixed species > richness at a a site, min_conf_0 gets stuck in a loop

When using estimated richness, for example from an alpha-diversity model, and at a site the number of fixed (="known") species is larger than predicted richness, spectre does not find "another species to add", and gets stuck in a loop then. By now, I change predicted richness to be == number of fixed species if neccessary in my R-code to control for this, but I think a fix/warning/check in the package would be nice.