r-spatialecology / spectre Goto Github PK
View Code? Open in Web Editor NEWHome Page: https://r-spatialecology.github.io/spectre/
License: Other
Home Page: https://r-spatialecology.github.io/spectre/
License: Other
Todo list for what needs to be polished:
Stopping spectre may take > 40 mins, for large landscapes with many species. It would be convenient if the package would check for "stops" from time to time.
Get rid of ggplot2 dependency
There is now an rcpp function for calculate_solution_commonness_site
.
It s faster then the previous implementation but still very bare:
expr min lq mean median uq max neval cld
r 484.7872 762.7204 1357.2797 1404.3644 1746.8956 2440.874 10 a
rcpp 325.4761 464.5862 909.2614 713.4591 963.5737 3100.043 10 a
... if everyone is in favor, I would vote to include RcppParallel and try to fiddle with this function.
@csim063 ...
What are we missing what you can do there? I think we need the same stuff to be in the package as well.
The idea is to have a keyword in the commit message (e.g. [skip-ci] ) to avoid a CI run get triggered. This is useful for minor changes like typos or in the readme. It doesn't seem to be very complicated to implement.
Quick Google search:
https://github.community/t/github-actions-does-not-respect-skip-ci/17325/8
https://timheuer.com/blog/skipping-ci-github-actions-workflows/
https://github.com/marketplace/actions/skip-based-on-commit-message
It would be helpful to automatically track the speed of spectre via GH actions to avoid unsuspected performance losses.
There are ways to do this for the C++ part:
https://thomaspoignant.medium.com/ci-build-performance-testing-with-github-action-e6b227097c83
https://github.com/catchorg/Catch2/blob/devel/docs/benchmarks.md
https://github.com/marketplace/actions/r-benchmark
I think, however, benchmarking on the R side would be more beneficial. We need some research here.
Should scale the difference value checked as a stop condition with the size of the landscape as it will get smaller for the same amount of error as landscapes become smaller making it difficult for users to know what stop condition to set.
I will clean up the code next week. The idea is to have a clean version in the master branch that could be published with a paper. It will contain the min_con_0 algorithm, only. Other algorithms will go to separate branches.
To avoid interference, please let me know if you work on the code within the next two weeks.
I'd like to try two things now:
What we do is to take the difference between two matrices (commonness and target) and take the norm of the resulting matrix as a summarized measure.
Currently, we use the Absolute-value norm (i.e. the sum of the abs values of all matrix entries). The issue is, that the norm of (3, 7, 5), (1, 9, 5) and (0, 15, 0) are all the same: 15. However, from a landscape perspective, they are different.
A different norm would be the Euclidian norm, i.e. the square root of the sum of squares. For the examples above, the norms would be 9.11, 10.344, and 15 - respectively.
An extreme would be the Max norm, i.e. the maximum value for each matrix. In the example, the norms would 7, 9, 15.
This is all relatively simple to implement and might have an impact because min-conf directly relies on this measure.
In the generate_commonness_matrix_from_gdm.R description, as well in run_optimization_min_conf.R, both "\alpha" and "\beta" cause the following warnings:
/man/generate_commonness_matrix_from_gdm.Rd:41: unknown macro '\alpha'
/man/generate_commonness_matrix_from_gdm.Rd:41: unknown macro '\beta'
/man/run_optimization_min_conf.Rd:71: unknown macro '\alpha'
/man/run_optimization_min_conf.Rd:71: unknown macro '\beta'
We should turn this into a function so that users can generate their own species matrices.
Keep new matrics with a probalbility of e.g. anneling = 0.001
even if difference did not decrease to avoid running in local minimums
So far, we accept solutions that are slightly worse than the actual solution (delta( |current_solution- last_solution| ) < epsilon), but like energy, the value of delta depends on n_sites, n_species, ... . I'd suggest to start with an acceptance_ probability of e.g. 0.25 to accept the current solution, independent of being better or not as the last solution, and to decrease the acceptance_probability with increasing number of iterations. The acceptance_probability could be optimized itself as a "hyper_parameter". This procedure is a quite common way in machine learning to avoid local minima, and it would avoid the energy measure issues.
Need to redo the for loops to only run over a single triangle of the matrix. @nldoc has said he could do this, thanks.
Think about plotting functions
I de-activated gh actions with macOS because we have limited computation time with a private repo and gh actions. macOS time "costs" 10x of linux time.
We just need to un-comment a line in the action.
I'm toying around with the Kanban feature ("Projects"). What do you think about using this? Too much overhead?
So this is probably me just being stupid but does anyone know what the name of the function to generate the target matrix from the alpha and beta diversity measures is. I have found the calculate_solution_commonness.cpp to calculate the commonness from the solution presence-absence matrix but can't see the function to generate the target?
We should implement a weighted random selection procedure to improve our initial solution before it goes into the main optimisation algorithm (as done in Mokany et al 2011 supplementary materials). Their approach does require to have a couple sites worth of known species data but then basically selected a couple of the known sites which were predicted to have the rest of the sites in the landscape. The species in these selected sites were then given a weighting which increased their probability of occurring in the other sites hopefully creating a better initial solution speeding up the creation of an estimate. I like this approach as it is quite simply to envisage but I would be happy if anyone had a better idea of how to do this.
There are quite some places in the code where energy has to be replaced by (absolute commonness) error.
I still think that E reads well, but Kerstin asked whether we should change the name. We could also use discrepancy D. Anyway, the following naming & subscripts could also work for E...
If we go for D: I'd suggest (D_{CO}) [ _ = subscript ] for discrepancy in commonness & D_{BC} for "Bray-Curtis-Dissimilarity (in the ms Bray-Curtis is used, rather than Sorensen).
An example sentence would read as: absolute D_{CO} (for the sum over all siteXsite pairs) was 42, mean D_{CO} (per siteXsite pair) was 0.084, and standardized D_{CO} (relative to the average of 1000 random solutions) was 0.12.
When commonness is very low, the energy can be lower at the beginning (all sites empty) than when all sites are filled.
Should we allow for some inaccuracy with regard to #43?
Either once to seed the C++ RNG or completely (need a performance comparison).
Benefit: no seed parameter needed, auto-sync between R and C++ seeds
which, as @mspngnbrg found out, it somehow does not :-(
Compared to min_conf_0, the new min_conf is > 10 times slower.
BCI data, 50 sites, 100 species, 1000 iterations:
min_conf_0: 5 seconds, min_conf: 52 seconds;
BCI data, 100 sites, 100 species, 10 known sites, 200 iterations:
40 seconds vs 6 minutes
I'll send you my script, Sebastian...
Just a thought. If we make this a package I think it may be in competition with Karel's idea so maybe to be safe we should not call it DynamicFOAM.
So using this as our message board for now. I have built the simplified version of the dynamicFOAM algorithm and pushed the code to this repository. Everything is rather basic, especially the documentation. If you want to have a look at how it works I have created a RNotebook (example_run) which goes through everything. The example is basically just the first iteration of running our EFFoRTS-ABM approach, on a shrunken landscape.
Not sure where exactly to put this (guess that's why they have the Kanban but hey).
This is a summary of what still needs to be done from my point of view.
Improve the optimisation algorithm (I have listed the ideas I have thought of but if you have any please add and implement)
Finalise the package
Run the example analyses @csim063
Write and submit a paper @csim063
E.g. allow that +/- 1 species or +/-10 % species in the siteXsite commonness count as "perfect" solution:
Example:
target
sites 5X7 = 10
commonness
sites 5X7 = 9, 10 and 11 would yield in a difference/energy/discrepancy between target
and commonness
of 0.
A good idea would be to start very coarse (like +/-25% allowed) and increase precision over time.
see Jan's experiments
I think it would be nice to be able to further optimize a result, i.e. give the solution back to the algorithm and let it try for some more iterations.
For this, we'd probably want to have some sort of S4/R class where the result is stored within a class object and the algorithm can re-use it.
What do you think about it?
So far "fixed species" can only be used with min_conf_0, but not with min_conf. We could either keep both algorithms, or use only min_conf (Jan used min_conf for the virtual_species, I used min_conf_0 for BCI). If I had to chose one, I'd pick min_conf now.
The current optimization algorithm is very simple. We could implement one or more options to improve efficiency of the optimization (genetic algorithms, simulated annealing, bayesian, ...)
So following from the last issue to make scaled difference measures. I implemented that so that now we kind of get the proportion of cells which do not match the target estimate (or simply the proportion of error). This has highlighted a problem, we have a huge error. Running the optimisation on the test data for 10000 iterations and a patience of 10000 and repeating that 10 times I never got below a 0.7 or 70% error rate. ๐ฑ
Guys, I am hyped ๐
How do we start, where do we start, what do I have to read ...
Give me the details!
I think it is only a configuration issue with GitHub Actions. @mhesselbarth can you confirm that it builds on macOS in gernerall?
A1. best speciesXsite solution (already implemented)
A2. absolute commonness error between solution matrix and objective matrix for each iteration (already implemented). Primary used for plotting, thus users can estimate how many iterations they need to run.
A3. RCE [%] between best solution matrix and objective matrix.
Calculated as:
UPDATE 2021-05-06
A3: now calc_commonness_error() calculates both the mean absolute error in commonness, as well as the RCE [%].
Thus spectre produces all desired outputs for A (no fixed species).
I split information from "known sites" into "input" & "testing" data. Species lists from "input sites" were fixed and a solution was calculated by spectre.
B1. Evaluation:
Using only the "testing sites", correctly predicted species [%] were calculated as:
Not too sure whether B should be in the package, since:
Is it true that currently in the row selection it is not checked that only a 1 is swapped with a 0? i.e. it can currently happen that two zeros or two ones are swapped with one another?
random_col <- sample(seq_len(ncol(new_solution)), size = 1)
random_row1 <- sample(seq_len(nrow(new_solution)), size = 1)
random_row2 <- sample(seq_len(nrow(new_solution)), size = 1)
new_solution[random_row1, random_col] <- current_solution[random_row2, random_col]
new_solution[random_row2, random_col] <- current_solution[random_row1, random_col]
By now, we return the solution derived from the last iteration. (1) Since in most cases we have an increase in energy when random site(s) are deleted temporarily (parameter patience), the last solution is not necessarily the best solution, especially if deletion took place shortly before the last iteration. (2) Another observation I made is, that in the beginning (all sites empty) energy can be lower than when all sites are filled (admittedly only the case if overall commonness is very low). Thus, if we constrain the solutions to have no empty sites, and constantly update the "temporarily best solution" whenever energy of a new solution (with all sites filled) has a lower energy, we could overcome issues 1 and 2.
Need to identify any sticking points in the code that are worth trying to streamline before we get into Cpp. @marcosci has kindly offered to do this
I am thinking about changing the behavior of the 'fixed_species' parameter a bit. At the moment 'fixed_species' is simply a special kind of 'partial_solution', i.e. species that are present are not changed.
I would like to change it so that 'fixed_species' becomes a kind of mask for 'partial_solution'. 'fixed_species' would then simply be a site X species matrix with values of either 0 or 1. Parts of 'partial_solution' that are marked with 1 by 'fixed_species' are then ignored by the optimizer. This way we can also make sure that species are not allowed to occur in certain sites.
What do you think? Does this break any of our scripts?
When using estimated richness, for example from an alpha-diversity model, and at a site the number of fixed (="known") species is larger than predicted richness, spectre does not find "another species to add", and gets stuck in a loop then. By now, I change predicted richness to be == number of fixed species if neccessary in my R-code to control for this, but I think a fix/warning/check in the package would be nice.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.