Giter Club home page Giter Club logo

spectre's People

Contributors

actions-user avatar bitbacchus avatar csim063 avatar marcosci avatar mhesselbarth avatar mspngnbrg avatar nldoc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

spectre's Issues

Improvements

Todo list for what needs to be polished:

  • Documentation
    • Get Started Vignette
    • Detailed introduction into method
    • Prerequirements
  • Code
    • No warnings, no notes, no errors
    • Get rid of load_raster and other functions to load data
    • improve naming? not sure, haven't checked in detail, just a thought

make spectre stoppable

Stopping spectre may take > 40 mins, for large landscapes with many species. It would be convenient if the package would check for "stops" from time to time.

Rcpp

There is now an rcpp function for calculate_solution_commonness_site.

It s faster then the previous implementation but still very bare:

 expr      min       lq      mean    median        uq      max neval cld
    r 484.7872 762.7204 1357.2797 1404.3644 1746.8956 2440.874    10   a
 rcpp 325.4761 464.5862  909.2614  713.4591  963.5737 3100.043    10   a

... if everyone is in favor, I would vote to include RcppParallel and try to fiddle with this function.

Add continious benchmarking

It would be helpful to automatically track the speed of spectre via GH actions to avoid unsuspected performance losses.

There are ways to do this for the C++ part:

https://thomaspoignant.medium.com/ci-build-performance-testing-with-github-action-e6b227097c83
https://github.com/catchorg/Catch2/blob/devel/docs/benchmarks.md
https://github.com/marketplace/actions/r-benchmark

I think, however, benchmarking on the R side would be more beneficial. We need some research here.

Scale difference/energy

Should scale the difference value checked as a stop condition with the size of the landscape as it will get smaller for the same amount of error as landscapes become smaller making it difficult for users to know what stop condition to set.

Code cleanup

I will clean up the code next week. The idea is to have a clean version in the master branch that could be published with a paper. It will contain the min_con_0 algorithm, only. Other algorithms will go to separate branches.

To avoid interference, please let me know if you work on the code within the next two weeks.

New measures

I'd like to try two things now:

  • implement a normalized "energy" (as Kerstin suggested, we probably shouldn't call it energy anymore)
  • try different alternative measures

What we do is to take the difference between two matrices (commonness and target) and take the norm of the resulting matrix as a summarized measure.

Currently, we use the Absolute-value norm (i.e. the sum of the abs values of all matrix entries). The issue is, that the norm of (3, 7, 5), (1, 9, 5) and (0, 15, 0) are all the same: 15. However, from a landscape perspective, they are different.

A different norm would be the Euclidian norm, i.e. the square root of the sum of squares. For the examples above, the norms would be 9.11, 10.344, and 15 - respectively.

An extreme would be the Max norm, i.e. the maximum value for each matrix. In the example, the norms would 7, 9, 15.

This is all relatively simple to implement and might have an impact because min-conf directly relies on this measure.

unknown macro(s) '\alpha' and '\beta'

In the generate_commonness_matrix_from_gdm.R description, as well in run_optimization_min_conf.R, both "\alpha" and "\beta" cause the following warnings:

/man/generate_commonness_matrix_from_gdm.Rd:41: unknown macro '\alpha'
/man/generate_commonness_matrix_from_gdm.Rd:41: unknown macro '\beta'
/man/run_optimization_min_conf.Rd:71: unknown macro '\alpha'
/man/run_optimization_min_conf.Rd:71: unknown macro '\beta'

Annealing

Keep new matrics with a probalbility of e.g. anneling = 0.001 even if difference did not decrease to avoid running in local minimums

Local minima: accept imperfect solutions (epsilon-replacement)

So far, we accept solutions that are slightly worse than the actual solution (delta( |current_solution- last_solution| ) < epsilon), but like energy, the value of delta depends on n_sites, n_species, ... . I'd suggest to start with an acceptance_ probability of e.g. 0.25 to accept the current solution, independent of being better or not as the last solution, and to decrease the acceptance_probability with increasing number of iterations. The acceptance_probability could be optimized itself as a "hyper_parameter". This procedure is a quite common way in machine learning to avoid local minima, and it would avoid the energy measure issues.

Re-enable gh actions with macOS

I de-activated gh actions with macOS because we have limited computation time with a private repo and gh actions. macOS time "costs" 10x of linux time.

We just need to un-comment a line in the action.

Kanban?

I'm toying around with the Kanban feature ("Projects"). What do you think about using this? Too much overhead?

Gotten myself lost

So this is probably me just being stupid but does anyone know what the name of the function to generate the target matrix from the alpha and beta diversity measures is. I have found the calculate_solution_commonness.cpp to calculate the commonness from the solution presence-absence matrix but can't see the function to generate the target?

Weighted species selection for initial solution

We should implement a weighted random selection procedure to improve our initial solution before it goes into the main optimisation algorithm (as done in Mokany et al 2011 supplementary materials). Their approach does require to have a couple sites worth of known species data but then basically selected a couple of the known sites which were predicted to have the rest of the sites in the landscape. The species in these selected sites were then given a weighting which increased their probability of occurring in the other sites hopefully creating a better initial solution speeding up the creation of an estimate. I like this approach as it is quite simply to envisage but I would be happy if anyone had a better idea of how to do this.

naming1: discrepancy D or energy E?

I still think that E reads well, but Kerstin asked whether we should change the name. We could also use discrepancy D. Anyway, the following naming & subscripts could also work for E...

If we go for D: I'd suggest (D_{CO}) [ _ = subscript ] for discrepancy in commonness & D_{BC} for "Bray-Curtis-Dissimilarity (in the ms Bray-Curtis is used, rather than Sorensen).

An example sentence would read as: absolute D_{CO} (for the sum over all siteXsite pairs) was 42, mean D_{CO} (per siteXsite pair) was 0.084, and standardized D_{CO} (relative to the average of 1000 random solutions) was 0.12.

Use R's RNG in Rcpp code

Either once to seed the C++ RNG or completely (need a performance comparison).

Benefit: no seed parameter needed, auto-sync between R and C++ seeds

"new" min_conf is very slow

Compared to min_conf_0, the new min_conf is > 10 times slower.
BCI data, 50 sites, 100 species, 1000 iterations:
min_conf_0: 5 seconds, min_conf: 52 seconds;

BCI data, 100 sites, 100 species, 10 known sites, 200 iterations:
40 seconds vs 6 minutes

I'll send you my script, Sebastian...

Name change

Just a thought. If we make this a package I think it may be in competition with Karel's idea so maybe to be safe we should not call it DynamicFOAM.

First bit of code

So using this as our message board for now. I have built the simplified version of the dynamicFOAM algorithm and pushed the code to this repository. Everything is rather basic, especially the documentation. If you want to have a look at how it works I have created a RNotebook (example_run) which goes through everything. The example is basically just the first iteration of running our EFFoRTS-ABM approach, on a shrunken landscape.

dplyr

  1. Does the package need the function simulate_environmental_data?
  2. We should anyway get rid of every dplyr dependency.

TODO

Not sure where exactly to put this (guess that's why they have the Kanban but hey).

This is a summary of what still needs to be done from my point of view.

  • Improve the optimisation algorithm (I have listed the ideas I have thought of but if you have any please add and implement)

    • Select the columns with the lowest relatedness to target preferentially
    • Weight the species selection for the creation of the initial solution matrix
    • Allow users to include actual site data
  • Finalise the package

    • Implement testing @mhesselbarth
    • Write documentation
    • Submit to CRAN
  • Run the example analyses @csim063

    • Parse out the presence/absence data from ebirds (hopefully with related environmental measures)
    • Get environmental predictor rasters
    • Calculate alpha and beta models
    • Run optimisation
    • Write vignette
  • Write and submit a paper @csim063

Allow approximate solutions

E.g. allow that +/- 1 species or +/-10 % species in the siteXsite commonness count as "perfect" solution:

Example:
target sites 5X7 = 10
commonness sites 5X7 = 9, 10 and 11 would yield in a difference/energy/discrepancy between target and commonness of 0.

A good idea would be to start very coarse (like +/-25% allowed) and increase precision over time.

Feature: optimize further functionality

I think it would be nice to be able to further optimize a result, i.e. give the solution back to the algorithm and let it try for some more iterations.

For this, we'd probably want to have some sort of S4/R class where the result is stored within a class object and the algorithm can re-use it.

What do you think about it?

Data

Can we use the stuff in inst/examples starting from the calculation of having the beta and alpha predictions (e.g. pairwise similarity matrix and the species richness list)?

Or is this data private?

@nldoc @csim063

add feature "fixed_species" to min_conf algorithm?

So far "fixed species" can only be used with min_conf_0, but not with min_conf. We could either keep both algorithms, or use only min_conf (Jan used min_conf for the virtual_species, I used min_conf_0 for BCI). If I had to chose one, I'd pick min_conf now.

Implement improved optimization algorithm(s)

The current optimization algorithm is very simple. We could implement one or more options to improve efficiency of the optimization (genetic algorithms, simulated annealing, bayesian, ...)

Inaccurate solutions

So following from the last issue to make scaled difference measures. I implemented that so that now we kind of get the proportion of cells which do not match the target estimate (or simply the proportion of error). This has highlighted a problem, we have a huge error. Running the optimisation on the test data for 10000 iterations and a patience of 10000 and repeating that 10 times I never got below a 0.7 or 70% error rate. ๐Ÿ˜ฑ

Kickoff

Guys, I am hyped ๐Ÿ˜†

How do we start, where do we start, what do I have to read ...
Give me the details!

desired returns of the algorithm

A: No fixed species:

A1. best speciesXsite solution (already implemented)
A2. absolute commonness error between solution matrix and objective matrix for each iteration (already implemented). Primary used for plotting, thus users can estimate how many iterations they need to run.
A3. RCE [%] between best solution matrix and objective matrix.
Calculated as:

  • "absolute commonness error between solution matrix and objective matrix"
  • divided by
  • "number of siteXsite pairs [= (nsites^2 - nsites) / 2]"
  • divided by
  • "mean commonness of the objective"
  • times 100.

UPDATE 2021-05-06
A3: now calc_commonness_error() calculates both the mean absolute error in commonness, as well as the RCE [%].

Thus spectre produces all desired outputs for A (no fixed species).

B: Fixed species, assuming that at "known sites" species lists are "complete", i.e. all absent & present species are known, as opposed to having incomplete species lists at known sites. (we decided to leave B to the user)

I split information from "known sites" into "input" & "testing" data. Species lists from "input sites" were fixed and a solution was calculated by spectre.

B1. Evaluation:
Using only the "testing sites", correctly predicted species [%] were calculated as:

  • Species that were present both in "testing data" && "solution" at the same site (=correctly predicted)
  • divided by
  • "number of present species in the testing data"
  • times 100.

Not too sure whether B should be in the package, since:

  • it is already in the code for the "BCI known species" analysis (but quite messy)
  • evaluation makes sense only if species lists of testing sites are complete.

Swap species issue in while loop

Is it true that currently in the row selection it is not checked that only a 1 is swapped with a 0? i.e. it can currently happen that two zeros or two ones are swapped with one another?

   random_col <- sample(seq_len(ncol(new_solution)), size = 1)
    random_row1 <- sample(seq_len(nrow(new_solution)), size = 1)
    random_row2 <- sample(seq_len(nrow(new_solution)), size = 1)

    new_solution[random_row1, random_col] <- current_solution[random_row2, random_col]
    new_solution[random_row2, random_col] <- current_solution[random_row1, random_col]

return "best" solution, instead of last solution

By now, we return the solution derived from the last iteration. (1) Since in most cases we have an increase in energy when random site(s) are deleted temporarily (parameter patience), the last solution is not necessarily the best solution, especially if deletion took place shortly before the last iteration. (2) Another observation I made is, that in the beginning (all sites empty) energy can be lower than when all sites are filled (admittedly only the case if overall commonness is very low). Thus, if we constrain the solutions to have no empty sites, and constantly update the "temporarily best solution" whenever energy of a new solution (with all sites filled) has a lower energy, we could overcome issues 1 and 2.

Profile function speeds

Need to identify any sticking points in the code that are worth trying to streamline before we get into Cpp. @marcosci has kindly offered to do this

Improved 'fixed_species' parameter

I am thinking about changing the behavior of the 'fixed_species' parameter a bit. At the moment 'fixed_species' is simply a special kind of 'partial_solution', i.e. species that are present are not changed.

I would like to change it so that 'fixed_species' becomes a kind of mask for 'partial_solution'. 'fixed_species' would then simply be a site X species matrix with values of either 0 or 1. Parts of 'partial_solution' that are marked with 1 by 'fixed_species' are then ignored by the optimizer. This way we can also make sure that species are not allowed to occur in certain sites.

What do you think? Does this break any of our scripts?

@mspngnbrg @nldoc @csim063 @mhesselbarth

if fixed species > richness at a a site, min_conf_0 gets stuck in a loop

When using estimated richness, for example from an alpha-diversity model, and at a site the number of fixed (="known") species is larger than predicted richness, spectre does not find "another species to add", and gets stuck in a loop then. By now, I change predicted richness to be == number of fixed species if neccessary in my R-code to control for this, but I think a fix/warning/check in the package would be nice.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.