Light

cecsve / insectmobile_diversity_and_biomass Goto Github PK

View Code? Open in Web Editor NEW

0.0 2.0 0.0 194 KB

Analysis for InsectMobile diversity and biomass across Denmark and Germany in the summer of 2018 and 2019

R 100.00%

insects biomass diversity dna-metabarcoding citizen-science landcover landuse

insectmobile_diversity_and_biomass's Introduction

InsectMobile: landscape-level drivers of flying insect diversity and biomass

WORK IN PROGRESS

This repository (R project) contains all scripts necessary to run the ecological/statistical analyses from the study "Landscape-level drivers of flying insect diversity and biomass" (Svenningsen et al. XXXX, DOI: [add link]).

The data were collected in June/July 2018 and June 2019 as part of the citizen science project InsectMobile ("Insektmobilen") at the Natural History Museum of Denmark and iDiv, Germany.

Data: data (e.g. the proportional land cover and land use data, flying insect biomass, diversity estimates etc. for each buffer zone) used in the analyses is deposited in XX: [add link]

NOTE ON BIOINFORMATICS

Two sequencing platforms were used to generate the data: HiSeq 4000 and NovaSeq 6000. The NovaSeq processing assigns quality scores differently from the HiSeq platform, where NovaSeq simplify the error rates by binning the 40 possible quality scores into just 4 categories which vastly reduces the amount of information dada2 can work off of to infer errors in the data. This discrepancy between platforms were not dealt with during the dada2 processing as we were unaware of the problem when we ran the bioinformatics. It seems to affect how many rare species/sequences are detected/retained in that fewer rare species are detected from NovaSeq data if the error rate step is not updated in the dada2 pipeline to accomodate the fewer quality scores. If anyone wish to use the data for analysis, we encourage users to find a way to deal with the NovaSeq processing in dada2 (for example by testing the four options mentioned here) and apply the best suited fix for this data by evaluating the quality score plots.

Description of the sub directories for data processing

reports: a step-wise list of scripts (01_, 02_ etc.) used for processing and analysing the data

Statistical analyses and modelling

Landscape-level effects on flying insect diversity and biomass was modelled with linear mixed-effects models at the buffer size with the most pronounced effect size.

insectmobile_diversity_and_biomass's People

Contributors

Watchers

insectmobile_diversity_and_biomass's Issues

environmental data

i have used the script from the biomass paper to process the environmental data for 2018
it is just saved in the main directory:
write.table(outputCast,file="environData_2018_DK.txt",sep="\t")

last time, we had a cleaned-data folder and put it in there, shall we do that again?

I will work on the 2019 data now but having some weird issues reading it in. Will keep trying.

Land cluster effects

Cluster analysis - look for natural land use compositions and relate them to biomass

Sub land use type

We have a column with land use and one with sub land use type which originally was derived from the route. I have previously made a clunky attempt of standardizing the different variations (from Danish): https://github.com/CecSve/InsectMobile_Diversity_and_Biomass/blob/main/reports/02_Sampling_data.R - could we use this in any way for heterogeneity (count amount of land use + sub land use for each route)? I think it would probably make more sense to take this from the environmental variables we have with proportions. But just throwing the idea out there - otherwise I think we should disregard the rough land use categories all together and don't spend time standardizing them.

Script_02 make sure all sampling data is included

double-check that metadata for imaged bulk samples are included
check that 2017 samples are included (will be excluded) and how many samples they account for
#18
check the routes from AU match the routes we have in CPH

Land heterogeneity

Lmer regression - land cover and heterogeneity prediction - test for biomass, richness and evenness

Need to think a bit more about what is our measure of heterogeneity.
As its simplest, it could be simply the Shannon diversity of the land cover proportions (but whether that is just based on the 5 dominant classes, or we use a finer habitat categorisation, is still to be decided)

relevant papers

https://www.pnas.org/doi/10.1073/pnas.2002554117

German data to-dos

@PtrsBrt here's a list of to-dos for you:

make sure all lab data is digitized
prepare a 02 script for German sampling data and align to the Danish output
prepare a 03 script for German lab data and align to the Danish output
in script 03, make a ID column that incorporates size fraction
in script 03, rename PCRIDs to S18_** and S19_**
retrieve all environmental data from Germany

Metrics

all analysis could involve:

biomass
richness
rarefied richness
eveness
diversity

Analysis of 2 years?

Previous analysis was just of 2018, but now we have 2 years. I propose we analyse both together:

We can do this by:
same model as before except with year as an additional fixed effect. We should not complicate it by any year x land cover interaction (i.e., we dont want to test whether land cover effects differ between the years). Instead, we should just focus on mean effects of the land cover across the years.
We still need the transect as a random effect. It doesn't matter if some were only sampled in one year.

Effects of continuous land covers

Same as original analysis
Lmer regression - continuous land cover (original analysis) - test for biomass, richness, evenness, diversity (already written, do we do rarefaction? iNext? incidence-based rarefaction)

04_compare sequenced data to data from sampling events and make sure as many samples are accounted for as possible

General check-up when we merge everything (should perhaps be calculated by year):

how many collected samples from DK?
how many collected samples from DE?
~~How many of the collected samples were processed in the lab (which were, which weren't)? DK~~
How many of the collected samples were processed in the lab (which were, which weren't)? DE
How many of the processed samples were sequenced DK?
How many of the processed samples were sequenced DE?
~~How many samples made it through the bioinformatic processing of the sequenced samples DK?~~
How many samples made it through the bioinformatic processing of the sequenced samples DE?

All this should be documented in the supplementary file. In the main manuscript, we should write something in the lines of:

Of the 771 samples collected during June 2018 in Denmark, 561 and in June 2019 678 samples out of 798 collected insect samples were retained for analysis after quality checks in the laboratory and the bioinformatics pipeline (Fig XX (map of sampled areas in DK with transparent dots coloured by year collected), SXX).

Document the output for each country in a comment below.

check species names

Bottom section of script 01 checks the names of the species.

Mostly accepted (95%), we have 240 synonyms and 4 doubtful status

The function at the bottom gets the accepted name of these species according to GBIF.

Just guide me on what I can do next.
e.g., check genus too?
replace the species names with their accepted names in taxonomy_cleaned_sub ?

Metric correlations

Relationships among the metrics - richness vs biomass

Script 01_Taxonomic scope?

Should we carry out analysis for Class = Insecta only or should we include Arachnids and some other arthropod classes as well, or just stick to flying invertebrates (which may be difficult to pin point)? We should at least remove the weird ones:

Phylum = Cnidaria (e.g. jelly fish)
Phylum = Nemertea (worms, unsure whether some species could be associated with insects?)
Class = Pycnogonida (sea spiders)
Class = Polychaeta (we have one marine species in the taxonomy list)
Class = Turbellaria (again a marine species)
Kingdom = Protista
Phylum = Onychophora (velvet worm, would be super cool if we found one, but they all live in the southern hemisphere and are extinct in the north)
Phylum = Rotifera
Phylum = Chordata (we have wild boar, cows, human and wolf/dog)
~~Order = Isopoda~~
~~Order = Decapoda~~
Class = Malacostraca (would take care of Decapoda and Isopoda)

Community composition

NMDS - Community composition dissimilarity (mapping the clusters onto land cover)

fit date error

this route has an incorrect eventDate

 PIDRouteID SampleID  PID DOFAtlasQuadrantID    subLandUseType  eventDate StartTime  EndTime

1099 P346.2 P346.2A P346 5684 bog, meadow, lake 23-06-2010 14:29:00 14:56:00
Wind Temperature Notes PilotNotes RouteID_JB utm_x utm_y decimalLongitude
1099 gentle breeze 3.4-5.5 25+ 20 641080.4 6173608 11.24422
decimalLatitude eventTime Time_band Route_length Distance_driven
1099 55.68773 2019-06-23 14:29:00/2019-06-23 14:56:00 midday 5000 10000
Date yDay Time_driven Velocity Year
1099 2010-06-23 174 27 370.3704 2010

Looking at eventTime, it looks like eventDate year 2010 should be changed to 2019

I leave that for you to do!

Once done, please send me the updated sampling_data_cleaned.txt file

check environmental data for all sampled

I checked this at the start of 04

Land cover:

we have all data for 2019
we have almost all data for 2018 (based on the covariate-data folder files of the biomass paper).
-Exception: route "P115.2" - but maybe there is a reason

Traffic light counts:
We only have one file for this. I guess for 2018.
Not sure how to check for 2019??

fix date

     PIDRouteID SampleID  PID DOFAtlasQuadrantID    subLandUseType  eventDate StartTime  EndTime
1099     P346.2  P346.2A P346               5684 bog, meadow, lake 23-06-2010  14:29:00 14:56:00
                      Wind Temperature Notes PilotNotes RouteID_JB    utm_x   utm_y decimalLongitude
1099 gentle breeze 3.4-5.5         25+  <NA>       <NA>         20 641080.4 6173608         11.24422
     decimalLatitude                               eventTime Time_band Route_length Distance_driven
1099        55.68773 2019-06-23 14:29:00/2019-06-23 14:56:00    midday         5000           10000
           Date yDay Time_driven Velocity Year
1099 2010-06-23  174          27 370.3704 2010

from eventTime, looks like eventDate should be 2019 not 2010

update it within sampling_data_cleaned

Richness and diversity as response variables

Used for the carnet + metabarcoding manuscript - check and reuse code for this calculation.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.