spxuw / dki Goto Github PK

License: MIT License

Python 46.39% R 53.61%

dki's Introduction

DKI (Data-driven Keystone species Identification)

This is a Pytorch implementation of DKI, as described in our paper:

Wang, X.W., Sun, Z., Jia, H., Michel-Mata, S., Angulo, M.T., Dai, L., He, X., Weiss, S.T. and Liu, Y.Y. [Identifying keystone species in microbial communities using deep learning]. bioRxiv, pp.2023-03 (2023).

We have tested this code for Python 3.8.13 and R 4.1.2.

Overview
Repo Contents
Data type for DKI
How the use the DKI framework

Overview

Previous studies suggested that microbial communities harbor keystone species whose removal can cause a dramatic shift in microbiome structure and functioning. Yet, an efficient method to systematically identify keystone species in microbial communities is still lacking. This is mainly due to our limited knowledge of microbial dynamics and the experimental and ethical difficulties of manipulating microbial communities. Here, we propose a Data-driven Keystone species Identification (DKI) framework based on deep learning to resolve this challenge. Our key idea is to implicitly learn the assembly rules of microbial communities from a particular habitat by training a deep learning model using microbiome samples collected from this habitat. The well-trained deep learning model enables us to quantify the community-specific keystoneness of each species in any microbiome sample from this habitat by conducting a thought experiment on species removal. We systematically validated this DKI framework using synthetic data generated from a classical population dynamics model in community ecology. We then applied DKI to analyze human gut, oral microbiome, soil, and coral microbiome data. We found that those taxa with high median keystoneness across different communities display strong community specificity, and many of them have been reported as keystone taxa in literature. The presented DKI framework demonstrates the power of machine learning in tackling a fundamental problem in community ecology, paving the way for the data-driven management of complex microbial communities.

Repo Contents

(1) A synthetic dataset to test the Data-driven Keystone species Identification (DKI) framework.

(2) Python code to predict the species composition using species assemblage (cNODE2) and R code to compute keystoneness.

(3) Predicted species composition after removing each present species in each sample.

Data type for DKI

(1) Ptrain.csv: matrix of taxanomic profile of size N*M, where N is the number of taxa and M is the sample size (without header).

	sample 1	sample 2	sample 3	sample 4
species 1	0.45	0.35	0.86	0.77
species 2	0.51	0	0	0
species 3	0	0.25	0	0
species 4	0	0	0.07	0
species 5	0	0	0	0.17
species 6	0.04	0.4	0.07	0.06

(2) Thought experiment: thought experiemt was realized by removing each present species in each sample. This will generated three data type.

Ztest.csv: matrix of perturbed species collection of size N*C, where N is the number of taxa and C is the total perturbed samples (without header).

	sample 1	sample 2	sample 3	sample 4	sample 5	sample 6	sample 7	sample 8	sample 9	sample 10	sample 11	sample 12
species 1	0	1	1	0	1	1	0	1	1	0	1	1
species 2	1	0	1	0	0	0	0	0	0	0	0	0
species 3	0	0	0	1	0	1	0	0	0	0	0	0
species 4	0	0	0	0	0	0	1	0	1	0	0	0
species 5	0	0	0	0	0	0	0	0	0	1	0	1
species 6	1	1	0	1	1	0	1	1	0	1	1	0

Species_id: a list indicating which species has been removed in each sample.

species
1
2
6
1
3
6
1
4
6
1
5
6

Sample_id: a list indicating which sample that the species been removed.

sample
1
1
1
2
2
2
3
3
3
4
4
4

How the use the DKI framework

Step 1: Predict species compostion using perturbed species assemblage

Run Python code "DKI.py" by taking Ptrain.csv and Ztest.csv as input will output the predicted microbiome composition using perturbed species colloction matrix Ztest.csv. The output file qtst.csv:

	sample 1	sample 2	sample 3	sample 4	sample 5	sample 6	sample 7	sample 8	sample 9	sample 10	sample 11	sample 12
species 1	0.0000000	0.000000	0.0000000	0.92458308	0.92458308	0.92458308	0.9245831	0.4725695	0.4729691	0.91488211	0.8053058	0.8053058
species 2	0.8315174	0.0000000	0.000000	0.0000000	0.00000000	0.00000000	0.00000000	0.0000000	0.5274305	0.0000000	0.00000000	0.0000000
species 3	0.0000000	0.8287832	0.000000	0.0000000	0.00000000	0.00000000	0.00000000	0.0000000	0.0000000	0.5270309	0.00000000	0.0000000
species 4	0.0000000	0.0000000	0.212941	0.0000000	0.00000000	0.00000000	0.00000000	0.0000000	0.0000000	0.0000000	0.08511789	0.0000000
species 5	0.0000000	0.0000000	0.000000	0.4444696	0.00000000	0.00000000	0.00000000	0.0000000	0.0000000	0.0000000	0.00000000	0.1946942
species 6	0.1684826	0.1712168	0.787059	0.5555304	0.07541692	0.07541692	0.07541692	0.0754169	0.0000000	0.0000000	0.00000000	0.0000000

Step 2: Compute the keystoneness

Run R code Keystoneness_computing.R to compute the keystonenss of each present in each sample. The output file:

keystoneness	sample	species
5.576585e-02	1	1
5.680769e-02	2	1
4.133107e-02	3	1
6.768209e-02	4	1
3.948267e-05	1	2
4.027457e-05	2	3
7.398025e-05	3	4
5.262661e-05	4	5
4.576021e-03	1	6
3.072820e-03	2	6
7.672017e-03	3	6
1.067806e-02	4	6

Each row represent the keystonenes of a species in a particular sample.

dki's People

Contributors

Stargazers

Watchers

Forkers

etasky ku-cbmr narenrv pandengwang guozhiming1992 gzahn wellington9023 shuwang9632

dki's Issues

Licensing

Hi,

Thanks for including your code, and your paper was very interesting!

Would you be able to include a license on this repo? I'm looking to play around with you model on my own code. From my understanding, according to GitHub terms of service if there's no license, external users can't modify or build on the code.

If this license exists elsewhere, I apologize!

Thanks,
Ryan

How can we obtain the GCN.tsv file.

Confusion for the following results

is the function beblow to create Ztest.csv right?

createZtest <- function(ptrain){
species.ids <- data.frame(rep(1:nrow(ptrain), each = ncol(ptrain)))
sample.ids <- data.frame(rep(1:ncol(ptrain), nrow(ptrain)))
ztest <- data.frame(X = rep(1, nrow(ptrain)))
for(i in 1:nrow(ptrain)){
t.mat <- ptrain
t.mat[i, ] = 0
ztest = cbind(ztest, t.mat)
}
ztest <- apply(ztest[, -1], 2, function(x)x/sum(x))
return(list(species_ids = species.ids, sample_ids = sample.ids, Ztest = ztest, Ptrain = ptrain))
}

DKI.py script problems

Dear Professor,
Thank you very much for the really inspiring project. I would really like to try to apply this method to my dataset, but I get errors with the Python script. I have updated all the folders, inserted the dataset name and updated all the packages, but the script crashes always here:

loss_train,qtst,qtrn = train_reptile(max_epochs,mb,LR,ztrn,ptrn,ztst,ptst,zval,pval,zall,pall)

It prints the line before and then it runs, but no errors and no results, just runs. Do you have any idea what kind of problem I might have with my job? I am still trying to work with your uploaded data to try the scripts.
Thanks for any help.

Confusion about template data

Hi,

Thank you for developing a good tools, it is very useful for keystoness species identification. But here i m confusing about template data in data directory.

What is difference between Ztest.csv and Ptest.csv.
Why all number in Ztest.csv is "0.0204081632653061".
How do I get the keystonenes values for each species (the output of step2)

It is very strange for me, could you kindly provide some explanation? you can contact me with [email protected]

How to get Ptrain(steady_state_relative) by absolute abundance？

Hello, dear associate professor. I learned about your development of CNODE2 through CHINAGut in Beijing last weekend, which I benefited a lot.
I would like your advice on generating the Ptrain.csv file when using human stool samples for keystoneness calculation. I have fecal samples' absolute abundance data through qpcr experiments, which I think can be directly regarded as steady_state_absolute. However, I am confused about how to generate Ptrain(steady_state_relative) by removing each species in each sample based on the real data.
In the Simulated_data_generation.R you provided, glv is built twice by setting the abundance of the species
j1 to 0 to get Ptrain(steady_state_relative), but how can I get ① the interaction matrix A ② the growth rate b for each species in the real data? Could you kindly provide the code for this part, if it's not too much trouble? Thanks again.

Or can you point me toward modifying the code "x = glv(N = N, A, b = b, y = y_0, tstart = 0, tend = 100, tstep = 0.1, perturb = NULL)"? For example, I can use the lasso equation to construct A but don't know how to get b.

尊敬的老师，您好，我通过上周末在北京举办的肠道大会了解到了您开发的CNODE2，这篇文章让我学习到了很多，非常感谢。
我有些不清楚您实际使用人类粪便样本进行keystoneness计算时是如何生成Ptrain.csv文件的？我通过湿实验得到了物种的绝对丰度数据，我认为可以将绝对丰度数据直接作为steady_state_absolute；但我不太清楚后续该如何从真实数据中，利用glv模型将每一个样本的每一个物种进行剔除并生成Ptrain(steady_state_relative)。
在_Simulated_data_generation.R_中，您通过将species j1的丰度设为0再次构建了glv方程，但在真实数据中我该如何得到①相互作用矩阵A②每个物种的growth rate b 呢？您是否可以将这一部分的代码进行分享呢？非常感谢您的帮助。
或者您是否可以指导我修改“x = glv(N = N, A, b = b, y = y_0, tstart = 0, tend = 100, tstep = 0.1, perturb = NULL)“代码的方向呢？例如我觉得可以使用lasso方程来构建A，但不知道怎么得到b；
再次非常感谢您的阅读和帮助。

How do I get the output file for step 2 of this framework

Hello, dear Professor！
I ran the framework with the data you provided, but after I ran the R script I only got a picture and a table, how do I get the keystonenes values for each species?

If it is convenient for you, we can communicate by email. My address is [email protected]
Looking forward to your reply, thank you very much！

How do I generate the thought experiment data?

Dear Professor,

Hello!

Thank you for developing this tool.

While using this tool, I was confused about the input data (Ztest.csv, Species_id.csv, Sample_id.csv ) used to compute the keystoneness.
I have the following questions:
(1) What do the values in the Ztest table represent? And I found that in the introduction of the tool, these values are all integers (0 or 1), but in the Ztest table of example data set , most of the values are decimals (e.g.,0.0204081632653061) .

(2) How do I generate the thought experiment data (Ztest.csv, Species_id.csv, Sample_id.csv ) from Ptrain.csv? Could you describe more about the specific process and tools used to generate these data?

(3) There are Ptest.csv and Ztest.csv in the example data set. What is the difference between the two tables? When using this tool to compute our own real data, should we use Ptest.csv or Ztest.csv? If I should use the Ptest table as input data, how do I generate this table?

I really appreciate your response.
If it is convenient for you, we can communicate by email. My email address is [email protected].

The understanding of the keystoneness.

Dear professor! thanks for your nice tool.

but i have a few questions follows:

You have given the brief introduction of this tool, can you the more detailed information of this tool, eg, the command line and parameters for using the 'DKI.py' and 'Keystoneness_computing.R'?
The keystoneness descirbes the importance of a species in a specific sample, but the how can we know it's importance through the value?
eg,

species 6 is a more keystone in sample 4, but is it really import in sample 4? or i can just choose the top rank speccies with the highest keystoneness as the keystone species in one sample?

Looking forward to your reply!