Giter Club home page Giter Club logo

Comments (12)

Arkadiy-Garber avatar Arkadiy-Garber commented on June 8, 2024 1

Hey Rachel - no worries at all! Hope the cruise went well, and congrats on defending your dissertation! I wish you the best of luck with your research, and please don't hesitate to reach out if you have any questions :)

from magiclamp.

Arkadiy-Garber avatar Arkadiy-Garber commented on June 8, 2024

Hey Rachel,

Good question! I typically use the --norm flag when comparing gene distributions between metagenomes. This way, the normalization gives you a little bit of context with regard to the total functional (and perhaps taxonomic) diversity within your dataset. If running magiclamp on individual genomes or bins, I usually do not include the --norm flag, since multiple copies of a gene are more rare, and, if present, likely indicate functional importance, rather than multiple species or strains.

One other thing to keep in mind when analyzing metagenomes is that regardless of whether or not you include the --norm flag, the number of hits that you get to a particular gene/domain family represents essentially the diversity of that gene/domain within your metagenomic dataset. So even though the number might be high for a particular gene, that does not necessarily indicate that that gene is present in high abundance, or is encoded by dominant organisms. In my experience, I've found that there is somewhat of a correlation between the total abundance of a gene and the number of raw or normalized gene counts. But to get a more direct estimate of abundance (and perhaps ecological relevance) of a gene, I would suggest including coverage information (via the -bam or -bams flags) in your magiclamp run. Does this make sesne? Happy to provide more details on this if you're interested.

Hope this helps! Let me know if you have any other questions :)

Arkadiy

from magiclamp.

rachelmugge avatar rachelmugge commented on June 8, 2024

Thanks for explaining!

I understand the first part, and figured that using the --norm flag was the way to go for this dataset. I typically work with 16S data, which we always report in relative abundance to compare taxa across samples, which seems analogous to the normalization method here.

I'm interested to know more about gene counts vs read coverage- I have a tool to generate bam files (Bowtie2), and I understand that this method takes your reads (in my case, paired end) and maps them to your assembly to understand how well your assembled contigs or scaffolds cover your reads (correct me if I'm wrong), but I don't completely understand why- if you provide coverage info to magiclamp, are gene abundances more accurate because they are based on only the mapped reads? For context, my metagenomes are marine biofilm samples from steel and wood substrates near historic shipwrecks, and my goal is to understand differences (or similarities) in functions/metabolisms within these biofilms both between substrates and across distance on the seabed, so I am very interested to get the "most correct" estimate of gene abundance in order to infer the ecological relevance.

from magiclamp.

rachelmugge avatar rachelmugge commented on June 8, 2024

Alright, now I've hit an error when test running one sample and including a bam file (seems similar to last time?):

Command:

MagicLamp.py LithoGenie -bin_dir /Volumes/FireFly_Promise_Pegasus/RMugge/DISSERTATION/Ch3/Metagenome_analysis/pipeline_4_assembly_based_func_annot/06_LithoGenie_tests/06_LithoGenie_contigs500_bam/B001_contigs -bin_ext fasta -out /Volumes/FireFly_Promise_Pegasus/RMugge/DISSERTATION/Ch3/Metagenome_analysis/pipeline_4_assembly_based_func_annot/06_LithoGenie_tests/06_LithoGenie_contigs500_bam/out_norm --meta -t 12 -bam /Volumes/FireFly_Promise_Pegasus/RMugge/DISSERTATION/Ch3/Metagenome_analysis/pipeline_4_assembly_based_func_annot/06_LithoGenie_tests/06_LithoGenie_contigs500_bam/B001_bam/B001_to_B001contigs500.sorted.bam --norm

Error:

Traceback (most recent call last):
File "/Users/firefly-hl/MagicLamp/MagicLamp.py", line 30, in
LithoGenie.main()
File "/Users/firefly-hl/MagicLamp/genies/LithoGenie.py", line 2438, in main
Dict[cell][process].append(float(depthDict[contig]))
TypeError: float() argument must be a string or a number, not 'collections.defaultdict'

Here is the created bam.depth file, if that is helpful:
B001_to_B001contigs500.sorted.bam.depth.zip

from magiclamp.

Arkadiy-Garber avatar Arkadiy-Garber commented on June 8, 2024

Thanks, Rachel! I will look into this error. It does indeed look similar, so perhaps a similar bug.

Regarding your question about gene counts vs read coverage, these two metrics are telling you somewhat different stories. But neither is necessarily more accurate than the other. Mapping reads back to your assembled contigs is a good way to confirm the quality of the assembly - it is also a good way to estimate the relative abundance of DNA sequences in your sample. For example, a contig that recruits a lot of reads and has high coverage (e.g. 1000x), likely represents a microbes that is highly abundant and perhaps dominating the biofilm. Thus, any gene that is encoded on that contig (by that dominant organism), is likely of high importance to the microbial community/biofilm. This importance might not come through if you are just looking at gene counts.

In this context, I am using 'counts' synonymously with 'variants' or 'homologs'. I know this might be confusing, so I may change the language in the software, but the basic idea is that when an assembler (e.g. SPAdes, megahit) constructs contigs, sequences that are significantly different are assembled into separate contigs, so any genes on them are then considered variants or homologs that are evolutionarily diverged. The exact amount of sequence divergence that's necessary to break a contig or gene into variants/homologs is assembler-dependent. But the important part is that if you see two genes that have the same annotation in your dataset, they represent homologs and may be assumed as performing the same or similar functions (a big assumption, I know, but necessary one to make if we want to maintain our sanity as bioinformaticians :) So when MagicLamp is using gene counts to estimate abundance, it is essentially counting how many different versions or homologs of each gene or gene category there are. If you have a highly complex or strain-varied community, you might see these gene counts are high, but the read coverage-based abundance could be low. However, if you have a very simple microbial community, then the number of gene variants/homologs might be low, but the abundance could be high. In my experience, however, the more homologs/counts there are of a certain gene, the higher it is in read coverage-based abundance. Does this make sense?

Another caveat to consider is that even if a contig (and its encoded genes) recruit many reads and are high in coverage, that does not necessarily mean that the gene is very active transcriptionally or translationally (or even if the encoded protein is active enzymatically). I've worked with datasets where some of the most highly expressed genes are encoded on contigs that are low in read coverage, and represent what may seem like insignificant members of the microbial community.

Hope this helps! Let me know if you have other questions or need additional clarification. Happy to help. Sounds like you've got some really cool study sites and interesting datasets :)

Cheers,
Arkadiy

from magiclamp.

rachelmugge avatar rachelmugge commented on June 8, 2024

Yes, this makes more sense- I think I needed to understand the difference in the most basic sense (gene homologs resulting from assembly). In my work, I'm a big proponent of explaining why I did what I did, so I think if I were to choose the read coverage route for the functional annotation, I would also report the mapping rates/coverage from the assemblies, which I am planning to report anyways...I will have to think on this a bit more.

Yes- I always have to remember when interpreting metagenome results that those functions are only present within the samples, and not necessarily active.

Thanks again for taking time to explain all this in depth! I have learned quite a bit.

from magiclamp.

Arkadiy-Garber avatar Arkadiy-Garber commented on June 8, 2024

Hi Rachel,

Glad to hear that. Let me know if you have other questions, happy to help!

For the error above, could you please send the output file that is generated prior to the crash?

Thanks!
Arkadiy

from magiclamp.

rachelmugge avatar rachelmugge commented on June 8, 2024

Hi Arkadiy,
Unfortunately there were no output files generated by LithoGenie before the crash, except for the "sorted.bam.depth' file I sent earlier...

Also- I've encountered an error with FeGenie as well- sorry I am finding all these errors! This one might be a bit simpler, as there's actually output. The command I ran is:

MagicLamp.py FeGenie -bin_dir /Volumes/FireFly_Promise_Pegasus/RMugge/DISSERTATION/Ch3/Metagenome_analysis/pipeline_4_assembly_based_func_annot/03_pullseq_contigs_500bp -bin_ext fasta -out /Volumes/FireFly_Promise_Pegasus/RMugge/DISSERTATION/Ch3/Metagenome_analysis/pipeline_4_assembly_based_func_annot/08_FeGenie/out_contigs500bp_norm --meta --norm -t 6

And the error it gave me is:

"Traceback (most recent call last):
File "/Users/firefly-hl/MagicLamp/MagicLamp.py", line 28, in
FeGenie.main()
File "/Users/firefly-hl/MagicLamp/genies/FeGenie.py", line 843, in main
time.sleep(5)
NameError: name 'time' is not defined"

It seemed to crash right before making the heatmap csv file (made ORF calls and ran HMM for all samples)- here is the output summary file:
iron_storage-summary.csv

Thanks for your help, and apologies again for creating more work for you!

from magiclamp.

Arkadiy-Garber avatar Arkadiy-Garber commented on June 8, 2024

Hey Rachel,

Sorry for the delayed response - been moving for the past 2 weeks, so a hectic end to my summer.

Thanks for letting me know about the FeGenie issue! I am actually in the process of removing FeGenie from MagicLamp, so I recommend that you use the standalone FeGenie for your iron gene identification: https://github.com/Arkadiy-Garber/FeGenie.

I will get back to the LithoGenie issue soon!

Thanks,
Arkadiy

PS - don't ever apologize to a programmer for finding bugs in their software. You are helping me to improve the tool, making it more accessible to other users :)

from magiclamp.

rachelmugge avatar rachelmugge commented on June 8, 2024

Hi Arkadiy,

No problem- I understand!

It did not even cross my mind to use the standalone version, I will try that route.

Thank you!

from magiclamp.

Arkadiy-Garber avatar Arkadiy-Garber commented on June 8, 2024

Hi Rachel - I am not able to find the source of the error that you are getting with LithoGenie. One possibility is that there is contig in one of the FASTA files that does not appear to be in the depth file. I updated the script to skip over this error, and provide some info regarding which contig is causing the error.

Could you please try downloading the updated version and trying again - and send me the output that you get from the run :)

Thanks!
Arkadiy

from magiclamp.

rachelmugge avatar rachelmugge commented on June 8, 2024

Hi Arkadiy,
I'm so sorry for the way delayed response- I went offshore for a month and then defended my dissertation! I think I have all the data that I need for now, but let me know if you'd still like me to test the updated version on my end.

Thanks,
Rachel

from magiclamp.

Related Issues (13)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.