lczech / grenedalf Goto Github PK
View Code? Open in Web Editor NEWToolkit for Population Genetic Statistics from Pool-Sequenced Samples, e.g., in Evolve and Resequence experiments
License: GNU General Public License v3.0
Toolkit for Population Genetic Statistics from Pool-Sequenced Samples, e.g., in Evolve and Resequence experiments
License: GNU General Public License v3.0
I am currently using Grendalf v 0.5.0 with some pooled sequencing data and was noticing that in my outputs the total.missing column is always 0 even when information is missing from my mpileup files. I have attached a screenshot of my mpileup file and of the output from grendalf diversity to illustrate this.
Hello,
I am trying to run grenedalf frequency on my vcf file generated by PoolSNP from an mpileup file using
GRENEDALF=/gxfs_home/geomar/smomw573/software/grenedalf/bin/grenedalf
VCF=/gxfs_home/geomar/smomw573/work/seasonal_adaptation/analysis/PoolSNP/finalfile.vcf.gz
GENOME=/gxfs_work1/geomar/smomw573/seasonal_adaptation/genome/Eaffinis.Baltic.PseudoRef.Mar22.fasta \
OUTPUT=/gxfs_home/geomar/smomw573/work/seasonal_adaptation/analysis/grenedalf-freq
$GRENEDALF frequency --vcf-path $VCF --allow-file-overwriting --reference-genome-fasta-file $GENOME --write-sample-alt-freq > $OUTPUT/grenedalftrial.log
and got the following error:
Cannot iterate over VCF file /gxfs_home/geomar/smomw573/work/seasonal_adaptation/analysis/PoolSNP/finalfile.vcf using the "AD" FORMAT field to count allelic depths, as that field is not part of the VCF file.
PoolSNP output vcf head
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT EA_2009_T1 EA_2009_T2 EA_2009_T3 EA_2009_T4 EA_2011_T1 EA_2011_T2 EA_2015_T1 EA_2015_T2 EA_2015_T3 EA_2015_T4 EA_2022_T1 EA_2022_T2 EA_2022_T3 EA_2022_T4
TRINITY_DN17_c1_g1_i1 185 . G A . . ADP=82.42857142857143;NC=0 GT:RD:AD:DP:FREQ ./.:.:.:.:. 0/1:51:1:52:0.02 0/1:71:3:74:0.04 0/1:94:3:97:0.03 0/1:64:2:66:0.03 0/1:107:2:109:0.02 0/1:109:2:111:0.02 0/0:91:0:91:0.0 0/1:48:2:50:0.04 0/0:96:0:96:0.0 0/1:99:2:101:0.02 0/1:92:1:93:0.01 0/1:79:3:82:0.04 0/1:100:2:102:0.02
TRINITY_DN17_c1_g1_i1 190 . C G . . ADP=83.92857142857143;NC=0 GT:RD:AD:DP:FREQ ./.:.:.:.:. 0/1:48:3:51:0.06 0/1:73:4:77:0.05 0/1:96:4:100:0.04 0/1:67:1:68:0.01 0/1:105:5:110:0.05 0/1:108:2:110:0.02 0/1:92:3:95:0.03 0/1:48:2:50:0.04 0/1:94:4:98:0.04 0/1:104:2:106:0.02 0/1:87:4:91:0.04 0/1:82:3:85:0.04 0/0:102:0:102:0.0
The vcf file was then formated with the following python script:
# Open the VCF file in read mode
with gzip.open(vcf_file_path, 'rt', encoding='utf-8') as vcf_file:
# Iterate through each line in the file
with open(output_file_path, 'a', encoding='utf-8') as output_file:
line_number = 0
for line in vcf_file:
if line.startswith('#'):
continue
# Increment the line number (Move this line outside the loop)
# line_number += 1
# Split the line into fields based on tabs
fields = line.strip().split('\t')
# 9th column forward
fsplit = fields[8].split(':')
modified_fields = ':'.join([fsplit[0], fsplit[2], fsplit[3], fsplit[4]])
fields[8] = modified_fields
idx = 9
for fcolumn in fields[9:]:
tmpsplit = fcolumn.split(':')
tmpjoin = ':'.join([tmpsplit[0], ','.join([tmpsplit[1], tmpsplit[2]]), ':'.join([tmpsplit[3], tmpsplit[4]])])
fields[idx] = tmpjoin
idx += 1
# Write the modified line to the output file
output_file.write('\t'.join(fields) + '\n')
line_number += 1
#print(line_number)
# Print line number every 50,000 lines
if line_number % 50000 == 0:
print(f"Processed {line_number} lines.")
and we corrected the AD field in the vcf header
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
and now grenedalf frequency in running.
Many thanks,
Jennifer
hi,
thanks for developing this -- it looks really useful.
I had more of conceptual question. In the popoolation manuals it always suggests subsampling pileups to create equal coverage. Is that something that is still necessary/useful here? And if so, how does one decide either on the subsampling value, or min/max coverage?
cheers and many thanks in advance.
Hello,
I'm trying to create a NIX package of your tool (which is used in my computing center) but I'm facing an issue during compilation, and I can't understand why. I don't know if you have some NIX knowledge, but here are the sources of this wip packages (genesis and grenedalf):
bzizou/nixpkgs@9a30cd2
bzizou/nixpkgs@1d366e5
Note that NIX is providing a way to create dependencies with specific versions, like the commit hash of CLI11 for example. This replaces the "manual" build you made into tools/cmake/IncludeHtslib.cmake
, that's why it has been disabled by a patch.
The compiler used by default here is gcc 13.2, but I tried with Clang, gcc11, gcc8 and I always have the same error.
Here is the full trace of the error:
[bzizou@bart:~/git/nixpkgs]$ nix-build -A grenedalf
this derivation will be built:
/nix/store/alvqp910rn1ajb0avjsvg59aihp27qi8-grenedalf-0.3.0.drv
building '/nix/store/alvqp910rn1ajb0avjsvg59aihp27qi8-grenedalf-0.3.0.drv'...
Running phase: unpackPhase
unpacking source archive /nix/store/rzv22r10gywmw7d6fjvyv1r1s7z4hdbg-source
source root is source
Running phase: patchPhase
Running phase: updateAutotoolsGnuConfigScriptsPhase
Running phase: configurePhase
fixing cmake files...
cmake flags: -DCMAKE_FIND_USE_SYSTEM_PACKAGE_REGISTRY=OFF -DCMAKE_FIND_USE_PACKAGE_REGISTRY=OFF -DCMAKE_EXPORT_NO_PACKAGE_REGISTRY=ON -DCMAKE_BUILD_TYPE=Release -DBUILD_TESTING=OFF -DCMAKE_INSTALL_LOCALEDIR=/nix/store/6iz6bqk2kriqrhry0ha8vjh2wvj55z9x-grenedalf-0.3.0/share/locale -DCMAKE_INSTALL_LIBEXECDIR=/nix/store/6iz6bqk2kriqrhry0ha8vjh2wvj55z9x-grenedalf-0.3.0/libexec -DCMAKE_INSTALL_LIBDIR=/nix/store/6iz6bqk2kriqrhry0ha8vjh2wvj55z9x-grenedalf-0.3.0/lib -DCMAKE_INSTALL_DOCDIR=/nix/store/6iz6bqk2kriqrhry0ha8vjh2wvj55z9x-grenedalf-0.3.0/share/doc/grenedalf -DCMAKE_INSTALL_INFODIR=/nix/store/6iz6bqk2kriqrhry0ha8vjh2wvj55z9x-grenedalf-0.3.0/share/info -DCMAKE_INSTALL_MANDIR=/nix/store/6iz6bqk2kriqrhry0ha8vjh2wvj55z9x-grenedalf-0.3.0/share/man -DCMAKE_INSTALL_OLDINCLUDEDIR=/nix/store/6iz6bqk2kriqrhry0ha8vjh2wvj55z9x-grenedalf-0.3.0/include -DCMAKE_INSTALL_INCLUDEDIR=/nix/store/6iz6bqk2kriqrhry0ha8vjh2wvj55z9x-grenedalf-0.3.0/include -DCMAKE_INSTALL_SBINDIR=/nix/store/6iz6bqk2kriqrhry0ha8vjh2wvj55z9x-grenedalf-0.3.0/sbin -DCMAKE_INSTALL_BINDIR=/nix/store/6iz6bqk2kriqrhry0ha8vjh2wvj55z9x-grenedalf-0.3.0/bin -DCMAKE_INSTALL_NAME_DIR=/nix/store/6iz6bqk2kriqrhry0ha8vjh2wvj55z9x-grenedalf-0.3.0/lib -DCMAKE_POLICY_DEFAULT_CMP0025=NEW -DCMAKE_OSX_SYSROOT= -DCMAKE_FIND_FRAMEWORK=LAST -DCMAKE_STRIP=/nix/store/ln6zld1ia7rxddmxgbpfhrmb42rbxdw8-gcc-wrapper-13.2.0/bin/strip -DCMAKE_RANLIB=/nix/store/ln6zld1ia7rxddmxgbpfhrmb42rbxdw8-gcc-wrapper-13.2.0/bin/ranlib -DCMAKE_AR=/nix/store/ln6zld1ia7rxddmxgbpfhrmb42rbxdw8-gcc-wrapper-13.2.0/bin/ar -DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++ -DCMAKE_INSTALL_PREFIX=/nix/store/6iz6bqk2kriqrhry0ha8vjh2wvj55z9x-grenedalf-0.3.0 -DGENESIS_INCLUDE_DIR=/nix/store/jyjd5a9gh4336kb7l40yz5sfpyhzvxr8-genesis-0.30.0/include
CMake Deprecation Warning at CMakeLists.txt:26 (cmake_minimum_required):
Compatibility with CMake < 3.5 will be removed from a future version of
CMake.
Update the VERSION argument <min> value or use a ...<max> suffix to tell
CMake that the project does not need compatibility with older versions.
-- The CXX compiler identification is GNU 13.2.0
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /nix/store/ln6zld1ia7rxddmxgbpfhrmb42rbxdw8-gcc-wrapper-13.2.0/bin/g++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- grenedalf build type: Release
-- Static linking of system libraries: OFF
-- Building without LTO/IPO support
-- CMAKE_EXE_LINKER_FLAGS
-- GENESIS_LINK_LIBRARIES
-- Configuring done (0.3s)
-- Generating done (0.0s)
CMake Warning:
Manually-specified variables were not used by the project:
BUILD_TESTING
CMAKE_C_COMPILER
CMAKE_EXPORT_NO_PACKAGE_REGISTRY
CMAKE_FIND_USE_PACKAGE_REGISTRY
CMAKE_FIND_USE_SYSTEM_PACKAGE_REGISTRY
CMAKE_INSTALL_BINDIR
CMAKE_INSTALL_DOCDIR
CMAKE_INSTALL_INCLUDEDIR
CMAKE_INSTALL_INFODIR
CMAKE_INSTALL_LIBDIR
CMAKE_INSTALL_LIBEXECDIR
CMAKE_INSTALL_LOCALEDIR
CMAKE_INSTALL_MANDIR
CMAKE_INSTALL_OLDINCLUDEDIR
CMAKE_INSTALL_SBINDIR
-- Build files have been written to: /build/source/build
cmake: enabled parallel building
cmake: enabled parallel installing
Running phase: buildPhase
build flags: -j8 SHELL=/nix/store/087167dfxal194pm54cmcbbxsfy3cjgn-bash-5.2p26/bin/bash
[ 12%] Building CXX object CMakeFiles/grenedalf.dir/src/commands/convert/frequency.cpp.o
[ 12%] Building CXX object CMakeFiles/grenedalf.dir/src/commands/analyze/fst.cpp.o
[ 12%] Building CXX object CMakeFiles/grenedalf.dir/src/commands/convert/sync.cpp.o
[ 12%] Building CXX object CMakeFiles/grenedalf.dir/src/commands/analyze/diversity.cpp.o
[ 19%] Building CXX object CMakeFiles/grenedalf.dir/src/commands/tools/license.cpp.o
[ 19%] Building CXX object CMakeFiles/grenedalf.dir/src/commands/tools/citation.cpp.o
[ 22%] Building CXX object CMakeFiles/grenedalf.dir/src/commands/simulate/simulate.cpp.o
[ 25%] Building CXX object CMakeFiles/grenedalf.dir/src/commands/tools/version.cpp.o
[ 29%] Building CXX object CMakeFiles/grenedalf.dir/src/commands/tools/wiki.cpp.o
[ 32%] Building CXX object CMakeFiles/grenedalf.dir/src/commands/visualize/afs_heatmap.cpp.o
[ 35%] Building CXX object CMakeFiles/grenedalf.dir/src/main.cpp.o
[ 38%] Building CXX object CMakeFiles/grenedalf.dir/src/options/file_input.cpp.o
[ 41%] Building CXX object CMakeFiles/grenedalf.dir/src/options/file_output.cpp.o
[ 45%] Building CXX object CMakeFiles/grenedalf.dir/src/options/global.cpp.o
[ 48%] Building CXX object CMakeFiles/grenedalf.dir/src/options/poolsizes.cpp.o
[ 51%] Building CXX object CMakeFiles/grenedalf.dir/src/options/table_output.cpp.o
[ 54%] Building CXX object CMakeFiles/grenedalf.dir/src/options/variant_file_frequency_table.cpp.o
[ 58%] Building CXX object CMakeFiles/grenedalf.dir/src/options/variant_file_pileup.cpp.o
[ 61%] Building CXX object CMakeFiles/grenedalf.dir/src/options/variant_file_sam.cpp.o
[ 64%] Building CXX object CMakeFiles/grenedalf.dir/src/options/variant_file_sync.cpp.o
[ 67%] Building CXX object CMakeFiles/grenedalf.dir/src/options/variant_file_vcf.cpp.o
/build/source/src/options/variant_file_sam.cpp: In member function 'virtual VariantFileOptions::VariantInputIterator VariantFileSamOptions::get_iterator_(const std::string&) const':
/build/source/src/options/variant_file_sam.cpp:158:5: error: 'SamVariantInputIterator' was not declared in this scope; did you mean 'VariantInputIterator'?
158 | SamVariantInputIterator reader;
| ^~~~~~~~~~~~~~~~~~~~~~~
| VariantInputIterator
/build/source/src/options/variant_file_sam.cpp:159:5: error: 'reader' was not declared in this scope
159 | reader.min_map_qual( sam_min_map_qual_.value );
| ^~~~~~
/build/source/src/options/variant_file_sam.cpp:162:31: error: 'string_to_sam_flag' was not declared in this scope
162 | reader.flags_include_all( string_to_sam_flag( sam_flags_include_all_.value ));
| ^~~~~~~~~~~~~~~~~~
/build/source/src/options/variant_file_sam.cpp:168:12: error: 'make_variant_input_iterator_from_sam_file' was not declared in this scope
168 | return make_variant_input_iterator_from_sam_file( filename, reader );
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[ 70%] Building CXX object CMakeFiles/grenedalf.dir/src/options/variant_filter_numerical.cpp.o
/build/source/src/options/variant_file_vcf.cpp: In member function 'virtual VariantFileOptions::VariantInputIterator VariantFileVcfOptions::get_iterator_(const std::string&) const':
/build/source/src/options/variant_file_vcf.cpp:74:12: error: 'make_variant_input_iterator_from_pool_vcf_file' was not declared in this scope
74 | return make_variant_input_iterator_from_pool_vcf_file( filename );
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[ 74%] Building CXX object CMakeFiles/grenedalf.dir/src/options/variant_filter_region.cpp.o
make[2]: *** [CMakeFiles/grenedalf.dir/build.make:328: CMakeFiles/grenedalf.dir/src/options/variant_file_sam.cpp.o] Error 1
make[2]: *** Waiting for unfinished jobs....
make[2]: *** [CMakeFiles/grenedalf.dir/build.make:356: CMakeFiles/grenedalf.dir/src/options/variant_file_vcf.cpp.o] Error 1
/build/source/src/options/variant_filter_region.cpp: In member function 'void VariantFilterRegionOptions::prepare_region_filters() const':
/build/source/src/options/variant_filter_region.cpp:225:22: error: 'genome_locus_set_from_vcf_file' was not declared in this scope
225 | add_filter_( genome_locus_set_from_vcf_file( file ));
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
make[2]: *** [CMakeFiles/grenedalf.dir/build.make:384: CMakeFiles/grenedalf.dir/src/options/variant_filter_region.cpp.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:83: CMakeFiles/grenedalf.dir/all] Error 2
make: *** [Makefile:91: all] Error 2
error: builder for '/nix/store/alvqp910rn1ajb0avjsvg59aihp27qi8-grenedalf-0.3.0.drv' failed with exit code 2;
last 10 log lines:
> make[2]: *** [CMakeFiles/grenedalf.dir/build.make:328: CMakeFiles/grenedalf.dir/src/options/variant_file_sam.cpp.o] Error 1
> make[2]: *** Waiting for unfinished jobs....
> make[2]: *** [CMakeFiles/grenedalf.dir/build.make:356: CMakeFiles/grenedalf.dir/src/options/variant_file_vcf.cpp.o] Error 1
> /build/source/src/options/variant_filter_region.cpp: In member function 'void VariantFilterRegionOptions::prepare_region_filters() const':
> /build/source/src/options/variant_filter_region.cpp:225:22: error: 'genome_locus_set_from_vcf_file' was not declared in this scope
> 225 | add_filter_( genome_locus_set_from_vcf_file( file ));
> | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> make[2]: *** [CMakeFiles/grenedalf.dir/build.make:384: CMakeFiles/grenedalf.dir/src/options/variant_filter_region.cpp.o] Error 1
> make[1]: *** [CMakeFiles/Makefile2:83: CMakeFiles/grenedalf.dir/all] Error 2
> make: *** [Makefile:91: all] Error 2
For full logs, run 'nix log /nix/store/alvqp910rn1ajb0avjsvg59aihp27qi8-grenedalf-0.3.0.drv'.
If you have any idea...
Dear Lucas,
great to see a revision of PoolSeq software for the next round of high throughput analyses. Exciting times ahead.
I know that you have already done a lot of work, but one thing that would help advance the community is the ability to generate Dxy per window, as a measure of divergence between populations.
why?
There is a lot of existing empirical and theoretical work that focuses on when Fst vs. Dxy are informative, and thus, it would be nice to have a framework where Fst and Dxy could both be estimated from PoolSeq. This was a limitation of the Popoolation2 package, that I think held back PoolSeq analyses from more commonly combining Fst and Dxy in their analyses.
Currently, in the individual based VCF world, this is done perhaps best via PiXY, which overcomes existing tools in accounting for missing data in most VCF based approaches for Pi and Fst analyses.
https://pixy.readthedocs.io/en/latest/
Hopefully you can do something similar?
Hi @lczech
I am using grenedalf v0.5.0 to calculate FST between a set of samples. My set of samples are coming from 3 different vcf files, so I am using the --sample-group-merge-table-file
to create the different groups (it works pretty well).
Moreover, my window-type is a gff3 file containing genes only.
Here is my command:
grenedalf fst --vcf-path set1.bcf --vcf-path set2.bcf --vcf-path set3.bcf \
--reference-genome-fasta-file genome.fasta \
--sample-group-merge-table-file grouping.list \
--filter-total-only-biallelic-snps \
--window-region-gff gff_genes.gff3 \
--window-type regions \
--pool-sizes pools.list \
--filter-total-snp-min-count 2 \
--method unbiased-hudson \
--window-average-policy available-loci \
--write-pi-tables \
--comparand-list comparand.list \
--file-prefix fst_genes_hudson. --threads 4 --log-file fst_genes_hudson.log
I am running into the following error:
Computing FST between 398 pairs of samples.
Streaming over 25351 total regions across 63 chromosomes.
Error: Cannot convert VcfRecord to Variant, as one of the VcfRecord REF or ALT sequences/alleles is not a single nucleotide (it is not a SNP), at chr1:18018. At the time bein
g, we are not supporting indels and other such variants.
The position mentioned is an indel and it is only present on set1.bcf
CHROM POS ID REF ALT QUAL
chr1 18018 Repeat TA T 4259.94
I also tried doing the following:
--filter-region-bed
flag and give a bed file that does not contain that position - Did not work--filter-total-only-biallelic-snps flag
- Did not workI had a look in the previous version that I was using (v0.3.0) and there was an implementation to handle multiallelic positions. Has that been removed in the new version? In any case, it should skip non-snp position as default, right?
Any idea how I could solve this problem?
Cheers,
Vinicius
Hi there,
Excited to use grenedalf, thanks for your work on this! I'm providing multiple VCFs to a grenedalf frequency call to start, and am getting the following error
what(): Malformed VCF file /DATA/home/juliak//pathogen_evo/vcfs/poolseq/Sample_100.Q20.ANNO.vcf: unordered chromosomes and positions going from NODE_9_length_182528_cov_101.096338:62956 to NODE_10_length_150116_cov_77.788002:93243 Aborted (core dumped)
I've taken a look at the vcfs in this region and they look correctly sorted (see below). Is there something about how the chromosome name is parsed that might be throwing this error?
Thanks,
Julia
Dear Lucas,
I have been using your tool ‘Grenedalf’ for analysing pooled data, and I would like to first thank you for putting such a great tool together!
I work with pooled data (coming from shotgun metagenomics sequencing of faecal samples infected with parasitic eggs or pools of eggs directly) and I am interested in analysing both nuclear and mitochondrial data to understand genetic variation within those parasites. I work directly on stool extracts mostly, so any downstream analysis is based on allele frequencies not genotypes.
I have used both VCF and BAM files to calculate Fst and theta pi nucleotide diversity from a couple of samples, and I am getting different results depending on whether I use BAM or VCF files for the same number of positions to be considered (n=229 mitochondrial variants, after filtering for missingness, quality etc). You have explained previously how you recommend to use BAM files as opposed to VCFs and that makes sense, so I will switch to using bam files moving forward.
First Q: For calculating Fst, when I use the ‘Karlsson’ method (as I am unsure of my pool size = I have hundreds of eggs in each environmental sample) with a VCF, then the number of SNPs that are considered for the analysis is n=93 but it is 95 if I am using the corresponding BAM files. However, the total number of SNPs in the VCF is 229 and they are all biallelic from looking at the VCF.
When I switch to the ‘Kofler’ method, for the BAM files, the number of SNPs to be considered is 156 (!) but for the VCF is still 93. I am unsure why ~ 73 positions are considered ‘Not SNPs’ from ‘Grenedalf’ (during the Kofler/BAM method), though they appear to be SNPs in my VCF and why when I use the ‘Karlsson’ method, then the tool throws out some ‘non biallelic SNPs’ (n=61) as well as the ‘no SNPs’ (n=73).
Are there any assumptions made by ‘Grenedalf’? Do the BAM files hold information about the reads (like indel positions etc even though they have been filtered out) that makes ‘Grenedalf’ to ignore them?
The VCFs I have used to test these have been filtered for min/max alleles (2 for both, bcftools), and then the positions have been filtered for high quality (based on custom QUAL distributions and for 0.7 max-missingness).
Does it have to do with the fact I am using mitogenome data (ploidy of 1?).
Command line for pi: grenedalf diversity --sam-path $i --filter-sample-min-count 2 --filter-sample-min-coverage 1 --filter-region-list $k --window-type chromosomes --pool-sizes 1000 --file-prefix ${nickname}_${name}_pi_
Command line for Fst: grenedalf fst --sam-path $i $j --method karlsson --filter-region-list 0.7_pools.recode.kept.sites --window-type chromososomes --file-prefix ${i}_${j}_
Second Q: Is there a way to calculate within sample pi diversity when you don't know the pool size?
Third Q: Is it possible to generate a collated table from when running Fst with BAMs? I have 88 bams and all the possible combos generate over 3,000 CSVs that have to be merged later...
Thank you for all the fantastic work and responsiveness!
Marina
Hi guys,
This is a tiny issue (and may be deliberate) but thought I would flag it in case not.
The output diversity file (calculating all three measures and putting in one file), when run for chromosomes has the start and end of the chromosome as zero. Not sure if you had wanted to have the last base on each chromosome for the end instead. Just thinking it would make the number of snps and coverage more immediately relevant.
All the best,
Jenni
Hi Lucas,
me again, another important feature which would be very useful is to use sample-specific masking conditions. Global masking may be useful to mask TE's etc. However, individual libraries may be characterized by differences in Read Depths which may require more specific masking for the individual samples.
In our case, we have individual BED/MASK files (in FASTA format) for each sample and currently, I need to break the input sample-wise and run grenedalf for each sample separately, which is quite an effort for >700 samples.
Is there a more elegant way to do that, e.g. by reading the BED files for each sample first and create a matrix with window-wise masks for each samples which can then be used to calculate averages?
Cheers, Martin
Hi,
I've been plotting files and realised that the SNP windows are the same co-ordinates for all my pairwise comparisons. But that some have 'nan'. Does this mean that windows are determined as those which hold 100 SNPs across all samples (I asked for 100 SNPs), but when two samples being compared have fewer than 100 SNPs in a given window, the statistic for that window for that comparison is not calculated?
Thanks,
Jenni
Hiya,
I am trying to install grenedalf to test the software on my pool seq data, but having a hard time installing it. I've encountered an issue related to OpenMP. Sorry, i am a bit of novice when it comes to compiling. But below is my error output:
#######################
Running CMake...
CMake Deprecation Warning at CMakeLists.txt:26 (cmake_minimum_required):
Compatibility with CMake < 2.8.12 will be removed from a future version of
CMake.
Update the VERSION argument value or use a ... suffix to tell
CMake that the project does not need compatibility with older versions.
-- The CXX compiler identification is AppleClang 12.0.5.12050022
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /Library/Developer/CommandLineTools/usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- grenedalf build type: RELEASE
-- Static linking of system libraries: OFF
CMake Deprecation Warning at libs/genesis/CMakeLists.txt:28 (cmake_minimum_required):
Compatibility with CMake < 2.8.12 will be removed from a future version of
CMake.
Update the VERSION argument value or use a ... suffix to tell
CMake that the project does not need compatibility with older versions.
-- Configuring Genesis
-- CMake version 3.21.0-rc2
-- The C compiler identification is AppleClang 12.0.5.12050022
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /Library/Developer/CommandLineTools/usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- OS: Darwin 20.5.0
-- Genesis version: v0.25.0
-- Building Genesis as a dependency
-- Build type: RELEASE
-- Unity build: FULL
-- C++ compiler: AppleClang 12.0.5.12050022 at /Library/Developer/CommandLineTools/usr/bin/c++
-- C compiler : AppleClang 12.0.5.12050022 at /Library/Developer/CommandLineTools/usr/bin/cc
-- Looking for zlib
-- Found ZLIB: /Library/Developer/CommandLineTools/SDKs/MacOSX11.3.sdk/usr/lib/libz.tbd (found version "1.2.11")
-- Found zlib: /Library/Developer/CommandLineTools/SDKs/MacOSX11.3.sdk/usr/include 1.2.11
-- Using zlib
-- Looking for Threads
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE
-- Found Threads:
-- Using Threads
-- Looking for OpenMP
-- Using find_package( OpenMP ) patch
-- Try OpenMP C flag = [ ]
-- Performing Test OpenMP_FLAG_DETECTED
-- Performing Test OpenMP_FLAG_DETECTED - Failed
-- Try OpenMP C flag = [-fopenmp]
-- Performing Test OpenMP_FLAG_DETECTED
-- Performing Test OpenMP_FLAG_DETECTED - Failed
-- Try OpenMP C flag = [-fopenmp=libiomp5]
-- Performing Test OpenMP_FLAG_DETECTED
-- Performing Test OpenMP_FLAG_DETECTED - Failed
-- Try OpenMP C flag = [-fopenmp=libomp]
-- Performing Test OpenMP_FLAG_DETECTED
-- Performing Test OpenMP_FLAG_DETECTED - Failed
-- Try OpenMP C flag = [-Xpreprocessor -fopenmp]
-- Performing Test OpenMP_FLAG_DETECTED
-- Performing Test OpenMP_FLAG_DETECTED - Failed
-- Try OpenMP C flag = [-Xclang -fopenmp]
-- Performing Test OpenMP_FLAG_DETECTED
-- Performing Test OpenMP_FLAG_DETECTED - Failed
-- Try OpenMP C flag = [/openmp]
-- Performing Test OpenMP_FLAG_DETECTED
-- Performing Test OpenMP_FLAG_DETECTED - Failed
-- Try OpenMP C flag = [-Qopenmp]
-- Performing Test OpenMP_FLAG_DETECTED
-- Performing Test OpenMP_FLAG_DETECTED - Failed
-- Try OpenMP C flag = [-openmp]
-- Performing Test OpenMP_FLAG_DETECTED
-- Performing Test OpenMP_FLAG_DETECTED - Failed
-- Try OpenMP C flag = [-xopenmp]
-- Performing Test OpenMP_FLAG_DETECTED
-- Performing Test OpenMP_FLAG_DETECTED - Failed
-- Try OpenMP C flag = [+Oopenmp]
-- Performing Test OpenMP_FLAG_DETECTED
-- Performing Test OpenMP_FLAG_DETECTED - Failed
-- Try OpenMP C flag = [-qsmp]
-- Performing Test OpenMP_FLAG_DETECTED
-- Performing Test OpenMP_FLAG_DETECTED - Failed
-- Try OpenMP C flag = [-mp]
-- Performing Test OpenMP_FLAG_DETECTED
-- Performing Test OpenMP_FLAG_DETECTED - Failed
-- Try OpenMP CXX flag = [ ]
-- Performing Test OpenMP_FLAG_DETECTED
-- Performing Test OpenMP_FLAG_DETECTED - Failed
-- Try OpenMP CXX flag = [-fopenmp]
-- Performing Test OpenMP_FLAG_DETECTED
-- Performing Test OpenMP_FLAG_DETECTED - Failed
-- Try OpenMP CXX flag = [-fopenmp=libiomp5]
-- Performing Test OpenMP_FLAG_DETECTED
-- Performing Test OpenMP_FLAG_DETECTED - Failed
-- Try OpenMP CXX flag = [-fopenmp=libomp]
-- Performing Test OpenMP_FLAG_DETECTED
-- Performing Test OpenMP_FLAG_DETECTED - Failed
-- Try OpenMP CXX flag = [-Xpreprocessor -fopenmp]
-- Performing Test OpenMP_FLAG_DETECTED
-- Performing Test OpenMP_FLAG_DETECTED - Failed
-- Try OpenMP CXX flag = [-Xclang -fopenmp]
-- Performing Test OpenMP_FLAG_DETECTED
-- Performing Test OpenMP_FLAG_DETECTED - Failed
-- Try OpenMP CXX flag = [/openmp]
-- Performing Test OpenMP_FLAG_DETECTED
-- Performing Test OpenMP_FLAG_DETECTED - Failed
-- Try OpenMP CXX flag = [-Qopenmp]
-- Performing Test OpenMP_FLAG_DETECTED
-- Performing Test OpenMP_FLAG_DETECTED - Failed
-- Try OpenMP CXX flag = [-openmp]
-- Performing Test OpenMP_FLAG_DETECTED
-- Performing Test OpenMP_FLAG_DETECTED - Failed
-- Try OpenMP CXX flag = [-xopenmp]
-- Performing Test OpenMP_FLAG_DETECTED
-- Performing Test OpenMP_FLAG_DETECTED - Failed
-- Try OpenMP CXX flag = [+Oopenmp]
-- Performing Test OpenMP_FLAG_DETECTED
-- Performing Test OpenMP_FLAG_DETECTED - Failed
-- Try OpenMP CXX flag = [-qsmp]
-- Performing Test OpenMP_FLAG_DETECTED
-- Performing Test OpenMP_FLAG_DETECTED - Failed
-- Try OpenMP CXX flag = [-mp]
-- Performing Test OpenMP_FLAG_DETECTED
-- Performing Test OpenMP_FLAG_DETECTED - Failed
CMake Warning (dev) at /Applications/CMake.app/Contents/share/cmake-3.21/Modules/FindPackageHandleStandardArgs.cmake:438 (message):
The package name passed to find_package_handle_standard_args
(OpenMP)
does not match the name of the calling package (OpenMP_patch). This can
lead to problems in calling code that expects find_package
result
variables (e.g., _FOUND
) to follow a certain pattern.
Call Stack (most recent call first):
libs/genesis/tools/cmake/FindOpenMP_patch.cmake:384 (find_package_handle_standard_args)
libs/genesis/CMakeLists.txt:336 (find_package)
This warning is for project developers. Use -Wno-dev to suppress it.
-- Could NOT find OpenMP (missing: OpenMP_C_FLAGS OpenMP_CXX_FLAGS)
-- Could NOT find OpenMP_C (missing: OpenMP_C_FLAGS OpenMP_C_LIB_NAMES)
-- Could NOT find OpenMP_CXX (missing: OpenMP_CXX_FLAGS OpenMP_CXX_LIB_NAMES)
-- Could NOT find OpenMP (missing: OpenMP_C_FOUND OpenMP_CXX_FOUND)
-- OpenMP not found
-- You are trying to compile with Clang and OpenMp. This does not seem to work right now. Try installing libiomp-dev
first, or the equivalent for your system.
-- To build without OpenMP support, call CMake with -DGENESIS_USE_OPENMP=OFF
CMake Error at libs/genesis/CMakeLists.txt:389 (message):
Required package OpenMP not found.
####################
Best,
Lenny
Good,
I have problems compiling the software. I have tried with gcc7, gcc11 and gcc12. I attach the log of the compilation.
best regards,
David
make.log
Hi Lucas,
thanks for the great new features -praticulary the averaging options are very useful. However, I am wondering if you could also implement the possibility to mask when in averaging by window-length. As far as I understand this is not possible yet.
In our case, we have already called SNPs and want to average diversity stats in windows (by window-length) based on a SYNC file with SNPs only. In addition, we have BED files, which contain the regions that should be masked in the genomes.
When I am using --window-average-policy valid-loci
in combination with --filter-mask-fasta ${PWD}/data/BED/${ID}.mask.gz
I get a lot of missing positions, but (I guess) the correct number of masked sites per window.
When I am using --window-average-policy window-length
in combination with --filter-mask-bed ${PWD}/data/BED/${ID}.bed.gz
I get no (or a few??) missing positions, but also no masked sites per window.
Thanks, Martin
Hi Lucas,
I'm trying to generate a sync filewith the following command:
grenedalf sync --threads 8 --sam-path /shared/projects/invalbo/tiphaine/results/04_Polishing/Reunion/O_rg_marked_realigned.bam
--sam-path /shared/projects/invalbo/tiphaine/results/04_Polishing/Reunion/P_rg_marked_realigned.bam
--reference-genome-dict-file /shared/projects/invalbo/tiphaine/resources/genomes/AalbF5_filtered.dict
--out-dir /shared/projects/invalbo/tiphaine/results/06_Grenedalf
--multi-file-locus-set intersection --with-header --sam-min-base-qual 30 --sam-min-map-qual 20 --sam-flags-include-any /^0x[1-2]+/
--file-suffix Reunion --filter-region NC_085136.1:1-1000000 --compress --allow-file-overwriting
I have try this option with and without the "--compress" option.
I don't really understand some parts of the error message:
[W::bam_hdr_read] EOF marker is absent. The input is probably truncated
Failed to read BGZF block data at offset 3493214333 expected 20418 bytes; hread returned 0
Read block operation failed with error 4 after 50 of 379 bytes
Do you have a suggestion for the origin of the problem ?
If it comes from my bam files, I don't understant in which way they can be truncated.
Thank you in advance for your help,
Cheers,
Tiphaine
Hi Lucas,
I've noticed my theta pi numbers were far, far higher than I would anticipate. I'm calculating based on sliding windows of 10000 positions. I found one high pi area to look at what was going on. That region looks as follows:
2L 5776 C 0:0:32:192:0:0
2L 6353 C 0:183:30:0:0:0
2L 6921 G 0:192:0:27:0:0
2L 7542 G 0:169:0:45:0:0
2L 7801 C 0:161:36:0:0:0
2L 8667 G 0:0:29:185:0:0
2L 8774 T 0:27:172:0:0:0
So in a 10000bp window I have 7 segregating sites with reasonably good and consistent coverage. If we calculate Per site nucleotide diversity (based on the naive calculation of per site expected heterozygosity) would be
0.00016. However, I get 0.269 which is reasonably close to the value expected based on only the 7 segregating sites (~0.23). So it seems the windows are not behaving as I would expect? It is calculating based on the just the bi-allelic sites, rather than the entire window. Is there a way to set this to acknowledge the rest of the window that I missed?
Thank you,
Tyler
Hi,
I am working with a dataset of pooled natural occurring populations which have been processed in a variant calling (gatk). I have already calculated the different diversity measures per-sample, but I was wondering if it would be possible to create groups of samples (e.g. 10 samples for group A and 10 samples for group B) and then calculate per-group diversity measures?
It would of course be possible to merge on fastq or bam level, but that would then lead to lots of extra preprocessing.
Thank you.
It may be an overly cautious gcc 12.3.0 compiler but during a successful linking I see:
[100%] Linking CXX executable /sw/bioinfo/grenedalf/0.3.0/src/grenedalf/bin/grenedalf
In member function 'update_',
inlined from 'update' at /sw/bioinfo/grenedalf/0.3.0/src/grenedalf/build/genesis_unity_sources/lib/all.cpp:67744:12,
inlined from 'update' at /sw/bioinfo/grenedalf/0.3.0/src/grenedalf/build/genesis_unity_sources/lib/all.cpp:67731:15:
/sw/bioinfo/grenedalf/0.3.0/src/grenedalf/build/genesis_unity_sources/lib/all.cpp:67831:11: warning: '__builtin_memcpy' reading between 0 and 4294967295 bytes from a region of size 0 [-Wstringop-overread]
67831 | memcpy( &buffer_[index], &input[i], length-i );
| ^
/sw/bioinfo/grenedalf/0.3.0/src/grenedalf/build/genesis_unity_sources/lib/all.cpp: In member function 'update':
/sw/bioinfo/grenedalf/0.3.0/src/grenedalf/build/genesis_unity_sources/lib/all.cpp:67725:10: note: at offset [65, 4294967295] into source object 'sbuf' of size 64
67725 | char sbuf[MD5::BlockSize];
| ^
[100%] Built target grenedalf
Note that I build with cmake separately as I'd prefer to use non-system modules for liblzma libbz2 libdeflate and libz that we already have.
not very important, but the output of fst
and frequency
both have .csv file extensions, but the files are tab delimited.
Hi,
Thank you for the amazing tool. I have a question on how the --sam-min-map-qual filter works. According to wiki
--sam-min-map-qual
UINT:UINT in [0 - 90]=0 Needs: --sam-path
Minimum phred-scaled mapping quality score [0-90] for a read in sam/bam/cram files to be considered. Any read that is below the given value of mapping quality will be completely discarded, and its bases not taken into account. Default is 0, meaning no filtering by base quality.
I have aligned the reads to the genome using BWA MEM and processed the bam files with the following command.
/grenedalf/bin/grenedalf frequency --sam-path Eng_ORG.bam Eng_MID.bam Eng_CUR.bam
--reference-genome-fasta-file genome.fa
--sam-min-base-qual 25 --sam-min-map-qual 30 --write-sample-read-depth --write-sample-counts --write-total-read-depth --write-sample-ref-freq --write-total-counts --out-dir ./ --file-prefix p2
My understanding is that only the reads with mapq > 30 will be used to estimate the frequencies, so the read depth of a sample in the output should match the number of reads with mapq score above 30 in the bam file. I am, i correct in assuming this?
I summed up the DEPTH column if a sample in R and this value does not match with the number of reads with mapq >30 in the bam file.
> sum(fsnp$Eng_MID.DEPTH)
[1] 488310302
samtools view Eng_MID.bam | awk '$5 > 30' | wc -l
27079399
The library size of Eng_MID.fq.gz was ~43 million PE reads
The same is the case with other samples. Is this because of multimapping? if yes how can this affect the allele frequency, diversity estimates and fst?
Thanks in advance
Dear @lczech,
I have four pooling sequencing libraries. I have finished SNP calling using bwa+GATK and a total of 9.8 M SNPs (after filtering for depth and missing rate) have been identified. I found that grenedalf
can directly work with BAM and VCF format. So I first calculated the FST using BAM as the input. At the end of the programme, the following statistics were recorded. How to understand each statistic and which number better represents the number of SNPs? If 28568428
represents the number of SNPs, it is significantly different from 9.8 M.
Sample filter summary (summed up across all samples):
Passed: 4036079496
Empty (after counts): 32041114
Above max coverage: 263818
Total filter summary (after applying all sample filters):
Passed: 28568428
Below min coverage: 51188538
Not SNP: 937243195
Finished 2023-10-02 23:48:24
When using VCF as the input, the statistics looks much more normal. It seems like that 9795814
represents the number if SNPs after filtering by minor allele count > 0.
Processed 39 chromosomes with 9795814 (non-filtered) positions in 47245 windows.
Total filter summary (after applying all sample filters):
Passed: 9795814
Not SNP: 120324
Finished 2023-10-03 10:29:20
Sincerely,
Zhuqing Zheng
Hi @lczech
Very nice tool. When installing this tool, i get the following errors. If possible, could you provide a binary program or any advice on installing under anaconda environment?
[ 21%] Building CXX object libs/genesis/lib/genesis/CMakeFiles/genesis_lib_obj.dir/__/__/__/__/genesis_unity_sources/lib/all.cpp.o
/public/home/zhengzhuqing/01.software/install/grenedalf/build/genesis_unity_sources/lib/all.cpp: In member function ‘void genesis::population::BedReader::read_(std::shared_ptr<genesis::utils::BaseInputSource>, std::function<void(genesis::population::BedReader::Feature&&)>) const’:
/public/home/zhengzhuqing/01.software/install/grenedalf/build/genesis_unity_sources/lib/all.cpp:7696:24 warning: missing initializer for member ‘genesis::population::BedReader::Feature::chrom’ [-Wmissing-field-initializers]
feat = Feature{};
make[3]: *** [libs/genesis/lib/genesis/CMakeFiles/genesis_lib_obj.dir/__/__/__/__/genesis_unity_sources/lib/all.cpp.o] Error 1
make[2]: *** [libs/genesis/lib/genesis/CMakeFiles/genesis_lib_obj.dir/all] Error 2
make[1]: *** [all] Error 2
make: *** [build] Error 2
Sincerely,
Zheng zhuqing
I'm trying to make a cathedral plot on a couple of different chromosomes. A subset are creating the following error:
Plotting cathedral-plot-scaffold_6.csv
Error: Svg axis label position out of [ 0.0, 1.0 ]
terminate called after throwing an instance of 'std::runtime_error'
what(): Svg axis label position out of [ 0.0, 1.0 ]
Aborted (core dumped)
I have tried using the --clip
and --min
, --max
flags but am still getting the same error message. Please let me know if any additional information would be helpful.
Dear Lucas,
I have a suggestion how to handle missing information. In our recent papers (e.g. here and here), we extended the original SYNC file code by a missing data code. I.e., instead of using 0:0:0:0:0:0
, which may either indicate missing data (e.g. by not passing SNP calling criteria) or zero coverage, we introduced the .:.:.:.:.:.
code, which unambiguously indicates that this position is masked.
It would be very useful if your software could handle this as well.
Thanks a lot,
Martin
Hi Lucas,
thanks for this new implementation. This will be very useful!
However, I am running into a problem when I want to exclude samples from a sync file. I provide a list of 737 names with --sample-name-list
and then a list of 13 samples to be excluded with --filter-samples-exclude
.
Now, I get the following error which indicates that the sample name list is not correctly updated when the samples are excluded. Is there a quick fix for that?
--sample-name-list(names.txt): Invalid sample names list that contains 737 name entries. This is incongruent with the input file, which contains 724 samples (after filtering, if a sample name filter was given).
Thanks, Martin
Hi Lucas,
me again! I have another suggestion that may be useful as you are restructuring your code right now. One common problem with PoPoolation is, that it does not take into account that not only SNPs but also non-polymorphic positions may not pass heuristic SNP-calling paramters or a posteriori filtering. This commonly leads to an underestimation of per-site averages for any popgen statistics when simply dividing the sum of absolute values by the window-size. To yield correct values, the window-sizes need to be adjusted to account for all sites that do not pass the same criteria (heuristic parameters, removal of low-complexity regions, etc.) as the actual SNPs. We describe this issue in the Methods section of our paper (https://academic.oup.com/mbe/article/37/9/2661/5837682)
One way to pass such information is to either read BED or MASK files. The latter are FASTA-type files, where all positions with a 0
are retained and all positions >0 are ignored. See also the documentation of vcftools
>1
0000011111222...
>2
2222211111000...
Based on this information, the window-size could be adjusted to calculate correct per-site averages. Hope this is helpful and apologies for swamping you with stuff ;-)
Cheers, Martin
Hi Lczech
I run grenedalf and I got this error:
Started 2024-06-14 14:57:37
Processing 4 samples
At chromosome chromosome_1
At chromosome chromosome_2
At chromosome chromosome_3
At chromosome chromosome_4
At chromosome chromosome_5
At chromosome chromosome_6
At chromosome chromosome_7
At chromosome chromosome_8
At chromosome chromosome_9
Error: Cannot iterate multiple input sources in parallel, as (at least) one of them is not in the correct sorting order. By default, we expect lexicographical sorting of chromosomes, and then sorting by position within chromosomes. Alternatively, when a sequence dictionary is specified (such as from a .dict or .fai file, or from a reference genome .fasta file), we expect the order of chromosomes as specified there. Offending input source: PS2, going from chromosome_9:58459040 to chromosome_10:6
terminate called after throwing an instance of 'std::runtime_error'
what(): Cannot iterate multiple input sources in parallel, as (at least) one of them is not in the correct sorting order. By default, we expect lexicographical sorting of chromosomes, and then sorting by position within chromosomes. Alternatively, when a sequence dictionary is specified (such as from a .dict or .fai file, or from a reference genome .fasta file), we expect the order of chromosomes as specified there. Offending input source: PS2, going from chromosome_9:58459040 to chromosome_10:6
Aborted (core dumped)
the for bam files come from grenepipe/dedup
is this related with the size of the label? the problems is when we change form chromossome 9 to 10.
This was my first try with grenedalf and I didn't use a referrence genome file.
/home/dau1/software/grenedalf-0.5.1/bin/grenedalf diversity --sam-path PN1.bam PN2.bam PS1.bam PS2.bam --pool-sizes pool-sizes.csv --window-type genome --window-average-policy valid-loci --filter-sample-min-count 2 --filter-sample-min-read-depth 4 --no-extra-columns
thank you
osp
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.