kchen-lab / monopogen Goto Github PK

View Code? Open in Web Editor NEW

68.0 3.0 16.0 99.22 MB

SNV calling from single cell sequencing

License: GNU General Public License v3.0

Python 10.37% Shell 0.13% R 2.53% HTML 86.97%

germline-variants single-cell snvs somatic-variants

monopogen's People

Contributors

Stargazers

Watchers

Forkers

bharatm26 shulp2211 yuchenino drneavin chouellaine wook2014 monoplasty cfusterot seb951 ningshuang-yao swvanderlaan envest antonjmlarsson xtmgah sultanghazala ruiluo111

monopogen's Issues

region-file

I want to process the whole genome. How does Monopogen that the X, Y, XY/PAR1/PAR2, MT chromosome (regions) are noted? I now (naively) have this:

chr1
chr2
chr3
chr4
chr5
chr6
chr7
chr8
chr9
chr10
chr11
chr12
chr13
chr14
chr15
chr16
chr17
chr18
chr19
chr20
chr12
chr22
chrX
chrXY
chrY
chrMT

file chromosome comparisons

Thank you for providing this useful tool! I had a quick question about some of the file checks in the Monopogen.py script.

I'm running this with a bam file that doesn't include "chr" in its chromosome names. I initially tried using a reference fasta and vcfs that include "chr" as input to the tool.

I received errors due to chromosome mismatches, prompting me to remove "chr" from the reference fasta and vcfs I was using as input.

These files no longer have "chr" in them, but I still received a chromosome mismatch error. Looking into the code, it looks like on line 81 of Monopogen.py the chromosome is compared with the value 0.

I changed the "0" to "args.chr" as done elsewhere in the script and everything worked fine. Is this check for "0" intended for some reason I missed?

Thank you,

grennfp

Missing genotypes？

We are particularly interested in calculating SNP (Single Nucleotide Polymorphism) expression levels using single-cell ATAC data. We have utilized the Monopogen tool to compute genotypes associated with germline mutations. However, we would like to determine whether the SNPs that were not computed are due to low read coverage or if they are consistent with the genomic genotype, making it impossible for us to make a determination.

Issues in the test file chr20.maester_scRNA.bam for somatic mutation calling

Hi,
Thanks so much for developing this tool! I've been trying to get the test dataset running to call somatic mutations, but have been having some trouble. I downloaded the file chr20.maester_scRNA.bam from https://drive.google.com/file/d/1nS2rjrab-QSiq-FhpTWOtJesCE9iS_0k/view?usp=share_link, but it appears to be broken.

I ran the code in this directory:

$ ls 
bam.lst
CB_7K.maester_scRNA.csv
CCDG_14151_B01_GRM_WGS_2020-08-05_chr20.filtered.shapeit2-duohmm-phased.vcf.gz
chr20_2Mb.hg38.fa
chr20_2Mb.hg38.fa.fai
chr20.maester_scRNA.bam
chr20.maester_scRNA.bam.bai
region.lst

The downloaded chr20.maester_scRNA.bam file looks like this:

$ samtools view chr20.maester_scRNA.bam | head -2

SRR15598776.294805248_TCTTGCGCAGGCACAC_CTCTCCTCGCCA	2064	20	60218	0	10H20M61H	*	0	0	GCAATCCTTTCCTCTCCATT	,,:,:,,,,,,,,,,,,::F	NM:i:0	MD:Z:20	AS:i:20	XS:i:19	SA:Z:5,113545723,-,22S37M32S,0,0;5,156763440,+,22S19M50S,0,0;	XA:Z:11,+88315939,60S19M12S,0;
SRR15598776.80450257_AATGAAGTCGCGGACT_AAGTATACGCTC	16	20	61569	0	43S19M29S	*	0	0	TATACATATGTCCCCTCCCTCTTGAATCTCACCTTCTACCCCCGACTCCATTCCACTCCTCTAGGTTGTCACAGAGCACCGCATTTGGCTC	FF:FFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFF	NM:i:0	MD:Z:19	AS:i:19	XS:i:19	SA:Z:5,64380457,-,19S19M53S,0,0;15,76070764,-,5S19M67S,0,0;X,10097594,-,67S19M5S,0,0;	XA:Z:Y,-10959620,44S19M28S,0;KI270736.1,+62338,28S19M44S,0;

$ samtools view chr20.maester_scRNA.bam | tail -2

SRR15598776.418134198_TTTCATGAGCAACTTC_TCACGTTTCCTA	2064	20	39999757	0	25H24M42H	*	0	0	ATATTTAAAATTTTATTTTTTAAT	,,:FF,:,FF,,::FF,,,FF,:,	NM:i:0	MD:Z:24	AS:i:24	XS:i:23	SA:Z:X,46457624,-,3M1I25M62S,0,1;6,124139067,-,53S23M15S,0,0;7,129403981,-,66S22M3S,0,0;8,102417610,+,30S20M41S,0,0;
SRR15598776.180926021_GACTCAAAGTTCACTG_ACCATATAAAAT	2048	20	39999869	0	70H21M	*	0	0	ATTTACTATAAAAATAGTTAC	F,,:,,,,,:::,,,,,,,,,	NM:i:1	MD:Z:20T0	AS:i:20	XS:i:20	SA:Z:1,82576961,-,21S39M31S,0,0;4,180292349,+,20S19M52S,0,0;	XA:Z:5,-14292491,7S25M59S,1;11,+95173927,67S19M5S,0;14,-21926828,8S19M64S,0;

$ samtools view -H chr20.maester_scRNA.bam

@HD	VN:1.3	SO:coordinate
@SQ	SN:1	LN:248956422
@SQ	SN:10	LN:133797422
@SQ	SN:11	LN:135086622
@SQ	SN:12	LN:133275309
@SQ	SN:13	LN:114364328
@SQ	SN:14	LN:107043718
@SQ	SN:15	LN:101991189
@SQ	SN:16	LN:90338345
@SQ	SN:17	LN:83257441
@SQ	SN:18	LN:80373285
@SQ	SN:19	LN:58617616
@SQ	SN:2	LN:242193529
@SQ	SN:20	LN:64444167
@SQ	SN:21	LN:46709983
@SQ	SN:22	LN:50818468
@SQ	SN:3	LN:198295559
@SQ	SN:4	LN:190214555
@SQ	SN:5	LN:181538259
@SQ	SN:6	LN:170805979
@SQ	SN:7	LN:159345973
@SQ	SN:8	LN:145138636
@SQ	SN:9	LN:138394717
@SQ	SN:MT	LN:16569
@SQ	SN:X	LN:156040895
@SQ	SN:Y	LN:57227415
@SQ	SN:KI270728.1	LN:1872759
@SQ	SN:KI270727.1	LN:448248
@SQ	SN:KI270442.1	LN:392061
@SQ	SN:KI270729.1	LN:280839
@SQ	SN:GL000225.1	LN:211173
@SQ	SN:KI270743.1	LN:210658
@SQ	SN:GL000008.2	LN:209709
@SQ	SN:GL000009.2	LN:201709
@SQ	SN:KI270747.1	LN:198735
@SQ	SN:KI270722.1	LN:194050
@SQ	SN:GL000194.1	LN:191469
@SQ	SN:KI270742.1	LN:186739
@SQ	SN:GL000205.2	LN:185591
@SQ	SN:GL000195.1	LN:182896
@SQ	SN:KI270736.1	LN:181920
@SQ	SN:KI270733.1	LN:179772
@SQ	SN:GL000224.1	LN:179693
@SQ	SN:GL000219.1	LN:179198
@SQ	SN:KI270719.1	LN:176845
@SQ	SN:GL000216.2	LN:176608
@SQ	SN:KI270712.1	LN:176043
@SQ	SN:KI270706.1	LN:175055
@SQ	SN:KI270725.1	LN:172810
@SQ	SN:KI270744.1	LN:168472
@SQ	SN:KI270734.1	LN:165050
@SQ	SN:GL000213.1	LN:164239
@SQ	SN:GL000220.1	LN:161802
@SQ	SN:KI270715.1	LN:161471
@SQ	SN:GL000218.1	LN:161147
@SQ	SN:KI270749.1	LN:158759
@SQ	SN:KI270741.1	LN:157432
@SQ	SN:GL000221.1	LN:155397
@SQ	SN:KI270716.1	LN:153799
@SQ	SN:KI270731.1	LN:150754
@SQ	SN:KI270751.1	LN:150742
@SQ	SN:KI270750.1	LN:148850
@SQ	SN:KI270519.1	LN:138126
@SQ	SN:GL000214.1	LN:137718
@SQ	SN:KI270708.1	LN:127682
@SQ	SN:KI270730.1	LN:112551
@SQ	SN:KI270438.1	LN:112505
@SQ	SN:KI270737.1	LN:103838
@SQ	SN:KI270721.1	LN:100316
@SQ	SN:KI270738.1	LN:99375
@SQ	SN:KI270748.1	LN:93321
@SQ	SN:KI270435.1	LN:92983
@SQ	SN:GL000208.1	LN:92689
@SQ	SN:KI270538.1	LN:91309
@SQ	SN:KI270756.1	LN:79590
@SQ	SN:KI270739.1	LN:73985
@SQ	SN:KI270757.1	LN:71251
@SQ	SN:KI270709.1	LN:66860
@SQ	SN:KI270746.1	LN:66486
@SQ	SN:KI270753.1	LN:62944
@SQ	SN:KI270589.1	LN:44474
@SQ	SN:KI270726.1	LN:43739
@SQ	SN:KI270735.1	LN:42811
@SQ	SN:KI270711.1	LN:42210
@SQ	SN:KI270745.1	LN:41891
@SQ	SN:KI270714.1	LN:41717
@SQ	SN:KI270732.1	LN:41543
@SQ	SN:KI270713.1	LN:40745
@SQ	SN:KI270754.1	LN:40191
@SQ	SN:KI270710.1	LN:40176
@SQ	SN:KI270717.1	LN:40062
@SQ	SN:KI270724.1	LN:39555
@SQ	SN:KI270720.1	LN:39050
@SQ	SN:KI270723.1	LN:38115
@SQ	SN:KI270718.1	LN:38054
@SQ	SN:KI270317.1	LN:37690
@SQ	SN:KI270740.1	LN:37240
@SQ	SN:KI270755.1	LN:36723
@SQ	SN:KI270707.1	LN:32032
@SQ	SN:KI270579.1	LN:31033
@SQ	SN:KI270752.1	LN:27745
@SQ	SN:KI270512.1	LN:22689
@SQ	SN:KI270322.1	LN:21476
@SQ	SN:GL000226.1	LN:15008
@SQ	SN:KI270311.1	LN:12399
@SQ	SN:KI270366.1	LN:8320
@SQ	SN:KI270511.1	LN:8127
@SQ	SN:KI270448.1	LN:7992
@SQ	SN:KI270521.1	LN:7642
@SQ	SN:KI270581.1	LN:7046
@SQ	SN:KI270582.1	LN:6504
@SQ	SN:KI270515.1	LN:6361
@SQ	SN:KI270588.1	LN:6158
@SQ	SN:KI270591.1	LN:5796
@SQ	SN:KI270522.1	LN:5674
@SQ	SN:KI270507.1	LN:5353
@SQ	SN:KI270590.1	LN:4685
@SQ	SN:KI270584.1	LN:4513
@SQ	SN:KI270320.1	LN:4416
@SQ	SN:KI270382.1	LN:4215
@SQ	SN:KI270468.1	LN:4055
@SQ	SN:KI270467.1	LN:3920
@SQ	SN:KI270362.1	LN:3530
@SQ	SN:KI270517.1	LN:3253
@SQ	SN:KI270593.1	LN:3041
@SQ	SN:KI270528.1	LN:2983
@SQ	SN:KI270587.1	LN:2969
@SQ	SN:KI270364.1	LN:2855
@SQ	SN:KI270371.1	LN:2805
@SQ	SN:KI270333.1	LN:2699
@SQ	SN:KI270374.1	LN:2656
@SQ	SN:KI270411.1	LN:2646
@SQ	SN:KI270414.1	LN:2489
@SQ	SN:KI270510.1	LN:2415
@SQ	SN:KI270390.1	LN:2387
@SQ	SN:KI270375.1	LN:2378
@SQ	SN:KI270420.1	LN:2321
@SQ	SN:KI270509.1	LN:2318
@SQ	SN:KI270315.1	LN:2276
@SQ	SN:KI270302.1	LN:2274
@SQ	SN:KI270518.1	LN:2186
@SQ	SN:KI270530.1	LN:2168
@SQ	SN:KI270304.1	LN:2165
@SQ	SN:KI270418.1	LN:2145
@SQ	SN:KI270424.1	LN:2140
@SQ	SN:KI270417.1	LN:2043
@SQ	SN:KI270508.1	LN:1951
@SQ	SN:KI270303.1	LN:1942
@SQ	SN:KI270381.1	LN:1930
@SQ	SN:KI270529.1	LN:1899
@SQ	SN:KI270425.1	LN:1884
@SQ	SN:KI270396.1	LN:1880
@SQ	SN:KI270363.1	LN:1803
@SQ	SN:KI270386.1	LN:1788
@SQ	SN:KI270465.1	LN:1774
@SQ	SN:KI270383.1	LN:1750
@SQ	SN:KI270384.1	LN:1658
@SQ	SN:KI270330.1	LN:1652
@SQ	SN:KI270372.1	LN:1650
@SQ	SN:KI270548.1	LN:1599
@SQ	SN:KI270580.1	LN:1553
@SQ	SN:KI270387.1	LN:1537
@SQ	SN:KI270391.1	LN:1484
@SQ	SN:KI270305.1	LN:1472
@SQ	SN:KI270373.1	LN:1451
@SQ	SN:KI270422.1	LN:1445
@SQ	SN:KI270316.1	LN:1444
@SQ	SN:KI270340.1	LN:1428
@SQ	SN:KI270338.1	LN:1428
@SQ	SN:KI270583.1	LN:1400
@SQ	SN:KI270334.1	LN:1368
@SQ	SN:KI270429.1	LN:1361
@SQ	SN:KI270393.1	LN:1308
@SQ	SN:KI270516.1	LN:1300
@SQ	SN:KI270389.1	LN:1298
@SQ	SN:KI270466.1	LN:1233
@SQ	SN:KI270388.1	LN:1216
@SQ	SN:KI270544.1	LN:1202
@SQ	SN:KI270310.1	LN:1201
@SQ	SN:KI270412.1	LN:1179
@SQ	SN:KI270395.1	LN:1143
@SQ	SN:KI270376.1	LN:1136
@SQ	SN:KI270337.1	LN:1121
@SQ	SN:KI270335.1	LN:1048
@SQ	SN:KI270378.1	LN:1048
@SQ	SN:KI270379.1	LN:1045
@SQ	SN:KI270329.1	LN:1040
@SQ	SN:KI270419.1	LN:1029
@SQ	SN:KI270336.1	LN:1026
@SQ	SN:KI270312.1	LN:998
@SQ	SN:KI270539.1	LN:993
@SQ	SN:KI270385.1	LN:990
@SQ	SN:KI270423.1	LN:981
@SQ	SN:KI270392.1	LN:971
@SQ	SN:KI270394.1	LN:970
@RG	ID:out_sorted	SM:out_sorted	LB:0.1	PU:out_sorted	PL:ILLUMINA
@PG	PN:bwa	ID:bwa	VN:0.7.17-r1198-dirty	CL:/risapps/rhel7/bwa/0.7.17/bwa mem -t24 -T0 /rsrch3/scratch/bcb/jdou1/PM1645_CRISPR/hg38/genome test.CB.fastq
@PG	ID:samtools	PN:samtools	PP:bwa	VN:1.18	CL:samtools view -H chr20.maester_scRNA.bam

The preprocessing step appears to run fine:

$ python ${path}/src/Monopogen.py  preProcess \
>    -b bam.lst \
>    -o bm \
>    -a ${path}/apps \
>    -t 8

[2023-12-12 12:22:01,044] INFO     Monopogen.py Performing data preprocess before variant calling...
[2023-12-12 12:22:01,044] INFO     germline.py Parameters in effect:
[2023-12-12 12:22:01,044] INFO     germline.py --subcommand = [preProcess]
[2023-12-12 12:22:01,044] INFO     germline.py --bamFile = [bam.lst]
[2023-12-12 12:22:01,044] INFO     germline.py --out = [bm]
[2023-12-12 12:22:01,044] INFO     germline.py --app_path = [/nfs/team205/at31/software/Monopogen/apps]
[2023-12-12 12:22:01,044] INFO     germline.py --max_mismatch = [3]
[2023-12-12 12:22:01,044] INFO     germline.py --nthreads = [8]
[2023-12-12 12:22:01,055] DEBUG    Monopogen.py PreProcessing sample bm
[2023-12-12 12:22:01,089] INFO     germline.py The contig chr4 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-12-12 12:22:01,090] INFO     germline.py The contig chr5 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-12-12 12:22:01,090] INFO     germline.py The contig chr6 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-12-12 12:22:01,091] INFO     germline.py The contig chr1 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-12-12 12:22:01,092] INFO     germline.py The contig chr2 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-12-12 12:22:01,094] INFO     germline.py The contig chr7 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-12-12 12:22:01,094] INFO     germline.py The contig chr3 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-12-12 12:22:01,096] INFO     germline.py The contig chr8 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-12-12 12:22:01,173] INFO     germline.py The contig chr9 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-12-12 12:22:01,176] INFO     germline.py The contig chr10 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-12-12 12:22:01,178] INFO     germline.py The contig chr11 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-12-12 12:22:01,179] INFO     germline.py The contig chr12 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-12-12 12:22:01,180] INFO     germline.py The contig chr14 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-12-12 12:22:01,183] INFO     germline.py The contig chr13 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-12-12 12:22:01,188] INFO     germline.py The contig chr15 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-12-12 12:22:01,191] INFO     germline.py The contig chr16 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-12-12 12:22:01,243] INFO     germline.py The contig chr18 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-12-12 12:22:01,249] INFO     germline.py The contig chr17 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-12-12 12:22:01,250] INFO     germline.py The contig chr19 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-12-12 12:22:01,254] INFO     germline.py The contig chr20 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-12-12 12:22:01,257] INFO     germline.py The contig chr21 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-12-12 12:22:01,264] INFO     germline.py The contig chr22 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-12-12 12:24:17,990] INFO     Monopogen.py Success! See instructions above.

The germline step emits lots of errors / warnings but still reports that it ran successfully:

$ python  ${path}/src/Monopogen.py  germline  \
>    -a   ${path}/apps \
>    -r  region.lst \
>    -p  ./ \
>    -g  chr20_2Mb.hg38.fa \
>    -m 3 \
>    -t 8 \
>    -s all \
>    -o bm

[2023-12-12 12:31:53,984] INFO     Monopogen.py Performing germline variant calling...
[2023-12-12 12:31:53,985] INFO     germline.py Parameters in effect:
[2023-12-12 12:31:53,985] INFO     germline.py --subcommand = [germline]
[2023-12-12 12:31:53,985] INFO     germline.py --region = [region.lst]
[2023-12-12 12:31:53,985] INFO     germline.py --step = [all]
[2023-12-12 12:31:53,985] INFO     germline.py --out = [bm]
[2023-12-12 12:31:53,985] INFO     germline.py --reference = [chr20_2Mb.hg38.fa]
[2023-12-12 12:31:53,985] INFO     germline.py --imputation_panel = [./]
[2023-12-12 12:31:53,985] INFO     germline.py --max_softClipped = [3]
[2023-12-12 12:31:53,985] INFO     germline.py --app_path = [/nfs/team205/at31/software/Monopogen/apps]
[2023-12-12 12:31:53,985] INFO     germline.py --nthreads = [8]
[2023-12-12 12:31:53,985] INFO     germline.py --norun = [FALSE]
[2023-12-12 12:31:53,985] INFO     Monopogen.py Checking existence of essenstial resource files...
[2023-12-12 12:31:54,050] INFO     Monopogen.py Checking dependencies...
['bash bm/Script/runGermline_chr20.sh']
[mpileup] 1 samples in 1 input files
(mpileup) Max depth is above 1M. Potential memory hog!
Reference allele mismatch at chr20:3000001 .. REF_SEQ:'A' vs VCF:'N'
[W::vcf_parse] Contig 'i' is not defined in the header. (Quick workaround: index the file with tabix.)
[E::bcf_write] Broken VCF record, the number of columns at chr20:3810050 does not match the number of samples (0 vs 1)
[E::bcf_write] Broken VCF record, the number of columns at i:0 does not match the number of samples (0 vs 1)
[E::bcf_write] Broken VCF record, the number of columns at i:0 does not match the number of samples (0 vs 1)
[W::vcf_parse] FILTER 'A' is not defined in the header
[W::vcf_parse] INFO '0' is not defined in the header, assuming Type=String
[W::vcf_parse_format] FORMAT 'INDEL;IDV=2;IMF=0.0210526;DP=95;I16=0,0,0,1,0,0,22,484,0,0,32,1024,0,0,24,576;QS=0,1;SGB=-0.379885;MQ0F=0' is not defined in the header, assuming Type=String
[E::bcf_hdr_parse_line] Could not parse the header line: "##FORMAT=<ID=INDEL;IDV=2;IMF=0.0210526;DP=95;I16=0,0,0,1,0,0,22,484,0,0,32,1024,0,0,24,576;QS=0,1;SGB=-0.379885;MQ0F=0,Number=1,Type=String,Description="Dummy">"
[E::vcf_parse_format] Could not add dummy header for FORMAT 'INDEL;IDV=2;IMF=0.0210526;DP=95;I16=0,0,0,1,0,0,22,484,0,0,32,1024,0,0,24,576;QS=0,1;SGB=-0.379885;MQ0F=0'
[W::vcf_parse_format] FORMAT 'INDEL;IDV=18;IMF=0.382979;DP=47;I16=0,0,0,2,0,0,66,2178,0,0,110,6050,0,0,18,162;QS=0,1;VDB=0.02;SGB=-0.453602;MQ0F=0' is not defined in the header, assuming Type=String
[E::bcf_hdr_parse_line] Could not parse the header line: "##FORMAT=<ID=INDEL;IDV=18;IMF=0.382979;DP=47;I16=0,0,0,2,0,0,66,2178,0,0,110,6050,0,0,18,162;QS=0,1;VDB=0.02;SGB=-0.453602;MQ0F=0,Number=1,Type=String,Description="Dummy">"
[E::vcf_parse_format] Could not add dummy header for FORMAT 'INDEL;IDV=18;IMF=0.382979;DP=47;I16=0,0,0,2,0,0,66,2178,0,0,110,6050,0,0,18,162;QS=0,1;VDB=0.02;SGB=-0.453602;MQ0F=0'
[W::vcf_parse] Contig '|' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] FILTER '50,6,0:2' is not defined in the header
[E::bcf_hdr_parse_line] Could not parse the header line: "##FILTER=<ID=50,6,0:2,Description="Dummy">"
[E::vcf_parse] Could not add dummy header for FILTER '50,6,0:2'
[E::bcf_write] Broken VCF record, the number of columns at chr20:5046788 does not match the number of samples (0 vs 1)
[W::vcf_parse] Contig '' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] FILTER '213,96,0:32' is not defined in the header
[E::bcf_hdr_parse_line] Could not parse the header line: "##FILTER=<ID=213,96,0:32,Description="Dummy">"
[E::vcf_parse] Could not add dummy header for FILTER '213,96,0:32'
[E::bcf_write] Broken VCF record, the number of columns at chr20:5847394 does not match the number of samples (0 vs 1)
[W::vcf_parse] FILTER '48,6,0:2' is not defined in the header
[E::bcf_hdr_parse_line] Could not parse the header line: "##FILTER=<ID=48,6,0:2,Description="Dummy">"
[E::vcf_parse] Could not add dummy header for FILTER '48,6,0:2'
[E::bcf_write] Broken VCF record, the number of columns at chr20:8188931 does not match the number of samples (0 vs 1)
[W::vcf_parse] FILTER '' is not defined in the header
[W::vcf_parse_format] FORMAT 'INDEL;IDV=1;IMF=0.00952381;DP=105;I16=0,0,77,3,0,0,2815,104141,0,0,4800,288000,0,0,1743,41265;QS=0,1;VDB=0.000484566;SGB=-0.693147;MQSB=1;MQ0F=0' is not defined in the header, assuming Type=String
[E::bcf_hdr_parse_line] Could not parse the header line: "##FORMAT=<ID=INDEL;IDV=1;IMF=0.00952381;DP=105;I16=0,0,77,3,0,0,2815,104141,0,0,4800,288000,0,0,1743,41265;QS=0,1;VDB=0.000484566;SGB=-0.693147;MQSB=1;MQ0F=0,Number=1,Type=String,Description="Dummy">"
[E::vcf_parse_format] Could not add dummy header for FORMAT 'INDEL;IDV=1;IMF=0.00952381;DP=105;I16=0,0,77,3,0,0,2815,104141,0,0,4800,288000,0,0,1743,41265;QS=0,1;VDB=0.000484566;SGB=-0.693147;MQSB=1;MQ0F=0'
[W::vcf_parse_format] FORMAT 'INDEL;IDV=3;IMF=0.0625;DP=48;I16=0,0,0,14,0,0,355,9005,0,0,840,50400,0,0,28,98;QS=0,1;VDB=2.00699e-07;SGB=-0.686358;MQ0F=0' is not defined in the header, assuming Type=String
[E::bcf_hdr_parse_line] Could not parse the header line: "##FORMAT=<ID=INDEL;IDV=3;IMF=0.0625;DP=48;I16=0,0,0,14,0,0,355,9005,0,0,840,50400,0,0,28,98;QS=0,1;VDB=2.00699e-07;SGB=-0.686358;MQ0F=0,Number=1,Type=String,Description="Dummy">"
[E::vcf_parse_format] Could not add dummy header for FORMAT 'INDEL;IDV=3;IMF=0.0625;DP=48;I16=0,0,0,14,0,0,355,9005,0,0,840,50400,0,0,28,98;QS=0,1;VDB=2.00699e-07;SGB=-0.686358;MQ0F=0'
[W::vcf_parse] Contig '?' is not defined in the header. (Quick workaround: index the file with tabix.)
[E::bcf_write] Broken VCF record, the number of columns at chr20:17944845 does not match the number of samples (0 vs 1)
[E::bcf_write] Broken VCF record, the number of columns at ?:0 does not match the number of samples (0 vs 1)
[E::bcf_write] Broken VCF record, the number of columns at :0 does not match the number of samples (0 vs 1)
[W::vcf_parse] FILTER 'T' is not defined in the header
[W::vcf_parse] Contig '?=' is not defined in the header. (Quick workaround: index the file with tabix.)
[E::bcf_write] Broken VCF record, the number of columns at chr20:25478115 does not match the number of samples (0 vs 1)
[E::bcf_write] Broken VCF record, the number of columns at ?=:0 does not match the number of samples (0 vs 1)
[E::bcf_write] Broken VCF record, the number of columns at :0 does not match the number of samples (0 vs 1)
[W::vcf_parse] Contig '?' is not defined in the header. (Quick workaround: index the file with tabix.)
[E::bcf_write] Broken VCF record, the number of columns at chr20:33389937 does not match the number of samples (0 vs 1)
[E::bcf_write] Broken VCF record, the number of columns at ?:0 does not match the number of samples (0 vs 1)
[E::bcf_write] Broken VCF record, the number of columns at :0 does not match the number of samples (0 vs 1)
[W::vcf_parse] Contig '?' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] FILTER '187,27,0:9' is not defined in the header
[E::bcf_hdr_parse_line] Could not parse the header line: "##FILTER=<ID=187,27,0:9,Description="Dummy">"
[E::vcf_parse] Could not add dummy header for FILTER '187,27,0:9'
[E::bcf_write] Broken VCF record, the number of columns at chr20:34805043 does not match the number of samples (0 vs 1)
[W::vcf_parse] FILTER '?' is not defined in the header
[W::vcf_parse_format] FORMAT 'INDEL;IDV=1;IMF=0.2;DP=5;I16=0,0,0,1,0,0,25,625,0,0,60,3600,0,0,11,121;QS=0,1;SGB=-0.379885;MQ0F=0' is not defined in the header, assuming Type=String
[E::bcf_hdr_parse_line] Could not parse the header line: "##FORMAT=<ID=INDEL;IDV=1;IMF=0.2;DP=5;I16=0,0,0,1,0,0,25,625,0,0,60,3600,0,0,11,121;QS=0,1;SGB=-0.379885;MQ0F=0,Number=1,Type=String,Description="Dummy">"
[E::vcf_parse_format] Could not add dummy header for FORMAT 'INDEL;IDV=1;IMF=0.2;DP=5;I16=0,0,0,1,0,0,25,625,0,0,60,3600,0,0,11,121;QS=0,1;SGB=-0.379885;MQ0F=0'
[W::bgzf_read_block] EOF marker is absent. The input is probably truncated
[E::bcf_write] Broken VCF record, the number of columns at chr20:36572282 does not match the number of samples (0 vs 1)
Error: VCF parse error
beagle.27Jul16.86a.jar (version 4.1)
Copyright (C) 2014-2015 Brian L. Browning
Enter "java -jar beagle.27Jul16.86a.jar" for a summary of command line arguments.
Start time: 12:36 PM GMT on 12 Dec 2023

Command line: java -Xmx20480m -jar beagle.jar
  gl=bm/germline/chr20.gl.vcf.gz
  ref=./CCDG_14151_B01_GRM_WGS_2020-08-05_chr20.filtered.shapeit2-duohmm-phased.vcf.gz
  chrom=chr20
  out=bm/germline/chr20.gp
  impute=false
  modelscale=2
  nthreads=24
  gprobs=true
  niterations=0

No genetic map is specified: using 1 cM = 1 Mb
Exception in thread "main" java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: VCF header line has 10 fields, but data line has 4 fields
File source:File source: bm/germline/chr20.gl.vcf.gz
[chr20, 2998989, ., A]
	at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
	at java.base/java.util.concurrent.ForkJoinTask.getThrowableException(ForkJoinTask.java:600)
	at java.base/java.util.concurrent.ForkJoinTask.reportException(ForkJoinTask.java:678)
	at java.base/java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:737)
	at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateParallel(ReduceOps.java:919)
	at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233)
	at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578)
	at vcf.VcfIt.fillEmissionBuffer(VcfIt.java:307)
	at vcf.VcfIt.next(VcfIt.java:363)
	at vcf.VcfIt.next(VcfIt.java:52)
	at vcf.IntervalVcfIt.readNextRecord(IntervalVcfIt.java:110)
	at vcf.IntervalVcfIt.next(IntervalVcfIt.java:92)
	at vcf.IntervalVcfIt.next(IntervalVcfIt.java:36)
	at main.Main.restrictToVcfMarkers(Main.java:343)
	at main.Main.allData(Main.java:313)
	at main.Main.main(Main.java:111)
Caused by: java.lang.IllegalArgumentException: VCF header line has 10 fields, but data line has 4 fields
File source:File source: bm/germline/chr20.gl.vcf.gz
[chr20, 2998989, ., A]
	at vcf.VcfRecord.fieldCountError(VcfRecord.java:221)
	at vcf.VcfRecord.delimiters(VcfRecord.java:203)
	at vcf.VcfRecord.<init>(VcfRecord.java:87)
	at vcf.VcfRecord.fromGTGL(VcfRecord.java:193)
	at vcf.VcfIt.lambda$static$5(VcfIt.java:76)
	at vcf.VcfIt.lambda$new$8(VcfIt.java:192)
	at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
	at java.base/java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
	at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
	at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
	at java.base/java.util.stream.ReduceOps$ReduceTask.doLeaf(ReduceOps.java:952)
	at java.base/java.util.stream.ReduceOps$ReduceTask.doLeaf(ReduceOps.java:926)
	at java.base/java.util.stream.AbstractTask.compute(AbstractTask.java:327)
	at java.base/java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:746)
	at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290)
	at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020)
	at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656)
	at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594)
	at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183)
gzip: bm/germline/chr20.gp.vcf.gz: No such file or directory
bm/germline/chr20.gp.vcf.gz: No such file or directory
beagle.27Jul16.86a.jar (version 4.1)
Copyright (C) 2014-2015 Brian L. Browning
Enter "java -jar beagle.27Jul16.86a.jar" for a summary of command line arguments.
Start time: 12:36 PM GMT on 12 Dec 2023

Command line: java -Xmx20480m -jar beagle.jar
  gt=bm/germline/chr20.germline.vcf
  ref=./CCDG_14151_B01_GRM_WGS_2020-08-05_chr20.filtered.shapeit2-duohmm-phased.vcf.gz
  chrom=chr20
  out=bm/germline/chr20.phased
  impute=false
  modelscale=2
  nthreads=24
  gprobs=true
  niterations=0

No genetic map is specified: using 1 cM = 1 Mb
Exception in thread "main" java.lang.IllegalArgumentException: Missing line (#CHROM ...) after meta-information lines
File source: bm/germline/chr20.germline.vcf
null
	at vcf.VcfHeader.checkHeaderLine(VcfHeader.java:135)
	at vcf.VcfHeader.<init>(VcfHeader.java:119)
	at vcf.VcfIt.<init>(VcfIt.java:190)
	at vcf.VcfIt.create(VcfIt.java:175)
	at vcf.VcfIt.create(VcfIt.java:150)
	at main.Main.allData(Main.java:297)
	at main.Main.main(Main.java:111)
[2023-12-12 12:36:05,744] INFO     Monopogen.py Success! See instructions above.

Then when I run the somatic featureInfo step, I get an error:

$ python ${path}/src/Monopogen.py  somatic  \
>     -a ${path}/apps \
>     -r region.lst \
>     -t 50 \
>     -i bm \
>     -l CB_7K.maester_scRNA.csv \
>     -s featureInfo \
>     -g chr20_2Mb.hg38.fa 

[2023-12-12 12:38:06,217] INFO     Monopogen.py Get feature information from sequencing data...
[E::hts_open_format] Failed to open file "bm/germline/chr20.phased.vcf.gz" : No such file or directory
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/nfs/users/nfs_a/at31/miniforge3/envs/monopogen/lib/python3.9/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/nfs/users/nfs_a/at31/miniforge3/envs/monopogen/lib/python3.9/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
  File "/nfs/team205/at31/software/Monopogen/src/somatic.py", line 88, in featureInfo
    vcf_in = VariantFile(out + "/germline/" +  region + ".phased.vcf.gz")
  File "pysam/libcbcf.pyx", line 4117, in pysam.libcbcf.VariantFile.__init__
  File "pysam/libcbcf.pyx", line 4342, in pysam.libcbcf.VariantFile.open
FileNotFoundError: [Errno 2] could not open variant file `b'bm/germline/chr20.phased.vcf.gz'`: No such file or directory
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/nfs/team205/at31/software/Monopogen/src/Monopogen.py", line 436, in <module>
    main()
  File "/nfs/team205/at31/software/Monopogen/src/Monopogen.py", line 429, in main
    args.func(args)
  File "/nfs/team205/at31/software/Monopogen/src/Monopogen.py", line 150, in somatic
    result = pool.map(featureInfo, joblst)
  File "/nfs/users/nfs_a/at31/miniforge3/envs/monopogen/lib/python3.9/multiprocessing/pool.py", line 364, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/nfs/users/nfs_a/at31/miniforge3/envs/monopogen/lib/python3.9/multiprocessing/pool.py", line 771, in get
    raise self._value
FileNotFoundError: [Errno 2] could not open variant file `b'bm/germline/chr20.phased.vcf.gz'`: No such file or directory

Are you familiar with this issue? I re-downloaded chr20.maester_scRNA.bam multiple times to make sure it wasn't somehow truncated during the download, but the issue persists. Do you have any idea what I'm doing wrong?

Thanks so much in advance!

About Error for the cellScan step during Somatic Variant Calling

Dear Monopogen developers,
I'd like to express my gratitude for developing Monopogen. It has been invaluable for my research. I've encountered an issue and was hoping to seek your guidance on it.
When I doing the cellScan step in Somatic Variant Calling, I got an error: with Monopogen.py Get single cell level information from sequencing data...['chr17']['chr17.filter.targeted.bam'] merge: invalid option -- 'o'
I checked the code in Monopogen.py file, it uses pysam.merge("-f","-o",output_bam,*bamlst), this error seems due to pysam version difference. I changed the code into pysam.merge("-f", output_bam, *bamlst), and it works. However, I created the environment exactly as yours(pysam=0.16.0.1), I don't know If it will change the output. If not, I hope this could help others who encounter this issue.

Error in Germline

Hi, I am trying to run Germline on the example provided in the repository. After successfully running preProcess, I run the following command:

python src/Monopogen.py germline -a apps/ -t 1 -r test/region.lst -p example/ -g example/chr20_2Mb.hg38.fa -s all -o out

But I get errors and the following is the output:

[2024-01-09 11:42:35,227] INFO Monopogen.py Performing germline variant calling...
[2024-01-09 11:42:35,227] INFO germline.py Parameters in effect:
[2024-01-09 11:42:35,227] INFO germline.py --subcommand = [germline]
[2024-01-09 11:42:35,227] INFO germline.py --region = [test/region.lst]
[2024-01-09 11:42:35,227] INFO germline.py --step = [all]
[2024-01-09 11:42:35,227] INFO germline.py --out = [out]
[2024-01-09 11:42:35,227] INFO germline.py --reference = [example/chr20_2Mb.hg38.fa]
[2024-01-09 11:42:35,228] INFO germline.py --imputation_panel = [example/]
[2024-01-09 11:42:35,228] INFO germline.py --max_softClipped = [1]
[2024-01-09 11:42:35,228] INFO germline.py --app_path = [apps/]
[2024-01-09 11:42:35,228] INFO germline.py --nthreads = [1]
[2024-01-09 11:42:35,228] INFO germline.py --norun = [FALSE]
[2024-01-09 11:42:35,228] INFO Monopogen.py Checking existence of essenstial resource files...
[2024-01-09 11:42:35,236] INFO Monopogen.py Checking dependencies...
['bash out/Script/runGermline_chr20.sh']
/rsrch5/home/tdccct/ppshah/shared/pipelines/Monopogen/apps/bcftools: error while loading shared libraries: libcrypto.so.1.0.0: cannot open shared object file: No such file or directory
/rsrch5/home/tdccct/ppshah/shared/pipelines/Monopogen/apps/bcftools: error while loading shared libraries: libcrypto.so.1.0.0: cannot open shared object file: No such file or directory
[mpileup] 2 samples in 2 input files
(mpileup) Max depth is above 1M. Potential memory hog!
beagle.27Jul16.86a.jar (version 4.1)
Copyright (C) 2014-2015 Brian L. Browning
Enter "java -jar beagle.27Jul16.86a.jar" for a summary of command line arguments.
Start time: 11:44 AM CST on 09 Jan 2024

Command line: java -Xmx20480m -jar beagle.jar
gl=out/germline/chr20.gl.vcf.gz
ref=example/CCDG_14151_B01_GRM_WGS_2020-08-05_chr20.filtered.shapeit2-duohmm-phased.vcf.gz
chrom=chr20
out=out/germline/chr20.gp
impute=false
modelscale=2
nthreads=24
gprobs=true
niterations=0

No genetic map is specified: using 1 cM = 1 Mb
Exception in thread "main" java.lang.IllegalArgumentException: Missing line (#CHROM ...) after meta-information lines
File source: out/germline/chr20.gl.vcf.gz
null
at vcf.VcfHeader.checkHeaderLine(VcfHeader.java:135)
at vcf.VcfHeader.(VcfHeader.java:119)
at vcf.VcfIt.(VcfIt.java:190)
at vcf.VcfIt.create(VcfIt.java:175)
at vcf.VcfIt.create(VcfIt.java:150)
at main.Main.allData(Main.java:303)
at main.Main.main(Main.java:111)
gzip: out/germline/chr20.gp.vcf.gz: No such file or directory
out/germline/chr20.gp.vcf.gz: No such file or directory
beagle.27Jul16.86a.jar (version 4.1)
Copyright (C) 2014-2015 Brian L. Browning
Enter "java -jar beagle.27Jul16.86a.jar" for a summary of command line arguments.
Start time: 11:44 AM CST on 09 Jan 2024

Command line: java -Xmx20480m -jar beagle.jar
gt=out/germline/chr20.germline.vcf
ref=example/CCDG_14151_B01_GRM_WGS_2020-08-05_chr20.filtered.shapeit2-duohmm-phased.vcf.gz
chrom=chr20
out=out/germline/chr20.phased
impute=false
modelscale=2
nthreads=24
gprobs=true
niterations=0

Any guidance to resolve this would be helpful. I am working on HPC system.

Thank you!

About run data preprocess

When I run prepocess.py with a sorted bam file, the following error occurs after some time: ValueError: fetch called on a bam file without index. How do I fix it?

Whether SmartSeq data can be used with the Somatic function？

Whether SmartSeq data can be used with the software.SmartSeq sequencing is a single cell, does not provide barcode.csv. Somatic function available?

Issues for germline variant calling part from scRNA-seq

Hi,
Thank you for developing such a wonderful tool.
I have faced a issue when I running the germline variant calling part of your tutorial. Actually I have succeeded running chromosome 1 and 20 from my data with the guidance of your tutorial. So I try to run 5 chromosomes (chr12-16) at the same time. But there are some errors occurred.

Could you please provide some guidance on how to resolve this issue? Thank you in advance for your assistance.

My code

source activate gva
path="/data/ouyangjfc/home/gmslijiw/Monopogen"
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:${path}/apps
python ${path}/src/Monopogen.py germline
-a ${path}/apps -t 5 -r /data/ouyangjfc/home/gmslijiw/Monopogen/CD34+_past/region_12_16.lst
-p /data/ouyangjfc/home/gmslijiw/Monopogen/reference/1KG3/
-g /data/ouyangjfc/home/gmslijiw/Monopogen/reference/hg38.analysisSet.fa -m 3 -s all -o /data/ouyangjfc/home/gmslijiw/Monopogen/CD34+_past

Error Message

[2024-01-25 17:04:08,041] INFO Monopogen.py Performing germline variant calling...
[2024-01-25 17:04:08,041] INFO germline.py Parameters in effect:
[2024-01-25 17:04:08,041] INFO germline.py --subcommand = [germline]
[2024-01-25 17:04:08,042] INFO germline.py --region = [/data/ouyangjfc/home/gmslijiw/Monopogen/CD34+/region_12_16.lst]
[2024-01-25 17:04:08,042] INFO germline.py --step = [all]
[2024-01-25 17:04:08,042] INFO germline.py --out = [/data/ouyangjfc/home/gmslijiw/Monopogen/CD34+]
[2024-01-25 17:04:08,042] INFO germline.py --reference = [/data/ouyangjfc/home/gmslijiw/Monopogen/reference/hg38.analysisSet.fa]
[2024-01-25 17:04:08,042] INFO germline.py --imputation_panel = [/data/ouyangjfc/home/gmslijiw/Monopogen/reference/1KG3/]
[2024-01-25 17:04:08,042] INFO germline.py --max_softClipped = [3]
[2024-01-25 17:04:08,042] INFO germline.py --app_path = [/data/ouyangjfc/home/gmslijiw/Monopogen/apps]
[2024-01-25 17:04:08,042] INFO germline.py --nthreads = [8]
[2024-01-25 17:04:08,042] INFO germline.py --norun = [FALSE]
[2024-01-25 17:04:08,042] INFO Monopogen.py Checking existence of essenstial resource files...
[2024-01-25 17:04:08,068] INFO Monopogen.py Checking dependencies...
/data/ouyangjfc/home/gmslijiw/Monopogen/CD34+/Bam/.filter.bam.lst: No such file or directory
[mpileup] 1 samples in 1 input files
[mpileup] 1 samples in 1 input files
[mpileup] 1 samples in 1 input files
[mpileup] 1 samples in 1 input files
[mpileup] 1 samples in 1 input files
Failed to open -: unknown file type
Failed to open -: unknown file type
Exception in thread "main" java.lang.IllegalArgumentException: missing value in key-value pair: chrom=
at blbutil.Validate.argsToMap(Validate.java:75)
at main.Par.(Par.java:100)
at main.Main.parameters(Main.java:388)
at main.Main.main(Main.java:104)
gzip: /data/ouyangjfc/home/gmslijiw/Monopogen/CD34+/germline/.gp.vcf.gz: No such file or directory
/data/ouyangjfc/home/gmslijiw/Monopogen/CD34+/germline/.gp.vcf.gz: No such file or directory
Exception in thread "main" java.lang.IllegalArgumentException: missing value in key-value pair: chrom=
at blbutil.Validate.argsToMap(Validate.java:75)
at main.Par.(Par.java:100)
at main.Main.parameters(Main.java:388)
at main.Main.main(Main.java:104)
(mpileup) Max depth is above 1M. Potential memory hog!
(mpileup) Max depth is above 1M. Potential memory hog!
(mpileup) Max depth is above 1M. Potential memory hog!
(mpileup) Max depth is above 1M. Potential memory hog!
(mpileup) Max depth is above 1M. Potential memory hog!
Lines total/split/realigned/skipped: 40866257/38219/3819/0
Lines total/split/realigned/skipped: 52806980/61472/4782/0
Lines total/split/realigned/skipped: 52742649/56641/5151/0
Lines total/split/realigned/skipped: 53210593/63155/5514/0
Lines total/split/realigned/skipped: 77572219/99785/7434/0
[2024-01-25 18:09:36,625] INFO Monopogen.py Success! See instructions above.

unable to run executables in /apps

Hello! What architecture are folks running the code on? I'm on an Intel Core i9 MacOS machine and get an executable format error for all the executables in the apps directory.

About phased.vcf.gz generation during germline calling

When I conducted the germline calling process, I noticed that only a limited number of chromosomes were successfully processed into 'phased.vcf.gz,' whereas most of the chromosomes remained unprocessed. I included all standard chromosomes (chromosomes 1-22) as listed in 'region.lst' and utilized the GRCh38 human reference FASTA file for this analysis. Could this variation in success be linked to the inherent low read depth typically associated with 10x scRNA-seq data? Additionally, I'm interested in knowing if there are any potential solutions to address this issue.

Many thanks.

Lineage tracing

Hello!

I would like to perform lineage tracing with Monopogen on my single-cell RNA-Seq data.
I followed the tutorial to the end and got following files in the somatic folder:
chr17.cell_snv.cellID.csv
chr17.germlineTwoLoci_model.csv
chr17.cell_snv.cellID.filter.csv
chr17.cell_snv.snvID.csv
chr17.putativeSNVs.csv
chr17.germlineTrioLoci_model.csv
chr17.gl.filter.hc.cell.mat.gz
chr17.cell_snv.mat.gz
chr17.gl.vcf.filter.hc.pos
chr17.gl.vcf.filter.hc.bed
chr17.SNV_mat.RDS
chr17.gl.vcf.DP4
chr17.gl.vcf.filter.DP4
LDrefinement_germline.chr17.pdf
svm_feature.chr17.pdf

Could you direct me where I can find information about these output files?

Filtering putativeSNVs.csv file with suggested parameters ((df.SVM_pos_score>0.5) & (df.LDrefine_merged_score>0.25) & (df.BAF_alt>0.1) & (df.BAF_alt<0.5) & (df.Depth_ref>2) & (df.Depth_alt>2)) gave me 111 variants.
How can I translate these variants on single-cell level to identify groups of cells with same origin?

preprocess error

hey I got the error while running preprocess always. could you help me out?

[2023-10-17 19:20:15,607] INFO     Monopogen.py Performing data preprocess before variant calling...
[2023-10-17 19:20:15,607] INFO     germline.py Parameters in effect:
[2023-10-17 19:20:15,607] INFO     germline.py --subcommand = [preProcess]
[2023-10-17 19:20:15,607] INFO     germline.py --bamFile = [bam.lst]
[2023-10-17 19:20:15,607] INFO     germline.py --out = [s1_out]
[2023-10-17 19:20:15,607] INFO     germline.py --app_path = [/home/big/zheng/Monopogen/apps]
[2023-10-17 19:20:15,607] INFO     germline.py --max_mismatch = [3]
[2023-10-17 19:20:15,607] INFO     germline.py --nthreads = [8]
[2023-10-17 19:20:15,614] DEBUG    Monopogen.py PreProcessing sample all_cells
[2023-10-17 19:20:15,809] INFO     germline.py The contig chr5 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-10-17 19:20:15,814] INFO     germline.py The contig chr1 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-10-17 19:20:15,818] INFO     germline.py The contig chr2 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-10-17 19:20:15,820] INFO     germline.py The contig chr4 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-10-17 19:20:15,821] INFO     germline.py The contig chr6 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-10-17 19:20:15,821] INFO     germline.py The contig chr3 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-10-17 19:20:15,823] INFO     germline.py The contig chr8 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-10-17 19:20:15,823] INFO     germline.py The contig chr7 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-10-17 19:20:15,921] INFO     germline.py The contig chr9 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-10-17 19:20:15,929] INFO     germline.py The contig chr10 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-10-17 19:20:15,949] INFO     germline.py The contig chr11 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-10-17 19:20:15,954] INFO     germline.py The contig chr12 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-10-17 19:20:15,954] INFO     germline.py The contig chr13 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-10-17 19:20:15,956] INFO     germline.py The contig chr15 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-10-17 19:20:15,958] INFO     germline.py The contig chr14 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-10-17 19:20:15,959] INFO     germline.py The contig chr16 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-10-17 19:20:16,026] INFO     germline.py The contig chr17 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-10-17 19:20:16,033] INFO     germline.py The contig chr18 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-10-17 19:20:16,055] INFO     germline.py The contig chr19 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-10-17 19:20:16,061] INFO     germline.py The contig chr20 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-10-17 19:20:16,063] INFO     germline.py The contig chr21 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-10-17 19:20:16,065] INFO     germline.py The contig chr22 does not contain the prefix 'chr' and we will add 'chr' on it 
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/zheng/anaconda3/envs/monopogen/lib/python3.8/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/home/zheng/anaconda3/envs/monopogen/lib/python3.8/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
  File "/home/big/zheng/Monopogen/src/germline.py", line 200, in BamFilter
    for s in infile.fetch(search_chr):
  File "pysam/libcalignmentfile.pyx", line 1089, in pysam.libcalignmentfile.AlignmentFile.fetch
  File "pysam/libchtslib.pyx", line 683, in pysam.libchtslib.HTSFile.parse_region
ValueError: invalid contig `5`
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/big/zheng/Monopogen/src/Monopogen.py", line 435, in <module>
    main()
  File "/home/big/zheng/Monopogen/src/Monopogen.py", line 428, in main
    args.func(args)
  File "/home/big/zheng/Monopogen/src/Monopogen.py", line 313, in preProcess
    result = pool.map(BamFilter, para_lst)
  File "/home/zheng/anaconda3/envs/monopogen/lib/python3.8/multiprocessing/pool.py", line 364, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/home/zheng/anaconda3/envs/monopogen/lib/python3.8/multiprocessing/pool.py", line 771, in get
    raise self._value
ValueError: invalid contig `5`

the output of samtools view -h sublibrary1_chr_sorted.bam | head -n 25 is

@HD     VN:1.4  SO:coordinate
@SQ     SN:chr1 LN:248956422
@SQ     SN:chr10        LN:133797422
@SQ     SN:chr11        LN:135086622
@SQ     SN:chr12        LN:133275309
@SQ     SN:chr13        LN:114364328
@SQ     SN:chr14        LN:107043718
@SQ     SN:chr15        LN:101991189
@SQ     SN:chr16        LN:90338345
@SQ     SN:chr17        LN:83257441
@SQ     SN:chr18        LN:80373285
@SQ     SN:chr19        LN:58617616
@SQ     SN:chr2 LN:242193529
@SQ     SN:chr20        LN:64444167
@SQ     SN:chr21        LN:46709983
@SQ     SN:chr22        LN:50818468
@SQ     SN:chr3 LN:198295559
@SQ     SN:chr4 LN:190214555
@SQ     SN:chr5 LN:181538259
@SQ     SN:chr6 LN:170805979
@SQ     SN:chr7 LN:159345973
@SQ     SN:chr8 LN:145138636
@SQ     SN:chr9 LN:138394717
@SQ     SN:chrMT        LN:16569
@SQ     SN:chrX LN:156040895

and the output of samtools view sublibrary1_chr_sorted.bam | head -n 10 is

63_76_14__R__159_76_14__ACGGACTC_AGATGTAC_AACCGAGA__TCCGGCTAAA__230914Xm_CAGATC 0       chr1    10002   255     108M42S                                                                                                                     *0       0       AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCACTAGATTCCGTCCACAGTCTCAAGCACGTGGATGTACAGCTA                                                                      FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFF::::F,:,FFFFFFF,:,,,F,FFFFF,,,,,,:,F,F,F,F                                                                                       NH:i:1   HI:i:1  AS:i:106        nM:i:0  GX:Z:   GN:Z:   pN:Z:TCCGGCTAAA CR:Z:ACGGACTC_AGATGTAC_AACCGAGA CB:Z:63_76_14__s1                                                                                                                   pB:Z:159_76_14   pS:Z:MRD016_D30 RE:A:N
30_91_44__T__30_91_44__ACTTTACC_CTAAGGTC_CTGAGCCA__ATCCAGAATG__230914Xm_CAGATC  16      chr1    10005   1       10S97M77N43M                                                                                                                *0       0       TAAGCCTATTCCTAACAGTATCAATATCACTAACCCGTACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC                                                                      ::F,FFFF,,F,,F,,,,F,,F,,,,,,,F,F,FF,,,F,FF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFF                                                                                       NH:i:3   HI:i:1  AS:i:120        nM:i:9  GX:Z:   GN:Z:   pN:Z:ATCCAGAATG CR:Z:ACTTTACC_CTAAGGTC_CTGAGCCA CB:Z:30_91_44__s1                                                                                                                   pB:Z:30_91_44    pS:Z:MRD007_Transplant  RE:A:N
04_26_30__R__100_26_30__GCTTATAG_AGCAGGAA_CAACCACA__TATGAAGATT__230914Xm_CAGATC 16      chr1    10534   3       96M2D27M1S                                                                                                                  *0       0       AGTACCACCGAAATCTGTGCAGAGGAGAACGCAGCTCCGCCCTCGCGGTGCTCTCCGGGTCTGTGCTGAGGAGAACGCAACTCCGCCGTCGCAAAGGCGCCGCGCCGGCGCAGGCGCAGAGAGC                                                                                                FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF                                                                                                                 NH:i:2   HI:i:1  AS:i:111        nM:i:2  GX:Z:   GN:Z:   pN:Z:TATGAAGATT CR:Z:GCTTATAG_AGCAGGAA_CAACCACA CB:Z:04_26_30__s1                                                                                                                   pB:Z:100_26_30   pS:Z:MRD002_D30 RE:A:N
04_26_30__R__100_26_30__GCTTATAG_AGCAGGAA_CAACCACA__TATGAAGATT__230914Xm_CAGATC 16      chr1    10534   3       96M2D27M1S                                                                                                                  *0       0       AGTACCACCGAAATCTGTGCAGAGGAGAACGCAGCTCCGCCCTCGCGGTGCTCTCCGGGTCTGTGCTGAGGAGAACGCAACTCCGCCGTCGCAAAGGCGCCGCGCCGGCGCAGGCGCAGAGAGC                                                                                                FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF                                                                                                                 NH:i:2   HI:i:1  AS:i:111        nM:i:2  GX:Z:   GN:Z:   pN:Z:TATGAAGATT CR:Z:GCTTATAG_AGCAGGAA_CAACCACA CB:Z:04_26_30__s1                                                                                                                   pB:Z:100_26_30   pS:Z:MRD002_D30 RE:A:N
04_26_30__R__100_26_30__GCTTATAG_AGCAGGAA_CAACCACA__TATGAAGATT__230914Xm_CAGATC 16      chr1    10534   3       96M2D27M1S                                                                                                                  *0       0       AGTACCACCGAAATCTGTGCAGAGGAGAACGCAGCTCCGCCCTCGCGGTGCTCTCCGGGTCTGTGCTGAGGAGAACGCAACTCCGCCGTCGCAAAGGCGCCGCGCCGGCGCAGGCGCAGAGAGC                                                                                                FF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF                                                                                                                 NH:i:2   HI:i:1  AS:i:111        nM:i:2  GX:Z:   GN:Z:   pN:Z:TATGAAGATT CR:Z:GCTTATAG_AGCAGGAA_CAACCACA CB:Z:04_26_30__s1                                                                                                                   pB:Z:100_26_30   pS:Z:MRD002_D30 RE:A:N
13_81_76__R__109_81_76__TATGTGTC_ATCATTCC_AGATGTAC__GCTTCATTTT__230914Xm_CAGATC 16      chr1    10535   3       95M2D25M                                                                                                                    *0       0       GTACCACCGAAATCTGTGCAGAGGAGAACGCAGCTCCGCCCTCGCGGTGCTCTCCGGGTCTGTGCTGAGGAGAACGCAACTCCGCCGTCGCAAAGGCGCCGCGCCGGCGCAGGCGCAGAG                                                                                                    FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF                                                                                                                     NH:i:2   HI:i:1  AS:i:108        nM:i:2  GX:Z:   GN:Z:   pN:Z:GCTTCATTTT CR:Z:TATGTGTC_ATCATTCC_AGATGTAC CB:Z:13_81_76__s1                                                                                                                   pB:Z:109_81_76   pS:Z:MRD004_Transplant  RE:A:N
04_26_30__R__100_26_30__GCTTATAG_AGCAGGAA_CAACCACA__TATGAAGATT__230914Xm_CAGATC 16      chr1    10538   3       92M2D27M1S                                                                                                                  *0       0       CCACCGAAATCTGTGCAGAGGAGAACGCAGCTCCGCCCTCGCGGTGCTCTCCGGGTCTGTGCTGAGGAGAACGCAACTCCGCCGTCGCAAAGGCGCCGCGCCGGCGCAGGCGCAGAGAGC                                                                                                    FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFF                                                                                                                     NH:i:2   HI:i:1  AS:i:107        nM:i:2  GX:Z:   GN:Z:   pN:Z:TATGAAGATT CR:Z:GCTTATAG_AGCAGGAA_CAACCACA CB:Z:04_26_30__s1                                                                                                                   pB:Z:100_26_30   pS:Z:MRD002_D30 RE:A:N
14_32_14__R__110_32_14__CAATTCTC_CAATGGAA_AACCGAGA__GAGGGGCGCG__230914Xm_CAGATC 16      chr1    10538   3       92M2D30M                                                                                                                    *0       0       CCACCGAAATCTGTGCAGAGGAGAACGCAGCTCCGCCCTCGCGGTGCTCTCCGGGTCTGTGCTGAGGAGAACGCAACTCCGCCGTCGCAAAGGCGCCGCGCCGGCGCAGGCGCAGAGAGGCG                                                                                                  FFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF                                                                                                                   NH:i:2   HI:i:1  AS:i:110        nM:i:2  GX:Z:   GN:Z:   pN:Z:GAGGGGCGCG CR:Z:CAATTCTC_CAATGGAA_AACCGAGA CB:Z:14_32_14__s1                                                                                                                   pB:Z:110_32_14   pS:Z:MRD004_Transplant  RE:A:N
14_32_14__R__110_32_14__CAATTCTC_CAATGGAA_AACCGAGA__GAGGGGCGCG__230914Xm_CAGATC 16      chr1    10538   3       92M2D30M                                                                                                                    *0       0       CCACCGAAATCTGTGCAGAGGAGAACGCAGCTCCGCCCTCGCGGTGCTCTCCGGGTCTGTGCTGAGGAGAACGCAACTCCGCCGTCGCAAAGGCGCCGCGCCGGCGCAGGCGCAGAGAGGCG                                                                                                  FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF                                                                                                                   NH:i:2   HI:i:1  AS:i:110        nM:i:2  GX:Z:   GN:Z:   pN:Z:GAGGGGCGCG CR:Z:CAATTCTC_CAATGGAA_AACCGAGA CB:Z:14_32_14__s1                                                                                                                   pB:Z:110_32_14   pS:Z:MRD004_Transplant  RE:A:N
14_32_14__R__110_32_14__CAATTCTC_CAATGGAA_AACCGAGA__GAGGGGCGCG__230914Xm_CAGATC 16      chr1    10540   3       90M2D30M                                                                                                                    *0       0       ACCGAAATCTGTGCAGAGGAGAACGCAGCTCCGCCCTCGCGGTGCTCTCCGGGTCTGTGCTGAGGAGAACGCAACTCCGCCGTCGCAAAGGCGCCGCGCCGGCGCAGGCGCAGAGAGGCG                                                                                                    FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF                                                                                                                     NH:i:2   HI:i:1  AS:i:108        nM:i:2  GX:Z:   GN:Z:   pN:Z:GAGGGGCGCG CR:Z:CAATTCTC_CAATGGAA_AACCGAGA CB:Z:14_32_14__s1                                                                                                                   pB:Z:110_32_14   pS:Z:MRD004_Transplant  RE:A:N

germline calling from multiple chromosomes

Hello all,

Thank you for making this great tool. In the example, the germline caller is used for chromosome 20 only. If we wish to perform germline calling for chromosomes 1-22 for a single sample, do we need to run the germline caller 22 times? Or is there a way to input multiple fasta files that correspond to the different chrs?

Requires unsupported openssl version

(monopogen-env) anannapaneni@esb:~/Monopogen$ python ${path}/src/Monopogen.py
Traceback (most recent call last):
  File "/home/anannapaneni/Monopogen/src/Monopogen.py", line 13, in <module>
    import pysam
  File "/share/quonlab/workspaces/anannapaneni/anaconda3/envs/monopogen-env/lib/python3.12/site-packages/pysam/__init__.py", line 4, in <module>
    from pysam.libchtslib import *
ImportError: /home/anannapaneni/Monopogen/apps/libcrypto.so.1.0.0: version `OPENSSL_1.0.0' not found (required by /usr/lib/x86_64-linux-gnu/libcurl.so.4)

This is on a brand new conda environment with only pip instal -e .
It requires openssl 1.0.0 which is no longer supported

chr20.gp.vcf.gz: No such file or directory

Hi, I am running Germline, after successfully running preProcess.
I met a few problems.
First, when I ran germline, the output showed that I didn't have Java. So, I installed Java using conda install bioconda::java-jdk . After that, I rerun it. the output showed that no bam.bai file. So, I used the samtools index to add the bai file. Then re-run Germine with the following command.

path="/home/mingchao/Linux_file/Monopogen/Monopogen"  # where Monopogen is downloaded
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:${path}/apps
python  ${path}/src/Monopogen.py  germline  \
    -a   ${path}/apps -r  /data/Mingchao/dowload_data/Monopogen/Monopogen_testfile/region.lst \
    -p  /data/Mingchao/dowload_data/Monopogen/Monopogen_testfile/  \
    -g  /data/Mingchao/dowload_data/Monopogen/Monopogen_testfile/chr20_2Mb.hg38.fa   -s all  -o /data/Mingchao/dowload_data/Monopogen/Monopogen_testout

The job still cannot run correctly. The output file only has chr20.gl.vcf.gz and the size is 0. But at the end of the report shows "Monopogen.py Success! See instructions above."

The Monopogen_testout files:

drwxrwxr-x+ 2 mingchao mingchao 4096 Apr  7 16:48 ./
drwxrwxr-x+ 5 mingchao mingchao 4096 Apr  7 13:17 ../
-rw-rw-r--+ 1 mingchao mingchao    0 Apr  7 16:48 chr20.gl.vcf.gz
-rw-rw-r--+ 1 mingchao mingchao  692 Apr  7 16:48 chr20.gp.log
-rw-rw-r--+ 1 mingchao mingchao  699 Apr  7 16:48 chr20.phased.log

The output report:

[2024-04-07 16:48:08,516] INFO     Monopogen.py Performing germline variant calling...
[2024-04-07 16:48:08,517] INFO     germline.py Parameters in effect:
[2024-04-07 16:48:08,517] INFO     germline.py --subcommand = [germline]
[2024-04-07 16:48:08,517] INFO     germline.py --region = [/data/Mingchao/dowload_data/Monopogen/Monopogen_testfile/region.lst]
[2024-04-07 16:48:08,517] INFO     germline.py --step = [all]
[2024-04-07 16:48:08,517] INFO     germline.py --out = [/data/Mingchao/dowload_data/Monopogen/Monopogen_testout]
[2024-04-07 16:48:08,517] INFO     germline.py --reference = [/data/Mingchao/dowload_data/Monopogen/Monopogen_testfile/chr20_2Mb.hg38.fa]
[2024-04-07 16:48:08,517] INFO     germline.py --imputation_panel = [/data/Mingchao/dowload_data/Monopogen/Monopogen_testfile/]
[2024-04-07 16:48:08,517] INFO     germline.py --max_softClipped = [1]
[2024-04-07 16:48:08,517] INFO     germline.py --app_path = [/home/mingchao/Linux_file/Monopogen/Monopogen/apps]
[2024-04-07 16:48:08,517] INFO     germline.py --nthreads = [1]
[2024-04-07 16:48:08,517] INFO     germline.py --norun = [FALSE]
[2024-04-07 16:48:08,517] INFO     Monopogen.py Checking existence of essenstial resource files...
[2024-04-07 16:48:08,517] INFO     Monopogen.py Checking dependencies...
/data/Mingchao/dowload_data/Monopogen/Monopogen_testout/Script/runGermline_chr20.sh: line 1: /home/mingchao/Linux_file/Monopogen/Monopogen/apps/samtools: Permission denied
/data/Mingchao/dowload_data/Monopogen/Monopogen_testout/Script/runGermline_chr20.sh: line 1: /home/mingchao/Linux_file/Monopogen/Monopogen/apps/bcftools: Permission denied
/data/Mingchao/dowload_data/Monopogen/Monopogen_testout/Script/runGermline_chr20.sh: line 1: /home/mingchao/Linux_file/Monopogen/Monopogen/apps/bcftools: Permission denied
/data/Mingchao/dowload_data/Monopogen/Monopogen_testout/Script/runGermline_chr20.sh: line 1: /home/mingchao/Linux_file/Monopogen/Monopogen/apps/bgzip: Permission denied
beagle.27Jul16.86a.jar (version 4.1)
Copyright (C) 2014-2015 Brian L. Browning
Enter "java -jar beagle.27Jul16.86a.jar" for a summary of command line arguments.
Start time: 04:48 PM EDT on 07 Apr 2024

Command line: java -Xmx18204m -jar beagle.jar
  gl=/data/Mingchao/dowload_data/Monopogen/Monopogen_testout/germline/chr20.gl.vcf.gz
  ref=/data/Mingchao/dowload_data/Monopogen/Monopogen_testfile/CCDG_14151_B01_GRM_WGS_2020-08-05_chr20.filtered.shapeit2-duohmm-phased.vcf.gz
  chrom=chr20
  out=/data/Mingchao/dowload_data/Monopogen/Monopogen_testout/germline/chr20.gp
  impute=false
  modelscale=2
  nthreads=24
  gprobs=true
  niterations=0

No genetic map is specified: using 1 cM = 1 Mb
java.io.EOFException
        at java.util.zip.GZIPInputStream.readUByte(GZIPInputStream.java:268)
        at java.util.zip.GZIPInputStream.readUShort(GZIPInputStream.java:258)
        at java.util.zip.GZIPInputStream.readHeader(GZIPInputStream.java:164)
        at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:79)
        at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:91)
        at blbutil.InputIt.fromGzipFile(InputIt.java:214)
        at main.Main.allData(Main.java:302)
        at main.Main.main(Main.java:111)
java.io.EOFException
Error reading /data/Mingchao/dowload_data/Monopogen/Monopogen_testout/germline/chr20.gl.vcf.gz
terminating program.
gzip: /data/Mingchao/dowload_data/Monopogen/Monopogen_testout/germline/chr20.gp.vcf.gz: No such file or directory
beagle.27Jul16.86a.jar (version 4.1)
Copyright (C) 2014-2015 Brian L. Browning
Enter "java -jar beagle.27Jul16.86a.jar" for a summary of command line arguments.
Start time: 04:48 PM EDT on 07 Apr 2024

Command line: java -Xmx18204m -jar beagle.jar
  gt=/data/Mingchao/dowload_data/Monopogen/Monopogen_testout/germline/chr20.germline.vcf
  ref=/data/Mingchao/dowload_data/Monopogen/Monopogen_testfile/CCDG_14151_B01_GRM_WGS_2020-08-05_chr20.filtered.shapeit2-duohmm-phased.vcf.gz
  chrom=chr20
  out=/data/Mingchao/dowload_data/Monopogen/Monopogen_testout/germline/chr20.phased
  impute=false
  modelscale=2
  nthreads=24
  gprobs=true
  niterations=0

No genetic map is specified: using 1 cM = 1 Mb
Exception in thread "main" java.lang.IllegalArgumentException: Missing line (#CHROM ...) after meta-information lines
File source: /data/Mingchao/dowload_data/Monopogen/Monopogen_testout/germline/chr20.germline.vcf
/data/Mingchao/dowload_data/Monopogen/Monopogen_testout/germline/chr20.gp.vcf.gz: No such file or directory
        at vcf.VcfHeader.checkHeaderLine(VcfHeader.java:135)
        at vcf.VcfHeader.<init>(VcfHeader.java:119)
        at vcf.VcfIt.<init>(VcfIt.java:190)
        at vcf.VcfIt.create(VcfIt.java:175)
        at vcf.VcfIt.create(VcfIt.java:150)
        at main.Main.allData(Main.java:297)
        at main.Main.main(Main.java:111)
bash /data/Mingchao/dowload_data/Monopogen/Monopogen_testout/Script/runGermline_chr20.sh
[2024-04-07 16:48:08,799] INFO     Monopogen.py Success! See instructions above.
['bash /data/Mingchao/dowload_data/Monopogen/Monopogen_testout/Script/runGermline_chr20.sh']

Any guidance to resolve this would be helpful.

Thank you so much!

typo in some of the code

FileNotFoundError: [Errno 2] No such file or directory: '/home/rzhang/software/Monopogen/apps/../resource/GRCh38.region.50MB.lst'
because of missspelling: resourse and resource

Error on somatic calling

Hello, thank you for a great tool!

I have successfully(maybe?) finished the preprocessing and germline calling step thanks to your kind tutorials.
However, I need some help with somatic SNV calling step.

With multiple snRNA-seq samples, I got one filter.bam.lst for each chromosomes in the preprocessing step.
Then I ran the germline calling step with GRCh38.region.lst, provided file in Monopogen/resource, getting one or more vcf/phased.vcf.gz/gl.vcf.gz files for each chromosomes. (e.g. chr1:1-50000001.germline.vcf, chr1:50000001-100000001.germline.vcf ...)

This is where I face a problem. When I use the region.lst that matches the germline outputs(e.g. chr1:1-50000001) for somatic SNV calling step(featureInfo), it says:
AssertionError: The bam list file {path}/preprocess/Bam/chr1:1-50000001.filter.bam.lst cannot be found!

So when I change my region.lst to only contain chromosome numbers(e.g. chr1 \ chr2 \ ...), it says:
AssertionError: The *.gl.vcf.gz file {path}/preprocess/germline/chr1.gl.vcf.gz cannot be found! Please run germline module

So my question is,
Is the germline SNV calling step supposed to return one gl.vcf.gz files even if I use the provided region.lst file?
If not, am I supposed to run germline SNV calling step with region list only with chromosome numbers(e.g. chr1 \ chr2 ...)?

Thank you in advance for your consideration.

Error in ld refinement on putative somatic SNVs

When I ran to step 2 with the demo data, I had the following problem. I have checked that the result of step 1 is correct. The code and error report of step 2 are as follows. Do you know the solution?
Code:

path="/share/home/zhangd/tools/scRNA/Monopogen/"
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:${path}/apps
python  ${path}/src/Monopogen.py  somatic  \
    -a   ${path}/apps  -r  region.lst  -t 22 \
    -i out  -l CB_7K.maester_scRNA.csv  -s featureInfo    \
    -g  ../example/chr20_2Mb.hg38.fa
# step 2: cellScan
python  ${path}/src/Monopogen.py  somatic  \
    -a   ${path}/apps  -r  region.lst  -t 22 \
    -i   out  -l CB_7K.maester_scRNA.csv  -s cellScan    \
    -g  ../example/chr20_2Mb.hg38.fa

Output:

[2024-05-04 00:02:24,777] INFO     Monopogen.py Collect single cell level information from sequencing data...
0:0:0:0:0:0
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/share/app/conda/conda3/lib/python3.8/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/share/app/conda/conda3/lib/python3.8/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
  File "/share/home/zhangd/tools/scRNA/Monopogen/src/somatic.py", line 291, in bam2mat
    mat = pd.read_csv(mat_infile, sep="\t", header=None)
  File "/share/home/zhangd/.local/lib/python3.8/site-packages/pandas/util/_decorators.py", line 311, in wrapper
    return func(*args, **kwargs)
  File "/share/home/zhangd/.local/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 586, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/share/home/zhangd/.local/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 482, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/share/home/zhangd/.local/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 811, in __init__
    self._engine = self._make_engine(self.engine)
  File "/share/home/zhangd/.local/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1040, in _make_engine
    return mapping[engine](self.f, **self.options)  # type: ignore[call-arg]
  File "/share/home/zhangd/.local/lib/python3.8/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 69, in __init__
    self._reader = parsers.TextReader(self.handles.handle, **kwds)
  File "pandas/_libs/parsers.pyx", line 549, in pandas._libs.parsers.TextReader.__cinit__
pandas.errors.EmptyDataError: No columns to parse from file
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/share/home/zhangd/tools/scRNA/Monopogen//src/Monopogen.py", line 338, in <module>
    main()
  File "/share/home/zhangd/tools/scRNA/Monopogen//src/Monopogen.py", line 331, in main
    args.func(args)
  File "/share/home/zhangd/tools/scRNA/Monopogen//src/Monopogen.py", line 170, in somatic
    result = pool.map(bam2mat, joblst)
  File "/share/app/conda/conda3/lib/python3.8/multiprocessing/pool.py", line 364, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/share/app/conda/conda3/lib/python3.8/multiprocessing/pool.py", line 771, in get
    raise self._value
pandas.errors.EmptyDataError: No columns to parse from file

About ld refinement on putative somatic SNVs, spend a lot of time

Hello,
I use single cell sequencing data to run somatic SNV calling from scRNA-seq. It takes a lot of time when I run the second step (cellScan), about 30+ hours. Is there any way to improve the running speed?

The bam file used is about 8G, and the cpu has 32 cores.

Could you please provide some guidance on how to resolve this issue?
Thank you in advance for your assistance.

can't decode byte 0x8b in position 1: invalid start byte

I wanted to try Monopogen on my GEX-cellranger output, first started with preprocess.py, however, it directly gave me the following error;

Traceback (most recent call last):
File "/path/Monopogen/src/Monopogen.py", line 435, in
main()
File "/path/Monopogen/src/Monopogen.py", line 428, in main
args.func(args)
File "/path/Monopogen/src/Monopogen.py", line 291, in preProcess
for line in f_in:
File "/miniconda3/lib/python3.9/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

My BAM file is not gzipped, I even tested with different BAM files but the result was the same, I looked into the source code and saw that it tries to open a BAM file with "with open" and seems like it causes the error, can you please help regarding this one?

missing a space between flag and file name

Hi there,
Thanks for the great tool!
I followed the manual for calling the somatic mutations

path="/SNV/Monopogen"
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:${path}/apps
#python  ${path}/src/Monopogen.py  preProcess -b bam.list3 -o bm  -a ${path}/apps -t 2
python ${path}/src/Monopogen.py  germline  -a ${path}/apps  -r region.lst \
 -p ./  -t 22 \
 -g  genome.fa  -m 3 -s all  -o bm

with bam.list3 like this
$cat bam.list3
bm,/SNV/Monopogen/test/chr20.maester_scRNA.bam

I successfully have the bm_chr20.filter.bam from the first step, then I got errors

Exception in thread "main" java.lang.IllegalArgumentException: Missing line (#CHROM ...) after meta-information lines
File source: bm/germline/chr20.germline.vcf

when I double-check the script file that it automatically generated,

$cat bm/Script/runGermline_chr20.sh
/SNV/Monopogen/apps/samtools mpileup -bout/Bam/chr20.filter.bam.lst -f chr20_2Mb.hg38.fa -r chr20 -q 20 -Q 20 -t DP -d 10000000 -v | .....

I think the correct code should be -b out/Bam/chr20.filter.bam.lst .... but somehow there is no space between the -b and the file name, maybe a space in line 77 of Monopogen.py is needed.

Thanks!

Can Monopogen use for calling variants in Malaria parasite single cell Data ?

Dear Monopogen developers ,
Is it possible to call variants on single cell data from Malaria parasite ? If yes How LD works . DO we need to prepare some files ?

your valuable suggestions is greatly appreciated

Bulk RNA seq

Can Monopogen work on bulk RNA seq data?

Installation error

I have a conda-environment with python 3.8.18 and tried to install your tool as per instruction.

pip install -e .
Obtaining file:///Users/slaan3/git/Monopogen
  Preparing metadata (setup.py) ... done
Collecting pandas>=1.2.3 (from Monopogen==1.0.0)
  Using cached pandas-2.0.3-cp38-cp38-macosx_10_9_x86_64.whl.metadata (18 kB)
Requirement already satisfied: pysam>=0.16.0.1 in /Users/slaan3/miniconda3/envs/scqtl/lib/python3.8/site-packages (from Monopogen==1.0.0) (0.22.0)
Requirement already satisfied: NumPy>=1.19.5 in /Users/slaan3/miniconda3/envs/scqtl/lib/python3.8/site-packages (from Monopogen==1.0.0) (1.24.4)
Requirement already satisfied: sciPy>=1.6.3 in /Users/slaan3/miniconda3/envs/scqtl/lib/python3.8/site-packages (from Monopogen==1.0.0) (1.10.1)
Requirement already satisfied: pillow>=8.2.0 in /Users/slaan3/miniconda3/envs/scqtl/lib/python3.8/site-packages (from Monopogen==1.0.0) (10.1.0)
Requirement already satisfied: python-dateutil>=2.8.2 in /Users/slaan3/miniconda3/envs/scqtl/lib/python3.8/site-packages (from pandas>=1.2.3->Monopogen==1.0.0) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /Users/slaan3/miniconda3/envs/scqtl/lib/python3.8/site-packages (from pandas>=1.2.3->Monopogen==1.0.0) (2023.3.post1)
Requirement already satisfied: tzdata>=2022.1 in /Users/slaan3/miniconda3/envs/scqtl/lib/python3.8/site-packages (from pandas>=1.2.3->Monopogen==1.0.0) (2023.3)
Requirement already satisfied: six>=1.5 in /Users/slaan3/miniconda3/envs/scqtl/lib/python3.8/site-packages (from python-dateutil>=2.8.2->pandas>=1.2.3->Monopogen==1.0.0) (1.16.0)
Using cached pandas-2.0.3-cp38-cp38-macosx_10_9_x86_64.whl (11.7 MB)
Installing collected packages: pandas, Monopogen
  Attempting uninstall: pandas
    Found existing installation: pandas 0.25.3
    Uninstalling pandas-0.25.3:
      Successfully uninstalled pandas-0.25.3
  Attempting uninstall: Monopogen
    Found existing installation: Monopogen 1.0.0
    Uninstalling Monopogen-1.0.0:
      Successfully uninstalled Monopogen-1.0.0
  Running setup.py develop for Monopogen
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
genometools 0.4.1 requires pandas<1,>=0.20.1, but you have pandas 2.0.3 which is incompatible.
Successfully installed Monopogen-1.0.0 pandas-2.0.3

How can I fix this error? Or is it inconsequential?

about cell counts in somatic SNV output(chr*.putativeSNVs.csv)

Hi, first, thank you for interesting tool to use in single cell

I have a question about SNV output(chr*.putativeSNVs.csv).

When I put input bam file, bam file have 8923 unique cell barcodes,
but When I got a chr*.putativeSNVs.csv, they just have 5704 unique cell barcodes

So I'm curiouts about how to call up cells in chr*.putativeSNVs.csv?
Is it have just cells with at least one variant, so 3219(8923- 5704) cells are not in the output file(chr*.putativeSNVs.csv) because the variant is not called?
or bugs, or problem of my technique?

All output file(chr.putativeSNVs.csv) have 5704 unique cell barcodes

I'll be waiting for your answer🥺
Thank you for reading!

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

Hello, thank you so much for developing Monopogen.

I get this error when I run the preProcess module. I tried my bam files and also the bam files you provided as example and unfortunately the issue do exist. I also tried it on python 3.8 and python 3.11 however, the issue is not resolved.

for line in f_in:

File "", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

I would be very grateful if you could help me with that
Thanks

Beagle multithreading

Hello, thank you for developing great software.
I tried to run germline genotyping using 16 threads, but it did not work multithreading on Beagle phasing step.
I think it forgets to set "args.nthreads" on Beagle parameter "nthreads" in Monopgen.py.
If you have other aims, I'm sorry for saying things that aren't necessary.

original

cmd3 = java + " -Xmx20g -jar " + beagle 
 +  " gl=" +  out + "/germline/" +  jobid + ".gl.vcf.gz"  
 +  " ref=" +  imputation_vcf   
 + "  chrom=" + record[0] 
 + " out="   +  out + "/germline/" + jobid 
 + ".gp " + "impute=false  modelscale=2  nthreads=1  gprobs=true  niterations=0"

cmd5 = java + " -Xmx20g -jar " + beagle 
 +  " gt=" +  out + "/germline/" +  jobid + ".germline.vcf"  
 +  " ref=" +  imputation_vcf    
 +  "  chrom=" + record[0]  
 + " out="   +  out + "/germline/" + jobid+ ".phased " 
 + "impute=false  modelscale=2  nthreads=1  gprobs=true  niterations=0"

palliative solution

cmd3 = java + " -Xmx20g -jar " + beagle 
 +  " gl=" +  out + "/germline/" +  jobid + ".gl.vcf.gz"  
 +  " ref=" +  imputation_vcf   
 + "  chrom=" + record[0] 
 + " out="   +  out + "/germline/" + jobid 
 + ".gp " + "impute=false  modelscale=2  nthreads=" + str(args.nthreads) + "  gprobs=true  niterations=0"

cmd5 = java + " -Xmx20g -jar " + beagle 
 +  " gt=" +  out + "/germline/" +  jobid + ".germline.vcf"  
 +  " ref=" +  imputation_vcf    
 +  "  chrom=" + record[0]  
 + " out="   +  out + "/germline/" + jobid+ ".phased " 
 + "impute=false  modelscale=2  nthreads=" + str(args.nthreads) + "  gprobs=true  niterations=0"

Best,
Tomoki

Run monopogen for more than 1 sample?

Hi,

I have run preprocess and germline calling steps for 9 scRNA-seq samples (10x genomics data).

Now, I hope to call somatic mutation, and I have a question about the cell barcode files.

How should I prepare the cell barcode file? Put all cells of 9 samples in one file?

However, in 10x genomics data, if we do not add index to cell barcode, there will be several duplication barcodes, can monopogen recognize the index part (such as AAACCCAAGCATTTGC-1_1_1_1_1_1 or sample1_AAACCCAAGCATTTGC-1) ?

Or, I should run it one by one from the initial (preprocess) step?

Thank you, so much!

Tutorial Somatic Variant Calling Error for the cellScan step. Different output for different input to -t and -w? Potentially different reference file?

Dear Monopogen developers,

I am currently running the somatic variant calling part of your tutorial.

(monopogen) [hk Monopogen]$ hk/anaconda3/bin/python ${mpath}/src/Monopogen.py somatic -a ${mpath}/apps -t 5 -w 50MB -r ${mpath}/t_conditions/region.lst -i hk/tools/Monopogen/bm -l ${mpath}/t_conditions/CB_7K.maester_scRNA.csv -s cellScan -g hk/tools/Monopogen/t_conditions/chr20.hg38.fa 
...
hk/tools/samtools-1.2/samtools mpileup -b hk/tools/Monopogen/bm/Bam/split_bam/cell.bam.lst -f hk/tools/Monopogen/t_conditions/chr20.hg38.fa -r chr20 -q 20 -Q 20 -t DP4 -d 10000 -v  | hk/tools/bcftools-1.8/bcftools view  | hk/tools/bcftools-1.8/bcftools norm -m-both -f hk/tools/Monopogen/t_conditions/chr20.hg38.fa | grep -v "<X>" | grep -v INDEL |hk/tools/htslib-1.9/bgzip -c > hk/tools/Monopogen/bm/somatic/chr20.cell.gl.vcf.gz
[mpileup] fail to load index for hk/tools/Monopogen/bm/Bam/split_bam/AGAGAGCTCGCCACTT.bam
Failed to open -: unknown file type
Failed to open -: unknown file type
hk/tools/bcftools-1.8/bcftools concat -o hk/tools/Monopogen/bm/somatic/chr20.cell.gl.vcf.gz  -O z
...
ValueError: invalid file `b'hk/tools/Monopogen/bm/somatic/chr20.cell.gl.vcf.gz'` (mode=`b'r'`) - is it VCF/BCF format?

The only things that are different are:
(1) the input to -t and -w, because I could not manipulate ulimit in my hpc system.
(2) reference file https://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/

Should changing the input to -t and -w change the output as well? If so, what is the difference anticipated?
Also, is this the format/source of reference file that is expected to be used?

Best,

Hannah Kim

Imprinted genes identification

Thanks for developing a nice tool and wanted to leverage its capabilities in my research. A few questions:

Using a reciprocal matings, I have generated two stem cell lines by mating two distinct mouse strains with characterized SNPs. I have performed scRNA-seq on these stem cell lines from both reciprocal matings/crosses. I am specifically interested in identifying imprinted genes within this dataset using your program. Can you please advise on the best approach to integrate my SNP and scRNA-seq data into your tool for this purpose?

Due to their sequencing bias, DropSeq methods like 10x can miss SNPs outside the captured 5' or 3' transcript ends. Do you have any suggestion?

Thank you!

Germline job never finish

Hi, I am running PreProcess and then Germline in a process in nextflow. But for some reason, the job never finish even though all the output files are written to the output directory. I get beagle finished but no "Monopogen.py Success! See instructions above." printed for germline (only preProcess)

The process:

"""
echo "$sample_name,$input_bam" > input_bam_${sample_name}.lst

python $mono_bin preProcess \
-b input_bam_${sample_name}.lst \
-o output_monopogen_${sample_name}/ \
-a $app_dir \
-t 8

python $mono_bin  germline  \
-a  $app_dir \
-t 8 \
-g  $reference \
-p  $phased_panel/ \
-r  $region \
-s all  \
-o output_monopogen_${sample_name}
"""

The input files:
GRCh38.primary_assembly.genome.fa (indexed in same directory)
Directory with imputation panel files from ftp link you have provided.
Region list Monopogen/resource/GRCh38.region.lst

Htop indicates that beagle is still running (more than 12h after last 'beagle finished' in log).

End time: 04:13 PM PDT on 26 Mar 2024
beagle.27Jul16.86a.jar (version 4.1) finished

Can you maybe help me to why the job wont finish?

Best regards,
Mette

No sush file: chr20.gp.vcf.gz

Script:

path=/XXXXXX/Monopogen
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:${path}/apps

python  ../src/Monopogen.py    SCvarCall \
      -b  ../example/chr20_2Mb.rh.filter.sort.bam  \
      -j all \
      -y single  \
      -a  ../apps  \
      -c chr20  \
      -o out \
      -i  ../example/cell_cluster.csv   \
      -p  ../example/CCDG_14151_B01_GRM_WGS_2020-08-05_chr20.filtered.shapeit2-duohmm-phased.vcf.gz  \
      -r  ../example/chr20_2Mb.hg38.fa  -d 200  -t 0.1  -m 3 -s 5

Error:

FileNotFoundError: [Errno 2] could not open variant file `b'/****/out/SCvarCall/chr20.gp.vcf.gz'`: No such file or directory

when following the tutorial in Readme, it occurs an error as above. the output files:

$ ll out
total 12
drwxr-xr-x 2 jeffery A 4096 Dec 26 14:57 Bam
drwxr-xr-x 2 jeffery A 4096 Dec 26 14:57 Scriptchr20
drwxr-xr-x 2 jeffery A 4096 Dec 26 14:57 SCvarCall
$ ll out/Bam/
total 22926
-rw-r--r-- 1 jeffery A 23469221 Dec 26 14:57 chr20.filter.bam
-rw-r--r-- 1 jeffery A     6936 Dec 26 14:57 chr20.filter.bam.bai
$ ll out/Scriptchr20/
total 0
-rw-r--r-- 1 jeffery A 0 Dec 26 14:57 runBeagle.sh
$ ll out/SCvarCall/
total 0

How can i fix this problem?

Monopogen’s test object

Is Monopogen’s test limited to individual human cells? Or is it also applicable to single bacterial or archaeal cells?

Error about data preprocess

Hello!Thank you for developing monopogen!
I run the code:python ${path}/src/Monopogen.py preProcess -b test.lst -o 01.test -a ${path}/apps,but got error like this:

multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/public/home/user/program/miniconda3/envs/monopogen/lib/python3.7/multiprocessing/pool.py", line 121, in worker
result = (True, func(*args, **kwds))
File "/public/home/user/program/miniconda3/envs/monopogen/lib/python3.7/multiprocessing/pool.py", line 44, in mapstar
return list(map(*args))
File "/public/home/user/program/Monopogen/src/germline.py", line 206, in BamFilter
if val < max_mismatch:
UnboundLocalError: local variable 'val' referenced before assignment
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/public/home/user/program/Monopogen/src/Monopogen.py", line 435, in
main()
File "/public/home/user/program/Monopogen/src/Monopogen.py", line 428, in main
args.func(args)
File "/public/home/user/program/Monopogen/src/Monopogen.py", line 313, in preProcess
result = pool.map(BamFilter, para_lst)
File "/public/home/user/program/miniconda3/envs/monopogen/lib/python3.7/multiprocessing/pool.py", line 268, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/public/home/user/program/miniconda3/envs/monopogen/lib/python3.7/multiprocessing/pool.py", line 657, in get
raise self._value
UnboundLocalError: local variable 'val' referenced before assignment

I would appreciate if you can help me.
Thank you very much!

fixed

Hi,
sorry I fixed the issue
Thanks in advance.
Br, Mette

About Somatic SNV calling from scRNA-seq

Hello,
Thank you for developing such a wonderful method.
Recently, I attempted to execute somatic variant calling from scRNA-seq data, following the step-by-step instructions provided on your GitHub repository. However, I encountered an issue during the refinement of putative somatic SNVs. The first step in this section, which involves the “-s featureInfo” command, executed successfully. However, I encountered an error while running “cellScan”.

Here is the code I used and the accompanying error message.

My running code

path="~/Monopogen"
output_path=${path}/test/outs
reference_fasta_path=${path}/example/chr20_2Mb.hg38.fa

$ python ${path}/src/Monopogen.py somatic -a ${path}/apps
-r ${path}/test/region.lst -t ${num_thread} -w 10MB -i ${output_path}
-l ${path}/example/CB_7K.maester_scRNA.csv -s cellScan -g ${reference_fasta_path}

Error Message

[mpileup] fail to load index for ~/Monopogen/test/outs/Bam/split_bam/AGAGAGCTCGCCACTT-1.bam
[mpileup] fail to load index for ~/Monopogen/test/outs/Bam/split_bam/AGAGAGCTCGCCACTT-1.bam
[mpileup] fail to load index for ~/Monopogen/test/outs/Bam/split_bam/AGAGAGCTCGCCACTT-1.bam
[mpileup] fail to load index for ~/Monopogen/test/outs/Bam/split_bam/AGAGAGCTCGCCACTT-1.bam
[mpileup] fail to load index for ~/Monopogen/test/outs/Bam/split_bam/AGAGAGCTCGCCACTT-1.bam
[mpileup] fail to load index for ~/Monopogen/test/outs/Bam/split_bam/AGAGAGCTCGCCACTT-1.bam
[mpileup] fail to load index for ~/Monopogen/test/outs/Bam/split_bam/AGAGAGCTCGCCACTT-1.bam
Failed to open -: unknown file type
Failed to open -: unknown file type
Failed to open -: unknown file type
Failed to open -: unknown file type
Failed to open -: unknown file type
Failed to open -: unknown file type
Failed to open -: unknown file type
Failed to open -: unknown file type
Failed to open -: unknown file type
Failed to open -: unknown file type
Failed to open -: unknown file type
Failed to open -: unknown file type
Failed to open -: unknown file type
Failed to open -: unknown file type

[E::bcf_hdr_read] Input is not detected as bcf or vcf format
Failed to parse header: /Monopogen/test/outs/somatic/chr20:2-10000001.cell.gl.vcf.gz
[E::hts_open_format] Failed to open file "/Monopogen/test/outs/somatic/chr20.cell.gl.vcf.gz" : No such file or directory
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/.conda/envs/Monopogen_py37/lib/python3.7/multiprocessing/pool.py", line 121, in worker
result = (True, func(*args, **kwds))
File "/.conda/envs/Monopogen_py37/lib/python3.7/multiprocessing/pool.py", line 44, in mapstar
return list(map(*args))
File "~/Monopogen/src/somatic.py", line 248, in vcf2mat
vcf_in = pysam.VariantFile(out + "/somatic/" + region + ".cell.gl.vcf.gz")
File "pysam/libcbcf.pyx", line 4054, in pysam.libcbcf.VariantFile.init
File "pysam/libcbcf.pyx", line 4279, in pysam.libcbcf.VariantFile.open
FileNotFoundError: [Errno 2] could not open variant file b'~/Monopogen/test/outs/somatic/chr20.cell.gl.vcf.gz': No such file or directory
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/Monopogen/src/Monopogen.py", line 441, in
main()
File "/Monopogen/src/Monopogen.py", line 434, in main
args.func(args)
File "/Monopogen/src/Monopogen.py", line 256, in somatic
result = pool.map(vcf2mat, joblst)
File "/.conda/envs/Monopogen_py37/lib/python3.7/multiprocessing/pool.py", line 268, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "~/.conda/envs/Monopogen_py37/lib/python3.7/multiprocessing/pool.py", line 657, in get
raise self._value
FileNotFoundError: [Errno 2] could not open variant file b'~/Monopogen/test/outs/somatic/chr20.cell.gl.vcf.gz': No such file or directory

I'm currently using the latest version of Monopogen, which I installed directly from your GitHub repository. My Python version is 3.7.12, and I have already checked that all the prerequisites, including pysam, pandas, and numpy, are installed as per your guidelines. However, I am still encountering this error.
Could you please provide some guidance on how to resolve this issue?
Thank you in advance for your assistance.

nothing in the germline.vcf file

Case: using our bam file, the result of char20.germine.vcf contains nothing
Description: To test your code, we used the same region.lst under the ./example folder, and used our own bam file in the bam.lst file. but the result is not expected. Please let us know if there is anything that we can improve to work it out.

Here are the additional information you might want to check out.

I appreciate it a lot if you would like to take a look

Monopogen for mouse single cell sequence

Can Monopogen be applied to Mus musculus or any other species?
If it is able to do it, can you show me how to do it? Like where I can prepare the WGS.vcf file.
If it is unable to do it, would you like to add this expansion on your to-do-list?

And which column Monopogen use to identify the single cell barcode?(My data do not have CR or CB column. )

Thank you.

GermCalling Target Markers

For germcalling, the example output had 10,000 markers but when I ran it I only got 635, and I ran it with the example data. Is there a reason for the difference?

Plus when I ran ld refinement on putative somatic SNVs, my graph also had a lot less data point.

error in chr20.gl.vcf, broken VCF record, incorrect # of columns

Hello, I've tried running Monopogen with the below parameters and run into an error where there is a
Incorrect number of FORMAT fields at chr20:2998993 .. DP, 0 vs 1 of the chr20.gl.vcf file that gets generated.

bcftools view chr20.gl.vcf.gz -r chr20:2998993 shows me that that position does not have the correct # of columns (see below)

##FILTER=<ID=PASS,Description="All filters passed">
##samtoolsVersion=1.8+htslib-1.8
##samtoolsCommand=samtools mpileup -f /tmp/Monopogen/example/chr20_2Mb.hg38.fa -r chr20 -q 20 -Q 20 -t DP -d 10000000 -v /dbfs/FileStore/2023-h1-ancestry/data/monopogen_output/80_test/Bam/chr20.filter.bam
##reference=file:///tmp/Monopogen/example/chr20_2Mb.hg38.fa
##contig=<ID=chr1,length=248956422>
##contig=<ID=chr10,length=133797422>
##contig=<ID=chr11,length=135086622>
##contig=<ID=chr12,length=133275309>
##contig=<ID=chr13,length=114364328>
##contig=<ID=chr14,length=107043718>
##contig=<ID=chr15,length=101991189>
##contig=<ID=chr16,length=90338345>
##contig=<ID=chr17,length=83257441>
##contig=<ID=chr18,length=80373285>
##contig=<ID=chr19,length=58617616>
##contig=<ID=chr2,length=242193529>
##contig=<ID=chr20,length=64444167>
##contig=<ID=chr21,length=46709983>
##contig=<ID=chr22,length=50818468>
##contig=<ID=chr3,length=198295559>
##contig=<ID=chr4,length=190214555>
##contig=<ID=chr5,length=181538259>
##contig=<ID=chr6,length=170805979>
##contig=<ID=chr7,length=159345973>
##contig=<ID=chr8,length=145138636>
##contig=<ID=chr9,length=138394717>
##contig=<ID=chrM,length=16569>
##contig=<ID=chrX,length=156040895>
##contig=<ID=chrY,length=57227415>
##contig=<ID=KI270728.1,length=1872759>
##contig=<ID=KI270727.1,length=448248>
##contig=<ID=KI270442.1,length=392061>
##contig=<ID=KI270729.1,length=280839>
##contig=<ID=GL000225.1,length=211173>
##contig=<ID=KI270743.1,length=210658>
##contig=<ID=GL000008.2,length=209709>
##contig=<ID=GL000009.2,length=201709>
##contig=<ID=KI270747.1,length=198735>
##contig=<ID=KI270722.1,length=194050>
##contig=<ID=GL000194.1,length=191469>
##contig=<ID=KI270742.1,length=186739>
##contig=<ID=GL000205.2,length=185591>
##contig=<ID=GL000195.1,length=182896>
##contig=<ID=KI270736.1,length=181920>
##contig=<ID=KI270733.1,length=179772>
##contig=<ID=GL000224.1,length=179693>
##contig=<ID=GL000219.1,length=179198>
##contig=<ID=KI270719.1,length=176845>
##contig=<ID=GL000216.2,length=176608>
##contig=<ID=KI270712.1,length=176043>
##contig=<ID=KI270706.1,length=175055>
##contig=<ID=KI270725.1,length=172810>
##contig=<ID=KI270744.1,length=168472>
##contig=<ID=KI270734.1,length=165050>
##contig=<ID=GL000213.1,length=164239>
##contig=<ID=GL000220.1,length=161802>
##contig=<ID=KI270715.1,length=161471>
##contig=<ID=GL000218.1,length=161147>
##contig=<ID=KI270749.1,length=158759>
##contig=<ID=KI270741.1,length=157432>
##contig=<ID=GL000221.1,length=155397>
##contig=<ID=KI270716.1,length=153799>
##contig=<ID=KI270731.1,length=150754>
##contig=<ID=KI270751.1,length=150742>
##contig=<ID=KI270750.1,length=148850>
##contig=<ID=KI270519.1,length=138126>
##contig=<ID=GL000214.1,length=137718>
##contig=<ID=KI270708.1,length=127682>
##contig=<ID=KI270730.1,length=112551>
##contig=<ID=KI270438.1,length=112505>
##contig=<ID=KI270737.1,length=103838>
##contig=<ID=KI270721.1,length=100316>
##contig=<ID=KI270738.1,length=99375>
##contig=<ID=KI270748.1,length=93321>
##contig=<ID=KI270435.1,length=92983>
##contig=<ID=GL000208.1,length=92689>
##contig=<ID=KI270538.1,length=91309>
##contig=<ID=KI270756.1,length=79590>
##contig=<ID=KI270739.1,length=73985>
##contig=<ID=KI270757.1,length=71251>
##contig=<ID=KI270709.1,length=66860>
##contig=<ID=KI270746.1,length=66486>
##contig=<ID=KI270753.1,length=62944>
##contig=<ID=KI270589.1,length=44474>
##contig=<ID=KI270726.1,length=43739>
##contig=<ID=KI270735.1,length=42811>
##contig=<ID=KI270711.1,length=42210>
##contig=<ID=KI270745.1,length=41891>
##contig=<ID=KI270714.1,length=41717>
##contig=<ID=KI270732.1,length=41543>
##contig=<ID=KI270713.1,length=40745>
##contig=<ID=KI270754.1,length=40191>
##contig=<ID=KI270710.1,length=40176>
##contig=<ID=KI270717.1,length=40062>
##contig=<ID=KI270724.1,length=39555>
##contig=<ID=KI270720.1,length=39050>
##contig=<ID=KI270723.1,length=38115>
##contig=<ID=KI270718.1,length=38054>
##contig=<ID=KI270317.1,length=37690>
##contig=<ID=KI270740.1,length=37240>
##contig=<ID=KI270755.1,length=36723>
##contig=<ID=KI270707.1,length=32032>
##contig=<ID=KI270579.1,length=31033>
##contig=<ID=KI270752.1,length=27745>
##contig=<ID=KI270512.1,length=22689>
##contig=<ID=KI270322.1,length=21476>
##contig=<ID=GL000226.1,length=15008>
##contig=<ID=KI270311.1,length=12399>
##contig=<ID=KI270366.1,length=8320>
##contig=<ID=KI270511.1,length=8127>
##contig=<ID=KI270448.1,length=7992>
##contig=<ID=KI270521.1,length=7642>
##contig=<ID=KI270581.1,length=7046>
##contig=<ID=KI270582.1,length=6504>
##contig=<ID=KI270515.1,length=6361>
##contig=<ID=KI270588.1,length=6158>
##contig=<ID=KI270591.1,length=5796>
##contig=<ID=KI270522.1,length=5674>
##contig=<ID=KI270507.1,length=5353>
##contig=<ID=KI270590.1,length=4685>
##contig=<ID=KI270584.1,length=4513>
##contig=<ID=KI270320.1,length=4416>
##contig=<ID=KI270382.1,length=4215>
##contig=<ID=KI270468.1,length=4055>
##contig=<ID=KI270467.1,length=3920>
##contig=<ID=KI270362.1,length=3530>
##contig=<ID=KI270517.1,length=3253>
##contig=<ID=KI270593.1,length=3041>
##contig=<ID=KI270528.1,length=2983>
##contig=<ID=KI270587.1,length=2969>
##contig=<ID=KI270364.1,length=2855>
##contig=<ID=KI270371.1,length=2805>
##contig=<ID=KI270333.1,length=2699>
##contig=<ID=KI270374.1,length=2656>
##contig=<ID=KI270411.1,length=2646>
##contig=<ID=KI270414.1,length=2489>
##contig=<ID=KI270510.1,length=2415>
##contig=<ID=KI270390.1,length=2387>
##contig=<ID=KI270375.1,length=2378>
##contig=<ID=KI270420.1,length=2321>
##contig=<ID=KI270509.1,length=2318>
##contig=<ID=KI270315.1,length=2276>
##contig=<ID=KI270302.1,length=2274>
##contig=<ID=KI270518.1,length=2186>
##contig=<ID=KI270530.1,length=2168>
##contig=<ID=KI270304.1,length=2165>
##contig=<ID=KI270418.1,length=2145>
##contig=<ID=KI270424.1,length=2140>
##contig=<ID=KI270417.1,length=2043>
##contig=<ID=KI270508.1,length=1951>
##contig=<ID=KI270303.1,length=1942>
##contig=<ID=KI270381.1,length=1930>
##contig=<ID=KI270529.1,length=1899>
##contig=<ID=KI270425.1,length=1884>
##contig=<ID=KI270396.1,length=1880>
##contig=<ID=KI270363.1,length=1803>
##contig=<ID=KI270386.1,length=1788>
##contig=<ID=KI270465.1,length=1774>
##contig=<ID=KI270383.1,length=1750>
##contig=<ID=KI270384.1,length=1658>
##contig=<ID=KI270330.1,length=1652>
##contig=<ID=KI270372.1,length=1650>
##contig=<ID=KI270548.1,length=1599>
##contig=<ID=KI270580.1,length=1553>
##contig=<ID=KI270387.1,length=1537>
##contig=<ID=KI270391.1,length=1484>
##contig=<ID=KI270305.1,length=1472>
##contig=<ID=KI270373.1,length=1451>
##contig=<ID=KI270422.1,length=1445>
##contig=<ID=KI270316.1,length=1444>
##contig=<ID=KI270340.1,length=1428>
##contig=<ID=KI270338.1,length=1428>
##contig=<ID=KI270583.1,length=1400>
##contig=<ID=KI270334.1,length=1368>
##contig=<ID=KI270429.1,length=1361>
##contig=<ID=KI270393.1,length=1308>
##contig=<ID=KI270516.1,length=1300>
##contig=<ID=KI270389.1,length=1298>
##contig=<ID=KI270466.1,length=1233>
##contig=<ID=KI270388.1,length=1216>
##contig=<ID=KI270544.1,length=1202>
##contig=<ID=KI270310.1,length=1201>
##contig=<ID=KI270412.1,length=1179>
##contig=<ID=KI270395.1,length=1143>
##contig=<ID=KI270376.1,length=1136>
##contig=<ID=KI270337.1,length=1121>
##contig=<ID=KI270335.1,length=1048>
##contig=<ID=KI270378.1,length=1048>
##contig=<ID=KI270379.1,length=1045>
##contig=<ID=KI270329.1,length=1040>
##contig=<ID=KI270419.1,length=1029>
##contig=<ID=KI270336.1,length=1026>
##contig=<ID=KI270312.1,length=998>
##contig=<ID=KI270539.1,length=993>
##contig=<ID=KI270385.1,length=990>
##contig=<ID=KI270423.1,length=981>
##contig=<ID=KI270392.1,length=971>
##contig=<ID=KI270394.1,length=970>
##ALT=<ID=*,Description="Represents allele(s) other than observed.">
##INFO=<ID=IDV,Number=1,Type=Integer,Description="Maximum number of reads supporting an indel">
##INFO=<ID=IMF,Number=1,Type=Float,Description="Maximum fraction of reads supporting an indel">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">
##INFO=<ID=VDB,Number=1,Type=Float,Description="Variant Distance Bias for filtering splice-site artefacts in RNA-seq data (bigger is better)",Version="3">
##INFO=<ID=RPB,Number=1,Type=Float,Description="Mann-Whitney U test of Read Position Bias (bigger is better)">
##INFO=<ID=MQB,Number=1,Type=Float,Description="Mann-Whitney U test of Mapping Quality Bias (bigger is better)">
##INFO=<ID=BQB,Number=1,Type=Float,Description="Mann-Whitney U test of Base Quality Bias (bigger is better)">
##INFO=<ID=MQSB,Number=1,Type=Float,Description="Mann-Whitney U test of Mapping Quality vs Strand Bias (bigger is better)">
##INFO=<ID=SGB,Number=1,Type=Float,Description="Segregation based metric.">
##INFO=<ID=MQ0F,Number=1,Type=Float,Description="Fraction of MQ0 reads (smaller is better)">
##INFO=<ID=I16,Number=16,Type=Float,Description="Auxiliary tag used for calling, see description of bcf_callret1_t in bam2bcf.h">
##INFO=<ID=QS,Number=R,Type=Float,Description="Auxiliary tag used for calling">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Number of high-quality bases">
##bcftools_viewVersion=1.8+htslib-1.8
##bcftools_viewCommand=view; Date=Wed Mar 15 02:47:19 2023
##bcftools_normVersion=1.8+htslib-1.8
##bcftools_normCommand=norm -m-both -f /tmp/Monopogen/example/chr20_2Mb.hg38.fa; Date=Wed Mar 15 02:47:19 2023
##bcftools_viewCommand=view -r chr20:2998993 /dbfs/FileStore/2023-h1-ancestry/data/monopogen_output/80_test/SCvarCall/chr20.gl.vcf.gz; Date=Wed Mar 15 03:18:53 2023
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	10x3_Lobe_19_D006_NeuN_4
[E::bcf_write] Broken VCF record, the number of columns at chr20:2998993 does not match the number of samples (0 vs 1)```



**Here's the output log**

[2023-03-15 20:01:59,456] INFO Monopogen.py Preparing varint calling pipeline...
[2023-03-15 20:01:59,457] INFO Monopogen.py Parameters in effect:
[2023-03-15 20:01:59,457] INFO Monopogen.py --subcommand = [SCvarCall]
[2023-03-15 20:01:59,457] INFO Monopogen.py --bamFile = [/dbfs/FileStore/2023-h1-ancestry/data/lattice/80.bam]
[2023-03-15 20:01:59,457] INFO Monopogen.py --step = [germline]
[2023-03-15 20:01:59,457] INFO Monopogen.py --mode = [single]
[2023-03-15 20:01:59,457] INFO Monopogen.py --chr = [chr20]
[2023-03-15 20:01:59,458] INFO Monopogen.py --out = [/dbfs/FileStore/2023-h1-ancestry/data/monopogen_output/80_test]
[2023-03-15 20:01:59,458] INFO Monopogen.py --reference = [/tmp/Monopogen/example/chr20_2Mb.hg38.fa]
[2023-03-15 20:01:59,458] INFO Monopogen.py --imputation_panel = [/tmp/Monopogen/example/CCDG_14151_B01_GRM_WGS_2020-08-05_chr20.filtered.shapeit2-duohmm-phased.vcf.gz]
[2023-03-15 20:01:59,458] INFO Monopogen.py --depth_filter = [200]
[2023-03-15 20:01:59,458] INFO Monopogen.py --depth_filter_monovar = [50]
[2023-03-15 20:01:59,458] INFO Monopogen.py --alt_ratio = [0.1]
[2023-03-15 20:01:59,458] INFO Monopogen.py --wildCluster = [2]
[2023-03-15 20:01:59,458] INFO Monopogen.py --mutationCluster = [1]
[2023-03-15 20:01:59,458] INFO Monopogen.py --max_mismatch = [3]
[2023-03-15 20:01:59,458] INFO Monopogen.py --max_softClipped = [5]
[2023-03-15 20:01:59,459] INFO Monopogen.py --app_path = [/tmp/Monopogen/apps]
[2023-03-15 20:01:59,459] INFO Monopogen.py --cell_cluster = [/tmp/Monopogen/example/cell_cluster.csv]
[2023-03-15 20:01:59,459] INFO Monopogen.py --logfile = [/dbfs/FileStore/2023-h1-ancestry/data/monopogen_output/80_test_chr20.log]
[2023-03-15 20:01:59,459] INFO Monopogen.py --samtools = [/tmp/Monopogen/apps/samtools]
[2023-03-15 20:01:59,459] INFO Monopogen.py --bcftools = [/tmp/Monopogen/apps/bcftools]
[2023-03-15 20:01:59,459] INFO Monopogen.py --bgzip = [/tmp/Monopogen/apps/bgzip]
[2023-03-15 20:01:59,459] INFO Monopogen.py --java = [java]
[2023-03-15 20:01:59,459] INFO Monopogen.py --beagle = [/tmp/Monopogen/apps/beagle.27Jul16.86a.jar]
[2023-03-15 20:01:59,459] INFO Monopogen.py Checking existence of essenstial resource files...
[2023-03-15 20:01:59,460] INFO Monopogen.py Checking dependencies...
[2023-03-15 20:01:59,566] INFO Monopogen.py Filtering bam files...
[2023-03-15 20:03:06,093] INFO Monopogen.py Performing Variant Calling...
[W::hts_idx_load2] The index file is older than the data file: /dbfs/FileStore/2023-h1-ancestry/data/monopogen_output/80_test/Bam/chr20.filter.bam.bai
[mpileup] 1 samples in 1 input files
(mpileup) Max depth is above 1M. Potential memory hog!
Reference allele mismatch at chr20:3000001 .. REF_SEQ:'A' vs VCF:'N'
Note: none of --samples-file, --ploidy or --ploidy-file given, assuming all sites are diploid
Incorrect number of FORMAT fields at chr20:2998993 .. DP, 0 vs 1
Exception in thread "main" java.lang.IllegalArgumentException: VCF header line has 10 fields, but data line has 8 fields
File source:File source: /dbfs/FileStore/2023-h1-ancestry/data/monopogen_output/80_test/SCvarCall/chr20.gl.vcf.gz
[chr20, 2998993, ., T, <*>, 0, ., DP=9;I16=6,3,0,0,333]
at vcf.VcfRecord.fieldCountError(VcfRecord.java:221)
at vcf.VcfRecord.delimiters(VcfRecord.java:203)
at vcf.VcfRecord.(VcfRecord.java:87)
at vcf.VcfRecord.fromGTGL(VcfRecord.java:193)
at vcf.VcfIt.lambda$static$5(VcfIt.java:76)
at vcf.VcfIt.lambda$new$8(VcfIt.java:192)
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
at java.util.stream.ReduceOps$ReduceTask.doLeaf(ReduceOps.java:747)
at java.util.stream.ReduceOps$ReduceTask.doLeaf(ReduceOps.java:721)
at java.util.stream.AbstractTask.compute(AbstractTask.java:327)
at java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:731)
at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
at java.util.concurrent.ForkJoinPool$WorkQueue.pollAndExecCC(ForkJoinPool.java:1190)
at java.util.concurrent.ForkJoinPool.helpComplete(ForkJoinPool.java:1879)
at java.util.concurrent.ForkJoinPool.externalHelpComplete(ForkJoinPool.java:2467)
at java.util.concurrent.ForkJoinTask.externalAwaitDone(ForkJoinTask.java:324)
at java.util.concurrent.ForkJoinTask.doInvoke(ForkJoinTask.java:405)
at java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:734)
at java.util.stream.ReduceOps$ReduceOp.evaluateParallel(ReduceOps.java:714)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233)
at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566)
at vcf.VcfIt.fillEmissionBuffer(VcfIt.java:307)
at vcf.VcfIt.next(VcfIt.java:363)
at vcf.VcfIt.next(VcfIt.java:52)
at vcf.IntervalVcfIt.readNextRecord(IntervalVcfIt.java:110)
at vcf.IntervalVcfIt.next(IntervalVcfIt.java:92)
at vcf.IntervalVcfIt.next(IntervalVcfIt.java:36)
at main.Main.restrictToVcfMarkers(Main.java:343)
at main.Main.allData(Main.java:313)
at main.Main.main(Main.java:111)
gzip: /dbfs/FileStore/2023-h1-ancestry/data/monopogen_output/80_test/SCvarCall/chr20.gp.vcf.gz: No such file or directory
/dbfs/FileStore/2023-h1-ancestry/data/monopogen_output/80_test/SCvarCall/chr20.gp.vcf.gz: No such file or directory
[2023-03-15 20:26:35,885] INFO Monopogen.py Generating variant statistical information ...
[E::hts_open_format] Failed to open file "/dbfs/FileStore/2023-h1-ancestry/data/monopogen_output/80_test/SCvarCall/chr20.gp.vcf.gz" : No such file or directory
Traceback (most recent call last):
File "/tmp/Monopogen/src/Monopogen.py", line 519, in
main()
File "/tmp/Monopogen/src/Monopogen.py", line 514, in main
args.func(args)
File "/tmp/Monopogen/src/Monopogen.py", line 398, in SCvarCall
getDPinfo(args)
File "/tmp/Monopogen/src/Monopogen.py", line 152, in getDPinfo
gp_vcf_in = VariantFile(out + "/SCvarCall/" + args.chr + ".gp.vcf.gz")
File "pysam/libcbcf.pyx", line 4119, in pysam.libcbcf.VariantFile.init
File "pysam/libcbcf.pyx", line 4344, in pysam.libcbcf.VariantFile.open
FileNotFoundError: [Errno 2] could not open variant file b'/dbfs/FileStore/2023-h1-ancestry/data/monopogen_output/80_test/SCvarCall/chr20.gp.vcf.gz': No such file or directory

Explanation .gl, .gp, .phased files

Hello! While running Monopogen, I noticed that it outputs quite a number of different files. I have read in your tutorial that the final output should be in the .phased.vcf.gz file, however that file only provides the genotype. I wanted to also obtain information about the read depth and allele frequency for those variants. I find that the .gl.vcf.gz file contains information about the depth, while the .gp.vcf.gz contains the genotype and allele frequency. I have also noticed that the .gl.vcf file contains unfiltered variants, while the .gp.vcf seems to contain only filtered variants that are the same as in .phased.vcf.

Could you help me understand what all these files are and how I could go about extracting as much information as possible for all variants (even unfiltered) from them?

Thank you!

About run Somatic SNV calling from scRNA-seq

Hello,
Thank you for developing such a wonderful method.
I use single-cell sequencing data to run Somatic SNV calling from scRNA-seq, but I don't know how to prepare the cell barcode file. Because the barcode.tsv file produced by cellranger has only one column, and the csv file given in the case has 2 columns, I don’t know what the id in the second column means? How to prepare this file? Can a randomly generated non-repeating serial number be used?

Could you please provide some guidance on how to resolve this issue?
Thank you in advance for your assistance.

region "-o" specifies an unknown reference name

HI! I'm having a problem with the cellScann module, in particular I can't understand the reason for this error:
INFO Monopogen.py Get single cell level information from sequencing data...
INFO:main:Get single cell level information from sequencing data...
[main_samview] region "-o" specifies an unknown reference name. Continue anyway.
[main_samview] region "/Users/Arianna/Monopogen/out_D545/Bam/chr1.filter.targeted.bam" specifies an unknown reference name. Continue anyway.

the samtools version should be the correct one (samtools 1.2, htslib1.2.1)

chrX

Hello:
I tried to use monopogen for germline SNP analysis of chrX and encountered two problems .
One is that the "monopogen.py" script only specifies chr1-chr22. I managed to get "chrX.gl.vcf.gz" by modifying "monopogen.py" .But another problem is that I can only on
"http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20201028_3202_phased" found the "CCDG_14151_B01_GRM_WGS_2020-08-05_chrX.filtered.eagle2-phased.v2.vcf.gz file" .If you use this file for subsequent analysis, the error message "CCDG_14151_B01_GRM_WGS_2020-08-05_chrX.filtered.shapeit2-duohmm-phased.vcf.gz" is displayed. I found that "CCDG_14151_B01_GRM_WGS_2020-08-05_chrX.filtered.eagle2-phased.v2.vcf.gz" and "CCDG_14151_B01_GRM_WGS_2020-08-05_chr1.filtered.shapeit2-duohmm-phased.vcf.gz" have different content formats .
Is there any way to get the appropriate chrX reference file for the next step？

"somatic mutation module is still in the development stage"

Hi,
I wanted to know about timeline for the final version release of the somatic mutation detection module?
Thanks.

samtools mpileup

Thanks for the tool. It's really nice. One thing I noted is that samtools in your code use old version and mpileup -t and -v is deprecated in the latest version (v 1.18). I am currently using 1.13 binary on my M1 mac, which seems working expectedly but the run with 1.18 is not working.

Are you updating this with bcftools mpileup in future?

[warning] samtools mpileup option `t` is functional, but deprecated. Please switch to using bcftools mpileup in future.
[warning] samtools mpileup option `t` is functional, but deprecated. Please switch to using bcftools mpileup in future.
[warning] samtools mpileup option `v` is functional, but deprecated. Please switch to using bcftools mpileup in future.
[warning] samtools mpileup option `v` is functional, but deprecated. Please switch to using bcftools mpileup in future.
[warning] samtools mpileup option `t` is functional, but deprecated. Please switch to using bcftools mpileup in future.
[warning] samtools mpileup option `v` is functional, but deprecated. Please switch to using bcftools mpileup in future.
[warning] samtools mpileup option `t` is functional, but deprecated. Please switch to using bcftools mpileup in future.
[warning] samtools mpileup option `v` is functional, but deprecated. Please switch to using bcftools mpileup in future.
[warning] samtools mpileup option `t` is functional, but deprecated. Please switch to using bcftools mpileup in future.
[warning] samtools mpileup option `v` is functional, but deprecated. Please switch to using bcftools mpileup in future.

outputs of cellScan

Hello! I am writing to ask you if you could list me what the outputs of each of the three somatic modules should be.
In fact, after the cellscan module I get the following warning:
INFO Monopogen.py Get single cell level information from sequencing data...
INFO:main:Get single cell level information from sequencing data...
[main_samview] region "-o" specifies an unknown reference name. Continue anyway.
[main_samview] region "/Users/Arianna/Monopogen/out_D545/Bam/chr1.filter.targeted.bam" specifies an unknown reference name. Continue anyway.
after that in the out/bam folder I find a merged.bam and merged.bam.bai and a split_bam folder with .bam files related to each cell.
In the out/somatic folder I find the following files: ch1.bed, ch1.gl.vcf.DP4, ch1.gl.vcf.filter.DP4, chr1.gl.vcf.filter.hc.bed, chr1.gl.vcf .filter.hc.pos. When I try to launch ldrefinement I get the following error:
File 'out_D545/somatic/chr1.gl.filter.hc.cell.mat.gz' does not exist or is non-readable. getwd()=='/Users/Arianna/Monopogen' this file was in fact not produced in the previous module.
I would like to ask if you have any ideas as to why the cellScan module may not have given the full output. Thank you and I apologize if in the previous issues I was not clear enough in describing the problem.

number of snps before and after first out of imputation prior to phasing

I have a question regarding the first imputation step using Beagle. My experience with imputation is strictly from a GWAS perspective where imputation from a large DNA array panel will result in the imputation of many, many more SNPs. However, in the case with scRNAseq, I actually get much much fewer snps post imputation than from the input germline file. For example, the number of snps in the germline calls after samtools mpileup commands results in 8,039,600 SNPs for chromosome 10. However after imputing against TGP, I get 275,752 SNPs (not yet phased).

Is this because we are calling all variants from mpileup on RNA so the variant calls may be inaccurate or too low of a minor allele frequency that after TGP imputation or strand discrepencies that exist, they are getting filtered out? Or is this an anomaly and there is something wrong?

The exact command I used for chr 10 is the following (in case it helps I have ~320 samples in my vcf files):

samtools mpileup -b chr10.filter.bam.lst -f genome.fa -r chr10 -q 20 -Q 20 -t DP,DPR,DV,DP4,INFO/DPR,SP -d 10000000 -v  | }bcftools view  | bcftools norm -m-both -f genome.fa| grep -v '<X>' | grep -v INDEL | $bgzip -c -@ 1 > chr10.gl.vcf.gz

java -Xmx20g -jar beagle.27Jul16.86a.jar gl=chr10.gl.vcf.gz ref=CCDG_14151_B01_GRM_WGS_2020-08-05_chr10.filtered.shapeit2-duohmm-phased.vcf.gz chrom=chr10 out=chr10.gp impute=false modelscale=2 nthreads=24 gprobs=true niterations=0

I added a few extra tag calculations to the samtools command but really that is it...

Error on gemline calling

Hello,
Thank you for developing such a wonderful tool.
Recently, I attempted to execute somatic variant calling from scRNA-seq data.
The preprocess and germline calling sections executed successfully. However, I encountered an issue during the first step of somatic calling.

My code:
python {path}/Monopogen/src/Monopogen.py somatic
-a {path}/Monopogen/apps
-r region.lst -t 16
-i {path}/${sample}
-l ./cell_barcode_annotations.tsv
-s featureInfo
-g {path}/GRCh38.fa

Error Message:
[2023-10-26 10:05:38,384] INFO Monopogen.py Get feature information from sequencing data...
[E::hts_open_format] Failed to open file "{output_path}/germline/chr7.phased.vcf.gz" : No such file or directory
[E::idx_find_and_load] Could not retrieve index file for '{output_path}/germline/chr1.phased.vcf.gz'
[W::vcf_parse] Contig 'chr1' is not defined in the header. (Quick workaround: index the file with tabix.)
[E::idx_find_and_load] Could not retrieve index file for '/{output_path}/germline/chr2.phased.vcf.gz'
[W::vcf_parse] Contig 'chr2' is not defined in the header. (Quick workaround: index the file with tabix.)
[E::idx_find_and_load] Could not retrieve index file for '{output_path}/germline/chr4.phased.vcf.gz'
[W::vcf_parse] Contig 'chr4' is not defined in the header. (Quick workaround: index the file with tabix.)
[E::idx_find_and_load] Could not retrieve index file for '{output_path}/germline/chr3.phased.vcf.gz'
[W::vcf_parse] Contig 'chr3' is not defined in the header. (Quick workaround: index the file with tabix.)
[E::idx_find_and_load] Could not retrieve index file for '{path}/germline/chr5.phased.vcf.gz'
[W::vcf_parse] Contig 'chr5' is not defined in the header. (Quick workaround: index the file with tabix.)
[E::idx_find_and_load] Could not retrieve index file for '/{output_path}/germline/chr6.phased.vcf.gz'
[W::vcf_parse] Contig 'chr6' is not defined in the header. (Quick workaround: index the file with tabix.)
[E::idx_find_and_load] Could not retrieve index file for '{output_path}/germline/chr8.phased.vcf.gz'
[W::vcf_parse] Contig 'chr8' is not defined in the header. (Quick workaround: index the file with tabix.)
[E::idx_find_and_load] Could not retrieve index file for '{output_path}/germline/chr9.phased.vcf.gz'
[W::vcf_parse] Contig 'chr9' is not defined in the header. (Quick workaround: index the file with tabix.)
[E::idx_find_and_load] Could not retrieve index file for '{output_path}/germline/chr10.phased.vcf.gz'
[W::vcf_parse] Contig 'chr10' is not defined in the header. (Quick workaround: index the file with tabix.)
[E::idx_find_and_load] Could not retrieve index file for '{output_path}/germline/chr11.phased.vcf.gz'
[W::vcf_parse] Contig 'chr11' is not defined in the header. (Quick workaround: index the file with tabix.)
[E::idx_find_and_load] Could not retrieve index file for '{output_path}/germline/chr12.phased.vcf.gz'
[W::vcf_parse] Contig 'chr12' is not defined in the header. (Quick workaround: index the file with tabix.)
[E::idx_find_and_load] Could not retrieve index file for '{output_path}/germline/chr13.phased.vcf.gz'
[W::vcf_parse] Contig 'chr13' is not defined in the header. (Quick workaround: index the file with tabix.)
[E::idx_find_and_load] Could not retrieve index file for '{output_path}/germline/chr14.phased.vcf.gz'
[W::vcf_parse] Contig 'chr14' is not defined in the header. (Quick workaround: index the file with tabix.)
[E::idx_find_and_load] Could not retrieve index file for '{output_path}/germline/chr16.phased.vcf.gz'
[W::vcf_parse] Contig 'chr16' is not defined in the header. (Quick workaround: index the file with tabix.)
[E::idx_find_and_load] Could not retrieve index file for '{output_path}/germline/chr15.phased.vcf.gz'
[W::vcf_parse] Contig 'chr15' is not defined in the header. (Quick workaround: index the file with tabix.)
[E::idx_find_and_load] Could not retrieve index file for '{output_path}/germline/chr2.gl.vcf.gz'
[E::idx_find_and_load] Could not retrieve index file for '{output_path}/germline/chr4.gl.vcf.gz'
[E::idx_find_and_load] Could not retrieve index file for '{output_path}/germline/chr1.gl.vcf.gz'
[E::idx_find_and_load] Could not retrieve index file for '{output_path}/germline/chr5.gl.vcf.gz'
[E::idx_find_and_load] Could not retrieve index file for '{output_path}/germline/chr3.gl.vcf.gz'
[E::idx_find_and_load] Could not retrieve index file for '{output_path}/germline/chr8.gl.vcf.gz'
[E::idx_find_and_load] Could not retrieve index file for '{output_path}/germline/chr6.gl.vcf.gz'
[E::idx_find_and_load] Could not retrieve index file for '{output_path}/germline/chr12.gl.vcf.gz'
[E::idx_find_and_load] Could not retrieve index file for '{output_path}/germline/chr9.gl.vcf.gz'
[E::idx_find_and_load] Could not retrieve index file for '{output_path}/germline/chr10.gl.vcf.gz'
[E::idx_find_and_load] Could not retrieve index file for '{output_path}/germline/chr11.gl.vcf.gz'
[E::idx_find_and_load] Could not retrieve index file for '{output_path}germline/chr13.gl.vcf.gz'
[E::idx_find_and_load] Could not retrieve index file for '{output_path}/germline/chr14.gl.vcf.gz'
[E::idx_find_and_load] Could not retrieve index file for '{output_path}/germline/chr17.phased.vcf.gz'
[W::vcf_parse] Contig 'chr17' is not defined in the header. (Quick workaround: index the file with tabix.)
[E::idx_find_and_load] Could not retrieve index file for '{output_path}/germline/chr16.gl.vcf.gz'
[E::idx_find_and_load] Could not retrieve index file for '{output_path}/germline/chr15.gl.vcf.gz'
[E::idx_find_and_load] Could not retrieve index file for '{output_path}/germline/chr17.gl.vcf.gz'
[E::idx_find_and_load] Could not retrieve index file for '{output_path}/germline/chr18.phased.vcf.gz'
[W::vcf_parse] Contig 'chr18' is not defined in the header. (Quick workaround: index the file with tabix.)
[E::idx_find_and_load] Could not retrieve index file for '{output_path}/germline/chr18.gl.vcf.gz'
[E::hts_open_format] Failed to open file "{output_path}/germline/chr19.phased.vcf.gz" : No such file or directory
[E::idx_find_and_load] Could not retrieve index file for '{output_path}/germline/chr20.phased.vcf.gz'
[W::vcf_parse] Contig 'chr20' is not defined in the header. (Quick workaround: index the file with tabix.)
[E::idx_find_and_load] Could not retrieve index file for '{output_path}/germline/chr20.gl.vcf.gz'
[E::idx_find_and_load] Could not retrieve index file for '{output_path}/germline/chr21.phased.vcf.gz'
[W::vcf_parse] Contig 'chr21' is not defined in the header. (Quick workaround: index the file with tabix.)
[E::idx_find_and_load] Could not retrieve index file for '{output_path}/germline/chr21.gl.vcf.gz'
[E::idx_find_and_load] Could not retrieve index file for '{output_path}/germline/chr22.phased.vcf.gz'
[W::vcf_parse] Contig 'chr22' is not defined in the header. (Quick workaround: index the file with tabix.)
[E::idx_find_and_load] Could not retrieve index file for '{output_path}/germline/chr22.gl.vcf.gz'
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/miniconda2/envs/Monopogen/lib/python3.8/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/miniconda2/envs/Monopogen/lib/python3.8/multiprocessing/pool.py", line 48, in mapstar
return list(map(*args))
File "/Monopogen/src/somatic.py", line 88, in featureInfo
vcf_in = VariantFile(out + "/germline/" + region + ".phased.vcf.gz")
File "pysam/libcbcf.pyx", line 4117, in pysam.libcbcf.VariantFile.init
File "pysam/libcbcf.pyx", line 4342, in pysam.libcbcf.VariantFile.open
FileNotFoundError: [Errno 2] could not open variant file b'{output_path}/germline/chr7.phased.vcf.gz': No such file or directory
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/Monopogen/src/Monopogen.py", line 435, in
main()
File "/Monopogen/src/Monopogen.py", line 428, in main
args.func(args)
File "/Monopogen/src/Monopogen.py", line 150, in somatic
result = pool.map(featureInfo, joblst)
File "/miniconda2/envs/Monopogen/lib/python3.8/multiprocessing/pool.py", line 364, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/miniconda2/envs/Monopogen/lib/python3.8/multiprocessing/pool.py", line 771, in get
raise self._value
FileNotFoundError: [Errno 2] could not open variant file `b'{output_path}/germline/chr7.phased.vcf.gz'`: No such file or directory

I would be very grateful if you could provide some guidance on how to resolve this issue?

kchen-lab / monopogen Goto Github PK

monopogen's People

Contributors

Stargazers

Watchers

Forkers

monopogen's Issues

My running code

Error Message

Hello, Thank you for developing such a wonderful tool. Recently, I attempted to execute somatic variant calling from scRNA-seq data. The preprocess and germline calling sections executed successfully. However, I encountered an issue during the first step of somatic calling.

My code: python {path}/Monopogen/src/Monopogen.py somatic -a {path}/Monopogen/apps -r region.lst -t 16 -i {path}/${sample} -l ./cell_barcode_annotations.tsv -s featureInfo -g {path}/GRCh38.fa

Recommend Projects

Recommend Topics

Recommend Org

Hello,
Thank you for developing such a wonderful tool.
Recently, I attempted to execute somatic variant calling from scRNA-seq data.
The preprocess and germline calling sections executed successfully. However, I encountered an issue during the first step of somatic calling.

My code:
python {path}/Monopogen/src/Monopogen.py somatic
-a {path}/Monopogen/apps
-r region.lst -t 16
-i {path}/${sample}
-l ./cell_barcode_annotations.tsv
-s featureInfo
-g {path}/GRCh38.fa