hringbauer / ancibd Goto Github PK
View Code? Open in Web Editor NEWDetecting IBD within low coverage ancient DNA data. Development Repository for software package that contains code for manuscript.
License: GNU General Public License v3.0
Detecting IBD within low coverage ancient DNA data. Development Repository for software package that contains code for manuscript.
License: GNU General Public License v3.0
Can this software be used to identify the kinship of non-human primates?
Hello!
In the bioarchive article, you write that you are using data from the AADR, but there they are in the eigenstrat format. Tell me how to translate them into VCF with the flag of GL. And how, then, is the difference between capture and whole genome data? Or do you have a BAM file for the 1240k panel?
Hi, I am loading vcf files from GLIMPSE to ancIBD, however it occurs this problem. And when I try tutorials by using test_data which you provided, it also has the same problem.
It seems python codeDecode problem, however when I write "
import sys
reload(sys)
sys.setdefaultencoding('utf-8')"
into sitecustomize.py under python" lib/site-packages/ "path, it could not solve the error. DO you have any suggestions? Thanks a lot!
Dear Harald and Yilei
I have been recently running some tests with ancIBD (v0.7) and from what I can see the mask (--mask) option does not appear to have any impact on the results.
I’m using the mask file provided in the vignette (./map/mask.track). From what I can see from the code, adding a header to the mask file is required (ch start_bp end_bp start_cm end_cm) and from the log file it appears to be working, but the output is identical whether the command is included or not.
I’ve put the commands and a subset of the results below for a test I ran on chromosome 18 for two individuals. The full log files and results are attached.
There is an example of a segment (6416465 BP to 8096608 BP) that I believe should be masked out but had not been.
Hopefully I haven’t missed something obvious on my end!
Best
Hugh
ancIBD command with mask
ancIBD-run \
--vcf 18.RISE109_EAS004.vcf.gz \
--ch 18 \
--min 1 \
--marker_path ./ancibd_data/filters/snps_bcftools_ch18.csv \
--map_path ./ancibd_data/map/v51.1_1240k.snp \
--af_path ./ancibd_data/afs/v51.1_1240k_AF_ch18.tsv \
--prefix RISE109_EAS004.mask \
--mask ./ancibd_data/map/mask.track_whead \
--out ./mask | tee -a "log.mask.txt"
ancIBD command without mask
ancIBD-run \
--vcf 18.RISE109_EAS004.vcf.gz \
--ch 18 \
--min 1 \
--marker_path ./ancibd_data/filters/snps_bcftools_ch18.csv \
--map_path ./ancibd_data/map/v51.1_1240k.snp \
--af_path ./ancibd_data/afs/v51.1_1240k_AF_ch18.tsv \
--prefix RISE109_EAS004.no_mask \
--out ./no_mask | tee -a "log.no_mask.txt"
Mask file
ch start_bp end_bp start_cm end_cm
1 117262791 152752796 147.421 163.079
2 28328 10223894 0.014 26.182
4 71566 5690711 0.341 11.535
8 164984 14405249 0.0004 31.511
10 43410489 54652146 62.682 77.491
14 101952406 107283009 111.711 120.2
15 20071673 34014753 0.005 43.484
17 32762040 39154710 54.307 63.702
18 69836 8490599 0.1604 25.465
19 266034 10030630 0.002 29.543
19 47112648 59087479 74.533 107.7316
21 14601415 24499572 0.861 21.041
22 17054720 25615059 1.723 23.842
Section of log file
Applying mask to IBD segments...
Start End ... iid1 iid2
0 1395 1487 ... RISE109.allentoft_2015_nature EAS004.gretzinger_2022_nature
1 2767 3629 ... RISE109.allentoft_2015_nature EAS004.gretzinger_2022_nature
2 4231 4775 ... RISE109.allentoft_2015_nature EAS004.gretzinger_2022_nature
3 5224 5411 ... RISE109.allentoft_2015_nature EAS004.gretzinger_2022_nature
4 6278 6657 ... RISE109.allentoft_2015_nature EAS004.gretzinger_2022_nature
5 7539 7933 ... RISE109.allentoft_2015_nature EAS004.gretzinger_2022_nature
6 9410 10448 ... RISE109.allentoft_2015_nature EAS004.gretzinger_2022_nature
7 12803 13437 ... RISE109.allentoft_2015_nature EAS004.gretzinger_2022_nature
8 14306 15154 ... RISE109.allentoft_2015_nature EAS004.gretzinger_2022_nature
9 15622 15879 ... RISE109.allentoft_2015_nature EAS004.gretzinger_2022_nature
10 19670 19898 ... RISE109.allentoft_2015_nature EAS004.gretzinger_2022_nature
11 21006 21444 ... RISE109.allentoft_2015_nature EAS004.gretzinger_2022_nature
12 22409 22741 ... RISE109.allentoft_2015_nature EAS004.gretzinger_2022_nature
13 24215 24718 ... RISE109.allentoft_2015_nature EAS004.gretzinger_2022_nature
14 24890 25111 ... RISE109.allentoft_2015_nature EAS004.gretzinger_2022_nature
15 25824 26136 ... RISE109.allentoft_2015_nature EAS004.gretzinger_2022_nature
16 26646 27343 ... RISE109.allentoft_2015_nature EAS004.gretzinger_2022_nature
17 27723 28089 ... RISE109.allentoft_2015_nature EAS004.gretzinger_2022_nature
Results from run with mask:
Start End StartM EndM length lengthM ch StartBP EndBP iid1 iid2
3110 4082 0.187756 0.240680992603302 972 0.0529249906539917 18 6416465 8096608 RISE109.allentoft_2015_nature EAS004.gretzinger_2022_nature
4739 5354 0.27425798773765564 0.31047698855400085 615 0.036219000816345215 18 9017786 10296623 RISE109.allentoft_2015_nature EAS004.gretzinger_2022_nature
5872 6090 0.33414000272750854 0.34862199425697327 218 0.014481991529464722 18 11078646 11573400 RISE109.allentoft_2015_nature EAS004.gretzinger_2022_nature
7048 7497 0.3891560137271881 0.4127730131149292 449 0.02361699938774109 18 14000961 20106913 RISE109.allentoft_2015_nature EAS004.gretzinger_2022_nature
8477 8929 0.43661099672317505 0.45054900646209717 452 0.01393800973892212 18 22545495 23529365 RISE109.allentoft_2015_nature EAS004.gretzinger_2022_nature
10582 11853 0.4954560101032257 0.5268459916114807 1271 0.031389981508255005 18 28087698 31420010 RISE109.allentoft_2015_nature EAS004.gretzinger_2022_nature
14468 15122 0.5745900273323059 0.5875930190086365 654 0.013002991676330566 18 38026603 39712385 RISE109.allentoft_2015_nature EAS004.gretzinger_2022_nature
16092 17042 0.6033920049667358 0.6263909935951233 950 0.02299898862838745 18 42434615 44150558 RISE109.allentoft_2015_nature EAS004.gretzinger_2022_nature
17575 17853 0.6380820274353027 0.6519550085067749 278 0.013872981071472168 18 45358666 45874561 RISE109.allentoft_2015_nature EAS004.gretzinger_2022_nature
21995 22245 0.743274986743927 0.7536090016365051 250 0.010334014892578125 18 55127493 55535280 RISE109.allentoft_2015_nature EAS004.gretzinger_2022_nature
25036 25454 0.8473230004310608 0.8599590063095093 418 0.012636005878448486 18 60957339 61802304 RISE109.allentoft_2015_nature EAS004.gretzinger_2022_nature
27042 28042 0.9102759957313538 0.9489629864692688 1000 0.03868699073791504 18 65887287 67938899 RISE109.allentoft_2015_nature EAS004.gretzinger_2022_nature
28832 29192 0.977187991142273 0.9897369742393494 360 0.012548983097076416 18 69700748 70489780 RISE109.allentoft_2015_nature EAS004.gretzinger_2022_nature
29745 30530 1.0176960229873657 1.0555000305175781 785 0.0378040075302124 18 71478627 72915939 RISE109.allentoft_2015_nature EAS004.gretzinger_2022_nature
30970 31374 1.0719029903411865 1.0926170349121094 404 0.02071404457092285 18 73615868 74318960 RISE109.allentoft_2015_nature EAS004.gretzinger_2022_nature
RISE109_EAS004.mask.ch18.txt
RISE109_EAS004.no_mask.ch18.txt
log.mask.txt
log.no_mask.txt
Hi,
First of all, thank you for your detailed and easy to follow instructions in the notebook.
Regarding the vcf to HDF5 conversion function:
To process the data obtained from Dropbox, it is necessary to unzip the vcf.gz files prior to executing the code mentioned in the notebook.
Second,
I get this error after unziping using gunzip *.gz
:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
This is the full error print:
Print downsampling to 1240K...
Finished BCF tools filtering.
Converting to HDF5...
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<timed exec> in <module>
~/anaconda3/lib/python3.9/site-packages/ancIBD/IO/prepare_h5.py in vcf_to_1240K_hdf(in_vcf_path, path_vcf, path_h5, marker_path, map_path, af_path, col_sample_af, chunk_length, chunk_width, buffer_size, ch)
114
115 print("Converting to HDF5...")
--> 116 allel.vcf_to_hdf5(input=path_vcf, output=path_h5,
117 fields = ['variants/*', 'calldata/*', "samples"],
118 types = {"samples":"S60", "calldata/GT":np.int8,
~/anaconda3/lib/python3.9/site-packages/allel/io/vcf_read.py in vcf_to_hdf5(input, output, group, compression, compression_opts, shuffle, overwrite, vlen, fields, exclude_fields, rename_fields, types, numbers, alt_number, fills, region, tabix, samples, transformers, buffer_size, chunk_length, chunk_width, log)
691
692 # setup chunk iterator
--> 693 fields, samples, headers, it = iter_vcf_chunks(
694 input, fields=fields, exclude_fields=exclude_fields, types=types,
695 numbers=numbers, alt_number=alt_number, buffer_size=buffer_size,
~/anaconda3/lib/python3.9/site-packages/allel/io/vcf_read.py in iter_vcf_chunks(input, fields, exclude_fields, types, numbers, alt_number, fills, region, tabix, samples, transformers, buffer_size, chunk_length)
1136
1137 # setup iterator
-> 1138 fields, samples, headers, it = _iter_vcf_stream(stream, **kwds)
1139
1140 # setup transformers
~/anaconda3/lib/python3.9/site-packages/allel/io/vcf_read.py in _iter_vcf_stream(stream, fields, exclude_fields, types, numbers, alt_number, chunk_length, fills, region, samples)
1634
1635 # read VCF headers
-> 1636 headers = _read_vcf_headers(stream)
1637
1638 # setup samples
~/anaconda3/lib/python3.9/site-packages/allel/io/vcf_read.py in _read_vcf_headers(stream)
1709 # read first header line
1710 header = stream.readline()
-> 1711 header = str(header, 'utf8')
1712
1713 while header and header[0] == '#':
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
Hi, recently I was running the ancIBD-0.5 using the example data (downloaded from https://www.dropbox.com/sh/q18yyrffbdj1yv1/AAC1apifYB_oKB8SNrmQQ-26a?dl=0), using python3.8.3 with the same parameters in the tutorial. All commands can run smoothly without errors, but the results I got is different from the results showed in the tutorial. And I still have not figure out what caused this difference. I will really appreciate it if you can help!!
I checked the output when prepare the hdf5 files, and it is the same from what is in the tutorial. But results start to differ when calling the IBD, and it seems that the IBD that I got is much less.
Chromosome 1; Loaded 18 IBD
Chromosome 2; Loaded 13 IBD
Chromosome 3; Loaded 11 IBD
Chromosome 4; Loaded 9 IBD
Chromosome 5; Loaded 9 IBD
Chromosome 6; Loaded 11 IBD
Chromosome 7; Loaded 12 IBD
Chromosome 8; Loaded 8 IBD
Chromosome 9; Loaded 10 IBD
Chromosome 10; Loaded 10 IBD
Chromosome 11; Loaded 10 IBD
Chromosome 12; Loaded 6 IBD
Chromosome 13; Loaded 12 IBD
Chromosome 14; Loaded 6 IBD
Chromosome 15; Loaded 6 IBD
Chromosome 16; Loaded 8 IBD
Chromosome 17; Loaded 9 IBD
Chromosome 18; Loaded 6 IBD
Chromosome 19; Loaded 11 IBD
Chromosome 20; Loaded 10 IBD
Chromosome 21; Loaded 9 IBD
Chromosome 22; Loaded 10 IBD
Saved 214 IBD to /home/han_shi/script/ancibd/output/ibd_hazelton2/ch_all.tsv.
When summarizing the IBD between pairs, some of the results yield IBD but differ in sum and number, but some pairs(I12438 & I12896, I12440& I12896...), which yield some IBD in the tutorials, did not yield one IBD here..
iid1 | iid2 | max_IBD | sum_IBD>8 | n_IBD>8 | sum_IBD>12 | n_IBD>12 | sum_IBD>16 | n_IBD>16 | sum_IBD>20 | n_IBD>20 |
---|---|---|---|---|---|---|---|---|---|---|
I12440 | I30300 | 268.7866928288713 | 3316.770885130245 | 22.0 | 3316.770885130245 | 22.0 | 3316.770885130245 | 22.0 | 3316.770885130245 | 22.0 |
I12438 | I30300 | 283.65220259875065 | 3307.30799218436 | 21.0 | 3307.30799218436 | 21.0 | 3307.30799218436 | 21.0 | 3307.30799218436 | 21.0 |
I12440 | I12438 | 176.73970285686664 | 1657.2198443172965 | 22.0 | 1647.3969392536674 | 21.0 | 1647.3969392536674 | 21.0 | 1647.3969392536674 | 21.0 |
I12439 | I12440 | 153.30770015716553 | 1626.9729871419258 | 21.0 | 1604.7398943570442 | 19.0 | 1604.7398943570442 | 19.0 | 1604.7398943570442 | 19.0 |
I12439 | I30300 | 98.25259447097778 | 919.9655823060311 | 13.0 | 919.9655823060311 | 13.0 | 919.9655823060311 | 13.0 | 919.9655823060311 | 13.0 |
I12439 | I12438 | 93.37389469146729 | 475.9611673653126 | 10.0 | 459.80747416615486 | 8.0 | 459.80747416615486 | 8.0 | 459.80747416615486 | 8.0 |
I21390 | I30300 | 13.226699829101562 | 13.226699829101562 | 1.0 | 13.226699829101562 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
I12896 | I30300 | 12.573707103729248 | 12.573707103729248 | 1.0 | 12.573707103729248 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
I12439 | I21390 | 11.11750602722168 | 11.11750602722168 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
I12439 | I12896 | 8.414804935455322 | 8.414804935455322 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
I12438 | I12896 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
I12438 | I21390 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
I12440 | I12896 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
I12440 | I21390 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
I12896 | I21390 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
Could you please give me some advice about what might caused this problem? Many thanks!!
Hello,
I am attempting to convert the example VCF file from Dropbox to HDF5, but I am facing an error during the interpolation step. Do you have any insights on what might be causing this issue?
Thank you!
Print downsampling to 1240K...
Running bash command:
bcftools view -Ov -o test/example_hazelton.ch20.1240k.vcf -T ./data/filters/snps_bcftools_ch20.csv -M2 -v snps ./data/vcf.raw/example_hazelton_chr20.vcf
Finished BCF tools filtering to target markers.
Converting to HDF5...
Finished conversion to hdf5!
Merging in LD Map..
Lifting LD Map from eigenstrat to HDF5...
Loaded 28940 variants.
Loaded 6 individuals.
Loaded 0 Chr.20 1240K SNPs.
Intersection 0 out of 28940 HDF5 SNPs
Interpolating 28940 variants.
Traceback (most recent call last):
File "/data/users/dorozco/.conda/envs/gnomix/bin/ancIBD-run", line 8, in <module>
sys.exit(main())
File "/data/users/dorozco/.local/lib/python3.8/site-packages/ancIBD/run_ancIBD.py", line 63, in main
vcf_to_1240K_hdf(in_vcf_path = args.vcf,
File "/data/users/dorozco/.local/lib/python3.8/site-packages/ancIBD/IO/prepare_h5.py", line 135, in vcf_to_1240K_hdf
merge_in_ld_map(path_h5=path_h5,
File "/data/users/dorozco/.local/lib/python3.8/site-packages/ancIBD/IO/h5_modify.py", line 154, in merge_in_ld_map
rec_ch = np.interp(x1, x, y)
File "<__array_function__ internals>", line 180, in interp
File "/data/users/dorozco/.conda/envs/gnomix/lib/python3.8/site-packages/numpy/lib/function_base.py", line 1594, in interp
return interp_func(x, xp, fp, left, right)
ValueError: array of sample points is empty
Greetings Harald, Yilei and ancIBD devs,
I'm Sanjay C Nagi, a postdoc working on the Anopheles 1000 genomes project and its successors. I am using ancIBD on whole, genome sequence data from mosquitoes - I had been working on my own heuristic method using opposite homozygotes to detect IBD1, however, I liked the look of the HMM in ancIBD. We work with our genomic data in the cloud, accessible via a Python package, malariagen_data. The package allows one to load and perform complex analyses with single lines of code, and by using python-based IBD method, we can compute IBD segments 'on the fly'.
I wanted to use ancIBD, but needed it to take input directly as numpy arrays, rather than converting a VCF to HDF5 and so on. This was pretty simple in the end, it just required a new loading model, which takes a Python dictionary as input, rather than a path to a HDF5 file. The good thing is that this doesn't modify any of the rest of the code. Would you have any interest in me adding this functionality to ancIBD itself? that way, malariagen_data could depend directly on ancIBD, rather than a fork of ancIBD. Here is the PR adding the feature, and here is an example of a notebook using ancIBD with dict input. If you approve, I can make a pull request from my fork of ancIBD.
Also, if you have some test data for ancIBD, Id be happy to add some testing/CI to the git repo at the same time, so that when you make changes to the package in future, you can ensure that ancIBD is still working as intended.
Hi again.
This issue can be merged with #5.
I've used GLIMPSE to impute two samples, and resulted with the following vcf file:
(base) batel:~/glimpse/Levant_BA$ bcftools view merged.vcf.gz | head -20
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##fileDate=01/08/2023 - 11:43:29
##source=GLIMPSE_phase v2.0.0
##contig=<ID=chr22>
##INFO=<ID=RAF,Number=A,Type=Float,Description="ALT allele frequency in the reference panel">
##INFO=<ID=AF,Number=A,Type=Float,Description="ALT allele frequency computed from DS/GP field across target samples">
##INFO=<ID=INFO,Number=A,Type=Float,Description="IMPUTE info quality score for diploid samples">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Phased genotypes">
##FORMAT=<ID=DS,Number=1,Type=Float,Description="Genotype dosage">
##FORMAT=<ID=GP,Number=3,Type=Float,Description="Genotype posteriors">
##NMAIN=15
##FPLOIDY=2
##bcftools_mergeVersion=1.10.2+htslib-1.10.2-3ubuntu0.1
##bcftools_mergeCommand=merge -l GLIMPSE_ligate/samples_merge.txt -o merged.vcf.gz; Date=Tue Aug 1 13:03:00 2023
##bcftools_viewVersion=1.10.2+htslib-1.10.2-3ubuntu0.1
##bcftools_viewCommand=view merged.vcf.gz; Date=Tue Aug 1 13:04:29 2023
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT S10769.SN S10770.SN
chr22 10519276 22:10519276:G:C G C . . RAF=0.0377889;AF=0.0572749;INFO=1 GT:DS:GP 0|0:0.106:0.896,0.101,0.003 0|0:0.115:0.888,0.108,0.004
chr22 10519389 22:10519389:T:C T C . . RAF=0.0376327;AF=0.0567751;INFO=1 GT:DS:GP 0|0:0.104:0.898,0.098,0.004 0|0:0.114:0.889,0.107,0.004
When trying to convert to HD5 using
from ancIBD.IO.prepare_h5 import vcf_to_1240K_hdf
ch = 22
#base_path = f"/n/groups/reich/hringbauer/git/hapBLOCK"
vcf_to_1240K_hdf(#in_vcf_path = f"./data/vcf.raw/example_hazelton_chr{ch}.vcf.gz",
in_vcf_path = f"./data/vcf.raw/merged_chr22_GT_GP.vcf.gz",
path_vcf = f"./data/vcf.1240k/example_hazelton_chr{ch}.vcf.gz",
path_h5 = f"./data/hdf5/example_hazelton_chr{ch}.h5",
marker_path = f"./data/filters/snps_bcftools_ch{ch}.csv",
map_path = f"./data/v51.1_1240k.snp",
af_path = f"./data/afs/v51.1_1240k_AF_ch{ch}.tsv",
col_sample_af = "",
buffer_size=20000, chunk_width=8, chunk_length=20000,
ch=ch)
I get the following error:
Print downsampling to 1240K...
Finished BCF tools filtering.
Converting to HDF5...
Finished conversion to hdf5!
Merging in LD Map..
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
/tmp/ipykernel_7983/2495295663.py in <module>
2 ch = 22
3 #base_path = f"/n/groups/reich/hringbauer/git/hapBLOCK"
----> 4 vcf_to_1240K_hdf(#in_vcf_path = f"./data/vcf.raw/example_hazelton_chr{ch}.vcf.gz",
5 in_vcf_path = f"./data/vcf.raw/merged_chr22_GT_GP.vcf.gz",
6 path_vcf = f"./data/vcf.1240k/example_hazelton_chr{ch}.vcf.gz",
~/anaconda3/lib/python3.9/site-packages/ancIBD/IO/prepare_h5.py in vcf_to_1240K_hdf(in_vcf_path, path_vcf, path_h5, marker_path, map_path, af_path, col_sample_af, chunk_length, chunk_width, buffer_size, ch)
125 print("Merging in LD Map..")
126 if len(map_path)>0:
--> 127 merge_in_ld_map(path_h5=path_h5,
128 path_snp1240k=map_path,
129 chs=[ch])
~/anaconda3/lib/python3.9/site-packages/ancIBD/IO/h5_modify.py in merge_in_ld_map(path_h5, path_snp1240k, chs, write_mode)
115 chs: Which Chromosomes to merge in HDF5 [list].
116 write_mode: Which mode to use on hdf5. a: New field. r+: Change Field"""
--> 117 with h5py.File(path_h5, "r") as f:
118 print("Lifting LD Map from eigenstrat to HDF5...")
119 print("Loaded %i variants." % np.shape(f["calldata/GT"])[0])
~/anaconda3/lib/python3.9/site-packages/h5py/_hl/files.py in __init__(self, name, mode, driver, libver, userblock_size, swmr, rdcc_nslots, rdcc_nbytes, rdcc_w0, track_order, fs_strategy, fs_persist, fs_threshold, fs_page_size, page_buf_size, min_meta_keep, min_raw_keep, locking, alignment_threshold, alignment_interval, **kwds)
531 fs_persist=fs_persist, fs_threshold=fs_threshold,
532 fs_page_size=fs_page_size)
--> 533 fid = make_fid(name, mode, userblock_size, fapl, fcpl, swmr=swmr)
534
535 if isinstance(libver, tuple):
~/anaconda3/lib/python3.9/site-packages/h5py/_hl/files.py in make_fid(name, mode, userblock_size, fapl, fcpl, swmr)
224 if swmr and swmr_support:
225 flags |= h5f.ACC_SWMR_READ
--> 226 fid = h5f.open(name, flags, fapl=fapl)
227 elif mode == 'r+':
228 fid = h5f.open(name, h5f.ACC_RDWR, fapl=fapl)
h5py/_objects.pyx in h5py._objects.with_phil.wrapper()
h5py/_objects.pyx in h5py._objects.with_phil.wrapper()
h5py/h5f.pyx in h5py.h5f.open()
FileNotFoundError: [Errno 2] Unable to open file (unable to open file: name = './data/hdf5/example_hazelton_chr22.h5', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)
The directory does exist, as I get no error when running:
os.listdir("./data/hdf5/")
and I've also tried writing the absolute path.
Moreover, when running the example you provided -- just by switching the in_vcf
it runs perfectly fine.
So it must be something in the vcf file iteslf.
Any ideas?
Batel
Hi,
Is it on purpose that not all SNPs from the 1240K markers are included in the filters data that is used to restrict imputed SNPs to the 1240K marker set? At the hdf5 data import the imputed GTs are filtered by these coordinates, so basically you always "lose" these non included markers even though you do have them in the imputed GTs.
I am aware that some SNPs included in the 1240K CHIP does have lower concordance with true shotgun WGS data. However not all data are coming from CHIP thus removing these from solely WGS dataset would be not required. On the other hand we are talking about imputed GTs anyway where the other markers in LD were allready used to figure out the diploid phased haplotypes on a large genomic chunk based on gold standard WGS ref data (considering random positions for the non concordant SNPs imputation should fix their error alreads). Accordingly, in case these files contains less markers because of trying to avoid "bad markers" then this kind of marker removal should have happened prior to the imputation step for CHIP data and not in the IBD identification step. That way imputation supposed to get better for CHIP while it does not affect true shotgun WGS. Furthermore that approach would not thin your markers at the IBD detection step for either WGS or CHIP data while you still should be able to co-analyze mixed datasets.
But again, I reserve the right to be dumb/ignorant and it may very well happen that I am unaware of some other valid reason to remove these markers. Could you please ellaborate on why ~50k (~4.4%) autosomal 1240K markers are excluded at filtering?
Regards,
Zoltan
Hi,
It would be nice if you could add the maskfile option to either to the
hapBLOCK_chroms
(to not emit IBD from mask areas, since all relevant genom coordinate info is available here)
or
filter_ibd_df
plus the caller create_ind_ibd_df
ind_all_ibd_df
(to filter IBD instead of (or additionally with) the SNP density parameter)
functions as a parameter since this could be handled naturally in the base package.
(The individual IBD data in the output of hapBLOCK_chroms
(yet) does not contain the genomic coordinates, and the mapping data is not the same scale (M vs cM) as in the mask data, thus simple "shell magic" would be complex to do this.)
While at a few samples, and at the individual pairwise IBD share it is not an issue, when you work with several hundreds individuals the combinations (N*(N-1)/2) gets large and at these genome locations almost everyone will share IBD with all other samples. This result in nedlessly large portion of these false positive IBD compared to the randomly distributed true IBD in the outputs.
Thanks!
I'm trying to install ancIBD in my institute's cluster, but it's failing when trying to build wheel
.
I've tried running pip install wheel
, and also upgraded it to the latest version.
(ancIBD) batelziv@moriah-gw-02:~/softwares/ancIBD$ python3 -m pip install ancIBD
Collecting ancIBD
Using cached ancIBD-0.5.tar.gz (201 kB)
Preparing metadata (setup.py) ... done
Requirement already satisfied: numpy in ./lib/python3.7/site-packages (from ancIBD) (1.21.6)
Requirement already satisfied: pandas in ./lib/python3.7/site-packages (from ancIBD) (1.3.5)
Collecting scipy (from ancIBD)
Using cached scipy-1.7.3-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (38.1 MB)
Requirement already satisfied: h5py in ./lib/python3.7/site-packages (from ancIBD) (3.8.0)
Requirement already satisfied: psutil in ./lib/python3.7/site-packages (from ancIBD) (5.9.5)
Requirement already satisfied: cython in ./lib/python3.7/site-packages (from ancIBD) (3.0.0)
Requirement already satisfied: matplotlib in ./lib/python3.7/site-packages (from ancIBD) (3.5.3)
Collecting pysam (from ancIBD)
Using cached pysam-0.21.0-cp37-cp37m-manylinux_2_24_x86_64.whl (18.1 MB)
Collecting scikit-allel (from ancIBD)
Using cached scikit-allel-1.3.6.tar.gz (9.7 MB)
Installing build dependencies ... done
Getting requirements to build wheel ... error
error: subprocess-exited-with-error
× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> [36 lines of output]
[scikit-allel] setup extensions with cython
Compiling allel/opt/model.pyx because it changed.
Compiling allel/opt/stats.pyx because it changed.
Compiling allel/opt/io_vcf_read.pyx because it changed.
[1/3] Cythonizing allel/opt/io_vcf_read.pyx
[2/3] Cythonizing allel/opt/model.pyx
[3/3] Cythonizing allel/opt/stats.pyx
Traceback (most recent call last):
File "/sci/home/batelziv/softwares/ancIBD/lib/python3.7/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 353, in <module>
main()
File "/sci/home/batelziv/softwares/ancIBD/lib/python3.7/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 335, in main
json_out['return_val'] = hook(**hook_input['kwargs'])
File "/sci/home/batelziv/softwares/ancIBD/lib/python3.7/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 118, in get_requires_for_build_wheel
return hook(config_settings)
File "/tmp/pip-build-env-v6neapz8/overlay/lib/python3.7/site-packages/setuptools/build_meta.py", line 341, in get_requires_for_build_wheel
return self._get_build_requires(config_settings, requirements=['wheel'])
File "/tmp/pip-build-env-v6neapz8/overlay/lib/python3.7/site-packages/setuptools/build_meta.py", line 323, in _get_build_requires
self.run_setup()
File "/tmp/pip-build-env-v6neapz8/overlay/lib/python3.7/site-packages/setuptools/build_meta.py", line 338, in run_setup
exec(code, locals())
File "<string>", line 100, in <module>
File "<string>", line 96, in setup_package
File "/tmp/pip-build-env-v6neapz8/overlay/lib/python3.7/site-packages/setuptools/__init__.py", line 107, in setup
return distutils.core.setup(**attrs)
File "/tmp/pip-build-env-v6neapz8/overlay/lib/python3.7/site-packages/setuptools/_distutils/core.py", line 159, in setup
dist.parse_config_files()
File "/tmp/pip-build-env-v6neapz8/overlay/lib/python3.7/site-packages/setuptools/dist.py", line 898, in parse_config_files
pyprojecttoml.apply_configuration(self, filename, ignore_option_errors)
File "/tmp/pip-build-env-v6neapz8/overlay/lib/python3.7/site-packages/setuptools/config/pyprojecttoml.py", line 66, in apply_configuration
config = read_configuration(filepath, True, ignore_option_errors, dist)
File "/tmp/pip-build-env-v6neapz8/overlay/lib/python3.7/site-packages/setuptools/config/pyprojecttoml.py", line 128, in read_configuration
validate(subset, filepath)
File "/tmp/pip-build-env-v6neapz8/overlay/lib/python3.7/site-packages/setuptools/config/pyprojecttoml.py", line 55, in validate
raise ValueError(f"{error}\n{summary}") from None
ValueError: invalid pyproject.toml config: `project`.
configuration error: `project` must contain ['name'] properties
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error
× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> See above for output.
note: This error originates from a subprocess, and is likely not a problem with pip.
Any ideas?
For ARM users, please use this workaround to simulate x86:
conda create -n py37
conda activate py37
conda config --env --set subdir osx-64
conda install python=3.7
on /docs
, do:
pip install -r requirements.txt
pip install Cython
pip Install ancIBD
Hi, I've been trying to run ancIBD on data imputed with GLIMPSE and the 1000G panel. I have the full vcf per chromosome, I could run the transformation and filtering to 1240k snps smoothly, calculating the af directly with the command:
for ch in range(1,23):
vcf_to_1240K_hdf(in_vcf_path = f"./data.chr{ch}.vcf.gz",
path_vcf = f"./ancIB_data/data.1240.chr{ch}.vcf.gz",
path_h5 = f"./ancIB_data/data.1240.chr{ch}.h5",
marker_path = f"./ancIBA_filters/snps_bcftools_ch{ch}.csv",
map_path = f"./ancIB_data/v51.1_1240k.snp",
#af_path = f"./ancIB_afs/v51.1_1240k_AF_ch{ch}.tsv",
col_sample_af = "AF_ALL",
buffer_size=20000, chunk_width=8, chunk_length=20000,
ch=ch). But when I run hapBLOCK_chroms
Then when I run the next command it fail to detect any IBD in any pairs among my 90 samples...
for ch in range(1,23):
df_ibd = hapBLOCK_chroms(folder_in='./ancIB_data/data.1240k.chr',
iids=iids, run_iids=[],
ch=ch, folder_out='./ancIB_data/ibd/',
output=False, prefix_out='', logfile=True,
l_model='hdf5', e_model='haploid_gl', h_model='FiveStateScaled', t_model='standard',
ibd_in=1, ibd_out=10, ibd_jump=400,
min_cm=6, cutoff_post=0.99, max_gap=0.0075,
processes=1)
When I run the code with the example data I obtain the exact same result as in the tutorial.
When I use plink or KING on my imputed dataset I detect IDB (and I know for sure that some individuals are 1st degree related).
Could you see where is the problem?
Here an example of the filtered vcf obtained by the first command:
21 10205629 rs140777501 A G 100 PASS NS=2504;AA=A|||;VT=SNP;DP=21926;RAF=0.166534;AF=0.166534;INFO=1;EAS_AF=0.2173;AMR_AF=0.1988;AFR_AF=0.1135;EUR_AF=0.2177;SAS_AF=0.1104;AN=54;AC=2 GT:DS:GP 0|0:0:1,0,0 ./.:.:. ./.:.:. 0|0:0:1,0,0 ./.:.:. 0|0:0:1,0,0 0|0:0.001:0.999,0.001,0 ./.:.:. ./.:.:. ./.:.:. ./.:
21 14601415 rs2775537 G A 100 PASS NS=2504;AA=g|||;VT=SNP;DP=17928;RAF=0.550519;AF=0.550519;INFO=1;EAS_AF=0.376;AMR_AF=0.4251;AFR_AF=0.8169;EUR_AF=0.4722;SAS_AF=0.5399;AN=30;AC=3 GT:DS:GP ./.:.:. ./.:.:. ./.:.:. ./.:.:. ./.:.:. 1|1:1.998:0,0.001,0.999 ./.:.:. ./.:.:. ./.:.:. ./.:.:. ./.:.:. ./.:.:. ./.:.:. ./.:.:. ./.:
21 14652908 rs3869758 T C 100 PASS NS=2504;AA=.|||;VT=SNP;DP=23683;RAF=0.0896565;AF=0.0896565;INFO=1;EAS_AF=0;AMR_AF=0.0648;AFR_AF=0.2814;EUR_AF=0.0239;SAS_AF=0.0082;AN=176;AC=0 GT:DS:GP 0|0:0:1,0,0 0|0:0:1,0,0 0|0:0:1,0,0 0|0:0:1,0,0 0|0:0:1,0,0 0|0:0:1,0,0 0|0:0:1,0,0 0|0:0:1,0,0
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.