Giter Club home page Giter Club logo

ancibd's Issues

Input for GLIMPSE

Hello!

In the bioarchive article, you write that you are using data from the AADR, but there they are in the eigenstrat format. Tell me how to translate them into VCF with the flag of GL. And how, then, is the difference between capture and whole genome data? Or do you have a BAM file for the 1240k panel?

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

Hi, I am loading vcf files from GLIMPSE to ancIBD, however it occurs this problem. And when I try tutorials by using test_data which you provided, it also has the same problem.

It seems python codeDecode problem, however when I write "

encoding:utf-8

import sys

reload(sys)
sys.setdefaultencoding('utf-8')"

into sitecustomize.py under python" lib/site-packages/ "path, it could not solve the error. DO you have any suggestions? Thanks a lot!

--mask option bug

Dear Harald and Yilei

I have been recently running some tests with ancIBD (v0.7) and from what I can see the mask (--mask) option does not appear to have any impact on the results.

I’m using the mask file provided in the vignette (./map/mask.track). From what I can see from the code, adding a header to the mask file is required (ch start_bp end_bp start_cm end_cm) and from the log file it appears to be working, but the output is identical whether the command is included or not.

I’ve put the commands and a subset of the results below for a test I ran on chromosome 18 for two individuals. The full log files and results are attached.

There is an example of a segment (6416465 BP to 8096608 BP) that I believe should be masked out but had not been.

Hopefully I haven’t missed something obvious on my end!

Best
Hugh

ancIBD command with mask

ancIBD-run \
--vcf 18.RISE109_EAS004.vcf.gz \
--ch 18 \
--min 1 \
--marker_path ./ancibd_data/filters/snps_bcftools_ch18.csv \
--map_path ./ancibd_data/map/v51.1_1240k.snp \
--af_path ./ancibd_data/afs/v51.1_1240k_AF_ch18.tsv \
--prefix RISE109_EAS004.mask \
--mask ./ancibd_data/map/mask.track_whead \
--out ./mask | tee -a "log.mask.txt"

ancIBD command without mask

ancIBD-run  \
--vcf 18.RISE109_EAS004.vcf.gz \
--ch 18 \
--min 1 \
--marker_path ./ancibd_data/filters/snps_bcftools_ch18.csv  \
--map_path ./ancibd_data/map/v51.1_1240k.snp \
--af_path ./ancibd_data/afs/v51.1_1240k_AF_ch18.tsv \
--prefix RISE109_EAS004.no_mask \
--out ./no_mask | tee -a "log.no_mask.txt"

Mask file

ch  start_bp   end_bp     start_cm  end_cm
1   117262791  152752796  147.421   163.079
2   28328      10223894   0.014     26.182
4   71566      5690711    0.341     11.535
8   164984     14405249   0.0004    31.511
10  43410489   54652146   62.682    77.491
14  101952406  107283009  111.711   120.2
15  20071673   34014753   0.005     43.484
17  32762040   39154710   54.307    63.702
18  69836      8490599    0.1604    25.465
19  266034     10030630   0.002     29.543
19  47112648   59087479   74.533    107.7316
21  14601415   24499572   0.861     21.041
22  17054720   25615059   1.723     23.842

Section of log file

Applying mask to IBD segments...
    Start    End  ...                           iid1                           iid2
0    1395   1487  ...  RISE109.allentoft_2015_nature  EAS004.gretzinger_2022_nature
1    2767   3629  ...  RISE109.allentoft_2015_nature  EAS004.gretzinger_2022_nature
2    4231   4775  ...  RISE109.allentoft_2015_nature  EAS004.gretzinger_2022_nature
3    5224   5411  ...  RISE109.allentoft_2015_nature  EAS004.gretzinger_2022_nature
4    6278   6657  ...  RISE109.allentoft_2015_nature  EAS004.gretzinger_2022_nature
5    7539   7933  ...  RISE109.allentoft_2015_nature  EAS004.gretzinger_2022_nature
6    9410  10448  ...  RISE109.allentoft_2015_nature  EAS004.gretzinger_2022_nature
7   12803  13437  ...  RISE109.allentoft_2015_nature  EAS004.gretzinger_2022_nature
8   14306  15154  ...  RISE109.allentoft_2015_nature  EAS004.gretzinger_2022_nature
9   15622  15879  ...  RISE109.allentoft_2015_nature  EAS004.gretzinger_2022_nature
10  19670  19898  ...  RISE109.allentoft_2015_nature  EAS004.gretzinger_2022_nature
11  21006  21444  ...  RISE109.allentoft_2015_nature  EAS004.gretzinger_2022_nature
12  22409  22741  ...  RISE109.allentoft_2015_nature  EAS004.gretzinger_2022_nature
13  24215  24718  ...  RISE109.allentoft_2015_nature  EAS004.gretzinger_2022_nature
14  24890  25111  ...  RISE109.allentoft_2015_nature  EAS004.gretzinger_2022_nature
15  25824  26136  ...  RISE109.allentoft_2015_nature  EAS004.gretzinger_2022_nature
16  26646  27343  ...  RISE109.allentoft_2015_nature  EAS004.gretzinger_2022_nature
17  27723  28089  ...  RISE109.allentoft_2015_nature  EAS004.gretzinger_2022_nature

Results from run with mask:

Start  End    StartM               EndM                 length  lengthM               ch  StartBP   EndBP     iid1                           iid2
3110   4082   0.187756             0.240680992603302    972     0.0529249906539917    18  6416465   8096608   RISE109.allentoft_2015_nature  EAS004.gretzinger_2022_nature
4739   5354   0.27425798773765564  0.31047698855400085  615     0.036219000816345215  18  9017786   10296623  RISE109.allentoft_2015_nature  EAS004.gretzinger_2022_nature
5872   6090   0.33414000272750854  0.34862199425697327  218     0.014481991529464722  18  11078646  11573400  RISE109.allentoft_2015_nature  EAS004.gretzinger_2022_nature
7048   7497   0.3891560137271881   0.4127730131149292   449     0.02361699938774109   18  14000961  20106913  RISE109.allentoft_2015_nature  EAS004.gretzinger_2022_nature
8477   8929   0.43661099672317505  0.45054900646209717  452     0.01393800973892212   18  22545495  23529365  RISE109.allentoft_2015_nature  EAS004.gretzinger_2022_nature
10582  11853  0.4954560101032257   0.5268459916114807   1271    0.031389981508255005  18  28087698  31420010  RISE109.allentoft_2015_nature  EAS004.gretzinger_2022_nature
14468  15122  0.5745900273323059   0.5875930190086365   654     0.013002991676330566  18  38026603  39712385  RISE109.allentoft_2015_nature  EAS004.gretzinger_2022_nature
16092  17042  0.6033920049667358   0.6263909935951233   950     0.02299898862838745   18  42434615  44150558  RISE109.allentoft_2015_nature  EAS004.gretzinger_2022_nature
17575  17853  0.6380820274353027   0.6519550085067749   278     0.013872981071472168  18  45358666  45874561  RISE109.allentoft_2015_nature  EAS004.gretzinger_2022_nature
21995  22245  0.743274986743927    0.7536090016365051   250     0.010334014892578125  18  55127493  55535280  RISE109.allentoft_2015_nature  EAS004.gretzinger_2022_nature
25036  25454  0.8473230004310608   0.8599590063095093   418     0.012636005878448486  18  60957339  61802304  RISE109.allentoft_2015_nature  EAS004.gretzinger_2022_nature
27042  28042  0.9102759957313538   0.9489629864692688   1000    0.03868699073791504   18  65887287  67938899  RISE109.allentoft_2015_nature  EAS004.gretzinger_2022_nature
28832  29192  0.977187991142273    0.9897369742393494   360     0.012548983097076416  18  69700748  70489780  RISE109.allentoft_2015_nature  EAS004.gretzinger_2022_nature
29745  30530  1.0176960229873657   1.0555000305175781   785     0.0378040075302124    18  71478627  72915939  RISE109.allentoft_2015_nature  EAS004.gretzinger_2022_nature
30970  31374  1.0719029903411865   1.0926170349121094   404     0.02071404457092285   18  73615868  74318960  RISE109.allentoft_2015_nature  EAS004.gretzinger_2022_nature

RISE109_EAS004.mask.ch18.txt
RISE109_EAS004.no_mask.ch18.txt
log.mask.txt
log.no_mask.txt

Converting example vcf to HDF5

Hi,
First of all, thank you for your detailed and easy to follow instructions in the notebook.

Regarding the vcf to HDF5 conversion function:
To process the data obtained from Dropbox, it is necessary to unzip the vcf.gz files prior to executing the code mentioned in the notebook.

Second,
I get this error after unziping using gunzip *.gz:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

This is the full error print:

Print downsampling to 1240K...
Finished BCF tools filtering.
Converting to HDF5...
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<timed exec> in <module>

~/anaconda3/lib/python3.9/site-packages/ancIBD/IO/prepare_h5.py in vcf_to_1240K_hdf(in_vcf_path, path_vcf, path_h5, marker_path, map_path, af_path, col_sample_af, chunk_length, chunk_width, buffer_size, ch)
    114 
    115     print("Converting to HDF5...")
--> 116     allel.vcf_to_hdf5(input=path_vcf, output=path_h5, 
    117                       fields = ['variants/*', 'calldata/*', "samples"],
    118                       types = {"samples":"S60", "calldata/GT":np.int8,

~/anaconda3/lib/python3.9/site-packages/allel/io/vcf_read.py in vcf_to_hdf5(input, output, group, compression, compression_opts, shuffle, overwrite, vlen, fields, exclude_fields, rename_fields, types, numbers, alt_number, fills, region, tabix, samples, transformers, buffer_size, chunk_length, chunk_width, log)
    691 
    692     # setup chunk iterator
--> 693     fields, samples, headers, it = iter_vcf_chunks(
    694         input, fields=fields, exclude_fields=exclude_fields, types=types,
    695         numbers=numbers, alt_number=alt_number, buffer_size=buffer_size,

~/anaconda3/lib/python3.9/site-packages/allel/io/vcf_read.py in iter_vcf_chunks(input, fields, exclude_fields, types, numbers, alt_number, fills, region, tabix, samples, transformers, buffer_size, chunk_length)
   1136 
   1137     # setup iterator
-> 1138     fields, samples, headers, it = _iter_vcf_stream(stream, **kwds)
   1139 
   1140     # setup transformers

~/anaconda3/lib/python3.9/site-packages/allel/io/vcf_read.py in _iter_vcf_stream(stream, fields, exclude_fields, types, numbers, alt_number, chunk_length, fills, region, samples)
   1634 
   1635     # read VCF headers
-> 1636     headers = _read_vcf_headers(stream)
   1637 
   1638     # setup samples

~/anaconda3/lib/python3.9/site-packages/allel/io/vcf_read.py in _read_vcf_headers(stream)
   1709     # read first header line
   1710     header = stream.readline()
-> 1711     header = str(header, 'utf8')
   1712 
   1713     while header and header[0] == '#':

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

Problems when running example data

Hi, recently I was running the ancIBD-0.5 using the example data (downloaded from https://www.dropbox.com/sh/q18yyrffbdj1yv1/AAC1apifYB_oKB8SNrmQQ-26a?dl=0), using python3.8.3 with the same parameters in the tutorial. All commands can run smoothly without errors, but the results I got is different from the results showed in the tutorial. And I still have not figure out what caused this difference. I will really appreciate it if you can help!!

I checked the output when prepare the hdf5 files, and it is the same from what is in the tutorial. But results start to differ when calling the IBD, and it seems that the IBD that I got is much less.

Chromosome 1; Loaded 18 IBD
Chromosome 2; Loaded 13 IBD
Chromosome 3; Loaded 11 IBD
Chromosome 4; Loaded 9 IBD
Chromosome 5; Loaded 9 IBD
Chromosome 6; Loaded 11 IBD
Chromosome 7; Loaded 12 IBD
Chromosome 8; Loaded 8 IBD
Chromosome 9; Loaded 10 IBD
Chromosome 10; Loaded 10 IBD
Chromosome 11; Loaded 10 IBD
Chromosome 12; Loaded 6 IBD
Chromosome 13; Loaded 12 IBD
Chromosome 14; Loaded 6 IBD
Chromosome 15; Loaded 6 IBD
Chromosome 16; Loaded 8 IBD
Chromosome 17; Loaded 9 IBD
Chromosome 18; Loaded 6 IBD
Chromosome 19; Loaded 11 IBD
Chromosome 20; Loaded 10 IBD
Chromosome 21; Loaded 9 IBD
Chromosome 22; Loaded 10 IBD
Saved 214 IBD to /home/han_shi/script/ancibd/output/ibd_hazelton2/ch_all.tsv.

When summarizing the IBD between pairs, some of the results yield IBD but differ in sum and number, but some pairs(I12438 & I12896, I12440& I12896...), which yield some IBD in the tutorials, did not yield one IBD here..

iid1 iid2 max_IBD sum_IBD>8 n_IBD>8 sum_IBD>12 n_IBD>12 sum_IBD>16 n_IBD>16 sum_IBD>20 n_IBD>20
I12440 I30300 268.7866928288713 3316.770885130245 22.0 3316.770885130245 22.0 3316.770885130245 22.0 3316.770885130245 22.0
I12438 I30300 283.65220259875065 3307.30799218436 21.0 3307.30799218436 21.0 3307.30799218436 21.0 3307.30799218436 21.0
I12440 I12438 176.73970285686664 1657.2198443172965 22.0 1647.3969392536674 21.0 1647.3969392536674 21.0 1647.3969392536674 21.0
I12439 I12440 153.30770015716553 1626.9729871419258 21.0 1604.7398943570442 19.0 1604.7398943570442 19.0 1604.7398943570442 19.0
I12439 I30300 98.25259447097778 919.9655823060311 13.0 919.9655823060311 13.0 919.9655823060311 13.0 919.9655823060311 13.0
I12439 I12438 93.37389469146729 475.9611673653126 10.0 459.80747416615486 8.0 459.80747416615486 8.0 459.80747416615486 8.0
I21390 I30300 13.226699829101562 13.226699829101562 1.0 13.226699829101562 1.0 0.0 0.0 0.0 0.0
I12896 I30300 12.573707103729248 12.573707103729248 1.0 12.573707103729248 1.0 0.0 0.0 0.0 0.0
I12439 I21390 11.11750602722168 11.11750602722168 1.0 0.0 0.0 0.0 0.0 0.0 0.0
I12439 I12896 8.414804935455322 8.414804935455322 1.0 0.0 0.0 0.0 0.0 0.0 0.0
I12438 I12896 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
I12438 I21390 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
I12440 I12896 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
I12440 I21390 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
I12896 I21390 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

Could you please give me some advice about what might caused this problem? Many thanks!!

Converting example vcf to HDF5: Error at interpolation step

Hello,

I am attempting to convert the example VCF file from Dropbox to HDF5, but I am facing an error during the interpolation step. Do you have any insights on what might be causing this issue?

Thank you!

Print downsampling to 1240K...
Running bash command: 
bcftools view -Ov -o test/example_hazelton.ch20.1240k.vcf -T ./data/filters/snps_bcftools_ch20.csv -M2 -v snps ./data/vcf.raw/example_hazelton_chr20.vcf
Finished BCF tools filtering to target markers.
Converting to HDF5...
Finished conversion to hdf5!
Merging in LD Map..
Lifting LD Map from eigenstrat to HDF5...
Loaded 28940 variants.
Loaded 6 individuals.
Loaded 0 Chr.20 1240K SNPs.
Intersection 0 out of 28940 HDF5 SNPs
Interpolating 28940 variants.
Traceback (most recent call last):
  File "/data/users/dorozco/.conda/envs/gnomix/bin/ancIBD-run", line 8, in <module>
    sys.exit(main())
  File "/data/users/dorozco/.local/lib/python3.8/site-packages/ancIBD/run_ancIBD.py", line 63, in main
    vcf_to_1240K_hdf(in_vcf_path = args.vcf,
  File "/data/users/dorozco/.local/lib/python3.8/site-packages/ancIBD/IO/prepare_h5.py", line 135, in vcf_to_1240K_hdf
    merge_in_ld_map(path_h5=path_h5, 
  File "/data/users/dorozco/.local/lib/python3.8/site-packages/ancIBD/IO/h5_modify.py", line 154, in merge_in_ld_map
    rec_ch = np.interp(x1, x, y) 
  File "<__array_function__ internals>", line 180, in interp
  File "/data/users/dorozco/.conda/envs/gnomix/lib/python3.8/site-packages/numpy/lib/function_base.py", line 1594, in interp
    return interp_func(x, xp, fp, left, right)
ValueError: array of sample points is empty

A new loading model - python dictionary input

Greetings Harald, Yilei and ancIBD devs,

I'm Sanjay C Nagi, a postdoc working on the Anopheles 1000 genomes project and its successors. I am using ancIBD on whole, genome sequence data from mosquitoes - I had been working on my own heuristic method using opposite homozygotes to detect IBD1, however, I liked the look of the HMM in ancIBD. We work with our genomic data in the cloud, accessible via a Python package, malariagen_data. The package allows one to load and perform complex analyses with single lines of code, and by using python-based IBD method, we can compute IBD segments 'on the fly'.

I wanted to use ancIBD, but needed it to take input directly as numpy arrays, rather than converting a VCF to HDF5 and so on. This was pretty simple in the end, it just required a new loading model, which takes a Python dictionary as input, rather than a path to a HDF5 file. The good thing is that this doesn't modify any of the rest of the code. Would you have any interest in me adding this functionality to ancIBD itself? that way, malariagen_data could depend directly on ancIBD, rather than a fork of ancIBD. Here is the PR adding the feature, and here is an example of a notebook using ancIBD with dict input. If you approve, I can make a pull request from my fork of ancIBD.

Also, if you have some test data for ancIBD, Id be happy to add some testing/CI to the git repo at the same time, so that when you make changes to the package in future, you can ensure that ancIBD is still working as intended.

Merging in LD Map.. results in error

Hi again.
This issue can be merged with #5.

I've used GLIMPSE to impute two samples, and resulted with the following vcf file:

(base) batel:~/glimpse/Levant_BA$ bcftools view merged.vcf.gz | head -20
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##fileDate=01/08/2023 - 11:43:29
##source=GLIMPSE_phase v2.0.0
##contig=<ID=chr22>
##INFO=<ID=RAF,Number=A,Type=Float,Description="ALT allele frequency in the reference panel">
##INFO=<ID=AF,Number=A,Type=Float,Description="ALT allele frequency computed from DS/GP field across target samples">
##INFO=<ID=INFO,Number=A,Type=Float,Description="IMPUTE info quality score for diploid samples">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Phased genotypes">
##FORMAT=<ID=DS,Number=1,Type=Float,Description="Genotype dosage">
##FORMAT=<ID=GP,Number=3,Type=Float,Description="Genotype posteriors">
##NMAIN=15
##FPLOIDY=2
##bcftools_mergeVersion=1.10.2+htslib-1.10.2-3ubuntu0.1
##bcftools_mergeCommand=merge -l GLIMPSE_ligate/samples_merge.txt -o merged.vcf.gz; Date=Tue Aug  1 13:03:00 2023
##bcftools_viewVersion=1.10.2+htslib-1.10.2-3ubuntu0.1
##bcftools_viewCommand=view merged.vcf.gz; Date=Tue Aug  1 13:04:29 2023
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	S10769.SN	S10770.SN
chr22	10519276	22:10519276:G:C	G	C	.	.	RAF=0.0377889;AF=0.0572749;INFO=1	GT:DS:GP	0|0:0.106:0.896,0.101,0.003	0|0:0.115:0.888,0.108,0.004
chr22	10519389	22:10519389:T:C	T	C	.	.	RAF=0.0376327;AF=0.0567751;INFO=1	GT:DS:GP	0|0:0.104:0.898,0.098,0.004	0|0:0.114:0.889,0.107,0.004

When trying to convert to HD5 using

from ancIBD.IO.prepare_h5 import vcf_to_1240K_hdf
ch = 22
#base_path = f"/n/groups/reich/hringbauer/git/hapBLOCK"
vcf_to_1240K_hdf(#in_vcf_path = f"./data/vcf.raw/example_hazelton_chr{ch}.vcf.gz",
    in_vcf_path = f"./data/vcf.raw/merged_chr22_GT_GP.vcf.gz",
    path_vcf = f"./data/vcf.1240k/example_hazelton_chr{ch}.vcf.gz",
                 path_h5 = f"./data/hdf5/example_hazelton_chr{ch}.h5",
                 marker_path = f"./data/filters/snps_bcftools_ch{ch}.csv",
                 map_path = f"./data/v51.1_1240k.snp",
                 af_path = f"./data/afs/v51.1_1240k_AF_ch{ch}.tsv",
                 col_sample_af = "",
                 buffer_size=20000, chunk_width=8, chunk_length=20000,
                 ch=ch)

I get the following error:

Print downsampling to 1240K...
Finished BCF tools filtering.
Converting to HDF5...
Finished conversion to hdf5!
Merging in LD Map..
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
/tmp/ipykernel_7983/2495295663.py in <module>
      2 ch = 22
      3 #base_path = f"/n/groups/reich/hringbauer/git/hapBLOCK"
----> 4 vcf_to_1240K_hdf(#in_vcf_path = f"./data/vcf.raw/example_hazelton_chr{ch}.vcf.gz",
      5     in_vcf_path = f"./data/vcf.raw/merged_chr22_GT_GP.vcf.gz",
      6     path_vcf = f"./data/vcf.1240k/example_hazelton_chr{ch}.vcf.gz",

~/anaconda3/lib/python3.9/site-packages/ancIBD/IO/prepare_h5.py in vcf_to_1240K_hdf(in_vcf_path, path_vcf, path_h5, marker_path, map_path, af_path, col_sample_af, chunk_length, chunk_width, buffer_size, ch)
    125     print("Merging in LD Map..")
    126     if len(map_path)>0:
--> 127         merge_in_ld_map(path_h5=path_h5, 
    128                     path_snp1240k=map_path,
    129                     chs=[ch])

~/anaconda3/lib/python3.9/site-packages/ancIBD/IO/h5_modify.py in merge_in_ld_map(path_h5, path_snp1240k, chs, write_mode)
    115     chs: Which Chromosomes to merge in HDF5 [list].
    116     write_mode: Which mode to use on hdf5. a: New field. r+: Change Field"""
--> 117     with h5py.File(path_h5, "r") as f:
    118         print("Lifting LD Map from eigenstrat to HDF5...")
    119         print("Loaded %i variants." % np.shape(f["calldata/GT"])[0])

~/anaconda3/lib/python3.9/site-packages/h5py/_hl/files.py in __init__(self, name, mode, driver, libver, userblock_size, swmr, rdcc_nslots, rdcc_nbytes, rdcc_w0, track_order, fs_strategy, fs_persist, fs_threshold, fs_page_size, page_buf_size, min_meta_keep, min_raw_keep, locking, alignment_threshold, alignment_interval, **kwds)
    531                                  fs_persist=fs_persist, fs_threshold=fs_threshold,
    532                                  fs_page_size=fs_page_size)
--> 533                 fid = make_fid(name, mode, userblock_size, fapl, fcpl, swmr=swmr)
    534 
    535             if isinstance(libver, tuple):

~/anaconda3/lib/python3.9/site-packages/h5py/_hl/files.py in make_fid(name, mode, userblock_size, fapl, fcpl, swmr)
    224         if swmr and swmr_support:
    225             flags |= h5f.ACC_SWMR_READ
--> 226         fid = h5f.open(name, flags, fapl=fapl)
    227     elif mode == 'r+':
    228         fid = h5f.open(name, h5f.ACC_RDWR, fapl=fapl)

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/h5f.pyx in h5py.h5f.open()

FileNotFoundError: [Errno 2] Unable to open file (unable to open file: name = './data/hdf5/example_hazelton_chr22.h5', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)

The directory does exist, as I get no error when running:
os.listdir("./data/hdf5/")
and I've also tried writing the absolute path.

Moreover, when running the example you provided -- just by switching the in_vcf it runs perfectly fine.
So it must be something in the vcf file iteslf.

Any ideas?

Batel

filters/snps_bcftools_chX.csv

Hi,

Is it on purpose that not all SNPs from the 1240K markers are included in the filters data that is used to restrict imputed SNPs to the 1240K marker set? At the hdf5 data import the imputed GTs are filtered by these coordinates, so basically you always "lose" these non included markers even though you do have them in the imputed GTs.

I am aware that some SNPs included in the 1240K CHIP does have lower concordance with true shotgun WGS data. However not all data are coming from CHIP thus removing these from solely WGS dataset would be not required. On the other hand we are talking about imputed GTs anyway where the other markers in LD were allready used to figure out the diploid phased haplotypes on a large genomic chunk based on gold standard WGS ref data (considering random positions for the non concordant SNPs imputation should fix their error alreads). Accordingly, in case these files contains less markers because of trying to avoid "bad markers" then this kind of marker removal should have happened prior to the imputation step for CHIP data and not in the IBD identification step. That way imputation supposed to get better for CHIP while it does not affect true shotgun WGS. Furthermore that approach would not thin your markers at the IBD detection step for either WGS or CHIP data while you still should be able to co-analyze mixed datasets.

But again, I reserve the right to be dumb/ignorant and it may very well happen that I am unaware of some other valid reason to remove these markers. Could you please ellaborate on why ~50k (~4.4%) autosomal 1240K markers are excluded at filtering?

Regards,
Zoltan

add option to use maskfile

Hi,

It would be nice if you could add the maskfile option to either to the

hapBLOCK_chroms (to not emit IBD from mask areas, since all relevant genom coordinate info is available here)
or
filter_ibd_df plus the caller create_ind_ibd_df ind_all_ibd_df (to filter IBD instead of (or additionally with) the SNP density parameter)

functions as a parameter since this could be handled naturally in the base package.

(The individual IBD data in the output of hapBLOCK_chroms (yet) does not contain the genomic coordinates, and the mapping data is not the same scale (M vs cM) as in the mask data, thus simple "shell magic" would be complex to do this.)

While at a few samples, and at the individual pairwise IBD share it is not an issue, when you work with several hundreds individuals the combinations (N*(N-1)/2) gets large and at these genome locations almost everyone will share IBD with all other samples. This result in nedlessly large portion of these false positive IBD compared to the randomly distributed true IBD in the outputs.

Thanks!

Installing ancIBD fails when building wheel

I'm trying to install ancIBD in my institute's cluster, but it's failing when trying to build wheel.
I've tried running pip install wheel, and also upgraded it to the latest version.


(ancIBD) batelziv@moriah-gw-02:~/softwares/ancIBD$ python3 -m pip install ancIBD
Collecting ancIBD
  Using cached ancIBD-0.5.tar.gz (201 kB)
  Preparing metadata (setup.py) ... done
Requirement already satisfied: numpy in ./lib/python3.7/site-packages (from ancIBD) (1.21.6)
Requirement already satisfied: pandas in ./lib/python3.7/site-packages (from ancIBD) (1.3.5)
Collecting scipy (from ancIBD)
  Using cached scipy-1.7.3-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (38.1 MB)
Requirement already satisfied: h5py in ./lib/python3.7/site-packages (from ancIBD) (3.8.0)
Requirement already satisfied: psutil in ./lib/python3.7/site-packages (from ancIBD) (5.9.5)
Requirement already satisfied: cython in ./lib/python3.7/site-packages (from ancIBD) (3.0.0)
Requirement already satisfied: matplotlib in ./lib/python3.7/site-packages (from ancIBD) (3.5.3)
Collecting pysam (from ancIBD)
  Using cached pysam-0.21.0-cp37-cp37m-manylinux_2_24_x86_64.whl (18.1 MB)
Collecting scikit-allel (from ancIBD)
  Using cached scikit-allel-1.3.6.tar.gz (9.7 MB)
  Installing build dependencies ... done


  Getting requirements to build wheel ... error
  error: subprocess-exited-with-error

  × Getting requirements to build wheel did not run successfully.
  │ exit code: 1
  ╰─> [36 lines of output]
      [scikit-allel] setup extensions with cython
      Compiling allel/opt/model.pyx because it changed.
      Compiling allel/opt/stats.pyx because it changed.
      Compiling allel/opt/io_vcf_read.pyx because it changed.
      [1/3] Cythonizing allel/opt/io_vcf_read.pyx
      [2/3] Cythonizing allel/opt/model.pyx
      [3/3] Cythonizing allel/opt/stats.pyx
      Traceback (most recent call last):
        File "/sci/home/batelziv/softwares/ancIBD/lib/python3.7/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 353, in <module>
          main()
        File "/sci/home/batelziv/softwares/ancIBD/lib/python3.7/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 335, in main
          json_out['return_val'] = hook(**hook_input['kwargs'])
        File "/sci/home/batelziv/softwares/ancIBD/lib/python3.7/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 118, in get_requires_for_build_wheel
          return hook(config_settings)
        File "/tmp/pip-build-env-v6neapz8/overlay/lib/python3.7/site-packages/setuptools/build_meta.py", line 341, in get_requires_for_build_wheel
          return self._get_build_requires(config_settings, requirements=['wheel'])
        File "/tmp/pip-build-env-v6neapz8/overlay/lib/python3.7/site-packages/setuptools/build_meta.py", line 323, in _get_build_requires
          self.run_setup()
        File "/tmp/pip-build-env-v6neapz8/overlay/lib/python3.7/site-packages/setuptools/build_meta.py", line 338, in run_setup
          exec(code, locals())
        File "<string>", line 100, in <module>
        File "<string>", line 96, in setup_package
        File "/tmp/pip-build-env-v6neapz8/overlay/lib/python3.7/site-packages/setuptools/__init__.py", line 107, in setup
          return distutils.core.setup(**attrs)
        File "/tmp/pip-build-env-v6neapz8/overlay/lib/python3.7/site-packages/setuptools/_distutils/core.py", line 159, in setup
          dist.parse_config_files()
        File "/tmp/pip-build-env-v6neapz8/overlay/lib/python3.7/site-packages/setuptools/dist.py", line 898, in parse_config_files
          pyprojecttoml.apply_configuration(self, filename, ignore_option_errors)
        File "/tmp/pip-build-env-v6neapz8/overlay/lib/python3.7/site-packages/setuptools/config/pyprojecttoml.py", line 66, in apply_configuration
          config = read_configuration(filepath, True, ignore_option_errors, dist)
        File "/tmp/pip-build-env-v6neapz8/overlay/lib/python3.7/site-packages/setuptools/config/pyprojecttoml.py", line 128, in read_configuration
          validate(subset, filepath)
        File "/tmp/pip-build-env-v6neapz8/overlay/lib/python3.7/site-packages/setuptools/config/pyprojecttoml.py", line 55, in validate
          raise ValueError(f"{error}\n{summary}") from None
      ValueError: invalid pyproject.toml config: `project`.
      configuration error: `project` must contain ['name'] properties
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.

Any ideas?

Mac M1 / M2 install tips

For ARM users, please use this workaround to simulate x86:

conda create -n py37
conda activate py37
conda config --env --set subdir osx-64
conda install python=3.7 

on /docs , do:

pip install -r requirements.txt
pip install Cython
pip Install ancIBD

Fail to detect any IBD in data

Hi, I've been trying to run ancIBD on data imputed with GLIMPSE and the 1000G panel. I have the full vcf per chromosome, I could run the transformation and filtering to 1240k snps smoothly, calculating the af directly with the command:

for ch in range(1,23):
vcf_to_1240K_hdf(in_vcf_path = f"./data.chr{ch}.vcf.gz",
                     path_vcf = f"./ancIB_data/data.1240.chr{ch}.vcf.gz",
                     path_h5 = f"./ancIB_data/data.1240.chr{ch}.h5",
                     marker_path = f"./ancIBA_filters/snps_bcftools_ch{ch}.csv",
                     map_path = f"./ancIB_data/v51.1_1240k.snp",
                     #af_path = f"./ancIB_afs/v51.1_1240k_AF_ch{ch}.tsv",
                     col_sample_af = "AF_ALL",
                     buffer_size=20000, chunk_width=8, chunk_length=20000,
                     ch=ch). But when I run hapBLOCK_chroms

Then when I run the next command it fail to detect any IBD in any pairs among my 90 samples...

for ch in range(1,23):
    df_ibd = hapBLOCK_chroms(folder_in='./ancIB_data/data.1240k.chr',
                             iids=iids, run_iids=[],
                             ch=ch, folder_out='./ancIB_data/ibd/',
                             output=False, prefix_out='', logfile=True,
                             l_model='hdf5', e_model='haploid_gl', h_model='FiveStateScaled', t_model='standard',
                             ibd_in=1, ibd_out=10, ibd_jump=400,
                             min_cm=6, cutoff_post=0.99, max_gap=0.0075,
                             processes=1)

When I run the code with the example data I obtain the exact same result as in the tutorial.

When I use plink or KING on my imputed dataset I detect IDB (and I know for sure that some individuals are 1st degree related).
Could you see where is the problem?

Here an example of the filtered vcf obtained by the first command:

21      10205629        rs140777501     A       G       100     PASS    NS=2504;AA=A|||;VT=SNP;DP=21926;RAF=0.166534;AF=0.166534;INFO=1;EAS_AF=0.2173;AMR_AF=0.1988;AFR_AF=0.1135;EUR_AF=0.2177;SAS_AF=0.1104;AN=54;AC=2        GT:DS:GP        0|0:0:1,0,0     ./.:.:. ./.:.:. 0|0:0:1,0,0     ./.:.:. 0|0:0:1,0,0     0|0:0.001:0.999,0.001,0 ./.:.:. ./.:.:. ./.:.:. ./.:
21      14601415        rs2775537       G       A       100     PASS    NS=2504;AA=g|||;VT=SNP;DP=17928;RAF=0.550519;AF=0.550519;INFO=1;EAS_AF=0.376;AMR_AF=0.4251;AFR_AF=0.8169;EUR_AF=0.4722;SAS_AF=0.5399;AN=30;AC=3 GT:DS:GP        ./.:.:. ./.:.:. ./.:.:. ./.:.:. ./.:.:. 1|1:1.998:0,0.001,0.999 ./.:.:. ./.:.:. ./.:.:. ./.:.:. ./.:.:. ./.:.:. ./.:.:. ./.:.:. ./.:
21      14652908        rs3869758       T       C       100     PASS    NS=2504;AA=.|||;VT=SNP;DP=23683;RAF=0.0896565;AF=0.0896565;INFO=1;EAS_AF=0;AMR_AF=0.0648;AFR_AF=0.2814;EUR_AF=0.0239;SAS_AF=0.0082;AN=176;AC=0  GT:DS:GP        0|0:0:1,0,0     0|0:0:1,0,0     0|0:0:1,0,0     0|0:0:1,0,0     0|0:0:1,0,0     0|0:0:1,0,0     0|0:0:1,0,0     0|0:0:1,0,0

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.