Comments (5)
Print downsampling to 1240K...
Finished BCF tools filtering.
Converting to HDF5...
UnicodeDecodeError Traceback (most recent call last)
Cell In[28], line 6
3 chs = range(1,23)
5 for ch in chs:
----> 6 vcf_to_1240K_hdf(in_vcf_path = f"/mnt/gpfs/Users/wangjincheng/aDNA/All_data/Analysis4/19.genotype_imputation/GLIMPSE_test/GLIMPSE_ligated/merged_chr{ch}.vcf.gz",
7 path_vcf = f"/mnt/gpfs/Users/wangjincheng/aDNA/All_data/Analysis4/19.genotype_imputation/GLIMPSE_test/ancIBD/chr{ch}.vcf",
8 path_h5 = f"/mnt/gpfs/Users/wangjincheng/aDNA/All_data/Analysis4/19.genotype_imputation/GLIMPSE_test/ancIBD/chr{ch}.h5",
9 marker_path = f"/mnt/gpfs/Users/wangjincheng/aDNA/reference/data/filters/snps_bcftools_ch{ch}.csv",
10 map_path = f"/mnt/gpfs/Users/wangjincheng/aDNA/reference/data/afs/v51.1_1240k.snp",
11 af_path = f"/mnt/gpfs/Users/wangjincheng/aDNA/reference/data/afs/v51.1_1240k_AF_ch{ch}.tsv",
12 col_sample_af = "",
13 buffer_size=20000, chunk_width=8, chunk_length=20000,
14 ch=ch)
File /mnt/gpfs/Users/wangjincheng/software/miniconda3_new/lib/python3.9/site-packages/ancIBD/IO/prepare_h5.py:116, in vcf_to_1240K_hdf(in_vcf_path, path_vcf, path_h5, marker_path, map_path, af_path, col_sample_af, chunk_length, chunk_width, buffer_size, ch)
113 os.remove(path_h5)
115 print("Converting to HDF5...")
--> 116 allel.vcf_to_hdf5(input=path_vcf, output=path_h5,
117 fields = ['variants/', 'calldata/', "samples"],
118 types = {"samples":"S60", "calldata/GT":np.int8,
119 "calldata/GP":np.float32, "calldata/PL":np.float32},
120 buffer_size=buffer_size,
121 chunk_length = chunk_length, chunk_width=chunk_width,
122 compression="gzip") # Do the conversion to hdf5. Takes hours
123 print("Finished conversion to hdf5!")
125 print("Merging in LD Map..")
File /mnt/gpfs/Users/wangjincheng/software/miniconda3_new/lib/python3.9/site-packages/allel/io/vcf_read.py:693, in vcf_to_hdf5(input, output, group, compression, compression_opts, shuffle, overwrite, vlen, fields, exclude_fields, rename_fields, types, numbers, alt_number, fills, region, tabix, samples, transformers, buffer_size, chunk_length, chunk_width, log)
690 store_samples, fields = _prep_fields_param(fields)
692 # setup chunk iterator
--> 693 fields, samples, headers, it = iter_vcf_chunks(
694 input, fields=fields, exclude_fields=exclude_fields, types=types,
695 numbers=numbers, alt_number=alt_number, buffer_size=buffer_size,
696 chunk_length=chunk_length, fills=fills, region=region, tabix=tabix,
697 samples=samples, transformers=transformers
698 )
700 # handle field renaming
701 if rename_fields:
File /mnt/gpfs/Users/wangjincheng/software/miniconda3_new/lib/python3.9/site-packages/allel/io/vcf_read.py:1138, in iter_vcf_chunks(input, fields, exclude_fields, types, numbers, alt_number, fills, region, tabix, samples, transformers, buffer_size, chunk_length)
1134 stream = _setup_input_stream(input=input, region=region, tabix=tabix,
1135 buffer_size=buffer_size)
1137 # setup iterator
-> 1138 fields, samples, headers, it = _iter_vcf_stream(stream, **kwds)
1140 # setup transformers
1141 if transformers is not None:
1142 # API flexibility
File /mnt/gpfs/Users/wangjincheng/software/miniconda3_new/lib/python3.9/site-packages/allel/io/vcf_read.py:1636, in _iter_vcf_stream(stream, fields, exclude_fields, types, numbers, alt_number, chunk_length, fills, region, samples)
1632 def _iter_vcf_stream(stream, fields, exclude_fields, types, numbers, alt_number,
1633 chunk_length, fills, region, samples):
1634
1635 # read VCF headers
-> 1636 headers = _read_vcf_headers(stream)
1638 # setup samples
1639 samples, loc_samples = _normalize_samples(samples=samples, headers=headers,
1640 types=types)
File /mnt/gpfs/Users/wangjincheng/software/miniconda3_new/lib/python3.9/site-packages/allel/io/vcf_read.py:1711, in _read_vcf_headers(stream)
1709 # read first header line
1710 header = stream.readline()
-> 1711 header = str(header, 'utf8')
1713 while header and header[0] == '#':
1715 headers.append(header)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
from ancibd.
HI @ymeili, I was on vacation, sorry for the belated reply!
That seems to be an issue with the VCF to HDF5 conversation, did you manually test the underlying function vcf_to_hdf5
of scikit allele?
Did you solve the problem already?
from ancibd.
Thanks for your replay!
I have solve this problem, I set "path_vcf" to vcf.gz format, also "in_vcf_path" format should be vcf.gz(which slightly different from your tutorials), the error was solved.
Sorry to disturb your vacation. I am confused how to determine remote kinship based on ancIBD results. The function "plot_pde_individual_from_ibd_df" from "ancIBD.plot.plot_karyotype", it requires 2 parameters "comm_ancs, ms, and labels ".
I used my data to visualize the results, two individuals have first-degree relationship, however the IBD pattern was not fitted to the curve. Would you please add some detailed description on how to determine remote kinship. The figure below on you paper was quite impressive. But I don't know how to classify my data in a precise way.
Thanks a lot for your reply!
from ancibd.
The paper does not contain a classifier "IBD sharing->degree of relatedness" as it focuses on a method to screen for IBD.
But as long as you have multiple long IBD (>20 cm), there is definitely a close biological relationship (up to ~6th degree). There is a biological variation of IBD sharing, so relatedness classes start to overlap from the 3rd degree onwards. I do the classification manually and compare IBD patterns >12cM to PEDSIM simulations.
The two parameters comm_ancs, ms, and labels describe the number of shared ancestral haplotypes (four if the relationship is via full sibs, 2 otherwise), the number of meiosis between the two individuals, and the labels of the respective relationship. You can adjust those to the relationships you want to plot. Keep in mind that there is ample biological variation around the expected values (each histogram bar is approximately Poisson distributed around its mean).
from ancibd.
Thanks for your reply! One more question, We have identified first-degree relationship between 2 samples, and 1240k coverage is 0.9X. However, using ancIBD, it turns out 3 or 4 degree relationships. It is the exception of our results, we are so confused. Do you encounter similar situation before?
from ancibd.
Related Issues (11)
- Input for GLIMPSE HOT 2
- Fail to detect any IBD in data HOT 3
- Converting example vcf to HDF5 HOT 6
- Merging in LD Map.. results in error HOT 3
- Installing ancIBD fails when building wheel HOT 4
- Problems when running example data HOT 9
- add option to use maskfile HOT 1
- Converting example vcf to HDF5: Error at interpolation step HOT 5
- Mac M1 / M2 install tips HOT 1
- filters/snps_bcftools_chX.csv HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ancibd.