Hi, I am loading vcf files from GLIMPSE to ancIBD, however it occurs this problem. And

HI <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte about ancibd HOT 5 CLOSED

ymeili commented on June 17, 2024

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

from ancibd.

Comments (5)

ymeili commented on June 17, 2024

Print downsampling to 1240K...
Finished BCF tools filtering.
Converting to HDF5...

UnicodeDecodeError Traceback (most recent call last)
Cell In[28], line 6
3 chs = range(1,23)
5 for ch in chs:
----> 6 vcf_to_1240K_hdf(in_vcf_path = f"/mnt/gpfs/Users/wangjincheng/aDNA/All_data/Analysis4/19.genotype_imputation/GLIMPSE_test/GLIMPSE_ligated/merged_chr{ch}.vcf.gz",
7 path_vcf = f"/mnt/gpfs/Users/wangjincheng/aDNA/All_data/Analysis4/19.genotype_imputation/GLIMPSE_test/ancIBD/chr{ch}.vcf",
8 path_h5 = f"/mnt/gpfs/Users/wangjincheng/aDNA/All_data/Analysis4/19.genotype_imputation/GLIMPSE_test/ancIBD/chr{ch}.h5",
9 marker_path = f"/mnt/gpfs/Users/wangjincheng/aDNA/reference/data/filters/snps_bcftools_ch{ch}.csv",
10 map_path = f"/mnt/gpfs/Users/wangjincheng/aDNA/reference/data/afs/v51.1_1240k.snp",
11 af_path = f"/mnt/gpfs/Users/wangjincheng/aDNA/reference/data/afs/v51.1_1240k_AF_ch{ch}.tsv",
12 col_sample_af = "",
13 buffer_size=20000, chunk_width=8, chunk_length=20000,
14 ch=ch)

File /mnt/gpfs/Users/wangjincheng/software/miniconda3_new/lib/python3.9/site-packages/ancIBD/IO/prepare_h5.py:116, in vcf_to_1240K_hdf(in_vcf_path, path_vcf, path_h5, marker_path, map_path, af_path, col_sample_af, chunk_length, chunk_width, buffer_size, ch)
113 os.remove(path_h5)
115 print("Converting to HDF5...")
--> 116 allel.vcf_to_hdf5(input=path_vcf, output=path_h5,
117 fields = ['variants/', 'calldata/', "samples"],
118 types = {"samples":"S60", "calldata/GT":np.int8,
119 "calldata/GP":np.float32, "calldata/PL":np.float32},
120 buffer_size=buffer_size,
121 chunk_length = chunk_length, chunk_width=chunk_width,
122 compression="gzip") # Do the conversion to hdf5. Takes hours
123 print("Finished conversion to hdf5!")
125 print("Merging in LD Map..")

File /mnt/gpfs/Users/wangjincheng/software/miniconda3_new/lib/python3.9/site-packages/allel/io/vcf_read.py:693, in vcf_to_hdf5(input, output, group, compression, compression_opts, shuffle, overwrite, vlen, fields, exclude_fields, rename_fields, types, numbers, alt_number, fills, region, tabix, samples, transformers, buffer_size, chunk_length, chunk_width, log)
690 store_samples, fields = _prep_fields_param(fields)
692 # setup chunk iterator
--> 693 fields, samples, headers, it = iter_vcf_chunks(
694 input, fields=fields, exclude_fields=exclude_fields, types=types,
695 numbers=numbers, alt_number=alt_number, buffer_size=buffer_size,
696 chunk_length=chunk_length, fills=fills, region=region, tabix=tabix,
697 samples=samples, transformers=transformers
698 )
700 # handle field renaming
701 if rename_fields:

File /mnt/gpfs/Users/wangjincheng/software/miniconda3_new/lib/python3.9/site-packages/allel/io/vcf_read.py:1138, in iter_vcf_chunks(input, fields, exclude_fields, types, numbers, alt_number, fills, region, tabix, samples, transformers, buffer_size, chunk_length)
1134 stream = _setup_input_stream(input=input, region=region, tabix=tabix,
1135 buffer_size=buffer_size)
1137 # setup iterator
-> 1138 fields, samples, headers, it = _iter_vcf_stream(stream, **kwds)
1140 # setup transformers
1141 if transformers is not None:
1142 # API flexibility

File /mnt/gpfs/Users/wangjincheng/software/miniconda3_new/lib/python3.9/site-packages/allel/io/vcf_read.py:1636, in _iter_vcf_stream(stream, fields, exclude_fields, types, numbers, alt_number, chunk_length, fills, region, samples)
1632 def _iter_vcf_stream(stream, fields, exclude_fields, types, numbers, alt_number,
1633 chunk_length, fills, region, samples):
1634
1635 # read VCF headers
-> 1636 headers = _read_vcf_headers(stream)
1638 # setup samples
1639 samples, loc_samples = _normalize_samples(samples=samples, headers=headers,
1640 types=types)

File /mnt/gpfs/Users/wangjincheng/software/miniconda3_new/lib/python3.9/site-packages/allel/io/vcf_read.py:1711, in _read_vcf_headers(stream)
1709 # read first header line
1710 header = stream.readline()
-> 1711 header = str(header, 'utf8')
1713 while header and header[0] == '#':
1715 headers.append(header)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

from ancibd.

hringbauer commented on June 17, 2024

HI @ymeili, I was on vacation, sorry for the belated reply!

That seems to be an issue with the VCF to HDF5 conversation, did you manually test the underlying function vcf_to_hdf5 of scikit allele?

Did you solve the problem already?

from ancibd.

ymeili commented on June 17, 2024

Thanks for your replay!

I have solve this problem, I set "path_vcf" to vcf.gz format, also "in_vcf_path" format should be vcf.gz(which slightly different from your tutorials), the error was solved.

Sorry to disturb your vacation. I am confused how to determine remote kinship based on ancIBD results. The function "plot_pde_individual_from_ibd_df" from "ancIBD.plot.plot_karyotype", it requires 2 parameters "comm_ancs, ms, and labels ".

I used my data to visualize the results, two individuals have first-degree relationship, however the IBD pattern was not fitted to the curve. Would you please add some detailed description on how to determine remote kinship. The figure below on you paper was quite impressive. But I don't know how to classify my data in a precise way.

Thanks a lot for your reply!

from ancibd.

hringbauer commented on June 17, 2024

The paper does not contain a classifier "IBD sharing->degree of relatedness" as it focuses on a method to screen for IBD.

But as long as you have multiple long IBD (>20 cm), there is definitely a close biological relationship (up to ~6th degree). There is a biological variation of IBD sharing, so relatedness classes start to overlap from the 3rd degree onwards. I do the classification manually and compare IBD patterns >12cM to PEDSIM simulations.

The two parameters comm_ancs, ms, and labels describe the number of shared ancestral haplotypes (four if the relationship is via full sibs, 2 otherwise), the number of meiosis between the two individuals, and the labels of the respective relationship. You can adjust those to the relationships you want to plot. Keep in mind that there is ample biological variation around the expected values (each histogram bar is approximately Poisson distributed around its mean).

from ancibd.

ymeili commented on June 17, 2024

Thanks for your reply! One more question, We have identified first-degree relationship between 2 samples, and 1240k coverage is 0.9X. However, using ancIBD, it turns out 3 or 4 degree relationships. It is the exception of our results, we are so confused. Do you encounter similar situation before?

from ancibd.

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte about ancibd HOT 5 CLOSED

Comments (5)

Print downsampling to 1240K...
Finished BCF tools filtering.
Converting to HDF5...

Related Issues (11)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Comments (5)

Print downsampling to 1240K... Finished BCF tools filtering. Converting to HDF5...

Related Issues (11)

Recommend Projects

Recommend Topics

Recommend Org

Print downsampling to 1240K...
Finished BCF tools filtering.
Converting to HDF5...