Giter Club home page Giter Club logo

Comments (5)

ymeili avatar ymeili commented on June 17, 2024

Print downsampling to 1240K...
Finished BCF tools filtering.
Converting to HDF5...

UnicodeDecodeError Traceback (most recent call last)
Cell In[28], line 6
3 chs = range(1,23)
5 for ch in chs:
----> 6 vcf_to_1240K_hdf(in_vcf_path = f"/mnt/gpfs/Users/wangjincheng/aDNA/All_data/Analysis4/19.genotype_imputation/GLIMPSE_test/GLIMPSE_ligated/merged_chr{ch}.vcf.gz",
7 path_vcf = f"/mnt/gpfs/Users/wangjincheng/aDNA/All_data/Analysis4/19.genotype_imputation/GLIMPSE_test/ancIBD/chr{ch}.vcf",
8 path_h5 = f"/mnt/gpfs/Users/wangjincheng/aDNA/All_data/Analysis4/19.genotype_imputation/GLIMPSE_test/ancIBD/chr{ch}.h5",
9 marker_path = f"/mnt/gpfs/Users/wangjincheng/aDNA/reference/data/filters/snps_bcftools_ch{ch}.csv",
10 map_path = f"/mnt/gpfs/Users/wangjincheng/aDNA/reference/data/afs/v51.1_1240k.snp",
11 af_path = f"/mnt/gpfs/Users/wangjincheng/aDNA/reference/data/afs/v51.1_1240k_AF_ch{ch}.tsv",
12 col_sample_af = "",
13 buffer_size=20000, chunk_width=8, chunk_length=20000,
14 ch=ch)

File /mnt/gpfs/Users/wangjincheng/software/miniconda3_new/lib/python3.9/site-packages/ancIBD/IO/prepare_h5.py:116, in vcf_to_1240K_hdf(in_vcf_path, path_vcf, path_h5, marker_path, map_path, af_path, col_sample_af, chunk_length, chunk_width, buffer_size, ch)
113 os.remove(path_h5)
115 print("Converting to HDF5...")
--> 116 allel.vcf_to_hdf5(input=path_vcf, output=path_h5,
117 fields = ['variants/', 'calldata/', "samples"],
118 types = {"samples":"S60", "calldata/GT":np.int8,
119 "calldata/GP":np.float32, "calldata/PL":np.float32},
120 buffer_size=buffer_size,
121 chunk_length = chunk_length, chunk_width=chunk_width,
122 compression="gzip") # Do the conversion to hdf5. Takes hours
123 print("Finished conversion to hdf5!")
125 print("Merging in LD Map..")

File /mnt/gpfs/Users/wangjincheng/software/miniconda3_new/lib/python3.9/site-packages/allel/io/vcf_read.py:693, in vcf_to_hdf5(input, output, group, compression, compression_opts, shuffle, overwrite, vlen, fields, exclude_fields, rename_fields, types, numbers, alt_number, fills, region, tabix, samples, transformers, buffer_size, chunk_length, chunk_width, log)
690 store_samples, fields = _prep_fields_param(fields)
692 # setup chunk iterator
--> 693 fields, samples, headers, it = iter_vcf_chunks(
694 input, fields=fields, exclude_fields=exclude_fields, types=types,
695 numbers=numbers, alt_number=alt_number, buffer_size=buffer_size,
696 chunk_length=chunk_length, fills=fills, region=region, tabix=tabix,
697 samples=samples, transformers=transformers
698 )
700 # handle field renaming
701 if rename_fields:

File /mnt/gpfs/Users/wangjincheng/software/miniconda3_new/lib/python3.9/site-packages/allel/io/vcf_read.py:1138, in iter_vcf_chunks(input, fields, exclude_fields, types, numbers, alt_number, fills, region, tabix, samples, transformers, buffer_size, chunk_length)
1134 stream = _setup_input_stream(input=input, region=region, tabix=tabix,
1135 buffer_size=buffer_size)
1137 # setup iterator
-> 1138 fields, samples, headers, it = _iter_vcf_stream(stream, **kwds)
1140 # setup transformers
1141 if transformers is not None:
1142 # API flexibility

File /mnt/gpfs/Users/wangjincheng/software/miniconda3_new/lib/python3.9/site-packages/allel/io/vcf_read.py:1636, in _iter_vcf_stream(stream, fields, exclude_fields, types, numbers, alt_number, chunk_length, fills, region, samples)
1632 def _iter_vcf_stream(stream, fields, exclude_fields, types, numbers, alt_number,
1633 chunk_length, fills, region, samples):
1634
1635 # read VCF headers
-> 1636 headers = _read_vcf_headers(stream)
1638 # setup samples
1639 samples, loc_samples = _normalize_samples(samples=samples, headers=headers,
1640 types=types)

File /mnt/gpfs/Users/wangjincheng/software/miniconda3_new/lib/python3.9/site-packages/allel/io/vcf_read.py:1711, in _read_vcf_headers(stream)
1709 # read first header line
1710 header = stream.readline()
-> 1711 header = str(header, 'utf8')
1713 while header and header[0] == '#':
1715 headers.append(header)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

from ancibd.

hringbauer avatar hringbauer commented on June 17, 2024

HI @ymeili, I was on vacation, sorry for the belated reply!

That seems to be an issue with the VCF to HDF5 conversation, did you manually test the underlying function vcf_to_hdf5 of scikit allele?

Did you solve the problem already?

from ancibd.

ymeili avatar ymeili commented on June 17, 2024

Thanks for your replay!

I have solve this problem, I set "path_vcf" to vcf.gz format, also "in_vcf_path" format should be vcf.gz(which slightly different from your tutorials), the error was solved.

Sorry to disturb your vacation. I am confused how to determine remote kinship based on ancIBD results. The function "plot_pde_individual_from_ibd_df" from "ancIBD.plot.plot_karyotype", it requires 2 parameters "comm_ancs, ms, and labels ".

I used my data to visualize the results, two individuals have first-degree relationship, however the IBD pattern was not fitted to the curve. Would you please add some detailed description on how to determine remote kinship. The figure below on you paper was quite impressive. But I don't know how to classify my data in a precise way.

Thanks a lot for your reply!

from ancibd.

hringbauer avatar hringbauer commented on June 17, 2024

The paper does not contain a classifier "IBD sharing->degree of relatedness" as it focuses on a method to screen for IBD.

But as long as you have multiple long IBD (>20 cm), there is definitely a close biological relationship (up to ~6th degree). There is a biological variation of IBD sharing, so relatedness classes start to overlap from the 3rd degree onwards. I do the classification manually and compare IBD patterns >12cM to PEDSIM simulations.

The two parameters comm_ancs, ms, and labels describe the number of shared ancestral haplotypes (four if the relationship is via full sibs, 2 otherwise), the number of meiosis between the two individuals, and the labels of the respective relationship. You can adjust those to the relationships you want to plot. Keep in mind that there is ample biological variation around the expected values (each histogram bar is approximately Poisson distributed around its mean).

from ancibd.

ymeili avatar ymeili commented on June 17, 2024

Thanks for your reply! One more question, We have identified first-degree relationship between 2 samples, and 1240k coverage is 0.9X. However, using ancIBD, it turns out 3 or 4 degree relationships. It is the exception of our results, we are so confused. Do you encounter similar situation before?

from ancibd.

Related Issues (11)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.