Giter Club home page Giter Club logo

project-male-assembly's Introduction

Project: chromosome Y assembly

This project repository contains a Snakemake workflow to produce whole-genome Verkko assemblies, and extract contigs that most likely represent the Y chromosome. The workflow requires HiFi and ONT reads to be executed, plus Illumina short reads for certain assembly evaluation tasks.

The input sample sheet is a simple tab-separated table listing sample name (sample) and (file system) location of read sets (hifi, ont and short if available). The sample sheet needs to be loaded as follows:

$ snakemake --config samples=PATH_TO_SAMPLE_SHEET [...]

Example sample sheet

Tool setup

The entire workflow uses Conda environments wherever possible to deploy software dependencies. A base environment containing Snakemake itself is defined in workflow/envs/run_env.yaml.

For software in development/prototype stage (Verkko and VerityMap), adaptations to the local infrastructure (Verkko) or building specific bugfix versions (VerityMap, see module workflow/envs/80_est_assm_errors.smk) is required, with the former not being automatable.

Plotting

The folder notebooks/ contains Jupyter notebooks used to plot various summary statistics of the generated assemblies. The notebooks contain a brief description documenting the necessary input files (produced by the Snakemake workflow).

Citation

In preparation

project-male-assembly's People

Contributors

ptrebert avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

youpu-chen

project-male-assembly's Issues

GRCh38 chrY contig

Since many downstream analyses will be performed relative to 38, can you think about (and decide) how to handle this sequence:

chrY_KI270740v1_random (37240 bp)

Open a project?

@ptrebert
Do you think it would help to open a project here - I can try to start writing things down step-by-step.

HMMER: additional seq motifs

Add the following motifs to HMMER runs (various purposes)
FASTA files located under references

  • DYZ19_Yq: located in Yq euchromatin, not for contig identification, just annotate chrY contigs
  • DYZ3-prim_Ycentro: 171bp rep unit, not for contig identification, just annotate chrY contigs
  • DYZ3-sec_Ycentro: 5941bp rep unit, threshold hits >1700 and use for contig identification
  • "TSPY": to be defined

hg38 Y seq annotations: align to all de novo Y assm

Pille:

Additional Y sequence annotations from hg38 - please add a step to align these to all chrY assemblies:
- using '--secondary=no'
- using '--secondary=yes' but not restricting the number of sec.alignments allowed
I've added both the bed and fasta to Globus /HHU/references/ - GRCh38_chrY-seq-classes_coord_plus_repeats.bed/fasta
These contain most of the Y repeats and SDs, so they are useful to understand the Y structure.

Comparing fasta sequences

@ptrebert
Btw, were you serious about a one-liner for comparing 2 fasta sequences? Can you write this (can be multiple lines as well :P) in like few minutes?
It would be even better if it could take as input multiple fasta sequences (an alignment essentially) and then output the site across all sequences + position in the sequence in case there is a difference between the sequences. Maybe in a similar format to vcf - one site per row: position, gt/sample1, gr/sample1 etc. I can kind of think of a way of writing it but in reality it would take me more time than I would like.

orient/order chrY contigs using T2T alignments

Order (orient) identified chrY contigs on the basis of the contig-to-reference alignments relative to T2T.
Contigs should be renamed; coordinate with Pille about naming convention.

Pille:

Contig names - maybe something as easy as just numbering from PAR1 as 1 to PAR2. E.g. chrY_01 to chrY_N?
Or do you think it would be worth retaining the original contig names as well?
[...] does it make sense to keep those with original contig names, or do we just rename the Y contigs?

identify Y contigs

implement a simple strategy (i.e. rule-based) to identify chrY contigs in the de novo assemblies.

Current rule set:

  • 1. tig has only chrY alignments
  • 2. tig has mixed alignments but Y-specific motif hits (above threshold)
  • 3. tig has primary alignments to chrY with many (current T: >300) unspecific motif hits (above threshold)
  • 4. tig is unaligned and has Y-specific motif hits (above threshold)
  • 5. tig has more than 90% bp primary alignments to chrY

Motif hits above threshold = "high-quality hits", thresholds set by expert curation

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.