Giter Club home page Giter Club logo

sexcmd's Introduction

SEXCMD : Created Sex Marker and Determination

USAGE : perl determine.pl sex_marker_filtered.hg38.final.fasta input.fastq.gz 2> /dev/null 1> input.SEX_OUTPUT

SEXCMD is available at https://github.com/lovemun/SEXCMD

The code is written in Perl and bash script. This tool is supported on Linux and needs Python, lastz, blastall and bwa. The tool uses gzip compressed fastq file as input and R1 and R2 file can be used in case of paired-end or mate pair. And you should give marker fasta file path, input fastq file path as auguements.

Required software

  1. Python (https://www.python.org/ftp/python/2.7.13/python-2.7.13.msi)
  2. bwa(version 7.0 or later) (https://sourceforge.net/projects/bio-bwa/files/)
  3. blastall (ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/)
  4. lastz(Release 1.02.00) (http://www.bx.psu.edu/~rsharris/lastz/lastz-1.04.00.tar.gz)
  5. samtools (https://sourceforge.net/projects/samtools/files/samtools/)
  6. R (https://www.r-project.org/)

Input Files

  1. Reference genome fasta file (ex. hg38.fa)
  2. Input sequence file (ex. ERR000000.fastq.gz)

Example

If you want to create marker file, you can make own sex marker file for your datafiles. There are 6 process steps for creating own sex marker file.

  1. Extract sex chromosome from Reference genome(ex. Human)

python 1.extract_chr_from_REF.py hg38.fasta X

python 1.extract_chr_from_REF.py hg38.fasta Y

  1. Mapping between sex chromosomes

lastz hg38.chrX.fasta hg38.chrY.fasta --format=maf > chrX_chrY.maf

  1. Generate sex marker text file

python 2.sex_marker.py chrX_chrY.maf hg38.ucsc.gtf 35 > sex_marker.hg38.txt (distance between marker : 35(recommended))

python 3.sex_marker_filtered.py sex_marker.hg38.txt hg38.ucsc.gtf 151 5 > sex_marker_filtered.hg38.txt (read length : 151 admit missmatch count : 5)

  1. Convert sex marker file format(text -> fasta)

python 4.make_markerFASTA.py sex_marker_filtered.hg38.txt

  1. Filtered ONLY MAPPED to chrX, chrY

blastn -db hg38.fa -query sex_marker_filtered.hg38.fasta -num_threads 4 -word_size 8 -outfmt 7 > sex_marker_filtered.hg38.fasta.blastm9

python 5.blast_filter.py sex_marker_filtered.hg38.fasta sex_marker_filtered.hg38.fasta.blastm9 sex_marker_filtered.hg38.final.fasta

  1. Index sex marker file

bwa index -a is sex_marker_filtered.hg38.final.fasta

Final sex marker : sex_marker_filtered.hg38.final.fasta

Sex determination

Rscript SEXCMD.R sex_marker_filtered.hg38.final.fasta [1/2/3] [XY/ZW] input.fastq.gz

[1/2/3] : sequencing type (1: Exome sequencing, 2: RNA sequencing, 3: Whole genome sequencing)

[XY/ZW] : Sex determination system

Contact

[email protected]

Citation

sexcmd's People

Contributors

lovemun avatar

Stargazers

Angelina Meiras Ottoni  avatar jekoy avatar  avatar  avatar Valentina avatar

Watchers

James Cloos avatar  avatar  avatar

sexcmd's Issues

Error in the SEXCMD.R script

Hi Seongmun,

I was using your script on several avian libraries and I found a couple of problems with the code in SEXCMD.R that you might want to check out:

Lines 118-127:

if (sextype == "XY"){
  ct <- data.frame(Length=do.call(rbind, marker_length_list[startsWith(x = names(marker_seq_list), prefix = "chrX")]), Count=do.call(rbind, marker_seq_list[startsWith(x = names(marker_seq_list), prefix = "chrX")]))
  ct <- cbind(ct, do.call(rbind, marker_seq_list[startsWith(x = names(marker_seq_list), prefix = "chrY")]))
  colnames(ct) <- c("Length", "Xcount", "Ycount")
}
if (sextype == "ZW"){
  ct <- data.frame(Length=do.call(rbind, marker_length_list[startsWith(x = names(marker_seq_list), prefix = "chrZ")]), Count=do.call(rbind, marker_seq_list[startsWith(x = names(marker_seq_list), prefix = "chrW")]))
  ct <- cbind(ct, do.call(rbind, marker_seq_list[startsWith(x = names(marker_seq_list), prefix = "chrW")]))
  colnames(ct) <- c("Length", "Zcount", "Wcount")
}

In the ZW system there is an inconsistency with respect to the XY system chunk of code. I think this line:
ct <- data.frame(Length=do.call(rbind, marker_length_list[startsWith(x = names(marker_seq_list), prefix = "chrZ")]), Count=do.call(rbind, marker_seq_list[startsWith(x = names(marker_seq_list), prefix = "chrW")]))
should be modified in:
ct <- data.frame(Length=do.call(rbind, marker_length_list[startsWith(x = names(marker_seq_list), prefix = "chrZ")]), Count=do.call(rbind, marker_seq_list[startsWith(x = names(marker_seq_list), prefix = "chrZ")]))
In this way it mimics the code for the XY system.

In lines 118, 123, 131, 140
if (sextype = "XY") and if (sextype = "ZW") should be replaced with if (sextype == "XY") and if (sextype == "ZW")

Thank you again for your tool :)

Valentina

Issue creating marker file

Hello,
I was using your pipeline and came across an issue in the 1.extract_chr_from_REF.py script. It runs until line 28 but then in line 29 it gives me the following error:

Traceback (most recent call last):
File "PATH/SEXCMD/1.extract_chr_from_REF.py", line 29, in
for idx in xrange(0, len(chr_dict[CHR]), 50):
KeyError: 'chrX'

I was wondering to get some insight into why this might be going on. Thank you very much.

Chicken/human sex marker files

Hello,

I was wondering if it would be possible to have the sex marker files for human and chickens (the ones generated for the publication of this tool) within the Example files, it would be a good resource for people who want to use this tool on those organisms again.

Thanks for your help,
Valentina

marker file creation issue

I am trying to make my own marker file for humans (since the software does not provide the one used in the paper).

In step 1 I got the following error

Traceback (most recent call last):
File "../../../aDNA_programs/SEXCMD/1.extract_chr_from_REF.py", line 28, in <m
outfile.write('>' + CHR + '\n')
I am using python 2.7
I tried different reference genomes (from NCBI, 1000K, UCSN) and builds 37/38 but did not worked.
Is there a solution for this?

In order to move forward I downloaded the chromosomes X and Y from NCBI, and changed the names of the fasta to CHRX and CHRY (because I think the script in step 1 was also doing this).

Then, in step 2 I met an allocation limit problem with lastz:

FAILURE: in add_segment()
table size (4,869,542,152 for 101,448,794 segments) exceeds allocation limit of 4,294,967,279;
consider raising scoring threshold (--hspthresh or --exact) or breaking your target sequence into smaller pieces

At first I tried the lastz_32 (more RAM memory) instead of lastz. Gave me the same error and also it was too slow.

Then I added in the 2nd command the argument --hspthresh=4000 (3000 is the default)
It return the same error
I gradually (per 1000) increased the value and at 10000 it did not gave me an error but it stack for two days. It did the same with 20000.

May I have some help with these issues?

Thank you in advance!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.