lovemun / sexcmd Goto Github PK

Python 76.87% R 23.13%

sexcmd's Introduction

SEXCMD : Created Sex Marker and Determination

USAGE : perl determine.pl sex_marker_filtered.hg38.final.fasta input.fastq.gz 2> /dev/null 1> input.SEX_OUTPUT

SEXCMD is available at https://github.com/lovemun/SEXCMD

The code is written in Perl and bash script. This tool is supported on Linux and needs Python, lastz, blastall and bwa. The tool uses gzip compressed fastq file as input and R1 and R2 file can be used in case of paired-end or mate pair. And you should give marker fasta file path, input fastq file path as auguements.

Required software

Python (https://www.python.org/ftp/python/2.7.13/python-2.7.13.msi)
bwa(version 7.0 or later) (https://sourceforge.net/projects/bio-bwa/files/)
blastall (ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/)
lastz(Release 1.02.00) (http://www.bx.psu.edu/~rsharris/lastz/lastz-1.04.00.tar.gz)
samtools (https://sourceforge.net/projects/samtools/files/samtools/)
R (https://www.r-project.org/)

Input Files

Reference genome fasta file (ex. hg38.fa)
Input sequence file (ex. ERR000000.fastq.gz)

Example

If you want to create marker file, you can make own sex marker file for your datafiles. There are 6 process steps for creating own sex marker file.

Extract sex chromosome from Reference genome(ex. Human)

python 1.extract_chr_from_REF.py hg38.fasta X

python 1.extract_chr_from_REF.py hg38.fasta Y

Mapping between sex chromosomes

lastz hg38.chrX.fasta hg38.chrY.fasta --format=maf > chrX_chrY.maf

Generate sex marker text file

python 2.sex_marker.py chrX_chrY.maf hg38.ucsc.gtf 35 > sex_marker.hg38.txt (distance between marker : 35(recommended))

python 3.sex_marker_filtered.py sex_marker.hg38.txt hg38.ucsc.gtf 151 5 > sex_marker_filtered.hg38.txt (read length : 151 admit missmatch count : 5)

Convert sex marker file format(text -> fasta)

python 4.make_markerFASTA.py sex_marker_filtered.hg38.txt

Filtered ONLY MAPPED to chrX, chrY

blastn -db hg38.fa -query sex_marker_filtered.hg38.fasta -num_threads 4 -word_size 8 -outfmt 7 > sex_marker_filtered.hg38.fasta.blastm9

python 5.blast_filter.py sex_marker_filtered.hg38.fasta sex_marker_filtered.hg38.fasta.blastm9 sex_marker_filtered.hg38.final.fasta

Index sex marker file

bwa index -a is sex_marker_filtered.hg38.final.fasta

Final sex marker : sex_marker_filtered.hg38.final.fasta

Sex determination

Rscript SEXCMD.R sex_marker_filtered.hg38.final.fasta [1/2/3] [XY/ZW] input.fastq.gz

[1/2/3] : sequencing type (1: Exome sequencing, 2: RNA sequencing, 3: Whole genome sequencing)

[XY/ZW] : Sex determination system

Contact

[email protected]

Citation

sexcmd's People

Contributors

Stargazers

Watchers

Forkers

qnie biocko oicr-gsi jing-xinxing leornardzhou

sexcmd's Issues

Error in the SEXCMD.R script

Hi Seongmun,

I was using your script on several avian libraries and I found a couple of problems with the code in SEXCMD.R that you might want to check out:

Lines 118-127:

if (sextype == "XY"){
  ct <- data.frame(Length=do.call(rbind, marker_length_list[startsWith(x = names(marker_seq_list), prefix = "chrX")]), Count=do.call(rbind, marker_seq_list[startsWith(x = names(marker_seq_list), prefix = "chrX")]))
  ct <- cbind(ct, do.call(rbind, marker_seq_list[startsWith(x = names(marker_seq_list), prefix = "chrY")]))
  colnames(ct) <- c("Length", "Xcount", "Ycount")
}
if (sextype == "ZW"){
  ct <- data.frame(Length=do.call(rbind, marker_length_list[startsWith(x = names(marker_seq_list), prefix = "chrZ")]), Count=do.call(rbind, marker_seq_list[startsWith(x = names(marker_seq_list), prefix = "chrW")]))
  ct <- cbind(ct, do.call(rbind, marker_seq_list[startsWith(x = names(marker_seq_list), prefix = "chrW")]))
  colnames(ct) <- c("Length", "Zcount", "Wcount")
}

In the ZW system there is an inconsistency with respect to the XY system chunk of code. I think this line:
ct <- data.frame(Length=do.call(rbind, marker_length_list[startsWith(x = names(marker_seq_list), prefix = "chrZ")]), Count=do.call(rbind, marker_seq_list[startsWith(x = names(marker_seq_list), prefix = "chrW")]))
should be modified in:
ct <- data.frame(Length=do.call(rbind, marker_length_list[startsWith(x = names(marker_seq_list), prefix = "chrZ")]), Count=do.call(rbind, marker_seq_list[startsWith(x = names(marker_seq_list), prefix = "chrZ")]))
In this way it mimics the code for the XY system.

In lines 118, 123, 131, 140
if (sextype = "XY") and if (sextype = "ZW") should be replaced with if (sextype == "XY") and if (sextype == "ZW")

Thank you again for your tool :)

Valentina

Issue creating marker file

Hello,
I was using your pipeline and came across an issue in the 1.extract_chr_from_REF.py script. It runs until line 28 but then in line 29 it gives me the following error:

Traceback (most recent call last):
File "PATH/SEXCMD/1.extract_chr_from_REF.py", line 29, in
for idx in xrange(0, len(chr_dict[CHR]), 50):
KeyError: 'chrX'

I was wondering to get some insight into why this might be going on. Thank you very much.

Chicken/human sex marker files

Hello,

I was wondering if it would be possible to have the sex marker files for human and chickens (the ones generated for the publication of this tool) within the Example files, it would be a good resource for people who want to use this tool on those organisms again.

Thanks for your help,
Valentina

marker file creation issue

I am trying to make my own marker file for humans (since the software does not provide the one used in the paper).

In step 1 I got the following error

Traceback (most recent call last):
File "../../../aDNA_programs/SEXCMD/1.extract_chr_from_REF.py", line 28, in <m
outfile.write('>' + CHR + '\n')
I am using python 2.7
I tried different reference genomes (from NCBI, 1000K, UCSN) and builds 37/38 but did not worked.
Is there a solution for this?

In order to move forward I downloaded the chromosomes X and Y from NCBI, and changed the names of the fasta to CHRX and CHRY (because I think the script in step 1 was also doing this).

Then, in step 2 I met an allocation limit problem with lastz:

FAILURE: in add_segment()
table size (4,869,542,152 for 101,448,794 segments) exceeds allocation limit of 4,294,967,279;
consider raising scoring threshold (--hspthresh or --exact) or breaking your target sequence into smaller pieces

At first I tried the lastz_32 (more RAM memory) instead of lastz. Gave me the same error and also it was too slow.

Then I added in the 2nd command the argument --hspthresh=4000 (3000 is the default)
It return the same error
I gradually (per 1000) increased the value and at 10000 it did not gave me an error but it stack for two days. It did the same with 20000.

May I have some help with these issues?

Thank you in advance!