mortazavilab / transcriptclean Goto Github PK

View Code? Open in Web Editor NEW

61.0 6.0 17.0 221.41 MB

Correct mismatches, microindels, and noncanonical splice junctions in long reads that have been mapped to the genome

License: MIT License

Python 92.23% R 6.41% Shell 1.23% Makefile 0.13%

pacbio oxford-nanopore sam

transcriptclean's People

Contributors

Stargazers

Watchers

Forkers

peiwenliu18 lucventurini aneeshpanoli simhar wenmm msto pythseq aqzas xjyx gongjingtang muhammedhasan hzongyao hd00ljy mtammami pyoelii theo-nelson arjunmckinney

transcriptclean's Issues

Cannot run TranscriptClean with qsub

I can run it beautifully interactively on our server but through qsub it gives this error. It is some issue with handling multi-threading?

My base environment uses python 3.6.7 not 3.7 but it does work interactively so I assumed this shouldn't be the cause of issue.

#!/usr/bin

# Set source of conda install
source miniconda3/etc/profile.d/conda.sh
python TranscriptClean/TranscriptClean.py -t 4 --sam /analysisdata/rawseq/fastq/SHARED/000078/Mouse_aging/BAM_MD/Day1_01_DRS_pass.sam --genome /home/callum/Genome_files/Mus_muscu
/analysisdata/rawseq/fastq/SHARED/000078/Mouse_aging/TranscriptClean/scripts/TranscriptClean_Day1_01.sh

 (most recent call last):
  File "TranscriptClean/TranscriptClean.py", line 1593, in <module>
    main()
  File "TranscriptClean/TranscriptClean.py", line 37, in main
    header, sam_chroms, sam_chunks = split_SAM(sam_file, n_threads)
  File "TranscriptClean/TranscriptClean.py", line 513, in split_SAM
    chunks = split_input(transcript_lines, n)
  File "TranscriptClean/TranscriptClean.py", line 527, in split_input
    batch = my_list[index:]
TypeError: slice indices must be integers or None or have an __index__ method

CIGAR operations out of range after splice junction correction

Hi,

Since the update in #14 to update transcripts after correcting splice junctions, I get the following error. I think there may be a bug in the rescueNoncanonicalJunction function, which results in a CIGAR string incompatible with the sequence (i.e. more match/insert/sub operations than there are characters in the sequence), but I haven't been able to trace the exact source of the error. Could you please look into it if you have the chance?

Thanks!

  File "bin/TranscriptClean_fork/TranscriptClean.py", line 1042, in <module>
    main()
  File "bin/TranscriptClean_fork/TranscriptClean.py", line 171, in main
    writeTranscriptOutput(noncanTranscripts, sjDict, oSam, oFa, transcriptLog, genome)
  File "bin/TranscriptClean_fork/TranscriptClean.py", line 184, in writeTranscriptOutput
    outSam.write(Transcript2.printableSAM(currTranscript, genome, spliceAnnot) + "\n")
  File "/scratch/mrstone3/pacbio_iPSC/bin/TranscriptClean_fork/transcript2.py", line 294, in printableSAM
    self.NM, self.MD = self.getNMandMDFlags(genome)
  File "/scratch/mrstone3/pacbio_iPSC/bin/TranscriptClean_fork/transcript2.py", line 360, in getNMandMDFlags
    currBase = self.SEQ[seqPos]
IndexError: string index out of range```

ImportError: cannot import name 'Fasta' from 'fasta'

HI @nargesr,
Can you help to solve this issue with TranscriptClean
ImportError: cannot import name 'Fasta' from 'fasta'

RuntimeWarning: overflow encountered in int_scalars

Hi,

I am using TranscriptClean to process some ONT long read data with the following parameters;

python TranscriptClean.py --threads 10 --sam input_long_read.sam --genome GRCh38.primary_assembly.genome.fa --spliceJns illumina_SJ.out.tab --correctMismatches true --correctIndels true --maxLenIndel 5 --maxSJOffset 5 --variants 00-common_all.vcf --outprefix output_TC

Everything runs fine and to completion, but for each thread I recieve the following error.

/home/s/sem66/Desktop/TranscriptClean-2.0.3/TranscriptClean.py:1365: RuntimeWarning: overflow encountered in int_scalars
if dist_0*dist_1 <= 0:

The sam files produced are roughly 60% of the original size so I just wanted to check if this error is to be ignored or if there is a specific way to address it.

NB the splice junction file used is the STAR output of the illumina data from the same study using 2 pass mapping to the genome annotation.

Any help would be greatly appreciated,

Add a sample TranscriptClean report to Github/supplement

Add a link to reference genome assembly for the example

python3 ?

Hi,

interesting idea, but I don't see the point in trying to install and use since python2 support ends in 2020. Are you considering a python3 version?

cheers,
Colin

Quality scores lost

I recently ran TranscriptClean on my PacBio reads, and it ran fine after making the changes you suggested (thanks!). However I noticed that the quality scores present in the input SAM file are not preserved in the output SAM. Would it be possible to add this feature at some point?

Of course, the quality string would need to be edited correspondingly with whatever indels are corrected. For insertions, the corresponding qual value would be removed, and for deletions you'd need to impute something (maybe "!"?).

If you're OK with a pull request I'd be happy to take a stab at it :)

AttributeError: 'Fasta' object has no attribute 'sequence'

Whilst running the accessory script get_SJs_from_gtf.py I got the following error:

Traceback (most recent call last):
  File "/projects/b1177/software/TranscriptClean/accessory_scripts/get_SJs_from_gtf.py", line 114, in <module>
    spliceJn = formatSJOutput(info, prev_exonEnd, genome, minIntron)
  File "/projects/b1177/software/TranscriptClean/accessory_scripts/get_SJs_from_gtf.py", line 28, in formatSJOutput
    intronMotif = getIntronMotif(chromosome, intron_start, intron_end, genome)
  File "/projects/b1177/software/TranscriptClean/accessory_scripts/get_SJs_from_gtf.py", line 45, in getIntronMotif
    startBases = genome.sequence({'chr': chrom, 'start': start, 'stop': start + 1}, one_based=True)
AttributeError: 'Fasta' object has no attribute 'sequence'

Additionally, I also see this warning when running TranscriptClean.py without providing a file of reference splice junctions:

'Fasta' object has no attribute 'sequence'

Although the script runs through. Do you know what might be the issue?

Many thanks,
Catherine

Incompatible with CIGAR operators X/=

This tool fails for SAM files using X/= CIGAR operators instead of M, which are coming into more common use. It seems to be a quick fix to look for X and = in any place the current code looks for M, but there may be some side effects I'm not aware of.

scalar errors in output but runs to completion

I seem to have many errors but TranscriptClean just finish and joins up outputs into one file as expected.

Can these errors be ignored?

I provide an example output file.

minimap2

minimap2 -t 10 -2 -ax splice -uf --MD --junc-bed /home/scratch/callum/Genome_files/gencode.v39.annotation.bed --secondary=no /home/scratch/callum/Genome_files/GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.mmi tail_trimmer/iPSC_rep1_run1_out/fastq/iPSC_rep1_run1_out.cleaned.fq.gz > iPSC_rep1_run1_pass_genome.sam

TranscriptClean

python 3.6.5

commit ae9e715 (HEAD -> master, tag: v2.0.3, origin/master, origin/HEAD)
Author: M. Hasan Celk [email protected]
Date: Fri Sep 3 19:04:55 2021 -0700

python /home/callum/TranscriptClean/TranscriptClean.py -t 12 --sam iPSC_rep1_run1/iPSC_rep1_run1_pass_genome.sam --genome /home/scratch/callum/Genome_files/GRCh38_no_alt_analysis_set_GCA_000001405.15_cleanheader.fasta -j /home/scratch/callum/Genome_files/gencode.v39.annotation.SJs.txt --outprefix iPSC_rep1_run1/TranscriptClean_ver3 --deleteTmp --tmpDir iPSC_rep1_run1/TranscriptClean_ver3/tmp

slurm-5134.out.txt

Fixing c42860/f2p4/2485

STAR: (minus sequence field)
The mapping has created a 7-bp micro-exon with a canonical but likely incorrect junction to its left, and a non-canonical junction on its right

Post-correction: (minus sequence field)
We ended up with two introns next to each other with a zero-length exon

The correct arrangement would be:

Reference junctions:
chr11 2978343 2992253 2 2 1 0 2 1
chr11 2979238 2992253 2 2 1 136 2 31
chr11 2989198 2992253 2 2 0 5 1 41
chr11 2989272 2992253 2 2 1 2 3 38
chr11 2989198 2990908 2 2 1 0 2 1

Create file containing this transcript for testing:

grep "c42860/f2p4/2485" /bio/dwyman/pacbio_f2016/GM12878/PB36/STAR/STAR_out/PB36_pool/Aligned.out.sam > test_TranscriptClean/c42860-f2p4-2485.sam

For now, the best way to fix this may be to decline fixing transcripts containing an exon that is smaller in size than the correction distance.

python /pub/dwyman/clean_splice_jns/TranscriptClean.py --sam test_TranscriptClean/c42860-f2p4-2485.sam --genome /bio/dwyman/pacbio_f2016/data/STAR_hg38_ENCODE/hg38.fa --spliceJns /bio/dwyman/pacbio_f2016/data/GM12878_illumina_SJs_ENCODE/SJ.out.tab --maxLenIndel 5 --maxSJOffset 5 --outprefix test_TranscriptClean/c42860-f2p4-2485 --variants /bio/dwyman/pacbio_f2016/data/NA12878_variants/NA12878.vcf.gz

ImportError: cannot import name 'Fasta' from 'fasta'

I am Getting this error ImportError: cannot import name 'Fasta' from 'fasta', when running this command "conda install -c bioconda python=3.7 pyranges samtools pyfaidx"

TranscriptClean.py samFile not found error

Hi there,

I'm having a little trouble running TranscriptClean. I have a set of sorted bam files I wanted to run through it prior to Talon. I converted each bam back to sam (e.g. samtools -h my_sorted.bam > my_sorted.sam) and submitted the following:

TranscriptClean --sam /my/indir/my_sorted.sam --genome /my/reference/my_genome.fa --outprefix /my/outdir/my_prefix

All dependancies are installed and seemingly run fine. The error received suggests there is an issue in reading the sam file;

File "/opt/software/TranscriptClean/TranscriptClean.py", line 1593, in
main()
File "/opt/software/TranscriptClean/TranscriptClean.py", line 37, in main
header, sam_chroms, sam_chunks = split_SAM(sam_file, n_threads)
File "/opt/software/TranscriptClean/TranscriptClean.py", line 502, in split_SAM
with open(samFile, 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: ''

I've had a look through the TranscriptClean.py source code and my sam files but can't find what may be causing the conflict. The sam files possess headers and seem to work fine with other analyses and the paths are correct in the submission script. Is this something you've maybe seen before? I must have made a mistake somewhere. Hopefully it's just a daft error, but I can't see it?

Many thanks for your time & insights,

Dave

Show that GM12878 SNPs are in fact not corrected by TranscriptClean when a VCF file is provided

invalid literal for int() with base 10: '1307=1' and UserWarning: Problem parsing transcript with ID 'transcript/10670'

Hi,

I am getting issues like below when I run TranscriptClean
"Correcting transcripts...
invalid literal for int() with base 10: '1307=1'
invalid literal for int() with base 10: '15=1'
invalid literal for int() with base 10: '1094=1'
invalid literal for int() with base 10: '1094=1'
invalid literal for int() with base 10: '1093=1'
invalid literal for int() with base 10: '1509=1'
invalid literal for int() with base 10: '511=1'
invalid literal for int() with base 10: '91=15588'
invalid literal for int() with base 10: '77=19737'
invalid literal for int() with base 10: '77=19737'
.."

Also,
"/data_disk2/software/TranscriptClean-2.0.3/TranscriptClean.py:339: UserWarning: Problem parsing transcript with ID 'transcript/10670'
warnings.warn("Problem parsing transcript with ID '" +
/data_disk2/software/TranscriptClean-2.0.3/TranscriptClean.py:339: UserWarning: Problem parsing transcript with ID 'transcript/10345'
warnings.warn("Problem parsing transcript with ID '" +
/data_disk2/software/TranscriptClean-2.0.3/TranscriptClean.py:339: UserWarning: Problem parsing transcript with ID 'transcript/11633'
warnings.warn("Problem parsing transcript with ID '" +
/data_disk2/software/TranscriptClean-2.0.3/TranscriptClean.py:339: UserWarning: Problem parsing transcript with ID 'transcript/11869'
warnings.warn("Problem parsing transcript with ID '" +
/data_disk2/software/TranscriptClean-2.0.3/TranscriptClean.py:339: UserWarning: Problem parsing transcript with ID 'transcript/23980'
warnings.warn("Problem parsing transcript with ID '" +
/data_disk2/software/TranscriptClean-2.0.3/TranscriptClean.py:339: UserWarning: Problem parsing transcript with ID 'transcript/224'
warnings.warn("Problem parsing transcript with ID '" +"

Can you please help me to fix the issue?

Thanks
Philge

Migrate from pyfasta to pyfaidx

I'm having problems with TranscriptClean's pyfasta dependency in Python 3.7 and 3.8:

$ python3 -m venv /tmp/venv
$ source /tmp/venv/bin/activate
$ pip list | grep pyfasta
pyfasta         0.5.2
(venv) $ python -c 'from pyfasta import Fasta'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/tmp/venv/lib/python3.7/site-packages/pyfasta/__init__.py", line 3, in <module>
    from fasta import Fasta, complement, DuplicateHeaderException
ModuleNotFoundError: No module named 'fasta'

The pyfasta project has not been updated in eight years (since 2014) and the author has since marked his repository read only and suggested people migrate to pyfaidx. Would it be feasible to migrate to pyfaidx in TranscriptClean?

list index out of range

Hi,

Thanks for an awesome piece of software. I have used TranscriptClean on large-scale assemblies with high success before.

However I am now running a metatranscriptomics project in which I am only interested in reads that map to the COI gene of the taxon of interest. I have 6504 reads that map to the full-length reference COI sequence. I have sorted the .sam file with samtools, but when I run TranscriptClean I get the error that the list "index is out of range". When I inspect the mapping in a genome map viewer, it looks good albeit with some gaps here and there. Still, all my reads are within the reference.

I sort the samples
samtools sort -O sam -T sample.sort -o sample.sort.sam mapped1.sam

I run this commant
python ....../TranscriptClean.py --sam sample.sort.sam --genome mygenome.fasta --out outfile

The program then returns
list index out of range Took 0:00:54 to process transcript batch. Took 0:00:00 to combine all outputs.
Below is a snippet from the sorted .sam file.

@HD VN:1.0 SO:coordinate @SQ SN:Facetotecta LN:1527 @RG ID:Unpaired_reads_assembled_against_Facetotecta SM: @PG ID:samtools PN:samtools VN:1.14 CL:samtools sort -O sam -T sample.sort -o sample.sort.sam mapped1_sorted_Facetotecta_cut_extraction.sam m54057_190926_040405/25100833/ccs_1 0 Facetotecta 1 255 2M1P2M1P1M5P1M10P1M3P1M2P1M5P1M4P1M1P1M2P1M2P1M1P1M1P1M1P2M3P1M2P1M2P1M1P1M2P1M1P1M2P4M3P1M1P1M1P1M2P2M1P1M2P1M5P2M3P1M1P1M2P2M4P2M1P1M2P1M2P1M8P1M3P1M21P2M4P1M5P1M2P1M2P1M3P1M4P1M5P1M3P1M5P1M1P1M3P2M3P1M1P1M6P1M1P1M3P1M2P1M1P1M10P1M2P1M18P1M1P2M4P1M6P1M1P1M8P2M3P2M11P1M2P1M6P1M2P1M6P1M3P2M2P1M7P1M6P1M1P1M1P1M1P1M1P1M9P1M8P3M5P1M1P1M1P1M1P1M1P2M7P1M3P1M1P1M1P1M2P1M1P1M14P1M1P1M4P1M4P1M1P1M12P3M2P1M6P1M1P1M3P2M2P1M1P1M1P2M3P1M3P2M3P1M3P2M1P1M4P2M23P1M4P1M4P1M8P1M2P1M1P1M1P1M17P1M1P1M5P1M3P1M1P1M16P1M1P1M1P2M3P1M5P1M1P3M1P3M1P1M2P2M4P2M1P1M1P1M5P2M3P1M7P1M5P1M2P1M2P1M1P1M1P1M1P1M1P1M3P2M1P1M30P1M1P1M2P1M1P2M8P1M3P1M8P1M1P1M1P1M8P2M1P2M1P1M1P1M4P2M5P1M1P1M4P1M10P1M5P1M4P1M5P1M2P1M10P1M1P1M1P1M3P1M4P1M4P2M1P2M9P2M4P2M3P2M2P1M1P1M3P2M2P1M2P2M2P3M3P1M17P1M4P1M1P1M3P2M2P1M4P1M8P1M1P1M1P1M2P1M2P3M1P1M3P1M4P1M1P2M2P1M1P1M3P1M3P1M3P1M4P1M3P1M1P1M3P1M1P1M5P2M2P1M1P2M1P1M3P1M1P1M1P1M3P1M8P2M1P2M1P1M2P2M11P1M3P2M8P1M1P2M14P1M14P2M8P1M1P2M3P1M4P1M5P1M1P1M1P1M1P1M2P2M6P1M1P1M1P1M1P1M4P1M2P1M3P1M6P1M2P1M2P2M2P1M1P3M1P1M2P1M6P1M2P1M1P1M2P1M9P1M2P2M4P1M4P1M5P1M6P1M1P1M1P1M3P1M3P1M4P1M7P1M8P1M8P1M9P1M16P1M2P1M18P1M4P1M12P1M6P1M3P1M3P1M2P1M6P1M1P1M12P1M1P1M1P1M12P2M7P1M1P1M3P1M3P1M1P2M1P1M4P1M3P1M3P1M1P1M7P1M3P1M2P2M21P1M6P1M3P1M1P2M29P2M2P1M2P1M1P1M81P2M1P1M4P2M1P1M2P1M2P1M1P2M1P1M1P1M1P2M1P1M4P2M1P1M2P2M10P1M3P1M3P1M1P1M4P1M1P1M1P1M1P1M2P1M1P1M1P2M6P1M9P1M3P1M2P2M3P1M7P1M2P1M3P1M1P3M4P1M6P1M2P1M1P2M1P1M3P1M1P1M9P1M1P1M1P2M3P1M4P2M3P3M1P1M10P1M8P1M4P1M2P1M4P1M2P1M2P1M4P2M5P1M2P1M5P1M1P2M3P1M1P1M1P2M13P1M1P1M1P1M2P1M2P1M12P1M9P1M1P1M1P1M1P1M2P1M3P2M2P1M2P1M8P1M1P2M1P2M1P3M10P2M4P1M2P1M4P1M4P1M1P1M8P1M2P1M1P1M4P2M1P2M2P1M2P1M3P1M9P1M5P2M4P2M17P1M1P1M13P1M2P1M3P1M11P1M2P1M10P1M2P1M22P1M1P1M19P1M4P1M3P1M14P1M5P1M3P1M2P1M3P1M5P1M12P1M11P1M2P1M2P1M6P1M2P1M10P1M1P1M9P1M3P1M1P1M4P1M2P1M2P1M1P1M12P1M3P1M2P1M1P1M1P1M2P1M2P1M3P1M2P1M4P1M5P2M1P1M2P1M2P1M1P1M2P1M3P1M3P1M6P1M1P1M3P1M2P1M6P1M3P1M6P1M1P1M3P1M1P1M1P1M4P1M4P1M8P1M6P1M1P1M1P1M2P2M * 0 0 ATGAAACGATGATTATTTTCCACTAACCACAAAGACATTGGTACAATGTACTTTATCCTGGGAGCGTGATCAGGTATAATCGGTACTGGTATAAGAATACTTATTCGAAGGGAACTAGGTCAACCCGGTAGACTTATTGGTAATGACCAAATTTACAACGTAATTGTTACAGCTCATGCATTTATCATAATTTTCTTTATAGTTATACCTATTATAATTGGAGGCTTTGGCAATTGGCTTGTTCCTCTTATAATTGGAGCTCCTGATATAGCCTTCCCTCGAATAAACAATATAAGATTTTGACTTCTTCCTCCTTCCCTCTCTCTTCTTTTATCAAGAAGATTAACTGAATCTGGAGTTGGAACAGGATGAACAGTTTACCCTCCTCTTTCAAGTAATATTGCCCACAGTGGTATTTCCGTTGACTTAGCTATCTTCTCACTCCATTTGGCAGGAGCAAGATCAATTTTAGGTGCCATTAATTTCATTACTACTATCATCAATATACGTAATAAAATAATCACAATAGACCGATTACCTCTATTTGTATGATCAGTTTTCATCACAGCGTTTCTCC * RG:Z:Unpaired_reads_assembled_against_Facetotecta m54057_190926_040405/7602703/ccs_2 0 Facetotecta 1 255 2M1P2M1P1M5P1M10P1M3P1M2P1M5P1M4P1M1P1M2P1M2P1M1P1M1P1M1P2M3P1M2P1M2P1M1P1M2P1M1P1M2P4M3P1M1P1M1P1M2P2M1P1M2P1M5P2M3P1M1P1M2P2M4P2M1P1M2P1M2P1M8P1M3P1M21P2M4P1M5P1M2P1M2P1M3P1M4P1M5P1M3P1M5P1M1P1M3P2M3P1M1P1M6P1M1P1M3P1M2P1M1P1M10P1M2P1M18P1M1P2M4P1M6P1M1P1M8P2M3P2M11P1M2P1M6P1M2P1M6P1M3P2M2P1M7P1M6P1M1P1M1P1M1P1M1P1M9P1M8P3M5P1M1P1M1P1M1P1M1P2M7P1M3P1M1P1M1P1M2P1M1P1M14P1M1P1M4P1M4P1M1P1M12P3M2P1M6P1M1P1M3P2M2P1M1P1M1P2M3P1M3P2M3P1M3P2M1P1M4P2M23P1M4P1M4P1M8P1M2P1M1P1M1P1M17P1M1P1M5P1M3P1M1P1M16P1M1P1M1P2M3P1M5P1M1P3M1P3M1P1M2P2M4P2M1P1M1P1M5P2M3P1M7P1M5P1M2P1M2P1M1P1M1P1M1P1M1P1M3P2M1P1M30P1M1P1M2P1M1P2M8P1M3P1M8P1M1P1M1P1M8P2M1P2M1P1M1P1M4P2M5P1M1P1M4P1M10P1M5P1M4P1M5P1M2P1M10P1M1P1M1P1M3P1M4P1M4P2M1P2M9P2M4P2M3P2M2P1M1P1M3P2M2P1M2P2M2P3M3P1M17P1M4P1M1P1M3P2M2P1M4P1M8P1M1P1M1P1M2P1M2P3M1P1M3P1M4P1M1P2M2P1M1P1M3P1M3P1M3P1M4P1M3P1M1P1M3P1M1P1M5P2M2P1M1P2M1P1M3P1M1P1M1P1M3P1M8P2M1P2M1P1M2P2M11P1M3P2M8P1M1P2M14P1M14P2M8P1M1P2M3P1M4P1M5P1M1P1M1P1M1P1M2P2M6P1M1P1M1P1M1P1M4P1M2P1M3P1M6P1M2P1M2P2M2P1D1P3D1P1D2P1M6P1M2P1M1I1M2P1M8P1I1M2P2M4P1M4P1M5P1M6P1M1P1M1P1M3P1M3P1M4P1M4P3I1M8P1M8P1M9P1M16P1M2P1M18P1M4P1M12P1M6P1M3P1M3P1M2P1M6P1M1P1M12P1M1P1M1P1M12P2M7P1M1P1M3P1M3P1M1P2M1P1M4P1M3P1M3P1M1P1M7P1M3P1M2P2M21P1M6P1M3P1M1P2M29P2M2P1M2P1M1P1M81P2M1P1M4P2M1P1M2P1M2P1M1P2M1P1M1P1M1P2M1P1M4P2M1P1M2P2M10P1M3P1M3P1M1P1M4P1M1P1M1P1M1P1M2P1M1P1M1P2M6P1M9P1M3P1M2P2M3P1M7P1M2P1M3P1M1P3M4P1M6P1M2P1M1P2M1P1M3P1M1P1M9P1M1P1M1P2M3P1M4P2M3P3M1P1M10P1M8P1M4P1M2P1M4P1M2P1M2P1M4P2M5P1M2P1M5P1M1P2M3P1M1P1M1P2M13P1M1P1M1P1M2P1M2P1M12P1M9P1M1P1M1P1M1P1M2P1M3P2M2P1M2P1M8P1M1P2M1P2M1P3M10P2M4P1M2P1M4P1M4P1M1P1M8P1M2P1M1P1M4P2M1P2M2P1M2P1M3P1M9P1M5P2M4P2M17P1M1P1M13P1M2P1M3P1M11P1M2P1M10P1M2P1M22P1M1P1M19P1M4P1M3P1M14P1M5P1M3P1M2P1M3P1M5P1M12P1M11P1M2P1M2P1M6P1M2P1M10P1M1P1M9P1M3P1M1P1M4P1M2P1M2P1M1P1M12P1M3P1M2P1M1P1M1P1M2P1M2P1M3P1M2P1M4P1M5P2M1P1M2P1M2P1M1P1M2P1M3P1M3P1M6P1M1P1M3P1M2P1M6P1M3P1M6P1M1P1M3P1M1P1M1P1M4P1M4P1M8P1M6P1M1P1M1P1M2P2M4P3M2P1M2P1I1M3P1M4P1M20P1D4P1M2P1M1P1M1P2M13P1M6P2M2P1M5P2M2P2M1P1M1P1M2P1M1P1M1P1M1P1M1P1M4P3M1P2M1P1M2P1M1P1M3P6M4P1M1P2M1P2M2P2M2P1M5P1M1P1M13P1M2P1M1P1M1P1M1P1M8P1M8P1M2P1M4P1M3P1M1P2M14P3M1P1M4P1M3P1M2P1M2P1M12P2M1P3M2P2M2P2M11P1M1P1M2P1D6P1D3P1D7P1D2P1D14P1D3P1M2P1M1P1M3P2M1P1M4P2M2P3M1P2M1P1M1P2M1P1M1P1M2P2M2P2M1P2M8P2M1P3M1P1M1P5M2P1M8P2M1P1M1P2M2P1M1P2M1P1M2P1M1P1M1P3M2P2M1P1M1P1M1P2M2P1M8P1M1P1M1P2M3P1M1P3M2P1M4P1M2P2M2P1M1P1M1P1M1P2M3P1M2P2M51P1M1P1M1P1M1P1M2P2M1P2M208P1M1P1M1P1M1P1M1P1M1P1M2P1M1P1M1P2M9P1M1P2M1P2M4P1M1P1M4P1M1P1M1P1M3P1M3P1M1P1M1P1M1P1M1P1M14P1M4P1M * 0 0 ATGAAACGATGATTATTTTCAACCAATCATAAAGATATTGGAACTATATATATAATATTCGGCGCCTGATCCGGCACTATAGGAGTGGCAATAAGAATAATTATCCGTAGAGAACTAGGGCAACCCGGTTCTCTAATTGGTAACGATCAAATCTATAATGTAATTGTAACTGCCCACGCCTTTATCATAATTTTCTTTATAGTAATACCAATCATAATTGGAGGATTTGGAAACTGACTAATTCCTCTGATATTAGGATCCCCTGATATAGCATTTCCACGGATAAATAACATAAGATTCTGACTACTCCCCCCATCATTAATTCTTTTAATTAGAAGAAGACTAACAGAAAGGGGGGTAGGAACAGGATGAACGGTCTATCCTCCTCTTTCAAGAAATATCTCTCATAGAGGAGTCTCAGTAGACATGGCCATCTTCTCCCTCCACTTAGCTGGAGCAAGATCCATTTTAGGAGCCATTAATTTTATTACTACGATCATTAATATACGCAACAAAAACCTTTCTTTTGACCGTCTACCATTATTAGTATGATCTATCTTTATTACTACTATCCTTTTACTACTTTCTTTACCAGTACTTGCCGGAGCTATTACCATACTATTAACAGATCGAAATATTAATACTTCATTCTTTGATCCAGGTGGGGATCCTGTATTATATCAACATCTATTTTGATTTTTCGGACACCCAGAAGTTTATATTTTAATTCTACCAGGGTTTGGAATAGTTTCCCACATTATTAGACAAGAAAG *

Any ideas about what goes wrong? It looks like TrascriptClean cannot run without a proper genome map or chromosome list, but I wanted to ask in case others are getting the same "error".

Appreciate any help and I am open to other solutions.
Niklas

Substantiate that the NCSJ correction actually represents the correct splicing junctions

scalar errors in output but runs to copk

I seem to have many errors but TranscriptClean just finish and joins up outputs into one file as expected.

Can these errors be ignored?

I provide an example output file.

Running slow on grid with qsub

I was wondering a preferable way to run the python script faster on grid. This should parallelize but still seems to run very slowly.

Can I split my SAM into numerous sub SAMs and submit as separate jobs? There are non-primary mappings that may end up in different files, would this effect TranscriptClean?

generate_report.R have noused argument

Hi dear:
I am having an issue with generate_report.R script.I attempt to run TC to correct my pacbio reads.
Below are commands:
python ~/TC/TranscriptClean.py --threads 30 --sam fruit.sorted.sam --genome final.genome.fasta --spliceJns all_sample.SJ.out.tab --maxLenIndel=50 --maxSJOffset=80000 --outprefix fruit
and，I get some files, fruit_clean.fa , fruit_clean.log , fruit_clean.sam , fruit_clean.TE.log

then,i want to use generate_report.R script to visualizing TranscriptClean results,the commands as follow：
Rscript ~/TC/generate_report.R ./fruit

but,i get a error like this.
Error in read_delim(logFileTE, "\t", escape_double = FALSE, col_names = TRUE, :
unused argument (trim_ws = TRUE) Calls: main -> read_delim

if i remove the 'trim_ws = TRUE' in the script,i will get the such result
[1] "Reading log files............"
[1] "Creating tables.............."
[1] "Plot 1.................."
[1] "Plot 2.................."
[1] "Plot 3.................."
[1] "Plot 5.................."
[1] "Plot 6.................."
null device
1
there no plot 4 ,and the fruit_report.pdf don't have Noncanonical jns counts in pdf.where wrong ？could you response me？
thanks.

warning message: "Problem encountered while correcting transcript " single positional indexer is out-of-bounds

I have facing the follow error:

/home/yanlab/Tools/TranscriptClean/TranscriptClean.py:449: UserWarning: Problem encountered while correcting transcript with ID 51ae56aa-43a8-4e20-bbc5-85ed8ea749bb. Will output original version.
warnings.warn(("Problem encountered while correcting transcript "
single positional indexer is out-of-bounds
/home/yanlab/Tools/TranscriptClean/TranscriptClean.py:449: UserWarning: Problem encountered while correcting transcript with ID ef176324-7623-4a17-8caf-a50ae7ed7859. Will output original version.
warnings.warn(("Problem encountered while correcting transcript "
single positional indexer is out-of-bounds
/home/yanlab/Tools/TranscriptClean/TranscriptClean.py:449: UserWarning: Problem encountered while correcting transcript with ID dfa9ff9b-1b1c-48ce-9ae8-b803c3ffbb6a. Will output original version.
warnings.warn(("Problem encountered while correcting transcript "
single positional indexer is out-of-bounds
/home/yanlab/Tools/TranscriptClean/TranscriptClean.py:449: UserWarning: Problem encountered while correcting transcript with ID 8f1a89de-355a-431a-b6d3-721a49d84f85. Will output original version.
warnings.warn(("Problem encountered while correcting transcript "
single positional indexer is out-of-bounds

The command lines used:

python ./accessory_scripts/get_SJs_from_gtf.py --f GRCh38_basic.gtf --g GRCh38.fa --o sj_file.tsv
python ./TranscriptClean.py -t 20 -s sample.sam -g GRCh38.fa -o output_file -j sj_file.tsv

Could you help me?

AssertionError

Hi,
I'm using TranscriptClean to corrects mismatches in long reads

(python2.7 TranscriptClean.py --sam /genoma/cDNA_Bham_Run1.sam --genome /hg38/GRCh38.primary_assembly.genome_filter.fa --outprefix /Map/outputfile_transClean)

But I had the following error:

Reading genome ..............................
No splice annotation provided. Will skip splice junction correction.
No variant file provided. Transcript correction will not be variant-aware.
Processing SAM file .........................
Traceback (most recent call last):
File "TranscriptClean.py", line 994, in
main()
File "TranscriptClean.py", line 132, in main
canTranscripts, noncanTranscripts = processSAM(samFile, genome, sjDict, oSam, oFa, transcriptLog, primaryOnly)
File "TranscriptClean.py", line 207, in processSAM
t = Transcript2(line, genome, spliceAnnot)
File "/home/joel/Documents/Soft_joe/TranscriptClean/transcript2.py", line 60, in init
self.NM, self.MD = self.getNMandMDFlags(genome)
File "/home/joel/Documents/Soft_joe/TranscriptClean/transcript2.py", line 344, in getNMandMDFlags
refBase = genome.sequence({'chr': self.CHROM, 'start': genomePos, 'stop': genomePos}, one_based=True)
File "/home/joel/anaconda3/envs/py27/lib/python2.7/site-packages/pyfasta/fasta.py", line 197, in sequence
assert 'chr' in f and f['chr'] in self, (f, f['chr'], self.keys())
AssertionError: ({'start': 104265275, 'chr': 'chr12', 'stop': 104265275}, 'chr12', ['chr5 5', 'chrY Y', 'chr4 4', 'chr7 7', 'chr19 19', 'chr22 22', 'chr8 8', 'chr14 14', 'chr17 17', 'chr21 21', 'chr6 6', 'chr3 3', 'chr2 2', 'chr20 20', 'chr12 12', 'chrX X', 'chr1 1', 'chr11 11', 'chr9 9', 'chr18 18', 'chr10 10', 'chr15 15', 'chrM MT', 'chr16 16', 'chr13 13'])

Do you have any suggestions to correct it?
Thanks

Assertion Error

Hi, I'm trying to use TranscriptClean to remove indels from PacBio data. Would you be able to clarify this error? I also included a sample line from the SAM file I'm trying to process.
Thanks!

Run & error:

lauren@notebook:~/results$ docker run -v /home/lauren/data:/data -v /home/lauren/results:/results transcriptclean /bin/bash -c "python /TranscriptClean-master/TranscriptClean.py --sam /results/test.sam --genome /data/ref_seqs/barcoded_stlA_amplicon.fa --outprefix /results/tc_ --dryRun"
Reading genome ..............................
Dry run mode: Cataloguing indels.........
Traceback (most recent call last):
  File "/TranscriptClean-master/TranscriptClean.py", line 1023, in <module>
    main()
  File "/TranscriptClean-master/TranscriptClean.py", line 100, in main
    dryRun_recordIndels(samFile, outprefix, genome)
  File "/TranscriptClean-master/TranscriptClean.py", line 980, in dryRun_recordIndels
    transcript = Transcript2(line, genome, spliceAnnot)
  File "/TranscriptClean-master/transcript2.py", line 60, in __init__
    self.NM, self.MD = self.getNMandMDFlags(genome)
  File "/TranscriptClean-master/transcript2.py", line 361, in getNMandMDFlags
    refBase = genome.sequence({'chr': self.CHROM, 'start': genomePos, 'stop': genomePos}, one_based=True) 
  File "/opt/conda/lib/python2.7/site-packages/pyfasta/fasta.py", line 197, in sequence
    assert 'chr' in f and f['chr'] in self, (f, f['chr'], self.keys())
AssertionError: ({'start': 1, 'chr': 'barcoded_stlA_amplicon.dna', 'stop': 1}, 'barcoded_stlA_amplicon.dna', ['barcoded_stlA_amplicon.dna  (2103 bp)'])

SAM file:

lauren@notebook:~/results$ head test.sam
@HD	VN:1.9	SO:coordinate	pb:3.0.7
@SQ	SN:barcoded_stlA_amplicon.dna	LN:2103
@RG	ID:default	PL:PACBIO	DS:READTYPE=UNKNOWN	PU:default	SM:UnnamedSample	PM:SEQUEL
@PG	ID:pbmm2	PN:pbmm2	VN:1.0.0 (commit 1.0.0)	CL:pbmm2 align data/ref_seqs/barcoded_stlA_amplicon.fa results/ccs_trimmed_filtered.fq results/ctf.bam --preset CCS --sort -o 7
m54328_190509_191658/4260017/ccs	0	barcoded_stlA_amplicon.dna	1	60	595M1D4M1I580M1M385M1I312M3I30M91M3I104M	*	0	AAGCCCGCTTATTTTTTACATGCCAATACAATGTAGGCTGCTCTACACCTAGCTTCTGGGCGAGTTTACGGGTTGTTAAACCTTCGATTCCGACCTCATTAAGCAGCTCTAATGCGCTGTTAATCACTTTACTTTTATCTAATCTAGACATCATTAATTCCTAATTTTTGTTGACACTCTATCATTGATAGAGTTATTTTACCACTCCCTATCAGTGATAGAGAAAAGTGAACTCTAGAAATAATTTTGTTTAACTTTAAGAAGGAGATATACATATGAAAGCTAAAGATGTTCAGCCAACCATTATTATTAATAAAAATGGCCTTATCTCTTTGGAAGATATCTATGACATTGCGATAAAACAAAAAAAAGTAGAAATATCAACGGAGATCACTGAACTTTTGACGCATGGTCGTGAAAAATTAGAGGAAAAATTAAATTCAGGAGAGGTTATATATGGAATCAATACAGGATTTGGAGGGAATGCCAATTTAGTTGTGCCATTTGAGAAAATCGCAGAGCATCAGCAAAATCTGTTAACTTTTCTTTCTGCTGGTACTGGGGACTATATGTCCAAACCTTGTATTAAAGCGTCCAATGTTACTATGTTACTTTCTGTTTGCAAAGGTTGGTCTGCAACCAGACCAATTGTCGCTCAAGCAATTGTTGATCATATTAATCATGACATTGTTCCTCTGGTTCCTCGCTATGGCTCAGTGGGTGCAAGCGGTGATTTAATTCCTTTATCTTATATTGCACGAGCATTATGTGGTATCGGCAAAGTTTATTATATGGGCGCAGAAATTGACGCTGCTGAAGCAATTAAACGTGCAGGGTTGACACCATTATCGTTAAAAGCCAAAGAAGGTCTTGCTCTGATTAACGGCACCCGGGTAATGTCAGGAATCAGTGCAATCACCGTCATTAAACTGGAAAAACTATTTAAAGCCTCAATTTCTGCGATTGCCCTTGCTGTTGAAGCATTACTTGCATCTCATGAACATTATGATGCCCGGATTCAACAAGTAAAAAATCATCCTGGTCAAAACGCGGTGGCAAGTGCATTGCGTAATTTATTGGCAGGTTCAACGCAGGTTAATCTATTATCTGGGGTTAAAGAACAAGCCAATAAAGCTTGTCGTCATCAAGAAATTACCCAACTAAATGATACCTTACAGGACGTTTATTCAATTCGCTGTGCACCACAAGTATTAGGTATAGTGCCAGAATCTTTAGCTACCGCTCGGAAAATATTGGAACGGGAAGTTATCTCAGCTAATGATAATCCATTGATAGATCCAGAAAATGGCGATGTTCTACACGGTGGAAATTTTATGGGGCAATATGTCGCCCGAACAATGGATGCATTAAAACTGGATATTGCTTTAATTGCCAATCATCTTCACGCCATTGTGGCTCTTATGATGGATAACCGTTTCTCTCGTGGATTACCTAATTCACTGAGTCCGACACCCGGCATGTATCAAGGTTTTAAAGGCGTCCAACTTTCTCAAACCGCTTTAGTTGCTGCAATTCGCCATGATTGTGCTGCATCAGGTATTCATAGCCCTCGCCACAGAACAATACAATCAAGATATTGTCAGTTTAGGTCTGCATGCCGCTCAAGATGTTTTAGAGATGGAGCAGAAATTACGCAATATTGTTTCAATGACAATTCTGGTAGTTTGTCAGGCCATTCATCTTCGCGGCAATATTAGTGAAATTGCGCCTGAAACTGCTAAATTTTACCATGCAGTACGCGAAATCAGTTCTCCTTTGATCACTGATCGTGCGTTGGATGAAGATATAATCCGCATTGCGGATGCAATTATTAATGATCAACTTCCTCTGCCAGAAATCATGCTGGAAGAATAACAGCACAAGTGAGCATATACGTAAACTTTGTACCCCGTCACTCAAAGGCGGTAGTACGGGTTTTGCTGCCCGCAAACGGGCTGTTCTGGTGTTGCTAGTTTGTTATCAGAATCGCAGATCCGGCTTCAGCCGGTTTGCCGGCTGAAAGCGCTATTTCTTCCAGAATTGCCATGATTTTTTCCCCACGGGAGGCGTCACTGGCTCCCGTGTTGTCGGCAGCTTTGATTCGATAAGC	~~~s~~~~~~~.~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~d~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~s~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~q~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~z~~~~~~~~~v~~~~~~~~~~~~~~~~v~~~~~~~~~~~~~~~~~~~~Z~~~~Y~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~p~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~X~~~~G~~~~~~~~~~~~~~~~~~p~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\~~~~~~~~~~~M~~~~~~~~~~~~~~~~~~|~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~`~~~~~~~~~~~~~~~~~~Y~~~~~~~~~~~~~~~~~~Z~~~~~~~~~~~~y~~~~~~~~~~~~~~~~~~}~~~~~~~~~~~~~~m~~~~~~~~~~~x~~~~~~v3~m~~C~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~p~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~}~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~P~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~4~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~R~~~~~~~~~~~~~~~~y~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~c~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~H~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~m~~~~~~~~~~~~~~~~~u~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~|~~~~~~~~~~~~~~~~~~~~~~~~n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~F~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~x~~~~~~~~~~~~~~~~p~~~~~~~~~~~~~~~~~~~~~~n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~`v~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~t~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~I~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~U~~~~~~~~~~v~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~b<'ae~D~A~I~*~~~~/S~~~Z~~tP~IF)j~Q~~~~~~~~~~~~~~~~~~~~~~D~~~~~~~~~~~~~~~~~~~~e~~~~~v~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{~~~~~~~~~~~~~~~~~~~~J~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~	RG:Z:defaultqs:i:0	qe:i:2110	mc:f:98.1043

TODO: Implement function in transcript2 class to compute jM and jI sam fields

Output sam files from the STARlong aligner contain two custom tags (jM and jI) that describe whether each splice junction is canonical and where each intron begins and ends. TranscriptClean uses these tags when correcting noncanonical splice junctions. However, not everyone can use STARlong to align their transcripts. Computing the jM and jI fields directly in my script would expand the splice junction correction feature to more people.

SAM headers not found in FASTA

I think I have some idea of how to fix the error message but I wanted to check that I am understanding properly.

I am trying to correct nanopore dRNA reads with splice junctions extracted from mapping Illumina TruSeq reads with STAR against GRCm38.dna.primary.assembly.fa

Because ensemble FASTA has messy header titles it seems to be causing an issue. Based on the error message below I can clean up the header titles in the primary.assembly file to just include the main part to match the abbreviated headers used in the SAM file?

python TranscriptClean.py --thread 20 --sam /analysisdata/rawseq/fastq/SHARED/000078/Mouse_aging/BAM/all.sam --genome /home/callum/Genome_files/Mus_musculus.GRCm38.dna.primary_assembly.fa --outprefix /home/callum/Mouse_aging/TranscriptClean_output --spliceJns /home/scratch/callum/MM_brain_cDNA_TruSeq_stranded.2ndPassSJ.out.tab
Traceback (most recent call last):
  File "TranscriptClean.py", line 1593, in <module>
    main()
  File "TranscriptClean.py", line 38, in main
    validate_chroms(options.refGenome, options.variantFile, sam_chroms)
  File "TranscriptClean.py", line 468, in validate_chroms
    raise RuntimeError(error_msg)
RuntimeError: One or more SAM chromosomes were not found in the fasta reference.
SAM chromosomes:
{"JH584299.1", "GL456382.1", "JH584295.1", "JH584294.1", "JH584297.1", "1", "GL456360.1", "GL456396.1", "GL456366.1", "14", "GL456212.1", "GL456216.1", "GL456370.1", "12", "GL456379.1", "GL456221.1", "Y", "GL456378.1", "MT", "15", "GL456211.1", "4", "13", "3", "5", "17", "GL456210.1", "JH584293.1", "GL456359.1", "GL456213.1", "7", "JH584296.1", "11", "10", "9", "X", "JH584298.1", "19", "8", "18", "JH584304.1", "2", "GL456350.1", "GL456233.1", "6", "16"}
FASTA chromosomes:
{"GL456239.1 dna:scaffold scaffold:GRCm38:GL456239.1:1:40056:1 REF", "GL456360.1 dna:scaffold scaffold:GRCm38:GL456360.1:1:31704:1 REF", "11 dna:chromosome chromosome:GRCm38:11:1:122082543:1 REF", "JH584295.1 dna:scaffold scaffold:GRCm38:JH584295.1:1:1976:1 REF", "16 dna:chromosome chromosome:GRCm38:16:1:98207768:1 REF", "GL456350.1 dna:scaffold scaffold:GRCm38:GL456350.1:1:227966:1 REF", "JH584292.1 dna:scaffold scaffold:GRCm38:JH584292.1:1:14945:1 REF", "GL456394.1 dna:scaffold scaffold:GRCm38:GL456394.1:1:24323:1 REF", "7 dna:chromosome chromosome:GRCm38:7:1:145441459:1 REF", "JH584297.1 dna:scaffold scaffold:GRCm38:JH584297.1:1:205776:1 REF", "GL456219.1 dna:scaffold scaffold:GRCm38:GL456219.1:1:175968:1 REF", "GL456385.1 dna:scaffold scaffold:GRCm38:GL456385.1:1:35240:1 REF", "2 dna:chromosome chromosome:GRCm38:2:1:182113224:1 REF", "X dna:chromosome chromosome:GRCm38:X:1:171031299:1 REF", "GL456354.1 dna:scaffold scaffold:GRCm38:GL456354.1:1:195993:1 REF", "17 dna:chromosome chromosome:GRCm38:17:1:94987271:1 REF", "GL456221.1 dna:scaffold scaffold:GRCm38:GL456221.1:1:206961:1 REF", "GL456233.1 dna:scaffold scaffold:GRCm38:GL456233.1:1:336933:1 REF", "GL456393.1 dna:scaffold scaffold:GRCm38:GL456393.1:1:55711:1 REF", "GL456213.1 dna:scaffold scaffold:GRCm38:GL456213.1:1:39340:1 REF", "JH584301.1 dna:scaffold scaffold:GRCm38:JH584301.1:1:259875:1 REF", "GL456367.1 dna:scaffold scaffold:GRCm38:GL456367.1:1:42057:1 REF", "GL456382.1 dna:scaffold scaffold:GRCm38:GL456382.1:1:23158:1 REF", "Y dna:chromosome chromosome:GRCm38:Y:1:91744698:1 REF", "6 dna:chromosome chromosome:GRCm38:6:1:149736546:1 REF", "GL456216.1 dna:scaffold scaffold:GRCm38:GL456216.1:1:66673:1 REF", "10 dna:chromosome chromosome:GRCm38:10:1:130694993:1 REF", "12 dna:chromosome chromosome:GRCm38:12:1:120129022:1 REF", "19 dna:chromosome chromosome:GRCm38:19:1:61431566:1 REF", "3 dna:chromosome chromosome:GRCm38:3:1:160039680:1 REF", "4 dna:chromosome chromosome:GRCm38:4:1:156508116:1 REF", "GL456211.1 dna:scaffold scaffold:GRCm38:GL456211.1:1:241735:1 REF", "JH584294.1 dna:scaffold scaffold:GRCm38:JH584294.1:1:191905:1 REF", "GL456210.1 dna:scaffold scaffold:GRCm38:GL456210.1:1:169725:1 REF", "GL456378.1 dna:scaffold scaffold:GRCm38:GL456378.1:1:31602:1 REF", "JH584300.1 dna:scaffold scaffold:GRCm38:JH584300.1:1:182347:1 REF", "GL456372.1 dna:scaffold scaffold:GRCm38:GL456372.1:1:28664:1 REF", "GL456387.1 dna:scaffold scaffold:GRCm38:GL456387.1:1:24685:1 REF", "GL456359.1 dna:scaffold scaffold:GRCm38:GL456359.1:1:22974:1 REF", "GL456368.1 dna:scaffold scaffold:GRCm38:GL456368.1:1:20208:1 REF", "9 dna:chromosome chromosome:GRCm38:9:1:124595110:1 REF", "GL456381.1 dna:scaffold scaffold:GRCm38:GL456381.1:1:25871:1 REF", "GL456212.1 dna:scaffold scaffold:GRCm38:GL456212.1:1:153618:1 REF", "JH584302.1 dna:scaffold scaffold:GRCm38:JH584302.1:1:155838:1 REF", "GL456379.1 dna:scaffold scaffold:GRCm38:GL456379.1:1:72385:1 REF", "JH584298.1 dna:scaffold scaffold:GRCm38:JH584298.1:1:184189:1 REF", "GL456392.1 dna:scaffold scaffold:GRCm38:GL456392.1:1:23629:1 REF", "8 dna:chromosome chromosome:GRCm38:8:1:129401213:1 REF", "JH584299.1 dna:scaffold scaffold:GRCm38:JH584299.1:1:953012:1 REF", "JH584303.1 dna:scaffold scaffold:GRCm38:JH584303.1:1:158099:1 REF", "GL456389.1 dna:scaffold scaffold:GRCm38:GL456389.1:1:28772:1 REF", "5 dna:chromosome chromosome:GRCm38:5:1:151834684:1 REF", "15 dna:chromosome chromosome:GRCm38:15:1:104043685:1 REF", "GL456370.1 dna:scaffold scaffold:GRCm38:GL456370.1:1:26764:1 REF", "14 dna:chromosome chromosome:GRCm38:14:1:124902244:1 REF", "GL456366.1 dna:scaffold scaffold:GRCm38:GL456366.1:1:47073:1 REF", "GL456390.1 dna:scaffold scaffold:GRCm38:GL456390.1:1:24668:1 REF", "1 dna:chromosome chromosome:GRCm38:1:1:195471971:1 REF", "MT dna:chromosome chromosome:GRCm38:MT:1:16299:1 REF", "JH584296.1 dna:scaffold scaffold:GRCm38:JH584296.1:1:199368:1 REF", "JH584304.1 dna:scaffold scaffold:GRCm38:JH584304.1:1:114452:1 REF", "GL456383.1 dna:scaffold scaffold:GRCm38:GL456383.1:1:38659:1 REF", "JH584293.1 dna:scaffold scaffold:GRCm38:JH584293.1:1:207968:1 REF", "18 dna:chromosome chromosome:GRCm38:18:1:90702639:1 REF", "GL456396.1 dna:scaffold scaffold:GRCm38:GL456396.1:1:21240:1 REF", "13 dna:chromosome chromosome:GRCm38:13:1:120421639:1 REF"}
One common cause of this problem is when the fasta headers contain more than one word. If this is the case, try trimming the headers to include only the chromosome name (i.e. '>chr1').

Updated syntax to Python 3

Hey,

I am currently trying to create a Dockerfile of the new TranscriptClean version for BioContainers.
I noticed a update to "get_SJs_from_gtf.py" for python 3, could this update also be applied to "get_corrected_SJs_from_log.py" and "clean_splice_jns.py"?
I can then finalize the container and utilize the MultiThreading in the pipeline. Thank you!

No such file or directory

Hi Dana

I am having an issue with one particular file in my data set when I attempt to run it through TC.

Below are commands

python $TC --threads 6 --sam $SSAM --genome $REF --spliceJns $SPLICE --deleteTmp --outprefix EXT2_TC

Reading genome ..............................
Reading genome ..............................
Reading genome ..............................
cat: 'TC_tmp//.sam': No such file or directory
cat: 'TC_tmp//.fa': No such file or directory
cat: 'TC_tmp//.log': No such file or directory
cat: 'TC_tmp//.TElog': No such file or directory
Took 0:00:00 to combine all outputs.

I have attempted to clear /tmp/ directory before trying this again (I notice pybedtools creates many files here) but it didn't help. After restarting my PC I got further (thinking it may clear temporary files causing issues), yet the output files don't appear to be correct. This replicate has the largest file size out of all, yet it was processed via TC with an end file size substantially smaller than all the rest, it was also processed very quickly, where as the rest took 3+ hours. To be sure, I remapped the original file in minimap2 again and tried once more. Have also tried without the --deleteTmp option.

Cheers
Dean

Is there a strong repair mode？

Hi:
The software you have developed is excellent.
I would like to know if there are any strong repair options including conventional SJ in the third generation including GTAG,GCAG.
Or remove any code in the software that can be completely fixed without considering the GTAG in PacBio

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 6485: invalid start byte

Hi,

Thanks for the great program, I've had no issues at all running through the pipeline aside from one minor hiccup.

I've corrected 5 datasets with TC - all of which ran perfectly fine, however, one data set (ONT directRNA, mapped with minimap2) is continually outputting this error below. I have tried running this on a HPC, and two local machines, with different version of TC, but to no avail.

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 6485: invalid start byte

I have cleaned two other replicates with the same -j -g -v files as I am using for this one. Both were also mapped with the exact same options on minimap2, sorted via samtools.

I'm having suspicions that there could be something funny going on in my minimap2 output .sam file....

Any help would be appreciated
Cheers
Dean

Recommended aligner

Hi Dewyman,

thank you for this tool. Which aligner and alignment parameters do you recommend? I used gmap for my Nanopore reads with the following parameters but I got an IndexError;
--cross-species --max-intronlength-ends=10000 -n 1 -z sense_force -f samse

Traceback (most recent call last):
File "/home/banthony/software/TranscriptClean-1.0.4/TranscriptClean.py", line 989, in
main()
File "/home/banthony/software/TranscriptClean-1.0.4/TranscriptClean.py", line 140, in main
correctMismatches(canTranscripts, genome, snps, transcriptErrorLog)
File "/home/banthony/software/TranscriptClean-1.0.4/TranscriptClean.py", line 555, in correctMismatches
mergeOperations, mergeCounts = transcript.mergeMDwithCIGAR()
File "/netmount/ip14_home/banthony/software/TranscriptClean-1.0.4/transcript2.py", line 223, in mergeMDwithCIGAR
if cigarOperation[cigarIndex] in ("H", "S", "I", "N"):
IndexError: list index out of range

Kindly advise if another aligner for example Minimap2 is better.

Regards,

What should I do if i have multiple SJ.out.tab file

Library suggestion

I noticed from the source code, that the library PyFasta is used to fetch genomic sequences.
Using PyFaidx may be a better choice for this. Pyfaidx never loads the sequence to memory but instead makes use of the fasta index (.fai) to quickly fetch the sequences. This would significantly alleviate the memory requirements.
I have a version of TranscriptClean with this library and based on a few tests, the output is unchanged while reducing virtual memory to less than half.