alexfields / orf-rater Goto Github PK

View Code? Open in Web Editor NEW

28.0 28.0 13.0 89 KB

Regression-based annotation of protein-coding sequences from ribosome profiling data

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

orf-rater's People

Contributors

Stargazers

Watchers

Forkers

boboppie bioxiao cyang-2014 cb11711211 odddvd jiewu2012 colin986 cangjie2020 akoshiol altorf liujamin ashakru rouskinlab

orf-rater's Issues

Reuse the method on CHX only data

Dear Alex,

Thank you for contributing a great method to the community. We are very interested to apply your algorithm to our data. However, we only have cycloheximide treated datasets. By reading the source code, it seems I can still reuse the code directly to CHX data without modification, is that the case?

It was mentioned in your paper that a modified algorithm was applied to ND only datasets. Have you tried it on CHX only? What do you expect to see in the results in terms of performance and accuracy? If I have to modify the code for CHX only, could you point me out how to do it?

Another question is regarding the classification and monotonization. You have selected six features to train the classifier including Wald statistics for the genomic start of Harr, LTM and CHX/ND, etc. As in our case, Harr and LTM features will not be available. Is the code needed to be modified to cope with this? In your training set, ORFs >= 100aa are selected. What about small ORFs below the threshold? I had a hard time to interpret rate_regression_output.py, your advice is greatly appreciated.

Best wishes,
Fengyuan

Cutoff for orfrating value and the forest score

Hello Dr. Fields,

Thanks for the contribution of the software! It is much easier to run than other ORF finding tools

I was wondering which values I can use to decide the confidence of the new ORFs, the orfrating value or the forest scores. Is there a good cutoff for orfrating value and the forest score for the novel ORFs? Any help will be appreciated. Thanks in advance!

prune_transcripts

There is an unmentioned dependency on tables.
It is not python3 compatible due to tuble parameter unpacking (https://www.python.org/dev/peps/pep-3113/) . Can probably be automatically fixed with 2to3

bug reading in h5 files and filtering on multiple columns using "where"

Dear Alex,

I’m writing you as I’m gratefully using your great ORF-Rater tool but encountered a bug which I think is worth a note:

In the make_orf_bed.py and quantify_orfs.py scripts you’re reading in h5 files using the build-in sql like where function to subset the data.
Unfortunately when more than one column is selected and one one these columns contains “NaN” values the logical combination of the selection vector is faulty.

First I thought it might be due to a different encoding used for writing and my default in python, but both are UTF-8.

The results so far always contain smaller tables but I did not investigate further into details how the selection exactly fails and simply edited your scripts to read in the data and then filter accordingly.

Please verify the bug, otherwise I have to keep on troubleshooting I guess.

best
Daniel

As an example run on python 2.7 and 3.7.6:

data = pd.read_hdf("orfratings.h5",'orfratings’) # [317568 rows x 21 columns]
data.loc[(data.orfrating >= 0.8) & (data.AAlen >= 10)] # [8949 rows x 21 columns]

And in comparison several methods to use the where read-in:

d2 = pd.read_hdf("orfratings.h5",'orfratings',where="AAlen >= 0 and orfrating >= 0.8”) #[4216 rows x 21 columns]
d2 = pd.read_hdf("orfratings.h5",'orfratings',where="AAlen >= 0 & orfrating >= 0.8”) #[4216 rows x 21 columns]
d2 = pd.read_hdf("orfratings.h5",'orfratings',where="(AAlen >= 0) and (orfrating >= 0.8)”) #[4216 rows x 21 columns]
d2 = pd.read_hdf("orfratings.h5",'orfratings',where="(AAlen >= 0) & (orfrating >= 0.8)”) #[4216 rows x 21 columns]
d2 = pd.read_hdf("orfratings.h5",'orfratings',where=["AAlen >= 0","orfrating >= 0.8”]) #[4216 rows x 21 columns]

Prune_transcripts.py error

Hi Alex,

Thanks for making the ORF-RATE pipeline public and accessible.

When running prune_transcripts.py with the following parameters in Python 2.7.15:
prune_transcripts.py --inbed annot.bed --minlen 27 --maxlen 32 genome.fa test.bam -p 10 --verbose

I am getting the following error :

[2019-09-30 11:59:51] Reading transcriptome and genome
[2019-09-30 12:00:05] Parsing sequence and count information

Traceback (most recent call last):
File "prune_transcripts.py", line 179, in
% (len(tid_summary), lowreads_dropped, len(tid_summary)-lowreads_dropped, opts.minreads, opts.peakfrac))

ValueError: All 46526 transcripts dropped due to too few reads (46526) or too many reads coming from one position (0). Consider decreasing MINREADS (currently 64) or increasing PEAKFR(currently 0.200000), or check validity of input BAM file.

I have done as suggested in the ValueError and decreased MINREADS and increased PEAKFR, but I keep getting the same error no matter what parameters I use. I've checked the input BAM file and it seems okay too. I am quite sure that the problem isn't low coverage. When doing samtools flagstat of the BAM file I get:

203414522 + 0 in total (QC-passed reads + QC-failed reads)
100572976 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
203414522 + 0 mapped (100.00% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

Please find attached a sample of the structure of my input files:
genome.fa.txt
annot.bed.txt

Thank you so much,

Marina

Unique hits and combine biological replicates

Hi Alex,

I have a couple of questions:

Why did you allow multiple hits in your alignment? Will it increase false positive signals.

In my case, I only kept uniquely aligned reads, however it resulted in much fewer reads than I expected (a large proportion of reads had been removed as contaminants). I'm thinking to combine all biological replicates to increase depth. Does it make sense?

Thanks,
Fengyuan

Regarding prune transcripts

Dear Alex,

In you paper, you aligned the footprints to a BMDC-specific transcriptome firstly, and then to UCSC annotated transcripts for those not aligned. In my case, I do not have an assembled transcriptome for our data (reconstructing transcriptome is not trivial), so I just aligned all the reads to annotated transcripts at the expense of losing true signals. Then I quantified the transcript expressions, giving raw read counts 10 (based on experience of our bioinformation) as a cut-off for expression, I can filter out most of the transcripts (non-expressed) before running prune_transcripts.py. Do you think it's reasonable?

In prune_transcripts.py, you have set 64 as the min raw read counts for each transcript (as well as in TopHat max multihits), how did you decide to use this number?

For peakfrac, you excluded transcripts that have 1/5 or more of the reads aligned to a single position. it's known that Ribosome Profiling shows peaks over start codons. Will preakfrac possibly rule out good transcripts that have 1/5 or more reads aligned around start codons?

Thanks,
Fengyuan

prune_transcripts.py Error Message

Hello,

We are receiving the following error when running the prune_transcripts script:

(ENV) Rhaegal:ORF-RATER matamala$ python prune_transcripts.py --inbed /Users/matamala/Desktop/MM9_Files/mm9_TOME.bed12 -p 2 /Users/matamala/Desktop/MM9_Files/mm9_REFGEN.fa /Users/matamala/Desktop/MM9_Files/Final_Alignment_BAM.bam > logfile.out Traceback (most recent call last): File "prune_transcripts.py", line 168, in <module> tid_summary = pd.concat(workers.map(_get_tid_info, bedlinedict.keys())) File "/usr/local/Cellar/python@2/2.7.14_3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/pool.py", line 253, in map return self.map_async(func, iterable, chunksize).get() File "/usr/local/Cellar/python@2/2.7.14_3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/pool.py", line 572, in get raise self._value ValueError: invalid literal for int() with base 4: ''

Prune_transcripts.py error

Hello, this script always reports errors, I don't know whether it is my input file or your script. Do you have the test file?
python prune_transcripts.py --inbed /data1/zhoulab/xiajiaqi/neo_ribosome/reference/gencode.v29lift37.annotation.bed /data1/zhoulab/xiajiaqi/reference/hg19canonic.fa /data1/zhoulab/xiajiaqi/neo_ribosome/data/GSM1446874.sort.bam -p 10 --verbose
[2019-10-24 10:38:07] Reading transcriptome and genome
[2019-10-24 10:38:50] Parsing sequence and count information
[2019-10-24 10:46:44] Partitioning sequences to identify multimappers
Traceback (most recent call last):
File "prune_transcripts.py", line 229, in
mm_df = workers.map(_find_mm_in_range, range(npart))
File "/data1/zhoulab/xiajiaqi/software/miniconda3/envs/orf-rater27/lib/python2.7/multiprocessing/pool.py", line 253, in map
return self.map_async(func, iterable, chunksize).get()
File "/data1/zhoulab/xiajiaqi/software/miniconda3/envs/orf-rater27/lib/python2.7/multiprocessing/pool.py", line 572, in get
raise self._value
ValueError: Using oa_ndim == 0 when op_axes is NULL. Use oa_ndim == -1 or the MultiNew iterator for NumPy <1.8 compatibility

Quantify result is a small subset of ORFs searching result

Hi Dr. Fields,

Thanks a lot for the quick response! I used quantify_orfs.py after running rate_regression_output.py, and found the quant.csv is only a small subset of the orfratings.csv. I was wondering why the quantify_orfs.py filters out many ORFs.

Best Regards,
Xiaoli

Prune_transcripts.py

`Traceback (most recent call last):
  File "prune_transcripts.py", line 229, in <module>
    mm_df = workers.map(_find_mm_in_range, range(npart))
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 251, in map
    return self.map_async(func, iterable, chunksize).get()
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 567, in get
    raise self._value
ValueError: No objects to concatenate
`

Hi Alex,

I get the above error when running prune_transcripts.py. I've indexed my bam files, so that's not an issue.
In case anyone else reads this, the scripts are not compatible with python3(due to some arguments such as maketrans being deprecated in py3, 2.7 works fine).

Thank you.
Sofina

Evaluate the likelihood that ORFs were protein coding

Hello Dr. Fields,

Thank you so much for the explanation on cutoffs. I noticed you used PhyloCSF to evaluate the likelihood that ORFs were protein coding. I followed this idea and download the mm10 Multiz Align (Multiz60way) table (maf format) from the UCSC table browser. When I tried to extract the multiple alignment of new ORFs found by ORF-RATER, most of them were "NNNNN...". I was wondering if you used a more comprehensive maf file, or if you have a better way to evaluate this likelihood now.

Best Regards,
Xiaoli

prune_transcripts mapping information not recorded

Hi,
I'm getting the following error while trying to run prune_transcripts.py. I downloaded the bed file from the UCSC table browser.

 [jfreimer@h1 ORF-RATER]$ python prune_transcripts.py --inbed gencode.vM7.basic.bed -p 1 --minlen 28 --maxlen 30 -v GRCm38.primary_assembly.genome.fa ribo1/accepted_hits.bam ribo2/accepted_hits.bam > logfile.out
Traceback (most recent call last):
  File "prune_transcripts.py", line 161, in <module>
    tid_summary = pd.concat(workers.map(_get_tid_info, bedlinedict.keys()))
  File "/N/u/jfreimer/Mason/miniconda/lib/python2.7/multiprocessing/pool.py", line 251, in map
    return self.map_async(func, iterable, chunksize).get()
  File "/N/u/jfreimer/Mason/miniconda/lib/python2.7/multiprocessing/pool.py", line 567, in get
    raise self._value
ValueError: mapping information not recorded in index or index not available

python version conflicts

Dear Alex,

I am trying to set up a conda environment to run ORF-RATER but I am having some difficulties to run the code. One of the requiered packages is plastid, which requires Python 3.6 or 3.3 or greater. However, some function calls in your code are not working in python 3.7 (seem to be writen in python 2.7). Do you have any recommendation on that regard? Have you planned to port your code from python 2 to python 3?

Thanks,

Lídia

prune_transcripts.py stucked

Hi Dr. Fields,

It seems prune_transcripts.py stucked at the Parsing sequence and count information step. I used -p 32 option. But after 1 or 2 minutes when I see "Parsing sequence and count information", Only 1 process is working and the cpu usage is nearly 0. I didn't see any error message in the output.

My bed12 file contains 200+ line. And prune_transcripts.py worked when I extracted the first 10 lines of the bed12 file. So I think the bam file and the fasta file should be fine. Do you have any idea what else can be wrong? Thanks in advance.

Best Regards,
Xiaoli

alexfields / orf-rater Goto Github PK

orf-rater's People

Contributors

Stargazers

Watchers

Forkers

orf-rater's Issues

Recommend Projects

Recommend Topics

Recommend Org