toniwestbrook / paladin Goto Github PK

Protein Alignment and Detection Interface

License: MIT License

Makefile 0.45% C 89.25% Shell 0.13% C++ 2.29% Python 7.72% Dockerfile 0.16%

unh bioinformatics protein-alignment nucleotide-alignment uniprot paladin

paladin's Introduction

Thanks for taking an interest in my projects! Listed below are my scientific applications developed for work and a variety of hardware and retrocomputing projects I work on in my spare time.

✨ I'm currently working on Cloudburst Connection, an RPG adventure game where you run your own dial-up Bulletin Board System (BBS). Add new features, keep your users happy, and explore other boards as a deeper story unfolds. Check here to follow its development!

UNH and Bioinformatics Projects

Project	Description
RepeatFS	File system providing scientific reproducibility through provenance and automation
PALADIN	Protein sequence alignment tool designed for the accurate functional characterization of metagenomes
PALADIN_Plugins	Pipeline plugins for PALADIN, providing HPC support, abundance (taxonomy, go terms), automation, etc
Mitobin	Taxonomic classification and read binning of mitochondrial DNA
ImageJ_Plugins	A collection of plugins for ImageJ/FIJI

Personal and Synthetic Dreams Projects

Project	Description
Shredz64	Guitar Hero style game for the Commodore 64
NetPaint	100% text-only drawing program, compatible with any terminal emulator that supports the mouse, such as PuTTY, Konsole, iTerm2, and many others
Whirlwind	Nintendo Entertainment System (NES) compatible FPGA core
IntLog	Monitor and log BIOS and DOS interrupts for debugging 16-bit DOS programs

paladin's People

Contributors

Stargazers

Watchers

Forkers

dzif fredhutch fmaguire avilella ctb chen318liang fallinwind

paladin's Issues

Piping fastq results in "-.pro"

I would like to cat my fastq directly in to paladin align and this seems to work ok, however, what that means is we end up with a file called "-.pro" in the current working directory

That in itself is not a problem, unless I want to run 50 samples in the same directory, in which case I assume the "-.pro" files will clash.

Is there a way to provide a prefix for this file in such instances?

Cheers
Mick

Paladin align error: Reporting can only be used on prepared indices.

Hi,
Thank you for this tool, it seems to be just what I have been looking for. I am trying to align metagenomic reads to a custom fasta file of protein sequences (genes). Following the manual's instructions, I used 'paladin index -r3 derep_clust90_proteins.faa' and then tried to use 'paladin align' and received this error:

[M::command_align] Loading the index for reference 'derep_clust90_proteins.faa'...
[M::index_load_from_disk] Read 0 ALT contigs
[E::command_align] Reporting can only be used on prepared indices.

My understanding of the manual is that running 'paladin index' on my custom reference is enough, but this error suggests I need to use 'paladin prepare' after my index step. According to the help for 'paladin prepare', my input reference database should be either Swiss-Prot or UniRef90, but my reference data are neither. How should I proceed?

few reads mapping

With version 0.1.3, few reads are mapping - to the Merlot samples, about 4% of reads map and with AcidovoraxAvenaeATCC19860-se-250-1000-10.fq only 12% are mapping. 😭

I thought this was related to last nights fix, but when rolling back to 0.1.2, the mapping is about the same, maybe even a bit worse.

taxa in uniprot report

Count   UniProtKB   ID  Organism    Protein Names   Genes   Pathway Features    Gene Ontology   Reviewd Existence   Comments
324 sp|P85153|CO2A1_MAMAE
243 sp|P0C2W2|CO1A1_TYREX
233 sp|P49756|RBM25_HUMAN
222 sp|P19275|VTP3_TTV1V
213 sp|B2RY56|RBM25_MOUSE

Is there any way to collapse 2 rows and add the counts together, e.g., RBM25_HUMAN and RBM25_MOUSE?
I guess Im having an issue with the taxonomy being so front and center. We don't care about it and we know that our method is not accurate.. So having our top hits being EXTINCT SPECIES Mastadon and T. rex is off putting. I wonder if we should just report the protein ID and cut out the species. People can still get to there with the UniProtKB, but it would not be so 'in your face'

process substitution?

not sure how hard this would be to implement:

Currently get a seg fault.

paladin align -t 4 -u 2 data/uniprot_sprot.fasta <(cat soil_S29_merged.assembled.fastq soil_S29_merged.unassembled.forward.fastq soil_S29_merged.unassembled.reverse.fastq)
[M::index_load_from_disk] read 0 ALT contigs
Segmentation fault (core dumped)

make fails

macmanes@davinci:/share/paladin$ make
gcc -c -g -Wall -Wno-unused-function -O2  -DHAVE_PTHREAD -DUSE_MALLOC_WRAPPERS  utils.c -o utils.o
gcc -c -g -Wall -Wno-unused-function -O2  -DHAVE_PTHREAD -DUSE_MALLOC_WRAPPERS  kthread.c -o kthread.o
gcc -c -g -Wall -Wno-unused-function -O2  -DHAVE_PTHREAD -DUSE_MALLOC_WRAPPERS  kstring.c -o kstring.o
gcc -c -g -Wall -Wno-unused-function -O2  -DHAVE_PTHREAD -DUSE_MALLOC_WRAPPERS  ksw.c -o ksw.o
gcc -c -g -Wall -Wno-unused-function -O2  -DHAVE_PTHREAD -DUSE_MALLOC_WRAPPERS  bwt.c -o bwt.o
bwt.c: In function ‘bwt_invPsi’:
bwt.c:56:2: warning: passing argument 1 of ‘unpackBWTValue’ discards ‘const’ qualifier from pointer target type [enabled by default]
  x = unpackBWTValue(bwt, x);
  ^
bwt.c:28:9: note: expected ‘struct bwt_t *’ but argument is of type ‘const struct bwt_t *’
 ubyte_t unpackBWTValue(bwt_t * passBWT, int passSeqIdx) {
         ^
bwt.c: In function ‘bwt_occ’:
bwt.c:132:2: warning: passing argument 1 of ‘getOccInterval’ discards ‘const’ qualifier from pointer target type [enabled by default]
  n = ((bwtint_t*)(p = getOccInterval(bwt, k)))[c];
  ^
bwt.c:18:12: note: expected ‘struct bwt_t *’ but argument is of type ‘const struct bwt_t *’
 uint32_t * getOccInterval(bwt_t * passBWT, bwtint_t passIndex) {
            ^
bwt.c: In function ‘bwt_2occ’:
bwt.c:161:3: warning: passing argument 1 of ‘getOccInterval’ discards ‘const’ qualifier from pointer target type [enabled by default]
   n = ((bwtint_t*)(p = getOccInterval(bwt, k)))[c];
   ^
bwt.c:18:12: note: expected ‘struct bwt_t *’ but argument is of type ‘const struct bwt_t *’
 uint32_t * getOccInterval(bwt_t * passBWT, bwtint_t passIndex) {
            ^
bwt.c: In function ‘bwt_occ4’:
bwt.c:195:2: warning: passing argument 1 of ‘getOccInterval’ discards ‘const’ qualifier from pointer target type [enabled by default]
  p = getOccInterval(bwt, k);
  ^
bwt.c:18:12: note: expected ‘struct bwt_t *’ but argument is of type ‘const struct bwt_t *’
 uint32_t * getOccInterval(bwt_t * passBWT, bwtint_t passIndex) {
            ^
bwt.c:204:3: warning: passing argument 1 of ‘unpackBWTValue’ discards ‘const’ qualifier from pointer target type [enabled by default]
   cnt[unpackBWTValue(bwt, k/128*128 + supercount)]++;
   ^
bwt.c:28:9: note: expected ‘struct bwt_t *’ but argument is of type ‘const struct bwt_t *’
 ubyte_t unpackBWTValue(bwt_t * passBWT, int passSeqIdx) {
         ^
bwt.c:205:3: warning: passing argument 1 of ‘unpackBWTValue’ discards ‘const’ qualifier from pointer target type [enabled by default]
   cnt[unpackBWTValue(bwt, k/128*128 + supercount+1)]++;
   ^
bwt.c:28:9: note: expected ‘struct bwt_t *’ but argument is of type ‘const struct bwt_t *’
 ubyte_t unpackBWTValue(bwt_t * passBWT, int passSeqIdx) {
         ^
bwt.c:206:3: warning: passing argument 1 of ‘unpackBWTValue’ discards ‘const’ qualifier from pointer target type [enabled by default]
   cnt[unpackBWTValue(bwt, k/128*128 + supercount+2)]++;
   ^
bwt.c:28:9: note: expected ‘struct bwt_t *’ but argument is of type ‘const struct bwt_t *’
 ubyte_t unpackBWTValue(bwt_t * passBWT, int passSeqIdx) {
         ^
bwt.c:207:3: warning: passing argument 1 of ‘unpackBWTValue’ discards ‘const’ qualifier from pointer target type [enabled by default]
   cnt[unpackBWTValue(bwt, k/128*128 + supercount+3)]++;
   ^
bwt.c:28:9: note: expected ‘struct bwt_t *’ but argument is of type ‘const struct bwt_t *’
 ubyte_t unpackBWTValue(bwt_t * passBWT, int passSeqIdx) {
         ^
bwt.c:214:3: warning: passing argument 1 of ‘unpackBWTValue’ discards ‘const’ qualifier from pointer target type [enabled by default]
   cnt[unpackBWTValue(bwt, k/128*128 + supercount++)]++;
   ^
bwt.c:28:9: note: expected ‘struct bwt_t *’ but argument is of type ‘const struct bwt_t *’
 ubyte_t unpackBWTValue(bwt_t * passBWT, int passSeqIdx) {
         ^
bwt.c:184:15: warning: unused variable ‘tmp’ [-Wunused-variable]
  uint32_t *p, tmp, *end;
               ^
bwt.c:183:11: warning: variable ‘x’ set but not used [-Wunused-but-set-variable]
  bwtint_t x;
           ^
bwt.c: In function ‘bwt_2occ4’:
bwt.c:222:15: warning: variable ‘_l’ set but not used [-Wunused-but-set-variable]
  bwtint_t _k, _l;
               ^
bwt.c:222:11: warning: variable ‘_k’ set but not used [-Wunused-but-set-variable]
  bwtint_t _k, _l;
           ^
gcc -c -g -Wall -Wno-unused-function -O2  -DHAVE_PTHREAD -DUSE_MALLOC_WRAPPERS  bntseq.c -o bntseq.o
bntseq.c: In function ‘add1’:
bntseq.c:315:3: warning: array subscript has type ‘char’ [-Wchar-subscripts]
   int c = (int) aa_encode_hash[seq->seq.s[i]];
   ^
gcc -c -g -Wall -Wno-unused-function -O2  -DHAVE_PTHREAD -DUSE_MALLOC_WRAPPERS  bwa.c -o bwa.o
gcc -c -g -Wall -Wno-unused-function -O2  -DHAVE_PTHREAD -DUSE_MALLOC_WRAPPERS  bwamem.c -o bwamem.o
bwamem.c: In function ‘mem_aln2sam’:
bwamem.c:931:3: warning: array subscript has type ‘char’ [-Wchar-subscripts]
   for (i = qb; i < qe; ++i) str->s[str->l++] = aa_ascii_hash[s->seq[i]];
   ^
bwamem.c:946:3: warning: array subscript has type ‘char’ [-Wchar-subscripts]
   for (i = qe-1; i >= qb; --i) str->s[str->l++] = aa_ascii_hash[s->seq[i]];
   ^
bwamem.c: In function ‘mem_reg2sam’:
bwamem.c:1037:9: warning: unused variable ‘uniprotEntry’ [-Wunused-variable]
  char * uniprotEntry;
         ^
bwamem.c:1035:12: warning: unused variable ‘parseIdx’ [-Wunused-variable]
  int k, l, parseIdx;
            ^
In file included from bwamem.c:19:0:
bwamem.c: At top level:
uniprot.h:17:22: warning: ‘uniprotEntryLists’ defined but not used [-Wunused-variable]
 static UniprotList * uniprotEntryLists = 0;
                      ^
uniprot.h:18:12: warning: ‘uniprotListCount’ defined but not used [-Wunused-variable]
 static int uniprotListCount = 0;
            ^
gcc -c -g -Wall -Wno-unused-function -O2  -DHAVE_PTHREAD -DUSE_MALLOC_WRAPPERS  bwamem_pair.c -o bwamem_pair.o
gcc -c -g -Wall -Wno-unused-function -O2  -DHAVE_PTHREAD -DUSE_MALLOC_WRAPPERS  bwamem_extra.c -o bwamem_extra.o
gcc -c -g -Wall -Wno-unused-function -O2  -DHAVE_PTHREAD -DUSE_MALLOC_WRAPPERS  malloc_wrap.c -o malloc_wrap.o
ar -csru libbwa.a utils.o kthread.o kstring.o ksw.o bwt.o bntseq.o bwa.o bwamem.o bwamem_pair.o bwamem_extra.o malloc_wrap.o
gcc -c -g -Wall -Wno-unused-function -O2  -DHAVE_PTHREAD -DUSE_MALLOC_WRAPPERS  is.c -o is.o
gcc -c -g -Wall -Wno-unused-function -O2  -DHAVE_PTHREAD -DUSE_MALLOC_WRAPPERS  bwtindex.c -o bwtindex.o
bwtindex.c:23:6: error: conflicting types for ‘packValue’
 void packValue(bwt_t * passBWT, int64_t passSeqIdx, bwtint_t passValue) {
      ^
In file included from bwtindex.c:9:0:
bwtindex.h:10:6: note: previous declaration of ‘packValue’ was here
 void packValue(bwt_t * passBWT, int passSeqIdx, bwtint_t passValue);
      ^
bwtindex.c:31:9: error: conflicting types for ‘unpackValue’
 ubyte_t unpackValue(bwt_t * passBWT, int64_t passSeqIdx) {
         ^
In file included from bwtindex.c:9:0:
bwtindex.h:13:9: note: previous declaration of ‘unpackValue’ was here
 ubyte_t unpackValue(bwt_t * passBWT, int passSeqIdx);
         ^
bwtindex.c: In function ‘command_index’:
bwtindex.c:259:4: warning: implicit declaration of function ‘writeIndexTestProtein’ [-Wimplicit-function-declaration]
    writeIndexTestProtein(prefix, proName); break;
    ^
make: *** [bwtindex.o] Error 1

alignment stats

Print alignment stats at the end of the alignment phase.

PALADIN successfully aligned % of n total reads in X minutes

Problem retrieving data from Uniprot

Hi,
I'm having trouble running paladin when it tries to retrieve data from Uniprot. I get a message saying "Received unexpected job ID size". I copy the completo log below.

I tried installing from bioconda first, and then I did a manual installation and the same problem showed up both times. The program does create the sam and tsv files, and they look fine.

When I run the script "make_test.sh" from the sample_data directory i works fine, and retrieves the information from Uniprot.

Thanks,
Marcelo

[M::command_align] Loading the index for reference 'uniprot_sprot.fasta'...
[M::index_load_from_disk] Read 0 ALT contigs
[W::writeReadsProtein] Brute force ORF detection redundant to MF index, disabling...
[M::writeReadsProtein] Detecting open reading frames...
[M::writeReadsProtein] Detected and translated 101914 open reading frames in 208577 sequences
[M::process] Read 611484 protein sequences (34983238 AA)...
[M::mem_process_seqs] Processed 611484 protein sequences in 102.528 CPU sec, 17.206 real sec
[M::renderNumberAligned] Aligned 13099 out of 102112 total detected ORF sequences (12.83%)
[M::prepareUniprotLists] Aggregating 13099 entries for UniProt report
[M::retrieveUniprotOnline] Submitted 5266 of 5266 entries to UniProt...
[E::retrieveUniprotOnline] Received unexpected job ID size
[E::retrieveUniprotOnline] Received unexpected job ID size
[E::retrieveUniprotOnline] Received unexpected job ID size
[E::retrieveUniprotOnline] Received unexpected job ID size
[E::retrieveUniprotOnline] Received unexpected job ID size

can't find download locations of uniprot indexes

Where do the uniprot indexes download to when using paladin prepare? I downloaded swiss-prot just fine, but when I try to download uniref it says it failed to write to disk, so I'm guessing it ran out of disk space wherever it was trying to download to. Can I change the default download location, and then specify that location in some kind of config file so that paladin knows where to find the indexed databases?

paladin prepare work around - pre-indexed UniRef90 available?

like others i ran into memory issues with paladin prepare when run on a 256G node. UniRef90 v.2021_03 is currently 31G and the packed file (.pac) 85G. I have made a special request to use a 1Tb node for 12 hours.
Questions:

Does 12 hours seem a reasonable time?
My job has been queued for 2 days already and has not yet even been allocated a run time so it may be a week or more before I can get an indexed database. It was suggested at one point that a pre-indexed UniRef90 might be made available. Does one exist?

Filtering by max quality

I have a question about the filtering by high max quality. Is a little confusing for me. When I perform the alignment, I set the T parameter in 20, as you gave that example in the man page to "preferring higher quality mappings" in the output. Again, in the ouput I have a list of the hits (60-0) in the max quality column. I should filter here? I mean, choose only hits with MaxQual=60? 50? 40? What's the thereshold? What do you recommend? Or the necessary filtering was already done in the command line with "T = 20"?

Using contigs rather than reads

Hi,
This is follow-up to twitter conversation with @peromhc.
Is the following workflow available in Paladin?
(1) Find ORF in assembled contigs
(2) map amino acid sequences to reference database
(3) The use read to contig mapping information (in SAM/BAM file) to estimate abundance of amino acid sequences per sample?

Cheers,
Ameet

align help info

says paladin mem incorrectly

>paladin align

Usage: paladin mem [options] <idxbase> <in.fq>

Algorithm options:
...

PALADIN with PEAR-merged versus L and R cat

I took a small dataset from here: http://metagenomics.anl.gov/?page=MetagenomeOverview&metagenome=4520320.3

I ran 2 PALADIn tests:

Reads from L and R were simply 'catted' together. There were 294k reads of length 250bp.
Reads were merged using PEAR. 92% were merged, resulting in 136k reads.

The UniProt results were quite different, both in terms of counts and proteins discovered..

head pear_paladin.txt merged_paladin.txt
==> pear_paladin.txt <==
Count   UniProtKB   
324 sp|P85153|CO2A1_MAMAE
243 sp|P0C2W2|CO1A1_TYREX
233 sp|P49756|RBM25_HUMAN
222 sp|P19275|VTP3_TTV1V
213 sp|B2RY56|RBM25_MOUSE
182 sp|P16274|IFEA_HELPO
180 sp|Q54YZ9|DHKJ_DICDI
173 sp|Q86AH4|Y8592_DICDI
168 sp|P04917|SRGN_RAT

==> merged_paladin.txt <==
Count   UniProtKB   
3017    sp|P85153|CO2A1_MAMAE
2344    sp|P0C2W2|CO1A1_TYREX
1463    sp|P16274|IFEA_HELPO
1036    sp|O19816|MATK_ALLCA
898 sp|P14277|CB2F_SOLLC
830 sp|P14274|CB2A_SOLLC
814 sp|P14276|CB2E_SOLLC
771 sp|P14275|CB2C_SOLLC
513 sp|O48085|CYB_ERYTA

Summing the counts column gives 164996 hits for the PEAR merged and 297183 for the catted results. Each has about the same number of UniProt hits, 81695 and 93698 respectively.. What is worrying me, is that the distribution of hits is very different - see plot below (kinda hokey). For instance, CO2A1_MAMAE got 3000 hits in the cat-dataset, but only 300 in the PEAR merged reads.

mean and var of uniprot counts: PEAR: 2.01 and 11.49852
mean and var of uniprot counts: cat: 3.17 and 254.5488

I'm not sure which dataset to trust - longer reads in the PEAR dataset might be better, but there are more reads in the catted dataset which might be better for dynamic range..

What do you all think?

extra count in report header.

head report.tsv

Count   Count % UniProtKB ...

Large Contiguous Blocks of Ambiguous IUPACs in Reference

Long sequences of N's in an AA reference can bias results - needs more investigation, then discussion on how to handle.

paladin prep?

Should we make a paladin prep that downloads the reference and indexes it automatically. Most people will use Uniprot. Could be paladin prep --reference uniprot.

Could be prep, setup, prepare, download, or something like that..

index info

Current:

Usage: paladin index [options] <reference.fasta> [annotation.gff]

Algorithm options:

       -f       Enable indexing all frames in nucleotide references
       -r<#>    Reference type:
                   0: Reference contains nucleotide sequences (requires annotation)
                   1: Reference contains nucleotide sequences (coding only)
                   2: Reference contains protein sequences
                   3: Development tests

Should add a little more info:

Reference contains nucleotide sequences (requires corresponding .gff annotation file)
Reference contains nucleotide sequences (coding only, e.g., a curated transcriptome)
Reference contains protein sequences (from UniProt or another source)

Also a sample command

Sample Command:

        paladin index -r2 uniprot_sprot.fasta

Has the PALADIN project been abandoned?

There isn't any commit activity since 2019 and issues haven't been responded to since then either. This would be a very useful tool for a tool I'm developing, as an alternative to using DIAMOND for my needs.

Output Directly to BAM

Error: [E::command_align] Reporting can only be used on prepared indices.

Hi,

I am trying to use paladin align with the Uniref90 database, but I get an error:

paladin align -t 4 -o $OUTPUT/filename $REFERENCE $INPUT
[E::command_align] Reporting can only be used on prepared indices

$Reference is the path to the indexed database, however, the path contains a symlink to the actual file. Could this be the source of the error?

Next steps

Four of us met briefly today to touch base. Here is a short list of needs.

Outputs across more parameters. Seed sizes 8-12, gap penalties 0-6, and threshold 0, 10, 20, 30, 40. Need similar ranges for BWA as well.
Need to see BWA and Paladin % read mapping for some real dataset with real unknown microbes.
Need % read mapped against the source CDS and Proteins. This gives us the % possible. Do we already have that ?

Not deleting reads PRO file

No longer deleting PRO file correctly after alignment - may be related to threading code not returning properly (similarly to what happens in cygwin POSIX environment during Windows tests). Investigate if this is related to the changing the pipeline order of when worker2 was running within process function of the alignment phase to fix Uniprot report issue

make returns error (OSX 10.12.6)

Hi,

OSx 10.12.6 when I attempt to make palandin I get this error:

gcc -c -g -Wall -Wno-unused-function -O2 -DHAVE_PTHREAD -DUSE_MALLOC_WRAPPERS translations.c -o translations.o
translations.c:46:9: warning: implicit declaration of function 'renderTranslations' is invalid in C99
[-Wimplicit-function-declaration]
renderTranslations();
^

paladin align -o

-o does not seem to be parsed properly. setting it to -o 100 has no effect on number of reads mapped.

[bwt_pac2bwt] Failed to allocate memory

Hi,

I'm trying to build a database from NCBI-NR, but seems like bowtie(?) is failing:

$ paladin index -r3 ../nr.gz
[M::command_index] Translating protein sequence...0.00 sec
[M::command_index] Packing protein sequence... 1735.20 sec
[M::command_index] Constructing BWT for the packed sequence... [bwt_pac2bwt] Failed to allocate 76177038661 bytes at bwtindex.c line 132: Cannot allocate memory

There's 500Gb of memory available, don't see why it couldn't allocate 76Gb. Any suggestions?
Thanks!

Saving unmapped reads

Toni, is there a way to save the unmapped reads as a separate file please? I ask because what if a user wants to run the unmapped reads through another reference database given a considerable proportion of the reads do not map. Just a thought...

align issue (-T and -k options)

Hi Toni! I encountered an obstacle when using paladin to align. This is the case; I got many short peptide sequences about 5-50AA. Only length > 15 peptide seqs are mapping when align with default options, I consider it was resulted from score threshold, so I set -T 5（minimum score to output） to align again, the result shows length > 11 are mapping, at last I set -k 5（minimum seed length) and -T 5, there were length>5 mapping result.
Therefore, I am curious that seed length set to 5 should be a correct choice, but how to set the minimum score? For short protein sequences, how to select a reasonable score threshold?

command "paladin prepare -r2" throwing error related to memory ?

Hi,
Is there any way/hack to run this particular part without encountering this error given below, if you have less memory, in my case 256GB only. Or can I use the pre-prepared or pre-indexed database?

Error: "Constructing BWT for the packed sequence... [is_bwt] Failed to allocate 482146077304 bytes at is.c line 212: Cannot allocate memory"

Kindly suggest.

Align protein sequences

I have a multi-FASTA file containing protein sequences, that I wish to align w/ PALADIN, i.e. neither the ORF detection step nor a translation to proteins is needed, only the alignment and reporting step.

From my understanding of the PALADIN pipeline, this should be possible, but apparently the functionality is not exposed to the user – correct me if I'm wrong. How do you advise to proceed?

Support for non-standard genetic codes

Provide options for specifying non-standard nuclear/mitochondrial/plastid genetic codes to be used during the ORF detection and translation process.

Columns ID to Cross Reference(EnsemblBacteria) are empty

Hi,

paladin is not retrieving information from columns ID to Cross Reference(EnsemblBacteria), like this example, with "test"

Count Abundance Quality (Avg) Quality (Max) UniProtKB ID Organism Protein Names Genes Pathway Features Gene Ontology Reviewed Existence Comments Cross Reference (KEGG) Cross Reference (GeneID) Cross Reference (PATRIC) Cross Reference(EnsemblBacteria)
2 66.66667 0.00000 0 Y1118_DICDI
1 33.33334 0.00000 0 SRGN_RAT

Multithread in db creation

There is any possibility to add multithread to the database creation? I mean, to the command paladin prepare . The db creation is very slow no matter the computer you use.

Thanks

seg fault when index is missing or mis-specified.

Currently, PALADIN has a seg fault when it can't find the index - it should fail more gracefully.

paladin align uniprot data/test.soil.fq
Segmentation fault (core dumped)

flag for not checking read ids in PE mode?

Hi,

I am testing Paladin as an alternative to Diamond for 2 main reasons:

pair-end mode
ability to produce bam files

I can successfully produce bam files in single-end mode, but when feeding it data from the NCBI SRA, the read ids don't seem to be paired, and I get this error below.

Is there a way to tell Paladin to take the R1 + R2 reads as they come without checking if the read ids match between the two?

│ [M::writeReadsProtein] Detecting open reading frames...                                                                                                                                                                                                            │
│ [M::writeReadsProtein] Detected and translated 79304452 open reading frames in 112455309 sequences                                                                                                                                                                 │
│ [M::process] Read 3419102 protein sequences (200000116 AA)...                                                                                                                                                                                                      │
│ [M::process] Read 3408448 protein sequences (200000094 AA)...                                                                                                                                                                                                      │
│ [M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (0, 0, 0, 0)                                                                                                                                                                                        │
│ [M::mem_pestat] skip orientation FF as there are not enough pairs                                                                                                                                                                                                  │
│ [M::mem_pestat] skip orientation FR as there are not enough pairs                                                                                                                                                                                                  │
│ [M::mem_pestat] skip orientation RF as there are not enough pairs                                                                                                                                                                                                  │
│ [M::mem_pestat] skip orientation RR as there are not enough pairs                                                                                                                                                                                                  │
│ [mem_sam_pe] paired reads have different names: "0:1:1:SRR7107847.1", "SRR7107847.2"                                             0                                                                                                                                 │
│                                                                                                                                                                                                                                                                    │
│ [mem_sam_pe] paired reads have different names: "1:2:2:SRR7107847.2", "SRR7107847.9"                                                                                                                                                                               │
│                                                                                                                                                                                                                                                                    │
│ [mem_sam_pe] paired reads have different names: "0:2:2:SRR7107847.1", "SRR7107847.3"                                                                                                                                                                               │
│                                                                                                                                                                                                                                                                    │
│ [W::sam_read1] Parse error at line 20                                                                                                                                                                                                                              │
│ [main_samview] truncated file.                                                                                                                                                                                                                                     │

Uniprot report

using paladin align -u 2 data/uniprot_sprot.fasta data/test.assembled.fastq I get a report that looks like this

Count   UniProtKB   ID  Organism    Protein Names   Genes   Pathway Features    Gene Ontology   Reviewd Existence   Comments
324 sp|P85153|CO2A1_MAMAE
243 sp|P0C2W2|CO1A1_TYREX
233 sp|P49756|RBM25_HUMAN
222 sp|P19275|VTP3_TTV1V
213 sp|B2RY56|RBM25_MOUSE
182 sp|P16274|IFEA_HELPO
180 sp|Q54YZ9|DHKJ_DICDI
173 sp|Q86AH4|Y8592_DICDI
168 sp|P04917|SRGN_RAT
142 sp|P43974|Y258_HAEIN
138 sp|Q8MP30|Y7791_DICDI
112 sp|Q54Q42|Y8578_DICDI
106 sp|P08399|PHXR5_MOUSE

Should the other columns contain info as well?

docker image downloads and indexes references to installation directory

On my system I found the downloaded swiss-prot reference here:

/var/lib/docker/overlay2/ccc85725455b9c3bd02131e0874eb05efbbc31d45891f6beddd0b5efff047d4c/diff/uniprot_sprot.fasta.gz

This could prove to be problematic when building the uniref90 database because I likely wouldn't have enough disk space. Is there a way to make the docker image run like the fresh installation where references are downloaded and indexed in the working directory? Or give an option to specify output directory?

paladin prep?

Should we make a paladin prep that downloads the reference and indexes it automatically. Most people will use Uniprot. Could be paladin prep --reference uniprot.

Could be prep, setup, prepare, download, or something like that..

Bug where last sequence in fasta/fastq is ignored

Hello Developers,

Thanks for the tool. Here I want to point out a small bug.

I am running the following command:

paladin align -p template.fasta blah.fasta

The bug always ignores the last fasta sequence.

Similarly if the query fasta contained only one seqeuence than the resulting SAM file is empty (Was stuck here for a while before figured the bug.

Thanks.

align -u 2

I think writing to stdout by default is bad idea, better, write to some smartly named file by default. Let user override default naming with -o [filename].

Recommended RAM for Db creation

Hi,
It seems that Paladin requires huge amount of RAM for database generation (>256Gb), so what is recommended quantity of RAM using full Uniprot90 sequences?

Thank you: Blaize