bioconductor / biostrings Goto Github PK

Efficient manipulation of biological strings

Home Page: https://bioconductor.org/packages/Biostrings

R 57.50% C 42.50%

biostrings's Introduction

Biostrings is an R/Bioconductor package for efficient manipulation of biological strings.

See https://bioconductor.org/packages/Biostrings for more information including how to install the release version of the package (please refrain from installing directly from GitHub).

biostrings's People

Contributors

Stargazers

Watchers

Forkers

alinavinogradovaa michbur jdalcin zachary-foster dpique twmccart felixernst stjordanis raoulkam ahl27 acvill villafup hsadia538 mcarlsn berylkanali gaoce

biostrings's Issues

readDNAStringSet for fasta with long headers

I am getting the error "cannot read line XX, line is too long" when opening a fasta file with readDNAStringSet. It seems to be because the headers of some sequences are very long. Would it be possible to increase the maximum header length allowed to 100 k characters with spaces or more?

Thanks,

Eva

Comparing single base DNAString with single character causes stack error

Under Biostring 2.58.0 and R4.0.5 on both macOS and CentOS.

> Biostrings::DNAString("A") == "A"
Error: C stack usage  7969280 is too close to the limit

Preconverting either side of the comparison works fine

> as.character(Biostrings::DNAString("A")) == "A"
[1] TRUE
> Biostrings::DNAString("A") == Biostrings::DNAString("A")
[1] TRUE

Bug in reverseComplement() with QualityScaledDNAStringSet

Consider the following example:

library(Biostrings)
stuff <- QualityScaledDNAStringSet(DNAStringSet("AAATTTCCCGGG"),
                                   PhredQuality("(((!!!###..."))

Running reverseComplement(stuff) gives me:

  A QualityScaledDNAStringSet instance containing:

  A DNAStringSet instance of length 1
    width seq
[1]    12 CCCGGGAAATTT

  A PhredQuality instance of length 1
    width seq
[1]    12 (((!!!###...

... where the DNA sequence is correctly reverse-complemented, but the quality string is not reversed. As a result, one would incorrectly think that "T" has a score of "." while "C" has a score of "(".

Session info:

R Under development (unstable) (2017-10-27 r73632)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.5 LTS

Matrix products: default
BLAS: /home/cri.camres.org/lun01/Software/R/trunk/lib/libRblas.so
LAPACK: /home/cri.camres.org/lun01/Software/R/trunk/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] stringr_1.2.0              ShortRead_1.37.0          
 [3] GenomicAlignments_1.15.2   SummarizedExperiment_1.9.5
 [5] DelayedArray_0.5.14        matrixStats_0.52.2        
 [7] Biobase_2.39.1             Rsamtools_1.31.1          
 [9] BiocParallel_1.13.1        sarlacc_1.0.0             
[11] Biostrings_2.47.1          XVector_0.19.1            
[13] GenomicRanges_1.31.3       GenomeInfoDb_1.15.1       
[15] IRanges_2.13.4             S4Vectors_0.17.15         
[17] BiocGenerics_0.25.1       

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.14           RColorBrewer_1.1-2     compiler_3.5.0        
 [4] pillar_1.0.1           prettyunits_1.0.2      GenomicFeatures_1.31.1
 [7] bitops_1.0-6           tools_3.5.0            zlibbioc_1.25.0       
[10] progress_1.1.2         biomaRt_2.35.5         digest_0.6.13         
[13] bit_1.1-12             RSQLite_2.0            memoise_1.1.0         
[16] tibble_1.4.1           lattice_0.20-35        pkgconfig_2.0.1       
[19] rlang_0.1.6            igraph_1.1.2           Matrix_1.2-12         
[22] DBI_0.7                GenomeInfoDbData_1.1.0 hwriter_1.3.2         
[25] rtracklayer_1.39.4     httr_1.3.1             ade4_1.7-10           
[28] bit64_0.9-7            grid_3.5.0             R6_2.2.2              
[31] AnnotationDbi_1.41.4   XML_3.98-1.9           RMySQL_0.10.13        
[34] latticeExtra_0.6-28    seqinr_3.4-5           magrittr_1.5          
[37] blob_1.1.0             MASS_7.3-48            assertthat_0.2.0      
[40] muscle_3.21.0          stringi_1.1.6          RCurl_1.95-4.9

Warnings, sometimes errors, when running sort() on DNAStringSet

Running the following bit of innocuous code:

library(Biostrings)
a <- DNAStringSet(c("AAA", "AA", "A"))
sort(a)

... yields the correct result, but an odd warning:

  A DNAStringSet instance of length 3
    width seq
[1]     1 A
[2]     2 AA
[3]     3 AAA
Warning message:
In order(..., na.last = na.last, decreasing = decreasing) :
  "order" method for XRawList objects ignores the 'na.last' argument

More concerning is the fact that the code fails if flowCore is loaded prior to Biostrings:

library(flowCore)
library(Biostrings)
a <- DNAStringSet(c("AAA", "AA", "A"))
sort(a)

... which produces the error:

Error in validObject(.Object) : 
  invalid class “MethodWithNext” object: Error : evaluation nested too deeply: infinite recursion / options(expressions=)?

... presumably because there is a sort in flowCore that overrides something.

R Under development (unstable) (2017-10-27 r73632)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.5 LTS

Matrix products: default
BLAS: /home/cri.camres.org/lun01/Software/R/trunk/lib/libRblas.so
LAPACK: /home/cri.camres.org/lun01/Software/R/trunk/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
[1] Biostrings_2.47.9   XVector_0.19.8      IRanges_2.13.28    
[4] S4Vectors_0.17.35   BiocGenerics_0.25.3 flowCore_1.45.7    

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.15       lattice_0.20-35    matrixStats_0.53.1 mvtnorm_1.0-7     
 [5] corpcor_1.6.9      rrcov_1.4-3        grid_3.5.0         pcaPP_1.9-73      
 [9] graph_1.57.1       zlibbioc_1.25.0    robustbase_0.92-8  Biobase_2.39.2    
[13] DEoptimR_1.0-8     compiler_3.5.0     cluster_2.0.6

Error with complement()

I'm having an error using the complement function.

I installed Biostrings using the installation instructions here.

> library(Biostrings)
> packageVersion("Biostrings")
[1] ‘2.58.0’
> Biostrings::complement(Biostrings::DNAString("GCATCGATGAAGAACGCAGC"))
Error in as(mcols, mcols_target_class) : 
  no method or default for coercing “DFrame” to “DataTable_OR_NULL”
>

Could you help me get this working.

Many thanks!

Misleading show() method for XStringSet objects

This is a follow up of https://support.bioconductor.org/p/122340/#122400

The show() method for XStringSet objects currently suggests the existence of a seq() getter for these objects:

library(Biostrings)
library(drosophila2probe)
dna <- DNAStringSet(drosophila2probe)
dna
#   A DNAStringSet instance of length 265400
#          width seq
#      [1]    25 CCTGAATCCTGGCAATGTCATCATC
#      [2]    25 ATCCTGGCAATGTCATCATCAATGG
#      [3]    25 ATCAGTTGTCAACGGCTAATACGCG
#      [4]    25 ATCAATGGCGATTGCCGCGTCTGCA
#      [5]    25 CCGCGTCTGCAATGTGAGGGCCTAA
#      ...   ... ...
# [265396]    25 TACTACTTGAGCCACAACCATCTGA
# [265397]    25 AGGGACTAAAGAGGCCCCATGCTCT
# [265398]    25 CATGCTCTGTCTGGTGTCAGCGCTA
# [265399]    25 GTCAGCGCTACATGGTCCAGGACAA
# [265400]    25 CCAGGACAAGTATGGACTTCCCCAC

but there is no such getter.

Same issue with the show() method for XString objects:

dna[[1]]
#  25-letter "DNAString" instance
# seq: CCTGAATCCTGGCAATGTCATCATC

Also it would be good to make these show() methods more consistent with other show() methods in S4Vectors/IRanges/GenomicRanges:

library(IRanges)
IRanges(1:3, 10, names=LETTERS[1:3], score=runif(3))
# IRanges object with 3 ranges and 1 metadata column:
#         start       end     width |             score
#     <integer> <integer> <integer> |         <numeric>
#   A         1        10        10 | 0.267148569226265
#   B         2        10         9 | 0.106218574102968
#   C         3        10         8 | 0.649568639695644

In particular the names on a DNAStringSet object should be displayed on the left. Also its metadata columns should be displayed (right now they are not):

dna2 <- dna[1:3]
names(dna2) <- LETTERS[1:3]
mcols(dna2)$score <- runif(3)
dna2
#   A DNAStringSet instance of length 3
#     width seq                                               names               
# [1]    25 CCTGAATCCTGGCAATGTCATCATC                         A
# [2]    25 ATCCTGGCAATGTCATCATCAATGG                         B
# [3]    25 ATCAGTTGTCAACGGCTAATACGCG                         C

Stack smashed by writeXStringSet for FASTQ output

Calling writeXStringSet on a DNAStringSet with a very long element:

library(Biostrings)
N <- 21000
x <- paste(sample(c("A", "C", "G", "T"), N, replace=TRUE), collapse="")
q <- runif(N)
y <- QualityScaledDNAStringSet(DNAStringSet(x), PhredQuality(q))
names(y) <- "YAY"
writeXStringSet(y, qualities=quality(y), file="whee.fastq", format="fastq")
## *** stack smashing detected ***: /home/cri.camres.org/lun01/Software/R/devel/bin/exec/R terminated
## Aborted (core dumped)

The exact value of N will probably vary between machines depending on the size of the stack, but nonetheless, this isn't good. I've tested this on BioC-devel but the same error happens on release.

Session info

R Under development (unstable) (2019-02-19 r76128)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.5 LTS

Matrix products: default
BLAS: /home/cri.camres.org/lun01/Software/R/trunk/lib/libRblas.so
LAPACK: /home/cri.camres.org/lun01/Software/R/trunk/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
[1] Biostrings_2.51.2   XVector_0.23.0      IRanges_2.17.4     
[4] S4Vectors_0.21.10   BiocGenerics_0.29.1

loaded via a namespace (and not attached):
[1] zlibbioc_1.29.0 compiler_3.6.0

bug in 'reverse'?

Dear Biostrings team:

Using the devel version of the package (Biostrings_2.47.7) the following command returns an error:

subject <- DNAStringSet(c("ATCATGCCATCATGAT", "ATTTGGAATCAT"))

reverse(subject)

Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function ‘windows’ for signature ‘"DNAString"’

This is my sessionInfo():

R Under development (unstable) (2018-02-08 r74238)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Arch Linux

Matrix products: default
BLAS: /usr/lib/libblas.so.3.8.0
LAPACK: /usr/lib/liblapack.so.3.8.0

locale:
 [1] LC_CTYPE=es_AR.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=es_AR.UTF-8        LC_COLLATE=es_AR.UTF-8    
 [5] LC_MONETARY=es_AR.UTF-8    LC_MESSAGES=es_AR.UTF-8   
 [7] LC_PAPER=es_AR.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=es_AR.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] ShortRead_1.37.1            GenomicAlignments_1.15.10  
 [3] SummarizedExperiment_1.9.13 DelayedArray_0.5.17        
 [5] matrixStats_0.53.0          Biobase_2.39.2             
 [7] Rsamtools_1.31.3            GenomicRanges_1.31.16      
 [9] GenomeInfoDb_1.15.5         Biostrings_2.47.7          
[11] XVector_0.19.8              IRanges_2.13.22            
[13] S4Vectors_0.17.30           BiocParallel_1.13.1        
[15] BiocGenerics_0.25.2         FastqCleaner_1.0.0         

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.15           zlibbioc_1.25.0        xtable_1.8-2          
 [4] lattice_0.20-35        R6_2.2.2               hwriter_1.3.2         
 [7] tcltk_3.5.0            tools_3.5.0            grid_3.5.0            
[10] latticeExtra_0.6-28    htmltools_0.3.6        digest_0.6.15         
[13] Matrix_1.2-12          shiny_1.0.5            GenomeInfoDbData_1.1.0
[16] RColorBrewer_1.1-2     bitops_1.0-6           RCurl_1.95-4.10       
[19] mime_0.5               compiler_3.5.0         httpuv_1.3.5

Could it be a bug in the function?

Thanks,
Leandro

Support for BSgenomeViews objects in matchPWM

Would it be possible to add a method for BSgenomeViews objects in the matchPWM function? The manual says it can take ordinary Views objects, but I don't see any support for BSgenomeViews objects?

Broken QualityScaledDNAStringSet constructor

It seems that QualityScaledDNAStringSet got broken somewhere. Using the example in the docs:

x1 <- DNAStringSet(c("TTGA", "CTCN"))
q1 <- PhredQuality(c("*+,-", "6789"))
qx1 <- QualityScaledDNAStringSet(x1, q1)

... gives me:

Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function ‘narrow’ for signature ‘"DNAStringSet"’

Session information:

R Under development (unstable) (2017-10-27 r73632)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.5 LTS

Matrix products: default
BLAS: /home/cri.camres.org/lun01/Software/R/trunk/lib/libRblas.so
LAPACK: /home/cri.camres.org/lun01/Software/R/trunk/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
[1] Biostrings_2.47.8   XVector_0.19.8      IRanges_2.13.21    
[4] S4Vectors_0.17.29   BiocGenerics_0.25.2

loaded via a namespace (and not attached):
[1] zlibbioc_1.25.0 compiler_3.5.0  tools_3.5.0

readDNAStringSet for FASTQ files with variable length reads

It seems as if variable length reads in FASTQ files are not yet supported. For example, if I have a FASTQ file named blah.fq with the following contents:

@READ1
ACTAGCGACACT
+
981731287232
@READ2
CTAGACT
+
3723123

... and I run (on Biostrings version 2.47.9):

library(Biostrings)
readDNAStringSet("blah.fq", format="fastq")

... I get the error:

Error in .Call2("read_XStringSet_from_fastq", filexp_list, nrec, skip,  : 
  read_XStringSet_from_fastq(): FASTQ files with variable sequence lengths are not supported yet

Is support for variable-length FASTQ files on the radar? This is the typical case when dealing with Nanopore data. I know I can use the ShortRead package but it seems unnecessary to reach for a different package when there seems to be a perfectly suitable function in Biostrings.

getSeq: Unable to find inherited method for DNAString

Hello,
I am trying to extract some sequences from a genome file. I have read in the genome into RStudio using readDNAStringSet, and then converting it to just a DNAString (its a bacterial genome, so there's only one chromosome/sequence).
I have a data frame, containing the positions for each gene for which I want to extract the sequence. I have also duplicated the data frame as a Genomic Ranges object.
Whenever I try to use the getSeq function to retrieve the desired sequences from the genome, I get the following error message:
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘getSeq’ for signature ‘"DNAString"’
I have tried using both the GRanges object and specifying the appropriate columns from the data frame. I have included some of my code below. Any help would be greatly appreciated.

Read in the genome sequence

pao1_genome <- readDNAStringSet("../PAO1_genome/GCF_000006765.1_ASM676v1_genomic.fna", format = "fasta")
pao1_genome <- DNAString(pao1_genome$NC_002516.2 Pseudomonas aeruginosa PAO1 chromosome, complete genome)

Try to extract sequences with data frame

getSeq(pao1_genome, master_genes_list$locus_tag, start = master_genes_list$promoter, end = master_genes_list$start, strand = master_genes_list$strand)

And with the GRanges object

getSeq(x = pao1_genome, names = pao1_gr)

Data frame

locus_tag Up_Down start end strand feature_interval_length OperonID promoter
1 PA0070 Up 82404 83318 - 915 12045 82154
2 PA0078 Up 95048 96397 - 1350 12046 94798
3 PA0082 Up 100124 101158 + 1035 12047 99874

GRanges

GRanges object with 155 ranges and 0 metadata columns:
seqnames ranges strand

[1] PA0070 [ 82154, 82404] -
[2] PA0078 [ 94798, 95048] -
[3] PA0082 [ 99874, 100124] +

SesionInfo

R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 17.10

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale:
[1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C LC_TIME=en_CA.UTF-8 LC_COLLATE=en_CA.UTF-8 LC_MONETARY=en_CA.UTF-8
[6] LC_MESSAGES=en_CA.UTF-8 LC_PAPER=en_CA.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base

other attached packages:
[1] Biostrings_2.46.0 XVector_0.18.0 GenomicRanges_1.30.3 GenomeInfoDb_1.14.0 IRanges_2.12.0 S4Vectors_0.16.0
[7] BiocGenerics_0.24.0

loaded via a namespace (and not attached):
[1] zlibbioc_1.24.0 compiler_3.4.4 tools_3.4.4 GenomeInfoDbData_1.0.0 RCurl_1.95-4.10 yaml_2.1.18
[7] bitops_1.0-6

Thanks!

Need for packages in Depends rather than Imports?

Many people use Biostrings without wanting to be forced to effectively call e.g. library(S4Vectors). Various non-Biostrings functions being imported in a user's R session leads to confusing problems such as e.g. dplyr's rename() being superseded by plyr's rename() if and only if Biostrings is loaded after dplyr. Since functions such as plyr's rename do not seem required to use Biostrings, wouldn't it be better to put the offending packages as Imports? I think it is general best practice for R packages to use Imports over Depends for this reason. (I may very well be missing some obvious reasons for why these packages are in Depends and perhaps the Bioconductor team has considered all this and decided it is a necessary evil, but posting as I didn't see an issue on it yet).

Edit: closing this because the plyr::rename issue should probably be brought up at S4Vectors and I should do more digging to see if Biostrings is designed to only be used with the Depends packages.

Ns in DNAstringset greys out all text in console that follows

This is minor and only affects visualization in the R/Rstudio console window.

I love the base colouring in the console. But I noticed that if there are Ns in a sequence, it causes all subsequent text to be greyed out.

#this looks fine
myfasta<-DNAStringSet(x=c('ACTG','GGGG'))
myfasta
print('Hi, can you see me?')

#this greys out
myfasta<-DNAStringSet(x=c('ACTG','NN','GGGG'))
myfasta
print('Hi, can you see me?')

writeXStringSet on QualityScaledDNAStringSet should automatically set qualities=

Consider:

x1 <- DNAStringSet(c("TTGA", "CTCN"))
q1 <- PhredQuality(c("*+,-", "6789"))
qdna1 <- QualityScaledDNAStringSet(x1, q1)
names(qdna1) <- c("A", "B")

writeXStringSet(qdna1, format="fastq", file="whee.fq") # (1)
writeXStringSet(qdna1, format="fastq", file="whee.fq", qualities=quality(qdna1)) # (2) 
writeQualityScaledXStringSet(qdna1, file="whee.fq") # (3)

(1) has no qualities while (2) and (3) add the qualities correctly. Nonetheless, it would be a useful quality-of-life improvement for (1) to attempt to extract the qualities from qdna1 if it is a QSDSS and qualities= is not explicitly provided in the function call. (2) is too verbose and it is hard to remember that (3) exists.

In general, if you're going to provide format="fastq", it would be reasonable to expect that the function would consider the qualities embedded in the supplied object, given that the qualities are the only difference from FASTA.

readAxt() and DNAStringSet() are unable to keep lowercase sequences

readAxt() and DNAStringSet() functions automatically convert the lowercase (repetitive) sequences to the uppercase which produces wrong outcomes in research and many papers have already been published without knowing this internal fault of these functions.

matchPattern wildcard for AAstringset

Hi there,

The function matchPattern allows the use of wildcards for DNA sequences;

testnucl <- DNAString("AAAACTAAA")
matchPattern("CNA", testnucl, fixed = F)

Returns:

Views on a 9-letter DNAString subject
subject: AAAACTAAA
views:
    start end width
[1]     5   7     3 [CTA]

While the same strategy for amino acids:

testprot <- AAString("MMMMPSTMMM")
matchPattern("PXT", testprot, fixed = T)

Returns:

Error in normargFixed(fixed, subject) : 
  'fixed' value only supported for a DNA or RNA subject (you can only use 'fixed=TRUE' with your subject)

Is there a reason for this wildcard search not being allowed on amino acid level?

Many thanks in advance,
Sander

Consistency of seqinfo getter/setter on DNAStringSet

Hi @hpages and @lawremi,

I was looking to set the Seqinfo on a DNAStringSet (created by importing a FASTA file) but found some inconsistent/incomplete behaviour, especially when rtracklayer was loaded and attached.
I'm not actually sure whether DNAStringSet is meant to support Seqinfo, but I'd really like it if it did (and had reliable/consistent behaviour).
I think this example should illustrate the issue(s).

Cheers,
Pete

suppressPackageStartupMessages(library(Biostrings))
suppressPackageStartupMessages(library(GenomeInfoDb))
suppressPackageStartupMessages(library(GenomicRanges))

y <- DNAStringSet(x = c("s1" = "CAG", "s2" = "GGGGGT"))
seqinfo(y)
#> Seqinfo object with 2 sequences from an unspecified genome; no seqlengths:
#>   seqnames seqlengths isCircular genome
#>   s1               NA         NA   <NA>
#>   s2               NA         NA   <NA>
seqnames(y) # Shouldn't this give same result as seqnames(seqinfo(y))?
#> Error in (function (classes, fdef, mtable) : unable to find an inherited method for function 'seqnames' for signature '"DNAStringSet"'
seqlevels(y)
#> [1] "s1" "s2"
isCircular(y)
#> s1 s2 
#> NA NA
genome(y)
#> s1 s2 
#> NA NA
seqlengths(y)
#> s1 s2 
#> NA NA

seqinfo(y) <- Seqinfo(
  seqnames = as.character(seq_along(y)), 
  seqlengths = lengths(y) + 100, # Unsure if changing length should be allowed.
  isCircular = c(FALSE, TRUE),
  genome = rep("fake", length(y)))
seqinfo(y)
#> Seqinfo object with 2 sequences (1 circular) from fake genome:
#>   seqnames seqlengths isCircular genome
#>   1               103      FALSE   fake
#>   2               106       TRUE   fake
seqnames(y)
#> Error in (function (classes, fdef, mtable) : unable to find an inherited method for function 'seqnames' for signature '"DNAStringSet"'
seqlevels(y)
#> [1] "1" "2"
isCircular(y)
#>     1     2 
#> FALSE  TRUE
genome(y)
#>      1      2 
#> "fake" "fake"
seqlengths(y) # Unsure if changing length should be allowed.
#>   1   2 
#> 103 106

suppressPackageStartupMessages(library(rtracklayer)) 
# Now get completely different results!
seqinfo(y)
#> Seqinfo object with 2 sequences from an unspecified genome:
#>   seqnames seqlengths isCircular genome
#>   s1                3         NA   <NA>
#>   s2                6         NA   <NA>
seqnames(y)
#> Error in (function (classes, fdef, mtable) : unable to find an inherited method for function 'seqnames' for signature '"DNAStringSet"'
seqlevels(y)
#> [1] "s1" "s2"
isCircular(y)
#> s1 s2 
#> NA NA
genome(y)
#> s1 s2 
#> NA NA
seqlengths(y)
#> s1 s2 
#>  3  6

^{Created on 2018-11-07 by the reprex package (v0.2.1)}

Session info

devtools::session_info()
#> ─ Session info ──────────────────────────────────────────────────────────
#>  setting  value                       
#>  version  R version 3.5.1 (2018-07-02)
#>  os       macOS Sierra 10.12.6        
#>  system   x86_64, darwin15.6.0        
#>  ui       X11                         
#>  language (EN)                        
#>  collate  en_AU.UTF-8                 
#>  ctype    en_AU.UTF-8                 
#>  tz       Australia/Adelaide          
#>  date     2018-11-07                  
#> 
#> ─ Packages ──────────────────────────────────────────────────────────────
#>  package              * version   date       lib source        
#>  assertthat             0.2.0     2017-04-11 [1] CRAN (R 3.5.0)
#>  backports              1.1.2     2017-12-13 [1] CRAN (R 3.5.0)
#>  base64enc              0.1-3     2015-07-28 [1] CRAN (R 3.5.0)
#>  Biobase                2.42.0    2018-10-30 [1] Bioconductor  
#>  BiocGenerics         * 0.28.0    2018-10-30 [1] Bioconductor  
#>  BiocParallel           1.16.0    2018-10-30 [1] Bioconductor  
#>  Biostrings           * 2.50.0    2018-10-30 [1] Bioconductor  
#>  bitops                 1.0-6     2013-08-17 [1] CRAN (R 3.5.0)
#>  callr                  3.0.0     2018-08-24 [1] CRAN (R 3.5.0)
#>  cli                    1.0.1     2018-09-25 [1] CRAN (R 3.5.0)
#>  crayon                 1.3.4     2017-09-16 [1] CRAN (R 3.5.0)
#>  debugme                1.1.0     2017-10-22 [1] CRAN (R 3.5.0)
#>  DelayedArray           0.8.0     2018-10-30 [1] Bioconductor  
#>  desc                   1.2.0     2018-05-01 [1] CRAN (R 3.5.0)
#>  devtools               2.0.1     2018-10-26 [1] CRAN (R 3.5.1)
#>  digest                 0.6.18    2018-10-10 [1] CRAN (R 3.5.1)
#>  evaluate               0.12      2018-10-09 [1] CRAN (R 3.5.1)
#>  fs                     1.2.6     2018-08-23 [1] CRAN (R 3.5.0)
#>  GenomeInfoDb         * 1.18.0    2018-10-30 [1] Bioconductor  
#>  GenomeInfoDbData       1.2.0     2018-10-06 [1] Bioconductor  
#>  GenomicAlignments      1.18.0    2018-10-30 [1] Bioconductor  
#>  GenomicRanges        * 1.34.0    2018-10-30 [1] Bioconductor  
#>  glue                   1.3.0     2018-07-17 [1] CRAN (R 3.5.0)
#>  htmltools              0.3.6     2017-04-28 [1] CRAN (R 3.5.0)
#>  IRanges              * 2.16.0    2018-10-30 [1] Bioconductor  
#>  knitr                  1.20      2018-02-20 [1] CRAN (R 3.5.0)
#>  lattice                0.20-38   2018-11-04 [1] CRAN (R 3.5.0)
#>  magrittr               1.5       2014-11-22 [1] CRAN (R 3.5.0)
#>  Matrix                 1.2-15    2018-11-01 [1] CRAN (R 3.5.0)
#>  matrixStats            0.54.0    2018-07-23 [1] CRAN (R 3.5.1)
#>  memoise                1.1.0     2017-04-21 [1] CRAN (R 3.5.0)
#>  pkgbuild               1.0.2     2018-10-16 [1] CRAN (R 3.5.1)
#>  pkgload                1.0.2     2018-10-29 [1] CRAN (R 3.5.1)
#>  prettyunits            1.0.2     2015-07-13 [1] CRAN (R 3.5.0)
#>  processx               3.2.0     2018-08-16 [1] CRAN (R 3.5.0)
#>  ps                     1.2.1     2018-11-06 [1] CRAN (R 3.5.1)
#>  R6                     2.3.0     2018-10-04 [1] CRAN (R 3.5.0)
#>  Rcpp                   0.12.19   2018-10-01 [1] CRAN (R 3.5.1)
#>  RCurl                  1.95-4.11 2018-07-15 [1] CRAN (R 3.5.0)
#>  remotes                2.0.2     2018-10-30 [1] CRAN (R 3.5.0)
#>  rlang                  0.3.0.1   2018-10-25 [1] CRAN (R 3.5.0)
#>  rmarkdown              1.10      2018-06-11 [1] CRAN (R 3.5.0)
#>  rprojroot              1.3-2     2018-01-03 [1] CRAN (R 3.5.0)
#>  Rsamtools              1.34.0    2018-10-30 [1] Bioconductor  
#>  rtracklayer          * 1.42.0    2018-10-30 [1] Bioconductor  
#>  S4Vectors            * 0.20.0    2018-10-30 [1] Bioconductor  
#>  sessioninfo            1.1.1     2018-11-05 [1] CRAN (R 3.5.1)
#>  stringi                1.2.4     2018-07-20 [1] CRAN (R 3.5.1)
#>  stringr                1.3.1     2018-05-10 [1] CRAN (R 3.5.0)
#>  SummarizedExperiment   1.12.0    2018-10-30 [1] Bioconductor  
#>  testthat               2.0.1     2018-10-13 [1] CRAN (R 3.5.0)
#>  usethis                1.4.0     2018-08-14 [1] CRAN (R 3.5.0)
#>  withr                  2.1.2     2018-03-15 [1] CRAN (R 3.5.0)
#>  XML                    3.98-1.16 2018-08-19 [1] CRAN (R 3.5.0)
#>  XVector              * 0.22.0    2018-10-30 [1] Bioconductor  
#>  yaml                   2.2.0     2018-07-25 [1] CRAN (R 3.5.1)
#>  zlibbioc               1.28.0    2018-10-30 [1] Bioconductor  
#> 
#> [1] /Library/Frameworks/R.framework/Versions/3.5/Resources/library

FastqQuality as.matrix is overly complicated

I would expect that to work:
random_fastq_one_read %>% quality %>% as.integer

Even the regular conversions are not working:
error:
as.matrix(random_fastq_one_read %>% quality %>% quality %>% FastqQuality)

I have to jump through a loop:
works:
as(random_fastq_one_read %>% quality %>% quality %>% FastqQuality, "matrix")

replaceAt with VCF Object

It would be a nice feature if the user could create derivative DNA sequences based on variants (including indels which shift the coordinates in the string) from a VCF file. GATK4 has FastaAlternateReferenceMaker, but it has some important caveats.

writeQualityScaledXStringSet omitted the third line

Hi there:
write to the fastq file use writeQualityScaledXStringSet function .
i want omitted on the third line ,Is there a better way.
like ShortRead::writeFastq(..., full = FALSE)

@HWI-EAS88_1_1_1_1001_499
GGACTTTGTAGGATACCCTCGCTTTCCTTCTCCTGT
+
]]]]]]]]]]]]Y]Y]]]]]]]]]]]]VCHVMPLAS
@HWI-EAS88_1_1_1_898_392
GATTTCTTACCTATTAGTGGTTGAACAGCATCGGAC
+
]]]]]]]]]]]]Y]]]]]]]]]YPV]T][PZPICCK

thanks!

Give `matchProbePair` the `fixed` parameter to allow IUPAC matching?

I recently need to predict amplicons using primers with lots of ambiguity codes. The matchProbePair function seemed perfect but it does not interpret the IUPAC codes. I noticed the code uses matchPattern to find primer matches and that function has the fixed option that allows interpretation of ambiguity codes when fixed=FALSE.

Can the matchProbePair function also have this option and pass its value to matchPattern like so?

matchProbePair <- function(Fprobe, Rprobe, subject, algorithm="auto",
             logfile=NULL, verbose=FALSE, fixed = FALSE)
    {
        ## This won't copy the data if Fprobe and Rprobe are already DNAString objects
        F <- DNAString(Fprobe)
        R <- DNAString(Rprobe)

        ## F and R hits on the + strand
        Fp_hits <- start(matchPattern(F, subject, algorithm=algorithm, fixed = fixed))
        Rp_hits <- start(matchPattern(R, subject, algorithm=algorithm, fixed = fixed))

        ## F and R hits on the - strand
        Fm_hits <- end(matchPattern(reverseComplement(F), subject, algorithm=algorithm, fixed = fixed))
        Rm_hits <- end(matchPattern(reverseComplement(R), subject, algorithm=algorithm, fixed = fixed))

        if (verbose) {
            cat("Fp_hits:", Fp_hits, "  Rp_hits:", Rp_hits,
                "  Fm_hits:", Fm_hits, "  Rm_hits:", Rm_hits, "\n")
        }

        matches0 <- Biostrings:::reduceProbePairMatches(c(Fp_hits, Rp_hits), c(Fm_hits, Rm_hits))
        ans <- Views(subject, start=matches0$start, end=matches0$end)

        if (!is.null(logfile)) {
            nFp <- length(Fp_hits)
            nRp <- length(Rp_hits)
            nFm <- length(Fm_hits)
            nRm <- length(Rm_hits)
            nmatches0 <- length(ans)
            ## cat("", ..., sep="\t") is a trick to get an extra tab
            cat("", nFp, nRp, nFm, nRm, nmatches0, file=logfile, sep="\t")
        }
        ans
}

I used the code above to do my analysis and it seemed to work.

I can submit a PR if you are interested.

Thanks!

Suggestion for Biostrings::translate - option to disambiguate

Hi,

Thanks for Biostrings, it's a fantastic R package that I use frequently.

The translate function works well to fuzzy translate ambiguous nucleotides as demonstrated in the help page of the function:

dna2 <- DNAString("HTG ATH **TGR** CCC YTR TRA")
translate(dna2, if.fuzzy.codon="solve")
     seq: MIXPL*

This works exactly as documented. I'd like to suggest an enhancement if it's easy to add. The function has translated TGR to MIXPL* but there are two possible translation solution a stop * or W.

The expected output would be:

translate(dna2, if.fuzzy.codon="solve")
      seq: MI*PL*
      seq: MIWPL*

It works well for short sequences but I can see this growing with very quickly with n > 3 ambiguous bases.

If this is outside the scope of this package or the function, please feel free to close the issue.

Thanks again,

Ammar

adding single element to existing DNAStringSetList

Hi there,

DNAStringSetList seems to have trouble with one intuitive (to me!) way to add a single element using [[ notation. I think the code below should show you what I mean.

thanks!

Janet

library(Biostrings)

dna1 <- c("AAA", "AC", "", "T", "GGATA")
dna2 <- c("G", "TT", "C")

x <- DNAStringSetList(dna1, dna2)
names(x) <- c("a", "b")

## an ordinary list for comparison
y <- list(a=DNAStringSet(dna1), b=DNAStringSet(dna2))

## this works
x[ c("c","d") ] <- DNAStringSetList(dna1,dna2)
x

class(x[["a"]])
# [1] "DNAStringSet"
# attr(,"package")
# [1] "Biostrings"

## this doesn't work, seems like it should (by analogy with ordinary lists)
x[["e"]] <- DNAStringSet(dna1)
    # Error in .wrap_in_length_one_list_like_object(value, name, x) : 
    #   failed to coerce 'list(value)' to a DNAStringSetList object of length 1
# works for an ordinary list
y[["c"]] <- DNAStringSet(dna1)

## this also doesn't work, but it makes sense to me that it shouldn't work
x[["e"]] <- DNAStringSetList(dna1)
    # Error in .wrap_in_length_one_list_like_object(value, name, x) : 
    #   failed to coerce 'list(value)' to a DNAStringSetList object of length 1

## single [ notation does work though
x["e"] <- DNAStringSetList(dna1)


sessionInfo()

R version 4.0.2 (2020-06-22)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.6

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] Biostrings_2.56.0   XVector_0.28.0      IRanges_2.22.2      S4Vectors_0.26.1    BiocGenerics_0.34.0

loaded via a namespace (and not attached):
[1] zlibbioc_1.34.0 compiler_4.0.2  tools_4.0.2     crayon_1.3.4

DNAStringSet(x, start=, end=) loses quality information

When trimming a DNAStringSet containing qualites using DNAStringSet(x, start=, end=), the
quality information is lost:

library(Biostrings)

fl <- system.file( package="Biostrings", "extdata", "s_1_sequence.txt")
fq <- readDNAStringSet(fl, format="fastq", with.qualities=TRUE)[1:4]
print(class(elementMetadata(fq)))
# [1] "DFrame"
# attr(,"package")
# [1] "S4Vectors"
fq.trimmed <- DNAStringSet(fq, start=10, end=30)
print(class(elementMetadata(fq.trimmed)))
# [1] "NULL"

The quality information is present after reading the data with readDNAStringSet(), after trimming
the elementMetadata slot is empty.

This is rather unexpected, IMHO the quality information should be retained and trimmed as well.

The environment is as follows:

R version 4.1.2 (2021-11-01)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] ShortRead_1.52.0            GenomicAlignments_1.30.0    SummarizedExperiment_1.24.0
 [4] Biobase_2.54.0              MatrixGenerics_1.6.0        matrixStats_0.61.0         
 [7] Rsamtools_2.10.0            GenomicRanges_1.46.1        BiocParallel_1.28.2        
[10] Biostrings_2.62.0           GenomeInfoDb_1.30.0         XVector_0.34.0             
[13] IRanges_2.28.0              S4Vectors_0.32.3            BiocGenerics_0.40.0        

loaded via a namespace (and not attached):
 [1] zlibbioc_1.40.0        lattice_0.20-45        jpeg_0.1-9             hwriter_1.3.2         
 [5] tools_4.1.2            parallel_4.1.2         grid_4.1.2             png_0.1-7             
 [9] latticeExtra_0.6-29    crayon_1.4.2           Matrix_1.3-4           GenomeInfoDbData_1.2.7
[13] RColorBrewer_1.1-2     bitops_1.0-7           RCurl_1.98-1.5         DelayedArray_0.20.0   
[17] compiler_4.1.2

Key error from new_XStringSet_from_CHARACTER call

I'm hoping for some help debugging this issue. It's the same error as this previous issue: #35
However the character "1" which is said to throw the error is not in my DNA sequences, I have confirmed several times over with various grep commands. The only characters in my DNA sequences are ACTG, N, and a few degenerate bases.

Here is the error:

Error in .Call2("new_XStringSet_from_CHARACTER", class(x0), elementType(x0),  : 
  key 49 (char '1') not in lookup table

Here is the command (yes, this is from dada2 assignTaxonomy function, but I am asking here because it's the biostrings function call that's throwing the error, even some clarity that the problem is upstream will help me, if that is the case):
taxa <- assignTaxonomy(seqtab, "COI_reference_dada2gen.fa.gz", multithread=TRUE, verbose=TRUE)

Here is a snippet of the input fasta file:

>Eukaryota;Metazoa;Arthropoda;Insecta;Coleoptera;Coccinellidae;Epilachna;Epilachna admirabilis;AB002178|77537;
CGCCTTTATCAGCAAATATTGCTCATGGA...
>Eukaryota;Metazoa;Arthropoda;Insecta;Coleoptera;Coccinellidae;Epilachna;Epilachna sp. H;AB002179|77538;
AGCAAATATCGCACATGGAGGATCATCTG...
>Eukaryota;Metazoa;Arthropoda;Insecta;Coleoptera;Coccinellidae;Henosepilachna;Henosepilachna vigintioctopunctata;AB002180|420089;
AGTTTACCCACCTCTTTCAGCTAATATTG...
>Eukaryota;Metazoa;Arthropoda;Insecta;Coleoptera;Coccinellidae;Henosepilachna;Henosepilachna vigintioctopunctata;AB002181|420089;
AGTTTACCCTCCTCTTTCAGCTAATATTG...

A grep to find any errant numbers in the fasta sequence yields nothing:
grep '[0-9]' COI_reference_dada2gen.fa | grep -v ">"

Please note all sequences begin and end with [ACTG] or degenerate bases, I've truncated them in the example to make this post easier to read.

Thanks in advance for any guidance anyone can provide.

stringDist returns negative values

subst <- matrix(nrow=4,ncol=4,dimnames=list(c("A","C","T","G"),c("A","C","T","G")),data=1)
diag(subst) <- 0
subst["A","G"] <- 0.75
subst["T","C"] <- 0.75
subst["G","A"] <- 0.75
subst["C","T"] <- 0.75

stringDist(DNAStringSet(c("CGTCTTA", "GACA" )), method = "substitutionMatrix",
type = "global", substitutionMatrix = subst,
diag = T, upper = T, gapOpening = 1)

 1    2

1 0.0 -0.5
2 -0.5 0.0

Why is there even such a thing as a negative distance? Makes no sense to me.

Error in .Call2("new_XStringSet_from_CHARACTER", class(x0), elementType(x0), : key 51 (char '3') not in lookup table

Hi all,

Even in these horrific times of COVID-19 pandemic we try to focus on our research.

The DADA2 pipeline Tutorial was very useful but now I am struggling big time on the phyloseq part.
I just removed all Eukaryota en NA from my data set:
ps_EPSO1_noEuks_noUnk <- subset_taxa(ps_EPSO1, Kingdom!="Eukaryota" &
Phylum!=" ")

print(ps_EPSO1_noEuks_noUnk)
phyloseq-class experiment-level object
otu_table() OTU Table: [ 3455 taxa and 40 samples ]
sample_data() Sample Data: [ 40 samples by 13 sample variables ]
tax_table() Taxonomy Table: [ 3455 taxa by 6 taxonomic ranks ]
refseq() DNAStringSet: [ 3455 reference sequences ]
But I receive on my next step an error:

dna <- Biostrings::DNAStringSet(taxa_names(ps_EPSO1_noEuks_noUnk))
Error in .Call2("new_XStringSet_from_CHARACTER", class(x0), elementType(x0), :
key 51 (char '3') not in lookup table
If I look at my ps_EPSO1_noEuks_noUnk, my pay_tree is NULL. I have no idea how come.

Some help would be much appreciated!

BW
H.

findPalindromes and palindromeArmLength - inconsistent arguments

hi there,

I was excited to see there's a findPalindromes function - this could be really useful for a project I'm working on at the moment.

I'll probably want to use some combination of findPalindromes and palindromeLeftArm (/RightArm). But it looks like findPalindromes and palindromeLeftArm/palindromeRightArm/palindromeArmLength are not behaving consistently with one another.

Seems like it would be sensible for all four palindrome functions to behave the same as each other (with same arguments and same defaults) - is that easy to implement?

Some code to show my problem is below.

Also, I intend to use the function a bunch of times on a bunch of different sequences. I imagine it might get slow, so I'm wondering if there could be a way to only call the function once and get all features of the palindrome hit returned at once? Right now I think I need to call findPalindromes to see the start and end positions of the palindrome, and then either palindromeArmLength or palindromeLeftArm/palindromeRightArm to determine the extent of each arm (because I'm allowing a loop, the arms are not simply half of the seqlength).

thanks, as always,

Janet

library(Biostrings)

#### set up test data
# pal1 is a perfect 10 char palindrome with a 3 char loop
# pal2 is the same thing, but with one mismatch between arms
# pal3 has two mismatches between arms
pal1 <- BString("abcdefghijNOPjihgfedcba")
pal2 <- BString("abcXefghijNOPjihgfedcba")
pal3 <- BString("aXcXefghijNOPjihgfedcba")


#### the perfect palindome with loop (pal1) behaves OK:
findPalindromes(pal1, max.looplength=3)
#   Views on a 23-letter BString subject
# subject: abcdefghijNOPjihgfedcba
# views:
#     start end width
# [1]     1  23    23 [abcdefghijNOPjihgfedcba]

palindromeLeftArm(pal1, max.looplength=3)
#   10-letter "BString" instance
# seq: abcdefghij
palindromeRightArm(pal1, max.looplength=3)
#   10-letter "BString" instance
# seq: jihgfedcba


#### but for pal2...
## findPalindromes looks OK - it's controllable using the args as I would like:
findPalindromes(pal2, max.looplength=3)
#   Views on a 23-letter BString subject
# subject: abcXefghijNOPjihgfedcba
# views:
#     start end width
# [1]     5  19    15 [efghijNOPjihgfe]

## but palindromeLeftArm / palindromeRightArm / palindromeArmLength don't pay attention to the criteria I'd like to supply, even though they don't raise a warning about unused arguments

palindromeLeftArm(pal2, max.looplength=3)
#   3-letter "BString" instance
# seq: abc

palindromeRightArm(pal2, max.looplength=3)
#  3-letter "BString" instance
# seq: cba

palindromeArmLength(pal2, max.looplength=3)
# [1] 3

palindromeArmLength(pal2, max.looplength=3, min.armlength=4)
# [1] 3

## gets a bit ridiculous to report super-short palindromes like this:
palindromeLeftArm(pal3, max.looplength=3, min.armlength=4)
#   1-letter "BString" instance
# seq: a

sessionInfo()


R version 3.6.1 (2019-07-05)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.6

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] Biostrings_2.53.0   XVector_0.25.0      IRanges_2.19.10     S4Vectors_0.23.17   BiocGenerics_0.31.4

loaded via a namespace (and not attached):
[1] zlibbioc_1.31.0 compiler_3.6.1

N50 function cannot operate on genomes larger than 2.1Gbp

R's integers have long precision, meaning they max out around 2.14 x 10^9. Many genomes (e.g. human) and other data sets that are desirable to compute N50s for (e.g. long-read sequence data) have cumulative sizes far in excess of this. For this function to be able to operate on these larger data sets, it needs to leave the data in numeric form (ie double precision).

I've created a pull request (#27 ) that removes the cast to integer which the function performs.

Quality to integer conversion

Hi,

there is a bug causing a warning during conversion of a XStringQuality object containing multiple sequences to integer values. This is probaly due to charToRaw not being vectorized.

# works
as.integer(PhredQuality("!!!!!!!!!h"))
# causes warning
as.integer(PhredQuality(c("!!!!!!!!!h","!!!!!!!!!h","!!!!!!!!!h")))
Warning message:
In charToRaw(as.character(x)) :
  argument should be a character vector of length 1
all but the first element will be ignored

If this behavior unintended or am I using it wrongly?

Kind regards
Felix

PS: If this is really unintended behavior, a fix for this would concern line 54 of R/XStringQuality-class.R only. A change from

value <- as.integer(charToRaw(as.character(x))) - offset(scale)

value <- IntegerList(lapply(lapply(as.character(q),charToRaw),as.integer)) - offset(scale)

if(is.null(names(q))){
    unname(split(as.integer(unlist(q)),Rle(seq_along(q),width(q)))) - offset(scale)
} else { 
    split(as.integer(unlist(q)),Rle(names(q),width(q))) - offset(scale)
}

I didn't do any benchmarking.

Unexpected behaviour in pairwiseAlignment()

It appears that pairwiseAligment behaves as if it was always performing local alignment, no matter what value I set in the type parameter. Just taking the example from https://www.bioconductor.org/packages/3.7/bioc/vignettes/Biostrings/inst/doc/PairwiseAlignments.pdf

pairwiseAlignment(pattern = c("succeed", "precede"), subject = "supersede", gapOpening = 0, gapExtension = 1)

what I'm getting:

Global PairwiseAlignmentsSingleSubject (1 of 2)
pattern: [1] su-cce--ed 
subject: [1] sup--ersed 
score: 7.945507

whereas the expected result (taken from the pdf I linked) would be:

Global PairwiseAlignmentsSingleSubject (1 of 2)
pattern: su-cce--ed-
subject: sup--ersede
score: 7.945507

My session info:

> sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.3

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] BiocInstaller_1.28.0 seqinr_3.4-5         reshape2_1.4.3       bio3d_2.3-4          plyr_1.8.4           ggplot2_2.2.1        stringr_1.2.0        Biostrings_2.46.0   
 [9] XVector_0.18.0       IRanges_2.12.0       S4Vectors_0.16.0     BiocGenerics_0.24.0  biomaRt_2.34.2      

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.15         pillar_1.1.0         compiler_3.4.3       prettyunits_1.0.2    bitops_1.0-6         tools_3.4.3          progress_1.1.2       zlibbioc_1.24.0     
 [9] digest_0.6.14        bit_1.1-12           RSQLite_2.0          memoise_1.1.0        tibble_1.4.1         gtable_0.2.0         rlang_0.1.6          DBI_0.7             
[17] curl_3.1             yaml_2.1.16          httr_1.3.1           ade4_1.7-11          bit64_0.9-7          grid_3.4.3           Biobase_2.38.0       R6_2.2.2            
[25] AnnotationDbi_1.40.0 XML_3.98-1.9         blob_1.1.0           magrittr_1.5         MASS_7.3-48          scales_0.5.0         assertthat_0.2.0     colorspace_1.3-2    
[33] labeling_0.3         stringi_1.1.6        RCurl_1.95-4.10      lazyeval_0.2.1       munsell_0.4.3

Many thanks,
Greg

from txdb/getSeq to fasta

Hi, I'm aiming to extract CD sequencesin fasta format by the following:

# load GTF
txdb <- makeTxDbFromGFF("BFgenomic.gff", format="gff")
# extract CDS
txdb.cds_by_transcript <- cdsBy(txdb, by="tx", use.names = TRUE)
dna <- readDNAStringSet("BFgenomic.fa")
# Extract CDS sequences
cds_by_trnascript <- getSeq(dna, txdb.cds_by_transcript)
# Write to fasta
writeXStringSet(cds_by_trnascript, "my.fasta")

but running into the error

Error in writeXStringSet(cds_by_trnascript, "my.fasta") : 
  'x' must be an XStringSet object

What I need to maybe convert my cds_by_trnascript to a XStringSet object?

Thank you

findPalindrome / palindromeLeftArm and mismatches - extends into loop

hi there,

Another thought about findPalindromes. Something about the way it treats mismatches between palindrome arms is a little unintuitive, biologically, while it is probably correct in the formal sense. My biologist's intuition is that mismatches should be tolerated WITHIN the palindrome arms. However, the code also allows mismatches at the termini of palindromes, so that palindromes get extended into the loop region. I want to tolerate mismatches so I can recognize diverged palindrome arms, but I don't want to artificially extend the palindrome arm into the loop. See code below.

I also found some weird behavior with one of my examples - also illustrated below.

thanks again,

Janet


library(Biostrings)

#### set up test data
# pal1 is a perfect 10 char palindrome with a 7 char loop
# pal2 is the same thing, but with one mismatch between arms
pal1 <- BString("abcdefghijNOPQRSTjihgfedcba")
pal2 <- BString("abcXefghijNOPQRSTjihgfedcba")


## without mismatches (pal1), things look good:
findPalindromes(pal1, max.looplength=7)
#   Views on a 27-letter BString subject
# subject: abcdefghijNOPQRSTjihgfedcba
# views:
#     start end width
# [1]     1  27    27 [abcdefghijNOPQRSTjihgfedcba]
palindromeLeftArm(pal1, max.looplength=7)
#   10-letter "BString" instance
# seq: abcdefghij

## allowing mismatches extends the palidrome into the loop. I want to tolerate mismatches within arms, but not to extend them.
palindromeLeftArm(pal1, max.looplength=7, max.mismatch=1)
#   11-letter "BString" instance
# seq: abcdefghijN
palindromeLeftArm(pal1, max.looplength=7, max.mismatch=2)
#   12-letter "BString" instance
# seq: abcdefghijNO
palindromeLeftArm(pal1, max.looplength=7, max.mismatch=3)
#   13-letter "BString" instance
# seq: abcdefghijNOP


## and I found something weird (an error??) with pal2 (where there's a mismatch). palindromeLeftArm finds the longer palindrome when I allow a mismatch, but findPalindromes misses the longer palindrome and only reports the shorter one, truncating at the mismatch. Are they using different algorithms?   Maybe it's worth including a brief description of the algorithm used in the help.

findPalindromes(pal2, max.looplength=7, max.mismatch=1)
#   Views on a 27-letter BString subject
# subject: abcXefghijNOPQRSTjihgfedcba
# views:
#     start end width
# [1]     5  23    19 [efghijNOPQRSTjihgfe]

palindromeLeftArm(pal2, max.looplength=7, max.mismatch=1)
#   10-letter "BString" instance
# seq: abcXefghij


sessionInfo()

R version 3.6.1 (2019-07-05)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.6

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] Biostrings_2.53.0   XVector_0.25.0      IRanges_2.19.10     S4Vectors_0.23.17   BiocGenerics_0.31.4

loaded via a namespace (and not attached):
[1] zlibbioc_1.31.0 compiler_3.6.1  tools_3.6.1

readDNAStringSet for long FASTQ files

Is it possible to import ONT long reads with readDNAStringSet. I get this error message when I try it "reading FASTQ file fail//fastq_runid_1d3e3951d2d9b150a6b515f944bf47b50d01104f_0.fastq: cannot read line 10, line is too long".

Is there any way to extract a sequence between two other sequences and allow for mismatches

First of all thank you for maintaining and developing Biostrings and you make our lives better everyday.
The following might not be an issue, but mostly its a request/ I am asking for help. Please accept my apologies in advance for taking the advantage.
I have a string that looks like this

dna_string <- DNAString("AAAAANNNNNNNNNNNNNNNNNNNNNNNNNCCCCC")

I want to find a way to extract the sequence between left-pattern=AAAAA and right pattern=CCCCC and allow for mismatches on left and right.
I would like to set the minimum distance between left and right pattern to six and the maximum to 30.

Do you have any idea if this is possible?
I am aware of the super cool

matchLRPatterns("AAAAA", "CCCCC", 25, dna_string)

but in here I can only set the maximum distance and not the minimum?

Subsetting the SharedRaw_Pool

Parts of my workflow involve taking a subsequence and saving it to file in a RDS object. However, the way that subseq works on the SharedRaw_Pool means that the entire sequence is still in the object:

X <- DNAStringSet("ACACTACGACGATCGATCGATCGATCGA")
object.size(X)
## 4088 bytes
object.size(subseq(X, 1, 1))
## 4088 bytes

I understand the rationale for this - I do a very similar thing in InteractionSet - but is there a function to reconstruct the pool from only the parts of sequence that are in use? This will reduce the size of the objects being passed around, which would make life a lot easier for analyzing long-read sequencing data.

Session information

R version 3.5.0 Patched (2018-04-30 r74679)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.5 LTS

Matrix products: default
BLAS: /home/cri.camres.org/lun01/Software/R/R-3-5-branch/lib/libRblas.so
LAPACK: /home/cri.camres.org/lun01/Software/R/R-3-5-branch/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
[1] Biostrings_2.49.1   XVector_0.21.3      IRanges_2.15.18    
[4] S4Vectors_0.19.19   BiocGenerics_0.27.1

loaded via a namespace (and not attached):
[1] zlibbioc_1.27.0 compiler_3.5.0  tools_3.5.0

translate gives wrong results for triplets TTG and CTG

Hello,

Since at least Biostring 2.42, the function "translate" gives wrong results for triplets TTG and CTG.

translate(DNAString("TTG"))
1-letter "AAString" instance
seq: M

translate(DNAString("CTG"))
1-letter "AAString" instance
seq: M

=> Both should be L (leucine).

We think it's related to the erroneous default use of alternative genetic codes.

Cheers,

Seb

extract the "start" position of "matchPattern"

hi, guys:

For the following code, how could I extract the "start“ of the match and calculate the average, which will be 3?

Thanks!

Jie

Error when trying to convert string to DNAString

I am getting the following error when trying to convert a string to a DNAString:

Error in as(mcols, mcols_target_class) :
no method or default for coercing “DFrame” to “DataTable_OR_NULL”

I reinstalled Bioconductor but no improvement. Any ideas?

AAString - Amino acid code enforced?

I worte on the BioC devel list a few days ago about a problem/behavior I have/I encounter with the AAString class. I haven't received a reply, so I am adding this as an issue, because in my opinion it is one. Thanks for having a look at it.

Below are the contents of the mail describing the issue:

I tried the following code, which should according to ?AAString not work, since ÜÖÄ are not part of any AA code.

> AAString("ÜÄÖ")
  3-letter "AAString" instance
seq: ÜÄÖ
> sessionInfo()
R version 3.4.4 (2018-03-15)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252    LC_MONETARY=German_Germany.1252 LC_NUMERIC=C                   
[5] LC_TIME=German_Germany.1252    

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] Biostrings_2.46.0   XVector_0.18.0      IRanges_2.12.0      S4Vectors_0.16.0    BiocGenerics_0.24.0

loaded via a namespace (and not attached):
[1] zlibbioc_1.24.0 compiler_3.4.4  tools_3.4.4     yaml_2.1.18

I don’t have access right now to the devel version of Biostrings, bit I checked out the current Code in the github repo and its recent changes. I am pretty sure, that this behavior is also in the current devel branch. Can someone confirm this?

My current interest is in using the XString classes and methods for an additional biological string representation. The initial question was, how can I restrict this to a certain character set, if the characters are not saved byte encoded? The latter option is not available to me, since characters like ‚«‘ or ‚=‘ result in a two byte code using the charToRaw function. This trips up the build of the internal lookup table, which are passed down to the C backend.

Therefore I looked into, how this is done for an AAString differing from a BString. I discovered, that it currently doesn‘t. I also looked into the current 2.47.12 repo, which as far as I can tell does not use the AMINO_ACID_CODE constant in the creation of an AAString object.

So my questions are:

What is the best practice for extending a class from XString with a restricted character set, which is not byte encoded?
Is there a way to use byte encoding for chars with two ore more bytes?

Thanks in advance for any help and suggestions.

Best regards,
Felix

PS: regarding the second question: One could change „as.integer(charToRaw(paste(letters, collapse="")))“ to „lapply(lapply(letters,charToRaw),as.integer)“ in .letterAsByteVal, but in any case it will not be atomic anymore, which I think is required to be excepted by the C backend. I didn’t test it.

Introduce checks for R interrupts in the C code?

Would it be possible to introduce checks for R in the C code? I would like to be able to use some of these functions with R.utils::withTimeout(), but the C code ignores attempts to interrupt it by the user. I think that https://rdrr.io/github/HenrikBengtsson/RNativeAPI/man/R_CheckUserInterrupt.html might be applicable for this.

Subset assignment into a QualityScaledStringSet does not work

I would have expected this to work;

library(Biostrings)
stuff <- QualityScaledDNAStringSet(DNAStringSet(c("AAA", "TTT", "CCC", "GGG")),
                                   PhredQuality(c("(((", "!!!", "###", "...")))
stuff[1] <- stuff[2]

... except that I get the following complaint:

Error in validObject(.Object) : 
  invalid class “QualityScaledDNAStringSet” object: invalid object for slot "quality" in class "QualityScaledDNAStringSet": got class "S4", should be or extend class "XStringQuality"

There are no problems with the equivalent DNAStringSet operation:

whee <- as(stuff, "DNAStringSet")
whee[1] <- whee[2] # Works as expected.

Session info:

R Under development (unstable) (2017-10-27 r73632)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.5 LTS

Matrix products: default
BLAS: /home/cri.camres.org/lun01/Software/R/trunk/lib/libRblas.so
LAPACK: /home/cri.camres.org/lun01/Software/R/trunk/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] stringr_1.2.0              ShortRead_1.37.0          
 [3] GenomicAlignments_1.15.2   SummarizedExperiment_1.9.5
 [5] DelayedArray_0.5.14        matrixStats_0.52.2        
 [7] Biobase_2.39.1             Rsamtools_1.31.1          
 [9] BiocParallel_1.13.1        sarlacc_1.0.0             
[11] Biostrings_2.47.1          XVector_0.19.1            
[13] GenomicRanges_1.31.3       GenomeInfoDb_1.15.1       
[15] IRanges_2.13.4             S4Vectors_0.17.15         
[17] BiocGenerics_0.25.1       

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.14           RColorBrewer_1.1-2     compiler_3.5.0        
 [4] pillar_1.0.1           prettyunits_1.0.2      GenomicFeatures_1.31.1
 [7] bitops_1.0-6           tools_3.5.0            zlibbioc_1.25.0       
[10] progress_1.1.2         biomaRt_2.35.5         digest_0.6.13         
[13] bit_1.1-12             RSQLite_2.0            memoise_1.1.0         
[16] tibble_1.4.1           lattice_0.20-35        pkgconfig_2.0.1       
[19] rlang_0.1.6            igraph_1.1.2           Matrix_1.2-12         
[22] DBI_0.7                GenomeInfoDbData_1.1.0 hwriter_1.3.2         
[25] rtracklayer_1.39.4     httr_1.3.1             ade4_1.7-10           
[28] bit64_0.9-7            grid_3.5.0             R6_2.2.2              
[31] AnnotationDbi_1.41.4   XML_3.98-1.9           RMySQL_0.10.13        
[34] latticeExtra_0.6-28    seqinr_3.4-5           magrittr_1.5          
[37] blob_1.1.0             MASS_7.3-48            assertthat_0.2.0      
[40] muscle_3.21.0          stringi_1.1.6          RCurl_1.95-4.9

translate uses alternative start codons by default

I am not sure, that this behavior is intended:

library(Biostrings)
translate(DNAStringSet(c("TTG","CTG","TAT")))
#>   A AAStringSet instance of length 3
#>     width seq
#> [1]     1 M
#> [2]     1 M
#> [3]     1 Y
GENETIC_CODE
#> TTT TTC TTA TTG TCT TCC TCA TCG TAT TAC TAA TAG TGT TGC TGA TGG CTT CTC 
#> "F" "F" "L" "L" "S" "S" "S" "S" "Y" "Y" "*" "*" "C" "C" "*" "W" "L" "L" 
#> CTA CTG CCT CCC CCA CCG CAT CAC CAA CAG CGT CGC CGA CGG ATT ATC ATA ATG 
#> "L" "L" "P" "P" "P" "P" "H" "H" "Q" "Q" "R" "R" "R" "R" "I" "I" "I" "M" 
#> ACT ACC ACA ACG AAT AAC AAA AAG AGT AGC AGA AGG GTT GTC GTA GTG GCT GCC 
#> "T" "T" "T" "T" "N" "N" "K" "K" "S" "S" "R" "R" "V" "V" "V" "V" "A" "A" 
#> GCA GCG GAT GAC GAA GAG GGT GGC GGA GGG 
#> "A" "A" "D" "D" "E" "E" "G" "G" "G" "G" 
#> attr(,"alt_init_codons")
#> [1] "TTG" "CTG"

In my opinion TTG and CTG should return L, shouldn't they?

readDNAStringSet fails with Legacy Mac (CR) line breaks

I got a help request from someone running iOS 10.12. The person couldn't figure out why readDNAStringSet didn't load the correctly formatted fasta file. After some troubleshooting I found out that the file contained Legacy Mac (CR) line breaks which apparently aren't recognised as line breaks by readDNAStringSet.

The function runs without error or warning but the result is meaningless (1, 0-width sequence).

Is it possible to support these line breaks or raise an informative error?

thanks for the great package!

Fabian

allow gap characters in translate and codons functions?

hi there,

I've been reading in some multiple sequence alignments, as DNAStringSet objects. They're nucleotide alignments that encode proteins. I've been playing with using 'translate', but it looks like it's not set up to deal with gap characters.

I might be missing some nice alternative way to do this, but if not, I guess I'd like to suggest enhancement to better deal with in-frame nucleotide alignments of coding sequences. I think the code below will show you what I mean, but if it's not clear please let me know.

thanks!

Janet

Dr. Janet Young

Malik lab
http://research.fhcrc.org/malik/en.html

Division of Basic Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Avenue N., A2-025,
P.O. Box 19024, Seattle, WA 98109-1024, USA.

tel: (206) 667 4512
email: jayoung ...at... fredhutch.org

library(Biostrings)

####### example data
### make a multiple sequence alignment of coding sequences
aln <- DNAStringSet(c( "ATGGCATCTACTTTGTATGACTATTGCAGAGTGCCCATGGGTGACATCTGTAAGAAAGATGGGGATAAGCGCTGTAAGCTT", "ATGGCATCTACTTCGTATGACTATTGCAGAGTGCCCATG---------------GAAGACGGGGATAAGCGCTGTAAGCTT", "ATGGCATCTACTTTGTATGACTATTGCAGAGTGCCCATGGGTGACATCTGTAAGAAAGATGGGGATAAGCGCTGTAAGCTT"))
names(aln) <- c("seq1","seq2","seq3")

### get ungapped seqs
ungapped <- DNAStringSet(gsub("-","",aln))



####### codons function
## codons only works if there are no gaps. How about adding an option on codons that the user can set to allow gap characters? 

codons(aln[[1]])
   # works

codons(aln[[2]])
# Error in .XString.codons(x) : 
#   some trinucleotides in 'x' contain non-base letters

## Might also be useful to allow N characters (the translate function enables this with the if.fuzzy.codon option, but the codons function is more strict):
codons(DNAString("ATGNCA"))
# Error in .XString.codons(x) : 
#   some trinucleotides in 'x' contain non-base letters

translate(DNAString("ATGNCA"), if.fuzzy.codon="solve")
#   2-letter "AAString" instance
# seq: MX

codons(DNAString("ATGNCA"), if.fuzzy.codon="solve")
# Error in codons(DNAString("ATGNCA"), if.fuzzy.codon = "solve") : 
#   unused argument (if.fuzzy.codon = "solve")



####### translation function

# works fine on the ungapped sequences
translate(ungapped)
#  A AAStringSet instance of length 3
#    width seq                                               names               
#[1]    27 MASTLYDYCRVPMGDICKKDGDKRCKL                       seq1
#[2]    22 MASTSYDYCRVPMEDGDKRCKL                            seq2
#[3]    27 MASTLYDYCRVPMGDICKKDGDKRCKL                       seq3

### but translation doesn't work when there are gaps (whether it's a single sequence or an alignment)
translate(aln)
# Error in .Call2("DNAStringSet_translate", x, skip_code, dna_codes[codon_alphabet],  : 
#   in 'x[[2]]': not a base at pos 40

translate(aln[[2]])
# Error in .Call2("DNAStringSet_translate", x, skip_code, dna_codes[codon_alphabet],  : 
#   not a base at pos 40

translate(aln[[1]])
#   27-letter "AAString" instance
# seq: MASTLYDYCRVPMGDICKKDGDKRCKL

### I thought I might be able to define my own genetic code to deal with this, but I can't. Would it be possible to add an option to allow genetic codes to include additional codon types?

gapCodon <- "-"
names(gapCodon) <- "---"
my_GENETIC_CODE <- c(GENETIC_CODE, gapCodon)

translate(aln[[1]], genetic.code=GENETIC_CODE)
#   27-letter "AAString" instance
# seq: MASTLYDYCRVPMGDICKKDGDKRCKL

translate(aln[[1]], genetic.code=my_GENETIC_CODE)
# Error in .normarg_genetic.code(genetic.code) : 
#   'genetic.code' must have the same names as predefined constant GENETIC_CODE


### if you do implement translation of gapped sequences, it'll be necessary to deal with cases like the following, where the alignment gap only spans part of a codon. Depending on what my downstream usage is, I usually either mark this in translated sequences as a frameshift (i.e. translate to X or to ! character) or I include a gap at that position in the translated sequence. 
translate(DNAString("ATG-CA"))





####### session info

sessionInfo()

R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.5 LTS

Matrix products: default
BLAS/LAPACK: /app/easybuild/software/OpenBLAS/0.2.18-GCC-5.4.0-2.26-LAPACK-3.6.1/lib/libopenblas_prescottp-r0.2.18.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
[1] Biostrings_2.52.0   XVector_0.24.0      IRanges_2.18.1     
[4] S4Vectors_0.22.0    BiocGenerics_0.30.0

loaded via a namespace (and not attached):
[1] zlibbioc_1.30.0 compiler_3.6.1

trinucleotideFrequency Segmentation Fault if Input Data is Large

My aim is to compute entropy of all of the unmapped reads for each patient sample of whole genome sequencing data and remove low-complexity sequences before doing a microbiome analysis. Filtering works for the smaller samples, but segmentation faults on the larger ones. I use R 4.1.0 and Biostrings 2.60.1. Ihave narrowed the crashing down to trinucleotideFrequency.

> testCase <- readFastq("OSCC_1-N_unmapped_all.fastq.gz")
testCase 
> testCase
class: ShortReadQ
length: 48324630 reads; width: 150 cycles
> trinucleotideFrequency(sread(testCase))

 *** caught segfault ***
address 0x7fa808276fb8, cause 'memory not mapped'

Traceback:
 1: .Call2("XStringSet_oligo_frequency", x, width, step, as.prob,     as.array, fast.moving.side, with.labels, simplify.as, base_codes,     PACKAGE = "Biostrings")
 2: .local(x, width, step, as.prob, as.array, fast.moving.side, with.labels,     ...)
 3: oligonucleotideFrequency(x, 3L, step = step, as.prob = as.prob,     as.array = as.array, fast.moving.side = fast.moving.side,     with.labels = with.labels, ...)
 4: oligonucleotideFrequency(x, 3L, step = step, as.prob = as.prob,     as.array = as.array, fast.moving.side = fast.moving.side,     with.labels = with.labels, ...)
 5: trinucleotideFrequency(sread(testCase))

narrow(x, start=, end=) does not trim quality information

When trimming a DNAStringSet containing qualites using narrow(x, start=, end=), the quality information is not trimmed, which leds to an error in writeXStringSet():

library(Biostrings)

fl <- system.file( package="Biostrings", "extdata", "s_1_sequence.txt")
fq <- readDNAStringSet(fl, format="fastq", with.qualities=TRUE)[1:4]
print(mcols(fq)$qualities)
# BStringSet object of length 4:
#     width seq
# [1]    36 ]]]]]]]]]]]]Y]Y]]]]]]]]]]]]VCHVMPLAS
# [2]    36 ]]]]]]]]]]]]Y]]]]]]]]]YPV]T][PZPICCK
# [3]    36 ]]]]Y]]]]]V]T]]]]]T]]]]]V]TMJEUXEFLA
# [4]    36 ]]]]]]]]]]]]]]]]]]]]]]T]]]]RJRZTQLOA
fq.narrow <- narrow(fq, start=10, end=30)
print(mcols(fq.narrow)$qualities)
# BStringSet object of length 4:
#     width seq
# [1]    36 ]]]]]]]]]]]]Y]Y]]]]]]]]]]]]VCHVMPLAS
# [2]    36 ]]]]]]]]]]]]Y]]]]]]]]]YPV]T][PZPICCK
# [3]    36 ]]]]Y]]]]]V]T]]]]]T]]]]]V]TMJEUXEFLA
# [4]    36 ]]]]]]]]]]]]]]]]]]]]]]T]]]]RJRZTQLOA

writeXStringSet(fq.narrow, "fq_narrowed.fq", format="fastq")
# Error in .Call2("write_XStringSet_to_fastq", x, filexp_list, qualities,  : 
#  'x' and 'quality' must have the same width

Sequences are trimmed as expected, but the quality scores seem to be untouched, as they have the same length and content as before.

The environment is as follows:

R version 4.1.2 (2021-11-01)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] ShortRead_1.52.0            GenomicAlignments_1.30.0    SummarizedExperiment_1.24.0
 [4] Biobase_2.54.0              MatrixGenerics_1.6.0        matrixStats_0.61.0         
 [7] Rsamtools_2.10.0            GenomicRanges_1.46.1        BiocParallel_1.28.2        
[10] Biostrings_2.62.0           GenomeInfoDb_1.30.0         XVector_0.34.0             
[13] IRanges_2.28.0              S4Vectors_0.32.3            BiocGenerics_0.40.0        

loaded via a namespace (and not attached):
 [1] zlibbioc_1.40.0        lattice_0.20-45        jpeg_0.1-9             hwriter_1.3.2         
 [5] tools_4.1.2            parallel_4.1.2         grid_4.1.2             png_0.1-7             
 [9] latticeExtra_0.6-29    crayon_1.4.2           Matrix_1.3-4           GenomeInfoDbData_1.2.7
[13] RColorBrewer_1.1-2     bitops_1.0-7           RCurl_1.98-1.5         DelayedArray_0.20.0   
[17] compiler_4.1.2

Streaming read sequences from a connection with readDNAStringSet

Would it be much work to allow reads to be streamed in from a connection to a FASTQ file? Something like:

fhandle <- open_some_FASTQ_file("my.fastq")
first.chunk <- readDNAStringSet(fhandle, nrec=1000) # first 1000
second.chunk <- readDNAStringSet(fhandle, nrec=1000) # next 1000
# etc.
close(fhandle)

This would allow us to process reads in blocks, which would be more memory-friendly than having to read the entire FASTQ file into memory for simultaneous processing. To achieve this right now, I would need to use skip, which presumably is less efficient as it needs to re-run through the skipped records in the file.

Package installation fails in latest R 4.2.0-nightly

Hi, I am trying to set up my environment for the development of a R package in Bioconductor, using the latest (15/03) nightly build and installing dependencies using Bioconductor devel. It looks like the package fails to compile under these conditions:

> BiocManager::install('Biostrings', version="devel")
'getOption("repos")' replaces Bioconductor standard repositories, see
'?repositories' for details

replacement repositories:
    CRAN: https://mirrors.dotsrc.org/cran

Bioconductor version 3.15 (BiocManager 1.30.16), R Under development (unstable)
  (2022-03-14 r81896)
Installing package(s) 'Biostrings'
trying URL 'https://bioconductor.org/packages/3.15/bioc/src/contrib/Biostrings_2.63.1.tar.gz'
Content type 'application/x-gzip' length 12645826 bytes (12.1 MB)
==================================================
downloaded 12.1 MB

* installing *source* package ‘Biostrings’ ...
** using staged installation
** libs
gcc -I"/data/user/teo/builds/R-4.2.0-devel/lib/R/include" -DNDEBUG  -I'/data/user/teo/builds/R-4.2.0-devel/lib/R/library/S4Vectors/include' -I'/data/user/teo/builds/R-4.2.0-devel/lib/R/library/IRanges/include' -I'/data/user/teo/builds/R-4.2.0-devel/lib/R/library/XVector/include' -I/usr/local/include   -fpic  -g -O2  -c BAB_class.c -o BAB_class.o
gcc -I"/data/user/teo/builds/R-4.2.0-devel/lib/R/include" -DNDEBUG  -I'/data/user/teo/builds/R-4.2.0-devel/lib/R/library/S4Vectors/include' -I'/data/user/teo/builds/R-4.2.0-devel/lib/R/library/IRanges/include' -I'/data/user/teo/builds/R-4.2.0-devel/lib/R/library/XVector/include' -I/usr/local/include   -fpic  -g -O2  -c BitMatrix.c -o BitMatrix.o
BitMatrix.c:9:10: fatal error: S.h: No such file or directory
 #include <S.h> /* for Salloc() */
          ^~~~~
compilation terminated.

Looking at the R-devel changelog I see:

The header ‘S.h’ which has been unsupported since Jan 2016 has been removed. Use ‘R.h’ instead. [Scheduled for Mar 15.]

so this is probably the first release with this problem. I guess just replacing the header in the #include would be enough.

Thank you!

bioconductor / biostrings Goto Github PK

biostrings's Introduction

biostrings's People

Contributors

Stargazers

Watchers

Forkers

biostrings's Issues

Read in the genome sequence

Try to extract sequences with data frame

And with the GRanges object

Data frame

GRanges

SesionInfo

Recommend Projects

Recommend Topics

Recommend Org