marcelm / dnaio Goto Github PK

View Code? Open in Web Editor NEW

56.0 56.0 8.0 634 KB

Efficiently read and write sequencing data from Python

Home Page: https://dnaio.readthedocs.io/

License: MIT License

Python 70.81% Cython 23.99% C 5.20%

bioinformatics python

dnaio's People

Contributors

Stargazers

Watchers

Forkers

rhpvorderman asellappen astaric hadaramira shiraliobul jlduan bede

dnaio's Issues

Fastq_iter improvements

Hi, I have made quite some PRs in the last week and I have still some plans, so I put them here so that the PRs do not come out of the blue.

Planned PRs:

Improve the paired FASTQ iteration by moving it into _core.pyx. I found that record_names_match takes 40 ns. Which is quite good. Then I benchmarked a Python function that returns True. Also 40 ns... A Cython function that returns True 28 ns. So of the 40 ns there is 28 ns call overhead. This is bad because iterating over two paired FASTQ files takes about 40% longer than iterating over two FASTQ files individually. This is bad because zip and zip_longest by themselves have very little overhead. By moving the loop into Cython a lot of checks in the loop
Add behaviour that recognizes sequence_class=tuple and simply yields a tuple. yield name, sequence, qualities. This is 10% faster than yielding Sequence types. This is quite nice if the rich interface of Sequence is not needed.
Add a binary mode. Enabled with dnaio.open(file, mode="rb"). This uses PyBytes_FromStringAndSize internally instead of PyUnicode_DecodeLatin1. This improves speed by more than 20%.
Replace PyUnicode_DecodeLatin1 with PyUnicode_DecodeASCII for sequence and qualities. PyUnicode_DecodeLatin1 has already a tremendous overhead compared to PyBytes_FromStringAndSize. (1100ms vs 850ms for parsing respectively), and the only added benefit is that it returns a string. By using PyUnicode_DecodeASCII for sequence and qualities, all sequence and qualities characters are tested for not exceeding 128. Since these are all invalid sequence and quality chars this is a value add. This does cost however, and parsing is a little slower (1200) ms. Given the tremendous speed improvements thank to memchr, I reckon there is room for a little quality of life improvement on the string side.

A while ago I watched some videos on optimization and having a look at generated assembly code. Since dnaio is a relatively straightforward project. I decided to have a go at it. Unfortunately Cython generates not so nice C code. So I decided to rewrite the core functionality in C. It is in the allinc branch.

It turns out that this makes reading quite substantially faster. The writing is at the same speed. So Cython apparantly has some friction when creating an iterator that generates a lot of objects. We knew this already actually, so this rewrite didn't learn me anything new. But it was fun to read records "superfaster" than what we are used to, so I wanted to share that here.

Looking at the C code. I don't see any way we can make our algorithms any faster. I think the code is already very close to the machine and I don't see how we can speed it up. I tried using x86 intrinsics instead of memchr, but apparently GCC already does this with the builtin memchr optimizations so there is no way to create something faster.

SequenceRecord array class

The discussion in #76 reminded me of an idea I had a while ago for possibly improving performance when dealing with many SequenceRecord objects: We could implement some kind of SequenceRecordArray class that represents an array of sequence records. It would support iteration (of course) to get the individual SequenceRecord objects. And as #76 suggests, it could also support a fastq_bytes() method that serializes all records to FASTQ en bloc.

This would replace the current functions for reading chunked data. So instead of the read_chunks() function, which returns memoryviews, one would then use a function that returns SequenceRecordArrays. Similar to how pandas.read_table does it, we could even add a chunksize parameter to dnaio.open that would then make it return SequenceRecordArray objects.

I think this would simplify some of the code in Cutadapt that transfers sequence data to and from worker processes. For that, I’d also want such an object to have extra attributes chunk_index and n_records.

Use a newline index for FastqIter

Currently the __next__ function is quite messy. Memchr is called several times and several times a NULL check needs to be performed. Any of these conditions might exit the while loop that is put in place.

Alternatively the update_buffer call can be used to create a newline index. The __next__ call can then simply get the next four lines from the index and yield a record.

Advantages:

For each next call it is known in advance whether a SequenceRecord can be yielded as the number of records in the buffer is known.
It can massively simplify the __next__ call, which now uses a while loop to allow for repeated buffer resizings. Instead the newline indexing step can indicate if there is an entire record in the buffer, which means update_buffer can guarantee that at least one FASTQ record is present. Which makes the while loop redundant. Also a lot less NULL checks. Basically only FASTQ integrity checks.
The newline indexing step can be run as a very simple function based on memchr. Such simple loops are fairly simple (and therefore fast) to execute. (CPUs like to do the same thing over and over on a limited amount of memory)

Disadvantages.

A newline_index array must be made. This can be an array of pointers or an array of offsets.
A newline_pos variable must be added to the struct.
Memory is read essentially twice. Once for making the index and another time for making the integrity (@ and +) and \r checks. This means these cache lines need to be fetched from memory/cache again. In contrast, they were probably already populated with the correct memory in our current solution.

In case there is no notable speedup, a reduction in code complexity is also nice. (Taking the line diff as measure).

API review

Here are a couple of things I noticed should possibly be revised before releasing a version 1.0. I’ll extend this when I find more.

SequenceRecord.fastq_bytes_two_headers() is redundant with SequenceRecord.fastq_bytes(two_headers=True)
record_names_match and record_names_match_bytes should possibly be renamed to record_ids_match and record_ids_match_bytes because those names are more accurate (also requires that the C-level record_ids_match is renamed)

Possible improvements for paired_fastq_heads

I saw this comment in _core.pyx:

# It would be nice to be able to have the first parameter be an
# unsigned char[:] (memory view), but this fails with a BufferError
# when a bytes object is passed in.
# See <https://stackoverflow.com/questions/28203670/>

This is possible by using the buffer protocol. For a basic implementation see here. This acquires the buffer from a python object, and converts the void * to unsigned char *.

I also think paired_fastq_heads would benefit from using memchr to search the newlines.

Determining file type from contents rather than name

Hi, I have been trying to run trim_galore on fastq files with naming like *.fastq.1.gz. However there is an error which seems to relate to identifying the file type from the contents i.e.

import dnaio
f = dnaio.open("13714_1#1.fastq.2.gz")

Traceback (most recent call last):
File "", line 1, in
File "//python3.6/site-packages/dnaio/init.py", line 83, in open
file1, fileformat=fileformat, mode=mode, qualities=qualities)
File "//python3.6/site-packages/dnaio/init.py", line 148, in _open_single
if file.seekable():
AttributeError: 'PipedGzipReader' object has no attribute 'seekable'

ascii_check.h should be in its own repo

I want to use it for my htspy project (the one that reads BAM files). BAM records have a read_name property, and it only makes sense to include this property as an ASCII string. (Current implementation uses bytes). It is of course very bad practice to advertise maxchar=127 without having checked it.

Furthermore I think the string_is_ascii test suite we have is a bit out of place here.

What I can do is make a separate repository at https://github.com/rhpvorderman/ascii-check. Then I can

Write a README for others on how to use this code
Write tests
Create multiple implementations of the same header file. One using standard C, one using SSE4.1 SIMD instructions and one using AVX SIMD instructions.

Re-importing it will be a simple git submodule add https://rhpvorderman/ascii-check and using `cdef extern from "ascii-check/ascii-check.h". Another option would be to simply copy the ASCII-check file as the license permits it.

Any improvements should benefit both dnaio and htspy, and other projects as well. Of course I checked if there is a header-only library available for ASCII checking, and no there is not. At least not one that uses the "check using a 8-byte mask" trick, despite that being a fairly common trick.

I'd lake to make the code available under the CC-0 public domain license in that repo, so it can be used for any purpose.

Since I technically waved any rights to this code by publishing it in your repository, I have to have your explicit permission first before I continue with the above.

Profiling of fastq iter method

Hi, I profiled the fastq iter method by using hyperfine and commenting out code. This was done on a branch based on the PR in #15 .

The script used was simple:

#!/usr/bin/env python3

import dnaio
import sys

with dnaio.FastqReader(sys.argv[1]) as records:
    for record in records:
        pass

Profile

When viewing these tables, I took it as a rule of thumb that a WGS sequence file contains 1 billion records. Therefore a time of 1 ns per read corresponds to 1 second when traversing a WGS file.

Area	Time for 5 million reads (milliseconds)	time for one read (nanoseconds, estimate)	Additional remarks
Startup dnaio and determine `two_headers`	37
Finding start and end positions of FASTQ record and basic testing	250	50	140ms user 110ms system (!)
Yielding an object and iterating over a for loop	150	30
Creating a tuple	5	1
Creating a Sequence object	100	20
Creating three `bytes` objects	280	56
Creating three `str` objects (latin-1)	580	116
Creating three `str` objects (name: latin-1, sequence and qualities: ASCII	640	128

The following benchmark were done by adapting the read script and check the differences between tuples and sequence objects.

Area	Time for 5 million reads (milliseconds)	time for one read (nanoseconds, estimate)	Additional remarks
Unpack tuple in for loop	100	20	`for name, sequence, qualities in records`
Access individual tuple member by index	90	18	`record[1]`
Access individual tuple member by global variable index	130	26	`record[SEQUENCE]`
Passing a tuple as an argument using `*`	100	20	`my_func(*record)`
Access Sequence member by name	160	32	`record.sequence`

Conclusions

The core parsing loop that checks all the beginnings and ends of name, sequence, second_header and qualities takes only 50 ns per read. More than 40% of which is system time. This means there is less than 30 ns to be saved here. Further optimizations to the parser code are therefore not worth it (such as the quality length shortcut in #14)
Bytes objects are created more than twice as fast as string objects.
Decoding into ASCII is 10% slower than latin-1. Not 50%. The comment in _core.pyx is outdated. Currently an ASCII mask is used to check for ASCII values. (It is possible that each individual character was checked in older versions of python, explaining the difference).
Decding into Latin-1 is not worth it anymore. The ASCII check is valuable in that it checks for any characters of 128 and higher, which are all invalid for sequence and qualities. This is worth the 10% extra cost. A Latin-1 string is a glorified bytes object, but twice as slow to create.
Tuples are faster than Sequence objects. Both in creation and in access times. Furthermore a tuple can be passed as an argument directly using *. (Useful in a function that creates a new FASTQ record). Sequence objects are way more convenient though.

There is a clear convenience vs speed tradeoff here. Currently dnaio leans heavily towards convenience. After the holiday I will implement a BinaryFastqReader and BinaryFastqWriter object which will lean towards the speed side.

x	FastqReader	BinaryFastqReader
dnaio.open mode	`r`	`rb`
iter yields	Sequence objects	tuples
String type	ASCII (for sequence and qualities)	`bytes`

I implemented this on a dummy branch. Traversing 5 million records takes about 1170ms for the ASCII FastqReader and about 720 ms for the BinaryFastqReader.

After #15 and #23 are merged it will be possible to implement this. See you after the holiday!

check quality score integrity

In quite a lot ofroutines concerning quality I use the following pattern.

cast the qualities to an uint8_t *.
loop over the qualities and on each value
- substract 33
- check if result.is higher than 93. Because of the unsignedness there is no need to check two bounds.
- do a lookup in the quality to error rate table.

This is also how it is done in the cutadapt maximum expected errors code.

How nice would it be if the boundscheck could.be removed. That would make things faster. It would.be great if dnaio could guarantee this.

I think it can do so quite cost effective. Simply do the above pattern without the lookup and instead store each quality after checking. It is a poor-performance version of memcpy that can be utilized when the qualities pyobject is created.
This can then be further upgraded using sse2 vector instructions. Using _mm_loadu, _mm_cmpgt_epi8, _mm_cmplt_epi8,_mm_storeu and _mm_or. A quite simple loop where data is.loaded. both bounds are checked (no unsigned comparison for 8byte integers in SSE) the result of both boundschecks is orred to a register that saves the result. After that storing the vector.at the destination. In the end check the Boundscheck vector if any bounds have been crossed. If so return an error.

Because dnaio does the boundscheck, it can be eliminated from any programs using dnaio. That is quite handy for performance.

fastq_iter should be made a class.

fastq_iter is now a function that uses yield. At least, that is the representation in Cython's custom syntax.

I recently implemented my first iterator in the python C-API. In order to do so you need to create a Python class that fills the tp_iter and tp_iternext slots.
This is exactly what cython does internally.

What we could do is rewrite fastq_iter to cdef class fastq_iter and implement a __next__ method. This would require some reorganizing, but not too much.

The big advantage of this is that we can expose n_records on this new iterator class.

This will help paired reading immensely. We can simply use zip(fastq_iter1, fastq_iter2) iterate over that and in the end check if fastq_iter1.n_records == fastq_iter2.n_records. This should reduce the overhead for our paired fastq iteration massively!

API review: an alternative namo for `dnaio.open`?

I kind of like offering dnaio.open() as the main entry function, but the open() name of course collides with the built-in function of the same name. This prevents users from writing from dnaio import open. For reading, this is kind of ok, but for writing, it looks not so nice because you also need the SequenceRecord and then have two import lines:

import dnaio
from dnaio import SequenceRecord
with dnaio.open("out.fastq") as f:
  f.write(SequenceRecord("name", "ACGT", "####"))

Alternatively, one could write dnaio.SequenceRecord and avoid the second import, but as a user, I may not want the dnaio. prefix everywhere. One can also rename the open function when importing, but that’s also not optimal.

So perhaps we should offer the dnaio.open function under a different name.

Add Sequence.id attribute

and also Sequence.comment and perhaps Sequence.sep. They could be lazily computed. Perhaps also add .header as an alias for .name? Should setting .id and .comment be supported? Convenient, but perhaps slower than setting the full header directly.

Switch single&paired API to single&multiple API.

Currently I am working a lot with UMI data that is stored in a separate FASTQ file meaning I have 3 files now.

I needed to filter those files on average error rate so I adopted the fastq-filter program to work with multiple files.

To keep the pipeline simple. I opted to have a Multiple file reader. This yields 1-tuples for 1 file, 2-tuples for 2 files, 3-tuples for 3 files, etc.
This way I can write the filters to always handle a tuple of SequenceRecord objects and use the same filter in all cases.
Similarly I wrote a multiple writer.

I am wondering if we should do this in dnaio too. There are now two cases in dniao:

Single file. Yield one SequenceRecord object.
Paired file. Yield a 2-tuple of SequenceRecord objects.

I propose replacing the latter with a multipe file reader that can read n number of records and yields n-tuples of SequenceRecords. The PairedEndReader and PairedEndWriter interfaces can still be maintained, but these can simply inherit the MultipleReaders and provide a backwards compatible interface. (Shouldn't be too hard given it is just the 2-case of the MultipleReader).

This way I do not have to reinvent the wheel across multiple projects. I also feel this is needed for cutadapt. Which needs a sort of auxilary file option, where the auxilary file with the UMIs is kept in sync with the FASTQ files that are output from cutadapt. Currently I have to use biopet-fastqsync to sync the UMI FASTQ file afterwards. (This is not the correct place to raise this issue, but I simply state this here to show that I think this will be a good move for the future).

I already have implemented a multiple reader in my FASTQ filter project. At first it was written in a generic manner. (Everything is a list of multiple files.) But I discovered that severely harms the single-end and paired-end cases: LUMC/fastq-filter#16 . I wonder what the best way is to implement is in dnaio. Alternatively there could be separate 1-tuple 2-tuple n-tuple readers that all share the same interface trough abstract classes.

Move dnaio to its own organization

@rhpvorderman What do you think, should we move dnaio to its own organization so that the repo would be at https://github.com/dnaio/dnaio?

Allow paired-end read id suffixes other than just 1 and 2

From PavlidisLab/rnaseq-pipeline#60

Apparently, there is such a thing as reads with three parts, where the read id of the 'second' read ends with .3, not with .1. This should be allowed.

conda install fails for py38 (also prevents cutadapt in py38)

conda install -c bioconda dnaio 

or 

conda install -c conda-forge -c bioconda dnaio

Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: | 
Found conflicts! Looking for incompatible packages.
This can take several minutes.  Press CTRL-C to abort.
failed                                                                                                                                             

UnsatisfiableError: The following specifications were found
to be incompatible with the existing python installation in your environment:

Specifications:

  - dnaio -> python[version='>=3.6,<3.7.0a0|>=3.7,<3.8.0a0']

Your python: python=3.8

If python is on the left-most side of the chain, that's the version you've asked for.
When python appears to the right, that indicates that the thing on the left is somehow
not available for the python version you are constrained to. Note that conda will not
change your python version to a different minor version unless you explicitly specify
that.

FastqFormatError, while reading in chunks

Hi @marcelm,

I am revising the miRge3.0 code which use Cutadapt for read trimming (for single-end data). Here, I am trying to pass reads in chunks. Below is the code and followed by the error:

buffer_size = 4 * 1024**2
with xopen(infile, "rb") as f:
        tlds = [i for i in dnaio.read_chunks(f,buffer_size)]

chunks_ind = []
for i in tlds:
    z = io.BytesIO(i)
    with dnaio.open(z) as f:
        chunks_ind.append([k for k in f])

I later send chunks_ind in Process.Pool.Executor() function to process the list in parallel.

Traceback (most recent call last):
  File "chunk_digest_mp.py", line 41, in <module>
    chunks_ind.append([k for k in f])
  File "chunk_digest_mp.py", line 41, in <listcomp>
    chunks_ind.append([k for k in f])
  File "src/dnaio/_core.pyx", line 183, in fastq_iter
dnaio.exceptions.FastqFormatError: Error in FASTQ file at line 1341: Line expected to start with '@', but found 'J'

If I give different buffer size, I get different Error in FASTQ format at differnet line. However, the files in itself don't have issues. I don't know what I am missing here and memory_file can't be sent through pool since it can't be pickled. Any suggestions/thoughts?

Thank you,
Arun.

Stop shipping Cython-generated .c files

Cython is now a build-time requirement, so we don’t need to include the generated .c files in the sdist anymore.

Three-way is mate comparison

UMIs are becoming a thing now and our sequencing center is in the habit of sending three files. (R1, R2, R3, with the second one being the file with UMIs annoyingly...). Given that we check for numbers 1-3 in dnaio, this seems to be fairly standard.

However there is no three-way comparison between SequenceRecord objects as of now.

This should be fairly doable to implement. I just can't decide yet on the name. I like is_mate. So I guess are_mates is the most appropriate name?

Two_headers should be a SequenceRecord attribute

Currently dnaio uses a slightly unusual architecure where the first value of fastq_iter is a boolean, not a SequenceRecord. This determines whether all coming fastq headers are printed with two headers. FastqWriter has a rather quirky implementation to determine its write method.

I think this can be best solved by having a boolean attribute to each sequencerecord. This can be set instantly without branching (no if statement). We can then add a fastq_bytes_as_input method, which will print one or two headers based on the boolean attribute. The fastq_bytes_as_input method can then be used by the FastqWriter class.

This will be fairly trivial to implement once the C-code PR is merged.

Typing warnings from mypy

Hello @marcelm!
I recently started to add typing to my code so I installed Mypy in my environment and as a PyCharm plugin, and when testing my code some warnings came from using your module.

So I made a fresh pull from your repo and executed it on it, and this was the output:

/dnaio_copy$ mypy --version
mypy 0.971 (compiled: yes)

/dnaio_copy$ mypy .
doc/conf.py:24: error: Need type annotation for "exclude_patterns" (hint: "exclude_patterns: List[<type>] = ...")
tests/test_open.py:377: error: List item 1 has incompatible type "Tuple[Dict[str, str], Type[MultipleFastqWriter]]"; expected "Tuple[object, Type[object]]"
tests/test_open.py:378: error: List item 2 has incompatible type "Tuple[Dict[str, str], Type[MultipleFastqWriter]]"; expected "Tuple[object, Type[object]]"
tests/test_open.py:379: error: List item 3 has incompatible type "Tuple[Dict[str, str], Type[MultipleFastaWriter]]"; expected "Tuple[object, Type[object]]"
tests/test_internal.py:31: error: Module "dnaio._core" has no attribute "bytes_ascii_check"
tests/read_from_stdin.py:6: error: Item "SingleEndReader" of "Union[SingleEndReader, PairedEndReader, SingleEndWriter, PairedEndWriter, MultipleFileReader, MultipleFileWriter]" has no attribute "__enter__"
tests/read_from_stdin.py:6: error: Item "PairedEndReader" of "Union[SingleEndReader, PairedEndReader, SingleEndWriter, PairedEndWriter, MultipleFileReader, MultipleFileWriter]" has no attribute "__enter__"
tests/read_from_stdin.py:6: error: Item "SingleEndWriter" of "Union[SingleEndReader, PairedEndReader, SingleEndWriter, PairedEndWriter, MultipleFileReader, MultipleFileWriter]" has no attribute "__enter__"
tests/read_from_stdin.py:6: error: Item "PairedEndWriter" of "Union[SingleEndReader, PairedEndReader, SingleEndWriter, PairedEndWriter, MultipleFileReader, MultipleFileWriter]" has no attribute "__enter__"
tests/read_from_stdin.py:6: error: Item "MultipleFileWriter" of "Union[SingleEndReader, PairedEndReader, SingleEndWriter, PairedEndWriter, MultipleFileReader, MultipleFileWriter]" has no attribute "__enter__"
tests/read_from_stdin.py:6: error: Item "SingleEndReader" of "Union[SingleEndReader, PairedEndReader, SingleEndWriter, PairedEndWriter, MultipleFileReader, MultipleFileWriter]" has no attribute "__exit__"
tests/read_from_stdin.py:6: error: Item "PairedEndReader" of "Union[SingleEndReader, PairedEndReader, SingleEndWriter, PairedEndWriter, MultipleFileReader, MultipleFileWriter]" has no attribute "__exit__"
tests/read_from_stdin.py:6: error: Item "SingleEndWriter" of "Union[SingleEndReader, PairedEndReader, SingleEndWriter, PairedEndWriter, MultipleFileReader, MultipleFileWriter]" has no attribute "__exit__"
tests/read_from_stdin.py:6: error: Item "PairedEndWriter" of "Union[SingleEndReader, PairedEndReader, SingleEndWriter, PairedEndWriter, MultipleFileReader, MultipleFileWriter]" has no attribute "__exit__"
tests/read_from_stdin.py:6: error: Item "MultipleFileWriter" of "Union[SingleEndReader, PairedEndReader, SingleEndWriter, PairedEndWriter, MultipleFileReader, MultipleFileWriter]" has no attribute "__exit__"
Found 15 errors in 4 files (checked 22 source files)

There were also several pep8 warnings:

/dnaio_copy$ pycodestyle .
./setup.py:16:80: E501 line too long (87 > 79 characters)
./src/dnaio/__init__.py:116:80: E501 line too long (81 > 79 characters)
./src/dnaio/__init__.py:139:80: E501 line too long (80 > 79 characters)
./src/dnaio/__init__.py:155:80: E501 line too long (85 > 79 characters)
.....

I'm starting to learn about code typing, but I thought that it could help to report it to you.

Cheers,

Version 1.0

Given #110 warrants a new release as you say, I think we should name it 1.0.0. it is a stable library which has been used in production for years. Easily the best and one of the most feature rich parsers out there.. See also https://0ver.org

Compilation error

Hi,
I am a long time user of cutadapt.
I've tried to update recently but i encounter this error while installing dnaio.

building 'dnaio._core' extension
x86_64-linux-gnu-gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -fPIC -DUSE_SSE2 -I/usr/include/python3.8 -c src/dnaio/_core.c -o build/temp.linux-x86_64-cpython-38/src/dnaio/_core.o
x86_64-linux-gnu-gcc: error: src/dnaio/_core.c: Aucun fichier ou dossier de ce type
x86_64-linux-gnu-gcc: fatal error: no input files
compilation terminated.
error: command '/usr/bin/x86_64-linux-gnu-gcc' failed with exit code 1

Can't figure out why _core.c is missing.

ResourceWarnings

When running the tests, there are 12 ResourceWarnings. Showing just one of them:

$ python -X dev -X tracemalloc -m pytest
[...]
tests/test_open.py::test_paired_open_with_multiple_args[fasta-r-PairedEndReader]
  .../dnaio/.venv/lib/python3.10/site-packages/_pytest/python.py:192: ResourceWarning: unclosed file <_io.TextIOWrapper name='/tmp/pytest-of-marcel/pytest-191/test_paired_open_with_multiple0/file' encoding='UTF-8'>
    result = testfunction(**testargs)
  
  Object allocated at:
    File ".../dnaio/.venv/lib/python3.10/site-packages/_pytest/python.py", line 192
      result = testfunction(**testargs)
[...]

I don’t have time at the moment right now to fix this, so just filing an issue. I would like to at least know what is going on before releasing 0.10.

These warnings were not there in v0.9.1.

Add better type hints for dnaio.open()

with overloads, similar to pycompression/xopen#112

Consider renaming Sequence to SequenceRecord

This has bothered me for a long time: A Sequence isn’t actually just a sequence; it is an object containing a sequence of nucleotides, a sequence of quality values and a name. The misnaming becomes obvious when one considers that a Sequence has a sequence attribute, so I’m sometimes tempted to write something like s = sequence.sequence or similar.

Biopython has SeqRecord, but I don’t like abbreviations in class names and to reduce the risk of being confusing with the Biopython class, I think it should be SequenceRecord.

Implement ASCII check

Currently I am not happy with the way we are creating strings, setting maxchar at 255. For FASTQ only ASCII characters are valid, so I feel that we should perform some form of check. The same as would happen if a user would use open("my.fastq", "rt", encoding="ascii"). I have several reasons for this:

Only ASCII is valid in FASTQ
ASCII strings have the smallest and simplest representation inside the Python C-API. This helps memory use and speed.
I want to implement stricter checks on the SequenceRecord object. If we set strict ASCII checks on the __init__ method and the getters and setters, then we can be 100% sure that a SequenceRecord object never contains anything but ASCII strings. That means we can drop a lot of the checks in other methods such as fastq_bytes helping speed. Currently this is not possible because the strings we provide are maxchar 255, so that would be incompatible. We can of course make the SequenceRecord object check for maxchar 255, but that does not make sense.

Back when #32 was merged, I noted that we may be able to do it faster than PyUnicode_DecodeASCII. Today I thought of a way to do so.

Checking for ASCII is done with an ASCII mask usually, as ASCII is a 7-bits encoding. The most significant 8th bit is therefore always 0. A mask can be constructed using bytes with the exactvalue of 128 (0x80 in hex). Since we have 64-bit computers we can test 8 bytes at the time with mask: 0x8080808080808080. This is what happens in CPython as well when PyUnicode_DecodeASCII is used.

But a 8-byte mask can not always be applied. If we have a read name of 29 characters:

29 / 8 = 3
29 % 8 = 5
So we perform 3 8-byte checks and 5 1-byte checks for 8 checks in total.

The numbers for a 150-bp illumina record with a 29-byte name:

8 checks for the name
150 / 8 = 18, 150 % 8 = 6: 24 checks for sequence
Same as above: 24 checks for qualities.

A total of 56 checks.

However since we have the entire record in memory we can check the entire record for ASCII. In this case this is 335 characters.

335 /8 = 41, 335 % 8 = 7.

A total of 48 checks.

Not to mention the easier logic if we only apply one ASCII check once for an entire record rather than each SequenceRecord member individually.

The ASCII check code can be easily programmed in a separate C header file and imported in Cython. It should have minimal impact on performance.

Does SequenceRecord need to support the buffer protocol?

The buffer protocol is basically a pointer to a struct that contains two function pointers. One function to acquire the buffer and one to release it. Simple buffers only have a buffer acquire function. This function increases the reference count of the object and uses the internal data pointer of the object. Since python automatically decreases the reference of the buffer exporting object on release, an external release function is not required. This is what is used for bytes objects. More complex buffer objects do some extra work, like setting a lock value to an object, to ensure it can't be changed while exporting a buffer.

Why support this for SequenceRecord? Well, because it is possible to write a FASTQ record to a block of memory, expose this as the buffer and then free the memory afterwards again. This basically adds a __bytes__ method to SequenceRecord. It also allows for SequenceRecord objects to be written directly to a Python file, without usage of the fastq_bytes method. This shaves off a little overhead.

I faced the same choice when programming the BamRecord object for htspy, and decided not to do it (yet) because I feel a to_bytes method is more intuitive and a new buffer needs to be created anyway. So it is best to communicate this to users even if this has a minor performance penalty (method lookup). Nevertheless I am curious about your opinion on this.

Remove BytesSequenceRecord

When I put BytesSequenceRecord in there were two major reasons:

PyBytes_FromStringAndSize was much faster than PyUnicode_DecodeASCII
Retrieving pointers from strings was deemed difficult, as strings do not have the buffer protocol.

I think both issues are now gone.

ASCII checking the entire buffer and then using PyUnicode_New(..., 127) is only slightly slower than PyBytes_FromStringAndSize. (!0%)
Retrieving pointers from strings can be done very fast with PyUnicode_DATA. Since SequenceRecord ensures that strings can never be anything else than ASCII this is as fast as PyBytes_AS_STRING.

On top of that, strings are more useful than bytes. Names should be strings. Sequences of nucleotides work more intuitive as strings. And qualities, are phred scores. These are an ASCII representation of the proper score and thus work best as strings.

I was working on #65 when I realised that BytesSequenceRecord is now just a maintenance burden at this point.

Add ubam support

Nanopore reads can be delivered in uBAM support. While a full-fledged BAM processor is a fool's errand (I have been there...) it is actually quite straightforward to just parse the name, sequence and qualities from a uBAM file.

Qualities are encoded with offset 0. This is a simple copy and add 33 manoeuver.
Sequences are encoded, but can be converted back with a simple lookup table.
Name can simply be copied.
Bam is basically a bgzip format. Since the indexing feature is not needed, a simple gzip reader can stream the entire file. Python-isal will read this quite fast.

This will add uBAM support to cutadapt. uBAM is very annoying anyway as minimap2 won't accept it. It will be nice of cutadapt can take care of the conversion while also trimming away any nanopore helper sequences.

Why decode?

Hi Marcel,

I got a bit fed up with the crappy fastq filters in the bioinformatics space so I started on something that is fast and actually evaluates quality scores correctly (taking into account they are log values, so many people simply compute the arithmetic mean...).

Currently I implemented a very steady and simple pure Python parser that constructs a NamedTuple containing the 4 lines (without \n) as bytes objects. It is about 2.5 times slower as dnaio.

Bytes objects are nice, since quality scores are integers and bytes objects can also be represented in python as a sequence of integers directly. (You can sum bytes, throw them into statistics.median etc. with no issue).

The same goes for the sequence string. AGTCN etc. are all in the ASCII range, so there is no reason to represent it in unicode.

As for cutadapt, bytes have a buffer protocol that allow for some very mean and lean reading of their internal uint8_t array. Strings do not have this. This can be quite powerful see this function for mean quality.

So I wonder why decode is used at all? Wouldn't it be more convenient for Sequence objects to have members as bytes instead? No decoding -> faster parsing and easier usage with other libraries that use cython (i.e. cutadapt).

Maybe it would be possible to set up a BinarySequence object which is parsed faster. For backwards compatibility this could be converted into Sequence object by default. What do you think on this issue?

dnaio.open() should return fewer classes

Probably just these, which are interfaces that the FastaWriter/FastqReader etc. classes implement:

SequenceReader
SequenceWriter
PairedSequenceReader
PairedSequenceWriter

reverse complement

Dear dnaio team,

is there a method to get reverse complement?

Thanks
Jorge

ASCII check should be moved to update buffer.

This way the check can be performed on large contiguous chunks of memory. Also it removes a branch from the __next__ method. Since the update buffer method is used relatively few times compared to the __next__ method it should be a net win.

Also because of the large contiguous chunks benefits of SIMD optimizations might become noticable.

I cannot implement this right now, but I have to write this down otherwise it will be stuck in my head for the rest of the day.

Error in dnaio

When running cutadapt 3.1 to process FASTQ files, I receive the following fatal error:
File "/usr/local/lib/python3.7/site-packages/cutadapt/pipeline.py", line 571, in run
dnaio.read_paired_chunks(f, f2, self.buffer_size)):
File "/usr/local/lib/python3.7/site-packages/dnaio/chunks.py", line 111, in read_paired_chunks
raise ValueError("FASTQ record does not fit into buffer")
ValueError: FASTQ record does not fit into buffer

The error occurs a little over half the time. Other FASTQ files process normally. Have you seen a similar issue? How can I prevent this?

Publish documentation on Read The Docs

Read through all docstrings (and ensure they exist)
Link the documentation from the README
Decide whether to use the RTD page as homepage URL -- no

compresslevel and open_threads should be arguments in the dnaio.open interface

Hi, I have been working on the benchmarking and I am quite intrigued by fastp. The project is very similar to dnaio in terms of it API. Paired reading, writing, a focus on speed, using zlib-compatible libraries to speed up (de)compression etc.

Anyway, I noticed that the fastq.gz reading benchmark between dnaio and fastp is not entirely fair, as the benchmark program always uses a single thread and dnaio gets around the GIL by launching an igzip process. This made me realize that in al my programs I do this:

import functools
import xopen

DEFAULT_OPENER = functools.partial(xopen, threads=0, compresslevel=1))

# Use all dnaio.open calls with DEFAULT_OPENER

This is quite an ugly way to interact with an API.

I propose that we add open_threads=0 and compresslevel=1 to the dnaio.open interface. open_threads should be 0, as any polite program will only go multithreaded when explicitly asked (unless it has parallel in its name). This will use python-isal by default, which is more efficient than piping through igzip (in terms of total cpu time, not wall clock time). We can add a nice docstring to the parameter that explains an external program will be opened when threads > 0.

Compresslevel should be set at 1. This is the most efficient compression level in terms of compute time for a given final size. Tools producing FASTQ almost always produce intermediate files (they are mapped afterwards), so the most common use case will be one where throughput is most important. An advantage of having a compresslevel parameter in dnaio.open is that it is very easy for users to change the parameter if they feel compression is more important than throughput. I know you have argued differently for cutadapt, but IIRC that was primarily motivated by backwards-compatibility concerns, where changing the behaviour of the program drastically was undesirable (which I agree with). Dnaio is not as well established, and in fact, still in its 0.x days.

I would have proposed this change with a PR, but since #87 also messes with the open function I leave this for after that is merged (or rejected).

Add Sequence and BytesSequence method for paired-end id comparison

Opening a single empty file using format auto-detect raises OSError.

Hi,

I have recently stumbled into an error trying to open an empty file that does not have a regular extension (fasta, fastq,...), by using open with auto-detect (fileformat=None). Quick example:

$ python3 -m venv .venv
$ source .venv/bin/activate
(.venv) $ pip install dnaio
Collecting dnaio
  Using cached dnaio-0.4.2-cp38-cp38-manylinux1_x86_64.whl (126 kB)
Collecting xopen>=0.8.2
  Using cached xopen-0.9.0-py2.py3-none-any.whl (8.3 kB)
Installing collected packages: xopen, dnaio
Successfully installed dnaio-0.4.2 xopen-0.9.0
(.venv) $ python3
python3
Python 3.8.2 (default, Apr 27 2020, 15:53:34)
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import dnaio
>>> import pathlib
>>> pathlib.Path('test.tmp').touch()
>>> dnaio.open('test.tmp')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".venv/lib/python3.8/site-packages/dnaio/__init__.py", line 108, in open
    return _open_single(
  File ".venv/lib/python3.8/site-packages/dnaio/__init__.py", line 177, in _open_single
    fileformat = _detect_format_from_content(file)
  File ".venv/lib/python3.8/site-packages/dnaio/__init__.py", line 203, in _detect_format_from_content
    file.seek(-1, 1)
OSError: [Errno 22] Invalid argument
>>> dnaio.open('test.tmp', fileformat='Fasta')
<dnaio.readers.FastaReader object at 0x7f0910594880>

I understand that in this case it is not possible to infer a file type, but I would like to suggest to use a more detailed error message to explain what is going on.

Thank you for your time!
Dries