nanoporetech / pod5-file-format Goto Github PK

View Code? Open in Web Editor NEW

115.0 28.0 13.0 29.35 MB

Pod5: a high performance file format for nanopore reads.

Home Page: https://pod5.nanoporetech.com

License: Other

CMake 3.04% Shell 0.82% Python 37.32% C++ 57.32% C 1.38% Makefile 0.12%

nanopore file-format apache-arrow

pod5-file-format's Introduction

POD5 File Format

POD5 is a file format for storing nanopore dna data in an easily accessible way. The format is able to be written in a streaming manner which allows a sequencing instrument to directly write the format.

Data in POD5 is stored using Apache Arrow, allowing users to consume data in many languages using standard tools.

What does this project contain

This project contains a core library for reading and writing POD5 data, and a toolkit for accessing this data in other languages.

Documentation

Full documentation is found at https://pod5-file-format.readthedocs.io/

Usage

POD5 is also bundled as a python module for easy use in scripts, a user can install using:

> pip install pod5

This python module provides the python library to write custom scripts against.

Please see examples for documentation on using the library.

The pod5 package also provides a selection of tools.

Design

For information about the design of POD5, see the docs.

Development

If you want to contribute to pod5_file_format, or our pre-built binaries do not meet your platform requirements, you can build pod5 from source using the instructions in DEV.md

pod5-file-format's People

Contributors

Stargazers

Watchers

Forkers

hyphaltip mattjones315 lutfia95 petrkralcz rafael-cast jannessp cfblaeb ssghost sashajenner thesequencingcenter lianggong24 adnaniazi

pod5-file-format's Issues

Guppy can't load pod5 files

Apologies if this is not place to ask this, I also asked on the Nanopore Community page, but figured I'd ask here too.

I recently loaded converted some fast5 files to pod5 to modbasecall with guppy.

This is the command I used:

./ont-guppy/bin/guppy_basecaller -i pods/ -a resources/ref.mmi -s guppy_out/ -c dna_r10.4_e8.1_modbases_5mc_cg_sup.cfg -x auto --recursive --bam_out --index --compress_fastq

I receive this output:

ONT Guppy basecalling software version 6.4.2+97a7f06, minimap2 version 2.24-r1122
config file: /home/matthew/snake_guppy/ont-guppy/data/dna_r10.4_e8.1_modbases_5mc_cg_sup.cfg
model file: /home/matthew/snake_guppy/ont-guppy/data/template_r10.4_e8.1_sup.jsn
input path: pods/
save path: guppy_out/
chunk size: 2000
chunks per runner: 208
minimum qscore: 10
records per file: 4000
fastq compression: ON
num basecallers: 4
gpu device: auto
kernel path:
runners per device: 12

alignment file: resources/ref.mmi
alignment type: auto

Use of this software is permitted solely under the terms of the end user license agreement (EULA).
By running, copying or accessing this software, you are demonstrating your acceptance of the EULA.
The EULA may be found in /home/matthew/snake_guppy/ont-guppy/bin
loading new index: resources/ref.mmi
Full alignment will be performed.
Found 5 input read files to process.
Init time: 5485 ms

0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|

Caller time: 101 ms, Samples called: 0, samples/s: 0
There were fast5 file loading problems! Failed to load 5 out of 5 fast5 files. Check log file for details.
Finishing up any open output files.
Basecalling completed successfully.

The log file says:

The EULA may be found in /home/matthew/snake_guppy/ont-guppy/bin
2023-01-25 16:49:07.330495 [guppy/info] crashpad_handler successfully launched.

2023-01-25 16:49:07.434523 [guppy/info] CUDA device 0 (compute 8.6) initialised, memory limit 25438322688B (24809373696B free)
2023-01-25 16:49:09.463378 [guppy/message] loading new index: resources/ref.mmi
2023-01-25 16:49:12.813098 [guppy/message] Full alignment will be performed.
2023-01-25 16:49:12.814092 [guppy/message] Found 5 input read files to process.
2023-01-25 16:49:12.814483 [guppy/info] Error attempting to open file "pods/OM2.pod5": Failed to query batch count: Invalid: null file passed to C API
2023-01-25 16:49:12.814569 [guppy/info] Error attempting to open file "pods/YM3.pod5": Failed to query batch count: Invalid: null file passed to C API
2023-01-25 16:49:12.814633 [guppy/info] Error attempting to open file "pods/OM1.pod5": Failed to query batch count: Invalid: null file passed to C API
2023-01-25 16:49:12.814701 [guppy/info] Error attempting to open file "pods/OM3.pod5": Failed to query batch count: Invalid: null file passed to C API
2023-01-25 16:49:12.814769 [guppy/info] Error attempting to open file "pods/YM2.pod5": Failed to query batch count: Invalid: null file passed to C API
2023-01-25 16:49:12.815703 [guppy/message] Init time: 5485 ms
2023-01-25 16:49:12.915834 [guppy/message] Caller time: 101 ms, Samples called: 0, samples/s: 0
2023-01-25 16:49:12.915859 [guppy/message] There were fast5 file loading problems! Failed to load 5 out of 5 fast5 files. Check log file for details.
2023-01-25 16:49:12.915872 [guppy/message] Finishing up any open output files.
2023-01-25 16:49:12.986101 [guppy/info] Stats for model /home/matthew/snake_guppy/ont-guppy/data/template_r10.4_e8.1_sup.jsn, 12 runners/device, 208 chunks/run, 2000 blocks/chunk, lifetime 4.52 s
CUDA device 0: 0 runs with 0 chunks (-nan%), 0 samples (-nan%), avg max size -nan, avg size -nan (-nan% of max), 0 samples/s
2023-01-25 16:49:13.012831 [guppy/message] Basecalling completed successfully.

I'm not sure what the issue is, does anyone have any advice?

I checked the files with pod5 inspect summary and they seem fine and have the expected sizes.

Installation error (no package 'arrow' found)

Hello!
I try to install pod5 via pip inside conda environment for multiple times. I have no success while installing on different machines. The typical error is

-- Checking for module 'arrow'
      --   No package 'arrow' found
      CMake Error at /home/asan/miniconda3/envs/4pod5-env/share/cmake-3.25/Modules/FindPackageHandleStandardArgs.cmake:230 (message):
        Could NOT find Arrow (missing: ARROW_INCLUDE_DIR ARROW_LIB_DIR
        ARROW_FULL_SO_VERSION ARROW_SO_VERSION)
      Call Stack (most recent call first):
        /home/asan/miniconda3/envs/4pod5-env/share/cmake-3.25/Modules/FindPackageHandleStandardArgs.cmake:600 (_FPHSA_FAILURE_MESSAGE)
        cmake_modules/FindArrow.cmake:450 (find_package_handle_standard_args)
        cmake_modules/FindArrowPython.cmake:46 (find_package)
        CMakeLists.txt:231 (find_package)
      
      
      -- Configuring incomplete, errors occurred!
      See also "/tmp/pip-install-959xe01e/pyarrow_81abfae6eef142db98c35f0c4d548b21/build/temp.linux-x86_64-cpython-311/CMakeFiles/CMakeOutput.log".
      error: command '/home/asan/miniconda3/envs/4pod5-env/bin/cmake' failed with exit code 1
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for pyarrow
Failed to build pyarrow
ERROR: Could not build wheels for pyarrow, which is required to install pyproject.toml-based projects

Conan way also had not led to any success.
The conda environment contains python 3.11, cmake, boost and compilers (gcc_linux-64, gxx_linux-64 and gfortran_linux-64). The hosts are Ubuntu 22.04 and 23.04. Installing arrow (or pyarrow) inside conda via conda package manager either via pip did not job.

What would be a solution? Are there any activities on adding pod5 into the conda repository?

how fast it is to convert from pod5 to fast5.

Hi, thank you so much for sharing this tool and detailed document.

I would like to know how fast it is to convert from pod5 to fast5. For promethION data, there are a lot of .fast5 files, meybe 20 millions reads , if I transfer these files into .pod5, how long does it take? And, can I use multiple cpu core to do it?

Thank you so much!

Understanding the signal_row count and the POD5 writing

Hi,

Could you please help me to understand what is happening behind the following piece of code (that I got from Dorado's POD5 reading):

            if (pod5_get_signal_row_info(file, signal_row_count, signal_rows_indices,
                                        signal_rows.data()) != POD5_OK) {
                fprintf(stderr,"Failed to get read %ld signal row locations: %s\n", row, pod5_get_error_string());
            }

            fprintf(stderr,"ROw count\t%s\t%ld\n", read_id_tmp, signal_row_count);

            size_t total_sample_count = 0;
            for (size_t i = 0; i < signal_row_count; ++i) {
                total_sample_count += signal_rows[i]->stored_sample_count;
            }

            int16_t *samples = (int16_t*)malloc(sizeof(int16_t)*total_sample_count);
            size_t samples_read_so_far = 0;
            for (size_t i = 0; i < signal_row_count; ++i) {
                if (pod5_get_signal(file, signal_rows[i], signal_rows[i]->stored_sample_count,
                                   samples + samples_read_so_far) != POD5_OK) {
                }

                samples_read_so_far += signal_rows[i]->stored_sample_count;
            }

Is this signal_row_count the number of chunks the MinKNOW is expecting to write when directly writing? However, when I converted 500,000 reads from fast5 to pod5, none of the reads has other than 1 for signal_row_count. When MinKNOW is writing files, what would be the expected value for this signal_row_count? I am asking this because the primary design goal in POD5 has been writing (and the need to write in chunks) and thus if the converter is not producing a file that the MINKNOW is expecting to produce, none of the reading-related benchmarks we do using pod5 generated using fast5 conversion are representative of the reality, as seek system calls (or major page faults if mmap is used internally) are ignored. If MinKNOW is expecting to reconvert chunked POD5 to unchunked POD5, then the benchmarks would be still representative, but if such a conversion is done, it contradicts the need to have a 'balanced' file format.

Also, are there any benchmarks done to evaluate POD5's writing performance? And is there a C API to do POD5 writing?.

Thank you.

Pod5 convert fast5 killed midway.

Hi, I am trying to convert some fast5 files to pod5, the command checks the files and starts converting them but it crashes after a while.

pod5 convert fast5 fast5/*.fast5 -r -o pod5/ -O fast5/
Converting 886 Fast5s:   3%|##1                                                                | 111250/3543065 [00:47<22:58, 2488.94Reads/s]
Killed

Q: When will POD5 be default format written by ONT Sequencing Devices?

Is there a roadmap when POD5 will be natively written by MinKNOW?
If it writes POD5, does it write many files (as for fast5) or will there be just a bunch of files or even a single POD5 file in the end?

Any info on that?

pod5 to fast5?

Hi folks,

I was able to convert a folder of fast5s to pod5 without any issues. Is there a tool for reversing the process?

Installation on macOS

Hi,

I'm having difficulties installing pod5 on macOS Ventura. When I use pip install pod5 I get the following error:
ERROR: Cannot install pod5==0.0.43 and pod5==0.1 because these package versions have conflicting dependencies.

The same command on Ubuntu 20.04.5 works fine.

Cheers,

Angus

adding recursive option

Hi,
it would be great to add an -r (recursive) option, to convert all fast5 files in a directory. Although wildcards in the path work an -r option would be great.
Best
Florian

Stable access to read.reader.schema.metadata in the Python API

Hi,

In my software, it is useful to have access to the POD5 file format version. It's possible, but not via the public API.

Old version:

p5_handle._read_reader.reader.schema.metadata[b'MINKNOW:pod5_version']

Current version:

p5_handle._handles.read.reader.schema.metadata[b'MINKNOW:pod5_version']

The metadata dict also contains a file UUID and the name of the software that made the file - both useful too. Would it be possible to make access to this metadata dict part of the public/documented Python API please?

Cheers,

TIM

better handling of corrupt files

For some unknown reason we have some fast5 files in a skip folder that appear to be corrupted.

The pod5 convert runs happily until it reaches one of the "corrupt" files and then crashes completely. I can then manually remove the reported file but have to start over again.

I would prefer if corrupt files were simply reported and left out of the pod5 conversion.

Best regards
Rasmus

Error nessage exanple:

pod5 convert fast5 corrupt_fast5s/ dummy.pod5
Converting 16 fast5 files.. 
0 reads,	 0 Samples,	 0/16 files,	 0.0 MB/s
Error processing corrupt_fast5s/PAK66154_skip_ecff0cbe_52bfab07_177.fast5

Sub-process trace:
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.10/site-packages/pod5/tools/pod5_convert_from_fast5.py", line 309, in get_reads_from_files
    _f5[read_id],
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "/home/ubuntu/.local/lib/python3.10/site-packages/h5py/_hl/group.py", line 357, in __getitem__
    oid = h5o.open(self.id, self._e(name), lapl=self._lapl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5o.pyx", line 190, in h5py.h5o.open
KeyError: "Unable to open object (file read failed: time = Mon Mar 13 15:03:13 2023\n, filename = 'corrupt_fast5s/PAK66154_skip_ecff0cbe_52bfab07_177.fast5', file descriptor = 10, errno = 5, error message = 'Input/output error', buf = 0x557f9c22e9d0, total read size = 80, bytes this sub-read = 80, bytes actually read = 18446744073709551615, offset = 0)"

An unexpected error occurred: 

POD5 has encountered an error: ''

For detailed information set POD5_DEBUG=1'

Ended up looping until no more corrupt files were left

RUNAGAIN=1;

while [ $RUNAGAIN -gt 0 ]
do
	pod5 convert fast5 ./20230301_1540_1E_PAK66154_ecff0cbe/fast5_skip/*.fast5 out.pod5 2> errout.txt
	corrupt_file=$(cat errout.txt | grep "filename" | sed -E "s/.*filename = '(.*.fast5).*/\1/")
	if  [ -f $corrupt_file ];
	then
	    mv $corrupt_file corrupt_fast5s/;
	    rm errout.txt;
	    rm out.pod5;
	else
	    echo "No more corrupt files"
	    RUNAGAIN=0;
	fi
done

pod5 convert fast5 -> 100% of reads converted but process not finished when converting bigger datasets

I use the following line in a bash script to convert my fast5 files to pod5

pod5 convert fast5 *.fast5 --output converted.pod5

In general, it works as expected. However, if I run the bash script on bigger data sets, conversion starts, reaches 100% and nothing happens. The next parts of my script are not executed. If I use the same script on a "smaller" data set, the conversion and the whole script finishes as expected.
Interestingly, if I terminate the conversion with "ctrl + C" when it reaches 100% the other steps are getting executed.

Here is the out put of the terminal, when killing the conversion:

Converting 206 Fast5s: 100%|#######| 821696/821696 [01:40<00:00, 8144.87Reads/s]
^CException ignored in atexit callback: <function _exit_function at 0x7f91e832ecb0>
Traceback (most recent call last):
File "/home/nanopore/software/anaconda3/lib/python3.10/multiprocessing/util.py", line 360, in _exit_function
_run_finalizers()
File "/home/nanopore/software/anaconda3/lib/python3.10/multiprocessing/util.py", line 300, in _run_finalizers
finalizer()
File "/home/nanopore/software/anaconda3/lib/python3.10/multiprocessing/util.py", line 224, in call
res = self._callback(*self._args, **self._kwargs)
File "/home/nanopore/software/anaconda3/lib/python3.10/multiprocessing/queues.py", line 199, in _finalize_join
thread.join()
File "/home/nanopore/software/anaconda3/lib/python3.10/threading.py", line 1096, in join
self._wait_for_tstate_lock()
File "/home/nanopore/software/anaconda3/lib/python3.10/threading.py", line 1116, in _wait_for_tstate_lock
if lock.acquire(block, timeout):
KeyboardInterrupt:

Of note, the same happens if I only run the command above in a terminal on its own. It finished for small data sets but not for the bigger ones, although both reach 100%. I was waiting for an hour and nothing happened.

Training a basecaller

I am trying to train a basecaller with the raw signals saved in pod5 format. Does the team have any best practices with regard to accessing the read individually from a single output.pod5 file? Currently I am using this with Pytorch so when defining a dataset object, each read is accessed individually and each time we will call:

with p5.Reader(fpath) as read:
    read = next(read.reads([read_id]))

is this the most efficient way to access a single read from the reader object?

Thanks!

pod5 merge error "too many open files"

Hi, I tried to merge all the pod5 files for one sample (>8k files) but encountered this:

POD5 has encountered an error: '[Errno 24] Too many open files'

For detailed information set POD5_DEBUG=1'

What should I try next? Should I use cat to merge the files directly? Many thanks!

Best,
CW

'An unexpected error occurred: Trying to re-open a closed Writer to ... ' when running pod5 convert fast5

Hi, when I run pod5 convert fast5 command, I got this error when running 74% of the process :
An unexpected error occurred: Trying to re-open a closed Writer to ... .pod5

Is pod5 a modification of slow5 ?

I noticed an article "Fast Nanopore Sequencing Data Analysis with SLOW5"
Does pod5 have anything to do with it ?

Installation problems

I have been trying to install this tool in order to convert my fast5 files into pod5 files.
Unfortunately, every time I use the conan build command, as described in #5, I get the following error:
ERROR: Error loading conanfile at '/lustre/nobackup/WUR/ABGC/hoger006/Tools/pod5-file-format/conanfile.py': Unable to load conanfile in /lustre/nobackup/WUR/ABGC/hoger006/Tools/pod5-file-format/conanfile.py File "<frozen importlib._bootstrap_external>", line 940, in exec_module File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed File "/lustre/nobackup/WUR/ABGC/hoger006/Tools/pod5-file-format/conanfile.py", line 3, in <module> from conans import CMake, ConanFile, tools ImportError: cannot import name 'CMake' from 'conans' (/home/WUR/hoger006/lustre_dir/Tools/mambaforge/envs/conan/lib/python3.11/site-packages/conans/__init__.py)
Because I am trying to install this tool on an HPC, I do not have administrative access, which might cause this error to occur.

Is there a package which contains a pre-built version of this tool, or a containerised version?

For reference I cloned the github and am using conan version 2.0.2 through mamba/conda.

Python quits unexpectedly when running pod5-convert-to-fast5 (segmentation fault 11)

Hello,

I am converting a directory of fast5 from an ONT run to POD5 for use with Dorado. pod5 convert keeps crashing python. The program keeps running, but eventually the other python instances related to pod5 convert crash and the programs stalls before converting all files. Is there a way I can avoid this happening?

System:
Apple M1 Pro 10 core CPU, 32gb
MacOS 13.1
pod5 installed through pip
python 3.10

Error report below

Thanks.

-Isaac

-------------------------------------
Translated Report (Full Report Below)
-------------------------------------

Process:               Python [4426]
Path:                  /Library/Frameworks/Python.framework/Versions/3.10/Resources/Python.app/Contents/MacOS/Python
Identifier:            org.python.python
Version:               3.10.2 (3.10.2)
Code Type:             ARM-64 (Native)
Parent Process:        Python [4418]
Responsible:           Terminal [632]
User ID:               501

Date/Time:             2022-12-27 21:08:55.7683 -0500
OS Version:            macOS 13.1 (22C65)
Report Version:        12
Anonymous UUID:        3093BEF8-BED7-F432-D82E-4805C4F3C24B


Time Awake Since Boot: 2400 seconds

System Integrity Protection: enabled

Crashed Thread:        0  Dispatch queue: com.apple.main-thread

Exception Type:        EXC_BAD_ACCESS (SIGSEGV)
Exception Codes:       KERN_INVALID_ADDRESS at 0x0000000280800000
Exception Codes:       0x0000000000000001, 0x0000000280800000

Termination Reason:    Namespace SIGNAL, Code 11 Segmentation fault: 11
Terminating Process:   exc handler [4426]

VM Region Info: 0x280800000 is not in any region.  Bytes after previous region: 1  Bytes before following region: 8388608
      REGION TYPE                    START - END         [ VSIZE] PRT/MAX SHRMOD  REGION DETAIL
      MALLOC_SMALL                280000000-280800000    [ 8192K] rw-/rwx SM=PRV  
--->  GAP OF 0x800000 BYTES
      MALLOC_SMALL                281000000-281800000    [ 8192K] rw-/rwx SM=PRV  

Kernel Triage:
VM - pmap_enter retried due to resource shortage
VM - pmap_enter retried due to resource shortage
VM - pmap_enter retried due to resource shortage
VM - pmap_enter retried due to resource shortage

Error processing message when running pod5 convert fast5

Hi, when I run pod5 convert fast5 command, I will get this message after this command run a while:
Sub-process trace:
A process in the process pool was terminated abruptly while the future was running or pending.

I do not know the reason and also do not know how to solve this question. Thank you very much!

KeyError: 'sample_id'

Hi,
I am trying to convert fast5 files to pod5 to perform base-calling and modification calling using Dorado, on the nanopore-wgs-consortium NA12878 dataset. But I am getting the KeyError: 'sample_id'.
I was getting some sort of error as "can't locate attribute: 'sample_id' error." while I was using Bonito for base-calling and modification calling.
Is there something I can do to make it work or to debug the issue?

object 'channel_id' doesn't exist

Hello,

When trying to convert fast5 into pod5 using pod5-convert-from-fast5 I get the following error message relating to the channel_id. Any suggestions or help?

Error in file Raw/0/AllenMiseq2_20170425_FN_MN19868_mux_scan_sample_id_37726_ch507_read77_strand.fast5: "Unable to open object (object 'channel_id' doesn't exist)"
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/pod5_format_tools/pod5_convert_from_fast5.py", line 157, in get_reads_from_files
    channel_id = inp[key]["channel_id"]
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "/usr/local/lib/python3.8/site-packages/h5py/_hl/group.py", line 288, in __getitem__
    oid = h5o.open(self.id, self._e(name), lapl=self._lapl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5o.pyx", line 190, in h5py.h5o.open

Installation

Hi,

I am trying to install this program on HPC (PBSpro) but not sure which part I should follow.
And, I have tried "Developing with conan" and "Pre commit".

FYI, this is what I did.

conda activate pip and cmake

git clone https://github.com/nanoporetech/pod5-file-format.git
cd pod5-file-format
git submodule update --init --recursive
mkdir build
cd build

conan install --build=missing -s build_type=Release ..
cmake -DUSE_CONAN=ON -DCMAKE_BUILD_TYPE=Release ..

Both attempts were not successful due to the cmake step.

ERROR: boost/1.78.0: Error in build() method, line 875
self.run(full_command, run_environment=True)
ConanException: Error 1 while executing b2 -q target-os=linux architecture=x86 address-model=64 binary-format=elf abi=sysv --layout=system --user-config=/home/uqhjung3/.conan/data/boost/1.78.0///source/source_subfolder/tools/build/user-config.jam -sNO_ZLIB=0 -sNO_BZIP2=0 -sNO_LZMA=1 -sNO_ZSTD=1 boost.locale.icu=off --disable-icu boost.locale.iconv=on boost.locale.iconv.lib=libc threading=multi visibility=hidden link=static variant=release --with-atomic --with-chrono --with-container --with-context --with-contract --with-coroutine --with-date_time --with-exception --with-filesystem --with-iostreams --with-locale --with-log --with-program_options --with-random --with-regex --with-serialization --with-stacktrace --with-system --with-test --with-thread --with-timer --with-type_erasure --with-wave toolset=gcc define=GLIBCXX_USE_CXX11_ABI=0 pch=on cxxflags="-fPIC -DBOOST_STACKTRACE_ADDR2LINE_LOCATION=/usr/bin/addr2line" install --prefix=/home/uqhjung3/.conan/data/boost/1.78.0///package/cf5b1011055d170fc18a05ba048979d2089d1695 -j24 --abbreviate-paths -d0 --debug-configuration --build-dir="/home/uqhjung3/.conan/data/boost/1.78.0//_/build/cf5b1011055d170fc18a05ba048979d2089d1695"

Any idea or suggestion on this matter?

Many thanks in advance!

Taek

pod5 convert fast5 hangs indefinitely at 100% done

I had no trouble getting pod5 convert fast5 up and running by installing using pip into a conda environment on my Linux server running Slurm. Initial tests on smaller data sets/number of files worked fine. However, when I run the command on my full ONT dataset, the program gets to 100% and never exits.

# command
pod5 convert fast5 --threads 16 ./fast5_pass/*.fast5 --output pod5 --one-to-one fast5_pass

# hang state in log - stays here for hours
Converting 674 Fast5s: 100%|##########| 2695235/2695235 [2:27:51<00:00, 303.81Reads/s] 

# idle state of pod process
top
   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND 
336777 wilsonte  20   0 9780.7m 223624  43464 S   0.0   0.1  18:18.87 pod5   

# verification of done state in file output
ls fast5_pass | wc -w
674
ls pod5 | wc -w
674

I can find no evidence that the command has anything more to do, or is doing anything, but it never exits, which prevents my pipeline from progressing. I have forced a stop and just continued on with Dorado basecalling - so far that seems to have no problems with the pod5 files created above.

No pod5 conda package

Dear pod5 developers,

please consider to create a conda package for pod5.
I tried to create a pod5 recipe from the pypi pod5 package myself using grayskull, but it fails:

There is no sdist package on pypi for pod5.

It would be super helpful to create a conda pod5 package, as many scientists work with conda.

Kind regards,
Jannes Spangenberg

error building from conan

I'm trying to build pod5 based on the instructions using conan. I was able to successfully obtain all the dependencies with conan and create the build directory but when I try to run make I get the error listed below.

Error message

In file included from pod5-file-format/c++/pod5_format/internal/combined_file_utils.h:3,
                 from pod5-file-format/c++/pod5_format/file_reader.cpp:3:
pod5-file-format/build/c++/pod5_flatbuffers/footer_generated.h: In function ‘const char* Minknow::ReadsFormat::EnumNameContentType(Minknow::ReadsFormat::ContentType)’:
pod5-file-format/build/c++/pod5_flatbuffers/footer_generated.h:49:20: error: ‘IsOutRange’ is not a
member of ‘flatbuffers’
   49 |   if (flatbuffers::IsOutRange(e, ContentType_ReadsTable, ContentType_OtherIndex)) return "";
      |                    ^~~~~~~~~~
pod5-file-format/build/c++/pod5_flatbuffers/footer_generated.h: In function ‘const char* Minknow::ReadsFormat::EnumNameFormat(Minknow::ReadsFormat::Format)’:
pod5-file-format/build/c++/pod5_flatbuffers/footer_generated.h:76:20: error: ‘IsOutRange’ is not a
member of ‘flatbuffers’
   76 |   if (flatbuffers::IsOutRange(e, Format_FeatherV2, Format_FeatherV2)) return "";
      |                    ^~~~~~~~~~
In file included from pod5-file-format/c++/pod5_format/internal/combined_file_utils.h:3,
                 from pod5-file-format/c++/pod5_format/file_writer.cpp:3:
pod5-file-format/build/c++/pod5_flatbuffers/footer_generated.h: In function ‘const char* Minknow::ReadsFormat::EnumNameContentType(Minknow::ReadsFormat::ContentType)’:
pod5-file-format/build/c++/pod5_flatbuffers/footer_generated.h:49:20: error: ‘IsOutRange’ is not a
member of ‘flatbuffers’
   49 |   if (flatbuffers::IsOutRange(e, ContentType_ReadsTable, ContentType_OtherIndex)) return "";
      |                    ^~~~~~~~~~
pod5-file-format/build/c++/pod5_flatbuffers/footer_generated.h: In function ‘const char* Minknow::ReadsFormat::EnumNameFormat(Minknow::ReadsFormat::Format)’:
pod5-file-format/build/c++/pod5_flatbuffers/footer_generated.h:76:20: error: ‘IsOutRange’ is not a
member of ‘flatbuffers’
   76 |   if (flatbuffers::IsOutRange(e, Format_FeatherV2, Format_FeatherV2)) return "";
      |                    ^~~~~~~~~~
make[2]: *** [c++/CMakeFiles/pod5_format.dir/build.make:94: c++/CMakeFiles/pod5_format.dir/pod5_format/file_reader.cpp.o] Error 1
make[2]: *** Waiting for unfinished jobs....
make[2]: *** [c++/CMakeFiles/pod5_format.dir/build.make:80: c++/CMakeFiles/pod5_format.dir/pod5_format/file_writer.cpp.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:244: c++/CMakeFiles/pod5_format.dir/all] Error 2
make: *** [Makefile:166: all] Error 2

Steps to produce:

mkdir build
cd build
conan remote add -I 0 conancenter https://center.conan.io
conan install --build=missing -s build_type=Release ..
cmake -DUSE_CONAN=ON -DCMAKE_BUILD_TYPE=Release ..
make

POD5 streaming data functionality

Hello,

Could you please explain the streaming functionality in C/C++ if I were to extract raw data for Read Until/selective sequencing?
I know a FAST5 has to be completely written before it can be read. Does POD5 have any advantages for Read Until?
How would future chunks of a read be handled? Appended to the same file or new file?

Question about using the C API to acess a POD5 file efficiently using multiple threads

Dear POD5 developers,

I have been trying to use the POD5 C API to write a simple example of converting raw signal data to pico ampere. It is a single POD5 file containing a large number of reads and I want to iterate through all the reads while exploiting as many threads as possible. Learning from the Dorado code I have written something and a code snippet is given below. I have a few questions.

Is there a way to do this pod5_get_signal_row_info() without using a C++ vector (by using pure C structs)? See comment on the code below
Depending on the memory available on the system, is there an example that shows how to fetch a batch of an arbitrary size? That is to say my programme takes the batch size as a user input parameter, and I want to load that many reads at a time rather than the batch size hardcoded into the file?
Is the arrow libraries underneath exploiting multiple threads available on the system? If so how can I control the number of threads it is allowed to use based on a user input? If not, how do I exploit multiple threads to efficiently load, decompress and parse into memory arrays in the user side?

pod5_init();

Pod5FileReader_t* file = pod5_open_combined_file(argv[1]);

if (!file) {
   fprintf(stderr,"Error in opening file\n");
   perror("perr: ");
   exit(EXIT_FAILURE);
}

size_t batch_count = 0;
if (pod5_get_read_batch_count(&batch_count, file) != POD5_OK) {
	fprintf(stderr, "Failed to query batch count: %s\n", pod5_get_error_string());
}

int read_count = 0;

for (size_t batch_index = 0; batch_index < batch_count; ++batch_index) {


	Pod5ReadRecordBatch_t* batch = NULL;
	if (pod5_get_read_batch(&batch, file, batch_index) != POD5_OK) {
	   fprintf(stderr,"Failed to get batch: %s\n", pod5_get_error_string());
	}

	size_t batch_row_count = 0;
	if (pod5_get_read_batch_row_count(&batch_row_count, batch) != POD5_OK) {
		fprintf(stderr,"Failed to get batch row count\n");
	}

	rec_t *rec = (rec_t*)malloc(batch_row_count * sizeof(rec_t));

	// need to find out of this part can be multi-threaded, and if so the best way, for instance should this be parallised using an openMP for ()? or is it internally using threads by the arrow library which is opaque to the user?
	for (size_t row = 0; row < batch_row_count; ++row) {
		uint8_t read_id[16];
		int16_t pore = 0;
		int16_t calibration_idx = 0;
		uint32_t read_number = 0;
		uint64_t start_sample = 0;
		float median_before = 0.0f;
		int16_t end_reason = 0;
		int16_t run_info = 0;
		int64_t signal_row_count = 0;
		if (pod5_get_read_batch_row_info(batch, row, read_id, &pore, &calibration_idx,
										&read_number, &start_sample, &median_before,
										&end_reason, &run_info, &signal_row_count) != POD5_OK) {
			fprintf(stderr,"Failed to get read %ld\n", row );
		}
		read_count += 1;

		char read_id_tmp[37];
		pod5_error_t err = pod5_format_read_id(read_id, read_id_tmp);

		CalibrationDictData_t *calib_data = NULL;
		if (pod5_get_calibration(batch, calibration_idx, &calib_data) != POD5_OK) {
			fprintf(stderr, "Failed to get read %ld calibration_idx data: %s\n", row,  pod5_get_error_string());
		}

		uint64_t *signal_rows_indices= (uint64_t*) malloc(signal_row_count * sizeof(uint64_t));

		if (pod5_get_signal_row_indices(batch, row, signal_row_count,
									   signal_rows_indices) != POD5_OK) {
			fprintf(stderr,"Failed to get read %ld; signal row indices: %s\n", row, pod5_get_error_string());
		}

		// cannot get to work this in C, So using C++
		//SignalRowInfo_t *signal_rows = (SignalRowInfo_t *)malloc(sizeof(SignalRowInfo_t)*signal_row_count);
		std::vector<SignalRowInfo_t *> signal_rows(signal_row_count);

		if (pod5_get_signal_row_info(file, signal_row_count, signal_rows_indices,
									signal_rows.data()) != POD5_OK) {
			fprintf(stderr,"Failed to get read %ld signal row locations: %s\n", row, pod5_get_error_string());
		}

		size_t total_sample_count = 0;
		for (size_t i = 0; i < signal_row_count; ++i) {
			total_sample_count += signal_rows[i]->stored_sample_count;
		}

		int16_t *samples = (int16_t*)malloc(sizeof(int16_t)*total_sample_count);
		size_t samples_read_so_far = 0;
		for (size_t i = 0; i < signal_row_count; ++i) {
			if (pod5_get_signal(file, signal_rows[i], signal_rows[i]->stored_sample_count,
							   samples + samples_read_so_far) != POD5_OK) {
				fprintf(stderr,"Failed to get read  %ld; signal: %s\n", row, pod5_get_error_string());
				fprintf(stderr,"Failed to get read  %ld; signal: %s\n", row, pod5_get_error_string());
			}

			samples_read_so_far += signal_rows[i]->stored_sample_count;
		}

		rec[row].len_raw_signal = samples_read_so_far;
		rec[row].raw_signal = samples;
		rec[row].scale = calib_data->scale;
		rec[row].offset = calib_data->offset;
		rec[row].read_id = strdup(read_id_tmp);

		pod5_release_calibration(calib_data);
		pod5_free_signal_row_info(signal_row_count, signal_rows.data());

		free(signal_rows_indices);

	}


	//process the batch here 
	
	//print the output here
	
	if (pod5_free_read_batch(batch) != POD5_OK) {
		fprintf(stderr,"Failed to release batch\n");
	}

	for (size_t row = 0; row < batch_row_count; ++row) {
		free(rec[row].read_id);
		free(rec[row].raw_signal);
	}
	free(rec);

}

Is the above implementation the most efficient way to use POD5 on a multi core system?

pod5:Enqueueing exception with pod5 convert

I want to convert my single_read_fast5 files into multi_read_fast5_file(s) into one .pod5 file.

single_to_multi_fast5 -i {input} -s {output}
pod5 convert fast5 {output}/*.fast5 --output converted.pod5

where {input} is simply the folder with all the single-read fast5 files.

The command single_to_multi_fast5 converts my input files into a file "batch_0.fast5" and additionally it outputs a "filename_mapping.txt".

But when I try to use the pod5 command, the following error appears:

Converting 1 Fast5s:   0%|   | 0/4000 [00:00<?, ?Reads/s]ERROR:pod5:Enqueueing exception: batch_0.fast5 'sample_id'
Converting 1 Fast5s:   0%|   | 0/4000 [00:00<?, ?Reads/s]
WARNING:pod5:Unfinished exceptions found during shutdown!

I can't do much with the error message, I think maybe something got lost in the conversion that pod5 needs.

I am using the following libraries:

pod5 0.2.2 
ont-fast5-api 4.1.1

Assertion failed when writing large batches.

Hi, I've run into the following problem after migrating to 0.2.0 (I'm not sure if I'm misunderstanding the C API or if this is a bug):
When writing POD5 files through the C API via "pod5_add_reads_data", and having more than 1000 reads, the programs fails on an assertion buffer. The problem specifically happend on read 999. Changing the read_table_batch_size to 10000 in the writer option fixes the issue.
I'll add some debug info in case this is a bug:

The exception happens in:
expandable_buffer.h@46 called by
read_table_writer.cpp@122 (write_batch) called by
read_table_writer@88 (add_read) called by
file_writer@92 (add_complete_read) called by
c_api@1124 (pod5_add_reads_data) called by my code (copy.cpp)

The dataset used is: s3://ont-open-data/gm24385_2020.09/analysis/r9.4.1/20200914_1357_1-E11-H11_PAF27462_d3c9678e/guppy_v4.0.11_r9.4.1_hac_prom/align_unfiltered/chr15/fast5/batch12.fast5
It was converted using the pod5 convertion tool given in the Python package.

Here is the full call stack from VSCode:
libc.so.6!__pthread_kill_implementation(int no_tid, int signo, pthread_t threadid) (pthread_kill.c:44)
libc.so.6!__pthread_kill_internal(int signo, pthread_t threadid) (pthread_kill.c:78)
libc.so.6!__GI___pthread_kill(pthread_t threadid, int signo) (pthread_kill.c:89)
libc.so.6!__GI_raise(int sig) (raise.c:26)
libc.so.6!__GI_abort() (abort.c:79)
libc.so.6!__assert_fail_base(const char * fmt, const char * assertion, const char * file, unsigned int line, const char * function) (assert.c:92)
libc.so.6!__GI___assert_fail(const char * assertion, const char * file, unsigned int line, const char * function) (assert.c:101)
pod5::ExpandableBuffer::get_data_span(const pod5::ExpandableBuffer * const this) (PATH/pod5/c++/pod5_format/expandable_buffer.h:46)
pod5::detail::StringDictionaryKeyBuilder::get_typed_offset_data(const pod5::detail::StringDictionaryKeyBuilder * const this) (PATH/pod5/c++/pod5_format/read_table_writer_utils.h:90)
pod5::detail::get_array_data(const std::shared_ptrarrow::DataType & type, const pod5::detail::StringDictionaryKeyBuilder & builder, std::size_t expected_length) (PATH/pod5/c++/pod5_format/read_table_writer_utils.cpp:33)
pod5::RunInfoWriter::get_value_array(pod5::RunInfoWriter * const this) (PATH/pod5/c++/pod5_format/read_table_writer_utils.cpp:228)
pod5::DictionaryWriter::build_dictionary_array(pod5::DictionaryWriter * const this, const std::shared_ptrarrow::Array & indices) (PATH/pod5/c++/pod5_format/read_table_writer_utils.cpp:198)
pod5::detail::BuilderHelperarrow::DictionaryArray::Finish(pod5::detail::BuilderHelperarrow::DictionaryArray * const this, std::shared_ptrarrow::Array * dest) (PATH/pod5/c++/pod5_format/schema_field_builder.h:174)
pod5::FieldBuilder<pod5::Field<0, pod5::UuidArray>, pod5::ListField<1, arrow::ListArray, arrow::NumericArrayarrow::UInt64Type >, pod5::Field<2, arrow::NumericArrayarrow::UInt32Type >, pod5::Field<3, arrow::NumericArrayarrow::UInt64Type >, pod5::Field<4, arrow::NumericArrayarrow::FloatType >, pod5::Field<5, arrow::NumericArrayarrow::UInt64Type >, pod5::Field<6, arrow::NumericArrayarrow::FloatType >, pod5::Field<7, arrow::NumericArrayarrow::FloatType >, pod5::Field<8, arrow::NumericArrayarrow::FloatType >, pod5::Field<9, arrow::NumericArrayarrow::FloatType >, pod5::Field<10, arrow::NumericArrayarrow::UInt32Type >, pod5::Field<11, arrow::NumericArrayarrow::FloatType >, pod5::Field<12, arrow::NumericArrayarrow::UInt64Type >, pod5::Field<13, arrow::NumericArrayarrow::UInt16Type >, pod5::Field<14, arrow::NumericArrayarrow::UInt8Type >, pod5::Field<15, arrow::DictionaryArray>, pod5::Field<16, arrow::NumericArrayarrow::FloatType >, pod5::Field<17, arrow::NumericArrayarrow::FloatType >, pod5::Field<18, arrow::DictionaryArray>, pod5::Field<19, arrow::BooleanArray>, pod5::Field<20, arrow::DictionaryArray> >::finish_columns()::{lambda(auto:1&, unsigned long)#1}::operator()<pod5::detail::BuilderHelperarrow::DictionaryArray >(pod5::detail::BuilderHelperarrow::DictionaryArray&, unsigned long) const(const struct {...} * const __closure, pod5::detail::BuilderHelperarrow::DictionaryArray & element, std::size_t index) (PATH/pod5/c++/pod5_format/schema_field_builder.h:240)
pod5::detail::for_each<std::tuple<pod5::detail::BuilderHelperpod5::UuidArray, pod5::detail::ListBuilderHelper<arrow::ListArray, arrow::NumericArrayarrow::UInt64Type >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::UInt32Type >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::UInt64Type >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::FloatType >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::UInt64Type >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::FloatType >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::FloatType >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::FloatType >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::FloatType >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::UInt32Type >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::FloatType >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::UInt64Type >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::UInt16Type >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::UInt8Type >, pod5::detail::BuilderHelperarrow::DictionaryArray, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::FloatType >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::FloatType >, pod5::detail::BuilderHelperarrow::DictionaryArray, pod5::detail::BuilderHelperarrow::BooleanArray, pod5::detail::BuilderHelperarrow::DictionaryArray >&, pod5::FieldBuilder<pod5::Field<0, pod5::UuidArray>, pod5::ListField<1, arrow::ListArray, arrow::NumericArrayarrow::UInt64Type >, pod5::Field<2, arrow::NumericArrayarrow::UInt32Type >, pod5::Field<3, arrow::NumericArrayarrow::UInt64Type >, pod5::Field<4, arrow::NumericArrayarrow::FloatType >, pod5::Field<5, arrow::NumericArrayarrow::UInt64Type >, pod5::Field<6, arrow::NumericArrayarrow::FloatType >, pod5::Field<7, arrow::NumericArrayarrow::FloatType >, pod5::Field<8, arrow::NumericArrayarrow::FloatType >, pod5::Field<9, arrow::NumericArrayarrow::FloatType >, pod5::Field<10, arrow::NumericArrayarrow::UInt32Type >, pod5::Field<11, arrow::NumericArrayarrow::FloatType >, pod5::Field<12, arrow::NumericArrayarrow::UInt64Type >, pod5::Field<13, arrow::NumericArrayarrow::UInt16Type >, pod5::Field<14, arrow::NumericArrayarrow::UInt8Type >, pod5::Field<15, arrow::DictionaryArray>, pod5::Field<16, arrow::NumericArrayarrow::FloatType >, pod5::Field<17, arrow::NumericArrayarrow::FloatType >, pod5::Field<18, arrow::DictionaryArray>, pod5::Field<19, arrow::BooleanArray>, pod5::Field<20, arrow::DictionaryArray> >::finish_columns()::{lambda(auto:1&, unsigned long)#1}, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20>(std::tuple<pod5::detail::BuilderHelperpod5::UuidArray, pod5::detail::ListBuilderHelper<arrow::ListArray, arrow::NumericArrayarrow::UInt64Type >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::UInt32Type >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::UInt64Type >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::FloatType >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::UInt64Type >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::FloatType >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::FloatType >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::FloatType >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::FloatType >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::UInt32Type >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::FloatType >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::UInt64Type >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::UInt16Type >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::UInt8Type >, pod5::detail::BuilderHelperarrow::DictionaryArray, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::FloatType >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::FloatType >, pod5::detail::BuilderHelperarrow::DictionaryArray, pod5::detail::BuilderHelperarrow::BooleanArray, pod5::detail::BuilderHelperarrow::DictionaryArray >&, pod5::FieldBuilder<pod5::Field<0, pod5::UuidArray>, pod5::ListField<1, arrow::ListArray, arrow::NumericArrayarrow::UInt64Type >, pod5::Field<2, arrow::NumericArrayarrow::UInt32Type >, pod5::Field<3, arrow::NumericArrayarrow::UInt64Type >, pod5::Field<4, arrow::NumericArrayarrow::FloatType >, pod5::Field<5, arrow::NumericArrayarrow::UInt64Type >, pod5::Field<6, arrow::NumericArrayarrow::FloatType >, pod5::Field<7, arrow::NumericArrayarrow::FloatType >, pod5::Field<8, arrow::NumericArrayarrow::FloatType >, pod5::Field<9, arrow::NumericArrayarrow::FloatType >, pod5::Field<10, arrow::NumericArrayarrow::UInt32Type >, pod5::Field<11, arrow::NumericArrayarrow::FloatType >, pod5::Field<12, arrow::NumericArrayarrow::UInt64Type >, pod5::Field<13, arrow::NumericArrayarrow::UInt16Type >, pod5::Field<14, arrow::NumericArrayarrow::UInt8Type >, pod5::Field<15, arrow::DictionaryArray>, pod5::Field<16, arrow::NumericArrayarrow::FloatType >, pod5::Field<17, arrow::NumericArrayarrow::FloatType >, pod5::Field<18, arrow::DictionaryArray>, pod5::Field<19, arrow::BooleanArray>, pod5::Field<20, arrow::DictionaryArray> >::finish_columns()::{lambda(auto:1&, unsigned long)#1}, std::integer_sequence<int, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20>)(std::tuple<pod5::detail::BuilderHelperpod5::UuidArray, pod5::detail::ListBuilderHelper<arrow::ListArray, arrow::NumericArrayarrow::UInt64Type >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::UInt32Type >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::UInt64Type >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::FloatType >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::UInt64Type >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::FloatType >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::FloatType >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::FloatType >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::FloatType >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::UInt32Type >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::FloatType >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::UInt64Type >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::UInt16Type >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::UInt8Type >, pod5::detail::BuilderHelperarrow::DictionaryArray, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::FloatType >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::FloatType >, pod5::detail::BuilderHelperarrow::DictionaryArray, pod5::detail::BuilderHelperarrow::BooleanArray, pod5::detail::BuilderHelperarrow::DictionaryArray > & t, struct {...} f) (PATH/pod5/c++/pod5_format/tuple_utils.h:11)
pod5::detail::for_each_in_tuple<pod5::detail::BuilderHelperpod5::UuidArray, pod5::detail::ListBuilderHelper<arrow::ListArray, arrow::NumericArrayarrow::UInt64Type >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::UInt32Type >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::UInt64Type >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::FloatType >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::UInt64Type >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::FloatType >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::FloatType >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::FloatType >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::FloatType >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::UInt32Type >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::FloatType >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::UInt64Type >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::UInt16Type >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::UInt8Type >, pod5::detail::BuilderHelperarrow::DictionaryArray, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::FloatType >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::FloatType >, pod5::detail::BuilderHelperarrow::DictionaryArray, pod5::detail::BuilderHelperarrow::BooleanArray, pod5::detail::BuilderHelperarrow::DictionaryArray, pod5::FieldBuilder<pod5::Field<0, pod5::UuidArray>, pod5::ListField<1, arrow::ListArray, arrow::NumericArrayarrow::UInt64Type >, pod5::Field<2, arrow::NumericArrayarrow::UInt32Type >, pod5::Field<3, arrow::NumericArrayarrow::UInt64Type >, pod5::Field<4, arrow::NumericArrayarrow::FloatType >, pod5::Field<5, arrow::NumericArrayarrow::UInt64Type >, pod5::Field<6, arrow::NumericArrayarrow::FloatType >, pod5::Field<7, arrow::NumericArrayarrow::FloatType >, pod5::Field<8, arrow::NumericArrayarrow::FloatType >, pod5::Field<9, arrow::NumericArrayarrow::FloatType >, pod5::Field<10, arrow::NumericArrayarrow::UInt32Type >, pod5::Field<11, arrow::NumericArrayarrow::FloatType >, pod5::Field<12, arrow::NumericArrayarrow::UInt64Type >, pod5::Field<13, arrow::NumericArrayarrow::UInt16Type >, pod5::Field<14, arrow::NumericArrayarrow::UInt8Type >, pod5::Field<15, arrow::DictionaryArray>, pod5::Field<16, arrow::NumericArrayarrow::FloatType >, pod5::Field<17, arrow::NumericArrayarrow::FloatType >, pod5::Field<18, arrow::DictionaryArray>, pod5::Field<19, arrow::BooleanArray>, pod5::Field<20, arrow::DictionaryArray> >::finish_columns()::{lambda(auto:1&, unsigned long)#1}>(std::tuple<pod5::detail::BuilderHelperpod5::UuidArray, pod5::detail::ListBuilderHelper<arrow::ListArray, arrow::NumericArrayarrow::UInt64Type >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::UInt32Type >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::UInt64Type >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::FloatType >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::UInt64Type >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::FloatType >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::FloatType >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::FloatType >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::FloatType >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::UInt32Type >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::FloatType >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::UInt64Type >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::UInt16Type >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::UInt8Type >, pod5::detail::BuilderHelperarrow::DictionaryArray, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::FloatType >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::FloatType >, pod5::detail::BuilderHelperarrow::DictionaryArray, pod5::detail::BuilderHelperarrow::BooleanArray, pod5::detail::BuilderHelperarrow::DictionaryArray >&, pod5::FieldBuilder<pod5::Field<0, pod5::UuidArray>, pod5::ListField<1, arrow::ListArray, arrow::NumericArrayarrow::UInt64Type >, pod5::Field<2, arrow::NumericArrayarrow::UInt32Type >, pod5::Field<3, arrow::NumericArrayarrow::UInt64Type >, pod5::Field<4, arrow::NumericArrayarrow::FloatType >, pod5::Field<5, arrow::NumericArrayarrow::UInt64Type >, pod5::Field<6, arrow::NumericArrayarrow::FloatType >, pod5::Field<7, arrow::NumericArrayarrow::FloatType >, pod5::Field<8, arrow::NumericArrayarrow::FloatType >, pod5::Field<9, arrow::NumericArrayarrow::FloatType >, pod5::Field<10, arrow::NumericArrayarrow::UInt32Type >, pod5::Field<11, arrow::NumericArrayarrow::FloatType >, pod5::Field<12, arrow::NumericArrayarrow::UInt64Type >, pod5::Field<13, arrow::NumericArrayarrow::UInt16Type >, pod5::Field<14, arrow::NumericArrayarrow::UInt8Type >, pod5::Field<15, arrow::DictionaryArray>, pod5::Field<16, arrow::NumericArrayarrow::FloatType >, pod5::Field<17, arrow::NumericArrayarrow::FloatType >, pod5::Field<18, arrow::DictionaryArray>, pod5::Field<19, arrow::BooleanArray>, pod5::Field<20, arrow::DictionaryArray> >::finish_columns()::{lambda(auto:1&, unsigned long)#1})(std::tuple<pod5::detail::BuilderHelperpod5::UuidArray, pod5::detail::ListBuilderHelper<arrow::ListArray, arrow::NumericArrayarrow::UInt64Type >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::UInt32Type >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::UInt64Type >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::FloatType >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::UInt64Type >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::FloatType >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::FloatType >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::FloatType >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::FloatType >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::UInt32Type >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::FloatType >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::UInt64Type >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::UInt16Type >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::UInt8Type >, pod5::detail::BuilderHelperarrow::DictionaryArray, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::FloatType >, pod5::detail::BuilderHelper<arrow::NumericArrayarrow::FloatType >, pod5::detail::BuilderHelperarrow::DictionaryArray, pod5::detail::BuilderHelperarrow::BooleanArray, pod5::detail::BuilderHelperarrow::DictionaryArray > & t, struct {...} f) (PATH/pod5/c++/pod5_format/tuple_utils.h:18)
pod5::FieldBuilder<pod5::Field<0, pod5::UuidArray>, pod5::ListField<1, arrow::ListArray, arrow::NumericArrayarrow::UInt64Type >, pod5::Field<2, arrow::NumericArrayarrow::UInt32Type >, pod5::Field<3, arrow::NumericArrayarrow::UInt64Type >, pod5::Field<4, arrow::NumericArrayarrow::FloatType >, pod5::Field<5, arrow::NumericArrayarrow::UInt64Type >, pod5::Field<6, arrow::NumericArrayarrow::FloatType >, pod5::Field<7, arrow::NumericArrayarrow::FloatType >, pod5::Field<8, arrow::NumericArrayarrow::FloatType >, pod5::Field<9, arrow::NumericArrayarrow::FloatType >, pod5::Field<10, arrow::NumericArrayarrow::UInt32Type >, pod5::Field<11, arrow::NumericArrayarrow::FloatType >, pod5::Field<12, arrow::NumericArrayarrow::UInt64Type >, pod5::Field<13, arrow::NumericArrayarrow::UInt16Type >, pod5::Field<14, arrow::NumericArrayarrow::UInt8Type >, pod5::Field<15, arrow::DictionaryArray>, pod5::Field<16, arrow::NumericArrayarrow::FloatType >, pod5::Field<17, arrow::NumericArrayarrow::FloatType >, pod5::Field<18, arrow::DictionaryArray>, pod5::Field<19, arrow::BooleanArray>, pod5::Field<20, arrow::DictionaryArray> >::finish_columns(pod5::FieldBuilder<pod5::Field<0, pod5::UuidArray>, pod5::ListField<1, arrow::ListArray, arrow::NumericArrayarrow::UInt64Type >, pod5::Field<2, arrow::NumericArrayarrow::UInt32Type >, pod5::Field<3, arrow::NumericArrayarrow::UInt64Type >, pod5::Field<4, arrow::NumericArrayarrow::FloatType >, pod5::Field<5, arrow::NumericArrayarrow::UInt64Type >, pod5::Field<6, arrow::NumericArrayarrow::FloatType >, pod5::Field<7, arrow::NumericArrayarrow::FloatType >, pod5::Field<8, arrow::NumericArrayarrow::FloatType >, pod5::Field<9, arrow::NumericArrayarrow::FloatType >, pod5::Field<10, arrow::NumericArrayarrow::UInt32Type >, pod5::Field<11, arrow::NumericArrayarrow::FloatType >, pod5::Field<12, arrow::NumericArrayarrow::UInt64Type >, pod5::Field<13, arrow::NumericArrayarrow::UInt16Type >, pod5::Field<14, arrow::NumericArrayarrow::UInt8Type >, pod5::Field<15, arrow::DictionaryArray>, pod5::Field<16, arrow::NumericArrayarrow::FloatType >, pod5::Field<17, arrow::NumericArrayarrow::FloatType >, pod5::Field<18, arrow::DictionaryArray>, pod5::Field<19, arrow::BooleanArray>, pod5::Field<20, arrow::DictionaryArray> > * const this) (PATH/pod5/c++/pod5_format/schema_field_builder.h:238)
pod5::ReadTableWriter::write_batch(pod5::ReadTableWriter * const this) (PATH/pod5/c++/pod5_format/read_table_writer.cpp:122)
pod5::ReadTableWriter::add_read(pod5::ReadTableWriter * const this, const pod5::ReadData & read_data, const gsl::span & signal, uint64_t signal_duration) (PATH/pod5/c++/pod5_format/read_table_writer.cpp:88)
pod5::FileWriterImpl::add_complete_read(pod5::FileWriterImpl * const this, const pod5::ReadData & read_data, const gsl::span & signal) (PATH/pod5/c++/pod5_format/file_writer.cpp:92)
pod5::FileWriter::add_complete_read(pod5::FileWriter * const this, const pod5::ReadData & read_data, const gsl::span & signal) (PATH/pod5/c++/pod5_format/file_writer.cpp:340)
pod5_add_reads_data(Pod5FileWriter_t * file, uint32_t read_count, uint16_t struct_version, const void * row_data, const int16_t ** signal, const uint32_t * signal_size) (PATH/pod5/c++/pod5_format/c_api.cpp:1124)
main(int argc, char ** argv) (PATH/src/c++/copy.cpp:431)

Regards,
Rafael.

adc_max/min=zero and adc_range missing, so can't calculate digitisation

Hello,

When I convert a set of fast5 files to pod5, the adc_max/min values are zero

The description of these fields states that the digitisation comes from the max-min of these values, however, they are zero in all of my reads, so I can't calculate the expected 2048.0

An alternative way to calculate digitisation is by knowing the adc_range; however, when a fast5 file is read by ( https://github.com/nanoporetech/pod5-file-format/blob/dcc0b99a45f742f06fe45d7d99f4dc8a0255e5a7/python/pod5_format/pod5_format/writer.py ) , this value is used to calculate the scale with the digitisation, and only the scale is recorded adc_range is discarded.

Is it possible to maintain the adc_range value in the conversion step, or ideally digitisation and adc_range?

data dumps below

Cheers,
James

[types.RunInfo.fields.adc_max]
type = "int16"
description = "The maximum ADC value that might be encountered. This is a hardware constraint."

[types.RunInfo.fields.adc_min]
type = "int16"
description = "The minimum ADC value that might be encountered. This is a hardware constraint. adc_max - adc_min is the digitisation."

#### read.run_info dump

run info
    acquisition_id: bfdfd1d840e2acaf5c061241fd9b8e5c3cfe729f
    acquisition_start_time: 2020-10-27 05:41:50+00:00
    adc_max: 0                                                         <-----| these are zero
    adc_min: 0                                                          <-----|
    context_tags
      barcoding_enabled: 0
      basecall_config_filename: dna_r9.4.1_450bps_hac_prom.cfg
      experiment_duration_set: 4320
      experiment_type: genomic_dna
      local_basecalling: 1
      package: bream4
      package_version: 6.0.7
      sample_frequency: 4000
      sequencing_kit: sqk-lsk109
    experiment_name:
    flow_cell_id: PAF25452
    flow_cell_product_code: FLO-PRO002
    protocol_name: sequencing/sequencing_PRO002_DNA:FLO-PRO002:SQK-LSK109
    protocol_run_id: 97d631c6-c622-473d-9e7d-3cb9297b0036
    protocol_start_time: 1970-01-01 00:00:00+00:00
    sample_id: NA12878_SRE
    sample_rate: 4000
    sequencing_kit: sqk-lsk109
    sequencer_position: 3A
    sequencer_position_type: promethion
    software: python-pod5-converter
    system_name:
    system_type:
    tracking_id
      asic_id: 0004A30B00F25467
      asic_id_eeprom: 0004A30B00F25467
      asic_temp: 31.996552
      asic_version: Unknown
      auto_update: 0
      auto_update_source: https://mirror.oxfordnanoportal.com/software/MinKNOW/
      bream_is_standard: 0
      configuration_version: 4.0.13
      device_id: 3A
      device_type: promethion
      distribution_status: stable
      distribution_version: 20.06.9
      exp_script_name: sequencing/sequencing_PRO002_DNA:FLO-PRO002:SQK-LSK109
      exp_script_purpose: sequencing_run
      exp_start_time: 2020-10-27T05:41:50Z
      flow_cell_id: PAF25452
      flow_cell_product_code: FLO-PRO002
      guppy_version: 4.0.11+f1071ce
      heatsink_temp: 32.164288
      hostname: PC24A004
      hublett_board_id: 013b01308fa78662
      hublett_firmware_version: 2.0.14
      installation_type: nc
      ip_address:
      local_firmware_file: 1
      mac_address:
      operating_system: ubuntu 16.04
      protocol_group_id: PLPN243131
      protocol_run_id: 97d631c6-c622-473d-9e7d-3cb9297b0036
      protocols_version: 6.0.7
      run_id: bfdfd1d840e2acaf5c061241fd9b8e5c3cfe729f
      sample_id: NA12878_SRE
      satellite_board_id: 013c763bef6cca9d
      satellite_firmware_version: 2.0.14
      usb_config: firm_1.2.3_ware#rbt_4.5.6_rbt#ctrl#USB3
      version: 4.0.3

### read and read.calibration

read_id: 000dab68-15a2-43c1-b33d-9598d95b37de
channel: 861
well: 1
pore_type: not_set
read_number: 261
start_sample: 3856185
end_reason: data_service_unblock_mux_change
median_before: 204.2
sample_count: 331742
byte_count: 226302
signal_compression_ratio: 0.341
scale: 0.36551764607429504
offset: -223.0

Julia library for accessing POD5 data

Are there any plans to expand the list of languages able access POD5 data files? I'd be particularly interested in a Julia package?

Given the completeness of the Python package (which is brilliant for scripting), merely having the ability to load and extract data from a POD5 data file using other languages would be sufficient. There is no need for options to manipulate or write out the data to new files. From the Julia perspective this should be fairly trivial given that Arrow.jl can do the heavy lifting with the data tables, but I couldn't from the documents figure out how you have wrapped these up in the container.

Happy to put in some effort to get this off the ground as it would remove another dependancy in my workflows and working with HDF5 is horrid. I'd like to make the jump ASAP!

Thanks, Tom.

Maintain input folder hierarchy on conversion to/from fast5

Hi,

I would like to convert a large set of old minion runs to pod5 for long term storage and possibly re-basecalling. By default only one pod5 file is created and when I try --output-one-to-one each fast5 is converted to a separate pod5 but placed in the same folder. I got the error below, likely because files in pass/fail folders have the same name. Would be great if either folder structure could be kept or pass/fail added to file names, and also if the conversion keeps track of read id:s to remove duplicates, as I think is done in guppy and in the single_to_multi fast5 conversion tool.

An unexpected error occurred: Input path already exists. Refusing to overwrite.

Traceback (most recent call last):
File "/home/minion/anaconda3/bin/pod5-convert-from-fast5", line 8, in
sys.exit(main())
File "/home/minion/anaconda3/lib/python3.7/site-packages/pod5_format_tools/pod5_convert_from_fast5.py", line 623, in main
args.signal_chunk_size,
File "/home/minion/anaconda3/lib/python3.7/site-packages/pod5_format_tools/pod5_convert_from_fast5.py", line 603, in convert_from_fast5
raise exc
File "/home/minion/anaconda3/lib/python3.7/site-packages/pod5_format_tools/pod5_convert_from_fast5.py", line 565, in convert_from_fast5
writer = output_handler.get_writer(item.file)
File "/home/minion/anaconda3/lib/python3.7/site-packages/pod5_format_tools/pod5_convert_from_fast5.py", line 395, in get_writer
return self._open_writer(output_path=output_path)
File "/home/minion/anaconda3/lib/python3.7/site-packages/pod5_format_tools/pod5_convert_from_fast5.py", line 381, in _open_writer
writer = p5.Writer(output_path)
File "/home/minion/anaconda3/lib/python3.7/site-packages/pod5_format/writer.py", line 84, in init
raise FileExistsError("Input path already exists. Refusing to overwrite.")
FileExistsError: Input path already exists. Refusing to overwrite.

No matching distribution found for lib-pod5~=0.1

Install using pip in a Python3.11 docker results in the following error, which suggests lib-pod5~=0.1 is not available:

(base) ➜  v0.1.0 git:(master) ✗ docker build --platform linux/amd64 -t zeunas/pod5tools:0.1.0 .
[+] Building 12.1s (7/7) FINISHED
 => [internal] load build definition from Dockerfile                                                                     0.0s
 => => transferring dockerfile: 519B                                                                                     0.0s
 => [internal] load .dockerignore                                                                                        0.0s
 => => transferring context: 2B                                                                                          0.0s
 => [internal] load metadata for docker.io/library/python:3.11                                                           1.2s
 => [auth] library/python:pull token for registry-1.docker.io                                                            0.0s
 => [1/3] FROM docker.io/library/python:3.11@sha256:11560799e4311fd5abcca7ace13585756d7222ce5471162cd78c78a4ecaf62bd     0.0s
 => CACHED [2/3] WORKDIR /usr/src/app                                                                                    0.0s
 => ERROR [3/3] RUN pip install --no-cache-dir --upgrade pip &&     pip install --no-cache-dir pod5==0.1.0              10.8s
------
 > [3/3] RUN pip install --no-cache-dir --upgrade pip &&     pip install --no-cache-dir pod5==0.1.0:
#7 6.721 Requirement already satisfied: pip in /usr/local/lib/python3.11/site-packages (22.3.1)
#7 7.384 WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
#7 9.518 Collecting pod5==0.1.0
#7 9.694   Downloading pod5-0.1-py3-none-any.whl (47 kB)
#7 9.728      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 47.6/47.6 kB 2.8 MB/s eta 0:00:00
#7 9.890 Collecting iso8601
#7 9.920   Downloading iso8601-1.1.0-py3-none-any.whl (9.9 kB)
#7 10.10 Collecting jsonschema
#7 10.13   Downloading jsonschema-4.17.3-py3-none-any.whl (90 kB)
#7 10.16      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 90.4/90.4 kB 4.9 MB/s eta 0:00:00
#7 10.23 ERROR: Could not find a version that satisfies the requirement lib-pod5~=0.1 (from pod5) (from versions: none)
#7 10.23 ERROR: No matching distribution found for lib-pod5~=0.1
------
executor failed running [/bin/sh -c pip install --no-cache-dir --upgrade pip &&     pip install --no-cache-dir pod5==0.1.0]: exit code: 1

Any idea what's going on? Strangely enough, I can see lib-pod5 0.1.0 should be available on pypi: https://pypi.org/project/lib-pod5/

Variable output pod5 subset

Hi,

I need to partition 1500 reads that are spread across 1375 pod5 files into 5 new pod5 files. However, each time I try, I end up with a different output.

I made a mapping.csv file detailing read IDs and the corresponding filename I want the read to end up in. Here are the first three lines:

chikungunya_virus.pod5,2993b28e-b5f0-44dd-8612-e0fce1167e22
chikungunya_virus.pod5,27b8d110-bf05-4b78-ab4d-a5e8661343f3
chikungunya_virus.pod5,42292daf-5d66-47a5-a6d5-3900b83462dc

Then I run the following command

pod5 subset --threads 5 --csv mapping.csv *.pod5

get this message

Subsetting 15000 read_ids into 5 outputs using 5 workers

and after it appears to have been completed successfully, I check the POD5 file content using:

pod5 inspect summary *.pod5

Issue: sometimes I end up with POD5 files with only a fraction of the reads/raws signals, while sometimes I just get an error, for example: “Failed to open pod5 file: zika_virus.pod5: IOError: Invalid signature in file”.

When I run pod5 subset multiple times, I end up with different amounts of reads in each POD5 file, and always a low number of reads.

Any advice on what I should try next?

Thanks in advance!

Wim

multiprocessing reads/batches

Hello,

What is the "best" way to do multiprocessing while reading a whole pod5 file?

Currently, I'm using something like this from the benchmarking code, which gets the total number of batches and splits those batches into groups of batches. Then each group is sent to a worker to run on a spawned process. That worker then goes through each batch, and reads each read in that batch, before moving to the next batch.

# worker to process a set of batches in a pod5 file
def batch_worker(filename, select_batches, result_queue):
    # for each batch in set of batches, get it, and process the reads
    for batch_id in batches:
        batch = file.get_batch(batch_id)
        for read in batch.reads():
            # get read stuff
            result_q.put(stuff)
        

main():
    # setup mp
    mp.set_start_method("spawn")
    result_queue = mp.Queue()
    runners = 10
    
    # open file
    file = pod5_format.open_combined_file(filename)
    
    # get range of batches and split into groups for each runner
    batches = list(range(file.batch_count))
    approx_chunk_size = max(1, len(batches) // runners)
    start_index = 0
    
    # submit each set of batches to the runners to process
    while start_index < len(batches):
        select_batches = batches[start_index : start_index + approx_chunk_size]
        p = mp.Process(
            target=batch_worker,
            args=(filename, select_batches, result_queue),
        )
        p.start()
        processes.append(p)
        start_index += len(select_batches)
    
    
    # clean up processes and other code not shown here
    for p in processes:
            p.join()

The other method I came up with was submitting each batch to a queue, then each worker pulls from the queue, and processes that
batch of reads, and places it in the results queue. But it's essentially the same, where each worker processes a batch.

Is there another or better way to do this? What is the most efficient way to read a pod5 file if you are reading all the data, not just a selection of the data.

Cheers,
James

Potential buffer overflow and dead code in C API

Hi,
I'm trying to read POD5 files with your C API. My problem comes specifically from the pod5_get_pore_type and pod5_get_end_reason functions. When I malloc a 16 char block for the end reason and an end reason larger than 16 chars is found, a buffer overflow occurs.
Specifically lines 663 to 667 contain:

   POD5_C_ASSIGN_OR_RAISE(auto const end_reason_val, batch->batch.get_end_reason(end_reason));
   *end_reason_string_value_size = end_reason_val.second.size() + 1;
   if (end_reason_val.second.size() >= *end_reason_string_value_size) {
       return POD5_ERROR_STRING_NOT_LONG_ENOUGH;
   }

My understanding is that said if contains dead code, hence it is never returned an POD5_ERROR_STRING_NOT_LONG_ENOUGH error code and thus a client application has no way of knowing whether the alloc'd memory is sufficient or not. Should I be checking the string value in another way?

Thanks,
Rafael.

Support for parquet storage?

Hi,

I wonder if there is any support to store pod5 file in Apache parquet format?

Thanks,
William

broken pipe error when running pod5

Hi, when I try to run pod5 v0.1.13 to convert .fast5 to .pod5, there is an error:

Anyone has some advice? Thank you very much!

pod5 convert fast5 stuck at checking fast5 stage

(pod5 0.1.16, python 3.7.13)

I have a folder of ~7000 fast5 files that I want to convert into pod5. From running --help, I am using this command:

usage: pod5 convert fast5 [-h] -o OUTPUT [-r] [-t THREADS] [--strict]
                          [-O ONE_TO_ONE] [-f]
                          [--signal-chunk-size SIGNAL_CHUNK_SIZE]
                          inputs [inputs ...]

Convert fast5 file(s) into a pod5 file(s)

positional arguments:
  inputs                Input path for fast5 file

optional arguments:
  -h, --help            show this help message and exit
  -r, --recursive       Search for input files recursively (default: False)
  -t THREADS, --threads THREADS
                        Set the number of threads to use [default: 10]
                        (default: 10)
  --strict              Immediately quit if an exception is encountered during
                        conversion instead of continuing with remaining inputs
                        after issuing a warning (default: False)

required arguments:
  -o OUTPUT, --output OUTPUT
                        Output path for the pod5 file(s). This can be an
                        existing directory (creating 'output.pod5' within it)
                        or a new named file path. A directory must be given
                        when using --one-to-one. (default: None)

output control arguments:
  -O ONE_TO_ONE, --one-to-one ONE_TO_ONE
                        Output files are written 1:1 to inputs. 1:1 output
                        files are written to the output directory in a new
                        directory structure relative to the directory path
                        provided to this argument. This directory path must be
                        a relative parent of all inputs. (default: None)
  -f, --force-overwrite
                        Overwrite destination files (default: False)
  --signal-chunk-size SIGNAL_CHUNK_SIZE
                        Chunk size to use for signal data set (default:
                        102400)

**nohup pod5 convert fast5 -o pod5/CHM13.pod5 --recursive --strict -t 60 multi_fast5/ &**

When I check nohup, I'm getting this:

cat nohup.out
Checking Fast5 Files:   0%|          | 0/7469 [00:00<?, ?Files/s]

Using ps, I'm getting the following status codes -- I believe indicating it's waiting for something to finish:

ps ux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
billylau 3471706  1.5  0.0 19561020 308316 pts/12 Sl  15:19   0:32 /home/billylau/.conda/envs/pod5/bin/python /home/billylau/.conda/envs/pod5/bin/pod5 convert fast5 -o pod5/CHM13.pod5 --recursive --strict -t 60 multi_fast5/
billylau 3472653  0.0  0.0   8892  3280 pts/12   R+   15:55   0:00 ps ux

top also indicates that nothing is happening:

PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                                                   
3471706 billylau  20   0   18.7g 311000  42388 S   2.0   0.0   0:34.00 pod5

It's been like this for ~30 minutes now with nothing updating. When I look at the code, eg. https://github.com/nanoporetech/pod5-file-format/blob/73617c63ac310cc4e9f8d23cf06f2bfde5d21b7b/python/pod5/src/pod5/tools/pod5_convert_from_fast5.py, it doesn't look like it's doing anything more than checking whether they are multi or not and it should update the progress bar quickly.

Edit: doing a single pod5 file works fine:

nohup pod5 convert fast5 -o pod5/CHM13.pod5 --recursive --strict -t 60 multi_fast5/FAK50913_c7ef4ac6a67eb7bce6220608fccf5b19227f4904_5.fast5 &

cat nohup.out
Converting 1 Fast5s: 100%|##########| 4000/4000 [00:11<00:00, 352.08Reads/s]

Edit 2: adding a wildcard to the input seems to make it "not stuck", but it still seems absurdly slow to me on the checking step:

nohup pod5 convert fast5 -o pod5/CHM13.pod5 --recursive --strict -t 60 multi_fast5/*
Checking Fast5 Files:  12%|#2        | 920/7469 [20:50<8:54:48,  4.90s/Files]

It seems like it's fast at first but then slows down in the first minute:

Checking Fast5 Files:   0%|          | 0/7469 [00:00<?, ?Files/s](pod5) billylau@suzuki:/mnt/ix1/Projects_lite/20230403_BL_CHM13_nanopore_raw/00_fast5$ cat nohup.out
Checking Fast5 Files:   6%|5         | 425/7469 [00:13<03:03, 38.39Files/s](pod5) billylau@suzuki:/mnt/ix1/Projects_lite/20230403_BL_CHM13_nanopore_raw/00_fast5$ cat nohup.out
Checking Fast5 Files:   7%|6         | 512/7469 [00:16<03:15, 35.58Files/s](pod5) billylau@suzuki:/mnt/ix1/Projects_lite/20230403_BL_CHM13_nanopore_raw/00_fast5$ cat nohup.out
Checking Fast5 Files:   7%|6         | 512/7469 [00:16<03:15, 35.58Files/s](pod5) billylau@suzuki:/mnt/ix1/Projects_lite/20230403_BL_CHM13_nanopore_raw/00_fast5$ cat nohup.out
Checking Fast5 Files:   7%|6         | 512/7469 [00:16<03:15, 35.58Files/s](pod5) billylau@suzuki:/mnt/ix1/Projects_lite/20230403_BL_CHM13_nanopore_raw/00_fast5$ cat nohup.out
Checking Fast5 Files:   8%|7         | 566/7469 [00:23<05:25, 21.21Files/s](pod5) billylau@suzuki:/mnt/ix1/Projects_lite/20230403_BL_CHM13_nanopore_raw/00_fast5$ cat nohup.out
Checking Fast5 Files:   8%|7         | 566/7469 [00:23<05:25, 21.21Files/s](pod5) billylau@suzuki:/mnt/ix1/Projects_lite/20230403_BL_CHM13_nanopore_raw/00_fast5$ cat nohup.out
Checking Fast5 Files:   8%|7         | 566/7469 [00:23<05:25, 21.21Files/s](pod5) billylau@suzuki:/mnt/ix1/Projects_lite/20230403_BL_CHM13_nanopore_raw/00_fast5$ cat nohup.out
Checking Fast5 Files:   8%|7         | 584/7469 [00:40<14:34,  7.87Files/s](pod5) billylau@suzuki:/mnt/ix1/Projects_lite/20230403_BL_CHM13_nanopore_raw/00_fast5$ cat nohup.out
Checking Fast5 Files:   8%|7         | 584/7469 [00:40<14:34,  7.87Files/s](pod5) billylau@suzuki:/mnt/ix1/Projects_lite/20230403_BL_CHM13_nanopore_raw/00_fast5$ cat nohup.out
Checking Fast5 Files:   8%|7         | 584/7469 [00:40<14:34,  7.87Files/s](pod5) billylau@suzuki:/mnt/ix1/Projects_lite/20230403_BL_CHM13_nanopore_raw/00_fast5$ cat nohup.out
Checking Fast5 Files:   8%|7         | 593/7469 [00:50<23:56,  4.79Files/s]

Missing `pod5_format_export.h` ?

Hi, I tried to find pod5_format_export.h in this project, but it is missing. But your other cpp files have included it many times, where is it?

pod5 subset and filter throw core dump error.

❯ pod5 filter WTC-11-NGN2-hiPSC_chromatin_1.pod5 --ids readID_5e5.txt --output WTC-11-NGN2-hiPSC_chromatin_1_subset.pod5 --force

[1]    1432189 illegal hardware instruction (core dumped)  pod5 filter WTC-11-NGN2-hiPSC_chromatin_1.pod5 --ids readID_5e5.txt --output

Running latest 0.2 version.

pod5 inspect reads input.pod5
Works so I think there is nothing wrong with the input pod5.

Basecalling takes a long time to start up with Guppy v6.5.7 and a merged pod5

I am attempting to use Guppy v6.5.7 with a merged pod5 from a PromethION run because of some strange issues that I've been having with fast5s. I created the pod5 using pod5 convert to create a single merged pod5 file. It's ~900GB.

When I start Guppy, it seems to take a long time after the "init" phase before basecalling actually starts -- as much as 30 minutes. It doesn't seem to be doing much on CPU except for slowly ramping up the memory usage. Is it trying to do a memory map?

This long startup time is really a bummer because I am submitting small jobs to the HPC using Guppy's --resume feature, and this really cuts into the server time if this happens every single time a new job starts.

A sample log -- take a look in between the "Init time" timestamp and when the first read is loaded:

2023-06-07 10:39:13.108864 [guppy/message] ONT Guppy basecalling software version 6.5.7+ca6d6af, minimap2 version 2.24-r1122
config file:        /home/groups/hanleeji/ont-guppy_v6.5.7/data/dna_r9.4.1_450bps_modbases_5mc_cg_sup_prom.cfg
model file:         /home/groups/hanleeji/ont-guppy_v6.5.7/data/template_r9.4.1_450bps_sup_prom.jsn
input path:         pod5/
save path:          guppy_5mc_prom_pod5/
chunk size:         2000
chunks per runner:  768
minimum qscore:     7
records per file:   4000
fastq compression:  ON
num basecallers:    4
gpu device:         cuda:all
kernel path:        
runners per device: 12

alignment file:     /home/groups/hanleeji/hs38_naa.mmi
alignment type:     auto

Use of this software is permitted solely under the terms of the end user license agreement (EULA).
By running, copying or accessing this software, you are demonstrating your acceptance of the EULA.
The EULA may be found in /home/groups/hanleeji/ont-guppy_v6.5.7/bin
2023-06-07 10:39:13.110617 [guppy/info] crashpad_handler not supported on this platform.
2023-06-07 10:39:13.523133 [guppy/info] CUDA device 0 (compute 8.0) initialised, memory limit 85031714816B (84594458624B free)
2023-06-07 10:39:17.480038 [guppy/message] loading new index: /home/groups/hanleeji/hs38_naa.mmi
2023-06-07 10:40:39.909546 [guppy/message] Full alignment will be performed.
2023-06-07 10:40:51.638841 [guppy/message] Resuming basecall from previous logfile: guppy_5mc_prom_pod5/guppy_basecaller_log-2023-06-07_10-07-54.log
2023-06-07 10:41:35.454594 [guppy/message] Found 1 input read file to process.
2023-06-07 10:41:35.503698 [guppy/info] lamp_arrangements arrangement folder not found: /home/groups/hanleeji/ont-guppy_v6.5.7/data/barcoding/lamp_arrangements
2023-06-07 10:41:35.961987 [guppy/info] lamp_arrangements arrangement folder not found: /home/groups/hanleeji/ont-guppy_v6.5.7/data/read_splitting/lamp_arrangements
2023-06-07 10:41:35.997585 [guppy/message] Init time: 142887 ms
2023-06-07 11:10:37.310822 [guppy/info] Read '000b2109-cd1a-4713-bcb6-265a84b14ed4' from file "20211119_PRM_1118.pod5" has been loaded.
2023-06-07 11:10:37.342269 [guppy/info] Read '000c5547-4ccf-4e80-a8bf-e4c5e61356be' from file "20211119_PRM_1118.pod5" has been loaded.
2023-06-07 11:10:37.342317 [guppy/info] Read '001455e0-3bc8-4640-b672-d0ac87237293' from file "20211119_PRM_1118.pod5" has been loaded.
2023-06-07 11:10:37.342343 [guppy/info] Read '00187ac0-90c8-4807-9e51-9e5298fcff54' from file "20211119_PRM_1118.pod5" has been loaded.
2023-06-07 11:10:37.342361 [guppy/info] Read '00370473-eb9e-45bb-bf4b-efe54e32d8e6' from file "20211119_PRM_1118.pod5" has been loaded.
2023-06-07 11:10:37.342380 [guppy/info] Read '0039136e-a13c-4d6c-89a3-28ae5ce3c217' from file "20211119_PRM_1118.pod5" has been loaded.
2023-06-07 11:10:37.342397 [guppy/info] Read '003a74f4-fc9e-4273-9e17-0f3ddfef99ec' from file "20211119_PRM_1118.pod5" has been loaded.
2023-06-07 11:10:37.342413 [guppy/info] Read '0043353b-2c49-43ea-bba4-fd21b4f99a9e' from file "20211119_PRM_1118.pod5" has been loaded.

About potential MinKNOW output

how will the MinKNOW output be - what would be the default batch size?
And, also is MinKNOW going to output one large single POD5 file per one sequencing run or will it be multiple POD5 files like it Is being done with FAST5 at the moment?

guppy basecalling failing with a pod5 file

Hey,

I am trying to run guppy (v6.3.7) basecalling on our HPC on a pod5 file. The job finishes without crashing after like 5 seconds. No error message thrown. No output is created.

Guppy v6.5.7 with pod5 uses tons of memory vs fast5

I converted fast5s to a ~900 GB single pod5 file (v0.2). When basecalling, it is taking up a ton of memory:

This doesn't happen when I call from fast5s. Is this normal?

pip install doesn't give any results for pod5

❯ pip install --upgrade pod5
ERROR: Could not find a version that satisfies the requirement pod5 (from versions: none)
ERROR: No matching distribution found for pod5

pod5:Empty queue or timeout

Hi，
When using the command "pod5 convert fast5," I consistently encounter an error that causes the program to stop running. However, upon inspecting my data, I can confirm that the data exists and is actually stored. How can I resolve this issue?

pod5 convert fast5 ./fast5/*.fast5 --output pod5/

pod5 inspect reads format to match the needs for pod5 subset

Ideally the pod5 tool guides would include an example on generating a summary file based on a folder with pod5s and then subsequently subset based on the channel information to facilitate easier duplex calling. nanoporetech/dorado#68 (comment)

I am struggling a bit with the formats and I am not sure that what is called summary by pod5 subset is even the same file as is output from pod5 inspect reads.

From the guides:
pod5 inspect reads https://github.com/nanoporetech/pod5-file-format/blob/master/python/pod5/README.md#pod5-inspect-reads
"Inspect all reads and print a csv table"

pod5 subset https://github.com/nanoporetech/pod5-file-format/blob/master/python/pod5/README.md#subsetting-from-a-summary
"pod5 subset can dynamically generate output targets and collect associated reads based on a tab-separated file "

I tried to convert the csv file from pod5 inspect to tab using csvkit and provide that file as summary but that gave me the following error "Number of passed names did not match number of header fields in the file"

What I tried to run:
`

get summary information from pod5 reads

pod5 inspect reads pod5_original/*pod5 > summary.csv

Convert list to tsv format

csvformat -T summary.csv > summary.tsv

subset based on channel

pod5 subset pod5_original/*pod5 --output barcode_channel_subset --summary summary.tsv --columns channel --template "{channel}.subset.pod5"
`

The writing speed of POD5

Compared to FAST5, how fast/efficient is POD5 for writing?

Will PromethION P48 be able to write when all 48 flowcells are operating and at double the current sampling rate (i.e. 8000?)?

Fast5 to pod5 - 5khz mode

ONT changed the sampling mode to 5khz recently.

I think I remember that the pod5 format was also developed to be able to save this kind of data.

By accident the format of the raw reads was changed from pod5 to fast 5 for one of our runs. Now I am not sure, if i convert the fast5 to pod5 afterwards, is the resulting pod5 file real 5khz sampled data, or was it lost/reduced to 4 khz due to the initial fast5 format.

Note I'll also post this question on the dorado github, as I am not sure which site is correct for this issue.

nanoporetech / pod5-file-format Goto Github PK

pod5-file-format's Introduction

POD5 File Format

POD5 File Format

What does this project contain

Documentation

Usage

Design

Development

pod5-file-format's People

Contributors

Stargazers

Watchers

Forkers

pod5-file-format's Issues

Error message

get summary information from pod5 reads

Convert list to tsv format

subset based on channel

Recommend Projects

Recommend Topics

Recommend Org