Comments (20)
Hi @HalfPhoton
sure!
Thank you very much for your patient explanation!
Kind regards,
Zhongyu
from pod5-file-format.
Hi @HITzhongyu ,
Could you set the POD5_DEBUG=1
environment variable and run the same command again? The converter will now generate a number of log files which show the state of the Queue at runtime. I can use these to help resolve this issue.
POD5_DEBUG=1 pod5 convert fast5 ./fast5/*.fast5 --output debug_pod5/
Kind regards,
Rich
from pod5-file-format.
Hi @HalfPhoton
I changed to a new set of test data and reran the command. This time, I encountered an error right from the beginning, as follows:
However, the program can still run normally. But it gets stuck at 99% and throws the following error:
and I try to run POD5_DEBUG=1 pod5 convert fast5 ./test/*.fast5 --output debug_pod5/
, the errors are as following:
Kind regards,
Zhongyu
from pod5-file-format.
The first report shows you using-t/--threads 40
which is giving a different error to the second report. You might be requesting too many resources which is why the tool is failing to create a new process or thread resulting in resource temporarily unavailable
. I would suggest reducing the value given to --threads
For the second report which is related to the original issue raised; there should be .log
files created now that POD5_DEBUG=1
is set . Can you share those with me please?
It looks like the Queue that contains the conversion tasks is becoming empty
somehow or timing out after 600 seconds for a single conversion task (which should be plenty of time for a small chunk of).
The log files will help me track down why this happens. Either the process is getting stuck or the queue logic is failing in your example.
Kind regards,
Rich
from pod5-file-format.
@HalfPhoton
Here are all the log files, thanks !
2023-06-27--20-01-54-p-11518-pod5.log
2023-06-27--20-01-54-p-11519-pod5.log
2023-06-27--20-01-54-p-11520-pod5.log
2023-06-27--20-01-54-p-11511-pod5.log
2023-06-27--20-01-54-p-11512-pod5.log
2023-06-27--20-01-54-p-11513-pod5.log
2023-06-27--20-01-54-p-11514-pod5.log
2023-06-27--20-01-54-p-11517-pod5.log
2023-06-27--20-01-52-main-pod5.log
Kind regards,
Zhongyu
from pod5-file-format.
Hi @HITzhongyu
Thank you very much for the logs. They've been very helpful.
From the main-pod5.log
we can see that one of the worker processes has been killed from a segmentation fault
2023-06-27 20:21:44,357 DEBUG 66:'terminate_processes': ... SpawnProcess-11, stopped[SIGSEGV] daemon ...
and in the worker 11513-pod5.log
we see that the log ends abruptly here:
--- Finishing previous file FAQ32498_pass_09083b73_65.fast5
2023-06-27 20:11:05,414 DEBUG 53:'convert_fast5_file':Done:37.193s
2023-06-27 20:11:05,425 DEBUG 53:'convert_fast5_file':Returned:4000
2023-06-27 20:11:05,427 INFO Enqueueing file end: FAQ32498_pass_09083b73_65.fast5 reads: 4000
2023-06-27 20:11:05,428 DEBUG c7:'enqueue_data'
--- Getting next file FAQ32498_pass_09083b73_71.fast5
2023-06-27 20:11:05,430 DEBUG 56:'get_input':(<pod5.tools.pod5_convert_from_fast5.QueueManager object at 0x7f8b6c3b5b10>,), {}
2023-06-27 20:11:05,430 DEBUG 56:'get_input':Done:0.000s
2023-06-27 20:11:05,430 DEBUG 56:'get_input':Returned:test/FAQ32498_pass_09083b73_71.fast5
--- Testing is_multi_read_fast5 on FAQ32498_pass_09083b73_71.fast5
2023-06-27 20:11:05,431 DEBUG 72:'is_multi_read_fast5':(PosixPath('test/FAQ32498_pass_09083b73_71.fast5'),), {}
--- Segfault
We'd expect to see
2023-06-27 20:10:26,479 DEBUG fd:'is_multi_read_fast5':(PosixPath('test/FAQ32498_pass_09083b73_65.fast5'),), {}
2023-06-27 20:10:28,220 DEBUG fd:'is_multi_read_fast5':Done:1.741s
2023-06-27 20:10:28,220 DEBUG fd:'is_multi_read_fast5':Returned:True
Can you please try and check that this file test/FAQ32498_pass_09083b73_71.fast5
is not corrupt in some way?
Kind regards,
Rich
from pod5-file-format.
Hi @HalfPhoton
I try to open test/FAQ32498_pass_09083b73_71.fast5
,but HDFView can't open it
I can upload the file to you,and you can test it
Kind regards,
Zhongyu
from pod5-file-format.
@HITzhongyu ,
Can you open it with python?
Using the same environment where pod5 is installed these module imports should exists:
# Get the a path to the file
from pathlib import Path
path = Path("test/FAQ32498_pass_09083b73_71.fast5")
assert path.exists()
# Can we open the file with h5py? If it fails here then the HDF5 file is corrupted somehow
import h5py
h5 = h5py.File(path)
# Is the file empty? If it fails here there's nothing to do anyway and the file should be deleted
assert len(h5) > 0
# Can pod5 check the file? If it fails here then there might be something we can do
from pod5.tools.pod5_convert_from_fast5 import is_multi_read_fast5
is_multi_read_fast5(pp)
from pod5-file-format.
@HalfPhoton
it report an error : Segmentation fault (core dumped)
from pod5-file-format.
Can you add a few print statements between tests or run it line-by-line in an interpreter to determine where the segfault occurs?
from pod5-file-format.
@HalfPhoton
sure!
from pathlib import Path
path = Path("/home/user/ydliu/hitbic/HG002/test/FAQ32498_pass_09083b73_71.fast5")
assert path.exists()
print("666")
import h5py
h5 = h5py.File(path)
print("777")
assert len(h5) > 0
print("888")
from pod5.tools.pod5_convert_from_fast5 import is_multi_read_fast5
print(is_multi_read_fast5(path))
from pod5-file-format.
Ok,
Please try this:
print("start")
with h5py.File(path) as _h5:
print("open")
print(_h5)
_h5.attrs
print("can access_h5.attrs")
print(_h5.attrs)
# The "file_type" attribute might be present on supported multi-read fast5 files.
if _h5.attrs.get("file_type") == "multi-read":
return True
print( "is not multi-read file type")
if len(_h5) == 0:
return True
print( "is not len 0")
# if there are "read_x" keys, this is a multi-read file
if any(key for key in _h5 if key.startswith("read_")):
print("found a read")
return True
print("closing handle")
print("everything is fine?!")
from pod5-file-format.
I modify your code,because it cause some error
print("start")
with h5py.File(path) as _h5:
print("open")
print(_h5)
_h5.attrs
print("can access_h5.attrs")
print(_h5.attrs)
# The "file_type" attribute might be present on supported multi-read fast5 files.
if _h5.attrs.get("file_type") == "multi-read":
print("True")
# return True
print( "is not multi-read file type")
if len(_h5) == 0:
print("True")
# return True
print( "is not len 0")
# if there are "read_x" keys, this is a multi-read file
if any(key for key in _h5 if key.startswith("read_")):
print("found a read")
# return True
print("closing handle")
print("everything is fine?!")
It reports an error:
start
open
<HDF5 file "FAQ32498_pass_09083b73_71.fast5" (mode r)>
can access_h5.attrs
<Attributes of HDF5 object at 139974581599904>
is not multi-read file type
is not len 0
Traceback (most recent call last):
File "test.py", line 40, in <module>
if any(key for key in _h5 if key.startswith("read_")):
File "test.py", line 40, in <genexpr>
if any(key for key in _h5 if key.startswith("read_")):
File "/home/user/ydliu/miniconda3/envs/remora/lib/python3.8/site-packages/h5py/_hl/group.py", line 499, in __iter__
for x in self.id.__iter__():
File "h5py/h5g.pyx", line 128, in h5py.h5g.GroupIter.__next__
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py/h5l.pyx", line 316, in h5py.h5l.LinkProxy.iterate
RuntimeError: Link iteration failed (incorrect metadata checksum after all read attempts)
from pod5-file-format.
Hi @HITzhongyu ,
It does appear that your fast5 file is corrupt. This is the same issue as seen here: megalodon#279
I'm not sure what we can do other than to recommend that you check your files, and drop those that are corrupt before continuing with pod5 convert
[Edit: subset -> convert
]
Apologies we don't have better solution.
Kind regards,
Rich
from pod5-file-format.
Hi @HalfPhoton
Thank you very much for your patient explanation.
I have another question. If the Fast5 data is corrupted, why is there no issue with it during Guppy processing, but problems arise specifically with pod5?
Regarding this issue, can you perform a filtering step before converting with pod5, skipping any damaged Fast5 files that are recognized as single Fast5, without affecting the subsequent program execution? If there are only a few such damaged data points, it should not impact the results of large-scale methylation detection.
or if it's convenient for you, could you please let me know which part of the code needs to be modified? I can make the changes on my end.
Kind regards,
Zhongyu
from pod5-file-format.
pod5 convert
will try to ignore bad fast5 unless --strict
is set. We removed the up-front fast5 checking because it was so slow.
In your case, the files are causing a prompt segfault which kills the worker process immediately instead of allowing it to handle the error gracefully. This is an issue with h5py
.
There potential changes we can make to how we handle dead workers which we might investigate.
As for how Guppy can handle this file when pod5 cannot; I'm not sure, but Guppy is not using python / h5py
which is where I believe the issue is caused.
Kind regards,
Rich
Edit: subset -> convert
from pod5-file-format.
Thank you very much for your patient explanation!
Kind regards,
Zhongyu
from pod5-file-format.
Hi @HalfPhoton
I find pod5 subset
to check pod5 not fast5
usage: pod5 subset [-h] [-o OUTPUT] [-r] [-f] [-t THREADS] [--csv CSV]
[-s TABLE] [-R READ_ID_COLUMN] [-c COLUMNS [COLUMNS ...]]
[--template TEMPLATE] [-T] [-M] [-D]
inputs [inputs ...]
Given one or more pod5 input files, take subsets of reads into one or more pod5 output files by a user-supplied mapping.
positional arguments:
inputs Pod5 filepaths to use as inputs
optional arguments:
-h, --help show this help message and exit
-o OUTPUT, --output OUTPUT
Destination directory to write outputs (default:
/home/user/ydliu/hitbic/HG002)
-r, --recursive Search for input files recursively matching `*.pod5`
(default: False)
-f, --force-overwrite
Overwrite destination files (default: False)
-t THREADS, --threads THREADS
Number of subsetting workers (default: 8)
direct mapping:
--csv CSV CSV file mapping output filename to read ids (default:
None)
table mapping:
-s TABLE, --summary TABLE, --table TABLE
Table filepath (csv or tsv) (default: None)
-R READ_ID_COLUMN, --read-id-column READ_ID_COLUMN
Name of the read_id column in the summary (default:
read_id)
-c COLUMNS [COLUMNS ...], --columns COLUMNS [COLUMNS ...]
Names of --summary / --table columns to subset on
(default: None)
--template TEMPLATE template string to generate output filenames (e.g.
"mux-{mux}_barcode-{barcode}.pod5"). default is to
concatenate all columns to values as shown in the
example. (default: None)
-T, --ignore-incomplete-template
Suppress the exception raised if the --template string
does not contain every --columns key (default: None)
content settings:
-M, --missing-ok Allow missing read_ids (default: False)
-D, --duplicate-ok Allow duplicate read_ids (default: False)
Example: pod5 subset inputs.pod5 --output subset_mux/ --summary summary.tsv --columns mux```
from pod5-file-format.
Sorry, my error. I meant to say pod5 convert
not pod5 subset
when explaining the --strict
option above.
from pod5-file-format.
Are you happy with the solution @HITzhongyu , can we close this issue?
from pod5-file-format.
Related Issues (20)
- Semaphore hissy fit at the end of subset run HOT 1
- pod5 subset/filter in preparation for dorado duplex is slow HOT 5
- error with pod5 convert to_fast5 HOT 1
- Cannot install pod5 through pip on ARM due to dependency issues HOT 11
- Reader class attributes immutable (Cannot edit "sample_id" field of mutable read object) HOT 1
- getrandom error with pod5 convert fast5 HOT 14
- MantaControl': Unable to read fast5 file at /path/: HDF5 exception", HOT 2
- Getting the signal chunk size of a pod5 file HOT 1
- Missing conda pod5 package HOT 2
- No documentation regarding multi-file pod5 dependency HOT 2
- pod5 convert fast5 warning: Failed to read key read_XXX HOT 2
- Troubleshooting Conversion of Fast5 Files to Pod5 Format HOT 12
- error:XX.fast5 is not a multi-read fast5 file HOT 2
- pod5 filter get killed HOT 5
- pod5 convert fast5 is stalling HOT 4
- Split Read IDs Cause Missing Read Error? HOT 1
- Support for sampling rates larger than int16 when writing pod5 files
- Lost signal data HOT 3
- pod5 convert fast5 doesn't finish for a set of files HOT 11
- Is it possible to repack pod5 files with a specific read number? HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pod5-file-format.