Giter Club home page Giter Club logo

Comments (20)

HITzhongyu avatar HITzhongyu commented on June 16, 2024 1

Hi @HalfPhoton

sure!
Thank you very much for your patient explanation!

Kind regards,
Zhongyu

from pod5-file-format.

HalfPhoton avatar HalfPhoton commented on June 16, 2024

Hi @HITzhongyu ,

Could you set the POD5_DEBUG=1 environment variable and run the same command again? The converter will now generate a number of log files which show the state of the Queue at runtime. I can use these to help resolve this issue.

POD5_DEBUG=1 pod5 convert fast5 ./fast5/*.fast5 --output debug_pod5/

Kind regards,
Rich

from pod5-file-format.

HITzhongyu avatar HITzhongyu commented on June 16, 2024

Hi @HalfPhoton

I changed to a new set of test data and reran the command. This time, I encountered an error right from the beginning, as follows:
image
image

However, the program can still run normally. But it gets stuck at 99% and throws the following error:
image

and I try to run POD5_DEBUG=1 pod5 convert fast5 ./test/*.fast5 --output debug_pod5/, the errors are as following:
image

Kind regards,
Zhongyu

from pod5-file-format.

HalfPhoton avatar HalfPhoton commented on June 16, 2024

@HITzhongyu

The first report shows you using-t/--threads 40 which is giving a different error to the second report. You might be requesting too many resources which is why the tool is failing to create a new process or thread resulting in resource temporarily unavailable. I would suggest reducing the value given to --threads

For the second report which is related to the original issue raised; there should be .log files created now that POD5_DEBUG=1 is set . Can you share those with me please?

It looks like the Queue that contains the conversion tasks is becoming empty somehow or timing out after 600 seconds for a single conversion task (which should be plenty of time for a small chunk of).
The log files will help me track down why this happens. Either the process is getting stuck or the queue logic is failing in your example.

Kind regards,
Rich

from pod5-file-format.

HITzhongyu avatar HITzhongyu commented on June 16, 2024

@HalfPhoton
Here are all the log files, thanks !

2023-06-27--20-01-54-p-11518-pod5.log
2023-06-27--20-01-54-p-11519-pod5.log
2023-06-27--20-01-54-p-11520-pod5.log
2023-06-27--20-01-54-p-11511-pod5.log
2023-06-27--20-01-54-p-11512-pod5.log
2023-06-27--20-01-54-p-11513-pod5.log
2023-06-27--20-01-54-p-11514-pod5.log
2023-06-27--20-01-54-p-11517-pod5.log
2023-06-27--20-01-52-main-pod5.log

Kind regards,
Zhongyu

from pod5-file-format.

HalfPhoton avatar HalfPhoton commented on June 16, 2024

Hi @HITzhongyu

Thank you very much for the logs. They've been very helpful.

From the main-pod5.log we can see that one of the worker processes has been killed from a segmentation fault

2023-06-27 20:21:44,357 DEBUG 66:'terminate_processes': ... SpawnProcess-11, stopped[SIGSEGV] daemon ...

and in the worker 11513-pod5.log we see that the log ends abruptly here:

--- Finishing previous file FAQ32498_pass_09083b73_65.fast5
2023-06-27 20:11:05,414 DEBUG 53:'convert_fast5_file':Done:37.193s
2023-06-27 20:11:05,425 DEBUG 53:'convert_fast5_file':Returned:4000
2023-06-27 20:11:05,427 INFO Enqueueing file end: FAQ32498_pass_09083b73_65.fast5 reads: 4000
2023-06-27 20:11:05,428 DEBUG c7:'enqueue_data'

--- Getting next file FAQ32498_pass_09083b73_71.fast5
2023-06-27 20:11:05,430 DEBUG 56:'get_input':(<pod5.tools.pod5_convert_from_fast5.QueueManager object at 0x7f8b6c3b5b10>,), {}
2023-06-27 20:11:05,430 DEBUG 56:'get_input':Done:0.000s
2023-06-27 20:11:05,430 DEBUG 56:'get_input':Returned:test/FAQ32498_pass_09083b73_71.fast5

--- Testing is_multi_read_fast5 on FAQ32498_pass_09083b73_71.fast5
2023-06-27 20:11:05,431 DEBUG 72:'is_multi_read_fast5':(PosixPath('test/FAQ32498_pass_09083b73_71.fast5'),), {}

--- Segfault

We'd expect to see

2023-06-27 20:10:26,479 DEBUG fd:'is_multi_read_fast5':(PosixPath('test/FAQ32498_pass_09083b73_65.fast5'),), {}
2023-06-27 20:10:28,220 DEBUG fd:'is_multi_read_fast5':Done:1.741s
2023-06-27 20:10:28,220 DEBUG fd:'is_multi_read_fast5':Returned:True

Can you please try and check that this file test/FAQ32498_pass_09083b73_71.fast5 is not corrupt in some way?

Kind regards,
Rich

from pod5-file-format.

HITzhongyu avatar HITzhongyu commented on June 16, 2024

Hi @HalfPhoton
I try to open test/FAQ32498_pass_09083b73_71.fast5,but HDFView can't open it
I can upload the file to you,and you can test it

Kind regards,
Zhongyu

from pod5-file-format.

HalfPhoton avatar HalfPhoton commented on June 16, 2024

@HITzhongyu ,
Can you open it with python?

Using the same environment where pod5 is installed these module imports should exists:

# Get the a path to the file
from pathlib import Path
path = Path("test/FAQ32498_pass_09083b73_71.fast5")
assert path.exists()

# Can we open the file with h5py? If it fails here then the HDF5 file is corrupted somehow
import h5py
h5 = h5py.File(path)

# Is the file empty? If it fails here there's nothing to do anyway and the file should be deleted 
assert len(h5) > 0

# Can pod5 check the file? If it fails here then there might be something we can do
from pod5.tools.pod5_convert_from_fast5 import is_multi_read_fast5
is_multi_read_fast5(pp)

from pod5-file-format.

HITzhongyu avatar HITzhongyu commented on June 16, 2024

@HalfPhoton
it report an error : Segmentation fault (core dumped)
image

from pod5-file-format.

HalfPhoton avatar HalfPhoton commented on June 16, 2024

Can you add a few print statements between tests or run it line-by-line in an interpreter to determine where the segfault occurs?

from pod5-file-format.

HITzhongyu avatar HITzhongyu commented on June 16, 2024

@HalfPhoton
sure!

from pathlib import Path
path = Path("/home/user/ydliu/hitbic/HG002/test/FAQ32498_pass_09083b73_71.fast5")
assert path.exists()
print("666")

import h5py
h5 = h5py.File(path)
print("777")

assert len(h5) > 0
print("888")

from pod5.tools.pod5_convert_from_fast5 import is_multi_read_fast5
print(is_multi_read_fast5(path))

image

from pod5-file-format.

HalfPhoton avatar HalfPhoton commented on June 16, 2024

Ok,

Please try this:

print("start")
with h5py.File(path) as _h5:
  print("open")           
  print(_h5)

  _h5.attrs
  print("can access_h5.attrs")
  print(_h5.attrs)

  # The "file_type" attribute might be present on supported multi-read fast5 files.
  if _h5.attrs.get("file_type") == "multi-read":
    return True
  print( "is not multi-read file type")

  if len(_h5) == 0:
    return True
  print( "is not len 0")

  # if there are "read_x" keys, this is a multi-read file
  if any(key for key in _h5 if key.startswith("read_")):
    print("found a read")
    return True

  print("closing handle")
print("everything is fine?!")

from pod5-file-format.

HITzhongyu avatar HITzhongyu commented on June 16, 2024

I modify your code,because it cause some error

print("start")
with h5py.File(path) as _h5:
    print("open")           
    print(_h5)

    _h5.attrs
    print("can access_h5.attrs")
    print(_h5.attrs)

    # The "file_type" attribute might be present on supported multi-read fast5 files.
    if _h5.attrs.get("file_type") == "multi-read":
        print("True")
        # return True
    print( "is not multi-read file type")

    if len(_h5) == 0:
        print("True")
        # return True
    print( "is not len 0")

    # if there are "read_x" keys, this is a multi-read file
    if any(key for key in _h5 if key.startswith("read_")):
        print("found a read")
        # return True

    print("closing handle")
print("everything is fine?!")

It reports an error:

start
open
<HDF5 file "FAQ32498_pass_09083b73_71.fast5" (mode r)>
can access_h5.attrs
<Attributes of HDF5 object at 139974581599904>
is not multi-read file type
is not len 0
Traceback (most recent call last):
  File "test.py", line 40, in <module>
    if any(key for key in _h5 if key.startswith("read_")):
  File "test.py", line 40, in <genexpr>
    if any(key for key in _h5 if key.startswith("read_")):
  File "/home/user/ydliu/miniconda3/envs/remora/lib/python3.8/site-packages/h5py/_hl/group.py", line 499, in __iter__
    for x in self.id.__iter__():
  File "h5py/h5g.pyx", line 128, in h5py.h5g.GroupIter.__next__
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5l.pyx", line 316, in h5py.h5l.LinkProxy.iterate
RuntimeError: Link iteration failed (incorrect metadata checksum after all read attempts)

from pod5-file-format.

HalfPhoton avatar HalfPhoton commented on June 16, 2024

Hi @HITzhongyu ,

It does appear that your fast5 file is corrupt. This is the same issue as seen here: megalodon#279

I'm not sure what we can do other than to recommend that you check your files, and drop those that are corrupt before continuing with pod5 convert [Edit: subset -> convert]

Apologies we don't have better solution.

Kind regards,
Rich

from pod5-file-format.

HITzhongyu avatar HITzhongyu commented on June 16, 2024

Hi @HalfPhoton

Thank you very much for your patient explanation.

I have another question. If the Fast5 data is corrupted, why is there no issue with it during Guppy processing, but problems arise specifically with pod5?

Regarding this issue, can you perform a filtering step before converting with pod5, skipping any damaged Fast5 files that are recognized as single Fast5, without affecting the subsequent program execution? If there are only a few such damaged data points, it should not impact the results of large-scale methylation detection.

or if it's convenient for you, could you please let me know which part of the code needs to be modified? I can make the changes on my end.

Kind regards,
Zhongyu

from pod5-file-format.

HalfPhoton avatar HalfPhoton commented on June 16, 2024

@HITzhongyu

pod5 convert will try to ignore bad fast5 unless --strict is set. We removed the up-front fast5 checking because it was so slow.

In your case, the files are causing a prompt segfault which kills the worker process immediately instead of allowing it to handle the error gracefully. This is an issue with h5py.

There potential changes we can make to how we handle dead workers which we might investigate.

As for how Guppy can handle this file when pod5 cannot; I'm not sure, but Guppy is not using python / h5py which is where I believe the issue is caused.

Kind regards,
Rich

Edit: subset -> convert

from pod5-file-format.

HITzhongyu avatar HITzhongyu commented on June 16, 2024

@HalfPhoton

Thank you very much for your patient explanation!

Kind regards,
Zhongyu

from pod5-file-format.

HITzhongyu avatar HITzhongyu commented on June 16, 2024

Hi @HalfPhoton
I find pod5 subset to check pod5 not fast5

usage: pod5 subset [-h] [-o OUTPUT] [-r] [-f] [-t THREADS] [--csv CSV]
                   [-s TABLE] [-R READ_ID_COLUMN] [-c COLUMNS [COLUMNS ...]]
                   [--template TEMPLATE] [-T] [-M] [-D]
                   inputs [inputs ...]

Given one or more pod5 input files, take subsets of reads into one or more pod5 output files by a user-supplied mapping.

positional arguments:
  inputs                Pod5 filepaths to use as inputs

optional arguments:
  -h, --help            show this help message and exit
  -o OUTPUT, --output OUTPUT
                        Destination directory to write outputs (default:
                        /home/user/ydliu/hitbic/HG002)
  -r, --recursive       Search for input files recursively matching `*.pod5`
                        (default: False)
  -f, --force-overwrite
                        Overwrite destination files (default: False)
  -t THREADS, --threads THREADS
                        Number of subsetting workers (default: 8)

direct mapping:
  --csv CSV             CSV file mapping output filename to read ids (default:
                        None)

table mapping:
  -s TABLE, --summary TABLE, --table TABLE
                        Table filepath (csv or tsv) (default: None)
  -R READ_ID_COLUMN, --read-id-column READ_ID_COLUMN
                        Name of the read_id column in the summary (default:
                        read_id)
  -c COLUMNS [COLUMNS ...], --columns COLUMNS [COLUMNS ...]
                        Names of --summary / --table columns to subset on
                        (default: None)
  --template TEMPLATE   template string to generate output filenames (e.g.
                        "mux-{mux}_barcode-{barcode}.pod5"). default is to
                        concatenate all columns to values as shown in the
                        example. (default: None)
  -T, --ignore-incomplete-template
                        Suppress the exception raised if the --template string
                        does not contain every --columns key (default: None)

content settings:
  -M, --missing-ok      Allow missing read_ids (default: False)
  -D, --duplicate-ok    Allow duplicate read_ids (default: False)

Example: pod5 subset inputs.pod5 --output subset_mux/ --summary summary.tsv --columns mux```

from pod5-file-format.

HalfPhoton avatar HalfPhoton commented on June 16, 2024

Sorry, my error. I meant to say pod5 convert not pod5 subset when explaining the --strict option above.

from pod5-file-format.

HalfPhoton avatar HalfPhoton commented on June 16, 2024

Are you happy with the solution @HITzhongyu , can we close this issue?

from pod5-file-format.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.