kishwarshafin / helen Goto Github PK

H.E.L.E.N. (Homopolymer Encoded Long-read Error-corrector for Nanopore)

License: MIT License

CMake 0.63% Python 76.20% C 13.74% C++ 8.81% Dockerfile 0.30% Makefile 0.32%

genome-analysis genome-assembly genome-scale-models genome-sequencing genomes genomic-data-analysis genomics human-genetics oxford-nanopore

helen's People

Contributors

Stargazers

Watchers

Forkers

priyesh000 arun-sub warrenlab tianfuzeng heziqing pythseq skoren bioswarm srisvs33

helen's Issues

Exception: process 6 terminated with signal SIGKILL

Hello,

I am trying to run helen in polishing mode. Here is my command:

helen polish -i marginPolish_images -m helen_models/HELEN_r941_guppy344_microbial.pkl -o helen_polish/ -t 16

However I face following error:
Traceback (most recent call last): File "/lustre-gseg/software/bin/helen", line 33, in <module> sys.exit(load_entry_point('helen==0.0.23', 'console_scripts', 'helen')()) File "/lustre-gseg/software/MarginPolish-HELEN/py36_venv/lib64/python3.6/site-packages/helen/helen.py", line 313, in main FLAGS.callers) File "/lustre-gseg/software/MarginPolish-HELEN/py36_venv/lib64/python3.6/site-packages/helen/modules/python/PolishInterface.py", line 87, in polish_genome callers) File "/lustre-gseg/software/MarginPolish-HELEN/py36_venv/lib64/python3.6/site-packages/helen/modules/python/CallConsensusInterface.py", line 153, in call_consensus callers, threads_per_caller, num_workers) File "/lustre-gseg/software/MarginPolish-HELEN/py36_venv/lib64/python3.6/site-packages/helen/modules/python/models/predict_cpu.py", line 248, in predict_cpu join=True) File "/lustre-gseg/software/MarginPolish-HELEN/py36_venv/lib64/python3.6/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/lustre-gseg/software/MarginPolish-HELEN/py36_venv/lib64/python3.6/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes while not context.join(): File "/lustre-gseg/software/MarginPolish-HELEN/py36_venv/lib64/python3.6/site-packages/torch/multiprocessing/spawn.py", line 108, in join (error_index, name) Exception: process 6 terminated with signal SIGKILL

Please could you shed some light on this?

Many thanks in advance.

How to train a new model with in lab testing data

Our lab is doing research on some channel protein, the sequencing error seems different from R9 pore, so we want to training marginpolish/Helen with these new data, could you tell me how to do it, thanks.

Add helen to bioconda.

Hi,

Can you add helen to bioconda? It is so difficult To install helen on a centos machine.

Best
Kun

output path and filename are not joined properly

stitch.py -o . produced a hidden file .HELEN_consensus.fa.

It would be better to use os.path.join() here:

helen/stitch.py

Line 47 in d372a9a

output_filename = output_path + output_prefix + '.fa'

Models for Guppy 3.6

Hi,

I really like your polishing pipeline and it gives great results so far.
Last week a new and improved version of Guppy with boosted accuracy was released. Are you planning to provide models for this version of Guppy and if so, when can we expect these?

Thank you,
Dominik

wrong script in example Usage in README.md

Dear,

In README.md Step 2 example script:

samtools sort -@ 32 unsorted.bam | samtools view > reads_2_assembly.0x904q60.bam

should be

samtools sort -@ 32 unsorted.bam >reads_2_assembly.0x904q60.bam

Best,
Jia

torch.set_num_threads error in docker image

Hi, am trying to run the docker image and get a torch runtime error.
Is this an error in the docker image? Thanks!

INFO: POLISH MODULE SELECTED
INFO: RUN-ID: 09012020_134236
INFO: PREDICTION OUTPUT DIRECTORY: /.../helen_out/predictions_09012020_134236
INFO: CALL CONSENSUS STARTING
INFO: OUTPUT FILE: /.../helen_out/predictions_09012020_134236/265L12.cont.cor.fa
INFO: MODEL LOADING TO ONNX
Traceback (most recent call last):
File "/opt/conda/bin/helen", line 8, in
sys.exit(main())
File "/opt/conda/lib/python3.7/site-packages/helen/helen.py", line 313, in main
FLAGS.callers)
File "/opt/conda/lib/python3.7/site-packages/helen/modules/python/PolishInterface.py", line 87, in polish_genome
callers)
File "/opt/conda/lib/python3.7/site-packages/helen/modules/python/CallConsensusInterface.py", line 153, in call_consensus
callers, threads_per_caller, num_workers)
File "/opt/conda/lib/python3.7/site-packages/helen/modules/python/models/predict_cpu.py", line 248, in predict_cpu
join=True)
File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
while not spawn_context.join():
File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 118, in join
raise Exception(msg)
Exception:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/opt/conda/lib/python3.7/site-packages/helen/modules/python/models/predict_cpu.py", line 194, in setup
threads)
File "/opt/conda/lib/python3.7/site-packages/helen/modules/python/models/predict_cpu.py", line 65, in predict
torch.set_num_threads(threads)
RuntimeError: set_num_threads expects a positive integer

marginPolish option

Hi,

Just compiled marginPolish and Helen according to your installation tutorial and ran it on a test dataset.

I am missing the marginPolish images being created.
Get the fasta output but not the image (hdf?).

Also, when looking at marginPolish options, i dont get the same options as are posted on their github repo (missing the -f parameter).

Mine shows:

./marginPolish
usage: marginPolish <BAM_FILE> <ASSEMBLY_FASTA> <PARAMS> [options]
Version: 1.0.0

Polishes the ASSEMBLY_FASTA using alignments in BAM_FILE.

Required arguments:
    BAM_FILE is the alignment of reads to the assembly (or reference).
    ASSEMBLY_FASTA is the reference sequence BAM file in fasta format.
    PARAMS is the file with marginPolish parameters.

Default options:
    -h --help                : Print this help screen
    -a --logLevel            : Set the log level [default = info]
    -t --threads             : Set number of concurrent threads [default = 1]
    -o --outputBase          : Name to use for output files [default = 'output']
    -r --region              : If set, will only compute for given chromosomal region.
                                 Format: chr:start_pos-end_pos (chr3:2000-3000).

Miscellaneous supplementary output options:
    -i --outputRepeatCounts  : Output base to write out the repeat counts [default = NULL]
    -j --outputPoaTsv        : Output base to write out the poa as TSV file [default = NULL]

Theirs is:

marginPolish <BAM_FILE> <ASSEMBLY_FASTA> <PARAMS> [options] 

Polishes the ASSEMBLY_FASTA using alignments in BAM_FILE.

Required arguments:
    BAM_FILE is the alignment of reads to the assembly (or reference).
    ASSEMBLY_FASTA is the reference sequence BAM file in fasta format.
    PARAMS is the file with marginPolish parameters.

Default options:
    -h --help                : Print this help screen
    -a --logLevel            : Set the log level [default = info]
    -t --threads             : Set number of concurrent threads [default = 1]
    -o --outputBase          : Name to use for output files [default = 'output']
    -r --region              : If set, will only compute for given chromosomal region.
                                 Format: chr:start_pos-end_pos (chr3:2000-3000).

HELEN feature generation options:
    -f --produceFeatures     : output features for HELEN.
    -F --featureType         : output features of chunks for HELEN.  Valid types:
                                 splitRleWeight:  [default] run lengths split into chunks
                                 nuclAndRlWeight: split into nucleotide and run length (RL across nucleotides)
                                 rleWeight:       weighted likelihood from POA nodes (RLE)
                                 simpleWeight:    weighted likelihood from POA nodes (non-RLE)
    -L --splitRleWeightMaxRL : max run length (for 'splitRleWeight' type only) [default = 10]
    -u --trueReferenceBam    : true reference aligned to ASSEMBLY_FASTA, for HELEN
                               features.  Setting this parameter will include labels
                               in output.

Miscellaneous supplementary output options:
    -i --outputRepeatCounts  : Output base to write out the repeat counts [default = NULL]
    -j --outputPoaTsv        : Output base to write out the poa as TSV file [default = NULL]

I am missing the whole HELEN feature generation options.
Do you have a docker which i could use?

Thanks,
Michel

MarginPolish

Request for MarginPolish to either:

Be able to run Docker without sudo.
Run docker with sudo but inside sigularity.
Fix the binary so it does not segfault

Any of those 3 options would be great. I'm not sure what you need in terms of system configuration, but I'll provide you with some basics on my primary test system, and you can let me know if you need more:

O/S CentOS v. 7.6
Dual Intel(R) Xeon(R) CPU E5-2640 v2 CPUs
256GB RAM
GCC v. 4.8.5 default compiler, but other compilers are available
Using Environment modules system
Cmake 3.11
O/S repos include CentOS 7 Basic, Plus, and EPEL

We have a mixture of systems, but the configuration above is pretty typical. On the university HPC, they use SLURM for resource management. On our primary lab servers we can run in standard user mode, or using Torque/PBS. All of my tests have been performed running outside of a resource management system.

Let me know what else you might need.

Thanks,
John

Question about model's training data

Hi:
MarginPolish && HELEN is such an excellent pipeline for polishing ONT assembly, which is easy to run and has very high accuracy. I am using the latest model to polishing some human data. I wonder what data do you use to train the model MP_r941_guppy344_human.json and HELEN_r941_guppy344_human.pkl. The training datasets of this two models were not mentioned in the paper. Which specie and which chromosome is used, HG002, CHM13 or HG00733 and chr1-6 or chr1-19, chr21-22?

Neng

HAC models

Dear HELEN developers,

When running MarginPolish with the allParams.np.human.guppy-ff-235.json model, i get a Calloc error.

udocker run -v /mnt/SCRATCH/michelmo/Projects/MudMinnow/Nhub_guppy305_flye10K:/data mGPolish reads_2_assembly.bam assembly.fasta allParams.np.human.guppy-ff-235.json -t 32 -o /mnt/SCRATCH/michelmo/Projects/MudMinnow/Nhub_guppy305_flye10K/mG305 -f

 ******************************************************************************
 *                                                                            *
 *               STARTING 137351bb-4e04-3309-9bf5-ae016625cef7                *
 *                                                                            *
 ******************************************************************************
 executing: sh
Set log level to INFO
Running OpenMP with 32 threads.
> Parsing model parameters from file: allParams.np.human.guppy-ff-235.json
Calloc failed with request for -2 lots of 16 bytes
Command exited with non-zero status 1

DEBUG_MAX_MEM:4608
DEBUG_RUNTIME:0:00.06

The program runs if using another model:

udocker run -v /mnt/SCRATCH/michelmo/Projects/MudMinnow/Nhub_guppy305_flye10K:/data mGPolish reads_2_assembly.bam assembly.fasta allParams.np.human.guppy-ff-233.json -t 32 -o /mnt/SCRATCH/michelmo/Projects/MudMinnow/Nhub_guppy305_flye10K/mG305 -f

 ******************************************************************************
 *                                                                            *
 *               STARTING 137351bb-4e04-3309-9bf5-ae016625cef7                *
 *                                                                            *
 ******************************************************************************
 executing: sh
Set log level to INFO
Running OpenMP with 32 threads.
> Parsing model parameters from file: allParams.np.human.guppy-ff-233.json
> Parsing reference sequences from file: assembly.fasta
> Going to write polished reference in : /mnt/SCRATCH/michelmo/Projects/MudMinnow/Nhub_guppy305_flye10K/mG305.fa
...

Is the 235 model file corrupted?

Also, i saw your latest models for polishing is named guppy 2.3.5.
Is this trained on the HAC configuration files?

We are currently using promethION data basecalled with HAC models on guppy 3.0.5 provided by ONT and i wonder which model would fit the data best.

model files used for basecalling:

md5sum dna_r9.4.1_450bps_hac_prom.cfg   c9dc5f42f63c005085ed89e4094e0bb4
md5sum template_r9.4.1_450bps_hac_prom.jsn     6ee479f9ae82a7d26cb47bd24a7882fd

Maybe it would be more accurate to name models after their used basecall models instead of guppy versions?

Thanks,
michel

Are existing models suitable to polish a high heterozygosity plant genome assemblies?

Hi,

Which model is the best one for polishing a high heterozygosity plant genome assemblies (genome size is ~3.6Gb, het rate is >> 2%)? Assemblies(~8Gb) are generated from flye.

Best,
Kun

run call.consensus..py fail

i can't run 'python3 cal_consensus.py', and i got the follow message " Traceback (most recent call last):
File "./call_consensus.py", line 3, in
from modules.python.TextColor import TextColor
ImportError: No module named 'modules.python'
"
any guy can help me ?

h5 error

Hi,

I run into following error troubles with when running helen (CPU only installation) on a flye-assembly:

python3 ~/tools/helen/call_consensus.py -i . -m /net/fs-1/home01/michelmo/tools/helen/r941_flip231_v001.pkl
INFO: OUTPUT DIRECTORY: ./output/
INFO: TORCH THREADS SET TO: 1.
Loading data
Traceback (most recent call last):
  File "/mnt/users/michelmo/tools/helen/call_consensus.py", line 133, in <module>
    FLAGS.gpu_mode)
  File "/mnt/users/michelmo/tools/helen/call_consensus.py", line 53, in polish_genome
    predict(image_filepath, output_filename, model_path, batch_size, num_workers, threads, gpu_mode)
  File "/net/fs-1/home01/michelmo/tools/helen/modules/python/models/predict.py", line 61, in predict
    test_data = SequenceDataset(test_file)
  File "/net/fs-1/home01/michelmo/tools/helen/modules/python/models/dataloader_predict.py", line 35, in __init__
    with h5py.File(hdf5_file_path, 'r') as hdf5_file:
  File "/mnt/users/michelmo/.conda/envs/HELEN/lib/python3.7/site-packages/h5py/_hl/files.py", line 394, in __init__
    swmr=swmr)
  File "/mnt/users/michelmo/.conda/envs/HELEN/lib/python3.7/site-packages/h5py/_hl/files.py", line 170, in make_fid
    fid = h5f.open(name, flags, fapl=fapl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5f.pyx", line 85, in h5py.h5f.open
OSError: Unable to open file (file signature not found)

I suspect that h5-files are somehow corrupted (some are empty), but thats difficult to assess for me. I thought the size of total images would be larger than the 5.9 M i got from a 2.5 Gb genome.

total 5.9M
420K mGimageRainbowtrout.T00.h5  212K mGimageRainbowtrout.T11.h5  316K mGimageRainbowtrout.T22.h5
212K mGimageRainbowtrout.T01.h5     0 mGimageRainbowtrout.T12.h5     0 mGimageRainbowtrout.T23.h5
212K mGimageRainbowtrout.T02.h5     0 mGimageRainbowtrout.T13.h5  212K mGimageRainbowtrout.T24.h5
   0 mGimageRainbowtrout.T03.h5  212K mGimageRainbowtrout.T14.h5  212K mGimageRainbowtrout.T25.h5
420K mGimageRainbowtrout.T04.h5  212K mGimageRainbowtrout.T15.h5  420K mGimageRainbowtrout.T26.h5
212K mGimageRainbowtrout.T05.h5     0 mGimageRainbowtrout.T16.h5     0 mGimageRainbowtrout.T27.h5
212K mGimageRainbowtrout.T06.h5     0 mGimageRainbowtrout.T17.h5     0 mGimageRainbowtrout.T28.h5
   0 mGimageRainbowtrout.T07.h5  212K mGimageRainbowtrout.T18.h5  212K mGimageRainbowtrout.T29.h5
420K mGimageRainbowtrout.T08.h5  212K mGimageRainbowtrout.T19.h5  212K mGimageRainbowtrout.T30.h5
   0 mGimageRainbowtrout.T09.h5  420K mGimageRainbowtrout.T20.h5  212K mGimageRainbowtrout.T31.h5
420K mGimageRainbowtrout.T10.h5  212K mGimageRainbowtrout.T21.h5

MarginPolish was run with default settings:

/marginPolish $BAM \
$ASM \
/net/fs-1/home01/michelmo/tools/marginPolish/params/allParams.np.human.guppy-ff-233.json \
-t 32 \
-o mGimageRainbowtrout \
-f 2>&1 | tee mG.log

Any ideas or hints about what could have gone wrong would be appreciated.

Michel

MarginPolish output zero .fa file

Dear author,
Thanks for your great assembly tool Shasta, and polish tool marginpolish and helen. I have used Shasta to assembled a genome and generated the Assembly.fasta file.
Next, I try to use Helen to polish the genome. I have generated the .bam file by minimap2 and indexed it by Samtools. However, the marginpolish step generates zero fasta file output.fa. The log is as follows.

Running OpenMP with 2 threads.
> Parsing model parameters from file:  ./helen_model/MP_r941_guppy344_human.json
> Parsing reference sequences from file: Assembly.fasta
> Going to write polished reference in : margin_image/output.fa
> Set up bam chunker with chunk size 5000 and overlap 50 (for region=all), resulting in 538336 total chunks
Warning! ***HDF5 library version mismatched error***
The HDF5 header files used to compile this application do not match
the version used by the HDF5 library to which this application is linked.
Data corruption or segmentation faults may occur if the application continues.
This can happen when an application was compiled by one version of HDF5 but
linked with a different version of static or shared HDF5 library.
You should recompile the application or check your shared library related
settings such as 'LD_LIBRARY_PATH'.
You can, at your own risk, disable this warning by setting the environment
variable 'HDF5_DISABLE_VERSION_CHECK' to a value of '1'.
Setting it to 2 or higher will suppress the warning messages totally.
Headers are 1.8.12, library is 1.8.11
	    SUMMARY OF THE HDF5 CONFIGURATION
	    =================================

General Information:
-------------------
		   HDF5 Version: 1.8.11
		  Configured on: Wed May  8 16:20:56 CDT 2013
		  Configured by: hdftest@koala
		 Configure mode: production
		    Host system: x86_64-unknown-linux-gnu
	      Uname information: Linux koala 2.6.18-348.1.1.el5 #1 SMP Tue Jan 22 16:19:19 EST 2013 x86_64 x86_64 x86_64 GNU/Linux
		       Byte sex: little-endian
		      Libraries: static, shared
	     Installation point: /mnt/scr1/pre-release/hdf5/v1811/thg-builds/koala

Compiling Options:
------------------
               Compilation Mode: production
                     C Compiler: /usr/bin/gcc ( gcc (GCC) 4.1.2 20080704 )
                         CFLAGS: 
                      H5_CFLAGS: -std=c99 -pedantic -Wall -Wextra -Wundef -Wshadow -Wpointer-arith -Wbad-function-cast -Wcast-qual -Wcast-align -Wwrite-strings -Wconversion -Waggregate-return -Wstrict-prototypes -Wmissing-prototypes -Wmissing-declarations -Wredundant-decls -Wnested-externs -Winline -Wno-long-long -Wfloat-equal -Wmissing-format-attribute -Wmissing-noreturn -Wpacked -Wdisabled-optimization -Wformat=2 -Wunreachable-code -Wendif-labels -Wdeclaration-after-statement -Wold-style-definition -Winvalid-pch -Wvariadic-macros -Wnonnull -Winit-self -Wmissing-include-dirs -Wswitch-default -Wswitch-enum -Wunused-macros -Wunsafe-loop-optimizations -Wc++-compat -Wvolatile-register-var -O3 -fomit-frame-pointer -finline-functions
                      AM_CFLAGS: 
                       CPPFLAGS: 
                    H5_CPPFLAGS: -D_POSIX_C_SOURCE=199506L   -DNDEBUG -UH5_DEBUG_API
                    AM_CPPFLAGS: -I/mnt/hdf/packages/szip/shared/encoder/Linux2.6-x86_64-gcc/include -D_LARGEFILE_SOURCE -D_LARGEFILE64_SOURCE -D_BSD_SOURCE 
               Shared C Library: yes
               Static C Library: yes
  Statically Linked Executables: yes
                        LDFLAGS: 
                     H5_LDFLAGS: 
                     AM_LDFLAGS:  -L/mnt/hdf/packages/szip/shared/encoder/Linux2.6-x86_64-gcc/lib
 	 	Extra libraries:  -lsz -lz -lrt -ldl -lm 
 		       Archiver: ar
 		 	 Ranlib: ranlib
 	      Debugged Packages: 
		    API Tracing: no

Languages:
----------
                        Fortran: yes
               Fortran Compiler: /usr/bin/gfortran ( GNU Fortran (GCC) 4.1.2 20080704 )
          Fortran 2003 Compiler: no
                  Fortran Flags: 
               H5 Fortran Flags:  
               AM Fortran Flags: 
         Shared Fortran Library: yes
         Static Fortran Library: yes

                            C++: yes
                   C++ Compiler: /usr/bin/g++ ( g++ (GCC) 4.1.2 20080704 )
                      C++ Flags: 
                   H5 C++ Flags:  
                   AM C++ Flags: 
             Shared C++ Library: yes
             Static C++ Library: yes

Features:
---------
                  Parallel HDF5: no
             High Level library: yes
                   Threadsafety: no
            Default API Mapping: v18
 With Deprecated Public Symbols: yes
         I/O filters (external): deflate(zlib),szip(encoder)
         I/O filters (internal): shuffle,fletcher32,nbit,scaleoffset
                            MPE: no
                     Direct VFD: no
                        dmalloc: no
Clear file buffers before write: yes
           Using memory checker: no
         Function Stack Tracing: no
                           GPFS: no
      Strict File Format Checks: no
   Optimization Instrumentation: no
       Large File Support (LFS): yes
Bye...

The command I used to run marginpolish is

marginpolish reads_2_assembly.0x904q60.bam Assembly.fasta $MODELDIR/MP_r941_guppy344_human.json -t 2 -o margin_image/output -f

Do you have any solutions to my issue?

Best
Xiaofei

marginpolish docker stuck at 99%

Hi,

I am running your new docker container to stream-line assembly polishing and run into some trouble with marginPolish. Looks like MP is stalling at the very end.

 singularity run /net/cn-1/mnt/SCRATCH/michelmo/Projects/CONTAINERS/helen_latest20200519.sif  marginpolish ../SimonFlye27_15K.ONTremap.0x904.bam SimonFlye27_15K.fasta /net/cn-1/mnt/SCRATCH/mic
helmo/Projects/CONTAINERS/MP_r941_guppy344_human.json -t 64 -o . -f
Running OpenMP with 64 threads.
> Parsing model parameters from file: /net/cn-1/mnt/SCRATCH/michelmo/Projects/CONTAINERS/MP_r941_guppy344_human.json
> Parsing reference sequences from file: SimonFlye27_15K.fasta
> Going to write polished reference in : ./output.fa
> Set up bam chunker with chunk size 5000 and overlap 50 (for region=all), resulting in 546365 total chunks
> Polishing  1% complete (5623/546365).  Estimated time remaining: 31h 25m
> Polishing  2% complete (10934/546365).  Estimated time remaining: 25h 50m
> Polishing  3% complete (16427/546365).  Estimated time remaining: 23h 52m
> Polishing  4% complete (21903/546365).  Estimated time remaining: 22h 37m
> Polishing  5% complete (27374/546365).  Estimated time remaining: 22h 48m
> Polishing  6% complete (32813/546365).  Estimated time remaining: 22h 32m
> Polishing  7% complete (38250/546365).  Estimated time remaining: 22h 2m
> Polishing  8% complete (43711/546365).  Estimated time remaining: 21h 55m
> Polishing  9% complete (49186/546365).  Estimated time remaining: 22h 18m
> Polishing 10% complete (54652/546365).  Estimated time remaining: 22h 35m
> Polishing 11% complete (60114/546365).  Estimated time remaining: 22h 50m
> Polishing 12% complete (65596/546365).  Estimated time remaining: 22h 55m
> Polishing 13% complete (71045/546365).  Estimated time remaining: 22h 54m
> Polishing 14% complete (76500/546365).  Estimated time remaining: 22h 52m
> Polishing 15% complete (81977/546365).  Estimated time remaining: 22h 50m
> Polishing 16% complete (87432/546365).  Estimated time remaining: 22h 48m
> Polishing 17% complete (92924/546365).  Estimated time remaining: 22h 43m
.....
> Polishing 91% complete (497222/546365).  Estimated time remaining: 2h 41m
> Polishing 92% complete (502673/546365).  Estimated time remaining: 2h 23m
> Polishing 93% complete (508160/546365).  Estimated time remaining: 2h 5m
> Polishing 94% complete (513630/546365).  Estimated time remaining: 1h 47m
> Polishing 95% complete (519110/546365).  Estimated time remaining: 1h 29m
> Polishing 96% complete (524517/546365).  Estimated time remaining: 1h 12m
> Polishing 97% complete (530110/546365).  Estimated time remaining: 54m 3s
> Polishing 98% complete (535445/546365).  Estimated time remaining: 35m 59s
> Polishing 99% complete (541585/546365).  Estimated time remaining: 17m 57s

H5 files have been created and written to but no more writing happened for the last few hours.

Process is still running but only using 1 thread for the last 5 hours.
Is this expected and does marginpolish do some final wrapup in the end which takes longer than expected?

368136 michelmo  20   0  121.9g 119.1g   1484 S 100.0  3.9 112466:10 marginPolish

Thank you,
Michel

there are no other mammal models other than human?

I have downloaded the available models, but there were no models other than human. My data was mammal but isn't human，how can I get the right model.Thank you for your help.

Plant species

Hi kishwar
Can I use this polisher on complex plant genome?
Does the model trained with human sequencing data work for the plant species?

Thank you.

Jolvii

stitch.py ValueError triggered by contig sequence name

stitch.py throws a ValueError if one of my contigs is named the following, but works fine if I rename it to something like CLUS3951

bc.1+2.clus.3951.fa.poa:1.0-7835.0

It is apparently trying to convert the 7835.0 at the end into an integer

File "stitch.py", line 93, in <module> process_marginpolish_h5py(FLAGS.sequence_hdf, FLAGS.output_dir, FLAGS.threads) File "stitch.py", line 58, in process_marginpolish_h5py consensus_sequence = stich_object.create_consensus_sequence(hdf_file_path, contig, chunk_keys, threads) File "modules/python/Stitch.py", line 280, in create_consensus_sequence sequence_chunk_key_list.append((contig, int(st), int(end))) ValueError: invalid literal for int() with base 10: '7835.0'

Compatibility with draft assembly from other assemblers

Hi,

I'm wondering if M + H is compatible with draft assemblies from other assemblers.
It would be great if there's already some documents showing comparisons.

Thanks!
Steve

Issue on stitch.py

Hello!

I have a problem with stitch.py. When I use the script with the following command A), I got the error B). Do you have any suggestions to resolve the issue. Thanks in advance!

A)
python3.6 /home/nozawa/Software/helen/stitch.py -t 16
-i /home/nozawa/Data/pal/MinION/MarginPolish_HELEN/consensus/HELEN_prediction.hdf
-o /home/nozawa/Data/pal/MinION/MarginPolish_HELEN/consensus

B)
Traceback (most recent call last):
File "/home/nozawa/Software/helen/stitch.py", line 5, in
from modules.python.Stitch import Stitch
File "/home/nozawa/Software/helen/modules/python/Stitch.py", line 10, in
from build import HELEN
ImportError: cannot import name 'HELEN'

Only Polish interested sequences

Hi，I used shasta to assemble the sequencing data from nanopore. I have hundreds of assembled results, and I only focus on those sequences that do not appear in the reference genome (NRS). So, I want to know if I could use M-H to only polish the NRS I extracted from the assembly results instead of the entire assembly fasta file?

margin docker run fail

hi,
i ran the margin polishg progrecess (docker version) , and got a fail result.
root@ecs-9875:/media/datarun/blnanodata/data# tail marginPolish.log
/usr/bin/time -f '\nDEBUG_MAX_MEM:%M\nDEBUG_RUNTIME:%E\n' /opt/MarginPolish/build/marginPolish reads_2_assembly.bam new.fasta allParams.np.human.guppy-ff-233.json -t 32 -o output/marginpolish_images -f

Running OpenMP with 32 threads.

Parsing model parameters from file: allParams.np.human.guppy-ff-233.json
Calloc failed with request for -2 lots of 16 bytes
Command exited with non-zero status 1

DEBUG_MAX_MEM:3836
DEBUG_RUNTIME:0:00.00

Can you help me to fix it ?

How to install helen in a gcc4.8 enrivments?

The Servers I'm using is RedHat4.8 && gcc4.8;
I have spent more than one day in installing helen, the error were coming one after another, anybody met the similar pros?
Please notice me. Thks.

helen polish UserWarning on batch size

I am running helen polish and received below warning message. Is this a harmless warning or I should do something about it?

INFO: POLISH MODULE SELECTED
INFO: RUN-ID: 04112022_102154
INFO: PREDICTION OUTPUT DIRECTORY: /HELEN/predictions_04112022_102154
INFO: CALL CONSENSUS STARTING
INFO: OUTPUT FILE: /HELEN/predictions_04112022_102154/output_AngusONTpolish.fa
INFO: MODEL LOADING TO ONNX
INFO: SAVING MODEL TO ONNX
/opt/conda/lib/python3.7/site-packages/torch/onnx/symbolic_opset9.py:1436: UserWarning: Exporting a model to ONNX with a batch_size other than 1, with a variable lenght with GRU can cause an error when running the ONNX model with a different batch size. Make sure to save the model with a batch size of 1, or define the initial states (h0/c0) as inputs of the model.
"or define the initial states (h0/c0) as inputs of the model. ")
INFO: TORCH THREADS SET TO: 4.

some problem occurs when evaluating the data of na12878 chromosome 21

hi
I ran helen to polish the draft assembly from na12878 chromosome 21. But there seem some problems in the polished results.
First I ran marginPolish to generate image features with the command:
marginPolish read2assembly.sort.bam ../assembly.fasta ~/tools/MarginPolish/params/allParams.np.human.r94-g235.json -o chr21_margin -t 60 -f
Second, I ran helen to generate a more accurate assembly with the command:
helen polish -i output_files/ -m ~/tools/helen/models/HELEN_r941_guppy344_human.pkl -b 512 -w 4 -t 60 -o helenPolish -p chr21_helen -g
After that I use pomoxis to evaluate the error rate of the polished assembly.
The following two figures are the results of marginPolish and helen.

Neng

How does marginpolish + helen compare to nanopolish?

Hi,

I only found comparisons for racon and medaka, however, the best results I see are from nanopolish, so I wonder how m + h compares to it.

Thanks,
Adrian