Giter Club home page Giter Club logo

helen's Introduction

POLISHER UPDATE: P.E.P.P.E.R.

We have released a new polisher PEPPER that replaces MarginPolish-HELEN. If you have newer data Guppy >= 3.0.5 please use PEPPER instead of MarginPolish-HELEN. PEPPER is fully supported by our team.

H.E.L.E.N.

H.E.L.E.N. (Homopolymer Encoded Long-read Error-corrector for Nanopore)

Build Status


HELEN is published in Nature Biotechnology:


Overview

HELEN uses a Recurrent-Neural-Network (RNN) based Multi-Task Learning (MTL) model that can predict a base and a run-length for each genomic position using the weights generated by MarginPolish.

© 2020 Kishwar Shafin, Trevor Pesout, Benedict Paten.
Computational Genomics Lab (CGL), University of California, Santa Cruz.

Why MarginPolish-HELEN ?

  • MarginPolish-HELEN outperforms other graph-based and Neural-Network based polishing pipelines.
  • Simple installation steps.
  • HELEN can use multiple GPUs at the same time.
  • Highly optimized pipeline that is faster than any other available polishing tool.
  • We have sequenced-assembled-polished 11 samples to ensure robustness, runtime-consistency and cost-efficiency.
  • We tested GPU usage on Amazon Web Services (AWS) and Google Cloud Platform (GCP) to ensure scalability.
  • Open source (MIT License).

Walkthrough

Installation

MarginPolish-HELEN is supported on Ubuntu 16.10/18.04 or any other Linux-based system. Â

Install prerequisites

Before you follow any of the methods, make sure you install all the dependencies:

sudo apt-get -y install git cmake make gcc g++ autoconf bzip2 lzma-dev zlib1g-dev \
libcurl4-openssl-dev libpthread-stubs0-dev libbz2-dev liblzma-dev libhdf5-dev \
python3-pip python3-virtualenv virtualenv

Method 1: Install MarginPolish-HELEN from GitHub

You can install from the GitHub repository:

git clone https://github.com/kishwarshafin/helen.git
cd helen
make install
. ./venv/bin/activate

helen --help
marginpolish --help

Each time you want to use it, activate the virtualenv:

. <path/to/helen/venv/bin/activate>

Method 2: Install using PyPi

Install prerequisites and the install MarginPolish-HELEN using pip:

python3 -m pip install helen --user

python3 -m helen.helen --help
python3 -m helen.marginpolish --help

Update the installed version:

python3 -m pip install update pip
python3 -m pip install helen --upgrade

You can also add module locations to path:

echo 'export PATH="$(python3 -m site --user-base)/bin":$PATH' >> ~/.bashrc
source ~/.bashrc

marginpolish --help
helen --help

Method 3: Use docker image

CPU based docker:
# SEE CONFIGURATION
docker run --rm -it --ipc=host kishwars/helen:latest helen --help
docker run --rm -it --ipc=host kishwars/helen:latest marginpolish --help

docker run -it --ipc=host --user=`id -u`:`id -g` --cpus="16" \
-v </directory/with/inputs_outputs>:/data kishwars/helen:latest \
helen --help
GPU based docker:
sudo apt-get install -y nvidia-docker2
# SEE CONFIGURATION
nvidia-docker run -it --ipc=host kishwars/helen:latest helen torch_stat
nvidia-docker run -it --ipc=host kishwars/helen:latest helen --help
nvidia-docker run -it --ipc=host kishwars/helen:latest marginpolish --help

# RUN HELEN
nvidia-docker run -it --ipc=host --user=`id -u`:`id -g` --cpus="16" \
-v </directory/with/inputs_outputs>:/data kishwars/helen:latest \
helen --help

Usage

MarginPolish requires a draft assembly and a mapping of reads to the draft assembly. We commend using Shasta as the initial assembler and MiniMap2 for the mapping.

Step 1: Generate an initial assembly

Generate an assembly using one of the ONT assemblers:

Step 2: Create an alignment between reads and shasta assembly

We recommend using MiniMap2 to generate the mapping between the reads and the assembly. You don't have to follow these exact commands.

minimap2 -ax map-ont -t 32 shasta_assembly.fa reads.fq | samtools view -hb -F 0x904 > unsorted.bam;
samtools sort -@32 -o reads_2_assembly.0x904.bam unsorted.bam;
samtools index -@32 reads_2_assembly.0x904.bam

Step 3: Generate images using MarginPolish

Download Model
helen download_models \
--output_dir <path/to/mp_helen_models/>
Run MarginPolish

You can generate images using MarginPolish by running:

marginpolish reads_2_assembly.bam \
Assembly.fa \
</path/to/model_name.json> \
-t <number_of_threads> \
-o <path/to/marginpolish_images> \
-f

You can find the models by downloading them.

Step 4: Run HELEN

Next, run HELEN to polish using a RNN.

helen polish \
--image_dir </path/to/marginpolish_images/> \
--model_path </path/to/model.pkl> \
--batch_size 256 \
--num_workers 4 \
--threads <num_of_threads> \
--output_dir </path/to/output_dir> \
--output_prefix <output_filename.fa> \
--gpu_mode

If you are using CPUs then remove the --gpu_mode argument.

Help

Please open a github issue if you face any difficulties.

Acknowledgement

We are thankful to Segey Koren and Karen Miga for their help with CHM13 data and evaluation.

We downloaded our data from Telomere-to-telomere consortium to evaluate our pipeline against CHM13.

We acknowledge the work of the developers of these packages:

Fun Fact

guppy235 guppy235

The name "HELEN" is inspired from the A.I. created by Tony Stark in the Marvel Comics (Earth-616). HELEN was created to control the city Tony was building named "Troy" making the A.I. "HELEN of Troy".

READ MORE: HELEN

© 2020 Kishwar Shafin, Trevor Pesout, Benedict Paten.

helen's People

Contributors

cgjosephlee avatar esrice avatar kishwarshafin avatar tpesout avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

helen's Issues

run call.consensus..py fail

i can't run 'python3 cal_consensus.py', and i got the follow message " Traceback (most recent call last):
File "./call_consensus.py", line 3, in
from modules.python.TextColor import TextColor
ImportError: No module named 'modules.python'
"
any guy can help me ?

Issue on stitch.py

Hello!

I have a problem with stitch.py. When I use the script with the following command A), I got the error B). Do you have any suggestions to resolve the issue. Thanks in advance!

A)
python3.6 /home/nozawa/Software/helen/stitch.py -t 16
-i /home/nozawa/Data/pal/MinION/MarginPolish_HELEN/consensus/HELEN_prediction.hdf
-o /home/nozawa/Data/pal/MinION/MarginPolish_HELEN/consensus

B)
Traceback (most recent call last):
File "/home/nozawa/Software/helen/stitch.py", line 5, in
from modules.python.Stitch import Stitch
File "/home/nozawa/Software/helen/modules/python/Stitch.py", line 10, in
from build import HELEN
ImportError: cannot import name 'HELEN'

marginPolish option

Hi,

Just compiled marginPolish and Helen according to your installation tutorial and ran it on a test dataset.

I am missing the marginPolish images being created.
Get the fasta output but not the image (hdf?).

Also, when looking at marginPolish options, i dont get the same options as are posted on their github repo (missing the -f parameter).

Mine shows:

./marginPolish
usage: marginPolish <BAM_FILE> <ASSEMBLY_FASTA> <PARAMS> [options]
Version: 1.0.0

Polishes the ASSEMBLY_FASTA using alignments in BAM_FILE.

Required arguments:
    BAM_FILE is the alignment of reads to the assembly (or reference).
    ASSEMBLY_FASTA is the reference sequence BAM file in fasta format.
    PARAMS is the file with marginPolish parameters.

Default options:
    -h --help                : Print this help screen
    -a --logLevel            : Set the log level [default = info]
    -t --threads             : Set number of concurrent threads [default = 1]
    -o --outputBase          : Name to use for output files [default = 'output']
    -r --region              : If set, will only compute for given chromosomal region.
                                 Format: chr:start_pos-end_pos (chr3:2000-3000).

Miscellaneous supplementary output options:
    -i --outputRepeatCounts  : Output base to write out the repeat counts [default = NULL]
    -j --outputPoaTsv        : Output base to write out the poa as TSV file [default = NULL]


Theirs is:

marginPolish <BAM_FILE> <ASSEMBLY_FASTA> <PARAMS> [options] 

Polishes the ASSEMBLY_FASTA using alignments in BAM_FILE.

Required arguments:
    BAM_FILE is the alignment of reads to the assembly (or reference).
    ASSEMBLY_FASTA is the reference sequence BAM file in fasta format.
    PARAMS is the file with marginPolish parameters.

Default options:
    -h --help                : Print this help screen
    -a --logLevel            : Set the log level [default = info]
    -t --threads             : Set number of concurrent threads [default = 1]
    -o --outputBase          : Name to use for output files [default = 'output']
    -r --region              : If set, will only compute for given chromosomal region.
                                 Format: chr:start_pos-end_pos (chr3:2000-3000).

HELEN feature generation options:
    -f --produceFeatures     : output features for HELEN.
    -F --featureType         : output features of chunks for HELEN.  Valid types:
                                 splitRleWeight:  [default] run lengths split into chunks
                                 nuclAndRlWeight: split into nucleotide and run length (RL across nucleotides)
                                 rleWeight:       weighted likelihood from POA nodes (RLE)
                                 simpleWeight:    weighted likelihood from POA nodes (non-RLE)
    -L --splitRleWeightMaxRL : max run length (for 'splitRleWeight' type only) [default = 10]
    -u --trueReferenceBam    : true reference aligned to ASSEMBLY_FASTA, for HELEN
                               features.  Setting this parameter will include labels
                               in output.

Miscellaneous supplementary output options:
    -i --outputRepeatCounts  : Output base to write out the repeat counts [default = NULL]
    -j --outputPoaTsv        : Output base to write out the poa as TSV file [default = NULL]

I am missing the whole HELEN feature generation options.
Do you have a docker which i could use?

Thanks,
Michel

h5 error

Hi,

I run into following error troubles with when running helen (CPU only installation) on a flye-assembly:

python3 ~/tools/helen/call_consensus.py -i . -m /net/fs-1/home01/michelmo/tools/helen/r941_flip231_v001.pkl
INFO: OUTPUT DIRECTORY: ./output/
INFO: TORCH THREADS SET TO: 1.
Loading data
Traceback (most recent call last):
  File "/mnt/users/michelmo/tools/helen/call_consensus.py", line 133, in <module>
    FLAGS.gpu_mode)
  File "/mnt/users/michelmo/tools/helen/call_consensus.py", line 53, in polish_genome
    predict(image_filepath, output_filename, model_path, batch_size, num_workers, threads, gpu_mode)
  File "/net/fs-1/home01/michelmo/tools/helen/modules/python/models/predict.py", line 61, in predict
    test_data = SequenceDataset(test_file)
  File "/net/fs-1/home01/michelmo/tools/helen/modules/python/models/dataloader_predict.py", line 35, in __init__
    with h5py.File(hdf5_file_path, 'r') as hdf5_file:
  File "/mnt/users/michelmo/.conda/envs/HELEN/lib/python3.7/site-packages/h5py/_hl/files.py", line 394, in __init__
    swmr=swmr)
  File "/mnt/users/michelmo/.conda/envs/HELEN/lib/python3.7/site-packages/h5py/_hl/files.py", line 170, in make_fid
    fid = h5f.open(name, flags, fapl=fapl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5f.pyx", line 85, in h5py.h5f.open
OSError: Unable to open file (file signature not found)

I suspect that h5-files are somehow corrupted (some are empty), but thats difficult to assess for me. I thought the size of total images would be larger than the 5.9 M i got from a 2.5 Gb genome.

total 5.9M
420K mGimageRainbowtrout.T00.h5  212K mGimageRainbowtrout.T11.h5  316K mGimageRainbowtrout.T22.h5
212K mGimageRainbowtrout.T01.h5     0 mGimageRainbowtrout.T12.h5     0 mGimageRainbowtrout.T23.h5
212K mGimageRainbowtrout.T02.h5     0 mGimageRainbowtrout.T13.h5  212K mGimageRainbowtrout.T24.h5
   0 mGimageRainbowtrout.T03.h5  212K mGimageRainbowtrout.T14.h5  212K mGimageRainbowtrout.T25.h5
420K mGimageRainbowtrout.T04.h5  212K mGimageRainbowtrout.T15.h5  420K mGimageRainbowtrout.T26.h5
212K mGimageRainbowtrout.T05.h5     0 mGimageRainbowtrout.T16.h5     0 mGimageRainbowtrout.T27.h5
212K mGimageRainbowtrout.T06.h5     0 mGimageRainbowtrout.T17.h5     0 mGimageRainbowtrout.T28.h5
   0 mGimageRainbowtrout.T07.h5  212K mGimageRainbowtrout.T18.h5  212K mGimageRainbowtrout.T29.h5
420K mGimageRainbowtrout.T08.h5  212K mGimageRainbowtrout.T19.h5  212K mGimageRainbowtrout.T30.h5
   0 mGimageRainbowtrout.T09.h5  420K mGimageRainbowtrout.T20.h5  212K mGimageRainbowtrout.T31.h5
420K mGimageRainbowtrout.T10.h5  212K mGimageRainbowtrout.T21.h5

MarginPolish was run with default settings:

/marginPolish $BAM \
$ASM \
/net/fs-1/home01/michelmo/tools/marginPolish/params/allParams.np.human.guppy-ff-233.json \
-t 32 \
-o mGimageRainbowtrout \
-f 2>&1 | tee mG.log

Any ideas or hints about what could have gone wrong would be appreciated.

Michel

helen polish UserWarning on batch size

I am running helen polish and received below warning message. Is this a harmless warning or I should do something about it?

INFO: POLISH MODULE SELECTED
INFO: RUN-ID: 04112022_102154
INFO: PREDICTION OUTPUT DIRECTORY: /HELEN/predictions_04112022_102154
INFO: CALL CONSENSUS STARTING
INFO: OUTPUT FILE: /HELEN/predictions_04112022_102154/output_AngusONTpolish.fa
INFO: MODEL LOADING TO ONNX
INFO: SAVING MODEL TO ONNX
/opt/conda/lib/python3.7/site-packages/torch/onnx/symbolic_opset9.py:1436: UserWarning: Exporting a model to ONNX with a batch_size other than 1, with a variable lenght with GRU can cause an error when running the ONNX model with a different batch size. Make sure to save the model with a batch size of 1, or define the initial states (h0/c0) as inputs of the model.
"or define the initial states (h0/c0) as inputs of the model. ")
INFO: TORCH THREADS SET TO: 4.

HAC models

Dear HELEN developers,

When running MarginPolish with the allParams.np.human.guppy-ff-235.json model, i get a Calloc error.

udocker run -v /mnt/SCRATCH/michelmo/Projects/MudMinnow/Nhub_guppy305_flye10K:/data mGPolish reads_2_assembly.bam assembly.fasta allParams.np.human.guppy-ff-235.json -t 32 -o /mnt/SCRATCH/michelmo/Projects/MudMinnow/Nhub_guppy305_flye10K/mG305 -f

 ******************************************************************************
 *                                                                            *
 *               STARTING 137351bb-4e04-3309-9bf5-ae016625cef7                *
 *                                                                            *
 ******************************************************************************
 executing: sh
Set log level to INFO
Running OpenMP with 32 threads.
> Parsing model parameters from file: allParams.np.human.guppy-ff-235.json
Calloc failed with request for -2 lots of 16 bytes
Command exited with non-zero status 1

DEBUG_MAX_MEM:4608
DEBUG_RUNTIME:0:00.06

The program runs if using another model:

udocker run -v /mnt/SCRATCH/michelmo/Projects/MudMinnow/Nhub_guppy305_flye10K:/data mGPolish reads_2_assembly.bam assembly.fasta allParams.np.human.guppy-ff-233.json -t 32 -o /mnt/SCRATCH/michelmo/Projects/MudMinnow/Nhub_guppy305_flye10K/mG305 -f

 ******************************************************************************
 *                                                                            *
 *               STARTING 137351bb-4e04-3309-9bf5-ae016625cef7                *
 *                                                                            *
 ******************************************************************************
 executing: sh
Set log level to INFO
Running OpenMP with 32 threads.
> Parsing model parameters from file: allParams.np.human.guppy-ff-233.json
> Parsing reference sequences from file: assembly.fasta
> Going to write polished reference in : /mnt/SCRATCH/michelmo/Projects/MudMinnow/Nhub_guppy305_flye10K/mG305.fa
...

Is the 235 model file corrupted?

Also, i saw your latest models for polishing is named guppy 2.3.5.
Is this trained on the HAC configuration files?

We are currently using promethION data basecalled with HAC models on guppy 3.0.5 provided by ONT and i wonder which model would fit the data best.

model files used for basecalling:

md5sum dna_r9.4.1_450bps_hac_prom.cfg   c9dc5f42f63c005085ed89e4094e0bb4
md5sum template_r9.4.1_450bps_hac_prom.jsn     6ee479f9ae82a7d26cb47bd24a7882fd

Maybe it would be more accurate to name models after their used basecall models instead of guppy versions?

Thanks,
michel

MarginPolish

Request for MarginPolish to either:

  1. Be able to run Docker without sudo.
  2. Run docker with sudo but inside sigularity.
  3. Fix the binary so it does not segfault

Any of those 3 options would be great. I'm not sure what you need in terms of system configuration, but I'll provide you with some basics on my primary test system, and you can let me know if you need more:

O/S CentOS v. 7.6
Dual Intel(R) Xeon(R) CPU E5-2640 v2 CPUs
256GB RAM
GCC v. 4.8.5 default compiler, but other compilers are available
Using Environment modules system
Cmake 3.11
O/S repos include CentOS 7 Basic, Plus, and EPEL

We have a mixture of systems, but the configuration above is pretty typical. On the university HPC, they use SLURM for resource management. On our primary lab servers we can run in standard user mode, or using Torque/PBS. All of my tests have been performed running outside of a resource management system.

Let me know what else you might need.

Thanks,
John

Question about model's training data

Hi:
MarginPolish && HELEN is such an excellent pipeline for polishing ONT assembly, which is easy to run and has very high accuracy. I am using the latest model to polishing some human data. I wonder what data do you use to train the model MP_r941_guppy344_human.json and HELEN_r941_guppy344_human.pkl. The training datasets of this two models were not mentioned in the paper. Which specie and which chromosome is used, HG002, CHM13 or HG00733 and chr1-6 or chr1-19, chr21-22?

Neng

Only Polish interested sequences

Hi,I used shasta to assemble the sequencing data from nanopore. I have hundreds of assembled results, and I only focus on those sequences that do not appear in the reference genome (NRS). So, I want to know if I could use M-H to only polish the NRS I extracted from the assembly results instead of the entire assembly fasta file?

Models for Guppy 3.6

Hi,

I really like your polishing pipeline and it gives great results so far.
Last week a new and improved version of Guppy with boosted accuracy was released. Are you planning to provide models for this version of Guppy and if so, when can we expect these?

Thank you,
Dominik

Plant species

Hi kishwar
Can I use this polisher on complex plant genome?
Does the model trained with human sequencing data work for the plant species?

Thank you.

Jolvii

MarginPolish output zero .fa file

Dear author,
Thanks for your great assembly tool Shasta, and polish tool marginpolish and helen. I have used Shasta to assembled a genome and generated the Assembly.fasta file.
Next, I try to use Helen to polish the genome. I have generated the .bam file by minimap2 and indexed it by Samtools. However, the marginpolish step generates zero fasta file output.fa. The log is as follows.

Running OpenMP with 2 threads.
> Parsing model parameters from file:  ./helen_model/MP_r941_guppy344_human.json
> Parsing reference sequences from file: Assembly.fasta
> Going to write polished reference in : margin_image/output.fa
> Set up bam chunker with chunk size 5000 and overlap 50 (for region=all), resulting in 538336 total chunks
Warning! ***HDF5 library version mismatched error***
The HDF5 header files used to compile this application do not match
the version used by the HDF5 library to which this application is linked.
Data corruption or segmentation faults may occur if the application continues.
This can happen when an application was compiled by one version of HDF5 but
linked with a different version of static or shared HDF5 library.
You should recompile the application or check your shared library related
settings such as 'LD_LIBRARY_PATH'.
You can, at your own risk, disable this warning by setting the environment
variable 'HDF5_DISABLE_VERSION_CHECK' to a value of '1'.
Setting it to 2 or higher will suppress the warning messages totally.
Headers are 1.8.12, library is 1.8.11
	    SUMMARY OF THE HDF5 CONFIGURATION
	    =================================

General Information:
-------------------
		   HDF5 Version: 1.8.11
		  Configured on: Wed May  8 16:20:56 CDT 2013
		  Configured by: hdftest@koala
		 Configure mode: production
		    Host system: x86_64-unknown-linux-gnu
	      Uname information: Linux koala 2.6.18-348.1.1.el5 #1 SMP Tue Jan 22 16:19:19 EST 2013 x86_64 x86_64 x86_64 GNU/Linux
		       Byte sex: little-endian
		      Libraries: static, shared
	     Installation point: /mnt/scr1/pre-release/hdf5/v1811/thg-builds/koala

Compiling Options:
------------------
               Compilation Mode: production
                     C Compiler: /usr/bin/gcc ( gcc (GCC) 4.1.2 20080704 )
                         CFLAGS: 
                      H5_CFLAGS: -std=c99 -pedantic -Wall -Wextra -Wundef -Wshadow -Wpointer-arith -Wbad-function-cast -Wcast-qual -Wcast-align -Wwrite-strings -Wconversion -Waggregate-return -Wstrict-prototypes -Wmissing-prototypes -Wmissing-declarations -Wredundant-decls -Wnested-externs -Winline -Wno-long-long -Wfloat-equal -Wmissing-format-attribute -Wmissing-noreturn -Wpacked -Wdisabled-optimization -Wformat=2 -Wunreachable-code -Wendif-labels -Wdeclaration-after-statement -Wold-style-definition -Winvalid-pch -Wvariadic-macros -Wnonnull -Winit-self -Wmissing-include-dirs -Wswitch-default -Wswitch-enum -Wunused-macros -Wunsafe-loop-optimizations -Wc++-compat -Wvolatile-register-var -O3 -fomit-frame-pointer -finline-functions
                      AM_CFLAGS: 
                       CPPFLAGS: 
                    H5_CPPFLAGS: -D_POSIX_C_SOURCE=199506L   -DNDEBUG -UH5_DEBUG_API
                    AM_CPPFLAGS: -I/mnt/hdf/packages/szip/shared/encoder/Linux2.6-x86_64-gcc/include -D_LARGEFILE_SOURCE -D_LARGEFILE64_SOURCE -D_BSD_SOURCE 
               Shared C Library: yes
               Static C Library: yes
  Statically Linked Executables: yes
                        LDFLAGS: 
                     H5_LDFLAGS: 
                     AM_LDFLAGS:  -L/mnt/hdf/packages/szip/shared/encoder/Linux2.6-x86_64-gcc/lib
 	 	Extra libraries:  -lsz -lz -lrt -ldl -lm 
 		       Archiver: ar
 		 	 Ranlib: ranlib
 	      Debugged Packages: 
		    API Tracing: no

Languages:
----------
                        Fortran: yes
               Fortran Compiler: /usr/bin/gfortran ( GNU Fortran (GCC) 4.1.2 20080704 )
          Fortran 2003 Compiler: no
                  Fortran Flags: 
               H5 Fortran Flags:  
               AM Fortran Flags: 
         Shared Fortran Library: yes
         Static Fortran Library: yes

                            C++: yes
                   C++ Compiler: /usr/bin/g++ ( g++ (GCC) 4.1.2 20080704 )
                      C++ Flags: 
                   H5 C++ Flags:  
                   AM C++ Flags: 
             Shared C++ Library: yes
             Static C++ Library: yes

Features:
---------
                  Parallel HDF5: no
             High Level library: yes
                   Threadsafety: no
            Default API Mapping: v18
 With Deprecated Public Symbols: yes
         I/O filters (external): deflate(zlib),szip(encoder)
         I/O filters (internal): shuffle,fletcher32,nbit,scaleoffset
                            MPE: no
                     Direct VFD: no
                        dmalloc: no
Clear file buffers before write: yes
           Using memory checker: no
         Function Stack Tracing: no
                           GPFS: no
      Strict File Format Checks: no
   Optimization Instrumentation: no
       Large File Support (LFS): yes
Bye...

The command I used to run marginpolish is

marginpolish reads_2_assembly.0x904q60.bam Assembly.fasta $MODELDIR/MP_r941_guppy344_human.json -t 2 -o margin_image/output -f

Do you have any solutions to my issue?

Best
Xiaofei

How to train a new model with in lab testing data

Our lab is doing research on some channel protein, the sequencing error seems different from R9 pore, so we want to training marginpolish/Helen with these new data, could you tell me how to do it, thanks.

How to install helen in a gcc4.8 enrivments?

The Servers I'm using is RedHat4.8 && gcc4.8;
I have spent more than one day in installing helen, the error were coming one after another, anybody met the similar pros?
Please notice me. Thks.

marginpolish docker stuck at 99%

Hi,

I am running your new docker container to stream-line assembly polishing and run into some trouble with marginPolish. Looks like MP is stalling at the very end.

 singularity run /net/cn-1/mnt/SCRATCH/michelmo/Projects/CONTAINERS/helen_latest20200519.sif  marginpolish ../SimonFlye27_15K.ONTremap.0x904.bam SimonFlye27_15K.fasta /net/cn-1/mnt/SCRATCH/mic
helmo/Projects/CONTAINERS/MP_r941_guppy344_human.json -t 64 -o . -f
Running OpenMP with 64 threads.
> Parsing model parameters from file: /net/cn-1/mnt/SCRATCH/michelmo/Projects/CONTAINERS/MP_r941_guppy344_human.json
> Parsing reference sequences from file: SimonFlye27_15K.fasta
> Going to write polished reference in : ./output.fa
> Set up bam chunker with chunk size 5000 and overlap 50 (for region=all), resulting in 546365 total chunks
> Polishing  1% complete (5623/546365).  Estimated time remaining: 31h 25m
> Polishing  2% complete (10934/546365).  Estimated time remaining: 25h 50m
> Polishing  3% complete (16427/546365).  Estimated time remaining: 23h 52m
> Polishing  4% complete (21903/546365).  Estimated time remaining: 22h 37m
> Polishing  5% complete (27374/546365).  Estimated time remaining: 22h 48m
> Polishing  6% complete (32813/546365).  Estimated time remaining: 22h 32m
> Polishing  7% complete (38250/546365).  Estimated time remaining: 22h 2m
> Polishing  8% complete (43711/546365).  Estimated time remaining: 21h 55m
> Polishing  9% complete (49186/546365).  Estimated time remaining: 22h 18m
> Polishing 10% complete (54652/546365).  Estimated time remaining: 22h 35m
> Polishing 11% complete (60114/546365).  Estimated time remaining: 22h 50m
> Polishing 12% complete (65596/546365).  Estimated time remaining: 22h 55m
> Polishing 13% complete (71045/546365).  Estimated time remaining: 22h 54m
> Polishing 14% complete (76500/546365).  Estimated time remaining: 22h 52m
> Polishing 15% complete (81977/546365).  Estimated time remaining: 22h 50m
> Polishing 16% complete (87432/546365).  Estimated time remaining: 22h 48m
> Polishing 17% complete (92924/546365).  Estimated time remaining: 22h 43m
.....
> Polishing 91% complete (497222/546365).  Estimated time remaining: 2h 41m
> Polishing 92% complete (502673/546365).  Estimated time remaining: 2h 23m
> Polishing 93% complete (508160/546365).  Estimated time remaining: 2h 5m
> Polishing 94% complete (513630/546365).  Estimated time remaining: 1h 47m
> Polishing 95% complete (519110/546365).  Estimated time remaining: 1h 29m
> Polishing 96% complete (524517/546365).  Estimated time remaining: 1h 12m
> Polishing 97% complete (530110/546365).  Estimated time remaining: 54m 3s
> Polishing 98% complete (535445/546365).  Estimated time remaining: 35m 59s
> Polishing 99% complete (541585/546365).  Estimated time remaining: 17m 57s

H5 files have been created and written to but no more writing happened for the last few hours.

Process is still running but only using 1 thread for the last 5 hours.
Is this expected and does marginpolish do some final wrapup in the end which takes longer than expected?

368136 michelmo  20   0  121.9g 119.1g   1484 S 100.0  3.9 112466:10 marginPolish                                 

Thank you,
Michel

stitch.py ValueError triggered by contig sequence name

stitch.py throws a ValueError if one of my contigs is named the following, but works fine if I rename it to something like CLUS3951

bc.1+2.clus.3951.fa.poa:1.0-7835.0

It is apparently trying to convert the 7835.0 at the end into an integer

File "stitch.py", line 93, in <module> process_marginpolish_h5py(FLAGS.sequence_hdf, FLAGS.output_dir, FLAGS.threads) File "stitch.py", line 58, in process_marginpolish_h5py consensus_sequence = stich_object.create_consensus_sequence(hdf_file_path, contig, chunk_keys, threads) File "modules/python/Stitch.py", line 280, in create_consensus_sequence sequence_chunk_key_list.append((contig, int(st), int(end))) ValueError: invalid literal for int() with base 10: '7835.0'

wrong script in example Usage in README.md

Dear,

In README.md Step 2 example script:

samtools sort -@ 32 unsorted.bam | samtools view > reads_2_assembly.0x904q60.bam

should be

samtools sort -@ 32 unsorted.bam >reads_2_assembly.0x904q60.bam

Best,
Jia

Add helen to bioconda.

Hi,

Can you add helen to bioconda? It is so difficult To install helen on a centos machine.

Best
Kun

margin docker run fail

hi,
i ran the margin polishg progrecess (docker version) , and got a fail result.
root@ecs-9875:/media/datarun/blnanodata/data# tail marginPolish.log
/usr/bin/time -f '\nDEBUG_MAX_MEM:%M\nDEBUG_RUNTIME:%E\n' /opt/MarginPolish/build/marginPolish reads_2_assembly.bam new.fasta allParams.np.human.guppy-ff-233.json -t 32 -o output/marginpolish_images -f

Running OpenMP with 32 threads.

Parsing model parameters from file: allParams.np.human.guppy-ff-233.json
Calloc failed with request for -2 lots of 16 bytes
Command exited with non-zero status 1

DEBUG_MAX_MEM:3836
DEBUG_RUNTIME:0:00.00

Can you help me to fix it ?

Exception: process 6 terminated with signal SIGKILL

Hello,

I am trying to run helen in polishing mode. Here is my command:

helen polish -i marginPolish_images -m helen_models/HELEN_r941_guppy344_microbial.pkl -o helen_polish/ -t 16

However I face following error:
Traceback (most recent call last): File "/lustre-gseg/software/bin/helen", line 33, in <module> sys.exit(load_entry_point('helen==0.0.23', 'console_scripts', 'helen')()) File "/lustre-gseg/software/MarginPolish-HELEN/py36_venv/lib64/python3.6/site-packages/helen/helen.py", line 313, in main FLAGS.callers) File "/lustre-gseg/software/MarginPolish-HELEN/py36_venv/lib64/python3.6/site-packages/helen/modules/python/PolishInterface.py", line 87, in polish_genome callers) File "/lustre-gseg/software/MarginPolish-HELEN/py36_venv/lib64/python3.6/site-packages/helen/modules/python/CallConsensusInterface.py", line 153, in call_consensus callers, threads_per_caller, num_workers) File "/lustre-gseg/software/MarginPolish-HELEN/py36_venv/lib64/python3.6/site-packages/helen/modules/python/models/predict_cpu.py", line 248, in predict_cpu join=True) File "/lustre-gseg/software/MarginPolish-HELEN/py36_venv/lib64/python3.6/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/lustre-gseg/software/MarginPolish-HELEN/py36_venv/lib64/python3.6/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes while not context.join(): File "/lustre-gseg/software/MarginPolish-HELEN/py36_venv/lib64/python3.6/site-packages/torch/multiprocessing/spawn.py", line 108, in join (error_index, name) Exception: process 6 terminated with signal SIGKILL

Please could you shed some light on this?

Many thanks in advance.

torch.set_num_threads error in docker image

Hi, am trying to run the docker image and get a torch runtime error.
Is this an error in the docker image? Thanks!

INFO: POLISH MODULE SELECTED
INFO: RUN-ID: 09012020_134236
INFO: PREDICTION OUTPUT DIRECTORY: /.../helen_out/predictions_09012020_134236
INFO: CALL CONSENSUS STARTING
INFO: OUTPUT FILE: /.../helen_out/predictions_09012020_134236/265L12.cont.cor.fa
INFO: MODEL LOADING TO ONNX
Traceback (most recent call last):
File "/opt/conda/bin/helen", line 8, in
sys.exit(main())
File "/opt/conda/lib/python3.7/site-packages/helen/helen.py", line 313, in main
FLAGS.callers)
File "/opt/conda/lib/python3.7/site-packages/helen/modules/python/PolishInterface.py", line 87, in polish_genome
callers)
File "/opt/conda/lib/python3.7/site-packages/helen/modules/python/CallConsensusInterface.py", line 153, in call_consensus
callers, threads_per_caller, num_workers)
File "/opt/conda/lib/python3.7/site-packages/helen/modules/python/models/predict_cpu.py", line 248, in predict_cpu
join=True)
File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
while not spawn_context.join():
File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 118, in join
raise Exception(msg)
Exception:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/opt/conda/lib/python3.7/site-packages/helen/modules/python/models/predict_cpu.py", line 194, in setup
threads)
File "/opt/conda/lib/python3.7/site-packages/helen/modules/python/models/predict_cpu.py", line 65, in predict
torch.set_num_threads(threads)
RuntimeError: set_num_threads expects a positive integer

some problem occurs when evaluating the data of na12878 chromosome 21

hi
I ran helen to polish the draft assembly from na12878 chromosome 21. But there seem some problems in the polished results.
First I ran marginPolish to generate image features with the command:
marginPolish read2assembly.sort.bam ../assembly.fasta ~/tools/MarginPolish/params/allParams.np.human.r94-g235.json -o chr21_margin -t 60 -f
Second, I ran helen to generate a more accurate assembly with the command:
helen polish -i output_files/ -m ~/tools/helen/models/HELEN_r941_guppy344_human.pkl -b 512 -w 4 -t 60 -o helenPolish -p chr21_helen -g
After that I use pomoxis to evaluate the error rate of the polished assembly.
The following two figures are the results of marginPolish and helen.
image

image

Neng

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.