Giter Club home page Giter Club logo

neuralte's Introduction

NeuralTE

GitHub GitHub Conda DOI

NeuralTE uses a Convolutional Neural Network (CNN) to classify transposable elements (TEs) at the superfamily level, based on the TE structure and k-mer occurrence features.

It is recommended that the TE library to be classified consists of full-length TEs. TE libraries always divide LTR retrotransposons into terminal and internal sequences, such as Copia-62_PHord-LTR and Copia-62_PHord-I in Repbase, and we suggest restoring them to full-length LTRs before classification.

We recommend using HiTE to generate full-length TE libraries, as it can identify more full-length TEs compared to other tools. For the classification of fragmented TEs, you can use RepeatClassifier configured with a complete Dfam library.

Table of Contents

Installation

NeuralTE is built on Python3 and Keras.

System Requirements

NeuralTE requires a standard computer to use the Convolutional Neural Network (CNN). Using GPU could acceralate the process of TE classification.

Recommended Hardware requirements: 40 CPU processors, 128 GB RAM.

Recommended OS: (Ubuntu 16.04, CentOS 7, etc.)

Option 1. Run with Conda

git clone https://github.com/CSU-KangHu/NeuralTE.git ## Alternatively, you can download the zip file directly from the repository.
cd NeuralTE
chmod +x tools/*

conda install mamba -c conda-forge
# Find the **yml** file in the project directory and run
mamba env create --name NeuralTE -f environment.yml
conda activate NeuralTE

run NeuralTE with demo data.

Pre-trained models

See models:

  • NeuralTE model: This model employs features such as k-mer occurrences, terminals, TE domain, and 5bp terminals, trained using Repbase version 28.06.

  • NeuralTE-TSDs model: This model incorporates features like k-mer occurrences, terminals, TE domain, 5bp terminals, and target site duplications (TSDs). It was trained using partial species data (493 species) from Repbase version 28.06. Please note that this model should be used in conjunction with the corresponding genome assembly of the species.

Specify the GPUs

Skipping when using CPUs

  --start_gpu_num start_gpu_num
                        The starting index for using GPUs. default = [ 0 ]
  --use_gpu_num use_gpu_num
                        Specifying the number of GPUs in use. default = [ 1 ]

For example, --start_gpu_num 0 and --use_gpu_num 2 indicate a total of two GPUs to be used, with the assigned GPU indices being gpu:0 and gpu:1.

Default configurations are set in gpu_config.py.

Demo data

Please refer to demo for some demo data to play with:

  • test.fa: demo TE library to be classified.
  • genome.fa: demo sequences of the genome assembly.
# 1.Classify TE library without genome
# Inputs: 
#       --data: TE library to be classified.
#       --model_path: Pre-trained NeuralTE model without using TSDs features.
#       --outdir: Output directory. The `--outdir` should not be the same as the directory 
#                 where the `--data` file is located.
#       --thread: The thread number used in data preprocessing.
# Outputs: 
#       classified.info: Classification labels corresponding to TE names.
#       classified_TE.fa: Classified TE library.
#       ${data}.domain: Mapping table for the positioning of TE sequences and domains.
python ${pathTo}/NeuralTE/src/Classifier.py \
 --data ${pathTo}/NeuralTE/demo/test.fa \
 --model_path ${pathTo}/NeuralTE/models/NeuralTE_model.h5 \
 --outdir ${outdir} \
 --thread ${threads_num}
 # e.g., my command: 
 # python /home/hukang/NeuralTE/src/Classifier.py \
 # --data /home/hukang/NeuralTE/demo/test.fa \
 # --model_path /home/hukang/NeuralTE/models/NeuralTE_model.h5 \
 # --outdir /home/hukang/NeuralTE/demo/work \
 # --thread 40
 
 
 # 2.Classify the TE library with genome
 #       --data: TE library to be classified 
 #       --genome: The genome assembly corresponding to TE library
 #       --use_TSD: Use the TSD feature to classify TEs
 #       --model_path: Pre-trained Neural TE model using TSDs features
 #       --outdir: Output directory. The `--outdir` should not be the same as the directory 
 #                 where the `--data` file is located.
 #       --thread: The thread number used in data preprocessing.
 # outputs: 
 #       classified.info: Classification labels corresponding to TE names
 #       classified_TE.fa: Classified TE library
 #       ${data}.domain: Mapping table for the positioning of TE sequences and domains.
python ${pathTo}/NeuralTE/src/Classifier.py \
 --data ${pathTo}/NeuralTE/demo/test.fa \
 --genome ${pathTo}/NeuralTE/demo/genome.fa \
 --use_TSD 1 \
 --model_path ${pathTo}/NeuralTE/models/NeuralTE-TSDs_model.h5 \
 --outdir ${outdir} \
 --thread ${threads_num}
 # e.g., my command: 
 # python /home/hukang/NeuralTE/src/Classifier.py \
 # --data /home/hukang/NeuralTE/demo/test.fa \
 # --genome /home/hukang/NeuralTE/demo/genome.fa \
 # --use_TSD 1 \
 # --model_path /home/hukang/NeuralTE/models/NeuralTE-TSDs_model.h5 \
 # --outdir /home/hukang/NeuralTE/demo/work \
 # --thread 40

Train a new model

# 0. Preprocess Repbase database (including merging subfiles, concatenating LTR terminal and internal sequences, filtering incomplete LTR transposons, etc.)
# Inputs:
#        repbase_dir: Directory containing all Repbase files
#        out_dir: Output directory containing preprocessed results
# Outputs:
#        all_repbase.ref: Merged sequence of all Repbase databases
python ${pathTo}/utils/preprocess_repbase.py \
 --repbase_dir ${pathTo}/RepBase${version}.fasta \
 --out_dir ${out_dir}
 # e.g., my command: 
 # python /home/hukang/NeuralTE/utils/preprocess_repbase.py \
 # --repbase_dir /home/hukang/RepBase28.06.fasta/ \
 # --out_dir /home/hukang/test/
 
 
# 1. Splitting train and test datasets
# Inputs:
#        data_path: All Repbase database sequences
#        out_dir: Output directory after dataset partition
#        ratio: Ratio of training set to test set
# Outputs:
#        train.ref: 80% of all Repbase database sequences for training
#        test.ref: 20% of all Repbase database sequences for testing
python ${pathTo}/utils/split_train_test.py \
 --data_path ${Step0_out_dir}/all_repbase.ref \
 --out_dir ${out_dir} \
 --ratio 0.8
 # e.g., my command: 
 # python /home/hukang/NeuralTE/utils/split_train_test.py \
 # --data_path /home/hukang/test/all_repbase.ref \
 # --out_dir /home/hukang/test/ \
 # --ratio 0.8


 # 2.Train a new NeuralTE Model
 # Inputs: 
 #        train.ref: training set
 # Outputs: 
 #        model_${time}.h5: Generate h5 format file in the ${pathTo}/NeuralTE/models directory
python ${pathTo}/NeuralTE/src/Trainer.py \
 --data ${Step1_out_dir}/train.ref \
 --is_train 1 \
 --is_predict 0 \
 --use_TSD 0 \
 --outdir ${outdir} \
 --thread ${threads_num} \
 --start_gpu_num ${start_gpu_num} \
 --use_gpu_num ${use_gpu_num}
 # e.g., my command: 
 # python /home/hukang/NeuralTE/src/Trainer.py \
 # --data /home/hukang/test/train.ref \
 # --is_train 1 \
 # --is_predict 0 \
 # --use_TSD 0 \
 # --outdir /home/hukang/test/work \
 # --thread 40 \
 # --start_gpu_num 0 \
 # --use_gpu_num 1
 
 # Replace original model
 cd ${pathTo}/NeuralTE/models && mv model_${time}.h5 NeuralTE_model.h5
 
 
 # 3.Train a new NeuralTE-TSDs Model
 # Inputs: 
 #        train.ref: training set
 #        genome.info: Modify the 'genome.info' file in the ${pathTo}/NeuralTE/data directory. Ensure that 'Scientific Name' corresponds to the species names in `train.ref`, and 'Genome Path' should be an absolute path.
 # Outputs: 
 #        model_${time}.h5: Generate h5 format file in the ${pathTo}/NeuralTE/models directory
python ${pathTo}/NeuralTE/src/Trainer.py \
 --data ${Step1_out_dir}/train.ref \
 --genome ${pathTo}/NeuralTE/data/genome.info \
 --is_train 1 \
 --is_predict 0 \
 --use_TSD 1 \
 --outdir ${outdir} \
 --thread ${threads_num} \
 --start_gpu_num ${start_gpu_num} \
 --use_gpu_num ${use_gpu_num}
 # e.g., my command: 
 # python /home/hukang/NeuralTE/src/Trainer.py \
 # --data /home/hukang/test/train.ref \
 # --genome /home/hukang/NeuralTE/data/genome.info \
 # --is_train 1 \
 # --is_predict 0 \
 # --use_TSD 1 \
 # --outdir /home/hukang/test/work \
 # --thread 40 \
 # --start_gpu_num 0 \
 # --use_gpu_num 1
 
 # Replace original model
cd ${pathTo}/NeuralTE/models && mv model_${time}.h5 NeuralTE-TSDs_model.h5

Experiment reproduction

All experimental results in the manuscript of NeuralTE can be reproduced step by step through Experiment reproduction.

Usage

1. Classify TE library

usage: Classifier.py [-h] --data data --outdir output_dir [--use_TSD use_TSD] [--is_predict is_predict] [--start_gpu_num start_gpu_num] [--use_gpu_num use_gpu_num] [--keep_raw keep_raw] [--genome genome] [--species species] [--model_path model_path]
                     [--use_kmers use_kmers] [--use_terminal use_terminal] [--use_minority use_minority] [--use_domain use_domain] [--use_ends use_ends] [--is_wicker is_wicker] [--is_plant is_plant] [--threads thread_num] [--internal_kmer_sizes internal_kmer_sizes]
                     [--terminal_kmer_sizes terminal_kmer_sizes]

########################## NeuralTE, version 1.0.1 ##########################

optional arguments:
  -h, --help            show this help message and exit
  --data data           Input fasta file used to predict, header format: seq_name label species_name, refer to "data/test.example.fa" for example.
  --outdir output_dir   Output directory, store temporary files
  --use_TSD use_TSD     Whether to use TSD features, 1: true, 0: false. default = [ 0 ]
  --is_predict is_predict
                        Enable prediction mode, 1: true, 0: false. default = [ 1 ]
  --start_gpu_num start_gpu_num
                        The starting index for using GPUs. default = [ 0 ]
  --use_gpu_num use_gpu_num
                        Specifying the number of GPUs in use. default = [ 1 ]
  --keep_raw keep_raw   Whether to retain the raw input sequence, 1: true, 0: false; only save species having TSDs. default = [ 0 ]
  --genome genome       Genome path, use to search for TSDs
  --species species     Which species does the TE library to be classified come from.
  --model_path model_path
                        Input the path of trained model, absolute path.
  --use_kmers use_kmers
                        Whether to use kmers features, 1: true, 0: false. default = [ 1 ]
  --use_terminal use_terminal
                        Whether to use LTR, TIR terminal features, 1: true, 0: false. default = [ 1 ]
  --use_minority use_minority
                        Whether to use minority features, 1: true, 0: false. default = [ 0 ]
  --use_domain use_domain
                        Whether to use domain features, 1: true, 0: false. default = [ 1 ]
  --use_ends use_ends   Whether to use 5-bp terminal ends features, 1: true, 0: false. default = [ 1 ]
  --is_wicker is_wicker
                        Use Wicker or RepeatMasker classification labels, 1: Wicker, 0: RepeatMasker. default = [ 1 ]
  --is_plant is_plant   Is the input genome of a plant? 0 represents non-plant, while 1 represents plant. default = [ 0 ]
  --threads thread_num  Input thread num, default = [ 104 ]
  --internal_kmer_sizes internal_kmer_sizes
                        The k-mer size used to convert internal sequences to k-mer frequency features, default = [ [5] MB ]
  --terminal_kmer_sizes terminal_kmer_sizes
                        The k-mer size used to convert terminal sequences to k-mer frequency features, default = [ [3, 4] ]

2. Train a new model

usage: Trainer.py [-h] --data data --outdir output_dir --use_TSD use_TSD --is_train is_train --is_predict is_predict [--start_gpu_num start_gpu_num] [--use_gpu_num use_gpu_num] [--only_preprocess only_preprocess] [--keep_raw keep_raw] [--genome genome]
                  [--use_kmers use_kmers] [--use_terminal use_terminal] [--use_minority use_minority] [--use_domain use_domain] [--use_ends use_ends] [--threads thread_num] [--internal_kmer_sizes internal_kmer_sizes] [--terminal_kmer_sizes terminal_kmer_sizes]
                  [--cnn_num_convs cnn_num_convs] [--cnn_filters_array cnn_filters_array] [--cnn_kernel_sizes_array cnn_kernel_sizes_array] [--cnn_dropout cnn_dropout] [--batch_size batch_size] [--epochs epochs] [--use_checkpoint use_checkpoint]

########################## NeuralTE, version 1.0.1 ##########################

optional arguments:
  -h, --help            show this help message and exit
  --data data           Input fasta file used to train model, header format: seq_name label species_name, refer to "data/train.example.fa" for example.
  --outdir output_dir   Output directory, store temporary files
  --use_TSD use_TSD     Whether to use TSD features, 1: true, 0: false. default = [ 0 ]
  --is_train is_train   Enable train mode, 1: true, 0: false. default = [ 0 ]
  --is_predict is_predict
                        Enable prediction mode, 1: true, 0: false. default = [ 1 ]
  --start_gpu_num start_gpu_num
                        The starting index for using GPUs. default = [ 0 ]
  --use_gpu_num use_gpu_num
                        Specifying the number of GPUs in use. default = [ 1 ]
  --only_preprocess only_preprocess
                        Whether to only perform data preprocessing, 1: true, 0: false.
  --keep_raw keep_raw   Whether to retain the raw input sequence, 1: true, 0: false; only save species having TSDs. default = [ 0 ]
  --genome genome       Genome path, use to search for TSDs
  --use_kmers use_kmers
                        Whether to use kmers features, 1: true, 0: false. default = [ 1 ]
  --use_terminal use_terminal
                        Whether to use LTR, TIR terminal features, 1: true, 0: false. default = [ 1 ]
  --use_minority use_minority
                        Whether to use minority features, 1: true, 0: false. default = [ 0 ]
  --use_domain use_domain
                        Whether to use domain features, 1: true, 0: false. default = [ 1 ]
  --use_ends use_ends   Whether to use 5-bp terminal ends features, 1: true, 0: false. default = [ 1 ]
  --threads thread_num  Input thread num, default = [ 104 ]
  --internal_kmer_sizes internal_kmer_sizes
                        The k-mer size used to convert internal sequences to k-mer frequency features, default = [ [5] MB ]
  --terminal_kmer_sizes terminal_kmer_sizes
                        The k-mer size used to convert terminal sequences to k-mer frequency features, default = [ [3, 4] ]
  --cnn_num_convs cnn_num_convs
                        The number of CNN convolutional layers. default = [ 3 ]
  --cnn_filters_array cnn_filters_array
                        The number of filters in each CNN convolutional layer. default = [ [16, 16, 16] ]
  --cnn_kernel_sizes_array cnn_kernel_sizes_array
                        The kernel size in each of CNN convolutional layer. default = [ [7, 7, 7] ]
  --cnn_dropout cnn_dropout
                        The threshold of CNN Dropout. default = [ 0.5 ]
  --batch_size batch_size
                        The batch size in training model. default = [ 32 ]
  --epochs epochs       The number of epochs in training model. default = [ 50 ]
  --use_checkpoint use_checkpoint
                        Whether to use breakpoint training. 1: true, 0: false. The model will continue training from the last failed parameters to avoid training from head. default = [ 0 ]

Citations

Please cite our paper if you find NeuralTE useful:

Hu, K., Xu, M., Gao, X. & Wang, J.โœ‰ (2024). NeuralTE: an accurate approach for Transposable Element superfamily classification with multi-feature fusion. bioRxiv.

neuralte's People

Contributors

csu-kanghu avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

neuralte's Issues

ValueError: Input 0 is incompatible with layer model: expected shape=(None, 1733, 1), found shape=(None, 305, 1)

Hello!

First of all, thanks for developing NeuralTE.
I was trying to run Classifier.py using the NeuralTE/demo/test.fa dataset exactly as described in "Demo data" README section.

python /home/NeuralTE/src/Classifier.py --data /home/NeuralTE/demo/test.fa --outdir /home/NeuralTE_test/test --model_path /home/NeuralTE/models/NeuralTE_model.h5 --thread 10

I installed NeuralTE on a linux machine as specified in "Option 1. Run with Conda"

The outputs in the outdir are:
domain minority segLTR2intactLTR.map temp test.fa test.fa.domain

I am getting the following error:
ValueError: Input 0 is incompatible with layer model: expected shape=(None, 1733, 1), found shape=(None, 305, 1)

Is there anything I can do about this error?
Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.