superzchen / ilearnplus Goto Github PK

iLearnPlus is the first machine-learning platform with both graphical- and web-based user interface that enables the construction of automated machine-learning pipelines for computational analysis and predictions using nucleic acid and protein sequences.

Python 100.00%

automated-modelling bioinformatics-tool biomedical-data-analytics deep-learning feature-selection machine-learning prediction python sequence-analysis

ilearnplus's People

Contributors

Stargazers

Watchers

ilearnplus's Issues

A question about the PSTNPss function in util/FileProcessing.py

Hello, I am interested in your project, but I have some questions when reading your source code. I hope you can help me answer them.

My question is about the PSTNPss function in util/FileProcessing.py.
Question 1: I see that in this function, you subtract one from the total number of samples for the corresponding label and subtract one from the trinucleotide count at the corresponding location. I don’t understand the purpose and principle of doing this.

p_num, n_num = positive_number, negative_number
po_number = matrix_po[j][order[sequence[j: j + 3]]]
if i[0] in positive_key and po_number > 0:
    po_number -= 1
    p_num -= 1
ne_number = matrix_ne[j][order[sequence[j: j + 3]]]
if i[0] in negative_key and ne_number > 0:
    ne_number -= 1
    n_num -= 1

Question 2: Secondly, this function uses different processing methods for the training dataset and the testing dataset. In the training dataset, you perform the above subtraction operation, but not in the testing dataset. I don’t understand why there is such a difference. I have attached your code snippet for your convenience. Thank you for your time and help!

    def PSTNPss(self):
        try:
            if not self.is_equal:
                self.error_msg = 'PSTNPss descriptor need fasta sequence with equal length.'
                return False

            fastas = []
            for item in self.fasta_list:
                if item[3] == 'training':
                    fastas.append(item)
                    fastas.append([item[0], item[1], item[2], 'testing'])
                else:
                    fastas.append(item)

            for i in fastas:
                if re.search('[^ACGT-]', i[1]):
                    self.error_msg = 'Illegal character included in the fasta sequences, only the "ACGT[U]" are allowed by this encoding scheme.'
                    return False

            encodings = []
            header = ['SampleName', 'label']
            for pos in range(len(fastas[0][1]) - 2):
                header.append('Pos.%d' % (pos + 1))
            encodings.append(header)

            positive = []
            negative = []
            positive_key = []
            negative_key = []
            for i in fastas:
                if i[3] == 'training':
                    if i[2] == '1':
                        positive.append(i[1])
                        positive_key.append(i[0])
                    else:
                        negative.append(i[1])
                        negative_key.append(i[0])

            nucleotides = ['A', 'C', 'G', 'T']
            trinucleotides = [n1 + n2 + n3 for n1 in nucleotides for n2 in nucleotides for n3 in nucleotides]
            order = {}
            for i in range(len(trinucleotides)):
                order[trinucleotides[i]] = i

            matrix_po = self.CalculateMatrix(positive, order)
            matrix_ne = self.CalculateMatrix(negative, order)

            positive_number = len(positive)
            negative_number = len(negative)

            for i in fastas:
                if i[3] == 'testing':
                    name, sequence, label = i[0], i[1], i[2]
                    code = [name, label]
                    for j in range(len(sequence) - 2):
                        if re.search('-', sequence[j: j + 3]):
                            code.append(0)
                        else:
                            p_num, n_num = positive_number, negative_number
                            po_number = matrix_po[j][order[sequence[j: j + 3]]]
                            if i[0] in positive_key and po_number > 0:
                                po_number -= 1
                                p_num -= 1
                            ne_number = matrix_ne[j][order[sequence[j: j + 3]]]
                            if i[0] in negative_key and ne_number > 0:
                                ne_number -= 1
                                n_num -= 1
                            code.append(po_number / p_num - ne_number / n_num)
                            # print(sequence[j: j+3], order[sequence[j: j+3]], po_number, p_num, ne_number, n_num)
                    encodings.append(code)
            self.encoding_array = np.array([])
            self.encoding_array = np.array(encodings, dtype=str)
            self.column = self.encoding_array.shape[1]
            self.row = self.encoding_array.shape[0] - 1
            del encodings
            if self.encoding_array.shape[0] > 1:
                return True
            else:
                return False
        except Exception as e:
            self.error_msg = str(e)
            return False

I have issues with selecting particular feature descriptors.

First of all, thank you for such a nice feature extraction tool.

In iLearnPlus Basic, I couldn't be able to select particular descriptors. Could you please let me know how to solve this issue?

Also, I attached the image for your reference. Please check it.

I look forward to hearing from you soon.

Thank you.

PSTNPss cannot be used

Hello, I entered the FASTA sequence in the required format. However, PSTNPss cannot be used.

Multi label problems

While performing multi label problems, all performance evaluation matrices are showning NA except Accuracy. ROC/ PRC is also not generated. Kindly help.

Pop out errors

sometimes when dealing with seq fasta data, it will pop out'RG' , sometimes "divided by zero“, sometimes just pop out error(with no responding). How to fix the problem if I come across with errors like that? thanks in advance.

Inquiry: Missing Source Code for Protein Features

我记得之前我访问时，提供了每个功能的源代码，例如蛋白质AAC的源代码。然而，这次我找不到它。会不会有任何变化，或者我遗漏了什么？

Is it possible to run iLearnPlus on terminal with threads ?

I wish to perform parallel computing on large dataset using AWS (either EC2 instance or sagemaker) . Is there any way to scale iLearnPlus on large dataset for fast processing ?
Thanks you for developing this wonderful package.

Show a warning if special fasta headers format is violated

In a large dataset of automatically downloaded sequences there can be names including "|" symbol.
I concatenate class and train/test labels also automatically.
So, when I try to analyze this file, there are uninformative error messages like:

ValueError: could not convert string to float: 'P42577.2'
ValueError: invalid literal for int() with base 10: '6LPD'

which are caused by incorrect fasta headers:

P42577.2_sp|P42577.2|FRIS_LYMST|0|training
6LPD_pdb|6LPD|F|1|training

A simple check when importing the file could show a warning to the user.

The results of DNA sequence and reverse complementary sequence are inconsistent

Why is the result of DNA sequence and its complementary sequence of the model I trained inconsistent？ For example, the Score 1 of a DNA sequence is larger than 0.5 but the score 1 of its complementary sequence is smaller than 0.5.

Pop out errors

something unrelated but can you please help

Hi,
I have tried all the possible solutions from google to install PyQt5.

(base) amit@amit-X705UDR:~$ /home/amit/miniconda3/bin/python
Python 3.8.12 (default, Oct 12 2021, 13:49:34) 
[GCC 7.5.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from ilearnplus import runiLearnPlus
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/amit/.local/lib/python3.8/site-packages/ilearnplus/__init__.py", line 5, in <module>
    from .iLearnPlusBasic import *
  File "/home/amit/.local/lib/python3.8/site-packages/ilearnplus/iLearnPlusBasic.py", line 7, in <module>
    from PyQt5.QtWidgets import (QApplication, QWidget, QPushButton, QFileDialog, QLabel, QHBoxLayout, QGroupBox, QTextEdit,
ModuleNotFoundError: No module named 'PyQt5'

Kindly suggest.

superzchen / ilearnplus Goto Github PK

ilearnplus's People

Contributors

Stargazers

Watchers

Forkers

ilearnplus's Issues

Recommend Projects

Recommend Topics

Recommend Org