Giter Club home page Giter Club logo

human_animal_ecoli's Introduction

human animal ecoli

This project is a demonstration of utilizing XGBoost for classification tasks on kmer data. The process involves data pre-processing to transform kmer data into a manageable format, training an XGBoost model, and evaluating its performance.


Table of Contents

  1. Installation
  2. Usage
  3. Dataset Processing
  4. XGBoost Model Training
  5. Performance Evaluation
  6. Contact

Installation

Clone this repository to your local machine and navigate to the project directory. Execute the following command to install necessary dependencies:

pip install -r requirements.txt

Usage

train data and get shap value

Change parameters in input.json file

python main.py input.json

data plot

import kmer_ml_pacakge.visualization to plot data


Dataset Processing

K-mer Dataset Processing

  • Reading Chunk Files: There are 2000 chunk files, each containing partial data to be processed.
  • Generating a Sparse Matrix: A sparse matrix is generated from the processed chunk files, facilitating efficient storage and computational operations.
  • Filtering Data: Data filtering is performed based on specified labels to obtain the relevant dataset for model training. The dataset is filtered to remove two designated data types, and the resulting data is categorized into Human & Animal (HA), Human & Human (HH), and Animal & Animal (AA) based on predefined conditions.

XGBoost Model Training

Grid Search and K-Fold Cross Validation

  • Hyperparameter Tuning: A grid search is conducted for hyperparameter tuning to find the optimal set of parameters for the XGBoost model.
  • K-Fold Cross Validation: K-Fold Cross Validation is performed to assess the model's performance. This approach helps to ensure that the model's performance is consistent across different subsets of the data.

Performance Evaluation

Confidence Intervals for Classification Metrics

  • Calculating Confidence Intervals: Confidence intervals for classification metrics such as precision, recall, and F1-score are calculated and stored. This provides a range within which the true value of the metric is likely to fall, providing an indication of the model's performance stability.

Contact


This documentation provides a high-level overview of the code structure and the project's primary functionalities. For more detailed information, please refer to the inline comments within the script file.

human_animal_ecoli's People

Contributors

tacshi avatar gzhoubioinf avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.