Giter Club home page Giter Club logo

sequence2vec's Introduction

kmer2vec

kmer2vec is an algorithm providing short seqeuence (k <= 10) embedding strategy, so that reads which share overlapping sequences tend to be "close" in N dimentional space.

read2vec

Read2Vec is an algorithm providing read-level (n = 150, 250, or 300) embedding strategy, so that reads which share overlapping sequences tend to be "close" in N dimentional space.

contig2vec

contig2vec is an algorithm which clusters and "assemble" reads based on graphical feature of reads and biological evidence (e.g. paired end reads). Segments from the same genome should be "close" to each other.

sequence2vec's People

Contributors

hh1985 avatar

Watchers

 avatar

sequence2vec's Issues

Function for benchmarking the embedding

Develop a function to benchmark the proposed embedding approach to true positions of reads.

  1. Fix the order of reads, calculate the relative distance (measure the variation of the distance)
  2. Sort the distance, calculate the maximum order difference (avoid FP)

Updated execution step

Discussion with Chatgpt for the development workflow.

Steps:

  1. K-mer Representation:

    • Convert each DNA read into a k-mer frequency vector.
  2. Dimensionality Reduction (Optional):

    • Use PCA or another technique to reduce the dimensions of your k-mer vectors.
  3. Grid-based Partitioning:

    • Divide the (reduced) feature space into grids and assign each k-mer vector to a grid cell.
  4. Local Graph Construction:

    • Within each grid cell, construct a graph where nodes represent individual reads.
    • Create edges between nodes based on the actual overlap of the reads.
  5. Edge Weights (Optional):

    • Assign weights to the edges based on the degree of overlap.
  6. Global Graph Integration:

    • Connect graphs from different grid cells based on overlapping reads that fall into adjacent cells.
  7. Graph Refinement:

    • Remove or merge nodes/edges based on specific criteria, like low overlap or edge weight.
  8. Analysis:

    • Apply graph analytics to identify patterns, like connected components that might indicate bacteria strains.

Potential Issues:

  1. Scalability: Creating a graph within each grid cell can still be computationally expensive for large datasets.

  2. Boundary Cases: Reads near the edge of one grid cell might actually be more related to reads in an adjacent cell, leading to missed connections.

  3. Parameter Tuning: The size of the grid and the criteria for edge creation need to be carefully selected.

Adding grid-based partitioning as a step can help manage computational costs by allowing you to work on smaller subsets of the data at a time. Would you like to explore any of these steps in more detail?

Extract data points in local regions.

Perform k-mer representation, PCA and grid partitioning to extract local data points. The purpose of this step is to build a local graph based on sequencing overlapping using local data points.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.