Giter Club home page Giter Club logo

kmer-counting's Introduction

Kmer counting

Task 1: Implement countKmers function

Implement the countKmers function in countKmers.R. This function takes two arguments: 1) a character vector, and 2) a number, k. You should write a function that, for the given k, counts all the k-mers in the given string, and returns a table. The table you return should be able to retrieve the count of a k-mer using single-bracket item access notation: for example, if you counted all 5-mers, your result should provide the count of the ATGCC 5-mer like this: result["ATGCC"]. (30 pts)

Note about package dependencies

You should implement this function using only base R or tidyverse packages. In particular, you may not use the Biostrings package, which contains a similar implementation.

Task 2: Use PCA to explore relationships among sequences

For the second task, use your countKmers function to count all 4-mers for each of the sequences found in the fasta folder. Build a matrix of 4-mer counts, with 1 row for each sequence and 1 column per 4-mer. Next, use PCA to reduce the dimensionality of the 4-mer matrix, and plot the first two principle components in a scatterplot using ggplot2. (20 pts)

Task 3: Construct a Bloom Filter to test for k-mer presence

Bloom filters are membership testers that can be used to test whether a given k-mer is present in a large dataset. They are efficient in both memory and speed. For this task, you will implement a Bloom Filter for k-mers. You should be able to insert k-mers into the Bloom Filter as short strings, and then test the filter for presence.

Use the skeleton R class provided in bloomFilter.R. The outline has been provided for you; you will need to implement 4 methods on this class: addElement (15 pts), testElement (15 pts), hashElement (15 pts), and resetFilter (5 pts).

To test your work and see examples of how to use the class, consult tests/testDriverBloom.R. To pass the tests, your Bloom Filter must score 100% accuracy on true positives (all inserted elements should return TRUE when tested for membership), and must achieve lower than 10% false positive accuracy when a non-inserted element is tested for presence under the test conditions. If your hash functions are done well, theoretically, your bloom filter should be able to achieve 2% false positive rates under the conditions of the tests.

Testing

You can test your work by running:

Rscript tests/testDriver.R

To test only the k-mer counting task:

Rscript tests/testDriverKmer.R

To test only the Bloom Filter task:

Rscript tests/testDriverBloom.R

kmer-counting's People

Contributors

nleroy917 avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.