Giter Club home page Giter Club logo

genome-scripts's Introduction

genome-scripts

Some hodge podge scripts for dealing with genome stuff

Transcriptomes

This folder contains instructions on the KU BI transcriptome pipeline

Other stuff

filter_for_length_fastx.R filters scaffolds/contigs below a certain length, and also reports total assembly size/number of contigs, N10-N90/L10-L90. It requires a fastx file as input, which you can generate using fastx-tool's fasta_formatter with the -t flag. Paste the R-script into R and then call it by filter_for_length_fastx(your_fastx_file,contig_length).

filter_contig_by_length.R does the same thing but using a fasta file as input. It is MUCH slower than filter_for_length_fastx.R, so is not recommended.

onelining.R removes the line breaks within the sequence associated with a single contig/scaffold (i.e. gets rid of the "hard wrap" that a lot of the default programs place in fasta files). The resulting fasta file should be twice as long as the number of contigs (one line for the header row, one line for the sequence).

linebylineblast.sh (and the associated R-scripts: onelining_tempseq.R and linebyline.R) blast an assembly against itself to look for regions of an assembly that made have been subject to a "false tandem duplication" (e.g. instead of alleles being assembled together, they have been assembled end to end).

summarizing_blast_hits.R summarizes the results of linebylineblast.sh

length_dist.R can be used to get the lengths of the contigs/scaffolds in the assembly and to generate histogram summaries of this data, as well as the percentage of the assembly located in each contig/scaffold length bin.

vectorcontam.sh (and the associated R-script: scrub_genome.R): searches your genome for contamination, and scrubs this from the genome, outputting contigs/scaffolds greater than 100 bp in size.

restart_vector_contam.sh (and the associated R-script: restart_scrub_genome.R): how to restart vectorcontam.sh if the job crashes part-way through.

Version history

Published with Anolis TBD

These scripts wouldn't be possible without:
R: R Core Team. 2015. R: A language and environment for statistical computing. URL http://www.R-project.org/. R Foundation for Statistical Computing, Vienna, Austria. https://www.r-project.org/

Wickham, H., Francois, R., Henry, L. and Müller, K., dplyr: A grammar of data manipulation.

Wickham, H., Hester, J. and Francois, R., Readr: Read rectangular text data.

Wickham H. stringr: Simple, Consistent Wrappers for Common String Operations.

genome-scripts's People

Contributors

laninsky avatar

Watchers

 avatar  avatar  avatar

Forkers

rglor plhm

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.