Giter Club home page Giter Club logo

genhub's Introduction

GenHub (deprecated)

NOTE: GenHub has been integrated into the AEGeAn Toolkit as the fidibus module and IS NO LONGER ACTIVELY DEVELOPED HERE!!!


Supported Python versions PyPI version GenHub build status codecov.io coverage BSD-3 licensed

GenHub is a free open-source software framework for analyzing eukaryotic genome content and organization. The Fidibus program calculates and reports a variety of statistics on interval loci (iLoci). Fidibus can analyze user-supplied genomes, and can also retrieve and process dozens of reference genomes directly from public databases (such as NCBI RefSeq) for easily reproducible comparative analysis.

For or information, see the GenHub user manual

Obtaining GenHub

The easiest way to obtain GenHub is to install from the Python Package Index (PyPI) using the pip command.

pip install genhub

Make sure you have GenomeTools and AEGeAn installed. For more info and troubleshooting tips, be sure to check out the complete installation instructions.

Quick start: example usages

# Show all configuration settings
fidibus --help

# Compute iLoci for a user-supplied genome
fidibus --workdir=./ --local --gdna=MyGenome.fasta --gff3=MyAnnotation.gff3 \
        --prot=MyProteins.fasta --label=Gnm1 \
        prep iloci

# List all available reference genomes
fidibus list

# Download and pre-process the budding yeast genome, but do not compute iLoci
fidibus --workdir=/opt/data/genomes/ --refr=Scer download prep

# Download and completely process a few dozen Hymenopteran genomes, 4 at a time
fidibus --workdir=/opt/data/genomes/ --refr=hymenoptera --numprocs=4 \
        download prep iloci breakdown stats

# Download 9 green algae genomes, cluster proteins to identify homologous iLoci
fidibus --workdir=~/mydata/ --refrbatch=chlorophyta --numprocs=6 \
        download prep iloci breakdown cleanup cluster

# Process a user-supplied genome and several reference genomes for comparison
fidibus --workdir=/data/ --numprocs=4 --local --gdna=MyGenome.fasta \
        --gff3=MyAnnotation.gff3 --prot=MyProteins.fasta --label=Gnm1 \
        --refr=Atha,Bdis,Bole,Cari,Gmax,Grai,Mtru,Osat,Tcac \
        download prep iloci breakdown stats

For more detailed instructions on running Fidibus and other ancillary scripts, see the user manual.

Citing GenHub

GenHub is research software and must be cited if it is used in a published research project. GenHub will soon be in print, but in the mean time it can be cited as follows.

Standage DS, Brendel VP (2016) GenHub. GitHub repository, https://github.com/standage/genhub.

Additional Details

GenHub was originally dubbed HymHub and designed specifically to facilitate reproducible analysis of hymenotperan genomes. The need for a more general solution motivated the development of GenHub in its current incarnation. Rather than distributing processed data (which can occupy more than 1 GB of storage space per genome), GenHub provides portable code so that researchers can easily process reference genomes on their own computing resources. This is all tied closely to our research philosophy and our conviction that published computational results (along with supporting software and data) should be reproducible and transparent. More recently, we have implemented support for processing of user-supplied non-reference genomes.

genhub's People

Contributors

standage avatar vpbrendel avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

genhub's Issues

Polish documentation, dev environment

The pyyaml and pycurl packages have now been properly added to the setup.py file, so they are installed when you pip install genhub. The other packages in requirements.txt are for development. I need to update the docs (and my own development environment) to account for this.

Easily identify piLocus representatives

The *protein2ilocus.tsv file shows all proteins, not just those chosen to represent each piLocus. It would be helpful to have another mapping file with only the iLocus representatives shown.

Pass source to `genhub-format-gff3.py`

The script is a mess of unnecessary if/elif/else statements right now, and it would be a lot cleaner (although probably just as verbose) if was passed the annotation source as a parameter.

Recipes for green algae

  • Auxenochlorella_protothecoides
  • Chlamydomonas_reinhardtii
  • Chlorella_variabilis
  • Coccomyxa subellipsoidea
  • Micromonas_pusilla
  • _Micromonas_sp.RCC299
  • Ostreococcus_lucimarinus
  • Ostreococcus tauri
  • Volvox_carteri

nose --> py.test

The nose testing framework is no longer supported. The transition to py.test should be pretty seamless, as they support similar conventions for naming test functions. It'll mostly be a matter of updating dependencies, fixing makefile, etc.

Move, rename config files

  • Move the directory containing the config files into the distribution
  • Update setup.py accordingly
  • Rename to "recipes" or something like that

Script paths during build

Currently, the format task and the format.sh script are calling other scripts using relative file paths, assuming the user is calling from the genhub root directory. There needs to be a better way to resolve the script paths that doesn't involve clogging up a bunch of function signatures.

Dmel format task fails

This is due to a gene model where the exon is labeled as a pseudogene but the gene feature itself is not. It therefore eludes the tidygff3 attempts to correct the feature types, and causes problems when downstream processing software tries to find the exon's parent RNA feature.

New rice recipe?

New rice entry in RefSeq: ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/plant/Oryza_sativa/all_assembly_versions/GCF_001433935.1_IRGSP-1.0/

NT_ and NW_ accessions in mammalian genomes correspond to patches, variants

The mouse and human genomes are very well finished, and the chromosome sequences are assigned NC_ accessions. NW_ and NT_ do not correspond to unplaced genomic scaffolds as they do in many other species, they correspond to patches or variants not (yet?) integrated into a major build release. This information is redundant and should be filtered out in preprocessing. Filtering annotations is simple, but if we don't want redundant sequences to be included in calculations this will require implementing a new filtering mechanism.

Support for generic input

One of the benefits we claim regarding iLoci is that the software accepts a small number of standard inputs and generates a wealth of useful output. While the latter half of that claim is definitely true, we need to work on the first half.

If your genome happens to already be in RefSeq, you're in luck: all you need to do is copy one of the existing RefSeq configuration files, change a few values (species name, genome accession, etc) and you're in business. But if you simply have a pair of Fasta and GFF3 files? You're basically relegated to running all the genhub-build.py steps by yourself, from scratch.

We need better support for generic inputs: we need to document what is expected from the Fasta and GFF3 files, and then we need to fix the genhub-build.py script to support this.

Consolidate file name resolution

Currently, different scripts and modules all redundantly implement similar functionality for resolving file paths.

filepath = '%s/%s/%s' % (workdir, speclabel, filename)

It would be more robust and easier to maintain/fix/change in the future if we used a single function for doing this.

# File doesn't exist yet, no need to test file existence
outfilepath = genhub.file_path(filename, speclabel, workdir=workdir)

# Input file, check to make sure it exists
infilepath = genhub.file_path(filename, speclabel, workdir=workdir, check_exist=True)

Test whether ilens file can be ignored

If all iiLocus and ziLocus lengths are easily parsed directly from GFF3, this can prevent cluttering of the working directory with many ancillary files.

Deprecate `*.simple-iloci.txt` file

This is a vestige of an old system, before we had the current terminology settled. It should be removed as soon as we can be sure it's not used anywhere.

Improve configuration parsing

Currently there are two available options for loading configuration files.

  • the -c/--cfg option for providing the path of a single config file
  • the --cfgdir option for providing the path of a directory, from which GenHub will attempt to load all .yml files

The --cfgdir option is fine as is, but I propose the following additions and changes for other config loading options.

  • --cfglist option for providing a file with config files (one per line)
  • --cfgpath option for providing one or more directories in which to search for config files
  • --cfgfullpath option for indicating that value(s) provided by -c/--cfg option or --cfglist option are full file paths; by default, they are treated as relative paths and GenHub searches all directories specified by --cfgpath for these files

The option labels might need tweaking, but I think the functionality supports most/all conceivable use cases with a relatively simple interface.

New Amel genome

GFF3 checksums are failing, presumably due to an update of the RefSeq files. Need to investigate and take action.

Script names

Scripts have very generic names at this point, which is fine for git clone installation but not for a system-wide/virtualenv/pip type installation. Need to select concise names that minimize collision risk.

Refactor `format.py`

The functions have gotten quite long, and could benefit from some decomposing.

More species to integrate

BeeBase consortium data sets (10 bee genomes paper)

  • Dufourea novaeangliae
  • Eufriesea mexicana
  • Habropoda laboriosa
  • Lasioglossum albipes
  • Melipona quadrifasciata

Species only available in HymenopteraBase

  • Cardiocondyla obscurior

HymenopteraBase versions of already integrated species?

  • Apis mellifera
  • Bombus impatiens
  • Bombus terrestris
  • Nasonia vitripennis
  • Atta cephalotes
  • Acromyrmex echinatior
  • Camponotus floridanus
  • Harpegnathos saltator
  • Linepithema humile
  • Pogonomyrmex barbatus
  • Solenopsis invicta

.txt --> .tsv for some ancillary data files

Some of the plain text supporting data files in each genome's directory are in fact simply tab-separated value (TSV) files that lack a header row. There isn't really a compelling reason for these files not to have self-documenting headers, which in turn facilitate easy loading into R/Python/etc for data analysis.

  • *.ilocus.mrnas.txt (this will probably require a change to AEGeAn's pmrna program)
  • *.protein2ilocus.txt (this is internal to GenHub)

Anything I'm missing or any other comments @vpbrendel @cycoyuk?

Overlapping exons prematurely kill C. reinhardtii build

Gene GeneID:5716281 in C. reinhardtii prematurely kills an important step in the prepare task. Two issues to address. The first, and pressing, matter is to discard this gene using the typical annotfilter mechanism. The second matter, which will probably have to wait, is the fact that the build continued even though the canon-gff3 step of the grep -v | pmrna --locus | canon-gff3 pipeline failed.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.