Giter Club home page Giter Club logo

gene-tools's Introduction

gene-tools

Tools to reliably map protein IDs to gene names, made during my UROP at the Lage Lab at the Broad Institute of Harvard and MIT.

map

The main project - automating the assignment of protein IDs / accession numbers to HGNC gene names. Takes in a list of IDs as input, separated by newlines, and returns a list of assigned gene names where possible, and provides information about all other cases. For instance, map reports unassigned protein IDs and returns its Ensembl ID wherever possible.

Setup

The tools draw information from UniProt and HGNC, both locally and programmatically through queries. To set up the local databases properly, follow the instructions in the following folders: ./human_data/ and ./hgnc_data. These are repeated below for completeness.

  1. ./human_data: From the UniProt Downloads page, go to 'Taxonomic Divisions' and download the uniprot_sprot_human.dat.gz and uniprot_trembl_human.dat.gz files. Extract them into the ./human_data folder. Then run grep.sh to create the data.txt file.

  2. ./hgnc_data: From the HGNC Custom Downloads page, download two files:

  3. Download a file with only the 'Approved Symbol' and 'UniProt ID' checked. Save this as hgnc_symbol_ac.txt in the ./hgnc_data folder.

  4. Download another file with only the 'Approved Symbol, 'Previous Symbols' and 'Synonyms' checked. Save this as hgnc_symbol_previous_synonym.txt in the ./hgnc_data folder.

At the end of this setup, your directory should look as so (the map and match directories are not expanded, they should not be modified):

gene-tools
- hgnc_data
--- hgnc_symbol_ac.txt
--- hgnc_symbol_previous_synonym.txt
- human_data
--- data.txt
--- grep.sh
--- uniprot_sprot_human.dat
--- uniprot_trembl_human.dat
- map
- README.md

Usage

Note that using map requires an Internet connection, since some queries will be resolved online.

map

map takes input from ./map/in.txt, which is a list of UniProt IDs or Accession Numbers, and outputs ./map/results.txt, a tab-spaced list of those IDs with corresponding HGNC gene names, along with status flags that indicate how the gene name was obtained. For instance, the ID Q15465 will be mapped to SHH directly on HGNC.

map also identifies problematic cases (i.e. cannot be mapped solely on HGNC) and resolves them accordingly where possible. For instance,

  1. Obsolete IDs: IDs that were once in use, but not any more. e.g. E9PEB9 is an obsolete ID on UniProt and cannot be found on HGNC - map will tell you that the last existing gene name on UniProt is DST and checks if DST is the correct HGNC gene name (it is).

  2. Unassigned IDs: IDs that exist, but have not been assigned a gene symbol. e.g. P00761 does not have an assigned gene name - map reports this, and returns its Ensembl ID where possible.

  3. Bad IDs: IDs that do not exist, possibly as a result of a typo.

  4. Not found in HGNC: IDs that can be mapped in UniProt but not on HGNC. map reports the UniProt gene name instead.

gene-tools's People

Contributors

justinlimkz avatar

Stargazers

 avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.