Giter Club home page Giter Club logo

genes's Introduction

Genes for Project Cognoma

This repository creates the set of genes to be used in Project Cognoma. The human subset of Entrez Gene is the basis of Cognoma genes. All genes in Cognoma should be converted to Entrez GeneIDs (using a preferred variable name of entrez_gene_id).

When encountering genes in Project Cognoma, identify which of the following approach should be applied:

  • If the input genes are only in symbols, open an issue to discuss mapping options.
  • If the input genes contain chromosome and symbol information, use chromosome-symbol-map.tsv to map the genes to Entrez GeneIDs.
  • If the genes are already encoded as Entrez GeneIDs, update the Gene_IDs to their most recent versions using updater.tsv and remove GeneIDs that are not in genes.tsv.

Downloads and data

The raw (downloaded) data is stored in the download directory. versions.json contains timestamps for the raw data. The raw data is tracked since the Entrez Gene FTP site doesn't version and archive files.

Created data is stored in the data directory. Applications should use the processed data rather than the raw data, if possible. Applications are strongly encouraged to use versioned (commit-hash-containing) links when accessing data from this repository.

Execution

Use the following commands to run the analysis, inside the environment specified by environment.yml:

# To run the entire analysis
python 1.download.py
python 2.process.py

# To run just the data processing
python 2.process.py

In general, we don't anticipate redownloading the data frequently. If you submit a pull request to create additional datasets, please do not execute 1.download.py.

genes's People

Contributors

dhimmel avatar gwaybio avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar

genes's Issues

Genes with multiple chromosomes

What does it mean for a gene to have multiple chromosomes? Here are all the genes from genes.tsv that exhibited multiple chromosomes:

entrez_gene_id symbol description chromosome gene_type synonyms
263 AMD1P2 adenosylmethionine decarboxylase 1 pseudogene 2 X Y pseudo
293 SLC25A6 solute carrier family 25 member 6 X Y protein-coding
438 ASMT acetylserotonin O-methyltransferase X Y protein-coding
1438 CSF2RA colony stimulating factor 2 receptor alpha subunit X Y protein-coding
3563 IL3RA interleukin 3 receptor subunit alpha X Y protein-coding
3581 IL9R interleukin 9 receptor X Y protein-coding
4267 CD99 CD99 molecule X Y protein-coding
6473 SHOX short stature homeobox X Y protein-coding
6845 VAMP7 vesicle associated membrane protein 7 X Y protein-coding
7501 XGR XG and CD99 regulator X Y other
8225 GTPBP6 GTP binding protein 6 (putative) X Y protein-coding
8227 AKAP17A A-kinase anchoring protein 17A X Y protein-coding
8623 ASMTL acetylserotonin O-methyltransferase-like X Y protein-coding
9189 ZBED1 zinc finger BED-type containing 1 X Y protein-coding
10251 SPRY3 sprouty RTK signaling antagonist 3 X Y protein-coding
28227 PPP2R3B protein phosphatase 2 regulatory subunit B''beta X Y protein-coding
55344 PLCXD1 phosphatidylinositol specific phospholipase C X domain containing 1 X Y protein-coding
64109 CRLF2 cytokine receptor-like factor 2 X Y protein-coding
80161 ASMTL-AS1 ASMTL antisense RNA 1 X Y ncRNA
207063 DHRSX dehydrogenase/reductase X-linked X Y protein-coding
283981 LINC00685 long intergenic non-protein coding RNA 685 X Y ncRNA
286530 P2RY8 purinergic receptor P2Y8 X Y protein-coding
401577 CD99P1 CD99 molecule pseudogene 1 X Y pseudo
442442 RPL14P5 ribosomal protein L14 pseudogene 5 X Y pseudo
619538 OMS otitis media, susceptibility to 10 19 3
644218 TRPC6P transient receptor potential cation channel subfamily C member 6, pseudogene X Y pseudo
652608 LOC652608 60S ribosomal protein L6-like X Y pseudo
653440 WASH6P WAS protein family homolog 6 pseudogene X Y pseudo
727856 DDX11L16 DEAD/H-box helicase 11 like 16 X Y pseudo
751580 LINC00106 long intergenic non-protein coding RNA 106 X Y ncRNA
100128260 WASIR1 WASH and IL9R antisense RNA 1 X Y ncRNA
100287692 TCEB1P24 transcription elongation factor B subunit 1 pseudogene 24 X Y pseudo
100359394 LINC00102 long intergenic non-protein coding RNA 102 X Y ncRNA
100418703 LOC100418703 repetin pseudogene X Y pseudo
100500894 MIR3690 microRNA 3690 X Y ncRNA
101928032 LOC101928032 uncharacterized LOC101928032 X Y ncRNA
101928055 LOC101928055 uncharacterized LOC101928055 X Y ncRNA
101928070 LOC101928070 uncharacterized LOC101928070 X Y ncRNA
101928092 LOC101928092 uncharacterized LOC101928092 X Y ncRNA
102464837 MIR6089 microRNA 6089 X Y ncRNA
102724521 LOC102724521 uncharacterized LOC102724521 X Y ncRNA
102725051 LOC102725051 uncharacterized LOC102725051 1 Un ncRNA
105373102 LOC105373102 uncharacterized LOC105373102 X Y protein-coding
105373105 LOC105373105 uncharacterized LOC105373105 X Y ncRNA
105379413 LOC105379413 uncharacterized LOC105379413 X Y ncRNA
105379414 LOC105379414 uncharacterized LOC105379414 X Y ncRNA
105379561 LOC105379561 uncharacterized LOC105379561 8 Un protein-coding
106478924 DHRSX-IT1 DHRSX intronic transcript 1 X Y ncRNA
106478926 DPH3P2 diphthamide biosynthesis 3 pseudogene 2 X Y pseudo
106480712 FABP5P13 fatty acid binding protein 5 pseudogene 13 X Y pseudo
106480770 RNA5SP498 RNA, 5S ribosomal pseudogene 498 X Y pseudo
107985637 LOC107985637 uncharacterized LOC107985637 X Y ncRNA
107985677 LOC107985677 uncharacterized LOC107985677 X Y ncRNA
107985697 LOC107985697 uncharacterized LOC107985697 X Y ncRNA
107985706 LOC107985706 uncharacterized LOC107985706 X Y ncRNA

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.