Giter Club home page Giter Club logo

target_classification's Introduction

TARGET classification workflow (using the GDC Data Portal)

Set up the directory structure:

project_dir="/data/BIDS-HPC/private/projects/dmi2"
working_dir="/home/weismanal/notebook/2020-06-10/dmi"
mkdir "$project_dir" "$working_dir"
cd "$working_dir"
git clone [email protected]:andrew-weisman/target_classification.git "$project_dir/checkout"
mkdir "$project_dir/data"

Note: The effort using the data directly from the TARGET data website (as opposed to the GDC Data Portal) is in the target_data_website branch of this repository.

Download the manifest for all the gene expression quantification files in the TARGET program (click on the blue "Manifest" button):

all_gene_expression_files_in_target.png

Place the downloaded manifest file as $project_dir/checkout/manifests/gdc_manifest.2020-06-10-all_gene_expression_files_in_target.txt.

In addition, click on the blue "Add All Files to Cart" button, go to the cart (top right of page), click on the two blue buttons "Sample Sheet" and "Metadata", and save the resulting two files to $project_dir/data. The two files will be named, e.g., gdc_sample_sheet.2020-07-02.tsv and metadata.cart.2020-07-02.json.

Note that these 5,149 files correspond to 1,192 cases (people [for sure that's what it means]).

Download the expression files from the manifest on Helix:

module load gdc-client
mkdir "$project_dir/data/all_gene_expression_files_in_target"
cd !!:1
gdc-client download -m "$project_dir/checkout/manifests/gdc_manifest.2020-06-10-all_gene_expression_files_in_target.txt"

Extract the resulting compressed files and link to them from a single folder $project_dir/data/all_gene_expression_files_in_target/links:

mkdir links
cd !!:1
for file in $(find ../ -iname "*.gz"); do gunzip "$file"; done
for file in $(find ../ -type f | grep -v "/logs/\|/annotations.txt"); do ln -s $file; done
ln -s "$project_dir/checkout/manifests/gdc_manifest.2020-06-10-all_gene_expression_files_in_target.txt" MANIFEST.txt

Note that

for file in $(ls | grep -v MANIFEST.txt); do echo $file | awk -v FS="." '{print $1}'; done | sort -u | wc -l

shows that, ostensibly, there are 2,481 unique expression files (independent of normalization). This is just based on the filenames, and is not actually correct.

Start an interactive allocation, using, e.g.,

sinteractive --mem=40g # --mem=20g may be fine

Go through the Python Jupyter notebook /data/BIDS-HPC/private/projects/dmi2/checkout/main.ipynb. Use the conda environment /data/BIDS-HPC/public/software/conda/envs/r_env. (Note this environment contains pandas version 1.1.0, whereas Biowulf's default python module has pandas version 0.24.2, which is insufficient.) See here for more notes on the environment.

target_classification's People

Contributors

andrew-weisman avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar

Forkers

fnlcr-dmap

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.