bioconductor / genomicdatacommons Goto Github PK

View Code? Open in Web Editor NEW

83.0 19.0 23.0 4.49 MB

Provide R access to the NCI Genomic Data Commons portal.

Home Page: http://bioconductor.github.io/GenomicDataCommons/

R 100.00%

tcga nci bioconductor vignette cancer genomics bioinformatics r api-client data-science

genomicdatacommons's Introduction

GenomicDataCommons

What is the GDC?

From the Genomic Data Commons (GDC) website:

The National Cancer Institute’s (NCI’s) Genomic Data Commons (GDC) is a data sharing platform that promotes precision medicine in oncology. It is not just a database or a tool; it is an expandable knowledge network supporting the import and standardization of genomic and clinical data from cancer research programs.

The GDC contains NCI-generated data from some of the largest and most comprehensive cancer genomic datasets, including The Cancer Genome Atlas (TCGA) and Therapeutically Applicable Research to Generate Effective Therapies (TARGET). For the first time, these datasets have been harmonized using a common set of bioinformatics pipelines, so that the data can be directly compared.

As a growing knowledge system for cancer, the GDC also enables researchers to submit data, and harmonizes these data for import into the GDC. As more researchers add clinical and genomic data to the GDC, it will become an even more powerful tool for making discoveries about the molecular basis of cancer that may lead to better care for patients.

The data model for the GDC is complex, but it worth a quick overview. The data model is encoded as a so-called property graph. Nodes represent entities such as Projects, Cases, Diagnoses, Files (various kinds), and Annotations. The relationships between these entities are maintained as edges. Both nodes and edges may have Properties that supply instance details. The GDC API exposes these nodes and edges in a somewhat simplified set of RESTful endpoints.

Quickstart

This software is available at Bioconductor.org and can be downloaded via BiocManager::install.

To report bugs or problems, either submit a new issue or submit a bug.report(package='GenomicDataCommons') from within R (which will redirect you to the new issue on GitHub).

Installation

Installation can be achieved via Bioconductor’s BiocManager package.

if (!require("BiocManager"))
    install.packages("BiocManager")

BiocManager::install('GenomicDataCommons')

library(GenomicDataCommons)

Check basic functionality

status()
#> $commit
#> [1] "4dd3680528a19ed33cfc83c7d049426c97bb903b"
#> 
#> $data_release
#> [1] "Data Release 34.0 - July 27, 2022"
#> 
#> $status
#> [1] "OK"
#> 
#> $tag
#> [1] "3.0.0"
#> 
#> $version
#> [1] 1

Find data

The following code builds a manifest that can be used to guide the download of raw data. Here, filtering finds gene expression files quantified as raw counts using STAR from ovarian cancer patients.

ge_manifest <- files() |>
    filter( cases.project.project_id == 'TCGA-OV') |>
    filter( type == 'gene_expression' ) |>
    filter( analysis.workflow_type == 'STAR - Counts') |>
    manifest(size = 5)
ge_manifest
#>                                     id data_format     access                                                                   file_name
#> 1 7c69529f-2273-4dc4-b213-e84924d78bea         TSV       open d6472bd0-b4e2-4ed1-a892-e1702c195dc7.rna_seq.augmented_star_gene_counts.tsv
#> 2 0eff4634-f8c4-4db9-8a7c-331b21689bae         TSV       open 42165baf-b32c-4fc4-8b04-29c5b4e76de0.rna_seq.augmented_star_gene_counts.tsv
#> 3 7d74b4c5-6391-4b3e-95a3-020ea0869e86         TSV controlled   accf08d4-a784-4908-831a-7a08d4c5f0f5.rna_seq.star_splice_junctions.tsv.gz
#> 4 dc2aeea4-3cd0-4623-92f4-bbbc962851cc         TSV controlled   8ab508b9-2993-4e66-b8f9-81e32e936d4a.rna_seq.star_splice_junctions.tsv.gz
#> 5 0cf852be-d2e3-4fde-bba8-c93efae2961a         TSV       open 93831282-1dd1-49a3-acd7-dae2a49ca62e.rna_seq.augmented_star_gene_counts.tsv
#>                           submitter_id           data_category       acl            type file_size                 created_datetime                           md5sum
#> 1 7085a70b-2f63-4402-9e53-70f091f26fcb Transcriptome Profiling      open gene_expression   4254435 2021-12-13T20:53:42.329364-06:00 19d5596bba8949f4c138793608497d56
#> 2 f0d44930-b1ad-447a-86b9-27d0285954b9 Transcriptome Profiling      open gene_expression   4257461 2021-12-13T20:47:24.326497-06:00 d89d71b7c028c1643d7a3ee7857d8e01
#> 3 e6473134-6d65-414c-9f52-2c25057fac7d Transcriptome Profiling phs000178 gene_expression   3109435 2021-12-13T21:03:56.008440-06:00 fb8332d6413c44a9de02a1cbe6b018aa
#> 4 f99b93a9-70cb-44f8-bd1f-4edeee4425a4 Transcriptome Profiling phs000178 gene_expression   4607701 2021-12-13T21:02:23.944851-06:00 26231bed1ef67c093d3ce2b39def81cd
#> 5 fb4d7abe-b61a-4f35-9700-605f1bc1512f Transcriptome Profiling      open gene_expression   4265694 2021-12-13T20:50:55.234254-06:00 050763aabd36509f954137fbdc4eeb00
#>                   updated_datetime                              file_id                      data_type    state experimental_strategy
#> 1 2022-01-19T14:47:28.965154-06:00 7c69529f-2273-4dc4-b213-e84924d78bea Gene Expression Quantification released               RNA-Seq
#> 2 2022-01-19T14:47:07.478144-06:00 0eff4634-f8c4-4db9-8a7c-331b21689bae Gene Expression Quantification released               RNA-Seq
#> 3 2022-01-19T14:01:15.621847-06:00 7d74b4c5-6391-4b3e-95a3-020ea0869e86 Splice Junction Quantification released               RNA-Seq
#> 4 2022-01-19T14:01:15.621847-06:00 dc2aeea4-3cd0-4623-92f4-bbbc962851cc Splice Junction Quantification released               RNA-Seq
#> 5 2022-01-19T14:47:07.036781-06:00 0cf852be-d2e3-4fde-bba8-c93efae2961a Gene Expression Quantification released               RNA-Seq

Download data

This code block downloads the 5 gene expression files specified in the query above. Using multiple processes to do the download very significantly speeds up the transfer in many cases. The following completes in about 15 seconds.

library(BiocParallel)
register(MulticoreParam())
destdir <- tempdir()
fnames <- lapply(ge_manifest$id,gdcdata)

If the download had included controlled-access data, the download above would have needed to include a token. Details are available in the authentication section below.

Metadata queries

Here we use a couple of ad-hoc helper functions to handle the output of the query. See the inst/script/README.Rmd folder for the source.

First, create a data.frame from the clinical data:

expands <- c("diagnoses","annotations",
             "demographic","exposures")
clinResults <- cases() |>
    GenomicDataCommons::select(NULL) |>
    GenomicDataCommons::expand(expands) |>
    results(size=6)
demoDF <- filterAllNA(clinResults$demographic)
exposuresDF <- bindrowname(clinResults$exposures)

demoDF[, 1:4]
#>                                      cause_of_death         race gender              ethnicity
#> 2525bfef-6962-4b7f-8e80-6186400ce624           <NA> not reported female           not reported
#> 126507c3-c0d7-41fb-9093-7deed5baf431 Cancer Related not reported female           not reported
#> c43ac461-9f03-44bc-be7d-3d867eb708a0           <NA> not reported female           not reported
#> a59a90d9-f1b0-49dd-9c97-bcaa6ba55d44 Cancer Related not reported   male           not reported
#> 59122a43-606a-4669-806b-6747e0ac9985           <NA>        white   male not hispanic or latino
#> 4447a969-e5c8-4291-b83c-53a0f7e77cbc Cancer Related        white female not hispanic or latino

exposuresDF[, 1:4]
#>                                       submitter_id                 created_datetime    alcohol_intensity pack_years_smoked
#> 2525bfef-6962-4b7f-8e80-6186400ce624 C3N-03839-EXP 2019-12-30T10:23:07.190853-06:00 Lifelong Non-Drinker                NA
#> 126507c3-c0d7-41fb-9093-7deed5baf431 C3N-01518-EXP 2018-06-21T14:27:48.817254-05:00 Lifelong Non-Drinker                NA
#> c43ac461-9f03-44bc-be7d-3d867eb708a0 C3N-03933-EXP 2019-03-14T08:23:14.054975-05:00 Lifelong Non-Drinker                NA
#> a59a90d9-f1b0-49dd-9c97-bcaa6ba55d44 C3N-02695-EXP 2019-03-14T08:23:14.054975-05:00   Occasional Drinker              16.8
#> 59122a43-606a-4669-806b-6747e0ac9985 C3L-03642-EXP 2019-06-24T07:53:15.534197-05:00 Lifelong Non-Drinker              39.0
#> 4447a969-e5c8-4291-b83c-53a0f7e77cbc C3L-03728-EXP 2019-06-24T07:53:15.534197-05:00 Lifelong Non-Drinker                NA

Note that the diagnoses data has multiple lines per patient:

diagDF <- bindrowname(clinResults$diagnoses)
diagDF[, 1:4]
#>                                      ajcc_pathologic_stage                 created_datetime tissue_or_organ_of_origin age_at_diagnosis
#> 2525bfef-6962-4b7f-8e80-6186400ce624             Stage IIB 2019-07-22T06:40:02.183501-05:00          Head of pancreas            19956
#> 126507c3-c0d7-41fb-9093-7deed5baf431          Not Reported 2018-12-03T12:05:16.846188-06:00             Temporal lobe            26312
#> c43ac461-9f03-44bc-be7d-3d867eb708a0             Stage III 2019-03-14T10:37:34.405260-05:00       Floor of mouth, NOS            25635
#> a59a90d9-f1b0-49dd-9c97-bcaa6ba55d44          Not Reported 2019-03-14T10:37:34.405260-05:00       Floor of mouth, NOS            16652
#> 59122a43-606a-4669-806b-6747e0ac9985          Not Reported 2019-07-22T06:40:02.183501-05:00          Upper lobe, lung            23384
#> 4447a969-e5c8-4291-b83c-53a0f7e77cbc          Not Reported 2019-05-07T07:41:33.411909-05:00              Frontal lobe            29326

Basic design

This package design is meant to have some similarities to the “tidyverse” approach of dplyr. Roughly, the functionality for finding and accessing files and metadata can be divided into:

Simple query constructors based on GDC API endpoints.
A set of verbs that when applied, adjust filtering, field selection, and faceting (fields for aggregation) and result in a new query object (an endomorphism)
A set of verbs that take a query and return results from the GDC

In addition, there are auxiliary functions for asking the GDC API for information about available and default fields, slicing BAM files, and downloading actual data files. Here is an overview of functionality¹.

Creating a query
- projects()
- cases()
- files()
- annotations()
Manipulating a query
- filter()
- facet()
- select()
Introspection on the GDC API fields
- mapping()
- available_fields()
- default_fields()
- grep_fields()
- available_values()
- available_expand()
Executing an API call to retrieve query results
- results()
- count()
- response()
Raw data file downloads
- gdcdata()
- transfer()
- gdc_client()
Summarizing and aggregating field values (faceting)
- aggregations()
Authentication
- gdc_token()
BAM file slicing
- slicing()

See individual function and methods documentation for specific details. ↩

genomicdatacommons's People

Contributors

Stargazers

Watchers

genomicdatacommons's Issues

Possibility to add push hook for bioc build

If someone with access to settings could add a push hook to build the package with Bioc-builder, that would be great.

Thanks!

Describe filtering in queries in vignette.

magrittr in Depends field

Hi Sean @seandavi,
On second thought, it seems like it would be more convenient to keep magrittr in the Depends:.
This would avoid all of the library calls in the examples.
Usually, the Depends: field is used if the current package extends the that package.

Taking a quick look at Organism.dplyr, @mtmorgan puts dplyr in the Depends field:
https://github.com/Bioconductor/Organism.dplyr/blob/master/DESCRIPTION#L16

@mtmorgan Thoughts?

Also, the tag should be @importFrom magrittr "%>%", at least that worked for me.

Thanks,
Marcel

Add `legacy()` method on `GDCQuery`

legacy(x,legacy=TRUE)

where x is a GDCQuery.

command failed

biocLite('Bioconductor/GenomicDataCommons')

BioC_mirror: https://bioconductor.org
Using Bioconductor 3.5 (BiocInstaller 1.25.3), R 3.4.0 (2017-04-21).
Installing github package(s) ‘Bioconductor/GenomicDataCommons’
Downloading GitHub repo Bioconductor/GenomicDataCommons@master
from URL https://api.github.com/repos/Bioconductor/GenomicDataCommons/zipball/master
Installing GenomicDataCommons
"C:/PROGRA~~1/R/R-34~~1.0/bin/i386/R" --no-site-file --no-environ --no-save
--no-restore --quiet CMD INSTALL
"C:/Users/User/AppData/Local/Temp/Rtmpe2qcsl/devtools158829604138/Bioconductor-GenomicDataCommons-e10a1d8"
--library="C:/Program Files/R/R-3.4.0/library" --install-tests

錯誤: Command failed (65535)

Why did not it work?

.gdc_list print output might be a bit misleading

Below, the files_list returned shows cases(8) which someone might interpret as 8 cases associated with the file. Not sure what the right answer is supposed to be, but in this particular instance, it is ambiguous what "cases" means. Should clarify.

> a = files(fields=grep('case',mapping('files')$fields,value=TRUE)[1:20])
> a[1]
class: files_list
files: 1
names:
    5fc5ee74-3001-4deb-bfdf-b21695128a09
fields:
    cases(8)
> a[1]$`5fc5ee74-3001-4deb-bfdf-b21695128a09`
$cases
                          submitter_id           demographic.updated_datetime 
                        "TCGA-P3-A6T6"     "2016-09-02T18:57:19.863724-05:00" 
                    demographic.gender             demographic.demographic_id 
                                "male" "3dca884d-56d7-5cab-b435-4b64c1986ec9" 
             demographic.year_of_birth                       demographic.race 
                                "1956"                                "white" 
              demographic.submitter_id                       updated_datetime 
            "TCGA-P3-A6T6_demographic"     "2016-09-08T12:09:02.830347-05:00"

File caching behavior and implementation

This discussion seems related to but distinct from the discussion of #40, so I am opening a new issue here. @mtmorgan pointed out that BiocFileCache fills at least part of this need.

The GDC uses UUIDs for everything, including files. They seem to serve a nice purpose for uniquely describing resources in the GDC. As such, the file UUID is an ideal key in any local cache. These UUIDs also serve to disambiguate any files with the same name, so incorporating them into a local file path is likely useful.

I would envision, then, keys that look like 7cde9495-e573-4b38-b89c-991076cf8cf8 and file paths inside the BiocFileCache that look something like 7cde9495-e573-4b38-b89c-991076cf8cf8/originalfilename.txt. The original filename is important as some functions rely on file suffixes.

slicing fails on multiple regions

When submitting a character list to slice() (works given a single region but not multiple)
Error: Tried to unbox a vector of length 2

follow error to code unbox(regions)

Implement support for facets

Here is an example of output of a basic facet (using pure GET):

x = GET("https://gdc-api.nci.nih.gov/projects?format=JSON&pretty=FALSE&fields=dbgap_accession_number&size=10&from=1&facets=program.name")
> x
Response [https://gdc-api.nci.nih.gov/projects?format=JSON&pretty=FALSE&fields=dbgap_accession_number&size=10&from=1&facets=program.name]
  Date: 2016-09-27 22:06
  Status: 200
  Content-Type: application/json
  Size: 608 B

> content(x)
$data
$data$pagination
$data$pagination$count
[1] 10

$data$pagination$sort
[1] ""

$data$pagination$from
[1] 1

$data$pagination$page
[1] 1

$data$pagination$total
[1] 39

$data$pagination$pages
[1] 4

$data$pagination$size
[1] 10


$data$hits
$data$hits[[1]]
$data$hits[[1]]$dbgap_accession_number
NULL


$data$hits[[2]]
$data$hits[[2]]$dbgap_accession_number
NULL


$data$hits[[3]]
$data$hits[[3]]$dbgap_accession_number
NULL


$data$hits[[4]]
$data$hits[[4]]$dbgap_accession_number
NULL


$data$hits[[5]]
$data$hits[[5]]$dbgap_accession_number
[1] "phs000467"


$data$hits[[6]]
$data$hits[[6]]$dbgap_accession_number
NULL


$data$hits[[7]]
$data$hits[[7]]$dbgap_accession_number
NULL


$data$hits[[8]]
$data$hits[[8]]$dbgap_accession_number
NULL


$data$hits[[9]]
$data$hits[[9]]$dbgap_accession_number
[1] "phs000468"


$data$hits[[10]]
$data$hits[[10]]$dbgap_accession_number
NULL



$data$aggregations
$data$aggregations$program.name
$data$aggregations$program.name$buckets
$data$aggregations$program.name$buckets[[1]]
$data$aggregations$program.name$buckets[[1]]$key
[1] "TCGA"

$data$aggregations$program.name$buckets[[1]]$doc_count
[1] 33


$data$aggregations$program.name$buckets[[2]]
$data$aggregations$program.name$buckets[[2]]$key
[1] "TARGET"

$data$aggregations$program.name$buckets[[2]]$doc_count
[1] 6






$warnings
named list()

Link README.md to a vignette

A github trick is to use a symbolic link to connect README.md to a vignette, so that the information in the vignette (available to R users) does not need to be duplicated to be available to github users. For instance, at https://github.com/Bioconductor/AnnotationFilter the https://github.com/Bioconductor/AnnotationFilter/blob/master/README is a link to the DESCRIPTION file.

token is not required for gdcdata call in vignette 2.4

since the example is open-access data, you do not need to set token=gdc_token() in

fnames = bplapply(ge_manifest$id,gdcdata,
                  token=gdc_token(),destination_dir=destdir,
                  BPPARAM = MulticoreParam(progressbar=TRUE))

attempting to do so for a token-deprived person like myself leads to an error.

404 when using legacy=TRUE

Dear developers,

I am getting a 404 when using legacy=TRUE to get a set of files().

For example:

file_list = files(legacy = FALSE) %>% results()

works fine, but

file_list = files(legacy = TRUE) %>% results()

returns

Error in .gdc_post(entity_name(x), body = body, legacy = x$legacy, token = NULL, :
Not Found (HTTP 404).

Is this intended behavior? I have been having trouble in general accessing the legacy archive on the GDC website.

shiny dependency

It would be nice to have a package that just retrieves data programmatically without bringing along a GUI. Maybe the Shiny GUI could be optional or a separate package? Btw, if you're going to use data.table (presumably because R is too slow?), do you really also need readr?

What should the API for BAM slicing look like?

Should it mirror reading from a local BAM, in a sense?
How should paired-end data be handled?

expand for deeply-nested fields

The package vignette provides an example for expanding first level fields to obtain a data frame. However, the approach does not work for deeper nested fields. For example,

files() %>% 
   GenomicDataCommons::select(NULL) %>%
   GenomicDataCommons::expand("cases.samples") %>%
   results()

produces a list with all children of the samples field concatenated into a comma-separated string without field names, e.g.

$cases
$cases$`3fe677f6-8329-447c-b999-5e70582624aa`
samples
1 01, 2017-03-04T16:37:25.946840-06:00, NA, true, NA, TCGA-IA-A83W-01A, NA, 2e4dfa77-839a-445d-beef-60b6396adf0c, FALSE, 10CCB12F-77E0-4100-A87A-0D36E5AF7F8B, NA, NA, Primary Tumor, live, NA, NA, NA, NA, NA, NA, NA, NA, NA, 3607, 140, NA

This is of limited utility as the order of the fields cannot be trusted, so the values cannot be reliably mapped back to field names. The only work-around I could find was to provide a custom respond handler to prevent jsonlite from simplifying the vectors (and consequently other structures).

respHandler <- function(txt, ...) { jsonlite::fromJSON(txt, simplifyVector = F) }
files() %>% 
   GenomicDataCommons::select(NULL) %>%
   GenomicDataCommons::expand("cases.samples") %>%
   response(response_handler = respHandler) %$%
   lapply(results, unlist, recursive = T) %>%
   lapply(as.list) %>%
   bind_rows()

However, it would be nice if such expansion happened automatically when results are called.

Check and retry request with 429 errors

Error: processing vignette 'overview.Rmd' failed with diagnostics:
Too Many Requests (RFC 6585) (HTTP 429).
Execution halted

_mapping is best represented as a data.frame

_mapping "expand" fix

I originally understood the "expand" parameter:

https://docs.gdc.cancer.gov/API/Users_Guide/Search_and_Retrieval/#expand

to refer to fields. Instead, it refers to "field groups", so it appears that adding a column with the "expand" value for each field might make more sense. So, for the output here:

https://gdc-api.nci.nih.gov/files/_mapping

Define a data frame with a column called "expand" that will include repeated field groups for the associated fields.

Add legacy endpoint to API

The legacy endpoint contains data that have not gone through the new GDC pipelines. The legacy archive is described here. Because some data are only available via the legacy archive, we should support that endpoint.

Error when downloading files

Hi, I keep getting this error when trying to download files.

Error: lexical error: invalid char in json text.
<?xml version="1.0" encoding="U
(right here) ------^

this is my code:
ge_manifest_CNV_primarytumour = files() %>%
filter( ~ cases.project.project_id == 'TCGA-LIHC' &
data_type == 'Masked Copy Number Segment' &
cases.samples.sample_type == 'Primary Tumor' &
analysis.workflow_type == 'DNAcopy') %>%

Please advise me further. Thank you

Cherlyn

file_id not recognized as field for file endpoint.

It doesn't seem to recognize 'file_id' as a valid field for the files endpoint. I know that as a workaround, the UUIDs can be retrieved with the bottom command, but it might be good to keep the fields consistent.

> fields <-c("file_id","file_name","cases.submitter_id")
> query2 = files(fields=fields,filters=make_filter("experimental_strategy"=="WXS"&"data_format"=="BAM"),size=20000)
> query2[[1]]
$file_name
[1] "C529.TCGA-HC-7737-11A-02D-2114-08.1_gdc_realn.bam"

$cases
  submitter_id 
"TCGA-HC-7737"
> names(query2[1])
[1] "164511a9-2f56-49e0-b5cf-9c4be32f8fc7"```

Add high-level developer docs (where stuff is, etc.)

facets and filters not working when specified together.

Related to #16.

quick start code does not run

several examples in the quick start chunk on the front page for
this repo fail. perhaps there is an authentication issue for some
of the commands. it would be very helpful to have links for the
authentication creation steps as the NCI GDC web pages do not give
a clear path to a workable approach -- where do i go, what do i do to
establish authentic credentials or know that i have failed to do so.

library(GenomicDataCommons)
endpoints()
available endpoints:
status, projects, cases, files, annotations, data, manifest,
slicing
?experiments
No documentation for ‘experiments’ in specified packages and libraries:
you could try ‘??experiments’
experiments(size=20)
Error: could not find function "experiments"
No suitable frames for recover()
status()
Error in curl::curl_fetch_memory(url, handle = handle) :
SSL connect error

Enter a frame number, or 0 to exit

1: status()
2: .gdc_get(paste(version, "status", sep = "/"))
3: GET(uri, add_headers(X-Auth-Token = token), ...)
4: request_perform(req, hu$handle$handle)
5: request_fetch(req$output, req$url, handle)
6: request_fetch.write_memory(req$output, req$url, handle)
7: curl::curl_fetch_memory(url, handle = handle)

sessionInfo()
R Under development (unstable) (2016-10-26 r71594)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X El Capitan 10.11.6

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] GenomicDataCommons_0.1.4 rmarkdown_1.3

loaded via a namespace (and not attached):
[1] Rcpp_0.12.8 digest_0.6.10 rprojroot_1.1 R6_2.2.0
[5] jsonlite_1.2 backports_1.0.4 magrittr_1.5 evaluate_0.10
[9] httr_1.2.1 stringi_1.1.2 curl_2.3 xml2_1.0.0
[13] tools_3.4.0 stringr_1.1.0 htmltools_0.3.5 knitr_1.15.1

Catch ssl handshake error and suggest fix

Related to #35 and #37.

Windows install

Overall, looks good
Just a few comments

For install to work on windows R, l I had to perform the following:

source("https://bioconductor.org/biocLite.R")
biocLite("BiocInstaller")
install.packages('devtools')
library(devtools)
install.packages("forecast", repos=c("http://rstudio.org/_packages", "http://cran.rstudio.com"))
devtools::install_github('Bioconductor/GenomicDataCommons')
library(GenomicDataCommons)

Also ??GenomicDataCommons did not bring up the Vignettes or the help pages.

Using Zenhub

Just a note that I am trying ZenHub on this project. I think if you install it in chrome, you should be able to see.

Error in named(list(...)) : could not find function "progress"

R> slicing("df80679e-c4d3-487b-934c-fcc782e5d46e", symbols=c("BRCA1", "BRCA2"), token=token)

Error in named(list(...)) : could not find function "progress"

Convert all requests to POST

Some URLs become too long when specifying all fields or long lists in filter queries. We could leave an option to use GET instead if necessary.

Better mechanism for field checking in using filters in API calls

Right now, the filters() function does the right thing, but only if the correct endpoint is specified in the call. Instead of having the user define the endpoint, it would be better to have each API call guarantee the correct endpoint and, thus, field checking in the call.

Instead of this:

cases(filters=filters(....,endpoint='cases'))

Do this:

cases(filters=FILTER_EXPRESSION)
# inside cases
filters = filters(FILTER_EXPRESSION,endoint='cases')

Implement ids() for GDCFileResults and GDCCasesResults

Add gdcdata() endpoint.

Implement:

https://docs.gdc.cancer.gov/API/Users_Guide/Downloading_Files/#data-endpoint

Error: lexical error: invalid char in json text.

Here the error:



                               type == 'gene_expression' &
+                                analysis.workflow_type == 'HTSeq - Counts')
> manifest_df = qfiles %>% manifest()
Error: lexical error: invalid char in json text.
                                       <?xml version="1.0" encoding="U
                     (right here) ------^
sessionInfo()
R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.4 LTS

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=it_IT.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=it_IT.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=it_IT.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=it_IT.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] ggplot2_2.2.1            knitr_1.20               GenomicDataCommons_1.2.0
[4] magrittr_1.5            

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.17           pillar_1.2.3           compiler_3.4.4        
 [4] BiocInstaller_1.28.0   GenomeInfoDb_1.14.0    plyr_1.8.4            
 [7] XVector_0.18.0         bitops_1.0-6           tools_3.4.4           
[10] zlibbioc_1.24.0        jsonlite_1.5           tibble_1.4.2          
[13] gtable_0.2.0           pkgconfig_2.0.1        rlang_0.2.0           
[16] rstudioapi_0.7         curl_3.2               yaml_2.1.19           
[19] parallel_3.4.4         GenomeInfoDbData_1.0.0 httr_1.3.1            
[22] xml2_1.2.0             S4Vectors_0.16.0       IRanges_2.12.0        
[25] hms_0.4.2              stats4_3.4.4           grid_3.4.4            
[28] data.table_1.11.4      R6_2.2.2               readr_1.1.1           
[31] scales_0.5.0           BiocGenerics_0.24.0    GenomicRanges_1.30.3  
[34] colorspace_1.3-2       labeling_0.3           utf8_1.1.4            
[37] RCurl_1.95-4.10        lazyeval_0.2.1         munsell_0.4.3         
[40] crayon_1.3.4

Report pagination and total count as part of .gdc_list

link in README for GDC authz/authn is broken

https://docs.gdc.cancer.gov/API/Users_Guide/Authentication_and_Authorization/

returns a 404

guidance on solving SSL error?

source 0.99.7
%vjcair> R CMD build GenomicDataCommons

checking for file ‘GenomicDataCommons/DESCRIPTION’ ... OK
preparing ‘GenomicDataCommons’:
checking DESCRIPTION meta-information ... OK
installing the package to build vignettes
creating vignettes ... ERROR
Quitting from lines 91-98 (api.Rmd)
Error: processing vignette 'api.Rmd' failed with diagnostics:
SSL connect error
Execution halted

Describe RESTful APIs in overview vignette

Higher-level functionality than just data access using GenomicDataCommons package

This isn't really an issue (or maybe a documentation issue). This package allows querying and downloading the GDC data. Does it stop there?

For example, TCGAbiolinks performs a similar function and can create a SummarizedExperiment or a data.frame from the downloaded data. Does GenomicDataCommons do something like that? Can it transform the downloaded files into some sort of a matrix structure?

Proposed new name

For discussion, @mtmorgan has mentioned the idea of changing the name here to be something like gdc to be the "low-level" API with the CamelCase version being the higher-level interface to full bioconductor objects. I could go either way.

Another issue is that there is likely going to be more than one "genomic data commons"; should we rename to include NCI or some other term to identify this as the NCI Genomic Data Commons?

extract clinical data from previous research

I want to download all the clinical data from the rnaseq data selected:


expands = c("diagnoses","annotations",
            "demographic","exposures")
clinResults = cases() %>%

  GenomicDataCommons::select(filter( ~ cases.project.project_id == 'TCGA-OV' &
                                       type == 'gene_expression' &
                                       analysis.workflow_type == 'HTSeq - Counts') ) %>%
  GenomicDataCommons::expand(expands) %>%
  results(size=300)
str(clinResults,list.len=10)
write.table(clinResults,"Clinical_results.csv",sep="\t",row.names = FALS
```E)

Proposal for caching behavior--comments.

We desire a caching behavior for data coming from the NCI GDC. Features might include:

Cache is used when a file is requested, optionally disabled if desired
Option to invalidate any or all of the cache
Cache location configurable
Ability to use either GDC transfer mechanism (either data transfer endpoint or via the data transfer tool) as a download mechanism, configurable by user (https://docs.gdc.cancer.gov/API/Users_Guide/Downloading_Files/)
Downloads use Access Token, if supplied, for controlled-access data

API suggestions:

Combine functionality of gdcdata and transfer into one method that accepts file UUIDs and serves back files.
Back the functionality in number 1 by a subclass of BiocFileCache that overrides the update and add methods to use the requested file transfer mechanism +/- Token.

Considerations:

How much metadata should we mirror in the cache versus making calls to the files() endpoint to gather metadata, on demand?
Conserve filenames?
Reporting of what is in cache can be "smart" with respect to metadata (how many files from project x, or of type y.)

Add gdcmanifest() endpoint

Implement:

https://docs.gdc.cancer.gov/API/Users_Guide/Downloading_Files/#manifest-endpoint

Manifest won't generate for larger groups of files

The manifest endpoint seems to consistently break after 108 files. See Below:

> query = files(fields=fields,filters=make_filter("experimental_strategy"=="RNA-Seq"&"data_format"=="BAM"),size=20000)
> query
class: files_list
files: 11607
names:
    a7fd6aae-6af5-490f-b65e-97c7bfcf44bf, 7d810229-fa70-4df1-88d6-82e40618ec81, c9eebf0c-3768-43a1-b5c5-874a0d4843c2, ...,
    cf153337-d01f-40bd-8547-92af3f344217, 37b6f86e-c70f-4f01-965e-f18ed9024dd4
> mf2 = manifest(uuids=names(query[1:109])) #Works fine
> mf2 = manifest(uuids=names(query[1:110]))
Error in .gdc_download_one(uri, destination, overwrite = FALSE, progress = FALSE,  : 
  Not Found (HTTP 404).
>

Utilities to rectangularize and tidyize ragged lists

Problem connecting to gdc

Successful installation done with biocLite('Bioconductor/GenomicDataCommons').

> files() Error in curl::curl_fetch_memory(url, handle = handle) : SSL connect error
> httr::GET('https://gdc-api.nci.nih.gov/status') Error in curl::curl_fetch_memory(url, handle = handle) : SSL connect error
> httr::GET('http://gdc-api.nci.nih.gov/status') Error in curl::curl_fetch_memory(url, handle = handle) : SSL connect error
`> sessionInfo()
R version 3.2.3 (2015-12-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server release 6.7 (Santiago)

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
[4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] ggplot2_2.2.1 GenomicDataCommons_0.99.8 magrittr_1.5
[4] dplyr_0.5.0 BiocInstaller_1.20.3

loaded via a namespace (and not attached):
[1] Rcpp_0.12.9 git2r_0.14.0 plyr_1.8.4 R.methodsS3_1.7.1 R.utils_2.4.0
[6] tools_3.2.3 digest_0.6.12 jsonlite_1.3 memoise_1.0.0 tibble_1.2
[11] gtable_0.2.0 R.cache_0.12.0 shiny_1.0.0 DBI_0.5-1 curl_2.3
[16] R.rsp_0.30.0 withr_1.0.1 httr_1.2.1 knitr_1.14 xml2_1.1.1
[21] devtools_1.12.0 grid_3.2.3 data.table_1.10.4 R6_2.2.0 readr_1.0.0
[26] scales_0.4.1 htmltools_0.3.5 assertthat_0.1 mime_0.5 colorspace_1.2-7
[31] xtable_1.8-2 httpuv_1.3.3 labeling_0.3 miniUI_0.1.1 lazyeval_0.2.0
[36] munsell_0.4.3 R.oo_1.20.0 `

What do people think of a functional "builder" API like googlesheets?

https://github.com/jennybc/googlesheets#quick-demo

I could see something like:

projects() %>% filter(....) %>% select(...) %>% limit(10) %>% as.data.frame()

filter is the equivalent of removing rows (lazy)
select limits return fields (lazy)
limit: set start and end on result (lazy)
as.data.frame(), as.list(), actually executes the query.

Overkill?

Need to return more complex object to support counting hits and facets

See #16 for details on facets and the need for this proposed change.

Currently, the API calls return only the "hits" data. To support counting of results (in the pagination JSON object) and facets (in the aggregation JSON object), we need to return a more complicated object. I am proposing that we return a list with three elements:

results -- contains the current gdc_list representation
facets -- if facets specified, this will be a list of data.frames, one for each facet field
pages -- gives counts, page size, and pages

'filters' not defined

R CMD INSTALL GenomicDataCommons

installing to library ‘/Library/Frameworks/R.framework/Versions/3.4/Resources/library’
installing source package ‘GenomicDataCommons’ ...
** R
** inst
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded
Error in namespaceExport(ns, exports) : undefined exports: filters
Error: loading failed
recover called non-interactively; frames dumped, use debugger() to view
DONE (GenomicDataCommons)

GDC server down, try to use this package later

HI,

I was trying to use TCGAbiolinks. Functions related to GDC gave me the error:
"GDC server down, try to use this package later" for the whole morning.

Any hint what's happening here?

Thanks,
Jessie