rnabioco / nihexporter Goto Github PK

View Code? Open in Web Editor NEW

12.0 3.0 13.0 376.6 MB

An R data package for NIH EXPORTER data

Home Page: https://rnabioco.github.io/nihexporter/

License: Other

R 100.00%

nih spending

nihexporter's Introduction

nihexporter

The nihexporter R package provides a minimal set of data from the NIH EXPORTER database, which contains information on NIH biomedical research funding from 1985-2021.

To keep the package lightweight, many details are omitted but can be easily retrieved from NIH RePORTER.

Installation

Install the package from github with:

# install.packages('pak')
pak::pkg_install("jayhesselberth/nihexporter")

Note: this is a large data package (>40 Mb)

Tables

projects: provides data on funded projects by NIH.
project_pis: links project numbers (project.num) to principal investigator IDs (pi.id).
publinks: links Pubmed IDs (pmid) to project numbers (project.num).
publications: provides information for individual publications, including their Relative Citation Ratio values (rcr).
patents: links project IDs (project.num) to patent.id.
clinical_studies: links project IDs to associated clinical trials.
project_io: pre-computed n.pubs, n.patents and project.cost for each project.num.

Note: Abstracts from NIH EXPORTER are not provided as they significantly increase the size of the package.

Functions

rcr() retrieves Relative Citation Ratios and associated information for PubMed IDs.
nihexporter_sqlite() can be used to cache data in a local SQLite database.

Variables

nih.institutes: 27 NIH institutes in two-letter format

Resources

nihexporter's People

Contributors

Stargazers

Watchers

Forkers

indywood ebaschal ileaheft bberickson anderseng darlandm lauermichael helenwang11 montoyaa speach wangjs davebraze seedpcseed

nihexporter's Issues

analyze lag times between project input and output

Add additional tables

Support for project.organization and project.pi is in the package, but the tables end up too big to import into github.

Need to set up an indexing scheme for project PI and organization (just a relational index). Then we can really see exactly who the big winners (and losers) are!

Link clinical studies to application ID

I am wondering if there could be a link between the clinical trials table the projects table by the application ID (rather than just the project num).

library(nihexporter)
#> Loading required package: jsonlite
#> Loading required package: httr
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyverse)
#> Loading tidyverse: ggplot2
#> Loading tidyverse: tibble
#> Loading tidyverse: tidyr
#> Loading tidyverse: readr
#> Loading tidyverse: purrr
#> Conflicts with tidy packages ----------------------------------------------
#> filter(): dplyr, stats
#> lag():    dplyr, stats

# this produces several NA values for application.id 
left_join(clinical_studies, projects) %>%
  select(application.id, one_of(names(clinical_studies)))
#> Joining, by = "project.num"
#> # A tibble: 248,786 × 4
#>    application.id                project.num    trial.id
#>             <int>                      <chr>       <chr>
#> 1              NA        261201100031C-0-0-1 NCT01831778
#> 2              NA 261201200042I-0-26100006-1 NCT02772003
#> 3              NA        261201400046C-0-0-1 NCT02464332
#> 4              NA        268200700015C-2-0-0 NCT00534495
#> 5              NA        268200700036C-5-0-1 NCT00556439
#> 6              NA        268200900040C-1-0-1 NCT01206062
#> 7              NA        268201000048C-5-0-1 NCT01322165
#> 8              NA        268201300046C-4-0-1 NCT00005485
#> 9              NA        268201300047C-4-0-1 NCT00005485
#> 10             NA        268201300048C-4-0-1 NCT00005485
#> # ... with 248,776 more rows, and 1 more variables: study.status <fctr>

RCR analysis

Compare RCR to:

total cost (overall for NIH and by institute)
budget mechanism
type of grant (e.g. R01, P01, R21, etc.).

shiny: make productivity more reactive with `project_io` table

compare cost of publishing in specific journals

add `arra.funded` to `projects` table

model productivity as function of time and money

productivity in intramural vs extramural grants

DUNS numbers are incorrect in `project_orgs` table

Sent the following to the NIH EXPORTER help folks on 2015 Mar 11, still waiting for fix.

For reference, I generated an R data package that pulls in the CSV
formatted data from 2000-2014:

https://github.com/jayhesselberth/nihexporter

The problem I identified is that there are several organizations that
have apparently been
assigned to the same number in the PROJECTS tables. Given that this is
supposed to be the
authoritative number for cross-referencing with institution
information, it would be nice if this were fixed.

For example, if I look at all of the organizations that have been
assigned to DUNS number 001910777 across all fiscal years, I get the
following result:

org.duns org.name count
1 001910777 JOHNS HOPKINS UNIVERSITY 3905
2 001910777 UNIVERSITY OF TEXAS MD ANDERSON CAN CTR 3912
3 001910777 UNIVERSITY OF VIRGINIA CHARLOTTESVILLE 5864
4 001910777 OSEL, INC. 33

In this case, Johns Hopkins is the correct one, but there are actually
more assigned to UVA. In fact, UVA is only assigned to this DUNS
number in the PROJECT tables, but it's actual DUNS number is 065391526.

Moreover, if I look up DUNS number 065391526, I get the following
result:

org.name n()
1 UNIVERSITY OF COLORADO DENVER 7642
2 UNIVERSITY OF VIRGINIA 1328

with some of the DUNS numbers hitting UVA (under a different name
though), but more with University of Colorado Denver.

I looked up a particular UVA grant (F31AT000058) in NIH REPOTER, and
REPORTER has the correct UVA DUNS number on the DETAILS page. I would
have though these are pulled from the same database, so it seems like
the EXPORTER export of DUNS numbers is not working correctly.

library(nihexporter)
#> Loading required package: jsonlite
#> Loading required package: httr
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyverse)
library(rlang)
#> 
#> Attaching package: 'rlang'
#> The following objects are masked from 'package:purrr':
#> 
#>     %@%, %||%, as_function, flatten, flatten_chr, flatten_dbl,
#>     flatten_int, flatten_lgl, invoke, list_along, modify, prepend,
#>     rep_along, splice
#> The following objects are masked from 'package:jsonlite':
#> 
#>     flatten, unbox
library(cowplot)
#> 
#> Attaching package: 'cowplot'
#> The following object is masked from 'package:ggplot2':
#> 
#>     ggsave

grant_funds <- function(codes, code_name) {
  code_name <- rlang::sym(code_name)
  
  projects %>%
    filter(activity %in% codes) %>%
    select(application.id, activity, fy.cost, fiscal.year) %>%
    left_join(project_orgs, by = "application.id") %>%
    left_join(org_info, by = "org.duns") %>%
    select(activity, fy.cost, fiscal.year, org.state) %>%
    mutate(
      activity = fct_collapse(
        activity, !!code_name := codes
      )
    ) %>%
    na.omit() %>%
    # filter for US states
    filter(org.state %in% state.abb)
}

fund_summary <- function(funds) {
  group_by(funds, activity, fiscal.year, org.state) %>%
    summarize(total.cost = sum(fy.cost, na.rm = TRUE)) %>%
    ungroup()
}

academic_funds <- grant_funds(c('R01'), "academic") %>% fund_summary()
industry_funds <- grant_funds(c('R41','R42'), "industry") %>% fund_summary()

combined_funds <- bind_rows(academic_funds, industry_funds) %>%
  spread(activity, total.cost) %>% na.omit() %>%
  mutate(fund.ratio = log10(academic / industry))
#> Warning in bind_rows_(x, .id): Unequal factor levels: coercing to character
#> Warning in bind_rows_(x, .id): binding character and factor vector,
#> coercing into character vector

#> Warning in bind_rows_(x, .id): binding character and factor vector,
#> coercing into character vector

ggplot(combined_funds, aes(fiscal.year, fund.ratio)) +
  geom_point() + geom_line() + 
  facet_wrap(~ org.state)

Created on 2018-05-09 by the reprex package (v0.2.0).

shiny: make interactive funding times with `dygraphs`

http://rstudio.github.io/dygraphs/index.html