Giter Club home page Giter Club logo

nihexporter's Introduction

nihexporter

R-CMD-check

The nihexporter R package provides a minimal set of data from the NIH EXPORTER database, which contains information on NIH biomedical research funding from 1985-2021.

To keep the package lightweight, many details are omitted but can be easily retrieved from NIH RePORTER.

Installation

Install the package from github with:

# install.packages('pak')
pak::pkg_install("jayhesselberth/nihexporter")

Note: this is a large data package (>40 Mb)

Tables

  • projects: provides data on funded projects by NIH.

  • project_pis: links project numbers (project.num) to principal investigator IDs (pi.id).

  • publinks: links Pubmed IDs (pmid) to project numbers (project.num).

  • publications: provides information for individual publications, including their Relative Citation Ratio values (rcr).

  • patents: links project IDs (project.num) to patent.id.

  • clinical_studies: links project IDs to associated clinical trials.

  • project_io: pre-computed n.pubs, n.patents and project.cost for each project.num.

Note: Abstracts from NIH EXPORTER are not provided as they significantly increase the size of the package.

Functions

  • rcr() retrieves Relative Citation Ratios and associated information for PubMed IDs.

  • nihexporter_sqlite() can be used to cache data in a local SQLite database.

Variables

  • nih.institutes: 27 NIH institutes in two-letter format

Resources

nihexporter's People

Contributors

jayhesselberth avatar speach avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

nihexporter's Issues

Add additional tables

Support for project.organization and project.pi is in the package, but the tables end up too big to import into github.

Need to set up an indexing scheme for project PI and organization (just a relational index). Then we can really see exactly who the big winners (and losers) are!

Link clinical studies to application ID

I am wondering if there could be a link between the clinical trials table the projects table by the application ID (rather than just the project num).

library(nihexporter)
#> Loading required package: jsonlite
#> Loading required package: httr
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyverse)
#> Loading tidyverse: ggplot2
#> Loading tidyverse: tibble
#> Loading tidyverse: tidyr
#> Loading tidyverse: readr
#> Loading tidyverse: purrr
#> Conflicts with tidy packages ----------------------------------------------
#> filter(): dplyr, stats
#> lag():    dplyr, stats

# this produces several NA values for application.id 
left_join(clinical_studies, projects) %>%
  select(application.id, one_of(names(clinical_studies)))
#> Joining, by = "project.num"
#> # A tibble: 248,786 ร— 4
#>    application.id                project.num    trial.id
#>             <int>                      <chr>       <chr>
#> 1              NA        261201100031C-0-0-1 NCT01831778
#> 2              NA 261201200042I-0-26100006-1 NCT02772003
#> 3              NA        261201400046C-0-0-1 NCT02464332
#> 4              NA        268200700015C-2-0-0 NCT00534495
#> 5              NA        268200700036C-5-0-1 NCT00556439
#> 6              NA        268200900040C-1-0-1 NCT01206062
#> 7              NA        268201000048C-5-0-1 NCT01322165
#> 8              NA        268201300046C-4-0-1 NCT00005485
#> 9              NA        268201300047C-4-0-1 NCT00005485
#> 10             NA        268201300048C-4-0-1 NCT00005485
#> # ... with 248,776 more rows, and 1 more variables: study.status <fctr>

RCR analysis

Compare RCR to:

  • total cost (overall for NIH and by institute)
  • budget mechanism
  • type of grant (e.g. R01, P01, R21, etc.).

DUNS numbers are incorrect in `project_orgs` table

Sent the following to the NIH EXPORTER help folks on 2015 Mar 11, still waiting for fix.

For reference, I generated an R data package that pulls in the CSV
formatted data from 2000-2014:

https://github.com/jayhesselberth/nihexporter

The problem I identified is that there are several organizations that
have apparently been
assigned to the same number in the PROJECTS tables. Given that this is
supposed to be the
authoritative number for cross-referencing with institution
information, it would be nice if this were fixed.

For example, if I look at all of the organizations that have been
assigned to DUNS number 001910777 across all fiscal years, I get the
following result:

org.duns org.name count
1 001910777 JOHNS HOPKINS UNIVERSITY 3905
2 001910777 UNIVERSITY OF TEXAS MD ANDERSON CAN CTR 3912
3 001910777 UNIVERSITY OF VIRGINIA CHARLOTTESVILLE 5864
4 001910777 OSEL, INC. 33

In this case, Johns Hopkins is the correct one, but there are actually
more assigned to UVA. In fact, UVA is only assigned to this DUNS
number in the PROJECT tables, but it's actual DUNS number is 065391526.

Moreover, if I look up DUNS number 065391526, I get the following
result:

org.name n()
1 UNIVERSITY OF COLORADO DENVER 7642
2 UNIVERSITY OF VIRGINIA 1328

with some of the DUNS numbers hitting UVA (under a different name
though), but more with University of Colorado Denver.

I looked up a particular UVA grant (F31AT000058) in NIH REPOTER, and
REPORTER has the correct UVA DUNS number on the DETAILS page. I would
have though these are pulled from the same database, so it seems like
the EXPORTER export of DUNS numbers is not working correctly.

Add congressional district

Include ORG_DISTRICT in raw projects data for better geographic filtering options for funding advocacy.

FOIA

File FOIA to access total submissions per mechanism per year, can calculate success rates from that.

investigtor-initiated vs industry collaborations

library(nihexporter)
#> Loading required package: jsonlite
#> Loading required package: httr
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyverse)
library(rlang)
#> 
#> Attaching package: 'rlang'
#> The following objects are masked from 'package:purrr':
#> 
#>     %@%, %||%, as_function, flatten, flatten_chr, flatten_dbl,
#>     flatten_int, flatten_lgl, invoke, list_along, modify, prepend,
#>     rep_along, splice
#> The following objects are masked from 'package:jsonlite':
#> 
#>     flatten, unbox
library(cowplot)
#> 
#> Attaching package: 'cowplot'
#> The following object is masked from 'package:ggplot2':
#> 
#>     ggsave

grant_funds <- function(codes, code_name) {
  code_name <- rlang::sym(code_name)
  
  projects %>%
    filter(activity %in% codes) %>%
    select(application.id, activity, fy.cost, fiscal.year) %>%
    left_join(project_orgs, by = "application.id") %>%
    left_join(org_info, by = "org.duns") %>%
    select(activity, fy.cost, fiscal.year, org.state) %>%
    mutate(
      activity = fct_collapse(
        activity, !!code_name := codes
      )
    ) %>%
    na.omit() %>%
    # filter for US states
    filter(org.state %in% state.abb)
}

fund_summary <- function(funds) {
  group_by(funds, activity, fiscal.year, org.state) %>%
    summarize(total.cost = sum(fy.cost, na.rm = TRUE)) %>%
    ungroup()
}

academic_funds <- grant_funds(c('R01'), "academic") %>% fund_summary()
industry_funds <- grant_funds(c('R41','R42'), "industry") %>% fund_summary()

combined_funds <- bind_rows(academic_funds, industry_funds) %>%
  spread(activity, total.cost) %>% na.omit() %>%
  mutate(fund.ratio = log10(academic / industry))
#> Warning in bind_rows_(x, .id): Unequal factor levels: coercing to character
#> Warning in bind_rows_(x, .id): binding character and factor vector,
#> coercing into character vector

#> Warning in bind_rows_(x, .id): binding character and factor vector,
#> coercing into character vector

ggplot(combined_funds, aes(fiscal.year, fund.ratio)) +
  geom_point() + geom_line() + 
  facet_wrap(~ org.state)

Created on 2018-05-09 by the reprex package (v0.2.0).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.