bjcairns / ukbschemas Goto Github PK

Use R to generate a database containing the UK Biobank data schemas from http://biobank.ctsu.ox.ac.uk/crystal/schema.cgi

License: Other

R 92.51% TSQL 7.49%

r rstats r-package uk-biobank sqlite

ukbschemas's Introduction

ukbschemas

This R package can be used to create and/or load a database containing the UK Biobank Data Showcase schemas, which are data dictionaries describing the structure of the UK Biobank main dataset.

Installation

You can install the current version of ukbschemas from GitHub with:

# install.packages("devtools")
devtools::install_github("bjcairns/ukbschemas")

library(ukbschemas)

Examples

The package supports two workflows.

Save-Load workflow (recommended)

The recommended approach is to use ukbschemas_db() to download the schema tables and save them to an SQLite database, then use load_db() to load the tables from the database and store them as tibbles in a named list:

db <- ukbschemas_db(path = tempdir())
sch <- load_db(db = db)

By default, the database is named ukb-schemas-YYYY-MM-DD.sqlite (where YYYY-MM-DD is the current date) and placed in the current working directory. (path = tempdir() in the above example puts it in the current temporary directory instead.) At the most recent compilation of the database (03 August 2019), the size of the .sqlite database file produced by ukbschemas_db() was approximately 10.1MB.

Note that without further arguments, ukbschemas_db() tidies up the database to give it a more consistent relational structure (the changes are summarised in the output of the first example, above). Alternatively the raw data can be loaded with the as_is argument:

db <- ukbschemas_db(path = tempdir(), overwrite = TRUE, as_is = TRUE)

The overwrite option allows the database file to be overwritten (if TRUE), or prevents this (FALSE), or if not specified and the session is interactive (interactive() == TRUE) then the user is prompted to decide.

Note: If you have created a schemas database with an earlier version of ukbschemas, it should be possible to load that database with the latest version of load_db(), which (currently) should load any SQLite database, regardless of contents.

Load-Save workflow

The second approach is to download the schemas and store them in memory in a list, and save them to a database only as requried.

This is not recommended, because it is better (for everyone) not to download the schema files every time they are needed, and because the database assumes a certain structure that should be guaranteed when the database is saved. If you still want to take this approach, use:

sch <- ukbschemas()
db <- save_db(sch, path = tempdir())

Why R?

This package was originally written in bash (a Unix shell scripting language). However, R is more accessible and all dependencies are loaded when you install the package; there is no need to install any secondary software (not even SQLite).

Notes

Design notes

All the encoding value tables (esimpint, esimpstring, esimpreal, esimpdate, ehierint, ehierstring) have been harmonised and combined into a single table encvalues. The value column in encvalues has type TEXT, but a type column has been added in case the value is not clear from context. The original type-specific tables have been deleted.
To avoid redunancy, category parent-child relationships have been moved to table categories, as column parent_id, from table catbrowse (which has been deleted).
Reference to the category to which a field belongs is in the main_category column in the fields schema, but has been renamed to category_id for consistency with the categories schema.
Details of several of the field properties (value_type, stability, item_type, strata and sexed) are available elsewhere on the Data Showcase. These have been added manually to tables valuetypes, stability, itemtypes, strata and sexed, and appropriate ID references have been renamed with the _id suffix in tables fields and encodings.
There are several columns in the tables which are not well-documented (e.g. base_type in fields, availability in encodings and categories, and others). Additional tables documenting these encoded values may be included in future versions (and suggestions are welcome).

Known code issues

The UK Biobank data schemas are regularly updated as new data are added to the system. ukbschemas does not currently include a facility for updating the database; it is necessary to create a new database.
Because readr::read_csv() reads whole numbers as type double, not integer (allowing 64-bit integers without loss of information), column types in schemas loaded in R will differ depending on whether the schemas are loaded directly to R or first saved to a database. This should make little or no difference for most applications.
Any other issues.

ukbschemas's People

Contributors

Stargazers

Watchers

Forkers

kenhanscombe elreda

ukbschemas's Issues

The ukbschemas_db function generates a parsing failure

Upon executing the following code a parsing failure is produced:

db <- ukbschemas_db(path = tempdir(), overwrite = T)

The parsing failure reads:

Warning: 1 parsing failure.
row col  expected    actual                                                            file
 84  -- 6 columns 9 columns 'http://biobank.ndph.ox.ac.uk/showcase/scdown.cgi?fmt=txt&id=4'

README issues

This is a list of current issues with the README. They can all be fixed together, and this issue should be closed when the list is completed.

Description isn't very accurate ("create ... the UK Biobank Data Showcase schemas"?)
The primary (ukbschemas_db()) workflow should be in one block
Low-level headings should separate the two workflow options

After creation tables do not follow the CREATE TABLE

When loading tidied tables, the tables are not stored in the form pre-specified by the CREATE TABLE SQL statements. Variable order and type are incorrect. For example:

db <- ukbschema::create_schema_db(path = tempdir())

The table encvalues is expected to have structure:

CREATE TABLE encvalues(
  "encoding_id" INTEGER,
  "code_id" INTEGER,
  "parent_id" INTEGER,
  "type" TEXT,
  "value" TEXT,
  "meaning" TEXT,
  "selectable" INTEGER,
  "showcase_order" INTEGER,
  PRIMARY KEY ("encoding_id", "code_id")
);

But then

$ sqlite3 ukb-schema-2019-07-11.sqlite ".schema encvalues"
CREATE TABLE `encvalues` (
  `encoding_id` REAL,
  `value` TEXT,
  `meaning` TEXT,
  `showcase_order` REAL,
  `parent_id` INTEGER,
  `selectable` INTEGER,
  `type` TEXT,
  `code_id` INTEGER
);

Error in guess_header_

When running db <- ukbschemas_db(), either with new temp path or referring to the sqlite file, it gives me the following cryptic error:

Error in guess_header_(datasource, tokenizer, locale) : 
  Expected single logical value

It used to work fine. Did any of the dependencies change?

Tests require download of the actual schema files

A current limitation of testing the package is that it needs to download the schema files. Locally, this is handled by keeping a cached copy (and Travis should do the same). However, this is increasingly undesirable as the collective size of the schemas (now >60MB in the sqlite file).

Moreover, any problems with the data which cause parse errors will cause tests and builds to fail until the schema files are corrected by UKB. Responsive as they are, it would be better not to have this point of failure which is out of our control.

It is worth exploring whether a dummy dataset can be created for testing the package (on Travis, especially) which avoids the substantial download of the schema files.

.tidy_schemas() does not handle missing tables

Most of the package works just fine with missing or extra tables because the functions generically save or load whatever tables are present.

.tidy_schemas(), however, fixes up specific tables and needs error handling or warning/skipping missing tables which currently it does not have.

Changes to data.table's fread function mean that date variables will now be imported as 'IDate' class, as opposed to 'character' class

Previously, data.table::fread() (used by the .tryRead() internal function of .import_schemas()) would import date variables as class character, necessitating a subsequent coercion to Date class. In order to achieve this, two further internal functions were used: .isISOdate() & .autoISOdate(). It would appear as though these functions are now redundant. However, as data.table::fread() imports date variables as IDate class, a new step is required to coerce these variables back to Date class. The version of data.table in which this change occurred should also be set as a hard dependency in the DESCRIPTION file.

Need examples for each of the API functions

None of the API functions currently have examples; all should do so.

Also add a test file to ensure the examples run without error.

.tryRead() currently allows for 64-bit integers to be imported as class 'integer64', as opposed to class 'double'

While it may be appropriate to import 64-bit integers as class integer64 this poses some problems. This first is that it adds an optional dependency for the bit64 package. The second is that it breaks consistency with the fallback readr::read_delim() function of .tryRead(), which does not have the ability to import variables as this class.

Setting the integer64 parameter of fread() to FALSE should rectify this issue, and ensure that all 64-bit integers as imported as class double.

Should "recommended" table be folded into "fields" or "categories"?

The recommended table seems to link categories and fields in a way that is (presumably) distinct from the category_id column of fields. From the schema table:

UK Biobank maintains lists of 'recommended' data fields which researchers are encouraged to use as starting points for baskets in various areas of interest. Properties included are: Category ID; Field ID. These categories are identified in schema 3 as having group_type=0.

The structure of recommended is not yet clear. If recommended is 1:m or m:1 (categories:fields) then it could be added to the table for which it's entries are NOT unique. A 1:m structure seems likely because categories with recommended fields are flagged in the categories table with group_type == 0.

If 1:1 then it depends on whether category_id in recommended matches category_id in fields for the same field_id. If so, then it can be removed and a recommended flag added to fields. If not, then the category_id could be added to fields as recommended_category_id or similar.

If m:n then it should be left as-is, because it cannot be added directly to either table without row or column duplication.

Migrate to GitHub Actions

The package could be migrated to use GitHub Actions for CI.

This will make issue #5 redundant.

Field properties like stability need a linked table

Several field properties:

stability
item_type
strata
sexed

are encoded as integers but have specific meanings which can be gleaned by joining the data dictionary at https://biobank.ctsu.ox.ac.uk/~bbdatan/Data_Dictionary_Showcase.csv with the fields table to identify the code mappings.

Add tables to the database with appropriate data included as internal variables.

ukbschemas() and load_db() return different list orders

Although no one should be referring to the list elements by position, for consistency it would be better for the order to be the same (alphabetical is simplest).

.get_schemas() shows progress bars when silent = TRUE

Probably the fix is to add readr.show_progress = FALSE to the temporary option changes in .get_schemas() ( in R/misc-functions.R).

Fix singular/plural use of "schema[s]"

Although the plural of schema is schemata, no one uses that. So schemas it is. And especially, schema is singular.

Usage is currently inconsistent so this should be fixed.

Reference to category-parents.sql in README is obsolete

This is a hang-over from the previous bash code, but the R package adds this in .tidy_schemas() instead.

Overwrite allowed on Travis with open connection

On Travis-CI, build 6 failed because the test suite failed with the following report:

  ── 1. Failure: create_schema_db() fails to overwrite when db is connected (@test
  `{ ... }` did not throw an error.
  
  ══ testthat results  ═══════════════════════════════════════════════════════════
  OK: 19 SKIPPED: 1 WARNINGS: 1 FAILED: 1
  1. Failure: create_schema_db() fails to overwrite when db is connected (@test-create-schema-db.R#73)

The failing test passes on my local Win10 OS, so maybe this is an OS issue?

The current work-around is to skip the test except on Windows, but it would be nice to know whether this goes deeper, e.g. if on Linux systems file.remove() works even when there is an open database connection. If so, some more robust way to check for open connections would be nice to have.

Category 119 is redundant

Category 119 is probably best left out of the schemas. It isn't linked to any other category so there is no loss.

UKB may end up fixing category 100032 description instead, and if they decide otherwise then it would be included automatically by create_schema_db() anyway.

Use on.exit() where needed to ensure connection closed

E.g. code to preserve and roll-back when writing to database from the r-dbi/RSQLite/R/table.R code:

dbBegin(conn, name = "dbWriteTable")
on.exit(dbRollback(conn, name = "dbWriteTable"))
# ...
dbCommit(conn, name = "dbWriteTable")
on.exit(NULL)

For each function taking this approach, need to add tests to ensure the connection is closed in each circumstance.

Feature to load schemas without creating database

In a sense this is the main purpose of the package: to get the schemas into R. But of course it is inefficient to download them every time from the UKB server.

I'm thinking about having a function, probably called ukb_schemas() or ukbschemas(), to do this, but not sure that it is a good idea.

A function like this could help make the create_schema_db() process more modular and user-accessible. If added, it should probably include warnings/recommendations to consider saving the database.

Review Unix-specific file removal code (per #22)

The use of lsof in #22 will allow testing of code to protect open database files, and error handling is implicit.

However, some errors (especially if lsof is missing/unavailable) will be misreported and better handling might be useful.

This code needs review and consideration on how to best implement this fix to #5.