Giter Club home page Giter Club logo

ukbschemas's Introduction

ukbschemas

Lifecycle: experimental Build Status

This R package can be used to create and/or load a database containing the UK Biobank Data Showcase schemas, which are data dictionaries describing the structure of the UK Biobank main dataset.

Installation

You can install the current version of ukbschemas from GitHub with:

# install.packages("devtools")
devtools::install_github("bjcairns/ukbschemas")

library(ukbschemas)

Examples

The package supports two workflows.

Save-Load workflow (recommended)

The recommended approach is to use ukbschemas_db() to download the schema tables and save them to an SQLite database, then use load_db() to load the tables from the database and store them as tibbles in a named list:

db <- ukbschemas_db(path = tempdir())
sch <- load_db(db = db)

By default, the database is named ukb-schemas-YYYY-MM-DD.sqlite (where YYYY-MM-DD is the current date) and placed in the current working directory. (path = tempdir() in the above example puts it in the current temporary directory instead.) At the most recent compilation of the database (03 August 2019), the size of the .sqlite database file produced by ukbschemas_db() was approximately 10.1MB.

Note that without further arguments, ukbschemas_db() tidies up the database to give it a more consistent relational structure (the changes are summarised in the output of the first example, above). Alternatively the raw data can be loaded with the as_is argument:

db <- ukbschemas_db(path = tempdir(), overwrite = TRUE, as_is = TRUE)

The overwrite option allows the database file to be overwritten (if TRUE), or prevents this (FALSE), or if not specified and the session is interactive (interactive() == TRUE) then the user is prompted to decide.

Note: If you have created a schemas database with an earlier version of ukbschemas, it should be possible to load that database with the latest version of load_db(), which (currently) should load any SQLite database, regardless of contents.

Load-Save workflow

The second approach is to download the schemas and store them in memory in a list, and save them to a database only as requried.

This is not recommended, because it is better (for everyone) not to download the schema files every time they are needed, and because the database assumes a certain structure that should be guaranteed when the database is saved. If you still want to take this approach, use:

sch <- ukbschemas()
db <- save_db(sch, path = tempdir())

Why R?

This package was originally written in bash (a Unix shell scripting language). However, R is more accessible and all dependencies are loaded when you install the package; there is no need to install any secondary software (not even SQLite).

Notes

Design notes

  • All the encoding value tables (esimpint, esimpstring, esimpreal, esimpdate, ehierint, ehierstring) have been harmonised and combined into a single table encvalues. The value column in encvalues has type TEXT, but a type column has been added in case the value is not clear from context. The original type-specific tables have been deleted.
  • To avoid redunancy, category parent-child relationships have been moved to table categories, as column parent_id, from table catbrowse (which has been deleted).
  • Reference to the category to which a field belongs is in the main_category column in the fields schema, but has been renamed to category_id for consistency with the categories schema.
  • Details of several of the field properties (value_type, stability, item_type, strata and sexed) are available elsewhere on the Data Showcase. These have been added manually to tables valuetypes, stability, itemtypes, strata and sexed, and appropriate ID references have been renamed with the _id suffix in tables fields and encodings.
  • There are several columns in the tables which are not well-documented (e.g. base_type in fields, availability in encodings and categories, and others). Additional tables documenting these encoded values may be included in future versions (and suggestions are welcome).

Known code issues

  • The UK Biobank data schemas are regularly updated as new data are added to the system. ukbschemas does not currently include a facility for updating the database; it is necessary to create a new database.
  • Because readr::read_csv() reads whole numbers as type double, not integer (allowing 64-bit integers without loss of information), column types in schemas loaded in R will differ depending on whether the schemas are loaded directly to R or first saved to a database. This should make little or no difference for most applications.
  • Any other issues.

ukbschemas's People

Contributors

bjcairns avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

ukbschemas's Issues

.tidy_schemas() does not handle missing tables

Most of the package works just fine with missing or extra tables because the functions generically save or load whatever tables are present.

.tidy_schemas(), however, fixes up specific tables and needs error handling or warning/skipping missing tables which currently it does not have.

Category 119 is redundant

Category 119 is probably best left out of the schemas. It isn't linked to any other category so there is no loss.

UKB may end up fixing category 100032 description instead, and if they decide otherwise then it would be included automatically by create_schema_db() anyway.

Error in guess_header_

When running db <- ukbschemas_db(), either with new temp path or referring to the sqlite file, it gives me the following cryptic error:

Error in guess_header_(datasource, tokenizer, locale) : 
  Expected single logical value

It used to work fine. Did any of the dependencies change?

Should "recommended" table be folded into "fields" or "categories"?

The recommended table seems to link categories and fields in a way that is (presumably) distinct from the category_id column of fields. From the schema table:

UK Biobank maintains lists of 'recommended' data fields which researchers are encouraged to use as starting points for baskets in various areas of interest. Properties included are: Category ID; Field ID. These categories are identified in schema 3 as having group_type=0.

The structure of recommended is not yet clear. If recommended is 1:m or m:1 (categories:fields) then it could be added to the table for which it's entries are NOT unique. A 1:m structure seems likely because categories with recommended fields are flagged in the categories table with group_type == 0.

If 1:1 then it depends on whether category_id in recommended matches category_id in fields for the same field_id. If so, then it can be removed and a recommended flag added to fields. If not, then the category_id could be added to fields as recommended_category_id or similar.

If m:n then it should be left as-is, because it cannot be added directly to either table without row or column duplication.

Use on.exit() where needed to ensure connection closed

E.g. code to preserve and roll-back when writing to database from the r-dbi/RSQLite/R/table.R code:

dbBegin(conn, name = "dbWriteTable")
on.exit(dbRollback(conn, name = "dbWriteTable"))
# ...
dbCommit(conn, name = "dbWriteTable")
on.exit(NULL)

For each function taking this approach, need to add tests to ensure the connection is closed in each circumstance.

README issues

This is a list of current issues with the README. They can all be fixed together, and this issue should be closed when the list is completed.

  • Description isn't very accurate ("create ... the UK Biobank Data Showcase schemas"?)
  • The primary (ukbschemas_db()) workflow should be in one block
  • Low-level headings should separate the two workflow options

Fix singular/plural use of "schema[s]"

Although the plural of schema is schemata, no one uses that. So schemas it is. And especially, schema is singular.

Usage is currently inconsistent so this should be fixed.

After creation tables do not follow the CREATE TABLE

When loading tidied tables, the tables are not stored in the form pre-specified by the CREATE TABLE SQL statements. Variable order and type are incorrect. For example:

db <- ukbschema::create_schema_db(path = tempdir())

The table encvalues is expected to have structure:

CREATE TABLE encvalues(
  "encoding_id" INTEGER,
  "code_id" INTEGER,
  "parent_id" INTEGER,
  "type" TEXT,
  "value" TEXT,
  "meaning" TEXT,
  "selectable" INTEGER,
  "showcase_order" INTEGER,
  PRIMARY KEY ("encoding_id", "code_id")
);

But then

$ sqlite3 ukb-schema-2019-07-11.sqlite ".schema encvalues"
CREATE TABLE `encvalues` (
  `encoding_id` REAL,
  `value` TEXT,
  `meaning` TEXT,
  `showcase_order` REAL,
  `parent_id` INTEGER,
  `selectable` INTEGER,
  `type` TEXT,
  `code_id` INTEGER
);

Changes to data.table's fread function mean that date variables will now be imported as 'IDate' class, as opposed to 'character' class

Previously, data.table::fread() (used by the .tryRead() internal function of .import_schemas()) would import date variables as class character, necessitating a subsequent coercion to Date class. In order to achieve this, two further internal functions were used: .isISOdate() & .autoISOdate(). It would appear as though these functions are now redundant. However, as data.table::fread() imports date variables as IDate class, a new step is required to coerce these variables back to Date class. The version of data.table in which this change occurred should also be set as a hard dependency in the DESCRIPTION file.

Overwrite allowed on Travis with open connection

On Travis-CI, build 6 failed because the test suite failed with the following report:

  ── 1. Failure: create_schema_db() fails to overwrite when db is connected (@test
  `{ ... }` did not throw an error.
  
  ══ testthat results  ═══════════════════════════════════════════════════════════
  OK: 19 SKIPPED: 1 WARNINGS: 1 FAILED: 1
  1. Failure: create_schema_db() fails to overwrite when db is connected (@test-create-schema-db.R#73) 

The failing test passes on my local Win10 OS, so maybe this is an OS issue?

The current work-around is to skip the test except on Windows, but it would be nice to know whether this goes deeper, e.g. if on Linux systems file.remove() works even when there is an open database connection. If so, some more robust way to check for open connections would be nice to have.

.tryRead() currently allows for 64-bit integers to be imported as class 'integer64', as opposed to class 'double'

While it may be appropriate to import 64-bit integers as class integer64 this poses some problems. This first is that it adds an optional dependency for the bit64 package. The second is that it breaks consistency with the fallback readr::read_delim() function of .tryRead(), which does not have the ability to import variables as this class.

Setting the integer64 parameter of fread() to FALSE should rectify this issue, and ensure that all 64-bit integers as imported as class double.

Feature to load schemas without creating database

In a sense this is the main purpose of the package: to get the schemas into R. But of course it is inefficient to download them every time from the UKB server.

I'm thinking about having a function, probably called ukb_schemas() or ukbschemas(), to do this, but not sure that it is a good idea.

A function like this could help make the create_schema_db() process more modular and user-accessible. If added, it should probably include warnings/recommendations to consider saving the database.

The ukbschemas_db function generates a parsing failure

Upon executing the following code a parsing failure is produced:

db <- ukbschemas_db(path = tempdir(), overwrite = T)

The parsing failure reads:

Warning: 1 parsing failure.
row col  expected    actual                                                            file
 84  -- 6 columns 9 columns 'http://biobank.ndph.ox.ac.uk/showcase/scdown.cgi?fmt=txt&id=4'

Review Unix-specific file removal code (per #22)

The use of lsof in #22 will allow testing of code to protect open database files, and error handling is implicit.

However, some errors (especially if lsof is missing/unavailable) will be misreported and better handling might be useful.

This code needs review and consideration on how to best implement this fix to #5.

Tests require download of the actual schema files

A current limitation of testing the package is that it needs to download the schema files. Locally, this is handled by keeping a cached copy (and Travis should do the same). However, this is increasingly undesirable as the collective size of the schemas (now >60MB in the sqlite file).

Moreover, any problems with the data which cause parse errors will cause tests and builds to fail until the schema files are corrected by UKB. Responsive as they are, it would be better not to have this point of failure which is out of our control.

It is worth exploring whether a dummy dataset can be created for testing the package (on Travis, especially) which avoids the substantial download of the schema files.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.