Giter Club home page Giter Club logo

linelist's People

Contributors

actions-user avatar annacarnegie avatar bisaloo avatar pitmonticone avatar thibautjombart avatar timtaylor avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

linelist's Issues

Move dplyr to Suggests

Once we are able to move ahead the deprecation process in select(tags =) and select_tags()

Document the meaning of each tag

Tags provided by default in linelist currently don't include any explanation of their meaning. You can get a list of tags, or their allowed class, but that's it.

library(linelist)

tags_names()
#>  [1] "id"             "date_onset"     "date_reporting" "date_admission"
#>  [5] "date_discharge" "date_outcome"   "date_death"     "gender"        
#>  [9] "age"            "location"       "occupation"     "hcw"           
#> [13] "outcome"

tags_types()
#> $id
#> [1] "numeric"   "integer"   "character"
#> 
#> $date_onset
#> [1] "integer" "numeric" "Date"    "POSIXct" "POSIXlt"
#> 
#> $date_reporting
#> [1] "integer" "numeric" "Date"    "POSIXct" "POSIXlt"
#> 
#> $date_admission
#> [1] "integer" "numeric" "Date"    "POSIXct" "POSIXlt"
#> 
#> $date_discharge
#> [1] "integer" "numeric" "Date"    "POSIXct" "POSIXlt"
#> 
#> $date_outcome
#> [1] "integer" "numeric" "Date"    "POSIXct" "POSIXlt"
#> 
#> $date_death
#> [1] "integer" "numeric" "Date"    "POSIXct" "POSIXlt"
#> 
#> $gender
#> [1] "character" "factor"   
#> 
#> $age
#> [1] "numeric" "integer"
#> 
#> $location
#> [1] "character" "factor"   
#> 
#> $occupation
#> [1] "character" "factor"   
#> 
#> $hcw
#> [1] "logical"   "integer"   "character" "factor"   
#> 
#> $outcome
#> [1] "character" "factor"

tags_defaults()
#> $id
#> NULL
#> 
#> $date_onset
#> NULL
#> 
#> $date_reporting
#> NULL
#> 
#> $date_admission
#> NULL
#> 
#> $date_discharge
#> NULL
#> 
#> $date_outcome
#> NULL
#> 
#> $date_death
#> NULL
#> 
#> $gender
#> NULL
#> 
#> $age
#> NULL
#> 
#> $location
#> NULL
#> 
#> $occupation
#> NULL
#> 
#> $hcw
#> NULL
#> 
#> $outcome
#> NULL

Created on 2023-06-20 with reprex v2.0.2

This change should probably be done by creating a csv file containing the following columns:

  • tag name
  • tag type
  • tag meaning

The tag_names(), tag_types() and tag_defaults() functions should then be updated to read from this file.

Anonymisation and anonymity testing

A fundamental barrier for sharing linelist data for further analysis/processing is the risk of identification of individuals, with substantial ethical and, potentially, legal implications. I wonder if linelist could help mitigate this risk by providing tools for users to help with ensuring none of the data contained is identifiable.

I can see two potential functions that linelist could provide:

  1. A function to assess the re-identification risk, e.g. calculating its k-anonymity
  2. Some support to reduce re-identification risk, e.g. by replacing a column or set of columns with a unique identifier.

Compatibility linelist-dplyr

Hi Hugo, I was using linelist to build one of the pipelines for the case studies I'm working on, and I realised that there were some unexpected results, particularly when using the mutate() and filter() functions from dplyr.
I was creating a reproducible example for this when @jamesmbaazam brought to my attention (thanks James for all your help) that there is a wip vignette that talks about this- I think it'd be very useful to bring this information to the forefront for users, so that they don't assume that this is a bug. Also, if this is possible, I think that linelist objects should be made compatible with all dplyr functions, as they are most widely used and taught to public health practitioners.
Let me know what you think, many thanks!

Clarify policy on tag inclusion

linelist offers the ability for users to define their own tags but also include some tags by default.

At the moment, we are not documenting how these default tags have been determined. This raises questions about addition of new tags. For example, the vaccineff package works with linelist data and would benefit from linelist having, e.g., a vaccination_status tag.

We should clarify:

  • how the original default tags have been chosen
  • what is the policy and the process to propose and add new tags

Release linelist 1.0.0

Prepare for release:

  • git pull
  • Check current CRAN check results
  • Check if any deprecation processes should be advanced, as described in Gradual deprecation
  • Polish NEWS
  • urlchecker::url_check()
  • devtools::build_readme()
  • devtools::check(remote = TRUE, manual = TRUE)
  • devtools::check_win_devel()
  • revdepcheck::revdep_check(num_workers = 4)
  • Update cran-comments.md
  • git push
  • Draft blog post

Submit to CRAN:

  • usethis::use_version('major')
  • devtools::submit_cran()
  • Approve email

Wait for CRAN...

  • Accepted ๐ŸŽ‰
  • Add preemptive link to blog post in pkgdown news menu
  • usethis::use_github_release()
  • usethis::use_dev_version(push = TRUE)
  • Finish blog post
  • Tweet

implement rename for linelist objects

dplyr::rename is a dangerous operation as we risk losing tagged variables by renaming them; ideally we'll capture the inputs and process them to rename tags accordingly, but as a first easier solution we should issue a warning if some tags have changed

Class: linelist

Create a linelist S3 class, inheriting data.frame.

  • class constructor
  • handle: dates of onset, reporting, infection, outcome, admission, discharge, death
  • handle: strata
  • handle: gender
  • handle: outcome
  • handle: age

Consider using *labelled* as backend

It seems the package {labelled} is doing something very similar to our tagging system. We should consider using it as a backend. Some considerations:

  • check it can actually replace current functionalities
  • check if it adds useful functionalities
  • check how many deps would be pulled in

Add support for counts

While the package is primarily designed for case line list, all features would equally apply to pre-calculated counts. It could be useful to add support for numbers of cases associated to specific date ranges, to be passed to incidence2::incidence(counts = ...).

implement restore_tags

When subsetted by columns, the tags may be altered. We need a procedure to restore the tags whose variables are still present in the output object.

Tagging and validating aggregated epi data

Last month we had a meeting with the HARMONIZE team, where we discussed best ways to link climate and epi data, and identified opportunities for collaboration with Epiverse.
One of the issues that we identified was the lack of wrangling and cleaning tools for aggregated data, which is the most common format for climate data, often the format found for outbreak data (e.g., in situation reports), and therefore the format that would have to be used to merge both types of dataset.
When discussing the overlap between both projects, we realised that there is an existing gap that could be addressed using linelist functionality, namely the tagging and validation feature. HARMONIZE collaborators mentioned that it would be very valuable to extend this feature to tag aggregated data, to prevent the accidental loss of data in spatiotemporal units (e.g., data from a certain location overtime) in the subsequent steps of the analysis.

At the moment linelist only provides functions to handle individual level data, I'm raising this issue here to have a discussion about expanding the functionality of the package to also provide functions for aggregated data

Add a tidyselector for tags

And:

  • deprecate the tags argument in select.linelist()
  • ultimately, remove the custom select.linelist() as we are now fully compatible with the NextMethod()

Handle column subsetting

Subsetting columns risks dropping tagged variables. Proposed strategy is to implement the common generics for subsetting columns which will silently drop the corresponding tags. This includes:

  • the [,] operator for 2-D objects
  • the [] operator for lists
  • the tidyverse select function

Add support for default date

Default date to be used in downstream analyses:

  • automated if there is a single date
  • user-specified if there are several dates

Make rename clever

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe the solution you'd like
A clear and concise description of what you want to happen.

Additional context
Add any other context or screenshots about the feature request here.

Implement [[<- operator

The action x[[some_var]] <- NULL could possibly delete tagged variables. We need to fool-proof this by adding a dedicated method.

Tweak README incidence plot

As of reconverse/incidence2@3db823f, you can obtain a slightly nicer plot for the current README:

x_no_geo %>%
  tags_df() %>%
  incidence("date_onset", groups = c("gender", "outcome")) %>%
  plot(
    fill = "outcome",
    angle = 45,
    nrow = 2,
    border_colour = "white",
    legend = "bottom"
  )

See #70 for draft PR for reference.

build assessors for dates

Dates can be any numeric values, or a character to be converted as Date. POSIXct / lt should also be acceptable.

Add support for default group

Implement default group to be used by downstream analyses:

  • use location if provided and no user input
  • user-specified group

Accessors

Build accessors for all meta data stored in a linelist object:

Dates

  • date of onset
  • date of infection
  • date of reporting
  • date of outcome
  • date of death
  • date of admission
  • date of discharge

Other

  • strata
  • outcome
  • gender
  • age

linelist objects don't play well with `dplyr::relocate()`

library(linelist)

x <- make_linelist(cars, date_onset = "dist", date_outcome = "speed")

x
#> 
#> // linelist object
#>    speed dist
#> 1      4    2
#> 2      4   10
#> 3      7    4
#> 4      7   22
#> 5      8   16
#> 6      9   10
#> 7     10   18
#> 8     10   26
#> 9     10   34
#> 10    11   17
#> 11    11   28
#> 12    12   14
#> 13    12   20
#> 14    12   24
#> 15    12   28
#> 16    13   26
#> 17    13   34
#> 18    13   34
#> 19    13   46
#> 20    14   26
#> 21    14   36
#> 22    14   60
#> 23    14   80
#> 24    15   20
#> 25    15   26
#> 26    15   54
#> 27    16   32
#> 28    16   40
#> 29    17   32
#> 30    17   40
#> 31    17   50
#> 32    18   42
#> 33    18   56
#> 34    18   76
#> 35    18   84
#> 36    19   36
#> 37    19   46
#> 38    19   68
#> 39    20   32
#> 40    20   48
#> 41    20   52
#> 42    20   56
#> 43    20   64
#> 44    22   66
#> 45    23   54
#> 46    24   70
#> 47    24   92
#> 48    24   93
#> 49    24  120
#> 50    25   85
#> 
#> // tags: date_onset:dist, date_outcome:speed

dplyr::relocate(x)
#> 
#> // linelist object
#>   speed dist
#> 1     4    2
#> 2     4   10
#> 
#> // tags: date_onset:dist, date_outcome:speed

Created on 2023-04-18 with reprex v2.0.2.9000

methods may need to pass through an appropriate `strict` argument to `modify_defaults()`

Please place an "x" in all the boxes that apply

  • I have the most recent version of the package and R
  • I have found a bug
  • I have a reproducible example
  • I want to request a new feature

library(linelist)
dat <- data.frame(a=1)
ll <- make_linelist(dat, a = "a", allow_extra = TRUE)
ll["a"]
#> Error in modify_defaults(tags_defaults(), new_tags): Unknown variable types: a
#>   Use only tags listed in `tags_names()`, or set `allow_extra = TRUE`

Created on 2023-04-27 with reprex v2.0.2

I think the issue is that methods need to pass through an appropriate strict argument depending on whether or not extra tags have been allowed


couple thoughts

Hey Thibaut - just read through as popped up on timeline, and had a few thoughts pop up.
We already spoke briefly about this a while back but I like the concept and think is going to be super important to have an implementation like this that allows for generalising script templates (e.g. r4epis).
We also talked briefly about whether a whole new class entirely necessary and thought I would just highlight a few things that would probably make the dev side easier and make it more useful for a wider audience of people:

  • have you checked out {labelled}? similar to tags (and epis are used to this type of system from e.g. stata). Combined with {matchmaker}, is pretty strong way of standardising data with dictionaries. A simple implementation in the pr for r4epis mortality templates - currently wip. (It also plays nicely with {gtsummary} so clean output tables much easier.
  • in {epidict} we wrote a function for checking whether variables from a dictionary are used in a script
  • in terms of validation have you looked at {pointblank}?
  • in {epikit} we have a function to check that start date is before end date.
  • also wanted to remphasise that as a community we should work with ocha to create standardised hxl tags for common epi variables, and then a package like dirk's {rhxl} would be a very lightweight solution when combined with {labelled}

anyway, long story short - I like the idea and think as a community we should emphasise the importance of data dictionaries and build around that with existing infrastructure, and maybe deemphasise the need for separate class and infrastructure cascade.
Just my two cents though :)

Build assessors for tagged variables

As we know what tagged variables should contain, we can build assessors for them:

  • dates fields: numeric, Date, POSIXcT, POSIXlt
  • categorical variables: character or factor
  • numeric variables

Implement names<- operator

This is the base-R version of rename, which we need to implement for linelist with tag restoration feature.

Remove option to use `lost_tags_action()` as part of a pipeline

The trick of passing "error"/"warning"/"none" either as the first or the second argument in lost_tags_action() seems like a hack. I'm afraid this will cause us maintainance issues in the future and will confuse users.
My first reaction would be to remove the option to use it in a pipeline. This is a global option and is not strictly speaking part of the pipeline so I find it confusing to use it here.
If you really want to allow its usage a part of a pipeline, a better option IMO would be to use ... and force the usage of named argument for action

lost_tags_action <- function(..., action = c("warning", "error", "none"),
                             quiet = FALSE) {

Compatibility with tibble and data.frame subclasses

This comes from a short slack discussion with @TimTaylor:

Should we support data.frame & tibble subclasses by default or only support classes that we have thoroughly tested with linelist?

One good example of a data.frame subclass that doesn't play well with linelist (because linelist breaks many of its assumptions) is data.table. Support for data.table is now disabled in #55 but it is likely that other data.frame / tibble subclasses will have the same issue and produce an undefined behaviour that doesn't match user expectations.

An extreme measure would be to allow only classes that we are properly, thoroughly tested, i.e., only data.frame and tibble. But then, we could possibly make it difficult for users to use other subclasses that we don't know about and that are strictly compatible with data.frame and tibble, and by extension with linelist.

Re-implement safer select

Current implementation of select is user-friendly but unsafe as tags are not formally differentiated from regular variables. A safer implementation will be to have:

  • tags_df(x): returns a data.frame of all tagged variables
  • select_tags(x, ...): return a data.frame, i.e. losing the linelist class
  • select(x, ..., tags = NULL): return a linelist, checking that tags are not lost, and renaming tags if tags is used

Typo in NEWS

  • I have the most recent version of the package and R
  • I have found a bug (more of a typo)

The NEWS.md document links to #76 for parallelisation of testing but I believe it should instead direct to #77.

create package skeletton

  • basic structure
  • testing infrastructure
  • README
  • codecov + testing coverage tag
  • gh actions for pkg check
  • pkg check tag

validator for linelist objects

The function validate_linelist() should perform a series of sanity checks on the object:

  • the linelist object should have the right class
  • have correctly formed tags
  • each tagged variable should be of the right type

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.