trinker / wakefield Goto Github PK

Generate random data sets

R 100.00%

wakefield's Introduction

wakefield

wakefield is designed to quickly generate random data sets. The user passes n (number of rows) and predefined vectors to the r_data_frame function to produce a dplyr::tbl_df object.

Installation
Contact
Demonstration

Installation

To download the development version of wakefield:

Download the zip ball or tar ball, decompress and run R CMD INSTALL on it, or use the pacman package to install the development version:

if (!require("pacman")) install.packages("pacman")
pacman::p_load_gh("trinker/wakefield")
pacman::p_load(dplyr, tidyr, ggplot2)

Contact

You are welcome to: * submit suggestions and bug-reports at: https://github.com/trinker/wakefield/issues * send a pull request on: https://github.com/trinker/wakefield/ * compose a friendly e-mail to: [email protected]

Demonstration

Getting Started

The r_data_frame function (random data frame) takes n (the number of rows) and any number of variables (columns). These columns are typically produced from a wakefield variable function. Each of these variable functions has a pre-set behavior that produces a named vector of n length, allowing the user to lazily pass unnamed functions (optionally, without call parenthesis). The column name is hidden as a varname attribute. For example here we see the race variable function:

race(n=10)

##  [1] Bi-Racial White     Bi-Racial Native    White     White     White     Asian     White     Hispanic 
## Levels: White Hispanic Black Asian Bi-Racial Native Other Hawaiian

attributes(race(n=10))

## $levels
## [1] "White"     "Hispanic"  "Black"     "Asian"     "Bi-Racial" "Native"    "Other"     "Hawaiian" 
## 
## $class
## [1] "variable" "factor"  
## 
## $varname
## [1] "Race"

When this variable is used inside of r_data_frame the varname is used as a column name. Additionally, the n argument is not set within variable functions but is set once in r_data_frame:

r_data_frame(
    n = 500,
    race
)

## Warning: `tbl_df()` is deprecated as of dplyr 1.0.0.
## Please use `tibble::as_tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

## # A tibble: 500 x 1
##    Race    
##    <fct>   
##  1 White   
##  2 White   
##  3 White   
##  4 White   
##  5 Black   
##  6 Black   
##  7 White   
##  8 White   
##  9 Hispanic
## 10 White   
## # ... with 490 more rows

The power of r_data_frame is apparent when we use many modular variable functions:

r_data_frame(
    n = 500,
    id,
    race,
    age,
    sex,
    hour,
    iq,
    height,
    died
)

## # A tibble: 500 x 8
##    ID    Race        Age Sex    Hour        IQ Height Died 
##    <chr> <fct>     <int> <fct>  <times>  <dbl>  <dbl> <lgl>
##  1 001   White        25 Female 00:00:00    93     69 TRUE 
##  2 002   White        80 Male   00:00:00    87     59 FALSE
##  3 003   White        60 Female 00:00:00   119     74 TRUE 
##  4 004   Bi-Racial    54 Female 00:00:00   109     72 FALSE
##  5 005   White        75 Female 00:00:00   106     70 FALSE
##  6 006   White        54 Male   00:00:00    89     67 TRUE 
##  7 007   Hispanic     67 Male   00:00:00    94     73 TRUE 
##  8 008   Bi-Racial    86 Female 00:00:00   100     65 TRUE 
##  9 009   Hispanic     56 Male   00:00:00    92     76 FALSE
## 10 010   Hispanic     52 Female 00:00:00   104     71 FALSE
## # ... with 490 more rows

There are 49 wakefield based variable functions to chose from, spanning R’s various data types (see ?variables for details).

age	dice	hair	military	sex\_inclusive
animal	dna	height	month	smokes
answer	dob	income	name	speed
area	dummy	internet\_browser	normal	state
car	education	iq	political	string
children	employment	language	race	upper
coin	eye	level	religion	valid
color	grade	likert	sat	year
date\_stamp	grade\_level	lorem\_ipsum	sentence	zip\_code
death	group	marital	sex

Available Variable Functions

However, the user may also pass their own vector producing functions or vectors to r_data_frame. Those with an n argument can be set by r_data_frame:

r_data_frame(
    n = 500,
    id,
    Scoring = rnorm,
    Smoker = valid,
    race,
    age,
    sex,
    hour,
    iq,
    height,
    died
)

## # A tibble: 500 x 10
##    ID    Scoring Smoker Race       Age Sex    Hour        IQ Height Died 
##    <chr>   <dbl> <lgl>  <fct>    <int> <fct>  <times>  <dbl>  <dbl> <lgl>
##  1 001    0.833  FALSE  White       20 Female 00:00:00    92     69 TRUE 
##  2 002   -0.529  TRUE   Hispanic    83 Female 00:00:00    99     74 TRUE 
##  3 003   -0.704  TRUE   Hispanic    24 Male   00:00:00   115     62 TRUE 
##  4 004   -0.839  TRUE   Asian       19 Female 00:00:00   113     69 TRUE 
##  5 005    0.606  TRUE   White       70 Male   00:00:00    95     68 FALSE
##  6 006    1.46   FALSE  Other       45 Female 00:00:00   110     78 FALSE
##  7 007   -0.681  TRUE   Black       47 Female 00:00:00    98     64 TRUE 
##  8 008    0.541  FALSE  White       88 Male   00:30:00    75     70 TRUE 
##  9 009   -0.294  FALSE  Hispanic    89 Male   00:30:00   104     63 FALSE
## 10 010    0.0749 FALSE  Hispanic    74 Female 00:30:00   105     69 TRUE 
## # ... with 490 more rows

r_data_frame(
    n = 500,
    id,
    age, age, age,
    grade, grade, grade
)

## # A tibble: 500 x 7
##    ID    Age_1 Age_2 Age_3 Grade_1 Grade_2 Grade_3
##    <chr> <int> <int> <int>   <dbl>   <dbl>   <dbl>
##  1 001      67    24    89    82.4    86.8    90.6
##  2 002      55    76    27    87.3    85.4    89.8
##  3 003      60    61    22    82.2    87      90.1
##  4 004      50    19    56    96.4    86.6    95.6
##  5 005      83    77    71    88.8    87.5    84.4
##  6 006      55    71    76    87.3    96.5    86.5
##  7 007      88    36    75    92.1    91.6    93.4
##  8 008      71    48    81    87.9    91.4    80.9
##  9 009      76    78    21    86.9    93.6    84.3
## 10 010      49    68    47    85.5    93      86.6
## # ... with 490 more rows

While passing variable functions to r_data_frame without call parenthesis is handy, the user may wish to set arguments. This can be done through call parenthesis as we do with data.frame or dplyr::data_frame:

r_data_frame(
    n = 500,
    id,
    Scoring = rnorm,
    Smoker = valid,
    `Reading(mins)` = rpois(lambda=20),  
    race,
    age(x = 8:14),
    sex,
    hour,
    iq,
    height(mean=50, sd = 10),
    died
)

## # A tibble: 500 x 11
##    ID    Scoring Smoker `Reading(mins)` Race       Age Sex    Hour        IQ Height Died 
##    <chr>   <dbl> <lgl>            <int> <fct>    <int> <fct>  <times>  <dbl>  <dbl> <lgl>
##  1 001    2.48   FALSE               10 White        9 Male   00:00:00    93     44 TRUE 
##  2 002    0.566  FALSE               14 Hispanic    10 Male   00:00:00   116     58 FALSE
##  3 003   -0.563  FALSE               19 Hispanic     8 Female 00:00:00    97     64 TRUE 
##  4 004    0.0187 TRUE                19 White        9 Male   00:00:00   104     58 TRUE 
##  5 005   -0.462  FALSE               17 Hispanic    11 Male   00:00:00    96     53 FALSE
##  6 006   -1.13   FALSE               17 White       10 Male   00:00:00    91     66 TRUE 
##  7 007   -0.673  TRUE                15 White       13 Female 00:00:00    99     61 FALSE
##  8 008    0.164  TRUE                22 White       11 Male   00:00:00   106     47 FALSE
##  9 009   -0.227  FALSE               21 White       12 Female 00:00:00   101     54 TRUE 
## 10 010    0.762  TRUE                22 White        8 Male   00:00:00   107     50 FALSE
## # ... with 490 more rows

Random Missing Observations

Often data contains missing values. wakefield allows the user to add a proportion of missing values per column/vector via the r_na (random NA). This works nicely within a dplyr/magrittr %>% then pipeline:

r_data_frame(
    n = 30,
    id,
    race,
    age,
    sex,
    hour,
    iq,
    height,
    died,
    Scoring = rnorm,
    Smoker = valid
) %>%
    r_na(prob=.4)

## # A tibble: 30 x 10
##    ID    Race       Age Sex    Hour        IQ Height Died  Scoring Smoker
##    <chr> <fct>    <int> <fct>  <times>  <dbl>  <dbl> <lgl>   <dbl> <lgl> 
##  1 01    Hispanic    24 Female 01:30:00    92     70 NA     NA     NA    
##  2 02    White       NA Female <NA>        NA     NA FALSE   0.696 TRUE  
##  3 03    Hispanic    NA Female 02:00:00   107     68 FALSE  -0.113 TRUE  
##  4 04    Black       29 Female <NA>        93     75 TRUE   -1.64  TRUE  
##  5 05    <NA>        43 Female 03:30:00    NA     NA NA     -0.705 FALSE 
##  6 06    Black       NA <NA>   04:00:00    93     NA TRUE   NA     NA    
##  7 07    Hispanic    60 <NA>   <NA>        NA     NA TRUE   NA     NA    
##  8 08    Hispanic    NA <NA>   <NA>        NA     NA TRUE   NA     FALSE 
##  9 09    <NA>        34 <NA>   05:30:00    NA     70 NA     -1.44  TRUE  
## 10 10    White       88 <NA>   <NA>        NA     NA NA     NA     NA    
## # ... with 20 more rows

Repeated Measures & Time Series

The r_series function allows the user to pass a single wakefield function and dictate how many columns (j) to produce.

set.seed(10)

r_series(likert, j = 3, n=10)

## # A tibble: 10 x 3
##    Likert_1          Likert_2          Likert_3         
##  * <ord>             <ord>             <ord>            
##  1 Neutral           Agree             Agree            
##  2 Strongly Agree    Strongly Disagree Strongly Agree   
##  3 Agree             Strongly Disagree Agree            
##  4 Disagree          Strongly Disagree Agree            
##  5 Neutral           Strongly Agree    Strongly Agree   
##  6 Agree             Disagree          Disagree         
##  7 Agree             Agree             Strongly Disagree
##  8 Agree             Strongly Disagree Agree            
##  9 Strongly Disagree Agree             Neutral          
## 10 Neutral           Strongly Disagree Neutral

Often the user wants a numeric score for Likert type columns and similar variables. For series with multiple factors the as_integer converts all columns to integer values. Additionally, we may want to specify column name prefixes. This can be accomplished via the variable function’s name argument. Both of these features are demonstrated here.

set.seed(10)

as_integer(r_series(likert, j = 5, n=10, name = "Item"))

## # A tibble: 10 x 5
##    Item_1 Item_2 Item_3 Item_4 Item_5
##     <int>  <int>  <int>  <int>  <int>
##  1      3      4      4      4      5
##  2      5      1      5      3      1
##  3      4      1      4      5      4
##  4      2      1      4      4      5
##  5      3      5      5      2      5
##  6      4      2      2      3      4
##  7      4      4      1      4      1
##  8      4      1      4      1      2
##  9      1      4      3      5      3
## 10      3      1      3      5      5

r_series can be used within a r_data_frame as well.

set.seed(10)

r_data_frame(n=100,
    id,
    age,
    sex,
    r_series(likert, 3, name = "Question")
)

## # A tibble: 100 x 6
##    ID      Age Sex    Question_1        Question_2        Question_3       
##    <chr> <int> <fct>  <ord>             <ord>             <ord>            
##  1 001      26 Male   Strongly Agree    Disagree          Disagree         
##  2 002      72 Male   Disagree          Agree             Strongly Disagree
##  3 003      89 Male   Strongly Disagree Strongly Disagree Strongly Agree   
##  4 004      71 Female Agree             Strongly Agree    Disagree         
##  5 005      56 Female Strongly Disagree Disagree          Neutral          
##  6 006      32 Female Strongly Disagree Strongly Agree    Disagree         
##  7 007      32 Female Strongly Disagree Strongly Agree    Strongly Disagree
##  8 008      59 Female Neutral           Strongly Agree    Strongly Disagree
##  9 009      88 Male   Agree             Agree             Agree            
## 10 010      51 Male   Agree             Disagree          Neutral          
## # ... with 90 more rows

set.seed(10)

r_data_frame(n=100,
    id,
    age,
    sex,
    r_series(likert, 5, name = "Item", integer = TRUE)
)

## # A tibble: 100 x 8
##    ID      Age Sex    Item_1 Item_2 Item_3 Item_4 Item_5
##    <chr> <int> <fct>   <int>  <int>  <int>  <int>  <int>
##  1 001      26 Male        5      2      2      4      5
##  2 002      72 Male        2      4      1      4      3
##  3 003      89 Male        1      1      5      4      4
##  4 004      71 Female      4      5      2      1      2
##  5 005      56 Female      1      2      3      3      2
##  6 006      32 Female      1      5      2      5      1
##  7 007      32 Female      1      5      1      1      5
##  8 008      59 Female      3      5      1      4      1
##  9 009      88 Male        4      4      4      3      2
## 10 010      51 Male        4      2      3      1      3
## # ... with 90 more rows

Related Series

The user can also create related series via the relate argument in r_series. It allows the user to specify the relationship between columns. relate may be a named list of or a short hand string of the form of "fM_sd" where:

f is one of (+, -, *, /)
M is a mean value
sd is a standard deviation of the mean value

For example you may use relate = "*4_1". If relate = NULL no relationship is generated between columns. I will use the short hand string form here.

Some Examples With Variation

r_series(grade, j = 5, n = 100, relate = "+1_6")

## # A tibble: 100 x 5
##    Grade_1    Grade_2    Grade_3    Grade_4    Grade_5   
##  * <variable> <variable> <variable> <variable> <variable>
##  1 90.0       98.7        98.6      104.6      114.1     
##  2 96.3       97.9        98.4      102.7      103.9     
##  3 96.6       92.6        94.9       92.7       98.8     
##  4 84.5       89.5        81.9       87.4       83.4     
##  5 86.8       84.1        82.2       82.8       94.0     
##  6 82.1       77.9        74.3       76.4       73.0     
##  7 90.9       96.1       107.5      120.2      126.8     
##  8 86.7       88.6        90.3       89.0       83.8     
##  9 86.1       84.1        88.9       90.1       72.6     
## 10 86.4       92.3        88.5       94.6       99.0     
## # ... with 90 more rows

r_series(age, 5, 100, relate = "+5_0")

## # A tibble: 100 x 5
##    Age_1      Age_2      Age_3      Age_4      Age_5     
##  * <variable> <variable> <variable> <variable> <variable>
##  1 83         88         93         98         103       
##  2 48         53         58         63          68       
##  3 80         85         90         95         100       
##  4 46         51         56         61          66       
##  5 33         38         43         48          53       
##  6 53         58         63         68          73       
##  7 34         39         44         49          54       
##  8 31         36         41         46          51       
##  9 81         86         91         96         101       
## 10 50         55         60         65          70       
## # ... with 90 more rows

r_series(likert, 5,  100, name ="Item", relate = "-.5_.1")

## # A tibble: 100 x 5
##    Item_1 Item_2 Item_3 Item_4 Item_5
##  *  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
##  1      1      0     -1     -1     -2
##  2      3      3      2      2      2
##  3      4      3      3      3      3
##  4      3      2      1      0      0
##  5      3      3      3      3      3
##  6      5      4      3      2      1
##  7      4      3      2      1      1
##  8      1      0     -1     -2     -2
##  9      3      2      1      1      1
## 10      1      0      0     -1     -2
## # ... with 90 more rows

r_series(grade, j = 5, n = 100, relate = "*1.05_.1")

## # A tibble: 100 x 5
##    Grade_1    Grade_2    Grade_3    Grade_4    Grade_5   
##  * <variable> <variable> <variable> <variable> <variable>
##  1 90.8       90.80       99.880    109.8680   109.8680  
##  2 89.8       80.82       80.820     64.6560    58.1904  
##  3 90.3       99.33      109.263    109.2630    98.3367  
##  4 95.2       76.16       91.392     91.3920   100.5312  
##  5 89.1       98.01      117.612    105.8508   105.8508  
##  6 86.8       95.48       95.480    114.5760   160.4064  
##  7 93.4       93.40       93.400    102.7400   123.2880  
##  8 92.7       83.43       91.773    110.1276   121.1404  
##  9 84.9       93.39       93.390    102.7290   113.0019  
## 10 84.7       84.70       93.170     93.1700   111.8040  
## # ... with 90 more rows

Adjust Correlations

Use the sd command to adjust correlations.

round(cor(r_series(grade, 8, 10, relate = "+1_2")), 2)

##         Grade_1 Grade_2 Grade_3 Grade_4 Grade_5 Grade_6 Grade_7 Grade_8
## Grade_1    1.00    0.84    0.57    0.41    0.31    0.30    0.16    0.15
## Grade_2    0.84    1.00    0.86    0.73    0.71    0.70    0.52    0.50
## Grade_3    0.57    0.86    1.00    0.93    0.92    0.90    0.77    0.71
## Grade_4    0.41    0.73    0.93    1.00    0.93    0.89    0.76    0.66
## Grade_5    0.31    0.71    0.92    0.93    1.00    0.93    0.83    0.79
## Grade_6    0.30    0.70    0.90    0.89    0.93    1.00    0.93    0.92
## Grade_7    0.16    0.52    0.77    0.76    0.83    0.93    1.00    0.95
## Grade_8    0.15    0.50    0.71    0.66    0.79    0.92    0.95    1.00

round(cor(r_series(grade, 8, 10, relate = "+1_0")), 2)

##         Grade_1 Grade_2 Grade_3 Grade_4 Grade_5 Grade_6 Grade_7 Grade_8
## Grade_1       1       1       1       1       1       1       1       1
## Grade_2       1       1       1       1       1       1       1       1
## Grade_3       1       1       1       1       1       1       1       1
## Grade_4       1       1       1       1       1       1       1       1
## Grade_5       1       1       1       1       1       1       1       1
## Grade_6       1       1       1       1       1       1       1       1
## Grade_7       1       1       1       1       1       1       1       1
## Grade_8       1       1       1       1       1       1       1       1

round(cor(r_series(grade, 8, 10, relate = "+1_20")), 2)

##         Grade_1 Grade_2 Grade_3 Grade_4 Grade_5 Grade_6 Grade_7 Grade_8
## Grade_1    1.00   -0.11    0.14   -0.21   -0.42   -0.29   -0.30   -0.27
## Grade_2   -0.11    1.00    0.49    0.44    0.18    0.24    0.23    0.51
## Grade_3    0.14    0.49    1.00    0.86    0.48    0.59    0.70    0.81
## Grade_4   -0.21    0.44    0.86    1.00    0.63    0.76    0.76    0.87
## Grade_5   -0.42    0.18    0.48    0.63    1.00    0.92    0.85    0.79
## Grade_6   -0.29    0.24    0.59    0.76    0.92    1.00    0.91    0.89
## Grade_7   -0.30    0.23    0.70    0.76    0.85    0.91    1.00    0.93
## Grade_8   -0.27    0.51    0.81    0.87    0.79    0.89    0.93    1.00

round(cor(r_series(grade, 8, 10, relate = "+15_20")), 2)

##         Grade_1 Grade_2 Grade_3 Grade_4 Grade_5 Grade_6 Grade_7 Grade_8
## Grade_1    1.00    0.48    0.47    0.63    0.58    0.66    0.35    0.18
## Grade_2    0.48    1.00    0.90    0.87    0.54    0.43    0.67    0.23
## Grade_3    0.47    0.90    1.00    0.81    0.63    0.53    0.74    0.30
## Grade_4    0.63    0.87    0.81    1.00    0.75    0.72    0.71    0.47
## Grade_5    0.58    0.54    0.63    0.75    1.00    0.88    0.57    0.42
## Grade_6    0.66    0.43    0.53    0.72    0.88    1.00    0.68    0.54
## Grade_7    0.35    0.67    0.74    0.71    0.57    0.68    1.00    0.77
## Grade_8    0.18    0.23    0.30    0.47    0.42    0.54    0.77    1.00

Visualize the Relationship

dat <- r_data_frame(12,
    name,
    r_series(grade, 100, relate = "+1_6")
) 

dat %>%
    gather(Time, Grade, -c(Name)) %>%
    mutate(Time = as.numeric(gsub("\\D", "", Time))) %>%
    ggplot(aes(x = Time, y = Grade, color = Name, group = Name)) +
        geom_line(size=.8) + 
        theme_bw()

Expanded Dummy Coding

The user may wish to expand a factor into j dummy coded columns. The r_dummy function expands a factor into j columns and works similar to the r_series function. The user may wish to use the original factor name as the prefix to the j columns. Setting prefix = TRUE within r_dummy accomplishes this.

set.seed(10)
r_data_frame(n=100,
    id,
    age,
    r_dummy(sex, prefix = TRUE),
    r_dummy(political)
)

## # A tibble: 100 x 8
##    ID      Age Sex_Male Sex_Female Democrat Republican Libertarian Green
##    <chr> <int>    <int>      <int>    <int>      <int>       <int> <int>
##  1 001      26        1          0        0          0           1     0
##  2 002      72        1          0        1          0           0     0
##  3 003      89        1          0        0          1           0     0
##  4 004      71        0          1        1          0           0     0
##  5 005      56        0          1        0          1           0     0
##  6 006      32        0          1        0          1           0     0
##  7 007      32        0          1        1          0           0     0
##  8 008      59        0          1        0          1           0     0
##  9 009      88        1          0        0          1           0     0
## 10 010      51        1          0        0          1           0     0
## # ... with 90 more rows

Visualizing Column Types

It is helpful to see the column types and NAs as a visualization. The table_heat (also the plot method assigned to tbl_df as well) can provide visual glimpse of data types and missing cells.

set.seed(10)

r_data_frame(n=100,
    id,
    dob,
    animal,
    grade, grade,
    death,
    dummy,
    grade_letter,
    gender,
    paragraph,
    sentence
) %>%
   r_na() %>%
   plot(palette = "Set1")

wakefield's People

Contributors

Stargazers

Watchers

wakefield's Issues

Header HTML is broken on main README

Earlier this month, GitHub changed how it renders Markdown and it no longer plays as nicely with HTML tags. This is affecting this package's README landing page.

controlling the mean and sd for the first variable in an with r_series

@trinker , amazing package. Quick question, can I have control over the mean and sd for the first variable in an r_series? It would probably look like this:

r_series(normal(mean = 35, sd = 10, min = 10, max = 45), 2, 100, name = "Question", relate = list(operation = "+", mean = 5, sd = 10))

Start date End date

There is already dob, date_stamp, and we could use these to get start and end dates, e.g.:

set.seed(1)
r_data_frame(
  n = 5,
  dob(name = "date_start")) %>% 
  mutate(date_end = date_start + runif(n(), 1, 100))

# # A tibble: 5 x 2
#   date_start date_end  
#   <date>     <date>    
# 1 2007-04-20 2007-06-01
# 2 2007-03-23 2007-06-16
# 3 2008-06-06 2008-07-29
# 4 2007-02-21 2007-04-17
# 5 2008-05-08 2008-07-31

Could we have a dedicated function that would create start and end date, where obviously date_end can't be before the date_start?

Or maybe even better, let date_stamp accept column name for start=, like below:

r_data_frame(
  n = 5,
  dob(name = "date_start")) %>% 
  mutate(date_end = date_stamp(start = date_start))
# Error in seq(start, length = k, by = by) : object 'date_start' not found

Generate multiple variables

I am trying to create a wide dataset with multiple variables of a certain type and do not want to have to type them all out. If I were to type them all out, it would look something like this:

sample_size = 1e6
df_foo = r_data_frame(
  n = sample_size,
  ID = id,
  y = dummy,
  x1 = rnorm,
  x2  = rnorm,
  x3 = rnorm, 
  ...,
  x100 = rnorm
)

I wonder if something like this is possible with wakefield or there is a workaround to get this done?

Thanks.

Retain the likert scale when using r_series

When performing r_series(likert, 5, 100, name ="Item", relate = "-.5_.1") (as in the readme file), it does not retain the original likert range. For example, running the code above results in the following ranges for the five items:

Item_1: 1-5
Item_2: 0-5
Item_3: -1-5
Item_4: -2-5
Item_5: -3-4

I would like to be able to create likert scales with the same response scales (e.g., 1 through 5) that are related to one another (e.g., alpha = .80). The current code creates scales that are related but no longer on the same response scale.

don't inhibit errors in user supplied generators

Currently all user errors are swallowed by try blocks in r_list:

> r_data_frame(10,
+              id = sample(NULL, 10),
+              answer())
Error in sample(NULL, 10, n = 10) (from r_list.R#71) : unused argument (n = 10)
>

which is unrelated to the real problem

> sample(NULL, 10)
Error in sample.int(length(x), size, replace, prob) : 
  invalid first argument

BTW, you can eliminate such errors and avoid the substitution of each var_function by having a smart n argument that would automatically pick n from the calling environment:

> var_fun <- function(a = get(".N.", parent.frame(1))){print(a)}
> caller <- function(){ .N. <- 10; var_fun()}
> caller()
[1] 10

More gender groups

Add more groups to gender. http://abcnews.go.com/blogs/headlines/2014/02/heres-a-list-of-58-gender-options-for-facebook-users/

Multiple response values

Is it possible to generate values in a multi-select environment? This could apply to survey research (select all that apply), or graphs (edges between nodes of a certain type).

Below is a really hacky function that demonstrates this use-case in a hard-coded way

## helper function:  obviously not a production-quality function
build_multi = function(ids) {
  df_data = data.frame()
  for (i in 1:length(ids)) {
    ## randomize how many choices are made
    n_obs = sample(x=1:2, size=1, prob = c(.75, .25))
    ## what are the choices available in the multi-select
    cvals = c("BIZ","ARTS","SCIENCE","HEALTH","OTHER")
    vals = sample(x = cvals,
                  replace = FALSE, 
                  prob = c(.4,.2,.15,.2, .05), 
                  size = n_obs)
    tmp_df = data.frame(id = ids[i],
                        values = vals)
    df_data = dplyr::bind_rows(df_data, tmp_df)
  }
  return(df_data)
}

## generate a set of ids
my_df = r_data_frame(id = id, n=100)

and generate the data

## return a long dataset of multiselect options for given probabilities (which I hardcoded)
my_long = build_multi(my_df$id)

Above generates the dataset in a structure that I would need, but I wasn't sure if this already existed in the current package.

gender_inclusive() and sex_inclusive() do not use probs stated in docs

The documentation for gender_inclusive() and sex_inclusive() implies that these probabilities are used for the x values:

Gender	Percent
Male	51.07 %
Female	48.63 %
Trans*	0.30 %

However, instead, all options are given equal weighting. Data is also only provided for gender_inclusive() and not sex_inclusive().

Generate data with correlations

I started work awhile ago on a much less ambitious project than wakefield to attempt to generate random data sets on the fly with a known correlation structure. You can see the seeds of that work here: https://github.com/jknowles/datasynthR

It would be cool to include the ability to generate numeric or factor data with a known correlation structure to build structural relationships into the very realistic looking data generated by wakefield.

Update documentation

Not a major issue, but I was looking through help(package = "wakefield") for the function that added NA values. I entirely missed r_na because its documentation still says Title.

https://github.com/trinker/wakefield/blob/1d14ca0b0b10652eeef9ebf6bae7e86a31c2168a/R/r_na.R

Ditto for r_dummy. Description is also missing. Great work on this package BTW.

Consider adding a simple mean/sd function with rounding

library(wakefield)
fun <- function(n, ..., digits = 0) round(rnorm(n=n, ...), digits)

foo <- r_data_frame(
    n = 10000,
    Income = fun(mean=54000, sd=1100),
    race,
    Age= fun(mean=35, sd=10),
    sex,
    internet_browser,
    state,
    marital
)

including "name" in r_data_frame results in error

The following code produces an error:

df = r_data_frame(
  n = 500,
  name,
  id,
  race,
  age,
  sex,
  hour,
  iq,
  height,
  died
)
View(df)

Error Produced: Error in sample.int(length(x), size, replace, prob) : cannot take a sample larger than the population when 'replace = FALSE'

This code however works just fine:

df = r_data_frame(
  n = 500,
  id,
  race,
  age,
  sex,
  hour,
  iq,
  height,
  died
)
View(df)

`name()` should use weights from babynames dataset

name() currently generates mostly unusual names because it doesn't do any weighting. It would be nice if the default would use frequency weights so that you're more likely to get common names.

generating 1000 names with replacement using r_data_frame

May be problem with passing arguments to r_data_frame.

works

x <- data.frame(name(1000, replace = TRUE))

fails, doesn't pass replace = TRUE.

x <- r_data_frame(1000,name,replace = TRUE)
Error in sample.int(length(x), size, replace, prob) :
cannot take a sample larger than the population when 'replace = FALSE'

Multiple time measurements

It would be nice for those of us who are lazy to have convenient names for repeated measures in a wide format.

Consider:

r_data_frame(
  n = 5,
  id, 
  race, race, race, 
  age, age, age
)
# Source: local data frame [5 x 7]
# 
#   ID     Race   Race.1   Race.2 Age Age.1 Age.2
#1  1 Hispanic    White Hispanic  30    30    32
#2  2    White Hispanic    White  31    20    30
#3  3    White    White    White  26    23    25
#4  4    White    Black    White  20    30    31
#5  5    Asian    White    White  20    28    24

Generally, the preferred form would be to have all "times" identified. Thus, at the very minimum, Race should become Race.0 for balance in the naming scheme.

I know I can just do:

r_data_frame(
  n = 5,
  id, 
  Race_1 = race, Race_2 = race, Race_3 = race, 
  Age_1 = age, Age_2 = age, Age_3 = age
)
# Source: local data frame [5 x 7]
# 
#   ID   Race_1   Race_2 Race_3 Age_1 Age_2 Age_3
#1  1    White Hispanic  White    24    30    23
#2  2 Hispanic    White  Black    32    35    32
#3  3    White    White  White    28    21    25
#4  4    White    White  White    33    22    24
#5  5    White    Black  White    31    30    21

But that's a lot of extra typing :-(

I haven't dug into your code (hence raising an issue and not a pull request), but it's possible that the fix might be something as easy as:

r_data_frame <- function (n, ...) 
{
  out <- r_list(n = n, ...)
  temp <- names(out)
  temp <- ave(temp, temp, FUN = function(x) 
    if (length(x) == 1) x else paste(x, seq_along(x), sep = "_"))
  out <- setNames(data.frame(out, stringsAsFactors = FALSE, 
                             check.names = FALSE), temp)
  dplyr::tbl_df(out)
}

Minor issue, fix warning, use tibble::as_tibble

To avoid warning as tbl_df is depricated, use as_tibble:

# R version 4.0.2 (2020-06-22), wakefield_0.3.6

r_data_frame(n = 5, dob)
# # A tibble: 5 x 1
#   DOB       
#   <date>    
# 1 2008-05-15
# 2 2007-03-23
# 3 2008-04-27
# 4 2008-01-21
# 5 2007-09-19
# Warning message:
# `tbl_df()` was deprecated in dplyr 1.0.0.
# Please use `tibble::as_tibble()` instead.

An update to stringi 1.5.3 breaks wakefield

Apologies, but an updated version of https://github.com/gagolews/stringi/ unfortunately breaks your package's checks, can you fix the following issues in the next release?

Change-log: https://cran.r-project.org/web/packages/stringi/NEWS:

[DEPRECATION WARNING] nparagraphs in stri_rand_lipsum() has been renamed n_paragraphs.

Package: wakefield
Check: examples
New result: ERROR
  Running examples in â€˜wakefield-Ex.Râ€™ failed
  The error most likely occurred in:
  
  > base::assign(".ptime", proc.time(), pos = "CheckExEnv")
  > ### Name: lorem_ipsum
  > ### Title: Generate Random Lorem Ipsum Strings
  > ### Aliases: lorem_ipsum paragraph
  > ### Keywords: character string
  > 
  > ### ** Examples
  > 
  > lorem_ipsum(10)
  Error in stringi::stri_rand_lipsum(n = n, ...) : 
    argument 1 matches multiple formal arguments
  Calls: lorem_ipsum
  Execution halted

Typo: empoyment in employment.R.

Hi Tyler,

Thanks for your great package! It's very useful.

I've just noticed that there's a small spelling mistake in employment.R. In the script, the function is written as "empoyment", so it gives an error when you type it.

Best,
Danilo

featyre request

Nice! It would also be great to set percentile distribution per var group? Ex. Age - 10% 10-20 ; 20% 30-40, 70 % 50-80 & so forth

add occupation function

group() limited to 2 groups?

Thanks for wakefield!

I need to generate factors with more than two levels. group() accepts x with length > 2, but only samples first two. Here is a toy example:

tg <- group(n=100, x=c("a", "b", "c"), name = "test")
summary(tg)
a b c
43 57 0
tg <- group(n=100, x=c("a", "b", "c"), prob = c(0.1, 0.2, 0.7), name = "test")
Error in sample.int(n = 2, size = n, replace = TRUE, prob = prob) :
incorrect number of probabilities
tg <- group(n=100, x=c("a", "b", "c"), prob = c(0.1, 0.2), name = "test")
summary(tg)
a b c
32 68 0

HTH,
Jim

`tbl_df()` was deprecated in dplyr 1.0.0.

tbl_df() was deprecated in dplyr 1.0.0.
ℹ Please use tibble::as_tibble() instead.
ℹ The deprecated feature was likely used in the wakefield package.

Correlated fields

Hi! So frequently, when I am training people on data analysis, I will put in specific issues for the students to pick up on. For example, I might have people from one city showing many more deaths, or I might show women as performing higher than men in 3 regions. The way I do this is:

Create a normal set, where let's say education is normally (or whatever) distributed. Then create a smaller dataset with the deviation, then bind_row and randomize the order.

That's all well and good, but I was wondering if there would be benefit from some kind of function like:

r_data_frame(
  n = 10, 
  sex(x = c("M", "F"), prob = c(0.9, 0.1)),
  died(correlator = "sex==F", corr_effect = "+20%")
)

where correlator and corr_effect are the new adverbs.

use package without calling library(wakefield)

Thanks for making this useful package.

I tried to use it initially by doing:

wakefield::r_data_frame(id, age, n = 500)

This fails because, it seems, the id and age, etc, functions need to be available. It is beyond my knowledge, but perhaps you could interpret the symbols given in the wakefield namespace. Adding all those generic-sounding names to my namespace is a little worrying. I suppose I can call each function directly, using wakefield:: but this becomes more cumbersome. Just a thought.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.