strengejacke / sjmisc Goto Github PK

View Code? Open in Web Editor NEW

157.0 14.0 24.0 7 MB

Data transformation and utility functions for R

Home Page: https://strengejacke.github.io/sjmisc

License: GNU General Public License v3.0

R 99.63% TeX 0.37%

data-transformation r data-wrangling labelled-data recoding

sjmisc's Issues

Warning in set_na

Hi,

I get a warning for trying to use set_na on a dataframe:

library(tidyverse)
library(sjPlot)
library(sjmisc)
setwd("/home/cs/Dropbox/Lehre/Methoden der politischen Soziologie/0 Daten/GLES Vorwahl")
d <- read_stata('GLES_Vorwahlquerschnitt_ZA5700_v1-0-0.dta')

d <- set_na(d, value = -99:-71)

no non-missing arguments to min; returning Infno non-missing arguments to max; returning -InfCan't set value labels for "x". Infinite value range.no non-missing arguments to min; returning Infno non-missing arguments to max; returning -InfCan't set value labels for "x". Infinite value range.no non-missing arguments to min; returning Infno non-missing arguments to max; returning -InfCan't set value labels for "x". Infinite value range.no non-missing arguments to min; returning Infno non-missing arguments to max; returning -InfCan't set value labels for "x". Infinite value range.no non-missing arguments to min; returning Infno non-missing arguments to max; returning -InfCan't set value labels for "x". Infinite value range.

Is this because not every variable contains values between -99 and -71?

rec() problems with val.labels and suggestion

Hi,

I have a very strange problem with rec(): When I use efc data to recode an age variable, val.labels works without any issues:

data(efc)
 
> alt_kat <-  rec(efc$e17age, "18:23=1; 24:65=2; 66:max=3",
+    var.label="Alter kat.", 
+    val.labels= c("18-23"=1,"24-65"=2,"ueber65"=3))
> frq(alt_kat)
# Alter kat.

 val   label frq raw.prc valid.prc cum.prc
   1   18-23   0    0.00       0.0     0.0
   2   24-65  32    3.52       3.6     3.6
   3 ueber65 858   94.49      96.4   100.0
   4      NA  18    1.98        NA      NA

However, when I use the exact same syntax for a Stata dataset the following happens:

d$alter <- 2013 -  set_na(d$q2c,-99:-97)
 d$alt_kat <-  rec(d$alter, "18:23=1; 24:65=2; 66:max=3",
+    var.label="Alter kat.", 
+    val.labels= c("18-23"=1,"24-65"=2,"ueber65"=3))
> frq(d$alt_kat)
# Alter kat.

 val label  frq raw.prc valid.prc cum.prc
   1     1   77    3.85      3.90    3.90
   2     2 1174   58.67     59.47   63.37
   3     3  723   36.13     36.63  100.00
   4    NA   27    1.35        NA      NA

The value labels were not passed over. The only way to change this is by not explicitly specifying the numericals for value labels:

alt_kat <-  rec(d$alter, "18:23=1; 24:65=2; 66:max=3",
+    var.label="Alter kat.", 
+    val.labels= c("18-23","24-65","ueber65"))
> frq(alt_kat)
# Alter kat.

 val   label  frq raw.prc valid.prc cum.prc
   1   18-23   77    3.85      3.90    3.90
   2   24-65 1174   58.67     59.47   63.37
   3 ueber65  723   36.13     36.63  100.00
   4      NA   27    1.35        NA      NA

Do you know why this happens? I would be happy to just skip the numericals, but from what I understand the order is not based upon the order of recodings, but based upon which values appear first in a vector. At least this seems to be true for another example:

frq(d$q1)
# Geschlecht
 val                label  frq raw.prc valid.prc cum.prc
 -99    -99. keine Angabe    0    0.00      0.00    0.00
 -98     -98. weiss nicht    0    0.00      0.00    0.00
 -97 -97. trifft nicht zu    0    0.00      0.00    0.00
   1         1. maennlich 1013   50.62     50.62   50.62
   2          2. weiblich  988   49.38     49.38  100.00
   3                   NA    0    0.00        NA      NA

> d$sex <- rec(d$q1, "2=1;1=0;else=NA", 
+              var.label="sex", val.labels=c('Frau'=1, 'Mann'=0))
> d$sex2 <- rec(d$q1, "2=1;1=0;else=NA", 
+              var.label="sex", val.labels=c('Frau', 'Mann'))

> frq(d$sex)
# sex
val label  frq raw.prc valid.prc cum.prc
   0  Frau  988   49.38     49.38   49.38
   1  Mann 1013   50.62     50.62  100.00
   2    NA    0    0.00        NA      NA

> frq(d$sex2)
# sex
 val label  frq raw.prc valid.prc cum.prc
   0  Frau 1013   50.62     50.62   50.62
   1  Mann  988   49.38     49.38  100.00
   2    NA    0    0.00        NA      NA

The complete dataset I used for the output (and for most other examples I provide in issues) is available [here](https://www.dropbox.com/s/42xt2u5zsj98z8m/GLES_Vorwahlquerschnitt_ZA5700_v1-0-0.dta?dl=1.

On a sidenote this is solved quite conveniently in Stata, where the labels are directly included in the recoding procedure.
recode alter (18/23=1 "18bis23") (24/65=2 "24bis65") (66/99=3 ">65"), gen (alterkat)
It might be tricky to implement but could work with regular expressions. For example if we use a special character like $ or §, or [ ] within the recode string:

 rec(efc$e17age, "18:23=1 [18bis23]; 24:65=2 [24bis65]; 66:max=3 [ueber 65]",
+    var.label="Alter kat.",)

It might even be able to implement this without breaking the old behavior right? A parameter like val.labels="auto" could scan for the special character and apply the automated procedure if the special characters are detected.

Error for set_na()

Hi,

I think during the change for the parameter name from value to na something went wrong. Although the parameter is listed as na`in the doc correctly, the following example raises an error:

x <- factor(c("a", "b", "c"))
x
set_na(x, na = "b", as.tag = TRUE)

Error in set_na(x, na = "b", as.tag = TRUE) : 
  argument "value" is missing, with no default

Error in terms.formula(formula, data = data) : 'data' argument is of the wrong type

Hi Daniel,

I'm having some strange issues with plotting model assumptions of linear model using:
sjp.lm(lm.test, type = "ma")

on a following linear model
lm.test = lm(Chao1 ~ Water_content + N.NH4 + N.NO3 + OM + pH + Cl + K, df).

It returns an error:
Error in terms.formula(formula, data = data) : 'data' argument is of the wrong type

It behaves quite strangely, because when I take out the water_content variable, it suddenly works. I tried renaming it, removing rows with NA values, but it wouldn't help.

I'm attaching the test data.

test_data.csv.zip

Thanks a lot!

to_factor not working as expected?

Starting with a clean R session, running RStudio 1.1.39 and R 3.3.2 on OS X Sierra, I was playing with to_factor() but the output differs from that expected. What could be going on here? I also reinstalled sjmisc via devtools so can't be a package version issue.

> library(sjmisc)
> data(efc)
> str(efc$e42dep)

atomic [1:908] 3 3 3 4 4 4 4 4 4 4 ...

attr(*, "label")= chr "elder's dependency"
attr(, "labels")= Named num [1:4] 1 2 3 4
..- attr(, "names")= chr [1:4] "independent" "slightly dependent" "moderately dependent" "severely dependent"

> table(to_factor(efc$e42dep))

1 2 3 4
66 225 306 304

> barplot(table(to_factor(efc$e42dep)))

The code below works though with the output in line with the documentation:

library(sjPlot)
efc$c172code <- to_factor(efc$c172code)
fit <- lm(barthtot ~ c160age + c12hour + c172code + c161sex, data = efc)
sjt.lm(fit, group.pred = TRUE)

`set_na()` could remove level from NA value for factors

Instead of

x <- factor(c("a", "b", "c"))
x
# [1] a b c
# Levels: a b c

set_na(x) <- "b"
x
# [1] a    <NA> c   
# attr(,"labels")
#  b 
# NA 
# Levels: a b c

better

set_na(x) <- "b"
x
# [1] a    <NA> c   
# attr(,"labels")
#  b 
# NA 
# Levels: a c

get_var_labels and labelled character

It's possible to create labelled character variables. Cf. doc of labelled:

s1 <- labelled(c("M", "M", "F"), c(Male = "M", Female = "F"))

Proposition of modification: #5

write_spss() coerces data to be contiguous

When sjmisc::write_spss() writes data to an SPSS file, it coerces data to be contiguous:

x <- set_labels(rep(c(1:5, 99), times = 3), c("Strongly Agree" = 1, "Agree" = 2, "Neutral" = 3, "Disagree" = 4, "Strongly Disagree" = 5, "Did Not Answer" = 99))
x
# [1]  1  2  3  4  5 99  1  2  3  4  5 99  1  2  3  4  5 99
#attr(,"labels")
#   Strongly Agree             Agree           Neutral          Disagree Strongly Disagree    Did Not Answer 
#                1                 2                 3                 4                 5                99 
sjmisc::write_spss(data.frame(x), path = "C:/TEMP/temp.sav")

And in SPSS the data is now contiguous but is inconsistent with the original structure of the data.

Is is possible to prevent this behavior as it causes a disconnect between the data in SPSS and R?

Suggestion: search function for variable labels

Hi,

I have a suggestion for a utility function to search for all variables containing a specific string within their variable labels. I'm sure you would come up with a better version and without the dplyr dependency:

lookfor <- function(x, df) {
  cols <- colnames(df)
  labels <- get_label(df)
  combined <- data.frame(cols, labels) %>% 
  filter(grepl(tolower(x), tolower(labels)))
  return(combined)
}

> lookfor('elder', efc)
      name                 label
1 e15relat relationship to elder
2   e16sex        elder's gender
3   e17age            elder' age
4   e42dep    elder's dependency
5 tot_sc_e  Services for elderly

mismatch between code and doc

For frq there seems to be a mismatch between code and doc (or between the argument description and the description of the functions signature):

I found
#' @param sort.frq Logical, if \code{TRUE}, rows will be sorted according to #' value frequencies.

But sort.frq seems to expect a character, not a logical.

Clarification regarding to_factor

Hi,

sorry to disturb you again. I would like to understand the purpose of to_factor. So, it's turning the variable into a factor using the values of the variable as the labels of the factor, and keeping the "labels" attribute?

In which case is there a specific need to keep the attr(*, "labels")?

else = keep as option for rec

`merge_df()` unexpected behaviour

I have a list of data.frames each with labelled and/or labelled_spss variables that I want to merge. Unfortunately, plyr::join_all() removes the "label", "na_values", and "format.spss" attributes.

I was hoping that merge_df() would be able to retain these attributes, but I got the following error message:

Error in `[[<-.data.frame`(`*tmp*`, i, value = c(NA, NA, NA, NA, NA, NA,  : 
  replacement has 4821 rows, data has 9441

If this issue is more haven-esque I'll make an issue there, thanks!

Labels disappear with too many variables in sjp.glm?

Thanks for the great work Daniel! Today I encountered a minor issue. When I make a plot with many variables my labels disappear. Is there a way or argument to avoid this?

y <- sample(0:1, 200, replace = TRUE)
a <- as.factor(sample(1:4, 200, replace = TRUE))
b <- as.factor(sample(1:4, 200, replace = TRUE))
c <- sample(1:4, 200, replace = TRUE)

set_label(y) <- "var y"
set_label(a) <- "var a"
set_label(b) <- "var b"
set_label(c) <- "var c"

#OK: shows the labels of the variables
fit1 <- glm(y~a+c, family = binomial(link = "logit"))
sjp.glm(fit1)

#NOT OK: shows the variable names instead of the labels
fit2 <- glm(y~a+b, family = binomial(link = "logit"))
sjp.glm(fit2)

set_labels doesn't properly set labels on factor

x<-factor(rep(c(1,2),10)) 
y<-set_labels(x, labels=c('A','B'))

after which

> y
 [1] 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2
attr(,"labels")
A B 
1 2 
Levels: 1 2

For all the labelled-agnostic world y is still a factor with levels 1 and 2. Is it by design, or is it a bug?

`get_var_labels` fail for read_spss(option="haven") imported data when having empty var labels

get_var_labels seems not to work with data having partial empty var labels:

raw.data <- sjmisc::read_spss("sample_data.sav")
# str(raw.data)[1:5]

sjmisc::get_var_labels(raw.data)

But works well with full-label variables

raw.data2 <- sjmisc::read_spss("sample_data_full_labels.sav")
# str(raw.data2)[1:5]

sjmisc::get_var_labels(raw.data2)

add_labels error

I was trying the example code and got an error:

efc.sub <- subset(efc, subset = e16sex == 1, select = c(4:8))
efc.sub <- add_labels(efc.sub, efc)

--->
Error in order(as.numeric(all.labels)) :
(list) object cannot be coerced to type 'double'

Can you take a look?

unable to process set_na()

I got this message when I use this function:

tscs2013r <- set_na(tscs2013r, c(92:99, "NA"))
Error in as.Date.numeric(value) : 'origin' must be supplied

It worked this summer but somehow this won't work this time when I came back to this project. Did I miss anything? Thank you.

rev parameter in rec() not working as intended

Hi,
trying to use rev for reversing value codings during rec() is not working as intended:

d <- read_stata('GLES_Vorwahlquerschnitt_ZA5700_v1-0-0.dta')

d$q3 %>% set_na(-99:-71) %>% frq()

# Politisches Interesse

 val               label frq raw.prc valid.prc cum.prc
   1       1. sehr stark 123    6.15      6.16    6.16
   2            2. stark 362   18.09     18.12   24.27
   3    3. mittelmaessig 813   40.63     40.69   64.96
   4    4. weniger stark 499   24.94     24.97   89.94
   5 5. ueberhaupt nicht 201   10.04     10.06  100.00
   6                  NA   3    0.15        NA      NA

The correct output should be a reverse ordering of these 5 categories with automatically adjusted values labels. Instead, this happens:

d$polint <- d$q3 %>% set_na(-99:-71) %>%  rec('rev')
frq(d$polint)

NAs introduced by coercion# Politisches Interesse

 val             label frq raw.prc valid.prc cum.prc
   1                 1 201    9.57     10.06   10.06
   2                 2 499   23.76     24.97   35.04
   3                 3 813   38.71     40.69   75.73
   4                 4 362   17.24     18.12   93.84
   5                 5 123    5.86      6.16  100.00
   6                     3    0.14      0.15  100.15
   6                     3    0.14      0.15  100.30
   6                     3    0.14      0.15  100.45
   6                     3    0.14      0.15  100.60
   6                     3    0.14      0.15  100.75
   6 -99. keine Angabe   3    0.14      0.15  100.90
   6  -98. weiss nicht   3    0.14      0.15  101.05
   6               -97   3    0.14      0.15  101.20
   6               -96   3    0.14      0.15  101.35
   6               -95   3    0.14      0.15  101.50
   6               -94   3    0.14      0.15  101.65
   6               -93   3    0.14      0.15  101.80
   6               -92   3    0.14      0.15  101.95
   6               -91   3    0.14      0.15  102.10
   6               -90   3    0.14      0.15  102.25
   6               -89   3    0.14      0.15  102.40
   6               -88   3    0.14      0.15  102.55
   6               -87   3    0.14      0.15  102.70
   6               -86   3    0.14      0.15  102.85
   6               -85   3    0.14      0.15  103.00
   6               -84   3    0.14      0.15  103.15
   6               -83   3    0.14      0.15  103.30
   6               -82   3    0.14      0.15  103.45
   6               -81   3    0.14      0.15  103.60
   6               -80   3    0.14      0.15  103.75
   6               -79   3    0.14      0.15  103.90
   6               -78   3    0.14      0.15  104.05
   6               -77   3    0.14      0.15  104.20
   6               -76   3    0.14      0.15  104.35
   6               -75   3    0.14      0.15  104.50
   6               -74   3    0.14      0.15  104.65
   6               -73   3    0.14      0.15  104.80
   6               -72   3    0.14      0.15  104.95
   6               -71   3    0.14        NA      NA

While reversing the initial values works, value labels are not attached and the output shows strange behavior in relation to the range of values is set as NA. Values like -85 however never existed in the first place.

Add functions get_na_values and to_na

a feature of set_na() wanted: select or drop variables

See strengejacke/sjPlot#75

Update vignettes

set_labels() bug

Hi,

I discovered a bug when using set_labels() which screws values and labels:

d <- read_stata('GLES_Vorwahlquerschnitt_ZA5700_v1-0-0.dta')
vars <- c(93,104:108)
d[vars] <- d[vars]%>% set_na(-99:-71)
d$merkel_eigen <- rowMeans(d[104:108], na.rm=T)
frq(d$merkel_eigen)


val frq            label raw.prc valid.prc cum.prc
                1  18                1    0.90      0.91    0.91
              1.2   4              1.2    0.20      0.20    1.11
 1.33333333333333   1 1.33333333333333    0.05      0.05    1.16
              1.4  14              1.4    0.70      0.70    1.86
              1.5   3              1.5    0.15      0.15    2.01
              1.6   8              1.6    0.40      0.40    2.41
 1.66666666666667   2 1.66666666666667    0.10      0.10    2.52
             1.75   3             1.75    0.15      0.15    2.67
              1.8  22              1.8    1.10      1.11    3.77
                2  35                2    1.75      1.76    5.53
              2.2  34              2.2    1.70      1.71    7.24
             2.25   4             2.25    0.20      0.20    7.44
 2.33333333333333   4 2.33333333333333    0.20      0.20    7.65
              2.4  52              2.4    2.60      2.62   10.26
              2.5  10              2.5    0.50      0.50   10.76
              2.6  84              2.6    4.20      4.23   14.99
 2.66666666666667   4 2.66666666666667    0.20      0.20   15.19
             2.75   6             2.75    0.30      0.30   15.49
              2.8  79              2.8    3.95      3.97   19.47
                3 164                3    8.20      8.25   27.72
              3.2 127              3.2    6.35      6.39   34.10
             3.25  14             3.25    0.70      0.70   34.81
 3.33333333333333   7 3.33333333333333    0.35      0.35   35.16
              3.4 131              3.4    6.55      6.59   41.75
              3.5  14              3.5    0.70      0.70   42.45
              3.6 122              3.6    6.10      6.14   48.59
 3.66666666666667   8 3.66666666666667    0.40      0.40   48.99
             3.75  13             3.75    0.65      0.65   49.65
              3.8 132              3.8    6.60      6.64   56.29
                4 200                4   10.00     10.06   66.35
              4.2 130              4.2    6.50      6.54   72.89
             4.25   6             4.25    0.30      0.30   73.19
 4.33333333333333   2 4.33333333333333    0.10      0.10   73.29
              4.4 150              4.4    7.50      7.55   80.84
              4.5  11              4.5    0.55      0.55   81.39
              4.6 115              4.6    5.75      5.78   87.17
 4.66666666666667   2 4.66666666666667    0.10      0.10   87.27
             4.75   1             4.75    0.05      0.05   87.32
              4.8  92              4.8    4.60      4.63   91.95
                5 160                5    8.00      8.05  100.00
              NaN  13              NaN    0.65      0.65  100.65
               42   0               NA    0.00        NA      NA



d$merkel_eigen2 <-   set_labels(d$merkel_eigen, c('sehr negative Bewertung'=1, 'sehr positive Bewertung'=5)) 
frq(d$merkel_eigen2)


val                   label frq raw.prc valid.prc cum.prc
  1.000000 sehr negative Bewertung  18    0.90      0.91    0.91
  1.200000                     1.2   0    0.00      0.00    0.91
  1.333333        1.33333333333333   0    0.00      0.00    0.91
  1.400000                     1.4   0    0.00      0.00    0.91
  1.500000                     1.5   0    0.00      0.00    0.91
  1.600000                     1.6   0    0.00      0.00    0.91
  1.666667        1.66666666666667   0    0.00      0.00    0.91
  1.750000                    1.75   0    0.00      0.00    0.91
  1.800000                     1.8   0    0.00      0.00    0.91
  2.000000                       2   4    0.20      0.20    1.11
  2.200000                     2.2   0    0.00      0.00    1.11
  2.250000                    2.25   0    0.00      0.00    1.11
  2.333333        2.33333333333333   0    0.00      0.00    1.11
  2.400000                     2.4   0    0.00      0.00    1.11
  2.500000                     2.5   0    0.00      0.00    1.11
  2.600000                     2.6   0    0.00      0.00    1.11
  2.666667        2.66666666666667   0    0.00      0.00    1.11
  2.750000                    2.75   0    0.00      0.00    1.11
  2.800000                     2.8   0    0.00      0.00    1.11
  3.000000                       3   1    0.05      0.05    1.16
  3.200000                     3.2   0    0.00      0.00    1.16
  3.250000                    3.25   0    0.00      0.00    1.16
  3.333333        3.33333333333333   0    0.00      0.00    1.16
  3.400000                     3.4   0    0.00      0.00    1.16
  3.500000                     3.5   0    0.00      0.00    1.16
  3.600000                     3.6   0    0.00      0.00    1.16
  3.666667        3.66666666666667   0    0.00      0.00    1.16
  3.750000                    3.75   0    0.00      0.00    1.16
  3.800000                     3.8   0    0.00      0.00    1.16
  4.000000                       4  14    0.70      0.70    1.86
  4.200000                     4.2   0    0.00      0.00    1.86
  4.250000                    4.25   0    0.00      0.00    1.86
  4.333333        4.33333333333333   0    0.00      0.00    1.86
  4.400000                     4.4   0    0.00      0.00    1.86
  4.500000                     4.5   0    0.00      0.00    1.86
  4.600000                     4.6   0    0.00      0.00    1.86
  4.666667        4.66666666666667   0    0.00      0.00    1.86
  4.750000                    4.75   0    0.00      0.00    1.86
  4.800000                     4.8   0    0.00      0.00    1.86
  5.000000 sehr positive Bewertung   3    0.15      0.15    2.01
  6.000000                      NA   8    0.40      0.40    2.41
  7.000000                      NA   2    0.10      0.10    2.52
  8.000000                      NA   3    0.15      0.15    2.67
  9.000000                      NA  22    1.10      1.11    3.77
 10.000000                      NA  35    1.75      1.76    5.53
 11.000000                      NA  34    1.70      1.71    7.24
 12.000000                      NA   4    0.20      0.20    7.44
 13.000000                      NA   4    0.20      0.20    7.65
 14.000000                      NA  52    2.60      2.62   10.26
 15.000000                      NA  10    0.50      0.50   10.76
 16.000000                      NA  84    4.20      4.23   14.99
 17.000000                      NA   4    0.20      0.20   15.19
 18.000000                      NA   6    0.30      0.30   15.49
 19.000000                      NA  79    3.95      3.97   19.47
 20.000000                      NA 164    8.20      8.25   27.72
 21.000000                      NA 127    6.35      6.39   34.10
 22.000000                      NA  14    0.70      0.70   34.81
 23.000000                      NA   7    0.35      0.35   35.16
 24.000000                      NA 131    6.55      6.59   41.75
 25.000000                      NA  14    0.70      0.70   42.45
 26.000000                      NA 122    6.10      6.14   48.59
 27.000000                      NA   8    0.40      0.40   48.99
 28.000000                      NA  13    0.65      0.65   49.65
 29.000000                      NA 132    6.60      6.64   56.29
 30.000000                      NA 200   10.00     10.06   66.35
 31.000000                      NA 130    6.50      6.54   72.89
 32.000000                      NA   6    0.30      0.30   73.19
 33.000000                      NA   2    0.10      0.10   73.29
 34.000000                      NA 150    7.50      7.55   80.84
 35.000000                      NA  11    0.55      0.55   81.39
 36.000000                      NA 115    5.75      5.78   87.17
 37.000000                      NA   2    0.10      0.10   87.27
 38.000000                      NA   1    0.05      0.05   87.32
 39.000000                      NA  92    4.60      4.63   91.95
 40.000000                      NA 160    8.00      8.05  100.00
 41.000000                      NA  13    0.65      0.65  100.65
 42.000000                      NA   0    0.00        NA      NA

describe for single variables or vectors

Hi, I just found your wrapper sjt.df which produces a nice summary of descriptive statistics. Is it possible to have this for one or several variables and without the html formatting? I'm basically looking for the equivalent of frq for the summary statistics.

Of course, one could do something like sjt.df(efc[['e17age']), but this has no console output and it would be convenient if one would not have to extract the variable(s) out of a dataset first.

Typo/bug in docs of set_labels

t-values?

Hi!

thx so much for this wonderful package! i really enjoy using it.

One thing that i cant seem to find is how to add t-values to the table? is it possible?

thx!

mean method for labelled vector

Hello,

I was doing few tests around the labelled package and I discovered few things.

I created the following object:

notes <- c(12, 16, 99, 14, 17)
val_label(notes, 99) <- "absent le jour de l'examen"
missing_val(notes) <- 99

I was expecting to observe:

> mean(notes)
[1] 31.6
> mean(missing_to_na(notes), na.rm = TRUE)
[1] 14.75

Surprisingly, in some cases, I was obtaining:

> mean(notes)
[1] 14.75

I tried to explore the reason of that and it appears by using getS3method that the mean method defined in sjmisc was called by R (even if I didn't attached sjmisc). I don't know why in some cases R was retrieving this method and in other cases he was not, but it raises at least few comments about having a mean method defined for labelled vectors:

first of all, mean method will return different values according to the package currently loaded in memory. And it's something completely invisible for most users;
I get a result while I didn't specify na.rm = TRUE and so it's so consistent with the original method;
almost of statistical functions in R are not taking into account the labelled class. Nevertheless, the advantage of the labelled class is that it preserves original values and that these vectors can be used with any statistical function in R, while the approach used by memisc with data.set objects is working with a limited number of compatible functions;
therefore when dealing with labelled data, the first thing that a user need to be aware is that it his responsibility to say explicitly what he want to do with self-defined missing values: (i) keeping them (and then considering there are valid values) or converting them to NA with missing_to_na.
According to the performed analysis, both approaches could be appropriate. In order to be consistent with the vast majority of R functions, my position is that self-defined missing values should always be converted to NA only if explicitly requested by the user using missing_to_na function. In that scenario, we ensure that the way to use the data is always the same, without having to remember which functions are taking into account labels attributes.

Therefore, I'm not sure that it would be appropriate to provide a mean method for labelled class. If we provide such method, at least we should get the same results than without that method for consistency. However, an option could be to display a warning message to alert the user that there are currently self-defined missing values. For example: providing a missing_to_na argument with default value equal to NULL:

if NULL, results will be the same that the default mean method, but a Warning will be displayed in case some missing values are defined.
if FALSE, self-defined missing values are not converted but no warning displayed;
if TRUE, self-defined missing values are converted to NA before computing the mean, the user having always to specify the value of na.rm

What do you think?

set_na() results in 0 instead of NA

load(tscs2013)
str(tscs2013$v66)
Factor w/ 8 levels "1","2","3","4",..: 3 2 1 4 2 3 1 3 2 2 ...

attr(, "value.labels")= Named chr [1:10] "99" "98" "97" "96" ...
..- attr(, "names")= chr [1:10] "遺漏值" "拒答" "不知道" "跳答" ...

attr(, "label")= Named chr "66 如果台灣宣布獨立,請問您認為兩岸會不會發生戰爭? "
..- attr(, "names")= chr "v66"

tscs2013 <- set_na(tscs2013, 94:98)
str(tscs2013$v66)
Factor w/ 8 levels "1","2","3","4",..: 3 2 1 4 2 3 1 3 2 2 ...

attr(, "value.labels")= Named chr [1:5] "99" "4" "3" "2" ...
..- attr(, "names")= chr [1:5] "遺漏值" "一定不會" "可能不會" "可能會" ...

attr(, "label")= Named chr "66 如果台灣宣布獨立,請問您認為兩岸會不會發生戰爭? "
..- attr(, "names")= chr "v66"

table(tscs2013$v66)
1 2 3 4 94 95 97 98
425 736 464 179 0 0 0 0

I was expecting that the level of variables like v66 goes down to 4 and that 94 to 98 become NA. Did I miss anything? Thank you.
ps.
The data used for this practice is downloadable from here: http://cl.ly/0I332p1i2Q3H/tscs2013.rda

descr() does not always work

Hi,

for some strange reason I cannot understand ``descr()``` does not work for me and always throws the following error:

data(efc)
descr(efc, e17age, c160age)

Error in mutate_impl(.data, dots) : basic_string::_M_replace_aux
13.
stop(structure(list(message = "basic_string::_M_replace_aux", call = mutate_impl(.data, dots), cppstack = NULL), .Names = c("message", "call", "cppstack"), class = c("std::length_error", "C++Error", "error", "condition")))
12.
mutate_impl(.data, dots)
11.
mutate_.tbl_df(.data, .dots = lazyeval::lazy_dots(...))
10.
mutate_(.data, .dots = lazyeval::lazy_dots(...))
9.
dplyr::mutate(., label = unname(get_label(.data, def.value = colnames(.data))))
8.
function_list[[i]](value)
7.
freduce(value, `_function_list`)
6.
`_fseq`(`_lhs`)
5.
eval(expr, envir, enclos)
4.
eval(quote(`_fseq`(`_lhs`)), env, env)
3.
withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
2.
tibble::as_tibble(psych::describe(.data)) %>% tibble::rownames_to_column(var = "variable") %>% dplyr::select_("-vars", "-mad") %>% dplyr::mutate(label = unname(get_label(.data, def.value = colnames(.data)))) %>% var_rename(median = "md")
1.
descr(d$skalocdu)

Update: the same error also sometimes occurs when using frq():

d <- read_stata('GLES_Vorwahlquerschnitt_ZA5700_v1-0-0.dta')
d$skalocdu <- set_na(d$q21a, -99:-71)
d$skalorrg <- set_na(d$q79f, -99:-71)

d %>% select(skalorrg, skalocdu) %>% frq()

get_labels() inconsistent with get_values() - behavior different

I recently updated to v2.2.1 of sjmisc and noticed a difference in behavior for get_labels() from v2.0.1. Previously the values and value labels from get_values() and get_labels() were aligned, but in some of the data I work with this is no longer the case.

library(sjmisc)

online_path <- tempfile()
download.file("https://www.dropbox.com/s/3b4xiaurawztvfj/get_labels_discrepancy.zip?dl=1", online_path, mode = "wb")
rdata_path <- unzip(online_path)
(load(rdata_path))

attr(mydf[[1]], "labels")
get_labels(mydf[[1]])
get_values(mydf[[1]])

In 2.0.1 these commands would return:

# Behavior in 2.0.1
> attr(mydf[[1]], "labels")
39 40 44 46 37 61 42 45 63 35 43 49 52 51 38 32 48 33 41 36 34 50 47 55 53 64 56 29 66 31 54 59 30 27 58 57 60 25 26 24 28 62 23 19 65 
39 40 44 46 37 61 42 45 63 35 43 49 52 51 38 32 48 33 41 36 34 50 47 55 53 64 56 29 66 31 54 59 30 27 58 57 60 25 26 24 28 62 23 19 65 
22 
22 
> get_labels(mydf[[1]])
 [1] "39" "40" "44" "46" "37" "61" "42" "45" "63" "35" "43" "49" "52" "51" "38" "32" "48" "33" "41" "36" "34" "50" "47" "55" "53" "64"
[27] "56" "29" "66" "31" "54" "59" "30" "27" "58" "57" "60" "25" "26" "24" "28" "62" "23" "19" "65" "22"
> get_values(mydf[[1]])
 [1] 39 40 44 46 37 61 42 45 63 35 43 49 52 51 38 32 48 33 41 36 34 50 47 55 53 64 56 29 66 31 54 59 30 27 58 57 60 25 26 24 28 62 23 19
[45] 65 22

In 2.0.1 the values and value labels are aligned. But with v 2.2.1, these commands return:

# Behavior in 2.2.1
> attr(mydf[[1]], "labels")
39 40 44 46 37 61 42 45 63 35 43 49 52 51 38 32 48 33 41 36 34 50 47 55 53 64 56 29 66 31 54 59 30 27 58 57 60 25 26 24 28 62 23 19 65 
39 40 44 46 37 61 42 45 63 35 43 49 52 51 38 32 48 33 41 36 34 50 47 55 53 64 56 29 66 31 54 59 30 27 58 57 60 25 26 24 28 62 23 19 65 
22 
22 
> get_labels(mydf[[1]])
 [1] "19" "22" "23" "24" "25" "26" "27" "28" "29" "30" "31" "32" "33" "34" "35" "36" "37" "38" "39" "40" "41" "42" "43" "44" "45" "46"
[27] "47" "48" "49" "50" "51" "52" "53" "54" "55" "56" "57" "58" "59" "60" "61" "62" "63" "64" "65" "66"
> get_values(mydf[[1]])
 [1] 39 40 44 46 37 61 42 45 63 35 43 49 52 51 38 32 48 33 41 36 34 50 47 55 53 64 56 29 66 31 54 59 30 27 58 57 60 25 26 24 28 62 23
[44] 19 65 22

The values and value labels are correct / aligned in the attributes, but the ordering returned by get_labels() is different in 2.2.1 and from it's previous behavior. This can be seen in all variables in the provided data set. This change in behavior is quite problematic for many of the processes I use. Is this an intended change or a bug?

lapply(mydf, function(x) data.frame(values = get_values(x), labels = get_labels(x), stringsAsFactors = F))
# Output not shown for brevity

Update read.me

Wrong value assignments in rec()

Hi,

when using the new squared brackets option to assign value labels in rec() the assignment of actual values is wrong:

frq(d$bl)

# Bundesland

 val                     label frq raw.prc valid.prc cum.prc
   1     1. Baden-Wuerttemberg 178    9.04      9.04    9.04
   2                 2. Bayern 281   14.28     14.28   23.32
   3                 3. Berlin  98    4.98      4.98   28.30
   4            4. Brandenburg 108    5.49      5.49   33.79
   5                 5. Bremen  11    0.56      0.56   34.35
   6                6. Hamburg  11    0.56      0.56   34.91
   7                 7. Hessen 112    5.69      5.69   40.60
   8 8. Mecklenburg-Vorpommern  81    4.12      4.12   44.72
   9          9. Niedersachsen 182    9.25      9.25   53.96
  10   10. Nordrhein-Westfalen 299   15.19     15.19   69.16
  11       11. Rheinland-Pfalz  73    3.71      3.71   72.87
  12              12. Saarland  16    0.81      0.81   73.68
  13               13. Sachsen 226   11.48     11.48   85.16
  14        14. Sachsen-Anhalt 109    5.54      5.54   90.70
  15    15. Schleswig-Holstein  58    2.95      2.95   93.65
  16            16. Thueringen 125    6.35      6.35  100.00
  17                        NA   0    0.00        NA      NA

d$west <- rec(d$bl, '1,2,5:7,9,10:12,15 = 1 [West]; 3,4,8,13,14,16 = 0 [Ost]',
              var.label='Dummy: West')
d$west2 <- rec(d$bl, '1,2,5:7,9,10:12,15 = 1; 3,4,8,13,14,16 = 0')

frq(d$west)
# Dummy: West

 val label  frq raw.prc valid.prc cum.prc
   0  West 1221   62.04     62.04   62.04
   1   Ost  747   37.96     37.96  100.00
   2    NA    0    0.00        NA      NA

frq(d$west2)
# Bundesland

 val  frq label raw.prc valid.prc cum.prc
   0  747     0   37.96     37.96   37.96
   1 1221     1   62.04     62.04  100.00
   2    0    NA    0.00        NA      NA

The dataset used here is again "GLES_Vorwahlquerschnitt_ZA5700_v1-0-0.dta".

descr() issues

Sorry to bother you again with potential installation problems and version conflicts, but descr()still does not work for me after reinstalling the cran version of dplyr. Here is my sessionInfo():

R version 3.3.0 (2016-05-03)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] sjstats_0.7.0   sjPlot_2.1.2    sjmisc_2.1.0    dplyr_0.5.0     purrr_0.2.2     readr_1.0.0     tidyr_0.6.0     tibble_1.2     
 [9] ggplot2_2.2.0   tidyverse_1.0.0

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.8        stringdist_0.9.4.2 mvtnorm_1.0-5      lattice_0.20-33    zoo_1.7-13         lmtest_0.9-34      assertthat_0.1    
 [8] digest_0.6.10      psych_1.6.9        mime_0.5           R6_2.2.0           plyr_1.8.4         stats4_3.3.0       coda_0.18-1       
[15] lazyeval_0.2.0     multcomp_1.4-6     minqa_1.2.4        nloptr_1.0.4       Matrix_1.2-6       DT_0.2             splines_3.3.0     
[22] lme4_1.1-12        stringr_1.1.0      foreign_0.8-66     htmlwidgets_0.8    munsell_0.4.3      shiny_0.14.2       broom_0.4.1       
[29] httpuv_1.3.3       modelr_0.1.0       mnormt_1.5-5       htmltools_0.3.5    nnet_7.3-12        coin_1.1-3         codetools_0.2-14  
[36] MASS_7.3-45        grid_3.3.0         nlme_3.1-127       arm_1.9-3          xtable_1.8-2       gtable_0.2.0       DBI_0.5-1         
[43] magrittr_1.5       scales_0.4.1       stringi_1.1.2      reshape2_1.4.2     effects_3.1-2      sandwich_2.3-4     blme_1.0-4        
[50] TH.data_1.0-7      tools_3.3.0        abind_1.4-5        parallel_3.3.0     survival_2.40-1    colorspace_1.3-1   knitr_1.15.1      
[57] haven_1.0.0        merTools_0.3.0     modeltools_0.2-21

And here is the example with traceback:

d <- read_stata('GLES_Vorwahlquerschnitt_ZA5700_v1-0-0.dta')
set.seed(339487731)
randgroup <- runif(nrow(d), min=0, max=1)
psych::describe(randgroup)

   vars    n mean   sd median trimmed  mad min max range  skew kurtosis   se
X1    1 2001  0.5 0.29   0.51     0.5 0.38   0   1     1 -0.01    -1.25 0.01

descr(randgroup)

Error in mutate_impl(.data, dots) : Unsupported type NILSXP for column "label"
13.
stop(structure(list(message = "Unsupported type NILSXP for column \"label\"", call = mutate_impl(.data, dots), cppstack = structure(list( file = "", line = -1L, stack = "C++ stack not available on this system"), .Names = c("file", "line", "stack"), class = "Rcpp_stack_trace")), .Names = c("message", ...
12.
mutate_impl(.data, dots)
11.
mutate_.tbl_df(.data, .dots = lazyeval::lazy_dots(...))
10.
mutate_(.data, .dots = lazyeval::lazy_dots(...))
9.
dplyr::mutate(., label = unname(get_label(dd, def.value = colnames(dd))))
8.
function_list[[i]](value)
7.
freduce(value, `_function_list`)
6.
`_fseq`(`_lhs`)
5.
eval(expr, envir, enclos)
4.
eval(quote(`_fseq`(`_lhs`)), env, env)
3.
withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
2.
tibble::as_tibble(psych::describe(dd)) %>% tibble::rownames_to_column(var = "variable") %>% dplyr::select_("-vars", "-mad") %>% dplyr::mutate(label = unname(get_label(dd, def.value = colnames(dd)))) %>% var_rename(median = "md")
1.
descr(randgroup)

remove_labels should unclass vector

Suggestion:: xtab function similar to frq()

Hi, it would be pretty nice to have a simple function xtab that, like frq() related to sjt.frq(), generates a simple non-html contingency table including value labels.

Related to this, it would also be useful to either have a parameter for this function and frq(), which would also print the variable label above the table, or just print the variable label as default.

Erroneous assignment of value labels to subset data when using copy_labels

An issue concerning the usage of copy_labels:

When subsetting data using subset{base}, haven-style variable and value labels are lost.
To restore labels, sjmisc offers the copy_labels function.
However, this function may fail to reproduce the collect labels in the following scenario: (a) if the subset dataframe is a subset of variables (columns) AND cases (rows); and (b) if, consequently, some of the original variable values are no longer contained in the subset dataset.
In this case, copy_labels uses the first variable labels.

More labels than values of "var1". Using first 3 labels.

This can lead to wrong assignments of value labels, however. In my specific case, value labels of missing values from the original dataframe were erroneously assigned to non-missing values in the subset dataframe.

Is there any way to make sure that the correct original value labels are reassigned?

I think for such instances, copy_labels could benefit from having a similar argument to that of "force.labels = TRUE" in set_labels?

Note: This issue only appears to occur when using the haven-style but not when requesting foreign-style labels in sjmisc's the read_spss command. Might this have to do with the way in which missing variables are treated per default in haven vs. foreign?

Thanks!

docs from remove_labels should mention tagged NA

Suggestion: Include content of blog posts in a vignette

Hi,

I'd like to suggest combining the most important stuff from the blog posts (1, 2, 3) in a vignette. This could include basic examples of analysis with labelled data and it would be very useful to have all this captured in one document. Especially for teaching it would be nice, where online access in labs is not always guaranteed and students easily get overwhelmed.

As a more ambitious project you could think about writing an R equivalent to data analysis with stata ;-)

Suggestion: easier way to include value labels in rec()

Hi,

recoding is solved quite conveniently in Stata, where the labels are directly included in the recoding procedure.
recode age (18/23=1 "18to23") (24/65=2 "24to65") (66/99=3 ">65"), gen (agekat)

Implementing this in rec() could work with regular expressions. For example if we use a special character like $ or §, or [ ] within the recode string:

 rec(efc$e17age, "18:23=1 [18to23]; 24:65=2 [24to65]; 66:max=3 [> 65]",
+    var.label="Age kat.",)

It might even be possible to implement this without breaking the old behavior right? A parameter like val.labels="auto" could scan for the special character(s) and apply the automated procedure if they are detected.

drop_labels()

Hi,

I guess it's a matter of design philosophy as well, but I would like to suggest calling drop_labels()as default for functions like sjt.tab(), and/or let set_na()optionally drop unused labels as well.
Often variables are just copied while encoding missing values like so:

d <- read_stata('GLES_Vorwahlquerschnitt_ZA5700_v1-0-0.dta')
vars <- select(d, q41, q119a, q192, q11ba) %>% set_na(-99:-83)
sjt.xtab(vars$q41, vars$q11ba) # gets messed up without drop_labels()

It is only a small detail, but I dislike the need to call drop_labels() in such cases again for dropping any unused NA labels. If one does not use drop_labels(), outputs for sjt.xtab() etc. get ugly because of the 0 categories for unused labels.

I guess this all comes down to one question for the whole package: Do you ever need unused labels for most analysis again? I personally don't, but maybe this is different for other people.

Feature request: Include label attribute into xtable header

I raised above question recently in SO.
I collaborated with the xtable maintainer to work out a solution which is posted as answer on SO.
Would it not be better to include this functionality into sjmisc package ?

http://stackoverflow.com/questions/34692402/include-label-attribute-into-xtable-header/34705889#34705889

glmmPQL objects

I´m struggled trying to graph my glmmPQL fitted model. I'm trying to use jsPlot, but just few functions work on this kind of objects. Would be great jsPlot may handle glmmPQL objects.

New functions is_odd and is_even

Format data frame in ragged style

the functionality to convert a dataframe into an ragged representaion (a la ftable) migth be somethimg for your sjmisc package.

http://stackoverflow.com/questions/41324110/convert-normal-r-data-frame-into-ragged-format-a-la-ftable

Install error

Hi,

I'm not sure whether this is a big problem but installing sjmisc from github raises the following warning at the end:

** preparing package for lazy loading
Error : object ‘zap_inf’ is not exported by 'namespace:sjmisc'
ERROR: lazy loading failed for package ‘sjPlot’
* removing ‘/home/cs/R/x86_64-pc-linux-gnu-library/3.3/sjPlot’
Error: Command failed (1)

a "labelled" package

In other projects, I need to deal with labelled variables. Therefore I'm interested by some functions of sjmisc, but only functions related to manipulation of labelled variables.

It could be useful to have a package dealing with and only with manipulation of such variables. This package could later be used by several projects.

Would you be interested by such package?

if no, would you have any problem that I use some of the codes you wrote for sjmisc?
If yes, would you want to lead it?

regards

Possible error in frq

Every year since 2005 I've got searching for a frequency function that works more like SAS or SPSS. I finally found it in frq. Thanks for writing it! It looks like there is a bug either in frq or in haven::read_spss. See the example below where frq "freaks out" (heh, heh) at gender. If this is a haven problem, please let me know & I'll post it over there.

I'm also a bit perplexed that the help file for frq uses the standard haven approach to the labels, while the help examples for sjmisc::read_spss use a different structure (atomic vs. labelled). Are you in the middle of converting from one to the other?

> (mydata <- haven::read_spss("mydata.sav"))
# A tibble: 8 × 7
     id  workshop    gender        q1        q2        q3        q4
  <dbl> <dbl+lbl> <chr+lbl> <dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl+lbl>
1     1         1         f         1         1         5         1
2     2         2         f         2         1         4         1
3     3         1         f         2         2         4         3
4     4         2                   3         1       NaN         3
5     5         1         m         3         5         2         4
6     6         2         m         5         4         5         5
7     7         1         m         5         3         4         4
8     8         2         m         4         5         5         5
> mydata %>%
+   select(gender) %>%
+   frq()
Error in order(names(labels), lvls) : argument lengths differ
> mydata %>%
+   select(workshop,q1:q4) %>%
+   frq()
$workshop
 val label frq raw.prc valid.prc cum.prc
   1     R   4      50        50      50
   2   SAS   4      50        50     100
   3    NA   0       0        NA      NA

$q1
# The instructor was well prepared.

 val             label frq raw.prc valid.prc cum.prc
   1 Strongly Disagree   1    12.5      12.5    12.5
   2          Disagree   2    25.0      25.0    37.5
   3           Neutral   2    25.0      25.0    62.5
   4             Agree   1    12.5      12.5    75.0
   5    Strongly Agree   2    25.0      25.0   100.0
   6                NA   0     0.0        NA      NA

$q2
# The instructor communicated well.

 val             label frq raw.prc valid.prc cum.prc
   1 Strongly Disagree   3    37.5      37.5    37.5
   2          Disagree   1    12.5      12.5    50.0
   3           Neutral   1    12.5      12.5    62.5
   4             Agree   1    12.5      12.5    75.0
   5    Strongly Agree   2    25.0      25.0   100.0
   6                NA   0     0.0        NA      NA

$q3
# The course material was helpful.

 val             label frq raw.prc valid.prc cum.prc
   1 Strongly Disagree   0     0.0      0.00    0.00
   2          Disagree   1    12.5     14.29   14.29
   3           Neutral   0     0.0      0.00   14.29
   4             Agree   3    37.5     42.86   57.14
   5    Strongly Agree   3    37.5     42.86  100.00
   6                NA   1    12.5        NA      NA

$q4
# Overall, I found this workshop useful.

 val             label frq raw.prc valid.prc cum.prc
   1 Strongly Disagree   2      25        25      25
   2          Disagree   0       0         0      25
   3           Neutral   2      25        25      50
   4             Agree   2      25        25      75
   5    Strongly Agree   2      25        25     100
   6                NA   0       0        NA      NA

Function to replace na with values

atomic_to_fac needs exact=T in attr

Small suggestion: parameter for find_var() to return variable labels

Hi,

I'd like to suggest a very subtle enhancement of find_var(): adding a boolean parameter, likeas.lab=T , which makes the function return variable labels for all matches. In almost all cases where I usefind_var for large datasets I want to directly get variable labels und use this syntax:
find_var(data, 'string', as.df=T) %>% get_label()

strengejacke / sjmisc Goto Github PK

sjmisc's Issues

Recommend Projects

Recommend Topics

Recommend Org