strengejacke / sjmisc Goto Github PK
View Code? Open in Web Editor NEWData transformation and utility functions for R
Home Page: https://strengejacke.github.io/sjmisc
License: GNU General Public License v3.0
Data transformation and utility functions for R
Home Page: https://strengejacke.github.io/sjmisc
License: GNU General Public License v3.0
Hi,
I get a warning for trying to use set_na
on a dataframe:
library(tidyverse)
library(sjPlot)
library(sjmisc)
setwd("/home/cs/Dropbox/Lehre/Methoden der politischen Soziologie/0 Daten/GLES Vorwahl")
d <- read_stata('GLES_Vorwahlquerschnitt_ZA5700_v1-0-0.dta')
d <- set_na(d, value = -99:-71)
no non-missing arguments to min; returning Infno non-missing arguments to max; returning -InfCan't set value labels for "x". Infinite value range.no non-missing arguments to min; returning Infno non-missing arguments to max; returning -InfCan't set value labels for "x". Infinite value range.no non-missing arguments to min; returning Infno non-missing arguments to max; returning -InfCan't set value labels for "x". Infinite value range.no non-missing arguments to min; returning Infno non-missing arguments to max; returning -InfCan't set value labels for "x". Infinite value range.no non-missing arguments to min; returning Infno non-missing arguments to max; returning -InfCan't set value labels for "x". Infinite value range.
Is this because not every variable contains values between -99 and -71?
Hi,
I have a very strange problem with rec()
: When I use efc
data to recode an age variable, val.labels
works without any issues:
data(efc)
> alt_kat <- rec(efc$e17age, "18:23=1; 24:65=2; 66:max=3",
+ var.label="Alter kat.",
+ val.labels= c("18-23"=1,"24-65"=2,"ueber65"=3))
> frq(alt_kat)
# Alter kat.
val label frq raw.prc valid.prc cum.prc
1 18-23 0 0.00 0.0 0.0
2 24-65 32 3.52 3.6 3.6
3 ueber65 858 94.49 96.4 100.0
4 NA 18 1.98 NA NA
However, when I use the exact same syntax for a Stata dataset the following happens:
d$alter <- 2013 - set_na(d$q2c,-99:-97)
d$alt_kat <- rec(d$alter, "18:23=1; 24:65=2; 66:max=3",
+ var.label="Alter kat.",
+ val.labels= c("18-23"=1,"24-65"=2,"ueber65"=3))
> frq(d$alt_kat)
# Alter kat.
val label frq raw.prc valid.prc cum.prc
1 1 77 3.85 3.90 3.90
2 2 1174 58.67 59.47 63.37
3 3 723 36.13 36.63 100.00
4 NA 27 1.35 NA NA
The value labels were not passed over. The only way to change this is by not explicitly specifying the numericals for value labels:
alt_kat <- rec(d$alter, "18:23=1; 24:65=2; 66:max=3",
+ var.label="Alter kat.",
+ val.labels= c("18-23","24-65","ueber65"))
> frq(alt_kat)
# Alter kat.
val label frq raw.prc valid.prc cum.prc
1 18-23 77 3.85 3.90 3.90
2 24-65 1174 58.67 59.47 63.37
3 ueber65 723 36.13 36.63 100.00
4 NA 27 1.35 NA NA
Do you know why this happens? I would be happy to just skip the numericals, but from what I understand the order is not based upon the order of recodings, but based upon which values appear first in a vector. At least this seems to be true for another example:
frq(d$q1)
# Geschlecht
val label frq raw.prc valid.prc cum.prc
-99 -99. keine Angabe 0 0.00 0.00 0.00
-98 -98. weiss nicht 0 0.00 0.00 0.00
-97 -97. trifft nicht zu 0 0.00 0.00 0.00
1 1. maennlich 1013 50.62 50.62 50.62
2 2. weiblich 988 49.38 49.38 100.00
3 NA 0 0.00 NA NA
> d$sex <- rec(d$q1, "2=1;1=0;else=NA",
+ var.label="sex", val.labels=c('Frau'=1, 'Mann'=0))
> d$sex2 <- rec(d$q1, "2=1;1=0;else=NA",
+ var.label="sex", val.labels=c('Frau', 'Mann'))
> frq(d$sex)
# sex
val label frq raw.prc valid.prc cum.prc
0 Frau 988 49.38 49.38 49.38
1 Mann 1013 50.62 50.62 100.00
2 NA 0 0.00 NA NA
> frq(d$sex2)
# sex
val label frq raw.prc valid.prc cum.prc
0 Frau 1013 50.62 50.62 50.62
1 Mann 988 49.38 49.38 100.00
2 NA 0 0.00 NA NA
The complete dataset I used for the output (and for most other examples I provide in issues) is available [here](https://www.dropbox.com/s/42xt2u5zsj98z8m/GLES_Vorwahlquerschnitt_ZA5700_v1-0-0.dta?dl=1.
On a sidenote this is solved quite conveniently in Stata, where the labels are directly included in the recoding procedure.
recode alter (18/23=1 "18bis23") (24/65=2 "24bis65") (66/99=3 ">65"), gen (alterkat)
It might be tricky to implement but could work with regular expressions. For example if we use a special character like $
or §
, or [ ]
within the recode string:
rec(efc$e17age, "18:23=1 [18bis23]; 24:65=2 [24bis65]; 66:max=3 [ueber 65]",
+ var.label="Alter kat.",)
It might even be able to implement this without breaking the old behavior right? A parameter like val.labels="auto"
could scan for the special character and apply the automated procedure if the special characters are detected.
Hi,
I think during the change for the parameter name from value to na something went wrong. Although the parameter is listed as na`in the doc correctly, the following example raises an error:
x <- factor(c("a", "b", "c"))
x
set_na(x, na = "b", as.tag = TRUE)
Error in set_na(x, na = "b", as.tag = TRUE) :
argument "value" is missing, with no default
Hi Daniel,
I'm having some strange issues with plotting model assumptions of linear model using:
sjp.lm(lm.test, type = "ma")
on a following linear model
lm.test = lm(Chao1 ~ Water_content + N.NH4 + N.NO3 + OM + pH + Cl + K, df)
.
It returns an error:
Error in terms.formula(formula, data = data) : 'data' argument is of the wrong type
It behaves quite strangely, because when I take out the water_content variable, it suddenly works. I tried renaming it, removing rows with NA values, but it wouldn't help.
I'm attaching the test data.
Thanks a lot!
Starting with a clean R session, running RStudio 1.1.39 and R 3.3.2 on OS X Sierra, I was playing with to_factor()
but the output differs from that expected. What could be going on here? I also reinstalled sjmisc
via devtools
so can't be a package version issue.
> library(sjmisc)
> data(efc)
> str(efc$e42dep)
atomic [1:908] 3 3 3 4 4 4 4 4 4 4 ...
> table(to_factor(efc$e42dep))
1 2 3 4
66 225 306 304
> barplot(table(to_factor(efc$e42dep)))
The code below works though with the output in line with the documentation:
library(sjPlot)
efc$c172code <- to_factor(efc$c172code)
fit <- lm(barthtot ~ c160age + c12hour + c172code + c161sex, data = efc)
sjt.lm(fit, group.pred = TRUE)
Instead of
x <- factor(c("a", "b", "c"))
x
# [1] a b c
# Levels: a b c
set_na(x) <- "b"
x
# [1] a <NA> c
# attr(,"labels")
# b
# NA
# Levels: a b c
better
set_na(x) <- "b"
x
# [1] a <NA> c
# attr(,"labels")
# b
# NA
# Levels: a c
It's possible to create labelled character variables. Cf. doc of labelled
:
s1 <- labelled(c("M", "M", "F"), c(Male = "M", Female = "F"))
Proposition of modification: #5
When sjmisc::write_spss() writes data to an SPSS file, it coerces data to be contiguous:
x <- set_labels(rep(c(1:5, 99), times = 3), c("Strongly Agree" = 1, "Agree" = 2, "Neutral" = 3, "Disagree" = 4, "Strongly Disagree" = 5, "Did Not Answer" = 99))
x
# [1] 1 2 3 4 5 99 1 2 3 4 5 99 1 2 3 4 5 99
#attr(,"labels")
# Strongly Agree Agree Neutral Disagree Strongly Disagree Did Not Answer
# 1 2 3 4 5 99
sjmisc::write_spss(data.frame(x), path = "C:/TEMP/temp.sav")
And in SPSS the data is now contiguous but is inconsistent with the original structure of the data.
Is is possible to prevent this behavior as it causes a disconnect between the data in SPSS and R?
Hi,
I have a suggestion for a utility function to search for all variables containing a specific string within their variable labels. I'm sure you would come up with a better version and without the dplyr
dependency:
lookfor <- function(x, df) {
cols <- colnames(df)
labels <- get_label(df)
combined <- data.frame(cols, labels) %>%
filter(grepl(tolower(x), tolower(labels)))
return(combined)
}
> lookfor('elder', efc)
name label
1 e15relat relationship to elder
2 e16sex elder's gender
3 e17age elder' age
4 e42dep elder's dependency
5 tot_sc_e Services for elderly
For frq
there seems to be a mismatch between code and doc (or between the argument description and the description of the functions signature):
I found
#' @param sort.frq Logical, if \code{TRUE}, rows will be sorted according to #' value frequencies.
But sort.frq
seems to expect a character, not a logical.
Hi,
sorry to disturb you again. I would like to understand the purpose of to_factor
. So, it's turning the variable into a factor using the values of the variable as the labels of the factor, and keeping the "labels" attribute?
In which case is there a specific need to keep the attr(*, "labels")
?
I have a list of data.frames
each with labelled
and/or labelled_spss
variables that I want to merge. Unfortunately, plyr::join_all()
removes the "label"
, "na_values"
, and "format.spss"
attributes.
I was hoping that merge_df()
would be able to retain these attributes, but I got the following error message:
Error in `[[<-.data.frame`(`*tmp*`, i, value = c(NA, NA, NA, NA, NA, NA, :
replacement has 4821 rows, data has 9441
If this issue is more haven
-esque I'll make an issue there, thanks!
Thanks for the great work Daniel! Today I encountered a minor issue. When I make a plot with many variables my labels disappear. Is there a way or argument to avoid this?
y <- sample(0:1, 200, replace = TRUE)
a <- as.factor(sample(1:4, 200, replace = TRUE))
b <- as.factor(sample(1:4, 200, replace = TRUE))
c <- sample(1:4, 200, replace = TRUE)
set_label(y) <- "var y"
set_label(a) <- "var a"
set_label(b) <- "var b"
set_label(c) <- "var c"
#OK: shows the labels of the variables
fit1 <- glm(y~a+c, family = binomial(link = "logit"))
sjp.glm(fit1)
#NOT OK: shows the variable names instead of the labels
fit2 <- glm(y~a+b, family = binomial(link = "logit"))
sjp.glm(fit2)
x<-factor(rep(c(1,2),10))
y<-set_labels(x, labels=c('A','B'))
after which
> y
[1] 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2
attr(,"labels")
A B
1 2
Levels: 1 2
For all the labelled
-agnostic world y
is still a factor with levels 1
and 2
. Is it by design, or is it a bug?
get_var_labels
seems not to work with data having partial empty var labels:
raw.data <- sjmisc::read_spss("sample_data.sav")
# str(raw.data)[1:5]
sjmisc::get_var_labels(raw.data)
But works well with full-label variables
raw.data2 <- sjmisc::read_spss("sample_data_full_labels.sav")
# str(raw.data2)[1:5]
sjmisc::get_var_labels(raw.data2)
I was trying the example code and got an error:
efc.sub <- subset(efc, subset = e16sex == 1, select = c(4:8))
efc.sub <- add_labels(efc.sub, efc)
--->
Error in order(as.numeric(all.labels)) :
(list) object cannot be coerced to type 'double'
Can you take a look?
I got this message when I use this function:
tscs2013r <- set_na(tscs2013r, c(92:99, "NA"))
Error in as.Date.numeric(value) : 'origin' must be supplied
It worked this summer but somehow this won't work this time when I came back to this project. Did I miss anything? Thank you.
Hi,
trying to use rev
for reversing value codings during rec()
is not working as intended:
d <- read_stata('GLES_Vorwahlquerschnitt_ZA5700_v1-0-0.dta')
d$q3 %>% set_na(-99:-71) %>% frq()
# Politisches Interesse
val label frq raw.prc valid.prc cum.prc
1 1. sehr stark 123 6.15 6.16 6.16
2 2. stark 362 18.09 18.12 24.27
3 3. mittelmaessig 813 40.63 40.69 64.96
4 4. weniger stark 499 24.94 24.97 89.94
5 5. ueberhaupt nicht 201 10.04 10.06 100.00
6 NA 3 0.15 NA NA
The correct output should be a reverse ordering of these 5 categories with automatically adjusted values labels. Instead, this happens:
d$polint <- d$q3 %>% set_na(-99:-71) %>% rec('rev')
frq(d$polint)
NAs introduced by coercion# Politisches Interesse
val label frq raw.prc valid.prc cum.prc
1 1 201 9.57 10.06 10.06
2 2 499 23.76 24.97 35.04
3 3 813 38.71 40.69 75.73
4 4 362 17.24 18.12 93.84
5 5 123 5.86 6.16 100.00
6 3 0.14 0.15 100.15
6 3 0.14 0.15 100.30
6 3 0.14 0.15 100.45
6 3 0.14 0.15 100.60
6 3 0.14 0.15 100.75
6 -99. keine Angabe 3 0.14 0.15 100.90
6 -98. weiss nicht 3 0.14 0.15 101.05
6 -97 3 0.14 0.15 101.20
6 -96 3 0.14 0.15 101.35
6 -95 3 0.14 0.15 101.50
6 -94 3 0.14 0.15 101.65
6 -93 3 0.14 0.15 101.80
6 -92 3 0.14 0.15 101.95
6 -91 3 0.14 0.15 102.10
6 -90 3 0.14 0.15 102.25
6 -89 3 0.14 0.15 102.40
6 -88 3 0.14 0.15 102.55
6 -87 3 0.14 0.15 102.70
6 -86 3 0.14 0.15 102.85
6 -85 3 0.14 0.15 103.00
6 -84 3 0.14 0.15 103.15
6 -83 3 0.14 0.15 103.30
6 -82 3 0.14 0.15 103.45
6 -81 3 0.14 0.15 103.60
6 -80 3 0.14 0.15 103.75
6 -79 3 0.14 0.15 103.90
6 -78 3 0.14 0.15 104.05
6 -77 3 0.14 0.15 104.20
6 -76 3 0.14 0.15 104.35
6 -75 3 0.14 0.15 104.50
6 -74 3 0.14 0.15 104.65
6 -73 3 0.14 0.15 104.80
6 -72 3 0.14 0.15 104.95
6 -71 3 0.14 NA NA
While reversing the initial values works, value labels are not attached and the output shows strange behavior in relation to the range of values is set as NA
. Values like -85
however never existed in the first place.
Hi,
I discovered a bug when using set_labels()
which screws values and labels:
d <- read_stata('GLES_Vorwahlquerschnitt_ZA5700_v1-0-0.dta')
vars <- c(93,104:108)
d[vars] <- d[vars]%>% set_na(-99:-71)
d$merkel_eigen <- rowMeans(d[104:108], na.rm=T)
frq(d$merkel_eigen)
val frq label raw.prc valid.prc cum.prc
1 18 1 0.90 0.91 0.91
1.2 4 1.2 0.20 0.20 1.11
1.33333333333333 1 1.33333333333333 0.05 0.05 1.16
1.4 14 1.4 0.70 0.70 1.86
1.5 3 1.5 0.15 0.15 2.01
1.6 8 1.6 0.40 0.40 2.41
1.66666666666667 2 1.66666666666667 0.10 0.10 2.52
1.75 3 1.75 0.15 0.15 2.67
1.8 22 1.8 1.10 1.11 3.77
2 35 2 1.75 1.76 5.53
2.2 34 2.2 1.70 1.71 7.24
2.25 4 2.25 0.20 0.20 7.44
2.33333333333333 4 2.33333333333333 0.20 0.20 7.65
2.4 52 2.4 2.60 2.62 10.26
2.5 10 2.5 0.50 0.50 10.76
2.6 84 2.6 4.20 4.23 14.99
2.66666666666667 4 2.66666666666667 0.20 0.20 15.19
2.75 6 2.75 0.30 0.30 15.49
2.8 79 2.8 3.95 3.97 19.47
3 164 3 8.20 8.25 27.72
3.2 127 3.2 6.35 6.39 34.10
3.25 14 3.25 0.70 0.70 34.81
3.33333333333333 7 3.33333333333333 0.35 0.35 35.16
3.4 131 3.4 6.55 6.59 41.75
3.5 14 3.5 0.70 0.70 42.45
3.6 122 3.6 6.10 6.14 48.59
3.66666666666667 8 3.66666666666667 0.40 0.40 48.99
3.75 13 3.75 0.65 0.65 49.65
3.8 132 3.8 6.60 6.64 56.29
4 200 4 10.00 10.06 66.35
4.2 130 4.2 6.50 6.54 72.89
4.25 6 4.25 0.30 0.30 73.19
4.33333333333333 2 4.33333333333333 0.10 0.10 73.29
4.4 150 4.4 7.50 7.55 80.84
4.5 11 4.5 0.55 0.55 81.39
4.6 115 4.6 5.75 5.78 87.17
4.66666666666667 2 4.66666666666667 0.10 0.10 87.27
4.75 1 4.75 0.05 0.05 87.32
4.8 92 4.8 4.60 4.63 91.95
5 160 5 8.00 8.05 100.00
NaN 13 NaN 0.65 0.65 100.65
42 0 NA 0.00 NA NA
d$merkel_eigen2 <- set_labels(d$merkel_eigen, c('sehr negative Bewertung'=1, 'sehr positive Bewertung'=5))
frq(d$merkel_eigen2)
val label frq raw.prc valid.prc cum.prc
1.000000 sehr negative Bewertung 18 0.90 0.91 0.91
1.200000 1.2 0 0.00 0.00 0.91
1.333333 1.33333333333333 0 0.00 0.00 0.91
1.400000 1.4 0 0.00 0.00 0.91
1.500000 1.5 0 0.00 0.00 0.91
1.600000 1.6 0 0.00 0.00 0.91
1.666667 1.66666666666667 0 0.00 0.00 0.91
1.750000 1.75 0 0.00 0.00 0.91
1.800000 1.8 0 0.00 0.00 0.91
2.000000 2 4 0.20 0.20 1.11
2.200000 2.2 0 0.00 0.00 1.11
2.250000 2.25 0 0.00 0.00 1.11
2.333333 2.33333333333333 0 0.00 0.00 1.11
2.400000 2.4 0 0.00 0.00 1.11
2.500000 2.5 0 0.00 0.00 1.11
2.600000 2.6 0 0.00 0.00 1.11
2.666667 2.66666666666667 0 0.00 0.00 1.11
2.750000 2.75 0 0.00 0.00 1.11
2.800000 2.8 0 0.00 0.00 1.11
3.000000 3 1 0.05 0.05 1.16
3.200000 3.2 0 0.00 0.00 1.16
3.250000 3.25 0 0.00 0.00 1.16
3.333333 3.33333333333333 0 0.00 0.00 1.16
3.400000 3.4 0 0.00 0.00 1.16
3.500000 3.5 0 0.00 0.00 1.16
3.600000 3.6 0 0.00 0.00 1.16
3.666667 3.66666666666667 0 0.00 0.00 1.16
3.750000 3.75 0 0.00 0.00 1.16
3.800000 3.8 0 0.00 0.00 1.16
4.000000 4 14 0.70 0.70 1.86
4.200000 4.2 0 0.00 0.00 1.86
4.250000 4.25 0 0.00 0.00 1.86
4.333333 4.33333333333333 0 0.00 0.00 1.86
4.400000 4.4 0 0.00 0.00 1.86
4.500000 4.5 0 0.00 0.00 1.86
4.600000 4.6 0 0.00 0.00 1.86
4.666667 4.66666666666667 0 0.00 0.00 1.86
4.750000 4.75 0 0.00 0.00 1.86
4.800000 4.8 0 0.00 0.00 1.86
5.000000 sehr positive Bewertung 3 0.15 0.15 2.01
6.000000 NA 8 0.40 0.40 2.41
7.000000 NA 2 0.10 0.10 2.52
8.000000 NA 3 0.15 0.15 2.67
9.000000 NA 22 1.10 1.11 3.77
10.000000 NA 35 1.75 1.76 5.53
11.000000 NA 34 1.70 1.71 7.24
12.000000 NA 4 0.20 0.20 7.44
13.000000 NA 4 0.20 0.20 7.65
14.000000 NA 52 2.60 2.62 10.26
15.000000 NA 10 0.50 0.50 10.76
16.000000 NA 84 4.20 4.23 14.99
17.000000 NA 4 0.20 0.20 15.19
18.000000 NA 6 0.30 0.30 15.49
19.000000 NA 79 3.95 3.97 19.47
20.000000 NA 164 8.20 8.25 27.72
21.000000 NA 127 6.35 6.39 34.10
22.000000 NA 14 0.70 0.70 34.81
23.000000 NA 7 0.35 0.35 35.16
24.000000 NA 131 6.55 6.59 41.75
25.000000 NA 14 0.70 0.70 42.45
26.000000 NA 122 6.10 6.14 48.59
27.000000 NA 8 0.40 0.40 48.99
28.000000 NA 13 0.65 0.65 49.65
29.000000 NA 132 6.60 6.64 56.29
30.000000 NA 200 10.00 10.06 66.35
31.000000 NA 130 6.50 6.54 72.89
32.000000 NA 6 0.30 0.30 73.19
33.000000 NA 2 0.10 0.10 73.29
34.000000 NA 150 7.50 7.55 80.84
35.000000 NA 11 0.55 0.55 81.39
36.000000 NA 115 5.75 5.78 87.17
37.000000 NA 2 0.10 0.10 87.27
38.000000 NA 1 0.05 0.05 87.32
39.000000 NA 92 4.60 4.63 91.95
40.000000 NA 160 8.00 8.05 100.00
41.000000 NA 13 0.65 0.65 100.65
42.000000 NA 0 0.00 NA NA
Hi, I just found your wrapper sjt.df
which produces a nice summary of descriptive statistics. Is it possible to have this for one or several variables and without the html formatting? I'm basically looking for the equivalent of frq
for the summary statistics.
Of course, one could do something like sjt.df(efc[['e17age'])
, but this has no console output and it would be convenient if one would not have to extract the variable(s) out of a dataset first.
Hi!
thx so much for this wonderful package! i really enjoy using it.
One thing that i cant seem to find is how to add t-values to the table? is it possible?
thx!
Z
Hello,
I was doing few tests around the labelled package and I discovered few things.
I created the following object:
notes <- c(12, 16, 99, 14, 17)
val_label(notes, 99) <- "absent le jour de l'examen"
missing_val(notes) <- 99
I was expecting to observe:
> mean(notes)
[1] 31.6
> mean(missing_to_na(notes), na.rm = TRUE)
[1] 14.75
Surprisingly, in some cases, I was obtaining:
> mean(notes)
[1] 14.75
I tried to explore the reason of that and it appears by using getS3method
that the mean
method defined in sjmisc was called by R (even if I didn't attached sjmisc). I don't know why in some cases R was retrieving this method and in other cases he was not, but it raises at least few comments about having a mean
method defined for labelled
vectors:
mean
method will return different values according to the package currently loaded in memory. And it's something completely invisible for most users;na.rm = TRUE
and so it's so consistent with the original method;labelled
class. Nevertheless, the advantage of the labelled class is that it preserves original values and that these vectors can be used with any statistical function in R, while the approach used by memisc with data.set
objects is working with a limited number of compatible functions;NA
with missing_to_na
.missing_to_na
function. In that scenario, we ensure that the way to use the data is always the same, without having to remember which functions are taking into account labels attributes.Therefore, I'm not sure that it would be appropriate to provide a mean
method for labelled
class. If we provide such method, at least we should get the same results than without that method for consistency. However, an option could be to display a warning message to alert the user that there are currently self-defined missing values. For example: providing a missing_to_na
argument with default value equal to NULL
:
NULL
, results will be the same that the default mean method, but a Warning will be displayed in case some missing values are defined.FALSE
, self-defined missing values are not converted but no warning displayed;TRUE
, self-defined missing values are converted to NA before computing the mean, the user having always to specify the value of na.rm
What do you think?
load(tscs2013)
str(tscs2013$v66)
Factor w/ 8 levels "1","2","3","4",..: 3 2 1 4 2 3 1 3 2 2 ...
attr(, "value.labels")= Named chr [1:10] "99" "98" "97" "96" ...
..- attr(, "names")= chr [1:10] "遺漏值" "拒答" "不知道" "跳答" ...attr(, "label")= Named chr "66 如果台灣宣布獨立,請問您認為兩岸會不會發生戰爭? "
..- attr(, "names")= chr "v66"tscs2013 <- set_na(tscs2013, 94:98)
str(tscs2013$v66)
Factor w/ 8 levels "1","2","3","4",..: 3 2 1 4 2 3 1 3 2 2 ...attr(, "value.labels")= Named chr [1:5] "99" "4" "3" "2" ...
..- attr(, "names")= chr [1:5] "遺漏值" "一定不會" "可能不會" "可能會" ...attr(, "label")= Named chr "66 如果台灣宣布獨立,請問您認為兩岸會不會發生戰爭? "
..- attr(, "names")= chr "v66"table(tscs2013$v66)
1 2 3 4 94 95 97 98
425 736 464 179 0 0 0 0
I was expecting that the level of variables like v66 goes down to 4 and that 94 to 98 become NA. Did I miss anything? Thank you.
ps.
The data used for this practice is downloadable from here: http://cl.ly/0I332p1i2Q3H/tscs2013.rda
Hi,
for some strange reason I cannot understand ``descr()``` does not work for me and always throws the following error:
data(efc)
descr(efc, e17age, c160age)
Error in mutate_impl(.data, dots) : basic_string::_M_replace_aux
13.
stop(structure(list(message = "basic_string::_M_replace_aux", call = mutate_impl(.data, dots), cppstack = NULL), .Names = c("message", "call", "cppstack"), class = c("std::length_error", "C++Error", "error", "condition")))
12.
mutate_impl(.data, dots)
11.
mutate_.tbl_df(.data, .dots = lazyeval::lazy_dots(...))
10.
mutate_(.data, .dots = lazyeval::lazy_dots(...))
9.
dplyr::mutate(., label = unname(get_label(.data, def.value = colnames(.data))))
8.
function_list[[i]](value)
7.
freduce(value, `_function_list`)
6.
`_fseq`(`_lhs`)
5.
eval(expr, envir, enclos)
4.
eval(quote(`_fseq`(`_lhs`)), env, env)
3.
withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
2.
tibble::as_tibble(psych::describe(.data)) %>% tibble::rownames_to_column(var = "variable") %>% dplyr::select_("-vars", "-mad") %>% dplyr::mutate(label = unname(get_label(.data, def.value = colnames(.data)))) %>% var_rename(median = "md")
1.
descr(d$skalocdu)
Update: the same error also sometimes occurs when using frq()
:
d <- read_stata('GLES_Vorwahlquerschnitt_ZA5700_v1-0-0.dta')
d$skalocdu <- set_na(d$q21a, -99:-71)
d$skalorrg <- set_na(d$q79f, -99:-71)
d %>% select(skalorrg, skalocdu) %>% frq()
I recently updated to v2.2.1 of sjmisc and noticed a difference in behavior for get_labels() from v2.0.1. Previously the values and value labels from get_values() and get_labels() were aligned, but in some of the data I work with this is no longer the case.
library(sjmisc)
online_path <- tempfile()
download.file("https://www.dropbox.com/s/3b4xiaurawztvfj/get_labels_discrepancy.zip?dl=1", online_path, mode = "wb")
rdata_path <- unzip(online_path)
(load(rdata_path))
attr(mydf[[1]], "labels")
get_labels(mydf[[1]])
get_values(mydf[[1]])
In 2.0.1 these commands would return:
# Behavior in 2.0.1
> attr(mydf[[1]], "labels")
39 40 44 46 37 61 42 45 63 35 43 49 52 51 38 32 48 33 41 36 34 50 47 55 53 64 56 29 66 31 54 59 30 27 58 57 60 25 26 24 28 62 23 19 65
39 40 44 46 37 61 42 45 63 35 43 49 52 51 38 32 48 33 41 36 34 50 47 55 53 64 56 29 66 31 54 59 30 27 58 57 60 25 26 24 28 62 23 19 65
22
22
> get_labels(mydf[[1]])
[1] "39" "40" "44" "46" "37" "61" "42" "45" "63" "35" "43" "49" "52" "51" "38" "32" "48" "33" "41" "36" "34" "50" "47" "55" "53" "64"
[27] "56" "29" "66" "31" "54" "59" "30" "27" "58" "57" "60" "25" "26" "24" "28" "62" "23" "19" "65" "22"
> get_values(mydf[[1]])
[1] 39 40 44 46 37 61 42 45 63 35 43 49 52 51 38 32 48 33 41 36 34 50 47 55 53 64 56 29 66 31 54 59 30 27 58 57 60 25 26 24 28 62 23 19
[45] 65 22
In 2.0.1 the values and value labels are aligned. But with v 2.2.1, these commands return:
# Behavior in 2.2.1
> attr(mydf[[1]], "labels")
39 40 44 46 37 61 42 45 63 35 43 49 52 51 38 32 48 33 41 36 34 50 47 55 53 64 56 29 66 31 54 59 30 27 58 57 60 25 26 24 28 62 23 19 65
39 40 44 46 37 61 42 45 63 35 43 49 52 51 38 32 48 33 41 36 34 50 47 55 53 64 56 29 66 31 54 59 30 27 58 57 60 25 26 24 28 62 23 19 65
22
22
> get_labels(mydf[[1]])
[1] "19" "22" "23" "24" "25" "26" "27" "28" "29" "30" "31" "32" "33" "34" "35" "36" "37" "38" "39" "40" "41" "42" "43" "44" "45" "46"
[27] "47" "48" "49" "50" "51" "52" "53" "54" "55" "56" "57" "58" "59" "60" "61" "62" "63" "64" "65" "66"
> get_values(mydf[[1]])
[1] 39 40 44 46 37 61 42 45 63 35 43 49 52 51 38 32 48 33 41 36 34 50 47 55 53 64 56 29 66 31 54 59 30 27 58 57 60 25 26 24 28 62 23
[44] 19 65 22
The values and value labels are correct / aligned in the attributes, but the ordering returned by get_labels() is different in 2.2.1 and from it's previous behavior. This can be seen in all variables in the provided data set. This change in behavior is quite problematic for many of the processes I use. Is this an intended change or a bug?
lapply(mydf, function(x) data.frame(values = get_values(x), labels = get_labels(x), stringsAsFactors = F))
# Output not shown for brevity
Hi,
when using the new squared brackets option to assign value labels in rec()
the assignment of actual values is wrong:
frq(d$bl)
# Bundesland
val label frq raw.prc valid.prc cum.prc
1 1. Baden-Wuerttemberg 178 9.04 9.04 9.04
2 2. Bayern 281 14.28 14.28 23.32
3 3. Berlin 98 4.98 4.98 28.30
4 4. Brandenburg 108 5.49 5.49 33.79
5 5. Bremen 11 0.56 0.56 34.35
6 6. Hamburg 11 0.56 0.56 34.91
7 7. Hessen 112 5.69 5.69 40.60
8 8. Mecklenburg-Vorpommern 81 4.12 4.12 44.72
9 9. Niedersachsen 182 9.25 9.25 53.96
10 10. Nordrhein-Westfalen 299 15.19 15.19 69.16
11 11. Rheinland-Pfalz 73 3.71 3.71 72.87
12 12. Saarland 16 0.81 0.81 73.68
13 13. Sachsen 226 11.48 11.48 85.16
14 14. Sachsen-Anhalt 109 5.54 5.54 90.70
15 15. Schleswig-Holstein 58 2.95 2.95 93.65
16 16. Thueringen 125 6.35 6.35 100.00
17 NA 0 0.00 NA NA
d$west <- rec(d$bl, '1,2,5:7,9,10:12,15 = 1 [West]; 3,4,8,13,14,16 = 0 [Ost]',
var.label='Dummy: West')
d$west2 <- rec(d$bl, '1,2,5:7,9,10:12,15 = 1; 3,4,8,13,14,16 = 0')
frq(d$west)
# Dummy: West
val label frq raw.prc valid.prc cum.prc
0 West 1221 62.04 62.04 62.04
1 Ost 747 37.96 37.96 100.00
2 NA 0 0.00 NA NA
frq(d$west2)
# Bundesland
val frq label raw.prc valid.prc cum.prc
0 747 0 37.96 37.96 37.96
1 1221 1 62.04 62.04 100.00
2 0 NA 0.00 NA NA
The dataset used here is again "GLES_Vorwahlquerschnitt_ZA5700_v1-0-0.dta".
Sorry to bother you again with potential installation problems and version conflicts, but descr()
still does not work for me after reinstalling the cran version of dplyr
. Here is my sessionInfo()
:
R version 3.3.0 (2016-05-03)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] sjstats_0.7.0 sjPlot_2.1.2 sjmisc_2.1.0 dplyr_0.5.0 purrr_0.2.2 readr_1.0.0 tidyr_0.6.0 tibble_1.2
[9] ggplot2_2.2.0 tidyverse_1.0.0
loaded via a namespace (and not attached):
[1] Rcpp_0.12.8 stringdist_0.9.4.2 mvtnorm_1.0-5 lattice_0.20-33 zoo_1.7-13 lmtest_0.9-34 assertthat_0.1
[8] digest_0.6.10 psych_1.6.9 mime_0.5 R6_2.2.0 plyr_1.8.4 stats4_3.3.0 coda_0.18-1
[15] lazyeval_0.2.0 multcomp_1.4-6 minqa_1.2.4 nloptr_1.0.4 Matrix_1.2-6 DT_0.2 splines_3.3.0
[22] lme4_1.1-12 stringr_1.1.0 foreign_0.8-66 htmlwidgets_0.8 munsell_0.4.3 shiny_0.14.2 broom_0.4.1
[29] httpuv_1.3.3 modelr_0.1.0 mnormt_1.5-5 htmltools_0.3.5 nnet_7.3-12 coin_1.1-3 codetools_0.2-14
[36] MASS_7.3-45 grid_3.3.0 nlme_3.1-127 arm_1.9-3 xtable_1.8-2 gtable_0.2.0 DBI_0.5-1
[43] magrittr_1.5 scales_0.4.1 stringi_1.1.2 reshape2_1.4.2 effects_3.1-2 sandwich_2.3-4 blme_1.0-4
[50] TH.data_1.0-7 tools_3.3.0 abind_1.4-5 parallel_3.3.0 survival_2.40-1 colorspace_1.3-1 knitr_1.15.1
[57] haven_1.0.0 merTools_0.3.0 modeltools_0.2-21
And here is the example with traceback:
d <- read_stata('GLES_Vorwahlquerschnitt_ZA5700_v1-0-0.dta')
set.seed(339487731)
randgroup <- runif(nrow(d), min=0, max=1)
psych::describe(randgroup)
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 2001 0.5 0.29 0.51 0.5 0.38 0 1 1 -0.01 -1.25 0.01
descr(randgroup)
Error in mutate_impl(.data, dots) : Unsupported type NILSXP for column "label"
13.
stop(structure(list(message = "Unsupported type NILSXP for column \"label\"", call = mutate_impl(.data, dots), cppstack = structure(list( file = "", line = -1L, stack = "C++ stack not available on this system"), .Names = c("file", "line", "stack"), class = "Rcpp_stack_trace")), .Names = c("message", ...
12.
mutate_impl(.data, dots)
11.
mutate_.tbl_df(.data, .dots = lazyeval::lazy_dots(...))
10.
mutate_(.data, .dots = lazyeval::lazy_dots(...))
9.
dplyr::mutate(., label = unname(get_label(dd, def.value = colnames(dd))))
8.
function_list[[i]](value)
7.
freduce(value, `_function_list`)
6.
`_fseq`(`_lhs`)
5.
eval(expr, envir, enclos)
4.
eval(quote(`_fseq`(`_lhs`)), env, env)
3.
withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
2.
tibble::as_tibble(psych::describe(dd)) %>% tibble::rownames_to_column(var = "variable") %>% dplyr::select_("-vars", "-mad") %>% dplyr::mutate(label = unname(get_label(dd, def.value = colnames(dd)))) %>% var_rename(median = "md")
1.
descr(randgroup)
Hi, it would be pretty nice to have a simple function xtab
that, like frq()
related to sjt.frq()
, generates a simple non-html contingency table including value labels.
Related to this, it would also be useful to either have a parameter for this function and frq()
, which would also print the variable label above the table, or just print the variable label as default.
An issue concerning the usage of copy_labels
:
When subsetting data using subset{base}, haven-style variable and value labels are lost.
To restore labels, sjmisc offers the copy_labels
function.
However, this function may fail to reproduce the collect labels in the following scenario: (a) if the subset dataframe is a subset of variables (columns) AND cases (rows); and (b) if, consequently, some of the original variable values are no longer contained in the subset dataset.
In this case, copy_labels
uses the first variable labels.
More labels than values of "var1". Using first 3 labels.
This can lead to wrong assignments of value labels, however. In my specific case, value labels of missing values from the original dataframe were erroneously assigned to non-missing values in the subset dataframe.
Is there any way to make sure that the correct original value labels are reassigned?
I think for such instances, copy_labels
could benefit from having a similar argument to that of "force.labels = TRUE" in set_labels
?
Note: This issue only appears to occur when using the haven-style but not when requesting foreign-style labels in sjmisc's the read_spss
command. Might this have to do with the way in which missing variables are treated per default in haven vs. foreign?
Thanks!
Hi,
I'd like to suggest combining the most important stuff from the blog posts (1, 2, 3) in a vignette. This could include basic examples of analysis with labelled data and it would be very useful to have all this captured in one document. Especially for teaching it would be nice, where online access in labs is not always guaranteed and students easily get overwhelmed.
As a more ambitious project you could think about writing an R equivalent to data analysis with stata ;-)
Hi,
recoding is solved quite conveniently in Stata, where the labels are directly included in the recoding procedure.
recode age (18/23=1 "18to23") (24/65=2 "24to65") (66/99=3 ">65"), gen (agekat)
Implementing this in rec()
could work with regular expressions. For example if we use a special character like $
or §
, or [ ]
within the recode string:
rec(efc$e17age, "18:23=1 [18to23]; 24:65=2 [24to65]; 66:max=3 [> 65]",
+ var.label="Age kat.",)
It might even be possible to implement this without breaking the old behavior right? A parameter like val.labels="auto"
could scan for the special character(s) and apply the automated procedure if they are detected.
Hi,
I guess it's a matter of design philosophy as well, but I would like to suggest calling drop_labels()
as default for functions like sjt.tab()
, and/or let set_na()
optionally drop unused labels as well.
Often variables are just copied while encoding missing values like so:
d <- read_stata('GLES_Vorwahlquerschnitt_ZA5700_v1-0-0.dta')
vars <- select(d, q41, q119a, q192, q11ba) %>% set_na(-99:-83)
sjt.xtab(vars$q41, vars$q11ba) # gets messed up without drop_labels()
It is only a small detail, but I dislike the need to call drop_labels()
in such cases again for dropping any unused NA labels. If one does not use drop_labels()
, outputs for sjt.xtab()
etc. get ugly because of the 0 categories for unused labels.
I guess this all comes down to one question for the whole package: Do you ever need unused labels for most analysis again? I personally don't, but maybe this is different for other people.
I raised above question recently in SO.
I collaborated with the xtable maintainer to work out a solution which is posted as answer on SO.
Would it not be better to include this functionality into sjmisc package ?
I´m struggled trying to graph my glmmPQL fitted model. I'm trying to use jsPlot, but just few functions work on this kind of objects. Would be great jsPlot may handle glmmPQL objects.
the functionality to convert a dataframe into an ragged representaion (a la ftable) migth be somethimg for your sjmisc package.
Hi,
I'm not sure whether this is a big problem but installing sjmisc
from github raises the following warning at the end:
** preparing package for lazy loading
Error : object ‘zap_inf’ is not exported by 'namespace:sjmisc'
ERROR: lazy loading failed for package ‘sjPlot’
* removing ‘/home/cs/R/x86_64-pc-linux-gnu-library/3.3/sjPlot’
Error: Command failed (1)
Hi
In other projects, I need to deal with labelled variables. Therefore I'm interested by some functions of sjmisc, but only functions related to manipulation of labelled variables.
It could be useful to have a package dealing with and only with manipulation of such variables. This package could later be used by several projects.
Would you be interested by such package?
regards
Every year since 2005 I've got searching for a frequency function that works more like SAS or SPSS. I finally found it in frq. Thanks for writing it! It looks like there is a bug either in frq or in haven::read_spss. See the example below where frq "freaks out" (heh, heh) at gender. If this is a haven problem, please let me know & I'll post it over there.
I'm also a bit perplexed that the help file for frq uses the standard haven approach to the labels, while the help examples for sjmisc::read_spss use a different structure (atomic vs. labelled). Are you in the middle of converting from one to the other?
> (mydata <- haven::read_spss("mydata.sav"))
# A tibble: 8 × 7
id workshop gender q1 q2 q3 q4
<dbl> <dbl+lbl> <chr+lbl> <dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl+lbl>
1 1 1 f 1 1 5 1
2 2 2 f 2 1 4 1
3 3 1 f 2 2 4 3
4 4 2 3 1 NaN 3
5 5 1 m 3 5 2 4
6 6 2 m 5 4 5 5
7 7 1 m 5 3 4 4
8 8 2 m 4 5 5 5
> mydata %>%
+ select(gender) %>%
+ frq()
Error in order(names(labels), lvls) : argument lengths differ
> mydata %>%
+ select(workshop,q1:q4) %>%
+ frq()
$workshop
val label frq raw.prc valid.prc cum.prc
1 R 4 50 50 50
2 SAS 4 50 50 100
3 NA 0 0 NA NA
$q1
# The instructor was well prepared.
val label frq raw.prc valid.prc cum.prc
1 Strongly Disagree 1 12.5 12.5 12.5
2 Disagree 2 25.0 25.0 37.5
3 Neutral 2 25.0 25.0 62.5
4 Agree 1 12.5 12.5 75.0
5 Strongly Agree 2 25.0 25.0 100.0
6 NA 0 0.0 NA NA
$q2
# The instructor communicated well.
val label frq raw.prc valid.prc cum.prc
1 Strongly Disagree 3 37.5 37.5 37.5
2 Disagree 1 12.5 12.5 50.0
3 Neutral 1 12.5 12.5 62.5
4 Agree 1 12.5 12.5 75.0
5 Strongly Agree 2 25.0 25.0 100.0
6 NA 0 0.0 NA NA
$q3
# The course material was helpful.
val label frq raw.prc valid.prc cum.prc
1 Strongly Disagree 0 0.0 0.00 0.00
2 Disagree 1 12.5 14.29 14.29
3 Neutral 0 0.0 0.00 14.29
4 Agree 3 37.5 42.86 57.14
5 Strongly Agree 3 37.5 42.86 100.00
6 NA 1 12.5 NA NA
$q4
# Overall, I found this workshop useful.
val label frq raw.prc valid.prc cum.prc
1 Strongly Disagree 2 25 25 25
2 Disagree 0 0 0 25
3 Neutral 2 25 25 50
4 Agree 2 25 25 75
5 Strongly Agree 2 25 25 100
6 NA 0 0 NA NA
Hi,
I'd like to suggest a very subtle enhancement of find_var()
: adding a boolean parameter, likeas.lab=T
, which makes the function return variable labels for all matches. In almost all cases where I usefind_var
for large datasets I want to directly get variable labels und use this syntax:
find_var(data, 'string', as.df=T) %>% get_label()
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.