statdivlab / rigr Goto Github PK

View Code? Open in Web Editor NEW

10.0 10.0 3.0 5.46 MB

Regression, Inference, and General Data Analysis Tools for R

License: Other

R 99.29% TeX 0.71%

rigr's People

Contributors

Stargazers

Watchers

rigr's Issues

`ttest()` and `ttesti()` should return objects, in addition to printing output

I think that these functions should return an object (a list, say) with all relevant output, in addition to printing. This matches e.g. wilcoxon(), and I think is far more user-friendly than only printing.

data argument in bplot doesn't work the way it should

When running:

mri <- read.table("http://www.emersonstatistics.com/datasets/mri.txt", header=TRUE);
bplot(data = mri, y=atrophy, x=male, xlab="Gender", ylab="Atrophy")

you get an error that "male" is not found because the dataset has not been attached. the data argument should work in a way that doesn't require users to attach the dataset to their environment beforehand

Wilcoxon bug

Wilcoxon bug.docx

`ttest` does not throw error for matched test with vectors of unequal length

Waiting to fix this until unit tests are written

when replaceZeroes = TRUE in `regress`, it does not actually replace zeroes with one-half the lowest nonzero value

The function replaces missing values with TRUE (aka 1), at least for cases when fnctl = "geometric mean" is specified

Confirm `ttest` handles NA properly in all use cases

Make styling consistent

Currently, function names and objects follow inconsistent styling. Ideally, we would follow a style guide (e.g., Hadley Wickham's or Google's) so that it is easier to read and debug the code.

I propose following Hadley's for the most part, and using underscores (_) in both function and object names.

Also, ideally the main user-facing functions will have their own files (e.g., descrip.R) and internal helper functions will live in a separate file (e.g., descrip_utils.R or possibly a separate file for each internal function).

Remove `version` parameter from functions, add version and/or package creation date when package is loaded

It will be far more maintainable to have the version listed in one spot (for example, when the package is loaded) as opposed to in individual functions. version parameter has already been removed from regress, descrip, and both of their utility functions, but the parameter likely still exists elsewhere.

At the suggestion of @bdwilliamson, we could consider is adding a print message when the package gets loaded that says something about the version and/or the package creation date.

This issue builds off of #33

online documentation

It would be wonderful to have online documentation. I haven't looked into this for many years but I recall this being pretty easy with github.io and pkgdown.

force users to write out the full fnctl argument in `regress`

We no longer want to support shorted fnctl strings in regress

Discussed in #53

^{Originally posted by taylorokonek August 9, 2021}
Currently, users can enter a shortened character string for the fnctl argument in regress, and so long as it corresponds to a unique substring, the function will run. As an example, a user could simply set fnctl = "g", and the function would take this to mean fnctl = "geometric mean". Personally, I don't think there's a huge harm in forcing users to spell out the fnctl argument. Does anyone have any insight as to why this particular functionality was introduced (@bdwilliamson)?

remove gee and hazard functionality from `regress`

interaction terms in regress

notice in the output of the following, the interaction term is not clearly labelled

also the overall F-statistic from regress does not agree with the F-statistic from lm

regress("mean",fev~ht*as.integer(sex))

devtools::check() fails

As a result of some of the change is associated with modernizing the package, devtools::check() fails.

`print.uRegress` throws an error for models with categorical variables with >2 levels

Example:

library(rigr); data(mri); regress("mean", atrophy ~ race, data=mri)

Error message:

Error in dimnames(x$model) <- *vtmp* :
length of 'dimnames' [1] not equal to array extent
In addition: Warning message:
In cbind(nms, levels) :
number of rows of result is not a multiple of vector length (arg 2)

`equal` function could be cleaner

In termTraverse.R (soon to be in regress_utils.R), there is a function defined called equal, and the value returned calls the function itself (not sure if there are any implications to this?). The function seems to be doing the same thing as checking if length(unique(x)) == 1, other than that they would handle vectors that include NAs differently. This is mostly a reminder to myself to edit this helper function to be a bit cleaner in the future.

equal <- function(x){
if(length(x)==1){
return(TRUE)
} else {
if(x[1]!=x[2]){
return(FALSE)
} else {
return(equal(x[-1]))
}
}
}

bplot axis label issue

the followiong produces a plot with double x axis labels

bplot(fev,as.integer(agecat),xlab="Age Category",ylab="FEV (liters)")

Does `ifelse1` need to exist?

I think this function has identical behavior and interface as ifelse - I'll look into it but it doesn't seem like we need this.

`descrip()` returns `firstEvent` column even for non time-to-event variables

The following code produces a value for firstEvent but has lastEvent as NA:

a <- rnorm(10,0,1) descrip(a)

I would imagine we'd want firstEvent to be NA as well in this case

Incorrect label in ttest

I downloaded the rigr package on 7/23 and was using ttest() and got this result. Notice the incorrect label above the output that says "One-sample t-test"

with(hiv,ttest(cd4[art==0],cd4[art==1]))

Call:
ttest(var1 = cd4[art == 0], var2 = cd4[art == 1])

Two-sample t-test allowing for unequal variances :

One-sample t-test :

Summary:
Group Obs Missing Mean Std. Err. Std. Dev. 95% CI
cd4[art == 0] 78 14 388.5 32.5 260 [323.5, 454]
cd4[art == 1] 50 19 298.5 32.9 183 [231.3, 366]
Difference 128 33 90.1 46.3 [-2.02, 182]

Ho: difference in means = 0 ;
Ha: difference in means != 0
t = 1.946 , df = 80.7
Pr(|T| > t) = 0.0551195

salary dataset as dataset

Currently both some vignettes and some examples run the following to get the salary data

salary <- read.table("http://www.emersonstatistics.com/datasets/salary.txt", header = TRUE, stringsAsFactors = FALSE)

This should instead be a documented dataset released with the package.

This impact the vignette regress_intro amongst others.

CRAN

It would be great to be able to install this package from CRAN -- this would enormously help many of our students with using it for the first time. Target date: Sept 13.

GEE for salary data in regress_intro vignette

In the regress_intro vignette, the line regress("mean", salary ~ female*year, id = id, data = salary) no longer works. Not sure why.

why does `regress` no longer support proportional hazards regression, and what do we want moving forward?

The documentation for regress seems to suggest you can do proportional hazards regression using fnctl = "hazard", but the code will actually throw an error if you specify fnctl = "hazard", stating "proportional hazards regression no longer supported". That being said, the remainder of the code seems to still have if/else cases and other sections in case fnctl = "hazard", so is this error something that was thrown in recently? @bdwilliamson do you have any insight?

Another question is whether we want regress to be able to handle proportional hazards regression moving forward.

factors in bplot

bplot should be able to handle factors

bplot(fev,agecat,xlab="Age Category",ylab="FEV (liters)") # fails if agecat is a factor

`wilcoxon()` provides two (different) p-values when `exact = TRUE`

As currently written, wilcoxon() is (as far as I can tell) largely a copy-pasted version of wilcox.test() from the stats package. That copied code provides a p-value (variable PVAL in the code), either using an exact test, or Normal approximation.

One of the additions in wilcoxon() is the inf return object, which is described as "a formatted table of inference values, for printing." This object provides the PVAL mentioned above, but also includes the Z-score and corresponding Normal approximation p-value. This p-value will correspond to PVAL when exact = FALSE. However, when exact = TRUE is used, PVAL is an exact p-value, but the Normal approximation p-value is still included in inf.

Long story short: when exact = TRUE, wilcoxon() provides both the exact and approximate p-value, which I think would be confusing to a user. To see this run:

set.seed(2)
y <- rnorm(100)
wilcoxon(y, exact = TRUE)

The inf return object has p-value 0.70012 (exact) and 0.69762 (Normal approximation).

My feeling is that we shouldn't report two different p-values. Thoughts?

intercepts in linear models

If I recall correctly, you can use the argument intercept = FALSE in regress(). Is it possible to add the -1 functionality?

Make TRUE and FALSE consistent

Currently, T and F are sometimes used as shorthand for TRUE and FALSE. While this is possible, it isn't good practice -- TRUE and FALSE are protected, while T and F are valid object names and thus can be overwritten by the user.

We should use the full words!

changed behavior for descrp

The newest version of descrip prints, by default, the columns Max Restriction, FirstEvent, LastEvent, isDate. Those seem confusing for the average user, especially if none of the measures is a time-to-event. Can we have descrip not print those columns by default?

Also, at the bottom of the descrip output I see
attr(,"class")
[1] "uDescriptives"
which will be confusing to the naive user.

Finally, all statistics are printed in scientific notation which is different from before and harder to read.

Having said all this, I just figured out that I think the issue is simply that the print.uDescriptives function has not been included in the rigr package?

Streamline `ttest.R` code

The code for this function is much more verbose than it needs to be, including many statements like if (condition){}; else{do stuff}. For ease of future maintenance this should be cleaned up.

regress doesn't accept factors

Other regression commands in R accept factor covariates but
regress("mean", fev ~ ht+sex,data=fevdata) # fails if sex is a factor

update documentation for `regress` helper functions in regress_utils.R

at the very least, ensure that all parameters and return values are described well

`ttest` output lists one-sample and two-sample tests

The printed output for ttest reads "Two-sample t-test:" followed by "One-sample t-test:" whenever two samples are given. It should only read "Two-sample t-test:"

Document helper functions

Many helper functions (i.e., non-exported functions) have no documentation. I think a good goal is to have all functions be documented in some way.

remove `version` parameter from `descrip()`?

Currently if you specify version = TRUE in descrip(), the function will return "20160730" and nothing else. I'd imagine we either want to to return a date that's a bit closer to today, or just remove this parameter entirely. Personally, I'm in favor of removing it entirely. Any objections or other thoughts?

descrip no longer works is intended on `mri` data

A previous line in test_cases vignette was descrip(mri). This no longer works. Possible reasons include that mri is now a tibble rather than a dataframe, but perhaps also because it contains variables of different types (not only numeric).

Fix `ttesti` documentation

ttesti.R documentation needs some changes - a few of the parameter descriptions aren't working properly.

`ttest` returns NULL if two data vectors are provided with stratification using `by`

This should throw an error instead.

Streamline `wilcoxon.R` code

wilcoxon.R has some redundancies, including calculating the p-value multiple times. The whole function should be simplified.

Write unit tests for `ttest` geometric mean

Reminder to myself to come back to this.

add `by` argument to `wilcoxon`

bug in ttesti

one-sided p-value in ttesti wrong for Ha: <. Fixed in the attached file (to find changes, search for "JPH")
my_ttesti.txt

`regress` returns p-value based on t-statistic for logistic regression despite column name Pr(>|z|) in output

I believe this is just a problem with an if statement on line 887 of regress.R

what is 'stratify' supposed to do in bplot?

I can't seem to get an example working (that doesn't throw an error) where the 'strata' argument in bplot is used. Is this argument supposed to do something analogous to facet_wrap in ggplot, or something else?

coefficient names unclear for categorical variables in `regress`

Note that when the following code is run:

library(rigr); data(mri)
regress("mean", atrophy ~ sex, data=mri)

the coefficient names printed are "Intercept" and "sex", whereas from lm() we'd have "(Intercept)" and "sexMale", this telling us which value of the categorical variable the coefficient corresponds to. We should have something similar to this in our print method for regress. Maybe something like "sex: Male"?

`proptest`

Implement proptest function as a branch from ttest. Add continuity correction, which is not currently implemented.

bug in wilcoxon

one-sided p-value in wilcoxon wrong . Fixed in the attached file. Search for "JPH" to find changes
my_wilcoxon.txt

Alternative hypothesis in ttest

The alternative hypothesis in test is specified with the argument "test.type". It would be better to be consistent with other R functions (particularly t.test) and use the argument "alternative". The attached version of ttest changes the
argument name everywhere.
my_ttest.txt