statdivlab / rigr Goto Github PK
View Code? Open in Web Editor NEWRegression, Inference, and General Data Analysis Tools for R
License: Other
Regression, Inference, and General Data Analysis Tools for R
License: Other
I think that these functions should return an object (a list
, say) with all relevant output, in addition to printing. This matches e.g. wilcoxon()
, and I think is far more user-friendly than only printing.
When running:
mri <- read.table("http://www.emersonstatistics.com/datasets/mri.txt", header=TRUE);
bplot(data = mri, y=atrophy, x=male, xlab="Gender", ylab="Atrophy")
you get an error that "male" is not found because the dataset has not been attached. the data argument should work in a way that doesn't require users to attach the dataset to their environment beforehand
Waiting to fix this until unit tests are written
The function replaces missing values with TRUE (aka 1), at least for cases when fnctl = "geometric mean" is specified
Currently, function names and objects follow inconsistent styling. Ideally, we would follow a style guide (e.g., Hadley Wickham's or Google's) so that it is easier to read and debug the code.
I propose following Hadley's for the most part, and using underscores (_
) in both function and object names.
Also, ideally the main user-facing functions will have their own files (e.g., descrip.R
) and internal helper functions will live in a separate file (e.g., descrip_utils.R
or possibly a separate file for each internal function).
It will be far more maintainable to have the version listed in one spot (for example, when the package is loaded) as opposed to in individual functions. version
parameter has already been removed from regress
, descrip
, and both of their utility functions, but the parameter likely still exists elsewhere.
At the suggestion of @bdwilliamson, we could consider is adding a print message when the package gets loaded that says something about the version and/or the package creation date.
This issue builds off of #33
It would be wonderful to have online documentation. I haven't looked into this for many years but I recall this being pretty easy with github.io and pkgdown
.
We no longer want to support shorted fnctl strings in regress
Originally posted by taylorokonek August 9, 2021
Currently, users can enter a shortened character string for the fnctl argument in regress
, and so long as it corresponds to a unique substring, the function will run. As an example, a user could simply set fnctl = "g"
, and the function would take this to mean fnctl = "geometric mean"
. Personally, I don't think there's a huge harm in forcing users to spell out the fnctl argument. Does anyone have any insight as to why this particular functionality was introduced (@bdwilliamson)?
regress("mean",fev~ht*as.integer(sex))
As a result of some of the change is associated with modernizing the package, devtools::check()
fails.
Example:
library(rigr); data(mri); regress("mean", atrophy ~ race, data=mri)
Error message:
Error in dimnames(x$model) <- *vtmp*
:
length of 'dimnames' [1] not equal to array extent
In addition: Warning message:
In cbind(nms, levels) :
number of rows of result is not a multiple of vector length (arg 2)
In termTraverse.R (soon to be in regress_utils.R), there is a function defined called equal
, and the value returned calls the function itself (not sure if there are any implications to this?). The function seems to be doing the same thing as checking if length(unique(x)) == 1, other than that they would handle vectors that include NAs differently. This is mostly a reminder to myself to edit this helper function to be a bit cleaner in the future.
equal <- function(x){
if(length(x)==1){
return(TRUE)
} else {
if(x[1]!=x[2]){
return(FALSE)
} else {
return(equal(x[-1]))
}
}
}
bplot(fev,as.integer(agecat),xlab="Age Category",ylab="FEV (liters)")
I think this function has identical behavior and interface as ifelse
- I'll look into it but it doesn't seem like we need this.
The following code produces a value for firstEvent
but has lastEvent
as NA:
a <- rnorm(10,0,1) descrip(a)
I would imagine we'd want firstEvent
to be NA as well in this case
I downloaded the rigr package on 7/23 and was using ttest() and got this result. Notice the incorrect label above the output that says "One-sample t-test"
with(hiv,ttest(cd4[art==0],cd4[art==1]))
Call:
ttest(var1 = cd4[art == 0], var2 = cd4[art == 1])
Two-sample t-test allowing for unequal variances :
One-sample t-test :
Summary:
Group Obs Missing Mean Std. Err. Std. Dev. 95% CI
cd4[art == 0] 78 14 388.5 32.5 260 [323.5, 454]
cd4[art == 1] 50 19 298.5 32.9 183 [231.3, 366]
Difference 128 33 90.1 46.3 [-2.02, 182]
Ho: difference in means = 0 ;
Ha: difference in means != 0
t = 1.946 , df = 80.7
Pr(|T| > t) = 0.0551195
Currently both some vignettes and some examples run the following to get the salary
data
salary <- read.table("http://www.emersonstatistics.com/datasets/salary.txt", header = TRUE, stringsAsFactors = FALSE)
This should instead be a documented dataset released with the package.
This impact the vignette regress_intro
amongst others.
It would be great to be able to install this package from CRAN -- this would enormously help many of our students with using it for the first time. Target date: Sept 13.
In the regress_intro
vignette, the line regress("mean", salary ~ female*year, id = id, data = salary)
no longer works. Not sure why.
The documentation for regress
seems to suggest you can do proportional hazards regression using fnctl = "hazard", but the code will actually throw an error if you specify fnctl = "hazard", stating "proportional hazards regression no longer supported". That being said, the remainder of the code seems to still have if/else cases and other sections in case fnctl = "hazard", so is this error something that was thrown in recently? @bdwilliamson do you have any insight?
Another question is whether we want regress
to be able to handle proportional hazards regression moving forward.
bplot(fev,agecat,xlab="Age Category",ylab="FEV (liters)") # fails if agecat is a factor
As currently written, wilcoxon()
is (as far as I can tell) largely a copy-pasted version of wilcox.test()
from the stats package. That copied code provides a p-value (variable PVAL
in the code), either using an exact test, or Normal approximation.
One of the additions in wilcoxon()
is the inf
return object, which is described as "a formatted table of inference values, for printing." This object provides the PVAL
mentioned above, but also includes the Z-score and corresponding Normal approximation p-value. This p-value will correspond to PVAL
when exact = FALSE
. However, when exact = TRUE
is used, PVAL
is an exact p-value, but the Normal approximation p-value is still included in inf
.
Long story short: when exact = TRUE
, wilcoxon()
provides both the exact and approximate p-value, which I think would be confusing to a user. To see this run:
set.seed(2)
y <- rnorm(100)
wilcoxon(y, exact = TRUE)
The inf
return object has p-value 0.70012 (exact) and 0.69762 (Normal approximation).
My feeling is that we shouldn't report two different p-values. Thoughts?
If I recall correctly, you can use the argument intercept = FALSE
in regress()
. Is it possible to add the -1
functionality?
Currently, T
and F
are sometimes used as shorthand for TRUE
and FALSE
. While this is possible, it isn't good practice -- TRUE
and FALSE
are protected, while T
and F
are valid object names and thus can be overwritten by the user.
We should use the full words!
The newest version of descrip prints, by default, the columns Max Restriction, FirstEvent, LastEvent, isDate. Those seem confusing for the average user, especially if none of the measures is a time-to-event. Can we have descrip not print those columns by default?
Also, at the bottom of the descrip output I see
attr(,"class")
[1] "uDescriptives"
which will be confusing to the naive user.
Finally, all statistics are printed in scientific notation which is different from before and harder to read.
Having said all this, I just figured out that I think the issue is simply that the print.uDescriptives function has not been included in the rigr package?
The code for this function is much more verbose than it needs to be, including many statements like if (condition){}; else{do stuff}
. For ease of future maintenance this should be cleaned up.
Other regression commands in R accept factor covariates but
regress("mean", fev ~ ht+sex,data=fevdata) # fails if sex is a factor
at the very least, ensure that all parameters and return values are described well
The printed output for ttest
reads "Two-sample t-test:" followed by "One-sample t-test:" whenever two samples are given. It should only read "Two-sample t-test:"
Many helper functions (i.e., non-exported functions) have no documentation. I think a good goal is to have all functions be documented in some way.
Currently if you specify version = TRUE
in descrip()
, the function will return "20160730" and nothing else. I'd imagine we either want to to return a date that's a bit closer to today, or just remove this parameter entirely. Personally, I'm in favor of removing it entirely. Any objections or other thoughts?
A previous line in test_cases
vignette was descrip(mri)
. This no longer works. Possible reasons include that mri
is now a tibble rather than a dataframe, but perhaps also because it contains variables of different types (not only numeric).
ttesti.R
documentation needs some changes - a few of the parameter descriptions aren't working properly.
This should throw an error instead.
wilcoxon.R
has some redundancies, including calculating the p-value multiple times. The whole function should be simplified.
Reminder to myself to come back to this.
one-sided p-value in ttesti wrong for Ha: <. Fixed in the attached file (to find changes, search for "JPH")
my_ttesti.txt
I believe this is just a problem with an if statement on line 887 of regress.R
I can't seem to get an example working (that doesn't throw an error) where the 'strata' argument in bplot is used. Is this argument supposed to do something analogous to facet_wrap in ggplot, or something else?
Note that when the following code is run:
library(rigr); data(mri)
regress("mean", atrophy ~ sex, data=mri)
the coefficient names printed are "Intercept" and "sex", whereas from lm() we'd have "(Intercept)" and "sexMale", this telling us which value of the categorical variable the coefficient corresponds to. We should have something similar to this in our print method for regress
. Maybe something like "sex: Male"?
Implement proptest
function as a branch from ttest
. Add continuity correction, which is not currently implemented.
one-sided p-value in wilcoxon wrong . Fixed in the attached file. Search for "JPH" to find changes
my_wilcoxon.txt
The alternative hypothesis in test is specified with the argument "test.type". It would be better to be consistent with other R functions (particularly t.test) and use the argument "alternative". The attached version of ttest changes the
argument name everywhere.
my_ttest.txt
It would be very useful if lincom could handle lm objects as well as uRegress objects
This seems to me to be a scoping issue, since wilcoxon()
calls wilcoxon.do()
. I tried using sys.frame(1)
in the substitute()
, without success. Commented out failing unit tests for this - restore these when fixed.
In an earlier email from Scott Emerson to me: "I have some updated functions that have not yet been put in the package."
Could the summer TA team please that reach out to him about this and incorporate those edits?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.