The online version of the book is here.
This project is open source and licensed under the MIT license.
Introduction to Statistics: an integrated textbook and workbook using R
License: MIT License
The online version of the book is here.
This project is open source and licensed under the MIT license.
In R module 14, there's a couple of instances of "math-negative" language around formulas:
Suggest replacing these with something more positive or at least neutral. Some concrete suggestions, to which I am not wedded:
Throughout the modules, there is ample opportunity to put statistical techniques in the context of the history of the people who invented those techniques and some of their early applications. This is important because it also gives us an opportunity to invite students into conversations about how many of the practitioners of modern statistics were inextricably linked with the eugenics movement.
Several students have noted that the four variables we've asked them to remove are contiguous and at the beginning of the list. That means that they could use the colon notation again and write hf_select4 <- select(hf, DepTime:Diverted)
and end up with the same dataframe produced by the
nominally correct answer hf_select4 <- select(hf, -Year, -Month, -DayofMonth, -DayOfWeek)
.
Thus, suggest asking students to remove four variables that aren't in a contiguous block in the dataset.
I've been following ISRS (see e.g. the box at the bottom of p. 70) in being careful about the difference between the test statistic and the point estimate: the test statistic, I'm saying to my students, is the recipe we follow, and the point estimate is the number that we get. For instance, in R module 8, I'd say that the test statistic is the difference in proportions, and the point estimate is -0.1532.
In the R modules, though, this difference is kinda elided. In particular, in the hypothesis testing rubric, we tell students to "report the test statistic" when perhaps we should more carefully be saying "report the point estimate of the test statistic."
Am I making a bigger deal out of this than it really warrants?
I noticed that the directions for the R modules in Canvas don't have a specific step to click "Preview" to generate the html file. Suggest adding one probably as step 4:
Click on "Preview" to generate the HTML file. Make sure you don't have any inline code errors indicated by a yellow error bar at the top of your .Rmd window.
In the second module 02_Using_R_Markdown
, students are required to Preview before running a code chunk. In theory, this should work fine, but a bug in RStudio prevents this from happening. (rstudio/rstudio#3288)
In the example chi-square test of goodness-of-fit for cylinders of cars from the mtcars data set: Is there any particular reason why we couldn't write
H_A: p_4 \neq p_6, p_4 \neq p_8, or p_6 \neq p_8 ?
This would generalize nicely to the case where we're comparing to a given probability distribution, as in the "Your turn:"
H_0: p_{general} = .15, p_{academic} = .6, p_{vocational} = .25 (we're implying an "and" here)
H_A: p_{general} \neq .15, p_{academic} \neq .6, p_{vocational}, \neq .25 (implied "or")
I suspect that there might just be some subtlety here that I'm not understanding. Is it just to avoid the "and" vs. "or" thing?
(This is mostly just a flag for me to look at and think about later when I have more brain space.)
There's currently no great way to determine whether students are using inline R code as requested, other than digging out their .Rmd file and looking directly. It would be really nice if we could get the results of inline R code to display differently -- perhaps in a different color or something. (I think this would also be a useful benefit for students -- oh look, those blue numbers are being automatically generated.)
However, if you look at the html source, it's clear that the values are being generated before they get to the html phase, so we can't just do the obvious thing and use CSS to style them.
So, if we're going to style them, it's going to have to be earlier in the stack -- maybe in pandoc or something.
In the section on residuals, it would be good to have the student perform the same calculation using the results of augment
. (Verify that the numbers are correct in, say, the second row or something.)
When knitting to PDF, scientific notation often broke the LaTeX. Now that we're using HTML, consider removing options(scipen = 999)
and allowing scientific notation.
Students cannot generate a html file by clicking Preview for 05_Manipulating_data.Rmd after answering all questions of this r module. RStudio throws an error saying “Error creating notebook pandoc document conversion failed with error 251”.
This should be an out-of-memory error, because, in order to answer all questions in this R module, we have to define about at least 15 large data matrices (usually 22758 *21) and it simply run out of the memory of Westminster RStudio when clicking Preview, even we can run large matrices chuck by chuck in the script.
This is another one of those issues that seems to cause a lot of confusion for students. Suggestion: Make the first example a non-uniform distribution, then have a little section that says "What if the null is a uniform distribution?" This will generalize better, because then it makes more sense that the percentages should be 33, 33, 33 or whatever, and then we can talk about how the default is to think about the data being evenly split across categories.
Another benefit is that it would better match the presentation in the book, where the primary example is a nonuniform distribution.
So I'm thinking about the following modules:
I think it'd be nice to try to dig up some more background information to make a more reasonable specification of what the baseline comparison number is. Like, maybe we look at the CDC statistics on ulcerated melanoma tumors, or something.
Notice btw that the example in 19 is a good example of what I'd like: the value $63,204 is a reasonable number that comes from some kind of reasonable background information.
Of course, this issue may become moot as we find cooler datasets to bake into the modules. Maybe those datasets will come with more reasonable background information.
In Rmd 15, about line 92, suggest replacing:
\chi^{2} = \frac{(11 - 10.67)^{2}}{10.67} + \frac{(7 - 10.67)^{2}}{10.67} + \frac{(14 - 10.67)^{2}}{10.67} = 2.3.
with
\chi^{2} = \frac{(11 - 10.67)^{2}}{10.67} + \frac{(7 - 10.67)^{2}}{10.67} + \frac{(14 - 10.67)^{2}}{10.67} \approx 2.312.
Students might otherwise be confused as to why the value that was computed manually differs from the value computed by chisq
.
Somehow the behavior of set.seed()
has changed so that the random numbers in sims
no longer agree with the expository text below.
In the hypothesis testing modules, the line that creates the simulation is generally in the same chunk as the line that plots the null distribution. For instance:
```{r}
sims <- do(5000) * diffprop(smoke ~ shuffle(gender), data = smoke_gender)
ggplot(sims, aes(x=diffprop)) +
geom_histogram() +
geom_vline(xintercept = obs_diff, color = "blue") +
geom_vline(xintercept = -obs_diff, color = "blue")
```
This can cause problems as students experiment to find a good binwidth -- the simulations keep getting regenerated every time the code chunk is run, and therefore the distribution or the calculated p-value may change slightly.
My suggestion is to train students to generate their simulations in a separate code block and then start plotting the null distribution:
```{r}
sims <- do(5000) * diffprop(smoke ~ shuffle(gender), data = smoke_gender)
```
```{r}
ggplot(sims, aes(x=diffprop)) +
geom_histogram(binwidth= 0.01) +
geom_vline(xintercept = obs_diff, color = "blue") +
geom_vline(xintercept = -obs_diff, color = "blue")
```
This way the simulations will at least be stable through multiple redrawings of the ggplot. It will also improve runtime and decrease memory load on the server.
Other possible solutions (that I don't really like):
rflip
and shuffle
are introduced and run multiple times.
In a recent edit, install.packages was moved to a footnote. But I missed a reference to it later in the exposition.
I need to alter the rubric to say
"Report the test statistic in context (when possible)."
Proportions, z scores, and t scores are interpretable. Chi-square and F values really don't have any kind of meaningful interpretation.
Align the code with the new ggplot2 syntax that deprecates variables like ..prop..
.
Okay, so there's 344 penguins in the penguins
dataset. But two of them have NA body mass. This causes the following issues:
penguins %>% group_by(species) %>% summarize(mean(body_mass_g))
will fail unless , na.rm = TRUE
is addedinfer
does: df_E = n - k, which students will probably want to compute as 344 - 3 = 341. But secretly n = 342, so df_E = 342 - 3 = 339.infer
will throw warnings about removing 2 rows. (But maybe students don't look at that / don't know what that means.)Suggestion: let's prompt students in the "Your turn" setup to filter out the penguins with NA body mass. I'm not sure we have shown them how to do this before, so we may even want to provide %>% filter(!is.na(...))
as a hint.
Happy to take a stab at this if you think it'd be good.
I think the labeling (and commentary) of the code chunks in lines 182-210ish is a little confusing.
The code chunk on line 184 computes the standard error, but line 182 says "We also compute a z score". Suggest changing line 182 to say, "Let's compute a z-score. We first need to compute the standard error using the formula from earlier:"
Then, suggest some additional commentary on line 191: "Now that we know the SE, we can compute the z-score in the usual way:"
Line 201 might also be confusing, leading students to believe that there is a different formula for the z-score. Rather, there's just a different formula for the standard error. Suggest changing the last sentence here to something like: "Note that we need to use the new SE formula for the two-proportion case, instead of the one for the one-proportion case."
In R module 4 exercise 7, we ask students to pick one numerical and one categorical variable and draw some plots. Both ptl
and ftv
are sorta numerical but do not take on very many different values, and so they end up being poor choices for the purposes of this exercise -- they behave much more like categorical variables than numerical variables. Perhaps we should warn students away from choosing ptl
or ftv
for Exercise 7.
In the first module, some students (who aren't reading carefully) are trying to use install.packages
on RStudio Server and getting errors. I need to reword this to make it somehow clear to novices what the difference is between accessing RStudio Server through a browser and using RStudio installed on their local machine.
On the rubric for module 22, Exercise 3 is associated with 3 points rather than 1.
Suggest modifying the example in Module 13 so that it doesn't involve the filter command. Yes, we really carefully tell students not to copy-paste the filter command, but lots of people do it anyway, and then run into silly and not-useful errors.
The wording for creating a new project has changed in RStudio. Instead of "Empty Project" it now says "New Project".
by googling my brains out, I found that if you use the date line below in the YAML header, it stamps the date and time of preview/knitting.
date: "`r format(Sys.time(), 'on %a %b %d, %Y, at %H:%M:%S')`"
I like it, even just for myself, since it tells you the most recent preview/knit right in the output.
Maybe something for future versions of the modules for students?
In module 5, if the student loads packages in a funky order, the select command might be attempting to access the one from the MASS package instead of dplyr.
Module 5 line 28 or so (directly under the ## Load packages
section header) has one instance of mosaic
typo'd as mosiac
.
So you were totally right: I took a quick ctrl-f through all the modules, and you only ever adorn column percentages, so there's no inconsistency.
Okay, so then here is maybe a suggestion. What if there's a brief little thing when we first introduce adorn_percentages("col")
that says, here we are interested in column percentages for such-and-such reasons, here's how we do that, bam, okay but what if we're for some other reason in some other table interested in the row percentages, okay well adorn_percentages("row")
, bam".
Alternatively: what students are ever going to do wrong is they're just going to leave the specifier off and not notice. So maybe instead, a thing says "what happens if we accidentally leave off "col"
in here? oh no, that's row percentages, compare what happens when we say "row"
, wow yeah those are less useful here."
The reason I'm putting this in as a "suggestion" rather than making a pull request is because (a) low bandwidth today lol and (b) I'm diffident about the pedagogical value of this move. Maybe it shows them a thing to watch out for, but maybe also it just muddies the waters.
It bothers me that all of the messages and warnings from libarary-ing packages at the beginning of an Rmd file end up in the final knit document so a typical code chunk from a knit document looks like this:
```{r libraries, warning=F, message=F} library(ggplot2) library(tidyr) library(mosaic) ```
but the notebooks are pretty twitchy about including the messages - some extra hammering leads me to the (possibly ill formed) conclusion that the very first time you library the package it will include the messages etc. but if you run it a second time it goes away. I hate having to skip all that blag all the time, so I googled up this post of several ideas, none for notebooks:
https://stackoverflow.com/questions/13090838/r-markdown-avoiding-package-loading-messages
so now I see 2 options, first (leaves the messages in the Rmd output but hides it from the html):
```{r libraries, results='hide'} library(ggplot2) library(tidyr) library(mosaic) ```
and second (which also hides it from the Rmd output when running the chunk):
```{r libraries, results='hide', message=F} library(ggplot2) library(tidyr) library(mosaic) ```
I prefer the first for now since the user still sees the occasionally useful messages in the Rmd window but I don't have to see them in student assignments.
MmmmKay?
The randomization modules could align a little better with the ISRS textbook if shuffle
were presented before rflip
, since the first simulation discussion occurs in the context of the gender discrimination study.
(In some ways this is not a particularly natural ordering. Maybe the better thing to do long-term is to just ensure that R modules 6 and 7 are independent, so that you could assign them in either order.)
I had a bunch of students struggle with making a factor (probably no surprise) but then the fact that the code to make the factor and code to assign it to something else are in the same chunk caused additional issues. For example, in module 04_Categorical_data.Rmd, the example using factor()
is:
race <- factor(birthwt$race,
levels = c(1, 2, 3),
labels = c("White", "Black", "Other"))
race_df <- data.frame(race)
I'm recommending that we split this into 2 separate code chunks for 2 reasons.
factor()
code so there was no vector called race
for the assignment to find. If the chunks are separate they would get an error about the factor() code, which is what was broken 97.547865% of the time. (really the issue isn't this example, but when students copy paste and adapt)factor()
part and not the bottom part, or they would always include this part: race_df <- data.frame(race)
even when they wanted to do somethings else. They thought you had to do all of it no matter what.the explanation of the pieces is already all there, I'd just separate the code and explanations.
This is a general suggestion for any of the modules that deal with hypothesis testing. When we ask students to write the null and alternative hypotheses in symbols, or when we model how to do this, I suggest that we include in the subscript something about both categorical variables. This will help clarify which of the four proportions in the relative frequency table it is that we're comparing.
For instance, in the example in R module 8, we're investigating whether there's an association between smoking and low birth weight. We write the null hypothesis as:
H0: pnonsmoker - psmoker = 0
I'm suggesting that it would be more informative for us to say:
H0: plow,nonsmoker - plow,smoker = 0
I think it'd help students understand the interpretation of the p-value if we separate the computing-the-number part from the writing-the-sentence part. Then they'd be able to see that, oh, I computed the number right, but now I have to say what it means.
Module 04, Question 4a answer in the rubric lists % of white women who smoke during pregnancy as 52.4%. Should be 54.2% I think.
Might be worth leaving as check on rubric copying, as long as QUARC consultants know about it.
"Notice that the all the times are missing" is at the end of Exercise 11, remove extra words...
Add {-} to the web version to remove section numbering.
a suggestion: can we add geom_qq_line() to all of the example qq plots? I have been asking my students to do this and at least some of them are more aware of patterns in the qq_plot since the y=x line is much more obvious -- for example deviations in the slope and tails are more obvious.
(PS I don't know if you remember how obsessed I got with these lines a few years ago, before geom_qq_line() existed but I'm simultaneously excited that it exists and a bit let down that it is now so easy. When I was a boy, we had to make ours by hand...)
just a thought...
I'm noticing that students, like, really don't understand how to "chunk up" R code into meaningful pieces. I wonder if better syntax highlighting would help. In particular, the only things that currently get a different color in code chunks are string literals and boolean literals, and also in particular, functions don't -- so I don't think students see what I mean when I say "the chisq.test function" or whatever.
A couple of thoughts:
It's not quite clear how to get the help page for hflights
to show up in the Help window if we haven't already loaded the hflights
package with library(hflights)
. Some suggested fixes:
library(hflights)
and then ?hflights
in the consolehflights
in the search box in the Help tab and look through the resulting list of possible results to find the one they wantUsing ggplot code, create the type of graph you identified above. Be sure to put involact on the y-axis and race` on the x-axis.
Missing backtick in front of race.
Here's some text from Exercise 7d:
Using the normal model you just developed, determine how likely the drug trial data will be to show the drug as "effective" according to the 85% standard. In other words, how often will our sample give us a result that is 85% or higher (even though secretly we know the true effectiveness is only 83%)? (Hint: you'll need to use the
pdist
command.)
A lot of people are answering this just by drawing the pdist
graph and not saying anything about it, and we can't really knock them based on this wording. Suggest the following text instead:
Using the normal model you just developed, determine how likely the drug trial data will be to show the drug as "effective" according to the 85% standard. In other words, how often will our sample give us a result that is 85% or higher (even though secretly we know the true effectiveness is only 83%)? Report your answer in a contextually-meaningful full sentence using inline R code. (Hint: you'll need to use the
pdist
command.)
We might even create a separate part to ask them to draw the picture, something along the lines of the instructions in R module 10.
Because I posted a "clean" version of the HTML document with no code run, the Melanoma data table does not appear as promised. I need to see if I can run just that one code chunk without showing the output from all the others.
I need to add the final subsections to Chapter 8 about finalizing the assignment.
Okay, so here's a funny thing:
In R module 9, let's say that students code their ulcer variable "backwards":
ulcer <- factor(Melanoma$ulcer, levels = c(0, 1), labels = c("absent", "present"))
Well, then if they do ulcer_prop <- prop(ulcer, data = ulcer_df)
, then they get 0.561ish, and so they'll want to use 1-ulcer_prop for most of their calculations. However, this gives you the wrong two-sided p-value:
P2 <- 2 * prop(sims2$prop <= 1-ulcer_prop)
results in 0.07.
The reason for this appears to be a rounding issue. Try doing sims2 %>% filter(prop <= 1-ulcer_prop) %>% arrange(prop)
versus sims2 %>% filter(prop <= 90/205) %>% arrange(prop)
and you'll see what I mean -- the prior results in 35 rows where the latter results in 50 rows, including some with a proportion equivalent to 90/205.
I think the easiest fix is to just mention 0.07 as another possibility for the two-sided p-value in the rubric, or else to very specifically instruct students to put the "success" criterion first.
In Rmd 17 around line 197, the normal distribution is plotted in red and the T distribution is plotted in green. Suggest using different colors to be friendlier to students who are red-green colorblind.
I've noticed that the line numbers reported in error messages don't correspond very well to the line numbers given in the gutters of the .Rmd files. Usually they're low by, say, 10 or 20, which makes me think that it has to do with R not counting lines in the preamble.
In R module 4, Exercise 8 asks:
"Modify the following side-by-side bar chart by adding a title and labels for both the fill variable and the x-axis variable."
We've previously seen examples of adding labels for the x-axis and y-axis variables, but not the fill variable. It's reasonable enough for students to guess that the required command should indeed be
labs(..., fill="desired label")
but perhaps we want to give them a more explicit example, or else clue them in explicitly that "fill" is a thing you can tell labs
about.
Some graph are hard for me to see. (For example, the last graph in Chapter 7 including the "side" predictor.) How much overhead would there be to teach students about using colorblind-friendly palettes. It might be enough to set a theme by default in the header. Then again, maybe it's useful to teach students more explicitly how to do it and why it's important.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.