vectorposse / intro_stats Goto Github PK

View Code? Open in Web Editor NEW

16.0 7.0 7.0 66.04 MB

Introduction to Statistics: an integrated textbook and workbook using R

License: MIT License

TeX 53.74% CSS 46.26%

statistics statistics-course statistics-learning r students learning-r

intro_stats's Introduction

Introduction to Statistics: an integrated textbook and workbook using R

The online version of the book is here.

This project is open source and licensed under the MIT license.

intro_stats's People

Contributors

Stargazers

Watchers

Forkers

bbynum jackwilb anhnguyendepocen rhinopotamus nerdcommander jingsai

intro_stats's Issues

Language around standard error formula in Rmd 14

In R module 14, there's a couple of instances of "math-negative" language around formulas:

Line 50: "Here is the formula that you'll have to take on faith:"
Line 64: "Without values of $p_{1}$ and $p_{2}$, we cannot plug anything into the ugly standard error formula above."
Line 201: "Unfortunately, this is a much uglier formula than it was for a test for one proportion."

Suggest replacing these with something more positive or at least neutral. Some concrete suggestions, to which I am not wedded:

Line 50: "Here is the formula, whose derivation is outside the scope of the course:"
Line 64: "Without values of $p_{1}$ and $p_{2}$, we cannot compute the standard error given by the formula above."
Line 201: "Remember that, when we compute the z-score, we need to use the formula for the standard error for the two-proportion case than the one for the one-proportion case."

Add historical context

Throughout the modules, there is ample opportunity to put statistical techniques in the context of the history of the people who invented those techniques and some of their early applications. This is important because it also gives us an opportunity to invite students into conversations about how many of the practitioners of modern statistics were inextricably linked with the eugenics movement.

Suggestion for Module 5 exercise 3

Several students have noted that the four variables we've asked them to remove are contiguous and at the beginning of the list. That means that they could use the colon notation again and write hf_select4 <- select(hf, DepTime:Diverted) and end up with the same dataframe produced by the
nominally correct answer hf_select4 <- select(hf, -Year, -Month, -DayofMonth, -DayOfWeek).

Thus, suggest asking students to remove four variables that aren't in a contiguous block in the dataset.

QUESTION: should we be more careful about "test statistic" vs "point estimate"?

I've been following ISRS (see e.g. the box at the bottom of p. 70) in being careful about the difference between the test statistic and the point estimate: the test statistic, I'm saying to my students, is the recipe we follow, and the point estimate is the number that we get. For instance, in R module 8, I'd say that the test statistic is the difference in proportions, and the point estimate is -0.1532.

In the R modules, though, this difference is kinda elided. In particular, in the hypothesis testing rubric, we tell students to "report the test statistic" when perhaps we should more carefully be saying "report the point estimate of the test statistic."

Am I making a bigger deal out of this than it really warrants?

Directions: specific step to "Preview"

I noticed that the directions for the R modules in Canvas don't have a specific step to click "Preview" to generate the html file. Suggest adding one probably as step 4:

Click on "Preview" to generate the HTML file. Make sure you don't have any inline code errors indicated by a yellow error bar at the top of your .Rmd window.

Preview before running code chunks

In the second module 02_Using_R_Markdown, students are required to Preview before running a code chunk. In theory, this should work fine, but a bug in RStudio prevents this from happening. (rstudio/rstudio#3288)

[Question] Symbolic H_A for chi-square test

In the example chi-square test of goodness-of-fit for cylinders of cars from the mtcars data set: Is there any particular reason why we couldn't write
H_A: p_4 \neq p_6, p_4 \neq p_8, or p_6 \neq p_8 ?

This would generalize nicely to the case where we're comparing to a given probability distribution, as in the "Your turn:"
H_0: p_{general} = .15, p_{academic} = .6, p_{vocational} = .25 (we're implying an "and" here)
H_A: p_{general} \neq .15, p_{academic} \neq .6, p_{vocational}, \neq .25 (implied "or")

I suspect that there might just be some subtlety here that I'm not understanding. Is it just to avoid the "and" vs. "or" thing?

[Suggestion] Can we make results of inline R code display differently?

(This is mostly just a flag for me to look at and think about later when I have more brain space.)

There's currently no great way to determine whether students are using inline R code as requested, other than digging out their .Rmd file and looking directly. It would be really nice if we could get the results of inline R code to display differently -- perhaps in a different color or something. (I think this would also be a useful benefit for students -- oh look, those blue numbers are being automatically generated.)
However, if you look at the html source, it's clear that the values are being generated before they get to the html phase, so we can't just do the obvious thing and use CSS to style them.
So, if we're going to style them, it's going to have to be earlier in the stack -- maybe in pandoc or something.

Ch. 7- Add exercise

In the section on residuals, it would be good to have the student perform the same calculation using the results of augment. (Verify that the numbers are correct in, say, the second row or something.)

Scientific notation

When knitting to PDF, scientific notation often broke the LaTeX. Now that we're using HTML, consider removing options(scipen = 999) and allowing scientific notation.

out-of-memory error

Students cannot generate a html file by clicking Preview for 05_Manipulating_data.Rmd after answering all questions of this r module. RStudio throws an error saying “Error creating notebook pandoc document conversion failed with error 251”.

This should be an out-of-memory error, because, in order to answer all questions in this R module, we have to define about at least 15 large data matrices (usually 22758 *21) and it simply run out of the memory of Westminster RStudio when clicking Preview, even we can run large matrices chuck by chuck in the script.

Module 15: have a non-uniform null in the example

This is another one of those issues that seems to cause a lot of confusion for students. Suggestion: Make the first example a non-uniform distribution, then have a little section that says "What if the null is a uniform distribution?" This will generalize better, because then it makes more sense that the percentages should be 33, 33, 33 or whatever, and then we can talk about how the default is to think about the data being evenly split across categories.

Another benefit is that it would better match the presentation in the book, where the primary example is a nonuniform distribution.

Suggestion: More motivated reference levels for one-prop and one-mean tests

So I'm thinking about the following modules:

11 (is proportion of ulcerated tumors different from 50%)
15 (is survival rate / prior surgery rate different from 50%)
19 (is math score different from 50)
In each of these modules, there's a comparison to some kind of baseline number that feels really artificial. Why would we have 50% prior surgery rate in our heads, walking into this analysis? Do we have some kind of evidence that points to 50% being the correct number for ulcerated tumors?

I think it'd be nice to try to dig up some more background information to make a more reasonable specification of what the baseline comparison number is. Like, maybe we look at the CDC statistics on ulcerated melanoma tumors, or something.

Notice btw that the example in 19 is a good example of what I'd like: the value $63,204 is a reasonable number that comes from some kind of reasonable background information.

Of course, this issue may become moot as we find cooler datasets to bake into the modules. Maybe those datasets will come with more reasonable background information.

[Suggestion] Manual computation of chi-square in Rmd 15

In Rmd 15, about line 92, suggest replacing:

\chi^{2} = \frac{(11 - 10.67)^{2}}{10.67} + \frac{(7 - 10.67)^{2}}{10.67} + \frac{(14 - 10.67)^{2}}{10.67} = 2.3.

with

\chi^{2} = \frac{(11 - 10.67)^{2}}{10.67} + \frac{(7 - 10.67)^{2}}{10.67} + \frac{(14 - 10.67)^{2}}{10.67} \approx 2.312.

Students might otherwise be confused as to why the value that was computed manually differs from the value computed by chisq.

Module #7- Simulation values no longer agree with text below

Somehow the behavior of set.seed() has changed so that the random numbers in sims no longer agree with the expository text below.

[Suggestion] Generate simulations in a separate code chunk

In the hypothesis testing modules, the line that creates the simulation is generally in the same chunk as the line that plots the null distribution. For instance:

```{r}
sims <- do(5000) * diffprop(smoke ~ shuffle(gender), data = smoke_gender)
ggplot(sims, aes(x=diffprop)) + 
    geom_histogram() +
    geom_vline(xintercept = obs_diff, color = "blue") +
    geom_vline(xintercept = -obs_diff, color = "blue")
```

This can cause problems as students experiment to find a good binwidth -- the simulations keep getting regenerated every time the code chunk is run, and therefore the distribution or the calculated p-value may change slightly.

My suggestion is to train students to generate their simulations in a separate code block and then start plotting the null distribution:

```{r}
sims <- do(5000) * diffprop(smoke ~ shuffle(gender), data = smoke_gender)
```

```{r}
ggplot(sims, aes(x=diffprop)) + 
    geom_histogram(binwidth= 0.01) +
    geom_vline(xintercept = obs_diff, color = "blue") +
    geom_vline(xintercept = -obs_diff, color = "blue")
```

This way the simulations will at least be stable through multiple redrawings of the ggplot. It will also improve runtime and decrease memory load on the server.

Other possible solutions (that I don't really like):

have students "restart R and run all chunks" every time they want to execute any chunk, but this seems like overkill.
have students reset the seed every time they generate a simulation, but this would be onerous in the modules where rflip and shuffle are introduced and run multiple times.
- Or we could say, just reset the seed before generating a big dataframe of simulations, but then I'd be worried about weird errors resulting from running the module in a different order.

Module #1: reference to install.packages

In a recent edit, install.packages was moved to a footnote. But I missed a reference to it later in the exposition.

Not always possible to report the test statistic in context

I need to alter the rubric to say

"Report the test statistic in context (when possible)."

Proportions, z scores, and t scores are interpretable. Chi-square and F values really don't have any kind of meaningful interpretation.

Change ..prop... in Ch. 3

Align the code with the new ggplot2 syntax that deprecates variables like ..prop...

Suggestion: prompt students to remove penguins with NA body mass

Okay, so there's 344 penguins in the penguins dataset. But two of them have NA body mass. This causes the following issues:

penguins %>% group_by(species) %>% summarize(mean(body_mass_g)) will fail unless , na.rm = TRUE is added
students might report the wrong sample sizes
and thus students might be confused by the df calculations that infer does: df_E = n - k, which students will probably want to compute as 344 - 3 = 341. But secretly n = 342, so df_E = 342 - 3 = 339.
infer will throw warnings about removing 2 rows. (But maybe students don't look at that / don't know what that means.)

Suggestion: let's prompt students in the "Your turn" setup to filter out the penguins with NA body mass. I'm not sure we have shown them how to do this before, so we may even want to provide %>% filter(!is.na(...)) as a hint.

Happy to take a stab at this if you think it'd be good.

Relabel the code chunks around computation of z-score in Rmd 14

I think the labeling (and commentary) of the code chunks in lines 182-210ish is a little confusing.

The code chunk on line 184 computes the standard error, but line 182 says "We also compute a z score". Suggest changing line 182 to say, "Let's compute a z-score. We first need to compute the standard error using the formula from earlier:"

Then, suggest some additional commentary on line 191: "Now that we know the SE, we can compute the z-score in the usual way:"

Line 201 might also be confusing, leading students to believe that there is a different formula for the z-score. Rather, there's just a different formula for the standard error. Suggest changing the last sentence here to something like: "Note that we need to use the new SE formula for the two-proportion case, instead of the one for the one-proportion case."

Module 4: Some birthwt variables are not very numeric

In R module 4 exercise 7, we ask students to pick one numerical and one categorical variable and draw some plots. Both ptl and ftv are sorta numerical but do not take on very many different values, and so they end up being poor choices for the purposes of this exercise -- they behave much more like categorical variables than numerical variables. Perhaps we should warn students away from choosing ptl or ftv for Exercise 7.

References to RStudio vs RStudio Server

In the first module, some students (who aren't reading carefully) are trying to use install.packages on RStudio Server and getting errors. I need to reword this to make it somehow clear to novices what the difference is between accessing RStudio Server through a browser and using RStudio installed on their local machine.

Rubric for module 22

On the rubric for module 22, Exercise 3 is associated with 3 points rather than 1.

Module 13: don't filter

Suggest modifying the example in Module 13 so that it doesn't involve the filter command. Yes, we really carefully tell students not to copy-paste the filter command, but lots of people do it anyway, and then run into silly and not-useful errors.

Module #1: "Empty Project"

The wording for creating a new project has changed in RStudio. Instead of "Empty Project" it now says "New Project".

date stamp for notebooks (and other Rmd files that are knit)

by googling my brains out, I found that if you use the date line below in the YAML header, it stamps the date and time of preview/knitting.

date: "`r format(Sys.time(), 'on %a %b %d, %Y, at %H:%M:%S')`"

I like it, even just for myself, since it tells you the most recent preview/knit right in the output.
Maybe something for future versions of the modules for students?

Module 5: select command is sometimes masked

In module 5, if the student loads packages in a funky order, the select command might be attempting to access the one from the MASS package instead of dplyr.

Small typo in module 5

Module 5 line 28 or so (directly under the ## Load packages section header) has one instance of mosaic typo'd as mosiac.

`adorn_percentages("col")` (per our conversation today)

So you were totally right: I took a quick ctrl-f through all the modules, and you only ever adorn column percentages, so there's no inconsistency.

Okay, so then here is maybe a suggestion. What if there's a brief little thing when we first introduce adorn_percentages("col") that says, here we are interested in column percentages for such-and-such reasons, here's how we do that, bam, okay but what if we're for some other reason in some other table interested in the row percentages, okay well adorn_percentages("row"), bam".

Alternatively: what students are ever going to do wrong is they're just going to leave the specifier off and not notice. So maybe instead, a thing says "what happens if we accidentally leave off "col" in here? oh no, that's row percentages, compare what happens when we say "row", wow yeah those are less useful here."

The reason I'm putting this in as a "suggestion" rather than making a pull request is because (a) low bandwidth today lol and (b) I'm diffident about the pedagogical value of this move. Maybe it shows them a thing to watch out for, but maybe also it just muddies the waters.

suppress messages from library(.)

It bothers me that all of the messages and warnings from libarary-ing packages at the beginning of an Rmd file end up in the final knit document so a typical code chunk from a knit document looks like this:

```{r libraries, warning=F, message=F}
library(ggplot2)
library(tidyr)
library(mosaic)
```

but the notebooks are pretty twitchy about including the messages - some extra hammering leads me to the (possibly ill formed) conclusion that the very first time you library the package it will include the messages etc. but if you run it a second time it goes away. I hate having to skip all that blag all the time, so I googled up this post of several ideas, none for notebooks:
https://stackoverflow.com/questions/13090838/r-markdown-avoiding-package-loading-messages

so now I see 2 options, first (leaves the messages in the Rmd output but hides it from the html):

```{r libraries, results='hide'}
library(ggplot2)
library(tidyr)
library(mosaic)
```

and second (which also hides it from the Rmd output when running the chunk):

```{r libraries, results='hide', message=F}
library(ggplot2)
library(tidyr)
library(mosaic)
```

I prefer the first for now since the user still sees the occasionally useful messages in the Rmd window but I don't have to see them in student assignments.

MmmmKay?

Switch R modules 6 and 7?

The randomization modules could align a little better with the ISRS textbook if shuffle were presented before rflip, since the first simulation discussion occurs in the context of the gender discrimination study.
(In some ways this is not a particularly natural ordering. Maybe the better thing to do long-term is to just ensure that R modules 6 and 7 are independent, so that you could assign them in either order.)

making a factor and other code should be in separate chunks

I had a bunch of students struggle with making a factor (probably no surprise) but then the fact that the code to make the factor and code to assign it to something else are in the same chunk caused additional issues. For example, in module 04_Categorical_data.Rmd, the example using factor() is:

race <- factor(birthwt$race,
               levels = c(1, 2, 3),
               labels = c("White", "Black", "Other"))
race_df <- data.frame(race)

I'm recommending that we split this into 2 separate code chunks for 2 reasons.

error messages can be uninformative. several students were getting errors like "object race not found" but the real problem was that they stuffed up their factor() code so there was no vector called race for the assignment to find. If the chunks are separate they would get an error about the factor() code, which is what was broken 97.547865% of the time. (really the issue isn't this example, but when students copy paste and adapt)
several students conflated these 2 different lines of code and would change the factor() part and not the bottom part, or they would always include this part: race_df <- data.frame(race) even when they wanted to do somethings else. They thought you had to do all of it no matter what.

the explanation of the pieces is already all there, I'd just separate the code and explanations.

[Suggestion] Note both variables in the symbolic hypotheses

This is a general suggestion for any of the modules that deal with hypothesis testing. When we ask students to write the null and alternative hypotheses in symbols, or when we model how to do this, I suggest that we include in the subscript something about both categorical variables. This will help clarify which of the four proportions in the relative frequency table it is that we're comparing.

For instance, in the example in R module 8, we're investigating whether there's an association between smoking and low birth weight. We write the null hypothesis as:
H₀: p_nonsmoker - p_smoker = 0
I'm suggesting that it would be more informative for us to say:
H₀: p_{low,nonsmoker} - p_low,smoker = 0

Rubrics: have separate criteria for computing and interpreting the p-value

I think it'd help students understand the interpretation of the p-value if we separate the computing-the-number part from the writing-the-sentence part. Then they'd be able to see that, oh, I computed the number right, but now I have to say what it means.

Module 04 rubric edit

Module 04, Question 4a answer in the rubric lists % of white women who smoke during pregnancy as 52.4%. Should be 54.2% I think.

Might be worth leaving as check on rubric copying, as long as QUARC consultants know about it.

typo toward end of Ex11 in module 05_Manipulating_data

"Notice that the all the times are missing" is at the end of Exercise 11, remove extra words...

Ch. 3, Ex. 5 and beyond have section numbering

Add {-} to the web version to remove section numbering.

adding geom_qq_line() to qq plots

a suggestion: can we add geom_qq_line() to all of the example qq plots? I have been asking my students to do this and at least some of them are more aware of patterns in the qq_plot since the y=x line is much more obvious -- for example deviations in the slope and tails are more obvious.
(PS I don't know if you remember how obsessed I got with these lines a few years ago, before geom_qq_line() existed but I'm simultaneously excited that it exists and a bit let down that it is now so easy. When I was a boy, we had to make ours by hand...)
just a thought...

Better syntax highlighting?

I'm noticing that students, like, really don't understand how to "chunk up" R code into meaningful pieces. I wonder if better syntax highlighting would help. In particular, the only things that currently get a different color in code chunks are string literals and boolean literals, and also in particular, functions don't -- so I don't think students see what I mean when I say "the chisq.test function" or whatever.

A couple of thoughts:

I found that there's an RStudio option you can turn on to highlight R function calls, which maybe we can turn on by default:
https://support.rstudio.com/hc/en-us/community/posts/211643088-syntax-highlighting-for-known-function-names-parameter-names
When I turned this option on using the default theme, the color wasn't actually different enough for it to be super noticeable (it was a slightly bluer gray, I guess?). Other themes (in Tools -> Global Options... -> Appearance) make a way bigger distinction between regular text and function names, which seems helpful.
I also wonder if there's an easy way to get RSTudio / notebooks to use a different highlighter that might be able to detect and highlight named function parameters. Here's some I found that seem promising:
https://github.com/yihui/highr
https://github.com/romainfrancois/highlight

Module 5: slightly confusing directions in Exercise 1

It's not quite clear how to get the help page for hflights to show up in the Help window if we haven't already loaded the hflights package with library(hflights). Some suggested fixes:

Tell students to look in the Packages tab instead of in the Help tab
Tell students to do library(hflights) and then ?hflights in the console
Tell students to put hflights in the search box in the Help tab and look through the resulting list of possible results to find the one they want

Typo in Chapter 6, Exercise 5(h)

Using ggplot code, create the type of graph you identified above. Be sure to put involact on the y-axis and race` on the x-axis.

Missing backtick in front of race.

Module 11 suggestion

Here's some text from Exercise 7d:

Using the normal model you just developed, determine how likely the drug trial data will be to show the drug as "effective" according to the 85% standard. In other words, how often will our sample give us a result that is 85% or higher (even though secretly we know the true effectiveness is only 83%)? (Hint: you'll need to use the pdist command.)

A lot of people are answering this just by drawing the pdist graph and not saying anything about it, and we can't really knock them based on this wording. Suggest the following text instead:

Using the normal model you just developed, determine how likely the drug trial data will be to show the drug as "effective" according to the 85% standard. In other words, how often will our sample give us a result that is 85% or higher (even though secretly we know the true effectiveness is only 83%)? Report your answer in a contextually-meaningful full sentence using inline R code. (Hint: you'll need to use the pdist command.)

We might even create a separate part to ask them to draw the picture, something along the lines of the instructions in R module 10.

Module #1: Melanoma data table doesn't appear

Because I posted a "clean" version of the HTML document with no code run, the Melanoma data table does not appear as promised. I need to see if I can run just that one code chunk without showing the output from all the others.

Chapter 8 conclusion is missing the final instructions.

I need to add the final subsections to Chapter 8 about finalizing the assignment.

Module 9 -- "incorrect" p-value from 1-ulcer_prop

Okay, so here's a funny thing:

In R module 9, let's say that students code their ulcer variable "backwards":

ulcer <- factor(Melanoma$ulcer, levels = c(0, 1), labels = c("absent", "present"))

Well, then if they do ulcer_prop <- prop(ulcer, data = ulcer_df), then they get 0.561ish, and so they'll want to use 1-ulcer_prop for most of their calculations. However, this gives you the wrong two-sided p-value:

P2 <- 2 * prop(sims2$prop <= 1-ulcer_prop) results in 0.07.

The reason for this appears to be a rounding issue. Try doing sims2 %>% filter(prop <= 1-ulcer_prop) %>% arrange(prop) versus sims2 %>% filter(prop <= 90/205) %>% arrange(prop) and you'll see what I mean -- the prior results in 35 rows where the latter results in 50 rows, including some with a proportion equivalent to 90/205.

I think the easiest fix is to just mention 0.07 as another possibility for the two-sided p-value in the rubric, or else to very specifically instruct students to put the "success" criterion first.

[Suggestion] Rmd 17: change colors to be colorblind-friendly

In Rmd 17 around line 197, the normal distribution is plotted in red and the T distribution is plotted in green. Suggest using different colors to be friendlier to students who are red-green colorblind.

Line numbers don't correspond

I've noticed that the line numbers reported in error messages don't correspond very well to the line numbers given in the gutters of the .Rmd files. Usually they're low by, say, 10 or 20, which makes me think that it has to do with R not counting lines in the preamble.

R module #4 exercise 8 -- no previous example of labeling "fill"

In R module 4, Exercise 8 asks:
"Modify the following side-by-side bar chart by adding a title and labels for both the fill variable and the x-axis variable."
We've previously seen examples of adding labels for the x-axis and y-axis variables, but not the fill variable. It's reasonable enough for students to guess that the required command should indeed be
labs(..., fill="desired label")
but perhaps we want to give them a more explicit example, or else clue them in explicitly that "fill" is a thing you can tell labs about.

Color-blind palettes

Some graph are hard for me to see. (For example, the last graph in Chapter 7 including the "side" predictor.) How much overhead would there be to teach students about using colorblind-friendly palettes. It might be enough to set a theme by default in the header. Then again, maybe it's useful to teach students more explicitly how to do it and why it's important.

vectorposse / intro_stats Goto Github PK

intro_stats's Introduction

Introduction to Statistics: an integrated textbook and workbook using R

intro_stats's People

Contributors

Stargazers

Watchers

Forkers

intro_stats's Issues

Recommend Projects

Recommend Topics

Recommend Org