openintrostat / ims-tutorials Goto Github PK

Interactive tutorials developed with the learnr package supporting the textbook OpenIntro::Introduction to Modern Statistics.

Home Page: https://openintrostat.github.io/ims-tutorials/

License: Other

HTML 99.38% R 0.62% CSS 0.01%

rstats learnr data-science statistics educational open-educational-resources open-source

ims-tutorials's Introduction

OpenIntro::Introduction to Modern Statistics Tutorials

Learnr tutorials supporting OpenIntro::Introduction to Modern Statistics. Each tutorial corresponds to one part of the book and is comprised of multiple lessons for a total of 35 lessons that cover the entire content of the book.

All tutorial content has been developed by Mine Çetinkaya-Rundel, Jo Hardin, Ben Baumer, and Andrew Bray and implemented in learnr with the help of Yanina Bellini Saibene, Florencia D’Andrea, Roxana Noelia Villafañe, and Allison Theobold.

ims-tutorials's People

Contributors

Stargazers

Watchers

ims-tutorials's Issues

Tutorial 3 - Lesson 3: Simple linear regression, Understanding Linear Models typo

In the subpoint "Understanding Linear Models" under "Fitted values" the text says:

"Y is changed to ŷ"

Then it says: "The resulting model looks like: Ŷ =β̂ 0+β̂ 1⋅X"

The Y_hat should actually be a small y_hat.

Tutorial 3 - Lesson 5: Model fit, Unusual points - Leverage typo

In the formula for the leverage score, in the denominator (sum of squared deviations from the mean - SSx), the "2" is not marked as an exponent.

Survey data without context in exercises

In the Statistical inference tutorials, the GSS data is used. A subsample of the data from NORC is used for exercises. It's not clear how that subsample is selected - not important for students. However, it is important to note that this data DOES come from a survey.

Suggestions to improve this include:

Ignore the multi year data, only introduce the 2016 data. The multi-year data is never used
Tie it back to the Sampling strategies and experimental design and discuss stratification/clustering that is used in the 2016 GSS and/or include the 2016 GSS as an example in that section.
Mention that this data comes from a complex survey design and appropriate inference would account for this design. The methods used in these tutorials are appropriate for simple random surveys or experimental data. SRS data is very rare in practice. More complex methods should be used for complex survey data. Not in scope for this course but the srvyr package would provide the easiest way to transition students in a more advanced course to do this.

Tutorial 4 - Lesson 2: Randomization test, Why 0.05?, Personal level of significance, typo

There is a typo in the second lesson under "Why 0.05?", "Personal level of significance" in the following figure:

https://openintro.shinyapps.io/ims-04-foundations-02/_w_24bd3239/images/lesson2_img8.png

I assume the second line should read: "P(HH)" instead of "P(H)"

02-02-lesson

The explanation of the cars dataset in lesson 2 of chapter 2 (at the beginning after the first subheader) refers to int and dbl variables. In the dataset itself, students see only dbl variables. Except for eng_size they are actually all integers but loaded in the setup chunk with read_csvas dbl.

A simple change in the text is perhaps not advisable because one of the aims of this part is to explain the difference between integers and floating points. Another solution would be to change the datatype in the setup chunk. Even if this is not very practicable as, during calculation, R computes with doubles anyway, I would suggest this kind of solution from an educational perspective.

If you agree, I could prepare a PR.

Tutorial 1 - Lesson 2: Type of studies

In the first section of this Lesson 2 ("Observational studies and experiments") two figures did not load ("404 Not Found").

Tutorial 3 - Lesson 4: Interpreting regression models, last "Your turn" typo

In the last "Your turn" subsection, there is a misnamed variable.

The text says:

"The same linear model, mod, is defined in your workspace."

It should be "hgt_wgt_mod" instead of "mod", as "mod" is not defined in the workspace.

Tutorial 2 Lesson 4 - Case Study: Check-in 1

In Tutorial 2 Lesson 4 - Case Study: Check-in 1 Image and spam interpretation section, the question posed is:
Which of the following interpretations of the plot is valid?

Does it seem the "correct" answer may be inaccurate? One of the two "correct" responses states, "Emails with images have a higher proportion that are spam than do emails without images." Upon reviewing the proportion graph above the question, this is NOT a valid interpretation.

Tutorial 4 - Lesson 2: Randomization test, Why 0.05?, Sample size in randomization distribution, variable name

In "Tutorial 4 - Lesson 2: Randomization test", in the subsection "Why 0.05?" under "Sample size in randomization distribution" we have to compute two contigency tables.

The exercise text says: "1.
Tabulate the small dataset, gender_discrimination_small. That is, call count(), passing the gender and promote columns, to get a contingency table."

The text suggests that "promote" is a variable in the dataset, while the variable actually is called "decision".

Tutorial 2 - Lesson 4: Case study: Image and spam interpretation

Could it be that there is a problem with the "Image and spam interpretation" question: "Which of the following interpretations of the plot is valid?"

It seems like no answer is valid, but if I don't check any boxes, I cannot submit an answer.

Tutorial 2 - Lesson 4 - Emails - Problem with dataset

At the start of this lesson, it is mentioned that the dataset comprises "3,921 emails". In the dataset element one can see that only "1,000" rows are listed. Could it be that the dataset is truncated to only have 1,000 observations?

In the following exercise, where you have to "group_by(spam)", there is only one case ("not spam"). Is it possible that there are no "spam" cases in the dataset?

I recognized the same issue in "Tutorial 2 - Lesson 1: Visualizing categorical data" where you initially speak of "over 23,000" cases, when in the end there are only "1,000 rows" listed in the dataset.

High traffic on ShinyApps

Many thanks for putting all this material together. I had assigned the tutorials for my class, but the IMS tutorials are seemingly too popular: the Shiny server window stays stuck at "Please wait" nearly indefinitively.

Outside of increasing capacity (which costs money), is there any alternative way to access the tutorials? e.g., I normally bundle my learnr files in a package that can be installed locally.

Revisit hints for Tutorial 8

Many of these are Markdown hints but they're in R Markdown chunks. They should be replaced with Markdown Hints as described here.

01-01-lesson: Several typos

I am going to suggest several PRs, mostly typos. As I am not very experienced, please tell me if I am doing it the right way. No problem for me if you don't want these kinds of PRs. Tell me, and I will stop.

Put all tutorials in a package

So that they can be easily launched from the RStudio IDE Tutorials pane.

If so, it might make sense to add the data needed only for these tutorials to that package, which touches on #4.

Hints in divs don't show solution

It looks like when hints are given in divs instead of an R chunk, one can't scroll to the solution. I changed how hints are coded in http://openintro.shinyapps.io/ims-05-infer-05/ to confirm that when hints are in chunks, you can scroll to the solution, but the downside is they're not formatted nicely.

I need to check to make sure this is true with the latest (dev) version of learnr, and if not, open a feature request. Leaving the issue here for reference.

cc @petzi53

Move data files into openintro package

The following might not be an exhaustive list.

Tutorial 2: cars04.csv
Tutorial 2: life_exp.csv
Tutorial 2: comics.csv
Tutorial 4: nyc.csv on Zagat ratings, comes from https://gattonweb.uky.edu/sheather/book/docs/datasets/nyc.csv
Tutorial 7: gss_wordsum_class.csv
Tutorial 7: manhattan.csv
Tutorial 6: iran.csv
Tutorial 6: iowa.csv
Tutorial 8: twins.csv (comes from Andrew, I have no idea where he got it: http://andrewpbray.github.io/data/twins.csv)
Tutorial 8: LAhomes.csv (comes from Andrew, I think he might have scraped it?? http://andrewpbray.github.io/data/LA.csv)
Tutorial 8: restNYC.csv almost positive this is the exact same data as nyc.csv from tutorial 4
Tutorial 8: movies.tsv from http://www.rossmanchance.com/iscam2/data/movies03.txt

Tutorial 3 - Lesson 5: Model fit, last "Your turn!" - two issues

In the last "Your turn" of "Tutorial 3 - Lesson 5: Model fit" there are two issues:

typo: In the text "1. Use filter() to create a new dataset named nontrivial_players consisting of the players from mlbbat10 with at least 10 at-bats and and OBP of below 0.500." the word "and" is repeated twice.
The following sentence "2. Create a visualization of the relationship between at bats and obp for both the new and old datasets" asks for a visualization of the relationship between "at bats" (at_bat) and "obp" (obp). But in the hints, you then use "slg" and "obp" instead of "at_bat" and "obp".

Tutorial 4 - Lesson 1: Sampling variability, Randomized distributions, "third shuffle"

I assume, there is a wrong number in the table of the "third shuffle" as the population in the "East" is now "36" total, instead of the previously "34".

Tutorial 3 - Lesson 5: Model fit, Your turn - wrong instruction

In "Tutorial 3 - Lesson 5: Model fit" in the first "Your turn" subsection, there is the following instruction in the text:

"2. Use the bdims_augment_wgt_hgt dataframe to compute the R2 of wgt_hgt_mod manually using the formula above. Hint: wgt is the response variable in this linear regression!"

There is no "bdims_augment_wgt_hgt" dataframe in the working space. We rather have to use "wgt_hgt_augment".

Tutorial 3 - Lesson 5: Model fit, Measures of model fit - RMSE typo

In the "Measures of model fit" subsection of "Tutorial 3 - Lesson 5" under "RMSE" the text says:

"The RMSE also generalizes to any kind model for a single numerical response, so it is not specific to regression models."

Maybe it should be "any kind OF model" - just a suggestion.

Tutorial 2 - Lesson 1 Exploratory data analysis: 1 - Exploring categorical data

Unfortunately, in the interactive R tutorial "https://openintro.shinyapps.io/ims-02-explore-01/#section-distribution-of-one-variable" the "pies" dataset is not found.

Tutorial 4 - Lesson 2: Randomization test, What is a p-value?, Calculating the p-values - problem with code exercise

In "Tutorial 4 - Lesson 2: Randomization test", in the subsection "What is a p-value?", under "Calculating the p-values", there is a code exercise, where you have to use the "visualize()" and "get_p_value" functions.

Unfortunately, i am not able to solve this exercise, as the console always returns me "argument is of length zero".

I was not sure whether this problem is due to my inability to solve the exercise or some other issues. Could you help me clarify, whether my coding is wrong?

Visualize and calculate the p-value for the original dataset

gender_discrimination_perm %>%
visualize(obs_stat = diff_orig, direction = "greater")

gender_discrimination_perm %>%
get_p_value(obs_stat = diff_orig, direction = "greater")

Visualize and calculate the p-value for the small dataset

gender_discrimination_small_perm %>%
visualize(obs_stat = diff_orig_small, direction = "greater")

gender_discrimination_small_perm %>%
get_p_value(obs_stat = diff_orig_small, direction = "greater")

Visualize and calculate the p-value for the big dataset

gender_discrimination_big_perm %>%
visualize(obs_stat = diff_orig_big, direction = "greater")

gender_discrimination_big_perm %>%
get_p_value(obs_stat = diff_orig_big, direction = "greater")

Add definitions using glosario?

See https://github.com/carpentries/glosario-r/. We have feedback that links to Wikipedia articles are distracting, but this might be more focused?

ims-02-explore-01/#section-distribution-of-one-variable - pie chart

From Bradley Lubich ([email protected])

R Tutorial. ims-02-explore-01/#section-distribution-of-one-variable

In the section where the argument is made to do away with pie charts in place of bar plots, the pie chart and bar plot do not display the same data (most obviously in the secret category).

Phrasing correction, 01-01-lesson

ims-tutorials/01-data/01-lesson/01-01-lesson.Rmd

Line 658 in 443082a

 The same positive relationship between math and science scores is still apparent. But we can also see that students in academic programs, shown with green points, tend to score higher than those in vocational programs, in blue, and general programs, in red. 

From Ezra Halleck:

"But we can also see that students in academic programs, shown with green points, tend to score higher than those in vocational programs, in blue, and general programs, in red."

It should instead say:

"But we can also see that students in academic programs, shown with green points, tend to have higher math scores relative to their science scores than those in vocational programs, in blue, and general programs, in red."

IMS-02-explore-03 gap2007 not found

From Bradley Lubich ([email protected])

R Tutorial. IMS-02-explore-03. Section "Your Turn!" (concluding measures of variability)

Even when copying to clipboard, this chunk of code

# Compute stats for lifeExp in Americas
gap2007 %>%
filter(continent == \"Americas\") %>%
summarize(mean(lifeExp),
sd(lifeExp))

returns error "object gap2007 not found" even though previous code chunks referring to object gap2007 were okay.

ims-02-explore-04/#section-check-in-1 - mc question correct answer wrong

From Bradley Lubich ([email protected])

R Tutorial. ims-02-explore-04/#section-check-in-1

The review question asks: Which of the following interpretations of the plot is valid?
1.) There are more emails with images than without images.
2.) Emails with images have a higher proportion that are spam than do emails without images.
3.) An email without an image is more likely to be not spam than spam.

The stated correct answer is (2 and 3), but number 2 is not correct as stated. Either number 2 should be edited so that it is correct or the correct answer should be (3 only).

Tutorial 2 lesson 4: number and spam interpretation

Could there be a problem with the interpretation here? The question asks for what's not valid. The third answer - Given that an email contains a big number, it is more likely to be not spam - seems valid, so should it be marked as correct=TRUE? I also don't follow the message for "incorrect=Given the kind of number an email contains, we can't say whether it'll be more or less likely to be spam". This seems like precisely what we can say from a segmented bar chart.

question("Which of the following interpretations of the plot **is not** valid?",
  answer("Given that an email contains a small number, it is more likely to be 
not spam"),
  answer("Given that an email contains no number, it is more likely to be spam.", correct = TRUE),
  answer("Given that an email contains a big number, it is more likely to be not spam.", correct = TRUE),
  answer("Within both spam and not spam, the most common number is a small one.", message = "Wrong!"),
  incorrect = "Given the kind of number an email contains, we can't say whether it'll be more or less likely to be spam.",
  allow_retry = TRUE
)

Rewrite regression to the mean section with another example

This is currently commented out.

Some alternatives, including test scores (Midterm vs final) and ski jump: https://fs.blog/2015/07/regression-to-the-mean/.

Tutorial 4 - Lesson 1: Sampling variability, Random permutation, Randomized data under null model of independence, "homes" not loaded in environment

In "Tutorial 4 - Lesson 1: Sampling variabilitiy" under the subsection "Random permutation" and the subsubsection "Randomized data under null model of independence", there are three "R Code" exercises, where we are introduced to the "specify", "hypothesize" and "generate" commands. In these three exercises we need the "homes" data.frame, but they are not present in the environment in the second and third exercise, therefore the code won't execute.

Tutorial 3, lesson 3

The indicator "west_coast" is being referred to as "coast" later in the document. Perhaps better to change it to "coast" from the beginning and refer to the levels as "west" and "not west" ?

Exploratory data analysis: 1 - Exploring categorical data

In "Exploring categorical data", where it is stated that "The contingency table from the last exercise is available in your workspace as tab", I don't seem to find tab in the workspace.

`arrange_()` is deprecated warning

The offending lesson is this one.

Need to track down what's causing it.

Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
Please use `arrange()` instead.
See vignette('programming') for more help
This warning is displayed once every 8 hours.
Call `lifecycle::last_warnings()` to see where this warning was generated.

Tutorial 3 - Lesson 4: Interpreting regression models, Fitting simple linear models typo

In the subsection "Fitting simple linear models" of "Tutorial 3 - Lesson 4: Interpreting regression models" the exercise text says:

"Using the mlbBat10 dataset, create a linear model for slg (y) as a function of obp (x)."

The dataset is "mlbbat10" instead of "mlbBat10".

Tutorial 3 - Lesson 4: Interpreting regression models, Using your linear model typo

There is a typo in the subsection "Using your linear model" in "Tutorial 3 - Lesson 4: Interpreting regression models".

The text says:
"Does this qualifies as irony?" --> "qualify" instead of "qualifies"

Tutorial 3 Lesson 3 Understanding Linear Models

For Tutorial 3 Lesson 3 Understanding Linear Models section, within the Your Turn! portion, should the answer comment read
"The E represents the noise and the �beta0 represents the y-intercept."

Figures Missing in 01-02

The figures named design-observational-association and design-experiment-causation are missing from the images folder in episode 02 of lesson 01.

Tutorial 4 - Lesson 3: Errors in hypothesis testing, Example: opportunity cost, Opportunity cost conclusion typo?

In "Tutorial 4 - Lesson 3: Errors in hypothesis testing" in subsection "Example: opportunity cost" under "Opportunity cost conclusion" in the last code field, we get returned a p-value of "0.007", although we computed a value of "0.017". Then, in the last single choice question, the text says "Based on this result of 0.017, what can you conclude from the study about the effect of reminding students to save money?", so again, we have the value "0.017".

Could it be that the value of "0.007" somehow is mentioned wrongfully?

Tutorial 4 - Lesson 4: Parameters and confidence intervals, Boostrapping, Variability of p̂ from the sample (bootstrapping) difference in number mentioned

In "Tutorial 4 - Lesson 4: Parameters and confidence intervals", in subsection "Boostrapping" under "Standard error" the standard error is computed as "0.08683128". In the text later on, under "Variability of p^ from the sample (bootstrapping)" the standard error is said to be " 0.085". This value appears two times, while it probably should be "0.087" instead.

Create feedback form

It would be nice to have a feedback form for the tutorials. I'm not sure whether it's best to add it to the end of each lesson or just on the index page for all lessons in a tutorial.

We should keep it short. Google Forms seems easy but I know some have concerns around the data it collects. We could look into something a bit more lightweight as well.

Typo in Foundations of Inference: Lesson 2- Randomization Test

From Lori McCune:

I think there's a typo in the tutorial Foundations of Inference: Lesson 2- Randomization Test. In the example for gender discrimination, the variable diff_orig calculates the difference as female-male, but the randomization technique calculates the difference as male-female. This means that the one-sided p-values that are calculated using diff_orig are all close to 1 rather than zero.

Tutorial 3 - Lesson 3: Simple linear regression - wrong variable name ("wgt" instead of "bwt")

In the exercise text to the first "Your turn" exercise the text says:

"Using the bdims dataset, create a scatterplot of body weight (bwt) as a function of height (hgt) for all individuals in the bdims dataset."

The body weight variable is falsely named "bwt" in the text, while it should be "wgt" (as proposed in one of the "hints").

Tutorial 3 - Lesson 2: Correlation, Problem with Exercise (The Anscombe dataset)

When trying to solve the "Compute properties of anscombe" exercise, I get the following error:

"Problem with summarise() column correlation_between_x_and_y. ℹ correlation_between_x_and_y = cor(x, y). ✖ 'x' must be numeric ℹ The error occurred in group 1: set = ""."

I get this error, even after using the code provided in the "hints".

Tutorial 4 - Lesson 4: Parameters and confidence intervals, Variability in p̂ , Bootstrap t-confidence interval, definition of the p_hat variable

In "Tutorial 4 - Lesson 4: Parameters and confidence intervals", in subsection "Variability in p̂ ", under "Bootstrap t-confidence interval", we have to compute the bootstrap t-confidence interval. I am not sure, whether you defined the "p_hat" variable, as intended by the exercise text.

From the text, it seems that for "lower" and "upper" you want "p_hat +/- 2 * sd(stat)", when in the coding exercise we actually have to write "p_hat$stat +/- 2 * sd(stat)".

So, I wanted to ask, whether you wanted to define "p_hat" as a numeric vector, instead of a data.frame?

01-03-lesson

The tutorial on sampling and experimental design (chapter 1 lesson 3) might have a problem.

Broken tutorial for Case Study, Section 1(01-data-04)

The Case Study of that Section 1 now returns an error page with the message

processing file: 01-04-lesson.Rmd
output file: 01-04-lesson.knit.md


Output created: 01-04-lesson.html
Error in value[[3L]](cond) : lexical error: invalid char in json text.
                                       NA
                     (right here) ------^
Calls: local ... tryCatch -> tryCatchList -> tryCatchOne -> <Anonymous>
Execution halted

It did work a couple of days ago...

Tutorial 4 - Lesson 2: Randomization test, What is a p-value?, Practice calculating p-values, wrong naming of datasets

In "Tutorial 4 - Lesson 2: Randomization test", in the subsection "What is a p-value?" under "Practice calculating p-values" there is the following sentence in the exercise text:

"The gender_discrimination_perm and gender_discrimination_perm datasets are also available in your workspace."

Could it be, that the sentence should refer to the following two datasets: "gender_discrimination_new_perm" and "diff_orig_new" that we need for following exercises?

Exploratory data analysis: 1 - Exploring categorical data; pie chart not showing up

In Exploratory data analysis: 1 - Exploring categorical data, Distribution of one variable, the figure for the section titled "Pie chart vs. bar chart" does not show up.

Some thought on 'Perception of correlation'

It is a great idea to train the perception of correlation. But maybe we could improve its realization?

For an estimation, without knowing the correct value, students would have to delete the unfinished code part of the exercise chunk. Then run the code for the plot, make their guess, and then write or paste the code again for correlation calculation. This seems laborious.

A simple solution would be to separate these two actions, e.g., either to have two exercise chunks (one for the plot, the other for the calculation) or just to display the plot and only one exercise chunk where students are asked to do the calculation.

There is another possibility to develop an exercise with trial & error using the gradethis package. To see how this feels, I designed a demo. Probably a native speaker should write better feedback messages, and I am not sure if the random praise and encourage messages should be used. But I think that these kinds of exercises would be more challenging and especially more fun.

openintrostat / ims-tutorials Goto Github PK

ims-tutorials's Introduction

OpenIntro::Introduction to Modern Statistics Tutorials

ims-tutorials's People

Contributors

Stargazers

Watchers

Forkers

ims-tutorials's Issues

Visualize and calculate the p-value for the original dataset

Visualize and calculate the p-value for the small dataset

Visualize and calculate the p-value for the big dataset

Recommend Projects

Recommend Topics

Recommend Org