Giter Club home page Giter Club logo

ims-tutorials's Introduction

OpenIntro::Introduction to Modern Statistics Tutorials

Learnr tutorials supporting OpenIntro::Introduction to Modern Statistics. Each tutorial corresponds to one part of the book and is comprised of multiple lessons for a total of 35 lessons that cover the entire content of the book.

All tutorial content has been developed by Mine Çetinkaya-Rundel, Jo Hardin, Ben Baumer, and Andrew Bray and implemented in learnr with the help of Yanina Bellini Saibene, Florencia D’Andrea, Roxana Noelia Villafañe, and Allison Theobold.

ims-tutorials's People

Contributors

aaronrendahl avatar atheobold avatar hardin47 avatar mamcisaac avatar mine-cetinkaya-rundel avatar npaterno avatar petzi53 avatar russellrayner avatar staceyhancock avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ims-tutorials's Issues

Survey data without context in exercises

In the Statistical inference tutorials, the GSS data is used. A subsample of the data from NORC is used for exercises. It's not clear how that subsample is selected - not important for students. However, it is important to note that this data DOES come from a survey.

Suggestions to improve this include:

  • Ignore the multi year data, only introduce the 2016 data. The multi-year data is never used
  • Tie it back to the Sampling strategies and experimental design and discuss stratification/clustering that is used in the 2016 GSS and/or include the 2016 GSS as an example in that section.
  • Mention that this data comes from a complex survey design and appropriate inference would account for this design. The methods used in these tutorials are appropriate for simple random surveys or experimental data. SRS data is very rare in practice. More complex methods should be used for complex survey data. Not in scope for this course but the srvyr package would provide the easiest way to transition students in a more advanced course to do this.

02-02-lesson

The explanation of the cars dataset in lesson 2 of chapter 2 (at the beginning after the first subheader) refers to int and dbl variables. In the dataset itself, students see only dbl variables. Except for eng_size they are actually all integers but loaded in the setup chunk with read_csvas dbl.

A simple change in the text is perhaps not advisable because one of the aims of this part is to explain the difference between integers and floating points. Another solution would be to change the datatype in the setup chunk. Even if this is not very practicable as, during calculation, R computes with doubles anyway, I would suggest this kind of solution from an educational perspective.

If you agree, I could prepare a PR.

Tutorial 2 Lesson 4 - Case Study: Check-in 1

In Tutorial 2 Lesson 4 - Case Study: Check-in 1 Image and spam interpretation section, the question posed is:
Which of the following interpretations of the plot is valid?

Screen Shot 2021-12-19 at 4 03 15 PM

Does it seem the "correct" answer may be inaccurate? One of the two "correct" responses states, "Emails with images have a higher proportion that are spam than do emails without images." Upon reviewing the proportion graph above the question, this is NOT a valid interpretation.

Tutorial 4 - Lesson 2: Randomization test, Why 0.05?, Sample size in randomization distribution, variable name

In "Tutorial 4 - Lesson 2: Randomization test", in the subsection "Why 0.05?" under "Sample size in randomization distribution" we have to compute two contigency tables.

The exercise text says: "1.
Tabulate the small dataset, gender_discrimination_small. That is, call count(), passing the gender and promote columns, to get a contingency table."

The text suggests that "promote" is a variable in the dataset, while the variable actually is called "decision".

Tutorial 2 - Lesson 4 - Emails - Problem with dataset

At the start of this lesson, it is mentioned that the dataset comprises "3,921 emails". In the dataset element one can see that only "1,000" rows are listed. Could it be that the dataset is truncated to only have 1,000 observations?

In the following exercise, where you have to "group_by(spam)", there is only one case ("not spam"). Is it possible that there are no "spam" cases in the dataset?

I recognized the same issue in "Tutorial 2 - Lesson 1: Visualizing categorical data" where you initially speak of "over 23,000" cases, when in the end there are only "1,000 rows" listed in the dataset.

High traffic on ShinyApps

Many thanks for putting all this material together. I had assigned the tutorials for my class, but the IMS tutorials are seemingly too popular: the Shiny server window stays stuck at "Please wait" nearly indefinitively.

Outside of increasing capacity (which costs money), is there any alternative way to access the tutorials? e.g., I normally bundle my learnr files in a package that can be installed locally.

Revisit hints for Tutorial 8

Many of these are Markdown hints but they're in R Markdown chunks. They should be replaced with Markdown Hints as described here.

01-01-lesson: Several typos

I am going to suggest several PRs, mostly typos. As I am not very experienced, please tell me if I am doing it the right way. No problem for me if you don't want these kinds of PRs. Tell me, and I will stop.

Put all tutorials in a package

So that they can be easily launched from the RStudio IDE Tutorials pane.

If so, it might make sense to add the data needed only for these tutorials to that package, which touches on #4.

Hints in divs don't show solution

It looks like when hints are given in divs instead of an R chunk, one can't scroll to the solution. I changed how hints are coded in http://openintro.shinyapps.io/ims-05-infer-05/ to confirm that when hints are in chunks, you can scroll to the solution, but the downside is they're not formatted nicely.

I need to check to make sure this is true with the latest (dev) version of learnr, and if not, open a feature request. Leaving the issue here for reference.

cc @petzi53

Move data files into openintro package

The following might not be an exhaustive list.

Tutorial 3 - Lesson 5: Model fit, last "Your turn!" - two issues

In the last "Your turn" of "Tutorial 3 - Lesson 5: Model fit" there are two issues:

  1. typo: In the text "1. Use filter() to create a new dataset named nontrivial_players consisting of the players from mlbbat10 with at least 10 at-bats and and OBP of below 0.500." the word "and" is repeated twice.

  2. The following sentence "2. Create a visualization of the relationship between at bats and obp for both the new and old datasets" asks for a visualization of the relationship between "at bats" (at_bat) and "obp" (obp). But in the hints, you then use "slg" and "obp" instead of "at_bat" and "obp".

Tutorial 3 - Lesson 5: Model fit, Your turn - wrong instruction

In "Tutorial 3 - Lesson 5: Model fit" in the first "Your turn" subsection, there is the following instruction in the text:

"2. Use the bdims_augment_wgt_hgt dataframe to compute the R2 of wgt_hgt_mod manually using the formula above. Hint: wgt is the response variable in this linear regression!"

There is no "bdims_augment_wgt_hgt" dataframe in the working space. We rather have to use "wgt_hgt_augment".

Tutorial 3 - Lesson 5: Model fit, Measures of model fit - RMSE typo

In the "Measures of model fit" subsection of "Tutorial 3 - Lesson 5" under "RMSE" the text says:

"The RMSE also generalizes to any kind model for a single numerical response, so it is not specific to regression models."

Maybe it should be "any kind OF model" - just a suggestion.

Tutorial 4 - Lesson 2: Randomization test, What is a p-value?, Calculating the p-values - problem with code exercise

In "Tutorial 4 - Lesson 2: Randomization test", in the subsection "What is a p-value?", under "Calculating the p-values", there is a code exercise, where you have to use the "visualize()" and "get_p_value" functions.

Unfortunately, i am not able to solve this exercise, as the console always returns me "argument is of length zero".

I was not sure whether this problem is due to my inability to solve the exercise or some other issues. Could you help me clarify, whether my coding is wrong?

Visualize and calculate the p-value for the original dataset

gender_discrimination_perm %>%
visualize(obs_stat = diff_orig, direction = "greater")

gender_discrimination_perm %>%
get_p_value(obs_stat = diff_orig, direction = "greater")

Visualize and calculate the p-value for the small dataset

gender_discrimination_small_perm %>%
visualize(obs_stat = diff_orig_small, direction = "greater")

gender_discrimination_small_perm %>%
get_p_value(obs_stat = diff_orig_small, direction = "greater")

Visualize and calculate the p-value for the big dataset

gender_discrimination_big_perm %>%
visualize(obs_stat = diff_orig_big, direction = "greater")

gender_discrimination_big_perm %>%
get_p_value(obs_stat = diff_orig_big, direction = "greater")

Phrasing correction, 01-01-lesson

The same positive relationship between math and science scores is still apparent. But we can also see that students in academic programs, shown with green points, tend to score higher than those in vocational programs, in blue, and general programs, in red.

From Ezra Halleck:

"But we can also see that students in academic programs, shown with green points, tend to score higher than those in vocational programs, in blue, and general programs, in red."

It should instead say:

"But we can also see that students in academic programs, shown with green points, tend to have higher math scores relative to their science scores than those in vocational programs, in blue, and general programs, in red."

IMS-02-explore-03 gap2007 not found

From Bradley Lubich ([email protected])

R Tutorial. IMS-02-explore-03. Section "Your Turn!" (concluding measures of variability)

Even when copying to clipboard, this chunk of code

# Compute stats for lifeExp in Americas
gap2007 %>%
filter(continent == \"Americas\") %>%
summarize(mean(lifeExp),
sd(lifeExp))

returns error "object gap2007 not found" even though previous code chunks referring to object gap2007 were okay.

ims-02-explore-04/#section-check-in-1 - mc question correct answer wrong

From Bradley Lubich ([email protected])

R Tutorial. ims-02-explore-04/#section-check-in-1

The review question asks: Which of the following interpretations of the plot is valid?
1.) There are more emails with images than without images.
2.) Emails with images have a higher proportion that are spam than do emails without images.
3.) An email without an image is more likely to be not spam than spam.

The stated correct answer is (2 and 3), but number 2 is not correct as stated. Either number 2 should be edited so that it is correct or the correct answer should be (3 only).

Tutorial 2 lesson 4: number and spam interpretation

Could there be a problem with the interpretation here? The question asks for what's not valid. The third answer - Given that an email contains a big number, it is more likely to be not spam - seems valid, so should it be marked as correct=TRUE? I also don't follow the message for "incorrect=Given the kind of number an email contains, we can't say whether it'll be more or less likely to be spam". This seems like precisely what we can say from a segmented bar chart.

question("Which of the following interpretations of the plot **is not** valid?",
  answer("Given that an email contains a small number, it is more likely to be 
not spam"),
  answer("Given that an email contains no number, it is more likely to be spam.", correct = TRUE),
  answer("Given that an email contains a big number, it is more likely to be not spam.", correct = TRUE),
  answer("Within both spam and not spam, the most common number is a small one.", message = "Wrong!"),
  incorrect = "Given the kind of number an email contains, we can't say whether it'll be more or less likely to be spam.",
  allow_retry = TRUE
)

Tutorial 4 - Lesson 1: Sampling variability, Random permutation, Randomized data under null model of independence, "homes" not loaded in environment

In "Tutorial 4 - Lesson 1: Sampling variabilitiy" under the subsection "Random permutation" and the subsubsection "Randomized data under null model of independence", there are three "R Code" exercises, where we are introduced to the "specify", "hypothesize" and "generate" commands. In these three exercises we need the "homes" data.frame, but they are not present in the environment in the second and third exercise, therefore the code won't execute.

Tutorial 3, lesson 3

The indicator "west_coast" is being referred to as "coast" later in the document. Perhaps better to change it to "coast" from the beginning and refer to the levels as "west" and "not west" ?

`arrange_()` is deprecated warning

The offending lesson is this one.

Need to track down what's causing it.

Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
Please use `arrange()` instead.
See vignette('programming') for more help
This warning is displayed once every 8 hours.
Call `lifecycle::last_warnings()` to see where this warning was generated.

Tutorial 3 Lesson 3 Understanding Linear Models

For Tutorial 3 Lesson 3 Understanding Linear Models section, within the Your Turn! portion, should the answer comment read
"The E represents the noise and the �beta0 represents the y-intercept."


Screen Shot 2021-12-21 at 3 35 35 PM

Figures Missing in 01-02

The figures named design-observational-association and design-experiment-causation are missing from the images folder in episode 02 of lesson 01.

Tutorial 4 - Lesson 3: Errors in hypothesis testing, Example: opportunity cost, Opportunity cost conclusion typo?

In "Tutorial 4 - Lesson 3: Errors in hypothesis testing" in subsection "Example: opportunity cost" under "Opportunity cost conclusion" in the last code field, we get returned a p-value of "0.007", although we computed a value of "0.017". Then, in the last single choice question, the text says "Based on this result of 0.017, what can you conclude from the study about the effect of reminding students to save money?", so again, we have the value "0.017".

Could it be that the value of "0.007" somehow is mentioned wrongfully?

Tutorial 4 - Lesson 4: Parameters and confidence intervals, Boostrapping, Variability of p̂ from the sample (bootstrapping) difference in number mentioned

In "Tutorial 4 - Lesson 4: Parameters and confidence intervals", in subsection "Boostrapping" under "Standard error" the standard error is computed as "0.08683128". In the text later on, under "Variability of p^ from the sample (bootstrapping)" the standard error is said to be " 0.085". This value appears two times, while it probably should be "0.087" instead.

Create feedback form

It would be nice to have a feedback form for the tutorials. I'm not sure whether it's best to add it to the end of each lesson or just on the index page for all lessons in a tutorial.

We should keep it short. Google Forms seems easy but I know some have concerns around the data it collects. We could look into something a bit more lightweight as well.

Typo in Foundations of Inference: Lesson 2- Randomization Test

From Lori McCune:

I think there's a typo in the tutorial Foundations of Inference: Lesson 2- Randomization Test. In the example for gender discrimination, the variable diff_orig calculates the difference as female-male, but the randomization technique calculates the difference as male-female. This means that the one-sided p-values that are calculated using diff_orig are all close to 1 rather than zero.

Tutorial 3 - Lesson 2: Correlation, Problem with Exercise (The Anscombe dataset)

When trying to solve the "Compute properties of anscombe" exercise, I get the following error:

"Problem with summarise() column correlation_between_x_and_y. ℹ correlation_between_x_and_y = cor(x, y). ✖ 'x' must be numeric ℹ The error occurred in group 1: set = ""."

I get this error, even after using the code provided in the "hints".

Tutorial 4 - Lesson 4: Parameters and confidence intervals, Variability in p̂ , Bootstrap t-confidence interval, definition of the p_hat variable

In "Tutorial 4 - Lesson 4: Parameters and confidence intervals", in subsection "Variability in p̂ ", under "Bootstrap t-confidence interval", we have to compute the bootstrap t-confidence interval. I am not sure, whether you defined the "p_hat" variable, as intended by the exercise text.

From the text, it seems that for "lower" and "upper" you want "p_hat +/- 2 * sd(stat)", when in the coding exercise we actually have to write "p_hat$stat +/- 2 * sd(stat)".

So, I wanted to ask, whether you wanted to define "p_hat" as a numeric vector, instead of a data.frame?

01-03-lesson

The tutorial on sampling and experimental design (chapter 1 lesson 3) might have a problem.
image

Broken tutorial for Case Study, Section 1(01-data-04)

The Case Study of that Section 1 now returns an error page with the message

processing file: 01-04-lesson.Rmd
output file: 01-04-lesson.knit.md


Output created: 01-04-lesson.html
Error in value[[3L]](cond) : lexical error: invalid char in json text.
                                       NA
                     (right here) ------^
Calls: local ... tryCatch -> tryCatchList -> tryCatchOne -> <Anonymous>
Execution halted

It did work a couple of days ago...

Tutorial 4 - Lesson 2: Randomization test, What is a p-value?, Practice calculating p-values, wrong naming of datasets

In "Tutorial 4 - Lesson 2: Randomization test", in the subsection "What is a p-value?" under "Practice calculating p-values" there is the following sentence in the exercise text:

"The gender_discrimination_perm and gender_discrimination_perm datasets are also available in your workspace."

Could it be, that the sentence should refer to the following two datasets: "gender_discrimination_new_perm" and "diff_orig_new" that we need for following exercises?

Some thought on 'Perception of correlation'

It is a great idea to train the perception of correlation. But maybe we could improve its realization?

For an estimation, without knowing the correct value, students would have to delete the unfinished code part of the exercise chunk. Then run the code for the plot, make their guess, and then write or paste the code again for correlation calculation. This seems laborious.

A simple solution would be to separate these two actions, e.g., either to have two exercise chunks (one for the plot, the other for the calculation) or just to display the plot and only one exercise chunk where students are asked to do the calculation.

There is another possibility to develop an exercise with trial & error using the gradethis package. To see how this feels, I designed a demo. Probably a native speaker should write better feedback messages, and I am not sure if the random praise and encourage messages should be used. But I think that these kinds of exercises would be more challenging and especially more fun.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.