openintrostat / openintro-statistics Goto Github PK

📚 An open-source textbook written at the college level. OpenIntro also offers a second college-level intro stat textbook and also a high school variant.

Home Page: https://www.openintro.org/book/os

License: Other

TeX 79.74% R 20.26%

statistics data-science textbook latex openintro

openintro-statistics's Introduction

Project Organization

Each chapter's content is in one of the eight chapter folders that start with "ch_". Within each folder, there is a "figures" folder and a "TeX" folder. The TeX folder contains the text files that are used to typeset the chapters in the textbook.
In many cases, R code is supplied with figures to regenerate the figure. It will often be necessary to install the "openintro" R package that is available from GitHub (https://github.com/OpenIntroOrg) if you would like to regenerate a figure. Other packages may also occasionally be required.
Exercise figures may be found in [chapter folder] > figures > eoce > [exercise figure folders]. "EOCE" means end-of-chapter exercises.
The extraTeX folder contains files for the front and back matter of the textbook and also the style files. Note that use of any style files, like all other files here, is under the Creative Commons license cited in the LICENSE file.

Typesetting the Textbook

The textbook may be typeset using the main.tex file. This file pulls in all of the necessary TeX files and figures. For a final typesetting event, typeset in the following order

LaTeX 3 times.
MakeIndex once.
BibTeX once.
LaTeX once.
MakeIndex once.
LaTeX once.

This isn't important for casual browsing, but it is important for a "final" version. The repetitive typesetting is to account for when typesetting changes references slightly, since typesetting the first few times can move content from one page to the next, e.g. as a \ref{...} gets filled in.

Learning LaTeX

If you are not familiar with LaTeX but would like to learn how to use it, check out the slides from two LaTeX mini-courses at

https://github.com/OpenIntroOrg/mini-course-materials

PDFs:

Basics of LaTeX

Math and BibTeX

For a more authoritative review, the book "Guide to LaTeX" is an excellent resource.

Also, see the branches of this repo by Kwangchun Lee for Korean translations of these mini-course materials.

openintro-statistics's People

Contributors

Stargazers

Watchers

Forkers

n31k praveenv4k cwru-sdle skkamaruddin hfgasparoto alandbravo calchad kyquangnew mgenty dataedlinks feyoung oi-biostat kotalikg helske inakibil brownrv mireckip statkclee ugolancia cybernetics erhanblc alcideschaux redreirei wangshuo1617 boazhillebrand olaobaju ldlpdx aranandhu xatoposx tmzapsr giacombi fongoses mshayeb imanojkumar afey escott8908 neurosapiens dchapman1 fdzul bowshock luis-llena michael6010 ehazrati imran1570 buesa82 julexxx iac7 kumanoit garthtarr vugsus webemma amponsem edukechase ntaback xiang-tischhauser ariel77 mbh038 marata459 kwokmun netgary0430 rebpax christophehunt tammachari anenciu mbsbuyer rflsierra balajivinodap databel brophyj reyemarr lukasstammler omdgit msheker roualdes ivan-barrera rogersguo quantcompany alipalooza bekterra medellinfein gajary fooway upforsnowboarding yassod ketaki712 hrbrmstr sebastiansauer sandboxorg mascaaj joseamunozmata sahooamarjeet dvorakt jqly gdsttian epsimatic88 latuji infimath gth158a davidaarmstrong lzcheng

openintro-statistics's Issues

incorrect explanation of variance increase for exercise 2.27

The exercise here: https://github.com/OpenIntroStat/openintro-statistics/blob/master/ch_summarizing_data/TeX/review_exercises.tex#L5

has a solution here:

openintro-statistics/extraTeX/eoceSolutions/eoceSolutions.tex

Line 381 in 70b58fc

 (c)~The new score is more than 1 standard deviation away from the previous mean, so 

But the justification is incorrect.

The update equation for the unbiased sample variance, supposing n-1 old measurements, and given the mean and sample variance on n-1 measurements, and a new nth measurement is:

$sampvar_{new}&=\frac{n-2}{n-1}sampvar_{old}+\frac{1}{n}\left(x_{n}-\bar{x}_{old}\right)^{2}$

(you can see this is identical to the second equation at https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Welford's_online_algorithm)

So the critical value of x_n so that the sample variance stays the same is

$\frac{n}{n-1}sampvar&=\left(x_{n}-\bar{x}_{old}\right)^{2}$

therefore, the there is a factor of 25/24 missing in the justification. I would submit a pull request, but I am not sure if the textbook covers these equations.

Pedagogy discussion: alternative sequencing of statistical inference

Hi all,

I'm a college instructor and teach courses in statistics for social science. I have a specific question about pedagogy and ordering of topics in an intro statistics textbook and course. I'm posting this here for two reasons: (1) I believe I'm likely to receive good feedback from the experts working on OpenIntro Statistics (OIS), and (2) I have been hoping to switch my textbook for one of my courses to OIS, and will make a fork of OIS and make the following changes if I'm still convinced they'll improve pedagogy.

For background, the strategy to explain inference in OIS appears to be:

Present repeat-sample inference for proportions (in chapter 5), central limit theorem, confidence intervals and hypothesis tests for proportions using the normal distribution.
Then, in chapter 7, introduce that the sample mean is approximately normal in large samples if sigma is known. Then in 7.1.3 talk about when sigma is not known, the t-distribution is used instead. Then turn to explaining confidence intervals & etc with t-distributions.

This is a very common approach to statistics instruction. I was curious what you thought about an alternative approach:

Present repeat-sample inference for means of numeric variables first, and introduce the central limit theorem, confidence intervals and hypothesis tests for these. (Start with means rather than proportions.)
Explain that if the sample size is large enough, the normal distribution approximation still works well even if the sample standard deviation is plugged in for sigma. Thus avoid introducing the t-distribution at this point, and keep emphasizing the normal distribution approximation works for large samples. Have exercises where students use the normal distribution but plug in s for sigma, rather than using the t-distribution. Call these "large sample confidence intervals and hypothesis tests."
Then point out that proportions are means (the mean of an indicator variable is a proportion), so that all the methods introduced for means can be used for proportions. This basically skips all the proportions analysis in a traditional stats course (binomial distribution etc.), and I think communicates the essence of what is done in practice for proportions. (The advantage of using the explicit formula for the variance for proportions I think is a smaller standard error, but this advantage disappears when n -> infinity. There are of course problems with CLT when either np or n(1-p) are small, but that happens when the sample size is small and the variable is very skewed (an indicator variable skewed when it is very often 0 or 1). Small sample size and/or very skewed variables also lead to CLT problems with numeric variables, so this also can be seen as a special case of what happens in the case of means.)
Then in an "In Practice" chapter, talk about (a) the t-distribution improvement (which is dropped in for the normal distribution), (b) the specific proportion standard deviation formula (and improvements that result from using it rather than the usual formula for s), and (c) anything else skipped because of the restructuring.

A related thought I've had is to only talk about two-tailed hypothesis tests during points (1) through (3) above, then talk about one-tailed hypothesis tests only in point (4). (In social science folks practically always use two-tailed tests anyway, I think one-tailed tests are kind of a distraction during the period where the "meat of statistical inference" is being introduced.)

In general I think intro stats courses might introduce too much too quickly, so I have been thinking about this approach which makes means and the normal distribution a common thread as intuition about statistics is being built, until the end where some practical adjustments are provided. I would be grateful for any thoughts on this approach! (I'm especially interested in what could go wrong.)

Ch 5: Link to Factfulness doesn't work

https://github.com/OpenIntroOrg/openintro-statistics/blob/ecad440aa2922a5dd8a4fd7eaa9cf577e26fd119/ch_foundations_for_inf/TeX/ch_foundations_for_inf.tex#L1113

Exercise 4.14

book → os
feedback_type → textbook_typo_exercise
textbook_page → 148
feedback_name → Neely Atkinson
feedback_email → [email protected]

feedback
Exercise 4.14 c). The example is a geometric distribution with p=0.02. The question asks how many failures are expected before the first success. The Answer Key gives 1/p = 50. I believe this is the expected number of the trial on which the first success occurs. The expected number of failures is therefore 49. This may be related to the fact that the text and R define the geometric distribution (and the negative distribution as well) differently.) My apologies if this is not correct.

Ch 5: Indentation at the beginning of GP 5.22 and 5.23

See image below:

Not sure why the indentation is happening.

This is in reference to

https://github.com/OpenIntroOrg/openintro-statistics/blob/ecad440aa2922a5dd8a4fd7eaa9cf577e26fd119/ch_foundations_for_inf/TeX/ch_foundations_for_inf.tex#L1456

and

https://github.com/OpenIntroOrg/openintro-statistics/blob/ecad440aa2922a5dd8a4fd7eaa9cf577e26fd119/ch_foundations_for_inf/TeX/ch_foundations_for_inf.tex#L1503

Ch 5: a "tool" called p-value

I'm not a fan of the word "tool" here

https://github.com/OpenIntroOrg/openintro-statistics/blob/ecad440aa2922a5dd8a4fd7eaa9cf577e26fd119/ch_foundations_for_inf/TeX/ch_foundations_for_inf.tex#L1701

though I can't think of a term that is necessarily better either: measure, statistic, probability?

Ch 5: Ex 5.28 Wording and symbol mismatch in alternative hypothesis

Wording below says majority but the alternative is defined as two-sided.

https://github.com/OpenIntroOrg/openintro-statistics/blob/b7e661929a25ca0f7fb530139e1c5aa015f1f459/ch_foundations_for_inf/TeX/ch_foundations_for_inf.tex#L1737-L1747

I know we want to avoid one sided tests, how about changing the wording for the question?

Change title of pdf to lowercase in eoceSolutions.tex

On line 558 of the eoceSolutions.tex, GRE_intro.pdf is required. However, the file is actually called gre_intro.pdf (lowercase) as is the corresponding R file.

This breaks compilation of main.tex in TeXLive 2015. The error is

! Package pdftex.def Error: File 'ch_distributions/figures/eoce/GRE_intro/GRE_intro.pdf' not found.

Ch 5: R code for simulation

https://github.com/OpenIntroOrg/openintro-statistics/blob/b7e661929a25ca0f7fb530139e1c5aa015f1f459/ch_foundations_for_inf/TeX/ch_foundations_for_inf.tex#L171-L188

I suggest explicitly writing out 250000000 instead of 250e06 since some students won't be familiar with the latter notation.
Also, to be explicit I'd define possible_entries as

pop_size <- 250000000
possible entries <- c(rep("support", pop_size * 0.887), rep("not", pop_size * 0.113))

We're looking for sampled_entries == "support" not "justified", right?

Streamline model diagnostics plots

In IMS we're using residuals vs. predicted. In multiple regression chapters of this book we're also using residuals vs. predicted. We should use those (as opposed to residuals vs. x) in the simple linear chapter of this book too.

OpenIntroStat/ims#61 mentions it would be nice to keep this consistent across books.

Ch 5: Section 5.3.2 Testing hypotheses using confidence intervals

This section tests HTs on p using CIs, but the standard deviation/error used are different when doing a CI vs. HT.

Unless I missed something in my review, this is not mentioned explicitly.

Ch 5: Is Rosling sample size really 50?

This is in reference to

https://github.com/OpenIntroOrg/openintro-statistics/blob/ecad440aa2922a5dd8a4fd7eaa9cf577e26fd119/ch_foundations_for_inf/TeX/ch_foundations_for_inf.tex#L1357-L1358

and specifically \roslingAsize{}.

I thought the sample size was larger, is this a subsample?

Guided practice answer hyperlinks

I really like the hyperlinks from the guided practice prompts to the answers. Also it makes sense that the hyperlinks at the actual answers point to web resources. However, when I download the pdf these answer hyperlinks (to external resources) do not launch my browser. Instead, they open a text file like this:

<html>
<head>
<title>Redirecting...</title>
<meta http-equiv="refresh" content="0; url=http://www.fda.gov/food/foodborneillnesscontaminants/metals/ucm115644.htm">


<script type="text/javascript">
  window.location.href = "http://www.fda.gov/food/foodborneillnesscontaminants/metals/ucm115644.htm"
</script></head>
<body>
<p>
  If you are not redirected automatically, use the following link:<br>
  <a href="http://www.fda.gov/food/foodborneillnesscontaminants/metals/ucm115644.htm">http://www.fda.gov/food/foodborneillnesscontaminants/metals/ucm115644.htm</a>
</p>
</body>
</html>

I am using the Okular pdf viewer.

Ch 5: First mention of 1.96

How about adding a bit more info here. Instead of just

https://github.com/OpenIntroOrg/openintro-statistics/blob/b7e661929a25ca0f7fb530139e1c5aa015f1f459/ch_foundations_for_inf/TeX/ch_foundations_for_inf.tex#L732-L736

something like

While about 95\% of the data is within 2 standard deviations 
in a normal distribution, it would be more precise to use 
a value of 1.96 standard deviations as the area under the normal 
curve within 1.96 standard deviations is closer to 0.95. This more 
precise value is typically what is used to construct a 95\% confidence 
interval.

Compiling give errors

Compilation gives a bunch of errors about "file not found". I think that you changed some of the folder's names, but not corrected it in the main document. Maybe I could help to correct it.

Issues compiling the print color book

Hi what are the changes that need to be made to main.tex to print the color print book, since the default is the tablet version of the book? I tried reverse some of the comments but could not get the document to compile.

Epub format

Not sure how hard this might be, but if it is not too much work, epub is pretty portable format. Main advantages include responsive sizing (nice when one wants to zoom in for bigger text on tablets), and flexible coloring, which makes it nice for dark mode. PDF is subtle with these things.

Thank you for the amazing efforts :)

unclear wording in Exercise 4.14 (c) and (d)

It is unclear whether the question is asking expected value of all transistors produced or expected value of all non-defective transistors (before the first defective one).

Navigation Issues

The final PDF document is difficult to navigating on common document readers and tablet software. The table of content only includes chapters and not the corresponding subsections. I've tried Amazon's Kindle reader (iOS, Windows, OSX) and OSX's preview app.

It improve navigation and rendering, a native eBook format should be used (e.g. epub or mobi). Has anyone attempted to convert the latex to a native ebook format? One possible option might be tex4ebook

Problem 4.37 University Admissions wrong solution

Version: 4

I think the answer is wrong for this question (5.59% with a 0.5 correction), could you please check.

I calculated a probability of 6.06% with a 0.5 correction.

I confirmed this with this calculator

Ch 5: Figure 5.5 - Sampling dists for several p and n

Can we add a y-axis label to the first plot in each row to indicate the value of p so a reader who doesn't read the caption (✋) can quickly understand what the differences between the rows are. I'm indifferent to additionally including the text in the caption, no objections to leaving it there.
Is the blue color not used here, or not visible due to bins being very narrow? If former, might be nice to add for consistency.

Ch 5: Avoid xx.0

Can we show this as 37% as opposed to 37.0% throughout, and similarly 0.37 instead of 0.370?

https://github.com/OpenIntroOrg/openintro-statistics/blob/ecad440aa2922a5dd8a4fd7eaa9cf577e26fd119/ch_foundations_for_inf/TeX/ch_foundations_for_inf.tex#L1757

Ch 5: Use of "often" in probability calculation

For some reason the word "often" bugged me here, how about instead of

https://github.com/OpenIntroOrg/openintro-statistics/blob/ecad440aa2922a5dd8a4fd7eaa9cf577e26fd119/ch_foundations_for_inf/TeX/ch_foundations_for_inf.tex#L803-L804

we say

If X is a normally distributed random variable, what is the probability of the value of X to be within 2.58 standard deviations of the mean?

OS 4: Is data plural or singular?

Currently we use it both as singular and plural, and I'm actually fine with this. But wanted to confirm this was intentional.

e.g.

"data suggests the actual support is close to"
"data provide strong evidence"

ANOVA has three groups not 4

openintro-statistics/ch_inference_for_means/TeX/ch_inference_for_means.tex

Line 3151 in 627e385

four groups but not identical.

This should say three groups

Misprint on page 236

The footnote 18 on page 236 mentions "John Oliver on This Week Tonight;" however, the name of his show is Last Week Tonight.

Section 5.2 title

Section 5.2 is titled CONFIDENCE INTERVALS FOR A SAMPLE PROPORTION, but I think it should be CONFIDENCE INTERVALS FOR A PROPORTION since the CI is for the population parameter. This would also match the title of the next section on hypothesis testing.

Ch 5: CIs only try to capture population parameters

I really like the essence of the following summary, but the image of CIs actively capturing something is a bit odd to me.

https://github.com/OpenIntroOrg/openintro-statistics/blob/ecad440aa2922a5dd8a4fd7eaa9cf577e26fd119/ch_foundations_for_inf/TeX/ch_foundations_for_inf.tex#L1088-L1093

I propose the following alternative wording:

 Another important consideration of confidence intervals is that they 
 \emph{are only about the population parameter}. A confidence 
 interval says nothing about individual observations, a proportion of the 
 observations, or point estimates. Confidence intervals only provide a 
 plausible range for population parameters, at a given confidence 
 level.
 population parameters.

Ch 5: "accuracy" of parameter estimates

I'm not sure if accuracy is the right word here, as opposed to precision, or both.

In discussing confidence intervals we talk about accuracy with regards to confidence levels, and precision with regards to magnitude of the margin of error. I think we mean both here so how about instead of

https://github.com/OpenIntroOrg/openintro-statistics/blob/b7e661929a25ca0f7fb530139e1c5aa015f1f459/ch_foundations_for_inf/TeX/ch_inference_for_props.tex#L6

we say

accuracy and precision of parameter estimates?

Ch 5: Too many consecutive GPs

I suggest turning the following to text

https://github.com/OpenIntroOrg/openintro-statistics/blob/ecad440aa2922a5dd8a4fd7eaa9cf577e26fd119/ch_foundations_for_inf/TeX/ch_foundations_for_inf.tex#L1629-L1642

and then keeping the next one as a GP.

This way it's not too many consecutive GPs on a page (I find that a bit distracting) and also gives some intuition about errors before asking a question.

Ch 5: How to verify sample observations are independent box

This tip box states

https://github.com/OpenIntroOrg/openintro-statistics/blob/b7e661929a25ca0f7fb530139e1c5aa015f1f459/ch_foundations_for_inf/TeX/ch_foundations_for_inf.tex#L339-L340

Do we want to give a justification for this? It seems like a stretch to expect students can justify why this is the case. However I'm not sure how feasible it is to do that succinctly.

Ch 5: Understanding the variability of a point estimate and unbiasedness

In this section we simulate samples of size n = 1000 from a population of N = 250million with the assumed proportion parameter of 0.887. And then we're say that the generated sampling distribution is centered at 0.887, and hence the estimate of p-hat = 0.887 is unbiased.

But the sampling distribution was generated assuming the true p is 0.887, so we shouldn't use the resulting sampling distribution to justify unbiasedness.

Am I missing something?

Ch 5: Definition of standard error

https://github.com/OpenIntroOrg/openintro-statistics/blob/b7e661929a25ca0f7fb530139e1c5aa015f1f459/ch_foundations_for_inf/TeX/ch_foundations_for_inf.tex#L212-L218

A more accurate definition would be that the variability of the sample statistic is called the standard error when calculated using sample statistics, since it's called the standard deviation if calculated using true population parameters.

Not sure if going into that distinction is worthwhile here, since the latter case rarely happens, but the definition here makes it sound like the variability of a sample statistic is always called standard error, which is not accurate.

Relatedly, the formula below is not correct:

https://github.com/OpenIntroOrg/openintro-statistics/blob/b7e661929a25ca0f7fb530139e1c5aa015f1f459/ch_foundations_for_inf/TeX/ch_foundations_for_inf.tex#L284

If using p, this should be called sigma. If using p-hat, it would be called SE.

I would suggest the following changes to address these two concerns:

In line 214: We typically measure the variability of a sampling distribution, or the variability of a point estimate, with \termsub{standard error}{standard error (SE)}, and the notation $SE_{\hat{p}}$ is used for the standard error associated with the sample proportion.
In line 284: sigma_{\hat{p}} &= \sqrt{\frac{p (1 - p)}{n}}
And then a new line 286 before discussing the S-F condition: Often we don't know p, so we can't directly calculate sigma_{\hat{p}}, and instead estimate it with SE_{\hat{p}} &= \sqrt{\frac{\hat{p} (1 - \hat{p})}{n}}.

os4 branch

Should we delete the os4 branch since master now has the most updated content?

Ch 5: Use of subsample of Pew Survey

I assume this was for simplicity of calculations, which I like, but I feel like the idea of a subsample will be confusing to students who are just starting to learn about drawing inferences from a sample. Also, as a teacher, my first reaction before reading the footnote was "really, exactly 1000 people?". Do you think the benefits of using n = 1000 for simplicity outweigh these potential costs? I'd suggest sticking with the original sample but could be convinced otherwise.

https://github.com/OpenIntroOrg/openintro-statistics/blob/b7e661929a25ca0f7fb530139e1c5aa015f1f459/ch_foundations_for_inf/TeX/ch_foundations_for_inf.tex#L67-L69

Ch 7: Typo in Hypothesis Testing Instructions

The is a typo on page 258, Chapter 7, Section 7.15, towards the bottom of the page.

In the table labeled "Hypothesis Testing for a single mean", the "Check" step read as:
"Verify conditions to ensure x-bar is nearly normal"

The x-bar should be the set of x values ( {x} ), not the mean of x.

Ch 5: "Applying" the normal model

Throughout the chapter there is phrasing like "the normal model may be applied to p-hat". I prefer phrasing this as "p-hat can be assumed to follow a normal distribution". We're not really "applying" something, we're assuming CLT holds and basing calculations off of that. What do you think about replacing this language with my suggested phrasing?

Cherry blossom data

The provenance of some of the random samples of Cherry Blossom data are iffy. In the next round of edits let's use the datasets from 📦 cherryblossom and add any random samples there for future reproducibility. These are datasets called run*.