telmo-correa / all-of-statistics Goto Github PK
View Code? Open in Web Editor NEWSelf-study on Larry Wasserman's "All of Statistics"
Self-study on Larry Wasserman's "All of Statistics"
There is an error with the presented variance of the MLE
This implies changes over the Wald statistic and the simulation, of course.
The solution presented from part a) to part c) is not general, it already assumes
Do not confuse the number of datapoints of
The MLE of
Taking the derivative of
The MLE estimator of
The Fisher's Matrix would be
Hi!
First of all thanks for this repo! It's been a big help for self-studying the book.
In question 8.5.8 I think se = r.std()
should be replaced by se = r.sem()
or math.sqrt(r.var() / len(r))
. The std calculates the standard deviation of the values themselves where as we want the standard error of the mean estimate which is calculated using:
There is an issue with the variance of true sampling distribution, which should be
This change will induce a less concentrated distribution in comparison with the solution presented at this moment (as the scale should be
the answer consists both big phi and small phi instead of only big phi
In the current status of the solution it is used that
The correct statements are:
All these changes lead us to the same expected result.
As pointed out by the statement of the question
Sincerely, I do not understand the simulation provided since I am not familiar with "tqdm" and maybe other things. If it helps, what I did was to generate 100 Bernoulli(p) RV of length N=10.000, calculate the 100 cumulative RVs (of length N=10000) and average the coverage over the 100 repetitions. My caveats: I am not sure I really need to average over a given number of random vectors (which I arbitrarily decided to be 100)
Here mycode:
```
import math
import random as rd
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import bernoulli
np.random.seed(1)
N = 10000
p = 0.4
alpha=0.05
nn=np.arange(1,N+1)
XX_Cum=[np.cumsum(bernoulli.rvs(p,size=N))/nn for i in range(100)]
def epsilon(N,alpha):
return np.sqrt(1/(2*N)*np.log(2/alpha))
Counter=np.array([[np.abs(XX_Cum[j][i-1]-p)<=epsilon(i,alpha) for i in range(1,N+1)] for j in range(100)])
Coverage=[np.count_nonzero(Counter.transpose()[i])/100 for i in range(len(Counter.transpose()))]
plt.plot(nn, Coverage,label='Coverage')
plt.plot(nn, [1-alpha for i in range(len(Counter.transpose()))],label='Hoeffding s lower bound')
# naming the x axis
plt.xlabel('N')
# naming the y axis
plt.ylabel('$P(p \in C_n)$')
# giving a title to my graph
plt.legend()
# function to show the plot
plt.show()
```
Hello! Thank you for a great work! It seems like a mistake in Chapter 21 - Nonparametric Curve Estimation in function j_hat_kde, since as mentioned in the book and in your text K^(2) is N(0,2) the function should look like (also the comment contains mistake too, h is a bandwidth), however the comment contains the information that dataset is rescaled to [0, 1], but I did not find any evidence for such transformation in the book as well as code of this function does not contain it, please could you double check this:
def j_hat_kde(X, h):
"""
Calculate the approximated estimated KDE risk J_hat for a N(0, 1) Gaussian kernel
\hat{J}(h) = \frac{1}{hn^2}\sum_{i, j} K^* \left( \frac{X_i - X_j}{h} \right) + \frac{2}{nh} K(0)
where:
n is the dataset size
h is the bandwidth for the rescaled [0, 1] dataset
K^* is K^{(2)}(x) - 2 K(x), and K^{(2)} is the convolved kernel, K^{(2)}(z) = \int K(z - y) K(y) dy
K is the original kernel
"""
n = len(X)
Kstar_args = np.array([X.iloc[i] - X.iloc[j] for i, j in product(range(n), range(n))]) / h
sum_value = np.sum(norm.pdf(Kstar_args, loc=0, scale = 2) - 2 * norm.pdf(Kstar_args, loc=0, scale = 1))
return sum_value / (h * n * n) + 2 * norm.pdf(0) / (n * h)
There is a typo:
It looks to me like you are picking a random lsat and a random gpa and making them a pair for the bootstrap.
I think we want to pick a random pair directly instead.
e.g.
lsat = [1,2]
gpa = [3,4]
We want to pick randomly from (1,3) and (2,4).
So (1,4) would be impossible.
all-of-statistics/Chapter 04 - Expectation.ipynb
Line 1292 in e8aa207
is $U \\sim \\text{Bernoulli}(1/2)$
meant to be $C \\sim \\text{Bernoulli}(1/2)$
?
also is U2 $U_2 \\sim \\text{Uniform}(3, 4)$
Hello,
The right answer for f(r) should be 2r instead of 2pi*r. Otherwise, you can't normalize f(r) to 1.
Best wishes,
Kirill
The proposed solution is algebraically correct on the left hand side of the inequality of the first probability, but
In the Cauchy runs to test nonparametric confidence band, the cdf the code uses to compare the cdf is from the normal distribution, not the cauchy distribution as it should be. This leads to many more instances of the actual cdf being outside the confidence band.
In the sections a) and b) (and by extension in the rest of the exercise) there are some issues with the meaning of
As I previously said, these changes produce some effects over the rest of the solution. However, the modifications I propose lead to the same final results of c) and d).
1- I don't think that the "fourth method" that the exercise ask for is the Parametric bootstrap. but the "The Jacknife" Confidence Interval that is explained in the Appendix of the chapter.
2- The second part of the exercise ask for "the true coverage", which means the "the percentage in which the true value of the statistic" (in this case the Skewness parameter). My way to solve this is the following (excuse me if my code is not very fluent)
import random as rd
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import norm
from tqdm.notebook import trange, tqdm
np.random.seed(1)
def skew(x):
mean=np.mean(x)
se=np.std(x)
return np.mean((x-mean)**3)/se**3
B=10000
def T_boot(x):
t_boot=np.empty(B)
n_x=len(x)
for i in tqdm(range(B)):
x_star=np.random.choice(x,n_x,replace=True)
t_boot[i]=skew(x_star)
return t_boot,np.std(t_boot)
#Creation of 1000 random vectors of n=50 datapoints
Y2=norm.rvs(loc=0, scale=1, size=(1000,50))
X2=np.exp(Y2)
#Mapping each random vector with the bootstrap to produce the confidence intervals
t_bootV=list(map(T_boot,X2))
#Generating a list with the confidence intervals from every random vector
#Normal 95% confidence intervals
z_95=norm.ppf(0.975)
NCI=np.array(list(map(lambda x,y: (skew(x)-z_95*y,skew(x)+z_95*y),X2,list(zip(*t_bootV))[1])))
#Percentile 95% confidence intervals
PerCI=np.array(list(map(lambda x: (np.quantile(x,0.025),np.quantile(x,0.975)),list(zip(*t_bootV))[0])))
#Pivotal 95% confidence intervals
PivCI=np.array(list(map(lambda x,y: (2*skew(x)-np.quantile(y,0.975),2*skew(x)-np.quantile(y,0.025)),X2,list(zip(*t_bootV))[0])))
According to theory:
So that the skewness parameter of
This will be the value we expect to be within our confidences intervals (not only for each method, but also for each simulated random vector). The "true coverage" estimations continues as follow:
#The true parameter:
skewness_f=(2-3*np.exp(1)+np.exp(1)**3)/(np.exp(1)-1)**(3/2)
#Comparison with the Normal 95% Confidence Intervals
cov_NCI=np.count_nonzero((NCI.T[1]>=skewness_f) == (skewness_f>=NCI.T[0]))/len(NCI.T[0])
cov_NCI
#Comparison with the Percentile 95% Confidence Intervals
cov_PerCI=np.count_nonzero((PerCI.T[1]>=skewness_f) == (skewness_f>=PerCI.T[0]))/len(PerCI.T[0])
cov_PerCI
#Comparison with the Pivotal 95% Confidence Intervals
cov_PivCI=np.count_nonzero((PivCI.T[1]>=skewness_f) == (skewness_f>=PivCI.T[0]))/len(PivCI.T[0])
cov_PivCI
print('The true coverage for the simulated Normal confidence intervals is: %.3f' % cov_NCI)
print('The true coverage for the simulated Percentile confidence intervals is: %.3f' % cov_PerCI)
print('The true coverage for the simulated Pivotal confidence intervals is: %.3f' % cov_PivCI)
The output of the simulation:
This means that our 50 datapoints are not enough to make our bootstrap 95% confidence intervals to reach the expected coverage...
With
With
Assuming that my procedure is correct, I only need more data points (which is more computationally expensive, reason of which I do not continue reporting results)...
I would appreciate some feedback to compare for.
As correctly showed in the beginning of part b),
Notice that this solutions depends on the sample size as it is expected for a posteriori, which applies both to
I also suggest to respect the symbol definitions in the preface in order to avoid confusions
Exercise 2.10.1.
limπβββ(π΄π)=1ββ(π΄) (3)
RHS should be β(π΄)
beta1 and beta2 are not independent. Need to include the covariance
On the very last code cell, the MSE is the variance not the standard error.
The question asked for
https://github.com/telmo-correa/all-of-statistics/blob/master/Chapter%2003%20-%20Random%20Variables.ipynb
Hi, there might be an issue with Chapter03, on the solution to
In equation (13), it shows : 1-P(|X|<c)=1-P(-c<X<-c), it should be 1-P(-c<X<c).
Then in equation (14), it should be 1-P((-c-3)/4<Z<(c-3)/4), because it may not be symmetric in the normal distribution.
So finally I got P(|X|>|x|)=Phi((x+3)/4)-Phi((x-3)/4)=0.005, and using brute-force finding that the solution of |x| is 9.5464.
But surely I don't know if I was right.
A line break is missing between the equations of the two moments
A declarative, efficient, and flexible JavaScript library for building user interfaces.
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. πππ
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google β€οΈ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.