gedeck / practical-statistics-for-data-scientists Goto Github PK
View Code? Open in Web Editor NEWCode repository for O'Reilly book
License: GNU General Public License v3.0
Code repository for O'Reilly book
License: GNU General Public License v3.0
I'm trying to do a pull request for some files which I've added to this project. They are the Python files Chapter..N...py broken down into smaller files to make them easier to read. I couldn't see how to do a pull request unless I had write access to this repo, so I cloned, and created my own, at https://github.com/pdxrod/practical-statistics-for-data-scientists. I'll delete this repo if requested to do so by Peter Gedeck.
The main purpose of this branch (small-files) was to make it easier for me to read the book and understand it, being able to see the code in smaller sections, whereas the Chapter..N...py files are 395 lines on average.
Chapter 7 Unsupervised Learning
Cell #18 Dendrogram is giving following error:
Please fix the error and upload corrected code to Github web page.
Thanks
ValueError Traceback (most recent call last)
in
1 fig, ax = plt.subplots(figsize=(5, 5))
2
----> 3 dendrogram(Z, labels=df.index, color_threshold=0)
4 plt.xticks(rotation=90)
5 ax.set_ylabel('distance')
C:\ProgramData\Anaconda3\lib\site-packages\scipy\cluster\hierarchy.py in dendrogram(Z, p, truncate_mode, color_threshold, get_leaves, orientation, labels, count_sort, distance_sort, show_leaf_counts, no_plot, no_labels, leaf_font_size, leaf_rotation, leaf_label_func, show_contracted, link_color_func, ax, above_threshold_color)
3275 "'bottom', or 'right'")
3276
-> 3277 if labels and Z.shape[0] + 1 != len(labels):
3278 raise ValueError("Dimensions of Z and labels must be consistent.")
3279
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in nonzero(self)
2148 def nonzero(self):
2149 raise ValueError(
-> 2150 f"The truth value of a {type(self).name} is ambiguous. "
2151 "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
2152 )
ValueError: The truth value of a Index is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Due to conda being able to handle the R dependencies as well, I'd recommend adding the following to the existing environment file:
r-vioplot
r-corrplot
r-gmodels
r-matrixstats
r-lmperm
r-pwr
r-fnn
r-klar
r-dmwr
r-xgboost
r-ellipse
r-mclust
r-ca
Optional: add rstudio-desktop version that is more supported
conda install -c conda-forge rstudio-desktop compared to the rstudio-desktop version that is part of the default.
Here's the link to the conda-forge version of rstudio-desktop.
This is in reference to Python Jupyter Notebook for Chapter 5: Classification, section: Undersampling.
The codes and outputs are, as mentioned in Notebook, shown below -
However, when I rerun that notebook, the output is as shown below
Needless to say, the output is drastically different from what is in original notebook. I have rerun the same code in different notebook and yet the output is different from the original.
Please look into this.
I am getting an error with the three sampling examples in R version 4.1.0:
"Error in sample.int(length(x), size, replace, prob) :
cannot take a sample larger than the population when 'replace = FALSE'"
This error is also generated for the sample of 5 and sample of 20 starting on line 39 and 45.
I've fixed it by passing into the sample arguments loans_income$x, not just loans_income, based on the suggestion on this post: https://stackoverflow.com/questions/19648238/r-says-cannot-take-a-sample-larger-than-the-population-but-i-am-not-taking/19648272
I'm using R 4.1.0; but the arm64 version.
Chapter 1, the Correlation section, the first 2 cells give the same error,
TypeError Traceback (most recent call last)
in ()
4
5 # Filter data for dates July 2012 through June 2015
----> 6 telecom = sp500_px.loc[sp500_px.index >= '2012-07-01', telecomSymbols]
7 telecom.corr()
8 telecom
/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/base.py in cmp_method(self, other)
120 else:
121 with np.errstate(all="ignore"):
--> 122 result = op(self.values, np.asarray(other))
123
124 if is_bool_dtype(result):
TypeError: '>=' not supported between instances of 'numpy.ndarray' and 'numpy.ndarray'
FYI, I'm running the cells on Google Colab.
The following code makes a variable call to the chi2 value calculated using the permutation test (chi2observed), vice the chi2 value computed using the scipy stats module (chisq).
chisq, pvalue, df, expected = stats.chi2_contingency(clicks)
print(f'Observed chi2: {chi2observed:.4f}')
print(f'p-value: {pvalue:.4f}')
I believe the first print line should be:
print(f'Observed chi2: {chisq:.4f}') since the purpose is to demonstrate using the chi2 module for statistical tests rather than the previous sections permutation test.
Thanks!
Chapter 4 code fails with
ValueError: Pandas data cast to numpy dtype of object. Check input data with np.asarray(data).
Error: Process completed with exit code 1.
Root cause of this failure is a change in the pandas get_dummies
function. It used to create 0/1
and now creates True/False
.
Prince package changed API to create plots to use Vega. replace with custom plot
Feedback on errata page:
The mean of the random values generated using the rexp(n=100, rate=0.2) function in R is ~5, which makes sense given that the mean number of events per time period is 0.2. However, for the Python code given in the book as stats.expon.rvs(0.2, size=100) we have the mean of the random values generated ~1.2, where loc=0.2 is the starting location for the exponential distribution. To get the same range of random values as those obtained with R we need to use stats.expon.rvs(scale=5, size=100) instead.
Make change to notebook.
Using the perm_fun(x, nA, nB)
for the permutation tests on pages 99-101 results in a deprecation warning now.
"FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead."
Hi, I hope it is OK that I am commenting on this here.
In chapter 3 I am stuck at this step:
3. Find the squared differences between the shuffled counts and expected counts then sum them.
Do you mean "calculate chi-square statistics" for each resampled sample set, where you calculate Pearson residuals first, or you just literally sum the squared differences between observed and expected counts? Thank you.
On 29 July 2020, two data were added in house_sales.csv by gedeck. Those data's zip codes are 9800 and 89118. Because of them, many of the execution result of the codes in book, especially in Chapter 4, are not mached with Github code's. 9800 and 89118 are not even the zip codes of King County. They were not in the original data, printed book and in Learning O'Reiily contents. Are they really needed?
At a minimum make sure that the R code executes without a problem.
Example github action to run R
https://blog--simonpcouch.netlify.app/blog/r-github-actions-commit/
Using given code creates an error (1).
Per Prince CA documentation, I was able to get it working (2).
Python version: 3.11.4 (Using Jupyter Notebook)
(1) Orignal Python Code & Error:
`housetasks = pd.read_csv(HOUSE_TASKS_CSV, index_col=0)
ca = prince.CA(n_components=2)
ca = ca.fit(housetasks)
ca.plot_coordinates(housetasks, figsize=(6, 6))
plt.tight_layout()
plt.show()`
ERROR: AttributeError: 'CA' object has no attribute 'plot_coordinates'
(2) Updated Python Code:
`import pandas as pd
import prince
import altair as alt
#Load the data
housetasks = pd.read_csv(HOUSE_TASKS_CSV, index_col=0)
#Create the model
ca = prince.CA(n_components=2)
#Fit the model
ca = ca.fit(housetasks)
#Extract the column coordinate dataframe, and change the column names
cc = ca.column_coordinates(housetasks).reset_index()
cc.columns = ['name', 'x', 'y']
#Extract the row coordinates dataframe, and change the column names
rc = ca.row_coordinates(housetasks).reset_index()
rc.columns = ['name', 'x', 'y']
#Combine the dataframes
crc_df = pd.concat([cc, rc], ignore_index=True)
#Plot and annotate
points = ca.plot(housetasks, x_component=0, y_component=1)
annot = alt.Chart(crc_df).mark_text(
align='left',
baseline='middle',
fontSize = 10,
dx = 7
).encode(
x='x',
y='y',
text='name'
)
points + annot`
[Confusion Matrix]
In [18]:
# Confusion matrix
pred <- predict(logistic_gam, newdata=loan_data)
pred_y <- as.numeric(pred > 0)
true_y <- as.numeric(loan_data$outcome=='default')
true_pos <- (true_y==1) & (pred_y==1)
true_neg <- (true_y==0) & (pred_y==0)
false_pos <- (true_y==0) & (pred_y==1)
false_neg <- (true_y==1) & (pred_y==0)
conf_mat <- matrix(c(sum(true_pos), sum(false_pos),
sum(false_neg), sum(true_neg)), 2, 2)
colnames(conf_mat) <- c('Yhat = 1', 'Yhat = 0')
rownames(conf_mat) <- c('Y = 1', 'Y = 0')
conf_mat
Yhat = 1 | Yhat = 0 | |
---|---|---|
Y | 14293 | 8378 |
Y | 8051 | 14620 |
In the R notebook, the correctly predicted defaults are 14,293 and incorrectly predicted ones are 8,378. But, in the printed book they are 14,295 and 8,376.
And in Python, I got the another diffrent numbers.
Yhat = default Yhat = paid off
Y = default 14336 8335
Y = paid off 8148 14523
Which one is correctly right? If the notebook's results are right, the numbers in the first paragrahp of page 222 should be edited.
[AUC]
In [21]:
sum(roc_df$recall[-1] * diff(1-roc_df$specificity))
head(roc_df)
0.692623197044616
The result in notebook is 0.692623197044616, but it is 0.6926172 in the book book. Please check the Python code and result too.
It's okay to excutue the codes till to page 276. But without explicitly setting eval_metric="error", you will finally get errors in page 280. I think it would be better to edit github's codes.
In [12]:
set.seed(1010103)
df <- sp500_px[row.names(sp500_px)>='2011-01-01', c('XOM', 'CVX')]
km <- kmeans(df, centers=4, nstart=1)
df$cluster <- factor(km$cluster)
head(df)
XOM CVX cluster
2011-01-03 0.73680496 0.2406809 1
2011-01-04 0.16866845 -0.5845157 4
2011-01-05 0.02663055 0.4469854 1
2011-01-06 0.24855834 -0.9197513 4
2011-01-07 0.33732892 0.1805111 1
2011-01-10 0.00000000 -0.4641675 4
In the nodebook the first six records are assigned to either cluster 1 or clust 4. The meas of the clusters are the below.
In [13]:
centers <- data.frame(cluster=factor(1:4), km$centers)
centers
cluster XOM CVX
1 0.2315403 0.3169645
2 0.9270317 1.3464117
3 -1.1439800 -1.7502975
4 -0.3287416 -0.5734695
But the excution results in the book are little bit different. They are assigned to cluster 1 or 2. However, as you see the [Figur 7-5], the cluster 3 and 4 are in the minus area(left below of the graph). and it looks like they represent "down" market. So, I think the code results and some sentences in page 296~297 should be changed.
> x <- loan_data[1:5, c('dti', 'payment_inc_ratio', 'home_', 'purpose_')]
> x
dti payment_inc_ratio home purpose
<dbl> <dbl> <fctr> <fctr>
1 1.00 2.39320 RENT car
...
It should be changed like this.
> x <- loan_data[1:5, c('dti', 'payment_inc_ratio', 'home_', 'purpose_')]
> x
dti payment_inc_ratio home_ purpose_
<dbl> <dbl> <fctr> <fctr>
1 1.00 2.39320 RENT major_purchase
...
Please check them all and let me know if I think(or did) something wrong. :) Thanks in advance.
In this function:
def perm_fun(box):
sample_clicks = [sum(random.sample(box, 1000)),
sum(random.sample(box, 1000)),
sum(random.sample(box, 1000))]
Shouldn't it be
def perm_fun(box):
random.shuffle(box)
sample_clicks = [sum(box[0:1000]),
sum(box[1000:2000]),
sum(box[2000:3000])]
to ensure the total count of clicks is always 34?
The predicted probabilities results are different. They should be 0.4798964(paid off) 0.5201036(default).
I ran the code in colab. Would check this notebook?
https://colab.research.google.com/drive/1ChitMlzaMHYDru6ngI1qBHhJGcIP-RhI#scrollTo=1EnynWD14l2R&line=7&uniqifier=1
Need line-break in line 318.
Need line-break in 453.
And line 452 has type error. Would check this line?
"TypeError: Object with dtype category cannot perform the numpy op subtract"
It would be better to set eval_metric='error' in Python codes too.
Jupyter Notebook program of Chapter 5 Classification is giving following errors:
Matplotlib is currently using agg, which is a non-GUI backend, so can't show the figure.
Please fix these errors and update notebook's code files on this book's Github webpage.
Thanks and best regards,
SSJ
There is a TypeError running the Chapter 3 Web Stickness notebook:
The line:
print(np.mean(perm_diffs > mean_b - mean_a))
results in the following TypeError: '>' not supported between instances of 'list' and 'float'
which can be fixed using a mapObj such as:
mapObj = map(lambda _: _>(mean_b-mean_a), perm_diffs)
print (f'{sum(mapObj)*100/len(perm_diffs):4.2f}%')
This line brings typeerror: TypeError: '>' not supported between instances of 'list' and 'float'
It would be better to correct this line to
print(np.mean(np.array(perm_diffs) > mean_b - mean_a))
やほーー
In chapter 1, the section where we talk about "Frequency Tables and Histograms", I tried to replicate the code of the histogram with a different Python package lets-plot
, which should be similar hist() plot in r. However, the y-axis (the frequency) is different than what the R and Python generated under the same number of bins.
The histogram generated from the textbook code:
Code:
ax = (state['Population'] / 1_000_000).plot.hist(bins=10)
ax.set_xlabel('Population (millions)')
The histogram generated by lets-plot
(aka ggplot in Python):
Code:
temp_df = pd.DataFrame(state['Population'] / 1_000_000)
ggplot(temp_df, aes(x="Population")) + geom_histogram(bins=10)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.