Comments (3)
Naive Bayes:
I checked this and I also get these changed values with a fresh Installation of R. As the results were practically identical between the Python and R implementations, I reached out the author of the klaR package to find out if something has changed that would explain this large change. The changelog of the klaR package mentions changes to the Laplace smoothing that may be responsible to this. When I set alpha in Python to about 2000, I get similar results to R now. I wait what the author of the R package responds.
Formatting changes
I'm not going to change the formatting. The main reason is that the same code is also used in the book and I try to keep it as concise/short as possible.
XGBoost / type error line 452
I thought I already added eval_metric='error' to the notebook. Have done it in another notebook but not here. I've done it now. At the same time, I addressed the warning message that the label encoder will be deprecated in xgboost and therefore changed the definition of y
to:
X = pd.get_dummies(loan_data[predictors], drop_first=True)
y = pd.Series([1 if o == 'default' else 0 for o in loan_data[outcome]])
That seemed to have also resolved the type error, as I don't see it.
from practical-statistics-for-data-scientists.
Naive Bayes (update):
I fortunately had an older installation of R around, so I could confirm that with an older version, the results are as printed in the book. The old version was R 3.6.1 and new version R 4.0.3. One thing that changed between these two versions is the handling of string columns when reading data files. The difference is in default.stringsAsFactors()
which is TRUE in 3.6.1 and FALSE in 4.0.3. When I explicitly set
loan_data <- read.csv(file.path(PSDS_PATH, 'data', 'loan_data.csv.gz'), stringsAsFactors=TRUE)
I get the original results printed in the book.
$posterior
paid off default
[1,] 0.3463013 0.6536987
I'm going to change the code to use stringsAsFactors=TRUE
from practical-statistics-for-data-scientists.
Closed with 7b89f15
from practical-statistics-for-data-scientists.
Related Issues (20)
- Errors and Questions in Ch5, 6, 7 HOT 3
- Incorrect variable reference Chi2 (Chapter 3 page 127) HOT 1
- Ch 3. Line 77 in Python Code HOT 2
- Ch. 2 - R Code Data and Sampling Distributions Lines 35, 36 HOT 1
- Pull request HOT 16
- Python Jupyter Notebook program output is different from what is shown there HOT 2
- Python code for Chapter 3 - Web Stickness - TypeError in the original code HOT 5
- Different histogram under the same number of bins HOT 2
- perm_fun use of set() HOT 2
- Possible Considerations on moving R into conda environment for consistency HOT 3
- Enable github CI for pull requests
- Add R build to CI HOT 3
- ζ°΄ζΈγγ
- Resampling in chi square test HOT 1
- Adjust code to changes in Python packages
- Figure 7.1 (Python) - Broken HOT 3
- Anaconda - ResolvePackageNotFound HOT 2
- Statistics
- chi-square, resampling approach HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from practical-statistics-for-data-scientists.