carpentries-incubator / machine-learning-trees-python Goto Github PK
View Code? Open in Web Editor NEWIntroduction to tree models with Python
Home Page: https://carpentries-incubator.github.io/machine-learning-trees-python
License: Other
Introduction to tree models with Python
Home Page: https://carpentries-incubator.github.io/machine-learning-trees-python
License: Other
What kind of exercises or tasks could we add to Section 7 to make it more interactive?
Note, the markdown for a standard exercise is:
> ## Exercise
> A) Q1
> B) Q2
> > ## Solution
> > A) Q1
> > B) Q2
> {: .solution}
{: .challenge}
It would be helpful to provide a bit more detail about Gini impurity at:
https://carpentries-incubator.github.io/machine-learning-trees-python/02-decision-tree/index.html
What kind of exercises or tasks could we add to Section 2 to make it more interactive?
Note, the markdown for a standard exercise is:
> ## Exercise
> A) Q1
> B) Q2
> > ## Solution
> > A) Q1
> > B) Q2
> {: .solution}
{: .challenge}
The following paragraph (at https://carpentries-incubator.github.io/machine-learning-trees-python/02-decision-tree/index.html) is forward referencing. To understand the paragraph, we need to understand concepts that haven't yet been taught.
While we will only be looking at classification here, regression isn’t too different. After grouping the data (which is essentially what a decision tree does), classification involves assigning all members of the group to the majority class of that group during training. Regression is the same, except you would assign the average value, not the majority.
We should move the discussion of regression to later in the content.
What kind of exercises or tasks could we add to Section 6 to make it more interactive?
Note, the markdown for a standard exercise is:
> ## Exercise
> A) Q1
> B) Q2
> > ## Solution
> > A) Q1
> > B) Q2
> {: .solution}
{: .challenge}
What kind of exercises or tasks could we add to Section 4 to make it more interactive?
Note, the markdown for a standard exercise is:
> ## Exercise
> A) Q1
> B) Q2
> > ## Solution
> > A) Q1
> > B) Q2
> {: .solution}
{: .challenge}
Thoughts after DUSC workshop 15/11/23:
I think it's interesting to see how common XGBoost (etc) papers are published using the approach that we've worked through in the lesson. Perhaps add a link to https://pubmed.ncbi.nlm.nih.gov/?term=xgboost+mortality+prediction at https://carpentries-incubator.github.io/machine-learning-trees-python/07-gradient-boosting/index.html
base_estimator -> estimator in recent sklearn
In the random forest page, we specify max_features=1
but the decision boundaries are all bivariate. This makes for a very confusing introduction to random forests
https://carpentries-incubator.github.io/machine-learning-trees-python/06-random-forest/index.html
As highlighted by @alanocallaghan in #26, some of the software used in this lesson has become outdated (in particular sklearn
). Bump the sklearn version and update the syntax.
I can think of two visualisations that may help with explaining boosting (though I am not sure how easy they would be to make).
One is to show the same image with the 6 trees next to each other, but colour them by how many points have been correctly classified and how many have been missed (so far) for each of the 6 trees. This would maybe help us better see what goes on in tree 6.
The second one is to create a composite image of all the decision boundaries that would illustrate the idea of the last step, where a weighted average is taken. Basically, there would be areas that are more or less orange and areas that are more or less blue, and it would (hopefully) be easy to see how the final decision surface comes about. (This may be an overly simplistic idea on my part, and it might turn out that this is not actually helpful. But may be worth a try).
Again, I could give these a shot.
The evaluation/performance section should include discussion of model calibration:
https://carpentries-incubator.github.io/machine-learning-trees-python/08-performance/
Builds are failing with Node.js 16 actions are deprecated
.
See: https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/
Fix appears to be at: https://docs.github.com/en/actions/creating-actions/metadata-syntax-for-github-actions#runs-for-javascript-actions
What kind of exercises or tasks could we add to Section 8 to make it more interactive?
Note, the markdown for a standard exercise is:
> ## Exercise
> A) Q1
> B) Q2
> > ## Solution
> > A) Q1
> > B) Q2
> {: .solution}
{: .challenge}
Make it clear that the random forest is restricting the variables available at each decision in the node (rather than restricting which variables are available for training of the entire tree): https://carpentries-incubator.github.io/machine-learning-trees-python/06-random-forest/
What kind of exercises or tasks could we add to Section 3 to make it more interactive?
Note, the markdown for a standard exercise is:
> ## Exercise
> A) Q1
> B) Q2
> > ## Solution
> > A) Q1
> > B) Q2
> {: .solution}
{: .challenge}
What kind of exercises or tasks could we add to Section 1 to make it more interactive?
Note, the markdown for a standard exercise is:
> ## Exercise
> A) Q1
> B) Q2
> > ## Solution
> > A) Q1
> > B) Q2
> {: .solution}
{: .challenge}
Consider adding a section on how to explore variable importance for the models (e.g. shapley).
Here is a possible exercise to help people build intuition: Show them just a small section of the data (maybe 10 data points), and have them build their own decision trees of dephts 1 and 2.
Could do this by showing the data on a 2-D graph, coloured by outcome and have users draw a decision surface by hand. Or write a widget that would let people play around with different decision models and see what happens.
This would be done before doing it with code for the entire dataset.
I would be happy to write this out and add it if you all think it's a good idea.
On the performance page, it would be good to add logistic regression as a baseline:
https://carpentries-incubator.github.io/machine-learning-trees-python/08-performance/
What kind of exercises or tasks could we add to Section 5 to make it more interactive?
Note, the markdown for a standard exercise is:
> ## Exercise
> A) Q1
> B) Q2
> > ## Solution
> > A) Q1
> > B) Q2
> {: .solution}
{: .challenge}
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.