Giter Club home page Giter Club logo

machine-learning-trees-python's Introduction

Introduction to Tree Models in Python

Decision trees are a family of algorithms that are based around a tree-like structure of decision rules. These algorithms often perform well in tasks such as prediction and classification.

This lesson explores the properties of tree models in the context of mortality prediction. The lesson also covers topics such as overfitting, ensemble models, boosting, and bagging.

It is the second lesson in the machine learning curriculum. In later lessons we explore neural networks for image classification, and responsible machine learning.

  1. Introduction to Machine Learning in Python [Lesson materials; Code repository]
  2. Introduction to Tree Models in Python [Lesson materials; Code repository]
  3. Introduction to artificial neural networks in Python [Lesson materials; Code repository]
  4. Responsible machine learning in Python [Lesson materials; Code repository]

Workshop schedule

These lessons are being run at University of Edinburgh as part of the Ed-DaSH Data Science training programme for Health and Biosciences.

The first lessons were taught in May: https://edcarp.github.io/2022-05-24_ed-dash_machine-learning/. For a list of future lessons, see: https://edcarp.github.io/Ed-DaSH/workshops

Contributing

We welcome all contributions to improve the lesson! Maintainers will do their best to help you if you have any questions, concerns, or experience any difficulties along the way.

We'd like to ask you to familiarize yourself with our Contribution Guide and have a look at the more detailed guidelines on proper formatting, ways to render the lesson locally, and even how to write new episodes.

Please see the current list of issues for ideas for contributing to this repository. For making your contribution, we use the GitHub flow, which is nicely explained in the chapter Contributing to a Project in Pro Git by Scott Chacon. Look for the tag good_first_issue. This indicates that the maintainers will welcome a pull request fixing this issue.

Maintainer(s)

Current maintainers of this lesson are:

Authors

A list of contributors to the lesson can be found in AUTHORS

Citation

To cite this lesson, please consult with CITATION

machine-learning-trees-python's People

Contributors

anenadic avatar lauracmurphy avatar mcmaurer avatar tompollard avatar zkamvar avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

machine-learning-trees-python's Issues

Exercises for section 3 (Variance)

What kind of exercises or tasks could we add to Section 3 to make it more interactive?

Note, the markdown for a standard exercise is:

> ## Exercise
> A) Q1
> B) Q2
> > ## Solution
> > A) Q1
> > B) Q2
> {: .solution}
{: .challenge}

Exercises for section 4 (boosting)

What kind of exercises or tasks could we add to Section 4 to make it more interactive?

Note, the markdown for a standard exercise is:

> ## Exercise
> A) Q1
> B) Q2
> > ## Solution
> > A) Q1
> > B) Q2
> {: .solution}
{: .challenge}

Exercises for section 6 (random forest)

What kind of exercises or tasks could we add to Section 6 to make it more interactive?

Note, the markdown for a standard exercise is:

> ## Exercise
> A) Q1
> B) Q2
> > ## Solution
> > A) Q1
> > B) Q2
> {: .solution}
{: .challenge}

Post DUSC Instructor thoughts

Thoughts after DUSC workshop 15/11/23:

  • Using a different dataset that has a more even spread (50:50) between the two classes would make the bias-variance tradeoff clearer
  • Also using a non-medical dataset would expand the usability
  • Generally needs more programming tasks. If this lesson follows on from the intro to ML course then you can:
  1. Set the data pre-processing as a task
  2. Get learners to determine accuracy across test/training data from the get go
  3. Set a general task at the end to allow user to train on more or on different data types
  • The course really could do with highlighting the benefits of random forests and gradient boosting. This can only be done by adding more features sooner.
  • Reduce the amount of plotting. It effective early on to visualise decision trees but not effective and time consuming when comparing later models.
  • Perhaps ignore gradient boosting entirely. It is skimmed over so fast it doesn't convey any of the benefits or differences over random forests.
  • to show the power of random forests try running the model on highly correlated features
  • Ideally the code should not be continually renaming the mdl variable, but create new variables for each model to help comparison

Visualisation to explain boosting

Visualisations that may help explain boosting

I can think of two visualisations that may help with explaining boosting (though I am not sure how easy they would be to make).

One is to show the same image with the 6 trees next to each other, but colour them by how many points have been correctly classified and how many have been missed (so far) for each of the 6 trees. This would maybe help us better see what goes on in tree 6.

The second one is to create a composite image of all the decision boundaries that would illustrate the idea of the last step, where a weighted average is taken. Basically, there would be areas that are more or less orange and areas that are more or less blue, and it would (hopefully) be easy to see how the final decision surface comes about. (This may be an overly simplistic idea on my part, and it might turn out that this is not actually helpful. But may be worth a try).

Again, I could give these a shot.

Idea for an exercise: Build your own decision tree

Idea for an exercise: Build your own tree

Here is a possible exercise to help people build intuition: Show them just a small section of the data (maybe 10 data points), and have them build their own decision trees of dephts 1 and 2.

Could do this by showing the data on a 2-D graph, coloured by outcome and have users draw a decision surface by hand. Or write a widget that would let people play around with different decision models and see what happens.

This would be done before doing it with code for the entire dataset.

I would be happy to write this out and add it if you all think it's a good idea.

Exercises for section 8 (performance)

What kind of exercises or tasks could we add to Section 8 to make it more interactive?

Note, the markdown for a standard exercise is:

> ## Exercise
> A) Q1
> B) Q2
> > ## Solution
> > A) Q1
> > B) Q2
> {: .solution}
{: .challenge}

Exercises for section 2 (decision tree)

What kind of exercises or tasks could we add to Section 2 to make it more interactive?

Note, the markdown for a standard exercise is:

> ## Exercise
> A) Q1
> B) Q2
> > ## Solution
> > A) Q1
> > B) Q2
> {: .solution}
{: .challenge}

Exercises for section 7 (gradient boosting)

What kind of exercises or tasks could we add to Section 7 to make it more interactive?

Note, the markdown for a standard exercise is:

> ## Exercise
> A) Q1
> B) Q2
> > ## Solution
> > A) Q1
> > B) Q2
> {: .solution}
{: .challenge}

Exercises for section 1 (introduction)

What kind of exercises or tasks could we add to Section 1 to make it more interactive?

Note, the markdown for a standard exercise is:

> ## Exercise
> A) Q1
> B) Q2
> > ## Solution
> > A) Q1
> > B) Q2
> {: .solution}
{: .challenge}

Discussion of regression should be moved to later in the workshop.

The following paragraph (at https://carpentries-incubator.github.io/machine-learning-trees-python/02-decision-tree/index.html) is forward referencing. To understand the paragraph, we need to understand concepts that haven't yet been taught.

While we will only be looking at classification here, regression isn’t too different. After grouping the data (which is essentially what a decision tree does), classification involves assigning all members of the group to the majority class of that group during training. Regression is the same, except you would assign the average value, not the majority.

We should move the discussion of regression to later in the content.

Exercises for section 5 (bagging)

What kind of exercises or tasks could we add to Section 5 to make it more interactive?

Note, the markdown for a standard exercise is:

> ## Exercise
> A) Q1
> B) Q2
> > ## Solution
> > A) Q1
> > B) Q2
> {: .solution}
{: .challenge}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.