Giter Club home page Giter Club logo

website's Introduction

Website sources for Applied Machine Learning for Tabular Data

Welcome! This is a work in progress. We want to create a practical guide to developing quality predictive models from tabular data. We'll publish materials here as we create them and welcome community contributions in the form of discussions, suggestions, and edits.

We also want these materials to be reusable and open. The sources are in the source GitHub repository with a Creative Commons license attached (see below).

Our intention is to write these materials and, when we feel we're done, pick a publishing partner to produce a print version.

The book takes a holistic view of the predictive modeling process and focuses on a few areas that are usually left out of similar works. For example, the effectiveness of the model can be driven by how the predictors are represented. Because of this, we tightly couple feature engineering methods with machine learning models. Also, quite a lot of work happens after we have determined our best model and created the final fit. These post-modeling activities are an important part of the model development process and will be described in detail.

To cite this work, we suggest:

@online{aml4td,
  Author = {Kuhn, M and Johnson, K},
  title = {{Applied Machine Learning for Tabular Data}},
  year = {2023},
  url = { https://aml4td.org},
  urldate = {2023-11-20}
}

License

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. Our goal is to have an open book where people can reuse and reference the materials but can't just put their names on them and resell them (without our permission).

Intended Audience

Our intended audience includes data analysts of many types: statisticians, data scientists, professors and instructors of machine learning courses, laboratory scientists, and anyone else who desires to understand how to create a model for prediction. We don't expect readers to be experts in these methods or the math behind them. Instead, our approach throughout this work is applied. That is, we want readers to use this material to build intuition about the predictive modeling process. What are good and bad ideas for the modeling process? What pitfalls should we look out for? How can we be confident that the model will be predictive for new samples? What are advantages and disadvantages of different types of models? These are just some of the questions that this work will address.

Some background in modeling and statistics will be extremely useful. Having seen or used basic regression models is good, and an understanding of basic statistical concepts such as variance, correlation, populations, samples, etc., is needed. There will also be some mathematical notation, so you'll need to be able to grasp these abstractions. But we will keep this to those parts where it is absolutely necessary. There are a few more statistically sophisticated sections for some of the more advanced topics.

If you would like a more theoretical treatment of machine learning models, then we recommend Hastie et al. (2017). Other books for gaining a more in-depth understanding of machine learning are Bishop and Nasrabadi (2006), Arnold et al. (2019) and, for more of a deep learning focus, Goodfellow et al. (2016).

Is there code?

We definitely want to decouple the content of this work from specific software. One of our other books on modeling had computing sections. Many people found these sections to be a useful resource at the time of the book's publication. However, code can quickly become outdated in today's computational environment. In addition, this information takes up a lot of page space that would be better used for other topics.

We will create computing supplements to go along with the materials. Since we use R's tidymodels framework for calculations, the supplement currently in-progress is:

If you are interested in working on a python/scikit-learn supplement, please file an issue

Are there exercises?

Many readers found the Exercise sections of Applied Predictive Modeling to be helpful for solidifying the concepts presented in each chapter. The current set can be found at exercises.aml4td.org

How can I ask questions?

If you have questions about the content, it is probably best to ask on a public forum, like cross-validated. You'll most likely get a faster answer there if you take the time to ask the questions in the best way possible.

If you want a direct answer from us, you should follow what I call Yihui's Rule: add an issue to GitHub (labeled as "Discussion") first. It may take some time for us to get back to you.

If you think there is a bug, please file an issue.

Can I contribute?

There is a contributing page with details on how to get up and running to compile the materials (there are a lot of software dependencies) and suggestions on how to help.

If you just want to fix a typo, you can make a pull request to alter the appropriate .qmd file.

Please feel free to improve the quality of this content by submitting pull requests. A merged PR will make you appear in the contributor list. It will, however, be considered a donation of your work to this project. You are still bound by the conditions of the license, meaning that you are not considered an author, copyright holder, or owner of the content once it has been merged in.

website's People

Contributors

amy-palmer avatar coatless avatar kjell-stattenacity avatar krz avatar syclik avatar tomsing1 avatar topepo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

website's Issues

2023-12-19 release

  • review potential changes to the DESCRIPTION file
  • (optional) update R packages with DESCRIPTION file
  • (optional) renv::snapshot()
  • search for pre-processing and other formats
  • remake bibtex files (R/post-process-bib-file.R)
  • review potential changes to contributing/preface pages.
  • delete items in _cache and _freeze
  • quarto render
  • quarto preview
  • review results
  • bump the version number in DESCRIPTION
  • usethis::use_github_release()
  • quarto publish gh-pages --no-render

additional individual transformations

For the numeric predictors chapter, we had previously talked about transformations for percentages and proportions (like the arc-sin; see this reference)

Also, describe some transformations based on conventions or scientific knowledge.

where to initially discuss the variance/bais tradeoff?

We'll talk a lot about model complexity in regards to

  • the overfitting chapter (e.g. as complexity โ˜๏ธ, bias ๐Ÿ‘‡)
  • resampling methods
  • ensembles
  • regression performance (MSE decomposition)

Should we have an initial section on it though?

Add basic regression tests

A few implementations, specifically torch and libsvm/kernlab, don't have good control over random number usage. Also, we have seen differences in results across intel and apple silicon chips (but that seems to be getting better).

We have some places where we programmatically write out results in-line. If the results change, our encoded conclusions might no longer be valid.

We can take a few key objects and save their results once their usage is finalized. Then we can use testthat to verify that those results are the same (or within some tolerance). Since the project is almost structured like an R package, this means that we can use devtools::test() to check for consistency of results.

Dark mode theme

Although it won't affect images, we should have some css for this

python computing supplement

I'd be interested in helping with a python computing supplement.

Did you have a format in mind? It seems likely that after the setup section, most sections could be tightly coupled between the R and python versions, which suggests maybe having two independent repositories isn't ideal? I think Quarto supports panelsets (as "tabsets"); that strikes me as a nice way to display the two, but also would mean both codes should be updated when a change is made.

One other thing that would be nice to decide on early: which python plotting library to use? plotnine mimics ggplot, matplotlib is already used by sklearn+pandas, others are slicker...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.