Giter Club home page Giter Club logo

practical-ml-with-pytorch's Introduction






ICCS Practical Machine Learning with PyTorch

GitHub CC BY-NC-SA 4.0 Binder DOI

This repository contains documentation, resources, and code for the Introduction to Machine Learning with PyTorch session designed and delivered by Jack Atkinson (@jatkinson1000) and Jim Denholm (@jdenholm) of ICCS.
The material has been delivered at both the ICCS and NCAS summer schools.
All materials, including slides and videos, are available such that individuals can cover the course in their own time.

A website for this workshop can be found at https://cambridge-iccs.github.io/practical-ml-with-pytorch/.

Contents

Learning Objectives

The key learning objective from this workshop could be simply summarised as:
Provide the ability to develop ML models in PyTorch.

However, more specifically we aim to:

  • provide an understanding of the structure of a PyTorch model and ML pipeline,
  • introduce the different functionalities PyTorch might provide,
  • encourage good research software engineering (RSE) practice, and
  • exercise careful consideration and understanding of data used for training ML models.

With regards to specific ML content we cover:

  • using ML for both classification and regression,
  • artificial neural networks (ANNs) and convolutional neural networks (CNNs)
  • treatment of both tabular and image data

Teaching Material

Slides

The slides for this workshop can be viewed on the ICCS Summer School Website:

The slides are generated from markdown using quarto. The raw markdown and html files can be found in the slides directory.

Exercises

The exercises for the course can be found in the exercises directory.
These take the form of partially complete jupyter notebooks.

Videos

Videos from past workshops may be useful if you are following along independently.
These can be found on the ICCS youtube channel under the 2023 Summer School materials.

Worked Solutions

Worked solutions for all of the exercises can be found in the worked solutions directory.
These are for recapping after the course in case you missed anything, and contain ideal solutions complete with docstrings, outfitted with type hints, linted, and conforming to the black code style.

Preparation and prerequisites

To get the most out of the session we assume a basic understanding in a few areas and for you to do some preparation in advance. Expected knowledge is outlined below, along with resources for reading if you are unfamiliar.

Mathematics and Machine Learning

Basic mathematics knowledge:

  • calculus - differentiating a function
  • matrix algebra - matrix multiplication and representing data as a matrix
  • regression - fitting a function to data

Neural Networks:

Python

The course will be taught in python using pyTorch. Whilst no prior knowledge of pyTorch is expected we assume users are familiar with the basics of Python3. This includes:

  • Basic mathematical operations
  • Writing and running scripts/programs
  • Writing and using functions
  • The concept of object orientation
    i.e. that an object, e.g. a dataset, can have associated functions/methods associated with it.
  • Basic use of the following libraries:
    • numpy for mathematical and array operations
    • matplotlib for ploting and visualisation
    • pandas for storing and accessing tabular data
  • Familiarity with the concept of a jupyter notebook

git and GitHub

You will be expected to know how to

  • clone and/or fork a repository,
  • commit, and
  • push.

The workshop from the 2022 ICCS Summer School should provide the necessary knowledge.

Preparation

In preparation for the course please ensure that your computer contains the following:

Note for Windows users: We have linked suitable applications for windows in the above lists. However, you may wish to refer to Windows' getting-started with python information for a complete guide to getting set up on a Windows system.

If you require assistance or further information with any of these please reach out to us before a training session.

Installation and setup

There are three options for participating in this workshop for which instructions are provided below:

We recommend the local install approach, especially if you forked the repository, as it is the easiest way to keep a copy of your work and push back to GitHub.

However, if you experience issues with the installation process or are unfamiliar with the terminal/installation process there is the option to run the notebooks in Google Colab or on binder.

Local Install

1. Clone or fork the repository

Navigate to the location you want to install this repository on your system and clone via https by running:

git clone https://github.com/Cambridge-ICCS/practical-ml-with-pytorch.git

This will create a directory practical-ml-with-pytorch/ with the contents of this repository.

Please note that if you have a GitHub account and want to preserve any work you do we suggest you first fork the repository and then clone your fork. This will allow you to push your changes and progress from the workshop back up to your fork for future reference.

2. Create a virtual environment

Before installing any Python packages it is important to first create a Python virtual environment. This provides an insulated environment inside which we can install Python packages without polluting the operating systems' Python environment.

If you have never done this before don't worry: it is very good practise, especially when you are working on multiple projects, and easy to do.

python3 -m venv MLvenv

This will create a directory called MLvenv containing software for the virtual environment. To activate the environment run:

source MLvenv/bin/activate

You can now work on python from within this isolated environment, installing packages as you wish without disturbing your base system environment.

When you have finished working on this project run:

deactivate

to deactivate the venv and return to the system python environment.

You can always boot back into the venv as you left it by running the activate command again.

3. Install dependencies

It is now time to install the dependencies for our code, for example PyTorch. The project has been packaged with a pyproject.toml so can be installed in one go. From within the root directory in a active virtual environment run:

pip install .

This will download the relevant dependencies into the venv as well as setting up the datasets that we will be using in the course.
Whilst the workshop should install and run with the latest versions of python libraries, it has been tested with following versions for major dependencies: torch 2.0.1, pandas 2.1.0, palmerpenguins 0.1.4, ipykernel 6.25.2, matplotlib 3.8.0, notebook 7.0.3.

4. Run the notebook

From the current directory, launch the jupyter notebook server:

jupyter notebook

This command should then point you to the right location within your browser to use the notebook, typically http://localhost:8888/.

(Optional) Keep virtual environment persistent in jupyter Notebooks

The following step is sometimes useful if you're having trouble with your jupyter notebook finding the virtual environment. You will want to do this before launching the jupyter notebook.

python -m ipykernel install --user --name=MLvenv

Google Colab

Running on Colab is useful as it allows you to access GPU resources.
To launch the notebooks in Google Colab click the following links for each of the exercises:

Notes:

  • Running in Google Colab requires you to have a Google account.
  • If you leave a Colab session your work will be lost, so be careful to save any work you want to keep.

binder

If you cannot operate using a local install, and do not wish to sign up for a Google account, the repository can be launched on binder.

Notes:

  • If you leave a binder session your work will be lost, so be careful to save any work you want to keep
  • Due to the limited resources provided by binder you will struggle to run training in exercises 3 and 4.

JOSE Publication

This workshop has been published in JOSE, the Journal of Open Source Education with DOI: 10.21105/jose.00239. The paper materials can be found in JOSE_paper/ directory.

If you re-use or build on this material please cite this publication using the information in the CITATION.cff file.

@article{Atkinson2024, doi = {10.21105/jose.00239}, url = {https://doi.org/10.21105/jose.00239}, year = {2024}, publisher = {The Open Journal}, volume = {7}, number = {76}, pages = {239}, author = {Jack Atkinson and Jim Denholm}, title = {Practical machine learning with PyTorch}, journal = {Journal of Open Source Education} }

License

The code materials in this project are licensed under the MIT License.

The teaching materials are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

CC BY-NC-SA 4.0

Contribution Guidelines and Support

If you spot an issue with the materials please let us know by opening an issue here on GitHub clearly describing the problem.

If you are able to fix an issue that you spot, or an existing open issue please get in touch by commenting on the issue thread.

Contributions from the community are welcome. To contribute back to the repository please first fork it, make the neccessary changes to fix the problem, and then open a pull request back to this repository clerly describing the changes you have made. We will then preform a review and merge once ready.

If you would like support using these materials, adapting them to your needs, or delivering them please get in touch either via GitHub or via ICCS.

practical-ml-with-pytorch's People

Contributors

jdenholm avatar jatkinson1000 avatar ma595 avatar dorchard avatar pdpino avatar bewithankit avatar

Stargazers

Calvin Nesbitt avatar Grzegorz Muszynski avatar Francesco Immorlano avatar Laura Mansfield avatar Lei Zhu 朱磊 地学系 avatar Sidafa Conde avatar Maria Luisa avatar Yuan Sun avatar Tianzhang Cai avatar Adam Taranto avatar  avatar ayu avatar Fernando Mayer avatar  avatar Hannah Woodward avatar Ioana Ciucă avatar Tobias Augspurger avatar Sadie L. Bartholomew avatar

Watchers

 avatar Christopher Edsall avatar

practical-ml-with-pytorch's Issues

Rename repository

Issue #46 made me wonder if we should perhaps re-name this repo as practical-ml-with-pytorch or similar as it better reflects the paper name, workshop name, and removes any confusion that this is intended for 'general ML' or similar.

Thoughts from @ma595 @surbhigoel77 @tztsai @dorchard @jdenholm appreciated.

It would be good to do this before the JOSE publication is accepted, which I hope is imminent.
Let me know and I will implement if we are happy.

Prerequisites

Jack and Jim have quite a lot on this previously:

  • we probably could condense this
  • add a pytorch primer.

Mathematics and Machine Learning

Basic mathematics knowledge:

  • calculus - differentiating a function
  • matrix algebra - matrix multiplication and representing data as a matrix
  • regression - fitting a function to data

Neural Networks:

Python

The course will be taught in python using pyTorch.
Whilst no prior knowledge of pyTorch is expected we assume users are familiar with the basics of Python3.
This includes:

  • Basic mathematical operations
  • Writing and running scripts/programs
  • Writing and using functions
  • The concept of object orientation
    i.e. that an object, e.g. a dataset, can have associated functions/methods associated with it.
  • Basic use of the following libraries:
    • numpy for mathematical and array operations
    • matplotlib for ploting and visualisation
    • pandas for storing and accessing tabular data
  • Familiarity with the concept of a jupyter notebook

git and GitHub

You will be expected to know how to

  • clone and/or fork a repository,
  • commit, and
  • push.

The workshop from the 2022 ICCS Summer School
should provide the necessary knowledge.

Preparation

In preparation for the course please ensure that your computer contains the following:

Note for Windows users: We have linked suitable applications for windows in the above lists.
However, you may wish to refer to Windows' getting-started with python information
for a complete guide to getting set up on a Windows system.

If you require assistance or further information with any of these please reach out to
us before a training session.

changes to slides

Surbhi and I plan to spend more time explaining the theory of NNs with more focus on:

  • Datasets. Dimensions.
  • An expanded introduction:
    • Potential applications. What NNs can do...
    • Other alternatives to NNs.
    • More of an overview of pytorch. What is jax etc.
  • Links to previous statistical techniques like regression.

Specific notes on current slides:

11/29

  • neural is spelt nerual
  • some weird aberration on the slide
  • weird formatting here.

13/29

  • "workshop lecture thing"

General notes on slides

  • quite a brief overview of nns
    - do mention 3blue1brown ml course, can't do much better than this.
    - could we include some of this content in additional slides to better visualise some concepts?
  • jumps straight into SGD without much intro
    • more of a justification for why we're using NNs
  • more images
    • visualising data
      • tabulated data
    • potentials of what NNs can do
  • comparison of different models for different datasets.
    • Other methods like SVMs instead of NNs.
  • Drawbacks of NNs?
    - black-boxes. https://towardsdatascience.com/the-math-behind-kan-kolmogorov-arnold-networks-7c12a164ba95
  • pytorch vs scikit learn
    • why we even need pytorch

Summary of what is currently covered

Penguin classification:

  • loading penguin dataframe and inspecting data
  • introduce a torch.utils.data.dataset.
  • split into train and validation
  • transforming the data using torchvision.transforms.Compose

Suggestions for guided tutorial / workshp

  • Add some boilerplate / template for them to add to. Notes.
  • Extensions. Compare different methods?

Other resources

  • Stanford ML resources

Todos

  • Think about what content we want to include as part of the expanded introduction.
  • Notebook improvements.
    • Adding more content to the exercises to make them easier to complete
    • Statistical summary of data. max. / min / missing values ...
    • [ ]

Suggested changes to the ML training material

Proposed sequence of topics

  1. How ML works
  2. Steps involved in developing a model - loading data, preprocessing, model definition (algorithm, optimiser, loss), training, prediction)
  3. Pytorch intro (data type it works with - tensors, other key concepts)
  4. Types of problems - Regression and Classification
  5. Hands-on with a Regression example
  6. Hands-on with Classification example

Algorithm used: Neural Network

changes to exercises

Fixes #60

Too much boilerplate needs to be provided by the students. It would be better to have some of the boilerplate provided so that they concentrate on implementing the key details of the algorithm.

One idea is to do the following:

# code to complete the training loop should go here... (please change this line to fix the training loop). 

General comments

  • Transformation of categorical data.

  • more visualisations

    • data exploration steps
      • understanding of data. Plot distribution of each of the variables. Statistical information of each variable.
    • For evaluation we could provide a correlation matrix
  • Look at CNN notebook. Currently it's MNIST. Could choose something more climate focussed?

Fix numpy version

Installing now leads to the following issue:

python3 
Python 3.12.4 | packaged by conda-forge | (main, Jun 17 2024, 10:11:10) [Clang 16.0.6 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch

A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.0.0 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "<stdin>", line 1, in <module>
  File "/Users/matt/miniforge3/envs/ml-workshop-2/lib/python3.12/site-packages/torch/__init__.py", line 1477, in <module>
    from .functional import *  # noqa: F403
  File "/Users/matt/miniforge3/envs/ml-workshop-2/lib/python3.12/site-packages/torch/functional.py", line 9, in <module>
    import torch.nn.functional as F
  File "/Users/matt/miniforge3/envs/ml-workshop-2/lib/python3.12/site-packages/torch/nn/__init__.py", line 1, in <module>
    from .modules import *  # noqa: F403
  File "/Users/matt/miniforge3/envs/ml-workshop-2/lib/python3.12/site-packages/torch/nn/modules/__init__.py", line 35, in <module>
    from .transformer import TransformerEncoder, TransformerDecoder, \
  File "/Users/matt/miniforge3/envs/ml-workshop-2/lib/python3.12/site-packages/torch/nn/modules/transformer.py", line 20, in <module>
    device: torch.device = torch.device(torch._C._get_default_device()),  # torch.device('cpu'),
/Users/matt/miniforge3/envs/ml-workshop-2/lib/python3.12/site-packages/torch/nn/modules/transformer.py:20: UserWarning: Failed to initialize NumPy: _ARRAY_API not found (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/utils/tensor_numpy.cpp:84.)
  device: torch.device = torch.device(torch._C._get_default_device()),  # torch.device('cpu'),

Topics for Pier Luigi's session

Pier Luigi's + Anna Sommer:

Summary:

  • Practical part will include 3 examples from the paper with a reduced dataset.
  • Students are required to be familiar with data visualisation ahead of time. Specifically cartopy and matplotlib. I have checked with James and this will be part of the visualisation lecture.
  • Mostly using pandas.

Jack inquired about what we could teach to complement the follow session by Anna Sommer::

Plan is to do the following:

I’m preparing a practical part for our session now. It’ll be in a form of Jupyter Notebook with some theoretical explanation, questions, and code to run. The topic is “Observation System Simulation Experiences in the Atlantic Ocean for enhanced surface ocean pCO2 reconstructions”, it is based on my paper with the same title published in 2021 https://os.copernicus.org/articles/17/1011/2021/.
The practical part will contain three examples from this paper with reduced data set due to time restriction.
We would like to show students that physical models can be also used to plan a deployment of measuring instruments, particularly localizing the areas where the source of data is most important to understand some phenomena.

The packages I use in the exercise are

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import cartopy.crs as ccrs

I’m working with data in dataframe format. Data will be in .csv files.
An important part is a data visualisation. It would be great if students have experienced already with matplotlib and cartopy.

  • First, we will plot the distribution of available data in the Atlantic Ocean.
  • In the second step, we’ll run Machine Learning model and to look at the loss function.
  • The third part will be a visualisation of results: maps of standard deviation, mean of differences between reference (ocean model) and ML output, correlation coefficient between reference and ML output.

It is three main steps of the exercise with questions and analysing work in each of them.

If you need more details or have comments/questions/suggestions, please don’t hesitate to contact me.

Hi, Laura, Pier Luigi, mentioned that you could help during the practical part. As we have something around an hour for practical part it would be great to have someone to help if students struggle technically or have questions etc. If you feel comfortable with this topic it would be helpful if you join us. As mentioned before if you need more details or have comments/questions/suggestions, please don’t hesitate to contact me. I’ll share the Jyputer Notebook file when it is ready (hopefully in a week's time).

Comments on exercises

It is not quite clear to me whether these exercises will be done in a guided, life-coding style fashion, or self-guided by the participants (with instructors available for questions). I assume the latter, but in that case some parts might need additional instructions.

Exercise1

  • Task 2, Line 63: What are we supposed to do with the "magic methods" which are mentioned in brackets? Implement them? Use them?
  • Line 65: spoiller -> spoiler
  • Tasks 6 and 7: Will you guide participants through these exercises? Otherwise more info would be needed.
  • Task 8, Line 189: Link could be integrated in the line above (not crucial, though).
  • Tasks 8, 9 and 10 need more guidance what needs to be done.

Exercise 2

  • Task 3, Line 73: Remeber -> Remember
  • Task 6, Line 142: on scratch -> from scratch; can solve -> want to solve; is cumbersome -> would be cumbersome
  • Line 144: very ugly -> not very nice (or something else less offensive)
  • Line 155: Leave out the "while my peers..." bit, just say "It is useful to know"
  • Again, more guidance for tasks 8, 9 and 10

Exercise 3

  • Task 1a, Line 37 - Do we know how to plot image tensors, or does that need more guidance?
  • Task 1a, Line 38 - Do we know how to create a validation data set, or does that need more guidance?
  • Numbering inconsistent: 1, 1a, 2a, 3 ?
  • Task 4, Line 101 wont ->won't
  • Task 5, Line 442: choose -> chose
  • "Test 6"? -> Task 6?: "But I forgot" -> Is that intended, or do you need to change something above?

Tidy text in readme and notebooks

Relatively low priority, but when reviewing #29 I noticed that there are a few minor typos in the readme (e.g. "ploting", "neccessary", "sesison", "systems's") that would be nice to clean up during the next pass.

As mentioned in #29, some of text in the exercise/solution notebooks is also a little out of sync.

Simpler PenguinDataset

To make the data reading aspect a little easier to understand, we intend to embed a simpler version of the PenguinDataset directly into the notebook. The src/ml_workshop/_penguins.py will remain untouched and can still be used as before.

Thought process around the 'simpler' class (discussion between @jatkinson1000 and @ma595)

Load pandas df in notebook:

Put the definition of the PenguinDataset in the notebook:

  • Remove x_tfms and y_tfms
  • Hardcode one_hot, tensor, fp32 into __getitem__

Propagate change to

  • solution notebook
  • colab
  • ex 2+? Consider leaving this as is.

Suggestion: include used library versions

I'd suggest mentioning which versions you used for major python libraries, especifically pytorch, matplotlib, pandas, numpy. For example, just saying: "The exercises were tested with versions: pytorch=2.2.2, pandas=2.1.0, etc".

In general, I think this is a good idea for future-proofing the repository -- is useful for users to know which versions should work. (For me it's ok that versions are not specified in the pyproject.toml, in some cases I've found problems when too many packages and versions are specified)

Apple Silicon issues

Running CNNs on CPU with apple silicon is prohibitively slow.

Pytorch now has an mps backend that works very similarly to cuda to alleviate this.

Consider adding to exercises.

Name of repo

Should we consider changing the repo name to practical DL with pytorch?

Changes to network achitectures

For some reason the non-linear activation functions have been removed from the models in the solutions. This may well work, but it seems very irregular and, since this is for educational purposes, it should be corrected.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.