Giter Club home page Giter Club logo

flames-python-data-wrangling's Introduction

Data wrangling in Python

Introduction

The handling of data is a recurring task for most scientists. Reading in experimental data, checking its properties, and creating visualisations may become tedious tasks. Hence, increasing the efficiency in this process is beneficial for many scientists. Spreadsheet-based software lacks the ability to properly support this process, due to the lack of automation and repeatability. The usage of a high-level scripting language such as Python is ideal for these tasks.

This course trains students to use Python effectively to do these tasks. The course focuses on data manipulation and cleaning, explorative analysis and visualisation using some important packages such as Pandas, Numpy and Matplotlib.

The course does not cover statistics, data mining, machine learning, or predictive modelling. It aims to provide researchers the means to effectively tackle commonly encountered data handling tasks in order to increase the overall efficiency of the research.

The course has been developed as a course for the Flanders’ Training Network for Methodology and Statistics (Flames), but can be taught to others upon request.

Aim & scope

This course is intended for researchers that have at least basic programming skills. A basic (scientific) programming course that is part of the regular curriculum should suffice. For those who have experience in another programming language (e.g. Matlab, R, ...), following a Python tutorial prior to the course is advised.

It is intended for researchers that want to enhance their general data manipulation and analysis skills in Python. The course is NOT intended to be a course on statistics or machine learning.

Getting started

The course uses Python 3 and some data analysis packages such as Pandas, Numpy and Matplotlib. To install the required libraries, we highly recommend Anaconda or miniconda (https://www.anaconda.com/download/) or another Python distribution that includes the scientific libraries (this recommendation applies to all platforms, so for both Window, Linux and Mac).

For detailed instructions to get started on your local machine , see the setup instructions.

In case you do not want to install everything and just want to try out the course material, use the environment setup by Binder Binder and open de notebooks rightaway (inside the notebooks directory).

Contributing

Found any typo or have a suggestion, see how to contribute.

Meta

Authors: Joris Van den Bossche, Stijn Van Hoey

flames-python-data-wrangling's People

Contributors

jorisvandenbossche avatar stijnvanhoey avatar

Watchers

 avatar  avatar

flames-python-data-wrangling's Issues

General feedback/notes from November 2020 course

  • General: we can update the data to be more recent (eg VMM discharge, bike count, air quality)

pandas 3a selecting data

  • string split:

    • better show the "pandas" way, and not the apply
    • split in two exercises, first do on a single string, in second translate to str.split(",").str.get(0)
    • better tips
  • pandas timeseries notebook

  • groupby note is not relevant (didn't yet see groupby at this point)

  • add a hint about the aggregate([])

  • visualization matplotlib:

  • include the illustration of Figure/Axes/Axis in the notebook

  • "object-oriented" -> difficult terminology

  • visualization seaborn:

  • start with a non-count plot

  • first show/explain figure-level plots, before directly also explaining difference between figure level and axis level

  • exercise: "Use a different color for the Age Sex

  • gebruik consistent data=

  • case3 bacterial

  • put short explanation of setup (different bacteria combined with different viruses, see evolution of bacteria population == optical density)

  • at the beginning drop columns we don't need (drop())

  • "Based on density_mean, make a barplot of the (mean) values for each Bacterial_genotype, with for each Bacterial_genotype an individual bar and with each Phage_t in a different color/hue (i.e. grouped bar chart)." -> "mean value for each "optical_density"" !

  • use consistently data=

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.