Giter Club home page Giter Club logo

python_tidydata's Introduction

Pandas Data Exploration Journey

The Pandas Data Exploration Journey series will consist of following notebooks :

  1. Data Profiling (01_16_av_automated_data_profiling_MS_WORD.ipynb)
  2. Data Quality and Transformations Decisions
  3. Tidy Data and Data Tranformations
  4. Analytics & Visualisations

Objective

The main objective of this notebook is only to understand the data using Data Profiling. The output is a MS Woprd document which documents all the data profiling infoprmation and plots for the given data file. The code is largely kept generic so that it could be used with any shape of data. Any data quality or data tidying recommendations will be dealt in other notebooks.

Data Profile Dataframe (DPD) - The Game Changer

The game changer for exploratory data analysis is the final Data Profile Dataframe that is generated which combines all the information required to inform data cleaning, tidy data and optimisations (memory and processing) decisions.
Instead of using various Pandas commands at different instances and going back and forth to cross refer information, Data Profile Dataframe brings all information into a single dataframe. This will be very useful when reviewing the data profile with the business subject matter or other team members as all information related to data profile is in a single easy to understand format.

image.png

Understanding the data is the critical step in preparing the data to be used for analytics. As many experts will point out the data preparation and transforming the data into a tidy format takes about 80% of the effort in any data analytics or data analysis project.
Understanding the data requires good understanding of the domain and/or access to a subject matter expert (SME) to help make decisions about data quality and data usage:

  • What are the columns and what do they mean?
  • How to interpret each columns and possible values of a column?
  • Should the columns be renamed (and cleaned e.g. trim)?
  • Are there columns that may have similar information that could be dropped in favour of one master column?
  • Can columns with no values (or all empty) be dropped?
  • Can columns which have more than certain threshold of blank values be dropped?
  • How can the missing values be filled and can it be filled meaningfully?
  • Can rows that have missing values for certain columns or combination of columns be dropped? i.e. the row is meaningless wihtout those values.
  • Can the numeric data type columns be converted / down casted to optimise memory usage based on the data values?
    • or will there be outliers possibly in future data sets that we cannot do this?
    • can the min and max values be used to determine the lowest possible data type?
  • Can some string/object columns be converted to Category types?
    • based on count of unique values
  • Can any columns be discarded that may not be required for analytics?

python_tidydata's People

Contributors

analyticsinsightsninja avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.