Giter Club home page Giter Club logo

tidypluspy's Introduction

tidyplusPy: a tool for data wrangling in Python

contributions welcome Build Status codecov

Contributors:

Latest

  • Date : March 18, 2018
  • Release : v4

About

The tidyplusPy package is an essential data cleaning package with features like missing value treatment, data type cleansing and displaying data as markdown table for documents. The package adds a few additional functionalities on the existing data wrangling packages (i.e. Pandas). The objective of this package is to provide a few specific functions to solve some of the pressing issues in data cleaning.

Install and import

The package needs to be installed from GitHub. Open your Anaconda or Terminal, and type in:

# if you have git installed in your computer already, try:
pip install git+https://github.com/UBC-MDS/tidyplusPy.git

# if you do not have git installed, try:
pip install https://github.com/UBC-MDS/tidyplusPy/zipball/master

There are 5 functions, typemix, cleanmix, EM, md and mmm, in the package tidyplusPy. To import the package:

import tidyplusPy

Functions included:

Three main parts including different functions in tidyplusPy

  • Data Type Cleansing :

    • typemix

      • The function helps to find the columns containing different types of data, like character and numeric. The input of the function is a data frame, and the output of the function will be a list of 3 data frames reporting details about the mixture of data types. The first data frame in the list is the same as the input data frame, the second one tells you the location and types of data in the columns where there is type mixture. The third data frame is a summary of the second data frame.
    • cleanmix

      • The function helps to clean our data frame. After knowing where the mixture of data types is, one can use this function to keep/delete a type of data in certain columns. Here, the input will be an output by the typemix function, ID of the column(s) (the ID is the numbering of the column(s)) that they want to clean, the type of data they want to work on, and if they want to keep or delete the certain type. The output will be a data frame like the original one but with specified data type in the certain columns deleted.
  • Missing Value Treatment : Basic Imputation and EM Imputation -em and mmm

    • Basic Imputation: function used mmm replace missing values in a column of a dataframe, or multiple columns of dataframe based on the method of imputation

      • (Method = 'Mean') replace using mean
      • (Method = 'Median') replace using median
      • (Method = 'Mode') replace using mode
    • EM Imputation: Bonus function used em (method = "EM")

      • Uses EM(Expectation- Maximization) algorithm to predict the closest value to the missing value
      • Can be used for both numeric and categorical predictions
  • Markdown Table:

    • md_new(): This function creates a bare bone for generating a markdown.

Example

This is a basic example which shows you how to solve a common problem:

Data type cleansing with typemix

# prepare data frame
import pandas as pd
from tidyplusPy.typemix import typemix

d={'x1':[1,2,3,"1.2.3"],
   'x2':["test","test",1,True],
   'x3':[True,True,False,False]}
dat=pd.DataFrame(data=d)

# run the function
typemix(dat)

Data type cleansing with cleanmix

# prepare data frame
import pandas as pd
from tidyplusPy.typemix import typemix
from tidyplusPy.cleanmix improt cleanmix

d={'x1':[1,2,3,"1.2.3"],
   'x2':["test","test",1,True],
   'x3':[True,True,False,False]}
dat=pd.DataFrame(data=d)
result=typemix(dat) # need result from typemix function as input

# run the function
cleanmix(result,column=c(1,2),type=c("number","character"))

Imputation with mean/ median / mode

  • Works on pandas dataframe
from tidyplusPy import mmm as tm

NaN = float('nan')
ID = [1, 2, 3, 4, 5, 6, 7]
A = [NaN, NaN, NaN, 0.1, 0.3, 0.3, 0.4]
B = [0.2, NaN, 0.2, 0.7, 0.9, NaN, NaN]
C = [NaN, 'A', 'B', NaN, 'C', 'D', 'D']
D = [NaN, 'C', 'E', NaN, 'C', 'H', 'D']
columns = {'A':A, 'B':B, 'C':C, 'D':D}
df = pd.DataFrame(columns, index=ID)
df.index.name = 'ID'


tm.mmm(df,method = "mode") ### method can be changed to mean and median as well

Imputation with EM

  • Works on ONLY on nd-array for now
from tidyplusPy import EM as te

matrix= np.random.rand(5,4)
matrix[1,0] = np.nan
matrix[2,1] = np.nan
matrix[4,2] = np.nan
matrix[3,3] = np.nan

te.em(matrix)

Create empty markdown table

from tidyplusPy import md

tbl1 = md.md_new()
# print out the table in markdown syntax
print(tbl1)

# create table of size 3x3; alignment to center
tbl2 = md.md_new(nrow = 3, ncol = 3, align = "c")
print(tbl2)

# provide header
tbl3 = md.md_new(header = ["foo","boo"])
print(tbl3)

User Scenario

Using Data Manipulation functionalities

  • Users can use the package when they want to clean and wrangle their data. For example, if the data has not been cleaned yet, users can use function typemix to check where data is not clean and use cleanmix to clean data. Based on personal work experience, the mix of number and character is usually seen in the data collected from the survey. After clean data is ready, one can use the Missing Value Treatment to deal with missing data by EM algorithm. You can use md_new() to create a empty markdown table.

Existing features in Python ecosystem similar to tidyplusPy

  • Data Type Cleansing
    • Pandas:string processing function and Pandas:string processing. Brief Version The existing pandas version doesn't have a functionality to explicitly perform string processing/datatype conversion without affecting the overall column type (which is inconvenient when you have really messed up data with mix of strings and numbers)
  • Missing Value treatment
    • Python doesn't have imputation methods which use EM algorithm for missing value treatment, which in fact is very efficient and accurate Imputation methods in python
  • Markdown table in Python
    • Python doesn't have a package or library which can output a dataset in the form of a markdown table (User defined)

Branch coverage

License

MIT

Contributing

This is an open source project. Please follow the guidelines below for contribution.

  • Open an issue for any feedback and suggestions.
  • For contributing to the project, please refer to Contributing for details.

tidypluspy's People

Contributors

akshi8 avatar tinaqian2017 avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

tidypluspy's Issues

milestone3: Error running python version in Data type cleansing with cleanmix

prepare data frame

import pandas as pd
from tidyplusPy.typemix import typemix
from tidyplusPy.cleanmix improt cleanmix

d={'x1':[1,2,3,"1.2.3"],
'x2':["test","test",1,True],
'x3':[True,True,False,False]}
dat=pd.DataFrame(data=d)
result=typemix(dat) # need result from typemix function as input

run the function

cleanmix(result,column=c(1,2),type=c("number","character"))

Error: NameError: name 'c' is not defined
Error: from tidyplusPy.typemix import typemix

Feedback

Hi All,

Congrats for your work and you did a good job for finding a series of functions to develop. I have some suggestions regarding your current proposal:

As is discussed in today's office hour, it would be better if you can specify and encouraged to think in advance for the detailed data type of input parameters for each function, the data type of output value of the function, any side-corner cases you need to consider when doing unit tests to exclude, how much big(o) notation of computational complexity the function would have? These are all important aspects when you design a thoughtful function.

For other documents especially for the README.MD description, very well documented and described. Well done!

Regards
Jason

Feedback for Milestone 2

Hi All,

Good job done for the milestone 2 project. Please refer to my comments for your current work below:

  1. Good README.MD update including example usage for all functions and used scenario
  2. Good dist folder for tar.gz compressed package
  3. Suggest to have a TO-DO list for task to be done
  4. Suggest to refer to https://google.github.io/styleguide/pyguide.html for Python coding style improvement
  5. For your first line of your python file, it is suggested that you can include #!/usr/bin/env python just in case the user is running your code in Linux(like me)
  6. For each .py main function file, suggest to have a init() function to initialize your global variable and parameter configuration
  7. For comments, please use #, rather than # and ## for style issue
  8. typemix.py line 45, no column datatype check
  9. typemix.py line 47, variable named a is not suggested
  10. typemix.py: suggest to have all variable declaration in the front of a py file (line 65 - 67)
  11. md.py: what is the usage for InputError since you pass in the function body
  12. md.py: line 71, h is not a good name for a variable
  13. md.py: why do you need to print in line 80?
  14. cleanmix.py: line 44: too much space between lines and other places like line 56 and line 81
  15. mmm.py: dummy dataset should be included in the test file, not the source file. If you, you can put it up in the front of the source code

Regards
Jason

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.