Hands-On Data Preprocessing in Python, published by Packt

License: MIT License

Jupyter Notebook 100.00%

hands-on-data-preprocessing-in-python's Introduction

Hands-On Data Preprocessing in Python

This is the code repository for Hands-On Data Preprocessing in Python, published by Packt.

Learn how to effectively prepare data for successful data analytics

What is this book about?

Data preprocessing is the first step in data visualization, data analytics, and machine learning, where data is prepared for analytics functions to get the best possible insights. Around 90% of the time spent on data analytics, data visualization, and machine learning projects is dedicated to performing data preprocessing.

This book covers the following exciting features:

Use Python to perform analytics functions on your data
Understand the role of databases and how to effectively pull data from databases
Perform data preprocessing steps defined by your analytics goals
Recognize and resolve data integration challenges
Identify the need for data reduction and execute it

If you feel this book is for you, get your copy today!

Instructions and Navigations

All of the code is organized into folders. For example, Chapter02.

The code will look like the following:

from ipywidgets import interact, widgets
interact(plotyear,year=widgets.
IntSlider(min=2010,max=2019,step=1,value=2010))

Following is what you need for this book: Junior and senior data analysts, business intelligence professionals, engineering undergraduates, and data enthusiasts looking to perform preprocessing and data cleaning on large amounts of data will find this book useful. Basic programming skills, such as working with variables, conditionals, and loops, along with beginner-level knowledge of Python and simple analytics experience, are assumed.

With the following software and hardware list you can run all code files present in the book (Chapter 1-18).

Software and Hardware List

Chapter	Software required	OS required
1 - 18	Python using the Jupyter Notebook	Windows Or Mac OS

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. Click here to download it.

Errata

Chapter 5 - page 126: The code chunk under Chapter 5 Data Visualization, Subsection Example of comparing populations using boxplots (page 126) is misplaced. The correct chunk of code can be found on the dedicated GitHub of the book. Also, this is the correct code:

income_possibilities = adult_df.income.unique()
dataForBox_dic= {}
for poss in income_possibilities:
    BM = adult_df.income == poss
    dataForBox_dic[poss] = adult_df[BM]['education-num']
    
plt.boxplot(dataForBox_dic.values(),vert=False)
plt.yticks([1,2],income_possibilities)
plt.show()

Chapter 6 - page 166: The following code chunk which can be found in Chapter 6, Prediction, Example of applying linear regression to perform regression analysis (page 166) has an error.

X = ['P_Football_Performance','P_2SMA']
Y = 'N_Applications'

The correct chunk of code:

X = ['P_Football_Performance','SMAn2']
Y = 'N_Applications'

The code in the GitHub repository is correct.

Chapter 12 - page 380: The first sentence in Exercise 5 of Chapter 12 (page 380) should be:

“Recreate Figure 5.23 from Chapter 5, Data Visualization, but instead of using WH Report_preprocessed.csv, integrate the following three files yourself first: WH Report.csv, populations.csv, and Countries.csv.”

Instead of

“Recreate Figure 5.20 from Chapter 5, Data Visualization, but instead of using WH Report_preprocessed.csv, integrate the following three files yourself first: WH Report.csv, populations.csv, and Countries.csv.”

Get to Know the Author

Roy Jafari , Ph.D. is an assistant professor of business analytics at the University of Redlands. Roy has taught and developed college-level courses that cover data cleaning, decision making, data science, machine learning, and optimization. Roy’s style of teaching is hands-on and he believes the best way to learn is to learn by doing. He uses active learning teaching philosophy and readers will get to experience active learning in this book. Roy believes that successful data preprocessing only happens when you are equipped with the most efficient tools, have an appropriate understanding of data analytic goals, are aware of data preprocessing steps, and can compare a variety of methods. This belief has shaped the structure of this book.

Download a free PDF

If you have already purchased a print or Kindle version of this book, you can get a DRM-free PDF version at no cost.
Simply click on the link to claim your free PDF.

https://packt.link/free-ebook/9781801072137

hands-on-data-preprocessing-in-python's People

Contributors

Stargazers

Watchers

Forkers

standardgalactic sejoro arifmudi parth-github devotionzhu anilktechie bobfromtula deb112233 dieptran43 beosro concioushuman jafarijason mathmachado lukhnov azurecloudmonk sfrias curioustauseef micseb juanalmona marckx0 restevesd savy2017 deepak-rai-1027 lechuzo32 greentimes-educaton jhkim06 celestine001 royjafari tomasmacuda nhatle9529 dengodunov acasanez opentrainingcamp partha186 sina-py11 marcopolli coubanao walidyoussif vungods cc314 girijeshcse priya-gittest tal-training eslamelsheikh projetsplusia deniscanevaro jacob2415 carlosajax nourolah carlosfrias zhinoos-adibi ehtisham-sadiq dchesah admiral-white digbrain businessdataprofessional victora0007 aoiryo fajarasyura cool1159 armandoaugusto answer-geng rezababazadeh gattuzzo0 javymb davidwinterman xteox017 iandmozart charliebrown007 jianfeng947 ashwinj13 orodrig5 nayakayp garyliu78 warcry98 juniorepingo castillogabriela erikrc30 donalus geekguy502 adrian-cancino glapierr leoclassic mohammedshaneeb-ai infinitebulldog mel-iza mooosiyo abdultheautomator ruawpreyl80933 zivankaraman chyjuls tumsel-kamer ae-de typeon vipulramk alekseyratushnyi jimmmnny jishnu7kp cmrajendran linnvv

hands-on-data-preprocessing-in-python's Issues

Hands-On-Data-Preprocessing-in-Python

text errors in chapter 1

Errors in Chapter 1:

Section: Overview of Jupyter Notebook

programming instead programing:

Jupyter Notebook is becoming increasingly popular as a successful User Interface (UI) for Python programing.

Text below figure 1.5

Wrong library name: np.arange() instead of np.arramge():

Next we will learn four NumPy functions: np.arramge(), np.zeros(), np.ones(), and np.linspace().

Section: The np.linspace() function

Incorrect representation under the first mathematical statement:
Text Formula_B17397_01_001 is useless at this point

Section: Pandas functions for exploring a DataFrame
Subsection: The .info() function

function info() must be lowercase:

If you run adult_df.Info(), ...

Section: Pandas functions for exploring a DataFrame
Subsection: Histograms and boxplots to visualize numerical columns

Wrong function adult_df.age.box() instead of adult_df.age.plot.box():

To create the boxplot for the age column, all you need to change is the last part of the code: adult_df.age.box(). Give it a try.

Text below figure 1.34

Function parameter "Normalize" must be lowercase:

To get the relative frequency table, all you need to do is to specify that you want the table to be normalized: .value_counts(Normalize = True). Give it a try!

Text above figure 1.37

Variable CapitalNet must be begin with lowercase:

As we are only interested in the correlations between education_num, lifeNoEd, and CapitalNet, the last line of the code has removed other columns before running the .corr() function.

some lines later: Text below figure 1.37

Variable CapitalNet must be again begin with lowercase:

... the correlation between education_num and CapitalNet is higher, at 0.117891.

inside figure 1.38:

functions .var() and .describe() must be lowercase

Subsection under figure 1.44 Multi-level access

spelling error "sizable" instead "sizeable":

In this subchapter, we gathered sizable exposure to multi-level indexing and columns.

Exercise files in repository have typo 'Excercises' in name

The exercise files in the repository are all named like "Chapter 1 Excercises.ipynb" instead of "Chapter 1 Exercises.ipynb".

This is typo that I frequently make, so I noticed it when I was posting a message to my students.

Bug: API connection link for Finhub.io for candle statistics

The link used for exercise in accessing of API link is
address_template = 'https://finnhub.io/api/v1/stock/candle?symbol={}&resolution={}&from={}&to={}&token={}'

candle statistics access is only for premium users.

When we are demonstrating example for connecting and pulling data from API, can we just install its respective python package like finnhub-python in this case and extract data.

Text errors in chapter 2

Below Figure 2.5 – Anatomy of Matplotlib visuals, direct above code block

Wrong function:
plt.ylabel() instead of plt.ylable()

The following screenshot shows the application of plt.title() and plt.ylable() to add a title to the visual and add a label to the y-axis respectively.

Section: Resizing visuals and saving them
Subsection: Saving

Missing space:
"to create" instead of "tocreate"

This function takes the name of the file you would like tocreate for saving ...

Section: Example of Matplotilb assisting data preprocessing

In the first paragraph, in the first codeblock and than a 3rd time in the next textblock:
Not really an error, but I think the filename should be "ColumnsVisualization.png" instead of "ColumnsVsiaulization.png"

Same with Numerical_colums in this code block: No error but better "Numerical_columns"

But an real spelling error is in the section title:
"Matplotilb" instead of "Matplotlib"

Link error

Chapter: 1 Review of the core modules of NumPy and Pandas
Section: Technical requirements

The link https://github.com/PacktPublishing/Hands-On-Data-Preprocessing-in-Python doesn't work, because this repository ends with a hyphen.

packtpublishing / hands-on-data-preprocessing-in-python Goto Github PK