Obtaining Your Data

Introduction

In this lesson, you'll synthesize many of your data loading skills you learned to date in order to merge multiple datasets from various sources.

Objectives

You will be able to:

Understand the ETL process and the steps it consists of
Understand the challenges of working with data from multiple sources

Loading SQL DB to DataFrames

import sqlite3
import pandas as pd

# Create a connection
con = sqlite3.connect('data.sqlite')
# Create a cursor
cur = con.cursor()
# Select some data
cur.execute("""SELECT * FROM orders JOIN orderdetails USING(orderNumber);""")
df = pd.DataFrame(cur.fetchall())
df.columns = [i[0] for i in cur.description]
print(df.shape)
df.head()

(2996, 11)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	orderNumber	orderDate	requiredDate	shippedDate	status	comments	customerNumber	productCode	quantityOrdered	priceEach	orderLineNumber
0	10100	2003-01-06	2003-01-13	2003-01-10	Shipped		363	S18_1749	30	136.00	3
1	10100	2003-01-06	2003-01-13	2003-01-10	Shipped		363	S18_2248	50	55.09	2
2	10100	2003-01-06	2003-01-13	2003-01-10	Shipped		363	S18_4409	22	75.46	4
3	10100	2003-01-06	2003-01-13	2003-01-10	Shipped		363	S24_3969	49	35.29	1
4	10101	2003-01-09	2003-01-18	2003-01-11	Shipped	Check on availability.	128	S18_2325	25	108.06	4

# Create a connection
con = sqlite3.connect('data.sqlite')
# Create a cursor
cur = con.cursor()
# Select some data
cur.execute("""SELECT * FROM products;""")
df = pd.DataFrame(cur.fetchall())
df.columns = [i[0] for i in cur.description]
print(df.shape)
df.head()

(110, 9)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	productCode	productName	productLine	productScale	productVendor	productDescription	quantityInStock	buyPrice	MSRP
0	S10_1678	1969 Harley Davidson Ultimate Chopper	Motorcycles	1:10	Min Lin Diecast	This replica features working kickstand, front...	7933	48.81	95.70
1	S10_1949	1952 Alpine Renault 1300	Classic Cars	1:10	Classic Metal Creations	Turnable front wheels; steering function; deta...	7305	98.58	214.30
2	S10_2016	1996 Moto Guzzi 1100i	Motorcycles	1:10	Highway 66 Mini Classics	Official Moto Guzzi logos and insignias, saddl...	6625	68.99	118.94
3	S10_4698	2003 Harley-Davidson Eagle Drag Bike	Motorcycles	1:10	Red Start Diecast	Model features, official Harley Davidson logos...	5582	91.02	193.66
4	S10_4757	1972 Alfa Romeo GTA	Classic Cars	1:10	Motor City Art Classics	Features include: Turnable front wheels; steer...	3252	85.68	136.00

Merging Data

Recall that you can also join data from multiple tables in SQL.

# Create a connection
con = sqlite3.connect('data.sqlite')
# Create a cursor
cur = con.cursor()
# Select some data
cur.execute("""SELECT * FROM products
                        JOIN orderdetails
                        USING (productCode);""")
df = pd.DataFrame(cur.fetchall())
df.columns = [i[0] for i in cur.description]
print(df.shape)
df.head()

(2996, 13)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	productCode	productName	productLine	productScale	productVendor	productDescription	quantityInStock	buyPrice	MSRP	orderNumber	quantityOrdered	priceEach	orderLineNumber
0	S10_1678	1969 Harley Davidson Ultimate Chopper	Motorcycles	1:10	Min Lin Diecast	This replica features working kickstand, front...	7933	48.81	95.70	10107	30	81.35	2
1	S10_1678	1969 Harley Davidson Ultimate Chopper	Motorcycles	1:10	Min Lin Diecast	This replica features working kickstand, front...	7933	48.81	95.70	10121	34	86.13	5
2	S10_1678	1969 Harley Davidson Ultimate Chopper	Motorcycles	1:10	Min Lin Diecast	This replica features working kickstand, front...	7933	48.81	95.70	10134	41	90.92	2
3	S10_1678	1969 Harley Davidson Ultimate Chopper	Motorcycles	1:10	Min Lin Diecast	This replica features working kickstand, front...	7933	48.81	95.70	10145	45	76.56	6
4	S10_1678	1969 Harley Davidson Ultimate Chopper	Motorcycles	1:10	Min Lin Diecast	This replica features working kickstand, front...	7933	48.81	95.70	10159	49	81.35	14

You can also merge data from a separate csv file. For example, say you take a separate data source regarding daily sales data for the various branches. You might first generate a view from our database:

# Create a connection
con = sqlite3.connect('data.sqlite')
# Create a cursor
cur = con.cursor()
# Select some data
cur.execute("""SELECT * FROM customers
                        JOIN orders
                        USING(customerNumber);""")
df = pd.DataFrame(cur.fetchall())
df.columns = [i[0] for i in cur.description]
print(df.shape)
df.head()

(326, 19)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	customerNumber	customerName	contactLastName	contactFirstName	phone	addressLine1	city	state	postalCode	country	salesRepEmployeeNumber	creditLimit	orderNumber	orderDate	requiredDate	shippedDate	status	comments
0	103	Atelier graphique	Schmitt	Carine	40.32.2555	54, rue Royale	Nantes		44000	France	1370	21000.00	10123	2003-05-20	2003-05-29	2003-05-22	Shipped
1	103	Atelier graphique	Schmitt	Carine	40.32.2555	54, rue Royale	Nantes		44000	France	1370	21000.00	10298	2004-09-27	2004-10-05	2004-10-01	Shipped
2	103	Atelier graphique	Schmitt	Carine	40.32.2555	54, rue Royale	Nantes		44000	France	1370	21000.00	10345	2004-11-25	2004-12-01	2004-11-26	Shipped
3	112	Signal Gift Stores	King	Jean	7025551838	8489 Strong St.	Las Vegas	NV	83030	USA	1166	71800.00	10124	2003-05-21	2003-05-29	2003-05-25	Shipped	Customer very concerned about the exact color ...
4	112	Signal Gift Stores	King	Jean	7025551838	8489 Strong St.	Las Vegas	NV	83030	USA	1166	71800.00	10278	2004-08-06	2004-08-16	2004-08-09	Shipped

And then load the seperate datefile:

daily_sums = pd.read_csv('Daily_Sales_Summaries.csv')
daily_sums.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	orderDate	min	max	sum	mean	std
0	2003-01-06	1660.12	4080.00	10223.83	2555.957500	1132.572429
1	2003-01-09	1463.85	4343.56	10549.01	2637.252500	1244.866467
2	2003-01-10	1768.33	3726.45	5494.78	2747.390000	1384.599930
3	2003-01-29	1283.48	5571.80	50218.95	3138.684375	1168.280303
4	2003-01-31	1338.04	4566.99	40206.20	3092.784615	1148.570425

merged = pd.merge(df, daily_sums)

Checking Merged Data

It's always good practice to check assumptions and preview transformed data views throughout your process. Let's take a look:

merged.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	customerNumber	customerName	contactLastName	contactFirstName	phone	addressLine1	city	state	postalCode	...	orderDate	requiredDate	shippedDate	status	comments	min	max	sum	mean	std
0	103	Atelier graphique	Schmitt	Carine	40.32.2555	54, rue Royale	Nantes		44000	...	2003-05-20	2003-05-29	2003-05-22	Shipped		2163.50	5282.64	14571.44	3642.860000	1322.891537
1	103	Atelier graphique	Schmitt	Carine	40.32.2555	54, rue Royale	Nantes		44000	...	2004-09-27	2004-10-05	2004-10-01	Shipped		1938.24	4128.54	6066.78	3033.390000	1548.775983
2	103	Atelier graphique	Schmitt	Carine	40.32.2555	54, rue Royale	Nantes		44000	...	2004-11-25	2004-12-01	2004-11-26	Shipped		557.60	7573.50	20564.45	2570.556250	2178.832190
3	350	Marseille Mini Autos	Lebihan	Laurence	91.24.4555	12, rue des Bouchers	Marseille		13008	...	2004-11-25	2004-12-02	2004-11-29	Shipped		557.60	7573.50	20564.45	2570.556250	2178.832190
4	112	Signal Gift Stores	King	Jean	7025551838	8489 Strong St.	Las Vegas	NV	83030	...	2003-05-21	2003-05-29	2003-05-25	Shipped	Customer very concerned about the exact color ...	798.38	4704.92	40207.06	2680.470667	1255.052262

5 rows × 24 columns

Pandas' merge() function conveniently uses common column names between the DataFrames as keys. You can always specifically specify what columns to join on by using the on keyword as in pd.merge(df1, df2, on=[col1, col2]). Unfortunately, columns that are not identically named beforehand will not work with this convenience method. Additionally, it is imperative to check the formatting of the join keys between the tables. A number formatted as a string can often ruin joins, and separate formatting conventions such as 'U.S.' versus 'USA' are also important preprocessing considerations before merging data files from various sources. In this case, everything worked smoothly, but it's good to keep in mind what problems may occur.

Saving Transformed Data to File

Finally, we can save our transformed dataset.

merged.to_csv('Merged_Dataset.csv', index=False)

Summary

Well done! In this lesson you reviewed merges, as well as potential pitfalls in merging datasets from different sources. In the next lab, you'll get some practice doing this as an initial step to a regression task.

bryandj / dsc-obtaining-your-data-online-ds-ft-090919 Goto Github PK

dsc-obtaining-your-data-online-ds-ft-090919's Introduction

Obtaining Your Data

Introduction

Objectives

Loading SQL DB to DataFrames

Merging Data

Checking Merged Data

Saving Transformed Data to File

Summary

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent