In this lesson, you'll synthesize many of your data loading skills you learned to date in order to merge multiple datasets from various sources.
You will be able to:
- Understand the ETL process and the steps it consists of
- Understand the challenges of working with data from multiple sources
import sqlite3
import pandas as pd
# Create a connection
con = sqlite3.connect('data.sqlite')
# Create a cursor
cur = con.cursor()
# Select some data
cur.execute("""SELECT * FROM orders JOIN orderdetails USING(orderNumber);""")
df = pd.DataFrame(cur.fetchall())
df.columns = [i[0] for i in cur.description]
print(df.shape)
df.head()
(2996, 11)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
orderNumber | orderDate | requiredDate | shippedDate | status | comments | customerNumber | productCode | quantityOrdered | priceEach | orderLineNumber | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 10100 | 2003-01-06 | 2003-01-13 | 2003-01-10 | Shipped | 363 | S18_1749 | 30 | 136.00 | 3 | |
1 | 10100 | 2003-01-06 | 2003-01-13 | 2003-01-10 | Shipped | 363 | S18_2248 | 50 | 55.09 | 2 | |
2 | 10100 | 2003-01-06 | 2003-01-13 | 2003-01-10 | Shipped | 363 | S18_4409 | 22 | 75.46 | 4 | |
3 | 10100 | 2003-01-06 | 2003-01-13 | 2003-01-10 | Shipped | 363 | S24_3969 | 49 | 35.29 | 1 | |
4 | 10101 | 2003-01-09 | 2003-01-18 | 2003-01-11 | Shipped | Check on availability. | 128 | S18_2325 | 25 | 108.06 | 4 |
# Create a connection
con = sqlite3.connect('data.sqlite')
# Create a cursor
cur = con.cursor()
# Select some data
cur.execute("""SELECT * FROM products;""")
df = pd.DataFrame(cur.fetchall())
df.columns = [i[0] for i in cur.description]
print(df.shape)
df.head()
(110, 9)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
productCode | productName | productLine | productScale | productVendor | productDescription | quantityInStock | buyPrice | MSRP | |
---|---|---|---|---|---|---|---|---|---|
0 | S10_1678 | 1969 Harley Davidson Ultimate Chopper | Motorcycles | 1:10 | Min Lin Diecast | This replica features working kickstand, front... | 7933 | 48.81 | 95.70 |
1 | S10_1949 | 1952 Alpine Renault 1300 | Classic Cars | 1:10 | Classic Metal Creations | Turnable front wheels; steering function; deta... | 7305 | 98.58 | 214.30 |
2 | S10_2016 | 1996 Moto Guzzi 1100i | Motorcycles | 1:10 | Highway 66 Mini Classics | Official Moto Guzzi logos and insignias, saddl... | 6625 | 68.99 | 118.94 |
3 | S10_4698 | 2003 Harley-Davidson Eagle Drag Bike | Motorcycles | 1:10 | Red Start Diecast | Model features, official Harley Davidson logos... | 5582 | 91.02 | 193.66 |
4 | S10_4757 | 1972 Alfa Romeo GTA | Classic Cars | 1:10 | Motor City Art Classics | Features include: Turnable front wheels; steer... | 3252 | 85.68 | 136.00 |
Recall that you can also join data from multiple tables in SQL.
# Create a connection
con = sqlite3.connect('data.sqlite')
# Create a cursor
cur = con.cursor()
# Select some data
cur.execute("""SELECT * FROM products
JOIN orderdetails
USING (productCode);""")
df = pd.DataFrame(cur.fetchall())
df.columns = [i[0] for i in cur.description]
print(df.shape)
df.head()
(2996, 13)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
productCode | productName | productLine | productScale | productVendor | productDescription | quantityInStock | buyPrice | MSRP | orderNumber | quantityOrdered | priceEach | orderLineNumber | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | S10_1678 | 1969 Harley Davidson Ultimate Chopper | Motorcycles | 1:10 | Min Lin Diecast | This replica features working kickstand, front... | 7933 | 48.81 | 95.70 | 10107 | 30 | 81.35 | 2 |
1 | S10_1678 | 1969 Harley Davidson Ultimate Chopper | Motorcycles | 1:10 | Min Lin Diecast | This replica features working kickstand, front... | 7933 | 48.81 | 95.70 | 10121 | 34 | 86.13 | 5 |
2 | S10_1678 | 1969 Harley Davidson Ultimate Chopper | Motorcycles | 1:10 | Min Lin Diecast | This replica features working kickstand, front... | 7933 | 48.81 | 95.70 | 10134 | 41 | 90.92 | 2 |
3 | S10_1678 | 1969 Harley Davidson Ultimate Chopper | Motorcycles | 1:10 | Min Lin Diecast | This replica features working kickstand, front... | 7933 | 48.81 | 95.70 | 10145 | 45 | 76.56 | 6 |
4 | S10_1678 | 1969 Harley Davidson Ultimate Chopper | Motorcycles | 1:10 | Min Lin Diecast | This replica features working kickstand, front... | 7933 | 48.81 | 95.70 | 10159 | 49 | 81.35 | 14 |
You can also merge data from a separate csv file. For example, say you take a separate data source regarding daily sales data for the various branches. You might first generate a view from our database:
# Create a connection
con = sqlite3.connect('data.sqlite')
# Create a cursor
cur = con.cursor()
# Select some data
cur.execute("""SELECT * FROM customers
JOIN orders
USING(customerNumber);""")
df = pd.DataFrame(cur.fetchall())
df.columns = [i[0] for i in cur.description]
print(df.shape)
df.head()
(326, 19)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
customerNumber | customerName | contactLastName | contactFirstName | phone | addressLine1 | addressLine2 | city | state | postalCode | country | salesRepEmployeeNumber | creditLimit | orderNumber | orderDate | requiredDate | shippedDate | status | comments | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 103 | Atelier graphique | Schmitt | Carine | 40.32.2555 | 54, rue Royale | Nantes | 44000 | France | 1370 | 21000.00 | 10123 | 2003-05-20 | 2003-05-29 | 2003-05-22 | Shipped | |||
1 | 103 | Atelier graphique | Schmitt | Carine | 40.32.2555 | 54, rue Royale | Nantes | 44000 | France | 1370 | 21000.00 | 10298 | 2004-09-27 | 2004-10-05 | 2004-10-01 | Shipped | |||
2 | 103 | Atelier graphique | Schmitt | Carine | 40.32.2555 | 54, rue Royale | Nantes | 44000 | France | 1370 | 21000.00 | 10345 | 2004-11-25 | 2004-12-01 | 2004-11-26 | Shipped | |||
3 | 112 | Signal Gift Stores | King | Jean | 7025551838 | 8489 Strong St. | Las Vegas | NV | 83030 | USA | 1166 | 71800.00 | 10124 | 2003-05-21 | 2003-05-29 | 2003-05-25 | Shipped | Customer very concerned about the exact color ... | |
4 | 112 | Signal Gift Stores | King | Jean | 7025551838 | 8489 Strong St. | Las Vegas | NV | 83030 | USA | 1166 | 71800.00 | 10278 | 2004-08-06 | 2004-08-16 | 2004-08-09 | Shipped |
And then load the seperate datefile:
daily_sums = pd.read_csv('Daily_Sales_Summaries.csv')
daily_sums.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
orderDate | min | max | sum | mean | std | |
---|---|---|---|---|---|---|
0 | 2003-01-06 | 1660.12 | 4080.00 | 10223.83 | 2555.957500 | 1132.572429 |
1 | 2003-01-09 | 1463.85 | 4343.56 | 10549.01 | 2637.252500 | 1244.866467 |
2 | 2003-01-10 | 1768.33 | 3726.45 | 5494.78 | 2747.390000 | 1384.599930 |
3 | 2003-01-29 | 1283.48 | 5571.80 | 50218.95 | 3138.684375 | 1168.280303 |
4 | 2003-01-31 | 1338.04 | 4566.99 | 40206.20 | 3092.784615 | 1148.570425 |
merged = pd.merge(df, daily_sums)
It's always good practice to check assumptions and preview transformed data views throughout your process. Let's take a look:
merged.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
customerNumber | customerName | contactLastName | contactFirstName | phone | addressLine1 | addressLine2 | city | state | postalCode | ... | orderDate | requiredDate | shippedDate | status | comments | min | max | sum | mean | std | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 103 | Atelier graphique | Schmitt | Carine | 40.32.2555 | 54, rue Royale | Nantes | 44000 | ... | 2003-05-20 | 2003-05-29 | 2003-05-22 | Shipped | 2163.50 | 5282.64 | 14571.44 | 3642.860000 | 1322.891537 | |||
1 | 103 | Atelier graphique | Schmitt | Carine | 40.32.2555 | 54, rue Royale | Nantes | 44000 | ... | 2004-09-27 | 2004-10-05 | 2004-10-01 | Shipped | 1938.24 | 4128.54 | 6066.78 | 3033.390000 | 1548.775983 | |||
2 | 103 | Atelier graphique | Schmitt | Carine | 40.32.2555 | 54, rue Royale | Nantes | 44000 | ... | 2004-11-25 | 2004-12-01 | 2004-11-26 | Shipped | 557.60 | 7573.50 | 20564.45 | 2570.556250 | 2178.832190 | |||
3 | 350 | Marseille Mini Autos | Lebihan | Laurence | 91.24.4555 | 12, rue des Bouchers | Marseille | 13008 | ... | 2004-11-25 | 2004-12-02 | 2004-11-29 | Shipped | 557.60 | 7573.50 | 20564.45 | 2570.556250 | 2178.832190 | |||
4 | 112 | Signal Gift Stores | King | Jean | 7025551838 | 8489 Strong St. | Las Vegas | NV | 83030 | ... | 2003-05-21 | 2003-05-29 | 2003-05-25 | Shipped | Customer very concerned about the exact color ... | 798.38 | 4704.92 | 40207.06 | 2680.470667 | 1255.052262 |
5 rows ร 24 columns
Pandas' merge()
function conveniently uses common column names between the DataFrames as keys. You can always specifically specify what columns to join on by using the on
keyword as in pd.merge(df1, df2, on=[col1, col2])
. Unfortunately, columns that are not identically named beforehand will not work with this convenience method. Additionally, it is imperative to check the formatting of the join keys between the tables. A number formatted as a string can often ruin joins, and separate formatting conventions such as 'U.S.' versus 'USA' are also important preprocessing considerations before merging data files from various sources. In this case, everything worked smoothly, but it's good to keep in mind what problems may occur.
Finally, we can save our transformed dataset.
merged.to_csv('Merged_Dataset.csv', index=False)
Well done! In this lesson you reviewed merges, as well as potential pitfalls in merging datasets from different sources. In the next lab, you'll get some practice doing this as an initial step to a regression task.