Just as with Pandas, we can use aggregate functions in SQL to assist with data manipulation. Sometimes you may wish to find the mean, median, min, or max of a column feature. For example, there could be a customer relational database that you've been working with and you may wonder if there are differences in overall sales across offices or regions.
You will be able to:
- Describe the relationship between aggregate functions and
GROUP BY
statements - Use
GROUP BY
statements in SQL to apply aggregate functions like:COUNT
,MAX
,MIN
, andSUM
- Create an alias in a SQL query
- Use the
HAVING
clause to compare different aggregates - Compare the difference between the
WHERE
andHAVING
clause
import sqlite3
import pandas as pd
As usual, start by creating a connection to the database and instantiating a cursor object.
conn = sqlite3.Connection('data.sqlite')
cur = conn.cursor()
Let's start by looking at some GROUP BY
statements to aggregate our data. The GROUP BY
clause groups records into summary rows and returns one record for each group.
Typically, GROUP BY
also involves an aggregate function (COUNT
, AVG
, etc.). Lastly, GROUP BY
can group by one or more columns.
In the cell below, we'll join the offices and employees tables in order to count the number of employees per city.
cur.execute("""SELECT city, COUNT(employeeNumber)
FROM offices
JOIN employees
USING(officeCode)
GROUP BY city
ORDER BY count(employeeNumber) DESC;""")
df = pd.DataFrame(cur.fetchall())
df.columns = [x[0] for x in cur.description]
df.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
city | COUNT(employeeNumber) | |
---|---|---|
0 | San Francisco | 6 |
1 | Paris | 5 |
2 | Sydney | 4 |
3 | Boston | 2 |
4 | London | 2 |
An Alias is a shorthand for a table or column name. Aliases reduce the amount of typing required to enter a query. Generally, complex queries with aliases are easier to read.
Aliases are useful with JOIN
, GROUP BY
, and aggregates (SUM
, COUNT
, etc.).
An Alias only exists for the duration of the query.
You can alias your GROUP BY
by specifying the index of our selection order that we want to group by. This is simply written as GROUP BY 1
, with the number "1" referring to the first column name that we are selecting.
Additionally, we can also rename our aggregate to a more descriptive name using the AS
clause.
cur.execute("""SELECT city, COUNT(employeeNumber) AS numEmployees
FROM offices
JOIN employees
USING(officeCode)
GROUP BY 1
ORDER BY numEmployees DESC;""")
df = pd.DataFrame(cur.fetchall())
df.columns = [x[0] for x in cur.description]
df.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
city | numEmployees | |
---|---|---|
0 | San Francisco | 6 |
1 | Paris | 5 |
2 | Sydney | 4 |
3 | Boston | 2 |
4 | London | 2 |
Aside from COUNT()
some other useful aggregations include:
MIN()
MAX()
SUM()
AVG()
cur.execute("""SELECT customerName,
COUNT(customerName) AS number_purchases,
MIN(amount) AS min_purchase,
MAX(amount) AS max_purchase,
AVG(amount) AS avg_purchase,
SUM(amount) AS total_spent
FROM customers
JOIN payments
USING(customerNumber)
GROUP BY customerName
ORDER BY SUM(amount) DESC;""")
df = pd.DataFrame(cur.fetchall())
df. columns = [i[0] for i in cur.description]
print(len(df))
df.head()
98
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
customerName | number_purchases | min_purchase | max_purchase | avg_purchase | total_spent | |
---|---|---|---|---|---|---|
0 | Euro+ Shopping Channel | 13 | 20009.53 | 120166.58 | 55056.844615 | 715738.98 |
1 | Mini Gifts Distributors Ltd. | 9 | 11044.30 | 111654.40 | 64909.804444 | 584188.24 |
2 | Australian Collectors, Co. | 4 | 7565.08 | 82261.22 | 45146.267500 | 180585.07 |
3 | Muscle Machine Inc | 4 | 20314.44 | 58841.35 | 44478.487500 | 177913.95 |
4 | Dragon Souveniers, Ltd. | 4 | 2611.84 | 105743.00 | 39062.757500 | 156251.03 |
df.tail()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
customerName | number_purchases | min_purchase | max_purchase | avg_purchase | total_spent | |
---|---|---|---|---|---|---|
93 | Royale Belge | 4 | 1128.20 | 14379.90 | 7304.295000 | 29217.18 |
94 | Frau da Collezione | 2 | 7612.06 | 17746.26 | 12679.160000 | 25358.32 |
95 | Atelier graphique | 3 | 1676.14 | 14571.44 | 7438.120000 | 22314.36 |
96 | Auto-Moto Classics Inc. | 3 | 5858.56 | 9658.74 | 7184.753333 | 21554.26 |
97 | Boards & Toys Co. | 2 | 3452.75 | 4465.85 | 3959.300000 | 7918.60 |
Finally, we can also filter our aggregated views with the HAVING
clause. The HAVING
clause works similarly to the WHERE
clause, except it is used to filter data selections on conditions after the GROUP BY
clause. For example, if we wanted to filter based on a customer's last name, we would use the WHERE
clause. However, if we wanted to filter a list of cities with at least 5 customers, we would use the HAVING
clause. First, we would GROUP BY
city and then use the HAVING
clause, which will allow us to pass conditions on the result of this aggregation.
cur.execute("""SELECT city, COUNT(customerNumber) AS number_customers
FROM customers
GROUP BY 1
HAVING COUNT(customerNumber)>=5;""")
df = pd.DataFrame(cur.fetchall())
df. columns = [i[0] for i in cur.description]
print(len(df))
df.head()
2
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
city | number_customers | |
---|---|---|
0 | Madrid | 5 |
1 | NYC | 5 |
We can also use the WHERE
and HAVING
clauses in conjunction with each other for more complex rules.
For example, let's say we want a list of customers who have made at least 2 purchases of over 50K each.
cur.execute("""SELECT customerName,
COUNT(amount) AS number_purchases_over_50K
FROM customers
JOIN payments
USING(customerNumber)
WHERE amount >= 50000
GROUP BY customerName
HAVING count(amount) >= 2
ORDER BY count(amount) DESC;""")
df = pd.DataFrame(cur.fetchall())
df. columns = [i[0] for i in cur.description]
print(len(df))
df.head()
4
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
customerName | number_purchases_over_50K | |
---|---|---|
0 | Euro+ Shopping Channel | 5 |
1 | Mini Gifts Distributors Ltd. | 5 |
2 | Muscle Machine Inc | 2 |
3 | Online Diecast Creations Co. | 2 |
In this lesson, you learned how to use aggregate functions, aliases, and the HAVING
clause to filter selections.