Grouping Data with SQL

Introduction

Just as with Pandas, we can use aggregate functions in SQL to assist with data manipulation. Sometimes you may wish to find the mean, median, min, or max of a column feature. For example, there could be a customer relational database that you've been working with and you may wonder if there are differences in overall sales across offices or regions.

Objectives

You will be able to:

Describe the relationship between aggregate functions and GROUP BY statements
Use GROUP BY statements in SQL to apply aggregate functions like: COUNT, MAX, MIN, and SUM
Create an alias in a SQL query
Use the HAVING clause to compare different aggregates
Compare the difference between the WHERE and HAVING clause

Database Schema

import sqlite3
import pandas as pd

Connecting to the Database

As usual, start by creating a connection to the database and instantiating a cursor object.

conn = sqlite3.Connection('data.sqlite')
cur = conn.cursor()

`GROUP BY` and Aggregate Functions

Let's start by looking at some GROUP BY statements to aggregate our data. The GROUP BY clause groups records into summary rows and returns one record for each group. Typically, GROUP BY also involves an aggregate function (COUNT, AVG, etc.). Lastly, GROUP BY can group by one or more columns.

In the cell below, we'll join the offices and employees tables in order to count the number of employees per city.

cur.execute("""SELECT city, COUNT(employeeNumber)
                      FROM offices
                      JOIN employees
                      USING(officeCode)
                      GROUP BY city
                      ORDER BY count(employeeNumber) DESC;""")
df = pd.DataFrame(cur.fetchall())
df.columns = [x[0] for x in cur.description]
df.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	city	COUNT(employeeNumber)
0	San Francisco	6
1	Paris	5
2	Sydney	4
3	Boston	2
4	London	2

Aliasing

An Alias is a shorthand for a table or column name. Aliases reduce the amount of typing required to enter a query. Generally, complex queries with aliases are easier to read. Aliases are useful with JOIN, GROUP BY, and aggregates (SUM, COUNT, etc.). An Alias only exists for the duration of the query.

You can alias your GROUP BY by specifying the index of our selection order that we want to group by. This is simply written as GROUP BY 1, with the number "1" referring to the first column name that we are selecting.

Additionally, we can also rename our aggregate to a more descriptive name using the AS clause.

cur.execute("""SELECT city, COUNT(employeeNumber) AS numEmployees
               FROM offices
               JOIN employees
               USING(officeCode)
               GROUP BY 1
               ORDER BY numEmployees DESC;""")
df = pd.DataFrame(cur.fetchall())
df.columns = [x[0] for x in cur.description]
df.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	city	numEmployees
0	San Francisco	6
1	Paris	5
2	Sydney	4
3	Boston	2
4	London	2

Other Aggregations

Aside from COUNT() some other useful aggregations include:

MIN()
MAX()
SUM()
AVG()

cur.execute("""SELECT customerName,
               COUNT(customerName) AS number_purchases,
               MIN(amount) AS min_purchase,
               MAX(amount) AS max_purchase,
               AVG(amount) AS avg_purchase,
               SUM(amount) AS total_spent
               FROM customers
               JOIN payments
               USING(customerNumber)
               GROUP BY customerName
               ORDER BY SUM(amount) DESC;""")
df = pd.DataFrame(cur.fetchall())
df. columns = [i[0] for i in cur.description]
print(len(df))
df.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	customerName	number_purchases	min_purchase	max_purchase	avg_purchase	total_spent
0	Euro+ Shopping Channel	13	20009.53	120166.58	55056.844615	715738.98
1	Mini Gifts Distributors Ltd.	9	11044.30	111654.40	64909.804444	584188.24
2	Australian Collectors, Co.	4	7565.08	82261.22	45146.267500	180585.07
3	Muscle Machine Inc	4	20314.44	58841.35	44478.487500	177913.95
4	Dragon Souveniers, Ltd.	4	2611.84	105743.00	39062.757500	156251.03

df.tail()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	customerName	number_purchases	min_purchase	max_purchase	avg_purchase	total_spent
93	Royale Belge	4	1128.20	14379.90	7304.295000	29217.18
94	Frau da Collezione	2	7612.06	17746.26	12679.160000	25358.32
95	Atelier graphique	3	1676.14	14571.44	7438.120000	22314.36
96	Auto-Moto Classics Inc.	3	5858.56	9658.74	7184.753333	21554.26
97	Boards & Toys Co.	2	3452.75	4465.85	3959.300000	7918.60

The `HAVING` clause

Finally, we can also filter our aggregated views with the HAVING clause. The HAVING clause works similarly to the WHERE clause, except it is used to filter data selections on conditions after the GROUP BY clause. For example, if we wanted to filter based on a customer's last name, we would use the WHERE clause. However, if we wanted to filter a list of cities with at least 5 customers, we would use the HAVING clause. First, we would GROUP BY city and then use the HAVING clause, which will allow us to pass conditions on the result of this aggregation.

cur.execute("""SELECT city, COUNT(customerNumber) AS number_customers
               FROM customers
               GROUP BY 1
               HAVING COUNT(customerNumber)>=5;""")
df = pd.DataFrame(cur.fetchall())
df. columns = [i[0] for i in cur.description]
print(len(df))
df.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	city	number_customers
0	Madrid	5
1	NYC	5

Combining the `WHERE` and `HAVING` clause

We can also use the WHERE and HAVING clauses in conjunction with each other for more complex rules. For example, let's say we want a list of customers who have made at least 2 purchases of over 50K each.

cur.execute("""SELECT customerName,
               COUNT(amount) AS number_purchases_over_50K
               FROM customers
               JOIN payments
               USING(customerNumber)
               WHERE amount >= 50000
               GROUP BY customerName
               HAVING count(amount) >= 2
               ORDER BY count(amount) DESC;""")
df = pd.DataFrame(cur.fetchall())
df. columns = [i[0] for i in cur.description]
print(len(df))
df.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	customerName	number_purchases_over_50K
0	Euro+ Shopping Channel	5
1	Mini Gifts Distributors Ltd.	5
2	Muscle Machine Inc	2
3	Online Diecast Creations Co.	2

Summary

In this lesson, you learned how to use aggregate functions, aliases, and the HAVING clause to filter selections.

camladd / dsc-grouping-data-with-sql Goto Github PK

dsc-grouping-data-with-sql's Introduction

Grouping Data with SQL

Introduction

Objectives

Database Schema

Connecting to the Database

`GROUP BY` and Aggregate Functions

Aliasing

Other Aggregations

The `HAVING` clause

Combining the `WHERE` and `HAVING` clause

Summary

dsc-grouping-data-with-sql's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

camladd / dsc-grouping-data-with-sql Goto Github PK

dsc-grouping-data-with-sql's Introduction

Grouping Data with SQL

Introduction

Objectives

Database Schema

Connecting to the Database

GROUP BY and Aggregate Functions

Aliasing

Other Aggregations

The HAVING clause

Combining the WHERE and HAVING clause

Summary

dsc-grouping-data-with-sql's People

Contributors

Recommend Projects

Recommend Topics

Recommend Org

`GROUP BY` and Aggregate Functions

The `HAVING` clause

Combining the `WHERE` and `HAVING` clause