Giter Club home page Giter Club logo

dsc-grouping-data-with-sql's Introduction

Grouping Data with SQL

Introduction

Just as with Pandas, we can use aggregate functions in SQL to assist with data manipulation. Sometimes you may wish to find the mean, median, min, or max of a column feature. For example, there could be a customer relational database that you've been working with and you may wonder if there are differences in overall sales across offices or regions.

Objectives

You will be able to:

  • Describe the relationship between aggregate functions and GROUP BY statements
  • Use GROUP BY statements in SQL to apply aggregate functions like: COUNT, MAX, MIN, and SUM
  • Create an alias in a SQL query
  • Use the HAVING clause to compare different aggregates
  • Compare the difference between the WHERE and HAVING clause

Database Schema

import sqlite3
import pandas as pd

Connecting to the Database

As usual, start by creating a connection to the database and instantiating a cursor object.

conn = sqlite3.Connection('data.sqlite')
cur = conn.cursor()

GROUP BY and Aggregate Functions

Let's start by looking at some GROUP BY statements to aggregate our data. The GROUP BY clause groups records into summary rows and returns one record for each group. Typically, GROUP BY also involves an aggregate function (COUNT, AVG, etc.). Lastly, GROUP BY can group by one or more columns.

In the cell below, we'll join the offices and employees tables in order to count the number of employees per city.

cur.execute("""SELECT city, COUNT(employeeNumber)
                      FROM offices
                      JOIN employees
                      USING(officeCode)
                      GROUP BY city
                      ORDER BY count(employeeNumber) DESC;""")
df = pd.DataFrame(cur.fetchall())
df.columns = [x[0] for x in cur.description]
df.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
city COUNT(employeeNumber)
0 San Francisco 6
1 Paris 5
2 Sydney 4
3 Boston 2
4 London 2

Aliasing

An Alias is a shorthand for a table or column name. Aliases reduce the amount of typing required to enter a query. Generally, complex queries with aliases are easier to read. Aliases are useful with JOIN, GROUP BY, and aggregates (SUM, COUNT, etc.). An Alias only exists for the duration of the query.

You can alias your GROUP BY by specifying the index of our selection order that we want to group by. This is simply written as GROUP BY 1, with the number "1" referring to the first column name that we are selecting.

Additionally, we can also rename our aggregate to a more descriptive name using the AS clause.

cur.execute("""SELECT city, COUNT(employeeNumber) AS numEmployees
               FROM offices
               JOIN employees
               USING(officeCode)
               GROUP BY 1
               ORDER BY numEmployees DESC;""")
df = pd.DataFrame(cur.fetchall())
df.columns = [x[0] for x in cur.description]
df.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
city numEmployees
0 San Francisco 6
1 Paris 5
2 Sydney 4
3 Boston 2
4 London 2

Other Aggregations

Aside from COUNT() some other useful aggregations include:

  • MIN()
  • MAX()
  • SUM()
  • AVG()
cur.execute("""SELECT customerName,
               COUNT(customerName) AS number_purchases,
               MIN(amount) AS min_purchase,
               MAX(amount) AS max_purchase,
               AVG(amount) AS avg_purchase,
               SUM(amount) AS total_spent
               FROM customers
               JOIN payments
               USING(customerNumber)
               GROUP BY customerName
               ORDER BY SUM(amount) DESC;""")
df = pd.DataFrame(cur.fetchall())
df. columns = [i[0] for i in cur.description]
print(len(df))
df.head()
98
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
customerName number_purchases min_purchase max_purchase avg_purchase total_spent
0 Euro+ Shopping Channel 13 20009.53 120166.58 55056.844615 715738.98
1 Mini Gifts Distributors Ltd. 9 11044.30 111654.40 64909.804444 584188.24
2 Australian Collectors, Co. 4 7565.08 82261.22 45146.267500 180585.07
3 Muscle Machine Inc 4 20314.44 58841.35 44478.487500 177913.95
4 Dragon Souveniers, Ltd. 4 2611.84 105743.00 39062.757500 156251.03
df.tail()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
customerName number_purchases min_purchase max_purchase avg_purchase total_spent
93 Royale Belge 4 1128.20 14379.90 7304.295000 29217.18
94 Frau da Collezione 2 7612.06 17746.26 12679.160000 25358.32
95 Atelier graphique 3 1676.14 14571.44 7438.120000 22314.36
96 Auto-Moto Classics Inc. 3 5858.56 9658.74 7184.753333 21554.26
97 Boards & Toys Co. 2 3452.75 4465.85 3959.300000 7918.60

The HAVING clause

Finally, we can also filter our aggregated views with the HAVING clause. The HAVING clause works similarly to the WHERE clause, except it is used to filter data selections on conditions after the GROUP BY clause. For example, if we wanted to filter based on a customer's last name, we would use the WHERE clause. However, if we wanted to filter a list of cities with at least 5 customers, we would use the HAVING clause. First, we would GROUP BY city and then use the HAVING clause, which will allow us to pass conditions on the result of this aggregation.

cur.execute("""SELECT city, COUNT(customerNumber) AS number_customers
               FROM customers
               GROUP BY 1
               HAVING COUNT(customerNumber)>=5;""")
df = pd.DataFrame(cur.fetchall())
df. columns = [i[0] for i in cur.description]
print(len(df))
df.head()
2
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
city number_customers
0 Madrid 5
1 NYC 5

Combining the WHERE and HAVING clause

We can also use the WHERE and HAVING clauses in conjunction with each other for more complex rules. For example, let's say we want a list of customers who have made at least 2 purchases of over 50K each.

cur.execute("""SELECT customerName,
               COUNT(amount) AS number_purchases_over_50K
               FROM customers
               JOIN payments
               USING(customerNumber)
               WHERE amount >= 50000
               GROUP BY customerName
               HAVING count(amount) >= 2
               ORDER BY count(amount) DESC;""")
df = pd.DataFrame(cur.fetchall())
df. columns = [i[0] for i in cur.description]
print(len(df))
df.head()
4
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
customerName number_purchases_over_50K
0 Euro+ Shopping Channel 5
1 Mini Gifts Distributors Ltd. 5
2 Muscle Machine Inc 2
3 Online Diecast Creations Co. 2

Summary

In this lesson, you learned how to use aggregate functions, aliases, and the HAVING clause to filter selections.

dsc-grouping-data-with-sql's People

Contributors

tkoar avatar mas16 avatar mathymitchell avatar loredirick avatar sumedh10 avatar sproulhimself avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.