Giter Club home page Giter Club logo

dsc-1-05-14-grouping-data-with-sql-nyc-career-ds-102218's Introduction

Grouping Data with SQL

Introduction

In this section, you'll learn how to use aggregate functions in SQL.

Objectives

You will be able to:

  • Write queries with aggregate functions like COUNT, MAX, MIN, and SUM
  • Create an alias for the return value of an aggregate function
  • Use GROUP BY to sort the data sets returned by aggregate functions
  • Compare aggregates using the HAVING clause
import sqlite3
import pandas as pd

Database Schema

Connecting to the Database

conn = sqlite3.Connection('data.sqlite')
cur = conn.cursor()

Groupby and Aggregate Functions

Lets start by looking at some groupby statements to aggregate our data.

#Here we join the offices and employees tables in order to count the number of employees per city.
cur.execute("""select city,
                      count(employeeNumber)
                      from offices
                      join employees
                      using(officeCode)
                      group by city;""")
pd.DataFrame(cur.fetchall())
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
0 1
0 Boston 2
1 London 2
2 NYC 2
3 Paris 5
4 San Francisco 6
5 Sydney 4
6 Tokyo 2

Ordering and Aliasing

We can also alias our groupby by specifying the number of our selection order that we want to group by. Additionally, we can also order or limit our selection with the order by and limit clauses.

cur.execute("""select city,
                      count(employeeNumber)
                      from offices
                      join employees
                      using(officeCode)
                      group by 1
                      order by count(employeeNumber) desc
                      limit 5;""")
pd.DataFrame(cur.fetchall())
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
0 1
0 San Francisco 6
1 Paris 5
2 Sydney 4
3 Boston 2
4 London 2

Retrieving Column Names

Recall that we can also retrieve our column names when using sqlite3 (note that this will be the default behavior in other environments such as sql workbench)

cur.execute("""select city,
                      count(employeeNumber)
                      from offices
                      join employees
                      using(officeCode)
                      group by 1
                      order by count(employeeNumber) desc
                      limit 5;""")
df = pd.DataFrame(cur.fetchall())
df. columns = [i[0] for i in cur.description]
df.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
city count(employeeNumber)
0 San Francisco 6
1 Paris 5
2 Sydney 4
3 Boston 2
4 London 2

Aliasing our Aggregate Function Name

Now that we can view our column names, we can also practice using alias's to name our aggregations.

cur.execute("""select city,
                      count(employeeNumber) as employeeCount
                      from offices
                      join employees
                      using(officeCode)
                      group by 1
                      order by count(employeeNumber) desc
                      limit 5;""")
df = pd.DataFrame(cur.fetchall())
df. columns = [i[0] for i in cur.description]
df.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
city employeeCount
0 San Francisco 6
1 Paris 5
2 Sydney 4
3 Boston 2
4 London 2

Other Aggregations

Aside from count() some other useful aggregations include: * min() * max() * sum() * avg()

cur.execute("""select customerName,
                      count(*) as number_purchases,
                      min(amount) as min_purchase,
                      max(amount) as max_purchase,
                      avg(amount) as avg_purchase,
                      sum(amount) as total_spent
                      from customers
                      join payments
                      using(customerNumber)
                      group by 1
                      order by sum(amount) desc;""")
df = pd.DataFrame(cur.fetchall())
df. columns = [i[0] for i in cur.description]
print(len(df))
df.head()
98
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
customerName number_purchases min_purchase max_purchase avg_purchase total_spent
0 Euro+ Shopping Channel 13 116208.40 65071.26 55056.844615 715738.98
1 Mini Gifts Distributors Ltd. 9 101244.59 85410.87 64909.804444 584188.24
2 Australian Collectors, Co. 4 44894.74 82261.22 45146.267500 180585.07
3 Muscle Machine Inc 4 20314.44 58841.35 44478.487500 177913.95
4 Dragon Souveniers, Ltd. 4 105743.00 44380.15 39062.757500 156251.03
df.tail()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
customerName number_purchases min_purchase max_purchase avg_purchase total_spent
93 Royale Belge 4 1128.20 1627.56 7304.295000 29217.18
94 Frau da Collezione 2 17746.26 7612.06 12679.160000 25358.32
95 Atelier graphique 3 14571.44 6066.78 7438.120000 22314.36
96 Auto-Moto Classics Inc. 3 5858.56 9658.74 7184.753333 21554.26
97 Boards & Toys Co. 2 3452.75 4465.85 3959.300000 7918.60

The having clause

Finally, we can also filter our aggregated views with the having clause. The having clause works like the where clause but is used to filter data selections on conditions post the group by. For example, if we wanted to filter based on a customer's last name, we would use the where clause. However, if we wanted to filter a list of city's with at least 5 customers, we would using the having clause; we would first groupby city and count the number of customers, and the having clause allows us to pass conditions on the result of this aggregation.

cur.execute("""select city,
                      count(customerNumber) as number_customers
                      from customers
                      group by 1
                      having count(customerNumber)>=5;""")
df = pd.DataFrame(cur.fetchall())
df. columns = [i[0] for i in cur.description]
print(len(df))
df.head()
2
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
city number_customers
0 Madrid 5
1 NYC 5

Combining the where and having clause

We can also use the where and having clause in conjunction with each other for more complex rules. For example, let's say we want a list of customers who have made at least 3 purchases of over 50K each.

cur.execute("""select customerName,
                      count(amount) as number_purchases_over_50K
                      from customers
                      join payments
                      using(customerNumber)
                      where amount >= 50000
                      group by 1
                      having count(amount) >= 3
                      order by count(amount) desc;""")
df = pd.DataFrame(cur.fetchall())
df. columns = [i[0] for i in cur.description]
print(len(df))
df.head()
53
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
customerName number_purchases_over_50K
0 Euro+ Shopping Channel 13
1 Mini Gifts Distributors Ltd. 9
2 Anna's Decorations, Ltd 4
3 Australian Collectors, Co. 4
4 Baane Mini Imports 4
df.tail()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
customerName number_purchases_over_50K
48 Stylish Desk Decors, Co. 3
49 Suominen Souveniers 3
50 Toys of Finland, Co. 3
51 Toys4GrownUps.com 3
52 Vitachrome Inc. 3

Summary

After this section, you should have a good idea of how to use aggregate functions, aliases and the having clause to filter selections.

dsc-1-05-14-grouping-data-with-sql-nyc-career-ds-102218's People

Contributors

loredirick avatar mathymitchell avatar tkoar avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.