Giter Club home page Giter Club logo

jaffle_shop_duckdb's Introduction

Testing dbt project: jaffle_shop

jaffle_shop is a fictional ecommerce store. This dbt project transforms raw data from an app database into a customers and orders model ready for analytics.

What is this repo?

What this repo is:

  • A self-contained playground dbt project, useful for testing out scripts, and communicating some of the core dbt concepts.

What this repo is not:

  • A tutorial — check out the Getting Started Tutorial for that. Notably, this repo contains some anti-patterns to make it self-contained, namely the use of seeds instead of sources.
  • A demonstration of best practices — check out the dbt Learn Demo repo instead. We want to keep this project as simple as possible. As such, we chose not to implement:
    • our standard file naming patterns (which make more sense on larger projects, rather than this five-model project)
    • a pull request flow
    • CI/CD integrations
  • A demonstration of using dbt for a high-complex project, or a demo of advanced features (e.g. macros, packages, hooks, operations) — we're just trying to keep things simple here!

What's in this repo?

This repo contains seeds that includes some (fake) raw data from a fictional app along with some basic dbt models, tests, and docs for this data.

The raw data consists of customers, orders, and payments, with the following entity-relationship diagram:

Jaffle Shop ERD

Why should I care about this repo?

If you're just starting your cloud data warehouse journey and are hungry to get started with dbt before your organization officially gets a data warehouse, you should check out this repo.

If you want to run 28 SQL operations with dbt in less than 1 second, for free, and all on your local machine, you should check out this repo. dbt_performance

If you want an adrenaline rush from a process that used to take dbt newcomers 1 hour and is now less than 1 minute, you should check out this repo.

dbt_full_deploy_commands

Verified GitHub Action on dbt Performance

Running this project

Prerequisities: Python >= 3.5

Mach Speed: No explanation needed

Run dbt as fast as possible in a single copy and paste motion!

POSIX bash/zsh
git clone https://github.com/dbt-labs/jaffle_shop_duckdb.git
cd jaffle_shop_duckdb
python3 -m venv venv
source venv/bin/activate
python3 -m pip install --upgrade pip
python3 -m pip install -r requirements.txt
source venv/bin/activate
dbt build
dbt docs generate
dbt docs serve
POSIX fish
git clone https://github.com/dbt-labs/jaffle_shop_duckdb.git
cd jaffle_shop_duckdb
python3 -m venv venv
source venv/bin/activate.fish
python3 -m pip install --upgrade pip
python3 -m pip install -r requirements.txt
source venv/bin/activate.fish
dbt build
dbt docs generate
dbt docs serve
POSIX csh/tcsh
git clone https://github.com/dbt-labs/jaffle_shop_duckdb.git
cd jaffle_shop_duckdb
python3 -m venv venv
source venv/bin/activate.csh
python3 -m pip install --upgrade pip
python3 -m pip install -r requirements.txt
source venv/bin/activate.csh
dbt build
dbt docs generate
dbt docs serve
POSIX PowerShell Core
git clone https://github.com/dbt-labs/jaffle_shop_duckdb.git
cd jaffle_shop_duckdb
python3 -m venv venv
venv/bin/Activate.ps1
python3 -m pip install --upgrade pip
python3 -m pip install -r requirements.txt
venv/bin/Activate.ps1
dbt build
dbt docs generate
dbt docs serve
Windows cmd.exe
git clone https://github.com/dbt-labs/jaffle_shop_duckdb.git
cd jaffle_shop_duckdb
python -m venv venv
venv\Scripts\activate.bat
python -m pip install --upgrade pip
python -m pip install -r requirements.txt
venv\Scripts\activate.bat
dbt build
dbt docs generate
dbt docs serve
Windows PowerShell
git clone https://github.com/dbt-labs/jaffle_shop_duckdb.git
cd jaffle_shop_duckdb
python -m venv venv
venv\Scripts\Activate.ps1
python -m pip install --upgrade pip
python -m pip install -r requirements.txt
venv\Scripts\Activate.ps1
dbt build
dbt docs generate
dbt docs serve
GitHub Codespaces / Dev Containers

Steps

  1. Ensure you have Codespaces enabled for your GitHub organization or turned on as a beta feature if you're an individual user
  2. Click the green Code button on near the top right of the page of this repo's homepage (you may already be on it)
  3. Instead of cloning the repo like you normally would, instead select the Codespaces tab of the pop out, then "Create codespace on duckdb" dbt_full_deploy_commands
  4. Wait for codespace to boot (~1 min?)
  5. Decide whether you'd like to use the Web IDE or open the codespace in your local environment
  6. When the codespace opens, a Task pane will show up and call dbt build just to show you how it's done
  7. Decide whether or not you'd like the recommended extensions installed (like dbt Power User extension)
  8. Open up a new terminal and type:
    dbt build
    
  9. Explore some of the bells and whistles (see below)

If you don't have Codespaces or would like to just run the environment in a local Docker container, you can by:

  1. Install Docker Desktop
  2. Install the VSCode Dev Containers extension (formerly known as the "Remote - Containers" extension). Video tutorial here.
  3. Clone this repo and open it in VSCode
  4. First time: View > Command Palette > Remote-Containers: Open Folder in Container
    • Wait for container to build -- expected to take several minutes
    • Open a new terminal
  5. Subsequent times: Click Reopen in Container and wait for container to spin up Reopen in Container
  6. Continue on step 7 above

bells and whistles

There's some bells and whistles defined in the .devcontainer.json.devcontainer.json) that are worth calling out. Also a great reference is the Setting up VSCode for dbt guide.

  1. there is syntax highlighting provided by the vdcode-dbt extension. However, it is configured such that files in your target/run and target/compiled folder are not syntax highlighted, as a reminder that these files are not where you should be making changes!
  2. basic sqlfluff linting is enabled as you type. Syntax errors will be underlined in red at the error, and will also be surfaced in the Problems tab of the Terminal pane. It's configured to lint as you type.
  3. Autocompletion is enabled for generic dbt macros via the vdcode-dbt extension. For example, if you type macro you'll notice a pop up that you can select with the arrow keys then click tab to get a macro snippet. image image
  4. the find-related extension allows an easy shortcut to navigating using CMD+Rto jump from
    • a model file to it's corresponding compiled version,
    • from a compiled file to either the original model file or the version in target/run
  5. The vscode-yaml YAML, combined with the JSON schema defined in dbt-labs/dbt-jsonschema, autocomplete options while working with dbt's YAML files: i.e. :
    • Project definition files (dbt_project.yml)
    • Package files (packages.yml)
    • Selectors files (selectors.yml)
    • Property files (models/whatever.yml)

Step-by-step explanation

To get up and running with this project:

  1. Clone this repository.

  2. Change into the jaffle_shop_duck directory from the command line:

    cd jaffle_shop_duckdb
  3. Install dbt and DuckDB in a virtual environment.

    Expand your shell below:

    POSIX bash/zsh
    python3 -m venv venv
    source venv/bin/activate
    python3 -m pip install --upgrade pip
    python3 -m pip install -r requirements.txt
    source venv/bin/activate
    POSIX fish
    python3 -m venv venv
    source venv/bin/activate.fish
    python3 -m pip install --upgrade pip
    python3 -m pip install -r requirements.txt
    source venv/bin/activate.fish
    POSIX csh/tcsh
    python3 -m venv venv
    source venv/bin/activate.csh
    python3 -m pip install --upgrade pip
    python3 -m pip install -r requirements.txt
    source venv/bin/activate.csh
    POSIX PowerShell Core
    python3 -m venv venv
    venv/bin/Activate.ps1
    python3 -m pip install --upgrade pip
    python3 -m pip install -r requirements.txt
    venv/bin/Activate.ps1
    Windows cmd.exe
    python -m venv venv
    venv\Scripts\activate.bat
    python -m pip install --upgrade pip
    python -m pip install -r requirements.txt
    venv\Scripts\activate.bat
    Windows PowerShell
    python -m venv venv
    venv\Scripts\Activate.ps1
    python -m pip install --upgrade pip
    python -m pip install -r requirements.txt
    venv\Scripts\Activate.ps1

    Why a 2nd activation of the virtual environment?

    This may not be necessary for many users, but might be for some. Read on for a first-person report from @dbeatty10.

    I use zsh as my shell on my MacBook Pro, and I use pyenv to manage my Python environments. I already had an alpha version of dbt Core 1.2 installed (and yet another via pipx):

    $ which dbt
    /Users/dbeatty/.pyenv/shims/dbt
    $ dbt --version
    Core:
      - installed: 1.2.0-a1
      - latest:    1.1.1    - Ahead of latest version!
    
    Plugins:
      - bigquery:  1.2.0a1 - Ahead of latest version!
      - snowflake: 1.2.0a1 - Ahead of latest version!
      - redshift:  1.2.0a1 - Ahead of latest version!
      - postgres:  1.2.0a1 - Ahead of latest version!

    Then I ran all the steps to create a virtual environment and install the requirements of our DuckDB-based Jaffle Shop repo:

    $ python3 -m venv venv
    $ source venv/bin/activate
    (venv) $ python3 -m pip install --upgrade pip
    (venv) $ python3 -m pip install -r requirements.txt

    Let's examine where dbt is installed and which version it is reporting:

    (venv) $ which dbt
    /Users/dbeatty/projects/jaffle_duck/venv/bin/dbt
    (venv) $ dbt --version
    Core:
      - installed: 1.2.0-a1
      - latest:    1.1.1    - Ahead of latest version!
    
    Plugins:
      - bigquery:  1.2.0a1 - Ahead of latest version!
      - snowflake: 1.2.0a1 - Ahead of latest version!
      - redshift:  1.2.0a1 - Ahead of latest version!
      - postgres:  1.2.0a1 - Ahead of latest version!

    ❌ That isn't what we expected -- something isn't right. 😢

    So let's reactivate the virtual environment and try again...

    (venv) $ source venv/bin/activate
    (venv) $ dbt --version
    Core:
      - installed: 1.1.1
      - latest:    1.1.1 - Up to date!
    
    Plugins:
      - postgres: 1.1.1 - Up to date!
      - duckdb:   1.1.3 - Up to date!

    ✅ This is what we want -- the 2nd reactivation worked. 😎

  4. Ensure your profile is setup correctly from the command line:

    dbt --version
    dbt debug
  5. Load the CSVs with the demo data set, run the models, and test the output of the models using the dbt build command:

    dbt build
  6. Query the data:

    Launch a DuckDB command-line interface (CLI):

    duckcli jaffle_shop.duckdb

    Run a query at the prompt and exit:

    select * from customers where customer_id = 42;
    exit;
    

    Alternatively, use a single-liner to perform the query:

    duckcli jaffle_shop.duckdb -e "select * from customers where customer_id = 42"

    or:

    echo 'select * from customers where customer_id = 42' | duckcli jaffle_shop.duckdb
  7. Generate and view the documentation for the project:

    dbt docs generate
    dbt docs serve

Running build steps independently

  1. Load the CSVs with the demo data set. This materializes the CSVs as tables in your target schema. Note that a typical dbt project does not require this step since dbt assumes your raw data is already in your warehouse.

    dbt seed
  2. Run the models:

    dbt run

    NOTE: If you decide to run this project in your own data warehouse (outside of this DuckDB demo) and steps fail, it might mean that you need to make small changes to the SQL in the models folder to adjust for the flavor of SQL of your target database. Definitely consider this if you are using a community-contributed adapter.

  3. Test the output of the models using the test command:

    dbt test

Browsing the data

Some options:

Troubleshooting

You may get an error like this, in which case you will need to disconnect from any sessions that are locking the database:

IO Error: Could not set lock on file "jaffle_shop.duckdb": Resource temporarily unavailable

This is a known issue in DuckDB. If you are using DBeaver, this means shutting down DBeaver (merely disconnecting didn't work for me).

Very worst-case, deleting the database file will get you back in action (BUT you will lose all your data).

GitHub Codespaces and VSCode Remote Container

If you're using a privacy-forward browser such as Firefox and Brave, or a tracking-cookie-blocking extension like UBlock Origin or Privacy Badger, you may see the below error. You can either change your cookie settings, use a browser like Chrome, or just ignore the error because it doesn't affect the demo

image


For more information on dbt:

jaffle_shop_duckdb's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

jaffle_shop_duckdb's Issues

VS Code dev container pops up dbt power user error

When running this repo in VS Code (locally or via codespaces), it looks like the tasks in the devcontainer complete okay, but the dbt power user extension pops up an error.

Any idea how to prevent the error? It seems maybe related to needing DBT_PROFILES_DIR set?

image

How to Seed or Read Parquet File

Hi Team,

we would like to seed or read Parquet file from local storage and Azure Blob Storage in project , Can you please guide the configuration which we can do to read the file.

Regards,
Akash

Cant get dbt build to work properly

Hi I've attempted to start the project by first copying the following instructions:

git clone https://github.com/dbt-labs/jaffle_shop_duckdb.git
cd jaffle_shop_duckdb
python3 -m venv venv
source venv/bin/activate
python3 -m pip install --upgrade pip
python3 -m pip install -r requirements.txt
source venv/bin/activate
dbt build
dbt docs generate
dbt docs serve

And that led to this error.
Screenshot 2023-02-13 at 15 04 56

Screenshot 2023-02-13 at 15 05 39

So then I tried by using the provided vscode dev container but then it somehow fails to find the csv seed files. (even tho they exist in the repo)

Screenshot 2023-02-13 at 15 03 45

prevent 'wheel' is not installed errors within requirements.txt

Also got a bunch of 'wheel' is not installed, Would it make sense to add wheel to the requirements.txt ?

Collecting text-unidecode>=1.3
  Using cached text_unidecode-1.3-py2.py3-none-any.whl (78 kB)
Using legacy 'setup.py install' for dbt-core, since package 'wheel' is not installed.
Using legacy 'setup.py install' for dbt-postgres, since package 'wheel' is not installed.
Using legacy 'setup.py install' for configobj, since package 'wheel' is not installed.

github action to prove mach speed works with macos, linux, windows terminals

follow the readme toggles and copy them into github action format.

This will serve as auto checks too.

It'll be cool for new folks coming across this repo for the first time thinking, "Yeah right, no way I can get started that fast" then click through the github actions and go, "Oh they already proved it's that fast..."

Update README text

Synced with @sungchun12 on some minor updates to the README to hopefully improve readability :)

  • Linking to documentation on profiles.yml in step 4
  • Removing the "What is a jaffle?" section; I know that's a cornerstone of jaffle shop, but I can't tell if it's just adding dead weight to a semi-long readme
  • Moving Doug's callout/dropdown to the virtual environment reactivation earlier up (maybe at the step it occurs)
  • The note in Step 2 here is potentially confusing—this is bey default using a duckDB database/adapter, so the code should work without having to change any of the model code. I think the note could be reworded along the lines of, "If you decided to use this project in your own data warehouse, make sure....blah blah"
  • Linking to documentation on dbt build in step 5

Can payment_methods be defined once and referred multiple times?

The accepted values for payment_method field for stg_customers are defined in `models/staging/schema.yml'.

version: 2

models:
  - name: stg_customers
    columns:
      - name: customer_id
        tests:
          - unique
          - not_null

  - name: stg_orders
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
      - name: status
        tests:
          - accepted_values:
              values: ['placed', 'shipped', 'completed', 'return_pending', 'returned']

  - name: stg_payments
    columns:
      - name: payment_id
        tests:
          - unique
          - not_null
      - name: payment_method
        tests:
          - accepted_values:
              values: ['credit_card', 'coupon', 'bank_transfer', 'gift_card']

These values are used in models/orders.sql.

-- ...

order_payments as (

    select
        order_id,

        {% for payment_method in payment_methods -%}
        sum(case when payment_method = '{{ payment_method }}' then amount else 0 end)
            as {{ payment_method }}_amount,
        {% endfor -%}

        sum(amount) as total_amount

    from payments

    group by order_id

),

Can payment_methods defined once in `models/staging/schema.yml', so that we could avoid data inconsistencies in the future. If it is possible, how can I do it?

1st Phase Feedback

Notes from working through this:

  • Encountered an error with python3 -m pip install -r requirements.txt and had to both update pip and brew install postgresql. Rerunning after updating pip and installing postrgesql worked! Had I not encountered this error, definitely would have been up and running in 1 min 🙂
  • Forgot to run source venv/bin/activate after installing requirements, so it was defaulting to my local snowflake adapter instead of the duckdb one. Don't be like me and read the instructions carefully!
  • Everything else ran super super smoothly! This is such a wildly fantastic idea and I can't wait to see it in use.

Reduce size of README

Currently, the README is a giant wall of text that is intimidating to me.

As a user, I always appreciate the main focus to be install instructions that help me get hands-on. Everything else is a distraction from that goal, in my opinion.

What good looks like

  • minimum prerequisites
    • minimum software to install
    • minimum skills/knowledge required
  • the user can quickly find all the commands they need to run
    • the relevant commands don't need to be fished out from the middle of sentences (see example below)
  • work for the greatest number of users
    • instructions that are cross-platform (zsh, PowerShell, etc)
  • ideally fits on a single page

Example of fishing out relevant commands

  • pip upgrade command needs to be fished out from here
  • Should just be a stand-alone line that says this instead:
    pip install --upgrade pip

Constraint application of primary and foreign keys

Hi, I'm quite new to DBT and DuckDB. I'm still a bit confused about setting up primary and foreign keys in the models.

I saw that this is how it is being implemented in the models:

- name: order_id
tests:
- unique
- not_null
description: This is a unique identifier for an order
- name: customer_id
description: Foreign key to the customers table
tests:
- not_null
- relationships:
to: ref('customers')
field: customer_id

When I open the resulting .db file in DBeaver or Metabase, the constraints doesn't seem to be implemented.

My question is:

  • Is this something that is expected in DBT or am I missing something?
  • Why don't we implement this in the .sql scripts?

PS:
I also found these tools:

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.