Giter Club home page Giter Club logo

superset-and-metabase's Introduction

Superset and Metabase Evaluations

Useful analysis of Apache Superset at Dropbox.

Supersetup

With Python virtual environment from scratch: Installing from scratch

With Docker: Recommended to run on Linux OS with Docker and docker-compose.

  1. Clone the Apache Superset repo on GitHub

     git clone https://github.com/apache/superset.git; cd superset
    
  2. (optional) If you have data and SQL scripts for loading data, change the db component in the docker-compose-non-dev.yml file to the following:

     db:
         env_file: docker/.env-non-dev
         image: postgres:latest
         container_name: superset_db
         restart: unless-stopped
         volumes:
           - db_home:/var/lib/postgresql/data
           - ../source-data:/source-data
           - ../database-setup/sourcedb.sql:/docker-entrypoint-initdb.d/sourcedb.sql
         ports:
           - "5433:5432" 
    

    The default settings also has no port mapping for the database, which prevents external connections to the database (e.g. from a VSCode extension). Specify the port mapping as above to make external connections possible. Other environment variables are in the .env-non-dev file.

    Modify the docker/.env-non-dev file not to load examples because it takes way too long to load them.

     SUPERSET_LOAD_EXAMPLES=no
    
  3. Run this production version of the docker-compose file

     docker-compose -f docker-compose-non-dev.yml up
    

    NOTE: Although the official instructions at this stage say that you can login with username and password admin at localhost:8088, I could not. This was a past issue as well.

    NOTE: Running containers in detached mode gives some sort of permission error. Better to run in non-detached mode.

  4. To fix this, go inside the superset_app container and run the docker-init.sh Bash script:

     docker exec -it superset_app bash
     cd /app/docker
     bash docker-init.sh
    

    This will create the Superset Admin user, with username and password admin. If SUPERSET_LOAD_EXAMPLES is set to no, running the Bash script will not load them. Otherwise, it will take ~30 min to load the examples.

    Another way of fixing this is to directly run the Bash script without going inside the container:

     docker exec superset_app bash /app/docker/docker-init.sh
    
  5. Exit the container and go to localhost:8088 and login with username and password admin.

Connect to database with URI string:

postgresql+psycopg2://superset:superset@db:5432/superset

If connecting to containers from a remote server (e.g. 192.168.1.13), use the IP address of the local server (the server I'm connecting from, e.g. 192.168.1.2):

postgresql+psycopg2://sourcedb2:[email protected]:5440/sourcedb2

sourcedb2 is a Postgres database hosted in a container on 192.168.1.13 and I want to connect from 192.168.1.2. Not sure why plain localhost doesn't work, but my hunch is that there could be two localhosts if I don't differentiate between them.

Local Superset Installation

Local Installation

Start Superset server with

superset run -p 8088 --with-threads --reload --debugger

Database Connections: If you install psycopg2 or its binaries, then PostgreSQL databases can be added. Otherwise, there's only SQLite shown on the UI.

Excel Uploads: Also have to install xlrd to have the option to upload Excel files.

Configuring Superset

To configure Superset, create a superset_config.py file and put it in the superset folder next to the config.py file. For a local installation using a Python virtual environment (named .venv for example), the config.py file is inside the .venv/Lib/site-packages/superset/ folder, where the installed Superset library is located.

Then, according to the docs the superset_config.py file has to be added to the PYTHONPATH.

To add to PYTHONPATH, follow some of the answers here. For Windows 10, Python v3.9.6, and with a local installation of Superset, this worked for me:

Add to PYTHONPATH

For Mac and Linux, adding the following command to ~/.bashrc will do the trick:

export PYTHONPATH="${PYTHONPATH}:/home/username/project/.venv/lib/python3.8/site-packages/superset/superset_config.py"

To see what the PYTHONPATH looks like, echo it on the terminal. For Windows, the pythonpath.py script prints it out.

Size Limit of CSV Uploads: To set an upper bound on the size limit of CSV uploads, (either a row limit or a byte limit), the variables ROW_LIMIT and UPLOAD_MAX_BYTES can be specified as shown in the superset_config.py file. For example:

UPLOAD_MAX_BYTES = 100 * 1024 * 1024
ROW_LIMIT = 1200000

Metabase Setup

Also recommended to run on Linux OS, facing some problems in Windows.

One-liner for Docker (not recommended):

docker run -d -p 3000:3000 --name metabase metabase/metabase

Custom setup (modified) with a docker-compose file, including database container and secret files:

docker-compose up -d

Connect to database from Metabase setup page with hostname:port = postgres-metabase:5432. Connect from external applications using localhost:5434.

NOTE: Docker secrets don't work with docker-compose, they only work with Docker Swarm. Use an environment file .env instead.

If containers keep exiting unexpectedly (usually when database connection credentials are invalid), check their logs with

docker logs [CONTAINER_NAME | CONTAINER_ID]

CSV, Excel, and Columnar File Uploads

Superset

CSV, Excel, and Parquet files can be uploaded but they require a SQL database to hold the data.

File uploads

Requires SQL database

Otherwise, there will be an error saying that there is no database allowing you to upload your CSV:

ERROR: No database is allowed for your csv upload

To fix this, go to Data -> Databases, click on Edit under the Actions column, go to Advanced -> Security and check the Allow data upload option.

Edit database security

Allow data upload (CSV, Excel, Parquet)

This uploads the CSV file as a physical table into the SQL database, but the column data types are automatically inferred by Pandas (e.g. zip codes, which should be treated as CHAR(5) become INT), which is undesirable in most cases. However, data types are preserved when uploading Parquet files.

For Excel files, there is an option to choose which sheet to upload.

Metabase

Metabase doesn't have a native way to import CSV files. Instead, the CSV has to be uploaded to a database and then Metabase can connect to that database. Direct imports from Excel and Parquet files are also not supported.

Metabase is mainly for asking questions about data that is already in databases, and it is expected for the user to bring their own database to connect to.

When adding a new database table, the table doesn't appear immediately unless the database schema is synced:

Metabase DB sync

The sync can be scheduled hourly or daily.

Report Scheduling, Logging, and Dashboard Refresh

Superset

Alerts and scheduled reports are possible, with some setup. They are disabled by default.

Logs are recorded in Settings > Security > Action Logs.

Dashboards can be manually refreshed or refresh intervals can be set.

Refreshing Dashboard

Refresh intervals

I tested this refresh by deleting rows from a table, waiting for the dashboard to refresh, and checking if the dashboard is updated afterwards. Then I reinserted the rows, waited for dashboard refresh, and verified that the dashboard goes back to its original numbers/visualizations.

Before refresh

SQL query to delete rows from the cleaned_sales_data table and hence, alter the dashboard:

-- create duplicate of vehicle sales data
CREATE TABLE cleaned_sales_data_dup AS
SELECT
  *
FROM
  cleaned_sales_data;
  
-- delete rows where sales > 3000 (this will delete 1541 rows)
DELETE FROM
  cleaned_sales_data
WHERE
  sales > 3000;

-- check that the `Sales Dashboard` refreshes with new data

After refresh, deleted rows

Reinsert the rows back into the table:

-- insert back the deleted rows and check that the dashboard goes back to original
INSERT INTO
  cleaned_sales_data
SELECT
  *
FROM
  cleaned_sales_data_dup
WHERE
  sales > 3000;

After refresh, after reinserting deleted rows

The SQL queries above are not allowed in the SQL Lab > SQL Editor page, unless the Allow DML option is ticked for that particular database.

Metabase

Pulses, soon to be deprecated in favor of Dashboard subscriptions can be set up to send Slack alerts and scheduled reports to emails, after setting up a Sender Email. Unlike Superset, Metabase provides a UI to set up this Sender Email for sending scheduled reports.

Logs are recorded in Settings > Admin > Troubleshooting > Logs. A bit verbose though. These logs can also be viewed through docker logs [CONTAINER_NAME].

Like Superset, Metabase dashboards also support refresh intervals, and I tested this in a similar way as well.

Dashboard refresh

Drilldowns, Custom Visualizations, and Sharing

Superset

Superset does not support drilldowns but they can be introduced via custom JavaScript code. There is currently a draft pull request in progress for adding drilldown functionality to Superset.

Custom plugins for visualizations can be built using the React framework.

Dashboards and charts can be shared via email or URLs. Public access to dashboards can be granted by providing datasource access permission to the Public role. For example, for the Sales Dashboard, the Public user must have datasource access on [examples].[clean_sales_data] in order for them to view the dashboard.

Superset can also be hosted on the cloud, which I tried out here, with the help of this article and this repo. The functionality is limited, since the Python installation of Superset doesn't allow uploads via Excel or Parquet, and the attached Postgres database on Heroku only allows 10,000 rows max. The Python Functions subsection in the Advanced Analytics section also doesn't appear.

A better option would be to host it on a cloud vendor web hosting service, or a container registry.

Metabase

Drilldowns are supported in Metabase and interactive dashboards with customized drilldowns can also be created.

Adding custom visualizations are much harder to do in Metabase and it does not have a custom plugin architecture like Superset.

Dashboards can be shared with public links, and can be embedded into websites as well. Sender Email has to be set up first, and if using TLS encryption, choose STARTTLS instead. Public Sharing in Settings > Admin > Public Sharing has to be enabled as well.

Running Metabase in production

superset-and-metabase's People

Contributors

ismaildawoodjee avatar

Stargazers

Sibelius Seraphini avatar Tien Bui avatar

Watchers

 avatar  avatar

Forkers

7u4

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.