toucantoco / weaverbird Goto Github PK

A visual data pipeline builder with various backends

Home Page: https://weaverbird.toucantoco.dev

License: BSD 3-Clause "New" or "Revised" License

JavaScript 0.16% HTML 0.19% Vue 20.67% TypeScript 50.15% SCSS 0.13% Makefile 0.05% Python 28.58% Dockerfile 0.04% Shell 0.04%

mongodb pandas vuejs mysql postgresql redshift snowflake sql data-transformation

weaverbird's Introduction

Weaverbird

Weaverbird is Toucan Toco's data pipelines toolkit, it contains :

a pipeline Data Model, currently supporting more than 40 transformation steps
a friendly User Interface for building those pipelines without writing any code, made with TypeScript, VueJS & VueX
a set of BackEnds to use those pipelines :
- the MongoDB Translator that generate Mongo Queries, written in TypeScript
- the Pandas Executor that compute the result using Pandas dataframes, written in Python
- the Snowflake SQL translator, written in Python

For in depth user & technical documentation, have a look at weaverbird.toucantoco.dev or at the documentation's source files in the docs directory.

Last but not least, you can play with Weaverbird on our online playground!

Badges

UI

Server

Project setup

yarn install

See Dockerfile for supported node version

Compiles target library

yarn build-bundle

This will generate an importable JS weaverbird library in the dist directory.

Important note: While we do our best to embrace semantic versioning, we do not guarantee full backward compatibility until version 1.0.0 is released.

Run your tests

The basic command to run all tests is:

yarn test

Lints and fixes files

yarn format:fix
yarn lint --fix

Build the API documentation

yarn build-doc

This will run typedoc on the src/ directory and generate the corresponding documentation in the dist/docs directory.

Build and run the documentation website

The web documentation is powered by Jekyll.

You can find all the sources into the docs folder.

To build and locally launch the documentation you need Ruby and gem before starting, then:

# install ruby
sudo apt install ruby ruby-dev

# install bundler
gem install bundler

# run jekyll and a local server with dependencies :
cd docs
bundle install
bundle exec jekyll serve

Enrich it!

put your .md file into the docs folder. You can add a folder as well to better organization

into your .md file don't forget to declare this at the beginning of the file :

---
title: your title doc name
permalink: /docs/your-page-doc-name/
---

to finish to get your page into the doc navigation you have to add it in `_data/docs.yml``

example :

- title: Technical documentation
  docs:
  - steps
  - stepforms
  - your-page-doc-name

Run the storybook

Storybook uses the bundled lib, so all showcased components must be in the public API.

yarn storybook

This will run storybook, displaying the stories (use cases) of UI components.

Stories are defined in the stories/ directory.

Publication

This library is published on npm under the name weaverbird automatically each time a release is created in GitHub.

Create a release (frontend)

Define new version using semantic versioning
Create a new local branch release/X.Y.Z from master

ex: release/0.20.0
Update the version property in package.json and in sonar-project.properties
Check differences between last release and current and fill CHANGELOG.md with updates
- Delete the ##changes title at start of the CHANGELOG.md if provided
- Add the date and version at start of CHANGELOG.md following this convention
```
[X.Y.Z] - YYYY-MM-DD
```
  ex: [0.20.0] - 2020-08-03
- Add link to the CHANGELOG.md from this version to the previous one at the end of the CHANGELOG.md
```
[X.Y.Z]: https://github.com/ToucanToco/weaverbird/compare/voldX.oldY.oldZ...vX.Y.Z
```
  ex: [0.20.0]: https://github.com/ToucanToco/weaverbird/compare/v0.19.2...v0.20.0
Commit changes with version number

ex: v0.20.0
Push branch
Create a pull request into master from your branch
When pull request is merged, create a release with the version number in tag version and title (no description needed)

ex: v0.20.0
Hit the release "publish release" button (this will automatically create a tag and trigger the package publication )

Create a release (backend)

Create a new local branch chore/bump-server-version-x-x-x
Edit server/pyproject.toml & increment the version in [tool.poetry] section
Push branch
Create a pull request into master from your branch
Once the PR is approved & merged in master publish the release in Pypi with make build & make upload

Usage as library

Without any module bundler

<!-- Import styles -->
<link rel="stylesheet" href="weaverbird/dist/weaverbird.umd.min.js" />

<!-- Import scripts -->
<script src="vue.js"></script>
<script src="weaverbird/dist/weaverbird.umd.min.js"></script>

With an ES module bundler (typically webpack, vite or rollup)

import { Pipeline } from "weaverbird";

By default, the CommonJS module is imported. If you prefer the ES module version, import dist/weaverbird.esm.js.

API

Modules

See the documentation generated in dist/docs directory

Styles

TODO: document here sass variables that can be overriden

Playground

The /playground directory hosts a demo application with a small server that showcases how to integrate the exported components and API. To run it, use the provided Dockerfile:

docker build -t weaverbird-playground .
docker run -p 5000:5000 --rm -d weaverbird-playground

which is basically a shortcut for the following steps:

cd ui
# install front-end dependencies
yarn
# build the front-end bundle
yarn build-bundle

cd ../server
# install the backend dependencies
pip install -e ".[playground,all]"
# run the server
QUART_APP=playground QUART_ENV=development quart run
# note: in the dockerfile, a production-ready webserver is used instead of a development one

Once the server is started, you should be able to open the http://localhost:5000 in your favorite browser and enjoy!

When developing the UI, use yarn vite that provides its own dev server.

Mongo back-end

The default back-end for the playground is a small server passing queries to MongoDB. Connect the playground to a running MongoDB instance with the environment variables:

MONGODB_CONNECTION_STRING (default to localhost:27017)
MONGODB_DATABASE_NAME (default to 'data')

Run Weaverbird + MongoDB or PostgreSQL with docker-compose

If you want to test the playground with a populated MongoDB or PostgreSQL instance, you can use docker-compose:

# At the directory root
# For MongoDB
docker-compose up -d weaverbird mongodb
# For PostgreSQL
docker-compose up -d weaverbird postgres

Pandas back-end

An alternative back-end for the playground is a small server running in python, executing pipelines with pandas. Add ?backend=pandas to the URL to see it in action.

Athena back-end

In order to run the playground with AWS Athena, make sure the following environment variables are set before starting the docker-compose stack:

ATHENA_REGION
ATHENA_SECRET_ACCESS_KEY
ATHENA_ACCESS_KEY_ID
ATHENA_DATABASE
ATHENA_OUTPUT

BigQuery back-end

In order to run the playground with Google BigQuery, download the JSON file containing the credentials for your service account, place it at the root of the weaverbird repo and name it bigquery-credentials.json. It will be mounted inside of the playground container.

Use your own data files

CSVs from playground/datastore are available to use in the playground with pandas. You can override this folder when running the container using by adding a volume parameter: -v /path/to/your/folder/with/csv:/weaverbird/playground/datastore.

Use another back-end

You can point a front-end to another API by using a query parameter: ?api=http://localhost:5000. This is particularly useful for front-end development, with yarn vite.

To avoid CORS issues when front-end and back-end are on different domains, whitelist your front-end domain using ALLOW_ORIGIN=<front-end domain> or ALLOW_ORIGIN=*.

weaverbird's People

Contributors

Stargazers

Watchers

Forkers

testinnplayin arthurzenika sermonis stantoxt pouc charlesrngrd chrismeyersfsu davinov jeremyjacquemont ankurseth82 arnoldligtvoet pifauc dakshinka arooba-git

weaverbird's Issues

Keep SASS style scoped in .vue component

We wanted to add the possibility to override some style variables (e.g: active-color), we thus had to remove the scope attributes on style block, that prevent to overwrite a class not on purpose.

So now, we've got style in a separate file that prevents the style to be computed before we can override it in another project.

I think a better solution can be found in order to follow Vue convention 'Single File Component' , but I failed to find it.

'pivot' step to Mongo

This 'pivot' step config in our VQB "language"...

{
   name: 'pivot',
   index: ['column_1', 'column_2'],
   column_to_pivot: 'column_3'
   value_column: 'column_4'
   agg_function: 'sum' # 'mean' by default
}

... should yield the following mongo aggregation pipeline step:

{
  $group:
    _id:
      column_1: "$column_1"
      column_2: "$column_2"
      column_3: "$column_3"
    column_4: $sum: "$column_4" # first we perform the aggregation ('$mean' by default), with the _id including the column to pivot
}
{
  $group:
    _id:
      column_1: "$_id.column_1"
      column_2: "$_id.column_2"
    array:
      $addToSet:
        column_3: "$_id.column_3"
        column_4: "$column_4" # then we group with with index columns as _id and we push documents as an array of sets including a column for the column to pivot and a column for the corresponding value
}
{
  $project:
    tmp_object:
      $arrayToObject:
        $zip:
          inputs: ["$array.column_3", "$array.column_4"] # then we project a tmp key to get an object from the array of couples [column_to_pivot, corresponding_value]
}
{
  $addFields:
    "tmp_object.column_1": "$_id.column_1"
    "tmp_object.column_2": "$_id.column_2" # Then we include back in every document created in the previous step the index columns (still accessible in the _id object as we did not explicitly specified '_id: 0' in the previous step)
}
{
  $replaceRoot:
    newRoot: "$tmp_object" # then we replace the root of the document tree to get our columns ready for our needed table-like format !
}
]

Delete a step

As a query designer, in the pipeline viewer, on a step, I can click on "trash" button that will display a popover with a confirmation message "Are you sure you want to delete this step ?". If I say "yes", the step is removed from the pipeline and the dataset is recomputed.

Dropdown operation buttons

Each transformation step has an associated category. For now, we envision the following categories:

filter
compute
text
date
aggregate
reshape

For each of these categories, we should have a dropdown button that will popover the list of related transformation steps.

'rename' step to Mongo

These 'rename' steps config in our VQB "language"...

{
   name: 'rename',
   oldname: 'foo',
   newname: 'bar'
}
{
   name: 'rename',
   oldname: 'old',
   newname: 'new'
}

... should yield the following mongo aggregation pipeline step:

{
  $project:
    <all other columns>: 1
    bar: "$foo"
    new: "$old"
}

Toy backend for easier testing

As a developer, it would be nice to test the editor in real conditions with real data (e.g. CSV files or a toy mongo database).

One could imagime something like:

git clone https://github.com/ToucanToco/vue-query-builder/issues/new
python examples/toy-backend.py my-config-file.yml
yarn start-demo

Definie a minimal dataset API

We'll need at least the following information:

{
   length: 12345,
   offset: 100,
   limit: 50,
   columns: [
      {
        name: 'my-col',
        type: 'integer',
        // possibly later some extra information such as `unique_values`, `min`, `max`, ...
      }
   ]
}

Column types will be useful for a number of usecases (e.g. detect which operations apply and wich don't)

'Aggregate' step to Mongo :: Support count distinct

TO DO

Support basic aggregation functions with straightforward implementation
Support count_distinct

Exemple 1: Several aggregations on several dimensions

This 'group' step config in our VQB "language"...

 {
      name: 'group',
      on: ['dimension1', "dimension2"]
      aggregations: [
        {
           name: "sum_value1" # default name: '<function>_<column>'
           column: "value1"
           agg_function: 'sum'
        }
        {
           name: "avg_value2"
           column: "value2"
           agg_function: 'avg'
        }
        {
           name: "MIN"
           column: "value3"
           agg_function: 'min'
        }
        {
           name: "MAX"
           column: "value4"
           agg_function: 'max'
        }
        {
           name: "count_value5"
           column: "value5"
           agg_function: 'count'
        }
        {
           name: "value6"
           column: "value6"
           agg_function: 'first'
        }
        {
           name: "value7"
           column: "value7"
           agg_function: 'last'
        }
      ]
  }

... should yield the following mongo aggregation pipeline:

query: [
  $group:
    _id: 
      dimension1: "$dimension1"
      dimension2: "$dimension2"
    sum_value1:
      $sum: '$value1'
    avg_value2:
      $avg: '$value2'
    MIN:
      $min: "$value3"
    MAX:
      $max: "$value4"
    count_value5:
      $sum: 1 # Careful here: there is no '$count' operator, we '$sum' a series of 1 to count
    value6:
      $first: "$value6"
    value7:
      $last: "$value7"
,
  $project:
    dimension1: "$_id.dimension1"
    dimension2: "$_id.dimension2"
    sum_value1: 1
    avg_value2: 1
    MIN: 1
    MAX: 1
    count_value5: 1
    value6: 1
    value7: 1
]

Exemple 2: We keep the same generic approach for aggregations on one dimension

This 'group' step config in our VQB "language"...

 {
      name: 'group',
      on: ['dimension1']
      dimensions: [
        {
           name: "sum_value1"
           column: "value1"
           agg_function: 'sum'
        }
      ]
  }

... should yield the following mongo aggregation pipeline:

query: [
  $group:
    _id: 
      dimension1: "$dimension1" # not "_id: '$dimension'" as it may be possible but we want to keep the approach as generic as possible so we use a "levelled" definition of "_id" as it generalises to grouping on several dimensions
    sum_value1:
      $sum: '$value1'
,
  $project:
    dimension1: "$_id.dimension1"
    sum_value1: 1
]

Exemple 3: The unobvious 'count_distinct' in Mongo....

This 'group' step config in our VQB "language"...

 {
      name: 'group',
      on: ['dimension1', 'dimension2']
      aggregations: [
        {
           name: "count_distinct_client"
           column: "client"
           agg_function: 'count_distinct'
        }
      ]
  }

... should yield the following mongo aggregation pipeline:

query: [
  $group:
    _id: 
      dimension1: "$dimension1"
      dimension2: "$dimension2"
      client: "$client" # This first '$group' builds unique groups and therefore make every client unique in every combination of dimension1 x dimension2 x client
,
  $group:
    _id: 
      dimension1: "$_id.dimension1"
      dimension2: "$_id.dimension2"
    count_distinct_client:
      $sum: 1 # Count the the unique groups and therefore the clients for every combination of dimension1 x dimension2
,
  $project:
    dimension1: "$_id.dimension1"
    dimension2: "$_id.dimension2"
    count_distinct_client: 1
]

Exemple 4: ... And Mongo is even more tricky when 'count_distinct' needs to be combined with other aggregations !

Important notice: We show here how we can perform such aggregations in the mongo aggregation pipeline, but the query gets quite complicated and we do not even cover the challenge of performing several count_distinct at the same time... In such a case, the approach detailed below would need to be nested a number of times equal to the number of count_distinct aggregation needed !!! It quite quickly become unacceptable both in terms of readability and performance.

=> So we believe that we should switch to the potsprocess option as soon as the aggregation contains a 'count_distinct' combined with at least one other aggregation.

This 'group' step config in our VQB "language"...

 {
      name: 'group',
      on: ['dimension1', 'dimension2']
      aggregations: [
        {
           name: "sum_value"
           column: "value"
           agg_function: 'sum'
        }
        {
           name: "count_distinct_client"
           column: "client"
           agg_function: 'count_distinct'
        }
      ]
  }

... should yield the following mongo aggregation pipeline:

query: [
  $group:
    _id: 
      dimension1: "$dimension1"
      dimension2: "$dimension2"
      client: "$client"
    array_detail:
      $push: '$$ROOT' # To keep track of the details and be able to perform every aggregation, we push every document ('$$ROOT' is a mongo variable similar to a cursor representing the current document being read) in arrays, with one array per group of [dimensions + the column for which we need to get the number of unique values]
,
$group:
    _id: 
      dimension1: "$_id.dimension1"
      dimension2: "$_id.dimension2"
    client:
      $sum: 1 # here we count the unique clients as there are unique as per the group _id definition of the previous step
    array_group:
      $push: '$$ROOT' # Same thing here than in the step above: we need to keep track of every document to then be able to perform the other aggregations (see steps below)
,
  $unwind: '$array_group' # The unwind step "destructure" the array to get a root document per document found in the array => in tabular terms, we get 1 row per document found in the array
,
  $project:
    dimension1: "$_id.dimension1"
    dimension2: "$_id.dimension2"
    client: 1
    array: "$array_group.array_detail" # We get the array with the most granular level of documents here to unwind it in the following step to get a row per document
,
  $unwind: "$array"
,
  $group:
    _id:
    dimension1: "$dimension1"
    dimension2: "$dimension2"
    sum_value:
      $sum: "$array.value" # Now we can perform the other aggregations based on the most granular level
    count_distinct_client:
      $first: "$client" # And we need to keep the first value of count distinct found per group as this value was duplicated after the '$unwind' operations
,
  $project:
    dimension1: "$_id.dimension1"
    dimension2: "$_id.dimension2"
    sum_value: 1
    count_distinct_client: 1
]

Exemple 5: Several aggregations on the same column

This 'group' step config in our VQB "language"...

 {
      name: 'group',
      on: ['dimension1', "dimension2"]
      aggregations: [
        {
           name: "sum_value1" # default name: '<function>_<column>'
           column: "value1"
           agg_function: 'sum'
        }
        {
           name: "my_name"
           column: "value1"
           agg_function: 'count'
        }
      ]
  }

... should yield the following mongo aggregation pipeline:

query: [
  $group:
    _id: 
      dimension1: "$dimension1"
      dimension2: "$dimension2"
    sum_value1:
      $sum: '$value1'
    count_value1:
      $count: '$value1'
,
  $project:
    dimension1: "$_id.dimension1"
    dimension2: "$_id.dimension2"
    sum_value1: 1
    my_name: 1
]

Example of query 1

This pipeline of steps in our VQB "language"...

[
  {
      name: 'domain',
      domain: 'focus_media_campaigns_levers_objectives'
  }
  {
      name: 'filter',
      column: 'period',
      value: "Monthly View"
      operator: 'eq'
  }
  {
      name: 'filter',
      column: 'period',
      value: ["2017", "2018"]
      operator: 'in'
  }
  {
      name: 'group',
      dimensions: ['day', "Campaign name"]
      measures: [
        {
           column: "Clicks"
           agg_function: 'avg'
        }
        {
           column: "Budget"
           agg_function: 'sum'
        }
        {
           column: "Impressions"
           agg_function: 'sum'
        }
      ]
  }
  {
     name: 'new_column_formula',
     new_column_name: 'CPM'
     formula: "[Budget] / [Impressions] * 1000"
  }
  {
    name: 'unpivot'
    id: ["day", "campaign"]
    value: ["Budget", "Impressions", "CPM"]
    dropna: true
  }
]

... should yield the following query + postprocess pipelines:

query: [
  $match:
    domain: "focus_media_campaigns_levers_objectives"
    period: 'Monthly View'
    year: $in: ["2017", "2018"]
,
  $group:
    _id: 
      day: "$day"
      campaign: "$Campaign name"
    Clicks:
      $avg: '$Clicks'
    Budget:
      $sum: '$Budget'
    Impressions:
      $sum: "$Impressions"
,
  $project:
    day: "$_id.day"
    campaign: "$_id.campaign"
    Clicks: 1
    Budget: 1
    Impressions: 1
    CPM:
      $multiply: [ $divide: [ "$Budget", "$Impressions"] , 1000 ]
]
postprocess: [
  melt:
    id: ["day", "campaign"]
    value: ["Budget", "Impressions", "CPM"]
    dropna: true
]

Notes:

Our "group" step should generate both a {$group} step as well as a {$project} step in mongo to project both the measures and the "_id. The latter is poorly named and may include several levels as in our example. We want to get readable column names rather than the default document-oriented "_id" of mongo, so that we can keep a consistent tabular-ready format. If following steps generate project, we are then able to merge the projects as in our example. To illustrate our point, the mongo "$project" of our example is the result of merging the following "$project" mongo deriving from 2 different stpes in our pipeline:

query: [
  ...
,
  # Derived from our group step
  $project:
    day: "$_id.day"
    campaign: "$_id.campaign"
    Clicks: 1
    Budget: 1
    Impressions: 1
,
  # Derived from our formula step
  $project:
    CPM:
      $multiply: [ $divide: [ "$Budget", "$Impressions"] , 1000 ]
]
...

In our formula step, we propose the convention of escaping column names between brackets ("[column_1]")

Visualize the generated query

As a query designer, I am able to see the query that corresponds to the current transformation pipeline.

At this point, the goal is mainly to ease the debugging process, there's no need to design it thoroughly.

make it responsive !

Define a few responsive behaviours (to be refined):

the "search field" disappear, only the magnifying glass icon remains,
the left pane shrinks to keep the right one as big as possible
…

Handle simple translation of aggregation step → mongo 3.6

Implement the aggregation step translation described in #30 with the following restrictions:

no postprocess, mongo 3.x only,
handle aggregation functions that have direct mongo 3.x mapping (i.e. no count_distinct)

popover component

We need a generic popover component that will be used both on category buttons and column headers.

display a list of unique values in the column

As a query designer, when I click on a column header, I see the list of unique values to be able to filter on it (or just to get some insights).

If the dataset is paginated, when I first click, I see only the unique values that are currently displayed. I can click on a label "see all unique values" that will display all the unique values in this column, whether visible or not.

keyboard shortcuts

As an expert designer, I can use keyboard shortcuts to navigate in my dataset viewer, search for operations, edit the pipeline, update the preview, etc.

'fillna' step

This 'fillna' step config in our VQB "language"...

{
   name: 'fillna',
   column: "foo"
   value: "bar"
}

... should yield the following mongo aggregation pipeline step:

{
  $project:
    <all other columns>: 1
    foo: $cond: [{$eq: ["$foo", null]}, "bar", "$foo"]
}

get rid of webpack

We use rollup to build and ship an ES6 module. We should therefore get rid of webpack since we don't want to keep two distinct package bundlers.

'argmax' / 'argmin' step

Example 1: with grouping

This 'argmax' step config in our VQB "language"...

{
   name: 'argmax',
   groups: ['foo'],
   column: 'bar' # column in which to search for the max value
}

... should yield the following mongo aggregation pipeline step:

{
  $group:
    _id:
      foo: "$foo"
    array:
      $push: "$$ROOT"
    value_to_compare: $max: "$bar" # $min for argmin
}
{
  $unwind: "$array"
}
{
  $project:
    value_to_compare: 1
    <all_other_columns>: "$array.<all_other_columns> # We want to keep all columns
}
  $redact: # shortcut operator to avoid to firstly create a boolean column via $project and then filter on 'true' rows via $match
    $cond: [
      {
        $eq: ["$foo", "$value_to_compare"]
      }
      "$$KEEP"
      "$$PRUNE" # Magic happens here with the mongo variable "$$KEEP" (for... keeping rows matching the condition) and "$$PRUNE" (for excluding rows not matching the condition)
    ]
]

Example 2: without grouping

This 'argmin' step config in our VQB "language"...

{
   name: 'argmin',
   column: 'bar'
}

... should yield the following mongo aggregation pipeline step:

{
  $group:
    _id: null
    array:
      $push: "$$ROOT"
    value_to_compare: $min: "$bar"
}
{
  $unwind: "$array"
}
{
  $project:
    value_to_compare: 1
    <all_other_columns>: "$array.<all_other_columns> # We want to keep all columns
}
  $redact: # shortcut operator to avoid to firstly create a boolean column via $project and then filter on 'true' rows via $match
    $cond: [
      {
        $eq: ["$foo", "$value_to_compare"]
      }
      "$$KEEP"
      "$$PRUNE" # Magic happens here with the mongo variable "$$KEEP" (for... keeping rows matching the condition) and "$$PRUNE" (for excluding rows not matching the condition)
    ]
]

Highlight candidate columns on operation:hover

As a query designer, when I hover an operation, the candidate columns for this operations are highlighted.

'unpivot' step to Mongo

Example 1: with 'dropna' parameter to true

This 'unpivot' step in our VQB "language"...

{
    name: 'unpivot'
    keep: ["MARCHE", "CANAL"]
    unpivot: ["NB_CLIENTS_TOTAL", "NB_ROWS"]
    unpivot_column_name: "KPI"
    value_column_name: "VALUE"
    dropna: true # default
}

... should yield the following mongo aggregation steps:

{
  $project:
    MARCHE: "$MARCHE"
    CANAL: "$CANAL"
    dimensions:
      $objectToArray: {
        NB_CLIENTS_TOTAL: "$NB_CLIENTS_TOTAL"
        NB_ROWS: "$NB_ROWS"
      }
}
{
  $unwind: "$dimensions"
}
{
  $project:
    MARCHE: 1
    CANAL: 1
    KPI: "$dimensions.k"
    VALUE: "$dimensions.v"
}
{
  $match:
    VALUE: $ne: null # dropna here
}

Example 2: with 'dropna' parameter to false

This 'unpivot' step in our VQB "language"...

{
    name: 'unpivot'
    keep: ["MARCHE", "CANAL"]
    unpivot: ["NB_CLIENTS_TOTAL", "NB_ROWS"]
    unpivot_column_name: "KPI"
    value_column_name: "VALUE"
    dropna: false
}

... should yield the following mongo aggregation steps:

{
  $project:
    MARCHE: "$MARCHE"
    CANAL: "$CANAL"
    dimensions:
      $objectToArray: {
        NB_CLIENTS_TOTAL: "$NB_CLIENTS_TOTAL"
        NB_ROWS: "$NB_ROWS"
      }
}
{
  $unwind: "$dimensions"
}
{
  $project:
    MARCHE: 1
    CANAL: 1
    KPI: "$dimensions.k"
    VALUE: "$dimensions.v"
}

'formula' step to Mongo

This 'formula' step config in our VQB "language"...

{
   name: 'formula'
   new_column: 'new_col' # new_column can be an existing column to be updated with the formula result
   formula: '(column_1 + column_2) / column_3 - column_4 * 100'
}

... should yield the following mongo aggregation pipeline step:

{
  $project:
    <all other columns>: 1
    new_col:
      $subtract: [
        {
          $divide: [
            {
              $add: [
                "$column_1"
                "$column_2"
              ]
            }
            "$column_3"
          ]
        }
        {
          $multiply: [
            "$column_4"
            100
           ]
        }
      ]
}

'top' step to Mongo

Exemple 1: With grouping

This 'top' step config in our VQB "language"...

{
  name: 'top',
  groups: ['foo']
  value: 'bar',
  sort: 'desc' # or 'asc'
  limit: 10
}

... should yield the following mongo aggregation pipeline step:

{
  $sort:
    bar: -1 # or 1 when 'asc'
}
{
  $group:
    _id:
      foo: "$foo"
    array:
      $push: "$$ROOT"
{
{
  $project:
    top_elems: $slice: ["$array", 10]
}
  $unwind: "$top_elems"
}
{
  $project:
    <all_columns>: "$top_elems.<all_columns>" # We want to keep all columns
}
]

Exemple 2: Without grouping

This 'top' steps config in our VQB "language"...

{
  name: 'top',
  value: 'bar',
  sort: 'desc' # or 'asc'
  limit: 10
}

... should yield the following mongo aggregation pipeline step:

{
  $sort:
    bar: -1 # or 1 when 'asc'
}
{
  $group:
    _id: null
    array:
      $push: "$$ROOT"
{
{
  $project:
    top_elems: $slice: ["$array", 10]
}
  $unwind: "$top_elems"
}
{
  $project:
    <all_columns>: "$top_elems.<all_columns>" # We want to keep all columns
}
]

time travel

As a query designer, in the pipeline viewer, I can click on the small circle beyond each step. This will disable the steps below the one I've just clicked on. Visually, the disabled steps will be slightly faded and the dataset viewer will update to the newly generated query.

Search operation

As a query designer, I can enter some text in the search field to find operations matching this text.

I can see a list of suggested operations as a type. Ideally, the search is fuzzy and if there's no match in a given category, the corresponding buttons fade out.

dataset pagination

As a query designer, I can browser my paginated dataset. The default is to limit the viewport to N (i.e. 50) but I can choose to configure this N visually and choose to jump to the next page.

http://allenfang.github.io/react-bootstrap-table/index.html might be a source of inspiration.

Step creation (+ edition) form

Each step is created (or edited) via a specific form.

We have yet to decide if these forms are built manually or with something like https://github.com/fagbokforlaget/vue-jsonschema-form

[table] select a cell

In the dataset viewer, it should be possible to select (and higlight) a specific cell. It should also be possible to select (and highlight) multiple cells using ctrl+click.

Configure step forms by clicking on cells / columns

Some of the step edition forms can be updated by simply clicking on columns or cells.

E.g. if a step applies on a specific column, I can edit the step. If I want to change the processed column, I simply have to click on the one that I want. Ideally, if this kind of step doesn't apply to a given column, it's disabled and I can't click on it.

Limit number of rows to N

If the component is initialized with a large dataset, we won't be able to display all
the lines. For the first version, we'll fix an viewport limit to N (50 ?) rows.

A future version will do better.

Replace value with another

As a query designer, I can click on a column and choose the "replace value" transformation step. The step creation form is displayed and I'm asked to select values (either by manually entering values or by clicking on a cell) and choose the corresponding replacement value.

This will generate a ReplaceStep in the pipeline.

Filter by value

As a query designer, I can click on a column, click on the "filter" step and pick the value to filter on either by clicking on a cell or manually entering the value.

This will generate a FilterStep in the pipeline

'percentage' step to Mongo

Exemple 1: with groups and result in new column

This 'percentage' step config in our VQB "language"...

{
   name: 'percentage'
   new_column: 'new_col'
   column: 'bar'
   group: ['foo']
}

... should yield the following mongo aggregation pipeline step:

{
  $group:
    _id:
      foo: "$foo"
    array:
      $push: "$$ROOT"
    total_denum: $sum: "$bar"
}
{
  $unwind: "$array"
}
{
  $project:
    new_col: $cond: [$eq: ["$total_denum", 0], null, $divide: ["$array.bar", "$total_denum"]] # we need to explicitely manage the case where '$total_denum' is null otherwise the query may just fail
    <all_other_columns>: "$array.<all_other_columns>"
}

Exemple 2: without groups and result in existing column

This 'percentage' step config in our VQB "language"...

{
   name: 'percentage'
   column: 'bar'
}

... should yield the following mongo aggregation pipeline step:

{
  $group:
    _id: null
    array:
      $push: "$$ROOT"
    total_denum: $sum: "$bar"
}
{
  $unwind: "$array"
}
{
  $project:
    bar: $cond: [$eq: ["$total_denum", 0], null, $divide: ["$array.bar", "$total_denum"]] # we need to explicitely manage the case where '$total_denum' is null otherwise the query may just fail
    <all_other_columns>: "$array.<all_other_columns>"
}

'filter' step to Mongo: complex filtering (AND/OR)

TO DO

Basic filtering (single conditions)
Complex filtering (combination of conditions)

This 'filter' step config in our VQB "language"...

{
  name: 'filter',
  and: [
    {
      column: 'column_1'
      value: ['foo', 'bar']
      operator: 'in'
    }
    {
      column: 'column_2'
      value: ['toucan', 'toco']
      operator: 'nin'
    }
    {
      or: [
        {
          column: 'column_3'
          value: 'toto'
          operator: 'eq'
        }
        {
          and: [
           {
              column: 'column_4'
              value: 42
              operator: 'lte'
           }
           {
              column: 'column_4'
              value: 0
              operator: 'gt'
           }
          ]
        }
      ]
    }
  ]
}

... should yield the following mongo aggregation pipeline step:

{
  $match:
    column_1: $in: ["foo", "bar"]
    column_2: $nin: ["toucan", "toco"] # an 'and' between conditions on different columns does not need an explicit '$and' operator
    $or: [
      {
        column_3: $eq: "toto" # '$eq' is optional as it is implicit in Mongo
      }
      {
        $and: [ # an 'and' between conditions on the same column do need an explicit '$and' operator
          {
             column_4: $lte: 42
          }
          {
             column_4: $gt: 0
          }
        ]
      }
    ]
}

[table] select a column

It must be possible to select (and highlight) a column in the table. It should also be possible to select (and highlight) more than one column at once with ctrl+click

'replace' step to Mongo

Exemple 1: replace more than one value in the same column, inplace

These 'replace' steps config in our VQB "language"...

{
   name: 'replace',
   search_column: "column_1"
   new_column: "column_1" # inplace by default
   oldvalue: 'foo'
   newvalue: 'bar'
}
{
   name: 'replace',
   search_column: "column_1"
   new_column: "column_1" # inplace by default
   oldvalue: 'old',
   newvalue: 'new'
}

... should yield the following mongo aggregation pipeline step:

{
  $project:
    <all other columns>: 1
    column_1: $cond: [{$eq: ["$column_1", "foo"]}, "bar", "$column_1"] # No, there is not a more elegant way of doing it in he mongo aggregation pipeline :) But it does have the advantage to work for any data type
}
{
  # We need a distinct step for the second replace (if we decide that it is more readable than nested $cond in one step...). We cannot merge the $project steps as they would include non unique keys as we want the replacement to be performed inplace, so only one replacement would be retained by Mongo.
  $project:
    <all other columns>: 1
    column_1: $cond: [{$eq: ["$column_1", "old"]}, "new", "$column_1"]
}

Exemple 2: replace more than one value in the same column, in a new column

These 'replace' steps config in our VQB "language"...

{
   name: 'replace',
   search_column: "column_1"
   new_column: "new_column"
   oldvalue: 'foo'
   newvalue: 'bar'
}
{
   name: 'replace',
   search_column: "new_column"
   oldvalue: 'old',
   newvalue: 'new'
}

... should yield the following mongo aggregation pipeline step:

{
  $project:
    <all other columns>: 1
    column_1: 1 # We do want to keep the source column
    new_column: $cond: [{$eq: ["$column_1", "foo"]}, "bar", "$column_1"]
}
{
  # We need a distinct step for the second replace (if we decide that it is more readable than nested $cond in one step...). We cannot merge the $project steps as they would include non unique keys as we want the replacement to be performed inplace, so only one replacement would be retained by Mongo.
  $project:
    <all other columns>: 1
    column_1: 1 # We do want to keep the source column
    new_column: $cond: [{$eq: ["$new_column", "old"]}, "new", "$column_1"] # Here we base our replacement on the 'new_column' from step 1
}

Exemple 3: replace one value in different columns

These 'replace' steps config in our VQB "language"...

{
   name: 'replace',
   search_column: "column_1"
   new_column: "column_1" # inplace by default
   oldvalue: 'foo'
   newvalue: 'bar'
}
{
   name: 'replace',
   search_column: "column_2"
   new_column: "new_column"
   oldvalue: 'old',
   newvalue: 'new'
}

... should yield the following mongo aggregation pipeline step:

{
  $project:
    <all other columns>: 1
    column_2: 1
    column_1: $cond: [{$eq: ["$column_1", "foo"]}, "bar", "$column_1"]
    new_column: $cond: [{$eq: ["$column_2", "old"]}, "new", "$column_1"]
}

Implement the 2-panes view

The left-pane will host a graphic rendering of the pipeline. The right-pane (720px min ?) will host the dataset viewer.

Mutiple datasets

As a query designer, I'm able to generate multiple queries on multiple datasets and combine the results.

'sort' step

Ex 1: sort on one column

This 'sort' step config in our VQB "language"...

{
   name: 'sort',
   columns: ["foo"]
   order: ["asc"] # 'asc' by default
}

... should yield the following mongo aggregation pipeline step:

{
  $sort:
    foo: 1
}

Ex 2: sort on multiple columns

This 'sort' step config in our VQB "language"...

{
   name: 'sort',
   columns: ["foo", "bar"]
   order: ["asc", "desc"]
}

... should yield the following mongo aggregation pipeline step:

{
  $sort:
    foo: 1
    bar: -1
}

Typecast popover

As a query designer, I can click on the "type" label in a column header and change it.

The corresponding cast step is generated in the pipeline.

New column based on conditions on existing ones

An important pain for query editors in Toucan Toco today is to express basic "if...then...else" conditions to populate a new column based on conditions on existing ones.

This point was not obvious in our steps usage analysis as it was not identified as an aggregation operator in the MongoDB aggregation pipeline ($cond), and it is not implement in Toucan Toco postprocess python utils.

Nevertheless, it is widely used and we should give it some priority.

A possible approach would be to use a dedicated formula in our formula editor accessed when creating a new column. A basic formula such as the excel one should do the trick:
IF( [COLUMN_1] < [COLUMN_2] AND [COLUMN_3] = "TCTC", "OK", [COLUMN_4])

Another, more graphical approach (and more explanatory to the user), would be to have a dropdown appearing when clicking on "New column" to select between "Formula" (for very basic arithmetic operations between column) and "Based on condition" (opening a modal to graphically express conditions and output)

Delete a column

As a query designer, I can click on a column and choose to delete it.

This will generate a DeleteStep in the pipeline.

Dropdown icon on hover only

In a column header, there is a small "call-to-action dropdown" icon to incite user to click.

To make the UI as light as possible, This dropdown icon should only be visible when the user hovers the column header.

Complex filter operations

As a query designer, the UI allows me to configure combinations (or / and) of filters.

"preview" mode

As a query designer, when I create / edit a transformation step, I have access to a "preview" button that will ask the host to recompute the dataset so I can preview the result.

The preview is also accessible when I click on a specific step in the pipeline pane.

Choose backend

As a query designer, I can choose to generate queries on different backends. We could imagine the following ones:

mongo 3.6 + python postprocess
mongo 4 + python postprocess
pandas only
sql + python postprocess

We don't want here to implement all those backends, just to prepare the code (and the UI) architecturally to support multiple backends.

toucantoco / weaverbird Goto Github PK

weaverbird's Introduction

Weaverbird

Badges

UI

Server

Project setup

Compiles target library

Run your tests

Lints and fixes files

Build the API documentation

Build and run the documentation website

Enrich it!

Run the storybook

Publication

Create a release (frontend)

Create a release (backend)

Usage as library

Without any module bundler

With an ES module bundler (typically webpack, vite or rollup)

API

Modules

Styles

Playground

Mongo back-end

Run Weaverbird + MongoDB or PostgreSQL with docker-compose

Pandas back-end

Athena back-end

BigQuery back-end

Use your own data files

Use another back-end

weaverbird's People

Contributors

Stargazers

Watchers

Forkers

weaverbird's Issues

TO DO

Exemple 1: Several aggregations on several dimensions

Exemple 2: We keep the same generic approach for aggregations on one dimension

Exemple 3: The unobvious 'count_distinct' in Mongo....

Exemple 4: ... And Mongo is even more tricky when 'count_distinct' needs to be combined with other aggregations !

Exemple 5: Several aggregations on the same column

Notes:

Example 1: with grouping

Example 2: without grouping

Example 1: with 'dropna' parameter to true

Example 2: with 'dropna' parameter to false

Exemple 1: With grouping

Exemple 2: Without grouping

Exemple 1: with groups and result in new column

Exemple 2: without groups and result in existing column

TO DO

Exemple 1: replace more than one value in the same column, inplace

Exemple 2: replace more than one value in the same column, in a new column

Exemple 3: replace one value in different columns

Ex 1: sort on one column

Ex 2: sort on multiple columns

Recommend Projects

Recommend Topics

Recommend Org