AAU, I can provide a date in the format "YYYY-MM-DD hh:mm:ss tt", datagit will convert it to YYYY-MM-DD

The computed branch is deleted and recreated each run. If the users enter a wrong name (like "main") they will lose all the data.
Computed branch should not be a parameter, but a default value based on the metric name

AAU, I see the entire number displayed on the y-axis in the waterfall chart

AAU, the current drift is up to date in the notion database

dbt installation

Problem

installation of datagit requires to setup a pipeline that runs the datagit package script.
For DBT users we want to leverage metric definitions, and pipeline runs.

Features

As a DBT user I can configure a git repository, with authentication details
As a DBT user I can configure a table to be historized in datagit
As a DBT user when I run a dbt command (probably dbt run), the datagit package is called with github token and the dataframe containing the table

Technical strategy

From current knowledge, dbt offer either dbt packages or dbt adapters.
dbt adapters are made for SQL engines, and datagit is no sql engine. But a datagit repository looks like a DWH since you can read and write.
An implementation could be a SQL facade that push the data to github. An other implementation could be an Adapter run method and implement only the write.

For dbt packages it seems not possible to invoke python packages, or to make API calls. It should be used for python utils operations.

AAU, I can partition the data set in smaller chunks based on the month

Pray the lord so it won't be too long

AAU, I can provide a callback method to decide wether the drift should trigger an alert or be automatically merged

AAU, I can see the drift in a table, filtered on the date

TODO:

Issues management

Add datadrift drilldown link to issue
Add comment to changelog (in notion report)

AAU, I have a hint toolbox in the summary page to understand how to read the cohort stability chart

AAU, I see the link to datadrift diff rather than commit link in my reports

AAU, I see a better toolbox for the waterfall charts and formatted numbers

Format number with comma for thousands
Metabase style for hover

AAU, on the summary report, if there is no data to display, I don't see an empty chart but a message "No data"

[Datagit] AAU, if the initial file is under 100mo, the second should not reach 100mo

AAU, if there are duplicated lines, i see unique_key-duplicate-1 in the dataframe

AAU, I install dbt-snapshot-analysis using PyPi

AAU, when I upload a dbt snapshot, I see the volatility of each metric and the mean

[Datagit] AAU, datagit does not detect empty drift for large files set or large precision numbers

AAU, the drift_evaluator class is removed, and we use functions instead

We should use drift_handler and new_data_handler instead of a class having both handlers.

AAU, in the Notion Summary report, I see the filtered report database

Why we need this

When looking at the summary, users then go to find the detailed reports for the KPI that drifted the most

Solution Design

Simplified nav to detailed reports by directly adding below the summary chart the filtered database

Tasks

tbd

Improve datagit documentation

Why we need this

Datagit document needs to be improved for faster & smoother installation

Task

Add PyGithub import
Add auto_merge drift in the documentation. Auto_merge should be by default in the store_metric function
Add step to authorize your datadrift app in your notion page
Remove date_column in the notion config json
Add to round numbers as a best practice when setting up your query
Add lineage to the documentation

AAU, I have more context when I open a changelog event page on Notion

Why we need this

When opening a changelog event page on Notion, users do not have much context. They need to click on the commit link to have any.

Solution Design

Directly add context in the body of the event page. Context should include:

total nb of diff introduced by the commit
impact day by day

Task

tbd

[Notion] Fix changelog when metrics is 0

When metric starts at 0, all the modification that introduce a 0 drift are considered a "Initial Value" event.

In the changelog database, the Initial value has 0 impact, while it should have the initial value

AAU, I can set my notion integration in Datadrift instead of the config

AAU, I can self host Data Drift

AAU, when I upload a dbt snapshot, I see the time to stabilization

DriftDB Connector

Implement a driftdb-connector in the python library. It should look like the local-connector, but uses the API of the datadrift backend.

AAU, the outlier detection detect numbers

AAU, I can see the drift on a metric between 2 given dates

Why we need this

Users want to compare two metrics in time and see by how much/why it has drifted. Use cases: compare with the previous investor update or board pack.

je veux comparer le MRR EoP d’aujourd’hui à celui qui a été communiqué au board en juillet 2023 (et pas par rapport au dernier main ou depuis la 1e historisation)

Solution Design

tbd

Task

tbd

Refactoring of driftdb API

Rename all metric to table
Remove Prints, use python logging
Use isort within the formatter
Use 120 lines width
Use a class for connectors and drift evaluator

Experimental: use dbt profile to get data from the DHW

AAU, in the alert I don't have the latest alert message in the body

AAU, the query builder sets a dd_date column

In order to detect new data, from historical data update (drift). The dataframes needs a date column which was not provided by the query builder

AAU, the url of waterfall charts are permanent

Snapshot analysis

handle default case when user has not defined a guard
Add a verbose parameter
Make the cli work without a date, with an option "latest snapshot of today"
Configure alert connector in the datadrift file
More colorful logs
Documentation

AAU, I store metrics on GitHub using datagit library

AAU, I see upgraded cohorts charts in my Summary Report

Why we need this

The graph is not easily understandable and does not provide waw effect. Users do not understand:

what the x and y axis represent
which data series is which
The charts url expires

Solution Design

Implement a new chart lib and new features for the chart to allow users to "align timeline" as in Github Star History

Tasks

Fixed color code for Y and M
On hover, display the metric name (metric + date)
Display unit on X axis
Decide between timeline or cohort display mode

AAU, I can set a working branch on datadrift so the setup does not impact main

AADev, the redis client is created and injected in services

Apply same pattern as for the database connection

AAU, I see the commit date, metric name and github link on my Datadrift diff

As a Data-drift dev, I have a clearer interface for function `store_metric()`

Issue

datagit.github_connector.store_metric() function has a strange behavior regarding opening or merging PRs:

If assignees is set, then a PR is opened
If not, no PR is open, but:
- if branch happens to be set to main, then changes are pushed to main branch and can be used in future drift computation
- if branch is not set / set to something else, drift will be simply stored in a branch

We have 3 different behaviors (related to "how to handle changes?") that are controlled by 2 params that seem not really related.

Also having changes pushed in main branch is nice, but what's be nicer would be having a PR opened and then automatically merged --> better for keeping track of who changed what

Proposal

New param: merge_policy = [ no_pr, open_pr, auto_merge_pr ]
- no_pr -> Simply push changes to a branch, and do nothing more
- open_pr -> Same as no_pr but also open a PR
- auto_merge_pr -> Same as open_pr but immediately auto-merge it
We keep param assignee, but if it's None/empty list then -> PR is opened with no assignee
We keep param branch, it simply controls where things are pushed

We could use an Enum for that

class MergePolicy(Enum):
    NoPR = "NoPR"
    OpenPR = "OpenPR"
    AutoMergePR = "AutoMergePR"


github_connector.store_metric(
    ghClient=Github(github_api_token),
    dataframe=dataframe,
    merge_policy=MergePolicy.AutoMergePR,
    assignees=[],
    branch=branch,
)

AAU, instead of Diff Too large, I see the first 10 lines of the file

[Notion] AAU, the reports is not purged to be updated, I can edit the report and the modification will be kept when the report is updated

As of now, the report page is found with a reportId generated from the metric name, timegrain, dimension and period. But the content of the report is deleted and recreated at each refresh.
There is a check on the reportHash to avoid recreation.

The feature:

What:

When there is a report update we only update what is mandatory (new graph, updated summary) and we append data when mandatory (new line in the changelog). Nothing is deleted

Why:

Notion page behaves as expected instead of being deleted. User can add comments & content.
Avoid too many calls to Notion API. Rate limiting during a report refresh ends with the report being partially deleted, then partially recreated.

How:

Store the Ids of the block that need to be updated, and the id of the block where changelog will be append.
Store those Ids in a database. Making the project stateful, and harder to self host

data-drift / data-drift Goto Github PK

data-drift's Introduction

Metrics Observability & Troubleshooting

Website | Issues | Blog | Doc | Roadmap | Discord

🚀 Quickstart

dbt integration

Python integration

Datadrift cloud

⚡️ Key Features

🔮 Metrics monitoring & custom alerting

🧑‍🎤 Automated root cause analysis & troubleshooting

💎 Shared understanding of metric variation

🧠 And much more

💚 Community

🗓 Upcoming features

data-drift's People

Contributors

Stargazers

Watchers

Forkers

data-drift's Issues

Problem

Features

Technical strategy

Why we need this

Solution Design

Tasks

Why we need this

Task

Why we need this

Solution Design

Task

Why we need this

Solution Design

Task

Why we need this

Solution Design

Tasks

Issue

Proposal

Recommend Projects

Recommend Topics

Recommend Org