Giter Club home page Giter Club logo

data-drift's Introduction


Datadrift logo

Discord Github Stars Data-Drift Build Storybook DataGit version

Metrics Observability & Troubleshooting

Datadrift is an open-source metric observability framework that helps data teams deliver trusted and reliable metrics.

DataDrift

Data monitoring tools fail by focusing on static tests (eg. null, unique, expected values) and metadata monitoring (eg. column-level).

Datadrift monitors your metrics, sends alerts when anomalies are detected and automates root cause analysis.
Data teams detect and solve data issues faster with Datadrift's row-level monitoring & troubleshooting.


🚀 Quickstart

dbt integration

pip install driftdb

Here is a quick demo. For a step-by-step guide on the dbt installation, see the docs.

Python integration

Install the monitor in your pipeline.

>>> from driftdb.connectors import LocalConnector
>>> LocalConnector().snapshot_table(table_dataframe=dataframe, table_name="revenue")

For a step-by-step guide on the python installation, see the docs.

Datadrift cloud

We are in development and we would love to do the installation with you. Fill the form on our website so we can do a 15min demo. If the tool solves your problem then the installation requires 30min.


⚡️ Key Features

🔮 Metrics monitoring & custom alerting

Get full visibility into metrics variation and pro-actively detect data quality issues. Become aware of unknown unknowns with metric drift custom alerting.

DataDrift new drift custom alerting

🧑‍🎤 Automated root cause analysis & troubleshooting

Operationalize your monitoring and solve your underlying data quality issue with lineage drill-down to understand the root cause of the problem.

DataDrift diff compare table

💎 Shared understanding of metric variation

Give visibility to data analysts and data consumers with shared explanation of metric variation.

DataDrift metric drift changelog

🧠 And much more

We are in the early days of Datadrift. Just open a new issue to tell us more about it and see how we could help!


💚 Community

We 💚 contributions big and small. In priority order (although everything is appreciated) with the most helpful first:


🗓 Upcoming features

Track planning on Github Projects and help us prioritising by upvoting or creating issues.

data-drift's People

Contributors

bl3f avatar lucasdvrs avatar marcbllv avatar phc avatar samox avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

data-drift's Issues

AAU, the main datagit branch cannot be deleted

The computed branch is deleted and recreated each run. If the users enter a wrong name (like "main") they will lose all the data.
Computed branch should not be a parameter, but a default value based on the metric name

dbt installation

Problem

installation of datagit requires to setup a pipeline that runs the datagit package script.
For DBT users we want to leverage metric definitions, and pipeline runs.

Features

  • As a DBT user I can configure a git repository, with authentication details
  • As a DBT user I can configure a table to be historized in datagit
  • As a DBT user when I run a dbt command (probably dbt run), the datagit package is called with github token and the dataframe containing the table

Technical strategy

From current knowledge, dbt offer either dbt packages or dbt adapters.
dbt adapters are made for SQL engines, and datagit is no sql engine. But a datagit repository looks like a DWH since you can read and write.
An implementation could be a SQL facade that push the data to github. An other implementation could be an Adapter run method and implement only the write.

For dbt packages it seems not possible to invoke python packages, or to make API calls. It should be used for python utils operations.

AAU, I can see the drift in a table, filtered on the date

TODO:

  • Add a dual table that scrolls
  • Design tables to represent red and green
  • Deploy storybook
  • Display the commit diff
    • display ellipsis between hunks
    • make table header sticky
    • header complet
  • Month filter
    • I see the month in a dropdown
    • When I select a month on the dropdown, the diff is filtered on that
  • deploy in production (example)
  • Display diff on metric

Issues management

  • Add datadrift drilldown link to issue
  • Add comment to changelog (in notion report)

Improve datagit documentation

Why we need this

Datagit document needs to be improved for faster & smoother installation

Task

  • Add PyGithub import
  • Add auto_merge drift in the documentation. Auto_merge should be by default in the store_metric function
  • Add step to authorize your datadrift app in your notion page
  • Remove date_column in the notion config json
  • Add to round numbers as a best practice when setting up your query
  • Add lineage to the documentation

AAU, I have more context when I open a changelog event page on Notion

Why we need this

When opening a changelog event page on Notion, users do not have much context. They need to click on the commit link to have any.

Solution Design

Directly add context in the body of the event page. Context should include:

  • total nb of diff introduced by the commit
  • impact day by day

Task

tbd

[Notion] Fix changelog when metrics is 0

When metric starts at 0, all the modification that introduce a 0 drift are considered a "Initial Value" event.
CleanShot 2023-07-12 at 16 18 22@2x

In the changelog database, the Initial value has 0 impact, while it should have the initial value

DriftDB Connector

Implement a driftdb-connector in the python library. It should look like the local-connector, but uses the API of the datadrift backend.

AAU, I can see the drift on a metric between 2 given dates

Why we need this

Users want to compare two metrics in time and see by how much/why it has drifted. Use cases: compare with the previous investor update or board pack.

je veux comparer le MRR EoP d’aujourd’hui à celui qui a été communiqué au board en juillet 2023 (et pas par rapport au dernier main ou depuis la 1e historisation)

Solution Design

tbd

Task

tbd

Refactoring of driftdb API

  • Rename all metric to table
  • Remove Prints, use python logging
  • Use isort within the formatter
  • Use 120 lines width
  • Use a class for connectors and drift evaluator

Snapshot analysis

  • handle default case when user has not defined a guard
  • Add a verbose parameter
  • Make the cli work without a date, with an option "latest snapshot of today"
  • Configure alert connector in the datadrift file
  • More colorful logs
  • Documentation

AAU, I see upgraded cohorts charts in my Summary Report

Why we need this

The graph is not easily understandable and does not provide waw effect. Users do not understand:

  • what the x and y axis represent
  • which data series is which
  • The charts url expires

Solution Design

Implement a new chart lib and new features for the chart to allow users to "align timeline" as in Github Star History

Tasks

  • Fixed color code for Y and M
  • On hover, display the metric name (metric + date)
  • Display unit on X axis
  • Decide between timeline or cohort display mode

As a Data-drift dev, I have a clearer interface for function `store_metric()`

Issue

datagit.github_connector.store_metric() function has a strange behavior regarding opening or merging PRs:

  • If assignees is set, then a PR is opened
  • If not, no PR is open, but:
    • if branch happens to be set to main, then changes are pushed to main branch and can be used in future drift computation
    • if branch is not set / set to something else, drift will be simply stored in a branch

We have 3 different behaviors (related to "how to handle changes?") that are controlled by 2 params that seem not really related.

Also having changes pushed in main branch is nice, but what's be nicer would be having a PR opened and then automatically merged --> better for keeping track of who changed what

Proposal

  • New param: merge_policy = [ no_pr, open_pr, auto_merge_pr ]
    • no_pr -> Simply push changes to a branch, and do nothing more
    • open_pr -> Same as no_pr but also open a PR
    • auto_merge_pr -> Same as open_pr but immediately auto-merge it
  • We keep param assignee, but if it's None/empty list then -> PR is opened with no assignee
  • We keep param branch, it simply controls where things are pushed

We could use an Enum for that

class MergePolicy(Enum):
    NoPR = "NoPR"
    OpenPR = "OpenPR"
    AutoMergePR = "AutoMergePR"


github_connector.store_metric(
    ghClient=Github(github_api_token),
    dataframe=dataframe,
    merge_policy=MergePolicy.AutoMergePR,
    assignees=[],
    branch=branch,
)

[Notion] AAU, the reports is not purged to be updated, I can edit the report and the modification will be kept when the report is updated

As of now, the report page is found with a reportId generated from the metric name, timegrain, dimension and period. But the content of the report is deleted and recreated at each refresh.
There is a check on the reportHash to avoid recreation.

The feature:

What:

  • When there is a report update we only update what is mandatory (new graph, updated summary) and we append data when mandatory (new line in the changelog). Nothing is deleted

Why:

  • Notion page behaves as expected instead of being deleted. User can add comments & content.
  • Avoid too many calls to Notion API. Rate limiting during a report refresh ends with the report being partially deleted, then partially recreated.

How:

  • Store the Ids of the block that need to be updated, and the id of the block where changelog will be append.
  • Store those Ids in a database. Making the project stateful, and harder to self host

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.