kedro-org / kedro-devrel Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 3.0 106 KB

Kedro developer relations team use this for content creation ideation and execution

License: Apache License 2.0

kedro-devrel's People

Contributors

Watchers

Forkers

afaqueahmad7117 kuriantom369 diegoliraqb

kedro-devrel's Issues

Create a blog post about why you should use Kedro

This captures the creation of content to promote best practices and encourage Kedro usage as a way to follow those practices.

We should probably write 2 posts: one which is the best practice (forms the basis of Academy) and second builds upon it to talk about Kedro. Then we can put the first post into a publication like "Towards Data Science" which doesn't like marketing, and publish the second ourselves.

There's also scope for other formats such as video, webinar or podcast script, but for now, this issue is purely about the mechanics of getting two complementary articles written, in draft, to share with GetInData for co-marketing.

Write a Kedro Academy article

It's about time we stepped outside the box a bit when it comes to blog post ideas and similar content.

I think it would be a good idea to write a post about Kedro Academy that describes our recent course.

Aims of the course (what we cover and why we think it's important)
How we designed it (survey, interviews, MVP, delivery and refinement)
How we delivered it (tips and tricks for delivering an engaging course on a Friday afternoon)
What's next and how to join

My initial thoughts are that this becomes a McKinsey news post, but later more broadly distributed.

Create Kedro Academy materials about semantic project structure

This is a child task of #13 and covers creation of MVP slides about a semantic project structure for Kedro Academy.

Create one or more blog posts about Kedro & "dynamic pipelines" that sets out the various requirements, solutions and links to docs

It would be great if we had a page explaining Kedro and dynamic pipelines and what the best solutions are if you can't use Kedro.

We are asked about it regularly:

Hello, good afternoon, I was adding a catalog item from a database that has timeseries data and I wanted to have a dynamic param so when I run the code It retrieves the data regarding a certain day.
I ran into this post:
kedro-org/kedro#1089
But I am failing to understand what is the reasoning behind. I would like to understand what would be the approach in kedro. Loading the whole table from sql and then run the transformations in code. But that would make quite inefficient when running in production. (edited)

Convert markdown on viz repo into blog post

https://github.com/kedro-org/kedro-viz/blob/main/LAYOUT_ENGINE.md would make a nice blog post to add to our set.

Community calls

It would be good to have open community calls about Kedro and its ecosystem. Some ideas:

Presentations from the engineering team about a specific topic
Office hours, support requests, refactory help
Discussion about development
"Showcase your Kedro project"

We already have a Kedro meetup. We could leverage that, or switch to something else like Luma, or something else (tooling is not so important, the format and the cadence and the content are).

There are various ways to do it. Some inspiration from Jupyter: https://blog.jupyter.org/online-collaboration-caf%C3%A9-launch-jupyterhub-team-meetings-to-become-more-collaborative-spaces-b713edadf15

The JupyterHub community often cite the team meetings as a touchpoint for newcomers to familiarise themselves with the JupyterHub project. However, this is not always the case. The team meetings have an emergent agenda built by the community members — which is great! — but it also means that it is a potluck as to whether the meeting you happen to attend will actually be useful depending on the agenda, and a newcomer may have to attend multiple meetings over a long period before feeling comfortable to ask questions and know where they can begin to help.

By reformatting the meeting into a collaborative co-working space using breakout rooms, we can cater for both the need of newcomers to be oriented to the project, and for the community to discuss in-depth topics in an emergent nature.

Meltano Office Hours: https://www.youtube.com/playlist?list=PLO0YrxtDbWAtuuubcEz7mnCHoGfIf8voT, https://www.addevent.com/calendar/Li390615 (looks like they're still going)

Airbyte demo calls: https://airbyte.com/events

Create Kedro Academy materials about dependency management

This is a child task of #13 and covers creation of MVP slides about dependency management for Kedro Academy.

Create first set of materials for Kedro Academy course on code quality

Introduction

We've agreed on the following sections as a MVP for a course about levelling up data science code quality to that demanded by more traditional software engineering (which I'll abbreviate as SWE for DS):

Introduction: Why are you even learning SWE for DS?

Then, to teach within Notebooks

Python functions/Functional (pure) programming
Configuration
Virtual environments and managing dependencies
VSCode

Then, to teach within VSCode

CLIs
Semantic folders/project template
Git, Github and collaborative version control (pull from clean code workshop)

Requirements

Apart from the introduction, each of these seven sections will have a common structure

Each section should take less than 45 minutes to deliver and be introduced with a standard page (What you will learn, Why it matters, How to test your knowledge: exercise & quiz)
Each section should be split into a series of short "chapters" that take 7-10 minutes each so they can be chunked within a Udemy/Coursera course or we can create checkpoints to add onto a YouTube video.
Each section should have a set of slides and a script, so it can be delivered in person or converted into a video. There should be an accompanying example exercise on a GitHub repo and a quiz.

The goal is to deliver the first draft of materials in January 2023 to trial with a select audience and refine. We'd then create more of the course and look at the options to record as video.

One of the big open issues right now is how to enable the user to follow along with a REPL (to avoid setup issues) and there are various options in play: Educative, Gitpod, REPLIt, GitHub Codespaces, Rhyme (virtual machine)

Bring back the Kedro cheatsheet

I'm sure I'm not imagining this -- we used to have a downloadable cheatsheet of Kedro commands, or a page of them.

But, anyway, there's a community version for this and I think we can make something similar and include a page in docs or as a blog post. Not high priority but a good idea, thanks @astrojuanlu @datajoely for the prompt.

Explore internship and apprenticeship initiatives

Examples:

Google Summer of Code https://summerofcode.withgoogle.com/ (2023 program announced, organizations can register until February 7th, LF could work as umbrella organization)
Google Season of Docs https://developers.google.com/season-of-docs (2023 program announced, organizations can apply until March 24th, usually no umbrella organizations, hence better to apply as Kedro directly)
Outreachy https://www.outreachy.org/ (requires our own funding, strong focus on diversity, two internship programs throughout the year May-August and December-March)

Before embarking on this though, we should have a more clear way of tagging "good first issues" (easy low hanging fruit tasks that any beginner can pick up for a small win) and "help wanted" (nice-to-have features that are otherwise not a priority for the core team, but also more complex than what a beginner can handle). Maybe I'll open an issue about this separately.

Explore further how to deliver exercises using collaborative coding environments

Following our first training session we need to review whether Replit is the best option for collaborative coding exercises

Create a blog post that guides users in using experiment tracking with viz using a tutorial (that isn't spaceflights)

I think it's time to start writing some tutorials that form useful blog posts and show off aspects of Kedro and Kedro Viz. One of these would be about experiment tracking so this is the ticket for that post.

We need to pick an interesting project (maybe one that's already available as a notebook or using another experiment tracking system, then convert that to a kedro project and use it as the example).

Messaging - One Sentence Summary for Kedro

Context

How would you describe Kedro in one sentence? This is key in defining how Kedro is perceived by users, the community, and all stakeholders. This is part of other messaging efforts of Kedro: #2099, #2094, #72

When you search for Kedro the result is some version of - Kedro - An open-source Python framework to create reproducible, maintainable, and modular data science code.

Does this capture the value proposition of Kedro succinctly?

Some other examples:

MLflow - A platform for the machine learning lifecycle
neptune.ai - ML metadata store, Google search - Build models with confidence
Weights & Biases - The developer-first MLOps platform, Google search - Developer tools for ML

Why is this important?

This would clearly highlight Kedro’s value proposition in one sentence, increasing awareness and adoption.

Next Steps

Research and decide on one sentence metaphor for Kedro
Publish on Kedro's website, documentation, press kit, and community

Write a blog post to showcase ways of improving Kedro run performance

Description

Kedro offers multiple features that help with improving run performance, but people seem not to know about them or how to use them. Create a blogpost that showcases how to use e.g. --async and CachedDataSet.

Context

kedro-org/kedro#2036 (comment)

Possible Implementation

Need to identify which features to showcase.

Create a blog post guide for users to create a soft-fail custom runner

Description

Create a blog post showing users how to implement a custom Runner that will run every possible Node in the pipeline that it can when only one Node fails.

As an illustration, all of the default runners currently exhibit the following behaviour when one Node fails:

Note that independent Nodes that could be run are not run. The custom runner should exhibit the following behaviour:

Context

This is a feature was requested by users in kedro-org/kedro#503. The blog post should direct readers to this GitHub issue.

Final smooth-over on Kedro Academy decks following first training session

We need to take another look at the examples on the decks, and take any other feedback into account, to finalise the first 4 modules.

Setup Linen on Kedro Slack

Description

We already have [Linen active on the Kedro Discord. Linen makes it possible for us to access messages beyond the 6-month limit on Slack for improved collaboration and knowledge sharing. We activated Linen we sunset Discord.

The scope of this ticket includes working with @datajoely to setup Linen on our Slack workspace.

Comparative content to clarify Kedro positioning

What are the alternatives to using Kedro?

I took the following from https://blog.streamlit.io/using-chatgpt-to-build-a-kedro-ml-pipeline/

Airflow: Airflow is a platform to programmatically author, schedule, and monitor workflows. It was originally developed by Airbnb, and is now a popular open-source project. Airflow provides a flexible and powerful platform for building and managing ML pipelines, and has a large user base and community.

Kubeflow: Kubeflow is an open-source project that aims to make it easy to deploy and manage machine learning workflows on Kubernetes. It provides a range of tools and frameworks for building, deploying, and managing ML pipelines, including Jupyter notebooks, TensorFlow, and PyTorch.

Prefect: Prefect is an open-source platform for building, scheduling, and monitoring data pipelines. It is designed to be highly scalable and resilient, and provides a range of features for building and managing ML pipelines, including support for distributed execution and automatic retries.

Luigi: Luigi is an open-source Python library for building complex pipelines of batch jobs. It was developed by Spotify and is designed to be easy to use and flexible. Luigi provides a range of features for building and managing ML pipelines, including support for dependencies, parallel execution, and error handling.

These are just a few examples of the alternatives to Kedro that you can consider. Each platform has its own strengths and weaknesses, and the right choice will depend on your specific needs and requirements.

I think we should create some content (either a non-navigable "topic" page or a blog post) that compares Kedro with each of these and maybe provides a visual "matrix" to check off which each offers and compare with Kedro (maybe a bit like this).

Blog post idea: Five Kedro hacks that will improve your workflow

Or something like this. Basically a listicle.

Onboarding style/awareness option: Less focussed on Kedro and more on data science. Such as "Ten data science hacks that will improve your productivity" or "Seven ways to make your data science code better" and have some academy content in there as well as a "Use Kedro".
Improving Kedro-skills option: Make the list Kedro-specific and cover five things you may not know about using Kedro, or some ways to do things that help move beyond onboarding to give beginner users more confidence and intermediate users super-powers.

Consider how to use the `kedro-community` repo and clean up any broken links/add new content

This is a general "housekeeping" ticket to work out how we use kedro-community better so it's up to date and useful.

Create a blog post about Kedro Viz architecture

This may be the basis of a a nice blog post to create about the architecture of Viz: https://github.com/kedro-org/kedro-viz/blob/main/ARCHITECTURE.md

Create new deck for Kedro

Description

Create a new and revamped deck highlighting what Kedro is, what problem it solves and how it does it (features).

Audience is non-technical people

Create a blog post about python dependency management

Not a Kedro topic but one to highlight our expertise/thought leadership. It's low-ish priority in many ways since it doesn't touch Kedro,, but we do have already content since @deepyaman has already shared it.

Maybe something he could publish on our blog as a way to start a fire/fight on Python subreddits at some point and take us viral 🤣

Create a deck for Kedro Experiment Tracking

Context

This captures the creation of a deck to promote and encourage awareness and adoption of Kedro Experiment Tracking. It is part of a wider Kedro Experiment Tracking product marketing strategy #38.

Goal

The goal is to create a deck highlighting what Kedro Experiment Tracking is, what problem it solves, and solution (features).

Audience is both technical and non-technical people.

Create a list of useful tools and services that we need for devrel

Here's my list so far:

Refine events strategy in Confluence/Notion and submit CFP

Potentially interesting events:

All PyData conferences (and also local meetups when appropriate)
Some PyCon and EuroPython events (whenever there's a sufficiently large audience of data scientists, or the event is big)
Open Data Science Conference
Other: AI Summit, JupyterCon, WebSummit, Grace Hopper Celebration, QCon, OSCON?, FOSDEM?, OSCAFEST?

Bottom line: meet our users where they are.

Create Kedro Academy materials about version control

This is a child task of #13 and covers creation of MVP slides about version control for Kedro Academy.

Establish content and social media strategies

(Drawing from some ideas by @stichbury)

Kedro channels:

Documentation
QB Medium
Blog (upcoming)

Possible third-party channels:

Twitter
Microblogging Fediverse (Mastodon or similar)
LinkedIn
Reddit
Instagram
YouTube
Twitch
TikTok

Guest channels:

Towards Data Science
?

Obviously I was overly comprehensive here and I don't think we should strive to be everywhere, nor re-posting content across channels. Instead, we should pick a few of those channels based on

Target audience/User personas
Content format

For example: discarding Instagram and LinkedIn, picking Mastodon over Twitter, choosing YouTube over Twitch, holding on Reddit and TikTok for now. Hence long-form blogging + microblogging + occasional video for now. To be discussed.

Create Kedro Academy materials about using VSCode

This is a child task of #13 and covers creation of MVP slides about using VSCode for basics of development as part of the Kedro Academy course.

Add a blog post with an example for Great Expectations

As per https://kedro-org.slack.com/archives/C03RKP2LW64/p1671548554220589 -- it's clearly causing some problems. We have found in the past that it's hard to keep up with the release cycle of GE so we don't have an example of using GE with Kedro that's in a known state, and we're reluctant to commit to having one, plus docs, since it will be an ongoing overhead.

We could, however, write a blog post to illustrate usage at a particular point in time.

It would be super awesome to put this on the GE blog perhaps, or have them author with us. Something to consider in Q1 2023 -- maybe something for Juan Luis to add to his queue, otherwise I can do it.

Messaging - Move Kedro advantages to website

Context

This involves moving the ‘What are the primary advantages of Kedro?’ section from the FAQ on kedro docs to the kedro website.

Why is this important?

This would clearly highlight the value proposition of Kedro to Data Scientists, Machine-Learning Engineers or Data Engineers, and Project Leads, increasing awareness and adoption.

Next Steps

Design
Move content from FAQ to website page
Review content and web page design
Release

Scaling introductory Kedro training

Description

This follows a conversation in the Software Engineering for Data Scientists about how users learn about Kedro and software engineering principles. This ticket proposes the creation of an Introduction to Kedro YouTube..

Why is this important?

We've been running "Kedro Beginner Bootcamp" trainings for many years now and this model is not ideal because:

We mostly host internal sessions, so our open-source users don't benefit from these learnings
We have only run one external training, which is well-viewed

So we need to think of ways to scale learning about Kedro beyond the capabilities of the team and this will impact all users.

Proposal

We should create an Introduction to Kedro training on YouTube. Some of this work includes:

Designing and tweaking the curriculum to fit this platform
Recording the sessions
Defining the learner support model
Creating a marketing plan

Evidence markers

Users did find Kedro because of DataEngineerOne
Users still try and use our old trainings
There is low-evidence that we should do a Udemy course from @stichbury's research

Therefore, we believe that this will help drive adoption of Kedro.

Create Kedro Academy materials to reinforce learning: exercise and quizzes

This is a child task of #13 and covers creation of materials to help reinforce the learnings:

An exercise (details TBD)
A set of Slido or Mentimeter quizzes

Find community projects that use Kedro to build case-study blog posts

https://github.com/dermatologist/kedro-tf-image owned by https://github.com/dermatologist
https://github.com/ChainYo/make-us-rich Cryptocurrency forecasting https://github.com/ChainYo
https://github.com/FactFiber/kedro-dvc from https://www.linkedin.com/company/fact-fiber
https://github.com/vchaparro/wind-power-forecasting masters project from https://github.com/vchaparro (looks defunct)

These are repos that use Kedro and the authors potentially could be approached to write a case study about why they chose Kedro, what they're doing etc.

Blog post to revise the "data science framework revolution" blog post

Our interview with Waylon was a nice read but we could improve the narrative flow, bring it up to date and extend it to maybe include some of the Notebook refactoring discussions we've had.

Challenges data scientists face when their code moves into production
Frameworks for data science workflow
Scope of data science has increased
How Kedro splits the workflow so expertise can be sought from team members and they don't have to understand the whole project

This may or may not mean we also pull in material from Power is nothing without control which I'd also like to rework as a blog post. Let's assume for now this is all part of 1 post rather than two.

Archive content on the `kedro-training` repo and sunset it

I've checked and nobody seems to be using the kedro-training repo content, which has dated, but may be useful later for future courses. Let's archive it for now and start adopting kedro-academy as the de facto location for publicly-shared decks and training graphics.

Create Kedro Academy materials about working with a CLI

This is a child task of #13 and covers creation of MVP slides about working with a CLI for Kedro Academy.

Blog post about using Polars and Kedro

Based on @astrojuanlu's upcoming talk:
“Analyze your data at the speed of light with Polars and Kedro”

Abstract: “The pandas library is one of the key factors that enabled the growth of Python in the Data Science industry and continues to help data scientists thrive almost 15 years after its creation. Because of this success, nowadays several open-source projects claim to improve pandas in various ways. Polars is one of those new dataframe libraries: it’s backed by Arrow and Rust, and offers an expressive API for dataframe manipulation with excellent performance. In this webinar I will show you how to combine Polars for your data manipulation needs with Kedro, a data science framework that will help you write more maintainable code.”

Create Kedro Academy materials about configuration files

This is a child task of #13 and covers creation of MVP slides about configuration files for Kedro Academy.

Blog post about Kedro/Databricks MLOps stack

Content by @jmholzer and @yetudada from recent Kedro update meeting

Potential for exploring co-marketing with Databricks.

Messaging - Create Kedro Press Kit

What is a Press Kit?

A press kit is a page on your website that makes it incredibly easy for journalists to learn about your brand and product, and access media assets (photos and videos) to use in their content or articles about your brand/product.

Why is it important ?

This provides journalists with the latest, accurate, and relevant information.

It should contain: an Overview (including Story and Mission), Product Information, Contact Details (including separate media email address), Media Assets, Logos, and Others (e.g. notable awards and quotes).

Here are some press kit examples.

Providing a brand kit would drive brand awareness of Kedro.

Next Steps for Kedro Press Kit

Design
Gather content: Overview (including Story and Mission), Product Information, Contact Details (including separate media email address), Media Assets, Logos, and Others (e.g. notable awards and quotes)
Publish on kedro website

Strategy for Kedro developers and contributors

kedro-org/kedro#2274 made me remember that not only we have Kedro users, but we also have potential contributors/developers (including both people who want to contribute to the core projects as well as people extending Kedro with plugins). We should have a distinct and explicit comms/docs strategy for contributors, and also think about in what circumstances a Kedro user could become a contributor or the other way around.

Blog posts for working with Kedro and Databricks on the Azure platform

Description

Currently, our documentation for deploying Kedro on Databricks is heavily based on AWS Databricks. Many of our users use Azure Databricks. We should define and document specific recommendations to these users.

Context

This issue is a child of kedro-org/kedro#2185.

Possible Implementation

The documentation written for this ticket could be a subsection of a new docs section dedicated to Databricks deployment.

Create a video walkthrough of Kedro-Viz

Context

This captures the creation of a video walkthrough to promote and encourage awareness and adoption of Kedro-Viz. It is part of a wider.

Goal

The goal is to create a video walkthrough highlighting what Kedro-Viz is, what problem it solves, and solution (features).

These key features include:

Dataset preview.
Plotly and Matlplotlib plots.
Experiment tracking.

Audience is both technical and non-technical people.

Create a blog post to offer messaging about the category in which to place Kedro

What is this?

It is common to see misconceptions around which category Kedro should fall into across blog posts, data science articles, and industry rankings.

These misconceptions include Kedro being seen as an orchestrator, or platform like SageMaker, or in a general category like ML or AI. Such as in here, and here

The goal is to define the right category for Kedro (e.g. development framework), and position Kedro there, starting with a blog post.

Why do we care about this?

This wrong perception determines if and how users implement Kedro for their projects.

Secondly, during client projects, if they think Kedro is a platform, they would be reluctant to use it because of the fear of ‘vendor lock-in’, or if they are already locked-in with another vendor.

Effectively communicating our unique positioning and category, would drive adoption of Kedro.

What needs to happen?

Benchmark our current perceived position
Define our category
Write a blog post about this
Approach industry map makers, and communicate kedro's category

Blog post series to celebrate open source communities

This is a bit of a woolly remit but a set of blog posts about how the open source community and big business can (and do) work together would be good. I'm constantly mindful that McKinsey doesn't seem like a natural fit with an open source project and I think it would be helpful for us to build a set of content that reinforces the way consultancies, tech companies and industry use and contribute to open source. We need to demonstrate a serious belief in our community and in being part of other communities.

This is a parent ticket to create a set of posts/issues for ideas about posts.

"Hamilton vs Kedro" plus "Hamilton and Kedro"

How the two compare. How the two complement each other.

Based on some discussion following this discussion on hacker news and some ongoing discussion over on the Hamilton Slack

We used Kedro at MoovAI. The standardized structure is reaaally valuable in consulting where team members change over the course of a project! The folks at potloc like it a lot and presented it at the most recent Montréal MLOps community event!
While using Kedro, I wanted to create modular functional code for data transformation, but creating a node for each function would require me to specify input-output for each node. In addition, if the output of these nodes would be pandas Series, I would have to assemble them manually at intermediary steps.
That's when I learned about Hamilton, which exactly met my needs for quick iterations of data processing pipelines with little/no boilerplate. I ended up calling Hamilton within a single Kedro node! (similar to Metaflow+Hamilton)
I think one of the main appeal of Kedro for orgs is the visualization tool that encompasses functions, data, code, experiments, etc. (+MLFlow and Airflow plugins). Integrating Kedro-viz with other DAG tools could be very exciting for users to have full visibility of their ETL pipeline. For example, at the MLOps meetup, someone asked if it would be possible to plug their Airflow ETL (upstream of the data science Kedro project) into the tool!

Convert Kedro principles documentation page into a blog post

The Kedro principles page is a nice enough page but doesn't really work in documentation. Let's move it out into a post or page.

Or, failing that, a topic page that sits on the website (but isn't in the nav, so you can find it organically but we don't push it to the user).

Kedro Experiment Tracking Product Marketing Strategy

Context

This captures all the product marketing activities to promote and encourage awareness and adoption of Kedro Experiment Tracking.

This includes updating the documentation, preparing a blog post, presentation, and webinar, and distributing the content beyond the existing Kedro slack organisation.

Goal

The goal is to increase adoption of kedro experiment tracking, and Kedro-Viz, especially Kedro-Viz/Kedro users by 30%.

Target audience

This is targeted to (DS, DE, and low tech)users with a 2 prong approach:

Marketing to kedro users to use kedro-viz
Marketing to non-kedro users to use kedro/kedro-viz

Content

This would include:

What is experiment tracking?
Why Kedro implemented experiment tracking?
Why would you choose to use experiment tracking in Kedro vs other tools?
What features are in experiment tracking for Kedro?
How did we get to this version of experiment tracking? (user testing process)
What's the future of this feature?

This content would then be repurposed for different forms and mediums.

Plan/Checklist

Admin:

Create Github issues to track product marketing strategy and sub-activities - blog post e.t.c.
Metrics - Determine success metrics for product marketing, and feature adoption

Key Activities:

Update Kedro-Viz docs #2144 and spaceflight tutorial
Write and publish blog article - #1227 - To address all the questions in the content section: Why Kedro implemented experiment tracking?
Video - Present and record a video walkthrough on kedro experiment tracking (especially for existing users of experiment tracking and user interview/testing participants) - #45 - To demo the features in Kedro experiment tracking
Presentation - #46 - prepare a presentation for internal & external audiences on kedro experiment tracking
Do a google search on experiment tracking, to benchmark other related content for messaging and positioning
Explore other channels beyond Slack, to reach users
Create a set of content that compares experiment tracking in Kedro to the 3 (or 5) most likely "competitors". This content could be video scripts (to film at a later stage) or just text to publish as potential blog posts or other content marketing.

Create/re-publish post about data layers

Joel wrote a nice blog post about data layers (I remember his awesome presentation about this in 2021) which we put on Towards Data Science: https://towardsdatascience.com/the-importance-of-layered-thinking-in-data-engineering-a09f685edc71

We have a little about it in the FAQ too, but I'm removing it because, it doesn't work as Kedro documentation. It is useful knowledge for someone in the field but it doesn't tell you about Kedro (so much as advocate for Kedro's opinionated data storage).

I think we should take the blog post we published and maybe repurpose it into a post for the new Kedro blog, then point to that when we need to demonstrate the reason for layering.

Bruce Philp and Guilherme Braccialli are the
brains behind a layered data-engineering convention as a model of managing data.

Refer to the following table below for a high level guide to each layer's purpose

The data layers don’t have to exist locally in the data folder within your project, but we recommend that you structure your S3 buckets or other data stores in a similar way.

Folder in data	Description
Raw	Initial start of the pipeline, containing the sourced data model(s) that should never be changed, it forms your single source of truth to work from. These data models are typically un-typed in most cases e.g. csv, but this will vary from case to case
Intermediate	Optional data model(s), which are introduced to type your :code:`raw` data model(s), e.g. converting string based values into their current typed representation
Primary	Domain specific data model(s) containing cleansed, transformed and wrangled data from either `raw` or `intermediate`, which forms your layer that you input into your feature engineering
Feature	Analytics specific data model(s) containing a set of features defined against the `primary` data, which are grouped by feature area of analysis and stored against a common dimension
Model input	Analytics specific data model(s) containing all :code:`feature` data against a common dimension and in the case of live projects against an analytics run date to ensure that you track the historical changes of the features over time
Models	Stored, serialised pre-trained machine learning models
Model output	Analytics specific data model(s) containing the results generated by the model based on the `model input` data
Reporting	Reporting data model(s) that are used to combine a set of `primary`, `feature`, `model input` and `model output` data used to drive the dashboard and the views constructed. It encapsulates and removes the need to define any blending or joining of data, improve performance and replacement of presentation layer without having to redefine the data models

kedro-org / kedro-devrel Goto Github PK

kedro-devrel's People

Contributors

Watchers

Forkers

kedro-devrel's Issues

Introduction

Requirements

Context

Why is this important?

Next Steps

Description

Context

Possible Implementation

Description

Context

Description

Description

Context

Goal

Context

Why is this important?

Next Steps

Description

Why is this important?

Proposal

Evidence markers

What is a Press Kit?

Why is it important ?

Next Steps for Kedro Press Kit

Description

Context

Possible Implementation

Context

Goal

What is this?

Why do we care about this?

What needs to happen?

Context

Goal

Target audience

Content

Plan/Checklist

Recommend Projects

Recommend Topics

Recommend Org