Giter Club home page Giter Club logo

kedro-devrel's People

Contributors

afaqueahmad7117 avatar deepyaman avatar diegoliraqb avatar kuriantom369 avatar stichbury avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

kedro-devrel's Issues

Create a blog post about why you should use Kedro

This captures the creation of content to promote best practices and encourage Kedro usage as a way to follow those practices.

We should probably write 2 posts: one which is the best practice (forms the basis of Academy) and second builds upon it to talk about Kedro. Then we can put the first post into a publication like "Towards Data Science" which doesn't like marketing, and publish the second ourselves.

There's also scope for other formats such as video, webinar or podcast script, but for now, this issue is purely about the mechanics of getting two complementary articles written, in draft, to share with GetInData for co-marketing.

Write a Kedro Academy article

It's about time we stepped outside the box a bit when it comes to blog post ideas and similar content.

I think it would be a good idea to write a post about Kedro Academy that describes our recent course.

  • Aims of the course (what we cover and why we think it's important)
  • How we designed it (survey, interviews, MVP, delivery and refinement)
  • How we delivered it (tips and tricks for delivering an engaging course on a Friday afternoon)
  • What's next and how to join

My initial thoughts are that this becomes a McKinsey news post, but later more broadly distributed.

Create one or more blog posts about Kedro & "dynamic pipelines" that sets out the various requirements, solutions and links to docs

It would be great if we had a page explaining Kedro and dynamic pipelines and what the best solutions are if you can't use Kedro.

We are asked about it regularly:

Hello, good afternoon, I was adding a catalog item from a database that has timeseries data and I wanted to have a dynamic param so when I run the code It retrieves the data regarding a certain day.
I ran into this post:
kedro-org/kedro#1089
But I am failing to understand what is the reasoning behind. I would like to understand what would be the approach in kedro. Loading the whole table from sql and then run the transformations in code. But that would make quite inefficient when running in production. (edited)

Community calls

It would be good to have open community calls about Kedro and its ecosystem. Some ideas:

  • Presentations from the engineering team about a specific topic
  • Office hours, support requests, refactory help
  • Discussion about development
  • "Showcase your Kedro project"

We already have a Kedro meetup. We could leverage that, or switch to something else like Luma, or something else (tooling is not so important, the format and the cadence and the content are).

There are various ways to do it. Some inspiration from Jupyter: https://blog.jupyter.org/online-collaboration-caf%C3%A9-launch-jupyterhub-team-meetings-to-become-more-collaborative-spaces-b713edadf15

The JupyterHub community often cite the team meetings as a touchpoint for newcomers to familiarise themselves with the JupyterHub project. However, this is not always the case. The team meetings have an emergent agenda built by the community members — which is great! — but it also means that it is a potluck as to whether the meeting you happen to attend will actually be useful depending on the agenda, and a newcomer may have to attend multiple meetings over a long period before feeling comfortable to ask questions and know where they can begin to help.

By reformatting the meeting into a collaborative co-working space using breakout rooms, we can cater for both the need of newcomers to be oriented to the project, and for the community to discuss in-depth topics in an emergent nature.

Meltano Office Hours: https://www.youtube.com/playlist?list=PLO0YrxtDbWAtuuubcEz7mnCHoGfIf8voT, https://www.addevent.com/calendar/Li390615 (looks like they're still going)

Airbyte demo calls: https://airbyte.com/events

Create first set of materials for Kedro Academy course on code quality

Introduction

We've agreed on the following sections as a MVP for a course about levelling up data science code quality to that demanded by more traditional software engineering (which I'll abbreviate as SWE for DS):

  • Introduction: Why are you even learning SWE for DS?

Then, to teach within Notebooks

  • Python functions/Functional (pure) programming
  • Configuration
  • Virtual environments and managing dependencies
  • VSCode

Then, to teach within VSCode

  • CLIs
  • Semantic folders/project template
  • Git, Github and collaborative version control (pull from clean code workshop)

Requirements

Apart from the introduction, each of these seven sections will have a common structure

  • Each section should take less than 45 minutes to deliver and be introduced with a standard page (What you will learn, Why it matters, How to test your knowledge: exercise & quiz)
  • Each section should be split into a series of short "chapters" that take 7-10 minutes each so they can be chunked within a Udemy/Coursera course or we can create checkpoints to add onto a YouTube video.
  • Each section should have a set of slides and a script, so it can be delivered in person or converted into a video. There should be an accompanying example exercise on a GitHub repo and a quiz.

The goal is to deliver the first draft of materials in January 2023 to trial with a select audience and refine. We'd then create more of the course and look at the options to record as video.

One of the big open issues right now is how to enable the user to follow along with a REPL (to avoid setup issues) and there are various options in play: Educative, Gitpod, REPLIt, GitHub Codespaces, Rhyme (virtual machine)

Bring back the Kedro cheatsheet

I'm sure I'm not imagining this -- we used to have a downloadable cheatsheet of Kedro commands, or a page of them.

But, anyway, there's a community version for this and I think we can make something similar and include a page in docs or as a blog post. Not high priority but a good idea, thanks @astrojuanlu @datajoely for the prompt.

Explore internship and apprenticeship initiatives

Examples:

Before embarking on this though, we should have a more clear way of tagging "good first issues" (easy low hanging fruit tasks that any beginner can pick up for a small win) and "help wanted" (nice-to-have features that are otherwise not a priority for the core team, but also more complex than what a beginner can handle). Maybe I'll open an issue about this separately.

Create a blog post that guides users in using experiment tracking with viz using a tutorial (that isn't spaceflights)

I think it's time to start writing some tutorials that form useful blog posts and show off aspects of Kedro and Kedro Viz. One of these would be about experiment tracking so this is the ticket for that post.

We need to pick an interesting project (maybe one that's already available as a notebook or using another experiment tracking system, then convert that to a kedro project and use it as the example).

Messaging - One Sentence Summary for Kedro

Context

How would you describe Kedro in one sentence? This is key in defining how Kedro is perceived by users, the community, and all stakeholders. This is part of other messaging efforts of Kedro: #2099, #2094, #72

When you search for Kedro the result is some version of - Kedro - An open-source Python framework to create reproducible, maintainable, and modular data science code.

Does this capture the value proposition of Kedro succinctly?

Some other examples:

Why is this important?

This would clearly highlight Kedro’s value proposition in one sentence, increasing awareness and adoption.

Next Steps

  • Research and decide on one sentence metaphor for Kedro
  • Publish on Kedro's website, documentation, press kit, and community

Create a blog post guide for users to create a soft-fail custom runner

Description

Create a blog post showing users how to implement a custom Runner that will run every possible Node in the pipeline that it can when only one Node fails.

As an illustration, all of the default runners currently exhibit the following behaviour when one Node fails:

Screenshot 2022-11-30 at 15 29 13

Note that independent Nodes that could be run are not run. The custom runner should exhibit the following behaviour:

Screenshot 2022-11-30 at 15 33 12

Context

This is a feature was requested by users in kedro-org/kedro#503. The blog post should direct readers to this GitHub issue.

Setup Linen on Kedro Slack

Description

We already have [Linen active on the Kedro Discord. Linen makes it possible for us to access messages beyond the 6-month limit on Slack for improved collaboration and knowledge sharing. We activated Linen we sunset Discord.

The scope of this ticket includes working with @datajoely to setup Linen on our Slack workspace.

Comparative content to clarify Kedro positioning

What are the alternatives to using Kedro?

I took the following from https://blog.streamlit.io/using-chatgpt-to-build-a-kedro-ml-pipeline/

Airflow: Airflow is a platform to programmatically author, schedule, and monitor workflows. It was originally developed by Airbnb, and is now a popular open-source project. Airflow provides a flexible and powerful platform for building and managing ML pipelines, and has a large user base and community.

Kubeflow: Kubeflow is an open-source project that aims to make it easy to deploy and manage machine learning workflows on Kubernetes. It provides a range of tools and frameworks for building, deploying, and managing ML pipelines, including Jupyter notebooks, TensorFlow, and PyTorch.

Prefect: Prefect is an open-source platform for building, scheduling, and monitoring data pipelines. It is designed to be highly scalable and resilient, and provides a range of features for building and managing ML pipelines, including support for distributed execution and automatic retries.

Luigi: Luigi is an open-source Python library for building complex pipelines of batch jobs. It was developed by Spotify and is designed to be easy to use and flexible. Luigi provides a range of features for building and managing ML pipelines, including support for dependencies, parallel execution, and error handling.

These are just a few examples of the alternatives to Kedro that you can consider. Each platform has its own strengths and weaknesses, and the right choice will depend on your specific needs and requirements.

I think we should create some content (either a non-navigable "topic" page or a blog post) that compares Kedro with each of these and maybe provides a visual "matrix" to check off which each offers and compare with Kedro (maybe a bit like this).

Blog post idea: Five Kedro hacks that will improve your workflow

Or something like this. Basically a listicle.

  • Onboarding style/awareness option: Less focussed on Kedro and more on data science. Such as "Ten data science hacks that will improve your productivity" or "Seven ways to make your data science code better" and have some academy content in there as well as a "Use Kedro".
  • Improving Kedro-skills option: Make the list Kedro-specific and cover five things you may not know about using Kedro, or some ways to do things that help move beyond onboarding to give beginner users more confidence and intermediate users super-powers.

Create new deck for Kedro

Description

Create a new and revamped deck highlighting what Kedro is, what problem it solves and how it does it (features).

Audience is non-technical people

Create a blog post about python dependency management

Not a Kedro topic but one to highlight our expertise/thought leadership. It's low-ish priority in many ways since it doesn't touch Kedro,, but we do have already content since @deepyaman has already shared it.

Maybe something he could publish on our blog as a way to start a fire/fight on Python subreddits at some point and take us viral 🤣

Create a deck for Kedro Experiment Tracking

Context

This captures the creation of a deck to promote and encourage awareness and adoption of Kedro Experiment Tracking. It is part of a wider Kedro Experiment Tracking product marketing strategy #38.

Goal

The goal is to create a deck highlighting what Kedro Experiment Tracking is, what problem it solves, and solution (features).

Audience is both technical and non-technical people.

Establish content and social media strategies

(Drawing from some ideas by @stichbury)

Kedro channels:

  • Documentation
  • QB Medium
  • Blog (upcoming)

Possible third-party channels:

  • Twitter
  • Microblogging Fediverse (Mastodon or similar)
  • LinkedIn
  • Reddit
  • Instagram
  • YouTube
  • Twitch
  • TikTok

Guest channels:

  • Towards Data Science
  • ?

Obviously I was overly comprehensive here and I don't think we should strive to be everywhere, nor re-posting content across channels. Instead, we should pick a few of those channels based on

  • Target audience/User personas
  • Content format

For example: discarding Instagram and LinkedIn, picking Mastodon over Twitter, choosing YouTube over Twitch, holding on Reddit and TikTok for now. Hence long-form blogging + microblogging + occasional video for now. To be discussed.

Add a blog post with an example for Great Expectations

As per https://kedro-org.slack.com/archives/C03RKP2LW64/p1671548554220589 -- it's clearly causing some problems. We have found in the past that it's hard to keep up with the release cycle of GE so we don't have an example of using GE with Kedro that's in a known state, and we're reluctant to commit to having one, plus docs, since it will be an ongoing overhead.

We could, however, write a blog post to illustrate usage at a particular point in time.

It would be super awesome to put this on the GE blog perhaps, or have them author with us. Something to consider in Q1 2023 -- maybe something for Juan Luis to add to his queue, otherwise I can do it.

Scaling introductory Kedro training

Description

This follows a conversation in the Software Engineering for Data Scientists about how users learn about Kedro and software engineering principles. This ticket proposes the creation of an Introduction to Kedro YouTube..

Why is this important?

We've been running "Kedro Beginner Bootcamp" trainings for many years now and this model is not ideal because:

  • We mostly host internal sessions, so our open-source users don't benefit from these learnings
  • We have only run one external training, which is well-viewed

So we need to think of ways to scale learning about Kedro beyond the capabilities of the team and this will impact all users.

Proposal

We should create an Introduction to Kedro training on YouTube. Some of this work includes:

  • Designing and tweaking the curriculum to fit this platform
  • Recording the sessions
  • Defining the learner support model
  • Creating a marketing plan

Evidence markers

  • Users did find Kedro because of DataEngineerOne
  • Users still try and use our old trainings
  • There is low-evidence that we should do a Udemy course from @stichbury's research

Therefore, we believe that this will help drive adoption of Kedro.

Find community projects that use Kedro to build case-study blog posts

These are repos that use Kedro and the authors potentially could be approached to write a case study about why they chose Kedro, what they're doing etc.

Blog post to revise the "data science framework revolution" blog post

Our interview with Waylon was a nice read but we could improve the narrative flow, bring it up to date and extend it to maybe include some of the Notebook refactoring discussions we've had.

  • Challenges data scientists face when their code moves into production
  • Frameworks for data science workflow
  • Scope of data science has increased
  • How Kedro splits the workflow so expertise can be sought from team members and they don't have to understand the whole project

This may or may not mean we also pull in material from Power is nothing without control which I'd also like to rework as a blog post. Let's assume for now this is all part of 1 post rather than two.

Archive content on the `kedro-training` repo and sunset it

I've checked and nobody seems to be using the kedro-training repo content, which has dated, but may be useful later for future courses. Let's archive it for now and start adopting kedro-academy as the de facto location for publicly-shared decks and training graphics.

Blog post about using Polars and Kedro

Based on @astrojuanlu's upcoming talk:
“Analyze your data at the speed of light with Polars and Kedro”

Abstract: “The pandas library is one of the key factors that enabled the growth of Python in the Data Science industry and continues to help data scientists thrive almost 15 years after its creation. Because of this success, nowadays several open-source projects claim to improve pandas in various ways. Polars is one of those new dataframe libraries: it’s backed by Arrow and Rust, and offers an expressive API for dataframe manipulation with excellent performance. In this webinar I will show you how to combine Polars for your data manipulation needs with Kedro, a data science framework that will help you write more maintainable code.”

Messaging - Create Kedro Press Kit

What is a Press Kit?

A press kit is a page on your website that makes it incredibly easy for journalists to learn about your brand and product, and access media assets (photos and videos) to use in their content or articles about your brand/product.

Why is it important ?

This provides journalists with the latest, accurate, and relevant information.

It should contain: an Overview (including Story and Mission), Product Information, Contact Details (including separate media email address), Media Assets, Logos, and Others (e.g. notable awards and quotes).

Here are some press kit examples.

Providing a brand kit would drive brand awareness of Kedro.

Next Steps for Kedro Press Kit

  • Design
  • Gather content: Overview (including Story and Mission), Product Information, Contact Details (including separate media email address), Media Assets, Logos, and Others (e.g. notable awards and quotes)
  • Publish on kedro website

Strategy for Kedro developers and contributors

kedro-org/kedro#2274 made me remember that not only we have Kedro users, but we also have potential contributors/developers (including both people who want to contribute to the core projects as well as people extending Kedro with plugins). We should have a distinct and explicit comms/docs strategy for contributors, and also think about in what circumstances a Kedro user could become a contributor or the other way around.

Blog posts for working with Kedro and Databricks on the Azure platform

Description

Currently, our documentation for deploying Kedro on Databricks is heavily based on AWS Databricks. Many of our users use Azure Databricks. We should define and document specific recommendations to these users.

Context

This issue is a child of kedro-org/kedro#2185.

Possible Implementation

The documentation written for this ticket could be a subsection of a new docs section dedicated to Databricks deployment.

Create a video walkthrough of Kedro-Viz

Context

This captures the creation of a video walkthrough to promote and encourage awareness and adoption of Kedro-Viz. It is part of a wider.

Goal

The goal is to create a video walkthrough highlighting what Kedro-Viz is, what problem it solves, and solution (features).

These key features include:

  • Dataset preview.
  • Plotly and Matlplotlib plots.
  • Experiment tracking.

Audience is both technical and non-technical people.

Create a blog post to offer messaging about the category in which to place Kedro

What is this?

It is common to see misconceptions around which category Kedro should fall into across blog posts, data science articles, and industry rankings.

These misconceptions include Kedro being seen as an orchestrator, or platform like SageMaker, or in a general category like ML or AI. Such as in here, and here

The goal is to define the right category for Kedro (e.g. development framework), and position Kedro there, starting with a blog post.

Why do we care about this?

This wrong perception determines if and how users implement Kedro for their projects.

Secondly, during client projects, if they think Kedro is a platform, they would be reluctant to use it because of the fear of ‘vendor lock-in’, or if they are already locked-in with another vendor.

Effectively communicating our unique positioning and category, would drive adoption of Kedro.

What needs to happen?

  • Benchmark our current perceived position
  • Define our category
  • Write a blog post about this
  • Approach industry map makers, and communicate kedro's category

Blog post series to celebrate open source communities

This is a bit of a woolly remit but a set of blog posts about how the open source community and big business can (and do) work together would be good. I'm constantly mindful that McKinsey doesn't seem like a natural fit with an open source project and I think it would be helpful for us to build a set of content that reinforces the way consultancies, tech companies and industry use and contribute to open source. We need to demonstrate a serious belief in our community and in being part of other communities.

This is a parent ticket to create a set of posts/issues for ideas about posts.

"Hamilton vs Kedro" plus "Hamilton and Kedro"

How the two compare. How the two complement each other.

Based on some discussion following this discussion on hacker news and some ongoing discussion over on the Hamilton Slack

We used Kedro at MoovAI. The standardized structure is reaaally valuable in consulting where team members change over the course of a project! The folks at potloc like it a lot and presented it at the most recent Montréal MLOps community event!
While using Kedro, I wanted to create modular functional code for data transformation, but creating a node for each function would require me to specify input-output for each node. In addition, if the output of these nodes would be pandas Series, I would have to assemble them manually at intermediary steps.
That's when I learned about Hamilton, which exactly met my needs for quick iterations of data processing pipelines with little/no boilerplate. I ended up calling Hamilton within a single Kedro node! (similar to Metaflow+Hamilton)
I think one of the main appeal of Kedro for orgs is the visualization tool that encompasses functions, data, code, experiments, etc. (+MLFlow and Airflow plugins). Integrating Kedro-viz with other DAG tools could be very exciting for users to have full visibility of their ETL pipeline. For example, at the MLOps meetup, someone asked if it would be possible to plug their Airflow ETL (upstream of the data science Kedro project) into the tool!

Kedro Experiment Tracking Product Marketing Strategy

Context

This captures all the product marketing activities to promote and encourage awareness and adoption of Kedro Experiment Tracking.

This includes updating the documentation, preparing a blog post, presentation, and webinar, and distributing the content beyond the existing Kedro slack organisation.

Goal

The goal is to increase adoption of kedro experiment tracking, and Kedro-Viz, especially Kedro-Viz/Kedro users by 30%.

Target audience

This is targeted to (DS, DE, and low tech)users with a 2 prong approach:

  1. Marketing to kedro users to use kedro-viz
  2. Marketing to non-kedro users to use kedro/kedro-viz

Content

This would include:

  • What is experiment tracking?
  • Why Kedro implemented experiment tracking?
  • Why would you choose to use experiment tracking in Kedro vs other tools?
  • What features are in experiment tracking for Kedro?
  • How did we get to this version of experiment tracking? (user testing process)
  • What's the future of this feature?

This content would then be repurposed for different forms and mediums.

Plan/Checklist

Admin:

  • Create Github issues to track product marketing strategy and sub-activities - blog post e.t.c.
  • Metrics - Determine success metrics for product marketing, and feature adoption

Key Activities:

  • Update Kedro-Viz docs #2144 and spaceflight tutorial
  • Write and publish blog article - #1227 - To address all the questions in the content section: Why Kedro implemented experiment tracking?
  • Video - Present and record a video walkthrough on kedro experiment tracking (especially for existing users of experiment tracking and user interview/testing participants) - #45 - To demo the features in Kedro experiment tracking
  • Presentation - #46 - prepare a presentation for internal & external audiences on kedro experiment tracking
  • Do a google search on experiment tracking, to benchmark other related content for messaging and positioning
  • Explore other channels beyond Slack, to reach users
  • Create a set of content that compares experiment tracking in Kedro to the 3 (or 5) most likely "competitors". This content could be video scripts (to film at a later stage) or just text to publish as potential blog posts or other content marketing.

Create/re-publish post about data layers

Joel wrote a nice blog post about data layers (I remember his awesome presentation about this in 2021) which we put on Towards Data Science: https://towardsdatascience.com/the-importance-of-layered-thinking-in-data-engineering-a09f685edc71

We have a little about it in the FAQ too, but I'm removing it because, it doesn't work as Kedro documentation. It is useful knowledge for someone in the field but it doesn't tell you about Kedro (so much as advocate for Kedro's opinionated data storage).

I think we should take the blog post we published and maybe repurpose it into a post for the new Kedro blog, then point to that when we need to demonstrate the reason for layering.


Bruce Philp and Guilherme Braccialli are the
brains behind a layered data-engineering convention as a model of managing data.

Refer to the following table below for a high level guide to each layer's purpose

The data layers don’t have to exist locally in the data folder within your project, but we recommend that you structure your S3 buckets or other data stores in a similar way.

image

Folder in data Description
Raw Initial start of the pipeline, containing the sourced data model(s) that should never be changed, it forms your single source of truth to work from. These data models are typically un-typed in most cases e.g. csv, but this will vary from case to case
Intermediate Optional data model(s), which are introduced to type your :code:raw data model(s), e.g. converting string based values into their current typed representation
Primary Domain specific data model(s) containing cleansed, transformed and wrangled data from either raw or intermediate, which forms your layer that you input into your feature engineering
Feature Analytics specific data model(s) containing a set of features defined against the primary data, which are grouped by feature area of analysis and stored against a common dimension
Model input Analytics specific data model(s) containing all :code:feature data against a common dimension and in the case of live projects against an analytics run date to ensure that you track the historical changes of the features over time
Models Stored, serialised pre-trained machine learning models
Model output Analytics specific data model(s) containing the results generated by the model based on the model input data
Reporting Reporting data model(s) that are used to combine a set of primary, feature, model input and model output data used to drive the dashboard and the views constructed. It encapsulates and removes the need to define any blending or joining of data, improve performance and replacement of presentation layer without having to redefine the data models

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.