kedro-org / kedro-devrel Goto Github PK
View Code? Open in Web Editor NEWKedro developer relations team use this for content creation ideation and execution
License: Apache License 2.0
Kedro developer relations team use this for content creation ideation and execution
License: Apache License 2.0
This captures the creation of content to promote best practices and encourage Kedro usage as a way to follow those practices.
We should probably write 2 posts: one which is the best practice (forms the basis of Academy) and second builds upon it to talk about Kedro. Then we can put the first post into a publication like "Towards Data Science" which doesn't like marketing, and publish the second ourselves.
There's also scope for other formats such as video, webinar or podcast script, but for now, this issue is purely about the mechanics of getting two complementary articles written, in draft, to share with GetInData for co-marketing.
It's about time we stepped outside the box a bit when it comes to blog post ideas and similar content.
I think it would be a good idea to write a post about Kedro Academy that describes our recent course.
My initial thoughts are that this becomes a McKinsey news post, but later more broadly distributed.
This is a child task of #13 and covers creation of MVP slides about a semantic project structure for Kedro Academy.
It would be great if we had a page explaining Kedro and dynamic pipelines and what the best solutions are if you can't use Kedro.
We are asked about it regularly:
Hello, good afternoon, I was adding a catalog item from a database that has timeseries data and I wanted to have a dynamic param so when I run the code It retrieves the data regarding a certain day.
I ran into this post:
kedro-org/kedro#1089
But I am failing to understand what is the reasoning behind. I would like to understand what would be the approach in kedro. Loading the whole table from sql and then run the transformations in code. But that would make quite inefficient when running in production. (edited)
https://github.com/kedro-org/kedro-viz/blob/main/LAYOUT_ENGINE.md would make a nice blog post to add to our set.
It would be good to have open community calls about Kedro and its ecosystem. Some ideas:
We already have a Kedro meetup. We could leverage that, or switch to something else like Luma, or something else (tooling is not so important, the format and the cadence and the content are).
There are various ways to do it. Some inspiration from Jupyter: https://blog.jupyter.org/online-collaboration-caf%C3%A9-launch-jupyterhub-team-meetings-to-become-more-collaborative-spaces-b713edadf15
The JupyterHub community often cite the team meetings as a touchpoint for newcomers to familiarise themselves with the JupyterHub project. However, this is not always the case. The team meetings have an emergent agenda built by the community members — which is great! — but it also means that it is a potluck as to whether the meeting you happen to attend will actually be useful depending on the agenda, and a newcomer may have to attend multiple meetings over a long period before feeling comfortable to ask questions and know where they can begin to help.
By reformatting the meeting into a collaborative co-working space using breakout rooms, we can cater for both the need of newcomers to be oriented to the project, and for the community to discuss in-depth topics in an emergent nature.
Meltano Office Hours: https://www.youtube.com/playlist?list=PLO0YrxtDbWAtuuubcEz7mnCHoGfIf8voT, https://www.addevent.com/calendar/Li390615 (looks like they're still going)
Airbyte demo calls: https://airbyte.com/events
This is a child task of #13 and covers creation of MVP slides about dependency management for Kedro Academy.
We've agreed on the following sections as a MVP for a course about levelling up data science code quality to that demanded by more traditional software engineering (which I'll abbreviate as SWE for DS):
Then, to teach within Notebooks
Then, to teach within VSCode
Apart from the introduction, each of these seven sections will have a common structure
The goal is to deliver the first draft of materials in January 2023 to trial with a select audience and refine. We'd then create more of the course and look at the options to record as video.
One of the big open issues right now is how to enable the user to follow along with a REPL (to avoid setup issues) and there are various options in play: Educative, Gitpod, REPLIt, GitHub Codespaces, Rhyme (virtual machine)
I'm sure I'm not imagining this -- we used to have a downloadable cheatsheet of Kedro commands, or a page of them.
But, anyway, there's a community version for this and I think we can make something similar and include a page in docs or as a blog post. Not high priority but a good idea, thanks @astrojuanlu @datajoely for the prompt.
Examples:
Before embarking on this though, we should have a more clear way of tagging "good first issues" (easy low hanging fruit tasks that any beginner can pick up for a small win) and "help wanted" (nice-to-have features that are otherwise not a priority for the core team, but also more complex than what a beginner can handle). Maybe I'll open an issue about this separately.
Following our first training session we need to review whether Replit is the best option for collaborative coding exercises
I think it's time to start writing some tutorials that form useful blog posts and show off aspects of Kedro and Kedro Viz. One of these would be about experiment tracking so this is the ticket for that post.
We need to pick an interesting project (maybe one that's already available as a notebook or using another experiment tracking system, then convert that to a kedro project and use it as the example).
How would you describe Kedro in one sentence? This is key in defining how Kedro is perceived by users, the community, and all stakeholders. This is part of other messaging efforts of Kedro: #2099, #2094, #72
When you search for Kedro the result is some version of - Kedro - An open-source Python framework to create reproducible, maintainable, and modular data science code.
Does this capture the value proposition of Kedro succinctly?
Some other examples:
MLflow - A platform for the machine learning lifecycle
neptune.ai - ML metadata store, Google search - Build models with confidence
Weights & Biases - The developer-first MLOps platform, Google search - Developer tools for ML
This would clearly highlight Kedro’s value proposition in one sentence, increasing awareness and adoption.
Kedro offers multiple features that help with improving run performance, but people seem not to know about them or how to use them. Create a blogpost that showcases how to use e.g. --async
and CachedDataSet
.
kedro-org/kedro#2036 (comment)
Need to identify which features to showcase.
Create a blog post showing users how to implement a custom Runner
that will run every possible Node
in the pipeline that it can when only one Node
fails.
As an illustration, all of the default runners currently exhibit the following behaviour when one Node
fails:
Note that independent Node
s that could be run are not run. The custom runner should exhibit the following behaviour:
This is a feature was requested by users in kedro-org/kedro#503. The blog post should direct readers to this GitHub issue.
We need to take another look at the examples on the decks, and take any other feedback into account, to finalise the first 4 modules.
We already have [Linen active on the Kedro Discord. Linen makes it possible for us to access messages beyond the 6-month limit on Slack for improved collaboration and knowledge sharing. We activated Linen we sunset Discord.
The scope of this ticket includes working with @datajoely to setup Linen on our Slack workspace.
What are the alternatives to using Kedro?
I took the following from https://blog.streamlit.io/using-chatgpt-to-build-a-kedro-ml-pipeline/
Airflow: Airflow is a platform to programmatically author, schedule, and monitor workflows. It was originally developed by Airbnb, and is now a popular open-source project. Airflow provides a flexible and powerful platform for building and managing ML pipelines, and has a large user base and community.
Kubeflow: Kubeflow is an open-source project that aims to make it easy to deploy and manage machine learning workflows on Kubernetes. It provides a range of tools and frameworks for building, deploying, and managing ML pipelines, including Jupyter notebooks, TensorFlow, and PyTorch.
Prefect: Prefect is an open-source platform for building, scheduling, and monitoring data pipelines. It is designed to be highly scalable and resilient, and provides a range of features for building and managing ML pipelines, including support for distributed execution and automatic retries.
Luigi: Luigi is an open-source Python library for building complex pipelines of batch jobs. It was developed by Spotify and is designed to be easy to use and flexible. Luigi provides a range of features for building and managing ML pipelines, including support for dependencies, parallel execution, and error handling.
These are just a few examples of the alternatives to Kedro that you can consider. Each platform has its own strengths and weaknesses, and the right choice will depend on your specific needs and requirements.
I think we should create some content (either a non-navigable "topic" page or a blog post) that compares Kedro with each of these and maybe provides a visual "matrix" to check off which each offers and compare with Kedro (maybe a bit like this).
Or something like this. Basically a listicle.
This is a general "housekeeping" ticket to work out how we use kedro-community
better so it's up to date and useful.
This may be the basis of a a nice blog post to create about the architecture of Viz: https://github.com/kedro-org/kedro-viz/blob/main/ARCHITECTURE.md
Create a new and revamped deck highlighting what Kedro is, what problem it solves and how it does it (features).
Audience is non-technical people
Not a Kedro topic but one to highlight our expertise/thought leadership. It's low-ish priority in many ways since it doesn't touch Kedro,, but we do have already content since @deepyaman has already shared it.
Maybe something he could publish on our blog as a way to start a fire/fight on Python subreddits at some point and take us viral 🤣
This captures the creation of a deck to promote and encourage awareness and adoption of Kedro Experiment Tracking. It is part of a wider Kedro Experiment Tracking product marketing strategy #38.
The goal is to create a deck highlighting what Kedro Experiment Tracking is, what problem it solves, and solution (features).
Audience is both technical and non-technical people.
Here's my list so far:
Potentially interesting events:
Bottom line: meet our users where they are.
This is a child task of #13 and covers creation of MVP slides about version control for Kedro Academy.
(Drawing from some ideas by @stichbury)
Kedro channels:
Possible third-party channels:
Guest channels:
Obviously I was overly comprehensive here and I don't think we should strive to be everywhere, nor re-posting content across channels. Instead, we should pick a few of those channels based on
For example: discarding Instagram and LinkedIn, picking Mastodon over Twitter, choosing YouTube over Twitch, holding on Reddit and TikTok for now. Hence long-form blogging + microblogging + occasional video for now. To be discussed.
This is a child task of #13 and covers creation of MVP slides about using VSCode for basics of development as part of the Kedro Academy course.
As per https://kedro-org.slack.com/archives/C03RKP2LW64/p1671548554220589 -- it's clearly causing some problems. We have found in the past that it's hard to keep up with the release cycle of GE so we don't have an example of using GE with Kedro that's in a known state, and we're reluctant to commit to having one, plus docs, since it will be an ongoing overhead.
We could, however, write a blog post to illustrate usage at a particular point in time.
It would be super awesome to put this on the GE blog perhaps, or have them author with us. Something to consider in Q1 2023 -- maybe something for Juan Luis to add to his queue, otherwise I can do it.
This involves moving the ‘What are the primary advantages of Kedro?’ section from the FAQ on kedro docs to the kedro website.
This would clearly highlight the value proposition of Kedro to Data Scientists, Machine-Learning Engineers or Data Engineers, and Project Leads, increasing awareness and adoption.
This follows a conversation in the Software Engineering for Data Scientists about how users learn about Kedro and software engineering principles. This ticket proposes the creation of an Introduction to Kedro YouTube..
We've been running "Kedro Beginner Bootcamp" trainings for many years now and this model is not ideal because:
So we need to think of ways to scale learning about Kedro beyond the capabilities of the team and this will impact all users.
We should create an Introduction to Kedro training on YouTube. Some of this work includes:
Therefore, we believe that this will help drive adoption of Kedro.
This is a child task of #13 and covers creation of materials to help reinforce the learnings:
These are repos that use Kedro and the authors potentially could be approached to write a case study about why they chose Kedro, what they're doing etc.
Our interview with Waylon was a nice read but we could improve the narrative flow, bring it up to date and extend it to maybe include some of the Notebook refactoring discussions we've had.
This may or may not mean we also pull in material from Power is nothing without control which I'd also like to rework as a blog post. Let's assume for now this is all part of 1 post rather than two.
I've checked and nobody seems to be using the kedro-training
repo content, which has dated, but may be useful later for future courses. Let's archive it for now and start adopting kedro-academy
as the de facto location for publicly-shared decks and training graphics.
This is a child task of #13 and covers creation of MVP slides about working with a CLI for Kedro Academy.
Based on @astrojuanlu's upcoming talk:
“Analyze your data at the speed of light with Polars and Kedro”
Abstract: “The pandas library is one of the key factors that enabled the growth of Python in the Data Science industry and continues to help data scientists thrive almost 15 years after its creation. Because of this success, nowadays several open-source projects claim to improve pandas in various ways. Polars is one of those new dataframe libraries: it’s backed by Arrow and Rust, and offers an expressive API for dataframe manipulation with excellent performance. In this webinar I will show you how to combine Polars for your data manipulation needs with Kedro, a data science framework that will help you write more maintainable code.”
This is a child task of #13 and covers creation of MVP slides about configuration files for Kedro Academy.
A press kit is a page on your website that makes it incredibly easy for journalists to learn about your brand and product, and access media assets (photos and videos) to use in their content or articles about your brand/product.
This provides journalists with the latest, accurate, and relevant information.
It should contain: an Overview (including Story and Mission), Product Information, Contact Details (including separate media email address), Media Assets, Logos, and Others (e.g. notable awards and quotes).
Here are some press kit examples.
Providing a brand kit would drive brand awareness of Kedro.
kedro-org/kedro#2274 made me remember that not only we have Kedro users, but we also have potential contributors/developers (including both people who want to contribute to the core projects as well as people extending Kedro with plugins). We should have a distinct and explicit comms/docs strategy for contributors, and also think about in what circumstances a Kedro user could become a contributor or the other way around.
Currently, our documentation for deploying Kedro on Databricks is heavily based on AWS Databricks. Many of our users use Azure Databricks. We should define and document specific recommendations to these users.
This issue is a child of kedro-org/kedro#2185.
The documentation written for this ticket could be a subsection of a new docs section dedicated to Databricks deployment.
This captures the creation of a video walkthrough to promote and encourage awareness and adoption of Kedro-Viz. It is part of a wider.
The goal is to create a video walkthrough highlighting what Kedro-Viz is, what problem it solves, and solution (features).
These key features include:
Audience is both technical and non-technical people.
It is common to see misconceptions around which category Kedro should fall into across blog posts, data science articles, and industry rankings.
These misconceptions include Kedro being seen as an orchestrator, or platform like SageMaker, or in a general category like ML or AI. Such as in here, and here
The goal is to define the right category for Kedro (e.g. development framework), and position Kedro there, starting with a blog post.
This wrong perception determines if and how users implement Kedro for their projects.
Secondly, during client projects, if they think Kedro is a platform, they would be reluctant to use it because of the fear of ‘vendor lock-in’, or if they are already locked-in with another vendor.
Effectively communicating our unique positioning and category, would drive adoption of Kedro.
This is a bit of a woolly remit but a set of blog posts about how the open source community and big business can (and do) work together would be good. I'm constantly mindful that McKinsey doesn't seem like a natural fit with an open source project and I think it would be helpful for us to build a set of content that reinforces the way consultancies, tech companies and industry use and contribute to open source. We need to demonstrate a serious belief in our community and in being part of other communities.
This is a parent ticket to create a set of posts/issues for ideas about posts.
How the two compare. How the two complement each other.
Based on some discussion following this discussion on hacker news and some ongoing discussion over on the Hamilton Slack
We used Kedro at MoovAI. The standardized structure is reaaally valuable in consulting where team members change over the course of a project! The folks at potloc like it a lot and presented it at the most recent Montréal MLOps community event!
While using Kedro, I wanted to create modular functional code for data transformation, but creating a node for each function would require me to specify input-output for each node. In addition, if the output of these nodes would be pandas Series, I would have to assemble them manually at intermediary steps.
That's when I learned about Hamilton, which exactly met my needs for quick iterations of data processing pipelines with little/no boilerplate. I ended up calling Hamilton within a single Kedro node! (similar to Metaflow+Hamilton)
I think one of the main appeal of Kedro for orgs is the visualization tool that encompasses functions, data, code, experiments, etc. (+MLFlow and Airflow plugins). Integrating Kedro-viz with other DAG tools could be very exciting for users to have full visibility of their ETL pipeline. For example, at the MLOps meetup, someone asked if it would be possible to plug their Airflow ETL (upstream of the data science Kedro project) into the tool!
The Kedro principles page is a nice enough page but doesn't really work in documentation. Let's move it out into a post or page.
Or, failing that, a topic page that sits on the website (but isn't in the nav, so you can find it organically but we don't push it to the user).
This captures all the product marketing activities to promote and encourage awareness and adoption of Kedro Experiment Tracking.
This includes updating the documentation, preparing a blog post, presentation, and webinar, and distributing the content beyond the existing Kedro slack organisation.
The goal is to increase adoption of kedro experiment tracking, and Kedro-Viz, especially Kedro-Viz/Kedro users by 30%.
This is targeted to (DS, DE, and low tech)users with a 2 prong approach:
This would include:
This content would then be repurposed for different forms and mediums.
Admin:
Key Activities:
Joel wrote a nice blog post about data layers (I remember his awesome presentation about this in 2021) which we put on Towards Data Science: https://towardsdatascience.com/the-importance-of-layered-thinking-in-data-engineering-a09f685edc71
We have a little about it in the FAQ too, but I'm removing it because, it doesn't work as Kedro documentation. It is useful knowledge for someone in the field but it doesn't tell you about Kedro (so much as advocate for Kedro's opinionated data storage).
I think we should take the blog post we published and maybe repurpose it into a post for the new Kedro blog, then point to that when we need to demonstrate the reason for layering.
Bruce Philp and Guilherme Braccialli are the
brains behind a layered data-engineering convention as a model of managing data.
Refer to the following table below for a high level guide to each layer's purpose
The data layers don’t have to exist locally in the
data
folder within your project, but we recommend that you structure your S3 buckets or other data stores in a similar way.
Folder in data | Description |
---|---|
Raw | Initial start of the pipeline, containing the sourced data model(s) that should never be changed, it forms your single source of truth to work from. These data models are typically un-typed in most cases e.g. csv, but this will vary from case to case |
Intermediate | Optional data model(s), which are introduced to type your :code:raw data model(s), e.g. converting string based values into their current typed representation |
Primary | Domain specific data model(s) containing cleansed, transformed and wrangled data from either raw or intermediate , which forms your layer that you input into your feature engineering |
Feature | Analytics specific data model(s) containing a set of features defined against the primary data, which are grouped by feature area of analysis and stored against a common dimension |
Model input | Analytics specific data model(s) containing all :code:feature data against a common dimension and in the case of live projects against an analytics run date to ensure that you track the historical changes of the features over time |
Models | Stored, serialised pre-trained machine learning models |
Model output | Analytics specific data model(s) containing the results generated by the model based on the model input data |
Reporting | Reporting data model(s) that are used to combine a set of primary , feature , model input and model output data used to drive the dashboard and the views constructed. It encapsulates and removes the need to define any blending or joining of data, improve performance and replacement of presentation layer without having to redefine the data models |
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.