TidyX

Hosts

Ellis has been working with R since 2015 and has a background working as a statistical programmer in support of both Statistical Genetics and HIV Vaccines. He also runs the Seattle UseR Group.

Patrick's current work centers on research and development in professional sport with an emphasis on data analysis in American football. Previously, He was a sport scientist within the Nike Sports Research Lab. Research interests include training and competition analysis as they apply to athlete health, injury, and performance.

Description

The goal of TidyX is to explain how R code works. We are focusing on submissions to the #TidyTuesday Project to help promote the great work being done there.

In this repository, you will find copies of the code we've explained, and the code we wrote to show the concept on a new dataset.

To submit code for review, email us at [email protected]

To watch more episodes, go to our youtube channel.

Patreon

If you appreciate what we are doing and would like to support TidyX, please consider signing up to be a patron through Patreon.

https://www.patreon.com/Tidy_Explained

TidyX Episodes

Episode 1: Introduction and Treemaps!
- UseR Highlighted: Courtney Gerver
- Original Tweet
- Source Code
Episode 2: The Office, Sentiment, and Wine
- UseR Highlighted: Robin Sifre
- Original Tweet
- Source Code
Episode 3: TBI, Polar Plots and the NBA
- UseR Highlighted: Raniere Silva
- Original Tweet
- Source Code
Episode 4: A New Hope, {Patchwork} and Interactive Plots
- UseR Highlighted: Maggie Sogin
- Original Tweet
- Source Code
Episode 5: Tour de France and {gganimate}
- UseR Highlighted: Owen Churches
- Original Tweet
- Source Code
Episode 6: Lollipop Charts
- UseR Highlighted: Priya Shukla
- Original Tweet
- Source Code
Episode 7: GDPR Faceting
- UseR Highlighted: Danielle Barnas
- Original Tweet
- Source Code
Episode 8: Broadway Line Tracing
- UseR Highlighted: Jake Kaupp
- Original Tweet
- Source Code
Episode 9: Tables and Animal Crossing
- UseR Highlighted: Ted Lederas
- Original Tweet
- Source Code
Episode 10: Volcanoes and Plotly
- Ellis and Patrick explore this weeks TidyTuesday Dataset!
Episode 11: Times Series and Bayes
- UseR Highlighted: Eric Ekholm
- Original Tweet
- Source Code
Episode 12: Cocktails with Thomas Mock
- UseR Highlighted: Joshua de la Bruere
- Original Tweet
- Source Code
Episode 13: Marble Races and Bump Plots
- UseR Highlighted: Cédric Scherer
- Original Tweet
- Source Code
Episode 14: African American Achievements
- UseR Highlighted: Catriona Cunningham
- Original Tweet
- Source Code
Episode 15: Juneteenth and Census Tables
- Ellis and Patrick show US Census tables in a report, broken down into divisions and highlight values using {colortable}
- Source Code
Episode 16: Caribou Migrations and NBA Shots on Basket
- UseR Highlighted: Jihong Zhang
- Original Tweet
- [Source Code](https://github.com/thebioengineer/TidyX/blob/master/TidyTuesday_Explained/016-Caribou_Migrations_and_Spatial_Analysis/Jihong Zhang - Caribou Migration Map.Rmd)
Episode 17: Uncanny X-men and Feature Engineering
- UseR Highlighted: Rebecca Stevick
- Original Tweet
- Source Code
Episode 18: Coffee and Random Forest
- UseR Highlighted: Nyssa Silbiger
- Original Tweet
- Source Code
Episode 19: Astronauts and Dashboards
- UseR Highlighted: Lauren Pandori
- Original Tweet
- Source Code
Episode 20: Cocktails with David Robinson
- UseR Highlighted: David Robinson
- Original Tweet
- Source Code
Episode 21: The Birds
- UseR Highlighted: Roman Link
- Original Tweet
- Source Code
Episode 22: European Energy and Ball Hogs
- UseR Highlighted: Kelly Cotton
- Original Tweet
- Source Code
Episode 23: Mailbag and Expected Wins
- Ellis and Patrick go into our mailbag and focus on a request we recently had on loops and functions.
- Source Code
Episode 24: Waffle plots and Shiny
- UseR Highlighted: Jared Braggins
- Original Tweet
- Source Code
Episode 25: Intro To Shiny
- This is a start of a series of episodes covering more in-depth uses for {Shiny}, an R package for creating web applications by Joe Cheng. In this episode we cover basics of Shiny, and explain the concept of reactive programming.
- Source Code
Episode 26: Labels and ShinyCARMELO - Part 1
- UseR Highlighted: Mr. Ochiwar
- Original Tweet
- Source Code
Episode 27: LIX and ShinyCARMELO - Part 2
- UseR Highlighted: Leon Jessen
- Original Tweet
- Source Code
Episode 28: Nearest Neighbors and ReactiveValues
- This week Ellis and Patrick explore how to perform career analysis and projections using the KNN algorithm.Using those concepts, we jump into part three of our shiny demo series where we have shiny execute a KNN for our input players. We show how to create an action button to execute our code, and reactiveValues to store the results to then plot!
- Source Code
Episode 29: Palettes and Random Effects
- UseR Highlighted: Kaylea Haynes
- Original Tweet
- Source Code
Episode 30: Tweet Sentiment
- Patrick and Ellis were inspired this week by all the sentiment analysis performed for #TidyTuesday this week so we decided to look at tweets to show and comment on additional things to be aware of when doing sentiment analysis. Using {rtweet}, we pull over 50,000 tweets that used the #Debate2020, and discuss how context is incredibly important to analysis.
- Source Code
Episode 31: Reactable
- This weeks #TidyTuesday dataset was on NCAA Womens Basketball Tournament appearances. Patrick and Ellis in the past have shown how tables can be used for data visualization, and wanted to learn more about another one. {reactable} is a really cool looking package, so we spend some time showing how to use the package, apply column definitions, and even apply html widgets within the table!
- Source Code
Episode 32: Shiny with Eric Nantz
- This weeks #TidyTuesday dataset was a super fun one. Ellis and Patrick are joined by Eric Nantz, who created a shiny app to explore and animate the data. We talk through several new shiny concepts, like using {golem}, cross-talk, and other shiny packages like {bs4dash}!
- UseR Highlighted: Eric Nantz
- Source Code
Episode 33: Beer and State Maps
- UseR Highlighted: Richard Bamattre
- Original Tweet
- Tweet Source Code
- TidyX Source Code
Episode 34: Wind and Maps
- UseR Highlighted: Florence V. Dubois
- Original Tweet
- Tweet Source Code
- TidyX Source Code
Episode 35: Rectangles
- UseR Highlighted: Henry Wakefield
- Original Tweet
- Tweet Source Code
- TidyX Source Code
Episode 36: Animated Plotly
- This weeks #TidyTuesday dataset was on Mobile and Landline subscriptions across the world. This week we saw lots of animation type plots, and wanted to add our own. Using {plotly}, we make an interactive plot that animates across time to show how GDP is related to the raw subscription numbers. We also do some exploration with line plots.
- Source Code
Episode 37: Code Review
- Looking back at ones code can show you just how far you have come. Sparked by a conversation between Ben Baldwin (@benbaldwin), Patrick and Ellis, this weeks episode is on code review and refactoring. Ben went into his past and has furnished a set of code for us to try to refactor. In the spirit of things, neither of us looked closely at the code ahead of time, and recorded our initial reactions and process of refactoring Bens code into a function that could be applied to multiple datasets!
- UseR Highlighted: Ben Baldwin
- Original Tweet
- Tweet Source Code
- TidyX Source Code
Episode 38: Polar Plots
- UseR Highlighted: Tobias Stalder
- Original Tweet
- Tweet Source Code
Episode 39: Imputing Missingness
- This weeks we reach into our mailbag to answer a request from Eric Fletcher(@iamericfletcher) on imputing NA's. In this video we scrape 2013 draft data, and impute using various techniques missing times for the three cone event. We also attempt to discuss Missing Completely at Random (MCAR), Missing at Random (MAR), and Missing Not at Random (MNAR) - but we decide at the end to leave it to the professionals.
- Source Code
Episode 40: Inspiring Women and Plotly
- UseR Highlighted: Jackie Torres
- Original Tweet
- Tweet Source Code
- TidyX Source Code
Episode 41: Worm Charts with Alice Sweeting
- Alice Sweeting(@alicesweeting) joins us as a guest explainer this week! We are very excited to have her on as she explains with us how she worked through creating a worm chart of a super netball game! She talks with us on common techniques she uses to process data, mixing base R with tidyverse. Then we spend some time discussing Alice's background, current role, and advice for folks looking to get started in sports analytics or R programming in general.
- UseR Highlighted: Alice Sweeting
- Source Code
Episode 42: Highlighting Lines
- UseR Highlighted: Peter
- Original Tweet
- Tweet Source Code
- TidyX Source Code
Episode 43: Funnel Plots, Plotly, and Hockey
- With no #TidyTuesday dataset this week, we decide to continue to work through our learning of plotly. This time, using a tool known as a funnel plot.
- Source Code
Episode 44: Transit Costs, steps, and Plotly Maps
- UseR Highlighted: Martin Devaux
- Original Tweet
- Blog Post
- TidyX Source Code
Episode 45: NHL Pythagorean Wins and Regression
- This week we reflect back on the past year and combine techniques from multiple episodes. We scrape multiuple tables from the the hockey reference website, use regular expressions to clean and organize the data, and use for loops to determine the optimal pythagorean win exponent. We visualize the data using several different techniques, like scatter and lollipop charts. We show some fun tools with regularizing values for linear regressions and how how to predict and visualize the results.
- Source Code
Episode 46: Circle Plots, NHL Salaries, and Logistic Regression
- UseR Highlighted: Natalie O'Shea
- Original Tweet
- Tweet Source Code
- TidyX Source Code
Episode 47: NHL Win Probabilities and GT Tables
- This week we play with a new technique for optimizing, the optim function! We scrape the 2019-2020 NHL season to generate power rankings for every NHL team and home-ice-edge. We can use this to then predict team winning probability! We then combine that with season summary data to generate a pretty GT table!
- Source Code
Episode 48: NBA Point Simulations
- In this episode we show how to scrape the current NBA seasons scores to then generate a simple game simulator. Using {purrr} with some base R functions we generate outputs and show how to simulate thousands of games to generate outcome predictions.
- Source Code
Episode 49: MLB Batting Simulations
- We continue looking at simulations this week, but this time for individual players. Using {Lahman}, we pull the 2019 MLB player batting stats, and visualize the stats using histograms and density plots. Next, to generate confidence intervals around their batting averages we use rbinom() combined with techniques from the {tidyverse} to make simulation easy. Finally we visualize the data using {gt} combined with {sparklines}.
- Source Code
Episode 50: MLB Batting Simulations
- Another MLB Batting episode. This time we use the James Stein Estimator (paper below) to apply a shrinkage estimate to player batting averages to get a "true" estimate, removing luck. Using {Lahman}, we pull the 2018 MLB player batting stats, and explain how to implement the estimator. Next, we compare estimates against the 2019 season. Finally we visualize the data using {gt}, using header spans, and cell styling. For the grand finale we combine this gt table with batting averages with plots using patchwork!.
- Source Code
Episode 51: Deploying Models with Shiny
- Sharing the results of a modeling effort is an important skill of any data scientist. However, just sharing the weights of each predictory is often not good enough to get buy in from stakeholders who are understandably skeptical of your results. Using the power of shiny, you can show your stakeholders exactly how your model interprets and then predicts the results. In this episode, we use the {palmerpenguins} package with {randomforest} to generate a model to predict the species of a new penguin. With shiny, we then deploy our model to allow the users to record new penguins attributes to see whether the model things they are an Adelie, Chinstrap, or Gentoo! The output is a boxplot indicating the models probablity for each species given the inputs.
- Source Code
Episode 52: Too Many Gentoo with Xaringan
- There are too many Gentoo, your PI proclaims. This weeks episode Patrick and Ellis talk how to use the {xaringan} package to produce reproducible html presentations using Rmarkdown syntax. We discuss how we looked at "raw" tech data and used summary statistics to compare against the gold standard {palmerpenguins} package from Dr. Allison Horst, Dr. Alison Hill, Data from Dr. Kristen Gorman. We use last weeks highly powerful machine learning model to generate presictions of species, and generate a confusion matrix of our data vs the predictions. Finally, we talk about the value of making your presentation based on Rmd and being able to update the presentation at the click of a button.
- Source Code
Episode 53: MLB Pitch Classification Introduction
- This week we start a series on using machine learning to automate pitch classification. In this first episode, we discuss ways to start looking at your data and questions to formulate. We use hierarchical clustering a few different ways to start to see relationships between the different pitch types and the statistics that were captured around each pitch!
- Source Code
Episode 54: MLB Pitch Classification 2 - KNN, Caret and UMAP
- In the second episode on using machine learning to automate pitch classification from pitchfx data, we apply the K-nearest-neighbors algorithm as our first attempt at classification. We start with using the results from our naive hierarchical clustering to select 4 groups and apply the KNN algorithm. We then look at how we could evaluate performance of the model both with total mis-classification and within class mis-classification. Then we use {carat} to optimize for the best clustering and compare the results. Finally, we use UMAP to perform dimensional reduction to visualize mulitple dimensions as two and view relationships within the clusters.
- Source Code
Episode 55: MLB Pitch Classification 3 - Decision Trees, Random Forests, optimization
- For the third episode in the series on using machine learning to automate pitch classification from pitchfx data, we talk about decision trees and its famous variant: random forests. We start by discussing what a decision tree is and its value. We visualize the results and discuss the quality of the fit. Then we expand on decision trees, using the Random Forest algorithm, and discuss its performance. Finally, we use {caret} and {doParallel} to do a grid search for optimal mtry, using parallel processes to speed up the search!.
- Source Code
Episode 56: MLB Pitch Classification 4 - XGBoost
- We now turn to the famous XGBoost algorithm to help us in our fourth episode in the series on using machine learning to automate pitch classification from pitchfx data. We start by training using default parameters and observe some tricks to make training faster. Then we use {caret} and {doParallel} to do a grid search for optimal settings to be using to train and discuss the merits and disadvantages of using ever more complicated ML models.
- Source Code
Episode 57: MLB Pitch Classification 5 - Naive Bayes Classification
- We naively turn to bayes...okay, I'm done. In this episode we use the Niave Bayes Classifier from the {e1071} package to classify pitches from our pitchfx data. We discuss briefly how this algorithm works, and review its performance against the other tree-based algorithms we've used so far.
- Source Code
Episode 58: MLB Pitch Classification 6 - TensorFlow
- The next model type is one that has had a lot of excitement over the last decade with the promise of "AI" - deep learning. Using the {keras} package from RStudio, we attempt to train a model to automate pitch classification from pitchfx data. We talk about the differences to consider when building a deep learning algorithm, and data prep that must be done. We finally review the restuls and talk a bit about black-box ML models.
- Source Code
Episode 59: MLB Pitch Classification 7 - Class Imbalance and Model Evaluation Intro
- Throughout this series, we been attempting to predict pitch type using PitchF/X data. However, we have not directly addressed a major flaw in our data, class imbalance. The Four-seam FastBall consists of nearly 37% of our data! In this episode we apply a couple techniuqes to help address the class imbalance, and look at ways to evaluate our models performance. We talk about the pros and cons to consider, and set up for our last episode for the series.
- Source Code
Episode 60: MLB Pitch Classification 8 - Model Evaluation and Visualization
- This week we apply everything we have learned over the last several weeks to attempt to pick the best model for our project. As a reminder, we are attempting to predict pitch type using a subset of PitchF/X data. We attempt to productionalize our evaluations by writing a series of functions that allow quick iteration across multiple input types and capturing of information. Finally, we visualize the evaluations using 2 gt tables. Thank you all so much for joining us for this mini series on ML models and being with us as we hit episode 60. This has been a wonderful ride!
- Source Code
Episode 61: Data Cleaning - Regular Expressions
- Okay, we've gotta say it - there is nothing "regular" about regular expressions. BUT that does not mean they are not an incredibly valuable tool in your programming toolbox. In this episode we go through how to apply regular expressions to a dataset and talk through some of the common tokens you might use when applying a regular expression to your dataset.
- Source Code
Episode 62: Data Cleaning - REGEX applied & stringr
- This week we continue using Regex, and this time talk about applying it to generate data for plots. Additionally we discuss techniques such as grouping, and using the stringr package for its str_* variants of the base R regex functions.
- Source Code
Episode 63: Data Cleaning - REGEX lookarounds & Player Gantt Charts
- We lookaround with regex this week, showing an alternative approach to setting anchors in your regular expressions using lookarounds. We apply this to extracting player substitutions. Then we calculate the number of stints and duration for players to create a player gantt chart across the Miami Heat and Milwaukee Bucks Game 2 of the Eastern Conference Playoffs.
- Source Code
Episode 64: Data Cleaning - Ugly Excel Files Part 1
- Ugly data. Ugly EXCEL data. Thats pretty common to come across as a data scientist. People unfamiliar with how to format data are often the ones creating the excel files you work with. This week, Patrick and Ellis talk through some techniques to handle these data and turn it into usable data. Patrick wrote up this weeks example, parsing through the data to generate a nice data.frame from the ugly excel example.
- Source Code
Episode 65: Data Cleaning - Ugly Excel Files Part 2
- This week Ellis works through the ugly excel file, writing out the code live as he goes, and explaining how to break up the parsing into nice, bite-size pieces and generalize them. Patrick is there asking questions and clarifying how things worked. At the end of the cast they end up with similar data.frames, ready to munge for final processing.
- Source Code
Episode 66: Data Cleaning - Ugly Excel Files Part 3
- Now that we have the excel file into a nice format, we go over the final pieces of processing to turn the incorrectly formatted fields into usable data. We talk about generating date objects, ifelse vs if_else, and have some fun!
- Source Code
Episode 67: Data Cleaning - Viewer Submitted Excel File
- For the first time in over a year, and 65 episodes, Ellis and Patrick are in the same room! This week they work on a viewer-submitted excel file. After last weeks episode, we put out a call to our viewers to submit the ugly data they see so we can try to help. Github user MikePrt submitted a file from the UK Government statistics organisation (Office for National Statistics (ONS)) as an example. We extract the data and produce a simple plot.
- Source Code
Episode 68: Data Cleaning - Ugly Excel Files Part 4 - Saving Outputs
- We continue our series on data cleaning and discuss sharing your outputs. Patrick and Ellis go over a few output file formats and two different excel libraries that give you differing levels of control over the outputs.
- Source Code
Episode 69: Modern Pentathlons with Mara Averick
- Ellis and Patrick are joined today by Mara Averick, a Developer Advocate for RStudio. We conclude our series on messy excel data by talking through cleaning an excel file from UIPM and reasoning out what the field and scoring are.Then we talk about Mara's role, career history, and advice she has for our viewers.
- UseR Highlighted: Mara Averick
- Source Code
Episode 70: Databases with {dplyr}
- Making friends with your friendly database administrator is a great way to improve your effectiveness as a data scientist in your organization. But what do you do if you don't know any SQL? We present {dbplyr} by the folks at RStudio. Easily connect, interact with and send queries to databases using familiar dplyr syntax and commands.
- Source Code
Episode 71: Databases in R | Exploring Your Database with NBA data
- Being handed a database without knowing its contents or where to start can be daunting. We talk about techniques we can use to start exploring it just like any other dataset. We get a list of the tables in your database, the column names, and show how you can write SQL to get the head of a table.
- Source Code
Episode 72: Databases in R | Shiny and Databases
- The fastest way for a data scientist to multiply their impact is to get their customers to be able to do the analysis themselves (with guiderails of course). Shiny provides a great user interface, combing this with some basic queries your clients may want improves response time and allows them to search to their hearts content. This week we show you a simple way to add interactivity with your database using {shiny} to query teams mean point differential at home across the 2001-2002 seasons.
- Source Code
Episode 73: Databases in R | Shiny,Databases, and Reactive Polling
- Now that we have a shiny app that allows our users to access and interact with the data in our database, how do we make sure that the user configuration is showing the most up-to-date information for selection? This is done through reactive polling - a time out feature that checks to see if there are any update to the database and updates the UI selection interface accordingly. We discuss the benefits and how to use the reactivePoll function combined with an observeEvent function to really supercharge our shiny app!
- Source Code
Episode 74: Databases with R | Joins in SQL vs Local
- Continuing the SQL/Database saga, we look at joins. we scrape a bunch of play by play information and game info and look at generating a database with this information. We then compare the speed of joining tables locally or within the sql database!
- Source Code
Episode 75: Databases with R | J Joins, databases, and commits in Shiny
- Now that we have a database full of data, and a shiny app to play with it, how do we capture and share the information across our users using the database? In this episode we share how we might create a sample database filled with play-by-play NBA data and create a shiny app to allow a coach or SME to review and add comments to the data as they review it. Then, they can decide to commit and save their thoughts for the future!
- Source Code
Episode 76: Databases with R | Polling databases in Shiny
- In Episode 75 we introduced the idea of committing changes from a shiny app to a database. But what about scenarios with multiple users? Ellis and Patrick explore an idea to allow for polling of the database and add updates that were committed to the database to active views of the rest of the users. We use reactive polling as introduced in episode 73 and updating reactiveValues.
- Source Code
Episode 77: Tidymodels - LM
- tidymodels is an ecosystem of packages developed by RStudio (Max Kuhn, Julia Silge to name a few) that is developed to help folks apply good modeling practices from start of the cleaned data to a fully productionalized model. We are going to be stepping through and learning how to apply tidymodels together. The first episode is on applying a simple linear model versus the base R method!
- Source Code
Episode 78: Tidymodels - Splits and Recipes
- tidymodels is an ecosystem of packages developed by RStudio (Max Kuhn, Julia Silge to name a few) that is developed to help folks apply good modeling practices from start of the cleaned data to a fully productionalized model. We are going to be stepping through and learning how to apply tidymodels together. The second episode we discuss how to set up your test/train splits as well as data preprocessing using the {recipes} package in conjuction with {workflow}! This smooths out and applies good practices simply and effectively to make data prep for modeling a breeze.
- Source Code
Episode 79: Tidymodels - Cross-validation and Metrics
- The third episode on tidymodels, we continue our data prep and model training by exploring cross-validation and metric evalidation. Ellis and Patrick show to set up a 5-fold cross validation set on your training split as well as fitting a tidymodels workflow! We finally show how to display and extract model fitting evaluation metrics.
- Source Code
Episode 80: Tidymodels - Decision Trees and Tuning
- The fourth episode on tidymodels, we sort out how to do parameter tuning of a model using the tune package. We set up a grid to train across and select the best model based on model metrics. We then retrain this model on the full test set and evaluate its performance against the final test set.
- Source Code
Episode 81: Tidymodels - Logistic Regression with GLM
- This week we look at how to perform a logistic regression using the tidymodels framework. During the fifth episode tidymodels, we show how to set up a logistic regression using GLM, perform a custom test/train split on the data, and calculate metrics such as ROC AUC, kappa, and accuracy. We visualize the performance and evaulate how well our model performed.
- Source Code
Episode 82: Tidymodels - Logistic Regression with GLM
- Continuing looking at classification models via tidymodels, this week we look at how to perform a multiple classification problem using random forests. We show how to tune your model, extract the optimal workflow, and then train it against your full training set and compare its performance on the test set. We calculate performance metrics such as ROC AUC and visualize the results.
- Source Code
Episode 83: Tidymodels - Naive Bayes of Penguins
- Naive bayes is the model we apply in this weeks Tidymodels series. We look at how to perform a multiple classification problem using the naive bayes theorem applied in the discrim package from tidymodels, and the klaR package to supply the engine. We show how to evaluate your model using 5-fold cross valudation, and then train it against your full training set and compare its performance on the test set. We calculate performance metrics such as ROC AUC and visualize the results.
- Source Code
Episode 84: Tidymodels - Workflow Sets and model selection
- Tidymodels makes it simple to try a multitude of modeling types by separating the preprocessing from the model type and creating a standardized way to apply different models. Workflow sets takes this a step further and makes it so you can train and compare these models at the same time, just like tuning. Using data from Kaggle, we look at how to perform model fitting for three model types and select the best workflow to train using our full train set and compare against our hold out test set. We calculate performance metrics such as RMSE and R-squared and visualize the results.
- Source Code
Episode 85: Tidymodels - Tuning Workflow Sets
- Tidymodels makes it simple to try a multitude of modeling types by separating the preprocessing from the model type and creating a standardized way to apply different models. In this episode we show how you can use Workflowsets along with tuning to create optimal models. Using wine data from Kaggle, we look at two different recipes and 3 different models requiring different levels of tuning. We select the best workflow and optimal tuned paramets to train using our full train set and compare against our hold out test set. We calculate performance metrics such as RMSE and R-squared and visualize the results.
- Source Code
Episode 86: Tidymodels - Julia Silge and Tune Racing
- This week have are thrilled to have Dr Julia Silge from RStudio join us to talk about tidymodels. Julia is one of the software engineers we have to thank for tidymodels and the ecosystem of packages that help us perform our data preprocessing and modeling steps with ease! In this episode we have a short interview with Julia where she talks a bit about her background, her current role and Tidymodels. We then jump into explaining how some code she wrote and shared in one of her owns screen casts on training an xgboost model to predict homeruns. One unique part of it is that Julia applies tune racing, making the tuning run faster using some clever comparisons to make sure only the best models continue to get trained across all cross folds. Patrick and Ellis ask questions throughout on how the code works and Julia's philosophies.
- Julia Silge's Blog Post on Racing Methods
Episode 87: Advent of Code Day 6 - Efficient Problem Solving
- This week we take a look at a problem from the Advent of Code, specifically day 6. Advent of Code is a fun time of year where the data science community comes together to solve a series of 25 problems posed by Eric Wastl. The goal is to see who all can solve the problems quickly and efficiently. It also provides an opportunity to work on problems unlike most of what you see in your day-to-day job. We work on finding an efficient solution to day 6 - Lanternfish. The fish reproduce at a standard rate, but calculating how many exist after a certain number of days is a problem that is trivial for small number of days, but quickly becomes too large for your computer if you approach the problem the wrong way!
- Source Code
Episode 88: Advent of Code Day 7 - For Loops and Lookup Vectors
- We work on finding an efficient solution to Advent of Code Day 7 - Whales. We need to find the most efficient location to align a series of crab submarines in order to escape with several different constraints. We discuss how to set up an efficient for loop, and create a lookup vector!!
- Source Code
Episode 89: Tables for Research
- We reach into our mailbag this week to answer a question from one of our viewers. In one of our episodes we talked about how you could extract coefficients from your fit models using the {broom} package. However, how would one turn that into a publication ready table? In this episode we use {gt} by Rich Iannone to convert our coefficients data.frame into a nice, publication-ready table!
- Source Code
Episode 90: Rmarkdown Guide - RMD Formatting
- Rmarkdown is an incredible tool that is widely used by R analysts to combine prose and code together into a beautiful symphony of reproducible outputs and information sharing. However, some of the set up as a new comer can be confusing. We are starting a series to discuss some of the knowledge to help users get going with their Rmarkdown journey. This week we start on the bones and structures of Rmarkdown documents, talk about markdown syntax, setting up your text to format as expected, and add some code chunks!
- Source Code
Episode 91: Rmarkdown Guide - Code Chunk Options & Figure Options
- Rmarkdown is an incredible tool that is widely used by R analysts to combine prose and code together into a beautiful symphony of reproducible outputs and information sharing. However, some of the set up as a new comer can be confusing. We are starting a series to discuss some of the knowledge to help users get going with their Rmarkdown journey. This week continues where we left off, talking through common chunk options that modify how your code and its outputs appear in the resulting output, and whether it even gets run at all. Then we cover common chunk options that modify figure outputs that are incredbly useful! Finally we start an rmarkdown report to demonstrate how we would use these options in a real report.
- Source Code
Episode 92: Rmarkdown Guide - Formatting Tabs for HTML outputs
- This weeks episode features a trick on how to make tabsets in your html outputs in Rmarkdown, as well as some advice on how to start organizing your code within an Rmarkdown document. Using the palmerpenguins dataset, we show how to make your code chunks super easy to update and things to think about when making your output.
- Source Code
Episode 93: Rmarkdown Guide - YAML Header
- The YAML header controls the macro level behaviors of your rmarkdown, from the output type, to the title, author, date, custom styling, table of contents, etc. In this episode we cover the basic YAML header contents, and how to add this customization to your rmarkdown documents. We also show two example outputs for html and word.
- Source Code
Episode 94: Rmarkdown Guide - Parameterized Reports

-Parameterized reports allow data scientists to multiply their impact by reducing the amount of work they need to do to produce new reports. Using the YAML header, a data scientist can set parameters that change based on user inputs to create customized reports at the click of a button. In this episode we go over the basics of adding a paramter, how to customize the input either interactively or programmatically, and using the parameter in your code. Then we create a custom example on pulling NBA basketball data for multiple years and displaying a team of interest.
- Source Code

whlying / tidyx Goto Github PK

tidyx's Introduction

TidyX

Hosts

Description

Patreon

TidyX Episodes

tidyx's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent