TidyTuesday: Cocktails pt.2

This is part 2 of TidyTuesday: Cocktails. Below shows how we can use #rstats to write a cocktail recommendation system that takes in a drink and returns a few other cocktails based on similarly mixed ingredients. Load libraries library(tidyverse) library(recommenderlab) Download and parse data Note: please check out part 1 for deatils on processing steps bc_raw <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-05-26/boston_cocktails.csv') bc <- bc_raw %>% mutate(ingredient = str_to_lower(ingredient)) %>% distinct() %>% select(name, ingredient) bc_tidy <- bc %>% filter(!str_detect(ingredient, ",")) bc_untidy <- bc %>% filter(str_detect(ingredient, ",")) %>% mutate(ingredient = str_split(ingredient, ", ")) %>% unnest(ingredient) bc_clean <- rbind(bc_tidy, bc_untidy) %>% distinct() df <- bc_clean %>% mutate(ingredient = str_replace_all(ingredient, "-", "_"), ingredient = str_replace_all(ingredient, " ", "_"), ingredient = str_replace_all(ingredient, "old_mr._boston_", ""), ingredient = str_replace_all(ingredient, "old_thompson_", "")) df_processed <- df %>% mutate(value = 1) %>% pivot_wider(names_from = name) %>% replace(is.na(.), 0) Recommendation algorithm Transform data to binary rating matrix cocktails_matrix <- df_processed %>% select(-ingredient) %>% as.matrix() %>% as("binaryRatingMatrix") Create evaluation scheme scheme <- cocktails_matrix %>% evaluationScheme(method = "cross", k = 5, train = 0.8, given = -1) Input customer cocktail preference Let’s check the ingredients for a very simple cocktail: ...

May 28, 2020 · Christopher Yee

TidyTuesday: Cocktails

Data from #tidytuesday week of 2020-05-26 (source) If you are looking for the R script then you can find it here Load packages library(tidyverse) library(ggrepel) library(FactoMineR) Download data bc_raw <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-05-26/boston_cocktails.csv') Data processing Standardize cases bc_raw %>% count(ingredient, sort = TRUE) %>% filter(str_detect(ingredient, "red pepper sauce")) ## # A tibble: 2 x 2 ## ingredient n ## <chr> <int> ## 1 Hot red pepper sauce 4 ## 2 hot red pepper sauce 1 Let’s fix that by making all ingredient values to lower case: ...

May 26, 2020 · Christopher Yee

R script for the CausalImpact package

Google has an amazing #rstats package called CausalImpact to predict the counterfactual: what would have happened if an intervention did not occur. This is a quick technical post to get someone up and running rather than a review of its literature, usage, or idiosyncrasies Load libraries library(tidyverse) library(CausalImpact) Download (dummy) data df <- read_csv("https://raw.githubusercontent.com/Eeysirhc/random_datasets/master/cimpact_sample_data.csv") df %>% sample_n(5) ## # A tibble: 5 x 3 ## date experiment_type revenue ## <date> <chr> <dbl> ## 1 2020-04-02 control 309. ## 2 2020-05-05 experiment 257. ## 3 2020-02-29 control 928. ## 4 2020-03-13 control 467. ## 5 2020-03-02 experiment 35.0 Shape data Before we can run our analysis, the CausalImpact package requires three columns: ...

May 19, 2020 · Christopher Yee

TidyTuesday: Volcano Eruptions (python)

Data from #tidytuesday week of 2020-05-12 (source) but plotting in python. Load modules import pandas as pd import matplotlib.pyplot as plt import seaborn as sns Download and parse data volcano_raw = pd.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-05-12/volcano.csv") volcano = volcano_raw[['primary_volcano_type', 'elevation']].sort_values(by='elevation', ascending=False) Visualize dataset sns.set(style="darkgrid") plt.figure(figsize=(20,15)) p = sns.boxplot(x=volcano.elevation, y=volcano.primary_volcano_type) p = sns.swarmplot(x=volcano.elevation, y=volcano.primary_volcano_type, color=".35") plt.xlabel("Elevation") plt.ylabel("") plt.title("What is the average elevation by volcano type?", x=0.01, horizontalalignment="left", fontsize=20) plt.figtext(0.9, 0.08, "by: @eeysirhc", horizontalalignment="right") plt.figtext(0.9, 0.07, "Source: The Smithsonian Institute", horizontalalignment="right") plt.show() ...

May 12, 2020 · Christopher Yee

TidyTuesday: Animal Crossing

Data from #tidytuesday week of 2020-05-05 (source) Load packages library(tidyverse) library(ggfortify) Download data villagers_raw <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-05-05/villagers.csv') Process data villagers <- villagers_raw %>% select(gender, species, personality) %>% mutate(species = str_to_title(species)) %>% group_by(gender, species, personality) %>% summarize(n = n()) %>% mutate(pct_total = n / sum(n)) %>% ungroup() Visualize data villagers %>% ggplot(aes(personality, pct_total, fill = gender, color = gender, group = gender)) + geom_polygon(alpha = 0.5) + geom_point() + coord_polar() + facet_wrap(~species) + labs(x = NULL, y = NULL, color = NULL, fill = NULL, title = "Animal Crossing: villager personality traits by species & gender", caption = "by: @eeysirhc\nsource:VillagerDB") + theme_bw() + theme(legend.position = 'top', axis.text.y = element_blank(), axis.ticks.y = element_blank()) ...

May 6, 2020 · Christopher Yee

Exploratory data analysis on COVID-19 search queries

The team at Bing were generous enough to release search query data with COVID-19 intent. The files are broken down by country and state level granularity so we can understand how the world is coping with the pandemic through search. What follows is an exploratory analysis on how US Bing users are searching for COVID-19 (a.k.a. coronavirus) information. tl;dr COVID-19 search queries generally fall into five distinct categories: 1. Awareness 2. Consideration 3. Management 4. Unease 5. Advocacy (?) ...

May 5, 2020 · Christopher Yee

TidyTuesday: Beer Production

Data from #tidytuesday week of 2020-03-31 (source) Load packages library(tidyverse) library(gganimate) library(gifski) Download data beer_states_raw <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-03-31/beer_states.csv") Clean data beer_total <- beer_states_raw %>% # FILL NULL VALUES WITH 0 replace(., is.na(.), 0) %>% # REMOVE LINE ITEM FOR 'TOTAL' filter(state != 'total') %>% # COMPUTE TOTAL BARRELS PER YEAR BY STATE group_by(year, state) %>% summarize(total_barrels = sum(barrels)) %>% ungroup() Create rankings beer_final <- beer_total %>% group_by(year) %>% mutate( # CALCULATE RANKINGS BY TOTAL BARRELS PRODUCED EACH YEAR rank = min_rank(-total_barrels) * 1.0, # STATE TOTAL DIVIDE BY STATE RANKED #1 PER YEAR produced = total_barrels / total_barrels[rank == 1], # CLEANED TEXT LABEL produced_label = paste0(" ", round(total_barrels / 1e6, 2), " M")) %>% group_by(state) %>% # SELECT TOP 20 filter(rank <= 20) %>% ungroup() Animate bar chart p <- beer_final %>% ggplot(aes(rank, produced, fill = state)) + geom_col(show.legend = FALSE) + geom_text(aes(rank, y = 0, label = state, hjust = 1.5)) + geom_text(aes(rank, y = produced, label = produced_label, hjust = 0)) + coord_flip() + scale_x_reverse() + theme_minimal(base_size = 15) + theme(axis.text.x = element_blank(), axis.text.y = element_blank()) + transition_time(year) + labs(title = "US Beer Production by State", subtitle = "Barrels produced each year: {round(frame_time)}", caption = "by: @eeysirhc\nsource: Alcohol and Tobacco Tax and Trade Bureau", x = NULL, y = NULL) animate(p, nframes = 300, fps = 12, width = 1000, height = 800, renderer = gifski_renderer()) ...

April 14, 2020 · Christopher Yee

Script to track COVID-19 cases in the US

A couple weeks ago I shared an #rstats script to track global coronavirus cases by country. The New York Times also released COVID-19 data for new cases in the United States, both at the state and county level. You can run the code below on a daily basis to get the most up to date figures. Feel free to modify for your own needs: library(scales) library(tidyverse) library(gghighlight) state <- read_csv("https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv") county <- read_csv("https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv") State state %>% group_by(date, state) %>% mutate(total_cases = cumsum(cases)) %>% ungroup() %>% filter(total_cases >= 100) %>% # MINIMUM 100 CASES group_by(state) %>% mutate(day_index = row_number(), n = n()) %>% ungroup() %>% filter(n >= 12) %>% # MINIMUM 12 DAYS ggplot(aes(day_index, total_cases, color = state, fill = state)) + geom_point() + geom_smooth() + gghighlight() + scale_y_log10(labels = comma_format()) + facet_wrap(~state, ncol = 4) + labs(title = "COVID-19: cumulative daily new cases by US states (log scale)", x = "Days since 100th reported case", y = NULL, fill = NULL, color = NULL, caption = "by: @eeysirhc\nSource: New York Times") + theme_minimal() + theme(legend.position = 'none') + expand_limits(x = 30) ...

March 30, 2020 · Christopher Yee