TidyTuesday: Cocktails

Data from #tidytuesday week of 2020-05-26 (source) If you are looking for the R script then you can find it here Load packages library(tidyverse) library(ggrepel) library(FactoMineR) Download data bc_raw <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-05-26/boston_cocktails.csv') Data processing Standardize cases bc_raw %>% count(ingredient, sort = TRUE) %>% filter(str_detect(ingredient, "red pepper sauce")) ## # A tibble: 2 x 2 ## ingredient n ## <chr> <int> ## 1 Hot red pepper sauce 4 ## 2 hot red pepper sauce 1 Let’s fix that by making all ingredient values to lower case: ...

May 26, 2020 · Christopher Yee

R script for the CausalImpact package

Google has an amazing #rstats package called CausalImpact to predict the counterfactual: what would have happened if an intervention did not occur. This is a quick technical post to get someone up and running rather than a review of its literature, usage, or idiosyncrasies Load libraries library(tidyverse) library(CausalImpact) Download (dummy) data df <- read_csv("https://raw.githubusercontent.com/Eeysirhc/random_datasets/master/cimpact_sample_data.csv") df %>% sample_n(5) ## # A tibble: 5 x 3 ## date experiment_type revenue ## <date> <chr> <dbl> ## 1 2020-04-02 control 309. ## 2 2020-05-05 experiment 257. ## 3 2020-02-29 control 928. ## 4 2020-03-13 control 467. ## 5 2020-03-02 experiment 35.0 Shape data Before we can run our analysis, the CausalImpact package requires three columns: ...

May 19, 2020 · Christopher Yee

TidyTuesday: Volcano Eruptions (python)

Data from #tidytuesday week of 2020-05-12 (source) but plotting in python. Load modules import pandas as pd import matplotlib.pyplot as plt import seaborn as sns Download and parse data volcano_raw = pd.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-05-12/volcano.csv") volcano = volcano_raw[['primary_volcano_type', 'elevation']].sort_values(by='elevation', ascending=False) Visualize dataset sns.set(style="darkgrid") plt.figure(figsize=(20,15)) p = sns.boxplot(x=volcano.elevation, y=volcano.primary_volcano_type) p = sns.swarmplot(x=volcano.elevation, y=volcano.primary_volcano_type, color=".35") plt.xlabel("Elevation") plt.ylabel("") plt.title("What is the average elevation by volcano type?", x=0.01, horizontalalignment="left", fontsize=20) plt.figtext(0.9, 0.08, "by: @eeysirhc", horizontalalignment="right") plt.figtext(0.9, 0.07, "Source: The Smithsonian Institute", horizontalalignment="right") plt.show() ...

May 12, 2020 · Christopher Yee

TidyTuesday: Animal Crossing

Data from #tidytuesday week of 2020-05-05 (source) Load packages library(tidyverse) library(ggfortify) Download data villagers_raw <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-05-05/villagers.csv') Process data villagers <- villagers_raw %>% select(gender, species, personality) %>% mutate(species = str_to_title(species)) %>% group_by(gender, species, personality) %>% summarize(n = n()) %>% mutate(pct_total = n / sum(n)) %>% ungroup() Visualize data villagers %>% ggplot(aes(personality, pct_total, fill = gender, color = gender, group = gender)) + geom_polygon(alpha = 0.5) + geom_point() + coord_polar() + facet_wrap(~species) + labs(x = NULL, y = NULL, color = NULL, fill = NULL, title = "Animal Crossing: villager personality traits by species & gender", caption = "by: @eeysirhc\nsource:VillagerDB") + theme_bw() + theme(legend.position = 'top', axis.text.y = element_blank(), axis.ticks.y = element_blank()) ...

May 6, 2020 · Christopher Yee

Exploratory data analysis on COVID-19 search queries

The team at Bing were generous enough to release search query data with COVID-19 intent. The files are broken down by country and state level granularity so we can understand how the world is coping with the pandemic through search. What follows is an exploratory analysis on how US Bing users are searching for COVID-19 (a.k.a. coronavirus) information. tl;dr COVID-19 search queries generally fall into five distinct categories: 1. Awareness 2. Consideration 3. Management 4. Unease 5. Advocacy (?) ...

May 5, 2020 · Christopher Yee

TidyTuesday: Beer Production

Data from #tidytuesday week of 2020-03-31 (source) Load packages library(tidyverse) library(gganimate) library(gifski) Download data beer_states_raw <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-03-31/beer_states.csv") Clean data beer_total <- beer_states_raw %>% # FILL NULL VALUES WITH 0 replace(., is.na(.), 0) %>% # REMOVE LINE ITEM FOR 'TOTAL' filter(state != 'total') %>% # COMPUTE TOTAL BARRELS PER YEAR BY STATE group_by(year, state) %>% summarize(total_barrels = sum(barrels)) %>% ungroup() Create rankings beer_final <- beer_total %>% group_by(year) %>% mutate( # CALCULATE RANKINGS BY TOTAL BARRELS PRODUCED EACH YEAR rank = min_rank(-total_barrels) * 1.0, # STATE TOTAL DIVIDE BY STATE RANKED #1 PER YEAR produced = total_barrels / total_barrels[rank == 1], # CLEANED TEXT LABEL produced_label = paste0(" ", round(total_barrels / 1e6, 2), " M")) %>% group_by(state) %>% # SELECT TOP 20 filter(rank <= 20) %>% ungroup() Animate bar chart p <- beer_final %>% ggplot(aes(rank, produced, fill = state)) + geom_col(show.legend = FALSE) + geom_text(aes(rank, y = 0, label = state, hjust = 1.5)) + geom_text(aes(rank, y = produced, label = produced_label, hjust = 0)) + coord_flip() + scale_x_reverse() + theme_minimal(base_size = 15) + theme(axis.text.x = element_blank(), axis.text.y = element_blank()) + transition_time(year) + labs(title = "US Beer Production by State", subtitle = "Barrels produced each year: {round(frame_time)}", caption = "by: @eeysirhc\nsource: Alcohol and Tobacco Tax and Trade Bureau", x = NULL, y = NULL) animate(p, nframes = 300, fps = 12, width = 1000, height = 800, renderer = gifski_renderer()) ...

April 14, 2020 · Christopher Yee

Script to track COVID-19 cases in the US

A couple weeks ago I shared an #rstats script to track global coronavirus cases by country. The New York Times also released COVID-19 data for new cases in the United States, both at the state and county level. You can run the code below on a daily basis to get the most up to date figures. Feel free to modify for your own needs: library(scales) library(tidyverse) library(gghighlight) state <- read_csv("https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv") county <- read_csv("https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv") State state %>% group_by(date, state) %>% mutate(total_cases = cumsum(cases)) %>% ungroup() %>% filter(total_cases >= 100) %>% # MINIMUM 100 CASES group_by(state) %>% mutate(day_index = row_number(), n = n()) %>% ungroup() %>% filter(n >= 12) %>% # MINIMUM 12 DAYS ggplot(aes(day_index, total_cases, color = state, fill = state)) + geom_point() + geom_smooth() + gghighlight() + scale_y_log10(labels = comma_format()) + facet_wrap(~state, ncol = 4) + labs(title = "COVID-19: cumulative daily new cases by US states (log scale)", x = "Days since 100th reported case", y = NULL, fill = NULL, color = NULL, caption = "by: @eeysirhc\nSource: New York Times") + theme_minimal() + theme(legend.position = 'none') + expand_limits(x = 30) ...

March 30, 2020 · Christopher Yee

TardyThursday: College Tuition, Diversity & Pay

The differences between this unsanctioned #tardythursday and the official #tidytuesday: These will publish on Thursday (obviously) The dataset will come from a completely different week of TidyTuesday For a surprise, I’ll code with either #rstats or python (similar to #makeovermonday) Load modules import pandas as pd import seaborn as sns import matplotlib.pyplot as plt Download and parse data df_raw=pd.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-03-10/salary_potential.csv") df=df_raw[['state_name', 'early_career_pay', 'mid_career_pay']].groupby('state_name').mean().reset_index() Visualize dataset sns.set(style="darkgrid") plt.figure(figsize=(20,15)) g=sns.regplot(x="early_career_pay", y="mid_career_pay", data=df) for line in range(0,df.shape[0]): g.text(df.early_career_pay[line]+0.01, df.mid_career_pay[line], df.state_name[line], horizontalalignment='left', size='medium', color='black') plt.xlabel("Early Career Pay") plt.ylabel("Mid Career Pay") plt.title("Average Salary Potential by State: Early vs Mid Career", x=0.01, horizontalalignment="left", fontsize=16) plt.figtext(0.9, 0.09, "by: @eeysirhc", horizontalalignment="right") plt.figtext(0.9, 0.08, "Source: TuitionTracker.org", horizontalalignment="right") plt.show() ...

March 19, 2020 · Christopher Yee