Christopher Yee

Algorithm to prioritize home improvement projects

I moved to Los Angeles with my wife in October 2019 where we had a list of home improvement projects we wanted to complete or things to purchase. The problem we faced was disagreement on where to start since we had to juggle costs and compromise on what was most important at the time. For example, if we focused too much on lower ticket purchases we would delay projects that had potential to improve our home value. ...

On Business Value vs Technical Knowledge

The purpose of this article is to elaborate and visualize (surprise!) my comment I left over on Reddit: Professional data scientists: did you overcome the feeling of never knowing enough? If so, how? I think this concept can be applied to any field - not just data science. My personal advice that has worked for me to quell any “insecurities” is frame your mindset in terms of business value vs technical knowledge. ...

Recreating plots in R: intro to bootstrapping

Objective: recreate and visualize the 500K sampling distribtuion of means from this intro to bootstrapping in statistics post using R. Load libraries library(tidyverse) library(rsample) Download data df <- read_csv("https://statisticsbyjim.com/wp-content/uploads/2017/04/body_fat.csv") Bootstrap resampling 500K df_bs <- df %>% bootstraps(times = 500000) %>% mutate(average = map_dbl(splits, ~ mean(as.data.frame(.)$`%Fat`))) Visualize sampling distribution of means df_bs %>% ggplot(aes(average)) + geom_histogram(binwidth = 0.1, alpha = 0.75, color = 'white', fill = 'steelblue') + scale_x_continuous(limits = c(25, 32)) + scale_y_continuous(labels = scales::comma_format()) + labs(title = "Histogram of % Fat", subtitle = "500K bootstrapped samples with 92 observations in each", x = "Average Mean", y = "Frequency") + theme_minimal() ...

TidyTuesday: Cocktails pt.2

This is part 2 of TidyTuesday: Cocktails. Below shows how we can use #rstats to write a cocktail recommendation system that takes in a drink and returns a few other cocktails based on similarly mixed ingredients. Load libraries library(tidyverse) library(recommenderlab) Download and parse data Note: please check out part 1 for deatils on processing steps bc_raw <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-05-26/boston_cocktails.csv') bc <- bc_raw %>% mutate(ingredient = str_to_lower(ingredient)) %>% distinct() %>% select(name, ingredient) bc_tidy <- bc %>% filter(!str_detect(ingredient, ",")) bc_untidy <- bc %>% filter(str_detect(ingredient, ",")) %>% mutate(ingredient = str_split(ingredient, ", ")) %>% unnest(ingredient) bc_clean <- rbind(bc_tidy, bc_untidy) %>% distinct() df <- bc_clean %>% mutate(ingredient = str_replace_all(ingredient, "-", "_"), ingredient = str_replace_all(ingredient, " ", "_"), ingredient = str_replace_all(ingredient, "old_mr._boston_", ""), ingredient = str_replace_all(ingredient, "old_thompson_", "")) df_processed <- df %>% mutate(value = 1) %>% pivot_wider(names_from = name) %>% replace(is.na(.), 0) Recommendation algorithm Transform data to binary rating matrix cocktails_matrix <- df_processed %>% select(-ingredient) %>% as.matrix() %>% as("binaryRatingMatrix") Create evaluation scheme scheme <- cocktails_matrix %>% evaluationScheme(method = "cross", k = 5, train = 0.8, given = -1) Input customer cocktail preference Let’s check the ingredients for a very simple cocktail: ...

TidyTuesday: Cocktails

Data from #tidytuesday week of 2020-05-26 (source) If you are looking for the R script then you can find it here Load packages library(tidyverse) library(ggrepel) library(FactoMineR) Download data bc_raw <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-05-26/boston_cocktails.csv') Data processing Standardize cases bc_raw %>% count(ingredient, sort = TRUE) %>% filter(str_detect(ingredient, "red pepper sauce")) ## # A tibble: 2 x 2 ## ingredient n ## <chr> <int> ## 1 Hot red pepper sauce 4 ## 2 hot red pepper sauce 1 Let’s fix that by making all ingredient values to lower case: ...

R script for the CausalImpact package

Google has an amazing #rstats package called CausalImpact to predict the counterfactual: what would have happened if an intervention did not occur. This is a quick technical post to get someone up and running rather than a review of its literature, usage, or idiosyncrasies Load libraries library(tidyverse) library(CausalImpact) Download (dummy) data df <- read_csv("https://raw.githubusercontent.com/Eeysirhc/random_datasets/master/cimpact_sample_data.csv") df %>% sample_n(5) ## # A tibble: 5 x 3 ## date experiment_type revenue ## <date> <chr> <dbl> ## 1 2020-04-02 control 309. ## 2 2020-05-05 experiment 257. ## 3 2020-02-29 control 928. ## 4 2020-03-13 control 467. ## 5 2020-03-02 experiment 35.0 Shape data Before we can run our analysis, the CausalImpact package requires three columns: ...

TidyTuesday: Volcano Eruptions (python)

Data from #tidytuesday week of 2020-05-12 (source) but plotting in python. Load modules import pandas as pd import matplotlib.pyplot as plt import seaborn as sns Download and parse data volcano_raw = pd.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-05-12/volcano.csv") volcano = volcano_raw[['primary_volcano_type', 'elevation']].sort_values(by='elevation', ascending=False) Visualize dataset sns.set(style="darkgrid") plt.figure(figsize=(20,15)) p = sns.boxplot(x=volcano.elevation, y=volcano.primary_volcano_type) p = sns.swarmplot(x=volcano.elevation, y=volcano.primary_volcano_type, color=".35") plt.xlabel("Elevation") plt.ylabel("") plt.title("What is the average elevation by volcano type?", x=0.01, horizontalalignment="left", fontsize=20) plt.figtext(0.9, 0.08, "by: @eeysirhc", horizontalalignment="right") plt.figtext(0.9, 0.07, "Source: The Smithsonian Institute", horizontalalignment="right") plt.show() ...

TidyTuesday: Animal Crossing

Data from #tidytuesday week of 2020-05-05 (source) Load packages library(tidyverse) library(ggfortify) Download data villagers_raw <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-05-05/villagers.csv') Process data villagers <- villagers_raw %>% select(gender, species, personality) %>% mutate(species = str_to_title(species)) %>% group_by(gender, species, personality) %>% summarize(n = n()) %>% mutate(pct_total = n / sum(n)) %>% ungroup() Visualize data villagers %>% ggplot(aes(personality, pct_total, fill = gender, color = gender, group = gender)) + geom_polygon(alpha = 0.5) + geom_point() + coord_polar() + facet_wrap(~species) + labs(x = NULL, y = NULL, color = NULL, fill = NULL, title = "Animal Crossing: villager personality traits by species & gender", caption = "by: @eeysirhc\nsource:VillagerDB") + theme_bw() + theme(legend.position = 'top', axis.text.y = element_blank(), axis.ticks.y = element_blank()) ...