Retrospective Introspection in 2018

With the year coming to a close and 2019 just around the corner, I thought I would try something new and reflect back on the defining moments of my 2018 - along with my $0.02. I do not compose enough of this (or my personal thoughts) in written format but I recently migrated my blog from WordPress to Hugo so this is a good a time as any. Lucky to work with some of the smartest digital marketing folks in fintech It’s rare to find the perfect team. I always thought it was a load of crock from techies who wrote on Medium and have everyone go on some wild goose chase. ...

December 31, 2018 · Christopher Yee

TidyTuesday: Cetaceans Dataset

Analyzing data for #tidytuesday week of 12/18/2018 (source) # LOAD PACKAGES AND PARSE DATA library(tidyverse) library(scales) library(RColorBrewer) library(forcats) library(lubridate) library(tidytext) cetaceans_raw <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2018/2018-12-18/allCetaceanData.csv") cetaceans <- cetaceans_raw Most notable cause of death between Male vs Female ? cetaceans %>% select(sex, COD) %>% filter(sex != "U") %>% na.omit() %>% mutate(sex = replace(sex, str_detect(sex, "F"), "Female"), sex = replace(sex, str_detect(sex, "M"), "Male")) %>% unnest_tokens(bigram, COD, token = "ngrams", n = 2) %>% count(sex, bigram) %>% bind_tf_idf(bigram, sex, n) %>% arrange(desc(tf_idf)) %>% filter(tf_idf > 0.0011) %>% ggplot() + geom_col(aes(reorder(bigram, tf_idf), tf_idf, fill = sex)) + coord_flip() + scale_fill_brewer(palette = 'Set2', name = "") + labs(x = "", y = "", title = "Bigrams with highest TF-IDF for cause of death \n between Cetacean genders", caption = "Source: The Pudding") + theme_bw() ...

December 18, 2018 · Christopher Yee

TidyTuesday: NYC Restaurant Inspections

Analyzing data for #tidytuesday week of 12/11/2018 (source) # LOAD PACKAGES AND PARSE DATA library(tidyverse) library(scales) library(RColorBrewer) library(forcats) library(lubridate) library(ebbr) nyc_restaurants_raw <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2018/2018-12-11/nyc_restaurants.csv") nyc_restaurants <- nyc_restaurants_raw %>% filter(inspection_date != '01/01/1900') What is the rate of “A” inspection grades by cuisine type? First step is to compute the relevant statistics cuisine_grades <- nyc_restaurants %>% select(cuisine_description, grade) %>% na.omit() %>% group_by(cuisine_description) %>% count(grade) %>% mutate(total = sum(n), pct_total = n/total) %>% ungroup() Next we apply empirical Bayesian estimation and filter the top 20 results ...

December 11, 2018 · Christopher Yee

TidyTuesday: Medium Article Metadata

Analyzing data for #tidytuesday week of 12/4/2018 (source) # LOAD PACKAGES AND PARSE DATA library(tidyverse) library(scales) library(RColorBrewer) library(forcats) library(tidytext) library(stringr) articles_raw <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2018/2018-12-04/medium_datasci.csv") articles <- articles_raw Who are the top 10 authors in terms of total articles published? top_authors <- articles %>% select(author) %>% group_by(author) %>% count() %>% arrange(desc(n)) %>% na.omit() %>% head(10) top_authors %>% ggplot() + geom_col(aes(reorder(author, n), n), fill = "darkslategray4", alpha = 0.8) + coord_flip() + theme_bw() + labs(x = "", y = "", title = "Top 10 authors on Medium in terms of total articles published") ...

December 4, 2018 · Christopher Yee

TidyTuesday: Baltimore Bridges

Analyzing data for #tidytuesday week of 11/27/2018 (source) # LOAD PACKAGES AND PARSE DATA library(tidyverse) library(scales) library(RColorBrewer) library(forcats) bridges_raw <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2018/2018-11-27/baltimore_bridges.csv") bridges <- bridges_raw Do bridge conditions get better over time? # REORDER BRIDGE_CONDITION FACTORS x <- bridges x$bridge_condition <- as.factor(x$bridge_condition) x$bridge_condition <- factor(x$bridge_condition, levels = c("Poor", "Fair", "Good")) x %>% filter(yr_built >= 1900) %>% # removing 2017 due to outlier select(lat, long, yr_built, bridge_condition, avg_daily_traffic) %>% group_by(yr_built, bridge_condition) %>% summarize(avg_daily_traffic = mean(avg_daily_traffic)) %>% ggplot() + geom_col(aes(yr_built, avg_daily_traffic, fill = bridge_condition), alpha = 0.3) + scale_y_continuous(label = comma_format(), limits = c(0, 223000)) + scale_fill_brewer(palette = 'Set1') + scale_color_brewer(palette = 'Set1') + geom_smooth(aes(yr_built, avg_daily_traffic, color = bridge_condition), se = FALSE) + theme_bw() + labs(x = "", y = "", title = "Baltimore bridges: average daily traffic by year built", subtitle = "Applied smoothing to highlight differences in bridge conditions and dampen outliers", fill = "Bridge Condition", color = "Bridge Condition") ...

November 27, 2018 · Christopher Yee

TidyTuesday: Thanksgiving Dinner

Analyzing data for #tidytuesday week of 11/20/2018 (source) # LOAD PACKAGES AND PARSE DATA library(tidyverse) library(scales) library(RColorBrewer) library(forcats) thanksgiving_raw <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2018/2018-11-20/thanksgiving_meals.csv") thanksgiving <- thanksgiving_raw %>% filter(celebrate != 'No') What are the most popular pies for Thanksgiving ? thanksgiving %>% select(pie1:pie13) %>% pivot_longer(pie1:pie13, names_to = "pie_type") %>% filter(value != 'None') %>% select(value) %>% group_by(value) %>% count() %>% filter(n > 10) %>% ungroup() %>% ggplot(aes(reorder(value, n), n, label = n)) + geom_bar(aes(fill = value), alpha = 0.9, stat='identity') + coord_flip() + theme_classic() + theme(legend.position = 'none') + labs(title = "Most Popular Pies for Thanksgiving (n=980)", subtitle = "Question: Which type of pie is typically served at your Thanksgiving dinner? \n Please select all that apply", x = "", y = "") ...

November 20, 2018 · Christopher Yee

For the Love of Data, Segment!

Aggregated data is misleading. Let’s read that again: aggregated data is misleading. Why? Because the homogenized set buries the meaningful insights away. For example, I recently came across a competitive SEO analysis that examined the relationship between the number of ranking organic keywords to the estimated traffic from organic search for a handful of websites. In my opinion, this is a great start to understand the opportunity size of a market and how a given business stacks up against its competitors. ...

August 29, 2018 · Christopher Yee

Data Viz: Top Marketing Words in Linkedin Job Titles

I abhor tabulated data for a number of reasons: Quite difficult on the human eye to spot trends Puts a burden on the end user to spend extra time digesting the information True insights get lost because the devil is in the details In fact, individuals who join Square’s SEO team (I’m hiring by the way) are required to read this book on how to visualize data before making any presentations - an SEO bible, if you will. ...

March 10, 2018 · Christopher Yee