Christopher Yee

Long Tail Search Keywords: Volume or Length?

I stumbled upon an interesting poll last week… SEO folks. When you reference “a long tail query” do you mean: (Trying again - deleted last poll as I phrased it ambiguously. Sorry.) — Will Critchlow (@willcritchlow) May 14, 2019 …and was amazed by the sheer number of incorrect responses. Ever the skeptic, I wanted to make sure this was a real phenomenon despite the 1.2K survey responses. Using Monte Carlo simulation we find there is a 100% chance the polling differences between the three groups are statistically significant (see notes for conjugate priors). ...

Getting Started With R Using Google Search Console Data

Follow along with the full code here A number of people have approached me over the years asking for materials on how to use R to draw insights from their search engine marketing data. Although I had no reservations giving away that information (with redactions) it was largely reserved for internal training courses at my previous companies. This guide is long overdue though and is meant to address a few things on my mind: ...

Text Mining the Redacted Mueller Report

After two politically-charged years, Robert Mueller finally concluded his investigation on Russian interference with the 2016 presidential elections. The outcome was a 440+ page report on their findings - the perfect candidate for some text mining. Side note: the idea for this post came when my attempts to extract the PDF text proved unsuccessful because it was locked in an unsearchable version. As a consequence of that I did a little tweet mining instead: too busy to read all 400+ pages of the #muellerreport but apparently not busy enough to do some text/tweet mining with #rstats according to this network graph though I should definitely check out page 290 pic.twitter.com/RPiahsrg9A ...

TidyTuesday: Women in the Workforce

Analyzing data for #tidytuesday week of 3/05/2019 (source) Load libraries library(tidyverse) library(scales) library(lubridate) jobs_gender <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-03-05/jobs_gender.csv") Clean & plot data jobs_gender %>% filter(year == '2016') %>% mutate(male_diff = ((((total_earnings_male/total_earnings)-1)*workers_male)/total_workers), female_diff = (((total_earnings_female/total_earnings)-1)*workers_female)/total_workers) %>% ggplot() + geom_jitter(aes(total_earnings, female_diff), color = 'salmon', alpha = 0.5, size = 2.5) + geom_jitter(aes(total_earnings, male_diff), color = 'steelblue', alpha = 0.5, size = 2.5) + geom_hline(yintercept = 0, color = 'grey54', lty = 'dashed') + facet_wrap(~major_category) + scale_x_continuous(labels = dollar_format(), limits = c(0,200000)) + scale_y_continuous(labels = percent_format(round(1)), limits = c(-0.3,0.3)) + labs(x = "Average Median Earnings", y = "Difference from Average", caption = "Graphic: @eeysirhc\nSource: Bureau of Labor Statistics", title = "2016 Earnings Differences (Weighted) by Job Sector", subtitle = "Blue = Male; Red = Female") + theme_bw() + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(), plot.subtitle = element_text(size = 12), legend.position = 'none') ...

Hello, can we stop using pie charts?

I came across this tweet and its corresponding graph a few days ago: Did you know? 🧐 1‐word keywords account for only 2.8% of all the keywords people search for in the United States. pic.twitter.com/GXdfttn3jk — Tim Soulo (@timsoulo) February 21, 2019 I love ahrefs and all but it’s 2019 - WHY ARE WE STILL USING PIE CHARTS?! I’ll spare my opinion since there is already a ton of literature out but here’s a few to get started: ...

TidyTuesday: Housing Prices

Instead of a static visualization I decided to build a barebones Shiny app this week. The purpose is to improve the interactivity of the final output - one of my 2019 goals to level up advanced R knowledge. You can find the full code here. Analyzing data for #tidytuesday week of 2/5/2019 (source) # LOAD PACKAGES library(tidyverse) library(scales) library(shiny) state_hpi_raw <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-02-05/state_hpi.csv") Process the raw data state_hpi <- state_hpi_raw %>% group_by(state, year) %>% summarize(us_avg = mean(us_avg), price_index = mean(price_index)) %>% mutate(pct_diff = (price_index / us_avg) - 1, segment = ifelse(pct_diff > 0, 'above', 'below'), segment = str_to_title(segment)) Build the UI level Include a drop down menu to select output data by state abbreviation ui <- fluidPage( "Housing Price Index: US Average vs State", selectInput(inputId = "select_state", label = "Choose a state", c(state.abb)), plotOutput("hpi1"), plotOutput("hpi2") ) Build the server level Plot 1: Time series for a given state average annual housing price index compared to the US average Plot 2: Time series for the percentage difference of a given state housing price index against the US average server <- function(input, output, session) { output$hpi1 <- renderPlot({ state_hpi %>% filter(state == input$select_state) %>% group_by(year, state) %>% summarize(price_index = mean(price_index), us_avg = mean(us_avg)) %>% ggplot() + geom_line(aes(year, price_index), size = 2, color = 'steelblue') + geom_col(aes(year, us_avg), alpha = 0.3, fill = 'grey54') + theme_bw() + labs(x = NULL, y = "Housing Price Index") + theme_bw(base_size = 15) + scale_y_continuous(limits = c(0,300)) }) output$hpi2 <- renderPlot({ state_hpi %>% filter(state == input$select_state) %>% ggplot() + geom_col(aes(year, pct_diff, fill = segment), alpha = 0.8) + geom_hline(yintercept = 0, lty = 'dashed') + scale_fill_brewer(palette = 'Set1', direction = -1) + scale_y_continuous(labels = percent_format(round(1))) + theme_bw(base_size = 15) + theme(legend.position = 'top') + labs(x = NULL, y = "Difference to US Average", fill = NULL) }) } Build the app level shinyApp(ui, server) Check out the final production build here ...

TidyTuesday: Milk Production

Analyzing data for #tidytuesday week of 1/29/2019 (source) # LOAD PACKAGES library(tidyverse) library(scales) library(lubridate) library(ggmap) library(gganimate) library(ggthemes) library(transformr) library(gifski) library(mapproj) milk_raw <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-01-29/state_milk_production.csv") milk <- milk_raw Extract geospatial data and parse data usa <- as_tibble(map_data("state")) usa$region <- str_to_title(usa$region) usa <- usa %>% rename(state = region) milk_parsed <- milk %>% select(-region) %>% mutate(milk_10billion = milk_produced / 10000000000, year = as.integer(year)) %>% full_join(usa) %>% filter(!is.na(year), !is.na(long), !is.na(lat)) Build animation milk_animation <- milk_parsed %>% ggplot(aes(long, lat, group = group, fill = milk_10billion)) + geom_polygon(color = 'black') + scale_fill_gradient2(low = "gray97", mid = "steelblue", high = "midnightblue", midpoint = 2.5) + theme_map() + coord_map() + labs(x = NULL, y = NULL, fill = NULL, title = "Milk production per 10 billion pounds", subtitle = "Year: {round(frame_time)}", caption = "Source: USDA") + transition_time(year) animate(milk_animation, height = 800, width = 800) ...

TidyTuesday: Incarceration Trends

Analyzing data for #tidytuesday week of 1/22/2019 (source) # LOAD PACKAGES AND PARSE DATA library(tidyverse) library(scales) library(lubridate) library(RColorBrewer) prison_raw <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-01-22/prison_population.csv") prison <- prison_raw Process the raw data total <- prison %>% filter(pop_category != 'Total' & pop_category != 'Male' & pop_category != 'Female') %>% select(county_name, urbanicity, pop_category, population, prison_population) %>% na.omit() %>% group_by(county_name, urbanicity, pop_category) %>% summarize(population = sum(population), prison_population = sum(prison_population)) %>% ungroup() %>% group_by(county_name, urbanicity) %>% mutate(pct_population = population / sum(population), pct_prisoner = prison_population / sum(prison_population)) What is the proportion of population:prisoners per demographic group ? total %>% filter(pop_category != 'Other') %>% ggplot() + geom_point(aes(pct_population, pct_prisoner), alpha = 0.1, size = 2, color = 'grey') + geom_smooth(aes(pct_population, pct_prisoner, color = pop_category), size = 1.2, se = FALSE) + theme_light() + scale_y_continuous(labels = percent_format()) + scale_x_continuous(labels = percent_format()) + labs(x = "County Population", y = "Prisoner Population", color = "", title = "Comparison of county to prison population by ethnicity from 1970 to 2016", subtitle = "Specific groups are overrepresented in the prisoner population", caption = "Source: Vera Institute of Justice") + geom_abline(linetype = 'dashed') + scale_color_brewer(palette = 'Set1') + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(), legend.position = 'top', panel.background = element_rect(fill = 'gray97', color = 'gray97', size = 0.5, linetype = 'solid')) ...