TidyTuesday: Steam Games

Data from #tidytuesday week of 2019-07-30 (source) Load R packages library(tidyverse) library(RColorBrewer) library(scales) Download data steam_raw <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-07-30/video_games.csv") Parse data steam_games <- steam_raw %>% # VARIABLE FOR AGE OF GAME mutate(release_year = substring(release_date, 8, 12), # EXTRACT YEAR release_year = as.numeric(str_trim(release_year)), release_year = case_when(release_year == 5 ~ 2015, # INCORRECT DATA POINT TRUE ~ release_year), age = 2019 - release_year) %>% # VARIABLE FOR MIN/MAX NUMBER OF OWNERS mutate(max_owners = str_trim(word(owners, 2, sep = "\\..")), max_owners = as.numeric(str_replace_all(max_owners, ",", "")), min_owners = str_trim(word(owners, 1, sep = "\\..")), min_owners = as.numeric(str_replace_all(min_owners, ",", ""))) %>% # REMOVE VALUES WITH INCONSISTENT RELEASE_DATE FORMAT (n=37) filter(age < 15) %>% # FILTER OUT STUDIO SOFTWARE filter(price < 150) Visualize data Question: how many people still play games that are X years old (on Steam) ? ...

July 30, 2019 · Christopher Yee

Classifying keywords with the fuzzyjoin R package

A few months ago I tweeted a complex (and tedious) Excel formula on how to classify keywords: For the #seo who insists on completing their keyword/intent research in excel Philosophy: keyword intent is not absolute so it won't fall neatly into an assigned bucket. For this reason a keyword can live under multiple conversion funnels since we can't be 100% certain. pic.twitter.com/JcTl9P11mC — Christopher Yee (@Eeysirhc) April 24, 2019 I then ended it with: ...

July 19, 2019 · Christopher Yee

Visualizing Netflix viewing activity

If you are like me then it’s very likely you share your Netflix account with multiple users. If you are also like me then it’s very likely you would be curious about how your Netflix viewing activity coompares and contrasts to all the parasites on your account! In this post we’ll leverage #rstats to visualize what that will look like. Load packages Let’s fire up our favorite packages. library(tidyverse) library(lubridate) library(igraph) library(ggraph) library(tidygraph) library(influenceR) Download data With the exception of my own viewing activity (I’m not ashamed!), I have provided anonymized Netflix viewing data from a few family and friends for you to follow along. ...

July 2, 2019 · Christopher Yee

Mining Google Trends data with R

Google Trends is great for understanding relative search popularity for a given keyword or phrase. However, if we wanted to explore the topics some more it is quite clunky to retrieve that data within the web interface. Enter the gtrendsR package for #rstats and what better way to demonstrate how this works than by pulling search popularity for ramen, pho, and spaghetti (hot on the heels of my last article about ramen ratings)! ...

June 28, 2019 · Christopher Yee

TidyTuesday: Ramen Ratings

Data from #tidytuesday week of 2019-06-04 (source) Load R packages library(tidyverse) library(plotly) Download and parse data frame ramen_raw <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-06-04/ramen_ratings.csv") ramen <- ramen_raw %>% group_by(brand, country) %>% summarize(avg_rating = round(mean(stars),2), total_reviews = n()) Build plotly chart plot_ly(data = ramen, x = ~total_reviews, y = ~avg_rating, size = 15, color = ~country, colors = 'Paired', text = ~paste("Brand: ", brand, "<br>Average Rating: ", avg_rating), showlegend = FALSE) %>% layout(xaxis = list(title = "Total Reviews"), yaxis = list(title = "Average Ratings"))

June 4, 2019 · Christopher Yee

Long Tail Search Keywords: Volume or Length?

I stumbled upon an interesting poll last week… SEO folks. When you reference “a long tail query” do you mean: (Trying again - deleted last poll as I phrased it ambiguously. Sorry.) — Will Critchlow (@willcritchlow) May 14, 2019 …and was amazed by the sheer number of incorrect responses. Ever the skeptic, I wanted to make sure this was a real phenomenon despite the 1.2K survey responses. Using Monte Carlo simulation we find there is a 100% chance the polling differences between the three groups are statistically significant (see notes for conjugate priors). ...

May 20, 2019 · Christopher Yee

Getting Started With R Using Google Search Console Data

Follow along with the full code here A number of people have approached me over the years asking for materials on how to use R to draw insights from their search engine marketing data. Although I had no reservations giving away that information (with redactions) it was largely reserved for internal training courses at my previous companies. This guide is long overdue though and is meant to address a few things on my mind: ...

May 2, 2019 · Christopher Yee

Text Mining the Redacted Mueller Report

After two politically-charged years, Robert Mueller finally concluded his investigation on Russian interference with the 2016 presidential elections. The outcome was a 440+ page report on their findings - the perfect candidate for some text mining. Side note: the idea for this post came when my attempts to extract the PDF text proved unsuccessful because it was locked in an unsearchable version. As a consequence of that I did a little tweet mining instead: too busy to read all 400+ pages of the #muellerreport but apparently not busy enough to do some text/tweet mining with #rstats according to this network graph though I should definitely check out page 290 pic.twitter.com/RPiahsrg9A ...

April 18, 2019 · Christopher Yee