We may be inundated with data but sometimes collecting it can be a challenge in and of itself. A few reasons off the top of my head:
- Sparsity
- Difficult to measure
- Impractical to devote company resources to it
- Lack of technical expertise to actually build or acquire it
- Lazy (yours truly - except for that one time)
Through simulation we can generate our own dataset with the added benefit of fully understanding what features we choose to put in our models (or leave out).
In fact, a few of the machine learning models I wrote and put into production at work are based on simulated data!
This article will provide a quick walkthrough in getting you up and running using #rstats.
Background
I am in the market for a smart camera so while shopping online I also compiled some page speed data for a few eCommerce websites. You can follow along with the data here:
library(tidyverse)
library(scales)
library(knitr)
library(kableExtra)
df <- read_csv("https://raw.githubusercontent.com/Eeysirhc/random_datasets/master/page_speed_benchmark.csv")
# VIEW RANDOM SAMPLE
df_sample <- df %>%
sample_n(10)
website | page_type | product | time_to_interactive |
---|---|---|---|
Smarthome | Category | Electrical | 9.8 |
Home Depot | Category | Sensors | 25.1 |
IKEA | Category | Entertainment | 9.8 |
Smarthome | Category | Sensors | 9.8 |
Smarthome | Category | Electrical | 8.9 |
Amazon | Category | Sensors | 7.1 |
Walmart | Category | Sensors | 15.9 |
Amazon | Home |
|
10.4 |
Amazon | Category | Sensors | 7.1 |
BestBuy | Category | Sensors | 29.4 |
I used Google Chrome’s built-in page audit (Lighthouse) to log the time for each website, page type and product category.
There are other page speed metrics but for educational purposes we’ll just focus on using time_to_interactive.
Lets pose the question: which site is fastest in terms of time_to_interactive?
Standard approach
One way to answer that question is to create descriptive statistics by computing the averages, finding the percent difference from the fastest site and then calling it a day.
df_standard <- df %>%
filter(page_type != 'Home') %>%
group_by(website) %>%
summarize(time_interactive = mean(time_to_interactive)) %>%
ungroup() %>%
arrange(time_interactive) %>%
mutate(slower_than_amazon = round((time_interactive / 7.59 - 1) * 100, 0),
slower_than_amazon = paste0(slower_than_amazon, "%"))
website | time_interactive | slower_than_amazon |
---|---|---|
Amazon | 7.59000 | 0% |
IKEA | 9.44000 | 24% |
Smarthome | 9.44375 | 24% |
Walmart | 16.63077 | 119% |
Target | 21.36667 | 182% |
Home Depot | 25.20000 | 232% |
BestBuy | 31.23077 | 311% |
However, there is a problem - sample size!
df_size <- df %>%
filter(page_type != 'Home') %>%
group_by(website) %>%
summarize(time_interactive = mean(time_to_interactive),
count = n()) %>%
ungroup() %>%
arrange(desc(count))
website | time_interactive | count |
---|---|---|
Smarthome | 9.44375 | 16 |
BestBuy | 31.23077 | 13 |
Walmart | 16.63077 | 13 |
Amazon | 7.59000 | 10 |
Home Depot | 25.20000 | 10 |
Target | 21.36667 | 6 |
IKEA | 9.44000 | 5 |
eCommerce sites have hundreds of thousands of pages so how can we be so certain our summary captures actual page speed performance? Perhaps adding in a confidence interval?
At a more granular level, IKEA and Smarthome have exactly the same average time_to_interactive of 9.44 seconds but I recorded fewer samples with the former - which site is actually faster?
df %>%
filter(website %in% c('IKEA', 'Smarthome')) %>%
ggplot(aes(website, time_to_interactive, color = website)) +
geom_point(show.legend = FALSE, size = 5, alpha = 0.5) +
geom_hline(yintercept = 9.44, lty = 2, color = 'red') +
scale_color_manual(values = cbPalette3) +
scale_y_continuous(limits = c(0, 15)) +
labs(x = NULL, y = "Time to Interactive (seconds)",
subtitle = "Dashed line represents average of 9.44s")
What if I don’t know how to write a script to grab every URL and then feed it into Google Lighthouse? Or more realistically, what if I am not inclined to go and collect 11 more data points for IKEA?
These questions can be answered with the help of simulation. By applying the central limit theorem and law of large numbers we can directly address measurement uncertainty.
Simulating “big” data
To get started we will need three values:
- n = number of observations
- mean = vector of means
- sd = vector of standard deviations
df_summary <- df %>%
filter(page_type != 'Home') %>%
group_by(website) %>%
summarize(time_interactive = mean(time_to_interactive),
sd = sd(time_to_interactive),
count = n()) %>%
ungroup() %>%
arrange(time_interactive)
website | time_interactive | sd | count |
---|---|---|---|
Amazon | 7.59000 | 1.9081405 | 10 |
IKEA | 9.44000 | 0.6877500 | 5 |
Smarthome | 9.44375 | 1.1488944 | 16 |
Walmart | 16.63077 | 0.8300448 | 13 |
Target | 21.36667 | 2.0175893 | 6 |
Home Depot | 25.20000 | 1.4055446 | 10 |
BestBuy | 31.23077 | 2.6042864 | 13 |
Now that we have the minimum requirements we can simulate our data. Let’s start with IKEA:
ikea <- rnorm(1e4, 9.44, 0.688) %>%
as_tibble() %>%
mutate(website = paste0("ikea"))
There was a lot to unpack there so let’s break it down:
- rnorm is the R function to generate random numbers from a Gaussian distribution
- 1e4 is scientific notation for 10,000 observations
- 9.44 is the average mean for IKEA time_to_interactive
- 0.688 is the standard deviation
- We moved our data into the tidyverse with as_tibble()
- Used the mutate function to add a website column and identify IKEA for the set of results
We can now plot our distribution of time_to_interactive scores for IKEA where we generated 10K data points.
ikea %>%
ggplot(aes(value, fill = website)) +
geom_histogram(position = 'identity', binwidth = 0.05, alpha = 0.8,
show.legend = FALSE) +
scale_fill_manual(values = cbPalette) +
scale_x_continuous(limits = c(0, 20)) +
labs(x = "Time to Interactive (seconds)", y = NULL)
What this illustrates is the potential frequency of page speed scores and the range of all possible values.
In other words, if we take a random page on IKEA and measure its time_to_interactive we know scores could be as low as 8s or as high as 11s. Additionally, there is a central tendency for scores to fall around 9s but it is not at all possible to have a score of 1s or more than 13s.
This is an improvement from a single summary statistic but what if we wanted to ask more complicated questions? What if we wanted to know the likelihood a random IKEA page was greater than 11s? Or which site is faster: Amazon vs IKEA?
The solution is to condition on our simulated dataset.
Conditioning on the imaginary
Lets ask the question: what is the probability IKEA will have a page speed greater than 11 seconds?
We can easily get that answer by sampling from our data:
sum(ikea$value > 11) / length(ikea$value) * 100
## [1] 1.45
And to drive that home with some data viz…
ikea %>%
mutate(greater_11s = ifelse(value > 11, 'yes', 'no')) %>%
ggplot(aes(value, fill = greater_11s)) +
geom_histogram(binwidth = 0.05, alpha = 0.7) +
scale_fill_manual(values = cbPalette) +
scale_x_continuous(limits = c(0, 20)) +
labs(x = "Time to Interactive (seconds)", y = NULL)
We can also head in the opposite direction: what is the probability IKEA will have a page speed of less than 9 seconds?
sum(ikea$value < 9) /length(ikea$value) * 100
## [1] 25.72
And to plot our results…
ikea %>%
mutate(less_than_9s = ifelse(value < 9, 'yes', 'no')) %>%
ggplot(aes(value, fill = less_than_9s)) +
geom_histogram(binwidth = 0.05, alpha = 0.7) +
scale_fill_manual(values = cbPalette) +
scale_x_continuous(limits = c(0, 20)) +
labs(x = "Time to Interactive (seconds)", y = NULL)
We can also ask more complicated quesitons such as: what is the probability a random Amazon page will be faster than IKEA?
# SIMULATE AMAZON DATA
amazon <- rnorm(1e4, 7.59, 1.91) %>%
as_tibble() %>%
mutate(website = paste0("amazon"))
mean(amazon$value < ikea$value) * 100
## [1] 82.39
And…well…you get the idea….
pagespeed <- rbind(ikea, amazon)
pagespeed %>%
ggplot(aes(value, fill = website)) +
geom_histogram(binwidth = 0.05, alpha = 0.7, position = 'identity') +
scale_fill_manual(values = cbPalette2) +
scale_x_continuous(limits = c(0, 20)) +
labs(x = "Time to Interactive (seconds)", y = NULL)
Side note
From my personal experience working at large companies, business executives respond very well to probabilities. Thus, “our site is 24% slower than Amazon” is not as impactful as stating “if we were to take 10K random pages from each site, there is an 82% chance our site will be slower than Amazon.”
What about Amazon?
In the past I wrote about segmenting data to reveal deeper insights hidden beneath the aggregation.
So, what is really driving Amazon’s page speed score higher? Lets simulate some data and for fun we’ll increase our observations from 1e4 (10K) to 1e5 (100K). Quick reminder about Amazon page speed data:
df_amazon <- df %>%
filter(website == 'Amazon', page_type != 'Home') %>%
group_by(product) %>%
summarize(time_interactive = mean(time_to_interactive),
sd = sd(time_to_interactive)) %>%
arrange(time_interactive)
product | time_interactive | sd |
---|---|---|
Electrical | 5.750000 | 0.212132 |
Security | 7.933333 | 2.458319 |
Sensors | 8.066667 | 1.674316 |
Entertainment | 8.200000 | 2.545584 |
And the code to generate our data with the subsequent plot:
electrical <- rnorm(1e5, 5.75, 0.212) %>% as_tibble() %>%
mutate(product = paste0("electrical"))
security <- rnorm(1e5, 7.93, 2.46) %>% as_tibble() %>%
mutate(product = paste0("security"))
sensors <- rnorm(1e5, 8.07, 1.67) %>% as_tibble() %>%
mutate(product = paste0("sensors"))
entertainment <- rnorm(1e5, 8.2, 2.55) %>% as_tibble() %>%
mutate(product = paste0("entertainment"))
amazon <- rbind(electrical, security, sensors, entertainment)
amazon %>%
ggplot(aes(value, fill = product)) +
geom_histogram(binwidth = 0.05, alpha = 0.7, position = 'identity') +
scale_x_continuous(limits = c(0, 20)) +
scale_y_continuous(labels = comma_format()) +
scale_fill_brewer(palette = 'Spectral') +
labs(x = "Time to Interactive (seconds)", y = NULL,
title = "Amazon: time to interactive by product category (n=100K)")
Well, this is quite shocking (pun intended) - the electrical category’s time_to_interactive is not only faster but its central tendency is also less dispersed than the other three.
Why might that be the case? Specific business focus on these products? Less teams working on it? Not enough products? All conjecture on my part without digging deeper into the site.
Finally, what is the probability the electrical category is faster than security (2nd place)?
mean(electrical$value < security$value) * 100
## [1] 81.145
Wrapping up
With data simulation we can account for sample size, uncertainty and interpretability. We can achieve this understanding without building something entirely new either.
This methodology can be applied to any aspect of digital marketing where data is the lifeblood of the channel. For example…
- Rankings: your daily rank went from #9 to #1 only to celebrate too early and it drops back down to #8 the next day. If you calculated the probability of hitting #1 from the source data, would you not have celebrated as early?
- Traffic: was the spike in site visitors a real phenomenon or just random chance?
- Click-through Rate: how do you handle low volume data where you only received 2 clicks and 2 impressions? You don’t want to kick out data because it is telling you something! (check out my R guide and the section on estimating CTR with empirical Bayes)
Intentional exclusion
I specifically left out certain concepts to get the reader excited about using R to simulate their own data. Although important, I will leave those for future articles but the following are notes for myself:
- Foundation for Bayesian statistics
- No mention of incorporating conjugate priors
- Other R distribution functions: dnorm, pnorm, qnorm
- Minimized mathematical notation