Simulating data to explore page speed performance

We may be inundated with data but sometimes collecting it can be a challenge in and of itself. A few reasons off the top of my head:

Sparsity
Difficult to measure
Impractical to devote company resources to it
Lack of technical expertise to actually build or acquire it
Lazy (yours truly - except for that one time)

Through simulation we can generate our own dataset with the added benefit of fully understanding what features we choose to put in our models (or leave out).

In fact, a few of the machine learning models I wrote and put into production at work are based on simulated data!

This article will provide a quick walkthrough in getting you up and running using #rstats.

Background

I am in the market for a smart camera so while shopping online I also compiled some page speed data for a few eCommerce websites. You can follow along with the data here:

library(tidyverse)
library(scales)
library(knitr)
library(kableExtra)

df <- read_csv("https://raw.githubusercontent.com/Eeysirhc/random_datasets/master/page_speed_benchmark.csv")

# VIEW RANDOM SAMPLE
df_sample <- df %>% 
  sample_n(10)

website	page_type	product	time_to_interactive
Smarthome	Category	Electrical	9.8
Home Depot	Category	Sensors	25.1
IKEA	Category	Entertainment	9.8
Smarthome	Category	Sensors	9.8
Smarthome	Category	Electrical	8.9
Amazon	Category	Sensors	7.1
Walmart	Category	Sensors	15.9
Amazon	Home		10.4
Amazon	Category	Sensors	7.1
BestBuy	Category	Sensors	29.4

I used Google Chrome’s built-in page audit (Lighthouse) to log the time for each website, page type and product category.

There are other page speed metrics but for educational purposes we’ll just focus on using time_to_interactive.

Lets pose the question: which site is fastest in terms of time_to_interactive?

Standard approach

One way to answer that question is to create descriptive statistics by computing the averages, finding the percent difference from the fastest site and then calling it a day.

df_standard <- df %>% 
  filter(page_type != 'Home') %>% 
  group_by(website) %>% 
  summarize(time_interactive = mean(time_to_interactive)) %>% 
  ungroup() %>% 
  arrange(time_interactive) %>% 
  mutate(slower_than_amazon = round((time_interactive / 7.59 - 1) * 100, 0),
         slower_than_amazon = paste0(slower_than_amazon, "%"))

website	time_interactive	slower_than_amazon
Amazon	7.59000	0%
IKEA	9.44000	24%
Smarthome	9.44375	24%
Walmart	16.63077	119%
Target	21.36667	182%
Home Depot	25.20000	232%
BestBuy	31.23077	311%

However, there is a problem - sample size!

df_size <- df %>% 
  filter(page_type != 'Home') %>% 
  group_by(website) %>% 
  summarize(time_interactive = mean(time_to_interactive),
            count = n()) %>% 
  ungroup() %>% 
  arrange(desc(count))

website	time_interactive	count
Smarthome	9.44375	16
BestBuy	31.23077	13
Walmart	16.63077	13
Amazon	7.59000	10
Home Depot	25.20000	10
Target	21.36667	6
IKEA	9.44000	5

eCommerce sites have hundreds of thousands of pages so how can we be so certain our summary captures actual page speed performance? Perhaps adding in a confidence interval?

At a more granular level, IKEA and Smarthome have exactly the same average time_to_interactive of 9.44 seconds but I recorded fewer samples with the former - which site is actually faster?

df %>% 
  filter(website %in% c('IKEA', 'Smarthome')) %>% 
  ggplot(aes(website, time_to_interactive, color = website)) +
  geom_point(show.legend = FALSE, size = 5, alpha = 0.5) +
  geom_hline(yintercept = 9.44, lty = 2, color = 'red') +
  scale_color_manual(values = cbPalette3) +
  scale_y_continuous(limits = c(0, 15)) +
  labs(x = NULL, y = "Time to Interactive (seconds)",
       subtitle = "Dashed line represents average of 9.44s")

What if I don’t know how to write a script to grab every URL and then feed it into Google Lighthouse? Or more realistically, what if I am not inclined to go and collect 11 more data points for IKEA?

These questions can be answered with the help of simulation. By applying the central limit theorem and law of large numbers we can directly address measurement uncertainty.

Simulating “big” data

To get started we will need three values:

n = number of observations
mean = vector of means
sd = vector of standard deviations

df_summary <- df %>% 
  filter(page_type != 'Home') %>% 
  group_by(website) %>% 
  summarize(time_interactive = mean(time_to_interactive),
            sd = sd(time_to_interactive),
            count = n()) %>% 
  ungroup() %>% 
  arrange(time_interactive)

website	time_interactive	sd	count
Amazon	7.59000	1.9081405	10
IKEA	9.44000	0.6877500	5
Smarthome	9.44375	1.1488944	16
Walmart	16.63077	0.8300448	13
Target	21.36667	2.0175893	6
Home Depot	25.20000	1.4055446	10
BestBuy	31.23077	2.6042864	13

Now that we have the minimum requirements we can simulate our data. Let’s start with IKEA:

ikea <- rnorm(1e4, 9.44, 0.688) %>% 
  as_tibble() %>% 
  mutate(website = paste0("ikea"))

There was a lot to unpack there so let’s break it down:

rnorm is the R function to generate random numbers from a Gaussian distribution
1e4 is scientific notation for 10,000 observations
9.44 is the average mean for IKEA time_to_interactive
0.688 is the standard deviation
We moved our data into the tidyverse with as_tibble()
Used the mutate function to add a website column and identify IKEA for the set of results

We can now plot our distribution of time_to_interactive scores for IKEA where we generated 10K data points.

ikea %>% 
  ggplot(aes(value, fill = website)) +
  geom_histogram(position = 'identity', binwidth = 0.05, alpha = 0.8,
                 show.legend = FALSE) +
  scale_fill_manual(values = cbPalette) +
  scale_x_continuous(limits = c(0, 20)) +
  labs(x = "Time to Interactive (seconds)", y = NULL)

What this illustrates is the potential frequency of page speed scores and the range of all possible values.

In other words, if we take a random page on IKEA and measure its time_to_interactive we know scores could be as low as 8s or as high as 11s. Additionally, there is a central tendency for scores to fall around 9s but it is not at all possible to have a score of 1s or more than 13s.

This is an improvement from a single summary statistic but what if we wanted to ask more complicated questions? What if we wanted to know the likelihood a random IKEA page was greater than 11s? Or which site is faster: Amazon vs IKEA?

The solution is to condition on our simulated dataset.

Conditioning on the imaginary

Lets ask the question: what is the probability IKEA will have a page speed greater than 11 seconds?

We can easily get that answer by sampling from our data:

sum(ikea$value > 11) / length(ikea$value) * 100

## [1] 1.45

And to drive that home with some data viz…

ikea %>% 
  mutate(greater_11s = ifelse(value > 11, 'yes', 'no')) %>% 
  ggplot(aes(value, fill = greater_11s)) +
  geom_histogram(binwidth = 0.05, alpha = 0.7) +
  scale_fill_manual(values = cbPalette) +
  scale_x_continuous(limits = c(0, 20)) +
  labs(x = "Time to Interactive (seconds)", y = NULL)

We can also head in the opposite direction: what is the probability IKEA will have a page speed of less than 9 seconds?

sum(ikea$value < 9) /length(ikea$value) * 100

## [1] 25.72

And to plot our results…

ikea %>% 
  mutate(less_than_9s = ifelse(value < 9, 'yes', 'no')) %>% 
  ggplot(aes(value, fill = less_than_9s)) +
  geom_histogram(binwidth = 0.05, alpha = 0.7) +
  scale_fill_manual(values = cbPalette) +
  scale_x_continuous(limits = c(0, 20)) +
  labs(x = "Time to Interactive (seconds)", y = NULL)

We can also ask more complicated quesitons such as: what is the probability a random Amazon page will be faster than IKEA?

# SIMULATE AMAZON DATA
amazon <- rnorm(1e4, 7.59, 1.91) %>%
  as_tibble() %>% 
  mutate(website = paste0("amazon"))

mean(amazon$value < ikea$value) * 100

## [1] 82.39

And…well…you get the idea….

pagespeed <- rbind(ikea, amazon)

pagespeed %>% 
  ggplot(aes(value, fill = website)) +
  geom_histogram(binwidth = 0.05, alpha = 0.7, position = 'identity') +
  scale_fill_manual(values = cbPalette2) +
  scale_x_continuous(limits = c(0, 20)) +
  labs(x = "Time to Interactive (seconds)", y = NULL)

Side note

From my personal experience working at large companies, business executives respond very well to probabilities. Thus, “our site is 24% slower than Amazon” is not as impactful as stating “if we were to take 10K random pages from each site, there is an 82% chance our site will be slower than Amazon.”

What about Amazon?

In the past I wrote about segmenting data to reveal deeper insights hidden beneath the aggregation.

So, what is really driving Amazon’s page speed score higher? Lets simulate some data and for fun we’ll increase our observations from 1e4 (10K) to 1e5 (100K). Quick reminder about Amazon page speed data:

df_amazon <- df %>% 
  filter(website == 'Amazon', page_type != 'Home') %>% 
  group_by(product) %>% 
  summarize(time_interactive = mean(time_to_interactive),
            sd = sd(time_to_interactive)) %>% 
  arrange(time_interactive)

product	time_interactive	sd
Electrical	5.750000	0.212132
Security	7.933333	2.458319
Sensors	8.066667	1.674316
Entertainment	8.200000	2.545584

And the code to generate our data with the subsequent plot:

electrical <- rnorm(1e5, 5.75, 0.212) %>% as_tibble() %>% 
  mutate(product = paste0("electrical"))

security <- rnorm(1e5, 7.93, 2.46) %>% as_tibble() %>% 
  mutate(product = paste0("security"))

sensors <- rnorm(1e5, 8.07, 1.67) %>% as_tibble() %>% 
  mutate(product = paste0("sensors"))

entertainment <- rnorm(1e5, 8.2, 2.55) %>% as_tibble() %>% 
  mutate(product = paste0("entertainment"))

amazon <- rbind(electrical, security, sensors, entertainment)

amazon %>% 
  ggplot(aes(value, fill = product)) +
  geom_histogram(binwidth = 0.05, alpha = 0.7, position = 'identity') +
  scale_x_continuous(limits = c(0, 20)) +
  scale_y_continuous(labels = comma_format()) +
  scale_fill_brewer(palette = 'Spectral') +
  labs(x = "Time to Interactive (seconds)", y = NULL,
       title = "Amazon: time to interactive by product category (n=100K)")

Well, this is quite shocking (pun intended) - the electrical category’s time_to_interactive is not only faster but its central tendency is also less dispersed than the other three.

Why might that be the case? Specific business focus on these products? Less teams working on it? Not enough products? All conjecture on my part without digging deeper into the site.

Finally, what is the probability the electrical category is faster than security (2nd place)?

mean(electrical$value < security$value) * 100

## [1] 81.145

Wrapping up

With data simulation we can account for sample size, uncertainty and interpretability. We can achieve this understanding without building something entirely new either.

This methodology can be applied to any aspect of digital marketing where data is the lifeblood of the channel. For example…

Rankings: your daily rank went from #9 to #1 only to celebrate too early and it drops back down to #8 the next day. If you calculated the probability of hitting #1 from the source data, would you not have celebrated as early?
Traffic: was the spike in site visitors a real phenomenon or just random chance?
Click-through Rate: how do you handle low volume data where you only received 2 clicks and 2 impressions? You don’t want to kick out data because it is telling you something! (check out my R guide and the section on estimating CTR with empirical Bayes)

Intentional exclusion

I specifically left out certain concepts to get the reader excited about using R to simulate their own data. Although important, I will leave those for future articles but the following are notes for myself:

Foundation for Bayesian statistics
No mention of incorporating conjugate priors
Other R distribution functions: dnorm, pnorm, qnorm
Minimized mathematical notation