Powerball Analysis Overview

Intro

The lottery has long captivated the imagination of the public — a tantalizing mix of chance, dreams, and data. Behind every draw lies a rich history of numbers, waiting to be explored. While winning may be left to fate, analyzing the patterns, distributions, and correlations in lottery data reveals something just as intriguing: the structure of randomness itself.

This quarto doc briefly explores decades of PowerBall data. From frequency polygons and boxplots to 3D surface plots and correlation heatmaps.

Whether you’re a curious lottery fan, a statistician interested in probability, or a data viz nerd looking to see Plotly in action — this analysis offers a visually rich and data-driven perspective on the numbers that fuel billion-dollar dreams.

Summary Statistics

# Compute summary statistics with mode using tidyverse functions
pb_summary <- pb_df %>%
  select(-DrawDate) %>%
  summarise(across(everything(), list(
    min = ~min(.x, na.rm = TRUE),
    max = ~max(.x, na.rm = TRUE),
    mean = ~mean(.x, na.rm = TRUE),
    sd = ~sd(.x, na.rm = TRUE)
  ))) %>%
  pivot_longer(everything(), names_to = c("variable", ".value"), names_sep = "_") %>%
  arrange(variable)

# Calculate mode for each variable
modes <- pb_df %>%
  select(-DrawDate) %>%
  map_int(~ as.integer(names(sort(table(.x), decreasing = TRUE)[1])))

# Add mode to summary
pb_summary$mode <- modes

# Reorder columns
pb_summary <- pb_summary %>% select(variable, min, max, mean, mode, sd)

# Display as a clean HTML table
kable(pb_summary, caption = "Summary Statistics of Powerball Numbers")
Summary Statistics of Powerball Numbers
variable min max mean mode sd
PB 1 45 18.459519 1 11.238591
PP 2 10 2.786220 15 1.173435
WB1 1 52 9.930367 27 8.123502
WB2 2 61 19.701851 39 10.555695
WB3 3 65 29.476927 45 11.651599
WB4 6 68 39.326610 20 11.911665
WB5 10 69 48.883670 2 11.014188

Line Plot of Drawn Numbers Over Time

# Plot white ball numbers over time
plot(pb_df$DrawDate, pb_df$WB1, type = "l", col = "red", ylim = c(0, 70), xlab = "Date", ylab = "Value")
lines(pb_df$DrawDate, pb_df$WB2, col = "blue")
lines(pb_df$DrawDate, pb_df$WB3, col = "green")
lines(pb_df$DrawDate, pb_df$WB4, col = "purple")
lines(pb_df$DrawDate, pb_df$WB5, col = "orange")

Box plot of Powerball Components

# Create a boxplot for the numeric columns
pb_df %>%
  select(-DrawDate) %>%
  boxplot(
    names = colnames(pb_df)[-1],
    main = "Distribution of Powerball Components",
    ylab = "Values",
    col = "lightblue",
    border = "darkblue"
  )

Histograms and Frequency Polygons

Distribution of picks for the first winning ball.

# Histogram and frequency polygon for WB1
ggplot(pb_df) +
  geom_histogram(aes(WB1), fill = "skyblue", bins = 30) +
  geom_freqpoly(aes(WB1), color = "red", size = 1)
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

All five visualized

# Frequency polygons for WB1 to WB5
ggplot(pb_df) +
  geom_freqpoly(aes(WB1), color = "red", size = 1) +
  geom_freqpoly(aes(WB2), color = "blue", size = 1) +
  geom_freqpoly(aes(WB3), color = "green", size = 1) +
  geom_freqpoly(aes(WB4), color = "orange", size = 1) +
  geom_freqpoly(aes(WB5), color = "brown", size = 1)  
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Heatmap Interpretation

The correlation heat map of white ball numbers since 2014 shows generally low correlation across positions, reflecting the expected randomness of Powerball draws. Slight positive correlations between adjacent positions, such as WB2 and WB3, are likely due to the enforced ascending order of drawn numbers rather than any meaningful predictive relationship. Overall, the heatmap reinforces that the numbers are distributed in a non-linear and non-repeating way, making prediction based on past results unreliable.

# Filter and correlate data since 2014
forcor <- pb_df %>%
  filter(DrawDate >= as.Date("2014-01-22")) %>%
  select(WB1:WB5)

heatmap(cor(forcor), 
        col = colorRampPalette(c("yellow", "orange", "red"))(100), 
        scale = "none", 
        symm = TRUE,
        main = "Correlation Heatmap of White Balls")

An Interactive 3D Perspective of Winning PowerBall Numbers

# Prepare data for 3D surface plot
z_matrix <- pb_df %>% select(-DrawDate) %>% as.matrix()
x_values <- as.Date(pb_df$DrawDate)
y_values <- seq_len(ncol(z_matrix))

# Generate interactive 3D surface plot
plot_ly(x = ~x_values, y = ~y_values, z = ~t(z_matrix), type = "surface") %>%
  layout(
    scene = list(
      xaxis = list(title = "Date", tickformat = "%Y-%m-%d"),
      yaxis = list(title = "Variables"),
      zaxis = list(title = "Value", range = c(0, 70))
    )
  )