The excitement of Formula 1 racing lies in the intense competition between drivers and teams as they navigate the twists and turns of each race to claim the coveted championship title. In recent years, data-driven predictions have become increasingly popular in sports, and F1 is no exception. Inspired by FiveThirtyEight’s comprehensive soccer predictions, we set out to create a similar model for predicting F1 driver standings using the R programming language. In this blog post, we’ll walk you through the process of building such a model, from data collection to model output

### Data Collection

To build a robust F1 driver standings prediction model, we first need access to a wealth of historical race data. A good source for this is the Ergast Developer API, which provides comprehensive F1 data dating back to the 1950s. The data can either be read from the API, or the webpage has the tables in CSV format

Database Images

### Feature Engineering

These forecasts work by having some form of quality measure for the input. The models developed by 538 use some measure of team strength, like expected goal difference in football, as the base to get the forecast. Therefore, I need to estimate driver/team quality and then use that to get the range of possible finishing positions for a driver. I have decided to use the average finishing position because the better driver/car will finish higher. Mean finishing position will not use this in early races because it won’t be an accurate reflection, so in the early part of the season, the average pace of the driver will be used to work out what the entire season result would be

```
pos_func = function(x) {
all_data778 = all_data %>% left_join(min_lap, by = c("raceId", "lap")) %>%
filter(round < x) %>%
filter(year > 2009) %>%
mutate(delta = milliseconds/minl-1) %>%
# filter(year == 2014) %>%
filter(delta < 2) %>%
filter(delta < 0.1) %>%
group_by(code, driverId, year) %>%
summarise(meandel = mean(delta)) %>%
left_join(avpos_dri, by = c("code", "year")) %>%
mutate(roundav = x) %>%
select(meandel, meanpos, roundav)
return(all_data778)
}
rac1 = c(2:19)
pace_mes = map_dfr(rac1, pos_func)
pace_mes2 = pace_mes %>%
ungroup() %>%
group_by(roundav) %>%
nest() %>%
mutate(mod = map(data, ~lm(meandel ~ meanpos, data = .)),
quality = map(mod, glance)) %>%
unnest(quality) %>%
select(roundav, r.squared) %>%
mutate(meas = "racepace")
pos_func2 = function(x){
avpos_dri_2 = results %>% left_join(drivers, by = "driverId") %>%
left_join(constructors, by ="constructorId") %>%
ungroup() %>%
left_join(races, by = "raceId") %>%
filter(milliseconds != "\\N") %>%
filter(year > 2008) %>%
ungroup() %>%
filter(round < x) %>%
group_by(year, code) %>%
summarise(meanpos1 = mean(positionOrder, na.rm = T)) %>%
left_join(avpos_dri, by = c("code", "year")) %>%
ungroup() %>%
mutate(roundav = x) %>%
ungroup() %>%
select(roundav, meanpos, meanpos1)
return(avpos_dri_2)
}
rac = c(2:19)
pos_sum = map_dfr(rac, pos_func2)
pos_sum2 = pos_sum %>%
ungroup() %>%
group_by(roundav) %>%
nest() %>%
mutate(mod = map(data, ~lm(meanpos1 ~ meanpos, data = .)),
quality = map(mod, glance)) %>%
unnest(quality) %>%
select(roundav, r.squared) %>%
mutate(meas = "position")
com_sum = pace_mes2 %>%
bind_rows(pos_sum2)
cols = c("position" = "#2f0094", "racepace" = "#ff8519")
ggplot(com_sum, aes(x = roundav, y = r.squared, col = meas)) + geom_line(size = 2) +
scale_colour_manual(values = cols) +
labs(x = "Race No.", y = "R Squared", title = "Average Finishing Position Predition" ) +
theme(panel.background = element_blank())
```

```
```

The average finishing position is the primary input for the model, and the above graph shows that if the race is less than race 4 in the season, then using the driver’s average race pace is the best. After that, I can use the driver’s mean finishing position for the rest of the season.

## Model Development

The feature selected is the driver’s average finishing position for the season. This is the best measure with both car quality and driver quality considered. Next is the probability distribution that is used to get a range of results that could be possible. I will review the normal distribution and the Poisson distribution.

```
set.seed(13) # set seed for reproducibility
n <- 23000 # number of draws
mu <- 7 # mean
sigma <- 3.33 # standard deviation
# generate n random draws from normal distribution
samples <- rnorm(n, mean = mu, sd = sigma)
sample2 = rpois(n, lambda = mu)
samp = tibble(samples)
samp2 = tibble(sample2)
norm = samp %>% mutate(result = round(samples, 0)) %>%
group_by(result) %>%
summarise(n = n() ) %>%
mutate(prob = n/23000) %>%
mutate(dist = "norm")
pois = samp2 %>% mutate(result = if_else(sample2 == 0, 1, as.double(sample2))) %>%
group_by(result) %>%
summarise(n = n() ) %>%
mutate(prob = n/23000) %>%
mutate(dist = "pois")
dist_comp = norm %>% bind_rows(pois)
ave_pos_num2 = avpos_dri %>% mutate(roundpos = round(meanpos, 0))
pos_dri = results %>% left_join(drivers, by = "driverId") %>%
left_join(constructors, by ="constructorId") %>%
left_join(races, by = "raceId") %>%
filter(milliseconds != "\\N") %>%
ungroup() %>%
left_join(ave_pos_num2, by = c("year", "code")) %>%
filter(roundpos == 7) %>%
group_by(positionOrder) %>%
summarise(n = n()) %>%
mutate(prob = n/449) %>%
rename(result = positionOrder) %>%
mutate(dist = "act")
dist_comp2 = dist_comp %>% bind_rows(pos_dri)
cols = c("pois" = "#ff8519", "norm" = "#2f0094", "act" = "#009c39")
ggplot(dist_comp2, aes(x = result, y = prob, col = dist))+ geom_line(size = 1.5) +
scale_color_manual(values = cols) +
labs(x = "Finishing Position", y = "probaility of Finishing", title = "Probability Distributions Compared to Actual") +
guides(colour = guide_legend(title = "Distribution")) +
theme(panel.background = element_blank())
```

To identify the best distribution, I compared the normal and Poisson distributions to the actual distribution of finishing positions. The Poisson distribution follows the actual distribution much more than the normal distribution. Therefore the Poisson distribution is the distribution that will be used in the model.

```
dat_create <- function(x) {
position <- pce23_2 %>% filter(driverId == x)
mu <- position[[5]]
sample2 <- rpois(10000000, lambda = mu)
samp2 <- tibble(sample2)
samp_ver <- samp2 %>%
mutate(sample3 = if_else(sample2 == 0, 1, as.double(sample2))) %>%
group_by(sample3) %>%
summarise(n = n()) %>%
uncount(n) %>%
mutate(driverId = x)
return(samp_ver)
}
driver <- c(1,4,807,815,822,825,830,832,839,840,842,844,846,847,848,852,855,856,857,858)
dri_23 <- map_dfr(driver, dat_create)
# Create a function to sample positions and rename columns
sample_and_rename <- function(data, position) {
data %>% filter(sample3 == position) %>%
sample_n(size = 100000) %>%
select(driverId) %>%
rename(!!paste0("pos_", position) := driverId)
}
# Loop through positions and bind columns
all_pos <- tibble()
for (i in 1:10) {
all_pos <- all_pos %>% bind_cols(sample_and_rename(dri_23, i))
}
```

I have my model input, and I have the probability distribution. The next step is to put the information into the distribution and get an output for the 10 scoring positions. The driver’s average finishing position will be used to obtain 10 million samples from the Poisson distribution, and the drivers with a higher probability will be selected.

```
doParallel::registerDoParallel()
all_pos2 = data.frame(all_pos)
race_res_sim <- function(x) {
pos_ssel <- sample(all_pos2$pos_1, 1)
fil_sel <- all_pos2[, -1]
result <- c(pos_ssel, numeric(9))
for (i in 2:10) {
pos_i <- paste0("pos_", i)
filt <- fil_sel[[pos_i]] != pos_ssel & !is.na(fil_sel[[pos_i]])
pos_is <- sample(fil_sel[filt, pos_i], 1)
fil_sel <- fil_sel[fil_sel[[pos_i]] != pos_is, ]
result[i] <- pos_is
}
c(result, race = x)
}
raceid <- 1:23000
res3 <- t(replicate(length(raceid), race_res_sim(raceid)))
res3 <- as.data.frame(res3)
re3 = res3 %>% select(contains("V"))
re4 = re3 %>% rename("Pos_1" = V1, "Pos_2" = V2, "Pos_3" = V3, "Pos_4" = V4, "Pos_5" = V5, "Pos_6" = V6, "Pos_7" = V7, "Pos_8" = V8, "Pos_9" = V9,
"Pos_10" = V10)
```

The code above goes through each points scoring position and selects one driver. It does this a total of 23000 times, so it’s the result of 230000 F1 races based on the Poisson distribution from the driver’s average finishing position. This can then be condensed down into 1000 runs of the whole F1 season to create the 538 style table for both the drivers and constructors’ championship.