Hello, welcome to this blog a few days ago I tweeted the below graph in a tweet
It was the output from the model I have created which predicts the qualifying time for each driver. I will get into the review of the outputs of the model in the next blog but today im going to discuss how I have developed it from the start.
The first task was to identify what I wanted to achieve. That was pretty clear – predict the qualifying time for each driver for every grand prix and therefore predict the final grid (before penalties).
In every Formula 1 race weekend there are 3 practice sessions before qualifying which sets the grid for the race. My idea was to use headline data from those practice sessions. The F1 website, after every practice publishes the leaderboard as below:
I can scrape that data from the F1 website and use each drivers fastest lap, as well as the number of laps and the delta to first. The gap to first is important because if there is a big gap to first the less likely that driver will be in pole position. Also I want to include the number of laps as the more laps a driver completes the more understanding of the set up and therefore the quicker they will go. Therefore, I wrote the following script to scrape the results fro the first practice at any grand prix
This is the basic code used to scrape the results of all the practice sesions and create a data frame with the required data. I am also going to add data such as the circuit length. That varies from 3.337km for the shortest circuit (Monaco) to 7km for the longest circuit (Spa). This will have a significant impact on the lap times. Finally in this previous blog i clustered circuits to find circuits similar to each other
I also used the groups identified that each circuit belongs to as another vairable for this analysis
In theory this could be 3 separate models however I am going to focus on the one model which pedicts based on the times from practice 1 and 2. I have my predictor variables practice 1 fastest time, practice 2 fastest time, numbers of laps, the delta to first, the manufacturer of the car and the category of the circuit. I have my target vairable the fastest qualifying time for each driver. The model I am going to be using to predict is the gradient boosting XGboost.
Training and Initial Results
Training the model was pretty quick, I sleected 100 runds of pre training before it was complete and it ended up with a training data rmse of 0.07. The full code for this model is available here on my github
In the next blog I will break down the season so far running this model for each qualifying session to really understand the strengths and weaknesses.