Hello and welcome to the second part of my mini-series using cluster analysis in order to categorise formula 1 circuits. please go check the first part it outlines the basic data we are using to categorise the circuits and an overview of the method used for hierarchical clustering. Today we are going to go with K-means clustering.
For K-means clustering we have to set our own value for K we are going to do that with two different types of analysis. An elbow plot and silhouette analysis.
The code below is what was used in order to generate the elbow plot. The elbow plot generated is below:
Reviewing the elbow plot it looks like already we are seeing a slightly different amount of clusters then we got when we conducted hierarchical clustering. The elbow of the plot looks to be at 3 but you can also argue there is one at 4 as well as the value for k.
The other way to decide a k value when conducting k means clustering is to produce a silhouette graph. This takes every point which is part of the analysis and rates it on how it fits in with each cluster with -1 being doesn’t fit at all and 1 being fits well. You then produce a graph for each value of k with the average silhouette width and the highest point is the value of k. I have put a picture of the code below and also the silhouette graph produced
Fascinatingly there are two high points. One for a k of 9 and another for a k of 3. I am going to choose a k of 3 as this is closely aligned to what we saw in the elbow plot and 9 clusters are just too many to deal with.
The above graph shows all the circuits in the calendar and where they are for average straight length and average speed, colour by the cluster they have been put in. I am a bit unsatisfied with this. I feel this doesn’t quite fit the different circuits on the calendar. For instance, Singapore is different to China and Germany. Therefore K-means is not going to be the clustering I use in the final blog to look at pace trends across the season. Look out for the final blog which we will look at the pace across all circuits so far for all the teams and we will look at some other metrics like overtakes and pitstops.
Hello there so as you know I’m currently working through the Datacamp course data scientist with R. (If the people from Datacamp are reading this I’m open top sponsorship!) There will be a further update how I’m getting on with this later this week, however, today I wanted to focus on applying something new that I learnt. Cluster analysis. Cluster analysis allows you to take a dataframe of two variables and calculate which are the rows best grouped together. There are two main methods that we are going to look at hierarchical clustering and kmeans clustering. We are going to look at formula 1 circuits. The idea is there are 21 different circuits currently on the calender all different lengths and height profiles and types of tarmac, however, can we group them together with certain characteristics. For me as an avid formula 1 watcher, the differences between the circuits are caused by lengths of straights and speed of corners. Therefore the two metrics we are using are the average straight length and average corner speed.
- Average straight length – calculated by measuring each stretch of track which the F1 car would be running full throttle. Removing any lengths of the track less than 100m. an example for Spain below straights is estimated in green.
- Average corner speed – I have calculated this by allocating each corner to either slow, medium or fast speed. (Unfortunately, I don’t have data for the exact corner speed but if any f1 team wants to send it over email me!) so you can see below in the table how many for each circuit was allocated
As I don’t know the exact speeds of these corners I have estimated that a slow corner is 80 km/h, medium speed corner is 150 km/h and a fast corner is 200km/h. This has left us with the following table:
The first thing we are going to look at is hierarchical clustering. The table above is fed into the following code:
this produces the following output:
We have 5 different distinct clusters that the F1 circuits fit into. It’s not too surprising that Singapore, Monaco and Hungary fit into a similar cluster as well as Belgium and Great Britain being similar circuits.
The scatter above you can clearly see the difference between the two main clusters 1 and 2. In cluster one straight length are often shorter, however, the corners are faster. Cluster 2 circuits often have longer straights but slower corners. With a few circuits from each group used so far, this season would be interesting to see if there are any trends with car speed. That’s it for the first part of this series next week we will look at any difference using K-means clustering. In the final part, we will look applying what we have seen so far this year to try and predict who will win in the later rounds.