Hello, welcome to today’s data adventure where we are going to be scouting for the next Chris Gayle. I am going to be using K-means clustering in order to achieve this. The first question is what numbers am I going to use for this? Previously I detailed the creation of several metrics with which to review batsmen in twenty20 cricket. If you missed it go check it out here:
There were 10 or 11 metrics to look at players and values for each of them will highlight different characteristics for each player. Therefore if I can find players with similar metrics values they will be similar players and therefore could be signed to fill a certain players role if they were to leave. Enter Chris Gayle who is now 39 and won’t be around much longer. His current IPL franchise is Kings XI Punjab what if they want to find the best player to replace him.
Above you can see the profile for Chris Gayle for his all-time IPL record. Highlights are he has a high average, high strike rate, low non-boundary strike rate i.e he doesn’t rotate the strike. Hits a lot of boundaries which are mainly sixes. So those are the characteristics we are looking for in our replacement. Now I could filter through the whole dataset and that might work however could take a while and not very scientific. Enter K-means. I can use K-means clustering on all of the metrics, find the cluster with which Chris Gayles belongs to then review the other players in the cluster.
Create dataset – I am using the ball by ball data found on Kaggle used in the previous blog. By running my function outlined in the previous blog I create a dataset with every batsman to have ever played in the IPL. Below you can see the dataset
Scale and remove NA’s – now we have 11 metrics with which to cluster on however they are all different scales, therefore, I need to scale the data in order to get it comparable.
Above you can see the summary of the missing data. I wonder if this can be removed by looking at how many balls each batsman has faced and filtering out batsman that have faced less than 40 balls.
Decide Centres – Now I have the scale data frame I am ready to perform k means clustering, once i remove the batsman name column. First thing I need to do is set the number of centres. I’m going to do this with an elbow plot.
The elbow plot can be used to decide on a suitable number of centres. The idea is to take the centres (the value of k) where the plot levels off or elbows. There are a number of parts on this plot where i could take my value of K. However I want as many clusters as possible as there are over 250 players in this list and if I used 8 clusters there would be too many players in each cluster. Therefore I’m choosing the elbow with k equal to 23 in order to get the pool as small as possible.
So Chris Gayle is in Cluster 23 the next step is to filter the full dataset to see who is in 23
Within this list of 19 players, there are obviously players that don’t play anymore (Collingwood, Hayden, Pietersen) so King XI can’t get them to replace him. One thing that clear for all players is none of them gets particularly close to Chris Gayles six hitting rate. The closest is Chris Morris, however, he bats well down the order so maybe wouldn’t be as successful doing it against the new ball.
The name that stands out for me is Jason Roy who as well as Gayle is an opening batsman.
- I’m not 100% satisfied with this method as the list that came back is still too big and there seem some big differences in the metrics still. Maybe using k means is over complicating this. Let me know your thoughts
- Maybe I have used too many variables in the K- means
Thanks for reading this mildly successful blog. All code is on my GitHub let me know your thoughts and if you think I could develop this further.