Hello, welcome to the next R stats adventure. I am going to be looking at developing a model which can be used to scout batsmen of the future in twenty 20 cricket. You may have seen previously (and if you haven’t go check it out) i looked at the second 11 county championship in order to find the next James Anderson. Well, I am going to use a similar method here however i will build it from the ground up.
Data
In order to do this I need some data. So I am using the second eleven twenty 20 competition in England. The competition has been running since 2011 so there are 8 seasons with which I can take data from. For each of the 8 seasons we will be taking all the batsmen who scored more then 50 runs in the competition. Each year we will take:
- Matches
- Date of Birth
- Innings
- Not Out no.
- Runs
- Highest score
- Batting Average
- Strike Rate
- 100’s
- 50’s
- Caught
- Stumped
Methodology
The aim is to find the best future batsmen. Therefore I need to find the best performers and how that links to future performance. Now the data I have is from all different ages but I am going to be focusing on under 22 players. However, a 21-year-old is more developed and experienced than an 18-year-old therefore I have to normalise the data for age. Also, the players have played different amounts of games, therefore, I will employ some regression to the mean in order to make it as fair as possible across different amounts of innings.
Once that is done i need to link the players second eleven average and strike rates to the same stats in first eleven twenty 20 competitions. Its that relationship I can use to identify the players of the future.
Implementing
First thing to do lets check out the data frame with all the batsmen:


above you can see the code used to read the data in and the output. In total there are 1615 records, not different players as some players will be included twice. First thing I need to do is create a column with the player’s age:


For this analysis, I am going to be looking at the age of the player on the 1st of April of the year of that season. The Age will be the rounded age.
Next job is to get averages and strike rates for all players if they played the same amount of games. I will cover that in part 2 as well as identifying the factor with which to adjust the scores for younger batsmen. In the final part, I will look at how this performance linked to the first team and then I can build the prediction.
Thanks for reading this initial outline of the method, hope you enjoyed the R stats adventure. See you in the next one!