Building a Model in R to Predict FPL Points

As I identified before the first part of this they key parts of the model are predicting how many goals might be in the match and the type of player and therefore do they have a scoring impact on a match or not. So first things first I need to create a score forecasting model.

**Data**

In order to do this I need to have a methodology for identifying how strong each team is in the match and then correlate that difference in team strength to an estimation of the scores for both teams in the match. Fundamentally I’m going to work out how the average team performs in the selected metric and then compare each teams performances to average and the difference between to two will be the variable to compare to the score.

**Metric options **

- Goals scored – Teams goals scored is the truest sense of how good a team is. Can be effected by short term variability in performances. This is freely available data
- xG – xG or expected goals is the expected scoring rate when you take into account various details about the shot in question
- shots taken- better teams will probably take more shots

I have decided to use expected goals data for each team. As a prediction metric it is highly correlated to actual performance and there is a source for the data thanks to fivethirtyeight.com. They publish team level expected goals data for many leagues freely available on their GitHub. I can read the data into R using the below code:

```
matchdat <- read_csv("https://projects.fivethirtyeight.com/soccer-api/club/spi_matches.csv")
```

Therefore I can use that to rate a team over the season and compare that to the average team. The data is easily read into R using the read_csv function.

```
## calaculating what the average performances
homeav <- matchdat %>% filter(league_id == 2411) %>% #working out the average home xg
summarise(avfor = mean(xg1, na.rm = T), avag = mean(xg2, na.rm = T))
homfor <- homeav[[1]] # extracting the number for home and away
homeag <- homeav[[2]]
# comparing each teams performance to average for xg for and against
homeratdat2 <- matchdat %>% filter(league_id == 2411) %>%
group_by(team1) %>%
summarise(avfor = mean(xg1, na.rm = T), avag = mean(xg2, na.rm = T)) %>%
mutate(xgfh = avfor - homfor, xgah = avag - homeag) %>%
select(team1, xgfh, xgah)
```

The code above calculates what the average team performs like for expected goals for and against. This is then compared to the teams actual performance to give each team in each season a delta. The code above is the calculation for home teams but the same code is used for way teams.

```
## creating a data frame with matches the for and against scores and the xg deltas
matches <- matchdat %>% filter(league_id == 2411) %>%
left_join(homeratdat2, by = "team1") %>%
left_join(awayratdat2, by = "team2") %>%
mutate(deltafh = xgfh + xgaa, deltagh = xgah+xgfa) %>%
select(season, score1, score2, deltafh, deltagh)
## splitting it up for home and away
mat1 <- matches %>% select(season, score1, deltafh) %>%
mutate(loc = "home")
colnames(mat1)[2] <- "score"
colnames(mat1)[3] <- "delta"
mat2 <- matches %>% select(season, score2, deltagh) %>%
mutate(loc = "away")
colnames(mat2)[2] <- "score"
colnames(mat2)[3] <- "delta"
## putting it together and creating the score as a factor
matall <- mat1 %>% bind_rows(mat2) %>%
mutate(scorecat = as.factor(if_else(score > 5,"5", as.character(score)))) %>%
select(-score, -season) ### data for calculating the chance of goals scored
```

Next, I joined each teams xG deltas to the actual scores for historical seasons. Therefore I have data which links each teams xG delta to the actual goals they scored and conceded in an historical match. I also split the data out so I could add the category if the team was home or away. Finally, as matches with more then 5 goals are rare I changed each time a team scored of conceded more then 5 to 5. Matches with more goals then that are so rare that the model will not be able to make good predictions on that.

```
```{r}
##splitting the data into training and testing
score_split <- initial_split(matall, prop = 0.9, strata = scorecat)
score_train <- training(score_split)
score_test <- testing(score_split)
##creating the classification random forest
rand1 <- rand_forest() %>% #type of model
set_engine("ranger") %>% # the engine used to fit the model. For randomm forest rf is a another one
set_mode("classification") %>% # the mode for the model as random forest can do both classification and regression
fit(scorecat ~., data = score_train) # the function to fit the model
### rand1 is the random forest for the score of the match
```

above I then used the functions from the tidy models packages to simply fit a random forest for the score prediction for a match. Based on if the team is home or away and the different in the expected goals ratings.

Above you can see the output from the initial model for the the Wolves vs Burnley game in 2019. For both teams there’s an estimate of how likely they are to score that amount of goals. Wolves are likely to score more then Burnley but there isn’t too much difference in the teams. Now for a player playing in this match I can have 10000 runs of this match and just over 30% of them Wolves will score 1 goal. This then becomes one of the arguments in the models to predict goals scored by a player and assists by the player. Defensive points can be directly calculated from the Oppositions estimated goals scored. Now I the basis an a rough prediction of match outcomes I can move onto the precise player predictions to make this relevant to FPL

]]>Win Probability Added – Batsman Review

Twenty20 Win Probability Added

The first blog covers the building of the model the second looks at reviewing batsman performances. Today though I’m going to be looking at reviewing bowlers.

Above you can see the best 10 performances in the data I have for win probability added. All of them include multiple wickets where as runs conceded some are super low and some at around average. There’s no bowler appearing twice but these bowlers have added a similar amount to the batsman in the top 10. Heraths 3-7 spell must have been incredible to watch.

Clearly the more wickets a bowler takes the more contribution they have to their teams chances of winning the match. There is a significant rise form when a bowler takes 0 wickets they hardly ever have a positive impact on there teams to if they take 3 or more wickets they nearly always have a positive impact.

Runs conceded there is some downward trend with a worse impact the more runs conceded but it doesn’t seem to be as clear cut as the wickets taken. Is this a hint that taking wickets is more important then runs conceded?

This distribution of the Win probability added per 10 balls for all bowlers in this years vitality blast and IPL illustrates how difficult the IPL is. There are more players operating in the positive region in the Vitality Blast compared to the IPL and more players operating in the strong negative region in the IPL though the majority of players are around the same level in both competitions.

There’s no point in a metric if it has no predictive power or tell you how a player might do. For the plot above I took a bowlers WPA in the vitality blast for all the previous season. I then compared it to the latest 2020 season. There is a trend in the numbers so there is some predictive power. Obviously no metric is going to be perfect because there is so much variability in player performance.

Ben Stokes is not the best twenty20 bowler. I guess that should be expected he’s an all rounder. His latest season in the IPL of bowling was simply awful with on average each over costing his team -3% win probability. Maybe his last two seasons have been affected by injury but if you compare his 2020 international performance that seems at a level he has historically performed.

Finally looking at one of the best twenty20 fast bowlers in the world Jasprit Bumrah. Clearly in his IPL performances he has a positive trajectory has he’s got older and more experienced. There’s obviously a lot of different players that could be looked at so in the future blog I will be detailing building a shiny dashboard to search players and look at games. Also a further blog with look at age effects and type of bowler effects.

]]>Building the model its important to have a clear state of the aim and that for this is clearly an estimation of the points a player might achieve in a game week or over several game weeks. First things first I need data and most of that data is conveniently available in the fplsrapr package. The get_player_details function from the package prints a table with the required seasons stats. To create this plot I did that for all seasons and that created one data frame of all the data.

```
``````
{r}
season18 <- get_player_details(season = 18)
```*# the function used to download the match by match data for each player from the fpl scrapr package. The season argument is the season 18/19 *
*# taking each seasons data filtering the required columns and then putting them all together*
s16p <- season16 %>% filter(minutes > 0) %>%
select(playername, total_points, fixture)
s17p <- season17 %>% filter(minutes > 0) %>%
select(playername, total_points, fixture)
s19p <- season19 %>% filter(minutes > 0) %>%
select(playername, total_points, fixture)

# creating the dataframe for total points```
totpo <- s16p %>%
bind_rows(s17p) %>%
bind_rows(s18p) %>%
bind_rows(s19p)
```

If you have a normally distributed target variable then it makes it much easier to model. This though is not even close to any distribution and I think that’s because of the randomness of football and there’s a certain type of player that are much more successful. Also the point system used in the game means there’s no an equal chance of all numbers appearing. Using the same data set I created the comparison between 2 players in the same team

The graph above shows the point of the model. They are both players who play in midfield for the same team therefore should have the same opportunity to score points. However, as you can see Salah scores higher points much more frequently then Henderson and therefore the model has to have information to make those determinations.

I use the dataframe I already downloaded to look at how the amount of goals scored by the teams effects how many points a player gets.

There’s not a perfect trend as you would expect. A lot of variability exists in this data due to different playing positions. There is a weak trend that total game week points increases as the amount of goals there team scores increases. There are is a clear split in a lot of the data. When there team has score no goals but lots of points scored these are mostly defenders or goalkeepers keeping clean sheets and saving penalties. When more goals are scored that’s when the strikers and attacking midfielders start to score the big points.

```
### collecting each players key statistics from each season
playdat1 <- twenty16 %>% bind_rows(twenty17) %>%
bind_rows(twenty18) %>%
bind_rows(twenty19) %>%
group_by(player_name) %>%
summarise(ninet = sum(time)/90, xgs = sum(xG), xas = sum(xA), npxGs = sum(npxG)) %>%
mutate(xg90 = xgs/ninet, xA90= xas/ninet, npxg90 = npxGs / ninet) %>%
select(player_name, xg90, xA90, npxg90) %>%
left_join(play2, by = "player_name") %>%
select(player_name, xg90, xA90, npxg90, NameID, Pos, FPLPos)
### getting each players points per 90 and then joining to the key stats ofe the player,
playpoints <- season19 %>% group_by(playername) %>%
summarise(tp = sum(total_points), tm = sum(minutes)) %>%
mutate(p90 = tp/(tm/90)) %>%
filter(tm > 270) %>%
left_join(play2, by = "playername") %>%
left_join(playdat1, by = "player_name") %>%
filter(!is.na(FPLPos.y))
## creating the plot
cols <- c("Defender" = "#8900a1", "GK" = "#c46900", "Midfielder" = "#0ea300", "Striker" = "#0042a6")
ggplot(playpoints, aes(x = xg90, y = p90, col = FPLPos.y)) +
geom_point(alpha = 0.6, size = 3) +
guides(colour = guide_legend(title = "Position")) +
scale_colour_manual(values = cols) + labs(x = "xG/90", y = "Points/90", title = "FPL Points compared to players Expected Goals") +
theme(panel.background = element_rect(fill = "#b8b8b8"), panel.grid.minor = element_blank(),panel.grid.major = element_line(colour = "#363636"), legend.background = element_rect(fill = "#b8b8b8"), plot.background = element_rect(fill = "#b8b8b8"))
```

When the xG per 90 is compared to a players points per 90 you can see it has strong correlation for the points a striker will score per 90 minutes they play. It also seems to have some impact on midfielders but not impact really on defenders and goalkeepers. Therefore its good to have a source of expected goal data to make the points prediction more accurate. Goalkeepers and defenders score will mainly be impacted by by how good the opposition is so as well individual player xG90 I will be using the whole team number as well.

The key takeaway is the model needs to take into account the expected result. Then it needs to have a method to split out the attacking players from the other other players. Your Mo Salahs from your Jordan Hendersons. Therefore this model is going to be multiple separate models to make one overall model. I’m the next blog ill go through the first part of it what data I am using and where it comes from and getting the data ready for the model.

]]>https://theparttimeanalyst.com/2020/06/20/twenty20-win-probability-added/

I created the metric using a logistic regression machine learning model. Now its time to apply the model to real data and look at what insights it can show.

The first question I want to ask which performance in the data I have had the biggest impact on a teams chances of winning.

There we see the top 10 performances for batsmen for win probability added. As we can see the universe boss Chris Gayle appears 3 times in the top 10 with the best performance ever being his 151 off 62 balls in the blast against Kent.

Chris Gayle has always opened the batting which leads me to wonder if does the where batsman bat effect how much win probability they add.

When in an innings a batsman faces there first ball is a proxy for where the batsman was batting in order. As you can see with the graph above the win probability added is quite similar for any innings up to just after the half point. After that on average innings starting then have had negative impacts on the teams chances of winning. This tells me that either the best players are opening the batting or batting earlier is easier to strike at high rates and score more runs. I guess like anything its probably a mixture of the two.

For the final part of the overall analysis I looked at how a Batsman WPA is effected by the competition. A net positive means the competition is easier as a player is generally adding more then there career average and a negative means the player is adding less. I found it quite surprising the Caribbean premier league as the most difficult for batsman. Looks like a lot of batsman went there and struggled.

This years blast finished in October with the Notts Outlaws taking the trophy for the second time. For this win probability metric I think I will normalise it to win probability added for every 10 balls faced for the batsman.

As we can see most of the performances in this years blast were around 0 for the batsman but there are the clear players who regularly had positive contributions to there teams outcomes. There are also a lot of players with negative outcomes as well.

Next we can see that win probability added is highly correlated with the batsman’s strike rate. This suggests that as a general rule the higher strike rate player might be better even if another player scores more runs generally.

I have looked at some of the big picture areas like how batting position effects the number now I want to move on and look at player performances over the years and the first player I want to look at is Jos Butler.

He seems to be in general improving as the years have gone on. In his early years he was facing less balls and not contributing as much as he is now. This could be because he was starting his innings much later in the teams innings.

When you look at Butlers batting position over the span of his career you can clearly see he changes from coming in on average around halfway and not really contributing lower WPA to coming in earlier which coincides with his increase in WPA.

Thanks for reading.

]]>First things first looking at the xPos data for the whole season. If you want to see how this is calculated see this blog here:

https://theparttimeanalyst.com/2019/11/02/f1-drivers-rated/

Now its not the perfect measure I have a plan to further revise it in another blog. For example I think Ocon is overrated and that could be caused by poor qualifying performances. However, the metric does show that the best drivers are Verstappen and Hamilton. I don’t think many people would argue that they are not the current 2 best drivers on the grid so its good to see they are top of the metric. One driver who is a lot lower then I would expect is Bottas a -1 xPos loss and only has the drivers in the two Haas cars and the Williams cars.

When you review the both drivers seasons then you can see Hamilton’s consistency and his incredible performance in turkey. Where as Bottas isn’t quite as consistent. You can argue he had some bad luck like the tyre failure in the last few laps at Silverstone (his worst performance of the season), however at the Turkish, and 2 Bahrain races he just looks like he went missing and he was comprehensively beaten by a newcomer in the same car.

Comparing all teams race laps to each race laps fastest laps on average Mercedes where the fastest car. Not so much surprise there and no real insight to be gained from it. I don’t think anyone would put the teams in any different order.

Now that same data compared to the each teams performance in 2019 starts to show some insights. The first thing is Ferrari and the Ferrari power teams are the only ones that over drifted backwards after last season. What’s interesting is Alfa Romeo and Haas went slightly backwards but Ferrari went significantly backwards. This demonstrates how much there car was designed around the possibly illegal engine in that it had a much bigger impact on Ferrari. Overall most of the field have got closer to the fastest the biggest improvement being Williams but its easiest to make improvements from a low base. Racing Point look to be clearly the 3rd quickest but didnt achieve 3rd place in the constructors.

Finally looking at a drivers perfromance compared to their teammate. I created this plot which if I’m honest I’m not satisfied with but I think you can just about see the message I’m trying to convey. Leclerc, Verstappen, Ricciardo and Perez were clearly quicker then there teammates in both qualifying and the race. When you look at the Mercedes drivers Hamilton is not as far ahead of Bottas as I expected. Over at Williams is the only driver pairing where one is significantly quicker in qualifying but the other is quicker in the race. Lastly the closest pairing looks to be Mclaren there is nothing between the two drivers in qualifying but race pace Sainz looked to have the slight edge. Maybe that is experience on Sainz part and Norris will soon progress to that level.

Thanks for reading thats my review of some key information of the 2020 f1 season. Roll on 2021 season in March. Hope you and your families and friends have a great 2021.

]]>Today i’m going to do a little data explore of the data from the F1 2020 season so far. Exploring a number of questions about the season so far. First of all looking at qualifying and why a lot of teams are annoyed by (t)Racing Point and the strategy they have used to develop their car. Reviewing the average qualifying positions for each car show you the quality of the car on the grid. As an example here’s McLaren

Since 1990 Mclaren has had some big and up and down changes in grid position. It looks like your more likely to go back down the grid far then go forward far. At no time have McLaren ever gained 5 places on the grid compared to the year before.

To really see the improvement that Racing Point have made I have compared there average position on the grid this season compared to last season. On average they are 6.7 places higher on the grid then last year. Calculating each teams yearly change in qualifying position. Unsurprisingly most of the time it doesn’t change too much forward or back on the grid. There are only 7 teams who have made a bigger improvement from one season to the next and most of them happened with large rule changes (Brawn GP 2009 or Williams 2014). If you are on of Racing Points rivals like Renault or Mclaren too right you would be annoyed if your struggling to improve by more then 3 places in a season and Racing Point come and improve by 7 places by using methods to develop a car thats definitely in the grey.

Moving onto the grid in general, this is a small sample size but clearly so far this season has been dominated by whoever is in pole position. The only race not won from pole position was the 70th anniversary GP where Verstappen beat both Mercedes. Hopefully that percentage reduces over the coming races or we could be in for a boring season.

Finally, now there has been some controversy this week caused by F1’s own rankings for the fastest qualifyer. A few months ago now I came up with a simple model to track how a F1 driver performs based on where they finished the race compared to expected. See here:

https://theparttimeanalyst.com/2019/11/02/f1-drivers-rated/

It was a simple model and I have ideas to further refine it. Coming to a blog near you soon but here is what the current version says are drivers performances of the year so far.

Verstappen is way ahead of the other drivers and part of that is because of Hungary. He only qualified 7th but finished 2nd which is a big gain and generally starting higher then 8th the average driver on average goes backwards so that was a big win for Verstappen. His current value of 3.5 is crazily high when compared to how this number looks over the long term and therefore I expect it to reduce over the next few races. Other good performances look to be Stroll and Perez but they could be boosted by the poor qualifying at the Styrian GP. Stroll I think is the biggest surprise by this metric and he seems to be having a good season. Russell looks to be having a bad season but maybe he has been putting the car on the grid way higher then it should be. This measure is far from perfect look out for the update to improve it.

Thats it for a summary of the data from the F1 season so far. We will see how the season develops in the next few races. I will look to update this further in the season so the season can be further understood.

]]>This is the output from my model to forecast the formula 1 grid. In this blog I am going to explain how I went about it.

First things first I need some data to train the model on. The way a Formula 1 weekend works is there is free practice on a Friday, qualifying on a Saturday and the race on a Sunday. The aim is to use the data generated on the Friday to forecast the grid. So what data points am I going use:

- Practice 1 laptime
- Practice 1 difference
- Practice 1 laps
- practice 2 lap time
- practice 2 difference
- practice 2 laps

There are other variables that could be used such as a way of classifying the type of circuit to try and entice out a cars strength and weaknesses but that an area for future development. I collected data from practice 1 and 2 from 2015 to now.

This produced the above data frame with in total over 1700 records. As I will be creating a classification model I also need to add the drivers qualifying position. I wanted to do a classification model as I want the output to be a percentage chance of achieving that position.

Now I have the data I can use the tidy models collection of packages to create the model.

First I use rsample to split the data into training and testing sets. The first part of r sample is to create the rules for splitting your data. The arguments are what data are you using, what proportion you want to be in training and testing and then there is the strata argument which I haven’t used. The strata argument allows you to balance your target variable in both the testing and training set. As I have lots of possible endings (20, the total size of the F1 grid) and a relatively small data set then it was no possible to use the strata argument.

The next step is you can either create a recipe to process your data using the recipe package or go straight to Parnsip to create your model. I went straight to parsnip as my data was relatively simple and there were no need to pre-process it. As you can see in the code above I am using a random forest model initially and you can see how simply it is to create it.

Reviewing the model and testing its quality on the testing set the yardstick package has numerous functions to allow you to do this. Above I have plotted a ROC curve for each classification option, all the positions on the grid. There is a different between positions. The model seems to be a lot better at predicting the first 3 grid positions compared to other positions. I think this is because the pattern in F1 recently is that the top 2 or 3 teams have been well ahead of the rest. Therefore this makes the classification job for the model a bit easier.

The net step, now I have a baseline model, is to tune it using the tune package. There are 3 tune able hyperparameters for a ranger random forest trees, mtry and min_n. Trees is the number of trees in each random forest and needs to be high enough to reduce the error rate. I am going to set it as 1000 to start off with. Mtry controls the split variable randomisation and is limited by the number of features in your data set. Min_n is short for minimum node size and controls when the trees are split up. The tune package allowes you to conduct a grid search across those parameters to find the right ones.

The results of the tune best of the ROC AUC metric are shown above. Clearly tees doesn’t make any difference its almost spread over randomly. MTRY looks to gave some variation and looks to be the best value of 3 – 5. Min_n is on a slope. The tune function automatically selects a range to train over but it looks like the AUC is only increasing and maybe the best value is a lot higher then 40. Therefore I am going to use a grid search in order to tune the model

I conducted a grid search across a range of values from Min_N and mtry. You can see the best value for min_n is between 200 – 300 and mtry 4. Doing a grid search across both parameters means you can control for the influence of each other and therefore get the best value for both. I then trained another model taking those values.

Comparing the original model to the now turned model and it is now slightly better for most positions. This is now the model I will use going forward. In order to improve this I think I need to add weather data. For example in the recent Hungarian Grand Prix second practice was effected by rain and that makes this model difficult to run. During a F1 race weekend they have 3 different compounds of tyre (Soft, Hard and Medium) which they use. Adding the tyre compound the lap was set with would improve the model because a lap might have been set on the slower tyre compared to others.

Full Code:

https://github.com/alexthom2/F1PoleForecast/blob/master/Polepos_reg.Rmd

Also check out the tidymodels website here

]]>Each players impact on the game can then be quantified and the best players will have the highest impact on the match.

In the plot above 1 is win and 0 is loss. For 4 balls in the middle of the first innings there can clearly been seen some segmentation more runs at any stage means you have more chance of winning, there is more purple. The task then is to translate this visualisation into a usable model of how much extra chance of winning has the player added.

I am using Cricsheets ball by ball data for all twenty20 matches found here:

Therefore I have ball by ball data for the IPL, Internationals, the Big Bash, The Blast and the PSL. This totals over 600000 balls and should be a nice large data set to train the model.

I am going to create 2 models one for the first innings and one for the second innings. For both models the main features are how many balls have been faced and how many runs they have scored as well as how many wickets they have lost. For the second innings there is the effect of scoreboard pressure and therefore I will be adding the feature of how many runs are required. Maybe its even irrelevant how many runs are required from the first ball and its all about how many runs are required at a particular stage.

I am using the Tidy Models collection of packages to create the initial model and subsequent models in this series. Splitting the data into training and testing sets is easy. R Sample has a function called initial split which is perfect for this example.

With initial split you specify the data that is being used for the model. The proportion with which to split the data and I have used the strata argument. This means both my training and testing data sets have a balanced amount of rows that were winning and losing. This split object can then be used with the training and testing functions to simply create the two different datasets.

I am going to train a logistic regression model to initially start. In the future I will be looking at other models and comparing them to this one.

Now that I have training and testing data I can move to the parsnip package.

Above you can see the code to set up the first logistic regression model. The first argument logistic_reg sets up what model I was going to create. The argument mode is more used if you have a model type that can be used for both classification or regression however logistic regression is only for classification problems so the argument isn’t really needed. The next part the set_engine function and this is the great benefit of tidy models. There are many different engines with machine learning and the set engine function allows access to many of them through a common methodology. In this version I have just used a simple glm engine though there are others possible (stan, spark and keras).

Finally the fit function is just the same function you would use whatever the model. Simply it has the formula used and the data.

Testing the model on the testing data by plotting a ROC curve

Comparing both the first and second innings model, the second innings model is clearly the better model for classifying which point is a winning score. However, this isn’t really about that the key here is both models are better then just randomly guessing and I can use this baseline to further enhance the models.

Applying the models to a match from 2019 and you can see how the win probability varies across both teams. In the first half of the innings when Durham batted it was pretty even. However, when Northampton chased Durham took control and won the match.

Above you can see a summary of each batsman’s contribution to the match. Despite what appears to be a poor batting performance Northamptonshire have 2 batsman with the best contribution. Durham batsman didnt have any big negative or positive contribution they just kept the team in the game and it looks like the bowling won the game.

Moving onto the bowlers and you can clearly see where the match winning contribution came from. Potts in 3.3 overs took 3 wickets for just 8 runs. This looks to have been the match winning contribution but looking at the information Short got man of the match.

Thats it for the first part in this series. Next part I will take this simple implementation and further review with more complicated machine learning learning models. If you have any feedback or comments please let me know. Stay Safe.

]]>https://wordpress.com/view/theparttimeanalyst.com

In it I looked at calculating the Pythagorean win percentage for each team in the IPL and then moving that forward to calculating the how many extra runs are needed to win one extra game. Therefore all team building should be done to get to that number. Where can you get an extra 60 runs from.

The question I am looking to answer predicting how many runs a batsman may score. If i was to look at a model I think by far the most predictive element how many runs a batsman will score is how many balls they face.

Plotted together they had an r squared of 0.8627 so 86% of the variance in runs is caused by how many balls a batsmen faces. The next question is how can I come up with a good value for how many balls a batsmen will face. It will depend on the bowler and when in the innings that batsman is facing. These are areas that the model could be further refined going forward. However. to start with

Overall the distributions on innings lengths are similar – as you would expect. The distributions have a low maximum, reflecting most innings in the IPL are relatively short but a long tail showing that there are a lot of innings of substantial length as well

In order to simulate how many runs a batsman might score, i am going to use the beta distribution and randomly draw from the distribution.

The beta distribution with shape parameters of 1.25 and 6 over 900 draws gives a shape broadly similar to the historical shape of all IPL innings. There are 14 games in the group stage of the IPL which is a relatively small sample size and there are many different quality of batsman. There are 2 arguments to the beta distribution which dictate the overall shape. The idea is to use a batsman’s historical average ball faced to adjust the shape2 parameter. This will then give a more reasonable draw of balls faced for each batsman.

Reviewing the average number of balls per innings compared to the total number of balls faced

It looks like over a fairly decent career size its pretty difficult to average more then 30 balls per innings. However, there are a few batsman that over a relatively short career average significantly more. To create the most accurate model, If a batsman has an average innings length more then 30 and has faced less then 1000 balls I am going to put there average balls as 30. This will stop the model over weighting small sample size batsmen. This methodology can be further refined in the future.

We now have the output which simulates how many balls each batsman might face for an innings. The next step is to turn that into an amount of runs. I am going to use a linear model to predict the amount of balls. This model will be further refined in the future and I will talk about it another time. These are the predictors I will be using:

**Features **

- No. Balls Faced – Drawn from the beta model previously
- Dot Percentage – percentage of balls faced which end in dot balls
- Non boundary strike rate – does the batsman rotate the strike or just stand there hitting sixes
- Six Percentage and 4 Percentage: what percentage of their balls do they hit for 4
- Strike Rate – on average the overall strike rate for the batsman

With these features they can be used to predict how many runs a batsman would be expected to score in the IPL. For now i am just using a simple linear model this can be improved by using a more powerful model and probably more powerful predictors but this is a first version of the model so keeping it simple for now.

Now onto evaluating the model performance. The model is only any use if it gets a value which is around what you would expect. Therefore its important to test the performance. The first test is in the 2019 season what percentage of batsmen did the model get with +/- 50 runs

As you can see out of 10000 runs of the model there were around 55% of batsmen within 50 runs which I am relatively pleased with. Model performance is an iterative process and this looks to be a good baseline to start with.

Also, when the actual runs are plotted against the predicted runs most batsmen follow a similar line. The furthest point away seems to be Kl Rahul who scored a lot less runs then predicted. Thats the model for today in the next blog i’m going to look at individual player performances and compare the 2020 IPL squads. Who bought the best players? The code for the model will be available on the GitHub

https://github.com/alexthom2/IPL_Moneyball/blob/master/Modv1.Rmd

]]>The model I will be using is fivethrityeight.com. Nate Silvers model make predictions for a number of leagues and competitions across the world. Below is a link to the predictions

https://projects.fivethirtyeight.com/soccer-predictions/?ex_cid=rrpromo

and this is the methodology used:

So the idea is to compare how this model does for all the leagues and if the championship is the most unpredictable league then the model will perform the worst on it. Simple logic. Just to make clear this is not picking faults in the model. My skill is know where near Nate Silver and his team.

They produce a csv file which has all the predictions since 16/17 season. The first thing i’m going to look at is simple how many correct results has the model got by league. If the Championship is the most unpredictable league then it will have the lowest correct predictions

Not True. While the championship is on the lower end is is someway above the most incorrect which is League 2. The Chinese super league looks to be the most predictable. Looking at the league near the top the Barclays Premier League and the Scottish Premiership there are teams a lot better then others (top 6 in the Premier League, Celtic and Rangers) which will make those league a lot more predictable.

In league a lot of teams must be pretty evenly matched meaning predictions are harder.

The trend of the English leagues shows that the Championship over the 3 years there predictions has increased its predictability. this could be improvements in the model or the championship itself being more predictable. Its impossible to tell which. Though if it was model performance then other league probably would improve too.

The model includes a percentage chance of each result and therefore to look at how close a league is I compared what the difference in percentage chance of a win is between each team in the match.

Measuring the difference in percentage chance between the favourite for a match and the other team then overall the Championship ends up the 5th closest league. The lower the gap between the favourite and the other team means that a league has a lot of parity and therefore will be unpredictable. Although so far the championship has not been the most predictable league it seems to be an unpredictable league.

Focusing in on the championship across multiple seasons and the difference in win probability across multiple years doesnt seem to alter much. In fact when compared to the correct prediction which was a definite increase there in no discernible change in the difference between the 2 teams win probability.

I think overall the Championship is shown the be quite unpredictable – not the most unpredictable – but over a few measures its shown to be amongst the group of most unpredictable league. There are 2 amin reasons for a league being unpredictable, no information about the league or theres a lot of parity in the league. I think with the championship this is definitely the latter. This has also shown some predictable leagues which may be frutiful for betting.

All my code should now be on my github below

https://github.com/alexthom2/TheChampionship/blob/master/UnpreditableExploration.Rmd

]]>