Today i’m going to do a little data explore of the data from the F1 2020 season so far. Exploring a number of questions about the season so far. First of all looking at qualifying and why a lot of teams are annoyed by (t)Racing Point and the strategy they have used to develop their car. Reviewing the average qualifying positions for each car show you the quality of the car on the grid. As an example here’s McLaren

Since 1990 Mclaren has had some big and up and down changes in grid position. It looks like your more likely to go back down the grid far then go forward far. At no time have McLaren ever gained 5 places on the grid compared to the year before.

To really see the improvement that Racing Point have made I have compared there average position on the grid this season compared to last season. On average they are 6.7 places higher on the grid then last year. Calculating each teams yearly change in qualifying position. Unsurprisingly most of the time it doesn’t change too much forward or back on the grid. There are only 7 teams who have made a bigger improvement from one season to the next and most of them happened with large rule changes (Brawn GP 2009 or Williams 2014). If you are on of Racing Points rivals like Renault or Mclaren too right you would be annoyed if your struggling to improve by more then 3 places in a season and Racing Point come and improve by 7 places by using methods to develop a car thats definitely in the grey.

Moving onto the grid in general, this is a small sample size but clearly so far this season has been dominated by whoever is in pole position. The only race not won from pole position was the 70th anniversary GP where Verstappen beat both Mercedes. Hopefully that percentage reduces over the coming races or we could be in for a boring season.

Finally, now there has been some controversy this week caused by F1’s own rankings for the fastest qualifyer. A few months ago now I came up with a simple model to track how a F1 driver performs based on where they finished the race compared to expected. See here:

https://theparttimeanalyst.com/2019/11/02/f1-drivers-rated/

It was a simple model and I have ideas to further refine it. Coming to a blog near you soon but here is what the current version says are drivers performances of the year so far.

Verstappen is way ahead of the other drivers and part of that is because of Hungary. He only qualified 7th but finished 2nd which is a big gain and generally starting higher then 8th the average driver on average goes backwards so that was a big win for Verstappen. His current value of 3.5 is crazily high when compared to how this number looks over the long term and therefore I expect it to reduce over the next few races. Other good performances look to be Stroll and Perez but they could be boosted by the poor qualifying at the Styrian GP. Stroll I think is the biggest surprise by this metric and he seems to be having a good season. Russell looks to be having a bad season but maybe he has been putting the car on the grid way higher then it should be. This measure is far from perfect look out for the update to improve it.

Thats it for a summary of the data from the F1 season so far. We will see how the season develops in the next few races. I will look to update this further in the season so the season can be further understood.

]]>This is the output from my model to forecast the formula 1 grid. In this blog I am going to explain how I went about it.

First things first I need some data to train the model on. The way a Formula 1 weekend works is there is free practice on a Friday, qualifying on a Saturday and the race on a Sunday. The aim is to use the data generated on the Friday to forecast the grid. So what data points am I going use:

- Practice 1 laptime
- Practice 1 difference
- Practice 1 laps
- practice 2 lap time
- practice 2 difference
- practice 2 laps

There are other variables that could be used such as a way of classifying the type of circuit to try and entice out a cars strength and weaknesses but that an area for future development. I collected data from practice 1 and 2 from 2015 to now.

This produced the above data frame with in total over 1700 records. As I will be creating a classification model I also need to add the drivers qualifying position. I wanted to do a classification model as I want the output to be a percentage chance of achieving that position.

Now I have the data I can use the tidy models collection of packages to create the model.

First I use rsample to split the data into training and testing sets. The first part of r sample is to create the rules for splitting your data. The arguments are what data are you using, what proportion you want to be in training and testing and then there is the strata argument which I haven’t used. The strata argument allows you to balance your target variable in both the testing and training set. As I have lots of possible endings (20, the total size of the F1 grid) and a relatively small data set then it was no possible to use the strata argument.

The next step is you can either create a recipe to process your data using the recipe package or go straight to Parnsip to create your model. I went straight to parsnip as my data was relatively simple and there were no need to pre-process it. As you can see in the code above I am using a random forest model initially and you can see how simply it is to create it.

Reviewing the model and testing its quality on the testing set the yardstick package has numerous functions to allow you to do this. Above I have plotted a ROC curve for each classification option, all the positions on the grid. There is a different between positions. The model seems to be a lot better at predicting the first 3 grid positions compared to other positions. I think this is because the pattern in F1 recently is that the top 2 or 3 teams have been well ahead of the rest. Therefore this makes the classification job for the model a bit easier.

The net step, now I have a baseline model, is to tune it using the tune package. There are 3 tune able hyperparameters for a ranger random forest trees, mtry and min_n. Trees is the number of trees in each random forest and needs to be high enough to reduce the error rate. I am going to set it as 1000 to start off with. Mtry controls the split variable randomisation and is limited by the number of features in your data set. Min_n is short for minimum node size and controls when the trees are split up. The tune package allowes you to conduct a grid search across those parameters to find the right ones.

The results of the tune best of the ROC AUC metric are shown above. Clearly tees doesn’t make any difference its almost spread over randomly. MTRY looks to gave some variation and looks to be the best value of 3 – 5. Min_n is on a slope. The tune function automatically selects a range to train over but it looks like the AUC is only increasing and maybe the best value is a lot higher then 40. Therefore I am going to use a grid search in order to tune the model

I conducted a grid search across a range of values from Min_N and mtry. You can see the best value for min_n is between 200 – 300 and mtry 4. Doing a grid search across both parameters means you can control for the influence of each other and therefore get the best value for both. I then trained another model taking those values.

Comparing the original model to the now turned model and it is now slightly better for most positions. This is now the model I will use going forward. In order to improve this I think I need to add weather data. For example in the recent Hungarian Grand Prix second practice was effected by rain and that makes this model difficult to run. During a F1 race weekend they have 3 different compounds of tyre (Soft, Hard and Medium) which they use. Adding the tyre compound the lap was set with would improve the model because a lap might have been set on the slower tyre compared to others.

Full Code:

https://github.com/alexthom2/F1PoleForecast/blob/master/Polepos_reg.Rmd

Also check out the tidymodels website here

]]>Each players impact on the game can then be quantified and the best players will have the highest impact on the match.

In the plot above 1 is win and 0 is loss. For 4 balls in the middle of the first innings there can clearly been seen some segmentation more runs at any stage means you have more chance of winning, there is more purple. The task then is to translate this visualisation into a usable model of how much extra chance of winning has the player added.

I am using Cricsheets ball by ball data for all twenty20 matches found here:

Therefore I have ball by ball data for the IPL, Internationals, the Big Bash, The Blast and the PSL. This totals over 600000 balls and should be a nice large data set to train the model.

I am going to create 2 models one for the first innings and one for the second innings. For both models the main features are how many balls have been faced and how many runs they have scored as well as how many wickets they have lost. For the second innings there is the effect of scoreboard pressure and therefore I will be adding the feature of how many runs are required. Maybe its even irrelevant how many runs are required from the first ball and its all about how many runs are required at a particular stage.

I am using the Tidy Models collection of packages to create the initial model and subsequent models in this series. Splitting the data into training and testing sets is easy. R Sample has a function called initial split which is perfect for this example.

With initial split you specify the data that is being used for the model. The proportion with which to split the data and I have used the strata argument. This means both my training and testing data sets have a balanced amount of rows that were winning and losing. This split object can then be used with the training and testing functions to simply create the two different datasets.

I am going to train a logistic regression model to initially start. In the future I will be looking at other models and comparing them to this one.

Now that I have training and testing data I can move to the parsnip package.

Above you can see the code to set up the first logistic regression model. The first argument logistic_reg sets up what model I was going to create. The argument mode is more used if you have a model type that can be used for both classification or regression however logistic regression is only for classification problems so the argument isn’t really needed. The next part the set_engine function and this is the great benefit of tidy models. There are many different engines with machine learning and the set engine function allows access to many of them through a common methodology. In this version I have just used a simple glm engine though there are others possible (stan, spark and keras).

Finally the fit function is just the same function you would use whatever the model. Simply it has the formula used and the data.

Testing the model on the testing data by plotting a ROC curve

Comparing both the first and second innings model, the second innings model is clearly the better model for classifying which point is a winning score. However, this isn’t really about that the key here is both models are better then just randomly guessing and I can use this baseline to further enhance the models.

Applying the models to a match from 2019 and you can see how the win probability varies across both teams. In the first half of the innings when Durham batted it was pretty even. However, when Northampton chased Durham took control and won the match.

Above you can see a summary of each batsman’s contribution to the match. Despite what appears to be a poor batting performance Northamptonshire have 2 batsman with the best contribution. Durham batsman didnt have any big negative or positive contribution they just kept the team in the game and it looks like the bowling won the game.

Moving onto the bowlers and you can clearly see where the match winning contribution came from. Potts in 3.3 overs took 3 wickets for just 8 runs. This looks to have been the match winning contribution but looking at the information Short got man of the match.

Thats it for the first part in this series. Next part I will take this simple implementation and further review with more complicated machine learning learning models. If you have any feedback or comments please let me know. Stay Safe.

]]>https://wordpress.com/view/theparttimeanalyst.com

In it I looked at calculating the Pythagorean win percentage for each team in the IPL and then moving that forward to calculating the how many extra runs are needed to win one extra game. Therefore all team building should be done to get to that number. Where can you get an extra 60 runs from.

The question I am looking to answer predicting how many runs a batsman may score. If i was to look at a model I think by far the most predictive element how many runs a batsman will score is how many balls they face.

Plotted together they had an r squared of 0.8627 so 86% of the variance in runs is caused by how many balls a batsmen faces. The next question is how can I come up with a good value for how many balls a batsmen will face. It will depend on the bowler and when in the innings that batsman is facing. These are areas that the model could be further refined going forward. However. to start with

Overall the distributions on innings lengths are similar – as you would expect. The distributions have a low maximum, reflecting most innings in the IPL are relatively short but a long tail showing that there are a lot of innings of substantial length as well

In order to simulate how many runs a batsman might score, i am going to use the beta distribution and randomly draw from the distribution.

The beta distribution with shape parameters of 1.25 and 6 over 900 draws gives a shape broadly similar to the historical shape of all IPL innings. There are 14 games in the group stage of the IPL which is a relatively small sample size and there are many different quality of batsman. There are 2 arguments to the beta distribution which dictate the overall shape. The idea is to use a batsman’s historical average ball faced to adjust the shape2 parameter. This will then give a more reasonable draw of balls faced for each batsman.

Reviewing the average number of balls per innings compared to the total number of balls faced

It looks like over a fairly decent career size its pretty difficult to average more then 30 balls per innings. However, there are a few batsman that over a relatively short career average significantly more. To create the most accurate model, If a batsman has an average innings length more then 30 and has faced less then 1000 balls I am going to put there average balls as 30. This will stop the model over weighting small sample size batsmen. This methodology can be further refined in the future.

We now have the output which simulates how many balls each batsman might face for an innings. The next step is to turn that into an amount of runs. I am going to use a linear model to predict the amount of balls. This model will be further refined in the future and I will talk about it another time. These are the predictors I will be using:

**Features **

- No. Balls Faced – Drawn from the beta model previously
- Dot Percentage – percentage of balls faced which end in dot balls
- Non boundary strike rate – does the batsman rotate the strike or just stand there hitting sixes
- Six Percentage and 4 Percentage: what percentage of their balls do they hit for 4
- Strike Rate – on average the overall strike rate for the batsman

With these features they can be used to predict how many runs a batsman would be expected to score in the IPL. For now i am just using a simple linear model this can be improved by using a more powerful model and probably more powerful predictors but this is a first version of the model so keeping it simple for now.

Now onto evaluating the model performance. The model is only any use if it gets a value which is around what you would expect. Therefore its important to test the performance. The first test is in the 2019 season what percentage of batsmen did the model get with +/- 50 runs

As you can see out of 10000 runs of the model there were around 55% of batsmen within 50 runs which I am relatively pleased with. Model performance is an iterative process and this looks to be a good baseline to start with.

Also, when the actual runs are plotted against the predicted runs most batsmen follow a similar line. The furthest point away seems to be Kl Rahul who scored a lot less runs then predicted. Thats the model for today in the next blog i’m going to look at individual player performances and compare the 2020 IPL squads. Who bought the best players? The code for the model will be available on the GitHub

https://github.com/alexthom2/IPL_Moneyball/blob/master/Modv1.Rmd

]]>The model I will be using is fivethrityeight.com. Nate Silvers model make predictions for a number of leagues and competitions across the world. Below is a link to the predictions

https://projects.fivethirtyeight.com/soccer-predictions/?ex_cid=rrpromo

and this is the methodology used:

So the idea is to compare how this model does for all the leagues and if the championship is the most unpredictable league then the model will perform the worst on it. Simple logic. Just to make clear this is not picking faults in the model. My skill is know where near Nate Silver and his team.

They produce a csv file which has all the predictions since 16/17 season. The first thing i’m going to look at is simple how many correct results has the model got by league. If the Championship is the most unpredictable league then it will have the lowest correct predictions

Not True. While the championship is on the lower end is is someway above the most incorrect which is League 2. The Chinese super league looks to be the most predictable. Looking at the league near the top the Barclays Premier League and the Scottish Premiership there are teams a lot better then others (top 6 in the Premier League, Celtic and Rangers) which will make those league a lot more predictable.

In league a lot of teams must be pretty evenly matched meaning predictions are harder.

The trend of the English leagues shows that the Championship over the 3 years there predictions has increased its predictability. this could be improvements in the model or the championship itself being more predictable. Its impossible to tell which. Though if it was model performance then other league probably would improve too.

The model includes a percentage chance of each result and therefore to look at how close a league is I compared what the difference in percentage chance of a win is between each team in the match.

Measuring the difference in percentage chance between the favourite for a match and the other team then overall the Championship ends up the 5th closest league. The lower the gap between the favourite and the other team means that a league has a lot of parity and therefore will be unpredictable. Although so far the championship has not been the most predictable league it seems to be an unpredictable league.

Focusing in on the championship across multiple seasons and the difference in win probability across multiple years doesnt seem to alter much. In fact when compared to the correct prediction which was a definite increase there in no discernible change in the difference between the 2 teams win probability.

I think overall the Championship is shown the be quite unpredictable – not the most unpredictable – but over a few measures its shown to be amongst the group of most unpredictable league. There are 2 amin reasons for a league being unpredictable, no information about the league or theres a lot of parity in the league. I think with the championship this is definitely the latter. This has also shown some predictable leagues which may be frutiful for betting.

All my code should now be on my github below

https://github.com/alexthom2/TheChampionship/blob/master/UnpreditableExploration.Rmd

]]>The background behind this is I have been reviewing the math behind Moneyball course on Coursera. The course linked here:

https://www.coursera.org/learn/mathematics-sport

The problem is most of the course is based in excel and in the modern world I like to use code to analyse data and R in particular. Also this is also based mostly on baseball and as not a particularly big baseball fan, i am going to apply it to cricket

The first concept it looks at is the Pythagorean theorem. Its slightly different to the one you probably remember from school which looked at sides of a triangle. This was created by Bill James for baseball and it looks at the number of runs conceded and scored to get the win percentage. Applying it to cricket i’m going to focus on the IPL. A key part of the formula to calculate the win percentage is the exponent. This is a constant and for baseball it is 2. However, there is no great literature for cricket. I have found one blog that quotes it at the 8 however lets compute it myself and see what it comes out with.

Based on this data the best value for the exponent in the IPL looks to be around 10 or 11. This is different to the previous work I found. If we apply this to every team since the 2013 season I can compare there actual win percentage against the actual win percentage

Some interesting trends are visible here. If you look at Sunrisers Hyderabad they have gradually increased their predicted win percentage since 2013. Is this smarter recruitment? They also massively under performed the predicted win percentage in 2019 therefore is there a chance of a regression to the mean in 2020?. Chennai seem to have over performed the last 2 years and seem to have a general trend downwards in predicted win percentage. The next thing that can be done with this is calculate how many extra runs you need to win one extra twenty20 game.

Above I create a data frame with runs for increasing in 5 from 2225 which is the average total runs a team scores in an IPL league season to 2285.

The summary table of the output shows that scoring an extra 60 runs over the season is the equivalent of one extra win. Therefore you need to recruit the players to achieve that extra 60 runs. That can also be bowlers as if you reduce the opposition to 30 less runs you only need to score 30 more runs. That is going to be the subject of the next blog.

]]>The idea is that historical data of where the average generic driver has finished compared to the their grid slot. So if the driver qualifies 2nd and the average driver who has qualified 2nd historically has finished on average 3.4 and this driver finishes 1st then that would be worth 2.4. This number can either be averaged over a long term or in the short term could be used a plus/minus statistic which could be used in the broadcasting of F1.

The Data

I often go beack to this but kaggle is one of the best soruces of data for whatever you want to look at. For this there is a whole F1 dat set covering all sorts of information. All I need for this one is the results and races. The results has each grand prix result for each driver as well as the qualifying position they started in

First thing to look at is the average finsihing position by qualifying position

Overall if you are starting in the top 8 on average you are going backwards in the race. 9th and downwards on average you are finishing higher in the race. However, I wonder how much of the lower starters are effected by retirements. If you start last then all of the retirements are going to be in front of you and you will always move forward. Also if you start from pole all retirements will be behind you and you can only stay in the same place or go backwards. Hence why on average the finishing position is lower then starting for first. The first thing I need to do is control for retirements so everyone is on a level playing field.

Now I can see percent reitrements by their finishing postion on the grid. Clearly the worse cars towards the back of the grid have a higher retirement rate and in our calculation of the KPI I can use that to normalise the results.

After running the model the first time this is the list of the best drivers by there average position change over their career looking at drivers since 2000. I think there must be an error here as I don’t think, with all due respect, Alex Yoong and Enrique Bernoldi are the best drivers to have graced the F1 grid. FYI as well the Verstappen you can see there in 9th is not Max its his dad Jos who was know where near as good as Max.

The error was I was creating the adjusted position from the grid position not the qualifying position. Making that change and creating the same graph shows this:

Now that’s more like it , these are the top 29 drivers by finishing position and there are a lot of pretty big names on it. Including all the world champions in the last 20 years. There are also some interesting names who people maybe wouldn’t instantly think of such as Kobayashi and Friesacher.

When a driver xP is compared to the number of races they competed in, you can clearly see drivers with better ratings do more races and some of the drivers with the highest ratings are world champions.

Lets focus on on a couple of drivers first the current World Champion Lewis Hamilton and the second Niko Hulkenburg.

Hamilton’s performances over the years seems two have 2 distinct periods, the early years where he was at Mclaren the field was a lot closer and he was rarely in a dominant car. Then when it moves into the hybrid era his total significantly increases, partly due to having a more dominant car and maybe worse reliability meaning he started lower on the grid. A more dominant car means if you start lower you gain more positions on those races. This is maybe a limitation for the metric and going forward I possibly have to control for how inherently fast the car is.

Hulkenberg’s career up to 2017 was a bit of a mixed bag. Overall in these 8 seasons he only has 2 seasons with strong positive position differences. Four of them are strong negatives. I chose Hulkenberg because hes the driver with the most race starts without a podium and looking at this record you can possibly see why. This rating isn’t obviously the be all and end all of a drivers career but its a way to try and understand who are the good and bad ones.

This is just the first exploration into a way of measuring F1 drivers performances. There are probably other measures that can also be used to gain a wider picture of how good an F1 driver truly is. I think i can also further improve the model by including the circuit into it. Certain circuits will be easier or harder to overtake and therefore will effect it. Room for further development.

]]>https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data

Lets read the data into R and take a look of it

So I can see there are 17 columns and over 48000 records with information covering the price and location of the AirBnB. Looking at the location column lets see where most of the AirBnB’s in New York are

Above you can see the code and the out put form the code with is the plot above. I can see that most of the AirBnB’s are located in the Williamsburg and Bedford-Stuyvesant areas of Brooklyn. Manhattan areas are also very prevalent in the top 30 neighbourhoods. Now lets look at how the price of the AirBnB changes by area

Looking how the price varies by the location, doesn’t really show much currently and shows we have the most expensive AirBnB in some area of Brooklyn. Intuitively this doesn’t seem correct to me

A quick histogram of the price shows there maybe some questionable data within the AirBnB data set. I would say it looks like 90% of the AirBnB’s in the New York area are less then $1000 dollar per night. There are some more, however because they are so much more then others in the data set then I suspect this is bad data. In-fact, if we look at the most expensive listings some are share rooms. Non one is going to pay $10000 for that. This is bad data and must be removed.

Removing the bad data and doing the plot again there is now a much clearer picture. There seems to be a high density of expensive AirBnB’s in upper Manhattan. This shows all listings however, the price probably differs for room type. Not so surprising though as you can see higher prices seem to be in the tourist hot spots, Moving on now lets start by building a model to look at predicting the price.

I am just going to be using simple linear modelling however I have split the data set into training and testing sets for the model. The first variable I am going to use is room type so is it a shared room, entire dwelling or private room.

Above we can see the summary of the first model fitted. I can see that compare to an entire home private rooms are about $114 cheaper per night and a shared room is about $130 dollars cheaper per night. The bad part is the residual error is pretty poor at 128 which means on average this model would be $128 dollars out for predicting price.

I re-ran the model now with neighbourhood, number of review and review per month included. Above you can see a summary of the variables that effect the price the most in a negative and positive way. Room type seems to the have the biggest negative effect. Location Tribeca and flatiron seem to be the most sort after locations. The residual error is down to 105.7. In order to improve it I will need to do some feature engineering.

The first thing I am going to look at is the description column. I wonder if there are some words in there that highlight more expensive homes. One word i can think of is luxury. Anything described as luxury makes it seem more expensive then something not.

I have used the string detect function on the name variable in order to find properties with luxury in their name. Adding that to the linear model shows that properties with that in the name often are $59 higher per night then others.

Whats clear by looking at the number of each of the words in the names of the title, is a lot of people like highlighting where the place in New York is. Other words that can be used for the model cozy, spacious, beautiful and large. I’m going to add clean to the list as well.

Now I can see that apart from luxury none of the other words in the descriptions have too much effect on the price. Cozy is maybe a euphemism for the property being small and therefore that’s why it has a negative effect on the price. Now I have my model lets apply it to the unseen data and see how it does on that.

Running the model on the unseen data saw sum surprising results. There are a few places with negative prices predicted. This clearly isn’t correct. Also I may have to model against the log of the actual price. By taking the log it dramatically improves the predictive power of the model residual error is now 0.614 and r squared of 0.52.

Above we can see the results of the long transformed model. There are still some higher priced rentals that the model doesn’t really predict. However lets use it to see the most over and under priced (according to the model)

Above you can see the most expensive compared to the predicted price and the link to it on AirBnB. All I can say is wow. It looks incredibile. Pictures are something that could never be in the model.

Above you can see a summary of the 10 most undervalued. They are all located in Manhattan and seem to be split between Tribeca district and Midtown. Thats it for today’s blog slightly longer then normal but hope you enjoyed.

]]>theparttimeanalyst.com/2019/07/10/predicting-f1-qualifying/

Today I am going to be dissecting the model to understands its strengths and weaknesses and to look if their is any bias within the model. First lets look at the importance matrix

The most important variable is the fastest time produced in practice 2. This is no surprise. Often the plan that all teams follow is to prepare for qualifying in practice 2 therefore for predicting qualifying they were always going to be important variables. Also its not surprising track length is important as that will always be a key driver for the final lap-time.

Looking at the RMSE by there’s a few races where the model has been really accurate however there are others where it has struggled. The worst of them is the British Grand Prix which has the worst RMSE. I think that’s because some of the training data has got some wet sessions in. This is one area the model doesn’t have and therefore could be a source of errors. Wet running would have an effect on lap times. Other then that most other races are predicted pretty well.

RMSE by team shows some differences across the grid. With Williams being particularly off compared to the other teams. Lets look to see if its mostly worn in a particular direction for Williams.

When I plot each race weekend for each driver’s difference from the calculated qualifying time to the actual qualifying time I can see that most of the time the model under estimates the drivers final qualifying time. Most races the prediction is within half a second. There are a few outliers, mainly Williams, which is probably the reason Williams have the highest RMSE. I have had a look at the model with real data lets try and understand it further by putting fake data into it and seeing what results you get. The first thing to do look at the effect of team. I provided the model with data where the only differentiation is the team.

So with all cars having the same data apart from the team Renault comes out as the fastest team with Mclaren a lot slower then everyone else. I wonder if this is because historically Mclaren over performed in practice 2 and often went backwards as the weekend went on.

Ultimately this models judgement comes when compared to another type of model with the same data. Therefore to do that I have created a bayesian generalised linear model and I will compare that to the machine learning xgboost model. All the code for this model can befound on my GitHub.

]]>- Jennings
- Burns
- Denly
- Root
- Stokes
- Butler
- Bairstow
- Ali
- Wood
- Broad
- Anderson

First things first Jennings is out. He has had 32 test innings now but only averages 25.19 far below what you would expect from a top quality opening batsmen. Also his record so far this year in the county championship, he has scored 260 runs at 23.63. It’s just not good enough so hes not picked.

Moving onto Burns who in 12 test innings has averaged just 25 with a high score of 84. Not brilliant howerver, in the county championship this year he has averages 37 with 1 hundred and 2 fifties. I would select him for the first ashes test for some continuity, you cant keep chopping and changing.

The other opening batsman I think there are 2 options. Jason Roy and Dominic Sibley. Jason Roy looks to be the likely player they will choose but this does come with a tinge of caution. In the last year Roy has only played 2 games of first class cricket with 3 innings in total, now he did pretty well but he didn’t open. Also since his debut I can only find two innings in first class cricket in which he opened. The positive for Roy is hes in excellent form as shown by the cricket he has played in the world cup. The selection of Roy as opener is a risk due to his limited opening experience in first class cricket.

The other option Dominic Sibley who currently in the county championship has scored over 1000 runs at an average of 63.47. Therefore hes clearly in great form and one of the selection adages is always pick people when in good form. I think for now its a leave him as the next cab off the rank. If Burns has a bad start to the series he could be an option for the 4th/5th test.

The current holder of this spot is Joe Denly. He has only played two test matches and one match he opened and the other match he batted at number 3. Therefore in the small sample its hard to judge his test career so far. In the County Championship this year he has had 11 innings scoring 504 runs at 56. Thats really good form.

The other option at number 3 is to use Jonny Bairstow who did bat at number 3 during the winter tours of Sri Lanka and West Indies. His stats show he did ok but the big benefit of playing Bairstow at number 3 means you can use another player further down the order. We will get to that later

I think the middle order is universally agreed: Joe Root, Ben Stokes and Jos Butler.

As far as I see it their are 2 options at number 7. If you are playing Joe Denly then Bairstow has to bat at 7 or higher if he is swapped with Stokes or Butler. Or if you decide to bat Bairstow at 3 you have a bonus spot to fill. The player I suggest would be ideal for here is Sam Curran. He had an incredible summer last summer winning man of the match award in the first test against India. He did struggle in the West Indies during the winter however his form this season has been excellent. In the county championship he has averaged 33 scoring 301 runs in 9 innings. His bowling has been the most impressive taking 24 wickets at 22.45. In a straight fight between Curran and Denly I would pick Curran. I also think Curran being only 20 is always going to improve.

Unless there are any dust bowls during the ashes I fully expect England to only player one spinner. I think there are 3 options:

**Moeen Ali** – The current holder of the shirt and a now veteran member of the team playing 58 tests who is useful with both bat and ball. However if world cup form is used for selection for test matches (as the case for Roy is built on) then Moeen is in really poor form. Averaging only 18 in the games he played and only getting 5 wickets. Also in test matches his form with the bat has been declining. Also bowling against Australia his record is really poor in 10 matches he has only taken 17 wickets at 65.94. I therefore think its worth looking at other options

**Jack Leach** – My pick for the spinner slot in this England team. So far he has had a short test career of 4 test matches with some of them in the spin friendly conditions of Sri Lanka. However his stats from that series were excellent and so far this year in the county championship he has taken 34 wickets at 21.97. Hes clearly in great form and deserves his chance to be England’s spinner in the Ashes

There are 3 spots left for fast bowlers in this team. I think its a given that Jimmy Anderson will definitely be one of them. Will this be his final series? that’s a point for discussion on another day. Of the two spots left there are probably 5 bowlers competing for them Stuart Broad, Chris Woakes, Jofra Archer, Mark Wood and Olly Stone. Mark Wood can be discounted to begin with as he has a side strain and therefore will not be available in the early part of the Ashes.

**Chris Woakes** – his test record is really interesting. If this was away from home he wouldnt be in the equation. Caparison between his away and home recordis like night and day. At home he is really good though taking his wickets at 23.33. He is also a very handy batsman so would dd to the batting strength.

**Stuart Broad** – Clearly not much needs to be said about one of England’s best ever bowlers. Is there a sign of decline? maybe slightly his bowling average in the last few months has been in 28-30 range when at his peak he was operating around mid twenties mark.

**Jofra Archer** – I think based on his form in the world cup, the fact hes a different type of bowler then others mentioned and his first class record is excellent. In 28 matches he has taken 131 wickets at 23.44. Therefore he takes one spot next to Jimmy Anderson.

**Olly Stone** – Seems to be liked by the selectors however if the pace of Archer is already in the team i’m not sure hes better then Broad and Woakes. Hes not in great form so far this season in the county championship he has only taken 7 wickets at 38.57.

In the final selection out of Broad and Woakes. I’m selecting Woakes due to the extra batting and his record at home particularly Edgbaston and Lords. Therefore my team would look like below:

- Burns
- Roy
- Bairstow
- Root
- Stokes
- Butler
- Curran
- Woakes
- Leach
- Archer
- Anderson

Let me know your thoughts. Do you disagree with my selection or maybe agree?

]]>