## Finding the Next James Anderson

Hello, welcome to today’s blog in which we will be scouting for the next James Anderson. Possibly. The ideas behind this blog are nothing new in fact I have stolen the idea from another sport. The Rangers report blog wrote a piece about scouting for the best youth players by using age-adjusted stats. This idea was based on Vollans work on ice hockey. I always like to say a lot of the best ideas are re-purposed from other areas! So how am I going to apply it? well, we are going to use the bowling stats from the second 11 country championship, age-adjust and see which bowler under the age of 21 looks to be the most promising.

Above is the first 20 rows for the 350 bowlers we are going to be looking at. The top wickets is Nijjar from Essex, however, he is 24, therefore, will not be included overall. The first under 21 player on the top wicket list is 20-year-old mike for Leicestershire. So let’s adjust the data and see where we end up.

The first thing to do is all the players have bowled differing amounts of balls. I need to get all the players to the same level of balls. To do that, I took the strike weight for there balls actually bowled and extended this to a full season of say 1500 balls. In the table below the estwicket column is the number of wickets if the bowlers strike rate continued over a full 1500 balls.

However, a bowler who has taken a lot of wickets in a low amount of balls is unlikely to continue this rate to the end of a season. (Sorry Ben Coad). Regression to the mean is a widely accepted mathematical theory therefore when we are extrapolating performance we need to account for regression to the mean.

In the Rangers report blog, they applied 1% regression for every match they projected performance for. I can’t do that as my analysis is based on balls and in multi-day matches bowlers will bowl different amounts of balls.  Therefore I decided to apply the same 1% regression for every extrapolated 100 balls.

The table to the side shows how much this affects the bowlers on the list, with most bowlers having a reduction in wickets.

Now the final thing is to adjust the total wickets based on the player’s age. The first part will be to filter for all players 21 and younger and then apply Volman’s age curve below

The age curve is by year and month, however, I have only got the age in years, therefore, will just be using the numbers in the first column. The graph below shows the resulting results

Based on this method Szymanski is the best bowler of the 2nd XI county championship. However, the large dot means we have to extrapolate a lot for Szymanski. Another interesting point is 3 of the top 5 estimated wicket takers are left-arm spinners. The England team is badly missing a reliable spinner particularly away from home could one of the 3 be the future England Spinner.

There lots more work that can be done with this data and look back historically at how these numbers can relate to future county championship averages. Also, apply a similar model to Batsmen which will be detailed in a future blog. Let me know your thoughts have you seen Szymanski bowl? Ideally, I would have prefered younger players but I think younger players play in the under 17 county championship.

## Does the Dog Get Adopted?? — P2

Today we are going to looking into the second part of creating the classification tree to look at the outcomes of dogs in the Dallas animal shelter. Today it’s the exciting stuff, creating the actual classification tree. If you want to understand how I have prepared the data, go and check out the first blog I go into the data preparation in detail.

As previously mentioned we are using the classification tree method and the columns we will be basing the outcome on initially are intake_type, intake_condition, chip_status, animal_origin and pedigree. I am using the rpart package in order to create the classification trees and will split the data so there’s 75% to train the tree on and 25% to test the tree on.

Above you can see the code and resulting classification tree for the first model. One thing immediately obvious in the first classification tree is that its highly complex and is possibly overfitted to the data. Let’s check how this tree performs:

Well, that’s not great at all. This classification tree seems to be barely better than random chance! This really isn’t ideal and means currently the model is pretty much worthless. Let’s have a look at what we can do to improve this.

The first thing I am going to do is look at the intake_condition column

There are 7 different categories within this column however, I think this can be simplified into Healthy, Treatable and Unhealthy. So let’s do this and check the results:

Success, much simpler tree, however:

The accuracy of the model has gone down! It is now less accurate than random chance, I am actually just wasting my time here. Let take a step back and look at the table which shows the predicted against the outcome. As you can see currently the model is predicting lots of dogs which died or were not adopted as being adopted.

I reviewed the composition of each column in the data frame when I filtered for predicted to be adopted and the outcome was they actually died. The biggest difference was seen in the condition column. Apparently, a lot of dogs that died are treatable how can that be?

I took a step back and went back to the original dataset and filtered for the dogs which are treatable but made no change to the outcome_type column as I guessed this could be where the problem was. The above graph looks at the outcomes of the dogs which are classed as treatable. There are clearly a lot of dogs euthanized which is possibly where the confusion is coming from as these will be classed as dying and normal logic you would expect dogs treatable to survive. This is interesting as it highlights how the decisions you make at the start of any analysis could affect it later on. The next question now is there another column in the dataframe that can be used to identify the euthanization.

I think I found it with the kennel_status column. By far the most common kennel for the euthanized dogs to go in is the lab. Therefore we are going to add the kennel status to the analysis and see where it goes:

Success, the model is now much more successful in predicting the outcome for each dog at the shelter. However, the classification tree is now back to being over-complex and could possibly overfit the training data. Next, I see if this tree can be pruned.

Above you can see the complexity plot for the overly complex classification tree. The tree isn’t much improved when you go over 4 -7 levels and a complexity of around 0.00075. This pruning can be done either pre or post creation of the model. For this i am going to to the pruning pre running of the model so I am going to run the fourth and hopefully final version of the model

Above you can see the final classifaction tree and the code with which to create it. In my call to rpart, i have used the control argument and limited the complextiy to 0.00075 based on the complexity plot and the max depth to 5. This has produced a much less complex tree and performance was similar to the previous complex tree.

This could be futher developed with more data does the sex of the dog have an effect on the results or the size or type of dog. Small dogs could bne more likely to be adopted and certain types could be more likely euthanized. Also this could be furth built on and a random forest model created. Thanks for reading well done if you got to the end its a bit longer than what I normally aim for please let me know your thoughts or if there has been anything i have missed or could have included.

## F1 Circuit Cluster Analysis – 3

Hello and welcome to the meant to be final F1 circuit cluster analysis blog, however, I have thought of some ideas to extend it to a fourth so we shall see how that goes. The idea is today we will review what we can from the season so far and in the next one look at some methods for predicting how the rest of the season will pan out. That one might not be until the summer break.

Above is a summary of the season with each circuit coloured by what cluster they belong to. Circuits have been clustered according to hierarchical clustering please see othe blogs in the series to see the method used. The tracks that belong in cluster 1 and 2 are pretty evenly distributed across the season. What is interesting the two wildcard tracks which don’t belong to any of the other three clusters are still to come. Could they prove crucial in the fight for the title?

Above you can see for each cluster the pace difference with 0 being the fastest car in each cluster up to around 3% which is the difference to the slowest car. The first thing to take away in cluster 1 and 2 Mercedes and Ferrari are neck and neck. Mercedes are slightly but only slightly quicker overall. Red Bull get better the with more lower speed corners and fewer straights there are. Highlighting the cars engine weakness. With cluster three the slow twisty circuits being their forte. They must be looking forward to Hungary next. Elsewhere apart from the top three one of the big stories is Haas. Their car looks well suited to the fast flowing circuits of cluster 1 but is the slowest in the stop-start circuits with short straights. That is clearly a car with strengths in high-speed downforce and engine power. The gap between the top 3 teams and the rest is pretty consistent across all the clusters.

Finally, we look at how the drivers rank at the different clusters. Some interesting points are apparent, In cluster 1 Hamilton seems to have a clear advantage over the others its close but he’s clearly on average faster than other drivers. There’s also a significant difference between Hamilton and his teammate Bottas showing fast twisty circuits could be Bottas weakness. Compared to cluster 2 where Hamilton, Vettel and Bottos are very evenly matched. At Ferrari Raikkonen is a lot closer to Vettel on the fast twisty circuits compared to cluster 2 which has much more large breaking zones and slower corners. The opposite pattern is seen at Red Bull Ricciardo is a good distance behind Verstappen on the fast twisty circuits but Ricciardo is actually slightly faster on the slower circuits. Elsewhere Alonso has a clear advantage over Vandoorne on both types of circuits.

So that’s it for today’s blog. I am going to put the R code for this on GitHub and the spreadsheet so if you have any further ideas what can be done with this dataset then id love to see what you come up with. There will be a fourth part in this series where we look at historical trends and then look at forecasting the future.

## Tidy Tuesday 2 – World Life Expectancy

Hello, welcome to today’s blog which is going to be my second one covering the tidy Tuesday dataset. This week it was looking at a dataset with life expectancy for every country in the world since 1950. I decided you could do some cluster analysis on this dataset and then once you have the clusters can further analyse to understand trends. We are going to use K-means clustering to put the countries together then look for trends and differences between the clusters. So the dataset has country, year between 1950 and 2015 and the life expectancy of that year. Now in order to do clustering, you need at least two measures, therefore, I created one with the change in life expectancy per year. The other measure is going to be the life expectancy in 2015.

In order to find our value of K, I did the below silhouette plot. Now you’re meant to use the value of K with the highest sil width, in this case, it would be 3. However, with so many different countries I feel that would be unfair and group the countries up too much. There are further spikes at 6 and 10.

I decided to do the below plot for different k values both 6 and 10. The plot for 10 is below

10 seems like a good value as there are not too many clusters to deal with but also good variation between the different clusters. We will take k equal to 10 for further analysis.

The comparison above looks at causes of death and i have grouped it up to get the mean for each for cause for each cluster. Conclusions that can be made:

• Cancer is prevalent across all clusters, however, the higher the life expectancy the more prevalent it is. This could be because your more likely to get cancer at older ages.
• Dementia is another cause which seems to increase with older life expectancy.
• HIV is highest in the two lowest life expectancy clusters the same with neonatal deaths
• Finally, road accident is an interesting cause, by far the highest cluster is cluster 7 which seems to be the cluster with the highest increase in life expectancy over the last 65 years. Could this because these are fast developing nations and have not got the safe road infrastructure in as yet.

That’s it for a little intro into reviewing the data this way. Let me know your thoughts and comments. There are lots of dataset on the World Health Organisations website as well as other datasets such as economic growth i can add to this analysis and develop it further.

## F1 Circuit Cluster Analysis —- Part 2

Hello and welcome to the second part of my mini-series using cluster analysis in order to categorise formula 1 circuits. please go check the first part it outlines the basic data we are using to categorise the circuits and an overview of the method used for hierarchical clustering. Today we are going to go with K-means clustering.

For K-means clustering we have to set our own value for K we are going to do that with two different types of analysis. An elbow plot and silhouette analysis.

Elbow Plot

The code below is what was used in order to generate the elbow plot. The elbow plot generated is below:

Reviewing the elbow plot it looks like already we are seeing a slightly different amount of clusters then we got when we conducted hierarchical clustering. The elbow of the plot looks to be at 3 but you can also argue there is one at 4 as well as the value for k.

Silhouette Analysis

The other way to decide a k value when conducting k means clustering is to produce a silhouette graph. This takes every point which is part of the analysis and rates it on how it fits in with each cluster with -1 being doesn’t fit at all and 1 being fits well. You then produce a graph for each value of k with the average silhouette width and the highest point is the value of k. I have put a picture of the code below and also the silhouette graph produced

Fascinatingly there are two high points. One for a k of 9 and another for a k of 3. I am going to choose a k of 3 as this is closely aligned to what we saw in the elbow plot and 9 clusters are just too many to deal with.

The above graph shows all the circuits in the calendar and where they are for average straight length and average speed, colour by the cluster they have been put in. I am a bit unsatisfied with this. I feel this doesn’t quite fit the different circuits on the calendar. For instance, Singapore is different to China and Germany. Therefore K-means is not going to be the clustering I use in the final blog to look at pace trends across the season. Look out for the final blog which we will look at the pace across all circuits so far for all the teams and we will look at some other metrics like overtakes and pitstops.

## World Cup Group F

Hello, welcome to the preview of Group F in the world cup. Thanks for all the support so far on these previews. I would love to hear peoples thoughts and predictions on the competition. Today we will be looking at group F which contains Germany, Mexico, South Korea and Sweden.

The first thing to look at is the age distribution of all 4 teams. Germany seems to have one of the younger squads in the tournament with a relatively small distribution between youngest and oldest players. Mexico has players from the youngest in their 20’s all the way up to near 40. South Korea and Sweden have the same median ages but South Korea has more players clustered around their median and have the lowest amount of players above 30.

Mexico has what looks to be the most experienced squad with players mostly having around 50 caps but they have some players up to 150 caps. Germany has a lot of players with a relatively low amount of caps but also have the trend we have seen with other squads of having a group of players with a lot of caps. I wonder if these players would be a similar age and therefore could be a golden generation. Sweden possibly has the most inexperienced squads in the group with a lot of players less than 50 caps.

On the face of it, Germany seems to have a small number of attackers in the squad. However, they have more midfielders and a few of them are creative attacking midfielders, therefore, I don’t think they will struggle for goals. South Korea also has the same amount of attackers as Germany but seem have picked more defenders. This could leave them struggling to score goals.

Last but not least we look at each teams chances using the probability of implied odds. No surprise really Germany are big favourites to get out the group. However, the fight for second place looks to be a realistic target for the other three teams. It looks particularly close between Mexico and Sweden. They play each other in the last game of the group stage, therefore, it could be a straight shoot-out for second place. Also, South Korea playing Germany who may have already qualified and therefore may make changes could give them an outside chance if it goes to the last game.

That’s it for today’s look a group F please let me know your thoughts would love to start a good debate on your thoughts. Also, check out the other blogs in the series.

## World Cup Group D

Group D is next on the agenda for us to take a look at in this series previewing the world cup. This is part of a series looking at all the groups in the World Cup so please takes a look at the others and let me know your thoughts. Group D contains Argentina, Croatia, Iceland and Nigeria. So let’s take a look at the age make up of the 4 squads

Argentina has one of the oldest medians we have seen so far and it looks to be about 30. Does this mean its this squad last opportunity to win the world cup? Lionel Messi will no be around forever and this is probably his last chance. Both Croatia and Iceland have similar medians which are around the area we have seen most medians so far in this preview series. Nigeria has a relatively young median age however interestingly they have more players over 30 then Croatia and Iceland.

There seems to be a big correlation between Nigeria’s relatively young squad and it seems to have the lowest amount of caps. Croatia seems to have a relatively experienced squad with most players having more then 25 caps this should stand them in good stead in the tournament if the experience is a key attribute to any good squad. Argentina caps seem to be evenly distributed across all of the range, they also seem to have the most amount of players above 100 caps.

Next, we review squad composition for the 4 teams in group D. All teams in this group seem to have varying amounts of all different departments in a team. Whats surprising is Argentina seem to have the least amount of attackers however the attackers they do have are all world class and it’s going to be difficult to fit them all in the team. Croatia seems to have the most amount of defenders which could mean they are strong defensively. Iceland and Nigeria have a similar makeup in their squads with only a slight difference in attackers and defenders.

Finally, we look at the chances of each team in the tournament based on looking at chance from implied odds. As you can see this group is expected to be pretty easy for Croatia and Argentina. Iceland and Nigeria look expected to be quite evenly matched teams but are not expected to have any impact on the group. Looking at the chances to win the competition Argentina are one of the big favourites unsurprisingly. However, it also looks like Croatia are seen as having a good outside chance so will be interesting to see how they do in the competition.

That’s it for today’s group D overview please let me know your thoughts in the comments below and check out the other previews.