Finding the Next James Anderson

Hello, welcome to today’s blog in which we will be scouting for the next James Anderson. Possibly. The ideas behind this blog are nothing new in fact I have stolen the idea from another sport. The Rangers report blog wrote a piece about scouting for the best youth players by using age-adjusted stats. This idea was based on Vollans work on ice hockey. I always like to say a lot of the best ideas are re-purposed from other areas! So how am I going to apply it? well, we are going to use the bowling stats from the second 11 country championship, age-adjust and see which bowler under the age of 21 looks to be the most promising.

bowlerdataAbove is the first 20 rows for the 350 bowlers we are going to be looking at. The top wickets is Nijjar from Essex, however, he is 24, therefore, will not be included overall. The first under 21 player on the top wicket list is 20-year-old mike for Leicestershire. So let’s adjust the data and see where we end up.

The first thing to do is all the players have bowled differing amounts of balls. I need to get all the players to the same level of balls. To do that, I took the strike weight for there balls actually bowled and extended this to a full season of say 1500 balls. In the table below the estwicket column is the number of wickets if the bowlers strike rate continued over a full 1500 balls.

ffff

However, a bowler who has taken a lot of wickets in a low amount of balls is unlikely to continue this rate to the end of a season. (Sorry Ben Coad). Regression to the mean is a widely accepted mathematical theory therefore when we are extrapolating performance we need to account for regression to the mean.

In the Rangers report blog, they applied 1% regression for every match they projected performance for. I can’t do that as my analysis is based on balls and in multi-day matches bowlers will bowl different amounts of balls.  Therefore I decided to apply the same 1% regression for every extrapolated 100 balls.

regreewick

The table to the side shows how much this affects the bowlers on the list, with most bowlers having a reduction in wickets.

Now the final thing is to adjust the total wickets based on the player’s age. The first part will be to filter for all players 21 and younger and then apply Volman’s age curve below

volmanage.png

The age curve is by year and month, however, I have only got the age in years, therefore, will just be using the numbers in the first column. The graph below shows the resulting results

bowlers

Based on this method Szymanski is the best bowler of the 2nd XI county championship. However, the large dot means we have to extrapolate a lot for Szymanski. Another interesting point is 3 of the top 5 estimated wicket takers are left-arm spinners. The England team is badly missing a reliable spinner particularly away from home could one of the 3 be the future England Spinner.

There lots more work that can be done with this data and look back historically at how these numbers can relate to future county championship averages. Also, apply a similar model to Batsmen which will be detailed in a future blog. Let me know your thoughts have you seen Szymanski bowl? Ideally, I would have prefered younger players but I think younger players play in the under 17 county championship.

 

 

Advertisements

Biketown EDA – P2

Hello, welcome to the second part of this blog doing exploratory data analysis on the bike town dataset. If you haven’t read the first one then go check it out. An overview is we found that most of the records were either subscribers to the system of casual people who might just use it every now and then. So I decided to compare those two groups. We saw in the smaller sample size that most of the casual users used it for recreation and the subscribers used it mostly for commuting. Today we are going to look at the distances and speeds the groups travel and where they rent the bikes from but first we are going to look at when they rent the bikes:

plot8.png

This graph totally makes sense that subscribers tend to rent the bikes for commuting as there are two large spikes at around 8am and around 5pm. The peak hours for people going too and from work. Casual users don’t tend to use the system in the morning, however, there’s consistent usage throughout the afternoon. This could be tourists exploring the city for instance. It’s strange both groups don’t reach their minimum to well after 2am, are these people using the system to get home from their night outs? Beware of the drunk Portland cyclist!

distance

The density plot for the distances the two groups go is interesting however i think does fit the current pattern. Subscribers are looking to do shorter journeys because they might be covering the last few miles to work whereas the casual users are maybe exploring the city and therefore cover more distance.

speeds.png

Now let’s check out the speed curves for both groups the further the distance the person travelled the slower the speed. The subscribers are obviously generally the much faster riders at all distances. The increase in speed towards the 25-mile distance i think is down to lower amounts of data. At the lower distance the subscribers have significantly faster speeds are they using the system to get from A to B as quickly as possible.

start.png

Now let’s look at the start and end locations for the casual riders. The heat map above shows that most riders are clustered around the city centre possibly moving from one tourist spot to another. There seems to be a fairly even distribution across the centre locations.

subscriber.png

The subscriber heat map above shows that generally, people take the bike from much further out than compared to the casual riders who are possibly just using it to get to the tourist hot spots. Once again mostly the usage is on the left side of the river that must be the area where people like to get about. Also, the start locations around the outside are much denser, therefore, its clear people are using the system from further out to go into the centre of the city.

That’s it for this exploratory data analysis in this blog I think we have found some interesting insights and at the minimum able to confirm what you would expect. I hope you have found this interesting and informative let me know your thoughts and if you have looked at this dataset yourself lets see your thoughts. Check out the code on my GitHub should be linked somewhere on the website.

Nike Biketown – Exploratory Data Analysis

Hello welcome to today’s blog which we are going to take a large dataset and do some exploratory data analysis on it. I am going to look at the biketown dataset which was a dataset on tidy tuesday. If you’re new around here tidy Tuesday is a hashtag on twitter which the R for data science online learning community actively promotes and has every Tuesday. If you’re inspired to learn R and data science like I was that is a really great community full of wonderful people to start with. I am not going to post code snippets within the blog as I think it gets too long, however, the full code used will be posted on my GitHub.

structure

Above is the structure of the dataframe. The data comes in numerous csv files so i read them all in and created one large data frame structured like so. The second column Payment plan seems an interesting column it has 3 constituents Casual, subscriber and another. The system in Portland has a way a regular user can automate payments to save time. Let’s look at how much of the dataset is based on the 3 types of payment:

plot1

As you can see the vast majority of this dataset is based on either casual and subscriber and I think it would be interesting to review the differences between the people on the two main payment plans.  Therefore going forward in the EDA we are going to remove the entries without a subscriber. After this, we could possibly look at a method for working out what type of payment plan the blanks are. First thing first let’s have a look at what type of trips either group takes:

plot5plot2

The big issue here with making any conclusion on the type of trips each group takes is going to be difficult. This is because each group has over 200 thousand entries and there are less than 1000 recorded trips for each group. What we can say is that it makes sense that subscribers in this smaller group tend to use the system for commuting and casual users clearly in the small sample size use the system for recreation which makes total sense.

plot6

Now we look at the payment methods that both groups have used. By far the 3 main payment types for both groups are keypad, mobile and keypad_rfid_card, with subtle differences between the 2 groups. The RFID card is clearly higher in the subscriber which must be because subscribers are given a card in order to gain access to the bikes. Also, casual users tend to be much more likely to use their mobile to gain access. Both groups have the vast majority using the keypad system.

That’s it for part 1 of today’s exploratory data analysis on the bike town data. Tomorrow we will look at the distances the groups go as well as location and usage time. Let me know your comments on the first part.

Does the Dog Get Adopted?? — P2

Today we are going to looking into the second part of creating the classification tree to look at the outcomes of dogs in the Dallas animal shelter. Today it’s the exciting stuff, creating the actual classification tree. If you want to understand how I have prepared the data, go and check out the first blog I go into the data preparation in detail.

As previously mentioned we are using the classification tree method and the columns we will be basing the outcome on initially are intake_type, intake_condition, chip_status, animal_origin and pedigree. I am using the rpart package in order to create the classification trees and will split the data so there’s 75% to train the tree on and 25% to test the tree on.

carbon (9).png

classtree1.png

Above you can see the code and resulting classification tree for the first model. One thing immediately obvious in the first classification tree is that its highly complex and is possibly overfitted to the data. Let’s check how this tree performs:

performance

Well, that’s not great at all. This classification tree seems to be barely better than random chance! This really isn’t ideal and means currently the model is pretty much worthless. Let’s have a look at what we can do to improve this.

The first thing I am going to do is look at the intake_condition column

intecond.PNG

There are 7 different categories within this column however, I think this can be simplified into Healthy, Treatable and Unhealthy. So let’s do this and check the results:

class2

Success, much simpler tree, however:

results2.PNG

The accuracy of the model has gone down! It is now less accurate than random chance, I am actually just wasting my time here. Let take a step back and look at the table which shows the predicted against the outcome. As you can see currently the model is predicting lots of dogs which died or were not adopted as being adopted.

comp

I reviewed the composition of each column in the data frame when I filtered for predicted to be adopted and the outcome was they actually died. The biggest difference was seen in the condition column. Apparently, a lot of dogs that died are treatable how can that be?

invest1

I took a step back and went back to the original dataset and filtered for the dogs which are treatable but made no change to the outcome_type column as I guessed this could be where the problem was. The above graph looks at the outcomes of the dogs which are classed as treatable. There are clearly a lot of dogs euthanized which is possibly where the confusion is coming from as these will be classed as dying and normal logic you would expect dogs treatable to survive. This is interesting as it highlights how the decisions you make at the start of any analysis could affect it later on. The next question now is there another column in the dataframe that can be used to identify the euthanization.

Rplot01.png

I think I found it with the kennel_status column. By far the most common kennel for the euthanized dogs to go in is the lab. Therefore we are going to add the kennel status to the analysis and see where it goes:

class3

 

results3

Success, the model is now much more successful in predicting the outcome for each dog at the shelter. However, the classification tree is now back to being over-complex and could possibly overfit the training data. Next, I see if this tree can be pruned.

complexplot

Above you can see the complexity plot for the overly complex classification tree. The tree isn’t much improved when you go over 4 -7 levels and a complexity of around 0.00075. This pruning can be done either pre or post creation of the model. For this i am going to to the pruning pre running of the model so I am going to run the fourth and hopefully final version of the model

finaltree

carbon (11).png

Above you can see the final classifaction tree and the code with which to create it. In my call to rpart, i have used the control argument and limited the complextiy to 0.00075 based on the complexity plot and the max depth to 5. This has produced a much less complex tree and performance was similar to the previous complex tree.

This could be futher developed with more data does the sex of the dog have an effect on the results or the size or type of dog. Small dogs could bne more likely to be adopted and certain types could be more likely euthanized. Also this could be furth built on and a random forest model created. Thanks for reading well done if you got to the end its a bit longer than what I normally aim for please let me know your thoughts or if there has been anything i have missed or could have included.

Does the Dog Get Adopted?? — P1

Hello, welcome to the next blog. I was inspired by this week Tidy Tuesday dataset. I’m sure I have said this before but if you want to learn rstats its a great resource with the weekly dataset to practice your burgeoning skills. This week’s data was from the Dallas open data project, and the particular dataset was from the Dallas animal shelter. I thought wouldn’t it be great to create a model which based on the information about the animal when it arrived at the shelter you could predict what might happen to the animal.

dataframe structure

Above is the structure of the dataframe the first thing is there are a number of different animals logged in the dataset. Creating a model for the 5 different types could be quite complicated therefore I am going to focus on Dogs. I think out of the 35000 or so observations dogs will make up most of them as well. The model type i think is most suited to this problem is a classification tree. Classification tree works by building a yes or no network with the various outcomes at the end. It works well when you have various factor variables which this dataset is full of.

Now we need to select the columns which this is going to be based on summarised below:

animal_breed – identifies the type of dog, I think this is key information some breeds of dogs are more likely to be adopted than others

intake_type – how did the dog arrive at the shelter. There will clearly be an effect on the dogs outcomes

intake_condition – what was the dogs condition when they arrived. An unhealthy dog is unlikely to be adopted possibly

chip_status – did the dog have a micro chip. dogs with micro chips more likely to be reunited with their owners

animal_origin – where was the animal found or how did it come to the shelter.

outcome_type – finally the most important column as this is how we will be making our predictions.

Now we have our columns selected we need to prepare the data the first thing we will look at is the outcome column i wanted to make sure they’re not too many outcomes this is based on. When you look back on the structure of the column there’s 12 separate outcomes this is far too many so let’s look if we can group some together.  Below is a summary of the different parts of the outcome_type column and i think there is definitely scope to group some together.

summary

Dead on arrival should be excluded as if the dog is dead on arrival that is the outcome there’s nothing to predict. Died and Euthanized I am going to group into just died outcome as predicting how the dog died is beyond the scope of this prediction. Foster, transfer and other will be grouped under unadopted. The others will then be filtered out of the list.

The final thing in this opening blog of preparing the data is the animal bread column. There are over 100 different dog breeds in this column which would be impossible for the classification tree. On close inspection, the column consisted of the individual dog breeds or mixed which I assumed is not pedigree. I decided to convert this to a column with the dog either pedigree or cross breed. Therefore the final data preparation code is below:

carbon (8)

That’s it for today’s opening tomorrow we will look at the results of the model and how if required I optimise it.

World Cup Group C

Hi there welcome to next in series of little previews ahead of the FIFA World Cup. Today we are dissecting the 4 teams in group C; France, Peru, Denmark and Australia. Please do check out the other previews and further previews are upcoming at 6 pm everyday ahead of the first game.

agec

On the face of it, these look to be some of the youngest squads in the tournament. Australia seems to have players from both ends of the spectrum and a good grouping around peak age players. France has probably the lowest median age across all squads in the competition. Peru doesn’t have too many players between 20-25, however, have a good grouping between 25-28.

capsc

Looking at the distribution of caps in each squad it looks like all four teams have relatively inexperienced players. Denmark has the most amount of players which have around 25 caps. They also have the familiar trend of having a spike higher up showing a good amount of experienced pros vital in any squad make up. Peru seems to have the most amount of players with experience in their squad which could stand them in good stead to get out the group. The big question for France is will their lack of experience affect them later in the competition.

compc

Finally looking at squad composition France and Australia seem to have the most amount of attackers. France has done this by bringing fewer midfielders Australia by bringing fewer Defenders. Peru seems to have gone a totally different direction to the rest of the team with a squad overloaded with midfielders. Most are attacking midfielders so they should still have goalscoring options.

probc

Now we look at each teams probability of getting out the group and winning the tournament. Finally, we have a group that on the face of it could be quite competitive for second place at least. Denmark is a clear favourite but both Peru and Australia seem to have good outside chances at least according to the bookies. France has a decent chance of winning the whole tournament and is currently 4th favourites, so it will be interesting to see how they do with their young squad.

That’s it for today’s overview let me know your thoughts how far do you think France will go and who you think will get out the group?

World Cup Group B

Today we are going to look at group B in the World Cup. This is part of my series reviewing each squad in the World Cup in order to asses strengths and weaknesses and understand squad make up. If you haven’t seen the other Blogs go check them out group A went live yesterday and the other groups will follow over the coming days. Group B consists of Spain, Portugal, Iran and Morocco.

ageb

The first thing to look at is the age composition of the 4 squads. Interestingly Iran seem to generally have the youngest squad in the group with the lowest median. Also Spain seem to have the largest grouping around peak age between 27-30. Morocco despite having the highest median have the lowest age players in the group. Portugal have some young players but also have some of the generally older players with a lot of squad members above 30.

capsb

Looking at the experience of the players Morocco looks clearly the least experienced squad. This could be because of the high amount of lower age players compared to the other teams. Spain and Portugal have similar caps profiles with a group of inexperienced players but also complimented by a few experienced players.

compostionb

The main thing that’s interesting with the squad composition of the 4 teams is that Portugal and Spain have the same composition. Is this a template the the bigger countries seem to be following? Also there is an increase in attacking players in these 4 squads compared to Group A which should mean these are all better balanced. In fact Morocco, Portugal and Spain have the same amount of attackers. With Iran having more Midfielders then any other team could this give them more options however lack of attackers could harm them if chasing a game.

chanceb

Finally we look at the chances of team qualifying from the group and the chance of winning the World Cup. Qualification for the group looks like a pretty much over and done deal. Portugal and Spain look to have by far the strongest chances of qualifying from the group. This could make this group not too interesting for spectators. Portugal and Spain do play each other in the first game which if there is a loser could add extra pressure when they come to play Morocco or Iran. I’m surprised the low chance compared to Spain of Portugal winning the title. Portugal are the reigning European champions and have mercurial talent Cristiano Ronaldo. Spain however look to be one of the big favourites so it will be interesting to see how they do after last World Cups total failure.

Thats it for group B overview any questions or comments let me know or if you have any ideas of other things i should look at let me know.