## NHL Assessing Player Contributions

Today in the blog we are going to be looking at a method I have been investigating as a way of measuring how much a player contributes to there team in each game. Normally you might look at how many goals a player scores. Thats always the headline figure but players contribute in numerous different ways in ice hockey – assists, shots, hits, takeaways but also we can look at negative contributions to the team penalty minutes, giveaways, goals conceded.  These events can be given points values depending on, how common the event is and how key the even is to win a match. Therefore we can look at each player’s contributions in the games a look at which affected the result more.

The first thing I have to say is thank you to NaturalStatTrick.com for making the data available. This would not be possible without them. Also, this method does not include goaltenders – I think this is a separate scoring for so this is not included.

The table shows how i have initially scored the different events. At the moment this is just arbitrary how I value each event and how common it its. For instance, goals are the most important as without a goal its impossible to win a game of hockey. Faceoffs are relatively common and although could be key to scoring I have awarded them a low score. One flaw with this is currently is giveaways a giveaway in your own defensive zone is probably a lot worse than a giveaway in the opponent’s defensive zone. I have no data for the location of the giveaway so can’t include that in the scoring. Location data would be very useful to refine this model may be for the future.

Above you can see a sample of the point scores for the first 2 games of the NHL season. Clearly, in the first game, you can see how the Washington Capitals forwards dominated the game in the 7-0 win. Also, in the second match despite a much closer game the Toronto forwards also looked to dominate. This is just the first explanation of the method and a quick initial look at  2 games. Things I will look at going forward:

• Comparison between defencemen and forwards. This method possibly overvalues forwards as often defensive actions are not quantifiable. Splitting it up will show the best offensive defensemen
• Point leaders so far in the season and where they score their points
• Look at certain players over the previous seasons

Let me know your thoughts on this and if you have any ideas what you want to see in the future

## Tidy Tuesday – Films Dataset

Hello, today we are going to be looking at this weeks tidy Tuesday dataset. This is just quick EDA as i got a bit carried away with just the dataset. I initially set out to do just one interesting graph but kept finding more and more interesting insights. So below you can see the structure of the dataset:

As you can see its got 3401 observations of 9 variables. All different films so first of all lets look at home the production budgets atre distributed with a histogram

We can see that the vast majority of films in this dataset have a budget less than 25 million dollars. But how does that change for each genre in the dataset:

Now you can see some interesting insights. Comedy, drama and horror films have clear peaks at the lower end of the production budget. Action and Adventure films are much more spread out across all production budgets.

Now let’s look at how the production budget influences how much the film grosses. Action and Adventure have the steeper slopes, so the more money put into these films on average the higher reward.

Above you can see when each film is released during the year. Notice the peak for horror in the 10th month, Halloween. Also, Drama seems to increase toward the end of the year in time for Oscar season. Adventure has a peak in July for summer blockbusters and in December maybe aiming for the holiday season.

Finally, for now, let’s look at the median profit percent per month, can we get any ideas when its best to release a particular genre. For action, adventure and comedy they seem to have 2 peaks. One in the middle of the year and one towards the end. Horror seems to have generally the highest median earnings. This data set is a simple one but one which insights are easy to come by. I could definitely at least write another blog with more information I have found form this dataset. Another day perhaps.

## Finding the Next James Anderson

Hello, welcome to today’s blog in which we will be scouting for the next James Anderson. Possibly. The ideas behind this blog are nothing new in fact I have stolen the idea from another sport. The Rangers report blog wrote a piece about scouting for the best youth players by using age-adjusted stats. This idea was based on Vollans work on ice hockey. I always like to say a lot of the best ideas are re-purposed from other areas! So how am I going to apply it? well, we are going to use the bowling stats from the second 11 country championship, age-adjust and see which bowler under the age of 21 looks to be the most promising.

Above is the first 20 rows for the 350 bowlers we are going to be looking at. The top wickets is Nijjar from Essex, however, he is 24, therefore, will not be included overall. The first under 21 player on the top wicket list is 20-year-old mike for Leicestershire. So let’s adjust the data and see where we end up.

The first thing to do is all the players have bowled differing amounts of balls. I need to get all the players to the same level of balls. To do that, I took the strike weight for there balls actually bowled and extended this to a full season of say 1500 balls. In the table below the estwicket column is the number of wickets if the bowlers strike rate continued over a full 1500 balls.

However, a bowler who has taken a lot of wickets in a low amount of balls is unlikely to continue this rate to the end of a season. (Sorry Ben Coad). Regression to the mean is a widely accepted mathematical theory therefore when we are extrapolating performance we need to account for regression to the mean.

In the Rangers report blog, they applied 1% regression for every match they projected performance for. I can’t do that as my analysis is based on balls and in multi-day matches bowlers will bowl different amounts of balls.  Therefore I decided to apply the same 1% regression for every extrapolated 100 balls.

The table to the side shows how much this affects the bowlers on the list, with most bowlers having a reduction in wickets.

Now the final thing is to adjust the total wickets based on the player’s age. The first part will be to filter for all players 21 and younger and then apply Volman’s age curve below

The age curve is by year and month, however, I have only got the age in years, therefore, will just be using the numbers in the first column. The graph below shows the resulting results

Based on this method Szymanski is the best bowler of the 2nd XI county championship. However, the large dot means we have to extrapolate a lot for Szymanski. Another interesting point is 3 of the top 5 estimated wicket takers are left-arm spinners. The England team is badly missing a reliable spinner particularly away from home could one of the 3 be the future England Spinner.

There lots more work that can be done with this data and look back historically at how these numbers can relate to future county championship averages. Also, apply a similar model to Batsmen which will be detailed in a future blog. Let me know your thoughts have you seen Szymanski bowl? Ideally, I would have prefered younger players but I think younger players play in the under 17 county championship.

## Biketown EDA – P2

Hello, welcome to the second part of this blog doing exploratory data analysis on the bike town dataset. If you haven’t read the first one then go check it out. An overview is we found that most of the records were either subscribers to the system of casual people who might just use it every now and then. So I decided to compare those two groups. We saw in the smaller sample size that most of the casual users used it for recreation and the subscribers used it mostly for commuting. Today we are going to look at the distances and speeds the groups travel and where they rent the bikes from but first we are going to look at when they rent the bikes:

This graph totally makes sense that subscribers tend to rent the bikes for commuting as there are two large spikes at around 8am and around 5pm. The peak hours for people going too and from work. Casual users don’t tend to use the system in the morning, however, there’s consistent usage throughout the afternoon. This could be tourists exploring the city for instance. It’s strange both groups don’t reach their minimum to well after 2am, are these people using the system to get home from their night outs? Beware of the drunk Portland cyclist!

The density plot for the distances the two groups go is interesting however i think does fit the current pattern. Subscribers are looking to do shorter journeys because they might be covering the last few miles to work whereas the casual users are maybe exploring the city and therefore cover more distance.

Now let’s check out the speed curves for both groups the further the distance the person travelled the slower the speed. The subscribers are obviously generally the much faster riders at all distances. The increase in speed towards the 25-mile distance i think is down to lower amounts of data. At the lower distance the subscribers have significantly faster speeds are they using the system to get from A to B as quickly as possible.

Now let’s look at the start and end locations for the casual riders. The heat map above shows that most riders are clustered around the city centre possibly moving from one tourist spot to another. There seems to be a fairly even distribution across the centre locations.

The subscriber heat map above shows that generally, people take the bike from much further out than compared to the casual riders who are possibly just using it to get to the tourist hot spots. Once again mostly the usage is on the left side of the river that must be the area where people like to get about. Also, the start locations around the outside are much denser, therefore, its clear people are using the system from further out to go into the centre of the city.

That’s it for this exploratory data analysis in this blog I think we have found some interesting insights and at the minimum able to confirm what you would expect. I hope you have found this interesting and informative let me know your thoughts and if you have looked at this dataset yourself lets see your thoughts. Check out the code on my GitHub should be linked somewhere on the website.

## Nike Biketown – Exploratory Data Analysis

Hello welcome to today’s blog which we are going to take a large dataset and do some exploratory data analysis on it. I am going to look at the biketown dataset which was a dataset on tidy tuesday. If you’re new around here tidy Tuesday is a hashtag on twitter which the R for data science online learning community actively promotes and has every Tuesday. If you’re inspired to learn R and data science like I was that is a really great community full of wonderful people to start with. I am not going to post code snippets within the blog as I think it gets too long, however, the full code used will be posted on my GitHub.

Above is the structure of the dataframe. The data comes in numerous csv files so i read them all in and created one large data frame structured like so. The second column Payment plan seems an interesting column it has 3 constituents Casual, subscriber and another. The system in Portland has a way a regular user can automate payments to save time. Let’s look at how much of the dataset is based on the 3 types of payment:

As you can see the vast majority of this dataset is based on either casual and subscriber and I think it would be interesting to review the differences between the people on the two main payment plans.  Therefore going forward in the EDA we are going to remove the entries without a subscriber. After this, we could possibly look at a method for working out what type of payment plan the blanks are. First thing first let’s have a look at what type of trips either group takes:

The big issue here with making any conclusion on the type of trips each group takes is going to be difficult. This is because each group has over 200 thousand entries and there are less than 1000 recorded trips for each group. What we can say is that it makes sense that subscribers in this smaller group tend to use the system for commuting and casual users clearly in the small sample size use the system for recreation which makes total sense.

Now we look at the payment methods that both groups have used. By far the 3 main payment types for both groups are keypad, mobile and keypad_rfid_card, with subtle differences between the 2 groups. The RFID card is clearly higher in the subscriber which must be because subscribers are given a card in order to gain access to the bikes. Also, casual users tend to be much more likely to use their mobile to gain access. Both groups have the vast majority using the keypad system.

That’s it for part 1 of today’s exploratory data analysis on the bike town data. Tomorrow we will look at the distances the groups go as well as location and usage time. Let me know your comments on the first part.

## Does the Dog Get Adopted?? — P2

Today we are going to looking into the second part of creating the classification tree to look at the outcomes of dogs in the Dallas animal shelter. Today it’s the exciting stuff, creating the actual classification tree. If you want to understand how I have prepared the data, go and check out the first blog I go into the data preparation in detail.

As previously mentioned we are using the classification tree method and the columns we will be basing the outcome on initially are intake_type, intake_condition, chip_status, animal_origin and pedigree. I am using the rpart package in order to create the classification trees and will split the data so there’s 75% to train the tree on and 25% to test the tree on.

Above you can see the code and resulting classification tree for the first model. One thing immediately obvious in the first classification tree is that its highly complex and is possibly overfitted to the data. Let’s check how this tree performs:

Well, that’s not great at all. This classification tree seems to be barely better than random chance! This really isn’t ideal and means currently the model is pretty much worthless. Let’s have a look at what we can do to improve this.

The first thing I am going to do is look at the intake_condition column

There are 7 different categories within this column however, I think this can be simplified into Healthy, Treatable and Unhealthy. So let’s do this and check the results:

Success, much simpler tree, however:

The accuracy of the model has gone down! It is now less accurate than random chance, I am actually just wasting my time here. Let take a step back and look at the table which shows the predicted against the outcome. As you can see currently the model is predicting lots of dogs which died or were not adopted as being adopted.

I reviewed the composition of each column in the data frame when I filtered for predicted to be adopted and the outcome was they actually died. The biggest difference was seen in the condition column. Apparently, a lot of dogs that died are treatable how can that be?

I took a step back and went back to the original dataset and filtered for the dogs which are treatable but made no change to the outcome_type column as I guessed this could be where the problem was. The above graph looks at the outcomes of the dogs which are classed as treatable. There are clearly a lot of dogs euthanized which is possibly where the confusion is coming from as these will be classed as dying and normal logic you would expect dogs treatable to survive. This is interesting as it highlights how the decisions you make at the start of any analysis could affect it later on. The next question now is there another column in the dataframe that can be used to identify the euthanization.

I think I found it with the kennel_status column. By far the most common kennel for the euthanized dogs to go in is the lab. Therefore we are going to add the kennel status to the analysis and see where it goes:

Success, the model is now much more successful in predicting the outcome for each dog at the shelter. However, the classification tree is now back to being over-complex and could possibly overfit the training data. Next, I see if this tree can be pruned.

Above you can see the complexity plot for the overly complex classification tree. The tree isn’t much improved when you go over 4 -7 levels and a complexity of around 0.00075. This pruning can be done either pre or post creation of the model. For this i am going to to the pruning pre running of the model so I am going to run the fourth and hopefully final version of the model

Above you can see the final classifaction tree and the code with which to create it. In my call to rpart, i have used the control argument and limited the complextiy to 0.00075 based on the complexity plot and the max depth to 5. This has produced a much less complex tree and performance was similar to the previous complex tree.

This could be futher developed with more data does the sex of the dog have an effect on the results or the size or type of dog. Small dogs could bne more likely to be adopted and certain types could be more likely euthanized. Also this could be furth built on and a random forest model created. Thanks for reading well done if you got to the end its a bit longer than what I normally aim for please let me know your thoughts or if there has been anything i have missed or could have included.

## Does the Dog Get Adopted?? — P1

Hello, welcome to the next blog. I was inspired by this week Tidy Tuesday dataset. I’m sure I have said this before but if you want to learn rstats its a great resource with the weekly dataset to practice your burgeoning skills. This week’s data was from the Dallas open data project, and the particular dataset was from the Dallas animal shelter. I thought wouldn’t it be great to create a model which based on the information about the animal when it arrived at the shelter you could predict what might happen to the animal.

Above is the structure of the dataframe the first thing is there are a number of different animals logged in the dataset. Creating a model for the 5 different types could be quite complicated therefore I am going to focus on Dogs. I think out of the 35000 or so observations dogs will make up most of them as well. The model type i think is most suited to this problem is a classification tree. Classification tree works by building a yes or no network with the various outcomes at the end. It works well when you have various factor variables which this dataset is full of.

Now we need to select the columns which this is going to be based on summarised below:

animal_breed – identifies the type of dog, I think this is key information some breeds of dogs are more likely to be adopted than others

intake_type – how did the dog arrive at the shelter. There will clearly be an effect on the dogs outcomes

intake_condition – what was the dogs condition when they arrived. An unhealthy dog is unlikely to be adopted possibly

chip_status – did the dog have a micro chip. dogs with micro chips more likely to be reunited with their owners

animal_origin – where was the animal found or how did it come to the shelter.

outcome_type – finally the most important column as this is how we will be making our predictions.

Now we have our columns selected we need to prepare the data the first thing we will look at is the outcome column i wanted to make sure they’re not too many outcomes this is based on. When you look back on the structure of the column there’s 12 separate outcomes this is far too many so let’s look if we can group some together.  Below is a summary of the different parts of the outcome_type column and i think there is definitely scope to group some together.

Dead on arrival should be excluded as if the dog is dead on arrival that is the outcome there’s nothing to predict. Died and Euthanized I am going to group into just died outcome as predicting how the dog died is beyond the scope of this prediction. Foster, transfer and other will be grouped under unadopted. The others will then be filtered out of the list.

The final thing in this opening blog of preparing the data is the animal bread column. There are over 100 different dog breeds in this column which would be impossible for the classification tree. On close inspection, the column consisted of the individual dog breeds or mixed which I assumed is not pedigree. I decided to convert this to a column with the dog either pedigree or cross breed. Therefore the final data preparation code is below:

That’s it for today’s opening tomorrow we will look at the results of the model and how if required I optimise it.