Does the Dog Get Adopted?? — P2

Today we are going to looking into the second part of creating the classification tree to look at the outcomes of dogs in the Dallas animal shelter. Today it’s the exciting stuff, creating the actual classification tree. If you want to understand how I have prepared the data, go and check out the first blog I go into the data preparation in detail.

As previously mentioned we are using the classification tree method and the columns we will be basing the outcome on initially are intake_type, intake_condition, chip_status, animal_origin and pedigree. I am using the rpart package in order to create the classification trees and will split the data so there’s 75% to train the tree on and 25% to test the tree on.

carbon (9).png

classtree1.png

Above you can see the code and resulting classification tree for the first model. One thing immediately obvious in the first classification tree is that its highly complex and is possibly overfitted to the data. Let’s check how this tree performs:

performance

Well, that’s not great at all. This classification tree seems to be barely better than random chance! This really isn’t ideal and means currently the model is pretty much worthless. Let’s have a look at what we can do to improve this.

The first thing I am going to do is look at the intake_condition column

intecond.PNG

There are 7 different categories within this column however, I think this can be simplified into Healthy, Treatable and Unhealthy. So let’s do this and check the results:

class2

Success, much simpler tree, however:

results2.PNG

The accuracy of the model has gone down! It is now less accurate than random chance, I am actually just wasting my time here. Let take a step back and look at the table which shows the predicted against the outcome. As you can see currently the model is predicting lots of dogs which died or were not adopted as being adopted.

comp

I reviewed the composition of each column in the data frame when I filtered for predicted to be adopted and the outcome was they actually died. The biggest difference was seen in the condition column. Apparently, a lot of dogs that died are treatable how can that be?

invest1

I took a step back and went back to the original dataset and filtered for the dogs which are treatable but made no change to the outcome_type column as I guessed this could be where the problem was. The above graph looks at the outcomes of the dogs which are classed as treatable. There are clearly a lot of dogs euthanized which is possibly where the confusion is coming from as these will be classed as dying and normal logic you would expect dogs treatable to survive. This is interesting as it highlights how the decisions you make at the start of any analysis could affect it later on. The next question now is there another column in the dataframe that can be used to identify the euthanization.

Rplot01.png

I think I found it with the kennel_status column. By far the most common kennel for the euthanized dogs to go in is the lab. Therefore we are going to add the kennel status to the analysis and see where it goes:

class3

 

results3

Success, the model is now much more successful in predicting the outcome for each dog at the shelter. However, the classification tree is now back to being over-complex and could possibly overfit the training data. Next, I see if this tree can be pruned.

complexplot

Above you can see the complexity plot for the overly complex classification tree. The tree isn’t much improved when you go over 4 -7 levels and a complexity of around 0.00075. This pruning can be done either pre or post creation of the model. For this i am going to to the pruning pre running of the model so I am going to run the fourth and hopefully final version of the model

finaltree

carbon (11).png

Above you can see the final classifaction tree and the code with which to create it. In my call to rpart, i have used the control argument and limited the complextiy to 0.00075 based on the complexity plot and the max depth to 5. This has produced a much less complex tree and performance was similar to the previous complex tree.

This could be futher developed with more data does the sex of the dog have an effect on the results or the size or type of dog. Small dogs could bne more likely to be adopted and certain types could be more likely euthanized. Also this could be furth built on and a random forest model created. Thanks for reading well done if you got to the end its a bit longer than what I normally aim for please let me know your thoughts or if there has been anything i have missed or could have included.

Advertisements

Does the Dog Get Adopted?? — P1

Hello, welcome to the next blog. I was inspired by this week Tidy Tuesday dataset. I’m sure I have said this before but if you want to learn rstats its a great resource with the weekly dataset to practice your burgeoning skills. This week’s data was from the Dallas open data project, and the particular dataset was from the Dallas animal shelter. I thought wouldn’t it be great to create a model which based on the information about the animal when it arrived at the shelter you could predict what might happen to the animal.

dataframe structure

Above is the structure of the dataframe the first thing is there are a number of different animals logged in the dataset. Creating a model for the 5 different types could be quite complicated therefore I am going to focus on Dogs. I think out of the 35000 or so observations dogs will make up most of them as well. The model type i think is most suited to this problem is a classification tree. Classification tree works by building a yes or no network with the various outcomes at the end. It works well when you have various factor variables which this dataset is full of.

Now we need to select the columns which this is going to be based on summarised below:

animal_breed – identifies the type of dog, I think this is key information some breeds of dogs are more likely to be adopted than others

intake_type – how did the dog arrive at the shelter. There will clearly be an effect on the dogs outcomes

intake_condition – what was the dogs condition when they arrived. An unhealthy dog is unlikely to be adopted possibly

chip_status – did the dog have a micro chip. dogs with micro chips more likely to be reunited with their owners

animal_origin – where was the animal found or how did it come to the shelter.

outcome_type – finally the most important column as this is how we will be making our predictions.

Now we have our columns selected we need to prepare the data the first thing we will look at is the outcome column i wanted to make sure they’re not too many outcomes this is based on. When you look back on the structure of the column there’s 12 separate outcomes this is far too many so let’s look if we can group some together.  Below is a summary of the different parts of the outcome_type column and i think there is definitely scope to group some together.

summary

Dead on arrival should be excluded as if the dog is dead on arrival that is the outcome there’s nothing to predict. Died and Euthanized I am going to group into just died outcome as predicting how the dog died is beyond the scope of this prediction. Foster, transfer and other will be grouped under unadopted. The others will then be filtered out of the list.

The final thing in this opening blog of preparing the data is the animal bread column. There are over 100 different dog breeds in this column which would be impossible for the classification tree. On close inspection, the column consisted of the individual dog breeds or mixed which I assumed is not pedigree. I decided to convert this to a column with the dog either pedigree or cross breed. Therefore the final data preparation code is below:

carbon (8)

That’s it for today’s opening tomorrow we will look at the results of the model and how if required I optimise it.

F1 Circuit Cluster Analysis – 3

Hello and welcome to the meant to be final F1 circuit cluster analysis blog, however, I have thought of some ideas to extend it to a fourth so we shall see how that goes. The idea is today we will review what we can from the season so far and in the next one look at some methods for predicting how the rest of the season will pan out. That one might not be until the summer break.

f1 season cluster

Above is a summary of the season with each circuit coloured by what cluster they belong to. Circuits have been clustered according to hierarchical clustering please see othe blogs in the series to see the method used. The tracks that belong in cluster 1 and 2 are pretty evenly distributed across the season. What is interesting the two wildcard tracks which don’t belong to any of the other three clusters are still to come. Could they prove crucial in the fight for the title?

plot3

Above you can see for each cluster the pace difference with 0 being the fastest car in each cluster up to around 3% which is the difference to the slowest car. The first thing to take away in cluster 1 and 2 Mercedes and Ferrari are neck and neck. Mercedes are slightly but only slightly quicker overall. Red Bull get better the with more lower speed corners and fewer straights there are. Highlighting the cars engine weakness. With cluster three the slow twisty circuits being their forte. They must be looking forward to Hungary next. Elsewhere apart from the top three one of the big stories is Haas. Their car looks well suited to the fast flowing circuits of cluster 1 but is the slowest in the stop-start circuits with short straights. That is clearly a car with strengths in high-speed downforce and engine power. The gap between the top 3 teams and the rest is pretty consistent across all the clusters.

driver

Finally, we look at how the drivers rank at the different clusters. Some interesting points are apparent, In cluster 1 Hamilton seems to have a clear advantage over the others its close but he’s clearly on average faster than other drivers. There’s also a significant difference between Hamilton and his teammate Bottas showing fast twisty circuits could be Bottas weakness. Compared to cluster 2 where Hamilton, Vettel and Bottos are very evenly matched. At Ferrari Raikkonen is a lot closer to Vettel on the fast twisty circuits compared to cluster 2 which has much more large breaking zones and slower corners. The opposite pattern is seen at Red Bull Ricciardo is a good distance behind Verstappen on the fast twisty circuits but Ricciardo is actually slightly faster on the slower circuits. Elsewhere Alonso has a clear advantage over Vandoorne on both types of circuits.

So that’s it for today’s blog. I am going to put the R code for this on GitHub and the spreadsheet so if you have any further ideas what can be done with this dataset then id love to see what you come up with. There will be a fourth part in this series where we look at historical trends and then look at forecasting the future.

Tidy Tuesday 2 – World Life Expectancy

Hello, welcome to today’s blog which is going to be my second one covering the tidy Tuesday dataset. This week it was looking at a dataset with life expectancy for every country in the world since 1950. I decided you could do some cluster analysis on this dataset and then once you have the clusters can further analyse to understand trends. We are going to use K-means clustering to put the countries together then look for trends and differences between the clusters. So the dataset has country, year between 1950 and 2015 and the life expectancy of that year. Now in order to do clustering, you need at least two measures, therefore, I created one with the change in life expectancy per year. The other measure is going to be the life expectancy in 2015.

In order to find our value of K, I did the below silhouette plot. Now you’re meant to use the value of K with the highest sil width, in this case, it would be 3. However, with so many different countries I feel that would be unfair and group the countries up too much. There are further spikes at 6 and 10. sil2

I decided to do the below plot for different k values both 6 and 10. The plot for 10 is below

kmean

10 seems like a good value as there are not too many clusters to deal with but also good variation between the different clusters. We will take k equal to 10 for further analysis.

compare

The comparison above looks at causes of death and i have grouped it up to get the mean for each for cause for each cluster. Conclusions that can be made:

  • Cancer is prevalent across all clusters, however, the higher the life expectancy the more prevalent it is. This could be because your more likely to get cancer at older ages.
  • Dementia is another cause which seems to increase with older life expectancy.
  • HIV is highest in the two lowest life expectancy clusters the same with neonatal deaths
  • Finally, road accident is an interesting cause, by far the highest cluster is cluster 7 which seems to be the cluster with the highest increase in life expectancy over the last 65 years. Could this because these are fast developing nations and have not got the safe road infrastructure in as yet.

 

That’s it for a little intro into reviewing the data this way. Let me know your thoughts and comments. There are lots of dataset on the World Health Organisations website as well as other datasets such as economic growth i can add to this analysis and develop it further.

 

F1 Circuit Cluster Analysis —- Part 2

Hello and welcome to the second part of my mini-series using cluster analysis in order to categorise formula 1 circuits. please go check the first part it outlines the basic data we are using to categorise the circuits and an overview of the method used for hierarchical clustering. Today we are going to go with K-means clustering.

For K-means clustering we have to set our own value for K we are going to do that with two different types of analysis. An elbow plot and silhouette analysis.

Elbow Plot

The code below is what was used in order to generate the elbow plot. The elbow plot generated is below:

carbon (5)

elbow

Reviewing the elbow plot it looks like already we are seeing a slightly different amount of clusters then we got when we conducted hierarchical clustering. The elbow of the plot looks to be at 3 but you can also argue there is one at 4 as well as the value for k.

Silhouette Analysis

The other way to decide a k value when conducting k means clustering is to produce a silhouette graph. This takes every point which is part of the analysis and rates it on how it fits in with each cluster with -1 being doesn’t fit at all and 1 being fits well. You then produce a graph for each value of k with the average silhouette width and the highest point is the value of k. I have put a picture of the code below and also the silhouette graph produced

carbon (6)

silouette

Fascinatingly there are two high points. One for a k of 9 and another for a k of 3. I am going to choose a k of 3 as this is closely aligned to what we saw in the elbow plot and 9 clusters are just too many to deal with.

kmeanscluster

The above graph shows all the circuits in the calendar and where they are for average straight length and average speed, colour by the cluster they have been put in. I am a bit unsatisfied with this. I feel this doesn’t quite fit the different circuits on the calendar. For instance, Singapore is different to China and Germany. Therefore K-means is not going to be the clustering I use in the final blog to look at pace trends across the season. Look out for the final blog which we will look at the pace across all circuits so far for all the teams and we will look at some other metrics like overtakes and pitstops.

 

 

F1 Circuit Cluster Analysis —- Part 1

Hello there so as you know I’m currently working through the Datacamp course data scientist with R. (If the people from Datacamp are reading this I’m open top sponsorship!) There will be a further update how I’m getting on with this later this week, however, today I wanted to focus on applying something new that I learnt. Cluster analysis. Cluster analysis allows you to take a dataframe of two variables and calculate which are the rows best grouped together. There are two main methods that we are going to look at hierarchical clustering and kmeans clustering. We are going to look at formula  1 circuits. The idea is there are 21 different circuits currently on the calender all different lengths and height profiles and types of tarmac, however, can we group them together with certain characteristics. For me as an avid formula 1 watcher, the differences between the circuits are caused by lengths of straights and speed of corners. Therefore the two metrics we are using are the average straight length and average corner speed.

  • Average straight length – calculated by measuring each stretch of track which the F1 car would be running full throttle. Removing any lengths of the track less than 100m. an example for Spain below straights is estimated in green.

circuit.png

  • Average corner speed – I have calculated this by allocating each corner to either slow, medium or fast speed. (Unfortunately, I don’t have data for the exact corner speed but if any f1 team wants to send it over email me!) so you can see below in the table how many for each circuit was allocated 

table2

As I don’t know the exact speeds of these corners I have estimated that a slow corner is 80 km/h, medium speed corner is 150 km/h and a fast corner is 200km/h. This has left us with the following table:

table3

Hierarchical Clustering

The first thing we are going to look at is hierarchical clustering. The table above is fed into the following code:

carbon-4-e1530090645312.png

this produces the following output:

dendrogramf1.png

We have 5 different distinct clusters that the F1 circuits fit into. It’s not too surprising that Singapore, Monaco and Hungary fit into a similar cluster as well as Belgium and Great Britain being similar circuits.

cluster scatter

The scatter above you can clearly see the difference between the two main clusters 1 and 2. In cluster one straight length are often shorter, however, the corners are faster. Cluster 2 circuits often have longer straights but slower corners. With a few circuits from each group used so far, this season would be interesting to see if there are any trends with car speed.  That’s it for the first part of this series next week we will look at any difference using K-means clustering. In the final part, we will look applying what we have seen so far this year to try and predict who will win in the later rounds.

World Cup Group H

Hello, welcome to the next blog in my series previewing the World Cup. All 7 other groups are looked at further in my blog so please go check them out let me know who you think the favourites are. The final group contains Columbia, Poland, Japan and Senegal. This promises to be quite an interesting group and there’s not one team that stands out as an absolute favourite.

ageh

Looking at the ages of the four squads Senegal have what looks to be the youngest squad in the group with 2 players under the age of 20 in the squad. The medians for the 4 teams, however, are about the same. Japan seems to have the highest age players but nothing too high. They all seem o have most of their players around peak ages which should mean they are in peak condition.

capsh

Looking at the caps distribution Poland looks to have the flattest range of caps with no overload of experienced or inexperienced players. The Senegal squad looks to be the least experienced squad most probably because as we have seen it has a lot of younger players. We have seen most teams in the tournament have squads with players in that have over 100 caps however it none of the squads in Group H have that.

comph

Now the Senegal squad composition looks interesting. They don’t have too many midfielders compared to the other teams and lots of attackers. It looks like Senegal could be a fun team to watch in the tournament. The other squad’s don’t look to have too many different options with them all having the same amount of midfielders and then just a small difference between defenders and attackers.

chanceh

Now finally we look at the chances of each team in the World Cup. First of all, I think its the closest group to call because the expected favourites Columbia have the lowest percent chance of all the favourites we have seen so far. Japan also could think that they have a half decent chance of getting through the group with it being so wide open. It will be fascinating to see how this group with no so-called big nations turns out. It might be the most exciting of the tournament. As for the chances of the teams, Columbia possibly has an outside chance but the none of these are among the top favourites.

That’s it for the last in this series of blogs looking at each group in the world cup. I hope you have enjoyed them and they have increased your understanding of the teams in the World Cup.