Today we are going to looking into the second part of creating the classification tree to look at the outcomes of dogs in the Dallas animal shelter. Today it’s the exciting stuff, creating the actual classification tree. If you want to understand how I have prepared the data, go and check out the first blog I go into the data preparation in detail.
As previously mentioned we are using the classification tree method and the columns we will be basing the outcome on initially are intake_type, intake_condition, chip_status, animal_origin and pedigree. I am using the rpart package in order to create the classification trees and will split the data so there’s 75% to train the tree on and 25% to test the tree on.
Above you can see the code and resulting classification tree for the first model. One thing immediately obvious in the first classification tree is that its highly complex and is possibly overfitted to the data. Let’s check how this tree performs:
Well, that’s not great at all. This classification tree seems to be barely better than random chance! This really isn’t ideal and means currently the model is pretty much worthless. Let’s have a look at what we can do to improve this.
The first thing I am going to do is look at the intake_condition column
There are 7 different categories within this column however, I think this can be simplified into Healthy, Treatable and Unhealthy. So let’s do this and check the results:
Success, much simpler tree, however:
The accuracy of the model has gone down! It is now less accurate than random chance, I am actually just wasting my time here. Let take a step back and look at the table which shows the predicted against the outcome. As you can see currently the model is predicting lots of dogs which died or were not adopted as being adopted.
I reviewed the composition of each column in the data frame when I filtered for predicted to be adopted and the outcome was they actually died. The biggest difference was seen in the condition column. Apparently, a lot of dogs that died are treatable how can that be?
I took a step back and went back to the original dataset and filtered for the dogs which are treatable but made no change to the outcome_type column as I guessed this could be where the problem was. The above graph looks at the outcomes of the dogs which are classed as treatable. There are clearly a lot of dogs euthanized which is possibly where the confusion is coming from as these will be classed as dying and normal logic you would expect dogs treatable to survive. This is interesting as it highlights how the decisions you make at the start of any analysis could affect it later on. The next question now is there another column in the dataframe that can be used to identify the euthanization.
I think I found it with the kennel_status column. By far the most common kennel for the euthanized dogs to go in is the lab. Therefore we are going to add the kennel status to the analysis and see where it goes:
Success, the model is now much more successful in predicting the outcome for each dog at the shelter. However, the classification tree is now back to being over-complex and could possibly overfit the training data. Next, I see if this tree can be pruned.
Above you can see the complexity plot for the overly complex classification tree. The tree isn’t much improved when you go over 4 -7 levels and a complexity of around 0.00075. This pruning can be done either pre or post creation of the model. For this i am going to to the pruning pre running of the model so I am going to run the fourth and hopefully final version of the model
Above you can see the final classifaction tree and the code with which to create it. In my call to rpart, i have used the control argument and limited the complextiy to 0.00075 based on the complexity plot and the max depth to 5. This has produced a much less complex tree and performance was similar to the previous complex tree.
This could be futher developed with more data does the sex of the dog have an effect on the results or the size or type of dog. Small dogs could bne more likely to be adopted and certain types could be more likely euthanized. Also this could be furth built on and a random forest model created. Thanks for reading well done if you got to the end its a bit longer than what I normally aim for please let me know your thoughts or if there has been anything i have missed or could have included.