Hello, welcome to the next blog. I was inspired by this week Tidy Tuesday dataset. I’m sure I have said this before but if you want to learn rstats its a great resource with the weekly dataset to practice your burgeoning skills. This week’s data was from the Dallas open data project, and the particular dataset was from the Dallas animal shelter. I thought wouldn’t it be great to create a model which based on the information about the animal when it arrived at the shelter you could predict what might happen to the animal.
Above is the structure of the dataframe the first thing is there are a number of different animals logged in the dataset. Creating a model for the 5 different types could be quite complicated therefore I am going to focus on Dogs. I think out of the 35000 or so observations dogs will make up most of them as well. The model type i think is most suited to this problem is a classification tree. Classification tree works by building a yes or no network with the various outcomes at the end. It works well when you have various factor variables which this dataset is full of.
Now we need to select the columns which this is going to be based on summarised below:
animal_breed – identifies the type of dog, I think this is key information some breeds of dogs are more likely to be adopted than others
intake_type – how did the dog arrive at the shelter. There will clearly be an effect on the dogs outcomes
intake_condition – what was the dogs condition when they arrived. An unhealthy dog is unlikely to be adopted possibly
chip_status – did the dog have a micro chip. dogs with micro chips more likely to be reunited with their owners
animal_origin – where was the animal found or how did it come to the shelter.
outcome_type – finally the most important column as this is how we will be making our predictions.
Now we have our columns selected we need to prepare the data the first thing we will look at is the outcome column i wanted to make sure they’re not too many outcomes this is based on. When you look back on the structure of the column there’s 12 separate outcomes this is far too many so let’s look if we can group some together. Below is a summary of the different parts of the outcome_type column and i think there is definitely scope to group some together.
Dead on arrival should be excluded as if the dog is dead on arrival that is the outcome there’s nothing to predict. Died and Euthanized I am going to group into just died outcome as predicting how the dog died is beyond the scope of this prediction. Foster, transfer and other will be grouped under unadopted. The others will then be filtered out of the list.
The final thing in this opening blog of preparing the data is the animal bread column. There are over 100 different dog breeds in this column which would be impossible for the classification tree. On close inspection, the column consisted of the individual dog breeds or mixed which I assumed is not pedigree. I decided to convert this to a column with the dog either pedigree or cross breed. Therefore the final data preparation code is below:
That’s it for today’s opening tomorrow we will look at the results of the model and how if required I optimise it.