Does the Dog Get Adopted?? — P2

Today we are going to looking into the second part of creating the classification tree to look at the outcomes of dogs in the Dallas animal shelter. Today it’s the exciting stuff, creating the actual classification tree. If you want to understand how I have prepared the data, go and check out the first blog I go into the data preparation in detail.

As previously mentioned we are using the classification tree method and the columns we will be basing the outcome on initially are intake_type, intake_condition, chip_status, animal_origin and pedigree. I am using the rpart package in order to create the classification trees and will split the data so there’s 75% to train the tree on and 25% to test the tree on.

carbon (9).png

classtree1.png

Above you can see the code and resulting classification tree for the first model. One thing immediately obvious in the first classification tree is that its highly complex and is possibly overfitted to the data. Let’s check how this tree performs:

performance

Well, that’s not great at all. This classification tree seems to be barely better than random chance! This really isn’t ideal and means currently the model is pretty much worthless. Let’s have a look at what we can do to improve this.

The first thing I am going to do is look at the intake_condition column

intecond.PNG

There are 7 different categories within this column however, I think this can be simplified into Healthy, Treatable and Unhealthy. So let’s do this and check the results:

class2

Success, much simpler tree, however:

results2.PNG

The accuracy of the model has gone down! It is now less accurate than random chance, I am actually just wasting my time here. Let take a step back and look at the table which shows the predicted against the outcome. As you can see currently the model is predicting lots of dogs which died or were not adopted as being adopted.

comp

I reviewed the composition of each column in the data frame when I filtered for predicted to be adopted and the outcome was they actually died. The biggest difference was seen in the condition column. Apparently, a lot of dogs that died are treatable how can that be?

invest1

I took a step back and went back to the original dataset and filtered for the dogs which are treatable but made no change to the outcome_type column as I guessed this could be where the problem was. The above graph looks at the outcomes of the dogs which are classed as treatable. There are clearly a lot of dogs euthanized which is possibly where the confusion is coming from as these will be classed as dying and normal logic you would expect dogs treatable to survive. This is interesting as it highlights how the decisions you make at the start of any analysis could affect it later on. The next question now is there another column in the dataframe that can be used to identify the euthanization.

Rplot01.png

I think I found it with the kennel_status column. By far the most common kennel for the euthanized dogs to go in is the lab. Therefore we are going to add the kennel status to the analysis and see where it goes:

class3

 

results3

Success, the model is now much more successful in predicting the outcome for each dog at the shelter. However, the classification tree is now back to being over-complex and could possibly overfit the training data. Next, I see if this tree can be pruned.

complexplot

Above you can see the complexity plot for the overly complex classification tree. The tree isn’t much improved when you go over 4 -7 levels and a complexity of around 0.00075. This pruning can be done either pre or post creation of the model. For this i am going to to the pruning pre running of the model so I am going to run the fourth and hopefully final version of the model

finaltree

carbon (11).png

Above you can see the final classifaction tree and the code with which to create it. In my call to rpart, i have used the control argument and limited the complextiy to 0.00075 based on the complexity plot and the max depth to 5. This has produced a much less complex tree and performance was similar to the previous complex tree.

This could be futher developed with more data does the sex of the dog have an effect on the results or the size or type of dog. Small dogs could bne more likely to be adopted and certain types could be more likely euthanized. Also this could be furth built on and a random forest model created. Thanks for reading well done if you got to the end its a bit longer than what I normally aim for please let me know your thoughts or if there has been anything i have missed or could have included.

Advertisements

Does the Dog Get Adopted?? — P1

Hello, welcome to the next blog. I was inspired by this week Tidy Tuesday dataset. I’m sure I have said this before but if you want to learn rstats its a great resource with the weekly dataset to practice your burgeoning skills. This week’s data was from the Dallas open data project, and the particular dataset was from the Dallas animal shelter. I thought wouldn’t it be great to create a model which based on the information about the animal when it arrived at the shelter you could predict what might happen to the animal.

dataframe structure

Above is the structure of the dataframe the first thing is there are a number of different animals logged in the dataset. Creating a model for the 5 different types could be quite complicated therefore I am going to focus on Dogs. I think out of the 35000 or so observations dogs will make up most of them as well. The model type i think is most suited to this problem is a classification tree. Classification tree works by building a yes or no network with the various outcomes at the end. It works well when you have various factor variables which this dataset is full of.

Now we need to select the columns which this is going to be based on summarised below:

animal_breed – identifies the type of dog, I think this is key information some breeds of dogs are more likely to be adopted than others

intake_type – how did the dog arrive at the shelter. There will clearly be an effect on the dogs outcomes

intake_condition – what was the dogs condition when they arrived. An unhealthy dog is unlikely to be adopted possibly

chip_status – did the dog have a micro chip. dogs with micro chips more likely to be reunited with their owners

animal_origin – where was the animal found or how did it come to the shelter.

outcome_type – finally the most important column as this is how we will be making our predictions.

Now we have our columns selected we need to prepare the data the first thing we will look at is the outcome column i wanted to make sure they’re not too many outcomes this is based on. When you look back on the structure of the column there’s 12 separate outcomes this is far too many so let’s look if we can group some together.  Below is a summary of the different parts of the outcome_type column and i think there is definitely scope to group some together.

summary

Dead on arrival should be excluded as if the dog is dead on arrival that is the outcome there’s nothing to predict. Died and Euthanized I am going to group into just died outcome as predicting how the dog died is beyond the scope of this prediction. Foster, transfer and other will be grouped under unadopted. The others will then be filtered out of the list.

The final thing in this opening blog of preparing the data is the animal bread column. There are over 100 different dog breeds in this column which would be impossible for the classification tree. On close inspection, the column consisted of the individual dog breeds or mixed which I assumed is not pedigree. I decided to convert this to a column with the dog either pedigree or cross breed. Therefore the final data preparation code is below:

carbon (8)

That’s it for today’s opening tomorrow we will look at the results of the model and how if required I optimise it.

World Cup Group C

Hi there welcome to next in series of little previews ahead of the FIFA World Cup. Today we are dissecting the 4 teams in group C; France, Peru, Denmark and Australia. Please do check out the other previews and further previews are upcoming at 6 pm everyday ahead of the first game.

agec

On the face of it, these look to be some of the youngest squads in the tournament. Australia seems to have players from both ends of the spectrum and a good grouping around peak age players. France has probably the lowest median age across all squads in the competition. Peru doesn’t have too many players between 20-25, however, have a good grouping between 25-28.

capsc

Looking at the distribution of caps in each squad it looks like all four teams have relatively inexperienced players. Denmark has the most amount of players which have around 25 caps. They also have the familiar trend of having a spike higher up showing a good amount of experienced pros vital in any squad make up. Peru seems to have the most amount of players with experience in their squad which could stand them in good stead to get out the group. The big question for France is will their lack of experience affect them later in the competition.

compc

Finally looking at squad composition France and Australia seem to have the most amount of attackers. France has done this by bringing fewer midfielders Australia by bringing fewer Defenders. Peru seems to have gone a totally different direction to the rest of the team with a squad overloaded with midfielders. Most are attacking midfielders so they should still have goalscoring options.

probc

Now we look at each teams probability of getting out the group and winning the tournament. Finally, we have a group that on the face of it could be quite competitive for second place at least. Denmark is a clear favourite but both Peru and Australia seem to have good outside chances at least according to the bookies. France has a decent chance of winning the whole tournament and is currently 4th favourites, so it will be interesting to see how they do with their young squad.

That’s it for today’s overview let me know your thoughts how far do you think France will go and who you think will get out the group?

World Cup Group B

Today we are going to look at group B in the World Cup. This is part of my series reviewing each squad in the World Cup in order to asses strengths and weaknesses and understand squad make up. If you haven’t seen the other Blogs go check them out group A went live yesterday and the other groups will follow over the coming days. Group B consists of Spain, Portugal, Iran and Morocco.

ageb

The first thing to look at is the age composition of the 4 squads. Interestingly Iran seem to generally have the youngest squad in the group with the lowest median. Also Spain seem to have the largest grouping around peak age between 27-30. Morocco despite having the highest median have the lowest age players in the group. Portugal have some young players but also have some of the generally older players with a lot of squad members above 30.

capsb

Looking at the experience of the players Morocco looks clearly the least experienced squad. This could be because of the high amount of lower age players compared to the other teams. Spain and Portugal have similar caps profiles with a group of inexperienced players but also complimented by a few experienced players.

compostionb

The main thing that’s interesting with the squad composition of the 4 teams is that Portugal and Spain have the same composition. Is this a template the the bigger countries seem to be following? Also there is an increase in attacking players in these 4 squads compared to Group A which should mean these are all better balanced. In fact Morocco, Portugal and Spain have the same amount of attackers. With Iran having more Midfielders then any other team could this give them more options however lack of attackers could harm them if chasing a game.

chanceb

Finally we look at the chances of team qualifying from the group and the chance of winning the World Cup. Qualification for the group looks like a pretty much over and done deal. Portugal and Spain look to have by far the strongest chances of qualifying from the group. This could make this group not too interesting for spectators. Portugal and Spain do play each other in the first game which if there is a loser could add extra pressure when they come to play Morocco or Iran. I’m surprised the low chance compared to Spain of Portugal winning the title. Portugal are the reigning European champions and have mercurial talent Cristiano Ronaldo. Spain however look to be one of the big favourites so it will be interesting to see how they do after last World Cups total failure.

Thats it for group B overview any questions or comments let me know or if you have any ideas of other things i should look at let me know.

World Cup Group A

Hello welcome to the first of my blogs looking at each group in the world cup. Over the next 8 blogs I hope to dissect each country’s squad and finally look at their chances of progressing and winning the cup. So today we start with group A which contains hosts Russia, Uruguay, Saudi Arabia and Egypt.

groupaage

The first thing to look at is the age range of each squad in group A. All 4 teams have a median around the same area. As you can see Egypt have a 45 year old player, one of their GK’s who is the oldest player in any squad in the tournament. Uruguay and Egypt tend to have some younger players than Russia and Saudi Arabia. Saudi Arabia look to have generally one of the older squads in the tournament.

caps

Next we look at the caps all the players in the squad have received. It’s clear Russia has the most players with the least amount of international experience. Will they struggle to cope with pressure from playing in front of home crowd. Saudi Arabia despite having the older squad of the 4 teams seems to have generally the least experienced team. Uruguay however seem to have a good balance with experience at all different levels.

asquadcomp

Looking at each teams squad composition clearly all have the same percent GK as 3 GK is stipulated in the rules. One thing that’s clear is Egypt, Russia and Saudi Arabia have a low amount of strikers within there squads. All three have just 3 recognised strikers will this leave all them struggling to score goals. Also Egypt seem to have the lowest amount of midfielders compared to the rest of the group with an increase in defenders. This will give egypt lots of options in defence in case of injuries however could leave them exposed if they need to make changes from the bench to try and win games.

chance  Finally we use the implied probability from the betting odds to look at the chance of each team getting out the group and winning the tournament. Overall group A seems to have no teams really capable of mounting a serious challenge for the world cup. With Russia with home advantage rated lower then Uruguay. Also it doesn’t seem to be a particularly close group for qualification the clear favourites are Uruguay and Russia. The wild card in this is Egypt if Mo Salah is fit for the tournament then expect there chances compared to Russia to increase considerably. If he isn’t expect this could be a pretty straight forward group.

Thats it for today’s look at group A please let me know your thoughts do you think any teams in this group can go far? let me know your thoughts in the comments. Group B will follow tomorrow.

Middlesbrough Performance Review

Hello welcome to the next blog on this blog. If this is our first time here then please have a read of all the other blogs on here and let me know your thoughts anything I havent spotted or things you want looked at. Today we are going to look at the performances of the Middlesbrough first team throughout the season.

The data used for this I have used the rating each player gets on whoscored.com. Overall it was a season to forget for Middlesbrough. Ahead of the season the chairman had promised they would smash the league and £40 million spent in the transfer market seemed to suggest that could be possible. However they ended up 25 points behind winners Wolves and easily got knocked out by Aston Villa in the play off semi final. The idea here is to look at performances over the full season look at if you can see if the change of managers had an effect, did performances improve? Also which areas of the team generally performed well which areas didn’t which might provide insight where the team could be improved in the transfer market.

season perf

Above you can see box plots for each league game of the season including the play offs. Generally the team played better in the wins then defeats. Hows that for an earth shattering conclusion! What is interesting is the team had two managers during the season and it does look like under Tony Pulis the performances were more consistent, Lets look at this in more detail…..

density plot

Now lets look at performances under both managers. The density plot above shows there really wasn’t much difference. The players generally performed at the same level under both managers however Pulis seemed to be able to get more when it comes to ratings above 8.

players

25 different players started a game for the team this season with one player the clear outstanding performer. Adama Traore. However Traore also has the largest spread of performances showing he can be an inconsistent performer. Also it seems generally attacking players are more consistent performers. What will be disappointing is Ben Gibson seems to be overall the worst performing defender in the team. In midfield It looks pretty close between Adam Clayton and Jonny Howson for the best median perofrmances however clayton looks to be much more consistent.

Overall its interesting to review the the players performances over the season. It could be interesting to further stretch this to look at other teams or look at previous seasons for specific players. Also it could be further drilled down into home and away performances. Let me know your thoughts or if you have any questions really would like to hear from you.

Formula 1 – The Competitive Picture

Hello welcome to this blog and today we are going to look at something we havnt looked at yet in this blog. Formula 1. I have watched F1 since 1997 and often wondered when ever they say we reviewed the data, what exactly the data they review and what process they use to review it. Now sadly I don’t have access to anything like the data F1 teams have (one day maybe!) however the main piece of data is freely available. The qualifying time. I decided I wanted to have a look at the competitive picture and now we are 4 races in that’s a decent sample size.

So to do this analysis I took each drivers fastest lap for the 4 qualifying sessions so far. I then added it all up to get each drivers qualifying time. The result plotted the below graph:

F11

So after 4 races Vettel has the lowest total qualifying time, closely followed by Hamilton. What is clear from this is the large gap between the top 6 drivers from the top 3 teams and the rest. Also apart from Ferrari and Mercedes being mixed up every other team is 2 by 2. This is surprising considering the small gaps between teams in the midfield. The next question I had was differences between team mates as in formula 1 your main rival to beat is always your team mate.

f12

The graph above shows the difference between each teams drivers with points at the top right smaller difference then at the bottom left. The team with the clearly the closest matched drivers are Red Bull with 0.07 seconds between them. This is good news for Ricciardo in particular who can use this information to increase his value in his contract talks. At the other end there is big pressure on Stoffel Vandoorne and Kimi Raikkonen. Both have been over a second in total behind there team mates which if it carries on could see them losing their seats.

I’m going to keep this dataset up to date as the season goes on and I have similar information for total race time. I think there’s more information you can derive by this such as whose developing their car the best. Please let me know your thoughts or if you have any questions i like to hear feedback.