Tidy Tuesday – Films Dataset

Hello, today we are going to be looking at this weeks tidy Tuesday dataset. This is just quick EDA as i got a bit carried away with just the dataset. I initially set out to do just one interesting graph but kept finding more and more interesting insights. So below you can see the structure of the dataset:

structuredata

As you can see its got 3401 observations of 9 variables. All different films so first of all lets look at home the production budgets atre distributed with a histogram

histogramfilm

We can see that the vast majority of films in this dataset have a budget less than 25 million dollars. But how does that change for each genre in the dataset:

histograms

Now you can see some interesting insights. Comedy, drama and horror films have clear peaks at the lower end of the production budget. Action and Adventure films are much more spread out across all production budgets.

vsgross

Now let’s look at how the production budget influences how much the film grosses. Action and Adventure have the steeper slopes, so the more money put into these films on average the higher reward.

release date

Above you can see when each film is released during the year. Notice the peak for horror in the 10th month, Halloween. Also, Drama seems to increase toward the end of the year in time for Oscar season. Adventure has a peak in July for summer blockbusters and in December maybe aiming for the holiday season.

profit

Finally, for now, let’s look at the median profit percent per month, can we get any ideas when its best to release a particular genre. For action, adventure and comedy they seem to have 2 peaks. One in the middle of the year and one towards the end. Horror seems to have generally the highest median earnings. This data set is a simple one but one which insights are easy to come by. I could definitely at least write another blog with more information I have found form this dataset. Another day perhaps.

Advertisements

Biketown EDA – P2

Hello, welcome to the second part of this blog doing exploratory data analysis on the bike town dataset. If you haven’t read the first one then go check it out. An overview is we found that most of the records were either subscribers to the system of casual people who might just use it every now and then. So I decided to compare those two groups. We saw in the smaller sample size that most of the casual users used it for recreation and the subscribers used it mostly for commuting. Today we are going to look at the distances and speeds the groups travel and where they rent the bikes from but first we are going to look at when they rent the bikes:

plot8.png

This graph totally makes sense that subscribers tend to rent the bikes for commuting as there are two large spikes at around 8am and around 5pm. The peak hours for people going too and from work. Casual users don’t tend to use the system in the morning, however, there’s consistent usage throughout the afternoon. This could be tourists exploring the city for instance. It’s strange both groups don’t reach their minimum to well after 2am, are these people using the system to get home from their night outs? Beware of the drunk Portland cyclist!

distance

The density plot for the distances the two groups go is interesting however i think does fit the current pattern. Subscribers are looking to do shorter journeys because they might be covering the last few miles to work whereas the casual users are maybe exploring the city and therefore cover more distance.

speeds.png

Now let’s check out the speed curves for both groups the further the distance the person travelled the slower the speed. The subscribers are obviously generally the much faster riders at all distances. The increase in speed towards the 25-mile distance i think is down to lower amounts of data. At the lower distance the subscribers have significantly faster speeds are they using the system to get from A to B as quickly as possible.

start.png

Now let’s look at the start and end locations for the casual riders. The heat map above shows that most riders are clustered around the city centre possibly moving from one tourist spot to another. There seems to be a fairly even distribution across the centre locations.

subscriber.png

The subscriber heat map above shows that generally, people take the bike from much further out than compared to the casual riders who are possibly just using it to get to the tourist hot spots. Once again mostly the usage is on the left side of the river that must be the area where people like to get about. Also, the start locations around the outside are much denser, therefore, its clear people are using the system from further out to go into the centre of the city.

That’s it for this exploratory data analysis in this blog I think we have found some interesting insights and at the minimum able to confirm what you would expect. I hope you have found this interesting and informative let me know your thoughts and if you have looked at this dataset yourself lets see your thoughts. Check out the code on my GitHub should be linked somewhere on the website.

Nike Biketown – Exploratory Data Analysis

Hello welcome to today’s blog which we are going to take a large dataset and do some exploratory data analysis on it. I am going to look at the biketown dataset which was a dataset on tidy tuesday. If you’re new around here tidy Tuesday is a hashtag on twitter which the R for data science online learning community actively promotes and has every Tuesday. If you’re inspired to learn R and data science like I was that is a really great community full of wonderful people to start with. I am not going to post code snippets within the blog as I think it gets too long, however, the full code used will be posted on my GitHub.

structure

Above is the structure of the dataframe. The data comes in numerous csv files so i read them all in and created one large data frame structured like so. The second column Payment plan seems an interesting column it has 3 constituents Casual, subscriber and another. The system in Portland has a way a regular user can automate payments to save time. Let’s look at how much of the dataset is based on the 3 types of payment:

plot1

As you can see the vast majority of this dataset is based on either casual and subscriber and I think it would be interesting to review the differences between the people on the two main payment plans.  Therefore going forward in the EDA we are going to remove the entries without a subscriber. After this, we could possibly look at a method for working out what type of payment plan the blanks are. First thing first let’s have a look at what type of trips either group takes:

plot5plot2

The big issue here with making any conclusion on the type of trips each group takes is going to be difficult. This is because each group has over 200 thousand entries and there are less than 1000 recorded trips for each group. What we can say is that it makes sense that subscribers in this smaller group tend to use the system for commuting and casual users clearly in the small sample size use the system for recreation which makes total sense.

plot6

Now we look at the payment methods that both groups have used. By far the 3 main payment types for both groups are keypad, mobile and keypad_rfid_card, with subtle differences between the 2 groups. The RFID card is clearly higher in the subscriber which must be because subscribers are given a card in order to gain access to the bikes. Also, casual users tend to be much more likely to use their mobile to gain access. Both groups have the vast majority using the keypad system.

That’s it for part 1 of today’s exploratory data analysis on the bike town data. Tomorrow we will look at the distances the groups go as well as location and usage time. Let me know your comments on the first part.

Does the Dog Get Adopted?? — P2

Today we are going to looking into the second part of creating the classification tree to look at the outcomes of dogs in the Dallas animal shelter. Today it’s the exciting stuff, creating the actual classification tree. If you want to understand how I have prepared the data, go and check out the first blog I go into the data preparation in detail.

As previously mentioned we are using the classification tree method and the columns we will be basing the outcome on initially are intake_type, intake_condition, chip_status, animal_origin and pedigree. I am using the rpart package in order to create the classification trees and will split the data so there’s 75% to train the tree on and 25% to test the tree on.

carbon (9).png

classtree1.png

Above you can see the code and resulting classification tree for the first model. One thing immediately obvious in the first classification tree is that its highly complex and is possibly overfitted to the data. Let’s check how this tree performs:

performance

Well, that’s not great at all. This classification tree seems to be barely better than random chance! This really isn’t ideal and means currently the model is pretty much worthless. Let’s have a look at what we can do to improve this.

The first thing I am going to do is look at the intake_condition column

intecond.PNG

There are 7 different categories within this column however, I think this can be simplified into Healthy, Treatable and Unhealthy. So let’s do this and check the results:

class2

Success, much simpler tree, however:

results2.PNG

The accuracy of the model has gone down! It is now less accurate than random chance, I am actually just wasting my time here. Let take a step back and look at the table which shows the predicted against the outcome. As you can see currently the model is predicting lots of dogs which died or were not adopted as being adopted.

comp

I reviewed the composition of each column in the data frame when I filtered for predicted to be adopted and the outcome was they actually died. The biggest difference was seen in the condition column. Apparently, a lot of dogs that died are treatable how can that be?

invest1

I took a step back and went back to the original dataset and filtered for the dogs which are treatable but made no change to the outcome_type column as I guessed this could be where the problem was. The above graph looks at the outcomes of the dogs which are classed as treatable. There are clearly a lot of dogs euthanized which is possibly where the confusion is coming from as these will be classed as dying and normal logic you would expect dogs treatable to survive. This is interesting as it highlights how the decisions you make at the start of any analysis could affect it later on. The next question now is there another column in the dataframe that can be used to identify the euthanization.

Rplot01.png

I think I found it with the kennel_status column. By far the most common kennel for the euthanized dogs to go in is the lab. Therefore we are going to add the kennel status to the analysis and see where it goes:

class3

 

results3

Success, the model is now much more successful in predicting the outcome for each dog at the shelter. However, the classification tree is now back to being over-complex and could possibly overfit the training data. Next, I see if this tree can be pruned.

complexplot

Above you can see the complexity plot for the overly complex classification tree. The tree isn’t much improved when you go over 4 -7 levels and a complexity of around 0.00075. This pruning can be done either pre or post creation of the model. For this i am going to to the pruning pre running of the model so I am going to run the fourth and hopefully final version of the model

finaltree

carbon (11).png

Above you can see the final classifaction tree and the code with which to create it. In my call to rpart, i have used the control argument and limited the complextiy to 0.00075 based on the complexity plot and the max depth to 5. This has produced a much less complex tree and performance was similar to the previous complex tree.

This could be futher developed with more data does the sex of the dog have an effect on the results or the size or type of dog. Small dogs could bne more likely to be adopted and certain types could be more likely euthanized. Also this could be furth built on and a random forest model created. Thanks for reading well done if you got to the end its a bit longer than what I normally aim for please let me know your thoughts or if there has been anything i have missed or could have included.

Does the Dog Get Adopted?? — P1

Hello, welcome to the next blog. I was inspired by this week Tidy Tuesday dataset. I’m sure I have said this before but if you want to learn rstats its a great resource with the weekly dataset to practice your burgeoning skills. This week’s data was from the Dallas open data project, and the particular dataset was from the Dallas animal shelter. I thought wouldn’t it be great to create a model which based on the information about the animal when it arrived at the shelter you could predict what might happen to the animal.

dataframe structure

Above is the structure of the dataframe the first thing is there are a number of different animals logged in the dataset. Creating a model for the 5 different types could be quite complicated therefore I am going to focus on Dogs. I think out of the 35000 or so observations dogs will make up most of them as well. The model type i think is most suited to this problem is a classification tree. Classification tree works by building a yes or no network with the various outcomes at the end. It works well when you have various factor variables which this dataset is full of.

Now we need to select the columns which this is going to be based on summarised below:

animal_breed – identifies the type of dog, I think this is key information some breeds of dogs are more likely to be adopted than others

intake_type – how did the dog arrive at the shelter. There will clearly be an effect on the dogs outcomes

intake_condition – what was the dogs condition when they arrived. An unhealthy dog is unlikely to be adopted possibly

chip_status – did the dog have a micro chip. dogs with micro chips more likely to be reunited with their owners

animal_origin – where was the animal found or how did it come to the shelter.

outcome_type – finally the most important column as this is how we will be making our predictions.

Now we have our columns selected we need to prepare the data the first thing we will look at is the outcome column i wanted to make sure they’re not too many outcomes this is based on. When you look back on the structure of the column there’s 12 separate outcomes this is far too many so let’s look if we can group some together.  Below is a summary of the different parts of the outcome_type column and i think there is definitely scope to group some together.

summary

Dead on arrival should be excluded as if the dog is dead on arrival that is the outcome there’s nothing to predict. Died and Euthanized I am going to group into just died outcome as predicting how the dog died is beyond the scope of this prediction. Foster, transfer and other will be grouped under unadopted. The others will then be filtered out of the list.

The final thing in this opening blog of preparing the data is the animal bread column. There are over 100 different dog breeds in this column which would be impossible for the classification tree. On close inspection, the column consisted of the individual dog breeds or mixed which I assumed is not pedigree. I decided to convert this to a column with the dog either pedigree or cross breed. Therefore the final data preparation code is below:

carbon (8)

That’s it for today’s opening tomorrow we will look at the results of the model and how if required I optimise it.

F1 Circuit Cluster Analysis – 3

Hello and welcome to the meant to be final F1 circuit cluster analysis blog, however, I have thought of some ideas to extend it to a fourth so we shall see how that goes. The idea is today we will review what we can from the season so far and in the next one look at some methods for predicting how the rest of the season will pan out. That one might not be until the summer break.

f1 season cluster

Above is a summary of the season with each circuit coloured by what cluster they belong to. Circuits have been clustered according to hierarchical clustering please see othe blogs in the series to see the method used. The tracks that belong in cluster 1 and 2 are pretty evenly distributed across the season. What is interesting the two wildcard tracks which don’t belong to any of the other three clusters are still to come. Could they prove crucial in the fight for the title?

plot3

Above you can see for each cluster the pace difference with 0 being the fastest car in each cluster up to around 3% which is the difference to the slowest car. The first thing to take away in cluster 1 and 2 Mercedes and Ferrari are neck and neck. Mercedes are slightly but only slightly quicker overall. Red Bull get better the with more lower speed corners and fewer straights there are. Highlighting the cars engine weakness. With cluster three the slow twisty circuits being their forte. They must be looking forward to Hungary next. Elsewhere apart from the top three one of the big stories is Haas. Their car looks well suited to the fast flowing circuits of cluster 1 but is the slowest in the stop-start circuits with short straights. That is clearly a car with strengths in high-speed downforce and engine power. The gap between the top 3 teams and the rest is pretty consistent across all the clusters.

driver

Finally, we look at how the drivers rank at the different clusters. Some interesting points are apparent, In cluster 1 Hamilton seems to have a clear advantage over the others its close but he’s clearly on average faster than other drivers. There’s also a significant difference between Hamilton and his teammate Bottas showing fast twisty circuits could be Bottas weakness. Compared to cluster 2 where Hamilton, Vettel and Bottos are very evenly matched. At Ferrari Raikkonen is a lot closer to Vettel on the fast twisty circuits compared to cluster 2 which has much more large breaking zones and slower corners. The opposite pattern is seen at Red Bull Ricciardo is a good distance behind Verstappen on the fast twisty circuits but Ricciardo is actually slightly faster on the slower circuits. Elsewhere Alonso has a clear advantage over Vandoorne on both types of circuits.

So that’s it for today’s blog. I am going to put the R code for this on GitHub and the spreadsheet so if you have any further ideas what can be done with this dataset then id love to see what you come up with. There will be a fourth part in this series where we look at historical trends and then look at forecasting the future.

Tidy Tuesday 2 – World Life Expectancy

Hello, welcome to today’s blog which is going to be my second one covering the tidy Tuesday dataset. This week it was looking at a dataset with life expectancy for every country in the world since 1950. I decided you could do some cluster analysis on this dataset and then once you have the clusters can further analyse to understand trends. We are going to use K-means clustering to put the countries together then look for trends and differences between the clusters. So the dataset has country, year between 1950 and 2015 and the life expectancy of that year. Now in order to do clustering, you need at least two measures, therefore, I created one with the change in life expectancy per year. The other measure is going to be the life expectancy in 2015.

In order to find our value of K, I did the below silhouette plot. Now you’re meant to use the value of K with the highest sil width, in this case, it would be 3. However, with so many different countries I feel that would be unfair and group the countries up too much. There are further spikes at 6 and 10. sil2

I decided to do the below plot for different k values both 6 and 10. The plot for 10 is below

kmean

10 seems like a good value as there are not too many clusters to deal with but also good variation between the different clusters. We will take k equal to 10 for further analysis.

compare

The comparison above looks at causes of death and i have grouped it up to get the mean for each for cause for each cluster. Conclusions that can be made:

  • Cancer is prevalent across all clusters, however, the higher the life expectancy the more prevalent it is. This could be because your more likely to get cancer at older ages.
  • Dementia is another cause which seems to increase with older life expectancy.
  • HIV is highest in the two lowest life expectancy clusters the same with neonatal deaths
  • Finally, road accident is an interesting cause, by far the highest cluster is cluster 7 which seems to be the cluster with the highest increase in life expectancy over the last 65 years. Could this because these are fast developing nations and have not got the safe road infrastructure in as yet.

 

That’s it for a little intro into reviewing the data this way. Let me know your thoughts and comments. There are lots of dataset on the World Health Organisations website as well as other datasets such as economic growth i can add to this analysis and develop it further.