Biketown EDA – P2

Hello, welcome to the second part of this blog doing exploratory data analysis on the bike town dataset. If you haven’t read the first one then go check it out. An overview is we found that most of the records were either subscribers to the system of casual people who might just use it every now and then. So I decided to compare those two groups. We saw in the smaller sample size that most of the casual users used it for recreation and the subscribers used it mostly for commuting. Today we are going to look at the distances and speeds the groups travel and where they rent the bikes from but first we are going to look at when they rent the bikes:

plot8.png

This graph totally makes sense that subscribers tend to rent the bikes for commuting as there are two large spikes at around 8am and around 5pm. The peak hours for people going too and from work. Casual users don’t tend to use the system in the morning, however, there’s consistent usage throughout the afternoon. This could be tourists exploring the city for instance. It’s strange both groups don’t reach their minimum to well after 2am, are these people using the system to get home from their night outs? Beware of the drunk Portland cyclist!

distance

The density plot for the distances the two groups go is interesting however i think does fit the current pattern. Subscribers are looking to do shorter journeys because they might be covering the last few miles to work whereas the casual users are maybe exploring the city and therefore cover more distance.

speeds.png

Now let’s check out the speed curves for both groups the further the distance the person travelled the slower the speed. The subscribers are obviously generally the much faster riders at all distances. The increase in speed towards the 25-mile distance i think is down to lower amounts of data. At the lower distance the subscribers have significantly faster speeds are they using the system to get from A to B as quickly as possible.

start.png

Now let’s look at the start and end locations for the casual riders. The heat map above shows that most riders are clustered around the city centre possibly moving from one tourist spot to another. There seems to be a fairly even distribution across the centre locations.

subscriber.png

The subscriber heat map above shows that generally, people take the bike from much further out than compared to the casual riders who are possibly just using it to get to the tourist hot spots. Once again mostly the usage is on the left side of the river that must be the area where people like to get about. Also, the start locations around the outside are much denser, therefore, its clear people are using the system from further out to go into the centre of the city.

That’s it for this exploratory data analysis in this blog I think we have found some interesting insights and at the minimum able to confirm what you would expect. I hope you have found this interesting and informative let me know your thoughts and if you have looked at this dataset yourself lets see your thoughts. Check out the code on my GitHub should be linked somewhere on the website.

Advertisements

Nike Biketown – Exploratory Data Analysis

Hello welcome to today’s blog which we are going to take a large dataset and do some exploratory data analysis on it. I am going to look at the biketown dataset which was a dataset on tidy tuesday. If you’re new around here tidy Tuesday is a hashtag on twitter which the R for data science online learning community actively promotes and has every Tuesday. If you’re inspired to learn R and data science like I was that is a really great community full of wonderful people to start with. I am not going to post code snippets within the blog as I think it gets too long, however, the full code used will be posted on my GitHub.

structure

Above is the structure of the dataframe. The data comes in numerous csv files so i read them all in and created one large data frame structured like so. The second column Payment plan seems an interesting column it has 3 constituents Casual, subscriber and another. The system in Portland has a way a regular user can automate payments to save time. Let’s look at how much of the dataset is based on the 3 types of payment:

plot1

As you can see the vast majority of this dataset is based on either casual and subscriber and I think it would be interesting to review the differences between the people on the two main payment plans.  Therefore going forward in the EDA we are going to remove the entries without a subscriber. After this, we could possibly look at a method for working out what type of payment plan the blanks are. First thing first let’s have a look at what type of trips either group takes:

plot5plot2

The big issue here with making any conclusion on the type of trips each group takes is going to be difficult. This is because each group has over 200 thousand entries and there are less than 1000 recorded trips for each group. What we can say is that it makes sense that subscribers in this smaller group tend to use the system for commuting and casual users clearly in the small sample size use the system for recreation which makes total sense.

plot6

Now we look at the payment methods that both groups have used. By far the 3 main payment types for both groups are keypad, mobile and keypad_rfid_card, with subtle differences between the 2 groups. The RFID card is clearly higher in the subscriber which must be because subscribers are given a card in order to gain access to the bikes. Also, casual users tend to be much more likely to use their mobile to gain access. Both groups have the vast majority using the keypad system.

That’s it for part 1 of today’s exploratory data analysis on the bike town data. Tomorrow we will look at the distances the groups go as well as location and usage time. Let me know your comments on the first part.

F1 Circuit Cluster Analysis – 3

Hello and welcome to the meant to be final F1 circuit cluster analysis blog, however, I have thought of some ideas to extend it to a fourth so we shall see how that goes. The idea is today we will review what we can from the season so far and in the next one look at some methods for predicting how the rest of the season will pan out. That one might not be until the summer break.

f1 season cluster

Above is a summary of the season with each circuit coloured by what cluster they belong to. Circuits have been clustered according to hierarchical clustering please see othe blogs in the series to see the method used. The tracks that belong in cluster 1 and 2 are pretty evenly distributed across the season. What is interesting the two wildcard tracks which don’t belong to any of the other three clusters are still to come. Could they prove crucial in the fight for the title?

plot3

Above you can see for each cluster the pace difference with 0 being the fastest car in each cluster up to around 3% which is the difference to the slowest car. The first thing to take away in cluster 1 and 2 Mercedes and Ferrari are neck and neck. Mercedes are slightly but only slightly quicker overall. Red Bull get better the with more lower speed corners and fewer straights there are. Highlighting the cars engine weakness. With cluster three the slow twisty circuits being their forte. They must be looking forward to Hungary next. Elsewhere apart from the top three one of the big stories is Haas. Their car looks well suited to the fast flowing circuits of cluster 1 but is the slowest in the stop-start circuits with short straights. That is clearly a car with strengths in high-speed downforce and engine power. The gap between the top 3 teams and the rest is pretty consistent across all the clusters.

driver

Finally, we look at how the drivers rank at the different clusters. Some interesting points are apparent, In cluster 1 Hamilton seems to have a clear advantage over the others its close but he’s clearly on average faster than other drivers. There’s also a significant difference between Hamilton and his teammate Bottas showing fast twisty circuits could be Bottas weakness. Compared to cluster 2 where Hamilton, Vettel and Bottos are very evenly matched. At Ferrari Raikkonen is a lot closer to Vettel on the fast twisty circuits compared to cluster 2 which has much more large breaking zones and slower corners. The opposite pattern is seen at Red Bull Ricciardo is a good distance behind Verstappen on the fast twisty circuits but Ricciardo is actually slightly faster on the slower circuits. Elsewhere Alonso has a clear advantage over Vandoorne on both types of circuits.

So that’s it for today’s blog. I am going to put the R code for this on GitHub and the spreadsheet so if you have any further ideas what can be done with this dataset then id love to see what you come up with. There will be a fourth part in this series where we look at historical trends and then look at forecasting the future.

Tidy Tuesday 2 – World Life Expectancy

Hello, welcome to today’s blog which is going to be my second one covering the tidy Tuesday dataset. This week it was looking at a dataset with life expectancy for every country in the world since 1950. I decided you could do some cluster analysis on this dataset and then once you have the clusters can further analyse to understand trends. We are going to use K-means clustering to put the countries together then look for trends and differences between the clusters. So the dataset has country, year between 1950 and 2015 and the life expectancy of that year. Now in order to do clustering, you need at least two measures, therefore, I created one with the change in life expectancy per year. The other measure is going to be the life expectancy in 2015.

In order to find our value of K, I did the below silhouette plot. Now you’re meant to use the value of K with the highest sil width, in this case, it would be 3. However, with so many different countries I feel that would be unfair and group the countries up too much. There are further spikes at 6 and 10. sil2

I decided to do the below plot for different k values both 6 and 10. The plot for 10 is below

kmean

10 seems like a good value as there are not too many clusters to deal with but also good variation between the different clusters. We will take k equal to 10 for further analysis.

compare

The comparison above looks at causes of death and i have grouped it up to get the mean for each for cause for each cluster. Conclusions that can be made:

  • Cancer is prevalent across all clusters, however, the higher the life expectancy the more prevalent it is. This could be because your more likely to get cancer at older ages.
  • Dementia is another cause which seems to increase with older life expectancy.
  • HIV is highest in the two lowest life expectancy clusters the same with neonatal deaths
  • Finally, road accident is an interesting cause, by far the highest cluster is cluster 7 which seems to be the cluster with the highest increase in life expectancy over the last 65 years. Could this because these are fast developing nations and have not got the safe road infrastructure in as yet.

 

That’s it for a little intro into reviewing the data this way. Let me know your thoughts and comments. There are lots of dataset on the World Health Organisations website as well as other datasets such as economic growth i can add to this analysis and develop it further.

 

World Cup Group F

Hello, welcome to the preview of Group F in the world cup. Thanks for all the support so far on these previews. I would love to hear peoples thoughts and predictions on the competition. Today we will be looking at group F which contains Germany, Mexico, South Korea and Sweden.

agef

The first thing to look at is the age distribution of all 4 teams. Germany seems to have one of the younger squads in the tournament with a relatively small distribution between youngest and oldest players. Mexico has players from the youngest in their 20’s all the way up to near 40. South Korea and Sweden have the same median ages but South Korea has more players clustered around their median and have the lowest amount of players above 30.

capsf

Mexico has what looks to be the most experienced squad with players mostly having around 50 caps but they have some players up to 150 caps. Germany has a lot of players with a relatively low amount of caps but also have the trend we have seen with other squads of having a group of players with a lot of caps. I wonder if these players would be a similar age and therefore could be a golden generation. Sweden possibly has the most inexperienced squads in the group with a lot of players less than 50 caps.

compf

On the face of it, Germany seems to have a small number of attackers in the squad. However, they have more midfielders and a few of them are creative attacking midfielders, therefore, I don’t think they will struggle for goals. South Korea also has the same amount of attackers as Germany but seem have picked more defenders. This could leave them struggling to score goals.

Rplot

Last but not least we look at each teams chances using the probability of implied odds. No surprise really Germany are big favourites to get out the group. However, the fight for second place looks to be a realistic target for the other three teams. It looks particularly close between Mexico and Sweden. They play each other in the last game of the group stage, therefore, it could be a straight shoot-out for second place. Also, South Korea playing Germany who may have already qualified and therefore may make changes could give them an outside chance if it goes to the last game.

That’s it for today’s look a group F please let me know your thoughts would love to start a good debate on your thoughts. Also, check out the other blogs in the series.

World Cup Group E

Todays World Cup preview we are going to be looking at group E. This group contains Brazil, Serbia, Switzerland and Costa Rica. As mentioned previously this is all part of my series previewing the world cup so please go check the others out and let me know your thoughts. So lets first look at the age distributions of the 4 squads

agee

First of all, I think this is the biggest differences between ages across a group we have seen so far. Brazil and Costa Rica have medians around late 20’s whereas Serbia and Switzerland both have medians around 25. Brazil seems to have a good cluster of players in the late 20’s age bracket which is peak age. Is everything aligning to make Brazil the strongest team in the competition? Costa Rica look to have the group around the same age all in their late thirties. Serbia has a squad towards the younger end of the scale with a lot of players between 20 -25.

capse

The age profiles in all four squads are reflected in the caps distribution. Both Brazil and Costa Rica have players with at least 30 caps. It would be interesting to investigate if the number of caps a team has affected the chances of winning the World Cup. Serbia having the youngest squad of the group also have the most players with the low amount of caps. Switzerland has a fairly even spread across the cap levels.

compe

All four teams seem to have similar squad compositions The only slight difference is Brazil has fewer midfielders at the expense of more attackers. Costa Rica looks to be relatively slim on the ground when it comes to attackers.

chancesee

Lets now take a look at how each teams chances in the tournament compare. This has been done by working out the chance based on implied bookies odds.  Whats not surprising is Brazil has an excellent chance of getting through the group seen as though they are the actual favourites of the competition. Whats good to see if that it looks to be a close competition for second place in the group with both Serbia and Switzerland with around 50/50 chance. Will be good to see how close it is when the games are close.

That’s it for today’s preview please check the others out and let me know your thoughts. I’ll be back with the next one tomorrow.

World Cup Group C

Hi there welcome to next in series of little previews ahead of the FIFA World Cup. Today we are dissecting the 4 teams in group C; France, Peru, Denmark and Australia. Please do check out the other previews and further previews are upcoming at 6 pm everyday ahead of the first game.

agec

On the face of it, these look to be some of the youngest squads in the tournament. Australia seems to have players from both ends of the spectrum and a good grouping around peak age players. France has probably the lowest median age across all squads in the competition. Peru doesn’t have too many players between 20-25, however, have a good grouping between 25-28.

capsc

Looking at the distribution of caps in each squad it looks like all four teams have relatively inexperienced players. Denmark has the most amount of players which have around 25 caps. They also have the familiar trend of having a spike higher up showing a good amount of experienced pros vital in any squad make up. Peru seems to have the most amount of players with experience in their squad which could stand them in good stead to get out the group. The big question for France is will their lack of experience affect them later in the competition.

compc

Finally looking at squad composition France and Australia seem to have the most amount of attackers. France has done this by bringing fewer midfielders Australia by bringing fewer Defenders. Peru seems to have gone a totally different direction to the rest of the team with a squad overloaded with midfielders. Most are attacking midfielders so they should still have goalscoring options.

probc

Now we look at each teams probability of getting out the group and winning the tournament. Finally, we have a group that on the face of it could be quite competitive for second place at least. Denmark is a clear favourite but both Peru and Australia seem to have good outside chances at least according to the bookies. France has a decent chance of winning the whole tournament and is currently 4th favourites, so it will be interesting to see how they do with their young squad.

That’s it for today’s overview let me know your thoughts how far do you think France will go and who you think will get out the group?