Hello, welcome to today’s blog which is going to be my second one covering the tidy Tuesday dataset. This week it was looking at a dataset with life expectancy for every country in the world since 1950. I decided you could do some cluster analysis on this dataset and then once you have the clusters can further analyse to understand trends. We are going to use K-means clustering to put the countries together then look for trends and differences between the clusters. So the dataset has country, year between 1950 and 2015 and the life expectancy of that year. Now in order to do clustering, you need at least two measures, therefore, I created one with the change in life expectancy per year. The other measure is going to be the life expectancy in 2015.
In order to find our value of K, I did the below silhouette plot. Now you’re meant to use the value of K with the highest sil width, in this case, it would be 3. However, with so many different countries I feel that would be unfair and group the countries up too much. There are further spikes at 6 and 10.
I decided to do the below plot for different k values both 6 and 10. The plot for 10 is below
10 seems like a good value as there are not too many clusters to deal with but also good variation between the different clusters. We will take k equal to 10 for further analysis.
The comparison above looks at causes of death and i have grouped it up to get the mean for each for cause for each cluster. Conclusions that can be made:
- Cancer is prevalent across all clusters, however, the higher the life expectancy the more prevalent it is. This could be because your more likely to get cancer at older ages.
- Dementia is another cause which seems to increase with older life expectancy.
- HIV is highest in the two lowest life expectancy clusters the same with neonatal deaths
- Finally, road accident is an interesting cause, by far the highest cluster is cluster 7 which seems to be the cluster with the highest increase in life expectancy over the last 65 years. Could this because these are fast developing nations and have not got the safe road infrastructure in as yet.
That’s it for a little intro into reviewing the data this way. Let me know your thoughts and comments. There are lots of dataset on the World Health Organisations website as well as other datasets such as economic growth i can add to this analysis and develop it further.
Hello and welcome to the second part of my mini-series using cluster analysis in order to categorise formula 1 circuits. please go check the first part it outlines the basic data we are using to categorise the circuits and an overview of the method used for hierarchical clustering. Today we are going to go with K-means clustering.
For K-means clustering we have to set our own value for K we are going to do that with two different types of analysis. An elbow plot and silhouette analysis.
The code below is what was used in order to generate the elbow plot. The elbow plot generated is below:
Reviewing the elbow plot it looks like already we are seeing a slightly different amount of clusters then we got when we conducted hierarchical clustering. The elbow of the plot looks to be at 3 but you can also argue there is one at 4 as well as the value for k.
The other way to decide a k value when conducting k means clustering is to produce a silhouette graph. This takes every point which is part of the analysis and rates it on how it fits in with each cluster with -1 being doesn’t fit at all and 1 being fits well. You then produce a graph for each value of k with the average silhouette width and the highest point is the value of k. I have put a picture of the code below and also the silhouette graph produced
Fascinatingly there are two high points. One for a k of 9 and another for a k of 3. I am going to choose a k of 3 as this is closely aligned to what we saw in the elbow plot and 9 clusters are just too many to deal with.
The above graph shows all the circuits in the calendar and where they are for average straight length and average speed, colour by the cluster they have been put in. I am a bit unsatisfied with this. I feel this doesn’t quite fit the different circuits on the calendar. For instance, Singapore is different to China and Germany. Therefore K-means is not going to be the clustering I use in the final blog to look at pace trends across the season. Look out for the final blog which we will look at the pace across all circuits so far for all the teams and we will look at some other metrics like overtakes and pitstops.
Hello there so as you know I’m currently working through the Datacamp course data scientist with R. (If the people from Datacamp are reading this I’m open top sponsorship!) There will be a further update how I’m getting on with this later this week, however, today I wanted to focus on applying something new that I learnt. Cluster analysis. Cluster analysis allows you to take a dataframe of two variables and calculate which are the rows best grouped together. There are two main methods that we are going to look at hierarchical clustering and kmeans clustering. We are going to look at formula 1 circuits. The idea is there are 21 different circuits currently on the calender all different lengths and height profiles and types of tarmac, however, can we group them together with certain characteristics. For me as an avid formula 1 watcher, the differences between the circuits are caused by lengths of straights and speed of corners. Therefore the two metrics we are using are the average straight length and average corner speed.
- Average straight length – calculated by measuring each stretch of track which the F1 car would be running full throttle. Removing any lengths of the track less than 100m. an example for Spain below straights is estimated in green.
- Average corner speed – I have calculated this by allocating each corner to either slow, medium or fast speed. (Unfortunately, I don’t have data for the exact corner speed but if any f1 team wants to send it over email me!) so you can see below in the table how many for each circuit was allocated
As I don’t know the exact speeds of these corners I have estimated that a slow corner is 80 km/h, medium speed corner is 150 km/h and a fast corner is 200km/h. This has left us with the following table:
The first thing we are going to look at is hierarchical clustering. The table above is fed into the following code:
this produces the following output:
We have 5 different distinct clusters that the F1 circuits fit into. It’s not too surprising that Singapore, Monaco and Hungary fit into a similar cluster as well as Belgium and Great Britain being similar circuits.
The scatter above you can clearly see the difference between the two main clusters 1 and 2. In cluster one straight length are often shorter, however, the corners are faster. Cluster 2 circuits often have longer straights but slower corners. With a few circuits from each group used so far, this season would be interesting to see if there are any trends with car speed. That’s it for the first part of this series next week we will look at any difference using K-means clustering. In the final part, we will look applying what we have seen so far this year to try and predict who will win in the later rounds.
Group D is next on the agenda for us to take a look at in this series previewing the world cup. This is part of a series looking at all the groups in the World Cup so please takes a look at the others and let me know your thoughts. Group D contains Argentina, Croatia, Iceland and Nigeria. So let’s take a look at the age make up of the 4 squads
Argentina has one of the oldest medians we have seen so far and it looks to be about 30. Does this mean its this squad last opportunity to win the world cup? Lionel Messi will no be around forever and this is probably his last chance. Both Croatia and Iceland have similar medians which are around the area we have seen most medians so far in this preview series. Nigeria has a relatively young median age however interestingly they have more players over 30 then Croatia and Iceland.
There seems to be a big correlation between Nigeria’s relatively young squad and it seems to have the lowest amount of caps. Croatia seems to have a relatively experienced squad with most players having more then 25 caps this should stand them in good stead in the tournament if the experience is a key attribute to any good squad. Argentina caps seem to be evenly distributed across all of the range, they also seem to have the most amount of players above 100 caps.
Next, we review squad composition for the 4 teams in group D. All teams in this group seem to have varying amounts of all different departments in a team. Whats surprising is Argentina seem to have the least amount of attackers however the attackers they do have are all world class and it’s going to be difficult to fit them all in the team. Croatia seems to have the most amount of defenders which could mean they are strong defensively. Iceland and Nigeria have a similar makeup in their squads with only a slight difference in attackers and defenders.
Finally, we look at the chances of each team in the tournament based on looking at chance from implied odds. As you can see this group is expected to be pretty easy for Croatia and Argentina. Iceland and Nigeria look expected to be quite evenly matched teams but are not expected to have any impact on the group. Looking at the chances to win the competition Argentina are one of the big favourites unsurprisingly. However, it also looks like Croatia are seen as having a good outside chance so will be interesting to see how they do in the competition.
That’s it for today’s group D overview please let me know your thoughts in the comments below and check out the other previews.
Hi there welcome to next in series of little previews ahead of the FIFA World Cup. Today we are dissecting the 4 teams in group C; France, Peru, Denmark and Australia. Please do check out the other previews and further previews are upcoming at 6 pm everyday ahead of the first game.
On the face of it, these look to be some of the youngest squads in the tournament. Australia seems to have players from both ends of the spectrum and a good grouping around peak age players. France has probably the lowest median age across all squads in the competition. Peru doesn’t have too many players between 20-25, however, have a good grouping between 25-28.
Looking at the distribution of caps in each squad it looks like all four teams have relatively inexperienced players. Denmark has the most amount of players which have around 25 caps. They also have the familiar trend of having a spike higher up showing a good amount of experienced pros vital in any squad make up. Peru seems to have the most amount of players with experience in their squad which could stand them in good stead to get out the group. The big question for France is will their lack of experience affect them later in the competition.
Finally looking at squad composition France and Australia seem to have the most amount of attackers. France has done this by bringing fewer midfielders Australia by bringing fewer Defenders. Peru seems to have gone a totally different direction to the rest of the team with a squad overloaded with midfielders. Most are attacking midfielders so they should still have goalscoring options.
Now we look at each teams probability of getting out the group and winning the tournament. Finally, we have a group that on the face of it could be quite competitive for second place at least. Denmark is a clear favourite but both Peru and Australia seem to have good outside chances at least according to the bookies. France has a decent chance of winning the whole tournament and is currently 4th favourites, so it will be interesting to see how they do with their young squad.
That’s it for today’s overview let me know your thoughts how far do you think France will go and who you think will get out the group?
Hello welcome to the first of my blogs looking at each group in the world cup. Over the next 8 blogs I hope to dissect each country’s squad and finally look at their chances of progressing and winning the cup. So today we start with group A which contains hosts Russia, Uruguay, Saudi Arabia and Egypt.
The first thing to look at is the age range of each squad in group A. All 4 teams have a median around the same area. As you can see Egypt have a 45 year old player, one of their GK’s who is the oldest player in any squad in the tournament. Uruguay and Egypt tend to have some younger players than Russia and Saudi Arabia. Saudi Arabia look to have generally one of the older squads in the tournament.
Next we look at the caps all the players in the squad have received. It’s clear Russia has the most players with the least amount of international experience. Will they struggle to cope with pressure from playing in front of home crowd. Saudi Arabia despite having the older squad of the 4 teams seems to have generally the least experienced team. Uruguay however seem to have a good balance with experience at all different levels.
Looking at each teams squad composition clearly all have the same percent GK as 3 GK is stipulated in the rules. One thing that’s clear is Egypt, Russia and Saudi Arabia have a low amount of strikers within there squads. All three have just 3 recognised strikers will this leave all them struggling to score goals. Also Egypt seem to have the lowest amount of midfielders compared to the rest of the group with an increase in defenders. This will give egypt lots of options in defence in case of injuries however could leave them exposed if they need to make changes from the bench to try and win games.
Finally we use the implied probability from the betting odds to look at the chance of each team getting out the group and winning the tournament. Overall group A seems to have no teams really capable of mounting a serious challenge for the world cup. With Russia with home advantage rated lower then Uruguay. Also it doesn’t seem to be a particularly close group for qualification the clear favourites are Uruguay and Russia. The wild card in this is Egypt if Mo Salah is fit for the tournament then expect there chances compared to Russia to increase considerably. If he isn’t expect this could be a pretty straight forward group.
Thats it for today’s look at group A please let me know your thoughts do you think any teams in this group can go far? let me know your thoughts in the comments. Group B will follow tomorrow.
Hello welcome to today’s blog. We are going to be looking to see if Home Secretary is the poison chalice job it is made out to be in the media. Recently Amber Rudd was forced to resign from the job due to being found to have lied to parliament. Many political commentators following it commented it being the hardest job in government and the apparent high turnover in occupants. I thought rather then take there word for it it could be tested with readily available data. I created my own data set from the last 100 years or so with the number of incumbents to the 4 great offices of state. Looking at the number of days they served in the role. I didn’t include anyone who died in the job as that’s nothing to do with difficulty of the job.
The first thing to look at is the number of holders of the 4 great offices of state since 1916. Clearly the “safest” job looks to be Prime Minister. This I think is because clearly the Prime Minister is responsible for hiring and firing the other three jobs and possibly Prime Ministers will often push incumbents out of those jobs in order to protect themselves. Also when we get back to the main question we were asking at the start of this blog then Home Secretary has had the most incumbents in the last 100 years suggesting there is a higher turnover then other jobs. However Chancellor and Foreign Secretary are no too far behind.
The plot above showing the distribution of days in office with the mean plotted as a black dot. Clearly the Prime Minister has the highest mean number of days in office but as you can see from the general spread its broadly similar to the other three jobs however it has been dragged up by the two outliers (Thatcher and Blair). The other three jobs have very similar means however home secretary does have the lowest. The general distribution though is similar to Foreign Secretary and Chancellor. Therefore it could be small sample size that is effecting the result. Looking at this i definitely don’t think its as clear the press make out.
Finally we look at the general trend over the last 100 years for each of the 4 great offices of state. Overall you can see that generally Prime Minister and Chancellor times in office are increasing. Possibly because in the last 20 years there have been two Prime Ministers that have aligned themselves closely with their chancellors. Foreign and home secretaries however have not changed and there tenures have stayed around the same levels over the last 100 years.
In conclusion I don’t think its clear that home Secretary is the worst job in government however it does seem they spend generally shorter in position then other 3 great offices of state. What’s surprising is Foreign Secretary is pretty similar to Home Secretary when its a lot small area to cover and a lot less that can go wrong. Maybe its easy to move the Foreign Secretary around in a re shuffle. Thanks for reading this blog if you enjoyed and want to see more please let me know and give the blog a follow so you can see when I post a new blog.