Hello, welcome to today’s blog which is going to be my second one covering the tidy Tuesday dataset. This week it was looking at a dataset with life expectancy for every country in the world since 1950. I decided you could do some cluster analysis on this dataset and then once you have the clusters can further analyse to understand trends. We are going to use K-means clustering to put the countries together then look for trends and differences between the clusters. So the dataset has country, year between 1950 and 2015 and the life expectancy of that year. Now in order to do clustering, you need at least two measures, therefore, I created one with the change in life expectancy per year. The other measure is going to be the life expectancy in 2015.
In order to find our value of K, I did the below silhouette plot. Now you’re meant to use the value of K with the highest sil width, in this case, it would be 3. However, with so many different countries I feel that would be unfair and group the countries up too much. There are further spikes at 6 and 10.
I decided to do the below plot for different k values both 6 and 10. The plot for 10 is below
10 seems like a good value as there are not too many clusters to deal with but also good variation between the different clusters. We will take k equal to 10 for further analysis.
The comparison above looks at causes of death and i have grouped it up to get the mean for each for cause for each cluster. Conclusions that can be made:
- Cancer is prevalent across all clusters, however, the higher the life expectancy the more prevalent it is. This could be because your more likely to get cancer at older ages.
- Dementia is another cause which seems to increase with older life expectancy.
- HIV is highest in the two lowest life expectancy clusters the same with neonatal deaths
- Finally, road accident is an interesting cause, by far the highest cluster is cluster 7 which seems to be the cluster with the highest increase in life expectancy over the last 65 years. Could this because these are fast developing nations and have not got the safe road infrastructure in as yet.
That’s it for a little intro into reviewing the data this way. Let me know your thoughts and comments. There are lots of dataset on the World Health Organisations website as well as other datasets such as economic growth i can add to this analysis and develop it further.
Hello and welcome to the second part of my mini-series using cluster analysis in order to categorise formula 1 circuits. please go check the first part it outlines the basic data we are using to categorise the circuits and an overview of the method used for hierarchical clustering. Today we are going to go with K-means clustering.
For K-means clustering we have to set our own value for K we are going to do that with two different types of analysis. An elbow plot and silhouette analysis.
The code below is what was used in order to generate the elbow plot. The elbow plot generated is below:
Reviewing the elbow plot it looks like already we are seeing a slightly different amount of clusters then we got when we conducted hierarchical clustering. The elbow of the plot looks to be at 3 but you can also argue there is one at 4 as well as the value for k.
The other way to decide a k value when conducting k means clustering is to produce a silhouette graph. This takes every point which is part of the analysis and rates it on how it fits in with each cluster with -1 being doesn’t fit at all and 1 being fits well. You then produce a graph for each value of k with the average silhouette width and the highest point is the value of k. I have put a picture of the code below and also the silhouette graph produced
Fascinatingly there are two high points. One for a k of 9 and another for a k of 3. I am going to choose a k of 3 as this is closely aligned to what we saw in the elbow plot and 9 clusters are just too many to deal with.
The above graph shows all the circuits in the calendar and where they are for average straight length and average speed, colour by the cluster they have been put in. I am a bit unsatisfied with this. I feel this doesn’t quite fit the different circuits on the calendar. For instance, Singapore is different to China and Germany. Therefore K-means is not going to be the clustering I use in the final blog to look at pace trends across the season. Look out for the final blog which we will look at the pace across all circuits so far for all the teams and we will look at some other metrics like overtakes and pitstops.
Hello there so as you know I’m currently working through the Datacamp course data scientist with R. (If the people from Datacamp are reading this I’m open top sponsorship!) There will be a further update how I’m getting on with this later this week, however, today I wanted to focus on applying something new that I learnt. Cluster analysis. Cluster analysis allows you to take a dataframe of two variables and calculate which are the rows best grouped together. There are two main methods that we are going to look at hierarchical clustering and kmeans clustering. We are going to look at formula 1 circuits. The idea is there are 21 different circuits currently on the calender all different lengths and height profiles and types of tarmac, however, can we group them together with certain characteristics. For me as an avid formula 1 watcher, the differences between the circuits are caused by lengths of straights and speed of corners. Therefore the two metrics we are using are the average straight length and average corner speed.
- Average straight length – calculated by measuring each stretch of track which the F1 car would be running full throttle. Removing any lengths of the track less than 100m. an example for Spain below straights is estimated in green.
- Average corner speed – I have calculated this by allocating each corner to either slow, medium or fast speed. (Unfortunately, I don’t have data for the exact corner speed but if any f1 team wants to send it over email me!) so you can see below in the table how many for each circuit was allocated
As I don’t know the exact speeds of these corners I have estimated that a slow corner is 80 km/h, medium speed corner is 150 km/h and a fast corner is 200km/h. This has left us with the following table:
The first thing we are going to look at is hierarchical clustering. The table above is fed into the following code:
this produces the following output:
We have 5 different distinct clusters that the F1 circuits fit into. It’s not too surprising that Singapore, Monaco and Hungary fit into a similar cluster as well as Belgium and Great Britain being similar circuits.
The scatter above you can clearly see the difference between the two main clusters 1 and 2. In cluster one straight length are often shorter, however, the corners are faster. Cluster 2 circuits often have longer straights but slower corners. With a few circuits from each group used so far, this season would be interesting to see if there are any trends with car speed. That’s it for the first part of this series next week we will look at any difference using K-means clustering. In the final part, we will look applying what we have seen so far this year to try and predict who will win in the later rounds.
Hello, welcome to the next blog in my series previewing the World Cup. All 7 other groups are looked at further in my blog so please go check them out let me know who you think the favourites are. The final group contains Columbia, Poland, Japan and Senegal. This promises to be quite an interesting group and there’s not one team that stands out as an absolute favourite.
Looking at the ages of the four squads Senegal have what looks to be the youngest squad in the group with 2 players under the age of 20 in the squad. The medians for the 4 teams, however, are about the same. Japan seems to have the highest age players but nothing too high. They all seem o have most of their players around peak ages which should mean they are in peak condition.
Looking at the caps distribution Poland looks to have the flattest range of caps with no overload of experienced or inexperienced players. The Senegal squad looks to be the least experienced squad most probably because as we have seen it has a lot of younger players. We have seen most teams in the tournament have squads with players in that have over 100 caps however it none of the squads in Group H have that.
Now the Senegal squad composition looks interesting. They don’t have too many midfielders compared to the other teams and lots of attackers. It looks like Senegal could be a fun team to watch in the tournament. The other squad’s don’t look to have too many different options with them all having the same amount of midfielders and then just a small difference between defenders and attackers.
Now finally we look at the chances of each team in the World Cup. First of all, I think its the closest group to call because the expected favourites Columbia have the lowest percent chance of all the favourites we have seen so far. Japan also could think that they have a half decent chance of getting through the group with it being so wide open. It will be fascinating to see how this group with no so-called big nations turns out. It might be the most exciting of the tournament. As for the chances of the teams, Columbia possibly has an outside chance but the none of these are among the top favourites.
That’s it for the last in this series of blogs looking at each group in the world cup. I hope you have enjoyed them and they have increased your understanding of the teams in the World Cup.
Welcome to today’s blog, these are all part of my series looking at all the team in the tournament group by group. Check the others out if you haven’t already and let me know your thoughts. Today is our second last group and it is group G which contains Belgium, England, Tunisia and Panama. As an Englishman myself hopefully this world cup doesn’t end up with a penalty shoot-out heartbreak.
As you know if you have read the other blogs we start with the age comparison first. England looks to have overall one of the younger squads in the competition. Not many players at peak age, therefore, it could be a tournament too early for this group of England players. Belgium on the other hand look to have the perfect squad makeup with most players at peak age suggesting this tournament is at the correct time for their golden generation. Panama has possibly the oldest squads in the group but also players of all ages. Tunisia squad is generally just slightly older then Englands.
Looking at the caps breakdown for all teams within group G. Belgium are showing the common trend of the squad being split in two with having a group of younger players will low caps but also a group of players with experience that can guide everyone. England probably has one of the squads with the lowest total caps we have seen in the tournament this strengthens the view that this might be a tournament of building for England. The positive is England don’t have many players with the baggage of the last world cups. Tunisia seems to have a similarly inexperienced squad as England which aligns with its also low squad average age. Panama has the most experienced squad in the group with a lot of players with over 100 caps.
Now, let’s take a look at the squad compositions and first things first it looks like Tunisia have abandoned the midfield. I don’t think a team in this World Cup have had a lower amount of midfielders then Tunisia. They have plenty of Defenders and Attackers so should be interesting to see how they play. Belgium has gone the opposite of Tunisia and gone with more midfielders at the expense of defenders and attackers.
Now finally let’s take a look at the chances of all four teams in this tournament. On the face of it, this looks an easy group with the top 2 already decided. However, because these are based on bookmaker odds England might be slightly over priced due to lots of expectant England fans out there. Looking at the chances for the whole tournament Belgium have a good chance, however, will they bottle it on the biggest stage again.
That’s it for today’s look at the world cup groups as always please let me know your thoughts and if you have any questions
Hello, welcome to the preview of Group F in the world cup. Thanks for all the support so far on these previews. I would love to hear peoples thoughts and predictions on the competition. Today we will be looking at group F which contains Germany, Mexico, South Korea and Sweden.
The first thing to look at is the age distribution of all 4 teams. Germany seems to have one of the younger squads in the tournament with a relatively small distribution between youngest and oldest players. Mexico has players from the youngest in their 20’s all the way up to near 40. South Korea and Sweden have the same median ages but South Korea has more players clustered around their median and have the lowest amount of players above 30.
Mexico has what looks to be the most experienced squad with players mostly having around 50 caps but they have some players up to 150 caps. Germany has a lot of players with a relatively low amount of caps but also have the trend we have seen with other squads of having a group of players with a lot of caps. I wonder if these players would be a similar age and therefore could be a golden generation. Sweden possibly has the most inexperienced squads in the group with a lot of players less than 50 caps.
On the face of it, Germany seems to have a small number of attackers in the squad. However, they have more midfielders and a few of them are creative attacking midfielders, therefore, I don’t think they will struggle for goals. South Korea also has the same amount of attackers as Germany but seem have picked more defenders. This could leave them struggling to score goals.
Last but not least we look at each teams chances using the probability of implied odds. No surprise really Germany are big favourites to get out the group. However, the fight for second place looks to be a realistic target for the other three teams. It looks particularly close between Mexico and Sweden. They play each other in the last game of the group stage, therefore, it could be a straight shoot-out for second place. Also, South Korea playing Germany who may have already qualified and therefore may make changes could give them an outside chance if it goes to the last game.
That’s it for today’s look a group F please let me know your thoughts would love to start a good debate on your thoughts. Also, check out the other blogs in the series.
Todays World Cup preview we are going to be looking at group E. This group contains Brazil, Serbia, Switzerland and Costa Rica. As mentioned previously this is all part of my series previewing the world cup so please go check the others out and let me know your thoughts. So lets first look at the age distributions of the 4 squads
First of all, I think this is the biggest differences between ages across a group we have seen so far. Brazil and Costa Rica have medians around late 20’s whereas Serbia and Switzerland both have medians around 25. Brazil seems to have a good cluster of players in the late 20’s age bracket which is peak age. Is everything aligning to make Brazil the strongest team in the competition? Costa Rica look to have the group around the same age all in their late thirties. Serbia has a squad towards the younger end of the scale with a lot of players between 20 -25.
The age profiles in all four squads are reflected in the caps distribution. Both Brazil and Costa Rica have players with at least 30 caps. It would be interesting to investigate if the number of caps a team has affected the chances of winning the World Cup. Serbia having the youngest squad of the group also have the most players with the low amount of caps. Switzerland has a fairly even spread across the cap levels.
All four teams seem to have similar squad compositions The only slight difference is Brazil has fewer midfielders at the expense of more attackers. Costa Rica looks to be relatively slim on the ground when it comes to attackers.
Lets now take a look at how each teams chances in the tournament compare. This has been done by working out the chance based on implied bookies odds. Whats not surprising is Brazil has an excellent chance of getting through the group seen as though they are the actual favourites of the competition. Whats good to see if that it looks to be a close competition for second place in the group with both Serbia and Switzerland with around 50/50 chance. Will be good to see how close it is when the games are close.
That’s it for today’s preview please check the others out and let me know your thoughts. I’ll be back with the next one tomorrow.