Hello, welcome to today’s blog which is going to be my second one covering the tidy Tuesday dataset. This week it was looking at a dataset with life expectancy for every country in the world since 1950. I decided you could do some cluster analysis on this dataset and then once you have the clusters can further analyse to understand trends. We are going to use K-means clustering to put the countries together then look for trends and differences between the clusters. So the dataset has country, year between 1950 and 2015 and the life expectancy of that year. Now in order to do clustering, you need at least two measures, therefore, I created one with the change in life expectancy per year. The other measure is going to be the life expectancy in 2015.
In order to find our value of K, I did the below silhouette plot. Now you’re meant to use the value of K with the highest sil width, in this case, it would be 3. However, with so many different countries I feel that would be unfair and group the countries up too much. There are further spikes at 6 and 10.
I decided to do the below plot for different k values both 6 and 10. The plot for 10 is below
10 seems like a good value as there are not too many clusters to deal with but also good variation between the different clusters. We will take k equal to 10 for further analysis.
The comparison above looks at causes of death and i have grouped it up to get the mean for each for cause for each cluster. Conclusions that can be made:
- Cancer is prevalent across all clusters, however, the higher the life expectancy the more prevalent it is. This could be because your more likely to get cancer at older ages.
- Dementia is another cause which seems to increase with older life expectancy.
- HIV is highest in the two lowest life expectancy clusters the same with neonatal deaths
- Finally, road accident is an interesting cause, by far the highest cluster is cluster 7 which seems to be the cluster with the highest increase in life expectancy over the last 65 years. Could this because these are fast developing nations and have not got the safe road infrastructure in as yet.
That’s it for a little intro into reviewing the data this way. Let me know your thoughts and comments. There are lots of dataset on the World Health Organisations website as well as other datasets such as economic growth i can add to this analysis and develop it further.
Hello and welcome to the second part of my mini-series using cluster analysis in order to categorise formula 1 circuits. please go check the first part it outlines the basic data we are using to categorise the circuits and an overview of the method used for hierarchical clustering. Today we are going to go with K-means clustering.
For K-means clustering we have to set our own value for K we are going to do that with two different types of analysis. An elbow plot and silhouette analysis.
The code below is what was used in order to generate the elbow plot. The elbow plot generated is below:
Reviewing the elbow plot it looks like already we are seeing a slightly different amount of clusters then we got when we conducted hierarchical clustering. The elbow of the plot looks to be at 3 but you can also argue there is one at 4 as well as the value for k.
The other way to decide a k value when conducting k means clustering is to produce a silhouette graph. This takes every point which is part of the analysis and rates it on how it fits in with each cluster with -1 being doesn’t fit at all and 1 being fits well. You then produce a graph for each value of k with the average silhouette width and the highest point is the value of k. I have put a picture of the code below and also the silhouette graph produced
Fascinatingly there are two high points. One for a k of 9 and another for a k of 3. I am going to choose a k of 3 as this is closely aligned to what we saw in the elbow plot and 9 clusters are just too many to deal with.
The above graph shows all the circuits in the calendar and where they are for average straight length and average speed, colour by the cluster they have been put in. I am a bit unsatisfied with this. I feel this doesn’t quite fit the different circuits on the calendar. For instance, Singapore is different to China and Germany. Therefore K-means is not going to be the clustering I use in the final blog to look at pace trends across the season. Look out for the final blog which we will look at the pace across all circuits so far for all the teams and we will look at some other metrics like overtakes and pitstops.
Hello welcome to the next blog on this blog. If this is our first time here then please have a read of all the other blogs on here and let me know your thoughts anything I havent spotted or things you want looked at. Today we are going to look at the performances of the Middlesbrough first team throughout the season.
The data used for this I have used the rating each player gets on whoscored.com. Overall it was a season to forget for Middlesbrough. Ahead of the season the chairman had promised they would smash the league and £40 million spent in the transfer market seemed to suggest that could be possible. However they ended up 25 points behind winners Wolves and easily got knocked out by Aston Villa in the play off semi final. The idea here is to look at performances over the full season look at if you can see if the change of managers had an effect, did performances improve? Also which areas of the team generally performed well which areas didn’t which might provide insight where the team could be improved in the transfer market.
Above you can see box plots for each league game of the season including the play offs. Generally the team played better in the wins then defeats. Hows that for an earth shattering conclusion! What is interesting is the team had two managers during the season and it does look like under Tony Pulis the performances were more consistent, Lets look at this in more detail…..
Now lets look at performances under both managers. The density plot above shows there really wasn’t much difference. The players generally performed at the same level under both managers however Pulis seemed to be able to get more when it comes to ratings above 8.
25 different players started a game for the team this season with one player the clear outstanding performer. Adama Traore. However Traore also has the largest spread of performances showing he can be an inconsistent performer. Also it seems generally attacking players are more consistent performers. What will be disappointing is Ben Gibson seems to be overall the worst performing defender in the team. In midfield It looks pretty close between Adam Clayton and Jonny Howson for the best median perofrmances however clayton looks to be much more consistent.
Overall its interesting to review the the players performances over the season. It could be interesting to further stretch this to look at other teams or look at previous seasons for specific players. Also it could be further drilled down into home and away performances. Let me know your thoughts or if you have any questions really would like to hear from you.
Hello welcome to today’s blog. We are going to be looking to see if Home Secretary is the poison chalice job it is made out to be in the media. Recently Amber Rudd was forced to resign from the job due to being found to have lied to parliament. Many political commentators following it commented it being the hardest job in government and the apparent high turnover in occupants. I thought rather then take there word for it it could be tested with readily available data. I created my own data set from the last 100 years or so with the number of incumbents to the 4 great offices of state. Looking at the number of days they served in the role. I didn’t include anyone who died in the job as that’s nothing to do with difficulty of the job.
The first thing to look at is the number of holders of the 4 great offices of state since 1916. Clearly the “safest” job looks to be Prime Minister. This I think is because clearly the Prime Minister is responsible for hiring and firing the other three jobs and possibly Prime Ministers will often push incumbents out of those jobs in order to protect themselves. Also when we get back to the main question we were asking at the start of this blog then Home Secretary has had the most incumbents in the last 100 years suggesting there is a higher turnover then other jobs. However Chancellor and Foreign Secretary are no too far behind.
The plot above showing the distribution of days in office with the mean plotted as a black dot. Clearly the Prime Minister has the highest mean number of days in office but as you can see from the general spread its broadly similar to the other three jobs however it has been dragged up by the two outliers (Thatcher and Blair). The other three jobs have very similar means however home secretary does have the lowest. The general distribution though is similar to Foreign Secretary and Chancellor. Therefore it could be small sample size that is effecting the result. Looking at this i definitely don’t think its as clear the press make out.
Finally we look at the general trend over the last 100 years for each of the 4 great offices of state. Overall you can see that generally Prime Minister and Chancellor times in office are increasing. Possibly because in the last 20 years there have been two Prime Ministers that have aligned themselves closely with their chancellors. Foreign and home secretaries however have not changed and there tenures have stayed around the same levels over the last 100 years.
In conclusion I don’t think its clear that home Secretary is the worst job in government however it does seem they spend generally shorter in position then other 3 great offices of state. What’s surprising is Foreign Secretary is pretty similar to Home Secretary when its a lot small area to cover and a lot less that can go wrong. Maybe its easy to move the Foreign Secretary around in a re shuffle. Thanks for reading this blog if you enjoyed and want to see more please let me know and give the blog a follow so you can see when I post a new blog.
Hello welcome to this blog and today we are going to look at something we havnt looked at yet in this blog. Formula 1. I have watched F1 since 1997 and often wondered when ever they say we reviewed the data, what exactly the data they review and what process they use to review it. Now sadly I don’t have access to anything like the data F1 teams have (one day maybe!) however the main piece of data is freely available. The qualifying time. I decided I wanted to have a look at the competitive picture and now we are 4 races in that’s a decent sample size.
So to do this analysis I took each drivers fastest lap for the 4 qualifying sessions so far. I then added it all up to get each drivers qualifying time. The result plotted the below graph:
So after 4 races Vettel has the lowest total qualifying time, closely followed by Hamilton. What is clear from this is the large gap between the top 6 drivers from the top 3 teams and the rest. Also apart from Ferrari and Mercedes being mixed up every other team is 2 by 2. This is surprising considering the small gaps between teams in the midfield. The next question I had was differences between team mates as in formula 1 your main rival to beat is always your team mate.
The graph above shows the difference between each teams drivers with points at the top right smaller difference then at the bottom left. The team with the clearly the closest matched drivers are Red Bull with 0.07 seconds between them. This is good news for Ricciardo in particular who can use this information to increase his value in his contract talks. At the other end there is big pressure on Stoffel Vandoorne and Kimi Raikkonen. Both have been over a second in total behind there team mates which if it carries on could see them losing their seats.
I’m going to keep this dataset up to date as the season goes on and I have similar information for total race time. I think there’s more information you can derive by this such as whose developing their car the best. Please let me know your thoughts or if you have any questions i like to hear feedback.
Hello this is going to be a shorter blog then normal I just felt I had to share the early findings. Inspired by the R4DS online learning community recent tidy Tuesday article in which we looked at a dataset which had the wages of various positions in the NFL. Reviewing it showed that while some positions salary was increasing at a high rate, others it was shown were not growing at all.
I decided to look at wages on in the English premier league. I got the data from the same website which had the wages for all players for every year from 2013 to the current year. I took that data and plotted the graph below which shows the wages for the top 50 players in each position.
Now despite all the money going into the league now with the latest increases in TV deal money. The wages for all players seems to staying at the same level. This shocked me and I can only think of couple reasons as to why:
- The increase in TV money has been spent on things other then wages – transfer fees or gone to the owners of the clubs
- If higher wages have been paid it has gone to the less skilled lower paid players
Stay tuned I have an few ideas how we can review this further and come up with some ideas if this is true.
Hello welcome to another blog this time looking at win odds in the championship so far this year and comparing each teams odds. The aim is to review the data and see if their are any trends we can spot. To get the data i downloaded the raw CSV from the football data. The CSV is available on their website for free and contains lots of other interesting information.
The summary above shows each team in the sky bet championship with the home and away game odds plotted. The big thing to take away is the spread for some teams. If you look at Wolves they were generally well fancied in their home and away games. Burton however even in their well fancied games at home they were still less fancied then other teams at home. Also the better team the more overlapping of home and away odds.
I have now updated the graphs to focus just home and away games. The home games again Wolves generally have the lower odds for home games. The only team that has some odd close the Wolves is Aston Villa who are obviously a well fancied home team. The biggest surprise for me is that despite Burton having clearly the higher odds then any other team in the division they dont hole the least fancied odds for a home team. That accolade goes to Barnsley. A similar pattern is seen with the away odds, obviously they are generally higher then the home odds. This data seems ot suggest that the better teams have both lower odds and smaller grouping of odds. Also this could be a way to review how closely matched a league is the more spread out the odds the closer the teams are in terms of quality.
The graph above compares the number of home wins for a team against their average odds to win. As expected the lower the average home win odds the more home wins a team has got. However there are some outliers which are interesting. The two big overachievers when looking at the bookies odds are Cardiff and Bolton. Cardiff look like they should have similar amount of wins to the teams in the playoff mix and Bolton look like they should have theoretically the second lowest home wins in the league. Underachievers look to be maybe Brentford and Norwich though there seems to be more teams Overachieving then Underachieving.
Finally away wins shows the same trend however this time there are clear teams at the bottom and top showing how much harder it is to win away from home. The big overachiever away from home is Burton which suggests they play well when teams underestimate them. A team which has under achieved is Middlesbrough who look to have been expected to get more then 10 wins away this season but have only 7.