Todays blog we are going to look at 2018 Formula 1 season and the numbers behind it. Particular the qualifying pace of each car. I have the data for each driver and cars fastest lap in qualifying.
Above you can see for each position on the grid the average difference to 1st. Obviously 1st the average difference is 0 but after that the difference is in seconds. There is a large gap between the first 3 rows on the grid and the rest. Possibly a hint of how far ahead the top 3 teams are to the rest
When we break the first graph into drivers and teams two things become clear. There is a big gap between the top 3 teams and the rest. There is also some fairly big gaps between team members. Pace-wise the closest battle was between the Renault, Haas and Force India. With all 6 drivers close together and mixed up.
Looking at how each teams difference to pole position trended over the season. Overall the Ferarri started the season as the quickest car. However, Mercedes out-developed them and was clearly the fastest car by the end of the season. The team that made the biggest improvement was Sauber going from one of the slowest teams at the first race to the fastest of the midfield by the end of the season. A bad season for the former great team Mclaren is shown they got rid of the Honda engine at the end of last season, however, the team that took it Torro Rosso overtook them by mid-season. Slowly they drifted further from the pace as the season went on.
Sauber’s improvement looks even more impressive, they have improved the car from being over 4% behind at the first race of the season to around 2% towards the end. This wasn’t driven by one driver being quicker than the other. Both drivers improved at similar rates though it was clear Leclerc was the quicker driver.
The closest teammates were the Red Bull drivers Verstappen and Ricciard0, the Renaults of Sainz and Hulkenberg and the William’s of Sirotkin and Stroll. For all the hype around Max Verstappen this season he has not been much quicker than his teammate. At the other end of the scale, the Mclaren and Sauber had the biggest differences between teammates. It was no surprise that Vandorne got dropped by Mclaren at the end of the season even though he was up against the great Fernando Alonso.
I hope you enjoyed this look at the season we had in 2018. Some interesting trends however its worrying the clear gap between the top 3 teams and the rest. Hopefully a closer season next year the other teams are able to close the gap.
Hello, welcome to the second part of this blog doing exploratory data analysis on the bike town dataset. If you haven’t read the first one then go check it out. An overview is we found that most of the records were either subscribers to the system of casual people who might just use it every now and then. So I decided to compare those two groups. We saw in the smaller sample size that most of the casual users used it for recreation and the subscribers used it mostly for commuting. Today we are going to look at the distances and speeds the groups travel and where they rent the bikes from but first we are going to look at when they rent the bikes:
This graph totally makes sense that subscribers tend to rent the bikes for commuting as there are two large spikes at around 8am and around 5pm. The peak hours for people going too and from work. Casual users don’t tend to use the system in the morning, however, there’s consistent usage throughout the afternoon. This could be tourists exploring the city for instance. It’s strange both groups don’t reach their minimum to well after 2am, are these people using the system to get home from their night outs? Beware of the drunk Portland cyclist!
The density plot for the distances the two groups go is interesting however i think does fit the current pattern. Subscribers are looking to do shorter journeys because they might be covering the last few miles to work whereas the casual users are maybe exploring the city and therefore cover more distance.
Now let’s check out the speed curves for both groups the further the distance the person travelled the slower the speed. The subscribers are obviously generally the much faster riders at all distances. The increase in speed towards the 25-mile distance i think is down to lower amounts of data. At the lower distance the subscribers have significantly faster speeds are they using the system to get from A to B as quickly as possible.
Now let’s look at the start and end locations for the casual riders. The heat map above shows that most riders are clustered around the city centre possibly moving from one tourist spot to another. There seems to be a fairly even distribution across the centre locations.
The subscriber heat map above shows that generally, people take the bike from much further out than compared to the casual riders who are possibly just using it to get to the tourist hot spots. Once again mostly the usage is on the left side of the river that must be the area where people like to get about. Also, the start locations around the outside are much denser, therefore, its clear people are using the system from further out to go into the centre of the city.
That’s it for this exploratory data analysis in this blog I think we have found some interesting insights and at the minimum able to confirm what you would expect. I hope you have found this interesting and informative let me know your thoughts and if you have looked at this dataset yourself lets see your thoughts. Check out the code on my GitHub should be linked somewhere on the website.
Hello welcome to today’s blog which we are going to take a large dataset and do some exploratory data analysis on it. I am going to look at the biketown dataset which was a dataset on tidy tuesday. If you’re new around here tidy Tuesday is a hashtag on twitter which the R for data science online learning community actively promotes and has every Tuesday. If you’re inspired to learn R and data science like I was that is a really great community full of wonderful people to start with. I am not going to post code snippets within the blog as I think it gets too long, however, the full code used will be posted on my GitHub.
Above is the structure of the dataframe. The data comes in numerous csv files so i read them all in and created one large data frame structured like so. The second column Payment plan seems an interesting column it has 3 constituents Casual, subscriber and another. The system in Portland has a way a regular user can automate payments to save time. Let’s look at how much of the dataset is based on the 3 types of payment:
As you can see the vast majority of this dataset is based on either casual and subscriber and I think it would be interesting to review the differences between the people on the two main payment plans. Therefore going forward in the EDA we are going to remove the entries without a subscriber. After this, we could possibly look at a method for working out what type of payment plan the blanks are. First thing first let’s have a look at what type of trips either group takes:
The big issue here with making any conclusion on the type of trips each group takes is going to be difficult. This is because each group has over 200 thousand entries and there are less than 1000 recorded trips for each group. What we can say is that it makes sense that subscribers in this smaller group tend to use the system for commuting and casual users clearly in the small sample size use the system for recreation which makes total sense.
Now we look at the payment methods that both groups have used. By far the 3 main payment types for both groups are keypad, mobile and keypad_rfid_card, with subtle differences between the 2 groups. The RFID card is clearly higher in the subscriber which must be because subscribers are given a card in order to gain access to the bikes. Also, casual users tend to be much more likely to use their mobile to gain access. Both groups have the vast majority using the keypad system.
That’s it for part 1 of today’s exploratory data analysis on the bike town data. Tomorrow we will look at the distances the groups go as well as location and usage time. Let me know your comments on the first part.
Hello and welcome to the meant to be final F1 circuit cluster analysis blog, however, I have thought of some ideas to extend it to a fourth so we shall see how that goes. The idea is today we will review what we can from the season so far and in the next one look at some methods for predicting how the rest of the season will pan out. That one might not be until the summer break.
Above is a summary of the season with each circuit coloured by what cluster they belong to. Circuits have been clustered according to hierarchical clustering please see othe blogs in the series to see the method used. The tracks that belong in cluster 1 and 2 are pretty evenly distributed across the season. What is interesting the two wildcard tracks which don’t belong to any of the other three clusters are still to come. Could they prove crucial in the fight for the title?
Above you can see for each cluster the pace difference with 0 being the fastest car in each cluster up to around 3% which is the difference to the slowest car. The first thing to take away in cluster 1 and 2 Mercedes and Ferrari are neck and neck. Mercedes are slightly but only slightly quicker overall. Red Bull get better the with more lower speed corners and fewer straights there are. Highlighting the cars engine weakness. With cluster three the slow twisty circuits being their forte. They must be looking forward to Hungary next. Elsewhere apart from the top three one of the big stories is Haas. Their car looks well suited to the fast flowing circuits of cluster 1 but is the slowest in the stop-start circuits with short straights. That is clearly a car with strengths in high-speed downforce and engine power. The gap between the top 3 teams and the rest is pretty consistent across all the clusters.
Finally, we look at how the drivers rank at the different clusters. Some interesting points are apparent, In cluster 1 Hamilton seems to have a clear advantage over the others its close but he’s clearly on average faster than other drivers. There’s also a significant difference between Hamilton and his teammate Bottas showing fast twisty circuits could be Bottas weakness. Compared to cluster 2 where Hamilton, Vettel and Bottos are very evenly matched. At Ferrari Raikkonen is a lot closer to Vettel on the fast twisty circuits compared to cluster 2 which has much more large breaking zones and slower corners. The opposite pattern is seen at Red Bull Ricciardo is a good distance behind Verstappen on the fast twisty circuits but Ricciardo is actually slightly faster on the slower circuits. Elsewhere Alonso has a clear advantage over Vandoorne on both types of circuits.
So that’s it for today’s blog. I am going to put the R code for this on GitHub and the spreadsheet so if you have any further ideas what can be done with this dataset then id love to see what you come up with. There will be a fourth part in this series where we look at historical trends and then look at forecasting the future.
Hello, welcome to today’s blog which is going to be my second one covering the tidy Tuesday dataset. This week it was looking at a dataset with life expectancy for every country in the world since 1950. I decided you could do some cluster analysis on this dataset and then once you have the clusters can further analyse to understand trends. We are going to use K-means clustering to put the countries together then look for trends and differences between the clusters. So the dataset has country, year between 1950 and 2015 and the life expectancy of that year. Now in order to do clustering, you need at least two measures, therefore, I created one with the change in life expectancy per year. The other measure is going to be the life expectancy in 2015.
In order to find our value of K, I did the below silhouette plot. Now you’re meant to use the value of K with the highest sil width, in this case, it would be 3. However, with so many different countries I feel that would be unfair and group the countries up too much. There are further spikes at 6 and 10.
I decided to do the below plot for different k values both 6 and 10. The plot for 10 is below
10 seems like a good value as there are not too many clusters to deal with but also good variation between the different clusters. We will take k equal to 10 for further analysis.
The comparison above looks at causes of death and i have grouped it up to get the mean for each for cause for each cluster. Conclusions that can be made:
Cancer is prevalent across all clusters, however, the higher the life expectancy the more prevalent it is. This could be because your more likely to get cancer at older ages.
Dementia is another cause which seems to increase with older life expectancy.
HIV is highest in the two lowest life expectancy clusters the same with neonatal deaths
Finally, road accident is an interesting cause, by far the highest cluster is cluster 7 which seems to be the cluster with the highest increase in life expectancy over the last 65 years. Could this because these are fast developing nations and have not got the safe road infrastructure in as yet.
That’s it for a little intro into reviewing the data this way. Let me know your thoughts and comments. There are lots of dataset on the World Health Organisations website as well as other datasets such as economic growth i can add to this analysis and develop it further.
Hello, welcome to the preview of Group F in the world cup. Thanks for all the support so far on these previews. I would love to hear peoples thoughts and predictions on the competition. Today we will be looking at group F which contains Germany, Mexico, South Korea and Sweden.
The first thing to look at is the age distribution of all 4 teams. Germany seems to have one of the younger squads in the tournament with a relatively small distribution between youngest and oldest players. Mexico has players from the youngest in their 20’s all the way up to near 40. South Korea and Sweden have the same median ages but South Korea has more players clustered around their median and have the lowest amount of players above 30.
Mexico has what looks to be the most experienced squad with players mostly having around 50 caps but they have some players up to 150 caps. Germany has a lot of players with a relatively low amount of caps but also have the trend we have seen with other squads of having a group of players with a lot of caps. I wonder if these players would be a similar age and therefore could be a golden generation. Sweden possibly has the most inexperienced squads in the group with a lot of players less than 50 caps.
On the face of it, Germany seems to have a small number of attackers in the squad. However, they have more midfielders and a few of them are creative attacking midfielders, therefore, I don’t think they will struggle for goals. South Korea also has the same amount of attackers as Germany but seem have picked more defenders. This could leave them struggling to score goals.
Last but not least we look at each teams chances using the probability of implied odds. No surprise really Germany are big favourites to get out the group. However, the fight for second place looks to be a realistic target for the other three teams. It looks particularly close between Mexico and Sweden. They play each other in the last game of the group stage, therefore, it could be a straight shoot-out for second place. Also, South Korea playing Germany who may have already qualified and therefore may make changes could give them an outside chance if it goes to the last game.
That’s it for today’s look a group F please let me know your thoughts would love to start a good debate on your thoughts. Also, check out the other blogs in the series.
Todays World Cup preview we are going to be looking at group E. This group contains Brazil, Serbia, Switzerland and Costa Rica. As mentioned previously this is all part of my series previewing the world cup so please go check the others out and let me know your thoughts. So lets first look at the age distributions of the 4 squads
First of all, I think this is the biggest differences between ages across a group we have seen so far. Brazil and Costa Rica have medians around late 20’s whereas Serbia and Switzerland both have medians around 25. Brazil seems to have a good cluster of players in the late 20’s age bracket which is peak age. Is everything aligning to make Brazil the strongest team in the competition? Costa Rica look to have the group around the same age all in their late thirties. Serbia has a squad towards the younger end of the scale with a lot of players between 20 -25.
The age profiles in all four squads are reflected in the caps distribution. Both Brazil and Costa Rica have players with at least 30 caps. It would be interesting to investigate if the number of caps a team has affected the chances of winning the World Cup. Serbia having the youngest squad of the group also have the most players with the low amount of caps. Switzerland has a fairly even spread across the cap levels.
All four teams seem to have similar squad compositions The only slight difference is Brazil has fewer midfielders at the expense of more attackers. Costa Rica looks to be relatively slim on the ground when it comes to attackers.
Lets now take a look at how each teams chances in the tournament compare. This has been done by working out the chance based on implied bookies odds. Whats not surprising is Brazil has an excellent chance of getting through the group seen as though they are the actual favourites of the competition. Whats good to see if that it looks to be a close competition for second place in the group with both Serbia and Switzerland with around 50/50 chance. Will be good to see how close it is when the games are close.
That’s it for today’s preview please check the others out and let me know your thoughts. I’ll be back with the next one tomorrow.