F1 – 2018 Review

Todays blog we are going to look at 2018 Formula 1 season and the numbers behind it. Particular the qualifying pace of each car. I have the data for each driver and cars fastest lap in qualifying. 

 

Above you can see for each position on the grid the average difference to 1st. Obviously 1st the average difference is 0 but after that the difference is in seconds. There is a large gap between the first 3 rows on the grid and the rest. Possibly a hint of how far ahead the top 3 teams are to the rest

When we break the first graph into drivers and teams two things become clear. There is a big gap between the top 3 teams and the rest. There is also some fairly big gaps between team members. Pace-wise the closest battle was between the Renault, Haas and Force India. With all 6 drivers close together and mixed up. 

Looking at how each teams difference to pole position trended over the season. Overall the Ferarri started the season as the quickest car. However, Mercedes out-developed them and was clearly the fastest car by the end of the season. The team that made the biggest improvement was Sauber going from one of the slowest teams at the first race to the fastest of the midfield by the end of the season. A bad season for the former great team Mclaren is shown they got rid of the Honda engine at the end of last season, however, the team that took it Torro Rosso overtook them by mid-season. Slowly they drifted further from the pace as the season went on. 

Sauber’s improvement looks even more impressive, they have improved the car from being over 4% behind at the first race of the season to around 2% towards the end. This wasn’t driven by one driver being quicker than the other. Both drivers improved at similar rates though it was clear Leclerc was the quicker driver. 

The closest teammates were the Red Bull drivers Verstappen and Ricciard0, the Renaults of Sainz and Hulkenberg and the William’s of Sirotkin and Stroll. For all the hype around Max Verstappen this season he has not been much quicker than his teammate. At the other end of the scale, the Mclaren and Sauber had the biggest differences between teammates. It was no surprise that Vandorne got dropped by Mclaren at the end of the season even though he was up against the great Fernando Alonso. 

I hope you enjoyed this look at the season we had in 2018. Some interesting trends however its worrying the clear gap between the top 3 teams and the rest. Hopefully a closer season next year the other teams are able to close the gap. 

Advertisements

NHL Assessing Player Contributions

Today in the blog we are going to be looking at a method I have been investigating as a way of measuring how much a player contributes to there team in each game. Normally you might look at how many goals a player scores. Thats always the headline figure but players contribute in numerous different ways in ice hockey – assists, shots, hits, takeaways but also we can look at negative contributions to the team penalty minutes, giveaways, goals conceded.  These events can be given points values depending on, how common the event is and how key the even is to win a match. Therefore we can look at each player’s contributions in the games a look at which affected the result more.

The first thing I have to say is thank you to NaturalStatTrick.com for making the data available. This would not be possible without them. Also, this method does not include goaltenders – I think this is a separate scoring for so this is not included.

nhl points

The table shows how i have initially scored the different events. At the moment this is just arbitrary how I value each event and how common it its. For instance, goals are the most important as without a goal its impossible to win a game of hockey. Faceoffs are relatively common and although could be key to scoring I have awarded them a low score. One flaw with this is currently is giveaways a giveaway in your own defensive zone is probably a lot worse than a giveaway in the opponent’s defensive zone. I have no data for the location of the giveaway so can’t include that in the scoring. Location data would be very useful to refine this model may be for the future.

nhl1nhl44Above you can see a sample of the point scores for the first 2 games of the NHL season. Clearly, in the first game, you can see how the Washington Capitals forwards dominated the game in the 7-0 win. Also, in the second match despite a much closer game the Toronto forwards also looked to dominate. This is just the first explanation of the method and a quick initial look at  2 games. Things I will look at going forward:

  • Comparison between defencemen and forwards. This method possibly overvalues forwards as often defensive actions are not quantifiable. Splitting it up will show the best offensive defensemen
  • Point leaders so far in the season and where they score their points
  • Look at certain players over the previous seasons

Let me know your thoughts on this and if you have any ideas what you want to see in the future

Tidy Tuesday – Films Dataset

Hello, today we are going to be looking at this weeks tidy Tuesday dataset. This is just quick EDA as i got a bit carried away with just the dataset. I initially set out to do just one interesting graph but kept finding more and more interesting insights. So below you can see the structure of the dataset:

structuredata

As you can see its got 3401 observations of 9 variables. All different films so first of all lets look at home the production budgets atre distributed with a histogram

histogramfilm

We can see that the vast majority of films in this dataset have a budget less than 25 million dollars. But how does that change for each genre in the dataset:

histograms

Now you can see some interesting insights. Comedy, drama and horror films have clear peaks at the lower end of the production budget. Action and Adventure films are much more spread out across all production budgets.

vsgross

Now let’s look at how the production budget influences how much the film grosses. Action and Adventure have the steeper slopes, so the more money put into these films on average the higher reward.

release date

Above you can see when each film is released during the year. Notice the peak for horror in the 10th month, Halloween. Also, Drama seems to increase toward the end of the year in time for Oscar season. Adventure has a peak in July for summer blockbusters and in December maybe aiming for the holiday season.

profit

Finally, for now, let’s look at the median profit percent per month, can we get any ideas when its best to release a particular genre. For action, adventure and comedy they seem to have 2 peaks. One in the middle of the year and one towards the end. Horror seems to have generally the highest median earnings. This data set is a simple one but one which insights are easy to come by. I could definitely at least write another blog with more information I have found form this dataset. Another day perhaps.

Finding the Next James Anderson

Hello, welcome to today’s blog in which we will be scouting for the next James Anderson. Possibly. The ideas behind this blog are nothing new in fact I have stolen the idea from another sport. The Rangers report blog wrote a piece about scouting for the best youth players by using age-adjusted stats. This idea was based on Vollans work on ice hockey. I always like to say a lot of the best ideas are re-purposed from other areas! So how am I going to apply it? well, we are going to use the bowling stats from the second 11 country championship, age-adjust and see which bowler under the age of 21 looks to be the most promising.

bowlerdataAbove is the first 20 rows for the 350 bowlers we are going to be looking at. The top wickets is Nijjar from Essex, however, he is 24, therefore, will not be included overall. The first under 21 player on the top wicket list is 20-year-old mike for Leicestershire. So let’s adjust the data and see where we end up.

The first thing to do is all the players have bowled differing amounts of balls. I need to get all the players to the same level of balls. To do that, I took the strike weight for there balls actually bowled and extended this to a full season of say 1500 balls. In the table below the estwicket column is the number of wickets if the bowlers strike rate continued over a full 1500 balls.

ffff

However, a bowler who has taken a lot of wickets in a low amount of balls is unlikely to continue this rate to the end of a season. (Sorry Ben Coad). Regression to the mean is a widely accepted mathematical theory therefore when we are extrapolating performance we need to account for regression to the mean.

In the Rangers report blog, they applied 1% regression for every match they projected performance for. I can’t do that as my analysis is based on balls and in multi-day matches bowlers will bowl different amounts of balls.  Therefore I decided to apply the same 1% regression for every extrapolated 100 balls.

regreewick

The table to the side shows how much this affects the bowlers on the list, with most bowlers having a reduction in wickets.

Now the final thing is to adjust the total wickets based on the player’s age. The first part will be to filter for all players 21 and younger and then apply Volman’s age curve below

volmanage.png

The age curve is by year and month, however, I have only got the age in years, therefore, will just be using the numbers in the first column. The graph below shows the resulting results

bowlers

Based on this method Szymanski is the best bowler of the 2nd XI county championship. However, the large dot means we have to extrapolate a lot for Szymanski. Another interesting point is 3 of the top 5 estimated wicket takers are left-arm spinners. The England team is badly missing a reliable spinner particularly away from home could one of the 3 be the future England Spinner.

There lots more work that can be done with this data and look back historically at how these numbers can relate to future county championship averages. Also, apply a similar model to Batsmen which will be detailed in a future blog. Let me know your thoughts have you seen Szymanski bowl? Ideally, I would have prefered younger players but I think younger players play in the under 17 county championship.

 

 

Data Science Ethics and Societal Implications

Hello, I wanted to do something different today. Recently i applied to a data science masters at a local university and for the application, you needed to discuss the ethical and societal implications of an aspect of Data Science. Unfortunately, I didn’t get accepted on the course so I’m going to share it with you as I thought it was an interesting question.

“Information is the oil of the 21st century, and analytics is the combustion engine”. However, with this great power comes with great responsibility.

Data analytics encompasses everything from daily shopping habits, to application usage on a smartphone and, more significantly, to users’ Google search history. The data collected, generated and analysed can have huge political ramifications in determining election outcomes and, ultimately, party victories or losses.

Nonetheless, this begs the question of how ethical is such data harvesting. In particular, there are key ethical debates surrounding awareness and understanding of each and every individual who is targeted in the name of Data Science. An especially challenging issue of such collection analytics is that very often data is generated without an individual’s consent, knowledge or understanding. Practices- that if occurred in other fields- would be viewed as highly unethical.

Algorithms are now more and more prevalent throughout society. These can be extremely useful in some situations, from suggesting suitable music linked to a user’s profile, to presenting advertisements in line with a Google Search history. Nevertheless, if the algorithm goes wrong in these instances, you either end up listening to a song for 30 seconds before moving to the next one or see an advert for something you have no interest in. Furthermore, other algorithms, such as ones being used to decide on whether an individual gets parole, can have a far more sinister outcome. In order to be considered ethical and fair, these life-altering algorithms must, in my view, have some form of independent verification. This is to ensure they are just and free from discrimination to all ethnic groups, sexual orientations and political affirmations within society. The users of these algorithms should be thoroughly trained with the understanding that no decision the algorithm can produce can be 100%. In other words, there will always be an element of probability.

Significantly, the Cambridge Analytical scandal is one of the key ethical case studies within data science. The negative media coverage and public outcry from this scandal suggests that it is highly unethical to obtain an individual’s data from a public website and, subsequently, employ this sensitive information to target and direct political advertising and influence party voting. However, would such data harvesting still be perceived as unethical if it was employed in medical research? For example, what if the data had been utilised to identify people in the earliest stages of cancer and could, most positive of all, increase survival rates? Would there be such universal outrage and, perhaps, a more ethical stance taken if the data was to be employed for “good”?

Additionally, data science is also having an impact on society with the risk of improperly communicated data being used to reinforce a particular belief and, ultimately, as a weapon to heighten prejudice. This has a wide impact, particularly in general elections and could cause politicians to focus on the extremely misguided policies. This could lead to the requirement for an independent body within a society which will fact check and review all published data science work to ensure the public receive accurate information.

Biketown EDA – P2

Hello, welcome to the second part of this blog doing exploratory data analysis on the bike town dataset. If you haven’t read the first one then go check it out. An overview is we found that most of the records were either subscribers to the system of casual people who might just use it every now and then. So I decided to compare those two groups. We saw in the smaller sample size that most of the casual users used it for recreation and the subscribers used it mostly for commuting. Today we are going to look at the distances and speeds the groups travel and where they rent the bikes from but first we are going to look at when they rent the bikes:

plot8.png

This graph totally makes sense that subscribers tend to rent the bikes for commuting as there are two large spikes at around 8am and around 5pm. The peak hours for people going too and from work. Casual users don’t tend to use the system in the morning, however, there’s consistent usage throughout the afternoon. This could be tourists exploring the city for instance. It’s strange both groups don’t reach their minimum to well after 2am, are these people using the system to get home from their night outs? Beware of the drunk Portland cyclist!

distance

The density plot for the distances the two groups go is interesting however i think does fit the current pattern. Subscribers are looking to do shorter journeys because they might be covering the last few miles to work whereas the casual users are maybe exploring the city and therefore cover more distance.

speeds.png

Now let’s check out the speed curves for both groups the further the distance the person travelled the slower the speed. The subscribers are obviously generally the much faster riders at all distances. The increase in speed towards the 25-mile distance i think is down to lower amounts of data. At the lower distance the subscribers have significantly faster speeds are they using the system to get from A to B as quickly as possible.

start.png

Now let’s look at the start and end locations for the casual riders. The heat map above shows that most riders are clustered around the city centre possibly moving from one tourist spot to another. There seems to be a fairly even distribution across the centre locations.

subscriber.png

The subscriber heat map above shows that generally, people take the bike from much further out than compared to the casual riders who are possibly just using it to get to the tourist hot spots. Once again mostly the usage is on the left side of the river that must be the area where people like to get about. Also, the start locations around the outside are much denser, therefore, its clear people are using the system from further out to go into the centre of the city.

That’s it for this exploratory data analysis in this blog I think we have found some interesting insights and at the minimum able to confirm what you would expect. I hope you have found this interesting and informative let me know your thoughts and if you have looked at this dataset yourself lets see your thoughts. Check out the code on my GitHub should be linked somewhere on the website.

Nike Biketown – Exploratory Data Analysis

Hello welcome to today’s blog which we are going to take a large dataset and do some exploratory data analysis on it. I am going to look at the biketown dataset which was a dataset on tidy tuesday. If you’re new around here tidy Tuesday is a hashtag on twitter which the R for data science online learning community actively promotes and has every Tuesday. If you’re inspired to learn R and data science like I was that is a really great community full of wonderful people to start with. I am not going to post code snippets within the blog as I think it gets too long, however, the full code used will be posted on my GitHub.

structure

Above is the structure of the dataframe. The data comes in numerous csv files so i read them all in and created one large data frame structured like so. The second column Payment plan seems an interesting column it has 3 constituents Casual, subscriber and another. The system in Portland has a way a regular user can automate payments to save time. Let’s look at how much of the dataset is based on the 3 types of payment:

plot1

As you can see the vast majority of this dataset is based on either casual and subscriber and I think it would be interesting to review the differences between the people on the two main payment plans.  Therefore going forward in the EDA we are going to remove the entries without a subscriber. After this, we could possibly look at a method for working out what type of payment plan the blanks are. First thing first let’s have a look at what type of trips either group takes:

plot5plot2

The big issue here with making any conclusion on the type of trips each group takes is going to be difficult. This is because each group has over 200 thousand entries and there are less than 1000 recorded trips for each group. What we can say is that it makes sense that subscribers in this smaller group tend to use the system for commuting and casual users clearly in the small sample size use the system for recreation which makes total sense.

plot6

Now we look at the payment methods that both groups have used. By far the 3 main payment types for both groups are keypad, mobile and keypad_rfid_card, with subtle differences between the 2 groups. The RFID card is clearly higher in the subscriber which must be because subscribers are given a card in order to gain access to the bikes. Also, casual users tend to be much more likely to use their mobile to gain access. Both groups have the vast majority using the keypad system.

That’s it for part 1 of today’s exploratory data analysis on the bike town data. Tomorrow we will look at the distances the groups go as well as location and usage time. Let me know your comments on the first part.