https://theparttimeanalyst.com/2020/06/20/twenty20-win-probability-added/

I created the metric using a logistic regression machine learning model. Now its time to apply the model to real data and look at what insights it can show.

The first question I want to ask which performance in the data I have had the biggest impact on a teams chances of winning.

There we see the top 10 performances for batsmen for win probability added. As we can see the universe boss Chris Gayle appears 3 times in the top 10 with the best performance ever being his 151 off 62 balls in the blast against Kent.

Chris Gayle has always opened the batting which leads me to wonder if does the where batsman bat effect how much win probability they add.

When in an innings a batsman faces there first ball is a proxy for where the batsman was batting in order. As you can see with the graph above the win probability added is quite similar for any innings up to just after the half point. After that on average innings starting then have had negative impacts on the teams chances of winning. This tells me that either the best players are opening the batting or batting earlier is easier to strike at high rates and score more runs. I guess like anything its probably a mixture of the two.

For the final part of the overall analysis I looked at how a Batsman WPA is effected by the competition. A net positive means the competition is easier as a player is generally adding more then there career average and a negative means the player is adding less. I found it quite surprising the Caribbean premier league as the most difficult for batsman. Looks like a lot of batsman went there and struggled.

This years blast finished in October with the Notts Outlaws taking the trophy for the second time. For this win probability metric I think I will normalise it to win probability added for every 10 balls faced for the batsman.

As we can see most of the performances in this years blast were around 0 for the batsman but there are the clear players who regularly had positive contributions to there teams outcomes. There are also a lot of players with negative outcomes as well.

Next we can see that win probability added is highly correlated with the batsman’s strike rate. This suggests that as a general rule the higher strike rate player might be better even if another player scores more runs generally.

I have looked at some of the big picture areas like how batting position effects the number now I want to move on and look at player performances over the years and the first player I want to look at is Jos Butler.

He seems to be in general improving as the years have gone on. In his early years he was facing less balls and not contributing as much as he is now. This could be because he was starting his innings much later in the teams innings.

When you look at Butlers batting position over the span of his career you can clearly see he changes from coming in on average around halfway and not really contributing lower WPA to coming in earlier which coincides with his increase in WPA.

Thanks for reading.

]]>First things first looking at the xPos data for the whole season. If you want to see how this is calculated see this blog here:

https://theparttimeanalyst.com/2019/11/02/f1-drivers-rated/

Now its not the perfect measure I have a plan to further revise it in another blog. For example I think Ocon is overrated and that could be caused by poor qualifying performances. However, the metric does show that the best drivers are Verstappen and Hamilton. I don’t think many people would argue that they are not the current 2 best drivers on the grid so its good to see they are top of the metric. One driver who is a lot lower then I would expect is Bottas a -1 xPos loss and only has the drivers in the two Haas cars and the Williams cars.

When you review the both drivers seasons then you can see Hamilton’s consistency and his incredible performance in turkey. Where as Bottas isn’t quite as consistent. You can argue he had some bad luck like the tyre failure in the last few laps at Silverstone (his worst performance of the season), however at the Turkish, and 2 Bahrain races he just looks like he went missing and he was comprehensively beaten by a newcomer in the same car.

Comparing all teams race laps to each race laps fastest laps on average Mercedes where the fastest car. Not so much surprise there and no real insight to be gained from it. I don’t think anyone would put the teams in any different order.

Now that same data compared to the each teams performance in 2019 starts to show some insights. The first thing is Ferrari and the Ferrari power teams are the only ones that over drifted backwards after last season. What’s interesting is Alfa Romeo and Haas went slightly backwards but Ferrari went significantly backwards. This demonstrates how much there car was designed around the possibly illegal engine in that it had a much bigger impact on Ferrari. Overall most of the field have got closer to the fastest the biggest improvement being Williams but its easiest to make improvements from a low base. Racing Point look to be clearly the 3rd quickest but didnt achieve 3rd place in the constructors.

Finally looking at a drivers perfromance compared to their teammate. I created this plot which if I’m honest I’m not satisfied with but I think you can just about see the message I’m trying to convey. Leclerc, Verstappen, Ricciardo and Perez were clearly quicker then there teammates in both qualifying and the race. When you look at the Mercedes drivers Hamilton is not as far ahead of Bottas as I expected. Over at Williams is the only driver pairing where one is significantly quicker in qualifying but the other is quicker in the race. Lastly the closest pairing looks to be Mclaren there is nothing between the two drivers in qualifying but race pace Sainz looked to have the slight edge. Maybe that is experience on Sainz part and Norris will soon progress to that level.

Thanks for reading thats my review of some key information of the 2020 f1 season. Roll on 2021 season in March. Hope you and your families and friends have a great 2021.

]]>Today i’m going to do a little data explore of the data from the F1 2020 season so far. Exploring a number of questions about the season so far. First of all looking at qualifying and why a lot of teams are annoyed by (t)Racing Point and the strategy they have used to develop their car. Reviewing the average qualifying positions for each car show you the quality of the car on the grid. As an example here’s McLaren

Since 1990 Mclaren has had some big and up and down changes in grid position. It looks like your more likely to go back down the grid far then go forward far. At no time have McLaren ever gained 5 places on the grid compared to the year before.

To really see the improvement that Racing Point have made I have compared there average position on the grid this season compared to last season. On average they are 6.7 places higher on the grid then last year. Calculating each teams yearly change in qualifying position. Unsurprisingly most of the time it doesn’t change too much forward or back on the grid. There are only 7 teams who have made a bigger improvement from one season to the next and most of them happened with large rule changes (Brawn GP 2009 or Williams 2014). If you are on of Racing Points rivals like Renault or Mclaren too right you would be annoyed if your struggling to improve by more then 3 places in a season and Racing Point come and improve by 7 places by using methods to develop a car thats definitely in the grey.

Moving onto the grid in general, this is a small sample size but clearly so far this season has been dominated by whoever is in pole position. The only race not won from pole position was the 70th anniversary GP where Verstappen beat both Mercedes. Hopefully that percentage reduces over the coming races or we could be in for a boring season.

Finally, now there has been some controversy this week caused by F1’s own rankings for the fastest qualifyer. A few months ago now I came up with a simple model to track how a F1 driver performs based on where they finished the race compared to expected. See here:

https://theparttimeanalyst.com/2019/11/02/f1-drivers-rated/

It was a simple model and I have ideas to further refine it. Coming to a blog near you soon but here is what the current version says are drivers performances of the year so far.

Verstappen is way ahead of the other drivers and part of that is because of Hungary. He only qualified 7th but finished 2nd which is a big gain and generally starting higher then 8th the average driver on average goes backwards so that was a big win for Verstappen. His current value of 3.5 is crazily high when compared to how this number looks over the long term and therefore I expect it to reduce over the next few races. Other good performances look to be Stroll and Perez but they could be boosted by the poor qualifying at the Styrian GP. Stroll I think is the biggest surprise by this metric and he seems to be having a good season. Russell looks to be having a bad season but maybe he has been putting the car on the grid way higher then it should be. This measure is far from perfect look out for the update to improve it.

Thats it for a summary of the data from the F1 season so far. We will see how the season develops in the next few races. I will look to update this further in the season so the season can be further understood.

]]>This is the output from my model to forecast the formula 1 grid. In this blog I am going to explain how I went about it.

First things first I need some data to train the model on. The way a Formula 1 weekend works is there is free practice on a Friday, qualifying on a Saturday and the race on a Sunday. The aim is to use the data generated on the Friday to forecast the grid. So what data points am I going use:

- Practice 1 laptime
- Practice 1 difference
- Practice 1 laps
- practice 2 lap time
- practice 2 difference
- practice 2 laps

There are other variables that could be used such as a way of classifying the type of circuit to try and entice out a cars strength and weaknesses but that an area for future development. I collected data from practice 1 and 2 from 2015 to now.

This produced the above data frame with in total over 1700 records. As I will be creating a classification model I also need to add the drivers qualifying position. I wanted to do a classification model as I want the output to be a percentage chance of achieving that position.

Now I have the data I can use the tidy models collection of packages to create the model.

First I use rsample to split the data into training and testing sets. The first part of r sample is to create the rules for splitting your data. The arguments are what data are you using, what proportion you want to be in training and testing and then there is the strata argument which I haven’t used. The strata argument allows you to balance your target variable in both the testing and training set. As I have lots of possible endings (20, the total size of the F1 grid) and a relatively small data set then it was no possible to use the strata argument.

The next step is you can either create a recipe to process your data using the recipe package or go straight to Parnsip to create your model. I went straight to parsnip as my data was relatively simple and there were no need to pre-process it. As you can see in the code above I am using a random forest model initially and you can see how simply it is to create it.

Reviewing the model and testing its quality on the testing set the yardstick package has numerous functions to allow you to do this. Above I have plotted a ROC curve for each classification option, all the positions on the grid. There is a different between positions. The model seems to be a lot better at predicting the first 3 grid positions compared to other positions. I think this is because the pattern in F1 recently is that the top 2 or 3 teams have been well ahead of the rest. Therefore this makes the classification job for the model a bit easier.

The net step, now I have a baseline model, is to tune it using the tune package. There are 3 tune able hyperparameters for a ranger random forest trees, mtry and min_n. Trees is the number of trees in each random forest and needs to be high enough to reduce the error rate. I am going to set it as 1000 to start off with. Mtry controls the split variable randomisation and is limited by the number of features in your data set. Min_n is short for minimum node size and controls when the trees are split up. The tune package allowes you to conduct a grid search across those parameters to find the right ones.

The results of the tune best of the ROC AUC metric are shown above. Clearly tees doesn’t make any difference its almost spread over randomly. MTRY looks to gave some variation and looks to be the best value of 3 – 5. Min_n is on a slope. The tune function automatically selects a range to train over but it looks like the AUC is only increasing and maybe the best value is a lot higher then 40. Therefore I am going to use a grid search in order to tune the model

I conducted a grid search across a range of values from Min_N and mtry. You can see the best value for min_n is between 200 – 300 and mtry 4. Doing a grid search across both parameters means you can control for the influence of each other and therefore get the best value for both. I then trained another model taking those values.

Comparing the original model to the now turned model and it is now slightly better for most positions. This is now the model I will use going forward. In order to improve this I think I need to add weather data. For example in the recent Hungarian Grand Prix second practice was effected by rain and that makes this model difficult to run. During a F1 race weekend they have 3 different compounds of tyre (Soft, Hard and Medium) which they use. Adding the tyre compound the lap was set with would improve the model because a lap might have been set on the slower tyre compared to others.

Full Code:

https://github.com/alexthom2/F1PoleForecast/blob/master/Polepos_reg.Rmd

Also check out the tidymodels website here

]]>Each players impact on the game can then be quantified and the best players will have the highest impact on the match.

In the plot above 1 is win and 0 is loss. For 4 balls in the middle of the first innings there can clearly been seen some segmentation more runs at any stage means you have more chance of winning, there is more purple. The task then is to translate this visualisation into a usable model of how much extra chance of winning has the player added.

I am using Cricsheets ball by ball data for all twenty20 matches found here:

Therefore I have ball by ball data for the IPL, Internationals, the Big Bash, The Blast and the PSL. This totals over 600000 balls and should be a nice large data set to train the model.

I am going to create 2 models one for the first innings and one for the second innings. For both models the main features are how many balls have been faced and how many runs they have scored as well as how many wickets they have lost. For the second innings there is the effect of scoreboard pressure and therefore I will be adding the feature of how many runs are required. Maybe its even irrelevant how many runs are required from the first ball and its all about how many runs are required at a particular stage.

I am using the Tidy Models collection of packages to create the initial model and subsequent models in this series. Splitting the data into training and testing sets is easy. R Sample has a function called initial split which is perfect for this example.

With initial split you specify the data that is being used for the model. The proportion with which to split the data and I have used the strata argument. This means both my training and testing data sets have a balanced amount of rows that were winning and losing. This split object can then be used with the training and testing functions to simply create the two different datasets.

I am going to train a logistic regression model to initially start. In the future I will be looking at other models and comparing them to this one.

Now that I have training and testing data I can move to the parsnip package.

Above you can see the code to set up the first logistic regression model. The first argument logistic_reg sets up what model I was going to create. The argument mode is more used if you have a model type that can be used for both classification or regression however logistic regression is only for classification problems so the argument isn’t really needed. The next part the set_engine function and this is the great benefit of tidy models. There are many different engines with machine learning and the set engine function allows access to many of them through a common methodology. In this version I have just used a simple glm engine though there are others possible (stan, spark and keras).

Finally the fit function is just the same function you would use whatever the model. Simply it has the formula used and the data.

Testing the model on the testing data by plotting a ROC curve

Comparing both the first and second innings model, the second innings model is clearly the better model for classifying which point is a winning score. However, this isn’t really about that the key here is both models are better then just randomly guessing and I can use this baseline to further enhance the models.

Applying the models to a match from 2019 and you can see how the win probability varies across both teams. In the first half of the innings when Durham batted it was pretty even. However, when Northampton chased Durham took control and won the match.

Above you can see a summary of each batsman’s contribution to the match. Despite what appears to be a poor batting performance Northamptonshire have 2 batsman with the best contribution. Durham batsman didnt have any big negative or positive contribution they just kept the team in the game and it looks like the bowling won the game.

Moving onto the bowlers and you can clearly see where the match winning contribution came from. Potts in 3.3 overs took 3 wickets for just 8 runs. This looks to have been the match winning contribution but looking at the information Short got man of the match.

Thats it for the first part in this series. Next part I will take this simple implementation and further review with more complicated machine learning learning models. If you have any feedback or comments please let me know. Stay Safe.

]]>https://wordpress.com/view/theparttimeanalyst.com

In it I looked at calculating the Pythagorean win percentage for each team in the IPL and then moving that forward to calculating the how many extra runs are needed to win one extra game. Therefore all team building should be done to get to that number. Where can you get an extra 60 runs from.

The question I am looking to answer predicting how many runs a batsman may score. If i was to look at a model I think by far the most predictive element how many runs a batsman will score is how many balls they face.

Plotted together they had an r squared of 0.8627 so 86% of the variance in runs is caused by how many balls a batsmen faces. The next question is how can I come up with a good value for how many balls a batsmen will face. It will depend on the bowler and when in the innings that batsman is facing. These are areas that the model could be further refined going forward. However. to start with

Overall the distributions on innings lengths are similar – as you would expect. The distributions have a low maximum, reflecting most innings in the IPL are relatively short but a long tail showing that there are a lot of innings of substantial length as well

In order to simulate how many runs a batsman might score, i am going to use the beta distribution and randomly draw from the distribution.

The beta distribution with shape parameters of 1.25 and 6 over 900 draws gives a shape broadly similar to the historical shape of all IPL innings. There are 14 games in the group stage of the IPL which is a relatively small sample size and there are many different quality of batsman. There are 2 arguments to the beta distribution which dictate the overall shape. The idea is to use a batsman’s historical average ball faced to adjust the shape2 parameter. This will then give a more reasonable draw of balls faced for each batsman.

Reviewing the average number of balls per innings compared to the total number of balls faced

It looks like over a fairly decent career size its pretty difficult to average more then 30 balls per innings. However, there are a few batsman that over a relatively short career average significantly more. To create the most accurate model, If a batsman has an average innings length more then 30 and has faced less then 1000 balls I am going to put there average balls as 30. This will stop the model over weighting small sample size batsmen. This methodology can be further refined in the future.

We now have the output which simulates how many balls each batsman might face for an innings. The next step is to turn that into an amount of runs. I am going to use a linear model to predict the amount of balls. This model will be further refined in the future and I will talk about it another time. These are the predictors I will be using:

**Features **

- No. Balls Faced – Drawn from the beta model previously
- Dot Percentage – percentage of balls faced which end in dot balls
- Non boundary strike rate – does the batsman rotate the strike or just stand there hitting sixes
- Six Percentage and 4 Percentage: what percentage of their balls do they hit for 4
- Strike Rate – on average the overall strike rate for the batsman

With these features they can be used to predict how many runs a batsman would be expected to score in the IPL. For now i am just using a simple linear model this can be improved by using a more powerful model and probably more powerful predictors but this is a first version of the model so keeping it simple for now.

Now onto evaluating the model performance. The model is only any use if it gets a value which is around what you would expect. Therefore its important to test the performance. The first test is in the 2019 season what percentage of batsmen did the model get with +/- 50 runs

As you can see out of 10000 runs of the model there were around 55% of batsmen within 50 runs which I am relatively pleased with. Model performance is an iterative process and this looks to be a good baseline to start with.

Also, when the actual runs are plotted against the predicted runs most batsmen follow a similar line. The furthest point away seems to be Kl Rahul who scored a lot less runs then predicted. Thats the model for today in the next blog i’m going to look at individual player performances and compare the 2020 IPL squads. Who bought the best players? The code for the model will be available on the GitHub

https://github.com/alexthom2/IPL_Moneyball/blob/master/Modv1.Rmd

]]>The model I will be using is fivethrityeight.com. Nate Silvers model make predictions for a number of leagues and competitions across the world. Below is a link to the predictions

https://projects.fivethirtyeight.com/soccer-predictions/?ex_cid=rrpromo

and this is the methodology used:

So the idea is to compare how this model does for all the leagues and if the championship is the most unpredictable league then the model will perform the worst on it. Simple logic. Just to make clear this is not picking faults in the model. My skill is know where near Nate Silver and his team.

They produce a csv file which has all the predictions since 16/17 season. The first thing i’m going to look at is simple how many correct results has the model got by league. If the Championship is the most unpredictable league then it will have the lowest correct predictions

Not True. While the championship is on the lower end is is someway above the most incorrect which is League 2. The Chinese super league looks to be the most predictable. Looking at the league near the top the Barclays Premier League and the Scottish Premiership there are teams a lot better then others (top 6 in the Premier League, Celtic and Rangers) which will make those league a lot more predictable.

In league a lot of teams must be pretty evenly matched meaning predictions are harder.

The trend of the English leagues shows that the Championship over the 3 years there predictions has increased its predictability. this could be improvements in the model or the championship itself being more predictable. Its impossible to tell which. Though if it was model performance then other league probably would improve too.

The model includes a percentage chance of each result and therefore to look at how close a league is I compared what the difference in percentage chance of a win is between each team in the match.

Measuring the difference in percentage chance between the favourite for a match and the other team then overall the Championship ends up the 5th closest league. The lower the gap between the favourite and the other team means that a league has a lot of parity and therefore will be unpredictable. Although so far the championship has not been the most predictable league it seems to be an unpredictable league.

Focusing in on the championship across multiple seasons and the difference in win probability across multiple years doesnt seem to alter much. In fact when compared to the correct prediction which was a definite increase there in no discernible change in the difference between the 2 teams win probability.

I think overall the Championship is shown the be quite unpredictable – not the most unpredictable – but over a few measures its shown to be amongst the group of most unpredictable league. There are 2 amin reasons for a league being unpredictable, no information about the league or theres a lot of parity in the league. I think with the championship this is definitely the latter. This has also shown some predictable leagues which may be frutiful for betting.

All my code should now be on my github below

https://github.com/alexthom2/TheChampionship/blob/master/UnpreditableExploration.Rmd

]]>The background behind this is I have been reviewing the math behind Moneyball course on Coursera. The course linked here:

https://www.coursera.org/learn/mathematics-sport

The problem is most of the course is based in excel and in the modern world I like to use code to analyse data and R in particular. Also this is also based mostly on baseball and as not a particularly big baseball fan, i am going to apply it to cricket

The first concept it looks at is the Pythagorean theorem. Its slightly different to the one you probably remember from school which looked at sides of a triangle. This was created by Bill James for baseball and it looks at the number of runs conceded and scored to get the win percentage. Applying it to cricket i’m going to focus on the IPL. A key part of the formula to calculate the win percentage is the exponent. This is a constant and for baseball it is 2. However, there is no great literature for cricket. I have found one blog that quotes it at the 8 however lets compute it myself and see what it comes out with.

Based on this data the best value for the exponent in the IPL looks to be around 10 or 11. This is different to the previous work I found. If we apply this to every team since the 2013 season I can compare there actual win percentage against the actual win percentage

Some interesting trends are visible here. If you look at Sunrisers Hyderabad they have gradually increased their predicted win percentage since 2013. Is this smarter recruitment? They also massively under performed the predicted win percentage in 2019 therefore is there a chance of a regression to the mean in 2020?. Chennai seem to have over performed the last 2 years and seem to have a general trend downwards in predicted win percentage. The next thing that can be done with this is calculate how many extra runs you need to win one extra twenty20 game.

Above I create a data frame with runs for increasing in 5 from 2225 which is the average total runs a team scores in an IPL league season to 2285.

The summary table of the output shows that scoring an extra 60 runs over the season is the equivalent of one extra win. Therefore you need to recruit the players to achieve that extra 60 runs. That can also be bowlers as if you reduce the opposition to 30 less runs you only need to score 30 more runs. That is going to be the subject of the next blog.

]]>The idea is that historical data of where the average generic driver has finished compared to the their grid slot. So if the driver qualifies 2nd and the average driver who has qualified 2nd historically has finished on average 3.4 and this driver finishes 1st then that would be worth 2.4. This number can either be averaged over a long term or in the short term could be used a plus/minus statistic which could be used in the broadcasting of F1.

The Data

I often go beack to this but kaggle is one of the best soruces of data for whatever you want to look at. For this there is a whole F1 dat set covering all sorts of information. All I need for this one is the results and races. The results has each grand prix result for each driver as well as the qualifying position they started in

First thing to look at is the average finsihing position by qualifying position

Overall if you are starting in the top 8 on average you are going backwards in the race. 9th and downwards on average you are finishing higher in the race. However, I wonder how much of the lower starters are effected by retirements. If you start last then all of the retirements are going to be in front of you and you will always move forward. Also if you start from pole all retirements will be behind you and you can only stay in the same place or go backwards. Hence why on average the finishing position is lower then starting for first. The first thing I need to do is control for retirements so everyone is on a level playing field.

Now I can see percent reitrements by their finishing postion on the grid. Clearly the worse cars towards the back of the grid have a higher retirement rate and in our calculation of the KPI I can use that to normalise the results.

After running the model the first time this is the list of the best drivers by there average position change over their career looking at drivers since 2000. I think there must be an error here as I don’t think, with all due respect, Alex Yoong and Enrique Bernoldi are the best drivers to have graced the F1 grid. FYI as well the Verstappen you can see there in 9th is not Max its his dad Jos who was know where near as good as Max.

The error was I was creating the adjusted position from the grid position not the qualifying position. Making that change and creating the same graph shows this:

Now that’s more like it , these are the top 29 drivers by finishing position and there are a lot of pretty big names on it. Including all the world champions in the last 20 years. There are also some interesting names who people maybe wouldn’t instantly think of such as Kobayashi and Friesacher.

When a driver xP is compared to the number of races they competed in, you can clearly see drivers with better ratings do more races and some of the drivers with the highest ratings are world champions.

Lets focus on on a couple of drivers first the current World Champion Lewis Hamilton and the second Niko Hulkenburg.

Hamilton’s performances over the years seems two have 2 distinct periods, the early years where he was at Mclaren the field was a lot closer and he was rarely in a dominant car. Then when it moves into the hybrid era his total significantly increases, partly due to having a more dominant car and maybe worse reliability meaning he started lower on the grid. A more dominant car means if you start lower you gain more positions on those races. This is maybe a limitation for the metric and going forward I possibly have to control for how inherently fast the car is.

Hulkenberg’s career up to 2017 was a bit of a mixed bag. Overall in these 8 seasons he only has 2 seasons with strong positive position differences. Four of them are strong negatives. I chose Hulkenberg because hes the driver with the most race starts without a podium and looking at this record you can possibly see why. This rating isn’t obviously the be all and end all of a drivers career but its a way to try and understand who are the good and bad ones.

This is just the first exploration into a way of measuring F1 drivers performances. There are probably other measures that can also be used to gain a wider picture of how good an F1 driver truly is. I think i can also further improve the model by including the circuit into it. Certain circuits will be easier or harder to overtake and therefore will effect it. Room for further development.

]]>https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data

Lets read the data into R and take a look of it

So I can see there are 17 columns and over 48000 records with information covering the price and location of the AirBnB. Looking at the location column lets see where most of the AirBnB’s in New York are

Above you can see the code and the out put form the code with is the plot above. I can see that most of the AirBnB’s are located in the Williamsburg and Bedford-Stuyvesant areas of Brooklyn. Manhattan areas are also very prevalent in the top 30 neighbourhoods. Now lets look at how the price of the AirBnB changes by area

Looking how the price varies by the location, doesn’t really show much currently and shows we have the most expensive AirBnB in some area of Brooklyn. Intuitively this doesn’t seem correct to me

A quick histogram of the price shows there maybe some questionable data within the AirBnB data set. I would say it looks like 90% of the AirBnB’s in the New York area are less then $1000 dollar per night. There are some more, however because they are so much more then others in the data set then I suspect this is bad data. In-fact, if we look at the most expensive listings some are share rooms. Non one is going to pay $10000 for that. This is bad data and must be removed.

Removing the bad data and doing the plot again there is now a much clearer picture. There seems to be a high density of expensive AirBnB’s in upper Manhattan. This shows all listings however, the price probably differs for room type. Not so surprising though as you can see higher prices seem to be in the tourist hot spots, Moving on now lets start by building a model to look at predicting the price.

I am just going to be using simple linear modelling however I have split the data set into training and testing sets for the model. The first variable I am going to use is room type so is it a shared room, entire dwelling or private room.

Above we can see the summary of the first model fitted. I can see that compare to an entire home private rooms are about $114 cheaper per night and a shared room is about $130 dollars cheaper per night. The bad part is the residual error is pretty poor at 128 which means on average this model would be $128 dollars out for predicting price.

I re-ran the model now with neighbourhood, number of review and review per month included. Above you can see a summary of the variables that effect the price the most in a negative and positive way. Room type seems to the have the biggest negative effect. Location Tribeca and flatiron seem to be the most sort after locations. The residual error is down to 105.7. In order to improve it I will need to do some feature engineering.

The first thing I am going to look at is the description column. I wonder if there are some words in there that highlight more expensive homes. One word i can think of is luxury. Anything described as luxury makes it seem more expensive then something not.

I have used the string detect function on the name variable in order to find properties with luxury in their name. Adding that to the linear model shows that properties with that in the name often are $59 higher per night then others.

Whats clear by looking at the number of each of the words in the names of the title, is a lot of people like highlighting where the place in New York is. Other words that can be used for the model cozy, spacious, beautiful and large. I’m going to add clean to the list as well.

Now I can see that apart from luxury none of the other words in the descriptions have too much effect on the price. Cozy is maybe a euphemism for the property being small and therefore that’s why it has a negative effect on the price. Now I have my model lets apply it to the unseen data and see how it does on that.

Running the model on the unseen data saw sum surprising results. There are a few places with negative prices predicted. This clearly isn’t correct. Also I may have to model against the log of the actual price. By taking the log it dramatically improves the predictive power of the model residual error is now 0.614 and r squared of 0.52.

Above we can see the results of the long transformed model. There are still some higher priced rentals that the model doesn’t really predict. However lets use it to see the most over and under priced (according to the model)

Above you can see the most expensive compared to the predicted price and the link to it on AirBnB. All I can say is wow. It looks incredibile. Pictures are something that could never be in the model.

Above you can see a summary of the 10 most undervalued. They are all located in Manhattan and seem to be split between Tribeca district and Midtown. Thats it for today’s blog slightly longer then normal but hope you enjoyed.

]]>