Hello, Welcome to today’s blog looking at developing my first model to enter a Kaggle competition. We are going to be doing it in the PUBG competition so if you haven’t checked out my two previous exploratory data analysis then go check them out.
So the idea here is you have the whole dataset and you have to predict the winPacePerc which is basically where the player finished in the game. The result will e calculated by Mean Absolute Error. The data has mainly all numerical apart from one column which identified the game type. Therefore we are going to use a linear model to start with.
Above you can see the initial code you can see the import of the data and removal of some variables that in the EDA I showed were not too key to predicting. It should make the model run quicker. I also need to create the training and testing dataset which is shown below.
The code used to code the initial linear model is shown below as well as a summary of the model.
This model has an MAE of 0.096 which if i look at the current leaderboard wouldn’t get me very high at all. Therefore we are going to need a more complicated model. I am going to use the caret package and make a random forest model lets see how that performs
The code for the random forest model to start with outlined above. Also, the results for the MAE is also pictured. I am using an MTRY of 15 to start within the first model. As you can see the random forest model offers significant improvement in the MAE metric for predicting players finishing position. This would get me a few places higher on the leaderboard but not super high. We can use cross-validation on the training dataset in order to get the model into a better fit.
So I used cross validations repeated twice however it hasn’t seemed to improve the model at all. This looks to be the best that this method can achieve. I, therefore, submitted this model and I ended up (at the time of writing) 463 out of 591 entries. So it isn’t great.
Things i can look at to improve the result:
- Use a different model xgboost is the model thats been used to win most Kaggle competition
- Experiment with the parameters of the model such as the amount of cross-validation
- Use feature engineering to develop new variables in the data set in order to better predict
Overall I’m happy with my first attempt and there is still 2 months of the competition left so I will further develop it and hopefully move further up the leaderboard