Hello, Today we are going to ask a question and try to answer it with a analysis of data. I want to create a board game. In order do that I want to understand what make a board game more highly rated then another. I want to create a popular board game afterall! To do this I am using another Tidy Tuesday data set found here:
As previously mentioned if you want to practice your data science skills check it out as its a useful source of data sets and an immensily supportive community!
Once the data is red into R the column titles are as follows:
I am going to create a XGBoost model. Maybe slightly over kill for the task in hand however I want to get better and develop an understanding of the algorithm. In order to do that i need practice utilising it. One thing the XGBoost algorithm needs is numeric variables. I can see there are some key ones that will be very useful: max_players, max_playtime, min_age, min_players, min_playtime and playing_time. The average rating is going to be the variable I am trying to predict. Category interests me however its not a numeric variable which poses a challenge when i’m using an algorithm which can only use numeric variables
The first stage in the data clean is to review the category variable to see if we can use it in the model.
First things first there are a few games with category’s not recorded so theirs have come as Na. I cant do anything with them so they will be filtered out. Next it looks like the games have multiple categories split by commas. Therefore, I am going to split the column by the comma into 3 separate columns and then use 1 hot encoding to change them into number I can use in the XGBoost model
One hot encoding – used on a categorical variable column, the variables become the column headers and the columns are filled with either 1 or 0. 1 if the, in this case, category belongs with the game. Therefore I can have all the mixtures of different category in one big data frame. Lets implement this:
Above you can see the code used to clean the category column and one hot encode the all the category’s. Now you can see we have a data frame with 248 variables present which are all numeric so we are ready to go the next stage
The XGBoost algorithm works best using a data matrix so the next stage is to take my data frame and create a matrix. I also need to separate my data into testing and training sets so I can asses my models performance on unseen data. First I’m going to shuffle the dataset to ensure if there was any order in it there is no order so I can get a fair representation for testing and training. Then I’m going to use an 80/20 split for training and testing data. Finally in using the XGBoost algorithm the target variable is a separate argument to the rest of the data so I will be separating it out
Above you can see the code used to the accomplish that. Finally you see the creation of the XGBoost matrix which should make training quicker than with an ordinary matrix
Running the Model
Below you can see an example of running the model. The first argument is the training data matrix I created previously. The second nrounds argument details how many time you want to run the algorithm the minimum is 2. The objective is the type of modelling used. I have used a linear regression version however if you have a binary problem i.e predicting if something fails or not you can use binary logistic regression. In the console you then get a print out eatch time its ran of the RMSE (root mean square error). Since 2.98 is a very large error i think I need to run it more then twice. Lets test it by running it for 10 rounds and say 50 rounds and test it on the unseen data.
I therefore created two examples of training the model one for 10 rounds and 1 for 50 rounds and then tested both on the unseen data. The 20% of data I kept out of the training set for this very purpose. Performance on the training set doesn’t massively matter as its easy to fit the perfect model to the training data. How it performs to unseen data is what matters.
Considering the model for pred2 was trained 5 times more then pred 1 its RMSE is not much different therefore it looks to me 10 rounds is close to the correct value. In fact just to illustrate the point I trained an 10000 round model and its unseen RMSE was 0.82 so worse then the 10 round model. Over fitting in action.
Around 10 rounds looks to be the best for both time and accuracy. How about answering the first question. I have two board games
One is more of a finally board game and the other one a fantasy card game for adults. Which one is going to get the best rating? and therefore the possible better seller. Lets run the model on both and see which one
The first game got an average rating of 6.4 and the second one got an average rating of 5,47. That suggests I should create the first game. That difference is greater than the error as well so suggests the second game would be the much better game to create. Thats it for today hopefully you were able to follow for the whole blog. Let me know your thoughts or if i missed anything.