Hello welcome to today’s blog which we are going to take a large dataset and do some exploratory data analysis on it. I am going to look at the biketown dataset which was a dataset on tidy tuesday. If you’re new around here tidy Tuesday is a hashtag on twitter which the R for data science online learning community actively promotes and has every Tuesday. If you’re inspired to learn R and data science like I was that is a really great community full of wonderful people to start with. I am not going to post code snippets within the blog as I think it gets too long, however, the full code used will be posted on my GitHub.
Above is the structure of the dataframe. The data comes in numerous csv files so i read them all in and created one large data frame structured like so. The second column Payment plan seems an interesting column it has 3 constituents Casual, subscriber and another. The system in Portland has a way a regular user can automate payments to save time. Let’s look at how much of the dataset is based on the 3 types of payment:
As you can see the vast majority of this dataset is based on either casual and subscriber and I think it would be interesting to review the differences between the people on the two main payment plans. Therefore going forward in the EDA we are going to remove the entries without a subscriber. After this, we could possibly look at a method for working out what type of payment plan the blanks are. First thing first let’s have a look at what type of trips either group takes:
The big issue here with making any conclusion on the type of trips each group takes is going to be difficult. This is because each group has over 200 thousand entries and there are less than 1000 recorded trips for each group. What we can say is that it makes sense that subscribers in this smaller group tend to use the system for commuting and casual users clearly in the small sample size use the system for recreation which makes total sense.
Now we look at the payment methods that both groups have used. By far the 3 main payment types for both groups are keypad, mobile and keypad_rfid_card, with subtle differences between the 2 groups. The RFID card is clearly higher in the subscriber which must be because subscribers are given a card in order to gain access to the bikes. Also, casual users tend to be much more likely to use their mobile to gain access. Both groups have the vast majority using the keypad system.
That’s it for part 1 of today’s exploratory data analysis on the bike town data. Tomorrow we will look at the distances the groups go as well as location and usage time. Let me know your comments on the first part.