Hello, welcome to today’s data adventure which I am are going to apply the many model’s workflow to a dataset. This way you can train many models on one data set and be able to compare models easily. This has been done on the gapminder dataset many times and further reading can be found in the R for data science textbook by Hadley Wickham and Garret Grolemund here (section 25):
However, I will be doing it on milk production throughout the united states. The data is available here:
Also, check out Tidy Tuesday on twitter there are new datasets with which to practice every week. The first step is to read the data into R Studio
I can see that the dataset is made up of each states milk production in lbs per year as well as an approximate region for those states. The first thing I am going to do is clean the milk produced column and change it from lbs to litres. As a european, I have no concept of imperial weight measurements. Also to me, it makes more sense to have the units as volume rather than weight as milk is a liquid suspension. I found on the internet if you divide pounds of milk by 2.272 that changes it to litres
I then dived the result by a million to get millions of litres. Hopefully, the number will be a bit easier to handle. Next job I am going to just do a basic plot looking at the milk produced by year and seeing which way it generally trends.
Wow, a lot of the trends are hidden in this graph however we can see that one state on the Pacific coast went from being the 5th highest producer to the highest by some distance and one state looks to have gone from producing no milk in 1970 to around the 3rd or 4th highest producer. I’m going to log transform the Y-axis see if any new trends can be seen.
Now some trends are more easily seen. There seems to be a mixture of states with increasing production and some states are reducing production. Let’s plot our many models. I am going to nest the data frame by state and then train a linear model for milk production against year.
Now I have a nested dataframe with 50 individual models for the 50 states in the dataset. I have also obtained the residuals for each model so let’s take a look at the plot for them now.
To do that I unnested the residual column and then I have plotted it against year. There seems to be a bit of a mix with some states being really well predicted and others showing wide variation. I am going to check each state’s r squared values see if that can shed some light
Above you can see the summary of the table with r squared for each state explicitly stated. The model for Montana seems to be incredibly unfitting compared to Alabama which is almost an r squared of 1. Let’s have a look at the data for those 2 states in depth.
This is really interesting. Montana has been pretty flat over the 40 years but has varied significantly up and down. Alabama has seen milk production rates decrease dramatically over 40 years. There must be a reason for this any thoughts what this might be?
Finally, let’s compare the slopes for each state regression models. California seems to be the state with the biggest increase per year in milk production. Missouri seems to be declining the most, even more than Alabama. A lot of states seem to have generally no change in over 40 years of milk production. This would be fascinating to review and see if adding further data can identify the reasons behind these trends
Thanks for reading this data adventure today. If you enjoyed please leave a like for more content like this and subscribe if you want to be the first to know when a new blog is posted.