Hello, Welcome to today’s blog looking at a tidy Tuesday data-set. Train delays in France. The first thing to do is to read the data into R Studio and have a look at the overview of the data:


I can see the data looks broken up by year and month. Theres also different routes, the number of trips of those routes. How long they take normally amongst other things. Im thinking at looking at seeing if I can get some metrics which look at the rate which trains are cancelled or depart late . Lets first see how many different stations this covers and the longest and shortest routes


Most of the trains are departing Paris stations with 4 of the top 5 stations in Paris. A summary of the longest train journeys are above. I could go on a tangent and look at the journey time against journey length, heck i’m going on a tangent!



Above you can see the summary of all 25 of the longest routes, the distance the route covers (as the crow flies) and then the percent arriving late, departing late and cancelled. Your train looks to be more likely to be cancelled the longer the distance and more likely to arrive late. I guess that makes sense longer journeys things are more likely to happen along the way making them arrive late more often it seems.
Lets look at the cancelled rate % for all trains, the delayed rate % for all routes and the arriving late percentage for all routes. I’m going to plot a histogram for each of the 3 metrics to look at what the range currently is.



Probably not surprising to see for cancelled, the vast majority of routes every month there are no trains cancelled. However, there have been some instances where 60% of trains on a given route in a month have been cancelled. You are significantly more likely to have a trained delayed departs or arriving late however for departing and arriving late the histograms don’t really allow a comparison for which is more likely. Lets have a look at these 3 KPI how they have evolved over the year for all 3 metrics



The first thing that stands out is that for % of trains delayed at departure and arriving late assuming that if a train departs late the chances are it will arrive late. Both see significant increases during the summer months (June, July & August). Also surprisingly there was a big increase in late departing trains in 2018. The cancelled rate is generally low with most months less than 5% however there are a number of months with extreme values. Overall though, 2018 does look to be worse then other years.
Further Work
- Calculate the probability of a train departing late and then use Bayesian statistics to implement a model for forecasting future lateness
- Investigate if particular stations have issues with late arriving trains