Transport Analysis

The aim of this project was to try to show the relationship between weather and public transport usage. We did this because we thought there may have been interesting correlations between different weather conditions (such as rain, temperature etc) and people deciding to use public transport and it might help in planning adequate transport arrangements for future weather events.

This was a side project attempted by Hubert and David on top of our other main projects. Our initial plan was to find data for temperature and rainfall and construct models using linear models and random forest to see if we can find a relationship between weather and public transport.

The weather data we used was drawn from the BOM Climate Data API. We took Rainfall on top of Minimum and Maximum Temperature data from Sydney Observatory station and assumed that the weather all over Sydney was consistent with Sydney Observatory. 

Although Opal data initially seemed promising, the main problem was the short periods that were recorded, meaning that the potential for noise to affect our results (such as special events or a period of a lack of rain) is high, so we ignored that initially.

Then, we tried to explore Bus Occupancy Data. Unfortunately, since the dataset provided was 10GB big, it was impossible to segment, clean and be used in time for this Hackathon.

Finally, we tried to use Train Occupancy Data. These datasets were far smaller and more manageable so we ended up trying to use Train occupancy data. We initially aimed to do more months but ultimately only ended up looking at November 2016 data.

Even with the November dataset, there were over 1 Million entries, making it excessively hard to attempt to feature engineer with time and technology available (I only had R to try to clean and Feature engineer so SQL wasn't an option :(  ). To resolve this, we tried to sample 20,000 entries from this dataset, creating a dataset that's even easier to manipulate.

Unfortunately, since Train Occupancy only had 3 categories which were equally spread, this resulted in the bulk of entries being listed as "empty train". So one method that could have been used was using Time Series Analysis to explore weekly and hourly relationships. However, this would have required far more time than was available.

Essentially, this form of classification has resulted in our model not showing the information that we wanted to as small fluctuations in passenger numbers due to weather wouldn't be picked up by such a broad scale. Despite this, we tried to do more feature analysis. This included creating a classifier for empty and full trains so if the train was classified as empty, then the empty classifier would be 1 and full would be zero. If a train was full, then the full train classifier would be 1 and empty 0. However, if they were in between, they would both be classified as 0s.

We also only included the lines which had trains that were more likely to be less full. This was done by using a linear model to predict which train lines were more not empty. In the end, we included City Circle, North Shore, East Hills and Northern line via Macq Park. This helped somewhat with our data and from this we found using logistic regression analysis that Minimum Temperature had an influence trains not being as empty. However, this could have been noisy and we did not run any diagnostics to test this other than wellness of fit (or AIC)

We also attmpted to do further feature Engineering by trying to classify Peak hour times as defined by Sydney Trains (7am-9am and 4:30pm-6pm) to attempt to work out whether peak hour had an effect. Unfortunately, there wasn't enough time to investigate this.

If we had more time, not only would we be able to explore the other datasets but also come up with more accurate observations for weather. We would have ideally been able to use Opal data but unfortunately, the period of the datasets wasn't large enough to provide an accurate representation of weather affecting transport. We would also have been able to do further Feature Engineering and Time Series Analysis. We were unfortunately not able to make a video in time.

Team name
Transport Analysis
State, Territory or Country
Event location
Datasets used
Dataset Name
Train Occupancy - Nov 2016 to Feb 2017
Dataset Name
BOM Climate Data
Source code and Material file