In a machine learning course at Aalto University, a project involved researching the performance of two machine learning methods applied to a real-life problem. The project recieved a perfect grade of 5, but the model’s accuracy – 50% – was not satisfactory. What’s the appropriate reponse? Develop another, better model independently, document the process, and then share is to the provider of the dataset, Bird Co.
1 Introduction
This project aims to understand customer behavior by predicting the demand for shared electric scooters in Helsinki and create a machine learning model with real-world applicability. My approach combines ride data with weather data and analyzes the combined data frame using machine learning methods and predicts demand based on the temporal features. The insights gained from this study have real-life value in guiding both fleet management strategies and administrative decisions in the micromobility sector as this study answers the question what factors affect customer behavior, and by how much; this can be used to ensure optimal scooter supply and predict revenue income. After this introductory section, Section 2 discusses the problem formulation, Section 3 explains two machine learning models applied to our problem, Section 4 displays the results of our study, Section 5 draws conclusions from the results, and references and the code are appended at the end of the project (editors note: the code is confidential, not included in this article).
2 Problem Formulation
2.1 Objective
The objective of the research is to predict electric scooter ride demand in the future on an hourly basis using historical ride data and their associated temporal features; date, weekday, time interval, temperature, and weather conditions. A secondary objective is to predict the demand with such accuracy, that the model would have real-world applicability, being linked to weather forecasts. This is a supervised learning task; a time series forecasting problem. The results of our research can be used to create a privately hosted machine learning model linked to weather forecasts, that will predict confidently the demand of electric scooters. These predictions can be applied to fleet management strategies, especially in designing shift times for workers with the end goal of ensuring optimal scooter supply to meet the predicted demand. These predictions can also be applied to administrative strategies; demand can be used for predicting revenue income, and operations can be adjusted accordingly.
2.2 Data Points, Features, and Labels
Every single datapoint represents an individual scooter ride completed during the timeframe. Our features are:
- Date – the specific day the ride took place (Categorical),
- Weekday – eq. Monday, Tuesday (Categorical),
- Time Interval – the specific 1-hour interval during which the ride commenced eq. 10:00-11:00, 11:00-12:00 (Categorical).
- Temperature – in Celsius (Continuous)
- Weather Conditions – Divided into four categories: clear, cloudy, thunder, rain (Categorical)
Ase our label we have:
- Ride Count for the Interval – the total number of rides that commenced during a specific 1-hour interval.
3 Methods
3.1 Dataset Overview and Preprocessing
One combined dataset is formed from ride data and weather data. The ride dataset is exclusively provided by Bird Co., a global micromobility company, storing shared electric scooter ride data over a span of three months from Helsinki, from 22.06.2023 to 20.09.2023. The dataset is expansive and extensive, offering insight into the behavioral patterns of shared electric scooter users of Helsinki. There are 220 769 data points, each of which represents a completed scooter ride. For each ride, the dataset stores information for ride id, start time, end time, ride distance, ride duration, and ride start and end coordinate.
The weather dataset is fetched from The Finnish Meteorological Institute’s open data service. The Finnish Meteorological Institute (fin. Ilmatieteenlaitos) is a government agency responsible for gathering and reporting weather data and forecasts in Finland, and they encourage developing applications using weather and oceanographic data through the open data web services. For this project, I created a Python program to fetch machine readable data from the Helsinki, Kaisaniemi weather station to match with the timeframe of our ride data. The dataset features a data point per 10-minute interval, and stores information for latitude and longitude of the weather station, timestamp, temperature, windspeed, and a SmartSymbol corresponding to the weather conditions.
Our preprocessing consisted of filtering out missing, incomplete, or unnecessary records from both datasets to maintain data integrity. For the ride data preprocessing included filtering out non-Helsinki rides or practically any rides where the start city was not Helsinki. The ride durations were cleaned by removing commas, filtering out rides shorter than 1 minute, and splitting the Start Time feature into three new features: Date, Weekday, and Time. Additionally, a Time-Interval feature was created to assign each ride to their respective time slot, unnecessary columns were dropped, and rides were grouped and counted. I grouped the data by ‘Date’, ‘Time_Interval’, and ‘Weekday’, then counted the number of rides in each unique combination of the group, further enabling to calculate the number of rides per time interval. For the weather data preprocessing included first converting the Unix time to match the timestamp in the ride data and converting the numerous possible SmartSymbols to a comprehendible weather condition.
3.2 Model Selection and Justification
For our study’s first method, we chose the Random Forest. The nature of the Random Forest machine learning method suited our dataset and project goals – Random Forest is an ensemble technique, functioning by creating a “forest” of decision trees, usually trained with the “bagging” method (GeeksForGeeks 2023). The predictions obtained from each decision tree are combined to obtain a final prediction, allowing the model to capture complex, non-linear relationships within the data effectively – our data likely doesn’t follow many linear patterns. Its hypothesis space is vast and flexible, offering the ability to navigate through the patterns and relationships in our unique dataset. The model is extremely capable of handling missing data, outliers, and noisy features (GeeksForGeeks 2023), and it provides insight into feature importance; essentially computing the relevance of the features for our problem.
For our second method, we chose XGBoost. Essentially, XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible, and portable (XGBoost Developers 2022). XGBoost claims to solve many data science problems in a fast and accurate way due to its use of a parallel tree boosting. The method has been known to win machine learning competitions due to its speed and performance – it handles various types of data and tends to work well out of the box. XGBoost also provides a built-in function to plot feature importance, which provides insights into which features are most influential in predicting the demand.
3.3 Loss Function Selection
From the list of allowed machine learning methods, it was advised to use the squared error, familiarly known as the mean squared error (MSE), for both of our models. MSE quantifies the difference between predicted and actual scooter ride counts. This choice is justified by MSE penalizing larger errors heavier than smaller ones due to its quadratic nature, furthermore, enabling our model to pay significant attention to outliers. This is crucial in our time series forecasting problem where sudden spikes or drops can have significant consequences. MSE is relevant to regression models, as it squares the differences, emphasizing the importance of each deviation to ensure our model aligns closely with actual values. It is also interpretable; the value of the squared error offers a clear indication of the model’s performance.
3.4 Model Validation Process
We split our dataset in chronological order, preventing the use of future data to predict past events.
- The training set is the earliest 70% of the data (~ 154 000 datapoints)
- The validation set contains 15% of the data immediately following the training set. (~
33 000 datapoints)
- The test set is the final 15% of the data, the most recent data. (~ 33 000 datapoints).
My choice to use a straightforward chronological split is due to the sequential nature of the data. Using techniques that would mix time points, such as k-fold cross-validation, could introduce inconsistencies. With the chron
4 Results
4.1 Visualization: Random Forest (Left) vs. XGBoost (Right)
Figure 1. Scatterplot of how the model performs versus actual data. Training set (70%) on the left, Validation set (15%) in the middle, and Test set (15%) on the right. The closer the scatter points are to the dashed line; the better the model’s predictions are.
Figure 2. The feature importance graph provides a representation of the relative importance of each feature in predicting the target variable, the number of rides. The longer the bar, the more significant the feature is in making accurate predictions.
Figure 3. These scatter plots compare the actual and predicted number of rides against the two most significant features: time interval and temperature. Blue dots represent the actual data points; red dots indicate the model’s predictions. These graphs can indicate if there are any systematic deviations in predictions and highlight the variability in the data and the predictions.
Figure 4. A comparison table; a presentation of the performance metrics for the Random Forest and XGBoost. MSE, or mean squared error, provides a measure of the model’s accuracy in predicting the number of rides. MAE, or mean absolute error, represents the average magnitude of error in the number of ride predictions. R-squared is a statistical measure that indicates the proportion of the variance in the dependent variable (number of rides) that is predictable from the independent variables.
4.2 Results Breakdown
The R-squared values on the test set – 86.81% for Random Forest and 86.37% for XGBoost – indicate that each model explains a substantial portion of the variance in ride demand. These high R-squared values reflect a strong predictive ability, with both models closely aligning with the observed data. The Random Forest Regressor has a higher training set mean squared error (MSE, 423,54) compared to XGBoost (292,38), which indicates that XGBoost fits the training data better. However, the validation set MSE is higher for XGBoost (688,36) than for Random Forest (632,16), suggesting that XGBoost might be overfitting the training data slightly more than Random Forest, as it performs slightly worse on unseen validation data. The mean absolute error (MAE) was included to make the results more comprehensible; the mean absolute error displays the average number of rides the predictions are off the actual number of rides. The greater discrepancy between the training and validation MSE for the XGBoost model, in comparison to the Random Forest, suggests that XGBoost may be overfitting to the training data by learning its noise and idiosyncrasies; it performs less effectively on unseen data. Conversely, the Random Forest model demonstrates a smaller increase between training and validation MSE, indicating better generalization and a more robust performance against overfitting.
The comparative analysis of the feature importance plots from both Random Forest and XGBoost models reveals a distinction in how each model prioritizes the predictors. Specifically, the Random Forest model exhibits a preference for the time interval feature – as demonstrated in Figure 2 – indicating its significant role in predicting the number of rides. In contrast, the XGBoost model attributes more balanced importance across features, with temperature emerging as the most influential predictor. This difference underlines the different mechanisms by which each model processes the features to generate predictions.
The scatter plots – in Figure 3 – provide a visual comparison of how the Random Forest and XGBoost models predict the number of rides across different temperatures and time intervals – the two most important features for both models. From the plots, it appears that the XGBoost predictions are more dispersed throughout the range of actual values, suggesting that XGBoost is potentially more responsive to the underlying patterns in the data, including outliers. On the other hand, the Random Forest predictions seem more concentrated around the mean of the data, which indicates conservatism in predicting values that significantly deviate from the average, thereby potentially underestimating or overestimating in regions with higher variability.
4.3 The Final Chosen Model
XGBoost is chosen for its broad feature utilization and diverse prediction range, matching the dynamic nature of ride data. It’s even feature importance allocation signifies a comprehensive understanding of ride sharing factors – from weather to time intervals – essential for capturing demand extremes. Although XGBoost shows slightly higher training and validation errors than Random Forest, the marginal difference doesn’t imply severe overfitting, maintaining strong predictive capability. Furthermore, XGBoost’s superior test set performance suggests better real-world applicability – another key factor behind the choice, as real-world applicability was one of the main objectives of this research. Consequently, XGBoost stands out as a robust and adaptable model for ride-sharing demand forecasting, warranting its selection as the model of choice.
5 Conclusion
To conclude, this project sought to develop a machine learning model capable of forecasting the demand for shared electric scooters in Helsinki by using an extensive dataset provided by Bird Co. covering a period of three months of Summer 2023 ride data and comprising over 220000 data points. Through systematic data preprocessing and thoughtful feature engineering, the chosen models – Random Forest Regressor and XGBoost – were trained and evaluated, with the latter model proving to be the final chosen method. The results suggests that the XGBoost model handles non-linear patterns in the data, delivers predictions of demand with a great degree of accuracy, and thereby suggests there could be potential for practical application in managing and optimizing fleet management operations, as well as aiding administrative decision-making in Helsinki.
Both objectives of the research were met; the model was able to predict unseen data with a roughly 86% accuracy, showcasing strong potential for real-world implementation. The practical utility of this model lies in its potential integration with live weather forecasting, leveraging the same data sources from The Finnish Meteorological Institute. Such an application could enhance fleet management operations by allowing for dynamic, data-driven allocation of vehicles based on anticipated demand and pinpointing the optimal times for repair projects. When demand is predicted to be lower, it is an optimal time for vehicle maintenance and inbound logistics; conversely, when supply is well-matched to customer demand, higher satisfaction and service reliability are achieved. This predictive capability is not only a benefit for operational efficiency but can also serve as a tool for administrative decision-making, offering insights into expected revenue streams and enabling more informed economic strategies; the higher the predicted demand is, the higher incoming revenue.
Integrating weather forecasts to the model will obviously bring down the overall accuracy, as the weather forecasts are inaccurate themselves. However, to compensate this loss of accuracy with a real-world model, having a larger dataset for training the model – a data set spanning a longer timeframe – could be useful. This would enhance the model’s understanding of year- round demand fluctuation and reduce seasonal bias. Additionally, the model could be improved by integrating other relevant data sets, such as events in the city and public transportation disruptions, to provide a more comprehensive demand forecast. Investing in real-time data processing and analytics would enable dynamic model adjustments as new data becomes available, leading to more accurate, timely predictions. This would allow for a proactive approach to fleet management, contributing to a more efficient and responsive urban mobility system.
6 References
ChatGPT (2023). Used for researching reasons on which machine learning model to choose for our specific study. Note: The use of ChatGPT was allowed in project instructions. OpenAI.com. Available at: https://chat.openai.com/ [Accessed 21 Sep. 2023].
GeeksForGeeks (2019). Random Forest Regression in Python. [online] GeeksforGeeks.org. Available at: https://www.geeksforgeeks.org/random-forest-regression-in-python/ [Accessed 22 Sep. 2023].
Jung, A., 2022. Machine Learning: The Basics. Springer, Singapore.
Sigg, S. (2023). Lecture 2: Regression. In CS-C3240 – Machine Learning (D). Delivered on September 8, 2023.
XGBoost Developers (2022). XGBoost Documentation — xgboost 1.5.1 documentation. [online] xgboost.readthedocs.io. Available at: https://xgboost.readthedocs.io/en/stable/.
7 Appendices
The code is confidential.
Leave a Reply