PROJECT FINAL PAPER
EE2: Weather-Calibrated Data Cleaning for Urban Mobility Modeling
Huiling Liu, Jiaying Zhou, Min Jiang, Shreya Sawant, Yi Kiu Ho, Zuo Wang
CIAM
BUS501
Dr. Edmund Khashadourian
June 28, 2025
EE1 Summary: Market Trends and Behavioral Insights
In the EE1 phase, we performed a systematic analysis of Capital Bikeshare’s operating environment as well as post-pandemic market trends by combining Ivey case materials with market data from Statista. Our analysis shows that the U.S. bike-sharing sector is recovering robustly. Some estimates predict that the industry will grow from $800 million in 2023 to $1.1 billion by 2029. And the user base, according to these projections, will pass 60 million by 2028.
The demand for bikesharing shows a dual pattern. One part of these trends includes seasonal peaks occurring in spring and summer, as you can see that 67% of all annual rides happen between April and September. Another trend is seen within each day: there is a clear ride peak from midday up until early evening. Moreover, the demand is very influenced by weather conditions. For example, ridership can drop by 30% to 40% when temperatures go above 35°C or when rainfall is more than 5mm.
User behavior is showing more diversity each day. Around 62% of rides seem to be for commuting purposes and 32% are done for leisure or social use. These points guide us toward a focused product strategy for Cycle Right helmets. That is, the helmets should be designed to work well for both commuters—keeping them light and safe—and for casual users who also care about style. In addition, the spring-summer period must be taken advantage of to help increase market share.
The Importance of Data Cleaning in EE2
Our preliminary analysis revealed that Capital Bikeshare demand is not only seasonal but also highly responsive to weather conditions. Ride volume peaks during warm, dry months (especially April to September) and drops sharply in adverse weather. Similarly, usage spikes midday and in the afternoon. These patterns imply that the market demand for Cycle Right helmets will align closely with high-usage periods and favorable weather—making precise demand modeling essential. This begins with rigorous data cleaning.
In EE2, we focus on building a high-quality data foundation. Patterns uncovered in EE1—such as the optimal temperature range (18–28°C) and sensitivity thresholds for rainfall—serve as benchmarks for identifying anomalies. Cleaned, reliable data is critical for pinpointing ideal product launch windows—especially for dynamic inventory planning during peak seasons and for tailoring distribution strategies across commuting and leisure use cases.
Our Targeted Data Cleaning Framework
We propose a data-cleaning methodology grounded in statistical theory and domain-specific best practices. This approach aims to preserve key variable relationships while improving overall data quality to support robust modeling.
1. Diagnosing the Missingness Mechanism
The first step is to assess why data is missing. Missing values can be:
· MCAR: Missing Completely At Random
· MAR: Missing At Random (conditional on observed data)
· MNAR: Missing Not At Random (dependent on unobserved data)
For example, missing weather data may result from sensor outages (likely MAR), while gaps in rental counts might indicate system downtime or logging issues. As Rubin (1976) emphasized, identifying the missingness mechanism is crucial—it determines whether deletion, imputation, or model-based methods are appropriate.
In our case, we found 14 missing entries for temperature fields (Max, Min, and Avg Temp), mostly in early February 2023 (snow days) and late April 2023 (rainy days). This suggests possible sensor downtime or data upload failures.
2. Contextual Imputation of Numerical Variables
For numeric variables like temperature, humidity, and ride counts, imputation strategy should match the pattern and duration of missingness:
Short gaps (1–3 hours): Linear interpolation maintains continuity and local trends.
Given the time-series nature of the dataset and strong inter-variable correlations, we recommend MICE. As shown by van Buuren and Groothuis-Oudshoorn (2011), this method preserves inter-variable relationships better than mean or forward-fill imputation.
3. Unit Consistency and Standardization
Although the original temperature fields are recorded in Fahrenheit (℉), we converted them to Celsius (°C) for three reasons:
· Alignment with EE1: Our analysis defined optimal ride demand at 18–28°C.
· Consistency with external sources: The Ivey case and most global references (e.g., WMO, Statista) use °C.
· Better model readability and interoperability: Using °C simplifies interpretation and avoids future unit mismatches with other datasets.
The standard conversion formula used was:
°C = (°F – 32) × 5/9
4. Outlier Detection and Contextual Validation
Outliers must be assessed in context—not blindly removed. For instance, 500 rentals at 3 AM is likely an error, while 500 rentals at 1 PM on a holiday is likely valid.
Common methods for outlier detection include:
· Z-score or modified Z-score
· Domain checks (e.g., temperature above 45°C or negative humidity = physically impossible)
· Temporal benchmarks (e.g., moving median or seasonal baselines)
Importantly, we avoid discarding extreme yet valid data points (e.g., spikes during festivals or heatwaves), as they may reveal critical commercial opportunities for helmet marketing.
5. Final Validation & Sensitivity Testing
Once imputation and anomaly correction are complete, we perform integrity checks:
· Compare pre- and post-cleaning descriptive statistics
· Re-fit baseline models (e.g., random forest) to identify residual issues
· Run sensitivity analysis to assess how different imputation strategies affect predictions
This process helps ensure that the cleaning process itself doesn’t introduce unintended modeling bias—an issue highlighted in many empirical modeling studies (Little & Rubin, 2019).
References
1. Rubin, D. B. (1976). Inference and missing data. Biometrika, 63(3), 581–592. https://doi.org/10.1093/biomet/63.3.581
2. van Buuren, S., & Groothuis-Oudshoorn, K. (2011). MICE: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, 45(3), 1–67. https://doi.org/10.18637/jss.v045.i03
3. Schafer, J. L., & Graham, J. W. (2002). Missing data: Our view of the state of the art. Psychological Methods, 7(2), 147–177. https://doi.org/10.1037/1082-989X.7.2.147
4. Little, R. J. A., & Rubin, D. B. (2019). Statistical Analysis with Missing Data (3rd ed.). https://doi.org/10.1002/9781119482260