Optimizing Data Pipelines for Green AI: A Comparative Analysis of Pandas, Polars, and PySpark for CO<sub>2</sub> Emission Prediction
Abstrak
This study evaluates the performance and energy trade-offs of three popular data processing libraries—Pandas, PySpark, and Polars—applied to GreenNav, a CO<sub>2</sub> emission prediction pipeline for urban traffic. GreenNav is an eco-friendly navigation app designed to predict CO<sub>2</sub> emissions and determine low-carbon routes using a hybrid CNN-LSTM model integrated into a complete pipeline for the ingestion and processing of large, heterogeneous geospatial and road data. Our study quantifies the end-to-end execution time, cumulative CPU load, and maximum RAM consumption for each library when applied to the GreenNav pipeline; it then converts these metrics into energy consumption and CO<sub>2</sub> equivalents. Experiments conducted on datasets ranging from 100 MB to 8 GB demonstrate that Polars in lazy mode offers substantial gains, reducing the processing time by a factor of more than twenty, memory consumption by about two-thirds, and energy consumption by about 60%, while maintaining the predictive accuracy of the model (R<sup>2</sup> ≈ 0.91). These results clearly show that the careful selection of data processing libraries can reconcile high computing performance and environmental sustainability in large-scale machine learning applications.
Topik & Kata Kunci
Penulis (3)
Youssef Mekouar
Mohammed Lahmer
Mohammed Karim
Akses Cepat
- Tahun Terbit
- 2025
- Sumber Database
- DOAJ
- DOI
- 10.3390/computers14080319
- Akses
- Open Access ✓