DOAJ Open Access 2025

Optimizing Data Pipelines for Green AI: A Comparative Analysis of Pandas, Polars, and PySpark for CO<sub>2</sub> Emission Prediction

Youssef Mekouar Mohammed Lahmer Mohammed Karim

Abstrak

This study evaluates the performance and energy trade-offs of three popular data processing libraries—Pandas, PySpark, and Polars—applied to GreenNav, a CO<sub>2</sub> emission prediction pipeline for urban traffic. GreenNav is an eco-friendly navigation app designed to predict CO<sub>2</sub> emissions and determine low-carbon routes using a hybrid CNN-LSTM model integrated into a complete pipeline for the ingestion and processing of large, heterogeneous geospatial and road data. Our study quantifies the end-to-end execution time, cumulative CPU load, and maximum RAM consumption for each library when applied to the GreenNav pipeline; it then converts these metrics into energy consumption and CO<sub>2</sub> equivalents. Experiments conducted on datasets ranging from 100 MB to 8 GB demonstrate that Polars in lazy mode offers substantial gains, reducing the processing time by a factor of more than twenty, memory consumption by about two-thirds, and energy consumption by about 60%, while maintaining the predictive accuracy of the model (R<sup>2</sup> ≈ 0.91). These results clearly show that the careful selection of data processing libraries can reconcile high computing performance and environmental sustainability in large-scale machine learning applications.

Penulis (3)

Y

Youssef Mekouar

M

Mohammed Lahmer

M

Mohammed Karim

Format Sitasi

Mekouar, Y., Lahmer, M., Karim, M. (2025). Optimizing Data Pipelines for Green AI: A Comparative Analysis of Pandas, Polars, and PySpark for CO<sub>2</sub> Emission Prediction. https://doi.org/10.3390/computers14080319

Akses Cepat

PDF tidak tersedia langsung

Cek di sumber asli →
Lihat di Sumber doi.org/10.3390/computers14080319
Informasi Jurnal
Tahun Terbit
2025
Sumber Database
DOAJ
DOI
10.3390/computers14080319
Akses
Open Access ✓