Machine Learning to identify water pollution hotspots in groundwater globally.

Publication date

DOI

Document Type

Master Thesis

Collections

Open Access logo

License

CC-BY-NC-ND

Abstract

This study evaluated how groundwater-quality constituents can be integrated into machine learning models to identify and predict global contamination hotspots. Twelve key constituents, spanning nutrients, heavy metals, microbial indicators, organic micropollutants, and physico-chemical parameters, were selected based on both literature prevalence and data availability. These were paired with hydroclimatic, geographic, and socioeconomic variables from the HydroBASINS and HydroATLAS datasets to construct a comprehensive input feature set. A set of machine learning models, including Random Forests, XGBoost, HistGradientBoostingRegressor, and Multi-layer Perceptron, were trained and compared using spatially aware validation strategies. While unsurprisingly, the Random Forest model emerged as the most robust overall, performance declined substantially under spatial cross-validation, particularly in Leave-One-Country-Out, highlighting the models’ limited generalizability in regions with sparse data. The spatial distribution of training data was highly skewed, with over 98% of observations concentrated in North America(54.85%), Europe(43.63%) and Asia (1.17%). This geographic imbalance, compounded by a low number of unique monitoring sites, contributed to overfitting and poor performance in data-scarce regions such as Africa and Australia. Although model outputs suggested potential contamination hotspots in areas like western China, northern Africa, and southwestern South America, these predictions must be interpreted with caution due to the spatially skewed and limited available data. The results emphasize the importance of both data coverage and appropriate validation when modelling environmental phenomena at global scales. While this framework provides a starting point for hotspot identification, more equitable global monitoring efforts and region-specific modelling approaches will be critical for improving reliability and practical utility.

Keywords

water pollution, geodata, machine learning

Citation