Elsevier

Water Research 水研究

Volume 246, 1 November 2023, 120676
卷 246,2023 年 11 月 1 日,120676
Water Research

Machine learning framework for intelligent aeration control in wastewater treatment plants: Automatic feature engineering based on variation sliding layer
污水处理厂智能通气控制的机器学习框架:基于变动滑层的自动特征工程

https://doi.org/10.1016/j.watres.2023.120676Get rights and content 获取权利和内容

Highlights 亮点

  • VSL was designed for as specialized feature engineering approach for intelligent control of WWTPs.
    VSL 被设计为一种专门的特征工程方法,用于污水处理厂的智能控制。
  • Performance of Machine Learning for Multiple Classes Improved by VSL.
    多类机器学习的性能通过 VSL 得到了改善。
  • Machine learning models based on VSL reduce energy consumption of aeration.
    基于 VSL 的机器学习模型减少了曝气的能耗。
  • An automatic Python library called 'wwtpai' packages has been made free and open sourced.
    一个名为“wwtpai”的自动化 Python 库已被免费开源。

Abstract 摘要

Intelligent control of wastewater treatment plants (WWTPs) has the potential to reduce energy consumption and greenhouse gas emissions significantly. Machine learning (ML) provides a promising solution to handle the increasing amount and complexity of generated data. However, relationships between the features of wastewater datasets are generally inconspicuous, which hinders the application of artificial intelligence (AI) in WWTPs intelligent control. In this study, we develop an automatic framework of feature engineering based on variation sliding layer (VSL) to control the air demand precisely. Results demonstrated that using VSL in classic machine learning, deep learning, and ensemble learning could significantly improve the efficiency of aeration intelligent control in WWTPs. Bayesian regression and ensemble learning achieved the highest accuracy for predicting air demand. The developed models with VSL-ML models were also successfully implemented under the full-scale wastewater treatment plant, showing a 16.12 % reduction in demand compared to conventional aeration control of preset dissolved oxygen (DO) and feedback to the blower. The VSL-ML models showed great potential to be applied for the precision air demand prediction and control. The package as a tripartite library of Python is called wwtpai, which is freely accessible on GitHub and CSDN to remove technical barriers to the application of AI technology in WWTPs.
废水处理厂(WWTPs)的智能控制有潜力显著减少能源消耗和温室气体排放。机器学习(ML)为处理日益增加的数据量和复杂性提供了一个有前景的解决方案。然而,废水数据集特征之间的关系通常不明显,这阻碍了人工智能(AI)在 WWTPs 智能控制中的应用。在本研究中,我们开发了一种基于变异滑动层(VSL)的特征工程自动框架,以精确控制空气需求。结果表明,在经典机器学习、深度学习和集成学习中使用 VSL 可以显著提高废水处理厂的曝气智能控制效率。贝叶斯回归和集成学习在预测空气需求方面达到了最高准确性。采用 VSL-ML 模型开发的模型也成功应用于全规模废水处理厂,与传统的预设溶解氧(DO)和反馈给鼓风机的曝气控制相比,需求减少了 16.12%。 VSL-ML 模型在精确空气需求预测和控制方面显示出巨大的潜力。该软件包作为一个三方库的 Python 称为 wwtpai,免费在 GitHub 和 CSDN 上提供,以消除人工智能技术在污水处理厂应用的技术障碍。

Graphical abstract 图形摘要

The surrounding water environment represents the wastewater treatment plant, the phrase(TN, TEMP, COD, Time, DO, MLSS, Flow rate and NH3-N) in the bubble indicates that the commonly used indicators of the wastewater plant are used to predict the air demand of aeration at the top of the figure. Human brain represents method of artificial intelligence, multiple neurons represent 12 algorithms (GBDT,LSTM,ANN,HUBER,KNN,SVM,ROBUST,DT,LGBM,VAYER,RF and XGB). The aeration quantity can be predicted through various algorithms. Use the library of Python (wwtpai) based on variation sliding layer (VSL) encapsulation to optimize the prediction result.
周围的水环境代表了废水处理厂,气泡中的短语(TN、TEMP、COD、时间、DO、MLSS、流量和 NH3-N)表明废水处理厂常用的指标用于预测图顶的曝气空气需求。人脑代表人工智能的方法,多个人工神经元代表 12 种算法(GBDT、LSTM、ANN、HUBER、KNN、SVM、ROBUST、DT、LGBM、VAYER、RF 和 XGB)。可以通过各种算法预测曝气量。使用基于变滑层(VSL)封装的 Python 库(wwtpai)来优化预测结果。

Keywords 关键词

Machine learning
Wastewater treatment plants
Feature engineering
Intelligent control
Variation sliding layer

机器学习 wastewater treatment plants 特征工程 智能控制 变滑层

1. Introduction

Wastewater treatment is an energy-intensive process that requires considerable electrical power and yields significant quantities of greenhouse gas (GHG), so it is a formidable challenge to environmental sustainability (Almuhtaram et al., 2021; Yan et al., 2017). It has been estimated that in 2018, wastewater treatment plants (WWTPs) may have accounted for approximately 1–3 % of total electricity production in the United States and European countries (Sabia et al., 2020). At the same time, the electricity consumption about WWTP in China is more than 1.5 × 1011 kW·h/year. The escalating electrical demand for WWTPs rises proportionally in the coming years due to the progressively increasing capacity and treatment standards of WWTPs, will inevitably result in an appreciable augmentation in carbon emissions and electrical power consumption. GHG emissions and the climate change it brings are one of the most pressing challenges of our time (Burke et al., 2015; Hallegatte and Rozenberg, 2017; Hubacek et al., 2017). In response, various countries worldwide are adopting policies to reduce carbon emissions to different extents. Nevertheless, the harsh reality remains that the growth of carbon dioxide and other GHG emissions as well as global warming caused by waste and wastewater treatment continues unabated (Ramanathan et al., 2021), with power consumption consistently serving as the main source of carbon emissions (Howells et al., 2013; Liu et al., 2021). Enhancing the energy efficiency of WWTPs has emerged as a critical strategy for energy conservation, resource preservation, and environment protection.
The aeration serves as the crux of the biological treatment process of WWTPs, consistently representing a significant fraction of total power consumption. Aeration is a complex system, riddled with a myriad of slow and uncertain biological mechanisms, coupled with diverse chemical transformations in wastewater treatment (Bourgin et al., 2018; Comninellis, 2006). Therefore, the accurate prediction and real-time, rapid regulation of aeration amount is of paramount practical, scientific, and engineering importance (Wang et al., 2017, 2018). Traditional mechanistic models, such as activated sludge models, can provide relatively robust data that meet the requirements and are the most commonly used models to predict and control in WWTPs. Traditional mechanistic models rely heavily on complex and incomprehensible model parameters, so the efficient application in WWTPs has been challenging due to computational power limitations (Fenu et al., 2010). On the other hand, there may be certain obstacles that arise with traditional mechanistic models as require a greater variety of initial data types, which can present challenges when calibrating the results. To make matters worse, traditional mechanistic models are known to have a slower response time, which cannot be regulated in real-time (Larsen et al., 2016; Miller et al., 2018).
Fortunately, the rapid maturation of artificial intelligence (AI) technology has seen its role in scientific research burgeon, with applications spanning a broad array of disciplines (Haibe-Kains et al., 2020), including mathematics (Stump, 2021), physics (Hatfield et al., 2021), environmental, biology, chemistry, and medical science (Li et al., 2018; Wallis, 2019). Increasingly, the fusion of AI with domain-specific expertise to address problems is a growing trend (Hosny et al., 2018). Implementing AI technology, such as machine learning (ML) models, can alleviate the limitaions of traditional mechanistic models. These ML models, through their superior speed and accuracy, can enhance the efficiency of environmental modeling, yielding quicker response times and superior outcomes. Machine learning models had a response time of several minutes compared to the traditional model with more than 10 h about. By utilizing ML models, researchers can reduce the time and resources required for small-scale experiments and gather a more extensive range of parameters for their models. Generally, in the context of WWTPs, the integration of AI technology with the control system offers an innovative approach to explore energy-saving solutions (Chen et al., 2020; Xia et al., 2022; Zhou et al., 2019). Even subtle modifications to the aeration operational strategy within WWTPs can significantly reduce energy consumption (Ahmad and Danish, 2018; Manaia et al., 2018).
The achievement of intelligent aeration control and energy conservation within WWTPs hinges significantly on the precise prediction of aeration demand in biochemical processes. Guo et al. proposed a framework to predict the wastewater treatment process based on CNN and RNN (Guo et al., 2020). Sangeeta et al. compared advanced machine learning models to obtain the most accurate method for predicting aeration efficiency through the Parshall flume (Sangeeta et al., 2021). These studies demonstrate the feasibility of machine learning for aeration. The collective endeavors of government, academia, and the private sector have catalyzed transformative impact of AI on wastewater treatment industry (Krajewski et al., 2017).
Previous studies had mostly focused on technical process enhancements, construction and application of control models. Such an orientation has frequently neglected the importance of the validity of input data for the intelligent control model, in addition to the critical role of big data in propelling AI algorithms for air demand control. Fundamentally, the effectiveness of a model is profoundly influenced by the quality of the dataset it utilizes (Briscoe and Marin, 2020; Kim et al., 2022). A significant challenge lies in mining high-quality, relevant parameter information from WWTPs and lessening the workload of pre-processing feature engineering within the extensive wastewater dataset. Thus, pre-processing to mine in-depth information is essential to achieve intelligent control of the biochemical process (Sermet and Ibrahim Demir, 2017).
In light of this, this study introduces an automated feature engineering of multi-source dataset in WWTPs, aimed at intelligent predict and control of air demand within biochemical treatment process. The foundational steps and principles of this automated feature engineering method were introduced, and its feasibility for predicting air demand within the biochemical treatment process of WWTPs was demonstrated. Moreover, the efficiency of the air demand predict model which developed based on this automated feature engineering methods, was validated through implementation in an actual full-scale WWTP, thereby affirming its effectiveness in energy conservation and consumption reduction.

2. Materials and methods

2.1. Description of the dataset

The data used for the model development was obtained from a full-scale domestic WWTP located in Shenzhen, China. As shown in Fig. 1, it employs an anaerobic-anoxic-oxic (A2O) process system for efficient biological nutrient removal. The wastewater characteristics, including ammonium (NH3-N, mg/L), nitrate (NO3-N, mg/L), COD (mg/L), and DO (mg/L) were measured in real-time by an autosampler (IQ Sensor Net system, WTW, Xylem Inc. Germany). In the modeling and computational process of this study, the interval of these wastewater characteristics was set at 15 min. Moreover, other parameters, such as temperature ( °C), the wastewater flowrates (m3/h), and air demand (m3/h), were automatically collected by the SCADA system also with 15 min interval.
Fig 1
  1. Download: Download high-res image (572KB)
  2. Download: Download full-size image

Fig. 1. Conceptual diagram of intelligent air demand prediction and control of wastewater treatment plant driven by variation sliding layer machine learning models.

During the period from May 1, 2022, to September 30, 2022, a comprehensive dataset comprising 14,688 groups from 153 days was collected for model development. The experimental data set was partitioned into two subsets: a training set encompassing 14,208 samples (148 days) and a test set consisting of 480 samples (5 days). The training set was mainly used to train the parameters in the machine learning models, which was related to the input-output model. Conversely, the test set was leveraged to validate the performance of the model. This study employed 16 characteristics indicators in the model development. Temperature indicators include wastewater influent (Tin) and effluent temperature (Teff). Wastewater flowrate indicators include influent flowrate (Fin), effluent flowrate (Feff), and aerobic tank outlet flowrate (Faer), which respectively represent the average flowrate of inlet of the WWTP, outlet of the WWTP and aeration tank outlet of the WWTP during the previous sampling interval (15 min). Influent wastewater quality indicators include influent COD (CODin), ammonia (ANin) and total nitrogen (TNin). Effluent wastewater quality indicators include effluent COD (CODeff), ammonia (ANeff), and total nitrogen (TNeff). Furthermore, parameters within the anaerobic tank featured ammonia concentration (ANaer), nitrate nitrogen concentration (NNaer), MLSS (MLSSaer), dissolved oxygen in the anoxic tank (DOano) and dissolved oxygen of effluent (DOeff).

2.2. Machine learning model

In order to substantiate the adaptability of the proposed methodologies, it was deemed necessary to select a diverse array of ML models for the investigation (Fu et al., 2022). Accordingly, this study meticulously curated twelve ML models to serve as the fundamental models (Fig. 2). This comprehensive selection aimed to ensure a robust and diverse examination of the adaptability across various algorithmic structures. These twelve fundamental models were categorized into three primary classifications.
Fig 2
  1. Download: Download high-res image (465KB)
  2. Download: Download full-size image

Fig. 2. Schematic of the principles of different fundamental machine learning models. Six kinds of classical machine learning: (a) Linear regression in ROBUST and HUBER (b) SVM. (c) KNN. (d) BAYES. Two kinds of deep learning: (e) ANN. (f) LSTM. Four kinds of ensemble learning: (g) Boosting in LGBM, GBDT, XGB. The subtree consists of DT. (h) Bagging in RF.

Firstly, six classical machine learning models encompassing robust regression (M1_ROBUST), huber regression (M2_Huber), support vector machines (M3_SVM), k-nearest neighbor (M4_KNN), bayesian regression (M5_BAYES) and decision tree (M6_DT). As illustrated in Fig. 2a–d, classical machine learning models are usually of low complexity and fast to model (Gnann et al., 2022; Haggerty et al., 2023; Huang et al., 2021). A more detailed explanation of classical machine learning models can be found in Supporting Information Text S1.
Secondly, two deep learning models were utilized, namely artificial neural network (M7_ANN) and long short-term memory (M8_LSTM). The artificial neural network (ANN), a multilayer feedforward neural network, is a typical representation of a deep learning model as depicted in Fig. 2e. The neurons in each layer have direct connections to neurons in the succeeding layer, primarily utilizing the results of backpropagation (BP) computations (Harry and Braccini, 2021; Jia et al., 2023). The long short-term memory (LSTM) is a type of recurrent neural network (RNN), which consists of a cell and three functional gates: input gate, output gate, and forget gate (Fig. 2f). LSTM is able to maintain long-term memory, which is very powerful in time series prediction (Chen et al., 2018). For a more detailed explanation of deep learning models, please refer to Supporting Information Text S2.
Lastly, four kinds of ensemble learning models were included, such as light gradient boosting machine (M9_LGBM), gradient boosting decision tree (M10_GBDT), extreme gradient boosting (M11_XGB) and random forest (M12_RF) (Supporting Information Text S3). Ensemble learning improves predictive performance through model fusion (Butler et al., 2018; Zhang et al., 2023). Light gradient boosting machine (LGBM) is a kind of ensemble learning model which can be used for regression (Fig. 2g). The gradient boosting decision tree (GBDT) consists of three parts: region decision tree, gradient boosting and shrinkage. And the GBDT model updates the sample weights based on the error of the weak learner in the previous round and performs multiple rounds of iterations. The extreme gradient boosting (XGB) introduces a regular term to the model to control the complexity of the model, making the final model less prone to overfitting (Li et al., 2022). The random forest (RF) is an ensemble learning method using bagging idea (Liu et al., 2022; Newhart et al., 2019). The data set is extracted from the original data set by sampling with placement to train different basic learners, and multiple decision trees are trained using the extracted sub data sets (Fig. 2h).

2.3. Original feature engineering machine learning (Ori-ML) model

Feature engineering is a process that refines data to enhance its usefulness for ML models, extract more relevant information for prediction, or more accurately represent the original data. Practically, simplifying the model is especially crucial since less complex models are more manageable, more transparent and interpretable, making it easier to identify and address problems that may arise. Feature engineering encompasses feature construction, feature extraction and feature selection. The conventional approach to feature engineering necessitates manually creating features based on domain-specific knowledge.
In this study, twelve original feature engineering machine learning models adopt the conventional approach to feature engineering. For this approach, all numerical data were converted to float 64 data type, and all non-numerical data were transformed into numerical data. Various padding methods were explored, including using values before the missing, values after the missing, a fixed value of 0, and a fixed value of average, with the most favourable results obtained using values before the missing position. After filling in the missing values and examining the distribution of the numerical data, a few outliers were identified that could potentially affect the model's accuracy. To address this, the dataset was sorted 16 times according to the 16 columns of numerical features, from smallest to largest. Sample rows that contained normally distributed eigenvalues in each column exceeding 3σ were removed.

2.4. Variation sliding layer machine learning (VSL-ML) model

Generally, for the datasets obtained from WWTPs, a notable challenge lies in their limited dimensional features and intricate internal correlations. Regrettably, this drawback hinders the optimal harnessing of big data techniques and machine learning models. In addition, traditional feature engineering is demanding, time-consuming, and error-prone. This study proposes an automatic feature engineering method based on variation sliding layer (VSL). And twelve variation sliding layer machine learning (VSL-ML) models were developed by integrating the VSL and fundamental ML models.
The VSL comprises four layers: outlier discarding layer, variation layer, sliding layer and feature discarding layer (Fig. 3). The pseudocode for VSL is presented in Table 1. The transition from the first to the second layer involves outlier removal. The design of this layer is informed by the common challenge of missing data in WWTPs. Operational issues such as sensor blockages typically lead to numerous missing and abnormal values in the wastewater datasets. This layer identifies outlier data by calculating the maximum value (parameter a) and data percentile (parameter b). Both parameters a and b are variable, with default values set at 10 % and 1.5 %, respectively. The transition from the second to the third layer employs a variable removal rate structure and a sliding average layer structure. For tasks related to WWTPs, variables typically include time, meteorological, water volume, and wastewater quality indicators. The feature engineering treatment processes for these are similar, resulting in constructed variables with analogous environmental significance. The feature expansion is realized by calculating the average value features of multiple datasets in variation layer and sliding layer. This layer constructs indicators for the removal performance of COD, AN, and TN. Finally, the transition from the third to the fourth layer involves discarding redundant features. Feature selection is performed for the constructed feature based on variance importance. This layer retains only the top 'c' features (parameter c). The parameter 'c' is adjustable to suit different contexts.
Fig 3
  1. Download: Download high-res image (338KB)
  2. Download: Download full-size image

Fig. 3. Schematic framework backbone and functionality of feature engineering based on variation sliding layer. The four layers of the framework are the outlier discarding layer, the variation layer, the sliding layer and the feature discarding layer.

Table 1. Pseudocode schematic of the algorithmic about VSL-based framework.

Algorithm VSL pseudocode
Input: Dataset train, test, optional parameters(a, b, c)
Output: Dataset train, test
1: function opt (opt_rate, df, threshold)
2:  ADindex list()
3:  thresholdnum threshold * len(df)
4:  for col in df.columns() do
5:   diffmin min (col)+opt_rate * (max(col)-min(col))
6:   diffmax min (col)+(1-opt_rate) * (max(col)-min(col))
7:   minlist All values in col that are less than diffmin
8:   maxlist All values in col that are greater than diffmax
9:   if len(minlist) <= threshold num then
10:    Add all values from minlist to ADindex
11:   end if
12:   if len(maxlist) <= threshold num then
13:    Add all values from maxlist to ADindex
14:   end if
15:  end for
16:  ADindex = sorted(list(set(ADindex)))
17:  return ADindex
18:end function
19: function VSL (train, test)
20:  optindex = opt (opt_rate=a, df=train, threshold=b)
21:  train Delete the optindex from train
22:  data Concat ([train, test])
23:  data Add hour, minute, weekday and Quarter features to data
24:  for col in train.columns do
25:   Add the change value of col between input and output to the data
26:   Add the rate of change of col between input and output to the data
27:  end for
28:  for i in {1,2,3,4,5,10,15,20,30,60,120} do
29:   Compute rolling features along the time series based on period i
30:  end for
31:   data select the top c% of values from data based on variance
32:   train, test data
*AD means air demand.

2.5. Modeling conditions and performance evaluation

The framework and extension library versions were set as follows: Python==3.9.7, Pandas==1.4.1, NumPy==1.21.5, Matplotlib==3.5.1, Seaborn==0.11.2, SciPy==2.6.2, Scikit-Learn==1.1.1, Torch=1.11.0, cuda==11.3, cudnn==8200, xgboost==1.6.1, lightgbm=3.3.2.
In this study,yact is employed to represent the true value, ypre represents the predicted value, and y¯ represents the average value. As shown in Eq. (1), three evaluation indicators, namely relative root mean square error (RMSE), mean absolute percentage error (MAPE), and R-square (R2), were employed to assess the loss and performance of model. RMSE represents the deviation between the predicted valueypre and the actual value ytrue; MAPE reflects the average absolute percentage error of each prediction value; R2 is a statistic to measure the degree of fitting in the regression equation, reflecting the proportion explained by the estimated regression equation in the variation of the dependent variably.Loss1:RMSE(yact,ypre)=1n*i=1n(ytrueypre)2Loss2:MAPE(yact,ypre)=1n*i=1n|ytrueypre||ytrue|(Eq. 1)Loss3:R2(yact,ypre)=1i=1n(ytrueypre)2i=1n(ytruey¯)2

2.6. Applicability tests

The applicability of the developed air demand control strategies was evaluated in the full-scale WWTP under two scenarios, i.e., the preset fixed DO (Scenario 1) and VSL-ML models predicts air demand in real time (Scenario 2). Scenario 1 (October 1, 2022 to October 5, 2022) employs the conventional method, utilizing preset DO levels and blower feedback for aeration control. The DO control location in the WWTP is in the last gallery of the aerobic tank, with a control range of from 2.2 mg/L to 2.6 mg/L, and the control method is through manual feedback control, with more aeration when it is low and less aeration when it is high.
Scenario 2 (October 11, 2022 to October 15, 2022) utilizes the VSL-ML models to predict the air demand of the WWTP in real time. 16 on-line characteristics datasets, sampled at 15-min intervals, were fed directly into the VSL-ML models for automated processing. The VSL-ML models leverages the procured real-time dataset to calculate the predicted air demand. The central control system then evaluates the predicted air demand value alongside the current air demand value. When|predictedairdemandcurrentairdemand|predictedairdemand×100%>3%, the central control system autonomously adjusts the position of the guide vanes of the blower's inlet and outlet. This adjustment ensures that the air output aligns with the requirements of the predicted air demand value. While, if|predictedairdemandcurrentairdemand|predictedairdemand×100%<3%, ensure that the current air demand and does not perform any operations on the blower to ensure the life cycle safety of the blower. Moreover, to enhance functionality, a gateway (Network protocol converter, Red Lion Controls Inc. America) was seamlessly integrated into the blower system, enabling seamless data exchange between the central control programmable logic controller (PLC) and the machine learning precision aeration system PLC through the plant's industrial loop network. The parameter of the central control PLC and the machine learning precision aeration system PLC were list in Table S2.

3. Results and discussions

3.1. Ori-ML model development and performance

Generally, a significant proportion of indicators within the WWTP dataset demonstrate notable discreteness. Following normalization, each variable manifests substantial variances in their respective indicators, exhibiting non-uniform distribution trends (Fig. 4). For instance, Fin was observed range from 734.6 to 3521.2 m3/h with an average of 1683.4 ± 523.2 m3/h. Moreover, the influent parameters also showed a drastic fluctuation tendency, such as the influent COD were ranged 87∼396 mg/L (average of 184.8 ± 55.3 mg/L) and TN were ranged 17.4∼69.6 mg/L (average of 25.8 ± 6.0 mg/L).
Fig 4
  1. Download: Download high-res image (245KB)
  2. Download: Download full-size image

Fig. 4. Probability density mountain range plots for the indicators. All indicator characteristics were normalized.

Based on the preprocessed data, twelve kinds of Ori-ML models were established as the soft-sensor models for real-time air demand prediction. Preliminary results from these models, grounded in original feature engineering, suggested their potential to forecast air demand. Nonetheless, several models (such as M1-ROBUST_Ori, M2-HUBER_Ori, M12_RF_Ori) manifested significant oscillations in their prediction outcomes (Fig. 5, blue line). Among the three categories of Ori-ML models, ensemble learning demonstrated the best average performance, followed by deep learning and finally classical machine learning. Interestingly, it is worth noting that M5_BAYES_Ori, rooted in classical machine learning model, outperformed the best performance among these twelve models, with R2, MAPE, and RSEM of 9.27 %, 0.4383, and 1869.98, respectively (Fig. 6). Conversely, the remaining five classical machine learning models failed to accurately predict air demand, particularly evidenced by negative R2 values observed due to aeration prediction is not a simple linear problem. The M1_ROBUST_Ori based on linear regression has the highest MAPE error of 20.66 %, followed by the M2_HUBER_Ori with a MAPE error of 18.01 %.
Fig 5
  1. Download: Download high-res image (1017KB)
  2. Download: Download full-size image

Fig. 5. Air demand prediction in test set by machine learning models using different feature engineering methods: (a) ROBUST. (b) SVM. (c) DT. (d) KNN. (e) ANN. (f) LSTM. (g) Huber. (h) BAYES. (i) LGBM. (j) GBDT. (k) XGB. (l) RF. (The “_Raw” represent the actual aeration measured in the WWTP. The “_Ori” represent the machine learning model using original feature engineering, the “_VSL” represent the machine learning model using variation sliding layer.).

Fig 6
  1. Download: Download high-res image (569KB)
  2. Download: Download full-size image

Fig. 6. Prediction Comparison of forecast indicators of different models using Ori-ML (Ori) and using VSL-ML (VSL), percentage of variation (PV): (a) MAPE. (b) RMSE. (c) R2.

In contrast, two kinds of deep learning models perform an acceptable but slightly lower prediction effect. The prediction results based M7_ANN_Ori were shown a superior performance, with the MAPE, RMSE and R2 of 13.34 %, 2462.88 and 0.0255, respectively. However, the M8_LSTM_Ori does not observe a better performance as expected, which may due to the aeration has no prominent time series feature. Generally, the ensemble learning models (M9_LGBM_Ori, M10_GBDT_Ori, M11_XGB_Ori and M12_RF_Ori) demonstrated superior prediction effects, with a MAPE of 10.67 %, 11.33 %, 11.92 % and 18.61 % (Fig. 6). It's worth noting that through model ensemble, the ensemble learning model achieves enhanced predictive power compared to individual machine learning models. For these ensemble learning algorithms, M9_LGBM_Ori achieved the best performance for the aeration prediction, with the MAPE, RMSE and R2 of 10.67 %, 2200.71 and 0.2219, respectively. These findings underscore the diversity of learning situations across the different models within the training set. The training sets of these machine learning models reveal that the different types of models yield varied prediction outcomes. Fig. S2 portrays the learning situation in the training set following the application of original feature engineering. In a general sense, the utilization of original feature engineering on the training set culminates in a relatively diminished learning effect, thereby posing challenges to the extraction of underlying data patterns.

3.2. VSL-ML model development and performance

Figs. 5 and 6 demonstrated the comparison between the Ori-ML models and VSL-ML models performance for real-time air demand prediction. In summary, the incorporation of VSL has significantly enhanced the performance of all Ori-ML models in this study. In terms of the MAPE, percentage variation (PV) of reduction between VSL-ML models and Ori-ML models is range from 7.57 % to 58.7 %. And the M12_RF_VSL model experienced the most significant performance improvement with the MAPE reduction of 58.7 %. Regarding the RMSE, the ANN model observed the least reduction rate of 2.2 %, slightly reduce from 2462.89 (M7_ANN_Ori) to 2407.47 (M7_ANN_VSL). Similarly, the M12_RF_VSL model demonstrated the most substantial improvement compared to the M12_RF_Ori, which observed RMSE from 3320.59 to 1567.36, reflecting a PV of decrease with 52.8 %. Moreover, the VSL were proved to be a practicable method to increase R2 to a positive number (Fig. 6c). The data processed through VSL feature engineering displayed a much smoother pattern than the original feature engineering. These results demonstrated that the MAPE and RMSE values for each VSL-ML model, alongside a substantial enhancement in R2.
Upon optimization, all models were able to increase their R2 to a positive value, within the realm of six classical machine learning models. Echoing the results from the Ori-ML models, M5_BEYES_VSL displayed superior performance among the twelve VSL-ML models, with MAPE of 6.27 %, R2 up to 0.7028, and RMSE of merely 1360.01. Contrastingly, the precision of the deep learning models also witnessed an improvement. The MAPE of M8_LSTM_VSL is only 11.41 %. The R2 values for all ensemble learning models surpassed 0.6, with M12_RF_VSL showing the most significant improvement among the twelve models, and a near 60 % enhancement for the three indicators. Among the twelve models, M9_LGBM_VSL has the best performance with R2 of 0.73.

3.3. Optimization principle of the VSL feature engineering

The VSL feature engineering comprises four steps: outlier discarding layer, variation layer, sliding layer and feature discarding layer. Each of these layers enhances the ultimate predictive outcomes of the model. In order to address the mechanism of action of the VSL, the analysis of this study provides a comparative assessment of the feature alterations within the crucial layers, thereby delving deeper into the feature generation and deletion of the VSL.
The outlier discarding layer increases the reliability of data samples by mitigating the influence of noise across the dataset. This layer operates by setting two coupled variables: the range of discarded samples and the maximum discarded sample count. As the range of discarded samples increases, more samples are discarded. However, to ensure the accuracy of prediction outcomes, the maximum number of discarded samples needs to be constrained. In the prediction of air demand, deploying this layer resulted in the compression of the training set from 14,208 samples to 12,712, thereby attenuating the noise within the dataset. The accuracy of real-time measurement devices and the issue of sensor clogging can significantly impact data collection and modeling efficacy. The discard mechanism of the outlier discard layer can effectively mitigate these problems.
The variation layer utilizes domain-specific environmental knowledge to generate assorted difference value and removal rate features. Proficiency in environmental science is essential for effectively utilizing this layer. The user is responsible for defining varied values for variables including temperature and flow, and specifying distinct values and removal rate characteristics for substances such as COD, TN, and AN. This layer enriches the initial data coupling model with environmental context, thereby expanding from 18 dimensions to 32 dimensions (Table S3).
The sliding layer is constructed by calculating the average value characteristics of every 1, 2, 3, 4, 5, 10, 15, 20, 30, 60,120 data samples, augmenting the 32 dimensions are constructed into 208 dimensions. The accumulation of minor or inaccurate errors in machine learning model predictions can be effectively minimized by averaging the historical data. For example, the LSTM model results validate the efficacy of this layer. LSTM takes into account the impact of past results on the present, resulting in higher prediction accuracy. The performance metrics of LSTM based on VSL feature engineering only show a slightly improved compared to other VSL feature engineering models. For instance, using VSL feature engineering only improved MAPE from 12.40 % to 11.41 % for LSTM, but other classes of models improved more, such as LGBM from 10.67 % to 6.45 %.
Lastly, the feature discarding layer is utilized to compress the 208-dimensional features into 166-dimensional features, identifying and eliminating over-expanded features that might be perceived as noise by the model. This layer optimizes the network's weight and deviation and improves the accuracy of the results obtained from each mode. The impact of the feature discarding process can be observed by comparing heat maps for the original features and the features using VSL, as shown in Fig. 7a and b. Certain features are removed, and the distribution between features has changed.
Fig 7
  1. Download: Download high-res image (741KB)
  2. Download: Download full-size image

Fig. 7. The heat map shows the degree of correlation between the original dimensional characteristics of the aeration of WWTPs. (a) Features correlation when using original feature engineering. The feature dimension at this point is 16. (b) Features correlation when using the VSL-ML models. After automated construction and culling of features, the raw 16-dimensional features are then reduced to 13. (c) Top 6 of global interpretation about distinguishing feature values. (d) Supervised clustering of five-day test set samples.

3.4. Model interpretability

In this study, feature interpretation method based on SHAP (SHapley Additive exPlanations) were introduced to evaluate the models for the practical deploying in WWTP (Futagami et al., 2021; Lundberg and Lee, 2017). The SHAP values of machine learning models are illustrated in Fig. S3, while the global interpretation of distinguishing feature values about the models were depicted in Fig. 7c. The vertical axis in Fig. 7c ranks the features based on the aggregate of SHAP values across all samples. Simultaneously, the horizontal axis depicts the distribution of the influence of individual sample SHAP values on model output. Each point represents a single sample. These samples are stacked vertically, with redder hues indicating larger feature values. The six features exerting the most significant impact are Tin, ANin, DOeff, TNin, DOano and CODin. The global importance of ANin, ranked second, is 86.65 % of Tin, whereas the global importance of DOeff, ranked third, is just 51.50 % of Tin. The findings suggest that increased levels of ANin, DOeff, TNin, DOano and CODin have a positive influence on the system, whereas higher levels of Tin negatively affect the system.
Fig. 7d employs supervised clustering along with a heatmap to visualize the underlying substructure of the test dataset. This display visualizes the distribution of features for each of the 480 samples across each five-day test period. Observing the vertical alignment of all samples, the color blocks of the initial samples present a distinct red color. The sum of the SHAP values (denoted as f(x)) exceeds the mean, which implies that they are categorized as high-quality samples. This outcome suggests that the VSL exhibits relative stability during the prediction phase, successfully avoiding severe overfitting.

3.5. VSL-ML model applicability evaluation

To facilitate its practical application in intelligent aeration processes within practical WWTP, the VSL-ML models has been encapsulated into a Python library named wwtpai. This library is composed of three distinct components and corresponds to outlier removal, feature construction and feature deletion, respectively. The library employs the general public license (GPL) open-source protocol and its code is available on open-source platforms (Supporting Information). Practitioners of the WWTP can access this open-source library by installing it through the pip package manager, using the command pip install wwtpai. They can then specify the location of the training set and test set, as well as a list of environmental characteristics to be constructed. Upon completion, the optimized training and test sets will be automatically generated and made available for direct model training. This process leads to a substantial enhancement in model performance.
The precision of aeration control directly impacts the power utilization efficiency of WWTP. The applicability test of the VSL-ML models over span five days demonstrated that VSL-ML models can significantly enhance the energy utilization efficiency of the full-scale WWTP, thereby reducing economic costs. The implementation of a VSL-ML models for air demand control can mitigate the impact of influent COD and TN fluctuations on the A2O process operations, ensuring that effluent COD and TN remain relatively stable. These results in significant air demand savings of the biochemical treatment process. With the Scenario 2, the aeration strategy of the WWTP was employed VSL-ML models predicts air demand in real time and the pollutants removal performance was comparable to the Scenario 1, achieving a COD removal rate of 95.01 ± 1.16 % and a TN removal rate of 74.11 ± 3.27 % (Fig. 8a & c). Significantly, the deployment of the VSL-ML models resulted in a considerable reduction in air demand. Compared to the scenario 1, the oxygen demand per unit of COD removal decreased from 1.08 kg-O2/kg-COD to 0.91 kg-O2/kg-COD, reflecting a decrease of 15.74 % (Fig. 8b). And the oxygen demand per unit of TN removal decreased from 5.49 kg-O2/kg-TN to 4.88 kg-O2/kg-TN, with an overall reduction of 11.11 % (Fig. 8d). The Supporting Information demonstrated air demand values of aeration. The average decreased from 15,425.21 m3/h to 12,938.66 m3/h, with a decrease of 16.12 %.
Fig 8
  1. Download: Download high-res image (1MB)
  2. Download: Download full-size image

Fig. 8. Performance of VSL-ML models for air demand prediction and control in applicability tests. (a) COD and (c) TN removal performance of the full-scale WWTP. (b) COD and (d) TN unit oxygen demanded of the full-scale WWTP.

The applicability test results demonstrate that ML models can predict air demand of WWTP, with the VSL-ML model significantly enhancing the prediction accuracy. A wide array of readily available wastewater quality indicators facilitates the quick establishment of ML model for the practical applications. In practical large-scale applications, the VSL-ML model is proficient at filtering out high-frequency noise that is characterized by significant fluctuations in the original data, leading to smoother data and improved prediction accuracy. Moreover, although DOeff is an indirect indicator of process performance suggesting non-oxygen-limited aeration processes, it alone cannot predict energy consumption. However, ML models are capable of uncovering historical aeration information embedded in the effluent DO, enabling them to correct historical aeration demand through negative feedback adjustment. And, the results also demonstrated that when DOeff is removed, the metric accuracy of the machine learning decreases, irrespective of whether the Ori-ML models or the VSL-ML models. (Fig. S3).

3.6. Perspectives

This study introduces a novel and automatic machine learning framework wwtpai, which integrates the VSL feature engineering and machine learning models to enable precise air demand prediction in WWTP. This framework has demonstrated superior accuracy and flexibility in predicting and controlling aeration within WWTPs. However, it is crucial to recognize and address several limitations of this study for subsequent research. For instance, while Python is the most widely used programming language in machine learning, the principles of VSL have only been incorporated into a third-party library for Python thus far. Moreover, this study primarily concentrated on predicting air demand, and suggesting that further verification of the VSL through predicting other tasks in WWTPs is necessitated. Looking forward, it is anticipated that additional hybrid algorithms will be developed through the integration of various machine learning models, aimed at achieving higher accuracy or expedited model construction. Moreover, given that the accuracy of neural networks is closely tied to the volume of input data, it is paramount to explore techniques to minimize data usage without compromising model performance. Additionally, the efficacy of VSL in other environmental modeling endeavors will be further validated, and model fusion techniques will be employed to optimize prediction capabilities. The leveraging of machine learning technology to address environmental challenges is a significant advancement, making it logical to continue investigating and exploring the applications of AI in this field. The creation of open-source environmental libraries for various tasks could provide a convenient and rapid means for environmental practitioners worldwide to access these resources, thereby contributing to the development of more sustainable solutions.

4. Conclusion

This study devises an automatic feature engineering approach based on VSL for intelligent air demand prediction and control of the WWTPs. The VSL method enhances the performance of fundamental machine learning models. The results reveal that metrics including RMSE, MAPE, and R2 of all machine learning model would be improved when applying VSL compared to the original feature engineering machine learning models. Moreover, data processed by VSL exhibit lower dispersion and fewer outliers. Among the tested ML models, The Bayesian model exhibited the most favorable MAPE outcome, while the LGBM model has the best RMSE and R2. Simultaneously, the effluent concentrations of COD and TN remain within a consistently stable range, indicating the potential of VSL in facilitating intelligent control of WWTPs. Moreover, VSL-ML models has been encapsulated into an automated Python library named wwtpai, which can be effortlessly applied, significantly reducing modeling time for practitioners within WWTPs. Therefore, merging VSL with machine learning models, the hybrid modeling approach offers a precise strategy for achieving accurate aeration control in WWTPs.

CRediT authorship contribution statement

Yu-Qi Wang: Conceptualization, Methodology, Writing – review & editing, Data curation, Visualization. Hong-Cheng Wang: Supervision, Resources, Conceptualization, Writing – review & editing. Yun-Peng Song: Writing – review & editing. Shi-Qing Zhou: Data curation, Writing – review & editing. Qiu-Ning Li: Writing – review & editing. Bin Liang: Writing – review & editing. Wen-Zong Liu: Writing – review & editing. Yi-Wei Zhao: Writing – review & editing. Ai-Jie Wang: Methodology, Writing – review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

We gratefully acknowledge the financial support by open project of State Key Laboratory of Urban Water Resources and Environment (Grant No 2022TS30), the National Natural Science Foundation of China (No. 52293445, 52321005), and the characteristic innovation project of Guangdong Province Department of Education (No. 2022KTSCX215).

Appendix. Supplementary materials

What’s this?

Data availability

  • Data will be made available on request.

References

Cited by (23)

View all citing articles on Scopus
View Abstract