Elsevier

Energy Conversion and Management
能量转换与管理

Volume 301, 1 February 2024, 118076
第 301 卷,2024 年 2 月 1 日,118076
Energy Conversion and Management

Faults detection and diagnosis of PV systems based on machine learning approach using random forest classifier
基于随机森林分类器的机器学习方法进行光伏系统的故障检测和诊断

https://doi.org/10.1016/j.enconman.2024.118076Get rights and content  获取权利和内容

Highlights  亮点

  • A novel usage of the translation technique to correct measured I-V curves to STC.
    新颖地利用平移技术将测得的 I-V 曲线修正为 STC。
  • Robust PV modeling based MGWO and translation technique to extract ODM parameters.
    基于 MGWO 和翻译技术的鲁棒光伏建模,以提取 ODM 参数。
  • RFCs-based fault detection and diagnosis method reached a prominent 99.4% accuracy.
    基于 RFCs 的故障检测和诊断方法的准确率高达 99.4%。
  • The proposed method outperformed SVM, KNN, MLP Classifier, DT, and SGDC ML models.
    拟议方法的性能优于 SVM、KNN、MLP 分类器、DT 和 SGDC ML 模型。

Abstract  摘要

Accurate and reliable fault detection procedures are crucial for optimizing photovoltaic (PV) system performance. Establishing a trustworthy PV array model is the primary step and a vital tool for monitoring and diagnosing PV systems. This paper outlines a two-step approach for creating a reliable PV array model and implementing a fault detection procedure using Random Forest Classifiers (RFCs).
准确可靠的故障检测程序对于优化光伏(PV)系统性能至关重要。建立可靠的光伏阵列模型是监测和诊断光伏系统的首要步骤和重要工具。本文概述了创建可靠的光伏阵列模型和使用随机森林分类器 (RFC) 实施故障检测程序的两步方法。
Firstly, we extracted the five unknown parameters of the one-diode model (ODM) by combining the current–voltage translation method to predict the reference curve and employing the modified grey wolf optimization (MGWO) algorithm. In the second step, we simulated the PV array to obtain maximum power point (MPP) coordinates and construct operational databases through co-simulations in PSIM/MATLAB. We developed two RFCs: one for fault detection (a binary classifier) and another for fault diagnosis (a multiclass classifier).
首先,我们结合电流-电压平移法预测参考曲线,并采用修正灰狼优化(MGWO)算法,提取了单二极管模型(ODM)的五个未知参数。第二步,我们模拟了光伏阵列,通过在 PSIM/MATLAB 中进行协同模拟,获得了最大功率点 (MPP) 坐标,并构建了运行数据库。我们开发了两个 RFC:一个用于故障检测(二分类器),另一个用于故障诊断(多分类器)。
Our results confirmed the accuracy of the PV array modeling approach. We achieved a root mean square error (RMSE) value of 0.0122 for the ODM parameter extraction and RMSEs lower than 0.3 in dynamic PV array output current simulations under cloudy conditions. Regarding the fault detection procedure, our results demonstrate exceptional classification accuracy rates of 99.4% for both fault detection and diagnosis, surpassing other tested models like Support Vector Machines (SVM), K-Nearest Neighbors (KNN), Neural Networks (MLP Classifier), Decision Trees (DT), and Stochastic Gradient Descent (SGDC).
我们的结果证实了光伏阵列建模方法的准确性。ODM 参数提取的均方根误差 (RMSE) 值为 0.0122,阴天条件下的动态光伏阵列输出电流模拟的均方根误差 (RMSE) 低于 0.3。在故障检测程序方面,我们的结果表明,故障检测和诊断的分类准确率高达 99.4%,超过了支持向量机 (SVM)、K-近邻 (KNN)、神经网络 (MLP 分类器)、决策树 (DT) 和随机梯度下降 (SGDC) 等其他测试模型。

Keywords  关键词

MGWO algorithm
Random forest classifier
Parameter extraction
Dynamic MPP model
Fault detection
Co-simulation

MGWO 算法随机森林分类器参数提取动态 MPP 模型故障检测协同仿真

1. Introduction  1.导言

In recent years, worldwide energy policies have focused on reducing carbon footprints and moving towards more sustainable energy sources. This is reflected by an increased adoption of renewable energy sources (RES) to ensure a greener future. Among RES, solar photovoltaics (PV) is identified as a key energy source, addressing environmental concerns at very competitive costs [1]. As a result, global PV capacity surpassed the terawatt threshold in early 2022, accounting for two-thirds of the projected increase in global renewable capacity by 2023 [2].
近年来,全球能源政策的重点是减少碳足迹和转向更可持续的能源。这体现在越来越多地采用可再生能源(RES),以确保更绿色的未来。在可再生能源中,太阳能光伏(PV)被认为是一种关键能源,它以极具竞争力的成本解决了环境问题[1]。因此,全球光伏发电能力在 2022 年初超过了太瓦门槛,占 2023 年全球可再生能源发电能力预计增长量的三分之二[2]。
PV systems are designed to operate under harsh external conditions, including extreme weather situations, wind-induced vibrations, and exposure to ultraviolet radiation [3], [4]. In these demanding environments, various malfunctions and failures may occur, potentially shortening the lifespan of PV modules, reducing the overall system’s energy yields, compromising system availability, and posing safety risks to personnel involved in their operation and maintenance [5]. Hence, the early detection and diagnosis of faults is paramount for ensuring the long-term reliability and sustainable operation of the entire PV system.
光伏系统设计用于在恶劣的外部条件下运行,包括极端天气状况、风引起的振动和紫外线辐射[3]、[4]。在这些苛刻的环境中,可能会出现各种故障和失效,从而缩短光伏组件的使用寿命,降低整个系统的发电量,影响系统的可用性,并对参与操作和维护的人员造成安全风险[5]。因此,故障的早期检测和诊断对于确保整个光伏系统的长期可靠性和可持续运行至关重要。
Multiple methods for detecting and diagnosing faults in PV systems have emerged over the last decade. Model-based approach procedures involve simulating the performance of the actual PV installation and comparing the simulated output power with the monitored one [6], [7]. Chouder and Silvester introduced a fault detection methodology for PV systems based on power loss analysis, categorizing identified faults into a faulty string, faulty module, and partial shading through detailed analysis of simulated and measured output ratios [8]. Silvestre et al. presented an automated procedure for fault detection in grid-connected PV systems centered on current and voltage indicators [9]. The method involves setting thresholds based on typical operational behavior, triggering a fault signal when surpassed, and identifying faults by analyzing current and voltage ratios. Drews et al. employed a fault detection method by setting a power residual threshold using weather satellite data for irradiance and temperature instead of on-site sensors [10]. While this obviates the need for additional on-site sensors, it may compromise accuracy due to potentially more significant margins of error in weather data. In contrast, Garoudja et al.'s method sets a threshold on the exponentially weighted moving average of current, voltage, and power residuals, integrating historical data for fault detection rather than relying solely on the most recent observation [11]. Overall, the fault detection techniques mentioned above are straightforward to implement. Nevertheless, the primary challenge lies in precisely selecting suitable thresholds to ensure their reliability.
过去十年间,出现了多种检测和诊断光伏系统故障的方法。基于模型的方法程序包括模拟实际光伏装置的性能,并将模拟输出功率与监测功率进行比较 [6]、[7]。Chouder 和 Silvester 提出了一种基于功率损耗分析的光伏系统故障检测方法,通过详细分析模拟和测量的输出比率,将识别出的故障分为故障组串、故障模块和部分遮光[8]。Silvestre 等人提出了一种以电流和电压指标为中心的并网光伏系统故障自动检测程序 [9]。该方法包括根据典型运行行为设置阈值,当超过阈值时触发故障信号,并通过分析电流和电压比来识别故障。Drews 等人采用了一种故障检测方法,利用气象卫星的辐照度和温度数据而不是现场传感器来设置功率剩余阈值[10]。虽然这种方法不需要额外的现场传感器,但由于气象数据的误差幅度可能更大,因此可能会影响准确性。相比之下,Garoudja 等人的方法是在电流、电压和功率残差的指数加权移动平均值上设置一个阈值,整合历史数据进行故障检测,而不是仅仅依赖于最新的观测数据[11]。总体而言,上述故障检测技术都可以直接实施。然而,主要的挑战在于如何精确选择合适的阈值,以确保其可靠性。
In recent years, diverse artificial intelligence (AI) techniques encompassing Machine Learning (ML) and Deep Learning (DL) have been incorporated as the core methodologies of PV fault detection and diagnosis due to their excellent capabilities in addressing feature extraction and classification problems. Several ML techniques were developed for fault detection and diagnosis in PV systems [12], [13], [14], [15]. Among these techniques, artificial neural network (ANN), support vector machine (SVM), and Random Forest (RF) are the most common approaches. Bendary et al. proposed two adaptive neuro-fuzzy inference system-based controllers (ANFIS) to address cleaning, tracking, and faulty issues in PV systems [16]. The method is based on associating the actual measured values of current and voltage with respect to the trained historical values for this parameter while considering the ambient changes in conditions, including irradiation and temperature. Madeti and Singh introduced an algorithm based on k-nearest neighbors (KNN) for real-time fault detection in PV systems, capable of detecting and classifying open circuit, line-to-line, and partial shading faults [17]. However, it's important to note that the method's accuracy is not flawless compared to its computational efficiency. Eskandari et al. proposed an ensemble learning method that combines three algorithms—Support Vector Machine (SVM), Naïve Bayes (NB), and KNN [18]. The selected classifiers exhibited impressive performance with an accuracy rate of up to 99.5 %. Nonetheless, it's worth noting that this method was specifically developed for detecting line-to-line faults. Similarly, Kapucu et al. explored an ensemble learning approach that integrates quadratic discriminant analysis (QDA), extra trees with entropy (ETent), and decision trees (DT) [19]. Their investigation focused on identifying two PV faults — partial shading and short circuit — the method achieved an initial accuracy rate of 97.46 %, which increased to 97.67 % after optimization. Likewise, Adhya et al. utilized a diagnostic approach comprising the light gradient boosting method (LGBM), categorical boosting (CatBoost), and extreme gradient boosting (XGBoost) to identify faults in PV systems [20]. This combination of diverse ML algorithms resulted in an impressive accuracy of approximately 99 %. However, despite these promising outcomes, the approach remains intricate, prompting the need for further refinements and enhancements. Akram et al. proposed a monitoring method for the DC side of PV arrays, employing the Probabilistic Neural Network (PNN). Their approach demonstrated a good classification accuracy, reaching 98.53 %. However, the method was specifically tested for detecting and classifying short- and open-circuit faults [21]. Chen et al. utilized a RF to classify partial shading, degradation, open circuit, and short circuit faults, employing only high-frequency current and voltage measurements in parallel circuit substrings [22]. This work showed good results. However, it was based on a limited range of operating weather conditions. Likewise, Gong et al. utilized the classification regression tree to address the issue of photovoltaic array fault diagnosis [23]. The method is based on I-V curves generated under specific working conditions, and the obtained classification accuracy was 97.9 %. Mellit et al. developed an embedded system for remote monitoring and fault diagnosis of PV systems based on two conventional neural network models [24]. The first ANN is used for fault detection, while the second deals with fault diagnosis. Both ML algorithms showed good accuracy when embedded into a low-cost edge device for real-time diagnosis of a PV array. On the other hand, the emergence of DL algorithms represents a transformative leap in machine learning, gaining considerable attention for their prowess in pattern recognition, data mining, and knowledge discovery. A notable contribution in this domain comes from Gao et al. [25], where they introduce a DL approach that integrates a stacked autoencoder (SAE) with a multi-grained cascade forest for diagnosing PV faults – associated with partial shading, open circuit, and short circuit faults – without needing weather data or I-V curves as inputs. In this approach, the SAE extracts the fault features automatically from normalized sequence waveforms of string current and voltage, while the multi-grained cascade forest is responsible for diagnosis. In parallel endeavors, Liu et al. introduced a fault diagnosis method for a PV array utilizing SAE and clustering [26]. This approach mines inherent I-V characteristics, enabling automatic feature extraction and fault diagnosis. In addition, Chen et al. presented an innovative deep residual network (ResNet) for intelligent fault detection and diagnosis. Leveraging output, I-V characteristic curves, and input ambient condition data, this novel approach adds depth to fault analysis [27].
近年来,包括机器学习(ML)和深度学习(DL)在内的各种人工智能(AI)技术因其在解决特征提取和分类问题方面的卓越能力,已被纳入光伏故障检测和诊断的核心方法。为光伏系统的故障检测和诊断开发了多种 ML 技术 [12]、[13]、[14]、[15]。在这些技术中,人工神经网络(ANN)、支持向量机(SVM)和随机森林(RF)是最常见的方法。Bendary 等人提出了两种基于神经模糊推理系统的自适应控制器(ANFIS),以解决光伏系统中的清洁、跟踪和故障问题[16]。该方法的基础是将电流和电压的实际测量值与经过训练的该参数历史值联系起来,同时考虑环境条件的变化,包括辐照和温度。Madeti 和 Singh 引入了一种基于 k-nearest neighbors (KNN) 的算法,用于光伏系统的实时故障检测,能够检测开路故障、线对线故障和部分遮挡故障并对其进行分类 [17]。但需要注意的是,与计算效率相比,该方法的准确性并不完美。Eskandari 等人提出了一种集合学习方法,该方法结合了支持向量机 (SVM)、奈夫贝叶斯 (NB) 和 KNN 三种算法 [18]。所选分类器表现出令人印象深刻的性能,准确率高达 99.5%。不过,值得注意的是,这种方法是专门为检测线对线故障而开发的。同样,Kapucu 等人也采用了这种方法。 探索了一种集成了二次判别分析 (QDA)、熵外树 (ETent) 和决策树 (DT) 的集合学习方法 [19]。他们的研究重点是识别两种光伏故障--部分遮光和短路--该方法的初始准确率为 97.46%,优化后提高到 97.67%。同样,Adhya 等人利用由光梯度增强法(LGBM)、分类增强法(CatBoost)和极端梯度增强法(XGBoost)组成的诊断方法来识别光伏系统中的故障[20]。这种多种 ML 算法的组合产生了令人印象深刻的约 99% 的准确率。然而,尽管取得了这些令人鼓舞的成果,该方法仍然错综复杂,需要进一步完善和改进。Akram 等人采用概率神经网络 (PNN) 提出了一种光伏阵列直流侧监控方法。该方法的分类准确率高达 98.53%。不过,该方法专门针对短路和开路故障的检测和分类进行了测试[21]。Chen 等人利用 RF 对部分遮蔽、劣化、开路和短路故障进行分类,仅采用并联电路子串中的高频电流和电压测量值[22]。这项工作取得了良好的效果。不过,它是基于有限范围的运行天气条件。同样,Gong 等人利用分类回归树来解决光伏阵列故障诊断问题 [23]。该方法基于特定工作条件下生成的 I-V 曲线,分类准确率为 97.9%。Mellit 等人 基于两个传统的神经网络模型,开发了一个用于光伏系统远程监控和故障诊断的嵌入式系统 [24]。第一个神经网络用于故障检测,第二个用于故障诊断。这两种 ML 算法在嵌入到低成本边缘设备中用于光伏阵列的实时诊断时,都显示出良好的准确性。另一方面,DL 算法的出现代表了机器学习领域的变革性飞跃,因其在模式识别、数据挖掘和知识发现方面的优势而备受关注。Gao 等人[25]在这一领域做出了显著贡献,他们介绍了一种 DL 方法,该方法将堆叠自动编码器 (SAE) 与多级级联森林集成在一起,用于诊断光伏故障(与部分遮阳、开路和短路故障相关),而无需将天气数据或 I-V 曲线作为输入。在这种方法中,SAE 从组串电流和电压的归一化序列波形中自动提取故障特征,而多级级联森林则负责诊断。与此同时,Liu 等人提出了一种利用 SAE 和聚类的光伏阵列故障诊断方法 [26]。该方法挖掘了固有的 I-V 特性,实现了自动特征提取和故障诊断。此外,Chen 等人提出了一种用于智能故障检测和诊断的创新型深度残差网络(ResNet)。这种新方法利用输出、I-V 特性曲线和输入环境条件数据,增加了故障分析的深度[27]。
Despite the effectiveness of the AI-based fault detection and diagnostics procedures, their accuracy is compromised by the data used in their training stage. Actual measurement data are not enough to train AI models, so to conceive a trustful training database, developing an accurate model of the PV system is crucial. Efficient models are essential to fully replicate the PV systems operation considering various faults and outdoor conditions. Furthermore, data processing is another crucial approach to consider in deploying AI-based machine learning procedures for PV diagnosis. As Wang et al. have underscored the data processing importance in increasing the accuracy of ML-based algorithms to categorize complex faults in the range of 81 %-99 % [28].
尽管基于人工智能的故障检测和诊断程序非常有效,但其准确性却因训练阶段所使用的数据而大打折扣。实际测量数据不足以训练人工智能模型,因此,要建立一个可信赖的训练数据库,开发一个准确的光伏系统模型至关重要。考虑到各种故障和室外条件,高效模型对于完全复制光伏系统的运行至关重要。此外,数据处理也是在光伏诊断中部署基于人工智能的机器学习程序时需要考虑的另一个重要方法。Wang 等人强调了数据处理在提高基于 ML 算法的复杂故障分类准确率方面的重要性,其准确率在 81 %-99 % 之间[28]。
The present work’s contributions involve developing a robust PV model that is the foundation for monitoring and fault detection — whether AI-based or conventional model-based — techniques. Additionally, it introduces a fault detection procedure based on Random Forest Classifiers, optimized through a grid-search algorithm for hyperparameter tuning. The adopted methodology unfolds in two crucial steps:
本研究的贡献在于开发了一个稳健的光伏模型,该模型是监控和故障检测技术(无论是基于人工智能的技术还是基于传统模型的技术)的基础。此外,它还引入了基于随机森林分类器的故障检测程序,并通过网格搜索算法对超参数调整进行了优化。所采用的方法分为两个关键步骤:
  • In the first step, we focus on accurately identifying the unknown parameters of the One-Diode Model (ODM) of the PV array operating under outdoor conditions. This is achieved through a novel application of the translation technique designed to correct randomly measured current–voltage (I-V) curves to reference standard test conditions (STC). The translation technique employs analytical formulations to derive these parameters across various operating conditions, accounting for variations in irradiance and temperature [29]. To determine the five unknown parameters of the ODM at STC, we utilize an optimization algorithm based on the Modified Grey Wolf Optimization (MGWO), an approach initially introduced by Mirjalili et al. in 2014 [30]. The MGWO algorithm's innovative position updating concept enables more efficient searching and exploitation capabilities while maintaining rapid convergence speed. Subsequently, based on the identified parameters, we derived and simulated the evolution of the maximum power point (MPP) model using actual dynamic measurements from a grid-connected PV system in Algeria.
    第一步,我们的重点是准确识别在室外条件下运行的光伏阵列单二极管模型 (ODM) 的未知参数。为此,我们采用了一种新颖的转换技术,旨在将随机测量的电流-电压 (I-V) 曲线修正为参考标准测试条件 (STC)。平移技术采用分析公式来推导各种运行条件下的这些参数,并考虑辐照度和温度的变化 [29]。为了确定标准测试条件下 ODM 的五个未知参数,我们采用了基于修正灰狼优化(MGWO)的优化算法,这种方法最初由 Mirjalili 等人于 2014 年提出[30]。MGWO 算法的创新位置更新概念可实现更高效的搜索和利用能力,同时保持快速收敛速度。随后,根据确定的参数,我们利用阿尔及利亚并网光伏系统的实际动态测量结果,推导并模拟了最大功率点(MPP)模型的演变过程。
  • The second step in our approach involves simulating the PV array to extract MPP coordinates and construct its operational databases through PSIM/MATLAB co-simulations. Additionally, we implement an efficient fault detection and diagnosis process by leveraging the Random Forest Classifier (RFC). This entails the development of two RFCs: the first for fault detection (a binary classifier) and the second for fault diagnosis (a multiclass classifier). Finally, we comprehensively compare our approach with other machine-learning techniques for detecting and diagnosing faults in the considered grid-connected PV system. The testing phase encompasses five operating scenarios: a healthy system, three short-circuited modules in one string, a line-to-line fault, a string disconnected from the array, and the shading effects on three panels.
    我们方法的第二步是模拟光伏阵列,通过 PSIM/MATLAB 协同模拟提取 MPP 坐标并构建其运行数据库。此外,我们还利用随机森林分类器 (RFC) 实现了高效的故障检测和诊断流程。这需要开发两个 RFC:第一个用于故障检测(二元分类器),第二个用于故障诊断(多类分类器)。最后,我们将我们的方法与其他机器学习技术进行了综合比较,以检测和诊断所考虑的并网光伏系统中的故障。测试阶段包括五个运行场景:健康系统、一个组串中的三个短路模块、线对线故障、一个组串与阵列断开,以及三个面板的遮光效应。
The remainder of this paper is organized as follows: Section 2 comprehensively describes the PV system utilized to validate the proposed methods. Section 3 explores the novel approach to PV modeling and parameter extraction. Section 4 is dedicated to explaining the developed fault detection approach. Section 5 presents the results obtained, accompanied by in-depth discussions to elucidate the methodology's performance and effectiveness. Finally, Section 6 summarizes the conclusions drawn from this study.
本文的其余部分安排如下:第 2 节全面介绍了用于验证建议方法的光伏系统。第 3 节探讨了光伏建模和参数提取的新方法。第 4 节专门解释所开发的故障检测方法。第 5 节介绍了获得的结果,并通过深入讨论阐明了该方法的性能和有效性。最后,第 6 节总结了本研究得出的结论。

2. Experimental setup description
2.实验装置说明

To rigorously assess the accuracy of the proposed fault detection methodology and the new procedure for extracting the PV model’s unknown parameters, monitored data from a grid-connected PV system were used. The proposed PV system is located in Algiers, Algeria, at coordinates 36°43′N latitude and 3°15′E longitude. This PV installation boasts a total capacity of 9.54 kW, organized into three sub-arrays, each with a capacity of 3.18 kW. Each sub-array comprises 30 Isofoton 106–12 panels arranged in two parallel strings of 15 modules in series. These PV modules are connected to a 2.5 kW single-phase inverter (IG30 Fronius).
为了严格评估所提出的故障检测方法和提取光伏模型未知参数的新程序的准确性,使用了并网光伏系统的监测数据。拟议的光伏系统位于阿尔及利亚阿尔及尔,坐标为北纬 36°43′,东经 3°15′。该光伏装置的总功率为 9.54 千瓦,分为三个子阵列,每个子阵列的功率为 3.18 千瓦。每个子阵列由 30 块 Isofoton 106-12 电池板组成,每两组并联串联 15 块模块。这些光伏组件与一个 2.5 千瓦的单相逆变器(IG30 Fronius)相连。
The PV plant's tilted and horizontal irradiance levels are monitored using a Kipp & Zonen CM11 thermoelectric pyranometer. Additionally, temperature measurements of the PV modules are conducted using K-type thermocouples. Meteorological and electrical variables are systematically recorded using a data logger (Agilent 34970). The data, including weather (Solar irradiance (G), module temperature (T), and PV output (Impp, Vmpp, Pmpp) parameters at the Maximum Power Point (MPP), were collected at a sampling rate of 1 min.
光伏电站的倾斜和水平辐照度水平由 Kipp & Zonen CM11 热电高温计进行监测。此外,还使用 K 型热电偶测量光伏组件的温度。气象和电气变量使用数据记录器(Agilent 34970)进行系统记录。数据包括最大功率点 (MPP) 时的天气(太阳辐照度 (G)、组件温度 (T) 和光伏输出 (Impp、Vmpp、Pmpp) 参数),采样率为 1 分钟。
The main specifications of the selected PV array used in this work are listed in Table 1, while further details of the whole PV installation can be found in [31].
表 1 列出了本文所选光伏阵列的主要规格,整个光伏装置的更多详情可参见文献 [31]。

Table 1. Main specifications of the selected PV array.
表 1.所选光伏阵列的主要规格。

Parameter  参数Description  说明
Module technology  模块技术Mono-crystalline (mc-Si)  单晶硅(mc-Si)
PV array nominal power  光伏阵列额定功率3.18 kWp
Inverter type and size  逆变器类型和尺寸IG30 Fronius single-phase, 2.5 kW
IG30 Fronius 单相,2.5 千瓦
Modules per inverter  每个逆变器的模块30
Modules in series (Ns)  串联模块 (Ns)15
Strings in parallel (Np)  并行字符串 (Np)2
Tilt - Azimuth  倾斜 - 方位35° − 10° West  西经 35° - 10°
Table 2 summarizes the key electrical parameters for the Isofoton 106–12 PV module under Standard Test Conditions (STC), characterized by a temperature of 25 °C and an irradiance level of 1000 W/m2.
表 2 总结了 Isofoton 106-12 光伏组件在标准测试条件 (STC) 下的主要电气参数,标准测试条件的特点是温度为 25 °C,辐照度为 1000 W/m 2

Table 2. Electrical characteristics of the considered PV module.
表 2.所考虑的光伏组件的电气特性。

Parameter  参数Value  价值
Pmp (W)106
ISC (A)6.54
VOC (V)21.6
Imp (A)6.10
Vmp (V)17.4
βVOC (%/°C)−0.36  -0.36
αISC (%/°C)0.06

3. Developed approach for PV modeling
3.已开发的光伏建模方法

The basis for our photovoltaic (PV) modeling approach is the widely adopted one-diode, five-parameter solar cell model [32]. This model is a popular choice in PV module modeling for various technologies, encompassing crystalline and thin-film designs. It is favored for striking a balance between model complexity and predictive accuracy. The solar cell I–V characteristic is described by the implicit and nonlinear expression given in Eq. (1).(1)I=Iph-IoexpqV+RsInkT-1-V+RsIRshwhere I0 is the diode saturation current (A). Iph represents the photocurrent in (A). n is the diode ideality factor. The Boltzmann constant (1.38x10-23JK−1) is defined by k. T is the cell temperature in (K). The parameter q is the electrical charge (1.602 x10-19C). Vt(V) is the thermal voltage expressed as Vt = kT/q. Finally, Rsh and Rs represent shunt and series resistances (Ω). For an in-depth understanding of this model, including the requisite equations to extend its applicability from a single solar cell to an entire PV array, an extensive description is referenced in [33].
我们的光伏(PV)建模方法的基础是广泛采用的单二极管五参数太阳能电池模型[32]。该模型在各种技术的光伏模块建模中广受欢迎,包括晶体和薄膜设计。它在模型复杂性和预测准确性之间取得了平衡,因而受到青睐。太阳能电池的 I-V 特性由式 (1) 所给出的隐式非线性表达式描述。 (1)I=Iph-IoexpqV+RsInkT-1-V+RsIRsh 其中 I 0 为二极管饱和电流 (A)。n 是二极管的理想系数。波尔兹曼常数(1.38x10 -23 JK −1 )由 k 定义。参数 q 是电荷(1.602 x10 -19 C)。V t (V) 是热电压,表示为 V t = kT/q。最后,R sh 和 R s 表示并联电阻和串联电阻 (Ω)。要深入了解这一模型,包括将其适用范围从单个太阳能电池扩展到整个光伏阵列所需的方程式,请参阅 [33] 中的详细说明。
The five parameters, namely Iph, Io, n, Rsh, and Rs, are typically not explicitly provided by PV module manufacturers. Previous investigation revealed that the extracted actual values of these parameters often deviate from calculated ones based on nominal data provided in the datasheet specified at the STC [34]. Consequently, achieving a precise alignment between the PV model outputs defined by Eq. (1) and real-world monitored data is essential for accurate simulation and fault detection. Therefore, the necessity of using an effective parameter identification procedure is crucial.
光伏组件制造商通常不会明确提供这五个参数,即 I ph 、I o 、n、R sh, 和 R s 。先前的调查显示,这些参数的实际提取值往往与根据 STC 指定的数据表中提供的标称数据计算得出的值存在偏差 [34]。因此,实现公式 (1) 所定义的光伏模型输出与实际监测数据之间的精确匹配,对于精确模拟和故障检测至关重要。因此,使用有效的参数识别程序至关重要。

3.1. Current-voltage translation to reference conditions
3.1.参考条件下的电流-电压转换

The parameters of PV cells are notably influenced by weather conditions, making it inaccurate to assume their constancy. Additionally, the mathematical expressions employed in these models depend on access to reference parameters. However, replicating the standard test conditions proves challenging under typical outdoor conditions. To address this challenge, we present an efficient translation method inspired by a technique introduced in [29] initially employed for analyzing the degradation of amorphous silicon-based modules. This method transforms three arbitrary I-V curves, each obtained under varying temperature and irradiance conditions, into a reference curve.
光伏电池的参数受到天气条件的显著影响,因此假设其恒定是不准确的。此外,这些模型中使用的数学表达式依赖于参考参数。然而,在典型的室外条件下,复制标准测试条件具有挑战性。为了应对这一挑战,我们提出了一种高效的转换方法,其灵感来自于 [29] 中介绍的一种最初用于分析非晶硅模块退化的技术。该方法将在不同温度和辐照度条件下获得的三条任意 I-V 曲线转换为一条参考曲线。
It is important to note that many translation methods in literature often necessitate prior knowledge of additional parameters. In contrast, our innovative approach requires no prior information about temperature coefficients or internal parameters. It solely relies on data obtained from three measured I-V curves (Curves 1,2 and 3) defined as:
值得注意的是,许多文献中的翻译方法往往需要事先了解其他参数。相比之下,我们的创新方法无需预先了解温度系数或内部参数。它完全依赖于从三条测得的 I-V 曲线(曲线 1、2 和 3)中获得的数据,这三条曲线的定义分别为
  • -
    Curve 1: V1i,I1i where i = 1,…,n1, measured at an irradiance G1 and a cell temperature T1
    曲线 1: V1i,I1i ,其中 i = 1,...n 1 ,在辐照度 G 1 和电池温度 T 1 条件下测得
  • -
    Curve 2: V2j,I2j where j = 1,…,n2, measured at an irradiance G2 and a cell temperature T2
    曲线 2: V2j,I2j ,其中 j = 1,...n 2 ,在辐照度 G 2 和电池温度 T 2 条件下测得
  • -
    Curve 3: V3k,I3k where k = 1,…,n3, measured at an irradiance G3 and a cell temperature T3
    曲线 3: V3k,I3k ,其中 k = 1,...n 3 ,在辐照度 G 3 和电池温度 T 3 条件下测得
The proposed methodology is rooted in the derivation of a new Curve 0, denoted by V0i,I0i, aligning with the desired conditions G0 and T0 at standard test conditions (STC). An intermediate curve is introduced to achieve this, initiating an interpolation process denoted as Curve 4, governed by the operating conditions G4 and T4. Initially, Curve 4 is extracted from Curve 1 and Curve 2. Subsequently, Curve 3 and Curve 4 are employed to attain the target Curve 0. The interpolation process begins within the irradiance/temperature plane, as illustrated in Fig. 1, and is subsequently carried out in the voltage/current space, employing identical parameters as elaborated below.
拟议方法的基础是推导出一条新的曲线 0,用 V0i,I0i 表示,与标准测试条件 (STC) 下的理想条件 G 0 和 T 0 保持一致。为此,引入了一条中间曲线,启动了一个插值过程,称为曲线 4,由运行条件 G 4 和 T 4 控制。最初,曲线 4 从曲线 1 和曲线 2 中提取。插值过程在辐照度/温度平面内开始,如图 1 所示,随后在电压/电流空间内进行,采用相同的参数,详见下文。
  1. Download: Download high-res image (181KB)
    下载:下载高清图片 (181KB)
  2. Download: Download full-size image
    下载:下载全尺寸图片

Fig. 1. The operating conditions of curves 1, 2, and 3 are interpolated to obtain the operating conditions of Curves 4 and 0.
图 1.对曲线 1、2 和 3 的运行条件进行内插,得到曲线 4 和 0 的运行条件。

The values of G4 and T4 are established based on combinations of G1 and G2, and T1 and T2, respectively, as depicted in Eqs. (2) and (3), wherein the parameter α will be determined. Additionally, as demonstrated in Eqs. (4) and (5), the desired irradiance G0 and temperature T0 are inferred from G3 and G4, as well as T3 and T4, respectively, with the incorporation of another unknown parameter . This configuration yields a system of four equations and four unknowns (G4, T4, α, and ).(2)G4=G1+G2-G1(3)T4=T1+T2-T1(4)G0=G3+G4-G3(5)T4=T3+T4+T3
如公式 (2) 和 (3) 所示,G 4 和 T 4 的值是根据 G 1 和 G 2 以及 T 1 和 T 2 的组合确定的,其中参数 α 将被确定。此外,如式 (4) 和 (5) 所示,根据 G 3 和 G 4 以及 T 3 和 T 4 分别推断出所需的辐照度 G 0 和温度 T 0 ,并加入另一个未知参数 。这种配置产生了由四个方程和四个未知数(G 4 、T 4 、α 和 )组成的系统。 (2)G4=G1+G2-G1 (3)T4=T1+T2-T1 (4)G0=G3+G4-G3 (5)T4=T3+T4+T3
The set of equations specified in the standard conditions has been streamlined by introducing a novel translation parameter, denoted as ω, defined as the product of and α. Additionally, the values of G4 and T4 have been integrated into Eqs. (4) and (5), resulting in the formulation of Eqs. (6) and (7), which can be straightforwardly computed.(6)G0-G3=(G1-G3).+G2-G1.ω(7)T0-T3=(T1-T3).+T2-T1.ω
通过引入一个新的平移参数,即定义为 和 α 的乘积的 ω,简化了标准条件中规定的方程组。此外,G 4 和 T 4 的值已整合到公式 (4) 和 (5) 中,从而得出公式 (6) 和 (7),并可直接计算。 (6)G0-G3=(G1-G3).+G2-G1.ω (7)T0-T3=(T1-T3).+T2-T1.ω
The next step is intended to find the I-V curves. It has been assumed that ISC1 and ISC2 are the short-circuit currents of Curve 1 and Curve 2, respectively. For each point of Curve 1 V1i,I1i, its partner V2j,I2j is sought in Curve 2 so that the next condition is satisfied: I2j-I1i=Isc2-Isc1. Then, a new point V4i,I4i of Curve 4 is obtained by applying Eqs. (8) and (9). By the same manner, for each point of Curve 3 V3i,I3i , the best matching point V4j,I4j of Curve 4 is selected fulfilling I4j-I3i=Isc4-Isc3 and the point V0i,I0i of Curve 0 is produced based on Eqs. (10) and (11).(8)V4[i]=V4[i]+V1[i]-V2[j](9)I4[i]=V4[i]+I1[i]-I2[j](10)V0[i]=V3[i]+V3[i]-V4[j](11)I0[i]=V3[i]+I3[i]-I4[j]
下一步是寻找 I-V 曲线。假设 I SC1 和 I SC2 分别是曲线 1 和曲线 2 的短路电流。对于曲线 1 中的每个点 V1i,I1i ,都要在曲线 2 中寻找其对应点 V2j,I2j ,以满足下一个条件: I2j-I1i=Isc2-Isc1 。然后,应用公式 (8) 和 (9) 得到曲线 4 的新点 V4i,I4i 。同样,对于曲线 3 的每个点 V3i,I3i ,根据公式 (10) 和 (11) 选择曲线 4 的最佳匹配点 V4j,I4j 满足 I4j-I3i=Isc4-Isc3 并生成曲线 0 的点 V0i,I0i(8)V4[i]=V4[i]+V1[i]-V2[j] (9)I4[i]=V4[i]+I1[i]-I2[j] (10)V0[i]=V3[i]+V3[i]-V4[j] (11)I0[i]=V3[i]+I3[i]-I4[j]

3.2. Parameter extraction based on modified grey wolf optimization
3.2.基于改良灰狼优化法的参数提取

In this section, we introduce an offline optimization method for parameter identification. The reason for opting for the optimization approach is that the characteristic equations, as defined in Eq. (1), have an implicit form making the direct parameters identification challenging. The parameter identification process can be likened to an optimization problem, and we tackle this challenge using the Modified Grey Wolf Optimization (MGWO) algorithm. This method effectively optimizes the unknown parameters to reconcile the implicit characteristic equations, enabling us to precisely determine the desired values based on actual measurement data. In this approach, we focus on quantifying the disparity between the outputs derived from Eq. (1) and the data obtained from the current–voltage translation methodology described above (section 3.1). We employ the root mean square error (RMSE) as a key criterion to measure this difference. For each set of experimental values (I, V), the RMSE is computed according to the following formula:(12)RMSE=1Ni=1Nf(V,I,x)2(13)f(V,I,x)=I-Iph-IoexpqV+RsInkT-1-V+RSIRshwhere, x=Iph,ref,Io,ref,Rsh,ref,Rs,ref,nref, and N represents the data points quantity.
在本节中,我们将介绍一种用于参数识别的离线优化方法。之所以选择优化方法,是因为公式 (1) 中定义的特征方程具有隐式形式,使得直接参数识别具有挑战性。参数识别过程可以比作一个优化问题,我们采用修正灰狼优化(MGWO)算法来应对这一挑战。该方法可有效优化未知参数,协调隐式特征方程,使我们能够根据实际测量数据精确确定所需值。在这种方法中,我们重点量化公式 (1) 得出的输出与上述电流-电压转换方法(第 3.1 节)获得的数据之间的差异。我们采用均方根误差 (RMSE) 作为衡量这种差异的关键标准。对于每组实验值(I、V),均方根误差按以下公式计算: (12)RMSE=1Ni=1Nf(V,I,x)2 (13)f(V,I,x)=I-Iph-IoexpqV+RsInkT-1-V+RSIRsh 其中, x=Iph,ref,Io,ref,Rsh,ref,Rs,ref,nref ,N 代表数据点数量。
The classical Grey Wolf Optimizer (GWO) algorithm was introduced in 2014 by Mirjalili et al. [30], and its mathematical social behavior model is represented as follows:(14)D=C.Xpt-Xt(15)Xt+1=Xpt-A.(D)where t is the current iteration, Xp is the position vector of the prey, X is the position vector of the hail wolf, and A and C are coefficient vectors, calculated as follows:(16)A=2a.r1-aC=2.r2where the components of a decrease linearly from 2 to 0 over the course of the iterations and r1, r2 are random numbers in [0,1]. The equation for position update is shown below.(17)Dα=C1.Xα-XDβ=C2.Xβ-XDδ=C3.Xδ-X(18)X1=Xα-A1.(Dα)X2=Xβ-A2.(Dβ)X3=Xδ-A3.(Dδ)
经典的灰狼优化算法(GWO)于 2014 年由 Mirjalili 等人提出[30],其数学社会行为模型表示如下: (14)D=C.Xpt-Xt (15)Xt+1=Xpt-A.(D) 其中 t 为当前迭代, Xp 为猎物的位置向量, X 为雹狼的位置向量, AC 为系数向量,计算公式如下: (16)A=2a.r1-aC=2.r2 其中 a 的分量在迭代过程中从 2 线性递减到 0,而 r1 , r2 是 [0,1] 中的随机数。位置更新方程如下所示。 (17)Dα=C1.Xα-XDβ=C2.Xβ-XDδ=C3.Xδ-X (18)X1=Xα-A1.(Dα)X2=Xβ-A2.(Dβ)X3=Xδ-A3.(Dδ)
Each wolf in the pack updates its position following the positions of X1,X2,andX3 which stand for the top three solutions thus far in the iteration process.(19)Xt+1=X1+X2+X33
狼群中的每只狼都会根据 X1,X2,andX3 的位置更新自己的位置,X1,X2,andX3 代表迭代过程中迄今为止的前三个解决方案。 (19)Xt+1=X1+X2+X33
This article introduces an adaptable method that leverages the GWO algorithm, with a minor modification in the selection phase. As depicted in Fig. 2, the diagram outlines the steps of the proposed Modified GWO (MGWO) technique, which closely aligns with a method previously employed in a prior study [35]. This technique determines alpha, beta, and delta members by evaluating the fitness function for individual positions, specifically the five unknown parameters. Other agents adjust their positions accordingly.
本文介绍了一种利用 GWO 算法的适应性方法,在选择阶段稍作修改。如图 2 所示,该图概述了所提出的修改后 GWO(MGWO)技术的步骤,它与之前一项研究[35]中使用的方法密切相关。该技术通过评估各个位置的适应度函数,特别是五个未知参数,来确定 alpha、beta 和 delta 成员。其他代理也会相应调整自己的位置。
  1. Download: Download high-res image (666KB)
    下载:下载高清图片 (666KB)
  2. Download: Download full-size image
    下载:下载全尺寸图片

Fig. 2. MGWO algorithm flowchart.
图 2.MGWO 算法流程图。

A novel approach to position updating is integrated into GWO, enhancing both search and exploitation capabilities while ensuring rapid convergence. This novel concept draws inspiration from the competitive exclusion method found in genetic algorithms [36]. In this approach, only positions from the current iteration of search agents (wolves) that exhibit higher fitness compared to positions from previous iterations are replaced. Only the top positions are considered during the final phase for selecting new alpha, beta, and delta members. The process iterates to update search agent positions based on these selections, repeating as necessary to reach the maximum number of iterations [37]. The MGWO with an additional phase can search for fully optimal results without using any parameters like conventional methods would.
GWO 中集成了一种新的位置更新方法,在确保快速收敛的同时增强了搜索和利用能力。这种新概念的灵感来自遗传算法中的竞争性排除法[36]。在这种方法中,只有当前迭代的搜索代理(狼)的位置与之前迭代的位置相比表现出更高的适合度,才会被替换。在选择新的α、β和δ成员的最后阶段,只考虑排名靠前的位置。该过程根据这些选择进行迭代,更新搜索代理位置,必要时重复迭代,以达到最大迭代次数 [37]。带有附加阶段的 MGWO 可以搜索完全最优的结果,而无需像传统方法那样使用任何参数。

3.3. Prediction of the PV outputs under actual outdoor conditions
3.3.实际室外条件下的光伏输出预测

Using fully analytical formulas and reference parameters obtained through the MGWO algorithm, the next crucial step involves establishing the values of the unknown parameters within real operational contexts. Eqs. (20) to (25) encompass the analytical expressions that enable the calculation of the five parameters in question as functions of temperature and irradiance [38], [39], [40].(20)nT=nrefT/Tref(21)IphG,T=GGrefIph,ref+α(T-Tref)(22)RshG=Rsh,refGref/G(23)Eg=Eg,ref1-0.0002677(T-Tref)(24)RsG,T=Rs,refTTref1-βlnG/Gref(25)IoG,T=Io,refTTref3eqnKBEg,refTref-EgT
利用通过 MGWO 算法获得的完全解析公式和参考参数,下一个关键步骤是在实际运行环境中确定未知参数的值。式 (20) 至 (25) 包含分析表达式,可将五个相关参数作为温度和辐照度的函数进行计算 [38]、[39]、[40]。 (20)nT=nrefT/Tref (21)IphG,T=GGrefIph,ref+α(T-Tref) (22)RshG=Rsh,refGref/G (23)Eg=Eg,ref1-0.0002677(T-Tref) (24)RsG,T=Rs,refTTref1-βlnG/Gref (25)IoG,T=Io,refTTref3eqnKBEg,refTref-EgT
Eg is the semiconductor's band gap energy, and Eg,ref is the band gap energy for reference conditions. Iph, Io, n, Rs, and Rsh are the five parameters at actual operating conditions. In contrast, Iph,ref, Io,ref, nref, Rs,ref, and Rsh,ref are the five unknown parameters at the reference conditions found by the extraction method application.
E g 是半导体的带隙能,E g , ref 是参考条件下的带隙能。I ph , I o , n, R s , 和 R sh 是实际工作条件下的五个参数。相反,I ph,ref 、I o,ref 、n ref 、R s,ref 和 R sh,ref 是应用提取方法找到的参考条件下的五个未知参数。

4. Faults detection and diagnosis strategy
4.故障检测和诊断策略

Operating a PV system during certain types of failures can lead to complete insecurity, catastrophic damages, and safety risks. The primary objective of this work is to establish a robust and reliable fault detection procedure using Random Forest Classifiers to detect anomalies within a PV system and pinpoint their root causes. To accomplish this, conceiving a high-quality database that clearly delineates the characteristics of each class of fault is imperative. Therefore, having a reliable simulation model that accurately represents the behavior of a PV system in both its healthy and faulty states is the best course of action to handle this case. Fig. 3 provides a comprehensive flowchart outlining the steps to develop the proposed strategy.
在某些类型的故障期间运行光伏系统可能会导致完全不安全、灾难性损害和安全风险。这项工作的主要目标是利用随机森林分类器建立一个稳健可靠的故障检测程序,以检测光伏系统内的异常情况并找出其根本原因。要实现这一目标,必须建立一个高质量的数据库,明确划分每一类故障的特征。因此,建立一个可靠的仿真模型,准确呈现光伏系统在健康和故障状态下的行为,是处理这种情况的最佳方案。图 3 提供了一个全面的流程图,概述了制定建议策略的步骤。
  1. Download: Download high-res image (823KB)
    下载:下载高清图片 (823KB)
  2. Download: Download full-size image
    下载:下载全尺寸图片

Fig. 3. Flowchart of the proposed fault detection and diagnosis strategy.
图 3.建议的故障检测和诊断策略流程图。

The validated PV system model, as described in the preceding section, serves as the foundation for constructing databases that capture the performance of the PV system under actual outdoor conditions. This PV model is harnessed to produce datasets comprising optimal operation and intentionally simulated defects, utilizing daily solar irradiance and module temperature profiles. To achieve this, the physical model of the grid-connected PV system under consideration was implemented within the PSIMTM software platform. Subsequently, the values of the unknown parameters extracted under reference conditions are incorporated into the physical PV array model.
如上一节所述,经过验证的光伏系统模型是构建数据库的基础,该数据库可捕捉光伏系统在实际室外条件下的性能。利用该光伏模型可生成包括最佳运行和有意模拟缺陷的数据集,利用的是每日太阳辐照度和组件温度曲线。为此,我们在 PSIM TM 软件平台中实施了所考虑的并网光伏系统物理模型。随后,在参考条件下提取的未知参数值被纳入光伏阵列物理模型。
In this work, the simulated healthy/faulty scenarios, encompassing the most prevalent issues encountered in grid-connected PV systems, are described below and depicted in Fig. 4.
在这项工作中,模拟的健康/故障场景包括并网光伏系统中遇到的最普遍问题,具体描述如下,并在图 4 中进行了描述。
  • a)
    A healthy system: This scenario mirrors the operation of the PV system without any anomalies.
    健康的系统:这种情况反映了光伏系统在没有任何异常的情况下运行。
  • b)
    Three short-circuited modules: Represents the case of one string in the PV system with fewer PV panels in operation.
    三个短路模块:代表光伏系统中的一个组串,运行中的光伏电池板数量较少。
  • c)
    Open circuit faults: This scenario simulates where one string within the PV system becomes non-functional.
    开路故障:这种情况模拟的是光伏系统中的一个组串出现故障。
  • d)
    Line-to-line fault: This is the case of a short-circuit between two PV strings.
    线对线故障:这是两个光伏组串之间发生短路的情况。
  • e)
    Three PV modules shaded: This scenario replicates the effects of partial shading experienced by PV systems due to factors such as cloud movement or the presence of nearby objects for a specific duration.
    三个光伏模块遮光:这种情景模拟了光伏系统在特定时间内由于云层移动或附近物体存在等因素造成的部分遮光效果。
  1. Download: Download high-res image (334KB)
    下载:下载高清图片 (334KB)
  2. Download: Download full-size image
    下载:下载全尺寸图片

Fig. 4. Failure types considered in the proposed methodology (#1 partial shading, open-circuit fault#2, #3 short-circuit fault, and #4 Line-to-Line fault).
图 4.建议方法中考虑的故障类型(#1 部分遮挡、#2 开路故障、#3 短路故障和 #4 线对线故障)。

The resulting databases contain five key attributes - Irradiance, Temperature, and the output Current, Voltage, and Power at Maximum Power Point (MPP) - extracted from each simulated operational scenario. An illustration of the simulated faults and their impact on the output power of the grid-connected PV system, based on clear-sky weather data for a typical day, is presented in Fig. 5.
由此产生的数据库包含五个关键属性--辐照度、温度以及最大功率点 (MPP) 的输出电流、电压和功率--均从每个模拟运行场景中提取。图 5 展示了模拟故障及其对并网光伏系统输出功率的影响,这些故障是基于典型一天的晴朗天气数据。
  1. Download: Download high-res image (336KB)
    下载:下载高清图片 (336KB)
  2. Download: Download full-size image
    下载:下载全尺寸图片

Fig. 5. DC output power of the grid-connected PV system within various fault scenarios.
图 5.各种故障情况下并网光伏系统的直流输出功率。

The concluding phase of the proposed fault detection strategy involves the deployment of two Random Forest Classifiers (RFCs). The first RFC is dedicated to identifying anomalies within the PV system, while the second RFC is responsible for diagnosing the specific faults that have been detected.
拟议故障检测策略的最后阶段是部署两个随机森林分类器(RFC)。第一个 RFC 专门用于识别光伏系统内的异常情况,而第二个 RFC 则负责诊断已检测到的特定故障。

4.1. Random forest classifier
4.1.随机森林分类器

The Random Forest (RF) algorithm is a supervised machine-learning technique widely employed for Classification and Regression tasks. It operates on the principle of ensemble learning, which involves combining multiple decision trees on different subsets of the input data to enhance predictive accuracy. As a fundamental concept in machine learning, Random Forest's effectiveness and problem-solving capabilities increase with the number of trees it encompasses.
随机森林(RF)算法是一种有监督的机器学习技术,广泛应用于分类和回归任务。它的运行原理是集合学习,即在输入数据的不同子集上组合多个决策树,以提高预测准确性。作为机器学习的一个基本概念,随机森林的有效性和解决问题的能力会随着其包含的决策树数量的增加而提高。
The RF model used in this study is characterized by the decision tree algorithm's benefits linked to speed and high accuracy. The developed model's structure involves two key steps: first, selecting a sampling method to create a data subset, and second, constructing a decision tree, as illustrated in Fig. 6. Notably, four hyperparameters must be considered, including the minimum number of samples for leaf nodes, the minimum number of samples for internal node splitting, the maximum number of selections, and the maximum depth of the decision tree. It's important to note that the RFC’s performance is influenced by various hyperparameters such as splitting criteria, the minimum number of samples for leaf nodes, and internal node splitting, and this study delves into optimizing their combination for enhanced results [41].
本研究中使用的射频模型的特点是决策树算法具有速度快、准确度高等优点。所开发模型的结构包括两个关键步骤:首先是选择一种采样方法来创建数据子集,其次是构建决策树,如图 6 所示。值得注意的是,必须考虑四个超参数,包括叶节点的最小样本数、内部节点分割的最小样本数、最大选择数和决策树的最大深度。值得注意的是,RFC 的性能会受到各种超参数的影响,如分割标准、叶节点最小样本数和内部节点分割,本研究将深入探讨如何优化它们的组合以提高结果[41]。
  1. Download: Download high-res image (290KB)
    下载:下载高清图片 (290KB)
  2. Download: Download full-size image
    下载:下载全尺寸图片

Fig. 6. The general structure of the deployed RF model.
图 6.部署的射频模型的总体结构。

After establishing the RF model, the test set samples are input into the model. Each individual decision tree evaluates the classification results for each sample. Once this evaluation is complete, the class that receives the most votes from all decision trees is assigned as the classification for the sample [42]. This is achieved by employing a voting mechanism that combines the results of all the decision trees. A grid search optimization method is utilized to further optimize the RF algorithm’s parameters, as illustrated in Fig. 7. This approach aids in identifying the most influential parameter combinations for the RF model, enhancing its classification performance. First, the Cartesian product is applied to the value set of each hyperparameter to generate the hyperparameter configuration space (the box on the left side of Fig. 7), which contains all potential hyperparameter combinations. The grid search algorithm then trains a model for each hyperparameter combination in the configuration space. As seen in the box on the right side of Fig. 7, the experiment with the best validation set error is picked as having discovered the optimal hyperparameters [43].
建立射频模型后,测试集样本被输入模型。每棵决策树都会对每个样本的分类结果进行评估。评估完成后,所有决策树中得票最多的类别将被指定为样本的分类[42]。这是通过综合所有决策树结果的投票机制实现的。如图 7 所示,利用网格搜索优化方法进一步优化 RF 算法的参数。这种方法有助于确定对 RF 模型影响最大的参数组合,从而提高其分类性能。首先,对每个超参数的值集进行笛卡尔乘积,生成超参数配置空间(图 7 左侧的方框),其中包含所有潜在的超参数组合。然后,网格搜索算法为配置空间中的每个超参数组合训练一个模型。如图 7 右侧方框所示,验证集误差最大的实验被认为发现了最优超参数[43]。
  1. Download: Download high-res image (95KB)
    下载:下载高清图片 (95KB)
  2. Download: Download full-size image
    下载:下载全尺寸图片

Fig. 7. The grid search algorithm's principle.
图 7 网格搜索算法原理网格搜索算法的原理。

As mentioned earlier, this study incorporates two RFCs, each with a specific role. The first classifier is designed to identify any indications of faults within the PV system, while the second classifier's task is to pinpoint the particular type of fault.
如前所述,本研究包含两个 RFC,每个都有特定的作用。第一个分类器旨在识别光伏系统中的任何故障迹象,而第二个分类器的任务则是确定故障的具体类型。
The diagnosis model aims to provide output that categorizes the specific fault cases (fault #1, fault #2, fault #3, fault #4), as visualized in Fig. 4. The detection model has five input parameters (T, G, Impp, Vmpp, and Pmpp), a data processing module, the Random Forest/Grid search optimization method application, and two outputs (healthy state, faulty state). This setup enables the models to effectively detect and diagnose faults within the PV system, offering a comprehensive view of the specific fault conditions.
如图 4 所示,诊断模型旨在提供对特定故障情况(故障 #1、故障 #2、故障 #3、故障 #4)进行分类的输出。检测模型有五个输入参数(T、G、Impp、Vmpp 和 Pmpp)、一个数据处理模块、随机森林/网格搜索优化方法应用和两个输出(健康状态、故障状态)。这种设置使模型能够有效地检测和诊断光伏系统中的故障,提供特定故障条件的全面视图。

4.2. Data preparation for learning and testing stages
4.2.学习和测试阶段的数据准备

The preprocessing phase of the raw data is imperative to enhance problem-solving capabilities and achieve higher accuracy. In this context, the 'sklearn' library offers a suite of functions for handling missing values, allowing us to identify and address them effectively using the 'isnull' function. To unveil relationships within the data, Pearson's correlation coefficient is employed. This metric yields values ranging from −1 (indicating a perfect negative correlation) to + 1 (indicating a perfect positive correlation), quantifying the strength of linear relationships. Notably, this measure is distinct from correlations between variables [44]. We employed a normalization process based on a calibration technique to facilitate a meaningful comparative analysis of information across attributes in the dataset. This technique centers values around the mean and utilizes a unit standard deviation. Additionally, to provide context for the recorded data, we assigned appropriate class labels, facilitating the creation of well-defined data samples. The defined classes with their corresponding fault type are given in Table 3.
为了提高解决问题的能力和准确性,对原始数据进行预处理是必不可少的。在这种情况下,"sklearn "库提供了一套处理缺失值的函数,使我们能够使用 "isnull "函数有效地识别和处理缺失值。为了揭示数据内部的关系,我们采用了皮尔逊相关系数。该指标的取值范围从 -1 (表示完全负相关)到 + 1(表示完全正相关),量化了线性关系的强度。值得注意的是,这一指标不同于变量之间的相关性[44]。我们采用了基于校准技术的归一化流程,以便对数据集中各属性的信息进行有意义的比较分析。该技术以平均值为中心,并使用单位标准偏差。此外,为了提供记录数据的上下文,我们分配了适当的类别标签,以便创建定义明确的数据样本。表 3 列出了已定义的类别及其相应的故障类型。

Table 3. Defined classes and their corresponding fault type.
表 3.定义的类别及其对应的故障类型

Phase  阶段Class  班级Corresponding fault type  相应的故障类型
Detection  检测0Healthy  健康
1Faulty  故障
Diagnosis  诊断0#2: Open-circuit fault  #2:开路故障
1#1: Partial Shading  #1:局部遮光
3#3: Short-circuit fault  #3:短路故障
9#4: Line-to-line fault  #4: 线对线故障
For the training and evaluating the two RFCs, we utilized a dataset that includes monitored data from selected 60 days throughout the year, covering all seasonal variations. 75 % of the complete data samples were randomly chosen for training purposes. Subsequently, the remaining 25 % of the data samples were used as an independent set of unknown data to assess performance in each scenario. Considering the data preprocessing of the original data, the dataset resulted in 242,890 data samples designated for detection and 194,400 data samples for diagnosis. As explained in the above section, the classifiers under consideration were fed with both learning and testing datasets, comprising five key attributes (T, G, Impp, Vmpp, and Pmpp). These attributes serve as input features, and the resulting outputs correspond to the estimated class labels for each data point. The specifics of the constructed detection and diagnosis databases are detailed in Table 4.
为了训练和评估两个 RFC,我们使用了一个数据集,其中包括全年选定 60 天的监测数据,涵盖所有季节变化。我们随机选择了 75% 的完整数据样本用于训练。随后,其余 25% 的数据样本被用作独立的未知数据集,以评估每个场景中的性能。考虑到原始数据的数据预处理,数据集产生了 242 890 个指定用于检测的数据样本和 194 400 个用于诊断的数据样本。如上文所述,分类器的学习数据集和测试数据集都包含五个关键属性(T、G、Impp、Vmpp 和 Pmpp)。这些属性作为输入特征,其输出结果与每个数据点的估计类标签相对应。表 4 详细介绍了所构建的检测和诊断数据库的具体情况。

Table 4. Details of the detection and diagnosis database construction.
表 4.检测和诊断数据库建设详情。

Phase  阶段Class  班级Test data set (25 %)  测试数据集(25)Train data set (75 %)  训练数据集(75)Total  总计
Detection  检测012,14536,433242,890
148,578145,734
Diagnosis  诊断012,14536,433194,312
112,14436,434
312,14536,433
912,14436,434
A performance evaluation of the classifiers was conducted, employing the confusion matrix as a critical assessment tool to gauge their effectiveness. The confusion matrix provides insights into the accuracy of the classifier's predictions and reveals areas where it made errors. In this matrix, the rows represent the actual labels, and the columns depict the predicted labels. The diagonal values indicate how often the predicted label aligns with the actual label, demonstrating correct predictions. Values in the remaining cells reflect instances where the classifier incorrectly assigned labels to observations, with columns indicating what the classifier predicted and rows showing the actual correct labels.
对分类器进行了性能评估,采用混淆矩阵作为重要的评估工具来衡量其有效性。混淆矩阵可以帮助我们深入了解分类器预测的准确性,并揭示分类器出错的地方。在该矩阵中,行表示实际标签,列表示预测标签。对角线上的值表示预测标签与实际标签一致的频率,说明预测正确。其余单元格中的值反映了分类器错误地为观测结果分配标签的情况,列表示分类器预测的结果,行表示实际正确的标签。
To comprehensively evaluate our proposed system, we employ metrics such as accuracy, Precision, Recall, and F1score, as expressed in the following equations [45]. These metrics are instrumental in assessing our models' overall performance and reliability.(25)Accuracy=TP+TNTP+TN+FP+FN(26)Precision=TPTP+FP(27)Recall=TPTP+FN(28)F1score=2PrecisionRecallPrecision+Recallwhere TP signifies the number of samples correctly classified into class “x” as they should have been. Conversely, FN represents the count of samples that were incorrectly classified, as they should have belonged to class “x” but were placed in another class by the classifier. On the other hand, TN corresponds to the True Negatives, denoting the number of samples that were correctly classified as not belonging to class “x.” These samples were placed in a different class per the classifier's judgment. Finally, FP stands for False Positives, signifying the number of samples that were incorrectly labeled as belonging to category “x,” even though they should not have been according to the classifier's assessment. Furthermore, for better interpretability in the case of multi-class classification, we adopt averaging methods, and the macro and weighted average of Precision, Recall, and F1score is calculated. Macro average (Macro avg) is calculated using the unweighted mean that can penalize the model if the performance in minority classes is poor. On the other hand, weighted average (weighted avg) considers the number of true instances in each class to cope with class imbalance and consequently favors the majority class.
为了全面评估我们提出的系统,我们采用了准确度、精确度、召回率和 F1 score 等指标,如下式所示[45]。这些指标有助于评估我们模型的整体性能和可靠性。 (25)Accuracy=TP+TNTP+TN+FP+FN (26)Precision=TPTP+FP (27)Recall=TPTP+FN (28)F1score=2PrecisionRecallPrecision+Recall 其中,TP 表示正确分类到 "x "类的样本数量。相反,FN 表示被错误分类的样本数,因为它们本应属于 "x "类,却被分类器归入了另一个类别。另一方面,TN 与 "真阴性 "相对应,表示被正确分类为不属于 "x "类的样本数量。根据分类器的判断,这些样本被归入另一个类别。最后,FP 代表 "假阳性"(False Positives),表示根据分类器的评估,被错误标注为属于 "x "类的样本数量,尽管这些样本不应该属于 "x "类。此外,为了更好地解释多类分类,我们采用了平均方法,计算精度、召回率和 F1 score 的宏观平均值和加权平均值。宏观平均值(Macro avg)是使用非加权平均值计算的,如果模型在少数类别中表现不佳,就会受到惩罚。另一方面,加权平均值(weighted avg)考虑了每个类别中真实实例的数量,以应对类别不平衡问题,因此有利于多数类别。

5. Results and discussion
5.结果和讨论

This section demonstrates the validation of the newly developed PV array modeling approach. Subsequently, the effectiveness of the proposed automatic fault detection system is assessed under various weather conditions, considering different faulty patterns in PV array operation. Finally, the fault detection method based on Random Forest Classifier (RFC) is evaluated through benchmarking against other established machine learning techniques. It's worth noting that these validations rely on data collected from the monitored PV system described earlier.
本节展示了新开发的光伏阵列建模方法的验证。随后,考虑到光伏阵列运行中的不同故障模式,在各种天气条件下对所提出的自动故障检测系统的有效性进行了评估。最后,基于随机森林分类器 (RFC) 的故障检测方法通过与其他成熟的机器学习技术进行基准比较进行了评估。值得注意的是,这些验证依赖于从前面所述的受监控光伏系统中收集的数据。

5.1. PV modeling and parameter estimation approach validation
5.1.光伏建模和参数估计方法验证

The newly developed procedure for Current-Voltage translation to STC has been validated using three measured curves denoted as Curve 1, Curve 2, and Curve 3, as shown in Fig. 8. The methodology involves an intermediate step where Curve 4 is determined based on the operating conditions (T and G) of Curve 1 and Curve 2. Subsequently, the target curve, which represents the operation of the PV array at STC (referred to as Curve 0), is predicted from Curve 4 and Curve 3.
如图 8 所示,新开发的电流-电压转换为 STC 的程序已通过三条测量曲线(分别为曲线 1、曲线 2 和曲线 3)进行了验证。该方法包括一个中间步骤,即根据曲线 1 和曲线 2 的运行条件(T 和 G)确定曲线 4。随后,根据曲线 4 和曲线 3 预测目标曲线,该曲线代表光伏阵列在 STC 时的运行情况(称为曲线 0)。
  1. Download: Download high-res image (294KB)
    下载:下载高清图片 (294KB)
  2. Download: Download full-size image
    下载:下载全尺寸图片

Fig. 8. Predicted I-V curve at STC (Curve 0) using the current–voltage translation method.
图 8.使用电流-电压转换法预测的 STC(曲线 0)I-V 曲线。

Following the extraction of the reference I-V curve for the PV array, the unknown parameters of the one-diode model (ODM) have been extracted through a parameter extraction technique utilizing the Modified Grey Wolf Optimization (MGWO). The optimization algorithm has demonstrated high accuracy, with an RMSE value of 0.0122, and the resulting parameters are presented in Table 5.
在提取出光伏阵列的参考 I-V 曲线后,利用修正灰狼优化(MGWO)参数提取技术提取了单二极管模型(ODM)的未知参数。该优化算法的精确度很高,均方根误差值为 0.0122,得到的参数如表 5 所示。

Table 5. Extracted ODM parameters at STC.
表 5.在 STC 提取的 ODM 参数。

Parameter  参数Value  价值
Rp (Ω)42.9633
RS(Ω)0.2212
Io (A)4.344 10-7
n45.1606
Iph (A)6.8378
RMSE0.0122
The proposed methodology for modeling the PV array has undergone extensive validation, employing the extracted parameters to simulate the PV array under varying irradiance (G) and temperature (T) conditions, as described in equations (19–24). A comparison was made between the experimental I-V and P-V curves and the simulated data to evaluate the model's accuracy under static conditions. The results are depicted in Fig. 9, revealing a noteworthy agreement between the measurements and the simulated values. This observation is further corroborated by the RMSE indicator values, which stand at 0.0266 and 0.1024, respectively.
所提出的光伏阵列建模方法经过了广泛的验证,利用提取的参数模拟了不同辐照度 (G) 和温度 (T) 条件下的光伏阵列,如公式 (19-24) 所述。实验 I-V 和 P-V 曲线与模拟数据进行了比较,以评估模型在静态条件下的准确性。结果如图 9 所示,显示测量值和模拟值之间存在显著的一致性。RMSE 指标值(分别为 0.0266 和 0.1024)进一步证实了这一观察结果。
  1. Download: Download high-res image (288KB)
    下载:下载高清图片 (288KB)
  2. Download: Download full-size image
    下载:下载全尺寸图片

Fig. 9. PV array model validation under a) T = 28.1, G = 749, b) T = 28.2, G = 800.
图 9.a) T = 28.1,G = 749,b) T = 28.2,G = 800 下的光伏阵列模型验证。

Dynamic validation of the PV array model was conducted using an adapted co-simulation model that integrates the MATLAB and PSIM environments. This dynamic validation incorporated daily temperature and irradiation profiles, along with measured MPP output profiles from a real PV system located in Algiers, under three distinct weather conditions: a) clear sky, b) semi-cloudy, and c) cloudy day. Fig. 10 illustrates the temporal evolution of the developed model's simulated PV array output current. The results demonstrate a strong agreement between the measured and estimated values of the maximum power point current, as indicated by the RMSE values (RMSE = 0.1416, 0.216, and 0.2971, respectively). This substantiates the efficacy of the identification process and the robustness of the proposed approach.
光伏阵列模型的动态验证采用了一个经过调整的协同仿真模型,该模型集成了 MATLAB 和 PSIM 环境。该动态验证结合了每日温度和辐照曲线,以及位于阿尔及尔的真实光伏系统在三种不同天气条件下的测量 MPP 输出曲线:a) 晴天;b) 半阴天;c) 阴天。图 10 展示了所开发模型模拟的光伏阵列输出电流的时间演变。结果表明,最大功率点电流的测量值和估计值非常一致,如 RMSE 值所示(RMSE 分别为 0.1416、0.216 和 0.2971)。这证明了识别过程的有效性和所建议方法的稳健性。
  1. Download: Download high-res image (327KB)
    下载:下载高清图片 (327KB)
  2. Download: Download full-size image
    下载:下载全尺寸图片

Fig. 10. Dynamic validation of the PV array model under different weather conditions.
图 10.光伏阵列模型在不同天气条件下的动态验证。

5.2. Evaluation of the proposed fault detection and diagnosis strategy
5.2.对拟议故障检测和诊断策略的评估

The automated fault detection procedure developed using Random Forest (RF) is implemented with the Python library scikit-learn. It leverages the “Random Forest Classifier” class from the “ensemble” module. The method is entirely built in Python, with key libraries such as scikit-learn, NumPy, SciPy, seaborn, matplotlib, and the open-source machine learning library dlib [46]. Scikit-learn primarily handles random forest implementation, while dlib is employed for automatic error detection and diagnosis. The computational environment used for this work comprised a personal computer equipped with an Intel Core i7 processor (2.50 GHz), 16 GB of RAM, and a GTX 1060 GPU with a 6 GB of memory.
使用随机森林(RF)开发的自动故障检测程序是通过 Python 库 scikit-learn 实现的。它利用了 "集合 "模块中的 "随机森林分类器 "类。该方法完全由 Python 构建,使用的关键库包括 scikit-learn、NumPy、SciPy、seaborn、matplotlib 和开源机器学习库 dlib [46]。Scikit-learn 主要处理随机森林的实现,而 dlib 则用于自动错误检测和诊断。这项工作使用的计算环境包括一台配备英特尔酷睿 i7 处理器(2.50 GHz)、16 GB 内存和 6 GB 内存的 GTX 1060 GPU 的个人电脑。
As explained in the section above, the grid search algorithm was utilized to optimize the hyperparameters in this study. Table 6 lists the optimal hyperparameters for each RF model.
如上一节所述,本研究采用网格搜索算法来优化超参数。表 6 列出了每个射频模型的最优超参数。

Table 6. Optimal hyperparameters.
表 6.最佳超参数。

Hyperparameter  超参数RF Detection model  射频检测模型RF Diagnosis model  射频诊断模式
max_depth  最大深度4585
n_estimators6535
Criterion  标准gini  基尼entropy  
BootstrapTrue  正确True  正确
Min_samples_leaf  最小样本叶片11
Min_sample_split  最小样本分割22
Max_features  最大特征66
The output reports generated by the two developed Random Forest Classifiers (RFCs) are summarized in Table 7 and Table 8. Both RFCs demonstrate outstanding performance in detecting faults, with an accuracy rate of 99.4 %. Among classification metrics, a lot of emphasis is given to measures of Precision and Recall (sensitivity) as they are more effective in dealing with imbalanced distributions. However, it is difficult to achieve a trade-off between these two, and the nature of the classification problem often dictates the requirement. For our case, the lowest values of the Precision and Recall are found during the diagnosis phase and are equal to 0.978 and 0.974, respectively. This can be explained by the model challenges in distinguishing between faults labeled as #1 and #3, representing three partially shaded and three short-circuited PV modules, respectively. These two faults have a similar impact on the PV output power. As seen in Fig. 4, when three PV modules are shaded, the PV system's output power closely resembles the scenario with three short-circuited PVs (represented in green). Despite this complexity, the developed fault detection procedure maintains an overall accuracy of 99.4 %.
表 7 和表 8 汇总了所开发的两个随机森林分类器(RFC)生成的输出报告。两个随机森林分类器在检测故障方面都表现出色,准确率高达 99.4%。在分类指标中,精确度和召回率(灵敏度)是最受重视的指标,因为它们在处理不平衡分布时更为有效。然而,要在这两者之间取得平衡并不容易,分类问题的性质往往决定了对这两个指标的要求。在我们的案例中,精确度和召回率的最低值出现在诊断阶段,分别为 0.978 和 0.974。这是因为模型在区分标为 #1 和 #3 的故障时遇到了挑战,这两个故障分别代表三个部分遮光和三个短路的光伏组件。这两种故障对光伏输出功率的影响相似。如图 4 所示,当三个光伏模块被遮挡时,光伏系统的输出功率与三个短路光伏模块(绿色)的情况非常相似。尽管如此复杂,所开发的故障检测程序仍保持了 99.4% 的总体准确率。

Table 7. Classification report of RF detection model.
表 7.射频检测模型的分类报告。

Empty CellPrecision  精度Recall  回顾F1scoreSamples number  样本数量
Class0  0 级1.000.9700.98512,145
Class1  1 级0.9931.0000.99648,578
Macro avg  宏观平均值0.9960.9850.99160,723
Weighted avg  加权平均数0.9940.9940.99460,723
Accuracy (%)  准确度(%)99.460,723

Table 8. Classification report of RF diagnosis model.
表 8.射频诊断模型的分类报告。

Empty CellPrecision  精度Recall  回顾F1scoreSamples number  样本数量
Class0  0 级0.9781.0000.98912,145
Class1  1 级1.0000.9740.98712,144
Class3  3级0.9991.0001.00012,145
Class9  9 级0.9981.0000.99912,144
Macro avg  宏观平均值0.9940.9940.99448,578
Weighted avg  加权平均数0.9940.9940.99448,578
Accuracy (%)  准确度(%)99.448,578
The normalized confusion matrices generated by both RF models, one for fault detection and the other for fault diagnosis, are presented in Fig. 11 and Fig. 12, respectively. Examining the data provided in Table 7 and Fig. 11, it becomes evident that the binary classification model performs exceptionally well. In the case of the healthy system, denoted as Class0, the model exhibits high precision, with very few false positives, though it has a slightly lower Recall, capturing 97 % of cases. This implies that it effectively identifies non-risky instances. In contrast, when dealing with faulty cases labeled as Class1, the model excels with high Precision, high Recall, and an excellent F1score, indicating near-perfect performance.
图 11 和图 12 分别显示了两种射频模型(一种用于故障检测,另一种用于故障诊断)生成的归一化混淆矩阵。从表 7 和图 11 中提供的数据可以看出,二元分类模型的表现非常出色。在健康系统(Class0)的情况下,虽然模型的召回率稍低,但却能捕捉到 97% 的案例,表现出较高的精确度,误报率极低。这意味着它能有效识别非风险实例。相比之下,在处理标记为 Class1 的故障案例时,该模型表现出色,具有高精确度、高召回率和出色的 F1 score ,表明其性能接近完美。
  1. Download: Download high-res image (67KB)
    下载:下载高清图片 (67KB)
  2. Download: Download full-size image
    下载:下载全尺寸图片

Fig. 11. Normalized Confusion matrix of RF detection model.
图 11.射频检测模型的归一化混淆矩阵。

  1. Download: Download high-res image (134KB)
    下载:下载高清图片 (134KB)
  2. Download: Download full-size image
    下载:下载全尺寸图片

Fig. 12. Normalized Confusion matrix of RF diagnosis model.
图 12.射频诊断模型的归一化混淆矩阵。

Regarding the diagnosis phase, it is represented by Table 8 and Fig. 12. A detailed explanation of rows and columns of the normalized confusion matrix of the RF diagnosis model is given in the following bullet points:
诊断阶段的情况见表 8 和图 12。射频诊断模型归一化混淆矩阵的行和列的详细说明见以下要点:
  • Row 1 (Class0): The first row corresponds to fault #2. The value 1.00 in the top-left cell means that the model correctly identifies Class0 instances almost perfectly. The remaining values in this row are zeros, indicating that the model rarely misclassifies Class0 as any other class. This demonstrates that the model is highly accurate in recognizing open circuit faults.
    第 1 行(0 级):第一行对应故障 #2。左上角单元格中的数值 1.00 表示模型几乎完美地正确识别了 Class0 实例。这一行中的其余数值均为零,表明模型很少将 0 类错误分类为任何其他类别。这表明模型在识别开路故障方面非常准确。
  • Row 2 (Class1): The second row represents fault #1. The value 0.974 in the second cell (from the left) means that the model correctly identifies Class1 instances with a true positive rate of about 97.4 %. The value 0.023 in the third cell suggests that the model occasionally misclassifies Class1 as Class3. The value 0.003 in the first cell suggests that the model occasionally misclassifies Class1 as Class0. The remaining values in this row are zeros, indicating rare misclassifications into other classes. This shows that the model is highly effective at identifying Class1 but may occasionally confuse it with Class3.
    第 2 行(Class1):第二行代表故障 1。第二格(左起)的数值 0.974 表示模型正确识别了 Class1 实例,真阳性率约为 97.4%。第三格中的数值 0.023 表明,模型偶尔会将 Class1 错误分类为 Class3。第一个单元格中的值 0.003 表明,模型偶尔会将 Class1 错误分类为 Class0。这一行中的其余数值均为零,表明很少出现误分类到其他类别的情况。这表明模型在识别 Class1 方面非常有效,但偶尔也会将 Class1 与 Class3 混淆。
  • Row 3 (Class3): The third row corresponds to fault #3. A value of 1.00 in the third cell (from the left) signifies that the model accurately recognizes Class3 instances, achieving a true positive rate of 100 %. The values in the remaining rows are all zero, indicating that the model consistently avoids misclassifying instances from other classes. This underscores the model's reliability in identifying Class3.
    第 3 行(Class3):第三行对应故障 3。第三格(左起)的值为 1.00,表示模型准确识别了第 3 类实例,真阳性率达到 100%。其余各行的值均为 0,表明模型始终避免了对其他类别实例的错误分类。这凸显了模型识别 Class3 的可靠性。
  • Row 4 (Class9): The fourth row represents fault #4. The value 1.0 in the last cell indicates a perfect true positive rate, meaning the model accurately identifies all Class9 instances. The values in this row suggest that the model never misclassifies Class9 as any other class, emphasizing the model's exceptional performance in recognizing line-to-line faults.
    第 4 行(Class9):第四行表示故障 #4。最后一个单元格中的值 1.0 表示完美的真阳性率,即模型能准确识别所有 Class9 实例。这一行中的数值表明,模型从未将 Class9 错误地分类为任何其他类别,从而强调了模型在识别线对线故障方面的卓越性能。
In summary, it can be observed that for Class0 (fault #2), Class3 (fault #3) and Class9 (fault #4), the model demonstrates a high True Positive rate, accurately predicting instances of these faults. This indicates its effectiveness in identifying these types of defects with precision. However, the model displays high precision for Class1 (fault #1), with minimal false positives. This could be attributed to the similarity between this type of fault and Class3 (fault #3), as previously explained. Overall, the model efficiently minimizes misclassifications.
总之,对于 Class0(故障 #2)、Class3(故障 #3)和 Class9(故障 #4),该模型显示出较高的真阳性率,能准确预测出这些故障的实例。这表明该模型在精确识别这些类型的故障方面非常有效。然而,该模型对 Class1(故障 #1)显示出较高的精确度,误报率极低。这可能是因为如前所述,这类故障与 Class3(3 号故障)之间存在相似性。总体而言,该模型有效地减少了错误分类。
To explain the classification results better, we have created graphical representations of the confusion matrices for both RFC models. These visual summaries are presented in Fig. 13 for the detection stage and Fig. 14 for the diagnosis stage. It can be seen that the graphical outputs align with the data in the confusion matrices. It can be concluded that the RFC models demonstrate robust performance across all fault classes defined in this work. The model's ability to maintain high precision and recall metrics is essential for accurate classification. It effectively reduces misclassifications and establishes its efficacy as a valuable tool for fault detection and diagnosis in grid-connected PV systems.
为了更好地解释分类结果,我们创建了两种 RFC 模型的混淆矩阵图。图 13 是检测阶段的直观总结,图 14 是诊断阶段的直观总结。可以看出,图形输出与混淆矩阵中的数据一致。由此可以得出结论,RFC 模型在本研究中定义的所有故障类别中都表现出了强大的性能。该模型能够保持较高的精确度和召回率指标,这对准确分类至关重要。它有效地减少了错误分类,并使其成为并网光伏系统故障检测和诊断的重要工具。
  1. Download: Download high-res image (161KB)
    下载:下载高清图片 (161KB)
  2. Download: Download full-size image
    下载:下载全尺寸图片

Fig. 13. Fault detection results.
图 13.故障检测结果。

  1. Download: Download high-res image (182KB)
    下载:下载高清图片 (182KB)
  2. Download: Download full-size image
    下载:下载全尺寸图片

Fig. 14. Fault diagnosis results.
图 14.故障诊断结果。

5.3. Comparative analysis
5.3.比较分析

To underscore the effectiveness of our machine learning-based RFCs in fault detection, we conducted a comparative analysis with various alternative approaches, including Support Vector Machines (SVM), K-Nearest Neighbors (KNN), Neural Networks (MLP Classifier), Decision Trees (DT), and Stochastic Gradient Descent (SGDC). To ensure a fair and thorough comparison, we followed the same steps as outlined in our study and fine-tuned the internal hyperparameters for each algorithm using a grid search approach. The summarized results are presented in Table 9.
为了强调我们基于机器学习的 RFC 在故障检测中的有效性,我们对各种替代方法进行了比较分析,包括支持向量机 (SVM)、K-近邻 (KNN)、神经网络 (MLP 分类器)、决策树 (DT) 和随机梯度下降 (SGDC)。为了确保比较的公平性和全面性,我们遵循了研究中概述的相同步骤,并使用网格搜索方法对每种算法的内部超参数进行了微调。汇总结果见表 9。

Table 9. Comparative Analysis between SVM, KNN, DT, SGDC, MLP, and RF trained and tested using the same data set.
表 9.使用相同数据集训练和测试 SVM、KNN、DT、SGDC、MLP 和 RF 的比较分析。

Phase  阶段Indicator  指标label  标签SVMMLP Classifier  MLP 分类器KNNDTSGDCRF
Detection  检测Precision  精度00.9790.9130.9790.9880.0001.000
10.9840.9900.9840.9810.7960.993
Recall  回顾00.9390.9600.9390.9270.0000.970
10.9950.9770.9950.9971.0001.000
F1score00.9590.9360.9590.9560.0000.985
10.9900.9830.9900.9890.8860.996
Accuracy (%)  准确度(%)84.597.398.398.379.699.4
Diagnosis  诊断Precision  精度00.9580.9920.9230.9920.8510.978
11.0001.0000.9960.9970.8761.000
30.9330.9740.9980.9720.9860.999
90.9640.9640.9970.9710.9080.998
Recall  回顾00.9280.9750.9970.9720.9671.000
10.9510.9720.9050.9750.9300.974
30.9680.9810.9970.9850.6971.000
91.0001.0001.0000.9971.0001.000
F1score00.9430.9830.9590.9820.9050.989
10.9750.9860.9480.9860.9020.987
30.9510.9780.9970.9790.8161.000
90.9810.9820.9980.9840.9520.999
Accuracy (%)  准确度(%)96.198.297.598.389.899.4
The observed results scores for the detection phases shows that, the SVM accuracy is 84.5 %, and the MLP Classifier achieved a good accuracy of 97.3 %. The SGDC yielded the lowest accuracy at 79.6 %, whereas both KNN and DT algorithms exhibited similar high-performance levels, both achieving an accuracy of 98.3 %. Moving on to the diagnosis stage, Table 9 illustrates that all algorithms demonstrated improved performance. The DT algorithm achieved the highest accuracy value of 98.3 %, and the SGDC exhibited notable progress, achieving an accuracy value of 89.8 %. Notably, our proposed RFCs method outperformed all other methods in both the detection and diagnosis phases, achieving a remarkable overall accuracy of 99.4 %. Our RFC model also excelled in other evaluation metrics, including Precision, Recall, and F1score, surpassing SVM, KNN, DT, SGDC, and MLP Classifier.
检测阶段的观察结果得分显示,SVM 的准确率为 84.5%,MLP 分类器的准确率达到了 97.3%。SGDC 的准确率最低,为 79.6%,而 KNN 和 DT 算法表现出相似的高性能水平,准确率均达到 98.3%。进入诊断阶段后,表 9 显示所有算法的性能都有所提高。其中,DT 算法的准确率最高,达到 98.3%,而 SGDC 算法的准确率为 89.8%,进步显著。值得注意的是,我们提出的 RFCs 方法在检测和诊断阶段的表现都优于所有其他方法,总体准确率达到了 99.4%。我们的 RFC 模型在精度、召回率和 F1 score 等其他评估指标上也表现出色,超过了 SVM、KNN、DT、SGDC 和 MLP 分类器。
The results obtained in this study align well with the findings in the existing literature. Eskandari et al. proposed an SVM-based method specifically for detecting and classifying Line-to-Line faults (LL), achieving average accuracies of 96 % and 97.5 %, respectively [18]. While their accuracies surpass ours, it's important to note that our study encompasses various types of faults, not limited to LL faults alone. Moving on to our K-Nearest Neighbors (KNN) model, it achieved a classification accuracy of 98.3 %, closely matching the results from Madeti and Singh [17], who attained an average fault classification accuracy of 98.70 %, focusing on open-circuit, Line-to-Line, and different short-circuit faults, including those represented by bypass diodes. In the domain of Neural Networks, specifically the Multilayer Perceptron (MLP) Classifier, our model achieved an accuracy of 98.2 %. In comparison, Chine et al. [47] reported a reasonable accuracy of 90.3 %, potentially attributed to the number of faults considered and the absence of optimization in the neural network architecture in their work. Benkercha and Moulahoum presented a fault detection and diagnosis technique based on the Decision Trees (DT) algorithm, achieving an overall accuracy of around 99 % [48]. Although slightly higher than our study's accuracy of about 98.3 %, this difference can be justified by considering a broader spectrum of fault types. Notably, Kapucu and Cubukcu [19] reported slightly lower accuracies of 97.46 % and 97.67 % using quadratic discriminant analysis-extra trees-Decision Trees (QDA-ETent-DT) for PV fault detection before and after optimization, respectively. It's worth mentioning that their study focused on partial shading and short-circuit faults without accounting for changes in weather conditions.
本研究获得的结果与现有文献的研究结果非常吻合。Eskandari 等人提出了一种基于 SVM 的方法,专门用于检测线对线故障(LL)并对其进行分类,平均准确率分别达到 96 % 和 97.5 %[18]。虽然他们的准确率超过了我们的方法,但值得注意的是,我们的研究涵盖了各种类型的故障,而不仅仅局限于线对线故障。我们的 K-Nearest Neighbors (KNN) 模型的分类准确率达到了 98.3%,与 Madeti 和 Singh [17] 的结果非常接近,他们的平均故障分类准确率达到了 98.70%,重点关注开路故障、线对线故障和不同的短路故障,包括旁路二极管代表的故障。在神经网络领域,特别是多层感知器(MLP)分类器方面,我们的模型达到了 98.2% 的准确率。相比之下,Chine 等人[47] 报告的准确率为 90.3%,这可能是由于他们考虑的故障数量较多,而且没有对神经网络结构进行优化。Benkercha 和 Moulahoum 提出了一种基于决策树 (DT) 算法的故障检测和诊断技术,总体准确率达到 99% 左右 [48]。虽然略高于我们研究的约 98.3% 的准确率,但考虑到更广泛的故障类型,这一差异是合理的。值得注意的是,Kapucu 和 Cubukcu [19]报告了使用二次判别分析-额外树-决策树(QDA-ETent-DT)对优化前和优化后的光伏故障检测的准确率,分别略低于 97.46 % 和 97.67 %。 值得一提的是,他们的研究侧重于部分遮光和短路故障,而没有考虑天气条件的变化。

6. Conclusion  6.结论

Photovoltaic systems are continuously exposed to many faults that significantly impact their performance and overall efficiency. These issues, including short circuits, shading, line-to-line problems, and open circuits, can substantially reduce harvested solar energy. In response, this manuscript introduces a robust machine learning (ML) technique that harnesses the Random Forest Classifier (RFC) to effectively detect and monitor PV system performance.
光伏系统持续暴露于许多故障之中,这些故障会严重影响其性能和整体效率。这些问题包括短路、遮光、线对线问题和开路,会大大减少太阳能的收集量。为此,本手稿介绍了一种稳健的机器学习(ML)技术,利用随机森林分类器(RFC)来有效检测和监控光伏系统的性能。
Our approach builds on a precise one-diode (ODM) simulation model, accurately replicating actual PV system behavior. Identifying the unknown parameters of the ODM involves a new application of the current–voltage translation technique combined with the Modified Grey Wolf Optimization (MGWO) algorithm.
我们的方法建立在精确的单二极管(ODM)仿真模型基础上,准确复制了实际的光伏系统行为。识别 ODM 的未知参数涉及电流-电压转换技术与修正灰狼优化 (MGWO) 算法相结合的新应用。
The extracted ODM parameters are integrated into the developed physical model of the studied PV system. Trustworthy databases representing normal and abnormal PV system operation are constructed using PSIM and MATLAB software co-simulations.
提取的 ODM 参数被集成到所研究的光伏系统的物理模型中。利用 PSIM 和 MATLAB 软件协同模拟,构建了代表正常和异常光伏系统运行的可信数据库。
Following the development of the RFC-based fault detection procedure, our results demonstrate exceptional classification accuracy rates of 99.4 % for both fault detection and diagnosis. This outperforms alternative models like Support Vector Machines (SVM), K-Nearest Neighbors (KNN), Neural Networks (MLP Classifier), Decision Trees (DT), and Stochastic Gradient Descent (SGDC).
在开发出基于 RFC 的故障检测程序后,我们的结果表明,故障检测和诊断的分类准确率高达 99.4%。这优于支持向量机 (SVM)、K-近邻 (KNN)、神经网络 (MLP 分类器)、决策树 (DT) 和随机梯度下降 (SGDC) 等其他模型。
In conclusion, the RF algorithm emerges as a robust tool for fault diagnosis, offering higher accuracy and efficiency, particularly in cases of partial shading. While these promising results are encouraging, it's essential to acknowledge the complexity of PV systems, with potential challenges in fault detection. Future research will explore more advanced techniques, potentially using deep learning methods, to precisely locate faults within PV systems. These advancements aim to enhance the reliability and efficiency of PV systems for a more sustainable solar energy future. Furthermore, it is noteworthy that the proposed technique eliminates the need to install any additional sensors beyond those already present in a standard PV installation. This adaptability enables its application across various PV systems, making the suggested approach straightforward to implement.
总之,射频算法是一种强大的故障诊断工具,具有更高的准确性和效率,特别是在部分遮光的情况下。这些令人鼓舞的成果固然可喜,但必须承认光伏系统的复杂性,以及故障检测方面的潜在挑战。未来的研究将探索更先进的技术,可能会使用深度学习方法,以精确定位光伏系统中的故障。这些进步旨在提高光伏系统的可靠性和效率,以实现更可持续的太阳能未来。此外,值得注意的是,除了标准光伏装置中已有的传感器外,拟议的技术无需安装任何其他传感器。这种适应性使其能够应用于各种光伏系统,从而使建议的方法易于实施。

CRediT authorship contribution statement
CRediT 作者贡献声明

Ahmed Faris Amiri: . Houcine Oudira: Writing – review & editing, Supervision, Resources, Methodology, Formal analysis, Conceptualization. Aissa Chouder: Writing – review & editing, Supervision, Resources, Methodology, Investigation, Conceptualization. Sofiane Kichou: Writing – review & editing, Writing – original draft, Visualization, Validation, Methodology, Investigation.
Ahmed Faris Amiri: .Houcine Oudira:写作--审阅和编辑、监督、资源、方法、形式分析、概念化。Aissa Chouder:写作--审阅和编辑、监督、资源、方法论、调查、概念化。Sofiane Kichou:写作--审阅和编辑、写作--原稿、可视化、验证、方法论、调查。

Declaration of Competing Interest
竞争利益声明

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
作者声明,他们没有任何可能会影响本文所报告工作的已知经济利益或个人关系。

Data availability  数据可用性

The data that has been used is confidential.
所使用的数据是保密的。

References