Abstract 抽象的

There is a large amount of information and maintenance data in the aviation industry that could be used to obtain meaningful results in forecasting future actions. This study aims to introduce machine learning models based on feature selection and data elimination to predict failures of aircraft systems. Maintenance and failure data for aircraft equipment across a period of two years were collected, and nine input and one output variables were meticulously identified. A hybrid data preparation model is proposed to improve the success of failure count prediction in two stages. In the first stage, ReliefF, a feature selection method for attribute evaluation, is used to find the most effective and ineffective parameters. In the second stage, a K-means algorithm is modified to eliminate noisy or inconsistent data. Performance of the hybrid data preparation model on the maintenance dataset of the equipment is evaluated by Multilayer Perceptron (MLP) as Artificial Neural network (ANN), Support Vector Regression (SVR), and Linear Regression (LR) as machine learning algorithms. Moreover, performance criteria such as the Correlation Coefficient (CC), Mean Absolute Error (MAE), and Root Mean Square Error (RMSE) are used to evaluate the models. The results indicate that the hybrid data preparation model is successful in predicting the failure count of the equipment.
航空业拥有大量信息和维护数据,可用于在预测未来行动方面获得有意义的结果。本研究旨在引入基于特征选择和数据消除的机器学习模型来预测飞机系统的故障。收集了两年期间飞机设备的维护和故障数据,并仔细确定了九个输入变量和一个输出变量。提出了一种混合数据准备模型来提高两个阶段故障计数预测的成功率。在第一阶段,使用ReliefF这种属性评估的特征选择方法来寻找最有效和无效的参数。在第二阶段,修改 K 均值算法以消除噪声或不一致的数据。通过作为人工神经网络(ANN)的多层感知器(MLP)、作为机器学习算法的支持向量回归(SVR)和线性回归(LR)来评估混合数据准备模型在设备维护数据集上的性能。此外,还使用相关系数 (CC)、平均绝对误差 (MAE) 和均方根误差 (RMSE) 等性能标准来评估模型。结果表明,混合数据准备模型成功地预测了设备的故障数。

1. Introduction 一、简介

Reliability and availability of aircraft components have always been an important consideration in aviation. Accurate prediction of possible failures will increase the reliability of aircraft components and systems. The scheduling of maintenance operations help determine the overall maintenance and overhaul costs of aircraft components. Maintenance costs constitute a significant portion of the total operating expenditure of aircraft systems.
飞机部件的可靠性和可用性一直是航空业的重要考虑因素。准确预测可能发生的故障将提高飞机部件和系统的可靠性。维护操作的安排有助于确定飞机部件的总体维护和大修成本。维护成本占飞机系统总运营支出的很大一部分。

There are three main types of maintenance for equipment: corrective maintenance, preventive maintenance, and predictive maintenance [

1]. Corrective maintenance helps manage repair actions and unscheduled fault events, such as equipment and machine failures. When aircraft equipment fails while it is in use, it is repaired or replaced. Preventive maintenance can reduce the need for unplanned repair operations. It is implemented by periodic maintenance to avoid equipment failures or machinery breakdowns. Tasks for this type of maintenance are planned to prevent unexpected downtime and breakdown events that would lead to repair operations. Predictive maintenance, as the name suggests, uses some parameters which are measured while the equipment is in operation to guess when failures might happen. It intends to interfere with the system before faults occur [1, 2] and help reduce the number of unexpected failures by providing the maintenance personnel with more reliable scheduling options for preventive maintenance. Assessing system reliability is important to choose the right maintenance strategy.
设备的维护主要分为三种类型:纠正性维护、预防性维护和预测性维护[1]。纠正性维护有助于管理维修操作和计划外故障事件,例如设备和机器故障。当飞机设备在使用过程中出现故障时,需要对其进行修理或更换。预防性维护可以减少计划外维修操作的需要。它是通过定期维护来实现的,以避免设备故障或机械故障。计划此类维护的任务是为了防止可能导致维修操作的意外停机和故障事件。顾名思义,预测性维护使用设备运行时测量的一些参数来猜测何时可能发生故障。它旨在在故障发生之前对系统进行干扰 [1, 2],并通过为维护人员提供更可靠的预防性维护计划选项来帮助减少意外故障的数量。评估系统可靠性对于选择正确的维护策略非常重要。

Machine learning is a rising technology that is supposed to develop in the future. Machine learning methods are applied in prediction/preventive systems, communications, security, energy management, and so on [

3]. The data preparing level is the core module of machine learning and the decision making system. It manages the data to make it useful for decision. The decision making depends on future forecasting, failure event, and availability of equipment [4]. Data mining is a way of classifying and clamping data into comprehensible information. It comprehends the applicable models from a mass of information and adopts different approaches to uncover secret data. Data mining can be defined as knowledge derivation from raw data [5].
机器学习是一项新兴技术,应该在未来发展。机器学习方法应用于预测/预防系统、通信、安全、能源管理等领域[3]。数据准备层是机器学习和决策系统的核心模块。它管理数据以使其对决策有用。决策取决于未来预测、故障事件和设备的可用性 [4]。数据挖掘是一种将数据分类和限制为可理解信息的方法。它从大量的信息中领悟出适用的模型,并采用不同的方法来揭秘数据。数据挖掘可以定义为从原始数据中推导知识[5]。

Feature selection is a fundamental issue in data mining and machine learning algorithms that focus on the features which are the most relevant to the intended prediction [

6]. Features collected from the observation of a circumstance are not all equivalently significant. Normally, operational data tend to be incomplete, insufficient, or partially meaningful or not meaningful at all. Some of them may be noisy, redundant, or irrelevant. Feature selection aims to choose a feature set that is relevant to a specific duty. This problem is a complex and multidimensional issue [7]. Hsu [8] proposed a novel feature selection algorithm based on the correlation coefficient clustering method. It focused on reducing noisy, repeated, or redundant features. The performance in the computational speed and the classification accuracy can be improved through the removal of the irrevelant features. Methods of data processing helps improve the quality of the data and increase the accuracy of data mining, thereby making it more efficient [8]. Data quality is important for the process of information discovery, checking data anomalies, and predicting and analyzing for decision making [9]. Predicting equipment failures are essential to reduce repair and equipment costs and to assess equipment availability [1].
特征选择是数据挖掘和机器学习算法中的一个基本问题,它关注与预期预测最相关的特征[6]。从观察情况中收集到的特征并不都具有同等的重要性。通常,运营数据往往不完整、不充分、部分有意义或根本没有意义。其中一些可能是嘈杂的、多余的或不相关的。特征选择旨在选择与特定职责相关的特征集。这个问题是一个复杂的、多维度的问题[7]。 Hsu[8]提出了一种基于相关系数聚类方法的新型特征选择算法。它专注于减少噪音、重复或冗余的特征。通过去除不相关的特征可以提高计算速度和分类精度的性能。数据处理方法有助于提高数据质量并提高数据挖掘的准确性,从而提高数据挖掘的效率[8]。数据质量对于信息发现、检查数据异常以及预测和分析决策的过程非常重要[9]。预测设备故障对于降低维修和设备成本以及评估设备可用性至关重要[1]。

Mass data can be useful for businesses and can guide systems to follow right paths. To boost performance in machine learning algorithms, it is critical that meaningful information be gathered from the dataset. To eliminate noisy and irrevelant data during data preparation, we used K-means clustering algorithm, which is one of the popular unsupervised machine learning algorithms. It defines k number of centroids and allocates every case to the nearest cluster while keeping the centroids small [

10]. The “means” in the K-means refers to the averaging of the dataset to find the centroid. This algorithm assigns each case to only a single set. The purpose is to accomplish a high level of similarity within the clusters and low similarly across them [11]. It is used for more effective and better clustering with decreased complexity.
海量数据对企业很有用,可以指导系统走正确的道路。为了提高机器学习算法的性能,从数据集中收集有意义的信息至关重要。为了在数据准备过程中消除噪声和不相关的数据,我们使用了 K 均值聚类算法,这是流行的无监督机器学习算法之一。它定义了 k 个质心,并将每个案例分配到最近的簇,同时保持质心较小[10]。 K-means 中的“均值”是指对数据集进行平均以找到质心。该算法将每个案例仅分配给一个集合。目的是在簇内实现高水平的相似性,并在簇之间实现低相似性[11]。它用于更有效、更好的聚类并降低复杂性。

There are many studies on maintenance data and forecasting failure rates. Data preparation is a critical step in the feature selection process, and it has a major effect on the success of a machine learning algorithm. Gurbuz et al. [

9] applied various techniques of preprocessing along with feature selection on 15 datasets of a Turkish airline company to understand and clean the dataset and to find the relationships between input and output features. They came up with 15 rules for creating failure alerts, and these were found useful by the experts of the aviation company. Classification algorithm was used to extract patterns within equipment data. Kutlylowska [12] proposed the application of artificial neural networks to failure rate modelling. Data from a water utility in Poland were used to estimate the output values of failure frequency. The results showed that artificial networks could be an option to assess the frequency of problems in the systems of water supply. Ramos et al. [13] carried out a study to predict malfunctions and to do predictive maintenance practices in a piece of manufacturing equipment. In this study, ARIMA forecasting methods were successfully compared with neural network models. The results indicated that both models were good at forecasting defibrator disc replacement, but the ARIMA was much better in the forecasting the distance between the discs. Trani et al. [14] introduced a basic method to estimate aircraft fuel consumption through the use of an artificial neural network. A fuel consumption model supported by a neural network was created by using the data given in the performance manual of the aircraft. The results from the neural network model were compared with analytical models. The results revealed that the proposed three-layer ANN with nonlinear transfer functions could correctly estimate fuel consumption in different stages of a flight. Ming et al. [15] investigated the use of the ANN method in vibration analysis by using the integrated data from the devices of vibration to predict equipment failures. The ANN model was applied to diagnose the faults in a mill. The results lent support to the efficiency of this methodology. Kozik and Sep [16] applied ANN forecasting to identify the demand for spare parts to be replaced during aircraft engine overhaul. The results indicated that the forecasting method that is composed of the engine’s hardtime calculation should be a power in the implementation of lean manufacturing in MRO (maintenance, repair, and overhaul) facilities. Altay et al. [17] used 532 failures of 60 aircraft to model an artificial neural network to forecast failures. The proposed model produced high correlation rates of prediction between the actual and target failure times of aircraft. Benkedjouh et al. [18] proposed a method to guess the useful life (RUL) in machinery with bearings. For this purpose, the researchers used the isometric feature mapping reduction technique (ISOMAP) and support vector regression (SVR). Moura et al. [19] presented an analysis to comparatively assess the SVM effectiveness in predicting failure time. The performance of SVM regression is compared with other learning methods.
关于维护数据和预测故障率的研究有很多。数据准备是特征选择过程中的关键步骤,它对机器学习算法的成功具有重大影响。古尔布兹等人。 [9] 在土耳其航空公司的 15 个数据集上应用了各种预处理和特征选择技术,以理解和清理数据集,并找到输入和输出特征之间的关系。他们提出了 15 条创建故障警报的规则,航空公司的专家认为这些规则很有用。分类算法用于提取设备数据中的模式。 Kutlylowska [12]提出了人工神经网络在故障率建模中的应用。波兰一家自来水公司的数据用于估计故障频率的输出值。结果表明,人工网络可以作为评估供水系统问题频率的一种选择。拉莫斯等人。 [13] 进行了一项研究来预测故障并在制造设备中进行预测性维护实践。在这项研究中,ARIMA 预测方法成功地与神经网络模型进行了比较。结果表明,两种模型都能很好地预测除颤器圆盘更换,但 ARIMA 在预测圆盘之间的距离方面要好得多。特拉尼等人。 [14]介绍了一种通过使用人工神经网络来估计飞机燃油消耗的基本方法。利用飞机性能手册中给出的数据创建了由神经网络支持的燃油消耗模型。将神经网络模型的结果与分析模型进行比较。 结果表明,所提出的具有非线性传递函数的三层神经网络可以正确估计飞行不同阶段的燃油消耗。明等人。 [15]研究了ANN方法在振动分析中的应用,通过使用振动设备的集成数据来预测设备故障。应用人工神经网络模型来诊断工厂的故障。结果支持了该方法的效率。 Kozik 和 Sep [16] 应用 ANN 预测来确定飞机发动机大修期间更换备件的需求。结果表明,由发动机故障计算组成的预测方法应该为MRO(维护、修理和大修)设施实施精益制造提供动力。阿勒泰等人。 [17] 使用 60 架飞机的 532 次故障来模拟人工神经网络来预测故障。所提出的模型在飞机的实际故障时间和目标故障时间之间产生了高相关率的预测。本克朱等人。 [18]提出了一种猜测带轴承机械的使用寿命(RUL)的方法。为此,研究人员使用了等距特征映射缩减技术(ISOMAP)和支持向量回归(SVR)。莫拉等人。 [19] 提出了一项分析来比较评估 SVM 在预测故障时间方面的有效性。 SVM 回归的性能与其他学习方法进行了比较。

This paper discusses the feature selection of variables in the maintenance data obtained from an aviation company in Turkey. The proposed system will help companies to collect, extract, and create data to improve the maintenance actions through more accurate predictions. This study proposes a hybrid data preparation method for maintenance data and predicting failure counts of equipment by comparing the results of three different algorithms. The feature selection method (ReliefF algorihm in the present study) is used for selecting attributes, and the modified K-means algorithm is used to eliminate the redundant data. Three methods for predicting equipment failure counts are introduced and compared using MLP as an ANN algorithm, SVR algorithm, and LR. The next section presents an overview of the materials and methodology, followed by experimental results and conclusions.
本文讨论了从土耳其一家航空公司获得的维护数据中变量的特征选择。拟议的系统将帮助公司收集、提取和创建数据,通过更准确的预测来改进维护行动。本研究通过比较三种不同算法的结果,提出了一种用于维护数据和预测设备故障计数的混合数据准备方法。采用特征选择方法(本研究中的ReliefF算法)来选择属性,并使用改进的K-means算法来消除冗余数据。介绍并比较了使用 MLP 作为 ANN 算法、SVR 算法和 LR 来预测设备故障数的三种方法。下一节概述了材料和方法,然后是实验结果和结论。

2. Materials and Methodology
2. 材料和方法

The context in which the present case study was carried out was an avitation company in Ankara, Turkey. The maintenance data were collected from the records of the maintenance department. They included removal of equipment, repair activities, experience of the operators, flight hours of the equipment, and other information relevant to the case study.
本案例研究的背景是土耳其安卡拉的一家航空公司。维护数据是从维护部门的记录中收集的。其中包括设备的拆除、维修活动、操作员的经验、设备的飞行时间以及与案例研究相关的其他信息。

2.1. Dataset Acquisition 2.1.数据集获取

In the ERP platform, a program is developed to collect data and to format the dataset for analysis through machine algorithms. The variables were grouped as the input variables, while the failure count was considered to be the output variable. Selected parameters are evaluated by the feature selection ReliefF method to find the most influential parameters that have a share in the failures.
在 ERP 平台中,开发了一个程序来收集数据并格式化数据集,以便通过机器算法进行分析。这些变量被分组为输入变量,而故障计数被视为输出变量。通过特征选择 ReliefF 方法评估所选参数,以找到在故障中最有影响力的参数。

Flow logic of the developed program is presented in Figure 1. Firstly, the selected materials’ serial-numbered equipment used in the landing gear system was selected. Their maintenance and operational data were identified. Attributes of the maintenance and failure data were identified in cooperation with maintenance personnel and technicians. Each instance of a no fault found (NFF) status was examined to find confirmed failure data. The total flight hours for each piece of equipment across different aircraft were calculated for a given time period. Nine input variables that affect the failure of the equipment were determined. The failure count was calculated as an output variable. These nine input variables and an output variables were used for modelling the machine learning algorithms used in the present study (MLP, SVR, and LR).
所开发程序的流程逻辑如图1所示。首先,选择起落架系统中使用的所选材料的序列号设备。确定了它们的维护和操作数据。与维护人员和技术人员合作识别维护和故障数据的属性。检查未发现故障 (NFF) 状态的每个实例以查找已确认的故障数据。计算给定时间段内不同飞机上每件设备的总飞行时间。确定了影响设备故障的九个输入变量。故障计数被计算为输出变量。这九个输入变量和一个输出变量用于对本研究中使用的机器学习算法(MLP、SVR 和 LR)进行建模。

2.2. Feature Selection and ReliefF Algorithm
2.2.特征选择和ReliefF算法

Feature selection is a technique to obtain the relevant features by removing irrelevant and noisy data from the original dataset. It is the process of selecting a subset of features that could adequately depict all the datasets. The main objective of feature selection is to mine the data to obtain the minimum number of features to achieve maximum accuracy. Feature selection methods are used in data mining and machine learning, as well as in artificial intelligence. They reduce model complexity and let algorithms operate faster. Relief is a feature selection algorithm used for random selection of instances for feature weight calculation. The Relief algorithm is proposed by Kira and Rendell in 1992 [

20, 21]. It estimates feature weights iteratively, according to their ability to make a distinction between neighboring models. Relief was extended to deal with noisy, irrevelant, and missing data to address multiclass issues. Kononenko [22] proposed an extension to Relief called ReliefF to address the multiclass problems. ReliefF is an extension of the Relief algorithm, which fails to remove irrelevant or incomplete features in two-class classification problems. The ReliefF algorithm finds one near miss for each different class and averages their value to revise feature weights.
特征选择是一种通过从原始数据集中去除不相关和噪声数据来获得相关特征的技术。这是选择能够充分描述所有数据集的特征子集的过程。特征选择的主要目标是挖掘数据以获得最少数量的特征以实现最大精度。特征选择方法用于数据挖掘和机器学习以及人工智能。它们降低了模型的复杂性,让算法运行得更快。 Relief是一种特征选择算法,用于随机选择实例来计算特征权重。 Relief 算法由 Kira 和 Rendell 于 1992 年提出[20, 21]。它根据特征权重区分相邻模型的能力迭代估计特征权重。救济扩展到处理嘈杂、不相关和缺失的数据,以解决多类问题。 Kononenko [22] 提出了一种称为 ReliefF 的 Relief 扩展来解决多类问题。 ReliefF是Relief算法的扩展,它无法去除二类分类问题中不相关或不完整的特征。 ReliefF 算法为每个不同类别找到一个未遂事件,并对它们的值进行平均以修改特征权重。

2.3. Data Elimination and Modified K-Means Algorithm
2.3.数据消除和改进的K-Means算法

K-means, a widely used algorithm in a wide range of applications, was first developed in 1967 by MacQueen [

23]. It allows each data point to be a member of a single set. It has limitation fields: fixed K value and an initial centroid. It uses the distance criterion as the Euclidean distance measurement. Fahad and Alam [10] proposed a method by using modified K-means algorithm, which proved less time consuming yet more efficient in clustering. The quality of the resulting clusters depends on the selection of the initial centroid. K-means algorithm makes it possible to create a new data cluster by eliminating the smallest class value represented in the cluster. Yilmaz et al. [11] proposed a system using modified K-means algorithm to eliminate noisy and irrevelant data. In this study, we used modified K-means algorithm as in [11] and developed the pseudocode given in Algorithm 1.
K-means 是一种应用广泛的算法,由 MacQueen 于 1967 年首次开发[23]。它允许每个数据点成为单个集合的成员。它有限制字段:固定的 K 值和初始质心。它使用距离准则作为欧几里德距离测量。 Fahad 和 Alam [10] 提出了一种使用改进的 K-means 算法的方法,事实证明该方法耗时更少,但聚类效率更高。生成的簇的质量取决于初始质心的选择。 K-means 算法可以通过消除簇中表示的最小类值来创建新的数据簇。伊尔马兹等人。 [11]提出了一种使用改进的K-means算法来消除噪声和不相关数据的系统。在本研究中,我们使用了[11]中的改进的 K-means 算法,并开发了算法 1 中给出的伪代码。

(1)Procedure prepare_data_set.
过程准备数据集。
(2)Get Clustured_dataset, distance center of cluster, elimination_number
获取Clustered_dataset、距离簇中心、reduction_number
(3)For i=1 to cluster_count
对于 i = 1 到 cluster_count
(4) Sort distance of dataset_cluster(i) to cluster_center(i) descending
dataset_cluster() 到 cluster_center() 的距离降序排序
(5)For j=1 to elimination_number
对于 j = 1 到 Elimination_number
(6)  Delete j. data in clustered_dataset(i)
删除 。 clustered_dataset() 中的数据
(7)End for 结束于
(8)End for 结束于
(9)End procedure
2.4. Prediction Methods 2.4.预测方法

In recent years, research on machine learning algorithms and data mining has been carried out to study failure prediction applications. In this study, the MLP, SVR, and LR algorithms were examined to model maintenance data and predict the failure count.
近年来,人们开展了机器学习算法和数据挖掘的研究来研究故障预测应用。在本研究中,对 MLP、SVR 和 LR 算法进行了检查,以对维护数据进行建模并预测故障计数。

2.4.1. Multilayer Perceptron as an Artificial Neural Network
2.4.1.作为人工神经网络的多层感知器

An Artificial Neural Network (ANN) is a mathematical model based on a biological interconnected group of artificial neurons. ANN imitates the brain’s ability to process the information approach to computation [

24, 25]. Neural networks are a nonlinear statistical data modelling and machine learning method. They can be used to model complex nonlinear relationships between inputs and outputs in the data. They also describe patterns or relationships in the data, and they help forecast output values with the help of training, learning, and testing processes. A cell in a neural network is called a neuron, and a fixed number of neurons build a layer. Neurons connect to other neurons in other layers by a weight factor. ANN algorithms compute weights for input values, hidden layer, and output layer neurons by a feed forward approach [26, 27]. Weights in an ANN are calculated by using a training algorithm as the most popular backpropagation algorithm. Backpropagation is a learning algorithm that seeks to minimize the difference between the real and target outputs. The weights are updated, so that the total error is distributed to the various neurons in the neural network. The error remains at a low level through feeding forward and backpropagating [17]. The predictive capability of neural networks comes from their multilayered structure. Neural networks have an input layer, one or more hidden layers, and an output layer. MLP algorithms are comprised of the activation function of the neurons [28]. In this study, multilayer perceptron (MLP) feed forward neural networks were used with a backpropagation learning algorithm.
人工神经网络 (ANN) 是一种基于生物互连人工神经元组的数学模型。人工神经网络模仿大脑处理信息的能力进行计算[24, 25]。神经网络是一种非线性统计数据建模和机器学习方法。它们可用于对数据中的输入和输出之间复杂的非线性关系进行建模。它们还描述数据中的模式或关系,并借助培训、学习和测试过程帮助预测输出值。神经网络中的一个细胞称为神经元,固定数量的神经元构建一个层。神经元通过权重因子连接到其他层中的其他神经元。 ANN 算法通过前馈方法计算输入值、隐藏层和输出层神经元的权重 [26, 27]。人工神经网络中的权重是通过使用训练算法作为最流行的反向传播算法来计算的。反向传播是一种学习算法,旨在最小化实际输出和目标输出之间的差异。更新权重,使总误差分布到神经网络中的各个神经元。通过前馈和反向传播,误差保持在较低水平[17]。神经网络的预测能力来自于其多层结构。神经网络具有输入层、一个或多个隐藏层以及输出层。 MLP 算法由神经元的激活函数组成[28]。在本研究中,多层感知器 (MLP) 前馈神经网络与反向传播学习算法一起使用。

2.4.2. Support Vector Regression
2.4.2.支持向量回归

Support Vector Machines (SVM) algorithm was introduced by Cortes and Vapnik in 1995 [

29]. It is a linear model used to address classification and regression problems. The SVM algorithm produces a hyperplane that classifies the data. There are two distinct classes separated by a linear plane. The training in the algorithm involves the process of identifying the parameters [11]. Support Vector Regression (SVR) is a regression algorithm that uses a similar method of SVM to carry out regression analysis. SVR is a supervised machine learning algorithm and an effective method which can be used for prediction and data mining and is successfully adopted for regression problems.
支持向量机 (SVM) 算法由 Cortes 和 Vapnik 于 1995 年提出[29]。它是用于解决分类和回归问题的线性模型。 SVM 算法生成一个对数据进行分类的超平面。有两个由线性平面分隔的不同类别。算法的训练涉及到参数的识别过程[11]。支持向量回归(SVR)是一种采用与SVM类似的方法进行回归分析的回归算法。 SVR是一种有监督的机器学习算法,是一种可用于预测和数据挖掘的有效方法,并成功应用于回归问题。

2.4.3. Linear Regression 2.4.3.线性回归

Linear regression is defined as a machine learning algorithm that is based on supervised learning, involving a regression task. It is used to model the linear relationship among dependent variables or independent variables. It helps determine the relationship between variables and prediction. Schuld et al. [

30] proposed a prediction algorithm on a quantum computer, based on a linear regression model with least-squares optimization. Its scheme focused on the machine learning task of assuming the output corresponding to a new input. The prediction result can be used for further quantum information processing routines.
线性回归被定义为一种基于监督学习的机器学习算法,涉及回归任务。它用于对因变量或自变量之间的线性关系进行建模。它有助于确定变量和预测之间的关系。舒尔德等人。 [30]提出了一种基于最小二乘优化线性回归模型的量子计算机预测算法。其方案侧重于假设输出对应于新输入的机器学习任务。预测结果可用于进一步的量子信息处理例程。

2.5. Evaluation Performance Measures
2.5.评估绩效指标

In this study, the mean absolute error (MAE), root mean square error (RMSE), and correlation coefficient (CC) criteria were used to evaluate the success of the all the models. There are many error measurement techniques, and they are most commonly used to quantify error measures. The error parameters, adopted from [

31], are presented in the following equations, respectively.
在本研究中,使用平均绝对误差(MAE)、均方根误差(RMSE)和相关系数(CC)标准来评估所有模型的成功。有许多误差测量技术,它们最常用于量化误差测量。误差参数取自[31],分别如下面的方程所示。
where N is the number of data; Xi is the observed value; Yi is the predicted value; is the mean of the observed data, and is the mean of the observed data and predicted data values. CC measures the variability of the observed data defined by the model as a correlation coefficient.
其中N是数据的数量; X是观测值; Y为预测值;是观测数据的平均值, 是观测数据和预测数据值的平均值。 CC 测量由模型定义为相关系数的观测数据的变异性。

3. Proposed Methods 3. 提出的方法

In this study, as noted in Section 2.1, the 585-line maintenance data in two years from a Turkish aviation company were used. The dataset consists of nine input variables and an output variable (failure count). The input variables/factors are operational and environmental parameters which could influence failure occurrence and the length of operation before failures occur. Input variables include such parameters as flight hours, the number of removals of equipment, and the number of faults with planned/unplanned removals. These data were analysed and represented in a format suitable for modelling, and variables were characterised with the corresponding domain classification, shown in Table 1. The output variable is the number of equipment failures. A sample of the dataset is provided in Table 2.
在本研究中,如第 2.1 节所述,使用了土耳其航空公司两年内 585 条航线的维护数据。该数据集由九个输入变量和一个输出变量(故障计数)组成。输入变量/因素是可能影响故障发生以及故障发生前的运行时长的操作和环境参数。输入变量包括飞行时间、设备拆除数量以及计划/计划外拆除的故障数量等参数。这些数据经过分析并以适合建模的格式表示,并用相应的领域分类来表征变量,如表 1 所示。输出变量是设备故障的数量。表 2 提供了数据集示例。

Feature selection is carried out using these ten attributes. The number of equipment failures is the target of the analysis. For this purpose, feature selection ReliefF algorithm was used to find relations and weighting coefficient dependencies. According to the ranked values, four most effective attributes were selected (Table 3).
使用这十个属性进行特征选择。设备故障数量是分析的目标。为此,使用特征选择ReliefF算法来查找关系和加权系数依赖性。根据排名值,选择了四个最有效的属性(表3)。

Noisy and inconsistent data in the prepared datasets often affect prediction negatively and reduce the performance of the system. Therefore, the modified K-means algorithm was used to eliminate the noisy and inconsistent data to increase the performance of the prediction. It was developed using the pseudocode given in Algorithm 1. In this model, set centers are initially allocated, and instances are properly distributed to the sets. A predetermined number (N = 5) of records furthest to the center in each set were eliminated. The distance criterion was the Euclidean distance measurement. The eliminated instances are shown in Figure 2.
准备好的数据集中的噪声和不一致的数据通常会对预测产生负面影响并降低系统的性能。因此,使用改进的K-means算法来消除噪声和不一致的数据,以提高预测的性能。它是使用算法 1 中给出的伪代码开发的。在该模型中,最初分配集合中心,并将实例适当地分配到集合中。每组中距离中心最远的预定数量(N = 5)的记录被消除。距离标准是欧几里德距离测量。消除的实例如图 2 所示。

Seventy-five records (approximately 13%) of the dataset were eliminated by the pseudocode of the proposed data preparation model. Five hundred and ten records were obtained from 585 records of the dataset. Our proposed hybrid data preparation model is comprised of two stages, as shown in Figure 3. In the first stage, nine input attributes were reduced to four attributes by feature selection ReliefF algorithm. In the second stage, the dataset was reduced to 510 records by the modified K-means algorithm. The obtained dataset with 510 records were provided as inputs to the MLP, SVM, and LR prediction algorithms.
所提出的数据准备模型的伪代码消除了数据集的 75 条记录(约 13%)。从数据集中的 585 条记录中获得了 510 条记录。我们提出的混合数据准备模型由两个阶段组成,如图 3 所示。在第一阶段,通过特征选择 ReliefF 算法将 9 个输入属性减少为 4 个属性。在第二阶段,通过修改后的K-means算法将数据集减少到510条记录。获得的包含 510 条记录的数据集作为 MLP、SVM 和 LR 预测算法的输入。

4. Experimental Results 4. 实验结果

A program is developed to gather data for analysis through machine algorithms. Selected equipment’s maintenance and operational data were identified. Nine input variables and an output variable were determined. According to using pure 585 rows, nine inputs, and an output (585 × 10) data, MLP, LR, and SVR models were trained and tested. The parameters of the predictors used in the study are provided in Tables 46, respectively.
开发了一个程序来收集数据以通过机器算法进行分析。确定了选定设备的维护和运行数据。确定了九个输入变量和一个输出变量。根据纯585行、9个输入和1个输出(585×10)数据,对MLP、LR和SVR模型进行了训练和测试。表 4-6 分别提供了研究中使用的预测变量的参数。

To illustrate the performance of the suggested two-phase hybrid system, the prediction results for the raw dataset that is composed of 585 records and 9 attributes are presented in Table 7. Table 7 shows that based on the CC performance criterion, the best results were provided by the SVR algorithm, while the LR algorithm provided the best results based on the MAE and RMSE performance criteria.
为了说明所建议的两相混合系统的性能,表 7 列出了由 585 条记录和 9 个属性组成的原始数据集的预测结果。表 7 显示,根据 CC 性能标准,最佳结果为由 SVR 算法提供,而 LR 算法根据 MA​​E 和 RMSE 性能标准提供最佳结果。

Then, the ReliefF algorithm was applied to the raw data to identify the features that prove the most effective in prediction. Feature selection ReliefF algorithm was applied to nine input parameters, and according to ranked values, the last five of them were eliminated. MLP, LR, and SVR models are built with selected 585 rows, four inputs, and an output. The dataset (585 × 5) was trained and tested. As seen in Table 8, all the results are better than those obtained without feature selection. The results provided by the MLP (CC = 0.9127, MAE = 0.7301, RMSE = 0.9853) is better than those of the other algorithms. The performance results for the error parameters in each prediction algorithm are provided in Table 8.
然后,将 ReliefF 算法应用于原始数据,以识别在预测中最有效的特征。将特征选择ReliefF算法应用于9个输入参数,并根据排序值,淘汰最后5个。 MLP、LR 和 SVR 模型采用选定的 585 行、四个输入和一个输出构建。数据集(585 × 5)经过训练和测试。如表 8 所示,所有结果都比没有进行特征选择时获得的结果更好。 MLP 提供的结果(CC = 0.9127,MAE = 0.7301,RMSE = 0.9853)优于其他算法。表 8 提供了每种预测算法中误差参数的性能结果。

In the final phase, the modified K-means algorithm was applied to the dataset to eliminate noisy and inconsistent data (585 × 5). The best k value was found to be (k = 15). Five parameters that were farthest from the center of each cluster were eliminated. As a result, 75 rows were eliminated. So, a hybrid model approach was applied to the maintenance data, and the quality of the data was improved. The LR, MLP, and SVR models were built with the selected 510 rows, four inputs, and an output. The selected data (510 × 5) were trained and tested. The results indicated that the performance of the model was highly successful, compared to the other results obtained without feature selection and data reduction. The performance of the LR, MLP, and SVR algorithms are presented in Table 9.
在最后阶段,将修改后的 K 均值算法应用于数据集,以消除噪声和不一致的数据 (585 × 5)。发现最佳 k 值为 (k = 15)。距离每个簇中心最远的五个参数被消除。结果,75 行被淘汰。因此,将混合模型方法应用于维护数据,提高了数据的质量。 LR、MLP 和 SVR 模型是使用选定的 510 行、四个输入和一个输出构建的。对选定的数据(510×5)进行训练和测试。结果表明,与没有特征选择和数据缩减的情况下获得的其他结果相比,该模型的性能非常成功。表 9 列出了 LR、MLP 和 SVR 算法的性能。

As shown in Table 9, based on the CC, MAE, and RMSE performance criteria, the best results were provided by the LR algorithm in the suggested two-phase hybrid system. For the test data, CC = 0.9316, MAE = 7108, and RMSE = 0.835 were obtained.
如表 9 所示,根据 CC、MAE 和 RMSE 性能标准,建议的两相混合系统中的 LR 算法提供了最佳结果。对于测试数据,获得 CC = 0.9316、MAE = 7108、RMSE = 0.835。

Figures 46 provide the linear correlation between the predicted and target results for the test data of LR, SVR, and MLP, respectively. The results indicated that the regression line in the test and the predicted data of LR provide y1 = 0.9976x + 0.0155; those of SVR provide y2 = 0.9184x +  0.48, and those of MLP provide y3 = 0.9999x − 0.0744.
图 4-6 分别提供了 LR、SVR 和 MLP 测试数据的预测结果与目标结果之间的线性相关性。结果表明,检验中的回归线与LR的预测数据得出y 1 = 0.9976x + 0.0155; SVR 提供 y 2 = 0.9184x + 0.48,MLP 提供 y 3 = 0.9999x − 0.0744。

The target and predicted values in the test dataset of the suggested two-phase hybrid system were provided for the LR, SVR, and MLP in Figures 79, respectively. The test results provided in Figures 79 are the graphical representation of the test results of the 1-fold cross validation.
图 7-9 中分别为 LR、SVR 和 MLP 提供了建议的两相混合系统测试数据集中的目标值和预测值。图 7-9 中提供的测试结果是 1 倍交叉验证测试结果的图形表示。

Table 9 presents the predicted and target values obtained through the two-phase hybrid system in Figures 79. As seen in Figures 79, the proposed hybrid data preparation model increased the performance of prediction models LR, SVR, and MLP. The suggested hybrid system helped attain higher accuracy in prediction as it enabled us to select the most effective features and eliminate noisy or redundant data that could lower the accuracy of predictions.
表 9 显示了通过图 7-9 中的两阶段混合系统获得的预测值和目标值。如图 7-9 所示,所提出的混合数据准备模型提高了预测模型 LR、SVR 和 MLP 的性能。建议的混合系统有助于获得更高的预测准确性,因为它使我们能够选择最有效的特征并消除可能降低预测准确性的噪声或冗余数据。

5. Conclusions 5。结论

In aviation, the use of maintenance data is highly critical in the analysis of reliability and maintenance costs. This is because predictive maintenance scheduling can be planned in line with estimates. The main target of predictive maintenance is to predict equipment failures and planning strategies for spare parts of the system components to analyze the reliability and maintainability of a complex repairable system. In this study, a hybrid data preparation model was applied to the landing gear system maintenance dataset using feature selection ReliefF algorithm to select attributes and a modified K-means algorithm to eliminate noisy and inconsistent data. The proposed hybrid data preparation method was put into practice through LR, SVR, and MLP models. The results indicated that the LR model had better performance than MLP and SVR models in predicting the failure counts. The results indicate that the proposed hybrid data preparation model significantly improves the accurate prediction of failure counts. This study could function as a guide for using hybrid data preparation methods in machine learning algorithms and data mining.
在航空领域,维护数据的使用对于分析可靠性和维护成本至关重要。这是因为可以根据估计来规划预测性维护计划。预测性维护的主要目标是预测设备故障并规划系统部件的备件策略,以分析复杂可修复系统的可靠性和可维护性。在本研究中,将混合数据准备模型应用于起落架系统维护数据集,使用特征选择ReliefF算法来选择属性,并使用改进的K-means算法来消除噪声和不一致的数据。所提出的混合数据准备方法通过LR、SVR和MLP模型付诸实践。结果表明,LR 模型在预测故障数方面比 MLP 和 SVR 模型具有更好的性能。结果表明,所提出的混合数据准备模型显着提高了故障计数的准确预测。这项研究可以作为在机器学习算法和数据挖掘中使用混合数据准备方法的指南。

Data Availability 数据可用性

The maintenance data used to support the findings of this study have not been made available because sharing the data might compromise data privacy. Moreover the authors are not allowed to share these data due to security concerns.
用于支持本研究结果的维护数据尚未提供,因为共享数据可能会损害数据隐私。此外,出于安全考虑,作者不得共享这些数据。

Conflicts of Interest 利益冲突

The authors declare that they have no conflicts of interest.
作者声明他们没有利益冲突。

Acknowledgments 致谢

This study was supported by the Scientific Research Project of Havelsan and Presidency of Defence Industries project, grant no. HVL-SÖZ-18/033.
这项研究得到了哈维尔桑科学研究项目和国防工业总统项目的支持,资助号: HVL-SÖZ-18/033。