This is a bilingual snapshot page saved by the user at 2025-1-2 15:35 for https://app.immersivetranslate.com/pdf-pro/d4441619-dcbe-4a08-be60-0f100bdcdd76, provided with bilingual support by Immersive Translate. Learn how to save?

  Unbalanced data processing and multimodal dynamic fusion

  Yuan Yihang


Key Laboratory of Modern Manufacturing Technology, Ministry of Education, Guizhou University, Guiyang, Guizhou 550025


Abstract: While advanced manufacturing technology is developing overall, the fault prediction and diagnosis of key components in mechanical equipment also deserve attention. Predicting faults in advance and diagnosing equipment can not only reduce the risk of major disasters but also minimize economic losses. Bearings, tools, rotors, and other key components of mechanical systems operate under high loads and high intensity for long periods, making them prone to failure and wear. Therefore, predictive maintenance of key components such as bearings, tools, and rotors is the focus of enterprise operations and maintenance, as well as a hot topic in academic research. Currently, most research on predictive maintenance of key components in mechanical systems is based on balanced data. However, in practice, the collected data is often unbalanced, with imbalances in intra-class and inter-class data, time series, etc., and there is also noise, which poses significant challenges for fault prediction and diagnosis. Multimodal fusion focuses on integrating information from various modalities to achieve more accurate predictions, and significant progress has been made in a wide range of scenarios, including autonomous driving and medical diagnosis. The development of multimodal fusion technology has brought new solutions to fault diagnosis in the industrial sector. This paper introduces a multimodal fusion technology from the perspective of multimodal data, which integrates data from different modalities, extracts information from various modalities, fills in unbalanced data, and alleviates the impact of data imbalance. It addresses the issue of data imbalance in fault prediction and diagnosis of key components in mechanical equipment using multimodal fusion methods.

Keywords: fault diagnosis; imbalanced data; noise; multimodal data; dynamic fusion

  1. Introduction


In recent years, our country has regarded "smart manufacturing engineering" as one of the five key projects of "Made in China 2025" and is fully promoting the rapid development of the manufacturing industry. Mechanical equipment, as the core foundation of manufacturing development, has very wide applications in high-precision industries such as spacecraft, military products, and integrated chips, as well as in everyday transportation industries like automobiles, trains, and subways. However, with the rapid development of technology and the increasing demands of production, the structures of various mechanical devices have become increasingly complex to adapt to various complicated and harsh working conditions. In this environment, key components of mechanical systems such as bearings, tools, and rotors may gradually degrade in performance and health status due to the combined effects of various internal factors (such as self-wear) or external factors (such as high temperature and pressure, super large loads, external impacts, etc.), and may even ultimately fail completely.


Once key components of a mechanical system experience abnormal damage, it can lead to product quality issues and equipment downtime at best, or to irreversible major safety accidents such as injuries or fatalities at worst. However, if people can accurately predict these potential failures in advance and perform equipment maintenance, such tragedies may be averted. Therefore, timely, reasonable, and effective condition monitoring, early warning, and the formulation of maintenance and repair strategies for key components of mechanical systems such as bearings, tools, and rotors are essential to ensure their safe operation and fundamentally prevent catastrophic accidents. The principle of monitoring and predicting key components of mechanical systems involves using sensors for force, temperature, vibration, etc., on the machinery to extract state information of the components, followed by feature extraction, and establishing models based on feature data for fault monitoring and lifespan prediction, thereby providing reasonable maintenance and care suggestions to the staff. However, machinery primarily operates under normal conditions.

The state, under which the data is easy to extract and there is a lot of data, is often shorter when in a fault state, and the fault data is relatively scarce, which leads to an imbalance in state data [ 3 ] [ 3 ] ^([3]){ }^{[3]} .

Generally, methods to address the imbalance problem include feature selection methods, improved classifier methods, and resampling techniques, with feature selection methods often combined with the other two methods. Compared to resampling techniques, classification improvement methods do not change the distribution of the original dataset; instead, they address the imbalance problem by assigning a higher misclassification cost to minority samples. Therefore, achieving favorable results in imbalance classification scenarios characterized by noisy data and overlapping classes can be challenging. Resampling techniques, especially oversampling techniques, have potential advantages in addressing issues related to noisy data and class overlap. In addition to traditional sampling-based fault diagnosis frameworks, GAN (Generative Adversarial Network)-based fault diagnosis techniques have also become very popular in recent years. However, both traditional oversampling techniques applied to imbalanced fault monitoring and GAN-based oversampling techniques fail to clearly describe the sample distribution characteristics in noisy and imbalanced small datasets. As a result, these shortcomings often lead to less than ideal outcomes in most research work in this field. Furthermore, studying the complexity of these features in these areas should be crucial. Wei Jian'an's team emphasizes that challenges related to "noise," "data imbalance," "limited samples," and other complexities are urgent pain points that need attention in this field (Yuan et al., 2023).

At the same time, with the arrival of the big data era, single-modal data analysis has become insufficient to meet the demands of complex tasks. Multimodal learning aims to extract information from multiple data sources and enhance prediction and decision-making capabilities by integrating features from various modalities. For example, sentiment analysis can combine the tone and speed of speech with the semantic features of text to comprehensively assess the user's emotional state. Effectively processing multimodal data and organically integrating it can not only improve the predictive power of the model.

It can also enhance the robustness and generalization ability of the system. Our perception of the world is diverse, as we experience it through touch, vision, hearing, smell, and taste. Although some sensory signals are unreliable, humans can extract useful clues from imperfect multimodal inputs and further piece together the entire scene of an event. With the development of sensor technology, we can easily collect various forms of data for analysis. To fully utilize the information from each modality, multimodal fusion has become a promising method for handling imbalanced data or small samples, obtaining accurate and reliable predictions by integrating information from multiple modalities, such as in medical image analysis, autonomous vehicles, and emotion recognition. There can also be significant improvements in the issue of imbalanced fault data in mechanical fault diagnosis; the data obtained from a single modality is limited, but by integrating information from various modalities such as images, videos, and tables, the reliability of the data can be greatly enhanced. Intuitively, fusing information from different modalities provides the possibility to explore cross-modal correlations and achieve better performance.

In this article, I will introduce a new multimodal fusion framework: the Predictive Dynamic Fusion Framework. This framework effectively reduces the upper limit of generalization error and significantly improves the reliability and stability of multimodal systems. Specifically, the PDF predicts the collaborative belief (common belief) of single confidence and full confidence for each modality. Single confidence and full confidence come from the intra-modal negative covariance and inter-modal positive covariance between fusion weights and loss functions, respectively. Additionally, the quality of data in open environments constantly changes, leading to inevitable prediction errors. To address this issue, the framework further proposes a relative calibration method that calibrates the predicted common belief from the perspective of the multimodal system, meaning that the relative advantages of each modality should dynamically change with variations in the quality of other modalities, rather than changing statically. Experiments have shown that this method has strong generalization capabilities.
  Achieved excellent results on multiple datasets [ 10 ] [ 10 ] ^([10]){ }^{[10]} .
  2. Theory

In this section, the basic setup and formulas of multimodal fusion are first illustrated. Next, the formula for the generalization error bound is revisited, and its connection to the fusion weights is established, revealing the theoretical guarantee for reducing the upper limit of the generalization error. Finally, a predictable dynamic fusion framework that meets the above theoretical analysis is proposed.

  2.1 Basic Settings


Given a multimodal task, let M be defined as the set of modalities, thus | M | | M | |M||\mathrm{M}| is the cardinality of M. The training dataset is represented as D train = { x i , y i } i = 1 N X × Y D train  = x i , y i i = 1 N X × Y D_("train ")={x_(i),y_(i)}_(i=1)^(N)in X xxY\mathcal{D}_{\text {train }}=\left\{x_{i}, y_{i}\right\}_{i=1}^{N} \in X \times \mathcal{Y} , where N is the sample size of D train D train  D_("train ")\mathcal{D}_{\text {train }} , x i = { x i m } m = 1 | M | x i = x i m m = 1 | M | x_(i)={x_(i)^(m)}_(m=1)^(|M|)∣x_{i}=\left\{x_{i}^{m}\right\}_{m=1}^{|\mathcal{M}|} \mid has | M | | M | |M||\mathrm{M}| modalities, and y i y i y_(i)iny_{i} \in Y is the corresponding label. The goal of the framework is to design a predictable fusion weight ω ω omega\omega for each modality and achieve robust multimodal fusion. The unimodal projection function f m : X Y f m : X Y f^(m):X rarr Yf^{m}: X \rightarrow Y is trained to dynamically adjust the fusion weight ω m ω m omega^(m)\omega^{m} during the training process, where m M m M m inMm \in \mathcal{M} . Decision-level multimodal fusion is:
f ( x ) = m = 1 | M | ω m f m ( x m ) f ( x ) = m = 1 | M | ω m f m x m f(x)=sum_(m=1)^(|M|)omega^(m)*f^(m)(x^(m))f(x)=\sum_{m=1}^{|M|} \omega^{m} \cdot f^{m}\left(x^{m}\right)

  2.2 Generalization Error Upper Bound


The generalization error bound (GEB) is an important concept in machine learning, referring to the upper limit of a model's performance on unknown data (Zhang et al., 2023) [11]. Generally, the smaller the upper limit of the generalization error, the better the model's generalization ability, meaning it performs better on unknown joint distributions. For binary classification, the generalization error (GE) of model f can be defined as:
GE ( f ) = E ( x , y ) D [ ( f ( x ) , y ) ] GE ( f ) = E ( x , y ) D [ ( f ( x ) , y ) ] GE(f)=E_((x,y)∼D)[ℓ(f(x),y)]\operatorname{GE}(f)=\mathrm{E}_{(x, y) \sim \mathcal{D}}[\ell(f(x), y)]

Among them, \ell is the convex log-sigmoid loss function, and D is the unknown dataset.

  2.3 Confidence Level

  2.3.1 Single-mode confidence level


Using loss as fusion weights for various modes presents significant challenges. Notably, as the training process progresses

Minimization of loss, even a slight deviation can cause substantial disturbances. This sensitivity to small errors in loss estimation may undermine the stability and effectiveness of the weights. Furthermore, the range of loss values can vary from zero to positive infinity, making precise prediction extremely challenging. To mitigate these challenges, the method suggests replacing the loss with the probability of the true class label ( p true [ 0 , 1 ] ] ) p true  [ 0 , 1 ] {:(p_("true ")in[0,1]])\left.\left(p_{\text {true }} \in[0,1]\right]\right) , which is inversely proportional to the loss, as shown in = log p true = log p true  ℓ=-log p_("true ")\ell=-\log p_{\text {true }} .


By analyzing the properties of p true p true  p_("true ")p_{\text {true }} , it was found that it reflects the confidence of modality, and some works (Corbi'ere et al., 2019) have elaborated on this [ 12 ] [ 12 ] ^([12]){ }^{[12]} . Using p truu p truu  p_("truu ")p_{\text {truu }} as fusion weights not only helps to reduce the upper limit of generalization error but also provides a theoretical guarantee for dynamic multimodal fusion. Since the predictable p true p true  p_("true ")p_{\text {true }} only considers the confidence of the current modality, we define it as single confidence:
Mono-Conf m = p ^ t r u e m  Mono-Conf  m = p ^ t r u e m " Mono-Conf "^(m)= hat(p)_(true)^(m)\text { Mono-Conf }{ }^{m}=\hat{p}_{t r u e}^{m}

The prediction of p ^ t r u e m m p true m p ^ t r u e m m p true  m hat(p)_(true)^(m)^(m)p_("true ")^(m)\hat{p}_{t r u e}^{m}{ }^{m} p_{\text {true }}^{m} is because there are no true labels in the testing phase.

  2.3.2 Overall Confidence


Consider using the sum of losses from other modalities as weights for the given modality to construct the weights. Based on the nature of = log p tru = log p tru  ℓ=-log p_("tru ")\ell=-\log p_{\text {tru }} , replace \ell with p true p true  p_("true ")p_{\text {true }} . Therefore, due to the cross-modal interactions of ptrue, this term is defined as Holo Confidence:
idence: Holo Conf m = j m | M | e ^ j i = 1 | M | ^ i = log j m p ^ ture j log i = 1 | M | p ^ true r  idence:   Holo   Conf  m = j m | M | e ^ j i = 1 | M | ^ i = log j m p ^ ture  j log i = 1 | M | p ^ true  {:" idence: "{:[" Holo "" Conf "^(m)=(sum_(j!=m)^(|M|) hat(e)^(j))/(sum_(i=1)^(|M|) hat(ℓ)^(i))],[=(log prod_(j!=m) hat(p)_("ture ")^(j))/(log prod_(i=1)^(|M|) hat(p)_("true ")^("r "))]:}:}\begin{aligned} \text { idence: } & \begin{aligned} \text { Holo } \text { Conf }^{m} & =\frac{\sum_{j \neq m}^{|\mathcal{M}|} \hat{e}^{j}}{\sum_{i=1}^{|\mathcal{M}|} \hat{\ell}^{i}} \\ & =\frac{\log \prod_{j \neq m} \hat{p}_{\text {ture }}^{j}}{\log \prod_{i=1}^{|\mathcal{M}|} \hat{p}_{\text {true }}^{\text {r }}} \end{aligned} \end{aligned}

Among them, ^ i ^ i hat(ℓ)^(i)\hat{\ell}^{i} and ~ j ~ j tilde(ℓ)j\tilde{\ell} j are the predictions of e i e i e^(i)e^{i} and j j ℓj\ell j .

  2.3.3 Collaborative Confidence


Since Mono Confidence and Holo Confidence promote collaborative interaction between modalities, collaborative confidence (Co-belief) is defined as a linear combination of predictable Mono Confidency and Holo Tendence
  Can be used as the final fusion weight.

Co-Belief m = m = ^(m)={ }^{m}= Mono-Conf m + m + ^(m)+{ }^{m}+ Holo-Conf m m ^(m)^{m} (5)

  3. Method

To achieve reliable predictions, a relative calibration strategy is proposed to calibrate the co-confidence of predictions to address the inevitable uncertainty. With this reliable prediction, the multimodal fusion framework is referred to as Predictive Dynamic Fusion (PDF).

It is worth noting that in open environments, data quality often changes dynamically, leading to inevitable uncertainty in predictions. To reduce the potential uncertainty of collaborative beliefs in complex scenarios, a method called relative calibration (RC) is further proposed, which calibrates the collaborative beliefs of predictions from the perspective of multimodal systems. This means that the relative advantages of each modality should change dynamically with variations in the quality of other modalities, rather than changing statically.


First, define the distribution uniformity DU m m ^(m){ }^{m} of the m-th modality in a multimodal system as,
DU m = 1 C i = 1 C | Softmax ( f m ( x m ) ) i μ | DU m = 1 C i = 1 C Softmax f m x m i μ DU^(m)=(1)/(C)sum_(i=1)^(C)|Softmax(f^(m)(x^(m)))_(i)-mu|\operatorname{DU}^{m}=\frac{1}{C} \sum_{i=1}^{C}\left|\operatorname{Softmax}\left(f^{m}\left(x^{m}\right)\right)_{i}-\mu\right|

Among them, C is the category number, μ μ mu\mu is the average probability, μ = 1 C μ = 1 C mu=(1)/(C)\mu=\frac{1}{C} . The probability distribution after softmax provides key insights into the model's uncertainty: a uniform distribution usually indicates high uncertainty, while a peaked distribution indicates low uncertainty in predictions (Huang et al., 2021a) [13].

Considering the ever-changing environment, the uncertainty of different modalities in a multimodal system should be relative, meaning that the uncertainty of each modality should dynamically change with the variations in the uncertainties of other modalities. One modality should dynamically perceive the changes in other modalities and adjust its relative contribution to the multimodal system. Therefore, this framework introduces relative calibration (RC) to calibrate the relative uncertainty of each modality. The relative calibration of the m-th modality can be formulated as follows (in the case of two modalities, represented as m , n M m , n M m,n inMm, n \in \mathcal{M} ):
RC m = DU m DU n = i = 1 | Softmax ( f m ( x m ) ) i μ | i = 1 C | Softmax ( f n ( x n ) ) i μ | RC m = DU m DU n = i = 1 Softmax f m x m i μ i = 1 C Softmax f n x n i μ RC^(m)=(DU^(m))/(DU^(n))=(sum_(i=1)^(llcorner)|Softmax(f^(m)(x^(m)))_(i)-mu|)/(sum_(i=1)^(C)|Softmax(f^(n)(x^(n)))_(i)-mu|)\mathrm{RC}^{m}=\frac{\mathrm{DU}^{m}}{\mathrm{DU}^{n}}=\frac{\sum_{i=1}^{\llcorner }\left|\operatorname{Softmax}\left(f^{m}\left(x^{m}\right)\right)_{i}-\mu\right|}{\sum_{i=1}^{C}\left|\operatorname{Softmax}\left(f^{n}\left(x^{n}\right)\right)_{i}-\mu\right|}

Considering the factors of the real world, R C m R C m RC^(m)R C^{m} adopts an asymmetric form to further calibrate the common belief. Specifically, it is assumed that the modality of R C m < 1 R C m < 1 RC^(m) < 1R C^{m}<1 has greater uncertainty and tends to produce relatively unreliable predictions for p ^ t r u e m p ^ t r u e m hat(p)_(true)^(m)\hat{p}_{t r u e}^{m} (Gawlikowski et al., 2023), thus there is a potential risk in the accuracy of the corresponding common belief. Therefore, the contribution of this modality is reduced by multiplying its predicted common belief by R C m ( RC m < 1 ) R C m RC m < 1 RC^(m)(RC^(m) < 1)R C^{m}\left(\mathrm{RC}^{m}<1\right) . In contrast, the modality of RC m > 1 RC m > 1 RC^(m) > 1\mathrm{RC}^{m}>1 is considered to have less uncertainty and accurate common credibility, so the contributions of these modalities can be maintained to reduce optimization difficulty. Based on this, the asymmetric calibration term is defined as:
k m = { RC m = DU m DU n if DU m < DU n 1 otherwise k m = RC m = DU m DU n       if  DU m < DU n 1       otherwise  k^(m)={[RC^(m)=(DU^(m))/(DU^(n))," if "DU^(m) < DU^(n)],[1," otherwise "]:}k^{m}= \begin{cases}\mathrm{RC}^{m}=\frac{\mathrm{DU}^{m}}{\mathrm{DU}^{n}} & \text { if } \mathrm{DU}^{m}<\mathrm{DU}^{n} \\ 1 & \text { otherwise }\end{cases}

Using an asymmetric calibration strategy to calibrate the co-confidence of the m m mm mode, the calibrated co-confidence (CCB) is:
CCB m = ( Co Belief m ) k m CCB m = Co  Belief  m k m CCB^(m)=(Co-" Belief "^(m))*k^(m)\mathrm{CCB}^{m}=\left(\mathrm{Co}-\text { Belief }^{m}\right) \cdot k^{m}

The framework uses the CCB of each modality as its fusion weight in the multimodal system
f ( x ) = m = 1 | M | ω m f m ( x m ) = m = 1 | M | Softmax ( CCB m ) f m ( x m ) f ( x ) = m = 1 | M | ω m f m x m = m = 1 | M | Softmax CCB m f m x m {:[f(x)=sum_(m=1)^(|M|)omega^(m)*f^(m)(x^(m))],[=sum_(m=1)^(|M|)Softmax(CCB^(m))*f^(m)(x^(m))]:}\begin{aligned} f(x) & =\sum_{m=1}^{|\mathcal{M}|} \omega^{m} \cdot f^{m}\left(x^{m}\right) \\ & =\sum_{m=1}^{|\mathcal{M}|} \operatorname{Softmax}\left(\mathrm{CCB}^{m}\right) \cdot f^{m}\left(x^{m}\right) \end{aligned}

As can be seen from the above, the predictive dynamic fusion framework consists of a total of seven parts, as shown in Figure 1. Specifically, it is divided into input data preparation, single-modal confidence calculation, overall confidence calculation, collaborative confidence calculation, relative calibration strategy, collaborative confidence calculation after collaboration, and multi-modal fusion. The framework principle is shown in Figure 2.
  Figure 1. Predictive Dynamic Fusion Framework

  4. Experiment

  4.1 Data Processing

  4.1.1 Select dataset


Choose a task dataset that contains multimodal data and has imbalanced labels. Multimodal methods are often used in sentiment analysis and medical diagnosis, so multimodal datasets in these areas can be utilized, with modalities mainly including images, videos, text, and tables.

  4.1.2 Data Preprocessing

  Preprocessing steps based on data types:
  Image data:

Resize: Adjust all images to a fixed size (e.g., 224 × 224 224 × 224 224 xx224224 \times 224 pixels). Normalization: Normalize pixel values to [ 0 , 1 ] [ 0 , 1 ] [0,1][0,1] or standardize (subtract the mean, divide by the standard deviation). Data augmentation: Perform operations such as rotation, flipping, cropping, etc., to increase the diversity of training data. Text data:

Text cleaning: remove stop words, punctuation, non-alphabetic characters, etc. Tokenization: convert text into words or subwords. Word vectorization: use pre-trained word vectors (such as
  Figure 2. Framework Schematic

  (a) PDF (Ours)

  (b) Static Late Fusion

GloVe, Word2Vec) or text encoding through pre-trained models like BERT. Table data:

Missing value imputation: Preprocess the missing parts of tabular data using self-supervised learning methods or common interpolation methods (such as mean imputation, median imputation, KNN imputation) [ 14 ] [ 14 ] ^([14]){ }^{[14]} . In particular, missing values can be filled in by model predictions. Standardization: Standardize numerical features to ensure consistent scales across different features [ 15 ] [ 15 ] ^([15]){ }^{[15]} .
  Label data:
  Label encoding: If it is a classification task, use one-hot encoding

(One-Hot Encoding) or label encoding (Label Encoding) for conversion.
  4.2 Predictive Dynamic Fusion Framework
  4.2.1 Weight Calculation

For image and text features f image f image  f_("image ")f_{\text {image }} and f text f text  f_("text ")f_{\text {text }} , the weights of each modality are calculated through a fully connected (FC) or self-attention mechanism.

This mechanism dynamically adjusts weights based on the features of the current modality and the context of the task.
α image = exp ( W image f image ) exp ( W image f image ) + exp ( W text f text ) α text = 1 α image α image  = exp W image  f image  exp W image  f image  + exp W text  f text  α text  = 1 α image  {:[alpha_("image ")],[=(exp(W_("image ")f_("image ")))/(exp(W_("image ")f_("image "))+exp(W_("text ")f_("text ")))],[alpha_("text ")=1-alpha_("image ")]:}\begin{aligned} & \alpha_{\text {image }} \\ & =\frac{\exp \left(W_{\text {image }} f_{\text {image }}\right)}{\exp \left(W_{\text {image }} f_{\text {image }}\right)+\exp \left(W_{\text {text }} f_{\text {text }}\right)} \\ & \alpha_{\text {text }}=1-\alpha_{\text {image }} \end{aligned}

Among them, W image W image  W_("image ")\mathrm{W}_{\text {image }} and W text W text  W_("text ")\mathrm{W}_{\text {text }} are the weight matrices for learning, representing the contribution of image and text features.

After calculating the weights of each modality, dynamically fuse based on the aforementioned prediction dynamic fusion framework, calculating dynamic fusion weights based on various confidence levels.

  4.2.2 Classification Layer

  Classify the fused features f fusion 输入到一个全连接层 f fusion 输入到一个全连接层  f_("fusion 输入到一个全连接层 ")f_{\text {fusion 输入到一个全连接层 }} :
y ^ = softmax ( W fusion f fusion ) y ^ = softmax W fusion  f fusion  widehat(y)=softmax(W_("fusion ")f_("fusion "))\widehat{y}=\operatorname{softmax}\left(W_{\text {fusion }} f_{\text {fusion }}\right)

Among them, W fusiom W fusiom  W_("fusiom ")W_{\text {fusiom }} is the weights of the fully connected layer, y ^ y ^ hat(y)\hat{y} is the predicted category label [ 16 ] [ 16 ] ^([16]){ }^{[16]} .

  4.2.3 Model Training

  Establish the loss function
  Use cross-entropy loss function to optimize multi-class problems:
L = i = 1 N y i log ( y ^ i ) L = i = 1 N y i log y ^ i L=-sum_(i=1)^(N)y_(i)*log( widehat(y)_(i))\mathcal{L}=-\sum_{i=1}^{N} y_{i} \cdot \log \left(\widehat{y}_{i}\right)

Among them, y i y i y_(i)y_{i} represents the true label, and y ^ i y ^ i widehat(y)_(i)\widehat{y}_{i}