Flow-MAE: Leveraging Masked AutoEncoder for Accurate, Efficient and Robust Malicious Traffic Classification
Flow-MAE：利用屏蔽的自动编码器进行准确、高效和强大的恶意流量分类

Zijun Hang, National University of Defense Technology, China, hangzijun17@nudt.edu.cn
杭子军，国防科技大学，中国，hangzijun17@nudt.edu.cn

Yuliang Lu, National University of Defense Technology, China, luyuliang@nudt.edu.cn
Yuliang Lu，国防科技大学，中国，luyuliang@nudt.edu.cn

Yongjie Wang, National University of Defense Technology, China, w_yong_j@189.cn
王永杰，国防科技大学，中国，w_yong_j@189.cn

Yi Xie, National University of Defense Technology, China, heilongjiangxieyi@163.com
谢毅，国防科技大学，中国，heilongjiangxieyi@163.com

DOI: https://doi.org/10.1145/3607199.3607206
DOI： https://doi.org/10.1145/3607199.3607206
RAID '23: The 26th International Symposium on Research in Attacks, Intrusions and Defenses, Hong Kong, China, October 2023
RAID '23：第 26 届攻击、入侵和防御研究国际研讨会，中国香港，2023 年 10 月

Malicious traffic classification is crucial for Intrusion Detection Systems (IDS). However, traditional Machine Learning approaches necessitate expert knowledge and a significant amount of well-labeled data. Although recent studies have employed pre-training models from the Natural Language Processing domain, such as ET-BERT, for traffic classification, their effectiveness is impeded by limited input length and fixed Byte Pair Encoding.
恶意流量分类对于入侵检测系统（IDS）至关重要。然而，传统的机器学习方法需要专业知识和大量标记良好的数据。尽管最近的研究采用了自然语言处理领域的预训练模型（如ET-BERT）进行流量分类，但其有效性受到有限的输入长度和固定的字节对编码的阻碍。

To address these challenges, this paper presents Flow-MAE, a pre-training model that employs Masked AutoEncoders (MAE) from the Computer Vision domain to achieve accurate, efficient, and robust malicious network traffic classification. Flow-MAE overcomes these challenges by utilizing burst (a generic representation of network traffic) and patch embedding to accommodate extensive traffic length. Moreover, Flow-MAE introduces a self-supervised pre-training task, the Masked Patch Model, which captures unbiased representations from bursts with varying lengths and patterns.
为了应对这些挑战，本文介绍了 Flow-MAE，这是一种预训练模型，它使用来自计算机视觉领域的掩码自动编码器（MAE）来实现准确、高效和强大的恶意网络流量分类。Flow-MAE 通过利用突发（网络流量的通用表示）和补丁嵌入来适应广泛的流量长度，从而克服了这些挑战。此外，Flow-MAE 还引入了一个自我监督的预训练任务，即掩蔽补丁模型，该模型从具有不同长度和模式的突发中捕获无偏表示。

Experimental results from six datasets reveal that Flow-MAE achieves new state-of-the-art accuracy (>0.99), efficiency (>900 samples/s), and robustness across diverse network traffic types. In comparison to the state-of-the-art ET-BERT, Flow-MAE exhibits improvements in accuracy and speed by 0.41%-1.93% and 7.8x-10.3x, respectively, while necessitating only 0.2% FLOPs and 44% memory overhead. The efficacy of the core designs is validated through few-shot learning and ablation experiments. The code is publicly available at https://github.com/NLear/Flow-MAE.
来自六个数据集的实验结果表明，Flow-MAE在各种网络流量类型中实现了新的最先进的精度（>0.99）、效率（>900样本/秒）和鲁棒性。与最先进的 ET-BERT 相比，Flow-MAE 在精度和速度方面分别提高了 0.41%-1.93% 和 7.8 倍-10.3 倍，而只需要 0.2% 的 FLOP 和 44% 的内存开销。核心设计的功效通过小样本学习和消融实验得到验证。该代码在 https://github.com/NLear/Flow-MAE 公开提供。

CCS Concepts: • Security and privacy → Network security; • Computing methodologies → Artificial intelligence; • Security and privacy → Intrusion detection systems;
CCS概念： • 网络安全→安全和隐私;• 人工智能→计算方法;• 安全和隐私→入侵检测系统;

Keywords: Malicious Traffic Classification, Masked AutoEncoder, Pre-training Model, Masked Patch Model
关键词：恶意流量分类，屏蔽自动编码器，预训练模型，屏蔽补丁模型

ACM Reference Format: ACM 参考格式：
Zijun Hang, Yuliang Lu, Yongjie Wang, and Yi Xie. 2023. Flow-MAE: Leveraging Masked AutoEncoder for Accurate, Efficient and Robust Malicious Traffic Classification. In The 26th International Symposium on Research in Attacks, Intrusions and Defenses (RAID '23), October 16--18, 2023, Hong Kong, China. ACM, New York, NY, USA 18 Pages. https://doi.org/10.1145/3607199.3607206
梓涝君，卢玉良，王永杰，谢毅.2023. Flow-MAE：利用屏蔽的自动编码器进行准确、高效和强大的恶意流量分类。2023 年 10 月 16 日至 18 日，第 26 届攻击、入侵和防御研究国际研讨会（RAID '23），中国香港。ACM，纽约，纽约，美国 18 页。https://doi.org/10.1145/3607199.3607206

1 INTRODUCTION 1 引言

Malicious traffic classification is a crucial network security mechanism for Intrusion Detection Systems (IDS) [49]. This multi-class classification problem aims to distinguish various types of malicious network traffic to uncover network security vulnerabilities [28]. Machine Learning (ML) has emerged as a promising network security paradigm [3, 7], particularly for malicious traffic classification, complementing traditional rule or fingerprint-based approaches [43].
恶意流量分类是入侵检测系统（Intrusion Detection Systems，简称IDS）的重要网络安全机制[ 49]。这种多类分类问题旨在区分各种类型的恶意网络流量，以发现网络安全漏洞 [ 28]。机器学习（ML）已成为一种很有前途的网络安全范式[3,7]，特别是对于恶意流量分类，是对传统规则或基于指纹的方法的补充[43]。

However, conventional ML techniques require expert knowledge and a substantial amount of well-labeled data to extract effective traffic features [24]. Since the majority of network traffic is benign, identifying malicious traffic presents a considerable challenge [30]. Furthermore, approaches that focus primarily on specific tasks limit their transferability, necessitating redesign or retraining for new tasks [39]. Consequently, these methods exhibit inefficiency, inaccuracy, and suboptimal robustness capabilities.
然而，传统的机器学习技术需要专业知识和大量标记良好的数据来提取有效的流量特征[24]。由于大多数网络流量是良性的，因此识别恶意流量是一个相当大的挑战[30]。此外，主要关注特定任务的方法限制了其可转移性，需要针对新任务进行重新设计或重新培训[39]。因此，这些方法表现出低效、不准确和次优鲁棒性能力。

In recent years, transformer-based [48] Deep Learning (DL) models have emerged as a promising approach beyond the traditional ML-based paradigm. These models have shown remarkable progress in various applications, inspired by the success of pre-training models [53] in the fileds of Natural Language Processing (NLP) [6, 8] and Computer Vision (CV) [9, 16]. Recently, ET-BERT [27] has applied BERT [8], a pre-training model in NLP, to traffic classification and achieved the state-of-the-art performance.
近年来，基于 transformer 的 [ 48] 深度学习（DL）模型已成为一种超越传统基于 ML 的范式的有前途的方法。这些模型在各种应用中都取得了显著的进步，这得益于预训练模型[ 53] 在自然语言处理（NLP） [ 6， 8] 和计算机视觉（CV） [ 9， 16] 领域的成功。最近，ET-BERT [ 27] 将 NLP 中的预训练模型 BERT [ 8] 应用于流量分类，并实现了最先进的性能。

Although significant progress has been made, there are two areas where ET-BERT could be further improved: (a) Limited by the BERT model's input length, ET-BERT adopts a network flow length of 128 bytes, which may not adequately represent extensive network traffic ranging from KBs to MBs. However, the 128-byte input length also results in increased computation and memory overheads for the ET-BERT model. (b) ET-BERT utilizes Byte Pair Encoding (BPE) [36] to convert the network traffic byte stream into BERT input tokens. This requires the use of a fixed dictionary, potentially reducing robustness in the face of varying traffic patterns.
尽管已经取得了重大进展，但有两个方面可以进一步改进：（a）受BERT模型输入长度的限制，ET-BERT采用128字节的网络流长度，这可能不足以代表从KB到MB的广泛网络流量。然而，128字节的输入长度也会导致ET-BERT模型的计算和内存开销增加。（b） ET-BERT利用字节对编码（BPE）[36]将网络流量字节流转换为BERT输入令牌。这需要使用固定的字典，这可能会降低面对不同流量模式时的鲁棒性。

Innovative pre-training models from the CV domain, such as Masked AutoEncoders (MAE) [16], have emerged as promising alternatives for malicious traffic classification. MAE can learn deep latent state representations of unlabeled images through a self-supervised task by randomly masking partial patches of input images and reconstructing the missing pixels. Flow-MAE employs an asymmetric encoder-decoder architecture similar to MAE. The encoder operates solely on the visible subset of the byte stream to generate latent representations, from which a lightweight decoder reconstructs the masked bytes. Experiments shows that masking a modest proportion of patches (e.g., 15%) creates a meaningful pre-training task.
来自CV领域的创新预训练模型，如Masked AutoEncoders（MAE）[ 16]，已成为恶意流量分类的有前途的替代方案。MAE可以通过随机屏蔽输入图像的部分斑块并重建缺失的像素，通过自我监督任务来学习未标记图像的深度潜在状态表示。Flow-MAE 采用类似于 MAE 的非对称编码器-解码器架构。编码器仅对字节流的可见子集进行操作以生成潜在表示，轻量级解码器从中重建掩码字节。实验表明，屏蔽适量比例的补丁（例如，15%）可以创建有意义的预训练任务。

Flow-MAE follows a two-stage process, including pre-training and subsequent fine-tuning: (1) Initially, the MAE model is pre-trained on a large volume of unlabeled traffic [38], in order to learn in-depth, unbiased, contextualized datagram-level representations of traffic autonomously. (2) Thereafter, the Flow-MAE model is fine-tuned on a small amount of task-specific labeled data, increasing its effectiveness for a designated downstream task.
Flow-MAE 遵循两个阶段的过程，包括预训练和随后的微调：（1）最初，MAE 模型在大量未标记的流量上进行预训练 [ 38]，order以自主学习深入、无偏、上下文化的数据报级流量表示。（2）此后，对Flow-MAE模型进行微调，对少量特定任务的标记数据进行微调，从而提高其对指定下游任务的有效性。

The primary contributions of this paper are summarized as follows:
本文的主要贡献总结如下：

We present Flow-MAE, a pre-training framework that utilizes a Masked AutoEncoder for malicious traffic classification. Flow-MAE acquires unbiased datagram representations from a large volume of unlabeled traffic and subsequently fine-tunes on a downstream task.
我们介绍了 Flow-MAE，这是一个预训练框架，它利用屏蔽的自动编码器进行恶意流量分类。Flow-MAE 从大量未标记的流量中获取无偏数据报表示，然后对下游任务进行微调。
We use burst (§ 3.2), a generic representation for network traffic, in conjunction with patch embedding (§ 3.3) to adapt extensive traffic to Flow-MAE model input. This approach bridges the gap between large-scale traffic data and the truncated input sequence in Flow-MAE (limited to 32 bytes).
我们使用突发（§ 3.2）（网络流量的通用表示）与补丁嵌入（§ 3.3）结合使用，以使广泛的流量适应 Flow-MAE 模型输入。这种方法弥合了大规模流量数据与 Flow-MAE 中截断的输入序列（限制为 32 字节）之间的差距。
We propose a generic self-supervised pre-training task, the Masked Patch Model (MPM) (§ 3.4), to capture contextual interdependence and achieve aligned generic representations from bursts of varying lengths.
我们提出了一个通用的自监督预训练任务，即掩蔽补丁模型（MPM）（§ 3.4），以捕获上下文相互依赖性并从不同长度的突发中获得对齐的通用表示。
We conduct experiments on six downstream datasets, comparing Flow-MAE with existing approaches to validate the core designs. Results show that integrating these designs enables efficient and effective training of the high-capacity MAE model. Flow-MAE achieves new state-of-the-art accuracy exceeding 99%, an efficiency of 900 samples/s, and impressive robustness on six downstream datasets. It outperforms the state-of-the-art ET-BERT in accuracy and speed by 0.41%-1.93% and 7.8x-10.3x, respectively, while only requiring 0.2% FLOPs and 44% memory overhead. The code is publicly available at https://github.com/NLear/Flow-MAE.
我们在六个下游数据集上进行了实验，将Flow-MAE与现有方法进行比较，以验证核心设计。结果表明，集成这些设计可以高效和有效地训练高容量 MAE 模型。Flow-MAE 实现了超过 99% 的先进准确度、900 个样本/秒的效率以及六个下游数据集的令人印象深刻的鲁棒性。它在精度和速度方面分别比最先进的 ET-BERT 高出 0.41%-1.93% 和 7.8 倍-10.3 倍，同时只需要 0.2% 的 FLOP 和 44% 的内存开销。该代码在 https://github.com/NLear/Flow-MAE 公开提供。

Figure 1: Flow-MAE Architecture Overview
图 1：Flow-MAE 架构概述

2 BACKGROUND AND MOTIVATIONS
2 背景和动机

2.1 Existing Methods 2.1 现有方法

Malicious traffic classification refers to the task of identifying and categorizing network traffic that is associated with malicious or suspicious activities. It involves identifying malicious intent, such as malware infections, network intrusions, denial-of-service attacks, or data exfiltration attempts, protecting legitimate network users from concealed malware and intrusions. In general, there are three primary types of malicious traffic classification methodologies.
恶意流量分类是指识别和分类与恶意或可疑活动相关的网络流量的任务。它涉及识别恶意意图，例如恶意软件感染、网络入侵、拒绝服务攻击或数据泄露尝试，从而保护合法网络用户免受隐藏的恶意软件和入侵。通常，恶意流量分类方法主要有三种类型。

2.1.1 Fingerprint Methods. Fingerprint methodologies, such as FlowPrint [47], initially build a database of spatial and temporal features extracted from numerous malicious flows. Consequently, hidden malicious flows can be detected by matching fingerprints with those in the database. The effectiveness of flow classification depends on the integrity of the fingerprint database and the quality of real-time flow fingerprinting.
2.1.1 指纹识别方法。指纹方法，如FlowPrint [ 47]，最初建立了一个从众多恶意流中提取的空间和时间特征的数据库。因此，可以通过将指纹与数据库中的指纹进行匹配来检测隐藏的恶意流。流量分类的有效性取决于指纹数据库的完整性和实时流量指纹的质量。

2.1.2 Statistical Machine Learning Methods. Statistical Machine Learning (ML) approaches extract a range of spatial or temporal statistical features to form the feature vector, then employ ML algorithms to model the feature distributions of network flows. Over the past decade, the intersection of network security and ML methodologies has produced groundbreaking outcomes, including AppScanner [45], CUMUL [33], BIND [1] and k-fingerprinting (K-fp) [14].
2.1.2 统计机器学习方法。统计机器学习（ML）方法提取一系列空间或时间统计特征以形成特征向量，然后使用 ML 算法对网络流的特征分布进行建模。在过去的十年中，网络安全和机器学习方法的交叉产生了突破性的成果，包括 AppScanner [ 45]、CUMUL [ 33]、BIND [ 1] 和 k-fingerprinting （K-fp） [ 14]。

However, traditional ML-based methods rely on accurate feature extraction and a significant amount of labeled malicious data. Given that only a small portion of all traffic constitutes real malicious traffic, manually extracting well-labeled malicious traffic is a daunting task, let alone obtaining valuable and accurate features.
然而，传统的基于ML的方法依赖于准确的特征提取和大量标记的恶意数据。鉴于所有流量中只有一小部分构成真正的恶意流量，手动提取标记良好的恶意流量是一项艰巨的任务，更不用说获得有价值和准确的特征了。

2.1.3 Deep Learning Methods. Deep Learning (DL) constitutes an emerging paradigm to malicious traffic detection, offering end-to-end solutions that surpass traditional statistical Machine Learning (ML) techniques. These models facilitate automatic feature extraction by employing raw flow data as input and integrate feature learning and classification within a unified pipeline. Existing DL models include DF [41], which employs Convolutional Neural Networks (CNNs), FS-Net [29] using AutoEncoders, TSCRNN [26] utilizing Recurrent Neural Networks (RNNs), as well as Deeppacket [31] relying on AutoGolun [11].
2.1.3 深度学习方法。深度学习（DL）是恶意流量检测的新兴范式，提供超越传统统计机器学习（ML）技术的端到端解决方案。这些模型通过使用原始流数据作为输入来促进自动特征提取，并将特征学习和分类集成到统一的管道中。现有的深度学习模型包括采用卷积神经网络（CNNs）的DF [ 41]、使用自动编码器的FS-Net [ 29]、使用递归神经网络（RNN）的TSCRNN [ 26]，以及依赖于AutoGolun [ 11] 的Deeppacket [ 31]。

However, DL models require a considerable amount of well-labeled malicious traffic for convergence. Moreover, these models may face difficulties when applied to different flow schemas. As a result, a complete redesign or re-training of the model is needed when adapting the model to a new task or dataset, leading to sub-optimal robustness.
但是，DL 模型需要大量标记良好的恶意流量才能进行收敛。此外，这些模型在应用于不同的流程模式时可能会遇到困难。因此，在使模型适应新任务或数据集时，需要对模型进行完全的重新设计或重新训练，从而导致次优鲁棒性。

In summary, the problems of efficacy, efficiency, and robustness in the aforementioned methods remain unresolved.
总之，上述方法的功效、效率和稳健性问题仍未解决。

2.2 Pre-training Models 2.2 预训练模型

2.2.1 Benefits of Pre-training Models. Recently, transcending traditional ML-based approaches, pre-training techniques [53] for DL models have emerged. Pre-training models can autonomously learn unbiased data representations from vast amounts of unlabeled data and subsequently transfer to downstream tasks through fine-tuning over limited epochs on minimal labeled data. During fine-tuning, the models adapt the previously learned representations to the specific task at hand, incorporating the labeled data to refine their performance. Since the pre-training phase is independent of any specific task or labels, the representations obtained from Flow-MAE are considered unbiased, as they are learned solely from the raw data without any specific task-related bias.
2.2.1 预训练模型的好处。最近，超越了传统的基于ML的方法，出现了用于深度学习模型的预训练技术[53]。预训练模型可以从大量未标记的数据中自主学习无偏数据表示，然后通过对最少标记数据的有限时间进行微调，将数据传输到下游任务。在微调过程中，模型将先前学习的表示调整为手头的特定任务，并结合标记的数据来优化其性能。由于预训练阶段独立于任何特定任务或标签，因此从 Flow-MAE 获得的表示被认为是无偏的，因为它们仅从原始数据中学习，没有任何与任务相关的特定偏差。

These unbiased representations enable Flow-MAE to generalize well to different downstream tasks, as they capture the essential features and patterns present in the data. This transferability of representations from pre-training to fine-tuning allows to achieve high performance even with limited labeled data. Numerous pre-training models favor the transformer [48] backbone, a DL model based on a self-attention mechanism. The transformer has two primary branches: the BERT [8] (Bidirectional Encoder Representations from Transformers) Model in NLP and the MAE (Masked AutoEncoder) model [16] in CV.
这些无偏表示使 Flow-MAE 能够很好地泛化到不同的下游任务，因为它们捕获了数据中存在的基本特征和模式。这种从预训练到微调的表示的可转移性允许即使在有限的标记数据下也能实现高性能。许多预训练模型都支持转换器 [ 48] 主干，这是一种基于自注意力机制的 DL 模型。转换器有两个主要分支：NLP 中的 BERT [ 8]（来自转换器的双向编码器表示）模型和 CV 中的 MAE（掩蔽自动编码器）模型 [ 16]。

2.2.2 BERT Model in NLP. Contemporary researches have implemented BERT-like pre-training architectures for traffic classification, yielding substantial advancements. PERT [15] adapts the ALBERT [22] pre-training model for traffic classification, attaining a 93.23% F1 score on the ISCX-VPN-2016 dataset [10]. The state-of-the-art ET-BERT [27] incorporates a bespoke Byte Pair Encoding representation for encrypted traffic and two pre-training tasks based on the BERT [8] model, enhancing the F1 score to 99%. Both PERT and ET-BERT extend BERT-like models to the networking domain, exemplifying the merits of pre-training architectures when capitalizing on copious unlabeled traffic data.
2.2.2 NLP中的BERT模型。当代研究已经实现了类似 BERT 的预训练架构来进行流量分类，取得了实质性的进展。PERT [ 15] 将 ALBERT [ 22] 预训练模型用于流量分类，在 ISCX-VPN-2016 数据集上获得了 93.23% 的 F1 分数 [ 10]。最先进的 ET-BERT [ 27] 结合了用于加密流量的定制字节对编码表示和两个基于 BERT [ 8] 模型的预训练任务，将 F1 得分提高到 99%。PERT 和 ET-BERT 都将类似 BERT 的模型扩展到网络领域，在利用大量未标记的流量数据时，举例说明了预训练架构的优点。

Despite the remarkable accomplishments attained so far, it is evident that two refinements can be implemented for ET-BERT: (a) Constrained by the BERT model's input dimensions, ET-BERT employs a 128-byte sequence to represent network traffic; this is substantially shorter than the majority of network traffic (ranging from kilobytes to megabytes) and might inadequately represent extensive network traffic. Furthermore, by adopting solely packet payloads, the packet headers containing temporal and spatial features of the traffic are discarded, leading to accuracy loss. Nonetheless, the 128-byte input length still incurs elevated computation and memory overheads for the ET-BERT model. (b) ET-BERT relies upon the Byte Pair Encoding (BPE) [36] technique to convert the network traffic byte stream into BERT input tokens, utilizing a fixed dictionary, which consequently results in diminished performance under disparate traffic patterns.
尽管迄今为止取得了显著的成就，但很明显，ET-BERT可以实现两个改进：（a）受BERT模型输入维度的约束，ET-BERT采用128字节的序列来表示网络流量;这比大多数网络流量（从千字节到兆字节不等）要短得多，并且可能不足以表示广泛的网络流量。此外，通过仅采用数据包有效负载，包含流量时间和空间特征的数据包标头将被丢弃，从而导致准确性损失。尽管如此，128字节的输入长度仍然会给ET-BERT模型带来更高的计算和内存开销。（b） ET-BERT依靠字节对编码（BPE）[36]技术将网络流量字节流转换为BERT输入令牌，使用固定字典，从而导致不同流量模式下的性能下降。

2.2.3 MAE-like Model in CV. The Masked AutoEncoder (MAE) [16] constitutes a pre-training architecture with a transformer [48] backbone in the Computer Vision (CV) domain. This model employs a transformer autoencoder to learn deep latent state representations by randomly masking input image patches and subsequently reconstructing the masked regions. The image size in the MAE model surpasses the sentence length in the BERT model by orders of magnitude, thereby facilitating the processing of extended input sequences. Furthermore, MAE incorporates a versatile convolutional embedding layer as opposed to the fixed BPE encoding layer, which encodes sequences utilizing a rigid vocabulary. This feature enables the model to adapt seamlessly to varying input patterns.
2.2.3 CV中类似MAE的模型。掩码自动编码器（MAE）[16]构成了计算机视觉（CV）领域中具有转换器[48]骨干的预训练架构。该模型采用转换器自动编码器，通过随机屏蔽输入图像块并随后重建屏蔽区域来学习深度潜伏状态表示。MAE 模型中的图像大小比 BERT 模型中的句子长度高出几个数量级，从而便于处理扩展的输入序列。此外，MAE 包含一个通用的卷积嵌入层，而不是固定的 BPE 编码层，后者使用严格的词汇表对序列进行编码。此功能使模型能够无缝适应不同的输入模式。

3 FLOW-MAE SYSTEM DESIGN 3 FLOW-MAE系统设计

3.1 Model Architecture Overview
3.1 模型架构概述

Motivated by the Masked AutoEncoders (MAE) pre-training model from the CV domain, we introduce Flow-MAE, a MAE designed as a versatile self-supervised learner for classifying malicious network traffic. Flow-MAE adheres to a two-stage process comprising preliminary pre-training and subsequent fine-tuning within the scope of self-supervised learning. The architectural design is delineated in Figure 1.
受 CV 域的屏蔽自动编码器（MAE）预训练模型的启发，我们引入了 Flow-MAE，这是一种 MAE，旨在对恶意网络流量进行分类。Flow-MAE遵循两个阶段的过程，包括初步的预训练和随后的自我监督学习范围内的微调。架构设计如图 1 所示。

3.1.1 Preprocessing. Flow-MAE adopts burst-level traffic representation. An initial preprocessing phase, termed burst representation (§ 3.2), converts the traffic datagram into bursts. Subsequently, the patch embedding step (§ 3.3) projects these bursts into patches, conforming to the input dimensions of the MAE model.
3.1.1 预处理。Flow-MAE采用突发级流量表示。初始预处理阶段称为突发表示（§ 3.2），将流量数据报转换为突发数据报。随后，补丁嵌入步骤（§ 3.3）将这些突发投影到补丁中，符合 MAE 模型的输入维度。

3.1.2 Pre-training Procedure (Section 3.4). The Flow-MAE model undergoes pre-training on unlabeled background traffic, facilitating the acquisition of deep contextualized burst-level unbiased representations from large-scale unlabeled network data.
3.1.2 预训练程序（第 3.4 节）。Flow-MAE 模型对未标记的后台流量进行了预训练，有助于从大规模未标记的网络数据中获取深度上下文化的突发级无偏表示。

Flow-MAE employs a transformer autoencoder architecture, consisting of multiple encoder and decoder layers, each containing multi-head self-attention blocks. During the pre-training process, randomly selected input patches are masked. The encoder focuses exclusively on visible patches, capturing latent relationships and generating an output representation for the burst. The decoder then attempts to reconstruct the masked patches using the encoder output. The optimization goal is to minimize the reconstruction loss.
Flow-MAE 采用 transformer 自动编码器架构，由多个编码器和解码器层组成，每个层都包含多头自注意力模块。在预训练过程中，随机选择的输入补丁被屏蔽。编码器专门关注可见斑块，捕获潜在关系并生成突发的输出表示。然后，解码器尝试使用编码器输出重建屏蔽的补丁。优化目标是将重建损耗降至最低。

3.1.3 Fine-tuning Procedure (§ 3.5). Upon completing the pre-training phase, the Flow-MAE encoder is capable of producing a deep latent representation for any given burst. The classifier model inherits the encoder component of the transformer autoencoder (including its architecture and weights) and replaces the decoder with a fully connected linear classifier.
3.1.3 微调程序（§ 3.5）。在完成预训练阶段后，Flow-MAE编码器能够为任何给定的突发生成深度潜在表示。分类器模型继承了 transformer 自动编码器的编码器组件（包括其架构和权重），并将解码器替换为全连接的线性分类器。

By fine-tuning the classifier model on a limited amount of task-specific labeled traffic, it can be customized for downstream applications. When exposed to target-specific traffic, the classifier model can accurately determine its classification.
通过在有限数量的特定于任务的标记流量上微调分类器模型，可以针对下游应用程序对其进行自定义。当暴露于特定于目标的流量时，分类器模型可以准确地确定其分类。

3.2 Burst Representation Preprocessing
3.2 突发表示预处理

Network traffic comprises numerous flows with diverse types (e.g., various applications, protocols, or services) and properties (e.g., benign and malicious). To achieve a consistent representation and fit the model's input length, preliminary traffic segmentation is vital.
网络流量包括具有不同类型（例如，各种应用程序、协议或服务）和属性（例如良性和恶意）的众多流。为了实现一致的表示并拟合模型的输入长度，初步的流量分割至关重要。

We select the burst as the fundamental unit for malicious traffic classification and preprocess the traffic in three distinct phases: (1) session splitting (§ 3.2.1): dividing the traffic into separate sessions; (2) burst splitting (§ 3.2.2): subdividing a session into multiple bursts to create model input sequences; and (3) burst padding (§ 3.2.3): generating masks for bursts of different lengths and padding all bursts to achieve a uniform length. Consequently, each burst originates from a specific traffic category and is prepared for model input.
我们选择突发作为恶意流量分类的基本单元，并分三个不同的阶段对流量进行预处理：（1）会话拆分（§3.2.1）：将流量划分为单独的会话;（2）突发拆分（§ 3.2.2）：将一个会话细分为多个突发以创建模型输入序列;（3）突发填充（§3.2.3）：为不同长度的突发生成掩码，并填充所有突发以实现均匀的长度。因此，每个突发都源自特定的流量类别，并为模型输入做好准备。

3.2.1 Session split. Initially, we separate the entire traffic into sessions, according to the 5-tuple: source IP address, destination IP address, protocol type, source port, and destination port. The definition of a session is as follows:
3.2.1 会话拆分。最初，我们根据 5 元组将整个流量分成会话：源 IP 地址、目标 IP 地址、协议类型、源端口和目标端口。会话的定义如下：

Definition 3.1 (Session) Given network traffic $P = {p_{i} ∣ i \in N^{+}}$ comprising multiple packets p_i, a session $S = {p_{j} ∣ p_{j} \in P \land j \in N^{+}} \subset P$ denotes those bidirectional packets p_j of a specific protocol transmitted between two ports on two hosts. The session length |S| = n_s represents the number of packets n_s within session S.
定义 3.1 （会话）给定包含多个数据包 p 的网络流量 $P = {p_{i} ∣ i \in N^{+}}$ ，会话 $S = {p_{j} ∣ p_{j} \in P \land j \in N^{+}} \subset P$ 表示在两台主机上的两个端口之间传输的特定协议的双向数据包 p _j 。 _i 会话长度 |小号|= n _s 表示会话 S 中的数据包数 n _s 。

As session packets are bidirectional, a session includes request and response packets of a specific application exchanged between two hosts. For example, a standard network application communicates via the TCP protocol, establishing a connection between ports on two hosts. A virus might also create a connection to enable lateral movement within a local network. A session can characterize an application since different sessions may possess unique and intrinsic features that facilitate their differentiation.
由于会话数据包是双向的，因此会话包括在两个主机之间交换的特定应用程序的请求和响应数据包。例如，标准网络应用程序通过 TCP 协议进行通信，在两台主机上的端口之间建立连接。病毒还可能创建连接，以便在本地网络内实现横向移动。会话可以表征应用程序，因为不同的会话可能具有独特的内在特征，有助于区分它们。

However, in certain scenarios (e.g., P2P downloading, video calling, and live streaming), the session length can be considerable (reaching MBs or even more). Therefore, further splitting is necessary to accommodate the model input length.
但是，在某些情况下（例如，P2P 下载、视频通话和直播），会话长度可能相当长（达到 MB 甚至更多）。因此，需要进一步拆分以适应模型输入长度。

3.2.2 Burst Split. We subsequently divide each session into bursts. Formally, the definition of a burst is:
3.2.2 突发拆分。随后，我们将每个会话分成几个突发。从形式上讲，突发的定义是：

Definition 3.2 (Burst) Given a session $S = {p_{j} ∣ p_{j} \in P \land j \in N^{+}}$ comprising multiple packets p_j from network traffic P, a burst $B = {p_{j} ∣ p_{k} \in S \land k \in N^{+}} \subset S$ is a subset of bidirectional contiguous network packets within a short time window τ in the session. We denote the number of bytes in a burst B as the burst length ℓ_B.
定义 3.2（突发）给定 $S = {p_{j} ∣ p_{j} \in P \land j \in N^{+}}$ 一个会话，其中包含来自网络流量 P 的多个数据包 p _j ，突发 $B = {p_{j} ∣ p_{k} \in S \land k \in N^{+}} \subset S$ 是会话中短时间窗口 τ 内双向连续网络数据包的子集。我们将突发 B 中的字节数表示为突发长度 l _B 。

A burst can characterize the pattern of network flow transmission from the application layer perspective; while it is as descriptive as a session, it is more focused on an interval.
突发可以从应用层的角度表征网络流传输的模式;虽然它与会话一样具有描述性，但它更侧重于间隔。

We use the raw bytes in an anonymous burst B to construct an input sequence, where each burst B is represented by a sequence of ℓ_B bytes. Concretely, we anonymize the burst by removing the MAC address, IP address, and port fields from each packet header. Anonymity is essential, as bursts from the same flow category may share identical addresses or ports; these can be used as explicit identifiers and prevent models from learning latent and unbiased representations of different flow categories. Subsequently, we concatenate the remaining bytes from the burst in the packet order so that the temporal relations in the burst is preserved. We also limit the maximum instances of bursts generated from the same session to guarantee unbiased burst generation by randomly sampling from these bursts.
我们使用匿名突发 B 中的原始字节来构造一个输入序列，其中每个突发 B 由 l _B 字节序列表示。具体来说，我们通过从每个数据包标头中删除 MAC 地址、IP 地址和端口字段来匿名化突发。匿名性是必不可少的，因为来自同一流类别的突发可能共享相同的地址或端口;这些可以用作显式标识符，并防止模型学习不同流类别的潜在和无偏表示。随后，我们将数据包中突发的剩余字节连接起来，order以便保留突发中的时间关系。我们还限制了从同一会话生成的突发的最大实例，以通过从这些突发中随机采样来保证无偏突发的生成。

A burst is a temporal, application-level representation that facilitates stable, fine-grained, and representative feature extraction for various traffic types. Another advantage of burst representation is that it allows for organizing numerous flows into smaller groups, significantly reducing the sequence length for model input.
突发是一种时间、应用程序级的表示，有助于对各种流量类型进行稳定、细粒度和代表性的特征提取。突发表示的另一个优点是，它允许将多个流组织成更小的组，从而显著减少模型输入的序列长度。

However, a disparity remains between the burst and the MAE model input. Different bursts can have varying burst lengths, while the MAE model requires fixed-sized image inputs. Consequently, we introduce a burst cropping and padding scheme, as described below.
然而，突发和 MAE 模型输入之间仍然存在差异。不同的连拍可以具有不同的连拍长度，而 MAE 模型需要固定大小的图像输入。因此，我们引入了一种突发裁剪和填充方案，如下所述。

3.2.3 Burst Cropping and Padding. Cropping and padding are common techniques in transformer models for NLP and time series analysis (e.g., BERT [8], ALBERT [22], and Informer [54]). Since sentences and time series have varying lengths, input sequences are cropped or padded to a predefined maximum length $ℓ_{MAX}$ that the model can process. A mask sequence M of equivalent length ( $ℓ_{M} = ℓ_{MAX}$ ) is introduced to indicate the valid sequence from the original input in the cropped or padded sequence.

\begin{matrix} B = {\begin{cases} B ‖ 0^{ℓ_{MAX} - ℓ_{B}}, if ℓ_{B} < ℓ_{MAX} \\ B [0 : ℓ_{MAX}], otherwise \end{cases} \end{matrix}

(1)

\begin{matrix} M = {\begin{cases} 1^{ℓ_{B}} ‖ 0^{ℓ_{MAX} - ℓ_{B}}, if ℓ_{B} < ℓ_{MAX} \\ 1^{ℓ_{MAX}}, otherwise \end{cases} \end{matrix}

(2)

Specifically, for a burst B with a length ℓ_B smaller than the maximum length $ℓ_{MAX}$ , we pad $ℓ_{MAX} - ℓ_{B}$ zeros at its end to reach the maximum length. The corresponding positions in the mask M[ℓ_B: ℓ_MAX] are filled with 0s, while the remaining positions M[0: ℓ_B − 1] corresponding to the original sequence are filled with 1s. Otherwise, if a burst B is longer than the maximum length $ℓ_{MAX}$ , we crop it to length $ℓ_{MAX}$ by discarding the posterior $ℓ_{B} - ℓ_{MAX}$ bytes and fill all $ℓ_{MAX}$ positions of the mask with 1s.
具体来说，对于长度 l _B 小于最大长度 $ℓ_{MAX}$ 的突发 B，我们在其末端填充 $ℓ_{MAX} - ℓ_{B}$ 零以达到最大长度。掩码 M[l _B ： l _MAX ] 中的相应位置填充为 0，而与原始序列相对应的其余位置 M[0： l _B − 1] 填充为 1s。否则，如果突发 B 的长度超过最大长度 $ℓ_{MAX}$ ，我们通过丢弃后向 $ℓ_{B} - ℓ_{MAX}$ 字节将其裁剪为长度 $ℓ_{MAX}$ ，并用 1 填充掩码的所有 $ℓ_{MAX}$ 位置。

It is important to note that the mask used in the padding preprocessing is different from the one applied in Masked Patch Model pre-training. The bytes masked in preprocessing are not calculated or updated during model pre-training. Instead, they remain untouched throughout the pre-training, as explained in § 3.4.
需要注意的是，填充预处理中使用的掩码与掩码贴片模型预训练中使用的掩码不同。在模型预训练期间，不会计算或更新预处理中屏蔽的字节。相反，它们在整个预训练过程中保持不变，如§3.4所述。

3.3 Patch Embedding 3.3 贴片嵌入

3.3.1 Burst to Patch Embedding. In the MAE model, an operation called patch embedding initially uses a convolutional layer to encode the input image. The input image is partitioned uniformly into numerous contiguous, non-overlapping segments, referred to as patches. Following this, each patch is embedded into $R^{d}$ , where d represents the embedding dimensionality of the MAE model. For example, within the MAE-Base model, an input image measuring 224x224 with a patch dimension of 16x16 yields 14x14=196 pacthes, each possessing d = 768 dimensions. The input image is down-sampled by 256 times. This process allows MAE to efficiently process and analyze larger inputs. On the other hand, BERT employs BPE (Byte Pair Encoding) encoding, which maps pairs of bytes to a single 768-dimensional vector. BPE encoding differs from patch embedding and does not directly support the incorporation of patch embedding techniques. Therefore, applying patch embedding to the BERT model, as used in MAE, may not be straightforward or feasible due to the fundamental differences in their encoding mechanisms.
3.3.1 突发到补丁嵌入。在 MAE 模型中，称为补丁嵌入的操作最初使用卷积层对输入图像进行编码。输入图像被均匀地划分为许多连续的、不重叠的段，称为补丁。在此之后，每个面片都嵌入到 $R^{d}$ 中，其中 d 表示 MAE 模型的嵌入维数。例如，在 MAE-Base 模型中，尺寸为 224x224、面片尺寸为 16x16 的输入图像生成 14x14=196 个 pacthe，每个 pacthes 具有 d = 768 维度。输入图像被下采样 256 倍。此过程使 MAE 能够有效地处理和分析更大的输入。另一方面，BERT 采用 BPE（字节对编码）编码，将字节对映射到单个 768 维向量。BPE 编码不同于补丁嵌入，不直接支持合并补丁嵌入技术。因此，将贴片嵌入应用于 MAE 中使用的 BERT 模型可能并不简单或可行，因为它们的编码机制存在根本差异。

Patch embedding is performed for two key reasons in Flow-MAE:
在 Flow-MAE 中执行补丁嵌入有两个关键原因：

(1) First, as input length increases, the computational and storage requirements for the transformer MAE grow quadratically. As a result, the transformer models typically employs a maximum input length of 128 or 256 to minimize computation and memory overhead. For extended input bursts, patch embedding significantly reduces the input sequence length, making it compatible with the MAE's input dimensions. Consequently, more comprehensive burst data, including packet headers, can be incorporated.
（1）首先，随着输入长度的增加，变压器MAE的计算和存储需求呈二次增长。因此，Transformer 模型通常采用 128 或 256 的最大输入长度，以最大限度地减少计算和内存开销。对于扩展的输入突发，贴片嵌入显著缩短了输入序列长度，使其与 MAE 的输入尺寸兼容。因此，可以合并更全面的突发数据，包括数据包标头。

(2) Second, bursts exhibit spatially localized sparsity. The convolutional patch embedding layer combines the sparse local data originating from the patches. Considering that the downstream task is traffic classification, which shows lower sensitivity to local semantics, patch embedding can be employed to extract a comprehensive high-level representation of traffic. The sparsity attribute enables the integration of the down-sampling embedding layer, yielding compact encoding and efficient high-level representation learning, while maintaining minimal computation and memory overheads.
（2）其次，暴发表现出空间局部稀疏性。卷积补丁嵌入层结合了源自补丁的稀疏局部数据。考虑到下游任务是流量分类，对局部语义的敏感度较低，因此可以使用补丁嵌入来提取流量的综合高级表示。稀疏性属性支持下采样嵌入层的集成，产生紧凑的编码和高效的高级表示学习，同时保持最小的计算和内存开销。

In the Flow-MAE framework, each input burst B constitutes a byte stream preprocessed to a fixed length ℓ_B as delineated in § 3.2. Treating each burst B as a one-dimensional input sequence $B \in N^{1 \times ℓ_{B}}$ , a one-dimensional convolution layer with a singular input channel, d output channels, a kernel size of K = (1, k), and stride S = (1, k) is employed for patch embedding. This process yields $⌈ \frac{ℓ_{B}}{k} ⌉$ patches $p_{i} \in R^{d}$ . Consequently, patch embedding uniformly subdivides a burst B into $| P | = ⌈ \frac{ℓ_{B}}{k} ⌉$ non-overlapping segments $P$ .
在 Flow-MAE 框架中，每个输入突发 B 构成一个字节流，预处理为固定长度的 l _B ，如第 3.2 节所述。将每个突发 B 视为一维输入序列 $B \in N^{1 \times ℓ_{B}}$ ，采用具有奇异输入通道、d 输出通道、核大小为 K = （1， k）和步幅 S = （1， k）的一维卷积层进行补丁嵌入。此过程生成 $⌈ \frac{ℓ_{B}}{k} ⌉$ 补丁 $p_{i} \in R^{d}$ 。因此，贴片嵌入均匀地将突发 B 细分为 $| P | = ⌈ \frac{ℓ_{B}}{k} ⌉$ 非重叠段 $P$ 。

\begin{matrix} P \leftarrow \underset{d, k}{Conv1D} (B) \\ s . t . P = {p_{i} ∣ p_{i} \in R^{d}, i = 0, 1, \dots ⌈ \frac{ℓ_{B}}{k} ⌉ - 1} \end{matrix}

(3)

Owing to the potential padding of bytes to burst B during preprocessing, it becomes essential to ascertain whether a patch is derived from the padded or original bytes in burst B. This distinction enables the calculation of attention scores solely from the original bytes, avoiding disruptions from padded bytes. An attention mask $M$ is generated from M:
由于在预处理过程中可能会填充字节到突发 B，因此必须确定补丁是从突发 B 中的填充字节还是原始字节派生而来的。这种区别使得仅从原始字节计算注意力分数，避免了填充字节的中断。从 M 生成一个注意力掩码 $M$ ：

\begin{aligned} M & = {m_{i} ∣ m_{i} \in {0, 1}, i = 0, 1, \dots, | P | - 1}, \end{aligned}

(4)

\begin{aligned} s . t . m_{i} = {\begin{cases} 0, if M [i \cdot k, (i + 1) \cdot k] = 0^{k} \\ 1, otherwise \end{cases} \end{aligned}

(5)

Per eq. (4), the patch mask m_i assumes a value of 1 provided that at least one original byte is present within the corresponding patch p_i. Conversely, the patch mask m_i is assigned a value of 0 if all bytes within patch p_i are padded.
根据方程（4），如果相应的补丁 p _i 中至少存在一个原始字节，则补丁掩码 m _i 假定值为 1。相反，如果修补程序 p _i 中的所有字节都已填充，则为修补程序掩码 m _i 分配值 0。

3.3.2 Positional Embedding. Datagram transmission order may serve as a distinguishing characteristic for certain malware activities. Given that the self-attention block within the transformer layer treats each patch as equivalent, positional embedding is implemented to introduce positional biases into patches. This approach represents the positional information within the sequence and enables the model to discern the spatial relationships among patches.
3.3.2 位置嵌入。数据报传输order可以作为某些恶意软件活动的区别特征。鉴于变压器层内的自注意力模块将每个贴片视为等价，因此实现了位置嵌入以将位置偏差引入贴片中。这种方法表示序列中的位置信息，并使模型能够辨别斑块之间的空间关系。

The positional embedding value for patch p_i is represented by $Pos (i) \in R^{h}$ , depending on the relative position of p_i within the patches $P$ . $Pos (\cdot)$ signifies the embedding function.
patch p _i 的位置嵌入值由 $Pos (i) \in R^{h}$ 表示，具体取决于 p _i 在中的相对position值 $P$ 。 $Pos (\cdot)$ 表示嵌入函数。

p_{i} \leftarrow p_{i} + Pos (i), Pos (\cdot) : R \to R^{h}

(6)

3.4 Pre-training Masked Patch Model
3.4 预训练掩码补丁模型

3.4.1 Patch Masking. Patch masking randomly masks each patch with a probability of $r (0 < r < 1)$ . A Shuffle operation firstly permutes all patches $P$ at random. Following this, the shuffled patches are divided into two subsets: the final $⌊ r | P | ⌋$ elements constitute the masked patches $P_{m}$ , while the remaining patches form the visible patches $P_{v}$ .
3.4.1 补丁掩蔽。补丁掩码随机屏蔽每个补丁，概率为 $r (0 < r < 1)$ 。操作 Shuffle 首先 $P$ 随机排列所有补丁。在此之后，洗牌后的补丁分为两个子集：最终 $⌊ r | P | ⌋$ 元素构成掩码补丁 $P_{m}$ ，而其余补丁形成可见补丁 $P_{v}$ 。

\begin{matrix} \underset{P_{v}}{\underset{⏟}{p_{0}, \dots, p_{| P | - ⌊ r | P | ⌋ - 1}}}, \underset{P_{m}}{\underset{⏟}{p_{| P | - ⌊ r | P | ⌋}, \dots, p_{| P | - 1}}} \leftarrow Shuffle (P) \\ s.t. {\begin{cases} P = P_{v} ‖ P_{m}, \\ | P_{m} | = ⌊ r | P | ⌋ \\ P_{v} \cap P_{m} = \emptyset \end{cases} \end{matrix}

(7)

Only the visible patches $P_{v}$ are incorporated into the encoder for representation learning, with the masked patches $P_{m}$ being disregarded. The shuffle operation randomly permutes all the patches in the dataset. This step introduces randomness and diversity into the training process. By shuffling and masking the patches, the model is exposed to different patch arrangements, ensuring that it learns robust representations that are invariant to the specific ordering of patches. This helps prevent the model from relying solely on the spatial arrangement of patches and encourages it to capture more meaningful and generalizable features.
只有可见的补丁 $P_{v}$ 被合并到编码器中，用于表示学习，而屏蔽的补丁 $P_{m}$ 将被忽略。随机排列操作随机排列数据集中的所有补丁。此步骤将随机性和多样性引入训练过程。通过对补丁进行洗牌和屏蔽，模型将暴露于不同的补丁排列中，从而确保它学习与补丁的特定顺序不变的鲁棒表示。这有助于防止模型仅依赖于面块的空间排列，并鼓励它捕获更有意义和可泛化的特征。

Afterwards, a class patch p_cls = 0^d is concatenated preceding the visible patches, facilitating the fusion of information from all visible patches at a later stage.
之后，在可见斑块之前连接一个类补丁 p _cls = 0 ^d ，从而促进后期所有可见斑块的信息融合。

P_{c v} \leftarrow {p_{c l s}} ‖ P_{v}

(8)

3.4.2 Masked Encoder. The transformer MAE capture latent inter-patch relationships within visible patches $P_{c v}$ , subsequently reconstructing masked patches $P_{m}$ . These bidirectional transformer autoencoder blocks facilitate information capture from both antecedent and subsequent visible patches. The self-attention mechanism autonomously derives suitable representations from input patches, expediting the recovery of masked patches.
3.4.2 蒙面编码器。转换器MAE捕获可见斑块内潜在的斑块间关系 $P_{c v}$ ，随后重建掩蔽斑块 $P_{m}$ 。这些双向变压器自动编码器模块有助于从前一个和后续可见斑块中捕获信息。自注意力机制自主地从输入补丁中导出合适的表示，从而加快了屏蔽补丁的恢复。

Delving deeper, the transformer encoder $E = {{Encoder}_{i} (\cdot) ∣ i \in N}$ incorporates |E| layers, with each layer ${Encoder}_{i} (\cdot)$ discerning latent relationships among input patches $P_{i}$ and outputting an deep latent representation $P_{i + 1}$ .
更深入地研究，变压器编码器 $E = {{Encoder}_{i} (\cdot) ∣ i \in N}$ 集成了 |E|层，每一层 ${Encoder}_{i} (\cdot)$ 都辨别输入补丁之间的潜在关系 $P_{i}$ ，并输出深度潜在表示 $P_{i + 1}$ 。

\begin{matrix} P_{i + 1} \leftarrow {Encoder}_{i} (P_{i}, M_{v}), \end{matrix}

(9)

\begin{matrix} s.t. {\begin{cases} P_{0} = P_{c v} \\ P_{i + 1}, P_{i}, M_{v} \in R^{| P_{v} | \times d} \end{cases} \end{matrix}

(10)

While maintaining the same shape as $P_{i}$ , each patch $p_{i + 1, j} \in P_{i + 1}$ fuses information from all patches in $P_{i}$ via the self-attention encoder block.
在保持与 $P_{i}$ 相同的形状的同时，每个 Patch $p_{i + 1, j} \in P_{i + 1}$ 通过自注意力编码器模块融合 $P_{i}$ 来自所有 Patch 的信息。

p_{i + 1, j} \leftarrow {Encoder}_{i} (p_{i, 0}, p_{i, 1}, \dots, p_{i, | P_{v} | - 1}, M_{v})

(11)

Hence, the class patch p_cls fuses information from all visible patches, enabling the representation of the burst as an integrated entity.
因此，类 patch p _cls 融合了来自所有可见 patch 的信息，从而能够将突发表示为一个集成实体。

Note that the attention mask $M_{v}$ is provided to the self-attention block within each layer of transformer encoders ${Encoder}_{i}$ , serving to mask the padded patches. $M_{v}$ conceals the attention scores of padded patches, ensuring that only the original patches will be attended to and precluding interference from padded patches.
请注意，注意力掩码 $M_{v}$ 提供给每层变压器编码器 ${Encoder}_{i}$ 内的自注意力模块，用于掩蔽填充的补丁。 $M_{v}$ 隐藏填充补丁的注意力分数，确保只关注原始补丁，并排除来自填充补丁的干扰。

3.4.3 Decoder. With the final layer's latent representation of the visible patches, denoted as $P_{| E |}$ , we can reconstruct the masked patches using the transformer decoder. The decoder's architecture resembles that of the encoder, consisting of multiple layers with self-attention blocks. There are two key differences: (1) the decoder is relatively lightweight, and (2) the number of patches is $| P |$ .
3.4.3 解码器。通过最后一层可见斑块的潜在表示，表示为 $P_{| E |}$ ，我们可以使用转换器解码器重建掩蔽的斑块。解码器的架构类似于编码器的架构，由具有自注意力块的多层组成。有两个关键区别：（1）解码器相对轻量级，（2）补丁数量为 $| P |$ 。

To reconstruct the masked patches, we initially insert placeholders at the masked patch positions and return the visible patches to their original locations. This operation, referred to as Unshuffle, produces $P_{0}^{'}$ .
为了重建蒙版补丁，我们首先在蒙版补丁位置插入占位符，并将可见补丁返回到其原始位置。此操作称为 Unshuffle ，生成 $P_{0}^{'}$ 。

P_{0}^{'} \leftarrow Unshuffle (\underset{| P_{| E |} | = | P_{v} |}{\underset{⏟}{P_{| E |}}}, \underset{| P_{m} |}{\underset{⏟}{0^{d}, 0^{d}, \dots, 0^{d}}})

(12)

The placeholder patches, represented by $| P_{m} |$ zero vectors 0^d, are soon replaced by the decoder with recovered masked patches. The Unshuffle operation restores the patch count to $| P | = | P_{v} | + | P_{m} |$ . Subsequently, a positional embedding is added to $P^{'}$ . The decoder layers, denoted as $D = {{Decoder}_{i} (\cdot) ∣ i \in N}$ , are then employed to replace the placeholders with recovered masked patches. The decoding process proceeds as follows:
由 $| P_{m} |$ 零向量 0 ^d 表示的占位符补丁很快被具有恢复的屏蔽补丁的解码器替换。该 Unshuffle 操作将修补程序计数恢复到 $| P | = | P_{v} | + | P_{m} |$ 。随后，将 $P^{'}$ 位置嵌入添加到中。然后使用解码器层（表示为 $D = {{Decoder}_{i} (\cdot) ∣ i \in N}$ ）将占位符替换为恢复的掩码补丁。解码过程按如下方式进行：

\begin{matrix} P_{i + 1}^{'} \leftarrow {Decoder}_{i} (P_{i}^{'}), \end{matrix}

(13)

\begin{matrix} s . t . P_{i + 1}^{'}, P_{i}^{'} \in R^{| P | \times d} \end{matrix}

(14)

The final decoder layer's latent representation, $P_{| D |}^{'}$ , has its placeholder patches extracted and projected to conform to the input burst's original shape, represented as $P_{r}^{'}$ .
最终解码器层的潜在表示形式 $P_{| D |}^{'}$ ，其占位符块被提取并投影为符合输入突发的原始形状，表示为 $P_{r}^{'}$ 。

The pre-training loss is calculated as the Mean Square Error (MSE) between the restored patches $P_{r}^{'}$ and the original masked patches $P_{m}$ :
训练前的损失计算为恢复的补丁 $P_{r}^{'}$ 和原始屏蔽的补丁之间的均方误差（MSE $P_{m}$ ）：

\begin{aligned} L_{p r e - t r a i n} & = MSE (P_{r}^{'}, P_{m}) \end{aligned}

(15)

\begin{aligned} = \frac{1}{| P_{m} |} \sum_{i = 0}^{| P_{m} | - 1} ‖ p_{r_{i}}^{'} - p_{m_{i}} ‖_{2} \end{aligned}

(16)

3.5 Finetuning Flow-MAE 3.5 微调流量-MAE

After the pre-training phase, the Flow-MAE encoder can generate deep latent representations for any given burst. Subsequently, the classifier model is fine-tuned using a modest amount of task-specific labeled traffic data. The classifier model retains the MAE's encoder component (including its architecture and weights), while the decoder part is replaced by a fully connected linear classifier.
在预训练阶段之后，Flow-MAE 编码器可以为任何给定的突发生成深度潜在表示。随后，使用适量的任务特定标记的流量数据对分类器模型进行微调。分类器模型保留了 MAE 的编码器组件（包括其架构和权重），而解码器部分则被完全连接的线性分类器所取代。

Fine-tuning is effective for downstream classification tasks for several reasons: (1) The encoder inherits from the pre-trained model, enabling the use of the same burst representation input in downstream tasks; (2) The pre-trained encoder output representation is universally applicable to all traffic patterns, facilitating adaptation to specific feature representations; (3) The output class patch p_cls integrates information from all visible patches and represents the entire burst, making it suitable for direct classification.
微调对于下游分类任务是有效的，原因如下：（1）编码器继承了预训练模型，允许在下游任务中使用相同的突发表示输入;（2）预训练的编码器输出表示普遍适用于所有交通模式，便于适应特定特征表示;（3）输出类斑块p _cls 整合了所有可见斑块的信息，代表了整个脉冲，适合直接分类。

A fully connected linear layer serves as a Classifier, projecting the encoder output class patch p_{|D|, cls} to a probability vector $\hat{Y} \subseteq {[0, 1]}^{c}$ , where c denotes the number of classes in the fine-tuning dataset. The predicted label index is defined as:
一个完全连接的线性层用作， Classifier 将编码器输出类 patch p _{|D|, cls} 投影到概率向量 $\hat{Y} \subseteq {[0, 1]}^{c}$ ，其中 c 表示微调数据集中的类数。预测标签索引定义为：

\hat{y} = \underset{max}{\arg} \hat{Y}

(17)

For a classification task involving c classes, given an input burst B, ground-truth label probability Y = {y₁, y₂, …, y_c}, and prediction probability vector $\hat{Y} = {\hat{y_{1}}, \hat{y_{2}}, \dots, \hat{y_{c}}}$ , the loss function is defined as the cross-entropy loss:
对于涉及 c 类的分类任务，给定输入突发 B、真值标签概率 Y = {y ₁ ， y ₂ ， ...， y _c } 和预测概率向量 $\hat{Y} = {\hat{y_{1}}, \hat{y_{2}}, \dots, \hat{y_{c}}}$ ，损失函数定义为交叉熵损失：

\begin{aligned} L_{C r o s s E n t r o p y} & = CrossEntropy (Softmax (\hat{Y}), Y) \end{aligned}

(18)

\begin{aligned} = - \sum_{i = 1}^{c} \log \frac{\exp (\hat{y_{i}})}{\sum_{j = 1}^{c} \exp (\hat{y_{j}})} y_{i} \end{aligned}

(19)

Specifically, for our classification task, the ground truth label index y constitutes an integer within the range [0, c − 1], and Y represents a one-hot encoding vector. The cross-entropy loss is expressed as:
具体来说，对于我们的分类任务，真值标签索引 y 构成 [0， c − 1] 范围内的整数，Y 表示一个热编码向量。交叉熵损失表示为：

L_{C r o s s E n t r o p y} = - \log \frac{\exp (\hat{y_{i}})}{\sum_{j = 1}^{c} \exp (\hat{y_{j}})}

(20)

Thus, our fine-tuning objective is to minimize the cross-entropy loss L_CrossEntropy. Upon completion of the fine-tuning process, the model is primed for the classification task. Given target-specific traffic, the fine-tuned model can accurately predict its classification.
因此，我们的微调目标是最小化交叉熵损失 L _CrossEntropy 。完成微调过程后，模型即可完成分类任务。给定特定于目标的流量，微调后的模型可以准确预测其分类。

4 EXPERIMENT RESULTS 4 实验结果

4.1 Experiment Setup 4.1 实验设置

4.1.1 Flow-MAE Implementation.
4.1.1 Flow-MAE 实现。

Table 1: Hyper-parameter Configurations in Pre-training and Fine-tuning
表 1：预训练和微调中的超参数配置

Hyper-parameter 超参数	Pre-training Value 预训练价值	Fine-tuning Value 微调值
burst length l_MAX 爆破长度 l _MAX	1024
embedding stride s 嵌入步幅 S	8
embedding kernel size k 嵌入内核大小 k	8
embedding dimensions d 嵌入尺寸 D	768
mask ratio r_m 掩模比例 R _m	0.15	0
attention drop rate r_a 注意力下降率 R _a	0	0.1
forward drop rate r_f 前向掉落率 R _f	0	0.1
embedding patches \|P\| 嵌入补丁 \|P\|	128
encoder layers \|E\| 编码器层 \|E\|	12
encoder heads h_E 编码器头 H _E	12
decoder layers \|D\| 解码器层 \|D\|	12	None 没有
decoder heads h_D 解码器头 H _D	12	None 没有
epochs 时代	800	25
batch size 批量大小	32
learning rate 学习率	1e-4 1E-4型
warm up ratio 热身比	0.015	0

We adopt the Vision Transformer base (ViT-base) [9] model as the masked autoencoder backbone in Flow-MAE. Vit-base incorporates 12 encoder layers, and 12 decoder layers, and adopts a 768-dimension embedding for each input patch.
我们采用Vision Transformer基站（ViT-base）[9]模型作为Flow-MAE中的掩蔽自动编码器主干。Vit-base 包含 12 个编码层和 12 个解码层，每个输入补丁采用 768 维嵌入。

Table 1 presents the parameter configurations for the pre-training and fine-tuning experiments. The Flow-MAE model comprises |E| = 12 transformer encoder layers, each featuring a self-attention block with h_E = 2 heads. The pre-training Masked Patch Model decoder comprises |D| =12 layers and h_D=12 self-attention heads. The input burst length l_MAX is set at 1024 bytes. The convolution embedding layer, employing a stride of s=32 and kernel size k=32, projects the burst into |P| =32 patches of d=768 dimensions. During pre-training, with a mask ratio r_m=0.15, |P_m| =4 of the 32 total patches are masked, leaving the remaining |P_v| =28 patches visible. The attention drop rate r_a and forward drop rate r_f are assigned values of 0 during pre-training and 0.1 in fine-tuning, respectively. In total, the Flow-MAE model contains 85.12 M parameters and a model size of 450 MB.
表 1 显示了预训练和微调实验的参数配置。Flow-MAE 模型包括 |E|= 12 个变压器编码器层，每个层都有一个 h _E = 2 个磁头的自注意力块。预训练掩码补丁模型解码器包括 |D|=12 层和 h _D =12 个自我关注头。输入突发长度 l _MAX 设置为 1024 字节。卷积嵌入层采用 s=32 的步幅和核大小 k=32，将突发投影到 |P|=32 个 d=768 维的补丁。在预训练期间，掩模比 r _m =0.15，|P _m |=32 个补丁中有 4 个被屏蔽，剩下的 |P _v |=28 个可见的补丁。在预训练期间，注意力下降率 r _a 和前向下降率 r _f 的值分别为 0，在微调时分别为 0.1。Flow-MAE 模型总共包含 85.12 M 参数，模型大小为 450 MB。

All experiments were conducted on a testbed equipped with an i7-12700K CPU (8 P-cores @4.7GHz and 4 E-cores @3.6GHz), 64 GB DDR5 DRAM (@6000MT/s), and two NVIDIA GeForce 3090Ti GPUs (24 GB of GDDR6X memory each). The software environment of the testbed includes Ubuntu 22.04.1 LTS (kernel 5.15.0-50), Python 3.8.13, PyTorch 1.12.0, and CUDA 11.6.
所有实验均在配备 i7-12700K CPU（8 个 P 核 @4.7GHz 和 4 个 E 核 @3.6GHz）、64 GB DDR5 DRAM （@6000MT/s）和两个 NVIDIA GeForce 3090Ti GPU（每个 24 GB GDDR6X 内存）的测试平台上进行。测试平台的软件环境包括 Ubuntu 22.04.1 LTS（内核 5.15.0-50）、Python 3.8.13、PyTorch 1.12.0 和 CUDA 11.6。

4.1.2 Datasets. 4.1.2 数据集。

Table 2: Statistical Information of the Datasets
表2：数据集的统计信息

Dataset 数据	#Benign	#Malicious	#Flow	#Packet
CIC-IDS2018 [38]	1	7	4.5M	80M
USTC-TFC-2016 [50] 中国科学技术大学TFC-2016 [ 50]	10	10	9.8K	97.1K
ISCX-VPN-2016 [10] ISCX-VPN-2016年版 [ 10]	6	6	8.4K	18.7K
ISCX-Tor-2016 [23]	8	8	3K	80K
Cross-Platform (Android) [46] 跨平台（Android） [ 46]	215	0	27.8K	656K

The experiments are conducted using the following datasets, as summarized in Table 2:
实验使用以下数据集进行，如表2所示：

CIC-IDS2018 [38]: An intrusion detection system evaluation dataset collected by the Canadian Institute for Cybersecurity (CIC). The CIC-IDS2018 dataset comprises benign background traffic and seven distinct malicious scenarios: Brute-force, Heartbleed, Botnet, DoS, DDoS, Web attacks, and infiltration, collected over a 10-day period. The network infrastructure includes 50 malicious machines in a local network, as well as 420 victim machines and 30 victim servers distributed across five departments.
CIC-IDS2018 [ 38]：加拿大网络安全研究所（CIC）收集的入侵检测系统评估数据集。CIC-IDS2018 数据集包括良性后台流量和七种不同的恶意场景：暴力破解、Heartbleed、僵尸网络、DoS、DDoS、Web 攻击和渗透，收集时间为 10 天。网络基础设施包括本地网络中的 50 台恶意计算机，以及分布在五个部门的 420 台受害计算机和 30 台受害服务器。
USTC-TFC-2016 [50]: A dataset containing encrypted traffic from 10 malware and 10 benign applications. This well-recognized dataset was published by a researcher from the University of Science and Technology of China (USTC) and has been extensively adopted by researchers.
USTC-TFC-2016 [ 50]：包含来自 10 个恶意软件和 10 个良性应用程序的加密流量的数据集。这个广受认可的数据集由中国科学技术大学（USTC）的研究人员发表，并已被研究人员广泛采用。
ISCX-VPN-2016-App [10]: ISCX-VPN-App is a dataset that contains encrypted network traffic collected by the Canadian Institute for Cybersecurity (CIC). The dataset focuses on Virtual Private Networks (VPNs) used for network communication. VPNs are widely used for intrusion purposes as they allow users to bypass censorship and hide their location through protocol obfuscation and proxy mode. The ISCX-VPN-App dataset is classified based on applications and consists of 17 different applications.
ISCX-VPN-2016-App [ 10]： ISCX-VPN-App 是一个数据集，其中包含加拿大网络安全研究所（CIC）收集的加密网络流量。该数据集侧重于用于网络通信的虚拟专用网络（VPN）。VPN被广泛用于入侵目的，因为它们允许用户绕过审查并通过协议混淆和代理模式隐藏他们的位置。ISCX-VPN-App 数据集根据应用程序进行分类，由 17 个不同的应用程序组成。
ISCX-VPN-2016-Service [10] ISCX-VPN-Service is another dataset within the ISCX-VPN-2016 collection. Similar to ISCX-VPN-App, this dataset contains encrypted network traffic collected by the CIC. It specifically focuses on the service level of VPNs. The ISCX-VPN-Service dataset is classified based on 12 different services related to VPN usage. By studying this dataset, researchers can gain insights into the characteristics and behavior of various VPN services.
ISCX-VPN-2016-服务 [ 10] ISCX-VPN-Service 是 ISCX-VPN-2016 集合中的另一个数据集。与 ISCX-VPN-App 类似，此数据集包含 CIC 收集的加密网络流量。它特别关注VPN的服务水平。ISCX-VPN-Service 数据集根据与 VPN 使用相关的 12 种不同服务进行分类。通过研究这个数据集，研究人员可以深入了解各种VPN服务的特征和行为。
ISCX-Tor-2016 [23]: This dataset comprises traffic from 16 applications utilizing the Onion Router (Tor) for encrypted communications. Tor obscures data between the sender and the receiver through a distributed routing network. Some intruders may use Tor to conceal their identity and activities, as the resulting obfuscation makes tracking and traffic classification challenging.
ISCX-Tor-2016 [ 23]：该数据集包括来自 16 个应用程序的流量，这些应用程序使用洋葱路由器（Tor）进行加密通信。Tor 通过分布式路由网络掩盖发送方和接收方之间的数据。一些入侵者可能会使用 Tor 来隐藏他们的身份和活动，因为由此产生的混淆使跟踪和流量分类具有挑战性。
Cross-Platform [46]: This dataset includes encrypted traffic from Android applications, featuring 215 apps from the top 100 Apps in the US, China, and India. It is representative of encrypted traffic from current applications, as it covers a wide range of applications worldwide. Moreover, it has the highest number of classes among all datasets used in the experiment, making it more challenging than the others.
跨平台 [ 46]：该数据集包括来自 Android 应用程序的加密流量，包含来自美国、中国和印度前 100 名应用程序的 215 个应用程序。它代表了来自当前应用程序的加密流量，因为它涵盖了全球广泛的应用程序。此外，在实验中使用的所有数据集中，它的类数量最多，使其比其他数据集更具挑战性。

Flow-MAE is pre-trained on a subset of unlabeled background traffic from CIC-IDS2018 [38], specifically the traffic collected on Tuesday, 20-02-2018, after removing malicious traffic directed towards the victim host 172.31.69.25. This subset dataset from 20-02-2018 consists of 451 raw pcap files, totaling 57.34 GB, and generates a pre-training dataset of 3.86 million bursts. To evaluate Flow-MAE's effectiveness and robustness, the model is fine-tuned for malicious traffic classification experiments on the first five public datasets mentioned above. The pre-training dataset is significantly larger in size compared to the fine-tuning dataset. For instance, the fine-tuning data on CIC-IDS2018 consists of approximately 10GB of malicious traffic. Additionally, the Cross-Platform (Android) dataset [46], which includes only benign traffic from 215 apps, is employed to validate Flow-MAE's transferability and robustness in the encrypted flow classification task involving large categories. However, to ensure an adequate amount of data for analysis, a data size threshold of at least 20MB was imposed. As a result, only 71 out of the 215 apps were included in the evaluation. This task demonstrates the model's capacity to adapt to different classification scenarios, even when handling a larger number of classes and diverse data types.
Flow-MAE 在 CIC-IDS2018 [ 38] 的未标记后台流量子集上进行了预训练，特别是在 2018 年 2 月 20 日星期二删除指向受害者主机 172.31.69.25 的恶意流量后收集的流量。2018 年 2 月 20 日的子集数据集由 451 个原始 pcap 文件组成，总计 57.34 GB，并生成 386 万次突发的预训练数据集。为了评估 Flow-MAE 的有效性和鲁棒性，该模型针对上述前五个公共数据集上的恶意流量分类实验进行了微调。与微调数据集相比，预训练数据集的大小要大得多。例如，CIC-IDS2018 上的微调数据由大约 10GB 的恶意流量组成。此外，采用跨平台（Android）数据集[46]，仅包含来自215个应用的良性流量，用于验证Flow-MAE在涉及大型类别的加密流量分类任务中的可转移性和鲁棒性。但是，为了确保有足够的数据量进行分析，施加了至少 20MB 的数据大小阈值。因此，在 215 个应用程序中，只有 71 个被纳入评估。此任务演示了模型适应不同分类方案的能力，即使在处理大量类和不同的数据类型时也是如此。

As outlined in the preprocessing procedure section (§ 3.2.2), the burst's IP address and port fields are anonymized to prevent explicit identification information from interfering with representation learning. Anonymous TCP and UDP sessions from each class are selected to create labeled burst datasets, which are then divided into training and testing sets in an 9:1 ratio.
如预处理过程部分（§ 3.2.2）所述，突发的 IP 地址和端口字段是匿名的，以防止显式识别信息干扰表示学习。从每个类中选择匿名 TCP 和 UDP 会话以创建标记的突发数据集，然后以 9：1 的比例将其划分为训练集和测试集。

4.1.3 Baselines. Twelve state-of-the-art generic malicious traffic detection methods are utilized as baselines: AppScanner [45], CUMUL [33], BIND [1], K-fp [14], FlowPrint [47], DF [41], FS-Net [29], GraphDApp [40], TSCRNN [26], Deeppacket [31], PERT [15], and ET-BERT [27]. These baseline methods represent a diverse range of approaches in the literature: (1) they encompass both traditional machine learning-based and deep learning-based methods; (2) they employ a wide variety of features, such as flow statistics, time series, raw packet headers, and raw payloads.
4.1.3 基线。12 种最先进的通用恶意流量检测方法被用作基线：AppScanner [ 45]、CUMUL [ 33]、BIND [ 1]、K-fp [ 14]、FlowPrint [ 47]、DF [ 41]、FS-Net [ 29]、GraphDApp [ 40]、TSCRNN [ 26]、Deeppacket [ 31]、PERT [ 15] 和 ET-BERT [ 27]。这些基线方法代表了文献中的各种方法：（1）它们包括传统的基于机器学习和基于深度学习的方法;（2）它们采用了多种功能，例如流量统计、时间序列、原始数据包标头和原始有效负载。

4.1.4 Evaluation Metrics. We evaluate the performance of Flow-MAE using four typical metrics in the literature: Accuracy (AC), Precision (PR), Recall (RC), and F₁-score (F₁). These metrics are defined as follows:
4.1.4 评估指标。我们使用文献中的四个典型指标来评估 Flow-MAE 的性能：准确率（AC）、精确度（PR）、召回率（RC）和 F ₁ 分数 ₁ （F）。这些指标定义如下：

Accuracy (AC): $A C = \frac{T_{P} + T_{N}}{T_{P} + T_{N} + F_{P} + F_{N}}$ , the ratio of the number of correctly classified instances (True Positives T_p and True Negatives T_N) to all instances in the test dataset, which comprises all input bursts T_P + T_N + F_P + F_N.
准确度（AC）： $A C = \frac{T_{P} + T_{N}}{T_{P} + T_{N} + F_{P} + F_{N}}$ ，正确分类的实例数（真阳性 T _p 和真阴性 T _N ）与测试数据集中所有实例的比率，包括所有输入突发 T _P + T _N + F _P _N + F。
Precision rate: $P R = \frac{T_{P}}{T_{P} + F_{P}}$ , the ratio of the number of correctly reported instances (True Positives T_p) to the number of reported instances T_P + F_P.
准确率： $P R = \frac{T_{P}}{T_{P} + F_{P}}$ ，正确报告的实例数（真阳性 T _p ）与报告的实例数 T _P + F 的比值 _P 。
Recall Rate: $R C = \frac{T_{P}}{T_{P} + F_{N}}$ , the ratio of the number of correctly reported instances to the number of correct instances T_p + F_N.
召回率： $R C = \frac{T_{P}}{T_{P} + F_{N}}$ ，正确报告的实例数与正确实例数的比值 _N T _p + F 。
F₁-score: $F_{1} = \frac{2 \times P R \times R C}{P R + R C}$ , where PR is the Precision rate and RC is the Recall rate.
F ₁ -score： $F_{1} = \frac{2 \times P R \times R C}{P R + R C}$ ，其中 PR 是 Precision 率，RC 是 Recall 率。

To account for imbalanced traffic categories and avoid biased results, Macro Average [44] is employed to calculate the mean values of AC, PR, RC, and F1 for all traffic categories. Moreover, to further mitigate potential biases, a five-fold validation is performed on each dataset.
为了解释不平衡的流量类别并避免有偏差的结果，采用宏观平均值 [ 44] 来计算所有流量类别的 AC、PR、RC 和 F1 的平均值。此外，为了进一步减少潜在的偏差，对每个数据集进行了五重验证。

Table 3: Accuracy Evaluation Results on CIC-IDS-2018, USTC-TFC-2016 and ISCX-VPN-App Datasets
表3：CIC-IDS-2018、USTC-TFC-2016和ISCX-VPN-App数据集的准确性评估结果

Method 方法	CIC-IDS-2018 [38]				USTC-TFC-2016 [50] 中国科学技术大学TFC-2016 [ 50]				ISCX-VPN-2016-App [10] ISCX-VPN-2016-应用程序 [ 10]
	AC	PR	RC	F₁	AC	PR	RC	F₁	AC	PR	RC	F₁
AppScanner [45] 应用扫描仪 [ 45]	0.9091	0.9147	0.9074	0.9110	0.8954	0.8984	0.8968	0.8892	0.6266	0.4864	0.5198	0.4935
CUMUL [33] 积木 [ 33]	0.5860	0.6181	0.6212	0.6196	0.5675	0.6171	0.5738	0.5513	0.5365	0.4129	0.4535	0.4236
BIND [1] 绑定 [ 1]	0.8661	0.8271	0.8091	0.8180	0.8457	0.8681	0.8382	0.8396	0.6767	0.5152	0.5153	0.4965
K-fp [14]	-	-	-	-	-	-	-	-	0.6070	0.5478	0.5430	0.5303
FlowPrint [47]	0.7841	0.7102	0.6921	0.7010	0.8146	0.6434	0.7002	0.6573	0.8767	0.6697	0.6651	0.6531
DF [41] 测向 [ 41]	0.7976	0.7554	0.7441	0.7497	0.7787	0.7883	0.7819	0.7593	0.6116	0.5706	0.4752	0.4799
FS-Net [29] FS-网络 [ 29]	0.9015	0.8994	0.8997	0.8996	0.8846	0.8846	0.8920	0.8840	0.6647	0.4819	0.4848	0.4737
GraphDApp [40] 图形DApp [ 40]	0.8667	0.8441	0.8439	0.8440	0.8789	0.8226	0.8260	0.8234	0.6328	0.5900	0.5472	0.5558
TSCRNN [26]	-	0.9678	0.9724	0.9701	-	0.9870	0.9860	0.9870	-	-	-	-
Deeppacket [31] 深度数据包 [ 31]	0.9661	0.9874	0.9883	0.9879	0.9640	0.9650	0.9631	0.9641	0.9758	0.9785	0.9745	0.9765
PERT [15] 珀特 [ 15]	0.9302	0.9479	0.9367	0.9423	0.9909	0.9911	0.9910	0.9911	0.8229	0.7092	0.7173	0.6992
ET-BERT (flow) [27] ET-BERT （流量） [ 27]	0.9943	0.9944	0.9942	0.9943	0.9929	0.9930	0.9930	0.9930	0.8519	0.7508	0.7294	0.7306
ET-BERT (packet) [27] ET-BERT （数据包） [ 27]	0.9955	0.9933	0.9932	0.9933	0.9915	0.9915	0.9916	0.9916	0.9962	0.9936	0.9938	0.9937
Flow-MAE (ours) Flow-MAE（我们的）	0.9958	0.9959	0.9958	0.9958	0.9988	0.9984	0.9989	0.9986	0.9987	0.9991	0.9989	0.9990

Table 4: Accuracy Evaluation Results on ISCX-VPN-Service, ISCX-Tor-2016, and Cross-Platform (Android) Datasets
表 4：ISCX-VPN-Service、ISCX-Tor-2016 和跨平台（Android）数据集的准确性评估结果

Method 方法	ISCX-VPN-2016-Service [10] ISCX-VPN-2016-服务 [ 10]				ISCX-Tor-2016 [23]				Cross-Platform (Android) [46] 跨平台（Android） [ 46]
	AC	PR	RC	F₁	AC	PR	RC	F₁	AC	PR	RC	F₁
AppScanner [45] 应用扫描仪 [ 45]	0.7182	0.7339	0.7225	0.7197	0.6722	0.3756	0.4422	0.3913	0.3868	0.2523	0.2594	0.2440
CUMUL [33] 积木 [ 33]	0.5610	0.5883	0.5676	0.5668	0.6606	0.3850	0.4416	0.3918	0.3525	0.2221	0.2409	0.2189
BIND [1] 绑定 [ 1]	0.7534	0.7583	0.7488	0.7420	0.7185	0.4598	0.4515	0.4511	0.4728	0.3126	0.3253	0.3026
K-fp [14]	0.6430	0.6492	0.6417	0.6395	0.6472	0.5576	0.5849	0.5522	0.2248	0.2113	0.2104	0.2052
FlowPrint [47]	0.7962	0.8042	0.7812	0.7820	0.9092	0.3820	0.3661	0.3654	0.8698	0.9007	0.8698	0.8702
DF [41] 测向 [ 41]	0.7154	0.7192	0.7104	0.7102	0.7533	0.6228	0.6010	0.5850	0.3862	0.2595	0.2620	0.2527
FS-Net [29] FS-网络 [ 29]	0.7205	0.7502	0.7238	0.7131	0.6071	0.5080	0.5350	0.4590	0.4846	0.3544	0.3365	0.3343
GraphDApp [40] 图形DApp [ 40]	0.5977	0.6045	0.6220	0.6036	0.6836	0.4864	0.4823	0.4488	0.4031	0.2842	0.2786	0.2703
TSCRNN [26]	-	0.9270	0.9260	0.9260	-	0.9490	0.9480	0.9480	-	-	-	-
Deeppacket [31] 深度数据包 [ 31]	0.9329	0.9377	0.9306	0.9321	0.7449	0.7549	0.7399	0.7473	0.8805	0.8004	0.7567	0.8138
PERT [15] 珀特 [ 15]	0.9352	0.9400	0.9349	0.9368	0.7682	0.4424	0.4446	0.4345	0.9772	0.8628	0.8591	0.8550
ET-BERT (flow) [27] ET-BERT （流量） [ 27]	0.9729	0.9756	0.9731	0.9733	0.8311	0.5564	0.6448	0.5886	0.9865	0.9324	0.9266	0.9246
ET-BERT (packet) [27] ET-BERT （数据包） [ 27]	0.9890	0.9891	0.9890	0.9890	0.9921	0.9923	0.9921	0.9921	0.9728	0.9439	0.9119	0.9206
Flow-MAE (ours) Flow-MAE（我们的）	0.9915	0.9924	0.9915	0.9917	0.9945	0.9946	0.9945	0.9945	0.9921	0.9922	0.9921	0.9921

4.2 Accuracy Evaluation 4.2 精度评估

We evaluated the accuracy of the proposed Flow-MAE in detecting malicious traffic using six distinct datasets. As illustrated in Table 3 and Table 4, Flow-MAE outperforms all baseline methods, achieving an average accuracy ranging from 0.9915 to 0.9958 and an average F1 score between 0.9917 and 0.9958.
我们使用六个不同的数据集评估了所提出的 Flow-MAE 在检测恶意流量方面的准确性。如表 3 和表 4 所示，Flow-MAE 优于所有基线方法，平均准确度在 0.9915 至 0.9958 之间，平均 F1 分数在 0.9917 至 0.9958 之间。

4.2.1 CIC-IDS-2018. As depicted in Table 3, in the largest dataset, CIC-IDS-2018, Flow-MAE attains up to 69.9%, 61.1%, 60.3%, and 60.7% improvements in accuracy, precision, recall, and F₁ compared to existing solutions, respectively.
4.2.1 CIC-IDS-2018年。如表3所示，在最大的数据集CIC-IDS-2018中， ₁ 与现有解决方案相比，Flow-MAE在准确度、精密度、召回率和F方面分别提高了69.9%、61.1%、60.3%和60.7%。

Compared to the state-of-the-art method, ET-BERT, Flow-MAE improves accuracy, precision, recall, and F₁ by 0.03%, 0.26%, 0.26%, and 0.25%, respectively. This improvement can be primarily attributed to the patch embedding layer, which allows for a longer input length (1024 bytes) than in ET-BERT (128 bytes), resulting in more comprehensive flow information being used. Moreover, the Flow-MAE model utilizes both packet header and payload to generate the bursts as input, capturing unique interaction patterns in the packet headers of distinct flows.
与最先进的ET-BERT方法相比，Flow-MAE的准确度、精密度、召回率和F率分别 ₁ 提高了0.03%、0.26%、0.26%和0.25%。这种改进主要归因于补丁嵌入层，它允许比 ET-BERT（128 字节）更长的输入长度（1024 字节），从而使用更全面的流信息。此外，Flow-MAE 模型利用数据包标头和有效负载来生成突发作为输入，从而捕获不同流的数据包标头中的独特交互模式。

Flow-MAE also surpasses several other existing solutions. For example, CUMUL achieves an accuracy of merely 0.586 and performs as poorly as a random guess model, making it unsuitable for malicious traffic detection in the CIC-IDS-2018 dataset. Although FlowPrint and DF produce slightly better results than CUMUL, their accuracy remains below 0.80, exhibiting a 19.8%-21.1% performance decrease compared to Flow-MAE. Other solutions display intermediate results, with accuracy ranging from a minimum of 0.8661 (BIND) to a maximum of 0.9661 (Deeppacket), indicating a gap of 12.9% and 2.97% when compared with Flow-MAE.
Flow-MAE 还超越了其他几种现有解决方案。例如，CUMUL的准确率仅为0.586，其性能与随机猜测模型一样差，因此不适合CIC-IDS-2018数据集中的恶意流量检测。尽管 FlowPrint 和 DF 产生的结果略好于 CUMUL，但它们的准确度仍低于 0.80，与 Flow-MAE 相比，性能下降了 19.8%-21.1%。其他解决方案显示中间结果，精度范围从最小 0.8661 （BIND）到最大 0.9661 （Deeppacket），与 Flow-MAE 相比，差距分别为 12.9% 和 2.97%。

FS-Net, similar to ET-BERT, utilizes only the payload as input but employs lightweight AutoEncoders without pre-training, rendering it inferior to both ET-BERT and Flow-MAE. This comparison demonstrates that a larger model is more robust in fitting the downstream classification task than a lightweight model. Furthermore, a pre-trained model benefits the fine-tuning process on the classification task with higher convergence than a randomly initialized model.
FS-Net与ET-BERT类似，仅使用有效载荷作为输入，但使用轻量级自动编码器，无需预训练，因此不如ET-BERT和Flow-MAE。这种比较表明，较大的模型在拟合下游分类任务方面比轻量级模型更强大。此外，与随机初始化的模型相比，预训练模型有利于分类任务的微调过程，收敛性更高。

4.2.2 USTC-TFC-2016. The performance of Flow-MAE on the USTC-TFC-2016 [50] dataset, as displayed in Table 3, demonstrates the highest performance across all four metrics. All metrics exceed 0.998, leading to a 0.54%-0.59% improvement compared to the state-of-the-art solution, ET-BERT.
4.2.2 USTC-TFC-2016.如表3所示，Flow-MAE在USTC-TFC-2016 [ 50] 数据集上的性能在所有四个指标上都表现出最高的性能。所有指标均超过 0.998，与最先进的解决方案 ET-BERT 相比，提高了 0.54%-0.59%。

Although the dataset comprises malicious traffic with unencrypted application layer data, most models exhibit no significant enhancement over the CIC-IDS-2018 encrypted dataset, except for PERT, which improves by 6.5%. This result is counter-intuitive, as plaintext payload might be expected to simplify classification for most models. However, this can be explained by the fact that the models treat both encrypted and unencrypted data as a data stream, and the encrypted byte sequence also contains inherent patterns of malicious traffic.
尽管该数据集包含带有未加密应用层数据的恶意流量，但大多数模型与 CIC-IDS-2018 加密数据集相比没有显着增强，但 PERT 提高了 6.5%。这个结果是违反直觉的，因为明文有效负载可能会简化大多数模型的分类。但是，这可以通过以下事实来解释：模型将加密和未加密的数据都视为数据流，并且加密的字节序列还包含恶意流量的固有模式。

Flow-MAE is capable of capture critical patterns of malicious traffic, whether encrypted or not.
Flow-MAE 能够捕获恶意流量的关键模式，无论是否加密。

4.2.3 ISCX-VPN-2016-App. Firstly, as presented Table 3, Flow-MAE achieves high scores for accuracy (0.9987), precision (0.9991), recall (0.9989), and F₁ (0.9990), with percentile advantages of 0.25%-46.28%, 0.55%-54.6%, and 0.53%-57.60% compared to existing schemes.
4.2.3 ISCX-VPN-2016-应用程序。首先，如表3所示，Flow-MAE在准确率（0.9987）、准确率（0.9991）、召回率（0.9989）和F ₁ （0.9990）方面取得了高分，与现有方案相比，百分位优势为0.25%-46.28%、0.55%-54.6%和0.53%-57.60%。

Second, Flow-MAE demonstrates robustness to variation in traffic patterns, as it does not experience any decrease in accuracy compared to USTC-TFC-2016. In contrast, other solutions, such as Appscanner, BIND, DF, and FS-Net, show a clear decline in accuracy, ranging from 21.5% to 30.0%, indicating poor robustness to traffic variations. Flow-MAE is less sensitive to dataset variation and generalizes better, indicating superior robustness compared to the state-of-the-art scheme.
其次，Flow-MAE表现出对流量模式变化的鲁棒性，因为与USTC-TFC-2016相比，它的准确性没有下降。相比之下，Appscanner、BIND、DF 和 FS-Net 等其他解决方案的准确率明显下降，从 21.5% 到 30.0% 不等，表明对流量变化的鲁棒性较差。Flow-MAE对数据集变化的敏感度较低，泛化性更好，表明与最先进的方案相比具有更好的鲁棒性。

Thirdly, the diminished accuracy in existing solutions when applied to the ISCX-VPN-2016-App dataset indicates that it is less distinguishable compared to the CIC-IDS-2018 and USTC-TFC-2016 datasets. AppScanner exhibits the poorest performance, with accuracy, precision, recall, and F1 scores registering at 0.6266, 0.4864, 0.5198, and 0.4935, respectively. This represents a decline of 30%, 45.9%, 42%, and 44.5% in comparison to the USTC-TFC-2016 dataset. Moreover, the accuracy of ET-BERT (flow) also experiences a reduction from 0.9929 to 0.8519, amounting to a 14.1% decrease.
第三，将现有解决方案应用于 ISCX-VPN-2016-App 数据集时的准确性降低表明，与 CIC-IDS-2018 和 USTC-TFC-2016 数据集相比，该数据集的可区分性较差。AppScanner 的性能最差，准确度、精确度、召回率和 F1 分数分别为 0.6266、0.4864、0.5198 和 0.4935。与 USTC-TFC-2016 数据集相比，这分别下降了 30%、45.9%、42% 和 44.5%。此外，ET-BERT（流量）的精度也从0.9929降低到0.8519，下降了14.1%。

4.2.4 ISCX-VPN-2016-Service. The ISCX-VPN-2016-Service dataset is categorized by service, which differs from the application-based categorization of the ISCX-VPN-2016-App dataset.
4.2.4 ISCX-VPN-2016-服务。ISCX-VPN-2016-Service 数据集按服务分类，这与 ISCX-VPN-2016-App 数据集的基于应用程序的分类不同。

First, as shown in Table 4, Flow-MAE achieved an accuracy of 0.9915, precision of 0.9924, recall of 0.9915, and F₁ score of 0.9917, which represents an improvement of 0.25%-76.7%, 0.33%-68.5%, 0.25%-74.6%, and 0.27%-74.9%, respectively, over the existing solutions.
首先，如表4所示，Flow-MAE的准确度为0.9915，精密度为0.9924，召回率为0.9915，F ₁ 得分为0.9917，与现有解决方案相比，分别提高了0.25%-76.7%、0.33%-68.5%、0.25%-74.6%和0.27%-74.9%。

Second, while Flow-MAE attained the best results on all four metrics, its performance on the ISCX-VPN-2016-Service was lower than on the ISCX-VPN-2016-App, with the metrics dropping from above 0.998 to below 0.992. The state-of-the-art solutions also experienced a decline in accuracy compared to their performance on the ISCX-VPN-2016-App dataset. This can be attributed to the mixing of traffic from multiple applications within each service category, and the fact that traffic from each application may appear in several services, which makes it difficult to distinguish between different categories. For example, the file transfer service category includes traffic from both sftp and skype applications.
其次，虽然Flow-MAE在所有四个指标上都取得了最好的结果，但它在ISCX-VPN-2016-Service上的表现低于ISCX-VPN-2016-App，指标从0.998以上下降到0.992以下。与ISCX-VPN-2016-App数据集上的性能相比，最先进的解决方案的准确性也有所下降。这可以归因于每个服务类别中来自多个应用程序的流量混合，以及来自每个应用程序的流量可能出现在多个服务中，这使得很难区分不同的类别。例如，文件传输服务类别包括来自 sftp 和 skype 应用程序的流量。

4.2.5 ISCX-Tor-2016. The results obtained on the ISCX-Tor-2016 Dataset are noteworthy. Flow-MAE outperforms the literature schemes with a precision, recall, and F₁ of 0.9944, 0.9941, and 0.9942, respectively, which are up to 2.6x, 2.7x, and 2.7x higher. Conversely, ET-BERT (flow) underperforms with metrics of 0.5564, 0.6448, and 0.5886, respectively, compared to its performance on the previous four datasets. The poor performance can be attributed to the multi-layer encryption and adversarial obfuscation used by Tor to conceal malicious behavior or identity, which makes it harder for other schemes to distinguish different flows. Flow-MAE, on the other hand, is not affected by encryption and obfuscation in Tor and can capture the latent property of different flows, thereby achieving higher accuracy in malicious traffic detection.
4.2.5 ISCX-Tor-2016 中。在 ISCX-Tor-2016 数据集上获得的结果值得注意。Flow-MAE 的精度、召回率和 F ₁ 分别为 0.9944、0.9941 和 0.9942，分别高出 2.6 倍、2.7 倍和 2.7 倍。相反，ET-BERT（流量）的表现不如前四个数据集的指标分别为 0.5564、0.6448 和 0.5886。性能不佳可归因于 Tor 用于隐藏恶意行为或身份的多层加密和对抗性混淆，这使得其他方案更难区分不同的流量。另一方面，Flow-MAE 不受 Tor 加密和混淆的影响，可以捕获不同流量的潜在属性，从而在恶意流量检测中实现更高的准确性。

4.2.6 Cross-Platform (Android). We employed the Flow-MAE model to tackle the challenging task of encrypted flow classification using the Cross-Platform (Android) dataset consisting of traffic from 73 Android applications. Due to the complexity of the encryption protocol and the high number of classes, this task is particularly difficult.
4.2.6 跨平台（Android）。我们采用 Flow-MAE 模型，使用跨平台（Android）数据集（来自 73 个 Android 应用程序的流量）来解决加密流分类的挑战性任务。由于加密协议的复杂性和类数众多，这项任务特别困难。

Despite not being specifically designed for encrypted flow classification, the Flow-MAE model exhibited excellent transferability. Experimental results on the Cross-Platform (Android) dataset, as shown in Table 4, indicate that Flow-MAE outperforms existing solutions by at least 4.4%, 6.3%, and 6.6% in terms of precision, recall, and F₁, respectively, with values of 0.9855, 0.9853, and 0.9853. The most significant improvements are 3.7x, 3.7x, and 3.8x, respectively.
尽管不是专门为加密流分类而设计的，但 Flow-MAE 模型表现出出色的可转移性。跨平台（Android）数据集的实验结果（如表4所示）表明，Flow-MAE在精度、召回率和F ₁ 方面分别比现有解决方案高出至少4.4%、6.3%和6.6%，值分别为0.9855、0.9853和0.9853。最显着的改进分别是 3.7 倍、3.7 倍和 3.8 倍。

Moreover, Flow-MAE achieved near-optimal accuracy compared to state-of-the-art ET-BERT (flow), with an absolute difference of only 0.13%. The model's accuracy is not affected by the diverse encrypted traffic from different applications, making it robust to everyday encryption and large classes.
此外，与最先进的ET-BERT（流量）相比，Flow-MAE实现了近乎最佳的精度，绝对差异仅为0.13%。该模型的准确性不受来自不同应用程序的各种加密流量的影响，使其对日常加密和大型类具有鲁棒性。

To conclude, the experiments conducted on the aforementioned datasets substantiate that Flow-MAE exhibits remarkable accuracy in identifying malicious traffic and robustness to disparate traffic patterns. Furthermore, in t-SNE and PCA visualizations (refer to Section A.2), the latent feature representations acquired by the Flow-MAE model are efficacious.
总而言之，在上述数据集上进行的实验证实，Flow-MAE在识别恶意流量方面表现出了非凡的准确性和对不同流量模式的鲁棒性。此外，在 t-SNE 和 PCA 可视化中（参见第 A.2 节），Flow-MAE 模型获取的潜在特征表示是有效的。

Figure 2: Results on Fine-tune Efficiency Evaluation.
图2：微调效率评估结果。

4.3 Fine-tune Efficiency Evaluation
4.3 微调效率评估

The efficiency of fine-tuning reflects a model's capacity to rapidly adapt to downstream tasks. To evaluate this efficiency, we compared Flow-MAE with the state-of-the-art ET-BERT by measuring the fine-tuning durations across six datasets. In the experiment, both models underwent fine-tuning using Distributed Data Parallelism (DDP) [34] on two RTX 3090Ti GPUs, each with a 32-sample batch size, for a total of 10 epochs.
微调的效率反映了模型快速适应下游任务的能力。为了评估这种效率，我们通过测量六个数据集的微调持续时间，将 Flow-MAE 与最先进的 ET-BERT 进行了比较。在实验中，两个模型都使用分布式数据并行性（DDP） [ 34] 在两个 RTX 3090Ti GPU 上进行了微调，每个 GPU 的批量大小为 32 个样本，总共 10 个周期。

In our study, we utilized a sequence length of 1024 and a patch size of 32 for Flow-MAE, enabling the model to downsample sequences by a factor of 32, resulting in 32 patches. This approach significantly reduces the computational and memory overhead compared to the 128-byte sequence in ET-BERT, thus achieving greater efficiency in processing longer sequences with lower resource consumption. Specifically, ET-BERT involves 715.23 GFLOPs and 132.13 million parameters, while Flow-MAE requires only 1.45 GFLOPs and 85.12 million parameters. The memory overheads for fine-tuning ET-BERT and Flow-MAE are 8.6 GB and 3.8 GB, respectively. Consequently, Flow-MAE requires merely 0.2% of the FLOPs and 44% of the memory overhead compared to ET-BERT.
在我们的研究中，我们对 Flow-MAE 使用了 1024 的序列长度和 32 的补丁大小，使模型能够将序列下采样 32 倍，从而产生 32 个补丁。与 ET-BERT 中的 128 字节序列相比，这种方法显著降低了计算和内存开销，从而以更低的资源消耗处理更长的序列。具体而言，ET-BERT 涉及 715.23 GFLOPs 和 1.3213 亿个参数，而 Flow-MAE 仅需要 1.45 GFLOPs 和 8512 万个参数。微调 ET-BERT 和 Flow-MAE 的内存开销分别为 8.6 GB 和 3.8 GB。因此，与 ET-BERT 相比，Flow-MAE 只需要 0.2% 的 FLOP 和 44% 的内存开销。

As substantiated by the findings in Figure 2, Flow-MAE's fine-tuning efficiency significantly outperforms that of ET-BERT across all six datasets. In Figure 2(a), the time savings offered by Flow-MAE are substantial; the model requires only 2.6-22.1 minutes for fine-tuning on the datasets, whereas ET-BERT demands 23.6-160 minutes. This significant improvement suggests that Flow-MAE can considerably reduce the time needed for fine-tuning in practical implementations. Moreover, in Figure 2(b), Flow-MAE achieves a speedup factor ranging from 7.2x to 10.3x relative to ET-BERT, indicating its superior efficiency in the fine-tuning process. Flow-MAE is capable of fine-tuning at a rate of 914-982 samples per second, in contrast to ET-BERT's modest 101-121 samples per second. As a result, Flow-MAE can process data much faster than ET-BERT, potentially facilitating quicker model development and iteration.
如图2所示，在所有六个数据集中，Flow-MAE的微调效率明显优于ET-BERT。在图2（a）中，Flow-MAE节省了大量时间;该模型只需要 2.6-22.1 分钟来微调数据集，而 ET-BERT 需要 23.6-160 分钟。这一重大改进表明，Flow-MAE 可以大大减少在实际实施中进行微调所需的时间。此外，在图2（b）中，Flow-MAE相对于ET-BERT实现了7.2倍至10.3倍的加速系数，表明其在微调过程中具有更高的效率。Flow-MAE 能够以每秒 914-982 个样本的速率进行微调，而 ET-BERT 的适度速率为每秒 101-121 个样本。因此，Flow-MAE 处理数据的速度比 ET-BERT 快得多，从而有可能促进更快的模型开发和迭代。

The efficiency of Flow-MAE can be attributed to its patch embedding mechanism, which uses patches to downsample long-range features present in bursts, thereby enabling the model to generate a compact yet informative representation of the sequence. This method reduces the input sequence length while preserving its essential features, making it an effective means of processing extended sequences. Consequently, Flow-MAE can make accurate decisions regarding long input sequences while maintaining exceptional efficiency.
Flow-MAE的效率可归因于其贴片嵌入机制，该机制使用贴片对突发中存在的长程特征进行下采样，从而使模型能够生成紧凑而信息丰富的序列表示。这种方法减少了输入序列长度，同时保留了其基本特征，使其成为处理扩展序列的有效手段。因此，Flow-MAE可以对长输入序列做出准确的决策，同时保持卓越的效率。

In conclusion, our results demonstrate that Flow-MAE surpasses ET-BERT in terms of fine-tuning efficiency and achieves state-of-the-art speed on all six datasets. This makes it a promising model in time-sensitive contexts, particularly in the domain of malicious flow detection.
综上所述，结果表明，Flow-MAE在微调效率方面超过了ET-BERT，在所有6个数据集上都达到了最先进的速度。这使得它在时间敏感的上下文中是一个很有前途的模型，特别是在恶意流检测领域。

4.4 Few-shot Learning Exploration
4.4 小样本学习探索

Few-Shot Learning (FSL) [51] represents a formidable task wherein a model must acquire new classes utilizing a paucity of labeled exemplars. In the context of Flow-MAE, the FSL scenario proves indispensable for practical applications, as the model must accommodate new traffic types rapidly. We conducted an examination of FSL in Flow-MAE by training the model on a small fraction of flows, subsequently testing it on the residual flows. By modulating the test ratio (the proportion of training samples per class) from 0.1 to 0.9, we scrutinized the accuracy and F1 score performance of Flow-MAE.
少样本学习（FSL）[51]代表了一项艰巨的任务，其中模型必须利用少量的标记示例来获取新类。在 Flow-MAE 的上下文中，FSL 场景被证明是实际应用不可或缺的，因为该模型必须快速适应新的流量类型。我们通过在一小部分流量上训练模型，然后在残余流量上对其进行测试，对 Flow-MAE 中的 FSL 进行了检查。通过将测试比率（每类训练样本的比例）从 0.1 调到 0.9，我们仔细检查了 Flow-MAE 的准确性和 F1 分数表现。

The results presented in Figure 3 demonstrate that Flow-MAE achieved remarkable accuracy and F1 scores, exceeding 0.95 with a mere 30% test ratio across all six datasets. Although the accuracy and F1 scores experienced marginal declines as the test ratio increased, the model consistently maintained accuracy and F1 scores above 0.95 at a 90% test ratio for the IDS-2018, USTC-TFC-2016, ISCX-VPN-2016-App, and ISCX-VPN-2016-Service datasets. In contrast, the performance on the ISCX-Tor-2016 and Cross-platform datasets exhibited a more pronounced decline, achieving scores above 0.75 at a 90% test ratio. Considering that the ISCX-Tor-2016 and Cross-platform datasets comprise encrypted flows, the results obtained by Flow-MAE are noteworthy. Accurate classification of encrypted flows generally poses a greater challenge due to the lack of discernible patterns within the packets. The decreased accuracy and F1 scores for these two datasets may be attributed to the complex nature of encrypted traffic.
图 3 中显示的结果表明，Flow-MAE 在所有六个数据集中都取得了显着的准确性和 F1 分数，超过 0.95，测试率仅为 30%。尽管随着测试比率的增加，准确率和 F1 分数略有下降，但该模型在 IDS-2018、USTC-TFC-2016、ISCX-VPN-2016-App 和 ISCX-VPN-2016-Service 数据集的 90% 测试比率下始终保持准确率和 F1 分数高于 0.95。相比之下，ISCX-Tor-2016 和跨平台数据集的性能下降更为明显，在 90% 的测试比率下得分超过 0.75。考虑到 ISCX-Tor-2016 和跨平台数据集包含加密流，Flow-MAE 获得的结果值得注意。由于数据包中缺乏可识别的模式，加密流的准确分类通常会带来更大的挑战。这两个数据集的准确性和 F1 分数下降可能归因于加密流量的复杂性。

Flow-MAE's strong few-shot learning capability makes it an ideal choice for situations where limited training data is available. This feature is particularly valuable in the field of network security, where new and evolving threats constantly emerge, and obtaining sufficient labeled data for each novel threat may be impractical. Flow-MAE's few-shot learning proficiency can address this issue by enabling the model to rapidly adapt to new threats using minimal training data. Furthermore, Flow-MAE's success in few-shot learning emphasizes its robustness to diverse traffic patterns even with limited data.
Flow-MAE 强大的小样本学习能力使其成为训练数据有限的情况下的理想选择。此功能在网络安全领域特别有价值，因为新的和不断演变的威胁不断出现，并且为每种新威胁获取足够的标记数据可能是不切实际的。Flow-MAE 的少量学习能力可以通过使模型使用最少的训练数据快速适应新威胁来解决这个问题。此外，Flow-MAE在小样本学习方面的成功强调了其对不同流量模式的鲁棒性，即使数据有限。

In summary, Flow-MAE suitable for practical deployment in malicious traffic classification with accuracy and robustness.
综上所述，Flow-MAE适用于恶意流量分类的实际部署，具有准确性和鲁棒性。

Figure 3: Results on Few-shot Learning Analysis.
图 3：小样本学习分析的结果。

4.5 Ablation Study 4.5 消融研究

Table 5: Accuracy Results on Ablation Study Experiment
表5：消融研究实验的准确性结果

Condition 条件	CIC-IDS-2018	USTC-TFC-2016	ISCX-VPN-App ISCX-VPN-应用程序	ISCX-VPN-Service ISCX-VPN服务	ISCX-Tor-2016 ISCX-Tor-2016 认证	Cross-Platform (Android) 跨平台（Android）
Flow-MAE 流-MAE	0.99958	0.9988	0.9987	0.9915	0.9945	0.9921
w/o position 不含position	0.99375	0.98725	0.9725	0.9607	0.21597	0.96452
w/o pre-train 不带预训练	0.99833	0.98755	0.9898	0.9762	0.24285	0.77841
w/o mask 不带掩码	0.99958	0.99939	0.9931	0.9901	0.81368	0.99203
w/o header (40) 不带针座（40）	0.99958	0.99879	0.9821	0.9845	0.98375	0.99432
w/o header (76) 不带针座（76）	0.99958	0.99818	0.8341	0.8202	0.67771	0.97706

In the ablation analysis of Flow-MAE, we examined the impact of several critical components on the model's effectiveness, including the positional embedding, Masked Patch Model pre-training task, patch mask, and packet headers. Due to page constraints, we only display the accuracy results in the main manuscript (Table 5), while other results can be found in Table 6 and Table 7 of Section A.4.
在 Flow-MAE 的消融分析中，我们检查了几个关键组件对模型有效性的影响，包括位置嵌入、掩码补丁模型预训练任务、补丁掩码和数据包标头。由于页数限制，我们只在正稿中显示准确性结果（表5），而其他结果可以在A.4部分的表6和表7中找到。

4.5.1 Positional Embedding. Our ablation analysis highlighted the importance of positional embedding within Flow-MAE, as its absence led to a decline in accuracy. Across the six datasets, the mean reduction in accuracy amounted to 2.03%, ranging from 0.458% to 18.611%. By incorporating positional embedding, the model effectively captured the temporal sequence of packets, resulting in improved accuracy. These insights emphasize the significance of considering the temporal dimensions of network traffic when developing models for traffic classification purposes.
4.5.1 位置嵌入。我们的消融分析强调了 Flow-MAE 中位置嵌入的重要性，因为缺少位置嵌入会导致准确性下降。在六个数据集中，准确度的平均降低为 2.03%，范围从 0.458% 到 18.611%。通过结合位置嵌入，该模型有效地捕获了数据包的时间序列，从而提高了准确性。这些见解强调了在开发用于流量分类目的的模型时考虑网络流量的时间维度的重要性。

4.5.2 Masked Patch Model Pre-training Task. The findings revealed that the Masked Patch Model pre-training task positively influenced Flow-MAE's performance. Specifically, the model pre-trained on the downstream classification task achieved higher accuracy compared to its non-pre-trained counterpart. The improvements in accuracy were notable, with increases of 0.125%, 1.125%, 0.89%, 1.53%, 75.1%, and 21.3% across the six datasets, respectively. The most significant enhancement occurred in the ISCX-Tor-2016 dataset, rising from 0.24285 to 0.9945, indicating a remarkable improvement of 75.1%. This observation emphasizes the necessity of pre-training the model to strengthen its ability to learn essential representations for the classification task.
4.5.2 Masked Patch Model 预训练任务。研究结果表明，Masked Patch Model 预训练任务对 Flow-MAE 的性能产生了积极影响。具体而言，与未预训练的模型相比，在下游分类任务上预训练的模型获得了更高的准确性。准确率的提高是显着的，六个数据集的准确率分别提高了0.125%、1.125%、0.89%、1.53%、75.1%和21.3%。最显着的增强发生在 ISCX-Tor-2016 数据集中，从 0.24285 上升到 0.9945，表明显着改善了 75.1%。这一观察结果强调了对模型进行预训练的必要性，以增强其学习分类任务基本表示的能力。

4.5.3 Patch Mask. The patch mask serves as a technique that enhances Flow-MAE's robustness by masking padded patches of input data. However, its effectiveness displayed variation across the six datasets utilized in our study.
4.5.3 贴片掩码。补丁掩码是一种技术，通过屏蔽输入数据的填充补丁来增强 Flow-MAE 的鲁棒性。然而，它的有效性在我们研究中使用的六个数据集中显示出差异。

Removing the patch mask had a negligible effect on the model's accuracy for the CIC-IDS-2018, USTC-TFC-2016, and Cross-Platform (Android) datasets, with changes below 0.06%. In contrast, for datasets such as ISCX-VPN-App and ISCX-Tor-2016, the patch mask exhibited a more significant impact, increasing accuracy by up to 18%.
移除贴片掩模对模型对 CIC-IDS-2018、USTC-TFC-2016 和跨平台（Android）数据集的准确性影响可以忽略不计，变化低于 0.06%。相比之下，对于 ISCX-VPN-App 和 ISCX-Tor-2016 等数据集，贴片掩码表现出更显着的影响，将准确性提高了 18%。

These findings suggest that the patch mask can improve the model's performance for specific datasets, while having minimal effect on others. The patch mask's effectiveness may also depend on distinct dataset attributes, such as sequence length. For example, in the CIC-IDS-2018 dataset, characterized by relatively long input sequence lengths, the patch mask's influence may be diminished.
这些发现表明，贴片掩码可以提高模型对特定数据集的性能，同时对其他数据集的影响最小。贴片掩码的有效性还可能取决于不同的数据集属性，例如序列长度。例如，在 CIC-IDS-2018 数据集中，其特征是输入序列长度相对较长，贴片掩模的影响可能会减弱。

4.5.4 Packet Headers. In experiment, we investigated the impact of truncating the first 40/76 bytes of packet headers to generate bursts, as the ET-BERT model solely utilizes the payload within packets by discarding the initial 76 bytes.
4.5.4 数据包头。在实验中，我们研究了截断数据包标头的前 40/76 字节以生成突发的影响，因为 ET-BERT 模型通过丢弃最初的 76 字节来仅利用数据包中的有效负载。

Our findings indicated that the first 40 bytes had a marginal effect on accuracy, with reductions below 1.66% across all datasets. However, the initial 76 bytes had a more significant impact on accuracy, with decreases of 16.46%, 17.116%, and 31.68% in ISCX-VPN-App, ISCX-VPN-Service, and ISCX-Tor-2016, respectively. Interestingly, we discerned that incorporating packet headers had no significant consequences for datasets featuring elongated packets, such as CIC-IDS-2018 and USTC-TFC-2016. This observation aligns with our antecedent findings concerning the patch mask's impact, as delineated in § 4.5.3. Additionally, the integration of packet headers can provide valuable insights into network traffic, including timestamps, window size, and protocol type, which may assist in differentiating between various network activities. These findings emphasize the importance of incorporating domain-specific knowledge into the model to enhance its effectiveness for specific tasks.
我们的研究结果表明，前 40 个字节对准确性的影响很小，所有数据集的减少率低于 1.66%。然而，最初的 76 个字节对准确性的影响更为显著，在 ISCX-VPN-App、ISCX-VPN-Service 和 ISCX-Tor-2016 中分别下降了 16.46%、17.116% 和 31.68%。有趣的是，我们发现合并数据包标头对具有细长数据包的数据集（例如 CIC-IDS-2018 和 USTC-TFC-2016）没有显着影响。这一观察结果与我们之前关于面罩影响的发现一致，如§4.5.3所述。此外，数据包标头的集成可以提供对网络流量的宝贵见解，包括时间戳、窗口大小和协议类型，这可能有助于区分各种网络活动。这些发现强调了将特定领域的知识纳入模型的重要性，以提高其对特定任务的有效性。

In conclusion, the ablation analysis demonstrated that the inclusion of positional embedding, MPM pre-training, and packet headers, can significantly improve Flow-MAE's performance in the classification task. On the other hand, the patch mask did not notably influence performance; however, it may still be beneficial in scenarios involving shorter packets where regularization is necessary.
综上所述，消融分析表明，包含位置嵌入、MPM预训练和数据包标头可以显著提高Flow-MAE在分类任务中的性能。另一方面，贴片掩模对性能没有明显影响;但是，在涉及需要正则化的较短数据包的方案中，它可能仍然有用。

5 DISCUSSION 5 讨论

5.1 Main Findings 5.1 主要发现

The Masked AutoEncoders (MAE) pre-training model offers several other advantages that make it well-suited for our system.
掩码自动编码器（MAE）预训练模型还具有其他一些优势，使其非常适合我们的系统。

Firstly, MAE-based pre-training allows the model to learn relevant representations directly from the raw network traffic data. By reconstructing masked patches, MAE captures the intrinsic patterns and structures in the traffic, enabling the model to capture both local and global dependencies. This comprehensive representation learning contributes to the model's ability to generalize well across different network traffic scenarios and improves its performance in malicious traffic classification tasks.
首先，基于MAE的预训练允许模型直接从原始网络流量数据中学习相关表示。通过重构屏蔽的补丁，MAE 捕获流量中的内在模式和结构，使模型能够捕获本地和全局依赖关系。这种全面的表示学习有助于模型在不同的网络流量场景中很好地泛化，并提高其在恶意流量分类任务中的性能。

Furthermore, MAE pre-training is particularly effective for capturing contextual information within a flow. By considering the flow as a whole, rather than individual packets, the model can understand the sequential and temporal aspects of the traffic, which are often crucial for detecting sophisticated attacks or anomalies. This holistic approach allows the model to leverage the dependencies and relationships between packets, resulting in a more accurate and robust classification capability.
此外，MAE 预训练对于捕获流程中的上下文信息特别有效。通过将流量视为一个整体，而不是单个数据包，该模型可以了解流量的顺序和时间方面，这对于检测复杂的攻击或异常通常至关重要。这种整体方法允许模型利用数据包之间的依赖关系和关系，从而产生更准确、更强大的分类功能。

Another advantage of the MAE pre-training model is its ability to handle larger inputs. The use of patch embedding techniques ensures that the model can efficiently process and analyze traffic data with varying sizes. This scalability is crucial in real-world scenarios where network traffic volumes can be extensive and diverse, allowing the model to handle a wide range of network traffic scales and adapt to different deployment environments.
MAE 预训练模型的另一个优点是它能够处理更大的输入。使用补丁嵌入技术可确保模型能够有效地处理和分析不同大小的流量数据。这种可伸缩性在实际场景中至关重要，在这些场景中，网络流量可能广泛且多样化，使模型能够处理各种网络流量规模并适应不同的部署环境。

5.2 Explainable Artificial Intelligence
5.2 可解释的人工智能

Explainable Artificial Intelligence (XAI) [2, 35] is increasingly gaining prominence in a variety of fields, such as malicious traffic detection, where understanding a model's decision-making process is essential for ensuring the security and reliability [5, 32] of network systems. In this context, the self-attention mechanism employed by Flow-MAE provides valuable insights into the model's rationale when assessing network traffic.
可解释人工智能（XAI）[2,35]在各种领域越来越突出，例如恶意流量检测，在这些领域中，了解模型的决策过程对于确保网络系统的安全性和可靠性至关重要[5,32]。在这种情况下，Flow-MAE采用的自注意力机制为评估网络流量时模型的基本原理提供了有价值的见解。

The pre-training mechanism within Flow-MAE allows the model to autonomously learn relevant features and relationships among packets. Flow-MAE's attention mechanism assigns weights to each packet based on its significance to the target class. By examining the attention weights for a particular flow, we can identify the most informative packets for the classification task [21]. This knowledge can deepen our understanding of the traffic's underlying characteristics, ultimately enhancing the model's accuracy.
Flow-MAE 中的预训练机制允许模型自主学习数据包之间的相关特征和关系。Flow-MAE 的注意力机制根据每个数据包对目标类的重要性为其分配权重。通过检查特定流的注意力权重，我们可以确定分类任务中信息量最大的数据包 [ 21]。这些知识可以加深我们对流量基本特征的理解，最终提高模型的准确性。

Furthermore, Flow-MAE's self-attention mechanism produces an interpretable representation of the flow, enabling visualization of attention weights and illuminating the classification decision. We can discern which packets and features are of paramount importance for a specific classification task and which aspects of the flow the model neglects. By leveraging this information, we can improve the model's accuracy or pinpoint areas requiring additional training data. This level of explainability is crucial for building trust in the model and ensuring accurate, informed decision-making.
此外，Flow-MAE的自注意力机制可以产生可解释的流表示，从而实现注意力权重的可视化并阐明分类决策。我们可以辨别哪些数据包和特征对于特定的分类任务至关重要，以及模型忽略了流程的哪些方面。通过利用这些信息，我们可以提高模型的准确性或确定需要额外训练数据的区域。这种程度的可解释性对于建立对模型的信任和确保准确、明智的决策至关重要。

5.3 Zero-Shot Learning 5.3 零样本学习

Zero-Shot Learning (ZSL) [42] is a challenging task that requires a model to recognize and classify unseen classes without any labeled examples. In the context of network traffic classification, ZSL is particularly beneficial for detecting novel and emerging network threats for which labeled training data is unavailable. While Few-Shot Learning (FSL) [51] involves learning new classes with limited labeled examples, ZSL focuses on recognizing entirely unseen classes. Nevertheless, these two learning paradigms are closely related, as ZSL represents an extreme form of Few-Shot Learning with zero labeled examples for new classes. Therefore, models that excel in Few-Shot Learning are likely to perform well in ZSL scenarios.
零样本学习（ZSL） [ 42] 是一项具有挑战性的任务，需要模型在没有任何标记示例的情况下识别和分类看不见的类。在网络流量分类的上下文中，ZSL 特别有利于检测无法使用标记训练数据的新型和新出现的网络威胁。少样本学习（FSL） [ 51] 涉及使用有限的标记示例学习新类，而 ZSL 则侧重于识别完全看不见的类。然而，这两种学习范式是密切相关的，因为 ZSL 代表了一种极端形式的 Few-Shot 学习，新课程的标记示例为零。因此，在小样本学习中表现出色的模型很可能在 ZSL 场景中表现良好。

This notion is supported by evidence from research in computer vision and natural language processing fields [18, 52], where self-attention mechanisms have exhibited improved ZSL performance. The remarkable Few-Shot Learning performance of Flow-MAE suggests its potential for ZSL. The self-attention mechanism in Flow-MAE enables the model to attend to different sections of the input flow, potentially facilitating the learning of more robust and generalizable flow representations. This may improve the model's effectiveness in identifying and classifying unseen classes in ZSL scenarios.
这一观点得到了计算机视觉和自然语言处理领域研究的证据的支持[18,52]，其中自注意力机制表现出更好的ZSL性能。Flow-MAE卓越的Few-Shot Learning性能表明了其ZSL的潜力。Flow-MAE 中的自注意力机制使模型能够关注输入流的不同部分，从而可能有助于学习更健壮和可泛化的流表示。这可能会提高模型在 ZSL 场景中识别和分类看不见的类的有效性。

However, Zero-Shot Learning remains a challenge in the realm of cybersecurity [4, 25]. Further research is required to thoroughly investigate Flow-MAE's potential for Zero-Shot Learning.
然而，零样本学习在网络安全领域仍然是一个挑战[4,25]。需要进一步的研究来彻底研究 Flow-MAE 在零样本学习方面的潜力。

5.4 Real-time Detection 5.4 实时检测

It is important to highlight that the pre-training phase is conducted only once, while fine-tuning can be accomplished rapidly. After completing the initial pre-training, Flow-MAE demonstrates remarkable efficiency in the fine-tuning process, enabling quick adaptation to different downstream datasets. The model showcases an impressive speed of 914-982 samples per second when utilizing two GPUs, ensuring swift processing of network traffic. This characteristic is vital in network security applications, where prompt detection and response to potential threats are imperative [19].
需要强调的是，预训练阶段只进行一次，而微调可以快速完成。在完成初始预训练后，Flow-MAE在微调过程中表现出了显著的效率，能够快速适应不同的下游数据集。当使用两个 GPU 时，该模型展示了每秒 914-982 个样本的惊人速度，确保了网络流量的快速处理。这一特性在网络安全应用中至关重要，在这些应用中，快速检测和响应潜在威胁势在必行[19]。

Nevertheless, ensuring real-time model operation poses a formidable challenge [37]. A primary obstacle stems from the computational overheads associated with the DL model [12]. Although the processing speed is noteworthy, it may prove inadequate to satisfy stringent time constraints in certain situations. Moreover, the complexity of network traffic can vary, and in cases where traffic is highly unpredictable or rapidly changing, the model's performance may be compromised. As a result, optimizing the model's architecture to minimize computational overheads and augment its adaptability to fluctuating network conditions is essential for achieving real-time operation.
然而，确保模型的实时操作是一项艰巨的挑战[37]。一个主要障碍来自与深度学习模型相关的计算开销[ 12]。尽管处理速度值得注意，但在某些情况下，它可能不足以满足严格的时间限制。此外，网络流量的复杂性可能会有所不同，在流量高度不可预测或快速变化的情况下，模型的性能可能会受到影响。因此，优化模型的架构以最大限度地减少计算开销并增强其对波动网络条件的适应性对于实现实时操作至关重要。

Exploring the feasibility of optimizing the model architecture and implementing model compression techniques can potentially make the Flow-MAE model more efficient and suitable for resource-constrained edge devices. Model architecture optimization, such as network pruning or architecture search, aims to reduce parameters and operations while maintaining performance. Additionally, model compression techniques like quantization, knowledge distillation, and weight sharing can decrease memory footprint and computational requirements. However, running Flow-MAE on edge devices presents challenges. Limited computational capacity may result in slower inference or hinder real-time processing. Battery life constraints affect power consumption, while memory limitations may impact performance or prevent model execution. Addressing these challenges is crucial for the practicality and usability of Flow-MAE on edge devices.
探索优化模型架构和实现模型压缩技术的可行性，可以使Flow-MAE模型更加高效，并适用于资源受限的边缘设备。模型架构优化（例如网络修剪或架构搜索）旨在减少参数和操作，同时保持性能。此外，量化、知识蒸馏和权重共享等模型压缩技术可以减少内存占用和计算要求。然而，在边缘设备上运行 Flow-MAE 带来了挑战。有限的计算能力可能会导致推理速度变慢或阻碍实时处理。电池寿命限制会影响功耗，而内存限制可能会影响性能或阻止模型执行。解决这些挑战对于Flow-MAE在边缘设备上的实用性和可用性至关重要。

Techniques such as DPDK (Data Plane Development Kit) [20] and model distillation [13, 17] can be utilized to tackle these challenges. DPDK can enhance the performance of packet processing applications by bypassing the operating system's network stack and directly accessing the network interface card. Model distillation, on the other hand, condenses a complex deep learning model into a smaller, less computationally demanding model while retaining most of the original model's accuracy. By implementing these techniques, Flow-MAE's computational overheads can be diminished, facilitating real-time operation and improving its adaptability to dynamic network conditions.
DPDK（数据平面开发套件）[20]和模型蒸馏[13,17]等技术可用于应对这些挑战。DPDK 可以通过绕过操作系统的网络堆栈并直接访问网络接口卡来增强数据包处理应用程序的性能。另一方面，模型蒸馏将复杂的深度学习模型压缩为更小、计算要求更低的模型，同时保留原始模型的大部分精度。通过实施这些技术，可以减少Flow-MAE的计算开销，促进实时操作并提高其对动态网络条件的适应性。

6 CONCLUSION 6 结论

This paper introduces Flow-MAE, a new state-of-the-art pre-training model that employs a Masked AutoEncoder (MAE) for malicious traffic classification. Flow-MAE aims to overcome the limitations of the state-of-the-art model, ET-BERT, which stem from short input length and fixed Byte Pair Encoding (BPE). Flow-MAE features core designs in burst representation, patch embedding, and the Masked Patch Model (MPM) pre-training task. Flow-MAE uses a generic representation for network traffic, termed burst, in conjunction with patch embedding to adapt extensive traffic to Flow-MAE's input. This adaptation allows for a longer input length (1024 bytes) compared to ET-BERT (128 bytes), accommodating more comprehensive features in only 32 input patches. Furthermore, Flow-MAE proposes the Masked Patch Model (MPM) pre-training task, which captures contextual interdependencies and yields unbiased representations from bursts of varying lengths and patterns. This ultimately benefits the fine-tuning process on subsequent malicious traffic classification tasks.
本文介绍了 Flow-MAE，这是一种新的最先进的预训练模型，它采用掩码自动编码器（MAE）进行恶意流量分类。Flow-MAE 旨在克服最先进模型 ET-BERT 的局限性，这些局限性源于较短的输入长度和固定的字节对编码（BPE）。Flow-MAE 在突发表示、补丁嵌入和掩码补丁模型（MPM）预训练任务中具有核心设计。Flow-MAE 使用网络流量的通用表示（称为突发）与补丁嵌入相结合，使大量流量适应 Flow-MAE 的输入。与 ET-BERT（128 字节）相比，这种调整允许更长的输入长度（1024 字节），仅在 32 个输入补丁中容纳更全面的功能。此外，Flow-MAE 提出了掩码补丁模型（MPM）预训练任务，该任务捕获上下文相互依赖关系，并从不同长度和模式的突发中生成无偏表示。这最终有利于后续恶意流量分类任务的微调过程。

Experimental results demonstrate that Flow-MAE achieves new state-of-the-art accuracy exceeding 99%, an efficiency of over 900 samples/s, and impressive robustness on six downstream datasets. It outperforms the state-of-the-art ET-BERT in accuracy and speed by 0.41%-1.93% and 7.8x-10.3x, respectively, while only requiring 0.2% FLOPs and 44% memory overhead. The few-shot learning experiment shows that Flow-MAE can rapidly adapt to new threats using minimal training data, making it robust against evolving network security threats. An ablation study underscores the significance of the positional embedding, MPM pre-training task, and header fields in rendering Flow-MAE an effective and robust model.
实验结果表明，Flow-MAE在6个下游数据集上实现了超过99%的先进准确率、超过900个样本/秒的效率和令人印象深刻的鲁棒性。它在精度和速度方面分别比最先进的 ET-BERT 高出 0.41%-1.93% 和 7.8 倍-10.3 倍，同时只需要 0.2% 的 FLOP 和 44% 的内存开销。少样本学习实验表明，Flow-MAE 可以使用最少的训练数据快速适应新的威胁，使其能够抵御不断变化的网络安全威胁。一项消融研究强调了位置嵌入、MPM 预训练任务和标头字段在使 Flow-MAE 成为有效且稳健的模型方面的重要性。

In conclusion, the Flow-MAE framework presents an accurate, efficient, and robust approach for classifying malicious traffic.
总之，Flow-MAE 框架提供了一种准确、高效和强大的恶意流量分类方法。

A DETAILED EXPERIMENT RESULTS

A.1 Confusion Matrix on Fine-tuning Tasks
A.1 微调任务的混淆矩阵

Figure 5 displays the confusion matrix for the fine-tuning tasks, providing a visual representation of the fine-tuned model's performance in classifying test data. The confusion matrix reveals high accuracy for the majority of classes, with few misclassifications (less than 1% in most instances).
图 5 显示了微调任务的混淆矩阵，直观地表示了微调模型在对测试数据进行分类方面的性能。混淆矩阵显示大多数类别的准确率很高，几乎没有错误分类（在大多数情况下小于 1%）。

For instance, in the CIC-IDS-2018 dataset (Figure 5(a)), the model demonstrated high precision and recall for classes representing various types of malicious flows, such as brute force, DDoS, and malware. These classes are fundamental for efficient intrusion and malware detection. Note that in Figure 5(f), only the initial 20 classes are displayed due to space limitations. Nevertheless, the results for the remaining classes remain consistent with these findings.
例如，在 CIC-IDS-2018 数据集（图 5（a））中，该模型对代表各种类型恶意流（如暴力破解、DDoS 和恶意软件）的类别表现出高精度和召回率。这些类是有效入侵和恶意软件检测的基础。请注意，在图 5（f）中，由于空间限制，仅显示最初的 20 个类。尽管如此，其余类别的结果仍然与这些发现一致。

In summary, the fine-tuning process yielded a substantial improvement in the model's performance, suggesting that transfer learning can effectively enhance the accuracy of malicious traffic detection systems.
综上所述，微调过程对模型性能有了实质性的提升，表明迁移学习可以有效提高恶意交通检测系统的准确性。

A.2 t-SNE Visualization on latent feature representations
A.2 潜在特征表示的t-SNE可视化

In Figure 6, we demonstrate the use of t-SNE (t-distributed Stochastic Neighbor Embedding) and PCA to visualize the latent feature representations learned by the Flow-MAE model. By reducing the high-dimensional latent features to a 2-dimensional space, we are able to effectively visualize and interpret the relationships between different data points.
在图 6 中，我们演示了使用 t-SNE（t 分布随机邻居嵌入）和 PCA 来可视化 Flow-MAE 模型学习的潜在特征表示。通过将高维潜在特征简化为二维空间，我们能够有效地可视化和解释不同数据点之间的关系。

Specifically, Our t-SNE visualizations of the latent feature representations of the network flows showed distinct clusters of different network activities. For example, in the CIC-IDS-2018 dataset, we observed explicit clusters corresponding to activities such as port scans, brute-force attacks, and DDoS attacks. The distinct clusters that we observed suggest that the network flows associated with each activity type have unique latent feature representations that can be used for classification. The proximity of flows within each cluster indicates that the latent features learned by the network are effective at capturing the distinguishing characteristics of each activity type. By leveraging these representations, it may be possible to develop more accurate and efficient methods for detecting and mitigating security threats in network traffic.
具体来说，我们对网络流的潜在特征表示的 t-SNE 可视化显示了不同网络活动的不同聚类。例如，在 CIC-IDS-2018 数据集中，我们观察到与端口扫描、暴力攻击和 DDoS 攻击等活动相对应的显式集群。我们观察到的不同聚类表明，与每种活动类型关联的网络流具有可用于分类的独特潜在特征表示。每个集群内流的接近程度表明，网络学习的潜在特征可以有效地捕获每种活动类型的显着特征。通过利用这些表示形式，可以开发更准确、更有效的方法来检测和缓解网络流量中的安全威胁。

Furthermore, the comparison of t-SNE and PCA in the visualization of the latent feature representations can provide insights into the underlying structure of the data. While PCA focuses on capturing the most important dimensions of the data, t-SNE emphasizes preserving the similarity between data points, thereby revealing non-linear relationships that may be hidden in higher dimensions.
此外，在潜在特征表示可视化中比较t-SNE和PCA可以深入了解数据的底层结构。PCA侧重于捕获数据中最重要的维度，而t-SNE则强调保持数据点之间的相似性，从而揭示可能隐藏在更高维度中的非线性关系。

The t-SNE visualizations provide valuable insights into how the Flow-MAE model learns and distinguishes between different network flows based on their traffic patterns. These visualizations can also help identify any patterns or clusters in the data that may be difficult to discern using other methods, providing a more intuitive understanding of the model's behavior.
t-SNE 可视化为 Flow-MAE 模型如何根据流量模式学习和区分不同的网络流提供了有价值的见解。这些可视化还有助于识别数据中使用其他方法可能难以辨别的任何模式或聚类，从而更直观地了解模型的行为。

Figure 4: Preprocess Efficiency comparison on ET-BERT and Flow-MAE

Table 6: Ablation Study Results on CIC-IDS-2018, USTC-TFC-2016 and ISCX-VPN-App Datasets

Dataset	CIC-IDS-2018				USTC-TFC-2016				ISCX-VPN-App
Condition	AC	PR	RC	F1	AC	PR	RC	F1	AC	PR	RC	F1
Flow-MAE (full)	0.99958	0.99959	0.99958	0.99958	0.9988	0.9984	0.9989	0.9986	0.9987	0.9991	0.9989	0.9990
w/o pre-train	0.99833	0.99834	0.99833	0.99833	0.98755	0.98774	0.98755	0.98756	0.9898	0.9866	0.9890	0.9878
w/o position	0.99375	0.99383	0.99375	0.99374	0.98725	0.98749	0.98725	0.98725	0.9725	0.9725	0.9721	0.9723
w/o mask	0.99958	0.99959	0.99958	0.99958	0.99939	0.99941	0.99939	0.9994	0.9931	0.9928	0.9928	0.9928
w/o header (40)	0.99958	0.99959	0.99958	0.99958	0.99879	0.99881	0.99879	0.99879	0.9821	0.9821	0.9820	0.9820
w/o header (76) 不带针座（76）	0.99958	0.99959	0.99958	0.99958	0.99818	0.99821	0.99818	0.99818	0.8341	0.8341	0.8344	0.8343

Table 7: Ablation Study Results on ISCX-VPN-Service, ISCX-Tor-2016, and Cross-Platform (Android) Datasets
表 7：ISCX-VPN-Service、ISCX-Tor-2016 和跨平台（Android）数据集的消融研究结果

Dataset 数据	ISCX-VPN-Service ISCX-VPN服务				ISCX-Tor-2016 ISCX-Tor-2016 认证				Cross-Platform (Android) 跨平台（Android）
Condition 条件	AC	PR	RC	F1	AC	PR	RC	F1	AC	PR	RC	F1
Flow-MAE (full) Flow-MAE（已满）	0.9915	0.9924	0.9915	0.9917	0.9945	0.9946	0.9945	0.9945	0.9921	0.9922	0.9921	0.9921
w/o pre-train 不带预训练	0.9762	0.9763	0.9762	0.9762	0.24285	0.15556	0.24285	0.17218	0.96452	0.9645	0.96452	0.96439
w/o position 不含position	0.9607	0.9607	0.9605	0.9606	0.21597	0.20377	0.21597	0.17359	0.96452	0.9645	0.96452	0.96439
w/o mask 不带掩码	0.9901	0.9900	0.9901	0.9901	0.81368	0.81397	0.81368	0.81372	0.99203	0.99209	0.99203	0.99203
w/o header (40) 不带针座（40）	0.9845	0.9846	0.9845	0.9845	0.98375	0.9838	0.98375	0.98376	0.99432	0.99435	0.99432	0.99432
w/o header (76) 不带针座（76）	0.8202	0.8203	0.8202	0.8202	0.67771	0.68082	0.67771	0.67878	0.97706	0.97725	0.97706	0.97707

A.3 Preprocess Efficiency
A.3 预处理效率

We evaluate the efficiency of Flow-MAE in two aspects, i.e., preprocess and fine-tune efficiency on the five datasets, by comparing its speed with ET-BERT (Figure 4).
通过将Flow-MAE的速度与ET-BERT进行比较，我们从两个方面评估了Flow-MAE的效率，即5个数据集的预处理和微调效率（图4）。

We first compare the time of preprocess procedure in ET-BERT and Flow-MAE, consisting of session split time (time to split large pcap files into independent session pcap files) and data generation time (generate burst dataset from session files used for train and test procedure). Figure 4 depicts the preprocess time (including session split and data generation time) of ET-BERT and Flow-MAE on five datasets.
我们首先比较了 ET-BERT 和 Flow-MAE 中的预处理过程时间，包括会话拆分时间（将大型 pcap 文件拆分为独立会话 pcap 文件的时间）和数据生成时间（从用于训练和测试程序的会话文件生成突发数据集）。图 4 描述了 ET-BERT 和 Flow-MAE 在五个数据集上的预处理时间（包括会话拆分和数据生成时间）。

First, the process time is positive correlated with dataset scale. The largest dataset IDS-2018 takes above 150 seconds while the smallest dataset only takes about 10 seconds to complete. All preprocess times are appropriate and comparatively small compared with fine-tune and pre-training times, though. Second, Flow-MAE is faster than the ET-BERT in total preprocess time on all datasets, which is of 10.5%-60.2% faster. The biggest improvement is 60.2% on the USTC-TFC-2016 dataset, with a speed up of 2.5x. Third, we also observe that session split occupies most of the process time in both ET-BERT and Flow-MAE, with a percentage of 35.4%-89.5% and 87.8%-88.2%, respectively. Fourth, although ET-BERT and Flow-MAE takes approximate time for session split, Flow-MAE is more efficient than ET-BERT on dataset generation, with a speed up of 12.1x-21.3x. The speedup of data generation originates from our optimization on packet parsing and crafting.
首先，处理时间与数据集尺度呈正相关。最大的数据集 IDS-2018 需要 150 秒以上，而最小的数据集只需大约 10 秒即可完成。不过，与微调和预训练时间相比，所有预处理时间都是合适的，并且相对较小。其次，在所有数据集上，Flow-MAE的总预处理时间都比ET-BERT快10.5%-60.2%。USTC-TFC-2016数据集的最大改进率为60.2%，速度提高了2.5倍。第三，我们还观察到，在ET-BERT和Flow-MAE中，会话拆分占据了大部分处理时间，百分比分别为35.4%-89.5%和87.8%-88.2%。第四，虽然 ET-BERT 和 Flow-MAE 的会话拆分时间差不多，但 Flow-MAE 在数据集生成方面比 ET-BERT 效率更高，速度提升了 12.1x-21.3x。数据生成的加速源于我们对数据包解析和制作的优化。

A.4 Additional Results on Ablation Study

Due to space limitations, only the accuracy results for the ablation study are presented in the main text (§ 4.5). Additional results, including precision, recall, and F₁ score, can be found in Table 6 and Table 7. These supplementary findings exhibit consistent trends across all four evaluation metrics.

Table 6 and Table 7 present additional ablation study results on Flow-MAE, featuring accuracy, precision, recall, and F₁ score for each dataset under various model variants with distinct components removed. These trends align with the accuracy results shown in Table 5. The findings confirm that eliminating the positional embedding leads to the most substantial impact on accuracy, while the pre-training task and packet headers also contribute significantly to the overall performance. Conversely, the patch mask has a minimal effect on certain datasets with extended bursts but exhibits a noticeable positive impact on others.

The precision, recall, and F₁ score display trends that align with those observed in the accuracy results. These findings emphasize the importance of the core components within the Flow-MAE model and their contributions to attaining high accuracy across various datasets.

Figure 5: Confusion Matrix on Fine-tuning Tasks
图 5：微调任务的混淆矩阵

Figure 6: The visualization of latent feature representations through dimensionality reduction techniques t-SNE and PCA
图 6：通过降维技术 t-SNE 和 PCA 实现潜在特征表示的可视化

REFERENCES 引用

Khaled Al-Naami, Swarup Chandra, Ahmad Mustafa, Latifur Khan, Zhiqiang Lin, Kevin Hamlen, and Bhavani Thuraisingham. 2016. Adaptive Encrypted Traffic Fingerprinting with Bi-Directional Dependence. In Proceedings of the 32nd Annual Conference on Computer Security Applications(ACSAC ’16). Association for Computing Machinery, New York, NY, USA, 177–188. https://doi.org/10.1145/2991079.2991123
哈立德·纳阿米、斯瓦鲁普·钱德拉、艾哈迈德·穆斯塔法、拉蒂富尔·汗、林志强、凯文·哈姆伦和巴瓦尼·图赖辛厄姆。2016. 具有双向依赖性的自适应加密流量指纹。在第 32 届计算机安全应用年会（ACSAC '16）的会议记录中。计算机协会，美国纽约州纽约市，177–188。https://doi.org/10.1145/2991079.2991123
Plamen Angelov and Eduardo Soares. 2020. Towards explainable deep neural networks (xDNN). Neural Networks 130 (2020), 185–194.
普拉门·安杰洛夫和爱德华多·苏亚雷斯。2020. 迈向可解释的深度神经网络（xDNN）。神经网络 130 （2020），185–194。
Marco Barreno, Blaine Nelson, Russell Sears, Anthony D Joseph, and J Doug Tygar. 2006. Machine learning and data mining methods for cybersecurity. IEEE Security & Privacy 4, 2 (2006), 38–49.
马可·巴雷诺、布莱恩·纳尔逊、罗素·西尔斯、安东尼·约瑟夫和 J 道格·泰加。2006. 网络安全的机器学习和数据挖掘方法.IEEE安全与隐私4,2（2006），38-49。
Pedro H. Barros, Eduarda T.C. Chagas, Leonardo B. Oliveira, Fabiane Queiroz, and Heitor S. Ramos. 2022. Malware‐SMELL: A zero‐shot learning strategy for detecting zero‐day vulnerabilities. Computers Security 120 (2022), 102785. https://doi.org/10.1016/j.cose.2022.102785
佩德罗·巴罗斯、爱德华达·查加斯、莱昂纳多·奥利维拉、法比亚·奎罗斯和海托·拉莫斯。2022. Malware‐SMELL：一种用于检测零日漏洞的零样本学习策略。计算机安全 120 （2022），102785。https://doi.org/10.1016/j.cose.2022.102785
Cedric Beliard, Alessandro Finamore, and Dario Rossi. 2020. Opening the deep pandora box: Explainable traffic classification. In IEEE INFOCOM 2020-IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS). IEEE, 1292–1293.
塞德里克·贝利亚德、亚历山德罗·菲纳莫尔和达里奥·罗西。2020. 打开深潘多拉魔盒：可解释的流量分类。在IEEE INFOCOM 2020-IEEE计算机通信研讨会会议（INFOCOM WKSHPS）中。IEEE，1292-1293。
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
汤姆·布朗、本杰明·曼、尼克·莱德、梅兰妮·苏比亚、贾里德·卡普兰、普拉富拉·达里瓦尔、阿尔文德·尼拉坎坦、普拉纳夫·希亚姆、吉里什·萨斯特里、阿曼达·阿斯凯尔等人，2020 年。语言模型是少数学习者。神经信息处理系统进展 33 （2020）， 1877–1901.
Anna L Buczak and Erhan Guven. 2016. A survey of data mining and machine learning methods for cyber security intrusion detection. IEEE Communications Surveys & Tutorials 18, 2 (2016), 1153–1176.
安娜·布扎克（Anna L Buczak）和埃尔汉·古文（Erhan Guven）。2016. 网络安全入侵检测的数据挖掘和机器学习方法综述.IEEE通信调查与教程18,2（2016），1153–1176。
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
雅各布·德夫林、张明伟、肯顿·李和克里斯蒂娜·图塔诺娃。2018. Bert：用于语言理解的深度双向转换器的预训练。arXiv 预印本 arXiv：1810.04805 （2018）。
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
阿列克谢·多索维茨基、卢卡斯·拜尔、亚历山大·科列斯尼科夫、德克·魏森伯恩、翟晓华、托马斯·温特西纳、穆斯塔法·德加尼、马蒂亚斯·明德勒、乔治·海戈尔德、西尔万·盖利等人，2020 年。一张图片价值 16x16 字：用于大规模图像识别的转换器。arXiv 预印本 arXiv：2010.11929 （2020）。
Gerard Draper-Gil, Arash Habibi Lashkari, Mohammad Saiful Islam Mamun, and Ali A Ghorbani. 2016. Characterization of encrypted and vpn traffic using time-related. In Proceedings of the 2nd international conference on information systems security and privacy (ICISSP). 407–414.
杰拉德·德雷珀-吉尔、阿拉什·哈比比·拉什卡里、穆罕默德·赛义夫·伊斯兰·马蒙和阿里·古尔巴尼。2016. 使用时间相关来表征加密和 VPN 流量。在第二届信息系统安全与隐私国际会议（ICISSP）的会议记录中。407–414.
Nick Erickson, Jonas Mueller, Alexander Shirkov, Hang Zhang, Pedro Larroy, Mu Li, and Alexander Smola. 2020. Autogluon-tabular: Robust and accurate automl for structured data. arXiv preprint arXiv:2003.06505 (2020).
尼克·埃里克森、乔纳斯·穆勒、亚历山大·希尔科夫、张航、佩德罗·拉罗伊、李木和亚历山大·斯莫拉。2020. Autogluon-tabular：结构化数据的稳健而准确的 automl。arXiv 预印本 arXiv：2003.06505 （2020）。
Chuanpu Fu, Qi Li, and Ke Xu. 2023. Detecting unknown encrypted malicious traffic in real time via flow interaction graph analysis. arXiv preprint arXiv:2301.13686 (2023).
傅传璞，李祁，珂旭.2023. 通过流量交互图分析实时检测未知加密恶意流量.arXiv 预印本 arXiv：2301.13686 （2023）。
Jianping Gou, Baosheng Yu, Stephen J Maybank, and Dacheng Tao. 2021. Knowledge distillation: A survey. International Journal of Computer Vision 129 (2021), 1789–1819.
郭建平、俞宝生、Stephen J Maybank 和 Dacheng Tao。2021. 知识蒸馏：一项调查。国际计算机视觉杂志 129 （2021）， 1789–1819.
Jamie Hayes and George Danezis. [n. d.]. K-Fingerprinting: A Robust Scalable Website Fingerprinting Technique. ([n. d.]), 18.
杰米·海耶斯（Jamie Hayes）和乔治·丹尼齐斯（George Danezis）。[未注明日期]。K-Fingerprinting：一种强大的可扩展网站指纹识别技术。（[未注明日期]）， 18.
Hong Ye He, Zhi Guo Yang, and Xiang Ning Chen. 2020. PERT: Payload Encoding Representation from Transformer for Encrypted Traffic Classification. In 2020 ITU Kaleidoscope: Industry-Driven Digital Transformation (ITU K). 1–8. https://doi.org/10.23919/ITUK50268.2020.9303204
洪烨何，杨志国，和陈向宁.2020. PERT：Transformer 用于加密流量分类的有效载荷编码表示。2020年国际电联“万花筒：行业驱动的数字化转型”（ITU K）。1–8.https://doi.org/10.23919/ITUK50268.2020.9303204
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. 2022. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16000–16009.
何凯明、陈新磊、谢赛宁、李阳昊、Piotr Dollár 和 Ross Girshick。2022. 屏蔽自动编码器是可扩展的视觉学习器。IEEE/CVF 计算机视觉和模式识别会议论文集。16000–16009.
Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. 2015. Distilling the Knowledge in a Neural Network. ArXiv abs/1503.02531 (2015).
Geoffrey E. Hinton、Oriol Vinyals 和 Jeffrey Dean。2015. 在神经网络中提炼知识.ArXiv abs/1503.02531 （2015）。
Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146 (2018).
杰里米·霍华德（Jeremy Howard）和塞巴斯蒂安·鲁德（Sebastian Ruder）。2018. 面向文本分类的通用语言模型微调.arXiv 预印本 arXiv：1801.06146 （2018）。
F IBM Security. 2020. Cost of a data breach report. https://www.ibm.com/reports/data-breach
F IBM 安全性。2020. 数据泄露报告的成本。https://www.ibm.com/reports/data-breach
DPDK Intel. 2014. Data plane development kit. https://www.intel.com/content/www/us/en/developer/topic-technology/networking/dpdk.html
DPDK 英特尔。2014. 数据平面开发套件。https://www.intel.com/content/www/us/en/developer/topic-technology/networking/dpdk.html
Izhar Ahmed Khan, Nour Moustafa, Dechang Pi, Karam M Sallam, Albert Y Zomaya, and Bentian Li. 2021. A new explainable deep learning framework for cyber threat discovery in industrial IoT networks. IEEE Internet of Things Journal 9, 13 (2021), 11604–11613.
伊扎尔·艾哈迈德·汗、努尔·穆斯塔法、德昌·皮、卡拉姆·萨拉姆、阿尔伯特·佐马亚和李本天，2021 年。一种新的可解释深度学习框架，用于工业物联网网络中的网络威胁发现。IEEE 物联网杂志 9， 13 （2021）， 11604–11613.
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942 (2019).
兰振忠、陈明达、塞巴斯蒂安·古德曼、凯文·金佩尔、皮尤什·夏尔马和拉杜·索里卡特。2019. Albert：用于语言表征的自我监督学习的精简版。arXiv 预印本 arXiv：1909.11942 （2019）。
Arash Habibi Lashkari, Gerard Draper-Gil, Mohammad Saiful Islam Mamun, Ali A Ghorbani, et al. 2017. Characterization of tor traffic using time based features.. In ICISSp. 253–262.
Arash Habibi Lashkari、Gerard Draper-Gil、Mohammad Saiful Islam Mamun、Ali A Ghorbani 等人，2017 年。使用基于时间的特征表征 tor 流量..在 ICISSp 中。253–262.
Ninh D Le, Van-Nam Huynh, Tuan-Anh Thai, Dusit Niyato, and Ping Wang. 2019. Machine learning for network anomaly detection: A survey. Computer Networks 151 (2019), 2–20.
Ninh D Le、Van-Nam Huynh、Tuan-Anh Thai、Dusit Niyato 和 Ping Wang。2019. 用于网络异常检测的机器学习：一项调查。计算机网络 151 （2019），2–20。
Zhipeng Li, Zheng Qin, Pengbo Shen, and Liu Jiang. 2019. Zero-shot learning for intrusion detection via attribute representation. In Neural Information Processing: 26th International Conference, ICONIP 2019, Sydney, NSW, Australia, December 12–15, 2019, Proceedings, Part I 26. Springer, 352–364.
李志鹏，琴郑，沈鹏波，刘江.2019. 通过属性表示进行入侵检测的零样本学习.神经信息处理：第 26 届国际会议，ICONIP 2019，澳大利亚新南威尔士州悉尼，2019 年 12 月 12 日至 15 日，论文集，第 I 部分 26。斯普林格，352-364。
Kunda Lin, Xiaolong Xu, and Honghao Gao. 2021. TSCRNN: A Novel Classification Scheme of Encrypted Traffic Based on Flow Spatiotemporal Features for Efficient Management of IIoT. Computer Networks 190 (May 2021), 107974. https://doi.org/10.1016/j.comnet.2021.107974
林坤达、徐晓龙和高洪昊。2021. TSCRNN：一种基于流时空特征的加密流量分类方案，用于IIoT的高效管理。计算机网络 190（2021 年 5 月），107974。https://doi.org/10.1016/j.comnet.2021.107974
Xinjie Lin, Gang Xiong, Gaopeng Gou, Zhen Li, Junzheng Shi, and Jing Yu. 2022. Et-bert: A contextualized datagram representation with pre-training transformers for encrypted traffic classification. In Proceedings of the ACM Web Conference 2022 (WWW 22). 633–642.
林新杰、熊刚、苟高鹏、李震、石俊正、俞静.2022. Et-bert：具有用于加密流量分类的预训练转换器的上下文化数据报表示。在 2022 年 ACM 网络会议（WWW 22）的会议记录中。633–642.
Richard P Lippmann and Robert K Cunningham. 2000. Improving intrusion detection performance using keyword selection and neural networks. Computer networks 34, 4 (2000), 597–603.
理查德·李普曼（Richard P Lippmann）和罗伯特·坎宁安（Robert K Cunningham）。2000. 使用关键字选择和神经网络提高入侵检测性能.计算机网络 34， 4 （2000）， 597–603.
Chang Liu, Longtao He, Gang Xiong, Zigang Cao, and Zhen Li. 2019. FS-Net: A Flow Sequence Network For Encrypted Traffic Classification. In IEEE INFOCOM 2019 - IEEE Conference on Computer Communications. 1171–1179. https://doi.org/10.1109/INFOCOM.2019.8737507
刘畅，何龙涛，熊刚，曹子刚，李臻. 2019.FS-Net：用于加密流量分类的流序列网络。在IEEE INFOCOM 2019 - IEEE计算机通信会议上。1171–1179.https://doi.org/10.1109/INFOCOM.2019.8737507
Cheng Liu, Yao Zhou, Changsheng Xu, and Lionel M Ni. 2017. Network traffic classification using correlation information. IEEE Transactions on Information Forensics and Security 12 (2017), 2289–2302.
Mohammad Lotfollahi, Mahdi Jafari Siavoshani, Ramin Shirali Hossein Zade, and Mohammdsadegh Saberian. 2020. Deep Packet: A Novel Approach for Encrypted Traffic Classification Using Deep Learning. Soft Computing 24, 3 (Feb. 2020), 1999–2012. https://doi.org/10.1007/s00500-019-04030-2
Alfredo Nascita, Antonio Montieri, Giuseppe Aceto, Domenico Ciuonzo, Valerio Persico, and Antonio Pescapé. 2021. XAI meets mobile traffic classification: Understanding and improving multimodal deep learning architectures. IEEE Transactions on Network and Service Management 18, 4 (2021), 4225–4246.
Andriy Panchenko, Fabian Lanze, Andreas Zinnen, Martin Henze, Jan Pennekamp, Klaus Wehrle, and Thomas Engel. 2016. Website Fingerprinting at Internet Scale. In Proceedings 2016 Network and Distributed System Security Symposium. Internet Society, San Diego, CA. https://doi.org/10.14722/ndss.2016.23477
pytorch. 2022. GETTING STARTED WITH DISTRIBUTED DATA PARALLEL. https://pytorch.org/tutorials/intermediate/ddp_tutorial.html
Gabrielle Ras, Ning Xie, Marcel Van Gerven, and Derek Doran. 2022. Explainable deep learning: A field guide for the uninitiated. Journal of Artificial Intelligence Research 73 (2022), 329–397.
Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1715–1725.
Mohaddeseh Shahhosseini, Hoda Mashayekhi, and Mohsen Rezvani. 2022. A deep learning approach for botnet detection using raw network traffic data. Journal of Network and Systems Management 30, 3 (2022), 44.
Iman Sharafaldin, Arash Habibi Lashkari, and Ali A Ghorbani. 2018. Toward generating a new intrusion detection dataset and intrusion traffic characterization.ICISSp 1 (2018), 108–116.
Manoj Sharma, Arun Kumar Singh, Aditya Vashist, Anurag Sharma, and Manoj Nigam. 2019. Machine learning-based detection of network anomalies: A survey. IEEE Communications Surveys & Tutorials 21, 1 (2019), 663–706.
Meng Shen, Jinpeng Zhang, Liehuang Zhu, Ke Xu, and Xiaojiang Du. 2021. Accurate Decentralized Application Identification via Encrypted Traffic Analysis Using Graph Neural Networks. IEEE Transactions on Information Forensics and Security 16 (2021), 2367–2380. https://doi.org/10.1109/TIFS.2021.3050608
Payap Sirinam, Mohsen Imani, Marc Juarez, and Matthew Wright. 2018. Deep Fingerprinting: Undermining Website Fingerprinting Defenses with Deep Learning. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security. ACM, Toronto Canada, 1928–1943. https://doi.org/10.1145/3243734.3243768
Richard Socher, Milind Ganjoo, Christopher D. Manning, and Andrew Y. Ng. 2013. Zero-Shot Learning through Cross-Modal Transfer. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 1 (Lake Tahoe, Nevada) (NIPS’13). Curran Associates Inc., Red Hook, NY, USA, 935–943.
Robin Sommer and Vern Paxson. 2010. Outside the closed world: On using machine learning for network intrusion detection. In 2010 IEEE symposium on security and privacy. IEEE, 305–316.
Aixin Sun and Ee-Peng Lim. 2001. Hierarchical text classification and evaluation. In Proceedings 2001 IEEE International Conference on Data Mining. IEEE, 521–528.
Vincent F. Taylor, Riccardo Spolaor, Mauro Conti, and Ivan Martinovic. 2018. Robust Smartphone App Identification via Encrypted Network Traffic Analysis. IEEE Transactions on Information Forensics and Security 13, 1 (Jan. 2018), 63–78. https://doi.org/10.1109/TIFS.2017.2737970
Thijs Van Ede, Riccardo Bortolameotti, Andrea Continella, Jingjing Ren, Daniel J Dubois, Martina Lindorfer, David Choffnes, Maarten van Steen, and Andreas Peter. 2020. Flowprint: Semi-supervised mobile-app fingerprinting on encrypted network traffic. In Network and Distributed System Security Symposium (NDSS), Vol. 27.
Thijs Sebastiaan van Ede, Riccardo Bortolameotti, Andrea Continella, Jingjing Ren, Daniel J. Dubois, Martina Lindorfer, David Choffnes, Maarten van Steen, and Andreas Peter. 2020. FlowPrint: Semi-Supervised Mobile-App Fingerprinting on Encrypted Network Traffic. In Network and Distributed System Security Symposium (NDSS). Internet Society. https://doi.org/10.14722/ndss.2020.24412
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).
Wei Wang, Ming Zhu, Xuewen Zeng, Xiaozhou Ye, and Yiqiang Sheng. 2017. Malware traffic classification using convolutional neural network for representation learning. In 2017 International conference on information networking (ICOIN). IEEE, 712–717.
Wei Wang, Ming Zhu, Xuewen Zeng, Xiaozhou Ye, and Yiqiang Sheng. 2017. Malware traffic classification using convolutional neural network for representation learning. 2017 International Conference on Information Networking (ICOIN) (2017), 712–717.
Yaqing Wang, Quanming Yao, James T Kwok, and Lionel M Ni. 2020. Generalizing from a few examples: A survey on few-shot learning. ACM computing surveys (csur) 53, 3 (2020), 1–34.
Yongqin Xian, Christoph H Lampert, Bernt Schiele, and Zeynep Akata. 2018. Zero-shot learning—a comprehensive evaluation of the good, the bad and the ugly. IEEE transactions on pattern analysis and machine intelligence 41, 9 (2018), 2251–2265.
Ce Zhou, Qian Li, Chen Li, Jun Yu, Yixin Liu, Guangjing Wang, Kai Zhang, Cheng Ji, Qiben Yan, Lifang He, et al. 2023. A comprehensive survey on pretrained foundation models: A history from bert to chatgpt. arXiv preprint arXiv:2302.09419 (2023).
Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. 2021. Informer: Beyond efficient transformer for long sequence time-series forecasting. In Proceedings of the AAAI conference on artificial intelligence, Vol. 35. 11106–11115.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org.

RAID '23, October 16–18, 2023, Hong Kong, China