这是用户在 2024-6-20 20:03 为 https://dl.acm.org/doi/fullHtml/10.1145/3607199.3607206 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

Flow-MAE: Leveraging Masked AutoEncoder for Accurate, Efficient and Robust Malicious Traffic Classification
Flow-MAE:利用屏蔽的自动编码器进行准确、高效和强大的恶意流量分类

Zijun Hang, National University of Defense Technology, China, hangzijun17@nudt.edu.cn
杭子军,国防科技大学,中国,hangzijun17@nudt.edu.cn
Yuliang Lu, National University of Defense Technology, China, luyuliang@nudt.edu.cn
Yuliang Lu,国防科技大学,中国,luyuliang@nudt.edu.cn
Yongjie Wang, National University of Defense Technology, China, w_yong_j@189.cn
王永杰,国防科技大学,中国,w_yong_j@189.cn
Yi Xie, National University of Defense Technology, China, heilongjiangxieyi@163.com
谢毅,国防科技大学,中国,heilongjiangxieyi@163.com

DOI: https://doi.org/10.1145/3607199.3607206
DOI: https://doi.org/10.1145/3607199.3607206

RAID '23: The 26th International Symposium on Research in Attacks, Intrusions and Defenses, Hong Kong, China, October 2023
RAID '23:第 26 届攻击、入侵和防御研究国际研讨会,中国香港,2023 年 10 月

Malicious traffic classification is crucial for Intrusion Detection Systems (IDS). However, traditional Machine Learning approaches necessitate expert knowledge and a significant amount of well-labeled data. Although recent studies have employed pre-training models from the Natural Language Processing domain, such as ET-BERT, for traffic classification, their effectiveness is impeded by limited input length and fixed Byte Pair Encoding.
恶意流量分类对于入侵检测系统 (IDS) 至关重要。然而,传统的机器学习方法需要专业知识和大量标记良好的数据。尽管最近的研究采用了自然语言处理领域的预训练模型(如ET-BERT)进行流量分类,但其有效性受到有限的输入长度和固定的字节对编码的阻碍。

To address these challenges, this paper presents Flow-MAE, a pre-training model that employs Masked AutoEncoders (MAE) from the Computer Vision domain to achieve accurate, efficient, and robust malicious network traffic classification. Flow-MAE overcomes these challenges by utilizing burst (a generic representation of network traffic) and patch embedding to accommodate extensive traffic length. Moreover, Flow-MAE introduces a self-supervised pre-training task, the Masked Patch Model, which captures unbiased representations from bursts with varying lengths and patterns.
为了应对这些挑战,本文介绍了 Flow-MAE,这是一种预训练模型,它使用来自计算机视觉领域的掩码自动编码器 (MAE) 来实现准确、高效和强大的恶意网络流量分类。Flow-MAE 通过利用突发(网络流量的通用表示)和补丁嵌入来适应广泛的流量长度,从而克服了这些挑战。此外,Flow-MAE 还引入了一个自我监督的预训练任务,即掩蔽补丁模型,该模型从具有不同长度和模式的突发中捕获无偏表示。

Experimental results from six datasets reveal that Flow-MAE achieves new state-of-the-art accuracy (>0.99), efficiency (>900 samples/s), and robustness across diverse network traffic types. In comparison to the state-of-the-art ET-BERT, Flow-MAE exhibits improvements in accuracy and speed by 0.41%-1.93% and 7.8x-10.3x, respectively, while necessitating only 0.2% FLOPs and 44% memory overhead. The efficacy of the core designs is validated through few-shot learning and ablation experiments. The code is publicly available at https://github.com/NLear/Flow-MAE.
来自六个数据集的实验结果表明,Flow-MAE在各种网络流量类型中实现了新的最先进的精度(>0.99)、效率(>900样本/秒)和鲁棒性。与最先进的 ET-BERT 相比,Flow-MAE 在精度和速度方面分别提高了 0.41%-1.93% 和 7.8 倍-10.3 倍,而只需要 0.2% 的 FLOP 和 44% 的内存开销。核心设计的功效通过小样本学习和消融实验得到验证。该代码在 https://github.com/NLear/Flow-MAE 公开提供。

CCS Concepts:Security and privacy → Network security; • Computing methodologies → Artificial intelligence; • Security and privacy → Intrusion detection systems;
CCS概念: • 网络安全→安全和隐私;• 人工智能→计算方法;• 安全和隐私→入侵检测系统;

Keywords: Malicious Traffic Classification, Masked AutoEncoder, Pre-training Model, Masked Patch Model
关键词:恶意流量分类, 屏蔽自动编码器, 预训练模型, 屏蔽补丁模型

ACM Reference Format: ACM 参考格式:
Zijun Hang, Yuliang Lu, Yongjie Wang, and Yi Xie. 2023. Flow-MAE: Leveraging Masked AutoEncoder for Accurate, Efficient and Robust Malicious Traffic Classification. In The 26th International Symposium on Research in Attacks, Intrusions and Defenses (RAID '23), October 16--18, 2023, Hong Kong, China. ACM, New York, NY, USA 18 Pages. https://doi.org/10.1145/3607199.3607206
梓涝君, 卢玉良, 王永杰, 谢毅.2023. Flow-MAE:利用屏蔽的自动编码器进行准确、高效和强大的恶意流量分类。2023 年 10 月 16 日至 18 日,第 26 届攻击、入侵和防御研究国际研讨会 (RAID '23),中国香港。ACM,纽约,纽约,美国 18 页。https://doi.org/10.1145/3607199.3607206

1 INTRODUCTION 1 引言

Malicious traffic classification is a crucial network security mechanism for Intrusion Detection Systems (IDS) [49]. This multi-class classification problem aims to distinguish various types of malicious network traffic to uncover network security vulnerabilities [28]. Machine Learning (ML) has emerged as a promising network security paradigm [3, 7], particularly for malicious traffic classification, complementing traditional rule or fingerprint-based approaches [43].
恶意流量分类是入侵检测系统(Intrusion Detection Systems,简称IDS)的重要网络安全机制[ 49]。这种多类分类问题旨在区分各种类型的恶意网络流量,以发现网络安全漏洞 [ 28]。机器学习(ML)已成为一种很有前途的网络安全范式[3,7],特别是对于恶意流量分类,是对传统规则或基于指纹的方法的补充[43]。

However, conventional ML techniques require expert knowledge and a substantial amount of well-labeled data to extract effective traffic features [24]. Since the majority of network traffic is benign, identifying malicious traffic presents a considerable challenge [30]. Furthermore, approaches that focus primarily on specific tasks limit their transferability, necessitating redesign or retraining for new tasks [39]. Consequently, these methods exhibit inefficiency, inaccuracy, and suboptimal robustness capabilities.
然而,传统的机器学习技术需要专业知识和大量标记良好的数据来提取有效的流量特征[24]。由于大多数网络流量是良性的,因此识别恶意流量是一个相当大的挑战[30]。此外,主要关注特定任务的方法限制了其可转移性,需要针对新任务进行重新设计或重新培训[39]。因此,这些方法表现出低效、不准确和次优鲁棒性能力。

In recent years, transformer-based [48] Deep Learning (DL) models have emerged as a promising approach beyond the traditional ML-based paradigm. These models have shown remarkable progress in various applications, inspired by the success of pre-training models [53] in the fileds of Natural Language Processing (NLP) [6, 8] and Computer Vision (CV) [9, 16]. Recently, ET-BERT [27] has applied BERT [8], a pre-training model in NLP, to traffic classification and achieved the state-of-the-art performance.
近年来,基于 transformer 的 [ 48] 深度学习 (DL) 模型已成为一种超越传统基于 ML 的范式的有前途的方法。这些模型在各种应用中都取得了显著的进步,这得益于预训练模型[ 53] 在自然语言处理 (NLP) [ 6, 8] 和计算机视觉 (CV) [ 9, 16] 领域的成功。最近,ET-BERT [ 27] 将 NLP 中的预训练模型 BERT [ 8] 应用于流量分类,并实现了最先进的性能。

Although significant progress has been made, there are two areas where ET-BERT could be further improved: (a) Limited by the BERT model's input length, ET-BERT adopts a network flow length of 128 bytes, which may not adequately represent extensive network traffic ranging from KBs to MBs. However, the 128-byte input length also results in increased computation and memory overheads for the ET-BERT model. (b) ET-BERT utilizes Byte Pair Encoding (BPE) [36] to convert the network traffic byte stream into BERT input tokens. This requires the use of a fixed dictionary, potentially reducing robustness in the face of varying traffic patterns.
尽管已经取得了重大进展,但有两个方面可以进一步改进:(a) 受BERT模型输入长度的限制,ET-BERT采用128字节的网络流长度,这可能不足以代表从KB到MB的广泛网络流量。然而,128字节的输入长度也会导致ET-BERT模型的计算和内存开销增加。(b) ET-BERT利用字节对编码(BPE)[36]将网络流量字节流转换为BERT输入令牌。这需要使用固定的字典,这可能会降低面对不同流量模式时的鲁棒性。

Innovative pre-training models from the CV domain, such as Masked AutoEncoders (MAE) [16], have emerged as promising alternatives for malicious traffic classification. MAE can learn deep latent state representations of unlabeled images through a self-supervised task by randomly masking partial patches of input images and reconstructing the missing pixels. Flow-MAE employs an asymmetric encoder-decoder architecture similar to MAE. The encoder operates solely on the visible subset of the byte stream to generate latent representations, from which a lightweight decoder reconstructs the masked bytes. Experiments shows that masking a modest proportion of patches (e.g., 15%) creates a meaningful pre-training task.
来自CV领域的创新预训练模型,如Masked AutoEncoders(MAE)[ 16],已成为恶意流量分类的有前途的替代方案。MAE可以通过随机屏蔽输入图像的部分斑块并重建缺失的像素,通过自我监督任务来学习未标记图像的深度潜在状态表示。Flow-MAE 采用类似于 MAE 的非对称编码器-解码器架构。编码器仅对字节流的可见子集进行操作以生成潜在表示,轻量级解码器从中重建掩码字节。实验表明,屏蔽适量比例的补丁(例如,15%)可以创建有意义的预训练任务。

Flow-MAE follows a two-stage process, including pre-training and subsequent fine-tuning: (1) Initially, the MAE model is pre-trained on a large volume of unlabeled traffic [38], in order to learn in-depth, unbiased, contextualized datagram-level representations of traffic autonomously. (2) Thereafter, the Flow-MAE model is fine-tuned on a small amount of task-specific labeled data, increasing its effectiveness for a designated downstream task.
Flow-MAE 遵循两个阶段的过程,包括预训练和随后的微调:(1) 最初,MAE 模型在大量未标记的流量上进行预训练 [ 38],order以自主学习深入、无偏、上下文化的数据报级流量表示。(2)此后,对Flow-MAE模型进行微调,对少量特定任务的标记数据进行微调,从而提高其对指定下游任务的有效性。

The primary contributions of this paper are summarized as follows:
本文的主要贡献总结如下:

  1. We present Flow-MAE, a pre-training framework that utilizes a Masked AutoEncoder for malicious traffic classification. Flow-MAE acquires unbiased datagram representations from a large volume of unlabeled traffic and subsequently fine-tunes on a downstream task.
    我们介绍了 Flow-MAE,这是一个预训练框架,它利用屏蔽的自动编码器进行恶意流量分类。Flow-MAE 从大量未标记的流量中获取无偏数据报表示,然后对下游任务进行微调。

  2. We use burst (§ 3.2), a generic representation for network traffic, in conjunction with patch embedding (§ 3.3) to adapt extensive traffic to Flow-MAE model input. This approach bridges the gap between large-scale traffic data and the truncated input sequence in Flow-MAE (limited to 32 bytes).
    我们使用突发 (§ 3.2) (网络流量的通用表示)与补丁嵌入 (§ 3.3) 结合使用,以使广泛的流量适应 Flow-MAE 模型输入。这种方法弥合了大规模流量数据与 Flow-MAE 中截断的输入序列(限制为 32 字节)之间的差距。

  3. We propose a generic self-supervised pre-training task, the Masked Patch Model (MPM) (§ 3.4), to capture contextual interdependence and achieve aligned generic representations from bursts of varying lengths.
    我们提出了一个通用的自监督预训练任务,即掩蔽补丁模型 (MPM) (§ 3.4),以捕获上下文相互依赖性并从不同长度的突发中获得对齐的通用表示。

  4. We conduct experiments on six downstream datasets, comparing Flow-MAE with existing approaches to validate the core designs. Results show that integrating these designs enables efficient and effective training of the high-capacity MAE model. Flow-MAE achieves new state-of-the-art accuracy exceeding 99%, an efficiency of 900 samples/s, and impressive robustness on six downstream datasets. It outperforms the state-of-the-art ET-BERT in accuracy and speed by 0.41%-1.93% and 7.8x-10.3x, respectively, while only requiring 0.2% FLOPs and 44% memory overhead. The code is publicly available at https://github.com/NLear/Flow-MAE.
    我们在六个下游数据集上进行了实验,将Flow-MAE与现有方法进行比较,以验证核心设计。结果表明,集成这些设计可以高效和有效地训练高容量 MAE 模型。Flow-MAE 实现了超过 99% 的先进准确度、900 个样本/秒的效率以及六个下游数据集的令人印象深刻的鲁棒性。它在精度和速度方面分别比最先进的 ET-BERT 高出 0.41%-1.93% 和 7.8 倍-10.3 倍,同时只需要 0.2% 的 FLOP 和 44% 的内存开销。该代码在 https://github.com/NLear/Flow-MAE 公开提供。

Figure 1
Figure 1: Flow-MAE Architecture Overview
图 1:Flow-MAE 架构概述

2 BACKGROUND AND MOTIVATIONS
2 背景和动机

2.1 Existing Methods 2.1 现有方法

Malicious traffic classification refers to the task of identifying and categorizing network traffic that is associated with malicious or suspicious activities. It involves identifying malicious intent, such as malware infections, network intrusions, denial-of-service attacks, or data exfiltration attempts, protecting legitimate network users from concealed malware and intrusions. In general, there are three primary types of malicious traffic classification methodologies.
恶意流量分类是指识别和分类与恶意或可疑活动相关的网络流量的任务。它涉及识别恶意意图,例如恶意软件感染、网络入侵、拒绝服务攻击或数据泄露尝试,从而保护合法网络用户免受隐藏的恶意软件和入侵。通常,恶意流量分类方法主要有三种类型。

2.1.1 Fingerprint Methods. Fingerprint methodologies, such as FlowPrint [47], initially build a database of spatial and temporal features extracted from numerous malicious flows. Consequently, hidden malicious flows can be detected by matching fingerprints with those in the database. The effectiveness of flow classification depends on the integrity of the fingerprint database and the quality of real-time flow fingerprinting.
2.1.1 指纹识别方法。指纹方法,如FlowPrint [ 47],最初建立了一个从众多恶意流中提取的空间和时间特征的数据库。因此,可以通过将指纹与数据库中的指纹进行匹配来检测隐藏的恶意流。流量分类的有效性取决于指纹数据库的完整性和实时流量指纹的质量。

2.1.2 Statistical Machine Learning Methods. Statistical Machine Learning (ML) approaches extract a range of spatial or temporal statistical features to form the feature vector, then employ ML algorithms to model the feature distributions of network flows. Over the past decade, the intersection of network security and ML methodologies has produced groundbreaking outcomes, including AppScanner [45], CUMUL [33], BIND [1] and k-fingerprinting (K-fp) [14].
2.1.2 统计机器学习方法。统计机器学习 (ML) 方法提取一系列空间或时间统计特征以形成特征向量,然后使用 ML 算法对网络流的特征分布进行建模。在过去的十年中,网络安全和机器学习方法的交叉产生了突破性的成果,包括 AppScanner [ 45]、CUMUL [ 33]、BIND [ 1] 和 k-fingerprinting (K-fp) [ 14]。

However, traditional ML-based methods rely on accurate feature extraction and a significant amount of labeled malicious data. Given that only a small portion of all traffic constitutes real malicious traffic, manually extracting well-labeled malicious traffic is a daunting task, let alone obtaining valuable and accurate features.
然而,传统的基于ML的方法依赖于准确的特征提取和大量标记的恶意数据。鉴于所有流量中只有一小部分构成真正的恶意流量,手动提取标记良好的恶意流量是一项艰巨的任务,更不用说获得有价值和准确的特征了。

2.1.3 Deep Learning Methods. Deep Learning (DL) constitutes an emerging paradigm to malicious traffic detection, offering end-to-end solutions that surpass traditional statistical Machine Learning (ML) techniques. These models facilitate automatic feature extraction by employing raw flow data as input and integrate feature learning and classification within a unified pipeline. Existing DL models include DF [41], which employs Convolutional Neural Networks (CNNs), FS-Net [29] using AutoEncoders, TSCRNN [26] utilizing Recurrent Neural Networks (RNNs), as well as Deeppacket [31] relying on AutoGolun [11].
2.1.3 深度学习方法。深度学习 (DL) 是恶意流量检测的新兴范式,提供超越传统统计机器学习 (ML) 技术的端到端解决方案。这些模型通过使用原始流数据作为输入来促进自动特征提取,并将特征学习和分类集成到统一的管道中。现有的深度学习模型包括采用卷积神经网络(CNNs)的DF [ 41]、使用自动编码器的FS-Net [ 29]、使用递归神经网络(RNN)的TSCRNN [ 26],以及依赖于AutoGolun [ 11] 的Deeppacket [ 31]。

However, DL models require a considerable amount of well-labeled malicious traffic for convergence. Moreover, these models may face difficulties when applied to different flow schemas. As a result, a complete redesign or re-training of the model is needed when adapting the model to a new task or dataset, leading to sub-optimal robustness.
但是,DL 模型需要大量标记良好的恶意流量才能进行收敛。此外,这些模型在应用于不同的流程模式时可能会遇到困难。因此,在使模型适应新任务或数据集时,需要对模型进行完全的重新设计或重新训练,从而导致次优鲁棒性。

In summary, the problems of efficacy, efficiency, and robustness in the aforementioned methods remain unresolved.
总之,上述方法的功效、效率和稳健性问题仍未解决。

2.2 Pre-training Models 2.2 预训练模型

2.2.1 Benefits of Pre-training Models. Recently, transcending traditional ML-based approaches, pre-training techniques [53] for DL models have emerged. Pre-training models can autonomously learn unbiased data representations from vast amounts of unlabeled data and subsequently transfer to downstream tasks through fine-tuning over limited epochs on minimal labeled data. During fine-tuning, the models adapt the previously learned representations to the specific task at hand, incorporating the labeled data to refine their performance. Since the pre-training phase is independent of any specific task or labels, the representations obtained from Flow-MAE are considered unbiased, as they are learned solely from the raw data without any specific task-related bias.
2.2.1 预训练模型的好处。最近,超越了传统的基于ML的方法,出现了用于深度学习模型的预训练技术[53]。预训练模型可以从大量未标记的数据中自主学习无偏数据表示,然后通过对最少标记数据的有限时间进行微调,将数据传输到下游任务。在微调过程中,模型将先前学习的表示调整为手头的特定任务,并结合标记的数据来优化其性能。由于预训练阶段独立于任何特定任务或标签,因此从 Flow-MAE 获得的表示被认为是无偏的,因为它们仅从原始数据中学习,没有任何与任务相关的特定偏差。

These unbiased representations enable Flow-MAE to generalize well to different downstream tasks, as they capture the essential features and patterns present in the data. This transferability of representations from pre-training to fine-tuning allows to achieve high performance even with limited labeled data. Numerous pre-training models favor the transformer [48] backbone, a DL model based on a self-attention mechanism. The transformer has two primary branches: the BERT [8] (Bidirectional Encoder Representations from Transformers) Model in NLP and the MAE (Masked AutoEncoder) model [16] in CV.
这些无偏表示使 Flow-MAE 能够很好地泛化到不同的下游任务,因为它们捕获了数据中存在的基本特征和模式。这种从预训练到微调的表示的可转移性允许即使在有限的标记数据下也能实现高性能。许多预训练模型都支持转换器 [ 48] 主干,这是一种基于自注意力机制的 DL 模型。转换器有两个主要分支:NLP 中的 BERT [ 8](来自转换器的双向编码器表示)模型和 CV 中的 MAE(掩蔽自动编码器)模型 [ 16]。

2.2.2 BERT Model in NLP. Contemporary researches have implemented BERT-like pre-training architectures for traffic classification, yielding substantial advancements. PERT [15] adapts the ALBERT [22] pre-training model for traffic classification, attaining a 93.23% F1 score on the ISCX-VPN-2016 dataset [10]. The state-of-the-art ET-BERT [27] incorporates a bespoke Byte Pair Encoding representation for encrypted traffic and two pre-training tasks based on the BERT [8] model, enhancing the F1 score to 99%. Both PERT and ET-BERT extend BERT-like models to the networking domain, exemplifying the merits of pre-training architectures when capitalizing on copious unlabeled traffic data.
2.2.2 NLP中的BERT模型。当代研究已经实现了类似 BERT 的预训练架构来进行流量分类,取得了实质性的进展。PERT [ 15] 将 ALBERT [ 22] 预训练模型用于流量分类,在 ISCX-VPN-2016 数据集上获得了 93.23% 的 F1 分数 [ 10]。最先进的 ET-BERT [ 27] 结合了用于加密流量的定制字节对编码表示和两个基于 BERT [ 8] 模型的预训练任务,将 F1 得分提高到 99%。PERT 和 ET-BERT 都将类似 BERT 的模型扩展到网络领域,在利用大量未标记的流量数据时,举例说明了预训练架构的优点。

Despite the remarkable accomplishments attained so far, it is evident that two refinements can be implemented for ET-BERT: (a) Constrained by the BERT model's input dimensions, ET-BERT employs a 128-byte sequence to represent network traffic; this is substantially shorter than the majority of network traffic (ranging from kilobytes to megabytes) and might inadequately represent extensive network traffic. Furthermore, by adopting solely packet payloads, the packet headers containing temporal and spatial features of the traffic are discarded, leading to accuracy loss. Nonetheless, the 128-byte input length still incurs elevated computation and memory overheads for the ET-BERT model. (b) ET-BERT relies upon the Byte Pair Encoding (BPE) [36] technique to convert the network traffic byte stream into BERT input tokens, utilizing a fixed dictionary, which consequently results in diminished performance under disparate traffic patterns.
尽管迄今为止取得了显著的成就,但很明显,ET-BERT可以实现两个改进:(a)受BERT模型输入维度的约束,ET-BERT采用128字节的序列来表示网络流量;这比大多数网络流量(从千字节到兆字节不等)要短得多,并且可能不足以表示广泛的网络流量。此外,通过仅采用数据包有效负载,包含流量时间和空间特征的数据包标头将被丢弃,从而导致准确性损失。尽管如此,128字节的输入长度仍然会给ET-BERT模型带来更高的计算和内存开销。(b) ET-BERT依靠字节对编码(BPE)[36]技术将网络流量字节流转换为BERT输入令牌,使用固定字典,从而导致不同流量模式下的性能下降。

2.2.3 MAE-like Model in CV. The Masked AutoEncoder (MAE) [16] constitutes a pre-training architecture with a transformer [48] backbone in the Computer Vision (CV) domain. This model employs a transformer autoencoder to learn deep latent state representations by randomly masking input image patches and subsequently reconstructing the masked regions. The image size in the MAE model surpasses the sentence length in the BERT model by orders of magnitude, thereby facilitating the processing of extended input sequences. Furthermore, MAE incorporates a versatile convolutional embedding layer as opposed to the fixed BPE encoding layer, which encodes sequences utilizing a rigid vocabulary. This feature enables the model to adapt seamlessly to varying input patterns.
2.2.3 CV中类似MAE的模型。掩码自动编码器(MAE)[16]构成了计算机视觉(CV)领域中具有转换器[48]骨干的预训练架构。该模型采用转换器自动编码器,通过随机屏蔽输入图像块并随后重建屏蔽区域来学习深度潜伏状态表示。MAE 模型中的图像大小比 BERT 模型中的句子长度高出几个数量级,从而便于处理扩展的输入序列。此外,MAE 包含一个通用的卷积嵌入层,而不是固定的 BPE 编码层,后者使用严格的词汇表对序列进行编码。此功能使模型能够无缝适应不同的输入模式。

3 FLOW-MAE SYSTEM DESIGN 3 FLOW-MAE系统设计

3.1 Model Architecture Overview
3.1 模型架构概述

Motivated by the Masked AutoEncoders (MAE) pre-training model from the CV domain, we introduce Flow-MAE, a MAE designed as a versatile self-supervised learner for classifying malicious network traffic. Flow-MAE adheres to a two-stage process comprising preliminary pre-training and subsequent fine-tuning within the scope of self-supervised learning. The architectural design is delineated in Figure 1.
受 CV 域的屏蔽自动编码器 (MAE) 预训练模型的启发,我们引入了 Flow-MAE,这是一种 MAE,旨在对恶意网络流量进行分类。Flow-MAE遵循两个阶段的过程,包括初步的预训练和随后的自我监督学习范围内的微调。架构设计如图 1 所示。

3.1.1 Preprocessing. Flow-MAE adopts burst-level traffic representation. An initial preprocessing phase, termed burst representation (§ 3.2), converts the traffic datagram into bursts. Subsequently, the patch embedding step (§ 3.3) projects these bursts into patches, conforming to the input dimensions of the MAE model.
3.1.1 预处理。Flow-MAE采用突发级流量表示。初始预处理阶段称为突发表示 (§ 3.2),将流量数据报转换为突发数据报。随后,补丁嵌入步骤(§ 3.3)将这些突发投影到补丁中,符合 MAE 模型的输入维度。

3.1.2 Pre-training Procedure (Section 3.4). The Flow-MAE model undergoes pre-training on unlabeled background traffic, facilitating the acquisition of deep contextualized burst-level unbiased representations from large-scale unlabeled network data.
3.1.2 预训练程序(第 3.4 节)。Flow-MAE 模型对未标记的后台流量进行了预训练,有助于从大规模未标记的网络数据中获取深度上下文化的突发级无偏表示。

Flow-MAE employs a transformer autoencoder architecture, consisting of multiple encoder and decoder layers, each containing multi-head self-attention blocks. During the pre-training process, randomly selected input patches are masked. The encoder focuses exclusively on visible patches, capturing latent relationships and generating an output representation for the burst. The decoder then attempts to reconstruct the masked patches using the encoder output. The optimization goal is to minimize the reconstruction loss.
Flow-MAE 采用 transformer 自动编码器架构,由多个编码器和解码器层组成,每个层都包含多头自注意力模块。在预训练过程中,随机选择的输入补丁被屏蔽。编码器专门关注可见斑块,捕获潜在关系并生成突发的输出表示。然后,解码器尝试使用编码器输出重建屏蔽的补丁。优化目标是将重建损耗降至最低。

3.1.3 Fine-tuning Procedure (§ 3.5). Upon completing the pre-training phase, the Flow-MAE encoder is capable of producing a deep latent representation for any given burst. The classifier model inherits the encoder component of the transformer autoencoder (including its architecture and weights) and replaces the decoder with a fully connected linear classifier.
3.1.3 微调程序 (§ 3.5)。在完成预训练阶段后,Flow-MAE编码器能够为任何给定的突发生成深度潜在表示。分类器模型继承了 transformer 自动编码器的编码器组件(包括其架构和权重),并将解码器替换为全连接的线性分类器。

By fine-tuning the classifier model on a limited amount of task-specific labeled traffic, it can be customized for downstream applications. When exposed to target-specific traffic, the classifier model can accurately determine its classification.
通过在有限数量的特定于任务的标记流量上微调分类器模型,可以针对下游应用程序对其进行自定义。当暴露于特定于目标的流量时,分类器模型可以准确地确定其分类。

3.2 Burst Representation Preprocessing
3.2 突发表示预处理

Network traffic comprises numerous flows with diverse types (e.g., various applications, protocols, or services) and properties (e.g., benign and malicious). To achieve a consistent representation and fit the model's input length, preliminary traffic segmentation is vital.
网络流量包括具有不同类型(例如,各种应用程序、协议或服务)和属性(例如良性和恶意)的众多流。为了实现一致的表示并拟合模型的输入长度,初步的流量分割至关重要。

We select the burst as the fundamental unit for malicious traffic classification and preprocess the traffic in three distinct phases: (1) session splitting (§ 3.2.1): dividing the traffic into separate sessions; (2) burst splitting (§ 3.2.2): subdividing a session into multiple bursts to create model input sequences; and (3) burst padding (§ 3.2.3): generating masks for bursts of different lengths and padding all bursts to achieve a uniform length. Consequently, each burst originates from a specific traffic category and is prepared for model input.
我们选择突发作为恶意流量分类的基本单元,并分三个不同的阶段对流量进行预处理:(1)会话拆分(§3.2.1):将流量划分为单独的会话;(2) 突发拆分 (§ 3.2.2):将一个会话细分为多个突发以创建模型输入序列;(3)突发填充(§3.2.3):为不同长度的突发生成掩码,并填充所有突发以实现均匀的长度。因此,每个突发都源自特定的流量类别,并为模型输入做好准备。

3.2.1 Session split. Initially, we separate the entire traffic into sessions, according to the 5-tuple: source IP address, destination IP address, protocol type, source port, and destination port. The definition of a session is as follows:
3.2.1 会话拆分。最初,我们根据 5 元组将整个流量分成会话:源 IP 地址、目标 IP 地址、协议类型、源端口和目标端口。会话的定义如下:

Definition 3.1 (Session) Given network traffic P={piiN+} comprising multiple packets pi, a session S={pjpjPjN+}P denotes those bidirectional packets pj of a specific protocol transmitted between two ports on two hosts. The session length |S| = ns represents the number of packets ns within session S.
定义 3.1 (会话) 给定包含多个数据包 p 的网络流量 P={piiN+} ,会话 S={pjpjPjN+}P 表示在两台主机上的两个端口之间传输的特定协议的双向数据包 p ji 会话长度 |小号|= n s 表示会话 S 中的数据包数 n s

As session packets are bidirectional, a session includes request and response packets of a specific application exchanged between two hosts. For example, a standard network application communicates via the TCP protocol, establishing a connection between ports on two hosts. A virus might also create a connection to enable lateral movement within a local network. A session can characterize an application since different sessions may possess unique and intrinsic features that facilitate their differentiation.
由于会话数据包是双向的,因此会话包括在两个主机之间交换的特定应用程序的请求和响应数据包。例如,标准网络应用程序通过 TCP 协议进行通信,在两台主机上的端口之间建立连接。病毒还可能创建连接,以便在本地网络内实现横向移动。会话可以表征应用程序,因为不同的会话可能具有独特的内在特征,有助于区分它们。

However, in certain scenarios (e.g., P2P downloading, video calling, and live streaming), the session length can be considerable (reaching MBs or even more). Therefore, further splitting is necessary to accommodate the model input length.
但是,在某些情况下(例如,P2P 下载、视频通话和直播),会话长度可能相当长(达到 MB 甚至更多)。因此,需要进一步拆分以适应模型输入长度。

3.2.2 Burst Split. We subsequently divide each session into bursts. Formally, the definition of a burst is:
3.2.2 突发拆分。随后,我们将每个会话分成几个突发。从形式上讲,突发的定义是:

Definition 3.2 (Burst) Given a session S={pjpjPjN+} comprising multiple packets pj from network traffic P, a burst B={pjpkSkN+}S is a subset of bidirectional contiguous network packets within a short time window τ in the session. We denote the number of bytes in a burst B as the burst length ℓB.
定义 3.2(突发) 给定 S={pjpjPjN+} 一个会话,其中包含来自网络流量 P 的多个数据包 p j ,突发 B={pjpkSkN+}S 是会话中短时间窗口 τ 内双向连续网络数据包的子集。我们将突发 B 中的字节数表示为突发长度 l B

A burst can characterize the pattern of network flow transmission from the application layer perspective; while it is as descriptive as a session, it is more focused on an interval.
突发可以从应用层的角度表征网络流传输的模式;虽然它与会话一样具有描述性,但它更侧重于间隔。

We use the raw bytes in an anonymous burst B to construct an input sequence, where each burst B is represented by a sequence of ℓB bytes. Concretely, we anonymize the burst by removing the MAC address, IP address, and port fields from each packet header. Anonymity is essential, as bursts from the same flow category may share identical addresses or ports; these can be used as explicit identifiers and prevent models from learning latent and unbiased representations of different flow categories. Subsequently, we concatenate the remaining bytes from the burst in the packet order so that the temporal relations in the burst is preserved. We also limit the maximum instances of bursts generated from the same session to guarantee unbiased burst generation by randomly sampling from these bursts.
我们使用匿名突发 B 中的原始字节来构造一个输入序列,其中每个突发 B 由 l B 字节序列表示。具体来说,我们通过从每个数据包标头中删除 MAC 地址、IP 地址和端口字段来匿名化突发。匿名性是必不可少的,因为来自同一流类别的突发可能共享相同的地址或端口;这些可以用作显式标识符,并防止模型学习不同流类别的潜在和无偏表示。随后,我们将数据包中突发的剩余字节连接起来,order以便保留突发中的时间关系。我们还限制了从同一会话生成的突发的最大实例,以通过从这些突发中随机采样来保证无偏突发的生成。

A burst is a temporal, application-level representation that facilitates stable, fine-grained, and representative feature extraction for various traffic types. Another advantage of burst representation is that it allows for organizing numerous flows into smaller groups, significantly reducing the sequence length for model input.
突发是一种时间、应用程序级的表示,有助于对各种流量类型进行稳定、细粒度和代表性的特征提取。突发表示的另一个优点是,它允许将多个流组织成更小的组,从而显著减少模型输入的序列长度。

However, a disparity remains between the burst and the MAE model input. Different bursts can have varying burst lengths, while the MAE model requires fixed-sized image inputs. Consequently, we introduce a burst cropping and padding scheme, as described below.
然而,突发和 MAE 模型输入之间仍然存在差异。不同的连拍可以具有不同的连拍长度,而 MAE 模型需要固定大小的图像输入。因此,我们引入了一种突发裁剪和填充方案,如下所述。

3.2.3 Burst Cropping and Padding. Cropping and padding are common techniques in transformer models for NLP and time series analysis (e.g., BERT [8], ALBERT [22], and Informer [54]). Since sentences and time series have varying lengths, input sequences are cropped or padded to a predefined maximum length MAX that the model can process. A mask sequence M of equivalent length (M=MAX) is introduced to indicate the valid sequence from the original input in the cropped or padded sequence.

B={B0MAXB,if B<MAXB[0:MAX],otherwise
(1)
M={1B0MAXB,if B<MAX1MAX,otherwise
(2)

Specifically, for a burst B with a length ℓB smaller than the maximum length MAX, we pad MAXB zeros at its end to reach the maximum length. The corresponding positions in the mask M[ℓB: ℓMAX] are filled with 0s, while the remaining positions M[0: ℓB − 1] corresponding to the original sequence are filled with 1s. Otherwise, if a burst B is longer than the maximum length MAX, we crop it to length MAX by discarding the posterior BMAX bytes and fill all MAX positions of the mask with 1s.
具体来说,对于长度 l B 小于最大长度 MAX 的突发 B,我们在其末端填充 MAXB 零以达到最大长度。掩码 M[l B : l MAX ] 中的相应位置填充为 0,而与原始序列相对应的其余位置 M[0: l B − 1] 填充为 1s。否则,如果突发 B 的长度超过最大长度 MAX ,我们通过丢弃后向 BMAX 字节将其裁剪为长度 MAX ,并用 1 填充掩码的所有 MAX 位置。

It is important to note that the mask used in the padding preprocessing is different from the one applied in Masked Patch Model pre-training. The bytes masked in preprocessing are not calculated or updated during model pre-training. Instead, they remain untouched throughout the pre-training, as explained in § 3.4.
需要注意的是,填充预处理中使用的掩码与掩码贴片模型预训练中使用的掩码不同。在模型预训练期间,不会计算或更新预处理中屏蔽的字节。相反,它们在整个预训练过程中保持不变,如§3.4所述。

3.3 Patch Embedding 3.3 贴片嵌入

3.3.1 Burst to Patch Embedding. In the MAE model, an operation called patch embedding initially uses a convolutional layer to encode the input image. The input image is partitioned uniformly into numerous contiguous, non-overlapping segments, referred to as patches. Following this, each patch is embedded into Rd, where d represents the embedding dimensionality of the MAE model. For example, within the MAE-Base model, an input image measuring 224x224 with a patch dimension of 16x16 yields 14x14=196 pacthes, each possessing d = 768 dimensions. The input image is down-sampled by 256 times. This process allows MAE to efficiently process and analyze larger inputs. On the other hand, BERT employs BPE (Byte Pair Encoding) encoding, which maps pairs of bytes to a single 768-dimensional vector. BPE encoding differs from patch embedding and does not directly support the incorporation of patch embedding techniques. Therefore, applying patch embedding to the BERT model, as used in MAE, may not be straightforward or feasible due to the fundamental differences in their encoding mechanisms.
3.3.1 突发到补丁嵌入。在 MAE 模型中,称为补丁嵌入的操作最初使用卷积层对输入图像进行编码。输入图像被均匀地划分为许多连续的、不重叠的段,称为补丁。在此之后,每个面片都嵌入到 Rd 中,其中 d 表示 MAE 模型的嵌入维数。例如,在 MAE-Base 模型中,尺寸为 224x224、面片尺寸为 16x16 的输入图像生成 14x14=196 个 pacthe,每个 pacthes 具有 d = 768 维度。输入图像被下采样 256 倍。此过程使 MAE 能够有效地处理和分析更大的输入。另一方面,BERT 采用 BPE(字节对编码)编码,将字节对映射到单个 768 维向量。BPE 编码不同于补丁嵌入,不直接支持合并补丁嵌入技术。因此,将贴片嵌入应用于 MAE 中使用的 BERT 模型可能并不简单或可行,因为它们的编码机制存在根本差异。

Patch embedding is performed for two key reasons in Flow-MAE:
在 Flow-MAE 中执行补丁嵌入有两个关键原因:

(1) First, as input length increases, the computational and storage requirements for the transformer MAE grow quadratically. As a result, the transformer models typically employs a maximum input length of 128 or 256 to minimize computation and memory overhead. For extended input bursts, patch embedding significantly reduces the input sequence length, making it compatible with the MAE's input dimensions. Consequently, more comprehensive burst data, including packet headers, can be incorporated.
(1)首先,随着输入长度的增加,变压器MAE的计算和存储需求呈二次增长。因此,Transformer 模型通常采用 128 或 256 的最大输入长度,以最大限度地减少计算和内存开销。对于扩展的输入突发,贴片嵌入显著缩短了输入序列长度,使其与 MAE 的输入尺寸兼容。因此,可以合并更全面的突发数据,包括数据包标头。

(2) Second, bursts exhibit spatially localized sparsity. The convolutional patch embedding layer combines the sparse local data originating from the patches. Considering that the downstream task is traffic classification, which shows lower sensitivity to local semantics, patch embedding can be employed to extract a comprehensive high-level representation of traffic. The sparsity attribute enables the integration of the down-sampling embedding layer, yielding compact encoding and efficient high-level representation learning, while maintaining minimal computation and memory overheads.
(2)其次,暴发表现出空间局部稀疏性。卷积补丁嵌入层结合了源自补丁的稀疏局部数据。考虑到下游任务是流量分类,对局部语义的敏感度较低,因此可以使用补丁嵌入来提取流量的综合高级表示。稀疏性属性支持下采样嵌入层的集成,产生紧凑的编码和高效的高级表示学习,同时保持最小的计算和内存开销。

In the Flow-MAE framework, each input burst B constitutes a byte stream preprocessed to a fixed length ℓB as delineated in § 3.2. Treating each burst B as a one-dimensional input sequence BN1×B, a one-dimensional convolution layer with a singular input channel, d output channels, a kernel size of K = (1, k), and stride S = (1, k) is employed for patch embedding. This process yields Bk patches piRd. Consequently, patch embedding uniformly subdivides a burst B into |P|=Bk non-overlapping segments P.
在 Flow-MAE 框架中,每个输入突发 B 构成一个字节流,预处理为固定长度的 l B ,如第 3.2 节所述。将每个突发 B 视为一维输入序列 BN1×B ,采用具有奇异输入通道、d 输出通道、核大小为 K = (1, k) 和步幅 S = (1, k) 的一维卷积层进行补丁嵌入。此过程生成 Bk 补丁 piRd 。因此,贴片嵌入均匀地将突发 B 细分为 |P|=Bk 非重叠段 P

PConv1Dd,k(B)s.t. P={pipiRd,i=0,1,Bk1}
(3)

Owing to the potential padding of bytes to burst B during preprocessing, it becomes essential to ascertain whether a patch is derived from the padded or original bytes in burst B. This distinction enables the calculation of attention scores solely from the original bytes, avoiding disruptions from padded bytes. An attention mask M is generated from M:
由于在预处理过程中可能会填充字节到突发 B,因此必须确定补丁是从突发 B 中的填充字节还是原始字节派生而来的。这种区别使得仅从原始字节计算注意力分数,避免了填充字节的中断。从 M 生成一个注意力掩码 M

M={mimi{0,1},i=0,1,,|P|1},
(4)
s.t. mi={0, if M[ik,(i+1)k]=0k1, otherwise
(5)

Per eq. (4), the patch mask mi assumes a value of 1 provided that at least one original byte is present within the corresponding patch pi. Conversely, the patch mask mi is assigned a value of 0 if all bytes within patch pi are padded.
根据方程(4),如果相应的补丁 p i 中至少存在一个原始字节,则补丁掩码 m i 假定值为 1。相反,如果修补程序 p i 中的所有字节都已填充,则为修补程序掩码 m i 分配值 0。

3.3.2 Positional Embedding. Datagram transmission order may serve as a distinguishing characteristic for certain malware activities. Given that the self-attention block within the transformer layer treats each patch as equivalent, positional embedding is implemented to introduce positional biases into patches. This approach represents the positional information within the sequence and enables the model to discern the spatial relationships among patches.
3.3.2 位置嵌入。数据报传输order可以作为某些恶意软件活动的区别特征。鉴于变压器层内的自注意力模块将每个贴片视为等价,因此实现了位置嵌入以将位置偏差引入贴片中。这种方法表示序列中的位置信息,并使模型能够辨别斑块之间的空间关系。

The positional embedding value for patch pi is represented by Pos(i)Rh, depending on the relative position of pi within the patches P. Pos() signifies the embedding function.
patch p i 的位置嵌入值由 Pos(i)Rh 表示,具体取决于 p i 在 中的相对position值 PPos() 表示嵌入函数。

pipi+Pos(i), Pos():RRh
(6)

3.4 Pre-training Masked Patch Model
3.4 预训练掩码补丁模型

3.4.1 Patch Masking. Patch masking randomly masks each patch with a probability of r (0<r<1). A Shuffle operation firstly permutes all patches P at random. Following this, the shuffled patches are divided into two subsets: the final r|P| elements constitute the masked patches Pm, while the remaining patches form the visible patches Pv.
3.4.1 补丁掩蔽。补丁掩码随机屏蔽每个补丁,概率为 r (0<r<1) 。操作 Shuffle 首先 P 随机排列所有补丁。在此之后,洗牌后的补丁分为两个子集:最终 r|P| 元素构成掩码补丁 Pm ,而其余补丁形成可见补丁 Pv

p0,,p|P|r|P|1Pv,p|P|r|P|,,p|P|1PmShuffle(P)s.t. {P=PvPm,|Pm|=r|P|PvPm=
(7)

Only the visible patches Pv are incorporated into the encoder for representation learning, with the masked patches Pm being disregarded. The shuffle operation randomly permutes all the patches in the dataset. This step introduces randomness and diversity into the training process. By shuffling and masking the patches, the model is exposed to different patch arrangements, ensuring that it learns robust representations that are invariant to the specific ordering of patches. This helps prevent the model from relying solely on the spatial arrangement of patches and encourages it to capture more meaningful and generalizable features.
只有可见的补丁 Pv 被合并到编码器中,用于表示学习,而屏蔽的补丁 Pm 将被忽略。随机排列操作随机排列数据集中的所有补丁。此步骤将随机性和多样性引入训练过程。通过对补丁进行洗牌和屏蔽,模型将暴露于不同的补丁排列中,从而确保它学习与补丁的特定顺序不变的鲁棒表示。这有助于防止模型仅依赖于面块的空间排列,并鼓励它捕获更有意义和可泛化的特征。

Afterwards, a class patch pcls = 0d is concatenated preceding the visible patches, facilitating the fusion of information from all visible patches at a later stage.
之后,在可见斑块之前连接一个类补丁 p cls = 0 d ,从而促进后期所有可见斑块的信息融合。

Pcv{pcls}Pv
(8)

3.4.2 Masked Encoder. The transformer MAE capture latent inter-patch relationships within visible patches Pcv, subsequently reconstructing masked patches Pm. These bidirectional transformer autoencoder blocks facilitate information capture from both antecedent and subsequent visible patches. The self-attention mechanism autonomously derives suitable representations from input patches, expediting the recovery of masked patches.
3.4.2 蒙面编码器。转换器MAE捕获可见斑块内潜在的斑块间关系 Pcv ,随后重建掩蔽斑块 Pm 。这些双向变压器自动编码器模块有助于从前一个和后续可见斑块中捕获信息。自注意力机制自主地从输入补丁中导出合适的表示,从而加快了屏蔽补丁的恢复。

Delving deeper, the transformer encoder E={Encoderi()iN} incorporates |E| layers, with each layer Encoderi() discerning latent relationships among input patches Pi and outputting an deep latent representation Pi+1.
更深入地研究,变压器编码器 E={Encoderi()iN} 集成了 |E|层,每一层 Encoderi() 都辨别输入补丁之间的潜在关系 Pi ,并输出深度潜在表示 Pi+1

Pi+1Encoderi(Pi,Mv),
(9)
s.t.{P0=PcvPi+1,Pi,MvR|Pv|×d
(10)

While maintaining the same shape as Pi, each patch pi+1,jPi+1 fuses information from all patches in Pi via the self-attention encoder block.
在保持与 Pi 相同的形状的同时,每个 Patch pi+1,jPi+1 通过自注意力编码器模块融合 Pi 来自所有 Patch 的信息。

pi+1,jEncoderi(pi,0,pi,1,,pi,|Pv|1,Mv)
(11)
Hence, the class patch pcls fuses information from all visible patches, enabling the representation of the burst as an integrated entity.
因此,类 patch p cls 融合了来自所有可见 patch 的信息,从而能够将突发表示为一个集成实体。

Note that the attention mask Mv is provided to the self-attention block within each layer of transformer encoders Encoderi, serving to mask the padded patches. Mv conceals the attention scores of padded patches, ensuring that only the original patches will be attended to and precluding interference from padded patches.
请注意,注意力掩码 Mv 提供给每层变压器编码器 Encoderi 内的自注意力模块,用于掩蔽填充的补丁。 Mv 隐藏填充补丁的注意力分数,确保只关注原始补丁,并排除来自填充补丁的干扰。

3.4.3 Decoder. With the final layer's latent representation of the visible patches, denoted as P|E|, we can reconstruct the masked patches using the transformer decoder. The decoder's architecture resembles that of the encoder, consisting of multiple layers with self-attention blocks. There are two key differences: (1) the decoder is relatively lightweight, and (2) the number of patches is |P|.
3.4.3 解码器。通过最后一层可见斑块的潜在表示,表示为 P|E| ,我们可以使用转换器解码器重建掩蔽的斑块。解码器的架构类似于编码器的架构,由具有自注意力块的多层组成。有两个关键区别:(1)解码器相对轻量级,(2)补丁数量为 |P|

To reconstruct the masked patches, we initially insert placeholders at the masked patch positions and return the visible patches to their original locations. This operation, referred to as Unshuffle, produces P0.
为了重建蒙版补丁,我们首先在蒙版补丁位置插入占位符,并将可见补丁返回到其原始位置。此操作称为 Unshuffle ,生成 P0

P0Unshuffle(P|E||P|E||=|Pv|,0d,0d,,0d|Pm|)
(12)

The placeholder patches, represented by |Pm| zero vectors 0d, are soon replaced by the decoder with recovered masked patches. The Unshuffle operation restores the patch count to |P|=|Pv|+|Pm|. Subsequently, a positional embedding is added to P. The decoder layers, denoted as D={Decoderi()iN}, are then employed to replace the placeholders with recovered masked patches. The decoding process proceeds as follows:
|Pm| 零向量 0 d 表示的占位符补丁很快被具有恢复的屏蔽补丁的解码器替换。该 Unshuffle 操作将修补程序计数恢复到 |P|=|Pv|+|Pm| 。随后,将 P 位置嵌入添加到 中。然后使用解码器层(表示为 D={Decoderi()iN} )将占位符替换为恢复的掩码补丁。解码过程按如下方式进行:

Pi+1Decoderi(Pi),
(13)
s.t. Pi+1,PiR|P|×d
(14)

The final decoder layer's latent representation, P|D|, has its placeholder patches extracted and projected to conform to the input burst's original shape, represented as Pr.
最终解码器层的潜在表示形式 P|D| ,其占位符块被提取并投影为符合输入突发的原始形状,表示为 Pr

The pre-training loss is calculated as the Mean Square Error (MSE) between the restored patches Pr and the original masked patches Pm:
训练前的损失计算为恢复的补丁 Pr 和原始屏蔽的补丁之间的均方误差 (MSE Pm ):

Lpretrain=MSE(Pr, Pm)
(15)
=1|Pm|i=0|Pm|1pripmi2
(16)

3.5 Finetuning Flow-MAE 3.5 微调流量-MAE

After the pre-training phase, the Flow-MAE encoder can generate deep latent representations for any given burst. Subsequently, the classifier model is fine-tuned using a modest amount of task-specific labeled traffic data. The classifier model retains the MAE's encoder component (including its architecture and weights), while the decoder part is replaced by a fully connected linear classifier.
在预训练阶段之后,Flow-MAE 编码器可以为任何给定的突发生成深度潜在表示。随后,使用适量的任务特定标记的流量数据对分类器模型进行微调。分类器模型保留了 MAE 的编码器组件(包括其架构和权重),而解码器部分则被完全连接的线性分类器所取代。

Fine-tuning is effective for downstream classification tasks for several reasons: (1) The encoder inherits from the pre-trained model, enabling the use of the same burst representation input in downstream tasks; (2) The pre-trained encoder output representation is universally applicable to all traffic patterns, facilitating adaptation to specific feature representations; (3) The output class patch pcls integrates information from all visible patches and represents the entire burst, making it suitable for direct classification.
微调对于下游分类任务是有效的,原因如下:(1)编码器继承了预训练模型,允许在下游任务中使用相同的突发表示输入;(2)预训练的编码器输出表示普遍适用于所有交通模式,便于适应特定特征表示;(3)输出类斑块p cls 整合了所有可见斑块的信息,代表了整个脉冲,适合直接分类。

A fully connected linear layer serves as a Classifier, projecting the encoder output class patch p|D|, cls to a probability vector Y^[0,1]c, where c denotes the number of classes in the fine-tuning dataset. The predicted label index is defined as:
一个完全连接的线性层用作 , Classifier 将编码器输出类 patch p |D|, cls 投影到概率向量 Y^[0,1]c ,其中 c 表示微调数据集中的类数。预测标签索引定义为:

y^=argmaxY^
(17)

For a classification task involving c classes, given an input burst B, ground-truth label probability Y = {y1, y2, …, yc}, and prediction probability vector Y^={y1^,y2^,,yc^}, the loss function is defined as the cross-entropy loss:
对于涉及 c 类的分类任务,给定输入突发 B、真值标签概率 Y = {y 1 , y 2 , ..., y c } 和预测概率向量 Y^={y1^,y2^,,yc^} ,损失函数定义为交叉熵损失:

LCrossEntropy=CrossEntropy(Softmax(Y^),Y)
(18)
=i=1clogexp(yi^)j=1cexp(yj^)yi
(19)
Specifically, for our classification task, the ground truth label index y constitutes an integer within the range [0, c − 1], and Y represents a one-hot encoding vector. The cross-entropy loss is expressed as:
具体来说,对于我们的分类任务,真值标签索引 y 构成 [0, c − 1] 范围内的整数,Y 表示一个热编码向量。交叉熵损失表示为:
LCrossEntropy=logexp(yi^)j=1cexp(yj^)
(20)
Thus, our fine-tuning objective is to minimize the cross-entropy loss LCrossEntropy. Upon completion of the fine-tuning process, the model is primed for the classification task. Given target-specific traffic, the fine-tuned model can accurately predict its classification.
因此,我们的微调目标是最小化交叉熵损失 L CrossEntropy 。完成微调过程后,模型即可完成分类任务。给定特定于目标的流量,微调后的模型可以准确预测其分类。

4 EXPERIMENT RESULTS 4 实验结果

4.1 Experiment Setup 4.1 实验设置

4.1.1 Flow-MAE Implementation.
4.1.1 Flow-MAE 实现。

Table 1: Hyper-parameter Configurations in Pre-training and Fine-tuning
表 1:预训练和微调中的超参数配置
Hyper-parameter 超参数 Pre-training Value 预训练价值 Fine-tuning Value 微调值
burst length lMAX
爆破长度 l MAX
1024
embedding stride s 嵌入步幅 S 8
embedding kernel size k 嵌入内核大小 k 8
embedding dimensions d 嵌入尺寸 D 768
mask ratio rm 掩模比例 R m 0.15 0
attention drop rate ra
注意力下降率 R a
0 0.1
forward drop rate rf
前向掉落率 R f
0 0.1
embedding patches |P| 嵌入补丁 |P| 128
encoder layers |E| 编码器层 |E| 12
encoder heads hE
编码器头 H E
12
decoder layers |D| 解码器层 |D| 12 None 没有
decoder heads hD
解码器头 H D
12 None 没有
epochs 时代 800 25
batch size 批量大小 32
learning rate 学习率 1e-4 1E-4型
warm up ratio 热身比 0.015 0

We adopt the Vision Transformer base (ViT-base) [9] model as the masked autoencoder backbone in Flow-MAE. Vit-base incorporates 12 encoder layers, and 12 decoder layers, and adopts a 768-dimension embedding for each input patch.
我们采用Vision Transformer基站(ViT-base)[9]模型作为Flow-MAE中的掩蔽自动编码器主干。Vit-base 包含 12 个编码层和 12 个解码层,每个输入补丁采用 768 维嵌入。

Table 1 presents the parameter configurations for the pre-training and fine-tuning experiments. The Flow-MAE model comprises |E| = 12 transformer encoder layers, each featuring a self-attention block with hE = 2 heads. The pre-training Masked Patch Model decoder comprises |D| =12 layers and hD=12 self-attention heads. The input burst length lMAX is set at 1024 bytes. The convolution embedding layer, employing a stride of s=32 and kernel size k=32, projects the burst into |P| =32 patches of d=768 dimensions. During pre-training, with a mask ratio rm=0.15, |Pm| =4 of the 32 total patches are masked, leaving the remaining |Pv| =28 patches visible. The attention drop rate ra and forward drop rate rf are assigned values of 0 during pre-training and 0.1 in fine-tuning, respectively. In total, the Flow-MAE model contains 85.12 M parameters and a model size of 450 MB.
表 1 显示了预训练和微调实验的参数配置。Flow-MAE 模型包括 |E|= 12 个变压器编码器层,每个层都有一个 h E = 2 个磁头的自注意力块。预训练掩码补丁模型解码器包括 |D|=12 层和 h D =12 个自我关注头。输入突发长度 l MAX 设置为 1024 字节。卷积嵌入层采用 s=32 的步幅和核大小 k=32,将突发投影到 |P|=32 个 d=768 维的补丁。在预训练期间,掩模比 r m =0.15,|P m |=32 个补丁中有 4 个被屏蔽,剩下的 |P v |=28 个可见的补丁。在预训练期间,注意力下降率 r a 和前向下降率 r f 的值分别为 0,在微调时分别为 0.1。Flow-MAE 模型总共包含 85.12 M 参数,模型大小为 450 MB。

All experiments were conducted on a testbed equipped with an i7-12700K CPU (8 P-cores @4.7GHz and 4 E-cores @3.6GHz), 64 GB DDR5 DRAM (@6000MT/s), and two NVIDIA GeForce 3090Ti GPUs (24 GB of GDDR6X memory each). The software environment of the testbed includes Ubuntu 22.04.1 LTS (kernel 5.15.0-50), Python 3.8.13, PyTorch 1.12.0, and CUDA 11.6.
所有实验均在配备 i7-12700K CPU(8 个 P 核 @4.7GHz 和 4 个 E 核 @3.6GHz)、64 GB DDR5 DRAM (@6000MT/s) 和两个 NVIDIA GeForce 3090Ti GPU(每个 24 GB GDDR6X 内存)的测试平台上进行。测试平台的软件环境包括 Ubuntu 22.04.1 LTS(内核 5.15.0-50)、Python 3.8.13、PyTorch 1.12.0 和 CUDA 11.6。

4.1.2 Datasets. 4.1.2 数据集。

Table 2: Statistical Information of the Datasets
表2:数据集的统计信息
Dataset 数据 #Benign #Malicious #Flow #Packet
CIC-IDS2018 [38] 1 7 4.5M 80M
USTC-TFC-2016 [50] 中国科学技术大学TFC-2016 [ 50] 10 10 9.8K 97.1K
ISCX-VPN-2016 [10] ISCX-VPN-2016年版 [ 10] 6 6 8.4K 18.7K
ISCX-Tor-2016 [23] 8 8 3K 80K
Cross-Platform (Android) [46]
跨平台(Android) [ 46]
215 0 27.8K 656K

The experiments are conducted using the following datasets, as summarized in Table 2:
实验使用以下数据集进行,如表2所示:

  1. CIC-IDS2018 [38]: An intrusion detection system evaluation dataset collected by the Canadian Institute for Cybersecurity (CIC). The CIC-IDS2018 dataset comprises benign background traffic and seven distinct malicious scenarios: Brute-force, Heartbleed, Botnet, DoS, DDoS, Web attacks, and infiltration, collected over a 10-day period. The network infrastructure includes 50 malicious machines in a local network, as well as 420 victim machines and 30 victim servers distributed across five departments.
    CIC-IDS2018 [ 38]:加拿大网络安全研究所(CIC)收集的入侵检测系统评估数据集。CIC-IDS2018 数据集包括良性后台流量和七种不同的恶意场景:暴力破解、Heartbleed、僵尸网络、DoS、DDoS、Web 攻击和渗透,收集时间为 10 天。网络基础设施包括本地网络中的 50 台恶意计算机,以及分布在五个部门的 420 台受害计算机和 30 台受害服务器。

  2. USTC-TFC-2016 [50]: A dataset containing encrypted traffic from 10 malware and 10 benign applications. This well-recognized dataset was published by a researcher from the University of Science and Technology of China (USTC) and has been extensively adopted by researchers.
    USTC-TFC-2016 [ 50]:包含来自 10 个恶意软件和 10 个良性应用程序的加密流量的数据集。这个广受认可的数据集由中国科学技术大学(USTC)的研究人员发表,并已被研究人员广泛采用。

  3. ISCX-VPN-2016-App [10]: ISCX-VPN-App is a dataset that contains encrypted network traffic collected by the Canadian Institute for Cybersecurity (CIC). The dataset focuses on Virtual Private Networks (VPNs) used for network communication. VPNs are widely used for intrusion purposes as they allow users to bypass censorship and hide their location through protocol obfuscation and proxy mode. The ISCX-VPN-App dataset is classified based on applications and consists of 17 different applications.
    ISCX-VPN-2016-App [ 10]: ISCX-VPN-App 是一个数据集,其中包含加拿大网络安全研究所 (CIC) 收集的加密网络流量。该数据集侧重于用于网络通信的虚拟专用网络 (VPN)。VPN被广泛用于入侵目的,因为它们允许用户绕过审查并通过协议混淆和代理模式隐藏他们的位置。ISCX-VPN-App 数据集根据应用程序进行分类,由 17 个不同的应用程序组成。

  4. ISCX-VPN-2016-Service [10] ISCX-VPN-Service is another dataset within the ISCX-VPN-2016 collection. Similar to ISCX-VPN-App, this dataset contains encrypted network traffic collected by the CIC. It specifically focuses on the service level of VPNs. The ISCX-VPN-Service dataset is classified based on 12 different services related to VPN usage. By studying this dataset, researchers can gain insights into the characteristics and behavior of various VPN services.
    ISCX-VPN-2016-服务 [ 10] ISCX-VPN-Service 是 ISCX-VPN-2016 集合中的另一个数据集。与 ISCX-VPN-App 类似,此数据集包含 CIC 收集的加密网络流量。它特别关注VPN的服务水平。ISCX-VPN-Service 数据集根据与 VPN 使用相关的 12 种不同服务进行分类。通过研究这个数据集,研究人员可以深入了解各种VPN服务的特征和行为。

  5. ISCX-Tor-2016 [23]: This dataset comprises traffic from 16 applications utilizing the Onion Router (Tor) for encrypted communications. Tor obscures data between the sender and the receiver through a distributed routing network. Some intruders may use Tor to conceal their identity and activities, as the resulting obfuscation makes tracking and traffic classification challenging.
    ISCX-Tor-2016 [ 23]:该数据集包括来自 16 个应用程序的流量,这些应用程序使用洋葱路由器 (Tor) 进行加密通信。Tor 通过分布式路由网络掩盖发送方和接收方之间的数据。一些入侵者可能会使用 Tor 来隐藏他们的身份和活动,因为由此产生的混淆使跟踪和流量分类具有挑战性。

  6. Cross-Platform [46]: This dataset includes encrypted traffic from Android applications, featuring 215 apps from the top 100 Apps in the US, China, and India. It is representative of encrypted traffic from current applications, as it covers a wide range of applications worldwide. Moreover, it has the highest number of classes among all datasets used in the experiment, making it more challenging than the others.
    跨平台 [ 46]:该数据集包括来自 Android 应用程序的加密流量,包含来自美国、中国和印度前 100 名应用程序的 215 个应用程序。它代表了来自当前应用程序的加密流量,因为它涵盖了全球广泛的应用程序。此外,在实验中使用的所有数据集中,它的类数量最多,使其比其他数据集更具挑战性。

Flow-MAE is pre-trained on a subset of unlabeled background traffic from CIC-IDS2018 [38], specifically the traffic collected on Tuesday, 20-02-2018, after removing malicious traffic directed towards the victim host 172.31.69.25. This subset dataset from 20-02-2018 consists of 451 raw pcap files, totaling 57.34 GB, and generates a pre-training dataset of 3.86 million bursts. To evaluate Flow-MAE's effectiveness and robustness, the model is fine-tuned for malicious traffic classification experiments on the first five public datasets mentioned above. The pre-training dataset is significantly larger in size compared to the fine-tuning dataset. For instance, the fine-tuning data on CIC-IDS2018 consists of approximately 10GB of malicious traffic. Additionally, the Cross-Platform (Android) dataset [46], which includes only benign traffic from 215 apps, is employed to validate Flow-MAE's transferability and robustness in the encrypted flow classification task involving large categories. However, to ensure an adequate amount of data for analysis, a data size threshold of at least 20MB was imposed. As a result, only 71 out of the 215 apps were included in the evaluation. This task demonstrates the model's capacity to adapt to different classification scenarios, even when handling a larger number of classes and diverse data types.
Flow-MAE 在 CIC-IDS2018 [ 38] 的未标记后台流量子集上进行了预训练,特别是在 2018 年 2 月 20 日星期二删除指向受害者主机 172.31.69.25 的恶意流量后收集的流量。2018 年 2 月 20 日的子集数据集由 451 个原始 pcap 文件组成,总计 57.34 GB,并生成 386 万次突发的预训练数据集。为了评估 Flow-MAE 的有效性和鲁棒性,该模型针对上述前五个公共数据集上的恶意流量分类实验进行了微调。与微调数据集相比,预训练数据集的大小要大得多。例如,CIC-IDS2018 上的微调数据由大约 10GB 的恶意流量组成。此外,采用跨平台(Android)数据集[46],仅包含来自215个应用的良性流量,用于验证Flow-MAE在涉及大型类别的加密流量分类任务中的可转移性和鲁棒性。但是,为了确保有足够的数据量进行分析,施加了至少 20MB 的数据大小阈值。因此,在 215 个应用程序中,只有 71 个被纳入评估。此任务演示了模型适应不同分类方案的能力,即使在处理大量类和不同的数据类型时也是如此。

As outlined in the preprocessing procedure section (§ 3.2.2), the burst's IP address and port fields are anonymized to prevent explicit identification information from interfering with representation learning. Anonymous TCP and UDP sessions from each class are selected to create labeled burst datasets, which are then divided into training and testing sets in an 9:1 ratio.
如预处理过程部分(§ 3.2.2)所述,突发的 IP 地址和端口字段是匿名的,以防止显式识别信息干扰表示学习。从每个类中选择匿名 TCP 和 UDP 会话以创建标记的突发数据集,然后以 9:1 的比例将其划分为训练集和测试集。

4.1.3 Baselines. Twelve state-of-the-art generic malicious traffic detection methods are utilized as baselines: AppScanner [45], CUMUL [33], BIND [1], K-fp [14], FlowPrint [47], DF [41], FS-Net [29], GraphDApp [40], TSCRNN [26], Deeppacket [31], PERT [15], and ET-BERT [27]. These baseline methods represent a diverse range of approaches in the literature: (1) they encompass both traditional machine learning-based and deep learning-based methods; (2) they employ a wide variety of features, such as flow statistics, time series, raw packet headers, and raw payloads.
4.1.3 基线。12 种最先进的通用恶意流量检测方法被用作基线:AppScanner [ 45]、CUMUL [ 33]、BIND [ 1]、K-fp [ 14]、FlowPrint [ 47]、DF [ 41]、FS-Net [ 29]、GraphDApp [ 40]、TSCRNN [ 26]、Deeppacket [ 31]、PERT [ 15] 和 ET-BERT [ 27]。这些基线方法代表了文献中的各种方法:(1)它们包括传统的基于机器学习和基于深度学习的方法;(2) 它们采用了多种功能,例如流量统计、时间序列、原始数据包标头和原始有效负载。

4.1.4 Evaluation Metrics. We evaluate the performance of Flow-MAE using four typical metrics in the literature: Accuracy (AC), Precision (PR), Recall (RC), and F1-score (F1). These metrics are defined as follows:
4.1.4 评估指标。我们使用文献中的四个典型指标来评估 Flow-MAE 的性能:准确率 (AC)、精确度 (PR)、召回率 (RC) 和 F 1 分数 1 (F)。这些指标定义如下:

  • Accuracy (AC): AC=TP+TNTP+TN+FP+FN, the ratio of the number of correctly classified instances (True Positives Tp and True Negatives TN) to all instances in the test dataset, which comprises all input bursts TP + TN + FP + FN.
    准确度 (AC): AC=TP+TNTP+TN+FP+FN ,正确分类的实例数(真阳性 T p 和真阴性 T N )与测试数据集中所有实例的比率,包括所有输入突发 T P + T N + F P N + F。

  • Precision rate: PR=TPTP+FP, the ratio of the number of correctly reported instances (True Positives Tp) to the number of reported instances TP + FP.
    准确率: PR=TPTP+FP ,正确报告的实例数(真阳性 T p )与报告的实例数 T P + F 的比值 P

  • Recall Rate: RC=TPTP+FN, the ratio of the number of correctly reported instances to the number of correct instances Tp + FN.
    召回率: RC=TPTP+FN ,正确报告的实例数与正确实例数的比值 N T p + F 。

  • F1-score: F1=2×PR×RCPR+RC, where PR is the Precision rate and RC is the Recall rate.
    F 1 -score: F1=2×PR×RCPR+RC ,其中 PR 是 Precision 率,RC 是 Recall 率。

To account for imbalanced traffic categories and avoid biased results, Macro Average [44] is employed to calculate the mean values of AC, PR, RC, and F1 for all traffic categories. Moreover, to further mitigate potential biases, a five-fold validation is performed on each dataset.
为了解释不平衡的流量类别并避免有偏差的结果,采用宏观平均值 [ 44] 来计算所有流量类别的 AC、PR、RC 和 F1 的平均值。此外,为了进一步减少潜在的偏差,对每个数据集进行了五重验证。

Table 3: Accuracy Evaluation Results on CIC-IDS-2018, USTC-TFC-2016 and ISCX-VPN-App Datasets
表3:CIC-IDS-2018、USTC-TFC-2016和ISCX-VPN-App数据集的准确性评估结果
Method 方法 CIC-IDS-2018 [38] USTC-TFC-2016 [50] 中国科学技术大学TFC-2016 [ 50] ISCX-VPN-2016-App [10] ISCX-VPN-2016-应用程序 [ 10]
AC PR RC F1 AC PR RC F1 AC PR RC F1
AppScanner [45] 应用扫描仪 [ 45] 0.9091 0.9147 0.9074 0.9110 0.8954 0.8984 0.8968 0.8892 0.6266 0.4864 0.5198 0.4935
CUMUL [33] 积木 [ 33] 0.5860 0.6181 0.6212 0.6196 0.5675 0.6171 0.5738 0.5513 0.5365 0.4129 0.4535 0.4236
BIND [1] 绑定 [ 1] 0.8661 0.8271 0.8091 0.8180 0.8457 0.8681 0.8382 0.8396 0.6767 0.5152 0.5153 0.4965
K-fp [14] - - - - - - - - 0.6070 0.5478 0.5430 0.5303
FlowPrint [47] 0.7841 0.7102 0.6921 0.7010 0.8146 0.6434 0.7002 0.6573 0.8767 0.6697 0.6651 0.6531
DF [41] 测向 [ 41] 0.7976 0.7554 0.7441 0.7497 0.7787 0.7883 0.7819 0.7593 0.6116 0.5706 0.4752 0.4799
FS-Net [29] FS-网络 [ 29] 0.9015 0.8994 0.8997 0.8996 0.8846 0.8846 0.8920 0.8840 0.6647 0.4819 0.4848 0.4737
GraphDApp [40] 图形DApp [ 40] 0.8667 0.8441 0.8439 0.8440 0.8789 0.8226 0.8260 0.8234 0.6328 0.5900 0.5472 0.5558
TSCRNN [26] - 0.9678 0.9724 0.9701 - 0.9870 0.9860 0.9870 - - - -
Deeppacket [31] 深度数据包 [ 31] 0.9661 0.9874 0.9883 0.9879 0.9640 0.9650 0.9631 0.9641 0.9758 0.9785 0.9745 0.9765
PERT [15] 珀特 [ 15] 0.9302 0.9479 0.9367 0.9423 0.9909 0.9911 0.9910 0.9911 0.8229 0.7092 0.7173 0.6992
ET-BERT (flow) [27] ET-BERT (流量) [ 27] 0.9943 0.9944 0.9942 0.9943 0.9929 0.9930 0.9930 0.9930 0.8519 0.7508 0.7294 0.7306
ET-BERT (packet) [27] ET-BERT (数据包) [ 27] 0.9955 0.9933 0.9932 0.9933 0.9915 0.9915 0.9916 0.9916 0.9962 0.9936 0.9938 0.9937
Flow-MAE (ours) Flow-MAE(我们的) 0.9958 0.9959 0.9958 0.9958 0.9988 0.9984 0.9989 0.9986 0.9987 0.9991 0.9989 0.9990
Table 4: Accuracy Evaluation Results on ISCX-VPN-Service, ISCX-Tor-2016, and Cross-Platform (Android) Datasets
表 4:ISCX-VPN-Service、ISCX-Tor-2016 和跨平台 (Android) 数据集的准确性评估结果
Method 方法 ISCX-VPN-2016-Service [10]
ISCX-VPN-2016-服务 [ 10]
ISCX-Tor-2016 [23] Cross-Platform (Android) [46]
跨平台(Android) [ 46]
AC PR RC F1 AC PR RC F1 AC PR RC F1
AppScanner [45] 应用扫描仪 [ 45] 0.7182 0.7339 0.7225 0.7197 0.6722 0.3756 0.4422 0.3913 0.3868 0.2523 0.2594 0.2440
CUMUL [33] 积木 [ 33] 0.5610 0.5883 0.5676 0.5668 0.6606 0.3850 0.4416 0.3918 0.3525 0.2221 0.2409 0.2189
BIND [1] 绑定 [ 1] 0.7534 0.7583 0.7488 0.7420 0.7185 0.4598 0.4515 0.4511 0.4728 0.3126 0.3253 0.3026
K-fp [14] 0.6430 0.6492 0.6417 0.6395 0.6472 0.5576 0.5849 0.5522 0.2248 0.2113 0.2104 0.2052
FlowPrint [47]