The Emerging Trends of Multi-Label Learning
多标签学习的新兴趋势

Weiwei Liu, Haobo Wang, Xiaobo Shen, and Ivor W. Tsang
刘伟伟、王浩波、沈晓波和 Ivor W. Tsang Weiwei Liu is with the School of Computer Science, Wuhan University, Wuhan 430079, China. E-mail: liuweiwei863@gmail.com.
Weiwei Liu 就职于武汉大学计算机科学学院，中国430079。电子邮件：liuweiwei863@gmail.com。
Haobo Wang is with College of Computer Science and Technology, Zhejiang University. E-mail: wanghaobo@zju.edu.cn.
Haobo Wang 就职于浙江大学计算机科学与技术学院。电子邮件：wanghaobo@zju.edu.cn。
Xiaobo Shen is with the School of Computer and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China. E-mail: njust.shenxiaobo@gmail.com.
沈晓波就职于南京理工大学计算机与工程学院，210094南京。电子邮件： njust.shenxiaobo@gmail.com.
Ivor W. Tsang is with the Centre for Artificial Intelligence, FEIT, University of Technology Sydney, NSW, Australia. E-mail: ivor.tsang@uts.edu.au.
Ivor W. Tsang 就职于澳大利亚新南威尔士州悉尼科技大学 FEIT 人工智能中心。电子邮件：ivor.tsang@uts.edu.au。
This work is supported by the National Natural Science Foundation of China under Grant No. 61976161, 62176126 and 61906091, the Natural Science Foundation of Jiangsu Province, China (Youth Fund Project) under Grant No. BK20190440, the Fundamental Research Funds for the Central Universities under Grant No. 30921011210, the ARC under Grant No. DP180100106 and DP200101328. (Corresponding author: Weiwei Liu.)
这项工作得到了中国国家自然科学基金（61976161、62176126 和 61906091 号、中国江苏省自然科学基金（青年基金项目）的支持，资助号为。BK20190440，第 30921011210 号拨款下的中央大学基本研究金，以及第号的 ARC。DP180100106 和 DP200101328。（通讯作者：刘伟伟）

Abstract 抽象

Exabytes of data are generated daily by humans, leading to the growing need for new efforts in dealing with the grand challenges for multi-label learning brought by big data. For example, extreme multi-label classification is an active and rapidly growing research area that deals with classification tasks with an extremely large number of classes or labels; utilizing massive data with limited supervision to build a multi-label classification model becomes valuable for practical applications, etc. Besides these, there are tremendous efforts on how to harvest the strong learning capability of deep learning to better capture the label dependencies in multi-label learning, which is the key for deep learning to address real-world classification tasks. However, it is noted that there has been a lack of systemic studies that focus explicitly on analyzing the emerging trends and new challenges of multi-label learning in the era of big data. It is imperative to call for a comprehensive survey to fulfill this mission and delineate future research directions and new applications.
人类每天都会生成 EB 级数据，因此在应对大数据带来的多标签学习巨大挑战方面，人们越来越需要新的努力。例如，极端多标签分类是一个活跃且快速增长的研究领域，它处理具有大量类或标签的分类任务;利用海量数据在有限的监督下构建多标签分类模型对于实际应用等具有价值。除此之外，在如何利用深度学习的强大学习能力以更好地捕获多标签学习中的标签依赖关系方面，人们还付出了巨大的努力，这是深度学习解决实际分类任务的关键。然而，值得注意的是，一直缺乏明确关注分析大数据时代多标签学习的新兴趋势和新挑战的系统性研究。当务之急是呼吁进行一次全面的调查来完成这一使命，并划定未来的研究方向和新的应用。

Index Terms:

Extreme Multi-label Learning, Multi-label Learning with Limited Supervision, Deep Learning for Multi-label Learning, Online Multi-label Learning, Statistical Multi-label Learning, New Applications.

索引术语：

极限多标签学习、有限监督下的多标签学习、多标签学习的深度学习、在线多标签学习、统计多标签学习、新应用。

1 Introduction
1 引言

Multi-label classification (MLC), which assigns multiple labels for each instance simultaneously, is of paramount importance in a variety of fields ranging from protein function classification and document classification, to automatic image categorization. For example, an image may have Cloud, Tree and Sky tags; the output for a document may cover a range of topics, such as News, Finance and Sport; a gene can belong to the functions of Protein Synthesis, Metabolism and Transcription.
多标签分类（MLC）同时为每个实例分配多个标签，在从蛋白质功能分类和文档分类到自动图像分类的各个领域都至关重要。例如，图像可能具有 Cloud、Tree 和 Sky 标签;文档的输出可能涵盖一系列主题，例如新闻、财经和体育;一个基因可以属于蛋白质合成、代谢和转录的功能。

The traditional multi-label classification methods are not coping well with the increasing needs of today’s big and complex data structure. As a result, there is a pressing need for new multi-label learning paradigms and new trends are emerging. This paper aims to provide a comprehensive survey on these emerging trends and the state-of-the-art methods, and discuss the possibility of future valuable research directions.
传统的多标签分类方法无法很好地应对当今庞大而复杂的数据结构日益增长的需求。因此，迫切需要新的多标签学习范式，并且新趋势正在出现。本文旨在对这些新兴趋势和最先进的方法进行全面调查，并讨论未来有价值的研究方向的可能性。

With the advent of the big data era, extreme multi-label classification (XMLC) becomes a rapidly growing new line of research that focuses on multi-label problems with an extremely large number of labels. Many challenging applications, such as image or video annotation, web page categorization, gene function prediction, language modeling can benefit from being formulated as multi-label classification tasks with millions, or even billions, of labels. The existing MLC techniques can not address the XMLC problem due to the prohibitive computational cost given the large number of labels. One of the most pioneering work in XMLC is SLEEC [1], which learns a small ensemble of local distance preserving embeddings. The authors in SLEEC contribute a popular public Extreme Classification Repository ¹¹1http://manikvarma.org/downloads/XC/XMLRepository.html, which promote the development of XMLC. The state-of-the-art XMLC techniques are mostly based on one-vs-all classifiers [2, 3, 4, 5], trees [6, 7, 8, 9, 10] and embeddings [11, 1, 12, 13, 14]. Unfortunately, the theoretical results in XMLC under the very high dimensional settings remain relatively under-explored. Moreover, the labels are extremely sparse, which leads to the problem of the long-tail distribution. How to precisely predict all the positive labels to testing examples pose a serious challenge in XMLC.
随着大数据时代的到来，极端多标签分类（XMLC）成为一种快速增长的新研究领域，专注于具有大量标签的多标签问题。许多具有挑战性的应用程序，例如图像或视频注释、网页分类、基因功能预测、语言建模，都可以从具有数百万甚至数十亿个标签的多标签分类任务中受益。现有的 MLC 技术无法解决 XMLC 问题，因为考虑到标签数量众多，计算成本高昂。XMLC 中最具开创性的工作之一是 SLEEC [1]，它学习了一小部分局部距离保持嵌入。SLEEC 的作者贡献了一个流行的公共 Extreme Classification Repository ¹¹1http://manikvarma.org/downloads/XC/XMLRepository.html，它促进了 XMLC 的发展。最先进的 XMLC 技术主要基于一对多分类器 [2， 3， 4， 5]、树 [6， 7， 8， 9， 10] 和嵌入 [11， 1， 12， 13 ， 14]。遗憾的是，在非常高维设置下，XMLC 中的理论结果仍然相对不足。而且，标签极其稀疏，这导致了长尾分布的问题。如何精确预测测试示例的所有阳性标签是 XMLC 的一个严峻挑战。

As the data volume grows quickly these days, it is usually expensive and time-consuming to acquire full supervision. In MLC tasks, the high dimensional output space makes it even harder. To mitigate this problem, a wealth of works have proposed various settings of MLC with limited supervision. For example, multi-label learning with missing labels (MLML) [15] assumes that only a subset of labels is obtained; semi-supervised MLC (SS-MLC) [16] admits a few fully labeled data and a large amount of unlabeled data; partial multi-label learning (PML) [17] studies an ambiguous setting that a superset of labels is given. Many effective models are also proposed based on graph [15, 18, 19], embedding [11, 20, 21], probability models [22, 23] and so on. More interesting improperly-supervised MLC settings are also considered recently, such as MLC with noisy labels [24], multi-label zero-shot learning [25] and multi-label active learning [26]. These settings make MLC practical in real-world applications by saving supervision costs, and thus, deserve more attention.
随着如今数据量的快速增长，获得全面监督通常既昂贵又耗时。在 MLC 任务中，高维输出空间使其更加困难。为了缓解这个问题，大量工作提出了 MLC 的各种设置，但监督有限。例如，缺失标签的多标签学习（MLML） [15] 假设只获得标签的子集;半监督 MLC （SS-MLC） [16] 允许少量完全标记的数据和大量未标记的数据;部分多标签学习（PML） [17] 研究了给定标签超集的模棱两可的设置。基于图 [15， 18， 19]、嵌入 [11， 20， 21]、概率模型 [22， 23] 等，还提出了许多有效的模型。最近还考虑了更有趣的、监督不当的 MLC 设置，例如具有噪声标签的 MLC [24]、多标签零样本学习 [25] 和多标签主动学习 [26]。这些设置通过节省监控成本使 MLC 在实际应用中实用，因此值得更多关注。

Deep learning has shown excellent potential since 2012 when AlexNet presents surprising performance on the single-label image classification of ILSVRC2012 challenge. As most natural images usually contain multiple objects, it is more practical that each image is associated with multiple tags or labels. Thus developing deep learning techniques that can address MLC problem is more practically demanding in real-world image classification tasks. Some large-scale multi-label image databases, e.g., Open Images [27], newly released Tencent ML-Images [28] promote deep learning for MLC problem. In this area, BP-MLL [29] is the first method to utilize neural network (NN) architecture for MLC problem. Canonical Correlated AutoEncoder (C2AE) [30] is the first Deep NN (DNN) based embedding method for MLC problem. In addition, some deep learning methods are also developed for the Challenging MLC problems, such as Extreme MLC [31, 31, 32, 33], partial and weakly-supervised MLC [34, 19, 23, 23], MLC with unseen labels [35, 36, 36]. Recently advanced deep learning architectures [37, 38, 39, 40] for MLC problems are studied. How to harvest the strong learning capability of deep learning to better capture the label dependencies is key for deep learning to address MLC problems.
自 2012 年以来，深度学习已显示出巨大的潜力，当时 AlexNet 在单标签图像分类ILSVRC2012挑战赛中表现出惊人的性能。由于大多数自然图像通常包含多个对象，因此每张图像都与多个标签或标签相关联更为实用。因此，在实际图像分类任务中，开发能够解决 MLC 问题的深度学习技术更具实际要求。一些大规模的多标签图像数据库，例如 Open Images [27]、新发布的腾讯 ML-Images [28] 促进了 MLC 问题的深度学习。在该领域，BP-MLL [29] 是第一种利用神经网络（NN）架构解决 MLC 问题的方法。Canonical Correlated AutoEncoder （C2AE） [30] 是第一个基于 Deep NN （DNN）的 MLC 问题嵌入方法。此外，针对具有挑战性的 MLC 问题，还开发了一些深度学习方法，例如极限 MLC [31， 31， 32， 33]、部分和弱监督 MLC [34， 19， 23， 23]、具有不可见标签的 MLC [35， 36， 36]。最近研究了针对 MLC 问题的高级深度学习架构 [37， 38， 39， 40]。如何利用深度学习的强大学习能力来更好地捕获标签依赖关系，是深度学习解决 MLC 问题的关键。

The Web continues to generate quintillion bytes of streaming data daily, leading to the key challenges for MLC tasks. Firstly, the existing off-line MLC algorithms are impractical for streaming data sets, since they require to store all data sets in memory; secondly, it is non-trivial to adapt off-line multi-label methods to the sequential data. Therefore, several approaches for online multi-label classification have recently been proposed, including [41, 42, 43]. However, both the experimental and theoretical results obtained so far are still not satisfactory and very limited. There is a real pressing need for credible research into online multi-label learning.
Web 每天继续生成 quintillion 字节的流数据，这给 MLC 任务带来了关键挑战。首先，现有的离线 MLC 算法对于流式数据集不切实际，因为它们需要将所有数据集存储在内存中;其次，将离线多标签方法应用于序列数据并非易事。因此，最近提出了几种在线多标签分类的方法，包括 [41， 42， 43]。然而，迄今为止获得的实验和理论结果仍然不尽如人意且非常有限。对在线多标签学习进行可靠的研究确实迫切需要。

Refer to caption — Figure 1: The structure of this paper.
图 1：本文的结构。

Many references [44, 45] have shown that methods of multi-label learning which explicitly capture label dependency will usually achieve better prediction performance. Therefore, in the past few years, modeling the label dependency is one of the major challenges in multi-label classification problems. A plethora of methods have been motivated to model the dependence. For example, the classifier chain (CC) model [46] captures label dependency by using binary label predictions as extra input attributes for the following classifiers in a chain. CCA [47] uses canonical correlation analysis for learning label dependency. CPLST [48] uses principal component analysis to capture both the label and the feature dependencies. Unfortunately, the statistical properties and asymptotic analysis of all these methods are still not well explored. One of the emerging trends is to develop statistical theory for understanding multi-label dependency modelings.
许多参考文献 [44， 45] 表明，显式捕获标签依赖性的多标签学习方法通常会获得更好的预测性能。因此，在过去几年中，标签依赖性建模是多标签分类问题的主要挑战之一。已经有大量的方法来模拟依赖关系。例如，分类器链（CC）模型 [46] 通过使用二进制标签预测作为链中以下分类器的额外输入属性来捕获标签依赖关系。CCA [47] 使用典型相关性分析来学习标签依赖性。CPLST [48] 使用主成分分析来捕获标签和特征依赖关系。不幸的是，所有这些方法的统计特性和渐近分析仍然没有得到很好的探索。新兴趋势之一是开发用于理解多标签依赖关系建模的统计理论。

During the past decade, multi-label classification has been successfully applied in computer vision, natural language processing and data mining. This paper will briefly review these emerging applications, which may inspire the community to explore more interesting applications. The structure of this paper is shown in Figure 1. Some evaluation metrics and important notations used in this paper can be found in the Supplementary Materials.
在过去的十年中，多标签分类已成功应用于计算机视觉、自然语言处理和数据挖掘。本文将简要回顾这些新兴的应用，这可能会激发社区探索更多有趣的应用。本文的结构如图 1 所示。本文中使用的一些评估指标和重要符号可以在补充材料中找到。

2 Extreme Multi-label Learning
2 极限多标签学习

Extreme multi-label classification (XMLC) aims to learn a classifier that is able to automatically annotate a data point with the most relevant subset of labels from an extremely large number of labels, which has opened up a new research frontier in data mining and machine learning. For example, there are millions of people who upload their selfies on the Facebook every day, based on these selfies, one might wish to build a classifier that recognizes who appear in the figure. Many XMLC applications have been found in various domains ranging from language modeling, document classification and face recognition to gene function prediction. The main challenging issue of XMLC is that XMLC learns with hundreds of thousands, or even millions, of labels, features and training points. To address this issue, the state-of-the-art XMLC techniques are mostly based on embeddings, trees and one-vs-all classifiers. We will review these advanced techniques in this section. Note that there are also some new deep learning-based XMLC models, but we leave the discussion until §4.
Extreme 多标签分类（XMLC）旨在学习一种分类器，该分类器能够从大量标签中自动使用最相关的标签子集对数据点进行注释，这在数据挖掘和机器学习方面开辟了新的研究前沿。例如，每天有数百万人在 Facebook 上上传他们的自拍照，根据这些自拍，人们可能希望构建一个分类器来识别图中出现的人。许多 XMLC 应用程序已应用于各个领域，从语言建模、文档分类和人脸识别到基因功能预测。XMLC 的主要挑战在于 XMLC 使用数十万甚至数百万个标签、特征和训练点进行学习。为了解决这个问题，最先进的 XMLC 技术主要基于嵌入、树和一对多分类器。我们将在本节中回顾这些高级技术。请注意，还有一些新的基于深度学习的 XMLC 模型，但我们将讨论留到 §4 为止。

2.1 Embedding Methods
2.1 嵌入方法

To deal with many labels, [49] assume that label vectors have a little support. In other words, each label vector can be projected into a lower dimensional compressed label space, which can be deemed as encoding. A regression is then learned for each compressed label. Lastly, the compressed sensing technique is used to decode the labels from the regression outputs of each testing instance. Many embedding based works have recently been developed in this learning paradigm. These works mainly differ in compression and decompression methods such as canonical correlation analysis (CCA) [47] and bloom filters [50]. Amongst them, SLEEC [1] is one of the seminal embedding methods in XMLC due to its simplicity and promising experimental results [1].
为了处理许多标签，[49] 假设标签向量有一点支持。换句话说，每个标签向量都可以投影到一个低维的压缩标签空间中，这可以被认为是编码。然后，为每个压缩标签学习回归。最后，使用压缩传感技术从每个测试实例的回归输出中解码标签。在这种学习范式中，最近开发了许多基于嵌入的作品。这些工作主要在压缩和解压缩方法上有所不同，例如典型相关分析（CCA） [47] 和布隆滤波器 [50]。其中，SLEEC [1] 是 XMLC 中开创性的嵌入方法之一，因为它简单且实验结果有前途 [1]。

SLEEC learns low dimensional embeddings which non-linearly capture label correlations by preserving the pairwise distances between only the closest (rather than all) label vectors. Regressors are then trained in the embedding space. SLEEC uses a $k$ -nearest neighbor ( $k$ NN) classifier in the embedding space for prediction.
SLEEC 学习低维嵌入，通过保持成对距离来非线性地捕获标签相关性仅在最近的（而不是所有）标签向量之间。然后在嵌入空间中训练回归器。SLEEC 在嵌入空间中使用 $k$ 最近邻（ $k$ NN）分类器进行预测。

Assume $x_{i}\in\mathbb{R}^{d\times 1}$ is a real vector representing an input or instance (feature), $y_{i}\in\{0,1\}^{L\times 1}$ is the corresponding output or label vector $(i\in\{1,\ldots,n\})$ . $n$ denotes the number of training data. The input matrix is $X=[x_{1},\ldots,x_{n}]\in\mathbb{R}^{d\times n}$ and the output matrix is $Y=[y_{1},\ldots,y_{n}]\in\{0,1\}^{L\times n}$ . SLEEC maps the label vector $y_{i}$ to $\varpi$ -dimensional vector $z_{i}\in\mathbb{R}^{\varpi\times 1}$ ( $\varpi<L$ is a small constant) and learns a set of regressors $V\in\mathbb{R}^{\varpi\times d}$ s.t. $z_{i}\approx Vx_{i},\forall i\in\{1,\ldots,n\}$ . During the prediction, for a testing instance $x$ , SLEEC first computes its embedding $Vx$ and then perform $k$ NN over the set $[Vx_{1},\ldots,Vx_{n}]$ . We denote the transpose of the vector/matrix by the superscript ^T and the logarithms to base 2 by $\log$ . Let $||\cdot||_{F}$ and $||\cdot||_{1}$ represent the Frobenius norm and $\ell_{1}$ norm of a matrix.
假设 $x_{i}\in\mathbb{R}^{d\times 1}$ 是表示输入或实例（特征）的实向量， $y_{i}\in\{0,1\}^{L\times 1}$ 是相应的输出或标签向量 $(i\in\{1,\ldots,n\})$ 。 $n$ 表示训练数据的数量。输入矩阵为 $X=[x_{1},\ldots,x_{n}]\in\mathbb{R}^{d\times n}$ ，输出矩阵为 $Y=[y_{1},\ldots,y_{n}]\in\{0,1\}^{L\times n}$ 。SLEEC 将标签向量 $y_{i}$ 映射到 $\varpi$ -维向量 $z_{i}\in\mathbb{R}^{\varpi\times 1}$ （ $\varpi<L$ 是一个小常数）并学习一组回归器 $V\in\mathbb{R}^{\varpi\times d}$ s.t. $z_{i}\approx Vx_{i},\forall i\in\{1,\ldots,n\}$ .在预测期间，对于测试实例 $x$ ，SLEEC 首先计算其嵌入 $Vx$ ，然后对 set $[Vx_{1},\ldots,Vx_{n}]$ 执行 $k$ NN 。我们用上标 ^T 表示向量/矩阵的转置，用表示对数到以 2 为底的对数。 $\log$ 设 $||\cdot||_{F}$ 和 $||\cdot||_{1}$ 表示矩阵的 Frobenius 范数和 $\ell_{1}$ 范数。

SLEEC aims to learn a embedding matrix $Z=[z_{1},\ldots,z_{n}]\in\mathbb{R}^{\varpi\times n}$ through the following formula:
SLEEC 旨在通过以下公式学习嵌入矩阵 $Z=[z_{1},\ldots,z_{n}]\in\mathbb{R}^{\varpi\times n}$ ：

\min_{Z\in\mathbb{R}^{\varpi\times n}}||P_{\Omega}(Y^{T}Y)-P_{\Omega}(Z^{T}Z)||^{2}_{F}

(1)

where the index set $\Omega$ denotes the set of neighbors: $(i,j)\in\Omega$ iff $j\in\mathcal{N}_{i}$ . $\mathcal{N}_{i}$ denotes a set of nearest neighbors of $i$ . $P_{\Omega}(\cdot)$ is defined as:
其中 index set $\Omega$ 表示邻居集： $(i,j)\in\Omega$ iff $j\in\mathcal{N}_{i}$ 。 $\mathcal{N}_{i}$ 表示的 $i$ 一组最近邻。 $P_{\Omega}(\cdot)$ 定义为：

\begin{split}\big{(}P_{\Omega}(Y^{T}Y)\big{)}_{(i,j)}=\begin{cases}y_{i}^{T}y_{j},&\mbox{if $(i,j)\in\Omega$}\\ 0,&\mbox{otherwise.}\end{cases}\end{split}

(2)

Based on embedding matrix $Z$ , SLEEC minimizes the following objective with $\ell_{1}$ and $\ell_{2}$ regularization to find regressors $V$ , which is able to reduce the prediction time and the model size, and avoid overfitting.
基于嵌入矩阵 $Z$ ，SLEEC 最小化以下目标 $\ell_{1}$ 和 $\ell_{2}$ 正则化以查找回归器 $V$ ，这能够减少预测时间和模型大小，并避免过拟合。

\begin{split}\min_{V\in\mathbb{R}^{\varpi\times d}}||Z-VX||^{2}_{F}+\mu||V||^{2}_{F}+\lambda||VX||_{1}\end{split}

(3)

where $\mu>0$ and $\lambda>0$ are the regularization parameters.
其中 $\mu>0$ 和 $\lambda>0$ 是正则化参数。

To scale to large-scale data sets, SLEEC clusters the training set into smaller local region just based on features and does not consider label information. Therefore, the instances that have similar labels are not guaranteed to be split into the same region. This partitioning may affect the quality of embeddings learned in SLEEC.
为了扩展到大规模数据集，SLEEC 仅根据特征将训练集聚类到较小的局部区域，而不考虑标签信息。因此，不保证将标签相似的实例拆分到同一区域。这种分区可能会影响在 SLEEC 中学习的嵌入的质量。

Many methods have been developed to address this issue. For example, AnnexML [12] shows a novel graph embedding method based on the k-nearest neighbor graph (KNNG). AnnexML aims to construct the KNNG of label vectors in the embedding space to improve both the prediction accuracy and speed of the k-nearest neighbor classifier. DEFRAG [51] represents each feature $j\in[d]$ as an $L$ -dimensional vector $q^{j}=\sum_{i=1}^{n}x_{i}^{j}y_{i}$ , which is a weighted aggregate of the label vectors of data points where the feature $j$ is non-zero. After creating these representative vectors, DEFRAG performs hierarchical clustering on them to obtain feature clusters, and then performs agglomeration by summing up the coordinates of the feature vectors within a cluster. [51] shows that DEFRAG offers faster and better performance.
已经开发了许多方法来解决这个问题。例如，AnnexML [12] 展示了一种基于 k 最近邻图（KNNG）的新型图嵌入方法。AnnexML 旨在构建嵌入空间中标签向量的 KNNG，以提高 k 最近邻分类器的预测准确性和速度。DEFRAG [51] 将每个特征 $j\in[d]$ 表示为一个 $L$ -dimensional vector $q^{j}=\sum_{i=1}^{n}x_{i}^{j}y_{i}$ ，它是特征为非零的数据点 $j$ 标签向量的加权聚合。创建这些代表性向量后，DEFRAG 对它们进行分层聚类以获得特征簇，然后通过汇总簇内特征向量的坐标来进行集聚。[51] 表明 DEFRAG 提供了更快、更好的性能。

Word embeddings have been successfully used for learning non-linear representations of text data for natural language processing (NLP) tasks, such as understanding word and document semantics and classifying documents. Recently, [52] first proposes to use word embedding techniques to learn the label embedding of instances. [52] treats each instance as a “word”, and define the “context” as k-nearest neighbors of a given instance in the space formed by the training label vectors $y_{i}$ . Based on Skip Gram Negative Sampling (SGNS) technique, [52] learns embeddings $z_{1},\ldots,z_{n}$ through the following formula:
单词嵌入已成功用于学习自然语言处理（NLP）任务的文本数据的非线性表示，例如理解单词和文档语义以及对文档进行分类。最近，[52] 首次提出使用词嵌入技术来学习实例的标签嵌入。[52] 将每个实例视为一个“单词”，并将“上下文”定义为由训练标签向量 $y_{i}$ 形成的空间中给定实例的 k 个最近邻。基于跳跃革兰氏负采样（SGNS）技术，[52] $z_{1},\ldots,z_{n}$ 通过以下公式学习嵌入：

\max_{z_{1},\ldots,z_{n}}\sum_{i=1}^{n}\bigg{(}\sum_{j\in\mathcal{N}_{i}}\log(\sigma(\langle z_{i},z_{j}\rangle))+C\sum_{j^{\prime}}\log(\sigma(-\langle z_{i},z_{j^{\prime}}\rangle))\bigg{)}

(4)

where $j^{\prime}\in\{1,\ldots,n\}$ , $\sigma(\cdot)$ is a sigmoid function, $\langle\cdot,\cdot\rangle$ denotes the inner product and $C$ is a constant. After learning label embeddings $z_{1},\ldots,z_{n}$ , [52] follows the learning algorithm of SLEEC to learn $V$ and make the prediction. [52] shows competitive prediction accuracies compared to state-of-the-art embedding methods, and provides the new insight for XMLC from the popular word2vec in NLP, which may open a new line of research.
其中 $j^{\prime}\in\{1,\ldots,n\}$ ，是 $\sigma(\cdot)$ sigmoid 函数， $\langle\cdot,\cdot\rangle$ 表示内积， $C$ 并且是一个常数。在学习标签嵌入后 $z_{1},\ldots,z_{n}$ ，[52] 遵循 SLEEC 的学习算法来学习和 $V$ 做出预测。[52] 显示了与最先进的嵌入方法相比具有竞争力的预测准确性，并从 NLP 中流行的 word2vec 为 XMLC 提供了新的见解，这可能会开辟一条新的研究路线。

The embedding matrix $Z=[z_{1},\ldots,z_{n}]\in\mathbb{R}^{\varpi\times n}$ of existing embedding methods is in real space. Hence we need to use regressors for training and may involve solving expensive optimization problems. To break this limitation, many references leverage coding technique for efficiently training the model. For example, based on Bloom filters [53], a well-known space-efficient randomized data structure designed for approximate membership testing, [50] designs a simple scheme to select the $k$ representative bits for labels for training and proposes a robust decoding algorithm for prediction. However, Bloom filters may yield many false positives.
现有嵌入方法的嵌入矩阵 $Z=[z_{1},\ldots,z_{n}]\in\mathbb{R}^{\varpi\times n}$ 位于真实空间中。因此，我们需要使用回归器进行训练，并且可能涉及解决昂贵的优化问题。为了打破这一限制，许多参考文献利用编码技术来有效地训练模型。例如，基于 Bloom filters [53]（一种著名的用于近似隶属度测试的节省空间的随机数据结构）[50]，设计了一个简单的方案来选择用于训练的标签的 $k$ 代表性位，并提出了一种用于预测的稳健解码算法。但是，Bloom 筛选器可能会产生许多误报。

To address this issue, [54] transforms MLC to a popular group testing problem. In the group testing problem, one wish to identify a small number $k$ of defective elements in a population of large size $L$ . The idea is to test the items in groups with the premise that most tests will return negative results. Only few $\varpi<L$ tests are needed to detect the $k$ defective elements. [54] trains $\varpi$ binary classifiers on $z_{i}$ and learn to test whether the data belongs to a group (of labels) or not, and then uses a simple inexpensive decoding scheme to recover the label vector from the predictions of the classifiers. Recently, [55] develops a novel sparse coding tree framework for XMLC based on Huffman coding and Shannon-Fano coding. [50, 54, 55] introduce the coding theory into MLC which is very novel and worth further research and exploration in this direction.
为了解决这个问题，[54] 将 MLC 转换为一个流行的群测试问题。在组测试问题中，人们希望在大规模 $L$ 的总体中识别少量有缺陷 $k$ 的元素。这个想法是分组测试项目，前提是大多数测试将返回阴性结果。只需进行少量 $\varpi<L$ 测试即可检测出有缺陷的 $k$ 元件。[54] 训练 $\varpi$ $z_{i}$ 二元分类器并学习测试数据是否属于一组（标签），然后使用一种简单廉价的解码方案从分类器的预测中恢复标签向量。最近，[55] 基于 Huffman 编码和 Shannon-Fano 编码为 XMLC 开发了一种新的稀疏编码树框架。[50， 54， 55] 将编码理论引入 MLC 中，这是非常新颖的，值得在这个方向上进一步研究和探索。

Remark. Embedding methods are the most popular strategies for addressing XMLC. SLEEC is a seminal work among them and recommended for the beginners to try. The major limitation of existing embedding methods is that the correlations between the input and output are ignored, such that their learned embeddings are not well aligned, which leads to degradation in prediction performance. How to build an embedding space that can preserve the relations between the input and output is an important research topic in the future. For example, [30, 56, 57] explore the correlations between the input and output. They propose to jointly learn a semantic common subspace and view-specific mappings within one framework. The semantic similarity structure among the embeddings is further preserved, ensuring that close embeddings share similar labels. Another limitation of existing embedding methods is that both the training and testing time complexity are too high (See Table I). Some techniques, such as random projection, hashing and parallelization, may be able to accelerate the training and testing process. Tree-based methods are able to obtain fast testing speed, which is discussed in the following paragraph.
备注。嵌入方法是解决 XMLC 的最常用策略。SLEEC 是其中的开创性作品，建议初学者尝试。现有嵌入方法的主要局限性是忽略了输入和输出之间的相关性，因此它们学习的嵌入不能很好地对齐，从而导致预测性能下降。如何构建一个能够保持输入和输出之间关系的嵌入空间是未来重要的研究课题。例如，[30， 56， 57] 探索输入和输出之间的相关性。他们提议在一个框架中共同学习语义公共子空间和特定于视图的映射。嵌入之间的语义相似性结构被进一步保留，确保紧密的嵌入共享相似的标签。现有嵌入方法的另一个限制是训练和测试时间的复杂度都太高（见表 I）。一些技术（例如随机投影、哈希和并行化）可能能够加速训练和测试过程。基于树的方法能够获得快速的测试速度，这将在下一段中讨论。

2.2 Tree-based Methods
2.2 基于树的方法

For tree-based methods, the original large-scale problem is divided into a sequence of small-scale subproblems by hierarchically partitioning the instance set or the label set. The root node is initialized to contain the entire set. A partitioning formulation is then optimized to partition a set in a node into a fixed number $k$ of subsets which are linked to $k$ child nodes. Nodes are recursively decomposed until a stopping condition is checked on the subsets. Each node involves two optimization problems: optimizing the partition criterion, and defining a condition or building a classifier on the feature space to decide which child node an instance belongs to. In the prediction phase, an instance is passed down the tree until it reaches a leaf (instance tree) or several leaves (label tree). For a label tree, the reached leaves contain the predicted labels. For an instance tree, the prediction is made by a classifier trained on the instances in the leaf node. Thus, the main advantage of tree-based methods is that the prediction costs are sub-linear or even logarithmic if the tree is balanced.
对于基于树的方法，通过对实例集或标签集进行分层分区，将原始的大规模问题划分为一系列小规模的子问题。根节点初始化为包含整个集。然后优化分区公式，将节点中的集分区为固定数量的 $k$ 子集，这些子集链接到 $k$ 子节点。节点将递归分解，直到在子集上检查停止条件。每个节点都涉及两个优化问题：优化分区标准，以及定义条件或在特征空间上构建分类器来决定实例属于哪个子节点。在预测阶段，实例沿树向下传递，直到到达一个叶子（实例树）或多个叶子（标签树）。对于标签树，到达的叶子包含预测的标签。对于实例树，预测是由在叶节点中的实例上训练的分类器进行的。因此，基于树的方法的主要优点是，如果树是平衡的，预测成本是亚线性的，甚至是对数的。

FastXML [6] presents to learn the hierarchy by optimizing the ranking loss function, normalized Discounted Cumulative Gain (nDCG). nDCG brings two main benefits to XMLC. Firstly, nDCG is a measurement which is sensitive to ranking and relevance and therefore ensures that the relevant positive labels are predicted with ranks that are as high as possible. This cannot be guaranteed by rank insensitive measures such as the Gini index or the clustering error. Second, by being rank sensitive, nDCG can be optimized across all $L$ labels at the current node thereby ensuring that the local optimization is not myopic. The experiments show that nDCG is more suitable for extreme multi-label learning.
FastXML [6] 通过优化排名损失函数、标准化折扣累积增益（nDCG）来学习层次结构。nDCG 为 XMLC 带来了两个主要好处。首先，nDCG 是一种对排名和相关性敏感的测量方法，因此可以确保以尽可能高的排名预测相关的正面标签。这不能通过不区分排名的度量值（如 Gini 指数或聚集误差）来保证。其次，通过对秩敏感，nDCG 可以在当前节点的所有 $L$ 标签上进行优化，从而确保局部优化不是短视的。实验表明，nDCG 更适合于极端的多标签学习。

Based on FastXML, PfastreXML [7] studies how to improve the prediction accuracy of tail labels. The labels in XMLC follow a power law distribution. Infrequently occurring labels usually convey more information, but have little training data and are harder to predict than frequently occurring ones. PfastXML improves FastXML by replacing the nDCG loss with its unbiased propensity scored variant, and assigns higher rewards for predicting accurate tail labels. Moreover, PfastreXML re-ranks PfastXML’s predictions using tail label classifiers. [7] shows that PfastreXML achieves promising performance in predicting tail labels and successfully applies to tagging, recommendation and ranking problems. SwiftXML [9] maintains all the scaling properties of PfastreXML, but improves the prediction accuracy of PfastreXML by considering more information about revealed item preferences and item features. SwiftXML proposes a novel node partitioning function by optimizing two separating hyperplanes in the user and item feature spaces respectively. Experiments on tagging on Wikipedia and item-to-item recommendation on Amazon reveal that SwiftXML is more accurate than leading extreme classifiers by 14%.
基于 FastXML，PfastreXML [7] 研究如何提高尾部标签的预测准确性。XMLC 中的标签遵循幂律分布。不经常出现的标签通常传达更多信息，但训练数据很少，并且比频繁出现的标签更难预测。PfastXML 通过将 nDCG 损失替换为其无偏倾向评分变体来改进 FastXML，并为预测准确的尾部标签分配更高的奖励。此外，PfastreXML 使用尾部标签分类器对 PfastXML 的预测进行重新排序。[7] 表明 PfastreXML 在预测尾部标签方面取得了有希望的性能，并成功地应用于标记、推荐和排名问题。SwiftXML [9] 保留了 PfastreXML 的所有缩放属性，但通过考虑有关显示的项目首选项和项目功能的更多信息，提高了 PfastreXML 的预测准确性。SwiftXML 通过在 user 和 item 特征空间中分别优化两个分离超平面，提出了一种新的节点分区函数。Wikipedia 上的标记实验和 Amazon 上的项目到项目推荐表明，SwiftXML 比领先的极端分类器准确 14%。

TABLE I: The training and testing time complexity of XMLC methods (

\text{nnz}(X)

denotes the number of non-zeros in

X

C

is a constant,

O(\zeta)

denotes the computational complexity of

\varpi

-bit Hamming distance calculation.

T

is the number of trees.

h

is the number of levels in the tree.

c

is the number of top-scoring items being reranked by the base-classifiers.

k\ll L

is a small constant).
表 I： XMLC 方法的训练和测试时间复杂度（

\text{nnz}(X)

表示中的

X

非零数是一个

C

常数，

O(\zeta)

表示 -bit 汉明距离计算的计算

\varpi

复杂度。

T

是树的数量。

h

是树中的级别数。

c

是基本分类器重新排名的得分最高的项目的数量。

k\ll L

是一个小常数）。

Methods 方法	Training Time 训练时间	Testing Time 测试时间
Embedding: SLEEC [1] 嵌入：SLEEC 1	$O(n\varpi^{2}+n\varpi C)$	$O(nd+kL)$
Embedding: DEFRAG [51] 嵌入：DEFRAG 51	$O(\text{nnz}(X)\log d)$	$O(nd+kL)$
Embedding: CoH [56] 嵌入：CoH 56	$O(n(d^{2}+L^{2}))$	$O(n\zeta+kL)$
Tree: FastXML [6] 树：FastXML 6	$O(nT\log L+\text{nnz}(X)nT)$	$O(Td\log L)$
Tree: SwiftXML [9] 树：SwiftXML 9	$O(\text{nnz}(X)Tn\log n)$	$O((T\log n\!+\!c)\text{nnz}(X))$
Tree: GBDT-SPARSE [58] 树：GBDT-SPARSE 58	$O(\text{nnz}(X)dTh\log k)$	$O(Tk\log k)$
OVA: PD-Sparse [59] OVA：PD-稀疏 59	$O(ndC)$	$O(dL)$
OVA: LF [60] OVA：LF 60	$O(nd\!+\!L\log L\!+\!\!n\log n\!+\!nL)$	$O(d+\log(2L))$
OVA: Parabel [4] OVA：帕拉贝尔 4	$O((nd\log L)/L)$	$O(\text{nnz}(X)Tk\log L)$
OVA: Slice [5] OVA：切片 5	$O(nd\log L)$	$O(d\log L)$

FastXML, PfastreXML and SwiftXML have studied the ranking-based measures such as nDCG and its variants. Recently, [8] focuses on F-measure, which is a commonly used performance measure in multi-label classification as well as other fields, such as information retrieval and natural language processing. [8] proposes a novel sparse probability estimates (SPEs) to reduce the complexity of threshold tuning in XMLC. Then, they develop three algorithms for maximizing the F-measure in the Empirical Utility Maximization (EUM) framework by using SPEs. Moreover, Probabilistic label trees (PLTs) and FastXML are discussed for computing SPEs. Recently, the theory in [10] shows that the pick-one-label is inconsistent with respect to the Precision@ $k$ , and PLTs model can get zero regret (i.e., it is consistent) in terms of marginal probability estimation and Precision@ $k$ in the multi-label setting. Inspired by [10], [61] further studies the consistency of other reduction strategies based on a different Recall@ $k$ metric.
FastXML、PfastreXML 和 SwiftXML 研究了基于排名的衡量标准，例如 nDCG 及其变体。最近，[8] 专注于 F-measure，这是多标签分类以及其他领域（如信息检索和自然语言处理）中常用的性能度量。[8] 提出了一种新的稀疏概率估计（SPE）来降低 XMLC 中阈值调整的复杂性。然后，他们开发了三种算法，用于使用 SPE 在经验效用最大化（EUM）框架中最大化 F 度量。此外，还讨论了用于计算 SPE 的概率标签树（PLT）和 FastXML。最近，[10] 中的理论表明，选择单标签相对于 Precision@ 不一致 $k$ ，PLTs 模型在边际概率估计和多标签设置 $k$ 中的Precision@方面可以获得零遗憾（即它是一致的）。受 [10] 的启发，[61] 进一步研究了基于不同 Recall@ $k$ 度量的其他还原策略的一致性。

Remark. Tree-based methods are the efficient strategy for addressing XMLC with the logarithmic dependence to the number of labels (See Table I). FastXML is a popular method and recommended for the practitioners. However, one of the major problems for tree-based methods, such as FastXML, PfastreXML and SwiftXML, is that they involve complex non-convex optimization problem at each node. How to obtain cheap and scalable tree structure is an important research topic in the future. For example, GBDT-SPARSE [58] studies the gradient boosted decision trees (GBDT) for XMLC. In each node, the feature is firstly projected into a low-dimensional space and then a simple inexact search strategy is used to find a good split. They significantly reduce the training and prediction time and model size of GBDT to make it suitable for XMLC. CRAFTML [62] tries to use fast partitioning strategies and exploit random forest algorithm. CRAFTML first randomly projects the feature and label into lower dimensional spaces. A $k$ -means algorithm is then used in the projected labels to partition the instances into $k$ temporary subsets. Moreover, GBDT-SPARSE and CRAFTML also open the way to parallelization, which are able to motivate further research.
备注。基于树的方法是一种有效的策略，用于对标签数量的对数依赖性 XMLC（请参阅表 I）。FastXML 是一种流行的方法，推荐给从业者。然而，基于树的方法（如 FastXML、PfastreXML 和 SwiftXML）的主要问题之一是它们在每个节点都涉及复杂的非凸优化问题。如何获得廉价且可扩展的树结构是未来重要的研究课题。例如，GBDT-SPARSE [58] 研究了 XMLC 的梯度提升决策树（GBDT）。在每个节点中，首先将特征投影到低维空间中，然后使用简单的不精确搜索策略来找到一个好的分割。它们显著减少了 GBDT 的训练和预测时间以及模型大小，使其适用于 XMLC。CRAFTML [62] 尝试使用快速分区策略并利用随机森林算法。CRAFTML 首先将特征和标签随机投影到低维空间中。然后， $k$ 在投影的标签中使用 -means 算法将实例划分为 $k$ 临时子集。此外，GBDT-SPARSE 和 CRAFTML 也为并行化开辟了道路，这能够激励进一步的研究。

2.3 One-vs-all Methods
2.3 一对多方法

One-vs-all (OVA) methods are one of the most popular strategies for multi-label classification which independently trains a binary classifier for each label. However, this technique suffers two major limitations for XMLC: 1) Training one-vs-all classifiers for XMLC problems using off-the-shelf solvers such as Liblinear can be infeasible for computation and memory. 2) The model size for XMLC data set can be extremely large, which leads to slow prediction. Recently, many works have been developed to address the above issues of the one-vs-all methods in XMLC.
一对多（OVA）方法是最流行的多标签分类策略之一，它为每个标签独立训练一个二进制分类器。但是，该技术对 XMLC 有两个主要限制：1）使用现成的求解器（如 Liblinear）为 XMLC 问题训练一对多分类器对于计算和内存来说可能不可行。2） XMLC 数据集的模型大小可能非常大，这会导致预测速度变慢。最近，已经开发了许多工作来解决 XMLC 中 one-vo-all 方法的上述问题。

By exploiting the sparsity of the data, some sub-linear algorithms are proposed to adapt one-vs-all methods in the extreme classification setting. For example, PD-Sparse [59] proposes to minimize the separation ranking loss and $\ell_{1}$ penalty in an Empirical Risk Minimization (ERM) framework for XMLC. The separation ranking loss penalizes the prediction on an instance by the highest response from the set of negative labels minus the lowest response from the set of positive labels. PD-Sparse obtains an extremely sparse solution both in primal and in dual with the sub-linear time cost, while yields higher accuracy than SLEEC, FastXML and some other XMLC methods. By introducing separable loss functions, PPDSparse [3] parallelizes PD-Sparse with sub-linear algorithms to scale out the training. PPDSparse can also reduce the memory cost of PDSparse by orders of magnitude due to the separation of training for each label. DiSMEC [2] also presents a sparse model with a parameter thresholding strategy, and employs a double layer of parallelization to scale one-vs-all methods for problems involving hundreds of thousand labels. ProXML [63] proposes to use $\ell_{1}$ -regularized Hamming loss to address the tail label issues, and reveals that minimizing one-vs-all method based on Hamming loss works well for tail-label prediction in XMLC based on the graph theory.
通过利用数据的稀疏性，提出了一些亚线性算法来适应极端分类设置中的一对多方法。例如，PD-Sparse [59] 提议在 XMLC 的经验风险最小化（ERM）框架中最小化分离排序损失和 $\ell_{1}$ 惩罚。分离排名损失根据负标签集的最高响应减去正标签集的最低响应来惩罚对实例的预测。PD-Sparse 在初数和对偶中都获得了极稀疏的解，具有次线性时间成本，同时比 SLEEC、FastXML 和其他一些 XMLC 方法产生更高的精度。通过引入可分离损失函数， PPDSparse [3] 使用次线性算法并行化 PD-Sparse 以扩展训练。由于每个标签的训练分离，PPDSparse 还可以将 PDSparse 的内存成本降低几个数量级。DiSMEC [2] 还提出了一个具有参数阈值策略的稀疏模型，并采用双层并行化来扩展涉及数十万个标签的问题的一对多方法。ProXML [63] 建议使用 $\ell_{1}$ 正则化汉明损失来解决尾部标签问题，并揭示了基于汉明损失的最小化一对多方法对于基于图论的 XMLC 中的尾部标签预测非常有效。

PD-Sparse, PPDSparse, DiSMEC and ProXML have obtained high prediction accuracies and low model sizes. However, they still train a separate linear classifier for each label and linear scan every single label to decide whether it is relevant or not. Thus the training and testing cost of these methods grow linearly with the number of labels. Some advanced methods are presented to address this issue. For example, to reduce the linear prediction cost of one-vs-all methods, [60] proposes to predict on a small set of labels, which is generated by projecting a test instance on a filtering line, and retaining only the labels that have training instances in the vicinity of this projection. The candidate label set should keep most of the true labels of the testing instances, and be as small as possible. They train the label filters by optimizing these two principles as a mixed integer problem. The label filters can reduce the testing time of existing XMLC classifiers by orders of magnitude, while yields comparable prediction accuracy. [60] shows an interesting technique to find a small number of potentially relevant labels, instead of going through a very long list of labels. How to use label filters to speed up the training time is left as an open problem.
PD-Sparse 、 PPDSparse 、 DiSMEC 和 ProXML 获得了高预测精度和低模型大小。但是，他们仍然为每个标签训练一个单独的线性分类器，并线性扫描每个标签以确定它是否相关。因此，这些方法的训练和测试成本随着标签数量的增加而线性增长。提出了一些高级方法来解决此问题。例如，为了降低一对多方法的线性预测成本，[60] 建议对一小部分标签进行预测，这是通过将测试实例投影到过滤线上生成的，并且只保留在此投影附近具有训练实例的标签。候选标签集应保留测试实例的大部分真实标签，并尽可能小。他们通过将这两个原则优化为混合整数问题来训练标签过滤器。标签过滤器可以将现有 XMLC 分类器的测试时间减少几个数量级，同时产生相当的预测准确性。[60] 展示了一种有趣的技术，可以查找少量可能相关的标签，而不是遍历很长的标签列表。如何使用标签过滤器来加快训练时间是一个悬而未决的问题。

Parabel [4] reduces training time of one-vs-all methods from $O(ndL)$ to $O((nd\log L)/L)$ by learning balanced binary label trees based on an efficient and informative label representation. They also present a probabilistic hierarchical multi-label model for generalizing hierarchical softmax to the multi-label setting. The logarithmic prediction algorithm is also proposed for dealing with XMLC. Experiments show that Parabel could be orders of magnitude faster at training and prediction compared to the state-of-the-art one-vs-all extreme classifiers. However, Parabel is not accurate in low-dimension data set, because Parabel can not guarantee that similar labels are divided into the same group, and the error will be propagated in the deep trees. To reduce the error propagation, Bonsai [64] shows a shallow $k$ -ary label tree structure with generalized label representation. A novel negative sampling technique is also presented in Slice [5] to improve the prediction accuracy for low-dimensional dense feature representations. Slice is able to cut down the training time cost of one-vs-all methods from linear to $O(nd\log L)$ by training classifier on only $O(n/L\log L)$ of the most confusing negative examples rather than on all $n$ training set. Slice employs generative model to estimate $O(n/L\log L)$ negative examples for each label based on approximate nearest neighbor search (ANNS) in time $O((n+L)d\log L)$ , and conduct the prediction on $O(\log L)$ of the most probable labels for each testing data. Slice is up to 15% more accurate than Parabel, and able to scale to 100 million labels and 240 million training points. The experiments in [5] show that negative sampling is a powerful tool in XMLC, and the performance gain of some advanced negative sampling technique may be explored for future research.
Parabel [4] $O((nd\log L)/L)$ 通过基于高效和信息丰富的标签表示来学习平衡的二进制标签树，从而减少了一对多方法 $O(ndL)$ 的训练时间。他们还提出了一个概率分层多标签模型，用于将分层 softmax 推广到多标签设置。还提出了对数预测算法来处理 XMLC。实验表明，与最先进的一对多极端分类器相比，Parabel 在训练和预测方面可以快几个数量级。但是，Parabel 在低维数据集中并不准确，因为 Parabel 无法保证相似的标签被划分到同一组，误差会在深树中传播。为了减少错误传播， Bonsai [64] 显示了具有广义标签表示形式的浅层 $k$ -ary 标签树结构。Slice [5] 中还提出了一种新的负采样技术，以提高低维密集特征表示的预测精度。Slice 能够将一对多方法的训练时间成本从线性降低到 $O(nd\log L)$ 仅在 $O(n/L\log L)$ 最令人困惑的负面样本上训练分类器，而不是在所有 $n$ 训练集上训练分类器。Slice 采用生成模型，根据近似最近邻搜索（ANNS）在时间 $O((n+L)d\log L)$ 上估计 $O(n/L\log L)$ 每个标签的负样本，并对 $O(\log L)$ 每个测试数据最可能的标签进行预测。Slice 的准确率比 Parabel 高 15%，并且能够扩展到 1 亿个标签和 2.4 亿个训练点。 [5] 中的实验表明，负采样是 XMLC 中的强大工具，一些高级负采样技术的性能提升可以用于未来的研究。

Remark. One-vs-all methods are the simple strategies for dealing with XMLC, and PD-Sparse is the first choice for the beginners to try. As mentioned before, one-vs-all methods independently train a binary classifier for each label, so computation and memory cost pose an intractable issue for XMLC, and one-vs-all methods do not consider the correlations between labels. Although the reviewed methods in this subsection are able to ease the computation issue, how to use the correlations between labels to boost the performance of one-vs-all methods could pose a serious problem in the future. One possible way is to design some one-vs-all learning models which consider various label correlations.
备注。一对多方法是处理 XMLC 的简单策略，PD-Sparse 是初学者尝试的首选。如前所述，一对多方法为每个标签独立训练一个二进制分类器，因此计算和内存成本对 XMLC 来说是一个棘手的问题，并且一对多方法不考虑标签之间的相关性。尽管本小节中回顾的方法能够缓解计算问题，但如何使用标签之间的相关性来提高 one-vs-all 方法的性能，将来可能会带来一个严重的问题。一种可能的方法是设计一些考虑各种标签相关性的 one-vs-all 学习模型。

3 Multi-label Learning With Limited Supervision
3 有限监督的多标签学习

Collecting fully-supervised data is usually hard and expensive and thus a critical bottleneck in real-world classification tasks. In MLC problems, there exist many ground-truth labels and the output space can be very large, which further aggravates the difficulty of precise annotation. To mitigate this problem, plenty of works have studied different settings of MLC with limited supervision. How to model label dependencies and handle incomplete supervision pose two major challenges in these tasks. In this section, we concentrate on several advanced topics. Amongst them, multi-label learning with missing labels (MLML) assumes only a subset of labels are given; semi-supervised MLC (SS-MLC) assumes a large set of unlabeled data as well labeled data are given; partial multi-label learning (PML) allows the annotators to provide a superset of labels as the candidates. We illustrate the connections between these different supervision types in Figure 2. Note that in these settings, though trained with imperfect supervised signals, the classifier is still evaluated on a perfectly supervised testing data set to quantify the predictive performance.
收集完全监督的数据通常既困难又昂贵，因此是实际分类任务的关键瓶颈。在 MLC 问题中，存在许多 ground-truth 标签，输出空间可能非常大，这进一步加剧了精确标注的难度。为了缓解这个问题，大量工作在有限的监督下研究了 MLC 的不同设置。如何对标签依赖关系进行建模和处理不完整的监督在这些任务中提出了两个主要挑战。在本节中，我们重点介绍几个高级主题。其中，缺失标签的多标签学习（MLML）假设只给出标签的子集;半监督 MLC （SS-MLC）假设给定了大量未标记的数据以及标记的数据;部分多标签学习（PML）允许注释者提供标签超集作为候选标签。我们在图 2 中说明了这些不同监督类型之间的联系。请注意，在这些设置中，尽管使用不完美的监督信号进行训练，但分类器仍然在完全监督的测试数据集上进行评估，以量化预测性能。

3.1 Multi-Label Learning With Missing Labels
3.1 缺少标签的多标签学习

In real-world scenarios, it is intractable for the annotators to figure out all the ground-truth labels, due to the complicated structure or the high volume of the output space. Instead, a subset of labels can be obtained, which is called multi-label learning with missing labels (MLML). There are two main settings in MLML. The first setting [15] only obtains a subset of relevant labels. It views the MLML problem as a positive-unlabeled learning task such that the remaining labels are all regarded as negative labels. The other setting [65] explicitly indicates which labels are missing. Formally, given a feature vector $x_{i}$ , we denote the label vector of these two settings by $\hat{y}_{i}\in\{-1,+1\}^{L\times 1}$ ( $-1$ can be missing or negative labels) and $\tilde{y}_{i}\in\{-1,0,+1\}^{L\times 1}$ ( $0$ represents missing labels) respectively. We distinguish these two settings in Figure 2. Moreover, two different learning targets may be considered. One is transductive that only learns to complete the missing entries. The other is inductive where a classifier is trained for unseen data. For simplicity, we do not explicitly distinguish these differences.
在实际场景中，由于结构复杂或输出空间的大容量，注释者很难弄清楚所有真实标签。相反，可以获取标签的子集，称为缺失标签的多标签学习（MLML）。MLML 中有两个主要设置。第一个设置 [15] 仅获得相关标签的子集。它将 MLML 问题视为一个正 - 未标记的学习任务，因此剩余的标签都被视为负标签。另一个设置 [65] 明确指示缺少哪些标签。形式上，给定一个特征向量 $x_{i}$ ，我们分别用 $\hat{y}_{i}\in\{-1,+1\}^{L\times 1}$ （ $-1$ 可以是缺失或负标签）和 $\tilde{y}_{i}\in\{-1,0,+1\}^{L\times 1}$ （ $0$ 表示缺失标签）来表示这两个设置的标签向量。我们在图 2 中区分了这两种设置。此外，可以考虑两个不同的学习目标。一种是 transductive，只学习完成缺失的条目。另一种是归纳法，其中分类器针对看不见的数据进行训练。为简单起见，我们没有明确区分这些差异。

Next, we will review state-of-the-art MLML methods which are mainly based on low-rank and graph assumptions.
接下来，我们将回顾主要基于低秩和图假设的最先进的 MLML 方法。

3.1.1 Low-Rank and Embedding Methods
3.1.1 低秩和嵌入方法

As discussed in §2.1, the existence of label correlations usually implies the output space is low-rank. Interestingly, this assumption has been widely used to complement the missing entries of a matrix in matrix completion tasks [66]. Since it benefits the two key targets in MLML, i.e. label correlation extraction and missing label completion, many low-rank assumption-based MLML methods have been developed.
如 §2.1 中所述，标签关联的存在通常意味着输出空间是低秩的。有趣的是，这个假设已被广泛用于补充矩阵完成任务中缺失的矩阵条目 [66]。由于它有利于 MLML 中的两个关键目标，即标签相关性提取和缺失标签补全，因此已经开发了许多基于低秩假设的 MLML 方法。

In [66], the MLML problem is regarded as a low-rank matrix completion problem with the existence of side information, i.e. the features. To accelerate the learning task, the label matrix is decomposed to be $Y=AWB$ where $A$ and $B$ are side information matrices. $W$ is assumed to be low-rank. In fact, in MLML problems, $A$ is exactly the feature matrix $X$ and $B$ is the identity matrix since there is no side information for the labels. Therefore, $W$ can be viewed as a linear classifier that enables the predicted labels $Y=XW$ to be low-rank. Then, LEML [11] generalizes this paradigm to a flexible empirical risk minimization framework. The formula is as follows,
在 [66] 中，MLML 问题被认为是一个低秩矩阵完成问题，存在侧信息，即特征。为了加速学习任务，标签矩阵被分解为 $Y=AWB$ where $A$ 和 $B$ 是侧面信息矩阵。 $W$ 被假定为低秩。事实上，在 MLML 问题中， $A$ 正是特征矩阵 $X$ 并且 $B$ 是单位矩阵，因为标签没有侧面信息。因此， $W$ 可以将其视为一个线性分类器，使预测的标签 $Y=XW$ 成为低秩标签。然后，LEML [11] 将这种范式推广到一个灵活的实证风险最小化框架。公式如下，

\begin{split}W=\arg\min_{W}\mathcal{L}(\hat{Y},XW)+\lambda r(W),\quad\textrm{s.t. }\textrm{rank}(W)\leq k\end{split}

(5)

where $\lambda$ and $k$ are constants, $r(W)$ is the regularizer. $\mathcal{L}$ can be any empirical risk that is evaluated on observed entries. To solve this problem, [11] decomposes the classifier to two rank- $k$ ( $k\ll L$ ) matrices $V$ and $U$ such that $W=VU$ . Then, an alternative optimization method is used to efficiently handle large-scale problems. Nevertheless, the presence of tail labels may break the low-rank property. Hence, [20] treats the tail labels as outliers and decompose the label matrix to, $\hat{Y}\approx Y_{1}+Y_{2}$ . Here $Y_{1}$ is low-rank and $Y_{2}$ is sparse. These two components can be obtained by solving the following objective,
其中 $\lambda$ 和 $k$ 是常量， $r(W)$ 是正则化器。 $\mathcal{L}$ 可以是根据观察到的条目评估的任何经验风险。为了解决这个问题，[11] 将分类器分解为两个秩 $k$ （ $k\ll L$ ）矩阵 $V$ ， $U$ 使得 $W=VU$ .然后，使用另一种优化方法来有效地处理大规模问题。然而，尾部标签的存在可能会破坏 low-rank 属性。因此，[20] 将尾部标签视为异常值，并将标签矩阵分解为 $\hat{Y}\approx Y_{1}+Y_{2}$ 。这里 $Y_{1}$ 是低秩的， $Y_{2}$ 而且是稀疏的。这两个组件可以通过求解以下目标来获得，

\begin{split}\min_{U,V,H}||\hat{Y}-Y_{1}-&Y_{2}||_{F}^{2}+\lambda_{1}||H||_{F}^{2}\\ &+\lambda_{2}(||U||_{F}^{2}+||V||_{F}^{2})+\lambda_{3}||XH||_{1}\\ \quad\textrm{s.t.}\quad Y_{1}=&XUV,\quad Y_{2}=XH\end{split}

(6)

These two learning frameworks are followed by many works. For example, [67] studies the problem that both features and labels are incomplete. The proposed solution, ColEmbed, requires the classifier as well as the recovered feature matrix to be low rank. Moreover, the kernel trick is used to enable the non-linearity of the classifier. Some recent works [68, 69, 70] further integrate the graph-based technique to get more effective models, which we will discussed in §3.1.2.
这两个学习框架之后有许多作品。例如，[67] 研究了特征和标签都不完整的问题。所提出的解决方案 ColEmbed 要求分类器和恢复的特征矩阵都是低秩的。此外，内核技巧用于启用分类器的非线性。最近的一些工作 [68， 69， 70] 进一步整合了基于图的技术以获得更有效的模型，我们将在 §3.1.2 中讨论。

The low-rank assumption is rather flexible and may be exploited in various ways. For example, COCO [71] considers a more complex setting that the features and labels are missing simultaneously. It imposes the concatenation of recovered feature matrix and label matrix to be low-rank via trace norm. Some works also utilize the assumption through a low-rank label correlation matrix. ML-LRC [72] assumes the label matrix can be reconstructed using a correlation matrix $U$ such that $Y=\hat{Y}^{T}U$ , where $U\in\mathbb{R}^{L\times L}$ is low-rank. Then, the loss is measured using the output and the reconstructed labels $||XW-YU||_{F}^{2}$ . Based on this assumption, ML-LEML [73] further involves an instance-wise label correlation matrix $V$ such that $Y=\hat{Y}V$ , where $V\in\mathbb{R}^{n\times n}$ is also low-rank.
低秩假设相当灵活，可以通过多种方式进行利用。例如，COCO [71] 考虑了一个更复杂的设置，即特征和标签同时缺失。它通过跟踪范数将恢复的特征矩阵和标签矩阵的串联强加为低秩。一些工作还通过低秩标签相关矩阵来利用该假设。ML-LRC [72] 假设标签矩阵可以使用相关矩阵 $U$ 进行重建，其中 $Y=\hat{Y}^{T}U$ ，其中 $U\in\mathbb{R}^{L\times L}$ 是低秩的。然后，使用输出和重建的标签 $||XW-YU||_{F}^{2}$ 测量损失。基于这一假设，ML-LEML [73] 进一步涉及一个实例级标签相关矩阵 $V$ ，其中 $Y=\hat{Y}V$ $V\in\mathbb{R}^{n\times n}$ 也是低秩的。

Another popular way is to follow the paradigm of embedding methods that projects the label vectors to a low-dimensional space. [30] proposes a deep neural network-based model C2AE. The features and labels are jointly embedded to a latent space using two neural networks $F_{x}$ and $F_{e}$ such that their codewords are maximally correlated. Then, the feature codewords are decoded by another neural network $F_{d}$ , which is also used for prediction. For MLML problems, the decoded labels are evaluated on the observed entries. Though the labels are decoded from a low-dimensional space, the low-rank assumption need not be satisfied with non-linear projection. Thus, REFDHF [74] added a trace norm regularization term on the decoded label matrix. [74] also proposes a novel hypergraph fusion technology that explores and utilizes the complementary between feature space and label space. Compared to low-rank classifier-based methods, the embedding methods are more flexible since the classifier can be non-linear and thus are worthy to be explored.
另一种流行的方法是遵循嵌入方法的范式，将标签向量投影到低维空间。[30] 提出了一种基于深度神经网络的模型 C2AE。特征和标签使用两个神经网络 $F_{x}$ 共同嵌入到潜在空间中， $F_{e}$ 以便它们的码字最大程度地相关。然后，特征码字由另一个神经网络解码，该神经网络 $F_{d}$ 也用于预测。对于 MLML 问题，将在观察到的条目上评估解码的标签。尽管标签是从低维空间解码的，但非线性投影不需要满足低秩假设。因此，REFDHF [74] 在解码的标签矩阵上添加了一个迹线范数正则化项。[74] 还提出了一种新颖的 Hypergraph Fusion 技术，该技术探索并利用特征空间和标签空间之间的互补。与基于低秩分类器的方法相比，嵌入方法更加灵活，因为分类器可以是非线性的，因此值得探索。

3.1.2 Graph-based Methods
3.1.2 基于图形的方法

To handle missing labels, one of the most popular solutions is graph-based model. Denote a weighted graph by $\mathcal{G}=(\mathcal{V},\mathcal{E},\mathcal{W})$ , where $\mathcal{V}=\{x_{i}|1\leq i\leq n\}$ denotes the vertex set and $\mathcal{E}=\{(x_{i},x_{j})\}$ denotes the edge set. $\mathcal{W}=[w_{ij}]_{n\times n}$ is a weight matrix where $w_{ij}=0$ if $(x_{i},x_{j})\notin E$ . With the graph being defined, the most typical strategy is adding a manifold regularization term to the empirical risk minimization framework. Note that in this section, we slightly abuse the notation $w_{ij}$ to represent the graph weight entry for the sake of simplicity.
为了处理缺失的标签，最流行的解决方案之一是基于图形的模型。用 $\mathcal{G}=(\mathcal{V},\mathcal{E},\mathcal{W})$ 表示加权图，其中 $\mathcal{V}=\{x_{i}|1\leq i\leq n\}$ 表示顶点集， $\mathcal{E}=\{(x_{i},x_{j})\}$ 表示边集。 $\mathcal{W}=[w_{ij}]_{n\times n}$ 是一个权重矩阵，其中 $w_{ij}=0$ if $(x_{i},x_{j})\notin E$ .定义图形后，最典型的策略是在经验风险最小化框架中添加一个流形正则化项。请注意，在本节中，为了简单起见，我们略微滥用了表示法 $w_{ij}$ 来表示图形权重条目。

The pioneering work [15] is the first to propose the concept of multi-label learning with weak labels, i.e. the implicit setting of MLML. The proposed method, named WELL, constructs a label-specific graph for each label from a feature-induced similarity graph. Then, the manifold regularization terms are added separately for each label. [65] formalized the other setting of MLML and involves three assumptions into MLML according to [16],
开创性的工作 [15] 首次提出了弱标签的多标签学习的概念，即 MLML 的隐式设置。所提出的方法名为 WELL，它从特征诱导的相似性图中为每个标签构建一个特定于标签的图。然后，为每个标签单独添加流形正则化项。[65] 正式确定了 MLML 的另一种设置，并根据 [16] 将三个假设纳入 MLML，

•

Label Consistency: the predicted labels should be consistent with the initial labels, which is usually achieved by empirical risk minimization principle;
标签一致性：预测的标签应与初始标签一致，这通常是通过经验风险最小化原则来实现的;
•

Sample-level Smoothness: if two samples $x_{i}$ and $x_{j}$ are close to each other, so are their predicted label vectors;
样本级平滑度：如果两个样本 $x_{i}$ 和 $x_{j}$ 彼此接近，则它们的预测标签向量也非常接近;
•

Label-level Smoothness: if two incomplete label vectors $y_{i}$ and $y_{j}$ are semantically similar, so are the predicted label vectors.
标签级平滑度：如果两个不完整的 $y_{i}$ 标签向量在语义上 $y_{j}$ 相似，则预测的标签向量也相似。

Formally, a $k$ -nearest neighbor graph is constructed to satisfy the sample-level smoothness, where weight matrix $\mathcal{W}^{x}$ is computed by $w_{ij}^{x}=\exp(-\frac{||x_{i}-x_{j}||_{2}^{2}}{||x_{i}-x_{h}||_{2}||x_{j}-x_{h}||_{2}})$ , where $x_{h}$ is the $h$ -th nearest neighbor of $x_{i}$ ( $h$ is a fixed constant). For the label-level smoothness, the authors constructs a $L$ -square weight matrix $\mathcal{W}^{y}$ where $w_{ij}^{y}=\exp(-\eta[1-\frac{\langle\tilde{Y}_{i.},\tilde{Y}_{j.}\rangle}{||\tilde{Y}_{i.}||_{2}||\tilde{Y}_{j.}||_{2}}])$ . $\tilde{Y}_{i.}$ is the $i$ -th row vector of incomplete matrix $\tilde{Y}$ . Finally, the predicted score matrix $\dot{Y}$ is learned by,
从形式上讲，构造一个 $k$ 最近邻图来满足样本级平滑度，其中权重矩阵 $\mathcal{W}^{x}$ 由 $w_{ij}^{x}=\exp(-\frac{||x_{i}-x_{j}||_{2}^{2}}{||x_{i}-x_{h}||_{2}||x_{j}-x_{h}||_{2}})$ 计算，其中 $x_{h}$ 是 $h$ $x_{i}$ （ $h$ 的第 -个最近邻（是一个固定常数）。对于标签级平滑度，作者构造了一个 $L$ 平方权重矩阵 $\mathcal{W}^{y}$ ，其中 $w_{ij}^{y}=\exp(-\eta[1-\frac{\langle\tilde{Y}_{i.},\tilde{Y}_{j.}\rangle}{||\tilde{Y}_{i.}||_{2}||\tilde{Y}_{j.}||_{2}}])$ 。 $\tilde{Y}_{i.}$ 是不完整矩阵 $\tilde{Y}$ 的第 $i$ -row 向量。最后，预测分数矩阵 $\dot{Y}$ 的学习方式为

\begin{split}\min_{\dot{Y}}||\dot{Y}-\tilde{Y}||_{F}^{2}+\frac{\lambda_{x}}{2}\textrm{Tr}(\dot{Y}L_{x}\dot{Y}^{T})+\frac{\lambda_{y}}{2}\textrm{Tr}(\dot{Y}^{T}L_{y}\dot{Y})\end{split}

(7)

where $L_{x}$ and $L_{y}$ is the laplacian matrix of $\mathcal{W}^{x}$ and $\mathcal{W}^{y}$ . $\lambda_{x}$ and $\lambda_{y}$ are trade-off parameters. This learning paradigm is followed by some recent works. [75] proposes an inductive version that the trained classifier can also predict on unseen data. [76] chooses the hinge loss as the empirical risk instead of squared loss. To tackle the severe class imbalance problem in MLML, [77] add two class cardinality constraints to Eq. (7) that enforces the number of positive labels is in a predefined range. With hierarchical label information being provided, MLMG-GO [70] involves a semantic hierarchical constraints such that the score of a label $y_{a}$ is smaller than its parent label $y_{b}$ . In [19], a new regularization framework IMCL is proposed that interactively learns the two similarity graphs.
其中 $L_{x}$ 和 $L_{y}$ 是 $\mathcal{W}^{x}$ 和 $\mathcal{W}^{y}$ 的拉普拉斯矩阵。 $\lambda_{x}$ 和 $\lambda_{y}$ 是权衡参数。最近的一些工作遵循了这种学习范式。[75] 提出了一个归纳版本，经过训练的分类器也可以预测看不见的数据。[76] 选择铰链损失作为经验风险，而不是平方损失。为了解决 MLML 中严重的类不平衡问题，[77] 在方程（7）中添加了两个类基数约束，强制正标签的数量在预定义的范围内。在提供分层标签信息的情况下，MLMG-GO [70] 涉及语义分层约束，使得标签 $y_{a}$ 的分数小于其父标签 $y_{b}$ 。在 [19] 中，提出了一种新的正则化框架 IMCL，它以交互方式学习两个相似性图。

Many graph-based methods only concentrate on the sample-level smoothness principle. That is, the graph information is mainly used for disambiguating the incomplete supervision, and different techniques are involved to utilize the label correlations. [69] treats the problem of one-class matrix factorization with side information as an MLML task. Inspired by [11], the linear classifier is restricted to be low-rank and the predicted label matrix is smoothed by a manifold regularization term. Since the low-rank assumption fails in many applications, MLMG-SL [70] further assumes that the output of graph model can be decomposed to a low-rank matrix and a sparse matrix. There are also several recent works that focus on the label-level smoothness. LSML [78] proposes to learn a label correlation matrix, i.e. a label graph, that can be used to complement the missing labels, smooth the label prediction and guide the learning of label-specific features simultaneously. GLOCAL [79] trains a low-rank model with manifold regularization that exploits global label-level smoothness. In addition, as label correlations may vary from one local region to another, GLOCAL partitions the instances to several groups and learns local label correlations by group-wise manifold regularization. In [34], a fully connected graph is built whose vertices are the labels and then, a graph neural network (GNN) is trained to model the label dependencies. The input of GNN is the $L$ -sized feature vector of the image extracted by a convolutional neural network, and the outputs are the predicted labels. To disambiguate the missing labels, [34] proposes two novel strategies. For the known labels, the authors propose partial binary cross-entropy loss (Partial-BCE) that reduces the normalization factor according to the label proportion. To complete the missing entries, [34] adopts a curriculum learning strategy that learns a self-paced model.
许多基于图形的方法只关注样本级平滑度原则。也就是说，图信息主要用于消除不完全监督的歧义，并且涉及不同的技术来利用标签相关性。[69] 将带有侧面信息的单类矩阵分解问题视为 MLML 任务。受 [11] 的启发，线性分类器被限制为低秩，预测的标签矩阵由流形正则化项平滑。由于低秩假设在许多应用中失败，MLMG-SL [70] 进一步假设图模型的输出可以分解为低秩矩阵和稀疏矩阵。最近还有几篇作品专注于标签级的平滑度。LSML [78] 提出学习一个标签相关矩阵，即标签图，它可以用来补充缺失的标签，平滑标签预测，同时指导标签特异性特征的学习。GLOCAL [79] 训练了一个具有流形正则化的低秩模型，该模型利用了全局标签级平滑度。此外，由于标签相关性可能因本地区域而异，因此 GLOCAL 将实例划分为多个组，并通过按组流形正则化来学习本地标签相关性。在 [34] 中，构建了一个完全连接的图，其顶点是标签，然后训练图神经网络（GNN）来对标签依赖关系进行建模。GNN 的输入是卷积神经网络提取的图像的 $L$ -size 特征向量，输出是预测的标签。为了消除缺失标签的歧义，[34] 提出了两种新颖的策略。对于已知的标签，作者提出了部分二进制交叉熵损失（Partial-BCE），它根据标签比例降低归一化因子。为了完成缺失的条目，[34] 采用了一种课程学习策略，学习自定进度的模型。

Besides, some studies are also interested in different graph information. APG-Graph [68] proposes a novel semantic descriptor-based approach for visual tasks to construct an instance-instance correlation graph. Specifically, [68] makes use of the posterior probabilities of the classifications on other public large-scale data sets. Then, a $k$ NN graph is constructed by these predicted tags. [69] regards the user-item interaction in the recommender system as a bipartite graph.
此外，一些研究还对不同的图信息感兴趣。APG-Graph [68] 为可视化任务提出了一种新颖的基于语义描述符的方法，以构建实例-实例相关图。具体来说，[68] 利用了其他公共大规模数据集上分类的后验概率。然后，由这些预测的标签构建 $k$ NN 图。[69] 将推荐系统中的用户-项目交互视为二分图。

In the past few years, graph mining techniques have received huge attention. We believe the future graph-based MLML models will involve more expressive graph models, e.g. graph neural networks [80], and various types of graphs, e.g. social networks [81].
在过去的几年里，图挖掘技术受到了巨大的关注。我们相信，未来基于图的MLML模型将涉及更具表现力的图模型，例如图神经网络[80]，以及各种类型的图，例如社交网络[81]。

3.1.3 Other Techniques for Missing Labels
3.1.3 缺失标签的其他技术

There are many other techniques can be used for MLML tasks, such as co-regularized learning [82], binary coding embedding [83]. In what follows, we focus on some advanced MLML algorithms.
还有许多其他技术可用于 MLML 任务，例如共正则化学习 [82]、二进制编码嵌入 [83]。在下文中，我们将重点介绍一些高级 MLML 算法。

Due to the capability of exploring the data distribution, probability graphical models (PGMs) have been popular for MLML problems since we can complement the missing labels in a generative manner. SSC-HDP [84] involves a correspondence hierarchical Dirichlet process (Corr-HDP) that enables the dimension of latent factors to be chosen dynamically. Based on Corr-HDP, SSC-HDP iteratively updates the likelihood $P(y^{j}|x)$ for an instance $x$ whose $j$ -th label is missing, while the likelihood of remaining labels is fixed to $1$ . CRBM [85] proposes a conditional restricted Boltzmann machine model to capture the high-order label dependence relationships. In specific, a latent layer is added above the labels layer to form a restricted Boltzmann machine, while the features are the conditions. Based on a latent factor model, GenEML [22] proposes a scalable generative model that involves an exposure variable for each missing labels. BMLS [86] jointly learns a low-rank embedding of the label matrix and the label co-occurrence matrix using an Poisson-Dirichlet-gamma non-negative factorization method [87]. Note that [85, 86] are also capable of incorporating auxiliary label relatedness information, such as Wikipedia.
由于能够探索数据分布，概率图形模型（PGM）在 MLML 问题中很受欢迎，因为我们可以以生成方式补充缺失的标签。SSC-HDP [84] 涉及一个对应的分层狄利克雷过程（Corr-HDP），该过程能够动态地选择潜在因素的维度。根据 Corr-HDP，SSC-HDP 迭代更新 $P(y^{j}|x)$ 缺少第 $j$ -th 个标签的实例 $x$ 的可能性，而剩余标签的可能性固定为 $1$ 。CRBM [85] 提出了一个条件限制玻尔兹曼机模型来捕获高阶标签依赖关系。具体来说，在 labels 层上方添加一个 latent 层以形成受限玻尔兹曼机，而特征是条件。基于潜在因子模型，GenEML [22] 提出了一个可扩展的生成模型，该模型涉及每个缺失标签的暴露变量。BMLS [86] 使用泊松-狄利克雷-伽马非负分解方法 [87] 共同学习标签矩阵和标签共生矩阵的低秩嵌入。请注意，[85， 86] 还能够合并辅助标签相关性信息，例如 Wikipedia。

Reweighting empirical risks is also a common strategy. [88] notices that in MLML setting, the traditional multi-label ranking error may overestimate the classification error. Hence, a slack variable is introduced to account for the error of ranking an unassigned class before the assigned class. [7] proposes an unbiased propensity scored variant of nDCG loss and [34] presents Partial-BCE, which we have discussed in previous sections. [89] assigns a weight factor for each term in binary cross-entropy loss. In particular, the weights of the positive labels are fixed to $1$ . The weights of missing entries are set as $P(\hat{y}_{c}|y)$ , i.e. the probability of having a negative label for the $c$ -th label given the vector of labels $y$ . Specifically, the probability is estimated from the ground-truth label matrix based on label co-occurrences.
重新加权实证风险也是一种常见的策略。[88] 注意到在 MLML 设置中，传统的多标签排名误差可能会高估分类误差。因此，引入了一个松弛变量来解释将未分配的类排在分配的类之前的错误。[7] 提出了 nDCG 损失的无偏倾向评分变体，[34] 提出了我们在前面部分中讨论过的 Partial-BCE。[89] 为二进制交叉熵损失中的每个项分配一个权重因子。特别是，正标签的权重固定为 $1$ 。缺失条目的权重设置为 $P(\hat{y}_{c}|y)$ ，即给定 labels $y$ 向量的第 $c$ -th 标签具有负标签的概率。具体来说，概率是根据标签共现的真实标签矩阵估计的。

Recently, bandit learning-based approaches are also introduced. Specifically, one pioneering work [90] considers the contextual bandits problem in the extreme multi-label learning context. It modifies the inverse gap weighting sampling strategy to select top- $k$ arms, which results in good generalization performance. Besides, this work proposes a tree-based algorithm by grouping similar arms and thus, the model enjoys a poly-logarithm computational cost w.r.t. the number of arms.
最近，还引入了基于老虎机学习的方法。具体来说，一项开创性的工作 [90] 考虑了极端多标签学习背景下的语境老虎机问题。它修改了逆间隙加权采样策略以选择顶部 $k$ 臂，从而获得了良好的泛化性能。此外，这项工作提出了一种基于树的算法，通过将相似的臂分组，因此，该模型在臂的数量上享有多对数计算成本。

Remark To date, graph-based methods and embedding-based methods are still dominant in the MLML context. Though recent works [19] have tried to involve deep models to promote performance, they mainly involve trivial convolutional networks and autoencoders. It would be promising to design more tailored model architectures for MLML. Other techniques are also worth to be explored. For example, with the success of existing PGM-based MLML methods, we believe that bayesian deep learning (BDL) [91] can further improve the performance due to its superiority on high-dimensional data and complex uncertainty.
备注迄今为止，基于图的方法和基于嵌入的方法在 MLML 上下文中仍然占主导地位。尽管最近的工作 [19] 试图引入深度模型来提高性能，但它们主要涉及琐碎的卷积网络和自动编码器。为 MLML 设计更多定制的模型架构将是有希望的。其他技术也值得探索。例如，随着现有基于PGM的MLML方法的成功，我们相信贝叶斯深度学习（BDL）[91]由于其在高维数据和复杂不确定性方面的优越性，可以进一步提高性能。

3.2 Semi-Supervised Multi-Label Classification
3.2 半监督多标签分类

In semi-supervised MLC (SS-MLC) [92], the data set is comprised of two sets: fully labeled data and unlabeled data. Though SS-MLC has a far longer history than MLML, we can regard it as a special case of MLML, i.e. the labels of some instances are totally missing. In fact, similar to MLML, a plenty of SS-MLC algorithms are also based on graph models [93, 94], and low-rank assumptions [79, 95]. In what follows, we first review some state-of-the-art SS-MLC algorithms and then, discuss a novel learning setting called weakly-supervised MLC.
在半监督MLC（SS-MLC）[92]中，数据集由两组组成：完全标记数据和未标记数据。虽然 SS-MLC 的历史比 MLML 要长得多，但我们可以将其视为 MLML 的一个特例，即一些实例的标签完全缺失。事实上，与 MLML 类似，许多 SS-MLC 算法也基于图模型 [93， 94] 和低秩假设 [79， 95]。在下文中，我们首先回顾了一些最先进的 SS-MLC 算法，然后讨论了一种称为弱监督 MLC 的新型学习设置。

3.2.1 State-of-the-art Algorithms
3.2.1 最先进的算法

Graph-based methods are very popular in SS-MLC, which mainly differ in the strategy of utilizing the label-correlation. SLRM [96] enforces the classifier to be low-rank, while a manifold-regularization term is added to ensure the sample-level smoothness. [95] proposes a triple low-rank regularization approach where the graph is dynamically updated using a low-rank feature-recovery matrix. Based on curriculum learning, ML-TLLT [97] forces a teacher pair to generate similar curriculums if the corresponding two labels are highly correlated over the labeled examples. CMLP [94] makes use of collaboration technique [98] to design an scalable multi-label propagation method. Specifically, it breaks the predicted label into two parts: 1) its own prediction part; 2) the prediction of others, i.e. collaborative part.
基于图的方法在 SS-MLC 中非常流行，它们的主要区别在于利用标签相关性的策略。SLRM [96] 强制分类器为低秩，同时添加了流形正则化项以确保样本级的平滑性。[95] 提出了一种三重低秩正则化方法，其中图形使用低秩特征恢复矩阵动态更新。基于课程学习，如果相应的两个标签与标记示例高度相关，则 ML-TLLT [97] 会强制教师对生成相似的课程。CMLP [94] 利用协作技术 [98] 设计了一种可扩展的多标签传播方法。具体来说，它将预测标签分为两部分：1）它自己的预测部分;2）他人的预测，即协作部分。

As mentioned above, other techniques may also be used. COIN [99] adapts the well-known co-training strategy to SS-MLC setting. In each co-training round, a dichotomy over the feature space is learned by maximizing the diversity between the two classifiers induced on either dichotomized feature subset. Then, pairwise ranking predictions on unlabeled data are iteratively communicated for model refinement. Based on COIN, [100] further proposes an ensemble method to accommodate streamed SS-MLC data. DRML [101] designs a dual-classifier domain adaptation network to align the features in a latent space. In order to model label dependencies, DRML generates the final prediction by feeding the outer-product of the dual predicted label vectors to a relation extraction network.
如上所述，也可以使用其他技术。COIN [99] 将众所周知的联合训练策略应用于 SS-MLC 设置。在每一轮共同训练中，通过最大化在任一二分特征子集上诱导的两个分类器之间的多样性来学习特征空间的二分法。然后，对未标记数据进行成对排名预测的迭代通信以进行模型优化。基于 COIN，[100] 进一步提出了一种集成方法来容纳流式 SS-MLC 数据。DRML [101] 设计了一个双分类器域适应网络来对齐潜在空间中的特征。为了对标签依赖关系进行建模，DRML 通过将对偶预测标签向量的外积馈送到关系提取网络来生成最终预测。

3.2.2 Weakly-Supervised MLC
3.2.2 弱监督 MLC

Due to the large output space, even in the SS-MLC problems, collecting precisely labeled data would take extensive efforts and costs. Hence, a new setting called weakly-supervised multi-label classification (WS-MLC) has attracted enormous attention, i.e. there might be fully labeled data, incompletely-labeled data and unlabeled data in the data set simultaneously. In this survey, we follow the definition of WS-MLC in [23]. However, weakly-supervised MLC may also have other meanings in the literature. In a broad sense, any noisy supervision can be termed as weakly-supervision. The readers should also be careful about the difference between WS-MLC and multi-label learning with weak labels [15, 102]. The latter sometimes indicates the implicit setting of MLML problems.
由于输出空间大，即使在 SS-MLC 问题中，收集精确标记的数据也需要大量的精力和成本。因此，一种称为弱监督多标签分类（WS-MLC）的新设置引起了极大的关注，即数据集中可能同时存在完全标记的数据、不完全标记的数据和未标记的数据。在这项调查中，我们遵循 [23] 中 WS-MLC 的定义。然而，弱监督 MLC 在文献中也可能有其他含义。从广义上讲，任何嘈杂的监督都可以称为弱监督。读者还应注意 WS-MLC 和弱标签的多标签学习之间的区别 [15， 102]。后者有时表示 MLML 问题的隐式设置。

Many effective approaches have been developed to deal with WS-MLC problems. For example, WeSed [103] handles the missing labels by a weighted ranking loss and integrates the unlabeled data via a triplet similarity loss. In [104], missing labels are first estimated by a correlation matrix. Then, a linear classifier is trained by minimizing a graph regularized model. SSWL [105] proposes a novel dual similarity regularizer $||Y-VYU||$ to characterize both sample-level and label-level smoothness. Here $V$ is the weight matrix of $k$ NN graph over training data and $U$ is a trainable variable that represents the label similarity. Moreover, SSWL also utilizes an ensemble of multiple models to improve the robustness. Though these works have demonstrated promising results, they directly use logical labels, and thus, ignore the relative importance of each label to an instance. To bridge this gap, WSMLLE [106] transforms the original problem to a label distribution learning problem [107]. In specific, a new label enhancement method is proposed that marries the concept of local correlation [79] and dual similarity regularizer [105]. The label enhancement technique is also adopted by fully-supervised MLC [108] and PML [109] models, and we will give a detailed discussion about the latter one in §3.3.
已经开发了许多有效的方法来处理 WS-MLC 问题。例如，WeSed [103] 通过加权排名损失处理缺失的标签，并通过三元组相似性损失整合未标记的数据。在 [104] 中，缺失标签首先由相关矩阵估计。然后，通过最小化图形正则化模型来训练线性分类器。SSWL [105] 提出了一种新的对偶相似性正则化器 $||Y-VYU||$ 来表征样本级和标签级平滑度。这是 $V$ $k$ NN 图在训练数据上的权重矩阵， $U$ 它是一个表示标签相似性的可训练变量。此外，SSWL 还利用多个模型的集合来提高鲁棒性。尽管这些工作已经显示出有希望的结果，但它们直接使用逻辑标签，因此忽略了每个标签对实例的相对重要性。为了弥合这一差距，WSMLLE [106] 将原始问题转换为标签分布学习问题 [107]。具体来说，提出了一种新的标签增强方法，该方法结合了局部相关性 [79] 和对偶相似性正则化器 [105] 的概念。全监督MLC [108]和PML [109]模型也采用了标签增强技术，我们将在§3.3中对后者进行详细讨论。

Probabilistic models are also popular in solving WS-MLC tasks, since the distribution of unlabeled data can be seamlessly integrated into a probabilistic framework. DSGM [23] proposes a deep sequential generative model which assumes an instance $x$ is generated from its label $y$ as well as a latent variable $z$ . DSGM leverages information from observed labels in a sequential manner. Then, the model is trained by maximizing the likelihood,
概率模型在解决 WS-MLC 任务时也很受欢迎，因为未标记数据的分布可以无缝集成到概率框架中。DSGM [23] 提出了一个深度顺序生成模型，该模型假设实例 $x$ 是从其标签 $y$ 和潜在变量 $z$ 生成的。DSGM 按顺序利用来自观察到的标签的信息。然后，通过最大化似然

\begin{split}\max_{\theta}\!\!\sum_{i\in D_{l}}\!\log p_{\theta}(x_{i},y_{i})\!+\!\!\sum_{j\in D_{o}}\!\log p_{\theta}(x_{j},\tilde{y}_{j})\!+\!\!\sum_{k\in D_{u}}\!\log p_{\theta}(x_{k})\end{split}

(8)

where $\theta$ is the model parameter. $D_{l}$ , $D_{o}$ and $D_{u}$ are the index sets of fully labeled data, incompletely-labeled data and unlabeled data respectively. [23] also proposes a variational inference method that minimizes the evidence lower bound of the objective. [110] designs an embedding-based probability model called ESMC, which addresses some key issues in WS-MLC tasks. Since the low-rank assumption may be broken by tail labels, ESMC uses the gaussian processes to perform non-linear projection. To handle missing labels, ESMC introduces a set of auxiliary random variable, a.k.a. experts, to model the relationship between the real-valued probability score and the observed logical labels. Finally, the unlabeled data can also be integrated to learn a smooth mapping from the feature space to the label space.
其中 $\theta$ 是 model 参数。 $D_{l}$ ， $D_{o}$ $D_{u}$ 分别是全标签数据、不完全标签数据和未标签数据的索引集。[23] 还提出了一种变分推理方法，该方法使目标的证据下限最小化。[110] 设计了一种称为 ESMC 的基于嵌入的概率模型，该模型解决了 WS-MLC 任务中的一些关键问题。由于低秩假设可能会被尾部标签打破，因此 ESMC 使用高斯过程来执行非线性投影。为了处理缺失的标签，ESMC 引入了一组辅助随机变量（又名 experts）来对实值概率分数与观察到的逻辑标签之间的关系进行建模。最后，还可以集成未标记的数据，以学习从特征空间到标签空间的平滑映射。

Remark. Compared to MLML problems, the presence of a large amount of unlabeled data in SS-MLC can highly restrict the representation ability of the model. However, few efforts have been made to apply tricks in state-of-the-art deep semi-supervised learning to SS-MLC. We recommend involving techniques such as consistency regularization [111] and self-supervised pretraining [112], which have demonstrated exciting ability to utilize the unlabeled data.
备注。与 MLML 问题相比，SS-MLC 中存在大量未标记数据会极大地限制模型的表示能力。然而，很少有人努力将最先进的深度半监督学习中的技巧应用于 SS-MLC。我们建议使用一致性正则化 [111] 和自我监督预训练 [112] 等技术，这些技术已经展示了利用未标记数据的令人兴奋的能力。

3.3 Partial Multi-Label Learning
3.3 部分多标签学习

In practice, the complicated structure of the label space usually makes it hard to decide some hard labels are relevant or not. For example, it is usually hard to decide whether a dog is a malamute or a husky. One might naively drop these labels and regard the original problem as an MLML task. However, missing labels provides no information to the user at all. Hence, partial multi-label learning (PML) [17] is proposed to address this issue, which preserves all the potentially correct labels. Formally speaking, each instance $x$ is equipped with a set of candidate labels $S$ , only some of which are the true relevant labels. The remaining labels are called false positive labels or distractor labels. Technically, PML can be regarded as a dual problem of MLML and solved by existing MLML techniques. However, it is worth noting that this strategy may be less practical owing to the sparsity of the label space. Moreover, PML also provides a safe way to protect data privacy since no label can be determined as the ground-truth, as opposite to MLML data.
在实践中，标签空间的复杂结构通常使得很难确定某些硬标签是否相关。例如，通常很难确定一只狗是雪橇犬还是哈士奇。人们可能会天真地丢弃这些标签，并将原始问题视为 MLML 任务。但是，缺少标签根本不会向用户提供任何信息。因此，提出了部分多标签学习（PML） [17] 来解决这个问题，它保留了所有可能正确的标签。从形式上讲，每个实例 $x$ 都配备了一组 candidate labels $S$ ，其中只有一部分是真正的相关标签。其余标签称为误报标签或干扰标签。从技术上讲，PML 可以看作是 MLML 的一个对偶问题，可以通过现有的 MLML 技术来解决。但是，值得注意的是，由于标签空间的稀疏性，这种策略可能不太实用。此外，PML 还提供了一种保护数据隐私的安全方法，因为与 MLML 数据相反，没有标签可以确定为真实数据。

3.3.1 Two-stage Learning Methods
3.3.1 两阶段学习方法

In PML, while label correlation still matters, the other key issue becomes identifying the ground-truth from the candidate label set instead of completion. To handle these issues, some PML algorithms adopt a two-stage learning framework. Formally, an enriched label representation $\Lambda=[\lambda_{ij}]\in\mathbb{R}^{L\times n}$ will be learned where $\lambda_{ij}$ is a real-valued number. The sign of $\lambda_{ij}$ indicates whether the label is positive or negative, while the magnitude reflects the confidence of the relevance. Then, the PML problem is transformed into a canonical supervised learning problem and the classifier can be easily induced. To obtain $\Lambda$ , PARTICLE [113] uses the label propagation technique that aggregates the information from the $k$ -nearest neighbors. After that, the confidences are converted back to logical labels by thresholding. To train an MLC classifier, [113] adopts a pairwise label ranking model coupled with virtual label splitting or maximum a posteriori (MAP) reasoning. PARTICLE has two main drawbacks. First, the confidences have richer information than logical labels, but, it is trimmed when thresholding. Second, only the second-order label correlation is considered.
在 PML 中，虽然标签相关性仍然很重要，但另一个关键问题是从候选标签集中识别真实值，而不是完成。为了解决这些问题，一些 PML 算法采用两阶段学习框架。从形式上讲，将学习丰富的标签表示 $\Lambda=[\lambda_{ij}]\in\mathbb{R}^{L\times n}$ 形式，其中 $\lambda_{ij}$ 是实值数字。符号 of $\lambda_{ij}$ 表示标签是正的还是负的，而量级则反映相关性的置信度。然后，将 PML 问题转化为规范的监督学习问题，并且可以轻松诱导分类器。为了获得 $\Lambda$ ，PARTICLE [113] 使用标签传播技术来聚合来自 $k$ 最近邻的信息。之后，通过阈值将置信度转换回逻辑标签。为了训练MLC分类器，[113]采用了成对标签排名模型，并结合虚拟标签分割或最大化后验（MAP）推理。PARTICLE 有两个主要缺点。首先，置信度比逻辑标签具有更丰富的信息，但是，在阈值化时，它被修剪了。其次，只考虑二阶标签相关性。

To tackle these problems, DRAMA [18] generates the label confidence matrix under the guidance of feature manifold and the candidate label set. Then, a novel gradient boosting decision tree (GBDT) based multi-output regressor is directly trained on the transformed data set $\tilde{\mathcal{D}}=\{(x_{i},\lambda_{i})|i\in\{1,\ldots,n\}\}$ where $\lambda_{i}$ is the $i$ -th column vector of $\Lambda$ . On $t$ -th boosting round, DRAMA augments the feature space using previously learned labels. Therefore, high-order label correlations are automatically exploited to improve performance.
为了解决这些问题，DRAMA [18] 在特征流形和候选标签集的指导下生成了标签置信度矩阵。然后，直接在变换后的数据集 $\tilde{\mathcal{D}}=\{(x_{i},\lambda_{i})|i\in\{1,\ldots,n\}\}$ 上训练一种基于梯度提升决策树（GBDT）的新型多输出回归器，其中 $\lambda_{i}$ 是 $i$ 的第 -列向量 $\Lambda$ 。在 $t$ 第 -th 个提升轮中，DRAMA 使用先前学习的标签来增强特征空间。因此，系统会自动利用高阶标签相关性来提高性能。

The major limitation of the aforementioned methods is that the disambiguation is achieved purely by features. However, label correlation itself can help to identify the correct labels. Insufficient disambiguation makes the induced MLC classifier error-prone. To this end, PML-LD [109] proposes a novel label enhancement method that transforms the PML problem to a label distribution learning problem [107]. When learning the label confidence matrix, PML-LD leverages the sample-level smoothness and local label-level smoothness [79] such that the candidate label set can be fully disambiguated. Then, the confidences are normalized by softmax to form an LDL problem and a multi-output support vector machine is induced.
上述方法的主要局限性是消除歧义完全是通过特征来实现的。但是，标签关联本身可以帮助识别正确的标签。歧义消除不足使诱导的 MLC 分类器容易出错。为此，PML-LD [109] 提出了一种新的标签增强方法，将 PML 问题转化为标签分布学习问题 [107]。在学习标签置信矩阵时，PML-LD 利用样本级平滑度和局部标签级平滑度 [79]，以便可以完全消除候选标签集的歧义。然后，通过 softmax 对置信度进行归一化以形成 LDL 问题，并诱导多输出支持向量机。

The advantages of two-stage PML methods are two-folds. First, since the label confidences are obtained, we can apply well-studied multi-output learning methods [114]. Second, the real-valued confidences reflect the relative intensity of the relevance or irrelevance, which may give us more information about our data.
两阶段 PML 方法的优点是双重的。首先，由于获得了标签置信度，我们可以应用经过充分研究的多输出学习方法 [114]。其次，实际价值置信度反映了相关性或不相关性的相对强度，这可能会为我们提供更多关于数据的信息。

3.3.2 End-to-end Learning Methods
3.3.2 端到端学习方法

As we have mentioned, two-stage learning PML methods usually need be carefully designed, or the induced MLC classifier may be error-prone due to insufficient disambiguation. Hence, many PML algorithms are developed in an end-to-end fashion, which vary from one to another.
正如我们所提到的，两阶段学习 PML 方法通常需要仔细设计，否则由于消歧不充分，诱导的 MLC 分类器可能容易出错。因此，许多 PML 算法都是以端到端方式开发的，这些算法因人而异。

[17] proposes a ranking model, which employs the label confidence as a weight for the ranking loss. To estimate the label confidences, [17] provides two practical ways based on label correlation and feature prototypes respectively. Moreover, the classification model along with the ground-truth confidence are optimized in a unified framework such that the two subproblems can benefit from each other. [115] presents a soft sign thresholding method to measure the discrepancy between the real-valued confidences and the candidate labels. Similar to [17], the classifier training and disagreement minimization are performed at the same time. Nevertheless, [115] does not well utilize the label correlations, and thus the performance is limited.
[17] 提出了一个排名模型，该模型使用标签置信度作为排名损失的权重。为了估计标签置信度，[17] 提供了两种基于标签相关性和特征原型的实用方法。此外，分类模型和真实置信度在一个统一的框架中进行了优化，以便两个子问题可以相互受益。[115] 提出了一种软符号阈值方法来测量实值置信度和候选标签之间的差异。与 [17] 类似，分类器训练和不一致最小化是同时进行的。然而，[115] 并没有很好地利用标签相关性，因此性能是有限的。

Some methods adopt the low-rank assumption. fPML [116] introduces the matrix factorization technique to obtain a shared latent space for both features and labels. The classifier is then trained by fitting the recovered labels. PML-LRS [21] utilizes the low-rank and sparse decomposition scheme. That is, it assumes the distractor label matrix is sparse while the ground-truth matrix is low-rank. Both fPML and PML-LRS treat the false-positive labels as randomly generated noise. However, in real-world applications, the false-positive labels may be caused by some ambiguous contents of the instance. Therefore, [117] divides the classifier $W$ to two parts $W=U+V$ . Here $U$ is the multi-label classifier and $V$ is the distractor label identifier. Meanwhile, $U$ is constrained to be low-rank to utilize label correlations. Since distractor labels usually correlate to only a few ambiguous features, $V$ is regularized to be sparse. MUSER [118] takes redundant labels together with noisy features into account by jointly exploring feature and label subspaces. Furthermore, it uses a manifold regularizer to ensure the consistency between features and latent labels.
一些方法采用低秩假设。fPML [116] 引入了矩阵分解技术，以获得特征和标签的共享潜在空间。然后通过拟合恢复的标签来训练分类器。PML-LRS [21] 利用低秩和稀疏分解方案。也就是说，它假设干扰物标签矩阵是稀疏的，而真实矩阵是低秩的。fPML 和 PML-LRS 都将假阳性标记视为随机生成的噪声。但是，在实际应用程序中，误报标签可能是由实例的某些模棱两可的内容引起的。因此，[117] 将分类器 $W$ 分为两部分 $W=U+V$ 。这是 $U$ 多标签分类器， $V$ 是干扰项标签标识符。同时， $U$ 被限制为低秩以利用标签相关性。由于干扰项标签通常仅与少数模棱两可的特征相关， $V$ 因此被正则化为稀疏。MUSER [118] 通过共同探索特征和标签子空间，将冗余标签与噪声特征一起考虑在内。此外，它使用流形正则化器来确保特征和潜在标签之间的一致性。

Remark. The PML problem is drawing increasing attention in the community. However, the assumption that all labels are equally being candidates can be less practical, since some ground-truth labels can be easily distinguished. Therefore, the key assumptions of PML should be carefully revisited. Here we recommend a more practical setting that besides providing the candidate set, the annotators should also provide partial ranks that which labels are more likely to be correct. Besides, existing PML data sets are mainly built upon multi-instance multi-label [119] data sets, and thus, there is also an urgent need to establish a benchmark for PML problems.
备注。PML 问题在社区中引起了越来越多的关注。但是，假设所有标签都同等地是候选标签可能不太实用，因为一些真实标签很容易区分。因此，应仔细重新审视 PML 的关键假设。在这里，我们推荐一个更实用的设置，除了提供候选集外，注释者还应该提供部分排名，即哪些标签更有可能是正确的。此外，现有的 PML 数据集主要建立在多实例多标签 [119] 数据集之上，因此也迫切需要建立 PML 问题的基准。

3.4 Other Settings
3.4 其他设置

The complexity of the label space has expedited various kinds of improperly-supervised MLC settings. In what follows, we briefly review some more state-of-the-art settings in the literature.
标签空间的复杂性加速了各种监督不当的 MLC 设置。在下文中，我们将简要回顾文献中一些更先进的设置。

MLC with Noisy Labels (Noisy-MLC). While MLML and PML consider single-side noise, Noisy-MLC assumes that noisy labels occur in both relevant and irrelevant labels. Many effective Noisy-MLC algorithms have been proposed to address this problem, including graph based methods [120], probability models [121], teacher-student model [122]. In [106], the WS-MLC framework is extended and noisy labels are assumed to be contained in the data set. Some works [123, 24] maintain a small set of clean data to reduce the noise in the large data set. Since learning from label noise have been a hot topic in the community, Noisy-MLC deserve more attention.
具有噪声标签的 MLC （Noisy-MLC）。MLML 和 PML 考虑单侧噪声，而 Noisy-MLC 假设噪声标签同时出现在相关和不相关的标签中。已经提出了许多有效的 Noisy-MLC 算法来解决这个问题，包括基于图的方法 [120]、概率模型 [121]、教师-学生模型 [122]。在 [106] 中，扩展了 WS-MLC 框架，并假设数据集中包含嘈杂的标签。一些工作 [123， 24] 维护了一小群干净的数据，以减少大数据集中的噪声。由于从标签噪声中学习一直是社区中的热门话题，因此 Noisy-MLC 值得更多关注。

MLC with Unseen Labels. In the aforementioned settings, the label spaces is fixed during training and testing. However, in practice, the label space may be dynamically expanded. For instance, [124] studies an online MLC setting that an arriving data instance may be associated with unknown labels. In [35], knowledge distillation method is used to handle streaming labels. Multi-label zero-shot learning (ML-ZSL) [36] requires the prediction of unknown labels which are not defined during training. To make ML-ZSL feasible, external semantic information is usually involved, such as word vectors [125] and knowledge graphs [36]. In [126], few-shot labels is also considered, which relates to only few instances in the data set, i.e. nearly unseen.
具有 Unseen Labels 的 MLC。在上述设置中，标签空间在训练和测试期间是固定的。但是，在实践中，标签空间可能会动态扩展。例如，[124] 研究了在线 MLC 设置，即到达的数据实例可能与未知标签相关联。在 [35] 中，知识蒸馏方法用于处理流式标签。多标签零样本学习（ML-ZSL）[36]需要预测训练过程中未定义的未知标签。为了使 ML-ZSL 可行，通常涉及外部语义信息，例如词向量 [125] 和知识图谱 [36]。在 [126] 中，还考虑了小镜头标签，它仅与数据集中的少数实例有关，即几乎看不见。

Multi-Label Active Learning (MLAL). Active learning is a notable way to alleviate the difficulty of multi-label tagging. The idea is to carefully select the most informative data instances for labeling such that better models can be trained with less labeling effort. A variety of works have studied MLAL problems. For example, [127] adopts maximum loss reduction with maximal confidence as the sampling criterion for MLAL. [26] solves MLAL problems via a probability model. Moreover, MLAL is also considered in crowdsourcing [128] and novel queries [129] tasks.
多标签主动学习（MLAL）。主动学习是缓解多标签标记难度的一种显着方法。这个想法是仔细选择信息量最大的数据实例进行标记，以便以更少的标记工作训练更好的模型。各种工作都研究了 MLAL 问题。例如，[127] 采用最大置信度的最大损失减少作为 MLAL 的采样标准。[26] 通过概率模型解决 MLAL 问题。此外，MLAL 也被考虑用于众包 [128] 和新颖查询 [129] 任务。

Label Distribution Learning (LDL). LDL [107] is a general framework that assigns $L$ normalized real values to label description degree. It aims to tackle inherent ambiguity in data annotation, e.g. a facial expression usually conveys a complex mixture of basic emotions. Since it is difficult to obtain the label distribution directly, many works [108, 130, 131] focus on recovering label distributions from logical labels, which is also known as label enhancement (LE). LE is an effective learning strategy to deal with label ambiguity. LE is also applied in WS-MLC [106] and PML [109] to handle imperfect supervision signals.
标签分布学习（LDL）。LDL [107] 是一个通用框架，它将标准化的实值分配给 $L$ 标签描述程度。它旨在解决数据注释中固有的歧义，例如，面部表情通常传达基本情绪的复杂混合。由于很难直接获得标签分布，许多工作 [108， 130， 131] 专注于从逻辑标签中恢复标签分布，这也称为标签增强（LE）。LE 是处理标签歧义的有效学习策略。LE 也应用于 WS-MLC [106] 和 PML [109] 中，以处理不完美的监控信号。

MLC with Multiple Instances (MIML). MIML [119] is a popular setting which assumes each example is described by multiple instances as well as associated with multiple binary labels. Recent studies in MIML [132, 133] have developed many deep learning models such that noisy instances can be effectively figured out. Nevertheless, MIML mainly focuses on the instance-level ambiguity instead of the labels. Hence, we do not further discuss it.
具有多个实例的 MLC （MIML）。MIML [119] 是一种流行的设置，它假设每个示例都由多个实例描述，并与多个二进制标签相关联。MIML [132， 133] 的最新研究开发了许多深度学习模型，可以有效地找出嘈杂的实例。尽管如此，MIML 主要关注实例级的歧义性，而不是标签。因此，我们不再进一步讨论它。

Remark. Intelligent systems are enrolled in increasingly difficult and complicated tasks, and thus new settings like PML and LDL deserve more attention. Moreover, there remain more challenging and complicated settings in real-world applications to be explored. For instance, there might be out-of-distribution detection [134], domain shift [135] and other problems arise in MLC problems.
备注。智能系统承担的任务越来越困难和复杂，因此 PML 和 LDL 等新设置值得更多关注。此外，在实际应用中仍有更具挑战性和更复杂的设置有待探索。例如，MLC 问题中可能会出现分布外检测 [134]、域偏移 [135] 和其他问题。

4 Deep Learning for Multi-label Learning
4 用于多标签学习的深度学习

Due to the powerful learning capability, deep learning has achieved state-of-the-art performance in many real-world multi-label applications, e.g., multi-label image classification. In MLC problems, it is key to harvest the advantage of deep learning to better capture the label dependencies. In this section, we first introduce some representative deep embedding methods for MLC, then present deep learning for challenging MLC, and finally review advanced deep learning for MLC.
由于强大的学习能力，深度学习在许多现实世界的多标签应用中取得了最先进的性能，例如多标签图像分类。在 MLC 问题中，关键是要利用深度学习的优势来更好地捕获标签依赖关系。在本节中，我们首先介绍了 MLC 的一些代表性深度嵌入方法，然后介绍了具有挑战性的 MLC 的深度学习，最后回顾了 MLC 的高级深度学习。

4.1 Deep Embedding Methods for MLC
4.1 MLC 的深度嵌入方法

Different from conventional multi-label methods, deep neural networks (Deep NNs) often seek a new feature space and employ a multi-label classifier on the top. To our knowledge, BP-MLL [29] is the first method to utilize NN architecture for multi-label learning problem. To explicitly exploit the dependencies among labels, given the neural network $F$ , BP-MLL introduces a pairwise loss function for each instance $x_{i}$ :
与传统的多标签方法不同，深度神经网络（Deep NN）通常会寻找新的特征空间，并在顶部使用多标签分类器。据我们所知，BP-MLL [29] 是第一种利用 NN 架构解决多标签学习问题的方法。为了显式利用标签之间的依赖关系，给定神经网络 $F$ ，BP-MLL 为每个实例 $x_{i}$ 引入了一个成对损失函数：

{E}_{i}=\frac{1}{\lvert y_{i}^{1}\rvert\lvert y_{i}^{0}\rvert}\sum_{(p,q)\in y_{i}^{1}\times y_{i}^{0}}\exp\left(-(F(x_{i})^{p}-F(x_{i})^{q})\right)

(9)

where $y_{i}^{1}$ and $y_{i}^{0}$ denote the sets of positive and negative labels for the $i$ -the instance $x_{i}$ respectively, $(F(x_{i}))^{p}$ denotes the $p$ -th entry of $F(x_{i})$ . $F(x_{i}))^{p}-(F(x_{i}))^{q}$ measures the difference between the outputs of the network on the positive and negative labels, and the exponential function is used to severely penalize the difference. Thus the minimization of (9) leads to output larger values for positive labels, and smaller values for the negative labels. [29] further shows that (9) is closely related to ranking loss.
其中 $y_{i}^{1}$ 和 $y_{i}^{0}$ 分别表示 $i$ -the 实例的 $x_{i}$ 正标签集和负标签集， $(F(x_{i}))^{p}$ 表示 $p$ 的第 -个条目 $F(x_{i})$ 。 $F(x_{i}))^{p}-(F(x_{i}))^{q}$ 测量正标签和负标签上网络输出之间的差异，并使用指数函数对差异进行严重惩罚。因此，（9）的最小化会导致正标签的输出值较大，而负标签的输出值较小。[29] 进一步表明（9）与排名损失密切相关。

Later, [136] finds that BP-MLL does not perform as expected on data sets in textual domain. To address the issue, based on BP-MLL, [136] proposes to use a comparably simple NN approach that can achieve the state-of-the-art performance in large-scale multi-label text classification. They show that the ranking loss in BP-MLL can be efficiently and effectively replaced by the commonly used cross-entropy function, and several NN tricks, i.e., rectified linear units (ReLUs), Dropout, and AdaGrad can be effectively employed in this setting.
后来，[136] 发现 BP-MLL 在文本域中的数据集上没有按预期执行。为了解决这个问题，基于 BP-MLL，[136] 建议使用一种相对简单的 NN 方法，该方法可以在大规模多标签文本分类中实现最先进的性能。他们表明，BP-MLL 中的排名损失可以有效地被常用的交叉熵函数所取代，并且几种 NN 技巧，即整流线性单元（ReLUs）、Dropout 和 AdaGrad 可以有效地用于此设置。

Embedding methods have been effective to capture the label dependency and reduce the computation costs. However, existing embedding methods are shallow models, which may not be powerful to discover high order dependency among labels. To fulfill this gap, [30] proposes Canonical Correlated AutoEncoder (C2AE), which is the first DNN-based embedding method for MLC to our knowledge. The basic idea of C2AE is to seek a deep latent space to jointly embed the instances and labels. C2AE performs feature-aware label embedding and label-correlation aware prediction. The former is realized by joint learning of deep canonical correlation analysis (DCCA) and the encoding stage of autoencoder, while the latter is achieved by the introduced loss function for the decoding outputs.
嵌入方法可以有效地捕获标签依赖关系并降低计算成本。然而，现有的嵌入方法是浅层模型，对于发现标签之间的高阶依赖关系可能并不强大。为了填补这一差距，[30] 提出了 Canonical Correlated AutoEncoder （C2AE），据我们所知，这是第一个基于 DNN 的 MLC 嵌入方法。C2AE 的基本思想是寻找一个深的潜在空间来共同嵌入实例和标签。C2AE 执行特征感知标签嵌入和标签相关性感知预测。前者是通过深度典型相关分析（DCCA）和自动编码器编码阶段的联合学习实现的，而后者是通过引入解码输出的损失函数来实现的。

C2AE consists of two DNN modules, i.e., DCCA and autoencoder, and seeks three mapping functions, i.e., feature mapping $F_{x}$ , encoding function $F_{e}$ , and decoding function $F_{d}$ . For training, C2AE receives instance $X$ and labels $Y$ , associates them in the latent space $L$ , and enforces the recover of $Y$ using autoencoder. The objective function of C2AE is defined as follows:
C2AE 由两个 DNN 模块组成，即 DCCA 和自动编码器，寻求三种映射功能，即特征映射 $F_{x}$ 、编码函数 $F_{e}$ 和解码函数 $F_{d}$ 。对于训练，C2AE 接收实例 $X$ 和标签 $Y$ ，将它们关联到潜在空间中 $L$ ，并使用自动编码器强制执行恢复 $Y$ 。C2AE 的目标函数定义如下：

\min_{F_{x},F_{e},F_{d}}\Phi(F_{x},F_{e})+\alpha\Gamma(F_{e},F_{d})

(10)

where $\Phi(F_{x},F_{e})$ and $\Gamma(F_{e},F_{d})$ denote the losses in the latent and output spaces respectively, $\alpha$ is used to balance the two terms. Inspired by the CCA, C2AE learns the deep latent space by maximizing the correlation between instances and labels. Thus $\Phi(F_{x},F_{e})$ can be defined as:
其中 $\Phi(F_{x},F_{e})$ 和 $\Gamma(F_{e},F_{d})$ 分别表示 latent 和 output 空间中的损失， $\alpha$ 用于平衡这两个项。受 CCA 的启发，C2AE 通过最大化实例和标签之间的相关性来学习深层潜在空间。因此 $\Phi(F_{x},F_{e})$ 可以定义为：

	$\displaystyle\min_{F_{x},F_{e}}$	$\displaystyle\quad\\|F_{x}(X)-F_{e}(Y)\\|_{F}^{2}$		(11)
	s.t.	$\displaystyle\quad F_{x}(X)F_{x}(X)^{T}=F_{e}(Y)F_{e}(Y)^{T}=I$

In addition, C2AE recovers the labels using autoencoder with aim of preserving label dependency. Inspired by [29], $\Gamma(F_{e},F_{d})$ is defined as follows:
此外，C2AE 使用自动编码器恢复标签，目的是保留标签依赖性。受 [29] 启发， $\Gamma(F_{e},F_{d})$ 定义如下：

		$\displaystyle\Gamma(F_{e},F_{d})\!\!=\!\!\sum_{i=1}^{N}E_{i}$		(12)
		$\displaystyle E_{i}\!\!=\!\!\frac{1}{\lvert y_{i}^{1}\rvert\lvert y_{i}^{0}\rvert}\!\!\!\sum_{(p,q)\in y_{i}^{1}\times y_{i}^{0}}\!\!\!\exp\left(-(\!F_{d}(F_{e}(x_{i}))^{p}\!\!-\!\!F_{d}(F_{e}(x_{i}))^{q}\!)\right)$

where $N$ is the number of the instances, $F_{d}(F_{e}(x_{i}))$ is the recovered label of $x_{i}$ using the autoencoder. For prediction, given a test instance $\hat{x}$ , C2AE performs prediction as $\hat{y}=F_{d}(F_{x}(\hat{x}))$ .
其中 $N$ 是实例数， $F_{d}(F_{e}(x_{i}))$ 是使用 autoencoder 的 $x_{i}$ recovered 标签。对于预测，给定一个测试实例 $\hat{x}$ ，C2AE 将执行预测。 $\hat{y}=F_{d}(F_{x}(\hat{x}))$

Later, inspired by C2AE, [137] presents a two-stage label embedding model based on neural factorization machine model. It first exploits second-order label correlation via a factorization layer and then learns high-order correlation by additional fully-connected layers. [57] proposes another deep embedding method, i.e., Deep Correlation Structure Preserved Label Space Embedding (DCSPE). In addition to DCCA, DCSPE further develops deep multidimensional scaling (DMDS) to preserve the intrinsic structure of the latent space. Finally, DCSPE transforms test instance into the latent space, searches its nearest neighbor, and finally regards label of this neighbor as prediction. However, as the $k$ NN search is time-consuming, the $k$ NN embedding methods are computationally expensive in the large-scale setting. To solve the above issue, [138] proposes a novel deep binary prototype compression (DBPC) for fast multi-label prediction. DBPC compresses the database into a small set of short binary prototypes, and uses the prototypes for prediction.
后来，受 C2AE 的启发，[137] 提出了一种基于神经因子分解机模型的两阶段标签嵌入模型。它首先通过因式分解层利用二阶标签相关性，然后通过其他全连接层学习高阶相关性。[57] 提出了另一种深度嵌入方法，即深度相关结构保留标签空间嵌入（DCSPE）。除了 DCCA，DCSPE 还进一步开发了深度多维缩放（DMDS）以保留潜在空间的内在结构。最后，DCSPE 将测试实例转换为潜在空间，搜索其最近的邻居，最后将该邻居的标签视为预测。然而，由于 $k$ NN 搜索非常耗时，因此 $k$ NN 嵌入方法在大规模设置中的计算成本很高。为了解决上述问题，[138] 提出了一种新的深度二进制原型压缩（DBPC）用于快速多标签预测。DBPC 将数据库压缩为一小组简短的二进制原型，并使用这些原型进行预测。

For multi-label emotion classification, [139] recently proposes latent emotion memory (LEM) to learn latent emotion distribution without external knowledge. LEM includes latent emotion and memory modules to learn emotion distribution and emotional features respectively, and the concatenation of the two is fed into Bi-directional Gated Recurrent Unit (BiGRU) for prediction. For multi-label image classification, [140] proposes a unified deep neural network that exploits both semantic and spatial relations between labels with only image-level supervision. Specifically, the authors propose Spatial Regularization Network (SRN) that generates attention maps for all labels and captures the underlying relations between them via learnable convolutions. [141] finds the consistency of attention regions of CNN classifiers under many transforms are not preserved. To address the issue, the authors propose a two-branch network with original and transformed images as inputs and introduce a new attention consistency loss that measures the attention heatmap consistency between two branches. Later [142] proposes Adjacency-based Similarity Graph Embedding (ASGE) and Cross-modality Attention (CMA) to capture the dependencies between labels and discover locations of discriminative features respectively. Specifically, ASGE learns semantic label embedding that can explicitly exploit label correlations, and CMA generates the meaningful attention maps by leveraging more prior semantic information. Instead of requiring laborious object-level annotations, [143] proposes to distill knowledge from weakly-supervised detection (WSD) task to boost MLC performance. The authors construct an end-to-end MLC framework augmented by a knowledge distillation module that guides the classification model by the WSD model for object RoIs. WSD and MLC are the teacher and student models respectively.
对于多标签情绪分类，[139] 最近提出了潜在情绪记忆（LEM）在没有外部知识的情况下学习潜在情绪分布。LEM 包括潜在情绪和记忆模块，分别学习情绪分布和情绪特征，两者的串联被馈送到双向门控循环单元（BiGRU）进行预测。对于多标签图像分类，[140] 提出了一个统一的深度神经网络，该网络利用标签之间的语义和空间关系，仅需要图像级监督。具体来说，作者提出了空间正则化网络（SRN），它为所有标签生成注意力图，并通过可学习的卷积捕捉它们之间的潜在关系。[141] 发现在许多转换下 CNN 分类器的注意力区域的一致性没有被保留。为了解决这个问题，作者提出了一个以原始图像和转换后的图像作为输入的双分支网络，并引入了一种新的注意力一致性损失，用于测量两个分支之间的注意力热图一致性。后来 [142] 提出了基于邻接的相似图嵌入（ASGE）和跨模态注意力（CMA），以分别捕获标签之间的依赖关系并发现判别特征的位置。具体来说，ASGE 学习可以显式利用标签相关性的语义标签嵌入，而 CMA 通过利用更多先验语义信息来生成有意义的注意力图。[143] 建议从弱监督检测（WSD）任务中提取知识，以提高 MLC 性能，而不是费力的对象级注释。作者构建了一个端到端的 MLC 框架，该框架由知识蒸馏模块增强，该模块通过对象 RoI 的 WSD 模型指导分类模型。WSD 和 MLC 分别是教师和学生模式。

Remark. Deep embedding methods are the most widely-used deep methods for MLC. Among them, C2AE is pioneer deep embedding work for MLC and has been applied in many real-world applications including multi-label emotion classification, which deserves exploration for the beginners to understand basic mechanism. As we know, label correlation is key for MLC, and some objectives, e.g., (12) have been used to model label correlation in deep embedding methods for MLC. However, existing research shows that (12) may not be effective for textual domain. In the future, how to effectively capture label correlation is an important research topic of deep embedding methods for MLC. Some advanced techniques, e.g., graph convolutional network (GCN), recurrent neural network (RNN) open doors to better capture label correlation, and can motivate further research of deep embedding methods for MLC.
备注。深度嵌入方法是 MLC 中使用最广泛的深度方法。其中，C2AE 是 MLC 的先驱深度嵌入工作，并已应用于包括多标签情感分类在内的许多实际应用中，值得初学者探索以了解基本机制。众所周知，标签相关性是 MLC 的关键，一些目标，例如（12）已被用于模拟 MLC 深度嵌入方法中的标签相关性。然而，现有的研究表明（12）可能对文本域无效。未来，如何有效地捕获标签相关性是 MLC 深度嵌入方法的重要研究课题。一些先进的技术，例如图卷积网络（GCN）、递归神经网络（RNN）为更好地捕获标签相关性打开了大门，并且可以激励 MLC 深度嵌入方法的进一步研究。

4.2 Deep Learning for Challenging MLC
4.2 具有挑战性的 MLC 的深度学习

In real world applications, multi-label learning is often challenging due to the complex setting of labels. For instance, the number of labels is very large known as XMLC; the labels are often partially or weakly given; labels emerge continuously or are even unseen before. This section reviews the recent advances of deep learning to address these challenging MLC problems.
在实际应用中，由于标签设置复杂，多标签学习通常具有挑战性。例如，标签的数量非常大，称为 XMLC;标签通常是部分或微弱的;标签不断出现，甚至以前从未见过。本节回顾了深度学习解决这些具有挑战性的 MLC 问题的最新进展。

DL for Extreme MLC. To our knowledge, [31] is the first attempt at applying deep learning to XMLC. XML-CNN [31] applies convolutional neural network (CNN) and dynamic pooling to learn the text representation, and a hidden bottleneck layer much smaller than the output layer is used to achieve computational efficiency. However, XML-CNN still suffers from effectiveness of capturing the important subtext for each label. To address this issue, AttentionXML [32] is proposed with two unique features: 1) a multi-label attention mechanism with raw text as input, which allows to capture the most relevant part of text to each label, 2) a shallow and wide probabilistic label tree (PLT), which allows to handle millions of labels, especially for "tail labels". Meanwhile, based on C2AE, a new deep embedding method, i.e., Ranking-based Auto-Encoder (Rank-AE) [33] is proposed for XMLC. Rank-AE first uses an efficient attention mechanism to learn rich representations from any type of input features, learns a latent space for instance and labels, and finally develops a margin-based ranking loss that is more effective for XMLC and noisy labels. [144] empirically demonstrates that overfitting leads to the poor performance of the DNN based embedding methods for XMLC. Based on this finding, [144] further proposes a new regularizer, i.e., GLaS for embedding-based neural network approaches. [145] finetunes a pretrained deep transformer for better feature representation. They propose a novel label clustering model for XMLC and the transformer serves as a neural matcher. With the proposed techniques, the state-of-the-art performance is achieved on several widely-used extreme data sets. Very recently [146] develops DeepXML framework that can generate a family of algorithms by including four sub-tasks, i.e., intermediate representation, negative sampling, transfer learning, and classifier learning. It yields Accelerated Short Text Extreme Classifier (Astec) that is more accurate and faster than state-of-the-art deepXMLs on public short text data sets. DECAF [147] considers label metadata, e.g., textual descriptions of labels, which is informative but usually ignored by existing methods. DECAF jointly learns model enriched by label metadata and feature representation, and predicts accurately with millions of labels. ECLARE [148] incorporates label text and label correlations, and develops frugal architecture and scalable techniques to train model with label correlation graph with millions of labels. Similarly, GalaXC [149] collaboratively learns over joint document-label graphs that can incorporate various sources, e.g., label metedata. GalaXC further introduces label-wise attention to obtain high-capacity extreme classifiers. [149] shows that GalaXC is up to 18% more accurate than state-of-the-arts while it trains 2-50 times faster and predicts 10 times faster on benchmark data sets.
DL 代表 Extreme MLC。据我们所知，[31] 是将深度学习应用于 XMLC 的第一次尝试。XML-CNN [31] 应用卷积神经网络（CNN）和动态池化来学习文本表示，并使用比输出层小得多的隐藏瓶颈层来实现计算效率。但是，XML-CNN 仍然无法有效地捕获每个标签的重要潜台词。为了解决这个问题，AttentionXML [32] 提出了两个独特的功能：1）以原始文本为输入的多标签注意力机制，允许捕获与每个标签最相关的文本部分，2）浅而宽的概率标签树（PLT），允许处理数百万个标签，特别是 “尾部标签”。同时，在 C2AE 的基础上，提出了一种新的深度嵌入方法，即基于排名的自动编码器（Rank-AE） [33] 用于 XMLC。Rank-AE 首先使用一种高效的注意力机制从任何类型的输入特征中学习丰富的表示，学习潜在空间（例如和标签），最后开发一种基于边距的排名损失，这对 XMLC 和嘈杂的标签更有效。[144] 经验证明，过拟合会导致基于 DNN 的 XMLC 嵌入方法性能不佳。基于这一发现，[144] 进一步提出了一种新的正则化器，即 GLaS，用于基于嵌入的神经网络方法。[145] 微调预训练的深度 transformer 以获得更好的特征表示。他们为 XMLC 提出了一种新的标签聚类模型，该转换器用作神经匹配器。通过提出的技术，在几个广泛使用的极端数据集上实现了最先进的性能。最近 [146] 开发了 DeepXML 框架，该框架可以通过包括四个子任务来生成一系列算法，即中间表示、负采样、迁移学习和分类器学习。它生成的加速短文本极端分类器（Astec）在公共短文本数据集上比最先进的 deepXML 更准确、更快速。DECAF [147] 考虑了标签元数据，例如标签的文本描述，这是信息丰富的，但通常被现有方法所忽略。DECAF 联合学习通过标签元数据和特征表示丰富的模型，并使用数百万个标签进行准确预测。ECLARE [148] 整合了标签文本和标签相关性，并开发了节俭架构和可扩展技术，以使用具有数百万个标签的标签关联图来训练模型。同样，GalaXC [149] 通过联合文档标签图进行协作学习，这些图可以包含各种来源，例如标签元数据。GalaXC 进一步引入了标签关注以获得高容量的极端分类器。[149] 表明，GalaXC 的准确率比最先进的高 18%，同时它在基准数据集上的训练速度提高了 2-50 倍，预测速度提高了 10 倍。

DL for partial and weakly-supervised MLC. Several efforts [150, 34, 19, 151] have been made towards MLC with partial labels. [34] empirically shows that partially annotating all images is better than fully annotating a small subset. Thus [34] generalizes the standard binary cross-entropy loss by exploiting label proportion information, and develops an approach based on Graph Neural Networks (GNNs) to explicitly model the correlation between categories. Later, [19] regularizes the cross-entropy loss with a cost function that measures the smoothness of labels and features of images on data manifold, and develops an efficient interactive learning framework where similarity learning and CNN training interact and improve each another. [23] is the first deep generative model to tackle weakly-supervised MLC (WS-MLC). [23] proposes a probabilistic framework that integrates sequential prediction and generation processes to exploit information from unlabeled or partially labeled data.
DL 用于部分和弱监督 MLC。已经对带有部分标签的 MLC 做出了一些努力 [150， 34， 19， 151]。[34] 实证表明，部分注释所有图像比完全注释一小部分图像要好。因此 [34] 通过利用标签比例信息概括了标准的二进制交叉熵损失，并开发了一种基于图神经网络（GNN）的方法来显式建模类别之间的相关性。后来，[19] 用一个成本函数来规范交叉熵损失，该函数测量数据流形上图像的标签和特征的平滑度，并开发了一个高效的交互式学习框架，其中相似性和 CNN 训练相互作用并相互改进。[23] 是第一个处理弱监督 MLC （WS-MLC）的深度生成模型。[23] 提出了一个概率框架，该框架集成了顺序预测和生成过程，以利用来自未标记或部分标记数据的信息。

DL for MLC with unseen labels. In conventional MLC, all the labels are assumed to be fixed and static; however, it is ignored that labels emerge continuously in changing environments. To fulfill this gap, a novel DNN-based method, i.e., Deep Streaming Label Learning (DSLL) [35] is proposed to deal with MLC with newly emerged labels effectively. DSLL uses streaming label mapping, streaming feature distillation, and senior student network to explore the knowledge from past labels and historical models to understand new labels. In addition, [35] further theoretically proves that DSLL admits tight generalization error bounds for new labels in the DNN framework. Different from DSLL, [36] incorporates the additional knowledge graphs for multi-label zero-shot learning (ML-ZSL). [36] advances label propagation mechanism in the semantic space, enabling the reasoning of the learned model for predicting unseen labels.
DL 表示具有不可见标签的 MLC。在传统的 MLC 中，所有标签都被假定为固定和静态的;然而，人们忽略了标签在不断变化的环境中不断出现。为了填补这一差距，提出了一种基于 DNN 的新方法，即深度流标签学习（DSLL） [35] 来有效处理新出现标签的 MLC。DSLL 使用流式标签映射、流式特征蒸馏和高年级学生网络来探索过去标签和历史模型中的知识，从而了解新标签。此外，[35] 进一步从理论上证明，DSLL 承认 DNN 框架中新标签的严格泛化误差边界。与 DSLL 不同，[36] 结合了用于多标签零样本学习（ML-ZSL）的附加知识图谱。[36] 在语义空间中推进了标签传播机制，使学习模型的推理能够预测看不见的标签。

Remark. MLC problem is challenging due to the high complexity of labels. [149], [34], [23], and [35] are representative deep works for beginners to address extreme MLC, partial MLC, weakly-supervised MLC, and MLC with unseen labels, respectively. The above attempts only focus on challenges of label space in MLC problem. However, in real-world MLC problems, there are some challenges in feature space, e.g., some features may be vanished or augmented, the distribution may change. How to simultaneously address challenges in label and feature spaces is more challenging and can be regarded as future research of challenging MLC problem.
备注。由于标签的高度复杂性，MLC 问题具有挑战性。[149]、[34]、[23] 和 [35] 是初学者的代表性深度工作，分别解决了极端 MLC、部分 MLC、弱监督 MLC 和具有看不见标签的 MLC。以上尝试仅关注 MLC 问题中标签空间的挑战。然而，在现实世界的 MLC 问题中，特征空间存在一些挑战，例如，某些特征可能会消失或增强，分布可能会发生变化。如何同时应对标签和特征空间的挑战更具挑战性，可以被视为未来具有挑战性的 MLC 问题的研究。

4.3 Advanced Deep Learning for MLC
4.3 MLC 的高级深度学习

Recently some advanced deep learning architectures have been developed for MLC problems.
最近，针对 MLC 问题开发了一些高级深度学习架构。

To exploit the underlying rich label structure, [37] proposes Deep In the Output Space (ADIOS) to partition the labels into a Markov Blanket Chain and then apply a novel deep architecture that exploits the partition. In multi-label image classification, CNN-RNN [152] utilizes recurrent neural networks (RNNs) to better exploit the higher-order label dependencies of an image. CNN-RNN learns a joint image-label embedding to characterize the semantic label dependency as well as the image-label relevance, and it can be trained end-to-end from scratch to integrate both information in a unified framework. In addition, instead of using classifier chain, [38] proposes to use RNN to convert MLC into a sequential prediction problem, where the labels are first ordered in an arbitrary fashion. The key advantage is to allow focusing on the prediction of only positive labels, a much smaller set than the full set of possible labels. [153] employs Long-Short Term Memory (LSTM) sub-network to sequentially predict semantic labeling scores on the located regions and capture the global dependencies of these regions, and achieve superior performance in large-scale multi-label image classification. [154] does not require pre-defined label orders. It integrates and learns visual attention and LSTM layers for multi-label image classification. Instead of a fixed, static label ordering, [39] assumes a dynamic, context-dependent label ordering. [39] consists of a simple EM-like algorithm that bootstraps the learned model, and a more principled approach based on reinforcement learning. The experiments empirically show dynamic label ordering approach based on reinforcement learning outperforms RNN with fixed label ordering. [155] proposes a new framework based on optimal completion distillation and multitask learning that also does not require a predefined label order. Recently [156] proposes predicted label alignment (PLA) and minimal loss alignment (MLA) to dynamically order the ground truth labels with the predicted label sequence. This allows for faster training of more optimal LSTM models, and obtains state-of-the-art results in large-scale image classification.
为了利用底层的丰富标签结构，[37] 提出了 Deep In the Output Space （ADIOS）将标签分区到马尔可夫毯链中，然后应用一种利用该分区的新型深度架构。在多标签图像分类中，CNN-RNN [152] 利用递归神经网络（RNN）来更好地利用图像的高阶标签依赖性。CNN-RNN 学习联合图像标签嵌入来描述语义标签依赖关系以及图像标签相关性，并且可以从头开始进行端到端训练，将这两种信息集成到一个统一的框架中。此外，[38] 建议使用 RNN 将 MLC 转换为顺序预测问题，其中标签首先以任意方式排序，而不是使用分类器链。主要优点是允许只关注正标签的预测，这组标签比完整的可能标签集小得多。[153] 采用长短期记忆（LSTM）子网络来顺序预测定位区域的语义标记分数并捕获这些区域的全局依赖关系，并在大规模多标签图像分类中实现卓越的性能。[154] 不需要预定义的标签顺序。它集成并学习视觉注意力和 LSTM 层，以实现多标签图像分类。[39] 不是固定的、静态的标签排序，而是假设动态的、依赖于上下文的标签排序。[39] 由一个简单的类似 EM 的算法组成，该算法用于引导学习的模型，以及一种基于强化学习的更原则性的方法。实验实证表明，基于强化学习的动态标签排序方法优于固定标签排序的 RNN。[155] 提出了一个基于最优完成蒸馏和多任务学习的新框架，该框架也不需要预定义的标签顺序。最近 [156] 提出了预测标签对齐（PLA）和最小损失对齐（MLA），以使用预测的标签序列对真值标签进行动态排序。这样可以更快地训练更优化的 LSTM 模型，并在大规模图像分类中获得最先进的结果。

Graph Convolutional Network (GCN) [80] has been also used to successfully model label dependency in MLC problem. In multi-label image classification problem, [157] first builds a directed graph over the object labels, employs GCN to model the correlations between labels, and maps label representation to inter-dependent object classifiers. Similarly, Semantic-Specific Graph Representation Learning (SSGRL) [158] includes semantic decoupling and interaction modules to learn and correlate semantic-specific representations respectively. The correlation is achieved by GCN on a graph built on label co-occurrence. Later, [159] adds lateral connections between GCN and CNN at shallow, middle and deep layers such that label information can be better injected into backbone CNN for label awareness. For multi-label patent classification, which is regarded as multi-label text classification problem, [160] proposes a new deep learning model based on GCN to capture rich semantic information. The authors design an adaptive non-local second-order attention layer to model long-range semantic dependencies in text content as label attention for patent categories.
图卷积网络（GCN） [80] 也被成功地用于模拟 MLC 问题中的标签依赖性。在多标签图像分类问题中，[157] 首先在对象标签上构建一个有向图，使用 GCN 对标签之间的相关性进行建模，并将标签表示映射到相互依赖的对象分类器。同样，语义特定图表示学习（SSGRL） [158] 包括语义解耦和交互模块，分别用于学习和关联语义特定表示。相关性是通过 GCN 在基于标签共现的图形上实现的。后来，[159] 在 GCN 和 CNN 之间增加了浅层、中层和深层的横向连接，以便标签信息可以更好地注入到主干 CNN 中以进行标签感知。对于被视为多标签文本分类问题的多标签专利分类，[160] 提出了一种基于 GCN 的新型深度学习模型来捕获丰富的语义信息。作者设计了一个自适应的非局部二阶注意力层，将文本内容中的长距离语义依赖性建模为专利类别的标签注意力。

As an alternative of DNN, deep forest [161] is a recent deep learning framework based on tree model ensembles, which does not rely on backpropagation. [40] introduces deep forest for MLC due to the advantages of deep forest models. The proposed Multi-Label Deep Forest (MLDF) can handle two challenging problems in MLC: optimizing different performance measures and reducing overfitting. The extensive experiments show that MLDF achieves the best performance over hamming loss, one-error, coverage, ranking loss, average precision and macro-AUC measures.
作为 DNN 的替代方案，deep forest [161] 是一种基于树模型集成的最新深度学习框架，它不依赖于反向传播。[40] 由于深林模型的优势，为 MLC 引入了深林。所提出的多标签深度森林（MLDF）可以处理 MLC 中的两个具有挑战性的问题：优化不同的性能度量和减少过拟合。广泛的实验表明，MLDF 比汉明损失、单误差、覆盖率、排序损失、平均精度和宏观 AUC 度量取得了最佳性能。

Remark. Advanced deep architecture has more powerful learning capability, and thus can be more effective for MLC problem. Beginners can try representative deep works of advanced RNN [152], GCN [157] [40] to address MLC problems. However, these advanced deep methods usually contain very large amounts of parameters, and require high complexity in terms of training and prediction costs. To devise lightweight architecture for efficient training and prediction is worthy to be explored for advanced deep methods for MLC.
备注。高级深度架构具有更强大的学习能力，因此可以更有效地解决 MLC 问题。初学者可以尝试高级 RNN [152]、GCN [157][40] 的代表性深度作品来解决 MLC 问题。然而，这些高级深度方法通常包含非常大量的参数，并且在训练和预测成本方面需要很高的复杂性。为高效训练和预测设计轻量级架构值得探索 MLC 的高级深度方法。

5 Online Multi-label Learning
5 在线多标签学习

Many real-world applications generate a massive volume of streaming data. For example, many web-related applications, such as Twitter and Facebook posts and RSS feeds, are attached with multiple essential forms of categorization tags. In the search industry, revenue comes from clicks on ads embedded in the result pages. Ad selection and placement can be significantly improved if ads are tagged correctly. This scenario, referred to as online multi-label learning, is a popular tool for addressing large-scale multi-label classification tasks.
许多实际应用程序会生成大量的流数据。例如，许多与 Web 相关的应用程序（如 Twitter 和 Facebook 帖子以及 RSS 源）都附加了多种基本形式的分类标记。在搜索行业，收入来自对结果页中嵌入的广告的点击。如果广告标记正确，则可以显著改善广告选择和投放。此方案称为在线多标签学习，是解决大规模多标签分类任务的常用工具。

The current off-line MLC methods assume that all data are available in advance for learning. However, there are two major limitations of designing MLC methods under such an assumption: firstly, these methods are impractical for large-scale data sets, since they require all data sets to be stored in memory; secondly, it is non-trivial to adapt off-line multi-label methods to the sequential data. In practice, data is collected sequentially, and data that is collected earlier in this process may expire as time passes. Therefore, it is non-trivial to propose new online multi-label learning methods to deal with streaming data. This section presents a review of the latest algorithms on online multi-label classification.
当前的离线 MLC 方法假设所有数据都提前可用于学习。然而，在这样的假设下设计 MLC 方法有两个主要限制：首先，这些方法对于大规模数据集来说是不切实际的，因为它们要求所有数据集都存储在内存中;其次，将离线多标签方法应用于序列数据并非易事。在实践中，数据是按顺序收集的，在此过程中早期收集的数据可能会随着时间的推移而过期。因此，提出新的在线多标签学习方法来处理流数据并非易事。本节回顾了在线多标签分类的最新算法。

[162] proposes an online universal classifier (OUC) to handle binary, multi-class and multi-label classification problems. To adapt all types of classification, OUC pre-processes the data set that the target label of all three classification types is represented as a vector with dimension equal to the number of output labels. A deep learning model is then employed for online training.
[162] 提出了一种在线通用分类器（OUC）来处理二进制、多类和多标签分类问题。为了适应所有类型的分类，OUC 对数据集进行了预处理，其中所有三种分类类型的目标标签都表示为维度等于输出标签数量的向量。然后采用深度学习模型进行在线训练。

Based on ELM [163], which is a single hidden layer feedforward neural network model, [42] proposes the OSML-ELM approach to handle streaming multi-label data. OSML-ELM uses a sigmoid activation function and outputs weights to predict the labels. In each step, the output weight is learned from the specific equation. OSML-ELM converts the label set from single to multiple representation in order to solve multi-label classification problems.
基于ELM [163]，这是一个单一的隐藏层前馈神经网络模型，[42]提出了OSML-ELM方法来处理流式多标签数据。OSML-ELM 使用 sigmoid 激活函数并输出权重来预测标签。在每个步骤中，输出权重都是从特定方程中学习的。OSML-ELM 将标签集从单个表示转换为多个表示，以解决多标签分类问题。

OLANSGD [41] is proposed based on label ranking, where the ranking functions are learned by minimizing the ranking loss in the large margin framework. However, the memory and computational costs of this process are expensive on large-scale data sets. Stochastic gradient descent (SGD) approaches update the model parameters using only the gradient information calculated from a single label at each iteration. OLANSGD minimizes the primal form using Nesterov’s smoothing, which has recently been extended to the stochastic setting.
OLANSGD [41] 是基于标签排序提出的，其中排名函数是通过最小化大边距框架中的排名损失来学习的。但是，对于大规模数据集，此过程的内存和计算成本非常昂贵。随机梯度下降（SGD）方法在每次迭代时仅使用从单个标签计算的梯度信息来更新模型参数。OLANSGD 使用 Nesterov 平滑最小化原始形式，该平滑最近已扩展到随机设置。

However, none of these methods analyze the loss function, and do not use the correlations between labels and features. Some works have been developed to address this issue. For example, [164] presents a novel cost-sensitive dynamic principal projection (CS-DPP) method for online MLC. Inspired by matrix stochastic gradient, they develop an efficient online dimension reducer, and provide the theoretical guarantee for their carefully-designed online regression learner. Moreover, the cost information is embedded into label weights to achieve cost-sensitivity along with theoretical guarantees. However, CS-DPP can not capture the joint information between features and labels. To capture such joint information, [43] proposes a novel online metric learning paradigm for MLC. They first project features and labels into the same embedding space, and then the distance metric is learned by enforcing the constraint that the distance between embedded instance and its correct label must be smaller than the distance between the embedded instance and other labels. Moreover, an efficient optimization algorithm is present for the online MLC. Theoretically, the upper bound of cumulative loss is analyzed in the paper. The experiment results show that their proposed algorithm outperforms the aforementioned baselines.
但是，这些方法都没有分析损失函数，也没有使用标签和特征之间的相关性。已经开发了一些工作来解决这个问题。例如，[164] 提出了一种用于在线 MLC 的新型成本敏感动态主投影（CS-DPP）方法。受矩阵随机梯度的启发，他们开发了一款高效的在线降维器，为他们精心设计的在线回归学习器提供了理论保障。此外，成本信息被嵌入到标签权重中，以实现成本敏感性和理论保证。然而，CS-DPP 无法捕获特征和标签之间的联合信息。为了捕获这种联合信息，[43] 为 MLC 提出了一种新的在线度量学习范式。他们首先将特征和标签投影到同一个嵌入空间中，然后通过强制执行嵌入实例与其正确标签之间的距离必须小于嵌入实例与其他标签之间的距离的约束来学习距离度量。此外，在线 MLC 存在一种有效的优化算法。理论上，本文分析了累积损失的上限。实验结果表明，他们提出的算法优于上述基线。

Recently, some works [165, 166] study online SS-MLC problem, where data examples can be unlabeled. [165] proposes a growing neural gas-based method, which constructs a dynamic graph with incoming data. OnSeML [166] adopts a label embedding fashion that a regression model is learned to fit the latent label vectors. To incorporate the unlabeled data, it extends the regularized moving least-square model [167] with a local smoothness regularizer. It is noteworthy that online learning from weakly-supervised data has long been a difficult issue since the global data structure is not given [168]. Hence, it is valuable to develop online MLC classifiers with limited supervision.
最近，一些工作 [165， 166] 研究了在线 SS-MLC 问题，其中数据示例可以不标记。[165] 提出了一种不断增长的基于神经气体的方法，该方法使用输入数据构建动态图。OnSeML [166] 采用标签嵌入方式，即学习回归模型来拟合潜在的标签向量。为了合并未标记的数据，它使用局部平滑度正则化器扩展了正则化移动最小二乘模型 [167]。值得注意的是，由于没有给出全局数据结构 [168]，因此从弱监督数据中在线学习长期以来一直是一个难题。因此，在有限的监督下开发在线 MLC 分类器是有价值的。

Remark. Online multi-label learning opens a new way to address large-scale MLC issues with limited memory. Unfortunately, the model, algorithm and theoretical results obtained so far are very limited. It is imperative to put more effort to explore this direction.
备注。在线多标签学习开辟了一种解决内存有限的大规模 MLC 问题的新方法。不幸的是，迄今为止获得的模型、算法和理论结果非常有限。必须付出更多的努力来探索这个方向。

6 Statistical Multi-label Learning
6 统计多标签学习

The generalization error of multi-label learning is analyzed by many papers. For example, [11] formulates MLC as the problem of learning a low-rank linear model in the standard ERM framework that could use a variety of loss functions and regularizations. They analyze the generalization error bounds for low-rank promoting trace norm regularization. There are also some statistical theoretical works which focus on the consistency of multi-label learning: whether the expected loss of a learned classifier converges to the Bayes loss as the training set size increases. For example, [169] studies two well-known multi-label loss functions: ranking loss and hamming loss. They provide a sufficient and necessary condition for the consistency of multi-label learning based on surrogate loss functions. For hamming loss, they propose a surrogate loss function which is consistent for the deterministic case. However, no convex surrogate loss is consistent with the ranking loss. [170] transforms MLC into the bipartite ranking problem, and proposes a simple univariate convex surrogate loss (exponential or logistic) defined on single labels, which is consistent with the ranking loss with explicit regret bounds and convergence rates. Recently, [10] shows that the pick-one-label can not achieve zero regret with respect to the Precision@ $k$ , and PLTs model can get zero regret (i.e., it is consistent) in terms of marginal probability estimation and Precision@ $k$ in the multi-label setting. Inspired by [10], [61] further studies the consistency of one-versus-all, pick-all-labels, normalised one-versus-all and normalised pick-all-labels reduction methods based on a different Recall@ $k$ metric. All these works study the generalization error and consistency of learning approaches which address multi-label learning by decomposing into a set of binary classification problems. However, the existing theory of the generalization error and consistency does not consider label correlations, and desire for more effort to explore.
多篇论文分析了多标签学习的泛化误差。例如，[11] 将 MLC 表述为在标准 ERM 框架中学习低秩线性模型的问题，该模型可以使用各种损失函数和正则化。他们分析低秩提升跟踪范数正则化的泛化误差边界。还有一些统计理论著作专注于多标签学习的一致性：随着训练集大小的增加，学习到的分类器的预期损失是否收敛到贝叶斯损失。例如，[169] 研究了两个著名的多标签损失函数：排序损失和汉明损失。它们为基于代理损失函数的多标签学习的一致性提供了充分和必要的条件。对于汉明损失，他们提出了一个代理损失函数，该函数对于确定性情况是一致的。但是，没有凸代理损失与排名损失一致。[170] 将 MLC 转化为二分排序问题，并提出了一种在单个标签上定义的简单单变量凸代理损失（指数或逻辑），这与具有明确遗憾边界和收敛率的排名损失一致。最近，[10] 表明，相对于 Precision@ 而言，选择单标签无法实现零遗憾 $k$ ，而 PLTs 模型在多标签设置 $k$ 中的边际概率估计和Precision@方面可以获得零遗憾（即它是一致的）。受 [10] 的启发，[61] 进一步研究了基于不同 Recall@ $k$ 指标的一对多、全选标签、标准化一对多和标准化全选标签缩减方法的一致性。所有这些工作都研究了通过分解为一组二元分类问题来解决多标签学习的学习方法的泛化误差和一致性。然而，现有的泛化误差和一致性理论并未考虑标签相关性，并希望付出更多努力进行探索。

As mentioned above, several XMLC methods, such as PD-Sparse [59] and SLEEC [1], use $\ell_{1}$ regularization to exploit the sparsity of the data. However, $\ell_{1}$ regularization suffers two major limitations: 1) [171, 172, 173] show that the $\ell_{1}$ regularization produces a bias into the resulting estimator, and harms the estimation accuracy. 2) [174] proves that the oracle property does not hold for $\ell_{1}$ regularization. To address these issues, [175] proposes a unified framework for SLEEC with nonconvex penalty, such as minimax concave penalty (MCP) [173] and smoothly clipped absolute deviation (SCAD) penalty [174], which have recently attracted much attention because they can eliminate the estimation bias and attain attractive statistical properties. Theoretically, they show that their proposed estimator enjoys oracle property, which performs as well as if the underlying model were known beforehand, as well as attains a desirable statistical convergence rate of $\mathcal{O}(\frac{\sigma\sqrt{\varpi}+\sqrt{s^{*}}}{\mu\sqrt{n}})$ , where $\sigma,\varpi,\mu$ are positive constants, $n$ is the sample size and $s^{*}$ denotes the cardinality of the true support of underlying model. Considering the magnitude of the entries in the underlying model, they can achieve a refined convergence rate of $\mathcal{O}(\frac{\sqrt{s^{*}}}{\mu\sqrt{n}})$ under suitable conditions. This paper could inspire the community to bring more powerful statistical penalty method and theory into MLC.
如上所述，几种 XMLC 方法，如 PD-Sparse [59] 和 SLEEC [1]，使用 $\ell_{1}$ 正则化来利用数据的稀疏性。然而， $\ell_{1}$ 正则化有两个主要限制：1） [171， 172， 173] 表明 $\ell_{1}$ 正则化对结果估计器产生偏差，并损害估计的准确性。2） [174] 证明 Oracle 属性不适用于 $\ell_{1}$ 正则化。为了解决这些问题，[175] 提出了一个具有非凸惩罚的 SLEEC 统一框架，例如最小极大凹惩罚（MCP） [173] 和平滑裁剪绝对偏差（SCAD）惩罚 [174]，它们最近引起了广泛关注，因为它们可以消除估计偏差并获得有吸引力的统计特性。从理论上讲，他们表明他们提出的估计器具有 oracle 属性，其性能与事先知道底层模型一样好，并且获得了理想的统计收敛率 $\mathcal{O}(\frac{\sigma\sqrt{\varpi}+\sqrt{s^{*}}}{\mu\sqrt{n}})$ ，其中 $\sigma,\varpi,\mu$ 是正常数， $n$ 是样本量， $s^{*}$ 表示底层模型的真正支持的基数。考虑到底层模型中条目的大小，它们可以在适当条件下实现精细的收敛速率 $\mathcal{O}(\frac{\sqrt{s^{*}}}{\mu\sqrt{n}})$ 。本文可以激励社区将更强大的统计惩罚方法和理论引入 MLC。

Remark. A key challenging issue in MLC is to model the interdependencies between labels and features. Existing methods, such as classifier chain, CCA and CPLST, attempt to model the correlations between labels and features. However, the statistical properties of these multi-label dependency modelings are less explored, and how to do theoretical analysis for them is an important research topic in the future. Copulas is an influential statistical tool for modeling dependence of multivariate data, and first brought into MLC for modeling label and feature dependencies [176]. In particular, [176] first constructs continuous distribution in the output space via employing the kernel trick, and then develops an unbiased and consistent estimator. Moreover, they also present the asymptotic analysis and mean squared error in the paper. However, the biggest problem for this paper is that it can not handle high dimension issues. The use of copula for modeling label and feature dependencies reveals new statistical insights in multi-label learning, and could orient more high dimension driven works in this direction.
备注。MLC 中一个关键的具有挑战性的问题是对标签和特征之间的相互依赖关系进行建模。现有的方法，如分类器链、 CCA 和 CPLST，试图对标签和特征之间的相关性进行建模。然而，这些多标签依赖关系模型的统计性质探索较少，如何对它们进行理论分析是未来重要的研究课题。Copulas 是一种有影响力的统计工具，用于对多变量数据的关系进行建模，并首次被引入 MLC 中用于对标签和特征依赖关系进行建模 [176]。特别是，[176] 首先通过使用核技巧在输出空间中构建连续分布，然后开发一个无偏且一致的估计器。此外，他们还在论文中提出了渐近分析和均方误差。然而，这篇论文最大的问题是它无法处理高维度的问题。使用 copula 对标签和特征依赖关系进行建模揭示了多标签学习中的新统计见解，并可能将更多高维驱动工作导向这个方向。

7 New Applications
7 新应用

During the past decade, multi-label classification has been successfully applied in various applications, such as protein function classification, music categorization and semantic scene classification. Recently, some new applications in computer vision (CV), natural language processing (NLP) and data mining (DM) are emerging, which are summarized in the Supplementary Materials. This section will briefly review some of them.
在过去的十年中，多标签分类已成功应用于各种应用，例如蛋白质功能分类、音乐分类和语义场景分类。最近，计算机视觉（CV）、自然语言处理（NLP）和数据挖掘（DM）领域的一些新应用不断涌现，这些应用将在补充材料中进行总结。本节将简要回顾其中的一些。

7.1 Computer Vision
7.1 计算机视觉

7.1.1 Video Annotation
7.1.1 视频注释

With the development of considerable videos on the Internet (e.g., Youtube, Flickr and Facebook), efficient and effective indexing and searching these video corpus becomes more and more important for the research and industry community. In many real-world video corpus, the videos are multi-labeled. For instance, most of the videos in the popular TRECVID data set [177] are annotated by more than one label from a set of 39 different concepts. Currently, semantic-level video annotation (i.e., the semantic video concept detection) has been an important research topic in the multimedia research community, which aims to tag videos with a set of concepts of interest, including scenes (e.g., garden, sky, tree), objects (e.g., animals, people, airplane, car), events (e.g., election, ceremony) and certain named entities (e.g., university, person, home). [178] attempts to capture the correlations between different labels to improve the annotation performance on video concepts. [179] proposes a novel online multi-label learning method for large-scale video annotation.
随着互联网上大量视频（例如 Youtube、Flickr 和 Facebook）的发展，高效和有效的索引和搜索这些视频语料库对于研究和行业社区来说变得越来越重要。在许多实际视频语料库中，视频是多标签的。例如，流行的TRECVID数据集[177]中的大多数视频都由一组39个不同概念中的多个标签进行注释。目前，语义级视频注释（即语义视频概念检测）一直是多媒体研究界的一个重要研究课题，它旨在用一组感兴趣的概念来标记视频，包括场景（例如，花园、天空、树木）、物体（例如，动物、人、飞机、汽车）、事件（例如，选举、仪式）和某些命名实体（例如，大学、人、家）。[178] 尝试捕获不同标签之间的相关性，以提高视频概念的注释性能。[179] 提出了一种用于大规模视频注释的新型在线多标签学习方法。

7.1.2 Facial Action Unit Recognition
7.1.2 面部动作单元识别

Thoughts and feelings are revealed in the face. The facial muscle movements tell a person’s social behavior, psychopathology and internal states. Facial Action Unit (AU) Recognition plays an important role in describing comprehensive facial expressions, and has been successfully applied in mental state analysis. Some works [180] have provided the evidence that the occurrence of AUs are strongly correlated, and the sample distribution of AUs is unbalanced. Based on these properties, multi-label learning methods are well-matched to this learning scenario. For example, [181] introduces joint-patch and multi-label learning (JPML) to leverage group sparsity by selecting a sparse subset of facial patches while learning a multi-label classifier. [182] presents deep region and multi-label learning (DRML) for AU detection. Recently, [183] proposes a semi-supervised multi-label approach for AU recognition utilizing a large number of web face images without AU labels.
想法和感受在脸上显露出来。面部肌肉运动说明了一个人的社会行为、精神病理学和内部状态。面部动作单元（AU）识别在描述全面的面部表情方面起着重要作用，并已成功应用于精神状态分析。一些工作 [180] 提供了证据，证明 AUs 的发生具有很强的相关性，并且 AUs 的样本分布是不平衡的。基于这些属性，多标签学习方法与此学习场景非常匹配。例如，[181] 引入了关节补丁和多标签学习（JPML），通过在学习多标签分类器时选择面部补丁的稀疏子集来利用群体稀疏性。[182] 提出了用于 AU 检测的深度区域和多标签学习（DRML）。最近，[183] 提出了一种半监督多标签方法用于 AU 识别，该方法利用大量没有 AU 标签的 Web 人脸图像。

7.1.3 Neonatal Brains
7.1.3 新生儿大脑

Effective and consistent segmentation of brain white matter bundles at neonatal stage plays a vital role in detecting white matter abnormalities and understanding brain development for the prediction of psychiatric disorders. Because of the complexity of white matter anatomy and the spatial resolution of diffusion-weighted MR imaging, multiple fiber bundles can pass through one voxel. [184] aims to assign one or multiple anatomical labels of white matter bundles to each voxel to reflect complex white matter anatomy of the neonatal brain. To achieve this goal, [184] explores the supervised multi-label learning algorithm in Riemannian diffusion tensor spaces, which considers diffusion tensors lying on the Log-Euclidean Riemannian manifold of symmetric positive definite (SPD) matrices and their corresponding vector space as feature space. [184] demonstrates that they are able to automatically learn the number of white matter bundles at a location and provide anatomical annotation of the neonatal white matter. Recently, [185] and [186] present some weakly-supervised multi-label learning methods for neonatal brain extraction.
新生儿期脑白质束的有效和一致分割在检测白质异常和了解大脑发育以预测精神疾病方面起着至关重要的作用。由于白质解剖结构的复杂性和弥散加权 MR 成像的空间分辨率，多个纤维束可以通过一个体素。[184] 旨在为每个体素分配一个或多个白质束的解剖标签，以反映新生儿大脑的复杂白质解剖结构。为了实现这一目标，[184] 探索了黎曼扩散张量空间中的监督多标签学习算法，该算法将位于对称正定（SPD）矩阵的对数-欧几里得黎曼流形及其相应的向量空间上的扩散张量视为特征空间。[184] 表明它们能够自动学习某个位置的白质束数量并提供新生儿白质的解剖注释。最近，[185] 和 [186] 提出了一些用于新生儿脑提取的弱监督多标签学习方法。

7.2 Natural Language Processing
7.2 自然语言处理

7.2.1 Mobile Applications
7.2.1 移动应用程序

Recently, the development of mobile applications has become one of the most important topics in communications [187]. Under this field, advanced high performance algorithms for mobile applications have attracted the attention of researchers. Recommendation systems are widely used to predict the “rating” or “preference” that a user would give to an item. A good recommendation system with high performance is able to attract users to the service for 5G applications. [188] focuses on high performance multi-label classification methods and their applications for medical recommendations in the domain of 5G communication. [189] develops a deep convolutional neural network for iris segmentation of noisy images acquired by mobile devices. A novel multi-label active learning method is proposed by [190] for mobile reviews classification tasks. Mobile applications involve language understandings, we group it to NLP.
最近，移动应用程序的开发已成为通信领域最重要的话题之一 [187]。在这个领域下，用于移动应用程序的先进高性能算法引起了研究人员的关注。推荐系统广泛用于预测用户对项目的 “评级” 或 “偏好”。一个好的高性能推荐系统能够吸引用户使用 5G 应用的服务。[188] 重点介绍高性能多标签分类方法及其在 5G 通信领域的医学推荐应用。[189] 开发了一种深度卷积神经网络，用于对移动设备采集的噪声图像进行虹膜分割。[190] 提出了一种新的用于移动评论分类任务的多标签主动学习方法。移动应用程序涉及语言理解，我们将其归类为 NLP。

7.2.2 Legal Text Mining
7.2.2 法律文本挖掘

MLC has been widely used in the legal domain, especially for legal text mining tasks. In 2008, Mencía and Fürnkranz [191] collects a data set EUR-Lex, which comprises of documents about European Union law, including treaties, legislation, case-law and legislative proposals. The documents are categorized into several orthogonal concepts according to the European Vocabulary (EUROVOC), to allow for multiple search facilities. Recently, there arises new interest. In [192], a new legal MLC data set, dubbed EURLEX-57K is released. This is a large-scale version of EUR-Lex data set (19.6k documents, 4k EUROVOC labels) that contains 57k EU legislative documents from the EUR-Lex portal, each of which is labeled by 4.3k concepts from EUROVOC. In the Chinese AI and Law challenge [193], MLC is also applied to the legal judgment prediction (LJP) task, which aims to empower the machine to predict the judgment results of legal cases after reading fact descriptions. Since each criminal case can be relevant to multiple law articles, charges and prison terms, the LJP task can be regarded as a multi-label text classification problem. XMLC [192] and DL MLC [193] are proposed to address this task. Based on syntactic and grammatical features, legal text mining is categorized as NLP.
MLC 已广泛应用于法律领域，尤其是法律文本挖掘任务。2008年，Mencía和Fürnkranz [191]收集了一个数据集EUR-Lex，其中包括有关欧盟法律的文件，包括条约、立法、判例法和立法提案。根据欧洲词汇（EUROVOC），这些文档被分为几个正交概念，以允许使用多种搜索工具。最近，人们产生了新的兴趣。在 [192] 中，发布了一个新的合法 MLC 数据集，称为 EURLEX-57K。这是 EUR-Lex 数据集（19.6k 文档，4k EUROVOC 标签）的大规模版本，其中包含来自 EUR-Lex 门户的 57k 欧盟立法文件，每份文件都由来自 EUROVOC 的 4.3k 概念标记。在中国人工智能与法律挑战赛 [193] 中，MLC 也被应用于法律判决预测（LJP）任务，旨在使机器在阅读事实描述后能够预测法律案件的判决结果。由于每个刑事案件都可能与多个法律条款、指控和监禁术语相关，因此 LJP 任务可以被视为多标签文本分类问题。XMLC [192] 和 DL MLC [193] 被提出来解决这一任务。根据句法和语法特征，法律文本挖掘被归类为 NLP。

7.3 Data Mining
7.3 数据挖掘

7.3.1 Recommender Systems
7.3.1 推荐系统

The recommender system can be naturally regarded as an MLC tasks since we usually recommend multiple items simultaneously to the users. For example, [194] develops an MLC model to automatically recommend bid phrases to an advertiser from a given ad landing page; [195] approaches the item-to-item recommendation task on Amazon, which aims at predicting the subset of items (labels) that a user might buy along with a given item. A recent work [145] regarded the keyword recommendation as an XMLC task, that provides keyword suggestions for advertisers to create campaigns. The MLC model receives the product-query customer purchase records and then suggests queries that are relevant to any given product by utilizing product information, like title, description, brand, and so on. The applications of XMLC in recommendation have been widely studied in the literature [194, 195, 145].
推荐系统自然可以被视为 MLC 任务，因为我们通常会同时向用户推荐多个项目。例如，[194] 开发了一个 MLC 模型，从给定的广告着陆页自动向广告商推荐出价词组;[195] 在 Amazon 上进行了商品到商品推荐任务，该任务旨在预测用户可能与给定商品一起购买的商品子集（标签）。最近的一项工作 [145] 将关键词推荐视为 XMLC 任务，为广告商提供关键词建议以创建活动。MLC 模型接收产品查询客户购买记录，然后利用产品信息（如标题、描述、品牌等）建议与任何给定产品相关的查询。XMLC 在推荐中的应用已在文献中得到广泛研究 [194， 195， 145]。

7.3.2 User Profiling
7.3.2 用户分析

In many applications, such as social media and e-commerce, it is essential to provide adaptive and personalized services to users. Therefore, user profiling, which infers user characteristics and personal interests from user-generated data, has been widely adopted by many online platforms. Some works regard this problem as a single-label learning task. However, obviously, more user characteristics lead to better personalization and the correlations between different user profiles can help improve the quality of user profiling. Hence, some works try to infer multiple attributes simultaneously. For example, Farnadi [196] proposes a hybrid deep learning framework to infer multiple types of user-profiles from multiple modalities of user data. Their experiments on 5K Facebook users also validates the superiority of the multi-label learning fashion to single-label learning. [197] explores the user profiles on Weibo, a famous social network platform in China, by using graph information in social networks. Another example is fraud detection in e-commerce platforms [94], since fraud users usually have different spam behaviors simultaneously. [94] presents a collaboration based multi-label propagation method to utilized the correlations among different fraud behaviors.
在许多应用程序中，例如社交媒体和电子商务，为用户提供自适应和个性化的服务至关重要。因此，从用户生成的数据中推断用户特征和个人兴趣的用户画像已被许多在线平台广泛采用。一些作品将这个问题视为单标签学习任务。然而，显然，更多的用户特征会带来更好的个性化，不同用户画像之间的相关性有助于提高用户画像的质量。因此，一些作品试图同时推断多个属性。例如，Farnadi [196] 提出了一个混合深度学习框架，用于从多种模态的用户数据中推断出多种类型的用户档案。他们在 5K Facebook 用户身上的实验也验证了多标签学习方式优于单标签学习。[197] 通过使用社交网络中的图信息，探索了中国著名社交网络平台微博上的用户画像。另一个例子是电子商务平台中的欺诈检测 [94]，因为欺诈用户通常同时具有不同的垃圾邮件行为。[94] 提出了一种基于协作的多标签传播方法，以利用不同欺诈行为之间的相关性。

8 Conclusion
8 总结

Multi-label classification has attracted significant attention from the community over the last decade. This paper provides a comprehensive review of the emerging topics of multi-Label learning, which include extreme multi-label classification, multi-label learning with limited supervision, deep learning for multi-label learning, online multi-label learning, statistical multi-label learning and new applications. We provide an overview of the representative works referenced throughout. In addition, we emphasize the challenges of these emerging topics and some future research directions and the promising extensions that are worthy of further study.
在过去十年中，多标签分类引起了社区的极大关注。本文全面综述了多标签学习的新兴主题，包括极限多标签分类、有限监督下的多标签学习、多标签学习的深度学习、在线多标签学习、统计多标签学习和新应用。我们概述了贯穿始终引用的代表性作品。此外，我们强调了这些新兴课题的挑战和一些未来的研究方向以及值得进一步研究的有前途的扩展。

Appendix A Evaluation Metrics and Notations and New Applications
附录 A 评估指标和符号以及新应用程序

TABLE II: Important notations used in the main paper.
表 II：主要论文中使用的重要符号。

Notations 符号	Explainations 解释
$x_{i},y_{i}$	Input and output vectors 输入和输出向量
$X,Y$	Input and output matrices 输入和输出矩阵
$\tilde{\mathcal{D}}$	Transformed data set 转换后的数据集
$\hat{Y},\tilde{Y}$	Label matrices of implicit and explicit missing labels 隐式和显式缺失标签的标签矩阵
$Z=[z_{1},\ldots,z_{n}]$	Embedding matrix and vectors 嵌入矩阵和向量
$\dot{y}$	Predicted score vector 预测分数向量
$\breve{Y},\breve{y}$	Predicted logical label matrix and vector 预测的逻辑标签矩阵和向量
$\Upsilon(y_{i})$	The indices of the positive labels of $y_{i}$ 的正标签 $y_{i}$ 索引
$\Lambda=[\lambda_{ij}]$	Enriched real-value label representation 丰富的实值标签表示
$S$	The candidate label set in PML PML 中设置的候选标签
$\Omega$	The index set of neighbors neighbor 的索引集
$\mathcal{N}_{i}$	The index set of neighbors of the $i$ -the instance -the instance 的 $i$ 邻居索引集
$D_{l}$ , $D_{o},D_{u}$ $D_{l}$ 、 $D_{o},D_{u}$	The index sets of labeled, incompletely-labeled and unlabeled data 已标记、未完全标记和未标记数据的索引集
$\bigtriangleup$	The symmetric difference between two sets 两组之间的对称差
$\|\cdot\|$	The set cardinality set cardinality
$\langle\cdot,\cdot\rangle$	Inner product 内积
$O(\cdot)$	Computational complexity 计算复杂性
$\cdot^{T}$	Matrix Transpose 矩阵转置
$\sigma(\cdot)$	Sigmoid function Sigmoid 函数
nnz $(\cdot)$	The number of non-zero entries 非零条目的数量
$\|\|\cdot\|\|_{F},,\|\|\cdot\|\|_{2},\|\|\cdot\|\|_{1}$	Frobenius norm, $\ell_{2}$ and $\ell_{1}$ norm of a matrix (vector) Frobenius 范数和 $\ell_{2}$ $\ell_{1}$ 矩阵（向量）的范数
Tr $(\cdot)$ $(\cdot)$ Tr	Trace operator 跟踪运算符
$r(\cdot)$	The regularizer function regularizer 函数
$\mathcal{L}(\cdot)$	Empirical risk function 经验风险函数
$n$	The number of training data 训练数据的数量
$d,L$	Feature dimensions and the number of labels 特征维度和标签数量
$\varpi$	Dimension of embedding vectors 嵌入向量的维度
$\mathbb{R}$	Set of real numbers
$s^{*}$	The cardinality of the true support of the underlying model 基础模型的真实支撑的基数
$\alpha,\mu,\lambda,C$	Trade-off hyperparameters 权衡超参数
$I$	Identity Matrix 单位矩阵
$A,B$	Side information matrices w.r.t input and output 输入和输出的侧面信息矩阵
$W,U,V,H$	Projection or similarity Matrix 投影或相似性矩阵
$F_{e},F_{x},F_{d}$	Label encoding, feature encoding and decoding network of C2AE C2AE 的标签编码、特征编解码网络
$\mathcal{W}=[w_{ij}]_{n\times n}$	Graph weight matrix 图形权重矩阵
$L_{x},L_{y}$	The laplacian matrix of $\mathcal{W}^{x}$ and $\mathcal{W}^{y}$ 和 $\mathcal{W}^{y}$ 的拉普拉斯矩阵 $\mathcal{W}^{x}$
$\Phi(F_{x},F_{e}),\Gamma(F_{e},F_{d})$	The losses of C2AE in the latent and output space C2AE 在潜在空间和输出空间中的损失

Assume $x_{i}\in\mathbb{R}^{d\times 1}$ is a real vector representing an input or instance (feature), $y_{i}=(y_{i,1},\cdots,y_{i,L})\in\{0,1\}^{L\times 1}$ is the corresponding output or label vector $(i\in\{1,\ldots,n\})$ . $n$ , $d$ and $L$ denote the number of training data, feature dimensions and the number of labels, respectively. The input matrix is $X=[x_{1},\ldots,x_{n}]\in\mathbb{R}^{d\times n}$ and the output matrix is $Y=[y_{1},\ldots,y_{n}]\in\{0,1\}^{L\times n}$ . MLC aims to learn a classifier which predicts the testing instance as accurate as possible with the set of proper labels. Let $\breve{Y}=[\breve{y}_{1},\ldots,\breve{y}_{n}]\in\{0,1\}^{L\times n}$ be the predicted label. We first introduce some evaluation metrics for MLC.
假设 $x_{i}\in\mathbb{R}^{d\times 1}$ 是表示输入或实例（特征）的实向量， $y_{i}=(y_{i,1},\cdots,y_{i,L})\in\{0,1\}^{L\times 1}$ 是相应的输出或标签向量 $(i\in\{1,\ldots,n\})$ 。 $n$ $L$ ， $d$ 分别表示训练数据的数量、特征维度和标签的数量。输入矩阵为 $X=[x_{1},\ldots,x_{n}]\in\mathbb{R}^{d\times n}$ ，输出矩阵为 $Y=[y_{1},\ldots,y_{n}]\in\{0,1\}^{L\times n}$ 。MLC 旨在学习一个分类器，该分类器使用一组适当的标签尽可能准确地预测测试实例。设 $\breve{Y}=[\breve{y}_{1},\ldots,\breve{y}_{n}]\in\{0,1\}^{L\times n}$ 为预测标签。我们首先介绍 MLC 的一些评估指标。

TABLE III: The new applications of multi-label learning.
表 III：多标签学习的新应用。

Reference 参考	New Applications 新应用	Approaches 方法	Evaluation Metrics 评估指标
[178]	CV: automatic video annotation CV：自动视频注释	XMLC [198], online MLC [179] XMLC 198、在线 MLC 179	Precision@ $k$ , Recall@ $k$ and Hamming loss $k$ Precision@ 、 Recall@ $k$ 和 Hamming 损失
[133]	CV: action recognition and localization in videos CV：视频中的动作识别和定位	multi-instance MLC [133] 多实例 MLC 133	Hamming loss 汉明损失
[183]	CV: facial action unit recognition CV：面部动作单元识别	DL MLC [182], semi-supervised MLC [183] DL MLC 182，半监控 MLC 183	Hamming loss and Ranking loss Hamming 损失和 Ranking 损失
[199]	CV: visual object recognition CV：视觉对象识别	online MLC [199, 200] 在线 MLC 199， 200	Hamming loss 汉明损失
[119]	CV: visual mobile robot navigation CV：视觉移动机器人导航	multi-instance MLC [119] 多实例 MLC 119	Hamming loss 汉明损失
[201]	CV: biomedical image segmentation CV：生物医学图像分割	semi-supervised MLC [201] 半监控 MLC 201	Hamming loss 汉明损失
[184]	CV: neonatal brains CV：新生儿大脑	semi-supervised MLC [185, 186] 半监控 MLC 185， 186	Hamming loss 汉明损失
[188]	NLP: 5G mobile medical recommendations NLP：5G 移动医疗推荐	DL MLC [189], MLAL [190] DL MLC 189、MLAL 190	Hamming loss and Ranking loss Hamming 损失和 Ranking 损失
[202]	NLP: social network analysis NLP：社交网络分析	DL MLC [81] DL MLC 81 系列	Hamming loss and Ranking loss Hamming 损失和 Ranking 损失
[203]	NLP: high-speed streaming data NLP：高速流数据	online MLC [203] 在线 MLC 203	Hamming loss 汉明损失
[204]	NLP: web page categorization NLP：网页分类	DL MLC [204] DL MLC 204 系列	Hamming loss and Ranking loss Hamming 损失和 Ranking 损失
[205]	NLP: protein subcellular localization NLP：蛋白质亚细胞定位	XMLC [205]	Precision@ $k$ and Recall@ $k$ $k$ Precision@ 和 Recall@ $k$
[205]	NLP: legal text mining NLP：法律文本挖掘	XMLC [192], DL MLC [193] XMLC 192、DL MLC 193	Precision@ $k$ , Recall@ $k$ and Hamming loss $k$ Precision@ 、 Recall@ $k$ 和 Hamming 损失
[196]	DM: recommender system DM：推荐系统	XMLC [194, 195, 145] XMLC 194、195、145	Precision@ $k$ , Recall@ $k$ , F-measure and Ranking loss $k$ Precision@ 、 Recall@ $k$ 、F 度量和排名损失
[196]	DM: user profiling in social media DM：社交媒体中的用户分析	DL MLC[196], semi-supervised MLC[206] DL MLC 196，半监控 MLC 206	Hamming loss and Ranking loss Hamming 损失和 Ranking 损失
[94]	DM: e-commercial fraud user detection DM：电商欺诈用户检测	semi-supervised MLC[94] 半监控 MLC 94	Hamming loss and Ranking loss Hamming 损失和 Ranking 损失

Hamming loss. Hamming loss is defined as follows:
汉明失利。汉明损失定义如下：

\begin{split}1/n\sum_{i=1}^{n}|\Upsilon(y_{i})\bigtriangleup\Upsilon(\breve{y}_{i})|/L\end{split}

where $\Upsilon(y_{i})$ denotes the indices of the positive labels of $y_{i}$ , $\bigtriangleup$ stands for the symmetric difference between two sets, $|\cdot|$ means the cardinality. The hamming loss evaluates the fraction of misclassified instance-label pairs.
其中 $\Upsilon(y_{i})$ 表示的正标签的索引 $y_{i}$ ， $\bigtriangleup$ 代表两组之间的对称差， $|\cdot|$ 表示基数。汉明损失评估错误分类的实例标签对的比例。

Ranking loss. Let $f$ be the real-valued function. Ranking loss is defined as follows:
排名损失。设 $f$ 为实值函数。排名损失定义如下：

\begin{split}1/n\!\sum_{i=1}^{n}\frac{|\{(a,b)\!:\!f(x_{i},a)\!\leq\!f(x_{i},b),(a,b)\in\!\Upsilon(y_{i})\!\times\!\bar{y_{i}}\}|}{|\Upsilon(y_{i})||\bar{y_{i}}|}\end{split}

where $\bar{y_{i}}$ is the complementary set of $\Upsilon(y_{i})$ in the label space. The ranking loss evaluates the fraction of reversely ordered label pairs.
其中 $\bar{y_{i}}$ 是标签空间中的 $\Upsilon(y_{i})$ 互补集。排名损失评估反向排序标签对的比例。

F-measure. F 度量。

\begin{split}\text{F-measure}=1/L\sum_{j=1}^{L}\frac{2\sum_{i=1}^{n}y_{i,j}\breve{y}_{i,j}}{\sum_{i=1}^{n}y_{i,j}+\sum_{i=1}^{n}\breve{y}_{i,j}}\end{split}

F-measure computes true positives, true negatives, false positives and false negatives over labels, and then calculates an overall F-1 score.
F 度量计算标签的真阳性、真阴性、假阳性和假阴性，然后计算总体 F-1 分数。

Precision@ $k$ .

\begin{split}\text{Precision@}k=1/k\sum_{k\in\text{rank}_{k}(\dot{y})}y_{k}\end{split}

where $\dot{y}\in\mathbb{R}^{L\times 1}$ is a predicted score vector, $y$ is a ground truth label vector and $rank_{k}(\dot{y})$ returns the $k$ largest indices of $\dot{y}$ ranked in descending order.
其中 $\dot{y}\in\mathbb{R}^{L\times 1}$ 是预测的分数向量， $y$ 是真值标签向量， $rank_{k}(\dot{y})$ 并返回按降序排列 $k$ 的最大 $\dot{y}$ 排名索引。

Recall@ $k$ .

\begin{split}\text{Recall@}k=1/|\Upsilon(y)|\sum_{k\in\text{rank}_{k}(\dot{y})}y_{k}\end{split}

Precision@ $k$ and Recall@ $k$ evaluate top- $k$ precision and recall over labels respectively, and both of them are the standard measures for XMLC. F-measure and Ranking loss are usually used in recommender system. Some CV and NLP applications, such as facial action unit recognition and web page categorization, usually use the Hamming loss and Ranking loss as the performance metric. The important notations and new applications in the main paper are summarized in Tables II and III, respectively.
Precision@ $k$ 和 Recall@ 分别 $k$ 评估标签的最高 $k$ 精确度和召回率，它们都是 XMLC 的标准度量。F-measure 和 Ranking loss 通常用于推荐系统。一些 CV 和 NLP 应用程序，例如面部动作单元识别和网页分类，通常使用 Hamming 损失和 Ranking 损失作为性能指标。表 II 和表 III 分别总结了主要论文中的重要符号和新的应用。

References

[1] K. Bhatia, H. Jain, P. Kar, M. Varma, and P. Jain, “Sparse local embeddings for extreme multi-label classification,” in NeurIPS, 2015, pp. 730–738.
[2] R. Babbar and B. Schölkopf, “Dismec: Distributed sparse machines for extreme multi-label classification,” in WSDM, 2017, pp. 721–729.
[3] I. E. Yen, X. Huang, W. Dai, P. Ravikumar, I. S. Dhillon, and E. P. Xing, “Ppdsparse: A parallel primal-dual sparse method for extreme classification,” in KDD, 2017, pp. 545–553.
[4] Y. Prabhu, A. Kag, S. Harsola, R. Agrawal, and M. Varma, “Parabel: Partitioned label trees for extreme classification with application to dynamic search advertising,” in WWW, 2018, pp. 993–1002.
[5] H. Jain, V. Balasubramanian, B. Chunduri, and M. Varma, “Slice: Scalable linear extreme classifiers trained on 100 million labels for related searches,” in WSDM, 2019, pp. 528–536.
[6] Y. Prabhu and M. Varma, “Fastxml: a fast, accurate and stable tree-classifier for extreme multi-label learning,” in KDD, 2014, pp. 263–272.
[7] H. Jain, Y. Prabhu, and M. Varma, “Extreme multi-label loss functions for recommendation, tagging, ranking & other missing label applications,” in KDD, 2016, pp. 935–944.
[8] K. Jasinska, K. Dembczynski, R. Busa-Fekete, K. Pfannschmidt, T. Klerx, and E. Hüllermeier, “Extreme f-measure maximization using sparse probability estimates,” in ICML, 2016, pp. 1435–1444.
[9] Y. Prabhu, A. Kag, S. Gopinath, K. Dahiya, S. Harsola, R. Agrawal, and M. Varma, “Extreme multi-label learning with label features for warm-start tagging, ranking & recommendation,” in WSDM, 2018, pp. 441–449.
[10] M. Wydmuch, K. Jasinska, M. Kuznetsov, R. Busa-Fekete, and K. Dembczynski, “A no-regret generalization of hierarchical softmax to extreme multi-label classification,” in NeurIPS, 2018, pp. 6358–6368.
[11] H. Yu, P. Jain, P. Kar, and I. S. Dhillon, “Large-scale multi-label learning with missing labels,” in ICML, 2014, pp. 593–601.
[12] Y. Tagami, “Annexml: Approximate nearest neighbor search for extreme multi-label classification,” in KDD, 2017, pp. 455–464.
[13] W. Liu, D. Xu, I. W. Tsang, and W. Zhang, “Metric learning for multi-output tasks,” TPAMI, vol. 41, no. 2, pp. 408–422, 2019.
[14] X. Gong, D. Yuan, and W. Bao, “Fast multi-label learning,” in IJCAI, 2021, pp. 2432–2438.
[15] Y. Sun, Y. Zhang, and Z. Zhou, “Multi-label learning with weak label,” in AAAI, 2010.
[16] G. Chen, Y. Song, F. Wang, and C. Zhang, “Semi-supervised multi-label learning by solving a sylvester equation,” in SDM, 2008, pp. 410–419.
[17] M. Xie and S. Huang, “Partial multi-label learning,” in AAAI, 2018, pp. 4302–4309.
[18] H. Wang, W. Liu, Y. Zhao, C. Zhang, T. Hu, and G. Chen, “Discriminative and correlative partial multi-label learning,” in IJCAI, 2019, pp. 3691–3697.
[19] D. Huynh and E. Elhamifar, “Interactive multi-label cnn learning with partial labels,” in CVPR, 2020, pp. 9423–9432.
[20] C. Xu, D. Tao, and C. Xu, “Robust extreme multi-label learning,” in KDD, 2016, pp. 1275–1284.
[21] L. Sun, S. Feng, T. Wang, C. Lang, and Y. Jin, “Partial multi-label learning by low-rank and sparse decomposition,” in AAAI, 2019, pp. 5016–5023.
[22] V. Jain, N. Modhe, and P. Rai, “Scalable generative models for multi-label learning with missing labels,” in ICML, 2017, pp. 1636–1644.
[23] H. Chu, C. Yeh, and Y. F. Wang, “Deep generative models for weakly-supervised multi-label classification,” in ECCV, 2018, pp. 409–425.
[24] M. Hu, H. Han, S. Shan, and X. Chen, “Weakly supervised image classification through noise regularization,” in CVPR, 2019, pp. 11 517–11 525.
[25] Z. Ji, B. Cui, H. Li, Y.-G. Jiang, T. Xiang, T. Hospedales, and Y. Fu, “Deep ranking for image zero-shot multi-label classification,” TIP, 2020.
[26] W. Shi and Q. Yu, “Fast direct search in an optimally compressed continuous target space for efficient multi-label active learning,” in ICML, 2019, pp. 5769–5778.
[27] A. Kuznetsova, H. Rom, N. Alldrin, J. R. R. Uijlings, I. Krasin, J. Pont-Tuset, S. Kamali, S. Popov, M. Malloci, T. Duerig, and V. Ferrari, “The open images dataset V4: unified image classification, object detection, and visual relationship detection at scale,” CoRR, vol. abs/1811.00982, 2018.
[28] B. Wu, W. Chen, Y. Fan, Y. Zhang, J. Hou, J. Liu, and T. Zhang, “Tencent ml-images: A large-scale multi-label image database for visual representation learning,” IEEE Access, vol. 7, pp. 172 683–172 693, 2019.
[29] M.-L. Zhang and Z.-H. Zhou, “Multilabel neural networks with applications to functional genomics and text categorization,” TKDE, vol. 18, no. 10, pp. 1338–1351, 2006.
[30] C. Yeh, W. Wu, W. Ko, and Y. F. Wang, “Learning deep latent space for multi-label classification,” in AAAI, 2017, pp. 2838–2844.
[31] J. Liu, W. Chang, Y. Wu, and Y. Yang, “Deep learning for extreme multi-label text classification,” in SIGIR, 2017, pp. 115–124.
[32] R. You, Z. Zhang, Z. Wang, S. Dai, H. Mamitsuka, and S. Zhu, “Attentionxml: Label tree-based attention-aware deep model for high-performance extreme multi-label text classification,” in NeurIPS, 2019, pp. 5812–5822.
[33] B. Wang, L. Chen, W. Sun, K. Qin, K. Li, and H. Zhou, “Ranking-based autoencoder for extreme multi-label classification,” in NAACL-HLT, 2019, pp. 2820–2830.
[34] T. Durand, N. Mehrasa, and G. Mori, “Learning a deep convnet for multi-label classification with partial labels,” in CVPR, 2019, pp. 647–657.
[35] Z. Wang, L. Liu, and D. Tao, “Deep streaming label learning,” in ICML, 2020.
[36] C. Lee, W. Fang, C. Yeh, and Y. F. Wang, “Multi-label zero-shot learning with structured knowledge graphs,” in CVPR, 2018, pp. 1576–1585.
[37] M. Cissé, M. Al-Shedivat, and S. Bengio, “ADIOS: architectures deep in output space,” in ICML, 2016, pp. 2770–2779.
[38] J. Nam, E. L. Mencía, H. J. Kim, and J. Fürnkranz, “Maximizing subset accuracy with recurrent neural networks in multi-label classification,” in NeurIPS, 2017, pp. 5413–5423.
[39] J. Nam, Y. Kim, E. L. Mencía, S. Park, R. Sarikaya, and J. Fürnkranz, “Learning context-dependent label permutations for multi-label classification,” in ICML, 2019, pp. 4733–4742.
[40] L. Yang, X. Wu, Y. Jiang, and Z. Zhou, “Multi-label learning with deep forest,” CoRR, vol. abs/1911.06557, 2019.
[41] S. Park and S. Choi, “Online multi-label learning with accelerated nonsmooth stochastic gradient descent,” in ICASSP, 2013, pp. 3322–3326.
[42] R. Venkatesan, M. J. Er, M. Dave, M. Pratama, and S. Wu, “A novel online multi-label classifier for high-speed streaming data applications,” Evolving Systems, vol. 8, no. 4, pp. 303–315, 2017.
[43] X. Gong, D. Yuan, and W. Bao, “Online metric learning for multi-label classification,” in AAAI, 2020, pp. 4012–4019.
[44] M.-L. Zhang and Z.-H. Zhou, “ML-KNN: A lazy learning approach to multi-label learning,” PR, vol. 40, no. 7, pp. 2038–2048, 2007.
[45] W. Liu, I. W. Tsang, and K.-R. Müller, “An easy-to-hard learning paradigm for multiple classes and multiple labels,” JMLR, vol. 18, no. 94, pp. 1–38, 2017.
[46] J. Read, B. Pfahringer, G. Holmes, and E. Frank, “Classifier chains for multi-label classification,” in ECML/PKDD, 2009, pp. 254–269.
[47] Y. Zhang and J. G. Schneider, “Multi-label output codes using canonical correlation analysis,” in AISTATS, 2011, pp. 873–882.
[48] Y.-N. Chen and H.-T. Lin, “Feature-aware label space dimension reduction for multi-label classification,” in NeurIPS, 2012, pp. 1538–1546.
[49] D. Hsu, S. Kakade, J. Langford, and T. Zhang, “Multi-label prediction via compressed sensing,” in NeurIPS, 2009, pp. 772–780.
[50] M. Cissé, N. Usunier, T. Artières, and P. Gallinari, “Robust bloom filters for large multilabel classification tasks,” in NeurIPS, 2013, pp. 1851–1859.
[51] A. Jalan and P. Kar, “Accelerating extreme classification via adaptive feature agglomeration,” in IJCAI, 2019, pp. 2600–2606.
[52] V. Gupta, R. Wadbude, N. Natarajan, H. Karnick, P. Jain, and P. Rai, “Distributional semantics meets multi-label learning,” in AAAI, 2019, pp. 3747–3754.
[53] B. H. Bloom, “Space/time trade-offs in hash coding with allowable errors,” Communications of the ACM, vol. 13, no. 7, pp. 422–426, 1970.
[54] S. Ubaru and A. Mazumdar, “Multilabel classification with group testing and codes,” in ICML, 2017, pp. 3492–3501.
[55] W. Liu and I. W. Tsang, “Making decision trees feasible in ultrahigh feature and label dimensions,” JMLR, vol. 18, no. 81, pp. 1–36, 2017.
[56] X. Shen, W. Liu, I. W. Tsang, Q. Sun, and Y. Ong, “Multilabel prediction via cross-view search,” TNNLS, vol. 29, no. 9, pp. 4324–4338, 2018.
[57] K. Wang, M. Yang, W. Yang, and Y. Yin, “Deep correlation structure preserved label space embedding for multi-label classification,” in ACML, 2018, pp. 1–16.
[58] S. Si, H. Zhang, S. S. Keerthi, D. Mahajan, I. S. Dhillon, and C. Hsieh, “Gradient boosted decision trees for high dimensional sparse output,” in ICML, 2017, pp. 3182–3190.
[59] I. E. Yen, X. Huang, P. Ravikumar, K. Zhong, and I. S. Dhillon, “Pd-sparse : A primal and dual sparse approach to extreme multiclass and multilabel classification,” in ICML, 2016, pp. 3069–3077.
[60] A. Niculescu-Mizil and E. Abbasnejad, “Label filters for large scale multilabel classification,” in AISTATS, 2017, pp. 1448–1457.
[61] A. K. Menon, A. S. Rawat, S. J. Reddi, and S. Kumar, “Multilabel reductions: what is my loss optimising?” in NeurIPS, 2019, pp. 10 599–10 610.
[62] W. Siblini, F. Meyer, and P. Kuntz, “Craftml, an efficient clustering-based random forest for extreme multi-label learning,” in ICML, 2018, pp. 4671–4680.
[63] R. Babbar and B. Schölkopf, “Data scarcity, robustness and extreme multi-label classification,” Machine Learning, vol. 108, no. 8-9, pp. 1329–1351, 2019.
[64] S. Khandagale, H. Xiao, and R. Babbar, “Bonsai - diverse and shallow trees for extreme multi-label classification,” CoRR, vol. abs/1904.08249, 2019.
[65] B. Wu, Z. Liu, S. Wang, B. Hu, and Q. Ji, “Multi-label learning with missing labels,” in ICPR, 2014, pp. 1964–1968.
[66] M. Xu, R. Jin, and Z. Zhou, “Speedup matrix completion with side information: Application to multi-label learning,” in NeurIPS, 2013, pp. 2301–2309.
[67] Y. Han, G. Sun, Y. Shen, and X. Zhang, “Multi-label learning with highly incomplete data via collaborative embedding,” in KDD, 2018, pp. 1494–1503.
[68] H. Yang, J. T. Zhou, and J. Cai, “Improving multi-label learning with missing labels by structured semantic correlations,” in ECCV, 2016, pp. 835–851.
[69] H. Yu, H. Huang, I. S. Dhillon, and C. Lin, “A unified algorithm for one-cass structured matrix factorization with side information,” in AAAI, 2017, pp. 2845–2851.
[70] B. Wu, F. Jia, W. Liu, B. Ghanem, and S. Lyu, “Multi-label learning with missing labels using mixed dependency graphs,” IJCV, vol. 126, no. 8, pp. 875–896, 2018.
[71] M. Xu, G. Niu, B. Han, I. W. Tsang, Z. Zhou, and M. Sugiyama, “Matrix co-completion for multi-label classification with missing features and labels,” CoRR, vol. abs/1805.09156, 2018.
[72] L. Xu, Z. Wang, Z. Shen, Y. Wang, and E. Chen, “Learning low-rank label correlations for multi-label classification with missing labels,” in ICDM, 2014, pp. 1067–1072.
[73] J. Ma, Z. Tian, H. Zhang, and T. W. S. Chow, “Multi-label low-dimensional embedding with missing labels,” KBS, vol. 137, pp. 65–82, 2017.
[74] K. Wang, “Robust embedding framework with dynamic hypergraph fusion for multi-label classification,” in ICME, 2019, pp. 982–987.
[75] B. Wu, S. Lyu, B. Hu, and Q. Ji, “Multi-label learning with missing labels for image annotation and facial action unit recognition,” PR, vol. 48, no. 7, pp. 2279–2289, 2015.
[76] Y. Liu, K. Wen, Q. Gao, X. Gao, and F. Nie, “SVM based multi-label learning with missing labels for image annotation,” PR, vol. 78, pp. 307–317, 2018.
[77] B. Wu, S. Lyu, and B. Ghanem, “Constrained submodular minimization for missing labels and class imbalance in multi-label learning,” in AAAI, 2016, pp. 2229–2236.
[78] J. Huang, F. Qin, X. Zheng, Z. Cheng, Z. Yuan, and W. Zhang, “Learning label-specific features for multi-label classification with missing labels,” in Fourth IEEE International Conference on Multimedia Big Data, 2018, pp. 1–5.
[79] Y. Zhu, J. T. Kwok, and Z. Zhou, “Multi-label learning with global and local label correlation,” TKDE, vol. 30, no. 6, pp. 1081–1094, 2018.
[80] J. Zhou, G. Cui, Z. Zhang, C. Yang, Z. Liu, and M. Sun, “Graph neural networks: A review of methods and applications,” CoRR, vol. abs/1812.08434, 2018.
[81] H. Dong, W. Wang, K. Huang, and F. Coenen, “Joint multi-label attention networks for social text annotation,” in NAACL-HLT, 2019, pp. 1348–1354.
[82] M. Chen, A. X. Zheng, and K. Q. Weinberger, “Fast image tagging,” in ICML, 2013, pp. 1274–1282.
[83] Q. Wang, B. Shen, S. Wang, L. Li, and L. Si, “Binary codes embedding for fast image tagging with incomplete labels,” in ECCV, 2014, pp. 425–439.
[84] Z. Qi, M. Yang, Z. M. Zhang, and Z. Zhang, “Mining partially annotated images,” in KDD, 2011, pp. 1199–1207.
[85] X. Li, F. Zhao, and Y. Guo, “Conditional restricted boltzmann machines for multi-label learning with incomplete labels,” in AISTATS, 2015.
[86] H. Zhao, P. Rai, L. Du, and W. L. Buntine, “Bayesian multi-label learning with sparse features and labels, and label co-occurrences,” in AISTATS, 2018, pp. 1943–1951.
[87] M. Zhou, L. Hannah, D. B. Dunson, and L. Carin, “Beta-negative binomial process and poisson factor analysis,” in AISTATS, 2012, pp. 1462–1471.
[88] S. S. Bucak, R. Jin, and A. K. Jain, “Multi-label learning with incomplete class assignments,” in CVPR, 2011, pp. 2801–2808.
[89] K. M. Ibrahim, E. V. Epure, G. Peeters, and G. Richard, “Confidence-based weighted loss for multi-label classification with missing labels,” in ICMR, 2020, pp. 291–295.
[90] R. Sen, A. Rakhlin, L. Ying, R. Kidambi, D. P. Foster, D. N. Hill, and I. S. Dhillon, “Top-k extreme contextual bandits with arm hierarchy,” CoRR, vol. abs/2102.07800, 2021.
[91] M. E. Khan, D. Nielsen, V. Tangkaratt, W. Lin, Y. Gal, and A. Srivastava, “Fast and scalable bayesian deep learning by weight-perturbation in adam,” in ICML, 2018, pp. 2616–2625.
[92] Y. Liu, R. Jin, and L. Yang, “Semi-supervised multi-label learning by constrained non-negative matrix factorization,” in AAAI, 2006, pp. 421–426.
[93] X. Kong, M. K. Ng, and Z. Zhou, “Transductive multilabel learning via label set propagation,” TKDE, vol. 25, no. 3, pp. 704–719, 2013.
[94] H. Wang, Z. Li, J. Huang, P. Hui, W. Liu, T. Hu, and G. Chen, “Collaboration based multi-label propagation for fraud detection,” in IJCAI, 2020.
[95] L. Sun, S. Feng, G. Lyu, and C. Lang, “Robust semi-supervised multi-label learning by triple low-rank regularization,” in PAKDD, 2019, pp. 269–280.
[96] L. Jing, L. Yang, J. Yu, and M. K. Ng, “Semi-supervised low-rank mapping learning for multi-label classification,” in CVPR, 2015, pp. 1483–1491.
[97] C. Gong, D. Tao, J. Yang, and W. Liu, “Teaching-to-learn and learning-to-teach for multi-label propagation,” in AAAI, 2016, pp. 1610–1616.
[98] L. Feng, B. An, and S. He, “Collaboration based multi-label learning,” in AAAI, 2019, pp. 3550–3557.
[99] W. Zhan and M. Zhang, “Inductive semi-supervised multi-label learning with co-training,” in KDD, 2017, pp. 1305–1314.
[100] Z. Chu, P. Li, and X. Hu, “Co-training based on semi-supervised ensemble classification approach for multi-label data stream,” in ICBK, 2019, pp. 58–65.
[101] L. Wang, Y. Liu, C. Qin, G. Sun, and Y. Fu, “Dual relation semi-supervised multi-label learning,” in AAAI, 2020, pp. 6227–6234.
[102] Q. Wang, L. Yang, and Y. Li, “Learning from weak-label data: A deep forest expedition,” in AAAI, 2020, pp. 6251–6258.
[103] F. Wu, Z. Wang, Z. Zhang, Y. Yang, J. Luo, W. Zhu, and Y. Zhuang, “Weakly semi-supervised deep learning for multi-label image annotation,” IEEE Transactions on Big Data, vol. 1, no. 3, pp. 109–122, 2015.
[104] Q. Tan, Y. Yu, G. Yu, and J. Wang, “Semi-supervised multi-label classification using incomplete label information,” Neurocomputing, vol. 260, pp. 192–202, 2017.
[105] H. Dong, Y. Li, and Z. Zhou, “Learning from semi-supervised weak-label data,” in AAAI, 2018, pp. 2926–2933.
[106] J. Lv, N. Xu, R. Zheng, and X. Geng, “Weakly supervised multi-label learning via label enhancement,” in IJCAI, 2019, pp. 3101–3107.
[107] X. Geng, “Label distribution learning,” TKDE, vol. 28, no. 7, pp. 1734–1748, 2016.
[108] R. Shao, N. Xu, and X. Geng, “Multi-label learning with label enhancement,” in ICDM, 2018, pp. 437–446.
[109] N. Xu, Y. Liu, and X. Geng, “Partial multi-label learning with label distribution,” in AAAI, 2020, pp. 6510–6517.
[110] A. Akbarnejad and M. S. Baghshah, “An efficient semi-supervised multi-label classifier capable of handling missing labels,” TKDE, vol. 31, no. 2, pp. 229–242, 2019.
[111] A. Tarvainen and H. Valpola, “Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results,” in NeurIPS, 2017, pp. 1195–1204.
[112] T. Chen, S. Kornblith, M. Norouzi, and G. E. Hinton, “A simple framework for contrastive learning of visual representations,” in ICML, vol. 119, 2020, pp. 1597–1607.
[113] M. Zhang and J. Fang, “Partial multi-label learning via credible label elicitation,” TPAMI, 2020.
[114] D. Xu, Y. Shi, I. W. Tsang, Y. Ong, C. Gong, and X. Shen, “A survey on multi-output learning,” CoRR, vol. abs/1901.00248, 2019.
[115] S. He, K. Deng, L. Li, S. Shu, and L. Liu, “Discriminatively relabel for partial multi-label learning,” in ICDM, 2019, pp. 280–288.
[116] G. Yu, X. Chen, C. Domeniconi, J. Wang, Z. Li, Z. Zhang, and X. Wu, “Feature-induced partial multi-label learning,” in ICDM, 2018, pp. 1398–1403.
[117] M. Xie and S. Huang, “Partial multi-label learning with noisy label identification,” in AAAI, 2020, pp. 6454–6461.
[118] Z. Li, G. Lyu, and S. Feng, “Partial multi-label learning via multi-subspace representation,” in IJCAI, 2020, pp. 2612–2618.
[119] J. He, H. Gu, and Z. Wang, “Multi-instance multi-label learning based on gaussian process with application to visual mobile robot navigation,” Information Sciences, vol. 190, pp. 162–177, 2012.
[120] C. Zhang, Z. Yu, H. Fu, P. Zhu, L. Chen, and Q. Hu, “Hybrid noise-oriented multilabel learning,” TCYB, vol. 50, no. 6, pp. 2837–2850, 2020.
[121] Z. Cui, Y. Zhang, and Q. Ji, “Label error correction and generation through label relationships,” in AAAI, 2020, pp. 3693–3700.
[122] M. Hu, H. Han, S. Shan, and X. Chen, “Multi-label learning from noisy labels with non-linear feature transformation,” in ACCV, 2018, pp. 404–419.
[123] A. Veit, N. Alldrin, G. Chechik, I. Krasin, A. Gupta, and S. J. Belongie, “Learning from noisy large-scale datasets with minimal supervision,” in CVPR, 2017, pp. 6575–6583.
[124] Y. Zhu, K. M. Ting, and Z. Zhou, “Multi-label learning with emerging new labels,” TKDE, vol. 30, no. 10, pp. 1901–1914, 2018.
[125] Y. Zhang, B. Gong, and M. Shah, “Fast zero-shot image tagging,” in CVPR, 2016, pp. 5985–5994.
[126] A. Alfassy, L. Karlinsky, A. Aides, J. Shtok, S. Harary, R. S. Feris, R. Giryes, and A. M. Bronstein, “Laso: Label-set operations networks for multi-label few-shot learning,” in CVPR, 2019, pp. 6548–6557.
[127] B. Yang, J. Sun, T. Wang, and Z. Chen, “Effective multi-label active learning for text classification,” in KDD, 2009, pp. 917–926.
[128] S. Li, Y. Jiang, N. V. Chawla, and Z. Zhou, “Multi-label learning from crowds,” TKDE, vol. 31, no. 7, pp. 1369–1382, 2019.
[129] S. Huang, S. Chen, and Z. Zhou, “Multi-label active learning: Query type matters,” in IJCAI, 2015, pp. 946–952.
[130] N. Xu, J. Shu, Y. Liu, and X. Geng, “Variational label enhancement,” in ICML, vol. 119, 2020, pp. 10 597–10 606.
[131] N. Xu, Y. Liu, and X. Geng, “Label enhancement for label distribution learning,” TKDE, vol. 33, no. 4, pp. 1632–1643, 2021.
[132] H. Yang, J. T. Zhou, J. Cai, and Y. Ong, “MIML-FCN+: multi-instance multi-label learning via fully convolutional networks with privileged information,” in CVPR, 2017, pp. 5996–6004.
[133] X. Zhang, H. Shi, C. Li, and P. Li, “Multi-instance multi-label action recognition and localization based on spatio-temporal pre-trimming for untrimmed videos,” in AAAI, 2020, pp. 12 886–12 893.
[134] D. Hendrycks and K. Gimpel, “A baseline for detecting misclassified and out-of-distribution examples in neural networks,” in ICLR, 2017.
[135] M. Wang and W. Deng, “Deep visual domain adaptation: A survey,” Neurocomputing, vol. 312, pp. 135–153, 2018.
[136] J. Nam, J. Kim, E. L. Mencía, I. Gurevych, and J. Fürnkranz, “Large-scale multi-label text classification - revisiting neural networks,” in ECML-PKDD, 2014, pp. 437–452.
[137] C. Chen, H. Wang, W. Liu, X. Zhao, T. Hu, and G. Chen, “Two-stage label embedding via neural factorization machine for multi-label classification,” in AAAI, 2019, pp. 3304–3311.
[138] X. Shen, W. Liu, Y. Luo, Y. Ong, and I. W. Tsang, “Deep discrete prototype multilabel learning,” in IJCAI, 2018, pp. 2675–2681.
[139] H. Fei, Y. Zhang, Y. Ren, and D. Ji, “Latent emotion memory for multi-label emotion classification,” in AAA, 2020, pp. 7692–7699.
[140] F. Zhu, H. Li, W. Ouyang, N. Yu, and X. Wang, “Learning spatial regularization with image-level supervisions for multi-label image classification,” in CVPR, 2017, pp. 2027–2036.
[141] H. Guo, K. Zheng, X. Fan, H. Yu, and S. Wang, “Visual attention consistency under image transforms for multi-label image classification,” in CVPR, 2019, pp. 729–739.
[142] R. You, Z. Guo, L. Cui, X. Long, Y. Bao, and S. Wen, “Cross-modality attention with semantic graph embedding for multi-label classification,” in AAAI, 2020, pp. 12 709–12 716.
[143] Y. Liu, L. Sheng, J. Shao, J. Yan, S. Xiang, and C. Pan, “Multi-label image classification via knowledge distillation from weakly-supervised detection,” in ACM MM, 2018, pp. 700–708.
[144] C. Guo, A. Mousavi, X. Wu, D. N. Holtmann-Rice, S. Kale, S. J. Reddi, and S. Kumar, “Breaking the glass ceiling for embedding-based classifiers for large output spaces,” in NeurIPS, 2019, pp. 4944–4954.
[145] W. Chang, H. Yu, K. Zhong, Y. Yang, and I. S. Dhillon, “Taming pretrained transformers for extreme multi-label text classification,” in KDD, 2020.
[146] K. Dahiya, D. Saini, A. Mittal, A. Shaw, K. Dave, A. Soni, H. Jain, S. Agarwal, and M. Varma, “Deepxml: A deep extreme multi-label learning framework applied to short text documents,” in WSDM, 2021, pp. 31–39.
[147] A. Mittal, K. Dahiya, S. Agrawal, D. Saini, S. Agarwal, P. Kar, and M. Varma, “DECAF: deep extreme classification with label features,” in WSDM, 2021, pp. 49–57.
[148] A. Mittal, N. Sachdeva, S. Agrawal, S. Agarwal, P. Kar, and M. Varma, “ECLARE: extreme classification with label graph correlations,” in WWW, 2021, pp. 3721–3732.
[149] D. Saini, A. K. Jain, K. Dave, J. Jiao, A. Singh, R. Zhang, and M. Varma, “Galaxc: Graph neural networks with labelwise attention for extreme classification,” in WWW, 2021, pp. 3733–3744.
[150] X. Gong, J. Yang, D. Yuan, and W. Bao, “Generalized large margin knn for partial label learning,” TMM, 2021.
[151] X. Gong, D. Yuan, and W. Bao, “Top-k partial label machine,” TNNLS, 2021.
[152] J. Wang, Y. Yang, J. Mao, Z. Huang, C. Huang, and W. Xu, “CNN-RNN: A unified framework for multi-label image classification,” in CVPR, 2016, pp. 2285–2294.
[153] Z. Wang, T. Chen, G. Li, R. Xu, and L. Lin, “Multi-label image recognition by recurrently discovering attentional regions,” in ICCV, 2017, pp. 464–472.
[154] S. Chen, Y. Chen, C. Yeh, and Y. F. Wang, “Order-free RNN with visual attention for multi-label classification,” in AAAI, 2018, pp. 6714–6721.
[155] C. Tsai and H. Lee, “Order-free learning alleviating exposure bias in multi-label classification,” in AAAI, 2020, pp. 6038–6045.
[156] V. O. Yazici, A. Gonzalez-Garcia, A. Ramisa, B. Twardowski, and J. van de Weijer, “Orderless recurrent models for multi-label classification,” in CVPR, 2020, pp. 13 437–13 446.
[157] Z. Chen, X. Wei, P. Wang, and Y. Guo, “Multi-label image recognition with graph convolutional networks,” in CVPR, 2019, pp. 5177–5186.
[158] T. Chen, M. Xu, X. Hui, H. Wu, and L. Lin, “Learning semantic-specific graph representation for multi-label image recognition,” in ICCV, 2019, pp. 522–531.
[159] Y. Wang, D. He, F. Li, X. Long, Z. Zhou, J. Ma, and S. Wen, “Multi-label classification with label graph superimposing,” in AAAI, 2020, pp. 12 265–12 272.
[160] P. Tang, M. Jiang, B. N. Xia, J. W. Pitera, J. Welser, and N. V. Chawla, “Multi-label patent categorization with non-local attention-based graph convolutional network,” in AAAI, 2020.
[161] Z. Zhou and J. Feng, “Deep forest: Towards an alternative to deep neural networks,” in IJCAI, 2017, pp. 3553–3559.
[162] M. J. Er, R. Venkatesan, and N. Wang, “An online universal classifier for binary, multi-class and multi-label classification,” in IEEE International Conference on Systems, Man, and Cybernetics, 2016, pp. 3701–3706.
[163] S. Ding, H. Zhao, Y. Zhang, X. Xu, and R. Nie, “Extreme learning machine: algorithm, theory and applications,” Artificial Intelligence Review, vol. 44, no. 1, pp. 103–115, 2015.
[164] H. Chu, K. Huang, and H. Lin, “Dynamic principal projection for cost-sensitive online multi-label classification,” Machine Learning, vol. 108, no. 8-9, pp. 1193–1230, 2019.
[165] S. Boulbazine, G. Cabanes, B. Matei, and Y. Bennani, “Online semi-supervised growing neural gas for multi-label data classification,” in IJCNN, 2018, pp. 1–8.
[166] P. Li, H. Wang, C. Böhm, and J. Shao, “Online semi-supervised multi-label classification with label compression and local smooth regression,” in IJCAI, 2020, pp. 1359–1365.
[167] D. Yeung and H. Chang, “Locally smooth metric learning with application to image retrieval,” in ICCV, 2007, pp. 1–7.
[168] H. Wang, Y. Qiang, C. Chen, W. Liu, T. Hu, Z. Li, and G. Chen, “Online partial label learning,” in ECML/PKDD, 2020.
[169] W. Gao and Z. Zhou, “On the consistency of multi-label learning,” Artificial Intelligence, vol. 199-200, pp. 22–44, 2013.
[170] K. Dembczynski, W. Kotlowski, and E. Hüllermeier, “Consistent multilabel ranking through univariate losses,” in ICML, 2012.
[171] H. Zou, “The adaptive lasso and its oracle properties,” Journal of the American Statistical Association, vol. 101, no. 476, pp. 1418–1429, 2006.
[172] C.-H. Zhang and J. Huang, “The sparsity and bias of the lasso selection in high-dimensional linear regression,” Annals of Statistics, vol. 36, no. 4, pp. 1567–1594, 2008.
[173] C.-H. Zhang, “Nearly unbiased variable selection under minimax concave penalty,” Annals of Statistics, vol. 38, no. 2, pp. 894–942, 2010.
[174] J. Fan and R. Li, “Variable selection via nonconcave penalized likelihood and its oracle properties,” Journal of the American Statistical Association, vol. 96, no. 456, pp. 1348–1360, 2001.
[175] W. Liu and X. Shen, “Sparse extreme multi-label learning with oracle property,” in ICML, 2019, pp. 4032–4041.
[176] W. Liu, “Copula multi-label learning,” in NeurIPS, 2019, pp. 6334–6343.
[177] C. Snoek, M. Worring, J. Geusebroek, D. Koelma, F. J. Seinstra, and A. W. M. Smeulders, “The semantic pathfinder: Using an authoring metaphor for generic multimedia indexing,” TPAMI, vol. 28, no. 10, pp. 1678–1689, 2006.
[178] G.-J. Qi, X.-S. Hua, Y. Rui, J. Tang, T. Mei, M. Wang, and H.-J. Zhang, “Correlative multi-label video annotation with temporal kernels,” ACM Transactions on Multimedia Computing, Communications, and Applications, vol. 5, no. 1, 2008.
[179] X.-S. Hua and G.-J. Qi, “Online multi-label active learning for large-scale multimedia annotation,” Microsoft, Tech. Rep., 2008.
[180] Z. Wang, Y. Li, S. Wang, and Q. Ji, “Capturing global semantic relationships for facial action unit recognition,” in ICCV, 2013, pp. 3304–3311.
[181] K. Zhao, W. Chu, F. D. la Torre, J. F. Cohn, and H. Zhang, “Joint patch and multi-label learning for facial action unit detection,” in CVPR, 2015, pp. 2207–2216.
[182] K. Zhao, W. Chu, and H. Zhang, “Deep region and multi-label learning for facial action unit detection,” in CVPR, 2016, pp. 3391–3399.
[183] X. Niu, H. Han, S. Shan, and X. Chen, “Multi-label co-regularization for semi-supervised facial action unit recognition,” in NeurIPS, 2019, pp. 907–917.
[184] N. Ratnarajah and A. Qiu, “Multi-label segmentation of white matter structures: Application to neonatal brains,” NeuroImage, vol. 102, pp. 913–922, 2014.
[185] N. Noorizadeh, K. Kazemi, H. Danyali, and A. Aarabi, “Multi-atlas based neonatal brain extraction using a two-level patch-based label fusion strategy,” Biomedical Signal Processing and Control, vol. 54, 2019.
[186] N. Noorizadeh, K. Kazemi, H. Danyali, A. Babajani-Feremi, and A. Aarabi, “Multi-atlas based neonatal brain extraction using atlas library clustering and local label fusion,” Multimedia Tools and Applications, vol. 79, no. 27-28, pp. 19 411–19 433, 2020.
[187] G. Wunder, P. Jung, M. Kasparick, T. Wild, F. Schaich, Y. Chen, S. ten Brink, I. Gaspar, N. Michailow, A. Festag, L. L. Mendes, N. Cassiau, D. Ktenas, M. Dryjanski, S. Pietrzyk, B. Eged, P. Vago, and F. Wiedmann, “5gnow: non-orthogonal, asynchronous waveforms for future mobile applications,” IEEE Communications Magazine, vol. 52, no. 2, pp. 97–105, 2014.
[188] L. Guo, B. Jin, R. Yu, C. Yao, C. Sun, and D. Huang, “Multi-label classification methods for green computing and application for mobile medical recommendations,” IEEE Access, vol. 4, pp. 3201–3209, 2016.
[189] C. Wang, Y. Wang, B. Xu, Y. He, Z. Dong, and Z. Sun, “A lightweight multi-label segmentation network for mobile iris biometrics,” in ICASSP, 2020, pp. 1006–1010.
[190] M. B. Messaoud, I. Jenhani, N. B. Jemaa, and M. W. Mkaouer, “A multi-label active learning approach for mobile app user review classification,” in KSEM, 2019, pp. 805–816.
[191] E. L. Mencía and J. Fürnkranz, “Efficient multilabel classification algorithms for large-scale problems in the legal domain,” in Semantic Processing of Legal Texts: Where the Language of Law Meets the Law of Language, ser. Lecture Notes in Computer Science, vol. 6036, 2010, pp. 192–215.
[192] I. Chalkidis, M. Fergadiotis, P. Malakasiotis, and I. Androutsopoulos, “Large-scale multi-label text classification on EU legislation,” in ACL, 2019, pp. 6314–6322.
[193] C. Xiao, H. Zhong, Z. Guo, C. Tu, Z. Liu, M. Sun, Y. Feng, X. Han, Z. Hu, H. Wang, and J. Xu, “CAIL2018: A large-scale legal dataset for judgment prediction,” CoRR, vol. abs/1807.02478, 2018.
[194] R. Agrawal, A. Gupta, Y. Prabhu, and M. Varma, “Multi-label learning with millions of labels: recommending advertiser bid phrases for web pages,” in WWW, 2013, pp. 13–24.
[195] J. J. McAuley, R. Pandey, and J. Leskovec, “Inferring networks of substitutable and complementary products,” in KDD, 2015, pp. 785–794.
[196] G. Farnadi, J. Tang, M. D. Cock, and M. Moens, “User profiling through deep multimodal fusion,” in WSDM, 2018, pp. 171–179.
[197] J. Wen, L. Wei, W. Zhou, J. Han, and T. Guo, “GCN-IA: user profile based on graph convolutional network with implicit association labels,” in ICCS, vol. 12139, 2020, pp. 355–364.
[198] M. R. Naphade, J. R. Smith, J. Tesic, S. Chang, W. H. Hsu, L. S. Kennedy, A. G. Hauptmann, and J. Curtis, “Large-scale concept ontology for multimedia,” TMM, vol. 13, no. 3, pp. 86–91, 2006.
[199] S. S. Bucak, R. Jin, and A. K. Jain, “Multi-label multiple kernel learning by stochastic approximation: Application to visual object recognition,” in NeurIPS, 2010, pp. 325–333.
[200] D. Y. Kim, B. Vo, and B. Vo, “Online visual multi-object tracking via labeled random finite set filtering,” CoRR, vol. abs/1611.06011, 2016.
[201] L. Grady and G. Funka-Lea, “Multi-label image segmentation for medical applications based on graph-theoretic electrical potentials,” in Computer Vision and Mathematical Methods in Medical and Biomedical Image Analysis, 2004, pp. 230–245.
[202] A. Schulz, L. M. Eneldo, and B. Schmidt, “A rapid-prototyping framework for extracting small-scale incident-related information in microblogs: Application of multi-label classification on tweets,” Information Systems, vol. 57, pp. 88–110, 2016.
[203] R. Venkatesan, M. J. Er, M. Dave, M. Pratama, and S. Wu, “A novel online multi-label classifier for high-speed streaming data applications,” CoRR, vol. abs/1609.00086, 2016.
[204] P. M. Ciarelli, E. Oliveira, and E. O. T. Salles, “Multi-label incremental learning applied to web page categorization,” Neural Computing and Applications, vol. 24, no. 6, pp. 1403–1419, 2014.
[205] S. Wan, M. Mak, B. Zhang, Y. Wang, and S. Kung, “Ensemble random projection for multi-label classification with application to protein subcellular localization,” in ICASSP, 2014, pp. 5999–6003.
[206] L. Wei, W. Zhou, J. Wen, M. Lin, J. Han, and S. Hu, “MLP-IA: multi-label user profile based on implicit association labels,” in ICCS, vol. 11536, 2019, pp. 548–561.

Weiwei Liu received his PhD degree under the supervision of Prof. Ivor W. Tsang in computer science from University of Technology Sydney, Australia in 2017. He is currently a full professor with the School of Computer Science, Wuhan University, China. His current research interest is machine learning. His research results have been published at prestigious journals and leading conferences such as JMLR, IEEE TPAMI, IEEE TNNLS, IEEE TIP, IEEE TCYB, NeurIPS, ICML, ACL, AAAI, IJCAI and so on.
刘伟伟先生于 2017 年在澳大利亚悉尼科技大学 Ivor W. Tsang 教授的指导下获得计算机科学博士学位。他目前是中国武汉大学计算机科学学院的正教授。他目前的研究兴趣是机器学习。他的研究成果发表在JMLR、IEEE TPAMI、IEEE TNNLS、IEEE TIP、IEEE TCYB、NeurIPS、ICML、ACL、AAAI、IJCAI等著名期刊和领先会议上。

The Emerging Trends of Multi-Label Learning多标签学习的新兴趋势

Abstract 抽象

Index Terms:

索引术语：

1 Introduction1 引言

2 Extreme Multi-label Learning2 极限多标签学习

2.1 Embedding Methods2.1 嵌入方法

2.2 Tree-based Methods2.2 基于树的方法

2.3 One-vs-all Methods2.3 一对多方法

3 Multi-label Learning With Limited Supervision3 有限监督的多标签学习

3.1 Multi-Label Learning With Missing Labels3.1 缺少标签的多标签学习

3.1.1 Low-Rank and Embedding Methods3.1.1 低秩和嵌入方法

3.1.2 Graph-based Methods3.1.2 基于图形的方法

3.1.3 Other Techniques for Missing Labels3.1.3 缺失标签的其他技术

3.2 Semi-Supervised Multi-Label Classification3.2 半监督多标签分类

3.2.1 State-of-the-art Algorithms3.2.1 最先进的算法

3.2.2 Weakly-Supervised MLC3.2.2 弱监督 MLC

3.3 Partial Multi-Label Learning3.3 部分多标签学习

3.3.1 Two-stage Learning Methods3.3.1 两阶段学习方法

3.3.2 End-to-end Learning Methods3.3.2 端到端学习方法

3.4 Other Settings3.4 其他设置

4 Deep Learning for Multi-label Learning4 用于多标签学习的深度学习

4.1 Deep Embedding Methods for MLC4.1 MLC 的深度嵌入方法

4.2 Deep Learning for Challenging MLC4.2 具有挑战性的 MLC 的深度学习

4.3 Advanced Deep Learning for MLC4.3 MLC 的高级深度学习

5 Online Multi-label Learning5 在线多标签学习

6 Statistical Multi-label Learning6 统计多标签学习

7 New Applications7 新应用

7.1 Computer Vision7.1 计算机视觉

7.1.1 Video Annotation7.1.1 视频注释

7.1.2 Facial Action Unit Recognition7.1.2 面部动作单元识别

7.1.3 Neonatal Brains7.1.3 新生儿大脑

7.2 Natural Language Processing7.2 自然语言处理

7.2.1 Mobile Applications7.2.1 移动应用程序

7.2.2 Legal Text Mining7.2.2 法律文本挖掘

7.3 Data Mining7.3 数据挖掘

7.3.1 Recommender Systems7.3.1 推荐系统

7.3.2 User Profiling7.3.2 用户分析

8 Conclusion8 总结

Appendix A Evaluation Metrics and Notations and New Applications附录 A 评估指标和符号以及新应用程序