这是用户在 2024-7-25 15:02 为 https://ar5iv.labs.arxiv.org/html/2207.00433?_immersive_translate_auto_translate=1 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

PROTOtypical Logic Tensor Networks (PROTO-LTN) for Zero Shot Learning
零样本学习的原型逻辑张量网络(PROTO-LTN)

Simone Martone, Francesco Manigrasso, Fabrizio Lamberti and Lia Morra simone.martone@studenti.polito.it, {francesco.manigrasso, fabrizio.lamberti, lia.morra}@polito.it
Simone Martone, Francesco Manigrasso, Fabrizio Lamberti 和 Lia Morra simone.martone@studenti.polito.it, {francesco.manigrasso, fabrizio.lamberti, lia.morra}@polito.it
Department of Control and Computer Engineering Politecnico di Torino
都灵理工大学控制与计算机工程系
Abstract 摘要

Semantic image interpretation can vastly benefit from approaches that combine sub-symbolic distributed representation learning with the capability to reason at a higher level of abstraction. Logic Tensor Networks (LTNs) are a class of neuro-symbolic systems based on a differentiable, first-order logic grounded into a deep neural network. LTNs replace the classical concept of training set with a knowledge base of fuzzy logical axioms. By defining a set of differentiable operators to approximate the role of connectives, predicates, functions and quantifiers, a loss function is automatically specified so that LTNs can learn to satisfy the knowledge base. We focus here on the subsumption or isOfClass predicate, which is fundamental to encode most semantic image interpretation tasks. Unlike conventional LTNs, which rely on a separate predicate for each class (e.g., dog, cat), each with its own set of learnable weights, we propose a common isOfClass predicate, whose level of truth is a function of the distance between an object embedding and the corresponding class prototype. The PROTOtypical Logic Tensor Networks (PROTO-LTN) extend the current formulation by grounding abstract concepts as parametrized class prototypes in a high-dimensional embedding space, while reducing the number of parameters required to ground the knowledge base. We show how this architecture can be effectively trained in the few and zero-shot learning scenarios. Experiments on Generalized Zero Shot Learning benchmarks validate the proposed implementation as a competitive alternative to traditional embedding-based approaches. The proposed formulation opens up new opportunities in zero shot learning settings, as the LTN formalism allows to integrate background knowledge in the form of logical axioms to compensate for the lack of labelled examples. PROTO-LTN was implemented in Tensorflow and is available at https://github.com/FrancescoManigrass/PROTO-LTN.git
语义图像解释可以从将亚符号分布表示学习与在更高抽象层次上推理能力相结合的方法中获益。逻辑张量网络(LTNs)是一类基于可微分的、一阶逻辑的神经符号系统,其基础是深度神经网络。LTNs 用模糊逻辑公理知识库取代了传统的训练集概念。通过定义一组可微分运算符来近似连接词、谓词、函数和量词的作用,自动指定了损失函数,使得 LTNs 能够学习满足知识库。我们在这里关注包含或 isOfClass 谓词,这对于编码大多数语义图像解释任务至关重要。与传统的 LTNs 不同,传统的 LTNs 依赖于每个类别(例如狗、猫)的单独谓词,每个谓词都有自己的可学习权重,我们提出了一个通用的 isOfClass 谓词,其真值水平是对象嵌入与相应类别原型之间距离的函数。 PROTOtypical Logic Tensor Networks(PROTO-LTN)通过在高维嵌入空间中将抽象概念作为参数化的类原型来扩展当前的表达方式,同时减少了接地知识库所需的参数数量。我们展示了这种架构如何在少量和零样本学习场景中得到有效训练。在广义零样本学习基准测试上的实验验证了所提出的实现作为传统基于嵌入的方法的竞争性替代方案。所提出的表达方式在零样本学习设置中开辟了新的机会,因为 LTN 形式主义允许将背景知识以逻辑公理的形式整合进来,以弥补标记示例的缺失。PROTO-LTN 在 Tensorflow 中实现,并可在 https://github.com/FrancescoManigrass/PROTO-LTN 上获得。git

I Introduction 我 介绍

Despite their impressive performance when trained on large-scale, supervised datasets, deep neural networks have still difficulties generalizing to unseen categories. On the contrary, humans can leverage logical reasoning to make guesses about new circumstances, and are able to infer knowledge from few to zero examples. Recent efforts towards Neural-Symbolic (NeSy) integration [1, 2] allow to assimilate symbolic representation and reasoning into deep architectures: this entails that background knowledge, in the form of logical axioms, can be exploited during training, opening up new scenarios for settings in which labelled examples are scarce or noisy [3, 4]. Specifically, we focus here on Logic Tensor Networks (LTNs) [5], a NeSy architecture that replaces the classical concept of a training set with a Knowledge Base 𝒦𝒦\mathcal{K} of logical axioms, ultimately interpreted in a fuzzy way, and formulates the learning objective as maximizing the satisfiability of 𝒦𝒦\mathcal{K}. While this framework has been applied to multi-label classification problems [5, 6] and object detection [4], its application to few- and zero-shot image classification has not yet been investigated.
尽管深度神经网络在大规模监督数据集上训练时表现出色,但仍然存在难以泛化到未见类别的困难。相反,人类可以利用逻辑推理对新情况进行猜测,并能够从少至零个示例中推断知识。最近关于神经符号(NeSy)集成的努力允许将符号表示和推理融入深度架构中:这意味着背景知识,以逻辑公理的形式,可以在训练过程中被利用,为标记示例稀缺或嘈杂的情况开辟了新的场景。具体而言,我们在这里关注逻辑张量网络(LTNs),这是一种 NeSy 架构,它用逻辑公理的知识库 𝒦𝒦\mathcal{K} 替代了传统的训练集概念,最终以模糊方式解释,并将学习目标制定为最大化 𝒦𝒦\mathcal{K} 的可满足性。虽然这一框架已应用于多标签分类问题和目标检测,但其在少量和零样本图像分类中的应用尚未得到研究。

In this work, we explore this task from a NeSy perspective, and propose to integrate ideas and concepts from the few-shot learning (FSL) and zero-shot learning (ZSL) domains, namely the Prototypical Networks (PNs) [7] framework, within the LTN formulation. PNs define class prototypes in a high-dimensional embedding space, so that incoming examples are assigned to the class of their nearest prototype according to some distance measure. In the LTN framework, this is achieved by representing the isOfClass relationship as a function of the distance between a class prototype and an object instance, thus obtaining the Prototypical Logic Tensor Network (PROTO-LTN) architecture. As the embedding space is the focus of the learning procedure, such prototypes may be also defined for classes that are not seen at training time.
在这项工作中,我们从 NeSy 的角度探讨了这个任务,并提议将少样本学习(FSL)和零样本学习(ZSL)领域的思想和概念整合到 LTN 公式中,即原型网络(PNs)[7]框架。PNs 在高维嵌入空间中定义类原型,因此传入的示例根据某种距离度量被分配到其最近原型的类别。在 LTN 框架中,通过将 isOfClass 关系表示为类原型与对象实例之间距离的函数来实现这一点,从而获得原型逻辑张量网络(PROTO-LTN)架构。由于嵌入空间是学习过程的重点,因此这样的原型也可以为训练时未见过的类别定义。

The present study thus formulates a theoretical framework that achieves competitive results with respect to standard embedding-based ZSL architectures such as DEM [8], yet offering higher degrees of flexibility. Although our analysis shows that their basic settings the two formulations are equivalent, PROTO-LTNs have greater potential in both standard and transductive ZSL. They are able to integrate in the training process prior knowledge and logical constraints from an external knowledge base, including information related to unseen classes [9]. Hence, a NeSy formulation allows to constraint the embedding space via symbolic priors.
本研究因此构建了一个理论框架,其在与基于标准嵌入的零样本学习架构(如 DEM [8])相比取得了竞争性的结果,同时提供了更高程度的灵活性。尽管我们的分析表明,这两种表述的基本设置是等效的,但 PROTO-LTNs 在标准和转导零样本学习中具有更大的潜力。它们能够在训练过程中整合来自外部知识库的先验知识和逻辑约束,包括与未见类别相关的信息[9]。因此,NeSy 表述允许通过符号先验对嵌入空间进行约束。

The proposed framework has also potential advantages over traditional LTNs, even outside of the FSL and ZSL settings, since classes are represented as parametrized prototypes rather than a discrete label space [5, 4]. First, representing higher-level concepts as distributed vectorized representations allows to naturally exploit the notion of distance for highlighting relationships between symbols, with semantically related symbols having similar representations [10]. Second, prototypes allow to ground abstract concepts in a vectorized form that can be more easily manipulated: as an example, it would be easier to define a suitable grounding for predicates that directly operate on the abstract classes, as well as their instances. Third, prototypes are more interpretable than simple labels, as their incorporation into the embedding space can be easily visualized by employing dimensionality reduction methods, such as t-SNE [11].
提出的框架在传统的 LTNs 之外也具有潜在优势,即使在零样本学习和零样本迁移学习的情况下,因为类别被表示为参数化的原型,而不是离散的标签空间。首先,将高级概念表示为分布式向量化表示允许自然地利用距离的概念来突出符号之间的关系,语义相关的符号具有相似的表示。其次,原型允许将抽象概念以向量化形式进行基础化,这样更容易操作:例如,可以更容易地为直接作用于抽象类别及其实例的谓词定义适当的基础。第三,原型比简单的标签更具可解释性,因为将它们整合到嵌入空间中可以通过使用降维方法(如 t-SNE)轻松可视化。

The rest of the paper is organized as follows. In Section II, we place the present work in the context of the related literature, and provide a background on LTNs. In Section III, we describe a simple theoretical scheme to assimilate PNs into a LTN for classification purposes (PROTO-LTN), both in the FSL and ZSL scenarios. Then, in Sections IV and V, we examine the behavior of the model in the Generalized Zero-Shot-Learning (GZSL) task on common benchmark datasets. Finally, in Section VI, we discuss conclusions and future works.
本文的其余部分组织如下。在第二部分中,我们将目前的工作置于相关文献的背景下,并提供有关 LTNs 的背景。在第三部分中,我们描述了一个简单的理论方案,将 PNs 纳入 LTN 以进行分类目的(PROTO-LTN),无论是在 FSL 还是 ZSL 情景中。然后,在第四和第五部分中,我们考察了模型在常见基准数据集上进行广义零样本学习(GZSL)任务的行为。最后,在第六部分中,我们讨论结论和未来工作。

II Related work II 相关工作

II-A Neural-symbolic AI in Semantic Image Interpretation
II-A 神经符号人工智能在语义图像解释中

Research on how to combine connectionist and symbolic approaches has flourished in the past few years [5, 12], with several applications in semantic image interpretation and visual query answering [5, 4, 13, 3, 14, 15, 16]. Among the plethora of compositional patterns that have been proposed [17, 12], the present work follows two main principles: knowledge representation (in the form of first order logic) is embedded into a neural network, which in turn allows to constrain the search space by leveraging explicit (and human-interpretable) domain knowledge as a symbolic prior. This latter property is extremely useful in ZSL, in which some external source of information is exploited to offer an abstract description of the classes in lieu of providing training examples. On the other hand, compared to approaches based on Inductive Logic Programming (such as [14]), in which perception and reasoning are performed by separate modules, LTNs provide tighter integration between the two subsystems.
过去几年来,关于如何结合连接主义和符号方法的研究蓬勃发展[5, 12],在语义图像解释和视觉查询回答等多个应用中取得了一些成果[5, 4, 13, 3, 14, 15, 16]。在众多已被提出的组合模式中[17, 12],本研究遵循两个主要原则:将知识表示(以一阶逻辑形式)嵌入到神经网络中,从而通过利用显式(且人类可解释的)领域知识作为符号先验来约束搜索空间。后一属性在零样本学习中非常有用,其中利用某些外部信息源来提供类别的抽象描述,而不是提供训练样本。另一方面,与基于归纳逻辑编程(如[14])的方法相比,其中感知和推理由独立模块执行,逻辑张量网络提供了两个子系统之间更紧密的集成。

II-B Logic Tensor Networks
II-B 逻辑张量网络

LTNs have proven effective in higher-level image interpretation tasks, such as object detection and scene graph construction [13, 5]. Donadello et al. applied them for scene relationship detection in a zero shot setting, showing how prior knowledge can compensate for the lack of supervision [3].
LTNs 已被证明在更高级别的图像解释任务中非常有效,比如目标检测和场景图构建[13, 5]。Donadello 等人将其应用于零样本设置中的场景关系检测,展示了先前知识如何弥补监督缺失[3]。

In the LTN framework, the term grounding denotes the interpretation of a First Order Language into a subset of the nsuperscript𝑛\mathbb{R}^{n} domain [5]. It defines a collection of terms (objects) and formulas described in a Knowledge base 𝒦𝒦\mathcal{K}. For instance, to express the friendship between two terms defined as Alice and Bob, we can use the predicate friend_of:
在 LTN 框架中,接地术语表示将一阶语言解释为 nsuperscript𝑛\mathbb{R}^{n} 域的子集[5]。它定义了在知识库 𝒦𝒦\mathcal{K} 中描述的一组术语(对象)和公式。例如,要表达定义为 Alice 和 Bob 的两个术语之间的友谊,我们可以使用谓词 friend_of:

ϕ1=friend_of(Alice,Bob)friend_of(Bob,Alice)subscriptitalic-ϕ1friend_of𝐴𝑙𝑖𝑐𝑒𝐵𝑜𝑏friend_of𝐵𝑜𝑏𝐴𝑙𝑖𝑐𝑒\displaystyle\phi_{1}=\texttt{friend\_of}(Alice,Bob)\wedge\texttt{friend\_of}(Bob,Alice)

At the same time, we can specify formulas defining general properties, such as the symmetric nature of the friendship relationship within a specific domain:
与此同时,我们可以指定定义一般属性的公式,比如在特定领域内友谊关系的对称性质:

ϕ2=x,y(friend_of(x,y)friend_of(y,x))subscriptitalic-ϕ2for-all𝑥𝑦friend_of𝑥𝑦friend_of𝑦𝑥\displaystyle\phi_{2}=\forall\,x,y\,(\texttt{friend\_of}(x,y)\Rightarrow\texttt{friend\_of}(y,x))

Adopting Real Logic, both formulas and terms are grounded (interpreted) into a scalar value in the [0,1] interval. Specifying the grounding function 𝒢𝒢\mathcal{G}, which maps terms and formulas into such real-valued features, generates a complete definition of a theory. Given a set of terms, aggregate formulas can be defined by approximating unary, binary or quantifiers connectives in fuzzy logic using suitable differential operators.
采用真实逻辑,无论是公式还是术语都被基于[0,1]区间内的标量值进行了基础化(解释)。指定将术语和公式映射为这种实值特征的基础化函数 𝒢𝒢\mathcal{G} ,生成了一个理论的完整定义。给定一组术语,可以通过使用适当的微分算子来近似模糊逻辑中的一元、二元或量词连接符,定义聚合公式。

In semantic image interpretation tasks, terms (objects) are typically grounded by features computed by a pre-trained convolutional neural network; it is also possible to jointly train the convolutional backbone and the LTNs in an end-to-end fashion [4]. Predicates symbols p𝒫𝑝𝒫p\in\mathcal{P} are grounded by a function 𝒢(D(p))[0,1]𝒢𝐷𝑝01\mathcal{G}\left(D(p)\right)\rightarrow[0,1]. A typical predicate in semantic image interpretation is the isOfClass one, which represents the probability that a given object belongs to class c𝑐c.
在语义图像解释任务中,术语(对象)通常由预先训练的卷积神经网络计算的特征进行基础化;也可以以端到端的方式联合训练卷积主干和 LTNs [4]。谓词符号 p𝒫𝑝𝒫p\in\mathcal{P} 由函数 𝒢(D(p))[0,1]𝒢𝐷𝑝01\mathcal{G}\left(D(p)\right)\rightarrow[0,1] 进行基础化。在语义图像解释中的典型谓词是 isOfClass,它表示给定对象属于类 c𝑐c 的概率。

In conventional LTNs [5, 13, 4], predicates are typically defined as the generalization of the neural tensor network:
在传统的 LTNs [5, 13, 4] 中,谓词通常被定义为神经张量网络的泛化:

𝒢(𝒫)(𝐯)=σ(uPTtanh(𝐯𝐓WP[1:k]𝐯+VP𝐯+bp))𝒢𝒫𝐯𝜎superscriptsubscript𝑢𝑃𝑇subscript𝐯𝐓superscriptsubscript𝑊𝑃delimited-[]:1𝑘𝐯subscript𝑉𝑃𝐯subscript𝑏𝑝\displaystyle\mathcal{G}\left(\mathcal{P}\right)(\mathbf{v})=\sigma\left(\mathit{u_{P}^{T}}\tanh\left(\mathbf{v_{T}}W_{P}^{[1:k]}\mathbf{v}+V_{P}\mathbf{v}+\mathit{b_{p}}\right)\right) (1)

where σ𝜎\sigma is the sigmoid function, W[1:k]k×mn×mnW[1:k]\in\mathbb{R}^{k\times mn\times mn}, Vpk×mnsubscript𝑉𝑝superscript𝑘𝑚𝑛V_{p}\in\mathbb{R}^{k\times mn}, upksubscript𝑢𝑝superscript𝑘u_{p}\in\mathbb{R}^{k}, and bpsubscript𝑏𝑝b_{p}\in\mathbb{R} are learnable tensors of parameters. For multi-class problems, the sigmoid function could be substituted by a softmax layer to enforce mutual exclusivity [5].
其中 σ𝜎\sigma 是 Sigmoid 函数, W[1:k]k×mn×mnW[1:k]\in\mathbb{R}^{k\times mn\times mn}Vpk×mnsubscript𝑉𝑝superscript𝑘𝑚𝑛V_{p}\in\mathbb{R}^{k\times mn}upksubscript𝑢𝑝superscript𝑘u_{p}\in\mathbb{R}^{k}bpsubscript𝑏𝑝b_{p}\in\mathbb{R} 是可学习的参数张量。对于多类问题,Sigmoid 函数可以被 Softmax 层替换以强制互斥性。

This grounding requires to add an additional predicate for each class (e.g., isDog, isPerson, etc.), which is embedded into a tensor network with separate weights. Additionally, since class symbols are not grounded, predicates can only be defined for object instances, which rapidly leads to very large knowledge bases when background logical axioms need to be imposed. On the contrary, our proposed grounding does not require additional model parameters, or in any case limits them to a small set which is shared among all isOfClass predicates. Furthermore, it encodes abstract classes as parametric objects that live in the same embedding space as their instances, and can be used to establish relationships with other objects (e.g., macro-category relationships). This formulation thus supports more efficient and compact representations.
这种基础要求为每个类别添加一个额外的谓词(例如,isDog,isPerson 等),该谓词嵌入到一个具有独立权重的张量网络中。此外,由于类别符号没有基础,谓词只能针对对象实例进行定义,当需要施加背景逻辑公理时,这很快会导致非常庞大的知识库。相反,我们提出的基础不需要额外的模型参数,或者在任何情况下将其限制为一小组共享在所有 isOfClass 谓词之间的参数。此外,它将抽象类别编码为生存在与其实例相同的嵌入空间中的参数化对象,并可用于与其他对象建立关系(例如,宏类别关系)。因此,这种表述支持更高效和紧凑的表示。

The best satisfability problem, which is the optimization problem underlying LTNs, consists in determining the values of ΘsuperscriptΘ\Theta^{*} that maximize the truth values of the conjunction of all formulas ϕ𝒦italic-ϕ𝒦\phi\in\mathcal{K}:
最佳可满足性问题是逻辑张量网络(LTNs)的基础优化问题,其目标是确定最大化所有公式的合取真值的 ΘsuperscriptΘ\Theta^{*} 的值: ϕ𝒦italic-ϕ𝒦\phi\in\mathcal{K}

Θ=argmaxΘ𝒢^θ(ϕ𝒦ϕ)λΘ22superscriptΘ𝑎𝑟𝑔𝑚𝑎subscript𝑥Θsubscript^𝒢𝜃subscriptitalic-ϕ𝒦italic-ϕ𝜆superscriptsubscriptnormΘ22\displaystyle\Theta^{*}=argmax_{\Theta}\hat{\mathcal{G}}_{\theta}\left(\bigwedge_{\phi\in\mathcal{K}}\phi\right)-\lambda||\Theta||_{2}^{2} (2)

where λΘ22𝜆superscriptsubscriptnormΘ22\lambda||\Theta||_{2}^{2} is a convenient regularization term.
其中 λΘ22𝜆superscriptsubscriptnormΘ22\lambda||\Theta||_{2}^{2} 是一个方便的正则化项。

II-C Zero-shot learning II-C 零样本学习

In zero-shot learning, a learner must be able to recognize objects from test classes, not seen during training, by leveraging some sort of description, most commonly a vector of semantic attributes [18]. In this paper, we target the Generalized zero-shot learning (GZSL) scenario, in which both seen and unseen classes appear at test time [18]. State-of-the-art techniques for ZSL classification typically fall within two categories [18, 8]: embedding-based and generative-based.
在零样本学习中,学习者必须能够通过利用某种描述(最常见的是语义属性向量)来识别训练过程中未见过的测试类别中的对象。本文针对广义零样本学习(GZSL)场景,其中在测试时出现了已见和未见类别。目前用于零样本学习分类的先进技术通常分为两类:基于嵌入的和基于生成的。

Embedding-based models [8, 19, 20, 21] compare semantic characteristics (e.g., attributes) and visual characteristics (usually taken from a pre-trained convolutional neural network) by (learning a) mapping to a common embedding space. Mapping the semantic space to the more compact visual feature space, rather than the opposite, alleviates the so-called hubness problem and facilitates separation between classes [8]. Standard embedding-based models are completely agnostic to any information about the test set: neither examples (even unlabelled), nor class attributes are assumed to be available at training time. Although based on a NeSy formulation, the proposed PROTO-LTN approach can be regarded as an embedding-based technique, as semantic concepts and visual features are mapped onto a common embedding space.
基于嵌入的模型[8, 19, 20, 21]通过将语义特征(例如属性)和视觉特征(通常来自预先训练的卷积神经网络)映射到一个共同的嵌入空间中进行比较。将语义空间映射到更紧凑的视觉特征空间,而不是相反,有助于缓解所谓的中心性问题,并促进类别之间的分离[8]。标准的基于嵌入的模型对测试集的任何信息都是完全不可知的:在训练时既不假设有示例(甚至未标记的),也不假设有类别属性。尽管基于 NeSy 公式,所提出的 PROTO-LTN 方法可以被视为一种基于嵌入的技术,因为语义概念和视觉特征被映射到一个共同的嵌入空间中。

Embedding-based models tend to be naturally biased towards seen classes. To alleviate this problem, generative models were proposed with the purpose of learning a conditioned probability distribution for each class, and thus generate artificial examples of unseen classes [22, 23, 24]. A conventional classifier is trained by utilizing both the true and the generated examples. Although impressive results, especially in a GZSL context, can be achieved by taking advantage of this machinery, reduced flexibility with respect to embedding methods is entailed, as unseen classes need to be defined, so that a number of corresponding examples can be artificially synthesized. PROTO-LTNs are thus best compared with other embedding-based models, although nothing prevents them from being trained on, or combined with, generative methods.
基于嵌入的模型往往在已知类别上具有自然偏见。为了缓解这一问题,提出了生成模型,目的是为每个类别学习一个条件概率分布,从而生成未知类别的人工示例[22, 23, 24]。传统分类器通过利用真实和生成的示例进行训练。尽管通过利用这种机制,特别是在广义零样本学习的情况下可以取得令人印象深刻的结果,但与嵌入方法相比,存在着较少的灵活性,因为需要定义未知类别,以便人工合成相应数量的示例。因此,PROTO-LTNs 最好与其他基于嵌入的模型进行比较,尽管它们也可以在生成方法上进行训练或结合。

III PROTOtypical Logic Tensor Networks
III 原型逻辑张量网络

First, we introduce the basic notations related to prototypical networks in the FSL (Section III-A) and ZSL (Section III-B) settings [7]. Then, in Sections III-C and III-D, we build on these concepts and show how the PROTO-LTN training cycle is constructed by substituting the original model with a grounded 𝒦𝒦\mathcal{K}, and the original loss with a best satisfiability problem.
首先,我们介绍与原型网络在 Few-Shot Learning(FSL)(第 III-A 节)和 Zero-Shot Learning(ZSL)(第 III-B 节)设置相关的基本符号[7]。然后,在第 III-C 节和 III-D 节中,我们基于这些概念,展示了如何通过用一个基于 𝒦𝒦\mathcal{K} 的模型替换原始模型,并用最佳可满足性问题替换原始损失函数来构建 PROTO-LTN 训练循环。

III-A Prototypical Networks: the FSL setting
III-A 原型网络:FSL 设置

A N𝑁N-way-K𝐾K-shot FSL scenario is supposed, in which a classifier is asked to discriminate the right class among N𝑁N choices, while having the chance to observe K𝐾K examples per class [25, 26, 27]. More specifically, the labelled examples are referred to as the support examples, whereas the unlabeled ones as the query examples.
一种 N𝑁N -way- K𝐾K -shot FSL 场景被假设,其中分类器被要求在 N𝑁N 个选择中区分正确的类别,同时有机会观察每个类别的 K𝐾K 个示例[25, 26, 27]。更具体地说,标记的示例被称为支持示例,而未标记的示例被称为查询示例。

The underlying assumption that it exists an embedding space in which elements of different classes are well-scattered, and that it can be mathematically translated into an embedding function fθsubscript𝑓𝜃f_{\theta} whose parameter θ𝜃\theta must be inferred, acting as a mapping
存在这样一个嵌入空间的基本假设,其中不同类别的元素被很好地分散,并且可以在数学上转化为一个嵌入函数 fθsubscript𝑓𝜃f_{\theta} ,其参数 θ𝜃\theta 必须被推断出来,充当映射

fθ:DM.:subscript𝑓𝜃superscript𝐷superscript𝑀\displaystyle f_{\theta}:\mathbb{R}^{D}\to\mathbb{R}^{M}. (3)

In Eq. 3, D𝐷D and M𝑀M are, respectively, the dimensions of the input space and of the embedding space. Thus, for an example x𝑥x, fθ(x)subscript𝑓𝜃𝑥f_{\theta}(x) is the corresponding embedding.
在方程 3 中, D𝐷DM𝑀M 分别是输入空间和嵌入空间的维度。因此,对于一个示例 x𝑥xfθ(x)subscript𝑓𝜃𝑥f_{\theta}(x) 是相应的嵌入。

In FSL, a prototype for class n𝑛n is obtained as the mean embedding of the K𝐾K support examples of class n𝑛n at train time:
在 FSL 中,类 n𝑛n 的原型是在训练时由类 n𝑛nK𝐾K 个支持示例的平均嵌入获得的:

pn=1K(xS~,yS~)S~s.t. yS~=nfθ(xS~).subscript𝑝𝑛1𝐾subscriptsuperscript𝑥~𝑆superscript𝑦~𝑆~𝑆s.t. superscript𝑦~𝑆𝑛subscript𝑓𝜃superscript𝑥~𝑆\displaystyle p_{n}=\frac{1}{K}\sum_{\begin{subarray}{c}(x^{\tilde{S}},y^{\tilde{S}})\in\tilde{S}\\ \text{s.t. }y^{\tilde{S}}=n\end{subarray}}f_{\theta}(x^{\tilde{S}}). (4)

Class prototypes thus need to live in the embedding space, as they embody average features shared by elements of the class they represent. At training time, θ𝜃\theta is optimized so that the distance between each prototype and the elements of its class is minimized, while the distance between different prototypes is maximized. Finally, classification at testing time is performed by assigning each query sample to its nearest prototype.
类原型因此需要存在于嵌入空间中,因为它们体现了它们所代表的类的元素共享的平均特征。在训练时, θ𝜃\theta 被优化,以使每个原型与其类的元素之间的距离最小化,同时不同原型之间的距离最大化。最后,在测试时,通过将每个查询样本分配给其最近的原型来进行分类。

At testing time, a support set is at disposal of NSsubscript𝑁𝑆N_{S} labeled examples S={(x1S,y1S),,(xNSS,yNSS)}𝑆subscriptsuperscript𝑥𝑆1subscriptsuperscript𝑦𝑆1subscriptsuperscript𝑥𝑆subscript𝑁𝑆subscriptsuperscript𝑦𝑆subscript𝑁𝑆S=\{(x^{S}_{1},y^{S}_{1}),...,(x^{S}_{N_{S}},y^{S}_{N_{S}})\}, where each xiSDsubscriptsuperscript𝑥𝑆𝑖superscript𝐷x^{S}_{i}\in\mathbb{R}^{D} is the feature vector of an example, and yiSCsubscriptsuperscript𝑦𝑆𝑖𝐶y^{S}_{i}\in C\subset\mathbb{N} is the corresponding label. Assuming a N𝑁N-way-K𝐾K-shot scenario, exactly K𝐾K support examples are available for each of the N𝑁N classes. A query set Q={x1Q,,xNQQ}𝑄subscriptsuperscript𝑥𝑄1subscriptsuperscript𝑥𝑄subscript𝑁𝑄Q=\{x^{Q}_{1},...,x^{Q}_{N_{Q}}\} of NQsubscript𝑁𝑄N_{Q} unlabeled examples is thus supplied, and the task is to correctly assort the examples into their classes. The elements from the query set Q𝑄Q belong to the same domain as those from the support set S𝑆S.
在测试时,支持集中有 NSsubscript𝑁𝑆N_{S} 个标记示例 S={(x1S,y1S),,(xNSS,yNSS)}𝑆subscriptsuperscript𝑥𝑆1subscriptsuperscript𝑦𝑆1subscriptsuperscript𝑥𝑆subscript𝑁𝑆subscriptsuperscript𝑦𝑆subscript𝑁𝑆S=\{(x^{S}_{1},y^{S}_{1}),...,(x^{S}_{N_{S}},y^{S}_{N_{S}})\} ,其中每个 xiSDsubscriptsuperscript𝑥𝑆𝑖superscript𝐷x^{S}_{i}\in\mathbb{R}^{D} 是一个示例的特征向量, yiSCsubscriptsuperscript𝑦𝑆𝑖𝐶y^{S}_{i}\in C\subset\mathbb{N} 是相应的标签。假设一个 N𝑁N -路- K𝐾K -拍摄场景,每个 N𝑁N 类别都有 K𝐾K 个支持示例可用。因此,提供了一个包含 NQsubscript𝑁𝑄N_{Q} 个未标记示例的查询集 Q={x1Q,,xNQQ}𝑄subscriptsuperscript𝑥𝑄1subscriptsuperscript𝑥𝑄subscript𝑁𝑄Q=\{x^{Q}_{1},...,x^{Q}_{N_{Q}}\} ,任务是正确地将示例分类。查询集 Q𝑄Q 中的元素属于与支持集 S𝑆S 中的元素相同的域。

At training time, it could be impossible to know which classes will the testing scenario yield. In other words, a support set S𝑆S is not accessible in advance. To cope with that, a training set T={(x1T,y1T),,(xNTT,yNTT)}𝑇subscriptsuperscript𝑥𝑇1subscriptsuperscript𝑦𝑇1subscriptsuperscript𝑥𝑇subscript𝑁𝑇subscriptsuperscript𝑦𝑇subscript𝑁𝑇T=\{(x^{T}_{1},y^{T}_{1}),...,(x^{T}_{N_{T}},y^{T}_{N_{T}})\} is chosen that reflects the best prior information possessed about the testing scenario, with labels yiTCTsuperscriptsubscript𝑦𝑖𝑇subscript𝐶𝑇y_{i}^{T}\in C_{T}\subset\mathbb{N} and |CT|=NTsubscript𝐶𝑇subscript𝑁𝑇|C_{T}|=N_{T} classes which can coincide or outnumber them (NTNsubscript𝑁𝑇𝑁N_{T}\geq N). In other words, it is possible that CCT𝐶subscript𝐶𝑇C\cap C_{T}\neq\emptyset, but it cannot be said in advance. Then, fake support and query sets S~T~𝑆𝑇\tilde{S}\subset T and Q~T~𝑄𝑇\tilde{Q}\subset T are extracted to mimic the testing scenario and instruct the model to learn accordingly.
在训练时,可能无法知道测试场景将产生哪些类别。换句话说,支持集 S𝑆S 事先不可访问。为了应对这种情况,选择一个训练集 T={(x1T,y1T),,(xNTT,yNTT)}𝑇subscriptsuperscript𝑥𝑇1subscriptsuperscript𝑦𝑇1subscriptsuperscript𝑥𝑇subscript𝑁𝑇subscriptsuperscript𝑦𝑇subscript𝑁𝑇T=\{(x^{T}_{1},y^{T}_{1}),...,(x^{T}_{N_{T}},y^{T}_{N_{T}})\} ,反映对测试场景拥有的最佳先验信息,具有标签 yiTCTsuperscriptsubscript𝑦𝑖𝑇subscript𝐶𝑇y_{i}^{T}\in C_{T}\subset\mathbb{N}|CT|=NTsubscript𝐶𝑇subscript𝑁𝑇|C_{T}|=N_{T} 类别,这些类别可以与之重叠或超过它们( NTNsubscript𝑁𝑇𝑁N_{T}\geq N )。换句话说,可能会 CCT𝐶subscript𝐶𝑇C\cap C_{T}\neq\emptyset ,但无法事先说出。然后,提取虚假的支持和查询集 S~T~𝑆𝑇\tilde{S}\subset TQ~T~𝑄𝑇\tilde{Q}\subset T ,以模仿测试场景并指导模型相应地学习。

III-B Prototypical networks: the ZSL setting
III-B 原型网络:零样本学习设置

In ZSL, one does not dispose of labelled examples for all classes. Instead, it is assumed that N𝑁N abstract vectors denoted as {a(1),a(2),,a(N)}superscript𝑎1superscript𝑎2superscript𝑎𝑁\{a^{(1)},a^{(2)},...,a^{(N)}\}, with a(n)Asuperscript𝑎𝑛superscript𝐴a^{(n)}\in\mathbb{R}^{A}, encode the characteristics of all N𝑁N classes.
在 ZSL 中,一个不会为所有类别丢弃标记的示例。相反,假定有 N𝑁N 个抽象向量表示为 {a(1),a(2),,a(N)}superscript𝑎1superscript𝑎2superscript𝑎𝑁\{a^{(1)},a^{(2)},...,a^{(N)}\} ,具有 a(n)Asuperscript𝑎𝑛superscript𝐴a^{(n)}\in\mathbb{R}^{A} ,编码了所有 N𝑁N 类别的特征。

As in FSL, at training time one takes advantage of a set T={(x1T,y1T),,(xNTT,yNTT)}𝑇subscriptsuperscript𝑥𝑇1subscriptsuperscript𝑦𝑇1subscriptsuperscript𝑥𝑇subscript𝑁𝑇subscriptsuperscript𝑦𝑇subscript𝑁𝑇T=\{(x^{T}_{1},y^{T}_{1}),...,(x^{T}_{N_{T}},y^{T}_{N_{T}})\} of labelled examples from classes yiTCTsubscriptsuperscript𝑦𝑇𝑖subscript𝐶𝑇y^{T}_{i}\in C_{T}\subset\mathbb{N}, where it is preferably |CT|=NTN=|C|subscript𝐶𝑇subscript𝑁𝑇𝑁𝐶|C_{T}|=N_{T}\geq N=|C|. The training cycle remains unchanged in the ZSL case, but class prototypes are defined differently:
与 FSL 一样,在训练时,人们利用来自类 yiTCTsubscriptsuperscript𝑦𝑇𝑖subscript𝐶𝑇y^{T}_{i}\in C_{T}\subset\mathbb{N} 的一组标记示例 T={(x1T,y1T),,(xNTT,yNTT)}𝑇subscriptsuperscript𝑥𝑇1subscriptsuperscript𝑦𝑇1subscriptsuperscript𝑥𝑇subscript𝑁𝑇subscriptsuperscript𝑦𝑇subscript𝑁𝑇T=\{(x^{T}_{1},y^{T}_{1}),...,(x^{T}_{N_{T}},y^{T}_{N_{T}})\} ,最好是 |CT|=NTN=|C|subscript𝐶𝑇subscript𝑁𝑇𝑁𝐶|C_{T}|=N_{T}\geq N=|C| 。在 ZSL 情况下,训练周期保持不变,但类原型的定义不同:

  • the embedding for a query example xQsuperscript𝑥𝑄x^{Q} is still obtained as fθ(xQ)subscript𝑓𝜃superscript𝑥𝑄f_{\theta}(x^{Q}), where fθ:DM:subscript𝑓𝜃superscript𝐷superscript𝑀f_{\theta}:\mathbb{R}^{D}\to\mathbb{R}^{M};


    • 查询示例 xQsuperscript𝑥𝑄x^{Q} 的嵌入仍然是通过 fθ(xQ)subscript𝑓𝜃superscript𝑥𝑄f_{\theta}(x^{Q}) 获得的,其中 fθ:DM:subscript𝑓𝜃superscript𝐷superscript𝑀f_{\theta}:\mathbb{R}^{D}\to\mathbb{R}^{M}
  • the prototype for class nC𝑛𝐶n\in C is extracted as pn=gθ(a(n))subscript𝑝𝑛subscript𝑔𝜃superscript𝑎𝑛p_{n}=g_{\theta}(a^{(n)}) via a separate embedding function gθ:AM:subscript𝑔𝜃superscript𝐴superscript𝑀g_{\theta}:\mathbb{R}^{A}\to\mathbb{R}^{M}, which maps the semantic attribute space to the common embedding space.


    • 类别 nC𝑛𝐶n\in C 的原型通过单独的嵌入函数 gθ:AM:subscript𝑔𝜃superscript𝐴superscript𝑀g_{\theta}:\mathbb{R}^{A}\to\mathbb{R}^{M} 提取为 pn=gθ(a(n))subscript𝑝𝑛subscript𝑔𝜃superscript𝑎𝑛p_{n}=g_{\theta}(a^{(n)}) ,该函数将语义属性空间映射到共同的嵌入空间。

III-C PROTO-LTN: the FSL scenario
III-C PROTO-LTN:FSL 方案

Refer to caption
Figure 1: Proto-LTN architecture for ZSL classification. The architecture is composed of a convolutional features extractor and an attribute encoder. The two branches allow to map semantic and visual features in a common embedding space. The isOfClass predicate aims to minimize the distance between instances (solid line circles) and class prototypes (dashed line circles) based on affirmative and negative formulas embedded in the knowledge base 𝒦𝒦\mathcal{K}. At train time, the loss function maximizes the satisfiability (truth value) of all formulas in 𝒦𝒦\mathcal{K}.
图 1:用于零样本学习分类的 Proto-LTN 架构。该架构由卷积特征提取器和属性编码器组成。这两个分支允许将语义特征和视觉特征映射到一个共同的嵌入空间中。isOfClass 谓词旨在通过嵌入在知识库 𝒦𝒦\mathcal{K} 中的肯定和否定公式,最小化实例(实线圆圈)与类原型(虚线圆圈)之间的距离。在训练时,损失函数最大化 𝒦𝒦\mathcal{K} 中所有公式的可满足性(真值)。

The overall architecture of PROTO-LTN, when tailored to the ZSL scenario, is illustrated in Fig. 1. The input image embeddings are extracted from a convolutional neural network, while attribute vectors are mapped into the embedding domain through an embedding function. In this section, details about the definition of the grounding of the constant, variables, functions and predicates are given. Then, the Knowledge Base 𝒦𝒦\mathcal{K} which encodes our learning problem is defined.
PROTO-LTN 的整体架构,当定制为零样本学习场景时,如图 1 所示。输入图像嵌入是从卷积神经网络中提取的,而属性向量通过嵌入函数映射到嵌入域中。在本节中,给出了关于常量、变量、函数和谓词接地定义的细节。然后,定义了编码我们学习问题的知识库 𝒦𝒦\mathcal{K}

III-C1 Groundings terms III-C1 接地术语

Within a single training episode, a batch of training samples is selected in the form of fake support S~~𝑆\tilde{S} and query Q~~𝑄\tilde{Q} sets. Groundings for variables and their domain D𝐷D (not learnable) can be defined as
在单个训练周期内,以虚假支持 S~~𝑆\tilde{S} 和查询 Q~~𝑄\tilde{Q} 集的形式选择一批训练样本。变量及其域 D𝐷D (不可学习)的基础可以定义为

𝒢(q)𝒢𝑞\displaystyle\mathcal{G}(q) =x1Q~,,xNQ~Q~,absentexpectationsubscriptsuperscript𝑥~𝑄1subscriptsuperscript𝑥~𝑄subscript𝑁~𝑄\displaystyle=\braket{x^{\tilde{Q}}_{1},...,x^{\tilde{Q}}_{N_{\tilde{Q}}}}, (5)
𝒢(ql)𝒢subscript𝑞𝑙\displaystyle\mathcal{G}(q_{l}) =y1Q~,,yNQ~Q~,absentexpectationsubscriptsuperscript𝑦~𝑄1subscriptsuperscript𝑦~𝑄subscript𝑁~𝑄\displaystyle=\braket{y^{\tilde{Q}}_{1},...,y^{\tilde{Q}}_{N_{\tilde{Q}}}}, (6)
𝒢(qe)𝒢subscript𝑞𝑒\displaystyle\mathcal{G}(q_{e}) =𝒢(getEmbedding(q))absent𝒢getEmbedding𝑞\displaystyle=\mathcal{G}(\texttt{getEmbedding}(q)) (7)
=fθ(x1Q~),,fθ(xNQ~Q~),absentexpectationsubscript𝑓𝜃subscriptsuperscript𝑥~𝑄1subscript𝑓𝜃subscriptsuperscript𝑥~𝑄subscript𝑁~𝑄\displaystyle=\braket{f_{\theta}(x^{\tilde{Q}}_{1}),...,f_{\theta}(x^{\tilde{Q}}_{N_{\tilde{Q}}})}, (8)
𝒢(s)𝒢𝑠\displaystyle\mathcal{G}(s) =x1S~,,xNSS~,absentexpectationsubscriptsuperscript𝑥~𝑆1subscriptsuperscript𝑥~𝑆subscript𝑁𝑆\displaystyle=\braket{x^{\tilde{S}}_{1},...,x^{\tilde{S}}_{N_{S}}}, (9)
𝒢(sl)𝒢subscript𝑠𝑙\displaystyle\mathcal{G}(s_{l}) =y1S~,,yNSS~,absentexpectationsubscriptsuperscript𝑦~𝑆1subscriptsuperscript𝑦~𝑆subscript𝑁𝑆\displaystyle=\braket{y^{\tilde{S}}_{1},...,y^{\tilde{S}}_{N_{S}}}, (10)
𝒢(p),𝒢(pl)𝒢𝑝𝒢subscript𝑝𝑙\displaystyle\mathcal{G}(p),\,\mathcal{G}(p_{l}) =𝒢(getPrototypes(s,sl))absent𝒢getPrototypes𝑠subscript𝑠𝑙\displaystyle=\mathcal{G}(\texttt{getPrototypes}(s,s_{l})) (11)
=Πθ(𝒢(s,sl))absentsubscriptΠ𝜃𝒢𝑠subscript𝑠𝑙\displaystyle=\Pi_{\theta}(\mathcal{G}(s,s_{l})) (12)
=Πθ((x1S~,y1S~),,(xNSS~,yNSS~)),absentsubscriptΠ𝜃expectationsubscriptsuperscript𝑥~𝑆1subscriptsuperscript𝑦~𝑆1subscriptsuperscript𝑥~𝑆subscript𝑁𝑆subscriptsuperscript𝑦~𝑆subscript𝑁𝑆\displaystyle=\Pi_{\theta}(\braket{(x^{\tilde{S}}_{1},y^{\tilde{S}}_{1}),...,(x^{\tilde{S}}_{N_{S}},y^{\tilde{S}}_{N_{S}})}), (13)

where q𝑞q are the query examples (D(q)=features𝐷𝑞featuresD(q)=\texttt{features}), qlsubscript𝑞𝑙q_{l} are the corresponding labels (D(ql)=labels𝐷subscript𝑞𝑙labelsD(q_{l})=\texttt{labels}), and qesubscript𝑞𝑒q_{e} are their embeddings (D(qe)=embeddings𝐷subscript𝑞𝑒embeddingsD(q_{e})=\texttt{embeddings}). Conversely, s𝑠s are the examples in the support set (D(s)=features𝐷𝑠featuresD(s)=\texttt{features}) and slsubscript𝑠𝑙s_{l} their labels. Finally, p𝑝p and plsubscript𝑝𝑙p_{l} are the prototypes and their labels, respectively, with D(p)=embeddings𝐷𝑝embeddingsD(p)=\texttt{embeddings} and D(pl)=labels𝐷subscript𝑝𝑙labelsD(p_{l})=\texttt{labels}.
其中 q𝑞q 是查询示例( D(q)=features𝐷𝑞featuresD(q)=\texttt{features} ), qlsubscript𝑞𝑙q_{l} 是相应的标签( D(ql)=labels𝐷subscript𝑞𝑙labelsD(q_{l})=\texttt{labels} ), qesubscript𝑞𝑒q_{e} 是它们的嵌入( D(qe)=embeddings𝐷subscript𝑞𝑒embeddingsD(q_{e})=\texttt{embeddings} )。相反, s𝑠s 是支持集中的示例( D(s)=features𝐷𝑠featuresD(s)=\texttt{features} ), slsubscript𝑠𝑙s_{l} 是它们的标签。最后, p𝑝pplsubscript𝑝𝑙p_{l} 是原型及其标签,分别为 D(p)=embeddings𝐷𝑝embeddingsD(p)=\texttt{embeddings}D(pl)=labels𝐷subscript𝑝𝑙labelsD(p_{l})=\texttt{labels}

III-C2 Grounding functions and predicates
III-C2 接地功能和谓词

PROTO-LTNs are based on two functions (getEmbedding and getPrototypes) and the isOfClass predicate.
PROTO-LTNs 基于两个函数(getEmbedding 和 getPrototypes)和 isOfClass 谓词。

getEmbedding is a conventional LTN function which maps image features into the embedding space, hence Din(getEmbedding)=featuressubscript𝐷ingetEmbeddingfeaturesD_{\mathrm{in}}(\texttt{getEmbedding})=\texttt{features} to
getEmbedding 是一个传统的 LTN 函数,它将图像特征映射到嵌入空间,因此 Din(getEmbedding)=featuressubscript𝐷ingetEmbeddingfeaturesD_{\mathrm{in}}(\texttt{getEmbedding})=\texttt{features}

Dout(getEmbedding)=embeddingssubscript𝐷outgetEmbeddingembeddingsD_{\mathrm{out}}(\texttt{getEmbedding})=\texttt{embeddings}.

The getPrototypes function, with Din(getPrototypes)=features×labelssubscript𝐷ingetPrototypesfeatureslabelsD_{\mathrm{in}}(\texttt{getPrototypes})=\texttt{features}\times\texttt{labels} and
getPrototypes 函数,使用 Din(getPrototypes)=features×labelssubscript𝐷ingetPrototypesfeatureslabelsD_{\mathrm{in}}(\texttt{getPrototypes})=\texttt{features}\times\texttt{labels}

Dout(getPrototypes)=embeddings×labelssubscript𝐷outgetPrototypesembeddingslabelsD_{\mathrm{out}}(\texttt{getPrototypes})=\texttt{embeddings}\times\texttt{labels}, returns labelled prototypes given a support set of labelled examples. Each prototype is in fact a function of all support points belonging to the same class, as defined in Eq. 4. It is defined as a generalized LTN function, which accepts as input multiple instantiations of variables (and hence multiple domains). A more formal definition is given in Appendix A.
Dout(getPrototypes)=embeddings×labelssubscript𝐷outgetPrototypesembeddingslabelsD_{\mathrm{out}}(\texttt{getPrototypes})=\texttt{embeddings}\times\texttt{labels} ,在给定带标签示例的支持集的情况下,返回带标签的原型。实际上,每个原型都是属于同一类别的所有支持点的函数,如等式 4 所定义。它被定义为广义的 LTN 函数,接受多个变量的多次实例化作为输入(因此具有多个域)。更正式的定义见附录 A。

Groundings for both functions are defined as:
两个函数的基础定义如下:

𝒢(getEmbedding)𝒢getEmbedding\displaystyle\mathcal{G}(\texttt{getEmbedding}) =fθ,absentsubscript𝑓𝜃\displaystyle=f_{\theta}, (14)
𝒢(getPrototypes)𝒢getPrototypes\displaystyle\mathcal{G}(\texttt{getPrototypes}) =Πθ,absentsubscriptΠ𝜃\displaystyle=\Pi_{\theta}, (15)

where fθ:DM:subscript𝑓𝜃superscript𝐷superscript𝑀f_{\theta}:\mathbb{R}^{D}\to\mathbb{R}^{M} defines the embedding function, whereas
其中 fθ:DM:subscript𝑓𝜃superscript𝐷superscript𝑀f_{\theta}:\mathbb{R}^{D}\to\mathbb{R}^{M} 定义了嵌入函数,而

Πθ:l=1×m=1lD×l=1×m=1lM×:subscriptΠ𝜃superscriptsubscript×𝑚1𝑙superscriptsubscript𝑙1superscript𝐷superscriptsubscript×𝑚1𝑙superscriptsubscript𝑙1superscript𝑀\displaystyle\Pi_{\theta}:\bigcup_{l=1}^{\infty}\vartimes_{m=1}^{l}\mathbb{R}^{D}\times\mathbb{N}\to\bigcup_{l=1}^{\infty}\vartimes_{m=1}^{l}\mathbb{R}^{M}\times\mathbb{N} (16)

accepts as input a list of NSsubscript𝑁𝑆N_{S} labelled support examples, i.e., an element of (D×)NSsuperscriptsuperscript𝐷subscript𝑁𝑆(\mathbb{R}^{D}\times\mathbb{N})^{N_{S}}, and returns a list of labelled prototypes for all the N~~𝑁\tilde{N} classes seen in the support set, or an element of (M×)N~superscriptsuperscript𝑀~𝑁(\mathbb{R}^{M}\times\mathbb{N})^{\tilde{N}}. Additional details are given in Appendix A.
接受一个包含 NSsubscript𝑁𝑆N_{S} 个标记支持示例的列表作为输入,即 (D×)NSsuperscriptsuperscript𝐷subscript𝑁𝑆(\mathbb{R}^{D}\times\mathbb{N})^{N_{S}} 的一个元素,并返回在支持集中看到的所有 N~~𝑁\tilde{N} 个类别的带标记原型的列表,或者 (M×)N~superscriptsuperscript𝑀~𝑁(\mathbb{R}^{M}\times\mathbb{N})^{\tilde{N}} 的一个元素。附录 A 中提供了更多细节。

The isOfClass predicate for class n𝑛n Cabsent𝐶\in C is grounded as:
n𝑛n Cabsent𝐶\in C 的 isOfClass 谓词被定义为:

𝒢(isOfClass)=eαd(,)2,𝒢isOfClasssuperscript𝑒𝛼𝑑superscript2\displaystyle\mathcal{G}(\texttt{isOfClass})=e^{-\alpha\,d(\cdot,\cdot)^{2}}, (17)

where α𝛼\alpha is a hyperparameter and d𝑑d is a measure of distance. 𝒢(isOfClass):M×M[0,1]:𝒢isOfClasssuperscript𝑀superscript𝑀01\mathcal{G}(\texttt{isOfClass}):\mathbb{R}^{M}\times\mathbb{R}^{M}\to[0,1]; 𝒢(isOfClass)𝒢isOfClass\mathcal{G}(\texttt{isOfClass})takes the value of 111 when the distance from the class prototype d(,)𝑑d(\cdot,\cdot) is 00. In our formulation the Euclidean distance squared is adopted, as in DEM [8]. Alternatively, parametric similarity functions could be used:
其中 α𝛼\alpha 是一个超参数, d𝑑d 是距离的度量。 𝒢(isOfClass):M×M[0,1]:𝒢isOfClasssuperscript𝑀superscript𝑀01\mathcal{G}(\texttt{isOfClass}):\mathbb{R}^{M}\times\mathbb{R}^{M}\to[0,1] ; 当距离从类原型 d(,)𝑑d(\cdot,\cdot)00 时, 𝒢(isOfClass)𝒢isOfClass\mathcal{G}(\texttt{isOfClass}) 取值 111 。在我们的公式中,采用欧氏距离的平方,就像在 DEM [8]中一样。或者,可以使用参数相似性函数:

𝒢′′(isOfClass)superscript𝒢′′isOfClass\displaystyle\mathcal{G}^{\prime\prime}(\texttt{isOfClass}) =σθ(Concatenate[,]).absentsubscript𝜎𝜃Concatenate\displaystyle=\sigma_{\theta}(\text{Concatenate}[\cdot,\cdot]). (18)

where σθsubscript𝜎𝜃\sigma_{\theta} could be a MLP with output sigmoid activation. This formulation is closer to that of Relation Networks [19].
其中 σθsubscript𝜎𝜃\sigma_{\theta} 可以是具有输出 Sigmoid 激活的 MLP。这种公式更接近于关系网络[19]的公式。

III-C3 Knowledge Base III-C3 知识库

𝒦𝒦\mathcal{K} represents our knowledge about the formulated problem and is updated at each training episode based on the current fake support set. 𝒦={ϕaff,ϕneg}𝒦subscriptitalic-ϕaffsubscriptitalic-ϕneg\mathcal{K}=\{\phi_{\text{aff}},\phi_{\text{neg}}\} contains two aggregations of formulas which specify that each query item is a positive example for its class, and a negative one for all the others:
𝒦𝒦\mathcal{K} 代表我们对所制定问题的了解,并且在每个训练周期根据当前的虚假支持集进行更新。 𝒦={ϕaff,ϕneg}𝒦subscriptitalic-ϕaffsubscriptitalic-ϕneg\mathcal{K}=\{\phi_{\text{aff}},\phi_{\text{neg}}\} 包含两个公式的聚合,这些公式指定每个查询项是其类别的正例,对于其他所有类别则是负例:

ϕaff=Diag(qe,ql)(Diag(p,pl):ql=pl(isOfClass(qe,p))),\phi_{\text{aff}}=\forall\text{Diag}(q_{e},q_{l})(\forall\text{Diag}(p,p_{l}):{q_{l}=p_{l}}(\texttt{isOfClass}(q_{e},p))), (19)
ϕneg=Diag(qe,ql)(Diag(p,pl):qlpl(¬isOfClass(qe,p))).\phi_{\text{neg}}=\forall\,\text{Diag}(q_{e},q_{l})\,(\forall\,\text{Diag}(p,p_{l}):{q_{l}\neq p_{l}}\,(\lnot\texttt{isOfClass}(q_{e},p))). (20)

We have exploited both Diagonal Quantification and Guarded Quantifiers, whose formal definition can be found in [5].
我们已经利用了对角量化和保护量词,其形式定义可以在[5]中找到。

PROTO-LTN is trained by maximizing the satisfiability
PROTO-LTN 通过最大化可满足性进行训练

ep=1(ϕ𝒦ϕ)=𝒢(ϕaff)wn𝒢(ϕneg),superscriptep1subscriptitalic-ϕ𝒦italic-ϕ𝒢subscriptitalic-ϕaffsubscript𝑤n𝒢subscriptitalic-ϕneg\mathcal{L}^{\text{ep}}=1-\left(\bigwedge_{\phi\in\mathcal{K}}\phi\right)=-\mathcal{G}(\phi_{\text{aff}})-w_{\text{n}}\,\mathcal{G}(\phi_{\text{neg}}), (21)

where the weight wnsubscript𝑤nw_{\text{n}} reflects the expectation that negations play a less discriminative role than affirmation in classification. In our experiments, we set wn=0subscript𝑤n0w_{\text{n}}=0 and consider only ϕaffsubscriptitalic-ϕaff\phi_{\text{aff}}, leaving exploration of this hyper-parameter to future work.
在我们的实验中,我们设置 wn=0subscript𝑤n0w_{\text{n}}=0 ,并仅考虑 ϕaffsubscriptitalic-ϕaff\phi_{\text{aff}} ,将对这一超参数的探索留给未来的工作。

By introducing an aggregation function [5, 11], we obtain
通过引入一个聚合函数[5, 11],我们得到

ep=(log(𝒢(ϕaff))1pagg)+wn(1𝒢(ϕn))1pagg)pagg{\mathcal{L}^{\text{ep}}=\bigg{(}-\log(\mathcal{G}(\phi_{\text{aff}})\big{)}^{\frac{1}{p_{\text{agg}}}})+w_{\text{n}}\big{(}1-\mathcal{G}(\phi_{\text{n}})\big{)}^{\frac{1}{p_{\text{agg}}}}\bigg{)}^{p_{\text{agg}}}} (22)

where 𝒢(ϕaff)𝒢subscriptitalic-ϕaff\mathcal{G}(\phi_{\text{aff}}) is implemented through the generalized product p𝑝p-mean operator and 𝒢(ϕneg)𝒢subscriptitalic-ϕneg\mathcal{G}(\phi_{\text{neg}}) with the generalized mean operator ApMsubscript𝐴𝑝𝑀A_{pM}:
其中 𝒢(ϕaff)𝒢subscriptitalic-ϕaff\mathcal{G}(\phi_{\text{aff}}) 是通过广义乘积 p𝑝p -均值算子实现的, 𝒢(ϕneg)𝒢subscriptitalic-ϕneg\mathcal{G}(\phi_{\text{neg}}) 是通过广义均值算子 ApMsubscript𝐴𝑝𝑀A_{pM} 实现的:

ApPR(τ1,,τn)=(i=1nτi)1p,ApM(τ1,,τn)=(1ni=1nτip)1p.subscript𝐴𝑝𝑃𝑅subscript𝜏1subscript𝜏𝑛superscriptsuperscriptsubscriptproduct𝑖1𝑛subscript𝜏𝑖1subscript𝑝for-allsubscript𝐴𝑝𝑀subscript𝜏1subscript𝜏𝑛superscript1𝑛superscriptsubscript𝑖1𝑛superscriptsubscript𝜏𝑖𝑝1subscript𝑝for-all\small\begin{aligned} A_{pPR}(\tau_{1},...,\tau_{n})=\bigg{(}\prod_{i=1}^{n}\tau_{i}\bigg{)}^{\frac{1}{p_{\forall}}},\end{aligned}\begin{aligned} A_{pM}(\tau_{1},...,\tau_{n})=\bigg{(}\frac{1}{n}\sum_{i=1}^{n}\tau_{i}^{p}\bigg{)}^{\frac{1}{p_{\forall}}}.\end{aligned}

It should be noticed that the choice of paggsubscript𝑝𝑎𝑔𝑔p_{agg} does not need to coincide with that of psubscript𝑝for-allp_{\forall} for quantification, and both hyper-parameters need to be tuned experimentally.
应该注意到, paggsubscript𝑝𝑎𝑔𝑔p_{agg} 的选择不需要与 psubscript𝑝for-allp_{\forall} 的选择相符合以进行量化,这两个超参数都需要通过实验进行调整。

When optimizing a positive quantity, a common practice consists in optimizing its logarithm: the product between similarities takes a more desirable form when ApPRsubscript𝐴𝑝𝑃𝑅A_{pPR} is used as the aggregation operator for for-all\forall. Unfortunately, one does not obtain an equally appealing expression for ϕnegsubscriptitalic-ϕneg\phi_{\text{neg}}.
在优化正数量时,一种常见做法是优化其对数:当 ApPRsubscript𝐴𝑝𝑃𝑅A_{pPR} 被用作 for-all\forall 的聚合运算符时,相似性之间的乘积会呈现出更理想的形式。不幸的是,对于 ϕnegsubscriptitalic-ϕneg\phi_{\text{neg}} ,我们并没有得到同样令人满意的表达式。

If a squared Euclidean distance is used as similarity measure and the negation weight wnsubscript𝑤nw_{\text{n}} is set to 0, one obtains the same formulation of the loss function of DEM [8], up to a scaling constant:
如果使用平方欧氏距离作为相似度度量,并且将否定权重 wnsubscript𝑤nw_{\text{n}} 设置为 0,则可以得到与 DEM [8]的损失函数相同的公式,只是存在一个缩放常数:

epsuperscriptep\displaystyle\mathcal{L}^{\text{ep}} =log(eαp(nC~(xQ~,yQ~)Q~s.t. yQ~nd(fθ(xQ~),pn)2))absentsuperscript𝑒𝛼subscript𝑝for-allsubscript𝑛~𝐶subscriptsuperscript𝑥~𝑄superscript𝑦~𝑄~𝑄s.t. superscript𝑦~𝑄𝑛𝑑superscriptsubscript𝑓𝜃superscript𝑥~𝑄subscript𝑝𝑛2\displaystyle=-\log\Bigg{(}e^{-\frac{\alpha}{p_{\forall}}\,\big{(}\sum_{n\in\tilde{C}}\,\sum_{\begin{subarray}{c}(x^{\tilde{Q}},y^{\tilde{Q}})\in\tilde{Q}\\ \text{s.t. }y^{\tilde{Q}}\neq n\end{subarray}}\,d(f_{\theta}(x^{\tilde{Q}}),p_{n})^{2}\big{)}}\Bigg{)}
=αp(nC~(xQ~,yQ~)Q~s.t. yQ~nd(fθ(xQ~),pn)2).absent𝛼subscript𝑝for-allsubscript𝑛~𝐶subscriptsuperscript𝑥~𝑄superscript𝑦~𝑄~𝑄s.t. superscript𝑦~𝑄𝑛𝑑superscriptsubscript𝑓𝜃superscript𝑥~𝑄subscript𝑝𝑛2\displaystyle=\frac{\alpha}{p_{\forall}}\,\Big{(}\sum_{n\in\tilde{C}}\,\sum_{\begin{subarray}{c}(x^{\tilde{Q}},y^{\tilde{Q}})\in\tilde{Q}\\ \text{s.t. }y^{\tilde{Q}}\neq n\end{subarray}}\,d(f_{\theta}(x^{\tilde{Q}}),p_{n})^{2}\Big{)}. (23)
Algorithm 1 PROTO-LTN - GZSL Training procedure
算法 1 PROTO-LTN - GZSL 训练过程
function Train  功能 训练
    Input \leftarrow q𝑞q Training Images
输入 \leftarrow q𝑞q 训练图像
    Input \leftarrow qlsubscript𝑞𝑙q_{l} Training label
输入 \leftarrow qlsubscript𝑞𝑙q_{l} 训练标签
    Input \leftarrow a𝑎a Semantic attribute set
输入 \leftarrow a𝑎a 语义属性集
    Input \leftarrow alsubscript𝑎𝑙a_{l} Semantic attribute label
输入 \leftarrow alsubscript𝑎𝑙a_{l} 语义属性标签
    for  i𝑖i in𝑖𝑛in NTrainingStepssubscript𝑁𝑇𝑟𝑎𝑖𝑛𝑖𝑛𝑔𝑆𝑡𝑒𝑝𝑠N_{TrainingSteps} do
对于 i𝑖i in𝑖𝑛in NTrainingStepssubscript𝑁𝑇𝑟𝑎𝑖𝑛𝑖𝑛𝑔𝑆𝑡𝑒𝑝𝑠N_{TrainingSteps}
         qeisubscript𝑞subscript𝑒𝑖absentq_{e_{i}}\leftarrow getEmbedding(q𝑞q)
         aisubscript𝑎𝑖a_{i},alisubscript𝑎subscript𝑙𝑖absenta_{l_{i}}\leftarrow getAttributes(a𝑎a)
         pi,pligetPrototypes(ai,ali)subscript𝑝𝑖subscript𝑝subscript𝑙𝑖getPrototypessubscript𝑎𝑖subscript𝑎subscript𝑙𝑖p_{i},p_{l_{i}}\leftarrow\texttt{getPrototypes}(a_{i},a_{l_{i}})
         ϕaffsubscriptitalic-ϕaff\phi_{\text{aff}} = Diag(qei,qli)(Diag(pi,pli):qli=pli(isOfClass(qei,pi)\forall\text{Diag}(q_{e_{i}},q_{l_{i}})(\forall\text{Diag}(p_{i},p_{l_{i}}):{q_{l_{i}}=p_{l_{i}}}(\texttt{isOfClass}(q_{e_{i}},p_{i})
         ϕnsubscriptitalic-ϕn\phi_{\text{n}} = Diag(qi,qli)(Diag(pi,pli):qlipli(¬isOfClass(qei,pi)))\forall\,\text{Diag}(q_{i},q_{l_{i}})\,(\forall\,\text{Diag}(p_{i},p_{l_{i}}):{q_{l_{i}}\neq p_{l_{i}}}\,(\lnot\texttt{isOfClass}(q_{e_{i}},p_{i})))
         ((log((𝒢(ϕaff))1pagg)))+wn(1𝒢(ϕn))1pagg)pagg\bigg{(}\big{(}\log((\mathcal{G}(\phi_{\text{aff}})\big{)}^{\frac{1}{p_{\text{agg}}}})))+w_{\text{n}}\big{(}1-\mathcal{G}(\phi_{\text{n}})\big{)}^{\frac{1}{p_{\text{agg}}}}\bigg{)}^{p_{\text{agg}}}
         computeGradient(ep)𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝐺𝑟𝑎𝑑𝑖𝑒𝑛𝑡superscriptepcomputeGradient(\mathcal{L}^{\text{ep}})
         updateGradient𝑢𝑝𝑑𝑎𝑡𝑒𝐺𝑟𝑎𝑑𝑖𝑒𝑛𝑡updateGradient
    end for 结束为
end function 结束函数
function Test  功能测试
    Input \leftarrow q𝑞q Test Images
输入 \leftarrow q𝑞q 测试图像
    Input \leftarrow a𝑎a Semantic attribute set
输入 \leftarrow a𝑎a 语义属性集
    qesubscript𝑞𝑒q_{e} \leftarrow getEmbedding(q𝑞q)
    a𝑎a,alsubscript𝑎𝑙a_{l} \leftarrow getAttributes(a𝑎a)
    p,pl𝑝subscript𝑝𝑙absentp,p_{l}\leftarrow getPrototypes(a𝑎a,alsubscript𝑎𝑙a_{l}
p,pl𝑝subscript𝑝𝑙absentp,p_{l}\leftarrow 获取原型 a𝑎a alsubscript𝑎𝑙a_{l}
)
    for  i𝑖i in𝑖𝑛in len(qesubscript𝑞𝑒q_{e}do
对于 qesubscript𝑞𝑒q_{e} 的长度为 i𝑖i in𝑖𝑛in
         for  j𝑗j in𝑖𝑛in len(p𝑝pdo
对于 p𝑝p 的长度为 j𝑗j in𝑖𝑛in
             predictioniisOfClass(qei,pj)𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜subscript𝑛𝑖isOfClasssubscript𝑞subscript𝑒𝑖subscript𝑝𝑗prediction_{i}\leftarrow\texttt{isOfClass}(q_{e_{i}},p_{j})
         end for 结束为
    end for 结束为
end function 结束函数

III-D PROTO-LTN: the GZSL scenario
III-D PROTO-LTN:GZSL 场景

The GZSL setting is analogous to the FSL setting, with the main difference lying in how prototypes are defined and calculated. No generalized LTN functions are needed for the GZSL case. Computations for a training epoch are reported in Algorithm 1.
GZSL 设置类似于 FSL 设置,主要区别在于原型的定义和计算方式。对于 GZSL 情况,不需要一般化的 LTN 函数。训练周期的计算结果见算法 1。

Since only one semantic vector a(n)superscript𝑎𝑛a^{(n)} is given for each class n𝑛n, there is a 1-to-1 correspondence between elements of the support set and prototypes. The latter are embodied by the semantic embedding function gθ:AD:subscript𝑔𝜃superscript𝐴superscript𝐷g_{\theta}:\mathbb{R}^{A}\to\mathbb{R}^{D} obtaining as the feature space the common embedding space. We just define getPrototypes as a conventional LTN function, whose grounding is 𝒢(getPrototypes)=gθ.𝒢getPrototypessubscript𝑔𝜃\mathcal{G}(\texttt{getPrototypes})=g_{\theta}. Conversely, nothing changes for the query map getEmbedding.
由于每个类别 n𝑛n 只给出一个语义向量 a(n)superscript𝑎𝑛a^{(n)} ,因此支持集的元素与原型之间存在一对一的对应关系。后者由语义嵌入函数 gθ:AD:subscript𝑔𝜃superscript𝐴superscript𝐷g_{\theta}:\mathbb{R}^{A}\to\mathbb{R}^{D} 体现,其特征空间为共同的嵌入空间。我们将 getPrototypes 定义为传统的 LTN 函数,其基础是 𝒢(getPrototypes)=gθ.𝒢getPrototypessubscript𝑔𝜃\mathcal{G}(\texttt{getPrototypes})=g_{\theta}. 相反,对于查询映射 getEmbedding 没有任何变化。

IV Experimental Settings
IV 实验设置

Experiments were conducted in both ZSL and GZSL settings on the Awa2 (Animals with Attributes) [18], CUB [28], aPY (Attribute Pascal and Yahoo)[29] and SUN (Scene Understanding) [30] benchmarks. For all datasets, image encodings, attributes and splits were collected from the original benchmark [18].
在 Awa2(具有属性的动物)[18],CUB [28],aPY(属性 Pascal 和 Yahoo)[29]和 SUN(场景理解)[30]基准测试中,分别在 ZSL 和 GZSL 设置下进行了实验。对于所有数据集,图像编码、属性和拆分均来自原始基准测试[18]。

TABLE I: For PROTO-LTN, we show mean ±plus-or-minus\pm standard deviation and maximum (in parenthesis) performance. TOP1ZSLsuperscriptTOP1ZSL\text{TOP1}^{\text{ZSL}} (T1), TOP1GZSL_UNSEENsuperscriptTOP1GZSL_UNSEEN\text{TOP1}^{\text{GZSL\_UNSEEN}} (U), TOP1GZSL_SEENsuperscriptTOP1GZSL_SEEN\text{TOP1}^{\text{GZSL\_SEEN}} (S) and HGZSLsuperscriptHGZSL\text{H}^{\text{GZSL}} (H) are always obtained on the proposed split (PS ) of Awa2, CUB, aPY and SUN classes, as described in [18]. \dagger assumes a transductive ZSL setting. Best performances are reported in bold.
表 I:对于 PROTO-LTN,我们展示平均 ±plus-or-minus\pm 标准差和最大(括号内)性能。 TOP1ZSLsuperscriptTOP1ZSL\text{TOP1}^{\text{ZSL}} (T1), TOP1GZSL_UNSEENsuperscriptTOP1GZSL_UNSEEN\text{TOP1}^{\text{GZSL\_UNSEEN}} (U), TOP1GZSL_SEENsuperscriptTOP1GZSL_SEEN\text{TOP1}^{\text{GZSL\_SEEN}} (S)和 HGZSLsuperscriptHGZSL\text{H}^{\text{GZSL}} (H)总是在提出的 Awa2,CUB,aPY 和 SUN 类别的划分(PS)上获得,如[18]中所述。 \dagger 假设是一种传导式零样本学习设置。最佳性能以粗体报告。
Method 方法 Awa2 CUB APY SUN
T1 U S H T1 U S H T1 U S H T1 U S H
SYNC (2016) [31] SYNC (2016) [31] 46.646.646.6 10.010.010.0 90.590.590.5 18.018.018.0 55.655.655.6 11.511.511.5 70.9 19.819.819.8 - - - - 56.356.356.3 7.97.97.9 43.343.343.3 13.413.413.4
Relation Net (2017)[19] 关系网络(2017)[19] 64.264.264.2 30.030.030.0 93.493.493.4 45.345.345.3 55.655.655.6 38.138.138.1 61.161.161.1 474747 - - - - - - - -
PrEN (2019) [32] 74.174.174.1 32.432.432.4 88.688.688.6 47.447.447.4 66.466.466.4 35.235.235.2 55.855.855.8 43.143.143.1 - - - - 62.9 35.4 27.227.227.2 30.8
VSE (2019) [20] VSE(2019)[20] 84.4 45.6 88.7 60.2 71.9 39.5 68.968.968.9 50.2 65.4 43.6 78.7 56.2 - - - -
DEM (2017) [8] DEM(2017)[8] 67.167.167.1 30.530.530.5 86.486.486.4 45.145.145.1 51.751.751.7 19.619.619.6 57.957.957.9 29.229.229.2 35.035.035.0 11.111.111.1 75.175.175.1 19.419.419.4 61.961.961.9 20.520.520.5 34.334.334.3 25.625.625.6
PROTO-LTN 67.667.667.6 32.032.032.0 83.783.783.7 46.246.246.2 48.848.848.8 20.820.820.8 54.354.354.3 30.030.030.0 35.035.035.0 17.117.117.1 66.266.266.2 27.2127.2127.21 60.460.460.4 20.420.420.4 36.8 26.226.226.2
±1.1 ±1.3 ±0.3 ±1.3 ±1.2 ±1.2 ±1.2 ±2.6 ±1.1 ±3.0 ±3.1 ±2.0 ±2.0 ±2.0 ±5.1 ±2.9 ±2.5 ±2.5 ±2.5 ±1.0 ±1.0 ±1.0 ±4.4 ±1.9 ±1.9 ±1.9
(70.8) (34.8) (84.3) (49.1) (50.3) (23.4) (55.7) (33.0) (38.6) (19.4) (70.7) (30.0) (62.1) (22.15) (39.9) (28.0)

The entire architecture is composed of two different blocks: the image visual encoder and the semantic encoder. The embedding function fθsubscript𝑓𝜃f_{\theta} is composed by a ResNet101 [33] embedding model, pretrained on ImageNet [34] and kept frozen, which converts an image I𝐼I into a vector 𝐱M𝐱superscript𝑀\mathbf{x}\in\mathbb{R}^{M}, where M=2048𝑀2048M=2048. This setting is maintained in all experiments with all datasets.
整个架构由两个不同的模块组成:图像视觉编码器和语义编码器。嵌入函数 fθsubscript𝑓𝜃f_{\theta} 由一个在 ImageNet [34]上预训练并保持冻结的 ResNet101 [33]嵌入模型组成,将图像 I𝐼I 转换为向量 𝐱M𝐱superscript𝑀\mathbf{x}\in\mathbb{R}^{M} ,其中 M=2048𝑀2048M=2048 。在所有实验和所有数据集中保持此设置。

Semantic vectors are encoded in the embedding space via a function gθsubscript𝑔𝜃g_{\theta}, which consists of two fully connected layers (FC) with ReLU activation function, initialized by a truncated normal distribution function. We set the hyper-parameter aggregations to pagg=1subscript𝑝𝑎𝑔𝑔1p_{agg}=1 and p=2subscript𝑝for-all2p_{\forall}=2, also taking into account preliminary experiments on Awa2 [18].
语义向量通过一个包含两个具有 ReLU 激活函数的全连接层(FC)的函数 gθsubscript𝑔𝜃g_{\theta} 在嵌入空间中进行编码,初始值为截断正态分布函数。我们将超参数聚合设置为 pagg=1subscript𝑝𝑎𝑔𝑔1p_{agg}=1p=2subscript𝑝for-all2p_{\forall}=2 ,同时考虑了对 Awa2 [18]的初步实验。

The framework was implemented in Tensorflow based on the LTN package [5, 35]. Experiments were conducted on a workstation equipped with an Intel® Core™ i7-10700K CPU and a RTX2080 TI GPU. All networks were trained for 30 epochs with Adam optimizer and batch size 64. Hyper-parameters (learning rate, α𝛼\alpha and regularization term λ𝜆\lambda) were optimized separately for each dataset. Details are reported in Appendix B. Standard performance metrics for GZSL were used as defined in [18]. Mean and standard deviation were calculated by repeating each experiment three times.
该框架是基于 LTN 包[5, 35]在 Tensorflow 中实现的。实验在配备 Intel® Core™ i7-10700K CPU 和 RTX2080 TI GPU 的工作站上进行。所有网络使用 Adam 优化器和批量大小 64 进行了 30 个 epochs 的训练。超参数(学习率, α𝛼\alpha 和正则化项 λ𝜆\lambda )针对每个数据集分别进行了优化。详细信息请参见附录 B。GZSL 的标准性能指标如[18]中定义。通过将每个实验重复三次计算得到均值和标准差。

V Results V 结果

PROTO-LTN results are reported in Table I, along with those for comparable embedding-based methods. Fig. 2 illustrates the embedding space with highlighted class prototypes.
PROTO-LTN 的结果如表 I 所示,以及与类似基于嵌入的方法的结果。图 2 展示了带有突出显示的类原型的嵌入空间。

As expected based on our analytical analysis, experimental performance is competitive with respect to most embedding-based techniques, in particular DEM [8] and Relation Net [19], which rely on similar assumptions and the same input as the current PROTO-LTN implementation. As shown in Section III-C, under certain conditions the PROTO-LTN loss is equivalent to that of DEM, up to a scaling constant, albeit with different regularization terms. We outperform DEM on unseen classes for all experimental benchmarks: this entails that the proposed formulation is a strong basis for a novel, NeSy approach to the GZSL task.
根据我们的分析分析,实验性能在大多数基于嵌入的技术方面具有竞争力,特别是 DEM [8]和 Relation Net [19],它们依赖于类似的假设和与当前 PROTO-LTN 实现相同的输入。如第 III-C 节所示,在某些条件下,PROTO-LTN 损失与 DEM 相当,只是存在一个缩放常数,尽管具有不同的正则化项。我们在所有实验基准中的未见类别上表现优于 DEM:这意味着所提出的公式是一个新颖的、NeSy 方法的 GZSL 任务的强基础。

Our method is outperformed by VSE, which relies on a different strategy to compute visual feature embeddings. A semantic loss allows to align the embedding space with part-feature concepts provided by a semantic oracle. Since the latter relies on an external knowledge base, it contains concepts beyond the available semantic vector {a(1),a(2),,a(N)}superscript𝑎1superscript𝑎2superscript𝑎𝑁\{a^{(1)},a^{(2)},...,a^{(N)}\}. This is especially advantageous in benchmarks like aPY, in which attributes are noisy and not visually informative [20]. This is a limitation of our current experiments, but not intrinsic to PROTO-LTNs. Indeed, 𝒦𝒦\mathcal{K} can be extended to include part-of relationships between concepts, and previous works have shown how these relationships can be leveraged to impose symbolic priors during learning, e.g., in object detection [4, 13]. However, the LTN formalism needs to be further extended to align part-based concepts with their visual groundings in an unsupervised fashion.
我们的方法被 VSE 超越,VSE 依赖于不同的策略来计算视觉特征嵌入。语义损失允许将嵌入空间与语义神谕提供的部分特征概念对齐。由于后者依赖于外部知识库,它包含超出可用语义向量的概念。这在像 aPY 这样的基准测试中特别有优势,其中属性是嘈杂的且在视觉上不具信息性。这是我们当前实验的局限性,但不是 PROTO-LTNs 固有的。实际上,可以扩展 𝒦𝒦\mathcal{K} 以包括概念之间的部分关系,并且先前的研究已经展示了如何在学习过程中利用这些关系来施加符号先验,例如在目标检测中。然而,LTN 形式主义需要进一步扩展,以以无监督的方式将基于部分的概念与它们的视觉基础对齐。

Refer to caption
Figure 2: t-SNE visualization of class prototypes for the Awa2 dataset.
图 2:Awa2 数据集的类原型的 t-SNE 可视化。

VI Conclusions and Future works
VI 结论和未来工作

We introduced PROTO-LTN, a novel Neuro-Symbolic architecture which extends the classical formulation of LTN borrowing from embeddings-based techniques. Following the strategy of PNs, we entirely focus on learning embedding functions (such as fθsubscript𝑓𝜃f_{\theta} and gθsubscript𝑔𝜃g_{\theta}), implying that class prototypes are obtained ex-post, based on a support set. These methods are robust to noise, an essential property in FSL, and provide a scheme to embed both examples (images) and class prototypes in the same metric space. This is a key property in the context of LTNs, because it enables different levels of abstraction: one can either state something about a particular example, or about an entire class, as prototypes can be viewed as parametrized labels for classes. We have shown the viability of our approach in GZSL and leave to future work the extension to other settings (e,g., few-shot or semi-supervised learning).
我们介绍了 PROTO-LTN,这是一种新颖的神经符号架构,它扩展了经典的 LTN 公式,借鉴了基于嵌入技术的技巧。遵循 PNs 的策略,我们完全专注于学习嵌入函数(如 fθsubscript𝑓𝜃f_{\theta}gθsubscript𝑔𝜃g_{\theta} ),这意味着类原型是基于支持集事后获得的。这些方法对噪声具有鲁棒性,在 FSL 中是一个重要的特性,并提供了一种将示例(图像)和类原型嵌入到同一度量空间的方案。这在 LTNs 的背景下是一个关键特性,因为它实现了不同层次的抽象:一个可以关于特定示例陈述某事,或者关于整个类,因为原型可以被视为类的参数化标签。我们已经展示了我们方法在 GZSL 中的可行性,并将其他设置(例如,少样本或半监督学习)的扩展留给未来的工作。

While our experimental results are encouraging, we argue that the strength of our formulation lies in its generality, and the full potential of PROTO-LTN is yet to be realized. Future work can aim at two complementary directions. First, alternative formulations of the isOfClass relationship could be explored, by changing the distance metric and/or the prototype encoding. Mapping class prototypes back to the input space, as done for instance in [36], could improve explainability.
尽管我们的实验结果令人鼓舞,但我们认为我们的公式的强大之处在于其普适性,PROTO-LTN 的全部潜力尚未被实现。未来的工作可以朝着两个互补的方向努力。首先,可以探索 isOfClass 关系的替代公式,通过改变距离度量和/或原型编码。将类原型映射回输入空间,例如在 [36] 中所做的那样,可以提高可解释性。

Second, the knowledge 𝒦𝒦\mathcal{K} could be extended to leverage prior information, e.g., from external knowledge bases, to improve generalization to unseen classes. Experiments should include both inductive and transductive settings: the assumption that information about attributes and relationships of unseen classes is available at training or test time (e.g., from WordNet) is less restrictive than assuming that actual examples, albeit unlabelled, are available.
其次,知识 𝒦𝒦\mathcal{K} 可以扩展到利用先前的信息,例如来自外部知识库,以提高对未见类别的泛化能力。实验应包括归纳和传导设置:假设关于未见类别的属性和关系的信息在训练或测试时间可用(例如,来自 WordNet),这比假设实际示例虽然未标记但可用的情况要少限制。

References

  • [1] L. De Raedt, S. Dumancic, R. Manhaeve, and G. Marra, “From statistical relational to neuro-symbolic artificial intelligence,” in 29th International Joint Conference on Artificial Intelligence, 2021, pp. 4943–4950.
    L. De Raedt, S. Dumancic, R. Manhaeve, and G. Marra,“从统计关系到神经符号人工智能”,于第 29 届国际人工智能联合会议,2021 年,第 4943-4950 页。
  • [2] T. R. Besold, A. d. Garcez, S. Bader, H. Bowman, P. Domingos, P. Hitzler, K.-U. Kühnberger, L. C. Lamb, D. Lowd, P. M. V. Lima et al., “Neural-symbolic learning and reasoning: A survey and interpretation,” 2017.
    T. R. Besold, A. d. Garcez, S. Bader, H. Bowman, P. Domingos, P. Hitzler, K.-U. Kühnberger, L. C. Lamb, D. Lowd, P. M. V. Lima 等人,“神经符号学习与推理:调查与解释”,2017 年。
  • [3] I. Donadello and L. Serafini, “Compensating supervision incompleteness with prior knowledge in semantic image interpretation,” in 2019 International Joint Conference on Neural Networks (IJCNN), 2019, pp. 1–8.
    I. Donadello 和 L. Serafini,“在语义图像解释中利用先验知识弥补监督不完整性”,2019 年国际神经网络联合会议(IJCNN)论文集,2019 年,第 1-8 页。
  • [4] F. Manigrasso, F. D. Miro, L. Morra, and F. Lamberti, “Faster-LTN: a neuro-symbolic, end-to-end object detection architecture,” in International Conference on Artificial Neural Networks.   Springer, 2021, pp. 40–52.
    F. Manigrasso, F. D. Miro, L. Morra, 和 F. Lamberti, “Faster-LTN: 一种神经符号化的端到端目标检测架构,” 收录于《国际人工神经网络会议论文集》. Springer, 2021, 页码 40–52.
  • [5] S. Badreddine, A. d. Garcez, L. Serafini, and M. Spranger, “Logic tensor networks,” p. 103649, 2022.
    S. Badreddine, A. d. Garcez, L. Serafini, 和 M. Spranger, “逻辑张量网络,” 页码 103649, 2022 年。
  • [6] L. Serafini, A. d’Avila Garcez, S. Badreddine, I. Donadello, M. Spranger, and F. Bianchi, “Logic tensor networks: Theory and applications,” in Neuro-Symbolic Artificial Intelligence: The State of the Art.   IOS Press, 2021, pp. 370–394.
    L. Serafini, A. d’Avila Garcez, S. Badreddine, I. Donadello, M. Spranger, and F. Bianchi,“逻辑张量网络:理论与应用”,载于《神经符号人工智能:现状》。IOS Press,2021 年,页码 370-394。
  • [7] J. Snell, K. Swersky, and R. Zemel, “Prototypical networks for few-shot learning,” in 31st International Conference on Neural Information Processing Systems, 2017, pp. 4080–4090.
    J. Snell, K. Swersky, 和 R. Zemel, “Prototypical networks for few-shot learning,” 在第 31 届国际神经信息处理系统大会上,2017 年,第 4080-4090 页。
  • [8] L. Zhang, T. Xiang, and S. Gong, “Learning a deep embedding model for zero-shot learning,” in IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 3010–3019.
    L. Zhang, T. Xiang, 和 S. Gong, “Learning a deep embedding model for zero-shot learning,” 在 IEEE 计算机视觉和模式识别会议上,2017 年,第 3010–3019 页。
  • [9] Z. Wan, D. Chen, Y. Li, X. Yan, J. Zhang, Y. Yu, and J. Liao, “Transductive zero-shot learning with visual structure constraint,” pp. 9972–9982, 2019.
    Z. Wan, D. Chen, Y. Li, X. Yan, J. Zhang, Y. Yu, 和 J. Liao, “带视觉结构约束的传导式零样本学习,” 页码 9972–9982, 2019.
  • [10] A. Goyal and Y. Bengio, “Inductive biases for deep learning of higher-level cognition,” 2020.
    A. Goyal 和 Y. Bengio,“深度学习高层认知的归纳偏见”,2020。
  • [11] L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.” 2008.
    L. Van der Maaten 和 G. Hinton,“使用 t-sne 可视化数据。”2008。
  • [12] D. Yu, B. Yang, D. Liu, and H. Wang, “A survey on neural-symbolic systems,” 2021.
    余 D,杨 B,刘 D 和王 H,“神经符号系统调查”,2021。
  • [13] I. Donadello, L. Serafini, and A. D. Garcez, “Logic tensor networks for semantic image interpretation,” in 26th International Joint Conference on Artificial Intelligence, 2017, pp. 1596–1602.
    I. Donadello, L. Serafini, 和 A. D. Garcez, “逻辑张量网络用于语义图像解释,” 发表于第 26 届国际人工智能联合会议, 2017, 页码 1596–1602.
  • [14] K. Yi, J. Wu, C. Gan, A. Torralba, P. Kohli, and J. B. Tenenbaum, “Neural-symbolic VQA: disentangling reasoning from vision and language understanding,” in 32nd International Conference on Neural Information Processing Systems, 2018, pp. 1039–1050.
    K. Yi, J. Wu, C. Gan, A. Torralba, P. Kohli, and J. B. Tenenbaum,“神经符号 VQA:从视觉和语言理解中分离推理”,在第 32 届国际神经信息处理系统大会上,2018 年,第 1039-1050 页。
  • [15] R. Vedantam, K. Desai, S. Lee, M. Rohrbach, D. Batra, and D. Parikh, “Probabilistic neural symbolic models for interpretable visual question answering,” in International Conference on Machine Learning, 2019, pp. 6428–6437.
    R. Vedantam, K. Desai, S. Lee, M. Rohrbach, D. Batra, and D. Parikh,“用于可解释视觉问答的概率神经符号模型”,发表于 2019 年机器学习国际会议,第 6428-6437 页。
  • [16] Z. Li, E. Stengel-Eskin, Y. Zhang, C. Xie, Q. H. Tran, B. Van Durme, and A. Yuille, “Calibrating concepts and operations: Towards symbolic reasoning on real images,” in IEEE/CVF International Conference on Computer Vision, 2021, pp. 14 910–14 919.
    Z. Li, E. Stengel-Eskin, Y. Zhang, C. Xie, Q. H. Tran, B. Van Durme, and A. Yuille,“校准概念和操作:走向对真实图像的符号推理”,在 2021 年 IEEE/CVF 国际计算机视觉会议上,第 14 910–14 919 页。
  • [17] M. van Bekkum, M. de Boer, F. van Harmelen, A. Meyer-Vitali, and A. ten Teije, “Modular design patterns for hybrid learning and reasoning systems,” pp. 1–19, 2021.
    M. van Bekkum, M. de Boer, F. van Harmelen, A. Meyer-Vitali, 和 A. ten Teije, “混合学习和推理系统的模块化设计模式”,第 1-19 页,2021 年。
  • [18] Y. Xian, C. H. Lampert, B. Schiele, and Z. Akata, “Zero-shot learning — A comprehensive evaluation of the good, the bad and the ugly,” pp. 2251–2265, 2019.
    Y. Xian, C. H. Lampert, B. Schiele, 和 Z. Akata, “零样本学习 - 好坏丑的全面评估,” 页码 2251–2265, 2019.
  • [19] F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. S. Torr, and T. M. Hospedales, “Learning to compare: Relation network for few-shot learning,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 1199–1208.
    F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. S. Torr, and T. M. Hospedales,“学习比较:关系网络用于少样本学习”,2018 年 IEEE/CVF 计算机视觉与模式识别会议,第 1199-1208 页。
  • [20] P. Zhu, H. Wang, and V. Saligrama, “Generalized zero-shot recognition based on visually semantic embedding,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019.
    朱 P,王 H,Saligrama V,“基于视觉语义嵌入的广义零样本识别”,2019 年 IEEE/CVF 计算机视觉与模式识别会议。
  • [21] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, and T. Mikolov, “DeViSE: A deep visual-semantic embedding model,” in 26th International Conference on Neural Information Processing Systems, 2013, p. 2121–2129.
    A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, and T. Mikolov,“DeViSE: 一种深度视觉-语义嵌入模型”,收录于 2013 年第 26 届国际神经信息处理系统大会,第 2121-2129 页。
  • [22] V. K. Verma, G. Arora, A. Mishra, and P. Rai, “Generalized zero-shot learning via synthesized examples,” in IEEE Conference on Computer Vision and Pattern Recognition, 2018.
    V. K. Verma, G. Arora, A. Mishra, and P. Rai,“通过合成示例的广义零样本学习”,2018 年 IEEE 计算机视觉与模式识别会议。
  • [23] H. Huang, C. Wang, P. S. Yu, and C.-D. Wang, “Generative dual adversarial network for generalized zero-shot learning,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019.
    H. 黄,C. 王,P. S. 于,和 C.-D. 王,“用于广义零样本学习的生成式双对抗网络”,2019 年 IEEE/CVF 计算机视觉与模式识别会议。
  • [24] Y. Xing, S. Huang, L. Huangfu, F. Chen, and Y. Ge, “Robust bidirectional generative network for generalized zero-shot learning,” in IEEE International Conference on Multimedia and Expo, 2020, pp. 1–6.
    Y. Xing, S. Huang, L. Huangfu, F. Chen, and Y. Ge,“用于广义零样本学习的稳健双向生成网络”,收录于 2020 年 IEEE 国际多媒体与博览会,第 1-6 页。
  • [25] E. G. Miller, N. E. Matsakis, and P. A. Viola, “Learning from one example through shared densities on transforms,” in IEEE Conference on Computer Vision and Pattern Recognition, 2000, pp. 464–471.
    E. G. 米勒,N. E. 马特萨基斯,和 P. A. 维奥拉,“通过转换上的共享密度从一个示例中学习”,2000 年 IEEE 计算机视觉与模式识别会议,第 464-471 页。
  • [26] B. Lake, R. Salakhutdinov, J. Gross, and J. Tenenbaum, “One shot learning of simple visual concepts,” 2011.
    B. Lake, R. Salakhutdinov, J. Gross, 和 J. Tenenbaum, “One shot learning of simple visual concepts,” 2011.
  • [27] G. Koch, R. Zemel, R. Salakhutdinov et al., “Siamese neural networks for one-shot image recognition,” in ICML deep learning workshop, vol. 2.   Lille, 2015.
    G. Koch, R. Zemel, R. Salakhutdinov 等人,“孪生神经网络用于一次性图像识别”,收录于 ICML 深度学习研讨会第 2 卷。2015 年,里尔。
  • [28] C. Wah, S. Branson, P. Perona, and S. J. Belongie, “Multiclass recognition and part localization with humans in the loop,” pp. 2524–2531, 2011.
    C. Wah, S. Branson, P. Perona, 和 S. J. Belongie, “Multiclass recognition and part localization with humans in the loop,” 页码 2524–2531, 2011.
  • [29] A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth, “Describing objects by their attributes,” in IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 1778–1785.
    A. Farhadi, I. Endres, D. Hoiem, 和 D. Forsyth, “通过它们的属性描述对象,” 在 2009 年 IEEE 计算机视觉和模式识别会议上, 页码 1778–1785.
  • [30] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba, “SUN database: Large-scale scene recognition from abbey to zoo,” in IEEE Conference on Computer Vision and Pattern Recognition, 2010, pp. 3485–3492.
    J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, 和 A. Torralba, “SUN 数据库:从修道院到动物园的大规模场景识别,” 在 2010 年 IEEE 计算机视觉与模式识别会议上发表,第 3485-3492 页。
  • [31] S. Changpinyo, W.-L. Chao, B. Gong, and F. Sha, “Synthesized classifiers for zero-shot learning,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 5327–5336.
    S. Changpinyo, W.-L. Chao, B. Gong, and F. Sha,“零样本学习的合成分类器”,2016 年 IEEE 计算机视觉与模式识别会议,2016 年,第 5327-5336 页。
  • [32] M. Ye and Y. Guo, “Progressive ensemble networks for zero-shot recognition,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019.
    M. Ye 和 Y. Guo,“用于零样本识别的渐进集成网络”,发表于 2019 年 IEEE/CVF 计算机视觉与模式识别会议。
  • [33] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
    K. He, X. Zhang, S. Ren, 和 J. Sun, “深度残差学习用于图像识别,” 在 2016 年 IEEE 计算机视觉与模式识别会议上发表,第 770-778 页。
  • [34] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 248–255.
    邓建华,董伟,Socher R,李良军,李凯,费菲菲,“Imagenet:一个大规模的分层图像数据库”,2009 年 IEEE 计算机视觉与模式识别会议,2009 年,第 248-255 页。
  • [35] S. Badreddine, A. Garcez, L. Serafini, and M. Spranger, “GTS: Logic Tensor Network library,” https://github.com/logictensornetworks/logictensornetworks, 2021.
    S. Badreddine, A. Garcez, L. Serafini, 和 M. Spranger, “GTS: 逻辑张量网络库,” https://github.com/logictensornetworks/logictensornetworks, 2021.
  • [36] C. Chen, O. Li, D. Tao, A. Barnett, C. Rudin, and J. K. Su, “This looks like that: Deep learning for interpretable image recognition,” pp. 8930–8941, 2019.
    C. Chen, O. Li, D. Tao, A. Barnett, C. Rudin, 和 J. K. Su, “这看起来像那个:用于可解释图像识别的深度学习,” 页码 8930–8941, 2019.

[] 请提供要翻译的文本

-A Function grounding in PROTO-LTNs
PROTO-LTNs 中的功能基础

PROTO-LTNs are based on two functions (embeddingFunction=fθembeddingFunctionsubscript𝑓𝜃\texttt{embeddingFunction}=f_{\theta} and getPrototype, respectively) and the isOfClass predicate.
PROTO-LTNs 基于两个函数( embeddingFunction=fθembeddingFunctionsubscript𝑓𝜃\texttt{embeddingFunction}=f_{\theta} 和 getPrototype),以及 isOfClass 谓词。

The function getPrototypes, with Din(getPrototypes)=features×labelssubscript𝐷ingetPrototypesfeatureslabelsD_{\mathrm{in}}(\texttt{getPrototypes})=\texttt{features}\times\texttt{labels} and
函数 getPrototypes,使用 Din(getPrototypes)=features×labelssubscript𝐷ingetPrototypesfeatureslabelsD_{\mathrm{in}}(\texttt{getPrototypes})=\texttt{features}\times\texttt{labels}

Dout(getPrototypes)=embeddings×labelssubscript𝐷outgetPrototypesembeddingslabelsD_{\mathrm{out}}(\texttt{getPrototypes})=\texttt{embeddings}\times\texttt{labels}, returns labelled prototypes given a support set of labelled examples. In this way each prototype depends on the support set of the same class, as defined in Eq. 4. As a consequence, we propose a novel definition for generalized LTN functions.
Dout(getPrototypes)=embeddings×labelssubscript𝐷outgetPrototypesembeddingslabelsD_{\mathrm{out}}(\texttt{getPrototypes})=\texttt{embeddings}\times\texttt{labels} ,在给定带标签示例的支持集的情况下,返回带标签的原型。这样,每个原型都依赖于同一类别的支持集,如等式 4 中定义的那样。因此,我们提出了广义 LTN 函数的新定义。

To understand why a generalized function is needed, recall that LTN variables are grounded onto the set of their instantiations. Assume that s𝑠s is a variable associated to support points, or:
为了理解为什么需要广义函数,回想一下 LTN 变量是基于它们的实例化集合的。假设 s𝑠s 是与支持点相关联的变量:

𝒢(s)=x1S~,,xNSS~.𝒢𝑠expectationsubscriptsuperscript𝑥~𝑆1subscriptsuperscript𝑥~𝑆subscript𝑁𝑆\displaystyle\mathcal{G}(s)=\braket{x^{\tilde{S}}_{1},...,x^{\tilde{S}}_{N_{S}}}.

If hh is a LTN function that is compatible with variable s𝑠s, or Din(f)=D(s)=Dsubscript𝐷in𝑓𝐷𝑠superscript𝐷D_{\text{in}}(f)=D(s)=\mathbb{R}^{D}, the grounding for h(s)𝑠h(s) is
如果 hh 是与变量 s𝑠sDin(f)=D(s)=Dsubscript𝐷in𝑓𝐷𝑠superscript𝐷D_{\text{in}}(f)=D(s)=\mathbb{R}^{D} 兼容的 LTN 函数,则 h(s)𝑠h(s) 的基础是

𝒢(h(s))=𝒢(h)(x1S~),,𝒢(h)(xNSS~).𝒢𝑠expectation𝒢subscriptsuperscript𝑥~𝑆1𝒢subscriptsuperscript𝑥~𝑆subscript𝑁𝑆\displaystyle\mathcal{G}(h(s))=\braket{\mathcal{G}(h)(x^{\tilde{S}}_{1}),...,\mathcal{G}(h)(x^{\tilde{S}}_{N_{S}})}.

This means that 𝒢(h)𝒢\mathcal{G}(h) only takes as input a single element of Dsuperscript𝐷\mathbb{R}^{D}. Unfortunately, a conventional LTN function such as hh cannot help us with prototypes, as their definition for a class nC~𝑛~𝐶n\in\tilde{C}, given in Eq. 4, is:
这意味着 𝒢(h)𝒢\mathcal{G}(h) 只接受 Dsuperscript𝐷\mathbb{R}^{D} 的单个元素作为输入。不幸的是,传统的 LTN 函数,如 hh ,无法帮助我们处理原型,因为它们对于类 nC~𝑛~𝐶n\in\tilde{C} 的定义,如等式 4 中所示,是:

pnsubscript𝑝𝑛\displaystyle p_{n} =1K(xS~,yS~)S~s.t. yS~=nfθ(xS~)=pn(x1S~,,xNSS~).absent1𝐾subscriptsuperscript𝑥~𝑆superscript𝑦~𝑆~𝑆s.t. superscript𝑦~𝑆𝑛subscript𝑓𝜃superscript𝑥~𝑆subscript𝑝𝑛subscriptsuperscript𝑥~𝑆1subscriptsuperscript𝑥~𝑆subscript𝑁𝑆\displaystyle=\frac{1}{K}\sum_{\begin{subarray}{c}(x^{\tilde{S}},y^{\tilde{S}})\in\tilde{S}\\ \text{s.t. }y^{\tilde{S}}=n\end{subarray}}f_{\theta}(x^{\tilde{S}})=p_{n}(x^{\tilde{S}}_{1},...,x^{\tilde{S}}_{N_{S}}).

Every prototype is in fact a function of all support points belonging to the same class. As a consequence, we propose a novel definition for generalized LTN functions.
每个原型实际上是属于同一类的所有支持点的函数。因此,我们提出了广义 LTN 函数的新定义。

Definition 1

A generalized LTN function Fgen𝐹superscriptgenF\in\mathcal{F}^{\text{gen}} is a function that lets multiple instantiations of variables be fed at once to 𝒢(F)𝒢𝐹\mathcal{G}(F), and returns a variable. The grounding for a generalized function Fgen𝐹superscriptgenF\in\mathcal{F}^{\text{gen}} is a function with flexible domain and range:

𝒢(F):l=1×m=1l𝒢(Din(F))l=1×m=1l𝒢(Dout(F)).:𝒢𝐹superscriptsubscript×𝑚1𝑙superscriptsubscript𝑙1𝒢subscript𝐷in𝐹superscriptsubscript×𝑚1𝑙superscriptsubscript𝑙1𝒢subscript𝐷out𝐹\displaystyle\mathcal{G}(F):\bigcup_{l=1}^{\infty}\,\vartimes_{m=1}^{l}\mathcal{G}(D_{\mathrm{in}}(F))\to\bigcup_{l=1}^{\infty}\,\vartimes_{m=1}^{l}\mathcal{G}(D_{\mathrm{out}}(F)).

If a generalized function Fgen𝐹superscriptgenF\in\mathcal{F}^{\text{gen}} and a variable x𝒳𝑥𝒳x\in\mathcal{X} have compatible domains, or Din(F)=D(x)subscript𝐷in𝐹𝐷𝑥D_{\mathrm{in}}(F)=D(x), the grounding for F(x)𝐹𝑥F(x) is defined by
如果一个广义函数 Fgen𝐹superscriptgenF\in\mathcal{F}^{\text{gen}} 和一个变量 x𝒳𝑥𝒳x\in\mathcal{X} 具有兼容的定义域,或 Din(F)=D(x)subscript𝐷in𝐹𝐷𝑥D_{\mathrm{in}}(F)=D(x) ,则 F(x)𝐹𝑥F(x) 的基准由以下定义:

𝒢(F(x))=𝒢(F)(𝒢(x)).𝒢𝐹𝑥𝒢𝐹𝒢𝑥\displaystyle\mathcal{G}(F(x))=\mathcal{G}(F)(\mathcal{G}(x)).

定义 1 广义 LTN 函数 Fgen𝐹superscriptgenF\in\mathcal{F}^{\text{gen}} 是一种函数,它允许一次性输入多个变量实例到 𝒢(F)𝒢𝐹\mathcal{G}(F) ,并返回一个变量。广义函数 Fgen𝐹superscriptgenF\in\mathcal{F}^{\text{gen}} 的基础是具有灵活定义域和值域的函数:

Grounding for both functions is defined as
两个功能的接地定义为

𝒢(embeddingFunction)𝒢embeddingFunction\displaystyle\mathcal{G}(\texttt{embeddingFunction}) =fθ,absentsubscript𝑓𝜃\displaystyle=f_{\theta}, (24)
𝒢(getPrototypes)𝒢getPrototypes\displaystyle\mathcal{G}(\texttt{getPrototypes}) =Πθ,absentsubscriptΠ𝜃\displaystyle=\Pi_{\theta}, (25)

where fθ:DM:subscript𝑓𝜃superscript𝐷superscript𝑀f_{\theta}:\mathbb{R}^{D}\to\mathbb{R}^{M} is the same embedding function as in the FSL setting, while
其中 fθ:DM:subscript𝑓𝜃superscript𝐷superscript𝑀f_{\theta}:\mathbb{R}^{D}\to\mathbb{R}^{M} 与 FSL 设置中相同的嵌入函数相同,而

Πθ:l=1×m=1lD×l=1×m=1lM×:subscriptΠ𝜃superscriptsubscript×𝑚1𝑙superscriptsubscript𝑙1superscript𝐷superscriptsubscript×𝑚1𝑙superscriptsubscript𝑙1superscript𝑀\displaystyle\Pi_{\theta}:\bigcup_{l=1}^{\infty}\,\vartimes_{m=1}^{l}\mathbb{R}^{D}\times\mathbb{N}\to\bigcup_{l=1}^{\infty}\,\vartimes_{m=1}^{l}\mathbb{R}^{M}\times\mathbb{N}

We structure ΠθsubscriptΠ𝜃\Pi_{\theta} to be computationally easy to implement (e.g., in a computational graph), and to generalize to a setting in which NSsubscript𝑁𝑆N_{S} and N~~𝑁\tilde{N} are not fixed, or the N𝑁N-way-K𝐾K-shot scenario is not perfect. More specifically, in the following is how ΠθsubscriptΠ𝜃\Pi_{\theta} works.
我们设计 ΠθsubscriptΠ𝜃\Pi_{\theta} 以便于计算实现(例如,在计算图中),并且可以推广到 NSsubscript𝑁𝑆N_{S}N~~𝑁\tilde{N} 不固定的情况,或者 N𝑁N -路- K𝐾K -击的情况不完美的情况。更具体地,以下是 ΠθsubscriptΠ𝜃\Pi_{\theta} 的工作原理。

  1. 1.

    Take as input:

    1. (a)

      a support set S~={(x1S~,y1S~),,(xNSS~,yNSS~)}(D×)NS~𝑆subscriptsuperscript𝑥~𝑆1subscriptsuperscript𝑦~𝑆1subscriptsuperscript𝑥~𝑆subscript𝑁𝑆subscriptsuperscript𝑦~𝑆subscript𝑁𝑆superscriptsuperscript𝐷subscript𝑁𝑆\tilde{S}=\{(x^{\tilde{S}}_{1},y^{\tilde{S}}_{1}),...,(x^{\tilde{S}}_{N_{S}},y^{\tilde{S}}_{N_{S}})\}\in(\mathbb{R}^{D}\times\mathbb{N})^{N_{S}} of labelled examples, with xiS~Dsubscriptsuperscript𝑥~𝑆𝑖superscript𝐷x^{\tilde{S}}_{i}\in\mathbb{R}^{D} and yiS~subscriptsuperscript𝑦~𝑆𝑖y^{\tilde{S}}_{i}\in\mathbb{N};


      (a)一个带有标记示例的支持集 S~={(x1S~,y1S~),,(xNSS~,yNSS~)}(D×)NS~𝑆subscriptsuperscript𝑥~𝑆1subscriptsuperscript𝑦~𝑆1subscriptsuperscript𝑥~𝑆subscript𝑁𝑆subscriptsuperscript𝑦~𝑆subscript𝑁𝑆superscriptsuperscript𝐷subscript𝑁𝑆\tilde{S}=\{(x^{\tilde{S}}_{1},y^{\tilde{S}}_{1}),...,(x^{\tilde{S}}_{N_{S}},y^{\tilde{S}}_{N_{S}})\}\in(\mathbb{R}^{D}\times\mathbb{N})^{N_{S}} ,其中 xiS~Dsubscriptsuperscript𝑥~𝑆𝑖superscript𝐷x^{\tilde{S}}_{i}\in\mathbb{R}^{D}yiS~subscriptsuperscript𝑦~𝑆𝑖y^{\tilde{S}}_{i}\in\mathbb{N}
    2. (b)

      the parameter θ𝜃\theta or, for the sake of clarity, the embedding function fθ:DM:subscript𝑓𝜃superscript𝐷superscript𝑀f_{\theta}:\mathbb{R}^{D}\to\mathbb{R}^{M}.


      (b)参数 θ𝜃\theta 或者为了清晰起见,嵌入函数 fθ:DM:subscript𝑓𝜃superscript𝐷superscript𝑀f_{\theta}:\mathbb{R}^{D}\to\mathbb{R}^{M}

    1. 输入:
  2. 2.

    Extract the classes contained in S~~𝑆\tilde{S} by applying:

    p(labels)=Unique(yS~),superscript𝑝𝑙𝑎𝑏𝑒𝑙𝑠Uniquesuperscript𝑦~𝑆\displaystyle p^{(labels)}=\text{Unique}(y^{\tilde{S}}),

    where the “Unique” function retrieves the unique elements of a vector. We call this variable p(labels)superscript𝑝𝑙𝑎𝑏𝑒𝑙𝑠p^{(labels)} because it will be associated to prototype labels. Define N~~𝑁\tilde{N} as the number of elements in p(labels)superscript𝑝𝑙𝑎𝑏𝑒𝑙𝑠p^{(labels)}.
    “Unique”函数检索向量的唯一元素。我们将这个变量称为 p(labels)superscript𝑝𝑙𝑎𝑏𝑒𝑙𝑠p^{(labels)} ,因为它将与原型标签关联。将 N~~𝑁\tilde{N} 定义为 p(labels)superscript𝑝𝑙𝑎𝑏𝑒𝑙𝑠p^{(labels)} 中的元素数量。


    2. 通过应用以下方法提取 S~~𝑆\tilde{S} 中包含的类:
  3. 3.

    Define a sparse “labels” matrix L{0,1}N~×NS𝐿superscript01~𝑁subscript𝑁𝑆L\in\{0,1\}^{\tilde{N}\times N_{S}} whose i,j𝑖𝑗i,j-th entry is equal to 1 if support item i𝑖i is of class pj(labels)subscriptsuperscript𝑝𝑙𝑎𝑏𝑒𝑙𝑠𝑗p^{(labels)}_{j}, 0 otherwise.


    3. 定义一个稀疏的“标签”矩阵 L{0,1}N~×NS𝐿superscript01~𝑁subscript𝑁𝑆L\in\{0,1\}^{\tilde{N}\times N_{S}} ,其中第 i,j𝑖𝑗i,j 个条目等于 1,如果支持项目 i𝑖i 属于类 pj(labels)subscriptsuperscript𝑝𝑙𝑎𝑏𝑒𝑙𝑠𝑗p^{(labels)}_{j} ,否则为 0。
  4. 4.

    Compute the prototypes tensor pN~×M𝑝superscript~𝑁𝑀p\in\mathbb{R}^{\tilde{N}\times M} as


    4. 计算原型张量 pN~×M𝑝superscript~𝑁𝑀p\in\mathbb{R}^{\tilde{N}\times M}
    p=Diag(L𝟙NS)1Lfθ(xS~)𝑝Diagsuperscript𝐿subscript1subscript𝑁𝑆1𝐿subscript𝑓𝜃superscript𝑥~𝑆\displaystyle p=\text{Diag}(L\mathds{1}_{N_{S}})^{-1}\,L\,f_{\theta}(x^{\tilde{S}})

    where 在哪里

    𝟙NS=[1,1,,1]TNSsubscript1subscript𝑁𝑆superscript111𝑇superscriptsubscript𝑁𝑆\displaystyle\mathds{1}_{N_{S}}=[1,1,...,1]^{T}\in\mathbb{R}^{N_{S}}

    is a vector of NSsubscript𝑁𝑆N_{S} ones, and
    是一个 NSsubscript𝑁𝑆N_{S} 个单位的向量,和

    fθ(xS~)=[fθ(x1S~),fθ(x2S~),,fθ(xNSS~)]TNS×Msubscript𝑓𝜃superscript𝑥~𝑆superscriptsubscript𝑓𝜃subscriptsuperscript𝑥~𝑆1subscript𝑓𝜃subscriptsuperscript𝑥~𝑆2subscript𝑓𝜃subscriptsuperscript𝑥~𝑆subscript𝑁𝑆𝑇superscriptsubscript𝑁𝑆𝑀\displaystyle f_{\theta}(x^{\tilde{S}})=[f_{\theta}(x^{\tilde{S}}_{1}),f_{\theta}(x^{\tilde{S}}_{2}),...,f_{\theta}(x^{\tilde{S}}_{N_{S}})]^{T}\in\mathbb{R}^{N_{S}\times M}

    is the piece-wise application of fθsubscript𝑓𝜃f_{\theta} to elements in xS~superscript𝑥~𝑆x^{\tilde{S}}, whereas “diag” computes the diagonal matrix associated to a vector. This expression does the same operation as Eq. 4, but it is more general because it allows for unbalanced support sets. In the case of balanced support sets, which correspond to a perfect N𝑁N-way-K𝐾K-shot scenario, one simply has Diag(L𝟙NS)1=1KIDiagsuperscript𝐿subscript1subscript𝑁𝑆11𝐾𝐼\text{Diag}(L\mathds{1}_{N_{S}})^{-1}=\frac{1}{K}\,I, where I𝐼I is the identity matrix.
    该表达式是将 fθsubscript𝑓𝜃f_{\theta} 逐段应用于 xS~superscript𝑥~𝑆x^{\tilde{S}} 中的元素,而“diag”计算与向量相关联的对角矩阵。该表达式执行与方程 4 相同的操作,但更通用,因为它允许不平衡的支持集。在支持集平衡的情况下,对应于完美的 N𝑁N -路- K𝐾K -拍摄场景,简单地有 Diag(L𝟙NS)1=1KIDiagsuperscript𝐿subscript1subscript𝑁𝑆11𝐾𝐼\text{Diag}(L\mathds{1}_{N_{S}})^{-1}=\frac{1}{K}\,I ,其中 I𝐼I 是单位矩阵。

  5. 5.

    Return p𝑝p and p(labels)superscript𝑝𝑙𝑎𝑏𝑒𝑙𝑠p^{(labels)}.


    5. 返回 p𝑝pp(labels)superscript𝑝𝑙𝑎𝑏𝑒𝑙𝑠p^{(labels)}

-B Experimental settings details
-B 实验设置细节

In Table II we report for each dataset the selected learning rate, α𝛼\alpha (distance parameter) and λ𝜆\lambda (L2 regularization) which allowed us to obtain the best performance.
在表 II 中,我们报告了每个数据集的选定学习率, α𝛼\alpha (距离参数)和 λ𝜆\lambda (L2 正则化),这使我们能够获得最佳性能。

TABLE II: Best hyperparameters used to train Proto-LTN on each benchmark.
表 II:用于在每个基准上训练 Proto-LTN 的最佳超参数。
Dataset 数据集 Lr α𝛼\alpha λ𝜆\lambda
Awa2 1×10041E-041\text{\times}{10}^{-04} 1×10051E-051\text{\times}{10}^{-05} 1×10031E-031\text{\times}{10}^{-03}
CUB 1×10041E-041\text{\times}{10}^{-04} 1×10041E-041\text{\times}{10}^{-04} 1×10031E-031\text{\times}{10}^{-03}
aPY 1×10031E-031\text{\times}{10}^{-03} 1×10051E-051\text{\times}{10}^{-05} 1×10051E-051\text{\times}{10}^{-05}
SUN 1×10031E-031\text{\times}{10}^{-03} 1×10051E-051\text{\times}{10}^{-05} 1×10051E-051\text{\times}{10}^{-05}