PROTOtypical Logic Tensor Networks (PROTO-LTN) for Zero Shot Learning
零样本学习的原型逻辑张量网络（PROTO-LTN）

Simone Martone, Francesco Manigrasso, Fabrizio Lamberti and Lia Morra simone.martone@studenti.polito.it, {francesco.manigrasso, fabrizio.lamberti, lia.morra}@polito.it
Simone Martone, Francesco Manigrasso, Fabrizio Lamberti 和 Lia Morra simone.martone@studenti.polito.it, {francesco.manigrasso, fabrizio.lamberti, lia.morra}@polito.it Department of Control and Computer Engineering Politecnico di Torino
都灵理工大学控制与计算机工程系

Abstract 摘要

Semantic image interpretation can vastly benefit from approaches that combine sub-symbolic distributed representation learning with the capability to reason at a higher level of abstraction. Logic Tensor Networks (LTNs) are a class of neuro-symbolic systems based on a differentiable, first-order logic grounded into a deep neural network. LTNs replace the classical concept of training set with a knowledge base of fuzzy logical axioms. By defining a set of differentiable operators to approximate the role of connectives, predicates, functions and quantifiers, a loss function is automatically specified so that LTNs can learn to satisfy the knowledge base. We focus here on the subsumption or isOfClass predicate, which is fundamental to encode most semantic image interpretation tasks. Unlike conventional LTNs, which rely on a separate predicate for each class (e.g., dog, cat), each with its own set of learnable weights, we propose a common isOfClass predicate, whose level of truth is a function of the distance between an object embedding and the corresponding class prototype. The PROTOtypical Logic Tensor Networks (PROTO-LTN) extend the current formulation by grounding abstract concepts as parametrized class prototypes in a high-dimensional embedding space, while reducing the number of parameters required to ground the knowledge base. We show how this architecture can be effectively trained in the few and zero-shot learning scenarios. Experiments on Generalized Zero Shot Learning benchmarks validate the proposed implementation as a competitive alternative to traditional embedding-based approaches. The proposed formulation opens up new opportunities in zero shot learning settings, as the LTN formalism allows to integrate background knowledge in the form of logical axioms to compensate for the lack of labelled examples. PROTO-LTN was implemented in Tensorflow and is available at https://github.com/FrancescoManigrass/PROTO-LTN.git
语义图像解释可以从将亚符号分布表示学习与在更高抽象层次上推理能力相结合的方法中获益。逻辑张量网络（LTNs）是一类基于可微分的、一阶逻辑的神经符号系统，其基础是深度神经网络。LTNs 用模糊逻辑公理知识库取代了传统的训练集概念。通过定义一组可微分运算符来近似连接词、谓词、函数和量词的作用，自动指定了损失函数，使得 LTNs 能够学习满足知识库。我们在这里关注包含或 isOfClass 谓词，这对于编码大多数语义图像解释任务至关重要。与传统的 LTNs 不同，传统的 LTNs 依赖于每个类别（例如狗、猫）的单独谓词，每个谓词都有自己的可学习权重，我们提出了一个通用的 isOfClass 谓词，其真值水平是对象嵌入与相应类别原型之间距离的函数。 PROTOtypical Logic Tensor Networks（PROTO-LTN）通过在高维嵌入空间中将抽象概念作为参数化的类原型来扩展当前的表达方式，同时减少了接地知识库所需的参数数量。我们展示了这种架构如何在少量和零样本学习场景中得到有效训练。在广义零样本学习基准测试上的实验验证了所提出的实现作为传统基于嵌入的方法的竞争性替代方案。所提出的表达方式在零样本学习设置中开辟了新的机会，因为 LTN 形式主义允许将背景知识以逻辑公理的形式整合进来，以弥补标记示例的缺失。PROTO-LTN 在 Tensorflow 中实现，并可在 https://github.com/FrancescoManigrass/PROTO-LTN 上获得。git

I Introduction 我介绍

Despite their impressive performance when trained on large-scale, supervised datasets, deep neural networks have still difficulties generalizing to unseen categories. On the contrary, humans can leverage logical reasoning to make guesses about new circumstances, and are able to infer knowledge from few to zero examples. Recent efforts towards Neural-Symbolic (NeSy) integration [1, 2] allow to assimilate symbolic representation and reasoning into deep architectures: this entails that background knowledge, in the form of logical axioms, can be exploited during training, opening up new scenarios for settings in which labelled examples are scarce or noisy [3, 4]. Specifically, we focus here on Logic Tensor Networks (LTNs) [5], a NeSy architecture that replaces the classical concept of a training set with a Knowledge Base $\mathcal{K}$ of logical axioms, ultimately interpreted in a fuzzy way, and formulates the learning objective as maximizing the satisfiability of $\mathcal{K}$ . While this framework has been applied to multi-label classification problems [5, 6] and object detection [4], its application to few- and zero-shot image classification has not yet been investigated.
尽管深度神经网络在大规模监督数据集上训练时表现出色，但仍然存在难以泛化到未见类别的困难。相反，人类可以利用逻辑推理对新情况进行猜测，并能够从少至零个示例中推断知识。最近关于神经符号（NeSy）集成的努力允许将符号表示和推理融入深度架构中：这意味着背景知识，以逻辑公理的形式，可以在训练过程中被利用，为标记示例稀缺或嘈杂的情况开辟了新的场景。具体而言，我们在这里关注逻辑张量网络（LTNs），这是一种 NeSy 架构，它用逻辑公理的知识库 $\mathcal{K}$ 替代了传统的训练集概念，最终以模糊方式解释，并将学习目标制定为最大化 $\mathcal{K}$ 的可满足性。虽然这一框架已应用于多标签分类问题和目标检测，但其在少量和零样本图像分类中的应用尚未得到研究。

In this work, we explore this task from a NeSy perspective, and propose to integrate ideas and concepts from the few-shot learning (FSL) and zero-shot learning (ZSL) domains, namely the Prototypical Networks (PNs) [7] framework, within the LTN formulation. PNs define class prototypes in a high-dimensional embedding space, so that incoming examples are assigned to the class of their nearest prototype according to some distance measure. In the LTN framework, this is achieved by representing the isOfClass relationship as a function of the distance between a class prototype and an object instance, thus obtaining the Prototypical Logic Tensor Network (PROTO-LTN) architecture. As the embedding space is the focus of the learning procedure, such prototypes may be also defined for classes that are not seen at training time.
在这项工作中，我们从 NeSy 的角度探讨了这个任务，并提议将少样本学习（FSL）和零样本学习（ZSL）领域的思想和概念整合到 LTN 公式中，即原型网络（PNs）[7]框架。PNs 在高维嵌入空间中定义类原型，因此传入的示例根据某种距离度量被分配到其最近原型的类别。在 LTN 框架中，通过将 isOfClass 关系表示为类原型与对象实例之间距离的函数来实现这一点，从而获得原型逻辑张量网络（PROTO-LTN）架构。由于嵌入空间是学习过程的重点，因此这样的原型也可以为训练时未见过的类别定义。

The present study thus formulates a theoretical framework that achieves competitive results with respect to standard embedding-based ZSL architectures such as DEM [8], yet offering higher degrees of flexibility. Although our analysis shows that their basic settings the two formulations are equivalent, PROTO-LTNs have greater potential in both standard and transductive ZSL. They are able to integrate in the training process prior knowledge and logical constraints from an external knowledge base, including information related to unseen classes [9]. Hence, a NeSy formulation allows to constraint the embedding space via symbolic priors.
本研究因此构建了一个理论框架，其在与基于标准嵌入的零样本学习架构（如 DEM [8]）相比取得了竞争性的结果，同时提供了更高程度的灵活性。尽管我们的分析表明，这两种表述的基本设置是等效的，但 PROTO-LTNs 在标准和转导零样本学习中具有更大的潜力。它们能够在训练过程中整合来自外部知识库的先验知识和逻辑约束，包括与未见类别相关的信息[9]。因此，NeSy 表述允许通过符号先验对嵌入空间进行约束。

The proposed framework has also potential advantages over traditional LTNs, even outside of the FSL and ZSL settings, since classes are represented as parametrized prototypes rather than a discrete label space [5, 4]. First, representing higher-level concepts as distributed vectorized representations allows to naturally exploit the notion of distance for highlighting relationships between symbols, with semantically related symbols having similar representations [10]. Second, prototypes allow to ground abstract concepts in a vectorized form that can be more easily manipulated: as an example, it would be easier to define a suitable grounding for predicates that directly operate on the abstract classes, as well as their instances. Third, prototypes are more interpretable than simple labels, as their incorporation into the embedding space can be easily visualized by employing dimensionality reduction methods, such as t-SNE [11].
提出的框架在传统的 LTNs 之外也具有潜在优势，即使在零样本学习和零样本迁移学习的情况下，因为类别被表示为参数化的原型，而不是离散的标签空间。首先，将高级概念表示为分布式向量化表示允许自然地利用距离的概念来突出符号之间的关系，语义相关的符号具有相似的表示。其次，原型允许将抽象概念以向量化形式进行基础化，这样更容易操作：例如，可以更容易地为直接作用于抽象类别及其实例的谓词定义适当的基础。第三，原型比简单的标签更具可解释性，因为将它们整合到嵌入空间中可以通过使用降维方法（如 t-SNE）轻松可视化。

The rest of the paper is organized as follows. In Section II, we place the present work in the context of the related literature, and provide a background on LTNs. In Section III, we describe a simple theoretical scheme to assimilate PNs into a LTN for classification purposes (PROTO-LTN), both in the FSL and ZSL scenarios. Then, in Sections IV and V, we examine the behavior of the model in the Generalized Zero-Shot-Learning (GZSL) task on common benchmark datasets. Finally, in Section VI, we discuss conclusions and future works.
本文的其余部分组织如下。在第二部分中，我们将目前的工作置于相关文献的背景下，并提供有关 LTNs 的背景。在第三部分中，我们描述了一个简单的理论方案，将 PNs 纳入 LTN 以进行分类目的（PROTO-LTN），无论是在 FSL 还是 ZSL 情景中。然后，在第四和第五部分中，我们考察了模型在常见基准数据集上进行广义零样本学习（GZSL）任务的行为。最后，在第六部分中，我们讨论结论和未来工作。

II Related work II 相关工作

II-A Neural-symbolic AI in Semantic Image Interpretation
II-A 神经符号人工智能在语义图像解释中

Research on how to combine connectionist and symbolic approaches has flourished in the past few years [5, 12], with several applications in semantic image interpretation and visual query answering [5, 4, 13, 3, 14, 15, 16]. Among the plethora of compositional patterns that have been proposed [17, 12], the present work follows two main principles: knowledge representation (in the form of first order logic) is embedded into a neural network, which in turn allows to constrain the search space by leveraging explicit (and human-interpretable) domain knowledge as a symbolic prior. This latter property is extremely useful in ZSL, in which some external source of information is exploited to offer an abstract description of the classes in lieu of providing training examples. On the other hand, compared to approaches based on Inductive Logic Programming (such as [14]), in which perception and reasoning are performed by separate modules, LTNs provide tighter integration between the two subsystems.
过去几年来，关于如何结合连接主义和符号方法的研究蓬勃发展[5, 12]，在语义图像解释和视觉查询回答等多个应用中取得了一些成果[5, 4, 13, 3, 14, 15, 16]。在众多已被提出的组合模式中[17, 12]，本研究遵循两个主要原则：将知识表示（以一阶逻辑形式）嵌入到神经网络中，从而通过利用显式（且人类可解释的）领域知识作为符号先验来约束搜索空间。后一属性在零样本学习中非常有用，其中利用某些外部信息源来提供类别的抽象描述，而不是提供训练样本。另一方面，与基于归纳逻辑编程（如[14]）的方法相比，其中感知和推理由独立模块执行，逻辑张量网络提供了两个子系统之间更紧密的集成。

II-B Logic Tensor Networks
II-B 逻辑张量网络

LTNs have proven effective in higher-level image interpretation tasks, such as object detection and scene graph construction [13, 5]. Donadello et al. applied them for scene relationship detection in a zero shot setting, showing how prior knowledge can compensate for the lack of supervision [3].
LTNs 已被证明在更高级别的图像解释任务中非常有效，比如目标检测和场景图构建[13, 5]。Donadello 等人将其应用于零样本设置中的场景关系检测，展示了先前知识如何弥补监督缺失[3]。

In the LTN framework, the term grounding denotes the interpretation of a First Order Language into a subset of the $\mathbb{R}^{n}$ domain [5]. It defines a collection of terms (objects) and formulas described in a Knowledge base $\mathcal{K}$ . For instance, to express the friendship between two terms defined as Alice and Bob, we can use the predicate friend_of:
在 LTN 框架中，接地术语表示将一阶语言解释为 $\mathbb{R}^{n}$ 域的子集[5]。它定义了在知识库 $\mathcal{K}$ 中描述的一组术语（对象）和公式。例如，要表达定义为 Alice 和 Bob 的两个术语之间的友谊，我们可以使用谓词 friend_of：

\displaystyle\phi_{1}=\texttt{friend\_of}(Alice,Bob)\wedge\texttt{friend\_of}(Bob,Alice)

At the same time, we can specify formulas defining general properties, such as the symmetric nature of the friendship relationship within a specific domain:
与此同时，我们可以指定定义一般属性的公式，比如在特定领域内友谊关系的对称性质：

\displaystyle\phi_{2}=\forall\,x,y\,(\texttt{friend\_of}(x,y)\Rightarrow\texttt{friend\_of}(y,x))

Adopting Real Logic, both formulas and terms are grounded (interpreted) into a scalar value in the [0,1] interval. Specifying the grounding function $\mathcal{G}$ , which maps terms and formulas into such real-valued features, generates a complete definition of a theory. Given a set of terms, aggregate formulas can be defined by approximating unary, binary or quantifiers connectives in fuzzy logic using suitable differential operators.
采用真实逻辑，无论是公式还是术语都被基于[0,1]区间内的标量值进行了基础化（解释）。指定将术语和公式映射为这种实值特征的基础化函数 $\mathcal{G}$ ，生成了一个理论的完整定义。给定一组术语，可以通过使用适当的微分算子来近似模糊逻辑中的一元、二元或量词连接符，定义聚合公式。

In semantic image interpretation tasks, terms (objects) are typically grounded by features computed by a pre-trained convolutional neural network; it is also possible to jointly train the convolutional backbone and the LTNs in an end-to-end fashion [4]. Predicates symbols $p\in\mathcal{P}$ are grounded by a function $\mathcal{G}\left(D(p)\right)\rightarrow[0,1]$ . A typical predicate in semantic image interpretation is the isOfClass one, which represents the probability that a given object belongs to class $c$ .
在语义图像解释任务中，术语（对象）通常由预先训练的卷积神经网络计算的特征进行基础化；也可以以端到端的方式联合训练卷积主干和 LTNs [4]。谓词符号 $p\in\mathcal{P}$ 由函数 $\mathcal{G}\left(D(p)\right)\rightarrow[0,1]$ 进行基础化。在语义图像解释中的典型谓词是 isOfClass，它表示给定对象属于类 $c$ 的概率。

In conventional LTNs [5, 13, 4], predicates are typically defined as the generalization of the neural tensor network:
在传统的 LTNs [5, 13, 4] 中，谓词通常被定义为神经张量网络的泛化：

\displaystyle\mathcal{G}\left(\mathcal{P}\right)(\mathbf{v})=\sigma\left(\mathit{u_{P}^{T}}\tanh\left(\mathbf{v_{T}}W_{P}^{[1:k]}\mathbf{v}+V_{P}\mathbf{v}+\mathit{b_{p}}\right)\right)

(1)

where $\sigma$ is the sigmoid function, $W[1:k]\in\mathbb{R}^{k\times mn\times mn}$ , $V_{p}\in\mathbb{R}^{k\times mn}$ , $u_{p}\in\mathbb{R}^{k}$ , and $b_{p}\in\mathbb{R}$ are learnable tensors of parameters. For multi-class problems, the sigmoid function could be substituted by a softmax layer to enforce mutual exclusivity [5].
其中 $\sigma$ 是 Sigmoid 函数， $W[1:k]\in\mathbb{R}^{k\times mn\times mn}$ ， $V_{p}\in\mathbb{R}^{k\times mn}$ ， $u_{p}\in\mathbb{R}^{k}$ 和 $b_{p}\in\mathbb{R}$ 是可学习的参数张量。对于多类问题，Sigmoid 函数可以被 Softmax 层替换以强制互斥性。

This grounding requires to add an additional predicate for each class (e.g., isDog, isPerson, etc.), which is embedded into a tensor network with separate weights. Additionally, since class symbols are not grounded, predicates can only be defined for object instances, which rapidly leads to very large knowledge bases when background logical axioms need to be imposed. On the contrary, our proposed grounding does not require additional model parameters, or in any case limits them to a small set which is shared among all isOfClass predicates. Furthermore, it encodes abstract classes as parametric objects that live in the same embedding space as their instances, and can be used to establish relationships with other objects (e.g., macro-category relationships). This formulation thus supports more efficient and compact representations.
这种基础要求为每个类别添加一个额外的谓词（例如，isDog，isPerson 等），该谓词嵌入到一个具有独立权重的张量网络中。此外，由于类别符号没有基础，谓词只能针对对象实例进行定义，当需要施加背景逻辑公理时，这很快会导致非常庞大的知识库。相反，我们提出的基础不需要额外的模型参数，或者在任何情况下将其限制为一小组共享在所有 isOfClass 谓词之间的参数。此外，它将抽象类别编码为生存在与其实例相同的嵌入空间中的参数化对象，并可用于与其他对象建立关系（例如，宏类别关系）。因此，这种表述支持更高效和紧凑的表示。

The best satisfability problem, which is the optimization problem underlying LTNs, consists in determining the values of $\Theta^{*}$ that maximize the truth values of the conjunction of all formulas $\phi\in\mathcal{K}$ :
最佳可满足性问题是逻辑张量网络（LTNs）的基础优化问题，其目标是确定最大化所有公式的合取真值的 $\Theta^{*}$ 的值： $\phi\in\mathcal{K}$

\displaystyle\Theta^{*}=argmax_{\Theta}\hat{\mathcal{G}}_{\theta}\left(\bigwedge_{\phi\in\mathcal{K}}\phi\right)-\lambda||\Theta||_{2}^{2}

(2)

where $\lambda||\Theta||_{2}^{2}$ is a convenient regularization term.
其中 $\lambda||\Theta||_{2}^{2}$ 是一个方便的正则化项。

II-C Zero-shot learning II-C 零样本学习

In zero-shot learning, a learner must be able to recognize objects from test classes, not seen during training, by leveraging some sort of description, most commonly a vector of semantic attributes [18]. In this paper, we target the Generalized zero-shot learning (GZSL) scenario, in which both seen and unseen classes appear at test time [18]. State-of-the-art techniques for ZSL classification typically fall within two categories [18, 8]: embedding-based and generative-based.
在零样本学习中，学习者必须能够通过利用某种描述（最常见的是语义属性向量）来识别训练过程中未见过的测试类别中的对象。本文针对广义零样本学习（GZSL）场景，其中在测试时出现了已见和未见类别。目前用于零样本学习分类的先进技术通常分为两类：基于嵌入的和基于生成的。

Embedding-based models [8, 19, 20, 21] compare semantic characteristics (e.g., attributes) and visual characteristics (usually taken from a pre-trained convolutional neural network) by (learning a) mapping to a common embedding space. Mapping the semantic space to the more compact visual feature space, rather than the opposite, alleviates the so-called hubness problem and facilitates separation between classes [8]. Standard embedding-based models are completely agnostic to any information about the test set: neither examples (even unlabelled), nor class attributes are assumed to be available at training time. Although based on a NeSy formulation, the proposed PROTO-LTN approach can be regarded as an embedding-based technique, as semantic concepts and visual features are mapped onto a common embedding space.
基于嵌入的模型[8, 19, 20, 21]通过将语义特征（例如属性）和视觉特征（通常来自预先训练的卷积神经网络）映射到一个共同的嵌入空间中进行比较。将语义空间映射到更紧凑的视觉特征空间，而不是相反，有助于缓解所谓的中心性问题，并促进类别之间的分离[8]。标准的基于嵌入的模型对测试集的任何信息都是完全不可知的：在训练时既不假设有示例（甚至未标记的），也不假设有类别属性。尽管基于 NeSy 公式，所提出的 PROTO-LTN 方法可以被视为一种基于嵌入的技术，因为语义概念和视觉特征被映射到一个共同的嵌入空间中。

Embedding-based models tend to be naturally biased towards seen classes. To alleviate this problem, generative models were proposed with the purpose of learning a conditioned probability distribution for each class, and thus generate artificial examples of unseen classes [22, 23, 24]. A conventional classifier is trained by utilizing both the true and the generated examples. Although impressive results, especially in a GZSL context, can be achieved by taking advantage of this machinery, reduced flexibility with respect to embedding methods is entailed, as unseen classes need to be defined, so that a number of corresponding examples can be artificially synthesized. PROTO-LTNs are thus best compared with other embedding-based models, although nothing prevents them from being trained on, or combined with, generative methods.
基于嵌入的模型往往在已知类别上具有自然偏见。为了缓解这一问题，提出了生成模型，目的是为每个类别学习一个条件概率分布，从而生成未知类别的人工示例[22, 23, 24]。传统分类器通过利用真实和生成的示例进行训练。尽管通过利用这种机制，特别是在广义零样本学习的情况下可以取得令人印象深刻的结果，但与嵌入方法相比，存在着较少的灵活性，因为需要定义未知类别，以便人工合成相应数量的示例。因此，PROTO-LTNs 最好与其他基于嵌入的模型进行比较，尽管它们也可以在生成方法上进行训练或结合。

III PROTOtypical Logic Tensor Networks
III 原型逻辑张量网络

First, we introduce the basic notations related to prototypical networks in the FSL (Section III-A) and ZSL (Section III-B) settings [7]. Then, in Sections III-C and III-D, we build on these concepts and show how the PROTO-LTN training cycle is constructed by substituting the original model with a grounded $\mathcal{K}$ , and the original loss with a best satisfiability problem.
首先，我们介绍与原型网络在 Few-Shot Learning（FSL）（第 III-A 节）和 Zero-Shot Learning（ZSL）（第 III-B 节）设置相关的基本符号[7]。然后，在第 III-C 节和 III-D 节中，我们基于这些概念，展示了如何通过用一个基于 $\mathcal{K}$ 的模型替换原始模型，并用最佳可满足性问题替换原始损失函数来构建 PROTO-LTN 训练循环。

III-A Prototypical Networks: the FSL setting
III-A 原型网络：FSL 设置

A $N$ -way- $K$ -shot FSL scenario is supposed, in which a classifier is asked to discriminate the right class among $N$ choices, while having the chance to observe $K$ examples per class [25, 26, 27]. More specifically, the labelled examples are referred to as the support examples, whereas the unlabeled ones as the query examples.
一种 $N$ -way- $K$ -shot FSL 场景被假设，其中分类器被要求在 $N$ 个选择中区分正确的类别，同时有机会观察每个类别的 $K$ 个示例[25, 26, 27]。更具体地说，标记的示例被称为支持示例，而未标记的示例被称为查询示例。

The underlying assumption that it exists an embedding space in which elements of different classes are well-scattered, and that it can be mathematically translated into an embedding function $f_{\theta}$ whose parameter $\theta$ must be inferred, acting as a mapping
存在这样一个嵌入空间的基本假设，其中不同类别的元素被很好地分散，并且可以在数学上转化为一个嵌入函数 $f_{\theta}$ ，其参数 $\theta$ 必须被推断出来，充当映射

\displaystyle f_{\theta}:\mathbb{R}^{D}\to\mathbb{R}^{M}.

(3)

In Eq. 3, $D$ and $M$ are, respectively, the dimensions of the input space and of the embedding space. Thus, for an example $x$ , $f_{\theta}(x)$ is the corresponding embedding.
在方程 3 中， $D$ 和 $M$ 分别是输入空间和嵌入空间的维度。因此，对于一个示例 $x$ ， $f_{\theta}(x)$ 是相应的嵌入。

In FSL, a prototype for class $n$ is obtained as the mean embedding of the $K$ support examples of class $n$ at train time:
在 FSL 中，类 $n$ 的原型是在训练时由类 $n$ 的 $K$ 个支持示例的平均嵌入获得的：

\displaystyle p_{n}=\frac{1}{K}\sum_{\begin{subarray}{c}(x^{\tilde{S}},y^{\tilde{S}})\in\tilde{S}\\ \text{s.t. }y^{\tilde{S}}=n\end{subarray}}f_{\theta}(x^{\tilde{S}}).

(4)

Class prototypes thus need to live in the embedding space, as they embody average features shared by elements of the class they represent. At training time, $\theta$ is optimized so that the distance between each prototype and the elements of its class is minimized, while the distance between different prototypes is maximized. Finally, classification at testing time is performed by assigning each query sample to its nearest prototype.
类原型因此需要存在于嵌入空间中，因为它们体现了它们所代表的类的元素共享的平均特征。在训练时， $\theta$ 被优化，以使每个原型与其类的元素之间的距离最小化，同时不同原型之间的距离最大化。最后，在测试时，通过将每个查询样本分配给其最近的原型来进行分类。

At testing time, a support set is at disposal of $N_{S}$ labeled examples $S=\{(x^{S}_{1},y^{S}_{1}),...,(x^{S}_{N_{S}},y^{S}_{N_{S}})\}$ , where each $x^{S}_{i}\in\mathbb{R}^{D}$ is the feature vector of an example, and $y^{S}_{i}\in C\subset\mathbb{N}$ is the corresponding label. Assuming a $N$ -way- $K$ -shot scenario, exactly $K$ support examples are available for each of the $N$ classes. A query set $Q=\{x^{Q}_{1},...,x^{Q}_{N_{Q}}\}$ of $N_{Q}$ unlabeled examples is thus supplied, and the task is to correctly assort the examples into their classes. The elements from the query set $Q$ belong to the same domain as those from the support set $S$ .
在测试时，支持集中有 $N_{S}$ 个标记示例 $S=\{(x^{S}_{1},y^{S}_{1}),...,(x^{S}_{N_{S}},y^{S}_{N_{S}})\}$ ，其中每个 $x^{S}_{i}\in\mathbb{R}^{D}$ 是一个示例的特征向量， $y^{S}_{i}\in C\subset\mathbb{N}$ 是相应的标签。假设一个 $N$ -路- $K$ -拍摄场景，每个 $N$ 类别都有 $K$ 个支持示例可用。因此，提供了一个包含 $N_{Q}$ 个未标记示例的查询集 $Q=\{x^{Q}_{1},...,x^{Q}_{N_{Q}}\}$ ，任务是正确地将示例分类。查询集 $Q$ 中的元素属于与支持集 $S$ 中的元素相同的域。

At training time, it could be impossible to know which classes will the testing scenario yield. In other words, a support set $S$ is not accessible in advance. To cope with that, a training set $T=\{(x^{T}_{1},y^{T}_{1}),...,(x^{T}_{N_{T}},y^{T}_{N_{T}})\}$ is chosen that reflects the best prior information possessed about the testing scenario, with labels $y_{i}^{T}\in C_{T}\subset\mathbb{N}$ and $|C_{T}|=N_{T}$ classes which can coincide or outnumber them ( $N_{T}\geq N$ ). In other words, it is possible that $C\cap C_{T}\neq\emptyset$ , but it cannot be said in advance. Then, fake support and query sets $\tilde{S}\subset T$ and $\tilde{Q}\subset T$ are extracted to mimic the testing scenario and instruct the model to learn accordingly.
在训练时，可能无法知道测试场景将产生哪些类别。换句话说，支持集 $S$ 事先不可访问。为了应对这种情况，选择一个训练集 $T=\{(x^{T}_{1},y^{T}_{1}),...,(x^{T}_{N_{T}},y^{T}_{N_{T}})\}$ ，反映对测试场景拥有的最佳先验信息，具有标签 $y_{i}^{T}\in C_{T}\subset\mathbb{N}$ 和 $|C_{T}|=N_{T}$ 类别，这些类别可以与之重叠或超过它们（ $N_{T}\geq N$ ）。换句话说，可能会 $C\cap C_{T}\neq\emptyset$ ，但无法事先说出。然后，提取虚假的支持和查询集 $\tilde{S}\subset T$ 和 $\tilde{Q}\subset T$ ，以模仿测试场景并指导模型相应地学习。

III-B Prototypical networks: the ZSL setting
III-B 原型网络：零样本学习设置

In ZSL, one does not dispose of labelled examples for all classes. Instead, it is assumed that $N$ abstract vectors denoted as $\{a^{(1)},a^{(2)},...,a^{(N)}\}$ , with $a^{(n)}\in\mathbb{R}^{A}$ , encode the characteristics of all $N$ classes.
在 ZSL 中，一个不会为所有类别丢弃标记的示例。相反，假定有 $N$ 个抽象向量表示为 $\{a^{(1)},a^{(2)},...,a^{(N)}\}$ ，具有 $a^{(n)}\in\mathbb{R}^{A}$ ，编码了所有 $N$ 类别的特征。

As in FSL, at training time one takes advantage of a set $T=\{(x^{T}_{1},y^{T}_{1}),...,(x^{T}_{N_{T}},y^{T}_{N_{T}})\}$ of labelled examples from classes $y^{T}_{i}\in C_{T}\subset\mathbb{N}$ , where it is preferably $|C_{T}|=N_{T}\geq N=|C|$ . The training cycle remains unchanged in the ZSL case, but class prototypes are defined differently:
与 FSL 一样，在训练时，人们利用来自类 $y^{T}_{i}\in C_{T}\subset\mathbb{N}$ 的一组标记示例 $T=\{(x^{T}_{1},y^{T}_{1}),...,(x^{T}_{N_{T}},y^{T}_{N_{T}})\}$ ，最好是 $|C_{T}|=N_{T}\geq N=|C|$ 。在 ZSL 情况下，训练周期保持不变，但类原型的定义不同：

•

the embedding for a query example $x^{Q}$ is still obtained as $f_{\theta}(x^{Q})$ , where $f_{\theta}:\mathbb{R}^{D}\to\mathbb{R}^{M}$ ;

• 查询示例 $x^{Q}$ 的嵌入仍然是通过 $f_{\theta}(x^{Q})$ 获得的，其中 $f_{\theta}:\mathbb{R}^{D}\to\mathbb{R}^{M}$ ；
•

the prototype for class $n\in C$ is extracted as $p_{n}=g_{\theta}(a^{(n)})$ via a separate embedding function $g_{\theta}:\mathbb{R}^{A}\to\mathbb{R}^{M}$ , which maps the semantic attribute space to the common embedding space.

• 类别 $n\in C$ 的原型通过单独的嵌入函数 $g_{\theta}:\mathbb{R}^{A}\to\mathbb{R}^{M}$ 提取为 $p_{n}=g_{\theta}(a^{(n)})$ ，该函数将语义属性空间映射到共同的嵌入空间。

III-C PROTO-LTN: the FSL scenario
III-C PROTO-LTN：FSL 方案

Refer to caption — Figure 1: Proto-LTN architecture for ZSL classification. The architecture is composed of a convolutional features extractor and an attribute encoder. The two branches allow to map semantic and visual features in a common embedding space. The isOfClass predicate aims to minimize the distance between instances (solid line circles) and class prototypes (dashed line circles) based on affirmative and negative formulas embedded in the knowledge base $\mathcal{K}$ . At train time, the loss function maximizes the satisfiability (truth value) of all formulas in $\mathcal{K}$ .
图 1：用于零样本学习分类的 Proto-LTN 架构。该架构由卷积特征提取器和属性编码器组成。这两个分支允许将语义特征和视觉特征映射到一个共同的嵌入空间中。isOfClass 谓词旨在通过嵌入在知识库 $\mathcal{K}$ 中的肯定和否定公式，最小化实例（实线圆圈）与类原型（虚线圆圈）之间的距离。在训练时，损失函数最大化 $\mathcal{K}$ 中所有公式的可满足性（真值）。

The overall architecture of PROTO-LTN, when tailored to the ZSL scenario, is illustrated in Fig. 1. The input image embeddings are extracted from a convolutional neural network, while attribute vectors are mapped into the embedding domain through an embedding function. In this section, details about the definition of the grounding of the constant, variables, functions and predicates are given. Then, the Knowledge Base $\mathcal{K}$ which encodes our learning problem is defined.
PROTO-LTN 的整体架构，当定制为零样本学习场景时，如图 1 所示。输入图像嵌入是从卷积神经网络中提取的，而属性向量通过嵌入函数映射到嵌入域中。在本节中，给出了关于常量、变量、函数和谓词接地定义的细节。然后，定义了编码我们学习问题的知识库 $\mathcal{K}$ 。

III-C1 Groundings terms III-C1 接地术语

Within a single training episode, a batch of training samples is selected in the form of fake support $\tilde{S}$ and query $\tilde{Q}$ sets. Groundings for variables and their domain $D$ (not learnable) can be defined as
在单个训练周期内，以虚假支持 $\tilde{S}$ 和查询 $\tilde{Q}$ 集的形式选择一批训练样本。变量及其域 $D$ （不可学习）的基础可以定义为

$\displaystyle\mathcal{G}(q)$	$\displaystyle=\braket{x^{\tilde{Q}}_{1},...,x^{\tilde{Q}}_{N_{\tilde{Q}}}},$	(5)
$\displaystyle\mathcal{G}(q_{l})$	$\displaystyle=\braket{y^{\tilde{Q}}_{1},...,y^{\tilde{Q}}_{N_{\tilde{Q}}}},$	(6)
$\displaystyle\mathcal{G}(q_{e})$	$\displaystyle=\mathcal{G}(\texttt{getEmbedding}(q))$	(7)
	$\displaystyle=\braket{f_{\theta}(x^{\tilde{Q}}_{1}),...,f_{\theta}(x^{\tilde{Q}}_{N_{\tilde{Q}}})},$	(8)
$\displaystyle\mathcal{G}(s)$	$\displaystyle=\braket{x^{\tilde{S}}_{1},...,x^{\tilde{S}}_{N_{S}}},$	(9)
$\displaystyle\mathcal{G}(s_{l})$	$\displaystyle=\braket{y^{\tilde{S}}_{1},...,y^{\tilde{S}}_{N_{S}}},$	(10)
$\displaystyle\mathcal{G}(p),\,\mathcal{G}(p_{l})$	$\displaystyle=\mathcal{G}(\texttt{getPrototypes}(s,s_{l}))$	(11)
	$\displaystyle=\Pi_{\theta}(\mathcal{G}(s,s_{l}))$	(12)
	$\displaystyle=\Pi_{\theta}(\braket{(x^{\tilde{S}}_{1},y^{\tilde{S}}_{1}),...,(x^{\tilde{S}}_{N_{S}},y^{\tilde{S}}_{N_{S}})}),$	(13)

where $q$ are the query examples ( $D(q)=\texttt{features}$ ), $q_{l}$ are the corresponding labels ( $D(q_{l})=\texttt{labels}$ ), and $q_{e}$ are their embeddings ( $D(q_{e})=\texttt{embeddings}$ ). Conversely, $s$ are the examples in the support set ( $D(s)=\texttt{features}$ ) and $s_{l}$ their labels. Finally, $p$ and $p_{l}$ are the prototypes and their labels, respectively, with $D(p)=\texttt{embeddings}$ and $D(p_{l})=\texttt{labels}$ .
其中 $q$ 是查询示例（ $D(q)=\texttt{features}$ ）， $q_{l}$ 是相应的标签（ $D(q_{l})=\texttt{labels}$ ）， $q_{e}$ 是它们的嵌入（ $D(q_{e})=\texttt{embeddings}$ ）。相反， $s$ 是支持集中的示例（ $D(s)=\texttt{features}$ ）， $s_{l}$ 是它们的标签。最后， $p$ 和 $p_{l}$ 是原型及其标签，分别为 $D(p)=\texttt{embeddings}$ 和 $D(p_{l})=\texttt{labels}$ 。

III-C2 Grounding functions and predicates
III-C2 接地功能和谓词

PROTO-LTNs are based on two functions (getEmbedding and getPrototypes) and the isOfClass predicate.
PROTO-LTNs 基于两个函数（getEmbedding 和 getPrototypes）和 isOfClass 谓词。

getEmbedding is a conventional LTN function which maps image features into the embedding space, hence $D_{\mathrm{in}}(\texttt{getEmbedding})=\texttt{features}$ to
getEmbedding 是一个传统的 LTN 函数，它将图像特征映射到嵌入空间，因此 $D_{\mathrm{in}}(\texttt{getEmbedding})=\texttt{features}$ 到
$D_{\mathrm{out}}(\texttt{getEmbedding})=\texttt{embeddings}$ .

The getPrototypes function, with $D_{\mathrm{in}}(\texttt{getPrototypes})=\texttt{features}\times\texttt{labels}$ and
getPrototypes 函数，使用 $D_{\mathrm{in}}(\texttt{getPrototypes})=\texttt{features}\times\texttt{labels}$ 和
$D_{\mathrm{out}}(\texttt{getPrototypes})=\texttt{embeddings}\times\texttt{labels}$ , returns labelled prototypes given a support set of labelled examples. Each prototype is in fact a function of all support points belonging to the same class, as defined in Eq. 4. It is defined as a generalized LTN function, which accepts as input multiple instantiations of variables (and hence multiple domains). A more formal definition is given in Appendix A.
$D_{\mathrm{out}}(\texttt{getPrototypes})=\texttt{embeddings}\times\texttt{labels}$ ，在给定带标签示例的支持集的情况下，返回带标签的原型。实际上，每个原型都是属于同一类别的所有支持点的函数，如等式 4 所定义。它被定义为广义的 LTN 函数，接受多个变量的多次实例化作为输入（因此具有多个域）。更正式的定义见附录 A。

Groundings for both functions are defined as:
两个函数的基础定义如下：

	$\displaystyle\mathcal{G}(\texttt{getEmbedding})$	$\displaystyle=f_{\theta},$		(14)
	$\displaystyle\mathcal{G}(\texttt{getPrototypes})$	$\displaystyle=\Pi_{\theta},$		(15)

where $f_{\theta}:\mathbb{R}^{D}\to\mathbb{R}^{M}$ defines the embedding function, whereas
其中 $f_{\theta}:\mathbb{R}^{D}\to\mathbb{R}^{M}$ 定义了嵌入函数，而

\displaystyle\Pi_{\theta}:\bigcup_{l=1}^{\infty}\vartimes_{m=1}^{l}\mathbb{R}^{D}\times\mathbb{N}\to\bigcup_{l=1}^{\infty}\vartimes_{m=1}^{l}\mathbb{R}^{M}\times\mathbb{N}

(16)

accepts as input a list of $N_{S}$ labelled support examples, i.e., an element of $(\mathbb{R}^{D}\times\mathbb{N})^{N_{S}}$ , and returns a list of labelled prototypes for all the $\tilde{N}$ classes seen in the support set, or an element of $(\mathbb{R}^{M}\times\mathbb{N})^{\tilde{N}}$ . Additional details are given in Appendix A.
接受一个包含 $N_{S}$ 个标记支持示例的列表作为输入，即 $(\mathbb{R}^{D}\times\mathbb{N})^{N_{S}}$ 的一个元素，并返回在支持集中看到的所有 $\tilde{N}$ 个类别的带标记原型的列表，或者 $(\mathbb{R}^{M}\times\mathbb{N})^{\tilde{N}}$ 的一个元素。附录 A 中提供了更多细节。

The isOfClass predicate for class $n$ $\in C$ is grounded as:
类 $n$ $\in C$ 的 isOfClass 谓词被定义为：

\displaystyle\mathcal{G}(\texttt{isOfClass})=e^{-\alpha\,d(\cdot,\cdot)^{2}},

(17)

where $\alpha$ is a hyperparameter and $d$ is a measure of distance. $\mathcal{G}(\texttt{isOfClass}):\mathbb{R}^{M}\times\mathbb{R}^{M}\to[0,1]$ ; $\mathcal{G}(\texttt{isOfClass})$ takes the value of $1$ when the distance from the class prototype $d(\cdot,\cdot)$ is $0$ . In our formulation the Euclidean distance squared is adopted, as in DEM [8]. Alternatively, parametric similarity functions could be used:
其中 $\alpha$ 是一个超参数， $d$ 是距离的度量。 $\mathcal{G}(\texttt{isOfClass}):\mathbb{R}^{M}\times\mathbb{R}^{M}\to[0,1]$ ; 当距离从类原型 $d(\cdot,\cdot)$ 是 $0$ 时， $\mathcal{G}(\texttt{isOfClass})$ 取值 $1$ 。在我们的公式中，采用欧氏距离的平方，就像在 DEM [8]中一样。或者，可以使用参数相似性函数：

\displaystyle\mathcal{G}^{\prime\prime}(\texttt{isOfClass})

\displaystyle=\sigma_{\theta}(\text{Concatenate}[\cdot,\cdot]).

(18)

where $\sigma_{\theta}$ could be a MLP with output sigmoid activation. This formulation is closer to that of Relation Networks [19].
其中 $\sigma_{\theta}$ 可以是具有输出 Sigmoid 激活的 MLP。这种公式更接近于关系网络[19]的公式。

III-C3 Knowledge Base III-C3 知识库

$\mathcal{K}$ represents our knowledge about the formulated problem and is updated at each training episode based on the current fake support set. $\mathcal{K}=\{\phi_{\text{aff}},\phi_{\text{neg}}\}$ contains two aggregations of formulas which specify that each query item is a positive example for its class, and a negative one for all the others:
$\mathcal{K}$ 代表我们对所制定问题的了解，并且在每个训练周期根据当前的虚假支持集进行更新。 $\mathcal{K}=\{\phi_{\text{aff}},\phi_{\text{neg}}\}$ 包含两个公式的聚合，这些公式指定每个查询项是其类别的正例，对于其他所有类别则是负例：

\phi_{\text{aff}}=\forall\text{Diag}(q_{e},q_{l})(\forall\text{Diag}(p,p_{l}):{q_{l}=p_{l}}(\texttt{isOfClass}(q_{e},p))),

(19)

\phi_{\text{neg}}=\forall\,\text{Diag}(q_{e},q_{l})\,(\forall\,\text{Diag}(p,p_{l}):{q_{l}\neq p_{l}}\,(\lnot\texttt{isOfClass}(q_{e},p))).

(20)

We have exploited both Diagonal Quantification and Guarded Quantifiers, whose formal definition can be found in [5].
我们已经利用了对角量化和保护量词，其形式定义可以在[5]中找到。

PROTO-LTN is trained by maximizing the satisfiability
PROTO-LTN 通过最大化可满足性进行训练

\mathcal{L}^{\text{ep}}=1-\left(\bigwedge_{\phi\in\mathcal{K}}\phi\right)=-\mathcal{G}(\phi_{\text{aff}})-w_{\text{n}}\,\mathcal{G}(\phi_{\text{neg}}),

(21)

where the weight $w_{\text{n}}$ reflects the expectation that negations play a less discriminative role than affirmation in classification. In our experiments, we set $w_{\text{n}}=0$ and consider only $\phi_{\text{aff}}$ , leaving exploration of this hyper-parameter to future work.
在我们的实验中，我们设置 $w_{\text{n}}=0$ ，并仅考虑 $\phi_{\text{aff}}$ ，将对这一超参数的探索留给未来的工作。

By introducing an aggregation function [5, 11], we obtain
通过引入一个聚合函数[5, 11]，我们得到

{\mathcal{L}^{\text{ep}}=\bigg{(}-\log(\mathcal{G}(\phi_{\text{aff}})\big{)}^{\frac{1}{p_{\text{agg}}}})+w_{\text{n}}\big{(}1-\mathcal{G}(\phi_{\text{n}})\big{)}^{\frac{1}{p_{\text{agg}}}}\bigg{)}^{p_{\text{agg}}}}

(22)

where $\mathcal{G}(\phi_{\text{aff}})$ is implemented through the generalized product $p$ -mean operator and $\mathcal{G}(\phi_{\text{neg}})$ with the generalized mean operator $A_{pM}$ :
其中 $\mathcal{G}(\phi_{\text{aff}})$ 是通过广义乘积 $p$ -均值算子实现的， $\mathcal{G}(\phi_{\text{neg}})$ 是通过广义均值算子 $A_{pM}$ 实现的：

\small\begin{aligned} A_{pPR}(\tau_{1},...,\tau_{n})=\bigg{(}\prod_{i=1}^{n}\tau_{i}\bigg{)}^{\frac{1}{p_{\forall}}},\end{aligned}\begin{aligned} A_{pM}(\tau_{1},...,\tau_{n})=\bigg{(}\frac{1}{n}\sum_{i=1}^{n}\tau_{i}^{p}\bigg{)}^{\frac{1}{p_{\forall}}}.\end{aligned}

It should be noticed that the choice of $p_{agg}$ does not need to coincide with that of $p_{\forall}$ for quantification, and both hyper-parameters need to be tuned experimentally.
应该注意到， $p_{agg}$ 的选择不需要与 $p_{\forall}$ 的选择相符合以进行量化，这两个超参数都需要通过实验进行调整。

When optimizing a positive quantity, a common practice consists in optimizing its logarithm: the product between similarities takes a more desirable form when $A_{pPR}$ is used as the aggregation operator for $\forall$ . Unfortunately, one does not obtain an equally appealing expression for $\phi_{\text{neg}}$ .
在优化正数量时，一种常见做法是优化其对数：当 $A_{pPR}$ 被用作 $\forall$ 的聚合运算符时，相似性之间的乘积会呈现出更理想的形式。不幸的是，对于 $\phi_{\text{neg}}$ ，我们并没有得到同样令人满意的表达式。

If a squared Euclidean distance is used as similarity measure and the negation weight $w_{\text{n}}$ is set to 0, one obtains the same formulation of the loss function of DEM [8], up to a scaling constant:
如果使用平方欧氏距离作为相似度度量，并且将否定权重 $w_{\text{n}}$ 设置为 0，则可以得到与 DEM [8]的损失函数相同的公式，只是存在一个缩放常数：

	$\displaystyle\mathcal{L}^{\text{ep}}$	$\displaystyle=-\log\Bigg{(}e^{-\frac{\alpha}{p_{\forall}}\,\big{(}\sum_{n\in\tilde{C}}\,\sum_{\begin{subarray}{c}(x^{\tilde{Q}},y^{\tilde{Q}})\in\tilde{Q}\\ \text{s.t. }y^{\tilde{Q}}\neq n\end{subarray}}\,d(f_{\theta}(x^{\tilde{Q}}),p_{n})^{2}\big{)}}\Bigg{)}$
		$\displaystyle=\frac{\alpha}{p_{\forall}}\,\Big{(}\sum_{n\in\tilde{C}}\,\sum_{\begin{subarray}{c}(x^{\tilde{Q}},y^{\tilde{Q}})\in\tilde{Q}\\ \text{s.t. }y^{\tilde{Q}}\neq n\end{subarray}}\,d(f_{\theta}(x^{\tilde{Q}}),p_{n})^{2}\Big{)}.$		(23)

Algorithm 1 PROTO-LTN - GZSL Training procedure
算法 1 PROTO-LTN - GZSL 训练过程

function Train 功能训练

Input

\leftarrow

q

Training Images
输入

\leftarrow

q

训练图像

Input

\leftarrow

q_{l}

Training label
输入

\leftarrow

q_{l}

训练标签

Input

\leftarrow

a

Semantic attribute set
输入

\leftarrow

a

语义属性集

Input

\leftarrow

a_{l}

Semantic attribute label
输入

\leftarrow

a_{l}

语义属性标签

for

i

in

N_{TrainingSteps}

do
对于

i

in

N_{TrainingSteps}

做

q_{e_{i}}\leftarrow

getEmbedding(

q

)

a_{i}

a_{l_{i}}\leftarrow

getAttributes(

a

)

p_{i},p_{l_{i}}\leftarrow\texttt{getPrototypes}(a_{i},a_{l_{i}})

\phi_{\text{aff}}

\forall\text{Diag}(q_{e_{i}},q_{l_{i}})(\forall\text{Diag}(p_{i},p_{l_{i}}):{q_{l_{i}}=p_{l_{i}}}(\texttt{isOfClass}(q_{e_{i}},p_{i})

\phi_{\text{n}}

\forall\,\text{Diag}(q_{i},q_{l_{i}})\,(\forall\,\text{Diag}(p_{i},p_{l_{i}}):{q_{l_{i}}\neq p_{l_{i}}}\,(\lnot\texttt{isOfClass}(q_{e_{i}},p_{i})))

\bigg{(}\big{(}\log((\mathcal{G}(\phi_{\text{aff}})\big{)}^{\frac{1}{p_{\text{agg}}}})))+w_{\text{n}}\big{(}1-\mathcal{G}(\phi_{\text{n}})\big{)}^{\frac{1}{p_{\text{agg}}}}\bigg{)}^{p_{\text{agg}}}

computeGradient(\mathcal{L}^{\text{ep}})

updateGradient

end for 结束为

end function 结束函数

function Test 功能测试

Input

\leftarrow

q

Test Images
输入

\leftarrow

q

测试图像

Input

\leftarrow

a

Semantic attribute set
输入

\leftarrow

a

语义属性集

q_{e}

\leftarrow

getEmbedding(

q

)

a

a_{l}

\leftarrow

getAttributes(

a

)

p,p_{l}\leftarrow

getPrototypes(

a

a_{l}

p,p_{l}\leftarrow

获取原型

a

a_{l}

)

for

i

in

len(

q_{e}

) do
对于

q_{e}

的长度为

i

in

for

j

in

len(

p

) do
对于

p

的长度为

j

in

prediction_{i}\leftarrow\texttt{isOfClass}(q_{e_{i}},p_{j})

end for 结束为

end function 结束函数

III-D PROTO-LTN: the GZSL scenario
III-D PROTO-LTN：GZSL 场景

The GZSL setting is analogous to the FSL setting, with the main difference lying in how prototypes are defined and calculated. No generalized LTN functions are needed for the GZSL case. Computations for a training epoch are reported in Algorithm 1.
GZSL 设置类似于 FSL 设置，主要区别在于原型的定义和计算方式。对于 GZSL 情况，不需要一般化的 LTN 函数。训练周期的计算结果见算法 1。

Since only one semantic vector $a^{(n)}$ is given for each class $n$ , there is a 1-to-1 correspondence between elements of the support set and prototypes. The latter are embodied by the semantic embedding function $g_{\theta}:\mathbb{R}^{A}\to\mathbb{R}^{D}$ obtaining as the feature space the common embedding space. We just define getPrototypes as a conventional LTN function, whose grounding is $\mathcal{G}(\texttt{getPrototypes})=g_{\theta}.$ Conversely, nothing changes for the query map getEmbedding.
由于每个类别 $n$ 只给出一个语义向量 $a^{(n)}$ ，因此支持集的元素与原型之间存在一对一的对应关系。后者由语义嵌入函数 $g_{\theta}:\mathbb{R}^{A}\to\mathbb{R}^{D}$ 体现，其特征空间为共同的嵌入空间。我们将 getPrototypes 定义为传统的 LTN 函数，其基础是 $\mathcal{G}(\texttt{getPrototypes})=g_{\theta}.$ 相反，对于查询映射 getEmbedding 没有任何变化。

IV Experimental Settings
IV 实验设置

Experiments were conducted in both ZSL and GZSL settings on the Awa2 (Animals with Attributes) [18], CUB [28], aPY (Attribute Pascal and Yahoo)[29] and SUN (Scene Understanding) [30] benchmarks. For all datasets, image encodings, attributes and splits were collected from the original benchmark [18].
在 Awa2（具有属性的动物）[18]，CUB [28]，aPY（属性 Pascal 和 Yahoo）[29]和 SUN（场景理解）[30]基准测试中，分别在 ZSL 和 GZSL 设置下进行了实验。对于所有数据集，图像编码、属性和拆分均来自原始基准测试[18]。

TABLE I: For PROTO-LTN, we show mean

\pm

standard deviation and maximum (in parenthesis) performance.

\text{TOP1}^{\text{ZSL}}

(T1),

\text{TOP1}^{\text{GZSL\_UNSEEN}}

(U),

\text{TOP1}^{\text{GZSL\_SEEN}}

(S) and

\text{H}^{\text{GZSL}}

(H) are always obtained on the proposed split (PS ) of Awa2, CUB, aPY and SUN classes, as described in [18].

\dagger

assumes a transductive ZSL setting. Best performances are reported in bold.
表 I：对于 PROTO-LTN，我们展示平均

\pm

标准差和最大（括号内）性能。

\text{TOP1}^{\text{ZSL}}

（T1），

\text{TOP1}^{\text{GZSL\_UNSEEN}}

（U），

\text{TOP1}^{\text{GZSL\_SEEN}}

（S）和

\text{H}^{\text{GZSL}}

（H）总是在提出的 Awa2，CUB，aPY 和 SUN 类别的划分（PS）上获得，如[18]中所述。

\dagger

假设是一种传导式零样本学习设置。最佳性能以粗体报告。

Method 方法	Awa2				CUB				APY				SUN
Method 方法	T1	U	S	H	T1	U	S	H	T1	U	S	H	T1	U	S	H
SYNC (2016) [31] SYNC (2016) [31]	$46.6$	$10.0$	$90.5$	$18.0$	$55.6$	$11.5$	70.9	$19.8$	-	-	-	-	$56.3$	$7.9$	$43.3$	$13.4$
Relation Net (2017)[19] 关系网络（2017）[19]	$64.2$	$30.0$	$93.4$	$45.3$	$55.6$	$38.1$	$61.1$	$47$	-	-	-	-	-	-	-	-
PrEN^† (2019) [32]	$74.1$	$32.4$	$88.6$	$47.4$	$66.4$	$35.2$	$55.8$	$43.1$	-	-	-	-	62.9	35.4	$27.2$	30.8
VSE (2019) [20] VSE（2019）[20]	84.4	45.6	88.7	60.2	71.9	39.5	$68.9$	50.2	65.4	43.6	78.7	56.2	-	-	-	-
DEM (2017) [8] DEM（2017）[8]	$67.1$	$30.5$	$86.4$	$45.1$	$51.7$	$19.6$	$57.9$	$29.2$	$35.0$	$11.1$	$75.1$	$19.4$	$61.9$	$20.5$	$34.3$	$25.6$
PROTO-LTN	$67.6$	$32.0$	$83.7$	$46.2$	$48.8$	$20.8$	$54.3$	$30.0$	$35.0$	$17.1$	$66.2$	$27.21$	$60.4$	$20.4$	36.8	$26.2$
	±1.1	±1.3	±0.3	±1.3	±1.2 ±1.2 ±1.2	±2.6	±1.1	±3.0	±3.1	±2.0 ±2.0 ±2.0	±5.1	±2.9	±2.5 ±2.5 ±2.5	±1.0 ±1.0 ±1.0	±4.4	±1.9 ±1.9 ±1.9
	(70.8)	(34.8)	(84.3)	(49.1)	(50.3)	(23.4)	(55.7)	(33.0)	(38.6)	(19.4)	(70.7)	(30.0)	(62.1)	(22.15)	(39.9)	(28.0)

The entire architecture is composed of two different blocks: the image visual encoder and the semantic encoder. The embedding function $f_{\theta}$ is composed by a ResNet101 [33] embedding model, pretrained on ImageNet [34] and kept frozen, which converts an image $I$ into a vector $\mathbf{x}\in\mathbb{R}^{M}$ , where $M=2048$ . This setting is maintained in all experiments with all datasets.
整个架构由两个不同的模块组成：图像视觉编码器和语义编码器。嵌入函数 $f_{\theta}$ 由一个在 ImageNet [34]上预训练并保持冻结的 ResNet101 [33]嵌入模型组成，将图像 $I$ 转换为向量 $\mathbf{x}\in\mathbb{R}^{M}$ ，其中 $M=2048$ 。在所有实验和所有数据集中保持此设置。

Semantic vectors are encoded in the embedding space via a function $g_{\theta}$ , which consists of two fully connected layers (FC) with ReLU activation function, initialized by a truncated normal distribution function. We set the hyper-parameter aggregations to $p_{agg}=1$ and $p_{\forall}=2$ , also taking into account preliminary experiments on Awa2 [18].
语义向量通过一个包含两个具有 ReLU 激活函数的全连接层（FC）的函数 $g_{\theta}$ 在嵌入空间中进行编码，初始值为截断正态分布函数。我们将超参数聚合设置为 $p_{agg}=1$ 和 $p_{\forall}=2$ ，同时考虑了对 Awa2 [18]的初步实验。

The framework was implemented in Tensorflow based on the LTN package [5, 35]. Experiments were conducted on a workstation equipped with an Intel® Core™ i7-10700K CPU and a RTX2080 TI GPU. All networks were trained for 30 epochs with Adam optimizer and batch size 64. Hyper-parameters (learning rate, $\alpha$ and regularization term $\lambda$ ) were optimized separately for each dataset. Details are reported in Appendix B. Standard performance metrics for GZSL were used as defined in [18]. Mean and standard deviation were calculated by repeating each experiment three times.
该框架是基于 LTN 包[5, 35]在 Tensorflow 中实现的。实验在配备 Intel® Core™ i7-10700K CPU 和 RTX2080 TI GPU 的工作站上进行。所有网络使用 Adam 优化器和批量大小 64 进行了 30 个 epochs 的训练。超参数（学习率， $\alpha$ 和正则化项 $\lambda$ ）针对每个数据集分别进行了优化。详细信息请参见附录 B。GZSL 的标准性能指标如[18]中定义。通过将每个实验重复三次计算得到均值和标准差。

V Results V 结果

PROTO-LTN results are reported in Table I, along with those for comparable embedding-based methods. Fig. 2 illustrates the embedding space with highlighted class prototypes.
PROTO-LTN 的结果如表 I 所示，以及与类似基于嵌入的方法的结果。图 2 展示了带有突出显示的类原型的嵌入空间。

As expected based on our analytical analysis, experimental performance is competitive with respect to most embedding-based techniques, in particular DEM [8] and Relation Net [19], which rely on similar assumptions and the same input as the current PROTO-LTN implementation. As shown in Section III-C, under certain conditions the PROTO-LTN loss is equivalent to that of DEM, up to a scaling constant, albeit with different regularization terms. We outperform DEM on unseen classes for all experimental benchmarks: this entails that the proposed formulation is a strong basis for a novel, NeSy approach to the GZSL task.
根据我们的分析分析，实验性能在大多数基于嵌入的技术方面具有竞争力，特别是 DEM [8]和 Relation Net [19]，它们依赖于类似的假设和与当前 PROTO-LTN 实现相同的输入。如第 III-C 节所示，在某些条件下，PROTO-LTN 损失与 DEM 相当，只是存在一个缩放常数，尽管具有不同的正则化项。我们在所有实验基准中的未见类别上表现优于 DEM：这意味着所提出的公式是一个新颖的、NeSy 方法的 GZSL 任务的强基础。

Our method is outperformed by VSE, which relies on a different strategy to compute visual feature embeddings. A semantic loss allows to align the embedding space with part-feature concepts provided by a semantic oracle. Since the latter relies on an external knowledge base, it contains concepts beyond the available semantic vector $\{a^{(1)},a^{(2)},...,a^{(N)}\}$ . This is especially advantageous in benchmarks like aPY, in which attributes are noisy and not visually informative [20]. This is a limitation of our current experiments, but not intrinsic to PROTO-LTNs. Indeed, $\mathcal{K}$ can be extended to include part-of relationships between concepts, and previous works have shown how these relationships can be leveraged to impose symbolic priors during learning, e.g., in object detection [4, 13]. However, the LTN formalism needs to be further extended to align part-based concepts with their visual groundings in an unsupervised fashion.
我们的方法被 VSE 超越，VSE 依赖于不同的策略来计算视觉特征嵌入。语义损失允许将嵌入空间与语义神谕提供的部分特征概念对齐。由于后者依赖于外部知识库，它包含超出可用语义向量的概念。这在像 aPY 这样的基准测试中特别有优势，其中属性是嘈杂的且在视觉上不具信息性。这是我们当前实验的局限性，但不是 PROTO-LTNs 固有的。实际上，可以扩展 $\mathcal{K}$ 以包括概念之间的部分关系，并且先前的研究已经展示了如何在学习过程中利用这些关系来施加符号先验，例如在目标检测中。然而，LTN 形式主义需要进一步扩展，以以无监督的方式将基于部分的概念与它们的视觉基础对齐。

VI Conclusions and Future works
VI 结论和未来工作

We introduced PROTO-LTN, a novel Neuro-Symbolic architecture which extends the classical formulation of LTN borrowing from embeddings-based techniques. Following the strategy of PNs, we entirely focus on learning embedding functions (such as $f_{\theta}$ and $g_{\theta}$ ), implying that class prototypes are obtained ex-post, based on a support set. These methods are robust to noise, an essential property in FSL, and provide a scheme to embed both examples (images) and class prototypes in the same metric space. This is a key property in the context of LTNs, because it enables different levels of abstraction: one can either state something about a particular example, or about an entire class, as prototypes can be viewed as parametrized labels for classes. We have shown the viability of our approach in GZSL and leave to future work the extension to other settings (e,g., few-shot or semi-supervised learning).
我们介绍了 PROTO-LTN，这是一种新颖的神经符号架构，它扩展了经典的 LTN 公式，借鉴了基于嵌入技术的技巧。遵循 PNs 的策略，我们完全专注于学习嵌入函数（如 $f_{\theta}$ 和 $g_{\theta}$ ），这意味着类原型是基于支持集事后获得的。这些方法对噪声具有鲁棒性，在 FSL 中是一个重要的特性，并提供了一种将示例（图像）和类原型嵌入到同一度量空间的方案。这在 LTNs 的背景下是一个关键特性，因为它实现了不同层次的抽象：一个可以关于特定示例陈述某事，或者关于整个类，因为原型可以被视为类的参数化标签。我们已经展示了我们方法在 GZSL 中的可行性，并将其他设置（例如，少样本或半监督学习）的扩展留给未来的工作。

While our experimental results are encouraging, we argue that the strength of our formulation lies in its generality, and the full potential of PROTO-LTN is yet to be realized. Future work can aim at two complementary directions. First, alternative formulations of the isOfClass relationship could be explored, by changing the distance metric and/or the prototype encoding. Mapping class prototypes back to the input space, as done for instance in [36], could improve explainability.
尽管我们的实验结果令人鼓舞，但我们认为我们的公式的强大之处在于其普适性，PROTO-LTN 的全部潜力尚未被实现。未来的工作可以朝着两个互补的方向努力。首先，可以探索 isOfClass 关系的替代公式，通过改变距离度量和/或原型编码。将类原型映射回输入空间，例如在 [36] 中所做的那样，可以提高可解释性。

Second, the knowledge $\mathcal{K}$ could be extended to leverage prior information, e.g., from external knowledge bases, to improve generalization to unseen classes. Experiments should include both inductive and transductive settings: the assumption that information about attributes and relationships of unseen classes is available at training or test time (e.g., from WordNet) is less restrictive than assuming that actual examples, albeit unlabelled, are available.
其次，知识 $\mathcal{K}$ 可以扩展到利用先前的信息，例如来自外部知识库，以提高对未见类别的泛化能力。实验应包括归纳和传导设置：假设关于未见类别的属性和关系的信息在训练或测试时间可用（例如，来自 WordNet），这比假设实际示例虽然未标记但可用的情况要少限制。

References

[1] L. De Raedt, S. Dumancic, R. Manhaeve, and G. Marra, “From statistical relational to neuro-symbolic artificial intelligence,” in 29th International Joint Conference on Artificial Intelligence, 2021, pp. 4943–4950.
L. De Raedt, S. Dumancic, R. Manhaeve, and G. Marra，“从统计关系到神经符号人工智能”，于第 29 届国际人工智能联合会议，2021 年，第 4943-4950 页。
[2] T. R. Besold, A. d. Garcez, S. Bader, H. Bowman, P. Domingos, P. Hitzler, K.-U. Kühnberger, L. C. Lamb, D. Lowd, P. M. V. Lima et al., “Neural-symbolic learning and reasoning: A survey and interpretation,” 2017.
T. R. Besold, A. d. Garcez, S. Bader, H. Bowman, P. Domingos, P. Hitzler, K.-U. Kühnberger, L. C. Lamb, D. Lowd, P. M. V. Lima 等人，“神经符号学习与推理：调查与解释”，2017 年。
[3] I. Donadello and L. Serafini, “Compensating supervision incompleteness with prior knowledge in semantic image interpretation,” in 2019 International Joint Conference on Neural Networks (IJCNN), 2019, pp. 1–8.
I. Donadello 和 L. Serafini，“在语义图像解释中利用先验知识弥补监督不完整性”，2019 年国际神经网络联合会议(IJCNN)论文集，2019 年，第 1-8 页。
[4] F. Manigrasso, F. D. Miro, L. Morra, and F. Lamberti, “Faster-LTN: a neuro-symbolic, end-to-end object detection architecture,” in International Conference on Artificial Neural Networks. Springer, 2021, pp. 40–52.
F. Manigrasso, F. D. Miro, L. Morra, 和 F. Lamberti, “Faster-LTN: 一种神经符号化的端到端目标检测架构,” 收录于《国际人工神经网络会议论文集》. Springer, 2021, 页码 40–52.
[5] S. Badreddine, A. d. Garcez, L. Serafini, and M. Spranger, “Logic tensor networks,” p. 103649, 2022.
S. Badreddine, A. d. Garcez, L. Serafini, 和 M. Spranger, “逻辑张量网络,” 页码 103649, 2022 年。
[6] L. Serafini, A. d’Avila Garcez, S. Badreddine, I. Donadello, M. Spranger, and F. Bianchi, “Logic tensor networks: Theory and applications,” in Neuro-Symbolic Artificial Intelligence: The State of the Art. IOS Press, 2021, pp. 370–394.
L. Serafini, A. d’Avila Garcez, S. Badreddine, I. Donadello, M. Spranger, and F. Bianchi，“逻辑张量网络：理论与应用”，载于《神经符号人工智能：现状》。IOS Press，2021 年，页码 370-394。
[7] J. Snell, K. Swersky, and R. Zemel, “Prototypical networks for few-shot learning,” in 31st International Conference on Neural Information Processing Systems, 2017, pp. 4080–4090.
J. Snell, K. Swersky, 和 R. Zemel, “Prototypical networks for few-shot learning,” 在第 31 届国际神经信息处理系统大会上，2017 年，第 4080-4090 页。
[8] L. Zhang, T. Xiang, and S. Gong, “Learning a deep embedding model for zero-shot learning,” in IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 3010–3019.
L. Zhang, T. Xiang, 和 S. Gong, “Learning a deep embedding model for zero-shot learning,” 在 IEEE 计算机视觉和模式识别会议上，2017 年，第 3010–3019 页。
[9] Z. Wan, D. Chen, Y. Li, X. Yan, J. Zhang, Y. Yu, and J. Liao, “Transductive zero-shot learning with visual structure constraint,” pp. 9972–9982, 2019.
Z. Wan, D. Chen, Y. Li, X. Yan, J. Zhang, Y. Yu, 和 J. Liao, “带视觉结构约束的传导式零样本学习,” 页码 9972–9982, 2019.
[10] A. Goyal and Y. Bengio, “Inductive biases for deep learning of higher-level cognition,” 2020.
A. Goyal 和 Y. Bengio，“深度学习高层认知的归纳偏见”，2020。
[11] L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.” 2008.
L. Van der Maaten 和 G. Hinton，“使用 t-sne 可视化数据。”2008。
[12] D. Yu, B. Yang, D. Liu, and H. Wang, “A survey on neural-symbolic systems,” 2021.
余 D，杨 B，刘 D 和王 H，“神经符号系统调查”，2021。
[13] I. Donadello, L. Serafini, and A. D. Garcez, “Logic tensor networks for semantic image interpretation,” in 26th International Joint Conference on Artificial Intelligence, 2017, pp. 1596–1602.
I. Donadello, L. Serafini, 和 A. D. Garcez, “逻辑张量网络用于语义图像解释,” 发表于第 26 届国际人工智能联合会议, 2017, 页码 1596–1602.
[14] K. Yi, J. Wu, C. Gan, A. Torralba, P. Kohli, and J. B. Tenenbaum, “Neural-symbolic VQA: disentangling reasoning from vision and language understanding,” in 32nd International Conference on Neural Information Processing Systems, 2018, pp. 1039–1050.
K. Yi, J. Wu, C. Gan, A. Torralba, P. Kohli, and J. B. Tenenbaum，“神经符号 VQA：从视觉和语言理解中分离推理”，在第 32 届国际神经信息处理系统大会上，2018 年，第 1039-1050 页。
[15] R. Vedantam, K. Desai, S. Lee, M. Rohrbach, D. Batra, and D. Parikh, “Probabilistic neural symbolic models for interpretable visual question answering,” in International Conference on Machine Learning, 2019, pp. 6428–6437.
R. Vedantam, K. Desai, S. Lee, M. Rohrbach, D. Batra, and D. Parikh，“用于可解释视觉问答的概率神经符号模型”，发表于 2019 年机器学习国际会议，第 6428-6437 页。
[16] Z. Li, E. Stengel-Eskin, Y. Zhang, C. Xie, Q. H. Tran, B. Van Durme, and A. Yuille, “Calibrating concepts and operations: Towards symbolic reasoning on real images,” in IEEE/CVF International Conference on Computer Vision, 2021, pp. 14 910–14 919.
Z. Li, E. Stengel-Eskin, Y. Zhang, C. Xie, Q. H. Tran, B. Van Durme, and A. Yuille，“校准概念和操作：走向对真实图像的符号推理”，在 2021 年 IEEE/CVF 国际计算机视觉会议上，第 14 910–14 919 页。
[17] M. van Bekkum, M. de Boer, F. van Harmelen, A. Meyer-Vitali, and A. ten Teije, “Modular design patterns for hybrid learning and reasoning systems,” pp. 1–19, 2021.
M. van Bekkum, M. de Boer, F. van Harmelen, A. Meyer-Vitali, 和 A. ten Teije, “混合学习和推理系统的模块化设计模式”，第 1-19 页，2021 年。
[18] Y. Xian, C. H. Lampert, B. Schiele, and Z. Akata, “Zero-shot learning — A comprehensive evaluation of the good, the bad and the ugly,” pp. 2251–2265, 2019.
Y. Xian, C. H. Lampert, B. Schiele, 和 Z. Akata, “零样本学习 - 好坏丑的全面评估,” 页码 2251–2265, 2019.
[19] F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. S. Torr, and T. M. Hospedales, “Learning to compare: Relation network for few-shot learning,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 1199–1208.
F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. S. Torr, and T. M. Hospedales，“学习比较：关系网络用于少样本学习”，2018 年 IEEE/CVF 计算机视觉与模式识别会议，第 1199-1208 页。
[20] P. Zhu, H. Wang, and V. Saligrama, “Generalized zero-shot recognition based on visually semantic embedding,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019.
朱 P，王 H，Saligrama V，“基于视觉语义嵌入的广义零样本识别”，2019 年 IEEE/CVF 计算机视觉与模式识别会议。
[21] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, and T. Mikolov, “DeViSE: A deep visual-semantic embedding model,” in 26th International Conference on Neural Information Processing Systems, 2013, p. 2121–2129.
A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, and T. Mikolov，“DeViSE: 一种深度视觉-语义嵌入模型”，收录于 2013 年第 26 届国际神经信息处理系统大会，第 2121-2129 页。
[22] V. K. Verma, G. Arora, A. Mishra, and P. Rai, “Generalized zero-shot learning via synthesized examples,” in IEEE Conference on Computer Vision and Pattern Recognition, 2018.
V. K. Verma, G. Arora, A. Mishra, and P. Rai，“通过合成示例的广义零样本学习”，2018 年 IEEE 计算机视觉与模式识别会议。
[23] H. Huang, C. Wang, P. S. Yu, and C.-D. Wang, “Generative dual adversarial network for generalized zero-shot learning,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019.
H. 黄，C. 王，P. S. 于，和 C.-D. 王，“用于广义零样本学习的生成式双对抗网络”，2019 年 IEEE/CVF 计算机视觉与模式识别会议。
[24] Y. Xing, S. Huang, L. Huangfu, F. Chen, and Y. Ge, “Robust bidirectional generative network for generalized zero-shot learning,” in IEEE International Conference on Multimedia and Expo, 2020, pp. 1–6.
Y. Xing, S. Huang, L. Huangfu, F. Chen, and Y. Ge，“用于广义零样本学习的稳健双向生成网络”，收录于 2020 年 IEEE 国际多媒体与博览会，第 1-6 页。
[25] E. G. Miller, N. E. Matsakis, and P. A. Viola, “Learning from one example through shared densities on transforms,” in IEEE Conference on Computer Vision and Pattern Recognition, 2000, pp. 464–471.
E. G. 米勒，N. E. 马特萨基斯，和 P. A. 维奥拉，“通过转换上的共享密度从一个示例中学习”，2000 年 IEEE 计算机视觉与模式识别会议，第 464-471 页。
[26] B. Lake, R. Salakhutdinov, J. Gross, and J. Tenenbaum, “One shot learning of simple visual concepts,” 2011.
B. Lake, R. Salakhutdinov, J. Gross, 和 J. Tenenbaum, “One shot learning of simple visual concepts,” 2011.
[27] G. Koch, R. Zemel, R. Salakhutdinov et al., “Siamese neural networks for one-shot image recognition,” in ICML deep learning workshop, vol. 2. Lille, 2015.
G. Koch, R. Zemel, R. Salakhutdinov 等人，“孪生神经网络用于一次性图像识别”，收录于 ICML 深度学习研讨会第 2 卷。2015 年，里尔。
[28] C. Wah, S. Branson, P. Perona, and S. J. Belongie, “Multiclass recognition and part localization with humans in the loop,” pp. 2524–2531, 2011.
C. Wah, S. Branson, P. Perona, 和 S. J. Belongie, “Multiclass recognition and part localization with humans in the loop,” 页码 2524–2531, 2011.
[29] A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth, “Describing objects by their attributes,” in IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 1778–1785.
A. Farhadi, I. Endres, D. Hoiem, 和 D. Forsyth, “通过它们的属性描述对象,” 在 2009 年 IEEE 计算机视觉和模式识别会议上, 页码 1778–1785.
[30] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba, “SUN database: Large-scale scene recognition from abbey to zoo,” in IEEE Conference on Computer Vision and Pattern Recognition, 2010, pp. 3485–3492.
J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, 和 A. Torralba, “SUN 数据库：从修道院到动物园的大规模场景识别,” 在 2010 年 IEEE 计算机视觉与模式识别会议上发表，第 3485-3492 页。
[31] S. Changpinyo, W.-L. Chao, B. Gong, and F. Sha, “Synthesized classifiers for zero-shot learning,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 5327–5336.
S. Changpinyo, W.-L. Chao, B. Gong, and F. Sha，“零样本学习的合成分类器”，2016 年 IEEE 计算机视觉与模式识别会议，2016 年，第 5327-5336 页。
[32] M. Ye and Y. Guo, “Progressive ensemble networks for zero-shot recognition,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019.
M. Ye 和 Y. Guo，“用于零样本识别的渐进集成网络”，发表于 2019 年 IEEE/CVF 计算机视觉与模式识别会议。
[33] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
K. He, X. Zhang, S. Ren, 和 J. Sun, “深度残差学习用于图像识别,” 在 2016 年 IEEE 计算机视觉与模式识别会议上发表，第 770-778 页。
[34] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 248–255.
邓建华，董伟，Socher R，李良军，李凯，费菲菲，“Imagenet：一个大规模的分层图像数据库”，2009 年 IEEE 计算机视觉与模式识别会议，2009 年，第 248-255 页。
[35] S. Badreddine, A. Garcez, L. Serafini, and M. Spranger, “GTS: Logic Tensor Network library,” https://github.com/logictensornetworks/logictensornetworks, 2021.
S. Badreddine, A. Garcez, L. Serafini, 和 M. Spranger, “GTS: 逻辑张量网络库,” https://github.com/logictensornetworks/logictensornetworks, 2021.
[36] C. Chen, O. Li, D. Tao, A. Barnett, C. Rudin, and J. K. Su, “This looks like that: Deep learning for interpretable image recognition,” pp. 8930–8941, 2019.
C. Chen, O. Li, D. Tao, A. Barnett, C. Rudin, 和 J. K. Su, “这看起来像那个：用于可解释图像识别的深度学习,” 页码 8930–8941, 2019.

[] 请提供要翻译的文本

-A Function grounding in PROTO-LTNs
PROTO-LTNs 中的功能基础

PROTO-LTNs are based on two functions ( $\texttt{embeddingFunction}=f_{\theta}$ and getPrototype, respectively) and the isOfClass predicate.
PROTO-LTNs 基于两个函数（ $\texttt{embeddingFunction}=f_{\theta}$ 和 getPrototype），以及 isOfClass 谓词。

The function getPrototypes, with $D_{\mathrm{in}}(\texttt{getPrototypes})=\texttt{features}\times\texttt{labels}$ and
函数 getPrototypes，使用 $D_{\mathrm{in}}(\texttt{getPrototypes})=\texttt{features}\times\texttt{labels}$ 和
$D_{\mathrm{out}}(\texttt{getPrototypes})=\texttt{embeddings}\times\texttt{labels}$ , returns labelled prototypes given a support set of labelled examples. In this way each prototype depends on the support set of the same class, as defined in Eq. 4. As a consequence, we propose a novel definition for generalized LTN functions.
$D_{\mathrm{out}}(\texttt{getPrototypes})=\texttt{embeddings}\times\texttt{labels}$ ，在给定带标签示例的支持集的情况下，返回带标签的原型。这样，每个原型都依赖于同一类别的支持集，如等式 4 中定义的那样。因此，我们提出了广义 LTN 函数的新定义。

To understand why a generalized function is needed, recall that LTN variables are grounded onto the set of their instantiations. Assume that $s$ is a variable associated to support points, or:
为了理解为什么需要广义函数，回想一下 LTN 变量是基于它们的实例化集合的。假设 $s$ 是与支持点相关联的变量：

\displaystyle\mathcal{G}(s)=\braket{x^{\tilde{S}}_{1},...,x^{\tilde{S}}_{N_{S}}}.

If $h$ is a LTN function that is compatible with variable $s$ , or $D_{\text{in}}(f)=D(s)=\mathbb{R}^{D}$ , the grounding for $h(s)$ is
如果 $h$ 是与变量 $s$ 或 $D_{\text{in}}(f)=D(s)=\mathbb{R}^{D}$ 兼容的 LTN 函数，则 $h(s)$ 的基础是

\displaystyle\mathcal{G}(h(s))=\braket{\mathcal{G}(h)(x^{\tilde{S}}_{1}),...,\mathcal{G}(h)(x^{\tilde{S}}_{N_{S}})}.

This means that $\mathcal{G}(h)$ only takes as input a single element of $\mathbb{R}^{D}$ . Unfortunately, a conventional LTN function such as $h$ cannot help us with prototypes, as their definition for a class $n\in\tilde{C}$ , given in Eq. 4, is:
这意味着 $\mathcal{G}(h)$ 只接受 $\mathbb{R}^{D}$ 的单个元素作为输入。不幸的是，传统的 LTN 函数，如 $h$ ，无法帮助我们处理原型，因为它们对于类 $n\in\tilde{C}$ 的定义，如等式 4 中所示，是：

\displaystyle p_{n}

\displaystyle=\frac{1}{K}\sum_{\begin{subarray}{c}(x^{\tilde{S}},y^{\tilde{S}})\in\tilde{S}\\ \text{s.t. }y^{\tilde{S}}=n\end{subarray}}f_{\theta}(x^{\tilde{S}})=p_{n}(x^{\tilde{S}}_{1},...,x^{\tilde{S}}_{N_{S}}).

Every prototype is in fact a function of all support points belonging to the same class. As a consequence, we propose a novel definition for generalized LTN functions.
每个原型实际上是属于同一类的所有支持点的函数。因此，我们提出了广义 LTN 函数的新定义。

Definition 1

A generalized LTN function $F\in\mathcal{F}^{\text{gen}}$ is a function that lets multiple instantiations of variables be fed at once to $\mathcal{G}(F)$ , and returns a variable. The grounding for a generalized function $F\in\mathcal{F}^{\text{gen}}$ is a function with flexible domain and range:

\displaystyle\mathcal{G}(F):\bigcup_{l=1}^{\infty}\,\vartimes_{m=1}^{l}\mathcal{G}(D_{\mathrm{in}}(F))\to\bigcup_{l=1}^{\infty}\,\vartimes_{m=1}^{l}\mathcal{G}(D_{\mathrm{out}}(F)).

If a generalized function $F\in\mathcal{F}^{\text{gen}}$ and a variable $x\in\mathcal{X}$ have compatible domains, or $D_{\mathrm{in}}(F)=D(x)$ , the grounding for $F(x)$ is defined by
如果一个广义函数 $F\in\mathcal{F}^{\text{gen}}$ 和一个变量 $x\in\mathcal{X}$ 具有兼容的定义域，或 $D_{\mathrm{in}}(F)=D(x)$ ，则 $F(x)$ 的基准由以下定义:

\displaystyle\mathcal{G}(F(x))=\mathcal{G}(F)(\mathcal{G}(x)).

定义 1 广义 LTN 函数

F\in\mathcal{F}^{\text{gen}}

是一种函数，它允许一次性输入多个变量实例到

\mathcal{G}(F)

，并返回一个变量。广义函数

F\in\mathcal{F}^{\text{gen}}

的基础是具有灵活定义域和值域的函数：

Grounding for both functions is defined as
两个功能的接地定义为

	$\displaystyle\mathcal{G}(\texttt{embeddingFunction})$	$\displaystyle=f_{\theta},$		(24)
	$\displaystyle\mathcal{G}(\texttt{getPrototypes})$	$\displaystyle=\Pi_{\theta},$		(25)

where $f_{\theta}:\mathbb{R}^{D}\to\mathbb{R}^{M}$ is the same embedding function as in the FSL setting, while
其中 $f_{\theta}:\mathbb{R}^{D}\to\mathbb{R}^{M}$ 与 FSL 设置中相同的嵌入函数相同，而

\displaystyle\Pi_{\theta}:\bigcup_{l=1}^{\infty}\,\vartimes_{m=1}^{l}\mathbb{R}^{D}\times\mathbb{N}\to\bigcup_{l=1}^{\infty}\,\vartimes_{m=1}^{l}\mathbb{R}^{M}\times\mathbb{N}

We structure $\Pi_{\theta}$ to be computationally easy to implement (e.g., in a computational graph), and to generalize to a setting in which $N_{S}$ and $\tilde{N}$ are not fixed, or the $N$ -way- $K$ -shot scenario is not perfect. More specifically, in the following is how $\Pi_{\theta}$ works.
我们设计 $\Pi_{\theta}$ 以便于计算实现（例如，在计算图中），并且可以推广到 $N_{S}$ 和 $\tilde{N}$ 不固定的情况，或者 $N$ -路- $K$ -击的情况不完美的情况。更具体地，以下是 $\Pi_{\theta}$ 的工作原理。

1.
Take as input:
1. (a)
  
  a support set $\tilde{S}=\{(x^{\tilde{S}}_{1},y^{\tilde{S}}_{1}),...,(x^{\tilde{S}}_{N_{S}},y^{\tilde{S}}_{N_{S}})\}\in(\mathbb{R}^{D}\times\mathbb{N})^{N_{S}}$ of labelled examples, with $x^{\tilde{S}}_{i}\in\mathbb{R}^{D}$ and $y^{\tilde{S}}_{i}\in\mathbb{N}$ ;
  
  （a）一个带有标记示例的支持集 $\tilde{S}=\{(x^{\tilde{S}}_{1},y^{\tilde{S}}_{1}),...,(x^{\tilde{S}}_{N_{S}},y^{\tilde{S}}_{N_{S}})\}\in(\mathbb{R}^{D}\times\mathbb{N})^{N_{S}}$ ，其中 $x^{\tilde{S}}_{i}\in\mathbb{R}^{D}$ 和 $y^{\tilde{S}}_{i}\in\mathbb{N}$ ；
2. (b)
  
  the parameter $\theta$ or, for the sake of clarity, the embedding function $f_{\theta}:\mathbb{R}^{D}\to\mathbb{R}^{M}$ .
  
  （b）参数 $\theta$ 或者为了清晰起见，嵌入函数 $f_{\theta}:\mathbb{R}^{D}\to\mathbb{R}^{M}$ 。
1. 输入：
2.

Extract the classes contained in $\tilde{S}$ by applying:

$\displaystyle p^{(labels)}=\text{Unique}(y^{\tilde{S}}),$

where the “Unique” function retrieves the unique elements of a vector. We call this variable $p^{(labels)}$ because it will be associated to prototype labels. Define $\tilde{N}$ as the number of elements in $p^{(labels)}$ .
“Unique”函数检索向量的唯一元素。我们将这个变量称为 $p^{(labels)}$ ，因为它将与原型标签关联。将 $\tilde{N}$ 定义为 $p^{(labels)}$ 中的元素数量。

2. 通过应用以下方法提取 $\tilde{S}$ 中包含的类：
3.

Define a sparse “labels” matrix $L\in\{0,1\}^{\tilde{N}\times N_{S}}$ whose $i,j$ -th entry is equal to 1 if support item $i$ is of class $p^{(labels)}_{j}$ , 0 otherwise.

3. 定义一个稀疏的“标签”矩阵 $L\in\{0,1\}^{\tilde{N}\times N_{S}}$ ，其中第 $i,j$ 个条目等于 1，如果支持项目 $i$ 属于类 $p^{(labels)}_{j}$ ，否则为 0。

Compute the prototypes tensor $p\in\mathbb{R}^{\tilde{N}\times M}$ as

4. 计算原型张量

p\in\mathbb{R}^{\tilde{N}\times M}

。

\displaystyle p=\text{Diag}(L\mathds{1}_{N_{S}})^{-1}\,L\,f_{\theta}(x^{\tilde{S}})

where 在哪里

\displaystyle\mathds{1}_{N_{S}}=[1,1,...,1]^{T}\in\mathbb{R}^{N_{S}}

is a vector of $N_{S}$ ones, and
是一个 $N_{S}$ 个单位的向量，和

\displaystyle f_{\theta}(x^{\tilde{S}})=[f_{\theta}(x^{\tilde{S}}_{1}),f_{\theta}(x^{\tilde{S}}_{2}),...,f_{\theta}(x^{\tilde{S}}_{N_{S}})]^{T}\in\mathbb{R}^{N_{S}\times M}

is the piece-wise application of $f_{\theta}$ to elements in $x^{\tilde{S}}$ , whereas “diag” computes the diagonal matrix associated to a vector. This expression does the same operation as Eq. 4, but it is more general because it allows for unbalanced support sets. In the case of balanced support sets, which correspond to a perfect $N$ -way- $K$ -shot scenario, one simply has $\text{Diag}(L\mathds{1}_{N_{S}})^{-1}=\frac{1}{K}\,I$ , where $I$ is the identity matrix.
该表达式是将 $f_{\theta}$ 逐段应用于 $x^{\tilde{S}}$ 中的元素，而“diag”计算与向量相关联的对角矩阵。该表达式执行与方程 4 相同的操作，但更通用，因为它允许不平衡的支持集。在支持集平衡的情况下，对应于完美的 $N$ -路- $K$ -拍摄场景，简单地有 $\text{Diag}(L\mathds{1}_{N_{S}})^{-1}=\frac{1}{K}\,I$ ，其中 $I$ 是单位矩阵。

5.

Return $p$ and $p^{(labels)}$ .

5. 返回 $p$ 和 $p^{(labels)}$ 。

-B Experimental settings details
-B 实验设置细节

In Table II we report for each dataset the selected learning rate, $\alpha$ (distance parameter) and $\lambda$ (L2 regularization) which allowed us to obtain the best performance.
在表 II 中，我们报告了每个数据集的选定学习率， $\alpha$ （距离参数）和 $\lambda$ （L2 正则化），这使我们能够获得最佳性能。

TABLE II: Best hyperparameters used to train Proto-LTN on each benchmark.
表 II：用于在每个基准上训练 Proto-LTN 的最佳超参数。

Dataset 数据集	Lr	$\alpha$	$\lambda$
Awa2	$1\text{\times}{10}^{-04}$	$1\text{\times}{10}^{-05}$	$1\text{\times}{10}^{-03}$
CUB	$1\text{\times}{10}^{-04}$	$1\text{\times}{10}^{-04}$	$1\text{\times}{10}^{-03}$
aPY	$1\text{\times}{10}^{-03}$	$1\text{\times}{10}^{-05}$	$1\text{\times}{10}^{-05}$
SUN	$1\text{\times}{10}^{-03}$	$1\text{\times}{10}^{-05}$	$1\text{\times}{10}^{-05}$

PROTOtypical Logic Tensor Networks (PROTO-LTN) for Zero Shot Learning零样本学习的原型逻辑张量网络（PROTO-LTN）

Abstract 摘要

I Introduction 我 介绍

II Related work II 相关工作

II-A Neural-symbolic AI in Semantic Image InterpretationII-A 神经符号人工智能在语义图像解释中

II-B Logic Tensor NetworksII-B 逻辑张量网络

II-C Zero-shot learning II-C 零样本学习

III PROTOtypical Logic Tensor NetworksIII 原型逻辑张量网络

III-A Prototypical Networks: the FSL settingIII-A 原型网络：FSL 设置

III-B Prototypical networks: the ZSL settingIII-B 原型网络：零样本学习设置

III-C PROTO-LTN: the FSL scenarioIII-C PROTO-LTN：FSL 方案

III-C1 Groundings terms III-C1 接地术语

III-C2 Grounding functions and predicatesIII-C2 接地功能和谓词

III-C3 Knowledge Base III-C3 知识库

III-D PROTO-LTN: the GZSL scenarioIII-D PROTO-LTN：GZSL 场景

IV Experimental SettingsIV 实验设置

V Results V 结果

VI Conclusions and Future worksVI 结论和未来工作

References

-A Function grounding in PROTO-LTNsPROTO-LTNs 中的功能基础

Definition 1

-B Experimental settings details-B 实验设置细节

PROTOtypical Logic Tensor Networks (PROTO-LTN) for Zero Shot Learning
零样本学习的原型逻辑张量网络（PROTO-LTN）

I Introduction 我介绍

II-A Neural-symbolic AI in Semantic Image Interpretation
II-A 神经符号人工智能在语义图像解释中

II-B Logic Tensor Networks
II-B 逻辑张量网络

III PROTOtypical Logic Tensor Networks
III 原型逻辑张量网络

III-A Prototypical Networks: the FSL setting
III-A 原型网络：FSL 设置

III-B Prototypical networks: the ZSL setting
III-B 原型网络：零样本学习设置

III-C PROTO-LTN: the FSL scenario
III-C PROTO-LTN：FSL 方案

III-C2 Grounding functions and predicates
III-C2 接地功能和谓词

III-D PROTO-LTN: the GZSL scenario
III-D PROTO-LTN：GZSL 场景

IV Experimental Settings
IV 实验设置

VI Conclusions and Future works
VI 结论和未来工作

-A Function grounding in PROTO-LTNs
PROTO-LTNs 中的功能基础

-B Experimental settings details
-B 实验设置细节