Logic Tensor Networks for
Semantic Image Interpretation
逻辑张量网络用于语义图像解释

Ivan Donadello 伊万·多纳德罗
Fondazione Bruno Kessler and
University of Trento
Trento, Italy
donadello@fbk.eu
意大利特伦托的布鲁诺·凯瑟基金会和特伦托大学 donadello@fbk.eu Luciano Serafini 卢西亚诺·塞拉菲尼
Fondazione Bruno Kessler
Via Sommarive 18, I-38123
Trento, Italy
serafini@fbk.eu
布鲁诺·凯瑟基金会意大利特伦托市索马里韦大街 18 号 serafini@fbk.eu Artur d’Avila Garcez 阿图尔·达维拉·加尔塞斯
City, University of London
Northampton Square
London EC1V 0HB, UK
a.garcez@city.ac.uk
伦敦城市大学伦敦北安普顿广场邮编 EC1V 0HB 英国 a.garcez@city.ac.uk

Abstract 摘要

Semantic Image Interpretation (SII) is the task of extracting structured Semantic descriptions from images. It is widely agreed that the combined use of visual data and background knowledge is of great importance for SII. Recently, Statistical Relational Learning (SRL) approaches have been developed for reasoning under uncertainty and learning in the presence of data and rich knowledge. Logic Tensor Networks (LTNs) are an SRL framework which integrates neural networks with first-order fuzzy logic to allow (i) efficient learning from noisy data in the presence of logical constraints, and (ii) reasoning with logical formulas describing general properties of the data. In this paper, we develop and apply LTNs to two of the main tasks of SII, namely, the classification of an image’s bounding boxes and the detection of the relevant part-of relations between objects. To the best of our knowledge, this is the first successful application of SRL to such SII tasks. The proposed approach is evaluated on a standard image processing benchmark. Experiments show that the use of background knowledge in the form of logical constraints can improve the performance of purely data-driven approaches, including the state-of-the-art Fast Region-based Convolutional Neural Networks (Fast R-CNN). Moreover, we show that the use of logical background knowledge adds robustness to the learning system when errors are present in the labels of the training data.
语义图像解释（SII）是从图像中提取结构化语义描述的任务。普遍认为，视觉数据和背景知识的结合对 SII 至关重要。最近，已经开发了用于在不确定性下推理和在数据和丰富知识存在的情况下学习的统计关系学习（SRL）方法。逻辑张量网络（LTNs）是一种 SRL 框架，它将神经网络与一阶模糊逻辑相结合，以允许（i）在逻辑约束存在的情况下从嘈杂数据中高效学习，以及（ii）使用描述数据一般属性的逻辑公式进行推理。在本文中，我们开发并应用 LTNs 到 SII 的两个主要任务，即图像边界框的分类和对象之间相关部分关系的检测。据我们所知，这是将 SRL 成功应用于这种 SII 任务的首次尝试。所提出的方法在标准图像处理基准上进行了评估。实验表明，利用逻辑约束形式的背景知识可以提高纯数据驱动方法的性能，包括最先进的快速区域卷积神经网络（Fast R-CNN）。此外，我们表明，在训练数据的标签中存在错误时，利用逻辑背景知识可以增加学习系统的稳健性。

1 Introduction 1 介绍

Semantic Image Interpretation (SII) is the task of generating a structured semantic description of the content of an image. This structured description can be represented as a labelled directed graph, where each vertex corresponds to a bounding box of an object in the image, and each edge represents a relation between pairs of objects; verteces are labelled with a set of object types and edges are labelled by the binary relations. Such a graph is also called a scene graph in [15].
语义图像解释（SII）是生成图像内容的结构化语义描述的任务。这种结构化描述可以表示为一个带标签的有向图，其中每个顶点对应于图像中的一个物体的边界框，每条边表示物体对之间的关系；顶点用一组物体类型标记，边由二元关系标记。这样的图在文献[15]中也被称为场景图。

A major obstacle to be overcome by SII is the so-called semantic gap [19], that is, the lack of a direct correspondence between low-level features of the image and high-level semantic descriptions. To tackle this problem, a system for SII must learn the latent correlations that may exist between the numerical features that can be observed in an image and the semantic concepts associated with the objects. It is in this learning process that the availability of relational background knowledge can be of great help. Thus, recent SII systems have sought to combine, or even integrate, visual features obtained from data and symbolic knowledge in the form of logical axioms [30, 4, 8].
SII 需要克服的一个主要障碍是所谓的语义鸿沟[19]，即图像的低级特征与高级语义描述之间缺乏直接对应。为了解决这个问题，SII 系统必须学习可能存在的图像中可观察到的数值特征与与对象相关的语义概念之间的潜在相关性。在这个学习过程中，关系背景知识的可用性可以提供很大帮助。因此，最近的 SII 系统已经试图结合甚至整合从数据中获得的视觉特征和逻辑公理形式的符号知识[30, 4, 8]。

The area of Statistical Relational Learning (SRL), or Statistical Artificial Intelligence (StarAI), seeks to combine data-driven learning, in the presence of uncertainty, with symbolic knowledge [29, 2, 13, 7, 26, 23]. However, only very few SRL systems have been applied to SII tasks (c.f. Section 2) due to the high complexity associated with image learning. Most systems for solving SII tasks have been based, instead, on deep learning and neural network models. These, on the other hand, do not in general offer a well-founded way of learning from data in the presence of relational logical constraints, requiring the neural models to be highly engineered from scratch.
统计关系学习（SRL）或统计人工智能（StarAI）领域旨在将数据驱动学习与符号知识相结合，以应对不确定性[29, 2, 13, 7, 26, 23]。然而，由于图像学习的高复杂性，只有极少数 SRL 系统被应用于 SII 任务（参见第 2 节）。大多数解决 SII 任务的系统基于深度学习和神经网络模型。另一方面，这些模型通常不能提供一种基于数据学习的良好方法，以应对关系逻辑约束的存在，需要从头开始高度工程化神经模型。

In this paper, we develop and apply for the first time, the SRL framework called Logic Tensor Networks (LTNs) to computationally challenging SII tasks. LTNs combine learning in deep networks with relational logical constraints [27]. It uses a First-order Logic (FOL) syntax interpreted in the real numbers, which is implemented as a deep tensor network. Logical terms are interpreted as feature vectors in a real-valued $n$ -dimensional space. Function symbols are interpreted as real-valued functions, and predicate symbols as fuzzy logic relations. This syntax and semantics, called real semantics, allow LTNs to learn efficiently in hybrid domains, where elements are composed of both numerical and relational information.
在本文中，我们首次开发并应用了名为逻辑张量网络（LTNs）的 SRL 框架，用于计算具有挑战性的 SII 任务。LTNs 将深度网络中的学习与关系逻辑约束相结合[27]。它使用在实数中解释的一阶逻辑（FOL）语法，这被实现为深度张量网络。逻辑术语被解释为实值 $n$ 维空间中的特征向量。函数符号被解释为实值函数，谓词符号被解释为模糊逻辑关系。这种称为实语义的语法和语义允许 LTNs 在混合域中高效学习，其中元素由数值和关系信息组成。

We argue, therefore, that LTNs are a good candidate for learning SII because they can express relational knowledge in FOL which serves as constraints on the data-driven learning within tensor networks. Being LTN a logic, it provides a notion of logical consequence, which forms the basis for learning within LTNs, which is defined as best satisfiability, c.f. Section 4. Solving the best satisfiability problem amounts to finding the latent correlations that may exist between a relational background knowledge and numerical data attributes. This formulation enables the specification of learning as reasoning, a unique characteristic of LTNs, which is seen as highly relevant for SII.
因此，我们认为 LTN 是学习 SII 的一个很好的候选者，因为它们可以在 FOL 中表达关系知识，这作为张量网络中数据驱动学习的约束。作为一种逻辑，LTN 提供了逻辑推论的概念，这构成了 LTN 学习的基础，被定义为最佳可满足性，参见第 4 节。解决最佳可满足性问题意味着找到可能存在于关系背景知识和数值数据属性之间的潜在相关性。这种表述使学习作为推理的规范化成为可能，这是 LTN 的一个独特特征，被认为对 SII 非常相关。

This paper specifies SII within LTNs, evaluating it on two important tasks: (i) the classification of bounding boxes, and (ii) the detection of the part-of relation between any two bounding boxes. Both tasks are evaluated using the PASCAL-Part dataset [5]. It is shown that LTNs improve the performance of the state-of-the-art object classifier Fast R-CNN [11] on the bounding box classification task. LTNs also outperform a rule-based heuristic (which uses the inclusion ratio of two bounding boxes) in the detection of part-of relations between objects. Finally, LTNs are evaluated on their ability to handle errors, specifically misclassifications of objects and part-of relations. Very large visual recognition datasets now exist which are noisy [24], and it is important for learning systems to become robust to noise. LTNs were trained systematically on progressively noisier datasets, with results on both SII tasks showing that LTN’s logical constraints are capable of adding robustness to the system, in the presence of errors in the labels of the training data.
本文在 LTNs 中明确了 SII，并对其进行了评估，涉及两个重要任务：(i) 边界框的分类，以及 (ii) 任意两个边界框之间部分关系的检测。这两个任务使用 PASCAL-Part 数据集[5]进行评估。研究结果表明，LTNs 提高了最先进的目标分类器 Fast R-CNN[11]在边界框分类任务上的性能。LTNs 在检测对象之间部分关系方面也优于基于规则的启发式方法（使用两个边界框的包含比率）。最后，LTNs 在处理错误的能力上进行了评估，特别是对象和部分关系的错误分类。现在存在非常大的视觉识别数据集，其中存在噪声[24]，学习系统变得对噪声具有鲁棒性变得非常重要。LTNs 系统地在逐渐嘈杂的数据集上进行了训练，对 SII 任务的结果显示，LTN 的逻辑约束能够在训练数据的标签存在错误的情况下为系统增加鲁棒性。

The paper is organized as follows: Section 2 contrasts the LTN approach with related work which integrate visual features and background knowledge for SII. Section 3 specifies LTNs in the context of SII. Section 4 defines the best satisfiability problem in this context, which enables the use of LTNs for SII. Section 5 describes in detail the comparative evaluations of LTNs on the SII tasks. Section 6 concludes the paper and discusses directions for future work.
本文的组织如下：第 2 节将 LTN 方法与整合视觉特征和背景知识用于 SII 的相关工作进行对比。第 3 节在 SII 的背景下具体说明了 LTNs。第 4 节定义了这一背景下的最佳可满足性问题，从而实现了 LTNs 在 SII 中的应用。第 5 节详细描述了 LTNs 在 SII 任务上的比较评估。第 6 节总结了本文并讨论了未来工作的方向。

2 Related Work 2 相关工作

The idea of exploiting logical background knowledge to improve SII tasks dates back to the early days of AI. In what follows, we review the most recent results in the area in comparison with LTNs.
利用逻辑背景知识来改进 SII 任务的想法可以追溯到人工智能的早期。接下来，我们将与 LTNs 进行比较，回顾该领域最新的研究结果。

Logic-based approaches have used Description Logics (DL), where the basic components of the scene are all assumed to have been already discovered (e.g. simple object types or spatial relations). Then, with logical reasoning, new facts can be derived in the scene from these basic components [19, 21]. Other logic-based approaches have used fuzzy DL to tackle uncertainty in the basic components [14, 6, 1]. These approaches have limited themselves to spatial relations or to refining the labels of the objects detected. In [8], the scene interpretation is created by combining image features with constraints defined using DL, but the method is tailored to the part-of relation and cannot be extended easily to account for other relations. LTNs, on the other hand, should be able to handle any semantic relation. In [18, 10], a symbolic Knowledge-base is used to improve object detection, but only the subsumption relation is explored and it is not possible to inject more complex knowledge using logical axioms.
基于逻辑的方法已经使用了描述逻辑（DL），其中场景的基本组件都被假定已经被发现（例如简单的对象类型或空间关系）。然后，通过逻辑推理，可以从这些基本组件中推导出场景中的新事实。其他基于逻辑的方法使用模糊 DL 来处理基本组件中的不确定性。这些方法局限于空间关系或者对检测到的对象的标签进行细化。在[8]中，通过将图像特征与使用 DL 定义的约束相结合来创建场景解释，但该方法专门针对部分关系，不容易扩展到其他关系。另一方面，逻辑张量网络（LTNs）应该能够处理任何语义关系。在[18, 10]中，使用符号知识库来改进对象检测，但只探讨了包含关系，无法使用逻辑公理注入更复杂的知识。

A second group of approaches seeks to encode background knowledge and visual features within probabilistic graphical models. In [30, 20], visual features are combined with knowledge gathered from datasets, web resources or annotators, about object labels, properties such as shape, colour and size, and affordances, using Markov Logic Networks (MLNs) [25] to predict facts in unseen images. Due to the specific knowledge-base schema adopted, the effectiveness of MLNs in this domain is evaluated only for Horn clauses, although the language of MLNs is more general. As a result, it is not easy to evaluate how the approach may perform with more complex axioms. In [2], a probabilistic fuzzy logic is used, but not with real semantics. Clauses are weighted and universally-quantified formulas are instantiated, as done by MLNs. This is different from LTNs where the universally-quantified formulas are computed by using an aggregation operation, which avoids the need for instantiating all variables.
一组方法试图在概率图模型中编码背景知识和视觉特征。在[30, 20]中，视觉特征与从数据集、网络资源或注释者收集的关于对象标签、形状、颜色和大小等属性以及功能的知识相结合，使用马尔可夫逻辑网络（MLNs）[25]来预测未见图像中的事实。由于采用了特定的知识库模式，MLNs 在这个领域的有效性仅针对 Horn 子句进行评估，尽管 MLNs 的语言更通用。因此，很难评估这种方法在处理更复杂公理时的表现。在[2]中，使用了概率模糊逻辑，但没有真实语义。子句被加权，普遍量化的公式被实例化，就像 MLNs 所做的那样。这与 LTNs 不同，LTNs 中的普遍量化公式是通过使用聚合操作计算的，这避免了需要实例化所有变量。

In other related work, [4, 16] encode background knowledge into a generic Conditional Random Field (CRF), where the nodes represent detected objects and the edges represent logical relationships between objects. The task is to find a correct labelling for this graph. In [4], the edges encode logical constraints on a knowledge-base specified in DL. Although these ideas are close in spirit to the approach presented in this paper, they are not formalised as in LTNs, which use a deep tensor network and first-order logic, rather than CRFs or DL. In general, the logical theory behind the functions to be defined in the CRF is unclear. In [16], potential functions are defined as text priors such as co-occurrence of terms found in the image descriptions of Flickr.
在其他相关工作中，[4, 16] 将背景知识编码到一个通用的条件随机场（CRF）中，其中节点代表检测到的对象，边代表对象之间的逻辑关系。任务是为这个图找到正确的标记。在[4]中，边编码了在 DL 中指定的知识库上的逻辑约束。尽管这些想法在精神上与本文提出的方法接近，但它们没有像 LTNs 那样形式化，LTNs 使用深度张量网络和一阶逻辑，而不是 CRFs 或 DL。一般来说，CRF 中要定义的函数背后的逻辑理论是不清楚的。在[16]中，潜在函数被定义为文本先验，例如在 Flickr 的图像描述中发现的术语共现。

In a final group of approaches, here called language-priors, background knowledge is taken from linguistic models [22, 17]. In [22], a neural network is built integrating visual features and a linguistic model to predict semantic relationships between bounding boxes. The linguistic model is a set of rules derived from WordNet [9], stating which types of semantic relationships occur between a subject and an object. In [17], a similar neural network is proposed for the same task but with a more sophisticated language model, embedding in the same vector space triples of the form subject-relation-object, such that semantically similar triples are mapped closely together in the embedding space. In this way, even if no examples exist of some triples in the data, the relations can be inferred from similarity to more frequent triples. A drawback, however, is the possibility of inferring inconsistent triples, such as e.g. man-eats-chair, due to the embedding. LTNs avoid this problem with a logic-based approach (in the above example, with an axiom to the effect that chairs are not normally edible). LTNs can also handle exceptions, offering a system capable of dealing with crisp axioms and real-valued data, as specified in what follows.
在最后一组方法中，这里称为语言先验，背景知识来自语言模型[22, 17]。在[22]中，构建了一个神经网络，集成了视觉特征和语言模型，以预测边界框之间的语义关系。语言模型是从 WordNet[9]中导出的一组规则，说明主语和宾语之间发生哪些类型的语义关系。在[17]中，提出了一个类似的神经网络用于相同的任务，但使用了更复杂的语言模型，在同一个向量空间中嵌入了主题-关系-客体形式的三元组，使语义上相似的三元组在嵌入空间中紧密映射在一起。通过这种方式，即使数据中不存在某些三元组的示例，也可以从与更频繁的三元组相似性推断出关系。然而，一个缺点是可能从嵌入中推断出不一致的三元组，例如人-吃-椅子，LTNs 通过基于逻辑的方法避免了这个问题（在上面的例子中，通过一个关于椅子通常不可食用的公理）。 LTNs 也可以处理异常情况，提供一个能够处理清晰公理和实值数据的系统，如下所述。

3 Logic Tensor Networks 3 逻辑张量网络

Let $\mathcal{L}$ be a first-order logic language, whose signature is composed of three disjoint sets $\mathcal{C}$ , $\mathcal{F}$ and $\mathcal{P}$ , denoting constants, functions and predicate symbols, respectively. For any function or predicate symbol $s$ , let $\alpha(s)$ denote its arity. Logical formulas in $\mathcal{L}$ allow one to specify relational knowledge, e.g. the atomic formula $\mathsf{partOf}(o_{1},o_{2})$ , stating that object $o_{1}$ is a part of object $o_{2}$ , the formulae $\forall xy(\mathsf{partOf}(x,y)\rightarrow\neg\mathsf{partOf}(y,x))$ , stating that the relation $\mathsf{partOf}$ is asymmetric, or $\forall x(\mathsf{Cat}(x)\rightarrow\exists y(\mathsf{partOf}(x,y)\wedge\mathsf{Tail}(y)))$ , stating that every cat should have a tail. In addition, exceptions are handled by allowing formulas to be interpreted in fuzzy logic, such that in the presence of an example of, say, a tailless cat, the above formula can be interpreted naturally as normally, every cat has a tail; this will be exemplified later.
令 $\mathcal{L}$ 为一个一阶逻辑语言，其签名由三个不相交的集合 $\mathcal{C}$ 、 $\mathcal{F}$ 和 $\mathcal{P}$ 组成，分别表示常量、函数和谓词符号。对于任何函数或谓词符号 $s$ ，令 $\alpha(s)$ 表示其元数。 $\mathcal{L}$ 中的逻辑公式允许指定关系知识，例如原子公式 $\mathsf{partOf}(o_{1},o_{2})$ ，表示对象 $o_{1}$ 是对象 $o_{2}$ 的一部分，公式 $\forall xy(\mathsf{partOf}(x,y)\rightarrow\neg\mathsf{partOf}(y,x))$ ，表示关系 $\mathsf{partOf}$ 是非对称的，或 $\forall x(\mathsf{Cat}(x)\rightarrow\exists y(\mathsf{partOf}(x,y)\wedge\mathsf{Tail}(y)))$ ，表示每只猫都应该有尾巴。此外，通过允许公式在模糊逻辑中进行解释来处理异常情况，例如，在存在无尾巴的猫的情况下，上述公式可以自然地被解释为通常情况下每只猫都有尾巴；这将在后面举例说明。

Semantics of $\mathcal{L}$ : We define the interpretation domain as a subset of $\mathbb{R}^{n}$ , i.e. every object in the domain is associated with a $n$ -dimensional vector of real numbers. Intuitively, this $n$ -tuple represents $n$ numerical features of an object, e.g. in the case of a person, their name in ASCII, height, weight, social security number, etc. Functions are interpreted as real-valued functions, and predicates are interpreted as fuzzy relations on real vectors. To emphasise the fact that we interpret symbols as real numbers, we use the term grounding instead of interpretation¹¹1In logic, the term grounding indicates the operation of replacing the variables of a term or formula with constants or terms that do not contain other variables. To avoid any confusion, we use the synonym instantiation for this purpose. It is worth noting that in LTN, differently from MLNs, the instantiation of every first order formula is not required.
在逻辑学中，基础术语表示用常量或不包含其他变量的项或公式替换变量的操作。为了避免混淆，我们使用同义词实例化来表示这个目的。值得注意的是，在 LTN 中，与 MLNs 不同，不需要对每个一阶公式进行实例化。
$\mathcal{L}$ 的语义学：我们将解释域定义为 $\mathbb{R}^{n}$ 的子集，即域中的每个对象都与一个 $n$ 维实数向量相关联。直观地，这个 $n$ 元组表示对象的 $n$ 个数值特征，例如在一个人的情况下，他们的 ASCII 名称、身高、体重、社会保障号码等。函数被解释为实值函数，谓词被解释为实向量上的模糊关系。为了强调我们将符号解释为实数的事实，我们使用“接地”这个术语来代替解释 ¹ 。 in the following definition of semantics.
在以下语义学定义中。

Definition 1

Let $n\in\mathbb{N}$ . An $n$ -grounding, or simply grounding, $\mathcal{G}$ for a FOL $\mathcal{L}$ is a function defined on the signature of $\mathcal{L}$ satisfying the following conditions:

1.

$\mathcal{G}(c)\in\mathbb{R}^{n}$ for every constant symbol $c\in\mathcal{C}$ ;

1. 对于每个常数符号 $c\in\mathcal{C}$ ；
2.

$\mathcal{G}(f)\in\mathbb{R}^{n\cdot\alpha(f)}\longrightarrow\mathbb{R}^{n}$ for every $f\in\mathcal{F}$ ;

2. 每 $f\in\mathcal{F}$ 个 $\mathcal{G}(f)\in\mathbb{R}^{n\cdot\alpha(f)}\longrightarrow\mathbb{R}^{n}$ ；
3.

$\mathcal{G}(P)\in\mathbb{R}^{n\cdot\alpha(P)}\longrightarrow[0,1]$ for every $P\in\mathcal{P}$ .

3. 每 $P\in\mathcal{P}$ 个 $\mathcal{G}(P)\in\mathbb{R}^{n\cdot\alpha(P)}\longrightarrow[0,1]$ 。

定义 1 让

n\in\mathbb{N}

。对于 FOL

\mathcal{L}

，一个

n

-grounding，或者简称为 grounding，是在

\mathcal{L}

的签名上定义的函数，满足以下条件：

Given a grounding $\mathcal{G}$ , the semantics of closed terms and atomic formulas is defined as follows:
给定一个基础 $\mathcal{G}$ ，封闭项和原子公式的语义定义如下：

	$\displaystyle\mathcal{G}(f(t_{1},\dots,t_{m}))$	$\displaystyle=\mathcal{G}(f)(\mathcal{G}(t_{1}),\dots,\mathcal{G}(t_{m}))$
	$\displaystyle\mathcal{G}(P(t_{1},\dots,t_{m}))$	$\displaystyle=\mathcal{G}(P)(\mathcal{G}(t_{1}),\dots,\mathcal{G}(t_{m}))$

The semantics for connectives is defined according to fuzzy logic; using for instance the Lukasiewicz t-norm²²2Examples of t-norms include Lukasiewicz, product and Gödel. The Lukasiewicz t-norm is $\mu_{Luk}(x,y)=\max(0,x+y-1)$ , product t-norm is $\mu_{Pr}(x,y)=x\cdot y$ , and Gödel t-norm is $\mu_{max}(x,y)=\min(x,y)$ . See [3] for details.
t-范数的例子包括 Lukasiewicz、乘积和 Gödel。Lukasiewicz t-范数是 $\mu_{Luk}(x,y)=\max(0,x+y-1)$ ，乘积 t-范数是 $\mu_{Pr}(x,y)=x\cdot y$ ，而 Gödel t-范数是 $\mu_{max}(x,y)=\min(x,y)$ 。详细信息请参见[3]。
连词的语义是根据模糊逻辑定义的；例如使用 Lukasiewicz t-范数 ² :

	$\displaystyle\mathcal{G}(\neg\phi)$	$\displaystyle=1-\mathcal{G}(\phi)$
	$\displaystyle\mathcal{G}(\phi\wedge\psi)$	$\displaystyle=\max(0,\mathcal{G}(\phi)+\mathcal{G}(\psi)-1)$
	$\displaystyle\mathcal{G}(\phi\vee\psi)$	$\displaystyle=\min(1,\mathcal{G}(\phi)+\mathcal{G}(\psi))$
	$\displaystyle\mathcal{G}(\phi\rightarrow\psi)$	$\displaystyle=\min(1,1-\mathcal{G}(\phi)+\mathcal{G}(\psi))$

The LTN semantics for $\forall$ is defined in [27] using the $\min$ operator, that is, $\mathcal{G}(\forall x\phi(x))=\min_{t\in\mathit{term}(\mathcal{L})}\mathcal{G}(\phi(t)))$ , where $\mathit{term}(\mathcal{L})$ is the set of instantiated terms of $\mathcal{L}$ . This, however, is inadequate for our purposes as it does not tolerate exceptions well (the presence of a single exception to the universally-quantified formulae, such as e.g. a cat without a tail, would falsify the formulae. Instead, our intention in SII is that the more examples there are that satisfy a formulae $\phi(x)$ , the higher the truth-value of $\forall x\phi(x)$ should be. To capture this, we use for the semantics of $\forall$ a mean-operator, as follows:
LTN 语义学对 $\forall$ 的定义在[27]中使用 $\min$ 运算符定义，即 $\mathcal{G}(\forall x\phi(x))=\min_{t\in\mathit{term}(\mathcal{L})}\mathcal{G}(\phi(t)))$ ，其中 $\mathit{term}(\mathcal{L})$ 是 $\mathcal{L}$ 的实例化术语集合。然而，这对于我们的目的是不够的，因为它不能很好地容忍异常情况（对于普遍量化的公式，如一个没有尾巴的猫等存在一个异常情况，将使公式变为假）。相反，在 SII 中，我们的意图是，满足公式 $\phi(x)$ 的示例越多， $\forall x\phi(x)$ 的真值就应该越高。为了捕捉这一点，我们使用均值运算符来定义 $\forall$ 的语义，如下所示：

\mathcal{G}(\forall x\phi(x))=\lim_{T\rightarrow\mathit{term}(\mathcal{L})}mean_{p}(\mathcal{G}(\phi(t))\mid t\in T)

where $mean_{p}(x_{1},\dots,x_{d})=\left(\frac{1}{d}\sum_{i=1}^{d}x_{i}^{p}\right)^{\frac{1}{p}}$ for $p\in\mathbb{Z}$ . ³³3The popular mean operators, arithmetic, geometric and harmonic mean, are obtained by setting $p=1,2,$ and $-1$ , respectively.
流行的平均运算符，算术平均数、几何平均数和调和平均数，分别通过设置 $p=1,2,$ 和 $-1$ 获得。
在 $p\in\mathbb{Z}$ 的情况下 $mean_{p}(x_{1},\dots,x_{d})=\left(\frac{1}{d}\sum_{i=1}^{d}x_{i}^{p}\right)^{\frac{1}{p}}$ 。 ³

Finally, the classical semantics of $\exists$ is uniquely determined by the semantics of $\forall$ , by making $\exists$ equivalent to $\neg\forall\neg$ . This approach, however, has a drawback too when it comes to SII: if we adopt, for instance, the arithmetic mean for the semantic of $\forall$ then $\mathcal{G}(\forall x\phi(x))=\mathcal{G}(\exists x\phi(x))$ . Therefore, we shall interpret existential quantification via Skolemization: every formula of the form $\forall x_{1},\dots,x_{n}(\dots\exists y\phi(x_{1},\dots,x_{n},y))$ is rewritten as $\forall x_{1},\dots,x_{n}(\dots\phi(x_{1},\dots,x_{n},f(x_{1},\dots,x_{n})))$ , by introducing a new $n$ -ary function symbol, called Skolem function. In this way, existential quantifiers can be eliminated from the language by introducing Skolem functions.
最后， $\exists$ 的经典语义是通过使 $\exists$ 等价于 $\neg\forall\neg$ 来唯一确定 $\forall$ 的语义。然而，当涉及到 SII 时，这种方法也有一个缺点：如果我们采用例如算术平均值来解释 $\forall$ 的语义，那么 $\mathcal{G}(\forall x\phi(x))=\mathcal{G}(\exists x\phi(x))$ 。因此，我们将通过斯科伦化来解释存在量词：形如 $\forall x_{1},\dots,x_{n}(\dots\exists y\phi(x_{1},\dots,x_{n},y))$ 的每个公式都被重写为 $\forall x_{1},\dots,x_{n}(\dots\phi(x_{1},\dots,x_{n},f(x_{1},\dots,x_{n})))$ ，引入一个新的 $n$ 元函数符号，称为斯科伦函数。通过引入斯科伦函数，存在量词可以从语言中消除。

Formalizing SII in LTNs: To specify the SII problem, as defined in the introduction, we consider a signature $\Sigma_{\text{SII}}=\left<\mathcal{C},\mathcal{F},\mathcal{P}\right>$ , where $\mathcal{C}=\bigcup_{p\in\mathit{Pics}}b(p)$ is the set of identifiers for all the bounding boxes in all the images, $\mathcal{F}=\emptyset$ , and $\mathcal{P}=\{\mathcal{P}_{1},\mathcal{P}_{2}\}$ , where $\mathcal{P}_{1}$ is a set of unary predicates, one for each object type, e.g. $\mathcal{P}_{1}=\{\mathsf{Dog},\mathsf{Cat},$ $\mathsf{Tail},\mathsf{Muzzle},\mathsf{Train},\mathsf{Coach},\dots\}$ , and $\mathcal{P}_{2}$ is a set of binary predicates representing relations between objects. Since in our experiments we focus on the part-of relation, $\mathcal{P}_{2}=\{\mathsf{partOf}\}$ . The FOL formulas based on this signature can specify (i) simple facts, e.g. the fact that bounding box $b$ contains a cat, written $\mathsf{Cat}(b)$ , the fact that $b$ contains either a cat or a dog, written $\mathsf{Cat}(b)\vee\mathsf{Dog}(b)$ , etc., and (ii) general rules such as $\forall x(\mathsf{Cat}(x)\rightarrow\exists y(\mathsf{partOf}(x,y)\wedge\mathsf{Tail}(y)))$ .
在 LTNs 中形式化 SII：为了指定在介绍中定义的 SII 问题，我们考虑一个签名 $\Sigma_{\text{SII}}=\left<\mathcal{C},\mathcal{F},\mathcal{P}\right>$ ，其中 $\mathcal{C}=\bigcup_{p\in\mathit{Pics}}b(p)$ 是所有图像中所有边界框的标识符集合， $\mathcal{F}=\emptyset$ ，以及 $\mathcal{P}=\{\mathcal{P}_{1},\mathcal{P}_{2}\}$ ，其中 $\mathcal{P}_{1}$ 是一组一元谓词，每个对象类型一个，例如 $\mathcal{P}_{1}=\{\mathsf{Dog},\mathsf{Cat},$ $\mathsf{Tail},\mathsf{Muzzle},\mathsf{Train},\mathsf{Coach},\dots\}$ ， $\mathcal{P}_{2}$ 是表示对象之间关系的一组二元谓词。由于在我们的实验中我们专注于部分关系， $\mathcal{P}_{2}=\{\mathsf{partOf}\}$ 。基于这个签名的 FOL 公式可以指定（i）简单事实，例如边界框 $b$ 包含一只猫的事实，写作 $\mathsf{Cat}(b)$ ，包含一只猫或一只狗的事实，写作 $b$ ，等等，以及（ii）一般规则，如 $\forall x(\mathsf{Cat}(x)\rightarrow\exists y(\mathsf{partOf}(x,y)\wedge\mathsf{Tail}(y)))$ 。

A grounding for $\Sigma_{\text{SII}}$ can be defined as follows: each constant $b$ , denoting a bounding box, can be associated with a set of geometric features and a set of semantic features obtained from the output of a bounding box detector. Specifically, each bounding box is associated with geometric features describing the position and the dimension of the bounding box, and semantic features describing the classification score returned by the bounding box detector for each class. For example, for each bounding box $b\in\mathcal{C}$ , $C_{i}\in\mathcal{P}_{1}$ , $\mathcal{G}(b)$ is the $\mathbb{R}^{4+|\mathcal{P}_{1}|}$ vector:
一个 $\Sigma_{\text{SII}}$ 的基础可以定义如下：每个常量 $b$ ，表示一个边界框，可以与一组几何特征和一组从边界框检测器的输出中获得的语义特征相关联。具体来说，每个边界框都与描述边界框位置和尺寸的几何特征以及描述边界框检测器为每个类别返回的分类分数的语义特征相关联。例如，对于每个边界框 $b\in\mathcal{C}$ ， $C_{i}\in\mathcal{P}_{1}$ ， $\mathcal{G}(b)$ 是 $\mathbb{R}^{4+|\mathcal{P}_{1}|}$ 向量：

\langle class(C_{1},b),\dots,class(C_{|\mathcal{P}_{1}|},b),x_{0}(b),y_{0}(b),x_{1}(b),y_{1}(b)\rangle

where the last four elements are the coordinates of the top-left and bottom-right corners of $b$ , and $class(C_{i},b)\in[0,1]$ is the classification score of the bounding box detector for $b$ .
最后四个元素是 $b$ 左上角和右下角的坐标， $class(C_{i},b)\in[0,1]$ 是边界框检测器对 $b$ 的分类分数。

An example of groundings for predicates can be defined by taking a one-vs-all multi-classifier approach, as follows. First, define the following grounding for each class $C_{i}\in\mathcal{P}_{1}$ (below, $\mathbf{x}=\left<x_{1},\dots,x_{|\mathcal{P}_{1}|+4}\right>$ is the vector corresponding to the grounding of a bounding box):
谓词的基础示例可以通过采用一对多多分类器方法来定义，具体如下。首先，为每个类别定义以下基础 $C_{i}\in\mathcal{P}_{1}$ （下面， $\mathbf{x}=\left<x_{1},\dots,x_{|\mathcal{P}_{1}|+4}\right>$ 是对边界框基础的向量）。

\displaystyle\mathcal{G}(C_{i})(\mathbf{x})=\left\{\begin{array}[]{ll}1&\mbox{if }i=\mathop{\mathrm{argmax}}_{1\leq l\leq|\mathcal{P}_{1}|}x_{l}\\ 0&\mbox{otherwise}\end{array}\right.

(3)

Then, a simple rule-based approach for defining a grounding for the $\mathsf{partOf}$ relation is based on the naïve assumption that the more a bounding box $b$ is contained within a bounding box $b^{\prime}$ , the higher the probability should be that $b$ is part of $b^{\prime}$ . Accordingly, one can define $\mathcal{G}(\mathsf{partOf}(b,b^{\prime}))$ as the inclusion ratio $\mathit{ir}(b,b^{\prime})$ of bounding box $b$ , with grounding $\mathbf{x}$ , into bounding box $b^{\prime}$ , with grounding $\mathbf{x}^{\prime}$ (formally, $\mathit{ir}(b,b^{\prime})=\frac{area(b\cap b^{\prime})}{area(b)}$ ). A slightly more sophisticated rule-based grounding for $\mathsf{partOf}$ (used as baseline in the experiments to follow) takes into account also type compatibilities by multiplying the inclusion ratio by a factor $w_{ij}$ . Hence, we define $\mathcal{G}(\mathsf{partOf}(b,b^{\prime}))$ as follows:
然后，一个简单的基于规则的方法来定义 $\mathsf{partOf}$ 关系的基础是基于天真的假设，即边界框 $b$ 越多地包含在边界框 $b^{\prime}$ 中， $b$ 是 $b^{\prime}$ 的一部分的概率就应该越高。因此，可以将 $\mathcal{G}(\mathsf{partOf}(b,b^{\prime}))$ 定义为边界框 $b$ 的包含比率 $\mathit{ir}(b,b^{\prime})$ ，带有基础 $\mathbf{x}$ ，进入边界框 $b^{\prime}$ ，带有基础 $\mathbf{x}^{\prime}$ （形式上， $\mathit{ir}(b,b^{\prime})=\frac{area(b\cap b^{\prime})}{area(b)}$ ）。一个稍微更复杂的基于规则的 $\mathsf{partOf}$ 基础（在接下来的实验中用作基准）还考虑了类型的兼容性，通过将包含比率乘以一个因子 $w_{ij}$ 。因此，我们定义 $\mathcal{G}(\mathsf{partOf}(b,b^{\prime}))$ 如下：

\displaystyle\left\{\begin{array}[]{ll}1&\mbox{if }\mathit{ir}(b,b^{\prime})\cdot\max_{ij=1}^{|\mathcal{P}_{1}|}(w_{ij}\cdot x_{i}\cdot x^{\prime}_{j})\geq th_{ir}\\ 0&\mbox{otherwise}\end{array}\right.

(6)

for some threshold $th_{ir}$ (we use $th_{ir}>0.5$ ), and with $w_{ij}=1$ if $C_{i}$ is a part of $C_{j}$ , and $0$ otherwise. Given the above grounding, we can compute the grounding of any atomic formula, e.g. $\mathsf{Cat}(b_{1})$ , $\mathsf{Dog}(b_{2})$ , $\mathsf{leg}(b_{3})$ , $\mathsf{partOf}(b_{3},b_{1})$ , $\mathsf{partOf}(b_{3},b_{2})$ , thus expressing the degree of truth of the formula. The rule-based groundings (Eqs. (3) and (6)) may not satisfy some of the constraints to be imposed. For example, the classification score may be wrong, a bounding box may include another which is not in the part-of relation, etc. Furthermore, in many situations, it is not possible to define a grounding a priori. Instead, groundings may need to be learned automatically from examples, by optimizing the truth-values of the formulas in the background knowledge. This is discussed next.
对于某个阈值 $th_{ir}$ （我们使用 $th_{ir}>0.5$ ），如果 $C_{i}$ 是 $C_{j}$ 的一部分，则为 $w_{ij}=1$ ，否则为 $0$ 。在上述基础上，我们可以计算任何原子公式的基础，例如 $\mathsf{Cat}(b_{1})$ ， $\mathsf{Dog}(b_{2})$ ， $\mathsf{leg}(b_{3})$ ， $\mathsf{partOf}(b_{3},b_{1})$ ， $\mathsf{partOf}(b_{3},b_{2})$ ，从而表达公式的真实程度。基于规则的基础（方程（3）和（6））可能无法满足一些要求的约束条件。例如，分类得分可能是错误的，边界框可能包含另一个不在部分关系中的边界框等。此外，在许多情况下，不可能事先定义一个基础。相反，基础可能需要从示例中自动学习，通过优化背景知识中公式的真值。接下来将讨论这一点。

4 Learning as Best Satisfiability
4 学习作为最佳可满足性

A partial grounding, denoted by $\hat{\mathcal{G}}$ , is a grounding that is defined on a subset of the signature of $\mathcal{L}$ . A grounding $\mathcal{G}$ is said to be a completion of $\hat{\mathcal{G}}$ , if $\mathcal{G}$ is a grounding for $\mathcal{L}$ and coincides with $\hat{\mathcal{G}}$ on the symbols where $\hat{\mathcal{G}}$ is defined.
部分接地，表示为 $\hat{\mathcal{G}}$ ，是在 $\mathcal{L}$ 的签名的子集上定义的接地。如果 $\mathcal{G}$ 是 $\hat{\mathcal{G}}$ 的一个完成，那么如果 $\mathcal{G}$ 是 $\mathcal{L}$ 的一个接地并且在 $\hat{\mathcal{G}}$ 定义的符号上与 $\hat{\mathcal{G}}$ 相符，则称 $\mathcal{G}$ 是 $\hat{\mathcal{G}}$ 的一个完成。

Definition 2

A grounded theory GT is a pair $\langle\mathcal{K},\hat{\mathcal{G}}\rangle$ with a set $\mathcal{K}$ of closed formulas and a partial grounding $\hat{\mathcal{G}}$ .

定义 2 地理理论 GT 是一对

\langle\mathcal{K},\hat{\mathcal{G}}\rangle

，其中包含一组封闭公式

\mathcal{K}

和部分接地

\hat{\mathcal{G}}

。

Definition 3

A grounding $\mathcal{G}$ satisfies a GT $\langle\mathcal{K},\hat{\mathcal{G}}\rangle$ if $\mathcal{G}$ completes $\hat{\mathcal{G}}$ and $\mathcal{G}(\phi)=1$ for all $\phi\in\mathcal{K}$ . A GT $\langle\mathcal{K},\hat{\mathcal{G}}\rangle$ is satisfiable if there exists a grounding $\mathcal{G}$ that satisfies $\langle\mathcal{K},\hat{\mathcal{G}}\rangle$ .

定义 3A 接地

\mathcal{G}

满足 GT

\langle\mathcal{K},\hat{\mathcal{G}}\rangle

如果

\mathcal{G}

完成

\hat{\mathcal{G}}

和

\mathcal{G}(\phi)=1

对于所有

\phi\in\mathcal{K}

。如果存在满足

\langle\mathcal{K},\hat{\mathcal{G}}\rangle

的接地

\mathcal{G}

，则 GT

\langle\mathcal{K},\hat{\mathcal{G}}\rangle

是可满足的。

According to the previous definition, deciding the satisfiability of $\langle\mathcal{K},\hat{\mathcal{G}}\rangle$ amounts to searching for a grounding $\hat{\mathcal{G}}$ such that all the formulas of $\mathcal{K}$ are mapped to 1. Differently from the classical satisfiability, when a GT is not satisfiable, we are interested in the best possible satisfaction that we can reach with a grounding. This is defined as follows.
根据先前的定义，决定 $\langle\mathcal{K},\hat{\mathcal{G}}\rangle$ 的可满足性相当于搜索一个基础 $\hat{\mathcal{G}}$ ，使得 $\mathcal{K}$ 的所有公式都映射到 1。与经典可满足性不同的是，当一个 GT 不可满足时，我们对于通过基础达到的最佳满足度感兴趣。定义如下。

Definition 4

Let $\langle\mathcal{K},\hat{\mathcal{G}}\rangle$ be a grounded theory. We define the best satisfiability problem as the problem of finding a grounding $\mathcal{G}^{*}$ that maximizes the truth-values of the conjunction of all clauses $cl\in\mathcal{K}$ , i.e. $\mathcal{G}^{*}=\mathop{\mathrm{argmax}}_{\hat{\mathcal{G}}\subseteq\mathcal{G}\in\mathbb{G}}\mathcal{G}(\bigwedge_{cl\in\mathcal{K}}cl).$

定义 4 让

\langle\mathcal{K},\hat{\mathcal{G}}\rangle

成为一个基础理论。我们将最佳可满足性问题定义为寻找一个最大化所有子句的合取式真值的基础

\mathcal{G}^{*}

，即

\mathcal{G}^{*}=\mathop{\mathrm{argmax}}_{\hat{\mathcal{G}}\subseteq\mathcal{G}\in\mathbb{G}}\mathcal{G}(\bigwedge_{cl\in\mathcal{K}}cl).

。

Grounding $\mathcal{G}^{*}$ captures the latent correlation between the quantitative attribute of objects and their categorical and relational properties. Not all functions are suitable as a grounding; they should preserve some form of regularity. If $\mathcal{G}(\mathsf{Cat})(\mathbf{x})\approx 1$ (the bounding box with feature vector $\mathbf{x}$ contains a cat) then for every $\mathbf{x}^{\prime}$ close to $\mathbf{x}$ (i.e. for every bounding box with features similar to $\mathbf{x}$ ), one should have $\mathcal{G}(\mathsf{Cat})(\mathbf{x}^{\prime})\approx 1$ . In particular, we consider groundings of the following form:
基础 $\mathcal{G}^{*}$ 捕捉了物体的定量属性与它们的分类和关系属性之间的潜在相关性。并非所有函数都适合作为基础；它们应该保留某种形式的规律性。如果 $\mathcal{G}(\mathsf{Cat})(\mathbf{x})\approx 1$ （具有特征向量 $\mathbf{x}$ 的边界框包含一只猫），那么对于每个接近 $\mathbf{x}$ 的 $\mathbf{x}^{\prime}$ （即对于每个具有类似 $\mathbf{x}$ 特征的边界框），应该有 $\mathcal{G}(\mathsf{Cat})(\mathbf{x}^{\prime})\approx 1$ 。特别地，我们考虑以下形式的基础：

Function symbols are grounded to linear transformations. If $f$ is a $m$ -ary function symbol, then $\mathcal{G}(f)$ is of the form:
函数符号被基于线性变换。如果 $f$ 是一个 $m$ 元函数符号，则 $\mathcal{G}(f)$ 的形式为：

\mathcal{G}(f)(\mathbf{v})=M_{f}\mathbf{v}+N_{f}

where $\mathbf{v}=\left<\mathbf{v}_{1}^{\intercal},\dots,\mathbf{v}_{m}^{\intercal}\right>^{\intercal}$ is the $mn$ -ary vector obtained by concatenating each $\mathbf{v}_{i}$ . The parameters for $\mathcal{G}(f)$ are the $n\times mn$ real matrix $M_{f}$ and the $n$ -vector $N_{f}$ .
其中 $\mathbf{v}=\left<\mathbf{v}_{1}^{\intercal},\dots,\mathbf{v}_{m}^{\intercal}\right>^{\intercal}$ 是通过连接每个 $\mathbf{v}_{i}$ 获得的 $mn$ -元向量。 $\mathcal{G}(f)$ 的参数是 $n\times mn$ 实矩阵 $M_{f}$ 和 $n$ -元向量 $N_{f}$ 。

The grounding of an $m$ -ary predicate $P$ , namely $\mathcal{G}(P)$ , is defined as a generalization of the neural tensor network (which has been shown effective at knowledge completion in the presence of simple logical constraints [28]), as a function from $\mathbb{R}^{mn}$ to $[0,1]$ , as follows:
$m$ 元谓词 $P$ 的基础，即 $\mathcal{G}(P)$ ，被定义为神经张量网络的一种泛化（已被证明在简单逻辑约束存在的情况下对知识完成有效[28]），作为从 $\mathbb{R}^{mn}$ 到 $[0,1]$ 的函数，如下所示：

\displaystyle\mathcal{G}(P)(\mathbf{v})=\sigma\left(u^{\intercal}_{P}\tanh\left(\mathbf{v}^{\intercal}W_{P}^{[1:k]}\mathbf{v}+V_{P}\mathbf{v}+b_{P}\right)\right)

(7)

with $\sigma$ the sigmoid function. The parameters for $P$ are: $W_{P}^{[1:k]}$ , a 3-D tensor in $\mathbb{R}^{k\times mn\times mn}$ , $V_{P}\in\mathbb{R}^{k\times mn}$ , $b_{P}\in\mathbb{R}^{k}$ and $u_{P}\in\mathbb{R}^{k}$ . This last parameter performs a linear combination of the quadratic features given by the tensor product. With this encoding, the grounding (i.e. truth-value) of a clause can be determined by a neural network which first computes the grounding of the literals contained in the clause, and then combines them using the specific t-norm.
使用 $\sigma$ sigmoid 函数。 $P$ 的参数为： $W_{P}^{[1:k]}$ ，一个在 $\mathbb{R}^{k\times mn\times mn}$ 中的 3-D 张量， $V_{P}\in\mathbb{R}^{k\times mn}$ ， $b_{P}\in\mathbb{R}^{k}$ 和 $u_{P}\in\mathbb{R}^{k}$ 。最后一个参数执行张量积给出的二次特征的线性组合。通过这种编码，子句的基础（即真值）可以通过神经网络来确定，该网络首先计算子句中包含的文字的基础，然后使用特定的 t-范数将它们组合起来。

In what follows, we describe how a suitable GT can be built for SII. Let $\mathit{Pics}^{t}\subseteq\mathit{Pics}$ be a set of bounding boxes of images correctly labelled with the classes that they belong to, and let each pair of bounding boxes be correctly labelled with the part-of relation. In machine learning terminology, $\mathit{Pics}^{t}$ is a training set without noise. In real semantics, a training set can be represented by a theory $\mathcal{T}_{\mathrm{expl}}=\langle\mathcal{K}_{\mathrm{expl}},\hat{\mathcal{G}}\rangle$ , where $\mathcal{K}_{\mathrm{expl}}$ contains the set of closed literals $C_{i}(b)$ (resp. $\neg C_{i}(b)$ ) and $\mathsf{partOf}(b,b^{\prime})$ (resp. $\neg\mathsf{partOf}(b,b^{\prime})$ ), for every bounding box $b$ labelled (resp. not labelled) with $C_{i}$ and for every pair of bounding boxes $\left<b,b^{\prime}\right>$ connected (resp $\neg\mathsf{partOf}(b,b^{\prime})$ . not connected) by the $\mathsf{partOf}$ relation. The partial grounding $\hat{\mathcal{G}}$ is defined on all bounding boxes of all the images in $\mathit{Pics}$ where both the semantic features $class(C_{i},b)$ and the bounding box coordinates are computed by the Fast R-CNN object detector [11]. $\hat{\mathcal{G}}$ is not defined for the predicate symbols in $\mathcal{P}$ and is to be learned. $\mathcal{T}_{\mathrm{expl}}$ contains only assertional information about specific bounding boxes. This is the classical setting of machine learning where classifiers (i.e. the grounding of predicates) are inductively learned from positive examples (such as $\mathsf{partOf}(b,b^{\prime})$ ) and negative examples ( $\neg\mathsf{partOf}(b,b^{\prime})$ ) of a classification. In this learning setting, mereological constraints such as “cats have no wheels” or “a tail is a part of a cat” are not taken into account. Examples of mereological constraints state, for instance, that the part-of relation is asymmetric ( $\forall xy(\mathsf{partOf}(x,y)\rightarrow\neg\mathsf{partOf}(y,x))$ ), or lists the several parts of an object (e.g. $\forall xy(\mathsf{Cat}(x)\wedge\mathsf{partOf}(x,y)\rightarrow\mathsf{Tail}(y)\vee\mathsf{Muzzle}(y)$ )), or even, for simplicity, that every whole object cannot be part of another object (e.g. $\forall xy(\mathsf{Cat}(x)\rightarrow\neg\mathsf{partOf}(x,y)))$ and every part object cannot be divided further into parts (e.g. $\forall xy(\mathsf{Tail}(x)\rightarrow\neg\mathsf{partOf}(y,x)))$ . This general knowledge is available from on-line resources, such as WordNet [9], and can be retrieved by inheriting the meronymy relations for every concept correponding to a whole object. A grounded theory that considers also mereological constraints as prior knowledge can be constructed by adding such axioms to $\mathcal{K}_{\mathrm{expl}}$ . More formally, we define $\mathcal{T}_{\mathrm{prior}}=\langle\mathcal{K}_{\mathrm{prior}},\hat{\mathcal{G}}\rangle$ , where $\mathcal{K}_{\mathrm{prior}}=\mathcal{K}_{\mathrm{expl}}+\mathcal{M}$ , and $\mathcal{M}$ is the set of mereological axioms. To check the role of $\mathcal{M}$ , we evaluate both theories and then compare results.
在接下来的内容中，我们描述了如何为 SII 构建一个合适的 GT。设 $\mathit{Pics}^{t}\subseteq\mathit{Pics}$ 是一组图像的边界框，这些图像被正确标记为它们所属的类别，并且每对边界框被正确标记为部分关系。在机器学习术语中， $\mathit{Pics}^{t}$ 是一个没有噪声的训练集。在实际语义中，训练集可以用一个理论 $\mathcal{T}_{\mathrm{expl}}=\langle\mathcal{K}_{\mathrm{expl}},\hat{\mathcal{G}}\rangle$ 表示，其中 $\mathcal{K}_{\mathrm{expl}}$ 包含一组封闭文字 $C_{i}(b)$ （或 $\neg C_{i}(b)$ ）和 $\mathsf{partOf}(b,b^{\prime})$ （或 $\neg\mathsf{partOf}(b,b^{\prime})$ ），对于每个被标记（或未被标记）为 $C_{i}$ 的边界框 $b$ ，以及对于每对由 $\mathsf{partOf}$ 关系连接（或未连接）的边界框 $\left<b,b^{\prime}\right>$ 。部分接地 $\hat{\mathcal{G}}$ 定义在 $\mathit{Pics}$ 中所有图像的所有边界框上，其中语义特征 $class(C_{i},b)$ 和边界框坐标都由 Fast R-CNN 目标检测器[11]计算。 $\hat{\mathcal{G}}$ 对 $\mathcal{P}$ 中的谓词符号未定义且需要学习。 $\mathcal{T}_{\mathrm{expl}}$ 仅包含关于特定边界框的断言信息。这是机器学习的经典设置，其中分类器（即。谓词的基础）是通过正例（例如 $\mathsf{partOf}(b,b^{\prime})$ ）和负例（ $\neg\mathsf{partOf}(b,b^{\prime})$ ）的归类从归纳中学习的。在这种学习环境中，像“猫没有轮子”或“尾巴是猫的一部分”这样的部分整体约束并未被考虑。部分整体约束的示例包括，例如，部分关系是非对称的（ $\forall xy(\mathsf{partOf}(x,y)\rightarrow\neg\mathsf{partOf}(y,x))$ ），或列出了一个物体的几个部分（例如 $\forall xy(\mathsf{Cat}(x)\wedge\mathsf{partOf}(x,y)\rightarrow\mathsf{Tail}(y)\vee\mathsf{Muzzle}(y)$ ），甚至为了简单起见，每个整体物体都不能是另一个物体的一部分（例如 $\forall xy(\mathsf{Cat}(x)\rightarrow\neg\mathsf{partOf}(x,y)))$ ），每个部分物体都不能进一步分解成部分（例如 $\forall xy(\mathsf{Tail}(x)\rightarrow\neg\mathsf{partOf}(y,x)))$ 。这些一般知识可以从在线资源（例如 WordNet [9]）中获得，并且可以通过继承与整体物体对应的每个概念的 meronymy 关系来检索。考虑到部分整体约束的基础理论可以通过将这些公理添加到 $\mathcal{K}_{\mathrm{expl}}$ 中来构建。更正式地，我们定义 $\mathcal{T}_{\mathrm{prior}}=\langle\mathcal{K}_{\mathrm{prior}},\hat{\mathcal{G}}\rangle$ ，其中 $\mathcal{K}_{\mathrm{prior}}=\mathcal{K}_{\mathrm{expl}}+\mathcal{M}$ ，而 $\mathcal{M}$ 是部分整体公理集。为了检查 $\mathcal{M}$ 的作用，我们评估两种理论，然后比较结果。

5 Experimental Evaluation 5 实验评估

Refer to caption — (a) LTNs with prior knowledge improves the performance of the Fast R-CNN on object type classification, achieving an Area Under the Curve (AUC) of 0.800 in comparison with 0.756.
(a) 具有先验知识的 LTNs 改善了 Fast R-CNN 在对象类型分类上的性能，实现了 0.800 的曲线下面积（AUC），而 0.756 相比。

We evaluate the performance of our approach for SII⁴⁴4LTN has been implemented as a Google TensorFlow^TMlibrary. Code, $\mathsf{partOf}$ ontology, and dataset are available at https://gitlab.fbk.eu/donadello/LTN_IJCAI17
LTN 已经作为 Google TensorFlow ^TM 库实现。代码， $\mathsf{partOf}$ 本体和数据集可在 https://gitlab.fbk.eu/donadello/LTN_IJCAI17 找到。
我们评估了我们在 SII ⁴ 方面的表现 on two tasks, namely, the classification of bounding boxes and the detection of $\mathsf{partOf}$ relations between pairs of bounding boxes. In particular, we chose the part-of relation because both data (the PASCAL-Part-dataset [5]) and ontologies (WordNet) are available on the part-of relation. In addition, part-of can be used to represent, via reification, a large class of relations [12] (e.g., the relation “a plant is lying on the table” can be reified in an object of type “lying event” whose parts are the plant and the table). However, it is worth noting that many other relations could have been included in this evaluation. The time complexity of LTN grows linearly with the number of axioms.
在两个任务上，即边界框的分类和检测 $\mathsf{partOf}$ 边界框对之间的关系。特别是，我们选择了部分关系，因为部分数据（PASCAL-Part 数据集[5]）和本体论（WordNet）都可用于部分关系。此外，部分关系可以通过具体化来表示大量关系[12]（例如，“一株植物躺在桌子上”这种关系可以在“躺着事件”类型的对象中具体化，其部分是植物和桌子）。然而，值得注意的是，许多其他关系也可以包含在此评估中。LTN 的时间复杂度随公理数量线性增长。

We also evaluate the robustness of our approach with respect to noisy data. It has been acknowledged by many that, with the vast growth in size of the training sets for visual recognition [15], many data annotations may be affected by noise such as missing or erroneous labels, non-localised objects, and disagreements between annotations, e.g. human annotators often mistake “part-of” for the “have” relation [24].
我们还评估了我们的方法对于嘈杂数据的稳健性。许多人已经认识到，随着视觉识别训练集规模的快速增长[15]，许多数据注释可能会受到噪声的影响，例如缺失或错误的标签，非本地化对象以及注释之间的分歧，例如，人类注释者经常将“部分”误认为“具有”关系[24]。

We use the PASCAL-Part-dataset that contains 10103 images with bounding boxes annotated with object-types and the part-of relation defined between pairs of bounding boxes. Labels are divided into three main groups: animals, vehicles and indoor objects, with their corresponding parts and “part-of” label. Whole objects inside the same group can share parts. Whole objects of different groups do not share any parts. Labels for parts are very specific, e.g. “left lower leg”. Thus, without loss of generality, we have merged the bounding boxes that referred to the same part into a single bounding box, e.g. bounding boxes labelled with “left lower leg” and “left upper leg” were merged into a single bounding box of type “leg”. In this way, we have limited our experiments to a dataset with 20 labels for whole objects and 39 labels for parts. In addition, we have removed from the dataset any bounding boxes with height or width smaller than 6 pixels. The images were then split into a training set with 80%, and a test set with 20% of the images, maintaining the same proportion of the number of bounding boxes for each label.
我们使用包含 10103 张图像的 PASCAL-Part 数据集，其中边界框带有对象类型和在边界框对之间定义的部分关系的注释。标签分为三个主要组：动物、车辆和室内物体，以及它们对应的部分和“部分-整体”标签。同一组内的整个对象可以共享部分。不同组的整个对象不共享任何部分。部分的标签非常具体，例如“左下腿”。因此，为了不失一般性，我们已将指向相同部分的边界框合并为一个边界框，例如，带有“左下腿”和“左上腿”标签的边界框被合并为一个类型为“腿”的边界框。通过这种方式，我们将实验限制在一个包含 20 个整体对象标签和 39 个部分标签的数据集中。此外，我们已从数据集中删除任何高度或宽度小于 6 像素的边界框。然后，将图像分为 80%的训练集和 20%的测试集，保持每个标签的边界框数量比例不变。

Object Type Classification and Detection of the Part-Of Relation: Given a set of bounding boxes detected by an object detector (we use Fast-RCNN), the task of object classification is to assign to each bounding box an object type. The task of Part-Of detection is to decide, given two bounding boxes, if the object contained in the first is a part of the object contained in the second. We use LTN to resolve both tasks simultaneously. This is important because a bounding box type and the part-of relation are not independent. Their dependencies are specified in LTN using background knowledge in the form of logical axioms.
对象类型分类和部分关系的检测：给定一个由对象检测器（我们使用 Fast-RCNN）检测到的一组边界框，对象分类的任务是为每个边界框分配一个对象类型。部分关系检测的任务是决定，给定两个边界框，第一个边界框中包含的对象是否是第二个边界框中包含的对象的一部分。我们使用 LTN 同时解决这两个任务。这很重要，因为边界框类型和部分关系并不是独立的。它们的依赖关系在 LTN 中使用背景知识以逻辑公理的形式来指定。

To show the effect of the logical axioms, we train two LTNs: the first containing only training examples of object types and part-of relations ( $\mathcal{T}_{\mathrm{expl}}$ ), and the second containing also logical axioms about types and part-of ( $\mathcal{T}_{\mathrm{prior}}$ ). The LTNs were set up with tensor of $k=6$ layers and a regularization parameter $\lambda=10^{-10}$ . We chose Lukasiewicz’s T-norm ( $\mu(a,b)=\max(0,a+b-1)$ ) and use the harmonic mean as aggregation operator. We ran 1000 training epochs of the RMSProp learning algorithm available in TensorFlow^TM. We compare results with the Fast RCNN at object type classification (Eq.(3)), and the inclusion ratio $ir$ baseline (Eq.eq:grBpof) at the part-of detection task⁵⁵5 A direct comparison with [4] is not possible because their code was not available.
与[4]的直接比较是不可能的，因为他们的代码不可用。
为了展示逻辑公理的影响，我们训练了两个 LTN：第一个仅包含对象类型和部分关系的训练示例（ $\mathcal{T}_{\mathrm{expl}}$ ），第二个还包含关于类型和部分关系的逻辑公理（ $\mathcal{T}_{\mathrm{prior}}$ ）。LTN 设置为 $k=6$ 层张量和正则化参数 $\lambda=10^{-10}$ 。我们选择了 Lukasiewicz 的 T-范数（ $\mu(a,b)=\max(0,a+b-1)$ ），并使用调和平均作为聚合算子。我们在 TensorFlow 中运行了 1000 个训练时期的 RMSProp 学习算法。我们将结果与 Fast RCNN 在对象类型分类（Eq.(3)）和部分检测任务中的包含比率 $ir$ 基线（Eq.eq:grBpof）进行比较 ⁵ 。. If $ir$ is larger than a given threshold $th$ (in our experiments, $th$ =0.7) then the bounding boxes are said to be in the $\mathsf{partOf}$ relation. Every bounding box $b$ is classified into $C\in\mathcal{P}_{1}$ if $\mathcal{G}(C(b))\geq th$ . With this, a bounding box can be classified into more than one class. For each class, precision and recall are calculated in the usual way. Results for indoor objects are shown in Figure 1 where AUC is the area under the precision-recall curve. The results show that, for both object types and the part-of relation, the LTN trained with prior knowledge given by mereological axioms has better performance than the LTN trained with examples only. Moreover, prior knowledge allows LTN to improve the performance of the Fast R-CNN (FRCNN) object detector. Notice that the LTN is trained using the Fast R-CNN results as features. FRCNN assigns a bounding box to a class if the values of the corresponding semantic features exceed $th$ . This is local to the specific semantic features. If such local features are very discriminative (which is the case in our experiments) then very good levels of precision can be achieved. Differently from FRCNN, LTNs make a global choice which takes into consideration all (semantic and geometric) features together. This should offer robustness to the LTN classifier at the price of a drop in precision. The logical axioms compensate this drop. For the other object types (animals and vehicles), LTN has results comparable to FRCNN: FRCNN beats $\mathcal{T}_{\mathrm{prior}}$ by 0.05 and 0.037 AUC, respectively, for animals and vehicles. Finally, we have performed an initial experiment on small data, on the assumption that the LTN axioms should be able to compensate a reduction in training data. By removing 50% of the training data for indoor objects, a similar performance to $\mathcal{T}_{\mathrm{prior}}$ with the full training set can be achieved: 0.767 AUC for object types and 0.623 AUC for the part-of relation, which shows an improvement in performance.
如果 $ir$ 大于给定阈值 $th$ （在我们的实验中， $th$ =0.7），则边界框被认为处于 $\mathsf{partOf}$ 关系中。每个边界框 $b$ 被分类为 $C\in\mathcal{P}_{1}$ 如果 $\mathcal{G}(C(b))\geq th$ 。因此，一个边界框可以被分类为多个类别。对于每个类别，精确度和召回率都是以通常的方式计算的。室内物体的结果如图 1 所示，其中 AUC 是精确率-召回率曲线下的面积。结果显示，对于对象类型和部分关系，通过仅使用部分关系的概念公理给出的先验知识训练的 LTN 性能优于仅使用示例训练的 LTN。此外，先验知识使 LTN 能够提高 Fast R-CNN（FRCNN）对象检测器的性能。请注意，LTN 是使用 Fast R-CNN 结果作为特征进行训练的。如果相应语义特征的值超过 $th$ ，FRCNN 将为边界框分配一个类别。这是针对特定语义特征的局部特征。如果这些局部特征具有很强的区分性（这在我们的实验中是成立的），那么可以实现非常高水平的精确度。与 FRCNN 不同，LTNs 做出全局选择，考虑所有（语义和几何）特征。这应该为 LTN 分类器提供鲁棒性，但会降低精度。逻辑公理弥补了这种降低。对于其他对象类型（动物和车辆），LTN 的结果与 FRCNN 可比：对于动物和车辆，FRCNN 分别比 $\mathcal{T}_{\mathrm{prior}}$ 高 0.05 和 0.037 的 AUC。最后，我们在小数据上进行了初步实验，假设 LTN 公理应该能够弥补训练数据的减少。通过去除室内对象 50%的训练数据，可以实现与 $\mathcal{T}_{\mathrm{prior}}$ 完整训练集相同的性能：对象类型的 AUC 为 0.767，部分关系的 AUC 为 0.623，显示出性能的提高。

Robustness to Noisy Training Data: In this evaluation, we show that logical axioms improve the robustness of LTNs in the presence of errors in the labels of the training data. We have added an increasing amount of noise to the PASCAL-Part-dataset training data, and measured how performance degrades in the presence and absence of axioms. For $k\in\{10,20,30,40\}$ , we randomly select $k\%$ of the bounding boxes in the training data, and randomly change their classification labels. In addition, we randomly select $k\%$ of pairs of bounding boxes, and flip the value of the part-of relation’s label. For each value of $k$ , we train LTNs $\mathcal{T}_{\mathrm{expl}}^{k}$ and $\mathcal{T}_{\mathrm{prior}}^{k}$ and evaluate results on both SII tasks as done before. As expected, adding too much noise to training labels leads to a large drop in performance. Figure 2 shows the AUC measures for indoor objects with increasing error $k$ . Each pair of bars indicates the AUC of $\mathcal{T}_{\mathrm{prior}}^{k},\mathcal{T}_{\mathrm{expl}}^{k}$ , for a given $k\%$ of errors.
对噪声训练数据的鲁棒性：在这个评估中，我们展示了逻辑公理如何提高逻辑张量网络（LTNs）对训练数据标签错误的鲁棒性。我们向 PASCAL-Part 数据集的训练数据中添加了逐渐增加的噪声，并测量了在有无公理的情况下性能如何降低。对于 $k\in\{10,20,30,40\}$ ，我们随机选择训练数据中 $k\%$ 个边界框，并随机更改它们的分类标签。此外，我们随机选择 $k\%$ 对边界框，并翻转部分关系标签的值。对于每个 $k$ 的值，我们训练 LTNs $\mathcal{T}_{\mathrm{expl}}^{k}$ 和 $\mathcal{T}_{\mathrm{prior}}^{k}$ ，并像以前一样在两个 SII 任务上评估结果。如预期的那样，向训练标签添加过多噪声会导致性能大幅下降。图 2 显示了室内物体的 AUC 度量随错误增加的情况。每对柱形图表示 $\mathcal{T}_{\mathrm{prior}}^{k},\mathcal{T}_{\mathrm{expl}}^{k}$ 的 AUC，对于给定的 $k\%$ 错误。

Results indicate that the LTN axioms offer robustness to noise: in addition to the expected overall drop in performance, an increasing gap can be seen between the drop in performance of the LTN trained with exampels only and the LTN trained including background knowledge.
结果表明，LTN 公理对噪声具有鲁棒性：除了预期的整体性能下降外，可以看到 LTN 仅使用示例训练和包括背景知识训练的性能下降之间的差距在增加。

6 Conclusion and Future Work
6 结论和未来工作

SII systems are required to address the semantic gap problem: combining visual low-level features with high-level concepts. We argue that the problem can be addressed by the integration of numerical and logical representations in deep learning. LTNs learn from numerical data and logical constraints, enabling approximate reasoning on unseen data to predict new facts. In this paper, LTNs were shown to improve on state-of-the-art method Fast R-CNN for bounding box classification, and to outperform a rule-based method at learning part-of relations in the PASCAL-Part-dataset. Moreover, LTNs were evaluated on how to handle noisy data through the systematic creation of training sets with errors in the labels. Results indicate that relational knowledge can add robustness to neural systems. As future work, we shall apply LTNs to larger datasets such as Visual Genome, and continue to compare the various instances of LTN with SRL, deep learning and other neural-symbolic approaches on such challenging visual intelligence tasks.
SII 系统需要解决语义鸿沟问题：将视觉低层特征与高层概念相结合。我们认为，这个问题可以通过在深度学习中集成数值和逻辑表示来解决。LTNs 从数值数据和逻辑约束中学习，使得可以对未见数据进行近似推理以预测新事实。在本文中，LTNs 被证明在边界框分类方面优于最先进的方法 Fast R-CNN，并且在学习 PASCAL-Part 数据集中的部分关系方面胜过基于规则的方法。此外，通过系统地创建带有标签错误的训练集来评估 LTNs 如何处理嘈杂数据。结果表明，关系知识可以增强神经系统的稳健性。作为未来工作，我们将把 LTNs 应用到诸如 Visual Genome 之类的更大数据集中，并继续比较 LTN 的各种实例与 SRL、深度学习以及其他神经符号方法在这些具有挑战性的视觉智能任务上的表现。

References

[1] J. Atif, C. Hudelot, and I Bloch. Explanatory reasoning for image understanding using formal concept analysis and description logics. Systems, Man, and Cybernetics: Systems, IEEE Transactions on, 44(5):552–570, May 2014.
J. Atif, C. Hudelot, 和 I Bloch. 利用形式概念分析和描述逻辑进行图像理解的解释性推理。IEEE Transactions on Systems, Man, and Cybernetics: Systems, 44(5):552–570, 2014 年 5 月。
[2] S. H. Bach, M. Broecheler, B. Huang, and L. Getoor. Hinge-loss markov random fields and probabilistic soft logic. CoRR, abs/1505.04406, 2015.
S. H. 巴赫，M. Broecheler，B. 黄，和 L. Getoor。铰链损失马尔可夫随机场和概率软逻辑。CoRR，abs/1505.04406，2015。
[3] M. Bergmann. An Introduction to Many-Valued and Fuzzy Logic: Semantics, Algebras, and Derivation Systems. Cambridge University Press, 2008.
M. Bergmann. 论多值逻辑和模糊逻辑：语义、代数和推导系统导论. 剑桥大学出版社, 2008.
[4] N. Chen, Q.-Y. Zhou, and V. Prasanna. Understanding web images by object relation network. In Proceedings of the 21st International Conference on World Wide Web, WWW ’12, pages 291–300, New York, NY, USA, 2012. ACM.
N. Chen, Q.-Y. Zhou, 和 V. Prasanna. 通过对象关系网络理解网络图像. 在第 21 届国际万维网会议论文集中, 页码 291-300, 美国纽约, 2012 年. ACM.
[5] X. Chen, R. Mottaghi, X. Liu, S. Fidler, R. Urtasun, and A. Yuille. Detect what you can: Detecting and representing objects using holistic models and body parts. In CVPR, 2014.
X. Chen, R. Mottaghi, X. Liu, S. Fidler, R. Urtasun, 和 A. Yuille. Detect what you can: Detecting and representing objects using holistic models and body parts. In CVPR, 2014.
[6] S. Dasiopoulou, Y. Kompatsiaris, and M.l G. Strintzis. Applying fuzzy dls in the extraction of image semantics. J. Data Semantics, 14:105–132, 2009.
S. Dasiopoulou，Y. Kompatsiaris 和 M.l G. Strintzis。在图像语义提取中应用模糊 dls。数据语义学杂志，14:105-132，2009 年。
[7] Michelangelo Diligenti, Marco Gori, and Claudio Saccà. Semantic-based regularization for learning and inference. Artificial Intelligence, 2015.
米开朗基罗·迪利真蒂（Michelangelo Diligenti）、马可·戈里（Marco Gori）和克劳迪奥·萨卡（Claudio Saccà）。基于语义的学习和推理的正则化。《人工智能》，2015 年。
[8] I. Donadello and L. Serafini. Integration of numeric and symbolic information for semantic image interpretation. Intelligenza Artificiale, 10(1):33–47, 2016.
I. Donadello 和 L. Serafini. 数字和符号信息的集成用于语义图像解释。Intelligenza Artificiale，10(1)：33-47，2016。
[9] C. Fellbaum, editor. WordNet: an electronic lexical database. MIT Press, 1998.
C. Fellbaum，编辑。WordNet：一个电子词汇数据库。麻省理工学院出版社，1998 年。
[10] G. Forestier, C. Wemmert, and A. Puissant. Coastal image interpretation using background knowledge and semantics. Computers & Geosciences, 54:88–96, 2013.
G. Forestier, C. Wemmert, 和 A. Puissant. 利用背景知识和语义进行海岸图像解释。计算机与地球科学，54:88–96，2013。
[11] R. Girshick. Fast r-cnn. In International Conference on Computer Vision (ICCV), 2015.
R. Girshick. Fast R-CNN. 在 2015 年国际计算机视觉大会（ICCV）上。
[12] N. Guarino and G. Guizzardi. On the reification of relationships. In 24th Italian Symp. on Advanced Database Sys., pages 350–357, 2016.
N. Guarino 和 G. Guizzardi。关系实体化。在第 24 届意大利高级数据库系统研讨会上，第 350-357 页，2016 年。
[13] B. Gutmann, M. Jaeger, and L. De Raedt. Extending problog with continuous distributions. In Proc. ILP, pages 76–91. Springer, 2010.
B. Gutmann, M. Jaeger, 和 L. De Raedt. 将 Problog 扩展为连续分布。在 ILP 会议论文集中，第 76-91 页。Springer，2010。
[14] C. Hudelot, J. Atif, and I. Bloch. Fuzzy spatial relation ontology for image interpretation. Fuzzy Sets and Systems, 159(15):1929–1951, 2008.
C. Hudelot, J. Atif, 和 I. Bloch. 模糊空间关系本体用于图像解释。《模糊集与系统》，159(15)：1929–1951，2008。
[15] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-L. Li, D. Shamma, M. Bernstein, and Li F.-F. Visual genome: Connecting language and vision using crowdsourced dense image annotations, 2016.
R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-L. Li, D. Shamma, M. Bernstein, and Li F.-F. 视觉基因组：使用众包密集图像注释连接语言和视觉，2016。
[16] G.h Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A. C. Berg, and T. L. Berg. Baby talk: Understanding and generating image descriptions. In CVPR, 2011.
G.h Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A. C. Berg, and T. L. Berg. 婴儿说话：理解和生成图像描述。在 CVPR，2011 年。
[17] C. Lu, R. Krishna, M. Bernstein, and L Fei-Fei. Visual relationship detection with language priors. In ECCV, pages 852–869, 2016.
C. Lu, R. Krishna, M. Bernstein, 和 L Fei-Fei. 带有语言先验的视觉关系检测。在 ECCV，页码 852-869，2016 年。
[18] M. Marszalek and C. Schmid. Semantic Hierarchies for Visual Object Recognition. In CVPR, 2007.
M. Marszalek 和 C. Schmid。视觉对象识别的语义层次。在 CVPR，2007 年。
[19] B. Neumann and R. Möller. On scene interpretation with description logics. Image and Vision Computing, 26(1):82 – 101, 2008. Cognitive Vision-Special Issue.
B. Neumann 和 R. Möller。使用描述逻辑进行场景解释。《图像与视觉计算》，26(1)：82-101，2008 年。认知视觉-特刊。
[20] D. Nyga, F. Balint-Benczedi, and M. Beetz. Pr2 looking at things-ensemble learning for unstructured information processing with markov logic networks. In IEEE Intl. Conf.on Robotics and Automation, pages 3916–3923, 2014.
D. Nyga, F. Balint-Benczedi 和 M. Beetz. Pr2 看东西-马尔可夫逻辑网络的非结构化信息处理的集成学习。在 2014 年 IEEE 国际机器人与自动化大会上，第 3916-3923 页。
[21] I. S. Espinosa Peraldi, A. Kaya, and R. Möller. Formalizing multimedia interpretation based on abduction over description logic aboxes. In Proc. of the 22nd Intl. Workshop on Description Logics, volume 477 of CEUR Workshop Proceedings. CEUR-WS.org, 2009.
I. S. Espinosa Peraldi, A. Kaya, 和 R. Möller. 基于描述逻辑 ABox 上的绑架形式化多媒体解释。在第 22 届国际描述逻辑研讨会论文集中，CEUR Workshop Proceedings 的第 477 卷。CEUR-WS.org，2009。
[22] V. Ramanathan, C. Li, J. Deng, W. Han, Z. Li, K. Gu, Y. Song, S. Bengio, C. Rosenberg, and L. Fei-Fei. Learning semantic relationships for better action retrieval in images. In CVPR, 2015.
V. Ramanathan, C. Li, J. Deng, W. Han, Z. Li, K. Gu, Y. Song, S. Bengio, C. Rosenberg, and L. Fei-Fei. 学习图像中更好的动作检索的语义关系。在 CVPR，2015 年。
[23] I. Ravkic, J. Ramon, and J. Davis. Learning relational dependency networks in hybrid domains. Machine Learning, 100(2-3):217–254, 2015.
I. Ravkic, J. Ramon, 和 J. Davis. 在混合领域中学习关系依赖网络. 机器学习, 100(2-3):217–254, 2015.
[24] S. Reed, H. Lee, D. Anguelov, C. Szegedy, D. Erhan, and A. Rabinovich. Training deep neural networks on noisy labels with bootstrapping. CoRR, abs/1412.6596, 2014.
S. Reed, H. Lee, D. Anguelov, C. Szegedy, D. Erhan, 和 A. Rabinovich. 使用自举法在带有噪声标签的数据上训练深度神经网络. CoRR, abs/1412.6596, 2014.
[25] M. Richardson and P. Domingos. Markov logic networks. Machine Learning, 62(1-2):107–136, 2006.
M. Richardson 和 P. Domingos. 马尔可夫逻辑网络. 机器学习, 62(1-2):107–136, 2006.
[26] T. Rocktaschel, S. Singh, and S. Riedel. Injecting logical background knowledge into embeddings for relation extraction. In NAACL, 2015.
T. Rocktaschel, S. Singh, 和 S. Riedel. 将逻辑背景知识注入嵌入以进行关系抽取。在 NAACL，2015 年。
[27] L. Serafini and A. S. d’Avila Garcez. Learning and reasoning with logic tensor networks. In Proc. AI*IA, pages 334–348, 2016.
L. Serafini 和 A. S. d'Avila Garcez. 使用逻辑张量网络进行学习和推理。在 AI*IA 会议论文集中，第 334-348 页，2016 年。
[28] R. Socher, D. Chen, C. D. Manning, and A. Ng. Reasoning with neural tensor networks for knowledge base completion. In Advances in Neural Information Processing Systems, pages 926–934, 2013.
R. Socher, D. Chen, C. D. Manning, 和 A. Ng. 使用神经张量网络进行知识库补全的推理。在神经信息处理系统的进展中，第 926-934 页，2013 年。
[29] J. Wang and P. Domingos. Hybrid markov logic networks. In AAAI, volume 8, pages 1106–1111, 2008.
J. Wang 和 P. Domingos. 混合马尔可夫逻辑网络. 在 AAAI, 第 8 卷, 页码 1106–1111, 2008 年。
[30] Y. Zhu, A. Fathi, and L. Fei-Fei. Reasoning about object affordances in a knowledge base representation. In ECCV, pages 408–424. 2014.
朱宇，法蒂，费菲。在知识库表示中推理对象的可供性。在 ECCV 中，第 408-424 页。2014 年。

Logic Tensor Networks for Semantic Image Interpretation逻辑张量网络用于语义图像解释