MNER-QG: An End-to-End MRC framework for Multimodal Named Entity Recognition with Query Grounding
MNER-QG：用于具有查询基础的多模态命名实体识别的端到端 MRC 框架

Meihuizi Jia^1,2, Lei Shen ², Xin Shen ³, Lejian Liao ¹, Meng Chen ², Xiaodong He ²,
贾梅慧子^1,2，沈磊 ²，沈欣 ³，廖乐建 ¹，陈猛 ²，何晓东 ²，
Zhendong Chen ¹, Jiaqi Li ¹
陈振东 ¹，李佳琪 ¹ Corresponding author 通讯作者

Abstract 抽象

Multimodal named entity recognition (MNER) is a critical step in information extraction, which aims to detect entity spans and classify them to corresponding entity types given a sentence-image pair. Existing methods either (1) obtain named entities with coarse-grained visual clues from attention mechanisms, or (2) first detect fine-grained visual regions with toolkits and then recognize named entities. However, they suffer from improper alignment between entity types and visual regions or error propagation in the two-stage manner, which finally imports irrelevant visual information into texts. In this paper, we propose a novel end-to-end framework named MNER-QG that can simultaneously perform MRC-based multimodal named entity recognition and query grounding. Specifically, with the assistance of queries, MNER-QG can provide prior knowledge of entity types and visual regions, and further enhance representations of both texts and images. To conduct the query grounding task, we provide manual annotations and weak supervisions that are obtained via training a highly flexible visual grounding model with transfer learning. We conduct extensive experiments on two public MNER datasets, Twitter2015 and Twitter2017. Experimental results show that MNER-QG outperforms the current state-of-the-art models on the MNER task, and also improves the query grounding performance.
多模态命名实体识别（MNER）是信息提取的关键步骤，旨在检测实体跨度并将其分类为给定句子-图像对的相应实体类型。现有方法要么（1）从注意力机制中获得具有粗粒度视觉线索的命名实体，要么（2）首先使用工具包检测细粒度视觉区域，然后识别命名实体。然而，它们存在实体类型和视觉区域对齐不当或两阶段方式的错误传播，最终将不相关的视觉信息引入文本。在本文中，我们提出了一个名为 MNER-QG 的新型端到端框架，它可以同时执行基于 MRC 的多模态命名实体识别和查询接地。具体来说，在查询的帮助下，MNER-QG 可以提供实体类型和视觉区域的先验知识，并进一步增强文本和图像的表示。为了执行查询接地任务，我们提供了手动注释和弱监督，这些都是通过使用迁移学习训练高度灵活的视觉接地模型获得的。我们在两个公共 MNER 数据集 Twitter2015 和 Twitter2017 上进行了广泛的实验。实验结果表明，MNER-QG 在 MNER 任务上优于当前最先进的模型，并且还提高了查询接地性能。

Introduction 介绍

Multimodal named entity recognition (MNER) is a vision-language task that extends the traditional text-based NER and alleviates ambiguity in natural languages by taking images as additional inputs. The essence of MNER is to effectively capture visual features corresponding to entity spans and incorporate certain visual regions into textual representations.
多模态命名实体识别（MNER）是一种视觉语言任务，它扩展了传统的基于文本的 NER，并通过将图像作为附加输入来减轻自然语言中的歧义。MNER 的本质是有效地捕捉与实体跨度相对应的视觉特征，并将某些视觉区域纳入文本表示中。

Refer to caption — Figure 1: Two examples of MNER-QG with entity type “ORG”, ”PER”, and ”OTHER”.
图 1：实体类型为“ORG”、“PER”和“OTHER”的 MNER-QG 的两个示例。

Existing MNER datasets contain few fine-grained annotations in each sentence-image pair, i.e., the relevant image is given as a whole without regional signals for a particular entity type. Therefore, previous works implicitly align contents inside a sentence-image pair and fuse their representations based on various attention mechanisms (Moon, Neves, and Carvalho 2018; Lu et al. 2018; Zhang et al. 2018; Arshad et al. 2019; Yu et al. 2020; Chen et al. 2021a; Xu et al. 2022). However, it is hard to interpret and evaluate the effectiveness of implicit alignments. Recently, visual grounding toolkits (Yang et al. 2019) are exploited to explicitly extract visual regions related to different entity types (Zhang et al. 2021). The detected regions are then bound with the input sentence and fed into the recognition model together (Jia et al. 2022c). Because of the two-stage manner, incorporating inaccurate visual regions from the first stage will hurt the final results (error propagation).
现有的 MNER 数据集在每个句子-图像对中包含很少的细粒度注释，即相关图像作为一个整体给出，没有特定实体类型的区域信号。因此，以前的工作隐含地将内容对齐到句子-图像对中，并根据各种注意力机制融合它们的表征（Moon、Neves 和 Carvalho 2018;Lu 等人，2018 年;Zhang et al. 2018;Arshad 等人，2019 年;Yu 等人，2020 年;Chen 等人，2021a;Xu 等人，2022 年）。然而，很难解释和评估隐式对齐的有效性。最近，视觉基础工具包（Yang et al. 2019）被用来显式提取与不同实体类型相关的视觉区域（Zhang et al. 2021）。然后将检测到的区域与输入句子绑定，并一起输入到识别模型中（Jia et al. 2022c）。由于采用两阶段方式，从第一阶段合并不准确的视觉区域将损害最终结果（误差传播）。

With respect to the problem formalization, early methods regard MNER as a sequence labeling task that integrates image embeddings into a sequence labeling model and assigns type labels to named entities. Recently, the machine reading comprehension (MRC) framework is employed in many natural language processing tasks due to its solid language understanding capability (Li et al. 2019a, b; Chen et al. 2021b). To take advantage of the prior knowledge encoded in MRC queries (Li et al. 2019a), we consider MNER as a MRC task, which extracts entity spans by answering queries about entity types. In addition, to capture the fine-grained alignment between entity types and visual regions, we ground the MRC queries on image regions and output their positions as bounding boxes. For example, as shown in Figure 1 (a), recognizing entities with type PER and ORG in sentence “Got to meet my favorite defensive player in the NFL today. Thank you @ Jurrellc for coming out today!” is formalized as extracting answer spans from the input sentence given the query “Person: People’s name…” and “Organization: Include club…”. Then, answer spans “Jurrellc” and “NFL” are obtained along with their visual regions marked by red and yellow boxes.
在问题形式化方面，早期方法将 MNER 视为序列标记任务，它将图像嵌入集成到序列标记模型中，并将类型标签分配给命名实体。最近，机器阅读理解（MRC）框架由于其扎实的语言理解能力而被用于许多自然语言处理任务（Li et al. 2019a， b;Chen 等人，2021b）。为了利用 MRC 查询中编码的先验知识（Li et al. 2019a），我们将 MNER 视为 MRC 任务，它通过回答有关实体类型的查询来提取实体跨度。此外，为了捕获实体类型和视觉区域之间的精细对齐，我们将 MRC 查询基于图像区域，并将其位置输出为边界框。例如，如图 1 （a）所示，识别句子中类型为 PER 和 ORG 的实体 “Got to meet my favorite defensive player in the NFL today .Thank you @ Jurrellc for coming out today！“被正式化为从输入句子中提取答案，给定查询”Person： People's name...”和“组织：包括俱乐部...”。然后，获得答案跨度 “Jurrellc” 和 “NFL” 以及它们用红色和黄色框标记的视觉区域。

To this end, we propose an end-to-end MRC framework for Multimodal Named Entity Recognition with Query Grounding (MNER-QG). This joint-training approach forces the model to explicitly align entity spans with the corresponding visual regions, and further improves the performance of both named entity recognition and query grounding. Specifically, we design unified queries with prior information as navigators to pilot our joint-training model. Meanwhile, we extract multi-scale visual features and design two interaction mechanisms, multi-scale cross-modality interaction and existence-aware uni-modality interaction, to enrich both textual and visual information. Since there are few fine-grained annotations for visual regions in existing MNER datasets, we provide two types of bounding box annotations, weak supervisions and manual annotations. The former is obtained by training a visual grounding model with transfer learning, while the latter aims to provide oracle results.
为此，我们提出了一个端到端的 MRC 框架，用于 Query G舍入（MNER-QG）的 M终极峰 Named Entity R生态。这种联合训练方法强制模型将实体跨度与相应的视觉区域显式对齐，并进一步提高了命名实体识别和查询基础的性能。具体来说，我们设计了带有先验信息的统一查询作为导航器，以试行我们的联合训练模型。同时，我们提取多尺度视觉特征，设计多尺度跨模态交互和存在感知单模态交互两种交互机制，以丰富文本和视觉信息。由于现有的 MNER 数据集中视觉区域的细粒度注释很少，因此我们提供了两种类型的边界框注释，弱监督和手动注释。前者是通过迁移学习训练视觉基础模型获得的，而后者旨在提供预言机结果。

In summary, the contribution of this paper is three-fold:
总而言之，本文的贡献有三个方面：

•

We propose a novel end-to-end MRC framework, MNER-QG. Our model simultaneously performs MRC-based multimodal named entity recognition and query grounding in a joint-training manner. To the best of our knowledge, this is the first attempt on MNER.
我们提出了一种新的端到端 MRC 框架 MNER-QG。我们的模型以联合训练的方式同时执行基于 MRC 的多模态命名实体识别和查询接地。据我们所知，这是对 MNER 的第一次尝试。
•

To fulfill the end-to-end training, we contribute weak supervisions via training a visual grounding model with transfer learning. Meanwhile, we offer manual annotations of bounding boxes as oracle results.
为了实现端到端的训练，我们通过使用迁移学习训练视觉接地模型来贡献弱监督。同时，我们提供边界框的手动注释作为 oracle 结果。
•

We conduct extensive experiments on two public MNER datasets, Twitter2015 and Twitter2017, to evaluate the performance of our framework. Experimental results show that MNER-QG outperforms the current state-of-the-art models on both datasets for MNER, and also improves the QG performance.
我们对两个公共 MNER 数据集 Twitter2015 和 Twitter2017 进行了广泛的实验，以评估我们框架的性能。实验结果表明，MNER-QG 在两个数据集上都优于当前最先进的 MNER 模型，并且还提高了 QG 性能。

Related Work 相关工作

Multimodal Named Entity Recognition
多模态命名实体识别

With the increasing popularity of multimodal data on social media platforms, multimodal named entity recognition (MNER) has become an important research direction, which assists the NER models (Li et al. 2021b, a, 2022) in better identifying entities by taking images as the auxiliary input. The critical challenge of MNER is how to align and fuse textual and visual information. Yu et al. (2020) proposed a multimodal transformer architecture for MNER, which captures expressive text-image representations by incorporating the auxiliary entity span detection. Zhang et al. (2021) created the graph connection between textual words and visual objects acquired by a visual grounding toolkit (Yang et al. 2019), and proposed a graph fusion approach to conduct graph encoding. Xu et al. (2022) proposed a matching and alignment framework for MNER to improve the consistency of representations in different modalities.
随着多模态数据在社交媒体平台上的日益普及，多模态命名实体识别（MNER）已成为一个重要的研究方向，它通过以图像作为辅助输入来帮助 NER 模型（Li et al. 2021b， a， 2022）更好地识别实体。MNER 的关键挑战是如何对齐和融合文本和视觉信息。Yu et al. （2020）为 MNER 提出了一种多模态 transformer 架构，它通过结合辅助实体跨度检测来捕获富有表现力的文本图像表示。Zhang et al. （2021）创建了文本单词和视觉基础工具包获得的视觉对象之间的图连接（Yang et al. 2019），并提出了一种图融合方法来进行图编码。Xu et al. （2022）提出了一个 MNER 的匹配和对齐框架，以提高不同模态中表示的一致性。

Lacking prior information of entity types and accurate annotations of visual regions corresponding to certain entity types, the above methods feed visual information (an entire image, image patches, or retrieved visual regions from toolkits) with the entire sentence into an entity recognition model, which inevitably makes it difficult to obtain the explicit alignment between images and texts.
由于缺乏实体类型的先验信息和某些实体类型对应的视觉区域的准确注释，上述方法将视觉信息（整个图像、图像块或从工具包中检索到的视觉区域）与整个句子一起输入实体识别模型，这不可避免地使得图像和文本之间的显式对齐难以获得。

Machine Reading Comprehension
机器阅读理解

Machine Reading Comprehension (MRC) aims to answer natural language queries given a set of contexts where the answers to these queries can be inferred. In various forms of MRC, span extraction MRC (Peng et al. 2021; Jia et al. 2022a) is challenging, which extracts a span as the answer from context. The span extraction can be regarded as two multi-class classification or two binary classification tasks. For the former, the model needs to predict the start and end positions of an answer. For the latter, the model needs to decide whether each token is the start/end position. Recurrent Neural Network (RNN) was used to encode textual information, then a linear projection layer was adopted to predict answer spans (Yang et al. 2018; Nishida et al. 2019). The performance was boosted with the development of large-scale pre-trained models (Qiu et al. 2019; Tu et al. 2020), such as ELMo (Peters et al. 2018), BERT (Devlin et al. 2019), and RoBERTa (Liu et al. 2019).
机器阅读理解（MRC）旨在回答给定一组上下文的自然语言查询，在这些上下文中可以推断出这些查询的答案。在各种形式的 MRC 中，跨度提取 MRC（Peng 等人，2021 年;Jia et al. 2022a）具有挑战性，它从上下文中提取了一个跨度作为答案。span 提取可以看作是两个多类分类或两个二元分类任务。对于前者，模型需要预测答案的开始和结束位置。对于后者，模型需要确定每个 Token 是否是开始/结束位置。使用递归神经网络（RNN）对文本信息进行编码，然后采用线性投影层来预测答案跨度（Yang 等人，2018 年;Nishida 等人，2019 年）。随着大规模预训练模型的开发，性能得到了提升（Qiu 等人，2019 年;Tu et al. 2020），例如 ELMo （Peters et al. 2018）、BERT （Devlin et al. 2019）和 RoBERTa （Liu et al. 2019）。

Recently, there is a trend of converting NLP tasks to the MRC form, including named entity recognition (Li et al. 2019a), entity relation extraction (Li et al. 2019b), and sentiment analysis (Chen et al. 2021b). Due to the powerful understanding ability contained in MRC, the model performance of these tasks is improved.
最近，有将 NLP 任务转换为 MRC 形式的趋势，包括命名实体识别（Li et al. 2019a）、实体关系提取（Li et al. 2019b）和情感分析（Chen et al. 2021b）。由于 MRC 中包含的强大理解能力，这些任务的模型性能得到了提高。

Visual Grounding 视觉接地

Visual grounding aims to localize textual entities or referring expressions in an image. This task is divided into two paradigms: two-stage and one-stage. For the former, the first stage is exploited to extract region proposals as candidates via some region proposal methods (e.g., Edgebox (Zitnick and Dollár 2014), selective search (Uijlings et al. 2013), and Region Proposal Networks (Ren et al. 2015)), and then the second stage is designed to rank region-text candidate pairs.
视觉基础旨在定位图像中的文本实体或引用表达式。此任务分为两个范式：两阶段和一阶段。对于前者，第一阶段通过一些区域建议方法（例如，Edgebox（Zitnick 和 Dollár 2014））、选择性搜索（Uijlings 等人，2013 年）和区域建议网络（任等人，2015 年））提取区域建议作为候选者，然后第二阶段旨在对区域文本候选对进行排名。

For the latter, researchers utilize one-stage model (e.g., YOLO (Redmon and Farhadi 2018; Bochkovskiy, Wang, and Liao 2020)) combined with extra features to directly output the final region(s). Compared with the two-stage manner, the one-stage framework is simplified and accelerates the inference by conducting detection and matching simultaneously.
对于后者，研究人员使用一阶段模型（例如，YOLO（Redmon 和 Farhadi 2018;Bochkovskiy， Wang， and Liao 2020））结合额外的功能直接输出最终区域。与两阶段方式相比，一阶段框架被简化并通过同时进行检测和匹配来加速推理。

To connect visual grounding and MRC-based named entity recognition effectively, we use queries from MRC as input texts and force model to perform query grounding. Since queries contain the prior knowledge of entity types, our work can achieve the explicit alignment between entity types and visual regions.
为了有效地连接视觉接地和基于 MRC 的命名实体识别，我们使用来自 MRC 的查询作为输入文本，并强制模型执行查询接地。由于查询包含实体类型的先验知识，因此我们的工作可以实现实体类型和视觉区域之间的显式对齐。

METHOD 方法

Overview 概述

Figure 2 illustrates the overall architecture of MNER-QG. Given a sentence $S=\left\{s_{0},s_{1},...,s_{n-1}\right\}$ and its associated image $V$ , where $n$ denotes the sentence length, we first design a natural language query $Q=\left\{q_{0},q_{1},...,q_{m-1}\right\}$ with prior awareness about entity types. Then, our model performs multi-scale cross-modality interaction and existence-aware uni-modal interaction to simultaneously detect entity spans $s_{\mathrm{start,end}}$ and the corresponding visual regions via answering the query $Q$ .
图 2 说明了 MNER-QG 的整体架构。给定一个句子 $S=\left\{s_{0},s_{1},...,s_{n-1}\right\}$ 及其关联的图像 $V$ ，其中 $n$ 表示句子长度，我们首先设计一个自然语言查询 $Q=\left\{q_{0},q_{1},...,q_{m-1}\right\}$ ，并事先了解实体类型。然后，我们的模型进行多尺度跨模态交互和存在感知单模态交互，通过回答查询 $Q$ 来同时检测实体跨度 $s_{\mathrm{start,end}}$ 和相应的视觉区域。

Query Construction 查询构造

Query plays a significant role as the navigator in our MNER-QG, and it should be expressed as generic, precise, and effective as possible. Table 1 shows examples of queries designed by us. We hope that the queries are moderate in difficulty and can provide informative knowledge of MNER and QG tasks. Therefore, the model can stimulate the solid understanding capability of MRC without limiting the performance of QG.
查询在我们的 MNER-QG 中作为导航器起着重要作用，它应该尽可能地表达为通用、精确和有效。表 1 显示了我们设计的查询示例。我们希望查询的难度适中，并且可以提供有关 MNER 和 QG 任务的信息知识。因此，该模型可以在不限制 QG 性能的情况下激发 MRC 的扎实理解能力。

Entity Type 实体类型	Natural Language Query 自然语言查询
PER (Person) PER （人）	Person: People’s name and fictional character. 人物：人名和虚构人物。
LOC (Location) LOC（位置）	Location: Country, city, town continent by geographical location. 位置：按地理位置划分的国家、城市、城镇大洲。
ORG (Organization) ORG （组织）	Organization: Include club, company, government party, school government, and news organization. 组织：包括俱乐部、公司、政府、学校政府和新闻机构。

Table 1: Examples of transforming entity types to queries.
表 1：将实体类型转换为查询的示例。

Model Architecture 模型架构

Input Representation. 输入表示。

For text information, we concatenate a query and sentence pair $\{{\mathrm{[CLS]},Q,\mathrm{[SEP]},S,\mathrm{[SEP]}}\}$ , and encode the result to a 768 $d$ real-valued vector with the pre-trained BERT model (Devlin et al. 2019), where $\mathrm{[CLS]}$ and $\mathrm{[SEP]}$ are special tokens. Then BERT outputs a contextual representation $\mathbf{H}\in\mathbb{R}^{c\times d_{c}}$ , where $c=m+n+3$ is the length of BERT input. For visual information, inspired by Yang et al. (2019), we use Darknet-53 (Zhu et al. 2016) with feature pyramid networks (Lin et al. 2017) to extract visual features. The images are resized to 256 $\times$ 256, and the feature maps are $\frac{1}{32}$ , $\frac{1}{16}$ , and $\frac{1}{8}$ , respectively. Therefore, the three spatial resolutions are $8\times 8\times d_{1}$ , $16\times 16\times d_{2}$ , and $32\times 32\times d_{3}$ , where $d_{1}=1024$ , $d_{2}=512$ , and $d_{3}=256$ are feature channels.
对于文本信息，我们将查询和句子对 $\{{\mathrm{[CLS]},Q,\mathrm{[SEP]},S,\mathrm{[SEP]}}\}$ 连接起来，并使用预先训练的 BERT 模型（Devlin 等人，2019 年）将结果编码为 768 $d$ 实值向量，其中 $\mathrm{[CLS]}$ 和 $\mathrm{[SEP]}$ 是特殊标记。然后 BERT 输出上下文表示 $\mathbf{H}\in\mathbb{R}^{c\times d_{c}}$ ，其中 $c=m+n+3$ 是 BERT 输入的长度。对于视觉信息，受 Yang et al. （2019）的启发，我们使用 Darknet-53 （Zhu et al. 2016）和特征金字塔网络（Lin et al. 2017）来提取视觉特征。图像大小调整为 256 $\times$ 256，特征图分别为 $\frac{1}{32}$ 、 $\frac{1}{16}$ 和 $\frac{1}{8}$ 。因此，三种空间分辨率为 $8\times 8\times d_{1}$ 、 $16\times 16\times d_{2}$ 和 $32\times 32\times d_{3}$ ，其中 $d_{1}=1024$ 、 $d_{2}=512$ 和 $d_{3}=256$ 是特征通道。

We unify the dimensions of three visual features and textual feature to facilitate the model computation. Specifically, we add a $1\times 1$ convolution layer with batch normalization and RELU under the feature pyramid networks to map the feature channels $d_{1}$ , $d_{2}$ , and $d_{3}$ to the same dimension $d=512$ . The new visual features are denoted as $\mathbf{U}_{1}$ , $\mathbf{U}_{2}$ , and $\mathbf{U}_{3}$ . At the same time, we flatten $8\times 8$ , $16\times 16$ , and $32\times 32$ to 64, 256, and 1024, which are used to generate the visual representations, $\mathbf{U}_{1}^{f}$ , $\mathbf{U}_{2}^{f}$ and $\mathbf{U}_{3}^{f}$ . For textual information, we use a linear projection to map $d_{c}$ to $d=512$ , and the mapped representation is $\mathbf{H}^{{}^{\prime}}$ .
我们将三个视觉特征和文本特征的维度统一起来，以方便模型计算。具体来说，我们在特征金字塔网络下添加一个 $1\times 1$ 具有批量归一化和 RELU 的卷积层，以将特征通道 $d_{1}$ 、 $d_{2}$ 和 $d_{3}$ 映射到同一维度 $d=512$ 。新的视觉特征表示为 $\mathbf{U}_{1}$ 、 $\mathbf{U}_{2}$ 和 $\mathbf{U}_{3}$ 。同时，我们将 $8\times 8$ 、 $16\times 16$ 、和 $32\times 32$ 展平为 64、256 和 1024，它们用于生成视觉表示 $\mathbf{U}_{1}^{f}$ 、和 $\mathbf{U}_{2}^{f}$ $\mathbf{U}_{3}^{f}$ 。对于文本信息，我们使用线性投影映射到 $d_{c}$ $d=512$ ，映射的表示形式为 $\mathbf{H}^{{}^{\prime}}$ 。

Multi-scale Cross-modality Interaction.
多尺度跨模态交互。

This module is shown in Figure 2 with the grey box. We first truncate the token-level representations of query $\mathbf{Q}$ from $\mathbf{H}^{{}^{\prime}}$ . The query encoding now contains messages from the original sentence $S$ , which can be passed to the QG task. Then, we use an attention-based approach (Rei and Søgaard 2019) to acquire the summarized query representation $\mathbf{q}\in\mathbb{R}^{1\times d}$ that will be fed into QG.
该模块如图 2 所示，带有灰色框。我们首先截断 query $\mathbf{Q}$ from 的 $\mathbf{H}^{{}^{\prime}}$ 标记级表示形式。查询编码现在包含来自原始句子的消息 $S$ ，这些消息可以传递给 QG 任务。然后，我们使用基于注意力的方法（Rei 和 Søgaard 2019）来获取将馈送到 QG 中的汇总查询表示 $\mathbf{q}\in\mathbb{R}^{1\times d}$ 。

\alpha=\mathrm{softmax}\left(\mathrm{MLP}\left(\mathbf{Q}\right)\right),\quad\mathbf{q}=\sum_{k=0}^{m-1}\alpha_{k}\mathbf{Q}\left[k,:\right]

(1)

To fully exploit image information, we employ multi-scale visual representations to update textual representation through a cross-modality attention mechanism, where $\mathbf{H}^{{}^{\prime}}$ works as the query matrix, while each of $\mathbf{U}_{1}^{f}$ , $\mathbf{U}_{2}^{f}$ , and $\mathbf{U}_{3}^{f}$ works as the key and value matrix. The visual-enhanced attention outputs are denoted as $\mathbf{H}_{1}$ , $\mathbf{H}_{2}$ , and $\mathbf{H}_{3}\in\mathbb{R}^{c\times d}$ . Then, we merge these matrices to a unified textual representation $\mathbf{H}_{a}$ using $\mathrm{MeanPooling}$ . Finally, we concatenate $\mathbf{H}_{a}$ and $\mathbf{H}^{{}^{\prime}}$ , and feed the result into a feed-forward neural network to get the final textual representation $\mathbf{H}_{u}$ .
为了充分利用图像信息，我们采用多尺度视觉表示，通过跨模态注意力机制更新文本表示，其中 $\mathbf{H}^{{}^{\prime}}$ $\mathbf{U}_{1}^{f}$ 、 $\mathbf{U}_{2}^{f}$ $\mathbf{U}_{3}^{f}$ 和分别作为键值矩阵。视觉增强的注意力输出表示为 $\mathbf{H}_{1}$ 、 $\mathbf{H}_{2}$ 和 $\mathbf{H}_{3}\in\mathbb{R}^{c\times d}$ 。然后，我们使用将这些矩阵合并为统一的文本表示 $\mathbf{H}_{a}$ $\mathrm{MeanPooling}$ 形式。最后，我们将 $\mathbf{H}_{a}$ 和 $\mathbf{H}^{{}^{\prime}}$ 连接起来，并将结果输入到前馈神经网络中，以获得最终的文本表示 $\mathbf{H}_{u}$ 。

Existence-aware Uni-modality Interaction.

Since the sentence does not always contain the entity asked by the current query, we design a global existence signal to enhance the model’s awareness of entity existence. Similar to Equation (1), we summarize contextual representation $\mathbf{H}^{{}^{\prime}}$ to acquire the existence representation $\mathbf{H}_{g}\in\mathbb{R}^{1\times d}$ . Inspired by Qin et al. (2021) and Li et al. (2021b), we then employ a label attention network to update both the textual representation with the encoding of start/end label and the existence representation with the encoding of existence label. (Note that the $\mathbf{e}^{l}$ in Figure 2 denotes a label embedding lookup table). Details of the label attention network are provided in the Appendix. Then, we get start/end label-enhanced textual representation, $\mathbf{H}_{s}$ / $\mathbf{H}_{e}$ , which can be regarded as the start/end representation of entity span, and also label-enhanced existence representation $\mathbf{\widehat{H}}_{g}$ .

We calculate attention scores between $\mathbf{H}_{s}$ and $\mathbf{\widehat{H}}_{g}$ , and define the existence-aware start representation $\mathbf{\widetilde{H}}_{s}$ as follows:

\mathbf{Z}_{s}=\mathrm{softmax}\left(\frac{\mathbf{Q}_{s}\mathbf{K}_{g}^{\top}}{\sqrt{d_{k}}}\right)\mathbf{V}_{g},\mathbf{\widetilde{H}}_{s}=\mathrm{LN}\left(\mathbf{H}_{s}+\mathbf{Z}_{s}\right),

(2)

where LN denotes the layer normalization function (Ba, Kiros, and Hinton 2016), $\mathbf{Q}_{s}\in\mathbb{R}^{c\times d}$ , $\mathbf{K}_{g},\mathbf{V}_{g}\in\mathbb{R}^{1\times d}$ , and $\mathbf{\widetilde{H}}_{s}\in\mathbb{R}^{c\times d}$ . Similarly, we can obtain the updated end representation $\mathbf{\widetilde{H}}_{e}\in\mathbb{R}^{c\times d}$ , and the updated existence representation $\mathbf{\widetilde{H}}_{g}\in\mathbb{R}^{1\times d}$ .

Query Grounding

Following Yang et al. (2019), we first broadcast the query representation $\mathbf{q}$ to each spatial location, denoted as $\left(i,j\right)$ , and then concatenate the query feature and visual feature $\mathbf{U}_{i}$ , where $i=1,2,3$ . The feature dimension after concatenation is $512+512=1024$ . Another $1\times 1$ convolution layer is appended to better fuse above features at each location independently and map them to the dimension $d=512$ .

Next, we perform the grounding operation. There are $8\times 8+16\times 16+32\times 32=1344$ locations in three spatial resolutions, and each location is related to a $512D$ feature vector from the fusion layer. YOLOv3 network centers around each of the location’s three anchor boxes, hence it predicts bounding boxes at three scales. The output of YOLOv3 network is $\left[3\times\left(4+1\right)\right]\times r_{i}\times r_{i}$ at each scale for shifting the center, width, and height $\left(t_{x},t_{y},t_{w},t_{h}\right)$ of the anchor box, along with the confidence score on this shifted box, where $r_{i}\times r_{i}$ denotes the shape size of each spatial resolution. Ultimately, only one region is desired as the output for query grounding. More details can be found in Yang et al. (2019).

The objective function $\mathcal{L}_{QG}$ of QG task consists of regression loss on bounding box $\mathcal{L}_{bbox}$ and objectness score loss $\mathcal{L}_{object}$ . $\mathcal{L}_{bbox}$ is expected to assign bounding box regions to ground truth objects via mean squared error (MSE). $\mathcal{L}_{object}$ is used to classify the bounding box regions as object or non-object via binary cross-entropy (BCE).

Multimodal Named Entity Recognition

The core of multimodal named entity recognition is to predict the entity span in sentence. In this section, we design an auxiliary task named existence detection (ED) after receiving the existence representation $\mathbf{\widetilde{H}}_{g}$ to predict whether a sentence contains entities with specific type and cooperate with the entity span prediction task to extract entity span.

Existence Detection.

This task and the entity span prediction task can share the corresponding mutual information with the co-interactive attention mechanism. The existence of entity is detected as follows:

\mathrm{P}_{exist}=\mathrm{softmax}\left(\mathbf{\widetilde{H}}_{g}\mathbf{W}_{exist}\right)

(3)

where $\mathbf{W}_{exist}\in\mathbb{R}^{d\times 2}$ and $\mathrm{P}_{exist}\in\mathbb{R}^{1\times 2}$ . We formulate the ED sub-task as a text classification task. The loss function is denoted by $\mathcal{L}_{ED}$ and the binary cross-entropy (BCE) loss is taken as the training objective.

Entity Span Prediction.

To tag the entity span from a sentence using MRC framework, it is necessary to find the start and end positions of the entity. We utilize two binary classifiers to predict whether each token in the sentence is the start/end index or not, respectively. The probability that each token is predicted to be a start position is as follows:

\mathrm{P}_{start}=\mathrm{softmax_{eachrow}}\left(\mathbf{\widetilde{H}}_{s}\mathbf{W}_{s}\right)

(4)

where $\mathbf{W}_{s}\in\mathbb{R}^{d\times 2}$ and $\mathrm{P}_{start}\in\mathbb{R}^{c\times 2}$ . Similarly, we can get the probability of the end position $\mathrm{P}_{end}\in\mathbb{R}^{c\times 2}$ .

Since there could be multiple entities of the same type in the sentence, we add a binary classification model to predict the matching probability of start and end positions inspired by Li et al. (2019a).

\mathrm{P}_{match}=\mathrm{sigmoid}\left(\mathbf{W}_{m}\left[\mathbf{\widetilde{H}}_{s};\mathbf{\widetilde{H}}_{e}\right]\right)

(5)

where $\mathbf{W}_{m}\in\mathbb{R}^{1\times 2d}$ , $\mathrm{P}_{match}\in\mathbb{R}^{1\times 2}$ . $\left[;\right]$ is denoted the concatenation in columns.

During training, the objective function $\mathcal{L}_{ESP}$ of entity span prediction (ESP) sub-task consists of start position loss $\mathcal{L}_{start}$ , end position loss $\mathcal{L}_{end}$ and matching loss $\mathcal{L}_{match}$ , where binary cross-entropy (BCE) is used for calculation.

Finally, combining the two tasks QG and MNER, the overall objective function is as follows:

\mathcal{L}=\omega_{f}\mathcal{L}_{QG}+\lambda_{1}\mathcal{L}_{ED}+\lambda_{2}\mathcal{L}_{ESP}

(6)

where $\omega_{f},\mathcal{\lambda}_{2}$ and $\mathcal{\lambda}_{3}$ are hyper-parameters to control the contributions of each sub-task.

Experiments

Dataset Construction

There are two widely-used MNER datasets, Twitter2015 (Zhang et al. 2018) and Twitter2017 (Lu et al. 2018), used to evaluate the effectiveness of our framework. Both datasets are separated into training, validation, and test sets with the same type distribution. Statistics are listed in Appendix. And then, we contribute two types of labels: weak supervisions and manual annotations for public research.

For weak supervisions, we apply the pre-trained fast and accurate one-stage visual grounding model (Yang et al. 2019) (denoted as FA-VG) as the base model. In the setting of Phrase Localization task, FA-VG was trained and evaluated on the Flickr30K Entities dataset (Plummer et al. 2015) that augments the original Flickr30K (Young et al. 2014) with region-phrase correspondence annotations. However, there are two obstacles: (1) These phrases/queries are from image captions, and not specially constructed for the named entity recognition task. (2) The MNER datasets (i.e. Twitter2015/2017) have different data domains compared with the Flickr30K Entities dataset. Thus, we utilize transfer learning to overcome above issues. In addition, we contribute manual annotations for public research. We hire three crowd-sourced workers who are familiar with the tasks of MNER and target detection to help us annotate the bounding box in the image. The annotators are requested to tag the visual regions in the image corresponding to the entity span in the sentence. After the data annotation, we merge the instances of strong inter-annotator agreement from three crowd-sourced workers to acquire high-quality and explicit text-image alignment data. Details of the annotation with two types of labels are provided in the Appendix.

Experiment Settings

Evaluation Metrics.

For the MNER task, we use precision ( $Pre.$ ), recall ( $Rec.$ ), and F1 score ( $F1$ ) to evaluate the performance of overall entity types, and use $F1$ only for each type. For the QG, we follow prior works (Rohrbach et al. 2016) and utilize Accu@0.5 as evaluation protocol. Given a query, an output image region is considered correct if its IoU is at least 0.5 with the ground truth bounding box. In addition, we add Accu@0.75 (IoU is at least 0.75) and Miou (mean of IoU) as additional evaluation metrics.

Implementation Details.

The learning rate and dropout rate are set to 5e-5 and 0.3, which obtains the best performance on the validation set of two datasets after conducting a grid search over the interval [1e-5, 1e-4] and [0.1, 0.6]. We train the model with AdamW optimization. To further evaluate our joint-training model, we take out the images from Twitter2015/2017 to train the QG model separately. For a fair comparison, we use the same configurations such as batch size, learning rate, and optimizer in both the QG model and our joint-training model. For the joint-training loss, we set the hyper-parameters $\mathcal{\lambda}_{1}=1$ and $\mathcal{\lambda}_{2}=2$ by tuning on the validation set. We specially set a balance factor $\omega_{f}$ to dynamically scale the loss of MNER and QG. Please refer to Appendix for calculation details.

Baseline Models.

Two groups of baselines are compared with our approach. The first group consists of some text-based MNER models that formalize MNER as a sequence labeling task: (1) BiLSTM-CRF (Huang, Xu, and Yu 2015); (2) CNN-BiLSTM-CRF (Ma and Hovy 2016); (3) HBiLSTM-CRF (Lample et al. 2016); (4) BERT (Devlin et al. 2019); (5) BERT-CRF; (6) T-NER (Ritter et al. 2011; Zhang et al. 2018). The second group contains several competitive MNER models: (1) GVATT-HBiLSTM-CRF (Lu et al. 2018); (2) GVATT-BERT-CRF (Yu et al. 2020); (3) AdaCAN-CNN-BiLSTM-CRF (Zhang et al. 2018); (4) AdaCAN-BERT-CRF (Yu et al. 2020); (5) UMT-BERT-CRF (Yu et al. 2020); (6) MT-BERT-CRF (Yu et al. 2020); (7) ATTR-MMKG-MNER (Chen et al. 2021a); (8) UMGF (Zhang et al. 2021); (9) MAF (Xu et al. 2022). The details of these models are illustrated in Appendix.

According to different derivations of bounding box labels in the images, we provide two versions of our model MNER-QG and MNER-QG (Oracle) for evaluation. In addition, we provide a variant of the model, MNER-QG-Text, which uses text input only.

Main Results

Table 2 shows the results of our model and baselines. The upper results are from text-based models and the lower results are from multimodal models. Firstly, we compare the multimodal models with their corresponding uni-modal baselines in MNER, such as AdaCAN-CNN-BiLSTM-CRF vs. CNN-BiLSTM-CRF, and MNER-QG vs. MNER-QG-Text. We notice almost all multimodal models can significantly outperform their corresponding uni-modal competitors, indicating the effectiveness of images. And then, we compare our MNER-QG with other multimodal baselines. The result shows MNER-QG outperforms all baselines on Twitter2017 and yields competitive results on Twitter2015. MNER-QG (Oracle) with more accurate manual annotations yields further results in both datasets.

Methods	Twitter2015							Twitter2017
	Single Type (F1)				Overall			Single Type (F1)				Overall
	PER	LOC	ORG	OTH.	Pre.	Rec.	F1	PER	LOC	ORG	OTH.	Pre.	Rec.	F1
BiLSTM-CRF	76.77	72.56	41.33	26.80	68.14	61.09	64.42	85.12	72.68	72.50	52.56	79.42	73.43	76.31
CNN-BiLSTM-CRF	80.86	75.39	47.77	32.61	66.24	68.09	67.15	87.99	77.44	74.02	60.82	80.00	78.76	79.37
HBiLSTM-CRF	82.34	76.83	51.59	32.52	70.32	68.05	69.17	87.91	78.57	76.67	59.32	82.69	78.16	80.37
BERT	84.72	79.91	58.26	38.81	68.30	74.61	71.32	90.88	84.00	79.25	61.63	82.19	83.72	82.95
BERT-CRF	84.74	80.51	60.27	37.29	69.22	74.59	71.81	90.25	83.05	81.13	62.21	83.32	83.57	83.44
T-NER	83.64	76.18	59.26	34.56	69.54	68.65	69.09	-	-	-	-	-	-	-
MNER-QG-Text (Ours)	84.72	81.13	60.07	39.23	76.35	69.46	72.74	91.33	85.23	81.75	68.41	87.12	84.03	85.55
GVATT-HBiLSTM-CRF	82.66	77.21	55.06	35.25	73.96	67.90	70.80	89.34	78.53	79.12	62.21	83.41	80.38	81.87
AdaCAN-CNN-BiLSTM-CRF	81.98	78.95	53.07	34.02	72.75	68.74	70.69	89.63	77.46	79.24	62.77	84.16	80.24	82.15
GVATT-BERT-CRF	84.43	80.87	59.02	38.14	69.15	74.46	71.70	90.94	83.52	81.91	62.75	83.64	84.38	84.01
AdaCAN-BERT-CRF	85.28	80.64	59.39	38.88	69.87	74.59	72.15	90.20	82.97	82.67	64.83	85.13	83.20	84.10
MT-BERT-CRF	85.30	81.21	61.10	37.97	70.84	74.80	72.58	91.47	82.05	81.84	65.80	84.60	84.16	84.42
UMT-BERT-CRF	85.24	81.58	63.03	39.45	71.67	75.23	73.41	91.56	84.73	82.24	70.10	85.28	85.34	85.31
ATTR-MMKG-MNER	84.28	79.43	58.97	41.47	74.78	71.82	73.27	-	-	-	-	-	-	-
UMGF	84.26	83.17	62.45	42.42	74.49	75.21	74.85	91.92	85.22	83.13	69.83	86.54	84.50	85.51
MAF	84.67	81.18	63.35	41.82	71.86	75.10	73.42	91.51	85.80	85.10	68.79	86.13	86.38	86.25
MNER-QG (Ours)	85.31	81.65	63.41	41.32	77.43	72.15	74.70	92.92	86.19	84.52	71.67	88.26	85.65	86.94
MNER-QG (Oracle) (Ours)	85.68	81.42	63.62	41.53	77.76	72.31	74.94	93.17	86.02	84.64	71.83	88.57	85.96	87.25

Table 2: Results on two MNER datasets. We refer to the results of UMGF from Zhang et al. (2021) and other results from Xu et al. (2022). Our model achieves a statistically significant improvement with p-value

<

0.05 under a paired two-sided t-test.

Methods	Twitter2015			Twitter2017
Methods	Pre.	Rec.	F1	Pre.	Rec.	F1
MNER-QG	77.43	72.15	74.70	88.26	85.65	86.94
- w/o QG loss	77.50	70.79	73.99	88.01	84.69	86.32
- w/o ED loss	77.53	71.20	74.23	87.81	85.28	86.53
- w/o QG+ED loss	77.17	70.29	73.57	87.63	84.47	86.02

Table 3: Ablation study of MNER-QG on Test set.

Ablation Study

Table 3 shows the ablation results. We observe that all sub-tasks are necessary. First, after removing the QG loss, the performance significantly drops on all metrics. In particular, $F1$ scores on two datasets degrade by 0.71% and 0.62%, respectively. The result shows the QG training promotes explicit alignment between text and image. Besides, removing the ED loss also damages the performance on all metrics. $F1$ scores on the two datasets decrease by 0.47% and 0.41%, respectively. We conjecture that ED provides global information for the entire model, which can help the model determine whether the sentence contains certain entities asked by the query. Finally, after removing both QG and ED loss, the performance drops more significantly, indicating that both the QG and ED tasks are essential in our framework.

Case Study

Here we conduct further qualitative analysis with two specific examples. We compare the results from MNER-QG, MNER-QG (Oracle), and the competitive model UMGF. In Figure 3 (a), the sentence contains two entities “lebron james”, and “Cavaliers” with PER, and ORG types respectively. However, UMGF locates the entity “lebron james” inaccurately and misjudge its type. We guess it is because UMGF cannot detect the region of person on the red T-shirt. Instead, both MNER-QG and MNER-QG (Oracle) extract region about “lebron james” (red box) for PER, and the logo about “Cavaliers” (yellow box) for ORG on clothing, and the regions extracted by MNER-QG (Oracle) are more accurate due to the more elaborate manual annotations. Compared with UMGF, our model can locate more relevant visual regions, which can assist the model on accurately recognizing entities. Figure 3 (b) shows a more challenging case, where the image cannot provide useful regions about LOC. It can be seen that UMGF, MNER-QG and MNER-QG (Oracle) cannot locate the relevant visual regions for this entity. However, both MNER-QG and MNER-QG (Oracle) can recognize “Epcot” and its type. We conjecture that the solid understanding capability of MRC and the guidance of query prior information contribute to the final correct prediction.

Discussions

Effectiveness of the End-to-End Manner.

Methods

Twitter2015

Twitter2017

MNER

Pre.

Rec.

Accu@0.5

Accu@0.75

Miou

Pre.

Rec.

Accu@0.5

Accu@0.75

Miou

MNER-QG-Text

76.35

69.46

72.74

87.12

84.03

85.55

MNER-VG

77.03

71.08

73.94

87.91

84.22

86.03

FA-VG

50.83

32.69

45.49

56.03

38.92

51.14

MNER-QG (Ours)

77.43

72.15

74.70

53.93

(M:54.86)

40.22

(M:41.13)

49.50

(M:50.41)

88.26

85.65

86.94

57.50

(M:58.49)

43.03

(M:43.67)

54.09

(M:55.3)

Table 4: Performance comparison on Joint-training or single-training models on Test set (M denotes Max).

Table 4 shows the results of our joint-training approach with other single-training approaches on different tasks¹¹1We provide two results in the QG task, one is the QG results when MNER reaches the optimum, and another is the optimal results in the QG task.. MNER-VG is a two-stage MNER model, which uses the VG model trained via transfer learning to acquire visual region in the first stage and integrates it into the second stage to enhance token representation. FA-VG is a one-stage VG model, and we retrain the model using Twitter2015/2017 datasets. As can be seen, compared with models MNER-QG-Text and FA-VG trained on a single data source (e.g., text or image) in different tasks, our joint-training model significantly improves the performance of each task, e.g., $F1$ score and $Accu@0.5$ are improved by 1.96% and 3.1% (max:4.03%), respectively in Twitter2015. Compared with the two-stage model MNER-VG, our end-to-end model still has obvious advantages, e.g., $F1$ scores are increased by 0.76% and 0.91% in Twitter2015/2017, respectively. The above results indicate that the different tasks in our model are complementary with each other under an end-to-end manner and enable the model to yield better performance.

Accuracy of QG.

Methods

Twitter2015

Twitter2017

Flickr30K

(W.S)

(M.A)

(W.S)

(M.A)

A@0.5

FA-VG

50.83

63.94

56.03

71.02

68.69

MNER-QG (Ours)

54.86

67.41

58.49

73.53

Table 5: Results on different bounding box labels on Test set (W.S and M.A denote weak supervisions and manual annotations, respectively. A@0.5 is Accu@0.5. The result of FA-VG on Flickr30K derives from Yang et al. (2019).)

To check the quality of the labels contributed by us for the QG, we present the results of the different models on two types of labels. In addition, we provide the result on a high-quality Flickr30K Entities dataset for comparison. The dataset links 31,783 images in Flickr30K with 427K referred entities. Table 5 shows the results. In either MNER-QG or FA-VG, there is a relatively obvious advantage on manual annotations compared with weak supervisions. But the acquisition of weak supervisions is easier and time-friendly. Regardless of the annotation method, our joint-training MNER-QG significantly improves the performance compared with single-training FA-VG on QG task. Compared with the results of FA-VG on Flickr30K Entities, there are competitive results on Twitter2015/2017. In particular, $Accu@0.5$ in Twitter2017 with manual annotations is 2.33% higher than the result in Flickr30K Entities²²2There are great deviations in the number of images and the distribution of data in Twitter2015/2017 and Flickr30K Entities, and the comparison of the three datasets is shown in the Appendix.. The results indicate two types of labels on Twitter2015/2017 for QG are reliable and leave ample scope for future research.

Effect of Query Transformations.

We explore different ways of query transformations and use entity type ORG as examples. 1) Keyword: An entity type keyword. e.g.,“Organization”. 2) Rule-based Template Filling: Phrases generated by a simple template. e.g.,“Please find Organization”. 3) Keyword’s Wikipedia: The definition of the entity type keyword from Wikipedia. e.g.,“An organization is an entity, such as an institution or an association, that has a collective goal and is linked to an external environment.” 4) Keyword+Annotation: The concatenation of a keyword and its annotation. e.g.,“Organization: Include club, company, government party, school…”. Results are shown in Figure 4. Queries designed by methods 1 and 2 contain deficient information, which results in friendly QG result but limits the language understanding of MRC. For method 3, definitions from Wikipedia are relatively general, leading to inferior results on both tasks. Compared with other methods, the framework with queries designed by method 4 achieves the highest $F1$ score and $\mathrm{Accu@0.5}$ in both tasks. We conjecture that method 4 accords with the requirements for query construction.

Conclusion and Future Work

In this work, we propose MNER-QG, an end-to-end MRC framework for MNER with QG. Our model provides prior knowledge of entity types and visual regions with the guidance of queries, then enhances representations of both texts and images after the interactions of multi-scale cross-modality and existence-aware uni-modality, at last, simultaneously extracts entity span and grounds the queries onto visual regions of the image. To perform the query grounding task, we contribute weak supervisions and manual labels. Experimental results on two widely-used datasets show that the joint-training model MNER-QG competes strongly with other baselines in different tasks. MNER-QG leaves ample scope for further research. For future work, we will explore more multimodal topics.

Acknowledgement

This work is supported by the National Key R&D Program of China under Grant No. 2020AAA0106600.

References

Arshad et al. (2019) Arshad, O.; Gallo, I.; Nawaz, S.; and Calefati, A. 2019. Aiding intra-text representations with visual context for multimodal named entity recognition. In 2019 International Conference on Document Analysis and Recognition (ICDAR), 337–342.
Ba, Kiros, and Hinton (2016) Ba, J. L.; Kiros, J. R.; and Hinton, G. E. 2016. Layer normalization. arXiv preprint arXiv:1607.06450.
Bochkovskiy, Wang, and Liao (2020) Bochkovskiy, A.; Wang, C.-Y.; and Liao, H.-Y. M. 2020. Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934.
Bowman et al. (2015) Bowman, S. R.; Angeli, G.; Potts, C.; and Manning, C. D. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP), 632–642.
Chen et al. (2021a) Chen, D.; Li, Z.; Gu, B.; and Chen, Z. 2021a. Multimodal Named Entity Recognition with Image Attributes and Image Knowledge. In Database Systems for Advanced Applications - 26th International Conference (DASFAA), 186–201.
Chen et al. (2021b) Chen, S.; Wang, Y.; Liu, J.; and Wang, Y. 2021b. Bidirectional machine reading comprehension for aspect sentiment triplet extraction. In Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI), 12666–12674.
Chen et al. (2019) Chen, W.; Wang, H.; Chen, J.; Zhang, Y.; Wang, H.; Li, S.; Zhou, X.; and Wang, W. Y. 2019. Tabfact: A large-scale dataset for table-based fact verification. In 8th International Conference on Learning Representations (ICLR).
Cui and Zhang (2019) Cui, L.; and Zhang, Y. 2019. Hierarchically-refined label attention network for sequence labeling. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 4113–4126.
Devlin et al. (2019) Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), 4171–4186.
Fleiss (1971) Fleiss, J. L. 1971. Measuring nominal scale agreement among many raters. Psychological bulletin, 76(5): 378.
Huang, Xu, and Yu (2015) Huang, Z.; Xu, W.; and Yu, K. 2015. Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991.
Jia et al. (2022a) Jia, M.; Liao, L.; Wang, W.; Li, F.; Chen, Z.; Li, J.; and Huang, H. 2022a. Keywords-aware Dynamic Graph Neural Network for Multi-hop Reading Comprehension. Neurocomputing, 501: 25–40.
Jia et al. (2022b) Jia, M.; Liu, R.; Wang, P.; Song, Y.; Xi, Z.; Li, H.; Shen, X.; Chen, M.; Pang, J.; and He, X. 2022b. E-ConvRec: A Large-Scale Conversational Recommendation Dataset for E-Commerce Customer Service. In Proceedings of the Thirteenth Language Resources and Evaluation Conference (LREC), 5787–5796.
Jia et al. (2022c) Jia, M.; Shen, X.; Shen, L.; Pang, J.; Liao, L.; Song, Y.; Chen, M.; and He, X. 2022c. Query Prior Matters: A MRC Framework for Multimodal Named Entity Recognition. In Proceedings of the 30th ACM International Conference on Multimedia (ACM MM), 3549–3558.
Lample et al. (2016) Lample, G.; Ballesteros, M.; Subramanian, S.; Kawakami, K.; and Dyer, C. 2016. Neural architectures for named entity recognition. In 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), 260–270.
Li et al. (2021a) Li, F.; Wang, Z.; Hui, S. C.; Liao, L.; Song, D.; and Xu, J. 2021a. Effective named entity recognition with boundary-aware bidirectional neural networks. In Proceedings of the Web Conference 2021 (WWW), 1695–1703.
Li et al. (2021b) Li, F.; Wang, Z.; Hui, S. C.; Liao, L.; Song, D.; Xu, J.; He, G.; and Jia, M. 2021b. Modularized Interaction Network for Named Entity Recognition. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL/IJCNLP), 200–209.
Li et al. (2022) Li, J.; Fei, H.; Liu, J.; Wu, S.; Zhang, M.; Teng, C.; Ji, D.; and Li, F. 2022. Unified named entity recognition as word-word relation classification. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 10965–10973.
Li et al. (2019a) Li, X.; Feng, J.; Meng, Y.; Han, Q.; Wu, F.; and Li, J. 2019a. A unified MRC framework for named entity recognition. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), 5849–5859.
Li et al. (2019b) Li, X.; Yin, F.; Sun, Z.; Li, X.; Yuan, A.; Chai, D.; Zhou, M.; and Li, J. 2019b. Entity-relation extraction as multi-turn question answering. In Proceedings of the 57th Conference of the Association for Computational Linguistics (ACL), 1340–1350.
Lin et al. (2017) Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; and Belongie, S. 2017. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2117–2125.
Liu et al. (2019) Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; and Stoyanov, V. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
Lu et al. (2018) Lu, D.; Neves, L.; Carvalho, V.; Zhang, N.; and Ji, H. 2018. Visual attention model for name tagging in multimodal social media. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL), 1990–1999.
Ma and Hovy (2016) Ma, X.; and Hovy, E. 2016. End-to-end sequence labeling via bi-directional lstm-cnns-crf. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL).
Moon, Neves, and Carvalho (2018) Moon, S.; Neves, L.; and Carvalho, V. 2018. Multimodal named entity recognition for short social media posts. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), 852–860.
Nishida et al. (2019) Nishida, K.; Nishida, K.; Nagata, M.; Otsuka, A.; Saito, I.; Asano, H.; and Tomita, J. 2019. Answering while summarizing: Multi-task learning for multi-hop QA with evidence extraction. In Proceedings of the 57th Conference of the Association for Computational Linguistics (ACL), 2335–2345.
Palatucci et al. (2009) Palatucci, M.; Pomerleau, D.; Hinton, G. E.; and Mitchell, T. M. 2009. Zero-shot learning with semantic output codes. 1410–1418.
Peng et al. (2021) Peng, W.; Hu, Y.; Yu, J.; Xing, L.; and Xie, Y. 2021. APER: AdaPtive Evidence-driven Reasoning Network for machine reading comprehension with unanswerable questions. Knowledge-Based Systems, 229: 107364.
Peters et al. (2018) Peters, M. E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; and Zettlemoyer, L. 2018. Deep Contextualized Word Representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational (NAACL), 2227–2237.
Plummer et al. (2015) Plummer, B. A.; Wang, L.; Cervantes, C. M.; Caicedo, J. C.; Hockenmaier, J.; and Lazebnik, S. 2015. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision (ICCV), 2641–2649.
Qin et al. (2021) Qin, L.; Liu, T.; Che, W.; Kang, B.; Zhao, S.; and Liu, T. 2021. A co-interactive transformer for joint slot filling and intent detection. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 8193–8197.
Qiu et al. (2019) Qiu, L.; Xiao, Y.; Qu, Y.; Zhou, H.; Li, L.; Zhang, W.; and Yu, Y. 2019. Dynamically fused graph network for multi-hop reasoning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), 6140–6150.
Redmon et al. (2016) Redmon, J.; Divvala, S.; Girshick, R.; and Farhadi, A. 2016. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), 779–788.
Redmon and Farhadi (2018) Redmon, J.; and Farhadi, A. 2018. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767.
Rei and Søgaard (2019) Rei, M.; and Søgaard, A. 2019. Jointly learning to label sentences and tokens. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI, 6916–6923.
Ren et al. (2015) Ren, S.; He, K.; Girshick, R.; and Sun, J. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In AAdvances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015 (NIPS), 91–99.
Ritter et al. (2011) Ritter, A.; Clark, S.; Etzioni, O.; et al. 2011. Named entity recognition in tweets: an experimental study. In Proceedings of the 2011 conference on empirical methods in natural language processing (EMNLP), 1524–1534.
Rohrbach et al. (2016) Rohrbach, A.; Rohrbach, M.; Hu, R.; Darrell, T.; and Schiele, B. 2016. Grounding of textual phrases in images by reconstruction. In European Conference on Computer Vision (ECCV), 817–834.
Socher et al. (2013) Socher, R.; Ganjoo, M.; Manning, C. D.; and Ng, A. 2013. Zero-shot learning through cross-modal transfer. 935–943.
Tu et al. (2020) Tu, M.; Huang, K.; Wang, G.; Huang, J.; He, X.; and Zhou, B. 2020. Select, answer and explain: Interpretable multi-hop reading comprehension over multiple documents. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), volume 34, 9073–9080.
Uijlings et al. (2013) Uijlings, J. R.; Van De Sande, K. E.; Gevers, T.; and Smeulders, A. W. 2013. Selective search for object recognition. International journal of computer vision, 104(2): 154–171.
Xu et al. (2022) Xu, B.; Huang, S.; Sha, C.; and Wang, H. 2022. MAF: A General Matching and Alignment Framework for Multimodal Named Entity Recognition. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining (WSDM), 1215–1223.
Yang et al. (2019) Yang, Z.; Gong, B.; Wang, L.; Huang, W.; Yu, D.; and Luo, J. 2019. A fast and accurate one-stage approach to visual grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 4683–4693.
Yang et al. (2018) Yang, Z.; Qi, P.; Zhang, S.; Bengio, Y.; Cohen, W. W.; Salakhutdinov, R.; and Manning, C. D. 2018. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2369–2380.
Young et al. (2014) Young, P.; Lai, A.; Hodosh, M.; and Hockenmaier, J. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. In Transactions of the Association for Computational Linguistics (TACL), 67–78.
Yu et al. (2020) Yu, J.; Jiang, J.; Yang, L.; and Xia, R. 2020. Improving multimodal named entity recognition via entity span detection with unified multimodal transformer. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), 3342–3352.
Zhang et al. (2021) Zhang, D.; Wei, S.; Li, S.; Wu, H.; Zhu, Q.; and Zhou, G. 2021. Multi-modal Graph Fusion for Named Entity Recognition with Targeted Visual Guidance. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 14347–14355.
Zhang et al. (2018) Zhang, Q.; Fu, J.; Liu, X.; and Huang, X. 2018. Adaptive co-attention network for named entity recognition in tweets. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence (AAAI), 5674–5681.
Zhang and Saligrama (2016) Zhang, Z.; and Saligrama, V. 2016. Zero-shot learning via joint latent similarity embedding. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 6034–6042.
Zhu et al. (2016) Zhu, Y.; Groth, O.; Bernstein, M.; and Fei-Fei, L. 2016. Visual7w: Grounded question answering in images. In Proceedings of the IEEE conference on computer vision and pattern recognition, 4995–5004.
Zitnick and Dollár (2014) Zitnick, C. L.; and Dollár, P. 2014. Edge boxes: Locating object proposals from edges. In European conference on computer vision (ECCV), 391–405.

Appendix A Details of Label Attention Network

Label attention network has been exploited to better facilitate the task of text classification (Socher et al. 2013; Zhang and Saligrama 2016). Inspired by previous works (Cui and Zhang 2019; Qin et al. 2021; Li et al. 2021b), we perform the label attention network to obtain the start/end label-enhanced textual representations and label-enhanced existence representation in Existence-aware Uni-modality Interaction Module. Due to space limitation, we omit the implementation details in our paper. Here, we introduce the computational details.

Label Representation.

Given a set of candidates’ labels $\mathrm{Y}_{start}=\left\{y_{s_{0}},...,y_{\left|start\right|-1}\right\}$ , $\mathrm{Y}_{end}=\left\{y_{e_{0}},...,y_{\left|end\right|-1}\right\}$ , $\mathrm{Y}_{exist}=\left\{y_{i_{0}},...,y_{\left|exist\right|-1}\right\}$ for start position prediction, end position prediction and existence detection, here we take the start position prediction as an example to illustrate the calculation process. Each label $y_{s_{i}}$ of start position is represented by using an embedding vector:

\mathbf{S}_{i}=\mathbf{e}^{l}\left(y_{s_{i}}\right)

(7)

where $\mathbf{e}^{l}$ denotes a label embedding lookup table. Label embeddings are randomly initialized and tuned during model training. The label representation of start position is $\mathbf{S}\in\mathbb{R}^{\left|start\right|\times d}$ , $d$ denotes the dimension of hidden state. $\left|start\right|$ represents the number of start positions.

Label-enhanced Textual Representation.

For the label attention layer, the attention mechanism produces an attention matrix containing the distribution of potential labels for each token. We first receive the contextual representation $\mathbf{H}_{u}$ that interacts with the image information and initial global existence representation $\mathbf{H}_{g}$ . And then, we define $\mathbf{Q}=\mathbf{H}_{u}$ , $\mathbf{K}=\mathbf{V}=\mathbf{S}$ , and the start representation of entity span with start position label is calculated by:

\mathbf{H}^{l}_{s}=\mathrm{softmax}\left(\frac{\mathbf{Q}\mathbf{K}^{\top}}{\sqrt{d}}\right)\mathbf{V}

(8)

We use multi-head attention to capture multiple of possible label distributions in parallel.

\mathbf{H}^{l}_{s}=\mathrm{concat}\left(\mathrm{head},...,\mathrm{head}_{h}\right)

(9)

\mathrm{head}_{i}=\mathrm{attention}\left(\mathbf{Q}\mathbf{W}^{Q}_{i},\mathbf{K}\mathbf{W}^{K}_{i},\mathbf{V}\mathbf{W}^{V}_{i}\right)

(10)

where $\mathbf{W}^{Q}_{i},\mathbf{W}^{K}_{i},\mathbf{W}^{V}_{i}\in\mathbb{R}^{d\times\frac{d}{h}}$ are parameters to be learned during the training, $h$ is the number of heads.

Finally, we obtain the start label-enhanced textual representation as follows:

\mathbf{H}_{s}=\mathbf{W}^{l}_{s}\left[\mathbf{H}^{l}_{s};\mathbf{H}_{u}\right]

(11)

where $\mathbf{W}^{l}_{s}\in\mathbb{R}^{d\times 2d}$ and $\mathbf{H}_{s}\in\mathbb{R}^{c\times d}$ .

The calculation process of end representation of entity span and existence representation are similar to the above procedure. The end representation of entity span is $\mathbf{H}_{e}\in\mathbb{R}^{c\times d}$ and existence representation is $\mathbf{\widehat{H}}_{g}\in\mathbb{R}^{1\times d}$ .

Appendix B Details of Datasets

Statistics of Datasets.

Both Twitter2015 and Twitter2017 contain four entity types: Person (PER), Organization (ORG), Location (LOC) and Others (OTHER) for sentence-image pairs. The statistics are shown in Tabel 6.

Type	Twitter2015			Twitter2017
Type	Train	Dev	Test	Train	Dev	Test
PER	2,217	522	1,816	2,943	626	621
LOC	2,091	522	1,697	731	173	178
ORG	928	247	839	1,674	375	395
OTHER	940	225	726	701	150	157
Total	6,176	1,546	5,078	6,049	1,324	1,351
# Tweets	4,000	1,000	3,257	3,373	723	723

Table 6: The statistics summary of two MNER datasets.

Details of Weak Supervisions.

We train a visual grounding model via transfer learning to offer weak supervisions. To fulfill MNER-QG adaptation of the pre-trained FA-VG model, we construct a corpus consisting of three sets of samples:

1.

Samples from the Flickr30K Entities dataset with phrases highly-related to pre-defined PER, LOC, ORG, and OTHER queries.
2.

Samples from (1) with phrases replaced by MNER-QG queries.
3.

Samples from the training set of Twitter2015/2017 datasets with manually-labeled regions of PER, LOC, and ORG types.

Since only a small part of phrases in the Flickr30K Entities dataset are related to MNER-QG entity types, we first filter the original data to get those highly-relevant samples, and replace phrases in them with MNER-QG queries. Specifically, all phrases in the Flickr30K Entities dataset and four MNER-QG queries are represented by BERT embeddings respectively. Then we calculate cosine similarities between embeddings of phrases and each query, and only keep samples with scores larger than a threshold, e.g., 0.7. To obtain samples in the second set, we make a copy of the first set and conduct query replacement.

As shown in Figure 5(a), the original query “team uniform” is replaced by “Organization: Include club, company…” which is defined as the MNER-QG query for ORG in the table Examples for transforming entity types to queries of this paper. To take advantage of some in-domain data, we randomly sample 1000+ images from the training set of Twitter2015/2017 datasets, and manually annotate regions related to PER, LOC, ORG, and OTHER. Take Figure 5(b) as an example. The image is labeled with three pairs of regions and queries: red box with “Person: People’s name…”, blue box with “Location: Country, city…” and yellow box with “Organization: Include…”. The statistics of the constructed corpus are summarized in Table 7. We split the corpus into training/validation/test set with the ratio of 9:0.5:0.5. During training, all sets of samples are shuffled so that the model can be finetuned to not only maintain the ability of accurate visual grounding, but also adapt to new task and domain.

Total data volume	26,311
F.30K data (unmodified)	12,504
F.30K data + modified query data	12,504
Tw.15/17 data + query + b.-box	1,303
LOC query data	2,983 (F.30K) + 700 (Tw.15/17)
ORG query data	4,191 (F.30K) + 350 (Tw.15/17)
PER query data	4,362 (F.30K) + 253 (Tw.15/17)
OTHER query data	968 (F.30K)

Table 7: Statistics of our constructed VG corpus (F.30k and Tw.15/17 denote Flickr30k and Twitter2015/2017, respectively and b.-box denotes bounding box).

FA-VG makes predictions based on the fused representations of image, text, and spatial³³3The spatial feature captures the coordinates of the top-left corner, center, and bottom-right corner of the grid at ( $i$ , $j$ ). features. After being finetuned on the constructed corpus, FA-VG achieves a grounding accuracy (IoU $>$ 0.5) of 79.96% on the test set⁴⁴4IOU (Intersection over Union) is a term used to describe the extent of overlap of two boxes., which significantly outperforms the pre-trained FA-VG model without MNER adaptation (64.72%). The results show that the highly flexible visual grounding model with transfer learning can reach a better localization of regions related to entity types. Finally, we obtain a set of bounding boxes labels from the output layer of FA-VG.

Details of Manual Annotations.

Three crowd-sourced workers are hired to annotate the bounding boxes in the image of Twitter2015 and Twitter2017. We ask crowd-sourced workers to match sentences and images in Tweets when they perform annotations. If a sentence does not contain entities of a particular type, the labeled bounding boxes should be covered a whole image. We assign 4.8k Tweets to each crowd-sourced worker from a total of 13K+ Tweets in two datasets and collect 1.3k cross-annotated Tweets. The agreement between annotators is measured using a percentage of 50% IoU overlapping for image regions. To ensure the quality of ground-truth, we follow the previous works (Bowman et al. 2015; Chen et al. 2019; Jia et al. 2022b) to employ the Fleiss Kappa (Fleiss 1971) as an indicator, where Fleiss $\mathcal{K}=\frac{\bar{p}_{c}-\bar{p}_{e}}{1-\bar{p}_{e}}$ is calculated from the observed agreement $\bar{p}_{c}$ and the agreement by chance $\bar{p}_{e}$ . We obtain a Fleiss $\mathcal{K}=0.85$ , which indicates strong inter-annotator agreement.

Comparisons of Twitter2015/2017 and Flickr 30K Entities.

Here, we present the comparisons of Twitter2015/2017 and Flickr 30K Entities in terms of images’ and queries’ distributions. The details are shown in Table 8. We refer to the statistics of images in the Flickr 30K Entities dataset from Plummer et al. (2015). The distribution of queries in the Flickr 30K Entities is obtained by counting the number of queries in the dataset⁵⁵5We acquire the Flickr 30K Entities dataset on the website https://bryanplummer.com/Flickr30kEntities/.. The queries in Twitter2015/2017 are designed by us and not included in the original dataset. From Table 8, we can see that the number of both images and queries in the Flickr30K Entities is more than the number in Twitter2015/2017.

Nums	Twitter2015			Twitter2017			Flickr 30K
Nums	Train	Dev	Test	Train	Dev	Test	Train	Dev	Test
Image	4,000	1,000	3,257	3,373	723	723	29,783	1,000	1,000
Query	16,000	4,000	13,028	13,492	2,892	2,892	69,888	5,066	5,016

Table 8: Comparisons of Twitter2015/2017 and Flickr 30K Entities.

Appendix C Details of Experiment Settings

Details of Baseline Models.

We compare two groups of baselines with our approach. The first group consists of some text-based NER models that formalize NER as a sequence labeling task: (1) BiLSTM-CRF, which is the vanilla NER model with a bidirectional LSTM layer and a CRF layer. (2) CNN-BiLSTM-CRF, which is an improvement of BiLSTM-CRF. The embedding of each word is substituted with the concatenation of word-level embedding and character embedding. (3) HBiLSTM-CRF, which is a variant of CNN-BiLSTM-CRF by replacing the CNN layer with the LSTM layer to acquire the character embedding. (4) BERT, which is a competitive model for NER with a multi-layer bidirectional Transformer encoder followed by a softmax layer. (5) BERT-CRF, which is a variant of BERT by replacing the softmax layer with a CRF decoder. (6) T-NER, which is a NER model designed specifically for tweets. It exploits widely-used features, including the dictionary, contextual and orthographic features.

Besides, we compare several competitive multimodal NER models: (1) GVATT-HBiLSTM-CRF, which uses HBiLSTM-CRF to encode text and proposes a visual attention mechanism to acquire text-aware image representation. (2) GVATT-BERT-CRF, which is a variant of the GVATT-HBiLSTM-CRF by replacing the text encoder HBiLSTM with BERT. (3) AdaCAN-CNN-BiLSTM-CRF, which is the MNER model based on CNN-BiLSTM-CRF, designing an adaptive co-attention mechanism to integrate image and text. (4) AdaCAN-BERT-CRF, which is a variant of the AdaCAN-CNN-BiLSTM-CRF by replacing the text encoder CNN-BiLSTM with BERT. (5) UMT-BERT-CRF, which proposes a multimodal interaction module to acquire expressive text-visual representation by incorporating the auxiliary entity span detection into multimodal Transformer. (6) MT-BERT-CRF, which is a variant of the UMT-BERT-CRF, pruning the auxiliary module. (7) ATTR-MMKG-MNER, which integrates both image attributes and image knowledge into the MNER model. (8) UMGF, which proposes graph fusion approach based on graph model to obtain text-visual representation. (9) MAF, which proposes a matching and alignment framework for MNER to alleviate the impact of mismatched text-image pairs on encoding.

Setting of Individual Parameter.

In the joint-training process, we find there is a very large gap between QG loss and MNER loss. We specially design a balance factor $\omega_{f}$ to dynamically scale the losses of two tasks to the same order of magnitude. The equation is as follows:

\omega_{f}=\frac{1}{10^{\left|\left\lfloor\log_{10}{a}\right\rfloor-\left\lfloor\log_{10}{b}\right\rfloor\right|}}

(12)

where $\left\lfloor\right\rfloor$ denotes the $\mathrm{floor()}$ function. $\left\lfloor x\right\rfloor=\mathrm{max}\left\{n\in\mathbb{Z}|n\leq x\right\}$ . $a$ and $b$ represent losses in different tasks. For example, in this paper, $a$ is one loss in the Entity Span Prediction sub-task (e.g., $\mathcal{L}_{start}$ ), $b$ is the loss $\mathcal{L}_{QG}$ in the Query Grounding task.

Appendix D Further Discussions

Error Analyses.

Here, we present four representative error cases to fulfill further analyses, meanwhile, we append the results of UMGF as a comparison. The entities and entities’ types in the sentence are highlighted by italicized bold fonts in different colors.

In Figure 6(a), the sentence contains four entities “AC / DC”, “Katarina Benzova”, and “Queen Elizabeth Olympic Park”, “London” with PER, and LOC types respectively. We can observe that neither UMGF nor MNER-QG recognizes “AC / DC” although they detect visual regions of the PER type in the image. We conjecture that this is because the image provides only obscure information and “AC / DC” does not have a clear semantic for PER type. Figure 6(b) shows a difficult case, the sentence contains one entity “The Tribeca Film Festival” with OTHER type. There is only one woman wearing a watch and no regions for OTHER type in the image. Both UMGF and MNER-QG recognize entity “Tribeca Film Festival” with OTHER type but the entity cannot exactly match with “The Tribeca Film Festival”. We guess that models can locate the entity span associated with the OTHER type but can not detect the exact entity boundary. Figure 6(c) is a more challenging case, the sentence contains two entities “British Special Forces” and “German” with PER and LOC types respectively. Here, the entity “British Special Forces” should be recognized as ORG type, not PER. Meanwhile, another entity “British” with LOC type is embedded in “British Special Forces”. Hence, both UMGF and MNER-QG recognize the entity “British” with LOC type but “British” is not the target in this case. The image contains the region of “German Shepherd”, our model MNER-QG grasps this visual clue and recognizes the entity “German Shepherd” with OTHER type but “German Shepherd” is still not the target in this case. UMGF recognizes the entity “German” but mis-recognizes the type of the entity as OTHER. The entity that should be recognized in Figure 6(d) is “Almonte in Concert” and its corresponding entity type is OTHER. Similar to the aforementioned cases, the image is still not informative. Hence, our model ignores the entity “Almonte in Concert”, and UMGF mis-recognizes the entity “Almonte” with LOC type.

As we can see from the above error cases, most images cannot provide effective visual clues for the text message. The result indicates that informative visual clues in the image are important for multimodal named entity recognition. At the same time, we also find that our model has a high error rate on OTHER type. This result reminds us to pay more attention to the OTHER type. The aforementioned four challenging cases reveal that high-quality datasets are a hinge in MNER and we still have a long way to go for this task.

MNER-QG: An End-to-End MRC framework for Multimodal Named Entity Recognition with Query GroundingMNER-QG：用于具有查询基础的多模态命名实体识别的端到端 MRC 框架