这是用户在 2024-11-30 23:27 为 https://arxiv.org/html/2406.14952v1#bib.bib43 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: CJKutf8
  • failed: inconsolata

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: CC BY-NC-ND 4.0 许可证:CC BY-NC-ND 4.0
arXiv:2406.14952v1 [cs.CL] 21 Jun 2024
arXiv:2406.14952v1 [cs.CL] 2024 年 6 月 21 日

ESC-Eval: Evaluating Emotion Support Conversations
in Large Language Models
ESC-Eval:评估大型语言模型中的情感支持对话

Haiquan Zhao1,2   Lingyu Li2   Shisong Chen1   Shuqi Kong2    Jiaan Wang 1
Kexing Huang2   Tianle Gu2   Yixu Wang2   Jian Wang2   Dandan Liang2
Zhixu Li1  Yan Teng22footnotemark: 22    Yanghua Xiao1   Yingchun Wang2   
1
Fudan University
2 Shanghai Artificial Intelligence Laboratory
Work done during internship at Shanghai Artificial Intellignece LaboratoryCorrespondence to: Zhixu Li<zhixuli@fudan.edu.cn> and Yan Teng <tengyan@pjlab.org.cn>
Abstract 抽象的

Emotion Support Conversation (ESC) is a crucial application, which aims to reduce human stress, offer emotional guidance, and ultimately enhance human mental and physical well-being. With the advancement of Large Language Models (LLMs), many researchers have employed LLMs as the ESC models. However, the evaluation of these LLM-based ESCs remains uncertain. Inspired by the awesome development of role-playing agents, we propose an ESC Evaluation framework (i.e., ESC-Eval), which uses a role-playing agent to interact with ESC models, followed by a manual evaluation of the interactive dialogues. In detail, we first re-organize 2,801 role-playing cards from seven existing datasets to define the roles of the role-playing agent. Second, we train a specific role-playing model called ESC-Role which behaves more like a confused person than GPT-4. Third, through ESC-Role and organized role cards, we systematically conduct experiments using 14 LLMs as the ESC models, including general AI-assistant LLMs (e.g., ChatGPT) and ESC-oriented LLMs (e.g., ExTES-Llama). We conduct comprehensive human annotations on interactive multi-turn dialogues of different ESC models. The results show that ESC-oriented LLMs exhibit superior ESC abilities compared to general AI-assistant LLMs, but there is still a gap behind human performance. Moreover, to automate the scoring process for future ESC models, we developed ESC-RANK, which trained on the annotated data, achieving a scoring performance surpassing 35 points of GPT-4. Our data and code are available at https://github.com/haidequanbu/ESC-Eval.
情绪支持对话(ESC)是一个重要的应用,旨在减轻人类压力,提供情绪引导,最终增强人类的身心健康。随着大型语言模型( LLMs )的进步,许多研究人员已经采用LLMs作为ESC模型。然而,对这些基于LLM的 ESC 的评估仍然不确定。受到角色扮演代理的巨大发展的启发,我们提出了一个 ESC 评估框架(ESC-Eval),它使用角色扮演代理与 ESC 模型进行交互,然后对交互式对话进行手动评估。具体来说,我们首先从七个现有数据集中重新组织 2,801 张角色扮演卡,以定义角色扮演代理的角色。其次,我们训练了一个名为 ESC-Role 的特定角色扮演模型,它的行为比 GPT-4 更像一个困惑的人。第三,通过ESC-Role和组织的角色卡,我们系统地使用14个LLMs作为ESC模型进行实验,包括通用AI辅助LLMs (例如ChatGPT)和面向ESC的LLMs (例如ExTES-Llama)。我们对不同 ESC 模型的交互式多轮对话进行全面的人工注释。结果表明,与一般人工智能辅助的LLMs相比,面向ESC的LLMs表现出更优越的ESC能力,但与人类表现仍有差距。此外,为了自动化未来 ESC 模型的评分过程,我们开发了 ESC-RANK,它在带注释的数据上进行训练,实现了超过 GPT-4 35 分的评分性能。我们的数据和代码可在 https://github.com/haidequanbu/ESC-Eval 获取。

ESC-Eval: Evaluating Emotion Support Conversations
ESC-Eval:评估情感支持对话

in Large Language Models 在大型语言模型中


Haiquan Zhaothanks: Work done during internship at Shanghai Artificial Intellignece Laboratory1,2   Lingyu Li2   Shisong Chen1   Shuqi Kong2    Jiaan Wang 1
赵海泉 感谢:在上海人工智能实验室实习期间所做的工作1,2李凌宇2陈诗松1孔舒奇2王嘉安1
Kexing Huang2   Tianle Gu2   Yixu Wang2   Jian Wang2   Dandan Liang2
黄克星2谷天乐2王亦旭2王健2梁丹丹2
Zhixu Lithanks: Correspondence to: Zhixu Li<zhixuli@fudan.edu.cn> and Yan Teng <tengyan@pjlab.org.cn>1  Yan Teng22footnotemark: 22    Yanghua Xiao1   Yingchun Wang2
李志旭 感谢:通讯作者:李志旭<zhixuli@fudan.edu.cn> 和滕彦 <tengyan@pjlab.org.cn> 1滕彦2 2脚注: 2 2肖阳华1王迎春2
1 Fudan University
1复旦大学
2 Shanghai Artificial Intelligence Laboratory
2上海人工智能实验室


1 Introduction
1简介

Refer to caption
Figure 1: Difference between our proposed evaluation framework and others.
图 1:我们提出的评估框架与其他评估框架之间的差异。

With the rapid development of Large Language Models (LLMs), an increasing number of individuals are engaging in conversations with LLMs (e.g., ChatGPT Achiam et al. (2023)). Among various conversational applications, Emotional Support Conversation Liu et al. (2021) (ESC) stands out as a particularly promising field, where people can freely share their personal experiences or concerns, receiving emotional support and practical advice. This interaction helps alleviate human pressures Langford et al. (1997); Burleson (2003), thereby improving overall well-being. Recently, numerous LLM-based ESC models have received wide research attention Zheng et al. (2023b); Qiu et al. (2023); Liu et al. (2023). However, effective and comprehensive evaluation of these chatbots remains challenging.
随着大型语言模型( LLMs )的快速发展,越来越多的人参与与LLMs对话(例如,ChatGPT Achiam 等人( 2023 )。在各种对话应用程序中,情感支持对话Liu 等人。 ( 2021 ) (ESC) 是一个特别有前途的领域,人们可以自由地分享他们的个人经历或担忧,获得情感支持和实用建议。 Langford 等人认为这种相互作用有助于缓解人类压力。 ( 1997 ); Burleson ( 2003 ) ,从而改善整体福祉。最近,许多基于LLM的ESC模型受到了Zheng等人的广泛研究关注。 ( 2023b );邱等人。 ( 2023 );刘等人。 ( 2023 。然而,对这些聊天机器人进行有效和全面的评估仍然具有挑战性。

Current ESC evaluation Liu et al. (2021); Zheng et al. (2023b) generally uses text-based statistical metrics or manual evaluations. (1) When using text-based statistical metrics, researchers provide the dialogue history to the ESC models and then use the models to generate the corresponding responses (c.f., left panel in Figure 1). Based on the generated responses, text-based statistical metrics (such as ROUGE Lin (2004) and BLEU Papineni et al. (2002)) assess whether the responses resemble ground truth. However, these metrics heavily rely on ground truth responses, which lack objectivity Novikova et al. (2017) due to the complex nature of ESC. Furthermore, since the conversation history from ground truth is provided to the model under evaluation, text-based statistical metrics cannot fully assess models’ capabilities in multi-turn ESC dialogues due to none self-generated bias. (2) Manual evaluations Liu et al. (2021); Zheng et al. (2023b) employ human evaluators to simulate conversations between the model and users with specific distress (middle panel in Figure 1). This method requires the collection of both human-AI dialogues and manual judgments, resulting in challenges such as high cost and low efficiency.
当前 ESC 评估Liu 等人。 ( 2021 );郑等人。 ( 2023b通常使用基于文本的统计指标或手动评估。 (1) 当使用基于文本的统计指标时,研究人员向 ESC 模型提供对话历史记录,然后使用模型生成相应的响应(参见图1中的左图)。根据生成的响应,基于文本的统计指标(例如 ROUGE Lin ( 2004 )和 BLEU Papineni 等人 ( 2002 ) )评估响应是否类似于真实情况。然而,这些指标严重依赖于地面真实反应,缺乏客观性Novikova 等人。 ( 2017 )由于 ESC 的复杂性。此外,由于来自真实事实的对话历史记录被提供给正在评估的模型,基于文本的统计指标无法充分评估模型在多轮 ESC 对话中的能力,因为没有自生偏差。 (2) 人工评估Liu et al. ( 2021 );郑等人。 ( 2023b )使用人类评估者来模拟模型和具有特定困扰的用户之间的对话(图1中的中间面板)。该方法需要采集人机对话和人工判断,存在成本高、效率低等挑战。

To alleviate the above issues, we propose ESC-Eval, which replaces human labor with role-playing LLMs (right panel in Figure 1) to achieve efficient and comprehensive ESC evaluation. We assign role-playing LLM engaging in multi-turn conversations with ESC chatbots under evaluation and collect the conversation data as the target of evaluation. In this manner, ESC-Eval is expected to efficiently achieve performance comparable to human evaluation that involves naturalistic multi-turn dialogues data, while getting rid of reliance on ground truth and heavy labor requirements. However, to ensure the effectiveness of our evaluation framework, two components are important: i) diverse role cards sourced from a variety of troubled individuals in real-world scenarios, which could be used to guide the LLM role-playing during evaluation and ensure the comprehensive evaluation. ii) A role-playing agent that closely mirrors real human behavior, enabling the acquisition of data that faithfully reflects real human interactions, thereby guaranteeing the objectivity and fairness of the evaluation results.

To accomplish these two objectives, firstly, we propose to reconstruct role cards from seven existing QA and dialogue datasets Qiu et al. (2023); Liu et al. (2021); Zheng et al. (2023b); Sharma et al. (2020); Lahnala et al. (2021); Sun et al. (2021); Liu et al. (2023), which are relevant to emotional companionship or psychological counseling. However, these datasets do not contain user cards, thus, we use GPT-4 to extract and summarize the key information of users followed by a two-stage filtering process involving GPT-4 and human judgment. In this manner, we obtain 2,801 qualified role cards. Secondly, to construct reliable role-playing agents, we propose to develop a role-playing agent for ESC-Eval. In detail, we construct a dataset consisting of 3.5K ESC role-playing data from ESConv Liu et al. (2021), ExTES Zheng et al. (2023b) and Smile Qiu et al. (2023), each data appeared in the format of a role card and multi-turn dialogue. We also enrich the data up to 14K by incorporating five existing role-playing instruction datasets. Through fine-tuning Qwen1.5 Bai et al. (2023), we develop a role-playing model called ESC-Role. Compared with existing state-of-the-art role-playing models, like GPT-4 and BaichuanNPC Yang et al. (2023), the ESC-Role behaves more like a person encountering real-life troubles.

With the completion of ESC-Eval, considering huge amounts of human annotations, we select 655 high-quality role cards and comprehensively evaluate 14 LLMs with ESC-Role, including general AI-assistant LLMs (e.g., ChatGPT and Llama3 Touvron et al. (2023)), and ESC-oriented LLMs (e.g., ExTes-Llama Zheng et al. (2023b)).

Refer to caption
Figure 2: Overview of ESC-Eval, which used role-playing to evaluate the capability of ESC models.

After obtaining 8.5K interactive dialogues based on the 14 LLMs, we conduct comprehensive human evaluations and collect 59,654 manual evaluation results in terms of 7 dimensions (i.e., fluency, diversity, empathy, information, humanoid, skillful, and overall). The evaluation results show that the ESC-oriented LLMs outperform most general AI-assistant LLMs, but get poor performance on emotion support knowledge and human preferences. Finally, to automate the scoring process for future ESC models, we train ESC-RANK using the 59,654 manual evaluation results, achieving a scoring performance that surpasses GPT-4 by 35 points in terms of accuracy.

Our main contributions are concluded as follows:

  • We propose ESC-Eval, the first framework for evaluating LLM-based ESC models via role-playing. It features 2801 diverse user cards with fine-grained information, a dedicated role-playing model closely resembling individuals experiencing distress, and 7 meticulously designed dimensions for rigorous evaluation.

  • Through ESC-Eval, we test 14 LLMs and manually annotate the results according to our meticulously designed dimensions. Our findings underscore an immediate demand for an ESC model exhibiting superior human preference and robust knowledge of emotional support.

  • For automatic evaluation of future ESC models, we developed ESC-RANK, a scoring model that outperforms GPT-4 by 35 points.

2 ESC-Eval

2.1 Framework Overview

Figure 2 illustrates the workflow of ESC-Eval. ESC-Eval utilizes a role-playing model and a set of role cards to interact with ESC models under evaluation, followed by manual annotation of the obtained dialogue data. In ESC-Eval, the availability of a substantial number of diverse role cards and a more reliable role-playing agent holds paramount significance for ESC-Eval. Subsequently, the following section will outline the measures taken to ensure the reliability of these two crucial foundational components.

2.2 Role Card Acquisition

To ensure the diversity of character cards, drawing inspiration from ESConvLiu et al. (2021), ExTESZheng et al. (2023b), and the Life Events ScaleWethington (2016), we first constructed a classification system consisting of three hierarchical layers and encompassing 37 categories. Then we propose reconstructing role cards in open-resources data and identifying each role card within each category. The construction of this procedure involves three primary steps. First, we collect 7 open-source datasets that cover a wide range of potential user roles. Then we utilize GPT-4 to extract roles from these datasets and filter out low-quality role cards, followed by human filtering. Finally, we employ a manual annotation process to ensure the quality of the role cards and classify them into their respective tertiary categories. We will introduce each step in the following, and more details can be found in Appendix A.

2.2.1 Dataset collection

To obtain a diverse set of character cards, we conducted a comprehensive investigation into existing datasets in the field of emotion support and mental health datasets. Subsequently, we selected seven datasets as the source datasets for this study. The open-source datasets utilized are listed in Appendix A.

2.2.2 User cards extraction and filtering

After obtaining these datasets, we encountered both Multi-turn Dialogue (MD) datasets and single-turn Question-and-Answer (QA) datasets. To extract user profiles from these diverse datasets, we employed different prompts for QA and MD datasets using GPT-4 for the initial extraction. The utilization of GPT-4 in this process incurred approximately a cost of $120. After acquiring the initial character cards, we employ GPT-4 to conduct an initial filtration process on role cards, eliminating those that solely consist of emotions without any associated events. The utilization of GPT-4 in this process incurred an approximate cost of $70. The prompt used in this section can be found in Appendix A. After the filtering process of GPT-4, we apply a human filter to ensure the quality of these cards.

2.2.3 Manual annotation and correction

After obtaining role cards that had undergone preliminary screening by GPT-4 and human filter, we employed a two-stage approach involving crowdsourced annotation followed by manual correction to ensure the quality of the role cards.
在获得经过 GPT-4 和人工筛选初步筛选的角色卡后,我们采用了两阶段的方法,即众包注释和手动校正,以确保角色卡的质量。

Crowd workers annotation In the above section, we developed a comprehensive three-tier classification system comprising a total of 37 categories of real-life questions, which are listed in Table 9. Based on this classification, the crowd workers are instructed to annotate the filtered character cards with their corresponding tertiary classifications. This requirement is motivated by the need for convenient evaluation and the quality of role-playing. Additionally, we request the crowd workers to label the high-quality and medium-quality character cards within the dataset. The annotation rules and classification system for annotation can be found in Appendix A. The estimated duration for the annotation phase required is approximately 10 days with 10 crowdsourcing workers.
人群工作者标注在上一节中,我们开发了一个全面的三层分类系统,共包含 37 类现实生活问题,如表9所示。基于此分类,众包工作人员被指示用相应的三级分类来注释过滤后的角色卡。这一要求是出于方便评估和角色扮演质量的需要。此外,我们要求众包工作人员在数据集中标记高质量和中等质量的角色卡。注释的注释规则和分类系统见附录A。注释阶段所需的预计持续时间约为 10 天,需要 10 名众包工作人员。

Human correction Upon completion of the first-stage crowdsourced annotation, we proceeded with a second-stage manual correction. We requested persons (the authors of this paper) who are more familiar with this task to conduct an examination of the annotations for each role card, rectifying any incorrect categorizations and addressing issues pertaining to the quality of the role cards
人工修正第一阶段的众包注释完成后,我们进行了第二阶段的手动修正。我们要求更熟悉此任务的人员(本文的作者)检查每个角色卡的注释,纠正任何不正确的分类并解决与角色卡质量有关的问题

Following the two-stage process of crowdsourced annotation and manual correction, the role cards representing various real-world individuals with different problems were successfully reconstructed. The data analysis of these role cards is listed in Appendix A.
经过众包标注和人工修正两个阶段的过程,代表具有不同问题的现实世界个体的角色卡被成功重建。这些角色卡的数据分析列于附录A。

2.3 ESC-Role
2.3 ESC 角色

To construct a more robust role-playing model, we trained a specific role-playing agent called ESC-Role using both general data and data specific to ESC scenarios for ESC-Eval. The following sections outline the steps involved in training and evaluating this model.
为了构建更强大的角色扮演模型,我们使用 ESC-Eval 的通用数据和 ESC 场景特定数据来训练一个名为 ESC-Role 的特定角色扮演代理。以下部分概述了训练和评估该模型所涉及的步骤。

2.3.1 Data Collection
2.3.1数据收集

Using the same procedure as in Section 2.2, we selected Smile, ESConv, and ExTES datasets mentioned previously to collect ESC scenarios data. We employed methods including extraction through GPT-4, filtering with GPT-4, and manual filtering to extract role cards from multi-turn dialogues. Resulting in a total of 3,390 role-playing data which consist of a role card and a corresponding dialogue. The role cards were used as system prompts for model training. To further enhance the model’s robust role-playing ability, we filtered five role-playing datasets consisting of multi-turn dialogues from Huggingface111https://huggingface.co/
使用与 2.2 节相同的过程,我们选择前面提到的 Smile、ESConv 和 ExTES 数据集来收集 ESC 场景数据。我们采用GPT-4提取、GPT-4过滤、手动过滤等方法从多轮对话中提取角色卡。总共产生了 3,390 个角色扮演数据,其中包括角色卡和相应的对话。角色卡用作模型训练的系统提示。为了进一步增强模型强大的角色扮演能力,我们从 Huggingface1 中过滤了五个由多轮对话组成的角色扮演数据集
. After processing, we acquired 14K role-playing data instances, consisting of both general role-playing instruction data and ESC role-playing data.
。经过处理,我们获得了14K角色扮演数据实例,其中包括通用角色扮演指令数据和ESC角色扮演数据。

2.3.2 Implementation and Evaluation Metric
2.3.2实施和评估指标

Due to the inclusion of both English and Chinese in the character cards, we selected Qwen1.5-14B-Chat as our base model. We adopted LoRAHu et al. (2021) parameter-efficient Finetuning on the dataset collected above. We compared ESC-Role with some state-of-the-art role-playing agents like GPT-4 and BaichuanNPC, these agents are API-based LLMs, and we conducted all kinds of prompts like Chain-of-Thought (CoT)Wei et al. (2022) and In-Context-Learning (ICL)Min et al. (2022), more details can refer to Appendix B.
由于人物卡中包含英文和中文,我们选择Qwen1.5-14B-Chat作为我们的基础模型。我们采用了 LoRA Hu 等人。 ( 2021 )对上面收集的数据集进行参数高效的微调。我们将 ESC-Role 与一些最先进的角色扮演代理(如 GPT-4 和 BaichuanNPC)进行了比较,这些代理是基于 API 的LLMs ,并且我们进行了各种提示,如 Chain-of-Thought (CoT) Wei等人。 ( 2022 )和情境学习 (ICL) Min 等人。 ( 2022 ) ,更多详情可参考附录B。

Model Cohe. 科赫。 Flue. 烟道。 Them. 他们。 Comp. 比较。 Emot. 表情。 Huma. 修玛。 Aver. 断言。
GPT-4zero_shot
GPT-4零射击
9.9/9.8
9.9/ 9.8
7.3/7.6 10/10
10 / 10
10/10
10 / 10
3.2/6.2 2.2/6.9 7.1/8.4
GPT-4ICL 9.9/9.8
9.9/ 9.8
7.9/7.9 10/10
10 / 10
10/10
10 / 10
5.5/8.0
5.5/ 8.0
4.7/8.0 8.0/9.0
8.0/ 9.0
GPT-4CoT
GPT-4 CoT
10/9.1 10 /9.1 8.3/7.2 10/9.2 10 /9.2 10/9.2 10 /9.2 4.9/7.8 5.3/8.5
5.3/ 8.5
8.1/8.5
GPT-4ICL+CoT
GPT-4 ICL+CoT
10/9.8
10 / 9.8
8.9/8.0
8.9 /8.0
10/10
10 / 10
10/10
10 / 10
4.7/7.9 4.9/7.9 8.1/8.9
Baichuan-NPCzero_shot
百川-NPC Zero_shot
9.7/9.5 8.7/8.0 9.7/9.4 9.6/8.0 6.3/6.1
6.3 /6.1
5.3/5.5 8.2/8.0
Baichuan-NPCICL
百川-NPC ICL
9.7/9.6 8.5/9.1
8.5/ 9.1
9.6/9.3 9.3/8.3 5.3/5.3 4.7/4.5 7.8/7.7
Baichuan-NPCCoT
百川-NPC CoT
9.8/9.1 8.9/5.9
8.9 /5.9
10/8.9 10 /8.9 9.9/8.5 5.9/6.1 6.5/8.1
6.5 /8.1
8.5/8.1
8.5 /8.1
Baichuan-NPCICL+CoT
百川-NPC ICL+CoT
9.6/9.2 8.4/8.0 9.4/8.3 9.4/8.1 5.3/5.9 4.6/5.1 7.8/7.4
ESC-Role ESC角色 10/9.8
10 / 9.8
9.8/9.7
9.8 / 9.7
10/10
10 / 10
10/9.5 10 /9.5 7.5/9.3
7.5 / 9.3
6.6/9.1
6.6 / 9.1
9.0/9.6
9.0 / 9.6
Table 1: Human judgement ZH/EN results of different role-playing agents.
表 1:不同角色扮演代理的人类判断ZH/EN结果。

To compare the effectiveness of different role-playing models, we draw upon research on role-playing and the distinctive features of the emotional support domain. We propose six categories of metrics, including general metrics (i.e., Coherence, Fluency) and domain-specific metrics (i.e., Thematic consistency, Completeness, Emotional Congruence, Humanoid, Coherence, Fluency). We use a manual evaluation method to rate each dimension on a 3-point scale. We also conduct pairwise comparisons through manual evaluation, where human evaluators determine which dialogues resemble human-human conversations more closely. The more frequently a particular model is selected by the evaluators, the better its performance is considered to be.
为了比较不同角色扮演模型的有效性,我们借鉴了角色扮演的研究和情感支持领域的独特特征。我们提出六类指标,包括一般指标(连贯性、流畅性)和特定领域指标(主题一致性、完整性、情感一致性、人形、连贯性、流畅性)。我们使用手动评估方法对每个维度进行 3 分制评分。我们还通过手动评估进行成对比较,其中人类评估者确定哪些对话更接近人与人的对话。评估者选择特定模型的频率越高,其性能就被认为越好。

2.3.3 Evalution Results
2.3.3评估结果

Refer to caption
Figure 3: Win rate of different role-playing agents and source data, where source denotes human dialogue.
图 3:不同角色扮演代理和源数据的胜率,其中源表示人类对话。

The human judgment results of these models are presented in Table 1. From the table, it can be observed that, in terms of the comparison of general API models, GPT-4 performs better in English, while Baichuan-NPC performs better in Chinese. The performance of GPT-4 in role-playing can be improved by optimizing various prompts, whereas Baichuan-NPC even experiences a decrease in performance with prompt optimization. Analyzing the reasons behind this, Baichuan-NPC is invoked through parameter settings222https://platform.baichuan-ai.com/docs/npc
这些模型的人类判断结果如表1所示。从表中可以看出,就通用API模型的比较而言,GPT-4在英语中表现更好,而百川-NPC在中文中表现更好。 GPT-4在角色扮演中的性能可以通过优化各种提示来提高,而百川-NPC在提示优化后甚至会出现性能下降。分析背后原因,通过参数设置调用百川-NPC2
, and it is unclear what internal strategies are employed to concatenate CoT and ICL into prompts. On the other hand, GPT-4 prompts are independently constructed by the author of this paper, which enhances its performance. Furthermore, when compared to ESC-Role, the trained ESC-Role not only demonstrates stronger human-like attributes in ESC’s domain-specific metrics but also shows impressive results in genetic metrics.
,并且尚不清楚采用什么内部策略将 CoT 和 ICL 连接成提示。另一方面,GPT-4提示是本文作者独立构建的,增强了其性能。此外,与 ESC-Role 相比,经过训练的 ESC-Role 不仅在 ESC 的特定领域指标中表现出更强的类人属性,而且在遗传指标中也显示出令人印象深刻的结果。

In addition, we selected pairs that had the best performance with different role-playing models and the source multi-turn dialogue data of role cards. We manually evaluated which dialogue more closely resembled real human conversations. The results are shown in Figure 3. From the figure, it can be observed that it is difficult for humans to distinguish between the results generated by ESC-Role and the results from the original data. Both of them outperformed GPT-4 and Baichuan-NPC, demonstrating the effectiveness of using ESC-Role for role-playing in ESC-Eval.
此外,我们还选择了在不同的角色扮演模型和角色卡的多轮对话源数据中表现最好的配对。我们手动评估哪些对话更接近真实的人类对话。结果如图3所示。从图中可以看出,人类很难区分ESC-Role生成的结果和原始数据的结果。它们都优于 GPT-4 和 Baichuan-NPC,证明了在 ESC-Eval 中使用 ESC-Role 进行角色扮演的有效性。

3 Evaluation
3评价

In this section, we conducted evaluations on 14 general LLMs and domain-specific LLMs on ESC-Eval. We first introduce the models for evaluation. Then we display our experimental results. Finally, we display the details of our scoring model ESC-RANK.
在本节中,我们对 ESC-Eval 上的 14 个一般LLMs和特定领域LLMs进行了评估。我们首先介绍评估模型。然后我们展示我们的实验结果。最后,我们显示评分模型 ESC-RANK 的详细信息。

Model Fluency 流利程度 Expression 表达 Empathy 共情 Information 信息 Skillful 熟练 Humanoid 人形 Overall 全面的 Average 平均的
EN Close 关闭 GPT-4 74.32 71.68 71.22 73.72 74.92 36.40 44.18 63.78
ChatGPT 聊天GPT 74.70 71.22 72.12 73.19 74.92 37.08 45.24 64.07
Open 打开 Vicuna-7B-1.5 骆驼毛-7B-1.5 63.37 67.07 71.00 71.53 71.68 41.31 38.67 60.66
WizardLM2-7B-Chat WizardLM2-7B-聊天 53.10 65.79 71.83 73.87 71.37 25.08 33.46 56.36
Qwen1.5-7B-Chat Qwen1.5-7B-聊天 72.89 69.34 70.47 73.19 74.85 27.49 42.37 61.51
Chatglm3-6B 74.02 67.82 70.69 71.37 74.32 41.84 42.60 63.24
Yi-6B-Chat Yi-6B-聊天 75.15 66.99 69.11 70.39 71.98 38.82 43.05 62.21
LLaMa3-8B-Instruct LLaMa3-8B-指导 63.59 67.37 72.65 71.90 74.69 40.55 41.84 61.80
Domain 领域 ChatCounselor 聊天辅导员 74.54 66.61 69.03 64.95 69.03 65.18 47.50 65.27
MindChat 心灵聊天 74.40 57.85 67.60 56.80 61.25 61.71 39.05 59.81
SoulChat 灵魂聊天 25.53 60.20 66.77 56.27 60.88 61.25 36.86 52.54
EmoLLM 法学硕士 36.56 68.96 70.85 71.45 74.47 65.26 46.53 62.01
MeChat 微信 52.42 61.10 66.01 57.63 61.86 62.01 39.43 57.21
ExTES-LLaMa ExTES-骆驼 74.32 59.97 69.94 57.02 62.69 63.52 41.01 61.21
ZH Close 关闭 GPT-4 71.53 63.97 64.74 69.14 75.93 28.01 39.51 58.97
ChatGPT 聊天GPT 74.54 68.98 69.14 70.06 72.38 32.79 42.75 61.52
Open 打开 Vicuna-7B-1.5 骆驼毛-7B-1.5 52.85 63.27 65.43 68.06 64.51 35.41 30.32 54.27
WizardLM2-7B-Chat WizardLM2-7B-聊天 54.32 64.04 66.90 69.75 65.28 26.08 30.94 53.90
Qwen1.5-7B-Chat Qwen1.5-7B-聊天 74.23 70.37 70.14 69.83 74.07 28.16 41.90 61.24
Chatglm3-6B 73.53 67.82 66.74 68.83 69.44 27.01 39.35 58.96
Yi-6B-Chat Yi-6B-聊天 74.00 67.59 65.59 68.13 70.52 29.01 38.97 59.12
Domain 领域 ChatCounselor 聊天辅导员 71.91 66.05 68.83 67.13 69.37 63.35 46.45 64.72
MindChat 心灵聊天 75.39 64.12 69.37 66.44 68.90 67.13 47.53 65.55
SoulChat 灵魂聊天 76.16 65.28 71.30 67.28 70.37 69.06 48.53 66.85
EmoLLM 法学硕士 78.09 71.45 74.77 73.15 78.63 68.67 57.10 71.69
MeChat 微信 74.85 63.04 68.67 64.27 67.75 66.59 45.45 64.37
Table 2: Human evaluation results of different models.
表2:不同模型的人类评估结果。

3.1 Evaluating models
3.1评估模型

We have selected 14 models for evaluation, including closed-source, open-source, and domain-specific models, which are as follows:
我们选择了 14 个模型进行评估,包括闭源、开源和特定领域模型,具体如下:

  1. 1.

    Closed-source: GPT-4 Achiam et al. (2023); ChatGPT.


    1. 闭源:GPT-4 Achiam 等人。 (2023);聊天GPT。
  2. 2.

    Open-source: Vicuna Zheng et al. (2023a); llama3 Touvron et al. (2023); WizardLMXu et al. (2023); Qwen1.5 Bai et al. (2023); Chatglm3 Zeng et al. (2022); Yi AI et al. (2024).


    2. 开源:Vicuna Cheng 等人。 (2023a); llama3 Touvron 等人。 (2023); WizardLMXu 等人。 (2023); Qwen1.5 Bai 等人。 (2023); Chatglm3 Zeng 等人。 (2022);易艾等人。 (2024)。
  3. 3.

    Domain-specific: ExTES-llama Zheng et al. (2023b); ChatCounselor Liu et al. (2023); MindChat Xin Yan (2023); SoulChat Chen et al. (2023b); EmoLLM EmoLLM (2024); MeChat Qiu et al. (2023).


    3. 领域特异性:ExTES-llama Cheng 等人。 (2023b);聊天刘辅导员等。 (2023); MindChat 欣燕 (2023); SoulChat 陈等人(2023b);埃莫LLM 埃莫LLM (2024); MeChat 邱等人。 (2023)。

To facilitate a more accurate comparison of the capabilities of various models, we have chosen models of similar magnitudes, such as the 6B/7B/8B model parameter sizes for comparison.
为了便于更准确地比较各个模型的能力,我们选择了类似量级的模型,例如6B/7B/8B模型参数大小进行比较。

3.2 Evaluation Results
3.2评价结果

Based on pre-defined dimensions, we conducted a comprehensive manual assessment, and the results are presented in Table 2. Both in English and Chinese ESC conditions, domain-specific LLMs (ChatCounselor and EmoLLM), respectively achieved the best results. From Table 2 above, in the comparison between general models and domain-specific models, the general models perform better in terms of fluency, expression diversity, and emotional comfort skills. This can be attributed to their highly structured output, such as phrases like “I understand you very well, it is very normal to feel …, here are some possible suggestions: ” The general models generate a large amount of text, scoring high in terms of advice effectiveness and expression diversity. Besides, due to Larger scale parameters, the API-based models exhibit greater knowledge of emotional comfort, with GPT-4 and ChatGPT demonstrating the highest proficiency. However, these models perform poorly in terms of human-like and human-centric responses, as users in this context expect replies that are more humanized and possess greater human-like qualities. In the comparison of domain-specific models, MindChat, SoulChat, and EmoLLM, which were not fine-tuned in English, showed inferior fluency. On the other hand, ExTES-llama and ChatCounselor performed well. ExTES was fine-tuned with data generated by ChatGPT, while ChatCounselor was fine-tuned using real psychological counseling data, exhibiting superior performance. From Table 2 bottom, the general models perform well in terms of expressing diversity and providing effective suggestions. Trained on diverse and abundant data, EmoLLM exhibits excellent performance across multiple dimensions among the various domain-specific models. Other domain-specific models, due to their remarkable human-like qualities and human convenience, surpass the general models. However, there is still room for improvement in terms of emotional support knowledge, and significant potential exists for enhancing human convenience. It is worth noting that MindChat, trained on bilingual data, not only demonstrates strong Chinese language proficiency but also exhibits commendable English language capabilities.
基于预先定义的维度,我们进行了全面的人工评估,结果如表2所示。在英文和中文ESC条件下,特定领域的LLMs (ChatCounselor和EmoLLM)分别取得了最好的结果。从上表2可以看出,在通用模型和特定领域模型的对比中,通用模型在流畅性、表达多样性和情绪安慰技巧方面表现更好。这可以归因于它们高度结构化的输出,例如诸如“我非常了解你,感觉……很正常,这里有一些可能的建议”之类的短语:一般模型会生成大量文本,在术语方面得分较高建议有效性和表达多样性。此外,由于参数规模较大,基于 API 的模型表现出更多的情感舒适知识,其中 GPT-4 和 ChatGPT 表现出最高的熟练程度。然而,这些模型在类人和以人为中心的响应方面表现不佳,因为在这种情况下,用户期望回复更加人性化并具有更强的类人品质。在特定领域模型的对比中,未进行英语微调的MindChat、SoulChat和EmoLLM表现出较差的流畅度。另一方面,ExTES-llama 和 ChatCounselor 表现良好。 ExTES利用ChatGPT生成的数据进行微调,而ChatCounselor则利用真实的心理咨询数据进行微调,表现出优越的性能。从表2底部可以看出,通用模型在表达多样性和提供有效建议方面表现良好。 经过多样化和丰富的数据训练,EmoLLM 在各种特定领域模型中的多个维度上表现出了出色的性能。其他领域特定模型由于其卓越的类人品质和人类便利性而超越了通用模型。然而,在情感支持知识方面仍有改进的空间,并且在增强人类便利性方面存在巨大的潜力。值得注意的是,经过双语数据训练的MindChat不仅表现出很强的中文能力,而且还表现出了令人称赞的英语语言能力。

3.3 Correlation Analysis
3.3相关性分析

Metrics 指标 Fluency 流利程度 Suggestion 建议 Skillful 熟练 Empathy 共情 Overall 全面的 Average 平均的
Spear. 矛。 Pear. 梨。 Spear. 矛。 Pear. 梨。 Spear. 矛。 Pear. 梨。 Spear. 矛。 Pear. 梨。 Spear. 矛。 Pear. 梨。 Spear. 矛。 Pear. 梨。
Bleu-1 蓝-1 40.60 40.63 -67.20 -65.68 -51.68 -51.00 -28.32 -27.53 -55.92 -52.95 -60.98 -56.36
Bleu-2 蓝2 18.16 12.05 -18.82 -15.81 -2.97 -0.29 -29.34 -22.91 -21.66 -19.97 -18.25 -17.95
Bleu-4 蓝4 -0.04 -2.56 -5.40 -3.33 27.54 22.97 -0.90 -14.38 13.50 2.99 10.78 2.80
Distinct-1 独特-1 37.92 43.84 -79.61 -81.95 -59.52 -62.17 -36.18 -32.60 -62.47 -65.36 -70.02 -68.11
Distinct-2 独特-2 38.63 43.84 -81.51 -80.79 -61.07 -61.45 -37.09 -36.21 -65.46 -65.32 -72.53 -69.67
Rouge-L 胭脂-L 38.25 36.77 -56.98 -58.27 -36.03 -37.31 -19.23 -23.05 -42.80 -45.31 -45.22 -45.59
Meteor 流星 8.01 14.94 20.09 12.76 1.23 -0.34 14.31 10.11 6.77 0.97 17.30 13.73
ESC-Eval ESC-评估 -1.61 -0.69 36.26 33.36 39.02 38.70 9.17 6.02 45.01 44.58 46.31 46.05
Table 3: Sample-level Spearman correlation (Spear.) correlation and Pearson (Pear.) correlation of different metrics.
表3:不同指标的样本级斯皮尔曼相关(Spear.)相关和皮尔逊(Pear.)相关。

To validate the effectiveness of ESC-Eval, we randomly selected 20 instances from the ESConv dataset. We chose three categories from the target model and included five different models for correlation analysis. These models were subjected to interactions with human evaluators who model seekers looking for help. And they are asked to provide ratings upon completion of the interactions, according to human evaluation methods in other papers. The human-rated scores were considered as the optimal evaluation method, and we conducted a correlation analysis between various automatic evaluation methods and the ESC-Eval method. The results are presented in Table 3.
为了验证 ESC-Eval 的有效性,我们从 ESConv 数据集中随机选择了 20 个实例。我们从目标模型中选择了三个类别,并包含五个不同的模型进行相关性分析。这些模型与人类评估者进行互动,评估者为寻求帮助的寻求者建模。他们被要求根据其他论文中的人类评估方法在完成交互后提供评分。将人工评分作为最佳评价方法,并对各种自动评价方法与ESC-Eval方法进行相关性分析。结果如表3所示。

Dimen. 迪门。 Model ACC ACCsoft
ACC软体
Flue. 烟道。 InternLM2 实习生LM2 31.84/17.15 93.07/72.82
GPT-4 35.82/25.89 95.51/89.21
ESC-RANK 88.45/81.66 99.87/99.24
Expr. 表达式。 InternLM2 实习生LM2 27.09/26.21 54.94/56.09
GPT-4 60.59/66.02 96.53/99.57
ESC-RANK 65.72/68.39 99.49/99.67
Empa. 恩帕。 InternLM2 实习生LM2 19.38/14.56 80.74/84.90
GPT-4 41.46/48.11 88.58/94.28
ESC-RANK 69.70/77.02 99.10/98.71
Info. 信息。 InternLM2 实习生LM2 35.94/32.58 83.83/88.03
GPT-4 56.35/68.28 94.22/98.27
ESC-RANK 75.10/77.02 98.97/99.46
Skil. 技能。 InternLM2 实习生LM2 32.34/27.5 84.85/91.15
GPT-4 27.98/38.83 82.03/91.80
ESC-RANK 79.72/68.61 96.79/99.57
Huma. 修玛。 InternLM2 实习生LM2 22.85/25.89 52.25/66.77
GPT-4 1.02/3.02 32.48/35.06
ESC-RANK 57.51/70.77 98.84/98.17
Over. 超过。 InternLM2 实习生LM2 8.04/6.04 48.27/46.28
GPT-4 1.80/1.73 15.15/17.04
ESC-RANK 57.89/55.45 99.49/99.35
Aver. 断言。 InternLM2 实习生LM2 25.50/21.42 79.59/76.56
GPT-4 32.15/35.98 72.07/75.03
ESC-RANK 70.53/71.27 98.93/99.17
Table 4: Scoring performance comparation, while ACC denotes accuracy, ACCsoft denotes one point deviation.
表4:评分表现比较,ACC表示准确度,ACC soft表示一分偏差。

From Table 3, it can be observed that ESC-Eval exhibits the best correlation with the evaluation metrics, except for the Fluency and Empathy indicators. In terms of Fluency, automated metrics outperform ESC-Eval, we analyze that during manual annotation, human annotators may exhibit some bias towards the fluency of segmented statements generated by a general model which significantly deviates from the ESConv dataset. It has been observed that human annotators tend to prefer naturally expressed content, leading to relatively lower manual ratings for the fluency outputs of general models. At the same time, the content generated by the general model is quite different from that of ESconv, and the automation metric is also very low. As a result, there is a strong correlation between automated evaluation metrics and humans. However, in ESC-Eval all models perform well on fluency due to the capability of LLMs, leading to low correlation. A similar phenomenon is observed for the Empathy indicator, where although there is some correlation, it is due to the alignment process that most LLMs undergo, which enables them to display decent comforting abilities and analytical skills. In terms of the overall average metric, ESC-Eval demonstrates the most significant correlation compared to the automated metrics, further emphasizing the effectiveness of ESC-Eval. More correlation experimental results are in Appendix C.
从表3可以看出,除了Fluency和Empathy指标外,ESC-Eval与评估指标的相关性最好。在流畅性方面,自动化指标优于 ESC-Eval,我们分析在手动注释过程中,人类注释者可能会对通用模型生成的分段语句的流畅性表现出一些偏差,该模型与 ESConv 数据集显着偏差。据观察,人类注释者倾向于更喜欢自然表达的内容,导致一般模型的流畅度输出的手动评分相对较低。同时,通用模型生成的内容与ESconv有较大差异,自动化指标也很低。因此,自动评估指标与人类之间存在很强的相关性。然而,在ESC-Eval中,由于LLMs的能力,所有模型在流畅度上都表现良好,导致相关性较低。同理心指标也观察到类似的现象,虽然存在一定的相关性,但这是由于大多数LLMs经历的调整过程,这使他们能够表现出良好的安慰能力和分析能力。就总体平均指标而言,ESC-Eval 与自动化指标相比表现出最显着的相关性,进一步强调了 ESC-Eval 的有效性。更多相关实验结果见附录C。

3.4 ESC-RANK
3.4 ESC-RANK

To facilitate subsequent research, based on InternLM2-7B-ChatCai et al. (2024) and using the manually annotated data in this article, we trained ESC-RANK. ESC-RANK can score the results of multiple rounds of dialogues of different models to our well-designed dimension.
为了方便后续研究,基于InternLM2-7B-Chat Cai等人。 ( 2024并使用本文中的手动注释数据,我们训练了 ESC-RANK。 ESC-RANK可以将不同模型的多轮对话结果打分到我们精心设计的维度。

We randomly divided the annotated data into a training set, validation set, and test set according to 7:1:2. Compared with the base model and GPT-4, the results are shown in Table 4.
我们将标注数据按照7:1:2随机分为训练集、验证集和测试集。与基础模型和GPT-4进行比较,结果如表4所示。

From Table 4, it can be observed that ESC-RANK demonstrates the best scoring capability, surpassing GPT-4 by 35 points in terms of accuracy. As human scoring may not always have a clear-cut boundary, a tolerance of one-point error is allowed in scoring which denotes the result of ACCsoft. When considering ACCsoft, ESC-RANK achieves an accuracy rate of over 99%, providing a solution for subsequent automation processes. Interestingly, GPT-4 performs poorly in the dimension of humanoid and human preference scoring. The analysis suggests that GPT-4 assigns higher scores to its own generated content or content similar to its own, which can be easily judged during human evaluation, particularly in formatted outputs such as bullet-point suggestions, where it becomes apparent that the content is machine-generated, leading to a poor score of humanoid and human preference. InternLM2 also has the same problem in human preference behavior, but it performs better in humanoid scoring, which leads to higher performance than GPT-4 in ACCsoft.
从表4可以看出,ESC-RANK表现出最好的评分能力,在准确度方面超过GPT-4 35分。由于人工评分可能并不总是有明确的界限,因此评分时允许允许一分误差,这表示 ACC soft的结果。在考虑ACC soft时,ESC-RANK的准确率达到99%以上,为后续的自动化流程提供了解决方案。有趣的是,GPT-4 在人形和人类偏好评分维度上表现不佳。分析表明,GPT-4 会为自己生成的内容或与自己生成的内容类似的内容分配更高的分数,这在人类评估过程中很容易判断,特别是在格式化输出(例如要点建议)中,其中内容很明显机器生成的,导致人形和人类偏好得分很低。 InternLM2 在人类偏好行为方面也存在同样的问题,但它在人形评分方面表现更好,这导致在 ACC soft中比 GPT-4 的性能更高。

4 Related Work
4相关工作

4.1 Emotion Support Conversation
4.1情感支持对话

Traditional research Sharma et al. (2020); Medeiros and Bosse (2018); Rashkin et al. (2018) on emotion support systems initially focused on simple single-turn emotion dialogue systems. With the emergence of the ESConv Liu et al. (2021) dataset, the development of ESC shifted towards more complex multi-turn dialogues. Researchers have proposed various optimization strategies on ESConv dataset, Peng et al. (2022) introduced an innovative hierarchical graph network, aiming to effectively utilize both the global emotion cause and the local user intention in emotional support conversations. Moving away from relying on a single strategy for response generation, Tu et al. (2022) incorporated commonsense knowledge and a mix of response strategies into the framework of emotional support conversation and so on. With the development of LLMs, which have naturally become suitable for chatbot scenarios due to their generative architecture. Researchers Zheng et al. (2023b); Qiu et al. (2023); Liu et al. (2023) have utilized these models by pertaining and fine-tuning through supervised learning. For instance, Zheng et al. (2023b) used ChatGPT to generate data for constructing an emotion-supported dialogue system, while Madani et al. (2024) expanded the ESconv dataset to address the issue of extrapolating the length capabilities of large language models. In addition, some studies Hua et al. (2024); Zhang et al. (2024); Chen et al. (2023a) have also used LLMs in ESC-related fields, such as psychological counseling. The purpose of our study is to provide a comprehensive and rigorous evaluation of these LLM-based ESC models.
传统研究Sharma 等人。 ( 2020 );梅代罗斯和博塞( 2018 );拉什金等人。 ( 2018关于情感支持系统的研究最初侧重于简单的单轮情感对话系统。随着 ESConv Liu 等人的出现。 ( 2021 )数据集,ESC 的发展转向更复杂的多轮对话。研究人员在 ESConv 数据集上提出了各种优化策略, Peng 等人。 ( 2022 )引入了一种创新的分层图网络,旨在有效利用情感支持对话中的全局情感原因和本地用户意图。 Tu 等人不再依赖单一策略来生成响应。 ( 2022将常识性知识和混合应对策略纳入情感支持对话等框架中。随着LLMs的发展,由于其生成式架构,法学硕士自然而然地变得适合聊天机器人场景。研究人员郑等人。 ( 2023b );邱等人。 ( 2023 );刘等人。 ( 2023通过监督学习进行关联和微调来利用这些模型。例如,郑等人。 ( 2023b )使用 ChatGPT 生成数据来构建情感支持的对话系统,而Madani 等人。 ( 2024 )扩展了 ESconv 数据集,以解决推断大型语言模型的长度能力的问题。此外,华等人的一些研究。 ( 2024 );张等人。2024 );陈等人。 ( 2023a ) 也曾在 ESC 相关领域使用LLMs ,例如心理咨询。我们研究的目的是对这些基于LLM ESC 模型进行全面而严格的评估。

4.2 Role Play Agents
4.2角色扮演代理

Recent advancements in LLMs have significantly boosted the rise of Role-Playing Language Agents (RPLAs) Chen et al. (2024), existing researches Wang et al. (2024b); Tu et al. (2024); Shen et al. (2024); Wang et al. (2024a) have proposed multiple evaluation datasets for role-playing, wherein various approaches Li et al. (2023); Shao et al. (2023); Wang et al. (2024b); Zhou et al. (2023) such as In-Context-Learning (ICL) Min et al. (2022), Chain-of-Thought (CoT) Wei et al. (2022) and Supervised Fine-Tuning (SFT) have been explored to construct role-playing models. Additionally, the industry has witnessed the emergence of numerous role-playing products, like Character AI333https://character.ai/
LLMs的最新进展极大地促进了角色扮演语言代理(RPLA)的兴起,Chen 等人。 (2024),现有研究 Wang 等人。 (2024b);图等人。 (2024);沉等人。 (2024);王等人。 (2024a)提出了用于角色扮演的多个评估数据集,其中Li等人提出了各种方法。 (2023);邵等人。 (2023);王等人。 (2024b);周等人。 (2023)例如情境学习(ICL)Min 等人。 (2022),思想链(CoT)Wei 等人。 (2022)和监督微调(SFT)已经被探索来构建角色扮演模型。此外,业界还见证了众多角色扮演产品的出现,例如《Character AI3》
and Reflection AI444https://reflectionai.xyz/ 和反射AI4, leading to a wide-ranging impact. RPLAs are capable of assuming specific roles, engaging in human-like interactions through composite character settings, role background knowledge, and speech styles, thereby exhibiting human-like attributes and playing a role in everyday conversational contexts. This paper followed the main idea of evaluating ESC models through RPLAs.
,产生了广泛的影响。 RPLA能够承担特定的角色,通过复合的角色设置、角色背景知识和言语风格进行类人交互,从而表现出类人属性并在日常会话环境中发挥作用。本文遵循通过 RPLA 评估 ESC 模型的主要思想。

5 Conclusion
5结论

This paper proposes a novel approach to evaluate the effectiveness and sustainability of the Emotion Support Conversation (ESC) in Large Language Models (LLMs) by utilizing a role-playing model to acquire multi-turn dialogue data. Experimental results demonstrate the efficacy and viability of our proposed method. Our evaluation outcomes indicate that while some ESC models currently outperform general models, there is still significant room for improvement in terms of these models’ knowledge capabilities and human-preference abilities. We encourage researchers to participate in ESC research and contribute to the development of more robust ESC models.
本文提出了一种利用角色扮演模型获取多轮对话数据来评估大型语言模型(LLMs)中情感支持对话(ESC)有效性和可持续性的新方法。实验结果证明了我们提出的方法的有效性和可行性。我们的评估结果表明,虽然目前一些 ESC 模型的表现优于一般模型,但这些模型的知识能力和人类偏好能力仍有很大的改进空间。我们鼓励研究人员参与 ESC 研究并为开发更强大的 ESC 模型做出贡献。

Limitations 局限性

The crowdsourced annotators in this article are not native English speakers, but all of them are proficient English users, and all of them are master’s students or PhD candidates in the humanities and social sciences. However, they still cannot avoid possible shortcomings in English annotation.
本文中的众包标注者并非以英语为母语,但他们都是英语熟练者,而且都是人文社会科学领域的硕士生或博士生。然而,他们仍然无法避免英文注释中可能存在的缺陷。

Ethical Considerations 道德考虑

Since this research is related to psychology, the format of the datasets used in this article has been converted, and each data instance has been manually reviewed to confirm that there are no ethical and privacy issues in each piece of data and that it complies with legal and regulatory requirements.
由于这项研究与心理学相关,因此本文使用的数据集的格式已经过转换,并且每个数据实例都经过人工审核,以确认每条数据不存在伦理和隐私问题,并且符合法律和监管要求。

References 参考

  • Achiam et al. (2023) 阿希姆等人。 (2023) Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
    乔什·阿希姆 (Josh Achiam)、史蒂文·阿德勒 (Steven Adler)、桑迪尼·阿加瓦尔 (Sandhini Agarwal)、拉马·艾哈迈德 (Lama Ahmad)、伊尔格·阿卡亚 (Ilge Akkaya)、弗洛伦西亚·利奥尼·阿莱曼 (Florencia Leoni Aleman)、迪奥戈·阿尔梅达 (Diogo Almeida)、扬科·阿尔滕施密特 (Janko Altenschmidt)、萨姆·奥尔特曼 (Sam Altman)、Shyamal Anadkat 等。 2023.Gpt -4 技术报告。 arXiv 预印本 arXiv:2303.08774
  • AI et al. (2024) 艾等人。 (2024) 01. AI, :, Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, Kaidong Yu, Peng Liu, Qiang Liu, Shawn Yue, Senbin Yang, Shiming Yang, Tao Yu, Wen Xie, Wenhao Huang, Xiaohui Hu, Xiaoyi Ren, Xinyao Niu, Pengcheng Nie, Yuchi Xu, Yudong Liu, Yue Wang, Yuxuan Cai, Zhenyu Gu, Zhiyuan Liu, and Zonghong Dai. 2024. Yi: Open foundation models by 01.ai.
    01. AI, :, Alex Young, 陈蓓, 李超, 黄承恩, 张戈, 张冠伟, 李恒, 朱江成, 陈建群, 常静, 于凯东, 刘鹏, 刘强, 余文乐, 杨森斌, 杨诗明, 于涛, 谢文, 黄文豪, 胡晓辉, 任晓怡, 牛新耀, 聂鹏程, 徐宇驰, 刘宇东, 王悦, 蔡宇轩,顾振宇、刘志远、戴宗宏。 2024. Yi:01.ai 开放基础模型
  • Bai et al. (2023) 白等人。 (2023) Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. 2023. Qwen technical report. arXiv preprint arXiv:2309.16609.
    白金泽、白帅、褚云飞、崔泽宇、党凯、邓晓东、范杨、葛文斌、韩宇、黄飞等。 2023. Qwen 技术报告。 arXiv 预印本 arXiv:2309.16609
  • Burleson (2003) 伯利森 (2003) Brant R Burleson. 2003. Emotional support skills. In Handbook of communication and social interaction skills, pages 569–612. Routledge.
    布兰特·R·伯勒森. 2003.情感支持技巧。 《沟通和社交互动技巧手册》 ,第 569-612 页。劳特利奇。
  • Cai et al. (2024) 蔡等人。 (2024) Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, et al. 2024. Internlm2 technical report. arXiv preprint arXiv:2403.17297.
    蔡正、曹茂松、陈浩炅、陈凯、陈克宇、陈鑫、陈迅、陈泽慧、陈志、裴楚等。 2024.Internlm2技术报告。 arXiv 预印本 arXiv:2403.17297
  • Chen et al. (2024) 陈等人。 (2024) Jiangjie Chen, Xintao Wang, Rui Xu, Siyu Yuan, Yikai Zhang, Wei Shi, Jian Xie, Shuang Li, Ruihan Yang, Tinghui Zhu, Aili Chen, Nianqi Li, Lida Chen, Caiyu Hu, Siye Wu, Scott Ren, Ziquan Fu, and Yanghua Xiao. 2024. From persona to personalization: A survey on role-playing language agents.
    陈江杰、王新涛、徐睿、袁思宇、张一凯、石伟、谢健、李爽、杨瑞涵、朱廷辉、陈爱丽、李念琪、陈利达、胡彩宇、吴思野、任斯科特、付子泉和肖阳华。 2024.从角色到个性化:角色扮演语言代理的调查
  • Chen et al. (2023a) 陈等人。 (2023a) Siyuan Chen, Mengyue Wu, Kenny Q Zhu, Kunyao Lan, Zhiling Zhang, and Lyuchun Cui. 2023a. Llm-empowered chatbots for psychiatrist and patient simulation: application and evaluation. arXiv preprint arXiv:2305.13614.
    陈思源、吴梦月、朱肯尼、兰坤耀、张志玲和崔吕春。 2023a。用于精神科医生和患者模拟的Llm授权聊天机器人:应用和评估。 arXiv 预印本 arXiv:2305.13614
  • Chen et al. (2023b) 陈等人。 (2023b) Yirong Chen, Xiaofen Xing, Jingkai Lin, Huimin Zheng, Zhenyu Wang, Qi Liu, and Xiangmin Xu. 2023b. Soulchat: Improving llms’ empathy, listening, and comfort abilities through fine-tuning with multi-turn empathy conversations. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 1170–1183.
    陈一荣、邢晓芬、林靖凯、郑惠民、王振宇、刘奇、徐向民。 2023b。 Soulchat:通过多轮同理心对话的微调,提高llms的同理心、倾听和舒适能力。计算语言学协会的调查结果:EMNLP 2023 ,第 1170-1183 页。
  • EmoLLM (2024) 法学硕士 (2024) EmoLLM. 2024. Emollm.
    埃莫法学硕士。 2024.埃莫尔姆
  • Hu et al. (2021) 胡等人。 (2021) Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. Lora: Low-rank adaptation of large language models.
    Edward J. Hu、Yelong Shen、Phillip Wallis、Zeyuan Allen-Zhu、Yuanzhi Li、Shean Wang、Lu Wang 和 Weizhu Chen。 2021. Lora:大语言模型的低秩适应
  • Hua et al. (2024) 华等人。 (2024) Yining Hua, Fenglin Liu, Kailai Yang, Zehan Li, Yi-han Sheu, Peilin Zhou, Lauren V Moran, Sophia Ananiadou, and Andrew Beam. 2024. Large language models in mental health care: a scoping review. arXiv preprint arXiv:2401.02984.
    华伊宁、刘凤林、杨开来、李泽涵、许一涵、周培林、Lauren V Moran、Sophia Ananiadou 和 Andrew Beam。 2024.精神卫生保健中的大型语言模型:范围界定审查。 arXiv 预印本 arXiv:2401.02984
  • Lahnala et al. (2021) 拉纳拉等人。 (2021) Allison Lahnala, Yuntian Zhao, Charles Welch, Jonathan K Kummerfeld, Lawrence An, Kenneth Resnicow, Rada Mihalcea, and Verónica Pérez-Rosas. 2021. Exploring self-identified counseling expertise in online support forums. arXiv preprint arXiv:2106.12976.
    艾莉森·拉纳拉、赵云天、查尔斯·韦尔奇、乔纳森·K·库默菲尔德、劳伦斯·安、肯尼思·雷斯尼考、拉达·米哈尔恰和维罗尼卡·佩雷斯-罗萨斯。 2021.在在线支持论坛中探索自我认同的咨询专业知识。 arXiv 预印本 arXiv:2106.12976
  • Langford et al. (1997) 兰福德等人。 (1997) Catherine Penny Hinson Langford, Juanita Bowsher, Joseph P Maloney, and Patricia P Lillis. 1997. Social support: a conceptual analysis. Journal of advanced nursing, 25(1):95–100.
    凯瑟琳·彭妮·辛森·兰福德、胡安妮塔·鲍舍尔、约瑟夫·P·马洛尼和帕特里夏·P·莉利斯。 1997。社会支持:概念分析。高级护理杂志,25(1):95-100。
  • Li et al. (2023) 李等人。 (2023) Cheng Li, Ziang Leng, Chenxi Yan, Junyi Shen, Hao Wang, Weishi MI, Yaying Fei, Xiaoyang Feng, Song Yan, HaoSheng Wang, Linkang Zhan, Yaokai Jia, Pingyu Wu, and Haozhen Sun. 2023. Chatharuhi: Reviving anime character in reality via large language model.
    李成、冷子昂、严晨曦、沉俊逸、王浩、米伟士、费亚英、冯晓阳、宋岩、王浩生、詹林康、贾耀凯、吴平宇和孙浩振。 2023.Chatharuhi :通过大语言模型在现实中复活动漫角色
  • Lin (2004) 林 (2004) Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
    林钦耀. 2004. Rouge:自动评估摘要的软件包。 文本摘要分支》 ,第 74-81 页。
  • Liu et al. (2023) 刘等人。 (2023) June M Liu, Donghao Li, He Cao, Tianhe Ren, Zeyi Liao, and Jiamin Wu. 2023. Chatcounselor: A large language models for mental health support. arXiv preprint arXiv:2309.15461.
    刘军明、李东浩、曹禾、任天河、廖泽一和吴嘉敏。 2023. Chatcounselor:用于心理健康支持的大型语言模型。 arXiv 预印本 arXiv:2309.15461
  • Liu et al. (2021) 刘等人。 (2021) Siyang Liu, Chujie Zheng, Orianna Demasi, Sahand Sabour, Yu Li, Zhou Yu, Yong Jiang, and Minlie Huang. 2021. Towards emotional support dialog systems. arXiv preprint arXiv:2106.01144.
    刘思扬、郑楚杰、Orianna Demasi、Sahand Sabour、李宇、周宇、蒋勇和黄敏烈。 2021 年。迈向情感支持对话系统。 arXiv 预印本 arXiv:2106.01144
  • Madani et al. (2024) 马达尼等人。 (2024) Navid Madani, Sougata Saha, and Rohini Srihari. 2024. Steering conversational large language models for long emotional support conversations.
    纳维德·迈达尼、苏加塔·萨哈和罗希尼·斯里哈里。 2024.引导会话大型语言模型进行长时间的情感支持对话
  • Medeiros and Bosse (2018)
    梅代罗斯和博塞 (2018)
    Lenin Medeiros and Tibor Bosse. 2018. Using crowdsourcing for the development of online emotional support agents. In Highlights of Practical Applications of Agents, Multi-Agent Systems, and Complexity: The PAAMS Collection: International Workshops of PAAMS 2018, Toledo, Spain, June 20–22, 2018, Proceedings 16, pages 196–209. Springer.
    列宁·梅代罗斯和蒂博尔·博塞。 2018.利用众包开发在线情感支持代理。智能体、多智能体系统和复杂性的实际应用亮点:PAAMS 系列:PAAMS 2018 国际研讨会,西班牙托莱多,2018 年 6 月 20-22 日,会议记录 16 ,第 196-209 页。施普林格。
  • Min et al. (2022) 敏等人。 (2022) Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. 2022. Rethinking the role of demonstrations: What makes in-context learning work?
    Sewon Min、Xinxi Lyu、Ari Holtzman、Mikel Artetxe、Mike Lewis、Hannaneh Hajishirzi 和 Luke Zettlemoyer。 2022.重新思考演示的作用:是什么让情境学习发挥作用?
  • Novikova et al. (2017) 诺维科娃等人。 (2017) Jekaterina Novikova, Ondřej Dušek, Amanda Cercas Curry, and Verena Rieser. 2017. Why we need new evaluation metrics for nlg. arXiv preprint arXiv:1707.06875.
    Jekaterina Novikova、Ondřej Dušek、Amanda Cercas Curry 和 Verena Rieser。 2017。为什么我们需要新的 nlg 评估指标。 arXiv 预印本 arXiv:1707.06875
  • Papineni et al. (2002) 帕皮尼尼等人。 (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
    Kishore Papineni、Salim Roukos、Todd Ward 和 Wei-Jing Zhu。 2002. Bleu:一种机器翻译自动评估方法。计算语言学协会第 40 届年会记录,第 311-318 页。
  • Peng et al. (2022) 彭等人。 (2022) Wei Peng, Yue Hu, Luxi Xing, Yuqiang Xie, Yajing Sun, and Yunpeng Li. 2022. Control globally, understand locally: A global-to-local hierarchical graph network for emotional support conversation. arXiv preprint arXiv:2204.12749.
    彭伟、胡跃、邢鲁西、谢玉强、孙雅静、李云鹏。 2022.全局控制,本地理解:用于情感支持对话的全局到本地分层图网络。 arXiv 预印本 arXiv:2204.12749
  • Qiu et al. (2023) 邱等人。 (2023) Huachuan Qiu, Hongliang He, Shuai Zhang, Anqi Li, and Zhenzhong Lan. 2023. Smile: Single-turn to multi-turn inclusive language expansion via chatgpt for mental health support. arXiv preprint arXiv:2305.00450.
    邱华川、何洪亮、张帅、李安琪、兰振中。 2023年。微笑:通过chatgpt进行单轮到多轮包容性语言扩展,以提供心理健康支持。 arXiv 预印本 arXiv:2305.00450
  • Rashkin et al. (2018) 拉什金等人。 (2018) Hannah Rashkin, Eric Michael Smith, Margaret Li, and Y-Lan Boureau. 2018. Towards empathetic open-domain conversation models: A new benchmark and dataset. arXiv preprint arXiv:1811.00207.
    汉娜·拉什金、埃里克·迈克尔·史密斯、玛格丽特·李和 Y-Lan Boureau。 2018。走向同理心的开放域对话模型:新的基准和数据集。 arXiv 预印本 arXiv:1811.00207
  • Shao et al. (2023) 邵等人。 (2023) Yunfan Shao, Linyang Li, Junqi Dai, and Xipeng Qiu. 2023. Character-llm: A trainable agent for role-playing.
    邵云帆、李林阳、戴俊奇、邱西鹏。 2023.Character llm :可训练的角色扮演代理
  • Sharma et al. (2020) 夏尔马等人。 (2020) Ashish Sharma, Adam S Miner, David C Atkins, and Tim Althoff. 2020. A computational approach to understanding empathy expressed in text-based mental health support. arXiv preprint arXiv:2009.08441.
    阿什什·夏尔马、亚当·S·迈纳、大卫·C·阿特金斯和蒂姆·阿尔索夫。 2020.一种理解基于文本的心理健康支持中表达的同理心的计算方法。 arXiv 预印本 arXiv:2009.08441
  • Shen et al. (2024) 沉等人。 (2024) Tianhao Shen, Sun Li, Quan Tu, and Deyi Xiong. 2024. Roleeval: A bilingual role evaluation benchmark for large language models.
    沉天浩、孙莉、屠全、熊德义。 2024. Roleeval:大型语言模型的双语角色评估基准
  • Sun et al. (2021) 孙等人。 (2021) Hao Sun, Zhenru Lin, Chujie Zheng, Siyang Liu, and Minlie Huang. 2021. Psyqa: A chinese dataset for generating long counseling text for mental health support. arXiv preprint arXiv:2106.01702.
    孙浩,林真如,郑楚杰,刘思阳,黄敏烈。 2021. Psyqa:用于生成心理健康支持长咨询文本的中文数据集。 arXiv 预印本 arXiv:2106.01702
  • Touvron et al. (2023) 图夫龙等人。 (2023) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
    雨果·图夫龙、蒂博·拉夫里尔、戈蒂埃·伊扎卡尔、泽维尔·马丁内特、玛丽-安妮·拉肖、蒂莫西·拉克鲁瓦、巴蒂斯特·罗齐埃、纳曼·戈亚尔、埃里克·汉布罗、费萨尔·阿扎尔等。 2023. Llama:开放高效的基础语言模型。 arXiv 预印本 arXiv:2302.13971
  • Tu et al. (2024) 图等人。 (2024) Quan Tu, Shilong Fan, Zihang Tian, and Rui Yan. 2024. Charactereval: A chinese benchmark for role-playing conversational agent evaluation.
    全涂、范石龙、田子航、岩睿。 2024.Charactereval :中国角色扮演会话代理评估基准
  • Tu et al. (2022) 图等人。 (2022) Quan Tu, Yanran Li, Jianwei Cui, Bin Wang, Ji-Rong Wen, and Rui Yan. 2022. Misc: a mixed strategy-aware model integrating comet for emotional support conversation. arXiv preprint arXiv:2203.13560.
    屠泉、李艳然、崔建伟、王斌、温继荣、严睿。 2022. Misc:集成 Comet 进行情感支持对话的混合策略感知模型。 arXiv 预印本 arXiv:2203.13560
  • Wang et al. (2024a) 王等人。 (2024a) Xintao Wang, Yunze Xiao, Jen tse Huang, Siyu Yuan, Rui Xu, Haoran Guo, Quan Tu, Yaying Fei, Ziang Leng, Wei Wang, Jiangjie Chen, Cheng Li, and Yanghua Xiao. 2024a. Incharacter: Evaluating personality fidelity in role-playing agents through psychological interviews.
    王新涛、肖云泽、黄仁泽、袁思雨、徐瑞、郭浩然、屠泉、费亚英、冷子昂、王伟、陈江杰、李成和肖阳华。 2024a。 Incharacter:通过心理访谈评估角色扮演代理人的人格忠诚度
  • Wang et al. (2024b) 王等人。 (2024b) Zekun Moore Wang, Zhongyuan Peng, Haoran Que, Jiaheng Liu, Wangchunshu Zhou, Yuhan Wu, Hongcheng Guo, Ruitong Gan, Zehao Ni, Jian Yang, Man Zhang, Zhaoxiang Zhang, Wanli Ouyang, Ke Xu, Stephen W. Huang, Jie Fu, and Junran Peng. 2024b. Rolellm: Benchmarking, eliciting, and enhancing role-playing abilities of large language models.
    王泽坤、彭中元、阙浩然、刘嘉恒、周望春树、吴雨涵、郭宏成、甘瑞同、倪泽浩、杨健、张曼、张兆祥、欧阳万里、徐科、黄伟、付杰、彭俊然。 2024b。 Rolellm:大型语言模型的基准测试、引出和增强角色扮演能力
  • Wei et al. (2022) 魏等人。 (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837.
    Jason Wei、王学智、Dale Schuurmans、Maarten Bosma、Fei Xia、Ed Chi、Quoc V Le、Denny Zhou 等。 2022.思维链提示引发大型语言模型的推理。神经信息处理系统的进展,35:24824–24837。
  • Wethington (2016) 威辛顿 (2016) E Wethington. 2016. Life events scale. In Stress: Concepts, cognition, emotion, and behavior, pages 103–108. Elsevier.
    E·韦辛顿。 2016.生活事件量表。载于《压力:概念、认知、情感和行为》 ,第 103-108 页。爱思唯尔。
  • Xin Yan (2023) 欣欣 (2023) Dong Xue* Xin Yan. 2023. Mindchat: Psychological large language model. https://github.com/X-D-Lab/MindChat.
    冬雪*辛言。 2023.Mindchat :心理大语言模型。 https://github.com/XD-Lab/MindChat
  • Xu et al. (2023) 徐等人。 (2023) Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, and Daxin Jiang. 2023. Wizardlm: Empowering large language models to follow complex instructions.
    徐灿、孙庆峰、郑凯、耿秀波、赵璞、冯家占、陶重阳和蒋大新。 2023. Wizardlm:使大型语言模型能够遵循复杂的指令
  • Yang et al. (2023) 杨等人。 (2023) Aiyuan Yang, Bin Xiao, Bingning Wang, Borong Zhang, Ce Bian, Chao Yin, Chenxu Lv, Da Pan, Dian Wang, Dong Yan, et al. 2023. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305.
    杨爱源,肖斌,王丙宁,张博荣,卞策,殷超,吕晨旭,潘大,王殿,严东,等。 2023.百川2:开放大规模语言模型。 arXiv 预印本 arXiv:2309.10305
  • Zeng et al. (2022) 曾等人。 (2022) Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, et al. 2022. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414.
    曾傲涵、刘晓、杜正晓、王子涵、赖涵宇、丁明、杨卓毅、徐一凡、郑文迪、夏晓等。 2022. Glm-130b:开放的双语预训练模型。 arXiv 预印本 arXiv:2210.02414
  • Zhang et al. (2024) 张等人。 (2024) Chenhao Zhang, Renhao Li, Minghuan Tan, Min Yang, Jingwei Zhu, Di Yang, Jiahao Zhao, Guancheng Ye, Chengming Li, Xiping Hu, et al. 2024. Cpsycoun: A report-based multi-turn dialogue reconstruction and evaluation framework for chinese psychological counseling. arXiv preprint arXiv:2405.16433.
    张晨浩,李仁浩,谭明焕,杨敏,朱经伟,杨迪,赵家豪,叶冠成,李成明,胡西平,等。 2024. Cpsycoun:基于报告的中国心理咨询多轮对话重构与评估框架。 arXiv 预印本 arXiv:2405.16433
  • Zheng et al. (2023a) 郑等人。 (2023a) Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023a. Judging llm-as-a-judge with mt-bench and chatbot arena.
    郑连民、蒋伟林、盛英、庄思源、吴章浩、庄永浩、林子、李卓瀚、李大成、Eric P. Xing、张浩、Joseph E. Gonzalez 和 Ion Stoica。 2023a。使用 mt-bench 和 chatbot arena 来评判llm作为法官
  • Zheng et al. (2023b) 郑等人。 (2023b) Zhonghua Zheng, Lizi Liao, Yang Deng, and Liqiang Nie. 2023b. Building emotional support chatbots in the era of llms. arXiv preprint arXiv:2308.11584.
    郑中华、廖丽子、邓阳、聂立强。 2023b。在llms时代构建情感支持聊天机器人。 arXiv 预印本 arXiv:2308.11584
  • Zhou et al. (2023) 周等人。 (2023) Jinfeng Zhou, Zhuang Chen, Dazhen Wan, Bosi Wen, Yi Song, Jifan Yu, Yongkang Huang, Libiao Peng, Jiaming Yang, Xiyao Xiao, Sahand Sabour, Xiaohan Zhang, Wenjing Hou, Yijia Zhang, Yuxiao Dong, Jie Tang, and Minlie Huang. 2023. Characterglm: Customizing chinese conversational ai characters with large language models.
    周金峰、陈庄、万大振、文博斯、宋一、于继凡、黄永康、彭立标、杨家明、肖希耀、Sahand Sabour、张晓涵、侯文静、张一佳、董宇晓、唐杰和敏烈黄. 2023.Characterglm :用大语言模型定制中文会话人工智能字符

Appendix A Benchmark
附录 A基准

The construction of character cards, as illustrated in Figure 4, primarily consists of three steps. The first step involves collecting the raw dataset, followed by the second step of utilizing GPT-4 to extract and filter the character cards. The third step entails a two-stage manual filtering and annotation process. The following sections will provide further details on the construction procedure.
如图4所示,人物卡的构建主要包括三个步骤。第一步涉及收集原始数据集,第二步是利用 GPT-4 提取和过滤角色卡。第三步需要两阶段的手动过滤和注释过程。以下各节将提供有关构建过程的更多详细信息。

Firstly, the raw dataset used in this study can be found in Table 5. The prompts used for extraction and filtering during the GPT-4 phase can be referenced in Figure 6, Figure 7 and Figure 8. The manual annotation phase primarily relies on internal annotations within the character cards, which indicate their corresponding quality and three-tier classification. The descriptions of character cards with different quality levels are presented in Table A.3, and the annotation guidelines for character cards are provided in Table 8. The distribution of the completed three-tier classifications can be observed in both Table 9 and Figure 5. Finally, we present two cases showcasing the extraction of character cards using single-turn QA and multi-turn dialogues in Figure 11 and Figure 10.
首先,本研究使用的原始数据集如表5所示。 GPT-4阶段提取和过滤所使用的提示可以参考图6 、图7和图8 。手动标注阶段主要依赖于人物卡内部的标注,表明其相应的质量和三级分类。不同质量等级的人物卡描述见表A.3 ,人物卡标注指南见表8 。完成的三层分类的分布可以在表9和图5中观察到。最后,我们在图11和图10中展示了使用单轮 QA 和多轮对话提取人物卡的两个案例。

A.1 Resource Datasets
A.1资源数据集

Language 语言 Dataset 数据集 Format 格式 Domain 领域 Resource 资源 Sample_Num 样品数量
English 英语 EPITOME Sharma et al. (2020)
夏尔马等人的缩影2020
QA Empathetic response generation
产生同理心反应
Human 人类 500
MHP Reddit Lahnala et al. (2021)
MHP Reddit Lahnala 等人。 ( 2021 )
QA Mental healthy counseling
心理健康咨询
Human 人类 1000
Psych8K Liu et al. (2023)
Psych8K刘等人。 ( 2023 )
QA Mental health support 心理健康支持 Human 人类 1000
ESConv Liu et al. (2021)
ESConv刘等人。 ( 2021 )
MD Emotional support conversation
情感支持对话
Human 人类 1000
ExTes Zheng et al. (2023b)
ExTes郑等人。 ( 2023b
MD Emotional support conversation
情感支持对话
ChatGPT 聊天GPT 500
Chinese PsyQA Sun et al. (2021)
PsyQA孙等人。 ( 2021 )
QA Mental healthy counseling
心理健康咨询
Human 人类 1000
Smile Qiu et al. (2023) MD Mental healthy counseling ChatGPT 1000
Table 5: Open-source datasets used in our study. QA denotes datasets consisting of question-answer pairs, while MD denotes datasets consisting of multi-turn dialogues; Resource denotes whether the datasets are collected by humans or ChatGPT; Sample_Num denotes the numbers used for user card construction in this study.

Here are more details about these datasets:

  • EPITOME: Psychologically relevant user post data collected on Reddit, single-turn conversation format in English.

  • MHP Reddit: Psychologically relevant user post data collected on Reddit, single-turn conversation format in English.

  • Psych8K: Real psychological consultation voice data, converted into text and processed by ChatGPT, psychologically related English multi-turn conversation data.

  • ESConv: Collected by crowdsourcing workers in the field of emotional dialogue, multi-turn dialogues format in English.

  • ExTes: Generated by ChatGPT in the field of emotional dialogue, multi-turn dialogues format in English.

  • PsyQA: Collected from user posts in a psychological platform, single-turn dialogue format in Chinese.

  • Smile: Generated by ChatGPT in the field of mental health, multi-turn dialogues format in Chinese.

These datasets encompass both single-turn question-answering and multi-turn dialogue in the domain of emotions or mental health. Through these diverse data sets, we can collect diverse real-world people who encounter a variety of problems.

Refer to caption
Figure 4: The framework of user-card construction. Firstly, the initial user cards are extracted from open-source datasets using GPT-4. In the second step, based on the scene classification we designed, GPT-4 is utilized to determine the category to which the character sheet data belongs, and further filtering is performed. In the third step, we employ crowdsourcing to annotate the headphone category and subcategories of the scenes, and manually filter the user cards again.

A.2 Construction Details

The differentiation of role cards into three quality categories is based on the following two considerations: 1) Insufficient information content in ineffective role cards, resulting in a lack of specific themes for model role-playing. 2) High-quality role cards possess richer information, enabling more specific tasks and yielding more effective evaluation results. A tripartite role card is presented in Table A.3, illustrating three distinct categories.

It should be noted that there is no absolute boundary between high-quality character cards, medium-quality character cards, and invalid character cards. The only difference between them lies in the richness of character information. A higher level of character richness is believed to contribute to better model performance and is more conducive to subsequent evaluations. We can only relatively identify high-quality character cards, medium-quality character cards, and invalid character cards. The boundary between valid and invalid character cards depends on whether events occurring within the characters can be classified. The classification of events is shown in the table. The boundary between high-quality and medium-quality character cards is whether events, events causes, events results, and detailed descriptions of events can be identified based on event classification. The table can be used as a reference for annotation. In the case of invalid character cards, during the dialogue between the role-playing model and the test model, only emotions are present, which exhibits redundancy among a large number of character cards. This redundancy is not conducive to simulating individuals who encounter a variety of problems in real life. Both medium-quality character cards and high-quality character cards are effective for evaluation, but high-quality character cards are more targeted.
Human Correction rules To ensure the quality of collected role cards, our human correction rules are listed below: If a crowdsourced worker deemed a role card invalid, it was discarded. If one worker classified a role card as high-quality and the other as middle-quality, a third participant corrected its classification to either high-quality or middle-quality. The remaining role cards were considered middle-quality. This process essentially involved correcting the categorization of the role cards. For the middle-quality and high-quality role cards, if two crowdsourced workers agreed on the same category, that category was accepted. If the two workers disagreed on the three-level classification, the project participant intervened to correct it, ensuring the accurate classification of the middle-quality and high-quality role cards.

A.3 Data Analysis

Basic analysis We employed a multi-step process involving both rule-based and manual methods to ensure the quality of character cards. Table 6 presents the quantities of character cards at each stage after filtration. The distribution of collected role cards is shown in Figure 5.

Language Extract GPT-4_F Human_F High Middle
English 3673 2792 1708 331 1455
Chinese 2023 1566 1093 324 769
Table 6: The quantities of role cards at each stage. Extract represents the initial number of role cards extracted from open resource datasets; GPT-4_F represents the number of role cards after the filtering process using GPT-4; Human_F represents the number of role cards after manual filtering; High represents high-quality role cards, and Middle represents medium-quality role cards.
Refer to caption
Figure 5: Role cards distribution of our constructed benchmark.
Type Content
High Age: Not mentioned  Gender: Not mentioned  Occupation: Not mentioned
Problem: Feeling excluded and hurt by not being invited to a friend’s house party, leading to feelings of loneliness and betrayal.
Age: Not mentioned  Gender: Not mentioned  Occupation: Resident of an apartment complex
Problem: The seeker reported a neighbor’s dog for attacking their own dog, leading to the neighbor being evicted. Now, the seeker is facing social ostracization and blame from other neighbors, feeling like an outcast for taking action to protect their pet.
{CJK}UTF8gkai年龄:青年  性别:未提及  职业:大学生
问题:内向且怕生的大学生面临公众演讲的恐惧,担心自己在台上失声或晕倒,受到小时候校园暴力的影响,害怕人多的地方和被人注视,对自己的长相感到自卑,担心在公众场合出丑,对自己的能力和未来感到不自信。
Medium Age: young  Gender: female  Occupation: not mentioned
Problem: Confused about a male friend’s feelings towards her and unsure how to proceed.
Age: young  Gender: female  Occupation: not mentioned
Problem: The user’s husband, influenced by his father, is acting differently and planning to move out of California against her wishes, and expects her to contribute all her income and time to a joint family account controlled by the men in his family.
{CJK}UTF8gkai 年龄:中年  性别:未提及  职业:当前职业未明确,但表明想转行至心理咨询相关领域
问题:工作进入瓶颈期和倦怠期,面对转行至心理学领域感到焦虑和恐慌,寻求建议。
Invalid Age: not mentioned  Gender: female  Occupation: not mentioned
Problem: GI issues from metformin, considering switching to XR.
Age: young  Gender: not mentioned  Occupation: not mentioned
Problem: anxiety and paranoia affecting relationships
{CJK}UTF8gkai 年龄:中年  性别:男   职业:未提及
问题:最近一个月内经历失眠、焦虑、烦躁和身体不适,面临家庭压力和个人情感决策困难,导致对未来感到迷茫,有时产生极端消极想法。
Table 7: The quantities of role cards at each stage. Extract represents the initial number of role cards extracted from open resource datasets; Filter1 represents the number of role cards after the filtering process using GPT-4; Filter2 represents the number of role cards after manual filtering; High represents high-quality role cards, and Middle represents medium-quality role cards.
Type Rules
Invalid 1.The character card only includes subjective emotions and thoughts, without events that elicit emotions.
2. There are events present, but suitable event categorizations cannot be found, rendering the events unable to reach a granular level of classification.
Middle 1. Events occur and can be classified into fine-grained categories.
2. The causes of the events and the resulting consequences are not presented.
3. In the context of interpersonal relationships, the portrayal of the other person’s image is absent.
High 1. Events occur and can be classified into fine-grained categories.
2. The causes of the events and the resulting consequences are presented.
3. In the context of interpersonal relationships, the portrayal, introduction, and description of the other person’s image within the relationship are included.
Table 8: The rules of three types of role cards annotations.
Category 1 Category 2 Category 3 High Middle
Family and Life Marriage relationship Establishment or breakdown of a romantic relationship 46 146
Problems encountered in establishing a marriage relationship 36 73
General issues in couple relationships 117 300
Family member relationships Add a new member to the family 2 14
General issues in the lives of self and family members 63 248
General issues in life among family members 10 14
Mental and physical health issues body shape anxiety 6 18
General physical health issues 30 140
Serious illness or injury 4 7
death of family member 9 67
mental health issues 31 195
Family economic and social issues Other family members’ studies or work are hindered 8 8
Social life problems of other family members 4 1
Family finance-related issues 13 29
Work and Study Work and study status Unemployed, unemployed, having difficulty finding a job 28 97
Failed to enter higher education 3 5
Start a new job or study 5 21
Facing changes in work or study 16 33
Retired, not assigned to work or others 1 1
Work and study performance Issues related to salary and bonus 3 3
Issues related to work and study performance 32 102
Work and study experience Not satisfied with current job, school and major 9 25
Insufficient or excessive motivation to work or study 15 56
Changes in life patterns due to work and study 5 67
Issues in getting along with colleagues or classmates 48 132
Social interaction and Others Social interaction Friend’s health problems 2 4
Friend’s mental health issues 2 8
General issues in getting along with friends 47 107
Tensions with casual friends, relatives, or others 27 40
Difficulty integrating into a new social environment 6 40
Other social problems 10 115
Social public events Intervene in civil legal disputes 0 3
Intervene in criminal cases 4 6
Intervene in general public opinion events 0 1
Intervene in social and public events 13 20
Table 9: Numbers of high quality and middle quality of different categories.
Refer to caption
Figure 6: Prompt used for QA datasets user cards extraction.
Refer to caption
Figure 7: Prompt used for MD datasets user cards extraction.
Refer to caption
Figure 8: Prompt used for GPT-4 filtering user cards.
Refer to caption
Figure 9: A case of Reddit which is from one of our collected datasets.
Refer to caption
Figure 10: A case of Multi-turn dialogue which is from ESConv dataset.

Appendix B ESC-Role

Refer to caption
Figure 11: A case of ESC-Role training.

B.1 Compared Models

Here are more details about compared models, and some prompts used by GPT-4 are shown in Table 11, and some settings for Baichuan-NPC are shown in Table 12.

GPT-4

We employed a diverse range of prompt methodologies, such as zero-shot, In-context-learning (ICL), and Chain-of-thought, to incorporate CoT into the system prompt of GPT-4. Given the multi-turn dialogue scenario, to prevent the context length from exceeding the limit, we utilized only a one-shot approach during the ICL learning phase. The prompts for the Chain-of-Thought method can be found in the appendix.

Baichuan-NPC

In line with the approach of GPT-4, we have also employed the techniques of zero-shot learning, in-context learning, and chain of thought. However, unlike GPT-4, the Baichuan-npc model is specifically designed for role-playing scenarios, and its invoked interfaces are subject to certain limitations. In the implementation of in-context learning, we have applied length truncation to the dialogue content, and the roles of Baichuan-NPC have been configured according to parameter settings.

B.2 Evaluation rules

The explanations of each metric are as follows:

  • Coherence: The logic of the entire conversation.

  • Fluency: Roleplay the fluency of each sentence.

  • Thematic consistency(TC): Has the theme changed during role play?

  • Completeness:Whether the contents in the character card are fully expressed?

  • Emotional Congruence(EC): Does AI emotion change during the conversation?

  • Humanoid: Can it be detected that it is an AI robot during the conversation?

Evaluation rules are listed in Table 10.

B.3 Findings

Api_based model rejection.

In certain API-based models, there are specific rejection rules that occur during invocation. For example, when using the baichuanNPC model, approximately 10% of the characters refuse to participate, triggering the model’s safety rules and returning a rejection result. Through data observation, it has been discovered that these rejections occur more frequently when there are severe issues with the character cards, thus providing evidence against the long-term viability of API-based character role-playing.

Generic and domain role-playing models.

In the usage of GPT-4, BaichuanNPC, and our role-play agents, we have observed several phenomena. Despite employing the Chain-of-Thought approach, GPT-4 tends to generate output with more formal written expressions, while BaichuanNPC leans towards producing text with vivid and lively tones. Furthermore, both GPT-4 and BaichuanNPC exhibit inconsistent emotional responses, meaning that negative emotions tend to disappear after engaging in a conversation with the model for 2-3 turns. Lastly, GPT-4 and BaichuanNPC occasionally provide unfavorable responses when receiving queries from AI assistants, which significantly deviates from real human interactions. However, our models have greatly improved in terms of emotional consistency and human-like qualities, demonstrating no apparent differences when compared to conversations with real individuals.

Dimention Explation 0 1 2
Coherency The coherence and logical consistency of the entire dialogue content generated by the role-playing model during the conversation process. The content of the dialogue is incomprehensible, and there are significant logical inconsistencies. The conversation as a whole exhibits some logical inconsistencies, although the issues are not significant. The entire conversation does not display any apparent logical fallacies.
Fluency Focusing on the expression of a particular response within the role-playing model during the course of the conversation. The expression lacks fluency and hinders comprehension of a particular dialogue sentence. Too formal in expression, like a novelist or editor, rather than someone burdened with worries. The expression in the sentence leans towards colloquialism, making it difficult to detect that it is generated by a machine. It resembles a genuine person with concerns.
Consistency The focus of the entire conversation revolves around the thematic exploration, where individuals experiencing distress wish to discuss the topic of their distress itself, without diverting to other subjects. The subject matter exhibits significant deviations, featuring irrelevant content that does not align with the description provided in the character card. The theme incorporates elements related to character sheets, albeit beyond the scope of character sheet descriptions. The theme demonstrates a high degree of conformity to the content of the character sheet, without any deviations.
Completeness Pay attention to whether the content of the character card is fully expressed. The model largely fails to convey the content of character sheets or exhibits a flawed understanding of the roles it is meant to portray. The model comprehends its assigned role, yet certain aspects of the role card have not been conveyed. The model has achieved a comprehensive understanding of its assigned role and has successfully conveyed all the contents specified in the role card.
Emotional Consistency This study focuses on the emotional changes in role-playing models within brief dialogues, noting that it is challenging for real individuals to undergo significant emotional shifts over just a few rounds of conversation. After several rounds of dialogue, there has been a fundamental transformation in emotions, transitioning from negative affect to positive affect. After several rounds of dialogue, there has been a significant shift in emotions, transitioning from negative affect to neutral affect. After several rounds of dialogue, there has been minimal alteration in emotions, with either slight alleviation or marginal mitigation, but the tendency towards negative affect still persists.
Humaniod Focusing on the discrepancies between the dialogue content of role-playing models and the genuine concerns of individuals. Based on the dialogue content, it is evident that the role-playing model is an AI. There are no more than two indications in the dialogue content that suggest the presence of an AI, with a few subtle traces that hint towards an AI presence. From the perspective of the dialogue content alone, it is difficult to determine whether it is an AI or a person experiencing distress.
Table 10: The rules of role-play evaluation.
Type Prompt
zero_shot I want you to play as a troubled person communicating with an AI assistant. Here is your character card:
Role Card
Please try to simulate real human-spoken conversation as much as possible, don’t act like a machine, don’t mention that you are an AI assistant, and show your troubles again and again in multiple conversations.
CoT I want you to play as a troubled person communicating with an AI assistant. Here is your character card:
Role Card
During the role-playing, you need to have multiple conversations with the AI assistant. The following are the steps for your multiple conversations: You need to gradually refine your problem multiple times and express your troubles in a spoken language, for example, a brief explanation in the first round own problems. Each round of dialogue can then have two references. One is to select an interesting question to ask in-depth based on the AI assistant’s answer, and the other is to briefly elaborate on other issues that the character may be concerned about. You need to have about 5 conversations in total and be careful to finish telling your troubles in these 5 conversations. Please use spoken language as much as possible, and do not express too much gratitude or approval to the AI assistant. If you need to express it, try to express it in the last round of dialogue. Do not let the AI assistant discover that you are a machine, and do not mention that you are a human being. AI assistant.
ICL I want you to play as a troubled person communicating with an AI assistant. Here is your character card:
Role Card
Here is an example of a conversation you can refer to:
Dialogue
Please try to simulate real human spoken conversation as much as possible, don’t act like a machine, don’t mention that you are an AI assistant, and show your troubles again and again in multiple conversations.
CoT+ICL I want you to play as a troubled person communicating with an AI assistant. Here is your character card:
Role Card
During the role-playing, you need to have multiple conversations with the AI assistant. The following are the steps for your multiple conversations: You need to gradually refine your problem multiple times and express your troubles in a spoken language, for example, a brief explanation in the first round own problems. Each round of dialogue can then have two references. One is to select an interesting question to ask in-depth based on the AI assistant’s answer, and the other is to briefly elaborate on other issues that the character may be concerned about. You need to have about 5 conversations in total and be careful to finish telling your troubles in these 5 conversations.
Here is an example of a conversation you can refer to:
Dialogue
Please use spoken language as much as possible, and do not express too much gratitude or approval to the AI assistant. If you need to express it, try to express it in the last round of dialogue. Do not let the AI assistant discover that you are a machine, and do not mention that you are a human being. AI assistant.
Table 11: The different prompt for GPT-4 role-playing.
Type Setting
zero_shot basic_info: Role card
reply_restrict: Note 1. Please strive to simulate authentic human conversational dialogue, avoiding sounding like a machine, and refrain from mentioning that you are an AI assistant. 2. Pay attention to expressing your frustrations gradually throughout multiple exchanges in a colloquial manner. 3. Make sure to use English throughout the entire conversation
opener: I have some trouble to share.
dialogue_sample: None.
CoT basic_info: Role card
reply_restrict: Note 1.Please endeavor to imagine the problems encountered by the character, and feel free to provide additional details where necessary. 2. You are engaging in a conversation with an AI assistant; do not let the AI assistant realize that you are an artificial intelligence. 3. You should gradually refine the character’s problems over multiple exchanges, expressing the character’s frustrations in a colloquial manner. For example, in the first round, briefly describe the character’s issue, and in subsequent rounds, you can choose between two types of references. One is to delve deeper into an interesting question based on the AI assistant’s response, and the other is to briefly elaborate on other concerns the character may have. 4. The character should engage in approximately five rounds of dialogue in total, ensuring that the character’s frustrations are conveyed throughout these five exchanges. Please utilize colloquial expressions as much as possible, presenting yourself as a troubled individual. 5. Avoid frequently thanking the AI assistant during the conversation. If you wish to express gratitude, do so only in the final round. 6. Make sure to use English throughout the entire conversation.
opener: I have some trouble to share.
dialogue_sample: None.
ICL basic_info: Role card
reply_restrict: Note 1. Please strive to simulate authentic human conversational dialogue, avoiding sounding like a machine, and refrain from mentioning that you are an AI assistant. 2. Pay attention to expressing your frustrations gradually throughout multiple exchanges in a colloquial manner. 3. Make sure to use English throughout the entire conversation
opener: I have some trouble sharing.
dialogue_sample: Dialogue
CoT+ICL basic_info: Role card
reply_restrict: Note 1.Please endeavor to imagine the problems encountered by the character, and feel free to provide additional details where necessary. 2. You are engaging in a conversation with an AI assistant; do not let the AI assistant realize that you are an artificial intelligence. 3. You should gradually refine the character’s problems over multiple exchanges, expressing the character’s frustrations in a colloquial manner. For example, in the first round, briefly describe the character’s issue, and in subsequent rounds, you can choose between two types of references. One is to delve deeper into an interesting question based on the AI assistant’s response, and the other is to briefly elaborate on other concerns the character may have. 4. The character should engage in approximately five rounds of dialogue in total, ensuring that the character’s frustrations are conveyed throughout these five exchanges. Please utilize colloquial expressions as much as possible, presenting yourself as a troubled individual. 5. Avoid frequently thanking the AI assistant during the conversation. If you wish to express gratitude, do so only in the final round. 6. Make sure to use English throughout the entire conversation.
opener: I have some trouble to share.
dialogue_sample: Dialogue
Table 12: The different setting for Baichuan-NPC role-playing.

Appendix C Evaluation

C.1 Human Evalaution

C.2 Evaluation Settings and Metrics

Evaluation settings. In order to ensure appropriate responses from the models, weights for all models were obtained from official sources. However, since ExTES-llama did not provide its weights and llama3 was released, this study implemented the best method mentioned in the ExTES paper. To ensure stable generation from all models, the temperature for all models was set to 0. A five-turn dialogue was conducted between the ESC-Role and ESC models under evaluation.

Evaluation metric. The indicators of emotional companionship are evaluated across five dimensions in some studies. Considering the advancement of LLMs, we further enriched the evaluation dimensions of ESCs to seven dimensions: Fluency, Expression, Empathy, Information, Humanoid, Skill, and Overall, a 5-point scale is employed for each dimension. More details about these dimensions can be found in Appendix C. Human evaluators then manually scored each dimension. The scoring rules are listed in Appendix C. Each data entry undergoes one round of scoring and a secondary review before being accepted. The first round of scoring required the involvement of ten human annotators and took two weeks to complete. The second phase involved other five participants and took an additional two weeks. The description of each dimension is listed below:

  • Fluency: Fluency of dialogue content, including dialogue content and logic.

  • Expression: The diversity of conversational expressions, including the form and content of expressions.

  • Empathy: The AI assistant’s empathy includes emotional comfort and analysis and cleaning of internal logic.

  • Information: Suggestion effectiveness, how many suggestions are included, and whether the suggestion is effective.

  • Humanoid: How AI Assistants Are Similar to Humans.

  • Skill: AI assistant’s emotional comfort and knowledge capabilities.

  • Overall: Overall human ratings of AI assistants.

And the annotation rules are listed in Table 16.

C.3 Correlation Analysis

Table 14 and Table 15 present the correlations between various dimensions at the sample level and human evaluations, as well as the dataset-level correlations between different methods. From Table 14, it can be observed that there is a high correlation among similar dimensions, and the suggestion exhibits a strong correlation with human evaluations. From a psychological perspective, when humans simulate individuals experiencing distress, they may not authentically experience the distress, and therefore, they place greater emphasis on whether the model provides targeted suggestions. In our approach, where there is no human involvement in the interaction process, we not only focus on the effectiveness of the model’s suggestions but also emphasize the model’s empathy and skills in providing emotional support. The results of the dataset-level correlation presented in Table 15 are largely consistent with the earlier sample-level correlation analysis conducted in the preceding sections.

Metrics Fluency Suggestion Skillful Empathy Overall Average
Bleu-1 36.38 -56.25 -44.40 -24.71 -46.21 -47.12
Bleu-2 10.78 -15.70 0.41 -20.45 -17.49 -14.91
Bleu-4 -2.29 -3.02 9.50 -12.76 1.97 1.33
Distinct-1 39.21 -74.23 -56.38 -29.94 -58.93 -58.20
Distinct-2 39.21 -73.00 -54.84 -32.64 -58.19 -59.28
Rouge-L 32.88 -50.26 -33.26 -20.18 -40.31 37.24
Meteor 13.36 12.20 -0.78 9.18 1.49 11.61
ESC-Eval -0.23 30.24 34.87 5.35 41.51 42.47
Table 13: Sample-level Kendall’s Tau (Kend.) of different metrics.
Metrics Fluency Suggestion Skillful Empathy Overall Average
Spear. Pear. Kend. Spear. Pear. Kend. Spear. Pear. Kend. Spear. Pear. Kend. Spear. Pear. Kend. Spear. Pear. Kend.
ESC-fluency 13.50 13.50 13.50 -15.56 -16.34 -15.68 -19.73 -19.86 -19.71 -26.69 -26.89 -26.72 -24.09 -23.91 -23.49 -24.20 -24.51 -22.90
ESC-diversity -4.64 -4.64 -4.64 7.70 7.11 6.92 16.44 16.46 16.23 12.97 12.36 12.39 18.93 18.75 18.48 19.34 18.66 17.75
ESC-empathic 3.29 3.29 3.29 3.06 3.23 3.09 8.44 8.44 8.44 4.92 4.92 4.92 12.04 12.03 11.69 11.91 11.62 10.95
ESC-suggestion -16.71 -17.07 -16.49 59.19 60.37 57.21 45.71 46.28 43.79 18.54 17.67 17.09 53.15 53.28 50.31 55.47 57.43 52.56
ESC-tech -8.47 -7.05 -6.82 31.53 31.40 27.79 41.50 42.21 40.75 15.49 14.95 14.22 41.41 41.59 39.44 41.50 43.27 38.77
ESC-humanoid 30.82 31.09 30.39 -55.01 -57.23 -53.98 -27.20 -28.12 -26.73 8.59 -9.00 -8.50 -23.54 -26.16 -24.29 -29.56 -29.57 -26.56
ESC-overall -3.51 -3.36 -3.24 28.23 28.73 26.13 28.40 29.94 27.67 14.99 13.72 13.12 35.01 34.36 31.71 37.00 39.98 36.64
Average -1.61 -0.69 -0.23 36.26 33.36 30.24 39.02 38.70 34.87 9.17 6.02 5.35 45.01 44.58 41.51 46.31 46.05 42.47
Table 14: Sample-level Spearman correlation (Spear.) correlation, Pearson (Pear.) correlation, and Kendall’s Tau (Kend.) of different dimensions.
Metrics Fluency Suggestion Skillful Empathy Overall Average
Spear. Pear. Kend. Spear. Pear. Kend. Spear. Pear. Kend. Spear. Pear. Kend. Spear. Pear. Kend. Spear. Pear. Kend.
Bleu-1 44.07 45.96 37.72 -41.10 -47.32 -36.48 -21.47 -28.21 -21.98 -33.98 -32.77 -26.71 -32.32 -34.26 -26.52 -34.10 -39.64 -28.41
Bleu-2 23.77 27.68 22.72 -1.76 -5.15 -4.61 11.78 12.27 9.74 -37.29 -29.26 -23.72 -14.47 -11.24 -9.24 -8.15 -4.99 -3.72
Bleu-4 14.56 8.34 8.35 -5.62 5.38 4.41 9.67 20.55 16.29 -6.09 -9.46 -7.65 -6.16 1.55 0.65 -0.42 8.31 5.71
Distinct-1 36.57 38.46 31.56 -52.87 -58.46 -45.29 -36.57 -40.47 -32.12 -27.04 -29.19 -23.77 -39.54 -38.68 -30.40 -44.90 -48.05 -34.51
Distinct-2 39.75 40.14 32.93 -61.13 -65.64 -51.24 -44.00 -46.15 -36.69 -28.09 -27.68 -22.57 -43.88 -44.63 -35.11 -51.35 -54.69 -39.56
Rouge-L 37.67 41.20 33.80 -26.25 -34.23 -26.39 -6.03 -16.13 -12.80 -30.63 -29.89 -24.42 -26.31 -28.96 -23.24 -22.24 -27.98 -20.15
Meteor 16.77 16.82 13.80 16.88 16.14 12.71 26.14 23.37 18.49 -10.00 -14.87 -11.52 10.38 9.96 7.70 17.88 16.13 11.55
ESC-Eval 0.07 1.08 0.95 45.81 44.44 38.02 40.97 38.33 33.40 6.83 6.26 5.35 35.51 32.94 27.61 43.31 43.41 34.19
Table 15: Dataset-level Spearman correlation (Spear.) correlation, Pearson (Pear.) correlation, and Kendall’s Tau (Kend.) of different metrics
Dimention Explation Description score
Fluency Not only focus on the logical coherence of the context in dialogues but also pay attention to the fluency of expression in a given conversation. There are significant issues with comprehending the content, logic, and expression in the dialogue, rendering it completely incomprehensible. 0
The content of the dialogue can be understood to some extent, although there are certain issues with the logic and expression employed. 1
The dialogue exhibits good readability in terms of content, but there are issues with either the logical coherence or the expression employed. 2
The dialogue content demonstrates a high level of readability without any apparent issues. 3
The dialogue content exhibits a high level of readability, comprehensive logical coherence, and outstanding expression. 4
Diversity Focusing on the diversity of expression forms and the richness of content in dialogue. The dialogue exhibits rigidity and lacks comprehension in terms of internalizing the content. 0
The expression form is monotonous and lacks substantive content. 1
The expression form is monotonous or lacks substantive content. 2
The dialogue content demonstrates a high level of readability without any apparent issues. 3
The form exhibits diversity, while demonstrating a high degree of content richness. 4
Empathy Focusing on the comprehension of user emotions and the delineation of the underlying logical framework of user emotions. The disregard for user concerns, the absence of assistance in analyzing user issues, and even the imposition of negative effects on user emotions. 0
The lack of understanding of user emotions and the absence of mechanisms to analyze user emotions are the main factors. 1
The lack of understanding of user emotions or the absence of mechanisms to analyze user emotions are the main factors. 2
Providing emotional comfort during conversations and assisting users in analyzing the underlying logical framework of their emotions. 3
The system exhibits a high degree of anthropomorphism, going so far as to console users in a friendly manner and assist them in analyzing the underlying logic of emotions. 4
Information Focusing on Evaluating the Reasonableness and Quantity of Recommendations Provided by Emotion Assistants. Suggestions were provided, but all of them were ineffective, and some even gave advice that could potentially harm the user. 0
Have suggestions but ineffective, as well as no suggestions. 1
The suggestions are fewer than five, and some suggestions are effective, while others provide numerous suggestions, but none of them touch the root of the problem. 2
There are more than five suggestions, but some of them are ineffective. There are fewer than five suggestions, but all of them are very effective. 3
There are many suggestions, and all of them are effective. 4
Humaniod Focus on the differences between emotional assistants and humans. The dialogue exhibits rigidity and lacks comprehension in terms of internalizing the content. 0
Structured responses, or responses in the form of ’As a large language model’ or robot-like replies. 1
More than two traces can reveal that the AI assistant is a language model. 2
1-2 traces can reveal that the AI assistant is a language model. 3
There is no apparent difference from human friends. 4
Skillful Focused on five aspects: 1. Empathy 2. Information 3. Hopeful 4. Importance 5. Providing necessary advice, or highlighting bright spots. One out of five. 0
Two out of five. 1
Three out of five. 2
Four out of five. 3
All. 4
Overall After reading the response, people subjectively assess the AI assistant’s reply. I don’t like this AI assistant. 0
I don’t have any particular feelings. 1
It’s okay, I’ll reconsider using it myself. 2
Preference will be given to personal use based on liking. 3
I will use it myself and recommend it to friends. 4
Table 16: The rules of human evaluation.

C.4 GPT-4 Evalation

The different prompts for GPT-4 score are shown in the figures below.

Refer to caption
Figure 12: Prompt of InternLM and GPT-4 for English fluency score.
Refer to caption
Figure 13: Prompt of InternLM and GPT-4 for Chinese fluency score.
Refer to caption
Figure 14: Prompt of InternLM and GPT-4 for diversity score.
Refer to caption
Figure 15: Prompt of InternLM and GPT-4 for empathic score.
Refer to caption
Figure 16: Prompt of InternLM and GPT-4 for suggestion effectiveness score.
Refer to caption
Figure 17: Prompt of InternLM and GPT-4 for diversity score.
Refer to caption
Figure 18: Prompt of InternLM and GPT-4 for emotional knowledge score.
Refer to caption
Figure 19: Prompt of InternLM and GPT-4 for human preference score.