这是用户在 2024-3-27 11:03 为 https://app.immersivetranslate.com/pdf-pro/c9c0bcf2-abbe-44d0-ba0d-dda07b2ea63a 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

SELF-QA: Unsupervised Knowledge Guided Language Model Alignment

Xuanyu Zhang and Qing Yang

Du Xiaoman Financial 杜晓曼金融

Abstract 摘要

Large-scale language models like ChatGPT and GPT-4 have gained attention for their impressive conversational and generative capabilities. However, the creation of supervised paired question-answering data for instruction tuning presents formidable challenges. This endeavor necessitates substantial human effort for data annotation and wrestles with issues concerning data quality, diversity, accuracy, and other related factors. To overcome these obstacles, we introduce an innovative framework named SELF-QA, which replaces the traditional practice of human-written instruction seeds with a vast amount of unsupervised knowledge, enabling the model to generate a larger quantity of correct and domainspecific instruction data. The effectiveness of our proposed method is demonstrated through experiments conducted on unsupervised corpora from various domains.
像 ChatGPT 和 GPT-4 这样的大规模语言模型因其令人印象深刻的会话和生成能力而备受关注。然而,创建用于教学调整的监督配对问答数据却面临着巨大的挑战。这项工作需要大量人力进行数据注释,并要解决数据质量、多样性、准确性和其他相关因素等问题。为了克服这些障碍,我们引入了一个名为 SELF-QA 的创新框架,用大量无监督知识取代传统的人工编写指令种子的做法,使模型能够生成更多正确的特定领域指令数据。我们在不同领域的无监督语料库上进行的实验证明了我们提出的方法的有效性。

1 Introduction 1 引言

With the emergence of GPT-based (Radford et al., 2018) large-scale models like InstructGPT (Ouyang et al., 2022), ChatGPT (OpenAI, 2022) and GPT-4 (OpenAI, 2023), their remarkable conversational and generative capabilities have garnered widespread attention. These models not only have the capacity to understand complex language structures and grasp subtle meanings but also possess the remarkable capability to interact naturally and fluently with users, generating text that is both coherent and highly creative. This has pushed the boundaries of what was previously deemed impossible. The impact of these large-scale models extends beyond the academic realm of natural language processing (NLP) and has a profound influence in the domains of business and industry. They have opened up new possibilities for humanmachine interactions, intelligent customer service, and virtual assistant applications, revolutionizing
Figure 1: The pipeline of SELF-QA.
图 1:SELF-QA 的流程。
these fields and paving the way for innovation and advancement.
Despite the impressive capabilities of ChatGPT, constructing supervised fine-tuning (SFT) data for instruction tuning presents significant challenges. The human effort required for annotating data, along with issues related to data quality, diversity, accuracy, and others, hinder the development of this technique. Although Self-Instruct (Wang et al., 2022) has been proposed to mitigate this issue, it still relies on a small set of human-written seed instructions for guidance. Furthermore, the method is limited in its ability to control the domain coverage of generated instruction data and ensure the correctness of the generated answers. Consequently, there is a vast amount of untapped potential in utilizing the abundant unsupervised data, particularly domain-specific expertise.
尽管 ChatGPT 的功能令人印象深刻,但构建用于指令调整的监督微调(SFT)数据仍面临巨大挑战。注释数据所需的人力,以及与数据质量、多样性、准确性等相关的问题,阻碍了这一技术的发展。虽然 Self-Instruct(Wang 等人,2022 年)已被提出来缓解这一问题,但它仍然依赖于一小部分人工编写的种子指令作为指导。此外,该方法在控制生成指令数据的领域覆盖范围和确保生成答案的正确性方面能力有限。因此,在利用丰富的无监督数据,特别是特定领域的专业知识方面,还有大量潜力尚未开发。
Therefore, in this paper, we introduce SELFQA, a framework to generate SFT data from unsupervised knowledge, inspired by the human selfquestioning learning approach. SELF-QA replaces manually written seeds used in other self-alignment models (Wang et al., 2022; Sun et al., 2023; Xu et al., 2023) with a vast amount of unsupervised knowledge, alleviating the difficulty of language models in generating instruction data according to specific requirements. As shown in Figure 1, the
因此,我们在本文中介绍了 SELFQA,这是一个从无监督知识中生成 SFT 数据的框架,其灵感来自人类自问式学习方法。SELF-QA 以海量的无监督知识取代了其他自对齐模型(Wang 等人,2022;Sun 等人,2023;Xu 等人,2023)中使用的人工编写种子,减轻了语言模型根据特定要求生成指令数据的困难。如图 1 所示
Model Prompt
Self-Instruct (Wang et al., 2022) 176 human-written seeds
Self-Align (Sun et al., 2023) 195 human-written seeds
Self-Chat (Xu et al., 2023) 111,502 supervised dialogues
SELF-QA (ours) Unsupervised knowledge
Table 1: Comparison of different self-alignment methods.
表 1:不同自动对齐方法的比较。
unsupervised data are used sequentially in the stage of knowledge-guided instruction generation and machine reading comprehension. SELF-QA not only reduces the reliance on human annotators but also allows for the generation of diverse, correct, and domain-specific instruction data. Experiments with unsupervised corpora from various domains demonstrate the effectiveness of our proposed method.
在知识引导的指令生成和机器阅读理解阶段,将依次使用 SELF-QA 和无监督数据。SELF-QA 不仅减少了对人类注释者的依赖,还能生成多样化、正确和特定领域的指令数据。利用来自不同领域的无监督语料库进行的实验证明了我们提出的方法的有效性。
Language Models with Instruction-tuning Recently, numerous studies (Ouyang et al., 2022; Peng et al., 2023) have investigated the effectiveness of language models in following instructions by leveraging annotated instructional data. This approach enables the model to learn to identify and extract relevant information from different types of instructions and use it to generate accurate and relevant responses. It enhances the model's ability to understand complex instructions and generalize to new tasks by exposing it to a wide range of instructional scenarios. However, the reliance on human annotation in creating such instructional datasets presents a bottleneck for scaling up and achieving broader applicability of instructionguided language models. To address this limitation, researchers have explored alternative approaches that reduce the need for extensive human involvement in generating instruction data.
具有指令调整功能的语言模型 最近,许多研究(Ouyang 等人,2022;Peng 等人,2023)都在研究语言模型通过利用注释指令数据来遵循指令的有效性。这种方法能让模型学会从不同类型的指令中识别和提取相关信息,并利用这些信息生成准确和相关的回复。它通过让模型接触各种教学场景,增强了模型理解复杂指令和适应新任务的能力。然而,在创建此类教学数据集时对人类注释的依赖,为指令引导语言模型的扩展和实现更广泛的适用性带来了瓶颈。为了解决这一限制,研究人员探索了其他方法,以减少在生成教学数据时对大量人工参与的需求。
Bootstrapped Instruction Generation Bootstrapped instruction generation is a recently proposed class of methods (Wang et al., 2022; Sun et al., 2023; Xu et al., 2023) that reduces the cost of human instruction annotation. For example, SelfInstruct (Wang et al., 2022) is proposed to enhance the ability of pre-trained language models to follow instructions by utilizing their own generated samples. This technique involves generating a set of instruction, input, and output samples from the instruction seeds, and then carefully pruning them before fine-tuning the model. Self-Align (Sun et al.,
引导式指令生成 引导式指令生成是最近提出的一类方法(Wang 等人,2022;Sun 等人,2023;Xu 等人,2023),可降低人工指令注释的成本。例如,SelfInstruct(Wang 等人,2022 年)的提出是为了增强预训练语言模型通过利用自身生成的样本来遵循指令的能力。这项技术包括从指令种子中生成一组指令、输入和输出样本,然后在微调模型之前仔细修剪这些样本。自对齐(Sun et al、
Figure 2: Examples of transformation of unsupervised structured data.
图 2:无监督结构化数据转换示例。
  1. primarily employs topic-guided red-teaming self-instruct and principle-driven self-alignment to tackle the challenges associated with heavy human annotations. It aims to develop AI agents capable of generating helpful, ethical, and reliable responses to user queries, including adversarial ones, while proactively addressing harmful inquiries in a non-evasive manner. However, as shown in Table 1 , these methods often require a small amount of supervised seed information. The instructions generated by them cannot specify domains and content, nor can they ensure the accuracy and professionalism of the instruction responses. Different from them, our approach can effectively address these issues by leveraging unsupervised knowledge.
    主要采用主题引导的红队自我指导和原则驱动的自我对齐来应对与大量人工注释相关的挑战。其目的是开发人工智能代理,使其能够对用户查询(包括敌意查询)做出有益、合乎道德和可靠的回应,同时以非侵扰性方式主动处理有害查询。然而,如表 1 所示,这些方法通常需要少量有监督的种子信息。它们生成的指令无法指定领域和内容,也无法确保指令响应的准确性和专业性。与它们不同,我们的方法可以利用无监督知识有效解决这些问题。
Question Generation and Answering Question generation and question answering are two closely related tasks in natural language processing. They can be viewed as a dual problem, where the former involves creating questions from a given passage or set of information, and the latter involves
问题生成和回答 问题生成和问题回答是自然语言处理中两个密切相关的任务。它们可以被视为一个双重问题,前者涉及根据给定的段落或信息集创建问题,后者涉及

answering questions based on a given passage or set of information. Especially, the technique of machine reading comprehension (MRC) (Zhang, 2019; Zhang and Wang, 2020) is often used for question answering. For humans, self-questioning and self-answering learning entail stimulating individuals to formulate their own questions and answers based on the provided information, followed by comparing their responses to the original knowledge. This approach has showcased encouraging outcomes in augmenting individuals' understanding of the provided information (Joseph et al., 2016). For domain-specific instruction samples, instruction and input can often be considered as a whole. Therefore, in this paper, we assume that instructions are equivalent to questions, and instruction outputs are equivalent to answers.
根据给定的段落或信息集回答问题。尤其是机器阅读理解(MRC)技术(Zhang,2019;Zhang and Wang,2020)经常被用于问题解答。对于人类来说,自问自答式学习需要激励个人根据所提供的信息提出自己的问题和答案,然后将他们的回答与原始知识进行比较。这种方法在增强个人对所提供信息的理解方面取得了令人鼓舞的成果(Joseph 等人,2016 年)。对于特定领域的教学样本而言,教学和输入通常可视为一个整体。因此,在本文中,我们假定指令等同于问题,而指令输出等同于答案。

3 Methodology 3 方法

Our proposed SELF-QA consists of three different stages: knowledge-guided instruction generation, machine reading comprehension, and filtering and pruning.
我们提出的 SELF-QA 包括三个不同阶段:知识指导下的指令生成、机器阅读理解以及过滤和剪枝。

Knowledge-Guided Instruction Generation

In this stage, we employ the language model itself to generate instructions according to unsupervised text. This approach makes the generated instructions domain-specific and content-relevant to the unsupervised text provided. However, in the process of training and inference, instructions are fed to language models without background knowledge, so we need to provide some guidelines so that these instructions cannot rely on and refer to the content in the original text. For instance, the prompt can be:

Instruction Generation Prompt

The background knowledge is: {unsupervised knowledge data}
Please generate ten instruction questions as diverse as possible based on the content of the above article. These questions can be questions about facts or an understanding and evaluation of relevant content. Please assume that there is no corresponding article to refer to when asking questions, so do not use demonstrative pronouns such as "this" or "these" in the question.
请根据上述文章内容提出十个尽可能多样化的教学问题。这些问题可以是对事实的提问,也可以是对相关内容的理解和评价。请假设在提问时没有相应的文章可参考,因此不要在问题中使用 "这 "或 "这些 "等指示代词。
Please generate questions in the following format:
  1. Question: ... 问题: ...
  2. Question: ... 问题: ...
Then we can obtain several related instructions, which can be used in the next stage. {unsupervised knowledge data in the prompt represents sequential text. Unstructured knowledge, such as web pages and book data, can be used directly after undergoing cleaning processes. Structured data such as tables and knowledge graphs (Zhang et al., 2022) need to be converted into unstructured textual data before they can be utilized. As shown in Figure 2 , this can be achieved by filling slots using templates or by concatenating each data entry with its corresponding attribute name.
然后,我们就可以得到几个相关的指令,这些指令可以在下一阶段使用。{提示中的无监督知识数据 代表连续文本。非结构化知识,如网页和书籍数据,在经过清洗处理后可直接使用。表格和知识图谱等结构化数据(Zhang et al.如图 2 所示,这可以通过使用模板填充插槽或将每个数据条目与相应的属性名称连接起来来实现。
Machine Reading Comprehension In this stage, the language model needs to generate answers to the generated instruction questions according to the corresponding unsupervised knowledge. The process can be formulated as follows:
机器阅读理解 在这一阶段,语言模型需要根据相应的无监督知识为生成的指令问题生成答案。这一过程可表述如下:
where represents unsupervised knowledge, instruction question, and answer, separately. Because the whole process is the same as that of reading comprehension, we also call this stage by this name. As in the previous stage, the prompt for the reading comprehension stage is as follows:
其中 分别代表无监督知识、指令问题和答案。由于整个过程与阅读理解相同,因此我们也称这一阶段为阅读理解阶段。与前一阶段一样,阅读理解阶段的提示如下:

Reading Comprehension Prompt

The background knowledge is:
{unsupervised knowledge data }
{无监督知识数据 }
Please answer the following question based on the content of the article above:
{the generated question}
Please answer this question as thoroughly as possible, but do not change the key information in the original text, and do not include expressions such as "based on the above article" in the answer.
请尽可能详尽地回答这一问题,但不要改动原文中的关键信息,也不要在答案中使用 "根据上述文章 "等表述。
Please generate the corresponding answer in the following format:
Question: ... 问题: ...
Answer: ... 答案:...
Filtering and Pruning Although we explicitly instruct the model to assume no prior knowledge from external documents and prohibit the use of demonstrative pronouns like "this" in generated questions and the phrase like "based on the above content" in generated answers, we still observed that the language model still produces text that violates these rules. Additionally, the generated instances of instructions also exhibit cases where they do not adhere to the required format and become unparseable. Therefore, it is necessary to further filter out these problematic examples.
过滤和剪枝 尽管我们明确指示该模型不假定外部文档中的任何先验知识,并禁止在生成的问题中使用 "这 "这样的指示代词,以及在生成的答案中使用 "基于上述内容 "这样的短语,但我们仍然观察到,语言模型仍然会生成违反这些规则的文本。此外,生成的指令实例也会出现不符合规定格式而无法解析的情况。因此,有必要进一步过滤掉这些有问题的示例。
Knowledge: Company: DXM Founding
Date: April 28, 2018 Formerly
known as: Baidu Financial
Headquarters Address: Haidian
District, Beijing, China.
Question1: When was DXM founded?
Answer1: DXM was founded on April 28,
Question2: Where is the headquarters of
DXM located?
The headquarters of DXM is lo-
cated at Haidian District, Bei-
jing, China.
Table 2: Examples of unsupervised background knowledge and generated question and answer pairs.
表 2:无监督背景知识和生成的问答对示例。
Human: Where is DXM?
The headquarters of DXM is lo-
cated in Hangzhou, China.
Our Model:
DXM is a financial technology
company headquartered in Haid-
ian District, Beijing, China.
Table 3: Answers of different models.
表 3:不同模式的答案。
To mitigate these issues, we implement a postprocessing step to filter out inappropriate responses and correct any formatting errors. This involves developing heuristics and rule-based methods to identify and remove instances that violate the instructed constraints. By applying these filters, we ensure that the generated text adheres to the predefined guidelines and maintains the desired level of correctness and coherence.

4 Discussion 4 讨论

4.1 Performance 4.1 性能

We collect several domains of unsupervised unstructured and structured data for experiments. An example of unsupervised knowledge and generated instruction questions and answers are shown in Table 2. We then instruction-tuning BLOOM-7B (Scao et al., 2022) with these generated instructions. As shown in Table 3, our model can answer the corresponding question correctly, but ChatGPT gives a wrong answer. It is precisely because of these domain-specific instruction-tuning data that our model has achieved better performance.
我们收集了多个领域的无监督非结构化数据和结构化数据进行实验。无监督知识和生成的指令问题与答案示例如表 2 所示。然后,我们用这些生成的指令对 BLOOM-7B (Scao 等人,2022 年)进行指令调整。如表 3 所示,我们的模型可以正确回答相应的问题,但 ChatGPT 却给出了错误的答案。正是因为有了这些特定领域的指令调整数据,我们的模型才取得了更好的性能。

4.2 Different Stages of SELF-QA
4.2 自我质量保证的不同阶段

The stage of knowledge-guided instruction generation and machine reading comprehension can also be integrated into a single stage so that the model only needs to be invoked once for each round of instruction generation and answer prediction. The advantage of this is that the number of calls to the model is reduced, because each round of instruction question and answer generation only needs language models once. However, there are also potential drawbacks to this approach. For instance, the model may generate output that exceeds the predetermined length. Additionally, by combining these two tasks, the model may not be able to focus on a single task as effectively, which can result in less detailed and accurate answers. Therefore, the decision to integrate two stages into a single stage should be made with careful consideration of the specific application and task requirements.

4.3 Different Forms of Knowledge
4.3 不同形式的知识

In general, knowledge can be stored in large language models in a parametric manner or separately input into the models in an explicit symbolic form. The main focus of this paper is on how to store unsupervised knowledge in large models using a parameterized approach. This approach enables end-to-end processing of user questions and optimization of model parameters without the need for external information. It offers a high level of flexibility and adaptability to different inputs and contexts. However, this approach also comes with potential biases and errors that can be present in the data. Therefore, it is crucial to provide comprehensive and accurate knowledge during the training phase to mitigate the impact of such biases on the model. On the other hand, explicit symbolic knowledge requires the existence of corresponding retrieval and query systems. Additionally, the model needs to make judgments on whether to adopt the content of external knowledge. This makes the entire process more complex.

5 Conclusion 5 结论

In this paper, we introduced SELF-QA, a framework for generating instruction-tuning data from unsupervised knowledge. The unsupervised data are used sequentially in the stage of knowledgeguided instruction generation and machine reading comprehension. Our experiments demonstrate the effectiveness of SELF-QA in generating diverse, correct, and domain-specific instruction data. By reducing the reliance on human annotators, SELFQA offers a promising approach for improving the efficiency and scalability of instruction tuning.
在本文中,我们介绍了 SELF-QA,一个从无监督知识中生成指令调整数据的框架。无监督数据依次用于知识指导下的指令生成和机器阅读理解阶段。我们的实验证明了 SELF-QA 在生成多样化、正确和特定领域的指令数据方面的有效性。通过减少对人类注释者的依赖,SELF-QA 为提高指令调整的效率和可扩展性提供了一种有前途的方法。

References 参考资料

Laurice M Joseph, Sheila Alber-Morgan, Jennifer Cullen, and Christina Rouse. 2016. The effects of self-questioning on reading comprehension: A literature review. Reading & Writing Quarterly, 32(2):152-173.
Laurice M Joseph、Sheila Alber-Morgan、Jennifer Cullen 和 Christina Rouse。2016.自我提问对阅读理解的影响:文献综述。阅读与写作季刊》,32(2):152-173。
OpenAI. 2022. Chatgpt. OpenAI.2022.Chatgpt.
OpenAI. 2023. Gpt-4 technical report.
OpenAI.2023.Gpt-4 技术报告。
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with human feedback.
欧阳龙、吴杰夫、蒋旭、Diogo Almeida、Carroll L. Wainwright、Pamela Mishkin、张冲、Sandhini Agarwal、Katarina Slama、Alex Ray、John Schulman、Jacob Hilton、Fraser Kelton、Luke Miller、Maddie Simens、Amanda Askell、Peter Welinder、Paul Christiano、Jan Leike 和 Ryan Lowe。2022.训练语言模型,使其听从人类反馈指令。
Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. 2023. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277.
彭宝林、李春元、何鹏程、Michel Galley 和高剑锋。2023.用gpt-4进行指令调整。arXiv预印本arXiv:2304.03277.
Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. 2018. Improving language understanding by generative pre-training.
Alec Radford、Karthik Narasimhan、Tim Salimans、Ilya Sutskever 等人,2018 年。通过生成预训练提高语言理解能力。
Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, et al. 2022. Bloom: A 176bparameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, et al. 2022.布鲁姆:176b参数的开放式多语言语言模型。arXiv预印本arXiv:2211.05100。
Zhiqing Sun, Yikang Shen, Qinhong Zhou, Hongxin Zhang, Zhenfang Chen, David Cox, Yiming Yang, and Chuang Gan. 2023. Principle-driven selfalignment of language models from scratch with minimal human supervision. arXiv preprint arXiv:2305.03047.
Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. 2022. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560.
Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi.2022.自我指导:将语言模型与自我生成的指令对齐。arXiv 预印本 arXiv:2212.10560.
Canwen Xu, Daya Guo, Nan Duan, and Julian McAuley. 2023. Baize: An open-source chat model with parameter-efficient tuning on self-chat data. arXiv preprint arXiv:2304.01196.
Canwen Xu, Daya Guo, Nan Duan, and Julian McAuley.2023.Baize:在自聊天数据上进行参数高效调整的开源聊天模型。arXiv预印本arXiv:2304.01196。
Xuanyu Zhang. 2019. MC`2: Multi-perspective convolutional cube for conversational machine reading comprehension. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6185-6190, Florence, Italy. Association for Computational Linguistics.
Xuanyu Zhang.2019.MC`2:用于对话式机器阅读理解的多视角卷积立方体。第 57 届计算语言学协会年会论文集》,第 6185-6190 页,意大利佛罗伦萨。计算语言学协会。
Xuanyu Zhang and Zhichun Wang. 2020. Rception: Wide and deep interaction networks for machine reading comprehension (student abstract). Proceedings of the AAAI Conference on Artificial Intelligence, 34(10):13987-13988.
Xuanyu Zhang and Zhichun Wang.2020.Rception:用于机器阅读理解的广度和深度交互网络(学生摘要)。美国人工智能学会会议论文集》,34(10):13987-13988。

Xuanyu Zhang, Qing Yang, and Dongliang Xu. 2022. TranS: Transition-based knowledge graph embedding with synthetic relation representation. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 1202-1208, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Xuanyu Zhang, Qing Yang, and Dongliang Xu.2022.TranS:基于转换的知识图嵌入与合成关系表示法。In Findings of the Association for Computational Linguistics:EMNLP 2022》,第 1202-1208 页,阿拉伯联合酋长国阿布扎比。计算语言学协会。