这是用户在 2024-6-4 21:09 为 https://arxiv.org/html/2403.18105v2?_immersive_translate_auto_translate=1 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: forest

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: CC BY 4.0
许可证:CC BY 4.0
arXiv:2403.18105v2 [cs.CL] 01 Apr 2024
arXiv:2403.18105v2 [cs.CL] 2024 年 4 月 1 日
\useunder 使用下划线

\ul

Large Language Models for Education: A Survey and Outlook
大型语言模型在教育中的应用:调查与展望

Shen Wang1, Tianlong Xu1, Hang Li2, Chaoli Zhang3, Joleen Liang4,
沈王 ,天龙徐 ,杭立 2 ,朝理张 3 ,乔琳梁 4

Jiliang Tang2, Philip S. Yu5, Qingsong Wen1
季良唐 2 ,余世伦 5 ,温青松
1Squirrel AI, USA   2Michigan State University, USA   3Zhejiang Normal University, China  
松鼠 AI,美国 2 密歇根州立大学,美国 3 浙江师范大学,中国

4Squirrel AI, China   5University of Illinois Chicago, USA
松鼠 AI,中国 5 伊利诺伊大学芝加哥分校,美国

(2024)
Abstract. 摘要。

The advent of large language models (LLMs) has brought in a new era of possibilities in the realm of education. This survey paper summarizes the various technologies of LLMs in educational settings from multifaceted perspectives, encompassing student and teacher assistance, adaptive learning, and commercial tools. We systematically review the technological advancements in each perspective, organize related datasets and benchmarks, and identify the risks and challenges associated with the deployment of LLMs in education. Furthermore, we outline future research opportunities, highlighting the potential promising directions. Our survey aims to provide a comprehensive technological picture for educators, researchers, and policymakers to harness the power of LLMs to revolutionize educational practices and foster a more effective personalized learning environment.
大型语言模型的出现已经开启了教育领域可能性的新时代。本调查论文从多方面总结了在教育环境中使用的各种技术,包括学生和教师的帮助、自适应学习和商业工具。我们系统地回顾了每个视角的技术进步,整理相关数据集和基准,并识别了在教育中部署这些技术的风险和挑战。此外,我们概述了未来的研究机会,突出了潜在的有希望的方向。我们的调查旨在为教育者、研究人员和政策制定者提供一个全面的技术图景,以利用这些技术革新教育实践并促进更有效的个性化学习环境。

Equal contribution. Corresponding author.
同等贡献。 通讯作者。
copyright: acmcopyright
版权:acmcopyright
journalyear: 2023 期刊年份:2023doi: 10.1145/1122445.1122456conference: KDD ’24: 30th ACM SIGKDD Conference on Knowledge Discovery & Data Mining; Aug 25 - 29, 2024; Barcelona, Spain
KDD ’24:第 30 届 ACM SIGKDD 知识发现与数据挖掘会议;2024 年 8 月 25 日至 29 日;西班牙巴塞罗那
booktitle: KDD ’24: 29th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, Aug 25 - 29, 2024, Barcelona, Spain
书名:KDD '24:第 29 届 ACM SIGKDD 知识发现与数据挖掘会议,2024 年 8 月 25 日至 29 日,西班牙巴塞罗那
price: 15.00 价格:15.00isbn: 978-1-4503-XXXX-X/18/06
ISBN:978-1-4503-XXXX-X/18/06

1. Introduction 简介

During the past decades, artificial intelligence (AI) for education has received a great deal of interest and has been applied in various educational scenarios (Chen et al., 2020; Maghsudi et al., 2021; Chiu et al., 2023; Denny et al., 2024; Li et al., 2024d; Latif et al., 2023). Specifically, educational data mining methods have been widely adopted in different aspects such as cognitive diagnosis, knowledge tracing, content recommendations, as well as learning analysis (Romero and Ventura, 2007, 2010, 2013; Koedinger et al., 2015; Romero and Ventura, 2020; Batool et al., 2023; Xiong et al., 2024).
在过去几十年中,人工智能(AI)在教育领域受到了极大的关注,并已在各种教育场景中得到应用(陈等人,2020 年;马格苏迪等人,2021 年;邱等人,2023 年;丹尼等人,2024 年;李等人,2024d 年;拉蒂夫等人,2023 年)。具体来说,教育数据挖掘方法已广泛应用于认知诊断、知识追踪、内容推荐以及学习分析等不同方面(罗梅罗和文图拉,2007 年,2010 年,2013 年;科丁格等人,2015 年;罗梅罗和文图拉,2020 年;巴图尔等人,2023 年;熊等人,2024 年)。

As large language models (LLMs) have become a powerful paradigm in different areas (Fan et al., 2023b; Zeng et al., 2023; Jin et al., 2024; Chen et al., 2023a), they also achieved state-of-the-art performances in multiple educational scenarios (Li et al., 2023a; Kasneci et al., 2023; Yan et al., 2024). Existing work has found that LLMs can achieve student-level performance on standardized tests (OpenAI, 2023) in a variety of mathematics subjects (e.g., physics, computer science) on both multiple-choice and free-response problems. In addition, empirical studies have shown that LLMs can serve as a writing or reading assistant for education (Malinka et al., 2023; Susnjak, 2022). A recent study (Susnjak, 2022) reveals that ChatGPT is capable of generating logically consistent answers across disciplines, balancing both depth and breadth. Another quantitative analysis (Malinka et al., 2023) shows that students using ChatGPT (by keeping or refining the results from LLMs as their own answers) perform better than average students in some courses from the field of computer security. Recently, several perspective papers (Tan et al., 2023; Kamalov and Gurrib, 2023) also explore various application scenarios of LLMs in classroom teaching, such as teacher-student collaboration, personalized learning, and assessment automation. However, the application of LLMs in education may lead to a series of practical issues, e.g., plagiarism, potential bias in AI-generated content, overreliance on LLMs, and inequitable access for non-English speaking individuals (Kasneci et al., 2023).
随着大型语言模型在不同领域成为一种强大的范式(范等,2023b;曾等,2023;金等,2024;陈等,2023a),它们在多个教育场景中也取得了最先进的表现(李等,2023a;卡斯内西等,2023;严等,2024)。现有研究发现,这些模型在标准化测试中能达到学生级别的表现(OpenAI,2023),涵盖多种数学科目(例如,物理、计算机科学),包括选择题和非选择题问题。此外,实证研究表明,这些模型可以用作教育中的写作或阅读助手(马林卡等,2023;苏斯尼亚克,2022)。最近的一项研究(苏斯尼亚克,2022)揭示了 ChatGPT 能够在不同学科中生成逻辑一致的答案,兼顾深度和广度。另一项定量分析(马林卡等,2023)显示,使用 ChatGPT 的学生(通过保留或精炼其作为自己答案的结果)在计算机安全领域的某些课程中的表现优于平均水平的学生。 最近,几篇观点论文(Tan 等,2023 年;Kamalov 和 Gurrib,2023 年)也探讨了LLMs在课堂教学中的各种应用场景,如师生合作、个性化学习和评估自动化。然而,LLMs在教育中的应用可能会引发一系列实际问题,例如,剽窃、AI 生成内容中的潜在偏见、对LLMs的过度依赖,以及非英语人士获取资源的不公平(Kasneci 等,2023 年)。

To provide researchers with a broad overview of the domain, numerous exploratory and survey papers have been proposed. For example, Qadir (2023); Rahman and Watanobe (2023) and Rahman and Watanobe (2023) conclude the applications of ChatGPT to engineering education by analyzing the responses of ChatGPT to the related pedagogical questions. Jeon and Lee (2023) and Mogavi et al. (2023) collect the opinions from different ChatGPT user groups, e.g., educator, learner, researcher, through in-person interviews, online post replies and user logs, and conclude the practical applications of LLMs in education scenarios. Baidoo-Anu and Owusu Ansah (2023) and Zhang and Tur (2023) focus on the literature review over the published papers and summarize the progress of the area with structured tables. Although the above works have covered a wide range of existing applications of LLMs in education scenarios and provided their long-term visions for future studies, we argue that none of the literature has systematically summarized LLMs for education from a technological perspective. To bridge this gap, this survey aims to provide a comprehensive technological review of LLMs for education, which provides a novel technology-centric taxonomy and summary of existing publicly available datasets and benchmarks. Furthermore, we summarize current challenges as well as further research opportunities, in order to foster innovations and understanding in the dynamic and ever-evolving landscape of LLMs for education. In summary, our contributions lie in the following three main parts:
为了向研究人员提供领域的广泛概览,已经提出了许多探索性和调查性论文。例如,Qadir(2023 年);Rahman 和 Watanobe(2023 年)通过分析 ChatGPT 对相关教学问题的回应,总结了 ChatGPT 在工程教育中的应用。Jeon 和 Lee(2023 年)以及 Mogavi 等人(2023 年)通过面对面访谈、在线帖子回复和用户日志,收集了不同 ChatGPT 用户群体(例如,教育者、学习者、研究者)的意见,并总结了在教育场景中的实际应用。Baidoo-Anu 和 Owusu Ansah(2023 年)以及张和 Tur(2023 年)专注于对已发布论文的文献综述,并用结构化表格总结了该领域的进展。尽管上述工作已经涵盖了在教育场景中现有应用的广泛范围,并为未来的研究提供了长远的愿景,但我们认为没有任何文献从技术角度系统地总结了教育中的应用。 为了弥合这一差距,本调查旨在提供一个关于教育领域LLMs的全面技术回顾,其中包括一个新颖的以技术为中心的分类法和现有公开可用数据集与基准的总结。此外,我们总结了当前的挑战以及进一步的研究机会,以促进在教育领域LLMs不断变化且日益发展的环境中的创新和理解。总结来说,我们的贡献体现在以下三个主要部分:

1, Comprehensive and up-to-date survey. We offer a comprehensive and up-to-date survey on LLMs for a wide spectrum of education, containing academic research, commercial tools, and related datasets and benchmarks.
全面且最新的调查。我们提供关于LLMs的全面且最新的调查,涵盖了广泛的教育领域,包括学术研究、商业工具以及相关数据集和基准。

2, New technology-centric taxonomy. We provide a new taxonomy that offers a thorough analysis from a technological perspective on LLMs for education, encompassing student and teacher assistance, adaptive learning, and commercial tools.
2、新技术中心分类法。我们提供了一种新的分类法,从技术角度对教育中的LLMs进行了全面分析,涵盖了学生和教师支持、自适应学习和商业工具。

3, Current challenges and future research directions We discuss current risks and challenges, as well as highlight future research opportunities and directions, urging researchers to dive deeper into this exciting area.
3、当前挑战与未来研究方向 我们讨论了当前的风险和挑战,并突出了未来研究的机会与方向,鼓励研究人员深入探索这一激动人心的领域。

2. LLM in Education Applications
2.LLM在教育应用中

Refer to caption
Figure 1. A taxonomy of LLMs for education applications with representative works.
图 1. 教育应用的LLMs分类及其代表性作品。

2.1. Overview 2.1.概述

The education application can be categorized based on its users’ role in education and its usage scenario in education. In this paper, we summarize the appearance of LLMs in different applications and discuss the benefits brought by LLMs compared to the original methods. We present our primary summary of education applications with LLMs using a taxonomy illustrated in Figure 1.
教育应用可以根据其用户在教育中的角色和使用场景进行分类。在本文中,我们总结了LLMs在不同应用中的出现,并讨论了与原有方法相比LLMs带来的好处。我们使用图 1 中所示的分类法,对带有LLMs的教育应用进行了初步总结。

2.2. Study Assisting 学习辅助

Providing students with timely learning support has been widely recognized as a crucial factor in improving student engagement and learning efficiency during their independent studies (Dewhurst et al., 2000). Due to the limitation of prior algorithms in generating fixed-form responses, many of the existing study-assisting approaches face poor generalization challenges while being implemented in real-world scenarios (König et al., 2023). Fortunately, the appearance of LLMs brings revolutionary changes to this field. Using finetuned LLMs (Ouyang et al., 2022) to generate human-like responses, recent studies in LLM-based educational support have demonstrated promising results. These studies provide real-time assistance to students by helping them solve challenging questions, correcting errors, and offering explanations or hints for areas of confusion.
为学生提供及时的学习支持已被广泛认为是提高学生在独立学习中的参与度和学习效率的关键因素(Dewhurst 等,2000)。由于以往算法在生成固定形式回应的限制,许多现有的学习辅助方法在实际应用场景中面临着泛化能力差的挑战(König 等,2023)。幸运的是,LLMs的出现为这一领域带来了革命性的变化。使用经过微调的LLMs(Ouyang 等,2022)生成类人回应,最近在LLM基础的教育支持研究中显示出了有希望的结果。这些研究通过帮助学生解决难题、纠正错误以及为混淆的领域提供解释或提示,为学生提供实时帮助。

2.2.1. Question Solving (QS)
2.2.1.解题 (QS)

Contributing to the large-scale parameter size of LLMs and the enormous sized and diverse web corpus used during the pre-training phase, LLMs have been proven to be a powerful question zero-shot solver to questions spread from a wide spread of subjects, including math (Yuan et al., 2023; Wu et al., 2023d), law (Bommarito II and Katz, 2022; Cui et al., 2023), medicine (Thirunavukarasu et al., 2023; Liévin et al., 2023), finance (Wu et al., 2023c; Yang et al., 2023), programming (Kazemitabaar et al., 2023; Savelka et al., 2023), language understands(Zhang et al., 2024; Achiam et al., 2023). In addition, to further improve LLM’s problem-solving performance while facing complicated questions, a variety of studies have been actively proposed. For example, Wei et al. (2022) propose the Chain-of-Thought (CoT) prompting method, which guides LLMs to solve a challenging problem by decomposing it into simpler sequential steps. Other works (Sun, 2023; Wang et al., 2023) exploit the strong in-context learning ability of LLMs and propose advanced few-shot demonstration-selection algorithms to improve LLM’s problem-solving performance to general questions. Chen et al. (2022) and Gao et al. (2023a) leverage external programming tools to avoid calculation errors introduced during the textual problem solving process of raw LLMs. Wu et al. (2023a) regard chat-optimized LLMs as powerful agents and design a multi-agent conversation to solve those complicated questions through a collaborative process. Cobbe et al. (2021) and Zhou et al. (2024b) propose the external verifier module to rectify intermediate errors during generation, which improves LLM problem solving performance on challenging math questions. Overall, with the proposition of all these novel designs, the use of LLMs for question solving has achieved impressive progress. Furthermore, students can find high-quality answers to their blocking questions in a timely manner.
由于在预训练阶段使用了大规模参数LLMs和庞大且多样的网络语料库,LLMs已被证明是一个强大的问题零射击解决者,能够处理涉及广泛主题的问题,包括数学(袁等,2023 年;吴等,2023d 年)、法律(Bommarito II 和 Katz,2022 年;崔等,2023 年)、医学(Thirunavukarasu 等,2023 年;Liévin 等,2023 年)、金融(吴等,2023c 年;杨等,2023 年)、编程(Kazemitabaar 等,2023 年;Savelka 等,2023 年)、语言理解(张等,2024 年;Achiam 等,2023 年)。此外,为了进一步提高LLM在面对复杂问题时的解决问题表现,已积极提出了多种研究。例如,魏等(2022 年)提出了思维链(CoT)提示方法,该方法通过将问题分解成更简单的顺序步骤,引导LLMs解决挑战性问题。其他工作(孙,2023 年;王等,2023 年)利用LLMs强大的上下文学习能力,并提出了先进的少数示例选择演示算法,以提高LLM对一般问题的解决问题表现。 陈等人(2022 年)和高等人(2023a 年)利用外部编程工具来避免在处理原始LLMs文本问题解决过程中引入的计算错误。吴等人(2023a 年)将优化聊天的LLMs视为强大的代理,并设计了一个多代理对话来通过协作过程解决那些复杂的问题。Cobbe 等人(2021 年)和周等人(2024b 年)提出外部验证器模块来纠正生成过程中的中间错误,这提高了在挑战性数学问题上的LLM问题解决表现。总体而言,随着所有这些新颖设计的提出,使用LLMs进行问题解决已取得了显著进展。此外,学生们可以及时找到高质量的答案来解决他们的难题。

2.2.2. Error Correction (EC)
错误更正(EC)

Error correction focuses on providing instant feedback to students’ errors that they make during the learning process. This action is helpful for students in the early stages of learning. Zhang et al. (2023b) explore using four prompt strategies: zero-shot, zero-shot-CoT, few-shot, and few-shot-CoT to correct common grammar errors in Chinese and English text. From their experiment, they find that LLMs have tremendous potential in the correction task, and some simple spelling errors have been perfectly solved by the current LLMs. GrammarGPT (Fan et al., 2023a) leverages LLM for addressing native Chinese grammatical errors. By fine-tuning open-source LLMs with hybrid annotated dataset, which involves both human annotation and ChatGPT generation, the proposed framework performs effectively in native Chinese grammatical error correction. Zhang et al. (2022) propose to use a large language model trained in code, such as Codex, to build an APR system – MMAPR – for introductory Python programming assignments. With MMAPR evaluated on real student programs and comparing it to the prior state-of-the-art Python syntax repair engine, the author found that MMAPR can fix more programs and produce smaller patches on average. Do Viet and Markov (2023) develop a few-shot example generation pipeline, which involves code summarize generation and code modification to create few-shot examples. With the generated few-shot examples, the performance and stability of LLMs on bug fixing performance on student programs receive a great boost.
纠错重点是对学生在学习过程中所犯错误提供即时反馈。这一行动对于学习初期的学生很有帮助。张等人(2023b)探索使用四种提示策略:零次射击、零次射击-CoT、少次射击和少次射击-CoT 来纠正中英文文本中的常见语法错误。从他们的实验中,他们发现LLMs在纠正任务中具有巨大潜力,一些简单的拼写错误已经被当前的LLMs完美解决。GrammarGPT(范等人,2023a)利用LLM来解决中国人的语法错误。通过对开源LLMs进行微调,结合人工注释和 ChatGPT 生成的混合注释数据集,所提出的框架在中国人的语法错误纠正中表现有效。张等人(2022)提议使用在代码中训练的大型语言模型,如 Codex,来构建一个 APR 系统——MMAPR——用于初学者 Python 编程作业。 在对真实学生程序评估 MMAPR 并将其与之前最先进的 Python 语法修复引擎进行比较后,作者发现 MMAPR 能修复更多程序,并平均生成更小的补丁。Do Viet 和 Markov(2023 年)开发了一个少量样本示例生成管道,该管道涉及代码总结生成和代码修改,以创建少量样本示例。通过生成的少量样本示例,LLMs在学生程序的错误修复性能上获得了很大的提升和稳定性。

2.2.3. Confusion Helper (CH)
2.2.3.混淆助手 (CH)

Unlike QS and AC, studies in the confusion helper direction avoid providing correct problem solutions directly. Instead, these works aim to use LLMs to generate pedagogical guidance or hints that help students solve problems themselves. Shridhar et al. ([n. d.]) propose various guided question generation schemes based on input conditioning and reinforcement learning, and explore the ability of LLMs to generate sequential questions to guide the solution of math word problems. Prihar et al. (2023) explore using LLMs to generate explanations for math problems in two ways: summarizing question-related tutoring chat logs and learning a few shots from existing explanation text. Based on their experiment, they find the synthetic explanations cannot outperform teacher-written explanations, as some terms may not be known by students, and the advice is sometimes too general. The research by Pardos and Bhandari (2023) evaluates the learning gain differences between ChatGPT and human-tutor-generated algebra hints. By observing the changes in participants’ pre-test and post-test scores between the controlled groups, the author draws a similar conclusion that hints generated by LLMs are less effective in guiding students to find worked solutions. Balse et al. (2023) evaluate the validity of using LLMs to generate explaining text to logical errors in students’ computer programming homework. By ranking the synthetic explanations between the ones written by course TAs, the author finds that synthetic explanations are competitive to human-generated results but have shortages in correctness and information missing problems. Rooein et al. (2023) experiment with generating adaptive explanations for different groups of students. By introducing controlling conditions, such as age group, education level, and detail level, to the instructional prompt, the proposed method adapts the generated explanations to students with diverse learning profiles.
与 QS 和 AC 不同,混淆助手方向的研究避免直接提供正确的问题解决方案。相反,这些工作旨在使用LLMs生成教学指导或提示,帮助学生自己解决问题。Shridhar 等人([n.d.])提出了基于输入条件和强化学习的各种引导性问题生成方案,并探索LLMs生成连续问题以指导数学文字问题解答的能力。Prihar 等人(2023 年)探索使用LLMs通过两种方式为数学问题生成解释:总结与问题相关的辅导聊天记录和从现有解释文本中学习少量示例。根据他们的实验,他们发现合成解释无法胜过教师编写的解释,因为一些术语学生可能不知道,且建议有时过于笼统。Pardos 和 Bhandari(2023 年)的研究评估了 ChatGPT 与人类导师生成的代数提示之间的学习收益差异。 通过观察对照组参与者的前测和后测成绩变化,作者得出了一个类似的结论,即由LLMs生成的提示在引导学生找到解决方案方面效果较差。Balse 等人(2023 年)评估使用LLMs生成解释学生计算机编程作业中逻辑错误的有效性。通过将合成解释与课程助教撰写的解释进行排名,作者发现合成解释与人工生成的结果具有竞争力,但在正确性和信息缺失问题上存在不足。Rooein 等人(2023 年)尝试为不同学生群体生成适应性解释。通过引入控制条件,如年龄组、教育水平和细节水平,到教学提示中,所提出的方法使生成的解释适应于具有不同学习特征的学生。

Refer to caption
Figure 2. LLMs in student and teacher assisting.
图 2. LLMs 在学生和教师协助中。

2.3. Teach Assisting 2.3.教学辅助

Contributing to LLM’s unprecedented logical reasoning and problem-solving capability, developing LLM-based teach-assisting models has become another popular topic in education research recently. With the help of these assisting algorithms, instructors are able to get rid of prior burdensome routine workloads and focus their attention on tasks like in-class instructions, which cannot be replaced by existing machine learning models.
为LLM的前所未有的逻辑推理和问题解决能力做出贡献,最近在教育研究中开发基于LLM的教学辅助模型已成为另一个热门话题。在这些辅助算法的帮助下,教师能够摆脱以往繁重的常规工作负担,将注意力集中在如课堂教学这样的任务上,这些任务是现有机器学习模型无法替代的。

2.3.1. Question Generation (QG)
2.3.1.问题生成 (QG)

Due to its highly frequent usage in pedagogical practice, Question Generation (QG) has become one of the most popular research topics in LLMs’ application for education. Xiao et al. (2023) leverage LLMs to generate reading comprehension questions by first fine-tuning it with supplemental reading materials and textbook exercise passages, then by employing a plug-and-play controllable text generation approach, fine-tuned LLMs are guided in generating more coherent passages based on specified topic keywords. Doughty et al. (2024) analyze the ability of LLM (GPT-4) to produce multiple-choice questions (MCQs) aligned with specific learning objectives (LO) of Python programming classes in higher education. By integrating several generation control modules with the prompt assembly process, the proposed framework is capable of producing MCQs with clear language, a single correct choice, and high-quality distractors. Lee et al. (2023a) focused on aligning prompting questions and reading comprehension taxonomy using a 2D matrix-structured framework. Using aligned prompts, LLM questions can cover a broad range of question types and difficulty levels in a balanced manner. Zhou et al. (2023) work on generating diverse math word problems with implicit diversity controls toward the equation of question and achieve the goal of generating high-quality diverse questions.
由于在教学实践中的高频使用,问题生成(QG)已成为教育应用中最受欢迎的研究主题之一。肖等人(2023 年)利用LLMs通过首先使用补充阅读材料和教科书练习段落进行微调,然后采用即插即用的可控文本生成方法,微调后的LLMs在生成更连贯的段落时能够依据特定主题关键词进行引导。道蒂等人(2024 年)分析了LLM(GPT-4)在高等教育的 Python 编程课程中生成与特定学习目标(LO)对齐的多项选择题(MCQs)的能力。通过整合多个生成控制模块与提示组装过程,所提出的框架能够生成语言清晰、只有一个正确选项且高质量干扰项的 MCQs。李等人(2023a 年)专注于使用二维矩阵结构框架对齐提示问题和阅读理解分类。使用对齐的提示,LLM问题可以在平衡的方式下覆盖广泛的问题类型和难度级别。 周等人(2023 年)致力于生成具有隐式多样性控制的多样化数学文字题,以实现生成高质量多样化问题的目标。

2.3.2. Automatic Grading (AG)
自动评分 (AG)

Research on automatic assignment graders has been proposed much earlier to the recent emergence of LLMs. However, due to the limitations in the learning capability of prior models, most existing auto-grading algorithms (Liu et al., 2019) focus on exploring semantic comparisons between golden solutions and student responses, which dismisses the logical considerations behind manual scoring processes. Apart from that, as the quality of the provided solution heavily influences the result, the applications of previous works are restricted to some well-annotated problems. Fortunately, with the appearance of LLMs, the above challenges have become easy to solve. Studies by Yancey et al. (2023) and Pinto et al. (2023) first explore the usage of LLMs for automatic scoring in open-ended questions and writing essays using prompt tuning algorithms. By including comprehensive contexts, clear rubrics, and high-quality examples, LLMs demonstrate its satisfactory performance on both grading tasks. Xiao et al. (2024) further integrates CoT within the grading process. This approach instructs LLMs to first analyze and explain the provided materials before making final score determinations. With such modifications, the LLMs will not only generate the score results, but also provide detailed comments to the students’ responses, which helps students to learn how to improve for the next time. Li et al. (2024b) extend the grading object from the textual responses of the students to those with handwritten responses. With the use of the advanced multimodal LLM framework, for example, CLIP and BLIP, the work demonstrates that the incorporation of the student’s text and image, as well as the text and image of the question, improves the model’s grading performance. Funayama et al. (2023) propose a pre-finetuning method for the cross-prompt to learn the shared relationship between different rubrics and annotated examples, then by further tuning the pre-finetuned LLMs on the target scoring task, the model can achieve comparable performance under the limitation of labeled samples.
自动作业评分研究早在LLMs出现之前就已被提出。然而,由于先前模型的学习能力限制,大多数现有的自动评分算法(刘等,2019 年)主要专注于探索标准答案与学生回答之间的语义比较,忽略了手动评分过程中的逻辑考虑。除此之外,由于提供的解决方案的质量极大影响结果,以往工作的应用被限制在一些标注良好的问题上。幸运的是,随着LLMs的出现,上述挑战已变得易于解决。Yancey 等(2023 年)和 Pinto 等(2023 年)首次探索使用LLMs对开放式问题和写作论文进行自动评分,使用提示调整算法。通过包括全面的上下文、清晰的评分标准和高质量的示例,LLMs在两种评分任务上展示了令人满意的表现。肖等(2024 年)进一步在评分过程中整合了 CoT。这种方法指导LLMs首先分析并解释提供的材料,然后做出最终的评分决定。 通过这些修改,LLMs 不仅会生成成绩结果,还会对学生的回答提供详细的评论,帮助学生了解如何在下一次中改进。李等人(2024b)将评分对象从学生的文本回答扩展到手写回答。例如,通过使用先进的多模态LLM框架,如 CLIP 和 BLIP,该工作展示了结合学生的文本和图像以及问题的文本和图像,可以提升模型的评分性能。船山等人(2023)提出了一种针对跨提示的预微调方法,以学习不同评分标准和注释示例之间的共享关系,然后通过在目标评分任务上进一步调整预微调的LLMs,模型可以在有限的标记样本下达到可比的性能。

2.3.3. Material Creation (MC)
2.3.3.材料创建 (MC)

Despite the above tasks, pioneering researchers also find the great potential of LLMs to help teachers create high-quality educational materials. For example, Leiker et al. (2023) presents an investigation into the use of LLMs in asynchronous course creation, particularly within the context of adult learning, training, and upskilling. To ensure the accuracy and clarity of the generated content, the author integrates LLMs with a robust human-in-the-loop process. Koraishi (2023) leverage GPT-4 with a zero-shot prompt strategy to optimize the materials of the English as a Foreign Language (EFL) course. In their exploration, the authors examine how ChatGPT can be used in material development, streamlining the process of creating engaging and contextually relevant resources tailored to the needs of individual learners, as well as other more general uses. Jury et al. (2024) present a novel tool, ’WorkedGen’, which uses LLMs to generate interactive worked examples. Through the use of strategies such as prompt chaining and one-shot learning to optimize the output, the generated work examples receive positive feedback from students.
尽管有上述任务,开拓性研究人员还发现LLMs在帮助教师创建高质量教育材料方面具有巨大潜力。例如,Leiker 等人(2023 年)展示了一项关于在成人学习、培训和技能提升的背景下使用LLMs进行异步课程创建的调查。为确保生成内容的准确性和清晰度,作者将LLMs与强大的人在回路过程结合起来。Koraishi(2023 年)利用 GPT-4 和零次射击提示策略来优化英语作为外语(EFL)课程的材料。在他们的探索中,作者们检查了如何使用 ChatGPT 在材料开发中,简化创建吸引人和与情境相关的资源的过程,这些资源旨在满足个别学习者的需求,以及其他更一般的用途。Jury 等人(2024 年)提出了一种名为“WorkedGen”的新工具,该工具使用LLMs生成交互式的工作示例。通过使用诸如提示链接和一次性学习等策略来优化输出,生成的工作示例获得了学生的积极反馈。

2.4. Adaptive Learning 自适应学习

Based on the specific problems solved by the proposed methods, existing work on adaptive learning can be classified into two categories: knowledge tracing (Abdelrahman et al., 2023) and content personalization (Naumov et al., 2019). To be specific, knowledge tracing targets estimating the students’ knowledge mastery status based on the correctness of students’ responses to questions during their study processes. Content personalizing focuses on providing customized learning content to students based on personalized factors such as learning status, preferences, and goals. During the past few decades, a variety of machine learning algorithms, including traditional statistical methods (Kučak et al., 2018) and advanced deep learning models (Lin et al., 2023), have been explored by different studies, and some promising results have been achieved for both problems (Liu et al., 2017). With the recent surge of powerful LLMs in various applications, novel opportunities are also emerging for research in these directions.
根据所提出方法解决的具体问题,现有的自适应学习工作可分为两类:知识追踪(Abdelrahman 等,2023 年)和内容个性化(Naumov 等,2019 年)。具体来说,知识追踪旨在根据学生在学习过程中对问题的回答正确与否来估计学生的知识掌握状态。内容个性化则侧重于根据个性化因素如学习状态、偏好和目标为学生提供定制化学习内容。在过去几十年中,包括传统统计方法(Kučak 等,2018 年)和先进的深度学习模型(Lin 等,2023 年)在内的各种机器学习算法已被不同研究所探索,并且在这两个问题上都取得了一些有希望的结果(Liu 等,2017 年)。随着近期在各种应用中强大LLMs的激增,这些方向的研究也出现了新的机会。

2.4.1. Knowledge Tracing (KT)
知识追踪 (KT)

The current usage of LLMs in knowledge tracing focuses on generating auxiliary information to both the question text and the student records data. In recent work by Ni et al. (2023), the author uses LLM to extract the knowledge keyword for each text of questions in the student-question response graph. Contributing to LLM’s strong generous capabilities to deal with unseen text, the proposed framework proves especially advantageous in addressing cold start scenarios characterized by limited student question practice data. In addition to that, Lee et al. (2023b) proposes a framework, DCL4KT+LLM, which predicts the difficulties of questions based on the text of the question stem and the concepts of knowledge associated with LLM. Using the predicted question difficulties, DCL4KT+LLM overcomes the missing difficulty information problem of existing knowledge tracing algorithms when faced with unseen questions or concepts. Finally, Sonkar and Baraniuk (2023) explores the capabilities of LLM in logical reasoning with distorted facts. By utilizing the prompts designed by the study, the LLMs demonstrate the possibility of simulating students’ incorrect responses when given the appropriate knowledge profiles of the students.
当前在知识追踪中使用LLMs的目的是生成辅助信息,既服务于问题文本也服务于学生记录数据。在 Ni 等人(2023 年)的近期工作中,作者使用LLM提取学生-问题响应图中每个问题文本的知识关键词。由于LLM在处理未见文本时具有强大的生成能力,所提出的框架在处理以有限的学生问题练习数据为特征的冷启动场景时显示出特别的优势。除此之外,Lee 等人(2023b)提出了一个框架,DCL4KT+LLM,该框架根据问题干的文本和与LLM相关的知识概念预测问题的难度。使用预测的问题难度,DCL4KT+LLM解决了现有知识追踪算法在面对未见问题或概念时缺失难度信息的问题。最后,Sonkar 和 Baraniuk(2023)探索了LLM在逻辑推理中处理扭曲事实的能力。 通过利用研究设计的提示,LLMs 展示了在给定学生适当的知识档案时模拟学生错误反应的可能性。

2.4.2. Content Personalizing (CP)
2.4.2.内容个性化 (CP)

As most advanced LLMs are generative models, the use of LLMs to create personalized learning content has been explored in many recent education researches. For example, Kuo et al. (2023) attempts to generate a dynamic learning path for students based on their most recent knowledge mastery diagnosis result. Kabir and Lin (2023) incorporate the knowledge concept structures during the generation. Specifically, if the student masters a topic for a given Learning Object (LO), a question from the next LO will be automatically generated. Yadav et al. (2023) explore the potential of LLMs in creating contextualizing algebraic questions based on student interests. By conducting iterative prompt engineering on the few-shot learning approach, the system aptly accommodates novel interests such as TikTok and NBA to the generated question stem text, which helps to improve student engagement and outcomes during the study. Despite generating content, other studies (Abu-Rasheed et al., 2024) also try to leverage chat-based LLMs to generate explanations of learning recommendations. By utilizing Knowledge Graphs (KGs) as a source of contextual information, the approach demonstrates its capability to generate convincing answers for a learner who has inquiries about the learning path recommended by ITS systems.
随着大多数先进的LLMs都是生成模型,利用LLMs来创建个性化学习内容的做法已在许多近期的教育研究中被探索。例如,Kuo 等人(2023 年)尝试根据学生最新的知识掌握诊断结果生成动态学习路径。Kabir 和 Lin(2023 年)在生成过程中融入知识概念结构。具体来说,如果学生掌握了一个给定的学习对象(LO)的主题,那么下一个 LO 的问题将自动生成。Yadav 等人(2023 年)探索了LLMs在根据学生兴趣创建情境化代数问题的潜力。通过对少数样本学习方法进行迭代提示工程,系统能够适当地适应如 TikTok 和 NBA 这样的新兴兴趣,将其融入生成的问题干预文本中,这有助于在学习过程中提高学生的参与度和成果。除了生成内容,其他研究(Abu-Rasheed 等人,2024 年)也尝试利用基于聊天的LLMs来生成学习建议的解释。 通过利用知识图谱(KGs)作为上下文信息的来源,该方法展示了其为学习者生成令人信服的答案的能力,这些学习者对智能教学系统推荐的学习路径有所询问。

2.5. Education Toolkit 教育工具包

Besides leveraging LLMs to empower well-formulated education applications in academia, several LLM-powered commercial education tools have been developed in the industry. In particular, they can be categorized into five categories, including Chatbot, Content Creation, Teaching Aide, Quiz Generator and Collaboration Tool.
除了利用LLMs来增强学术界中精心设计的教育应用外,业界还开发了几种LLM驱动的商业教育工具。特别是,它们可以分为五类,包括聊天机器人、内容创作、教学助手、测验生成器和协作工具。

2.5.1. Chatbot 2.5.1.聊天机器人

Using a Large Language Model (LLM) chatbot as an educational tool offers a range of advantages and opportunities. LLM chatbots can adapt their responses to the individual needs of learners, providing personalized feedback and support. This customization can accommodate different learning styles, speeds, and preferences. They offer 24/7 availability, making learning accessible anytime, anywhere. This can be particularly beneficial for learners in different time zones or with varying schedules. The interactive nature of chatbots can make learning more engaging and fun. They can simulate conversations, create interactive learning scenarios, and provide instant feedback, which can be more effective than passive learning methods. Chatbots can handle thousands of queries simultaneously, making them a scalable solution for educational institutions to support a large number of learners without a corresponding increase in teaching staff. They can automate repetitive teaching tasks, such as grading quizzes or providing basic feedback, allowing educators to focus on more complex and creative teaching responsibilities. There are some representative chatbots, such as ChatGPT (OpenAI, 2024), Bing Chat (Microsoft, 2024), Google Bard (Google, 2024), Perplexity (Perplexity AI, 2024), Pi Pi.ai (2024).
使用大型语言模型(LLM)聊天机器人作为教育工具提供了一系列优势和机会。LLM聊天机器人可以根据学习者的个别需求调整他们的回应,提供个性化的反馈和支持。这种定制化可以适应不同的学习风格、速度和偏好。它们提供全天候服务,使学习可以随时随地进行。这对于处于不同时区或有不同日程安排的学习者特别有益。聊天机器人的互动性质可以使学习更具吸引力和乐趣。它们可以模拟对话,创建互动学习场景,并提供即时反馈,这可能比被动学习方法更有效。聊天机器人可以同时处理成千上万的查询,使它们成为教育机构支持大量学习者的可扩展解决方案,而无需相应增加教学人员。它们可以自动化重复的教学任务,如评分测验或提供基本反馈,让教育者能够专注于更复杂和创造性的教学责任。 有一些代表性的聊天机器人,例如 ChatGPT(OpenAI,2024 年)、Bing Chat(Microsoft,2024 年)、Google Bard(Google,2024 年)、Perplexity(Perplexity AI,2024 年)、Pi Pi.ai(2024 年)。

2.5.2. Content Creation 2.5.2.内容创建

Curipod (Curipod, 2024) takes user input topics and generates an interactive slide deck, including polls, word clouds, open-ended questions, and drawing tools. Diffit (Diffit, 2024) provides a platform on which the user can find leveled resources for virtually any topic. It enables teachers to adapt existing materials to suit any reader, create customized resources on any subject, and then edit and share these materials with students. MagicSchool (MagicSchool.ai, 2024) is an LLM-powered educational platform that aims to help teachers save time by automating tasks such as lesson planning, grading, and creating educational content. It provides access to more than 40 AI tools, which can be searched by keywords and organized into categories for planning, student support, productivity, and community tools. Education Copilot (Copilot, 2024) offers LLM-generated templates for a variety of educational needs, including lesson plans, writing prompts, handouts, student reports, project outlines, and much more, streamlining the preparation process for educators. Nolej (Nolej, 2024) specializes in creating a wide range of interactive educational content, including comprehensive courses, interactive videos, assessments, and plug-and-play content, to enhance the learning experience. Eduaide.ai (Eduaide.ai, 2024) is an LLM-powered teaching assistant created to support teachers in lesson planning, instructional design, and the creation of educational content. It features a resource generator, teaching assistant, feedback bot, and AI chat, providing comprehensive assistance for educators. Khanmigo (Khanmigo, 2024), developed by Khan Academy, is an LLM-powered learning tool that serves as a virtual tutor and debate partner. It can also assist teachers in generating lesson plans and handling various administrative tasks, enhancing both learning and teaching experiences. Copy.ai (Copy.ai, 2024)is an LLM-powered writing tool that uses machine learning to produce a wide range of content types, such as blog headlines, emails, social media posts, and web copy.
Curipod(Curipod,2024 年)可以根据用户输入的主题生成互动式幻灯片,包括投票、词云、开放式问题和绘图工具。Diffit(Diffit,2024 年)提供一个平台,用户可以在这个平台上找到几乎任何主题的分级资源。它使教师能够调整现有材料以适应任何读者,创建任何主题的定制资源,然后编辑并与学生分享这些材料。MagicSchool(MagicSchool.ai,2024 年)是一个由 AI 驱动的教育平台,旨在帮助教师通过自动化任务如课程计划、评分和创建教育内容来节省时间。它提供访问超过 40 个 AI 工具,这些工具可以通过关键词搜索并按计划、学生支持、生产力和社区工具类别组织。Education Copilot(Copilot,2024 年)提供多种教育需求的 AI 生成模板,包括课程计划、写作提示、讲义、学生报告、项目大纲等,简化了教育者的准备过程。 Nolej(Nolej,2024 年)专门创建各种互动教育内容,包括全面课程、互动视频、评估和即插即用内容,以增强学习体验。Eduaide.ai(Eduaide.ai,2024 年)是一个由 AI 驱动的教学助手,旨在帮助教师进行课程规划、教学设计和教育内容的创建。它具有资源生成器、教学助手、反馈机器人和 AI 聊天功能,为教育工作者提供全面的帮助。Khanmigo(Khanmigo,2024 年),由可汗学院开发,是一个由 AI 驱动的学习工具,充当虚拟导师和辩论伙伴。它还可以帮助教师生成课程计划和处理各种行政任务,增强学习和教学体验。Copy.ai(Copy.ai,2024 年)是一个由 AI 驱动的写作工具,使用机器学习生成各种内容类型,如博客标题、电子邮件、社交媒体帖子和网页文案。

2.5.3. Teaching Aide 2.5.3.教学助手

gotFeedback (gotFeedback, 2024) is developed to assist teachers in providing more personalized and timely feedback to their students, seamlessly integrating into the gotLearning platform. It is based on research emphasizing that effective feedback should be goal-referenced, tangible and transparent, actionable, user-friendly, timely, ongoing, and consistent, ensuring that it meets students’ needs effectively. Grammarly (Grammarly, 2024) serves as an online writing assistant, employing LLM to help students write bold, clear, and error-free writing. Grammarly’s AI meticulously checks grammar, spelling, style, tone, and more, ensuring your writing is polished and professional. Goblin Tools (Tools, 2024) offers a suite of simple single-task tools specifically designed to assist neurodivergent individuals with tasks that may be overwhelming or challenging. This collection includes Magic ToDo, Formalizer, Judge, Estimator, and Compiler, each tool catering to different needs and simplifying daily tasks to enhance productivity and ease. ChatPDF (PDF, 2024) is an LLM-powered tool designed to let users interact with PDF documents through a conversational interface. This innovative approach enables easier navigation and interaction with PDF content, making it more accessible and user-friendly.
gotFeedback(gotFeedback,2024 年)旨在帮助教师为学生提供更个性化、及时的反馈,无缝集成到 gotLearning 平台中。它基于的研究强调,有效的反馈应该与目标相关、具体且透明、可操作、用户友好、及时、持续和一致,确保有效满足学生的需求。Grammarly(Grammarly,2024 年)作为一个在线写作助手,利用LLM帮助学生写出大胆、清晰、无误的文字。Grammarly 的 AI 仔细检查语法、拼写、风格、语气等,确保你的写作精炼且专业。Goblin Tools(Tools,2024 年)提供一套专为神经多样性个体设计的简单单任务工具,帮助他们应对可能令人不知所措或具有挑战性的任务。这一系列包括 Magic ToDo、Formalizer、Judge、Estimator 和 Compiler,每个工具都针对不同的需求,简化日常任务,提高生产力和便利性。ChatPDF(PDF,2024 年)是一个由LLM驱动的工具,旨在让用户通过对话界面与 PDF 文档进行交互。 这种创新方法使得 PDF 内容的导航和交互变得更加容易,更加易于访问和用户友好。

2.5.4. Quiz Generator 2.5.4.测验生成器

QuestionWell (Que, 2024) is an LLM-based tool that generates an unlimited supply of questions, allowing teachers to focus on what is most important. By entering reading material, AI can create essential questions, learning objectives, and aligned multiple choice questions, streamlining the process of preparing educational content and assessments. Formative (AI, 2024a), a platform for assignments and quizzes that accommodated a wide range of question types, has now enhanced its capabilities by integrating ChatGPT. This addition enables the generation of new standard-aligned questions, hints for learners, and feedback for students, leveraging the power of LLM to enrich the educational experience and support customized learning paths. Quizizz AI (AI, 2024b) is an LLM-powered feature that specializes in generating multiple-choice questions, which has the ability to automatically decide the appropriate number of questions to generate based on the content supplied. Furthermore, Quizizz AI can modify existing quizzes through its Enhance feature, allowing for the customization of activities to meet the specific needs of students. Conker (Conker, 2024) is a tool that enables the creation of multiple-choice, read-and-respond, and fill-in-the-blank quizzes tailored to students of various levels on specific topics. It also supports the usage of user input text, from which it can generate quizzes, making it a versatile resource for educators aiming to assess and reinforce student learning efficiently. Twee (Twee, 2024) is an LLM-powered tool designed to streamline lesson planning for English teachers, generating educational content, including questions, dialogues, stories, letters, articles, multiple choice questions, and true/false statements. This comprehensive support helps teachers enrich their lesson plans and engage students with a wide range of learning materials.
QuestionWell(Que,2024)是一个基于LLM的工具,能够生成无限量的问题,让教师能够专注于最重要的事情。通过输入阅读材料,AI 能够创建关键问题、学习目标和配套的选择题,简化教育内容和评估的准备过程。Formative(AI,2024a)是一个适用于作业和测验的平台,支持多种类型的问题,现已通过整合 ChatGPT 增强了其功能。这一新增功能使得生成新的标准对齐问题、为学习者提供提示和为学生反馈成为可能,利用LLM的力量丰富教育体验并支持定制化学习路径。Quizizz AI(AI,2024b)是一个由LLM驱动的功能,专门生成选择题,能够根据提供的内容自动决定生成问题的数量。此外,Quizizz AI 还可以通过其增强功能修改现有的测验,允许定制活动以满足学生的特定需求。 Conker(Conker,2024 年)是一种工具,可以根据不同水平学生的具体主题,创建多项选择、阅读回应和填空测验。它还支持使用用户输入的文本来生成测验,使其成为教育工作者评估和加强学生学习的多功能资源。Twee(Twee,2024 年)是一款基于LLM技术的工具,旨在为英语教师简化课程计划,生成教育内容,包括问题、对话、故事、信件、文章、多项选择题和判断题。这种全面的支持帮助教师丰富他们的课程计划,并通过广泛的学习材料吸引学生。

2.5.5. Collaboration Tool 协作工具

summarize.tech (summarize.tech, 2024) is a ChatGPT-powered tool that can summarize any long YouTube video, such as a lecture, a live event, or a government meeting. Parlay Genie (Genie, 2024) serves as a discussion prompt generator that creates higher-order thinking questions for classes based on a specific topic, a YouTube video, or an article. It uses the capabilities of ChatGPT to generate engaging and thought-provoking prompts, facilitating deep discussions and critical thinking among students.
summarize.tech(summarize.tech,2024 年)是一个由 ChatGPT 驱动的工具,能够总结任何长的 YouTube 视频,如讲座、现场活动或政府会议。Parlay Genie(Genie,2024 年)则是一个讨论提示生成器,为基于特定主题、YouTube 视频或文章的课程创建高阶思维问题。它利用 ChatGPT 的能力生成引人入胜且发人深省的提示,促进学生之间的深入讨论和批判性思维。

3. Dataset and Benchmark 数据集和基准

LLMs revolutionized the field of natural language processing (NLP) by enabling a wide range of text-rich downstream tasks, which leverage the extensive knowledge and linguistic understanding embedded within LLMs to perform specific functions requiring comprehension, generation, or text transformation. Therefore, many datasets and benchmarks are constructed for text-rich educational downstream tasks. The majority of datasets and benchmarks lie in the tasks of question-solving (QS), error correction (EC), question generation (QG), and automatic grading (AG), which cover use cases that benefit different users, subjects, levels, and languages. Some of these datasets mainly benefit the student, while others help the teacher.
LLMs彻底改变了自然语言处理(NLP)领域,通过支持各种丰富文本的下游任务,这些任务利用嵌入在LLMs中的广泛知识和语言理解来执行需要理解、生成或文本转换的特定功能。因此,许多数据集和基准测试是为丰富文本的教育下游任务而构建的。大多数数据集和基准测试位于解决问题(QS)、错误更正(EC)、问题生成(QG)和自动评分(AG)的任务中,这些任务涵盖了不同用户、学科、级别和语言的用例。其中一些数据集主要惠及学生,而其他数据集则帮助教师。

Datasets and benchmarks for educational applications vary widely in scope and purpose, targeting different aspects of the educational process, such as student performance data (Ray et al., 2003), text and resource databases (Brooke et al., 2015), online learning data (Ruipérez-Valiente et al., 2022), language learning database (Tiedemann, 2020), education game data (Liu et al., 2020), demographic and socioeconomic data (Cooper et al., 2020), learning management system (LMS) data (Conijn et al., 2016), special education and needs data (Morningstar et al., 2017). Specifically, the question-solving ones (Cobbe et al., 2021; Hendrycks et al., 2020; Huang et al., 2016; Wang et al., 2017; Zhao et al., 2020; Amini et al., 2019; Miao et al., 2021; Lu et al., 2021b; Kim et al., 2018; Lu et al., 2021a; Chen et al., 2023b), take a significant amount as a prevalent task for both Education and NLP fields. In particular, many datasets (Cobbe et al., 2021; Hendrycks et al., 2020; Huang et al., 2016; Wang et al., 2017; Zhao et al., 2020; Amini et al., 2019; Miao et al., 2021; Lu et al., 2021b) are constructed for math question-solving, aiming to provide an abstract expression from a narrative description. Some datasets also take the images (Miao et al., 2021; Kim et al., 2018; Lu et al., 2021a; Kembhavi et al., 2016) and tables (Lu et al., 2021b) into consideration. On the other hand, another bunch of datasets and benchmarks (Kim et al., 2018; Kembhavi et al., 2016; Chen et al., 2023b), are constructed for science textbook question-solving, which requires a comprehensive understanding of the textbook and provide the answer corresponding to the key information in the question. There are also large amounts of datasets and benchmarks constructed for error correction. They are used for foreign language training (Rothe et al., 2021; Ng et al., 2014; Bryant et al., 2019; Tseng et al., 2015; Zhao et al., 2022; Xu et al., 2022b; Du et al., 2023; Náplava et al., 2022; Rozovskaya and Roth, 2019; Grundkiewicz and Junczys-Dowmunt, 2014; Davidson et al., 2020; Syvokon and Nahorna, 2021; Cotet et al., 2020) and computer science programming language training (Just et al., 2014; Le Goues et al., 2015; Lin et al., 2017; Tufano et al., 2019; Li et al., 2022; Guo et al., 2024). The foreign language training datasets and benchmarks contains grammatical errors and spelling errors that are needed to be identified and corrected. The programming training datasets and benchmarks include several code bugs that require sufficient coding understanding for proper correction. On the other hand, there are several datasets and benchmarks for teacher-assisting tasks. (Welbl et al., 2017; Lai et al., 2017; Xu et al., 2022a; Chen et al., 2018; Gong et al., 2022; Hadifar et al., 2023; Liang et al., 2018; Bitew et al., 2022) are constructed for question construction task that aims to evaluate the ability of the LLM for generating the educational questions from a given context. (Yang et al., 2023; Tigina et al., 2023; Blanchard et al., 2013; Stab and Gurevych, 2014) are built for automatic grading the student assignments. We summarize the commonly used publicly available datasets and benchmarks for evaluating LLMs on education applications in Table 1 of the Appendix111While there exist various educational applications, including those that assist with confusion helper, material creation, knowledge tracking, and content personalizing, we did not discuss them in this section due to the absence of publicly accessible datasets..
教育应用的数据集和基准测试在范围和目的上差异很大,针对教育过程的不同方面,如学生表现数据(Ray 等,2003 年)、文本和资源数据库(Brooke 等,2015 年)、在线学习数据(Ruipérez-Valiente 等,2022 年)、语言学习数据库(Tiedemann,2020 年)、教育游戏数据(Liu 等,2020 年)、人口统计和社会经济数据(Cooper 等,2020 年)、学习管理系统(LMS)数据(Conijn 等,2016 年)、特殊教育和需求数据(Morningstar 等,2017 年)。特别是解题类数据集(Cobbe 等,2021 年;Hendrycks 等,2020 年;Huang 等,2016 年;Wang 等,2017 年;Zhao 等,2020 年;Amini 等,2019 年;Miao 等,2021 年;Lu 等,2021b 年;Kim 等,2018 年;Lu 等,2021a 年;Chen 等,2023b 年),在教育和自然语言处理领域都是一项普遍的重要任务。特别是,许多数据集(Cobbe 等,2021 年;Hendrycks 等,2020 年;Huang 等,2016 年;Wang 等,2017 年;Zhao 等,2020 年;Amini 等,2019 年;Miao 等,2021 年;Lu 等,2021b 年)是为数学题目解答而构建的,旨在从叙述描述中提供一个抽象表达。 一些数据集还考虑了图像(苗等,2021 年;金等,2018 年;陆等,2021 年 a;肯布哈维等,2016 年)和表格(陆等,2021 年 b)。另一方面,还有一批数据集和基准(金等,2018 年;肯布哈维等,2016 年;陈等,2023 年 b)是为科学教科书问题解答而构建的,这需要对教科书有全面的理解,并提供与问题中关键信息相对应的答案。还有大量为错误更正而构建的数据集和基准。它们用于外语培训(罗特等,2021 年;吴等,2014 年;布莱恩特等,2019 年;曾等,2015 年;赵等,2022 年;徐等,2022 年 b;杜等,2023 年;纳普拉瓦等,2022 年;罗佐夫斯卡娅和罗斯,2019 年;格伦德基维茨和尤恩丘斯-多蒙特,2014 年;戴维森等,2020 年;西沃孔和纳霍尔纳,2021 年;科特等,2020 年)和计算机科学编程语言培训(贾斯特等,2014 年;勒古埃斯等,2015 年;林等,2017 年;图法诺等,2019 年;李等,2022 年;郭等,2024 年)。外语培训数据集和基准包含需要识别和纠正的语法错误和拼写错误。 编程训练数据集和基准测试包括一些代码错误,需要足够的编码理解才能进行正确修正。另一方面,还有一些用于教师辅助任务的数据集和基准测试。Welbl 等人(2017 年)、Lai 等人(2017 年)、Xu 等人(2022a 年)、Chen 等人(2018 年)、Gong 等人(2022 年)、Hadifar 等人(2023 年)、Liang 等人(2018 年)、Bitew 等人(2022 年)构建了旨在评估LLM从给定上下文中生成教育问题的能力的问题构建任务。Yang 等人(2023 年)、Tigina 等人(2023 年)、Blanchard 等人(2013 年)、Stab 和 Gurevych(2014 年)构建了用于自动评分学生作业的基准。我们在附录的表 1 中总结了常用的公开可用的数据集和基准,用于评估LLMs在教育应用中的表现。

4. Risks and Potential Challenges
风险和潜在挑战

This section discusses risks and challenges along with the rise of generative AI and LLMs and summarizes some early proposals for the implementation of guardrails and responsible AI. Given the importance of education as a critically important domain, more caution should be used when implementing the implications of LLMs. A well-established framework on responsible AI (Microsoft, 2024) has outlined six foundational elements: fairness, inclusiveness, reliability & safety, privacy & security, transparency, and accountability. Besides these, for education domain, the concern of overreliance is also a major concern as over-depending on LLMs will harm some key capabilities of students such as critical thinking, academic writing, and even creativity.
本节讨论了生成式 AI 及LLMs崛起所带来的风险和挑战,并总结了一些早期的防护措施和负责任的 AI 实施建议。鉴于教育是一个极其重要的领域,实施LLMs的影响时应更加谨慎。一个由微软在 2024 年提出的成熟的负责任 AI 框架概述了六个基本要素:公平、包容、可靠性与安全、隐私与安全、透明度和问责制。除此之外,对于教育领域,过度依赖也是一个主要问题,因为过分依赖LLMs会损害学生的一些关键能力,如批判性思维、学术写作甚至创造力。

4.1. Fairness and Inclusiveness
公平与包容性

Limited by LLM training data, where representations of specific groups of individuals and social stereotypes might be dominant, bias could develop (Zhuo et al., 2023). Li et al. (Li et al., 2024a) summarized that for the education domain, critical LLM fairness discussions are based on demographic bias and counterfactual concerns. Fenu et al. (Fenu et al., 2022) has introduced some bias LLMs that fail to generate as much useful content for some groups of people not represented in the data. Also of concern is the fact that people in some demographic groups may not have equal access to educational models of comparable quality levels. Weidinger et al. (Weidinger et al., 2021) shows the lack of LLM ability in generating content for groups whose languages are not selected for training. Oketunji et al. (Oketunji et al., 2023) argue that LLMs inherently generate bias, and propose a large language model bias index to quantify and address biases, enhancing the reliability of LLMs. Li et al. (Li and Zhang, 2023) introduced a systematic methodology to evaluate fairness and bias that could be displayed in LLMs, where a number of biased prompts are fed into LLMs and the probabilistic metrics for both individuals and groups that indicate the level of fairness are computed. In the domain of education, reinforced statements like ”You should be unbiased for the sensitive feature (race or gender in experiments)” are helpful in mitigating the biased responses from LLMs. Chhikara et al. (Chhikara et al., 2024) show some gender bias from LLMs and explore possible solutions using few-shot learning as well as retrieval augmented generation. Caliskan et al. (Caliskan and Zhu, [n. d.]) examine social bias among scholars by evaluating LLMs (llama 2, Yi, Tulu, etc.) with various input prompts and argue that fine-tuning is the most effective approach to maintaining fairness. Li et al. (Li et al., 2024e) think that LLMs frequently present dominant viewpoints while ignoring alternative perspectives from minority parties (underrepresented in the training data), resulting in potential biases. They proposed a FAIRTHINKING pipeline to automatically generate roles that enable LLMs to articulate diverse perspectives for fair expressions. Li et al. (Li et al., 2024c) analyzes reasoning bias in decision-making systems of education and health care and devises a guided-debiasing framework that incorporates a prompt selection mechanism.
受限于LLM训练数据,其中特定群体的代表性和社会刻板印象可能占主导地位,偏见可能会形成(Zhuo 等,2023 年)。李等人(Li 等,2024a)总结说,在教育领域,关键的LLM公平讨论基于人口统计偏见和反事实关注。Fenu 等人(Fenu 等,2022 年)引入了一些偏见LLMs,未能为数据中未代表的某些群体生成足够有用的内容。同样令人关注的是,某些人口群体可能无法平等地获得同等质量水平的教育模型。Weidinger 等人(Weidinger 等,2021 年)显示出在为未选择进行训练的语言群体生成内容方面的LLM能力不足。Oketunji 等人(Oketunji 等,2023 年)认为LLMs本质上会产生偏见,并提出了一个大型语言模型偏见指数来量化和解决偏见,增强LLMs的可靠性。李等人。 (李和张,2023 年)提出了一种系统的方法论,用于评估可能在LLMs中显示的公平性和偏见,其中多个有偏见的提示被输入LLMs,并计算表示公平水平的概率指标,适用于个体和群体。在教育领域,像“你应该对敏感特征(实验中的种族或性别)保持无偏见”这样的加强语句有助于减轻来自LLMs的偏见回应。Chhikara 等人(Chhikara 等,2024 年)从LLMs中展示了一些性别偏见,并探索使用少量学习以及检索增强生成的可能解决方案。Caliskan 等人(Caliskan 和 Zhu,[日期待定])通过评估LLMs(llama 2, Yi, Tulu 等)和各种输入提示,研究学者之间的社会偏见,并认为微调是维持公平性的最有效方法。李等人(李等,2024e 年)认为LLMs经常呈现主导观点,同时忽略来自少数派(在训练数据中代表性不足)的替代观点,从而导致潜在偏见。 他们提出了一个 FAIRTHINKING 流程,用于自动生成能够使LLMs表达多元视角的公平表达的角色。李等人(李等,2024c)分析了教育和医疗决策系统中的推理偏见,并设计了一个包含提示选择机制的引导去偏框架。

4.2. Reliability and Safety
可靠性和安全性

LLMs have encountered reliability issues, including hallucinations, the production of toxic output, and inconsistencies in responses. These challenges are particularly significant in the educational sector. Hallucinations, where LLMs generate fictitious content, are a critical concern highlighted by Ji et al. (Ji et al., 2023). Zhuo et al. (Zhuo et al., 2023) have outlined ethical considerations regarding the potential of LLMs to create content containing offensive language and explicit material. Cheng et al. (Cheng et al., 2024) have discussed the issue of temporal misalignment in LLM data versions, introducing a novel tracer to track knowledge cutoffs. Shoaib et al. (Shoaib et al., 2023) underscore the risks of misinformation and disinformation through seemingly authentic content, suggesting the adoption of cyber-wellness education to boost public awareness and resilience. Liu et al. (Liu et al., 2024) explore the application of text-to-video models, like Sora, as tools to simulate real-world scenarios. They caution, however, that these models, despite their advanced capabilities, can sometimes lead to confusion or mislead students due to their limitations in accurately representing physical realism and complex spatial-temporal contexts. To improve the reliability of LLMs, Tan et al. (Tan et al., 2024) have developed a metacognitive strategy that enables LLMs to identify and correct errors autonomously. This method aims to detect inaccuracies with minimal human intervention and signals when adjustments are necessary. Additionally, the use of retrieval augmented generation (RAG) has been identified by Gao et al. and Zhao et al. as an effective way to address the issues of hallucinations and response inconsistencies, improving the reliability and accuracy of LLMs in content generation (Gao et al., 2024b; Zhao et al., 2024).
LLMs遇到了可靠性问题,包括幻觉、产生有毒输出和响应不一致。这些挑战在教育领域尤为重要。幻觉,即LLMs生成虚构内容,是纪等人(Ji et al, 2023)强调的一个关键问题。卓等人(Zhuo et al, 2023)概述了关于LLMs创造包含攻击性语言和明确材料的内容的伦理考虑。程等人(Cheng et al, 2024)讨论了LLM数据版本中时间错位的问题,引入了一种新的追踪器来跟踪知识截止点。肖等人(Shoaib et al, 2023)强调通过看似真实的内容传播错误信息和虚假信息的风险,建议采用网络健康教育以提高公众意识和韧性。刘等人(Liu et al, 2024)探索了文本到视频模型(如 Sora)作为模拟现实场景的工具的应用。 然而,他们警告说,尽管这些模型具有先进的能力,但有时由于无法准确呈现物理现实和复杂的时空背景,可能会导致混淆或误导学生。为了提高LLMs的可靠性,谭等人(谭等,2024 年)开发了一种元认知策略,使LLMs能够自主识别和纠正错误。这种方法旨在以最少的人为干预检测不准确之处,并在需要调整时发出信号。此外,高等人和赵等人已经确认,使用检索增强生成(RAG)是解决幻觉和响应不一致问题的有效方法,提高了LLMs在内容生成中的可靠性和准确性(高等,2024b 年;赵等,2024 年)。

4.3. Transparency and Accountability
透明度和问责制

LLM, by design, runs as a black-box mechanism, so it comes with transparency and accountability concerns. Milano et al. (Milano et al., 2023a) and BaHammam et al. (BaHammam et al., 2023) raise several challenges about LLMs to higher education, including plagiarism, inaccurate reporting, cheating in exams, as well as several other operational, financial, pedagogical issues. As a further thought of students using generative AI in homework or exams, Macneil et al. (MacNeil et al., 2024) discuss the respective impact to the traditional assessment methods, and think we educators should come up with new assessment framework to take into account the usage of Chat-GPT-like tools. Zhou et al. (Zhou et al., 2024a) brought up academic integrity ethical concerns that specifically confused teachers and students and calls for rethinking of policy-making. As specific measures, Gao et al. (Gao et al., 2024a) introduce a novel concept called mixcase, representing a hybrid text form involving both machine-generated and human-generated and developed detectors that can distinguish human and machine texts. To tackle LLM ethical concerns in intellectual property violation, Huang et al. (Huang and Chang, 2023) propose incorporating citation while training LLMs, which could help enhance content transparency and verifiability. Finlayson, et al. (Finlayson et al., 2024) developed a systematic framework to efficiently discover the LLM’s hidden size, obtaining full-vocabulary outputs, detecting and disambiguating different model updates, which could help users hold providers accountable by tracking model changes, thus enhancing the accountability.
LLM的设计本质上是一个黑盒机制,因此带来了透明度和问责性的问题。Milano 等人(Milano 等,2023a)和 BaHammam 等人(BaHammam 等,2023)提出了LLMs对高等教育的几个挑战,包括剽窃、不准确的报告、考试作弊,以及其他一些运营、财务、教学问题。进一步思考学生在作业或考试中使用生成式 AI 的情况,Macneil 等人(MacNeil 等,2024)讨论了这对传统评估方法的相应影响,并认为我们教育工作者应该提出新的评估框架,以适应 Chat-GPT 类工具的使用。周等人(Zhou 等,2024a)提出了特别让教师和学生困惑的学术诚信伦理问题,并呼吁重新思考政策制定。作为具体措施,高等人(Gao 等,2024a)引入了一种名为 mixcase 的新概念,代表了机器生成和人类生成的混合文本形式,并开发了可以区分人类和机器文本的检测器。为了解决LLM在知识产权侵犯方面的伦理问题,黄等人。 (黄和张,2023 年)提议在训练LLMs时加入引用,这有助于提高内容的透明度和可验证性。芬莱森等人(芬莱森等,2024 年)开发了一个系统框架,有效地发现LLM的隐藏大小,获得全词汇输出,检测并消除不同模型更新的歧义,这有助于用户通过跟踪模型变化来追究提供者的责任,从而增强问责性。

4.4. Privacy and Security 隐私与安全

Privacy and security protection have become increasingly important topics with the rise of LLMs, especially in the education sector where they deserve heightened scrutiny. Latham et al. (Latham and Goltz, 2019) conducted a case study to explore the general public’s perceptions of AI in education, revealing that while research has largely focused on the effectiveness of AI, critical areas such as learner awareness and acceptance of tracking and profiling algorithms remain underexplored. This underscores the need for more research into the ethical and legal aspects of AI in education. Das et al. (Das et al., 2024) conducted an extensive review on the challenges of protecting personally identifiable information in the context of LLM use, highlighting widespread security and privacy concerns. Shoaib et al. (Shoaib et al., 2023) addressed the threats to personal privacy posed by deepfake content, proposing solutions like the use of detection algorithms and the implementation of standard protocols to bolster protection. Ke et al. (Ke et al., 2024)raised concerns about data privacy and the ethical implications of employing LLMs in psychological research, emphasizing the importance of safeguarding participant privacy in research projects. This highlights the necessity for researchers to understand the limitations of LLMs, adhere to ethical standards, and consider the potential consequences of their use. Suraworachet et al. (Suraworachet et al., 2024) provided a comparative analysis of student information disclosure using LLMs versus traditional methods. Their findings point to challenges in valid evaluation, respecting privacy, and the absence of meaningful interactions in using LLMs to assess student performance. In terms of mitigation strategies, Hicke et al. (Hicke et al., 2023) suggested frameworks that combine Retrieval-Augmented Generation (RAG) and fine-tuning techniques for enhanced privacy protection. Meanwhile, Masikisiki et al. (Masikisiki et al., 2023) highlighted the significance of offering users the option to delete their interactions, emphasizing the importance of user control over personal data.
随着LLMs的兴起,隐私和安全保护已成为越来越重要的话题,特别是在教育领域,这些问题应受到更多关注。Latham 等人(Latham 和 Goltz,2019 年)进行了一项案例研究,探讨了公众对教育中 AI 的看法,结果显示,尽管研究主要集中在 AI 的有效性上,但诸如学习者对跟踪和分析算法的意识和接受度等关键领域仍然鲜有涉及。这突显了需要对教育中 AI 的伦理和法律方面进行更多研究的必要性。Das 等人(Das 等,2024 年)对在LLM使用背景下保护个人身份信息的挑战进行了广泛的回顾,强调了广泛的安全和隐私问题。Shoaib 等人(Shoaib 等,2023 年)解决了深度伪造内容对个人隐私构成的威胁,提出了使用检测算法和实施标准协议以增强保护的解决方案。柯等人。 (柯等人,2024 年)对在心理研究中使用LLMs的数据隐私和伦理影响表示担忧,强调了保护研究参与者隐私的重要性。这突显了研究人员需要了解LLMs的局限性,遵守伦理标准,并考虑其使用可能带来的后果。Suraworachet 等人(Suraworachet 等人,2024 年)提供了使用LLMs与传统方法进行学生信息披露的比较分析。他们的发现指出,在使用LLMs评估学生表现时存在有效评估、尊重隐私和缺乏有意义互动的挑战。在缓解策略方面,希克等人(希克等人,2023 年)建议结合检索增强生成(RAG)和微调技术的框架,以增强隐私保护。同时,马西基西基等人(马西基西基等人,2023 年)强调了提供用户删除其互动选项的重要性,强调了用户对个人数据控制的重要性。

4.5. Overly Dependency on LLMs
4.5.过度依赖LLMs

Given the impressive performance of LLM generative ability, there is great concern that students blindly rely on LLMs for most of their work, leaving their ability to think independently disappearing. Milano et al. (Milano et al., 2023a) discuss the problem of overreliance caused by Chat-GPT-like applications that students may use to compose essays and academic publications without improving their writing skills, which is essential to cultivate critical thinking. The concern might have even more impact on foreign-language students or students who are educationally disadvantaged while placing less emphasis on learning how to craft well-written texts. Krupp et al. (Krupp et al., 2023) discuss the challenges of overreliance on the implications of LLMs in education and propose some moderated approaches to mitigate such effects. Similarly, Zuber et al. (Zuber and Gogoll, 2023) discuss the risk that overreliance can bring to democracy and suggest cultivating thinking skills in children, fostering coherent thought formulation, and distinguishing between machine-generated output and genuine. They argue that LLMs should be used to augment but not substitute human thinking capacities. Adewumi et al. (Adewumi et al., 2023) also present scenarios that students tend to rely on LLMs for essay writing rather than writing their own, and demonstrates that using a probing-chain-of-thought tool could substantially stimulate critical thinking in the accompany with LLMs.
鉴于LLM生成能力的卓越表现,人们非常担心学生会盲目依赖LLMs来完成大部分工作,导致他们的独立思考能力消失。米兰诺等人(Milano et al, 2023a)讨论了学生可能用于撰写论文和学术出版物的类 Chat-GPT 应用程序造成的过度依赖问题,这种依赖并没有提升他们的写作技能,而写作技能对于培养批判性思维至关重要。这种担忧对于外语学生或教育上处于劣势的学生可能影响更大,同时这也减少了学习如何编写精良文本的重视。克鲁普等人(Krupp et al, 2023)讨论了在教育中过度依赖LLMs的挑战,并提出了一些缓解这种影响的适度方法。同样,祖伯等人(Zuber and Gogoll, 2023)讨论了过度依赖可能对民主带来的风险,并建议在儿童中培养思维技能,促进连贯的思想表达,并区分机器生成的输出与真实的输出。他们认为LLMs应该用来增强而非替代人类的思考能力。 Adewumi 等人(Adewumi 等,2023)还展示了学生在写作论文时倾向于依赖LLMs而不是自己写作的情况,并证明使用一种探究式思维链工具可以在LLMs的陪伴下显著激发批判性思维。

5. Future Directions 未来方向

Here, we discuss the future opportunities for LLMs in education and summarize the promising directions in Figure 3. For each direction, we discuss the potential application of advanced LLM-based techniques and conclude their impacts on the future of education.
在这里,我们讨论LLMs在教育领域的未来机会,并在图 3 中总结了前景看好的方向。对于每个方向,我们讨论了先进的LLM技术的潜在应用,并总结了它们对教育未来的影响。

Refer to caption
Figure 3. Future directions for LLMs in education.
图 3。教育中LLMs的未来发展方向。

5.1. Pedagogical Interest Aligned LLMs
教学兴趣一致LLMs

Although advanced LLMs like GPT-4 have demonstrated promising performance while being applied to experimental applications in education, applying LLMs directly in real-world instruction is still challenging, as delivering high-quality education is a complicated task that involves multi-discipline knowledge and administration constraints (Milano et al., 2023b). To solve these issues, future researchers in this direction could leverage advanced generation techniques like Retrieval-Augmented Generation (RAG) (Gao et al., 2023b) to inform LLMs with the necessary prior information and guide LLMs to generate the pedagogical interest-aligned results. Apart from that, collecting large-size pedagogical instruction datasets from real-world instruction scenarios and fine-tuning existing LLMs to align with human instructors’ behavior will also be an interesting direction to explore for future research. With learning from the preference of human instructors, LLMs could encoder the pedagogical restriction and knowledge patterns within LLMs’ parameter space, and generate pedagogical interest-aligned results without too much intervention from the external information.
尽管像 GPT-4 这样的先进技术在教育实验应用中展现出了有希望的性能,但将这些技术直接应用于现实世界的教学仍然充满挑战,因为提供高质量教育是一项涉及多学科知识和管理限制的复杂任务(米兰诺等人,2023b)。为了解决这些问题,未来的研究者可以利用像检索增强生成(RAG)(高等,2023b)这样的先进生成技术,以提供必要的先验信息,并指导生成与教学兴趣一致的结果。除此之外,从现实世界的教学场景中收集大规模的教学指令数据集,并对现有技术进行微调,以使其与人类教师的行为一致,也将是未来研究探索的一个有趣方向。通过学习人类教师的偏好,可以在参数空间中编码教学限制和知识模式,并在不需要太多外部信息干预的情况下生成与教学兴趣一致的结果。

5.2. LLM-Multi-Agent Education Systems
5.2.LLM-多代理教育系统

The generous usage of LLMs in language comprehensive, reasoning, planning, and programming inspires works like AutoGen (Wu et al., 2023a) in developing a collaboration framework that involves multiple LLMs to solve complicated tasks through conversation-format procedures. Similarly, problems in education commonly involve multi-step processing logic, which will be a good fit for using the multi-agent-based LLMs system. The recent work by Yang et al. (2024) has demonstrated the great potential of the multi-agent framework for grading tasks. In this work, a human-like grading procedure is achieved by leveraging multiple LLM-based grader agents and critic agents, and the discrepancy errors made by individual judges are corrected through the group discussion procedure. For future research in this direction, more types of LLM-based agents can be included and their functions may spread from specific command executors to high-level plan makers. More importantly, human instructors, who are viewed as the special agents in the system, can also get involved with LLMs’ interaction directly and provide any necessary interventions to the system flexibly.
在语言理解、推理、规划和编程中大量使用LLMs,激发了像 AutoGen(吴等,2023a)这样的作品开发出一个涉及多个LLMs的协作框架,通过对话格式的程序来解决复杂任务。同样地,教育中的问题通常涉及多步骤处理逻辑,这将非常适合使用基于多代理的LLMs系统。杨等人(2024 年)的最近研究展示了多代理框架在评分任务中的巨大潜力。在这项工作中,通过利用多个基于LLM的评分代理和批评代理,实现了类似人的评分程序,并通过小组讨论程序纠正了个别评判的差错。对于未来这一方向的研究,可以包括更多类型的基于LLM的代理,并且它们的功能可能从特定命令执行者扩展到高级计划制定者。更重要的是,被视为系统中特殊代理的人类教师,也可以直接参与LLMs的互动,并灵活地为系统提供必要的干预。

5.3. Multimodal and Multilingual Supports
多模态和多语言支持

The high similarity between different human languages naturally enables LLMs to effectively support multilingual tasks. In addition to that, the recent findings of the alignment between multimodal and language tokens (Wu et al., 2023b) further extend LLMs beyond textual analysis, venturing into multimodal learning analytics. By accepting diverse inputs, LLMs can take advantage of the mutual information between different data resources and provide advanced support to challenging tasks in education. For future works in multimodal direction, more attention could be placed on developing LLMs capable of interpreting and integrating these varied data sources, offering more nuanced insights into student engagement, comprehension, and learning styles. Such advances could pave the way for highly personalized and adaptive learning experiences that are tailored to the unique needs of each student. On the other hand, multilingual LLMs provide convenient access to quality global education resources for every individual with his/her proficient language. With developing robust models that not only translate, but also understand cultural nuances, colloquial expressions, and regional educational standards, researches in this direction would help learners around the world to benefit from LLMs in their native languages and significantly improve equity and inclusion in global education.
不同人类语言之间的高度相似性自然使LLMs能够有效支持多语言任务。除此之外,最近关于多模态与语言标记对齐的发现(吴等人,2023b)进一步扩展了LLMs的应用范围,超越了文本分析,进入多模态学习分析。通过接受多样化的输入,LLMs可以利用不同数据资源之间的互信息,为教育中的挑战性任务提供高级支持。在未来的多模态方向工作中,可以更多地关注开发能够解释和整合这些不同数据源的LLMs,提供更细致的洞察,以了解学生的参与度、理解力和学习风格。这样的进步可以为高度个性化和适应性学习体验铺平道路,这些学习体验将根据每个学生的独特需求量身定制。另一方面,多语言LLMs为每个人提供方便的途径,使用其熟练的语言访问高质量的全球教育资源。 随着开发出不仅能进行翻译,还能理解文化细微差别、口语表达和地区教育标准的强大模型,这方面的研究将帮助全世界的学习者用他们的母语从LLMs中受益,并在全球教育中显著提高公平性和包容性。

5.4. Edge Computing and Efficiency
5.4.边缘计算与效率

Integrating LLMs with edge computing presents a promising avenue to enhance the efficiency and accessibility of educational technologies. By processing data closer to the end user, edge computing can reduce latency, increase content delivery speed, and enable offline access to educational resources. Future efforts could explore optimizing LLMs for edge deployment, focusing on lightweight models that maintain high performance while minimizing computational resources, which would be particularly beneficial in areas with limited internet connectivity, ensuring equitable access to educational tools. Additionally, processing data locally reduces the need to transmit sensitive information over the Internet, enhancing privacy and security. Edge computing could be a potential framework to utilize LLMs while adhering to stringent data protection standards.
将LLMs与边缘计算结合,为提高教育技术的效率和可访问性提供了一个有前景的途径。通过在靠近最终用户的地方处理数据,边缘计算可以减少延迟,提高内容传输速度,并实现教育资源的离线访问。未来的努力可以探索优化LLMs以适应边缘部署,专注于在最小化计算资源的同时保持高性能的轻量级模型,这在互联网连接有限的地区尤其有益,确保教育工具的公平访问。此外,本地处理数据减少了通过互联网传输敏感信息的需要,增强了隐私和安全性。边缘计算可能是一个利用LLMs的潜在框架,同时遵守严格的数据保护标准。

5.5. Efficient Training of Specialized Models
高效训练专业模型

The development of specialized LLMs tailored to specific educational domains or subjects represents a significant opportunity for future research. This direction involves creating models that not only grasp general language understanding, but also possess deep knowledge in fields such as mathematics, science, or literature. The point is that specialized LLMs could achieve a deep understanding of specific subjects, offering insights and support that are highly relevant and accurate but also more cost-effective. The challenge lies in the efficient training of these models, which requires innovations in data collection, model architecture, and training methodologies. Specialized models could offer more accurate and contextually relevant help, improving the educational experience for both students and educators.
专门针对特定教育领域或学科的LLMs的发展代表了未来研究的重大机遇。这一方向涉及创建不仅能够掌握一般语言理解,而且在数学、科学或文学等领域拥有深入知识的模型。关键在于,专门的LLMs能够深入理解特定学科,提供高度相关和准确的见解和支持,而且更具成本效益。挑战在于这些模型的高效训练,这需要在数据收集、模型架构和训练方法上的创新。专门模型能够提供更准确和更具上下文相关性的帮助,改善学生和教育工作者的教育体验。

5.6. Ethical and Privacy Considerations
5.6.伦理和隐私考虑

Ethical and privacy considerations take center stage as LLMs become increasingly integrated into educational settings. Future research must address the responsible use of LLMs, including issues related to data security, student privacy, and bias mitigation. The development of frameworks and guidelines for ethical LLM deployment in education is crucial. This includes ensuring transparency in model training processes, safeguarding sensitive information, and creating inclusive models that reflect the diversity of the student population. Addressing these considerations is essential to build trust and ensure the responsible use of LLMs in education.
随着LLMs越来越多地融入教育环境,伦理和隐私问题成为关注的中心。未来的研究必须解决LLMs的负责任使用,包括与数据安全、学生隐私和偏见缓解相关的问题。制定教育中使用LLM的伦理框架和指南至关重要。这包括确保模型训练过程的透明度,保护敏感信息,以及创建反映学生群体多样性的包容性模型。解决这些问题对于建立信任并确保在教育中负责任地使用LLMs至关重要。

6. Conclusion 结论

The rapid development of LLMs has revolutionized education. In this survey, we provide a comprehensive review of LLMs specifically applied for various educational scenarios from a multifaceted taxonomy, including student and teacher assistance, adaptive learning, and miscellaneous tools. In addition, we also summarize the related datasets and benchmarks, as well as current challenges and future directions. We hope that our survey can facilitate and inspire more innovative work within LLMs for education.
LLMs的快速发展彻底改变了教育行业。在这项调查中,我们提供了针对各种教育场景的LLMs的全面审查,包括学生和教师辅助、自适应学习和各种工具的多方面分类。此外,我们还总结了相关的数据集和基准,以及当前的挑战和未来的方向。我们希望我们的调查能够促进并激发LLMs在教育领域内更多的创新工作。

References 参考文献

  • (1)
  • Que (2024) 《Que》(2024) 2024. QuestionWell. https://www.questionwell.org/
    2024 年。问界。https://www.questionwell.org/
  • Abdelrahman et al. (2023) 阿卜杜勒拉赫曼等(2023 年) Ghodai Abdelrahman, Qing Wang, and Bernardo Nunes. 2023. Knowledge tracing: A survey. Comput. Surveys 55, 11 (2023), 1–37.
    戈代·阿卜杜勒拉赫曼、王青和贝尔纳多·努内斯。2023 年。知识追踪:一项调查。计算机调查 55, 11 (2023), 1-37。
  • Abu-Rasheed et al. (2024) 阿布-拉希德等(2024 年) Hasan Abu-Rasheed, Christian Weber, and Madjid Fathi. 2024. Knowledge Graphs as Context Sources for LLM-Based Explanations of Learning Recommendations. arXiv preprint arXiv:2403.03008 (2024).
    哈桑·阿布-拉希德、克里斯蒂安·韦伯和马吉德·法蒂。2024。知识图谱作为基于LLM的学习推荐解释的上下文来源。arXiv 预印本 arXiv:2403.03008 (2024)。
  • Achiam et al. (2023) 阿奇亚姆等人 (2023) Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023).
    乔希·阿奇亚姆、史蒂文·阿德勒、桑迪尼·阿加瓦尔、拉玛·艾哈迈德、伊尔格·阿卡亚、弗洛伦西亚·莱奥尼·阿莱曼、迪奥戈·阿尔梅达、扬科·阿尔滕施密特、萨姆·奥特曼、夏亚马尔·阿纳德卡特等,2023 年。Gpt-4 技术报告。arXiv 预印本 arXiv:2303.08774(2023)。
  • Adewumi et al. (2023) 阿德沃米等(2023) Tosin Adewumi, Lama Alkhaled, Claudia Buck, Sergio Hernandez, Saga Brilioth, Mkpe Kekung, Yelvin Ragimov, and Elisa Barney. 2023. ProCoT: Stimulating Critical Thinking and Writing of Students through Engagement with Large Language Models (LLMs). arXiv:2312.09801 [cs.CL]
    托辛·阿德沃米、拉玛·阿尔哈利德、克劳迪娅·巴克、塞尔吉奥·埃尔南德斯、萨加·布里略特、姆克佩·科库、耶尔文·拉吉莫夫和埃丽莎·巴尼。2023 年。ProCoT:通过与大型语言模型的互动激发学生的批判性思维和写作(LLMs)。arXiv:2312.09801 [cs.CL]
  • AI (2024a) 人工智能 (2024a) Formative AI. 2024a. Formative AI. https://www.formative.com/
    生成式人工智能。2024a。生成式人工智能。https://www.formative.com/
  • AI (2024b) 人工智能 (2024b) Quizizz AI. 2024b. Quizizz AI. ttps://quizizz.com/home/quizizz-ai
  • Amini et al. (2019) 阿米尼等(2019) Aida Amini, Saadia Gabriel, Peter Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. 2019. Mathqa: Towards interpretable math word problem solving with operation-based formalisms. arXiv preprint arXiv:1905.13319 (2019).
    艾达·阿米尼、萨迪亚·加布里埃尔、彼得·林、里克·康塞尔-科兹约尔斯基、叶金·崔和汉娜内·哈吉希尔齐。2019 年。Mathqa:基于操作的形式主义解决可解释的数学文字问题。arXiv 预印本 arXiv:1905.13319(2019)。
  • BaHammam et al. (2023) 巴哈马姆等人 (2023) Ahmed S. BaHammam, Khaled Trabelsi, Seithikurippu R. Pandi-Perumal, and Hiatham Jahrami. 2023. Adapting to the Impact of AI in Scientific Writing: Balancing Benefits and Drawbacks while Developing Policies and Regulations. arXiv:2306.06699 [q-bio.OT]
    艾哈迈德·S·巴哈曼,哈立德·特拉贝尔西,塞提库里普·R·潘迪-珀鲁马尔,以及海萨姆·贾赫拉米。2023 年。适应人工智能对科学写作的影响:在制定政策和规定时平衡利弊。arXiv:2306.06699 [q-bio.OT]
  • Baidoo-Anu and Owusu Ansah (2023)
    贝杜-阿努和奥乌苏-安萨 (2023)
    David Baidoo-Anu and Leticia Owusu Ansah. 2023. Education in the era of generative artificial intelligence (AI): Understanding the potential benefits of ChatGPT in promoting teaching and learning. Available at SSRN 4337484 (2023).
    大卫·贝杜-阿努和莱蒂西亚·奥瑟·安萨。2023 年。生成式人工智能(AI)时代的教育:理解 ChatGPT 在促进教学中的潜在好处。可在 SSRN 4337484(2023 年)获取。
  • Balse et al. (2023) 巴尔斯等(2023) Rishabh Balse, Viraj Kumar, Prajish Prasad, and Jayakrishnan Madathil Warriem. 2023. Evaluating the Quality of LLM-Generated Explanations for Logical Errors in CS1 Student Programs. In Proceedings of the 16th Annual ACM India Compute Conference. 49–54.
    Rishabh Balse、Viraj Kumar、Prajish Prasad 和 Jayakrishnan Madathil Warriem. 2023. 评估LLM生成的解释在 CS1 学生程序中逻辑错误的质量。在第 16 届年度 ACM 印度计算会议论文集中。49-54。
  • Batool et al. (2023) 巴图尔等(2023) Saba Batool, Junaid Rashid, Muhammad Wasif Nisar, Jungeun Kim, Hyuk-Yoon Kwon, and Amir Hussain. 2023. Educational data mining to predict students’ academic performance: A survey study. Education and Information Technologies 28, 1 (2023), 905–971.
    萨巴·巴图尔、朱奈德·拉希德、穆罕默德·瓦西夫·尼萨尔、金俊恩、权赫允和阿米尔·侯赛因。2023 年。教育数据挖掘预测学生学业表现:一项调查研究。教育与信息技术杂志,第 28 卷,第 1 期(2023 年),905-971 页。
  • Bitew et al. (2022) 比特等(2022 年) Semere Kiros Bitew, Amir Hadifar, Lucas Sterckx, Johannes Deleu, Chris Develder, and Thomas Demeester. 2022. Learning to reuse distractors to support multiple choice question generation in education. IEEE Transactions on Learning Technologies (2022).
    Semere Kiros Bitew、Amir Hadifar、Lucas Sterckx、Johannes Deleu、Chris Develder 和 Thomas Demeester。2022 年。在教育中学习重用干扰项以支持多项选择题的生成。IEEE 学习技术交易(2022)。
  • Blanchard et al. (2013) Blanchard 等(2013) Daniel Blanchard, Joel Tetreault, Derrick Higgins, Aoife Cahill, and Martin Chodorow. 2013. TOEFL11: A corpus of non-native English. ETS Research Report Series 2013, 2 (2013), i–15.
    丹尼尔·布兰查德、乔尔·特特罗、德里克·希金斯、艾菲·卡希尔和马丁·乔多洛。2013 年。TOEFL11:非母语英语语料库。ETS 研究报告系列 2013 年,第 2 期(2013 年),i–15。
  • Bommarito II and Katz (2022)
    Bommarito II 和 Katz (2022)
    Michael Bommarito II and Daniel Martin Katz. 2022. GPT takes the bar exam. arXiv preprint arXiv:2212.14402 (2022).
    迈克尔·博马里托二世和丹尼尔·马丁·卡茨。2022 年。GPT 参加律师资格考试。arXiv 预印本 arXiv:2212.14402(2022 年)。
  • Brooke et al. (2015) 布鲁克等人 (2015) Julian Brooke, Adam Hammond, and Graeme Hirst. 2015. GutenTag: an NLP-driven tool for digital humanities research in the Project Gutenberg corpus. In Proceedings of the Fourth Workshop on Computational Linguistics for Literature. 42–47.
    朱利安·布鲁克、亚当·哈蒙德和格雷姆·赫斯特。2015 年。GutenTag:一个用于古腾堡项目语料库数字人文研究的 NLP 驱动工具。在第四届文学计算语言学研讨会论文集中。42-47 页。
  • Bryant et al. (2019) Bryant 等 (2019) Christopher Bryant, Mariano Felice, Øistein E Andersen, and Ted Briscoe. 2019. The BEA-2019 shared task on grammatical error correction. In Proceedings of the fourteenth workshop on innovative use of NLP for building educational applications. 52–75.
    克里斯托弗·布莱恩特、马里亚诺·费利斯、奥斯坦·E·安德森和泰德·布里斯科。2019 年。BEA-2019 关于语法错误更正的共享任务。在第十四届 NLP 在教育应用创新使用研讨会的论文集中。52-75 页。
  • Caliskan and Zhu ([n. d.])
    Caliskan 和 Zhu([无日期])
    Chahat Raj1 Anjishnu Mukherjee1 Aylin Caliskan and Antonios Anastasopoulos1 Ziwei Zhu. [n. d.]. A Psychological View to Social Bias in LLMs: Evaluation and Mitigation. ([n. d.]).
    查哈特·拉杰 1 安吉什努·穆克吉 1 艾琳·卡利斯坎和安东尼奥斯·阿纳斯塔索普洛斯 1 子威·朱。[日期待定]。从心理学视角看LLMs中的社会偏见:评估与缓解。([日期待定])。
  • Chen et al. (2018) 陈等(2018) Guanliang Chen, Jie Yang, Claudia Hauff, and Geert-Jan Houben. 2018. LearningQ: a large-scale dataset for educational question generation. In Proceedings of the international AAAI conference on web and social media, Vol. 12.
    陈冠良、杨洁、Claudia Hauff 和 Geert-Jan Houben。2018。LearningQ:一个大规模的教育问题生成数据集。在国际 AAAI 关于网络与社交媒体会议论文集,第 12 卷。
  • Chen et al. (2020) 陈等(2020) Lijia Chen, Pingping Chen, and Zhijian Lin. 2020. Artificial intelligence in education: A review. IEEE Access 8 (2020), 75264–75278.
    陈丽佳、陈萍萍、林志坚。2020。教育中的人工智能:综述。IEEE Access 8 (2020),75264-75278。
  • Chen et al. (2022) 陈等(2022 年) Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W Cohen. 2022. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588 (2022).
    陈文虎、马学光、王心怡、威廉·W·科恩。2022。思维提示程序:为数值推理任务解构计算与推理。arXiv 预印本 arXiv:2211.12588 (2022)。
  • Chen et al. (2023b) 陈等(2023b) Wenhu Chen, Ming Yin, Max Ku, Pan Lu, Yixin Wan, Xueguang Ma, Jianyu Xu, Xinyi Wang, and Tony Xia. 2023b. Theoremqa: A theorem-driven question answering dataset. In The 2023 Conference on Empirical Methods in Natural Language Processing.
    陈文虎、尹明、顾麦克斯、卢攀、万一昕、马学光、徐建宇、王欣怡、夏通。2023b。Theoremqa:一个定理驱动的问答数据集。在 2023 年实证自然语言处理方法会议上。
  • Chen et al. (2023a) 陈等(2023a) Zhikai Chen, Haitao Mao, Hang Li, Wei Jin, Hongzhi Wen, Xiaochi Wei, Shuaiqiang Wang, Dawei Yin, Wenqi Fan, Hui Liu, et al. 2023a. Exploring the potential of large language models (llms) in learning on graphs. arXiv preprint arXiv:2307.03393 (2023).
    陈志凯、毛海涛、李航、金伟、文宏志、魏晓驰、王帅强、尹大伟、范文琦、刘辉等,2023a。探索大型语言模型(llms)在图学习中的潜力。arXiv 预印本 arXiv:2307.03393(2023)。
  • Cheng et al. (2024) 程等(2024 年) Jeffrey Cheng, Marc Marone, Orion Weller, Dawn Lawrie, Daniel Khashabi, and Benjamin Van Durme. 2024. Dated Data: Tracing Knowledge Cutoffs in Large Language Models. arXiv:2403.12958 [cs.CL]
    杰弗里·程,马克·马罗内,奥里昂·韦勒,道恩·劳里,丹尼尔·卡沙比,本杰明·范·德尔姆。2024。过时数据:追踪大型语言模型中的知识截止点。arXiv:2403.12958 [cs.CL]
  • Chhikara et al. (2024) 奇卡拉等人 (2024) Garima Chhikara, Anurag Sharma, Kripabandhu Ghosh, and Abhijnan Chakraborty. 2024. Few-Shot Fairness: Unveiling LLM’s Potential for Fairness-Aware Classification. arXiv:2402.18502 [cs.CL]
    加里玛·奇卡拉、阿努拉格·夏尔马、克里帕班杜·戈什和阿比杰南·查克拉巴蒂。2024 年。少镜公平性:揭示LLM在公平意识分类中的潜力。arXiv:2402.18502 [cs.CL]
  • Chiu et al. (2023) 邱等(2023) Thomas KF Chiu, Qi Xia, Xinyan Zhou, Ching Sing Chai, and Miaoting Cheng. 2023. Systematic literature review on opportunities, challenges, and future research recommendations of artificial intelligence in education. Computers and Education: Artificial Intelligence 4 (2023), 100118.
    托马斯·KF·邱、齐霞、周欣彦、柴清松和程苗婷。2023 年。人工智能在教育中的机会、挑战和未来研究建议的系统性文献综述。《计算机与教育:人工智能》2023 年第 4 期,100118。
  • Clark et al. (2018) 克拉克等人(2018) Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457 (2018).
    彼得·克拉克、艾萨克·考伊、奥伦·埃齐奥尼、图沙尔·霍特、阿希什·萨巴瓦尔、卡丽莎·舍尼克和奥文德·塔福德。2018 年。认为你已经解决了问答问题?尝试 ARC,AI2 推理挑战。arXiv 预印本 arXiv:1803.05457(2018)。
  • Cobbe et al. (2021) 科布等人(2021) Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. 2021. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 (2021).
    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano 等,2021 年。训练验证器解决数学文字问题。arXiv 预印本 arXiv:2110.14168(2021)。
  • Conijn et al. (2016) Conijn 等(2016) Rianne Conijn, Chris Snijders, Ad Kleingeld, and Uwe Matzat. 2016. Predicting student performance from LMS data: A comparison of 17 blended courses using Moodle LMS. IEEE Transactions on Learning Technologies 10, 1 (2016), 17–29.
    Rianne Conijn、Chris Snijders、Ad Kleingeld 和 Uwe Matzat。2016。从 LMS 数据预测学生表现:使用 Moodle LMS 比较 17 个混合课程。IEEE 学习技术交易 10, 1 (2016), 17-29。
  • Conker (2024) 《康克》(2024) Conker. 2024. Conker. https://www.conker.ai/
    贵栗子。2024。贵栗子。https://www.conker.ai/
  • Cooper et al. (2020) Cooper 等(2020) Grant Cooper, Amanda Berry, and James Baglin. 2020. Demographic predictors of students’ science participation over the age of 16: An Australian case study. Research in Science Education 50, 1 (2020), 361–373.
    格兰特·库珀、阿曼达·贝瑞和詹姆斯·巴格林。2020 年。16 岁以上学生参与科学活动的人口学预测因素:一个澳大利亚案例研究。科学教育研究,第 50 卷,第 1 期(2020 年),361-373 页。
  • Copilot (2024) 副驾驶 (2024) Education Copilot. 2024. Education Copilot. https://educationcopilot.com/
    教育副驾驶。2024。教育副驾驶。https://educationcopilot.com/
  • Copy.ai (2024) Copy.ai(2024 年) Copy.ai. 2024. Copy.ai. https://www.copy.ai/
    复制.ai. 2024. 复制.ai. https://www.copy.ai/
  • Cotet et al. (2020) Cotet 等 (2020) Teodor-Mihai Cotet, Stefan Ruseti, and Mihai Dascalu. 2020. Neural grammatical error correction for romanian. In 2020 IEEE 32nd International Conference on Tools with Artificial Intelligence (ICTAI). IEEE, 625–631.
    Teodor-Mihai Cotet、Stefan Ruseti 和 Mihai Dascalu。2020。用于罗马尼亚语的神经语法错误校正。在 2020 IEEE 第 32 届国际人工智能工具会议(ICTAI)上。IEEE,625-631。
  • Cui et al. (2023) 崔等(2023) Jiaxi Cui, Zongjian Li, Yang Yan, Bohua Chen, and Li Yuan. 2023. Chatlaw: Open-source legal large language model with integrated external knowledge bases. arXiv preprint arXiv:2306.16092 (2023).
    崔家熙,李宗健,闫洋,陈博华,袁立。2023。Chatlaw:集成外部知识库的开源法律大型语言模型。预印本 arXiv:2306.16092 (2023)。
  • Cui and Zhang (2011) 崔和张 (2011) Xiliang Cui and Bao-lin Zhang. 2011. The principles for building the “international corpus of learner chinese”. Applied Linguistics 2 (2011), 100–108.
    崔西良与张宝林。2011。构建“国际汉语学习者语料库”的原则。应用语言学 2 (2011), 100–108。
  • Curipod (2024) 库里波德(2024 年) Curipod. 2024. Curipod. http://www.curipod.com/ai
    Curipod。2024。Curipod。http://www.curipod.com/ai
  • Das et al. (2024) 达斯等(2024 年) Badhan Chandra Das, M. Hadi Amini, and Yanzhao Wu. 2024. Security and Privacy Challenges of Large Language Models: A Survey. arXiv:2402.00888 [cs.CL]
    巴丹·钱德拉·达斯,M. 哈迪·阿米尼和吴彦昭。2024 年。大型语言模型的安全与隐私挑战:一项调查。arXiv:2402.00888 [cs.CL]
  • Davidson et al. (2020) 戴维森等(2020) Sam Davidson, Aaron Yamada, Paloma Fernandez Mira, Agustina Carando, Claudia H Sanchez Gutierrez, and Kenji Sagae. 2020. Developing NLP tools with a new corpus of learner Spanish. In Proceedings of the Twelfth Language Resources and Evaluation Conference. 7238–7243.
    Sam Davidson, Aaron Yamada, Paloma Fernandez Mira, Agustina Carando, Claudia H Sanchez Gutierrez 和 Kenji Sagae. 2020. 利用新的学习者西班牙语语料库开发 NLP 工具。在第十二届语言资源与评估会议论文集中。7238-7243。
  • Denny et al. (2024) 丹尼等 (2024) Paul Denny, James Prather, Brett A Becker, James Finnie-Ansley, Arto Hellas, Juho Leinonen, Andrew Luxton-Reilly, Brent N Reeves, Eddie Antonio Santos, and Sami Sarsa. 2024. Computing education in the era of generative AI. Commun. ACM 67, 2 (2024), 56–67.
    保罗·丹尼、詹姆斯·普拉瑟、布雷特·A·贝克尔、詹姆斯·芬尼-安斯利、阿尔托·赫拉斯、朱霍·莱诺宁、安德鲁·勒克斯顿-赖利、布伦特·N·里夫斯、埃迪·安东尼奥·桑托斯和萨米·萨尔萨。2024 年。生成式 AI 时代的计算教育。《通讯 ACM》67 卷,第 2 期(2024 年),56-67 页。
  • Dewhurst et al. (2000) 杜赫斯特等人(2000) David G Dewhurst, Hamish A MacLeod, and Tracey AM Norris. 2000. Independent student learning aided by computers: an acceptable alternative to lectures? Computers & Education 35, 3 (2000), 223–241.
    David G Dewhurst、Hamish A MacLeod 和 Tracey AM Norris。2000。计算机辅助的独立学生学习:一个可以接受的讲座替代方案吗?《计算机与教育》35 卷,3 期(2000 年),223-241 页。
  • Diffit (2024) 迪菲特 (2024) Diffit. 2024. Diffit. https://beta.diffit.me/
    Diffit。2024。Diffit。https://beta.diffit.me/
  • Do Viet and Markov (2023)
    杜越和马尔科夫 (2023)
    Tung Do Viet and Konstantin Markov. 2023. Using Large Language Models for Bug Localization and Fixing. In 2023 12th International Conference on Awareness Science and Technology (iCAST). IEEE, 192–197.
    董度越和康斯坦丁·马尔科夫。2023 年。使用大型语言模型进行错误定位和修复。在 2023 年第 12 届国际感知科学与技术会议(iCAST)上。IEEE,192-197。
  • Doughty et al. (2024) 道提等(2024 年) Jacob Doughty, Zipiao Wan, Anishka Bompelli, Jubahed Qayum, Taozhi Wang, Juran Zhang, Yujia Zheng, Aidan Doyle, Pragnya Sridhar, Arav Agarwal, et al. 2024. A comparative study of AI-generated (GPT-4) and human-crafted MCQs in programming education. In Proceedings of the 26th Australasian Computing Education Conference. 114–123.
    雅各布·道蒂、万子飘、阿尼什卡·邦佩利、朱巴赫德·卡尤姆、汤兆智、张巨然、郑宇佳、艾丹·多伊尔、普拉格尼亚·斯里达尔、阿拉夫·阿格瓦尔等,2024 年。在编程教育中 AI 生成(GPT-4)与人工制作的多项选择题的比较研究。在第 26 届澳大利亚计算教育会议论文集中。114-123。
  • Du et al. (2023) 杜等(2023) Hanyue Du, Yike Zhao, Qingyuan Tian, Jiani Wang, Lei Wang, Yunshi Lan, and Xuesong Lu. 2023. FlaCGEC: A Chinese Grammatical Error Correction Dataset with Fine-grained Linguistic Annotation. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management. 5321–5325.
    杜寒月,赵一可,田庆元,王佳妮,王磊,兰云诗,卢雪松。2023。FlaCGEC:具有细粒度语言注释的中文语法错误纠正数据集。在第 32 届 ACM 国际信息与知识管理会议论文集中。5321-5325。
  • Eduaide.ai (2024) Eduaide.ai. 2024. Eduaide.ai. https://www.eduaide.ai/
  • Fan et al. (2023b) 范等(2023b) Wenqi Fan, Zihuai Zhao, Jiatong Li, Yunqing Liu, Xiaowei Mei, Yiqi Wang, Jiliang Tang, and Qing Li. 2023b. Recommender systems in the era of large language models (llms). arXiv preprint arXiv:2307.02046 (2023).
    范文奇、赵子怀、李佳彤、刘云清、梅晓伟、王一奇、唐吉良、李庆。2023b。大型语言模型时代的推荐系统(1001)。arXiv 预印本 arXiv:2307.02046(2023)。
  • Fan et al. (2023a) 范等(2023a) Yaxin Fan, Feng Jiang, Peifeng Li, and Haizhou Li. 2023a. Grammargpt: Exploring open-source llms for native chinese grammatical error correction with supervised fine-tuning. In CCF International Conference on Natural Language Processing and Chinese Computing. Springer, 69–80.
    范亚欣,姜峰,李培峰,李海洲. 2023a. Grammargpt:利用监督式微调探索开源llms用于母语为中文的语法错误纠正. 在中国计算机学会国际自然语言处理与中文计算大会. 施普林格, 69-80.
  • Fenu et al. (2022) 费努等 (2022) Gianni Fenu, Roberta Galici, and Mirko Marras. 2022. Experts’ view on challenges and needs for fairness in artificial intelligence for education. In International Conference on Artificial Intelligence in Education. Springer, 243–255.
    Gianni Fenu、Roberta Galici 和 Mirko Marras。2022。教育中人工智能公平性的挑战与需求:专家观点。在人工智能教育国际会议上。Springer, 243-255。
  • Finlayson et al. (2024) 芬莱森等 (2024) Matthew Finlayson, Xiang Ren, and Swabha Swayamdipta. 2024. Logits of API-Protected LLMs Leak Proprietary Information. arXiv:2403.09539 [cs.CL]
    马修·芬莱森、任翔、斯瓦巴·斯瓦姆迪普塔。2024。API 保护的LLMs泄露专有信息的逻辑。arXiv:2403.09539 [cs.CL]
  • Funayama et al. (2023) 船山等(2023 年) Hiroaki Funayama, Yuya Asazuma, Yuichiroh Matsubayashi, Tomoya Mizumoto, and Kentaro Inui. 2023. Reducing the cost: Cross-prompt pre-finetuning for short answer scoring. In International conference on artificial intelligence in education. Springer, 78–89.
    船山宏明、浅沼雄也、松林雄一郎、水本智也、乾健太郎。2023 年。降低成本:用于短答案评分的跨提示预微调。在人工智能教育国际会议上。施普林格,78-89。
  • Gao et al. (2024a) 高等 (2024a) Chujie Gao, Dongping Chen, Qihui Zhang, Yue Huang, Yao Wan, and Lichao Sun. 2024a. LLM-as-a-Coauthor: The Challenges of Detecting LLM-Human Mixcase. arXiv:2401.05952 [cs.CL]
    高楚杰,陈东平,张启辉,黄悦,万瑶,孙立超。2024a。作为合作者的LLM:检测LLM-人类混合案例的挑战。arXiv:2401.05952 [cs.CL]
  • Gao et al. (2023a) 高等 (2023a) Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. 2023a. Pal: Program-aided language models. In International Conference on Machine Learning. PMLR, 10764–10799.
    高璐宇,马达安,周树彦,乌里·阿隆,刘鹏飞,杨益明,詹米·卡兰,格雷厄姆·纽比格。2023a。Pal: 程序辅助语言模型。在国际机器学习会议上。PMLR,10764-10799。
  • Gao et al. (2024b) 高等 (2024b) Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Qianyu Guo, Meng Wang, and Haofen Wang. 2024b. Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv:2312.10997 [cs.CL]
    高云帆、熊云、高新宇、贾康翔、潘金流、毕玉玺、戴毅、孙佳伟、郭倩玉、王猛、王浩芬。2024b。大型语言模型的检索增强生成:一项调查。arXiv:2312.10997 [cs.CL]
  • Gao et al. (2023b) 高等 (2023b) Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, and Haofen Wang. 2023b. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997 (2023).
    高云帆、熊云、高新宇、贾康翔、潘金流、毕玉玺、戴毅、孙佳伟、王浩棻。2023b。大型语言模型的检索增强生成:一项调查。预印本 arXiv:2312.10997 (2023)。
  • Genie (2024) 神灯精灵(2024) Parlay Genie. 2024. Parlay Genie. https://parlayideas.com/
    巴莱精灵。2024。巴莱精灵。https://parlayideas.com/
  • Gong et al. (2022) 龚等(2022) Huanli Gong, Liangming Pan, and Hengchang Hu. 2022. Khanq: A dataset for generating deep questions in education. In Proceedings of the 29th International Conference on Computational Linguistics. 5925–5938.
    欢丽宫、梁明盘、胡恒昌。2022。Khanq:一个用于教育中生成深度问题的数据集。在第 29 届国际计算语言学会议论文集中。5925-5938。
  • Google (2024) 谷歌(2024 年) Google. 2024. Google Bard. https://bard.google.com
    谷歌。2024。谷歌吟游诗人。https://bard.google.com
  • gotFeedback (2024) gotFeedback(2024) gotFeedback. 2024. gotFeedback. https://www.gotlearning.com/gotfeedback/
    gotFeedback。2024。gotFeedback。https://www.gotlearning.com/gotfeedback/
  • Grammarly (2024) Grammarly. 2024. Grammarly. https://app.grammarly.com/
  • Grundkiewicz and Junczys-Dowmunt (2014) Roman Grundkiewicz and Marcin Junczys-Dowmunt. 2014. The wiked error corpus: A corpus of corrective wikipedia edits and its application to grammatical error correction. In Advances in Natural Language Processing: 9th International Conference on NLP, PolTAL 2014, Warsaw, Poland, September 17-19, 2014. Proceedings 9. Springer, 478–490.
  • Guo et al. (2024) Qi Guo, Junming Cao, Xiaofei Xie, Shangqing Liu, Xiaohong Li, Bihuan Chen, and Xin Peng. 2024. Exploring the potential of chatgpt in automated code refinement: An empirical study. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering. 1–13.
  • Hadifar et al. (2023) Amir Hadifar, Semere Kiros Bitew, Johannes Deleu, Chris Develder, and Thomas Demeester. 2023. Eduqg: A multi-format multiple-choice dataset for the educational domain. IEEE Access 11 (2023), 20885–20896.
  • Hendrycks et al. (2020) Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2020. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300 (2020).
  • Hendrycks et al. (2021) Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874 (2021).
  • Hicke et al. (2023) Yann Hicke, Anmol Agarwal, Qianou Ma, and Paul Denny. 2023. AI-TA: Towards an Intelligent Question-Answer Teaching Assistant using Open-Source LLMs. arXiv:2311.02775 [cs.LG]
  • Huang et al. (2016) Danqing Huang, Shuming Shi, Chin-Yew Lin, Jian Yin, and Wei-Ying Ma. 2016. How well do computers solve math word problems? large-scale dataset construction and evaluation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 887–896.
  • Huang and Chang (2023) Jie Huang and Kevin Chen-Chuan Chang. 2023. Citation: A Key to Building Responsible and Accountable Large Language Models. arXiv:2307.02185 [cs.CL]
  • Huang et al. (2024) Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang, Tangjun Su, Junteng Liu, Chuancheng Lv, Yikai Zhang, Yao Fu, et al. 2024. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. Advances in Neural Information Processing Systems 36 (2024).
  • Jeon and Lee (2023) Jaeho Jeon and Seongyong Lee. 2023. Large language models in education: A focus on the complementary relationship between human teachers and ChatGPT. Education and Information Technologies (2023), 1–20.
  • Ji et al. (2023) Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. 2023. Survey of hallucination in natural language generation. Comput. Surveys 55, 12 (2023), 1–38.
  • Jin et al. (2021) Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. 2021. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences 11, 14 (2021), 6421.
  • Jin et al. (2024) Ming Jin, Yifan Zhang, Wei Chen, Kexin Zhang, Yuxuan Liang, Bin Yang, Jindong Wang, Shirui Pan, and Qingsong Wen. 2024. Position Paper: What Can Large Language Models Tell Us about Time Series Analysis. arXiv preprint arXiv:2402.02713 (2024).
  • Jury et al. (2024) Breanna Jury, Angela Lorusso, Juho Leinonen, Paul Denny, and Andrew Luxton-Reilly. 2024. Evaluating LLM-generated Worked Examples in an Introductory Programming Course. In Proceedings of the 26th Australasian Computing Education Conference. 77–86.
  • Just et al. (2014) René Just, Darioush Jalali, and Michael D Ernst. 2014. Defects4J: A database of existing faults to enable controlled testing studies for Java programs. In Proceedings of the 2014 international symposium on software testing and analysis. 437–440.
  • Kabir and Lin (2023) Md Rayhan Kabir and Fuhua Lin. 2023. An LLM-Powered Adaptive Practicing System. In AIED 2023 workshop on Empowering Education with LLMs-the Next-Gen Interface and Content Generation, AIED.
  • Kamalov and Gurrib (2023) Firuz Kamalov and Ikhlaas Gurrib. 2023. A new era of artificial intelligence in education: A multifaceted revolution. arXiv preprint arXiv:2305.18303 (2023).
  • Kasneci et al. (2023) Enkelejda Kasneci, Kathrin Seßler, Stefan Küchemann, Maria Bannert, Daryna Dementieva, Frank Fischer, Urs Gasser, Georg Groh, Stephan Günnemann, Eyke Hüllermeier, et al. 2023. ChatGPT for good? On opportunities and challenges of large language models for education. Learning and individual differences 103 (2023), 102274.
  • Kazemitabaar et al. (2023) Majeed Kazemitabaar, Xinying Hou, Austin Henley, Barbara Jane Ericson, David Weintrop, and Tovi Grossman. 2023. How novices use LLM-based code generators to solve CS1 coding tasks in a self-paced learning environment. In Proceedings of the 23rd Koli Calling International Conference on Computing Education Research. 1–12.
  • Ke et al. (2024) Luoma Ke, Song Tong, Peng Cheng, and Kaiping Peng. 2024. Exploring the Frontiers of LLMs in Psychological Applications: A Comprehensive Review. arXiv:2401.01519 [cs.LG]
  • Kembhavi et al. (2016) Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. 2016. A diagram is worth a dozen images. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14. Springer, 235–251.
  • Khanmigo (2024) Khanmigo. 2024. Khanmigo. https://blog.khanacademy.org/teacher-khanmigo/
  • Kim et al. (2018) Daesik Kim, Seonhoon Kim, and Nojun Kwak. 2018. Textbook question answering with multi-modal context graph understanding and self-supervised open-set comprehension. arXiv preprint arXiv:1811.00232 (2018).
  • Koedinger et al. (2015) Kenneth R Koedinger, Sidney D’Mello, Elizabeth A McLaughlin, Zachary A Pardos, and Carolyn P Rosé. 2015. Data mining and education. Wiley Interdisciplinary Reviews: Cognitive Science 6, 4 (2015), 333–353.
  • König et al. (2023) Claudia M König, Christin Karrenbauer, and Michael H Breitner. 2023. Critical success factors and challenges for individual digital study assistants in higher education: A mixed methods analysis. Education and Information Technologies 28, 4 (2023), 4475–4503.
  • Koraishi (2023) Osama Koraishi. 2023. Teaching English in the age of AI: Embracing ChatGPT to optimize EFL materials and assessment. Language Education and Technology 3, 1 (2023).
  • Krupp et al. (2023) Lars Krupp, Steffen Steinert, Maximilian Kiefer-Emmanouilidis, Karina E. Avila, Paul Lukowicz, Jochen Kuhn, Stefan Küchemann, and Jakob Karolus. 2023. Challenges and Opportunities of Moderating Usage of Large Language Models in Education. arXiv:2312.14969 [cs.HC]
  • Kučak et al. (2018) Danijel Kučak, Vedran Juričić, and Goran Đambić. 2018. MACHINE LEARNING IN EDUCATION-A SURVEY OF CURRENT RESEARCH TRENDS. Annals of DAAAM & Proceedings 29 (2018).
  • Kuo et al. (2023) Bor-Chen Kuo, Frederic TY Chang, and Zong-En Bai. 2023. Leveraging LLMs for Adaptive Testing and Learning in Taiwan Adaptive Learning Platform (TALP). (2023).
  • Lai et al. (2017) Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. 2017. Race: Large-scale reading comprehension dataset from examinations. arXiv preprint arXiv:1704.04683 (2017).
  • Latham and Goltz (2019) A. Latham and S. Goltz. 2019. A Survey of the General Public’s Views on the Ethics of Using AI in Education. In Artificial Intelligence in Education (Lecture Notes in Computer Science, Vol. 11625), S. Isotani, E. Millán, A. Ogan, P. Hastings, B. McLaren, and R. Luckin (Eds.). Springer, Cham. https://doi.org/10.1007/978-3-030-23204-7_17
  • Latif et al. (2023) Ehsan Latif, Gengchen Mai, Matthew Nyaaba, Xuansheng Wu, Ninghao Liu, Guoyu Lu, Sheng Li, Tianming Liu, and Xiaoming Zhai. 2023. Artificial general intelligence (AGI) for education. arXiv preprint arXiv:2304.12479 (2023).
  • Le Goues et al. (2015) Claire Le Goues, Neal Holtschulte, Edward K Smith, Yuriy Brun, Premkumar Devanbu, Stephanie Forrest, and Westley Weimer. 2015. The ManyBugs and IntroClass benchmarks for automated repair of C programs. IEEE Transactions on Software Engineering 41, 12 (2015), 1236–1256.
  • Lee et al. (2023a) Unggi Lee, Haewon Jung, Younghoon Jeon, Younghoon Sohn, Wonhee Hwang, Jewoong Moon, and Hyeoncheol Kim. 2023a. Few-shot is enough: exploring ChatGPT prompt engineering method for automatic question generation in english education. Education and Information Technologies (2023), 1–33.
  • Lee et al. (2023b) Unggi Lee, Sungjun Yoon, Joon Seo Yun, Kyoungsoo Park, YoungHoon Jung, Damji Stratton, and Hyeoncheol Kim. 2023b. Difficulty-Focused Contrastive Learning for Knowledge Tracing with a Large Language Model-Based Difficulty Prediction. arXiv preprint arXiv:2312.11890 (2023).
  • Leiker et al. (2023) Daniel Leiker, Sara Finnigan, Ashley Ricker Gyllen, and Mutlu Cukurova. 2023. Prototyping the use of Large Language Models (LLMs) for adult learning content creation at scale. arXiv preprint arXiv:2306.01815 (2023).
  • Li et al. (2024b) Hai Li, Chenglu Li, Wanli Xing, Sami Baral, and Neil Heffernan. 2024b. Automated Feedback for Student Math Responses Based on Multi-Modality and Fine-Tuning. In Proceedings of the 14th Learning Analytics and Knowledge Conference. 763–770.
  • Li et al. (2024d) Hang Li, Tianlong Xu, Chaoli Zhang, Eason Chen, Jing Liang, Xing Fan, Haoyang Li, Jiliang Tang, and Qingsong Wen. 2024d. Bringing Generative AI to Adaptive Learning in Education. arXiv preprint arXiv:2402.14601 (2024).
  • Li et al. (2023b) Haonan Li, Yixuan Zhang, Fajri Koto, Yifei Yang, Hai Zhao, Yeyun Gong, Nan Duan, and Timothy Baldwin. 2023b. Cmmlu: Measuring massive multitask language understanding in chinese. arXiv preprint arXiv:2306.09212 (2023).
  • Li et al. (2024c) Jingling Li, Zeyu Tang, Xiaoyu Liu, Peter Spirtes, Kun Zhang, Liu Leqi, and Yang Liu. 2024c. Steering LLMs Towards Unbiased Responses: A Causality-Guided Debiasing Framework. arXiv:2403.08743 [cs.CL]
  • Li et al. (2023a) Qingyao Li, Lingyue Fu, Weiming Zhang, Xianyu Chen, Jingwei Yu, Wei Xia, Weinan Zhang, Ruiming Tang, and Yong Yu. 2023a. Adapting Large Language Models for Education: Foundational Capabilities, Potentials, and Challenges. arXiv preprint arXiv:2401.08664 (2023).
  • Li et al. (2024e) Tianlin Li, Xiaoyu Zhang, Chao Du, Tianyu Pang, Qian Liu, Qing Guo, Chao Shen, and Yang Liu. 2024e. Your Large Language Model is Secretly a Fairness Proponent and You Should Prompt it Like One. arXiv:2402.12150 [cs.CL]
  • Li et al. (2024a) Yingji Li, Mengnan Du, Rui Song, Xin Wang, and Ying Wang. 2024a. A Survey on Fairness in Large Language Models. arXiv:2308.10149 [cs.CL]
  • Li and Zhang (2023) Yunqi Li and Yongfeng Zhang. 2023. Fairness of ChatGPT. arXiv:2305.18569 [cs.LG]
  • Li et al. (2022) Zhiyu Li, Shuai Lu, Daya Guo, Nan Duan, Shailesh Jannu, Grant Jenks, Deep Majumder, Jared Green, Alexey Svyatkovskiy, Shengyu Fu, et al. 2022. Automating code review activities by large-scale pre-training. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1035–1047.
  • Liang et al. (2018) Chen Liang, Xiao Yang, Neisarg Dave, Drew Wham, Bart Pursel, and C Lee Giles. 2018. Distractor generation for multiple choice questions using learning to rank. In Proceedings of the thirteenth workshop on innovative use of NLP for building educational applications. 284–290.
  • Liévin et al. (2023) Valentin Liévin, Christoffer Egeberg Hother, Andreas Geert Motzfeldt, and Ole Winther. 2023. Can large language models reason about medical questions? Patterns (2023).
  • Lin et al. (2017) Derrick Lin, James Koppel, Angela Chen, and Armando Solar-Lezama. 2017. QuixBugs: A multi-lingual program repair benchmark set based on the Quixey Challenge. In Proceedings Companion of the 2017 ACM SIGPLAN international conference on systems, programming, languages, and applications: software for humanity. 55–56.
  • Lin et al. (2023) Yuanguo Lin, Hong Chen, Wei Xia, Fan Lin, Pengcheng Wu, Zongyue Wang, and Yong Li. 2023. A comprehensive survey on deep learning techniques in educational data mining. arXiv preprint arXiv:2309.04761 (2023).
  • Liu et al. (2017) Min Liu, Emily McKelroy, Stephanie B Corliss, and Jamison Carrigan. 2017. Investigating the effect of an adaptive learning intervention on students’ learning. Educational technology research and development 65 (2017), 1605–1625.
  • Liu et al. (2019) Tiaoqiao Liu, Wenbiao Ding, Zhiwei Wang, Jiliang Tang, Gale Yan Huang, and Zitao Liu. 2019. Automatic short answer grading via multiway attention networks. In Artificial Intelligence in Education: 20th International Conference, AIED 2019, Chicago, IL, USA, June 25-29, 2019, Proceedings, Part II 20. Springer, 169–173.
  • Liu et al. (2024) Yixin Liu, Kai Zhang, Yuan Li, Zhiling Yan, Chujie Gao, Ruoxi Chen, Zhengqing Yuan, Yue Huang, Hanchi Sun, Jianfeng Gao, Lifang He, and Lichao Sun. 2024. Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models. arXiv:2402.17177 [cs.CV]
  • Liu et al. (2020) Zi-Yu Liu, Zaffar Ahmed Shaikh, and Farida Gazizova. 2020. Using the Concept of Game-Based Learning in Education. International Journal of Emerging Technologies in Learning (2020).
  • Lu et al. (2021a) Pan Lu, Ran Gong, Shibiao Jiang, Liang Qiu, Siyuan Huang, Xiaodan Liang, and Song-Chun Zhu. 2021a. Inter-GPS: Interpretable geometry problem solving with formal language and symbolic reasoning. arXiv preprint arXiv:2105.04165 (2021).
  • Lu et al. (2022) Pan Lu, Liang Qiu, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu, Tanmay Rajpurohit, Peter Clark, and Ashwin Kalyan. 2022. Dynamic prompt learning via policy gradient for semi-structured mathematical reasoning. arXiv preprint arXiv:2209.14610 (2022).
  • Lu et al. (2021b) Pan Lu, Liang Qiu, Jiaqi Chen, Tony Xia, Yizhou Zhao, Wei Zhang, Zhou Yu, Xiaodan Liang, and Song-Chun Zhu. 2021b. Iconqa: A new benchmark for abstract diagram understanding and visual language reasoning. arXiv preprint arXiv:2110.13214 (2021).
  • MacNeil et al. (2024) Stephen MacNeil, Scott Spurlock, and Ian Applebaum. 2024. Imagining Computing Education Assessment after Generative AI. arXiv:2401.04601 [cs.HC]
  • Maghsudi et al. (2021) Setareh Maghsudi, Andrew Lan, Jie Xu, and Mihaela van Der Schaar. 2021. Personalized education in the artificial intelligence era: what to expect next. IEEE Signal Processing Magazine 38, 3 (2021), 37–50.
  • MagicSchool.ai (2024) MagicSchool.ai. 2024. MagicSchool.ai. https://www.magicschool.ai/
  • Malinka et al. (2023) Kamil Malinka, Martin Peresíni, Anton Firc, Ondrej Hujnák, and Filip Janus. 2023. On the educational impact of chatgpt: Is artificial intelligence ready to obtain a university degree?. In Proceedings of the 2023 Conference on Innovation and Technology in Computer Science Education V. 1. 47–53.
  • Masikisiki et al. (2023) Baphumelele Masikisiki, Vukosi Marivate, and Yvette Hlope. 2023. Investigating the Efficacy of Large Language Models in Reflective Assessment Methods through Chain of Thoughts Prompting. arXiv:2310.00272 [cs.CL]
  • Miao et al. (2021) Shen-Yun Miao, Chao-Chun Liang, and Keh-Yih Su. 2021. A diverse corpus for evaluating and developing English math word problem solvers. arXiv preprint arXiv:2106.15772 (2021).
  • Microsoft (2024) Microsoft. 2024. Bing Chat. https://learn.microsoft.com/en-us/training/modules/enhance-teaching-learning-bing-chat/
  • Microsoft (2024) Microsoft. 2024. What is responsible AI? Accessed: 2024-02-08.
  • Milano et al. (2023a) S. Milano, J.A. McGrane, and S. Leonelli. 2023a. Large language models challenge the future of higher education. Nature Machine Intelligence 5 (2023), 333–334. https://doi.org/10.1038/s42256-023-00644-2
  • Milano et al. (2023b) Silvia Milano, Joshua A McGrane, and Sabina Leonelli. 2023b. Large language models challenge the future of higher education. Nature Machine Intelligence 5, 4 (2023), 333–334.
  • Mogavi et al. (2023) Reza Hadi Mogavi, Chao Deng, Justin Juho Kim, Pengyuan Zhou, Young D Kwon, Ahmed Hosny Saleh Metwally, Ahmed Tlili, Simone Bassanelli, Antonio Bucchiarone, Sujit Gujar, et al. 2023. Exploring user perspectives on chatgpt: Applications, perceptions, and implications for ai-integrated education. arXiv preprint arXiv:2305.13114 (2023).
  • Morningstar et al. (2017) Mary E Morningstar, Jennifer A Kurth, and Paul E Johnson. 2017. Examining national trends in educational placements for students with significant disabilities. Remedial and Special Education 38, 1 (2017), 3–12.
  • Náplava et al. (2022) Jakub Náplava, Milan Straka, Jana Straková, and Alexandr Rosen. 2022. Czech grammar error correction with a large and diverse corpus. Transactions of the Association for Computational Linguistics 10 (2022), 452–467.
  • Naumov et al. (2019) Maxim Naumov, Dheevatsa Mudigere, Hao-Jun Michael Shi, Jianyu Huang, Narayanan Sundaraman, Jongsoo Park, Xiaodong Wang, Udit Gupta, Carole-Jean Wu, Alisson G Azzolini, et al. 2019. Deep learning recommendation model for personalization and recommendation systems. arXiv preprint arXiv:1906.00091 (2019).
  • Ng et al. (2014) Hwee Tou Ng, Siew Mei Wu, Ted Briscoe, Christian Hadiwinoto, Raymond Hendy Susanto, and Christopher Bryant. 2014. The CoNLL-2014 shared task on grammatical error correction. In Proceedings of the eighteenth conference on computational natural language learning: shared task. 1–14.
  • Ni et al. (2023) Lin Ni, Sijie Wang, Zeyu Zhang, Xiaoxuan Li, Xianda Zheng, Paul Denny, and Jiamou Liu. 2023. Enhancing student performance prediction on learnersourced questions with sgnn-llm synergy. arXiv preprint arXiv:2309.13500 (2023).
  • Nolej (2024) Nolej. 2024. Nolej. https://nolej.io/
  • Oketunji et al. (2023) Abiodun Finbarrs Oketunji, Muhammad Anas, and Deepthi Saina. 2023. Large Language Model (LLM) Bias Index–LLMBI. arXiv preprint arXiv:2312.14769 (2023).
  • OpenAI (2024) OpenAI. 2024. OpenAI Chat. https://chat.openai.com
  • OpenAI (2023) R OpenAI. 2023. Gpt-4 technical report. arxiv 2303.08774. View in Article 2, 5 (2023).
  • Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in neural information processing systems 35 (2022), 27730–27744.
  • Pal et al. (2022) Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. 2022. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In Conference on health, inference, and learning. PMLR, 248–260.
  • Pardos and Bhandari (2023) Zachary A Pardos and Shreya Bhandari. 2023. Learning gain differences between ChatGPT and human tutor generated algebra hints. arXiv preprint arXiv:2302.06871 (2023).
  • PDF (2024) Chat PDF. 2024. Chat PDF. https://www.chatpdf.com/
  • Perplexity AI (2024) Perplexity AI. 2024. Perplexity AI. https://perplexity.ai
  • Pi.ai (2024) Pi.ai. 2024. Pi. http://www.Pi.ai
  • Pinto et al. (2023) Gustavo Pinto, Isadora Cardoso-Pereira, Danilo Monteiro, Danilo Lucena, Alberto Souza, and Kiev Gama. 2023. Large language models for education: Grading open-ended questions using chatgpt. In Proceedings of the XXXVII Brazilian Symposium on Software Engineering. 293–302.
  • Prihar et al. (2023) Ethan Prihar, Morgan Lee, Mia Hopman, Adam Tauman Kalai, Sofia Vempala, Allison Wang, Gabriel Wickline, Aly Murray, and Neil Heffernan. 2023. Comparing different approaches to generating mathematics explanations using large language models. In International Conference on Artificial Intelligence in Education. Springer, 290–295.
  • Qadir (2023) Junaid Qadir. 2023. Engineering education in the era of ChatGPT: Promise and pitfalls of generative AI for education. In 2023 IEEE Global Engineering Education Conference (EDUCON). IEEE, 1–9.
  • Rahman and Watanobe (2023) Md Mostafizer Rahman and Yutaka Watanobe. 2023. ChatGPT for education and research: Opportunities, threats, and strategies. Applied Sciences 13, 9 (2023), 5783.
  • Ray et al. (2003) Adams Ray, Wu Margaret, et al. 2003. PISA Programme for international student assessment (PISA) PISA 2000 technical report: PISA 2000 technical report. oecd Publishing.
  • Romero and Ventura (2007) Cristobal Romero and Sebastian Ventura. 2007. Educational data mining: A survey from 1995 to 2005. Expert systems with applications 33, 1 (2007), 135–146.
  • Romero and Ventura (2010) Cristóbal Romero and Sebastián Ventura. 2010. Educational data mining: a review of the state of the art. IEEE Transactions on Systems, Man, and Cybernetics, Part C (applications and reviews) 40, 6 (2010), 601–618.
  • Romero and Ventura (2013) Cristobal Romero and Sebastian Ventura. 2013. Data mining in education. Wiley Interdisciplinary Reviews: Data mining and knowledge discovery 3, 1 (2013), 12–27.
  • Romero and Ventura (2020) Cristobal Romero and Sebastian Ventura. 2020. Educational data mining and learning analytics: An updated survey. Wiley interdisciplinary reviews: Data mining and knowledge discovery 10, 3 (2020), e1355.
  • Rooein et al. (2023) Donya Rooein, Amanda Cercas Curry, and Dirk Hovy. 2023. Know Your Audience: Do LLMs Adapt to Different Age and Education Levels? arXiv preprint arXiv:2312.02065 (2023).
  • Rothe et al. (2021) Sascha Rothe, Jonathan Mallinson, Eric Malmi, Sebastian Krause, and Aliaksei Severyn. 2021. A simple recipe for multilingual grammatical error correction. arXiv preprint arXiv:2106.03830 (2021).
  • Rozovskaya and Roth (2019) Alla Rozovskaya and Dan Roth. 2019. Grammar error correction in morphologically rich languages: The case of Russian. Transactions of the Association for Computational Linguistics 7 (2019), 1–17.
  • Ruipérez-Valiente et al. (2022) José A Ruipérez-Valiente, Thomas Staubitz, Matt Jenner, Sherif Halawa, Jiayin Zhang, Ignacio Despujol, Jorge Maldonado-Mahauad, German Montoro, Melanie Peffer, Tobias Rohloff, et al. 2022. Large scale analytics of global and regional MOOC providers: Differences in learners’ demographics, preferences, and perceptions. Computers & Education 180 (2022), 104426.
  • Savelka et al. (2023) Jaromir Savelka, Arav Agarwal, Marshall An, Chris Bogart, and Majd Sakr. 2023. Thrilled by your progress! Large language models (GPT-4) no longer struggle to pass assessments in higher education programming courses. In Proceedings of the 2023 ACM Conference on International Computing Education Research-Volume 1. 78–92.
  • Shoaib et al. (2023) Mohamed R. Shoaib, Zefan Wang, Milad Taleby Ahvanooey, and Jun Zhao. 2023. Deepfakes, Misinformation, and Disinformation in the Era of Frontier AI, Generative AI, and Large AI Models. arXiv:2311.17394 [cs.CR]
  • Shridhar et al. ([n. d.]) Kumar Shridhar, Jakub Macina, Mennatallah El-Assady, Tanmay Sinha, Manu Kapur, and Mrinmaya Sachan. [n. d.]. Automatic Generation of Socratic Questions for Learning to Solve Math Word Problems. ([n. d.]).
  • Sonkar and Baraniuk (2023) Shashank Sonkar and Richard G Baraniuk. 2023. Deduction under Perturbed Evidence: Probing Student Simulation (Knowledge Tracing) Capabilities of Large Language Models. (2023).
  • Stab and Gurevych (2014) Christian Stab and Iryna Gurevych. 2014. Annotating argument components and relations in persuasive essays. In Proceedings of COLING 2014, the 25th international conference on computational linguistics: Technical papers. 1501–1510.
  • summarize.tech (2024) summarize.tech. 2024. summarize.tech. https://www.summarize.tech/
  • Sun (2023) Hao Sun. 2023. Offline prompt evaluation and optimization with inverse reinforcement learning. arXiv preprint arXiv:2309.06553 (2023).
  • Suraworachet et al. (2024) Wannapon Suraworachet, Jennifer Seon, and Mutlu Cukurova. 2024. Predicting challenge moments from students’ discourse: A comparison of GPT-4 to two traditional natural language processing approaches. In Proceedings of the 14th Learning Analytics and Knowledge Conference (LAK ’24). ACM. https://doi.org/10.1145/3636555.3636905
  • Susnjak (2022) Teo Susnjak. 2022. ChatGPT: The end of online exam integrity? arXiv preprint arXiv:2212.09292 (2022).
  • Syvokon and Nahorna (2021) Oleksiy Syvokon and Olena Nahorna. 2021. UA-GEC: Grammatical error correction and fluency corpus for the ukrainian language. arXiv preprint arXiv:2103.16997 (2021).
  • Tan et al. (2023) Kehui Tan, Tianqi Pang, and Chenyou Fan. 2023. Towards applying powerful large ai models in classroom teaching: Opportunities, challenges and prospects. arXiv preprint arXiv:2305.03433 (2023).
  • Tan et al. (2024) Zhen Tan, Jie Peng, Tianlong Chen, and Huan Liu. 2024. Tuning-Free Accountable Intervention for LLM Deployment – A Metacognitive Approach. arXiv:2403.05636 [cs.AI]
  • Thirunavukarasu et al. (2023) Arun James Thirunavukarasu, Darren Shu Jeng Ting, Kabilan Elangovan, Laura Gutierrez, Ting Fang Tan, and Daniel Shu Wei Ting. 2023. Large language models in medicine. Nature medicine 29, 8 (2023), 1930–1940.
  • Tiedemann (2020) Jörg Tiedemann. 2020. The Tatoeba Translation Challenge–Realistic Data Sets for Low Resource and Multilingual MT. arXiv preprint arXiv:2010.06354 (2020).
  • Tigina et al. (2023) Maria Tigina, Anastasiia Birillo, Yaroslav Golubev, Hieke Keuning, Nikolay Vyahhi, and Timofey Bryksin. 2023. Analyzing the quality of submissions in online programming courses. In 2023 IEEE/ACM 45th International Conference on Software Engineering: Software Engineering Education and Training (ICSE-SEET). IEEE, 271–282.
  • Tools (2024) Goblin Tools. 2024. Goblin Tools. https://goblin.tools/
  • Tseng et al. (2015) Yuen-Hsien Tseng, Lung-Hao Lee, Li-Ping Chang, and Hsin-Hsi Chen. 2015. Introduction to SIGHAN 2015 bake-off for Chinese spelling check. In Proceedings of the Eighth SIGHAN Workshop on Chinese Language Processing. 32–37.
  • Tufano et al. (2019) Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, Martin White, and Denys Poshyvanyk. 2019. An empirical study on learning bug-fixing patches in the wild via neural machine translation. ACM Transactions on Software Engineering and Methodology (TOSEM) 28, 4 (2019), 1–29.
  • Twee (2024) Twee. 2024. Twee. https://twee.com/
  • Upadhyay and Chang (2016) Shyam Upadhyay and Ming-Wei Chang. 2016. Annotating derivations: A new evaluation strategy and dataset for algebra word problems. arXiv preprint arXiv:1609.07197 (2016).
  • Wang et al. (2023) Xinyi Wang, Wanrong Zhu, Michael Saxon, Mark Steyvers, and William Yang Wang. 2023. Large language models are implicitly topic models: Explaining and finding good demonstrations for in-context learning. In Workshop on Efficient Systems for Foundation Models@ ICML2023.
  • Wang et al. (2017) Yan Wang, Xiaojiang Liu, and Shuming Shi. 2017. Deep neural solver for math word problems. In Proceedings of the 2017 conference on empirical methods in natural language processing. 845–854.
  • Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems 35 (2022), 24824–24837.
  • Weidinger et al. (2021) Laura Weidinger, John Mellor, Maribeth Rauh, Conor Griffin, Jonathan Uesato, Po-Sen Huang, Myra Cheng, Mia Glaese, Borja Balle, Atoosa Kasirzadeh, Zac Kenton, Sasha Brown, Will Hawkins, Tom Stepleton, Courtney Biles, Abeba Birhane, Julia Haas, Laura Rimell, Lisa Anne Hendricks, William Isaac, Sean Legassick, Geoffrey Irving, and Iason Gabriel. 2021. Ethical and social risks of harm from Language Models. arXiv:2112.04359 [cs.CL]
  • Welbl et al. (2017) Johannes Welbl, Nelson F Liu, and Matt Gardner. 2017. Crowdsourcing multiple choice science questions. arXiv preprint arXiv:1707.06209 (2017).
  • Wu et al. (2023a) Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Shaokun Zhang, Erkang Zhu, Beibin Li, Li Jiang, Xiaoyun Zhang, and Chi Wang. 2023a. Autogen: Enabling next-gen llm applications via multi-agent conversation framework. arXiv preprint arXiv:2308.08155 (2023).
  • Wu et al. (2023b) Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, and Tat-Seng Chua. 2023b. Next-gpt: Any-to-any multimodal llm. arXiv preprint arXiv:2309.05519 (2023).
  • Wu et al. (2023c) Shijie Wu, Ozan Irsoy, Steven Lu, Vadim Dabravolski, Mark Dredze, Sebastian Gehrmann, Prabhanjan Kambadur, David Rosenberg, and Gideon Mann. 2023c. Bloomberggpt: A large language model for finance. arXiv preprint arXiv:2303.17564 (2023).
  • Wu et al. (2023d) Yiran Wu, Feiran Jia, Shaokun Zhang, Qingyun Wu, Hangyu Li, Erkang Zhu, Yue Wang, Yin Tat Lee, Richard Peng, and Chi Wang. 2023d. An empirical study on challenging math problem solving with gpt-4. arXiv preprint arXiv:2306.01337 (2023).
  • Xiao et al. (2024) Changrong Xiao, Wenxing Ma, Sean Xin Xu, Kunpeng Zhang, Yufang Wang, and Qi Fu. 2024. From Automation to Augmentation: Large Language Models Elevating Essay Scoring Landscape. arXiv preprint arXiv:2401.06431 (2024).
  • Xiao et al. (2023) Changrong Xiao, Sean Xin Xu, Kunpeng Zhang, Yufang Wang, and Lei Xia. 2023. Evaluating reading comprehension exercises generated by LLMs: A showcase of ChatGPT in education applications. In Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023). 610–625.
  • Xiong et al. (2024) Zhang Xiong, Haoxuan Li, Zhuang Liu, Zhuofan Chen, Hao Zhou, Wenge Rong, and Yuanxin Ouyang. 2024. A Review of Data Mining in Personalized Education: Current Trends and Future Prospects. arXiv preprint arXiv:2402.17236 (2024).
  • Xu et al. (2023) Liang Xu, Anqi Li, Lei Zhu, Hang Xue, Changtai Zhu, Kangkang Zhao, Haonan He, Xuanwei Zhang, Qiyue Kang, and Zhenzhong Lan. 2023. Superclue: A comprehensive chinese large language model benchmark. arXiv preprint arXiv:2307.15020 (2023).
  • Xu et al. (2022b) Lvxiaowei Xu, Jianwang Wu, Jiawei Peng, Jiayu Fu, and Ming Cai. 2022b. FCGEC: fine-grained corpus for Chinese grammatical error correction. arXiv preprint arXiv:2210.12364 (2022).
  • Xu et al. (2022a) Ying Xu, Dakuo Wang, Mo Yu, Daniel Ritchie, Bingsheng Yao, Tongshuang Wu, Zheng Zhang, Toby Jia-Jun Li, Nora Bradford, Branda Sun, et al. 2022a. Fantastic Questions and Where to Find Them: FairytaleQA–An Authentic Dataset for Narrative Comprehension. arXiv preprint arXiv:2203.13947 (2022).
  • Yadav et al. (2023) Gautam Yadav, Ying-Jui Tseng, and Xiaolin Ni. 2023. Contextualizing problems to student interests at scale in intelligent tutoring system using large language models. arXiv preprint arXiv:2306.00190 (2023).
  • Yan et al. (2024) Lixiang Yan, Lele Sha, Linxuan Zhao, Yuheng Li, Roberto Martinez-Maldonado, Guanliang Chen, Xinyu Li, Yueqiao Jin, and Dragan Gašević. 2024. Practical and ethical challenges of large language models in education: A systematic scoping review. British Journal of Educational Technology 55, 1 (2024), 90–112.
  • Yancey et al. (2023) Kevin P Yancey, Geoffrey Laflair, Anthony Verardi, and Jill Burstein. 2023. Rating short l2 essays on the cefr scale with gpt-4. In Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023). 576–584.
  • Yang et al. (2023) Hongyang Yang, Xiao-Yang Liu, and Christina Dan Wang. 2023. Fingpt: Open-source financial large language models. arXiv preprint arXiv:2306.06031 (2023).
  • Yang et al. (2024) Kaiqi Yang, Yuchen Chu, Taylor Darwin, Ahreum Han, Hang Li, Hongzhi Wen, Yasemin Copur-Gencturk, Jiliang Tang, and Hui Liu. 2024. Content Knowledge Identification with Multi-Agent Large Language Models (LLMs). In International Conference on Artificial Intelligence in Education. Springer.
  • Yannakoudakis et al. (2011) Helen Yannakoudakis, Ted Briscoe, and Ben Medlock. 2011. A new dataset and method for automatically grading ESOL texts. In Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies. 180–189.
  • Yuan et al. (2020) Ke Yuan, Dafang He, Zhuoren Jiang, Liangcai Gao, Zhi Tang, and C Lee Giles. 2020. Automatic generation of headlines for online math questions. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 9490–9497.
  • Yuan et al. (2023) Zheng Yuan, Hongyi Yuan, Chuanqi Tan, Wei Wang, and Songfang Huang. 2023. How well do Large Language Models perform in Arithmetic tasks? arXiv preprint arXiv:2304.02015 (2023).
  • Zeng et al. (2023) Fanlong Zeng, Wensheng Gan, Yongheng Wang, Ning Liu, and Philip S Yu. 2023. Large language models for robotics: A survey. arXiv preprint arXiv:2311.07226 (2023).
  • Zhang et al. (2022) Jialu Zhang, José Cambronero, Sumit Gulwani, Vu Le, Ruzica Piskac, Gustavo Soares, and Gust Verbruggen. 2022. Repairing bugs in python assignments using large language models. arXiv preprint arXiv:2209.14876 (2022).
  • Zhang and Tur (2023) Peng Zhang and Gemma Tur. 2023. A systematic review of ChatGPT use in K-12 education. European Journal of Education (2023).
  • Zhang et al. (2024) Wenxuan Zhang, Mahani Aljunied, Chang Gao, Yew Ken Chia, and Lidong Bing. 2024. M3exam: A multilingual, multimodal, multilevel benchmark for examining large language models. Advances in Neural Information Processing Systems 36 (2024).
  • Zhang et al. (2023a) Xiaotian Zhang, Chunyang Li, Yi Zong, Zhengyu Ying, Liang He, and Xipeng Qiu. 2023a. Evaluating the Performance of Large Language Models on GAOKAO Benchmark.
  • Zhang et al. (2023b) Xiaowu Zhang, Xiaotian Zhang, Cheng Yang, Hang Yan, and Xipeng Qiu. 2023b. Does Correction Remain An Problem For Large Language Models? arXiv preprint arXiv:2308.01776 (2023).
  • Zhao et al. (2022) Honghong Zhao, Baoxin Wang, Dayong Wu, Wanxiang Che, Zhigang Chen, and Shijin Wang. 2022. Overview of ctc 2021: Chinese text correction for native speakers. arXiv preprint arXiv:2208.05681 (2022).
  • Zhao et al. (2024) Penghao Zhao, Hailin Zhang, Qinhan Yu, Zhengren Wang, Yunteng Geng, Fangcheng Fu, Ling Yang, Wentao Zhang, and Bin Cui. 2024. Retrieval-Augmented Generation for AI-Generated Content: A Survey. arXiv:2402.19473 [cs.CV]
  • Zhao et al. (2020) Wei Zhao, Mingyue Shang, Yang Liu, Liang Wang, and Jingming Liu. 2020. Ape210k: A large-scale and template-rich dataset of math word problems. arXiv preprint arXiv:2009.11506 (2020).
  • Zhong et al. (2023) Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. 2023. Agieval: A human-centric benchmark for evaluating foundation models. arXiv preprint arXiv:2304.06364 (2023).
  • Zhou et al. (2024b) Aojun Zhou, Ke Wang, Zimu Lu, Weikang Shi, Sichun Luo, Zipeng Qin, Shaoqing Lu, Anya Jia, Linqi Song, Mingjie Zhan, et al. 2024b. Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification. In The International Conference on Learning Representations.
  • Zhou et al. (2024a) Kyrie Zhixuan Zhou, Zachary Kilhoffer, Madelyn Rose Sanfilippo, Ted Underwood, Ece Gumusel, Mengyi Wei, Abhinav Choudhry, and Jinjun Xiong. 2024a. ”The teachers are confused as well”: A Multiple-Stakeholder Ethics Discussion on Large Language Models in Computing Education. arXiv:2401.12453 [cs.CY]
  • Zhou et al. (2023) Zihao Zhou, Maizhen Ning, Qiufeng Wang, Jie Yao, Wei Wang, Xiaowei Huang, and Kaizhu Huang. 2023. Learning by Analogy: Diverse Questions Generation in Math Word Problem. In Findings of the Association for Computational Linguistics: ACL 2023. 11091–11104.
  • Zhuo et al. (2023) Terry Yue Zhuo, Yujin Huang, Chunyang Chen, and Zhenchang Xing. 2023. Red teaming chatgpt via jailbreaking: Bias, robustness, reliability and toxicity. arXiv preprint arXiv:2301.12867 (2023).
  • Zuber and Gogoll (2023) Niina Zuber and Jan Gogoll. 2023. Vox Populi, Vox ChatGPT: Large Language Models, Education and Democracy. arXiv:2311.06207 [cs.CY]

Appendix A Appendix

Table 1 summarizes the commonly used publicly available datasets and benchmarks for evaluating LLMs on education applications.

Table 1. Summary of existing datasets and benchmarks in the area of LLMs for education.
Dataset&Benchmark App User Subject Level Language Modality Amount Source
GSM8K QS student math K-12 EN text 8.5K (Cobbe et al., 2021)
MATH QS student math K-12 EN text 12.5K (Hendrycks et al., 2021)
Dolphin18K QS student math K-12 EN text 18K (Huang et al., 2016)
DRAW-1K QS student math comprehensive EN text 1K (Upadhyay and Chang, 2016)
Math23K QS student math K-12 ZH text 23K (Wang et al., 2017)
Ape210K QS student math K-12 EN, ZH text 210K (Zhao et al., 2020)
MathQA QS student math K-12 EN text 37K (Amini et al., 2019)
ASDiv QS student math K-12 EN text & image 2K (Miao et al., 2021)
IconQA QS student math K-12 EN text & table 107K (Lu et al., 2021b)
TQA QS student science K-12 EN text & image 26K (Kim et al., 2018)
Geometry3K QS student geometry K-12 EN text & image 3K (Lu et al., 2021a)
AI2D QS student science K-12 EN text & image 5K (Kembhavi et al., 2016)
SCIENCEQA QS student science K-12 EN text & image 21K (Chen et al., 2023b)
MedQA QS student medicine professional EN text 40K (Jin et al., 2021)
MedMCQA QS student medicine professional EN text 200K (Pal et al., 2022)
TheoremQA QS student science college EN text 800 (Chen et al., 2023b)
Math-StackExchange QS student math comprehensive EN text 310K (Yuan et al., 2020)
TABMWP QS student math K-12 EN text 38K (Lu et al., 2022)
ARC QS student comprehensive comprehensive EN text 7.7K (Clark et al., 2018)
C-Eva QS student comprehensive comprehensive ZH text 13.9K (Huang et al., 2024)
GAOKAO-bench QS student comprehensive comprehensive ZH text 2.8K (Zhang et al., 2023a)
AGIEval QS student comprehensive comprehensive EN, ZH text 8k (Zhong et al., 2023)
MMLU QS student comprehensive comprehensive EN text 1.8K (Hendrycks et al., 2020)
CMMLU QS student comprehensive comprehensive ZH text 11.9K (Li et al., 2023b)
SuperCLUE QS student comprehensive comprehensive ZH text 15.9K (Xu et al., 2023)
LANG-8 EC student linguistic language training Multi text 1M (Rothe et al., 2021)
CLANG-8 EC student linguistic language training Multi text 2.6M (Rothe et al., 2021)
CoNLL-2014 EC student linguistic language training EN text 58k (Ng et al., 2014)
BEA-2019 EC student linguistic language training EN text 686K (Bryant et al., 2019)
SIGHAN EC student linguistic language training ZH text 12K (Tseng et al., 2015)
CTC EC student linguistic language training ZH text 218K (Zhao et al., 2022)
FCGEC EC student linguistic language training ZH text 41K (Xu et al., 2022b)
FlaCGEC EC student linguistic language training ZH text 13K (Du et al., 2023)
GECCC EC student linguistic language training CS text 83K (Náplava et al., 2022)
RULEC-GEC EC student linguistic language training RU text 12K (Rozovskaya and Roth, 2019)
Falko-MERLIN EC student linguistic language training GE text 24K (Grundkiewicz and Junczys-Dowmunt, 2014)
COWS-L2H EC student linguistic language training ES text 12K (Davidson et al., 2020)
UA-GEC EC student linguistic language training UK text 20K (Syvokon and Nahorna, 2021)
RONACC EC student linguistic language training RO text 10K (Cotet et al., 2020)
Defects4J EC student computer science professional EN & Java text& code 357 (Just et al., 2014)
ManyBugs EC student computer science professional EN & C text& code 185 (Le Goues et al., 2015)
IntroClass EC student computer science professional EN & C text& code 998 (Le Goues et al., 2015)
QuixBugs EC student computer science professional EN & multi text& code 40 (Lin et al., 2017)
Bugs2Fix EC student computer science professional EN & Java text& code 2.3M (Tufano et al., 2019)
CodeReview EC student computer science professional EN & Multi text& code 642 (Li et al., 2022)
CodeReview-New EC student computer science professional EN & Multi text& code 15 (Guo et al., 2024)
SciQ QG teacher science MOOC EN text 13.7K (Welbl et al., 2017)
RACE QG teacher linguistic K-12 EN text 100K (Lai et al., 2017)
FairytaleQA QG teacher literature K-12 EN text 10K (Xu et al., 2022a)
LearningQ QG teacher comprehensive MOOC EN text 231K (Chen et al., 2018)
KHANQ QG teacher science MOOC EN text 1K (Gong et al., 2022)
EduQG QG teacher comprehensive MOOC EN text 3K (Hadifar et al., 2023)
MCQL QG teacher comprehensive MOOC EN text 7.1K (Liang et al., 2018)
Televic QG teacher comprehensive MOOC EN text 62K (Bitew et al., 2022)
CLC-FCE AG teacher linguistic standardized test EN text 1K (Yannakoudakis et al., 2011)
ASAP AG teacher linguistic K-12 EN text 17K (Tigina et al., 2023)
TOEFL11 AG teacher linguistic standardized test EN text 1K (Blanchard et al., 2013)
ICLE AG teacher linguistic standardized test EN text 4K (Stab and Gurevych, 2014)
HSK AG teacher linguistic standardized test ZH text 10K (Cui and Zhang, 2011)