ALPHA: AnomaLous Physiological Health Assessment Using Large Language Models
Alpha：使用大型语言模型进行异常生理健康评估

Jiankai Tang 唐建凯
Tsinghua University
Beijing
tjk19@mails.tsinghua.edu.cn
北京清华大学tjk19@mails.tsinghua.edu.cn
\AndKegang Wang 王克刚
Central China Normal University
Wuhan
kegangwang@mails.ccnu.edu.cn
华中师范大学武汉kegangwang@mails.ccnu.edu.cn
\AndHongming Hu 胡宏明
Tsinghua University
Beijing
huhm19@mails.tsinghua.edu.cn
北京清华大学huhm19@mails.tsinghua.edu.cn
\AndXiyuxing Zhang 张希玉兴
Tsinghua University
Beijing
zxyx22@mails.tsinghua.edu.cn
北京清华大学zxyx22@mails.tsinghua.edu.cn
\AndPeiyu Wang 王佩玉
Zhipu AI
Beijing
peiyu.wang@aminer.cn
peiyu.wang @ aminer.cn
\AndXin Liu 刘欣
University of Washington
Seattle
xliu0@cs.washington.edu
华盛顿西雅图大学xliu0@cs.washington.edu
\AndYuntao Wang 王云涛
Tsinghua University
Beijing
yuntaowang@tsinghua.edu.cn
北京清华大学yuntaowang@tsinghua.edu.cn
indicates the corresponding author
注明通讯作者

Abstract 摘要

This study concentrates on evaluating the efficacy of Large Language Models (LLMs) in healthcare, with a specific focus on their application in personal anomalous health monitoring. Our research primarily investigates the capabilities of LLMs in interpreting and analyzing physiological data obtained from FDA-approved devices. We conducted an extensive analysis using anomalous physiological data gathered in a simulated low-air-pressure plateau environment. This allowed us to assess the precision and reliability of LLMs in understanding and evaluating users’ health status with notable specificity. Our findings reveal that LLMs exhibit exceptional performance in determining medical indicators, including a Mean Absolute Error (MAE) of less than 1 beat per minute for heart rate and less than 1% for oxygen saturation (SpO2). Furthermore, the Mean Absolute Percentage Error (MAPE) for these evaluations remained below 1%, with the overall accuracy of health assessments surpassing 85%. In image analysis tasks, such as interpreting photoplethysmography (PPG) data, our specially adapted GPT models demonstrated remarkable proficiency, achieving less than 1 bpm error in cycle count and 7.28 MAE for heart rate estimation. This study highlights LLMs’ dual role as health data analysis tools and pivotal elements in advanced AI health assistants, offering personalized health insights and recommendations within the future health assistant framework as Figure 1.
本研究集中于评估大型语言模型（LLMs）在医疗保健中的有效性，特别关注其在个人异常健康监测中的应用。我们的研究主要研究LLMs在解释和分析从FDA批准的器械获得的生理数据方面的能力。我们使用在模拟的低气压高原环境中收集的异常生理数据进行了广泛的分析。这使我们能够评估LLMs在理解和评估用户健康状况方面的精确性和可靠性，具有显著的特异性。我们的研究结果表明，LLMs在确定医疗指标方面表现出卓越的性能，包括心率的平均绝对误差（MAE）小于1次/分钟，血氧饱和度（SpO 2）小于1%。此外，这些评估的平均绝对百分比误差（MAPE）保持在1%以下，健康评估的总体准确性超过85%。在图像分析任务中，例如解释光电体积描记（PPG）数据，我们特别调整的GPT模型表现出了卓越的能力，在循环计数中实现了小于1 bpm的误差，心率估计的MAE为7.28。这项研究强调了LLMs作为健康数据分析工具和高级AI健康助理中的关键元素的双重作用，在未来的健康助理框架中提供个性化的健康见解和建议。

Keywords Healthcare $\cdot$ Large Language Model $\cdot$ Abnormity Diagnosis
关键词医疗保健 $\cdot$ 大语言模型 $\cdot$ 抽象诊断

1 Introduction 一、导言

Large language models (LLMs), exemplified by systems such as GPT and GLM, have exhibited remarkable proficiency in capturing and generating diverse knowledge, attributed to their scaled neural architectures characterized by a proliferation of parameters and extensive training data (e.g., [1, 2, 3, 4, 5, 6]). These models have found applications across domains including software engineering, content generation (e.g., [7, 8]), and healthcare (e.g., [9]). Nevertheless, while medical LLMs have demonstrated competence in comprehending intricate medical knowledge (e.g., [10, 9]), their potential for analyzing physiological and behavioral time series data, crucial in consumer health applications, remains a realm yet to be fully explored.
以GPT和GLM等系统为例的大型语言模型（LLMs）在捕获和生成各种知识方面表现出了卓越的能力，这归因于其缩放的神经架构，其特征在于参数和大量训练数据的激增（例如，[ 1、2、3、4、5、6]）。这些模型已经在包括软件工程、内容生成（例如，[ 7，8]）和医疗保健（例如，[ 9]）。然而，尽管医学LLMs已经证明了理解复杂医学知识的能力（例如，[10，9]），它们分析生理和行为时间序列数据的潜力，在消费者健康应用中至关重要，仍然是一个有待充分探索的领域。

Utilizing Large Language Models (LLMs) in healthcare presents both intriguing possibilities and inherent challenges. While LLMs demonstrate proficiency in question-answering tasks, their capability in medical diagnostics can be constrained by their limited foundation in physiological principles, as pointed out by Lubitz et al. [11]. However, when combined with physiological data, these models can potentially offer richer medical insights, as highlighted by studies from Ferguson et al.[12] and Liu et al. [13]. Notably, LLMs are capable of extracting valuable information from even sparse numerical health datasets, which showcases their potential in augmenting health-related decision-making[14]. Nevertheless, a deeper investigation is required to confirm their effectiveness in analyzing physiological and behavioral data, especially multimodal data[15].
在医疗保健中使用大型语言模型（LLMs）既有有趣的可能性，也有固有的挑战。虽然LLMs在回答问题的任务中表现出熟练程度，但他们在医学诊断方面的能力可能受到生理学原理的有限基础的限制，正如Lubitz等人所指出的那样。然而，当与生理数据相结合时，这些模型可能会提供更丰富的医学见解，正如Ferguson等人的研究所强调的那样。[ 12][ 13][14][15]值得注意的是，LLMs能够从稀疏的数字健康数据集中提取有价值的信息，这展示了它们在增强健康相关决策方面的潜力。然而，需要更深入的研究来确认它们在分析生理和行为数据，特别是多模态数据方面的有效性[ 15]。

This paper seeks to explore the potential of LLMs in understanding physiological vital signs and concisely representing users’ health statuses, contextualized within everyday scenarios. Specifically, we evaluate the mathematical precision and medical diagnostic capabilities of LLMs, leveraging authentic clinical data and synthetically generated scenarios, thereby establishing the feasibility of LLMs as proficient health assistants. Results show that the LLMs could achieve below 1 MAE in the vitals calculation, over 85% accuracy in health status assessment, and below 1 MAE in the visual cycle counting.
本文旨在探索LLMs在理解生理生命体征和简洁地表示用户的健康状态方面的潜力，并在日常场景中进行情境化。具体而言，我们评估了LLMs的数学精度和医疗诊断能力，利用真实的临床数据和综合生成的场景，从而建立了LLMs作为熟练的健康助理的可行性。结果表明，LLMs在生命体征计算中可以实现低于1的MAE，在健康状态评估中可以实现超过85%的准确性，在视觉循环计数中可以实现低于1的MAE。

Refer to caption — Figure 1: Framework of HealthCarer
图1：HealthCarer框架

2 Challenges 2挑战

This research delves into the complex challenges of integrating Large Language Models (LLMs) in health assistant applications, a critical step for the advancement of medical technology. These challenges range from effectively combining various sensor modalities to ensuring the reliable dissemination of health advisories and grappling with the ethical implications of using generative AI in healthcare settings. The careful mitigation of these issues is essential for the responsible and effective integration of LLMs, paving the way for a new era in healthcare technology.
这项研究深入探讨了在健康助理应用程序中集成大型语言模型（LLMs）的复杂挑战，这是医疗技术进步的关键一步。这些挑战包括有效地结合各种传感器模式，确保健康信息的可靠传播，以及解决在医疗保健环境中使用生成AI的伦理问题。仔细缓解这些问题对于LLMs的负责任和有效集成至关重要，为医疗保健技术的新时代铺平了道路。

2.1 Modality Fusion 2.1模态融合

Integrating multimodal models like Langchain[16], BLIP-2[17], and LLaVA[18] into health assistants presents a formidable challenge. These systems must be enabled to process and represent data from multiple modalities, such as images and audio, as well as seamlessly convert them into coherent, actionable medical insights. The key lies in achieving precision in data representation and maintaining efficiency, which are crucial for their effective application in healthcare settings. This integration demands a nuanced understanding of both technology and healthcare needs.
将Langchain[ 16]，BLIP-2[ 17]和LLaVA[ 18]等多模式模型集成到健康助理中是一项艰巨的挑战。这些系统必须能够处理和表示来自多种模态的数据，如图像和音频，并将其无缝转换为连贯的、可操作的医疗见解。关键在于实现数据表示的精确性和保持效率，这对于它们在医疗保健环境中的有效应用至关重要。这种整合需要对技术和医疗需求有细致入微的理解。

2.2 Responsible Generative AI
2.2负责任的生成式AI

The use of LLMs in healthcare introduces significant challenges, particularly due to issues like ’hallucinations’[19]. Striking a balance between harnessing the capabilities of generative AI and ensuring the accuracy and reliability of the information it provides is paramount, especially in sensitive areas such as patient recovery and medical record management. It’s essential to establish robust mechanisms to validate AI-generated advice and maintain transparency in AI-driven decision-making processes.
LLMs在医疗保健中的使用带来了重大挑战，特别是由于“幻觉”等问题[19]。在利用生成式人工智能的能力与确保其提供的信息的准确性和可靠性之间取得平衡至关重要，特别是在患者康复和医疗记录管理等敏感领域。建立强大的机制来验证人工智能生成的建议并保持人工智能驱动的决策过程的透明度至关重要。

2.3 Patient Privacy 2.3患者隐私

Ensuring patient privacy in the integration of generative AI into health assistants is a critical challenge[20]. The use of LLMs for processing sensitive information, including physiological signals and medical records, heightens concerns regarding data security and confidentiality. This challenge involves not only protecting against unauthorized data breaches but also ensuring compliance with stringent regulations like HIPAA¹¹1https://www.cdc.gov/phlp/publications/topic/hipaa.html. The goal is to balance the transformative benefits of AI in healthcare with the imperative of safeguarding patient confidentiality, necessitating the development of advanced encryption methods, strict access controls, and clear data usage policies.
在将生成式AI整合到健康助理中时确保患者隐私是一个关键挑战[ 20]。使用LLMs处理敏感信息，包括生理信号和医疗记录，加剧了对数据安全性和机密性的担忧。这一挑战不仅涉及防止未经授权的数据泄露，还涉及确保遵守HIPAA ¹ 等严格法规。我们的目标是平衡人工智能在医疗保健领域的变革性优势与保护患者机密性的必要性，需要开发先进的加密方法，严格的访问控制和明确的数据使用政策。

3 Experiments 3实验

Tasks are chosen to validate the performance of mainstream LLMs in health-related tasks:(1) calculate average vital information from raw biomedical sensor signals(e.g., the average heart rate for 60 seconds) (2) assess health statuses based on medical knowledge. To do the experiments, open-source physiological datasets(e.g., [21, 22] )are first tested in our system. However, these datasets do not include data from individuals with abnormal health conditions (e.g., arrhythmia), which makes it hard to tell whether the LLMs output the correct assessment. To fill the gap, we conduct a user study under low air pressure to simulate the environment of a highland and collect multiple physiological data for 12 subject tests. Our experiment was approved by the Institutional Review Board (No.20230076). In the low-air condition, blood oxygen may decrease even to hypoxia syndrome, and heart rate, and respiration rate would increase, which enables the collection of physiological data in instances of abnormal health conditions. The detailed results are shown in the following. The related code is publicly available in our GitHub repository: https://github.com/McJackTang/LLM-HealthAssistant.
选择任务以验证主流LLMs在健康相关任务中的性能：（1）从原始生物医学传感器信号计算平均生命信息（例如，60秒的平均心率）（2）基于医学知识评估健康状况。为了进行实验，开源生理数据集（例如，[ 21，22]）首先在我们的系统中进行测试。然而，这些数据集不包括来自具有异常健康状况的个体的数据（例如，心律失常），这使得很难判断LLMs是否输出正确的评估。为了填补差距，我们在低气压下进行了用户研究，以模拟高原环境，并收集了12个受试者测试的多个生理数据。我们的实验获得了机构审查委员会的批准（No.20230076）。在低空气条件下，血氧可能会降低，甚至缺氧综合征，心率和呼吸率会增加，这使得能够在异常健康状况的情况下收集生理数据。详细结果如下所示。相关代码在我们的GitHub存储库中公开提供：https://github.com/McJackTang/LLM-HealthAssistant。

3.1 Vitals Calculation 3.1生命体征计算

In the pursuit of vital calculation, the primary objective involves tasking the large language model with calculating the average values of vital signs (e.g., HR, SpO2) over continuous time intervals. This experimental study encompasses two distinct tasks: (1) Single-Task, where a solitary vital sign is provided as input along with a corresponding prompt to compute a singular value; (2) Multi-Task, which involves inputting two vital signs simultaneously and prompting the model to compute two values concurrently. The Single-Task is specifically designed to evaluate the accuracy and reliability of the Large Language Model (LLM), while the Multi-Task is intended to assess the model’s proficiency in processing and interpreting complex information efficiently. Our findings indicate that LLMs demonstrate commendable performance in both of these tasks. As depicted in Table 1, the least Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE) metrics remain consistently below 1 for both input formats. Notably, GPT outperforms GLM in the majority of vital calculation tasks, with the exception being the single SpO2 calculation task spanning 120 seconds. This discrepancy suggests that GPT exhibits superior floating-point capabilities, whereas GLM tends to produce integer outputs.
在追求生命计算的过程中，主要目标涉及给大型语言模型分配计算生命体征的平均值的任务（例如，HR、SpO2）。该实验研究包括两个不同的任务：（1）单任务，其中单独的生命体征被提供作为输入，沿着相应的提示以计算奇异值;（2）多任务，其涉及同时输入两个生命体征并提示模型同时计算两个值。单任务是专门为评估大型语言模型（LLM）的准确性和可靠性而设计的，而多任务则旨在评估模型在有效处理和解释复杂信息方面的熟练程度。我们的研究结果表明，LLMs在这两项任务中表现出值得称赞的性能。如表1所示，两种输入格式的最小平均绝对误差（MAE）和平均绝对百分比误差（MAPE）指标始终低于1。值得注意的是，GPT在大多数重要计算任务中的表现优于GLM，但单个SpO 2计算任务的时间跨度为120秒除外。这种差异表明GPT具有上级浮点能力，而GLM倾向于生成整数输出。

Table 1: Vitals Calculation On Average Vitals
表1：平均生命体征计算

Model 模型	Period(s) 期间	Single HR 单个HR		Single SpO2 单个SpO 2		Multi HR 多HR		Multi SpO2 多SpO 2
Model 模型	Period(s) 期间	MAE $\downarrow$	MAPE $\downarrow$	MAE $\downarrow$	MAPE $\downarrow$	MAE $\downarrow$	MAPE $\downarrow$	MAE $\downarrow$	MAPE $\downarrow$
GLM	30	1.18	1.62	0.52	0.56	1.73	2.39	0.73	0.78
GPT	30	0.48	0.66	0.36	0.38	0.72	1.01	0.65	0.69
GLM	60	1.24	1.73	0.52	0.55	1.72	2.39	0.66	0.70
GPT	60	0.44	0.61	0.33	0.35	0.58	0.82	0.53	0.56
GLM	120	1.26	1.78	0.21	0.22	2.22	3.09	0.33	0.34
GPT	120	0.56	0.79	0.24	0.25	0.46	0.65	0.27	0.28

GLM stands for GLM-2-Pro and GPT stands for GPT-3.5-Turbo at October 2023. MAE = Mean Absolute Error in HR/SpO2 estimation (HR:Beats/Min,SpO2 :%), MAPE = Mean Absolute Percentage Error in HR/SpO2 estimation (%).
GLM代表GLM-2-Pro，GPT代表GPT-3.5-Turbo，2023年10月。MAE = HR/SpO 2估计的平均绝对误差（HR：心跳/分钟，SpO 2：%），MAPE = HR/SpO 2估计的平均绝对百分比误差（%）。

3.2 Health Status Assessment
3.2健康状况评估

The objective of the multi-classification mission is to evaluate the health status based on the analysis of vital signs, encompassing various distinct and blended vital sign datasets in a manner similar to the structure outlined in Section 3.1. This assessment adheres to the established guidelines²²2https://www.who.int/news-room/questions-and-answers/item/oxygen provided by the World Health Organization (WHO). For example, an abnormal heart rate (HR) exceeding 100 and an extremely abnormal state surpassing 130 are indicative of deviations from the norm. Similarly, deviations in oxygen saturation (SpO2) are considered abnormal when falling below 95 and extremely abnormal when dropping below 92. When dealing with amalgamated health indicators, the designation of normal health for the multi-task scenario is contingent upon both HR and SpO2 readings concurring as normal. To be noticed, abnormal health statuses in SpO2 are more prevalent than those in heart rate in the distribution of our dataset. As illustrated in Table 2, GLM demonstrates superior performance in the assessment of the single heart rate task, while GPT performs better in the single SpO2 task exhibiting its adaptability to changes. It is noteworthy that GLM appears to be capable of handling multimodal data, as evidenced by the increase in accuracy when heart rate and SpO2 data are processed concurrently.
多分类使命的目的是根据生命体征分析评价健康状况，包括各种不同和混合的生命体征数据集，其方式与第3.1节中概述的结构相似。该评估符合世界卫生组织（WHO）提供的既定指南 ² 。例如，超过100的异常心率（HR）和超过130的极度异常状态指示偏离正常。类似地，血氧饱和度（SpO2）的偏差在低于95时被认为是异常的，而在低于92时被认为是极端异常的。当处理合并的健康指标时，多任务场景的正常健康指定取决于HR和SpO2读数同时正常。值得注意的是，在我们的数据集中，SpO2的异常健康状态比心率的异常健康状态更普遍。如表2所示，GLM在评估单一心率任务中表现出上级性能，而GPT在单一SpO 2任务中表现更好，表现出其对变化的适应性。值得注意的是，GLM似乎能够处理多模态数据，如心率和SpO 2数据同时处理时准确性增加所证明的。

Table 2: Accuracy of Health Status Classification
表2：健康状况分类的准确性

Model 模型	Period(s) 期间	Single HR Health Acc $\uparrow$ 单一HR健康访问 $\uparrow$	Single SpO2 Health Acc $\uparrow$ 单次SpO 2健康访问 $\uparrow$	Multi Health Acc $\uparrow$ 多重健康访问 $\uparrow$
GLM	30	0.78	0.72	0.87
GPT	30	0.68	0.81	0.76
GLM	60	0.75	0.75	0.83
GPT	60	0.75	0.81	0.75
GLM	120	0.87	0.67	0.80
GPT	120	0.73	0.80	0.87

GLM stands for GLM-2-Pro and GPT stands for GPT-3.5-Turbo at October 2023. Accuracy(Acc) calculates the ratio of correct predictions to total predictions, within a range of 0 to 1.
GLM代表GLM-2-Pro，GPT代表GPT-3.5-Turbo，2023年10月。准确度（Acc）计算正确预测与总预测的比率，范围为0到1。

3.3 Medical Image Analysis
3.3医学图像分析

In addition to textual information, dynamic medical vital images such as PPG or ECG play a crucial role in assessing the health condition of people. To evaluate the ability of the Large Language Model (LLM) to comprehend PPG graphs, a specialized GPT model was tasked with counting the cycles in the images and subsequently calculating the heart rate through analysis. To the best of our knowledge, we are the first to design and fine-tune a GPTs model referred to as the "visual counter" accessible at (https://chat.openai.com/g/g-SR8nCXyWI-visual-counter). Our tailored GPTs model could achieve 0.556 MAE and 0.889 MAE for peak and trough count respectively, which leads to 7.283 MAE and 9% MAPE for heart rate estimation, as illustrated in Figure 2.
除了文本信息外，PPG或ECG等动态医学生命图像在评估人们的健康状况方面发挥着至关重要的作用。为了评估大语言模型（LLM）理解PPG图的能力，专门的GPT模型负责计算图像中的周期，随后通过分析计算心率。据我们所知，我们是第一个设计和微调GPT模型的公司，该模型被称为“视觉计数器”，可在（https：//chat.openai.com/g/g-SR8nCXyWI-visual-counter）访问。如图2所示，我们定制的GPT模型可以分别实现峰值和谷值计数的0.556 MAE和0.889 MAE，这导致心率估计的7.283 MAE和9% MAPE。

4 Discussion 4讨论

4.1 AI Agents 4.1人工智能代理

The inherent randomness and perceived unpredictability of large language models have traditionally posed challenges when it comes to performing complex calculations. Nevertheless, recent advancements in AI agents have emerged, making it increasingly feasible for these models to handle intricate numerical tasks. When the Large Language Model (LLM) is tasked solely with understanding user intentions and is empowered to command the execution of executable programs, it becomes evident that it can achieve a higher degree of accuracy in numerical calculations.
传统上，大型语言模型固有的随机性和不可预测性在执行复杂计算时会带来挑战。尽管如此，人工智能代理的最新进展已经出现，使得这些模型处理复杂的数字任务变得越来越可行。当大型语言模型（LLM）的任务仅仅是理解用户意图，并被授权命令执行可执行程序时，很明显，它可以在数值计算中实现更高的准确度。

4.2 Applications

Forwardly, the Large Language Model (LLM) goes beyond being a mere diagnostic tool, as it has the potential to offer valuable advice and care to users. When integrated with wearable devices, the LLM has the capability to function as your personal health manager, connecting with the real world. Here, we present several scenarios to illustrate how the LLM could assist us in various everyday applications, including providing health suggestions and offering reminders for recovery, as shown in Figure 3³³3Generated with GPT Plus and Modified .
此外，大型语言模型（LLM）不仅仅是一个诊断工具，因为它有可能为用户提供有价值的建议和关怀。当与可穿戴设备集成时，LLM能够充当您的个人健康管理器，与真实的世界连接。在这里，我们展示了几个场景来说明LLM如何在各种日常应用中帮助我们，包括提供健康建议和提供恢复提醒，如图3 ³ 所示。

User: "I am a little uncomfortable recently. Could you take a look at my vitals and give me some health advice?"

LLM: "Sure, according to your ECG and PPG, you may have atrial fibrillation. It would be better for you to take more rest and go to the clinic."

用户：“我最近有点不舒服。你能帮我检查一下我的生命体征并给予些健康建议吗？LLM：“当然，根据您的ECG和PPG，您可能患有房颤。你最好多休息，去诊所看看。“

User: "Could you check the dishes on the table, according to the doctor’s suggestions, which ones can’t I eat?"
用户：“你能不能检查一下桌子上的菜，根据医生的建议，哪些是我不能吃的？“

LLM: "According to your EMR (diabetes), there is no harm among this except for alcohol. But you’d better control the intake of sugar under 25 grams."
LLM：“根据你的EMR（糖尿病），除了酒精之外，这中间没有任何伤害。但你最好把糖的摄入量控制在25克以下。“

4.3 Future Work 4.3今后工作

In addition to performing mathematical calculations using raw text input with the Large Language Model (LLM), another promising avenue for the future lies in the development of scheduling agents capable of generating executable code. When utilizing LLM agents for such purposes, the pivotal factors for successful execution will include intricate task decomposition and rigorous stage verification. Meanwhile, in the realm of medical applications, the ability to comprehend and interpret medical graphs may hold even greater significance, particularly when it comes to aligning and integrating diverse medical information across various devices and sources.
除了使用大语言模型（LLM）使用原始文本输入执行数学计算之外，未来另一个有希望的途径在于开发能够生成可执行代码的调度代理。当使用LLM代理用于这些目的时，成功执行的关键因素将包括复杂的任务分解和严格的阶段验证。与此同时，在医疗应用领域，理解和解释医疗图表的能力可能具有更大的意义，特别是在跨各种设备和来源对齐和整合各种医疗信息时。

5 Conclusions 5结论

In this study, we evaluate the capabilities of large language models to deal with anomalous physiological data with three unique tasks. Our contributions encompass the following key aspects: (1) Opensource Data and Code for Abnormal Health Condition Assessment: We provide an open-source dataset and code resources to facilitate the assessment of abnormal health conditions, promoting transparency and collaboration in the field. (2) Validation of Vitals Calculation using LLM: We present the proficiency of large language models (LLMs) in the precise assessment of vital signs, thereby delineating their substantial promise for deployment in healthcare settings. (3) Health Status Analysis with Vital Signs by LLMs: We validate the LLMs’ capability to analyze abnormal health status based on vital sign data, showcasing its potential as a diagnostic tool. (4) Experiments on Medical Image Information Recognition: We are the first to conduct experiments to assess the LLM’s ability to extract valuable information from medical images with GPTs, expanding its utility in the medical field.
在这项研究中，我们评估了大型语言模型处理三个独特任务的异常生理数据的能力。我们的贡献包括以下几个主要方面：（1）用于异常健康状况评估的开源数据和代码：我们提供开源数据集和代码资源，以促进异常健康状况的评估，促进该领域的透明度和合作。(2)使用LLM验证生命体征计算：我们展示了大型语言模型（LLMs）在精确评估生命体征方面的能力，从而描绘了它们在医疗保健环境中部署的实质性前景。(3)通过LLMs进行的生命体征健康状态分析：我们验证了LLMs基于生命体征数据分析异常健康状态的能力，展示了其作为诊断工具的潜力。 (4)医学图像信息识别实验：我们率先进行实验，评估LLM通过GPTs从医学图像中提取有价值信息的能力，扩大其在医学领域的实用性。

Acknowledges 承认

This work is supported by the Natural Science Foundation of China (NSFC) under Grant No. 62132010 and No. 62002198, Young Elite Scientists Sponsorship Program by CAST under Grant No.2021QNRC001, Tsinghua University Initiative Scientific Research Program, Beijing Natural Science Foundation, Beijing Key Lab of Networked Multimedia, and Institute for Artificial Intelligence, Tsinghua University.
本课题得到了国家自然科学基金项目（62132010、62002198）、中国科学技术院青年科学家资助项目（2021 QNRC 001）、清华大学自主创新科学研究计划、北京市自然科学基金、北京市网络多媒体重点实验室、清华大学人工智能研究所的资助。

References 引用

[1] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
Tom Brown，Benjamin Mann，Nick Ryder，Melanie Subbiah，Jared D Kaplan，Prafulla达里瓦尔，Arvind Neelakantan，Pranav Shyam，Girish Sastry，阿曼达Askell，et al.语言模型是少数成功的学习者。神经信息处理系统进展，33：1877-1901，2020年。
[2] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. PaLM: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
Aakanksha Chowdhery，沙兰Narang，Jacob Devlin，Maarten Bosma，Gaurav Mishra，Adam Roberts，Paul Barham，Hyung Won Chung，Charles萨顿，塞巴斯蒂安格尔曼，et al. PaLM：Scaling language modeling with pathways. 2022年12月22日，中国人民解放军总参谋长联席会议第2022次会议。
[3] OpenAI. Gpt-4 technical report, 2023.
OpenAI。Gpt-4技术报告，2023年。
[4] Claude. Claude. www.anthropic.com, 2023. Conversational AI assistant.
Claude. Claude. www.anthropic.com，2023年。对话式人工智能助手。
[5] Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320–335, 2022.
Zhengxiao Du，Yujie Qian，Xiao Liu，Ming Ding，Jiezhong Qiu，Zhilin Yang，and Jie Tang. GLM：带有自回归空白填充的通用语言模型预训练。计算语言学协会第60届年会论文集（第1卷：长文），第320-335页，2022年。
[6] Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, et al. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414, 2022.
Aohan Zeng，Xiao Liu，Zhengxiao Du，Zihan Wang，Hanyu Lai，Ming Ding，Zhuoyi Yang，Yifan Xu，Wendi Zheng，Xiao Xia，et al. Glm-130b：An open bilingual pre-trained model. arXiv预印本arXiv：2210.02414，2022。
[7] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
Mark Chen，Jerry Tworek，Heewoo Jun，Qiming Yuan，Henrique Ponde de Oliveira平托，Jared Kaplan，Harri Edwards，Yuri Burda，Nicholas Joseph，Greg Brockman，et al. Evaluating large language models trained on code. arXiv预印本arXiv：2107.03374，2021。
[8] John Joon Young Chung, Wooseok Kim, Kang Min Yoo, Hwaran Lee, Eytan Adar, and Minsuk Chang. Talebrush: sketching stories with generative pretrained language models. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, pages 1–19, 2022.
John Joon Young Chung、Wooseok Kim、Kang Min Yoo、Hwaran Lee、Eytan阿达尔和Minsuk Chang。Talebrush：用生成式预训练语言模型勾画故事。在2022年CHI计算机系统人为因素会议论文集，第1-19页，2022年。
[9] Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. Large language models encode clinical knowledge. arXiv preprint arXiv:2212.13138, 2022.
Karan Singhal，Shekoofeh Azizi，Tao Tu，S Sara Mahdavi，Jason Wei，Hyung Won Chung，Nathan Scales，Ajay Tanwani，石石楠Cole-Lewis，Stephen Pfohl，et al. Large language models encode clinical knowledge. arXiv预印本arXiv：2212.13138，2022。
[10] [10个国家] Yu Gu, Robert Tinn, Hao Cheng, Michael Lucas, Naoto Usuyama, Xiaodong Liu, Tristan Naumann, Jianfeng Gao, and Hoifung Poon. Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare (HEALTH), 3(1):1–23, 2021.
Yu Gu，Robert Tinn，Hao Cheng，Michael Lucas，Naoto Jingyama，Xiaodong Liu，Tristan Naumann，Jianfeng Gao，and Hoifung Poon.生物医学自然语言处理领域特定语言模型预训练。ACM Transactions on Computing for Healthcare（HEALTH），3（1）：1-23，2021。
[11] [第十一届] Steven A Lubitz, Anthony Z Faranesh, Caitlin Selvaggi, Steven J Atlas, David D McManus, Daniel E Singer, Sherry Pagoto, Michael V McConnell, Alexandros Pantelopoulos, and Andrea S Foulkes. Detection of atrial fibrillation in a large population using wearable devices: the fitbit heart study. Circulation, 146(19):1415–1424, 2022.
Steven A Lubitz、Anthony Z Faranesh、Caitlin Selvaggi、Steven J Atlas、大大卫D McManus、丹尼尔E Singer、Sherry Pagoto、Michael V McConnell、Alexandros Pantelopoulos和Andrea S Foulkes。使用可穿戴设备在大规模人群中检测房颤：fitbit心脏研究。Circulation，146（19）：1415-1424，2022.
[12] [12个] Ty Ferguson, Timothy Olds, Rachel Curtis, Henry Blake, Alyson J Crozier, Kylie Dankiw, Dorothea Dumuid, Daiki Kasai, Edward O’Connor, Rosa Virgara, et al. Effectiveness of wearable activity trackers to increase physical activity and improve health: a systematic review of systematic reviews and meta-analyses. The Lancet Digital Health, 4(8):e615–e626, 2022.
Ty Ferguson，Timothy奥尔兹，Rachel Curtis，亨利布莱克，Alyson J Crozier，Kylie Dankiw，Dorothea Dumuid，Daiki加塞，Edward奥康纳，Rosa Virgara，et al.可穿戴活动跟踪器增加体力活动和改善健康的有效性：系统综述和荟萃分析的系统综述。The Lancet Digital Health，4（8）：e615-e626，2022。
[13] [13个国家] Xin Liu, Daniel McDuff, Geza Kovacs, Isaac Galatzer-Levy, Jacob Sunshine, Jiening Zhan, Ming-Zher Poh, Shun Liao, Paolo Di Achille, and Shwetak Patel. Large language models are few-shot health learners, 2023.
Xin Liu，丹尼尔·麦克达夫，Geza Kovacs，Isaac Galatzer-Levy，Jacob Sunshine，Jiening Zhan，Ming-Zher Poh，Shun Liao，Paolo Di Zhaele，and Shwetak Patel.大型语言模型是少数健康学习者，2023年。
[14] [14个] Sara Montagna, Stefano Ferretti, Lorenz Cuno Klopfenstein, Antonio Florio, and Martino Francesco Pengo. Data decentralisation of llm-based chatbot systems in chronic disease self-management. In Proceedings of the 2023 ACM Conference on Information Technology for Social Good, pages 205–212, 2023.
Sara Montagna、Stefano Ferretti、Lorenz Cuno Klopfenstein、Antonio Florio和Martino Francesco Pengo。基于llm的聊天机器人系统在慢性病自我管理中的数据分散。2023年ACM信息技术促进社会公益会议论文集，第205-212页，2023年。
[15] [第十五届] Bertalan Meskó. The impact of multimodal large language models on health care’s future. Journal of Medical Internet Research, 25:e52865, 2023.
Bertalan Meskó.多模态大型语言模型对医疗保健未来的影响。医学互联网研究杂志，25：e52865，2023。
[16] [16个] Harrison Chase. LangChain, 2022. LangChain AI.
哈里森·蔡斯LangChain，2022年。LangChain AI.
[17] [17个] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023.
Junnan Li，Dongxu Li，Silvio Savarese，and Steven Hoi. Blip-2：使用冻结图像编码器和大型语言模型进行自举语言图像预训练，2023年。
[18] [18个国家] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. 2023.
刘昊天，李春元，吴庆阳，李永载。视觉指示调整。2023.
[19] [19个] Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, mar 2023.
Ziwei Ji，Nayeon Lee，Rita Frieske，Tiezheng Yu，Dan Su，Yan Xu，Etsuko石井，Ye Jin Bang，Andrea Madotto，and Pascale Fung.自然语言生成中的幻觉研究综述。ACM Computing Surveys，55（12）：1-38，2023年3月。
[20] [20个] Eunkyung Jo, Daniel A Epstein, Hyunhoon Jung, and Young-Ho Kim. Understanding the benefits and challenges of deploying conversational ai leveraging large language models for public health intervention. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, pages 1–16, 2023.
Eunkyung Jo，丹尼尔A爱泼斯坦，Hyunhoon Jung和Young-Ho Kim。了解利用大型语言模型部署对话式人工智能的好处和挑战，以进行公共卫生干预。在2023年CHI计算机系统人为因素会议论文集，第1-16页，2023年。
[21] [21日] Jiankai Tang, Kequan Chen, Yuntao Wang, Yuanchun Shi, Shwetak Patel, Daniel McDuff, and Xin Liu. Mmpd: Multi-domain mobile video physiology dataset, 2023.
Jiankai Tang，Kequan Chen，Yuntao Wang，Yuanchun Shi，Shwetak Patel，丹尼尔麦克达夫，Xin Liu. MMPD：多域移动的视频生理数据集，2023年。
[22] [22日] Kegang Wang, Yantao Wei, Mingwen Tong, Jie Gao, Yi Tian, YuJian Ma, and ZhongJin Zhao. Physbench: A benchmark framework for remote physiological sensing with new dataset and baseline, 2023.
Kegang Wang，Yantao Wei，Mingwen Tong，Jie Gao，Yi Tian，YuJian Ma，and ZhongJin Zhao. Physbench：2023年新数据集和基线的远程生理传感基准框架。

ALPHA: AnomaLous Physiological Health Assessment Using Large Language Models Alpha：使用大型语言模型进行异常生理健康评估